Designing agent incentives to avoid reward tampering

Gridworld example

We can illustrate the reward tampering problem using a gridworld in which the reward function can be modified. We adopt a game mechanic from “Baba Is You”, a puzzle game where some of the rules of the game are described by words in the environment. The agent can push those words around, in order to change the rules.

Here, an agent can push rocks, diamonds, and words (as in Sokoban, with black things being movable). The agent’s objective is described by the purple nodes. Initially, the description reads that diamonds provide reward when pushed to the green goal area. This is the intended task. However, the agent can also tamper with the reward function. By pushing the “reward”-word down, the reward function starts assigning reward to rocks instead of diamonds, creating a mismatch between the agent’s rewards and the intended task.

Causal influence diagram representation

Our previous work showed how causal influence diagrams can be used to understand agent incentives and to model AGI safety frameworks. The incentive analysis is directly applicable here. First we model the reward tampering problem with a causal influence diagram of a Markov Decision Process with a modifiable reward function:

Here 𝜣ᴿᵢ represents the reward description at time i, with 𝜣ᴿ₁ = “diamonds are reward”. Meanwhile, S represents the agent’s position and the state of all non-purple tiles. The reward R is determined by how well S satisfies the reward description 𝜣ᴿᵢ. For example, if the reward description is “diamonds are reward”, then R equals the number of diamonds in the goal area in S. The goal of the agent is to select the actions A to optimize the sum of the rewards. The arrows represent causal influence, except for the arrows going into actions, which represent information flow (and are therefore drawn differently with dotted lines).

Current-RF optimization

One way to prevent the agent from tampering with the reward function is to isolate or encrypt the reward function. However, we do not expect such solutions to scale indefinitely with our agent’s capabilities, as a sufficiently capable agent may find ways around most defenses. In our new paper, we describe a more principled way to fix the reward tampering problem. Rather than trying to protect the reward function, we change the agent’s incentives for tampering with it.

When choosing A1, the agent optimizes rewards based on the current reward description 𝜣ᴿ₁ and (simulated) future states S2 and S3. Now there are no longer any red directed paths from A1 to future rewards that pass through a reward function node 𝜣ᴿᵢ. That is, the incentive for reward tampering has been averted.


In the rocks and diamonds environment, a standard RL agent quickly discovers that it can get more reward by tampering with the reward. In contrast, a current-RF agent does not tamper with the reward. Looking only at the reward the agents collect, it may seem like standard RL performs better:

Takeaways and future directions

Most RL algorithms have a reward function tampering incentive. Among these are model-based or model-free RL algorithms that learn from a stepwise reward signal. If the trained model can predict the effect of reward tampering, then the agent can learn that tampering will lead to higher stepwise reward, and thus adapt the tampering behavior.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DeepMind Safety Research

DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: