Designing agent incentives to avoid reward tampering

Gridworld example

Here, an agent can push rocks, diamonds, and words (as in Sokoban, with black things being movable). The agent’s objective is described by the purple nodes. Initially, the description reads that diamonds provide reward when pushed to the green goal area. This is the intended task. However, the agent can also tamper with the reward function. By pushing the “reward”-word down, the reward function starts assigning reward to rocks instead of diamonds, creating a mismatch between the agent’s rewards and the intended task.

Causal influence diagram representation

Here 𝜣ᴿᵢ represents the reward description at time i, with 𝜣ᴿ₁ = “diamonds are reward”. Meanwhile, S represents the agent’s position and the state of all non-purple tiles. The reward R is determined by how well S satisfies the reward description 𝜣ᴿᵢ. For example, if the reward description is “diamonds are reward”, then R equals the number of diamonds in the goal area in S. The goal of the agent is to select the actions A to optimize the sum of the rewards. The arrows represent causal influence, except for the arrows going into actions, which represent information flow (and are therefore drawn differently with dotted lines).

Current-RF optimization

When choosing A1, the agent optimizes rewards based on the current reward description 𝜣ᴿ₁ and (simulated) future states S2 and S3. Now there are no longer any red directed paths from A1 to future rewards that pass through a reward function node 𝜣ᴿᵢ. That is, the incentive for reward tampering has been averted.


Takeaways and future directions



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store