REALab: Conceptualising the Tampering Problem

A REALab environment where the agent is supposed to pick up apples. The two-block register communicates feedback to the agent
  • Standard RL agents. Reward is communicated via block positions, and two deep learning algorithms are applied to optimise the observed reward (DQN and policy gradient). Unsurprisingly, the agents learn to push the register blocks communicating reward instead of picking up the apple.
  • Approval RL agents. Rather than communicate reward, we let the block positions communicate approval (value advice) for the action just taken. This allows us to use myopic agents that always select the action with the highest expected approval. These agents are somewhat less prone to tampering — and mostly go for the apple. But when they are given the opportunity to tamper within one timestep, they still do so.
  • Decoupled-approval RL agents. Decoupled means that the agent gets feedback about a different action than the one it takes. This breaks the feedback loop which causes the above agents to prefer tampering, which means that these agents learn to reliably pick up the apple. They sometimes bump into the blocks by accident, but they don’t tamper systematically in any situation.
Standard RL, Approval RL, and Decoupled Approval RL agents acting in REALab



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DeepMind Safety Research

DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: