Avoiding Unsafe States in 3D Environments using Human Feedback

By Matthew Rahtz, Vikrant Varma, Ramana Kumar, Zachary Kenton, Shane Legg, and Jan Leike.

Tl;dr: ReQueST is an algorithm for learning objectives from human feedback on hypothetical behaviour. In this work, we scale ReQueST to complex 3D environments, and show that it works even with feedback sourced entirely from real humans. Read our paper at https://arxiv.org/abs/2201.08102.

Learning about unsafe states

Source: Getty Images.

One way that humans solve this problem is by learning from hypothetical situations. Our imagination gives us the ability to consider various courses of action without actually having to enact them in the real world. In particular, this allows us to learn about potential sources of danger without having to expose ourselves or others to the concomitant risks.

The ReQueST algorithm

  1. A neural environment simulator — a dynamics model learned from trajectories generated by humans exploring the environment safely. In our work this is a pixel-based dynamics model,
  2. A reward model, learned from human feedback on videos of (hypothetical) behaviour in the learned simulator.
  3. Trajectory optimisation, so that we can choose hypothetical behaviours to ask the human about that help the reward model learn what’s safe and what’s not (in addition to other aspects of the task) as quickly as possible.

Together, these three components allow us to learn a reward model based entirely on hypothetical examples ‘imagined’ using the learned simulator. If we then use the learned simulator and reward model with a model-based control algorithm, the result is an agent that does what the human wants — in particular, avoiding behaviours the human has indicated is unsafe — without having had to first try those behaviours in the real world!

ReQueST in our work

It turns out the answer is: yes!

The video above shows a number of (cherry-picked) example episodes from our ReQueST agent on an apple collection task. The left pane shows ground-truth observations and rewards from the ‘real’ environment. On the right, we see the predictions generated by the learned environment simulator and the reward model, used by the agent to determine which actions to take. On top are predictions of future observations generated by the dynamics model; on the bottom are predictions from a reward model we’ve trained to reward the agent for moving closer and closer to each apple.


Note that at test time, in the 100 evaluation episodes, the ReQueST agent barely falls off the edge at all, in the hardest environment falling in only 3% of episodes. A model-free RL algorithm, in contrast, must fall off the edge over 900 times before it learns not to fall off the edge. (Note that ReQueST itself does not fall off the edge during training, but for fairness we count times that human contractors fall off the edge (despite being instructed not to) while generating trajectories for the dynamics model as safety violations. We also tried training on only the safe trajectories to confirm that unsafe trajectories were not required.)

In terms of performance, the ReQueST agent manages to eat about 2 out of the 3 possible apples on average. This is worse than the model-free baseline, which does eat all 3 apples consistently. However, we do not believe that this is reflective of performance achievable with the ReQueST algorithm in principle. Most failures in our experiments could be attributed to some combination of low fidelity from the learned simulator and inconsistent outputs from the reward model, neither of which were the focus of this work. We believe such failure modes could be solved relatively easily with additional work on these components — and the quality of the learned simulation in particular will improve with general progress in generative modelling.

What is the significance of these results?

Second, this work establishes ReQueST as a plausible solution to human-guided safe exploration. We believe this is the particular brand of safe exploration likely to be representative of real-world deployments of AGI: where the constraints of safe behaviour are fuzzy, and therefore must be learned from humans because they can’t be specified programmatically. In particular, ReQueST shines in situations where any safety violations at all may incur great cost (e.g. harm to humans).

Third, we have shown the exciting promise of neural environment simulators to make RL more safe. We believe such simulators warrant more attention, given that a) they can be learned from data (rather than handcrafted by experts), and b) the ability to differentiate through them in order to discover situations of interest (as we do in this work during trajectory optimization). Given current progress in this area, we are excited to see what the future holds.



We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: deepmind.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: deepmind.com