Avoiding Unsafe States in 3D Environments using Human Feedback

5 min readJan 21, 2022

By Matthew Rahtz, Vikrant Varma, Ramana Kumar, Zachary Kenton, Shane Legg, and Jan Leike.

Tl;dr: ReQueST is an algorithm for learning objectives from human feedback on hypothetical behaviour. In this work, we scale ReQueST to complex 3D environments, and show that it works even with feedback sourced entirely from real humans. Read our paper at https://arxiv.org/abs/2201.08102.

Learning about unsafe states

Online reinforcement learning has a problem: it must act unsafely in order to learn not to act unsafely. For example, if we were to use online reinforcement learning to train a self-driving car, the car would have to drive off a cliff in order to learn not to drive off cliffs.

One way that humans solve this problem is by learning from hypothetical situations. Our imagination gives us the ability to consider various courses of action without actually having to enact them in the real world. In particular, this allows us to learn about potential sources of danger without having to expose ourselves or others to the concomitant risks.

The ReQueST algorithm

ReQueST (Reward Query Synthesis via Trajectory optimization) is a technique developed to give AI systems the same ability. ReQueST employs three components:

A neural environment simulator — a dynamics model learned from trajectories generated by humans exploring the environment safely. In our work this is a pixel-based dynamics model,
A reward model, learned from human feedback on videos of (hypothetical) behaviour in the learned simulator.
Trajectory optimisation, so that we can choose hypothetical behaviours to ask the human about that help the reward model learn what’s safe and what’s not (in addition to other aspects of the task) as quickly as possible.

Together, these three components allow us to learn a reward model based entirely on hypothetical examples ‘imagined’ using the learned simulator. If we then use the learned simulator and reward model with a model-based control algorithm, the result is an agent that does what the human wants — in particular, avoiding behaviours the human has indicated is unsafe — without having had to first try those behaviours in the real world!

ReQueST in our work

In our latest paper, we ask: is ReQueST viable in a more realistic setting than the simple 2D environments used in the work that introduced ReQueST? In particular, can we scale ReQueST to a complex 3D environment, with imperfect feedback as sourced from real humans rather than procedural reward functions?

It turns out the answer is: yes!

The video above shows a number of (cherry-picked) example episodes from our ReQueST agent on an apple collection task. The left pane shows ground-truth observations and rewards from the ‘real’ environment. On the right, we see the predictions generated by the learned environment simulator and the reward model, used by the agent to determine which actions to take. On top are predictions of future observations generated by the dynamics model; on the bottom are predictions from a reward model we’ve trained to reward the agent for moving closer and closer to each apple.

Results

To quantify the ability of our agent to avoid unsafe states in the ‘real’ environment, we run 100 evaluation episodes in a ‘cliff edge’ environment, where the agent can fall off the edge of the world (which would be unsafe). We test on three sizes of environment, corresponding to different difficulty levels: the larger the environment, the harder it is for the agent to accidentally wander off the edge. Results are as follows.

Note that at test time, in the 100 evaluation episodes, the ReQueST agent barely falls off the edge at all, in the hardest environment falling in only 3% of episodes. A model-free RL algorithm, in contrast, must fall off the edge over 900 times before it learns not to fall off the edge. (Note that ReQueST itself does not fall off the edge during training, but for fairness we count times that human contractors fall off the edge (despite being instructed not to) while generating trajectories for the dynamics model as safety violations. We also tried training on only the safe trajectories to confirm that unsafe trajectories were not required.)

In terms of performance, the ReQueST agent manages to eat about 2 out of the 3 possible apples on average. This is worse than the model-free baseline, which does eat all 3 apples consistently. However, we do not believe that this is reflective of performance achievable with the ReQueST algorithm in principle. Most failures in our experiments could be attributed to some combination of low fidelity from the learned simulator and inconsistent outputs from the reward model, neither of which were the focus of this work. We believe such failure modes could be solved relatively easily with additional work on these components — and the quality of the learned simulation in particular will improve with general progress in generative modelling.

What is the significance of these results?

First, this research shows that even in realistic environments, it is possible to aim for zero safety violations without making major assumptions about the state space. With others such as Luo 2021 starting to aim for a similar target, we hope this is the beginning of a new level of ambition for safe exploration research.

Second, this work establishes ReQueST as a plausible solution to human-guided safe exploration. We believe this is the particular brand of safe exploration likely to be representative of real-world deployments of AGI: where the constraints of safe behaviour are fuzzy, and therefore must be learned from humans because they can’t be specified programmatically. In particular, ReQueST shines in situations where any safety violations at all may incur great cost (e.g. harm to humans).

Third, we have shown the exciting promise of neural environment simulators to make RL more safe. We believe such simulators warrant more attention, given that a) they can be learned from data (rather than handcrafted by experts), and b) the ability to differentiate through them in order to discover situations of interest (as we do in this work during trajectory optimization). Given current progress in this area, we are excited to see what the future holds.