Avoiding Unsafe States in 3D Environments using Human Feedback

Learning about unsafe states

Source: Getty Images.

The ReQueST algorithm

  1. A neural environment simulator — a dynamics model learned from trajectories generated by humans exploring the environment safely. In our work this is a pixel-based dynamics model,
  2. A reward model, learned from human feedback on videos of (hypothetical) behaviour in the learned simulator.
  3. Trajectory optimisation, so that we can choose hypothetical behaviours to ask the human about that help the reward model learn what’s safe and what’s not (in addition to other aspects of the task) as quickly as possible.

ReQueST in our work

Results

What is the significance of these results?

--

--

--

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: deepmind.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

European project Robotics4EU releases surveys aimed at robotics community

First event in Taiwan in year of tiger, Asiabots in SelectUSA Tech

Provide Fast and High-Quality Customer Service with Conversational AI FAQ Bots

What is SKULogi — Smart Inventory Planner and Why Should You Care?

Are Sensors the Villain in Autonomous Vehicle Story?

Autonomous cars making own decisions for safer driving using advanced sensors

For Successful AI Projects, Celebrate Your Graveyard

Robin Lohmann — Barclays’ Visa Business holds hands

Koan about artificial intelligence

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DeepMind Safety Research

DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: deepmind.com

More from Medium

Ambarella and Helm.ai Announce High-End ADAS Software Integration

The Driverless Race

The Effect of Samplers on “Max of two Quadratics”-Learning