By Tom Everitt, Ryan Carey, Lewis Hammond, James Fox, Eric Langlois, and Shane Legg

Crossposted to the alignmentforum

About 2 years ago, we released the first few papers on understanding agent incentives using causal influence diagrams. This blog post will summarize progress made since then.

What are causal influence diagrams?

A key problem in AI…

By Adam Gleave, Michael Dennis, Shane Legg, Stuart Russell and Jan Leike.

TL;DR: Equivalent-Policy Invariant Comparison (EPIC) provides a fast and reliable way to compute how similar a pair of reward functions are to one another. EPIC can be used to benchmark reward learning algorithms by comparing learned reward functions…

By Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik and Geoffrey Irving

Would your AI deceive you? This is a central question when considering the safety of AI, underlying many of the most pressing risks from current systems to future AGI. We have recently seen impressive advances in…

By the Safety Analysis Team: Grégoire Déletang, Jordi Grau-Moya, Miljan Martic, Tim Genewein, Tom McGrath, Vladimir Mikulik, Markus Kunesch, Shane Legg, and Pedro A. Ortega.

TL;DR: To study agent behaviour we must use the tools of causal analysis rather than rely on observation alone.

By Grégoire Delétang, Tom McGrath, Tim Genewein, Vladimir Mikulik, Markus Kunesch, Jordi Grau-Moya, Miljan Martic, Shane Legg, Pedro A. Ortega

TL;DR: In our recent paper we show that meta-trained recurrent neural networks implement Bayes-optimal algorithms.

One of the most challenging problems in modern AI research is understanding the learned algorithms…

By Tom Everitt, Ramana Kumar, Jonathan Uesato, Victoria Krakovna, Richard Ngo, Shane Legg

In two new papers, we study tampering in simulation. The first paper describes a platform, called REALab, which makes tampering a natural part of the physics of the environment. …

Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike, Shane Legg

This article is cross-posted on the DeepMind website.

Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome. We have all had experiences with…

By Siddharth Reddy and Jan Leike. Cross-posted from the DeepMind website.

TL;DR: We present a method for training reinforcement learning agents from human feedback in the presence of unknown unsafe states.

When we train reinforcement learning (RL) agents in the real world, we don’t want them to explore unsafe states

By Tom Everitt, Ramana Kumar, and Marcus Hutter

From an AI safety perspective, having a clear design principle and a crisp characterization of what problem it solves means that we don’t have to guess which agents are safe. …

By Pushmeet Kohli, Krishnamurthy (Dj) Dvijotham, Jonathan Uesato, Sven Gowal, and the Robust & Verified Deep Learning group. This article is cross-posted from

Bugs and software have gone hand in hand since the beginning of computer programming. Over time, software developers have established a set of best practices for…

DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store