By Tom Everitt, Ryan Carey, Lewis Hammond, James Fox, Eric Langlois, and Shane Legg

Crossposted to the alignmentforum

About 2 years ago, we released the first few papers on understanding agent incentives using causal influence diagrams. This blog post will summarize progress made since then.

What are causal influence diagrams?

A key problem in AI alignment is understanding agent incentives. Concerns have been raised that agents may be incentivized to avoid correction, manipulate users, or inappropriately influence their learning. This is particularly worrying as training schemes often shape incentives in subtle and surprising ways. …

By Adam Gleave, Michael Dennis, Shane Legg, Stuart Russell and Jan Leike.

TL;DR: Equivalent-Policy Invariant Comparison (EPIC) provides a fast and reliable way to compute how similar a pair of reward functions are to one another. EPIC can be used to benchmark reward learning algorithms by comparing learned reward functions to a ground-truth reward. EPIC is up to 1000 times faster than alternative evaluation methods, and requires little to no hyperparameter tuning. Moreover, we show both theoretically and empirically that reward functions judged as similar by EPIC induce policies with similar returns, even in unseen environments.

Figure 1: EPIC compares reward functions Rᵤ and Rᵥ by first mapping them to canonical representatives and then computing the Pearson distance between the canonical representatives on a coverage distribution 𝒟. Canonicalization removes the effect of potential shaping, and Pearson distance is invariant to positive affine transformations.

Specifying a reward function…

By Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik and Geoffrey Irving

Would your AI deceive you? This is a central question when considering the safety of AI, underlying many of the most pressing risks from current systems to future AGI. We have recently seen impressive advances in language agents — AI systems that use natural language. This motivates a more careful investigation of their safety properties.

In our recent paper, we consider the safety of language agents through the lens of AI alignment, which is about how to get the behaviour of an AI agent to match…

By the Safety Analysis Team: Grégoire Déletang, Jordi Grau-Moya, Miljan Martic, Tim Genewein, Tom McGrath, Vladimir Mikulik, Markus Kunesch, Shane Legg, and Pedro A. Ortega.

TL;DR: To study agent behaviour we must use the tools of causal analysis rather than rely on observation alone. Our paper outlines a rigorous methodology for uncovering the agents’ causal mechanisms.

Understanding the mechanisms that drive agent behaviour is an important challenge in AI safety. In order to diagnose faulty behaviour, we need to understand why agents do what they do. As is the case in medical trials, it is not sufficient to observe that…

By Grégoire Delétang, Tom McGrath, Tim Genewein, Vladimir Mikulik, Markus Kunesch, Jordi Grau-Moya, Miljan Martic, Shane Legg, Pedro A. Ortega

TL;DR: In our recent paper we show that meta-trained recurrent neural networks implement Bayes-optimal algorithms.

One of the most challenging problems in modern AI research is understanding the learned algorithms that arise from training machine learning systems. This issue is at the heart of building robust, reliable, and safe AI systems. …

By Tom Everitt, Ramana Kumar, Jonathan Uesato, Victoria Krakovna, Richard Ngo, Shane Legg

In two new papers, we study tampering in simulation. The first paper describes a platform, called REALab, which makes tampering a natural part of the physics of the environment. The second paper studies the tampering behaviour of several deep learning algorithms and shows that decoupled approval algorithms avoid tampering in both theory and practice.

Supplying the objective for an AI agent can be a difficult problem. One difficulty is coming up with the right objective (the specification gaming problem). But a second difficulty is ensuring that the…

Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike, Shane Legg

This article is cross-posted on the DeepMind website.

Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome. We have all had experiences with specification gaming, even if not by this name. Readers may have heard the myth of King Midas and the golden touch, in which the king asks that anything he touches be turned to gold — but soon finds that even food and drink turn to metal in his hands. …

By Siddharth Reddy and Jan Leike. Cross-posted from the DeepMind website.

TL;DR: We present a method for training reinforcement learning agents from human feedback in the presence of unknown unsafe states.

When we train reinforcement learning (RL) agents in the real world, we don’t want them to explore unsafe states, such as driving a mobile robot into a ditch or writing an embarrassing email to one’s boss. Training RL agents in the presence of unsafe states is known as the safe exploration problem. We tackle the hardest version of this problem, in which the agent initially doesn’t know how the…

By Tom Everitt, Ramana Kumar, and Marcus Hutter

From an AI safety perspective, having a clear design principle and a crisp characterization of what problem it solves means that we don’t have to guess which agents are safe. In this post and paper we describe how a design principle called current-RF optimization avoids the reward function tampering problem.

Reinforcement learning (RL) agents are designed to maximize reward. For example, Chess and Go agents are rewarded for winning the game, while a manufacturing robot may be rewarded for correctly assembling some given pieces. …

By Pushmeet Kohli, Krishnamurthy (Dj) Dvijotham, Jonathan Uesato, Sven Gowal, and the Robust & Verified Deep Learning group. This article is cross-posted from

Bugs and software have gone hand in hand since the beginning of computer programming. Over time, software developers have established a set of best practices for testing and debugging before deployment, but these practices are not suited for modern deep learning systems. Today, the prevailing practice in machine learning is to train a system on a training data set, and then test it on another set. While this reveals the average-case performance of models, it is…

DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store