Building safe artificial intelligence: specification, robustness, and assurance

Three AI safety problem areas. Each box highlights some representative challenges and approaches. The three areas are not disjoint but rather aspects that interact with each other. In particular, a given specific safety problem might involve solving more than one aspect.

Specification: define the purpose of the system

  • ideal specification (the “wishes”), corresponding to the hypothetical (but hard to articulate) description of an ideal AI system that is fully aligned to the desires of the human operator;
  • design specification (the “blueprint”), corresponding to the specification that we actually use to build the AI system, e.g. the reward function that a reinforcement learning system maximises;
  • and revealed specification (the “behaviour”), which is the specification that best describes what actually happens, e.g. the reward function we can reverse-engineer from observing the system’s behaviour using, say, inverse reinforcement learning. This is typically different from the one provided by the human operator because AI systems are not perfect optimisers or because of other unforeseen consequences of the design specification.
From Faulty Reward Functions in the Wild by OpenAI: a reinforcement learning agent discovers an unintended strategy for achieving a higher score.

Robustness: design the system to withstand perturbations

From AI Safety Gridworlds. During training the agent learns to avoid the lava; but when we test it in a new situation where the location of the lava has changed, it fails to generalise and runs straight into the lava.
An adversarial input, overlaid on a typical image, can cause a classifier to miscategorise a sloth as a race car. The two images differ by at most 0.0078 in each pixel. The first one is classified as a three-toed sloth with >99% confidence. The second one is classified as a race car with >99% probability.

Assurance: monitor and control system activity

ToMNet discovers two subspecies of agents and predicts their behaviour (from “Machine Theory of Mind”)
A problem with interruptions: human interventions (i.e. pressing the stop button) can change the task. In the figure, the interruption adds a transition (in red) to the Markov decision process that changes the original task (in black). See Orseau and Armstrong, 2016.

Looking ahead

If you are interested in working with us on the research areas outlined in this post, we are hiring! Please check our open roles at and note your interest in AI safety when you apply. We would love to hear from talented researchers and non-researchers alike.





We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Breaking Taboos with AI in India: SnehAI

Artificial Intelligence Machine Learning Arrays — Perfect team

How Chatbots Are Affecting and Revolutionizing the Automotive Industry

Innovating With an Eye Toward Safer AI

IoT Smart Camera Landscape

Scenario of RPA in Banking

RPA in Banking

Five Insights from Releasing a Banking Chatbot in Brooklyn

Women Leading The AI Industry: “Take risks; Traditionally women are taught to think first about…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DeepMind Safety Research

DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work:

More from Medium

What Is Explainable AI?

AI Human Value Alignment: Supporting human values and human flourishing.

Gilbane Advisor 6–15–22 — LinkBERT, VALHALLA, text networks

AI Needs Embodied Intelligence and Mortality