Building safe artificial intelligence: specification, robustness, and assurance

Three AI safety problem areas. Each box highlights some representative challenges and approaches. The three areas are not disjoint but rather aspects that interact with each other. In particular, a given specific safety problem might involve solving more than one aspect.

Specification: define the purpose of the system

  • ideal specification (the “wishes”), corresponding to the hypothetical (but hard to articulate) description of an ideal AI system that is fully aligned to the desires of the human operator;
  • design specification (the “blueprint”), corresponding to the specification that we actually use to build the AI system, e.g. the reward function that a reinforcement learning system maximises;
  • and revealed specification (the “behaviour”), which is the specification that best describes what actually happens, e.g. the reward function we can reverse-engineer from observing the system’s behaviour using, say, inverse reinforcement learning. This is typically different from the one provided by the human operator because AI systems are not perfect optimisers or because of other unforeseen consequences of the design specification.
From Faulty Reward Functions in the Wild by OpenAI: a reinforcement learning agent discovers an unintended strategy for achieving a higher score.

Robustness: design the system to withstand perturbations

From AI Safety Gridworlds. During training the agent learns to avoid the lava; but when we test it in a new situation where the location of the lava has changed, it fails to generalise and runs straight into the lava.
An adversarial input, overlaid on a typical image, can cause a classifier to miscategorise a sloth as a race car. The two images differ by at most 0.0078 in each pixel. The first one is classified as a three-toed sloth with >99% confidence. The second one is classified as a race car with >99% probability.

Assurance: monitor and control system activity

ToMNet discovers two subspecies of agents and predicts their behaviour (from “Machine Theory of Mind”)
A problem with interruptions: human interventions (i.e. pressing the stop button) can change the task. In the figure, the interruption adds a transition (in red) to the Markov decision process that changes the original task (in black). See Orseau and Armstrong, 2016.

Looking ahead

If you are interested in working with us on the research areas outlined in this post, we are hiring! Please check our open roles at https://deepmind.com/careers/ and note your interest in AI safety when you apply. We would love to hear from talented researchers and non-researchers alike.

Resources

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: deepmind.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How is AI influencing the Alcohol Industry?

Women of the AI: “ Unless women take the leap to join the industry in spite of these stereotypes…

Impressive Medium Articles on AI/ML This Month (Nov)

RoboEvaluate The Performance Of Your LiDARs And Multi-Sensor Systems

IoT Smart Camera Landscape

How might Artificial Intelligence transform corporate sustainability policies?

The MLOps Turn

An overview of artificial neural networks (ANNs) for classification and their applications in…

DeepMind Safety Research

DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: deepmind.com

More from Medium

What are AI, ML, and DL?

AI Bias & Fairness Series

XAI Methods — Integrated Gradients

Paper Review: “All You Need Is Low (Rank): Defending Against Adversarial Attacks on Graphs” in The…