Building safe artificial intelligence: specification, robustness, and assurance

Three AI safety problem areas. Each box highlights some representative challenges and approaches. The three areas are not disjoint but rather aspects that interact with each other. In particular, a given specific safety problem might involve solving more than one aspect.

Specification: define the purpose of the system

  • ideal specification (the “wishes”), corresponding to the hypothetical (but hard to articulate) description of an ideal AI system that is fully aligned to the desires of the human operator;
  • design specification (the “blueprint”), corresponding to the specification that we actually use to build the AI system, e.g. the reward function that a reinforcement learning system maximises;
  • and revealed specification (the “behaviour”), which is the specification that best describes what actually happens, e.g. the reward function we can reverse-engineer from observing the system’s behaviour using, say, inverse reinforcement learning. This is typically different from the one provided by the human operator because AI systems are not perfect optimisers or because of other unforeseen consequences of the design specification.
From Faulty Reward Functions in the Wild by OpenAI: a reinforcement learning agent discovers an unintended strategy for achieving a higher score.

Robustness: design the system to withstand perturbations

From AI Safety Gridworlds. During training the agent learns to avoid the lava; but when we test it in a new situation where the location of the lava has changed, it fails to generalise and runs straight into the lava.
An adversarial input, overlaid on a typical image, can cause a classifier to miscategorise a sloth as a race car. The two images differ by at most 0.0078 in each pixel. The first one is classified as a three-toed sloth with >99% confidence. The second one is classified as a race car with >99% probability.

Assurance: monitor and control system activity

ToMNet discovers two subspecies of agents and predicts their behaviour (from “Machine Theory of Mind”)
A problem with interruptions: human interventions (i.e. pressing the stop button) can change the task. In the figure, the interruption adds a transition (in red) to the Markov decision process that changes the original task (in black). See Orseau and Armstrong, 2016.

Looking ahead

We are building the foundations of a technology which will be used for many important applications in the future. It is worth bearing in mind that design decisions which are not safety-critical at the time of deployment can still have a large impact when the technology becomes widely used. Although convenient at the time, once these design choices have been irreversibly integrated into important systems the tradeoffs look different, and we may find they cause problems that are hard to fix without a complete redesign.

If you are interested in working with us on the research areas outlined in this post, we are hiring! Please check our open roles at https://deepmind.com/careers/ and note your interest in AI safety when you apply. We would love to hear from talented researchers and non-researchers alike.

Resources

For related reading, below is a collection of other articles, agendas, or taxonomies that have informed our thinking or present a helpful alternative view on problem framing for technical AI safety:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DeepMind Safety Research

DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: deepmind.com