Towards Robust and Verified AI: Specification Testing, Robust Training, and Formal Verification

DeepMind Safety Research
9 min readMar 28, 2019


By Pushmeet Kohli, Krishnamurthy (Dj) Dvijotham, Jonathan Uesato, Sven Gowal, and the Robust & Verified Deep Learning group. This article is cross-posted from

Bugs and software have gone hand in hand since the beginning of computer programming. Over time, software developers have established a set of best practices for testing and debugging before deployment, but these practices are not suited for modern deep learning systems. Today, the prevailing practice in machine learning is to train a system on a training data set, and then test it on another set. While this reveals the average-case performance of models, it is also crucial to ensure robustness, or acceptably high performance even in the worst case. In this article, we describe three approaches for rigorously identifying and eliminating bugs in learned predictive models: adversarial testing, robust learning, and formal verification.

Machine learning systems are not robust by default. Even systems that outperform humans in a particular domain can fail at solving simple problems if subtle differences are introduced. For example, consider the problem of image perturbations: a neural network that can classify images better than a human can be easily fooled into believing that sloth is a race car if a small amount of carefully calculated noise is added to the input image.

An adversarial input, overlaid on a typical image, can cause a classifier to miscategorise a sloth as a race car. The two images differ by at most 0.0078 in each pixel. The first one is classified as a three-toed sloth with >99% confidence. The second one is classified as a race car with >99% probability.

This is not an entirely new problem. Computer programs have always had bugs. Over decades, software engineers have assembled an impressive toolkit of techniques, ranging from unit testing to formal verification. These methods work well on traditional software, but adapting these approaches to rigorously test machine learning models like neural networks is extremely challenging due to the scale and lack of structure in these models, which may contain hundreds of millions of parameters. This necessitates the need for developing novel approaches for ensuring that machine learning systems are robust at deployment.

From a programmer’s perspective, a bug is any behaviour that is inconsistent with the specification, i.e. the intended functionality, of a system. As part of our mission of solving intelligence, we conduct research into techniques for evaluating whether machine learning systems are consistent not only with the train and test set, but also with a list of specifications describing desirable properties of a system. Such properties might include robustness to sufficiently small perturbations in inputs, safety constraints to avoid catastrophic failures, or producing predictions consistent with the laws of physics.

In this article, we discuss three important technical challenges for the machine learning community to take on, as we collectively work towards rigorous development and deployment of machine learning systems that are reliably consistent with desired specifications:

  1. Testing consistency with specifications efficiently. We explore efficient ways to test that machine learning systems are consistent with properties (such as invariance or robustness) desired by the designer and users of the system. One approach to uncover cases where the model might be inconsistent with the desired behaviour is to systematically search for worst-case outcomes during evaluation.
  2. Training machine learning models to be specification-consistent. Even with copious training data, standard machine learning algorithms can produce predictive models that make predictions inconsistent with desirable specifications like robustness or fairness — this requires us to reconsider training algorithms that produce models that not only fit training data well, but also are consistent with a list of specifications.
  3. Formally proving that machine learning models are specification-consistent. There is a need for algorithms that can verify that the model predictions are provably consistent with a specification of interest for all possible inputs. While the field of formal verification has studied such algorithms for several decades, these approaches do not easily scale to modern deep learning systems despite impressive progress.

Testing consistency with specifications

Robustness to adversarial examples is a relatively well-studied problem in deep learning. One major theme that has come out of this work is the importance of evaluating against strong attacks, and designing transparent models which can be efficiently analysed. Alongside other researchers from the community, we have found that many models appear robust when evaluated against weak adversaries. However, they show essentially 0% adversarial accuracy when evaluated against stronger adversaries (Athalye et al., 2018, Uesato et al., 2018, Carlini and Wagner, 2017).

While most work has focused on rare failures in the context of supervised learning (largely image classification), there is a need to extend these ideas to other settings. In recent work on adversarial approaches for uncovering catastrophic failures, we apply these ideas towards testing reinforcement learning agents intended for use in safety-critical settings (Ruderman et al., 2018, Uesato et al., 2018b). One challenge in developing autonomous systems is that because a single mistake may have large consequences, very small failure probabilities are unacceptable.

Our objective is to design an “adversary” to allow us to detect such failures in advance (e.g., in a controlled environment). If the adversary can efficiently identify the worst-case input for a given model, this allows us to catch rare failure cases before deploying a model. As with image classifiers, evaluating against a weak adversary provides a false sense of security during deployment. This is similar to the software practice of red-teaming, though extends beyond failures caused by malicious adversaries, and also includes failures which arise naturally, for example due to lack of generalization.

We developed two complementary approaches for adversarial testing of RL agents. In the first, we use a derivative-free optimisation to directly minimise the expected reward of an agent. In the second, we learn an adversarial value function which predicts from experience which situations are most likely to cause failures for the agent. We then use this learned function for optimisation to focus the evaluation on the most problematic inputs. These approaches form only a small part of a rich, growing space of potential algorithms, and we are excited about future development in rigorous evaluation of agents.

Already, both approaches result in large improvements over random testing. Using our method, failures that would have taken days to uncover, or even gone undetected entirely, can be detected in minutes (Uesato et al., 2018b). We also found that adversarial testing may uncover qualitatively different behaviour in our agents from what might be expected from evaluation on a random test set. In particular, using adversarial environment construction we found that agents performing a 3D navigation task, which match human-level performance on average, still failed to find the goal completely on surprisingly simple mazes (Ruderman et al., 2018). Our work also highlights that we need to design systems that are secure against natural failures, not only against adversaries

With random sampling, we almost never observe maps with high failure probabilities, but adversarial testing reveals such maps do exist. These maps retain high failure probabilities even after many walls are removed, yielding simpler maps than the original.

Training specification-consistent models

Adversarial testing aims to find a counter example that violates specifications. As such, it often leads to overestimating the consistency of models with respect to these specifications. Mathematically, a specification is some relationship that has to hold between the inputs and outputs of a neural network. This can take the form of upper and lower bounds on certain key input and output parameters.

Motivated by this observation, several researchers (Raghunathan et al., 2018; Wong et al., 2018; Mirman et al., 2018; Wang et al., 2018) including our team at DeepMind (Dvijotham et al., 2018; Gowal et al., 2018), have worked on algorithms that are agnostic to the adversarial testing procedure (used to assess consistency with the specification). This can be understood geometrically — we can bound (e.g., using interval bound propagation; Ehlers 2017, Katz et al. 2017, Mirman et al., 2018) the worst violation of a specification by bounding the space of outputs given a set of inputs. If this bound is differentiable with respect to network parameters and can be computed quickly, it can be used during training. The original bounding box can then be propagated through each layer of the network.

We show that interval bound propagation is fast, efficient, and — contrary to prior belief — can achieve strong results (Gowal et al., 2018). In particular, we demonstrate that it can decrease the provable error rate (i.e., maximal error rate achievable by any adversary) over state-of-the-art in image classification on both MNIST and CIFAR-10 datasets.

Going forward, the next frontier will be to learn the right geometric abstractions to compute tighter overapproximations of the space of outputs. We also want to train networks to be consistent with more complex specifications capturing desirable behavior, such as above mentioned invariances and consistency with physical laws.

Formal verification

Rigorous testing and training can go a long way towards building robust machine learning systems. However, no amount of testing can formally guarantee that a system will behave as we want. In large-scale models, enumerating all possible outputs for a given set of inputs (for example, infinitesimal perturbations to an image) is intractable due to the astronomical number of choices for the input perturbation. However, as in the case of training, we can find more efficient approaches by setting geometric bounds on the set of outputs. Formal verification is a subject of ongoing research at DeepMind.

The machine learning community has developed several interesting ideas on how to compute precise geometric bounds on the space of outputs of the network (Katz et al. 2017, Weng et al., 2018; Singh et al., 2018). Our approach (Dvijotham et al., 2018), based on optimisation and duality, consists of formulating the verification problem as an optimisation problem that tries to find the largest violation of the property being verified. By using ideas from duality in optimisation, the problem becomes computationally tractable. This results in additional constraints that refine the bounding boxes computed by interval bound propagation, using so-called cutting planes. This approach is sound but incomplete: there may be cases where the property of interest is true, but the bound computed by this algorithm is not tight enough to prove the property. However, once we obtain a bound, this formally guarantees that there can be no violation of the property. The figure below graphically illustrates the approach.

This approach enables us to extend the applicability of verification algorithms to more general networks (activation functions, architectures), general specifications and more sophisticated deep learning models (generative models, neural processes, etc.) and specifications beyond adversarial robustness (Qin, 2018).


Deployment of machine learning in high-stakes situations presents unique challenges, and requires the development of evaluation techniques that reliably detect unlikely failure modes. More broadly, we believe that learning consistency with specifications can provide large efficiency improvements over approaches where specifications only arise implicitly from training data. We are excited about ongoing research into adversarial evaluation, learning robust models, and verification of formal specifications.

Much more work is needed to build automated tools for ensuring that AI systems in the real world will do the “right thing”. In particular, we are excited about progress in the following directions:

  1. Learning for adversarial evaluation and verification: As AI systems scale and become more complex, it will become increasingly difficult to design adversarial evaluation and verification algorithms that are well-adapted to the AI model. If we can leverage the power of AI to facilitate evaluation and verification, this process can be bootstrapped to scale.
  2. Development of publicly-available tools for adversarial evaluation and verification: It is important to provide AI engineers and practitioners with easy-to-use tools that shed light on the possible failure modes of the AI system before it leads to widespread negative impact. This would require some degree of standardisation of adversarial evaluation and verification algorithms.
  3. Broadening the scope of adversarial examples: To date, most work on adversarial examples has focused on model invariances to small perturbations, typically of images. This has provided an excellent testbed for developing approaches to adversarial evaluation, robust learning, and verification. We have begun to explore alternate specifications for properties directly relevant in the real world, and are excited by future research in this direction.
  4. Learning specifications: Specifications that capture “correct” behavior in AI systems are often difficult to precisely state. Building systems that can use partial human specifications and learn further specifications from evaluative feedback would be required as we build increasingly intelligent agents capable of exhibiting complex behaviors and acting in unstructured environments.

DeepMind is dedicated to positive social impact through responsible development and deployment of machine learning systems. To make sure that the contributions of developers are reliably positive, we need to tackle many technical challenges. We are committed to taking part in this effort and are excited to work with the community on solving these challenges.

This post describes the work of the Robust and Verified Deep Learning group (Pushmeet Kohli, DJ Krishnamurthy, Jonathan Uesato, Sven Gowal, Chongli Qin, Robert Stanforth, Po-Sen Huang) performed in collaboration with various contributors across DeepMind including Avraham Ruderman, Alhussein Fawzi, Ananya Kumar, Brendan O’Donoghue, Bristy Sikder, Chenglong Wang, Csaba Szepesvari, Hubert Soyer, Relja Arandjelovic, Richard Everett, Rudy Bunel, Timothy Mann, Grzegorz Swirszcz, and Tom Erez.

Thanks to Vishal Maini, Aleš Flidr, Damien Boudot, and Jan Leike for their contributions to the post.



DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: