Towards Robust and Verified AI: Specification Testing, Robust Training, and Formal Verification

An adversarial input, overlaid on a typical image, can cause a classifier to miscategorise a sloth as a race car. The two images differ by at most 0.0078 in each pixel. The first one is classified as a three-toed sloth with >99% confidence. The second one is classified as a race car with >99% probability.
  1. Testing consistency with specifications efficiently. We explore efficient ways to test that machine learning systems are consistent with properties (such as invariance or robustness) desired by the designer and users of the system. One approach to uncover cases where the model might be inconsistent with the desired behaviour is to systematically search for worst-case outcomes during evaluation.
  2. Training machine learning models to be specification-consistent. Even with copious training data, standard machine learning algorithms can produce predictive models that make predictions inconsistent with desirable specifications like robustness or fairness — this requires us to reconsider training algorithms that produce models that not only fit training data well, but also are consistent with a list of specifications.
  3. Formally proving that machine learning models are specification-consistent. There is a need for algorithms that can verify that the model predictions are provably consistent with a specification of interest for all possible inputs. While the field of formal verification has studied such algorithms for several decades, these approaches do not easily scale to modern deep learning systems despite impressive progress.

Testing consistency with specifications

Robustness to adversarial examples is a relatively well-studied problem in deep learning. One major theme that has come out of this work is the importance of evaluating against strong attacks, and designing transparent models which can be efficiently analysed. Alongside other researchers from the community, we have found that many models appear robust when evaluated against weak adversaries. However, they show essentially 0% adversarial accuracy when evaluated against stronger adversaries (Athalye et al., 2018, Uesato et al., 2018, Carlini and Wagner, 2017).

With random sampling, we almost never observe maps with high failure probabilities, but adversarial testing reveals such maps do exist. These maps retain high failure probabilities even after many walls are removed, yielding simpler maps than the original.

Training specification-consistent models

Adversarial testing aims to find a counter example that violates specifications. As such, it often leads to overestimating the consistency of models with respect to these specifications. Mathematically, a specification is some relationship that has to hold between the inputs and outputs of a neural network. This can take the form of upper and lower bounds on certain key input and output parameters.

Formal verification

Rigorous testing and training can go a long way towards building robust machine learning systems. However, no amount of testing can formally guarantee that a system will behave as we want. In large-scale models, enumerating all possible outputs for a given set of inputs (for example, infinitesimal perturbations to an image) is intractable due to the astronomical number of choices for the input perturbation. However, as in the case of training, we can find more efficient approaches by setting geometric bounds on the set of outputs. Formal verification is a subject of ongoing research at DeepMind.


Deployment of machine learning in high-stakes situations presents unique challenges, and requires the development of evaluation techniques that reliably detect unlikely failure modes. More broadly, we believe that learning consistency with specifications can provide large efficiency improvements over approaches where specifications only arise implicitly from training data. We are excited about ongoing research into adversarial evaluation, learning robust models, and verification of formal specifications.

  1. Learning for adversarial evaluation and verification: As AI systems scale and become more complex, it will become increasingly difficult to design adversarial evaluation and verification algorithms that are well-adapted to the AI model. If we can leverage the power of AI to facilitate evaluation and verification, this process can be bootstrapped to scale.
  2. Development of publicly-available tools for adversarial evaluation and verification: It is important to provide AI engineers and practitioners with easy-to-use tools that shed light on the possible failure modes of the AI system before it leads to widespread negative impact. This would require some degree of standardisation of adversarial evaluation and verification algorithms.
  3. Broadening the scope of adversarial examples: To date, most work on adversarial examples has focused on model invariances to small perturbations, typically of images. This has provided an excellent testbed for developing approaches to adversarial evaluation, robust learning, and verification. We have begun to explore alternate specifications for properties directly relevant in the real world, and are excited by future research in this direction.
  4. Learning specifications: Specifications that capture “correct” behavior in AI systems are often difficult to precisely state. Building systems that can use partial human specifications and learn further specifications from evaluative feedback would be required as we build increasingly intelligent agents capable of exhibiting complex behaviors and acting in unstructured environments.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DeepMind Safety Research

DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work: