An EPIC way to evaluate reward functions

Figure 1: EPIC compares reward functions Rᵤ and Rᵥ by first mapping them to canonical representatives and then computing the Pearson distance between the canonical representatives on a coverage distribution 𝒟. Canonicalization removes the effect of potential shaping, and Pearson distance is invariant to positive affine transformations.
Figure 2: There exist a variety of techniques to specify a reward function. EPIC can help you decide which one works best for a given task.
  • Benchmarking in tasks where the reward function specification problem has already been solved, giving a “ground-truth” reward. We can then compare learned rewards directly to this “ground-truth” to gauge the performance of a new reward learning method.
  • Validation of reward functions prior to deployment. In particular, we often have a collection of reward functions specified by different people, methods or data sources. If multiple distinct approaches produce similar reward functions (i.e. a low EPIC distance to one another) then we can have more confidence that the resulting reward is correct. More generally, if two reward functions have low EPIC distance to one another, then information we gain about one (such as by using interpretability methods) also helps us understand the other.

Introducing EPIC

  1. Potential shaping, which moves reward earlier or later in time.
  2. Positive affine transformations, adding a constant or rescaling by a positive factor.
  1. First, we canonicalize the rewards being compared. Rewards that are the same up to potential shaping are mapped to the same canonical representative.
  2. Next, we compute the Pearson correlation between the canonical representatives, over some coverage distribution 𝒟 over transitions. Pearson correlation is invariant to positive affine transformations.

Why use EPIC?

Figure 3: Runtime needed to perform pairwise comparison of 5 reward functions in a simple continuous control task.
Figure 4: EPIC distance between rewards is similar across different distributions (colored bars), while baselines (NPEC and ERC) are highly sensitive to distribution. The coverage distribution consists of rollouts from: a policy that takes actions uniformly at random, an expert optimal policy and a mixed policy that randomly transitions between the other two.
Figure 5: The PointMaze environment: the blue agent must reach the green goal by navigating around the wall that is on the left at train time and on the right at test time.
Figure 6: EPIC distance (blue) predicts policy regret in the train (orange) and test (green) tasks across three different reward learning methods.


Figure 7: Evaluation by RL training concludes the reward function was faulty after destroying the vase. EPIC can warn you the reward function differs from others before you train an agent.




We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

TREND #3 — Greater visibility into and control over fraud via Artificial Intelligence (AI)/…

Chabot Space and Science Center Wedding

Chatbot Design Guidelines: Product Definition Phase

Cognify: Insight into the Future World With Technology


Predictive AI & Cybersecurity

Language, Reason and Building a Machine that Can Spot Fallacies

Artificial Intelligence of Things(AIoT) Explained!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DeepMind Safety Research

DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work:

More from Medium

Build Better Futures with Ethical AI

Newsletter #57 — IBM, Oracle and Microsoft grapple with AI in healthcare

A Summary of AI Ethics and Bias in 2021

AI Human Value Alignment: Supporting human values and human flourishing.