An EPIC way to evaluate reward functions

Figure 1: EPIC compares reward functions Rᵤ and Rᵥ by first mapping them to canonical representatives and then computing the Pearson distance between the canonical representatives on a coverage distribution 𝒟. Canonicalization removes the effect of potential shaping, and Pearson distance is invariant to positive affine transformations.
Figure 2: There exist a variety of techniques to specify a reward function. EPIC can help you decide which one works best for a given task.

Introducing EPIC

Why use EPIC?

Figure 3: Runtime needed to perform pairwise comparison of 5 reward functions in a simple continuous control task.
Figure 4: EPIC distance between rewards is similar across different distributions (colored bars), while baselines (NPEC and ERC) are highly sensitive to distribution. The coverage distribution consists of rollouts from: a policy that takes actions uniformly at random, an expert optimal policy and a mixed policy that randomly transitions between the other two.
Figure 5: The PointMaze environment: the blue agent must reach the green goal by navigating around the wall that is on the left at train time and on the right at test time.
Figure 6: EPIC distance (blue) predicts policy regret in the train (orange) and test (green) tasks across three different reward learning methods.


Figure 7: Evaluation by RL training concludes the reward function was faulty after destroying the vase. EPIC can warn you the reward function differs from others before you train an agent.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store