An EPIC way to evaluate reward functions

Figure 1: EPIC compares reward functions Rᵤ and Rᵥ by first mapping them to canonical representatives and then computing the Pearson distance between the canonical representatives on a coverage distribution 𝒟. Canonicalization removes the effect of potential shaping, and Pearson distance is invariant to positive affine transformations.
Figure 2: There exist a variety of techniques to specify a reward function. EPIC can help you decide which one works best for a given task.

Introducing EPIC

Why use EPIC?

Figure 3: Runtime needed to perform pairwise comparison of 5 reward functions in a simple continuous control task.
Figure 4: EPIC distance between rewards is similar across different distributions (colored bars), while baselines (NPEC and ERC) are highly sensitive to distribution. The coverage distribution consists of rollouts from: a policy that takes actions uniformly at random, an expert optimal policy and a mixed policy that randomly transitions between the other two.
Figure 5: The PointMaze environment: the blue agent must reach the green goal by navigating around the wall that is on the left at train time and on the right at test time.
Figure 6: EPIC distance (blue) predicts policy regret in the train (orange) and test (green) tasks across three different reward learning methods.

Conclusions

Figure 7: Evaluation by RL training concludes the reward function was faulty after destroying the vase. EPIC can warn you the reward function differs from others before you train an agent.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store