OpenAI Retro Contest – Everything I know about JERK agent

The first approach in the OpenAI Retro Contest which I started to implement, test and modify was the JERK approach. Jerk agent is one of the baseline scripts for this contest.

You can find it here: https://github.com/openai/retro-baselines

I think it is the easiest algorithm to understand for programmers who doesn’t have any Machine Learning experience.

The pseudo-code for the JERK algorithm looks like this:

Why is this the easiest approach? Because this algorithm is based on rewards. But not the same kind of rewards like rainbow or ppo2. JERK algorithm has the moves already scripted before. It doesn’t learn the same way like the two others. Sonic runs forward and jumps and if it scores points or progresses on the level further it gets rewarded. It learns based on rewards and tries to not make the mistakes again, because making a mistake will cost him “reward points”. It’s somehow like with us, humans. We are motivated to do something if we get a possible reward at the end.

The name is just an acronym for “Just Enough Retained Knowledge”.

What’s the problem with jerk? It’s neither good or bad.

In one environment it could score higher than Rainbow or PPO2, but in other it would totally fail. So on some Sonic levels this approach would be very good but on other would totally fail.

And if we look at the statistics of running this algorithm on test levels, we get exactly this. This approach will score higher on some levels but on other would totally fail.

But we can modify this script. And in my experience modifying this script will let us score higher numbers than the standard baselines algorithm.

That’s all for the Theory. Lets start with the practice.

Environment and Scenario File

Before we start there are some things which you need to know.

The baseline algorithm runs on ‘tmp/sock’ environment. To run this on local environment we need to change it:

We added a scenario.json file to our environment. But the problem is, we don’t know what exactly kind of scenario file is run on the OpenAI test server. Based on the Gotta Learn Fast Report:

https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/retro-contest/gotta_learn_fast_report.pdf

We can read in the section 3.8 Rewards that it is using two components: horizontal offset (x offset) and completion bonus (level_end_bonus). So based on the information we can only guess the rest.

So this scenario.json file works only on our local environment and is made only for test purposes.

You can find it in the retro-contest folder: ..\retro-contest\gym-retro\data\SonicTheHedgehog-Genesis

And it looks like this:

{
  "done": {
    "variables": {
      "lives": {
        "op": "zero"
      }
    }
  },
  "reward": {
    "variables": {
      "score": {
        "reward": 10.0
      }
    }
  }
}

Based on this guide: https://github.com/openai/retro#scenario-information-scenariojson

I would add the parameters: “score”, “x”, “screen_x”, and “level_end_bonus” to our scenario.json file with adequate rewards.

What kind of rewards can we apply here? We can find all the variable names in this file:

https://github.com/openai/retro/blob/master/data/SonicTheHedgehog-Genesis/data.json

To get the scenario file running  we need to copy it to our script folder.

As I said before we can’t upload the scenario file on the docker server, but adding some extra variables will help us in improving our jerk agent.

After changing the environment don’t forget to add “from retro import make” to the python script.

And to get our game running we need to write env.render() in one of the code lines.

This is the modified jerk script to get running on local evaluation:

#!/usr/bin/env python

"""
A scripted agent called "Just Enough Retained Knowledge".
"""

import random

import gym
import numpy as np

import gym_remote.client as grc
import gym_remote.exceptions as gre

from retro import make

EXPLOIT_BIAS = 0.25
TOTAL_TIMESTEPS = int(1e6)

def main():
    """Run JERK on the attached environment."""
    env = make(game='SonicTheHedgehog-Genesis', state='GreenHillZone.Act1', scenario='scenario.json')
    env = TrackedEnv(env)
    new_ep = True
    solutions = []
    while True:
        if new_ep:
            if (solutions and
                    random.random() < EXPLOIT_BIAS + env.total_steps_ever / TOTAL_TIMESTEPS):
                solutions = sorted(solutions, key=lambda x: np.mean(x[0]))
                best_pair = solutions[-1]
                new_rew = exploit(env, best_pair[1])
                best_pair[0].append(new_rew)
                print('replayed best with reward %f' % new_rew)
                continue
            else:
                env.reset()
                new_ep = False
        rew, new_ep = move(env, 100)
        if not new_ep and rew <= 0:
            print('backtracking due to negative reward: %f' % rew)
            _, new_ep = move(env, 70, left=True)
        if new_ep:
            solutions.append(([max(env.reward_history)], env.best_sequence()))

def move(env, num_steps, left=False, jump_prob=1.0 / 10.0, jump_repeat=4):
    """
    Move right or left for a certain number of steps,
    jumping periodically.
    """
    total_rew = 0.0
    done = False
    steps_taken = 0
    jumping_steps_left = 0
    while not done and steps_taken < num_steps:
        action = np.zeros((12,), dtype=np.bool)
        action[6] = left
        action[7] = not left
        if jumping_steps_left > 0:
            action[0] = True
            jumping_steps_left -= 1
        else:
            env.render()
            if random.random() < jump_prob:
                jumping_steps_left = jump_repeat - 1
                action[0] = True
        _, rew, done, _ = env.step(action)
        total_rew += rew
        steps_taken += 1
        if done:
            break
    return total_rew, done

def exploit(env, sequence):
    """
    Replay an action sequence; pad with NOPs if needed.

    Returns the final cumulative reward.
    """
    env.reset()
    done = False
    idx = 0
    while not done:
        if idx >= len(sequence):
            _, _, done, _ = env.step(np.zeros((12,), dtype='bool'))
        else:
            _, _, done, _ = env.step(sequence[idx])
        idx += 1
    return env.total_reward

class TrackedEnv(gym.Wrapper):
    """
    An environment that tracks the current trajectory and
    the total number of timesteps ever taken.
    """
    def __init__(self, env):
        super(TrackedEnv, self).__init__(env)
        self.action_history = []
        self.reward_history = []
        self.total_reward = 0
        self.total_steps_ever = 0

    def best_sequence(self):
        """
        Get the prefix of the trajectory with the best
        cumulative reward.
        """
        max_cumulative = max(self.reward_history)
        for i, rew in enumerate(self.reward_history):
            if rew == max_cumulative:
                return self.action_history[:i+1]
        raise RuntimeError('unreachable')

    # pylint: disable=E0202
    def reset(self, **kwargs):
        self.action_history = []
        self.reward_history = []
        self.total_reward = 0
        return self.env.reset(**kwargs)

    def step(self, action):
        self.total_steps_ever += 1
        self.action_history.append(action.copy())
        obs, rew, done, info = self.env.step(action)
        self.total_reward += rew
        self.reward_history.append(self.total_reward)
        return obs, rew, done, info

if __name__ == '__main__':
    try:
        main()
    except gre.GymRemoteError as exc:
        print('exception', exc)

I can call it in my bash running: python jerk_agent.py and can see the result of the code on the screen.

Scripting

What parameters of the code are most important? Where can we focus our attention to get better results?

In my opinion it’s hard to say. Every parameter has impact on the other. In understanding the jerk agent I started with the main() and move() function.

At the beginning I modified this part of the code:

rew, new_ep = move(env, 100)
if not new_ep and rew <= 0:
    print('backtracking due to negative reward: %f' % rew)
    _, new_ep = move(env, 70, left=True)
if new_ep:

The first parameter is responsible for running forward and the second for moving backward. So our Sonic runs 100 steps forward, but if he approaches an obstacle which he can’t pass he will go 70 steps backward. There are only two variables but they have a very large impact on the final score.

The other important variable is:

EXPLOIT_BIAS = 0.25

This variable has an impact on the probability of our agent to try exploiting successful run.

In our main function we get this statement:

if new_ep:
    if (solutions and
            random.random() < EXPLOIT_BIAS + env.total_steps_ever / TOTAL_TIMESTEPS):
        solutions = sorted(solutions, key=lambda x: np.mean(x[0]))
        best_pair = solutions[-1]
        new_rew = exploit(env, best_pair[1])
        best_pair[0].append(new_rew)
        print('replayed best with reward %f' % new_rew)
        continue
    else:
        env.reset()
        new_ep = False
rew, new_ep = move(env, 45)

Which calls this function:

def exploit(env, sequence):
    """
    Replay an action sequence; pad with NOPs if needed.

    Returns the final cumulative reward.
    """
    env.reset()
    done = False
    idx = 0
    while not done:
        if idx >= len(sequence):
            _, _, done, _ = env.step(np.zeros((12,), dtype='bool'))
        else:
            _, _, done, _ = env.step(sequence[idx])
        idx += 1
    return env.total_reward

This variable is in charge to decide if the last succesful solution should be used again, or should we maybe try a new approach.

In the move() function we get the jump probability. Based on the variable we can decide how often should sonic jump.

If we comment out our action[] array and change the jump_prob to 1.0 , our agent would stop jumping or moving.

Changing it to 2.0 will make sonic jump very often and to 80.0 will make the sonic jump sometimes.

The variable set to 80.0

The other variable jump_repeat is in charge to decide how often is the jump button pressed. Changing it to 10 will make sonic jump higher and changing it to 1 will make sonic jump lower.

What kind of buttons are pressed we can see it in this line of code:

action = np.zeros((12,), dtype=np.bool)
action[6] = left
action[7] = not left

action array is an array of 12 elements with false variable at the beginning. The elements are: “B, A, MODE, START, UP, DOWN, LEFT, RIGHT, C, Y, X, Z. ”

So if we trigger

action[0] = True

We make sonic jump, because we trigger the “B” button.

Looking at the controls: https://strategywiki.org/wiki/Sonic_the_Hedgehog/Controls

Pressing A, B or C makes no difference, because all those buttons trigger the same action.

Same with action[6] and action[7]. The first is responsible for running Left and the second for running right.

Rewards

To see how the reward variable changes I added the two lines of code to my script:

print("rew" , rew)
print("total_rew",  total_rew)

This way i know how much of reward my agent becomes and when is the reward variable reseted.

My opinion on JERK Approach.

After testing and editing the script for 2 weeks I think it is not a bad approach. But as i said before. Jerk is only as good as the level where sonic has to run. I am not 100% sure but I can guess that there are obstacles where JERK agent has a lot of problems and where at the same time Machine Learning algorithms have none.

After all it is a very good beginner algorithm. So if you are starting with AI I would give it a try. I think it is a very good baseline on which you can build your AI knowledge.

The JERK algorithm snippet is from the Gotta Learn Fast Paper:

https://arxiv.org/pdf/1804.03720.pdf

It’s a very nice paper and definitely “must read” before starting this contest. There are also all scores from the test levels for jerk, ppo2 and rainbow agents.

Other Blog Posts for further reading:

View story at Medium.com

View story at Medium.com

That’s all for this post. Thanks for reading.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.