“How to hit 6k points in the Leaderboard.”

~The Easy Way.

Well, not really.

To hit 6k milestone it took for me around 1,5 months. First week i was trying to configure the Environment and the second was learning the “basics” of reinforcement learning.

I made a list of milestones which i checked every time i hit the mark.

- 4k score – check
- 5k score – check
- 6k score – check
- 7k score – no check
- 9k score on one of the evaluation levels – check

#### So… what did i do to hit 4k Score?

I took one of the baselines algorithms and edited it.

https://github.com/openai/retro-baselines/tree/master/agents

At the beginning i took the JERK algorithm, because of the evaluation time. I takes only 2 hours to finish all the tasks so i could test many possible solutions in very short time.

Here is a pretty good write-up about how this algorithm actually works.

For testing and improving purposes the JERK algorithm i created notes. The changes of variables and their outcomes.

Standard JERK evaluation gives this on test server:

And the Video:

As you see on this one obstacle it gets stuck.

rew, new_ep = move(env, 100) if not new_ep and rew <= 0: print('backtracking due to negative reward: %f' % rew) _, new_ep = move(env, 70, left=True)

Changing this line of code to:

rew, new_ep = move(env, 50) if not new_ep and rew <= 0: print('backtracking due to negative reward: %f' % rew) _, new_ep = move(env, 70, left=True)

Gives score of 3915.15 which is better.

Improved score on #3 and #5

Same for local evaluations. We get a score of around 5k-5.3k on Green Hill Zone Act 1.

On Level Labyrinth we get better outcomes on the standard 100/70 approach, but our approach works good too.

50/70 gives overall score of 2738 and 2857 and the standard 100/70 approach gives score of around 2840 and 2976.

So we can call it a 1:1 draw.

After some testing on local and on online evaluation we get those outcomes:

50/70 = 3808.37

50/70 = 3802.75

40/70 = 3970.68

40/70 = 3708.13

45/55 = 3839.05

45/80 = 3864.03

So as we see… it doesn’t change much. We could guess that it changes the score and behavior but we need to include the probability of outcomes. Algorithms learn every time a little different after all.

On local evaluations:

Exploit Bias = 1 and 45/70 with

def move(env, num_steps, left=False, jump_prob=1.0 / 20.0, jump_repeat=4):

Is definitely bad approach.

Made for test purposes.

Same with:

Exploit Bias = 1 and 45/70 is bad too.

But we can already see what gives good and what gives bad outcomes.

#### And than…

Exploit Bias = 0.5 and 45/70 gives us our first milestone.

Nice score of 4042.52

So… Exploit Bias? Huh?

Here is the run:

As you see. Sonic is pretty shaky.

Improve Exploit Bias!

Exploit Bias = 0.6 and 45/70 gives us 3523.32 so… more doesn’t mean better.

What about the run left and right variables? Will they give us better outcome?

Well… not really.

Exploit Bias = 0.5 and 45/12 gives us 3729.96 Score.

#### Well… maybe we should lower the exploit bias than?

Exploit Bias = 0.45 and 45/55 gives us score of 4104.32 and a nice position #17 on the leaderboard.

Let’s run the evaluation again with a little higher exploit bias:

Exploit Bias = 0.5 and 45/55 gives us score of 4094.27 and position #19 in leaderboard.

A little worse.

Exploit Bias = 0.4 and 45/55 gives us score of 4039.90.

Exploit Bias = 0.45 and 45/70 gives us score of 3936.65

As we see this change doesn’t change the scoe much.

Exploit Bias = 0.45 and 45/45 gives us score of 4058.88

Exploit Bias = 0.45 and 45/55 and

def move(env, num_steps, left=False, jump_prob=1.0 / 8.0, jump_repeat=4):

Gives us score of 3833.29

exploit Bias = 0.45 and 45/55 and

def move(env, num_steps, left=False, jump_prob=1.0 / 12.0, jump_repeat=4):

Gives us score of 4110.23

New personal highscore! But the highscore on leaderboard is around 4.8k so it doesn’t help us further.

We are going to run some more tests and i guess… give up with this approach because it doesn’t work like the top 10 runs on leaderboard.

Exploit Bias = 0.45 and 45/55 and

def move(env, num_steps, left=False, jump_prob=1.0 / 14.0, jump_repeat=4):

Gives us score of 3963.03

Exploit Bias = 0.45 and 45/50 and

def move(env, num_steps, left=False, jump_prob=1.0 / 12.0, jump_repeat=4):

Gives us score of 3971.64

Exploit Bias = 0.45 and 45/55 and

def move(env, num_steps, left=False, jump_prob=1.0 / 12.0, jump_repeat=4):

and timesteps = 1e7 gives us score of 3433.87

#### So change of timesteps lowers our score.

Exploit Bias = 0.45 and 45/55 and

def move(env, num_steps, left=False, jump_prob=1.0 / 10.0, jump_repeat=7):

Gives us score of 4082.68

Exploit Bias = 0.45 and 45/55 and

def move(env, num_steps, left=False, jump_prob=1.0 / 10.0, jump_repeat=10):

Gives us score of 4061.07

Exploit Bias = 0.45 and 45/55 and

def move(env, num_steps, left=False, jump_prob=1.0 / 12.0, jump_repeat=7):

Gives us score of 3806.66

At this point we give up with the JERK approach.

It’s a pretty time efficient algorithm and on local evaluation it works pretty well, because we don’t need to wait until sonic learns to walk like on ppo2 or rainbow approach but… it won’t let us win.

Sonic get’s stuck on the same obstacle over and over.

So we start to make a little research and try to develop our own algorithm.

Trying OpenCV and YOLO

We find out that this approach won’t work and well… we lack math skills. After some more research we decide, that we can’t outsmart AI researchers.

We try with ppo2 algorithm because it is shorter in evaluation time than rainbow.

ppo2.learn(policy=policies.CnnPolicy, env=DummyVecEnv([make_env]), nsteps=4096, nminibatches=8, lam=0.95, gamma=0.99, noptepochs=3, log_interval=1, ent_coef=0.01, lr=lambda _: 2e-4, cliprange=lambda _: 0.1, total_timesteps=int(1e7))

These are our standard hyperparameters.

https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)

And they give us score of 3265.79

The run is far from great… but we can change the parameters, right?

Changing these to:

ppo2.learn(policy=policies.CnnPolicy, env=DummyVecEnv([make_env]), nsteps=1024, nminibatches=8, lam=0.95, gamma=0.99, noptepochs=3, log_interval=1, ent_coef=0.01, lr=lambda _: 2e-4, cliprange=lambda _: 0.2, total_timesteps=int(1e7))

Doesn’t help us much. We hit: 2269.29

#### But there is a bright side.

We got throught the obstacle!

So looking on other runs on the leaderboard it looks like other players could be using ppo2.

ppo2.learn(policy=policies.CnnPolicy, env=DummyVecEnv([make_env]), nsteps=128, nminibatches=4, lam=0.95, gamma=0.99, noptepochs=4, log_interval=1, ent_coef=0.01, lr=lambda _: 2e-4, cliprange=lambda _: 0.1, total_timesteps=int(1e7))

Gives score of 1768.87 and changing nsteps to 8192 gives score of 2678.39.

ppo2.learn(policy=policies.CnnPolicy, env=DummyVecEnv([make_env]), nsteps=8192, nminibatches=32, lam=0.95, gamma=0.99, noptepochs=10, log_interval=1, ent_coef=0.01, lr=lambda _: 3e-4, cliprange=lambda _: 0.2, total_timesteps=int(1e7))

Gives score of 1773.42 and with lr=lambda _: 2e-4 we get score of 1805.50.

Going the other way around and using standard baseline with change of reward variable in sonic_util.py script from 0.001 to 0.05 gives score of 2808.74 and reward at 0.007 gives score of 3199.44

I tried some other approaches which failed totally.

The best approach was:

ppo2.learn(policy=policies.CnnPolicy, env=DummyVecEnv([make_env]), nsteps=4096, nminibatches=8, lam=0.97, gamma=0.99, noptepochs=3, log_interval=1, ent_coef=0.01, lr=lambda _: 3e-4, cliprange=lambda _: 0.2, total_timesteps=int(1e7))

Which gave us score of 3465.19

Improving lambda cliprange and lam value gave us slightly better score, but still worse than JERK.

But we still get stuck. A little further than with JERK approach but still…

Rainbow was our last hope. Looking at the leaderboard those algorithms are going smooth throught the “elevators”. Rainbow DQN scores overall a lot better than ppo2 based on the Gotta Learn Fast Paper:

https://arxiv.org/pdf/1804.03720.pdf

#### The price we pay on the other hand is the evaluation time.

Reading throught the internet on DQN algorithms we find out that Rainbow could be the one algorithm which could help us get higher scores than JERK.

Rainbow Baseline:

with tf.Session(config=config) as sess: dqn = DQN(*rainbow_models(sess, env.action_space.n, gym_space_vectorizer(env.observation_space), min_val=-200, max_val=200)) player = NStepPlayer(BatchedPlayer(env, dqn.online_net), 3) optimize = dqn.optimize(learning_rate=1e-4) sess.run(tf.global_variables_initializer()) dqn.train(num_steps=2000000, # Make sure an exception arrives before we stop. player=player, replay_buffer=PrioritizedReplayBuffer(500000, 0.5, 0.4, epsilon=0.1), optimize_op=optimize, train_interval=1, target_interval=8192, batch_size=32, min_buffer_size=20000)

On the first run with baseline algorithm we score 3970.24

And on the second run 3714.10

So it looks like we are scoring lower in points, but we get further in level.

And it looks like other leaderbord high score players are using rainbow DQN too.

But wait a second…

We find this blog post.

#### Which says we can hit 4.8k+ score with Rainbow Baseline.

Is it true? If yes… than we need to find out.

It is 15 May at this moment. Our highest score is 4.1k.

And at this moment we would be happy to hit 4.5k.

After some testing on local evaluation we decide to upload the best approaches:

with tf.Session(config=config) as sess: dqn = DQN(*rainbow_models(sess, env.action_space.n, gym_space_vectorizer(env.observation_space), min_val=-250, max_val=250)) player = NStepPlayer(BatchedPlayer(env, dqn.online_net), 3) optimize = dqn.optimize(learning_rate=1e-4) sess.run(tf.global_variables_initializer()) dqn.train(num_steps=2000000, # Make sure an exception arrives before we stop. player=player, replay_buffer=PrioritizedReplayBuffer(500000, 0.5, 0.4, epsilon=0.1), optimize_op=optimize, train_interval=1, target_interval=8192, batch_size=32, min_buffer_size=20000)

Gives score of 3961.18

Changing the min_val to -150 and max_val to 150 gives us 3610.73 score.

So min/max value indeed makes a difference.

-320/320 gives us score of 3843.79.

And min/max value of -420/420 gives us 3752.08 and 4053.46 score.

With only one change of variable we did hit 4k. Not bad. But can we score higher?

#### And here it comes.

Changing the value of target_interval to 4096 gave us score of 4335.63 and it makes the best personal score. And we did actually hit place #10 in leaderboard. That’s something.

The video doesn’t look better. But that’s a start.

Can we score higher?

Changing min/max value to = -500/500 with target_interval = 4096 gave us score of 4139.82

Again min/max value of -200/200 and with target_interval = 4096 gave us score of 4051.23.

With min/max value of -420/420, target_interval = 4096 and learning_rate 1e-5 we get error on the evaluation server. It works locally so we try again.

We hit score of 3885.06

Changing the min/max value to -420/420 and back to our learning rate 1e-4 with target_interval 4096 and we hit nice score of 4385.69.

On the video it doesn’t look better. Local evaluations gave us good runs… so we try again…

At this point we know that local evaluations and on contest server evaluation give different outcomes.

The good thing is the score is stable.

Using the same parameters and changing min_buffer_size to 15000 gives score of 4194.96 and changing the buffer size to 25000 gives score of 4334.76

Going further on next evaluation we change the min/max value to 400. We get score 3947.71. While changing batch size to 64 we get score of **4401.63.**

#### New personal high score!

Changing the target_interval to 1024 but with batch size of 32 gives us again new high score! **4440.39.
**

That’s our highest score for the day 25 May 2018.

And the run looks like this:

It’s not bad. What can we learn from this? Well… that Rainbow DQN is actually pretty good approach. It takes time but changing only some of those values gives massive effect. Because the change of the target_interval gave us more points we are going change it further.

target_interval of 512 gives us score of **4914.14**

but changing it to 256 gave us score of 4656.51

So… I thought this is the limit. But changing it to 128 gave us again new high score of **4916.98.**

And with target_interval of 64 we made **5028.05** score!

#### We did it!

We hit another milestone!

As you see on the video. The algorithm has a lot easier way with the obstacles than before.

With target_interval of 32 we score 4721.00

Running it again and with target_interval of 64 we hit another high score of **5526.41**.

At this moment things start to be a little shaky. Our algorithm got 500 +/- on evaluation server.

But the good thing is we hit our 9k mark. Another achievement unlocked!

At this moment i wanted to test other things so i changed the target_interval to 16 and min/max value to 400, while changing also the reward value in sonic_util.py file to 0.015. We hit 4306.06 score.

target_interval of 16 and reward of 0.009 gives us 4867.84 score.

target_interval of 16, and min/max value of 420 gives us score of 4311.39.

Going back to the best approach, also know as target_interval of 64 and min/max value 420 gives us scores of **6000.52** and 5833.22.

#### That’s not bad!

We hit our 6k mark and we hit the 1 place in the leaderboard.

So going back to the medium post which predicted that you are able to hit over 4.8k+ score with Rainbow DQN algorithm… he was right!

Myth Approved!

And here is the source code for the full Rainbow DQN algorithm with hyperparameters:

#!/usr/bin/env python """ Train an agent on Sonic using an open source Rainbow DQN implementation. """ import tensorflow as tf from anyrl.algos import DQN from anyrl.envs import BatchedGymEnv from anyrl.envs.wrappers import BatchedFrameStack from anyrl.models import rainbow_models from anyrl.rollouts import BatchedPlayer, PrioritizedReplayBuffer, NStepPlayer from anyrl.spaces import gym_space_vectorizer import gym_remote.exceptions as gre from sonic_util import AllowBacktracking, make_env def main(): """Run DQN until the environment throws an exception.""" env = AllowBacktracking(make_env(stack=False, scale_rew=False)) env = BatchedFrameStack(BatchedGymEnv([[env]]), num_images=4, concat=False) config = tf.ConfigProto() config.gpu_options.allow_growth = True # pylint: disable=E1101 with tf.Session(config=config) as sess: dqn = DQN(*rainbow_models(sess, env.action_space.n, gym_space_vectorizer(env.observation_space), min_val=-421, max_val=421)) player = NStepPlayer(BatchedPlayer(env, dqn.online_net), 3) optimize = dqn.optimize(learning_rate=1e-4) sess.run(tf.global_variables_initializer()) dqn.train(num_steps=2000000, # Make sure an exception arrives before we stop. player=player, replay_buffer=PrioritizedReplayBuffer(500000, 0.5, 0.4, epsilon=0.1), optimize_op=optimize, train_interval=1, target_interval=64, batch_size=32, min_buffer_size=25000) if __name__ == '__main__': try: main() except gre.GymRemoteError as exc: print('exception', exc)

It is 30 May 2018. We got 6 more days for online evaluations.

#### We want after all hit our 7k mark.

Changing the reward function to 0.009 crashes the evaluation at 5.3-5.4k

We try with min/max value of 440 and we get score of 5159.65.

Min/max value of 400 gives us score of 5330.80

And again going back to 420 gives us score of 5838.92

So it looks like min/max value of 420 is the best one.

Changing the buffer_size to 28000 gives us score of 4426.11.

We go back to buffer_size of 25000 and try min/max value of 415. We get score 5564.51

Trying again with min/max value of 425 and we hit 4953.63.

Setting the variable to min/max value of 420 and we hit 5535.94 and 5395.93.

We try for the last time changing the min/max value to 421.

The score after last change is **5815.51**.

There are 2 more days of evaluation. But we decide that the second place in the leaderboard is a good place after all!

Here is the final video: https://contest.openai.com/videos/499.mp4

So… are you able to hit 7k mark? I guess that this is possible. With a little more tunning and testing more hyperparameters i can guess you can hit 7k.

What can we learn from this?

**That hyperparameters matter and that we can double our perfomance just by tunning them.**

The contest has ended. Well, at least the algorithm phase.

Now we need to wait 2 weeks for the final evaluations to find out who’s the winner.

After all… i guess everyone is somehow a winner. If you were able to run sonic locally or on evaluation server you have learned how to set up the Reinforcement Learning Environment. If you sat down and started to learn Machine Learning or Reinforcement learning on your own, or even started to learn Calculus, Probability Theory, Linear Algebra or Statistics on your own… you are already a winner.

#### Because even if I win or lose. It doesn’t really matter.

This contest gave me finally a little push to start learning Machine andReinforcement Learning.

I wanted to learn this stuff two years ago but never had the motivation to do it.

And this contest was a great opportunity to do it.

So after all, if other participants feel the same way I do, I can only guess that OpenAI did a great job to motivate young programmers/students to learn and research this kind of topic.

Other than that, this contest was a great place to meet new people. People interested in the same topic as you are.

Ideas were exchanged, friendships were made. Bugs were fixed and errors were troubleshoot. New algorithms were developed.

Maybe they didn’t work the way you wanted them to work and weren’t better than the baseline one but they worked!

And after long journey if you learned something from it. You are already a winner.

Thanks for reading!