Learning Breakout More Quickly

philoxenic

5.00/5 (3 votes)

Jun 30, 2020

CPOL

3 min read

9091

In this article, we will see how you can use a different learning algorithm (plus more cores and a GPU) to train much faster on the mountain car environment.

The previous article introduced the OpenAI Gym environment for Atari Breakout, together with some code for training an agent to solve it using reinforcement learning.

Now we are going to take a closer look at the details of this environment, and use a more sophisticated algorithm to train an agent on it much quicker.

Inspecting the Environment

You can use the following simple Python code to play the game interactively (and, it has to be said, more slowly than usual). The keys you need are A for left, D for right, and Space to launch the ball. This will only work if you’re in an environment with a real graphical display; otherwise, you can just read this bit.

import gym
from gym.utils.play import play, PlayPlot

def callback(obs_t, obs_tp1, action, rew, done, info):
    return [rew]

plotter = PlayPlot(callback, 30 * 5, ["reward"])
env = gym.make("Breakout-ramNoFrameskip-v4")
play(env, callback=plotter.callback, zoom=4)

We use a callback function to show the reward received over time. As you can see, we get no reward except when the ball hits and removes a brick. There is no negative reward for losing a life.

Several things aren’t apparent from this screenshot:

You get a higher reward for knocking out bricks in the higher levels
After a while, the bat gets smaller
The velocity of the ball varies a great deal

So, a few challenges for an agent to overcome!

Overview of Ray’s Architecture

Since we are inspecting things, this is a good opportunity to have a brief overview of Ray’s architecture and, in particular, the things we might like to tweak to change its performance. In the previous article, we ran on a single CPU; this time we are going to make use of more cores and a GPU.

The architecture of Ray consists of one trainer and zero or more external worker processes, which feed back batches of observations. Each worker can run one or more environments, based on what you have configured.

Here are some of the common parameters you can change to affect performance and scaling:

num_cpus_per_worker: the number of CPUs each worker is allowed to use; there’s no benefit to this being more than one for any of the standard environments, but it might be useful if you have an expensive custom environment
num_envs_per_worker: the number of environments spun up by each worker process
num_gpus: the total number of GPUs to use for training
num_gpus_per_worker: the number of GPUs to use per worker, typically zero
num_workers: the number of worker processes
rollout_fragment_length: the number of observations to take from each environment before a worker sends it back to the trainer
train_batch_size: the number of observations in each batch when training the policy

Using Impala for Learning Breakout More Quickly

The following code sets up seven Ray workers, each running five Breakout environments. We are also switching to use the IMPALA algorithm instead of DQN.

import ray
from ray import tune
from ray.rllib.agents.impala import ImpalaTrainer

ray.shutdown()
ray.init(include_webui=False, ignore_reinit_error=True)

ENV = "BreakoutNoFrameskip-v4"
TARGET_REWARD = 200
TRAINER = ImpalaTrainer

tune.run(
    TRAINER,
    stop={"episode_reward_mean": TARGET_REWARD},
    config={
      "env": ENV,
      "monitor": True,
      "evaluation_num_episodes": 25,

      # from https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/impala/atari-impala.yaml
      "rollout_fragment_length": 50,
      "train_batch_size": 500,
      "num_workers": 7,
      "num_envs_per_worker": 5,
      "clip_rewards": True,
      "lr_schedule": [
          [0, 0.0005],
          [20_000_000, 0.000000000001],
      ]
    }
)

Using eight CPU cores and a GPU, this took about 0.6 hours to train to the score of 200. Much quicker than the DQN model we used in the previous article.

Progress wasn’t exactly linear. In particular, it had a very wobbly moment towards the end, where the mean score dropped right back.

Having learned to solve the Breakout environment in half the time, you might think we are done with it. But no, this is only half the battle. Learning Breakout from RAM, instead of from pixels, throws up some interesting challenges, as we will discover in the next article.