Here we look at the custom environment’s reset function, where it sets the new target location and start position.
In earlier articles in this series we looked at the Humanoid Bullet environment, where the objective is to teach a model of a humanoid to walk forwards without falling over.
This time we will look at how we can tweak the environment to make the agent learn a slightly different challenge: to walk backwards instead of forwards.
You won’t find much, if anything, about this in the scientific literature for machine learning. Indeed, getting it to work at all was a bit of a struggle! Naively I thought I would just be able to set a different target x-coordinate on the environment, and that would be it. However, when I tried that I ended up with agents that simply learnt to throw themselves backwards from the starting line and then the episode would finish. It’s fun to watch, but it’s not what I was trying to achieve.
Instead, it turned out that I needed to override the target x-coordinate in both the environment and the robot, as well as move the starting point away from the origin location (0, 0, 0). It took a lot of trial and error to get this to work! I never did manage to track down where in the code the episode was being forcibly ended when the x-coordinate went negative – If you work this out, please let me know. The PyBullet code wasn’t designed with this sort of extensibility in mind.
Here is the code I used. The interesting part is in the custom environment’s reset function, where it sets the new target location and start position:
_display = pyvirtualdisplay.Display(visible=False, size=(1400, 900))
_ = _display.start()
from ray import tune
from ray.rllib.agents.sac import SACTrainer
from pybullet_envs.gym_locomotion_envs import HumanoidBulletEnv
from ray.tune.registry import register_env
state = super().reset()
self.walk_target_x = -1e3
self.robot.walk_target_x = -1e3
self.robot.start_pos_x = 500
self.robot.body_xyz = [500, 0, 0]
env = RunBackwardsEnv()
ENV = 'HumanoidBulletEnvReverseReward-v0'
TARGET_REWARD = 6000
TRAINER = SACTrainer
By default, the environment’s target x location is 1000. I set it to -1000, but I’m not sure if it ever makes it that far. I suspect the episode would be forcibly terminated when it passes zero.
Here is a graph of the average reward over time from the training, over the course of 46.8 hours.
As you can see, the learning process was not particularly smooth, and it looks like the agent might have continued to improve if I had left it for longer.
Here is what the trained agent looked like. Perhaps it’s not an elegant gait, but given how many failed experiments I had done that were thwarted by not being able to go backwards past the origin in the environment, I was thrilled to finally see this working.
In the next and final article in this series we will look at even deeper customisation: editing the XML-based model of the figure and then training the result.