World Cup Reinforcement Learning

In honor of the ongoing World Cup, I worked this week on training agents to play soccer as a team. I used the Unity Game Engine and based the scene off of a demo scene in their ML Agent library.

For fun, I created a demo version where you can play as striker for the red team (yours is the player with the white top). Your teammates will be neural nets, as will the opposing team. Use the arrow keys to run. I’m exploring better ways to publish it, but on a Windows machine you can download and try it out here.

Unity Setup

Unity launched ML Agent last September, and the documentation is now quite good. Visit this guide for information on setting everything up.  If you already have Tensorflow and Unity set up, it’s simply a matter of downloading ML Agents from github and opening unity-environment as a new project.

The RL Training framework in Unity. In the soccer game, I have one academy, three brains, and eight agents.

To train an ML agent in a scene, you need to define three things:

  • Agent – Each agent collects observations to pass to the brain, and acts on the actions output by the brain. In the soccer world, each player is an agent. (2 have a goalie brain, 2 have a defender brain, and 4 have a striker brain).
  • Brain – Each brain is a neural net, taking in observations and outputting actions. Set the brain to Internal for inference, and External for training.
  • Academy – The overall training regime (and connection to Python/Tensorflow)

Separately, within the python folder, you can define a specific training regime for the scene. Use trainer_config.yaml to define hyperparameters for each brain (the PPO algorithm is selected by default). The PPO implementation is at python/unitytrainers/ppo/, and the trained models are output to the python/models folder (this will contain the bytes file you’ll need to copy back to the Unity scene).


Unity provides a demo soccer scene, with one goalie brain and one striker brain. The striker gets a reward when it scores a goal, and the goalie gets a penalty when the opposing team scores. Additionally, an existential penalty is put on the striker (the striker should want to score and thereby end an episode as quickly as possible). The goalie instead is rewarded for long episodes.

I changed up the demo scene a fair amount. First, I made the game 4v4 instead of 2v2, and I made the agent observations more elaborate. I also added a new brain type – Defender. The defender is equally motivated to score and to avoid being scored on.

4v4, with purely random policies. At this point, the players are running around randomly, and the goalies (dark purple for the blue team, yellow for the red team) have not yet learned they should stay near their own goals.

I removed the demo scene’s artificial restrictions that goalies and strikers could only roam in certain parts of the field. I compensated for this by giving the goalie a slight reward for staying in its own goalie box (presumably comparable to the human reward of being able to use hands there).

Looking distinctly more soccer-like. Goalies have learned to stay in their own box, unless there’s a good reason to leave. Often, the defender stays closer to its own goal, since it is penalized highly when the other team scores.

I also added a new reward type – possession. Here, everyone on a team is rewarded very slightly whenever their team touches the ball, and penalized when the other team touches it.

After a few hours of training, the game was definitely seeming more soccer-like, though the players learned to purposely bounce the ball of the walls. I’m now training with the wall taken out, the field a bit bigger, and a penalty to whichever team kicks the ball out. Initially I made the out-of-bounds penalty too big, and the agents became afraid to touch the ball at all. Now with a more balanced penalty, the training looks more promising. I’m curious to see what team behaviors might emerge after more time.

Humanoid Walking

I also focused this week on the walking humanoid (for simplicity sake, I’m calling him Ralph for the rest of this blog – I realize I’ve gotten into the habit of anthropomorphizing my models these days — I still love how amazing it is that you can wiggle the numbers in some randomized matrices and develop models that walk and play soccer).

I first modified Ralph so that his arms default to resting near his sides (the Unity scene has his arms out in a T-shape – probably because it makes balance a lot easier). I trained a policy from scratch, until he was able to successfully take a few steps. I then experimented with slightly changing the shape of Ralph’s feet. Perhaps unsurprisingly, even a small shift broke the trained policy, and he fell over at every try.

Ralph wasn’t too happy when I added a toe joint.

I decided to make a few changes at once and then retrain from scratch using PPO. To more closely mimic human walking, I added a toe joint to the model. I also changed the observation state to include the last 3 frames (joint angles and speeds at the last 3 time steps), rather than just the current frame. I speculated that having information about the recent past would help the model learn about the recent trajectories, and would be more useful than just a single snapshot. I still feed all observations directly to a densely connected neural net. In the near future, I’ll also try switching this to be an RNN.

After several hours on my laptop, Ralph did figure out a way to move to the target.

He didn’t learn to walk (on this training run), but at least it gets the job done.

I’m now completely fascinated by how easy it is to break these models. I have been thinking a lot about how humans walk, and how effortlessly we pick up a backpack or wear high heels, and we don’t have to relearn how to walk. (Well, maybe high heels do take some practice, but it’s certainly not as disastrous as changing Ralph’s shoes!)

I’m currently experimenting with adding some noise to the training process to see if I can build in more robustness. At each episode, I start the model with a very slightly different foot shape, weight distribution, and floor friction.

I’m also curious if there’s a better way to damp down all the extraneous arm movements. I’ve been penalizing the model for high hand velocities, but I’m looking to find a more systematic way to reduce energetic motions. I noticed this particularly when I tried putting Ralph in a pool of water to see if he could learn any sort of swimming motion. He soon developed very fast motions that flung his body out of the water (some sort of butterfly stroke on steroids).

In general, I’m interested in some of these things that humans do well, but which Ralph can’t do at all:

  • Ability to make small modifications without needing to retrain
  • Ability to walk on new and uneven surfaces
  • Ability to change to running gait at a certain speed
  • Instinct to expend as little energy as possible

I’m also wondering if it could be effective to split the walking model into two separate neural nets – one only for the lower body, perhaps with only limited observations about upper body velocity and center of mass. This is partly inspired by biology, since the human body creates a walking rhythm partly from a circuit that loops only to the spinal cord (and not all the way up to the brain). Read more about spinal cord driven cyclical walking rates here. Most dramatically, meet Mike, the chicken who ran around for 18 months with his head cut off

Studying other RL algorithms

Throughout the week I often found myself with long stretches while models were training.

Rather than trying my hand at swordfighting, I watched several more of the Berkeley CS 294 lectures. This is really an excellent course, and Sergey Levine presents the material in such a clear and logically ordered way.

Last week I promised a summary of specific RL algorithms, but then I found this awesome writeup. For the deepest understanding though, I’d highly recommend going straight to the Berkeley lectures. They do take some time (and mathematical fortitude), but are well worth the effort.


Next week I’m going to continue exploring how to make policies more robust. I’m also planning to switch back over to OpenAI’s Gym.

Reinforcement Learning – The Big Picture

So, I survived my first week of studying Reinforcement Learning. I know a good amount of deep learning and I love math, so I assumed RL would be an interesting but fairly comfortable excursion. Turns out, it’s much more of a wild frontier – tons of different algorithms, varying and often contradictory advice of what’s best when, and a lot of math.

‘Everything is still exciting — literally nothing is solved yet!’
-Vlad Mnih, Google DeepMind

I dove in first by watching the lecture videos from last year’s Deep RL Bootcamp and working through the accompanying labs. I also read the first three chapters of Sutton and Barto’s Reinforcement Learning textbook. Lastly, I started watching some of the lecture videos from Berkeley’s CS 294 (Fall 2017).

In retrospect, I’d recommend the opposite order – begin with CS 294, then work through the Deep RL Bootcamp labs, and use Sutton & Barto as a reference. I really enjoyed the bootcamp labs – they provide a great framework, but they also don’t feed you every answer, and you do have to understand the various RL algorithms in order to solve them. Lab #2 is specifically about Chainer – if you already know PyTorch, you can probably safely skip this, it’s just a slight change of notation.

The CS 294 lecture videos go much more deeply into the math, but since it’s presented in such a careful and step-wise manner, I found it much easier to follow. Of the bootcamp lectures, definitely check out Nuts and Bolts of Deep RL Research which is full of great practical tips.

For me, most difficult was getting a good overall view of RL. There are so many different algorithms, and often it’s easy to get deep into the math and realize you have no idea why you went there in the first place, and where you are in the bigger picture. I decided to use this blog post to write up a guide to the buzzwords and overall ideas I wish I’d known a week ago.

How is Reinforcement Learning different from Deep Learning?

In deep learning, we train a network by giving it input and then telling it what its output should be. We assign a cost based on how far the network’s answer is from the correct answer, and then train the network to minimize that cost.

In reinforcement learning, we no longer immediately tell our system what the correct answer should be. If our program is playing tic-tac-toe, it has to make several different decisions, and then only at the end it finds out if it won, lost, or tied. The difficulty in RL is how to take that final reward and use it to improve the game strategy. If the program lost, what does it do with this information? Does it blame all its moves, or the most recent moves, or the moves that were different than the game it won? In order to deal with this uncertainty, we need to play many, many games.

S, A, R, and sometimes O

RL deals in States (S), Actions (A), and Rewards (R). State S is all the information about the world now. For a video game, this might be all the screen pixels. For a robot, it might be the camera pixels and all of its own joint angles. Given S, our agent has to choose action A.  We tell the environment our choice, and it tells us what our next state is.

Our agent’s goal in life is to maximize expected reward R.  It does this by learning to choose “good” actions.

Given this Pacman state S, we’d choose A=”Move Right”, expecting soon to get cherry points R

We might live in a deterministic world (like Pacman) where taking action A on state S automatically puts us in a new state S’, but often we’re in a stochastic one, where the outcome is probabilistic.

Observations O come into play if we as the agent don’t know everything about the world state S. We usually accept that O is a good approximation of S, but it’s good to remember these are separate.

Expected reward deals with uncertainty in a stochastic world. If my odds of winning the $10M lottery are 1 in 5 I’ll buy a ticket (a $2M expected reward for my $1 ticket sounds great!). If the odds of winning are 1 in a billion, I’ll probably pass.

Markov Decision Process

This is a fancy name for a simple idea – the way the environment transitions from S(t) to S(t+1) depends only on S(t) and A(t). There’s no looking back in time, and the environment doesn’t care about any of the states and actions in the past.

At first glance, it’s not obvious how this relates to real life at all. If you’re looking at a ball flying through the air, the current position is clearly not enough to tell you what the ball’s next position will be. To make this work, you have to be careful about what information S includes. In the case of our ball, S also needs to include the velocity and even things like the wind speed.

Although a simple idea, MDP is quite powerful. It means we can choose our action only based on our current state. We don’t need to consider all past states and actions. Also, the transition operator – the way the world moves from the current state-action pair [S(t), A(t)] to the next state S(t+1) – is linear.

Policy Gradients

Policy Gradient is a very popular category of RL methods. Here we look to learn a policy, which prescribes what action to take, given the state of the world.  We usually want policy π to be continuous: π(A|S) tells us the probability of choosing action A, given state S.

In a few moments, a good policy would tell me to turn my car about 90° to the right. I’m ok with a chance I’ll be anywhere in the range 85-95°. Hopefully there’s a vanishingly small chance I’ll accidentally turn 90° left.

In addition to mirroring real life, a continuous policy means we have a continuous expected reward function. This is hugely important, since we can then take a gradient and use this to improve our policy.

Often we implement the policy as a neural net, with weights θ. In traditional deep learning, we would train this neural net by defining a loss function, and then optimizing the model to minimize this loss. At each step we take the gradient of the loss function with respect to each of our weights (this is backprop), and then we take a step in the direction that minimizes loss.

We’d like to do the same thing here except we no longer have a nice expression to differentiate. We’re now looking to maximize expected reward. Given that we’ll get stochastic rewards over time based on our policy, this is quickly a mess. Rather than try to use backprop to figure this out analytically, we’ll use…

Monte Carlo Sampling

In RL, we’re often faced with integrals that we can’t solve analytically. We can express our expected reward as an integral over all possible trajectories, but we don’t have a good way to express any of this as a specific function we could integrate. Instead, we can sample a lot of trial runs, and use that to figure out our expected reward.


Monte Carlo sampling is used to calculate difficult integrals. In RL, we use it to calculate the expected reward (using the rewards we get from all our different simulation runs).

One trouble — what if our dart-thrower is biased, and more often hits the left-hand side of the board? This is the problem we run into when sampling the Policy Gradient (we’re sampling trajectories according to our policy, not according to all the equally possible random trajectories). We use a very neat trick to rescue this situation, the Log Likelihood trick.

Another big issue – if we change our policy slightly, we then have to redo all our samples. There are a few common buzzwords to describe this – policy gradient is an “on-policy” learning technique, and it is not very “sample efficient”.   Maybe we have a fast simulator and don’t care about this time, but maybe we have an expensive and slow robot and would be better off choosing a different technique such as…

The Q-Function

The other main approach to finding a good policy is to first represent the value of each state. We set Value V(S) to be the expected reward we’d get if we start at S and follow our policy until the end of our trial run. Adding a very slight twist, we could instead track a Q-function. This is nearly the same thing, except here we express the expected reward from state S if we now take action A.

For both V and Q, we can look at the values for a particular policy, or we could consider V* and Q* for an ideal policy – the best expected reward we could get from each state. If we knew that ideal Q*, then at each state S we should just choose the action that maximizes it.

The trouble of course is that we don’t know V* or Q* a priori. We generally start with a randomized, nonsense Q, and then look for ways to improve it.

Tabular Q-Learning vs. Deep Q-Networks (DQN)

This depends on the size of our world — can we fit all our states S and actions A in a nice lookup table? If so, we’re in the world of Tabular Q-Learning. If not, we need a formula, or some other way to generate our Q for each S and A. We could look to fit some linear function, or more often we throw a neural net at the problem.


There is a constant tension between Exploration (trying out new paths) and Exploitation (sticking to the actions you know will yield good rewards). In Q-Learning, our “best” action is the one that maximizes Q. However, if we only ever take this action, we won’t explore other options, and we might totally miss a different route that is much more rewarding.

We use ε-Greedy to address this. Some small percent of the time (say ε=.1), we take any random action other than our best action. The rest of the time, we choose the greedy option (the one with the highest reward). We could start out with a high ε, and then lower it as our Q-function gets more accurate and we’re more sure that we know which states are good.

Fake it ’til you make it

Turns out, this is not just my own favorite motto, but also the way most of reinforcement learning works.

The goal in RL is to find a way to maximize expected reward. We do this by learning a good policy π (in Policy Gradient methods) or a good Q-Function (in Q learning).  Or we could learn some combination of the two (hint: Actor-Critic methods), but I’ll get to that next week.

However, in any of these cases, we usually start things out by initializing all our parameters randomly.

TFW someone just reinitialized all your parameters.

The magic comes in when we evaluate how our policy performed, and use this to figure out how to improve π or Q. As we iterate over and over, we hope that little by little we’re getting better, and eventually we’ll reach our goal.

This raises several interesting questions — how do we know we’re going to get better?  Are we guaranteed to converge to some final policy?  Will it be a good final policy?   How long will it take to converge?  Could starting out at a different random initialization yield a wildly different outcome? Are there tricks we could use to reduce variance and make our algorithm more likely to succeed? How do we pick which algorithm to use?

A lot is already known about these topics, but in many cases, these are still open questions.

Google DeepMind’s humanoid run. Converging on a policy doesn’t automatically guarantee it’s the best policy…



Next week I’m planning to continue my exploration of RL. In my blog post I’ll do a quick tour through various RL algorithms and I’ll write a bit about the math behind this all. I’m also planning to watch more Berkeley CS 294 lectures and work through those labs, so I’ll report back on what I learn!

Training Neural Nets in the Browser

Two months ago, Google announced its new TensorflowJS, for training and inference in the browser. This opens the door to a lot of amazing web applications.

Try out Webcam Pacman and Mobile Phone Emoji Hunt for a taste of what’s possible.

With TensorflowJS, we can take a pretrained model and personalize it with the user’s own data (keeping that data local and private!). We can also make fast predictions, without needing to wait to send data to a model up in the cloud.

I’m interested in how to teach deep learning well, and so I decided to take a week and explore TensorflowJS thoroughly. As Tensorflow Playground demonstrated, it’s so helpful for a deep learning newcomer to have the chance to tinker easily.

Continuing the idea of experimenting with hyperparameter and architectural choices, I created a demo page where you can try out training different kinds of models on the MNIST set and see in real time how your parameter choices affect the final accuracy. This is roughly based off of one of the TFJS tutorials, but I added in several different model and parameter options, and I fixed the bias in their train/test split. Hop over to my demo page to try it out.

Like everything in deep learning, TFJS is moving fast. Keep an eye on ml5.js. They’re building a wrapper around TensorflowJS that aims to be a “friendly machine learning library” for artists, hobbyists, and students. They have some beautiful looking demos. Personally I wouldn’t call it “friendly” just yet — they’re still missing a lot of documentation, and at this point nothing I tried out worked easily. I suspect in another month or two it’ll be in great shape.

Keep an eye on ml5js. It’s still developing (this is from the demo on their homepage and I’m 99.07% sure that’s *not* a robin!), but looks like it’ll be very exciting soon, especially for artists!

Getting to know TensorflowJS

I came to TensorflowJS with a fair amount of Tensorflow experience, but only a little web design experience. I suspect the path would actually be easier coming from the opposite direction. TensorflowJS itself is quite easy and a very natural extension of regular Tensorflow. I had no problems creating fun local extensions of the tutorials, though I then had a trickier time sorting out deploying to production (long story short, tfjs-angular-firebase seems a good way to go).

I began the week by working through the TensorflowJS Tutorials. Much like Tensorflow itself, you can work with TFJS at different levels of abstraction. Each tutorial focuses on one of these levels:

    1. Math: Polynomial Regressions introduces the lowest level. Here you can do math operations directly (adding, matrix multiplications, etc.). It’s not the level where you’d normally implement a model, but the neat thing is here we don’t need to be doing neural nets at all. In fact, check out the Realtime tSNE Visualization. It uses TFJS to create real-time interactive visualizations of high-dimensional data.


    3. Layers: MNIST Training moves one level up. This will feel very familiar to anyone who has used Keras. We start with model=tf.sequential() and then add layers (convolutional, fully connected, pooling, relu, etc.) to this model. Then we can compile the model and train it using
    4. Pretrained Model: The Pacman Tutorial is by far the most fun. This introduces importing a pretrained model (here we use mobilenet) and then we finetune that on a set of webcam images. We start with four categories (up/down/left/right), although it’s trivial to change this in the code. After fine tuning, we switch to prediction mode and feed these streaming results directly to the Pacman game, so we can now control the game with our webcam.


    Personally, I like to work through tutorials by looking over the code, then switching to a blank page and trying to recreate it myself. It’s more painful than just reading through the code samples, but I highly recommend this for code that you want to understand well — it’s amazing how many little details you notice by taking this extra step. In my case, the pacman tutorial took me a morning to recreate from scratch (initially it involved a fair amount of glancing back to look at the original code, but I soon felt increasingly independent).

    TensorflowJS Summary

    Based on my few days of playing with TensorflowJS – it works best when you have a straightforward pretrained model that needs fine tuning. TFJS isn’t geared for customized loss functions, lambda layers, or other personalizations. Things are changing fast though, and this may soon be easier to do. My main annoyance with TFJS is the lack of guidance – the tutorials are nice, but I couldn’t find good documentation beyond that. I found often I’d try to use a Tensorflow function, and later find that it doesn’t exist in TFJS.

    Between TFJS and TFLite (Tensorflow geared for mobile devices), new deep learning web and mobile apps will be very exciting to watch. In particular I think it’ll be a fantastic tool for artists, musicians, and educators.

    Demo Page

    As part of my tinkering this week, I created a TensorflowJS demo page that lets you try out several different models for training on MNIST. You can try convolutional nets with different filter sizes, or you can go the fully connected route and see how that compares. You can try removing the pooling layers or the reLU layers. Most variations eventually reach good accuracy, although some train more slowly than others. You can also play around with the learning rate and the number of batches to train.


    I’m excited for next week as I begin my dive into Reinforcement Learning. I’m planning to study the Deep RL Bootcamp and UC Berkeley 294 over the next three weeks and will continue to tell the tales here.

Learning About Deep Learning

I’m thrilled to be starting in the Scholars Program at OpenAI this June. I was a physics major and I’ve always loved math, but a year ago I didn’t have any deep learning or AI knowledge. I’d also stepped away from science for a few years while my kids were very young. This is how I got back up to speed. I’d love to hear from everyone else what they’ve found useful – please add comments about ideas, courses, competitions, scholarships, etc. that I’ve missed.

It’s both thrilling and completely overwhelming the amount you can learn online at this point.  Here’s what I found to be the main challenges of learning independently:

  1. Choosing what to work on
  2. Staying on schedule
  3. Learning actively (not passively!)
  4. Proving how much you’ve learned

Choosing what to work on

These are the courses I’ve found to be very high-yield and worth the time. I do browse at other courses, particularly if there’s a specific topic I need to learn, but these are the ones I recommend start-to-finish.

  • Jeremy Howard & Rachel Thomas’ FastAI sequence. The 2017-2018 course uses PyTorch. There’s also an older version of the course which has mostly similar material in Keras and Tensorflow.
  • Andrew Ng’s Deep Learning Specialization on Coursera. This dives more into the math behind deep learning and is a fantastic overall introduction. Sometimes the techniques taught are less cutting-edge than the FastAI ones.
  • Jose Portilla’s Python courses on Udemy, particularly Python for Data Science. I came into this not knowing Python at all, so appreciated having this great introduction to python, numpy, scipy, and pandas.
  • way to test your skills and learn the current cutting edge.
  • HackerRank – great place to prepare for interviews and programming tests (it’s not deep learning specific)

Staying on Schedule

Here I think the most important is to know your own personality and to play to that. I’m very project-oriented and goal-oriented. Once I’m working on a specific task, I have no trouble staying with that for hours. So I’ve tended to binge-watch courses (particularly Andrew Ng’s series). On the other hand, I know I’m not very good at jumping between several different projects, and when I don’t have a specific goal. I try to keep this in mind when planning my schedule for the week.

I also like Jeremy Howard’s advice for working on Kaggle competitions. He suggests without fail working a half-hour *every* day on the project. Each day you make some slight incremental progress.

With that in mind, I try to learn one new topic every day (even if it’s a relatively small detail). Either by reading a paper, watching a course video, or reading a blog post. Recently I met some of the Google Brain team, and when the topic turned to Variational Auto-Encoders, by chance I knew all about them since they’d happened to be my chosen topic one day. I keep a small journal with a half a page of notes on whatever I learn each day.

Learning actively (not passively!)

The big danger of online courses is that it’s far too easy to watch a bunch of videos (on 1.5x speed) and then a week later not remember any of it.

Both the FastAI and Deep Learning Specialization have very active forums. It’s definitely worth participating in those – both asking questions and trying to answer others. After taking the Coursera sequence, I was invited to become a mentor for the CNN and RNN courses, and I’m sure I’ve learned far more from trying to teach others than I did taking the course on my own.

This is also where the Kaggle competitions are extremely valuable. It’s one thing to be able to follow through a Jupyter Notebook. It’s something totally different to start with just the data and create all the structure from scratch.

Proving how much you’ve learned

After all the online courses, it’s helpful to create some tangible proof of what you know. My suggestions are a github project, a Kaggle competition, and some blog posts.

Jeremy Howard gave me the advice to focus on one great project, rather than several decent, half-baked ones. He says to really polish it up, even to make a nice web interface.

Along the same lines, it’s great practice to try out several different Kaggle competitions, but it’s important to focus on one and score highly on that one.

I’ve written a lot as a mentor for Andrew Ng’s courses. I’ve always been impressed how I have to understand things so much more deeply in order to explain them well in writing. This is my first foray into blog post writing – I’m naturally a fairly quiet and reserved person, so I’m having to consciously push myself to do this, but it’s also an exciting way to connect with the data science world.