World Cup Soccer: Team Spirit
This past week I continued to focus on my soccer RL environment. I’m thinking about how team dynamics could emerge, and I’m particularly interested to see if I can establish inter-agent communication.
I wrote previously about the problem of training multiple strikers at once: if the team scores, all strikers are rewarded, even if 2 of 3 are doing nothing useful. I had addressed this by training smaller numbers of players at a time, but a better solution is to introduce “team spirit”. This value is gradually annealed from 0 to 1 so that early on the players are only interested in individual rewards, whereas later on they care about maximizing team rewards.
Team Evolution vs. Individual Player Evolution
The Unity demo had one brain controlling both goalies, and a separate brain controlling both strikers (it didn’t include defenders). I wanted to add the ability for the two teams to diverge and develop different skills. In TensorFlow, this is a matter of having copies of the neural nets (one for each team, and each within a separate variable scope). Then at certain time points, I copy the values from the more successful brain over to the others.
This way I have the option to evolve individual players (at a given stage, pick the best goalie, best striker, and best defender, and reset all the teams with those values). Alternately, I can allow the teams to evolve as a whole (this is the traditional self-play model, as in AlphaGo Zero). My guess is that initially it will be better to evolve individuals, and then as the level of play gets high, it will be better to switch to evolving entire teams.
Individual players send out rays in several different directions, and then track if and where a ray hits an object of interest (the ball, the goal, an opponent, etc.) Initially, I thought having more observations would be better, and I increased the number of rays and the number of object categories. However, I realized this actually slows down training quite a bit. In theory, more input is helpful, but in practice this means much larger matrices in the neural net. This then requires more training data (and there’s more risk of overfitting to the data).
Since the players also have the ability to rotate quickly (and thereby get a broad sweep of all the angles around them), it works well to give them a more limited number of input angles.
I experimented with changing the possible actions available to the players. I added the option to kick the ball further and up in the air. (I also accidentally added in an awesome little bug which caused my players to start flying around as if they were playing Quidditch.)
I also tried giving the goalies the option to jump. Currently I don’t include this (having too many action options is similar to having too large an input space — in an environment with sparse rewards, it makes it that much harder for the agents to learn a good policy).
Next week, I’m planning to try switching the action space to be continuous. Right now, the players can choose discrete actions (move left, right, forward, back, turn, kick). I’d like to give them the option of moving more or less, and the ability to choose the kick strength and direction. This also will make it easier to add in communication, as described below.
The Unity demo scene puts up walls around the field, to avoid the issue of out-of-bounds entirely. I wanted to make a more realistic soccer simulation, and so I removed the walls (and also made the field larger and added players). Originally, I treated out-of-bounds as the end of an episode. However, since the strikers and defenders have an existential penalty (designed to make them want to score as quickly as possible), I found they were too willing to hit the ball out of bounds to end an episode.
When I instead penalized hitting the ball out, they became too afraid to go near the ball when it was by the side line. I decided to add in a simple possession switch when the ball goes out of bounds. The ball’s position is reset to the sideline, and an opposing team player is moved next to it. For corner kicks and goal kicks, the ball is placed accordingly.
I added the optional ability to start the players and/or the goals at a larger size. This makes it easier for the agents to find rewards (it makes it more likely they’ll accidentally hit the ball, and more likely the ball will touch the goal). Qualitatively, it works well to start the players and goals at a larger size and then gradually step them down to normal size.
I also added the ability to set subgoals: a reward for touching the ball, a reward for passing, a reward for the ball reaching the opponent’s goalie box, and a reward for the goalie staying in the goalie box. I normally set these to zero, but they are available for experimentation. The reward for touching the ball seems to be useful (especially at early training stages), but it needs to be carefully balanced: if set too high, the agent grabs the ball and then goes to a corner and spins round and round with the ball – racking up points far away from the other team.
Modelling Energy and Athleticism
I suspect that the need to pass the ball comes from limitations in player speed. I set an adjustable factor so that “dribbling” the ball is slower than running without it. Karthik also suggested that passing also comes from each individual player not having infinite energy to run. I add a slight reward to “resting” (choosing to rotate or stay stationary, rather than to run). This reward also needs to be set very carefully, and I haven’t yet found the best way to manage it. When set too high (especially early on in training), the players prefer to stand in place and they ignore the ball entirely. I’ll experiment with setting this at zero initially, and then gradually increasing the penalty for expending energy.
Additionally, I may add in a factor to model the chance that an opponent steals the ball (probably proportional to the number of opponents at close range).
My next big project is to add in communication between the players. I’d like each player to generate a short message (a vector of floats) that will then be passed to other players on its team as input. Per Karthik’s suggestion, I’ll first start with only a pair of players, and then I’ll try to build up from there if it seems successful. Ideally, it would be interesting if the players could learn to coordinate where to move. I’m planning to test for successful communication by having the team play against an identical copy that has messaging zeroed out.
Along with reorganizing the reward structure, I also spent time this week cleaning up the soccer code. Not much to say here – it was a mess! (Part my fault, and part what I’d inherited from the Unity demo scene.) I also added in the option to train individual players (placing the ball in ways they would need to react to) and I added some metrics (number of passes, time ball is sitting untouched, team possession time).
Yet again I’m reminded how tricky it can be to debug reinforcement learning projects. In the course of refactoring, I accidentally flipped a team name so that I wasn’t fully rewarding the red team strikers when they scored. Eventually looking back through the code, I noticed my mistake, but there were no red flags in training. Everything appeared to be training normally, and I thought it was just by chance that the red team happened to be slightly worse than the blue team.
My original plan had been to pivot at this point (four weeks into the OpenAI Scholar Program) and focus on medical applications for the rest of my time here. However, after talking with Karthik, I’ve decided to focus on this soccer multi-agent problem for a while and see where it leads. During any long stretches where I’m waiting for models to train, I’ll start doing a survey of the current state of medical deep learning.