Multi-Agent Training: Lessons learned from training 3 separate competing networks
Over the course of the weekend, I dug further into my soccer RL project. I realized that while the Striker and Goalie brains had learned very good policies, the Defenders were terrible.
Testing this theory, I set up a defenders only game, and the results were abysmal:
I realized my mistake – the defenders were getting small rewards and penalties when the team scored or was scored on. However this had very little to do with how each individual defender was acting. The strikers and goalies were doing all the work, and the defenders were not getting a strong signal to train (any true signal was getting lost in the noise of what the strikers and goalies were doing).
I decided to train the defenders on their own. However, I made two main mistakes (obvious in retrospect, but I’m posting them here in the hope they’re helpful to others).
First, I invented a different reward system for the defenders. Since I didn’t want them just to turn into strikers, I gave them only a small reward for scoring. Instead, I figured they should be rewarded for moving the ball away from their own goal, so I gave them a penalty according to how deep the ball was in their own half (it was proportional to the ball’s distance away from the centerline).
After some time, I realized this made an extremely confusing reward for the players. On one run, a player might get a better reward just because the ball was slightly further from its own goal, even if it was not due to any action the player took. Without having an enormous number of samples, it’s difficult to separate out a reward or penalty for the player’s own actions.
To fix this, I split the field into three parts, and gave the defender a large existential penalty if the ball was in the back third, a smaller penalty in the middle, and a small bonus in the attacking third. Unfortunately, this then led to a crazy policy where the defender would bring the ball to the attacking third and then spin around indefinitely. Finally, I made even the attacking third a very small penalty, to entice the defender to score and thereby end the episode.
I set the defenders training 4v4. Watching the game, this seemed a good choice, since they would at least sometimes get to the ball and occasionally score. However, I realized that yet again the signal for individual action was too small compared with the size of the team rewards. All players got a bonus when the team scored, even if 3 of 4 of them were doing nothing.
To fix this, I trained 1v1 for a while, then moved to 2v2. I don’t want to train only 1v1 since I’d ideally like to see only the nearest player go to the ball, or even to see some preference for passing. Once the defender network seemed decent, I sometimes set 1 defender against 1 striker.
I automated this training process, and now at the start of each episode, the program randomly decides between 1v1, 2v2, and 3v3. It adds in goalies 10% of the time. For each team, players are randomly assigned to be strikers or defenders.
Plans for future training
I’m still curious to see if there is a way for true team dynamics to emerge naturally. My guess is that it will be necessary to change the game simulation slightly in order for this to happen. Players have no motivation to pass if they are fast enough to run with the ball straight to the goal and if there is little chance an opponent will be able to take the ball away. I’m next setting the overall running speed slower, and making “dribbling” slower than running without the ball.