The first great conquest of artificial intelligence was chess. The game has a staggering number of possible combinations, but it was relatively tractable as it was structured by a set of clear rules. An algorithm could always have perfect knowledge of the state of play and know every possible move that both the algorithm and its opponent could make. The state of the game can be judged by simply looking at the board.
But many other games are not so simple. If you like something like Pac-Man, then figuring out the ideal move would involve taking into account the shape of the maze, the location of the ghosts, the location of any additional areas to clear, the availability of power-ups, etc., and the best plan can end in disaster if Blinky or Clyde make an unexpected move. We developed AIs that can handle these games too, but they had to take a very different approach from the ones that conquered chess and Go.
At least until now. Today, however, Google’s DeepMind division released a paper describing the structure of an AI that can handle both chess and Atari classics.
Reinforcement of trees
The algorithms that have worked on games like chess and Go plan their schedule using a tree approach, simply looking ahead at all the branches that result from various actions in the present. This approach is computationally expensive and the algorithms rely on knowing the rules of the game, allowing them to project the current game state into possible future game states.
Other games require algorithms that don’t really care about the state of the game. Instead, the algorithms simply evaluate what they “see” – usually something like the position of pixels on an arcade game screen – and choose an action based on that. There is no internal model of the state of the game, and the training process largely consists of figuring out which response is appropriate based on that information. There have been some attempts to model a game state based on inputs such as the pixel information, but they have not fared as well as the successful algorithms that only respond to what is displayed on the screen.
The new system, which DeepMind calls MuZero, is based in part on DeepMind’s work with the AlphaZero AI, which taught itself to master rules-based games such as chess and Go. But MuZero also adds a new twist that makes it significantly more flexible.
That twist is called “model-based reinforcement learning.” In a system that uses this approach, the software uses what it can see from a game to build an internal model of the game’s state. It’s critical that that status isn’t pre-structured based on any understanding of the game — the AI can have a lot of flexibility regarding what information is or isn’t included in it. The reinforcing learning portion of things refers to the training process, allowing the AI to learn to recognize when the model it uses is both accurate and contains the information it needs to make decisions.
The model it creates is used to make a number of predictions. These include the best possible move given the current state and the state of play resulting from the move. It is critical that the prediction it makes is based on the internal model of game states – not on the actual visual representation of the game, such as the location of chess pieces. The prediction itself is made based on past experience, which is also subject to training.
Finally, the value of the move is evaluated using the algorithms’ predictions of any instantaneous rewards achieved with that move (for example, the point value of a piece taken in chess) and the final state of the game, such as the win or lose outcome of chess. These may involve the same tree searches of possible game states as done by previous chess algorithms, but in this case the trees consist of the AI’s own internal game models.
If that’s confusing, see it this way: MuZero runs three parallel evaluations. One (the policy process) chooses the next move given the current game state model. A second predicts the new state that will result, and any immediate rewards from the difference. And a third considers past experience to inform the policy decision. Each of these is the product of training, which focuses on minimizing the errors between these predictions and what actually happens in the game.
Obviously the folks at DeepMind wouldn’t have a paper in Nature if this didn’t work. MuZero played just under a million games against its predecessor AlphaZero to achieve a similar level of achievement in chess or shogi. For Go, it surpassed AlphaZero after just half a million games. In all three of those cases, MuZero can be considered superior to any human player.
But MuZero also excelled in a panel of Atari games, something that previously required a very different AI approach. Compared to the previous best algorithm, which uses no internal model at all, MuZero had a higher average and median score in 42 out of 57 games tested. So while there are still some circumstances where it lags, it has now made model-based AIs competitive in these games, while retaining its ability to take on rule-based games like Chess and Go.
All in all, this is an impressive achievement and an indication of how AIs are becoming more and more sophisticated. A few years ago, training AIs for just one task, such as recognizing a cat in photos, was a feat. But now we are able to train multiple aspects of an AI at the same time: Here, the algorithm that created the model, the algorithm that chose the move, and the algorithm that predicted future rewards were all trained at the same time.
In part, that is the product of the availability of greater computing power, which makes it possible to play millions of chess games. But in part, it’s an admission that this is what we need to do if an AI is ever going to be flexible enough to perform multiple, distantly related tasks.
Nature, 2020. DOI: 10.1038/s41586-020-03051-4 (About DOIs).
Listing image by Richard Heaven / Flickr