Board games (like chess) are widely studied field in the history of artificial intelligence. Pioneers like Turing, Babbage, von Neumann and Shannon developed theories, algorithms and hardwares to analyze and play chess game. And in the last couple of years, we have seen similar programs that outperform humans in much more complex game like Go and Shogi (Japanese chess).
Google’s Deepmind has a phenomenal track record when it comes to beating humans at board games. In 2015, their project AlphaGo became the first computer Go program to beat a human (a professional Go player). And now they’ve developed a AlphaGo program that can learn the game of chess by itself and beat human or any other computer program (including Stockfish and Deep Blue) in nearly 4 hours.
Conventional AI programs (of board games) are highly optimized to their domain, and can’t be generalized to other problems without human intervention. AlphaZero program, on the other hand, can achieve superhuman performance in several challenging domains. With no prior knowledge except the game rules and starting from random play, AlphaZero achieved a superhuman play level within 24 hours in the games of Chess, Shogi and Go, and defeated world’s best program in each case. How they’ve done this and what are the exact results? Let’s find out.
In October 2017, Deepmind announced that their AlphaGo Zero algorithm has achieved superhuman performance using deep convolution neural network and trained solely by reinforcement learning. Engineers have used the same approach to build a generic algorithm, called AlphaZero, which replaces the domain-specific augmentations and handcrafted knowledge used in conventional game playing algorithms with deep neural networks and a tabula rasa reinforcement learning algorithm.
AlphaZero uses general purpose MCTS (Monte-Carlo Tree Search) algorithm rather than alpha-beta search. It learns value estimates and move probabilities by playing against itself, and then uses the learned-information to guide its search.
How It’s Different from AlphaGo Zero Algorithm
The AlphaGo Zero algorithm estimates and optimizes the winning probability, taking binary win or loss outcomes into account. AlphaZero, on the other hand, estimates and optimizes the expected outcome, considering draws or other potential outcomes.
The Go game rules are invariant to reflection and rotation. This fact is utilized very well in both AlphaGo and its advanced version AlphaGo Zero in 2 ways.
- Augment training data by creating 8 symmetries for every position.
- Transform position via a randomly selecting reflection or rotation before being computed by neural network, in MCTS algorithm, so that the computation is averaged over different biases.
In case of chess and shogi, the rules are asymmetric, and you can’t assume symmetries in general. In AlphaZero, training data is not augmented and board position is not transformed during MCTS.
AlphaGo Zero uses the best player from previous iterations to generate self-play game. After the completion of each iteration, new player’s performance is evaluated against the best player. If it’s won by 55 percent margin, the best player is replaced and self-play games are further generated by the new player. However, AlphaZero maintains a single neural network (updated continuously) rather than pausing until an iteration is complete.
AlphaZero Optimization and Training
AlphaZero uses hyper-parameters for all games without any game-specific optimization. In order to ensure exploration, a noise factor is integrated, which is proportionally scaled to the number of legitimate moves for that type of game.
Like AlphaGo Zero, the board state is encoded by spatial planes and actions by are encoded by either spatial planes or a flat vector, based on the basic rules of each game.
Developers applied AlphaZero to chess, shogi and Go. The same network architecture, hyper-parameters and settings were used for all 3 games. An individual instance of the algorithm is trained for each game. Starting from randomly initialized parameters, training conducted for 700,000 steps, using 5,000 first generation Tensor Processing Units to build self-play games and 64 second generation Tensor Processing Units to train the neural networks.
As you can see in the figure, AlphaZero outperformed Stockfish after 300,000 steps (after 4 hours) in chess; it outperformed Elmo in 110,000 steps (within 2 hours); and it outperformed AlphaGo Lee in 165,000 steps (after 8 hours).
The fully trained instances (trained for 3 days) of AlphaZero were tested against AlphaGo Zero, Elmo and Stockfish, playing 100 matches at the time rate of 1 minute per move. The results were quite impressive (mentioned in the table below).
AlphaGo Zero and AlphaZero used a single machine with 4 Tensor Processing Units, Elmo and Stockfish performed their best using 64 threads and 1 GB of hash size. AlphaZero defeated them all, losing 8 games to Elmo and none to Stockfish.
Google developers also examined the performance of MCTS search in AlphaZero. It searches for 40,000 positions per second in shogi and 80,000 in chess, compared to 35,000,000 for Elmo and 70,000,000 for Stockfish. AlphaZero uses its deep neural network to focus more selectively on the most promising options, or you can say a more human-like approach.
While AlphaZero is still in its infancy, it constitutes an important step towards its objective. If similar approaches can be applied to other structured problems, like protein folding, discovering new materials or decreasing energy consumption, the outcomes have the potential to impact our future in a positive manner.