Learning via Self-Play: An AlphaGo/AlphaZero Story
Since its historic win against Go champion Lee Sedol in 2016, AlphaGo has made headlines throughout the world as it was thought that AI would take another decade to surpass humans in Go. AlphaGo uses an initial supervised learning procedure to learn from the games of human professionals, before conducting self-play reinforcement learning to improve itself further. AlphaGo Zero took it one step further and learnt the game with just the rules and without any human knowledge, and managed to have better performance than AlphaGo!
It is exciting how reinforcement learning methods can be made superhuman with self-play, and this presentation serves to give a beginner’s overview to the winning methods behind AlphaGo/AlphaGo Zero - namely:
(1) Monte Carlo Tree Search (which helps to balance the explore-exploit tradeoff and serve as a way to lookahead and self-improve),
(2) a neural network to approximate how well the board position is (the value network), and
(3) a neural network to decide which moves to focus on (the policy network)
00:00 Intro (AlphaGo Movie)
3:11 Start of Talk
4:35 Explore-Exploit Tradeoff
8:28 Monte Carlo
11:40 Monte Carlo Tree Search
18:58 AlphaGo (Neural Networks + MCTS)
23:46 Policy Network (Breadth)
26:34 Value Network (Depths)
28:14 AlphaGo: An Overview
31:19 AlphaGo Zero (no human expert knowledge)
34:51 MCTS in AlphaGo Zero
37:40 Self-play
39:00 Simplicity is better: Human features can be distracting
39:23 AlphaGo Zero Performance
40:18 How to achieve superhuman performance?
41:27 Q&A