Facebook AI Research (FAIR) published a paper on Recursive Belief-based Learning (ReBeL), their new AI for playing imperfect-information games that can defeat top human players in poker. The algorithm combines reinforcement learning (RL) with state-space search and converges to a Nash equilibrium for any two player zero sum game. Code for training the algorithm to play Liar's Dice has been open-sourced.
FAIR researchers Noam Brown and Anton Bakhtin gave an overview of the system in a blog post. ReBeL is a general-purpose algorithm for use in any two player zero sum game, even those with imperfect information, where players do not have complete knowledge of the game state. By modeling the probability distribution of the beliefs that players may have about game state, ReBeL can apply AI techniques used in perfect-information games. The algorithm is proven to converge to the optimal policy for the game, and FAIR's implementation for Heads-Up No-Limit (HUNL) Texas Hold 'Em poker outperformed previous AI benchmarks and defeated a top human professional poker player. Brown and Bakhtin note:
[W]e view this as a major step toward developing universal techniques for multiagent interactions, and thus as a step toward complex real-world applications like fraud detection and cybersecurity.
Modeling a game for AI automation typically involves encoding the state of the game: for example, the positions of all the pieces on a chess board. The AI agent then uses an algorithm (or policy) to choose its next move, which then updates the game's state. In many games where agents have perfect information about the current state of the game, such as chess, a common implementation involves searching the game's state space: that is, simulating many possible sequences of moves by both players to find the best next move, where "best" is measured using a value function. However, for games where the state space is very large, such as Go, search alone becomes impractical. Instead, researchers for these games have turned to RL, where the agent plays the game and updates its policy based on the results of its moves. DeepMind's AlphaGo Zero combines both RL and search, using a learned policy to guide search by narrowing down the search space.
For games with imperfect information, such as card games where the players' cards are hidden from each other, agents cannot accurately determine the game's full state. While RL techniques do have some success, they lack a guarantee that the policy learned will approach the Nash equilibrium. Because none of the players in the game have full knowledge of the game state, they form beliefs about the unknown state. Furthermore, to determine the likely actions of other players, an agent must form beliefs about the other players' beliefs. DeepStack, the first AI to defeat human professionals at HUNL, uses such recursive reasoning to produce inputs to a neural network which is used to assign values to game states in conjunction with state search.
ReBeL also uses this recursive process to develop probability distributions about the unknown state of the game, and formalizes the process as public belief states (PBS). A PBS is formed by using Bayesian techniques and observations of players' actions to update an initial uniform distribution; thus all information required to produce a PBS is available to all players. By reformulating imperfect-information games to a continuous-state perfect-information based on PBS, ReBeL can use search in the PBS space. The value function for evaluating states in the search is trained using RL during self-play, which reduces the need for expert domain knowledge and makes the algorithm applicable to a wider range of games. While DeepStack also used PBS and search, its value function was trained by randomly generating PBSs from a distribution built using expert knowledge instead of by self-play. FAIR's previous poker-playing AI, Pluribus, does use self-play, but does not adapt its strategy based on observations of other players.
The FAIR team proved that ReBeL converges to a Nash equilibrium policy, and tested the algorithm on both HUNL Texas Hold 'Em poker and Liar's Dice. In poker, ReBeL outperformed two benchmark bots, BabyTartanian8 and Slumbot, and also defeated Dong Kim, a professional poker player. Although FAIR has not released the code for their poker implementation, the source code and model checkpoints for the Liar's Dice experiment are available on GitHub.