Researchers from OpenAI have open-sourced Video PreTraining (VPT), a semi-supervised learning technique for training game-playing agents. In a zero-shot setting, VPT performs tasks that agents cannot via reinforcement learning (RL) alone, and with fine-tuning is the first AI to craft a diamond pickaxe in Minecraft.
The model and several experiments were described in a paper published on arXiv. To train VPT, the team first contracted players to perform specific actions in the game, which was used to create a labeled dataset of around 2,000 hours of video. Using this, the researchers trained an Inverse Dynamics Model (IDM) that can infer what keystrokes or mouse actions produced the action in the video. The team used this model to label around 70k hours of internet videos showing Minecraft play; this dataset was then used to pretrain a VPT foundation model. Without fine-tuning, this model was able to perform complex game behaviors that have in the past proven impossible for RL models to learn, including multi-step crafting activities. When fine-tuned on additional contractor data, VPT learned to craft a diamond pickaxe, which can require over 24k in-game actions. According to the OpenAI team:
VPT paves the path toward allowing agents to learn to act by watching the vast numbers of videos on the internet...While we only experiment in Minecraft, the game is very open-ended and the native human interface (mouse and keyboard) is very generic, so we believe our results bode well for other similar domains, e.g. computer usage.
Recent research in natural language processing (NLP) and computer vision (CV) AI has shown that pretraining models on large, noisy datasets scraped from the web can produce state-of-the-art results on a variety of downstream tasks. These large pretrained models, sometimes called foundation models, are typically fine-tuned on a relatively small task-specific datasets. In contrast, most game-playing agents are trained using RL, requiring many thousands of episodes of the agent playing the game, which can be time consuming yet still not explore a large part of the game's potential, especially with "open-world" games such as Minecraft.
While internet video-sharing sites like YouTube do have hundreds of thousands of hours of gameplay videos for an agent to learn from, the problem is that these videos show only the game screen, but not the control inputs which are crucial for learning. The OpenAI solution was to train an IDM to infer the control inputs given a series of video frames. To do this, the team first hired contractors to play Minecraft; during play, their video screens were recorded along with keystroke and mouse inputs. This produced a labeled dataset which was used to train the IDM.
Next, the team collected and cleaned Minecraft gameplay videos from the internet, then used the IDM to label this dataset with the inferred control inputs driving the game. This larger dataset was used to train the VPT foundation model using "standard behavioral cloning." Behavioral cloning is a form of imitation learning, where an agent is trained on the observed states and actions of another agent (usually a human teacher) and learns to estimate the teacher's own policy. In contrast with RL, behavioral cloning does not require the learning agent to interact with the environment directly.
VPT Pretraining Overview (image source: https://arxiv.org/abs/2206.11795)
In addition to releasing the VPT code and model weights, OpenAI has partnered with this year's MineRL NeurIPS competition. This competition offers prizes to teams who train agents that can perform tasks in the MineRL Benchmark for Agents that Solve Almost-Lifelike Tasks (MineRL BASALT). Besides OpenAI, several other large tech companies are supporting AI research efforts using Minecraft as a platform. In 2019, InfoQ covered Meta's open-source CraftAssist framework for building bots to assist players in the game. More recently, NVIDIA open-sourced MineDojo, a framework for embodied agent research in Minecraft.
The VPT code and models are available on GitHub.