A research team from Google recently published a paper on GameNGen, a generative AI model that can simulate the video game Doom. GameNGen can simulate the game at 20 frames-per-second (FPS) and in human evaluations was preferred only slightly less often than the actual game.
GameNGen (pronounced "game engine") is based on the open-source Stable Diffusion v1.4 text-to-image model. Google modified it so that instead of generating an image from a text prompt, it generates a frame of gameplay from previous frames and an action input (such as a key press or mouse click). To create training data at scale, Google used a game-playing agent trained with reinforcement learning (RL), collecting around 900M frames along with corresponding actions. After training, the model is able to simulate and maintain the complex state of the real game, including player health and items. Google evaluated GameNGen by showing human judges a side-by-side comparison of video clips from the simulated game with clips from the real game. The judges preferred the simulated game clip 40% of the time. According to Google,
While many important questions remain, we are hopeful that this paradigm could have important benefits. For example, the development process for video games under this new paradigm might be less costly and more accessible, whereby games could be developed and edited via textual descriptions or example images. A small part of this vision, namely creating modifications or novel behaviors for existing games, might be achievable in the shorter term. For example, we might be able to convert a set of frames into a new playable level or create a new character just based on example images, without having to author code. Other advantages of this new paradigm include strong guarantees on frame rates and memory footprints.
GameNGen Architecture. Image Source: GameNGen Project Website
Google's research paper on GameNGen cited the It Runs Doom subreddit, which is dedicated to "odd hardware that runs Doom." Users in that subreddit started a discussion thread about GameNGen, with one describing thus:
You ever dreamt of being in a game? Details are hazy, things aren't all where they should be, but in general the game is recognizable. That's what this is. The AI is recalling what should happen from its memory, but it literally doesn't know a single thing about the game it's in. It doesn't know what the game's code is or what the next level is, it's just going off of memory because it's watched so much Doom gameplay.
Users on Hacker News also discussed the model. One user noted that:
Apparently there’s more cause, effect, and sequencing in diffusion models than what I expected, which would be roughly ‘none’. Google here uses [Stable Diffusion] 1.4, as the core of the diffusion model, which is a nice reminder that open models are useful to even giant cloud monopolies.
The same user also called out a problem the researchers discovered while developing GameNGen. They noticed initially when the model generated game frames, it suffered from "error accumulation and fast degradation in sample quality." To correct this, they added noise to the training data and included a noise level input to the model. This allowed the model to learn to "de-noise" its autoregressive output.
Although Google did not release the GameNGen code, the model weights for the underlying open-source Stable Diffusion model weights are available on Huggingface.