The PyTorch Foundation recently released PyTorch version 2.5, which contains support for Intel GPUs. The release also includes several performance enhancements, such as the FlexAttention API, TorchInductor CPU backend optimizations, and a regional compilation feature which reduces compilation time. Overall, the release contains 4095 commits since PyTorch 2.4.
The Intel GPU support was previewed at the recent PyTorch conference. Intel engineers Eikan Wang and Min Jean Cho described the PyTorch changes made to support the hardware. This included generalizing the PyTorch runtime and device layers which makes it easier to integrate new hardware backends. Intel specific backends were also implemented for torch.compile and torch.distributed. According to Kismat Singh, Intel's VP of engineering for AI frameworks:
We have added support for Intel client GPUs in PyTorch 2.5 and that basically means that you'll be able to run PyTorch on the Intel laptops and desktops that are built using the latest Intel processors. We think it's going to unlock 40 million laptops and desktops for PyTorch users this year and we expect the number to go to around 100 million by the end of next year.
The release includes a new FlexAttention API which makes it easier for PyTorch users to experiment with different attention mechanisms in their models. Typically, researchers who want to try a new attention variant need to hand-code it directly from PyTorch operators. However, this could result in "slow runtime and CUDA OOMs." The new API supports writing these instead with "a few lines of idiomatic PyTorch code." The compiler then converts these to an optimized kernel "that doesn’t materialize any extra memory and has performance competitive with handwritten ones."
Several performance improvements have been released in beta status. A new backend Fused Flash Attention provides "up to 75% speed-up over FlashAttentionV2" for NVIDIA H100 GPUs. A regional compilation feature for torch.compile reduces the need for full model compilation; instead, repeated nn.Modules, such as Transformer layers, are compiled. This can reduce compilation latency while incurring only a few percent performance degradation. There are also several optimizations to the TorchInductor CPU backend.
Flight Recorder, a new debugging tool for stuck jobs, was also included in the release. Stuck jobs can occur during distributed training, and could have many root causes, including data starvation, network issues, or software bugs. Flight Recorder uses an in-memory circular buffer to capture diagnostic info. When it detects a stuck job, it dumps the diagnostics to a file; the data can then be analyzed using a script of heuristics to identify the root cause.
In discussions about the release on Reddit, many users were glad to see support for Intel GPUs, calling it a "game changer." Another user wrote:
Excited to see the improvements in torch.compile, especially the ability to reuse repeated modules to speed up compilation. That could be a game-changer for large models with lots of similar components. The FlexAttention API also looks really promising - being able to implement various attention mechanisms with just a few lines of code and get near-handwritten performance is huge. Kudos to the PyTorch team and contributors for another solid release!
The PyTorch 2.5 code and release notes are available on GitHub.