On September 18 and 19 of 2024, The Linux Foundation hosted the PyTorch Conference 2024 around Fort Mason in San Francisco. The conference showcased the latest advancements in PyTorch 2.4 and Llama 3.1, as well as some upcoming changes in PyTorch 2.5. Matt White, executive director of the PyTorch Foundation and GM of AI at the Linux Foundation, opened the conference on Day 1 by highlighting the importance of open-source initiatives in advancing responsible generative AI.
Hanna Hajishirzi detailed the OLMo project, aimed at building robust language models and making them fully accessible to researchers. This includes open-source code for data management, training, inference, and interaction. There was also discussion of DOLMa, a 3T token open dataset curated for training language models, Tulu, an instruction-tuned language model, and OLMo v1, a fully-open 7B parameter language model trained from scratch.
Piotr Bialecki from NVIDIA, Peng Wu from Meta, and others gave a technical deep dive into PyTorch, charting its evolution from 2016 to 2024. They highlighted how PyTorch has become more straightforward, debuggable, and hackable over the years. They also provided numbers about PyTorch’s growth. With over 20,000 research papers and 140,000 Github repositories utilizing PyTorch in the past year alone, its adoption has been remarkable.
The conference highlighted several libraries within the ecosystem. Torchtune, a PyTorch library, offers a flexible and accessible solution for fine-tuning LLMs. It addresses memory efficiency challenges through techniques like activation checkpointing, 8-bit AdamW optimizers, and chunked cross entropy. The integration of torch.compile and techniques like sample packing and FlexAttention significantly boost training speed. Torchtune's modular design and training recipes cater to users with varying levels of expertise, democratizing the process of fine-tuning LLMs.
TorchChat, a PyTorch library, aims to streamline this process, enabling seamless and performant execution of LLMs on laptops, desktops, and mobile devices. It leverages core PyTorch components like torch.compile, torch.export, AOT Inductor, and ExecuTorch to optimize and deploy models in both Python and non-Python environments. TorchChat's focus on composability, debuggability, and hackability empowers developers to build and deploy LLMs efficiently.
TorchAO, a library for quantization and sparsification, tackles the memory and computational demands of large models. Hardware optionality was discussed, with torchao enabling low-precision optimization in PyTorch. PyTorch 2.0's inference story was explored, showcasing advancements in exporting models for diverse deployment scenarios.
The poster session on the first night of the conference featured contributions from Meta, NVIDIA, Google, Intel, and others. Key topics included improvements in PyTorch's data handling, inference performance, and support for new hardware through tools like Torch.Compile, TensorRT, and AI edge quantization. One tool from Google Research was a graph visualization tool that helps one understand, debug, and optimize ML models. The winning poster from Meta, "PyTorch Performance Debugging in N-Dimensional Parallelism", discussed identifying and mitigating performance inefficiencies for training across 16K H100 GPU's on a single training cluster.
At this magnitude, it is crucial to delve deep into performance inefficiencies for new model paradigms, which is essential for large-scale training.. This platform help's observe and quickly debug large scale model performance and scaling bottlenecks. - Sreen Tallam
Chip Huyen, VP of AI & OSS at Voltron Data, started the second day by discussing the limitations of external evaluation tools in AI, emphasizing the importance of critical thinking in the evaluation process. Sebastian Raschka, PhD, a staff research engineer at Lightning AI, took attendees on a journey through the evolution of large language models (LLMs). Raschka highlighted key developments in attention mechanisms and the latest "tricks of the trade" that have improved the training processes and performance of state-of-the-art LLMs.
Jerry Liu also discussed the challenges and building blocks of creating a reliable multi-agent system. Liu's presentation highlighted the shift from simple RAG stacks to more autonomous agents that can reason over diverse inputs to produce sophisticated outputs.
Woosuk Kwon and Xiaoxuan Liu presented vLLM, a high-performance LLM inference engine built on PyTorch that enables fast and efficient deployment on various hardware, including AMD GPUs, Google TPUs, and AWS Inferentia. Omar Sanseviero discussed Hugging Face's efforts to distribute over a million open models, highlighting the platform's role in democratizing access to powerful AI tools.
The second day also discussed pushing the boundaries of LLM deployment. Chen Lai and Kimish Patel from Meta's PyTorch Edge team tackled the challenges of deploying LLMs on edge devices. They discussed the constraints of these resource-limited environments and presented ExecuTorch, a framework for efficient LLM execution on edge hardware, including CPUs, GPUs, and specialized AI accelerators. Mark Moyou from NVIDIA explored the intricacies of sizing production-grade LLM deployments, delving into topics like quantization, parallelism, and KV Cache management.
"No training dataset is entirely free of bias. Even if it is largely unbiased for one use case, that doesn't guarantee it will be unbiased in another." - Shailvi Wakhlu
The conference also featured insightful discussions on the ethical considerations of AI deployment. Rashmi Nagpal, a machine learning engineer at Patchstack, addressed the need to build interpretable models and the importance of navigating the maze of ethical considerations. Amber Hasan, owner of Ethical Tech AI, discussed the potential environmental impact of AI, particularly on water resources.
Developers who would like to learn more about the conference can watch videos on YouTube in the coming weeks or check out the conference schedule for some of the material presented.