BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations An Open Source Infrastructure for PyTorch

An Open Source Infrastructure for PyTorch

Bookmarks
43:00

Summary

Mark Saroufim discusses tools and techniques to deploy PyTorch in production.

Bio

Mark Saroufim is an Applied AI engineer in the Business Engineering group at Meta who spends most of his time maintaining or contributing to Pytorch. He's passionate about building in the open and even more passionate about online communities.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Saroufim: I'm Mark Saroufim. I'm here to give you a talk titled, "An Open Source Infrastructure for PyTorch." I'm an applied AI Engineer here at Meta. Some people think I spend the majority of my time on Twitter, so you can follow me there for my updates. For the most part, I have the luxury of having the majority of my work be out in the open, so you can go check me out on GitHub, for the latest and greatest. Let's unpack this title a bit more. Why is it that open source is important? What does it really mean to have an infrastructure that's open source? We all know what PyTorch is. PyTorch is a language to author machine learning models. What does it really mean to have an infrastructure around it? Let's just dive deeper. The PyTorch organization maintains a bunch of projects that are more related to how do we actually use PyTorch in production. There's a couple that I personally work on, like PyTorch Serve used for model inferencing, where I'm one of the maintainers. I also contribute to a bunch of more projects like PyTorch Core, Serve, TorchX for infrastructure management. Torchdata for dataset management. An example is just more of an educational repro. My charter is, how do we make it so PyTorch is a joy to deploy in production? This photo on the right is me right at work, right there on the GitHub interface.

Outline

One way of viewing the agenda for this talk is going to be this. We're going to have lots of digressions. I'm sure you've seen lots of pictures by lots of people describing how to do machine learning, and these are all the tools you should use. This is not going to be one of those talks. In this talk really the reason why I'm categorizing things into subproblems, there's more to talk about them. Ideally, I'm really here to show you what's interesting about a lot of those problems. It's really just to get you curious about those subproblems. Just because all lines are generally helpful, the way to think about it is that you're going to have some data, and that data is not going to be in a tensor format. It may be some images. It may be some text. It may be some audio snippets. How do we actually load, decode, preprocess that data? The tools we're going to talk about within this context, are what we call the PyTorch domain libraries. There's going to be things like torchvision, torchtext, torchaudio. There's going to be our data loading libraries, so things like torchdata. Then when we're defining a model architecture, hopefully, you've had an experience already defining some model with PyTorch. It's actually really easy. If you don't want to do that, there's also pretrained models that you can get from companies like Weights & Biases.

Even then, something that's becoming more important as you may have heard from the news like, models are getting bigger. What does it really mean to define a large model? We're going to be discussing tools like DDP and FSDP. That's distributed data parallel and fully sharded data parallel. Then we're going to discuss, how do you actually deploy those models. You want to launch a training job, but then if it's a large job with multiple nodes in it, some of those nodes could fail? What does that really mean? How do you check for the job status when the job is occurring over multiple machines? Of course, after deploying those models, you may realize, this model that I'm trying to train is not cost effective at all, and it is going to take months or years to train. What are the various tools that people use to make models faster, ranging from quantization, or some more inference specific solutions like TensorRT, IPEX, and ONNX. Finally, once we've done all of that, and we've defined the exact models and weights that we want to serve in production, what does serving a model in production really mean? Here I'm going to talk to you more about TorchServe.

PyTorch Foundation - Democratic Governance

Some piece of news that we're really excited about and sets the context for why open source is important is the PyTorch Foundation. The goal of the PyTorch Foundation is to foster an open source ecosystem for PyTorch, where the governance is distributed outside of Meta. By governance, this really means both the technical governance, which will be a meritocratic process where if you've shown and demonstrated outstanding technical ability and vision, and what PyTorch should look like, you can get nominated as a maintainer, even if you don't work at Meta. The other aspect is the business governance. This includes things like the website, and all the various assets like the trademark, PyTorch. So far in the governing board, we're really fortunate to have some of the largest and best companies in machine learning. Companies like AMD, Google, Amazon, Meta, of course, Microsoft, and NVIDIA. This is all going to be managed by the Linux Foundation. Personally, I'm very excited about this. This basically means that the project is growing up. It's becoming too important to basically only belong to Meta.

Why PyTorch?

What's up with PyTorch? Why is it that people like PyTorch? My two cents is that PyTorch is easy to use, learn, and debug. Generally, researchers really love it, because you can find lots of pretrained models that you can compose. There's not too many abstractions, and it's mostly Python, which means that you can dive into the internals and debug stuff. As your log statements, you can use PDB. That's one of the benefits of what people call an eager mode framework, as opposed to a graph mode framework. Which also means you can freely intersperse Python and PyTorch code. You can have for loops. You can have if conditions. You can have print statements. For the most part, if you want a model that works, you're mostly constrained by like, am I really building the right thing, and not really constrained by your tooling? As a researcher, this is great, which means you can wrestle with the foundations of the universe.

This is a dashboard, for example, from Papers With Code, which is a website that tracks the usage of framework per new papers. We can see that PyTorch is an extremely popular option among researchers. Again, this is really because of those three things that I highlighted. It's easy to use, easy to learn, and it's easy to debug. When you're thinking about production, the design space becomes larger. While you also want now the code that was easy to learn, debug, you want it to be easy to maintain. You want it to be easy to accelerate. You want this thing to run fast, because if it's running fast, it means you can run even more servers, you can make your models even better. It also needs to plug in into an existing infrastructure. Fundamentally, if you look at the research to production journey, it's very different because the road to production isn't like a point. It's almost like a journey, and it's long, and it's perilous.

The reason that's the case is because, one, there's lots of options. There's all sorts of tools you can use, like managed services, open source infrastructure versus something homegrown. If it's homegrown, that's good. That means you build something that solves a new problem. The problem is, if you're successful, now this is something you need to maintain forever, in addition to the work you've already done. It's not like you can just learn deep learning, now you have to learn operating systems, CUDA, networking, distributed programming, computer architecture, but also, your own domain. The tools are complex. They're complex also because of a human imposed complexity. A lot of tools may not necessarily be available just if you put your credit card online. You need to talk to salespeople. You may need to prioritize asks via certain PMs. This is fundamentally why this problem is so complex, and why you can't essentially have a single engineer to cover the entire research to production stack.

Why Open Source?

I gave this option of, let's say you're using something maybe that's like a managed service, and you have a new feature request. It's really hard to prioritize it, and you're just waiting. What do you do? It feels like maybe you're not some big-shot VP, you're just some random person, but you're pretty sure something is a problem so what do you do. Often, I think historically, this has been like, too bad, there's not much you can do. One of the things I love about open source and why I think the internet is so amazing is because the users of a product are often more aware of its limitations than its maintainers. If you have something that's a complaint, that's a feature request, and it's meritocratic. Because it's like GitHub, you may not necessarily have your corporate credentials assigned to it. As long as you make a good argument, that means you can control the roadmap of a project, even if it's backed by a large company. This also means that you control your own destiny, as in, you can add your own features, and you add your own abstractions on top in a fork. Or if they're useful enough for the broader community, you can make a pull request and then you yourself control how people use this product. I think most importantly, whenever you're dependent on a single project for your infrastructure, it means you're also dependent on the company that produces this infrastructure. Companies come and go over time. One way of ensuring the long-term survival of a project is effectively open sourcing it, because then that project becomes in service of its community, and not just in service of its creators.

Good News: Getting Better at Research to Prod

There are some good news here. We are becoming better at converting people from research to production. You can go check out our PyTorch Medium, https://medium.com/pytorch/tagged/case-study. There's a bunch of case studies there. Ones that I personally liked are things like AI for Agriculture. This was a case study done with Blue River Technology. Others are like using ground penetrating radar. Just radar through material, and look at potential flaws in its manufacturing, without breaking it. This is one thing that I personally never liked about roads, is that they're not particularly debuggable, you need to break everything before you can look into it. Imagine you needed to do this for software. Another case study was on studying the properties of glass. You can imagine a molecule of glass. I'm not a chemist. Then, it's a graph, and you want to analyze its properties. Again, this is telling of how expressive, in general, deep learning is. That, as long as you can express your problem as a tensor, from input tensors, to labels, which could also be output tensors, you can pretty much solve a large chunk of problems, whether it's text, images, graph. In this case, the first are images, and the last ones are graphs. Odds are, you could probably try out using deep learning and maybe get pleasantly surprised.

Data Loading - PyTorch/Data

Let's dive deep into the tooling. The first problem I want to cover is data loading. What is data loading? The simplest way to view the problem is that you have some data, maybe it's in some cloud storage, or whatever, and you want to convert this to a tensor. Historically, this has meant that you need to first bring in that remote dataset locally, and now you need to convert it. Then you can do something like for element in batch, like model.fit, or like model.optimizer, model.step, whatever your favorite API is. One trend we're noticing here is, large models in of themselves are not particularly useful. It's really just that large models are also great at working with larger datasets. That's the key. Where are those larger datasets located? They certainly won't be on basically your local disk locally, instead they'll be in remote object store. These are things like AWS, like Azure, or GCP, maybe it's the Hugging Face Hub, maybe it's NVIDIA AIStore, webdataset. Maybe it's a TFRecord. Maybe it's on Gdrive or Dropbox, if you're doing some small experiments, or maybe just a local compressed file. Fundamentally, we should be able to work with all of those datasets, without requiring you to change your code. That's exactly one of the things that torchdata was really built to address.

The other problem is, once you have this dataset, you may want to read it, decode it and do some preprocessing on it very quickly. The problem with this is generally this has meant that you do the conversions on CPU, and then you send over the data to GPU. As GPUs get faster, memory bandwidth becomes the bottleneck. Instead, there's multiple technologies like NVIDIA's DALI, or ffcv, that are working more on having direct GPU decoding, and accelerate preprocessing so you can directly send images to the GPU. If you do this, you're basically going to have much higher GPU utilization, which, especially with newer generations of GPUs, is becoming the primary bottleneck for any model training.

Back to torchdata. What was the problem this library was really looking to solve? Historically, you may already know that we have DataLoaders in PyTorch. It was entirely possible for you to write something like for elem, and dataset, and start loading batches one at a time into your training pipeline. The problem was there was a complete lack of modularity and composition in this, so people were writing their own datasets. It wasn't really obvious how to reuse different parts of this. We redesigned this from the ground up to be much more modular and composable. The main idea is that you want to define a DAG, like a directed acyclic graph, and execute it. Executing that graph essentially lets you do something like for element in graph. I have an example here. Let's say you're reading some data from S3, and then you're reading some data from AIStore, you want to concatenate these two maybe text datasets, then you want to parse them. Then you want to filter out, for example, all the parts that may not necessarily have like, let's say, English sentences. Then you want to load these in one at a time. This is something I just made up on the spot. There's a lot more potential you could have in what those nodes are. Maybe the nodes are related to archiving, so maybe they could compress or decompress ZIP data, Gzip, Tar, whatever. They could be more combinatorial, as in, how do I take a dataset and combine it, or split it, or rotate through it cyclically? It could be grouping, like what we just discussed about below. Maybe it's mapping, as in, you want to map a function over every element of a dataset, which is very useful for anything like preprocessing. Selecting, which is essentially like a filter. Or parsing, which is specific, really, like parsing for text, or decoding for images would be the same.

What's really interesting about this is that, once you have a DAG like this, you can execute it. We already provide an execution for it. You can create your own. As in, you could have your own engine, and we call this a reading service. You could potentially maybe add support for like multi-threaded, or maybe you have some really custom infrastructure. We've effectively split responsibilities between how the preprocessing DAG is defined versus how it's executed. This even lets us swap things in and out to make things easier. A core requirement for us even early on was that we wanted to make sure that this graph is serializable. As in, you can take this graph, and you can write it to disk. Then once you write it to disk, maybe you can modify it, so maybe someone can create a DAG, and then you come in and you can add more optimized versions of a node. Of course, you can create your own nodes. It's actually one of the best ramp-up tasks on the library. For example, I personally created a Hugging Face Hub reader, and you can come in and create your own, or you can have somebody within your company and share them freely.

Fundamentally, I think, torchdata is focused on the scenario of streaming datasets. As in, the constraint is that the data does not fit on RAM, and it does not fit on the on-device memory for GPUs, because that's not growing exponentially. To clarify this distinction in the documentation, we make the distinction between map-style datasets, which is a dataset that you can index into. For example, you can see dataset index. The benefit of this is that these datasets are really easy to shuffle. It's what we're here to do. They're familiar. The con of them is that you need to load the entire dataset in memory, because you're never really sure which index you're really going to ask for, unless you're willing to just pay for the screeds or cache misses. On the other hand, the future, and if you go through streaming only scenarios, you effectively can only look at the next element of a dataset. This is what the Iter really means. The benefit of this is clear. It's that you only ever need to load the next few batches in memory. I'm really only saying the next few batches instead of the next batch, because you may actually want to do something like prefetching here to have it work efficiently.

Then the problem really becomes shuffling. How can you shuffle data if you can't see it all? Thankfully, this is a problem that we've solved for you in torchdata. I want to make you curious about the library, https://github.com/pytorch/data. Make sure to go to the library, check out how we've solved this problem. Just to help you appreciate why this is maybe not an entirely obvious problem is, you may think, maybe I can shuffle elements as I see them. That doesn't quite work, because then you're going to be biased towards the elements that are ordered first in the dataset. Maybe you keep a buffer. Then, how big should this buffer be? You keep a buffer and then you fill this buffer, and then you shuffle the elements within that buffer. When the limit of the buffer is the size of the whole dataset, then you get map-style datasets, but then you get the cons of map-style datasets. We don't want that. How small can we make it? Then, of course, there's a wrench that gets thrown into all of this which is, ok, what if you're running a distributed DataLoader, so now the dataset is put over multiple shards. Then you have a random seed that may decide how you're reading datasets in one at a time. Then, do you share the random seed or not? Do you share the random seed generator? These are all hard problems. You can go check out the answer on the torchdata repo.

Distributed Training - PyTorch/PyTorch w/FSDP

The next thing I want to talk about is distributed training. Distributed training is hard. Let me set the definitions before I start talking about FSDP, or DDP, or anything like that. One form of parallelism is what we call data parallelism. Data parallelism is, you can fit a model on an individual GPU, but the dataset does not. What you do is you split the dataset over multiple devices, and you have the same model on all of those devices. Then you sync, and then you do a gradient step over that individual shard. The problem with this approach is that, since the shards were not seeing the same examples, they're going to have different gradients. To resolve this problem, you need to synchronize the gradients occasionally, either synchronously or asynchronously, to get them to work. When it comes to large models, effectively, what we mean is model parallelism. Model parallelism is, let's assume the data can fit on a single machine. The problem is, we can only fit an individual layer at a time. The problem should be somewhat clear. As in, let's say model part 2 is really slow, but model part 1 is really fast. Then the bottleneck is going to be in model part 2. What does it mean? Should you just wait? No. Ideally, you can just pipeline examples. This is what we call pipeline parallelism. Of course, closely related to model parallelism is CPU offloading, which means, like let's say you don't have a GPU per layer, a lot of us don't, what do you do? Maybe, then what you could do is you could offload extra weights on your CPU, and only load them in exactly when you need them. This way, you can take advantage of the high RAM which could be potentially much larger than the on-device GPU memory, but still also benefit from GPU's acceleration, when you need it.

To combine all of these ideas, we have FullyShardedDataParallel, which basically lets you automatically just wrap a model. You can just FullyShardedDataParallel wrap model. Your model is just a typical nn module. There's no magic here. This is exactly the same model you would write if it's single layer. You can also enable CPU offloading to make sure that you avoid umphs. This is great, because now you can offload params, you can maybe potentially offload gradients. You can store the weights in various kinds of precision. Maybe you want to store them in 32-bit, maybe 16-bit, and take advantage of both data and model parallelism and pipeline parallelism and CPU offloading. This way, you don't need to think about it.

Deploy Models - PyTorch/TorchX

Let's say now you've defined the model, but you want to actually deploy it on some infrastructure. Where do you run Python large model.py? You may actually run it on Kubernetes on a slurm cluster. Whatever it is that you're running, TorchX is an easy-to-use option for these things. Specifically, if you have your app.py script, deploying that model on various clusters, whether it's a ray cluster, whether it's a slurm cluster, whether it's aws_batch, whether it's Kubernetes, it's really just a question of changing the scheduler parameter. That's it. Once you do that, you can pretty much deploy wherever you want. Once you deploy it, there is a couple of verbs that are going to be [inaudible 00:24:18]. Maybe you want to set up this infrastructure, so you would run something like ray up or eksctl. Then you want to submit your job against that infrastructure. This is like running something like torchrun. Then, getting the status of that job, that's a storage check status, so this is where you can know like, did the job fail? Did it succeed? Or maybe you want to get the logs for this job. All of these are just verbs. This makes it really easy for you to figure out, maybe you haven't settled on an infrastructure yet, so you can try things out. You can try out Kubernetes. You can try out ray. You can try out slurm, and just see what makes the most sense to you. Once you've done this, we have built-in components to make it really easy for you to do distributed training or even elastic training. Which is, assume like someone trips on a cable while a job is running, instead of getting a [inaudible 00:25:08] deadlock error, you can just keep going. This is a lot of the benefits that TorchX brings.

Model Optimization

The other aspect is like, great, now you know how to train a model, and you know how to scale it, you know how to load data, and you know how to manage your model. Let's say your model is just fundamentally too big, but this is the size that you know that you want, how do you make things run faster? Here, I'm going to take my time in explaining these things, because this is really an important idea. A lot of these pictures are borrowed from one of my colleagues, Horace, in his article called Brrr. It's a really good article, so I'd recommend you check it out. I'm going to summarize it for you here. GPUs are now like memory bandwidth bound. What this means is, essentially, imagine you're sending work to a factory. Then this factory sends you work, as in, tensors, and you want to run modeled off word. Then this factory is going to send you some output tensors back. Fundamentally, now the factories are becoming so fast that the bottleneck is becoming sending data quickly enough. The problem is actually closer to the right-hand side, which is, our factories keep getting bigger, and so we're now not able to send enough data on the bus.

One way to solve this problem is what's called fusions. Imagine, you have a model that looks something like this. It's running a torch.randn, and then it's running a cosine, and then another cosine. The way this typically works in eager mode PyTorch, is that you have the model, so you're doing some operation, and then you write back a to disk. Then you read a from disk. Then you make an operation, and then you write back b to disk. Then same thing would be, you read it from disk, then you write it back. This is going to be extremely slow. We don't ever really need b. Instead of doing this, we can fuse all of these operations and then do a single read/write for all of these operations into a single fused_op. Instead of thrashing back and forth like this, you can do fusions. The reason I bring this up is because fusions are critically important to understand whether you're talking about like ONNX, like TensorRT, AITemplate. The majority of those operations take advantage of faster fusions.

There's two kinds of important fusions I want you to be aware of. One important fusion is vertical fusion. This is what I showed you earlier here, which is, you have a sequential pipeline where you have lots of reads and writes. You can fuse those into a single read and a single write. This is vertical fusions. Another kind of fusion is called horizontal fusion. I know these are confusing terms. Horizontal fusion is something you already do with batch inference. The way batch inference works is you have a single model, but you're running it on two different examples at once. Now imagine that you have a model that's taking advantage of a single layer at multiple points, so imagine something with residual connections. You could essentially schedule those operations concurrently with horizontal fusions. Don't worry about that example that I mention later, just think of batching. That is a form of horizontal fusion. Optimization providers typically benefit from those two operations.

The way you use those runtimes is usually fairly simple. They all look relatively the same, as in, it's some sort of runtime_load operation that takes in the serialized weights on disk with some runtime_config. Then you make an inference on those things. There's vertical fusion, which is fusing pointwise ops. Pointwise ops are things like cosine, but the more important one that you may care about is something like ReLU. Horizontal fusion is about parallel general matrix multiplications, or batching. Then memory fusions are if you have fusions that are doing either concatenations or splits. Even though, instead of, for example, allocating more memory to disk, you can just read them or view them. If you are interested in how to find these fusion opportunities automatically, make sure to check out projects like AITemplate, or if you want to see what it looks like to spend a lot of time and writing them by hand, there's a lot of vendors that will write these fusions by hand in things like CUTLASS. They'll be in C++, they're harder to use. If you want something more Pythonic, make sure to check out AITemplate, https://github.com/facebookincubator/AITemplate.

Model Serving - PyTorch/Serve

The last thing I want to talk about is TorchServe. TorchServe is a solution for model inferencing. What is the model inferencing problem? Fundamentally, the model inferencing problem is you have some model weights and now you're not in the real world, you're not operating over tensors, you're operating over binary images, text, and audio. You may want to decode and preprocess this data and then run model.forward. Even then, after you run model.forward, you get a tensor back. Like, who can read a tensor? You also need a post-processing pipeline that can either turn this tensor into a label, maybe it's an image, if it's something generative, or text, or audio if it's something closer to the Stable Diffusion stuff. We have a whole bunch of case studies of people using TorchServe. You can check them out in our news page, https://github.com/pytorch/serve#-news.

I want to briefly talk about, how does a model inferencing framework work at a high level? At a high level, you're going to be making curl requests, really. Those curl requests will say something like, I want you to make an inferencing here. Then, as the TorchServe instance gets more inference requests that it needs to handle, it'll batch them, and then, basically treat them all at once. This is what's called optional batching. It's one of the important optimizations in TorchServe. Inferencing is one aspect of an inferencing framework. The other is the Management API, which lets you roll back to older versions of a model, do canary rollouts, so you can progressively do 10% rollouts with technologies like KServe, or autoscale a model if the scale gets really larger. Fundamentally, you have a TorchServe instance. This TorchServe instance can deal with multiple kinds of models concurrently. Those models are all in a worker pool. This worker pool will take in requests from a queue, one at a time, for inference. Or take in management requests from this queue as well to change their state. Their state will be written in snapshots, so you can roll back to older versions of a model. Metrics and logs so you can actually debug what was going on. Regardless of these internals, TorchServe is the default way to serve models on SageMaker. We also have partnerships with Docker, Google Vertex, KServe. You can also deploy it locally. The core engine right now is written in Java, but we're moving towards having the core engine written in C++ for future releases to take advantage of multi-threaded inference. You can follow our progress here, and TorchServe specifically in the PyTorch C++ backend, where you can see the latest and greatest there for the C++ frontend.

Torch MultiPy

When you say parallelism in deep learning, there's a lot of ideas here. Batching and vectorization are actually both forms of concurrency because you can take multiple examples and batch them all together at once. Or you can take one large example, and split it up into many subproblems. There's actually another. Historically, in Python, multi-threading is not great because of the GIL. This is also fundamentally a constraint that PyTorch has. One way around it is that you can run multiple Python sub-interpreters at once. Then have those sub-interpreters share the same memory map PyTorch storage tensors under the hood. That way, you effectively get multi-threading in C++. This used to be called torch::deploy, now we call it like torch MultiPy. For now, it's C++ only, but if you'd like Python bindings, let me know, because this is something I've been interested in working on myself. We want to make this easier to use. Effectively, you'll have a whole bunch of tricks to make inference faster, whether it's using fusions, whether it's tuning your model inferencing framework, whether it's using batching, whether it's using vectorization. Now, multiple sub-interpreters will be the new kid on the block.

Questions and Answers

Luu: Maybe you can share a bit about the recent news about PyTorch.

Saroufim: There was a super exciting announcement that happened at the PyTorch Conference. We launched PyTorch 2.0. I know the term 2.0 scares people, they're like, no, you're going to break backwards compatibility. This is fully backwards compatible, actually. The reason we're calling it 2.0 is because it's just like a big change to how we use PyTorch. We still want you to write your eager mode code, so you just write your model.py, and it's easy to use. Then you can also compile that code. Historically, there has been a tradeoff here where some frameworks are hard to program, but performant, and others are easy to program but they're slow. Effectively with 2.0, we've removed this distinction. Basically, we want it to be easy to program and easy to compile on first. In particular, there's one compiler that we've introduced called inductor that we're very excited about. What that does is basically, it lets you write CUDA kernels in Python, and also have them be really performant. Inductor is essentially a code generator from Python to inductor. There's a lot more. There's a training compiler, which is very rare. You could introduce multiple backend compilers, whether it's something like XLA, or TVM, or whether it's TensorRT, or ONNX. I expect to see a lot more innovation in the compiler space now. It's going to be a lot more possible for people to become compiler hackers, because it's all for fun.

Luu: Are there any new features that users were not able to do before that now they can do?

Saroufim: The main one is going to be torch.compile, which is like, you basically take your model.py, and then you can compile it into a bunch of intermediate representations. Then take that intermediate representation and pass it into various compiler vendors with just a single flag. You can say torch.compile backend equals inductor equals TensorRT equals ONNX, and that's it. It's just you set a string, and then you can try out different compilers. Feature-wise, it's not that you can do more things, it's just that you can do the same thing more quickly. Now you can either run a bigger model, or run it for longer and do more iterations.

How do you think fastai library compressed to PyTorch? Is fastai supposed to be higher level and easier to use, and it works on top of PyTorch, and easier for beginners?

The main difference is that PyTorch in of itself doesn't provide a training loop. What I mean by that is like, it has different facilities that make it easy for you to define your architecture. We try to stay relatively low level with how you're supposed to use it. That's why there's a couple of other libraries that are a bit higher level, like fastai or PyTorch Lightning that will make it a lot more user friendly. The two projects somewhat have different design goals. From our perspective, we just want to make it easy for people to hack on even the low-level bits, whereas higher-level libraries may be more concerned with getting people to solving a business problem more quickly. If you're a beginner, definitely go for the higher-level libraries. As you get more comfortable and you want more control, then I'd recommend you start diving into the lower-level stuff.

Is the PyTorch versus TensorFlow debate still relevant today? Which would you recommend for a new POC project? Do you see some unification or interop between the two in the future?

Maybe the bait is like Pepsi versus Coke. It's really whatever you prefer. I can't comment too much on the tradeoff. One thing that PyTorch did really well was because we focused so much on debuggability, and that you can add print statements to your code, I think it ended up becoming a very researcher friendly framework. Then the flip side of that is, it's easy to hack, but because it's in Python, it's slow. There's the Python overhead. You can't do optimizations that cover the whole model, because it's just executing a line of Python, one line at a time. That's really why we feel like the Torch compiler will be such a powerful feature. In terms of interop, it's not like the libraries will interop directly. I think that may be a bit tricky. However, I've definitely seen people convert a PyTorch model back to TensorFlow and vice versa using intermediate representations like ONNX, depending on what your infrastructure leverages more. Yes, they do interop, but not directly.

Are there plans to support more GPU types in PyTorch? I know NVIDIA was the initial one, and now there's support for Apple Silicon. What about AMD GPUs?

AMD GPUs are actually already supported, granted, in torch.compile, because this uses our new compiler called inductor. Inductor generates Python kernels, which is only as far as I know today using NVIDIA. The support in NVIDIA tends to be broader just because the library has been around for longer. It's one of those things where the performance speaks for itself. I would expect over time to have more hardware vendors and also the performance for those hardware vendors to get better. You should benchmark and see what works for you.

I think one of the interesting things is, historically, it's been difficult to support a new backend for PyTorch, if you're a new hardware vendor. Because you need to support all the PyTorch operations, and there's about 2000 of them. One thing we did, and this was inspired by jax.lax and a bunch of other compilers, where we decompose all of those operations to a smaller subset. I think last I did was like around 250. That way, if the larger subset is a composition of the smaller ones, you only as a hardware vendor only need to implement the smaller ones to get full coverage for PyTorch, as opposed to getting things like unsupported op errors, which are a frustrating user experience for people. Again, I feel like a lot of the infrastructure that was built for 2.0 will make it a lot easier for people to support new backends and new hardware architectures more easily.

Luu: Extensibility, it sounds like, or flexibility.

Saroufim: Yes, and hackability. No one knows how to read C++ anymore. I certainly don't.

 

See more presentations with transcripts

 

Recorded at:

Sep 19, 2023

BT