Transcript
Montani: GenAI, large language models have been truly transformative. The innovation has always almost been shockingly simple, just make the models a lot bigger. With more bigger models coming out, it's only fair to wonder, are we heading further into a black box era with bigger models obscured behind APIs controlled by big tech monopolies? As you can tell from the title of my talk, I don't think so. I actually think open source software means there is no monopoly to be gained in AI. It's also the reason why I ended up founding an open source company called Explosion. That's me. That's my co-founder and CTO, Matt. To give you a very quick background on what we've been doing in the space, we're probably most known for spaCy, which is an open source free library for natural language processing in Python. You take text, you want to find out more about it, get structured data out of it. That's what spaCy is for. We've always actually put a lot of work into the annoying aspect of backwards compatibility, making sure the API stays stable. That had a really nice side effect more recently, which is that ChatGPT is actually pretty good at writing spaCy code, which made me proud. We also developed Prodigy, which is an annotation tool for machine learning developers, that's fully scriptable in Python and runs entirely on the user's hardware. We're currently betaing Prodigy Teams, which brings the whole thing into the cloud, and lets cross-functional teams collaborate on their training and data development, while also keeping the data privacy in check and letting users host their own data processing component. Actually, a lot of the ideas and philosophies I'll be talking about are also things that really directly inspired this design.
Why Open Source?
Who has used or uses open source at work? Open source, there are tons of reasons why companies especially choose it. It's transparent. You know what you're going to get. You're not locked into a specific vendor. Yes, there's some commitment, but you'll never lose access. It's extensible. You can even fork it and just run it yourself if you want to. It runs in-house, which is especially important, if you're working with private data, don't want to send your stuff to someone else's servers. Open source software can just run on your machine. Easy to get started, just pip install, download it. That also leads to a community vetting aspect, you can see what's popular, who uses what. It's very programmable. Very rarely, what you're doing is like an end-to-end thing. Slots into an existing process. You can use code with it. It's also usually quite up to date. It's not always the case. Open source projects, often surprisingly, run by very small teams, they can accept pull requests, community contributions, so you're very likely to find the latest research implemented somewhere in a repository. Looking at this, you might find that one aspect that people very commonly associate with open source software isn't even listed here, which is, that it's also usually free. That's because I actually believe that companies don't just pick open source software because it's free. It helps with getting started and other aspects, but there are a lot of other compelling reasons that make open source so dominant. It's not that it's free.
Open Source Models
Of course, we're talking about AI here, machine learning, that's not just software. Machine learning is code plus data. Especially, more recently, there's been a very big growing ecosystem of open source models out there that you can use often, the code, data, weights, a lot of it is available. To put it into perspective, I've divided it up into three categories. On one hand, we have task-specific models. Those are, for example, models we distribute for spaCy, community projects for different domains. Stanza from Stanford also publishes models for specific tasks. There are tons of course on the Hugging Face Hub. These models are small, especially by today's standards. They are often fast, very cheap to run usually, but they don't always generalize well. They're trained to do a specific task, so you need data to fine-tune them for your use case. Then there's a second category that are called encoder models. That's also probably models you're familiar with, like Google's BERT, various local variations like CamemBERT, or AlBERTo for French and Italian. These models also often power the task-specific models. You use those for embeddings, and then train something specific on top. They're also relatively small, fast, affordable to run. You can do that in-house. They generalize a lot better, but they do need some data to fine-tune them if you want them to do something specific. Then, finally, the development on encoder models has more recently led to a third category, that I'm calling large generative models here. I was initially thinking, how deep should I go into all of the specific ones? I've just used a general overview of some examples that are quite popular like Falcon, Mixtral publishes some. There are all kinds of variations, plays on Llamas, Alpacas. You get the idea for all kinds of different use cases, different sizes, and so on. These are, as the name implies, very large. They're often slower, they're quite expensive to run. Of course, they generalize and adapt very well. They need little to no data to fine-tune, and adapt them, and make them do something specific, which makes them very attractive. If you're looking at this, one thing that doesn't really help you is that pretty much all of these models have at some point or another been called LLMs or large language models. That's incredibly confusing, because we don't know what we're talking about, especially if we're looking at the encoder models and large generative models. There is actually a very significant difference. If we just call it all the same, that really muddies the discussion, because, how do we make these models do something that we want them to do? In encoder models, you actually have a task-specific network. You use the model for the embeddings. Then you train this task network on top of it that does a specific thing, predict something like categories over the whole text. Then, based on that, you get task-specific output, like structured data. With the large generative models, you'll see that there's no task-specific network or nothing you're training on top, you just rely on the prompt, and then you get freeform text out of it. Then you need to implement some logic to turn that human readable text into something specific that you need, like categories or structured predictions over the text. Again, that's a pretty significant difference. That also really informs how we're working with them. I'll also really try to not use LLMs, mostly throughout this talk, to not cause even more confusion.
To go back to these models, the task-specific and encoder models, they've probably been in production at many companies for years now. Many have also started using large generative models, exploring at least prototyping, playing around with them. Since they're very large, expensive, it's not very easy to just spin it up yourself. There have been a number of providers that have sprung up that are providing access to those models via an API, including also models that are available for free, the knowledge and resources are available for free. How does that work? How can companies take something that's freely available and offer it for money? How does that make sense? The reason is economies of scale. Companies like OpenAI, Google, they have a lot of advantages. First, they have access to talent, hire the best people. They can use compute at wholesale prices. That's all really great. There are lots of other business reasons why more output becomes cheaper. In the context of AI specifically, GPUs are a huge part because they're incredibly parallel. You want to batch up the requests, and you can't just chop up the user's text. You either need to wait until there's enough requests in so you can batch them up, or you just need your API to be incredibly popular and have tons of requests in real time, so you don't really have that problem. You can think of it a bit like a train schedule. If you're in a small town, it's not really viable for trains to come every 5 minutes, whereas here in London, you can have trains and tubes run all the time, and there are always enough people there. It's the same here. That's, of course, tricky if you want to operate these models, and now that's where you are. Are you doomed? Can you keep up with this? Yes, there are ways that I'm going to be talking about. You're not quite doomed just yet.
AI Products Are More Than Just a Model
Another distinction that again, people, in my opinion are not making enough that's super important is the distinction between human facing systems and machine facing models. On the one hand, we have ChatGPT that you probably all know. We have Google's Gemini. On the other hand, we have the underlying systems like GPT-4, or Bard, that are powering these models, or these systems and these products. That's really important, because for the human facing systems, the most important differentiation is product. That stuff like UI/UX, marketing, how its presented, but also the customization and stuff around it. We don't quite know how ChatGPT works under the hood. Very likely, there's a lot of constraints implemented around the model that prevented from saying something that's really offensive. All of that is really on the product dimension. Whereas on the other hand, the machine facing models, they are really swappable components. They're based on research, openly published, openly available, often on data that's also openly out there. These impacts are quantifiable, so speed, accuracy, latency, cost. I think actually not making this distinction is what leads to a lot of the confusion that people have around monopolizing or winning at AI, because we're not just talking about machine facing models, we're throwing them in with actual products like a human chat assistant. While OpenAI might very well dominate that category, it doesn't necessarily mean they will dominate the AI and the software components. Now you might be thinking, but what about the data, everyone's talking about data? Don't OpenAI have this enormous amount of user data that they can use to make their AI better? The thing is, yes, they definitely have that data. User data is really an advantage for product, not for the underlying foundation, for the machine facing task. Again, to make your human chat interaction better, user data is really valuable. One thing we've learned from the whole developments around GenAI and large generative models is that we don't need specific knowledge and specific data to gain general knowledge. That's kind of, if you think about it, the whole point. Again, yes, there's an advantage to be gained here, but we shouldn't throw in products with the machine facing models, because that's actually really what we're talking about here, and that's what we care about. That's what we're using and applying.
Use Cases in Industry
If we're looking at what people are doing with these software components in practice, there are two types of capabilities and tasks that companies are working on. On the one hand, we have the generative tasks, which are things like summarization, reasoning, problem solving, question-answering, paraphrasing, style transfer. These are really capabilities that are new, and that are also things that large generative models really only made possible, at least to do well before. That's one category. On the other hand, we have the predictive tasks. That stuff like text classification, entity recognition, relation extraction, coreference resolution, grammar, morphology, semantic parsing, discourse structure, everything where basically text goes in, and a structured representation comes out. Usually, that is not really the end application, that's one component. Then that structured data goes into a database, for example. If we're thinking about or looking at the actual use cases, and what people have been doing, we'll notice that a lot of industry problems actually have largely remained the same and mostly changed in scale. One fundamental problem we've always been trying to solve is putting structure into something that's fundamentally unstructured, like language. Of course, if we're working with index cards, there's a limit to how much you can structure, because it's index cards. Even before the introduction of computers, though, that was something we've been trying to do. There's also a limit of how many projects any team can do at a given point and how likely they are to succeed. Being able to solve a lot of these problems with AI and machine learning means that, yes, we have all of these other cool things that we can do. We're also able to do a lot more of what we've been trying to solve since before computers, which is just creating more structured data and doing more projects.
Evolution of Problem Definitions
Ultimately, the fundamental reason we're all here, probably, is because we are trying to tell computers what to do. That's really what this is all about, and what unites all of us. We are at least trying. That process has gone through a few iterations over time. In the beginning, how do we tell a computer what to do? We give it rules or instructions. That can be conditional logic, regular expressions, and so on. With machine learning, we unlocked a new way of doing that, which is programming by example. That's also often called supervised learning. That's a new way. Then, with in-context learning more recently, we can also go back and provide more rules and instructions in natural languages time, so in the form of prompts. There are pros and cons to both of them. Instructions are great because they're inherently human shaped. They really mimic the way we tell a human what to do. They also really make it easy for non-experts to get in and get started, because you can talk to humans, you can talk to a machine like that. That's pretty intuitive. There's, of course, also a risk of data drift. If you model in the case of large generative models that you're prompting, or if your input text changes, your rules and instructions might not apply anymore, and your whole system falls apart. That's something you have to work around. On the other hand, examples, they also have a lot of advantages, because they let you express this nuanced behavior, that you see it and you know it's right, but you can't necessarily put it into words or put it into specific instructions. They're also incredibly specific to your use case. Instead of providing something general, you can really provide examples of the data you need to process and train your system or tell the computer, hopefully, to generalize. The biggest problem so far is that this process is very labor intensive. You would somehow need to generate these examples. That's been pretty difficult. Especially if you're starting out from scratch, if you have no way to initialize the model, it's like training it up from birth. That's been pretty difficult. How do we find a good middle ground? We've seen like, both of these ways are good, how can we basically take this into a modern workflow and use the best of both worlds in a practical sense?
Workflow Example
Here's a workflow example, in slightly abstract terms. Let's say you have this large general-purpose model, big weights that can do a bunch of stuff. Then you have your domain specific data. Now you want to apply this model to your domain specific data. What you can do, of course, is you can prompt it and you can ask it to use all these weights that it has in it in order to give you an answer. Now, on top of that, what you can do is you can of course evaluate it. You can see, how well does this thing do out of the box if I don't do anything? If I just prompt it, what's the baseline I'm getting? You can also look at these examples in context, in an iterative process and really see, ok, here's what my model is predicting. If it's good, you could go and correct these predictions, and then basically create data that takes everything that's available in the model, in the weights, and outputs and saves only that part that is relevant to you. Now with the help of transfer learning, you can create this distilled, task-specific model that really only does one thing that you want it to do, and keep doing that and evaluate it until you beat that baseline. The thing about transfer learning is, this is really something that has motivated deep learning since at least 2018. Just because we now have new technologies available, we have in-context learning, that's, of course, super interesting for research, and that's what people mostly talk about. Doesn't mean that transfer learning is somehow outdated or has been replaced. It's just like a different technology. A lot of research has moved on, but it doesn't mean that it's not useful for practical applications.
Prototype to Prod
If we're looking at the workflow here, how can this really benefit you in a real-world setting? One big thing that large generative models help you with is getting over the cold start problem and getting over this point where you have nothing. Here, you can start out with a prototype that doesn't need any examples to be trained instead of starting out with putting 40 hours of work into it before you probably get zero accuracy and have to wonder, what's the problem? Is it my code, my model, my hyperparameters? That really suck and that's probably what really made up for a lot of the labor-intensive aspect to programming by example, and what held companies back. You have a prototype that works out of the box. Even though the output of your generative model is text, you can use all kinds of ways to parse it out and put it into a structured format. One idea is what we've implemented in spaCy: it's all free, it's all open source. You can check that out. The idea is, it takes care of the prompting, and it outputs a structured format in data structures that you can work with, instead of just having this freeform text that you can't really compute with. Then, for runtime, there's actually no need to really use this large generative model that does all kinds of things that you might not even need in your production pipeline. You could go mix and match these components. You can swap them out, for other approaches. You can benchmark them against each other. For highly regulated environments, explainability, and so on, you can actually solve a lot of these problems with a distilled approach, because, at runtime, you have a system that really just outputs a set of vectors, or a set of integer IDs. It can't go rogue. No matter what you tell it to do, it will output these IDs, and that's it.
Results and Case Studies
I'll show you some benchmarks and some examples later, but you will have a lot of advantages, because, again, these things are also going to be quite small. You get this structured machine facing object back. To show you some results and case studies, that part's a bit more dry, but still very interesting. This is an analysis we did on the CoNLL 2003, the CoNLL task, that's a named entity recognition task. You're predicting named entity categories and spans of text. It's a very common benchmark task. What we can see here is that the current state of the art on few-shot prompting is pretty impressive. This is just out of the box with no training. We get pretty high here. That's very cool. We also see that we're nowhere near the state of the art in 2023. In terms of speed, if you're looking at the state of the art from 2003, which was when this task was introduced, we never got anywhere close to that ever. There are clearly tradeoffs to be made here. Even if we're getting much better, there are still methods you can use to get to a comparable system without needing to have a massive black box that can also do quite a lot of other things. That's not what these results mean. Another nice example that we ran is, we used the Claude 2 API. We ran that on a task, which is also predicting spans of text with a lot of labels. Here, you see the benchmark results. The dotted line or dashed line is very low. We're getting an accuracy of 20% out of the box. We're able to beat that with only 20 examples that are labeled. You might say, that's maybe a bit unfair. Also, these models are getting better. If you consider that even if we're able to double or triple the accuracy, there's still the ability to top that and to beat that baseline with under 500 examples. That means that if we initialize with some existing knowledge, we don't need a huge army of annotators creating data. We also don't need a massive model that runs at runtime and has a lot of disadvantages. We can basically take the best out of the latest models that we have, and put it into something smaller, faster, and more specific. One more thing here is, you might have heard, how does that make sense? There have been a lot of studies or some papers that have shown that ChatGPT is actually better than human annotators. How can that be? A lot of these analyses are looking at crowd workers. People over the internet with no connection to the task, who're creating large amounts of data. That's true. ChatGPT is about as good as those armies of annotators. In fact, it says a lot more about the crowd worker methodology, and how it's really this relic of the past before we were able to initialize models. It doesn't say very much about the real task at hand here, humans can still do a lot, we just don't need armies of them that are completely disconnected from the task. If we only need a few hundred examples, we can in-house the whole data process and that has a lot of advantages.
Distilled Task-Specific Components
To sum up this part, we've seen this approach. It's incredibly modular, which really matches the way we develop software. We don't have to throw all of these best practices overboard. It makes sense. There's a reason we ended up at this as a best practice, and it can still be the same for our model development. We're also not locked into any provider. We can use providers at development time, but at runtime, we can actually own the weights that we're running and shipping. We have testability. We can test these components individually. We can run interpretability metrics on them. They're doing one specific thing, and it's much easier to see when they're failing, than if we have a single black box. It's extensible. It's one component in your system, flexible, and it's really cheap to run. We've had examples of teams that have trained models under 10 megabytes, or models that can run on a CPU, which is pretty amazing at the other end of the spectrum, if you compare it to what it takes to run the largest models, and at the same accuracy, because you only need to do a subset of what any large model can do. It runs in-house, which is often important. If you're working with sensitive data, A, you're not allowed to send your data to an API. I do think in a lot of industries, you also shouldn't. I wouldn't want some large company to send my personal data to some rando startup. You can run these models in-house on your own infrastructure, on air-gapped machines, even like after the libraries are installed, you don't need internet. That really solves a lot of problems and probably also makes your security teams happy. Programmable, very rarely machine learning is like the end goal. You're usually doing this to solve a business problem. For that, there is something that happens before, there's something that happens after, and you want to be able to program. Having one part of this happen somewhere else really introduces a lot of problems. If you can avoid that, that's great. Predictable, goes in the same direction. You know what the model is doing. It outputs a prediction. It doesn't do anything else. You don't have to worry about it having access to stuff it shouldn't, or leaking information, because, again, its job is to predict one part of the output you're looking for. You're also bringing back this transparency. We do not need large black box models. We can use them, and they're useful for a lot of reasons. We can also develop transparent systems just like we have been developing transparent software before. You might notice that this actually reminds us of a slide from the very beginning of the reasons why companies use open source. That's no coincidence. In the end, we're talking about AI development here. That is just a type of development. The same reasons why companies choose open source software and why they implement things the way they do, also apply to AI. It's just a type of software development. Of course, the reasons here are similar, and we don't have to throw it all overboard, because it's like, "It's a completely new paradigm. We're doing everything differently now." No, it's the works.
Monopoly Strategies
To go back to our monopoly question, basically, yes, there's been a lot written about how to get to that position, to get rid of all this pesky competition that's so bad for your business. Ultimately, you want to tick as many boxes as possible. One way of how you can do this is you need a compounding advantage. By that I mean, for example, something like network effects, which is mostly the case for social networks, or even Google for ads, or economies of scale, what we just looked at. If we're thinking about it, actually economies of scale, they're pretty lame advantage or pretty lame mode to have because, A, we've seen, you don't even really need this. The fact that there are a lot of companies competing on offering it cheaper, yes, that's the opposite of a monopoly. That's competition. Finally, it's not even so expensive to get involved, if you're thinking about scales of traditional manufacturing. Airbus A380 cost 25 billion to develop. The Tesla Gigafactory in Berlin, 5 billion. Those are like the scales we're looking at in traditional manufacturing. These models, it's big, it's expensive for you as a company at the top of this economies of scale curve, but for anyone else to get involved, it's not that big of a thing. Economies of scale, it's a pretty lame advantage. That's not going to give anyone a monopoly. Similarly, resources. We don't have that problem here. There are no phone lines. There's no mine with resources to control. One thing we might want to watch out for is regulation, because that's a really awesome strategy if you want to have a monopoly, which is, you have a monopoly because the government says so. We definitely need to make sure that we're distinguishing between the human facing products and the machine facing software components. If we're not doing that, and we're regulating both, then yes, we might be gifting someone a monopoly who's lobbying for it the hardest. On the other hand, I've seen a lot of drafts or even the EU AI Act has a lot in there about regulating the actions and regulating the use cases, and not actually the technology, and actually making that distinction. I think this distinction is very important. We should keep that in mind. Because otherwise, yes, that's the one avenue we have left of accidentally ending up with a monopoly. That's if we mess up the regulation question.
Summary
I think the AI revolution won't be monopolized. There's no secret sauce that the software industry has, or secret knowledge that others don't have access to. Research gets published, knowledge gets published, data gets published, models get published, nobody is gaining on monopoly because they have some secret that other people don't have. Similarly, user data. Everyone talks about data, the new oil, but usage data is great for improving your product. If you have a product, awesome. It does not generalize, and it certainly doesn't apply to software components and give someone a monopoly because it gives them better models. That's simply not the case. What we're calling LLMs, large generative models, they are often part of a product or a process, they're not just the product. They can also be swapped out for different approaches. You can start out with a generative model to help you recognize U.S. addresses. If you're a developer, you quickly realize, I don't need this massive model for it, I can probably get by with regular expressions to do that, once you have some time to sit down and write some code. There are components that can be swapped out. This kind of interoperability which is really backed largely by open source software, this really is the absolute opposite of a monopoly. Finally, yes, regulation, that is one way that can give someone a monopoly if we're not careful, and if we let it. We have to make sure that regulation we pass focuses on actions and focuses on the products and things you do with it, with the technology, and not just on a software component, because we're also not regulating other software components. That doesn't make sense. If you see big tech leaders lobbying in front of U.S. Congress, or wherever, talking about the dangers of AI, and how we need regulation for models, their intentions might not be entirely pure. We need to be careful there, and watch out.
Questions and Answers
Participant 1: You're talking about regulation. We just had a European regulation. Maybe you take a look at that. How would you evaluate this regulation regarding the monopoly it could create, according to you?
Montani: I have to face that, but I'm not an expert on. Regulations is obviously a topic I care about. One thing I did notice with the EU AI Act, that they did really distinguish quite specifically between different applications, different level of applications, different use cases, and businesses. I think that's overall, very good. There are some aspects where we have to see how it develops. One bit of worry I've always had when it came to EU regulation was that it's easy to jump the gun and maybe end up with these subtle regulations that don't actually achieve the correct goal. It's like, we should have just banned targeted advertising. Instead, what we ended up with is these cookie popups. That's bad in hindsight. I'm hoping that we're not getting to a point like that. We're also not getting to a point where people can hold this against the EU to say, we should all leave this organization, because all they care about is how bent bananas are. They're not actually really solving anything. I would hate that, because I do think the EU is overall a very good initiative.
Participant 2: I've heard some concern about open source models being more accessible for bad actors. I'm curious how you respond to that and weigh that against the upsides of having more open source LLMs?
Montani: This is a good point, because again, also, if we're saying, use an open source model, we're downloading this blob of binary data, and then we're calling someone else's code to load it. That's definitely a point where we have to be careful, especially the bigger this blob of data gets. One thing in the approach that I outlined, which I think can be a good mitigation to this problem, is that you're not actually plugging in these models at runtime, if you can avoid it. If you're building a chatbot, there are some use cases where this is quite hard, but for a lot of the industry use cases, if you can set it up so that your models run at development time and you basically use them to create data and distill that down into a model that you control from start to finish, you end up with an artifact that ideally avoids a lot of the problems that you'd otherwise have at runtime. That's, if it's possible. If you're doing anything with structured data, that's usually a possibility. There are, of course, greater problems like, we might be looking back at this time as the golden age of when the data models were trained from what's still pure, before the internet got polluted with autogenerated content, which is already really out of control. The other day, I read something about our software, which was completely wrong, and it was generated by probably ChatGPT. I was like, no, you're poisoning all of the nice capabilities here. I don't know how that's going to develop, but I think getting your model out of runtime is good if you can do it.
Participant 3: What is your opinion about Microsoft that scooped up the biggest model from Mistral and made it closed source?
Montani: For every model that's closed source, there's something else that's open source. It's like research going dark. A lot of people have this idea, why does Google not go dark in their research? It's because it doesn't work like that. If you don't publish a paper, someone else will. You're not going to have a lead very long and very quickly by doing that, and it doesn't even cost that much on the scale of things.
See more presentations with transcripts