InfoQ Homepage Podcasts Namee Oberst on Small Language Models and How They are Enabling AI-Powered PCs

Namee Oberst on Small Language Models and How They are Enabling AI-Powered PCs

Nov 04, 2024

Podcast with

Namee Oberst
Co-Founder of LLMWare

Srini Penchikala
Senior Software Architect

In this podcast, Namee Oberst, co-founder of AI Bloks, the company behind AI framework LLMWare, discusses the recent trend in Generative AI and Language Model technologies, the Small Language Models (SLMs) and how these smaller models are empowering the edge computing on devices and enabling AI-powered PC's.

Key Takeaways

A small language model is a model that's generally lower than seven to eight billion parameters because quantized, you can run that on a local machine.
Small language models will democratize AI and expand the AI use cases we have available today.
With small language models running locally on laptops, you no longer have to worry about any data privacy issues.
Small language models are enabling AI-powered PCs to run frameworks like OpenVINO which can run just CPUs, no GPU hardware is needed.
Some use cases include information retrieval from documents, auditing, observability, workflow automation.

Subscribe on:

Transcript

Srini Penchikala: Hi, everyone, my name is Srini Penchikala. I am the lead editor for AI ML and data engineering community with the InfoQ website and I'm also a podcast host. Thank you for tuning into the podcast. In today's episode, I will be speaking with Namee Oberst. Namee is the founder of AI Bloks, the company behind the open source LLM framework called LLMware that is used for Gen AI based applications in financial services and legal industries. We will discuss the topics of current state of language models in the Gen AI adoption, and more importantly, the recent emergence of small language models or also called SLMs. How these new types of language models compare with the large language models that we have been familiar with for the last few years.

And more importantly, how these smaller models are enabling mobile devices and edge computing servers in leveraging Gen AI solutions that have been limited to large language models until now. Hi Namee. Thank you for joining me today. Can you introduce yourself and tell our listeners about your career and what areas have you been focusing on recently?

Introductions [01:28]

Namee Oberst: Hi Srini. Thank you so much for having me on. It's such a pleasure to be here. I'm Namee Oberst. I'm the co-founder of AI Bloks. And as Srini mentioned, we are the folks behind LLMware. We are an open source project and our mission is really to make generative AI easy to use, easy to deploy, and easy to really develop. And that's all for the enterprise and for regulated industries. And we really emphasize development and deployment securely, safely and cost effectively. And we've been focused on using small language models. We're the pioneers of doing that. We started 20 months ago. We launched an open source about a year ago, but we really started our work to prepare for launching really well before that.

And a little bit about my background I think will be helpful in why and how we came to focus on small language models. So I started actually as a corporate attorney working at big law. And in working in big law, I had a lot of repetitive tasks, really boring, mundane tasks that, I mean to be honest, soul crushing. I loved working in my big law firm for the camaraderie for the colleagues and I loved the perks of all that, but I mean honestly, there were so many repetitive tasks that I was doing and that was almost the motivation behind starting an AI company thinking there are so many, for knowledge workers, so many repetitive tasks that could be automated. Why aren't we doing that? So that's really the impetus and the aha moment.

But then the second aha moment was because I was working with a lot of highly regulated industries and large corporate clients, I also knew that the ChatGPT way of sending out all this potentially sensitive data out was not going to fly to a lot of enterprise customers. I know how lawyers think because I am a lawyer. So I thought there is just no way that this is going to be sustainable for the long term. So we really looked at small language models from the beginning and how small language models could do very focused and targeted tasks like let's say contract analysis as an example. Could I have small language models, review 200 documents and tell me are these contracts assignable or what is the state of jurisdiction or give me some concrete facts and do a lot of information retrieval.

And we found, lo and behold they could, and even 20 months ago we thought we were finding that they could, in some ways, more reliably and predictably than our OpenAI calls because OpenAI had so many outages and we were like, if you ping it like 100 times, we were getting no responses back for 30 to 20% of the time. At that time, OpenAI also had significant issues with accuracy as well. So we were looking at it thinking, "Wow, this model that you can run actually locally can actually produce very comparable results, if you have a very specific use case in mind". So that's how we really started down this path. And so not surprisingly when we launched an open source, one of the first things we also did is release a whole bunch of RAG fine-tuned models.

I think we're the first to even do that as well. Our first RAG fine-tunes are called Dragon and they're like seven billion parameter models that are specifically trained to not hallucinate, to really only answer the facts and with quality scores because the idea is we wanted the users to have the kind of accuracy that you could get from basically a corporate lawyer. If you had another lawyer or very intelligent colleague working alongside you like a coworker. So that was kind of the idea behind using small language models. And then since then, we've really evolved to then smaller RAG fine tunes and then most recently are very small function calling models that are one to three billion parameters that can really step in and help you automate a lot of your workflows and processes.

And that's a very interesting space that we're getting a lot of usage on and I think a lot of our community members really like that as well.

Srini Penchikala: Thanks, Namee. I think this is probably the first time I think we have a guest to know who is not only passionate about the innovation side of the AI space but also, has the attorney background to comment on, opine about from the regulation and compliance side. So it's a great combination because we need both of them. So I definitely want to talk more about the small language models. I'm very excited about this because like any other innovation in other technology spaces, I think the small language models have a lot of potential to make a big impact in the overall adoption of these solutions. But before we get into the main discussion, let me set some context for this discussion with a quick background on small language models.

Small Language Models (SLMs) [06:09]

So as we all know, large language models or LLMs have been one of the major innovations in AI and ML space for the last few years since GPT and ChatGPT were released and they kind of took the software world by storm, by introducing the power of generative AI technologies compared to the traditional predictive AI that we've been doing for a long time. Now, we are seeing it being used in so many different use cases and also, we have witnessed LLMs are being dominantly used in various business and technology use cases to create exciting new opportunities that we have never even thought about before. Not only for the end users but also for software engineers, the DevOps engineers and other groups to be more productive.

So they have empowered pretty much the full life cycle of the process. So LLMs are powerful, but they do come with some challenges. So mainly they require significant computing resources to operate. You cannot just run it on a small machine or a small cluster of machines. That's why only the big tech companies are able to fully leverage these technologies. And also there's a privacy factor also. You have to send the data to the cloud to leverage these LLMs and sending the data may not be possible or allowed for most of the use cases in the companies. So that's where these LLMs are not the best choice.

So latest trend in the language model evolution are the small language models or SLMs that offer many of the same benefits as LLMs, but they're smaller in size, they're trained using smaller data sets and they don't require a lot of computing resources. Again, these models are obviously not for every use case out there. I mean they're not one-size-fits-all solution. But they are very valuable for some use cases where you have a constraint on the resources or you want to localize the model execution, they're going to be a great help. And then, I think they also are opening up a lot of new opportunities to run the SLMs on smartphones and other mobile devices that are used for edge computing and they are not always connected to the cloud.

So this detached model of language model execution is definitely paving the way to use SLMs in these different applications. So they keep the data within the device and are great candidates for use cases where privacy or latency or other concerns in sending the data to the cloud. So can you talk about how SLMs have come about? I know you've been focusing on this for some time and what are the pros and cons if anybody in our audience that are new to SLMs, so how do these small models compare to the large models and what are the pros and what are the cons of the small models?

Namee Oberst: First, just to start, let's define what constitutes a small language model. The reason why I bring that up is because my own definition has changed. If you asked me three months ago what is a small language model, I had literally a slide that I presented at an AI conference in London where I said, and I've been saying that for a few months, a small language model is a model that's generally lower than seven to eight billion parameters because quantized, you can run that on a local machine, but then I had, as one of the bullet points, but this definition is subject to change when the models get better and also, when the hardware gets better.

And so lo and behold, I'm going to have to change that definition now because using, and we're going to get to this, but using the latest AI PCs that's out there in the market today, we are currently running experiments where we are able to run up to 14 billion parameters. That's like the latest that I've been experimenting with, just as easily on an Intel based AI PC as I can, like a seven billion parameter a model on the Mac, quantized all of course, but that would be quantized like GGUF versus quantized OpenVINO for the Intel. So that size is changing, so that's actually really, really interesting and exciting. And then that is coupled with the fact that the models themselves, the small language models are getting actually really, really good. They're getting better by the week practically.

And that is what's making all this innovation possible. So between the small language models getting really good and then, the fact that they are now being able to run on literally just a commodity laptop that are out, that will be, I don't know what the MSRP is, but one of the earlier versions of the Intel laptops we purchased for like $1,100 from Dell this summer. And then, I think the Lunar Lakes are just coming out. So I haven't purchased one for myself yet. Literally, but we have a test machine that we've been able to use. So it's really exciting to see the small language models getting exceptionally good and then, the hardware on the edge device side getting exceptionally good, so that the definition of small language model is changing. And then, if you add to that mix, the innovation and the way that we're able to train these small language models.

So since you mentioned like smart phones, I'll talk about the Apple intelligence that's coming out at the end of this month. Apple intelligence, it's a 3 billion parameter model that's going to be I think on device and then, if it's larger, they're going to access it on the cloud. But that is made possible by pruning a 6.4 billion parameter model. So they took a larger model and they pruned it, which means they basically took away some of the parts that they didn't think that they needed and then, they turned it into a three billion parameter on device version. And then, between pruning and then coupling that with distillation techniques, and we did both pruning and distillation to come up with their minitron like in July or August, it's really exciting. So there's so many innovations around small language models that are happening.

That I'm not even sure that you can say that necessary that large language models per se are better because I would have to say, tell me what the use case is and tell me what the task is, right? So actually, for a given task, I actually would have to say I'm not sure, I am honestly not sure, given the fact that they're so efficient and you can train them to be so specific to your use case, I actually think that there's arguments to be had on both sides.

Srini Penchikala: That's a great point. Yes, use the right size model for the right problem. So also like you mentioned, by leveraging these small models with the company's proprietary data, which is more specific to their domain, the accuracy has to be comparable with a large model, right?

Namee Oberst: That's right. So I mean, in the Nvidia study that they released in July, they were able to basically use 40 times fewer training tokens to come up with their Minitron. That's less than, I think they said, less than 3% of what it would take to train even from scratch, like a small language model. If you keep taking that concept and you keep applying it like distillation plus pruning and that resulted in a model that was 16% better apparently than a model the same size trained from scratch, so you take all these innovations and then you keep fine tuning it with your own data and then combine with the fact that you can run this on a laptop and not a special laptop, just a laptop like an AI PC coming soon to a desktop near you.

I mean, I think it's really exciting and you now no longer need a GPU farm. You don't need IT to go spend millions and millions of dollars on separate GPU clusters. I think that's really exciting. That's truly democratizing AI and use cases. So I'm really excited about that and I think that's going to be just revolutionary and it's going to blow open the AI use cases that we have available today. And kind of like I guess one of the things that you and I wanted to talk about were possibilities. What is possible now. If everybody has access to these AI models on their laptop, what is possible compared to before? And I think before, some of the trough of disillusionment or the winter of discontent around AI is because we were really thinking about these use cases with a capital A, capital I, like the killer use case.

And I think that going back to my story about why I was motivated to start Ai Bloks and LLMware it was in these micro day-to-day tasks, I just needed help. I wanted something to do these help me with these day-to-day tasks that would save me a couple of hours a day or an hour a day and maybe I wouldn't even use it every day, but it'd be so amazing just to have that tool. A tool that I can use to very easily query my documents, chat with it, and not have to worry about data privacy, not have to worry about data leakage, data copies, because that is every enterprise's nightmare. Obviously, you don't want data copies sitting around everywhere, but how are your workers working?

They're already working on the laptop, everybody is distributed, so everybody has access to laptops. So if you can really bring AI to where they are at their fingertips, I think that blows open a whole bunch of use cases and truly fulfills the promise of AI, which is like worker productivity, right?

SLM Use Cases [15:20]

Srini Penchikala: Makes sense. Yes. Actually, I have recently tried Ollama with the Microsoft Phi-3 model, and I can definitely see the difference. It was obviously a lot faster than downloading a large model and running it locally. So definitely, I have not tried with different use cases, but I could see the performance difference and this was on my laptop, which is by no means the fastest in the world. So you mentioned about these new models are going to democratize the use cases. Can we talk a little bit more about those use cases, what specific use cases you are seeing where small language models are being preferred to other alternatives?

Namee Oberst: So just to start, let's just start with the basic, the most basic use case, just things like finding information in your documents, you have an 80-page contract, you need to look through, who wants to look through an 80-page contract? You just want to find one or two pieces of information. If you needed to look through it, it's probably on your laptop somewhere, you have access to it. You would just upload it into your own laptop, just literally just upload the file into the model, which is all running locally. So you could even run that air gap. You don't even need access to wifi because you're literally taking the model, like it's the models on your device. You could start to query it, just ask it questions. You could do summarizations, then you could do SQL queries.

If you have a large complex spreadsheet that's never any fun to query those, it's like SQL queries. So all these kinds of what I call micro tasks, but that's a part of everybody's day-to-day working life I think could all be automated. A voice recording, you can have that transcribed and then a meeting and then you could literally just start to query that, summarize that, have it rewrite things for you, but now, you no longer actually have to worry about any data privacy issues. So if you were to do that exactly like take a contract. Let's say you're in HR, you have an employment contract and you just want to look up some information. How many holidays did we promise this employee as an example? You could never, or I don't think that you should. Every company's policy is different, but I don't think you should be taking that agreement and uploading it to some cloud, that your company has not authorized.

But now, if that model is sitting in your laptop, it's literally just secure and it's not going anywhere. There are no data copies, just as an example, but that's a very basic use case. And then, what we are more seeing now is that with the agent workflow, you can easily create a very simple workflow to automate. So one of the examples that we have is let's say you are a financial analyst and you have an article that you're looking at and you just want to get, produce a whole range of reports, like get a background description on the company. You want the latest stock price you want, its historical stock price. You could even have it do like an API call to Yahoo Finance as an example if you wanted to that. You can have a lookup information or Wikipedia as an example or other APIs that your company allows.

You can do that, just have it really just grab that information and put all that into a report for you and it just gets you started. So it's really supposed to be a co-pilot for your everyday working life. And I think that is the promise of AI. So capital A, capital I but you're not this big behemoth use case that nobody has access to, that only IT does, and you can't also have 10,000 employees bombarding IT with their individual use cases. That's just not going to fly. So we want to really bring it to the user on their device.

Retrieval Augmented-Generation (RAG) [18:43]

Srini Penchikala: Definitely. Also, I see the RAG, which is one of the interesting outcomes of this, right? So where you have a base model, which can be a large model or small model, and then, you basically train the model with your company's private information and then, you can use the model to ask questions that are specific to your business domain. So do you see small models and the RAG are a better fit and kind of made for each other because you can start small and then, enrich the model with your own business domain data, and then now, you can use it in a, like you said, on a commodity hardware and make it available for everybody in the company?

Namee Oberst: Yes, no, absolutely. So with RAG because of the loss in the middle problem anyway, large models are not necessarily better at all for RAG. And I think we were on a podcast earlier where we talked about that, and I believe even Mandy Gu had a personal experience with that. And it's just many, many studies have shown that. So it's not like large language models are designed for complex RAG. And so the key to successful RAG deployment is just in the workflow, just making sure that from document ingestion all the way to inferencing that that chain is handled accurately. So yes, no, absolutely, but then if you combine with that, let's say a specialized embedding model that really understands your domain just as an example in the middle of that, or if you don't do something as complex as that, that would be for a much larger application.

But on device, you can really just upload the document, start querying it, and with very, very fast inference speed, especially with the new AI PCs that have the integrated GPU, the performance difference even compared to what we were previously seeing on the Mac M3 is astounding. And who knows what'll happen when Apple comes out with their new chips for sure? I haven't seen that, so I can't say. For now, even on the older Intel chip, which is the Meteor Lake, I can tell you that the difference between running PyTorch version of a model versus the OpenVINO GPU is like 115 seconds to 15 seconds difference for a 21 question inference test for 1.1 billion parameter model. So you can get like sub second response time, if you're matching the right inferencing technique with the right hardware.

So there's so much going on in this space, but the net of that is really AI is coming to you, it's coming to you at your fingertips. So that's super exciting.

Srini Penchikala: Yes, I think that's the best way to explain it. Bring the AI to the people who can get a lot of value out of that, instead of sending the data to the cloud, right? Also, quickly, you mentioned about the previous podcast that is the AI ML and Data Engineering trends report that we recorded back in August. So I'll make sure that we link to that as part of this podcast and so our listeners can reference that as well and learn more about it. And then regarding the use cases, just to continue that use cases discussion, you mentioned about also with your attorney background, how do you see these small models, especially on the PCs being used in applications like auditing or compliance?

Regulatory Compliance and Auditing [21:44]

So how can you make it more built in and enable those use cases to be more proactive? So compliance by design, regulations by design rather than by accident. How do you see that?

Namee Oberst: So maybe because of my background, what we built into what we are doing for our commercial application, but I actually think that this has to be in everything that we do in terms of AI are features related to compliance and auditability, AI explainability, not to mention guardrails of course, like PII guardrails, but with respect to, I'll just start with AI explainability, and this is where I think small language models shine and we are working with a use case in a large bank, and the part that they actually found very interesting that they really liked about this, the way to do this with small language models is that if you're chaining a workflow, designing on automation with no prompts.

But small SLMs, function calling SLMs that are basically outputting Python dictionary, so you can have ... let's just say you ask it a question and if it's yes or no, then creating decision trees based on the model inferences and the answers. In that case, because you have visibility to every single step because the model here at this point, said yes or no. Let's just say, is this a gold high level customer, yes or no? And you could kind of look at the logic and if somehow there is any fault in the decision tree, then you can kind of see where exactly that fault is as an example, and you can course correct, right? You can see where the mistakes are being made because you are really crystallizing the rationale behind how you are designing your workflow very much almost like coding or software engineering.

Whereas if you have a really large language model that you're using with a lot of prompt engineering, this is not to knock prompt engineering. I'm just kind of explaining the difference. You actually can't tell where ... if you have an outcome that you're not happy with, where in that process did it fall apart. For instance, where did the model make the wrong inference that led to this outcome that you're not happy with? So you don't have visibility to that. So I think that's one of the big differences too from an enterprise perspective, is if you can have visibility to every single step of, the AI explainability is you can see actually,. we can expose what are the options that the model was considering when it gave you that answer and what is the option that it ended up choosing? And so you can see that every single step of the process that is truly, I think so critical to, especially when you're trying to deploy new processes.

For everyone to understand what is this model doing at every step and where is it making the right or the wrong decisions? Because I mean, I think that there are very, very few instances where something is going to get it right like 100% of the time right out of the box. So we are just trying to do this in a very systematic, observable way. So I think that observability factor, explainability factor is really critical. And then because you have so much data and so much information, every single interaction, all of that is being captured. So every single inference is captured, every single decision is being captured. So for auditability purposes, for compliance purposes, you have all the data that you could possibly need for every kind of question your auditors or compliance officer might have.

So I think for an enterprise, the unknown, I don't know how it got there, "Oh geez, what are we supposed to do," is the worst answer. And you want to be able to almost do a forensic report on where did it screw up and let's see why, and let's fix that part, but if you don't know where it screwed up, how are you going to go fix that part?

Srini Penchikala: Also, that visibility into every single step in the process. I mean, that's the auditor's dream, so they want to know all those under the hood details, right? So that kind of helps.

Namee Oberst: Absolutely.

Large and Small Language Models Combination [25:40]

Srini Penchikala: Before we jump onto the other topics in this space, let me ask you this. I definitely see large models, they've been popular for so long. The small models are getting more attention for specific use cases, right? Like you said, even though they're smaller in size, not necessarily poor in terms of performance or accuracy. So are there any use cases where you either see in the future where both large models and small models in combination can be the best choice? Not the OR, but the AND.

Namee Oberst: Yes. No, absolutely. I really do. So for instance, we had a very interesting conversation with an extremely, extremely intelligent gentleman. He's an architect at one of the Fortune 500 companies and really, really brilliant. And we asked him, where would you use, because he's very leading edge and has really experimented with small language models as well. So we asked him, "From your perspective, where would you use a really large language model like an open AI?" And he said, those models are really good at just preserving the context of these conversations. So for instance, when you're having a conversation with a customer, the customer, Even if the conversation, the chat goes on for several days, just as an example, the customer never expects to tell you the same information twice.

That's just a no go. So those large language models are great for capturing those conversations and just of keeping that context over days over long sessions. And I think that's totally right because of the large context window, this is where it could preserve the context and keep having that conversation. So for these very important customer-facing chatbots, that can go on potentially for days and be really prolonged. Yes, no, I definitely could see that use case, but now, let's say you want to capture that information, that conversation, when that conversation is done and you want to drive a lot of insights and analytics off of that.

And those insights and analytics don't have to happen in real time. So that can run in batch processes overnight on very inexpensive hardware. So basically, you can take that same chat and you can basically send that off to a new workflow that doesn't need all that capability. And I think any processes that can run in the background, that's like running hourly, daily, and a lot of enterprises have processes that run weekly, monthly, all of those things actually can, I think very soon be run with small language models, chaining a whole bunch of them together on CPUs.

Srini Penchikala: Definitely, I think you explained it very well. So the small language models, going back to the popular use cases or patterns in the machine learning, usually we do some edge on-device analytics in real time for either finding any defects or any other use cases in the manufacturing of processes and then, send results to the cloud. And also, there can be offline analytics to further generate more insights. To use that analogy, small language models can be really powerful on the edge and the devices, they can do the local modeling and generate the insights, and that output can be sent to the cloud to train the large-language models, which can be more intelligent because of that.

And the output of the LLMs can be sent back to the small language models. And I see there can be a feedback loop between the localized small language models and the cloud offline large-language models kind of feeding each other and making each other better, right?

Namee Oberst: Absolutely. No, absolutely. I think the future, especially for large enterprise will be a combination, but what I think would be a wasteful at this point would be to use a really large language model for everything that's just like, I think that could be overkill and actually a tremendous waste of resources, but not everyone actually knows that yet. So there are definitely people who will just unleash all of these behemoth models at everything, but I think now many, many people are coming to realize that that is an absolute overkill and a waste of resources.

Srini Penchikala: That's a great point, right? For any organization, especially the startups which are always constrained financially. So large language models may not be an option for them to start with, but they can use the small models, learn the process, learn the solutions, and make sure that they are the right solution, and then, invest in a bigger solution. So that can be a first step in the overall learning and adoption process for the companies.

Namee Oberst: Yes, and then, beyond that, even enterprises are extremely cost sensitive, at least the enterprises I speak with are extremely cost-sensitive. They may be more cost-sensitive than startups, to be honest with you,

Srini Penchikala: Because they can fail big if they're not right.

Namee Oberst: Yes, this is where enterprises are always looking for a cost-efficiency. I mean really, and I think that's right that you want to obviously not waste money and have great results for your shareholders. So I fully understand that. And then, they really care about security, safety and cost efficiency. I totally understand on every single front that a solution has to deliver those to be useful to an enterprise.

SLM Infrastructure and Tools [30:30]

Srini Penchikala: It's really during the exploration phase of the adoption, they can try these low-cost, better performance, easy-to-run models and then, grow into the large models. So quickly, Namee, I want to now talk about the practical side of this discussion. So I know most of our listeners are senior technical professionals. They're probably itching to try this out on their own laptops. So what kind of infrastructure or tools are required to adopt or try or explore small language models in applications? How can our listeners who are interested in trying out small language models on their own laptops, how can they get started? What do you see as the open source frameworks or other tools to get going with this?

Namee Oberst: I'm going to give you a nuanced answer now because it really depends on your laptop, and I wouldn't have said this up until two months ago, but I really have to emphasize this point sincerely at this point. It depends on your laptop. If you are using a Mac, then by all means use the GGUF quantized version and use a solution Ollama. However, if you are working on an Intel-based machine, for example, then you should really work with OpenVINO. It's by far, even if you don't have the GPU version, just the CPU version, you should really work with the OpenVINO library. We support a lot of that workflow in LLMware now, but you still have to download OpenVINO yourself to work with our library.

And let's say you're neither, it's neither Intel and it's neither a Mac, then I would really work with ONNX, which is the Microsoft version, and it's like cross-platform. And in my experimentation, ONNX kind of is like the middle of the road approach. So for the same test on that same Dell machine, if PyTorch ran at 115 seconds for 21 questions for one billion parameter model. On that same Dell, GGUF ran at 113 seconds. So basically negligible. So don't bother with GGUF, like don't even waste your time on an Intel based machine. But ONNX ran at 65 seconds, while the OpenVINO ran at 15 seconds. So I see ONNX as a really good middle of the road approach if you have a non-Mac and non-Intel based machine.

If you do, so far as I know, GGUF is the fastest way to run inferencing on the Mac and OpenVINO hands down on Intel, including just CPU based machines. I have a four-year-old Dell machine that I thought was basically destined for recycling but using what we have right now that we've developed, it's giving me the same performance as an M3 in terms of inference speed. It's crazy, but we didn't know that you had to match the right software to the hardware. So please go check out OpenVINO, please check out ONNX and not just stick to GGUF and realize, And I had somebody from Australia call me three days ago saying, "Why are these models running so slow on my machine?" And I said, "Let me guess you don't have a Mac". And he's like, "Yes, you're right. I'm not using a Mac". I was like, "Okay, now this is what you need to do". So this is my public service announcement to everybody. Please.

Small Language Models Enable AI-powered PCs [33:31]

Srini Penchikala: Yes, don't throw away your old laptops, right? They're still valuable, right? Yes, thanks for that. It's easy to get going with this. So the other main area I want to focus in this podcast discussion is how the small language models are enabling the AI-powered PCs. We talked about that a couple of times in this talk already. So how do you see the small language models can operate at the edge, as they say, and they don't need to be connected to the cloud. So how do you see this evolution of the models enabling devices like laptops, smartphones, they can be IoT devices, it can be even sensors in a manufacturing plant or components in the autonomous vehicle, right? I mean, the use cases are limitless. So how do you see this, especially the emergence of small models accelerating the power of AI PCs and the devices?

Namee Oberst: I mean, I think it's incredible, just true democratization of bringing the power of AI to your fingertips. So I think in the earlier podcast, I said it's my dream and my vision that AI should be as ubiquitous as just regular software and as easy to use and maybe even easier to use. And I think that's happening. So we are definitely trying to make that happen. So we've developed a product called Model HQ, where you don't need to know C++ to download OpenVINO, you don't need to know anything. You just literally,. it's an executable. You just download it and then, you can run any of these OpenVINO models and it runs really, really blazing fast on your AI PC, and this is just a start, and this is just like us. So I think this is such early days for all the model development, the PC hardware development.

I'm sure there will be better and better hardware. So where do I see this going is truly, I do really think that's the power of AI in, let's say three years, will be so ubiquitous. It'll be we'll look back at these times and go, "Gosh, do you remember we were even talking about like, oh, these small language models are new?" No, at that point it won't be new. It'll just be software at this point, right?

Srini Penchikala: Well, when you buy the toys, it'll say batteries not included. AI will be so ubiquitous that the applications have to say, okay, this application does not include AI, right?

Namee Oberst: No, that's right. I think we can power everything. And I mean, anyone who's played with gen AI knows the power of being able to inference and the powerful capability and the fact that they're now coming in such small footprints, smaller and smaller footprints with all these distillation and pruning techniques and combination and getting better and better and smaller footprints, and then, the hardware getting better and better. So basically, even the definition of small language model changing the fact that I can now say, "Oh, I can run, after we hang up. I'm going to this weekend, try to see if I can run a 20 billion parameter model on the laptop". Unthinkable. Unthinkable. Even three or four months ago for me anyway.

I'm sure there are brilliant people out there who've already done it, but for me, it was unthinkable. But the fact that I can't even talk like that, but then, the fact that you might not even need a 20 billion parameter model because these three billion parameter models are getting so stupendously good, it's really an embarrassment of riches at this point, today, let alone what will it be like three years from now.

Srini Penchikala: Yes, definitely. I see these small language models and AI-powered PCs as two sides of the same coin. They're so complementary. They're going to make each other innovate at a very fast pace.

Namee Oberst: That's right. And the price point of these AI PCs to me is really interesting. If you think about it. For the powerful capability that they offer, they're really inexpensive and we are able to actually just turn one into an inference server just playing around with it. So the GPU capability is so powerful, so it's really, really, really cool what can be done with it. So I highly recommend everybody to check it out. I think the Lunar Lake version is coming out now for consumers, and if you're at your local Best Buy or wherever they're available, I just check them out. It's amazing. Amazing.

Srini Penchikala: It's interesting, I was checking the Dell website earlier today, just to see what kind of desktops and laptops are available. You can get a gaming desktop with the GPU, with NVIDIA RTX, I think that's for $1,000 to $2,000. And so, I guess very affordable, relatively right.

Namee Oberst: Yes, no, absolutely. Compared to having to reserve time on your cloud for A-10, A-100, H-100, how much is that going to cost you and how is that even accessible anyway?

Best Practices [37:52]

Srini Penchikala: So yes, let's talk about some best practices. Definitely, I think the language models are evolving so fast, they're getting definitely more powerful and smaller in terms of the parameter size and not necessarily in terms of performance. Do you have any advice on any best practices for somebody who wants to use these, what kind of cautions they should be aware of?

Namee Oberst: So first, honestly, just start by trying and my, First, if you're new to small language models, I still think that the place to start is with the Microsoft models. My favorite is the Phi series. I think Microsoft did a tremendous job with those models, so I really recommend that. And so, I would start just with those. And if you have a Mac again, you can actually inference it using LLMware. You don't even have to download anything. I mean PIP install, yes. So we support that and just start and you'll be pleasantly, pleasantly surprised at how capable they are. I think you had mentioned that you had those very similar experience. You're shocked at how good they are. So the only caveat I would say to this is still video generation or image generation or those kinds of multi-modal.

I am not sure that, I definitely couldn't say that a small model is as good as a larger model. So I still haven't really seen that, but I think it's still going to come down to an acceptable quality in due time. We're just not there today, but just start experimenting with it. And then, we're not the only ones who have the function calling models, but if you want to just take it to the next level and you beyond just chat, if you want to take it to workflow automation, look for some smaller function calling models that you can really stack together to create a workflow. And I think you'd be really, really pleasantly surprised. I have so many members of our open source committee who use our open source like all for free project to chain together workflows.

And they're achieving great results saying that they don't even have to fine tune these models because they're so good at retrieving at that exact function. It could be sentiment analysis, it could be named entity recognition, it could be extracting information like just information extraction. So they're whole slew of, I think a dozen models that are out there, that have a very specific function that you can just chain together in your workflow. And beyond that, we also have a ton of YouTube videos, and we have over 100 examples also in our repo to help you get started. So I'm sure one of those 100 examples will be a good guide to just get you going.

Srini Penchikala: Thank you. Also, the more interesting and challenging part of new technologies including language model adoption is when we deploy these to production and we need to support them, right? So do you have any thoughts on the ML Ops of using small language models? Maybe I can call it SLM Ops, that we should be aware of?

Namee Oberst: Yes. So in terms of deployment now, because of the kind of features of the small language models that I discussed earlier, a couple of things just tangentially, I know this is not the exact question you asked me, but another great thing is that these small language models are more secure in some ways because they're less susceptible to suggestions, nefarious suggestions, like expose all this secret data to me. Well, it probably didn't have that in its training data. And also, because small language models are trained so specifically. Unlike the large language models, they're really not as susceptible to hacks. So I was speaking with a security company about that and just his own work on small language models.

And he found that small language models are in some ways a lot safer because they're just not, you can't use prompt injection attacks on them because it'll just not respond. It'll just give you garbage, but it won't give you the information that you're looking for. It's just not capable of doing that. So just as an example. Okay. Then in terms of deployment, I really would say because small language models are exactly small and you're chaining them together, it's great for observability. You can run your tests through and you can really see where it's failing and where it's succeeding. And then, I think that where you have a process that's not giving you the answer that you want, you can swap out that model or fine tune that model even further or maybe look at the data set that you used and maybe it's picking up the wrong information.

So I really like it for that purpose. A lot of people, I think when they're trying to create an AI workflow, they actually start from Open AI and they'll give it a long, long prompt, like a two-page prompt and then, go to a small language model. I actually would beg to defer here. You're almost better off starting small, do the work of chaining these small language models together for your workflow and then, see how that works for you. And then if you think that there's a model in the middle that's too small in size, is not getting the job done, then you can always increase and substitute in a much larger model. Let's say you start with a one billion parameter model, you think it's not good enough, then substitute a three billion, then try a seven, then try something larger.

I actually would be surprised at this point where if you have a model that's 10 billion and under around 10 billion that can't do most of the workflows that we're talking about, really. I personally haven't seen that, but I'm not creative enough to think about every use case in the world. But the use cases that I've seen, I haven't seen one that a 10 billion parameter model couldn't really solve when changed the right way with, again, the hard exception being video creation and images and things like that.

Srini Penchikala: Definitely start small, explore and iterate and there, can I know?

Namee Oberst: Absolutely.

Online Resources [43:24]

Srini Penchikala: Sounds good. So on this topic, do you have any recommendations, online resources? I know we talked about a couple of tools, open, we know and other ones, right? So do you have any other resources really like knowledge based articles or tutorials that our listeners can check out?

Namee Oberst: So I always really like going to MarkTechPost for a lot of just leading edge AI research. If you really are into that and you want to see the latest development in the model training and what is going on in that space, if you are AI research inclined. And then, believe it or not, I love watching tutorials on YouTube. So we have a YouTube channel and there are a whole host of other YouTube channels that really expose the latest in AI. And I like watching people, their take on it. They show you their experience with it. And AI anytime has a great channel. World of AI is always good about exposing the latest open source projects. And again, we have one.

And then, I always follow Hugging Face's LinkedIn site just because they also really promote some great new models. Most of them are small, and so Hugging Face on LinkedIn is great. So those are kind of the resources that I tend to follow.

Srini Penchikala: And obviously, we have good information on our own InfoQ website as well, right?

Namee Oberst: Absolutely. Actually, not because I'm on it, but I have to say, when I was doing research on the Apple Foundation models, I really love the content that I saw at InfoQ, describing how the AFMs, the Apple Foundation Models were created. And I thought it was a great article. So InfoQ, definitely.

Srini Penchikala: Going back to your point about the AI are becoming ubiquitous, and I'm seeing that on InfoQ content as well, we have these different communities for architecture, the DevOps and so on, cloud and security, and obviously, AI and ML. We are seeing a lot of AI-based articles in pretty much all of these communities. So it's kind of pervasive, right?

Namee Oberst: The question is when is AI going to be just a thing, just a regular thing. No one is excited about, "Oh, I have my own website". No one cares anymore. Come on. So when are we going to reach the point where, yes, it's just understood that everything is just going to have the AI in it, just like you said.

Srini Penchikala: Yes, I think it's going to happen, but we have to look back and think about, back in, I don't know, whatever, 2026, '27, whatever, right? I'm just picking a year. But then we'll say, "Oh, that was when AI became a part of everything we do". So I think we won't realize when it happens, but maybe looking back after that probably we'll realize that.

Namee Oberst: Absolutely.

Wrap Up [45:47]

Srini Penchikala: So do you have any additional comments before we wrap up today's discussion? Anything else you would like to share?

Namee Oberst: No, I mean, thank you so much for having me on. I really appreciate it and thank you in InfoQ community for listening, and I really encourage you to keep experimenting and play around and please visit my open source site LLMware. We are really pioneers in this space and we really try to pack in an end-to-end solution, and it's really free for anyone to try out, so please give it a whirl.

Srini Penchikala: Thank you Namee very much for joining this podcast. It's been great to discuss the recent innovation in the language models and Gen AI space, which is the small language models, which like you mentioned, have the potential to commoditize and localize the language model solutions so they can have a bigger impact on the software development community overall. And to our listeners, thank you for listening to this podcast. If you'd like to learn more about AI ML topics, check out the AI ML and data engineering community page on Infoq.com website. I encourage you to listen to the recent podcasts, especially the one that Namee mentioned earlier, the AI ML Trends Report for 2024. We did the podcast in August, but we published the report last month in September, 2024.

You can learn about small language models, AI powered PCs, coding assistants, and a lot of other trends, a lot of interesting topics there. We've been seeing a lot of activity on that podcast, so definitely check it out. Thank you, Namee. Thank you. Thanks everybody.

Namee Oberst: Thank you.

Mentioned:

2024 AI ML and Data Engineering Trends Report
MarkTechPost
World of AI YouTube Channel
Hugging Face's LinkedIn site

About the Author

Namee Oberst

Namee Oberst is the founder of AI start-up focused on Generative AI and open source innovations for enterprise use cases.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.