Key Takeaways
- When choosing between an API-based vs. self-hosted model, consider that the API solution is simpler for early rapid iteration, but self-hosted may be better for long-term cost and privacy concerns.
- Before fine-tuning a model, try prompt engineering and retrieval augmented generation (RAG).
- Smaller open models may not match the performance of large closed models like GPT-4 in all scenarios, but they are often good enough for many tasks and are worth trying.
- Hallucination is a common risk of LLMs, but RAG using trustworthy sources is a good mitigation.
- When adopting LLMs, organizations should invest in the education and training of their employees, especially focusing on the capabilities and limitations of the models.
This article is part of the Practical Applications of Generative AI article series, where we present real-world solutions and hands-on practices from leading GenAI practitioners. |
Introduction
Large Language Models (LLMs) are a general purpose AI solution that can handle a wide range of tasks: answering questions, summarizing long documents, even writing code. Many organizations would like to adopt this technology, but with the fast pace of innovation, it can be difficult to keep up with the different LLM options available, each with its own benefits and risks.
Most people are familiar with the big models available over the web via APIs, such as ChatGPT. Many are also familiar with the risks associated with these models, such as the potential for "hallucinations" and privacy concerns. They may also know about the need for "prompt engineering" to get the best results from the models. However, they may not be aware of the range of available models or best practices.
In our virtual panel, we'll discuss some issues people should think about when adopting LLMs and how they can make the best choice for their specific use case.
The panelists:
- Meryem Arik - Co-Founder @TitanML
- Numa Dhamani - Principal Machine Learning Engineer @KUNGFU.AI
- Maggie Engler - Engineer & Researcher on LLMs @Inflection
- Tingyi Li - Enterprise Solutions Architect @AWS
InfoQ: What are some guidelines for choosing an API-based model, such as ChatGPT, versus a "local" or self-hosted model?
Meryem Arik: There are a few main reasons why you should be working with self-hosted models rather than API based models, and we covered them in a recent blog post. Firstly, control and data privacy. For a lot of enterprises, their LLM applications will be touching fairly business-sensitive data, and for them it may be important that they control the model that sees that data.
Secondly, customizability. When you self-host models you control all of the weights in the model. This means you can fine-tune or adapt the model as you wish. This can give better results even with smaller models.
Thirdly, cost and scalability. It is true that when experimenting, API based models tend to be cheaper, as there is no need to set up the infrastructure of self-hosting. However, at scale, it is cheaper and more scalable to self-host with a highly efficient inference stack.
Now that open-source models like Llama 3 have come out, there is very little reason to not build up the capabilities to self-host. Llama 3 is on par with the leading API-based models but comes with the additional benefit of being able to deploy it privately and in your own environment. Even if not all of your applications use self-hosted models, every enterprise must build up the capabilities to self-host otherwise they are seriously missing out.
Numa Dhamani: When deciding between an API-based model and a local or self-hosted model, there are several key factors that can be considered. First is data privacy and security. Using an API-based model may expose sensitive information to the service provider, which could be a concern if your data includes proprietary or sensitive information. On the other hand, a local model allows you to retain control over your data within your own organization’s infrastructure, making it a better choice for industries regulated around data privacy, such as healthcare and finance.
Additionally, cost and resource allocation are also significant considerations. API-based models are generally more cost-effective initially regarding operational overhead, as you pay per use and avoid the expenses related to infrastructure setup and maintenance. They are also typically easier to integrate and use, and often ideal for organizations without the capability or desire to manage complex infrastructure for AI systems.
However, while local models require more resources initially for setup and ongoing maintenance, they may offer long-term savings and greater flexibility, which is particularly beneficial for organizations with extensive and consistent model usage. Hosting locally can also reduce latency, as data doesn’t need to travel over the internet, which can be crucial for time-sensitive applications. Finally, reliance on API models can introduce risks related to service availability and change in terms or pricing, which can impact long-term planning and operational stability.
Maggie Engler: Most organizations will be concerned first and foremost with the cost to build and maintain. Assuming two models of roughly equal size and capabilities, an API-based model will require fewer engineering resources to set up, and load management and scaling will typically be handled by the API provider. Considering the cost of serving infrastructure, it may well be worth it to pay per API call, especially for exploratory projects and proof-of-concepts.
For long-term, high-volume use, a self-hosted model will likely be more cost-effective, as well as offering more control over performance and latency. Additionally, organizations with high data privacy and security standards might opt to use a self-hosted model to avoid data sharing with third-party providers.
Tingyi Li: Choosing between an API-based model and a local or self-hosted model depends on various factors, including specific use cases, technical expertise, resource availability, and budget. Normally, companies from highly regulated industries are forced to in-house hosting given legal requirements and regulations around data governance, privacy, and transparency.
If you require extensive customization or need to fine-tune the model for your specific domain or use case, a self-hosted model may be preferable. It gives you full control over the model architecture, training data, and parameters. As a trade-off, you are responsible for maintaining the infrastructure, updating the model, and addressing any issues that arise.
With API-based models, customization options may be limited to what the provider offers, but you benefit from automatic updates, bug fixes, and improvements without any additional effort on your part as they are managed by the provider.
Cost can also be an important factor to consider. API-based models typically operate on a subscription or usage-based pricing model, which can be cost-effective, especially for small to medium-sized applications. However, as usage scales up, the cost may increase significantly. Self-hosted models require upfront investment in infrastructure and ongoing maintenance costs but can be more cost-effective in the long run for high-traffic applications.
Also, for real-time or latency-sensitive applications, hosting the model locally may offer better performance since it eliminates network overhead. API-based models introduce additional latency due to network communication.
InfoQ: In what scenarios would you recommend fine-tuning an LLM, vs using it "out of the box" with prompt engineering?
Arik: I would always recommend trying alternatives to fine-tuning and would not recommend fine-tuning until you have tried all the alternative techniques since they are much lower effort! These alternatives include retrieval augmented generation (RAG), controlled generation, and prompt tuning.
Dhamani: Fine-tuning an LLM versus prompt engineering depends largely on the task at hand. If you have unique requirements or need the model to perform with a high degree of specificity, then fine-tuning is likely the better option. For example, if your organization operates within a niche industry with specialized jargon, you will be able to tailor the model’s responses to these specific contexts with fine-tuning.
Meanwhile, prompt engineering is more suitable for general applications or tasks where the need for customization is less critical. It’s particularly useful when flexibility and adaptability across a broad range of tasks are more important than deep specialization in a specific task.
Engler: For a given application, if you're considering using an LLM, you should have extremely clear success criteria. You should define components of successful responses in a set of evaluation contexts that capture how users will interact with the model, and ideally be able to evaluate an arbitrary model in this fashion.
For many applications, prompt engineering might work without any additional fine-tuning; the right prompt could produce reliable raw responses, or, if there are specific output constraints, responses that can be post-processed to meet those constraints. Fine-tuning will always give you more fine-grained control over the model's responses, but the simpler method is worth trying first.
Li: There isn’t a case where you HAVE to fine-tune, and there are many cases where you should avoid fine-tuning. The rule of thumb is to start out with the low-hanging fruit: using "out-of-the box" models with prompt engineering and RAG. A quick proof-of-concept like this for your use cases using your private data will take less than half a day to spin up, and you can get a rough evaluation of the performance and cost in production.
Whether to fine-tune or not depends on the ROI you are looking for; that is, the cost of evaluating and fine-tuning models compared to the performance boost and business value it will bring to your businesses.
For example, if you have a GenAI-powered verticalized offering-as-a-service, like Perplexity.ai, or you want to power your business using GenAI as your core differentiator, then you might want to invest in fine-tuning. But with the emergence of open-source models such as Llama3-8B and the advancement of its reasoning capabilities, the entry bar for niche use cases with smaller models could become more feasible and easy.
InfoQ: What are your thoughts on "democratization" of LLMs? Are the small, open-source/open-weight models good enough in some cases?
Arik: Yes! More than some: actually most. Now that Llama 3 has been released, which has results which are as good as many API-based models, there is very little reason to not be working with open-source language models. Even if they are just as a "backup" it is irresponsible to not invest in some ability to self-host language models, otherwise, you are putting a lot of control into the hands of 3rd parties.
Dhamani: The democratization of LLMs certainly has the potential to spur innovation, lower entry barriers, and increase competition in the AI field. Small, open-source models can be sufficient, or even advantageous, in certain scenarios.
They’re particularly valuable for educational and research purposes, experimentation, and use in environments where computational resources are limited. Now, individuals and smaller organizations or research institutions can explore AI capabilities, work to mitigate their limitations, and prototype applications without the substantial financial investment required for larger models.
But of course, while small models can be quite effective for certain tasks, their performance doesn’t typically match that of larger, more advanced models in handling more complex or nuanced tasks. This trade-off often lies in the balance between resource constraints and performance requirements on a task-by-task basis.
Engler: Yes! Small, open-source models are getting increasingly powerful, and are already good enough for many use cases. There are also use cases that require a small model, like running an LLM on a mobile device.
I often speak to prompt engineers who don't realize that smaller models will work, and could save them a lot of money. I met a consultant recently who was using GPT-4 for a task that a model with less than one billion parameters could easily handle, and suggested to them that they use GPT-4 to label a few thousand examples, and then fine-tune a small model on those annotations to save on inference costs.
Li: Open-source/open-weight models definitely have brought in transparency and have accelerated the thriving of the open-source development and proliferation of various derivatives. While small open-source LLMs have their advantages, they may not be suitable for all use cases. In general, we are still at the stage where we rarely see open-source models on par with closed models.
However, you really don’t need the state-of-the-art or the most performant models for all your use cases. The efficacy of open-source and small LLMs depends highly on the specific requirements, resources, and constraints of the application. For many scenarios, they can provide a viable and accessible alternative to large, proprietary models, especially when combined with careful customization and optimization.
InfoQ: What are some of the common risks with LLMs, and how can they be mitigated?
Arik: The most common risk people talk about is the hallucination problem. However, I don’t think that this is a problem, rather than a feature that needs to be understood. If builders understand that hallucination is something that is inevitable when building with LLMs then they can work to design systems that mitigate this.
The most common and popular way of mitigating hallucinations is through RAG: this is essentially the ability for the model to "look up" information that informs its answer. What is fantastic about RAG is it not only gives the model real time access to the data but it also adds a layer of audibility where the user can look at what information the model has used to get to its answer. For a guide on how to build RAG applications you can check out our blog.
Dhamani: LLMs come with several risks and limitations that must be carefully navigated to effectively leverage their capabilities. It is well-documented that LLMs not only inherit but also amplify the biases present in their training data. This could lead to discriminatory practices and the perpetuation of historical injustices and inequalities, especially in sensitive applications like predictive policing, hiring, or credit scoring. Mitigating bias involves conducting thorough bias audits, using and documenting diverse datasets, and inclusive testing and validation processes.
Another concern is privacy, as LLMs may inadvertently memorize and leak sensitive information. The use of privacy-enhancing techniques, such as differential privacy, can be used during the training process for mitigation, and deploying robust data anonymization protocols can further safeguard user data.
LLMs are also prone to hallucinations (i.e., generating plausible but entirely fabricated information), where leveraging techniques like RAG can provide them with additional contextual data, which may help reduce the model’s hallucinations.
Finally, encouraging human-in-the-loop solutions, where decision-making is augmented by AI instead of replaced, can not only help mitigate hallucinations, bias, and other concerns, such as skill degradation in the workforce but also ensure that the central focus remains on the human experience.
Engler: There are definitely misconceptions about what LLMs are actually doing when they generate a response. Essentially, they are predicting the most likely response according to past data that they've seen. This makes them liable to making up realistic-sounding but inaccurate information and reinforcing biases.
Additionally, it means that unusual inputs can sometimes produce unexpected outputs, and we've already seen lots of LLM "jailbreaks" that circumvent some of the model's training, often towards the end of producing unsafe content. Discussing mitigations could fill a book, but broadly speaking these risks can be mitigated via retrieval from trustworthy data sources, fine-tuning for robustness, and input and output sanitization.
Li: There are lots of common risks with LLMs such as scams with misleading information, misuse of intellectual property, invading privacy, bias etc, meaning that organizations need to think about building AI systems in a responsible and governed manner.
There are eight key dimensions including: fairness, explainability, robustness of the AI system, having control of AI behavior, privacy and security of data, enforcing responsible AI practices, transparency of AI usage, and safety from misuse and harm. In order to enforce the eight dimensions, we can regard responsible AI as a shared responsibility between the companies who build the models and the companies who build the apps using those models.
What non-technical or organizational changes are needed for the successful adoption of LLMs?
Arik: One is the consolidation of resources. LLMs are difficult to deploy and they need significant resources, and that means that there is a lot of value to be gained by consolidating resources between teams. Instead of having teams do their own deployments, teams should be able to focus on building great applications and instead share the resources that are provided by central teams. This allows for much better utilization of resources and significantly faster development cycles.
Another is education. LLMs behave in ways that are very different from how people think that computers should behave: they are non-deterministic and sometimes make mistakes. This means that we need to interact with them with this in mind. There are great courses like Mindstone that we recommend offering to the non-technical members of staff so they can better understand what AI is and how they can utilize it in their work.
Dhamani: To successfully adopt LLMs within an organization, it's important to make comprehensive organizational changes that go beyond just technical deployment. First, there needs to be support from leadership, which not only involves championing their use across departments, but also workforce training, implementing ethics and governance frameworks, and fostering a culture of continuous learning and development.
Employees across all levels and departments should be educated on the fundamentals of LLMs, including how they work at a high level, their capabilities and limitations, and best practices for interacting with them.
Implementing robust ethics and governance frameworks also ensures that LLM adoption adheres to fairness, privacy, and transparency standards, safeguarding against potential ethical and legal pitfalls.
Engler: Within an organization, the successful adoption of LLMs requires training on the capabilities and risks of LLMs, including hallucinations and data privacy concerns. But it might mean fewer changes than people assume: LLMs are presently still best suited to supplement work rather than replace it. They can produce quick and high-quality first drafts, explain concepts, and automate more tedious tasks.
In a recent Stanford and MIT study, using LLMs made call center workers about 14% more productive overall, with the most noticeable effects among novice employees; experts were not much aided by the tools. Organizations should approach the adoption of LLMs pragmatically and iteratively, experimenting and continually evaluating the return on investment.
Li: Non-technical and organizational changes play a crucial role in maximizing the benefits and mitigating potential risks associated with LLMs. First of all, organizations need to foster a culture that embraces AI technologies like LLMs. This involves promoting awareness, understanding, and acceptance of AI among employees at all levels.
Companies who are committed to developing AI securely and responsibly can take a people-centric approach that prioritizes educating the leaders, advancing the science, and aims to integrate responsible AI across the end-to-end AI lifecycle. Comprehensive training programs should be implemented to upskill employees on how to effectively utilize LLMs in their workflows. This includes training on best practices, ethical considerations, and how to interpret and trust AI-generated content.
Clear governance structures and policies should be established around the use of LLMs, including guidelines for data handling, model deployment, privacy protection, and compliance with regulatory requirements. Eventually, organizations should aim at developing an ethical GenAI adoption framework that guides the responsible use of LLMs, which fits the organizational culture and practices.
Also, Foundation Model Operations (FMOps) require additional mechanisms integrated into the traditional MLOps practices for organizations, such as human-in-loop and model benchmarking and evaluations. Ongoing monitoring and evaluation are crucial to assess the performance, impact, and ROI of LLMs. Organizations should establish metrics to track key performance indicators, solicit feedback from users, and iterate on model improvements over time.
Conclusion
Our panel reached a consensus on several general principles. First, the large, closed, API-based LLMs are a good initial tool for experimenting. However, smaller open models are a viable solution in many scenarios, and self-hosting models can save on long-term costs and address privacy concerns.
Regardless of the model chosen, when trying to improve the model’s responses, prompt engineering and retrieval augmented generation (RAG) are often better choices than fine-tuning models. RAG is also a good mitigation for LLM risks such as hallucinations.
LLMs present a great opportunity for companies and organizations to improve the efficiency of their employees, especially when those employees are trained to understand the benefits and limits of the technology. However, companies should have clear success criteria and should track the ROI of their model use.
This article is part of the Practical Applications of Generative AI article series, where we present real-world solutions and hands-on practices from leading GenAI practitioners. |