At its Google I/O 2024 developer conference, Google announced it is working to make support for on-device large language models a reality by bringing the smallest of its Gemini models, Gemini Nano, to Chrome.
Generative AI requires using large models that are about a thousand times bigger than the median web page size, say Chrome developers Kenji Baheux and Alexandra Klepper, ranging from 10s to 100s of megabytes.
While this makes it rather hard to deploy and run AI models locally on-device, the benefits of doing so are manifold, including better privacy for sensitive data, which does not need to leave a user's device; reduced latency to improve user experience; offline access to AI features and graceful fallback in case remote models are unavailable; and the possibility of a hybrid computation approach where you can run some AI on-device as a preview or to reduce remote inference costs for frequent user flows.
To circumvent the model size and delivery problem, Chrome engineers are
developing web platform APIs and browser features designed to integrate AI models, including large language models (LLMs), directly into the browser. This includes Gemini Nano, the most efficient version of the Gemini family of LLMs, designed to run locally on most modern desktop and laptop computers.
This approach also has additional benefits, including ease of deployment thanks to the fact that the browser can distribute the right model for the device at hand and automatically update it; furthermore, the browser can use the GPU, NPU, or fall back to the CPU based on available hardware.
To make all of this possible, Chrome developers created a specific infrastructure to access foundation and expert models on-device. This infrastructure is currently used to power the Help me write experimental feature, which aims to help users start writing or refine existing texts using Gemini models.
You'll access built-in AI capabilities primarily with task APIs, such as a translation API or a summarization API. Task APIs are designed to run inference against the best model for the assignment.
According to Chrome developers, Gemini Nano is best for language-related use cases, such as summarization, rephrasing, or categorization, but the APIs will support fine-tuning it. Fine-tuning is a technique that can be used to "specialize" a given model for a specific task without having to use an entirely new model specifically built for that task. Chrome's APIs will support Low-Rank Adaptation (LoRA) to adjust the model's weights to improve performance. Another API that could go into Chrome is the Prompt API, which enables sending an arbitrary task expressed in natural language to Gemini Nano.
Developers must join an early preview program to experiment with Chrome's new features.