InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage News Meta MobileLLM Advances LLM Design for On-Device Use Cases

AI, ML & Data Engineering

Meta MobileLLM Advances LLM Design for On-Device Use Cases

This item in japanese

Nov 05, 2024 2 min read

Write & Win: InfoQ Contest

Join the contest to:

Win a conference ticket
Boost your profile
Help the community

Send your article proposal

Meta researchers' goal with MobileLLM is ambitious: showing that, for smaller models, quality is not a direct product of how many billions of parameters they have; rather, it is the result of carefully designing their architecture. To prove their point, they coupled deep and thin architectures with embedding sharing and grouped-query attention mechanisms to build 4 models of 125M, 350M, 600M, and 1B parameters able to improve accuracy over prior state-of-the-art models.

MobileLLM shifts away from the generally accepted "scaling law", attributed to Kaplan, that relates improved performance with an increased number of parameters.

A prevalent belief (Kaplan et al., 2020) in the field suggests that the performance of transformer models is primarily determined by the number of parameters, the size of the training dataset, and the number of training iterations. [...] Our experimental results, specifically for small models with limited model capacity, reveals that going deeper is more crucial than going wider for performance improvement.

Previously used for Meta TinyLlama, embedding sharing is a technique consisting in reusing the same weights across input and output embedding layers, which reduces the overall number of weights and makes the model smaller. As Meta researchers explain, this technique is less effective for larger models, where input and output embeddings only account for a minimal portion of total parameters (e.g., 3.7% in LLaMA-70B). On the contrary, for a 125M-parameter model, the embedding layers account for over 20% of parameters.

On a 30-layer 125M-parameter model,

sharing the input and output embeddings reduces the number of parameters by 16M, approximately 11.8% of total parameters with a 0.2 points drop in aver- age accuracy. The marginal accuracy drop can be readily restored by reallocating the saved parameters to add more layers.

Another technique aimed at maximizing weight utilization is immediate block-wise weight sharing, where weights are replicated between adjacent blocks. This has the effect of reducing latency without significantly increasing the model size and can be especially relevant, say the researchers, in scenarios where the main factor determining model latency is memory movement.

Leveraging these techniques and others, MobileLLM aims to define a strong baseline approach to design optimized smaller models. Meta researchers ran a number of experiments to compare MobileLLM with previous state-of-the-art sub-billion parameter models on a number of tasks, including zero-shot common sense reasoning, question answering, and reading comprehension. For example, in zero-shot reasoning,

MobileLLM-LS-125M achieves comparable or even higher results than most pre- vious 350M models. In the 350M model size category, MobileLLM surpasses previous state-of-the-art models by more than 4 points with comparable or smaller model sizes.

Analogous results hold in question answering and reading comprehension tasks.

Meta researchers say there is a growing need for large language models on mobile devices to reduce cloud costs and latency. They also highlight the increasing energy consumption and carbon dioxide emissions of larger LLMs and advocate for the need to downsize LLMs to make them more environmentally friendly. Shifting to on-device models, they say, maybe the answer to these concerns while also improving the model performance by cutting down on latency.

MobileLLM is available on Hugging Face.

About the Author

Sergio De Simone

Sergio De Simone is a software engineer. Sergio has been working as a software engineer for over twenty five years across a range of different projects and companies, including such different work environments as Siemens, HP, and small startups. For the last 10+ years, his focus has been on development for mobile platforms and related technologies. He is currently working for BigML, Inc., where he leads iOS and macOS development.

Show moreShow less

This content is in the AI, ML & Data Engineering topic

Write Your Way to a QCon or InfoQ Dev Summit!

Join the InfoQ article competition to win a complimentary ticket to QCon or InfoQ Dev Summit! We're seeking in-depth technical articles written by software developers for software developers.

Send your proposal

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Meta MobileLLM Advances LLM Design for On-Device Use Cases

Write & Win: InfoQ Contest

About the Author

Sergio De Simone

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

Write Your Way to a QCon or InfoQ Dev Summit!

The InfoQ Newsletter