Meta recently announced Llama 3.2, the latest version of Meta's open-source language model, which includes vision, voice, and open customizable models. This is the first multimodal version of the model, which will allow users to interact with visual data in ways like identifying objects in photos or editing images with natural language commands, among other use cases.
The new release includes vision models with 11 billion and 90 billion parameters, as well as lightweight text-only models with 1 billion and 3 billion parameters designed to run efficiently on edge and mobile devices. Llama 3.2 models support an extended context length of up to 128K tokens, positioning them as state-of-the-art in their class for tasks such as summarization, instruction following, and text rewriting.
Works great on documents, OCR, complex graphs.. I asked the 11B model what was funny about this image-it was able to pick the humour and even the paper details! - Sanyam Bhutani
This release is part of Meta’s ongoing commitment to openness, offering both pre-trained and instruction-tuned versions that developers can fine-tune for custom applications using tools like torchtune and torchchat. The models are available for immediate download on platforms like Hugging Face and Meta's own website, and they can be deployed across a broad ecosystem of partner platforms, including major cloud providers like AWS, Google Cloud, and Microsoft Azure.
The vision models, which are the first in the Llama series to support image reasoning, can handle complex tasks such as document-level understanding, image captioning, and visual grounding. The lightweight 1B and 3B models are particularly noteworthy for their ability to run on mobile devices, offering instant responses and enhanced privacy by processing data locally. These models are also capable of tool calling, making them ideal for personalized, on-device applications.
Meta today launched the Llama 3.2 family of models and I really like the new tiny 3b model. You can run it locally on your laptop, it's fast and pretty good - Guido Appenzeller
The training process for these models involved multiple stages, starting from pre-trained Llama 3.1 text models and incorporating image adapters and encoders. Post-training involved several rounds of alignment, including supervised fine-tuning and rejection sampling, to ensure the models were both helpful and safe. Meta also employed synthetic data generation to enhance the quality of fine-tuning data.
“It’s sort of like the Linux of AI, and we’re seeing closed-source labs react by trying to slash their prices to compete with Llama,” Mark Zuckerberg, the CEO of Meta, said. The new model will not be available in the EU due to legal reasons.
Meta has introduced Llama Stack distributions to simplify the deployment of these models in various environments, from single-node setups to cloud and on-device applications. This includes a command line interface, client code in multiple languages, and Docker containers, providing a consistent and streamlined experience for developers. The stack supports both local and cloud-based implementations, allowing flexibility in choosing between running models locally or utilizing cloud services. Developers can install the stack via PyPI and configure it using a series of interactive commands, with support for both Conda environments and Docker images.
Safety remains a priority, with new updates to the family of safeguards, including Llama Guard 3 for vision capabilities and optimized versions for lightweight models. These safeguards are integrated into reference implementations and are available for the open-source community to use.
Developers interested in learning more about Llama 3.2 may find more information on Github, such as information about model evaluations and model cards for the text and vision models.