IBM has introduced its new Granite 3.2 multi-modal and reasoning model. Granite 3.2 features experimental chain-of-thought reasoning capabilities that significantly improve its predecessor's performance, a new vision language model (VLM) outperforming larger models on several benchmarks, and smaller models for more efficient deployments.
IBM says its Granite 3.2 8B Instruct and Granite 3.2 2B Instruct significantly outperform their 3.1 predecessors thanks to enhanced reasoning capabilities. Instead of providing specialized reasoning models as other companies currently do, IBM chose to include reasoning in their Instruct models as an option that can be toggled on and off depending on the particular task at hand.
One technique IBM is using in Granite 3.2 to build their reasoning capabilities is inference scaling, which is inspired by the idea of letting an LLM generate multiple answers and then pick the best based on some reward model — only, applied to the reasoning process.
In the context of reasoning tasks, this idea of scoring multiple answers to pick the best answer can be applied also to the “chain of thought” that often precedes answer generation. In fact, you don’t need to wait for the entire reasoning to be completed before deciding whether the reasoning was good or not.
IBM's approach advances the one popularized by DeepSeek, which uses one inference model to measure its own progress by also using a search model to explore the reasoning space. So, the process reward model helps the LLM detect and avoid wrong reasoning turns, while the search algorithm makes the process more flexible.
According to IBM, their inference scaling approach is able to boost performance on the MATH500 and AIME2024 math-reasoning benchmarks and let Granite 3.2 outperform much larger models like GPT-4o-0513 and Claude3.5-Sonnet-1022 for single-pass inference.
Granite 3.2 also includes a VLM particularly aimed at document understanding, named Granite Vision 3.2 2B. According to IBM, this lightweight model rivals larger models on enterprise benchmarks, such as DocVQA and ChartQA, but is not intended to be used as a replacement for text-only Granite models. It was trained using a specific dataset, DocFM, that IBM built on curated enterprise data, including general document images, charts, flowcharts, and diagrams.
Another component of the Granite family is Granite Guardian 3.2, a guardrail model able to detect risks in prompts and responses. Guardian 3.2 provides a similar performance to Guardian 3.1 at greater speed with lower inference costs and memory usage, IBM says. It introduces a new feature, verbalized confidence, to assess the potential risk in a more nuanced way by providing a confidence value.
Guardian 3.2 comes in two variants, Guardian 3.2 5B (down from 8B in Granite 3.1) and Guardian 3.2 3B-A800M, with the added optimization of activating only 800 million parameters out of the total three billion at inference time.
As a final note on Granite 3.2, it is worth mentioning that it brings new timeseries models (TTM) supporting weekly and daily forecasting in addition to the minutely to hourly resolutions already supported by its predecessor.
TTM-R2 models (including the new TTM-R2.1 variants) top all models for point forecasting accuracy as measured by mean absolute scaled error (MASE). TTM-R2 also ranks in the top 5 for probabilistic forecasting, as measured by continuous ranked probability score (CRPS).
In its announcement, IBM does not let go unnoticed that its TTM models are "tiny" in comparison to Google’s TimesFM-2.0 (500M parameters) and Amazon’s Chronos-Bolt-Base (205M parameters), which ranks second and third by MASE.
While IBM's announcement appeared an impressive feat to some reddit users, others highlighted the fact that their reported performance may look like overfitting a few benchmarks while ignoring others. Still, although it would be naive to think that such small models (8B and 2B parameters) can be preferable to larger models performing much better overall or for complex tasks like coding, it is true that they may be a good fit for more specialized tasks.
Others speculate about the fact that IBM's offering specifically targets enterprises, where it is important to have legal guarantees in case things go wrong or with potential IP issues with datasets used for training.
All Granite models are licensed under the Apache 2.0 license and available on HuggingFace, watsonx.ai, Ollama, and LM Studio.