DeepThought-8B is a small "reasoning" model built on LLaMA-3.1 8B that can carry through decision-making processes step by step, similarly to how OpenAI o1 does, but in a much smaller package.
Requiring a "mere" 16GB of VRAM, DeepThought-8B is particularly aimed at step-by-step problem-solving, coding and mathematical tasks, and instruction-following. Ruliad, the company behind it, says its reasoning capabilities rival larger models.
This release represents our first step toward making AI reasoning more transparent and controllable, while demonstrating that smaller, more efficient models can achieve sophisticated reasoning capabilities that rival models of much larger scales.
As Ruliad explains, DeepThought-8B can break down the process of finding the solution to a problem into a sequence of steps, each of a specific type. The first step in the process is problem understanding, followed by data gathering, analysis, calculation, verification, conclusion drawing, and implementation. The actual number of steps varies with the complexity of the given task. At the end of the process, DeepThought outputs a JSON document detailing all the steps, which makes it possible for users to understand and validate the reasoning.
Ruliad emphasizes users' ability to customize the model's reasoning patterns without retraining. This is shown in the deepthought_inference
tool included with the model.
Ruliad has not disclosed benchmark scores, inviting users to test the model and share their findings with the community. However, the company has published a comparison of the model's performance with other major models.
Interestingly, while DeepThought-8B shows similar performance to LLaMA-3.1-8B-Instruct for coding and math tasks, it outperforms it on "reasoning" tasks. Ruliad's model also outperforms Qwuen-2-72B, despite the latter being larger. On the other hand, GPT-4o, o1-mini, and Claude-3.5-Sonnet get better scores on all the counts, including reasoning. This should come as no surprise, anyway, since they are far larger.
Several Hacker News readers tried the model out to test its performance. While it failed at "finding two primes whose sum is 123" or at counting the "r"s in "strawerberry" or in similar un-lexical variations of "strawberry", it correctly answered "which is heavier? 2kg of feathers or 1kg of lead". This might sound trivial, but it appears to be a challenging question for small LLMs like LLaMA-8B, Gemma-2-9B, and others.
Other Hacker News readers took issue with the idea that such models do actually "reason" and stressed that using beam search to select the best path to reach an answer is hardly "reasoning" at all. This stance is also backed by research showing that the ability of LLM models to solve tasks is quite limited since they seem to rely on narrow procedures that cannot be easily transferred to problems differing significantly from those used for training.
DeepThought-8B can be downloaded from Hugging Face or used on Ruliad's website.