BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News UC Berkeley's Sky Computing Lab Introduces Model to Reduce AI Language Model Inference Costs

UC Berkeley's Sky Computing Lab Introduces Model to Reduce AI Language Model Inference Costs

Log in to listen to this article

UC Berkeley's Sky Computing Lab has released Sky-T1-32B-Flash, an updated reasoning language model that addresses the common issue of AI overthinking. The model, developed through the NovaSky (Next-generation Open Vision and AI) initiative, "slashes inference costs on challenging questions by up to 57%" while maintaining accuracy across mathematics, coding, science, and general knowledge domains.

The research team identified overthinking as a significant challenge where reasoning models generate unnecessarily lengthy responses with redundant steps. By optimizing the model to produce more concise outputs, Sky-T1-32B-Flash delivers faster responses while preserving answer quality. The improvements enable more efficient implementation of advanced techniques like Best-of-N, Majority Vote, and Monte Carlo Tree Search within existing computational constraints.

The Sky Computing Lab team implemented a three-stage process to tackle the overthinking problem in AI language models while preserving accuracy. The approach expands upon established self-training methods with specific enhancements for complex reasoning tasks.

Source: reduction in generated token lengths while maintaining performance

The first stage focused on data generation using Sky-T1-32B-Preview to create diverse responses for 12,000 questions from the PRM800K dataset. The team generated eight responses per question using a temperature setting of 1.0 to create variation in response lengths. They then created training pairs by selecting the shortest correct answer as a positive example and the longest correct answer as a negative example.

Initial results showed promise in reducing output length while maintaining performance on several benchmarks, including MATH500, GPQA, and MMLU. However, the team observed decreased accuracy on complex tasks like LiveCodeBench Medium and Hard, along with advanced math problems in AIME24 and MATH500 Level 5. To address this underthinking issue, they added 1,000 new training pairs that contrasted incorrect short responses with longer correct ones, helping the model learn when deeper reasoning was necessary.

In the second stage, they focused on response refinement using Llama3.3-70B to eliminate redundant solutions while preserving reasoning quality. This process targeted common patterns where models proposed multiple solutions with phrases like "Alternatively..." or "Let me reconsider..." that often didn't improve the final answer.

The team developed a "First Correct Solution plus One" (FCS+1) method that retained the initial correct solution and one additional solution to maintain the model's reasoning capabilities. This approach proved more effective than alternatives like First Correct Solution (FCS) or FCS with Reflection in reducing response length while maintaining accuracy. The researchers noted that coding responses required different handling since they rarely contained multiple complete solutions.

For the final stage, the team implemented SimPO (Simple Preference Optimization) for training, which integrated length normalization into its reward structure. This method offered advantages over DPO (Direct Preference Optimization) by eliminating the need for a reference model and reducing computational requirements.

Sky-T1-32B-Flash demonstrates significant performance improvements in reducing output length while preserving accuracy. The model reduces sequence lengths by 37% and 57% on complex problems from AIME24 and LCB-Hard, respectively, while maintaining the accuracy levels of its predecessor, Sky-T1-32B-Preview.

The optimization resulted in consistent generation length reductions exceeding 30% across all benchmark tests, marking a substantial improvement in model efficiency without compromising solution quality.

Source: Sky-T1-32B-Flash benchmark tests

The Sky-T1-32B-Flash release has sparked discussions across social media platforms, highlighting its practical impact on AI model efficiency.

A user on X praised the research team's approach to addressing verbose AI responses:

Finally someone acknowledged the rambling problem! Better yet: You guys just proved you can cut down all the needless talk without losing performance.

A Reddit user reported integration results:

We merge this model with DeepSeek-R1-Distill-Qwen-32B and QwQ-32B-Preview. The resulted model FuseAI/FuseO1-DeepSeekR1-QwQ-SkyT1-Flash-32B-Preview achieves 58.2 on LiveCodeBench (2408-2502), which is better than deepseek-ai/DeepSeek-R1-Distill-Qwen-32B (56.1) and approaching DeepSeek R1 (62.8) and OpenAI O1 (63.4).

These early fusion experiments suggest potential pathways for further performance improvements through model combination strategies.

The UC Berkeley team has released the complete Sky-T1-32B-Flash development pipeline to support further research and innovation in AI model optimization. The open-source release includes code for data generation, response rewriting, preference optimization, and evaluation procedures. The researchers have also made available their dataset of 10,000 preference pairs and the model weights through HuggingFace, enabling the broader AI community to build upon and validate their approach to reducing model overthinking.

About the Author

BT