Windsurf Introduces Arena Mode to Compare AI Models During Development

Windsurf has introduced Arena Mode inside its IDE allowing developers to compare large language models side by side while working on real coding tasks. The feature is designed to let users evaluate models directly within their existing development context, rather than relying on public benchmarks or external evaluation websites.

Arena Mode runs two Cascade agents in parallel on the same prompt, with the underlying model identities hidden during the session. Developers interact with both agents using their normal workflow, including access to their codebase, tools, and context. After reviewing the outputs, users can select which response performed better, and those votes are used to calculate model rankings. The results feed into both a personal leaderboard based on an individual’s votes and a global leaderboard aggregated across the Windsurf user base.

According to Windsurf, the approach is intended to address limitations of existing model comparison systems, such as testing without real project context, sensitivity to superficial output style, and the inability to reflect differences across tasks, languages, or workflows. Windsurf aims to capture evaluations that more closely resemble day-to-day development work, including debugging, feature development, and code understanding.

Arena Mode supports testing specific models or selecting from predefined groups, such as faster models versus higher-capability models. Developers can keep follow-up prompts synchronized between agents or branch conversations independently. Once a preferred output emerges, the session can be finalized and recorded for ranking.

Arena Mode is offered with free access to all battle groups for a limited period, after which results will be published and additional models added over time. Windsurf also plans to expand the system with more granular leaderboards by task type, programming language, and potentially team-level evaluations for larger organizations.

The announcement of Arena Mode has sparked a mix of excitement, praise, and some skepticism from the community. Users on X appreciate the real-world benchmarking approach but raise concerns about token usage and practicality.

DevRel Lead @nnennahacks shared:

Your codebase is the benchmark. Spicy!

Meanwhile user @BigWum commented:

What a great way to burn through even more tokens.

Several other tools in the developer AI space are exploring related ideas, though with different levels of integration and focus. Public evaluation platforms such as Dpaia Arena allow users to compare model outputs side by side, but typically operate on short, context-free prompts outside of real development environments. Some IDE-integrated assistants, including GitHub Copilot and Cursor, support switching between models or running background evaluations, but do not currently center on explicit, user-driven head-to-head comparisons as part of the workflow. Other emerging coding agents emphasize multi-model routing or automatic model selection based on task type, rather than exposing direct comparisons to developers.

Alongside Arena Mode, Windsurf announced a new Plan Mode as part of its latest release. Plan Mode focuses on task planning before code generation, prompting users with clarifying questions and producing structured plans that can then be executed by Cascade agents. The feature is intended to help developers define context and constraints upfront before running code-related tasks.

About the Author

Daniel Dominguez

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Daniel Dominguez

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter