InfoQ Homepage Presentations Fine Tuning the Enterprise: Reinforcement Learning in Practice

Fine Tuning the Enterprise: Reinforcement Learning in Practice

View Presentation

Speed:

48:47

Summary

The speakers discuss Agent RFT, OpenAI’s platform for fine-tuning reasoning models via real-time tool interactions and custom reward signals. They explain how reinforcement learning solves complex credit assignment challenges within the context window. They share enterprise success stories, showing how Agent RFT eliminates long-tail token loops and drives extreme efficiency.

Bio

Wenjie Zi is member of Technical Staff @OpenAI, community leader building TAPNET, previously @Grammarly, 10+ years of industrial experience in AI applications. Will Hang is member of Technical Staff @OpenAI, former founder, previously @DeepMind, @Google Brain, @Snorkel AI.

About the conference

QCon AI is a practitioner-led event focused entirely on the engineering discipline required to scale these workloads safely. It provides direct access to the architectural playbooks and failure metrics that peer organizations use in production.

Transcript

Will Hang: I'm Will.

Wenjie Zi: I'm Wenjie.

Will Hang: We're on the fine-tuning team at OpenAI. We're excited to talk to you today about Agent RFT, the most powerful way to enhance the performance of your agents. Let's kick things off by talking about agents. You're probably joining us today because you're building an agent for your application or your business, and you'd like to improve its performance. First, we can start by talking about what an agent actually is. What we think makes an agent different from a regular model is its ability to interact with the outside world to complete a task. It doesn't have to go through you all the time or even talk to you, it just gets things done on its own. In order to get things done, this agent has to have access to tools. If you're building a coding agent, it's got to have access to a terminal, a code interpreter, or maybe even an entire codebase.

If you're building a customer service agent, maybe it needs access to a different set of tools, like internal software to look up customer records, billing systems to issue refunds, or the ability to escalate to a human. This agent needs a way to interact with your business context and the outside world to get things done through the use of tools. Another way that we think about agents is that all their interactions with the outside world go back into the context window. This means that after looking at what it sent into and got out of a tool, the agent will reason to itself, call another tool, and then potentially repeat the process. All these tool calls, what it decides to send into a tool, what it gets out of a tool, and the tokens that it uses for reasoning, they're all part of the same context window.

This is going to come up later. We care a lot about agents here at OpenAI, and we're building some of the best agents for specific use cases. For example, Codex has access to a wide range of tools to complete coding tasks end-to-end, like running tests, reading your code files, and making code changes. In the case of Codex, there are tools that we expose as shell commands. For example, shell commands that can be run through a terminal, like ls or cat, or maybe even higher-level planning tools, where it calls out to a specific planning model to get out a code plan, which it then executes on its own. Another example that you all might be familiar with is Deep Research, and this is now embedded within our agent and GPT-5 series models. Deep Research has access to tools like a browser, where it can look stuff up, or it can look up documents in your file system, if you attach your file system tools to Deep Research.

Improving Agent Performance

How do we make these agents better and improve their performance? In the field, we've seen different ways to improve the performance of your agents. As a frontline technique, you can use prompt optimization. You can optimize the agent prompt itself to achieve better results. With the prompting process, you can steer model behavior to align more with your preferences or the kinds of things that you're trying to optimize for in your business use case. Or, you can then simplify the task itself or add better guardrails around the task to improve the agent's chances of getting things right. You can also add more tools or subtract tools, or you can improve the descriptions for tools. There's a lot of essentially tool adjustments that you can make. Or you can make the tools themselves just better at giving the agent the information it needs or accomplishing what the agent intends to do, so you can also improve the quality of the tools themselves to improve overall task performance.

Let's say that you've tried all these approaches, and you're still trying to squeeze even more juice out of your task, and you want even better agent performance. That's where you might turn to fine-tuning. Fine-tuning is a way to train, potentially, an agent, but more generally, a model end-to-end on your tasks to achieve even better performance. Before we talk about how to fine-tune agents, which is what I'd love to really focus on today, we were asked to maybe give an overview of how we thought about fine-tuning right now, and also maybe the future of fine-tuning moving forward, because we know a lot has changed in the world of fine-tuning. Maybe we can start from the top. Let's first talk about the origins of a lot of fine-tuning techniques, which are rooted in supervised learning. Supervised fine-tuning, which is a way to fine-tune models using supervised learning, was a paradigm for fine-tuning in general for years, even including our own product in the API. We found that when most people think about fine-tuning, they think about supervised fine-tuning. Supervised fine-tuning leverages supervised learning to teach the model to be much more likely to output the tokens in your dataset given the input prompt.

Now let's go over a bit what supervised learning is. We think of supervised learning as essentially teaching the model to parrot things back to you given an input. In supervised learning, you adjust the weights of the model to increase the probability of emitting tokens that are in your, essentially, output dataset. By doing this, you can shift the behavior of the model at a fundamental level. Your hope here is that the model generalizes well to new inputs that it hasn't seen yet and learns to output the correct tokens. This is super effective for tasks like classification tasks or summarization or translation tasks, or tasks where you want the model to match a particular style, because the shift here that you would want in model behavior is quite predictable and consistent. This is what we mean. For example, let's say that you'd like to fine-tune a model to be better at summarizing financial documents like S-1s or 10-Ks.

You want the model to always include things like revenue or operating margins or other essential information from the financial document that you pass into the model through the input prompt. Through the supervised fine-tuning process, the model will learn that it should attend more to the input tokens that might contain relevant information to use when it's generating the output tokens. Potentially, even parrot back specific tokens it sees in the input data in order to capture relevant information. The model will learn this general pattern when it encounters an S-1 or 10-K for a company it's never seen before and act accordingly. In the slide, you see this heatmap of where the model is attending to, where it's paying the most attention. One of the patterns you might learn is to pay the most attention to tokens in the input that contain the specific information that you're trying to extract out to generate the summary that you're trying to train the model on.

Right here, we can see how through the supervised learning process, you're actually making the model up-weight the more likely token or that you're telling it, billion here is the token that's the most likely after 2.4, for example. You can push up the probabilities of the tokens that it ought to emit based on what you specified in your dataset. Supervised fine-tuning is an incredibly powerful technique to make models behave the way that you want them to. It's also possible to elicit similar pattern matching behavior from our models using other techniques like in-context learning, or simply achieve better performance using prompt optimization. Using these two techniques, you might use way more tokens than you want to in the input prompt, but it's still possible to elicit this type of improved performance from the model. Supervised fine-tuning isn't the only kid in town. It's not the only technique that you can use to significantly alter model behavior to achieve great results on less complicated tasks.

Furthermore, since supervised learning is all about trying to get the model to match what you want it to output, the supervised fine-tuning process isn't necessarily the best on tasks where you need the model to maybe act a lot more flexibly and handle a wide variety of novel inputs. It's also not the best at really challenging tasks where a complex step-by-step process is required to get from the input to the correct output rather than just simply following a pattern that it learns.

Reasoning Models

Then came the development of reasoning models, a class of model that's trained to reason using reinforcement learning or RL. Unlike our non-reasoning models, reasoning models don't just immediately spit out tokens, output tokens back at you when prompted. Reasoning models learn to stop and think to themselves using their context window as a scratch pad before finally deciding to emit a final answer. If you imagine non-reasoning models as just saying the first thing that comes to mind when asked, reasoning models actually take the time to think to themselves and try to solve the problem on their own before getting back to you. The technique used to train this class of model is reinforcement learning. We've built a reinforcement fine-tuning platform or RFT platform to enable you to fine-tune your own reasoning models on your domain-specific tasks to push frontier performance where you really care about. Before I tell you how all this works, I first want to skip to the punchline. Reasoning models are able to solve much harder problems and exhibit much more consistent agentic behavior than non-reasoning models. In the future, the agents that can do economically useful work and automate your business will almost certainly be powered by reasoning models. Definitely a thing to watch out for.

Here's where we give an intro into how RL works and how it comes into play in our platform. Reinforcement learning comes into play here because the model needs to learn how to use its context window in the most effective manner to achieve the correct answer. Unlike supervised learning, we're no longer telling the model or forcing it how to use its scratch pad or context window. We're no longer telling it how to pare it back to us exactly how it ought to think to solve a given problem. We're actually going to let the model try to figure it out on its own. Then we're going to teach the model to learn from its own experience. To do this, we use something called a reward or a grade, which is basically a score that tells the model how well it did on a given problem. For example, if we want to train our model to do something like legal research, we might reward the model for the relevance or accuracy of its legal citations and of the conclusions that it generates during the discovery process.

In this example, I know that we're not using a legal example just because I wanted to keep it simple. We're just having the model solve a simple math problem. As you can see, we're letting the model generate its own training data by telling it to solve a problem many times, and collecting the times where it did well and where it did poorly. Then encouraging the model to do more of what it made successful at solving a given problem by increasing the probability of outputting the tokens that led it to success. Basically teaching the model how to think better instead of directly forcing it into thinking a certain way. In this way, RL is actually teaching the model how to act or how to behave. This exploration process is super important for RL.

In supervised learning, the model learns this complex pattern matching between an input and the next token that it emits. We force this behavior through the supervised learning process. With reinforcement learning, however, the model can additionally learn how the relationships between the tokens it emits over an entire trajectory influence each other to lead to the best final outcomes. In supervised learning, a lot of it's just concerned with the direct next token that we're emitting. In RL, in reinforcement learning, we can train this model end-to-end to consider how all of its decision so far influence each other. This essentially trains the model to be accountable for all the tokens it emits in a trajectory. More importantly, to understand how the tokens it emitted earlier on are responsible for downstream outcomes that might affect the final reward. We call this process credit assignment. This credit assignment bit is really important.

I know it's a bit in the weeds here, but we're talking to you about it because getting the model to learn good credit assignment is key to better agentic performance. Since tool calls and reasoning tokens all happen in the same context window of the model, they're all basically tokens. They're all decisions that the model has to make on its way to solving your problem or completing your workflow. Credit assignment allows the model to learn which tool calls or which thoughts it had in its reasoning tokens led to better outcomes. With that insight, the model is more likely to emit tokens, whether they be tool calls or reasoning tokens that lead to better agentic performance on your business use case.

Agent Reinforcement Fine-Tuning

We launched the RFT platform. Again, that stands for Reinforcement Fine-Tuning platform, in May, allowing anyone to use reinforcement fine-tuning to improve the performance of o4-mini on their hardest tasks. We saw use cases range anywhere from tax and accounting to genetics to coding. What all these use cases had in common was that the model couldn't just simply apply an easily learned pattern given the input data to reach the desired output. These use cases required a pretty complex reasoning process in order to go from input to solving the task. The regular RFT platform doesn't quite allow you to fine-tune an agent because agents need to interact with the outside world. They have to do so during the exploration process because they need to also figure out which tool calls led to the desired performance for your specific use case. That's why we're talking to you today about Agent Reinforcement Fine-Tuning.

Agent RFT is the way that we've built to fine-tune agents. Agent RFT changes the weights of the model according to a learning signal that you specify to teach the model what good behavior and what bad behavior look like. During training, the agent will explore different ways of calling your tools to learn how it can do better and better as training progresses. Again, base RFT is already a functionality in the current fine-tuning API, but you can't use it to fine-tune agents. Agent RFT allows the model to call tools while it's exploring during the rollout process so it can learn from all possible ways of calling your tools. You can also specify arbitrary reward signal to train the model on so it gets better in ways that matter to you. We're going to talk about that in a bit. To summarize the benefits of Agent RFT, again, helps you improve the performance of reasoning models.

It improves the agent's ability to use tools to reach the best final answer. It's also quite sample efficient, which can be really important in domains where data is scarce. The most important part here is that you're building a greater or great reward signal so that the model can learn from its own experience rather than you having to create training data for the model. I will also talk about specific examples of the sample efficiency of this Agent RFT process during our customer spotlights, which Wenjie will share later. The Agent RFT process obviously results in a model that can have lower latency and is better on your agentic workflows.

One of the challenges with making agents work with your business context is that your context might be different from how we at OpenAI train our models. If your tools look and behave the same way as Codex's shell command tools or Deep Research's search engine tools that we train our models on in-house, then you're in luck because your agent is going to work well out of the box. Your business context is most likely specific to you, which means that your agent might not be used to using your tools in a way that's ideal. For example, it might call a tool too many times, or it might call five different tools when calling one tool is better for what it's trying to do in a given context. Using this fine-tuning process, it's possible to train a model to say, use far fewer tool calls to achieve the same or sometimes even better performance on a given task compared to the base model.

One example of this is we can use the Agent RFT process to dramatically cut down the number of tool calls an agent uses during a rollout. One of the ways that we can do this is, essentially, you can set this cutoff and we can just cut the model off for a rollout if it uses too many tool calls. If we cut the model off, we apply a penalty against the model. What the model ends up learning is it learns to stay within that tool call budget. Not only does it stay within that tool call budget, it can also preserve or exceed the original ML performance. It explores many different ways of calling tools in it and it finds out that staying within the budget is good. It also finds out how to optimize within that budget. Agent RFT improves the ML performance of your agent.

It does this by training the model to reason better across tool outputs, so when it calls tools and it gets the tool output back, it can think over those tool output tokens that it gets to arrive at a better final answer. It can also learn to use tools better in the first place to know which tools to call, to know what sequence to call them in. All this is, again, learned organically by the model during training as it explores the search space to call your tools and then think about the outputs it gets from those tools to arrive at a better answer.

How Does Agent RFT Work?

In order to make all this work, we've introduced several major new additions to the RFT product. The first one is the ability for the model to call tools during training via calls to your endpoints. The second one is the ability for you to specify a grader in the form of an endpoint that we call. These two additions mark the first time that we've allowed our models to interact with the outside world during the training process, like actually at OpenAI. These models are interacting with your world while they're getting trained and while we apply backprop to the model. Now I want to spend a bit more time to dive even deeper into what exactly happens during the training process. For each agent rollout, we assign a unique identifier to all tool calls and final answers that come out of that rollout. When the agent calls your tools, we'll attach that unique ID to the tool call so that your system actually can recognize different tool calls as originating from the same rollout.

This can allow you to keep track of rollouts as they happen, which could be important for state management. An example of this is maybe you're keeping track of all the state for a given trajectory or given rollout, and you want to use that state when you're grading the model at the very end. Using that UUID, you can then trace that trajectory, and if you're trying to reward the model for using fewer tool calls or reward the model for using certain tools more than others, then you can figure that out in the state that you've kept track of. This allows for much more flexible grading. We hope that Agent RFT helps you teach agents to achieve frontier performance.

Getting Successful with Agent RFT

We all want you to be really successful with Agent RFT. Here are four key principles to ensure your success. The first one that's really important is you want to choose a task that's really well-defined and well-constrained. You don't want there to be subjectivity or this notion of taste in deciding what good performance and less than good performance looks like. This makes it so that the agent doesn't really have to guess as to preferences here. There's like very clear and hopefully even verifiable rewards or grades that you can use to train the model. The second one is you really want to make sure that your eval datasets mirror your production traffic. What this means is that when you're choosing which model or which model checkpoint to deploy, you want to make sure that the criteria that you're using literally look like the inference traffic that you get from your customers or from your business use case.

If there's a divergence, then you might end up optimizing for the wrong thing. You really want to make sure that you're training eval and inference, like all that data distribution essentially looks really similar to each other. The next one is especially important. You want to make sure that there is a chance for the model to achieve better performance as it tries more. Remember, RL is all about the exploration process. You would hope that as the model explores more, that it figures out how to do better on your given use case. If you're in a situation where the model rarely ever does well on your use case, then it's going to be really hard for the model to work up to achieving consistently good performance. You want to make sure that if you give the model more tries at your given use case, it has a shot at doing better.

That way it can bootstrap and learn from its own experience to do even better, more consistently. Last one is, and we see this happen all the time, unfortunately, rewards can get hacked quite easily, because our models are, I'd like to say, really smart. If you have some edge case in your reward signal or your grader, the model might find a way to exploit that. You really want to make sure that you plug up all the edge cases in your grader so that you're really rewarding the model for exactly what you care about and not some exploit that the model is finding to extract reward from your system.

When to Turn to Agent RFT

Here's also some recommendations on when we think folks like you all should turn to Agent RFT, since we know that the Agent RFT process can be heavy weight: you have to specify your tools, you have to specify a grader. First of all, we want to make sure that, obviously, you build these really high-quality training and eval datasets that also match your production traffic. After that, you want to make sure that you establish a solid baseline by evaluating your baseline models against the datasets that you've collected so that you know what performance to expect. After that, you want to try to optimize the prompts in your task as much as possible before you resort to adjusting the weights of the model. Because that way, once you optimize everything around the model, then it's time to optimize the model itself through fine-tuning. Once you fine-tune it, hopefully, that can allow you to really push the frontier for your task.

Task Setup - FinQA

Now I'm going to turn it over to Wenjie to talk about how some of our partners have really pushed that frontier, and to also share some of our baselines and examples to add more color as to how you can use Agent RFT.

Wenjie Zi: This is a great overview of what kind of fine-tuning and functionality we provide, specifically about Agent RFT and why it matters nowadays. Now let me walk you through how we set up Agent RFT for a specific task and see how it works in practice. The example I'm going to work together with you is about FinQA, which is also one of the benchmarks we use internally to measure how our training process looks like. FinQA was first introduced back in 2021 and has since become a standard benchmark for people to evaluate how well the model does for numerical reasoning over financial reports. In its original form, the setup is pretty simple. When you are given a question, there's usually a particular financial report associated with that question being given, too. The model knows which file to look at. It just needs to do numerical reasoning over the table content, over the test content.

To make it more realistic and look more like the RAG system that everybody is building nowadays, we did some modifications. Specifically, we did two modifications. One is that, instead of telling the model which file is already where you should look at, we let the model figure out exactly which one is related to this question within these 2.8k documents that we define in our corpus. The second one is that we constrain the model in terms of how many tool calls, in maximum, it can take in order for it to get a correct answer. We set it as 10 tool calls. One of the examples of input we send is illustrated like that. The question we give is that, what is the 5-year cumulative return for Intel in 2013? After a couple of iterations, the model is able to identify specifically page 13 for the financial report from Intel in 2013 has some content related to it.

Then the model will reason over that content to reach the final result, which is 114%. In order for you to set up the Agent RFT fine-tuning jobs, you can go through the OpenAI fine-tuning platform UI, which is the screenshot shown on the left, or you can directly do it by calling our API. There are a couple of important configurations that you will need to specify here, including what is the training method you're going to use. As Will mentioned, we provide supervised fine-tuning. We also provide DPO. In our case right now, you can choose the reinforcement fine-tuning. Second is to choose a base model you want to fine-tune on. We offer a variety of models from our own model families, including GPT-5. Third, you can provide the training and the validation data you created. Most importantly, you also will have to specify a list of tools the agent is going to use to get to the final result, and write your own graders to define how you want to grade the model response and produce the reward signal for the model to learn from.

To support this RAG task, we give the agent access to three different types of tools. The first tool is search, where the model is able to perform keyword search or embedding based retrievals to find the relevant content from the document. The second is the tool list, where the model can just list out what files is under the given prefix. The last one is cat, where the model is able to load the entire document and showcase what is the content inside. As you can see, the internal logic of all these current tools is hidden behind the endpoints that we provided for the model to use. This basically means that if your team needs to implement arbitrarily complex tools, you just need to define your own endpoint, expose it, and add it to the tool list. Then the model is able to leverage these different tools during the learning procedure.

Once the agent produces the answer, we need a reliable way for it to evaluate its quality. In this setup, we are using a model-based grader. There is a prompt that we're telling the LLM to say what is good and what is bad. This LLM-as-a-judge approach has been increasingly commonly used by the industry because we have seen that this has a high correlation with many of the human labels on different type of tasks. In reality, it's also very easy to set up and very easy to iterate on by doing prompt engineering based on people's observation of what's falling short and what's performing well. Even though the answer specifically for FinQA is numerical-based, we still landed at using the model grader rather than just doing the number comparison, because we see that this helps us to avoid some false reward signals caused by superficial differences, such as different formattings, units, or small numerical variations.

Even for simple tasks, this model-based grader has been working well for us. However, there isn't a type of grader that will fit all the works that people are doing, especially if we're talking about domain-specific or agentic workflow. To handle this, we support a variety of different graders you can choose from, starting from the most simple string match grader to the model grader that we just walked through, powered by LLM, or you can use Python to use the code to write how you want to do the gradings. Lastly, as Will mentioned, there is an endpoint grader we're now providing. The endpoint grader is very similar to how the tool calls is set up. If you have very sensitive data, you have very specialized tools, or complex environments, you want this endpoint grader to operate on, you can wrap up the logic and then release this endpoint class to tool call.

As long as it provides us a numerical value representing the reward signal for the model to learn from, we're fine here. On top of that, we also have something called multigraders. This is where you can combine different graders and define your own weighted sum. This lets you balance different optimization targets, whether it's on accuracies or it is efficiencies, into one single training objective for the model to optimize.

Here, I want to show you a concrete run we had on the FinQA task. Our platform provides a list of different dashboards for you to monitor how the training is progressing. To start, we'll be focusing on this one, which is on the reward signal. In this run, the green one on the bottom is the training reward, and the purple curve on the top is the validation reward. Over about 50 training steps, we see that the validation reward improved from the original 0.53 to 0.65, which is roughly about 23% of the improvement. Next, interestingly, we also observed something that as training progresses, the model isn't just getting better at being accurate, it's also learning how to get there more efficiently. The figure on the top shows how the model's reasoning token usage changes over time. You can see that it steadily decreases to the optimal point, which is step 50 for us.

On the bottom figure, we see how the model is using tools. The black line on the bottom is the parallel tool calls, and all the other colorful lines are about each individual tool. Over time, the model is learning to use less tool calls and getting better results. This is a great balance that we're trying to improve on. In production, this matters a lot because most of the team, they care about both rather than just pushing the performance numbers.

Success Stories

For the last part of the presentation, I want to share some of the success stories we had with our customers and using Agent RFT since launch. This is relatively recent, but even just within a couple months, we're able to see different teams applying Agent RFT through a wide range of agentic use cases and achieving meaningful improvement on their tasks. Of course, one of the most popular use cases is around coding. I'll start by sharing three cases we have on the coding topic and then move to the other set. A few months ago, we started our partnership with Cognition, the team behind Devin. A bit of context, Devin is an AI software engineer that can inspect the codebase, run shell scripts, and grab and read files, and decide which files for it to edit. This makes it very agentic because it's not only about directly writing the code, it's also about actually planning what needs to be done and also interacting with this environment.

To support this training, Cognition has to build a very robust rollout infrastructure. In this rollout infrastructure, each agent is operating in its own isolated environment with a dedicated VM that maintains its own copy of code, executing different tool calls, and grading independently. This will ensure that different trajectories don't interfere with each other, and this also reflects how the real-world behavior looks like. With the support of this setup, we trained a model on a dataset with real-world user queries and which files the user actually ended up editing. We used the F1 score over selected files as the reward to balance both the precisions and the recall. As a result, we saw two important takeaways from this Cognition use case. The first one is that data quality and data volume actually matters. Reinforcement fine-tuning can be very sample efficient. If you have 10 or 100 different examples, in our case, we are already able to observe some of the improvements.

On our side, this is leading to about 5% of improvement with just 100 examples. After that, we scaled the data to about 1,000 different examples, 10xed it, to capture more diversity in the behaviors we want to learn from. The improvement jumped from roughly 5 points to 10 points. While RFT and Agent RFT is sample efficient, if possible, we still recommend our partners to add more high-quality representative data. Often, that can translate into a better agent performance. Second observation we have is RFT is particularly effective as teaching agents how to use tools in parallel. At the beginning, when we started this training, we observed that the model usually takes 8 to 10 sequential turns, alternating between reasoning and tool calls in order for it to get to the final result. After Agentic RFT training, the agent learns to launch more tool calls in parallel and reducing the overall interactions down from about four turns.

Next, I want to share our journey with Cosine on Agent RFT. For a bit of context, Cosine builds autonomous AI coding agents designed to operate inside real enterprise codebases in a way that closely mirrors how human software engineers work. This example is particularly interesting for two reasons. One is the scale of the tool use. Different from the FinQA setup we already have, this agent is actually trained with more than 30 different types of tools, including the general files read, search, browsing, memory, and so on. This creates a very highly realistic and production-grade environment where the agent must make the right choice among so many options it has. Second is how the team designed powerful multi-stage graders that unlock the success of this task. In this design, the agent will actually receive zero reward if the answers fail the correctness test. While this may be a little bit counterintuitive, because, normally, we try to avoid sparse reward because it makes it hard for the model to learn.

This restrictiveness actually forced the model to optimize the first order goal, which is shipping workable code, rather than other goals like making the code beautiful. To make this trainable, Cosine needed to do some balance by increasing the batch size and more compute, so each training batch will still contain some positive examples for it to learn the positive reward from. Once the correctness is achieved, a secondary LLM-as-a-judge will evaluate the developer experience through examining the styles and the tones, and also give additional reward if the model actually does testing of its own results or inspecting the outputs during their result generations. After training with this carefully designed set of tools and graders, Cosine's model delivered substantial gains, outperforming the state-of-the-art models on a variety of major coding benchmarks, including the very well-known SWE-bench and DeepCodeBench. During the collaboration with Cosine, we also noticed how Agent RFT can help reduce the long tail problem in tool calls.

Before the training, you can see from the red histogram, in some cases, the agent will take more than 100 messages before finishing a task. These long runs are especially problematic in production because it definitely hurts latency, but it also breaks trust with customers as they don't actually know when this will be stopping. This is usually a sign that the agent is looping or exploring in a very inefficient way. After the Agent RFT, the distribution shifts significantly and the green histogram shows that the long tail problem largely disappeared and the agent converges to a much tighter cluster of shorter trajectories. In practice, this means that the agent not only is able to reach the correct answer with less tool calls, but also can have more predictable behaviors.

Lastly on coding, Mako is a really exciting use case. They're building agents that write high performance GPU kernels. This is traditionally very hard for LLM because unlike the usual coding problems, there's not much of the example code for kernel writings, especially if you're running on some newly established or newly published hardware platforms. With Agent RFT, Mako trained GPT-5 to write fast kernels using only about 100 PyTorch prompts. For the Mako cases, we spend major time in crafting a good reward function with the team, because early in the training we actually do observe the reward hacking issues and we found the model starts to bypass the test by referencing PyTorch code, do noop kernels or just outputting identity kernels. The Mako team actually spent time inspecting the rollouts manually and found seven different types of hacks. Then they tightened the success conditions used for the LLM grader to catch all these seven different cheating criterias.

They also added a statistic analysis tool using syntax tree to verify that the generated kernels actually exist and actually is runnable. Once all these protections regarding reward hacking have been implemented, the team using a very smart way of generating three rollouts at the same time and choosing the best one among the three, with this setup they're able to beat the state of art by about 72%.

Beyond coding, we've also partnered with our customers on a variety of use cases across different domains. The first example I will show here is from Ambience. Their task here is to map a patient diagnosed to one of the 70,000 billing codes. Using Agent RFT, they fine-tune the agent to use tools and reasoning to narrow down the correct code. As a result, they achieve a five-point improvement in F1 score while also reducing the average response time by 18%. The number may look a little bit low here, but it's because a lot of the cases are actually quite ambiguous. Even for a human expert, they may not be agreeing on one of the examples. Our cap is not reaching 100% here. The next example comes from Genspark. It's also a very interesting example. The agent is retrieving information from different data sources, and use tools to generate a full presentation.

After applying Agent RFT, they saw 88% improvement on the extreme bad cases from previous runs. The slides produced by the agent also became more informative and more aesthetically consistent. This actually highlights how Agent RFT can not only improve quantitative metrics, but also improve qualitatively user-facing outcomes. Rogo is another example that's pretty interesting. They're operating in the financial domain. They're building a financial reasoning agent for complex real-world analysis. This agent will combine information again from multiple sources, and then does reasoning on retrieved content to produce evidence-grounded answers, which means citations. Then we use the Agent RFT with domain expert LLM-as-a-judge to shape both reasoning quality and citation behaviors. This result was a 21% improvement in evidence-grounded attribution, along with materially lower hallucination rates. This also shows how Agent RFT can be used to align reasoning with expert expectations in highly trusted domains.

Questions and Answers

Participant 1: One of the common challenges with fine-tuning a model is it evolves. What happens then? Is there a path when the frontier model evolves first you get into it, or is it retraining?

Will Hang: We're building the platform to make it really easy to rebase your training on the new model. The nice thing about something like Agent RFT is that your tools or your graders hopefully stay stationary. That's the scaffolding. The only thing that really gets swapped out is the model. Right now, you just select it from a dropdown and you re-kick off the training process to get a new fine-tuned model out of a new base model. Absolutely, you have to do that swap sometimes, but it really is just a matter of swapping a new model into the training process.

Wenjie Zi: Because you already have the validation dataset, so it actually knows how well your model performs on the baseline models, on the previous fine-tuned model. The team finders give them strong evidence whether they should do the switch or not after they fine-tune the model, and they can be the one that make the decision.

Participant 2: Generally speaking, how well does fine-tuning hedge against hallucinations?

Will Hang: Especially in the RFT case, you can absolutely reduce the amount of hallucinations. In fact, in some of our customer success stories, we did see many cases in which we reduced the hallucination rate just due to essentially rewarding the model for reduced hallucinations. I think a lot of the times it is a matter of once you have a better, more grounded reward signal that rewards the model for reduced hallucinations, you can actually see the model adhere to that behavior that you specify in your grader. It's a matter of your reward signal.

Participant 3: Does it happen to clients before, or this is like common practice, where during the training, the tool that you want to actually use in production is either not available or not safe to be used during training, say if you take actions or there's something. Do people create a mock set of tools or they do some modification? What are some of the common practices to easily help build into this tool?

Will Hang: The question is like, if you have a tool that you can't really use during training? We've definitely seen people create synthetic tools or mocked tools. It's never going to be as good as like an actual live tool that you're using in production. Just know that it is possible to create these synthetic tools. I think it depends on what you're trying to optimize for as well. If you're trying to get the model to simply call tools better, like issue better formed inputs into a tool, then the mock tool approach is pretty useful. If you're trying to have the model reason better over the outputs of the tools, for example, if you needed to search over a knowledge base and then use the outputs from the knowledge base to reason better, to arrive at a better final answer, then mocking is really hard for something like that. It just depends on what you're trying to optimize for.

See more presentations with transcripts

Recorded at:

Jul 03, 2026

InfoQ Software Architects' Newsletter