BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Leadership in AI-Assisted Engineering

Leadership in AI-Assisted Engineering

51:07

Summary

Justin Reock discusses the reality of AI’s impact on engineering, moving past anecdotes to hard data from DORA and DX research. He explains the "GenAI Divide" - where 95% of pilots fail - and shares how leaders can use the SPACE and Core 4 frameworks to measure true ROI. He explains how to balance speed with quality, reduce developer fear, and apply agentic solutions across the entire SDLC.

Bio

Justin Reock is the Deputy CTO of DX (getdx.com), and is an engineer, speaker, writer, and software practice evangelist with over 20 years of experience working in various software roles. He is an outspoken thought leader, delivering enterprise solutions, numerous keynotes, technical leadership, various publications and community education on developer experience and productivity.

About the conference

QCon AI is a practitioner-led event focused entirely on the engineering discipline required to scale these workloads safely. It provides direct access to the architectural playbooks and failure metrics that peer organizations use in production.

Introduction

Justin Reock: I am Justin Reock. I'm the Deputy CTO at DX. We are the engineering intelligence platform based on a lot of research that you're probably familiar with, from Microsoft and Google Productivity Lab, and the University of Victoria, British Columbia. If you've heard of DORA and SPACE and the DevEx framework, that's a lot of what we focused on, and we basically built a platform around it. The presentation that I have for you today is built around this larger playbook, which is a playbook for senior executives, based on research, interviews, trends that we've seen, and then research from other organizations that we partner with, like the DORA community, for instance. We put a much larger explanation of all that into this guide, and then we built this presentation around it. We're going to cover a lot of highlights from that research.

What is the Current Impact of AI?

What is the current impact of AI? What are we even seeing? Nobody seems to know. Google, on the one hand, is telling us that their engineers are about 10% more productive as a result of using AI. They are, of course, a very engineering productivity-focused organization in the first place. However, we also had this now infamous METR Study, that had some problems with it. It was a small sampling of engineers, but it was still a pretty well-executed study that showed a 19% decrease in overall productivity in this particular experiment. Again, problems with this study. There were some engineers who had never used Cursor before, which was the tool that they were using in this study. What I think is pretty interesting is that every engineer in this study felt like they were more productive, like their qualitative data actually showed, no, I think I actually am getting more done. The data bore out that that wasn't true. We need to manage perceptions and reality. We have to measure, and we have to really be diligent about how this technology is working for us.

Our friends at the DORA community produced this report a few months ago, where they found these modest but at least positively leaning impacts to overall productivity and the developer experience. Twenty-five percent increase in AI adoption equated to a 7.5% increase in overall documentation quality. That was the biggest impact. A 3.4% increase in code quality, so modest, but at least not heading in the opposite direction. 3.1% increase in overall code review speed. A 1.3% increase in overall approval speed. On these averages, when we look at this data, again on an average, we find modest but positively leaning impact. DX, we can sample lots of data. We have samples from both qualitative and quantitative metrics that we pull from systems and that we pull from surveys and things like that. We thought, can we correlate this data? Will we see the same thing? What you're looking at here is the Change Confidence Developer Experience Index driver that we use. This is a qualitative measure, basically saying how confident do I feel when I put a change that I'm not going to break stuff? This is calculated on what's called a top box Likert score.

If you're familiar with survey engineering or at least have taken surveys, you've probably answered something that the possible answers for you are like always, very often, sometimes. The top box scoring means the percentage of people who answer always or very often. The two positive answers that you can have on that survey question. For change confidence, we found that moderate to heavy AI users, so this is either weekly or daily use of AI to help with engineering, led to a 2.6% average gain in change confidence. Similar to what DORA is seeing, positive leaning but modest impact. Code maintainability, another qualitative metric. How much cognitive load do I have to put into just understanding the code that I have in front of me? We found that the code became 2.2% more maintainable with AI users versus non-AI users.

The DORA metric for change failure rate. Who's familiar with DORA and the DORA metrics? This is the percentage at which when we add something value adding to the product, do we break it? Do we have to back it out? You want to be low on this one. We saw a 0.11% reduction overall here from using AI. It doesn't sound all that significant until you realize that the industry benchmark for change failure rate is like 4%. It is somewhat meaningful.

We have had lots of conversations with companies that are seeing things going in both directions. This can't necessarily be too accurate. We thought, what would happen if we broke this data out per company? Here's what we saw. Each bar represents a single company. The impact that, in this case, again, change confidence has had to their individual developer qualitative experience. Look at this. It's so noisy. It's all over the place. We have some companies seeing more than a 20% gain in change confidence, while others seeing more than a 20% decrease. The average, when we filter on average, we're not seeing the full picture. We're not seeing the full story. We need to look at this and actually look at a per company basis for this impact. Similar pattern for code maintainability. The same thing here. Some people getting messier and sloppier code. Others getting tighter, better code. Then here's change failure rate.

The top bar here, which you don't want to be on, you want to be on the bottom part of this graph, is 2%. Two percent against the industry benchmark of 4% means shipping as many as 50% more defects than you were shipping before. A lot of variability, a lot of volatility. Just like always, the future is here, but it is not evenly distributed, which is absolutely true for the ROI that we see with AI. Some organizations are seeing positive impacts, but others are really struggling, and even seeing negative impacts. We know that top-down mandates are not working very well. A decrease overall in psychological safety. People will game the system.

If the whole point is just 100% AI utilization, that really tells us nothing. We need to strive to understand impact as a multidimensional and longitudinal study of metrics that we get from organizations. We'll talk a bit about measurement. We found some patterns, maybe antipatterns, in organizations that were seeing poor results here. There was generally a lack of overall education and enablement. That seems very important. Not just providing good education, but also providing time to experiment in safe environments in which to do so, safe from both the systems in a psychological sense.

Often, organizations will just turn the tech on, and then just expect users to just magically be proficient. Just like any other technology, this does require learning. There is a learning curve. We've seen a very distinct learning curve, in fact, where whether you're a junior engineer or a senior engineer, whether you're a large enterprise or a smaller, scrappy software startup, going from no adoption of AI to light adoption of AI almost universally results in both quality and productivity deficiency. You see both go down. Then as you move into more moderate and heavy adoption, we balance off, and then even end up higher than we were before. We do see gains, but we see a clear, distinct learning curve that we have to prepare for in the meantime.

This is from that same DORA report. This is a Bayesian Posterior Distribution of AI. What does that mean? Effectively, the way to interpret this graph, these are qualitative signals. This dotted line down the middle effectively is our net zero impact. We have a bunch of different possible strategies over here on the left. More yellow mass to the right means that we've seen a more net positive impact when implementing some of these strategies. A sharp peak, like we see, for instance, with mandatory training, shows confidence in the data. Clear AI policies is number one. This has the largest distribution of mass over to the right. Being very clear about what we can do with this technology, giving time to learn, also very important. To break down some of these top ones, like again, clear AI policy, 90% indicator that this is strong, positive benefit by putting good, clear policies in place.

Shortly right after that one is our giving time to learn. Alleviating displacement worries at 80%. We're going to talk about why this is so important, that as leaders, we make sure that we're very clear and transparent, that we're looking at this technology now as a way to augment individual productivity, not replace engineers. This is a throughput story, not a cost cutting story. There's lots of data to back that up. We want to encourage AI in workflows. For most organizations, generating code is not the bottleneck. There's other bottlenecks throughout the SDLC that we should be looking at and seeing if we can creatively apply AI to help with that. Being very transparent about what we plan to do. Then, yes, offering mandatory training, but not just training where, ok, here's some material and go look at it, and if you don't look at it, you're going to get in trouble or whatever. Actually, being able to tie that to time to experiment in sandboxes and really learn how this technology functions. That's hopefully a lot of what we're getting from this conference.

Distinct Strategies, as Leaders

As leaders, we have some pretty distinct strategies that we can employ. We should, of course, look at integrating across the SDLC. We're going to look at some really interesting use cases that we've uncovered from various companies who have had success deploying agents to deal with real issues in production. We need to unblock usage. Too often I hear, we'd love to use Cursor or some tool, but we're worried about data exfiltration. A lot of work has been done to make sure that it's safer and more compliant to use these tools. There's creative things we can still do, like self-hosting models. We tend to throw giant frontier foundational models at every problem when there's actually a lot of problems we can solve with smaller and even open models. We need to have open discussions about the metrics that we're gathering. We need to be very clear about why we're gathering the metrics in the first place, or else people will just game the system, good old Goodhart's Law.

Then we need to be clear about what we found and what we're going to improve as a result of those findings. We have to reduce the fear of AI. I will show you data. There's tons of data out there suggesting that this technology is not, and may not ever, be ready to fully replace human engineers. There's also plenty of compelling data that talks about how we can augment productivity of engineering organizations. We can get more value out of those engineering organizations.

Companies like Zapier, for instance, which I'll look at later, are actually hiring faster and at a greater volume than they ever have, because they know that they can get a higher return on investment out of a single engineer. I love that attitude. They've actually cranked up hiring as a result of this. Compliance and trust. We do need to be able to trust these outputs. We need to be able to not destroy our change confidence as a result of using this technology. We need to, as leaders, tie this to employee success. We have the opportunity here to help our employees gain new skill sets that are likely to last them for the rest of their career. Really good context engineering. Really good prompt engineering. Understanding how to build agents. These are definitely skills that are going to behoove us for years to come. This tech doesn't seem to be going anywhere.

Reducing the Fear of AI

We need to frame AI as a force multiplier for performance. Something that can augment and accelerate. We need to remind engineers that these tools are not there to replace jobs. They're there to transcend what we could do before. We've known that psychological safety is incredibly important for a long time. A decade ago now, Google launched Project Aristotle, which was a study that tried to figure out like, what are the characteristics of highly performant organizations? What do high-performing organizations have? Named Aristotle because he was famous for saying that the whole is greater than the sum of its parts. They assumed they had a hypothesis that the recipe for really successful teams was going to be some combination of hiring high-performing engineers, having very experienced managers, and having basically unlimited access to compute. All of which were easy for Google to obtain. They were completely wrong.

Overwhelmingly, psychological safety was the most important characteristic for high-performing teams. I think that that really applies now in this climate of a lot of uncertainty, especially amongst junior developers and people just entering the job market, and whether they're going to get replaced with this technology that they're helping to build and train. This is SWE-bench. Who looks at SWE-bench like every day, checking capabilities of models right now? You can't always trust the benchmarks, but this is a decent benchmark site that has a number of different specific benchmarks where certain models are looked at, open and closed models. They're ranked by how well they perform in these benchmarks.

The highest performing benchmark in the most rigorous test is like 44%. In other words, it can complete about 44% of the tasks that are given to it without human in the loop. That doesn't mean without validation. It just means that it's doing the job. That does not an employee replacement strategy make. Those numbers are way too low. It's also very interesting that if you look longitudinally, the highest performer a year and a half ago was at 34%. We were surging at first in our abilities with this stuff. It's definitely starting to taper off. There's plenty of reasons for that. The underlying algorithms for these things really haven't changed since the 1950s. We've just been able to throw a lot more silicon, a lot more compute, get really creative about things like agent-to-agent, RAG, and MCP, and all that stuff. The point is it may never be in a spot to be able to replace humans. It's certainly not there now.

Some of you may be familiar with this more recent study called the GenAI Divide done by the MIT NANDA group. They found that 95% of AI pilots fail in the study that they looked at. There's a lot of investigation happening. General-purpose LLMs are the darker bar here, whereas embedded task-specific, in other words, agentic solutions are in the paler blue. Lots being investigated, lots being piloted, but only 5% of these have been successfully implemented at this point. That's a pretty high failure rate. It doesn't mean that we aren't getting better at this stuff. It just means that we are years off from really looking at this as some sort of cost-cutting engineering replacement strategy. Let's not forget, throughput is the most important component in a high-performing team. Anybody who's familiar with the Theory of Constraints, who read Eli Goldratt's, "The Goal", or if you didn't read The Goal, but maybe you read "The Phoenix Project", which is The Goal, should be familiar with that concept that we shouldn't prioritize cost, we should prioritize throughput. Throughput is the job of the machine of business. We can improve our throughput by applying AI in intelligent ways and increasing the throughput and individual productivity of engineers by using it.

Transparency - Set Clear Intent and Expectations

We need to be really transparent about why we're putting AI strategies in place. We need to frame it, again, as a way to augment, not replace. We need to proactively address these fears. We shouldn't wait as leaders for regrettable attrition or for low productivity because people aren't at their most creative and innovative because they're a little scared. We need to proactively say, here's what we're doing. You're not going to lose your job. Please help us let you learn about this stuff. Let us augment the capabilities of this business. We need to be open about the metrics that we're collecting, why we're collecting them. That this is, again, not for stack ranking based on how great you are with AI, but really more trying to figure out, are these investments, these not insignificant investments that are being made in AI, actually moving the needle? Are they doing what they're supposed to do? That's why we should be collecting these metrics.

If we get into the business of incentivization, or weaponization, or top-down mandates, we end up running smack into something called Goodhart's Law. Who's familiar with Goodhart's Law? A few of you who have looked at productivity theory before. Goodhart's Law says that when a measure becomes a target, it ceases to be a good measure. In other words, we'll game, we'll play to the metric. If I need to do 100% utilization in my company of AI, then a lot of engineers are just going to update their README file on a Monday and call it, yes, I'm using AI. They were not actually moving the needle for productivity.

One of my favorite parables about Goodhart's Law is something called Cobra Effect. Anybody ever heard of the Cobra Effect? A few of you are like me, you're big developer experience productivity nerds. I come from the South. We say if you're studying developer experience, you can't swing a dead cat without coming across Goodhart's Law. Cobra Effect, so you had an emperor. That emperor had a problem with venomous cobras. Decides to come up with an incentivization program. You bring me 100 dead cobras, I'll give you some money. Here we are with a metric that is being incentivized. We need to bring 100 dead cobras and we'll get some money. What do you think people did? People got clever, started farming cobras. That's right. Started farming, slaughtering, and bringing to the emperor, getting their money. The emperor got wind of this gamification, shuts down the program and all those farmers release all those cobras. The problem gets a lot worse. We have to be very clear about why we're collecting these metrics, what we're going to do with them, and not try to focus on some single metric like 100% daily utilization across the company.

Metrics

How do we do that? What should we be looking at? There are a lot of different metrics we can look at. I know a lot of you are familiar with DORA metrics. Who's familiar with SPACE framework? SPACE built on the success of DORA by bringing in additional qualitative metrics, metrics like developer satisfaction, collaboration. What was important was that the authors of that study said that in order to really understand developer experience, we have to look at a constellation of metrics in tension with one another. We have to look at metrics across oppositional dimensions to get a real picture of what's happening with productivity. We can do that with AI.

Our two main throttles are going to be our speed measurements and then our quality and maintainability measurements. We want to make sure that we're balancing any gains that we're getting from higher speed with long-term maintainability and quality. We just want to make sure that we're moving faster without creating a bunch of slop. There are metrics that we can look at. These are oppositional metrics of speed and quality that we can use to determine this. We want to collect metrics in multiple ways. We want to do a mixed-method approach and gather both system metrics, quantitative or hard metrics, as well as qualitative metrics. We want to be able to gather that context. At DX, we like to use a lot of health metaphors, the qualitative metric or the context being, ok, I'm going to the doctor because I don't feel well. Doctor, I don't feel well. Then they take your pulse and they take your oxygen and they take your temperature, and they're like, everything's fine. Yes, but I don't feel well. Something's wrong. We have to have the context of these system metrics that we gather. We want to collect system data. We can get this from admin APIs and things like that, the telemetry metrics that are starting to come out from some of these enterprise solutions.

We also want to be able to conduct periodic surveys. Surveys are hard. We have worked very hard at DX to be able to get most of our customers with an average of 90% or higher participation rates in surveys that are really well engineered, that frame developer experience as much more of a systems problem than a people problem, which it is. That's how we're able to drive more honesty in the qualitative results. Just because you've conducted a survey, you really also need to pay attention to where are there metrics in this survey, where there was an incentive to game the metric, in which case you'll get some dishonest signals, in other words, useless signals. Was the participation rate high enough? Ninety percent or higher is pretty good, but most organizations have difficulty achieving that. They need to be well-engineered surveys. They need to guard against survey fatigue. They need to guard against Goodhart's Law and gamification. You should do experience sampling as well, gathering one or two small bits of data, mixed workflow, and then looking at that data longitudinally.

Really then we have three classes of metrics that we can be looking at for measuring impact. We have our telemetry, again, the data coming out of Copilot API, or Cursor API, or Claude API. This data is really just showing us what's happening, like who's using the tech, maybe how many PRs are being augmented with AI, but it's really just showing us utilization. Again, that doesn't really say much. One hundred percent utilization across the company doesn't tell us anything in terms of real speed or real quality. This is where experience sampling comes in. This is a little bit better for quantifying ROI. This could be something as simple as adding one field to a PR form that says I use AI to work on this PR, or I enjoyed using AI to work on this PR. We gather a small sample of data that we can look at longitudinally.

Then of course our self-reported data, our survey data, but we have to be careful here that we're guarding against survey fatigue, that we're not running these too often. We can only run them periodically as a result of that. Really, you should be running these on a period where however long it takes in your culture to be able to see improvement. If you're not using these metrics to try to improve things, there's no point in gathering the metrics in the first place. The audience for those metrics and what they plan to do with those metrics in terms of improving the state of things for developers, that's a leading indicator of productivity. If you're just throwing stuff up on a dashboard just to put it on a dashboard, you're really like, there's no point. The period for running these surveys really should be the length of time that it tends to take in your culture for positive change or improvements to be enacted and noticed. That way you're always driving metrics up. That gets people excited to see the results. That gets people excited to take part.

We can look across these three different types of metrics. In other words, these AI metrics, our foundational developer experience and developer productivity metrics are still what matter the most. Despite all this hype and everything that's been happening over the last year, year and a half, this is still about improving developer experience and hopefully as a result of that, improving developer productivity. The problem is we as an industry didn't really know how to define those things well even before the AI boom hit. We weren't even really sure exactly what we should be looking at, how we should be defining things. There was a lot of misalignment, especially amongst leaders and individual contributors of what productivity even means.

Then we threw AI at the thing. We hadn't even finished that initial discussion. These AI metrics tell us what's happening, but these core metrics tell us whether these investments are working. Putting this stuff in place, is it actually improving quality or is it making it worse? Are we actually moving faster? Have we increased our throughput? Are we getting more features out to market? Can we respond to the needs of our customers more quickly? These foundational metrics are still what matter the most. These top companies are looking at a few different things, and we worked with a bunch of companies to show the metric framework that I'm about to display after this. We looked for commonality across these companies, what was important to them in terms of the metrics that they were tracking.

Microsoft is looking at adoption. Sure, but you want to use those adoption metrics to correlate them to impact metrics, to create cohorts of users that you can then slice and say, weekly active users of Copilot are showing such and such increase in weighted PR throughput, or such and such increase in PR revert rate, or such and such decrease or increase in change failure rate. That we want to correlate that utilization, and that's exactly what they do. They look at system velocity, developer satisfaction, change failure rate. They have a great metric that I could probably do a whole talk on, but there's a great white paper about it too called, a bad developer day. They gather telemetry from all across the platform and have tried to quantify what makes a bad day for a developer.

If you look at Microsoft Bad Developer Days, there's a whole white paper on it. It's really interesting, but they correlate that metric with the AI utilization metrics that they're gathering. Dropbox, similar stuff. We're looking at adoption and engagement. We're looking at developer sentiment. We're looking at velocity. We're looking at engineering hours saved, percentage of AI code that is generated. That number is somewhere around 23% by our most recent reckoning. Then, Booking.com. Daily active users, time saved per day, PR throughput, developer CSAT, change failure rate, starting to see some patterns here.

Where we found overlap, actually working with closer to 20 companies, we drop them into what we call our DX AI Measurement Framework. If you're familiar at all with the work that DX has done, our last big measurement framework that we introduced was called the Core 4. It is a distillation of DORA, SPACE, and the DevEx metric framework all into one framework called the Core 4. If you're doing Core 4, you're doing DORA, you're doing SPACE, and you're doing DevEx. We very much took that same sort of format and put together these oppositional metrics of utilization, impact, and cost. We prescribed some metrics here based on what we'd seen from other companies. You can use this as both a way to measure how AI is working within the organization and as a bit of a maturity curve.

Most organizations start over on the left with utilization, looking at things like daily active users, weekly active users, percentage of PRs that are AI-assisted, percentage of committed code that's AI-generated, and then overall tasks that have been assigned to agents. We're starting to see more and more creative uses of backend platforms like OpenTelemetry that people can use to push metrics around what agents are actually doing. Then we move from utilization to impact. These are the metrics that really matter. These tell us what's happening in the organization. These are things like AI-driven time savings. We're looking at some Core 4 metrics here, like PR throughput. We use a special PR throughput metric called TrueThroughput, which effectively looks at the overall complexity of the PR that's being put through.

Again, I can't stress enough that you never want to hyper-index or over-index on any one of these single metrics. PR throughput on its own doesn't tell you very much. I can update the README file 10 times a week, my PR throughput could look great. We need to be able to make sure that we understand the complexity and the value of those changes as part of that. Perceived rate of delivery. Code maintainability, there we are again. Change confidence. Change fail percentage. Then, yes, cost. Except I do like to point out that we're about 15 years after the last major hype cycle, and we're still figuring out how to calculate our cloud costs. I also hear that there's people burning through thousands of dollars' worth of tokens a day, so we probably do need to do something. This is, again, the DX AI Measurement Framework. There's more information if you want to look at survey templates. You don't have to use the platform to be able to use this metric set. Happy to talk about how you can calculate a lot of this on your own.

Compliance and Trust

What about compliance and trust? How are we going to be able to trust these outputs coming from this tech? There's actually a lot of levers to pull. I'm going to focus on a couple of the more technical ones. There's obvious stuff too, like testing has never been more important. It's another reason why we have to keep humans in the loop. Even with really cool adversarial validation loops and everything like that, we still need to make sure that our competent humans are the ones who are looking at these things before we actually put them into production. We do also have some interesting levers we can pull.

The first one is creating a feedback loop for your system prompts. Different platforms have different names for the same thing here, like Copilot calls it a system prompt, Cursor calls this Cursor Rules, Claude calls this Agent Markdown. This is effectively just a set of rules that accompanies every prompt that gets fed in to the model that can help guide the way that the model is supposed to act. This is where a lot of people put some guardrails in. We need more than just being able to provide, if we want security guardrails, that goes well beyond just a system prompt, so let's be clear there. This is a good way of making sure that the model is semantically and stylistically accurate with the way that we write code in our organization.

The main takeaway here is to put together the feedback loop. Maintaining the system prompt, applying the system prompt, that's pretty straightforward. What's most important is making sure that there's someone or a group gatekeeping feedback when these models misbehave, when they do something wrong. Whether that's part of your QE team, whether that's part of your AI center of excellence, doesn't really matter. Whether the feedback is gathered through a Slack channel or source control or tickets, doesn't matter. You just need to establish this so that when the models misbehave, there's someone gatekeeping all that feedback, keeping the system prompt up to date and keeping it applied. Lots of good approaches for this. You can see on the right here, very simple example of how the model's been using Spring Boot versions that are too old for the organization, Spring Boot 2.6 or earlier. It's sometimes providing code that has deprecated methods. It's providing code snippets that have syntax errors. We're using a little bit of meta-prompting and we're saying that your new rules are to make sure that you're always providing code snippets that are Spring 3.0 or higher, that you're not sending me code that has syntax errors, that they're syntactically valid.

Then I'm also giving it some meta-prompting about how I want it to behave in terms of its output, every time I want it to provide relevant explanations. If it's uncertain, don't just try to please me like I know you like to do, but instead just give me some methods and approaches. Don't give me bogus code and don't include any references to internal proprietary API. Again, simple system prompt here, but the big takeaway is setting up a feedback loop, making sure that someone's gatekeeping feedback when the model's misbehaved so that you can keep this up to date. It's going to go stale quickly. Rules change all the time. It's important that you have someone accountable for making sure that this stays up to date based on feedback that you're getting.

Determinism and Non-Determinism

Understanding determinism and non-determinism. Especially with agents, you have a lot of opportunity to control what's called the temperature of the prompt and the answer that's coming out of the model. Temperature equals heat equals entropy equals randomness. That's how they've connected dots to that. Effectively, when a token is being decided on for its next output, it doesn't just pick one token. It's got a matrix of tokens that have a probability associated with it.

Then a certain amount of randomness is applied to which token actually gets selected. That randomness is effectively the temperature. You have control over that. A lower temperature means that the model will probably produce more deterministic output. Whereas a higher temperature makes the model "more creative", or that it will explore different paths with the tokens that it's outputting. Also, certainly makes it more error prone. There are times where you may want a more non-deterministic output from the agent of the model, especially like a brainstorming feedback loop or something like that, or content generation. Then when we're doing strict code generation, especially combined with a specific system prompt, we might want a lower determinism here. You can see a very simple example of how this can change outputs. This was done in LM Studio, which is a great place to experiment with temperature and the way that it affects output from models. Similar stuff in Ollama or Docker Model Runner, or any of those platforms that let you run models locally and experiment with them.

In this case, I gave it a low temperature. Because of that selection logic, you want to pick a number between 0 and 1. Don't pick 0, don't pick 1, weird stuff will happen. Some of you have probably played with that already. A decimal somewhere between 0 and 1 is where you want to set your temperature setting. In this case, we set something relatively low, 0.0001. I asked it to create a JavaScript method to render a gradient of colors from blue to red. It didn't give me a JavaScript method, which is fine, but with this low temperature, it gave me the exact same solution in both cases, character for character. It gave me a little HTML block and started manipulating style to start building this.

The actual output obviously goes on a lot further from this slide, this is truncated. Look at what happens when I set a high temperature, so 0.9 in this case. I still only get one JavaScript method here on output B, so it didn't do exactly what I asked. Look at these two solutions. Technically, for what I wanted it to do, rendering the gradient from blue to red, these are both valid solutions. They're just wildly different approaches. On the left, I get an HTML block and I start manipulating style and CSS and all that stuff. On the right, just straight up JavaScript, manipulating a canvas object. Both valid solutions, just much different. You can start seeing where a combination of a system prompt and a decent, appropriate temperature setting can help you get more deterministic or less deterministic, but more closer stylistically and semantically to the use case that you're working on by combining those two things together.

Guardrails - Protect Quality and Trust

Again, guardrails. We want to make sure that we do have clear validation steps, absolutely requiring human review. We need to train not just on how to do good prompt engineering and context engineering and inference pipelines and building agents and all that wonderful stuff, but we also need to teach bias and security training. We need to spot hallucinations. We need to spot biased answers. Then, as I mentioned before, we need to create these feedback loops for making sure that we're able to update our system prompts, gathering feedback from how these models are actually behaving in the wild so we can control them and make them better.

Employee Success - DX GenAI Study Overview

All the data is showing us that engineers that learn to leverage this technology really well are going to outperform those who resist adoption. I'm all about psychological safety, but here's the deal, AI is not coming for your job, but somebody really good at AI might take your job. Using this as a way to increase our skill sets in this emerging technology, and as leaders, encouraging our employees to learn these skill sets and giving them the time and giving them the materials and giving them the safe experimentation beds is giving them an opportunity to learn new skills that will probably benefit them for the rest of their careers. We also remember back from the DORA graph at the beginning there that time to learn was positively associated. We want to provide both education and time to learn.

One of the things that DX did was we put together a prompt engineering guide effectively, but we did it our way. We conducted a study. We interviewed a bunch of S-level, so SVP or higher engineering leaders who had rolled out, at this point, coding assistants to thousands of engineers and were seeing positive impacts to KPIs. We just asked them what they were doing in terms of what best practices they were encouraging and things like that. Then we went directly to engineers. We found engineers that were saving at least an hour a week using this technology, and we asked them to just stack rank their top five most valuable use cases.

Then we put together a guide with coding examples and prompt examples and all of that thing according to what we discovered. It's available for you here, 65-page PDF. Again, goes through what we discovered in this study in terms of high-value use cases and best practices with coding examples and prompting examples. Proud to say that we've had feedback that this has become required reading in certain engineering teams, so hopefully this is one way that we can help drive that employee success and provide some good materials. Lots of great materials out there. I think what I like about this guide is that we didn't just arbitrarily pick some use cases and things. We actually drove the way that we do with studies and research and things like that.

I won't go through the whole study. I think it's really interesting that engineers overwhelmingly picked stack trace analysis as the most valuable use case. I've been writing code professionally since the late '90s. It was a lot of Java code. There was a lot of lines of stack trace, I remember going through line by line. This is just first-class behavior now in the tools like Cursor and agent mode, for instance. If a build breaks, it's going to look at the stack trace. No question that this is a high-value use case. Refactoring existing code. Generating the mid-loop of a function or something. We have to start philosophically asking ourselves, what is toil now? I know that so many of us are like, I became an engineer because I like to write code. I enjoy writing code. Me too. When we're getting paid to write code, and it's like our job, and we have the technology to be able to write that code more effectively, I think we do bear some responsibility to do things as efficiently as we can.

If you're working on a passion project, an open-source project, by all means, write every line of code, if you want to, if you enjoy that. When we're writing code professionally, I think the story is a little bit different. That's the top 10 list that came directly from engineers. We've also found that good utilization of AI according to all these different policies that we've called out has had a significant impact on dev ramp-up. Engineers in general are becoming valuable much more quickly to the organization. We like this metric called time to 10th PR. Again, not in a gamified sense, how quickly can we spit out 10 PRs? It's really more about, how long does it take for a new engineer in a project or in an organization to reach the 10th accepted merged PR? It's an interesting metric for onboarding. We see that we've cut this in half when we're doing a good job with AI in the organization.

Unblock Usage

Unblocking usage. Start where engineers can apply this technology safely. Prioritize the most high impact workflows. Again, remove bottlenecks to experimentation. Think about self-hosted models. We have good infrastructure for this stuff now, like Bedrock and Fireworks AI. We can run smaller open models locally. I've got my little 5080 gaming rig. That thing takes gpt-oss-20b, runs it great. I can do all kinds of interesting things with it. We don't always need to practice on the frontier models, and a lot of times that's like fishing with dynamite. We want to champion a culture of innovation. We want to partner with compliance early, like on day one. We often assume that we're going to get shut down, like we're not allowed to do a thing that we want to do. If you just go talk to the compliance folks, sometimes it's fine. Or sometimes they already have a gateway or something like that that you can use to creatively move around what you thought was an impossible constraint. Think creatively around barriers. Use synthetic datasets. Anonymize data. Do what you can with better prompt engineering to sidestep some of these perceived constraints.

Integrate Across the SDLC

Finally, we want to integrate across the SDLC. Mentioned before, yes, we're saving some time with code generation, but even the most recent DORA report on the State of DevEx says that engineers in scaled organizations still only get maybe five to six hours a week to sit down and actually write code. There are so many other time sinks that they face. Code generation is not usually the bottleneck. An hour saved on something that isn't the bottleneck, it's worthless, according to Eli Goldratt, Theory of Constraints, again. We did this study, this is 135,000 engineers, and we looked at biggest time sinks for them. We compared that to overall annualized AI time savings. This pale bar in the middle is time savings through AI. Not insignificant, it's good. It's like 3.4 hours a week is what we found. Look at what eclipses this: interruption frequency, sources of context switching, meeting heavy days. Or start compounding the other time sinks, like dev environment toil, build and test cycle time.

When we compare those things and we start putting them together, these savings are absolutely being eclipsed by these other areas in the SDLC, these other parts of software development that have nothing to do with writing code. We need to be mindful of that. We need to get creative about that. We need to find the bottleneck and fix the bottleneck. Some companies are doing really well with this. Morgan Stanley has been very open. They have a Wall Street Journal article, a Business Insider article, talking about an agent that they created called DevGen.AI. They have a ton of legacy code laying around: mainframe, natural, COBOL, Perl. I've written a lot of Perl, but apparently, it's legacy now. They have an agent that can look at a bunch of context, look at a bunch of code, and create effectively reverse engineered specs that they can hand directly to developers. It's not a full end-to-end modernization solution, but it's getting rid of a lot of the reverse engineering part of the effort. They're saving almost 300,000 hours annually by eliminating that reverse engineering step.

Zapier, one of my favorite stories. They're already a very automation-heavy culture, but they started opening up effectively a framework, a platform for being able to introduce agents. They can put together agents and have them out and running after just a little bit of testing within a couple of days. They have more bots than humans at Zapier. They've done stuff like reduce daily standups from five times a week down to two times a week, which is impressive. They've moved their onboarding time from 30 minutes down to 2 weeks. They've had some real successes. Like I mentioned before, they're hiring more than they ever have because they know that they can make a single engineer more productive and valuable more quickly, and that they can get anywhere from 10% to 15% greater throughput out of that single engineer. This is the right attitude. We can augment. We can get more out of an individual engineer so we should hire more to increase and unlock more throughput. Canva is using an agent for PRD generation. Project managers can use natural language to get epics and stories into Jira, or prototypes into Figma. That's great on its own, but it's also generating these PRDs in a language that's very friendly to developers based on a lot of context in the organization, which removes another friction point of actually engineers being able to understand what it is that PMs want them to do instead of a lot of back and forth over the feedback that's in that PRD.

Interesting use case there, too. Faire has automated lower sophistication change code reviews like config changes and one-liners and stuff like that, but it still represents about 3,000 PRs a week right now, or code reviews a week. They're just triggering off of GitHub Actions. Then they're putting the feedback directly in the PR comments, right where engineers are. You get instant feedback on your changes and things coming from the agent right there in the PR, which eliminates some of the steps needed in code review. Doesn't eliminate the full review, but a lot of that first pass stuff gets handled by an agent.

Spotify gave us the North Star for DevOps and the early beginnings of platform engineering with the Spotify model. Now they're calling this Spotify 2.0, which is effectively using agents across the SDLC to improve things even more. They've got a lot of great stuff that they're doing, but one thing is with incident management, which is really helping SREs. SREs, often the first few critical minutes of an incident, gathering all the context data, runbook instructions and things like this, can eat up valuable time. Now they just get this information in Slack instantly from an agent when an incident is detected, looking at runbooks and context and things like that. It's helping them with 90% of the incidents at Spotify right now. Hopefully some good food for thought about how we can use agents throughout the SDLC.

Next Steps

Next steps, distribute the AI guide as a reference for integrating AI into development workflows. Determine a method for measuring. Please use oppositional metrics across multiple dimensions, and don't hyper-focus on a single metric or even a single dimension, for instance, utilization. Then track and measure AI adoption and iterate on best practices and use cases. Again, if you're not using the data to continuously improve developer experience, then you're not using it as best as you can. Here's that playbook again.

Questions and Answers

Participant 1: I was just wondering, since you were talking about psychological safety and also ramp-up time to become productive, if you had any wisdom that you found in your findings about early career software engineers who only have like zero or one years of experience and whether the science has anything particular about them, because I know it's a big issue now. They're not being hired as frequently. Wondering if you have a comment on that.

Justin Reock: What's the impact on new engineers, zero to one years of experience? Is there research on this? Yes, in fact, on our website, we have our most recent AI impact report where we look at a lot of the spread between junior engineers and senior engineers and how they're being augmented.

The onboarding is pretty much consistent between both junior and senior engineers in terms of the impact to better onboarding. The curve is really similar. Junior engineers are using the technology more, which is another reason why I think that we should really be treating junior devs as very valuable resources because you're not somebody like me, an OG developer who's been writing code since the '90s. I had to build new muscle to really get used to integrating this stuff and to become reflexive about it in my own workflows. There's still a lot of work that I have to do. There's a lot to unlearn, a lot of calcification that happens when we've been developing for a long time. Junior developers see higher utilization faster. However, over time, senior engineers are actually saving the most time. I do think that it's because just with that experience, you're going to be able to recognize problems with the code and things like that.

Overall, we see benefits to both, but I really think that, especially as a junior, you should be asking these questions. You deserve a good developer experience. You deserve an optimized developer experience. Again, developer experience is a systems problem more than it is a people problem. The engineering leaders that you're interviewing with are the ones that are responsible for giving a good system that makes you do your best work. In the interview process, you want to ask those questions. Like, what is the attitude towards AI adoption? Listen for keywords like augmentation and success and things like that, and listen for a good focus and investment on developer experience, because, ultimately, that's going to reduce your own friction, toil, cognitive fatigue, burnout, and then help you ramp up faster.

 

See more presentations with transcripts

 

Recorded at:

May 08, 2026

BT