InfoQ Homepage Presentations State of Play: AI Coding Assistants

State of Play: AI Coding Assistants

View Presentation

Speed:

42:01

Summary

Birgitta Böckeler discusses the rapid evolution of AI agents, moving beyond "vibe coding" to sophisticated context engineering. She explains how architectural constraints and "harness engineering" create the safety nets required for autonomous code generation. She shares vital insights for leaders on balancing speed with maintainability, security risks, and the cost of AI autonomy.

Bio

Birgitta Böckeler works at Thoughtworks where she is currently the Global Lead for AI-assisted Software Delivery. She is a software developer, architect and technical leader with about 20 years of experience in technology. Her career in software delivery consulting has given her the opportunity to see many organizations and teams succeed and fail at delivering valuable software.

About the conference

Software is changing the world. QCon London empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Birgitta Böckeler: I was here at QCon, so about a year ago. I gave a talk with the title, From Autocomplete to Agents. I tried to just give a bit of the lay of the land of where we are in terms of AI coding, because as Blanca said, everybody's a bit overwhelmed, including me, even though I have time to look into this topic full time. I'm going to try and do something like that again today. A year ago, I was mainly talking about the new agentic modes that at the time everybody was starting to pay attention to. The term vibe coding was about two months old at the time, I think. MCP was all the rage. Claude Code was still in diapers. It was already kind of there, but not generally available yet. That's where we were about a year ago. I try to reflect on the high-level things that happened over the past year.

One of the things is that context engineering evolved. This is also a term that wasn't even around last year at QCon London. It started floating around, around June. Basically, the simplest definition of it is you want to curate the information that your model or your agent sees to get better results. That's a simple definition. It means different things, like when you're building an agent or using a coding agent, which is what I'll be talking about right now.

A year ago, I had a slide that was similar to this one that was talking about the state of context engineering at the time for coding agents, which was rules files. You just have like an AGENTS.md, a CLAUDE.md file in your workspace, and every time you start a session with an agent, the agent gets sent this file. You can put typical pitfalls, repeated errors in there. I had a thing, like my agent kept forgetting to activate a Python virtual environment every time it ran a Python process, so you would put stuff like that in there. MCP servers were also around at the time, as I said, which help an agent more dynamically get data.

Since then, there have been all of these things. Not just rules and MCP servers, also commands and skills and subagents and plugins and specs. There's been a lot of activity in that space. I just want to zoom in on one of those things, because it's been confusing to quite a lot of people, I think. I took a sabbatical the end of last year, and when I came back, skills had come out, and I was very confused at the beginning what exactly it is. I'll try to zoom in a little bit on that.

First of all, this was a new concept introduced by Anthropic, based on a lot of different things that had already been going on. What it basically is, one, it helps you modularize those rules that I was talking about. You don't just have to have one big file in your workspace that always gets sent to the agent, but you can have little subfolders that modularize the different things you want to tell the agent. It can be from like, here's how we usually build a React component, all the way to, here's how you get logs from our AWS test environment. These modules can then be loaded by an LLM just in time. That's another big thing. This kind of progressive, lazy loading of context. The agent, or the large language model, will just get a description of the skill. Here it's just like a get-log skill, and the description says, get logs from a test environment, for example, for debugging incidents.

Then when the large language model realizes, that seems to be what we're doing, there is more information on that, then it's just going to load that, so it's not filling up your context window from the start. Then the next thing with skills is that they can include more files than just this Markdown file. You can have additional documentation in this folder, you can have scripts in this folder that you want the agent to execute. It's a folder, it's not just one Markdown file. You can have scripts in here as well. You could do this before, but it's been a thing that more people have caught on to now, that you can just tell the agent in the Markdown file to use the CLI that is installed on your machine. This is like the realization that has had a lot of people shift a lot of their use cases away from MCP in agentic coding, and just paid more attention to what CLIs do we already have on our machine, what scripts can I write to have my agent do that, because it's a lot more straightforward than having yet another different type of process running on your machine.

Context engineering is a combination of the following things. One, it's, of course, as always, reusable instructions and conventions. It's, how do you write a React component, how do you bootstrap a new project? Also just coding conventions and stuff like that. That is combined with context interfaces. Things like the description of the skill, or things like the list of tools that an MCP server has, or the list of tools that are built into the coding agent. All things that the LLM can then take and say, in this situation, I want to call this tool, I want to load that skill, I want to call this MCP server tool. It's all about, how can you get this to be intelligently loaded just in time? As always, non-determinism is always involved with this. You never have a guarantee that the LLM will actually decide to load your skill.

As a human, then, my job becomes a lot to manage that context, and also to manage and monitor the context size. Because even though context windows are, of course, a lot bigger now technically than they used to be a year ago, or especially two years ago, still, when they get full, you can feel the effectiveness of the agent degrade. Also, of course, it starts costing a lot more money. Because every time you go back and forth with the model, you send the full context window. There are now encoding agents, there are different features to help you monitor what's actually in your context, and it's actually quite interesting to also see all of the things that take up space.

On the left here is a session that I started with Claude Code, and it was actually very shortly after I started the session, so I hadn't even typed a lot. Already the context window was 15% full, because there's a Claude Code system prompt, there's all of these context interfaces, like skills that I have, and stuff like that. That's also part of how you have to balance your skills and all of the things that you give to a coding agent. On the right, GitHub Copilot also has that feature. At the moment, the Claude Code team is leading the pack, and then everybody else copies what they're doing. It's a little bit the high-level summary.

Then, finally, a really powerful feature that I would also see as part of context engineering, which started becoming more popular last year as well, is subagents. A lot of the coding agents have that built in now. It's this idea that the main agent can spawn off subagents. The most common use case of that, that we don't even control, but that the agents often decide to do, is that when you start a session and the agent wants to research what's in your codebase, it often spawns a subagent to do that.

On the bottom right, there's a Claude Code screenshot that shows this Explore agent. Then, because research often takes up a lot of tokens, because it has to read a lot of files and stuff like that, and find the ones that are relevant to the task, and then it just reports back the results to the main session, and the main session doesn't have all of the potential overload of some stuff that it doesn't have to know. You can also use this yourself. A very common use case for subagents that we as users try to do is a code review agent. A lot of people like to have a separate context window that doesn't know about all of the history in the session, do a code review, or maybe also have that use a different model, and so on. This feature has unlocked a lot of other stuff that I'll also be talking about.

In the context engineering area, ask yourself, what coding conventions do you want to amplify? Because all of this is a way to amplify things with AI: good stuff and bad stuff. Also, you have to be careful. What workflows can you maybe build for modernization initiatives. Migration is a super good use case for generative AI, and now with some of these options in the coding agents, it becomes easier to build something like that. I just recently talked to a colleague who was working with a client who had thousands of CI/CD pipelines in an older tool that they wanted to migrate to GitHub Actions, and she was building a human supervised workflow with different subagents and skills and MCP servers and stuff like that. What tools should be available in your organization to make it easier for an agent to take certain actions or to get information? That can be CLIs, it can be MCP servers, maybe also language servers.

I'll get a little bit back to that later, but if you have maybe more unusual languages where you can give the agent more information and tools that actually understand how the language works. Not only coding conventions, but also what are practices you want to amplify? My favorite examples that I always mention are things like improving an architecture decision record, or how do you do threat modeling, stuff like that. You can have those and skills as well to help people actually understand those. There are lots of open questions as well, so context engineering is a little bit engineering in quotes. How do you version and distribute these? There's first ideas, there's now these plug-in marketplaces, also again introduced by Anthropic and the Claude Code team. It's still not quite mature, it's all evolving.

Also, a big question of course is, the context that you have, is it making things better or worse? This is all about evals, evaluation of the skills. Anthropic just released something to make it easier to do something like evals. The skills registry Tessl also released something recently, so that's in the early stages as well.

More Agent Autonomy, Less Human Supervision

With all the model improvements that of course have also been happening, and this more powerful context engineering, there's been this trend towards giving agents more autonomy and reducing human supervision. I guess that's what all the hype is always about, like when can I just have AI build all of my code? Supervised, like me as a developer, I sit in front of the session, and I still look what it's doing, and I steer, and there's a lot of back and forth. Versus unsupervised. This is a screenshot from mid of last year when OpenAI's Codex first came out, where you basically have these cloud agents. You just send it off in the cloud maybe for 20 minutes or something to do something for you.

At this point, a lot of the big coding agent products have this ability for you to not only have agents work locally in the IDE, but also in the cloud, so top left here is Cursor, bottom right is Claude Code. I think Copilot has this as well. Of course, now you can also do this from your mobile phone, so people starting to code on their commute to work, if they have one. You can use these cloud coding agent platforms. Another way to do this is also that because there's been this rise of CLI-based coding assistants, so again, every one of the big products now also has a command line-based version of it. There's a Cursor CLI. There's a Copilot CLI. The first one that got a lot of attention, of course, again, is Claude Code. Because you can run them in headless mode, you can also put them into your existing pipeline systems. For example, there are GitHub Actions for Claude Code, and Copilot, and so on.

As part of this, a familiar beast rears its head, which is, we have to sandbox this somehow, or we have to give the agent a proper environment. We have to figure out how to give it all of the tools that it needs, all of the compilers, all of the dev tools, and so on. Out of memory error is just like a representation of like, we have to give these the right amount of resources and stuff like that. Internet access is in our CI/CD pipelines. Dev sandboxes is already a big question, like how much internet access do we give it or not? There's new concerns now with agents with prompt injection when they load maybe untrusted content on the web. There's a few new questions, but some of it is also an existing challenge that we have. Not just in the cloud, but there's also a trend to less supervision locally. This is a visual from Steve Yegge's Gas Town blog post, it says, The 8 Stages of Dev Evolution to AI, and stage 6 here is having three Claude Code instances in parallel on your machine, and stage 7 is 10 instances in parallel.

I tried the three instances, and it's a lot. I kept typing the wrong thing into the wrong session and stuff like that. It is something that in some teams out there is actually happening. That last picture there, the stage 8, is like a short intermezzo for the hype du jour, I want to call it, which is agent swarms. Gas Town is an example of that. There's this project, Claude Flow, that has been around for quite a while, I think it was recently renamed. There were these two experiments recently by Cursor and Anthropic as well that got a lot of attention.

By agent swarms, I basically mean you send out a lot of agents, like dozens or hundreds. You throw as many agents as you can at the wall, and you see what sticks, and maybe are there going to be emerging new behaviors. How do we coordinate those, and so on. Those experiments by Cursor and Anthropic made a lot of people nervous. Cursor basically had a bunch of agents run, I think the longest run was for a week to build a browser, and Anthropic had them build a C compiler.

Then people were like, does this mean now AI is already there, it can build these things? One thing to keep in mind with these two is that both of those use cases, and they were probably also specifically picked for that, Cursor actually also say they specifically picked the use case, both browsers and C compilers are very well-specified problems, and the specification is all over the internet. Also, especially in the case of the C compiler, there's even a very elaborate test suite that the agent can use to get feedback on if it works or not. That's often like what we don't have when we build enterprise software. We don't have it to that level.

I don't think you necessarily have to go and try Gas Town for the context of probably what a lot of your work environments are, but if you want to still dip your toes into this, a good way to do that, also Claude Code released a feature, this is in preview, so you have to switch preview flag, I think, which is called agent teams. Some people use agent teams and swarms as the same thing. I like to think of them conceptually as two different things. You either send out dozens and hundreds of agents that AI even decides which ones to use, or maybe you can also do it in a smaller context. Here I was trying this with five agents, I think. The key is that there needs to be a lot of orchestration. The main agent decides what can be parallelized. They can also talk to each other and stuff like that. This is all still early days.

Back to the Present (Less Supervision)

Let's go a bit back to the present and the less supervision. Some things that you should be asking yourselves like, where do you want to experiment with those cloud agents or with less supervised agents? Depending on your environment, you might not want to take large risks, but you can experiment with things like cleaning up feature toggles. Lots of people, again, use code review agents and stuff like that. Then, how do you gauge the appropriate level of supervision for a task? Also, how do you help people in the organization gauge that level of supervision that you should choose? I found myself doing a lot of micro, sometimes macro, like lots of little risk assessments when I use AI. Should I use it for this? Should I not use it? How much review should I apply? Risk assessment is always, in any situation, a combination of three things, probability, impact, and detectability.

In this context, for me, first, I think about the probability that AI might get something wrong. That probability, I assess that by my knowledge of the context I've given it, by my knowledge of how good the tool works, by my experience of using it for similar things before. It's like an intuition you have to build up over time using these tools. It's also about stuff like, how confident am I even in my requirements? How well can I even specify what I want? Impact if AI gets something wrong is all about the use case criticality. Is this a proof of concept or a spike? Or is this something that will get me out of bed at 2 a.m. in the morning on the weekend because it's a critical workflow and I'm on-call?

Then detectability is the detectability that AI got something wrong. This is all about me knowing my feedback loops. Then by assessing that, I make a decision about which type of workflow I use. Do I use a very super elaborate one with lots of planning in the beginning and stuff like that, or do I just start with a quick prompt? It also determines how much I review. Do I just fully vibe code and not look at anything, or do I look at every single line of code or something in between? Also, how long do I let it go off without supervision? Because the longer it goes off without supervision, the more I have to review afterwards and actually see what happened. If you look at these things, actually only the thing that I have highlighted in yellow here now, like knowing the context and knowing how good the AI is at this type of task, that's the new thing.

All of the other things, an experienced developer at least should already be good at. This is the skill that we have to hone. It's also about like you have to be this tall to ride the roller coaster. You have to be this tall to reduce supervision. The probability that AI gets something wrong can increase with a bad codebase, because it will pick up on existing patterns. It can increase with like, you have a system where things are in lots of different places. It has a lower chance to actually get all the information. Detectability is also something like, if you don't have feedback loops, if you don't have good test automation and stuff like that, then you have fewer ways to both yourself verify what happened, but also give the agent the ability to verify that.

More Autonomy, Less Supervision: Beware of Security and Cost

There's more things to be cautious about, of course. I also talk a bit about security and cost. Also, two things that have changed over the past year, but maybe not in a direction that we like. First of all, security, like almost every week now there's some report of something happening. It's usually all related to prompt injection. This idea that an agent gets content from an untrusted source that might give it instructions that you weren't aware of. One thing that can happen with that is unwanted command execution. All of the agents have these allow list features in them that you can also configure them to say, with this pattern or for these commands, you don't have to bother and ask me if you may execute that. For other ones, you maybe always want to say, "Yes, you're allowed to. Yes, you're allowed to". There are some weaknesses in the implementations of these. That combined with other stuff still not being quite figured out and with AI being non-deterministic is a risk.

Another security risk is the extraction of secrets. This is a big risk especially for open-source projects that are open for GitHub Issues by anybody, but then immediately trigger an agent without much supervision. This one actually managed through some prompt injection in a GitHub Issue to extract the secrets that allowed the attacker to push to the npm registry for this particular tool. Not all of this is relevant for what we're doing inside of an enterprise, for example, but it shows us a much-increased risk, again, in the dependencies that we're using and how careful we have to be about the ecosystem that we're pulling into our application. A really great model for thinking through this, Simon Willison wrote about this in June 25, this idea of the lethal trifecta. When you have an agent that has exposure to untrusted content, and access to private data and can externally communicate, then you have a high risk of getting data problems, getting security problems with this agent.

This is even more relevant for business use cases of agents, because as soon as you integrate email, for example, with read and write permissions into an agent, you already have the lethal trifecta. It's not a technical problem, it's a conceptual problem. It will be interesting to see on the business side of things how all of those use cases with agents that were promised to us will even be able to get around this. With less human supervision, security, so think about, are you making it easy to sandbox coding agents also locally, not just in the cloud, both with pre-existing things like dev containers, for example, that I've been using a lot for this recently.

Also, there's a bunch of new products popping up that have interesting ideas about this. Yes, so just sandboxing your agents, even when you run them locally to reduce the supervision. What is the AI security literacy among the engineers in the organization? Do they even know what's happening under the hood? Don't use this YOLO mode where you don't even have to allow any of the commands, it just does what it wants, stuff like that.

Then, secondly, for costs, the honeymoon is definitely also over. In the beginning of '24, I heard a keynote where the keynote speaker said, generating 100 lines of code only costs about 12 cents, compare that to a developer's salary. Leaving aside that lines of code is of course not a good measure of value. Let's leave that aside. It's not even like 12 cents anymore. This is from summer 2025, this is some websites where people post their token usage. There was this one person who was using, on average, $380 a day, which if you extrapolate that to 20 workdays a month, 12 months, would be a developer's salary of $91,200, which is not a bad developer's salary in Germany.

Of course, this was summer, this has exploded even more. We've gone from $20 flat rates in the beginning. I think Copilot might have even been $10 in the very beginning, to more like $200 flat rates that are not really flat rates because you get request limiting. Then you see people on Reddit saying, it's only the middle of the month and I'm out of tokens, and what do I do? Because we can't work without them anymore. That's a whole other topic. Why is the cost ballpark for a change far from 12 cents now? The keynote speaker, this was early '24, so we were mainly doing autocomplete. Maybe asking in a chat for a few lines of code. Now we have the agent research the existing code, then make a plan, then we review and adjust the plan.

Then we start implementation. We have it run the tests and fix the tests, and check the lint errors and fix the lint errors, maybe check the browser if it's a visual UI feature. Fix that again. Have a code review agent running, react to that, have a summarization. It's all of this back and forth. It might even just be two lines of code afterwards.

Where Are We Now?

Where are we then after these last 12 months? Context engineering is becoming a very powerful lever now of amplification, both for good and bad. You can actually do a lot of stuff now to make it more probable that the agent does what you want. Again, model improvements, I haven't even explicitly mentioned here, that has definitely also happened, but I find that actually less interesting than all the other stuff that's happening around that.

As a result of these things, there are strong forces tempting us out of the loop. We have to think about where in a given organization or a given use case, can we give in to that pull and where is it treacherous. This feels really good and gives me quick results, but like, what will this mean in a year maybe? There's this need for speed, everybody just wants to be faster and more throughput, and look how many PRs we merged this week. With more autonomy and less supervision, there's also not just a question of security and cost, but also what happens to maintainability of the code. There was an interesting article recently by a team at OpenAI that says they have been working on a codebase for the last five months. It started as a Greenfield codebase, and their rule for themselves was, we don't want to touch the code directly, we just want to interact with the agent and then continuously improve all the setup around it to make it easier for the agent to maintain this autonomously. It was a mix of these things. It was a lot of that context engineering stuff, like skills and stuff like that. They also put in more deterministic checks like custom linters and structural tests and stuff like that.

Then they still said they had entropy increasing and drift happening. They had what they called garbage collection, like agents continuously running against the codebase and cleaning up over time. These architectural constraints with more deterministic tools as well, this is something that has been popping up in a lot of stories with teams that use agents like this a lot recently. I also experimented a little bit with it. It's this idea of having structural tests as agent feedback. Think stuff like ArchUnit or Spring Modulith. In my case I was working on a TypeScript codebase so I used something called dependency-cruiser which I had actually never heard of before.

The reason is that we've had these tools for quite a while but I think a lot of people haven't used them because we were still working on the code, and in a lot of cases of course they are. We were like, "Yes, I know how to modularize. I don't need a tool to help me with these constraints". Now they're becoming a really interesting feedback tool for the agents. Here in this case in my application, and again I worked on this together with AI, I defined the different layers that I wanted in the application and then set up a bunch of different rules. There was one, for example, that said that external SDKs may only be imported by files in this client's folder that had lots of clients for other APIs. You don't want to do that in the domain folder.

Then this was additional feedback for the agent. What's also interesting about these linters and structural tests and so on is our ability to go and enhance and extend the messages now. It's kind of like a good type of prompt injection. You adjust the messages so that they also contain instructions or hints for the agent, how to react to those. Let's say you have maybe like a linting rule. Every file may only have 500 lines of code. If you want to avoid that the agent just makes every line multiple statements so that it stays in the 500 lines of code, you could put into that error message, that is a smell for a design problem so you should consider refactoring. Just give it more context on what that message means to us.

How Can We Increase Our Trust in the Agents?

This is basically all about how can we increase our trust in the agents ultimately, at building a harness like this. The team called this harness engineering, because, ultimately, let's think a few years from now, I think we're not aiming for perfection that we want the agents to build the perfect code. We certainly aren't building perfect code. How do we get enough confidence and trust for our particular situation? Because we want to safely and quickly in a sustainable way deliver our software. I just wanted to share a mental model of how I have recently been thinking about this. Again, I talked a lot about the context engineering, the skills and so on, and all of those things that feed forward into the agent. We're anticipating what the agent might do wrong and we're trying to give it all the tools and instructions to increase the probability that it does what we want. Those would be like giving it principles, coding conventions, maybe reference documentation, how-tos, and so on.

Then, after the agent does its first generation of code, very often it's not immediately perfect so then we give it feedback. Static analysis. Maybe give it access to logs and start the application and see if it actually works, the browser and so on. That all lets the agent do some of the rote work of corrections before we even look at what it's doing. This can actually be a mix of CPU and GPU-based things. I took this framing from this company called Moderne who recently started using that as framing for the CPU-based tools that they use. We can have a code review agent but that is still like based on GPU inference.

Like, what if we enhance that with a lot more CPU-based, more deterministic things? We can actually have the same thing in the feedforward as well. All the things that I mentioned here before were GPU inference-based but if we give the agent access to CLIs, to maybe bootstrap scripts, to Codemods like OpenRewrite recipes and stuff like that, then, again, we can feed it the things that make it more probable that it does the right thing. That also comes back here to the language servers that I mentioned before. For example, you can now give an agent access to something like IntelliJ's refactoring capabilities. It can actually use the rename symbol functionality to do a refactoring instead of doing text diffs all over the place. Again, makes it more probable that it does the right thing in the first place.

Then that, all of those things together, I would call a harness. Some of those things are built into the coding agents themselves. Like the way they do code editing, the way they do code search, they can improve all of those things. It's all part of the feedforward harness. Then there's a bunch of things that we can control ourselves and make it specific to our situation. Then, as humans, we're the steerers of this. That's what that OpenAI team was trying to do. They were trying to continuously improve this harness around the agent. Of course, for that, we can also use AI. That's the new potential also of these structural tests that I mentioned before and linting.

Previously we wouldn't have built custom tooling for that, it was too much work. Now actually for that experiment that I was doing, I did all of that with AI, I didn't write that myself. That's a lot lower risk factor than some of the production software we write. I also wonder if that's kind of like one of the new abstraction layers, actually. Will we maybe at some point in the future just have these topologies that cover 80% of what we do. We build data dashboards a lot that just collect data from other APIs and show it, or there's maybe a CRUD business service or event processors or just like typical types of applications that we write. Then we have a definition of how they're supposed to be structured and what the tech stack is. Then maybe have instead of a service template, we have a harness template for that, that we can instantiate to then work on our codebase. Then we might not even care anymore if it's React or Vue.

One of our decision metrics might actually be, is there an existing harness for that, so I don't have to worry about that and I don't have to build that up initially? All the examples I just gave, everything I just talked about with this harness were all about maintainability and internal code quality. I still have lots of questions about verifying functionality. Of course, that's ultimately the key. We still want our software to work. We don't only want safety critical systems to work, we want all software to work. I think we need ideas how to harness different aspects of our applications. I just talked a lot about maintainability. Maybe there's also something like architecture fitness. How can we have a harness for operability, performance, and so on?

Then there's, like I said, behavior. Right now, what most people do is they feedforward a description of the functionality. Then the feedback part is mostly like, the test suite is green, but the test suite was generated by AI. Then you do some manual testing and then that's it. That's the approach right now that I see. Some people say they do more review of the tests and all of that, but I'm sometimes a little bit skeptical. That's not good enough. We have to come up with better ways to do this.

How does all of this change my trust level? Improved models have definitely increased some of my trust level. Much more sophisticated context engineering, so the progressive loading of context. More tool integrations, subagents, all of those also offer a lot of new ways to increase my trust level. New food for thought for how far can we push static analysis and stuff like that. In terms of the question marks, I still see models do stupid things all the time. There was just this post on Reddit that went viral, where the agent was saying, "Yes, you said no, but I thought you said no to me asking you for permission, so I just went ahead".

Something like that, like a teenager kind of like, yes, you said no, but I thought you meant something else. I still see that happen all the time. There's this cognitive overload in the loop. There are more and more anecdotes now of people talking about burnout or just being overwhelmed either by doing all of the review or trying these multiple sessions at the same time. Also, mostly when I hear this done dialed up to 11 with a lot of AI autonomy, it's done by quite experienced engineers, who have the capacity for more load because they have so much experience.

Then there's just this pressure for speed from up above as well. Then, if you put pressure on people to be faster with AI, they will cut corners and become more sloppy. Interestingly, there was an article about Amazon reflecting recently on some outages that supposedly were related to AI generated code, and one of their reactions to it is now to have more gateways where senior engineers have to review what's going into production, which seems kind of like, weren't we supposed to be faster going into production and now we're just putting in more gateways? That probably also isn't the solution. I've recently been asking myself more and more with this speed, how much speed do we actually need? It's like, what is the Goldilocks speed that is fast enough, but not too fast? Just imagine Homer falling off that fast treadmill. What are the risks of all of this speed? Also, how does the speed actually help the organization? Aren't there other things that we should be looking into?

Conclusion

I talked a lot about reducing human supervision and increasing the ability to use AI as an automation tool, like building a lot more for us, because that is definitely the through line of everything that's happening, and that's what a lot of people see in the future. AI is a Swiss Army knife with lots of potential use cases. Of course, plenty of them are useful with supervision, and they don't necessarily overload you, but they're actually like a good extension of what you otherwise would have to be manually for, and it would take a really long time. So many things have developed over the past year that also make the use of AI more effective and a better developer experience when you're using it as an enhancement of yourself.

Over the next 12 months, we're going to learn a bunch more. Again, there will be new good ideas, but we'll also discover more about all the worries and concerns that we have, about overload, about skills atrophying and stuff like that. These were just some of the things that I was posing to you as questions to reflect on. The least you have to do is reflect on these questions. It's a good thought experiment. How ready would you be if you wanted to give AI more autonomy in your delivery? What's your automated safety net? What's your security stance? What's people's AI literacy at this point in time? Improving those things is always worth it. It's also worth it for the humans. AI can even help you improve that safety net already today if you want to be prepared in the future.

See more presentations with transcripts

Recorded at:

Apr 08, 2026

Birgitta Böckeler

InfoQ Software Architects' Newsletter

State of Play: AI Coding Assistants

Summary

Bio

About the conference

Transcript

More Agent Autonomy, Less Human Supervision

Back to the Present (Less Supervision)

More Autonomy, Less Supervision: Beware of Security and Cost

Where Are We Now?

How Can We Increase Our Trust in the Agents?

Conclusion

Related Sponsors

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Popular across InfoQ