BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations AI Native Engineering

AI Native Engineering

49:54

Summary

Ian Thomas shares a case study on embracing AI-native engineering within Meta’s Reality Labs. He explains the "Assess and Grow" framework, a maturity model designed to move teams from manual toil to AI-integrated innovation. He discusses real-world wins - including hitting 90% code coverage in record time - while addressing senior concerns like "code slop," review fatigue, and maintaining quality.

Bio

Ian Thomas is a Software Engineer at Meta currently leading teams building Horizon Worlds across web, mobile and VR platforms. Previously, he helped lead development of Workplace and Horizon Workrooms after spending nearly a decade building sports betting systems for Sky Bet and PokerStars.

About the conference

QCon AI is a practitioner-led event focused entirely on the engineering discipline required to scale these workloads safely. It provides direct access to the architectural playbooks and failure metrics that peer organizations use in production.

Transcript

Ian Thomas: I'm really excited to share a case study of how we've been embracing AI native engineering in my small part of Reality Labs, which is called Horizon Experiences. I want you to imagine that you aren't here and you're back in your chair, working away, thinking about what you're doing on a day-to-day and thinking about how much of that time you spend is focused on toil, updating tests, fixing things that are broken, reviewing mundane code changes, time that you could be spending otherwise focused on solving interesting problems. That's the core vision behind what we're trying to do within our little part of Meta. I think it applies whether you've got 10 engineers, 100, 1,000, 100,000. Hopefully this is something that will be interesting to you and be useful to you. Our vision is that we want to move people away from being builders to becoming explorers and innovators. Focusing on reducing the toil that happens on a day-to-day basis so that you can spend your time doing something meaningfully productive using your human creativity for its maximum effect.

The timeline for this may look something like what you see on the slides here now. I think we're broadly aligned with end of year 2025 right now where a significant amount of our time is spent on both administrative tasks and modernization as well as feature development. A much smaller portion of that time is spent on feature exploration. What I've been doing over the last seven months is building up a community around what we call AI4P, AI for productivity. I'm going to share with you some of the numbers and some of the things that we've been doing to try and make that leap effective from engineering as we know it, to AI native engineering.

In the seven months that we've been at this, we've driven organic growth of a community from nothing to over 400 members. We see significant tool adoption now and also significant time saving in some workflows where the specific workflow is the heavy lifting in that sentence, it's not across the board. Multiple teams have been completing assessments that we've designed, working with our maturity model that I will be presenting to you later. We're starting to see these patterns apply across the board, not just in our teams but also across other parts of Meta.

I'm Ian. I'm a software engineer. I work out of the UK. I'm part of the Horizon Experiences team in Reality Labs. I work on Horizon Worlds, which is this VR, mobile, desktop platform for social engagement, interactivity, game building, you name it. It's an immersive 3D environment. I want to talk to you about the journey we've been on, taking you through how we founded it, built the structure, grew it, and then found ways that we could be successful and also some of the challenges that we faced along the way because it hasn't always been plain sailing. Then, hopefully at the end, I'll give you some actionable tips and playbooks that you can take back and use in your day-to-day work.

May 2025: We Asked a Question

We started this in May. What happened was a few of us got together, and we're talking about what we call our engineering excellence goals. This is a program that hopefully all companies will recognize. It has three main pillars. One is implementation quality, another is better engineering, which is how we make ourselves more productive, building tools and improving our working practices.

Then the third is production excellence, which is how we operate our software in production and make it so that our customers have delightful experiences. This is a great place for us to start because a few of us have been experimenting with AI tooling outside of work, using things like Claude Code, taking notice of what was going on in the industry. Tooling internally was also being developed quite rapidly, but we weren't seeing a great deal of adoption. We put the two and two together and said, how could we maybe introduce this approach, this idea, and we can also accelerate our engineering excellence goals because I'm sure everyone will recognize that focusing on improving your test coverage often falls behind in the priority list to building that next feature.

Act 1: The Beginning - How We Started Small

In the beginning, we decided we were deliberately starting small. We wanted this to be a safe space for people because we knew that there was going to be an awful lot of missteps and vulnerability needed. Fewer people mean there's less exposure and it's easier to share your concerns in things that you're not finding easy. We could also be much more focused because our teams, at that time, I was in a team called 1P Spaces and this was our ring-fenced area. We had a very specific set of products that we were working on and so it allowed us to be very laser-focused on the technologies and techniques we were working with. It didn't have to be broadly applicable. Again, small means fast. We could iterate quickly, we could learn, do experiments and see how things worked out for us. Again, because we had a very good small group, we could have high ratio of people that were really passionate about this, driving it as our champions, to the people who were trying to encourage to adopt these tools and bring along with us.

That led to some organic growth, which was great. It was word of mouth. We weren't particularly proactive in sharing this and bringing people into the group. It was a thing that we were aiming to do organically. We launched the program, and as I said, we were using our engineering excellence program as our vehicle for this. The primary things we were focusing on impacting were test health and coverage, code quality, complexity levels, bits of documentation, and some of the other things that help people to onboard and learn codebases quickly. The foundation led to us generating fairly ad hoc brown bag sessions.

As people discovered new things, they would set up a lunchtime presentation just informally discussing with people. We made sure that we had this safe space, like I said, where people could ask questions and show where they had big gaps in their knowledge. We started to find that people were creating patterns or we spotted patterns in the things that were generating success, so we could start to collect them and bring them into our documentation and our process library. Because we'd focused on these engineering excellence goals, that allowed us to be quite clear that we were going to start with a couple of internal tools. They're listed here. I appreciate they mean probably nothing to anybody, but Devmate is the main supervised tool that we use within VS Code at Meta. It's like your programming partner. There is an unsupervised approach to working with it too, but at that time it was mainly your partner. RACER is a different initiative that actually came out of our monetization team where it was focused on reducing code complexity and improving test health and test quality. The RA stands for risk aware. It was actually something that was developed to have an idea of quality baked into it so that it wasn't going to just churn out slop, which *it did at the time. Slop isn't something we can avoid universally, unfortunately. Then we tracked what worked well and what didn't.

What we found from this was that people who would ad hoc trialing things, they were having a really terrible time with it. It was very difficult for them to get progress and find meaningful value. These are the things that we spotted from that problem. People were trying to use the same tool for lots of different problems, and it led to low quality outcomes because the alignment was wrong. They also had big gaps in how they were applying it to their processes. They were hoping that they could just zero-shot prompt things, and it would solve all of their problems. It clearly was never going to be the case. From a leadership perspective, we had trouble talking about the ROI of the effort because people were spending more time on working with these tools and learning them, and they weren't necessarily being as productive as they were without using them.

Then, like I said, people were failing, and without the safe space and the vulnerability and everything that we were trying to do with the community, they weren't willing to share that. We didn't know that there was a problem. That was an issue that we had to overcome. At about this time, end of May, throughout June and July, I was talking to a few of the AI4P champions that we had and said, I think we could really benefit from having some maturity model here. Then we could apply that and work with this at a team level and understand how teams are working and which bits they're specifically struggling with. That's where we decided that we needed to build a bit of structure.

Act 2: Building Structure - The Assess and Grow Framework

In August, I launched this framework, based on DORA, which I've used for many years to help with team productivity and quality outcomes and things across the board with our DevEx. They've recently been doing a lot of work in the AI space as well. We use that to anchor in the research, and say, this is where we're thinking and where we're driving it from. Here is a maturity model with six dimensions that we think you can use in your teams to help identify where there are problems. The good thing about this is it's team-driven. Teams could take it away and they can work on it in their own space. They weren't necessarily having to feel watched or have a mandate from above. They could scale it. It was general. It didn't have any specific tools prescribed or any processes prescribed. It was more generic and hopefully useful across the board. The six dimensions were things like workflow integration, prompt skill and sharing, trust and quality. We had five levels per dimension.

As with all good things, it starts with zero-based indexing which is one of the reasons why I get so confused in buildings in the U.S., because one should be the one up from ground. We start at SIT and go all the way up to LEAP. This is where you see people going from, I'm not using this, I'm maybe not even aware of this, right the way through to, I am defaulting to using this tool as part of my day-to-day and I have fundamentally changed how I'm working. If we dive into an example, we've got the workflow integration dimension where people might be aware that a tool exists but they barely use it, right the way through to occasional use in some workflows, up to, this is AI native. It's a seamless integration. One of the things I'd say while you're sitting here is just consider using these levels where your team sits today. Is there anything that you think you could meaningfully do to help move them from one level to the next?

The next slide's going to have all of the dimensions and all the levels on it. I'm not going to go through them in great detail. I'll try and also make this available offline so people can download it if needs be. One of the things that made this useful was that we packaged it as a retrospective style workshop. We offered this way of running it as an hour-long session where teams could come in and they could have the Figma board or whatever it was they wanted to use set up, and they could go through the different dimensions and figure out where they are. The rough plan for how these sessions went is on the screen. You can see that reading dimensions was there. The idea was that people would read ahead of time, but we all know that that never really happens. There is always a period of quiet reading at the start.

Then it's about getting through the dot voting quickly so people can put their ideas up and they can put where they feel they are. Then get to the real value point of it which is the discussion afterwards, because it's the discussion within the team that generated a lot of the insights. It might be that everyone's clustered at one end. Maybe they've got a sit in trust and quality and everyone's there. Or more interestingly perhaps, you've got some people that are one end of the spectrum and others that are right at the other end with a big gap in the middle, and there's a really good discussion point there too. This is where the value was. It was the dialogue and not the score.

We really encouraged people to not focus on an absolute number that they were targeting. We were trying to desperately avoid Goodhart's Law with that. In our first wave, we had four teams that took it. That was around about 30 to 40 engineers involved. There was quite a lot of interesting discussions, and one of the things that really resonated with me from a, "I didn't realize that", was that some of our strongest engineers were also our biggest AI skeptics. Winning them over was going to be quite a challenge. Again, we found some common themes and that was where it became useful to have this overview of everything and get people publishing their ideas from their workshop so that we could spot these patterns and try and put some ideas in place to fix them.

Some of the bright spots. There was a high enthusiasm for the potential of AI. People knew that this thing was coming and it was going to be important. They also were finding that they had pockets where it was being used to a really high level and it was showing real value, so they had some clear use cases. I think partially because we'd focused it around our EngEx program, testing and refactoring featured highly on this list of things.

Some of the issues that we found, though, were mainly focused around things like trust and quality. People were seeing that there was issues with the slop coming out of the tools. There was poor hallucinations. They weren't finding that they were getting suitable solutions. Things weren't actually working when they were working with the code generation. Partially that was down to the prompt skills that were very variable. Also, it was about a lack of awareness of how to use the tools, what the limitations were. The high potential often is met with the pit of despair after they realize that they can't just zero-shot something and it's going to work.

From a leadership perspective, it was difficult to articulate the ROI, like I said. This is where it became important that we just had trust from our management to say, this is going to be important. We know that this is going to be something that the teams need to adopt. It's a paradigm shift in the way we work. We'll accept a bit of a slowdown here. Then, for me, one of the more eye-opening things was that it was actually very narrowly focused. People were only considering extremely narrow code-related use cases.

My personal belief is that that productivity across the business is not limited to coding. There's a whole bunch of stuff that we need to consider. I put together a workshop and some slides to help people understand the AI native mindset. I included this view of what a fairly traditional SDLC might look like, and used it to highlight to them where we had tools available to support you at each stage of the journey. There's a couple of things that are across the board. Like the personal productivity tools, they've been getting better very quickly. A lot of our internal teams have integrated AI into the productivity tools, so things like your calendar assistant and inbox flow, helping you to understand what's happened in your messages.

My team's based in London, but we have teams based in the Bay Area. We have teams based in Seattle and Bellevue and all over the U.S. There's a lot of conversations that happen out of hours as well. Being able to catch up on what they're saying quickly and understand the narrative that's going on around the teams without having to trawl through millions of chat messages is actually a massive game changer for somebody that's interested in keeping on top of what's going on. A lot of these tools, they're very specific to Meta. There's also a few in here that you can see that are more generic.

One of the benefits that we have is that we have a good blend of access to third-party tools and internal things. Some of them have been around for a bit longer than you'd expect, including DRS, which is actually really super valuable. It's a tool that runs against any code change in our codebase and tells you the likelihood of this change leading to an incident. It's a way that we can use that to then gate on further AI code reviews down the line. If we know something's really high risk, we can make sure that it has thorough human review rather than allowing it through with an automated review.

Then the last thing I just want to touch on here before we move on, right down at the bottom, there's a thing that says build your own tools using Confucius. Confucius is an orchestration platform that was built by the team that wanted to build RACER. This is one of the most powerful tools I think we've got access to because it allows us to build specific agents and agentic workflows that are domain specific and team specific. I've been using this an awful lot recently, and it's a very powerful way for us to leverage these tools, but also narrow the focus on what we're trying to do.

The summary of this is that self-assessment drives change within the teams. It's because of the ownership that the teams see because they're driving this themselves as a retrospective format. It's not like consultants coming in and telling you what to do or leadership are telling you what to do. The team is seeing themselves clearly and they're choosing what to improve. That was great. Some teams were focusing on prompting. Some teams were focusing on how to improve trust and qualities. They're all different priorities, but they're all valid. This was one of the key things that we found, and we've seen this assessment now has been adopted more broadly across Meta and Reality Labs.

Act 3: Growth and Momentum - Community Explosion

At this point we were thinking, ok, we've got this framework, we've got this community we've been growing, it's going pretty well. How do we go on from here? What should we do next? Momentum was building. At this point, I think we're around about September, we've gone from a few people in our community to now over a hundred. We kept getting questions from people saying, how do I do this in my area, in my teams? I said, "There's nothing special here. It's just no magic sauce. Just copy what we're doing". Increasingly I felt actually maybe the value here would be if we built an overall community where everyone's wins are shared rather than having pockets and things being siloed. Could we adopt this and move it into a bigger forum without maybe sacrificing too much of the small-scale familiarity that made it successful in the first place?

In September, we decided that we would open this up to all teams. We took a vote within us to say like, should this be a whole Horizon-wide thing because this is a pretty big org across Reality Labs? We said yes. Some of the things that I wanted to consider when we were doing this was, we felt like we were at the point in "Crossing the Chasm", which if you're familiar with is Geoffrey Moore's book around technology adoption, where we saturated our early adopters and our innovators, and we needed to now move it over to the early majority. I have a few different frameworks and tools that I use to help think through this when we were deciding what to do.

One of which is this great model from Emily Webber from her book, "Building Successful Communities of Practice", where communities ultimately need to get to the self-sustaining level to be super valuable. We were on the climb to maturing. What was necessary for us to help climb that peak? Equally, if you look at the chasm, it's quite a big jump from early adopters to early majority. Again, what was it that was going to make it so that we were successful in our activity in making this bigger? Meta's a fairly competitive environment, I'm sure you may be aware. There's a lot of focus on metrics and driving numbers.

If you can measure it, it generally is a thing that people are willing to get behind. My belief is that quite a few of the things that become important to community are fairly intangible, and they're hard to measure. They're some of the things that we really needed to invest in, especially as we got bigger, because it's that sense of familiarity and scale and safety that would allow people to get that sense of belonging and really help them to adopt and become part of the community and make it work. We had things like office hours, lean coffees, one-to-ones, and social activities. Again, hard to measure, but they're really important. They're the things that kept gluing people back together and bringing them back to the community to share their wins and ask the questions and show their vulnerability.

As I'm sure you're probably aware from what I've been saying, we opened the doors up to everybody and immediately we saw a big growth in the community. We had over 100 members join straight away. Actually, it was really interesting to see how different use cases came in, and they challenged our understanding again. We were continually learning. Some of the things we tried to do, like I said, to maintain that culture at scale were to have these lean coffee sessions. There was still a local element to it. Clearly when you've got a big community of people and they're all distributed across the globe, it's difficult to maintain a sense of being part of a small thing. Some of these activities helped to maintain that. Celebrating the public wins was a really key part of this because it meant people felt validated in sharing their wins and sharing their knowledge, but also exactly the same for the people who were sharing things that they were finding hard.

On top of this, we realized that the ad hoc training sessions that we were holding were ok, but they weren't necessarily providing a solid platform. There was quite a lot of effort involved in bringing everybody up to date with everything that had happened. Some of them had happened maybe several months before and tools had changed quite a lot. What we put together with the help of one of our engineering directors was a two-part structured learning program, which was two full days of training focused on building the mindset and high-level concepts on day one, and then going much deeper into specific tools and workflows on day two. This was rolled out across the whole of Horizon Experiences and Horizon OS, and with a lot of the other teams across Reality Labs as well. This has provided a foundational level of knowledge so that everyone can be at the same level. It's been a really important part in driving the growth of understanding and growth of the community too.

Around about this time as well, this is when people had taken a maturity assessment and they could go back and revisit that. We asked people if they were prepared to do it, could we go back and track it, see how you're progressing? Are you moving up the scale? Have you found anything new? That's where we found some interesting patterns and things that had changed over time, but also things that had stayed the same. We knew that there were bigger investments needed.

To recap the journey, we started really small. There was probably maybe 10 people in our community by the end of May. By the end of November, we'd hit 400-plus. We have a maturity model and a self-assessment framework that we could roll out and help teams identify their own gaps. We had a full two-day training program that people could take so new joiners could get up to speed quickly and people across the company could get involved and understand the different tools that were available to them. What that looks like in numbers, because like I say, we love measuring things. This is the community expansion. You can see the key points where things jump up based on providing the training or providing access to the community or sharing it more widely. That has also percolated through to our tooling adoption. At the point that we kicked it all off, we were hovering between 30% and 40% weekly active users, and we finished the year pretty strongly, heading over 80% now, with the exception, as you can see, from the North American contingent for Thanksgiving week.

Act 4: Success Stories - Real Wins, Real Impact

Some success stories. One of our engineers decided that RACER was going to be their test bed, and they wanted to see if they could improve test coverage, and not just in a small way. They had a significant amount of code that was uncovered that was challenging to cover because it was spread across many different files. It's not like, here's a list of all the files with zero coverage, go and fix them. They thought, what if I can teach RACER how to go and get the data to assess which files are the most important ones for us to add coverage to, and then generate the tasks to go and write the coverage for those files. It's a two-stage solution. One of it's fairly well-supervised. The other one is a totally unsupervised thing. The solution, like I say, they went and found dashboards and data that they could use, taught the agent how to go and inspect that, and then also taught the agent how to create tasks for itself. What that would do, it would then parallelize the fixes across multiple agent runs at the same time, resulting in huge numbers of diffs being generated. We thought this would probably take about 19, 20 hours, if you were an individual engineer trying to do the same work.

In the end, we landed it with about three hours of manual effort, mostly focused on actually reviewing the changes. We hit over 90% code coverage with this, and nearly 60 diffs were merged on the back of it. For three hours effort, that's a pretty big payback. From a more supervised approach, we also have quite a lot of legacy code knocking around. There are significant portions where we wanted to be able to migrate safely from legacy patterns to more modern approaches. This is where we use this approach of human is the architect, AI is the typewriter.

For this one, I think this was something to do with MonoBehaviours across Unity codebases. It's an area where it's important to do, it's quite mechanical, but it needs high precision. We couldn't afford for this to go wrong, because especially when you consider VR experiences, they're very performance sensitive. If you drop the frame rate in a VR experience, you can make people very sick very quickly. Ensuring that this works properly is a key thing. Again, the results here, the engineer in question said, I think this is going to take me a day. In the end, they landed it with the assistant's help in about four hours. This is in a more of a pair programming model, so it's a supervised approach. The code quality is maintained because you're constantly inspecting it and making sure the agent is doing the right thing. Then you have a second human review afterwards because the diff would be published by the original author.

As you can probably imagine, building something like Horizon Worlds, which is a 3D immersive user-generated content platform can be quite complicated and have a significant amount of state in the background. Getting domain-specific knowledge into tools that use things like Claude or Gemini relies heavily on context. One of the use cases that I haven't personally explored was MCP. We had an engineer who said, I think we could really benefit from having an MCP that understands how to go and query all the state behind our worlds and then can tell us how it's set up, what assets are in there, what scripts are being used, how people are using our APIs, what leaderboards are set up, what PVARs. They could go in and query this and they could bring that knowledge into the tooling that they were using to help write code. This is a huge win from a couple of different angles.

One is significant amounts of our engineering work is done on on-demand instances, which aren't Windows machines. Horizon Worlds is a Windows platform. We need to have your Windows desktop, your Windows laptop available at all times. They're pretty high-powered machines, as you can imagine, because it's a 3D environment. If you're not in the office, if you don't have your desktop with you, you can't spin up the environment easily. You can't go and inspect this world state. Having an MCP that could go and do that for you and have it right there in your editor was super valuable. It also gives people the chance to go and learn the codebase and learn how different things are being used. As you're developing new features, you can go and say, how will this impact this world? The MCP can go and query the world state. It will tell you whether it's actually using that feature or not.

Then this one, again, it's an unsupervised approach, which was focused more on code quality. We have a fairly extensive codemod system at Meta that runs in the background. It's largely deterministic. It's very rule-based, as you can imagine. It complements our linting in many ways. It doesn't always have great fixes for some of the code quality patterns that we see people introducing, either deliberately because they need to get things shipped quickly, or more likely, they're patterns that got noticed over time and then got rules put against them that need fixing.

By combining Devmate and running it in an unsupervised fashion with our codemod platform, we can have these tools running constantly, checking the codebase, making sure that new files that are checked in, if they have code quality issues, they can immediately have diffs raised that will then allow them to be fixed. This is the sort of thing that can run very constrained based on code projects that we set up, or it can run across the whole codebase, which, as you can imagine at Meta scale, is many millions of lines of code. It's a great way of helping to keep the standards high without you necessarily having to do too much of the manual work to go find all these core sites and fix them yourself. It's one of those key things of reducing the toil of your day-to-day.

Then, lastly, this is something I've been using extensively, multi-modal development, I've called it. Actually, I find this applies across the board. Going back to what I was saying about the SDLC being broad and having many different facets that you can impact with AI, I found that removing my bottleneck, which is the speed of my typing and the accuracy of my typing, can help me get things done faster. I use Wispr Flow extensively now and talk to my computer a lot, which I'm sure if you're in an open plan office may seem completely odd. I'm remote, so it's not too bad for me. It's a great way of getting ideas out of your head quickly.

Then you can also use other tools to refine the thinking that you've put down on paper. It reduces the barrier to writing things. It stops you getting into that write, edit, refine cycle, so you miss out on the opportunity, you just get the ideas out and then refine it later. It also means that if you're working with an agent, you can talk to it really quickly. I think it's something like 140-plus words per minute of spoken versus less than 90 if you're typing. It can be quite a big boost if you do a lot of document writing and TDDs.

What are some of the patterns behind these wins? What made them successful? One of them was having a clear problem. It wasn't just a mandate to go and use AI more broadly. We had to find the problem to tool fit. We made sure that we had a clear idea of what the goal was we were trying to do. The human insight and the human oversight was really critical to making sure the quality stayed high. Where we could scale things out quickly, we did. Some of these tools, the results shared have led to other people having ideas of how they can use them.

On the back of the test coverage example, one of the engineers in our events team said, I wonder if I can make all of my tests for my team that aren't currently eligible to run on changes because of the time they take to run, I wonder if I can improve the performance of them. They went away, followed the same pattern. Out of 6,000 tests that they had, they managed to optimize nearly 2,000 of them without actually having to write any of the code themselves because they could set up the tool to go find the test, do the data insights, write the prompts, create the tasks, and then this would run in the background. I think they validated this by saying, of the tests that they then made diff eligible, they found 200 PRs that would have previously gone through without issue. They caught issues because these tests were actually now valid and able to run on those changes. That's a massive win in terms of shifting left and stopping problems getting to production. Again, we keep iterating, and we keep learning and sharing back and making sure that people are aware of what's going on across the board.

Act 5: Challenges and Lessons - What We Learned the Hard Way

Some of the challenges that we faced. It's not all plain sailing and fun and sunshine. What didn't work and what's still kind of the jury is out for. What worked, obviously, starting small was great and having this self-assessment framework. The things that we're still trying to figure out include things like code review processes at scale, long-term quality, maintenance above and beyond, and the code modernizer approach.

Then really trying to understand how we measure true productivity gains. I've said a couple of times about our weekly active users, and this was a measure that we were following for quite a while. I'm very keen that we move away from that because I don't think that adoption is the right measure that we should have in place to say, actually, are we seeing value from this? Some of the agents are slightly less reliable than others. I've seen some interesting cases where the code generated has been quite eye-opening. I had one particular case where someone shared with me, they had a test that they were writing, and the agent got so stuck that it resorted to going to the file system to load the code from the file system and asserting that certain statements were in that code that they loaded from the file system. It's not foolproof. There are things that we can still improve there. Stuff that we have discounted, but only for now. We're not in a position where a full autonomy is there yet, and we need to think about these vanity metrics in a bit more detail. There isn't a one-size-fits-all approach that works here.

The education that we gave was really good, but top-down mandates that people have had in place without that aren't going to work. Trust and quality has been a consistent theme throughout this. All the workshops came back with this feedback. When I talk to people, they have the same feedback. This is the same thing that we really have to focus on, and it leads to the second point that I'll get to. I think this is where we can make the biggest impact in terms of how we think about our workflows and how we make sure these tools don't have a negative sentiment with everybody. Lots of fear around, is my job just going to be reviewing sloppy diffs now? How do I know this is correct? Changes are larger, so that makes the review difficult. There's also a question about who's accountable.

If I was the author of some code and you reviewed it, if there was a problem, then it will fall back to me as the author. If it's an agent that authors it, where does that accountability lie? Does it put extra pressure on the code reviewer, or is it just something that becomes accepted? That's an area we're still looking to explore. People obviously have a fear of introducing bugs at scale, because the changes are more unfamiliar to you as well. Our response to this is people have spun up workstreams that look at how we can reduce slop.

Specifically, in our case, because we were focused so heavily on engineering excellence, how can we reduce the test slop that's being generated by some of these tools? We now have a tool that runs all changes that have more than a certain percentage of AI-generated test code. It provides a really solid code review based on rules that have been set up by senior ICs across the organization. There's two wins to this. One is obviously the automation and this thing happening on code that is checked in. Also, there's prompt files that this tool uses that people can go and read and understand what a really solid test reviewer will do and how they can learn from them to say, what makes this a good review? What are the things I should be looking for when I'm doing a test review? This has been built into some of our other tools as well. Metamate is like a ChatGPT style tool that we have available to us that has access to all of our internal knowledge. You can ask this to run the command on diffs as well, so it can run in a separate to your coding workflow approach too.

I hinted before that I think some of the biggest problems are in familiarity and code review. I think this is down to some of the changes we're seeing around people checking in really large diffs. The tendency for changes to be much bigger, have lots more lines of code, and generally be more boilerplate-y and full of stuff that you wouldn't necessarily write yourself means that your reviews are going to be harder. If you see someone checking in a 4,000-line diff, your review is going to be very different to something that's checking in 100 or 200 lines of code. Equally, the rate of change has increased too. We're seeing more diffs of bigger size. That means that your codebase that you knew on Monday is very different to the one that you look at on Friday. That lack of familiarity also impacts your ability to code review. If we're not careful about this, one of the big problems will be that we'll speed up one aspect of the workflow and slow down the other one so dramatically that we'll actually be invariant. It's not going to be any speedup at all. I mentioned about measuring the wrong thing.

Increasingly, what we're trying to do is shift our focus away from weekly active users and being superficial in our adoption metrics, to look like it's working well for us, and thinking more about, are we actually saving time on our tasks? Are we seeing quality improvements? Are we seeing features shipped consistently? One of the examples I've been using with directors across our teams is not necessarily when we think about 5x productivity.

If you've got 10 experiments that run in your team, and one of them generates a positive result, don't think I'm going to start running 50 experiments. Maybe we could use the insights that you generate with these tools to make a hypothesis stronger so 5 of those 10 are more active and more likely to result in something positive. You're still having the same quantity, but the quality is much higher. Measuring that would be an important thing. Equally, we want to be able to measure the satisfaction and confidence of our engineering teams to make sure that they're getting the value from these tools as well, not just being forced to adopt them.

Some of these knowledge gaps, we're still plugging them. Clearly, the community is something I'm passionate about, and I think this is the key thing that we're going to do to solve this. Also, we have the formal program now around education, documentation, and mentorship. Finding champions within teams to help drive adoption and having those champions being close to people that they can go and ask questions from is super valuable. We're still seeing the people having issues with the prompting skills and understanding when AI is a good fit or it isn't, and when context needs to be added and when it shouldn't, and how much they should write into a spec or not. Making sure you have an expert nearby that can really give you that peer-to-peer mentorship is going to be increasingly valuable still. Again, the tools aren't perfect.

These are things that are evolving rapidly. The tools that we were using in May are very different in output and quality to what we're using today, and that's not going to change, I don't think. That is only going to increase in speed. Making sure that we have an understanding of what fits where and we can constantly be on the appraisal of new tools and make sure that they're being looked at from our perspective or by our workflows. This is the sort of thing that we can help by engaging the early adopters in challenging like, what are the status quo, what are the things that we should be bringing in and what should be discarded for now?

Act 6: Your Playbook - Actionable Takeaways

I said I'd leave you with some ideas to take away and hopefully these ideas will be useful to you on a day-to-day basis. From a leadership perspective, five of the strategies that work, making it safe. Let people have the appreciation that they can fail. They can experiment and try and it's not going to be judged in a bad way. They need to have that space to be able to be vulnerable. Celebrate the learnings, not just the wins. That's really critical because learning is about trying and failing as much as it is about winning something. We need to be top-down and bottom-up. This initiative was started by me and a few other engineers. It wasn't a leadership mandate, but we saw that this was the wind of change that was coming across the industry and it was important for us to start doing something here. We could start that and build it and get the momentum going, but then when we needed it, the leadership mandate came in to support us and give people the permission to say, this is something you should be trying and making sure you're working with. The quality bar obviously has to stay very high. I think Wes shared this, it amplifies your practices.

If you're doing things badly in the first place, you're only going to be doing things much worse. You can go wronger faster, I think is the phrase I've used a bit. Making sure you've got enhancements that are generating real value and quality is really important. From our perspective, because of our focus on engineering excellence, that's been around testing and testing practices, having an anti-slop mindset. Then, investing in education, clearly, we're all going to be constantly learning. Formal and informal programs are key here. Making sure you've got the mentorship and peer support available. If you can, try and find a way to show the ROI. It isn't always easy, but it is definitely better if you can find something that shows the value rather than just the adoption.

If you can link it back to existing goals like we did with our engineering excellence program, that way we can say, we had a really ambitious goal about coverage for all of our teams at the start of the half. It looked like it was an 80-20 goal, but by adopting AI, we've made it a 50-50. That's maybe a small win, but it's important and it's a nice way of us being able to quantify what we're doing because it was the thing that we originally set out to do, with or without AI.

Measuring and integrating and scaling this, some patterns to take away. Measurement, focus on the outcome's time, save quality, satisfaction, not activity. On the integration, focus on the workflows. It's great for bounded tasks, but you still have to have the human validation in there. It's not quite ready for full autonomy and complex work. Then, thirdly, the platform integration is quite key here. Quality of the integration matters more than the number of tools. If you can leverage MCP to bring in context specific to your own domains, then you can supercharge the tools available to you too. IDE integration is great, but I found actually some of our better platforms are the ones where we can build our own tools or we can use them in an unsupervised fashion and in a familiar way. My personal favorite is the voice typing that you may or may not have got the impression about. They work pretty much no matter what your scale.

If you want to follow our model, find a couple of motivated people, go and find a team that are willing to do this with you. Run the workshop, identify the quick wins, and then go from there. Let the team take ownership of the process and find the things that they want to work on. Iterate and keep improving and share that back to the broader community. You don't really need permission to do this. You just need people. Go and find those champions and bring them with you. For us, our journey continues. We know that in some cases, AI native engineering is achievable today. We're seeing people doing this more and more as a default. Small starts lead to big changes. Structure accelerates, it doesn't hinder. We need to have that at times. The community is the thing that amplifies everything for me. If we get this right, quality and speed don't have to be tradeoffs either.

Some of the things that we're going to be trying to look at next year are widening the pool of people that are taking these assessments so that we can get a bigger view of all the different workflows because there's a very different set of engineering practices and processes involved, whether you're working on embedded systems or devices or web systems or mobile compared to what we're doing in VR and our area. It'd be interesting to see how those patterns evolve and the understanding changes as we get more feedback from people. Again, we keep working to address the challenges as tools evolve, because there will be new challenges that form at every corner as the tools become more mature.

I just want to leave you with one thought which is around the overall mindset and the vision that we set off with. That is that we're offloading the heavy lifting to AI to free up human talent to tackle the unique challenges and gain deeper insights and develop the groundbreaking ideas that actually move things forward. I'm a really passionate advocate for engineering excellence as you can probably tell, but having the best test suite isn't necessarily finding the next big thing in product. Being able to have that time to be an explorer or an innovator and making sure that you've got the backing of these tools to help reduce the toil in your day-to-day work is going to be absolutely critical to making the advances that we think we can make based on using this approach.

Questions and Answers

Participant 1: When you started seeing success, did you get pressure from upstairs to push it out faster and harder?

Ian Thomas: Did we get pressure to push out faster and harder? Yes, but I think it was a positive pressure because it was a recognition of the value that people were already seeing and creating rather than it was like a force mandate to say, everyone else is doing it, but you should. Having the community, if we hadn't had that already in place, I think it would have been a different feeling for everybody. Because it was, go and look what's already happening here. This is great. We're also seeing it happening across Facebook. We're seeing it happening across Instagram. This thing isn't just us doing it. Each team's finding their own way of building up this way of working and finding success. Go for it. What's holding you back? Why aren't you going for it? I think that was the key thing. It was more like supportive leadership mandate rather than pressure in a negative sense.

Participant 2: You mentioned we're shipping more code faster written by AI and you're concerned about offloading some slowness to later on in the process. How have you approached mitigating that when we're generating all this AI code and shipping things more quickly, but now we as human engineers are less familiar with the code?

Ian Thomas: I think this is still one of the areas that is ripe for getting something in place and exploring properly. One of the tools that I mentioned was Confucius and being able to build your own agent. For this specific purpose that you mentioned, in the last month or so, I've been building an agent that does a very detailed and specific code review as soon as a change is put up for review. I can't go into the details of what it's going to do, but it has a very specific reason to exist, and it can provide very early signal to a reviewer what might be needing greater scrutiny versus what you can maybe say, ok, this is acceptable to pass through right now.

The thing that I found a value with there is saying, ok, you've still got a finite amount of time. How do you make sure that that is used on the most important changes for you to review rather than the ones that are going to be quite clear? We see that there's more general tools available. The Devmate platform that I mentioned, that has a lot of integrations now with our CI/CD pipeline so that people can see when code's checked in, what the Devmate review might say as well. It's not tailored specific enough to everyone's unique product. That's where I think the value can come in, that you can build your own tool to support your own preferences and your own quality bar and making sure that you're holding your own set of standards.

Participant 3: Could you dig into some details around the anti-slop initiative and maybe how you guys, either implementation-wise or best practices, are improving the agents? You gave the example of, I think the refactoring agent or the one that scans all code as it's going out and proactively identifies when things aren't up to standard. How are you guys defining the standards either at the codebase level or providing additional feedback to the agent afterwards?

Ian Thomas: There are a few different things at play here. That test slop agent originally came about through one of the other senior ICs saying, I'm not happy with this, I'm going to build a command line tool. They were then running that manually to run with, I think it was Sonnet at the time. They had a set of rules that they'd encoded into a prompt and they set that off on different test diffs that they'd seen that had been produced by some of the more unsupervised tooling that was available.

Then the teams that own the platform for code review picked up on this and said, can we maybe think about combining this, because we are also very concerned about this? They started to say, is there an opportunity to align thinking here? It came about at the same time that we started to introduce a concept for Devmate called rules, it's very similar to Claude skills or what have you. They exist at different layers of the codebase where the closer they are to the root, because we have a monorepo, the more generic and broadly applicable they are. You can start to combine reviews from the test slop agent that's very specific with rules that can then be updated based on findings of things that we're seeing in repeated patterns that then apply. When it runs, it can pick up all these different parts of the context that people have generated over time.

Like I mentioned, one of the things that's really useful about this is that we got the opinion of some very experienced engineers in these prompts. Now we can hold them up as artifacts that other people can go and read and learn from, so that when they're actually writing the tasks that will generate this test coverage or will do some of this code modification, they can take that and bear it in mind when they're spec'ing out what they want it to do, so they can try and avoid some of the issues upfront. It's a combination of the human learning from what's been going on. Then these tools combining all the different capabilities they've developed over the months.

Participant 4: The 2024 State of DevOps report talks about some of the negatives to using AI in engineering, and one of them being large batch sizes. Engineers just releasing huge PRs, like you were talking about a couple thousand lines, which nobody wants to review that. It sucks to be in that place. How are you ameliorating those negatives? Are you introducing batch size limits? What are your thoughts on that?

Ian Thomas: We're not mandating anything specific like that. One of the things around Meta's engineering culture is that you're free to do whatever you want, and you're weighed at the end of the day by the outcomes that you generate. If you are checking in huge amounts of slop in large diffs and you're annoying everybody, that will come back in some feedback to you, either directly or through the performance management cycle. I think what we're seeing is like, culturally, people are finding ways to enforce the standards there. Again, it goes back to what I was saying about the agent that I was building. Sometimes you just have to accept that things will go through and they're maybe not that harmful. Sometimes they will go through and they are harmful and they're the ones that you really got to figure out how to cut out. It's trying to weigh up when it's acceptable to dig deeper or when it's ok to say, it's not how I do it, it's not what I'm happy with, but it's ok because the risk is low.

 

See more presentations with transcripts

 

Recorded at:

May 22, 2026

BT