BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Elevate Developer Experience with Generative AI Capabilities on AWS

Elevate Developer Experience with Generative AI Capabilities on AWS

40:06

Summary

Olalekan Elesin discusses how generative AI tools can improve productivity, streamline workflows, and foster a more efficient and effective development environment.

Bio

Olalekan Elesin is Engineering Director @HRS Group & AWS Machine Learning Hero with 10+ years experience developing and scaling data and AI-driven technology products.

About the conference

InfoQ Dev Summit Munich software development conference focuses on the critical software challenges senior dev teams face today. Gain valuable real-world technical insights from 20+ senior software developers, connect with speakers and peers, and enjoy social events.

Transcript

Elesin: My name is Olalekan. I was the director of engineering, now I'm VP of engineering. The title says, elevate developer experience with generative AI capabilities on AWS.

We go through, we introduce Amazon Bedrock, code review assistant, agentic code generation, code summarization, and I have a bonus material which I would share with you. Personally, I'm a fan of football. I would show you an example of, I was watching a game, and then in 20 minutes or so, built a simple application.

Introduction to Amazon Bedrock

What is Amazon Bedrock? It's a fully managed service. I'm not, as I said, maybe AWS hero, but I just want to introduce you to this. Not a marketing slide. I would also give a critique of my own of this service, or AWS services that I have used to elevate developer experiences at my workplace. Fully managed service comes with foundation models out of the box from leading companies, AI21, Anthropic, and things like that. Key features, Bedrock Studio, Knowledge Bases.

I'm not here to talk about Bedrock, but I'm just trying to give you an introduction to it and how we can use it to elevate developer experience, which for me is very important. If you have access to an AWS console, you look at Bedrock, this is what it looks like. It's why Meta, which is also a very key contributor in the open-source community for the development of foundational models or large language models. This is where it gets interesting, everyday developer experience. For me, this is what I care about, not Bedrock.

Code Review Assistant with Amazon Bedrock

As an engineer, who loves code reviews? If you're an engineer, you love code reviews, you're from Mars, everybody else is from Earth. Let's take that. I still do code reviews, and at every point in time I'm wondering, why do I have to do this? If you look at the coding flow, there is 19 hours of coding on average, pickup time takes about 9 hours on average. Review takes five days because, if someone says, yes, in this daily standup, I need you to review this. Don't worry, I'll pick it up. Who picks it up? Nobody. I also do the same, nobody picks it up. Then the deployment time is on an average 80 minutes. There is a lot of work getting stuck in the review phase.

The value that we as engineers create is that we are very excited when the value that we create with code, when the problems that we solve with code, gets into hand of customer. For me, this is very important. How do we reduce this? One day, I was super pissed at code reviews, and I realized that the engineers on my teams also were struggling with code reviews. I sat and said, how can I solve this? Then I came up with this architecture. Fortunately, it's quite simple, so you look at a third-party Git repository, think of GitHub or whichever one. Created a webhook with API gateway, which goes to a lambda function, sends the information to Bedrock.

Once you create a pull request, it triggers the webhook. Goes straight, does the review of the code, goes back and comments, and does review. This is something that can shorten that nine hours plus five days to minutes, if not seconds. I was super excited on working on this, only to realize that people already built this on GitHub, and they're putting in the GitHub Marketplace.

Then again, this is something that can take you from nine hours plus five days, into minutes. It doesn't mean that the reviews are super perfect, but it gets you up to speed as quickly as possible. Now think of the webhook, and think of how you can extend this use case within your organization. You can also exchange Amazon Bedrock. You can change it to Mistral code assistant, you can change it to Claude code assistant. You can change it to anything that you want. It's simple. Gets you up to speed as quickly as possible.

Agentic Code Generation with Amazon Q Developer

Who loves moving from Java 8 to Java 17 and Java 19 or 21? Who has ever embarked on that journey where we say, we're running on Java 8, we have CVE issues, now we need to go from 8 to 95? Who enjoys that? I think there was some Log4j issue in, I think 2022, where lots of people had old Java versions that we had to migrate. This was really interesting for a lot of us. If I see this, I know I'm excited. Who's excited, to do code migrations? Here comes Amazon Q Developer. Amazon has mentioned a lot about this. They've done migrations on their own, they've done 1000-plus migrations. I can tell you that this actually works, from my own personal test.

What I did not like about the Amazon Q was that it took us months to provision. The reason, you have to connect it to AWS SSO, which for most of us, in our organizations, we already have an SSO provider. What we did was not to use Amazon Q Developer, but we used GitHub Copilot. I cannot share the stats, but I will share with you what I observed with the engineers that used Copilot in my teams and engineers that didn't use Copilot.

Every morning we use our instant messaging tool, and then I see at least four engineers sending pull requests every day saying, please review this. It turns out that these engineers are actually using GitHub Copilot. When we ran a survey with them, estimated again, they had between 15% to 30% efficiency gain by using Copilot. Then I asked myself, the colleagues in suits saying, in five years, engineers will go away. I don't think so. I think in five years, engineers will become better at the craft. We will be problem solvers.

One of the things that I noticed with the engineers that used Copilot or AI assistants in writing code was that they became better problem solvers than code implementers. It can be Amazon Q. It can be GitHub Copilot. It can be Cursor. What I'm trying to say is that, as engineers, you can start introducing this to your workflow, as long as it aligns with the security policy of your company.

Let me see if I play this video. I'm actually a Manchester United fan. When I was preparing for this, was one of the football games on a Saturday afternoon. I wanted to watch the football match, and at the same time, I wanted to prepare for this. I wrote a prompt saying, give me this particular project that would perform this particular activity.

It generates the sequence of tasks as an LLM agent, and here I am for about five minutes just sitting and watching Manchester United. This is what we do as engineers. We tell the product manager and the engineering manager that, yes, I'm working remotely, just give me a bit of time. We're chilling, sipping some coffee, whatever you drink. It's October 1st, so you can get the idea.

Then, you wait for the LLM to generate the code. Definitely it generates code, but it doesn't mean it's accurate. It means that you still have to look through it by guiding it through the problem you're trying to solve. What do I do after? That's it. It generates my CloudFormation template. Generates the code itself, and I spend maybe another 10, 15, 20 minutes editing to make sure it aligns, and testing it. What would have taken me about three hours, I got it done watching a football match, and also in 20 minutes. Again, better problem solvers as engineers. Engineers are not going to be replaced in five years, we're going to evolve and be better at the craft.

Code Explanation and Summarization

Code explanation and summarization. How many of us have people joining our teams regularly? Tell me if you actually enjoy explaining to them how the code works. Nobody does that. Is there anybody that loves to do that, to explain? I don't enjoy it. When I joined HRS, I led the data platform team, and we built it from scratch together, the Kinesis data ingestion and data lake, and S3, and here we have about four years after someone was asking me a question, and I'd left the team. I was like, how do I start explaining how this works? I got tired.

Then comes code summarization and explanation. What I'm showing you right now is an architectural blueprint that you can give a try to. We have repositories that exist on GitLab, Bitbucket, GitHub, you can put that in an S3 bucket securely, making sure that there is no public access. That's very important. Then put that in the Amazon knowledge base with OpenSearch, either the OpenSearch with instance, or OpenSearch Serverless, and then with a foundation model, the new team members can ask questions to it. Very easy.

Even within GitHub Copilot and Amazon Bedrock, you can also have that in there, in the code repository itself. They can highlight the code and ask you to explain. This way, you can easily generate documentation that's really closest to the code as much as possible. You can have interesting automation in here.

Example, think about when you do a git push, you can attach a webhook to it to automatically update the documentation and make publishing it as an HTML, where they can go to and see. Think of platform engineering teams as well that are maybe publishing CDK templates or CloudFormation templates or Terraform templates. These can really help onboard engineers into what you're building. This is something that you can get started with without writing as much code as possible.

Support Case Investigation

This is the bonus material. I work in a B2B environment. How many of us get someone saying there was a customer somewhere that complained about an issue in production, but our infrastructure monitoring never picked it up? What happens is, when a customer raises an issue, sometimes it takes a couple of days for us to find the underlying issue. One of the things that we realized was that the information about the issue reported by the customer exists in different systems. If you're not using super complicated architecture, super complicated login systems, it might be difficult to find.

Then the customer service agent with the application support, the product manager are all pissed and wondering, why can't you find it? Here is a simple architecture. Let's say you have a case management system as an example. Someone logs in and creates a case saying, we have this issue in front of customer x, customer y, you can easily trigger a webhook similar to the git integration earlier, which invokes an agent. What that agent does is that it goes through your login system.

You can do that with Python or Java SDK, both with three SDKs, saying, search CloudWatch with CloudWatch Insights for this particular string. If you use Prometheus, you can use OpenTelemetry to query it. If you have a database that you have read only access, you can also put that in place to do a read.

Then, once you find all that information, what LLMs are good at is that they are good at really synthesizing unstructured data. What inspired this was what I described earlier, but I think late last year, just to give you a bit of background context, we had colleagues in our HR department. They run surveys multiple times every year where they ask colleagues, how do you feel about the onboarding process? How do you feel about this? This is give or take, they receive 240 responses, unstructured, in an Excel file. Who can go through 240 and still have not drank maybe five or six cups of double espresso?

What I did was to sit with the HR colleague, this is to say that, even with using LLMs as elevating developer experience, you can also extend to other parts of the organization. Sat with the HR colleague and said, let's look at all these files. Let's put them in a large language model, and let's map the result of this file to our onboarding process. What would have taken seven hours, we did it in five minutes.

Again, developers are not going away. We're here to stay. Tell the suits that. What I'm saying here is, support case triggers a webhook, API gateway invokes a lambda function, queries all possible locations that you have your logs, and then sends the information back into the support case application. What used to take about three days, nine hours, whatever time, with this approach, you can do it in five minutes.

Then nobody has to sit on my neck and start asking me, where's the answer to the question. This is very important. Like I said, you can think about multiple use cases and try to reshape this based on your organizational context, your current problems, but I can tell you that this would elevate your experience as an engineer, because now you focus on the actual problem solving. It's good to explain code to the colleagues, but what we love is to create value with software.

Questions and Answers

Participant 1: Have you gathered any experience using GenAI for when you have a refactoring, like the upgrade you mentioned, that you have to roll out across 200 services, and services are different enough that you cannot really script it. Have you ever had any use cases like that?

Elesin: Yes. One example was that we wanted to put a REST interface in front of an internal gRPC protocol. We put it in GenAI, and it messed up bad. What we did was to think about the problem we're trying to solve and then write it by ourselves. We couldn't trust GenAI in that regard. In some cases, it behaves really well. In this case, it was really terrible.

In fact, I could not believe the result myself, because then I had to sit on the call with the engineer, saying, can you please show me which went through? I was like, we have to do this on our own. I know, it happens. It's not perfect, but what it does is that it gets us a head start.

Participant 1: If you have to roll that out across a lot of repositories, you could probably make the change once, show it to the AI and tell it to do the same thing everywhere else.

Elesin: At least what I know is that in Amazon Q Developer, you can integrate your own code and train the underlying model with your own code, but due to our own security compliance, we haven't done that.

Participant 2: Can you give me an idea of what an AI code review looks like? You make a 2000-line code pull request, what does it say?

Elesin: One of the things that it looks like is that it doesn't have context of the entire application itself. What we realize is that with the new models, which say they have 200,000 token size, which is about a book of maybe about 500 to 600 pages, you can actually give it your entire book itself, and then it's able to understand the context. You can give it the entire repository, and then it understands the context. What it does is that it says this particular line behaves like this. This is the expected result, maybe adjust it in this way to simplify the runtime.

Then it comments one after the other that way. Very human readable and very understandable. Like I said earlier, I have a team that is using this integration as we have it, and every day it's about four to five pull requests, anytime I open the instant messaging. On the other hand, we have teams that are not using this, so I did more like A/B testing with the team, and this is the result.

Participant 3: How much do you spend approximately on the tokens for inaudible 00:21:13. You might not need to measure it, because you're free on tokens, I assume. Was it more in the dollars or the tens of dollars? Because the repository can get arbitrarily long when it comes to the tokens you need to consume to even embed it.

Elesin: I didn't see it pop up in my AWS costs. I know that if I exceed the budgets, my FinOps colleagues will reach out to me. So far, no.

Participant 4: Could you share some rookie mistakes which you did when starting to build on an AWS Bedrock. What are the things which you would recommend if I'm starting out now, not to do?

Elesin: What not to do when starting out with Bedrock? Number one, check with your security, check with your compliance. Number two, and this is my personal understanding, anything that we consider as high proprietary company information, intellectual property, don't put it in there unless you have the compliance check from your security department. I don't think it's more of don't do, but I think it's more to do. What I would say to do is to try it out once you have some validation, immediately. We're in business travel as HRS, and then we're also heavy on sustainability, which is helping companies to align their company strategy when it comes to business travel with building a sustainable planet.

One of the things that we have, it's called a Green Stay, which is about sustainability classing for the hotels that work with us. That information is very voluminous. One of the engineers on the team asked me a question, how can we simplify this? I said, you don't need to train an AI model, so put all this information in a Word document, because it's public information.

Let's go to the Bedrock UI, put it in there, let's start asking it questions. We have the version of that chatbot in a development area where someone could log into and actually play with it. This engineer had no experience with machine learning at all or AI at all. Check security. Get started immediately. Get into validation as quickly as possible. Do's, not so many don'ts.

Losio: You mentioned before, one example is, I have Java 8, I want to bring it to Java 17, whatever it was. The scenario, usually, you have Java 8, or whatever else it is, you have probably a project that's been running for many years. A person that has been probably quite senior, old, whatever adjective you want to have, a project that maybe has been there for many years. Think of that scenario. You want to convince the team to do that, because it's easy to say, yes, Amazon did it, they migrated 10,000 projects.

Automatically, if you go to the developer team that has been playing with that code for the last 10 years, probably are going to tell you, I'm not going to put my entire code base and see 50 changes in every single class, all committed together tomorrow morning. How do you convince your team to follow you on your machine learning journey?

Elesin: How do you convince experienced engineers to use this tool to accelerate? It's a difficult one. In my area, we've all come to the realization that this journey is almost impossible without some automation in place. That's the first part. The second is, I had to show. Like I said, we have a team that is using it, and the colleagues see the efficiency gain with the team using it. That's one of the proof points. The first fact that we came to the realization that this is almost impossible for us to do. The second is that we have a team that is using GenAI in one way or the other, and we're seeing accelerated value delivery with software.

The question is, why not go in this route? Right now, when I'm hiring engineers on the team, I'm actually expecting them to tell me that they use GenAI to solve a problem, and then we go into detail. To your point, it's change management. It's difficult. For me, the communication is that, as an engineer, it's not about someone getting rid of you in five years. It's about you evolving to solve problems faster and get value into the hands of the customer quicker. It takes time, but yes, once they see the benefit, then they begin to understand. Because I myself also demonstrate how to use it. I think that's what I do.

Apart from the developer experience, I also do product managers as well. Who loves estimations? We have refinement, now we have to do estimate. Then you get into the meeting the product owner or the product manager says, I'm not sure, I've broken them down into user stories, but let's start discussing. Who enjoys those kinds of meetings? I ran a session just to also explain to you that this also takes time, not only from an engineering perspective, but also from product ownership perspective. I had a session with our product owners and said, I know that this is a problem, and I know that you might not trust this software, but let me show you the benefits of this.

First, map out who are the users of this and let's take the problem that you have, put it into this large language model and say, generate user stories based on this problem, acceptance criteria, and then you're better prepared to have discussions with the engineers in the refinement sessions. Now we have better refinement sessions because this helps better preparation.

The development experience, from that perspective, it also took time. Because I've mentioned it to the colleagues before, it took them time to understand, now they understand this, now it's accelerating. The change management is difficult, but it is a journey that we keep on trying and improving over time.

Participant 5: You mentioned so many use cases about generative AI. One case you didn't mention is generation of test cases. Is that it can't handle Bedrock, or is that you deliberately allow it?

Elesin: I actually did this myself. Because I'm in business travel, like I said, we had a use case where we had so much work for our QA engineer to do, and nobody was taking care of this. Because I understood the context of the problem we wanted solve, we simply took that and put it into the large language model, in this case I think it's Bedrock, one of the models on it, closed source models, and it generated the test cases. It was then based on this test case that we created tasks for other people, so that we could parallelize the work. It's possible, and I've actually done it myself. It does work.

Participant 5: If it does, then how can it generate test oracles for your program? How does it know that it is doing what it's supposed to do? It doesn't know. You feed in code, which could be wrong, but that the test case needs to be coming from the requirements, for example, that A plus B is equal to C, and you have programmed A plus B is equal to D, how the Bedrock or any algorithm would know that actually it needs to write the test?

Elesin: When it comes to generative AI, it's generative. Then, for it to generate something, you need to give it a prompt. In my case, I understood the context of what was expected. There were policies with regards to the expected behavior of the system which I had clear understanding of. It was basically my clear understanding of the problem. I said, this is the expected outcome of the user, this is the expected outcome for the problem itself.

Based on this, generate test cases. Then we looked at the test cases, some of them made sense, quite a number of them didn't make any sense. Then we ruled out the ones that didn't make sense. The one that made sense, we simply took them and said, let's create the work for the colleagues to do.

Participant 5: You need a human in the loop?

Elesin: Definitely, yes. You saw, it was starting from engineers, then people are there. There's always a human in there. In this case, the human was me.

Participant 6: Maybe one question on the specific topic you had with the repository and embedding it. How did you handle the generally weird structure of repositories being deeply nested. Because normally if you embed a document, you have this one embedding for this one document. Did you start building a hierarchy, or is it something that's just natively handled in Bedrock?

Elesin: I didn't care. I simply uploaded it into S3 but I excluded the irrelevant parts. What RAG, retrieval augmented generation, does is that it converts all text form into some vector representation and then stores that in OpenSearch. For me, it doesn't matter if it's nested or not nested. It's simply making sure that it goes into the vector database, in this case, OpenSearch.

If you are planning to do maybe something more complicated that might require hierarchical structuring, and I think there have been techniques like the graph RAG, graph which is connecting nodes and edges based on the retrieval augmented generation, you can also try that. For me, I wanted to get it to work as quickly as possible. First phase this. If it requires more fine-tuning based on the work, then you can add that hierarchical structure in there. First, don't care about the hierarchical structure.

Participant 6: I was just wondering if it handles it internally.

Elesin: Yes, it did it quite well.

Participant 7: Did you measure that, because inaudible 00:33:26 with installing the stuff, at least for Java code, not using Amazon but customly just build embeddings and just started to search through the Java code, and it provided completely unsatisfactory results. Because if you just think about Java code, they are just tokens that are not relevant to the language directly and just there are indeed should be some techniques, you overcome these limitations. Basically, yes, it's interesting. What were the results? Basically, at the end these new team members were satisfied.

Elesin: How do you measure the accuracy of the response from large language models. Because it's not a classical machine learning problem, that is, you have an outcome or an output that you can predict. It's difficult to say the accuracy is this. This is another point where you need the human in the loop to do a thumbs up or a thumbs down, and say, this was relevant, this was not relevant. In this case, it was 60% relevant for the queries that we issued. In some cases, the remaining 40%, it wasn't relevant at all because it was missing the context.

Especially we saw this in the pull request example, where, when we first tried the pull request use case, we were only giving it the snippet of the code that was written, not the entire context of the classes that were involved. That's the case. In the 60% of this time, about 60%, there was a thumbs up that the responses were relevant.

Participant 8: You mentioned about the test scenarios generation, which is nice, during the refinement sessions, they will help. Is it also helping scripting the tests, such as that automated test, because there are some development environments which generate unit tests very easily, but in terms of unit test or API layer automated tests, it would be really useful to have a level of automation where afterwards developers can make changes and make it workable. Maybe also connected question in terms of test automation coverage, because if it is intelligent enough to detect the areas around pull request reviews, maybe also in terms of coverage, it can be useful then.

Elesin: Can it generate unit tests? What's the unit test coverage? Can it also increase?

The team that uses GitHub Copilot, my expected outcome for them was to increase the unit test coverage of the projects that they have in their purview, the target was 70%. This was really almost impossible for colleagues that joined newly to write unit tests and also still ship value with code. They optimized by using GitHub Copilot to generate unit tests, which then increased our test coverage from around 0 to about 50%. We're still getting there, but we're not there yet. It also increases the test coverage. I think they improved efficiency by 15% to 30%, so there's unit test generation, which is now increasing the unit test coverage of the projects in their purview.

Losio: You mentioned at the beginning that you didn't use Amazon Q, I think you said, just for the SSO configuration. I was wondering if that's the only reason, or actually you got better results as well with Copilot. What's your feeling about it?

Elesin: That was the reason. Now we have it enabled. The main reason, and the only reason we didn't use it initially, was because of the access reason. Amazon SSO, we had Azure SSO already. Why maintain two things? Because that's a security exposure on its own. That was the reason. Now we solved that.

The other side of GitHub Copilot is that we didn't get statistics of the usage, so we didn't know how many people were using it, and what they were using it for. Now we're switching to Q. It gives us the cost visibility. It gives us how many times that engineers are accepting the recommendations from Q Developer itself. Gives us the number of times that recommendations are rejected. It also gives us visibility into how the cost is developing. We just started rolling that out. The main reason was the fact that we had to use AWS SSO, which at that point we didn't want to use.

Losio: I was quite curious about the integration of different services in that sense. I was curious if there was as well a choice of like, try to pick up the best of every provider, or integrate different providers, how it works as well, or it's more an effort to them.

Elesin: It was more of the security. Now, also to your point, we want to do a comparison to be sure of what to roll out at scale for the entire organization. Because, for us, we want to double down on this as much as possible. It wasn't the comparison. It was more of the limitation at that point in time.

 

See more presentations with transcripts

 

Recorded at:

Feb 26, 2025

BT