This publication is based on the talk I gave at DevOpsDays in Rome, Italy, called "Metrics Driven Development".
Nowadays releases in the IT world are becoming a matter of hours or even minutes. Everything is scaling up and down (vertically), to the right and to the left (horizontally). Therefore having a good monitoring system is a must. In a wide range of IT organizations, applications are at the core of their business. However, monitoring is implemented by OPS teams alone, i.e. the teams who don't write applications. Why is it so? How and why (if at all) this needs to be changed? How to achieve better results? In this article I will share my thoughts and experience gathered while working together with DEV teams, trying to make sense of metrics.
The ideas described in this article are taken from my work as an IT architect (somewhere between DEV and OPS) at company Adform. Adform is a digital advertising company which has its own DEV team (consisting of 65 people) and OPS team (consisting of 8 people). The OPS team is responsible for the release process and development environments as well as preparing and maintaining networking and servers in production. Meanwhile, the DEV team is responsible for maintaining the applications they write even after they are in production. These ideas are based on our actual experience helping developers to dig into metrics and monitoring, a sphere which was previously considered solely operations playground.
What Is Metrics Driven Development
You may have already heard about a widely known practice TDD - test-driven development, or about the less known BDD - behaviour-driven development, or about the least known ADD - asshole-driven development (for a nice list of development practice names check Scott Berkunn blog post). However, Metrics Driven Development (MDD) is mentioned nowhere.
So what is MDD? I would define MDD as a practice where metrics are used to drive the entire application development. In a company which uses MDD, everything from performance and usage patterns to revenue is measured. Moreover, every single decision taken by developers, operations or even business people is based on metrics. Metrics are used to monitor team performance, solve performance bottlenecks, estimate hardware needs or to accomplish other purposes at any stage of development life-cycle.
The key principles of Metrics-Driven-Development are:
- Assign metrics to metrics owners
- Create layered metrics and correlate trends
- Use metrics when taking decisions
Each metric should have an owner who has the knowledge and means to implement and maintain certain metrics. Metrics owners take responsibility for applications, services and even servers on which they are running, ensuring they are properly monitored and collected. Metric owners keep existing metrics up to date and create new metrics for each new application or functionality.
It's also important to have structure in metrics. Grouped or layered by certain criteria metrics ensure that everyone from business people to developers will understand them much better. Furthermore, it's easier to correlate and find trends, as well as find and solve problems faster, when metrics are layered.
MDD brings visibility to the whole development process, so decisions are taken quickly and accurately and mistakes are spotted as they happen and fixed immediately. Furthermore, everything that can be measured can be optimized. In other words, MDD enables you to feel the pulse of applications and provides you with an opportunity for continuous improvement.
Finally, MDD stacks up well with other established practices such as TDD, BDD or Scrum, by using metrics to set the goals for the sprint. Metrics also uncover problems and usage patterns (in production) which are hard to notice during acceptance testing. One example is the Lance Armstrong bug: the code never fails a test but evidence shows it's not behaving as expected. Another example is the business bug when a feature is created but never used. Evidence about these “bugs” could be collected using metrics. MDD is really another tool in your belt to achieve a focused and efficient development process which retro feeds live application usage.
Who Creates Metrics?
Usually, when you start something from a blank sheet of paper you have an opportunity to rethink concepts, to bring some innovations to a process or even change it entirely. Something similar happened to us at Adform. After we had written down the requirements and drawn a concept of how we imagined our perfect monitoring solution, we found a huge mismatch. The issue was that we wanted to gather a lot more information about our applications and services - as they generate income and give us a competitive advantage. However, the monitoring was done solely by the OPS team who has deep understanding of the infrastructure but limited knowledge of the applications and services internal workings.
Companies do not earn money because their servers run smoothly (even if this is very important) or because they have a 10 Gigabit Internet connection. Companies earn money because of the functionality their applications and services provide and because they run smoothly (and this is not the same as "servers run smoothly"). So involving the guys who actually write the applications in the monitoring process is essential to find problems fast. In fact, developers can easily spot when an application behaves differently from expected as they possess all the knowledge about the products they develop.
On top of that, there are more benefits when developers work on monitoring:
- DEV teams have the means to embed monitoring points in the applications during development
- DEV teams get quick feedback about the applications (their performance, bugs, usage patterns) in production environment
- DEV teams grow their infrastructure knowledge and start to foresee bottlenecks already during development
OPS have been doing monitoring for ages - CPU, memory, IO is in their blood, but for applications and services metrics owner is DEV. DEV holds all the needed knowledge about applications and skill set to do improvements. That's why OPS teams shouldn't be left alone to do monitoring. DEV should work together with OPS to create and maintain metrics. In the next section I will describe how our company decided to move for a change and involve our development people in the sphere of monitoring and metrics.
Applying MDD Principles At RTB Project
Helping DEV to master metrics is a process of understanding, learning, failing and finally succeeding. It is not a black and white situation where either DEV don't use metrics at all and all of a sudden start to rely on metrics to make all decisions. It's a long and difficult journey in which multiple factors such as company’s culture, employees’ attitude, management position and even working habits will influence the final result. In order to have a buy-in from management and development, it's important to show the value of metrics first. For us, that came incidentally through a project called Real Time Bidding (RTB).
RTB is a relatively new method of selling and buying online display advertising in real time, one ad at a time. To cut a long story short, there are structures called "Ad Exchanges" through which we receive a request on how much our customers are willing to pay in order to show a banner to a specific user on a specific site. When we receive and process the request, we send a response back to the originating service. Operational requirements for this project include handling 40000 queries per second (QPS) taking up no more than 100ms for a single transaction round-trip.
After we launched the application in production, the situation turned out to be a disaster, as we were able to handle only 5000 QPS. Plus, almost 30 per cent of those transactions were failing because we were unable to fulfil the 100ms round-trip requirement. To make matters worse, there was no obvious reason why our performance was so poor. We couldn't figure out whether the problems were related to networking, servers capacity or the application layers.
Finally metrics helped us to dig down into the real causes of the problem and turn the situation around. We ended up handling more than 70 000 QPS (about 14x more than initially) while keeping failed transactions below half per cent (about 50x less than initially). However, in order to achieve these results, we had to bring a lot more visibility and structure to the data as illustrated in the next sections.
Layered Metrics
Since we had already started a metrics project by the time the RTB project started, we could already embed some metrics into the application. In addition, we had servers and network metrics. However, it felt like an ocean of data, in which it was difficult to distinguish essential data, understand trends and perceive why our performance was way below required thresholds. It was clear that collection of data alone doesn't bring much value.
So we decided to visualize the data in three layers:
- Business metrics
- Application metrics
- Infrastructure metrics
The layered approach made metrics a lot more structured, available and understandable to anyone from developers to business people. The common entry point to check the state of the application became the Business dashboard. Anyone could access it to check compliance with required SLA, usage trends or revenue. If needed, one could dig down into the Application dashboard to check the performance of the different application components, latencies between different server groups and data growth. Actually, some metrics in the Application dashboard were the same as in Business dashboard, only in much greater detail. For example in the Business dashboard we graphed application performance and in the Application dashboard we had a graph showing how separate parts of application are performing (see picture 1 & picture 2). Finally, we checked the Infrastructure dashboard for information about CPU, memory and I/O usage.
(Click on the image to enlarge it)
Picture 1. Application performance in the Business dashboard
(Click on the image to enlarge it)
Picture 2. Application performance in the Application dashboard
The next step to find out the root cause of the problem (and not just its consequences), was to correlate different metrics. We stacked up graphics showing key metrics on top of each other, where business metrics were on the top, application metrics in the middle and infrastructure metrics at the bottom. This approach worked in two ways. When we looked at these graphics top-down we could clearly see how changing the number of QPS impacted our application’s performance and servers' CPU (see Picture 3). Conversely, when we looked bottom-up we could see how spiking I/O impacted our SLA.
(Click on the image to enlarge it)
Picture 3. Example of correlated metrics (top-down: QPS, SLA, Bid price service performance, CPU load)
Layered and available metrics allowed developers to gain visibility on the business side of the project. They could actually see where and how much money we earned minute by minute. When a new feature or bug fix was released into production environment, developers could immediately see how that influenced earnings in the money chart in the Business dashboard. Conversely, business people could understand the technical side of the project and visualize problems developers were facing as well as our load limitations. In reality, we went a step further and even embedded MDD into Scrum by setting goals based on metrics. Ultimately, great results are achieved when all the people involved in the project have an understanding of all its parts.
Use Metrics When Taking Decisions
The most important thing in every activity is knowledge and understanding of the goal or the reason behind it. You can create different dashboards and collect loads of metrics but they will become useless unless used as input for decision making. I've seen several examples when teams create multiple metrics without any understanding of what they mean or why they are necessary. Thus they weren't fulfilling the principle of using metrics to guide decision making. Another bad example is when a team makes decisions (I'd rather say guess) without knowing how their application really behaves. They have some metrics, but not enough to really get value from them.
The beauty of MDD is that it also keeps misunderstandings to the minimum. When a decision is taken based on metrics, there’s hardly any room left for interpretation. Decisions become obvious, logical and simple to explain and thus hard to refute. Decisions are made more quickly and accurately and even the atmosphere in the team improves considerably. Moreover, this has a cascading effect that crosses team borders. Communication between them becomes less emotional and more data-driven. In other words, the blame game that sometimes arises between DEV and OPS or between multiple DEV teams is brought down to a minimum or even completely disappears.
Note that in some cases we might only suspect what is the real reason of a problem but we can't prove it based on existing metrics. The solution is to create additional metrics which will support or deny the initial assumption.
Using metrics in decision making is a win-win situation for everyone. Besides its primary objective to provide information about various aspects of infrastructure and applications, MDD helps to improve relations within and between teams.
What We've Learned
It takes a long way to get MDD practices applied across the whole organization. This is not only a technical change but more importantly a cultural change. Everyone needs to switch their perceptions, attitudes and understanding about the development process. While pursuing our vision we’ve come across heterogeneous results. Therefore, I'd like to share with you a couple of lessons learned that could be beneficial during your journey as well.
The first important lesson we've learned is that you should put all your efforts to create an experience for DEV teams as smooth as possible and avoid playing the middle man. We initially tried a solution where OPS and DEV were sharing one metrics server. It didn't work well for us because of a couple of major reasons. Firstly, DEV was still blocked by OPS, they needed OPS permission for every single change. Secondly, the OPS team didn't appreciate DEV teams making frequent changes. So we decided to dedicate one server to DEV from which they could reach all metrics on all servers. No release procedure and no special permissions were needed for DEV to make changes on that server. Actually, they could do almost everything on the server – even remove other servers from the monitoring. Although it might seem scary giving developers the freedom to decide, implement and take responsibility for their changes, it was one of the best decisions we ever made.
The second very practical lesson we learned was that time spent on documenting 'monitoring vision' or 'requirements for monitoring tool' was time wasted as our vision and implementation have changed several times during the process. And they will keep changing as we move on. Spending too much time choosing a metrics tool is also a waste. No tool will cover all your needs, so pick one you feel comfortable with and that fits your vision and your company. Our particular choice was Zabbix - an open source tool created in Latvia. Despite its limitations and complex navigation (we even called it "click to death tool") it enabled us to start swiftly. Finally, don't forget to prepare examples on how to collect and graph data for the most common use cases.
Make it fun and visible for everyone. Place TV screens in lobbies and work rooms showing essential metrics. Graph earnings from different projects (if possible). Give prizes for achievements such as reaching a certain number of consecutive successful releases (see picture 4). Measure and celebrate together top achievements such as highest number of visitors or transactions per day, week, month. This approach gets people to wonder, ask questions and get involved into the world of metrics. Typically you start by adding a few simple graphs. Then colleagues see them in the lobby and immediately suggest great improvement ideas.
(Click on the image to enlarge it)
Picture 4. Reward for succesfull releases
The cultural change required by MDD is scary for all teams involved and particularly for developers. They start observing problems with their application which no one was aware of before. At Adform the mindset changed completely. Earlier the attitude was to assume everything was working fine if no one was complaining. Now everyone understands this isn't a good approach to measure application's performance. Handling metrics is now part of developers’ comfort zone. Ironically, while earlier DEV teams had no visibility on metrics and felt good about their applications, now they feel uncomfortable whenever metrics are unavailable.
Current Focus and Future Plans
As we expand MDD across the whole organization, new challenges keep coming up. When metrics are created by multiple teams with different needs, visions and perception of what metrics are, things can get messy. We need to make sure that unused metrics are regularly deleted and that new projects are covered by metrics in similar fashion.
Currently we try to make sure all developers are on the same page by creating guidelines for them. Each DEV team has to be able to answer these three questions:
- How do you know that your application is working correctly? (lack of open bugs is not enough, further evidence (e. g. simulations) must be provided)
- How does your application performance evolve over time? (E.g. Does it get faster or slower compared to last release? Does it keep performing well under high load?)
- How often is your application used? (E.g. How many users are generating reports at the same time? How many banners are published into the system during day? How many transactions do we receive?)
These questions help ensure that all applications are measured and reach the same metrics coverage level before entering production.
Ideas for future improvements we are currently playing with include a traffic light monitoring layer (for quicker feedback and eventually outsourcing first response level - see picture 5) and metrics-driven hardware capacity estimation.
(Click on the image to enlarge it)
Picture 5. Application Semaphore
In conclusion, incorporating metrics in the development process has a lot of potential. Rolling out this practice across the organization is hard. However, the benefits are potentially huge: DEV and Business people can visualize application and business performance, decisions can be made faster and more accurately based on actual data and even communication inside and across teams is improved.
About the Author
Mantas Klasavičius is an infrastructure architect at company Adform. He has over 9 years of experience with background in Windows, Linux and networks administration. Currently he is specializing in high-performance computing, continuous deployment and Cloud services. His new passion is helping developers to reveal the beauty of monitoring.