Key takeaways
|
“This web page is slow” is a common and regular complaint about web sites, especially since web applications started replacing desktop applications. While the web brings some desirable characteristics such as global delivery, it also brings its share of challenges at the performance level.
Rationale behind collecting and using data
Your user gave you the url of a “slow” web page. Great, now what? Where do the slowness comes from? Is it actually slow to start with? Is it slow for all users? There’s a lot of questions to answer here to fix the issue and make sure it doesn’t become slow again a week later.
You can find performance optimization material available online, often about specific topics such as Jit, garbage collection, SQL queries optimization, ORM pitfalls and so forth. Although it is tempting to implement an optimization because it looks promising, one question pops up: how do you know this optimization will actually yields good results in your context?
Clearly a piece of the puzzle is missing. We need a way to find performance issues, and on a continuous basis. That way, we know what is slow and have concrete measures to back it up. With this knowledge in hand, it is then possible to determine precisely if performance improvements are needed and explain them to stakeholders.
Identifying performance precisely is a much more efficient way to respond to perceived slowness issues. The first problem is that it might not be a slowness issue to start with. In the case of timeouts (where the load balancer may sever the connection after X seconds, for example), it is plain impossible to distinguish between a deadlock or a slow response time as the result will be the same, a timeout. Data is needed to find the real issue.
To Illustrate the importance of identifying precisely performance issues, here are some possible points of slowness in a web application:
- Slow JavaScript
- Rendering blocking on asset loading
- Proxy getting in the way on the user side
- DNS issues
- ISP issues/ Network
- Switches and routers
- Load balancer
- Application code (including third party libraries)
- HTTP Server (Something from ASP.NET or IIS, for example)
- Third party services such as payment processors, maps provider, etc.
- Subsystems such as SQL Server, Redis, Elasticsearch, Rabbit MQ, etc.
And the list goes on, depending on the complexity and scale you deal with. How to diagnose a performance issue when so many components are in play? One word: data. You want relevant, meaningful data on everything. Data that can prove the guiltiness or innocence of a system involved in a slow request.
With data in hands, you can start by the top and scratching components off the list as you go, similar to performing a search in a sorted tree. Each step down the tree makes you closer to the details and actual issue:
- Client side, server side or somewhere in between?
- Slow Javascript, rendering, blocking assets?
- Load balancer, web server, any subsystem or third party?
As we go down the tree, the problem becomes more and more precise. As such, the data needed to find the issue must match that level of precision. At this point, tools such as performance profilers or SQL query execution plans become necessary.
For an efficient use of your time, it is worth citing Amdahl’s law:
Regardless the magnitude of an improvement, the theoretical speedup of a task is always limited by the part of the task that cannot benefit from the improvement.
For example, suppose we have a web request taking 100ms of server processing and 5 seconds for an SQL query. Even if you were to get the server processing down to 1 ms, the overall improvement is marginal in terms of response time, going from 5.1 to 5 seconds. The 5 seconds of SQL processing are where the potential gains are the highest.
Infrastructure issues
A top-bottom approach, i.e. identifying an issue more and more precisely, works well in the context of an issue localized to a single page. How about issues that spawns multiple pages? What if, for example, various pages experience intermittent slow response time due to a subsystem not keeping up or an antique network switch which each reboot may be its last?
This is where a monitoring approach focused on the application shows its limitations. At this level, other metrics are needed to assess the healthiness of every component in the system, both at software and hardware level.
At the hardware level, the first machines that comes to mind are the web and database servers. However, these are only the tip of the iceberg. All hardware components must be identified and monitored: server, network switch, router, load balancer, firewall, SAN, etc.
All this may be seem obvious to a system administrator, as hardware monitoring is a common practice. There is, however, a major caveat: all those hardware metrics are mostly useless in a performance point of view if they are separated from the application metrics. In other words, metrics are most useful when put into context.
For example, an average of 50% CPU usage on a database server may be completely normal in some cases and a ticking time bomb in others. In peak times, 50% CPU usage indicates there is still some room to accommodate even heavier traffic. If the same 50% happens frequently during idle periods, it suggests the application is unlikely to survive a sudden peak of incoming requests.
Bottom line, system wide metrics such as CPU, memory and disk must be correlated to application metrics for asserting system healthiness. Being able to visualize application metrics such as request throughput and system metrics such as CPU usage, gives a much more complete picture of a system’s healthiness.
Application Performance Management (APM) tools
APM tools provide these basics operations: data collection, data storage and data visualization. An agent is usually in charge of collecting the data and send it to a data store. Using the web interface, the data can be visualized through dashboards centered on web requests.
APM tools are useful to:
- Visualize the performance of a web application as a whole
- Visualize the performance of a specific web requests
- Automatically send alerts when the web application performs poorly or have too many error
- Verify how the application responds in periods of high traffic
Example here.
Here is a non exhaustive list of APM tools with out-of-the-box support for ASP.NET and IIS:
Infrastructure monitoring tools
To provide a complete picture, infrastructure monitoring tools collects metrics at the host level. Metrics are collected both at the hardware and software level.
Lightweight profilers
Lightweight profilers gives high level metrics on particular web request. They provide immediate feedback to the developer as he browse web pages. They can be used in all types of environments (development, QA, staging, production, etc.) making them well suited to quickly assess the performance of a particular page.
The fundamental difference of lightweight profilers over their full-blown counterpart is that they’re not attached to the process. This means you can use them without worrying much about the overhead they generate.
In a development context, lightweight profilers provide immediate feedback on the code you are currently writing. This is particularly useful to find issues like N + 1 or slow response time, as you always have the response time displayed in a corner of the page.
Performance counters to fill the holes
Performance counters in Windows provide metrics on different aspect at the hardware and software level. Monitoring tools usually reports some performance counters such as CPU and memory usage. However, some useful counters such as GC time are often missing. The most practical way to get started is to use a basic list and iterate, adding relevant counters as needed.
It is possible to collect and visualize performance counters in real-time using perfmon. Integration with APMs is also possible in most cases, using custom metrics or plugins.
SQL tools
The persistence layer, namely SQL databases, is a frequent bottleneck due to its omnipresence in most applications. Specialized tools for SQL monitoring provide metrics on resources utilization as well as specific metrics such as wait time or compilations/sec, to name a few.
You can find several type of issues as well as possible performance improvements with the data provided:
- Excessive throughput on one or several queries
- Excessive CPU usage hinting to query issues or missing indexes
- High throughput queries which could be cached
SQL monitoring tools:
Other persistence systems
All sub-systems need to monitored to some degree. Simple data collection and visualization may suffice for low throughput or non-critical systems. In other cases, they need more advanced, specialized monitoring.
Code Profilers
When a particular page or piece of code have been identified as slow, a code profiler provides the most detailed view to identify a performance issue. They also provide a precise view of external calls such as database queries and web requests.
Profilers:
Memory Profiler
Monitoring memory and garbage collections metrics is useful to detect potential issues. While they show the presence of an issue, they usually don’t say where it is. A memory profiler comes in handy when there is a need to dig into memory and garbage collection issues.
Profilers:
Client side profiler
Performance issues may also comes from the front-end side. This is even more true with the emergence of single page applications, where Javascript is king. All major browser embed tooling such as code profilers and memory profilers.
Tools showing the sequence of events and requests are handy to determine at a glance if an issue comes from the front end or back end.
Tools:
Page analyzers
Higher level client side tools are a handy starting point for performance troubleshooting. These tools can provide a high level view of where response time issues come from, along with some recommendations. Google’s PageSpeed Insights is one free example of such tools.
The sheer number of factors and tools involved in the performance of a system may seem overwhelming. However it can all be summarized in one word: data. Having a clear and precise view of a system at a given time makes it possible to reason about its performance. It also enables just in time learning, where performance metrics and charts can guide you towards what’s impacting your system.
About the Author
Pierre-Luc Maheu is a software developer who worked in VoIP, cloud hosting and e-commerce for the past five years. He currently works at Amilia, an SaaS platform for managing online registrations. His current interests are monitoring, performance/scaling and F#. In his free time, he likes to clear his mind doing indoor climbing, Animal Flow and Kendo.