In our previous article we covered how to diagnose common database performance hotspots in your Java Code. In the current article we focus on patterns that cause performance and scalability issues in distributed “Micro” Service Oriented Architecture (SOA), such as transporting excessive quantities of data over a low latency connection, or making too many service calls due to bad service interface design, or thread/connection pool exhaustion.
Here is an example of an application I recently helped analyze. This company had to move their old monolithic application to a service-oriented architecture in order to cope with rising demand for their popular web site. The hallmark feature of that app was their Search function. The search logic – previously implemented in the front end web rendering code - was moved to a new back end Search REST API. When we did our architectural review we executed several different search queries and the result was eye opening for many. Each search key resulted in a different number of calls to the new backend Search REST API. It turned out that the internal Search REST API was called for every item in the search result – which is a classical N+1 Query Problem Pattern. The following is a screenshot of the Dynatrace transaction flow for a single search that delivers 33 search results. It’s easy to spot several bad service access patterns that I will explain in more details later in this article. I always look for key architectural metrics, which in this case tell me that this app is most likely never going to scale and perform as they had envisioned:
Looking at key incoming and outgoing calls to services makes it rather easy to spot my service call problem patterns. Looking at this data will make architectural and code reviews also much easier!
I know: this doesn’t happen to you! ☺
I’ve presented this particular use case of the search feature numerous times at user groups and meetups around the world. I always get the “this doesn’t apply to us because we know how to do services right” look. Well keep reading and you will discover that there are countless examples of similar infractions in live production code! Nobody does this intentionally; it simply happens, and here are three reasons I have observed:
#1: Not seeing or understanding the Bigger Picture!
Service teams are so focused on their own services, and invest so much energy in making it scale and perform, that they often forget the bigger picture; how is my service going to be used? Do we provide the right interface methods for what is needed by our service consumers?
In the example above the Search REST API team provided a service GetProductDetails(int productId). What they should have provided is a GetAllProductDetails(string searchQuery) or GetAllProductDetails(int[] productIds). I would advise every service team to constantly talk to your customers/consumers, and don’t just listen to what they think they need, but make sure you also put service usage monitoring in your implementation and learn from your production environment how often each service is used and by whom!
#2: Not understanding the internals of your frameworks
Most teams don’t implement their own frameworks, but rely on existing popular frameworks, such as MVC and REST. This is a good thing as we should not reinvent the wheel every time we build a new app. But a common syndrome occurs when a new project starts with a small prototype based on a sample downloaded from public code repositories like GitHub; the prototype evolves and morphs into a full application, and the team neglects to invest the requisite retrospection to evaluate whether the chosen frameworks were the best for the job and were properly tuned. My tip is spend time to understand the internals of your selected frameworks and how to best configure and optimize them for throughput and performance. If you don’t then be prepared to end up with the situation explained above; I see it every day!
#3: Migrate vs Re-Architect
When you have to break up your monolithic app don’t simply assume you can “carve-out” classes that provide a certain functionality and put it into a service. Doing so could cause local in-thread and in-process calls to become cross service/server/network/cloud calls, which could go unnoticed, because the invocation of these services is as easy as calling a local method.
When you migrate from monolith to microservices make sure to first understand what APIs your service really needs to provide. In most cases this means a re-architecture and a re-definition of interfaces instead of copy/paste code from one monolith project into several service projects!
Available diagnostics tools & options
As shown in the above example I always look for key architectural metrics such as call-frequency, call-count between servers, transferred data, and simply understanding which services talk to each other. You can usually gather these metrics through the service frameworks you used to build your app, e.g: Spring, Netflix, etc.; most frameworks provide diagnostics and monitoring features. You can also rely on your own code profilers or pick any of the available application performance monitoring (APM) tools that are available in either a completely free or at least freemium/free trial edition. My tool of choice is Dynatrace Application Monitoring & UEM which is available for developers, architects and testers as a full free version through the Dynatrace Personal License. A key criteria for such a tool is its ability to show data across your service infrastructure and how all services interact. A profiler will only evaluate a single JVM and is therefore too limited for this exercise!
Diagnose Bad Service Access Patterns
Now let’s get to the master list of service access patterns I like to watch out for. Make sure you check for these patterns in your own applications if you want to build a highly scalable and performant application using a service oriented architecture. We will first state the list and then explore some examples that leverage these patterns to locate and remediate specific performance issues:
- Excessive Service Calls: Too many service calls per single end-to-end use case. How many are too many? Of course this depends on your particular app and requirements, and how you separated your services. But as a rule of thumb five service calls should signal you to start investigating.
- N+1 Service Call Pattern: Execute the same service more than once within an end-to-end use case. This is a good indicator that you may have to redefine your service endpoints and provide a more specific version of that service. In short, use “Give me product details for a search query” instead of “Give me the product details for product X and then product Y”.
- High Service Network Payload: Bandwidth is cheap – but only if the endpoints are in close proximity; once you move services to the cloud you will need to factor in higher latency, new bandwidth constraints and additional costs from your cloud provider for increased network traffic to and from the cloud instances. When I see more data transferred between internal service calls than what is getting sent back to the end user I take a closer look at how I might optimize the transported data.
- Connection & Thread Pool Sizing: services communicate via a connection and that’s why we have to properly size and watch both outgoing and incoming connections and thread pools. Once you understand which services communicate through which channels you can do proper sizing based on load-predictions.
- Excessive use of Asynchronous Threads: It is not easy to implement event-driven service-call patterns that make async calls and receive a notification when the work is complete. Watch out for frameworks that “fake” asynchronous behavior by spawning and blocking multiple background threads until all results of all service calls arrive.
- Architectural Violations: Do your services interact with other services as intended or have access antipatterns unexpectedly crept into your architecture, (for example, accessing a backend data store directly instead of going through a data access service API).
- Lack of Caching: Moving work to services is great, but if this work is now performed less efficiently you may run into resource issues. A good example would be making redundant database accesses instead of caching data across multiple service calls to avoid extra round trips to the database.
As promised, let’s see these techniques in action:
Example #1: Excessive Service Calls and N+1 Query Pattern
This example is from a well-known job search site. For every search request executed by an end user, the front-end service queries a list of potential job titles that match the provided search key. For each individual job title returned, it then makes a call to the external “search” REST service. This process is readily optimized by providing a coarse-grained search REST call that accepts a list of job titles, significantly reducing the number of REST roundtrips:
A job search request to the front end service causes 38 REST Calls to an external service to retrieve details of individual job title results. This can be optimized by providing a better REST interface that delivers results for a list of job titles!
Just looking at the number of calls doesn’t immediately suggest that this is actually a bad designed, or is an inefficient use of the REST interface between the front-end and back-end. To get the full picture you want to look at the actual REST queries executed – organizing them by endpoint URL and query string. This strategy now exposes the actual problematic N+1 Query Problem and every duplicated REST call reusing the exact same query string:
Spot inefficient calls to a REST Endpoint by looking at the number of invocations by Endpoint + Query String. If you have these patterns think about providing better interfaces to handle these queries with a single service call.
If you see your services being used as in the above example, where for every job search the same service is executed several times for different job titles, it makes sense to consider providing a REST interface that better supports the end-to-end use case. It might also be that your service already provides this interface, but the front-end developers (or the consumers of your service) were not aware of it. So by performing this type of usage analysis you are also educating your consumers to better leverage your services!
Tip: In Dynatrace you can use the Web Requests Dashlet to show you all calls done by a single end-to-end transaction. Make sure you put the dashlet into “Show -> All” and “Group By -> URL + Query” mode via the context menu.
Example #2: Excessive use of Asynchronous Threads
In the same job search example all calls to the /getResult URL were executed by spawning a new background thread for each service call. A total of 35 threads were spawned by the main HTTP thread to execute these REST calls in parallel. At the end the HTTP Worker thread blocks until all of those threads have completed execution:
Analyze how many threads are involved when executing your REST calls. If you have an N+1 Service Call pattern you also consume N additional threads for every incoming request on your frontend!
The problem of the 35 threads is obviously impacted by the N+1 Service Access pattern. If that pattern is solved it would also solve the problem of hogging a new thread for every service call.
Example #3: Thread Ratio and Thread Pools
Example #2 can also be analyzed by looking at the ratio between incoming requests/transactions and the total amount of active threads involved in execution. You can easily access these two metrics by tapping into the JMX Metrics your JVM exposes. You can even extend this by breaking down the number of threads by thread group – this works well if your app gives your threads names just as in the example above, which is a good development pattern. Also observe your CPU; if you experience slowness but don’t observe high CPU usage, that indicates that your threads are either waiting for I/O or waiting for each other:
A good practice is to correlate the number of incoming requests with the total amount of active threads and CPU utilization. If I see a ratio of 26 to 1 I know that our application is binding a lot of background threads per request. Watching the number of threads over a longer period of time also allows you to see whether it hits “a ceiling”. Like in the case above the number of max worker threads was 1300. If more requests come in they simply can’t be serviced due to thread starvation!
Monitor Service Metrics throughout your Pipeline
In our previous article we talked about database metrics and how to integrate them into your Continuous Integration build. We can do the same for service metrics as well. If you have automated tests for your services, to test the search, or perhaps some news alert feature, you should automatically monitor those metrics for every single test execution. But you should not stop monitoring the software after your test executed in CI. Why? Because the software you just tested will be deployed into staging and hopefully production where it is equally important to keep monitoring these same metrics. The “holy grail” for me is if you observe similar metrics once the service is deployed in production. Additionally you want to monitor feature usage of your service, to help you make better decisions about which new services/features are actually used by your consumers. Having that data allows you to make decisions on which features to improve in order to increase usage adoption, and which features to remove in case the feature is just not as desired as you thought. This helps reduce the code base, code complexity and ultimately what the industry calls “technical debt”
Here is a quick walkthrough on how this could have looked in my introductory example, where the product search feature resulted in 33 service calls. I start with my story back at the time where the application was still the “monolithic” version of the software. We can capture metrics from our tests in CI for the news alert and the search feature providing us with metrics about how the code is interacting with the database, how many service calls are made, how much data is transferred, and what features are used in production:
Build 17 shows that news alert and search run slowly in production. News alert has very low usage, which could be caused by a very slow response time discouraging users from using that feature. Search adoption is good but could be better.
By evaluating these metrics we can now understand how the code is currently running, and we can make a decision to optimize performance by breaking up the monolithic search feature into a service-oriented approach. Several builds down the line we have our new service-oriented implementation completed. Running the same tests however produce some unexpected metrics that force us to delay our production release; looking at the chart below we see that the changes we made to search clearly violated a number of our architectural rules (numbers are all taken from my initial example in the opening paragraph):
Build 25 is obviously a very bad build, as the migration to a micro-service approach shows very poor performance metrics from our service call pattern. Let’s NOT deploy that but fix it!
The N+1 query problem pattern also resulted in many additional SQL queries and a lot more data transferred over the network. The fix for this implementation was a cleaner and more efficient design and implementation of the new backend search service interface. Instead of calling the backend service for each individual search result, a new “bulk” service was introduced that returned all details for the complete search. With that fix in place we finally have a build that we can deploy. All numbers look good on the continuous integration server. Looking at the usage numbers in production after deployment shows us that Search indeed now runs on multiple service containers, provides better performance, and shows better usage rates. The news alert feature however only shows a marginal increase; perhaps an indicator to remove this feature in an upcoming build, as it clearly provides little value:
Build 26 is a technical solid build and shows improvements in usage for search. Usage for news alert is however not really improving – allowing us to decide to remove it in Build 35!
In our database article we showed a visualization option for architectural metrics in your test automation / continuous integration done by Dynatrace. Compare that to the following dashboard showing the key architectural metrics for our search service in production. The different shades of yellow, orange and green tell us for instance how many search requests execute between one and five internal REST calls (yellow) or five or more (bright yellow). The same is done for the number of SQL calls executed (shades of orange) and the payload sent per search (green). This way of visualization helps to identify a change in usage for a service (Total sum on Y-Axis) over time but also a change in behavior (if the portion of a certain color code gets bigger or smaller.)
Service Monitoring in Production: Understand Usage and internal behavior after code deployments
Get a handle on your services before it’s too late
If you do not yet have architectural reviews based on metrics I suggest you start doing it. Look at the patterns I described and let us know if there are other patterns we should watch out for. But don’t just do it manually during your coding phase. Automate monitoring of these metrics from Dev(to)Ops and make better decisions on what to deploy and also which services to promote or eliminate.
About the Author
Andreas Grabner (@grabnerandi) is a performance engineer who has been working in this field for the past fifteen years. Andreas helps organizations identify the real problems in their applications and then uses this knowledge to teach others how to avoid the problems by sharing engineering best practices.