Martin Thompson, co-founder of LMAX, keynoted on performance at QCon São Paulo 2016. Initially entitled “Top Performance Myths and Folklore”, Thompson renamed the presentation “Top 10 Performance Mistakes” because “we all make mistakes and it’s very easy to do that.”
This is a digest of the top 10 performance related mistakes he has seen in production, including advice on avoiding them.
10. Not Upgrading. Many complain about their systems that are not fast enough, looking for help in writing better algorithms and data structures, but actually “what they need to do is upgrade,” said Thompson. Upgrading the operating system, the JVM, the CLR, etc. The most common excuse for not upgrading is “There might be bugs in the new version.”
To avoid that one can do regular continuous integration and testing, which should be a fundamental part of the development process. Thompson exemplified that by mentioning real-life systems where developers tested against a new version of the database, and, after it passed all tests, they released it into production.
9. Duplicated Work. For this Thompson tells the story of a system that was very slow in serving web pages, and the developers thought it was an issue with the database, and they were trying to tune it up. But when he run a profiler on the system, he discovered a loop where the ORM was called 7,000 times, being the actual culprit for the slow page load. When that loop was fixed the system became properly responsive. The lesson is “Measure the system. If the system is a black box you can’t tell where the time is spent.”
8. Data Dependent Loads. Thompson presented the results of a benchmark measuring the time needed to perform one operation when attempting to sum up all longs in a 1GB array located in memory (RAM). The time depends on how the memory is accessed, and is presented in the following table:
The results of this benchmark show that not all memory operations are equal, and one needs to be careful how it deals with them. Thompson considers that it is important to know some basics regarding the performance of various data structures, noting that Java’s HashMap is over ten times slower than .NET Dictionary for structures larger than 2GB. He added that there are cases when .NET is much slower than Java.
7. Too Much Allocation. While allocation is almost free in many cases, claiming back that memory is not free because the garbage collector needs significant time when working on large sets of data. When lots of data is allocated, the cache is filled up and older data is discarded, making operations on that data to take 90ns/op instead of 7ns/op, which is more than an order of magnitude slower.
6. Going Parallel. While going parallel is very attractive for certain algorithms, there are some limitations and overhead associated with it. Thompson cited the paper Scalability! But at what COST? in which the authors compare parallel systems with single threaded ones by introducing COST (Configuration that Outperforms a Single Thread), defined as
The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation. COST weighs a system’s scalability against the overheads introduced by the system, and indicates the actual performance gains of the system, without rewarding systems that bring substantial but parallelizable overheads.
The authors have analyzed the measurements of various data-parallel systems and concluded that “many systems have either a surprisingly large COST, often hundreds of cores, or simply underperform one thread for all of their reported configurations.”
In this context Thompson remarked that there is a certain communication and synchronization overhead associated with parallel tasks, and some of the activity being intrinsically serial and not parallelizable. According to Amdahl Law, if 5% of a system’s activity needs to be serial, then the system’s speed improvement will be maximum 20 times no matter how many processors are used.
Thompson also mentioned the Universal Scalability Law introduced by Neil J. Gunther in 1993 (PDF) which shows that there are even more limitations in parallel shared-nothing systems, and the speed decreases after a certain number of processors used, depending on the level of concurrency, contention and coherency. (More explained on the page How to Quantify Scalability.) Speed vs. #processors according to the two laws mentioned is expressed in the following chart:
Thompson explained that the USL sees a decrease in performance due to the cost associated with communication between the components of a parallel system: “The more you throw at a system, the more communication pathways are, and that slows down the algorithm.”
The main advice when building parallel systems is to avoid shared mutable state, because “it is so hard to reason about it … and you end up with so many bugs,” added Thompson. The recommended approach is either use a shared-nothing architecture or use single writers that can access a certain piece of data.
His final tip on this performance issue was: If you want to speed up an algorithm find out ways to speed up the single threaded version of it before attempting a parallel solution, because the later is so much harder.
5. Not Understanding TCP. On this one Thompson remarked that many are considering a microservices architecture without a solid understanding of TCP. In certain cases it is possible to experience delayed ACK limiting the number of packets sent over the wire to 2-5 per second. This is due to a deadlock created by two algorithms introduced in TCP: Nagle and TCP Delayed Acknowledgement. There is a 200-500 ms timeout that interrupts the deadlock, but the communication between microservices is severally affected by it. The recommended solution is to use TCP_NODELAY which disables Nagle’s algorithm and multiple smaller packets can be sent one after the other. The difference is between 5 and 500 req/sec, according to Thompson.
4. Synchronous Communications. Synchronous communication between a client and a server incurs a time penalty which becomes problematic in a system where machines need fast communication. The solution is not buying more expensive and faster hardware but using asynchronous communication, said Thompson. In this case, a client can send multiple requests to the server without having to wait on the response between them. This approach requires a change in how the client deals with responses, but it is worthwhile.
3. Text Encoding. Many times developers choose to send data over the wire using a text encoding format such as JSON, XML or Base64 because “it is human readable.” But Thompson noted that no human reads it when two systems talk to each other. It may be easier to debug with a simple text editor, but there is a high CPU penalty related to converting binary data to text and back. The solution is using better tools that understand binary, Thompson mentioning Wireshark.
2. API Design. Some of the mistakes with the most negative impact on performance are related to APIs, according to Thompson. He presented the following method signature as an example of bad code:
public void startElement( String uri, String localName, String qName, Attributes atts) throws SAXException
explaining:
If you work with XML you generally do nothing with those values [the three strings and the collection of attributes]. You allocate a bunch of stuff, you waste it all, and it all gets thrown away. That’s what’s eating our battery life, that’s what causing us to burn up trees. We need to make this much simpler.
He suggested instead the following signature for a similar method:
public void characters( char[] ch, int start, int length) throws SAXException
For those complaints that the later is harder to use than the former, Thompson suggested using composability, wrapping the one in the other to give people a choice. If performance is not an issue, then the first (with String) can be used, otherwise the second is much better.
Another example Thompson provided was string splitting:
public String[] split(String regex)
The performance problems related to this method signature are:
- the regular expression needs to be compiled every time the method is called
- a dynamic structure needs to be instantiated to store the initially unknown number of tokens in the string being parsed
- the return structure is a fixed size array forcing the tokens to be collected in a temporary structure then copied into the array
- if the caller wants to perform some operations on those tokens, such as sorting, it has to copy them into another structure
A better solution would be using an Iterable, which avoids intermediate token copies to be created in memory:
public Iterable split(String regex)
Another option is allowing the caller to provide the collection where the tokens are to be stored. Also, the caller should be given the option to pass a Set if it wants a distinct list of tokens, or a TreeMap if it wants an ordered list:
public void split( String regex, Collection<String> dst)
1. Logging. For the #1 performance culprit Thompson listed the time spent for logging. He showed a graph depicting the average time spent for a logging operation when the number of threads increases:
The graph indicates a 100% sequential operation taking place. No matter how many threads are used for logging, the time needed increases linearly. Thompson said that this kind of graph can be drawn for most logging systems existing out there, and “Loggers are one of the biggest bottlenecks that can be found in a system.” The solution would be to use an asynchronous logger.
Also, data recorded by the logger should be structured data that is later read and processed by a tool, not just a bunch of strings. For recording repeated errors, he suggested recording the error the first time it takes place, then just increment a counter telling how many times the respective error took place. Saving the same error over and over again is of no use. For debugging live systems, Thompson suggested using a code weaver, such as Byte Buddy, because it avoids writing and running unnecessary logging code.
The slides of the session can be accessed online (PDF).
About the Author
Abel Avram has been involved in many InfoQ editorial activities since 2008, enjoying writing news reports on Mobile, HTML, .NET, Cloud Computing, EA and other topics. He is co-author of Domain-Driven Design Quickly.
In the past he worked for many years as software engineer and project/team leader on legacy systems, Java and .NET. He started his career as an assistant professor at the Computer and Automatics Faculty of the Technical University of Timisoara, Romania.
If you are interested in submitting a news story or an educational article please contact him at abel [at] infoq.com.