Thomas Dullien, distinguished software engineer at Elastic, shared at QCon London some lessons learned from analyzing the performance of large-scale compute systems.
The co-founder of Optimyze started the presentation by arguing that with the death of Moore's law, the shift from on-premise software to SaaS, and the widespread adoption of metered cloud computing, efficiency is now critical for businesses and directly impacting margins.
Discussing how much security and performance engineering have in common, Dullien shared a few performance challenges he experienced over the years in different roles and projects. First of all, developers should not forget that the software they use has usually not been designed for the hardware where it runs today:
Your language is designed for computers that are extinct!
Using Java as an example, Dullien pointed to some of the challenges:
For example, traversing large linked graph structures on the heap (garbage collection) or assuming that dereferencing a pointer does not come with a significant performance hit were entirely correct assumptions in 1991 but entirely wrong today, and you end up pay in many surprising ways.
According to Dullien, it is common to see 10-20% of all CPU cycles spent in garbage collection, with many Java developers becoming experts at tuning GCs and high-performance Java developers avoiding allocations altogether.
Comparing spinning disks and NVMe SSDs, Dullien highlighted how early design choices in applications and databases impact performances today:
A surprising number of storage systems have fixed-size thread pools, mmap-backed storage and rely on large read-heads, choices that make sense if you are on a spinning disk (...) Modern SSDs are performance beasts, and you need to think carefully about the best way to feed them.
For example, as a single thread can only originate 3000 IOPS, to saturate a 170k IOPS drive, applications need 56 threads constantly hitting page faults. Therefore, for blocking I/O, thread pools are often too small. Cloud services provide a different challenge:
Cloud-attached storage is an entirely different beast, with very few DBMS optimized to operate in the "high-latency, near limitless concurrency" paradigm.
Dullien warned about the impact of a finite number of common libraries (allocators, garbage collectors, compression, FFMpeg, ...) included in many applications that globally eat the most CPU: in almost every large-sized org, the CPU cost of a common library will eclipse the cost of the most heavyweight app. The organization chart matters too, with vertical organizations better identifying and fixing libraries, reaping benefits in cascade everywhere.
Dullien moved than benchmarking and warned of statistical nightmares:
High variance in measurements means it is harder to tell if your change improves things, but people do not fear variance enough.
Noisy neighbors on cloud instances, unreliable multiple runs, and benchmarks that do not match production deployments are other common issues affecting benchmarking.
Dullien provided different advice on the development, business, mathematical, and hardware areas, with some points to the practitioner working on performance:
- Know your napkin math
- Accept that tooling is nascent and disjoint
- Always measure; the perpetrator is often not the usual suspect
- There are many low-hanging fruits; do not leave easy wins on the table
Dullien's talk concluded with some thoughts on the inadequacy of existing tooling and where things could and should improve, focusing on CO2 reduction, cost accounting, latency analysis, and cluster-wise "truly causal" profiling.