Key Takeaways
- Benchmarks help communities to codify their understanding of user behaviour
- All benchmarks are vulnerable to gaming and cheating
- Tracing and profiling can replace stale user behaviour models from benchmarks
- Common tools do not exist to share trace data across projects
- Benchmarks can live forever as part of performance regression testing
The Chromium project recently announced that they are dropping Octane, a traditional JavaScript benchmark, in favour of real-world performance measurements collected through tracing and profiling to drive performance improvements.
The reason they give is that JavaScript performance has reached a plateau using traditional benchmarks, and that in the end, developers will always find a way to game them.
But are tracing and profiling the future of performance engineering outside of the fast-moving JavaScript community? And do all benchmarks have a shelf-life?
Benchmarks
All good benchmarks simulate real-world workloads. Their built-in ability to measure metrics such as duration of execution, latency and throughput, and operations per second give developers insights into how their software performs.
Fundamentally, the purpose of benchmarks is to allow comparisons of different software versions and configurations. Applying an identical workload rules out all other factors; only the code differences are compared.
Having a canned workload is invaluable for writing and testing software optimisations because it gives developers a sense of how their users will experience those changes. Benchmarks are arbiters. They judge whether changes are good or bad for performance, and ultimately, good or bad for the user. A 15% improvement to a benchmark’s results might translate to a 25ms decrease in web page load time.
And this is how many performance improvements have been developed for popular projects. Take a benchmark and optimise the traversed code paths until you see a noticeable speed up. Some projects even write their own benchmarks if none are readily available.
Benchmarks that test a specific component in an artificial or synthetic way are known as micro-benchmarks. Micro-benchmarks are particularly valuable for understanding how software will scale in the future, or what the absolute maximum performance of an individual component is, even if it’s not possible to fully load that component today.
They are useful for guiding optimisation if it’s too cumbersome to use a full benchmark. For example, if you need to improve the performance of a cache layer that has no public API and needs to be accessed indirectly. Another example could be when a developer wants to reproduce a hard to trigger performance issue.
Micro-benchmarks get a bad reputation for being difficult to write correctly, but there are plenty of examples of them being used successfully to achieve performance gains.
Benchmarks are not just useful for improvements. They can be used as the basis for regression testing, ensuring that performance stays consistent as code is changed. Given that performance isn’t a binary state like broken/working, it’s not always obvious when a regression has occurred. Methodically tracking performance regressions is extremely important for mature projects.
Perhaps most important of all, publishing a well-crafted benchmark can codify an entire community’s understanding of interesting workloads and user behaviour. Since the best optimisations figure out the common case and tune for that, benchmarks can guide all developers (especially new ones) towards improving the code that matters most.
However, as the Google Chromium team pointed out, there are several drawbacks to benchmarks.
If a benchmark no longer represents a workload that is relevant to your project, or worse, it never did, then you have to rewrite the code that was based on the understanding that the benchmark was representative of your users. The original code was likely a major waste of development time.
Sometimes your best bet may be to write a new benchmark altogether rather than updating an existing one.
But even if your benchmark is a true representation of current user behaviour, the configuration may be so complex that many people are using it incorrectly. The risk of this increases the more complex the benchmark becomes. Parameters can be copied and pasted with little to no thought about whether the configuration makes sense for the software being tested.
Not everyone has the best of intentions when running benchmarks. Some will intentionally try to exploit every loophole to get the winning results. Some benchmarks have resorted to specifying the permitted flags to prevent the compiler from optimising the code too heavily. Heavy optimisations can allow compilers to eliminate or simplify generated code and defeat the purpose of the benchmark.
When this is done purely to benefit benchmark scores and not users, this is known as gaming the benchmark, or optimising for the benchmark, and the optimisations are known as “benchmark specials”. The Chrome V8 JavaScript engine contains a SunSpider benchmark special,
“V8 uses a rather simple trick: Since every SunSpider test is run in a new <iframe>, which corresponds to a new native context in V8 speak, we just detect rapid <iframe> creation and disposal (all SunSpider tests take less than 50ms each), and in that case perform a garbage collection between the disposal and creation, to ensure that we never trigger a GC while actually running a test.” – Benedikt Meurer
Tracing and Profiling
Historically, tracing and profiling required separate tools, but many projects now include profilers to help developers understand runtime behaviour. Not only do these profilers provide lots of details, they’re often so lightweight that they’re enabled in production.
Benchmarks are based on workload scenarios that are frozen in time. Updating them usually requires releasing a new version, which can a real pain because invariably some people will still be running the old version and not the latest code. Rolling out updates is easy if you control the platform. But as decades of desktop software experience has taught us, getting end users to install patches can be a royal pain.
Gathering data from users with tracing and profiling can help here. It completely removes the need to model user behaviour because the data describes the behaviour. The data always provides an accurate picture of how users are using your product at the time of collection. When it becomes old and stale, you can simply collect new data.
With Continuous Integration/Deployment being ubiquitous, code is changing almost constantly. It’s a very real possibility that any analysis to model user workloads from last week is out of date today.
It’s not just that trace data is more current; it also gives developers a more holistic view because every detail of a transaction can be captured. This makes retrospective performance analysis possible and aids in understanding what events caused performance to nose dive. Tracing can a blessing for recording performance issues that occur infrequently.
Avoiding benchmarks essentially eliminates gaming; the only performance that matters is the real-world experience of users. It’s no longer possible for software teams to cheat because all optimisations that improve the user experience are fair game. You’re no longer playing for a benchmark result, you’re playing for user happiness.
Finally, tracing allows low-level details to be recorded in business-sensitive situations and optimisations designed without understanding – or caring – exactly how users are working.
But some of the things that make tracing desirable for developers have a negative impact on the larger community.
By bundling the trace data with the project that generated it – or worse, keeping it private inside of a company – developers cannot build on the work of others, and every project must understand performance from scratch. Using tracing, developers cannot improve their software without first capturing data, whereas a traditional benchmark, once designed and written, can be used indefinitely to tune and optimise any software project.
Furthermore, the origin of data is extremely important. Should all users be treated equally? Which workloads will still be important in six months? Writing benchmarks forces you to decide on these things up front.
Few software communities are ready to replace benchmarks with trace and profile data. Getting to that point would require years of work, and a fundamental shift in developer skills to capture, store and analyse such data. At the very least, tools for sharing data among separate projects would be needed.
Do all benchmarks have a shelf-life?
There are many benchmarks out there. Some have outlived their usefulness and have been retired, and occasionally new ones have replaced them. The Chromium team claim that the Octane benchmark has reached the point of diminishing returns, but are all benchmarks ultimately destined for obsolescence?
The Linux kernel community still runs operating system benchmarks that were developed in the 90’s and early 2000’s. Because the POSIX API hasn’t undergone much change, not only do old benchmarks run correctly, they still provide a realistic application workload. This is despite the fact that new kernel releases average over 12,000 patches from more than 1,700 developers.
The SPEC CPU2006 benchmarks have been available since 2006 and are still in use. But benchmarks cannot survive without updates – perhaps to combat gaming, fix bugs or add new features – and SPEC CPU2006 has had a couple since its first release. And that does require the support of the benchmark authors and community – they have to want the benchmark to survive, and be willing to maintain it.
So it seems that not all benchmarks outlive their usefulness. As long as they provide a representative workload they will continue to be used, if not for new optimisations, then at least for performance regression testing.
Conclusion
Benchmarks codify an entire community’s understanding of typical user behaviour. Their standalone nature encourages use by any project for regression testing and performance improvements without the cost of analysing user behaviour from scratch.
It’s inspiring that the Chromium team want to continue setting new records for JavaScript performance. But retiring the Octane benchmark without a replacement will make it more difficult for other projects to optimise their JavaScript engines. Using tracing and profiling instead of benchmarks raises issues of origination and absence of shared tools.
Outdated benchmarks are a problem, but the answer should be “better benchmarks”, not “no benchmarks”. We need more education and encouragement with writing them, and to correct the stigma that they’re always difficult to write, misleading, or just plain broken.
It’s too early to tell if this is the start of a trend – the JavaScript community is known for its fast pace of development. But we should at least keep discussing the merits and drawbacks of benchmarks, and appreciate the performance they’ve helped us achieve until now.
Benchmarks may not be dead yet, but they need all the support we can give them.
About the Author
Matt Fleming is a Senior Performance Engineer at SUSE and a freelance writer. Twitter: @fleming_matt