During the second day of QCon San Francisco 2023, Yao Yue, a platform engineer, distributed system aficionado, cache expert, and the founder of IOP Systems, presented on performance engineering. Her session is part of the "Platform Engineering Done Well" track.
In her session Yue discussed the evolving performance engineering in the modern era. For decades, hardware advancements have kept many performance engineers on the sidelines, but now, in a pivotal moment, their skills are more crucial than ever.
The hardware engineering domain deals with complexities, from power and thermal caps to many other challenges. As the software landscape becomes increasingly characterized by a diverse and interconnected ecosystem, performance engineering is emerging as the unsung hero of the tech world. With a deep and complex software stack, there's no magic button to solve performance issues, as performance is an intricate system property that necessitates a meticulous counting exercise.
Yue dived into the various factors in performance engineering, where security, availability, reliability, and cost are interconnected. Performance engineering seeks to optimize performance within certain bounds, recognizing that there is a limit to how fast a system can be if all the right measures are in place. The notion that high-performance services share common traits while underperforming services each have their unique issues is highlighted, encouraging engineers to benchmark their systems against exemplary performers.
Yue continued the session by outlining a blueprint for performance engineering at scale, emphasizing the importance of combining modeling and counting methodologies and translating them into data engineering problems. She discussed the crucial role of collecting data using tools like rezolus, a performance telemetry agent, and highlighted the significance of long-term metrics for signal aggregation.
Yue touched upon the practical use of data through tools like EasyPerf (built internally), enabling accessible garbage collection (GC) wins. Moreover, Yue emphasizes the importance of curating data for higher-quality traces and the role of trace aggregation pipelines in creating indices. Performance engineering is depicted as a holistic approach, where understanding the interactions between different services, metrics, and attributes is essential.
Various insights, tools, and models for analyzing system performance were introduced. These included Service Dependency Explorer (also built internally) and Latenseer, a causal model of end-to-end latency distribution. Also, Yue addresses the challenges and questions that performance engineering teams face, from aligning with business values to making strategic decisions.
Lastly, Yue explained how the creation of Twitter's Performance Engineering team unfolded in phases characterized by innovation and adaptability. There was no established blueprint for building a team like that, and the team approached performance engineering with a philosopher's mindset, seeking profound understanding rather than predefined solutions.
- The initial phase consisted of just four individuals who ventured into performance telemetry and took on many diverse tasks that extended beyond traditional performance engineering. Amidst this pioneering phase, the team began crafting their plans and vision for the future.
- As the team expanded to eight members in the second phase, they delved into critical aspects of performance engineering, including tracing and long-term metrics. Simultaneously, they remained vigilant, surveying emerging technologies and making strides with tools like the mentioned rezolus. This phase also marked the team's transition towards offering more comprehensive consultation services.
- In the third phase, with a growing team of ten or more individuals, primary datasets became usable, and the team found themselves in high demand for inbound consultation requests. They actively participated in critical projects and crisis response, broadening their impact. This phase was characterized by the development of additional products and a focus on branding.
- Finally, in the fourth (last) phase, with a team of approximately ten members, the team coordinated multi-team efficiency projects, strategized, and executed platform investments, and emphasized investment in accelerating their progress. Yet this phase ended abruptly in 2022 due to economic factors.
In the closing moments of the session, Yue emphasized the equal importance of technical and social aspects in the field, underscoring the need for a scalable performance methodology. She concluded by stressing the significance of building a cohesive and proficient performance engineering team and being nice to each other!