In her presentation at QCon San Francisco, Ying Dai shared two critical software engineering migration stories - one focused on production monitoring and the other on production deployments with automated validations. Both migrations were driven by the goal of enhancing engineering efficiency, but each came with its own challenges and lessons.
The legacy telemetry system struggled to scale with increasing demands, and often failed to deliver timely and accurate information, Dai said. This led to extensive on-call hours and troubleshooting efforts, ultimately hindering engineering productivity, she added.
To build the new system, they did a thorough analysis of the existing one, pinpointing its shortcomings and the challenges their internal engineers face, Dai mentioned. They explored available options and designed a new system with a strategic transition plan:
Our first objective was to develop a new system that would offer both high availability and reliability. Recognizing the critical importance of performance and accuracy, we implemented a rigorous testing methodology. This involved a dual-writing process, where data was simultaneously written to the existing legacy and new systems.
This strategy allowed to thoroughly verify the integrity and functionality of the new system while ensuring uninterrupted service and data preservation during the transition, Dai said.
Dai shared a second migration story about service deployment. The previous service deployment process, which relied entirely on manual steps, lacked necessary checks and validations, Dai said. This made it simple to roll out changes but also led to a higher frequency of incidents.
During the rollout, they encountered friction in engineering experiences, which prompted them to take action. Based on engineers’ feedback, they implemented improvements aimed at ensuring a smoother and more seamless transition, as Dai explained:
This experience underscored the importance of customer-centricity and iterative development in achieving successful technology implementations.
During their research, they saw potential for enhancing the automated canary analysis, Dai said. By performing this analysis directly in production on canary instances, they could offer immediate and tangible value to our engineers by improving the reliability of their rollout processes.
They designed the automated canary analysis rules to be universally applicable across all services, eliminating the need for engineers’ input. This "zero onboarding effort" approach provided ease of use and seamless integration, Dai said.
They also understood the importance of flexibility, by ensuring that their system incorporated the necessary provisions for engineers to customize validation rules according to their specific needs and preferences. This adaptability empowers their engineers to tailor the analysis process to align perfectly with their unique requirements, as Dai explained:
In essence, our customer-centric approach, coupled with our strategic focus on automated canary analysis and commitment to both simplicity and flexibility, has paved the way for substantial improvements in our engineering efficiency and overall user satisfaction.
It’s essential to understand and actively address the needs of our engineers, Dai said. This involves fostering open communication, encouraging their early participation in the change process, providing targeted training, and seeking continuous feedback, she explained:
By prioritizing their needs, we can minimize disruptions, cultivate a collaborative environment, and implement changes that empower our engineers and ultimately benefit the entire organization.
The benefits they got from the migrations were increased reliability, a decrease in the number of incidents, and improved overall availability. These positive outcomes validated the effectiveness of their approach in enhancing the system’s performance, Dai concluded.
InfoQ interviewed Ying Dai about their migrations.
InfoQ: How did the transition toward a new telemetry system go?
Ying Dai: The transition to a new system was a complex and challenging undertaking. It required a significant investment of time, resources, and effort from us (owners of both the new system and the old one) as well as our customers (internal engineers mostly) involved. We listened to our engineers, using their feedback for continuous improvements and effectively bridging any gaps in user experience between the old and new systems.
By carefully considering their input, we were able to identify and address any gaps in user experience that emerged between the old and new systems. This customer-centric approach allowed us to ensure a smooth and seamless transition, minimizing disruptions and maintaining a high level of user satisfaction.
InfoQ: How did you engage with the software engineers during the migrations?
Dai: To bolster our engineering efficiency, we initiated a comprehensive investigation into our engineers’ experiences. Through in-depth interviews with our internal engineers, we were able to pinpoint their primary pain points and challenges.
For example, during the production deployment migration project, we noticed that while some customers have integration tests and are open to developing more, these tests often yield limited value due to discrepancies between the testing and production environments. Such insights served as the foundation for identifying key areas where we could implement improvements and subsequently develop a strategic, actionable plan.