BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Scaling Challenges: Productivity, Cost Efficiency, and Microservice Management

Scaling Challenges: Productivity, Cost Efficiency, and Microservice Management

Key Takeaways

  • Adjusting team structures to align with clear business goals can significantly increase productivity and innovation.
  • Centralizing cost-saving initiatives leads to more efficient resource allocation and reduces the risks linked to decentralized efforts.
  • Continuous monitoring and adaptation are essential for managing increased traffic and maintaining system reliability.
  • Regular reviews of long-term traffic patterns help anticipate bottlenecks and preempt critical failures.
  • Proactive coordination across microservices is crucial for ensuring resilience and supporting evolving business needs.

The main objective of this article is to delve into the technical complexities and strategic adjustments undertaken by Trainline, a digital platform in the European rail industry. By examining challenges such as managing peak transaction volumes and orchestrating microservice architectures, we aim to uncover the valuable lessons learned and insights gained from Trainline's journey through the dynamic landscape of digital transportation platforms. This article is a summary of my presentation at QCon London 2024.

Trainline, Europe's leading rail digital platform, is a one-stop destination for purchasing rail tickets and managing entire rail journeys. With a strong presence across the UK, Spain, France, Italy, and much of Europe, the company retails tickets and accompanies users throughout their rail travel experience. From providing platform information to assisting with journey changes and compensation for delays, Trainline ensures a seamless travel experience.

With approximately 5 billion net ticket sales traded last year, the company's impact is considerable, both financially and technically. In terms of technical metrics, the company handles around 350 searches per second for journeys and origin-destination pairs, encompassing about 3.8 million unique routes monthly. This extensive search volume underscores the complexity of the platform's search functionality, catering to diverse travel needs across Europe. The company's tech and product organization comprise around 500 individuals, primarily focused on technical endeavors. Notably, Trainline's innovative approach enables real-time tracking of every live train in Europe, reflecting the vastness of its data management and operational responsibilities.

Understanding Trainline's Technical Challenges

We operate in a dynamic environment where technical challenges are multifaceted, requiring sophisticated solutions to ensure seamless service delivery to its vast customer base. Among these challenges, perhaps one of the most daunting is the complex supply aggregation process, which involves integrating data from over 270 API integrations.

This task is compounded by the lack of standardization in the rail industry, where each integration comes with its own set of specifications and limitations. To navigate this complexity, Trainline employs intricate journey planning algorithms that analyze diverse API access patterns, enabling the platform to aggregate supply effectively while ensuring smooth operations for users.

Managing transactions over a finite inventory poses another layer of complexity for us. Unlike digital products, where inventory is virtually unlimited, Our platform must contend with the finite availability of seats on specific trains with unique fares. This necessitates the implementation of robust and reliable systems capable of handling peak transaction volumes, particularly during periods of high demand. With over 1300 transactions per minute during peak times, maintaining the reliability and speed of transaction processing is paramount to meeting customer demand and upholding service quality standards.

Furthermore, meeting customer expectations for instant ticket fulfillment presents yet another significant challenge for our system. While digital products can be delivered instantaneously, rail tickets often require time to process and validate due to industry-level processes. Given that a considerable portion of tickets are purchased on the day of travel, Trainline's systems must seamlessly interact with industry processes to generate barcodes for scanning at station gates. This ensures timely ticket delivery and meets the high expectations of customers for a seamless travel experience.

The technical challenges underscore the intricacies of operating within the rail industry and serving the diverse needs of its customer base. From complex supply aggregation to managing transactions over a finite inventory and meeting customer expectations for instant ticket fulfillment, we navigate these challenges with innovative solutions and robust technical infrastructure.

Complexities of Scaling

Our focus is on three critical lessons learned from scaling at Trainline: improving team productivity, optimizing platform cost efficiency, and managing growth and reliability. The first lesson examines how team and architectural changes affect productivity. The second explores strategies for achieving cost efficiency while scaling our platform. The third addresses handling increased traffic and enhancing reliability. These lessons are illustrated with real-world examples, demonstrating the challenges we faced and the solutions we implemented. By delving into them, we uncover strategies and best practices that emerged from our scaling journey.

Scaling Team Productivity

When I joined Trainline in July 2021, the structure of the engineering team followed a cluster model, where each team owned specific components of the technical stack. This included separate teams for Android, iOS, web, backend, supply, and order processing. At that time, we had approximately 350 engineers. While this model seemed intuitive, it led to inefficiencies and low productivity. Projects often required collaboration across multiple teams, resulting in delays and dependencies.

Key Strategies Considered

  • Transition to Platform and Verticals: Grouping engineers with diverse skills into teams aligned with clear business goals to improve alignment and productivity.
  • Defining Clear Ownership: Ensuring teams had clear ownership of technical components and were accountable for both maintaining and advancing their respective areas.
  • Refining Structure: Narrowing the platform's scope to core services shared by all teams and integrating other technical areas into verticals.

Recognizing the need for a change, we initiated a reorganization in January 2022, transitioning to a platform and vertical structure. Under this new model, engineers with diverse skills were grouped into teams aligned with clear business goals. This restructuring aimed to improve alignment and productivity within the organization. However, it also introduced challenges, including tensions between platform and vertical teams regarding priorities and technical debt management.

In late 2023, we further refined the structure by narrowing the platform's scope to core services shared by all teams. Additionally, we integrated other technical areas into verticals, now referred to as diagonals. This adjustment sought to retain the advantages of both previous models while minimizing friction. Throughout these changes, a key focus was to ensure teams had clear ownership of technical components and were accountable for both maintaining and advancing their respective areas. The goal was to strike a balance between alignment with business objectives, productivity, and the quality of technical work.

This iterative process of organizational restructuring underscores the importance of adaptability and continuous improvement in dynamic environments. By revisiting and refining the team structure, we aimed to optimize collaboration, enhance productivity, and foster innovation within the organization. Clear ownership and accountability were prioritized to empower teams to drive technical excellence while aligning with overarching business goals. The evolution from a cluster model to a platform and verticals, which we then renamed diagonals, reflects our commitment to agility and responsiveness in addressing evolving challenges and opportunities in the technology landscape.

Customer and business needs often transcend architectural boundaries, demanding flexible team structures and technical ownership. As business priorities evolve, we must adjust our team organization and technical strategies accordingly. Conway’s Law underscores the interplay between system design and organizational structure, highlighting the significance of aligning team organization with architectural goals. However, reversing these influences is challenging, as indicated by the absence of a perfect reverse Conway maneuver. Therefore, prioritizing systems designed for ownership transfers and external contributions is crucial. By envisioning our technology as being maintained by entirely different teams, we can design for long-term sustainability. Consistency remains essential: enforcing uniformity in tools and processes enables effective management of technical debt while ensuring operational efficiency.

Achieving Cost Efficiency as Traffic Scales

Managing production costs at Trainline emerged as a critical priority amidst the company's rapid expansion. With our AWS bill constituting a significant portion of our software engineer compensation, reducing production costs by 10% annually became imperative. However, identifying inefficiencies within our intricate technical architecture posed formidable challenges. To tackle this, we engaged in extensive brainstorming sessions to explore various cost-saving strategies.

Key Strategies Considered

  • Data Cleanup: Removing obsolete or redundant data to optimize storage and processing costs.
  • Environment Consolidation: Merging various development and testing environments to reduce overhead and improve resource utilization.
  • Architectural Reviews: Conducting thorough assessments to pinpoint inefficiencies and opportunities for optimization.

However, implementing these ideas encountered obstacles when decentralizing the problem. Delegating responsibility to individual teams for managing specific segments of the technology stack appeared logical in theory but proved challenging in practice. The 10% reduction target was found to be disproportionate across different areas, leading to disparities where some teams faced significant engineering efforts while others resorted to risky shortcuts. Consequently, issues such as service disruptions and outages surfaced, highlighting the critical need for centralized oversight in managing cost-saving initiatives.

Recognizing the limitations of decentralized approaches, we realized the importance of establishing a centralized task force to collaborate with teams and ensure that cost-saving measures align with broader system efficiency and reliability goals. This centralized approach emphasizes the significance of both decentralized expertise and centralized oversight in achieving sustainable cost reductions and maintaining system stability and efficiency.

By fostering collaboration between teams and ensuring that overarching goals are aligned with strategic objectives, organizations can effectively navigate the complexities of cost optimization. Ultimately, sustainable cost reductions and system stability can be achieved through a holistic approach that integrates both decentralized and centralized perspectives. Through ongoing collaboration and centralized oversight, Trainline aims to enhance operational efficiency and drive long-term success in managing production costs.

This approach emphasizes the importance of both decentralized expertise and centralized oversight in achieving sustainable cost reductions and maintaining system stability and efficiency. Just as understanding where 'flab' exists within a large technical system requires centralized thinking, predicting which cost-reducing efforts are worthwhile can be challenging. Therefore, it is crucial not to blindly assign cost-saving goals to individual teams, as this may inadvertently incentivize them in the wrong direction, resulting in unintended side effects.

Strategies for Scaling Microservice Architectures

The journey through scaling Trainline's architecture unveils pivotal lessons learned from confronting various challenges. One critical episode unfolded in October 2021, following a surge in traffic post-COVID recovery, laying bare vulnerabilities within the system. The accumulation of 18 months' worth of untested code changes precipitated unforeseen issues, notably contention in database connections stemming from the integration of numerous microservices.

A different but equally significant threat emerged in November, marked by a series of DDoS attacks coinciding with geopolitical tensions in Europe. While bolstering DDoS protections proved essential, internal vulnerabilities, such as inconsistent retry strategies, worsened the situation, culminating in self-inflicted overloads. These incidents underscore the multifaceted challenge of identifying and addressing systemic vulnerabilities within a distributed architecture. The absence of singular causes highlights the complexity of detecting and mitigating cumulative effects stemming from multiple contributing factors, necessitating a holistic approach to security and resilience in system design and management.

Key Strategies Considered

  • Traffic Pattern Monitoring: Regularly reviewing long-term traffic patterns to anticipate potential bottlenecks and preempt critical failures.
  • Coordinated Microservice Efforts: Guiding teams on critical aspects such as retry strategies, scaling policies, and database connection management.
  • DDoS Protection: Strengthening defenses against DDoS attacks while addressing internal vulnerabilities such as inconsistent retry strategies.

The challenges encountered in scaling Trainline's architecture underscore the complexity of predicting and mitigating bottlenecks in large microservice-based systems. The inherent unpredictability of such systems emphasizes the importance of proactive measures. Regular reviews of long-term traffic patterns enable organizations to anticipate potential bottlenecks and preempt critical failures, shifting focus from day-to-day or release-by-release assessments to a more comprehensive approach.

Additionally, fostering coordinated efforts across microservices is essential in guiding teams on critical aspects such as retry strategies, scaling policies, and database connection management. By embracing these proactive measures, organizations can navigate the complexities of large-scale architectures with greater resilience and adaptability, fortifying their systems against unexpected challenges and supporting evolving business needs effectively.

Summary

In Trainline's journey, three key lessons emerged as critical for sustainable growth and efficiency. Firstly, maintaining consistency emerged as paramount, even if it meant sacrificing immediate engineering preferences. This approach ensures longevity and transferability of technology, crucial for businesses aiming for long-term success. Secondly, centralizing system cost-saving efforts proved more effective than delegation, as it prevents teams from losing sight of the bigger picture and encourages more informed decision-making.

Finally, the importance of observing long-term traffic patterns and coordinating microservice fleets cannot be overstated. Regular reviews spanning extended periods help identify potential issues and prevent common pitfalls, ensuring system resilience amidst evolving demands. These lessons, drawn from Trainline's experiences, offer valuable insights for organizations navigating the complexities of scalable architectures.

About the Author

Rate this Article

Adoption
Style

BT