At the inaugural O’Reilly Software Architecture conference, Raffi Krikorian discussed strategies and tactics for technical leads and architects who are undertaking a system rewrite. Drawing on his experience as VP of Twitter Engineering, Krikorian discussed a twelve point plan for managing the re-architecting process, including defining “done”, instrumenting the existing system, and maintaining code quality.
Krikorian, engineering lead at Uber’s Advanced Technologies Center and former VP of Twitter engineering, began the talk by stating that the decisions and process behind rewriting or re-architecting a system is often plagued with a series of problems: people always underestimate the complexity, people never fully understand the customers, system requirements constantly change, and it typically takes much longer than anyone can predict.
Krikorian suggested that a re-architecture typically occurs when an application is not servicing customer requirements, is not scaling, or is not matching features of similar products. The talk presented a series of system rewrite case studies, and highlighted common pitfalls and provided several strategies for undertaking a successful system re-architecture. A core theme throughout the presentation was to avoid a complete re-write if it all possible, but if this approach is chosen, then the following twelve points should be considered before and during a system re-architecture.
- Hold the line [against the business]
- Define “done”
- Incrementalism
- Find the starting line
- Don’t ignore the data
- Manage tech debt better
- Stay away from vanity stuff [such as ‘hot’ new technical stacks]
- Prepare for mounting tensions
- Know the business
- Get ready for politics
- Keep an eye on code quality
- Get the team ready
Exploring each point in turn, Krikorian first discussed that during an application rewrite the business product managers are going to become anxious, but this must be managed by the technical lead. The business and technical stakeholders must work together to define what “done” means in relation to a successful re-architecture. Several existing specification techniques can be applied to define the requirements, but the existing code should not serve as the entire new system specification.
Krikorian also suggested that obtaining 100% feature parity between the new and existing systems will be difficult. During the time the existing system has been running in production, new functionality may have been added in order to address issues and corner cases, and these modifications may not have been documented appropriately.
Do you really know what the system does?
"Make it do what it already does” is harder than you think.
Most programmers don’t even know what questions to ask. Doubly true if they weren’t the original designers of the system
Implementing features is difficult, and therefore a rewrite should be done incrementally. An agile, integrated and continuous delivery-based process is the most appropriate approach. The new system should be ready to be released at the end of every iteration, and regular demonstrations of the rewritten functionality should be provided to business stakeholders. Krikorian also warned against allowing feature creep in the rewrite:
Feature creep is incredibly tempting, especially if feature development is halting on the old system.
It will potentially kill you.
You try not to do this during regular development, so, why start now?
The underlying data within an software system typically changes very slowly in comparison to the application code. Krikorian cautioned that the use of fake or synthetic data during a rewrite will provide a false sense of security. Tests must be conducted with real data as soon as possible, potentially using techniques such as dark launching and traffic shadowing. A plan should also be created to determine how data will be reconciled between the existing and new applications. The existing system must be instrumented (both in terms of performance and data handling), and decisions made within the rewrite should be driven from this data.
Managing technical debt is a core component of any system rewrite, as left unchecked this will increases software entropy. Technical debt is typically accrued through business pressures, lack of process, an absence of a well-defined architecture, and lack of engineering mentorship.
Krikorian suggested that a culture of design quality should be nurtured. Refactoring should be encouraged, as should continuous design and other code-quality practices. A regular portion of development time should be allocated to address technical debt. Unless properly tended, real world code becomes complicated over time. As developers typically spend more time reading than they do writing code, code reviews should be utilised to structure code for “readability”.
Krikorian also suggested that a rewrite should avoid “vanity stuff”, such as choosing a “hot’ new language or technology stack. Although this can be tempting, both in terms of increasing short-term motivation for the development team or being used as a recruitment tool, a technical lead should always do what is benefical for the long-term aspirations of the team.
There is the strong potential during a rewrite to gather seriously unhappy customers. External customer get no new features, and internal customers are being held up by the rewrite.
There will be political battles within engineering, and frustration can emerge throughout the business because deadlines will almost certainly be missed.
Krikorian cautioned that “nobody ever takes into account this stuff” before a rewrite begins. These battles can be disruptive to the rewrite, and the technical lead must manage the resolution of issues that arise.
It is not uncommon for a development team to be divided into two for a rewrite - one team maintains the old system, and one team works on the re-architecting. Although this can be a valid approach, a technical lead should be ready to deal with mounting tensions between these two teams. Fixing bugs and firefighting is stressful - one team has to do this and the other doesn’t. Krikorian suggested that a technical lead has to maintain or develop supporters in the team maintaining the existing system. Progress should be regularly communicated between teams, and the work undertaken as part of the rewrite should be conducted in a transparent manner.
It is essential that before a rewrite is begun, that all of the business stakeholders be identified by the technical lead. The non-technical motivations of a rewrite should also be identified and managed, for example increasing the velocity of feature delivery or reducing hardware costs. Krikorian suggested that the rewrite process and the corresponding progress reporting must be data-driven, which could include metrics on cost savings, feature delivery velocity, reliability, performance, and stability. Anecdotal evidence should be strongly avoided, and instead results of experiments must be shown.
Krikorian concluded the talk by stating that the technical lead or architect in charge of a system rewrite is often in a precarious position. Engineering resources are being utilised, and new features are not being delivered to the business. The in-flight rewrite work should be kept small, and must be well integrated on a regular basis using techniques such as continuous integration and continuous delivery. Code quality within the rewritten system must be monitored closely.
Following Conway’s law, the desired architecture and team structure must also be aligned, as Krikorian notes:
The architecture dictates the team structure, which in turn dictates the architecture...
The slides for Raffi Krikorian’s “Re-architecting on the Fly” talk can be found on slideshare. The inaugural O’Reilly Software Architecture Conference was held in Boston during March 2015, and further details including a link to recordings of the talks can be found on the conference website.