In this podcast Shane Hastie, Lead Editor for Culture & Methods, spoke to Mike Bryzek – CTO and founder of Flow.io about his background in philanthropy, his current role at Flow.io and how they have empowered developers and adopted testing in production in order to raise the quality of their software.
Key Takeaways
- The value of volunteerism and philanthropy for good through focusing on outcomes
- Testing in production helps make higher quality software
- The importance of instrumenting the software so we get feedback in real time
- Allowing the engineer to own the decision about when to release the software and be accountable for that decision results in better products because the engineers who build them carefully consider what they need to do to get the product truly ready for release
- Testing in production may seem scary now, but it’s just another tool in the toolbox of building quality software – it’s the only way we can guarantee that the software is actually working as expected in the environment that it is intended for
Subscribe on:
Show notes
- 0:30 Introductions
- 1:44 The value of volunteerism and philanthropy for good
- 1:51 The importance of focusing on outcomes not just on the activities
- 2:15 These outcomes are very hard to measure
- 2:55 The impact that Bill Gates has had through his philanthropic work
- 3:20 Testing in production helps make higher quality software
- 3:40 This approach makes the software better sooner, which free up time to build better products
- 3:55 Finding the time to instrument the software so we get feedback in real time
- 4:05 Telling the story of how one outstanding engineer prepared for and deployed software with logging and real-time monitoring tools to immediately get feedback on the impact of a change
- 5:10 Some bugs take time to surface (eg a slow memory leak)
- 5:48 The process of delivering quality software is time intensive and requires great concentration
- 6:23 Pre-instrumenting the software so it is possible to show in retrospect exactly where a bug came from when one does appear
- 7:05 We know how to do this but we tend not to because it’s hard
- 7:25 The perception that the cost of delivering quality is high and the desire for more features rather than better products
- 7:50 Analysing the underlying cause of different types of bugs to find patterns and opportunities for improvement
- 8:04 At the small component/service/library level we have very good test automation tools and there is a correlation between test coverage and defect reduction
- 10:50
- 8:42 There are some classes of problem which are much harder to automate the testing of – eg user interface design
- 9:04 To get the program to the state where you can test the user interface requires lots of underlying services and dataset to be running and configured correctly, which takes time and makes it hard to test
- 9:28 An example of just how complex these “simple” environments can be
- 10:04 The UI automation tools are fragile and fail frequently because of dependencies in the larger ecosystem
- 10:35 Most of the things we’re doing in the UI are not dangerous per-se because most of what is being done is rendering
- 11:05 What if you could attach the new UI to the production backend stack and test with the data that you know is correct because it is live?
- 11:20 Developers are quite good at keeping production running, and quite poor at keeping other platforms running
- 11:38 Automating the UI test that uses the production backend means that if it fails one of two very important things has happened, either of which should be addressed – 1) there is a bug in the UI code : fix it; 2) production is down: escalate it immediately!
- 12:04 We’re doing this test in production manually anyway – why not automate it and make it part of the deployment job
- 12:18 The value of automated tests in monitoring the long-term health of the system
- 12:40 The Surprised KPI – the number of times a regression bug in production is identified and notified
- 13:25 The importance of a comprehensive and robust definition of done, not just “it’s released”
- 13:44 No matter how vigilant and careful we are, people make mistakes – build systems around the assumption that mistakes will happen and focus on identifying the failure and minimising the risks
- 14:05 Experiences getting better at instrumenting and testing our code
- 14:33 Any metric that you identify can be gamified
- 15:02 The goal is the lack of defects in production, today and in the future
- 15:07 Allowing the engineer to own the decision about when to release the software and be accountable for that decision results in better products because the engineers who build them carefully consider what they need to do to get the product truly ready for release
- 15:38 This is a profound mindset shift
- 15:54 An example of how this empowerment played out in the conversation about posting to a company tech-blog and how people supported each other
- 17:04 Shifting the locus of control from the company to the individual drives great behaviour
- 17:31 The best way to increase the quality of your tests in terms of actually delivering software without defects is to take away all the barriers
- 17:50 At Flow there are no test or staging environments – you build software and tests and deploy them to production, and you are the one who decides when to deploy
- 18:05 This results in more automation, instrumenting and logging to ensure that when it is released the developer has the confidence that it will do what it’s supposed to without causing regression, and that if something goes wrong it is immediately identified
- 18:34 This results in confidence with evidence of increasing quality and creates opportunities to do more, quicker
- 18:52 This also applies to other parts of the tech stack, for instance deploying framework updates
- 19:03 This results in more and more effective instrumentation and logging which make the product more robust
- 19:27 Removing obstacles between developers and production, having a culture of accountability and ownership results in better products
- 19:34 People tend to do the right things because the incentives are aligned
- 19:43 Removing QA environments has resulted in a time saving of 30-40% as that time would have been spent debugging the environments
- 19:50 This saved time goes directly into the quality practices around testing, logging and instrumenting
- 20:30 Tackling the “this won’t work in a regulated environment” challenge
- 21:00 Understand what the objective is from the regulation (eg passenger safety in the airline industry) and ensure that this alignment extends down to the people in the teams
- 21:10 The story of working with auditors to understand their concerns and mitigate them through showing the controls (instrumentation, alerts, checks and logging) which were built into the system to identify and stop invalid transactions
- 23:28 When the whole organisation is aligned around the objectives and values then the practices which result support this alignment and help achieve the right results
- 24:24 Advice on how to move towards adopting this approach – think of it as a journey that will take time
- 24:33 Examine the current environment and honestly identify where the wastes are – where are we spending time doing non-productive work
- 24:55 Identify places where improving or stopping work will have a real benefit on the ROI
- 25:05 List the opportunities and ideas for improvement
- 25:12 Pick one and make the change, measure the results
- 25:40 An example of making the change at Gilt with iPhone payment that resulted in additional benefits beyond what the initial challenge was meant to overcome
- 28:10 Fix the deployment process as one of the first steps to improving quality – it must be fast and easy to deploy
- 28:52 Releasing frequently is the key to building quality software
- 29:15 Testing in production may seem scary now, but it’s just another tool in the toolbox of building quality software – it’s the only way we can guarantee that the software is actually working as expected in the environment that it is intended for
Mentioned: