Key Takeaways
- The structure of a value stream map defines its behavior
- Lead time and throughput are dynamic characteristics of a value stream map
- Dynamic characteristics are derivatives of the value stream map structure
- Every value stream has multiple paths, therefore, several lead times for each path
- Feedback loops are an important element in defining the behavior of value streams
In the previous article, "Dynamic Value Stream Mapping to Help Increase Developer Productivity", we provided a high-level overview of dynamic value stream mapping. In this article, we continue our journey and unfold experimentally the dynamic relationship of the feedback loop (using rework as an example) and its influence on lead time and throughput.
When we do a value stream mapping workshop, we often treat lead time and throughput data as if they are static; instead, these parameters are behavioral traits of a value stream. The value stream structure, including steps, their capacity, and processing time, feedback loops and their probability, etc., defines the lead time(s), WIP (work in progress), and throughput dynamics of a value stream.
Let’s take a simple value stream where we have only one step, "Do work", which performs some work on the item. The arrival rate of items is one per minute, and the step processing time for any item is constant and equal to 1 minute. The work item sits in the input queue for 1 minute. For our simple experiment, we need just one feedback loop "Redo" meaning that a work item sometimes requires reprocessing.
As you might assume, redo is the step that isn’t always executed, if the quality of the item is excellent, then the item goes to the output queue. If an item's quality requires attention, it goes to the redo queue and is expected to perform the step "Do Work" again.
For our experiment, the probability of poor quality required repetition of "Do Work" will be very important, we can refer to that case as the probability of Redo.
In real life, we see this type of process everywhere:
- In the code review, when someone submits the code change and gets feedback to address identified code issues (code standard violation, implementation issues, etc)
- In the quality assurance practice, when the test team returns the application build for bug fixes
- In site reliability engineering, when the new release deployment requires manual intervention to address overlooked issues
Of course, our example is simple, and some might argue that this is too simple of a case and it doesn’t fully reflect the mentioned examples above, but we don’t need all the details at this moment. As one renowned scientist said, all models are wrong, but some are useful. This is exactly our case. Let’s learn from this model and see what we can apply to the world of software engineering management.
Picture 1. Structure of the experimental value stream map
BTW you can try to explore the model, play with parameters and add required details if you need - here is a link.
Now, let’s investigate how the probability of "Redo" at one step impacts the total process lead time and throughout. Such an experiment has a very practical outcome for software engineering managers since it could give us a clue on how to increase the value delivered without hiring new people. So let’s explore this abstract case and then discuss practical implications for software management.
In our simulation, we will increase the probability of redo from 0% to 90% with a 10% increment. We will assume that the results will be different depending on which item will be processed first: either one from the input queue or one from the redo queue, or whether we need to prioritize fixing bugs over feature development.
To model these cases, we will run three sets of simulations of this value stream map for 100 minutes:
- When the input queue has a higher priority than the redo - an engineering team works on new stories instead of bugs.
- When the input queue has an equal priority to the redo - an engineering team mixes new stories and bugs.
- When the input queue has a lower priority than the redo - an engineering team works on bugs first before any stories.
The output of our simulation is the following:
Picture 2. Three cases simulation results
Only for the case when the input queue has higher priority (new features are more important) than redo (bugs) we can observe the invariant lead time that equals two minutes. The explanation of such a phenomenon is simple: all items from the redo queue (bugs) are stuck at the bottom of the backlog and cannot get through the step "do work" since input queue items (new features) have higher priority. Even though it looks like the process performs well given the lead time of two minutes, it is an illusion because WIP grows due to deferred processing of redo items. Someday the postponed rework (bugs or technical debt) should be paid off; therefore, new items (new features) will be delayed anyway, and so it is reasonable to allocate a certain capacity of your team to address found issues.
Picture 3. Work in In Progress trend for the 50% probability of redo in case of the high priority of input queue
WIP - the closer you look, the worse it gets ... There are several models that can inform us about the impact of Work In Progress; highway traffic and network traffic. In both of these cases, additional packets on a network and additional cars on a segment of the road will have an effect on the performance of that part of the infrastructure because of limited capacity. When the load levels are low, the impact is negligible. But as the load increases (the number of vehicles or data packets) the overall speed of the traffic decreases. This is a key reason why Kanban uses WIP as a flow mechanism; it reduces the "traffic" level. Using WIP as a threshold indicator can help us monitor "traffic" but only if it captures all of the work. We need to capture all work, including planned, unplanned, administrative, and all other activities that compete for our mind and time. This soon becomes a game of priorities.
Getting back to our model, things are different when the input queue (new features) has higher priority than the redo queue (bugs), the lead time always grows along with a higher probability of redo. The reason for this phenomenon is that items get into the step "do work" several times before they exit the process, in other words, before we can say that done work is done. Another interesting observation is that throughput goes down along with redo growth, which is obvious since it counts the ratio of departed items to all arrived. If the lead time grows within a given period (in our case, we simulated for 100 minutes), then the number of departed items leaving the process goes down; therefore, the throughput does so as well.
The practical implication for software engineering management is to first address feedback loops that generate a lot of bugs/issues to get your capacity back. For example, if you have a fragile architecture or code of low maintainability that requires a lot of rework after any new change implementation, it is obvious that refactoring is necessary to regain engineering productivity; otherwise, engineering team capacity will be low.
The last observation is that the lead time will depend on the simulation duration, the longer you run the value stream, the higher the number of lead times variants you will get. Such behavior is the direct implication of the value stream structure with the redo feedback loop and its probability distribution between the output queue and the redo queue. If you are an engineering manager who inherited legacy code with significant accumulated debt, it might be reasonable to consider incremental solution rewriting. Otherwise, the speed of delivery will be very slow forever, not only for the modernization time.
The art of simplicity; greater complexity yields more variations which increase the probability of results occurring outside of acceptable parameters. One way to minimize this is to make feedback loops as simple, fast, and direct as possible. Information collected should be suspect at best until data quality can be assessed and qualified. This is why engineering practices such as TDD and BDD are extremely powerful; it gives simple, short, and fast feedback on the quality of the change implemented.
Subsequent iterations with single variable changes can then be made (ideally within a design of experiments), we recommend you build a model of your value streams to see how the change plays out. The challenge then maybe "perfect being the enemy of good", with sufficiency being the goal. Advanced level challenges will include issues such as cross-stream impacts and indirect impacts, not to mention unrealized relationships between activities and/or data observed. Transparency of process and data, along with visibility, can be a strong tool in conveying the intent and reality of our focus on our value streams build both credibility and usefulness as stakeholders engage.
To illustrate the fact that the value stream might have several flows and therefore several lead times, see this list of various flow traces for departed work items in the case of 50% percent redo probability in case of equal priorities for the input and redo queue. As you can see, there is huge variability in how many redo feedback loops a particular item can travel. When a software engineering manager faces a situation like that (high feedback loop probability), it might be reasonable to tackle these cases differently, given their frequency, given the adverse effect on throughput and team capacity. Investigate the origins of the feedback first - what is it that prevents engineers from getting this feedback immediately? The next important question is for repeated feedback - what prevents engineers from making the change right after receiving feedback. I bet the nature of the architecture or environment configuration or both inhibit engineers from doing the experimentation locally and running this feedback locally instead of getting it from other teams/pipeline steps. In the highest productivity environment, all pipeline steps and all types of testing can be performed by an engineer locally without affecting the code base.
In conclusion, we recommend all value stream mapping practitioners in the Agile and DevOps fields to look at the dynamic characteristics of the value streams, such as lead time, throughput, and WIP, through the lenses of the value stream structure. The question of what structural elements of the value stream lead to the observation of the dynamics is crucial to the improvement or to design a new, more balanced value stream.
We hope you find these ideas and insights useful and will apply them in practice. Jack and Pavel will be happy to follow up and learn from you which of the lessons might be the most important for your organization.