The Latency Chart

Managing timely collaboration for work in progress

"In product development,  our problem is virtually never motionless engineers. It is almost always motionless work product"

- "Principles of Product Development Flow", 2nd Ed. Donald Reinertsen.

In software development, this means that it is critically important to track the age of work, since motionless work product ages in place, leading to high cycle times and variability.

But even more critical, we need to track if work is aging because it is motionless, or it is aging but still making progress.

With Continuous Measurement in Polaris, we have a detailed picture of age for work in progress, accurate at all times, up to the last progress event: currently defined as the last commit, or update to a card or pull request.

Latency, the elapsed time since the last progress event was recorded for a card, is one of two fundamental measurements we focus on in the Polaris Advisor program. The latency clock resets for a card with every progress event, so high latency cards are the ones that have not made progress in a while, and need some attention. The age of a card is the sum of its latencies over time, so we look at latency as a primitive measurement that age is dependent on.

Other than high levels of WIP, the overwhelming cause of latency in software delivery is the presence of delays caused by human loops: hand-offs, context switching, code reviews etc. and these are hard to predict and model ahead of time. These are the root causes of why work can age in place even when WIP levels are low.

They also occur in different places, and at different times, for different cards.

Latency captures all these different types of delays in a generic fashion.

Latency and Wip

When Wip levels are high relative to the number of team members working, latency will spike for everything except the cards that are being worked on. So keeping Wip low will reduce latency, but conversely, if you run a low latency process, your Wip levels will automatically go down. This is a consequence of Little's Law  from queueing theory.

Our approach to managing Wip in the Polaris Advisor program is via Age limits in addition to Wip Limits.

Polaris lets you set an age limit for all work that flows through the pipeline, stated as a value in days.

The Latency Chart

In order to make sure work stays within the age limit, we pair it with a latency limit. This is typically set to 10% of the age limit. For instance, for a 7 day age limit, this requires each card to record progress at least daily. The Latency Chart tracks these two metrics for all cards in progress in real time.

The horizontal axis shows the elapsed cycle time for each card in progress, and the vertical axis shows the current latency for each card. The cycle time and latency limits divide the chart into four quadrants, and each represents a bucket of work that you can monitor and handle separately as a team.

  1. Lower left quadrant: This is the "Moving" green zone. All cards here are within the targets.
  2. Lower right quadrant: Cards in this zone are still making progress, but have exceeded the cycle time limit. Make sure this was what you expected for these cards, otherwise follow up right away. You may have work that is churning that will affecting your plan. This is the "Delayed" zone.
  3. Top left quadrant: Cards in this zone are within cycle time limits but have not made progress. This is the "Slowing" zone. This is the first category of "motionless" work product, but these are within salvageable if you follow up on them.
  4. Top right corner.  These are all the cards that have exceeded the cycle time limit and have not progressed within the latency limit. This is the red "Stalled" zone. If you are running a tight execution process this quadrant should be empty.

In our practice, we find that once we set an age limit for work items, a large fraction of the work starts out in the "Stalled" state - 60% and higher is not unusual.

A vivid demonstration of Reinertsen's claim above.

This is especially true if you are running processes like Scrum, where the default motion is to release all the work at the end of a sprint, and many cards get started in parallel earlier in a sprint causing many of them to accumulate latency.

Even when we set the cycle time limit to be the sprint length, we often end up with well over 60% of work in progress in the Stalled state.

Latency based Pull Policies

Cycle Time and Latency limits let us set scheduling policies that are based on actual cycle times of cards in progress than placing static Wip Limits by development phase.

Such a  pull policy will require that a new piece of work cannot start unless all existing work is in a known "safe" zone. When a card is at risk of "escaping" into a non-safe zone, it becomes the responsibility of the whole team to make sure it can be safely taken over the finish line instead of starting new work. This means that if a team member is in a position to bring a card at risk back into the safe zone, that becomes their first priority instead of starting work on a new card. If there is no one who is in position to do this, it is time to "stop the line" and figure out what to do right away.

With latency based pull policies communicated and enforced, the team as whole gets to take responsibility moving cards end to end to through the pipeline, rather than having a large set of half finished card in progress at any time.

Note that these policies are process agnostic. You can apply them whether you are running a Scrum process, or a typical flow process like Kanban,