The Latency Chart

Limit Wip without Wip Limits

Managing Delays

With Continuous Measurement in Polaris, we have a detailed picture of the state of play for work in progress, accurate at all times, up to the last progress event: currently defined as the last commit, or update to a card or pull request.

Latency, the elapsed time since the last progress event was recorded for a card, is one of two fundamental measurements we focus on in Ergonometrics. The latency clock resets for a card with every progress event, so high latency cards are the ones that have not made progress in a while, and need some attention.

Latency captures delays in a generic fashion without having to predetermine the cause of the delay via a process model. The overwhelming cause of latency in software delivery is the presence of delays caused by human loops: hand-offs, context switching, code reviews etc. and these are hard to predict and model ahead of time. They also occur in different places, and at different times, for different cards.

But latency measures the symptom in a uniform way, and Polaris gives you the tools to track down the root cause by digging into the implementation details of each card in progress. This is a lot more powerful than trying to guess where the latencies may arise by modeling queues etc. because that is a futile exercise.

Latency and Wip

When Wip levels are high relative to the number of team members working, latency will spike for everything except the cards that are being worked on. So keeping Wip low will reduce latency, but conversely, if you run a low latency process, your Wip will automatically have to go down. This is a consequence of some basic mathematical laws from queueing theory.

In the Ergonometrics framework, the approach to managing Wip is via Cycle Time limits instead of Wip Limits. Polaris lets you set a target cycle time for all work that flows through the pipeline, stated as a value in days and optionally a percentile value to allow for exceptions. For example, you can set a p90 target of 7 days for your team: the means that you are aiming for performance goal that the 90th percentile of cycle time will be 7 days or less. The remaining tenth of a percentile allows for the occasional exception where the nature of the work necessitates going longer, but the overall emphasis is on breaking work up so that overwhelming majority of cards can be completed end to end in 7 days or less.

This moves the onus of planning on making sure that cards are scoped correctly, and gives the team a single metric to focus on when delivering work. Trying to hit a fixed cycle time target will give you a lot more insights into how to scope work after a couple of weeks of doing it.

The Latency Chart

In order to make sure we meet our cycle time target, we pair it with a latency limit. This is typically set to 10% of the Cycle Time limit. For instance, for a 7 day cycle time limit, this requires each card to record progress at least daily. The Latency Chart tracks these two metrics for all cards in progress in real time.

The horizontal axis shows the elapsed cycle time for each card in progress, and the vertical axis shows the current latency for each card. The cycle time and latency limits divide the chart into four quadrants, and each represents a bucket of work that you can monitor and handle separately as a team.

  1. Lower left quadrant: This is the green zone. All cards here are within the targets.
  2. Lower right quadrant: Cards in this zone are still making progress, but have exceeded the cycle time limit. Make sure this was what you expected for these cards, otherwise follow up right away. You may have work that is churning that will affecting your plan.
  3. Top left quadrant: Cards in this zone are within cycle time limits but have not made progress. Track these for follow ups.
  4. Top right corner. If you are running a tight execution process this quadrant should be empty.

If a lot of your work ends up in the top right quadrant it indicates you are taking on more work than you can complete within your target cycle time.

Typically most teams start with a lot of work in the top right quadrant. This is especially true if you are running processes like Scrum, where the default motion is to release all the work at the end of a sprint, and many cards get started in parallel earlier in a sprint causing many of them to accumulate latency.

Latency based Pull Policies

Cycle Time and Latency limits let us set scheduling policies that are based on actual cycle times of cards in progress than placing static Wip Limits by development phase.

Such a  pull policy will require that a new piece of work cannot start unless all existing work is in a known "safe" zone. When a card is at risk of "escaping" into a non-safe zone, it becomes the responsibility of the whole team to make sure it can be safely taken over the finish line instead of starting new work. This means that if a team member is in a position to bring a card at risk back into the safe zone, that becomes their first priority instead of starting work on a new card. If there is no one who is in position to do this, it is time to "stop the line" and figure out what to do right away.

With latency based pull policies communicated and enforced, the team as whole gets to take responsibility moving cards end to end to through the pipeline, rather than having a large set of half finished card in progress at any time. Note that these policies are process agnostic. You can apply them whether you are running a SCRUM process, or a typical flow process like Kanban,

When you do this, Wip will go down. code reviews will get completed faster, cycle time will go down and your customers as a whole will be happier. And you don't have to figure out what your Wip Limits are ahead of time, because they will change every day.

Try it and you'll see.