Meaningful Metrics

In my last blog post, Getting the most from metrics, I suggested using the acronym METRICS in order to show what metrics could, or probably should be, in order to ensure that you are getting maximum value from them. This post looks specifically at ‘Meaningful’ metrics.

When producing a metric relating to an Agile delivery, it should have meaning. A change in a metric that has no or little meaning can lead to mis-interpretation and provide a false indication as to what has taken place, what is currently happening and/or where things are heading. This could lead to the wrong decisions being made.

A metric is given meaning by ensuring that it has context. Context may be provided by setting the scene for the use of the metric or providing a scale which the metric can be compared against. Metrics should be relevant to the situation being monitored, with vanity metrics being avoided. It should be possible to identify actions from changes in metrics and it may be necessary to utilise different metrics for different audiences.

I worked in one organisation where a senior manager liked to receive monthly details of the total number of tickets that had been completed (moved to ‘Done’) across the the delivery teams. Although this total could be an indicator of the amount of work being undertaken/delivered, on its own it is a very subjective measure and could provide a false indication of activity and may even be considered a vanity metric.

The total number of tickets included all user stories and defects that were completed. Understanding the breakdown of user stories to defects is an important indicator of progress – are we spending more time/effort building new features, or, fixing existing ones? Furthermore, an increase in the number of larger user stories could result in a reduction in the total number of user stories completed. Alternatively, an increase in smaller user stories would have the opposite effect. This increase or decrease may not mean that any more or less value were being delivered, only that there has been a change profile size of user stories. On its own, measuring only throughput (the total number of stories completed) provides a false measure of progress of the team/department. On its own, such a metric is a vanity metric.

This is also a metric that could be easily gamed by the teams within the department. If they are being measured/monitored according to their throughput, user stories could start to become smaller. This may be a good thing as long as the user stories continue to delivery value. However, if they begin to edge towards a ‘task level’ and deliver no value without being released together, then this metric is being gamed.

One way for a metric like this to have some meaning, would be to combine it with another metric. This may begin to give the metric context. For example, if you considered the total number of user stories completed alongside the total number of story points for those user stories you can start to understand if more or less work is being completed.

Let’s work through a simple example. In order to keep it simple I will consider the output of one team on a monthly basis.

In Month 1 Team Alpha completes 25 user stories. At the moment this metric has no meaning as it has no context. Is 25 user stories good or bad? Is this better or worse than last month? How many did the team say that they would complete? Were the user stories big or small?

In Month 2 Team Alpha completes 30 user stories. This metric has a little more context as it can be compared to the previous month. At first sight it appears that there has been progress – the team completed more user stories. However, there is still a lack of detail that would enable any conclusions to be drawn.

The table below shows the number of user stories completed by Teams Alpha over a six month period.

Team AlphaMonth 1Month 2Month 3Month 4Month 5Month 6
User Stories253035203540

At first sight, this table suggest that over time Team Alpha were completing more work as the number of user stories completed increases month on month, with the exception of Month 4. But were they?

Lets consider a different metric – the total number of story points (velocity) for each month.

Team AlphaMonth 1Month 2Month 3Month 4Month 5Month 6
Story Points7580801007080

Considering this metric on its own provides a different picture of progress of the six months. Now Month 4 appears to have been the most productive month, with the other five months having a similar output (70-80 story points).

In order to get more context and meaning, and to be able to start to draw inferences from the metrics, it is necessary to combine these two metrics and create a new metric – average story size.

Team AlphaMonth 1Month 2Month 3Month 4Month 5Month 6
User Stories253035203540
Story Points7580801007080
Avg. Story Size3 points2.7 points2.3 points5 points2 points2 points

This table suggests that although Month 4 had the fewest number of user stories, the average size of those user stories was high, making it the most productive month. In contrast Month 6, which had the highest number of user stories completed, had the joint lowest average story point size.

The above example is a simplified one and further analysis should be considered to understand the true productivity of the team. How many user stories/story points were the team forecasting that they would deliver? What percentage of the forecast did the team deliver? How many user stories/story points were being carried over between months. How many defects were recorded/resolved? Were there any other factors that affected the teams ability to deliver i.e. environment downtime, resource availability etc.?

However, possibly the biggest question that remains unanswered from the above example is whether or not the team delivered any value? Yes, they delivered 185 user stories and 485 story points, but were the functions/features that were developed released to users? If so, are the new functions/features being used by users? Are they generating revenue or increasing the revenue that was previously being generated? Have they improved the speed of the service? Have these user stories moved the team closer to the goal of the delivery? Only by answering the ‘value question’ will it be possible to look back and understand the team had been building the right thing.

By capturing additional data, manipulating it, turning it into information and analysing it allows inferences/conclusions to be drawn. In a criminal trial the prosecution will rarely rely on a single piece of evidence in order to obtain a conviction. They will gather all the evidence available from a variety of sources in order to corroborate and prove beyond a reasonable doubt that the defendant committed the offence.

Metrics could be considered in a similar way. One metric is never enough. Relying on a single metric can be dangerous. It can lead to misinterpretation and provide a false indication as to what has taken place/what is happening. Wherever possible you should look to use additional metrics as evidence to provide meaning and context, although these metrics should be relevant to the situation being analysed.

Once a metric has been produced, analysed and conclusion are drawn, it is time to make recommendations, identify actions and implement change. Such changes will be made with a view to making improvements and will impact the metric being measured. As time passes the metric should continue to be monitored to see if the change had the desired result.

Any metric that could be considered a vanity metric, or, could be gamed should be avoided or used with extreme caution. It may also be necessary to consider using different metrics for different audiences. Metrics are first and foremost for use by ‘the team’ – they are not a management tool. They assist the team when planning for a Sprint/Release, but can also be used by the team to identify process improvements or monitor the impact of experiments.

I challenge you to review your metrics and make sure that they have ‘meaning’.