I've seen questions around issues calculating averages come up a few times in various forums and it came up again last week and I feel that there is some benefit in walking through the details of this issue. For many of you the following will be nothing new, but I'm hoping that this may serve as a reference that you can point to when you get requests for this sort of calculation.
The core issue here is really a fundamental mathematical one. Personally I see it surfacing most often in DAX and MDX as those are languages that I spend a lot of time with, but also because of their multi-dimensional natures you need to be able to write generic calculations that will work regardless of how the end users slice and dice the data.
The discussions invariably start with a statement like the following:
"I have a calculated measure that an average, but my totals are calculating incorrectly"
There are 2 different issues I see relating to this.
The first one is trying to use the AVG() function in MDX. Basically if you want an average calculation that works with all your different dimensions then avoid this function. The AVG function in MDX calculates the average over a fixed set. You may be able to use it in a static query, but to calculate an average in your MDX script simply create the two base measures - a sum and a count, then divide the sum by the count. This is not as much of an issue in DAX as the built-in AVERAGE, AVERAGEA and AVERAGEX generally work as expected.
The other sort of question that I see is related to how the totals are calculated and the question is usually something like the following:
"I have an average measure calculated by doing sum / count - which produces the correct average for each row, but the total is calculated as "sum of sums" / "sum of counts" and my user wants to see it as the average of all the averages."
And to put it bluntly this requirement is invalid. You should never "total" a series of averages by averaging them. The easiest way to explain why this is the case is to illustrate with some data. So let's have a look at a few scenarios.
The first problem you will see with the "average of averages" approach is that it gives too much weight to outlying amounts.
Category | Amount | Count | Average |
Bikes | 1,000 | 1 | 1,000 |
Helmets | 10,000 | 1,000 | 10 |
TOTAL | 11,000 | 1,001 | ??? |
Given the data above how should we calculate the total average? if we do the "average of averages" approach we have:
(1000 + 10) / 2 = 505
If we take the SUM(Amount) / SUM(Count) approach we get the following:
11000 / 1001 = 10.99
This is an extreme example to prove a point, but which do you think is correct? Should the 1 bike we sold for $1000 skew the average to $505 or should the fact that it was just one product out of 1001 mean that the average should only be $10.99?
Your business user might be happy seeing a higher average amount, but what if the situation was reversed and we had sold 1000 bikes and just one helmet? This would make the "average of averages" still equal 505 while recalculating the average at the total level would give us $999.01 - I know which calculation I think is giving a better indication of the total average sales.
It's possible that you may be thinking at this point that this is not so much of a big deal for you because you don't have that sort of variability in your data. However that is only the start of the issues. If you are still unsure about the evils of averaging averages then read on because it only gets worse.
To show the next nasty side effect we need to look at just a little bit more data. Take the following 4 records for example where we have data split between 2 cities and 2 product categories
City | Category | Amount | Count |
Melbourne | Bikes | 18 | 3 |
Melbourne | Helmets | 25 | 5 |
Seattle | Bikes | 21 | 3 |
Seattle | Helmets | 16 | 4 |
When we group the data by City we get the following results. The "Total" line is where the average is recalculated at the total level. Where as the "Avg of Averages" line is where I've take the average of the 2 City averages.
City | Amount | Count | Average |
Melbourne | 43 | 8 | 5.38 |
Seattle | 37 | 7 | 5.29 |
Avg of Averages | | | 5.34 |
Total | 80 | 15 | 5.33 |
Now lets have a look at what happens to the figures when we group the data by the product category. Notice that the Total line has remained unchanged, but the "Avg of Averages" is now different!
Category | Amount | Count | Average |
Bikes | 39 | 6 | 6.5 |
Helmets | 41 | 9 | 4.56 |
Avg of Averages | | | 5.53 |
Total | 80 | 15 | 5.33 |
This sort of behaviour - where the figures reported for total and sub-totals will vary depending on how the data is sliced and diced - will be the death of your BI project.
Trust - the most important "feature" of any BI project
I would argue that possibly the most important "feature" of any BI solution is trust. You can always add features and missing functionality, but it can be very difficult to win back the trust of your users once it's been lost. And nothing will erode the trust of your users than seeing inconsistent results.
It's not just straight Averages that are the issue
Anytime you are mixing calculations that do sums and divisions you need to be careful of the order of operations. Ratios, Percentages and moving averages are just a few of the examples of other calculation types for which you need to take care of the order in which add and divide things.