What Aggregates to Build?

This is probably the simplest part of data warehousing, yet far too many DBAs make this much more of an issue than is necessary. Here are two simple rules to go by:

  • Build whatever aggregates are required to make the end-users happy.

  • Aggregates should be 10?100 times smaller than the fact tables on which they're based.

The first rule seems so simple, yet it's often the one where technical people have the most problems. Data warehouses are not traditional, normalized database designs. So then why argue with a business user over the technical merits of an aggregate? If they ask for it and it will be smaller than the fact it's based on, then just do it. The need for an aggregate is a business issue, not a technical question.

Another way to look at this is the importance of data warehouse end-users. These people are generally executives and senior managers from the business side. These people make business decisions at strategic and tactical levels. They're also the people who budget for internal support organizations, like IS. They don't just pay your salary, but run the company such that there is even a need for such support. In other words, these people make decisions such that the business prospers and grows. If the business doesn't prosper, there won't be any need for DBAs and data warehouses.

So if a business user is organizationally worthy of an aggregate, then just build it, even if that's the only person who will ever use it. For example, at 7-Eleven, beer represents a significant portion of both sales revenue and profit. So when the "beer" executive asked for an aggregate on daily sales related just to beer, we built it. Yes, there was a brief discussion on our development team about how all of our existing aggregates were based on time (e.g., week, month, quarter, year) and thus were generally useful to all end-users. But no one wanted say no to the beer guy?and rightfully so, he was the #3 person in the company.

The second rule also seems simple: Aggregates should be much smaller than what they summarize. The problem here is to make sure you find out from the businesspeople the information regarding the candidate fact tables. There is a belief on the technical side that all facts can be summarized equally well across the time dimension, but that's not true. Some facts may not summarize equally well across each of the time dimension's levels. For example, at 7-Eleven, the order fact had a billion rows and the order week aggregate had 650 million rows. How could this happen? The technical team's assumption was that all facts summarized at least to week and month, so those were the minimally, initially built aggregates. But the businesspeople knew that stores only order once per week and hence there was no need for an order week aggregate; the technical people never asked. So, the lesson here is: Don't guesstimate probable sizes for aggregates; ask the businesspeople.

Another way to create small aggregates is to build them on non-time dimensions. A typical time dimension with several hierarchical levels may still possess only a few thousand or tens of thousands of rows. Thus, aggregates based on time are based on domains of relatively small ranges and groupings within those domains. So, aggregating by month obviously only compresses to 1/30 of the original. And, aggregating by quarter compresses to 1/90 the original. Now, suppose you wanted to aggregate along a non-time dimension, such as products. For example, a typical retail store might have 400,000 distinct products, which represent 2,000 categories. Thus, you would have much larger domains and groupings within those domains on which to base aggregates, and aggregating by product category would compress to 1/200 the original.

Finally, don't hesitate to combine aggregation techniques. Returning to the "beer" example, why not build a beer aggregate by week, month, and quarter?