Chapter 7. Implementing Aggregates

Aggregates are one of the key differentiators between successful data warehouses and otherwise mediocre attempts. As has been said several times throughout this book, successful data warehouses will experience much higher than originally expected end-user utilization, specifically more queries than planned. Why? Because if the data is easy to get at and the reports run quickly, the end-users will mine the data more than even they could have imagined. This will have another unexpected result: Your end-users will ask for more aggregate or summary tables as the data warehouse increases in size. Why? Because as the warehouse gets larger, they'll want you to maintain their fast report runtimes, and often, that will require creating aggregates. Of course, adding new tables will only serve to make the data warehouse grow even faster. So what? Disk space is cheap and happy end-users means job security.

What exactly is an aggregate? It is simply the rollup of an existing fact table along one of its dimensions, most often time. For example, if the base fact table is sales by day, then time-based aggregates might be sales by week, month, and quarter, with data volume reductions of approximately 1/7, 1/30, and 1/90, respectively. So, end-users doing trend analyses over somewhat long periods of time would benefit from querying smaller tables. Of course, as was said in Chapter 5, obtaining the correct explain plan is the most critical factor. But assuming the correct explain plan is being utilized, then querying a table that's 30 or 90 times smaller would only make a good thing much better. In real-world terms, a data warehouse fact table receiving 20 million rows each day and keeping 5 years of history online would contain 36 billion rows! However, the monthly aggregate would contain a more reasonable 1.2 billion rows, while the quarterly aggregate would contain a mere pittance of 400 million rows. With the right explain plan on just 400 million to 1.2 billion rows, end-user reports will run in seconds to minutes.

The DBA must be careful and manage the tradeoffs. With more aggregate tables come increased complexity, including disk space management, object management, partition management, and aggregate management. Of all these issues (and possibly others), the DBA must weight aggregate management the highest. Of course, there are the obvious aggregate issues, such as determining the calculation, creating the aggregate, updating the aggregate's values, and on occasion, reverifying the aggregate's data accuracy. But, it's the less obvious impact analysis that should be of more concern. For example, if you have a fact table that has six aggregates and you must change something regarding that fact table, it's possible that you'll invalidate one or more of the aggregates. And the problem may be much more than just breaking a calculation in a nightly summarization job. What if the fact change makes the contents of the existing aggregate invalid? You'd have to reassess and rebuild all the affected aggregates as well. Thus, aggregates are powerful weapons that must be researched and implemented with great insight.