Chapter 5. Tuning Ad-Hoc Queries

This is the single most important chapter in the entire book. Here I'll explain how to get lightning-fast queries, even though you have no idea what those queries may look like (i.e., truly ad-hoc reports). This is also going to be the hardest chapter for many DBAs to accept because the advice flies so contrary to popular belief. But rest assured, this is the only way to go.

A really good question to ask at this point is: Why am I covering ad-hoc query tuning before other major topics such as loading data? The answer is simple: If end-users cannot quickly view reports on the data, then the data warehouse is a bust. In fact, you can judge your data warehouse's success by simply asking the following: Do the end-users run many more ad-hoc reports than they had originally planned? If the answer is yes, then you've hit a home run. When ad-hoc reports run quickly, the users can perform much more detailed "what-if" analyses and drill much deeper into the massive amounts of data to find better business answers. So, more is better.

Another way to look at this is to think in terms of this nursery rhyme-type question: How much data would a data miner mine if a data miner could mine data? Again, the answer is really quite simple: as much as time permits. So, if ad-hoc reports run quickly, they'll naturally dig deeper. But note that we are talking about data warehouses with massive amounts of data, so "quickly" is really a very relative term. If you remember, in Chapter 1, we examined the criteria of what constitutes a data warehouse. The end-users are usually executives or very senior managers making monumental business decisions. As such, a report that answers a truly strategic question and runs in less than 30 minutes is most likely acceptable. Of course, if you follow these techniques, your ad-hoc reports will run in much less time than that.

Finally, in Chapter 4 we examined the concept of fact versus dimension tables and how to model them with data. I'd like to augment those definitions to include some query-specific terminology:

  • Dimension tables? Generally queried in business terms with high selectivity to find relatively few lookup value matches that are then used to query against the fact table.

  • Fact tables? Must be selectively queried since they often have hundreds of millions to billions of rows; even full table scans utilizing parallel query are too big for most systems.