10.2 Queries that Return Data from Too Many Rows

There are four main contexts in which queries cover too many rows, resulting in long-running queries even when per-row costs are as low as they can be:

Online queries that return hundreds or more rows: These queries are invariably excessive from the point of view of end users; such large result sets are inconvenient or impossible to use online. The most likely response of an end user is simply to repeat such a query with some further refinement to the conditions to try to reduce the results to a manageable size.
Batch reports that return tens of thousands or more rows: These queries are invariably excessive from the point of view of end users; such large result sets are inconvenient to read, even in a report. Certainly, a large, well-ordered report does not have to be read from cover to cover, but if the intent is just to provide a way to look up selected facts, why not use a well-designed online transaction for the purpose? A well-designed application on a relational database is a better way to look up selected facts than the output of a huge flat report.
Aggregations: Queries sometimes aggregate (i.e., summarize) large rowsets into results that are small enough for end users to digest. This can happen online or in batch.
Middleware: Batch processes sometimes function as middlewaresoftware that moves data around within a system or systems without sending that data to the end users. Since end users aren't part of the picture and computers have plenty of patience for large data volumes, these batch processes sometimes legitimately need to handle data volumes that are too large for human consumption.

Let's examine ways to fix each of these types of excessive reads, in turn.

10.2.1 Large Online Queries

In the long run, large online queries tend to take care of themselves. Every time end users deliberately or accidentally trigger a query with insufficiently selective filter criteria, they receive negative feedback in the form of a long wait and a returned result that is inconveniently large for online viewing. The returned result has too much chaff (unneeded data) mixed in with the wheat (the data the end users really want) for convenient use. Over time, end users learn which query criteria will likely deliver punishingly large result sets, and they learn to avoid those query criteria. End-user education can help this learning process along, but the problem is somewhat self-repairing even without formal effort.

Unfortunately, there are two key times in the short run when this sort of automated behavior modification doesn't prevent serious trouble:

During test usage

End users and developers often try out online forms in scenarios in which any value will suffice. In real use, an end user knows the name, or at least most of the name, before looking it up, and he knows to type at least a partial name to avoid having to scroll through a pick list of thousands of entries to find the right name. However, in a test scenario, the tester might be perfectly happy querying every name (such a query is called a blind query) and picking, for convenience, "Aaron, Abigail" from the top of the huge alphabetic list.

Blind queries such as this are much more common when end users are just surfing a test system, to test a preproduction application, than when they have production work to do. Blind queries like this can even find their way into hurriedly compiled, automated benchmark load tests, horribly distorting benchmark results. Unfortunately, it is exactly during such testing that you will likely be searching for performance problems, so these blind queries distort the problem picture and deliver a poor first impression of the application performance.

Early in end users' learning process

Novice end users do not yet know to avoid blind and insufficiently selective (i.e., semiblind) queries even in production use. When they execute such long-running queries, they form a poor, often costly first impression of the application performance. In early production use, and even later if end-user turnover is high, the system-wide load from these mistakes can be high, hurting everyone, even end users who avoid the mistakes.

Blind and semiblind queries in the most frequently used transactions are worth preventing in the application layer. When possible, the application should simply refuse to perform a potentially large query that does not have at least one known-to-be-selective search criteria specified at query time. This requires a little extra effort to discover in advance which selective search criteria will be available to end users in production use. These selective search criteria will map to a list of table columns that must drive efficient queries. However, you need to ask just these questions to make sure that you have indexed paths from those columns to the rest of the data; otherwise, you end up tuning queries, with otherwise harmful indexes, that will never come up in real application use. When end users query without reasonable search criteria, it is better to return immediate error messages that suggest more selective search criteria than to tie up their windows with long-running, useless queries.

Sometimes, you cannot guess in advance how selective a search will be until it runs; the selectivity depends on the application data. For example, the application might return far more matches to a last name of "Smith" than "Kmetec," but you would not want to hardcode a list of name frequencies into the application. In these cases, you need a way to see that a list is too long, without the expense of reading the whole list. The solution has several steps:

Determine the maximum-length list you want to return without error. For purposes of discussion, let's say the maximum length is 500.
In the call to the database, request one more row than that maximum length (501 rows for this discussion).
Arrange in advance that the query execution plan is robust, following nested loops to return rows without having to prehash, sort, or otherwise store whole large rowsets before returning the first 501 rows.
Request that the query result from the database be unsorted, so the database does not need to read all the rows before returning the first rows.
Sort the result as needed in the application layer if the result does not trigger an error by hitting the error threshold (501 rows for this discussion).
Cancel the query and return an error that suggests a more selective search if the rowcount hits the maximum length (501 rows for this discussion).

In my former performance-tuning position at TenFold Corporation, this technique proved so useful that we made it the automatic behavior of the EnterpriseTenFold application platform, with an adjustable maximum rowcount.

In summary, prevent large online queries with a three-pronged approach:

Train end users to specify narrow enough searches that they don't get more data than is useful or efficient to read.
Return error messages when end users attempt obviously unselective queries, especially blind queries based on large root detail tables.
Run potentially large queries in such a way that you get the first rows quickly, and return an error as soon as the number of returned rows is excessive.

10.2.2 Large Batch Reports

Slow online events are punishing enough that they don't go uncorrected. The affected end users either modify their behavior or complain loudly enough that the problem is corrected. Batch load can be more subtly dangerous in the end, since batch performance problems sometimes go unnoticed, creating enormous load and preventing adequate system throughput without being an obvious problem. When overall load is too high, all parts of an application slow down, but the villains, the batch processes that consume too much of the system resources, might be performing well enough to go unnoticed, especially when they are low-priority processes that no one is awaiting. Automatically prescheduled, periodic batch processes are especially dangerous: they might run much more frequently than needed, without anyone noticing. They might cease to be needed as often as they once were, or cease to be needed at all, but go right on tying up your system unnoticed.

Conceptually, there are several questions regarding a large batch report that are relevant to choosing a performance solution:

What is the reason for the report?
How is the report triggered?
Why is performance of the report a concern?
What sort of information does the reader extract from the report?

The answers to these questions all affect the best answer to the final question: how do you fix the report performance?

10.2.2.1 Reasons for large reports

Beginning with the assumption that no one person will ever read a huge application report from cover to cover, why should applications ever need huge report queries? Common reasons for huge report queries, and performance strategies for each reason, include:

A report has many readers, each of whom is interested in a different subset of the data. No one reads the report from cover to cover, but any given part of it might be interesting to at least one reader. The needs filled by such all-inclusive reports are often better served by multiple smaller reports. These can run in parallel, when needed, making it easier for the system to reach all the necessary data quickly enough. They also can each run just as often as their readers need, rather than read everyone's data as often as the most demanding readers require.
All details of a report are potentially interesting at the time the report is requested, but end users will read only a small part of the report, based on which questions happen to arise that the end users must answer. The need to answer such ad hoc questions as they arise is far better met by online application queries to the database. A flat report structure in a huge report can never offer a path to data as convenient as a well-built application. When you use reports for ad hoc data access in place of online applications, you are bypassing all the advantages of a relational database. Since the ad hoc online queries that replace the entire huge report will touch only a small subset of the data the report must read, the online solution requires far less logical and physical I/O.
Only a subset of the query data is ever used. Here, the solution is obvious and overwhelmingly beneficial: eliminate from the query and the report those rows that are never used. Where the report lists fewer rows than the query returns, add filters to match just the rows the report needs. If you trim the report itself, add filters to the queries that serve the trimmed report and tune the queries so they never touch the unused data. A special case in which only a subset of the data is required occurs when the end user needs only the summary information (the aggregations) in a report that includes both details and aggregations. In this case, eliminate the details from both the report and the database queries, and see Section 10.2.3 to consider further solutions.
A report is required only for legal reasons, not because anyone will ever read it. Such a justification for a report invites several questions. Is it still legally mandated, or did that requirement vanish and you're just producing the report out of habit? Is it really required as often as you are producing it? Rare huge reports are not likely to be a performance or throughput problem, so the essential point is to run them rarely if you cannot get rid of them. Does the law really require the data in the form of that report, or is it enough just to have access to the data in some form? Often, requirements for data retention do not specify the form of the retained data. The database itself, or its backups, might satisfy the requirements without an application report. If the report is required only for legal reasons, it can likely run during off hours, when load is less of an issue.

I once heard a thirdhand tale of a report produced for legal reasons that had been broken from the beginning; it filled many pages with a system dump every time it ran. Since no one actually used the report, no one noticed that it didn't work. However, each paper printout of the system dump was dutifully filed away, and the filer checked off a box in a form stating that the legal requirement to run and file the report was met. When the system was reengineered, the report was finally recognized as broken and was eliminated. The reengineering team concluded the report had never really been necessary after all. However, even when told that the report didn't work and never had, the person in charge of checking off the ran-and-filed box in the old form wanted the report put back in its original, broken form.

10.2.2.2 Ways reports are triggered

There are two basic ways reports get triggered:

Ad hoc requests

When a report is specifically, manually requested, chances are high that at least the requestor knows a genuine need for at least part of the report. It is also likely that the requestor cares how long she will have to wait for the report output, and a long report runtime might cost the business. When the requestor needs the report soon, it should get high priority, potentially even running parallel processes that consume much of the system's resources to meet the deadline. Otherwise, it is helpful to ensure that the queries avoid running in parallel, thus avoiding harm to higher-priority process performance. Furthermore, low-priority reports should automatically be relegated to run during low-load periods, which is often enough to eliminate any performance impact from these reports.

Automatic requests

Much of the batch load on most business systems is automatic, in the form of reports that were long ago scheduled to run automatically every day or week, or on some other periodic basis. These periodic batch processes are a particularly insidious load source, because they are so easy to forget and to ignore. Most addressees of reports receive far more material than they could ever read, even if you count only material produced by humans, setting aside monster automated reports with vast collections of uninteresting numbers. Somehow, most business people also feel vaguely embarrassed by their inability to read and digest the vast amount of information that comes their way. Therefore, rather than complain that they never read some monster of a report that they receive daily or weekly and argue that it is a waste of resources, they will keep meekly quiet about their embarrassing inability to accomplish the impossible. (I know I've done this; haven't you?) If you want to reduce the frequency of huge reports or eliminate them altogether, don't wait for the addressees to complain that the reports aren't needed!

One strategy to eliminate reports is to ask a leading question: "We suspect this report is no longer useful; what do you think?" Another strategy is to state, "We're going to eliminate this report, which we suspect is no longer useful, unless one of you lets us know you still need it. If you still need it, do you still need it on the same frequency, or can we produce it less often?" My favorite strategy is to eliminate everyone except yourself from the addressee list and just see if anyone complains. If they complain, send them your copy and keep the report, adding them back to the addressee list. If no one complains, drop the scheduled report after a safe interval. (Of course, I wouldn't suggest doing this without authority.) For pull reportsreports that the reader must navigate to, rather than receiving them by email or on paper (these are called push reports)you can get the same result by making the file inaccessible and waiting for complaints.

10.2.2.3 Reasons batch performance is a concern

If a long-running batch job does not create an overloaded system or otherwise cost the business money, don't worry about it. Otherwise, there are several reasons to address performance of the process, with different solutions for each:

End users await results: When a process requires an end user to request a report and the next important task the end user can do requires the output of that report, then report runtime is the main concern. Parallelizing the report process is one solution, allowing several processors to attack the problem at once. You can accomplish this in the database layer with parallel execution plans (which I have never found necessary in my own experience) or in the application layer, spawning multiple application processes that attack the problem at once and consolidate results in the end. However, much more often the solution lies in eliminating all or part of the report with the strategies described earlier under "Reasons for large reports." This is especially true because a bottleneck process like this rarely requires the reader to digest enormous amounts of data; chances are that only a few lines of the report are really required by the end user.
Report fails to fit a processing time-window: Many recurring processes take place in time-windows that are preassigned for the work, often in the middle of the night or over weekends. The rest of this chapter has strategies to reduce runtimes, making it easier to meet any given time-window. However, often the simplest strategy is to relax the time-window constraints. When you step back and examine the fundamental needs of the business, time-window constraints usually turn out to be more flexible than they appear. Usually, the window of a process exists within a larger window for a whole collection of processes. That larger window might be smaller than it needs to be. Even if it is not, you can almost always rearrange the processes within the collection to allow enough time for any given process, if the other processes are reasonably tuned.
Report overloads the system: The runtime of the report might not directly be a problem at all; no one needs the results right away. This is usually the easiest problem to solve: just make sure the report ties up only moderate amounts of scarce resources during its run. Avoid running parallel processes to hurry a low-priority report at the expense of higher-priority work. Almost always, you can find a way to push such low-priority processes to off hours, when system resources are not scarce and when the effect on the performance of other processes is negligible.

10.2.2.4 Report information types

Behind every report is (or should be) one or more business needs that the report addresses. If a report never influences the behavior of its addressees, it is a waste of resources. In a perfect world, a report would simply say, "Do the following: ...," and would be so well designed that you could follow that advice with perfect confidence. Since the recommended action would never take long to describe, long reports would be nonexistent. In the real world, the report helps some human to reach the same conclusion by following human reasoning about the information the report provides. However, it is still the case that, when the report is much longer than the description of the decision it supports, it is probably not distilling the data as well as it should. By discovering how to distill the data to a reasonable volume in the report queries, you not only make the queries inherently faster, you also produce a more usable report, helping to find the forest for the trees, so to speak.

Consider different ways a manager might use a long report, on the assumption that thousands of lines of numbers cannot each, individually, be relevant to business decision-making:

Only the totals, averages, and other aggregations are interesting, and the detail lines are useless. When this is the case, eliminate the detail lines and report only the aggregations. Refer to the strategies under the following section, "Aggregations of Many Details," to further tune the resulting queries.
The aggregations are what really matter, at least as approximations, but they're not even in the report. Instead, the poor manager must scan the details to do her own "eyeball" averages or sums, a miserable waste of human effort and an unreliable way to do arithmetic. If the addressees find themselves doing such eyeball arithmetic, it is a sure sign that the report is poorly designed and that it should report the aggregations directly.
Exceptions are what matter. The vast majority of report rows are useless, but the manager scans the report for special cases that call for action. In the manager's head are criteria for the conditions that call for action, or at least for closer consideration. The answer in this case is clear: define the exception criteria and report only the exceptions. When the exception criteria are fuzzy, at least figure out what defines a clear nonexception and filter those out. The result is almost certain to run much faster and to be a much more usable report.
The top (or bottom) n (for some nice, round n) are what really matter. This is really a special case of exceptions, but it is somewhat harder to define the exception criteria without first examining all the details from a sort. The key to handling this case is to realize that there is nothing magic about a nice, round n. It is probably just as good to produce a sorted list of records that meet a preset exception criteria. For example, you might choose to reward your top 10 salespersons. However, would you really want to reward the 10th-best if the 10th-best sold less than last year's average salesperson? On the flip side, would you want to bypass rewarding the 11th-best, who missed being 10th-best by $15? The point is that, compared to a top-n list, it is probably at least as useful to report a sorted list of exceptionsfor example, salespersons who exceeded half a million dollars in sales in the quarter. By defining good exception criteria, which can change as the business evolves, you save the database the work of finding every row to perform a complete sort when almost all of the data is unexceptional and has no real chance to reach the top of the sort. You also provide added information, such as how close the 11th-best was to the 10th-best and how many salespersons exceeded the threshold, compared to the last time you read the report.
A subset of the data is what matters. Discard the rest of the set.

10.2.2.5 Solutions

For any particular large report, at least one of the earlier sets of questions should lead to a solution to your performance problem. Sometimes, a combination of questions will lead to a multipart solution or will just reinforce the set of reasons for a single solution. In summary, these are the techniques that resolve performance problems:

Eliminate unused subsets of the reported data. Make sure these unused subsets are not only not reported, but are not even queried from the database. Special cases of this include:
- Eliminate details, in favor of aggregations only, and see Section 10.2.3, for how to report aggregations of many details quickly.
- Report exceptions only, eliminating nonexceptional rows.
- Replace top-n reporting with sorted lists of exceptions.
Replace large reports with several smaller reports that each cover just the needs of a subset of the addressees.
Eliminate large reports in favor of online functionality that helps the end users find the information they need as they decide they need it, instead of having them search a report for that information.
Eliminate reporting driven by data-retention requirements in favor of just maintaining access to the same data in the database or its backups.
Parallelize processing of high-priority reports that are needed quickly.
Serialize processing of low-priority reports, and push processing to off hours and lower frequency. Often, the correct frequency for low-priority reports is never.
Rearrange load during processing time-windows to relax time-window-driven constraints.

Admittedly, use of these techniques is more art than science, but take heart: I have never found a case in which no reasonable solution existed.

10.2.3 Aggregations of Many Details

It never makes sense to show an end user a million rows of data, either online or in a report. However, it makes perfect sense that an end user would want to know something about an aggregation of a million or more rows, such as "What was last quarter's revenue?," where that revenue summarizes a million or more order details. Unfortunately, it is no easier for the database to read a million rows for purposes of aggregation than for purposes of detail reporting. As a result, these large aggregations are often the ultimate in thorny performance problems: perfectly reasonable queries, from a functional perspective, that are inherently expensive to run, even with perfect tuning. These problems often call for a two-pronged attack:

Examine the query as if it were a query of many details, using the methods of the earlier sections, and apply techniques for large detail queries when possible. In particular, consider eliminating details that do not contribute to the aggregations. For example, when summarizing order details for revenue, you might find many order details, such as advertising inserts included for free, that have no cost to the customer. The aggregations are unaffected by these, so the developer might not have bothered to exclude them from the query. But if you exclude them explicitly, the query handles fewer rows without changing the result.
When necessary, preaggregate data in the database and report the summaries without touching the details. This is the most commonly justified form of redundant data stored in a database. It might be possible, for example, to deduce an account balance from the sum of all transactions to that account in its history. However, account balances are needed so commonly that it makes much more sense to store the running balance directly and make very sure that it is rigorously synchronized with the sum of all past transactions. Much of the complexity, error, and redundancy in applications originates from just such synchronization demands. Today, much of this need can be met, with careful design, through database triggers that automatically increment or decrement running sums and counts every time a detail record is inserted, updated, or deleted. With a trigger-based approach, the application frontend need not handle the synchronization needs at all, and many different sources of detail change can all propagate to a summary through a single trigger in the background. This guarantees, far better than the application code can, clean, rigorous synchronization of the redundant data.

10.2.4 Middleware Processes Handling Too Many Rows

When the destination for collected data is a system (either the same system, when creating redundant data, or another system, when propagating data) rather than a human, larger data volumes often make sense. Nevertheless, the techniques to reduce middleware data volumes look much like the techniques for reducing data volume in reports, substituting a machine addressee for a human one. These are the techniques most reminiscent of report-load-reduction techniques:

Eliminate unused subsets of the transferred data. Make sure these unused subsets are not only not transferred, but are not even queried from the database. Special cases of this include:
- Eliminate details, in favor of aggregations only, and see the previous section, Section 10.2.3, for how to collect aggregations of many details quickly.
- Transfer exceptions only, eliminating nonexceptional rows.
Parallelize processing of high-priority system interfaces that must move data quickly.
Serialize processing of low-priority system interfaces, and push processing to off hours and lower frequency. Often, the correct frequency for low-priority system interfaces is never.
Rearrange load during processing time-windows to relax time-window-driven constraints.

In addition to these familiar-looking techniques, there are several techniques specific to middleware:

Transfer only changed data, not data that hasn't changed since the last run. This is, by far, the most powerful technique to reduce middleware data volumes, since changes to data involve much lower volumes than fresh collections of all data. However, middleware commonly takes the slower path to all data, because it is harder, functionally, to keep data in sync through propagating changes to data rather than through complete data refreshes. Just as for preaggregation, which is after all just a special case of data transfer, the safest strategies for propagating only data changes depend on well-designed database triggers. With triggers, any data change, from any cause, will automatically fire the trigger and record the necessary changes whenever the source data is changed.
Eliminate the interface. When applications share a database instance, you can often eliminate interfaces, allowing applications to read each other's data instead of copying data from one application to another.
Move the dividing line between applications. For example, you might have one application that is responsible for Order Entry, and another for Order Fulfillment and Accounts Receivable, combined. The interface between Order Entry and Order Fulfillment would likely involve almost complete duplication of data. If you rearranged the systems to combine Order Entry and Order Fulfillment, you would find a much thinner interface, moving less data, to Accounts Receivable.
Make the interface faster. If you must move high data volumes between applications, at least arrange that those high volumes can move fast. The fastest interface simply moves data between tables within a database instance. The next-fastest interface moves data between instances on the same hardware. The next-fastest moves data across a local area network between instances on different machines. The slowest interface alternative transfers data across low-bandwidth, wide-area network links, potentially over intercontinental distances.