The types of queries issued by Decision Support Systems (DSS) differ from those issued against OLTP systems. Consider the following business queries:
Find the top 10 salespeople in each sales district last year.
Find all customers whose total orders last year exceeded 20% of the aggregate sales for their geographic region.
Identify the region that suffered the worst quarter-to-quarter sales decline last year.
Find the best and worst selling menu items by state for each quarter last year.
Queries such as these are staples of DSS, and are used by managers, analysts, marketing executives, etc., to spot trends, identify outliers, uncover business opportunities, and predict future business performance. DSS systems typically sit atop data warehouses, in which large quantities of scrubbed, aggregated data provide fertile grounds for researching and formulating business decisions.
Although all of the previous queries can be easily expressed in English, they have historically been difficult to formulate using SQL for the following reasons:
They may require different levels of aggregation of the same data.
They may involve intratable comparisons (comparing one or more rows in a table with other rows in the same table).
They may require an extra filtering step after the result set has been sorted (i.e., finding the top 10 and bottom 10 salespeople last month).
Although it is possible to generate the desired results using such SQL features as self joins, inline views, and user-defined functions, the resulting queries can be difficult to understand and might yield unacceptably long execution times. To illustrate the difficulty in formulating such queries, we will walk through the construction of this query: "Find all customers whose total orders in 2001 exceeded 20% of the aggregate sales for their geographic region."
For this and other examples in this chapter, we use a simple star schema consisting of a single fact table (called orders) containing aggregated sales information across the following dimensions: region, salesperson, customer, and month. There are two main facets to this query, each requiring a different level of aggregation of the same data:
Sum all sales per region last year.
Sum all sales per customer last year.
After these two intermediate result sets have been constructed, each customer's total can be compared to the total for their region to see if it exceeds 20%. The final result set will show the customer names along with their total sales, region name, and the percentage of their region's sales.
The query to aggregate sales by region looks as follows:
SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales FROM orders o WHERE o.year = 2001 GROUP BY o.region_id; REGION_ID TOT_SALES ---------- ---------- 5 6585641 6 6307766 7 6868495 8 6854731 9 6739374 10 6238901
The query to aggregate sales by customer would be:
SELECT o.cust_nbr cust_nbr, o.region_id region_id, SUM(o.tot_sales) tot_sales FROM orders o WHERE o.year = 2001 GROUP BY o.cust_nbr, o.region_id; CUST_NBR REGION_ID TOT_SALES ---------- ---------- ---------- 1 5 1151162 2 5 1224992 3 5 1161286 4 5 1878275 5 5 1169926 6 6 1788836 7 6 971585 8 6 1141638 9 6 1208959 10 6 1196748 11 7 1190421 12 7 1182275 13 7 1310434 14 7 1929774 15 7 1255591 16 8 1068467 17 8 1944281 18 8 1253840 19 8 1174421 20 8 1413722 21 9 1020541 22 9 1036146 23 9 1224992 24 9 1224992 25 9 2232703 26 10 1808949 27 10 1322747 28 10 986964 29 10 903383 30 10 1216858
By placing each of the two queries in an inline view and joining them on region_id, you can identify those customers whose total sales exceeds 20% of their region, as in:
SELECT cust_sales.cust_nbr cust_nbr, cust_sales.region_id region_id, cust_sales.tot_sales cust_sales, region_sales.tot_sales region_sales FROM (SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales FROM orders o WHERE o.year = 2001 GROUP BY o.region_id) region_sales INNER JOIN (SELECT o.cust_nbr cust_nbr, o.region_id region_id, SUM(o.tot_sales) tot_sales FROM orders o WHERE o.year = 2001 GROUP BY o.cust_nbr, o.region_id) cust_sales ON cust_sales.region_id = region_sales.region_id WHERE cust_sales.tot_sales > (region_sales.tot_sales * .2); CUST_NBR REGION_ID CUST_SALES REGION_SALES ---------- ---------- ---------- ------------ 4 5 1878275 6585641 6 6 1788836 6307766 14 7 1929774 6868495 17 8 1944281 6854731 20 8 1413722 6854731 25 9 2232703 6739374 26 10 1808949 6238901 27 10 1322747 6238901
The final step is to join the region and customer dimensions to include the customer and region names in the result set:
SELECT c.name cust_name, big_custs.cust_sales cust_sales, r.name region_name, 100 * ROUND(big_custs.cust_sales / big_custs.region_sales, 2) percent_of_region FROM (SELECT cust_sales.cust_nbr cust_nbr, cust_sales.region_id region_id, cust_sales.tot_sales cust_sales, region_sales.tot_sales region_sales FROM (SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales FROM orders o WHERE o.year = 2001 GROUP BY o.region_id) region_sales INNER JOIN (SELECT o.cust_nbr cust_nbr, o.region_id region_id, SUM(o.tot_sales) tot_sales FROM orders o WHERE o.year = 2001 GROUP BY o.cust_nbr, o.region_id) cust_sales ON cust_sales.region_id = region_sales.region_id WHERE cust_sales.tot_sales > (region_sales.tot_sales * .2)) big_custs INNER JOIN customer c ON big_custs.cust_nbr = c.cust_nbr INNER JOIN region r ON big_custs.region_id = r.region_id; CUST_NAME CUST_SALES REGION_NAME PERCENT_OF_REGION ---------------------- ---------- -------------------- ----------------- Flowtech Inc. 1878275 New England 29 Spartan Industries 1788836 Mid-Atlantic 28 Madden Industries 1929774 Southeast US 28 Evans Supply Corp. 1944281 Southwest US 28 Malden Labs 1413722 Southwest US 21 Worcester Technologies 2232703 Northwest US 33 Alpha Technologies 1808949 Central US 29 Phillips Labs 1322747 Central US 21
Using nothing more exotic than inline views, therefore, it is possible to construct a single query that generates the desired results. Such a solution, however, has the following shortcomings:
The query is fairly complex.
Two passes through the same rows of the orders table are required to generate the different aggregation levels needed by the query.
Let's see how we can both simplify the query and perform the same work in a single pass through the orders table using one of the new analytic functions. Rather than issuing two separate queries to aggregate sales per region and per customer, we will create a single query that aggregates sales over both region and customer, and then call an analytic function that performs a second level of aggregation to generate total sales per region:
SELECT o.region_id region_id, o.cust_nbr cust_nbr, SUM(o.tot_sales) tot_sales, SUM(SUM(o.tot_sales)) OVER (PARTITION BY o.region_id) region_sales FROM orders o WHERE o.year = 2001 GROUP BY o.region_id, o.cust_nbr; REGION_ID CUST_NBR TOT_SALES REGION_SALES ---------- ---------- ---------- ------------ 5 1 1151162 6585641 5 2 1224992 6585641 5 3 1161286 6585641 5 4 1878275 6585641 5 5 1169926 6585641 6 6 1788836 6307766 6 7 971585 6307766 6 8 1141638 6307766 6 9 1208959 6307766 6 10 1196748 6307766 7 11 1190421 6868495 7 12 1182275 6868495 7 13 1310434 6868495 7 14 1929774 6868495 7 15 1255591 6868495 8 16 1068467 6854731 8 17 1944281 6854731 8 18 1253840 6854731 8 19 1174421 6854731 8 20 1413722 6854731 9 21 1020541 6739374 9 22 1036146 6739374 9 23 1224992 6739374 9 24 1224992 6739374 9 25 2232703 6739374 10 26 1808949 6238901 10 27 1322747 6238901 10 28 986964 6238901 10 29 903383 6238901 10 30 1216858 6238901
The analytic function can be found in line 3 of the previous query and the result has the alias region_sales. The aggregate function (SUM(o.tot_sales)) in line 2 generates the total sales per customer and region as directed by the GROUP BY clause, and the analytic function in line 3 aggregates these sums for each region, thereby computing the aggregate sales per region. The value for the region_sales column is identical for all customers within the same region and is equal to the sum of all customer sales within that region. We can then wrap the query in an inline view, filter out those customers with less than 20% of their region's total sales, and join the region and customer tables to generate the desired result set:
SELECT c.name cust_name, cust_sales.tot_sales cust_sales, r.name region_name, 100 * ROUND(cust_sales.tot_sales / cust_sales.region_sales, 2) percent_of_region FROM (SELECT o.region_id region_id, o.cust_nbr cust_nbr, SUM(o.tot_sales) tot_sales, SUM(SUM(o.tot_sales)) OVER (PARTITION BY o.region_id) region_sales FROM orders o WHERE o.year = 2001 GROUP BY o.region_id, o.cust_nbr) cust_sales INNER JOIN region r ON cust_sales.region_id = r.region_id INNER JOIN customer c ON cust_sales.cust_nbr = c.cust_nbr WHERE cust_sales.tot_sales > (cust_sales.region_sales * .2); CUST_NAME CUST_SALES REGION_NAME PERCENT_OF_REGION ---------------------- ---------- -------------------- ----------------- Flowtech Inc. 1878275 New England 29 Spartan Industries 1788836 Mid-Atlantic 28 Madden Industries 1929774 Southeast US 28 Evans Supply Corp. 1944281 Southwest US 28 Malden Labs 1413722 Southwest US 21 Worcester Technologies 2232703 Northwest US 33 Alpha Technologies 1808949 Central US 29 Phillips Labs 1322747 Central US 21
Using an inline view saves us from having to join the region and customer tables to the orders table; otherwise, we would have to include columns from the region and customer tables in the GROUP BY clause.
Later in this chapter, under "Reporting Functions," we'll get into the details of how the SUM...OVER function works. For now, you can see that Oracle is performing an aggregation of an aggregation rather than revisiting the detail rows twice. Thus, the query runs faster and should also prove easier to understand and maintain once the syntax is familiar.
Unlike built-in functions such as DECODE, GREATEST, and SUBSTR, Oracle's suite of analytic functions can only be used in the SELECT and ORDER BY clauses of a query. This is because analytic functions are only executed after the FROM, WHERE, GROUP BY, and HAVING clauses have been evaluated. After the analytic functions have executed, the query's ORDER BY clause is evaluated to sort the final result set, and the ORDER BY clause is allowed to reference columns in the SELECT clause generated via analytic functions as well as specify analytic functions not found in the SELECT clause.
The remainder of this chapter introduces the Oracle8i Database and Oracle9i Database analytic functions, grouped by functionality.