14.1 Analytic SQL Overview

The types of queries issued by Decision Support Systems (DSS) differ from those issued against OLTP systems. Consider the following business queries:

Find the top 10 salespeople in each sales district last year.
Find all customers whose total orders last year exceeded 20% of the aggregate sales for their geographic region.
Identify the region that suffered the worst quarter-to-quarter sales decline last year.
Find the best and worst selling menu items by state for each quarter last year.

Queries such as these are staples of DSS, and are used by managers, analysts, marketing executives, etc., to spot trends, identify outliers, uncover business opportunities, and predict future business performance. DSS systems typically sit atop data warehouses, in which large quantities of scrubbed, aggregated data provide fertile grounds for researching and formulating business decisions.

Although all of the previous queries can be easily expressed in English, they have historically been difficult to formulate using SQL for the following reasons:

They may require different levels of aggregation of the same data.
They may involve intratable comparisons (comparing one or more rows in a table with other rows in the same table).
They may require an extra filtering step after the result set has been sorted (i.e., finding the top 10 and bottom 10 salespeople last month).

Although it is possible to generate the desired results using such SQL features as self joins, inline views, and user-defined functions, the resulting queries can be difficult to understand and might yield unacceptably long execution times. To illustrate the difficulty in formulating such queries, we will walk through the construction of this query: "Find all customers whose total orders in 2001 exceeded 20% of the aggregate sales for their geographic region."

For this and other examples in this chapter, we use a simple star schema consisting of a single fact table (called orders) containing aggregated sales information across the following dimensions: region, salesperson, customer, and month. There are two main facets to this query, each requiring a different level of aggregation of the same data:

Sum all sales per region last year.
Sum all sales per customer last year.

After these two intermediate result sets have been constructed, each customer's total can be compared to the total for their region to see if it exceeds 20%. The final result set will show the customer names along with their total sales, region name, and the percentage of their region's sales.

The query to aggregate sales by region looks as follows:

SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales

FROM orders o

WHERE o.year = 2001

GROUP BY o.region_id;



REGION_ID  TOT_SALES

---------- ----------

         5    6585641

         6    6307766

         7    6868495

         8    6854731

         9    6739374

        10    6238901

The query to aggregate sales by customer would be:

SELECT o.cust_nbr cust_nbr, o.region_id region_id,

  SUM(o.tot_sales) tot_sales

FROM orders o

WHERE o.year = 2001

GROUP BY o.cust_nbr, o.region_id;



  CUST_NBR  REGION_ID  TOT_SALES

---------- ---------- ----------

         1          5    1151162

         2          5    1224992

         3          5    1161286

         4          5    1878275

         5          5    1169926

         6          6    1788836

         7          6     971585

         8          6    1141638

         9          6    1208959

        10          6    1196748

        11          7    1190421

        12          7    1182275

        13          7    1310434

        14          7    1929774

        15          7    1255591

        16          8    1068467

        17          8    1944281

        18          8    1253840

        19          8    1174421

        20          8    1413722

        21          9    1020541

        22          9    1036146

        23          9    1224992

        24          9    1224992

        25          9    2232703

        26         10    1808949

        27         10    1322747

        28         10     986964

        29         10     903383

        30         10    1216858

By placing each of the two queries in an inline view and joining them on region_id, you can identify those customers whose total sales exceeds 20% of their region, as in:

SELECT cust_sales.cust_nbr cust_nbr, cust_sales.region_id region_id,

  cust_sales.tot_sales cust_sales, region_sales.tot_sales region_sales

FROM

 (SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales

  FROM orders o

  WHERE o.year = 2001

  GROUP BY o.region_id) region_sales INNER JOIN

 (SELECT o.cust_nbr cust_nbr, o.region_id region_id,

    SUM(o.tot_sales) tot_sales

  FROM orders o

  WHERE o.year = 2001

  GROUP BY o.cust_nbr, o.region_id) cust_sales

  ON cust_sales.region_id = region_sales.region_id

WHERE cust_sales.tot_sales > (region_sales.tot_sales * .2);



  CUST_NBR  REGION_ID CUST_SALES REGION_SALES

---------- ---------- ---------- ------------

         4          5    1878275      6585641

         6          6    1788836      6307766

        14          7    1929774      6868495

        17          8    1944281      6854731

        20          8    1413722      6854731

        25          9    2232703      6739374

        26         10    1808949      6238901

        27         10    1322747      6238901

The final step is to join the region and customer dimensions to include the customer and region names in the result set:

SELECT c.name cust_name,

  big_custs.cust_sales cust_sales, r.name region_name,

  100 * ROUND(big_custs.cust_sales / 

    big_custs.region_sales, 2)  percent_of_region

FROM 

 (SELECT cust_sales.cust_nbr cust_nbr, cust_sales.region_id region_id,

    cust_sales.tot_sales cust_sales, 

    region_sales.tot_sales region_sales

  FROM

   (SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales

    FROM orders o

    WHERE o.year = 2001

    GROUP BY o.region_id) region_sales INNER JOIN

   (SELECT o.cust_nbr cust_nbr, o.region_id region_id,

      SUM(o.tot_sales) tot_sales

    FROM orders o

    WHERE o.year = 2001

    GROUP BY o.cust_nbr, o.region_id) cust_sales

    ON cust_sales.region_id = region_sales.region_id

  WHERE cust_sales.tot_sales > (region_sales.tot_sales * .2)) big_custs INNER JOIN 

customer c

  ON big_custs.cust_nbr = c.cust_nbr

  INNER JOIN region r

  ON big_custs.region_id = r.region_id;



CUST_NAME              CUST_SALES REGION_NAME          PERCENT_OF_REGION

---------------------- ---------- -------------------- -----------------

Flowtech Inc.             1878275 New England                         29

Spartan Industries        1788836 Mid-Atlantic                        28

Madden Industries         1929774 Southeast US                        28

Evans Supply Corp.        1944281 Southwest US                        28

Malden Labs               1413722 Southwest US                        21

Worcester Technologies    2232703 Northwest US                        33

Alpha Technologies        1808949 Central US                          29

Phillips Labs             1322747 Central US                          21

Using nothing more exotic than inline views, therefore, it is possible to construct a single query that generates the desired results. Such a solution, however, has the following shortcomings:

The query is fairly complex.
Two passes through the same rows of the orders table are required to generate the different aggregation levels needed by the query.

Let's see how we can both simplify the query and perform the same work in a single pass through the orders table using one of the new analytic functions. Rather than issuing two separate queries to aggregate sales per region and per customer, we will create a single query that aggregates sales over both region and customer, and then call an analytic function that performs a second level of aggregation to generate total sales per region:

SELECT o.region_id region_id, o.cust_nbr cust_nbr,

 SUM(o.tot_sales) tot_sales,

 SUM(SUM(o.tot_sales)) OVER (PARTITION BY o.region_id) region_sales

FROM orders o

WHERE o.year = 2001

GROUP BY o.region_id, o.cust_nbr;



REGION_ID   CUST_NBR  TOT_SALES REGION_SALES

---------- ---------- ---------- ------------

         5          1    1151162      6585641

         5          2    1224992      6585641

         5          3    1161286      6585641

         5          4    1878275      6585641

         5          5    1169926      6585641

         6          6    1788836      6307766

         6          7     971585      6307766

         6          8    1141638      6307766

         6          9    1208959      6307766

         6         10    1196748      6307766

         7         11    1190421      6868495

         7         12    1182275      6868495

         7         13    1310434      6868495

         7         14    1929774      6868495

         7         15    1255591      6868495

         8         16    1068467      6854731

         8         17    1944281      6854731

         8         18    1253840      6854731

         8         19    1174421      6854731

         8         20    1413722      6854731

         9         21    1020541      6739374

         9         22    1036146      6739374

         9         23    1224992      6739374

         9         24    1224992      6739374

         9         25    2232703      6739374

        10         26    1808949      6238901

        10         27    1322747      6238901

        10         28     986964      6238901

        10         29     903383      6238901

        10         30    1216858      6238901

The analytic function can be found in line 3 of the previous query and the result has the alias region_sales. The aggregate function (SUM(o.tot_sales)) in line 2 generates the total sales per customer and region as directed by the GROUP BY clause, and the analytic function in line 3 aggregates these sums for each region, thereby computing the aggregate sales per region. The value for the region_sales column is identical for all customers within the same region and is equal to the sum of all customer sales within that region. We can then wrap the query in an inline view, filter out those customers with less than 20% of their region's total sales, and join the region and customer tables to generate the desired result set:

SELECT c.name cust_name,

  cust_sales.tot_sales cust_sales, r.name region_name,

  100 * ROUND(cust_sales.tot_sales / 

    cust_sales.region_sales, 2)  percent_of_region

FROM 

 (SELECT o.region_id region_id, o.cust_nbr cust_nbr,

    SUM(o.tot_sales) tot_sales,

    SUM(SUM(o.tot_sales)) OVER (PARTITION BY o.region_id) region_sales

  FROM orders o

  WHERE o.year = 2001

  GROUP BY o.region_id, o.cust_nbr) cust_sales INNER JOIN region r

  ON cust_sales.region_id = r.region_id

  INNER JOIN customer c

  ON cust_sales.cust_nbr = c.cust_nbr

WHERE cust_sales.tot_sales > (cust_sales.region_sales * .2);



CUST_NAME              CUST_SALES REGION_NAME          PERCENT_OF_REGION

---------------------- ---------- -------------------- -----------------

Flowtech Inc.             1878275 New England                         29

Spartan Industries        1788836 Mid-Atlantic                        28

Madden Industries         1929774 Southeast US                        28

Evans Supply Corp.        1944281 Southwest US                        28

Malden Labs               1413722 Southwest US                        21

Worcester Technologies    2232703 Northwest US                        33

Alpha Technologies        1808949 Central US                          29

Phillips Labs             1322747 Central US                          21

Using an inline view saves us from having to join the region and customer tables to the orders table; otherwise, we would have to include columns from the region and customer tables in the GROUP BY clause.

Later in this chapter, under "Reporting Functions," we'll get into the details of how the SUM...OVER function works. For now, you can see that Oracle is performing an aggregation of an aggregation rather than revisiting the detail rows twice. Thus, the query runs faster and should also prove easier to understand and maintain once the syntax is familiar.

Unlike built-in functions such as DECODE, GREATEST, and SUBSTR, Oracle's suite of analytic functions can only be used in the SELECT and ORDER BY clauses of a query. This is because analytic functions are only executed after the FROM, WHERE, GROUP BY, and HAVING clauses have been evaluated. After the analytic functions have executed, the query's ORDER BY clause is evaluated to sort the final result set, and the ORDER BY clause is allowed to reference columns in the SELECT clause generated via analytic functions as well as specify analytic functions not found in the SELECT clause.

The remainder of this chapter introduces the Oracle8i Database and Oracle9i Database analytic functions, grouped by functionality.