Determining the performance of a particular business entity compared to its peers is central to a wide variety of business decisions. Examples include:
Identifying assets with the highest utilization
Determining the worst selling products by region
Finding the best performing salespeople
Prior to the release of Oracle8i Database, you could use the ORDER BY clause to sort a result set on one or more columns, but any further processing to calculate rankings or percentiles had to be performed using a procedural language. Beginning with Oracle8i Database, however, you can take advantage of several new functions to either generate rankings for each row in a result set or to group rows into buckets for percentile calculations.
The RANK, DENSE_RANK, and ROW_NUMBER functions generate an integer value from 1 to N for each row, where N is less than or equal to the number of rows in the result set. The differences in the values returned by these functions revolves around how each one handles ties:
Returns a unique number for each row starting with 1. For rows that have duplicate values, numbers are arbitrarily assigned.
Assigns a unique number for each row starting with 1, except for rows that have duplicate values, in which case the same ranking is assigned.
Assigns a unique number for each row starting with 1, except for rows that have duplicate values, in which case the same ranking is assigned and a gap appears in the sequence for each duplicate ranking.
An example will best illustrate the differences. First, here is the query to generate the aggregate sales data by region and customer for the year 2001:
SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales FROM orders WHERE year = 2001 GROUP BY region_id, cust_nbr ORDER BY region_id, cust_nbr; REGION_ID CUST_NBR CUST_SALES    5 1 1151162 5 2 1224992 5 3 1161286 5 4 1878275 5 5 1169926 6 6 1788836 6 7 971585 6 8 1141638 6 9 1208959 6 10 1196748 7 11 1190421 7 12 1182275 7 13 1310434 7 14 1929774 7 15 1255591 8 16 1068467 8 17 1944281 8 18 1253840 8 19 1174421 8 20 1413722 9 21 1020541 9 22 1036146 9 23 1224992 9 24 1224992 9 25 2232703 10 26 1808949 10 27 1322747 10 28 986964 10 29 903383 10 30 1216858
Notice that three of the customers (2, 23, and 24) have the same value for total sales ($1,224,992). In the next query, three function calls are added to generate rankings for each customer across all regions, and the results are then ordered by the ROW_NUMBER function to make the difference in rankings easier to observe:
SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales, RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_rank, DENSE_RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_dense_rank, ROW_NUMBER( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_number FROM orders WHERE year = 2001 GROUP BY region_id, cust_nbr ORDER BY sales_number; REGION_ID CUST_NBR CUST_SALES SALES_RANK SALES_DENSE_RANK SALES_NUMBER       9 25 2232703 1 1 1 8 17 1944281 2 2 2 7 14 1929774 3 3 3 5 4 1878275 4 4 4 10 26 1808949 5 5 5 6 6 1788836 6 6 6 8 20 1413722 7 7 7 10 27 1322747 8 8 8 7 13 1310434 9 9 9 7 15 1255591 10 10 10 8 18 1253840 11 11 11 5 2 1224992 12 12 12 9 23 1224992 12 12 13 9 24 1224992 12 12 14 10 30 1216858 15 13 15 6 9 1208959 16 14 16 6 10 1196748 17 15 17 7 11 1190421 18 16 18 7 12 1182275 19 17 19 8 19 1174421 20 18 20 5 5 1169926 21 19 21 5 3 1161286 22 20 22 5 1 1151162 23 21 23 6 8 1141638 24 22 24 8 16 1068467 25 23 25 9 22 1036146 26 24 26 9 21 1020541 27 25 27 10 28 986964 28 26 28 6 7 971585 29 27 29 10 29 903383 30 28 30
Don't be confused by the ORDER BY clause at the end of the query and the ORDER BY clauses within each function call; the functions use their ORDER BY clauses internally to sort their results for the purpose of applying a ranking. Thus, each of the three functions applies its ranking algorithm to the sum of each customer's sales in descending order. The final ORDER BY clause specifies the results of the ROW_NUMBER function as the sort key for the final result set, but we could have picked any of the six columns as our sort key.
Both the RANK and DENSE_RANK functions assign the rank of 12 to the three rows with total sales of $1,224,992, while the ROW_NUMBER function assigns the ranks 12, 13, and 14 to the same rows. The difference between the RANK and DENSE_RANK functions manifests itself in the ranking assigned to the nextlowest sales total; the RANK function leaves a gap in the ranking sequence and assigns a rank of 15 to customer number 30, while the DENSE_RANK function continues the sequence with a ranking of 13.
Deciding which of the three functions to use depends on the desired outcome. If you want to identify the top 13 customers from this result set, you would use:
If you want exactly 13 rows without regard to ties. In this case, one of the customers who might otherwise be included in the list will be excluded from the final set.
If you want at least 13 rows but don't want to include rows that would have been excluded had there been no ties. In this case, you would retrieve 14 rows.
If you want all customers with a ranking of 13 or less, including all duplicates. In this case, you would retrieve 15 rows.
While the previous query generates rankings across the entire result set, it is also possible to generate independent sets of rankings across multiple partitions of the result set. The following query generates rankings for customer sales within each region rather than across all regions. Note the addition of the PARTITION BY clause:
SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales, RANK( ) OVER (PARTITION BY region_id ORDER BY SUM(tot_sales) DESC) sales_rank, DENSE_RANK( ) OVER (PARTITION BY region_id ORDER BY SUM(tot_sales) DESC) sales_dense_rank, ROW_NUMBER( ) OVER (PARTITION BY region_id ORDER BY SUM(tot_sales) DESC) sales_number FROM orders WHERE year = 2001 GROUP BY region_id, cust_nbr ORDER BY region_id, sales_number; REGION_ID CUST_NBR CUST_SALES SALES_RANK SALES_DENSE_RANK SALES_NUMBER       5 4 1878275 1 1 1 5 2 1224992 2 2 2 5 5 1169926 3 3 3 5 3 1161286 4 4 4 5 1 1151162 5 5 5 6 6 1788836 1 1 1 6 9 1208959 2 2 2 6 10 1196748 3 3 3 6 8 1141638 4 4 4 6 7 971585 5 5 5 7 14 1929774 1 1 1 7 13 1310434 2 2 2 7 15 1255591 3 3 3 7 11 1190421 4 4 4 7 12 1182275 5 5 5 8 17 1944281 1 1 1 8 20 1413722 2 2 2 8 18 1253840 3 3 3 8 19 1174421 4 4 4 8 16 1068467 5 5 5 9 25 2232703 1 1 1 9 23 1224992 2 2 2 9 24 1224992 2 2 3 9 22 1036146 4 3 4 9 21 1020541 5 4 5 10 26 1808949 1 1 1 10 27 1322747 2 2 2 10 30 1216858 3 3 3 10 28 986964 4 4 4 10 29 903383 5 5 5
Each customer receives a ranking between one and five depending on their relation to other customers in the same region. Of the three customers with duplicate total sales, two of them are in region 9; as before, the RANK and DENSE_RANK functions generate identical rankings for both customers.

All ranking functions allow you to specify where in the ranking order NULL values should appear. This is accomplished by appending either NULLS FIRST or NULLS LAST after the ORDER BY clause of the function, as in:
SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales,
RANK( ) OVER (ORDER BY SUM(tot_sales) DESC NULLS LAST) sales_rank
FROM orders
WHERE year = 2001
GROUP BY region_id, cust_nbr;
If omitted, NULL values will either appear last in ascending rankings or first in descending rankings.
One of the most common uses of a ranked data set is to identify the top N or bottom N performers. Since you can't call analytic functions from the WHERE or HAVING clauses, you are forced to generate the rankings for all the rows and then use an outer query to filter out the unwanted rankings. For example, the following query uses an inline view to identify the top five salespersons for 2001:
SELECT s.name, sp.sp_sales total_sales FROM (SELECT salesperson_id, SUM(tot_sales) sp_sales, RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_rank FROM orders WHERE year = 2001 GROUP BY salesperson_id) sp INNER JOIN salesperson s ON sp.salesperson_id = s.salesperson_id WHERE sp.sales_rank <= 5 ORDER BY sp.sales_rank; NAME TOTAL_SALES   Jeff Blake 1927580 Sam Houseman 1814327 Mark Russell 1784596 John Boorman 1768813 Carl Isaacs 1761814 Tim McGowan 1761814
Although there is no function for returning only the top or bottom N from a ranked result set, Oracle provides functionality for identifying the first (top 1) or last (bottom 1) records in a ranked set. This is useful for queries such as the following: "Find the regions with the best and worst total sales last year." Unlike the top five salespeople example from the previous section, this query needs an additional piece of information?the size of the result set?to answer the question.
Oracle9i provides the ability to answer such queries efficiently using functions that rank the result set based on a specified ordering, identify the row with the top or bottom ranking, and report on any column available in the result set. These functions are composed of three parts:
An ORDER BY clause that specifies how to rank the result set.
The keywords FIRST and LAST to specify whether to use the top or bottomranked row.
An aggregate function (i.e., MIN, MAX, AVG, COUNT) used as a tiebreaker in case more than one row of the result set tie for the FIRST or LAST spot in the ranking.
The following query uses the MIN aggregate function to find the regions that rank FIRST and LAST by total sales:
SELECT MIN(region_id) KEEP (DENSE_RANK FIRST ORDER BY SUM(tot_sales) DESC) best_region, MIN(region_id) KEEP (DENSE_RANK LAST ORDER BY SUM(tot_sales) DESC) worst_region FROM orders WHERE year = 2001 GROUP BY region_id; BEST_REGION WORST_REGION   7 10
The use of the MIN function in the previous query is a bit confusing: it is used only if more than one region ties for either first or last place in the ranking. If there were a tie, the row with the minimum value for region_id would be chosen. To find out if a tie actually exists, you could call each function twice using MIN for the first and MAX for the second, and see if they return the same results:
SELECT MIN(region_id) KEEP (DENSE_RANK FIRST ORDER BY SUM(tot_sales) DESC) min_best_region, MAX(region_id) KEEP (DENSE_RANK FIRST ORDER BY SUM(tot_sales) DESC) max_best_region, MIN(region_id) KEEP (DENSE_RANK LAST ORDER BY SUM(tot_sales) DESC) min_worst_region, MAX(region_id) KEEP (DENSE_RANK LAST ORDER BY SUM(tot_sales) DESC) max_worst_region FROM orders WHERE year = 2001 GROUP BY region_id; MIN_BEST_REGION MAX_BEST_REGION MIN_WORST_REGION MAX_WORST_REGION     7 7 10 10
In this case, there are no ties for either first or last place. Depending on the type of data you are working with, using an aggregate function as a tiebreaker can can be somewhat arbitrary.
Another way rankings are commonly used is to generate buckets into which sets of rankings are grouped. For example, you may want to find those customers whose total sales ranked in the top 25%. The following query uses the NTILE function to group the customers into four buckets (or quartiles):
SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales, NTILE(4) OVER (ORDER BY SUM(tot_sales) DESC) sales_quartile FROM orders WHERE year = 2001 GROUP BY region_id, cust_nbr ORDER BY sales_quartile, cust_sales DESC; REGION_ID CUST_NBR CUST_SALES SALES_QUARTILE     9 25 2232703 1 8 17 1944281 1 7 14 1929774 1 5 4 1878275 1 10 26 1808949 1 6 6 1788836 1 8 20 1413722 1 10 27 1322747 1 7 13 1310434 2 7 15 1255591 2 8 18 1253840 2 5 2 1224992 2 9 23 1224992 2 9 24 1224992 2 10 30 1216858 2 6 9 1208959 2 6 10 1196748 3 7 11 1190421 3 7 12 1182275 3 8 19 1174421 3 5 5 1169926 3 5 3 1161286 3 5 1 1151162 3 6 8 1141638 4 8 16 1068467 4 9 22 1036146 4 9 21 1020541 4 10 28 986964 4 6 7 971585 4 10 29 903383 4
The sales_quartile column in this query specifies NTILE(4) to create four buckets. The NTILE function finds each row's place in the ranking, and then assigns each row to a bucket such that every bucket contains the same number of rows. If the number of rows is not evenly divisible by the number of buckets, then the extra rows are distributed so that the number of rows per bucket differs by one at most. In the previous example, there are four buckets allocated for 30 rows, with buckets one and two containing eight rows each, and buckets three and four containing seven rows each. This approach is referred to as equiheight buckets because each bucket contains (optimally) the same number of rows.
Just like in the top N query discussed earlier, you will need to wrap the query in an inline view if you want to filter on the NTILE result:
SELECT r.name region, c.name customer, cs.cust_sales FROM (SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales, NTILE(4) OVER (ORDER BY SUM(tot_sales) DESC) sales_quartile FROM orders WHERE year = 2001 GROUP BY region_id, cust_nbr) cs INNER JOIN customer c ON cs.cust_nbr = c.cust_nbr INNER JOIN region r ON cs.region_id = r.region_id WHERE cs.sales_quartile = 1 ORDER BY cs.cust_sales DESC; REGION CUSTOMER CUST_SALES    Northwest US Worcester Technologies 2232703 Southwest US Evans Supply Corp. 1944281 Southeast US Madden Industries 1929774 New England Flowtech Inc. 1878275 Central US Alpha Technologies 1808949 MidAtlantic Spartan Industries 1788836 Southwest US Malden Labs 1413722 Central US Phillips Labs 1322747
The outer query filters on sales_quartile = 1, which removes all rows not in the top 25% of sales, and then joins the region and customer dimensions to generate the final results.
Similar to the NTILE function, the WIDTH_BUCKET function groups rows of the result set into buckets. Unlike NTILE, however, the WIDTH_BUCKET function attempts to create equiwidth buckets, meaning that the range of values is evenly distributed across the buckets. If your data were distributed across a bell curve, therefore, you could expect the buckets representing the low and high ranges of the bell curve to contain few records, whereas the buckets representing the middle ranges would contain many records.
WIDTH_BUCKET can operate on numeric or date types, and takes the following four parameters:
The expression that generates the buckets
The value used as the start of the range for bucket #1
The value used as the end of the range for bucket #N
The number of buckets to create (N)
WIDTH_BUCKET uses the values of the second, third, and fourth parameters to generate N buckets containing comparable ranges. If the expression yields values that fall outside the range specified by the second and third parameters, the WIDTH_BUCKET function will generate two additional buckets, numbered 0 and N + 1, into which the outliers are placed. If you want to work with the entire result set, you need to make sure your values for the second and third parameters completely enclose the range of values in the result set. However, if you only wish to work with a subset of the data, you can specify values for the second and third parameters that enclose the desired range, and any rows falling outside the range will be placed into buckets 0 and N + 1.
Here's an example that uses the NTILE example from earlier to generate three buckets for the total sales per customer:
SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales, WIDTH_BUCKET(SUM(tot_sales), 1, 3000000, 3) sales_buckets FROM orders WHERE year = 2001 GROUP BY region_id, cust_nbr ORDER BY cust_sales; REGION_ID CUST_NBR CUST_SALES SALES_BUCKETS     10 29 903383 1 6 7 971585 1 10 28 986964 1 9 21 1020541 2 9 22 1036146 2 8 16 1068467 2 6 8 1141638 2 5 1 1151162 2 5 3 1161286 2 5 5 1169926 2 8 19 1174421 2 7 12 1182275 2 7 11 1190421 2 6 10 1196748 2 6 9 1208959 2 10 30 1216858 2 5 2 1224992 2 9 24 1224992 2 9 23 1224992 2 8 18 1253840 2 7 15 1255591 2 7 13 1310434 2 10 27 1322747 2 8 20 1413722 2 6 6 1788836 2 10 26 1808949 2 5 4 1878275 2 7 14 1929774 2 8 17 1944281 2 9 25 2232703 3
Based on these parameters, the WIDTH_BUCKET function generates three buckets; the first bucket starts at 1, and the third bucket has an upper range of 3,000,000. Since there are three buckets, the ranges for each bucket will be 1 to 1,000,000, 1,000,001 to 2,000,000, and 2,000,001 to 3,000,000. When the rows are placed in the appropriate bucket, there are three rows that fall into bucket #1, a single row that falls in bucket #3, and the remaining 26 rows that fall into the second bucket.
The values 1 and 3,000,000 were chosen to guarantee that all rows in the result set would be placed into one of the three buckets. If you want to generate buckets only for rows that have aggregate sales between $1,000,000 and $2,000,000, the WIDTH_BUCKET function will place the remaining rows in the 0^{th} and 4^{th} buckets:
SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales, WIDTH_BUCKET(SUM(tot_sales), 1000000, 2000000, 3) sales_buckets FROM orders WHERE year = 2001 GROUP BY region_id, cust_nbr ORDER BY cust_sales; REGION_ID CUST_NBR CUST_SALES SALES_BUCKETS     10 29 903383 0 6 7 971585 0 10 28 986964 0 9 21 1020541 1 9 22 1036146 1 8 16 1068467 1 6 8 1141638 1 5 1 1151162 1 5 3 1161286 1 5 5 1169926 1 8 19 1174421 1 7 12 1182275 1 7 11 1190421 1 6 10 1196748 1 6 9 1208959 1 10 30 1216858 1 5 2 1224992 1 9 24 1224992 1 9 23 1224992 1 8 18 1253840 1 7 15 1255591 1 7 13 1310434 1 10 27 1322747 1 8 20 1413722 2 6 6 1788836 3 10 26 1808949 3 5 4 1878275 3 7 14 1929774 3 8 17 1944281 3 9 25 2232703 4
Keep in mind that the WIDTH_BUCKET function does not remove rows from the result set that do not lie within the specified range; rather, they are placed into special buckets that your query can either utilize or ignore as needed.
The final two ranking functions, CUME_DIST and PERCENT_RANK, use the rank of a particular row to calculate additional information. The CUME_DIST function (short for Cumulative Distribution) calculates the ratio of the number of rows that have a lesser or equal ranking to the total number of rows in the partition. The PERCENT_RANK function calculates the ratio of a row's ranking to the number of rows in the partition using the formula:
(RRP  1) / (NRP  1)
where RRP is the "rank of row in partition," and NRP is the "number of rows in partition."
Both functions utilize DENSE_RANK for their rankings and can be specified to be in ascending or descending order. The following query demonstrates the use of these two functions (both specifying descending order) with the customer yearly sales query:
SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales, CUME_DIST( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_cume_dist, PERCENT_RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_percent_rank FROM orders WHERE year = 2001 GROUP BY region_id, cust_nbr ORDER BY cust_sales DESC; REGION_ID CUST_NBR CUST_SALES SALES_CUME_DIST SALES_PERCENT_RANK      9 25 2232703 .033333333 0 8 17 1944281 .066666667 .034482759 7 14 1929774 .1 .068965517 5 4 1878275 .133333333 .103448276 10 26 1808949 .166666667 .137931034 6 6 1788836 .2 .172413793 8 20 1413722 .233333333 .206896552 10 27 1322747 .266666667 .24137931 7 13 1310434 .3 .275862069 7 15 1255591 .333333333 .310344828 8 18 1253840 .366666667 .344827586 5 2 1224992 .466666667 .379310345 9 23 1224992 .466666667 .379310345 9 24 1224992 .466666667 .379310345 10 30 1216858 .5 .482758621 6 9 1208959 .533333333 .517241379 6 10 1196748 .566666667 .551724138 7 11 1190421 .6 .586206897 7 12 1182275 .633333333 .620689655 8 19 1174421 .666666667 .655172414 5 5 1169926 .7 .689655172 5 3 1161286 .733333333 .724137931 5 1 1151162 .766666667 .75862069 6 8 1141638 .8 .793103448 8 16 1068467 .833333333 .827586207 9 22 1036146 .866666667 .862068966 9 21 1020541 .9 .896551724 10 28 986964 .933333333 .931034483 6 7 971585 .966666667 .965517241 10 29 903383 1 1
Let's walk through a couple of calculations for customer number 1 in the previous result set. With total sales of $1,151,162, customer number 1 ranks 23^{rd} in the set of 30 customers in descending order of sales. Since there are a total of 30 rows, the CUME_DIST is equal to 23/30, or .766666667. The PERCENT_RANK function yields (23  1) / (30  1) = .75862069. It should come as no surprise that each function returns identical values for the rows that have identical sales totals, since the calculations are based on rank, which is identical for all three rows.
For some types of analysis, determining what might have happened is more revealing than knowing what really happened. Oracle provides special versions of RANK, DENSE_RANK, CUME_DIST, and PERCENT_RANK that allow rankings and distributions to be calculated for hypothetical data, allowing the user to see what would have happened if a specific value (or set of values) was included in a data set.
To illustrate this concept, let's rank all customers by total sales for 2001, and then see where a hypothetical sales figure would fall in the ranking. Here is the query that generates the rankings and distributions:
SELECT cust_nbr, SUM(tot_sales) cust_sales, RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) rank, DENSE_RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) dense_rank, CUME_DIST( ) OVER (ORDER BY SUM(tot_sales) DESC) cume_dist, PERCENT_RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) percent_rank FROM orders WHERE year = 2001 GROUP BY cust_nbr ORDER BY rank; CUST_NBR CUST_SALES RANK DENSE_RANK CUME_DIST PERCENT_RANK       25 2232703 1 1 .033333333 0 17 1944281 2 2 .066666667 .034482759 14 1929774 3 3 .1 .068965517 4 1878275 4 4 .133333333 .103448276 26 1808949 5 5 .166666667 .137931034 6 1788836 6 6 .2 .172413793 20 1413722 7 7 .233333333 .206896552 27 1322747 8 8 .266666667 .24137931 13 1310434 9 9 .3 .275862069 15 1255591 10 10 .333333333 .310344828 18 1253840 11 11 .366666667 .344827586 2 1224992 12 12 .466666667 .379310345 23 1224992 12 12 .466666667 .379310345 24 1224992 12 12 .466666667 .379310345 30 1216858 15 13 .5 .482758621 9 1208959 16 14 .533333333 .517241379 10 1196748 17 15 .566666667 .551724138 11 1190421 18 16 .6 .586206897 12 1182275 19 17 .633333333 .620689655 19 1174421 20 18 .666666667 .655172414 5 1169926 21 19 .7 .689655172 3 1161286 22 20 .733333333 .724137931 1 1151162 23 21 .766666667 .75862069 8 1141638 24 22 .8 .793103448 16 1068467 25 23 .833333333 .827586207 22 1036146 26 24 .866666667 .862068966 21 1020541 27 25 .9 .896551724 28 986964 28 26 .933333333 .931034483 7 971585 29 27 .966666667 .965517241 29 903383 30 28 1 1
Now let's see where a customer with an even million dollars of sales would have ranked:
SELECT RANK(1000000) WITHIN GROUP (ORDER BY SUM(tot_sales) DESC) hyp_rank, DENSE_RANK(1000000) WITHIN GROUP (ORDER BY SUM(tot_sales) DESC) hyp_dense_rank, CUME_DIST(1000000) WITHIN GROUP (ORDER BY SUM(tot_sales) DESC) hyp_cume_dist, PERCENT_RANK(1000000) WITHIN GROUP (ORDER BY SUM(tot_sales) DESC) hyp_percent_rank FROM orders WHERE year = 2001 GROUP BY cust_nbr; HYP_RANK HYP_DENSE_RANK HYP_CUME_DIST HYP_PERCENT_RANK     28 26 .903225806 .9
The WITHIN GROUP clause has the effect of injecting a fictitious row into the result set before determining the rankings. One possible use of this functionality would be to see how actual sales compare to sales targets.