5.3 Correlated Subqueries

A subquery that references one or more columns from its containing SQL statement is called a correlated subquery. Unlike noncorrelated subqueries, which are executed exactly once prior to execution of the containing statement, a correlated subquery is executed once for each candidate row in the intermediate result set of the containing query. For example, consider the following query, which locates all parts supplied by Acme Industries that have been purchased 10 or more times since July 2001:

SELECT p.part_nbr, p.name

FROM supplier s INNER JOIN part p

ON s.supplier_id = p.supplier_id

WHERE s.name = 'Acme Industries' 

  AND 10 <= 

   (SELECT COUNT(*) 

    FROM cust_order co INNER JOIN line_item li

    ON li.order_nbr = co.order_nbr

    WHERE li.part_nbr = p.part_nbr 

      AND co.order_dt >= TO_DATE('01-JUL-2001','DD-MON-YYYY'));

The reference to p.part_nbr is what makes the subquery correlated; values for p.part_nbr must be supplied by the containing query before the subquery can execute. If there are 10,000 parts in the part table, but only 100 are supplied by Acme Industries, the subquery will be executed once for each of the 100 rows in the intermediate result set created by joining the part and supplier tables.

It is possible to ask for the subquery to be evaluated earlier in the execution plan using the PUSH_SUBQ hint; once again, we suggest you pick up a good book on Oracle tuning if you are interested in learning more about how Oracle actually executes subqueries.

Correlated subqueries are often used to test whether relationships exist without regard to cardinality. We might, for example, want to find all parts that have shipped at least once since January 2002. The EXISTS operator is used for these types of queries, as illustrated by the following query:

SELECT p.part_nbr, p.name, p.unit_cost

FROM part p

WHERE EXISTS 

 (SELECT 1 

  FROM line_item li INNER JOIN cust_order co

  ON li.order_nbr = co.order_nbr

  WHERE li.part_nbr = p.part_nbr 

    AND co.ship_dt >= TO_DATE('01-JAN-2002','DD-MON-YYYY'));

As long as the subquery returns one or more rows, the EXISTS condition is satisfied without regard for how many rows were actually returned by the subquery. Since the EXISTS operator returns TRUE or FALSE depending on the number of rows returned by the subquery, the actual columns returned by the subquery are irrelevant. The SELECT clause requires at least one column, however, so it is common practice to use either the literal "1" or the wildcard "*".

Conversely, you can test whether a relationship does not exist:

UPDATE customer c 

SET c.inactive_ind = 'Y', c.inactive_dt = TRUNC(SYSDATE)

WHERE c.inactive_dt IS NULL 

  AND NOT EXISTS (SELECT 1 FROM cust_order co

    WHERE co.cust_nbr = c.cust_nbr 

      AND co.order_dt > TRUNC(SYSDATE) -- 365);

This statement makes all customer records inactive for those customers who haven't placed an order in the past year. Such queries are commonly found in maintenance routines. For example, foreign key constraints might prevent child records from referring to a nonexistent parent, but it is possible to have parent records without children. If business rules prohibit this situation, you might run a utility each week that removes these records, as in:

DELETE FROM cust_order co

WHERE co.order_dt > TRUNC(SYSDATE) -- 7 

  AND co.cancelled_dt IS NULL

  AND NOT EXISTS 

   (SELECT 1 FROM line_item li 

    WHERE li.order_nbr = co.order_nbr);

A query that includes a correlated subquery using the EXISTS operator is referred to as a semi-join. A semi-join includes rows in table A for which corresponding data is found one or more times in table B. Thus, the size of the final result set is unaffected by the number of matches found in table B. Similar to the anti-join discussed earlier, the Oracle optimizer can employ multiple strategies for formulating execution plans for such queries, including a merge semi-join or a hash semi-join.

Although they are very often used together, the use of correlated subqueries does not require the EXISTS operator. If your database design includes denormalized columns, for example, you might run nightly routines to recalculate the denormalized data, as in:

UPDATE customer c 

SET (c.tot_orders, c.last_order_dt) = 

 (SELECT COUNT(*), MAX(co.order_dt) 

  FROM cust_order co

  WHERE co.cust_nbr = c.cust_nbr 

    AND co.cancelled_dt IS NULL);

Because a SET clause assigns values to columns in the table, the only operator allowed is =. The subquery returns exactly one row (thanks to the aggregation functions), so the results may be safely assigned to the target columns. Rather than recalculating the entire sum each day, a more efficient method might be to update only those customers who placed orders today:

UPDATE customer c SET (c.tot_orders, c.last_order_dt) = 

 (SELECT c.tot_orders + COUNT(*), MAX(co.order_dt) 

  FROM cust_order co

  WHERE co.cust_nbr = c.cust_nbr 

    AND co.cancelled_dt IS NULL

    AND co.order_dt >= TRUNC(SYSDATE))

WHERE c.cust_nbr IN 

 (SELECT co.cust_nbr 

  FROM cust_order co

  WHERE co.order_dt >= TRUNC(SYSDATE) 

    AND co.cancelled_dt IS NULL);

As the previous statement shows, data from the containing query can be used for other purposes in the correlated subquery than just join conditions in the WHERE clause. In this example, the SELECT clause of the correlated subquery adds today's sales totals to the previous value of tot_orders in the customer table to arrive at the new value.

Along with the WHERE clause of SELECT, UPDATE, and DELETE statements, and the SET clause of UPDATE statements, another potent use of correlated subqueries is in the SELECT clause, as illustrated by the following:

SELECT d.dept_id, d.name,

 (SELECT COUNT(*) FROM employee e 

  WHERE e.dept_id = d.dept_id) empl_cnt

FROM department d;



   DEPT_ID NAME                   EMPL_CNT

---------- -------------------- ----------

        10 ACCOUNTING                    3

        20 RESEARCH                      5

        30 SALES                         6

        40 OPERATIONS                    0

The empl_cnt column returned from this query is derived from a correlated subquery that returns the number of employees assigned to each department. Note that the OPERATIONS department has no assigned employees, so the subquery returns 0.

To appreciate the value of subqueries in the SELECT clause, let's compare the previous query to a more traditional method using GROUP BY:

SELECT d.dept_id, d.name, COUNT(e.emp_id) empl_cnt

FROM department d LEFT OUTER JOIN employee e

  ON d.dept_id = e.dept_id

GROUP BY d.dept_id, d.name;



   DEPT_ID NAME                   EMPL_CNT

---------- -------------------- ----------

        10 ACCOUNTING                    3

        20 RESEARCH                      5

        30 SALES                         6

        40 OPERATIONS                    0

To include every department in the result set, and not just those with assigned employees, you must perform an outer join from department to employee. The results are sorted by department ID and name, and the number of employees are counted within each department. In our opinion, the previous query employing the scalar correlated subquery is easier to understand. It does not need an outer join (or any join at all), and does not necessitate a sort operation, making it an attractive alternative to the GROUP BY version.