Developers, and even DBAs, occasionally need to compare the contents of two tables to determine whether the tables contain the same data. The need to do this is especially common in test environments, as developers may want to compare a set of data generated by a program under test with a set of "known good" data. Comparison of tables is also useful for automated testing purposes, when you have to compare actual results with a given set of expected results. SQL's set operations provide an interesting solution to this problem of comparing two tables.
The following query uses both MINUS and UNION ALL to compare two tables for equality. The query depends on each table having either a primary key or at least one unique index.
(SELECT * FROM customer_known_good MINUS SELECT * FROM customer_test) UNION ALL (SELECT * FROM customer_test MINUS SELECT * FROM customer_known_good);
You can look at this query as the union of two compound queries. The parentheses ensure that both MINUS operations take place first before the UNION ALL operation is performed. The result of the first MINUS query will be those rows in customer_known_good that are not also in customer_test. The result of the second MINUS query will be those rows in customer_test that are not also in customer_known_good. The UNION ALL operator simply combines these two result sets for convenience. If no rows are returned by this query, then we know that both tables have identical rows. Any rows returned by this query represent differences between the customer_test and customer_known_good tables.
If the possibility exists for one or both tables to contain duplicate rows, you must use a more general form of this query to test the two tables for equality. This more general form uses row counts to detect duplicates:
(SELECT c1.*,COUNT(*) FROM customer_known_good GROUP BY c1.cust_nbr, c1.name . . . MINUS SELECT c2.*, COUNT(*) FROM customer_test c2 GROUP BY c2.cust_nbr, c2.name . . . ) UNION ALL (SELECT c3.*,COUNT(*) FROM customer_test c3 GROUP BY c3.cust_nbr, c3.name . . . MINUS SELECT c4.*, COUNT(*) FROM customer_known_good c4 GROUP BY c4.cust_nbr, c4.name . . . )
This query is getting complex! The GROUP BY clause (see Chapter 4) for each SELECT must list all columns for the table being selected. Any duplicate rows will be grouped together, and the count will reflect the number of duplicates. If the number of duplicates is the same in both tables, the MINUS operations will cancel those rows out. If any rows are different, or if any occurrence counts are different, the resulting rows will be reported by the query.
Let's look at an example to illustrate how this query works. We'll start with the following tables and data:
DESC customer_known_good Name Null? Type ---------------------------- -------- ---------------- CUST_NBR NOT NULL NUMBER(5) NAME NOT NULL VARCHAR2(30) SELECT * FROM customer_known_good; CUST_NBR NAME ----------- ------------------------------ 1 Sony 1 Sony 2 Samsung 3 Panasonic 3 Panasonic 3 Panasonic 6 rows selected. DESC customer_test Name Null? Type ---------------------------- -------- ---------------- CUST_NBR NOT NULL NUMBER(5) NAME NOT NULL VARCHAR2(30) SELECT * FROM customer_test; CUST_NBR NAME ----------- ------------------------------ 1 Sony 1 Sony 2 Samsung 2 Samsung 3 Panasonic
As you can see the customer_known_good and customer_test tables have the same structure, but different data. Also notice that none of these tables has a primary or unique key; there are duplicate records in both. The following SQL will compare these two tables effectively:
(SELECT c1.*, COUNT(*) FROM customer_known_good c1 GROUP BY c1.cust_nbr, c1.name MINUS SELECT c2.*, COUNT(*) FROM customer_test c2 GROUP BY c2.cust_nbr, c2.name) UNION ALL (SELECT c3.*, COUNT(*) FROM customer_test c3 GROUP BY c3.cust_nbr, c3.name MINUS SELECT c4.*, COUNT(*) FROM customer_known_good c4 GROUP BY c4.cust_nbr, c4.name); CUST_NBR NAME COUNT(*) ----------- ------------------------------ ---------- 2 Samsung 1 3 Panasonic 3 2 Samsung 2 3 Panasonic 1
These results indicate that one table (customer_known_good) has one record for "Samsung," whereas the second table (customer_test) has two records for the same customer. Also, one table (customer_known_good) has three records for "Panasonic," whereas the second table (customer_test) has one record for the same customer. Both the tables have the same number of rows (two) for "Sony," and therefore "Sony" doesn't appear in the output.
|