In this section we'll be talking about inner joins, which only return rows with matching values from both joined tables excluding all other rows.
There are two different syntaxes for table joins in SQL, one introduced in 1992 by the SQL standards committee, and the "old" one, used before this date but which still persists. The new standard syntax began to take a foothold in the late 1990s with introduction of Microsoft SQL Server 7.0, IBM DB2 UDB 7.1, and eventually Oracle joining the club with its 9i version, which supports the standard "new" SQL syntax in addition to the old one.
SQL99 defines syntax that uses FROM clause of the SELECT statement to join tables:
... FROM <table1> [INNER | NATURAL | CROSS] JOIN <table2> [ON <condition>] | [USING <column_name>,...],... ...
The actual examples are given later in this chapter; for now let's clarify what all the optional clauses stand for.
With an inner join, you have these options:
Keyword INNER is optional; it could be used for clarity to distinguish between inner and outer joins.
Keyword NATURAL is used to specify a natural join between two tables, i.e., join them by column(s) with identical names. You cannot invoke either the ON or USING clause along with the NATURAL keyword. Out of all our "big three" RDBMS, it is available only in Oracle 9i. The natural join is discussed in more detail later in this chapter.
Keyword CROSS is used to produce a cross join, as discussed later in this chapter. The keyword is valid for Oracle 9i and MS SQL Server 2000 syntax but is not recognized by DB2 UDB 8.1.
The ON clause is to specify the join condition (equijoin or nonequijoin, explained later in the chapter); all our "big three" databases have it in their syntax.
When you are specifying an equijoin of columns that have the same name in both tables, a USING clause can indicate the column(s) to be used. You can use this clause only if the join columns in both tables have the same name. The USING clause could be used when the NATURAL join would not work properly, i.e., tables have more identically named columns than you would actually want to use in your join. For example, if hypothetical tables A and B have common fields CUST_ID, PROD_ID, and ORDER_ID, but you only want to join them by CUST_ID and PROD_ID, you could specify
... FROM A JOIN B USING (CUST_ID, PROD_ID) ...
The column names in USING clause should not be qualified by table names.
The USING clause is only supported by Oracle 9i out of our three vendors, but the same results could easily be achieved with ON clause:
... FROM A JOIN B ON A.CUST_ID = B.CUST_ID AND A.PROD_ID = B.PROD_ID ...
The other syntax, which is commonly used by virtually everyone (Oracle only recognized the "standard" syntax starting version 9i, and Microsoft SQL Server introduced it in version 7.0), performs joins in the WHERE clause:
... WHERE [<qualifier>.]<column_name> <join_condition> [<qualifier>.]<column_name> [AND [<qualifier>].<column_name> <join_condition> [<qualifier>.]<column_name>],... ...
The join_condition can be the "equal" sign ('=') for equijoin or "not equal," "greater than," "less than," "greater than or equal," or "less than or equal" signs (<>, >, <, >=, <=) for non-equijoin. The explanation of terms "equijoin" and non-equijoin" follows.
Equijoin, which is the most popular type of table join, uses exact matching between two columns. For example, in the ACME database the CUSTOMER table does not contain any information about customer phone numbers; the PHONE table has the field PHONE_CUSTID_FN, which is the foreign key to CUSTOMER table. To display information from both tables, use equijoin as shown in Figure 9-1.
The SQL99-compliant syntax to produce the output shown on the bottom of Figure 9-1 is given here:
SELECT cust_id_n, cust_name_s, phone_phonenum_s, phone_type_s FROM customer JOIN phone ON cust_id_n = phone_custid_fn
Natural join is a special case of equijoin performed by RDBMS on the following assumption: "Always join tables using a column (or multiple columns) if they share the same name(s)." In other words, you don't have to specify the columns explicitly in the ON subclause of the SELECT statement's FROM clause. The ON subclause is omitted completely.
The idea of natural join contradicts the quasi-"Hungarian" notation concept (see Appendix B). In the ACME database, every column is prefixed with its abbreviated table name, so every column name in the database is unique and as such the natural join cannot be used at all. There are different opinions on whether the primary and the foreign keys should share the same name. The possibility to perform natural joins seems to be the only advantage in this case (and only if you are using Oracle), so in our ACME database we decided to use the notation consistently.
The naming conventions for the ACME database are described in Appendix B.
The natural join can make the query writing process easier, but only assuming the database structure and the query itself are simple enough; otherwise undesirable joins and incorrect query output could result.
The old syntax joins tables in the WHERE clause of the SELECT statement. The syntax has its defenders and opponents. The main argument against it is that the WHERE clause should not mix joins and predicates — its only purpose should be setting the "vertical" limits on the produced resultset. The major point for it is that in queries that join dozens of tables the old syntax is more readable. Another big point is code legacy, especially for Oracle where the standard syntax was not available until version 9i. It is your choice which syntax to use (we would suggest the SQL99-compliant one for all new developments), but it is definitely not a bad idea to be familiar with both.
The following statement is the equivalent to one given in the previous section; the old syntax being the only difference:
SELECT c.cust_id_n, c.cust_name_s, p.phone_phonenum_s FROM customer c, phone p WHERE c.cust_id_n = p.phone_custid_fn
Sometimes you need to join tables based on criteria other than equality. While the most typical use of equijoin deals with primary/foreign key relationships, that is not usually the case with nonequijoins — such a join would usually populate the resulting set in a way that does not make much sense at all. For example, if you replace the = (equals) operator in the query from the last section with <> (not equal), the resulting query will return every possible combination of customers and phone numbers except the ones that actually represent the valid customer/phone listings (Figure 9-2); in other words, the result will be somehow similar to the Cartesian product (CROSS JOIN) discussed later in this chapter.
In some situations though, nonequijoins appear to be quite useful. For example, imagine ACME, INC. has the following set of business rules. There is no discount for orders under $10,000; a 2 percent discount is given to all orders with total amount between $10,000 and $14,999; the orders with a total amount between $15,000 and $19,999 receive a 3 percent discount; and, finally, the orders that are $20,000 or more are eligible for a 4 percent discount.
One of the ways to implement such business rules is to have a table that stores minimum and maximum order amount along with the corresponding discount percentage as shown in Figure 9-3.
It is always a good practice to store values that potentially could change (for example, the discount for orders over $20,000 might change from 4 to 3.5 percent) in a RDBMS table rather than to hardcode them in the application. Should business rules change, you only need to update the table value(s) — which is fast and painless in comparison with changing the application code, recompiling the application, and so on.
In the following example, the nonequijoin between view V_CUSTOMER_TOTALS (that displays the dollar amount totals per order calculated using the order line amounts) and table DISCOUNT is shown. The query calculates the appropriate discount for each dollar amount (TOTAL_DISCOUNT) by multiplying the order total amount by the appropriate percent discount found based on the rules described above.
SELECT ct.order_number, ct.total_price, ct.total_price * d.disc_pct / 100 AS total_discount FROM v_customer_totals ct JOIN discount d ON ct.total_price > d.disc_minamount_n AND ct.total_price < d.disc_maxamount_n ORDER_NUMBER TOTAL_PRICE TOTAL_DISCOUNT ------------------------------ ----------- -------------- 523774 6608 0 523778 6608 0 ... 523777 10010 200.2 523737 10010 200.2 523730 10915.8 218.316 ... 523781 15402.2 462.066 523741 15402.2 462.066 523775 15456.8 463.704 523735 15456.8 463.704 51 rows selected.
SELECT order_number, total_price, total_price * disc_pct / 100 AS total_discount FROM v_customer_totals ct, discount d WHERE ct.total_price > d.disc_minamount_n AND ct.total_price < d.disc_maxamount_n
Tables can be joined by more than one column. In this example we join the V_CUSTOMER_TOTALS column TOTAL_PRICE with both the DISC_MINAMOUNT_N and DISC_MAXAMOUNT_N columns of the DISCOUNT table to find the appropriate discount percentage for the specified range. Another typical example of multicolumn joins is when you have to join composite primary/foreign keys.
The idea of self-join is probably one of the most unintuitive SQL concepts. Even though it sounds very simple — a table is joined with itself rather than with another table — sometimes it causes lots of confusion.
One important thing to understand here is that despite the fact that you are joining the table with itself, you are still dealing with two instances of the same table, or with two identical tables rather than with just one table, so the self-join should be considered as a special case of the multitable join.
The understanding of the difference between a table and its instance is also very important. Two table instances could be thought of as the table data loaded twice into two different memory locations completely separate from each other.
Imagine if ACME, INC. decided to implement some kind of multilevel marketing. It starts selling products to resellers, who in their turn either sell the products to a lower-level reseller or directly to a customer. That means any of ACME's customers now could be a reseller (the one who buys and resells products) or a supplier (the one whom products are bought from), or both. The relationship has been implemented in the RESELLER table. ACME, INC. has NULL in the RESELLER_SUPPLIER_ID column because it's on the top of the tree and so it does not have a supplier. Everybody else has another entity's (from the RESELLER table) primary key in this field.
Figure 9-4 illustrates this concept as well as the idea of multiple table instances.
The following query uses self-join to retrieve the reseller id and name along with its supplier id and name (for each entity in RESELLER table) using SQL99-compliant syntax:
SELECT r.reseller_id_n AS res_id, r.reseller_name_s AS res_name, s.reseller_id_n AS sup_id, s.reseller_name_s AS sup_name FROM reseller r JOIN reseller s ON r.reseller_supplier_id = s.reseller_id_n RES_ID RES_NAME SUP_ID SUP_NAME ------ ---------------------------- ------ -------------------------- 2 MAGNETICS USA INC. 1 ACME, INC 3 MAGNETOMETRIC DEVICES INC. 1 ACME, INC 4 FAIR PARK GARDENS 2 MAGNETICS USA INC. 5 FAIR AND SONS AIR CONDTNG 2 MAGNETICS USA INC. 6 FABRITEK INC. 2 MAGNETICS USA INC. 7 WILE ELECTRONICS INC. 3 MAGNETOMETRIC DEVICES INC. 8 INTEREX USA 3 MAGNETOMETRIC DEVICES INC. 9 JUDCO MANUFACTURING INC. 4 FAIR PARK GARDENS 10 ELECTRO BASS INC. 5 FAIR AND SONS AIR CONDTNG (9 rows affected)
The information is to be interpreted this way: ACME, INC. sells directly to MAGNETICS USA INC. and MAGNETOMETRIC DEVICES INC. MAGNETOMETRIC DEVICES INC. buys products from ACME, INC. and sells them to WILE ELECTRONICS INC. and INTEREX USA, and so on.
The following statement is the old syntax equivalent of the same query:
SELECT r.reseller_id_n AS res_id, r.reseller_name_s AS res_name, s.reseller_id_n AS sup_id, s.reseller_name_s AS sup_name FROM reseller r, reseller s WHERE r.reseller_supplier_id = s.reseller_id_n
Cross join, or the Cartesian product of two tables, can be defined as another (rather virtual) table that consists of all possible pairs of rows from the two source tables. Returning to our customer phone example in the nonequijoin section of this chapter, the cross join of the CUSTOMER and PHONE tables returns results very similar to what we've got on Figure 8-2 except it would also return the valid customer/phone combinations, excluded from the previously mentioned nonequijoin.
This query will return all possible combinations of customer names and phone numbers by performing cross join of CUSTOMER and PHONE tables:
SELECT cust_name_s, phone_phonenum_s FROM customer CROSS JOIN phone CUST_NAME_S PHONE_PHONENUM_S -------------------------------------------------- -------------------- DLH INDUSTRIES (817) 555-5524 FAIR AND SONS AIR CONDTNG (817) 555-5524 KILBURN GLASS INDUSTRIES (817) 555-5524 BOSWELL DESIGNS CORP. (817) 555-5524 WILE ELECTROMATIC INC. (817) 555-5524 FABRITEK INC. (817) 555-5524 ... ... DLH INDUSTRIES (847) 555-2924 FAIR AND SONS AIR CONDTNG (847) 555-2924 KILBURN GLASS INDUSTRIES (847) 555-2924 BOSWELL DESIGNS CORP. (847) 555-2924 WILE ELECTROMATIC INC. (847) 555-2924 FABRITEK INC. (847) 555-2924 ... ... DLH INDUSTRIES (814) 555-0323 FAIR AND SONS AIR CONDTNG (814) 555-0323 KILBURN GLASS INDUSTRIES (814) 555-0323 BOSWELL DESIGNS CORP. (814) 555-0323 WILE ELECTROMATIC INC. (814) 555-0323 FABRITEK INC. (814) 555-0323 ... ... 3182 rows selected.
The old syntax for cross join simply omits the WHERE clause in a multitable join:
SELECT cust_name_s, phone_phonenum_s FROM customer, phone
Cross join could be a very costly operation, especially when it happens as the result of a human error and involves large tables with hundreds of thousands or even millions of rows. Since the resulting set of a Cartesian product is every possible combination of all rows in both tables, if the first table is, say, one hundred thousand rows and the second one is two hundred thousand rows, the query returns twenty billion rows (100,000 * 200,000 = 20,000,000,000) — quite enough to cause considerable database server slowdown that would affect all users that are currently logged on.
Thus, the Cartesian product should be handled with care, especially because it is not something that a typical SQL user uses a lot. Cross joins can be useful to generate test data or in educational purposes (for example, to demonstrate the idea that all equijoins and nonequijoins are simply a subset of a cross join that meets certain conditions), but one would hardly need, say, all possible combinations of customer names and phone numbers. In most real-life situations a Cartesian product is simply the result of a human error that would have to be fixed by the DBA (probably by killing the user's session; more on the sessions in Chapter 7).
In a relational database quite often you need to retrieve data from many tables simultaneously within a single query to get all necessary information. Thus, in real life a multitable query could easily mean a dozen-table query or even a hundred-table query. Probably 90 percent of the SQL programming art is the talent to properly join multiple tables based on the knowledge of the internal database objects' structure plus the ability to apply this knowledge.
The concept is simple. The result of joining two tables could be considered as a new virtual table, which, in its turn, could be joined with the next table in the query, producing yet another virtual table, and so on. For example, if you needed to know what customers ordered which products, your query would have to join four tables: CUSTOMER, ORDER_HEADER, ORDER_LINE, and PRODUCT:
SELECT DISTINCT customer.cust_name_s, product.prod_description_s FROM customer JOIN order_header ON customer.cust_id_n = order_header.ordhdr_custid_fn JOIN order_line ON order_header.ordhdr_id_n = order_line.ordline_ordhdrid_fn JOIN product ON product.prod_id_n = order_line.ordline_prodid_fn
You could envision this query as a sequence of joins. First, join CUSTOMER and ORDER_HEADER using the appropriate columns; then join the resulting set of rows with the ORDER_LINE table; and, finally, join your result with the PRODUCT table as illustrated in Figure 9-5.
Also note that even though you might not need any columns either from ORDER_HEADER or from ORDER_LINE tables, you have to use them in your query joins because there is no direct relationship between the CUSTOMER and PRODUCT tables.
One important rule to remember is that there should be at least (n – 1) joins in an n-table query, thus, at least two joins for a three-table query, at least three joins for query that involves four tables, and so on. The words "at least" are important — there could be more than (n – 1) joins in a nonequijoin query or in a query that joins tables using composite primary/foreign keys, but if your multitable query has less than (n – 1) joins, the result will be a Cartesian product.
The most common (and the most dangerous) type of a cross join happens in queries that involve more than two tables. In that situation you don't even have to omit the WHERE clause — it would be sufficient to simply skip a join between two tables out of a dozen. Also, the more tables are joined within a single query, the bigger the resulting set could be — the Cartesian product of five tables, with only one hundred rows each, produces ten billion rows.
The query in the previous example uses SQL99-compliant syntax. The query that follows also uses SQL99-compliant syntax to join the CUSTOMER, ORDER_HEADER, and STATUS tables and return customer name, order number, and order status:
SELECT cust_name_s, ordhdr_nbr_s, status_desc_s FROM customer JOIN order_header ON cust_id_n = ordhdr_custid_fn JOIN status ON status_id_n = ordhdr_statusid_fn CUST_NAME_S ORDHDR_NBR_S STATUS_DESC_S ----------------------- ------------ ------------- WILE BESS COMPANY 523731 COMPLETE WILE BESS COMPANY 523732 COMPLETE WILE ELECTRONICS INC. 523733 COMPLETE WILE ELECTRONICS INC. 523734 COMPLETE WILE ELECTRONICS INC. 523735 COMPLETE WILE BESS COMPANY 523736 COMPLETE ... ... ... WILE BESS COMPANY 523738 CANCELLED WILE BESS COMPANY 523742 CANCELLED WILE BESS COMPANY 523743 CANCELLED 51 rows selected.
Here is the old syntax equivalent of the previous query:
SELECT cust_name_s, ordhdr_nbr_s, status_desc_s FROM customer, order_header, status WHERE cust_id_n = ordhdr_custid_fn AND status_id_n = ordhdr_statusid_fn