You'll often want to output data that's based on relationships between two or more tables. For example, in the winestore database, you might want to know which customers have placed orders, which customers live in Australia, or how many bottles of wine Lucy Williams has bought. These are examples of join queries, queries that match rows between tables based (usually) on primary key values. In SQL, a join query matches rows from two or more tables based on a condition in a WHERE clause and outputs only those rows that meet the condition.
As part of the process of converting the winestore entity-relationship model to SQL statements, we've included the attributes required in any practical join condition. To understand which tables can be joined in the winestore database, and how the joins are processed, it's helpful to have a copy of the ER model at hand as you work your way through this section.
Suppose you want to find out the names of the wineries in the winestore database and, for each winery, the name of the region that it's located in. To do this, you examine the ER model and discover that the region and winery tables are related, and that they both contain attributes that you need in the answer to your query. Specifically, you need to retrieve the winery_name attribute from the winery table and the region_name attribute from the region table, and you need to join the two tables together to find the result.
Consider this query, which we might intuitively, but wrongly, use to find all the wineries in a region:
SELECT winery_name, region_name FROM winery, region;
This query produces (in part) the following results:
+-------------------------------+-------------+ | winery_name | region_name | +-------------------------------+-------------+ | Durham and Sons Premium Wines | Coonawarra | | Durham Brook Group | Coonawarra | | Durham Creek | Coonawarra | | Durham Estates | Coonawarra | | Durham Hill Vineyard | Coonawarra |
The impression here is that, for example, Durham Creek winery is located in the Coonawarra region. This might not be the case. Why? First, you can use the techniques covered so far in this chapter to check which region the Durham Creek winery is located in:
SELECT region_id FROM winery WHERE winery_name='Durham Creek';
The result is:
+-----------+ | region_id | +-----------+ | 9 | +-----------+ 1 row in set (0.01 sec)
Now, you can query the region table to find the name of the region using:
mysql> SELECT region_name FROM region WHERE region_id=9; +----------------+ | region_name | +----------------+ | Margaret River | +----------------+ 1 row in set (0.00 sec)
So, Durham Creek winery isn't in Coonawarra at all!
What happened in the first attempt at a join query? The technical answer is that you just evaluated a Cartesian product: you produced as output all the possible combinations of wineries and regions, most of which don't make any sense. These odd results can be seen if you add an ORDER BY clause to the original query:
SELECT winery_name, region_name FROM winery, region ORDER BY winery_name, region_name;
Recall that the ORDER BY clause sorts the results after the query has been evaluated and that it has no effect on which rows are returned from the query. Here is the first part of the output:
+---------------------------------+---------------------+ | winery_name | region_name | +---------------------------------+---------------------+ | Anderson and Sons Premium Wines | All | | Anderson and Sons Premium Wines | Barossa Valley | | Anderson and Sons Premium Wines | Coonawarra | | Anderson and Sons Premium Wines | Goulburn Valley | | Anderson and Sons Premium Wines | Lower Hunter Valley | | Anderson and Sons Premium Wines | Margaret River | | Anderson and Sons Premium Wines | Riverland | | Anderson and Sons Premium Wines | Rutherglen | | Anderson and Sons Premium Wines | Swan Valley | | Anderson and Sons Premium Wines | Upper Hunter Valley |
The query produces all possible combinations of the 10 region names and 300 wineries in the sample database! In fact, the number of rows output is the total number of rows in the first table multiplied by the total rows in the second table. In this case, the output is 10 x 300 = 3,000 rows.
A cartesian product isn't the join we want. Instead, we want to limit the results to only the sensible rows, where the winery is actually located in the region. To do this, you need to understand how the relationship between the region and winery tables is maintained. If you examine the ER model, you'll see that many wineries are located in a region.
In the database tables, the relationship between the winery and region tables is maintained using the primary key of the region table, the attribute region_id that's also an attribute in the winery table. To understand this, consider the first three rows from the winery table:
mysql> SELECT * FROM winery LIMIT 3; +-----------+--------------------------+-----------+ | winery_id | winery_name | region_id | +-----------+--------------------------+-----------+ | 1 | Hanshaw Estates Winery | 2 | | 2 | De Morton and Sons Wines | 5 | | 3 | Jones's Premium Wines | 3 | +-----------+--------------------------+-----------+ 3 rows in set (0.04 sec)
The first winery has a region_id of 2, the second a region_id of 5, and the third a region_id of 3. Consider now the first five rows of the region table:
mysql> SELECT * FROM region LIMIT 5; +-----------+---------------------+ | region_id | region_name | +-----------+---------------------+ | 1 | All | | 2 | Goulburn Valley | | 3 | Rutherglen | | 4 | Coonawarra | | 5 | Upper Hunter Valley | +-----------+---------------------+ 5 rows in set (0.04 sec)
If you match up each winery's region_id value with a region's region_id value, you can determine the relationship and answer the query. For example, you can now see that the first winery (Hanshaw Estates Winery) is located in region 2, the Goulburn Valley.
From a querying perspective, we want to output winery_name and region_name values where the region_id in the winery table matches the corresponding region_id in the region table. This is a natural join.
You can perform a natural join on the winery and region tables using:
SELECT winery_name, region_name FROM winery NATURAL JOIN region ORDER BY winery_name;
The query produces (in part) the following sensible results:
+---------------------------------+---------------------+ | winery_name | region_name | +---------------------------------+---------------------+ | Anderson and Sons Premium Wines | Coonawarra | | Anderson and Sons Wines | Coonawarra | | Anderson Brothers Group | Rutherglen | | Anderson Creek Group | Riverland | | Anderson Daze Group | Rutherglen | | Anderson Daze Vineyard | Margaret River | | Anderson Daze Wines | Barossa Valley | | Anderson Ridge Wines | Lower Hunter Valley |
A natural join query relies on the DBMS matching attributes with the same name across the two tables. In this example, MySQL discovers that there's a region_id attribute in the winery and region tables, and it only outputs combinations where the region_id in both tables is the same.
You can write a join query that explicitly specifies which attributes should be matched to produce the correct result. The following query uses a WHERE clause to produce identical results to our previous example:
SELECT winery_name, region_name FROM winery, region WHERE winery.region_id = region.region_id ORDER BY winery_name;
We recommend writing out your joins so that they include the join condition in the WHERE clause. This is safer and clearer than relying on the NATURAL JOIN operator to discover common attribute names across tables and allowing the DBMS to figure out how the join is done.
Several features are shown in this second example:
The FROM clause contains the two table names winery and region, and so retrieves rows from both tables.
Attributes in the WHERE clause are specified using both the table name and attribute name, separated by a period. This is useful because the same attribute name is often used in different tables, and the query can't figure out which table is meant unless you include it. When an attribute name occurs in only one table, you can omit the table name.
In this example, region_id in the region table and region_id in the winery table have to be specified unambiguously as region.region_id and winery.region_id. In contrast, winery_name and region_name don't need the table name because they occur only in the winery and region tables respectively.
The use of both the table and attribute name can also be used for clarity in queries, even if it isn't required. So, for example, you could write winery.winery_name in the example query. It can also be used in all parts of the query, not just the WHERE clause.
The WHERE clause includes a join clause that matches rows between the multiple tables. In this example, the output is reduced to those rows where wineries and regions have matching region_id attributes, resulting in a list of all wineries and which region they are located in. This is the key to joining two or more tables to produce sensible results.
A join can be used to find lots of useful information from the winestore database. Suppose we want to find the names of wineries and the wines they make. Again, after examining the ER model, you'll see that you need to join together the related wine and winery tables to get the required names. Here's the query you'd need to write to get the correct result:
SELECT winery_name, wine_name FROM winery, wine WHERE wine.winery_id = winery.winery_id;
This query joins the winery and wine tables by matching the winery_id attributes. The result is the names and wineries of the 1,048 wines stocked at the winestore.
You can extend this query to produce a list of wines made by a specific winery or group of wineries. For example, to find all wines made by wineries with a name beginning with Borg, use:
SELECT winery_name, wine_name FROM winery, wine WHERE wine.winery_id = winery.winery_id AND winery.winery_name LIKE 'Borg%';
The LIKE clause is discussed in detail in Chapter 15.
Here are two more example join queries:
To find the name of the region that the Ryan Ridge Winery is situated in:
SELECT region_name FROM region, winery WHERE winery.region_id=region.region_id AND winery_name='Ryan Ridge Winery';
To find which wineries make Tonnibrook wines:
SELECT winery_name FROM winery, wine WHERE wine.winery_id=winery.winery_id AND wine_name='Tonnibrook';
The next example uses the DISTINCT operator to find wines that cost less than $10:
SELECT DISTINCT wine.wine_id FROM wine, inventory WHERE wine.wine_id=inventory.wine_id AND cost<10;
Wines can have more than one inventory row, and the DISTINCT operator shows each wine_id once by removing any duplicates.
Here are two examples that use DISTINCT to show only one matching answer:
To find which countries the customers live in:
SELECT DISTINCT country FROM customer, countries WHERE customer.country_id = countries.country_id;
To find which customers have ordered wines:
SELECT DISTINCT surname,firstname FROM customer,orders WHERE customer.cust_id = orders.cust_id ORDER BY surname,firstname;
Queries can join more than two tables. Suppose you want to find the details of the wine purchases made by a customer, including the customer's details, the dates they made an order, and the quantity and price of the items purchased. You examine the ER model, and see that the customer table that contains the customer information is related to the orders table that contains the date, and the orders table is related to the items table that contains the quantities and prices. So, to get the information you need, you have to join all three tables together.
By examining the database structure or the CREATE TABLE statements, you can see that the cust_id attribute can be used to join together the customer and the orders table. Joining the orders table and items table is a little trickier: the primary key of the orders table isn't just the order_id, it's both the cust_id and the order_id. So, for example there are many rows with an order_id of 1, but what makes a row unique is the combination of the cust_id for a customer and the order_id. These two attributes together are used to join the orders and items tables.
Suppose now that we want run this query for customer #2. Here's the query you'd use:
SELECT * FROM customer, orders, items WHERE customer.cust_id = orders.cust_id AND orders.order_id = items.order_id AND orders.cust_id = items.cust_id AND customer.cust_id = 2;
The WHERE clause contains the join condition between the three tables, customer, orders, and items, and the rows selected are those in which the cust_id is the same for all three tables, the cust_id is 2, and the order_id is the same in the orders and items tables. The example illustrates how frequently the Boolean operators AND and OR are used.
If you remove the cust_id=2 clause, the query outputs all items from all orders by all customers. This is a large result set, but still a sensible one that is much smaller than the cartesian product!
Here are two more examples that join three tables:
To find which wines are made in the Margaret River region:
SELECT wine_id FROM wine, winery, region WHERE wine.winery_id=winery.winery_id AND winery.region_id=region.region_id AND region.region_name='Margaret River';
To find which region contains the winery that makes wine #28:
SELECT region_name FROM wine, winery, region WHERE wine.winery_id=winery.winery_id AND winery.region_id=region.region_id AND wine.wine_id=28;
Extending to four or more tables generalizes the approach further. To find the details of customers who have purchased wines from the Ryan Estates Group winery, use:
SELECT DISTINCT customer.cust_id, surname, firstname FROM customer, winery, wine, items WHERE customer.cust_id=items.cust_id AND items.wine_id=wine.wine_id AND wine.winery_id=winery.winery_id AND winery.winery_name='Ryan Estates Group' ORDER BY surname, firstname;
This query is the most complex so far and has four parts. The easiest way to understand a query is usually to start at the end of the WHERE clause and work toward the SELECT clause:
The WHERE clause restricts the winery rows to the Ryan Estates Group (which, in this case, only matches one winery).
The resultant winery row is joined with the wine table to find all wines made by the Ryan Estates Group.
The wines made by Ryan Estates Group are joined with the items that have been purchased by joining to the items table.
The purchased wines are then joined with the customer rows to find the purchasers. You can leave out the orders table, because the items table contains a cust_id for the join; if you need the order number or credit card number (or another orders attribute), the orders table needs to be included in the query.
The result is the details of customers who have purchased Ryan Estates Group wines. The DISTINCT clause is used to show each customer only once. ORDER BY sorts the customer rows into telephone directory order.
Designing a query like this is a step-by-step process. We began by testing a query to find the winery_id of wineries with the name Ryan Estates Group. Then, after testing the query and checking the result, we progressively added additional tables to the FROM clause and the join conditions. Finally, we added the ORDER BY clause.
The next example uses three tables. It queries the complex many-to-many relationship that exists between the wines and grape_variety tables via the wine_variety table. A wine can have one or more grape varieties and these are listed in a specific order (e.g., Cabernet, then Sauvignon). From the other perspective, a grape variety such as Cabernet can be in hundreds of different wines. The many-to-many relationship is managed by creating an intermediate table between grape_variety and wine called wine_variety. The id attribute value stored in that table represents the order in which the grape varieties should appear for the wine. You can find a longer discussion of how these tables were designed and how they're used in Appendix E.
Here is the example query that joins the three tables to find what grape varieties are in wine #1004:
SELECT variety FROM grape_variety, wine_variety, wine WHERE wine.wine_id=wine_variety.wine_id AND wine_variety.variety_id=grape_variety.variety_id AND wine.wine_id=1004 ORDER BY wine_variety.id;
The result of the query is:
+-----------+ | variety | +-----------+ | Cabernet | | Sauvignon | +-----------+ 2 rows in set (0.00 sec)
The join condition is the same as any three-table query. The only significant difference is the ORDER BY clause that presents the results in id order (the first listed variety was stored with ID=1, the second ID=2, and so on).