Subselects are a popular database feature that's available in MySQL 4.1. A subselect, or a subquery, is a query that occurs within the context of another query. You then use the subselect's results in the main query.
Many developers like subselects because they allow them to chain queries together to winnow results. It's often possible to rewrite a query to eliminate a subselect; however, this is not always straightforward or efficient. Additionally, sometimes, without a subselect, you'll be forced to make multiple requests.
The following sections contain a few examples that show how a subselect can solve problems. Many of them use the following programs table:
mysql> DESCRIBE programs; +------------+-----------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +------------+-----------------+------+-----+---------+----------------+ | id | int(5) unsigned | | PRI | NULL | auto_increment | | title | varchar(50) | | | | | | channel_id | int(5) unsigned | | | 0 | | +------------+-----------------+------+-----+---------+----------------+ 3 rows in set (0.00 sec) mysql> SELECT * FROM programs; +----+-----------------+------------+ | id | title | channel_id | +----+-----------------+------------+ | 1 | Oprah | 60 | | 2 | Sex and the City| 201 | | 3 | The Sopranos | 201 | | 4 | Frontline | 13 | +----+-----------------+------------+ 4 rows in set (0.00 sec)
A common database-related task is finding rows that match a set of criteria. Normally, these specifications are known ahead of time:
// Find the names and address of all people // who have an email address ending in "php.net". SELECT name, email FROM users WHERE email LIKE '%.php.net'; // Find the title of all TV programs that air on channel 13 SELECT title FROM programs, channels WHERE channel.name = 'HBO' AND program.channel_id = channel.id;
Sometimes, as in the first case, you only need to query a single table. Other times, as in the second example, you need to link two tables together based on a common key to gather the information you want.
However, these links aren't always sufficient. For example, you know there's a show called The Sopranos and you want to discover all the other shows that air on the same channel, but you don't know which channel that is. One solution is to use two queries:
// Get channel_id: $r = $db->query("SELECT channel_id FROM programs WHERE title = 'The Sopranos'"); $row = $r->fetch_assoc( ); // Use channel_id to find programs: $s = $db->query("SELECT title FROM programs WHERE channel_id = $row[channel_id]"); while ($program = $s->fetch_assoc( )) { print $program['title'] . "\n"; } Sex and the City The Sopranos
This is easy to understand, but it's a bit wasteful to make two queries for such a simple request. Another solution is to employ a self-join:
// All-in-one query using self-join: $s = $db->query("SELECT p2.title FROM programs AS p1, programs AS p2 WHERE p1.title = 'The Sopranos' AND p1.channel_id = p2.channel_id"); while ($program = $s->fetch_assoc( )) { print $program['title'] . "\n"; } Sex and the City The Sopranos
It can seem a little odd at first to JOIN a table against itself, but it can actually come in quite handy. In this example, you're finding the set of shows that have a title of The Sopranos and cross-referencing them against all the shows have the same channel_id field.
In order to eliminate the ambiguity of which table you're referring to, you're forced to assign table aliases. The table you're using to locate the channel The Sopranos is on is p1, and the table in which you're searching for matching shows is p2. That's why you return p2.title.
To return all the rows in p2, use p2.*. This syntax is similar to *, which returns all rows, but qualifies it with a table name.
Since you're joining on the same column, an equivalent query is:
// All-in-one query using self-join and USING( ) $s = $db->query("SELECT p2.title FROM programs AS p1 INNER JOIN programs AS p2 USING (channel_id) WHERE p1.title = 'The Sopranos'"); while ($program = $s->fetch_assoc( )) { print $program['title'] . "\n"; } Sex and the City The Sopranos
An INNER JOIN is identical to linking tables with a comma (,), and the USING( ) condition is shorthand for a list of columns to set as equal. Thus, USING(channel_id) is the same as p1.channel_id = p2.channel_id.
Still, that type of query can get hard to handle as the relationships become more complex. Many people find it easier to implement that request using a subselect:
// All-in-one query using subselect: $s = $db->query("SELECT title FROM programs WHERE channel_id = (SELECT channel_id FROM programs WHERE title = 'The Sopranos' LIMIT 1)"); while ($program = $s->fetch_assoc( )) { print $program['title'] . "\n"; } Sex and the City The Sopranos
MySQL starts off by parsing the SQL statement. When it reaches the second SELECT, it places what it already has on hold and executes the second query:
mysql> SELECT channel_id FROM programs WHERE title = 'The Sopranos' LIMIT 1; +------------+ | channel_id | +------------+ | 201 | +------------+ 1 row in set (0.00 sec)
That query returns a number (or NULL, if there are no matches). For the sake of example, assume it's 201. This then translates the original query into:
mysql> SELECT title FROM programs WHERE channel_id = 201; +-----------------+ | title | +-----------------+ | Sex and the City| | The Sopranos | +-----------------+ 2 rows in set (0.00 sec)
MySQL can now execute the query and return your titles.
Notice how the subselect specifically adds a LIMIT clause to restrict the number of rows to 1. Since the results are used in the context of channel_id =, returning more than one number leads to a parser error because it's like doing channel_id = 201, 202.
When you expect your query will return multiple rows, switch from an equals sign (=) to IN( ):
mysql> SELECT title FROM programs WHERE channel_id IN( SELECT channel_id FROM programs WHERE title = 'The Sopranos'); +-----------------+ | title | +-----------------+ | Sex and the City| | The Sopranos | +-----------------+ 2 rows in set (0.02 sec)
This allows you to find programs located in any of the channels returned by the subselect. Be careful, though?duplicate items are eliminated when you use IN( ). Sometimes this is good, but not always.
For instance, say you're implementing the backend of a TiVo or another DVR. You want to track all the shows that have been watched so you can suggest related programs the person might also be interested in viewing. This table, viewings, contains a list of program_ids. Each number is a key in the programs table from before and uniquely represents a program.
Here's one potential dataset:
mysql> SELECT * FROM viewings; +------------+ | program_id | +------------+ | 1 | | 2 | | 1 | | 3 | | 4 | +------------+ 5 rows in set (0.01 sec)
There are five shows, but program 1 was watched twice. Here's what happens if you look up the titles using IN( ):
mysql> SELECT title FROM programs WHERE id IN (SELECT program_id FROM viewings); +-----------------+ | title | +-----------------+ | Oprah | | Sex and the City| | The Sopranos | | Frontline | +-----------------+ 4 rows in set (0.01 sec)
There are only four records because the duplicate program ID of 1 isn't counted twice. To find out which show was so good as to merit a second viewing, you need to eliminate the subselect and return to a join:
mysql> SELECT title FROM programs INNER JOIN viewings ON viewings.program_id = programs.id; +-----------------+ | title | +-----------------+ | Oprah | | Frontline | | Oprah | | Sex and the City| | The Sopranos | +-----------------+ 5 rows in set (0.01 sec)
Ah! Now you can see that someone really enjoys Oprah, so you're more likely to suggest related shows.
The previous examples showed different ways to rewrite your query. While using a subselect often made the query easier to write and understand, it was never necessary. There was always an alternative syntax that used a JOIN to eliminate the subselect, yet retained the same results.
That's not always the case. There are lots of examples where you just can't solve the problem in a single query without using a subselect. This normally occurs when you're using an aggregate function, such as COUNT( ).
Returning to the last problem, you've already seen how repeated entries in a table can cause problems when using IN( ), but they can give you trouble in other ways, too. For instance, you want to find the titles of the shows you've watched more than The Sopranos. Once again, the viewings table looks like this:
mysql> SELECT * FROM viewings; +------------+ | program_id | +------------+ | 1 | | 2 | | 1 | | 3 | | 4 | +------------+ 5 rows in set (0.01 sec)
To find the total number of times you've watched a show, you need to do a join against the programs table and count the number of matches:
mysql> SELECT COUNT(*) FROM viewings LEFT JOIN programs ON(program_id = programs.id) WHERE programs.title = 'The Sopranos'; +---+ | c | +---+ | 1 | +---+ 1 row in set (0.00 sec)
Place this result as a subselect inside the master query:
mysql> SELECT title, COUNT(*) AS c FROM viewings LEFT JOIN programs ON (program_id = programs.id) GROUP BY title HAVING c > ( SELECT COUNT(*) FROM viewings LEFT JOIN programs ON (program_id = programs.id) WHERE title = 'The Sopranos'); +-------+---+ | title | c | +-------+---+ | Oprah | 2 | +-------+---+ 1 row in set (0.00 sec)
Once again, Oprah is the most popular show.
This question can't be rewritten to avoid a subselect, because the condition inside the subselect not only uses an aggregate function, such as COUNT( ) in this example, but also places restrictions upon the query using a WHERE clause. Since you only want the WHERE title = 'The Sopranos' condition to apply to the subselect's COUNT(*), it can't mix with the SELECT title, COUNT(*) in the rest of the query.
For more on aggregate functions and HAVING clauses, see the discussion in Section 4.6.