To conclude this chapter, we show you selected techniques for improving the performance of your databases, queries, and MySQL server. We focus on how to choose and design indexes, tips for querying and database design, how to tune the server parameters, and how to use MySQL's query cache.
As discussed in Chapter 5, each table should have a PRIMARY KEY definition as part of its CREATE TABLE statement. A primary key is an attribute (or set of attributes) that uniquely identifies a row in a table. Storing two rows with the same primary key isn't permitted and an attempt to INSERT duplicate primary keys produces an error (unless you use the IGNORE modifier).
The attribute values of the primary key are stored in an index to allow fast access to a row using the primary key values. The default index type for a MyISAM table type is fast for queries that find a specific row, a range of rows, for joins between tables, grouping data, ordering data, and finding minimum and maximum values. Indexes don't provide any speed improvement for retrieving all the rows in a table or for other query types.
As discussed briefly in Chapter 5, indexes are also useful for fast access to rows by values other than those in the primary key. For example, in the customer table, you might define an index by adding it using:
ALTER TABLE customer ADD INDEX namecity (surname,firstname,city);
After you define this index, some queries that select a particular customer through a WHERE clause automatically use it. Consider an example:
SELECT * FROM customer WHERE surname = 'Marzalla' AND firstname = 'Dimitria' AND city = 'St Albans';
This query can use the new index to quickly locate the row that matches the search criteria. Without the index, the server must scan all the rows in the customer table and compare each row to the WHERE clause. This might be quite slow and certainly requires significantly more disk activity than the index-based approach (assuming the table has more than a few rows).
A particular feature of database servers is that they develop a query evaluation strategy and optimize it without any interaction from the user or programmer. If an index is available, and it makes sense to use it in the context of a query, the server does this automatically. All you need to do is identify which queries are common, and make an index available for those common queries by adding the KEY clause to the CREATE TABLE statement or using ALTER TABLE on an existing table.
If you've created the namecity index, and you want to check that MySQL will use it for the previous query, you can do so with the EXPLAIN statement:
EXPLAIN SELECT * FROM customer WHERE surname = 'Marzalla' AND firstname = 'Dimitria' AND city = 'St Albans';
This reports that:
+----+-------------+----------+------+---------------+----------+---------+---- ---------------+------+--------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+----------+------+---------------+----------+---------+---- ---------------+------+--------+ | 1 | SIMPLE | customer | ref | namecity | namecity | 153 | const, const,const | 1 | Using where | +----+-------------+----------+------+---------------+----------+---------+---- ---------------+------+--------+ 1 row in set (0.06 sec)
You can see that the namecity index is listed as a possible choice in the possible_keys column and, as expected, it's the index that'll be used to evaluate the query as shown in the key column. The EXPLAIN statement is a useful diagnostic tool for understanding and optimizing complex queries but we don't discuss it in detail here; you can find out more about it in Section 5.2.1 of the MySQL manual.
Careful index design is important. The namecity index we have defined can also speed queries other than those that supply a complete surname, firstname, and city. For example, consider a query:
SELECT * FROM customer WHERE surname = 'LaTrobe' AND firstname = 'Anthony';
This query can also use the index namecity, because the index permits access to rows in sorted order first by surname, then firstname, and then city. With this sorting, all "LaTrobe, Anthony" index entries are clustered together in the index. Indeed, the index can also be used for the query:
SELECT * FROM customer WHERE surname LIKE 'Mar%';
Similarly, all surnames beginning with "Mar" are clustered together in the index. You can use EXPLAIN to check that the index is being used.
However, the index can't be used for a query such as:
SELECT * FROM customer WHERE firstname = 'Dimitria' AND city = 'St Albans';
The index can't be used because the leftmost attribute named in the index, surname, isn't part of the WHERE clause. In this case, all rows in the customer table must be scanned and the query is much slower (again assuming there are more than a few rows in the customer table, and assuming there is no other index that could be used).
|
There are other cases in which an index can't be used, such as when a query contains an OR that isn't on an indexed attribute:
SELECT * FROM customer WHERE surname = 'Marzalla' OR zipcode = "3001";
Again, the customer table must be completely scanned, because the second condition, zipcode="3001", requires all rows to be retrieved as there is no index available on the attribute zipcode. Also, the attributes that are combined together with OR must be the leftmost attributes in the index; otherwise the query requires a complete scan of the customer table. The following example requires a complete scan:
SELECT * FROM customer WHERE firstname = 'Dimitria' OR surname = 'Marzalla';
If all the attributes in the index are used in all the queries, to optimize index size, the leftmost attribute in the KEY clause should be the attribute with the highest number of duplicate entries.
Because indexes speed up queries, why wouldn't you create indexes on all the attributes you can possibly search on? The answer is that while indexes are fast for searching, they consume space and require updates each time rows are added or deleted, or key attributes are changed. So, if a database is largely static, additional indexes have low overheads, but if a database changes frequently, each additional index slows down the update process significantly. In either case, indexes consume additional space on disk and in memory, and unnecessary indexes should be avoided.
One way to reduce the size of an index and speed updates is to create an index on a prefix of an attribute. Our namecity index uses considerable space: for each row in the customer table, an index entry is up to 150 characters in length because it is created from the combined values of the surname, firstname, and city attributes.[2]
[2] This isn't the space actually required by an index entry, because the data is compressed for storage. However, even with compression, the fewer characters that are indexed, the more compact is the representation, the more space is saved, and (depending on the usability of the index) the faster searching and updates are.
To reduce space, you can define the index as:
ALTER TABLE customer ADD INDEX namecity (surname(10),firstname(3),city(2));
This uses only the first 10 characters of surname, 3 of firstname, and 2 of city to distinguish index entries. This is quite reasonable, because 10 characters from a surname distinguishes between most surnames, and the addition of a few characters from a first name and the prefix of their city should be sufficient to uniquely identify almost all customers. Having a smaller index with less information can also mean that queries are actually faster, because more index information can be retrieved from disk per second, more of the index can fit into spare memory, and disk retrieval speed is almost always the bottleneck in query performance.
The space saving is significant with a reduced index. A new index entry requires only 15 characters, a savings of up to 135 characters, so index insertions, deletions, and modifications are now likely to be much faster. Note that for TEXT and BLOB attribute types, a prefix must be taken when indexing, because indexing the entire attribute is impractical and isn't permitted by the MySQL server.
Careful index design is one technique that improves speed and reduces resource requirements. However, design of your database, tables, attributes, and queries is also important. As discussed previously, accessing a hard disk is slow and is usually the bottleneck in database server performance. Therefore, most techniques described in this section improve performance by minimizing disk space and disk use.
Reducing disk space requirements improves both disk seek and read performance. Disk read performance is improved because less data is required to be transferred, while seek performance is improved because the disk head has to move less on average when randomly accessing a smaller file than when accessing a larger file.
Here are some simple ways to improve database server performance:
Carefully choose attribute types and lengths. Where possible, use small variants such as SMALLINT or MEDIUMINT rather than the regular choice INT. When using fixed-length attributes, such as CHAR, specify a length that is as short as possible.
Use fixed-length attributes: try to avoid the types VARCHAR, BLOB, and TEXT. While fixed-length text attributes may waste space, scanning fixed-length rows is much faster than scanning variable-length rows (and, as discussed in Section 15.6. the MyISAM table type adjusts its structure for speed when variable-length types aren't used).
MySQL can't join tables using an index if the attributes are different types or (in some cases) have different lengths. You should use ALTER TABLE to change the attribute types and lengths so that they match. However, MySQL can join char and varchar attributes, as long as they have the same declared length. You can check what's happening with the EXPLAIN statement.
Create a statistics table if aggregate functions such as COUNT( ) or SUM( ) are frequently used in queries that contain WHERE clauses and are on large tables. A statistics table usually stores only one row that is manually updated with the aggregate values of another table.
For example, suppose you want to create a statistics table that tracks the number of rows in the customer table. You would create it with the following statement:
CREATE TABLE custCount (custCount int(5));
It doesn't need a primary key because there's only going to be one row of data in the table. You would then initialize the statistics table to the current count of customers using:
INSERT INTO custCount SELECT count(*) FROM customer;
From then on, you'd use the new table to check the count of customers. For example:
SELECT custCount from custCount;
If a row is deleted from the customer table, you need to update the statistics table:
UPDATE custCount SET custCount = custCount - 1;
Similarly, if a new row is added, you add one to the counter. For large tables, this technique is often faster than calculating aggregate functions with the slow built-in functions that require complete processing of all rows.
If you're inserting large numbers of rows, list the values in one (or few) insert statements as this is much faster to process. For example, convert:
INSERT INTO table (1, "cat"); INSERT INTO table (2, "dog");
into:
INSERT INTO table (1, "cat"), (2, "dog");
If large numbers of rows are deleted from a table, or a table containing variable-length attributes is frequently modified, disk space may be wasted. MySQL doesn't usually remove deleted or modified data; it only marks the location as being no longer in use. Wasted space can affect access speed.
To reorganize a table, use the OPTIMIZE TABLE command discussed in Section 15.3.3. It should be used periodically (perhaps once per month).
MySQL uses statistics about a table to make decisions about how to optimize each query. You can update these statistics by running:
ANALYZE TABLE customer;
You don't need to do this often.
Use the Heap table type discussed in "Table Types" for small tables that are searched only for exact matches using = or <=>.
Section 5.2.13 of the MySQL manual includes other excellent ideas for simple performance improvement.
Comprehensive database tuning is a complex topic that fills many books. We include in this section only a few practical ideas to help you to begin to improve the performance of a database system. You can refer to the books in Appendix G for more information and also read Section 5.5 of the MySQL manual.
MySQL includes is the mysqladmin tool for database administration. Details of the system setup can be found by running the following command in a Unix shell:
% /usr/local/mysql/bin/mysqladmin -uroot -ppassword variables
In Microsoft Windows, type the following into the Run dialog that's accessible through the Start menu:
"C:\Program Files\EasyPHP1-7\mysql\bin\mysqladmin.exe" -uroot -ppassword variables
Both commands assume you've followed our installation instructions in Appendix A through Appendix C.
This shows, in part, the following selected system parameters:
join_buffer current value: 131072 key_buffer current value: 8388600 net_buffer_length current value: 16384 record_buffer current value: 131072 sort_buffer current value: 2097144 table_cache current value: 64
Some of the important parameters are those that impact disk use. MySQL has several main-memory buffer parameters that control how much data is kept in memory for processing. These include:
The record_buffer for scanning all rows in a table
The sort_buffer for ORDER BY and GROUP BY operations
The key_buffer for storing indexes in main memory
The join_buffer for joins that don't use indexes
In general, the larger these buffers, the more data from disk is cached or stored in memory and the fewer disk accesses are required. However, if the sum of these parameters is near to exceeding the size of the memory installed in the server, the operating system will start to swap data between disk and memory, and the MySQL server will be slow. In any case, careful experimentation based on the application is likely to improve server performance.
Section 5.5.2 of the MySQL manual suggests parameter settings when starting the MySQL server. First, for machines with more than 256 MB of free memory, large tables in the database, and a moderate number of users, start your MySQL in Unix with:
% /usr/local/mysql/bin/mysqld_safe -O key_buffer=64M -O table_cache=256 \ -O sort_buffer=4M -O read_buffer_size=1M &
The following setting is appropriate for an application such as the online winestore, because many users are expected, the queries are largely index-based, and the database is small:
mysqld_safe -O key_buffer=512k -O sort_buffer=16k \ -O table_cache=32 -O read_buffer_size=8k -O net_buffer_length=1K &
There are two other parameters used in this example that we've not discussed. The table_cache parameter manages the maximum number of open tables per user connection, while the net_buffer parameter sets the minimum size of the network query buffer in which incoming queries are kept before they are executed.
The SHOW STATUS and SHOW VARIABLES commands that are described at the beginning of this chapter can be used to report on MySQL's use and behavior. SHOW VARIABLES does the same thing as the mysqladmin command at the beginning of this section. SHOW STATUS gives a brief point-in-time summary of the server status and can help find more about the number of user connections, queries, and table use. This is useful input into changing the startup parameters we've just described.
MySQL 4 features an optional query cache. When you activate it, the results of queries are stored in a memory buffer. If an identical query arrives later, the results are returned from the cache rather than the query being rerun. This is an excellent feature if your application runs many identical queries and your database doesn't change too often. Typically, this makes it an ideal tool for web database applications, and we recommend you use it for your applications; if you followed our Unix installation instructions in Appendix A through Appendix C, you've already enabled the query cache with default parameters.
Consider an example. In our online winestore, the following query is executed every time any user visits the homepage. The query finds the latest three wines that have been stocked at the winestore and have been reviewed by a wine writer:
SELECT wi.winery_name, w.year, w.wine_name, w.wine_id, w.description FROM wine w, winery wi, inventory i WHERE w.winery_id = wi.winery_id AND w.wine_id = i.wine_id AND w.description IS NOT NULL GROUP BY w.wine_id ORDER BY i.date_added DESC LIMIT 3;
The query isn't fast to run: it uses three tables, two join conditions, a conditional check for a description, a GROUP BY clause, an ORDER BY clause, and the LIMIT modifier. However, it's an ideal candidate for query caching: the homepage is popular and new wines, wine reviews, and inventory are infrequently added. From simple experiments with our online winestore from the first edition of this book, we've found that adding query caching makes visiting the homepage almost three times faster.
Query caching has several features:
It's configurable so that you can control which queries are cached when you write the queries (more later in this section).
It can be tuned: you can control how much memory is used in total, the maximum size of a result set in the cache, and the size of memory blocks that are allocated in the cache (more on this next).
It's sensitive to data or table structure changes: if they change, any query that uses them is automatically flushed from the cache.
However, there are some situations in which it isn't useful and you need to be careful:
If you rarely execute the same query, most queries are first checked against the cache and then executed anyway. This adds an overhead to most queries.
If your data changes frequently, the cached queries have to be frequently flushed from the cache, and this slows querying down. If any row changes in a table, a query using that table is flushed (even if it doesn't access the row).
To use the cache, a query must be identical byte for byte with a query that's in the cache. So, for example, SELECT * FROM customer and SELECT * FROM CUSTOMER are treated as different queries, and the latter won't find the former in the cache. Even additional whitespace renders two queries different.
Queries that use functions that should return different results don't use the cache. Most of these are obvious: system time functions, system date functions, locking functions, and so on. A complete list is in Section 6.9.1 of the MySQL manual.
Query caching is off by default when you install MySQL 4.1. To turn it on, you need to edit your MySQL configuration file that you installed when following the installation procedure in Appendix A through Appendix C. In a Unix environment, edit the file /etc/my.cnf, and in Microsoft Windows edit the my.ini file in C:\winnt for Windows 2000/2003/NT or C:\windows for Windows XP. Find the section that beings with the heading:
# The MySQL server [mysqld]
Add the following statements to the end of that section:
query_cache_size = 16M query_cache_type = 1 query_cache_min_res_unit = 4K query_cache_limit = 1M
You need to add the first parameter query_cache_size that defines how much memory the cache uses (we've decided on 16 MB): the default is 0, which disables caching. The remaining three statements are optional, but we've included them with their default settings anyway. The query_cache_type parameter defines whether caching is off (0), on (1), or only used when you ask for it (2); we discuss how to control which queries are cached in the next section. The third parameter query_cache_min_res_unit defines the minimum memory block size for a cached result set, and the default of 4 KB. works well. The last parameter query_cache_limit defines the maximum size of a result set that can be cached, and the 1 MB default is a sensible choice.
After you've made the changes, save the file, and restart your MySQL using the method described for your platform in Appendix A through Appendix C. Alternatively, reboot your machine.
You now have caching enabled. If you repeat a query such as SELECT * FROM customer twice or more, you can check that the cache is in action by using the SHOW STATUS command from the command interpreter. This outputs, in part, the following:
| Qcache_queries_in_cache | 1 | | Qcache_inserts | 1 | | Qcache_hits | 3 | | Qcache_lowmem_prunes | 0 | | Qcache_not_cached | 0 | | Qcache_free_memory | 16709128 | | Qcache_free_blocks | 1 | | Qcache_total_blocks | 4 |
In our example, you can see that our system has just started up: there's one query in the cache, only one query has ever been inserted, it's been re-run three times, and no queries that couldn't be cached have ever been run. There's also plenty of memory free.
After you've got caching turned on, you can control whether an individual query is cached or not. If you don't prevent caching, and the query_cache_type parameter is set to its default of 1, all queries are cached (with the exception of those queries that it doesn't make sense to cache, as discussed previously). If the query_cache_type parameter is set to 2, only those queries you ask to be cached will be cached.
Here's an example of how to explicitly cache a query:
SELECT SQL_CACHE * FROM customer;
Here's an example of how not to cache a query:
SELECT SQL_NO_CACHE * FROM customer;
Caching only works with SELECT queries. It doesn't make sense to cache dynamic DELETE, INSERT, or UPDATE queries (or their variants such as TRUNCATE).