Character-based data values are interpreted with respect to a given character set, which determines the allowable characters that can be used. Character sets also have a collating (sorting) order, which affects many types of operations on character values:
Comparisons: <, <=, =, <>, >=, and >.
Sorting: ORDER BY, MIN(), MAX().
Grouping: GROUP BY, DISTINCT.
The character set also affects other aspects of server operation, such as which characters can be used in database, table, and column names, because names normally are constructed from the alphanumeric characters in the server's default character set. (See the "MySQL Naming Rules" section in Chapter 3.)
The level of character set support available to you depends on your version of MySQL. Prior to MySQL 4.1, the server operates using a single character set at a time. As of MySQL 4.1, the server can support multiple character sets simultaneously, and character sets can be specified at the server, database, table, column, or string constant level. For example, if you want a table's columns to use latin1 by default, but also to include a Hebrew column and a Greek column, you can do that. You can also find out what character sets are available or convert data from one character set to another.
This section describes how to use the character sets that are supported by your server. To configure your server to support the character sets you want, see Chapter 11. That chapter also includes notes on what to do when upgrading to MySQL 4.1 so that you can use the new features with older tables.
Prior to MySQL 4.1, data values in MySQL have no explicit character set. Instead, string constants and column values are interpreted with respect to the server's character set. By default, this is the character set selected when the server was built (usually latin1), but the built-in value can be overridden at runtime with the --default-character-set option. This is very simple but quite limiting. For example, you cannot have a table that stores values using different character sets for different columns.
The single-character-set model also can lead to index-related problems if you change the server's character set after having already created tables and loaded character data into them. These problems occur due to the fact that index values are stored in sorted order, with the order for character columns being determined by the collating sequence of the character set that happens to be in force at the time the index entries are created. Some character sets have different collating sequences than others, so if you load a table while the server is using one character set and then reconfigure the server to use a different set, it's possible that the index entries will no longer be in the correct order with respect to the collating sequence of the new character set. Worse, if you add new rows to the table, the index that was initially created using the sort order of the original character set will be updated using the order of the new set. Consequently, index-based queries may not work correctly.
The solution to this problem is to rebuild the indexes for each existing table that has character-based indexes to use the collating order of the new character set. A table can be converted in various ways:
Dump the table with mysqldump, drop it, and reload it from the dump file. This operation causes the indexes to be rebuilt as the file is reloaded. It works for any table type.
Drop the indexes and add them again (for example, with ALTER TABLE, or with DROP INDEX and CREATE INDEX). This works for any table type but requires that you know the exact index definitions so that you can re-create them properly.
For MyISAM tables, you can rebuild indexes by running myisamchk with the --recover and --quick options, together with a --set-character-set option that specifies the character set to use. Equivalent alternatives are to use the mysqlcheck program with the --repair and --quick options or a REPAIR TABLE statement with the QUICK option. mysqlcheck and REPAIR TABLE are more convenient because the server does the work and it knows which character set to use. myisamchk must be run with the tables offline, and you have to specify the character set explicitly.
Despite the many methods available for reordering indexes if you change the server's character set, the fact that you need to do it at all is a bother. MySQL 4.1 eliminates the need.
Character set support has been revised considerably in MySQL 4.1 to provide the following features:
Support for using multiple character sets simultaneously
The ability to specify character sets at the server, database, table, column, and string constant level, not just at the server level:
An ALTER DATABASE statement for database character set assignment
CREATE TABLE and ALTER TABLE clauses for table- and column-level character set assignment
Functions and operators for converting individual values from one character set to another or for determining the character set of a value
A COLLATE operator for treating values in one character set as having the collating order of another character set
A SHOW CHARACTER SET statement to list all the character sets the server knows about
Automatic index reordering when character set changes occur
Unicode support, provided by the utf8 and ucs2 character sets
Many other new character sets
You cannot mix character sets within a string or use different character sets for different rows of a given column. However, by using a Unicode character set (which represents the encodings for many languages within a single character set), you may be able to implement multi-lingual support of the type you desire.
Character sets can be assigned at several levels, from the default used by the server down to the set used for individual strings:
The server's default character set is built in at compile time, and you can override it at server startup time by using a --default-character-set option.
To specify a default character set for a database, use the following statement:
ALTER DATABASE db_name DEFAULT CHARACTER SET charset;
charset is the name of a supported character set, or DEFAULT. A value of DEFAULT indicates that the database has no explicit character set; in this case, the server makes database-level character set decisions by referring to the server's default character set.
To specify a default character set for a table, use a CHARACTER SET table option at table creation time:
CREATE TABLE tbl_name (...) CHARACTER SET = charset;
charset is the name of a supported character set, or DEFAULT. A value of DEFAULT tells the server to make table-level character set decisions by referring to the database character set.
Columns in a table can be assigned a character set explicitly with a CHARACTER SET attribute. For example:
c CHAR(10) CHARACTER SET charset
In this case, the charset value must be the name of a supported character set; it cannot be DEFAULT. However, you can omit the CHARACTER SET attribute entirely, in which case the table-level character set is used. Column types for which a character set can be given are CHAR and VARCHAR (if declared without the BINARY attribute) and the TEXT types.
String constants can be converted to a given character set using the following notation, where charset is the name of a supported char acter set:
_charset str
The following examples produce strings in the latin1_de and utf8 character sets:
_latin1_de 'abc' _utf8 'def'
This notation works only for literal quoted strings, not for hexadecimal constants, string expressions, or column values. However, any string can be converted to a designated character set using the CONVERT() function:
SELECT CONVERT(str USING charset);
It's also possible to sort values from given character set using the collating sequence for a different set by using the COLLATE operator. For example, if c is a latin1 column but you want to order it using latin1_de sorting rules, do this:
SELECT c FROM t ORDER BY c COLLATE latin1_de;
Character set support in MySQL 4.1 and up includes statements for obtaining information at several levels:
At the server level, you can find out which character sets are available using the following query:
SHOW CHARACTER SET;
To determine what the server's default character set is, issue the following query:
SHOW VARIABLES LIKE 'character_set';
The database-level character set for a given database can be obtained as follows:
SHOW CREATE DATABASE db_name;
If the statement output doesn't indicate a character set, the database's character set has been never been set or has been set explicitly to DEFAULT.
A table's character set can be discovered two ways:
SHOW CREATE TABLE tbl_name; SHOW TABLE STATUS LIKE 'tbl_name';
Individual character set assignments for a table's columns are displayed by each of the following statements:
DESCRIBE tbl_name; SHOW COLUMNS FROM tbl_name; SHOW CREATE TABLE tbl_name;
To determine what character set is associated with a string, string expression, or column value, use the CHARSET() function:
SELECT CHARSET(str);
One of the reasons there are so many character sets is that different encodings have been developed for different languages. This presents several problems. For example, a given character that is common to several languages might be represented by different numeric values in different encodings. Also, different languages require different numbers of bytes to represent characters. The Latin-1 character set is small enough that every character fits in a single byte, but some languages, such as those used in Japan and China, contain so many characters that they require multiple bytes per character.
The goal of Unicode is to provide a unified character-encoding system within which all languages can be represented in a consistent manner. In MySQL, Unicode support is provided through two character sets:
UTF-8 is a variable-length format in which characters are represented using from one to four characters. (UTF is an abbreviation for UCS Transformation Format, where UCS is itself an abbreviation for Universal Character Set.) The utf8 character set in MySQL does not include any four-byte characters, although support for them may be added in the future.
The other Unicode character set in MySQL is UCS2. The ucs2 set represents each character using two bytes, most significant byte first. This character set does not represent characters that require more than two bytes.
If you upgrade a server to MySQL 4.1 or newer, older tables can still be used but will not be able to take full advantage of the improved character set support instituted in 4.1. To rectify this, you should convert them to 4.1 format. Instructions for doing so can be found in Chapter 11.