All source code is divided into a stream of tokens. The compiler tries to collect as many contiguous characters as it can to build a valid token. (This is sometimes called the "max munch" rule.) It stops when the next character it would read cannot possibly be part of the token it is reading.
A token can be an identifier, a reserved keyword, a literal, or an operator or punctuation symbol. Each kind of token is described later in this section.
Step 3 of the compilation process reads preprocessor tokens. These tokens are converted automatically to ordinary compiler tokens as part of the main compilation in Step 7. The differences between a preprocessor token and a compiler token are small:
The preprocessor and the compiler might use different encodings for character and string literals.
The compiler treats integer and floating-point literals differently; the preprocessor does not.
The preprocessor recognizes <header> as a single token (for #include directives); the compiler does not.
An identifier is a name that you define or that is defined in a library. An identifier begins with a nondigit character and is followed by any number of digits and nondigits. A nondigit character is a letter, an underscore, or one of a set of universal characters. The exact set of nondigit universal characters is defined in the C++ standard and in ISO/IEC PDTR 10176. Basically, this set contains the universal characters that represent letters. Most programmers restrict themselves to the characters a...z, A...Z, and underscore, but the standard permits letters in other languages.
Not all compilers support universal characters in identifiers.
Certain identifiers are reserved for use by the standard library:
Any identifier that contains two consecutive underscores (like_ _this) is reserved, that is, you cannot use such an identifier for macros, class members, global objects, or anything else.
Any identifier that starts with an underscore, followed by a capital letter (A-Z) is reserved.
Any identifier that starts with an underscore is reserved in the global namespace. You can use such names in other contexts (i.e., class members and local names).
The C standard reserves some identifiers for future use. These identifiers fall into two categories: function names and macro names. Function names are reserved and should not be used as global function or object names; you should also avoid using them as "C" linkage names in any namespace. Note that the C standard reserves these names regardless of which headers you #include. The reserved function names are:
is followed by a lowercase letter, such as isblank
mem followed by a lowercase letter, such as memxyz
str followed by a lowercase letter, such as strtof
to followed by a lowercase letter, such as toxyz
wcs followed by a lowercase letter, such as wcstof
In <cmath> with f or l appended, such as cosf and sinl
Macro names are reserved in all contexts. Do not use any of the following reserved macro names:
Identifiers that start with E followed by a digit or an uppercase letter
Identifiers that start with LC_ followed by an uppercase letter
Identifiers that start with SIG or SIG_ followed by an uppercase letter
A keyword is an identifier that is reserved in all contexts for special use by the language. The following is a list of all the reserved keywords. (Note that some compilers do not implement all of the reserved keywords; these compilers allow you to use certain keywords as identifiers. See Section 1.5 later in this chapter for more information.)
and |
continue |
goto |
public |
try |
and_eq |
default |
if |
register |
typedef |
asm |
delete |
inline |
reintepret_cast |
typeid |
auto |
do |
int |
return |
typename |
bitand |
double |
long |
short |
union |
bitor |
dynamic_cast |
mutable |
signed |
unsigned |
bool |
else |
namespace |
sizeof |
using |
break |
enum |
new |
static |
virtual |
case |
explicit |
not |
static_cast |
void |
catch |
export |
not_eq |
struct |
volatile |
char |
extern |
operator |
switch |
wchar_t |
class |
false |
or |
template |
while |
compl |
float |
or_eq |
this |
xor |
const |
for |
private |
throw |
xor_eq |
const_cast |
friend |
protected |
true |
A literal is an integer, floating-point, Boolean, character, or string constant.
An integer literal can be a decimal, octal, or hexadecimal constant. A prefix specifies the base or radix: 0x or 0X for hexadecimal, 0 for octal, and nothing for decimal. An integer literal can also have a suffix that is a combination of U and L, for unsigned and long, respectively. The suffix can be uppercase or lowercase and can be in any order. The suffix and prefix are interpreted as follows:
If the suffix is UL (or ul, LU, etc.), the literal's type is unsigned long.
If the suffix is L, the literal's type is long or unsigned long, whichever fits first. (That is, if the value fits in a long, the type is long; otherwise, the type is unsigned long. An error results if the value does not fit in an unsigned long.)
If the suffix is U, the type is unsigned or unsigned long, whichever fits first.
Without a suffix, a decimal integer has type int or long, whichever fits first.
An octal or hexadecimal literal has type int, unsigned, long, or unsigned long, whichever fits first.
Some compilers offer other suffixes as extensions to the standard. See Appendix A for examples.
Here are some examples of integer literals:
314 // Legal 314u // Legal 314LU // Legal 0xFeeL // Legal 0ul // Legal 078 // Illegal: 8 is not an octal digit 032UU // Illegal: cannot repeat a suffix
A floating-point literal has an integer part, a decimal point, a fractional part, and an exponent part. You must include the decimal point, the exponent, or both. You must include the integer part, the fractional part, or both. The signed exponent is introduced by e or E. The literal's type is double unless there is a suffix: F for type float and L for long double. The suffix can be uppercase or lowercase.
Here are some examples of floating-point literals:
3.14159 // Legal .314159F // Legal 314159E-5L // Legal 314. // Legal 314E // Illegal: incomplete exponent 314f // Illegal: no decimal or exponent .e24 // Illegal: missing integer or fraction
There are two Boolean literals, both keywords: true and false.
Character literals are enclosed in single quotes. If the literal begins with L (uppercase only), it is a wide character literal (e.g., L'x'). Otherwise, it is a narrow character literal (e.g., 'x'). Narrow characters are used more frequently than wide characters, so the "narrow" adjective is usually dropped.
The value of a narrow or wide character literal is the value of the character's encoding in the execution character set. If the literal contains more than one character, the literal value is implementation-defined. Note that a character might have different encodings in different locales. Consult your compiler's documentation to learn which encoding it uses for character literals.
A narrow character literal with a single character has type char. With more than one character, the type is int (e.g., 'abc'). The type of a wide character literal is always wchar_t.
|
A character literal can be a plain character (e.g., 'x'), an escape sequence (e.g., '\b'), or a universal character (e.g., '\u03C0'). Table 1-1 lists the possible escape sequences. Note that you must use an escape sequence for a backslash or single-quote character literal. Using an escape for a double quote or question mark is optional. Only the characters shown in Table 1-1 are allowed in an escape sequence. (Some compilers extend the standard and recognize other escape sequences.)
Escape sequence |
Meaning |
---|---|
\\ |
\ character |
\' |
' character |
\" |
" character |
\? |
? character (used to avoid creating a trigraph, e.g., \?\?-) |
\a |
Alert or bell |
\b |
Backspace |
\f |
Form feed |
\n |
Newline |
\r |
Carriage return |
\t |
Horizontal tab |
\v |
Vertical tab |
\ooo |
Octal number of one to three digits |
\xhh . . . |
Hexadecimal number of one or more digits |
String literals are enclosed in double quotes. A string contains characters that are similar to character literals: plain characters, escape sequences, and universal characters. A string cannot cross a line boundary in the source file, but it can contain escaped line endings (backslash followed by newline).
A wide string literal is prefaced with L (always uppercase). In a wide string literal, a single universal character always maps to a single wide character. In a narrow string literal, the implementation determines whether a universal character maps to one or multiple characters (called a multibyte character). See Chapter 8 for more information on multibyte characters.
Two adjacent string literals (possibly separated by whitespace, including new lines) are concatenated at compile time into a single string. This is often a convenient way to break a long string across multiple lines. Do not try to combine a narrow string with a wide string in this way.
After concatenating adjacent strings, the null character ('\0' or L'\0') is automatically appended after the last character in the string literal.
Here are some examples of string literals. Note that the first three form identical strings.
"hello, reader" "hello, \ reader" "hello, " "rea" "der" "Alert: \a; ASCII tab: \010; portable tab: \t" "illegal: unterminated string L"string with \"quotes\""
A string literal's type is an array of const char. For example, "string"'s type is const char[7]. Wide string literals are arrays of const wchar_t. All string literals have static lifetimes (see Chapter 2 for more information about lifetimes).
As with an array of const anything, the compiler can automatically convert the array to a pointer to the array's first element. You can, for example, assign a string literal to a suitable pointer object:
const char* ptr; ptr = "string";
As a special case, you can also convert a string literal to a non-const pointer. Attempting to modify the string results in undefined behavior. This conversion is deprecated, and well-written code does not rely on it.
Nonalphabetic symbols are used as operators and as punctuation (e.g., statement terminators). Some symbols are made of multiple adjacent characters. The following are all the symbols used for operators and punctuation:
{ |
( |
%: |
. |
^ |
. |
= |
!= |
-= |
&= |
} |
) |
%:%: |
+ |
& |
.* |
== |
<< |
+= |
|= |
[ |
<: |
; |
- |
| |
-> |
< |
>> |
*= |
^= |
] |
:> |
: |
* |
? |
->* |
> |
<<= |
/= |
++ |
# |
<% |
... |
/ |
: |
~ |
<= |
>>= |
%= |
-- |
## |
%> |
, |
% |
:: |
! |
>= |
You cannot insert whitespace between characters that make up a symbol, and C++ always collects as many characters as it can to form a symbol before trying to interpret the symbol. Thus, an expression such as x+++y is read as x ++ + y. A common error when first using templates is to omit a space between closing angle brackets in a nested template instantiation. The following is an example with that space:
std::list<std::vector<int> > list; Note the space here.
The example is incorrect without the space character because the adjacent greater than signs would be interpreted as a single right-shift operator, not as two separate closing angle brackets. Another, slightly less common, error is instantiating a template with a template argument that uses the global scope operators:
::std::list< ::std::list<int> > list; Space here and here
Again, a space is needed, this time between the angle-bracket (<) and the scope operator (::), to prevent the compiler from seeing the first token as <: rather than <. The <: token is an alternative token, as described in Section 1.5 later in this chapter.