1.2 Tokens

All source code is divided into a stream of tokens. The compiler tries to collect as many contiguous characters as it can to build a valid token. (This is sometimes called the "max munch" rule.) It stops when the next character it would read cannot possibly be part of the token it is reading.

A token can be an identifier, a reserved keyword, a literal, or an operator or punctuation symbol. Each kind of token is described later in this section.

Step 3 of the compilation process reads preprocessor tokens. These tokens are converted automatically to ordinary compiler tokens as part of the main compilation in Step 7. The differences between a preprocessor token and a compiler token are small:

  • The preprocessor and the compiler might use different encodings for character and string literals.

  • The compiler treats integer and floating-point literals differently; the preprocessor does not.

  • The preprocessor recognizes <header> as a single token (for #include directives); the compiler does not.

1.2.1 Identifiers

An identifier is a name that you define or that is defined in a library. An identifier begins with a nondigit character and is followed by any number of digits and nondigits. A nondigit character is a letter, an underscore, or one of a set of universal characters. The exact set of nondigit universal characters is defined in the C++ standard and in ISO/IEC PDTR 10176. Basically, this set contains the universal characters that represent letters. Most programmers restrict themselves to the characters a...z, A...Z, and underscore, but the standard permits letters in other languages.

figs/acorn.gif

Not all compilers support universal characters in identifiers.

Certain identifiers are reserved for use by the standard library:

  • Any identifier that contains two consecutive underscores (like_ _this) is reserved, that is, you cannot use such an identifier for macros, class members, global objects, or anything else.

  • Any identifier that starts with an underscore, followed by a capital letter (A-Z) is reserved.

  • Any identifier that starts with an underscore is reserved in the global namespace. You can use such names in other contexts (i.e., class members and local names).

  • The C standard reserves some identifiers for future use. These identifiers fall into two categories: function names and macro names. Function names are reserved and should not be used as global function or object names; you should also avoid using them as "C" linkage names in any namespace. Note that the C standard reserves these names regardless of which headers you #include. The reserved function names are:

    • is followed by a lowercase letter, such as isblank

    • mem followed by a lowercase letter, such as memxyz

    • str followed by a lowercase letter, such as strtof

    • to followed by a lowercase letter, such as toxyz

    • wcs followed by a lowercase letter, such as wcstof

    • In <cmath> with f or l appended, such as cosf and sinl

  • Macro names are reserved in all contexts. Do not use any of the following reserved macro names:

    • Identifiers that start with E followed by a digit or an uppercase letter

    • Identifiers that start with LC_ followed by an uppercase letter

    • Identifiers that start with SIG or SIG_ followed by an uppercase letter

1.2.2 Keywords

figs/acorn.gif

A keyword is an identifier that is reserved in all contexts for special use by the language. The following is a list of all the reserved keywords. (Note that some compilers do not implement all of the reserved keywords; these compilers allow you to use certain keywords as identifiers. See Section 1.5 later in this chapter for more information.)

and

continue

goto

public

try

and_eq

default

if

register

typedef

asm

delete

inline

reintepret_cast

typeid

auto

do

int

return

typename

bitand

double

long

short

union

bitor

dynamic_cast

mutable

signed

unsigned

bool

else

namespace

sizeof

using

break

enum

new

static

virtual

case

explicit

not

static_cast

void

catch

export

not_eq

struct

volatile

char

extern

operator

switch

wchar_t

class

false

or

template

while

compl

float

or_eq

this

xor

const

for

private

throw

xor_eq

const_cast

friend

protected

true

 

1.2.3 Literals

A literal is an integer, floating-point, Boolean, character, or string constant.

1.2.3.1 Integer literals

An integer literal can be a decimal, octal, or hexadecimal constant. A prefix specifies the base or radix: 0x or 0X for hexadecimal, 0 for octal, and nothing for decimal. An integer literal can also have a suffix that is a combination of U and L, for unsigned and long, respectively. The suffix can be uppercase or lowercase and can be in any order. The suffix and prefix are interpreted as follows:

  • If the suffix is UL (or ul, LU, etc.), the literal's type is unsigned long.

  • If the suffix is L, the literal's type is long or unsigned long, whichever fits first. (That is, if the value fits in a long, the type is long; otherwise, the type is unsigned long. An error results if the value does not fit in an unsigned long.)

  • If the suffix is U, the type is unsigned or unsigned long, whichever fits first.

  • Without a suffix, a decimal integer has type int or long, whichever fits first.

  • An octal or hexadecimal literal has type int, unsigned, long, or unsigned long, whichever fits first.

figs/acorn.gif

Some compilers offer other suffixes as extensions to the standard. See Appendix A for examples.

Here are some examples of integer literals:

314          // Legal

314u         // Legal

314LU        // Legal

0xFeeL       // Legal

0ul          // Legal

078          // Illegal: 8 is not an octal digit

032UU        // Illegal: cannot repeat a suffix
1.2.3.2 Floating-point literals

A floating-point literal has an integer part, a decimal point, a fractional part, and an exponent part. You must include the decimal point, the exponent, or both. You must include the integer part, the fractional part, or both. The signed exponent is introduced by e or E. The literal's type is double unless there is a suffix: F for type float and L for long double. The suffix can be uppercase or lowercase.

Here are some examples of floating-point literals:

3.14159          // Legal

.314159F         // Legal

314159E-5L       // Legal

314.             // Legal

314E             // Illegal: incomplete exponent

314f             // Illegal: no decimal or exponent

.e24             // Illegal: missing integer or fraction
1.2.3.3 Boolean literals

There are two Boolean literals, both keywords: true and false.

1.2.3.4 Character literals

Character literals are enclosed in single quotes. If the literal begins with L (uppercase only), it is a wide character literal (e.g., L'x'). Otherwise, it is a narrow character literal (e.g., 'x'). Narrow characters are used more frequently than wide characters, so the "narrow" adjective is usually dropped.

figs/acorn.gif

The value of a narrow or wide character literal is the value of the character's encoding in the execution character set. If the literal contains more than one character, the literal value is implementation-defined. Note that a character might have different encodings in different locales. Consult your compiler's documentation to learn which encoding it uses for character literals.

A narrow character literal with a single character has type char. With more than one character, the type is int (e.g., 'abc'). The type of a wide character literal is always wchar_t.

In C, a character literal always has type int. C++ changed the type of character literals to support overloading, especially for I/O (e.g., cout << '\n' starts a new line and does not print the integer value of the newline character).

figs/acorn.gif

A character literal can be a plain character (e.g., 'x'), an escape sequence (e.g., '\b'), or a universal character (e.g., '\u03C0'). Table 1-1 lists the possible escape sequences. Note that you must use an escape sequence for a backslash or single-quote character literal. Using an escape for a double quote or question mark is optional. Only the characters shown in Table 1-1 are allowed in an escape sequence. (Some compilers extend the standard and recognize other escape sequences.)

Table 1-1. Character escape sequences

Escape sequence

Meaning

\\

\ character

\'

' character

\"

" character

\?

? character (used to avoid creating a trigraph, e.g., \?\?-)

\a

Alert or bell

\b

Backspace

\f

Form feed

\n

Newline

\r

Carriage return

\t

Horizontal tab

\v

Vertical tab

\ooo

Octal number of one to three digits

\xhh . . .

Hexadecimal number of one or more digits

1.2.3.5 String literals

String literals are enclosed in double quotes. A string contains characters that are similar to character literals: plain characters, escape sequences, and universal characters. A string cannot cross a line boundary in the source file, but it can contain escaped line endings (backslash followed by newline).

figs/acorn.gif

A wide string literal is prefaced with L (always uppercase). In a wide string literal, a single universal character always maps to a single wide character. In a narrow string literal, the implementation determines whether a universal character maps to one or multiple characters (called a multibyte character). See Chapter 8 for more information on multibyte characters.

Two adjacent string literals (possibly separated by whitespace, including new lines) are concatenated at compile time into a single string. This is often a convenient way to break a long string across multiple lines. Do not try to combine a narrow string with a wide string in this way.

After concatenating adjacent strings, the null character ('\0' or L'\0') is automatically appended after the last character in the string literal.

Here are some examples of string literals. Note that the first three form identical strings.

"hello, reader"

"hello, \

reader"

"hello, " "rea" "der"



"Alert: \a; ASCII tab: \010; portable tab: \t"

"illegal: unterminated string

L"string with \"quotes\""

A string literal's type is an array of const char. For example, "string"'s type is const char[7]. Wide string literals are arrays of const wchar_t. All string literals have static lifetimes (see Chapter 2 for more information about lifetimes).

As with an array of const anything, the compiler can automatically convert the array to a pointer to the array's first element. You can, for example, assign a string literal to a suitable pointer object:

const char* ptr;

ptr = "string";

As a special case, you can also convert a string literal to a non-const pointer. Attempting to modify the string results in undefined behavior. This conversion is deprecated, and well-written code does not rely on it.

1.2.4 Symbols

Nonalphabetic symbols are used as operators and as punctuation (e.g., statement terminators). Some symbols are made of multiple adjacent characters. The following are all the symbols used for operators and punctuation:

{

(

%:

.

^

.

=

!=

-=

&=

}

)

%:%:

+

&

.*

==

<<

+=

|=

[

<:

;

-

|

->

<

>>

*=

^=

]

:>

:

*

?

->*

>

<<=

/=

++

#

<%

...

/

:

~

<=

>>=

%=

--

##

%>

,

%

::

!

>=

     

You cannot insert whitespace between characters that make up a symbol, and C++ always collects as many characters as it can to form a symbol before trying to interpret the symbol. Thus, an expression such as x+++y is read as x ++ + y. A common error when first using templates is to omit a space between closing angle brackets in a nested template instantiation. The following is an example with that space:


std::list<std::vector<int> > list;

                            

              Note the space here.

The example is incorrect without the space character because the adjacent greater than signs would be interpreted as a single right-shift operator, not as two separate closing angle brackets. Another, slightly less common, error is instantiating a template with a template argument that uses the global scope operators:


::std::list< ::std::list<int> > list;

                            

       Space here                  and here

Again, a space is needed, this time between the angle-bracket (<) and the scope operator (::), to prevent the compiler from seeing the first token as <: rather than <. The <: token is an alternative token, as described in Section 1.5 later in this chapter.