1.1 Compilation Steps

A C++ source file undergoes many transformations on its way to becoming an executable program. The initial steps involve processing all the #include and conditional preprocessing directives to produce what the standard calls a translation unit. Translation units are important because they have no dependencies on other files. Nonetheless, programmers still speak in terms of source files, even if they actually mean translation units, so this book uses the phrase source file because it is familiar to most readers. The term "translation" encompasses compilation and interpretation, although most C++ translators are compilers. This section discusses how C++ reads and compiles (translates) source files (translation units).

A C++ program can be made from many source files, and each file can be compiled separately. Conceptually, the compilation process has several steps (although a compiler can merge or otherwise modify steps if it can do so without affecting the observable results):

  1. Read physical characters from the source file and translate the characters to the source character set (described in Section 1.4 later in this chapter). The source "file" is not necessarily a physical file; an implementation might, for example, retrieve the source from a database. Trigraph sequences are reduced to their equivalent characters (see Section 1.6 later in this chapter). Each native end-of-line character or character sequence is replaced by a newline character.

  2. If a backslash character is followed immediately by a newline character, delete the backslash and the newline. The backslash/newline combination must not fall in the middle of a universal character (e.g., \u1234) and must not be at the end of a file. It can be used in a character or string literal, or to continue a preprocessor directive or one-line comment on multiple lines. A non-empty file must end with a newline.

  3. Partition the source into preprocessor tokens separated by whitespace and comments. A preprocessor token is slightly different from a compiler token (see the next section, Section 1.2). A preprocessor token can be a header name, identifier, number, character literal, string literal, symbol, or miscellaneous character. Each preprocessor token is the longest sequence of characters that can make up a legal token, regardless of what comes after the token.

  4. Perform preprocessing and expand macros. All #include files are processed in the manner described in steps 1-4. For more information about preprocessing, see Chapter 11.

  5. Convert character and string literals to the execution character set.

  6. Concatenate adjacent string literals. Narrow string literals are concatenated with narrow string literals. Wide string literals are concatenated with wide string literals. Mixing narrow and wide string literals results in an error.

  7. Perform the main compilation.

  8. Combine compiled files. For each file, all required template instantiations (see Chapter 7) are identified, and the necessary template definitions are located and compiled.

  9. Resolve external references. The compiled files are linked to produce an executable image.