4.1 Lexical Structure

The lexical structure of a programming language is the set of basic rules that govern how you write programs in that language. It is the lowest-level syntax of the language and specifies such things as what variable names look like and what characters are used for comments. Each Python source file, like any other text file, is a sequence of characters. You can also usefully see it as a sequence of lines, tokens, or statements. These different syntactic views complement and reinforce each other. Python is very particular about program layout, especially with regard to lines and indentation, so you'll want to pay attention to this information if you are coming to Python from another language.

4.1.1 Lines and Indentation

A Python program is composed of a sequence of logical lines, each made up of one or more physical lines. Each physical line may end with a comment. A pound sign (#) that is not inside a string literal begins a comment. All characters after the # and up to the physical line end are part of the comment, and the Python interpreter ignores them. A line containing only whitespace, possibly with a comment, is called a blank line, and is ignored by the interpreter. In an interactive interpreter session, you must enter an empty physical line (without any whitespace or comment) to terminate a multiline statement.

In Python, the end of a physical line marks the end of most statements. Unlike in other languages, Python statements are not normally terminated with a delimiter, such as a semicolon (;). When a statement is too long to fit on a single physical line, you can join two adjacent physical lines into a logical line by ensuring that the first physical line has no comment and ends with a backslash (\). Python also joins adjacent physical lines into one logical line if an open parenthesis ((), bracket ([), or brace ({) has not yet been closed. Triple-quoted string literals can also span physical lines. Physical lines after the first one in a logical line are known as continuation lines. The indentation issues covered next do not apply to continuation lines, but only to the first physical line of each logical line.

Python uses indentation to express the block structure of a program. Unlike other languages, Python does not use braces or begin/end delimiters around blocks of statements: indentation is the only way to indicate such blocks. Each logical line in a Python program is indented by the whitespace on its left. A block is a contiguous sequence of logical lines, all indented by the same amount; the block is ended by a logical line with less indentation. All statements in a block must have the same indentation, as must all clauses in a compound statement. Standard Python style is to use four spaces per indentation level. The first statement in a source file must have no indentation (i.e., it must not begin with any whitespace). Additionally, statements typed at the interactive interpreter prompt >>> (covered in Chapter 3) must have no indentation.

A tab is logically replaced by up to 8 spaces, so that the next character after the tab falls into logical column 9, 17, 25, etc. Don't mix spaces and tabs for indentation, since different tools (e.g., editors, email systems, printers) treat tabs differently. The -t and -tt options to the Python interpreter (covered in Chapter 3) ensure against inconsistent tab and space usage in Python source code. You can configure any good editor to expand tabs to spaces so that all Python source code you write contains only spaces, not tabs. You then know that all tools, including Python itself, are going to be consistent in handling the crucial matter of indentation in your source files.

4.1.2 Tokens

Python breaks each logical line into a sequence of elementary lexical components, called tokens. Each token corresponds to a substring of the logical line. The normal token types are identifiers, keywords, operators, delimiters, and literals, as covered in the following sections. Whitespace may be freely used between tokens to separate them. Some whitespace separation is needed between logically adjacent identifiers or keywords; otherwise, they would be parsed as a single, longer identifier. For example, printx is a single identifierto write the keyword print followed by identifier x, you need to insert some whitespace (e.g., print x).

4.1.2.1 Identifiers

An identifier is a name used to identify a variable, function, class, module, or other object. An identifier starts with a letter (A to Z or a to z) or underscore (_) followed by zero or more letters, underscores, and digits (0 to 9). Case is significant in Python: lowercase and uppercase letters are distinct. Punctuation characters such as @, $, and % are not allowed in identifiers.

Normal Python style is to start class names with an uppercase letter and other identifiers with a lowercase letter. Starting an identifier with a single leading underscore indicates by convention that the identifier is meant to be private. Starting an identifier with two leading underscores indicates a strongly private identifier; if the identifier also ends with two trailing underscores, the identifier is a language-defined special name. The identifier _ (a single underscore) is special in interactive interpreter sessions: the interpreter binds _ to the result of the last expression statement evaluated interactively, if any.

4.1.2.2 Keywords

Python has 28 keywords (29 in Python 2.3 and later), which are identifiers that Python reserves for special syntactic uses. Keywords are composed of lowercase letters only. You cannot use keywords as regular identifiers. Some keywords begin simple statements or clauses of compound statements, while other keywords are used as operators. All the keywords are covered in detail in this book, either later in this chapter or in Chapter 5, Chapter 6, or Chapter 7. The keywords in Python are:

and

del

for

is

raise

assert

elif

from

lambda

return

break

else

global

not

try

class

except

if

or

while

continue

exec

import

pass

yield[1]

def

finally

in

print

[1] Only in Python 2.3 and later (or Python 2.2 with from _ _future_ _ import generators).

4.1.2.3 Operators

Python uses non-alphanumeric characters and character combinations as operators. Python recognizes the following operators, which are covered in detail later in this chapter:

+
-
*
/
%
**
//
<<
>>
&
|
^
~
<
<=
>
>=
<>
!=
=  =

4.1.2.4 Delimiters

Python uses the following symbols and symbol combinations as delimiters in expressions, lists, dictionaries, various aspects of statements, and strings, among other purposes:

(
)
[
]
{
}
,
:
.
`
=
;
+=
-=
*=
/=
//=
%=
&=
|=
^=
>>=
<<=
**=

The period (.) can also appear in floating-point and imaginary literals. A sequence of three periods (...) has a special meaning in slices. The last two rows of the table list the augmented assignment operators, which serve lexically as delimiters but also perform an operation. I'll discuss the syntax for the various delimiters when I introduce the objects or statements with which they are used.

The following characters have special meanings as part of other tokens:

'
"
#
\

The characters @, $, and ?, all control characters except whitespace, and all characters with ISO codes above 126 (i.e., non-ASCII characters, such as accented letters), can never be part of the text of a Python program except in comments or string literals.

4.1.2.5 Literals

A literal is a data value that appears directly in a program. The following are all literals in Python:

42                       # Integer literal
3.14                     # Floating-point literal
1.0J                     # Imaginary literal
'hello'                  # String literal
"world"                  # Another string literal
"""Good
night"""                 # Triple-quoted string literal

Using literals and delimiters, you can create data values of other types:

[ 42, 3.14, 'hello' ]    # List 
( 100, 200, 300 )        # Tuple 
{ 'x':42, 'y':3.14 }     # Dictionary

The syntax for literals and other data values is covered in detail later in this chapter, when we discuss the various data types supported by Python.

4.1.3 Statements

You can consider a Python source file as a sequence of simple and compound statements. Unlike other languages, Python has no declarations or other top-level syntax elements.

4.1.3.1 Simple statements

A simple statement is one that contains no other statements. A simple statement lies entirely within a logical line. As in other languages, you may place more than one simple statement on a single logical line, with a semicolon (;) as the separator. However, one statement per line is the usual Python style, as it makes programs more readable.

Any expression can stand on its own as a simple statement; we'll discuss expressions in detail later in this chapter. The interactive interpreter shows the result of an expression statement entered at the prompt (>>>), and also binds the result to a variable named _. Apart from interactive sessions, expression statements are useful only to call functions (and other callables) that have side effects (e.g., that perform output or change global variables).

An assignment is a simple statement that assigns a value to a variable, as we'll discuss later in this chapter. Unlike in some other languages, an assignment in Python is a statement, and therefore can never be part of an expression.

4.1.3.2 Compound statements

A compound statement contains other statements and controls their execution. A compound statement has one or more clauses, aligned at the same indentation. Each clause has a header that starts with a keyword and ends with a colon (:), followed by a body, which is a sequence of one or more statements. When the body contains multiple statements, also known as a block, these statements should be placed on separate logical lines after the header line and indented rightward from the header line. The block terminates when the indentation returns to that of the clause header (or further left from there). Alternatively, the body can be a single simple statement, following the : on the same logical line as the header. The body may also be several simple statements on the same line with semicolons between them, but as I've already indicated, this is not good Python style.



    Part III: Python Library and Extension Modules