1.1 Techniques and Patterns

1.1.1 Utilizing Higher-Order Functions in Text Processing

This first topic merits a warning. It jumps feet-first into higher-order functions (HOFs) at a fairly sophisticated level and may be unfamiliar even to experienced Python programmers. Do not be too frightened by this first topic?you can understand the rest of the book without it. If the functional programming (FP) concepts in this topic seem unfamiliar to you, I recommend you jump ahead to Appendix A, especially its final section on FP concepts.

In text processing, one frequently acts upon a series of chunks of text that are, in a sense, homogeneous. Most often, these chunks are lines, delimited by newline characters?but sometimes other sorts of fields and blocks are relevant. Moreover, Python has standard functions and syntax for reading in lines from a file (sensitive to platform differences). Obviously, these chunks are not entirely homogeneous?they can contain varying data. But at the level we worry about during processing, each chunk contains a natural parcel of instruction or information.

As an example, consider an imperative style code fragment that selects only those lines of text that match a criterion isCond():

selected = []                 # temp list to hold matches
fp = open(filename):
for line in fp.readlines():   # Py2.2 -> "for line in fp:"
    if isCond(line):          # (2.2 version reads lazily)
        selected.append(line)
del line                      # Cleanup transient variable

There is nothing wrong with these few lines (see xreadlines on efficiency issues). But it does take a few seconds to read through them. In my opinion, even this small block of lines does not parse as a single thought, even though its operation really is such. Also the variable line is slightly superfluous (and it retains a value as a side effect after the loop and also could conceivably step on a previously defined value). In FP style, we could write the simpler:

selected = filter(isCond, open(filename).readlines())
# Py2.2 -> filter(isCond, open(filename))

In the concrete, a textual source that one frequently wants to process as a list of lines is a log file. All sorts of applications produce log files, most typically either ones that cause system changes that might need to be examined or long-running applications that perform actions intermittently. For example, the PythonLabs Windows installer for Python 2.2 produces a file called INSTALL.LOG that contains a list of actions taken during the install. Below is a highly abridged copy of this file from one of my computers:

INSTALL.LOG sample data file
Title: Python 2.2
Source: C:\DOWNLOAD\PYTHON-2.2.EXE | 02-23-2002 | 01:40:54 | 7074248
Made Dir: D:\Python22
File Copy: D:\Python22\UNWISE.EXE | 05-24-2001 | 12:59:30 | | ...
RegDB Key: Software\Microsoft\Windows\CurrentVersion\Uninstall\Py...
RegDB Val: Python 2.2
File Copy: D:\Python22\w9xpopen.exe | 12-21-2001 | 12:22:34 | | ...
Made Dir: D:\PYTHON22\DLLs
File Overwrite: C:\WINDOWS\SYSTEM\MSVCRT.DLL | | | | 295000 | 770c8856
RegDB Root: 2
RegDB Key: Software\Microsoft\Windows\CurrentVersion\App Paths\Py...
RegDB Val: D:\PYTHON22\Python.exe
Shell Link: C:\WINDOWS\Start Menu\Programs\Python 2.2\Uninstall Py...
Link Info: D:\Python22\UNWISE.EXE | D:\PYTHON22 |  | 0 | 1 | 0 |
Shell Link: C:\WINDOWS\Start Menu\Programs\Python 2.2\Python ...
Link Info: D:\Python22\python.exe | D:\PYTHON22 | D:\PYTHON22\...

You can see that each action recorded belongs to one of several types. A processing application would presumably handle each type of action differently (especially since each action has different data fields associated with it). It is easy enough to write Boolean functions that identify line types, for example:

def isFileCopy(line):
    return line[:10]=='File Copy:' # or line.startswith(...)
def isFileOverwrite(line):
    return line[:15]=='File Overwrite:'

The string method "".startswith() is less error prone than an initial slice for recent Python versions, but these examples are compatible with Python 1.5. In a slightly more compact functional programming style, you can also write these like:

isRegDBRoot = lambda line: line[:11]=='RegDB Root:'
isRegDBKey = lambda line: line[:10]=='RegDB Key:'
isRegDBVal = lambda line: line[:10]=='RegDB Val:'

Selecting lines of a certain type is done exactly as above:

lines = open(r'd:\python22\install.log').readlines()
regroot_lines = filter(isRegDBRoot, lines)

But if you want to select upon multiple criteria, an FP style can initially become cumbersome. For example, suppose you are interested in all the "RegDB" lines; you could write a new custom function for this filter:

def isAnyRegDB(line):
    if   line[:11]=='RegDB Root:': return 1
    elif line[:10]=='RegDB Key:':  return 1
    elif line[:10]=='RegDB Val:':  return 1
    else:                          return 0
# For recent Pythons, line.startswith(...) is better

Programming a custom function for each combined condition can produce a glut of named functions. More importantly, each such custom function requires a modicum of work to write and has a nonzero chance of introducing a bug. For conditions that should be jointly satisfied, you can either write custom functions or nest several filters within each other. For example:

shortline = lambda line: len(line) < 25
short_regvals = filter(shortline, filter(isRegDBVal, lines))

In this example, we rely on previously defined functions for the filter. Any error in the filters will be in either shortline() or isRegDBVal(), but not independently in some third function isShortRegVal(). Such nested filters, however, are difficult to read?especially if more than two are involved.

Calls to map() are sometimes similarly nested if several operations are to be performed on the same string. For a fairly trivial example, suppose you wished to reverse, capitalize, and normalize whitespace in lines of text. Creating the support functions is straightforward, and they could be nested in map() calls:

from string import upper, join, split
def flip(s):
    a = list(s)
    a.reverse()
    return join(a,'')
normalize = lambda s: join(split(s),' ')
cap_flip_norms = map(upper, map(flip, map(normalize, lines)))

This type of map() or filter() nest is difficult to read, and should be avoided. Moreover, one can sometimes be drawn into nesting alternating map() and filter() calls, making matters still worse. For example, suppose you want to perform several operations on each of the lines that meet several criteria. To avoid this trap, many programmers fall back to a more verbose imperative coding style that simply wraps the lists in a few loops and creates some temporary variables for intermediate results.

Within a functional programming style, it is nonetheless possible to avoid the pitfall of excessive call nesting. The key to doing this is an intelligent selection of a few combinatorial higher-order functions. In general, a higher-order function is one that takes as argument or returns as result a function object. First-order functions just take some data as arguments and produce a datum as an answer (perhaps a data-structure like a list or dictionary). In contrast, the "inputs" and "outputs" of a HOF are more function objects?ones generally intended to be eventually called somewhere later in the program flow.

One example of a higher-order function is a function factory: a function (or class) that returns a function, or collection of functions, that are somehow "configured" at the time of their creation. The "Hello World" of function factories is an "adder" factory. Like "Hello World," an adder factory exists just to show what can be done; it doesn't really do anything useful by itself. Pretty much every explanation of function factories uses an example such as:

>>> def adder_factory(n):
...    return lambda m, n=n: m+n
...
>>> add10 = adder_factory(10)
>>> add10
<function <lambda> at 0x00FB0020>
>>> add10(4)
14
>>> add10(20)
30
>>> add5 = adder_factory(5)
>>> add5(4)
9

For text processing tasks, simple function factories are of less interest than are combinatorial HOFs. The idea of a combinatorial higher-order function is to take several (usually first-order) functions as arguments and return a new function that somehow synthesizes the operations of the argument functions. Below is a simple library of combinatorial higher-order functions that achieve surprisingly much in a small number of lines:

combinatorial.py
from operator import mul, add, truth
apply_each = lambda fns, args=[]: map(apply, fns, [args]*len(fns))
bools = lambda 1st: map(truth, 1st)
bool_each = lambda fns, args=[]: bools(apply_each(fns, args))
conjoin = lambda fns, args=[]: reduce(mul, bool_each(fns, args))
all = lambda fns: lambda arg, fns=fns: conjoin(fns, (arg,))
both = lambda f,g: all((f,g))
all3 = lambda f,g,h: all((f,g,h))
and_ = lambda f,g: lambda x, f=f, g=g: f(x) and g(x)
disjoin = lambda fns, args=[]: reduce(add, bool_each(fns, args))
some = lambda fns: lambda arg, fns=fns: disjoin(fns, (arg,))
either = lambda f,g: some((f,g))
anyof3 = lambda f,g,h: some((f,g,h))
compose = lambda f,g: lambda x, f=f, g=g: f(g(x))
compose3 = lambda f,g,h: lambda x, f=f, g=g, h=h: f(g(h(x)))
ident = lambda x: x

Even with just over a dozen lines, many of these combinatorial functions are merely convenience functions that wrap other more general ones. Let us take a look at how we can use these HOFs to simplify some of the earlier examples. The same names are used for results, so look above for comparisons:

Some examples using higher-order functions
# Don't nest filters, just produce func that does both
short_regvals = filter(both(shortline, isRegVal), lines)

# Don't multiply ad hoc functions, just describe need
regroot_lines = \
    filter(some([isRegDBRoot, isRegDBKey, isRegDBVal]), lines)

# Don't nest transformations, make one combined transform
capFlipNorm = compose3(upper, flip, normalize)
cap_flip_norms = map(capFlipNorm, lines)

In the example, we bind the composed function capFlipNorm for readability. The corresponding map() line expresses just the single thought of applying a common operation to all the lines. But the binding also illustrates some of the flexibility of combinatorial functions. By condensing the several operations previously nested in several map() calls, we can save the combined operation for reuse elsewhere in the program.

As a rule of thumb, I recommend not using more than one filter() and one map() in any given line of code. If these "list application" functions need to nest more deeply than this, readability is preserved by saving results to intermediate names. Successive lines of such functional programming style calls themselves revert to a more imperative style?but a wonderful thing about Python is the degree to which it allows seamless combinations of different programming styles. For example:

intermed = filter(niceProperty, map(someTransform, lines))
final = map(otherTransform, intermed)

Any nesting of successive filter () or map() calls, however, can be reduced to single functions using the proper combinatorial HOFs. Therefore, the number of procedural steps needed is pretty much always quite small. However, the reduction in total lines-of-code is offset by the lines used for giving names to combinatorial functions. Overall, FP style code is usually about one-half the length of imperative style equivalents (fewer lines generally mean correspondingly fewer bugs).

A nice feature of combinatorial functions is that they can provide a complete Boolean algebra for functions that have not been called yet (the use of operator.add and operator.mul in combinatorial.py is more than accidental, in that sense). For example, with a collection of simple values, you might express a (complex) relation of multiple truth values as:

satisfied = (this or that) and (foo or bar)

In the case of text processing on chunks of text, these truth values are often the results of predicative functions applied to a chunk:

satisfied = (thisP(s) or thatP(s)) and (fooP(s) or barP(s))

In an expression like the above one, several predicative functions are applied to the same string (or other object), and a set of logical relations on the results are evaluated. But this expression is itself a logical predicate of the string. For naming clarity?and especially if you wish to evaluate the same predicate more than once?it is convenient to create an actual function expressing the predicate:

satisfiedP = both(either(thisP,thatP), either(fooP,barP))

Using a predicative function created with combinatorial techniques is the same as using any other function:

selected = filter(satisfiedP, lines)

1.1.2 Exercise: More on combinatorial functions

The module combinatorial.py presented above provides some of the most commonly useful combinatorial higher-order functions. But there is room for enhancement in the brief example. Creating a personal or organization library of useful HOFs is a way to improve the reusability of your current text processing libraries.

QUESTIONS

1:

Some of the functions defined in combinatorial.py are not, strictly speaking, combinatorial. In a precise sense, a combinatorial function should take one or several functions as arguments and return one or more function objects that "combine" the input arguments. Identify which functions are not "strictly" combinatorial, and determine exactly what type of thing each one does return.

2:

The functions both() and and_() do almost the same thing. But they differ in an important, albeit subtle, way. and_(), like the Python operator and, uses shortcutting in its evaluation. Consider these lines:

>>> f = lambda n: n**2 > 10
>>> g = lambda n: 100/n > 10
>>> and_(f,g)(5)
1
>>> both(f,g)(5)
1
>>> and_(f,g)(0)
0
>>> both(f,g)(0)
Traceback (most recent call last):
...

The shortcutting and_() can potentially allow the first function to act as a "guard" for the second one. The second function never gets called if the first function returns a false value on a given argument.

  1. Create a similarly shortcutting combinatorial or_() function for your library.

  2. Create general shortcutting functions shortcut_all() and shortcut_some() that behave similarly to the functions all() and some(), respectively.

  3. Describe some situations where nonshortcutting combinatorial functions like both(), all(), or anyof3() are more desirable than similar shortcutting functions.

3:

The function ident() would appear to be pointless, since it simply returns whatever value is passed to it. In truth, ident() is an almost indispensable function for a combinatorial collection. Explain the significance of ident().

Hint: Suppose you have a list of lines of text, where some of the lines may be empty strings. What filter can you apply to find all the lines that start with a #?

4:

The function not_() might make a nice addition to a combinatorial library. We could define this function as:

>>> not_ = lambda f: lambda x, f=f: not f(x)

Explore some situations where a not_() function would aid combinatoric programming.

5:

The function apply_each() is used in combinatorial.py to build some other functions. But the utility of apply_each() is more general than its supporting role might suggest. A trivial usage of apply_each() might look something like:

>>> apply_each(map(adder_factory, range(5)),(10,))
[10, 11, 12, 13, 14]

Explore some situations where apply_each() simplifies applying multiple operations to a chunk of text.

6:

Unlike the functions all() and some(), the functions compose() and compose3() take a fixed number of input functions as arguments. Create a generalized composition function that takes a list of input functions, of any length, as an argument.

7:

What other combinatorial higher-order functions that have not been discussed here are likely to prove useful in text processing? Consider other ways of combining first-order functions into useful operations, and add these to your library. What are good names for these enhanced HOFs?

1.1.3 Specializing Python Datatypes

Python comes with an excellent collection of standard datatypes?Appendix A discusses each built-in type. At the same time, an important principle of Python programming makes types less important than programmers coming from other languages tend to expect. According to Python's "principle of pervasive polymorphism" (my own coinage), it is more important what an object does than what it is. Another common way of putting the principle is: if it walks like a duck and quacks like a duck, treat it like a duck.

Broadly, the idea behind polymorphism is letting the same function or operator work on things of different types. In C++ or Java, for example, you might use signature-based method overloading to let an operation apply to several types of things (acting differently as needed). For example:

C++ signature-based polymorphism
#include <stdio.h>
class Print {
public:
  void print(int i)    { printf("int %d\n", i); } 
  void print(double d) { printf("double %f\n", d); }
  void print(float f)  { printf("float %f\n", f); }
};
main() {
  Print *p = new Print();
  p->print(37);      /* --> "int 37" */
  p->print(37.0);    /* --> "double 37.000000" */
}

The most direct Python translation of signature-based overloading is a function that performs type checks on its argument(s). It is simple to write such functions:

Python "signature-based" polymorphism
def Print(x):
    from types import *
    if type(x) is FloatType:  print "float", x
    elif type(x) is IntType:  print "int", x
    elif type(x) is LongType: print "long", x

Writing signature-based functions, however, is extremely un-Pythonic. If you find yourself performing these sorts of explicit type checks, you have probably not understood the problem you want to solve correctly! What you should (usually) be interested in is not what type x is, but rather whether x can perform the action you need it to perform (regardless of what type of thing it is strictly).

PYTHONIC POLYMORPHISM

Probably the single most common case where pervasive polymorphism is useful is in identifying "file-like" objects. There are many objects that can do things that files can do, such as those created with urllib, cStringIO, zipfile, and by other means. Various objects can perform only subsets of what actual files can: some can read, others can write, still others can seek, and so on. But for many purposes, you have no need to exercise every "file-like" capability?it is good enough to make sure that a specified object has those capabilities you actually need.

Here is a typical example. I have a module that uses DOM to work with XML documents; I would like users to be able to specify an XML source in any of several ways: using the name of an XML file, passing a file-like object that contains XML, or indicating an already-built DOM object to work with (built with any of several XML libraries). Moreover, future users of my module may get their XML from novel places I have not even thought of (an RDBMS, over sockets, etc.). By looking at what a candidate object can do, I can just utilize whichever capabilities that object has:

Python capability-based polymorphism
def toDOM(xml_src=None):
    from xml.dom import minidom
    if hasattr(xml_src, 'documentElement'):
        return xml_src    # it is already a DOM object
    elif hasattr(xml_src,'read'):
        # it is something that knows how to read data
        return minidom.parseString(xml_src.read())
    elif type(xml_src) in (StringType, UnicodeType):
        # it is a filename of an XML document
        xml = open(xml_src).read()
        return minidom.parseString(xml)
    else:
        raise ValueError, "Must be initialized with " +\
              "filename, file-like object, or DOM object"

Even simple-seeming numeric types have varying capabilities. As with other objects, you should not usually care about the internal representation of an object, but rather about what it can do. Of course, as one way to assure that an object has a capability, it is often appropriate to coerce it to a type using the built-in functions complex(), dict(), float(), int(), list(), long(), str(), tuple(), and unicode(). All of these functions make a good effort to transform anything that looks a little bit like the type of thing they name into a true instance of it. It is usually not necessary, however, actually to transform values to prescribed types; again we can just check capabilities.

For example, suppose that you want to remove the "least significant" portion of any number?perhaps because they represent measurements of limited accuracy. For whole numbers?ints or longs?you might mask out some low-order bits; for fractional values you might round to a given precision. Rather than testing value types explicitly, you can look for numeric capabilities. One common way to test a capability in Python is to try to do something, and catch any exceptions that occur (then try something else). Below is a simple example:

Checking what numbers can do
def approx(x):                # int attributes require 2.2+
    if hasattr(x,'__and__'):  # supports bitwise-and
        return x & ~OxOFL
    try:                      # supports real/imag
        return (round(x.real,2)+round(x.imag,2)*1j)
    except AttributeError:
        return round(x,2)
ENHANCED OBJECTS

The reason that the principle of pervasive polymorphism matters is because Python makes it easy to create new objects that behave mostly?but not exactly?like basic datatypes. File-like objects were already mentioned as examples; you may or may not think of a file object as a datatype precisely. But even basic datatypes like numbers, strings, lists, and dictionaries can be easily specialized and/or emulated.

There are two details to pay attention to when emulating basic datatypes. The most important matter to understand is that the capabilities of an object?even those utilized with syntactic constructs?are generally implemented by its "magic" methods, each named with leading and trailing double underscores. Any object that has the right magic methods can act like a basic datatype in those contexts that use the supplied methods. At heart, a basic datatype is just an object with some well-optimized versions of the right collection of magic methods.

The second detail concerns exactly how you get at the magic methods?or rather, how best to make use of existing implementations. There is nothing stopping you from writing your own version of any basic datatype, except for the piddling details of doing so. However, there are quite a few such details, and the easiest way to get the functionality you want is to specialize an existing class. Under all non-ancient versions of Python, the standard library provides the pure-Python modules UserDict, UserList, and UserString as starting points for custom datatypes. You can inherit from an appropriate parent class and specialize (magic) methods as needed. No sample parents are provided for tuples, ints, floats, and the rest, however.

Under Python 2.2 and above, a better option is available. "New-style" Python classes let you inherit from the underlying C implementations of all the Python basic datatypes. Moreover, these parent classes have become the self-same callable objects that are used to coerce types and construct objects: int(), list(), unicode(), and so on. There is a lot of arcana and subtle profundities that accompany new-style classes, but you generally do not need to worry about these. All you need to know is that a class that inherits from string is faster than one that inherits from UserString; likewise for list versus UserList and dict versus UserDict (assuming your scripts all run on a recent enough version of Python).

Custom datatypes, however, need not specialize full-fledged implementations. You are free to create classes that implement "just enough" of the interface of a basic datatype to be used for a given purpose. Of course, in practice, the reason you would create such custom datatypes is either because you want them to contain non-magic methods of their own or because you want them to implement the magic methods associated with multiple basic datatypes. For example, below is a custom datatype that can be passed to the prior approx() function, and that also provides a (slightly) useful custom method:

>>> class I:  # "Fuzzy" integer datatype
...     def __init__(self, i):  self.i = i
...     def __and__(self, i):   return self.i & i
...     def err_range(self):
...         lbound = approx(self.i)
...         return "Value: [%d, %d)" % (lbound, lbound+0x0F)
...
>>> i1, i2 = I(29), I(20)
>>> approx(i1), approx(i2)
(16L, 16L)
>>> i2.err_range()
'Value: [16, 31)'

Despite supporting an extra method and being able to get passed into the approx() function, I is not a very versatile datatype. If you try to add, or divide, or multiply using "fuzzy integers," you will raise a TypeError. Since there is no module called Userlnt, under an older Python version you would need to implement every needed magic method yourself.

Using new-style classes in Python 2.2+, you could derive a "fuzzy integer" from the underlying int datatype. A partial implementation could look like:

>>> class I2(int):    # New-style fuzzy integer
...     def __add__(self, j):
...         vals = map(int, [approx(self), approx(j)])
...         k = int.__add__(*vals)
...         return I2(int.__add__(k, 0x0F))
...     def err_range(self):
...         lbound = approx(self)
...         return "Value: [%d, %d)" %(lbound,lbound+0x0F)
...
>>> i1, i2 = I2(29), I2(20)
>>> print "i1 =", i1.err_range(),": i2 =", i2.err_range()
i1 = Value: [16, 31) : i2 = Value: [16, 31)
>>> i3 = i1 + i2
>>> print i3, type(i3)
47 <class '__main__.I2'>

Since the new-style class int already supports bitwise-and, there is no need to implement it again. With new-style classes, you refer to data values directly with self, rather than as an attribute that holds the data (e.g., self.i in class I). As well, it is generally unsafe to use syntactic operators within magic methods that define their operation; for example, I utilize the .__add__() method of the parent int rather than the + operator in the I2.__add__() method.

In practice, you are less likely to want to create number-like datatypes than you are to emulate container types. But it is worth understanding just how and why even plain integers are a fuzzy concept in Python (the fuzziness of the concepts is of a different sort than the fuzziness of I2 integers, though). Even a function that operates on whole numbers need not operate on objects of IntType or LongType?just on an object that satisfies the desired protocols.

1.1.4 Base Classes for Datatypes

There are several magic methods that are often useful to define for any custom datatype. In fact, these methods are useful even for classes that do not really define datatypes (in some sense, every object is a datatype since it can contain attribute values, but not every object supports special syntax such as arithmetic operators and indexing). Not quite every magic method that you can define is documented in this book, but most are under the parent datatype each is most relevant to. Moreover, each new version of Python has introduced a few additional magic methods; those covered either have been around for a few versions or are particularly important.

In documenting class methods of base classes, the same general conventions are used as for documenting module functions. The one special convention for these base class methods is the use of self as the first argument to all methods. Since the name self is purely arbitrary, this convention is less special than it might appear. For example, both of the following uses of self are equally legal:

>>> import string
>>> self = 'spam'
>>> object.__repr__(self)
'<str object at 0x12c0a0>'
>>> string.upper(self)
'SPAM'

However, there is usually little reason to use class methods in place of perfectly good built-in and module functions with the same purpose. Normally, these methods of datatype classes are used only in child classes that override the base classes, as in:

>>> class UpperObject(object):
...       def __repr__(self):
...           return object.__repr__(self).upper()
...
>>> uo = UpperObject()
>>> print uo
<__MAIN__.UPPEROBJECT OBJECT AT 0X1C2C6C>

object Ancestor class for new-style datatypes

Under Python 2.2+, object has become a base for new-style classes. Inheriting from object enables a custom class to use a few new capabilities, such as slots and properties. But usually if you are interested in creating a custom datatype, it is better to inherit from a child of object, such as list, float, or dict.

METHODS
object.__eq__(self, other)

Return a Boolean comparison between self and other. Determines how a datatype responds to the == operator. The parent class object does not implement . __eq__() since by default object equality means the same thing as identity (the is operator). A child is free to implement this in order to affect comparisons.

object.__ne__(self, other)

Return a Boolean comparison between self and other. Determines how a datatype responds to the != and <> operators. The parent class object does not implement .__ne__() since by default object inequality means the same thing as nonidentity (the is not operator). Although it might seem that equality and inequality always return opposite values, the methods are not explicitly defined in terms of each other. You could force the relationship with:

>>> class EQ(object):
...     # Abstract parent class for equality classes
...     def __eq__(self, o): return not self <> o
...     def __ne__(self, o): return not self == o
...
>>> class Comparable(EQ):
...     # By def'ing inequlty, get equlty (or vice versa)
...     def __ne__(self, other):
...         return someComplexComparison(self, other)
object.__nonzero__(self)

Return a Boolean value for an object. Determines how a datatype responds to the Boolean comparisons or, and, and not, and to if and filter(None,...) tests. An object whose .__nonzero__() method returns a true value is itself treated as a true value.

object.__len__(self)
len(object)

Return an integer representing the "length" of the object. For collection types, this is fairly straightforward?how many objects are in the collection? Custom types may change the behavior to some other meaningful value.

object.__repr__(self)
repr(object)
object.__str__(self)
str(object)

Return a string representation of the object self. Determines how a datatype responds to the repr() and str() built-in functions, to the print keyword, and to the back-tick operator.

Where feasible, it is desirable to have the .__repr__() method return a representation with sufficient information in it to reconstruct an identical object. The goal here is to fulfill the equality obj==eval(repr(obj)). In many cases, however, you cannot encode sufficient information in a string, and the repr() of an object is either identical to, or slightly more detailed than, the str() representation of the same object.

SEE ALSO: repr 96; operator 47;

fileNew-style base class for file objects

Under Python 2.2+, it is possible to create a custom file-like object by inheriting from the built-in class file. In older Python versions you may only create file-like objects by defining the methods that define an object as "file-like." However, even in recent versions of Python, inheritance from file buys you little?if the data contents come from somewhere other than a native filesystem, you will have to reimplement every method you wish to support.

Even more than for other object types, what makes an object file-like is a fuzzy concept. Depending on your purpose you may be happy with an object that can only read, or one that can only write. You may need to seek within the object, or you may be happy with a linear stream. In general, however, file-like objects are expected to read and write strings. Custom classes only need implement those methods that are meaningful to them and should only be used in contexts where their capabilities are sufficient.

In documenting the methods of file-like objects, I adopt a slightly different convention than for other built-in types. Since actually inheriting from file is unusual, I use the capitalized name FILE to indicate a general file-like object. Instances of the actual file class are examples (and implement all the methods named), but other types of objects can be equally good FILE instances.

BUILT-IN FUNCTIONS
open(fname [,mode [,buffering]])
file(fname [,mode [,buffering]])

Return a file object that attaches to the filename fname. The optional argument mode describes the capabilities and access style of the object. An r mode is for reading; w for writing (truncating any existing content); a for appending (writing to the end). Each of these modes may also have the binary flag b for platforms like Windows that distinguish text and binary files. The flag + may be used to allow both reading and writing. The argument buffering may be 0 for none, 1 for line-oriented, a larger integer for number of bytes.

>>> open('tmp','w').write('spam and eggs\n')
>>> print open('tmp','r').read(),
spam and eggs
>>> open('tmp','w').write('this and that\n')
>>> print open('tmp','r').read(),
this and that
>>> open('tmp','a').write('something else\n')
>>> print open('tmp','r').read(),
this and that
something else
METHODS AND ATTRIBUTES
FILE.close()

Close a file object. Reading and writing are disallowed after a file is closed.

FILE.closed

Return a Boolean value indicating whether the file has been closed.

FILE.fileno()

Return a file descriptor number for the file. File-like objects that do not attach to actual files should not implement this method.

FILE.flush()

Write any pending data to the underlying file. File-like objects that do not cache data can still implement this method as pass.

FILE.isatty()

Return a Boolean value indicating whether the file is a TTY-like device. The standard documentation says that file-like objects that do not attach to actual files should not implement this method, but implementing it to always return 0 is probably a better approach.

FILE.mode

Attribute containing the mode of the file, normally identical to the mode argument passed to the object's initializer.

FILE.name

The name of the file. For file-like objects without a filesystem name, some string identifying the object should be put into this attribute.

FILE.read ([size=sys.maxint])

Return a string containing up to size bytes of content from the file. Stop the read if an EOF is encountered or upon another condition that makes sense for the object type. Move the file position forward immediately past the read in bytes. A negative size argument is treated as the default value.

FILE.readline([size=sys.maxint])

Return a string containing one line from the file, including the trailing newline, if any. A maximum of size bytes are read. The file position is moved forward past the read. A negative size argument is treated as the default value.

FILE.readlines([size=sys.maxint])

Return a list of lines from the file, each line including its trailing newline. If the argument size is given, limit the read to approximately size bytes worth of lines. The file position is moved forward past the read in bytes. A negative size argument is treated as the default value.

FILE.seek(offset [,whence=0])

Move the file position by offset bytes (positive or negative). The argument whence specifies where the initial file position is prior to the move: 0 for BOF; 1 for current position; 2 for EOF.

FILE.tell()

Return the current file position.

FILE.truncate([size=0])

Truncate the file contents (it becomes size length).

FILE.write(s)

Write the string s to the file, starting at the current file position. The file position is moved forward past the written bytes.

FILE.writelines(lines)

Write the lines in the sequence lines to the file. No newlines are added during the write. The file position is moved forward past the written bytes.

FILE.xreadlines()

Memory-efficient iterator over lines in a file. In Python 2.2+, you might implement this as a generator that returns one line per each yield.

SEE ALSO: xreadlines 72;

int New-style base class for integer objects

long New-style base class for long integers

In Python, there are two standard datatypes for representing integers. Objects of type IntType have a fixed range that depends on the underlying platform?usually between plus and minus 2**31. Objects of type LongType are unbounded in size. In Python 2.2+, operations on integers that exceed the range of an int object results in automatic promotion to long objects. However, no operation on a long will demote the result back to an int object (even if the result is of small magnitude)?with the exception of the int() function, of course.

From a user point of view ints and longs provide exactly the same interface. The difference between them is only in underlying implementation, with ints typically being significantly faster to operate on (since they use raw CPU instructions fairly directly). Most of the magic methods integers have are shared by floating point numbers as well and are discussed below. For example, consult the discussion of float.__mul__() for information on the corresponding int.__mul__() method. The special capability that integers have over floating point numbers is their ability to perform bitwise operations.

Under Python 2.2+, you may create a custom datatype that inherits from int or long; under earlier versions, you would need to manually define all the magic methods you wished to utilize (generally a lot of work, and probably not worth it).

Each binary bit operation has a left-associative and a right-associative version. If you define both versions and perform an operation on two custom objects, the left-associative version is chosen. However, if you perform an operation with a basic int and a custom object, the custom right-associative method will be chosen over the basic operation. For example:

>>> class I(int):
...     def __xor__(self, other):
...         return "X0R"
...     def __rxor__(self, other):
...         return "RX0R"
...
>>> 0xFF ^ 0xFF
0
>>> 0xFF ^ I(0xFF)
'RXOR'
>>> I(0xFF) ^ 0xFF
'XOR'
>>> I(0xFF) ^ I(0xFF)
'X0R'
METHODS
int.__and__(self, other)
int.__rand__(self, other)

Return a bitwise-and between self and other. Determines how a datatype responds to the & operator.

int.__hex__(self)

Return a hex string representing self. Determines how a datatype responds to the built-in hex() function.

int.__invert__(self)

Return a bitwise inversion of self. Determines how a datatype responds to the ~ operator.

int.__lshift__(self, other)
int.__rlshift__(self, other)

Return the result of bit-shifting self to the left by other bits. The right-associative version shifts other by self bits. Determines how a datatype responds to the << operator.

int.__oct__(self)

Return an octal string representing self. Determines how a datatype responds to the built-in oct() function.

int.__or__(self, other)
int.__ror__(self, other)

Return a bitwise-or between self and other. Determines how a datatype responds to the | operator.

int.__rshift__(self, other)
int.__rrshift__(self, other)

Return the result of bit-shifting self to the right by other bits. The right-associative version shifts other by self bits. Determines how a datatype responds to the >> operator.

int.__xor__(self, other)
int.__rxor__(self, other)

Return a bitwise-xor between self and other. Determines how a datatype responds to the ^ operator.

SEE ALSO: float 19; int 421; long 422; sys.maxint 50; operator 47;

floatNew-style base class for floating point numbers

Python floating point numbers are mostly implemented using the underlying C floating point library of your platform; that is, to a greater or lesser degree based on the IEEE 754 standard. A complex number is just a Python object that wraps a pair of floats with a few extra operations on these pairs.

DIGRESSION

Although the details are far outside the scope of this book, a general warning is in order. Floating point math is harder than you think! If you think you understand just how complex IEEE 754 math is, you are not yet aware of all of the subtleties. By way of indication, Python luminary and erstwhile professor of numeric computing Alex Martelli commented in 2001 (on <comp.lang.python>):

Anybody who thinks he knows what he's doing when floating point is involved IS either naive, or Tim Peters (well, it COULD be W. Kahan I guess, but I don't think he writes here).

Fellow Python guru Tim Peters observed:

I find it's possible to be both (wink). But nothing about fp comes easily to anyone, and even Kahan works his butt off to come up with the amazing things that he does.

Peters illustrated further by way of Donald Knuth (The Art of Computer Programming, Third Edition, Addison-Wesley, 1997; ISBN: 0201896842, vol. 2, p. 229):

Many serious mathematicians have attempted to analyze a sequence of floating point operations rigorously, but found the task so formidable that they have tried to be content with plausibility arguments instead.

The trick about floating point numbers is that although they are extremely useful for representing real-life (fractional) quantities, operations on them do not obey the arithmetic rules we learned in middle school: associativity, transitivity, commutativity; moreover, many very ordinary-seeming numbers can be represented only approximately with floating point numbers. For example:

>>> 1./3
0.33333333333333331
>>> .3
0.29999999999999999
>>> 7 == 7./25 * 25
0
>>> 7 == 7./24 * 24
1
CAPABILITIES

In the hierarchy of Python numeric types, floating point numbers are higher up the scale than integers, and complex numbers higher than floats. That is, operations on mixed types get promoted upwards. However, the magic methods that make a datatype "float-like" are strictly a subset of those associated with integers. All of the magic methods listed below for floats apply equally to ints and longs (or integer-like custom datatypes). Complex numbers support a few addition methods.

Under Python 2.2+, you may create a custom datatype that inherits from float or complex; under earlier versions, you would need to manually define all the magic methods you wished to utilize (generally a lot of work, and probably not worth it).

Each binary operation has a left-associative and a right-associative version. If you define both versions and perform an operation on two custom objects, the left-associative version is chosen. However, if you perform an operation with a basic datatype and a custom object, the custom right-associative method will be chosen over the basic operation. See the example under int.

METHODS
float.__abs__(self)

Return the absolute value of self. Determines how a datatype responds to the built-in function abs().

float.__add__(self, other)
float.__radd__(self, other)

Return the sum of self and other. Determines how a datatype responds to the + operator.

float.__cmp__(self, other)

Return a value indicating the order of self and other. Determines how a datatype responds to the numeric comparison operators <, >, <=, >=, ==, <>, and !=. Also determines the behavior of the built-in cmp() function. Should return -1 for self<other, 0 for self==other, and 1 for self>other. If other comparison methods are defined, they take precedence over .__cmp__(): .__ge__(), .__gt__(), .__le__(), and .__lt__().

float.__div__(self, other)
float.__rdiv__(self, other)

Return the ratio of self and other. Determines how a datatype responds to the / operator. In Python 2.3+, this method will instead determine how a datatype responds to the floor division operator //.

float.__divmod__(self, other)
float.__rdivmod__(self, other)

Return the pair (div, remainder). Determines how a datatype responds to the built-in divmod() function.

float.__floordiv__(self, other)
float.__rfloordiv__(self, other)

Return the number of whole times self goes into other. Determines how a datatype responds to the Python 2.2+ floor division operator //.

float.__mod__(self, other)
float.__rmod__(self, other)

Return the modulo division of self into other. Determines how a datatype responds to the % operator.

float.__mul__(self, other)
float.__rmul__(self, other)

Return the product of self and other. Determines how a datatype responds to the * operator.

float.__neg__(self)

Return the negative of self. Determines how a datatype responds to the unary - operator.

float.__pow__(self, other)
float.__rpow__(self, other)

Return self raised to the other power. Determines how a datatype responds to the ^ operator.

float.__sub__(self, other)
float.__rsub__(self, other)

Return the difference between self and other. Determines how a datatype responds to the binary - operator.

float.__truediv__(self, other)
float.__rtruediv__(self, other)

Return the ratio of self and other. Determines how a datatype responds to the Python 2.3+ true division operator /.

SEE ALSO: complex 22; int 18; float 422; operator 47;

complex New-style base class for complex numbers

Complex numbers implement all the above documented methods of floating point numbers, and a few additional ones.

Inequality operations on complex numbers are not supported in recent versions of Python, even though they were previously. In Python 2.1+, the methods complex.__ge__(), complex.__gt__(), complex.__le__(), and complex.__lt__() all raise TypeError rather than return Boolean values indicating the order. There is a certain logic to this change inasmuch as complex numbers do not have a "natural" ordering. But there is also significant breakage with this change?this is one of the few changes in Python, since version 1.4 when I started using it, that I feel was a real mistake. The important breakage comes when you want to sort a list of various things, some of which might be complex numbers:

>>> lst = ["string", 1.0, 1, 1L, ('t','u' , 'p')]
>>> lst.sort()
>>> 1st
[1.0, 1, 1L, 'string', ('t', 'u', 'p')]
>>> lst.append(1j)
>>> lst.sort()
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: cannot compare complex numbers using <, <=, >, >=

It is true that there is no obvious correct ordering between a complex number and another number (complex or otherwise), but there is also no natural ordering between a string, a tuple, and a number. Nonetheless, it is frequently useful to sort a heterogeneous list in order to create a canonical (even if meaningless) order. In Python 2.2+, you can remedy this shortcoming of recent Python versions in the style below (under 2.1 you are largely out of luck):

>>> class C(complex):
...   def __lt__(self, o):
...     if hasattr(o, 'imag'):
...       return (self.real,self.imag) < (o.real,o.imag)
...     else:
...       return self.real < o
...   def __le__(self, o): return self < o or self==o
...   def __gt__(self, o): return not (self==o or self < o)
...   def __ge__(self, o): return self > o or self==o
...
>>> 1st = ["str", 1.0, 1, 1L, (1,2,3), C(1+1j), C(2-2j)]
>>> lst.sort()
>>> 1st
[1.0, 1, 1L, (1+1j), (2-2j), 'str', (1, 2, 3)]

Of course, if you adopt this strategy, you have to create all of your complex values using the custom datatype C. And unfortunately, unless you override arithmetic operations also, a binary operation between a C object and another number reverts to a basic complex datatype. The reader can work out the details of this solution if she needs it.

METHODS
complex.conjugate(self)

Return the complex conjugate of self. A quick refresher here: If self is n+mj its conjugate is n-mj.

complex.imag

Imaginary component of a complex number.

complex.real

Real component of a complex number.

SEE ALSO: float 19; complex 422;

UserDictCustom wrapper around dictionary objects

dictNew-style base class for dictionary objects

Dictionaries in Python provide a well-optimized mapping between immutable objects and other Python objects (see Glossary entry on "immutable"). You may create custom datatypes that respond to various dictionary operations. There are a few syntactic operations associated with dictionaries, all involving indexing with square braces. But unlike with numeric datatypes, there are several regular methods that are reasonable to consider as part of the general interface for dictionary-like objects.

If you create a dictionary-like datatype by subclassing from UserDict.UserDict, all the special methods defined by the parent are proxies to the true dictionary stored in the object's .data member. If, under Python 2.2+, you subclass from dict itself, the object itself inherits dictionary behaviors. In either case, you may customize whichever methods you wish. Below is an example of the two styles for subclassing a dictionary-like datatype:

>>> from sys import stderr
>>> from UserDict import UserDict
>>> class LogDictOld(UserDict):
...    def __setitem__(self, key, val):
...       stderr.write("Set: "+str(key)+"->"+str(val)+"\n")
...       self.data[key] = val
...
>>> ldo = LogDict0ld()
>>> ldo['this'] = 'that'
Set: this->that
>>> class LogDictNew(dict):
...    def __setitem__(self, key, val):
...       stderr.write("Set: "+str(key)+"->"+str(val)+"\n")
...       dict.__setitem__(self, key, val)
...
>>> ldn = LogDict0ld()
>>> ldn['this'] = 'that'
Set: this->that
METHODS
dict.__cmp__(self, other)
UserDict.UserDict.__cmp__(self, other)

Return a value indicating the order of self and other. Determines how a datatype responds to the numeric comparison operators <, >, <=, >=, ==, <>, and !=. Also determines the behavior of the built-in cmp() function. Should return -1 for self<other, 0 for self==other, and 1 for self>other. If other comparison methods are defined, they take precedence over .__cmp__(): .__ge__(), .__gt__(), .__le__(), and .__lt__().

dict.__contains__(self, x)
UserDict.UserDict.__contains__(self, x)

Return a Boolean value indicating whether self "contains" the value x. By default, being contained in a dictionary means matching one of its keys, but you can change this behavior by overriding it (e.g., check whether x is in a value rather th