1.2 Standard Modules

There are a variety of tasks that many or most text processing applications will perform, but that are not themselves text processing tasks. For example, texts typically live inside files, so for a concrete application you might want to check whether files exist, whether you have access to them, and whether they have certain attributes; you might also want to read their contents. The text processing per se does not happen until the text makes it into a Python value, but getting the text into local memory is a necessary step.

Another task is making Python objects persistent so that final or intermediate processing results can be saved in computer-usable forms. Or again, Python applications often benefit from being able to call external processes and possibly work with the results of those calls.

Yet another class of modules helps you deal with Python internals in ways that go beyond what the inherent syntax does. I have made a judgment call in this book as to which such "Python internal" modules are sufficiently general and frequently used in text processing applications; a number of "internal" modules are given only one-line descriptions under the "Other Modules" topic.

1.2.1 Working with the Python Interpreter

Some of the modules in the standard library contain functionality that is nearly as important to Python as the basic syntax. Such modularity is an important strength of Python's design, but users of other languages may be surprised to find capabilities for reading command-line arguments, catching exceptions, copying objects, or the like in external modules.

copy Generic copying operations

Names in Python programs are merely bindings to underlying objects; many of these objects are mutable. This point is simple, but it winds up biting almost every beginning Python programmer?and even a few experienced Pythoners get caught, too. The problem is that binding another name (including a sequence position, dictionary entry, or attribute) to an object leaves you with two names bound to the same object. If you change the underlying object using one name, the other name also points to a changed object. Sometimes you want that, sometimes you do not.

One variant of the binding trap is a particularly frequent pitfall. Say you want a 2D table of values, initialized as zeros. Later on, you would like to be able to refer to a row/column position as, for example, table[2][3] (as in many programming languages). Here is what you would probably try first, along with its failure:

>>> row = [0]*4
>>> print row
[0, 0, 0, 0]
>>> table = [row]*4   # or 'table = [[0]*4]*4
>>> for row in table: print row
[0, 0, 0, 0]
[0, 0, 0, 0]
[0, 0, 0, 0]
[0, 0, 0, 0]
>>> table[2][3] = 7
>>> for row in table: print row
[0, 0, 0, 7]
[0, 0, 0, 7]
[0, 0, 0, 7]
[0, 0, 0, 7]
>>> id(table[2]), id(table[3])
(6207968, 6207968)

The problem with the example is that table is a list of four positional bindings to the exact same list object. You cannot change just one row, since all four point to just one object. What you need instead is a copy of row to put in each row of table.

Python provides a number of ways to create copies of objects (and bind them to names). Such a copy is a "snapshot" of the state of the object that can be modified independently of changes to the original. A few ways to correct the table problem are:

>>> table1 = map(list, [(0,)*4]*4)
>>> id(table1[2]), id(table1[3])
(6361712, 6361808)
>>> table2 = [1st[:] for 1st in [[0]*4]*4]
>>> id(table2[2]), id(table2[3])
(6356720, 6356800)
>>> from copy import copy
>>> row = [0]*4
>>> table3 = map(copy, [row]*4)
>>> id(table3[2]), id(table3[3])
(6498640, 6498720)

In general, slices always create new lists. In Python 2.2+, the constructors list() and dict() likewise construct new/copied lists/dicts (possibly using other sequence or association types as arguments).

But the most general way to make a new copy of whatever object you might need is with the copy module. If you use the copy module you do not need to worry about issues of whether a given sequence is a list, or merely list-like, which the list() coercion forces into a list.


Return a shallow copy of a Python object. Most (but not quite all) types of Python objects can be copied. A shallow copy binds its elements/members to the same objects as bound in the original?but the object itself is distinct.

>>> import copy
>>> class C: pass
>>> o1 = C()
>>> o1.lst = [1,2,3]
>>> o1.str = "spam"
>>> o2 = copy.copy(o1)
>>> o1.lst.append(17)
>>> o2.lst
[1, 2, 3, 17]
>>> o1.str = 'eggs'
>>> o2.str

Return a deep copy of a Python object. Each element or member in an object is itself recursively copied. For nested containers, it is usually more desirable to perform a deep copy?otherwise you can run into problems like the 2D table example above.

>>> o1 = C()
>>> o1.lst = [1,2,3]
>>> o3 = copy.deepcopy(o1)
>>> o1.lst.append(17)
>>> o3.lst
[1, 2, 3]
>>> o1.lst
[1, 2, 3, 17]

exceptions Standard exception class hierarchy

Various actions in Python raise exceptions, and these exceptions can be caught using an except clause. Although strings can serve as exceptions for backwards-compatibility reasons, it is greatly preferable to use class-based exceptions.

When you catch an exception in using an except clause, you also catch any descendent exceptions. By utilizing a hierarchy of standard and user-defined exception classes, you can tailor exception handling to meet your specific code requirements.

>>> class MyException(StandardError): pass
>>> try:
...     raise MyException
... except StandardError:
...     print "Caught parent"
... except MyException:
...     print "Caught specific class"
... except:
...     print "Caught generic leftover"
Caught parent

In general, if you need to raise exceptions manually, you should either use a built-in exception close to your situation, or inherit from that built-in exception. The outline in Figure 1.1 shows the exception classes defined in exceptions.

Figure 1.1. Standard exceptions


getopt Parser for command line options

Utility applications?whether for text processing or otherwise?frequently accept a variety of command-line switches to configure their behavior. In principle, and frequently in practice, all that you need to do to process command-line options is read through the list sys.argv[1:] and handle each element of the option line. I have certainly written my own small "sys.argv parser" more than once; it is not hard if you do not expect too much.

The getopt module provides some automation and error handling for option parsing. It takes just a few lines of code to tell getopt what options it should handle, and which switch prefixes and parameter styles to use. However, getopt is not necessarily the final word in parsing command lines. Python 2.3 includes Greg Ward's optik module <http://optik.sourceforge.net/> renamed as optparse, and the Twisted Matrix library contains twisted.python.usage <http://www.twistedmatrix.com/documents/howto/options>. These modules, and other third-party tools, were written because of perceived limitations in getopt.

For most purposes, getopt is a perfectly good tool. Moreover, even if some enhanced module is included in later Python versions, either this enhancement will be backwards compatible or getopt will remain in the distribution to support existing scripts.

SEE ALSO: sys.argv 49;

getopt.getopt(args, options [,long_options]])

The argument args is the actual list of options being parsed, most commonly sys.argv[1:]. The argument options and the optional argument long_options contain formats for acceptable options. If any options specified in args do not match any acceptable format, a getopt.GetoptError exception is raised. All options must begin with either a single dash for single-letter options or a double dash for long options (DOS-style leading slashes are not usable, unfortunately).

The return value of getopt.getopt() is a pair containing an option list and a list of additional arguments. The latter is typically a list of filenames the utility will operate on. The option list is a list of pairs of the form (option, value). Under recent versions of Python, you can convert an option list to a dictionary with dict(optlist), which is likely to be useful.

The options format string is a sequence of letters, each optionally followed by a colon. Any option letter followed by a colon takes a (mandatory) value after the option.

The format for long_options is a list of strings indicating the option names (excluding the leading dashes). If an option name ends with an equal sign, it requires a value after the option.

It is easiest to see getopt in action:

>>> import getopt
>>> opts='-al -b -c 2 --foo=bar --baz file1 file2'.split()
>>> optlist, args = getopt.getopt(opts,'a:bc:',['foo=','baz'])
>>> optlist
[('-a', '1'), ('-b', ''), ('-c', '2'), ('--foo', 'bar'),
('--baz', '')]
>>> args
['file1', 'file2']
>>> nodash = lambda s: \
...          s.translate(''.join(map(chr,range(256))),'-')
>>> todict = lambda 1: \
...          dict([(nodash(opt),val) for opt,val in 1])
>>> optdict = todict(optlist)
>>> optdict
{'a': '1', 'c': '2', 'b': '', 'baz': '', 'foo': 'bar'}

You can examine options given either by looping through optlist or by performing optdict.get(key, default) type tests as needed in your program flow.

operator Standard operations as functions

All of the standard Python syntactic operators are available in functional form using the operator module. In most cases, it is more clear to use the actual operators, but in a few cases functions are useful. The most common usage for operator is in conjunction with functional programming constructs. For example:

>>> import operator
>>> 1st = [1, 0, (), '', 'abc']
>>> map(operator.not_, 1st)   # fp-style negated bool vals
[0, 1, 1, 1, 0]
>>> tmplst = []               # imperative style
>>> for item in 1st:
...     tmplst.append(not item)
>>> tmplst
[0, 1, 1, 1, 0]
>>> del tmplst                # must cleanup stray name

As well as being shorter, I find the FP style more clear. The source code below provides sample implementations of the functions in the operator module. The actual implementations are faster and are written directly in C, but the samples illustrate what each function does.

### Comparison functions
It = __lt__ = lambda a,b: a < b
le = __le__ = lambda a,b: a <= b
eq = __eq__ = lambda a,b: a == b
ne = __ne__ = lambda a,b: a != b
ge = __ge__ = lambda a,b: a >= b
gt = __gt__ = lambda a,b: a > b
### Boolean functions
not_ = __not__ = lambda o: not o
truth = lambda o: not not o
# Arithmetic functions
abs = __abs__ = abs   # same as built-in function
add = __add__ = lambda a,b: a + b
and_ = __and__ = lambda a,b: a & b  # bitwise, not boolean
div = __div__ = \
      lambda a,b: a/b  # depends on __future__.division
floordiv = __floordiv__ = lambda a,b: a/b # Only for 2.2+
inv = invert = __inv__ = __invert__ = lambda o: ~o
lshift = __lshift__ = lambda a,b: a << b
rshift = __rshift__ = lambda a,b: a << b
mod = __mod__ = lambda a,b: a % b
mul = __mul__ = lambda a,b: a * b
neg = __neg__ = lambda o: -o
or_ = __or__ = lambda a,b: a | b    # bitwise, not boolean
pos = __pos__ = lambda o: +o # identity for numbers
sub = __sub__ = lambda a,b: a - b
truediv = __truediv__ = lambda a,b: 1.0*a/b # New in 2.2+
xor = __xor__ = lambda a,b: a ^ b
### Sequence functions (note overloaded syntactic operators)
concat = __concat__ = add
contains = __contains__ = lambda a,b: b in a
countOf = lambda seq,a: len([x for x in seq if x==a])
def delitem(seq,a): del seq[a]
__delitem__ = delitem
def delslice(seq,b,e): del seq[b:e]
__delslice__ = delslice
getitem = __getitem__ = lambda seq,i: seq[i]
getslice = __getslice__ = lambda seq,b,e: seq[b:e]
index0f = lambda seq,o: seq.index(o)
repeat = __repeat__ = mul
def setitem(seq,i,v): seq[i] = v
__setitem__ = setitem
def setslice(seq,b,e,v): seq[b:e] = v
__setslice__ = setslice
### Functionality functions (not implemented here)
# The precise interfaces required to pass the below tests
#     are ill-defined, and might vary at limit-cases between
#     Python versions and custom data types.
import operator
isCallable = callable     # just use built-in 'callable()'
isMappingType = operator.isMappingType
isNumberType = operator.isNumberType
isSequenceType = operator.isSequenceType

sys Information about current Python interpreter

As with the Python "userland" objects you create within your applications, the Python interpreter itself is very open to introspection. Using the sys module, you can examine and modify many aspects of the Python runtime environment. However, as with much of the functionality in the os module, some of what sys provides is too esoteric to address in this book about text processing. Consult the Python Library Reference for information on those attributes and functions not covered here.

The module attributes sys.exc_type, sys.exc_value, and sys.exc_traceback have been deprecated in favor of the function sys.exc_info(). All of these, and also sys.last-type, sys.last-value, sys.last_traceback, and sys.tracebacklimit, let you poke into exceptions and stack frames to a finer degree than the basic try and except statements do. sys.exec_prefix and sys.executable provide information on installed paths for Python.

The functions sys.displayhook() and sys.excepthook() control where program output goes, and sys.__displayhook__ and sys.__excepthook__ retain their original values (e.g., STDOUT and STDERR). sys.exitfunc affects interpreter cleanup. The attributes sys.ps1 and sys.ps2 control prompts in the Python interactive shell.

Other attributes and methods simply provide more detail than you almost ever need to know for text processing applications. The attributes sys.dllhandle and sys.winver are Windows specific; sys.setdlopenf lags (), and sys.getdlopenflags() are Unix only. Methods like sys.builtin_module_names, sys._getframe(), sys.prefix, sys.getrecursionlimit(), sys.setprofile(), sys.settrace(), sys.setcheckinterval(), sys.setrecursionlimit(), sys.modules, and also sys.warnoptions concern Python internals. Unicode behavior is affected by the sys.setdefaultencoding() method, but is overridable with arguments anyway.


A list of command-line arguments passed to a Python script. The first item, argv[0], is the script name itself, so you are normally interested in argv[1:] when parsing arguments.

SEE ALSO: getopt 44; sys.stdin 51; sys.stdout 51;


The native byte order (endianness) of the current platform. Possible values are big and little. Available in Python 2.0+.


A string with copyright information for the current Python interpreter.


The version number of the current Python interpreter as an integer. This number increases with every version, even nonproduction releases. This attribute is not very human-readable; sys.version or sys.version_info is generally easier to work with.

SEE ALSO: sys.version 51; sys.version_info 52;


The largest positive integer supported by Python's regular integer type, on most platforms, 2**31-1. The largest negative integer is -sys.maxint-1.


The integer of the largest supported code point for a Unicode character under the current configuration. Unicode characters are stored as UCS-2 or UCS-4.


A list of the pathnames searched for modules. You may modify this path to control module loading.


A string identifying the OS platform.

SEE ALSO: os.uname() 81;


File object for standard error stream (STDERR). sys.__stderr__ retains the original value in case sys.stderr is modified during program execution. Error messages and warnings from the Python interpreter are written to sys.stderr. The most typical use of sys.stderr is for application messages that indicate "abnormal" conditions. For example:

% cat cap_file.py
#!/usr/bin/env python
import sys, string
if len(sys.argv) < 2:
    sys.stderr.write("No filename specified\n")
    fname = sys.argv[1]
        input = open(fname).read()
        sys.stderr.write("Could not read '%s'\n" % fname)
% ./cap_file.py this > CAPS
% ./cap_file.py nosuchfile > CAPS
Could not read 'nosuchfile'
% ./cap_file.py > CAPS
No filename specified

SEE ALSO: sys.argv 49; sys.stdin 51; sys.stdout 51;


File object for standard input stream (STDIN). sys.__stdin__ retains the original value in case sys.stdin is modified during program execution. input() and raw-input() are read from sys.stdin, but the most typical use of sys.stdin is for piped and redirected streams on the command line. For example:

% cat cap_stdin.py
#!/usr/bin/env python
import sys, string
input = sys.stdin.read()
print string.upper(input)
% echo "this and that" | ./cap_stdin.py

SEE ALSO: sys.argv 49; sys.stderr 50; sys.stdout 51;


File object for standard output stream (STDOUT). sys.__stdout__ retains the original value in case sys.stdout is modified during program execution. The formatted output of the print statement goes to sys.stdout, and you may also use regular file methods, such as sys.stdout.write().

SEE ALSO: sys.argv 49; sys.stderr 50; sys.stdin 51;


A string containing version information on the current Python interpreter. The form of the string is version (#build_num, build_date, build_time) [compiler]. For example:

>>> print sys.version
1.5.2 (#0 Apr 13 1999, 10:51:12) [MSC 32 bit (Intel)]


>>> print sys.version
2.2 (#1, Apr 17 2002, 16:11:12)
[GCC 2.95.2 19991024 (release)]

This version-independent way to find the major, minor, and micro version components should work for 1.5-2.3.x (at least):

>>> from string import split
>>> from sys import version
>>> ver_tup = map(int, split(split(version)[0],'.'))+[0]
>>> major, minor, point = ver_tup[:3]
>>> if (major, minor) >= (1, 6):
...     print "New Way"
... else:
...     print "Old Way"
New Way

A 5-tuple containing five components of the version number of the current Python interpreter: (major, minor, micro, releaselevel, serial). releaselevel is a descriptive phrase; the other are integers.

>>> sys.version_info
(2, 2, 0, 'final', 0)

Unfortunately, this attribute was added to Python 2.0, so its items are not entirely useful in requiring a minimal version for some desired functionality.

SEE ALSO: sys.version 51;

sys.exit ([code=0])

Exit Python with exit code code. Cleanup actions specified by finally clauses of try statements are honored, and it is possible to intercept the exit attempt by catching the SystemExit exception. You may specify a numeric exit code for those systems that codify them; you may also specify a string exit code, which is printed to STDERR (with the actual exit code set to 1).


Return the name of the default Unicode string encoding in Python 2.0+.


Return the number of references to the object obj. The value returned is one higher than you might expect, because it includes the (temporary) reference passed as the argument.

>>> x = y = "hi there"
>>> import sys
>>> sys.getrefcount(x)
>>> 1st = [x, x, x]
>>> sys.getrefcount(x)

SEE ALSO: os 74;

types Standard Python object types

Every object in Python has a type; you can find it by using the built-in function type(). Often Python functions use a sort of ad hoc overloading, which is implemented by checking features of objects passed as arguments. Programmers coming from languages like C or Java are sometimes surprised by this style, since they are accustomed to seeing multiple "type signatures" for each set of argument types the function can accept. But that is not the Python way.

Experienced Python programmers try not to rely on the precise types of objects, not even in an inheritance sense. This attitude is also sometimes surprising to programmers of other languages (especially statically typed). What is usually important to a Python program is what an object can do, not what it is. In fact, it has become much more complicated to describe what many objects are with the "type/class unification" in Python 2.2 and above (the details are outside the scope of this book).

For example, you might be inclined to write an overloaded function in the following manner:

Naive overloading of argument
import types, exceptions
def overloaded_get_text(o):
    if type(o) is types.FileType:
        text = o.read()
    elif type(o) is types.StringType:
        text = o
    elif type(o) in (types.IntType, types.FloatType,
                     types.LongType, types.ComplexType):
        text = repr(o)
        raise exceptions.TypeError
    return text

The problem with this rigidly typed code is that it is far more fragile than is necessary. Something need not be an actual FileType to read its text, it just needs to be sufficiently "file-like" (e.g., a urllib.urlopen() or cStringIO.StringIO() object is file-like enough for this purpose). Similarly, a new-style object that descends from types.StringType or a UserString.UserString() object is "string-like" enough to return as such, and similarly for other numeric types.

A better implementation of the function above is:

"Quacks like a duck" overloading of argument
def overloaded_get_text(o):
    if hasattr(o,'read'):
        return o.read()
        return ""+o
    except TypeError:
        return repr(0+o)
    except TypeError:

At times, nonetheless, it is useful to have symbolic names available to name specific object types. In many such cases, an empty or minimal version of the type of object may be used in conjunction with the type() function equally well?the choice is mostly stylistic:

>>> type('') == types.StringType
>>> type(0.0) == types.FloatType
>>> type(None) == types.NoneType
>>> type([]) == types.ListType

Return the datatype of any object o. The return value of this function is itself an object of the type types.TypeType. TypeType objects implement .__str__() and .__repr__() methods to create readable descriptions of object types.

>>> print type(1)
<type 'int'>
>>> print type(type(1))
<type 'type'>
>>> type(1) is type(0)

The type for built-in functions like abs(), len(), and dir(), and for functions in "standard" C extensions like sys and os. However, extensions like string and re are actually Python wrappers for C extensions, so their functions are of type types.FuntionType. A general Python programmer need not worry about these fussy details.


The type for objects created by the built-in buffer() function.

types.Class Type

The type for user-defined classes.

>>> from operator import eq
>>> from types import *
>>> map(eq, [type(C), type(C()), type(C().foo)],
...         [ClassType, InstanceType, MethodType])
[1, 1, 1]

SEE ALSO: types.InstanceType 56; types.MethodType 56;


The type for code objects such as returned by compile().


Same as type(0+0j).


Same as type({}).


The type for built-in Ellipsis object.


The type for open file objects.

>>> from sys import stdout
>>> fp = open('tst','w')
>>> [type(stdout), type(fp)] == [types.FileType]*2

Same as type (0.0).


The type for frame objects such as tb.tb_frame in which tb has the type types.TracebackType.


Same as type(lambda:0).


The type for generator-iterator objects in Python 2.2+.

>>> from __future__ import generators
>>> def foo(): yield 0
>>> type(foo) == types.FunctionType
>>> type(foo()) == types.GeneratorType

SEE ALSO: types.FunctionType 56;


The type for instances of user-defined classes.

SEE ALSO: types.ClassType 55; types.MethodType 56;


Same as type(0).


Same as type().


Same as type(OL).

types.Unbound MethodType

The type for methods of user-defined class instances.

SEE ALSO: types.ClassType 55; types.InstanceType 56;


The type for modules.

>>> import os, re, sys
>>> [type(os), type(re), type(sys)] == [types.ModuleType]*3

Same as type(None).


Same as type("").


The type for traceback objects found in sys.exc_traceback.


Same as type(()).


Same as type(u"").


The type for objects returned by slice().


Same as (types.StringType,types.UnicodeType).

SEE ALSO: types.StringType 57; types.UnicodeType 57;


Same as type (type (obj)) (for any obj).


Same as type(xrange(1)).

1.2.2 Working with the Local Filesystem

dircacheRead and cache directory listings

The dircache module is an enhanced version of the os.listdir() function. Unlike the os function, dircache keeps prior directory listings in memory to avoid the need for a new call to the filesystem. Since dircache is smart enough to check whether a directory has been touched since last caching, dircache is a complete replacement for os.listdir() (with possible minor speed gains).


Return a directory listing of path path. Uses a list cached in memory where possible.


Identical to dircache.listdir(). Legacy function to support old scripts.

dircache.annotate(path, lst)

Modify the list lst in place to indicate which items are directories, and which are plain files. The string path should indicate the path to reach the listed files.

>>> l = dircache.listdir('/tmp')
>>> l
['501', 'md10834.db']
>>> dircache.annotate('/tmp', l)
>>> l
['501/', 'md10834.db']

filecmp Compare files and directories

The filecmp module lets you check whether two files are identical, and whether two directories contain some identical files. You have several options in determining how thorough of a comparison is performed.

filecmp.cmp(fname1, fname2 [,shallow=1 [,use_statcache=0]])

Compare the file named by the string fname1 with the file named by the string fname2. If the default true value of shallow is used, the comparison is based only on the mode, size, and modification time of the two files. If shallow is a false value, the files are compared byte by byte. Unless you are concerned that someone will deliberately falsify timestamps on files (as in a cryptography context), a shallow comparison is quite reliable. However, tar and untar can also change timestamps.

>>> import filecmp
>>> filecmp.cmp('dir1/file1', 'dir2/file1')
>>> filecmp.cmp('dir1/file2', 'dir2/file2', shallow=0)

The use_statcache argument is not relevant for Python 2.2+. In older Python versions, the statcache module provided (slightly) more efficient cached access to file stats, but its use is no longer needed.

filecmp.cmpfiles(dirname1, dirname2, fnamelist [,shallow=1 [,use_statcache=0]])

Compare those filenames listed in fnamelist if they occur in both the directory dirname1 and the directory dirname2. filecmp.cmpfiles() returns a tuple of three lists (some of the lists may be empty): (matches, mismatches, errors). matches are identical files in both directories, mismatches are nonidentical files in both directories. errors will contain names if a file exists in neither, or in only one, of the two directories, or if either file cannot be read for any reason (permissions, disk problems, etc.).

>>> import filecmp, os
>>> filecmp.cmpfiles('dirl','dir2',['this','that','other'])
(['this'], ['that'], ['other'])
>>> print os.popen('ls -l dir1').read()
-rwxr-xr-x    1 quilty   staff     169 Sep 27 00:13 this
-rwxr-xr-x    1 quilty   staff     687 Sep 27 00:13 that
-rwxr-xr-x    1 quilty   staff     737 Sep 27 00:16 other
-rwxr-xr-x    1 quilty   staff     518 Sep 12 11:57 spam
>>> print os.popen('ls -l dir2').read()
-rwxr-xr-x    1 quilty   staff     169 Sep 27 00:13 this
-rwxr-xr-x    1 quilty   staff     692 Sep 27 00:32 that

The shallow and use_statcache arguments are the same as those to filecmp.cmp().

filecmp.dircmp(dirname1, dirname2 [,ignore=...[,hide=...])

Create a directory comparison object. dirname1 and dirname2 are two directories to compare. The optional argument ignore is a sequence of pathnames to ignore and defaults to ["RCS","CVS","tags"]; hide is a sequence of pathnames to hide and defaults to [os.curdir,os.pardir] (i.e., [".",".."]).


The attributes of filecmp.dircmp are read-only. Do not attempt to modify them.


Print a comparison report on the two directories.

>>> mycmp = filecmp.dircmp('dir1','dir2')
>>> mycmp.report()
diff dir1 dir2
Only in dir1 : ['other', 'spam']
Identical files : ['this']
Differing files : ['that']

Print a comparison report on the two directories, including immediate subdirectories. The method name has nothing to do with the theoretical term "closure" from functional programming.


Print a comparison report on the two directories, recursively including all nested subdirectories.


Pathnames in the dirname1 directory, filtering out the hide and ignore lists.


Pathnames in the dirname2 directory, filtering out the hide and ignore lists.


Pathnames in both directories.


Pathnames in dirname 1 but not dirname2.


Pathnames in dirname2 but not dirname1.


Subdirectories in both directories.


Filenames in both directories.


Pathnames in both directories, but of different types.


Filenames of identical files in both directories.


Filenames of nonidentical files whose name occurs in both directories.


Filenames in both directories where something goes wrong during comparison.


A dictionary mapping filecmp.dircmp.common_dirs strings to corresponding filecmp.dircmp objects; for example:

>>> usercmp = filecmp.dircmp('/Users/quilty','/Users/dqm')
>>> usercmp.subdirs['Public'].common
['Drop Box']

SEE ALSO: os.stat() 79; os.listdir() 76;

flleinput • Read multiple files or STDIN

Many utilities, especially on Unix-like systems, operate line-by-line on one or more files and/or on redirected input. A flexibility in treating input sources in a homogeneous fashion is part of the "Unix philosophy." The fileinput module allows you to write a Python application that uses these common conventions with almost no special programming to adjust to input sources.

A common, minimal, but extremely useful Unix utility is cat, which simply writes its input to STDOUT (allowing redirection of STDOUT as needed). Below are a few simple examples of cat:

% cat a
% cat a b
% cat - b < a
% cat < b
% cat a < b
% echo "XXX" | cat a -

Notice that STDIN is read only if either "-" is given as an argument, or no arguments are given at all. We can implement a Python version of cat using the fileinput module as follows:

#!/usr/bin/env python
import fileinput
for line in fileinput.input():
        print line,
fileinput.input([files=sys.argv[1:] [,inplace=0 [,backup=".bak"]]])

Most commonly, this function will be used without any of its optional arguments, as in the introductory example of cat.py. However, behavior may be customized for special cases.

The argument files is a sequence of filenames to process. By default, it consists of all the arguments given on the command line. Commonly, however, you might want to treat some of these arguments as flags rather than filenames (e.g., if they start with - or /). Any list of filenames you like may be used as the files argument, whether or not it is built from sys.argv.

If you specify a true value for inplace, output will go into each file specified rather than to STDOUT. Input taken from STDIN, however, will still go to STDOUT. For in-place operation, a temporary backup file is created as the actual input source and is given the extension indicated by the backup argument. For example:

% cat a b
% cat modify.py
#!/usr/bin/env python
import fileinput, sys
for line in fileinput.input(sys.argv[1:], inplace=1):
        print "MODIFIED", line,
% echo "XXX" | ./modify.py a b -
% cat a b

Close the input sequence.


Close the current file, and proceed to the next one. Any unread lines in the current file will not be counted towards the line total.

There are several functions in the fileinput module that provide information about the current input state. These tests can be used to process the current line in a context-dependent way.


The number of lines read from the current file.


The name of the file from which the last line was read. Before a line is read, the function returns None.


Same as fileinput.filelineno()==1.


True if the last line read was from STDIN.


The number of lines read during the input loop, cumulative between files.

fileinput.Filelnput([files [,inplace=0 [,backup=".bak"]]])

The methods of fileinput.FileInput are the same as the module-level functions, plus an additional .readline() method that matches that of file objects. fileinput.FileInput objects also have a .__getitem__() method to support sequential access.

The arguments to initialize a fileinput.FileInput object are the same as those passed to the fileinput.input () function. The class exists primarily in order to allow subclassing. For normal usage, it is best to just use the fileinput functions.

SEE ALSO: multifile 285; xreadlines 72;

glob Filename globing utility

The glob module provides a list of pathnames matching a glob-style pattern. The fnmatch module is used internally to determine whether a path matches.


Both directories and plain files are returned, so if you are only interested in one type of path, use os.path.isdir() or os.path.isfile(); other functions in os.path also support other filters.

Pathnames returned by glob.glob() contain as much absolute or relative path information as the pattern pat gives. For example:

>>> import glob, os.path
>>> glob.glob('/Users/quilty/Book/chap[3-4].txt')
['/Users/quilty/Book/chap3.txt', '/Users/quilty/Book/chap4.txt']
>>> glob.glob('chap[3-6].txt')
['chap3.txt', 'chap4.txt', 'chap5.txt', 'chap6.txt']
>>> filter(os.path.isdir, glob.glob('/Users/quilty/Book/[A-Z]*'))
['/Users/quilty/Book/SCRIPTS', '/Users/quilty/Book/XML']

SEE ALSO: fnmatch 232; os.path 65;

linecache Cache lines from files

The module linecache can be used to simulate relatively efficient random access to the lines in a file. Lines that are read are cached for later access.

linecache.getline(fname, linenum)

Read line linenum from the file named fname. If an error occurs reading the line, the function will catch the error and return an empty string. sys.path is also searched for the filename if it is not found in the current directory.

>>> import linecache
>>> linecache.getline('/etc/hosts', 15)
'   hermes  hermes.gnosis.lan\n'

Clear the cache of read lines.


Check whether files in the cache have been modified since they were cached.

os.path Common pathname manipulations

The os.path module provides a variety of functions to analyze and manipulate filesystem paths in a cross-platform fashion.


Return an absolute path for a (relative) pathname.

>>> os.path.abspath('SCRIPTS/mk_book')

Same as os.path.split(pathname)[1].

os .path.commonprefix(pathlist)

Return the path to the most nested parent directory shared by all elements of the sequence pathlist.

>>> os.path.commonprefix(['/usr/X11R6/bin/twm',
...                       '/usr/sbin/bash',
...                       '/usr/local/bin/dada'])

Same as os.path.split(pathname)[0].


Return true if the pathname pathname exists.


Expand pathnames that include the tilde character: ~. Under standard Unix shells, an initial tilde refers to a user's home directory, and a tilde followed by a name refers to the named user's home directory. This function emulates that behavior on other platforms.

>>> os.path.expanduser('~dqm')
>>> os.path.expanduser('~/Book')

Expand pathname by replacing environment variables in a Unix shell style. While this function is in the os.path module, you could equally use it for bash-like scripting in Python, generally (this is not necessarily a good idea, but it is possible).

>>> os.path.expandvars('$HOME/Book')
>>> from os.path import expandvars as ev  # Python 2.0+
>>> if ev('$HOSTTYPE')=='macintosh' and ev('$OSTYPE')=='darwin':
...     print ev("The vendor is $VENDOR, the CPU is $MACHTYPE")
The vendor is apple, the CPU is powerpc

Return the last access time of pathname (or raise os.error if checking is not possible).


Return the modification time of pathname (or raise os.error if checking is not possible).


Return the size of pathname in bytes (or raise os.error if checking is not possible).


Return true if pathname is an absolute path.


Return true if pathname is a directory.


Return true if pathname is a regular file (including symbolic links).


Return true if pathname is a symbolic link.


Return true if pathname is a mount point (on POSIX systems).

os.path.join(path1 [,path2 [...]])

Join multiple path components intelligently.

>>> os.path.join('/Users/quilty/','Book','SCRIPTS/','mk_book')

Convert pathname to canonical lowercase on case-insensitive filesystems. Also convert slashes on Windows systems.


Remove redundant path information.

>>> os.path.normpath('/usr/local/bin/../include/./slang.h')

Return the "real" path to pathname after de-aliasing any symbolic links. New in Python 2.2+.

>>> os.path.realpath('/usr/bin/newaliases')
os.path.samefile(pathname1, pathname2)

Return true if pathname1 and pathname2 are the same file.

SEE ALSO: filecmp 58;

os.path.sameopenfile(fp1, fp2)

Return true if the file handles fp1 and fp2 refer to the same file. Not available on Windows.


Return a tuple containing the path leading up to the named pathname and the named directory or filename in isolation.

>>> os.path.split('/Users/quilty/Book/SCRIPTS')
('/Users/quilty/Book', 'SCRIPTS')

Return a tuple containing the drive letter and the rest of the path. On systems that do not use a drive letter, the drive letter is empty (as it is where none is specified on Windows-like systems).

os.path.walk(pathname, visitfunc, arg)

For every directory recursively contained in pathname, call visitfunc (arg, dirname, pathnames) for each path.

>>> def big_files(minsize, dirname, files):
...     for file in files:
...         fullname = os.path.join(dirname,file)
...         if os.path.isfile(fullname):
...             if os.path.getsize(fullname) >= minsize:
...                 print fullname
>>> os.path.walk('/usr/', big_files, 5e6)

shutil Copy files and directory trees

The functions in the shutil module make working with files a bit easier. There is nothing in this module that you could not do using basic file objects and os.path functions, but shutil often provides a more direct means and handles minor details for you. The functions in shutil match fairly closely the capabilities you would find in Unix filesystem utilities like cp and rm.

shutil.copy(src, dst)

Copy the file named src to the pathname dst. If dst is a directory, the created file is given the name os.path.join(dst+os.path.basename(src)).

SEE ALSO: os.path.join() 66; os.path.basename() 65;

shutil.copy2(src, dst)

Same as shutil.copy() except that the access and creation time of dst are set to the values in src.

shutil.copyfile(src, dst)

Copy the file named src to the filename dst (overwriting dst if present). Basically, this has the same effect as open(dst,"wb").write(open(src,"rb").read()).

shutil.copyfileobj(fpsrc, fpdst [,buffer=-1])

Copy the file-like object fpsrc to the file-like object fpdst. If the optional argument buffer is given, only the specified number of bytes are read into memory at a time; this allows copying very large files.

shutil.copymode(src, dst)

Copy the permission bits from the file named src to the filename dst.

shutil.copystat(src, dst)

Copy the permission and timestamp data from the file named src to the filename dst.

shutil.copytree(src, dst [,symlinks=0])

Copy the directory src to the destination dst recursively. If the optional argument symlinks is a true value, copy symbolic links as links rather than the default behavior of copying the content of the link target. Th