If your application performs other types of tasks besides text processing, a skim of this module list can suggest where to look for relevant functionality. As well, readers who find themselves maintaining code written by other developers may find that unfamiliar modules are imported by the existing code. If an imported module is not summarized in the list below, nor documented elsewhere, it is probably an in-house or third-party module. For standard library modules, the summaries here will at least give you a sense of the general purpose of a given module.
Access to built-in functions, exceptions, and other objects. Python does a great job of exposing its own internals, but "normal" developers do not need to worry about this.
In object-oriented programming (OOP) languages like Python, compound data and structured data is frequently represented at runtime as native objects. At times these objects belong to basic datatypes?lists, tuples, and dictionaries?but more often, once you reach a certain degree of complexity, hierarchies of instances containing attributes become more likely.
For simple objects, especially sequences, serialization and storage is rather straightforward. For example, lists can easily be represented in delimited or fixed-length strings. Lists-of-lists can be saved in line-oriented files, each line containing delimited fields, or in rows of RDBMS tables. But once the dimension of nested sequences goes past two, and even more so for heterogeneous data structures, traditional table-oriented storage is a less-obvious fit.
While it is possible to create "object/relational adaptors" that write OOP instances to flat tables, that usually requires custom programming. A number of more general solutions exist, both in the Python standard library and in third-party tools. There are actually two separate issues involved in storing Python objects. The first issue is how to convert them into strings in the first place; the second issue is how to create a general persistence mechanism for such serialized objects. At a minimal level, of course, it is simple enough to store (and retrieve) a serialization string the same way you would any other string?to a file, a database, and so on. The various *dbm modules create a "dictionary on disk," while the shelve module automatically utilizes cPickle serialization to write arbitrary objects as values (keys are still strings).
Several third-party modules support object serialization with special features. If you need an XML dialect for your object representation, the modules gnosis.xml.pickle and xmlrpclib are useful. The YAML format is both human-readable/editable and has support libraries for Python, Perl, Ruby, and Java; using these various libraries, you can exchange objects between these several programming languages.
SEE ALSO: gnosis.xml.pickle 410; yaml 415; xmlrpclib 407;
DBM • Interfaces to dbm-style databases |
A dbm-style database is a "dictionary on disk." Using a database of this sort allows you to store a set of key/val pairs to a file, or files, on the local filesystem, and to access and set them as if they were an in-memory dictionary. A dbm-style database, unlike a standard dictionary, always maps strings to strings. If you need to store other types of objects, you will need to convert them to strings (or use the shelve module as a wrapper).
Depending on your platform, and on which external libraries are installed, different dbm modules might be available. The performance characteristics of the various modules vary significantly. As well, some DBM modules support some special functionality. Most of the time, however, your best approach is to access the locally supported DBM module using the wrapper module anydbm. Calls to this module will select the best available DBM for the current environment without a programmer or user having to worry about the underlying support mechanism.
Functions and methods are documents using the nonspecific capitalized form DBM. In real usage, you would use the name of a specific module. Most of the time, you will get or set DBM values using standard named indexing; for example, db["key"]. A few methods characteristic of dictionaries are also supported, as well as a few methods special to DBM databases.
SEE ALSO: shelve 98; dict 24; UserDict 24;
Open the filename fname for dbm access. The optional argument flag specifies how the database is accessed. A value of r is for read-only access (on an existing dbm file); w opens an already existing file for read/write access; c will create a database or use an existing one, with read/write access; the option n will always create a new database, erasing the one named in fname if it already existed. The optional mode argument specifies the Unix mode of the file(s) created.
Close the database and flush any pending writes.
Return the first key/val pair in the DBM. The order is arbitrary but stable. You may use the DBM.first() method, combined with repeated calls to DBM.next(), to process every item in the dictionary.
In Python 2.2+, you can implement an items() function to emulate the behavior of the .items() method of dictionaries for DBMs:
>>> from __future__ import generators >>> def items(db): ... try: ... yield db.first() ... while 1: ... yield db.next() ... except KeyError: ... raise StopIteration ... >>> for k,v in items(d): # typical usage ... print k,v
Return a true value if the DBM has the key key.
Return a list of string keys in the DBM.
Return the last key/val pair in the DBM. The order is arbitrary but stable. You may use the DBM.last() method, combined with repeated calls to DBM.previous() , to process every item in the dictionary in reverse order.
Return the next key/val pair in the DBM. A pointer to the current position is always maintained, so the methods DBM.next() and DBM.previous() can be used to access relative items.
Return the previous key/val pair in the DBM. A pointer to the current position is always maintained, so the methods DBM.next() and DBM.previous() can be used to access relative items.
Force any pending data to be written to disk.
SEE ALSO: FILE.flush() 16;
Generic interface to underlying DBM support. Calls to this module use the functionality of the "best available" DBM module. If you open an existing database file, its type is guessed and used?assuming the current machine supports that style.
SEE ALSO: whichdb 93;
Interface to the Berkeley DB library.
Interface to the BSD DB library.
Interface to the Unix (n)dbm library.
Interface to slow, but portable pure Python DBM.
Interface to the GNU DBM (GDBM) library.
Guess which db package to use to open a db file. This module contains the single function whichdb.whichdb(). If you open an existing DBM file with anydbm, this function is called automatically behind the scenes.
SEE ALSO: shelve 98;
cPickle • Fast Python object serialization |
pickle • Standard Python object serialization |
The module cPickle is a comparatively fast C implementation of the pure Python pickle module. The streams produced and read by cPickle and pickle are interchangeable. The only time you should prefer pickle is in the uncommon case where you wish to subclass the pickling base class; cPickle is many times faster to use. The class pickle.Pickler is not documented here.
The cPickle and pickle modules support a both binary and an ASCII format. Neither is designed for human readability, but it is not hugely difficult to read an ASCII pickle. Nonetheless, if readability is a goal, yaml or gnosis.xml.pickle are better choices. Binary format produces smaller pickles that are faster to write or load.
It is possible to fine-tune the pickling behavior of objects by defining the methods .__getstate__(), .__setstate__(), and .__getinitargs__(). The particular black magic invocations involved in defining these methods, however, are not addressed in this book and are rarely necessary for "normal" objects (i.e., those that represent data structures).
Use of the cPickle or pickle module is quite simple:
>>> import cPickle >>> from somewhere import my_complex_object >>> s = cPickle.dumps(my_complex_object) >>> new_obj = cPickle.loads(s)
Write a serialized form of the object o to the file-like object file. If the optional argument bin is given a true value, use binary format.
Return a serialized form of the object o as a string. If the optional argument bin is given a true value, use binary format.
Return an object that was serialized as the contents of the file-like object file.
Return an object that was serialized in the string s.
SEE ALSO: gnosis.xml.pickle 410; yaml 415;
Internal Python object serialization. For more general object serialization, use pickle, cPickle, or gnosis.xml.pickle, or the YAML tools at <http://yaml.org>; marshal is a limited-purpose serialization to the pseudo-compiled byte-code format used by Python .pyc files.
pprint • Pretty-print basic datatypes |
The module pprint is similar to the built-in function repr() and the module repr. The purpose of pprint is to represent objects of basic datatypes in a more readable fashion, especially in cases where collection types nest inside each other. In simple cases pprint.pformat and repr() produce the same result; for more complex objects, pprint uses newlines and indentation to illustrate the structure of a collection. Where possible, the string representation produced by pprint functions can be used to re-create objects with the built-in eval() .
I find the module pprint somewhat limited in that it does not produce a particularly helpful representation of objects of custom types, which might themselves represent compound data. Instance attributes are very frequently used in a manner similar to dictionary keys. For example:
>>> import pprint >>> dct = {1.7:2.5, ('t','u','p'):['l','i','s','t']} >>> dct2 = {'this':'that', 'num':38, 'dct':dct} >>> class Container: pass ... >>> inst = Container() >>> inst.this, inst.num, inst.dct = 'that', 38, dct >>> pprint.pprint(dct2) {'dct': {('t', 'u', 'p'): ['l', 'i', 's', 't'], 1.7: 2.5}, 'num': 38, 'this': 'that'} >>> pprint.pprint(inst) <__main__.Container instance at 0x415770>
In the example, dct2 and inst have the same structure, and either might plausibly be chosen in an application as a data container. But the latter pprint representation only tells us the barest information about what an object is, not what data it contains. The mini-module below enhances pretty-printing:
from pprint import pformat import string, sys def pformat2(o): if hasattr(o,'__dict__'): lines = [] klass = o.__class__.__name__ module = o.__module__ desc = '<%s.%s instance at 0x%x>' % (module, klass, id(o)) lines.append(desc) for k,v in o.__dict__.items(): lines.append('instance.%s=%s' % (k, pformat(v))) return string.join(lines,'\n') else: return pprint.pformat(o) def pprint2(o, stream=sys.stdout): stream.write(pformat2(o)+'\n')
Continuing the session above, we get a more useful report:
>>> import pprint2 >>> pprint2.pprint2(inst) <__main__.Container instance at 0x415770> instance.this='that' instance.dct={('t', 'u', 'p'): ['l', 'i', 's', 't'], 1.7: 2.5} instance.num=38
Return a true value if the equality below holds:
o == eval(pprint.pformat(o))
Return a true value if the object o contains recursive containers. Objects that contain themselves at any nested level cannot be restored with eval().
Return a formatted string representation of the object o.
Print the formatted representation of the object o to the file-like object stream.
Return a pretty-printing object that will format using a width of width, will limit recursion to depth depth, and will indent each new level by indent spaces. The method pprint.PrettyPrinter.pprint() will write to the file-like object stream.
>>> pp = pprint.PrettyPrinter(width=30) >>> pp.pprint(dct2) {'dct': {1.7: 2.5, ('t', 'u', 'p'): ['l', 'i', 's', 't']}, 'num': 38, 'this': 'that'}
The class pprint.PrettyPrinter has the same methods as the module level functions. The only difference is that the stream used for pprint.PrettyPrinter.pprint() is configured when an instance is initialized rather than passed as an optional argument.
SEE ALSO: gnosis.xml.pickle 410; yaml 415;
repr • Alternative object representation |
The module repr contains code for customizing the string representation of objects. In its default behavior the function repr.repr() provides a length-limited string representation of objects?in the case of large collections, displaying the entire collection can be unwieldy, and unnecessary for merely distinguishing objects. For example:
>>> dct = dict([(n,str(n)) for n in range(6)]) >>> repr(dct) # much worse for, e.g., 1000 item dict "{0: '0', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}" >>> from repr import repr >>> repr(dct) "{0: '0', 1: '1', 2: '2', 3: '3', ...}" >>> 'dct' "{0: '0', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}"
The back-tick operator does not change behavior if the built-in repr() function is replaced.
You can change the behavior of the repr.repr() by modifying attributes of the instance object repr.aRepr.
>>> dct = dict([(n,str(n)) for n in range(6)]) >>> repr(dct) "{0: '0', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}" >>> import repr >>> repr.repr(dct) "{0: '0', 1: '1', 2: '2', 3: '3', ...}" >>> repr.aRepr.maxdict = 5 >>> repr.repr(dct) "{0: '0', 1: '1', 2: '2', 3: '3', 4: '4', ...}"
In my opinion, the choice of the name for this module is unfortunate, since it is identical to that of the built-in function. You can avoid some of the collision by using the as form of importing, as in:
>>> import repr as _repr >>> from repr import repr as newrepr
For fine-tuned control of object representation, you may subclass the class repr.Repr. Potentially, you could use substitutable repr() functions to change the behavior of application output, but if you anticipate such a need, it is better practice to give a name that indicates this; for example, overridable_repr().
Base for customized object representations. The instance repr.aRepr automatically exists in the module namespace, so this class is useful primarily as a parent class. To change an attribute, it is simplest just to set it in an instance.
Depth of recursive objects to follow.
Number of items in a collection of the indicated type to include in the representation. Sequences default to 6, dicts to 4.
Number of digits of a long integer to stringify. Default is 40.
Length of string representation (e.g., s[:N]). Default is 30.
"Catch-all" maximum length of other representations.
Behaves like built-in repr(), but potentially with a different string representation created.
Represent an object of the type TYPE, where the names used are the standard type names. The argument level indicates the level of recursion when this method is called (you might want to decide what to print based on how deep within the representation the object is). The Python Library Reference gives the example:
class MyRepr(repr.Repr): def repr_file(self, obj, level): if obj.name in ['<stdin>', '<stdout>', '<stderr>']: return obj.name else: return 'obj' aRepr = MyRepr() print aRepr.repr(sys.stdin) # prints '<stdin>'
shelve • General persistent dictionary |
The module shelve builds on the capabilities of the DBM modules, but takes things a step forward. Unlike with the DBM modules, you may write arbitrary Python objects as values in a shelve database. The keys in shelve databases, however, must still be strings.
The methods of shelve databases are generally the same as those for their underlying DBMs. However, shelves do not have the .first(), .last(), .next(), or .previous () methods; nor do they have the .items () method that actual dictionaries do. Most of the time you will simply use name-indexed assignment and access. But from time to time, the available shelve.get(), shelve.keys(), shelve.sync(), shelve.has_key(), and shelve.close() methods are useful.
Usage of a shelve consists of a few simple steps like the ones below:
>>> import shelve >>> sh = shelve.open('test_shelve') >>> sh.keys() ['this'] >>> sh['new_key'] = {1:2, 3:4, ('t','u','p'):['l','i','s','t']} >>> sh.keys() ['this', 'new_key'] >>> sh['new_key'] {1: 2, 3: 4, ('t', 'u', 'p'): ['l', 'i', 's', 't']} >>> del sh['this'] >>> sh.keys() ['new_key'] >>> sh.close()
In the example, I opened an existing shelve, and the previously existing key/value pair was available. Deleting a key/value pair is the same as doing so from a standard dictionary. Opening a new shelve automatically creates the necessary file(s).
Although shelve only allows strings to be used as keys, in a pinch it is not difficult to generate strings that characterize other types of immutable objects. For the same reasons that you do not generally want to use mutable objects as dictionary keys, it is also a bad idea to use mutable objects as shelve keys. Using the built-in hash() method is a good way to generate strings?but keep in mind that this technique does not strictly guarantee uniqueness, so it is possible (but unlikely) to accidentally overwrite entries using this hack:
>>> '%x' % hash((1,2,3,4,5)) '866123f4' >>> '%x' % hash(3.1415) '6aad0902' >>> '%x' % hash(38) '26' >>> '%x' % hash('38') '92bb58e3'
Integers, notice, are their own hash, and strings of digits are common. Therefore, if you adopted this approach, you would want to hash strings as well, before using them as keys. There is no real problem with doing so, merely an extra indirection step that you need to remember to use consistently:
>>> sh['%x' % hash('another_key')] = 'another value' >>> sh.keys() ['new_key', '8f9ef0ca'] >>> sh['%x' % hash('another_key')] 'another value' >>> sh['another_key'] Traceback (most recent call last): File "<stdin>", line 1, in ? File "/sw/lib/python2.2/shelve.py", line 70, in __getitem__ f = StringIO(self.dict[key]) KeyError: another_key
If you want to go beyond the capabilities of shelve in several ways, you might want to investigate the third-party library Zope Object Database (ZODB). ZODB allows arbitrary objects to be persistent, not only dictionary-like objects. Moreover, ZODB lets you store data in ways other than in local files, and also has adaptors for multiuser simultaneous access. Look for details at:
<http://www.zope.org/Wikis/ZODB/StandaloneZODB>
SEE ALSO: DBM 90; dict 24;
The rest of the listed modules are comparatively unlikely to be needed in text processing applications. Some modules are specific to a particular platform; if so, this is indicated parenthetically. Recent distributions of Python have taken a "batteries included" approach?much more is included in a base Python distribution than is with other free programming languages (but other popular languages still have a range of existing libraries that can be downloaded separately).
Access to the Windows registry (Windows).
AppleEvents (Macintosh; replaced by Carbon.AE).
Conversion between Python variables and AppleEvent data containers (Macintosh).
AppleEvent objects (Macintosh).
Rudimentary decoder for AppleSingle format files (Macintosh).
Build MacOS applets (Macintosh).
Print calendars, much like the Unix cal utility. A variety of functions allow you to print or stringify calendars for various time frames. For example,
>>> print calendar.month(2002,11) November 2002 Mo Tu We Th Fr Sa Su 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Interfaces to Carbon API (Macintosh).
CD-ROM access on SGI systems (IRIX).
Code Fragment Resource module (Macintosh).
Interface to the standard color selection dialog (Macintosh).
Interface to the Communications Tool Box (Macintosh).
Call C functions in shared objects (Unix).
Basic Macintosh dialogs (Macintosh).
Access to Unix fcntl() and iocntl() system functions (Unix).
AppleEvents interface to MacOS finder (Macintosh).
Functions and constants for working with the FORMS library (IRIX).
Functions and constants for working with the Font Manager library (IRIX).
Floating point exception control (Unix).
Structured development of MacOS applications (Macintosh).
The module gettext eases the development of multilingual applications. While actual translations must be performed manually, this module aids in identifying strings for translation and runtime substitutions of language-specific strings.
Information on Unix groups (Unix).
Control the language and regional settings for an application. The locale setting affects the behavior of several functions, such as time.strftime() and string.lower(). The locale module is also useful for creating strings such as number with grouped digits and currency strings for specific nations.
Macintosh implementation of os module functionality. It is generally better to use os directly and let it call mac where needed (Macintosh).
Filesystem services (Macintosh).
Access to MacOS Python interpreter (Macintosh).
Locate script resources (Macintosh).
Interface to Speech Manager (Macintosh).
Easy access serial to line connections (Macintosh).
Create CodeWarrior projects (Macintosh).
Miscellaneous Windows-specific functions provided in Microsoft's Visual C++ Runtime libraries (Windows).
Interface to Navigation Services (Macintosh).
Access to Sun's NIS Yellow Pages (Unix).
Manage pipes at a finer level than done by os.popen() and its relatives. Reliability varies between platforms (Unix).
Wrap PixMap objects (Macintosh).
Access to operating system functionality under Unix. The os module provides more portable version of the same functionality and should be used instead (Unix).
Application preferences manager (Macintosh).
Pseudo terminal utilities (IRIX, Linux).
Access to Unix password database (Unix).
Preferences manager for Python (Macintosh).
Helper to create PYC resources for compiled applications (Macintosh).
Buffered, nonvisible STDOUT output (Macintosh).
Examine resource usage (Unix).
Interface to Unix syslog library (Unix).
POSIX tty control (Unix).
Widgets for the Mac (Macintosh).
Interface to the WorldScript-Aware Styled Text Engine (Macintosh).
Interface to audio hardware under Windows (Windows).
Implements (a subset of) Sun eXternal Data Representation (XDR). In concept, xdrlib is similar to the struct module, but the format is less widely used.
Read and write AIFC and AIFF audio files. The interface to aifc is the same as for the sunau and wave modules.
Audio functions for SGI (IRIX).
Manipulate raw audio data.
Read chunks of IFF audio data.
Convert between RGB color model and YIQ, HLS, and HSV color spaces.
Functions and constants for working with Silicon Graphics' Graphics Library (IRIX).
Manipulate image data stored as Python strings. For most operations on image files, the third-party Python Imaging Library (usually called "PIL"; see <http://www.pythonware.com/products/pil/>) is a versatile and powerful tool.
Support for imglib files (IRIX).
Read and write JPEG files on SGI (IRIX). The Python Imaging Library (<http://www.pythonware.com/products/pil/>) provides a cross-platform means of working with a large number of image formats and is preferable for most purposes.
Read and write SGI RGB files (IRIX).
Read and write Sun AU audio files. The interface to sunau is the same as for the aifc and wave modules.
Interface to Sun audio hardware (SunOS/Solaris).
Read QuickTime movies frame by frame (Macintosh).
Read and write WAV audio files. The interface to wave is the same as for the aifc and sunau modules.
Typed arrays of numeric values. More efficient than standard Python lists, where applicable.
Exit handlers. Same functionality as sys.exitfunc, but different interface.
HTTP server classes. BaseHTTPServer should usually be treated as an abstract class. The other modules provide sufficient customization for usage in the specific context indicated by their names. All may be customized for your application's needs.
Restricted object access. Used in conjunction with rexec.
List insertion maintaining sort order.
Mathematical functions over complex numbers.
Build line-oriented command interpreters.
Utilities to emulate Python's interactive interpreter.
Compile possibly incomplete Python source code.
Module/script to compile .py files to cached byte-code files.
Analyze Python source code and generate Python byte-codes.
Helper to provide extensibility for pickle/cPickle.
Full-screen terminal handling with the (n)curses library.
Cached directory listing. This module enhances the functionality of os.listdir().
Disassembler of Python byte-code into mnemonics.
Build and install Python modules and packages. distutils provides a standard mechanism for creating distribution packages of Python tools and libraries, and also for installing them on target machines. Although distutils is likely to be useful for text processing applications that are distributed to users, a discussion of the details of working with distutils is outside the scope of this book. Useful information can be found in the Python standard documentation, especially Greg Ward's Distributing Python Modules and Installing Python Modules.
Check the accuracy of _doc_ strings.
Standard errno system symbols.
General floating point formatting functions. Duplicates string interpolation functionality.
Control Python's (optional) cyclic garbage collection.
Utilities to collect a password without echoing to screen.
Access the internals of the import statement.
Get useful information from live Python objects for Python 2.1+.
Check whether string is a Python keyword.
Various trigonometric and algebraic functions and constants. These functions generally operate on floating point numbers?use cmath for calculations on complex numbers.
Work with mutual exclusion locks, typically for threaded applications.
Create special Python objects in customizable ways. For example, Python hackers can create a module object without using a file of the same name or create an instance while bypassing the normal .__init__() call. "Normal" techniques generally suffice for text processing applications.
A Python debugger.
Functions to spawn commands with pipes to STDIN, STDOUT, and optionally STDERR. In Python 2.0+, this functionality is copied to the os module in slightly improved form. Generally you should use the os module (unless you are running Python 1.52 or earlier).
Profile the performance characteristics of Python code. If speed becomes an issue in your application, your first step in solving any problem issues should be profiling the code. But details of using profile are outside the scope of this book. Moreover, it is usually a bad idea to assume speed is a problem until it is actually found to be so.
Print reports on profiled Python code.
Python class browser; useful for implementing code development environments for editing Python.
Extremely useful script and module for examining Python documentation. pydoc is included with Python 2.1+, but is compatible with earlier versions if downloaded. pydoc can provide help similar to Unix man pages, help in the interactive shell, and also a Web browser interface to documentation. This tool is worth using frequently while developing Python applications, but its details are outside the scope of this book.
"Compile" a .py file to a .pyc (or .pyo) file.
A multiproducer, multiconsumer queue, especially for threaded programming.
Interface to GNU readline (Unix).
Restricted execution facilities.
General event scheduler.
Handlers for asynchronous events.
Customizable startup module that can be modified to change the behavior of the local Python installation.
Maintain a cache of os.stat() information on files. Deprecated in Python 2.2+.
Constants for interpreting the results of os.statvfs() and os.fstatvfs().
Create multithreaded applications with Python. Although text processing applications?like other applications?might use a threaded approach, this topic is outside the scope of this book. Most, but not all, Python platforms support threaded applications.
Python interface to TCL/TK and higher-level widgets for TK. Supported on many platforms, but not on all Python installations.
Extract, format, and print information about Python stack traces. Useful for debugging applications.
Unit testing framework. Like a number of other documenting, testing, and debugging modules, unittest is a useful facility?and its usage is recommended for Python applications in general. But this module is not specific enough to text processing applications to be addressed in this book.
Python 2.1 added a set of warning messages for conditions a user should be aware of, but that fall below the threshold for raising exceptions. By default, such messages are printed to STDERR, but the warning module can be used to modify the behavior of warning messages.
Create references to objects that do not limit garbage collection. At first brush, weak references seem strange, and the strangeness does not really go away quickly. If you do not know why you would want to use these, do not worry about it?you do not need to.
Wichmann-Hill random number generator. Deprecated since Python 2.1, and not necessary to use directly before that?use the module random to create pseudorandom values.