2.1 Some Common Tasks

2.1.1 Problem: Quickly sorting lines on custom criteria

Sorting is one of the real meat-and-potatoes algorithms of text processing and, in fact, of most programming. Fortunately for Python developers, the native [].sort method is extraordinarily fast. Moreover, Python lists with almost any heterogeneous objects as elements can be sorted?Python cannot rely on the uniform arrays of a language like C (an unfortunate exception to this general power was introduced in recent Python versions where comparisons of complex numbers raise a TypeError; and [1+1j,2+2j].sort() dies for the same reason; Unicode strings in lists can cause similar problems).

schwartzian_sort.py

# Timing test for "sort on fourth word"
# Specifically, two lines >= 4 words will be sorted
#   lexographically on the 4th, 5th, etc.. words.
#   Any line with fewer than four words will be sorted to
#   the end, and will occur in "natural" order.

import sys, string, time
wrerr = sys.stderr.write

# naive custom sort
def fourth_word(ln1,ln2):
    lst1 = string.split(ln1)
    lst2 = string.split(ln2)
    #-- Compare "long" lines
    if len(lst1) >= 4 and len(lst2) >= 4:
        return cmp(lst1[3:],lst2[3:])
    #-- Long lines before short lines
    elif len(lst1) >= 4 and len(lst2) < 4:
        return -1
    #-- Short lines after long lines
    elif len(lst1) < 4 and len(lst2) >= 4:
        return 1
    else:                   # Natural order
        return cmp(ln1,ln2)

# Don't count the read itself in the time
lines = open(sys.argv[1]).readlines()

# Time the custom comparison sort
start = time.time()
lines.sort(fourth_word)

end = time.time()
wrerr("Custom comparison func in %3.2f secs\n" % (end-start))
# open('tmp.custom','w').writelines(lines)

# Don't count the read itself in the time
lines = open(sys.argv[1]).readlines()

# Time the Schwartzian sort
start = time.time()
for n in range(len(lines)):       # Create the transform
    1st = string.split(lines[n])
    if len(lst) >= 4:             # Tuple w/ sort info first
        lines[n] = (1st[3:], lines[n])
    else:                         # Short lines to end
        lines[n] = (['\377'], lines[n])

lines.sort()                      # Native sort

for n in range(len(lines)):       # Restore original lines
    lines[n] = lines[n] [1]

end = time.time()
wrerr("Schwartzian transform sort in %3.2f secs\n" % (end-start))
# open('tmp.schwartzian','w').writelines(lines)

Only one particular example is presented, but readers should be able to generalize this technique to any sort they need to perform frequently or on large files.

2.1.2 Problem: Reformatting paragraphs of text

While I mourn the decline of plaintext ASCII as a communication format?and its eclipse by unnecessarily complicated and large (and often proprietary) formats?there is still plenty of life left in text files full of prose. READMEs, HOWTOs, email, Usenet posts, and this book itself are written in plaintext (or at least something close enough to plaintext that generic processing techniques are valuable). Moreover, many formats like HTML and $graphics/latex.gif$ are frequently enough hand-edited that their plaintext appearance is important.

One task that is extremely common when working with prose text files is reformatting paragraphs to conform to desired margins. Python 2.3 adds the module textwrap, which performs more limited reformatting than the code below. Most of the time, this task gets done within text editors, which are indeed quite capable of performing the task. However, sometimes it would be nice to automate the formatting process. The task is simple enough that it is slightly surprising that Python has no standard module function to do this. There is the class formatter.DumbWriter, or the possibility of inheriting from and customizing formatter.AbstractWriter. These classes are discussed in Chapter 5; but frankly, the amount of customization and sophistication needed to use these classes and their many methods is way out of proportion for the task at hand.

Below is a simple solution that can be used either as a command-line tool (reading from STDIN and writing to STDOUT) or by import to a larger application.

reformat_para.py

# Simple paragraph reformatter.  Allows specification
# of left and right margins, and of justification style
# (using constants defined in module).

LEFT,RIGHT,CENTER = 'LEFT','RIGHT','CENTER'

def reformat_para(para='',left=0,right=72,just=LEFT):
    words = para.split()
    lines = []
    line  = ''
    word = 0
    end_words = 0
    while not end_words:
        if len(words[word]) > right-left: # Handle very long words
            line = words[word]
            word +=1
            if word >= len(words):
                end_words = 1
        else:                             # Compose line of words
            while len(line)+len(words[word]) <= right-left:
                line += words[word]+' '
                word += 1
                if word >= len(words):
                    end_words = 1
                    break
        lines.append(line)
        line = ''
    if just==CENTER:
        r, 1 = right, left
        return '\n'.join([' '*left+ln.center(r-l) for ln in lines])
    elif just==RIGHT:
        return '\n'.join([line.rjust(right) for line in lines])
    else: # left justify
        return '\n'.join([' '*left+line for line in lines])

if __name__=='__main__':
    import sys
    if len(sys.argv) <> 4:
        print "Please specify left_margin, right_marg, justification"
    else:
        left  = int(sys.argv[1])
        right = int(sys.argv[2])
        just  = sys.argv[3].upper()

              # Simplistic approach to finding initial paragraphs
              for p in sys.stdin.read().split('\n\n'):
                  print reformat_para(p,left,right,just),'\n'

A number of enhancements are left to readers, if needed. You might want to allow hanging indents or indented first lines, for example. Or paragraphs meeting certain criteria might not be appropriate for wrapping (e.g., headers). A custom application might also determine the input paragraphs differently, either by a different parsing of an input file, or by generating paragraphs internally in some manner.

2.1.3 Problem: Column statistics for delimited or flat-record files

Data feeds, DBMS dumps, log files, and flat-file databases all tend to contain ontologically similar records?one per line?with a collection of fields in each record. Usually such fields are separated either by a specified delimiter or by specific column positions where fields are to occur.

Parsing these structured text records is quite easy, and performing computations on fields is equally straightforward. But in working with a variety of such "structured text databases," it is easy to keep writing almost the same code over again for each variation in format and computation.

The example below provides a generic framework for every similar computation on a structured text database.

fields_stats.py

# Perform calculations on one or more of the
# fields in a structured text database.

import operator
from types import *
from xreadlines import xreadlines # req 2.1, but is much faster...
                                  # could use .readline() meth < 2.1
#-- Symbolic Constants
DELIMITED = 1
FLATFILE = 2

#-- Some sample "statistical" func (in functional programming style)
nillFunc = lambda 1st: None
toFloat = lambda 1st: map(float, 1st)
avg_1st = lambda 1st: reduce(operator.add, toFloat(lst))/len(lst)
sum_1st = lambda 1st: reduce(operator.add, toFloat(lst))
max_1st = lambda 1st: reduce(max, toFloat(lst))

class FieldStats:
    """Gather statistics about structured text database fields
text_db may be either string (incl. Unicode) or file-like object
style may be in (DELIMITED, FLATFILE)
delimiter specifies the field separator in DELIMITED style text_db
column_positions lists all field positions for FLATFILE style,
                 using one-based indexing (first column is 1).
          E.g.:  (1, 7, 40) would take fields one, two, three
                 from columns 1, 7, 40 respectively.
field_funcs is a dictionary with column positions as keys,
            and functions on lists as values.
     E.g.:  {1:avg_1st, 4:sum_lst, 5:max_lst} would specify the
            average of column one, the sum of column 4, and the
            max of column 5.  All other cols--incl 2,3, >=6--
            are ignored.

"""
def __init__(self,
             text_db='',
             style=DELIMITED,
             delimiter=',',
             column_positions=(1,),
             field_funcs={} ):
    self.text_db = text_db
    self.style = style
    self.delimiter = delimiter
    self.column_positions = column_positions
    self.field_funcs = field_funcs

def calc(self):
    """Calculate the column statistics
    """
    #-- 1st, create a list of lists for data (incl. unused flds)
    used_cols = self.field_funcs.keys()
    used_cols.sort()
    # one-based column naming: column[0] is always unused
    columns = []
    for n in range(1+used_cols[-1]):
        # hint: '[[]]*num' creates refs to same list
        columns.append([])

          #-- 2nd, fill lists used for calculated fields
                  # might use a string directly for text_db
          if type(self.text_db) in (StringType,UnicodeType):
              for line in self.text_db.split('\n'):
                  fields = self.splitter(line)
                  for col in used_cols:
                      field = fields[col-1]   # zero-based index
                        columns[col].append(field)
            else:   # Something file-like for text_db
                for line in xreadlines(self.text_db):
                    fields = self.splitter(line)
                    for col in used_cols:
                        field = fields[col-1]   # zero-based index
                        columns[col].append(field)

            #-- 3rd, apply the field funcs to column lists
            results = [None] * (1+used_cols[-1])
            for col in used_cols:
                results[col] = \
                     apply(self.field_funcs[col],(columns[col],))

            #-- Finally, return the result list
            return results

    def splitter(self, line):
        """Split a line into fields according to curr inst specs"""
        if self.style == DELIMITED:
            return line.split(self.delimiter)
        elif self.style == FLATFILE:
            fields = []
            # Adjust offsets to Python zero-based indexing,
            # and also add final position after the line
            num_positions = len(self.column_positions)
            offsets = [(pos-1) for pos in self.column_positions]
            offsets.append(len(line))
            for pos in range(num_positions):
                start = offsets[pos]
                end = offsets[pos+1]
                fields.append(line[start:end])
            return fields
        else:
            raise ValueError, \
                  "Text database must be DELIMITED or FLATFILE"

#-- Test data
# First Name, Last Name, Salary, Years Seniority, Department
delim = '''
Kevin,Smith,50000,5,Media Relations
Tom,Woo,30000,7,Accounting
Sally,Jones,62000,10,Management
'''.strip()     # no leading/trailing newlines

# Comment     First     Last      Salary    Years  Dept
flat = '''
tech note     Kevin     Smith     50000     5      Media Relations
more filler   Tom       Woo       30000     7      Accounting
yet more...   Sally     Jones     62000     10     Management
'''.strip()     # no leading/trailing newlines

#-- Run self-test code
if __name__ == '__main__':
    getdelim = FieldStats(delim, field_funcs={3:avg_lst,4:max_lst})
    print 'Delimited Calculations:'
    results = getdelim.calc()
    print '  Average salary -', results[3]
    print '  Max years worked -', results[4]

    getflat = FieldStats(flat, field_funcs={3:avg_lst,4:max_lst},
                               style=FLATFILE,
                               column_positions=(15,25,35,45,52))
    print 'Flat Calculations:'
    results = getflat.calc()
    print '  Average salary -', results[3]
    print '  Max years worked -', results[4]

The example above includes some efficiency considerations that make it a good model for working with large data sets. In the first place, class FieldStats can (optionally) deal with a file-like object, rather than keeping the whole structured text database in memory. The generator xreadlines.xreadlines() is an extremely fast and efficient file reader, but it requires Python 2.1+?otherwise use FILE.readline() or FILE.readlines() (for either memory or speed efficiency, respectively). Moreover, only the data that is actually of interest is collected into lists, in order to save memory. However, rather than require multiple passes to collect statistics on multiple fields, as many field columns and summary functions as wanted can be used in one pass.

One possible improvement would be to allow multiple summary functions against the same field during a pass. But that is left as an exercise to the reader, if she desires to do it.

2.1.4 Problem: Counting characters, words, lines, and paragraphs

There is a wonderful utility under Unix-like systems called wc. What it does is so basic, and so obvious, that it is hard to imagine working without it. wc simply counts the characters, words, and lines of files (or STDIN). A few command-line options control which results are displayed, but I rarely use them.

In writing this chapter, I found myself on a system without wc, and felt a remedy was in order. The example below is actually an "enhanced" wc since it also counts paragraphs (but it lacks the command-line switches). Unlike the external wc, it is easy to use the technique directly within Python and is available anywhere Python is. The main trick?inasmuch as there is one?is a compact use of the "".join() and "".split() methods (string.join() and string.split() could also be used, for example, to be compatible with Python 1.5.2 or below).

wc.py

# Report the chars, words, lines, paragraphs
# on STDIN or in wildcard filename patterns
import sys, glob
if len(sys.argv) > 1:
    c, w, 1, p = 0, 0, 0, 0
    for pat in sys.argv[1:]:
        for file in glob.glob(pat):
            s = open(file).read()
            wc = len(s), len(s.split()), \
                 len(s.split('\n')), len(s.split('\n\n'))
            print '\t'.join(map(str, wc)),'\t'+file
            c, w, 1, p = c+wc[0], w+wc[1], l+wc[2], p+wc[3]
    wc = (c,w,l,p)
    print '\t'.join(map(str, wc)), '\tTOTAL'
else:
    s = sys.stdin.read()
    wc = len(s), len(s.split()), len(s.split('\n')), \
         len(s.split('\n\n'))
    print '\t'.join(map(str, wc)), '\tSTDIN'

This little functionality could be wrapped up in a function, but it is almost too compact to bother with doing so. Most of the work is in the interaction with the shell environment, with the counting basically taking only two lines.

The solution above is quite likely the "one obvious way to do it," and therefore Pythonic. On the other hand a slightly more adventurous reader might consider this assignment (if only for fun):

>>> wc  = map(len,[s]+map(s.split,(None,'\n','\n\n')))

A real daredevil might be able to reduce the entire program to a single print statement.

2.1.5 Problem: Transmitting binary data as ASCII

Many channels require that the information that travels over them is 7-bit ASCII. Any bytes with a high-order first bit of one will be handled unpredictably when transmitting data over protocols like Simple Mail Transport Protocol (SMTP), Network News Transport Protocol (NNTP), or HTTP (depending on content encoding), or even just when displaying them in many standard tools like editors. In order to encode 8-bit binary data as ASCII, a number of techniques have been invented over time.

An obvious, but obese, encoding technique is to translate each binary byte into its hexadecimal digits. UUencoding is an older standard that developed around the need to transmit binary files over the Usenet and on BBSs. Binhex is a similar technique from the MacOS world. In recent years, base64?which is specified by RFC1521?has edged out the other styles of encoding. All of the techniques are basically 4/3 encodings?that is, four ASCII bytes are used to represent three binary bytes?but they differ somewhat in line ending and header conventions (as well as in the encoding as such). Quoted printable is yet another format, but of variable encoding length. In quoted printable encoding, most plain ASCII bytes are left unchanged, but a few special characters and all high-bit bytes are escaped.

Python provides modules for all the encoding styles mentioned. The high-level wrappers uu, binhex, base64, and quopri all operate on input and output file-like objects, encoding the data therein. They also each have slightly different method names and arguments. binhex, for example, closes its output file after encoding, which makes it unusable in conjunction with a cStringlO file-like object. All of the high-level encoders utilize the services of the low-level C module binascii. binascii, in turn, implements the actual low-level block conversions, but assumes that it will be passed the right size blocks for a given encoding.

The standard library, therefore, does not contain quite the right intermediate-level functionality for when the goal is just encoding the binary data in arbitrary strings. It is easy to wrap that up, though:

encode_binary.py

# Provide encoders for arbitrary binary data
# in Python strings.  Handles block size issues
# transparently, and returns a string.
# Precompression of the input string can reduce
# or eliminate any size penalty for encoding.

import sys
import zlib
import binascii

UU = 45
BASE64 = 57
BINHEX = sys.maxint

def ASCIIencode(s='', type=BASE64, compress=1):
    """ASCII encode a binary string"""
    # First, decide the encoding style
    if type == BASE64:   encode = binascii.b2a_base64
    elif type == UU:     encode = binascii.b2a_uu
    elif type == BINHEX: encode = binascii.b2a_hqx
    else: raise ValueError, "Encoding must be in UU, BASE64, BINHEX"
    # Second, compress the source if specified
    if compress: s = zlib.compress(s)
    # Third, encode the string, block-by-block
    offset = 0
    blocks = []
    while 1:
        blocks.append(encode(s[offset:offset+type]))
        offset += type
        if offset > len(s):
            break
    # Fourth, return the concatenated blocks
    return ''.join(blocks)

def ASCIIdecode(s='', type=BASE64, compress=1):
    """Decode ASCII to a binary string"""
    # First, decide the encoding style
    if type == BASE64:   s = binascii.a2b_base64(s)
    elif type == BINHEX: s = binascii.a2b_hqx(s)
    elif type == UU:
        s = ''.join([binascii.a2b_uu(line) for line in s.split('\n')])
    # Second, decompress the source if specified
    if compress: s = zlib.decompress(s)
    # Third, return the decoded binary string
    return s

# Encode/decode STDIN for self-test
if __name__ == '__main__':
    decode, TYPE = 0, BASE64
    for arg in sys.argv:
        if   arg.lower()=='-d': decode = 1
        elif arg.upper()=='UU': TYPE=UU
        elif arg.upper()=='BINHEX': TYPE=BINHEX
        elif arg.upper()=='BASE64': TYPE=BASE64
    if decode:
        print ASCIIdecode(sys.stdin.read(),type=TYPE)
    else:
        print ASCIIencode(sys.stdin.read(),type=TYPE)

The example above does not attach any headers or delimit the encoded block (by design); for that, a wrapper like uu, mimify, or MimeWriter is a better choice. Or a custom wrapper around encode_binary.py.

2.1.6 Problem: Creating word or letter histograms

A histogram is an analysis of the relative occurrence frequency of each of a number of possible values. In terms of text processing, the occurrences in question are almost always either words or byte values. Creating histograms is quite simple using Python dictionaries, but the technique is not always immediately obvious to people thinking about it. The example below has a good generality, provides several utility functions associated with histograms, and can be used in a command-line operation mode.

histogram.py

# Create occurrence counts of words or characters
# A few utility functions for presenting results
# Avoids requirement of recent Python features

from string import split, maketrans, translate, punctuation, digits
import sys
from types import *
import types

def word_histogram(source):
    """Create histogram of normalized words (no punct or digits)"""
    hist = {}
    trans = maketrans('','')
    if type(source) in (StringType,UnicodeType):  # String-like src
        for word in split(source):
            word = translate(word, trans, punctuation+digits)
            if len(word) > 0:
                hist[word] = hist.get(word,0) + 1
    elif hasattr(source,'read'):                  # File-like src
        try:
            from xreadlines import xreadlines     # Check for module
            for line in xreadlines(source):
                for word in split(line):
                    word = translate(word, trans, punctuation+digits)
                    if len(word) > 0:
                        hist[word] = hist.get(word,0) + 1
        except ImportError:                       # Older Python ver
            line = source.readline()          # Slow but mem-friendly
            while line:
                for word in split(line):
                    word = translate(word, trans, punctuation+digits)
                    if len(word) > 0:
                        hist[word] = hist.get(word,0) + 1
                line = source.readline()
    else:
        raise TypeError, \
              "source must be a string-like or file-like object"
    return hist

def char_histogram(source, sizehint=1024*1024):
    hist = {}
    if type(source) in (StringType,UnicodeType):  # String-like src
        for char in source:
            hist[char] = hist.get(char,0) + 1
    elif hasattr(source,'read'):                  # File-like src
        chunk = source.read(sizehint)
        while chunk:
            for char in chunk:
                hist[char] = hist.get(char,0) + 1
            chunk = source.read(sizehint)
    else:
        raise TypeError, \
              "source must be a string-like or file-like object"
    return hist

def most_common(hist, num=1):
    pairs = []
    for pair in hist.items():
        pairs.append((pair[1],pair[0]))
    pairs.sort()
    pairs.reverse()
    return pairs[:num]

def first_things(hist, num=1):
    pairs = []
    things = hist.keys()
    things.sort()
    for thing in things:
        pairs.append((thing,hist[thing]))
    pairs.sort()
    return pairs[:num]

if __name__ == '__main__':
    if len(sys.argv) > 1:
        hist = word_histogram(open(sys.argv[1]))
    else:
        hist = word_histogram(sys.stdin)

    print "Ten most common words:"
    for pair in most_common(hist, 10):
        print '\t', pair[1], pair[0]

    print "First ten words alphabetically:"
    for pair in first_things(hist, 10):
        print '\t', pair[0], pair[1]

    # a more practical command-line version might use:
    # for pair in most_common(hist,len(hist)):
    #     print pair[1],'\t',pair[0]

Several of the design choices are somewhat arbitrary. Words have all their punctuation stripped to identify "real" words. But on the other hand, words are still case-sensitive, which may not be what is desired. The sorting functions first_things() and most_common() only return an initial sublist. Perhaps it would be better to return the whole list, and let the user slice the result. It is simple to customize around these sorts of issues, though.

2.1.7 Problem: Reading a file backwards by record, line, or paragraph

Reading a file line by line is a common task in Python, or in most any language. Files like server logs, configuration files, structured text databases, and others frequently arrange information into logical records, one per line. Very often, the job of a program is to perform some calculation on each record in turn.

Python provides a number of convenient methods on file-like objects for such line-by-line reading. FILE.readlines() reads a whole file at once and returns a list of lines. The technique is very fast, but requires the whole contents of the file be kept in memory. For very large files, this can be a problem. FILE.readline() is memory-friendly?it just reads a line at a time and can be called repeatedly until the EOF is reached?but it is also much slower. The best solution for recent Python versions is xreadlines.xreadlines() or FILE.xreadlines() in Python 2.1+. These techniques are memory-friendly, while still being fast and presenting a "virtual list" of lines (by way of Python's new generator/iterator interface).

The above techniques work nicely for reading a file in its natural order, but what if you want to start at the end of a file and work backwards from there? This need is frequently encountered when you want to read log files that have records appended over time (and when you want to look at the most recent records first). It comes up in other situations also. There is a very easy technique if memory usage is not an issue:

>>> open('lines','w').write('\n'.join(['n' for n in range(100)]))
>>> fp = open('lines')
>>> lines = fp.readlines()
>>> lines.reverse()
>>> for line in lines [1:5]:
...     # Processing suite here
...     print line,
...
98
97
96
95

For large input files, however, this technique is not feasible. It would be nice to have something analogous to xreadlines here. The example below provides a good starting point (the example works equally well for file-like objects).

read_backwards.py

# Read blocks of a file from end to beginning.
# Blocks may be defined by any delimiter, but the
#  constants LINE and PARA are useful ones.
# Works much like the file object method '.readline()':
#  repeated calls continue to get "next" part, and
#  function returns empty string once BOF is reached.

# Define constants
from os import linesep
LINE = linesep
PARA = linesep*2
READSIZE = 1000

# Global variables
buffer = ''

def read_backwards(fp, mode=LINE, sizehint=READSIZE, _init=[0]):
    """Read blocks of file backwards (return empty string when done)"""
    # Trick of mutable default argument to hold state between calls
    if not _init[0]:
        fp.seek(0,2)
        _init[0] = 1
    # Find a block (using global buffer)
    global buffer
    while 1:
        # first check for block in buffer
        delim = buffer.rfind(mode)
        if delim <> -1:     # block is in buffer, return it
            block = buffer[delim+len(mode):]
            buffer = buffer[:delim]
            return block+mode
        #-- BOF reached, return remainder (or empty string)
        elif fp.tell()==0:
            block = buffer
            buffer = ''
            return block
        else:           # Read some more data into the buffer
            readsize = min(fp.tell(),sizehint)
            fp.seek(-readsize,1)
            buffer = fp.read(readsize) + buffer
            fp.seek(-readsize,1)
#-- Self test of read_backwards()
if __name__ == '__main__':
    # Let's create a test file to read in backwards
    fp = open('lines','wb')
    fp.write(LINE.join(['--- %d ---'%n for n in range(15)]))
    # Now open for reading backwards
    fp = open('lines','rb')
    # Read the blocks in, one per call (block==line by default)
    block = read_backwards(fp)
    while block:
        print block,
        block = read_backwards(fp)

Notice that anything could serve as a block delimiter. The constants provided just happened to work for lines and block paragraphs (and block paragraphs only with the current OS's style of line breaks). But other delimiters could be used. It would not be immediately possible to read backwards word-by-word?a space delimiter would come close, but would not be quite right for other whitespace. However, reading a line (and maybe reversing its words) is generally good enough.

Another enhancement is possible with Python 2.2+. Using the new yield keyword, read_backwards() could be programmed as an iterator rather than as a multi-call function. The performance will not differ significantly, but the function might be expressed more clearly (and a "list-like" interface like FILE.readlines() makes the application's loop simpler).

QUESTIONS

1:	Write a generator-based version of `read_backwards()` that uses the `yield` keyword. Modify the self-test code to utilize the generator instead.
2:	Explore and explain some pitfalls with the use of a mutable default value as a function argument. Explain also how the style allows functions to encapsulate data and contrast with the encapsulation of class instances.