4.3 Parser Libraries for Python

4.3.1 Specialized Parsers in the Standard Library

Python comes standard with a number of modules that perform specialized parsing tasks. A variety of custom formats are in sufficiently widespread use that it is convenient to have standard library support for them. Aside from those listed in this chapter, Chapter 5 discusses the email and xml packages, and the modules mailbox, HTMLParser, and urlparse, each of which performs parsing of sorts. A number of additional modules listed in Chapter 1, which handle and process audio and image formats, in a broad sense could be considered parsing tools. However, these media formats are better considered as byte streams and structures than as token streams of the sort parsers handle (the distinction is fine, though).

The specialized tools discussed under this section are presented only in summary. Consult the Python Library Reference for detailed documentation of their various APIs and features. It is worth knowing what is available, but for space reasons, this book does not document usage specifics of these few modules.

ConfigParser

Parse and modify Windows-style configuration files.

>>> import ConfigParser
>>> config = ConfigParser.ConfigParser()
>>> config.read(['test.ini','nonesuch.ini'])
>>> config.sections()
['userlevel', 'colorscheme']
>>> config.get('userlevel','login')
'2'
>>> config.set('userlevel','login',5)
>>> config.write(sys.stdout)
[userlevel]
login = 5
title = 1

[colorscheme]
background = red
foreground = blue
difflib
.../Tools/scripts/ndiff.py

The module difflib, introduced in Python 2.1, contains a variety of functions and classes to help you determine the difference and similarity of pairs of sequences. The API of difflib is flexible enough to work with sequences of all kinds, but the typical usage is in comparing sequences of lines or sequences of characters.

Word similarity is useful for determining likely misspellings and typos and/or edit changes required between strings. The function difflib.get_close_matches() is a useful way to perform "fuzzy matching" of a string against patterns. The required similarity is configurable.

>>> users = ['j.smith', 't.smith', 'p.smyth', 'a.simpson']
>>> maxhits = 10
>>> login = 'a.smith'
>>> difflib.get_close_matches(login, users, maxhits)
['t.smith', 'j.smith', 'p.smyth']
>>> difflib.get_close_matches(login, users, maxhits, cutoff=.75)
['t.smith', 'j.smith']
>>> difflib.get_close_matches(login, users, maxhits, cutoff=.4)
['t.smith', 'j.smith', 'p.smyth', 'a.simpson']

Line matching is similar to the behavior of the Unix diff (or ndiff) and patch utilities. The latter utility is able to take a source and a difference, and produce the second compared line-list (file). The functions difflib.ndiff() and difflib.restore() implement these capabilities. Much of the time, however, the bundled ndiff.py tool performs the comparisons you are interested in (and the "patches" with an -r# option).

%. ./ndiff.py chap4.txt chap4.txt~ | grep   '^[+-]'
-:  chap4.txt
+:  chap4.txt~
+      against patterns.
-     against patterns.  The required similarity is configurable.
-
-     >>> users = ['j.smith', 't.smith', 'p.smyth', 'a.simpson']
-     >>> maxhits = 10
-     >>> login = 'a.smith'

There are a few more capabilities in the difflib module, and considerable customization is possible.

formatter

Transform an abstract sequence of formatting events into a sequence of callbacks to "writer" objects. Writer objects, in turn, produce concrete outputs based on these callbacks. Several parent formatter and writer classes are contained in the module.

In a way, formatter is an "anti-parser"?that is, while a parser transforms a series of tokens into program events, formatter transforms a series of program events into output tokens.

The purpose of the formatter module is to structure creation of streams such as word processor file formats. The module htmllib utilizes the formatter module. The particular API details provide calls related to features like fonts, margins, and so on.

For highly structured output of prose-oriented documents, the formatter module is useful, albeit requiring learning a fairly complicated API. At the minimal level, you may use the classes included to create simple tools. For example, the following utility is approximately equivalent to lynx -dump:

urldump.py
#!/usr/bin/env python
import sys
from urllib import urlopen
from htmllib import HTMLParser
from formatter import AbstractFormatter, DumbWriter
if len(sys.argv) > 1:
    fpin = urlopen(sys.argv[1])
    parser = HTMLParser(AbstractFormatter(DumbWriter()))
    parser.feed(fpin.read())
    print '-----------------------------------------------'
    print fpin.geturl()
    print fpin.info()
else:
    print "No specified URL"

SEE ALSO: htmllib 285; urllib 388;

htmllib

Parse and process HTML files, using the services of sgmllib. In contrast to the HTMLParser module, htmllib relies on the user constructing a suitable "formatter" object to accept callbacks from HTML events, usually utilizing the formatter module. A formatter, in turn, uses a "writer" (also usually based on the formatter module). In my opinion, there are enough layers of indirection in the htmllib API to make HTMLParser preferable for almost all tasks.

SEE ALSO: HTMLParser 384; formatter 284; sgmllib 285;

multifile

The class multifile.MultiFile allows you to treat a text file composed of multiple delimited parts as if it were several files, each with their own FILE methods: .read(), .readline(), .readlines(), .seek(), and .tell() methods. In iterator fashion, advancing to the next virtual file is performed with the method multifile.MultiFile.next().

SEE ALSO: fileinput 61; mailbox 372; email.Parser 363; string.split() 142; file 15;

parser
symbol
token
tokenize

Interface to Python's internal parser and tokenizer. Although parsing Python source code is arguably a text processing task, the complexities of parsing Python are too specialized for this book.

robotparser

Examine a robots.txt access control file. This file is used by Web servers to indicate the desired behavior of automatic indexers and Web crawlers?all the popular search engines honor these requests.

sgmllib

A partial parser for SGML. Standard Generalized Markup Language (SGML) is an enormously complex document standard; in its full generality, SGML cannot be considered a format, but rather a grammar for describing concrete formats. HTML is one particular SGML dialect, and XML is (almost) a simplified subset of SGML.

Although it might be nice to have a Python library that handled generic SGML, sgmllib is not such a thing. Instead, sgmllib implements just enough SGML parsing to support HTML parsing with htmllib. You might be able to coax parsing an XML library out of sgmllib, with some work, but Python's standard XML tools are far more refined for this purpose.

SEE ALSO: htmllib 285; xml.sax 405;

shlex

A lexical analyzer class for simple Unix shell-like syntaxes. This capability is primarily useful to implement small command language within Python applications.

tabnanny

This module is generally used as a command-line script rather than imported into other applications. The module/script tabnanny checks Python source code files for mixed use of tabs and spaces within the same block. Behind the scenes, the Python source is fully tokenized, but normal usage consists of something like:

% /sw/lib/python2.2/tabnanny.py SCRIPTS/
SCRIPTS/cmdline.py 165 '\treturn 1\r\n'
'SCRIPTS/HTMLParser_stack.py': Token Error: ('EOF in
                                multi-line string', (3, 7))
SCRIPTS/outputters.py 18 '\tself.writer=writer\r\n'
SCRIPTS/txt2bookU.py 148 '\ttry:\n'

The tool is single purpose, but that purpose addresses a common pitfall in Python programming.

SEE ALSO: tokenize 285;

4.3.2 Low-Level State Machine Parsing

mx.TextToolsFast Text Manipulation Tools

Marc-Andre Lemburg's mx.TextTools is a remarkable tool that is a bit difficult to grasp the gestalt of. mx.TextTools can be blazingly fast and extremely powerful. But at the same time, as difficult as it might be to "get" the mindset of mx.TextTools, it is still more difficult to get an application written with it working just right. Once it is working, an application that utilizes mx.TextTools can process a larger class of text structures than can regular expressions, while simultaneously operating much faster. But debugging an mx.TextTools "tag table" can make you wish you were merely debugging a cryptic regular expression!

In recent versions, mx.TextTools has come in a larger package with eGenix.com's several other "mx Extensions for Python." Most of the other subpackages add highly efficient C implementations of datatypes not found in a base Python system.

mx.TextTools stands somewhere between a state machine and a full-fledged parser. In fact, the module SimpleParse, discussed below, is an EBNF parser library that is built on top of mx.TextTools. As a state machine, mx.TextTools feels like a lower-level tool than the statemachine module presented in the prior section. And yet, mx.TextTools is simultaneously very close to a high-level parser. This is how Lemburg characterizes it in the documentation accompanying mx.TextTools:

mxTextTools is an extension package for Python that provides several useful functions and types that implement high-performance text manipulation and searching algorithms in addition to a very flexible and extendable state machine, the Tagging Engine, that allows scanning and processing text based on low-level byte-code "programs" written using Python tuples. It gives you access to the speed of C without the need to do any compile and link steps every time you change the parsing description.

Applications include parsing structured text, finding and extracting text (either exact or using translation tables) and recombining strings to form new text.

The Python standard library has a good set of text processing tools. The basic tools are powerful, flexible, and easy to work with. But Python's basic text processing is not particularly fast. Mind you, for most problems, Python by itself is as fast as you need. But for a certain class of problems, being able to choose mx.TextTools is invaluable.

The unusual structure of mx.TextTools applications warrants some discussion of concrete usage. After a few sample applications are presented, a listing of mx.TextTools constants, commands, modifiers, and functions is given.

BENCHMARKS

A familiar computer-industry paraphrase of Mark Twain (who repeats Benjamin Disraeli) dictates that there are "Lies, Damn Lies, and Benchmarks." I will not argue with that and certainly do not want readers to put too great an import on the timings suggested. Nonetheless, in exploring mx.TextTools, I wanted to get some sense of just how fast it is. So here is a rough idea.

The second example below presents part of a reworked version of the state machine-based Txt2Html application reproduced in Appendix D. The most time-consuming aspect of Txt2Html is the regular expression replacements performed in the function Typography() for smart ASCII inline markup of words and phrases.

In order to get a timeable test case, I concatenated 110 copies of an article I wrote to get a file a bit over 2MB, and about 41k lines and 300k words. My test processes an entire input as one text block, first using an mx.TextTools version of Typography(), then using the re version.

Processing time of the same test file went from about 34 seconds to about 12 seconds on one slowish Linux test machine (running Python 1.5.2). In other words, mx.TextTools gave me about a 3x speedup over what I get with the re module. This speedup is probably typical, but particular applications might gain significantly more or less from use of mx.TextTools. Moreover, 34 seconds is a long time in an interactive application, but is not very long at all for a batch process done once a day, or once a week.

Example: Buyer/Order Report Parsing

Recall (or refer to) the sample report presented in the previous section "An Introduction to State Machines." A report contained a mixture of header material, buyer orders, and comments. The state machine we used looked at each successive line of the file and decided based on context whether the new line indicated a new state should start. It would be possible to write almost the same algorithm utilizing mx.TextTools only to speed up the decisions, but that is not what we will do.

A more representative use of mx.TextTools is to produce a concrete parse tree of the interesting components of the report document. In principle, you should be able to create a "grammar" that describes every valid "buyer report" document, but in practice using a mixed procedural/grammar approach is much easier, and more maintainable?at least for the test report.

An mx.TextTools tag table is a miniature state machine that either matches or fails to match a portion of a string. Matching, in this context, means that a "success" end state is reached, while nonmatching means that a "failure" end state is reached. Falling off the end of the tag table is a success state. Each individual state in a tag table tries to match some smaller construct by reading from the "read-head" and moving the read-head correspondingly. On either success or failure, program flow jumps to an indicated target state (which might be a success or failure state for the tag table as a whole). Of course, the jump target for success is often different from the jump target for failure?but there are only these two possible choices for jump targets, unlike the statemachine module's indefinite number.

Notably, one of the types of states you can include in a tag table is another tag table. That one state can "externally" look like a simple match attempt, but internally it might involve complex subpatterns and machine flow in order to determine if the state is a match or nonmatch. Much as in an EBNF grammar, you can build nested constructs for recognition of complex patterns. States can also have special behavior, such as function callbacks?but in general, an mx.TextTools tag table state is simply a binary match/nonmatch switch.

Let us look at an mx.TextTools parsing application for "buyer reports" and then examine how it works:

buyer_report.py
from mx.TextTools import *

word_set = set(alphanumeric+white+'-')
quant_set = set(number+'kKmM')

item   = ( (None, AllInSet, newline_set, +1),                # 1
           (None, AllInSet, white_set, +1),                  # 2
           ('Prod', AllInSet, a2z_set, Fail),                # 3
           (None, AllInSet, white_set, Fail),                # 4
           ('Quant', AllInSet, quant_set, Fail),             # 5
           (None, WordEnd, '\n', -5) )                       # 6

buyers = ( ('Order', Table,                                  # 1
                  ( (None, WordEnd, '\n>> ', Fail),          # 1.1
                    ('Buyer', AllInSet, word_set, Fail),     # 1.2
                    ('Item', Table, item, MatchOk, +0) ),    # 1.3
                  Fail, +0), )

comments = ( ('Comment', Table,                              # 1
                  ( (None, Word, '\n*', Fail),               # 1.1
                    (None, WordEnd, '*\n', Fail),            # 1.2
                    (None, Skip, -1) ),                      # 1.3
                  +1, +2),
             (None, Skip, +1),                               # 2
             (None, EOF, Here, -2) )                         # 3

def unclaimed_ranges(tagtuple):
    starts = [0] + [tup[2] for tup in tagtuple [1] ]
    stops = [tup[1] for tup in tagtuple[1]] + [tagtuple[2]]
    return zip(starts, stops)

def report2data(s):
    comtuple = tag(s, comments)
    taglist = comtuple[1]
    for beg,end in unclaimed_ranges(comtuple):
        taglist.extend(tag(s, buyers, beg, end)[1])
    taglist.sort(cmp)
    return taglist

if __name__=='__main__':
    import sys, pprint
    pprint.pprint(report2data(sys.stdin.read()))

Several tag tables are defined in buyer_report: item, buyers, and comments. State machines such as those in each tag table are general matching engines that can be used to identify patterns; after working with mx.TextTools for a while, you might accumulate a library of useful tag tables. As mentioned above, states in tag tables can reference other tag tables, either by name or inline. For example, buyers contains an inline tag table, while this inline tag table utilizes the tag table named item.

Let us take a look, step by step, at what the buyers tag table does. In order to do anything, a tag table needs to be passed as an argument to the mx.TextTools.tag() function, along with a string to match against. That is done in the report2data() function in the example. But in general, buyers?or any tag table?contains a list of states, each containing branch offsets. In the example, all such states are numbered in comments. buyers in particular contains just one state, which contains a subtable with three states.

Tag table state in buyers
  1. Try to match the subtable. If the match succeeds, add the name Order to the taglist of matches. If the match fails, do not add anything. If the match succeeds, jump back into the one state (i.e., +0). In effect, buyers loops as long as it succeeds, advancing the read-head on each such match.

Subtable states in buyers
  1. Try to find the end of the "word" \n>> in the string. That is, look for two greater-than symbols at the beginning of a line. If successful, move the read-head just past the point that first matched. If this state match fails, jump to Fail?that is, the (sub)table as a whole fails to match. No jump target is given for a successful match, so the default jump of +1 is taken. Since None is the tag object, do not add anything to the taglist upon a state match.

  2. Try to find some word_set characters. This set of characters is defined in buyer_report; various other sets are defined in mx.TextTools itself. If the match succeeds, add the name Buyer to the taglist of matches. As many contiguous characters in the set as possible are matched. The match is considered a failure if there is not at least one such character. If this state match fails, jump to Fail, as in state (1).

  3. Try to match the item tag table. If the match succeeds, add the name Item to the taglist of matches. What gets added, moreover, includes anything added within the item tag table. If the match fails, jump to MatchOk?that is, the (sub)table as a whole matches. If the match succeeds, jump +0?that is, keep looking for another Item to add to the taglist.

What buyer_report actually does is to first identify any comments, then to scan what is left in between comments for buyer orders. This approach proved easier to understand. Moreover, the design of mx.TextTools allows us to do this with no real inefficiency. Tagging a string does not involve actually pulling out the slices that match patterns, but simply identifying numerically the offset ranges where they occur. This approach is much "cheaper" than performing repeated slices, or otherwise creating new strings.

The following is important to notice: As of version 2.1.0, the documentation of the mx.TextTools.tag() function that accompanies mx.TextTools does not match its behavior! If the optional third and fourth arguments are passed to tag() they must indicate the start and end offsets within a larger string to scan, not the starting offset and length. Hopefully, later versions will fix the discrepancy (either approach would be fine, but could cause some breakage in existing code).

What buyer_report produces is a data structure, not final output. This data structure looks something like:

buyer_report.py data structure
$ python ex_mx.py < recs.tmp
[('Order', 0,  638,
  [('Buyer', 547, 562, None),
   ('Item', 562, 583,
    [('Prod', 566, 573, None), ('Quant', 579, 582, None)]),
   ('Item', 583, 602,
     [('Prod', 585, 593, None), ('Quant', 597, 601, None)]),
   ('Item', 602, 621,
     [('Prod', 604, 611, None), ('Quant', 616, 620, None)]),
   ('Item', 621, 638,
     [('Prod', 623, 632, None), ('Quant', 635, 637, None)])]),
 ('Comment', 638, 763, []),
 ('Order', 763, 805,
  [('Buyer', 768, 776, None),
   ('Item', 776, 792,
    [('Prod', 778, 785, None), ('Quant', 788, 791, None)]),
   ('Item', 792, 805,
    [('Prod', 792, 800, None), ('Quant', 802, 804, None)])]),
 ('Order', 805, 893,
  [('Buyer', 809, 829, None),
   ('Item', 829, 852,
    [('Prod', 833, 840, None), ('Quant', 848, 851, None)]),
   ('Item', 852, 871,
    [('Prod', 855, 863, None), ('Quant', 869, 870, None)]),
   ('Item', 871, 893,
    [('Prod', 874, 879, None), ('Quant', 888, 892, None)])]),
 ('Comment', 893, 952, []),
 ('Comment', 952, 1025, []),
 ('Comment', 1026, 1049, []),
 ('Order', 1049, 1109,
  [('Buyer', 1054, 1069, None),
   ('Item',1069, 1109,
    [('Prod', 1070, 1077, None), ('Quant', 1083, 1086, None)])])]

While this is "just" a new data structure, it is quite easy to deal with compared to raw textual reports. For example, here is a brief function that will create well-formed XML out of any taglist. You could even arrange for it to be valid XML by designing tag tables to match DTDs (see Chapter 5 for details about XML, DTDs, etc.):

def taglist2xml(s, taglist, root):
    print '<%s>' % root
    for tt in taglist:
        if tt[3] :
            taglist2xml(s, tt[3], tt[0])
        else:
            print '<%s>%s</%s>' % (tt[0], s[tt[1]:tt[2]], tt[0])
    print '</%s>' % root
Example: Marking up smart ASCII

The "smart ASCII" format uses email-like conventions to lightly mark features like word emphasis, source code, and URL links. This format?with graphics/latex.gif as an intermediate format?was used to produce the book you hold (which was written using a variety of plaintext editors). By obeying just a few conventions (that are almost the same as you would use on Usenet or in email), a writer can write without much clutter, but still convert to production-ready markup.

The Txt2Html utility uses a block-level state machine, combined with a collection of inline-level regular expressions, to identify and modify markup patterns in smart ASCII texts. Even though Python's regular expression engine is moderately slow, converting a five-page article takes only a couple seconds. In practice, Txt2Html is more than adequate for my own 20 kilobyte documents. However, it is easy to imagine a not-so-different situation where you were converting multimegabyte documents and/or delivering such dynamically converted content on a high-volume Web site. In such a case, Python's string operations, and especially regular expressions, would simply be too slow.

mx.TextTools can do everything regular expressions can, plus some things regular expressions cannot. In particular, a taglist can contain recursive references to matched patterns, which regular expressions cannot. The utility mxTypography.py utilizes several mx.TextTools capabilities the prior example did not use. Rather than create a nested data structure, mxTypography.py utilizes a number of callback functions, each responding to a particular match event. As well, mxTypography.py adds some important debugging techniques. Something similar to these techniques is almost required for tag tables that are likely to be updated over time (or simply to aid the initial development). Overall, this looks like a robust application should.

mx.TextTools version of Typography()
from mx.TextTools import *
import string, sys

#-- List of all words with  markup, head position, loop count
ws, head_pos, loops = [], None, 0

#-- Define "emitter" callbacks for each output format
def emit_misc(tl,txt,l,r,s):
    ws.append(txt[l:r])
def emit_func(tl,txt,l,r,s):
    ws.append('<code>'+txt[l+1:r-1]+'</code>')
def emit_modl(tl,txt,l,r,s):
    ws.append('<em><code>'+txt[l+1:r-1]+'</code></em>')
def emit_emph(tl,txt,l,r,s):
    ws.append('<em>'+txt[l+1:r-1]+'</em>')
def emit_strg(tl,txt,l,r,s):
    ws.append('<strong>'+txt[l+1:r-1]+'</strong>')
def emit_titl(tl,txt,l,r,s):
    ws.append('<cite>'+txt[l+1:r-1]+'</cite>')
def jump_count(tl,txt,l,r,s):
    global head_pos, loops
    loops = loops+1
    if head_pos is None: head_pos = r
    elif head_pos == r:
        raise "InfiniteLoopError", \
              txt[l-20:l]+'{'+txt[l]+'}'+txt[l+1:r+15]
    else: head_pos = r

#-- What can appear inside, and what can be, markups?
punct_set = set("'!@#$%^&*()_-+=|\{}[]:;'<>,.?/"+'"')
markable = alphanumeric+whitespace+"'!@#$%^&()+= |\{}:;<>,.?/"+'"'
markable_func = set(markable+"*-_[]")
markable_modl = set(markable+"*-_'")
markable_emph = set(markable+"*_'[]")
markable_strg = set(markable+"-_'[]")
markable_titl = set(markable+"*-'[]")
markup_set    = set("-*'[]_")

#-- What can precede and follow markup phrases?
darkins = '(/"'
leadins = whitespace+darkins      # might add from "-*'[]_"
darkouts = '/.),:;?!"'
darkout_set = set(darkouts)
leadouts = whitespace+darkouts    # for non-conflicting markup
leadout_set = set(leadouts)

#-- What can appear inside plain words?
word_set = set(alphanumeric+'{}/@#$%^&-_+= |\><'+darkouts)
wordinit_set = set(alphanumeric+"$#+\<.&{"+darkins)

#-- Define the word patterns (global so as to do it only at import)
# Special markup
def markup_struct(lmark, rmark, callback, markables, x_post="-"):
    struct = \
      ( callback, Table+CallTag,
        ( (None, Is, lmark),                 # Starts with left marker
          (None, AllInSet, markables),       # Stuff marked
          (None, Is, rmark),                 # Ends with right marker
          (None, IsInSet, leadout_set,+2,+1),# EITHR: postfix w/ leadout
          (None, Skip, -1,+1, MatchOk),      # ..give back trailng ldout
          (None, IsIn, x_post, MatchFail),   # OR: special case postfix
          (None, Skip, -1,+1, MatchOk)       # ..give back trailing char
        )
      )
    return struct
funcs   = markup_struct("'", "'", emit_func, markable_func)
modules = markup_struct("[", "]", emit_modl, markable_modl)
emphs   = markup_struct("-", "-", emit_emph, markable_emph, x_post="")
strongs = markup_struct("*", "*", emit_strg, markable_strg)
titles  = markup_struct("_", "_", emit_titl, markable_titl)

# All the stuff not specially marked
plain_words = \
 ( ws, Table+AppendMatch,           # AppendMatch only -slightly
   ( (None, IsInSet,                # faster than emit_misc callback
        wordinit_set, MatchFail),   # Must start with word-initial
     (None, Is, "'",+1),            # May have apostrophe next
     (None, AllInSet, word_set,+1), # May have more word-internal
     (None, Is, "'", +2),           # May have trailing apostrophe
     (None, IsIn, "st",+1),         # May have [ts] after apostrophe
     (None, IsInSet,
        darkout_set,+1, MatchOk),   # Postfixed with dark lead-out
     (None, IsInSet,
        whitespace_set, MatchFail), # Give back trailing whitespace
     (None, Skip, -1)
   ) )
# Catch some special cases
bullet_point = \
 ( ws, Table+AppendMatch,
   ( (None, Word+CallTag, "* "),       # Asterisk bullet is a word
   ) )
horiz_rule = \
 ( None, Table,
   ( (None, Word, "-"*50),             # 50 dashes in a row
     (None, AllIn, "-"),               # More dashes
   ) )
into_mark = \
 ( ws, Table+AppendMatch,             # Special case where dark leadin
   ( (None, IsInSet, set(darkins)),   #   is followed by markup char
     (None, IsInSet, markup_set),
     (None, Skip, -1)                 # Give back the markup char
   ) )
stray_punct = \
 ( ws, Table+AppendMatch,              # Pickup any cases where multiple
   ( (None, IsInSet, punct_set),       # punctuation character occur
     (None, AllInSet, punct_set),      # alone (followed by whitespace)
     (None, IsInSet, whitespace_set),
     (None, Skip, -1)                  # Give back the whitespace
   ) )
leadout_eater = (ws, AllInSet+AppendMatch, leadout_set)

#-- Tag all the (possibly marked-up) words
tag_words = \
 ( bullet_point+(+1,),
   horiz_rule + (+1,),
   into_mark  + (+1,),
   stray_punct+ (+1,),
   emphs   + (+1,),
   funcs   + (+1,),
   strongs + (+1,),
   modules + (+1,),
   titles  + (+1,),
   into_mark+(+1,),
   plain_words +(+1,),             # Since file is mstly plain wrds, can
   leadout_eater+(+1,-1),          # shortcut by tight looping (w/ esc)
   (jump_count, Skip+CallTag, 0),  # Check for infinite loop
   (None, EOF, Here, -13)          # Check for EOF
 )
def Typography(txt):
    global ws
    ws = []    # clear the list before we proceed
    tag(txt, tag_words, 0, len(txt), ws)
    return string.join(ws, '')

if __name__ == '__main__':
    print Typography(open(sys.argv[1]).read())

mxTypographify.py reads through a string and determines if the next bit of text matches one of the markup patterns in tag_words. Or rather, it better match some pattern or the application just will not know what action to take for the next bit of text. Whenever a named subtable matches, a callback function is called, which leads to a properly annotated string being appended to the global list ws. In the end, all such appended strings are concatenated.

Several of the patterns given are mostly fallback conditions. For example, the stray_punct tag table detects the condition where the next bit of text is some punctuation symbols standing alone without abutting any words. In most cases, you don't want smart ASCII to contain such a pattern, but mxTypographify has to do something with them if they are encountered.

Making sure that every subsequence is matched by some subtable or another is tricky. Here are a few examples of matches and failures for the stray_punct subtable. Everything that does not match this subtable needs to match some other subtable instead:

-- spam      # matches "--"
& spam       # fails at "AllInSet" since '&' advanced head
#@$ %% spam  # matches "#@$"
**spam       # fails (whitespace isn't encountered before 's')

After each success, the read-head is at the space right before the next word "spam" or "%%". After a failure, the read-head remains where it started out (at the beginning of the line).

Like stray_punct, emphs, funcs, strongs, plain_words, et cetera contain tag tables. Each entry in tag_words has its appropriate callback functions (all "emitters" of various names, because they "emit" the match, along with surrounding markup if needed). Most lines each have a "+1" appended to their tuple; what this does is specify where to jump in case of a match failure. That is, even if these patterns fail to match, we continue on?with the read-head in the same position?to try matching against the other patterns.

After the basic word patterns each attempt a match, we get to the "leadout eater" line. For mxTypography.py, a "leadout" is the opposite of a "leadin." That is, the latter are things that might precede a word pattern, and the former are things that might follow a word pattern. The leadout_set includes whitespace characters, but it also includes things like a comma, period, and question mark, which might end a word. The "leadout eater" uses a callback function, too. As designed, it preserves exactly the whitespace the input has. However, it would be easy to normalize whitespace here by emitting something other than the actual match (e.g., a single space always).

The jump_count is extremely important; we will come back to it momentarily. For now, it is enough to say that we hope the line never does anything.

The EOF line is our flow control, in a way. The call made by this line is to None, which is to say that nothing is actually done with any match. The command EOF is the important thing (Here is just a filler value that occupies the tuple position). It succeeds if the read-head is past the end of the read buffer. On success, the whole tag table tag_words succeeds, and having succeeded, processing stops. EOF failure is more interesting. Assuming we haven't reached the end of our string, we jump -13 states (to bullet_point). From there, the whole process starts over, hopefully with the read-head advanced to the next word. By looping back to the start of the list of tuples, we continue eating successive word patterns until the read buffer is exhausted (calling callbacks along the way).

The tag() call simply launches processing of the tag table we pass to it (against the read buffer contained in txt). In our case, we do not care about the return value of tag() since everything is handled in callbacks. However, in cases where the tag table does not loop itself, the returned tuple can be used to determine if there is reason to call tag() again with a tail of the read buffer.

DEBUGGING A TAG TABLE

Describing it is easy, but I spent a large number of hours finding the exact collection of tag tables that would match every pattern I was interested in without mismatching any pattern as something it wasn't. While smart ASCII markup seems pretty simple, there are actually quite a few complications (e.g., markup characters being used in nonmarkup contexts, or markup characters and other punctuation appearing in various sequences). Any structured document format that is complicated enough to warrant using mx.TextTools instead of string is likely to have similar complications.

Without question, the worst thing that can go wrong in a looping state pattern like the one above is that none of the listed states match from the current read-head position. If that happens, your program winds up in a tight infinite loop (entirely inside the extension module, so you cannot get at it with Python code directly). I wound up forcing a manual kill of the process countless times during my first brush at mx.TextTools development.

Fortunately, there is a solution to the infinite loop problem. This is to use a callback like jump_count.

mxTypography.py infinite loop catcher
def jump_count(taglist,txt,l,r,subtag):
    global head_pos
    if head_pos is None: head_pos = r
    elif head_pos == r:
        raise "InfiniteLoopError", \
              txt[1-20:1]+'{'+txt[1]+'}'+txt[l+1:r+15]
    else: head_pos = r

The basic purpose of jump_count is simple: We want to catch the situation where our tag table has been run through multiple times without matching anything. The simplest way to do this is to check whether the last read-head position is the same as the current. If it is, more loops cannot get anywhere, since we have reached the exact same state twice, and the same thing is fated to happen forever. mxTypography.py simply raises an error to stop the program (and reports a little bit of buffer context to see what is going on).

It is also possible to move the read-head manually and try again from a different starting position. To manipulate the read head in this fashion, you could use the Call command in tag table items. But a better approach is to create a nonlooping tag table that is called repeatedly from a Python loop. This Python loop can look at a returned tuple and use adjusted offsets in the next call if no match occurred. Either way, since much more time is spent in Python this way than with the loop tag table approach, less speed would be gained from mx.TextTools.

Not as bad as an infinite loop, but still undesirable, is having patterns within a tag table match when they are not supposed to or not match when they are suppose to (but something else has to match, or we would have an infinite loop issue). Using callbacks everywhere makes examining this situation much easier. During development, I frequently create temporary changes to my emit_* callbacks to print or log when certain emitters get called. By looking at output from these temporary print statements, most times you can tell where the problem lies.

CONSTANTS

The mx.TextTools module contains constants for a number of frequently used collections of characters. Many of these character classes are the same as ones in the string module. Each of these constants also has a set version predefined; a set is an efficient representation of a character class that may be used in tag tables and other mx.TextTools functions. You may also obtain a character set from a (custom) character class using the mx.TextTools.set() function:

>>> from mx.TextTools import a2z, set
>>> varname_chars = a2z + '_'
>>> varname_set = set(varname_chars)
mx.TextTools.a2z
mx.TextTools.a2z_set

English lowercase letters ("abcdefghijklmnopqrstuvwxyz").

mx.TextTools.A2Z
mx.TextTools.A2Z_set

English uppercase letters ("ABCDEFGHIJKLMNOPQRSTUVWXYZ").

mx.TextTools.umlaute
mx.TextTools.umlaute_set

Extra German lowercase hi-bit characters.

mx.TextTools.Umlaute
mx.TextTools.Umlaute_set

Extra German uppercase hi-bit characters.

mx.TextTools.alpha
mx.TextTools.alpha_set

English letters (A2Z + a2z).

mx.TextTools.german_alpha
mx.TextTools.german_alpha_set

German letters (A2Z + a2z + umlaute + Umlaute).

mx.TextTools.number
mx.TextTools.number_set

The decimal numerals ("0123456789").

mx.TextTools.alphanumeric
mx.TextTools.alphanumeric_set

English numbers and letters (alpha + number).

mx.TextTools.white
mx.TextTools.white_set

Spaces and tabs (" \t\v"). This is more restricted than string.whitespace.

mx.TextTools.newline
mx.TextTools.newline_set

Line break characters for various platforms ("\n\r").

mx.TextTools.formfeed
mx.TextTools.formfeed_set

Formfeed character ("\f").

mx.TextTools.whitespace
mx.TextTools.whitespace_set

Same as string.whitespace (white+newline+formfeed).

mx.TextTools.any
mx.TextTools.any_set

All characters (0x00-0xFF).

SEE ALSO: string.digits 130; string.hexdigits 130; string.octdigits 130; string.lowercase 131; string.uppercase 131; string.letters 131; string.punctuation 131; string.whitespace 131; string.printable 132;

COMMANDS

Programming in mx.TextTools amounts mostly to correctly configuring tag tables. Utilizing a tag table requires just one call to the mx.TextTools.tag(), but inside a tag table is a kind of mini-language?something close to a specialized Assembly language, in many ways.

Each tuple within a tag table contains several elements, of the form:

(tagobj, command[+modifiers], argument
         [,jump_no_match=MatchFail [,jump_match=+l]])

The "tag object" may be None, a callable object, or a string. If tagobj is None, the indicated pattern may match, but nothing is added to a taglist data structure if so, nor is a callback invoked. If a callable object (usually a function) is given, it acts as a callback for a match. If a string is used, it is used to name a part of the taglist data structure returned by a call to mx.TextTools.tag().

A command indicates a type of pattern to match, and a modifier can change the behavior that occurs in case of such a match. Some commands succeed or fail unconditionally, but allow you to specify behaviors to take if they are reached. An argument is required, but the specific values that are allowed and how they are interpreted depends on the command used.

Two jump conditions may optionally be specified. If no values are given, jump_no_match defaults to MatchFail?that is, unless otherwise specified, failing to match a tuple in a tag table causes the tag table as a whole to fail. If a value is given, jump_no_match branches to a tuple the specified number of states forward or backward. For clarity, an explicit leading "+" is used in forward branches. Branches backward will begin with a minus sign. For example:

# Branch forward one state if next character -is not- an X
# ... branch backward three states if it is an X
tupX = (None, Is, 'X', +1, -3)
# assume all the tups are defined somewhere...
tagtable = (tupA, tupB, tupV, tupW, tupX, tupY, tupZ)

If no value is given for jump_match, branching is one state forward in the case of a match.

Version 2.1.0 of mx.TextTools adds named jump targets, which are often easier to read (and maintain) than numeric offsets. An example is given in the mx.TextTools documentation:

tag_table = ('start',
             ('lowercase',AllIn,a2z,+1,'skip'),
             ('upper',AllIn,A2Z,'skip'),
             'skip',
             (None,AllIn,white+newline,+1),
             (None,AllNotIn,alpha+white+newline,+1),
             (None,EOF,Here,'start') )

It is easy to see that if you were to add or remove a tuple, it is less error prone to retain a jump to, for example, skip than to change every necessary +2 to a +3 or the like.

UNCONDITIONAL COMMANDS
mx.TextTools.Fail
mx.TextTools.Jump

Nonmatch at this tuple. Used mostly for documentary purposes in a tag table, usually with the Here or To placeholder. The tag tables below are equivalent:

table1 = ( ('foo', Is, 'X', MatchFail, MatchOk), )
table2 = ( ('foo', Is, 'X', +1, +2),
           ('Not_X', Fail, Here) )

The Fail command may be preferred if several other states branch to the same failure, or if the condition needs to be documented explicitly.

Jump is equivalent to Fail, but it is often better self-documenting to use one rather than the other; for example:

tup1 = (None, Fail, Here, +3)
tup2 = (None, Jump, To, +3)
mx.TextTools.Skip
mx.TextTools.Move

Match at this tuple, and change the read-head position. Skip moves the read-head by a relative amount, Move to an absolute offset (within the slice the tag table is operating on). For example:

# read-head forward 20 chars, jump to next state
tup1 = (None, Skip, 20)
# read-head to position 10, and jump back 4 states
tup2 = (None, Move, 10, 0, -4)

Negative offsets are allowed, as in Python list indexing.

MATCHING PARTICULAR CHARACTERS
mx.TextTools.AllIn
mx.TextTools.AllInSet
mx.TextTools.AllInCharSet

Match all characters up to the first that is not included in argument. AllIn uses a character string while AllInSet uses a set as argument. For version 2.1.0, you may also use AllInCharSet to match CharSet objects. In general, the set or CharSet form will be faster and is preferable. The following are functionally the same:

tup1 = ('xyz', AllIn, 'XYZxyz')
tup2 = ('xyz', AllInSet, set('XYZxyz')
tup3 = ('xyz', AllInSet, CharSet('XYZxyz'))

At least one character must match for the tuple to match.

mx.TextTools.AIINotIn

Match all characters up to the first that is included in argument. As of version 2.1.0, mx.TextTools does not include an AllNotInSet command. However, the following tuples are functionally the same (the second usually faster):

from mx.TextTools import AllNotIn, AllInSet, invset
tup1 = ('xyz', AllNotIn, 'XYZxyz')
tup2 = ('xyz', AllInSet, invset('xyzXYZ'))

At least one character must match for the tuple to match.

mx.TextTools.ls

Match specified character. For example:

tup = ('X', Is, 'X')
mx.TextTools.IsNot

Match any one character except the specified character.

tup = ('X', IsNot, 'X')
mx.TextTools.IsIn
mx.TextToo1s.IsInSet
mx.TextTools.IsInCharSet

Match exactly one character if it is in argument. IsIn uses a character string while IsInSet use a set as argument. For version 2.1.0, you may also use IsInCharSet to match CharSet objects. In general, the set or CharSet form will be faster and is preferable. The following are functionally the same:

tup1 = ('xyz', IsIn, 'XYZxyz')
tup2 = ('xyz', IsInSet, set('XYZxyz')
tup3 = ('xyz', IsInSet, CharSet('XYZxyz')
mx.TextTools.IsNotIn

Match exactly one character if it is not in argument. As of version 2.1.0, mx.TextTools does not include an 'AllNotInSet command. However, the following tuples are functionally the same (the second usually faster):

from mx.TextTools import IsNotIn, IsInSet, invset
tup1 = ('xyz', IsNotIn, 'XYZxyz')
tup2 = ('xyz', IsInSet, invset('xyzXYZ'))
MATCHING SEQUENCES
mx.TextTools.Word

Match a word at the current read-head position. For example:

tup = ('spam', Word, 'spam')
mx.TextTools.WordStart
mx.TextTools.sWordStart
mx.TextTools.WordEnd
mx.TextTools.sWordEnd

Search for a word, and match up to the point of the match. Searches performed in this manner are extremely fast, and this is one of the most powerful elements of tag tables. The commands sWordStart and sWordEnd use "search objects" rather than plaintexts (and are significantly faster).

WordStart and sWordStart leave the read-head immediately prior to the matched word, if a match succeeds. WordEnd and sWordEnd leave the read-head immediately after the matched word. On failure, the read-head is not moved for any of these commands.

>>> from mx.TextTools import *
>>> s = 'spam and eggs taste good'
>>> tab1 = ( ('toeggs', WordStart, 'eggs'), )
>>> tag(s, tab1)
(1, [('toeggs', 0, 9, None)], 9)
>>> s[0:9]
'spam and '
>>> tab2 = ( ('pasteggs', sWordEnd, BMS('eggs')), )
>>> tag(s, tab2)
(1, [('pasteggs', 0, 13, None)], 13)
>>> s[0:13]
'spam and eggs'

SEE ALSO: mx.TextTools.BMS() 307; mx.TextTools.sFindWord 303;

mx.TextTools.sFindWord

Search for a word, and match only that word. Any characters leading up to the match are ignored. This command accepts a search object as an argument. In case of a match, the read-head is positioned immediately after the matched word.

>>> from mx.TextTools import *
>>> s = 'spam and eggs taste good'
>>> tab3 = ( ('justeggs', sFindWord, BMS('eggs')), )
>>> tag(s, tab3)
(1, [('justeggs', 9, 13, None)], 13)
>>> s[9:13]
'eggs'

SEE ALSO: mx.TextTools.sWordEnd 302;

mx.TextTools.EOF

Match if the read-head is past the end of the string slice. Normally used with placeholder argument Here, for example:

tup = (None, EOF, Here)
COMPOUND MATCHES
mx.TextTools.Table
mx.TextTools.SubTable

Match if the table given as argument matches at the current read-head position. The difference between the Table and the SubTable commands is in where matches get inserted. When the Table command is used, any matches in the indicated table are nested in the data structure associated with the tuple. When SubTable is used, matches are written into the current level taglist. For example:

>>> from mx.TextTools import *
>>> from pprint import pprint
>>> caps = ('Caps', AllIn, A2Z)
>>> lower = ('Lower', AllIn, a2z)
>>> words = ( ('Word', Table, (caps, lower)),
...           (None, AllIn, whitespace, MatchFail, -1) )
>>> from pprint import pprint
>>> pprint(tag(s, words))
(0,
 [('Word', 0, 4, [('Caps', 0, 1, None), ('Lower', 1, 4, None)]),
  ('Word', 5, 19, [('Caps', 5, 6, None), ('Lower', 6, 19, None)]),
  ('Word', 20, 29, [('Caps', 20, 24, None), ('Lower', 24, 29, None)]),
  ('Word', 30, 35, [('Caps', 30, 32, None), ('Lower', 32, 35, None)])
 ],
 35)
>>> flatwords = ( (None, SubTable, (caps, lower)),
...               (None, AllIn, whitespace, MatchFail, -1) )
>>> pprint (tag(s, flatwords))
(0,
 [('Caps', 0, 1, None),
  ('Lower', 1, 4, None),
  ('Caps', 5, 6, None),
  ('Lower', 6, 19, None),
  ('Caps', 20, 24, None),
  ('Lower', 24, 29, None),
  ('Caps', 30, 32, None),
  ('Lower', 32, 35, None)],
 35)

For either command, if a match occurs, the read-head is moved to immediately after the match.

The special constant ThisTable can be used instead of a tag table to call the current table recursively.

mx.TextTools.TableInList
mx.TextTools.SubTableInList

Similar to Table and SubTable except that the argument is a tuple of the form (list_of_tables, index). The advantage (and the danger) of this is that a list is mutable and may have tables added after the tuple defined?in particular, the containing tag table may be added to list_of_tables to allow recursion. Note, however, that the special value ThisTable can be used with the Table or SubTable commands and is usually more clear.

SEE ALSO: mx.TextTools.Table 304; mx.TextTools.SubTable 304;

mx.TextTools.Call

Match on any computable basis. Essentially, when the Call command is used, control over parsing/matching is turned over to Python rather than staying in the mx.TextTools engine. The function that is called must accept arguments s, pos, and end?where s is the underlying string, pos is the current read-head position, and end is ending of the slice being processed. The called function must return an integer for the new read-head position; if the return is different from pos, the match is a success.

As an example, suppose you want to match at a certain point only if the next N characters make up a dictionary word. Perhaps an efficient stemmed data structure is used to represent the dictionary word list. You might check dictionary membership with a tuple like:

tup = ('DictWord', Call, inDict)

Since the function inDict is written in Python, it will generally not operate as quickly as does an mx.TextTools pattern tuple.

mx.TextTools.CallArg

Same as Call, except CallArg allows passing additional arguments. For example, suppose the dictionary example given in the discussion of Call also allows you to specify language and maximum word length for a match:

tup = ('DictWord', Call, (inDict,['English',10]))

SEE ALSO: mx.TextTools.Call 305;

MODIFIERS
mx.TextTools.CallTag

Instead of appending (tagobj, l, r, subtags) to the taglist upon a successful match, call the function indicated as the tag object (which must be a function rather than None or a string). The function called must accept the arguments taglist, s, start, end, and subtags?where taglist is the present taglist, s is the underlying string, start and end are the slice indices of the match, and subtags is the nested taglist. The function called may, but need not, append to or modify taglist or subtags as part of its action. For example, a code parsing application might include:

>>> def todo_flag(taglist, s, start, end, subtags):
...     sys.stderr.write("Fix issue at offset %d\n" % start)
...
>>> tup = (todo_flag, Word+CallTag, 'XXX')
>>> tag('XXX more stuff', (tup,))
Fix issue at offset 0
(1, [], 3)
mx.TextTools.AppendMatch

Instead of appending (tagobj,start,end,subtags) to the taglist upon successful matching, append the match found as string. The produced taglist is "flattened" and cannot be used in the same manner as "normal" taglist data structures. The flat data structure is often useful for joining or for list processing styles.

>>> from mx.TextTools import *
>>> words = (('Word', AllIn+AppendMatch, alpha),
...          (None, AllIn, whitespace, MatchFail, -1))
>>> tag('this and that', words)
(0, ['this', 'and', 'that'], 13)
>>> join(tag('this and that', words)[1], '-')
'this-and-that'

SEE ALSO: string.split() 142;

mx.TextTools.AppendToTagobj

Instead of appending (tagobj,start,end,subtags) to the taglist upon successful matching, call the .append() method of the tag object. The tag object must be a list