22.2 The htmllib Module

The htmllib module supplies a class named HTMLParser that subclasses SGMLParser and defines start_tag, do_tag, and end_tag methods for tags defined in HTML 2.0. HTMLParser implements and overrides methods in terms of calls to methods of a formatter object, covered later in this chapter. You can subclass HTMLParser to add or override methods. In addition to the start_tag, do_tag, and end_tag methods, an instance h of HTMLParser supplies the following attributes and methods.



Called for each <a> tag. href, name, and type are the string values of the tag's attributes with the same names. HTMLParser's implementation of anchor_bgn maintains a list of outgoing hyperlinks (i.e., href arguments of method s.anchor_bgn) in an instance attribute named s.anchorlist.


h.anchor_end(  )

Called for each </a> end tag. HTMLParser's implementation of anchor_end emits to the formatter a footnote reference that is an index within s.anchorlist. In other words, by default, HTMLParser asks the formatter to format an <a>/</a> tag pair as the text inside the tag, followed by a footnote reference number that points to the URL in the <a> tag. Of course, it's up to the formatter to deal with this formatting request.


The h.anchor_list attribute contains the list of outgoing hyperlink URLs built by h.anchor_bgn.


The h.formatter attribute is the formatter object f associated with h, which you pass as the only argument when you instantiate HTMLParser(f).



Called for each <img> tag. Each argument is the string value of the tag's attribute of the same name. HTMLParser's implementation of handle_image calls h.handle_data(alt).



The h.nofill attribute is false when the parser is collapsing whitespace, the normal case. It is true when the parser must preserve whitespace, typically within a <pre> tag.


h.save_bgn(  )

Diverts data to an internal buffer instead of passing it to the formatter, until the next call to h.save_end( ). h has only one buffer, so you cannot nest save_bgn calls.


h.save_end(  )

Returns a string with all data in the internal buffer, and directs data back to the formatter from now on. If save_bgn state was not on, raises TypeError.

22.2.1 The formatter Module

The formatter module defines formatter and writer classes. You instantiate a formatter by passing to the class a writer instance, and then you pass the formatter instance to class HTMLParser of module htmllib. You can define your own formatters and writers by subclassing formatter's classes and overriding methods appropriately, but I do not cover this advanced and rarely used possibility in this book. An application with special output requirements would typically define an appropriate writer, subclassing AbstractWriter and overriding all methods, and use class AbstractFormatter without needing to subclass it. Module formatter supplies the following classes.


class AbstractFormatter(writer)

The standard formatter implementation, suitable for most tasks.


class AbstractWriter(  )

A writer implementation that prints each of its method names when called, suitable for debugging purposes only.


class DumbWriter(file=sys.stdout,maxcol=72)

A writer implementation that emits text to file object file, with word wrapping to ensure that no text line is longer than maxcol characters.


class NullFormatter(writer=None)

A formatter implementation whose methods are do-nothing stubs. When writer is None, instantiates NullWriter. Suitable when you subclass HMTLParser to analyze an HTML document but don't want any output to happen.


class NullWriter(  )

A writer implementation whose methods are do-nothing stubs.

22.2.2 The htmlentitydefs Module

The htmlentitydefs module supplies just one attribute, a dictionary named entitydefs that maps each entity defined in HTML 2.0 to the corresponding string in the ISO-8859-1 (also known as Latin-1) encoding. Module htmllib uses module htmlentitydefs internally.

22.2.3 Parsing HTML with htmllib

The following example uses htmllib to perform the same task as in the previous example for sgmllib, fetching a page from the Web with urllib, parsing it, and outputting the hyperlinks:

import htmllib, formatter, urllib, urlparse

p = htmllib.HTMLParser(formatter.NullFormatter(  ))
f = urllib.urlopen('http://www.python.org/index.html')
BUFSIZE = 8192
while True:
    data = f.read(BUFSIZE)
    if not data: break
p.close(  )

seen = {}
for url in p.anchorlist:
    if url in seen: continue
    seen[url] = True
    pieces = urlparse.urlparse(url)
    if pieces[0] == 'http':
        print urlparse.urlunparse(pieces)

The example exploits the anchorlist attribute of class htmllib.HTMLParser, and therefore does not need to perform any subclassing. htmllib.HTMLParser builds the anchorlist attribute as it parses the HTML page, so the code need only loop on the list and work with the list's items, each a relevant URL.

    Part III: Python Library and Extension Modules