The htmllib module
supplies a class named HTMLParser that subclasses
SGMLParser and defines
start_tag,
do_tag, and
end_tag methods for
tags defined in HTML 2.0. HTMLParser implements
and overrides methods in terms of calls to methods of a formatter
object, covered later in this chapter. You can subclass
HTMLParser to add or override methods. In addition
to the start_tag,
do_tag, and
end_tag methods, an
instance h of
HTMLParser supplies the following attributes and
methods.
h.anchor_bgn(href,name,type)
|
|
Called for each
<a> tag. href,
name, and type
are the string values of the tag's attributes with
the same names. HTMLParser's
implementation of anchor_bgn maintains a list of
outgoing hyperlinks (i.e., href arguments
of method s.anchor_bgn)
in an instance attribute named
s.anchorlist.
Called for each </a> end tag.
HTMLParser's implementation of
anchor_end emits to the formatter a footnote
reference that is an index within
s.anchorlist. In other
words, by default, HTMLParser asks the formatter
to format an
<a>/</a> tag pair
as the text inside the tag, followed by a footnote reference number
that points to the URL in the <a> tag. Of
course, it's up to the formatter to deal with this
formatting request.
The
h.anchor_list attribute
contains the list of outgoing hyperlink URLs built by
h.anchor_bgn.
The h.formatter
attribute is the formatter object f
associated with h, which you pass as the
only argument when you instantiate
HTMLParser(f).
h.handle_image(source,alt,ismap='',align='',width='',height='')
|
|
Called for each <img> tag. Each argument is
the string value of the tag's attribute of the same
name. HTMLParser's implementation
of handle_image calls
h.handle_data(alt).
The
h.nofill attribute is
false when the parser is collapsing whitespace, the normal case. It
is true when the parser must preserve whitespace, typically within a
<pre> tag.
Diverts data to an internal buffer instead of passing it to the
formatter, until the next call to
h.save_end( ).
h has only one buffer, so you cannot nest
save_bgn calls.
Returns a string with all data in the internal buffer, and directs
data back to the formatter from now on. If
save_bgn state was not on, raises
TypeError.
22.2.1 The formatter Module
The formatter module
defines formatter and writer classes. You instantiate a formatter by
passing to the class a writer instance, and then you pass the
formatter instance to class HTMLParser of module
htmllib. You can define your own formatters and
writers by subclassing
formatter's classes and
overriding methods appropriately, but I do not cover this advanced
and rarely used possibility in this book. An application with special
output requirements would typically define an appropriate writer,
subclassing AbstractWriter and overriding all
methods, and use class AbstractFormatter without
needing to subclass it. Module formatter supplies
the following classes.
class AbstractFormatter(writer)
|
|
The standard formatter implementation, suitable for most tasks.
A writer implementation that prints each of its method names when
called, suitable for debugging purposes
only.
class DumbWriter(file=sys.stdout,maxcol=72)
|
|
A writer implementation that emits text to file object
file, with word wrapping to ensure that no
text line is longer than maxcol
characters.
class NullFormatter(writer=None)
|
|
A formatter implementation whose methods are do-nothing stubs. When
writer is None,
instantiates NullWriter. Suitable when you
subclass HMTLParser to analyze an HTML document
but don't want any output to happen.
A writer implementation whose methods are do-nothing stubs.
22.2.2 The htmlentitydefs Module
The htmlentitydefs
module supplies just one attribute, a dictionary named
entitydefs that maps each entity defined in HTML
2.0 to the corresponding string in the ISO-8859-1 (also known as
Latin-1) encoding. Module htmllib uses module
htmlentitydefs
internally.
22.2.3 Parsing HTML with htmllib
The
following example uses htmllib to perform the same
task as in the previous example for sgmllib,
fetching a page from the Web with urllib, parsing
it, and outputting the hyperlinks:
import htmllib, formatter, urllib, urlparse
p = htmllib.HTMLParser(formatter.NullFormatter( ))
f = urllib.urlopen('http://www.python.org/index.html')
BUFSIZE = 8192
while True:
data = f.read(BUFSIZE)
if not data: break
p.feed(data)
p.close( )
seen = {}
for url in p.anchorlist:
if url in seen: continue
seen[url] = True
pieces = urlparse.urlparse(url)
if pieces[0] == 'http':
print urlparse.urlunparse(pieces)
The example exploits the anchorlist attribute of
class htmllib.HTMLParser, and therefore does not
need to perform any subclassing.
htmllib.HTMLParser builds the
anchorlist attribute as it parses the HTML page,
so the code need only loop on the list and work with the
list's items, each a relevant URL.