5.2 World Wide Web Applications

5.2.1 Common Gateway Interface

cgi • Support for Common Gateway Interface scripts

The module cgi provides a number of helpful tools for creating CGI scripts. There are two elements to CGI, basically: (1) Reading query values. (2) Writing the results back to the requesting browser. The first of these elements is aided by the cgi module, the second is just a matter of formatting suitable text to return. The cgi module contains one class that is its primary interface; it also contains several utility functions that are not documented here because their use is uncommon (and not hard to replicate and customize for your specific needs). See the Python Library Reference for details on the utility functions.

A CGI PRIMER

A primer on the Common Gateway Interface is in order. A CGI script is just an application?in any programming language?that runs on a Web server. The server software recognizes a request for a CGI application, sets up a suitable environment, then passes control to the CGI application. By default, this is done by spawning a new process space for the CGI application to run in, but technologies like FastCGI and mod-python perform some tricks to avoid extra process creation. These latter techniques speed performance but change little from the point of view of the CGI application creator.

A Python CGI script is called in exactly the same way any other URL is. The only difference between a CGI and a static URL is that the former is marked as executable by the Web server?conventionally, such scripts are confined to a ./cgi-bin/ subdirectory (sometimes another directory name is used); Web servers generally allow you to configure where CGI scripts may live. When a CGI script runs, it is expected to output a Content-Type header to STDOUT, followed by a blank line, then finally some content of the appropriate type?most often an HTML document. That is really all there is to it.

CGI requests may utilize one of two methods: POST or GET. A POST request sends any associated query data to the STDIN of the CGI script (the Web server sets this up for the script). A GET request puts the query in an environment variable called QUERY_STRING. There is not a lot of difference between the two methods, but GET requests encode their query information in a Uniform Resource Identifier (URI) and may therefore be composed without HTML forms and saved/bookmarked. For example, the following is an HTTP GET query to a script example discussed below:

<http://gnosis.cx/cgi-bin/simple.cgi?this=that&spam=eggs+are+good>

You do not actually need the cgi module to create CGI scripts. For example, let us look at the script simple.cgi mentioned above:

simple.cgi

#!/usr/bin/python
import os,sys
print "Content-Type: text/html"
print
print "<html><head><title>Environment test</title></head><body><pre>"
for k,v in os.environ.items():
    print k, "::",
    if len(v)<=40: print v
    else:          print v[:37]+"..."
print "&lt;STDIN&gt; ::", sys.stdin.read()
print "</pre></body></html>"

I happen to have composed the above sample query by hand, but you will often call a CGI script from another Web page. Here is one that does so:

http://gnosis.cx/simpleform.html

<html><head><title>Test simple.cgi</title></head><body>
<form action="cgi-bin/simple.cgi" method="GET" name="form">
<input type="hidden" name="this" value="that">
<input type="text" value="" name="spam" size="55" maxlength="256">
<input type="submit" value="GET">
</form>
<form action="cgi-bin/simple.cgi" method="POST" name="form">
<input type="hidden" name="this" value="that">
<input type="text" value="" name="spam" size="55" maxlength="256">
<input type="submit" value="POST">
</form>
</body></html>

It turns out that the script simple.cgi is moderately useful; it tells the requester exactly what it has to work with. For example, the query above (which could be generated exactly by the GET form on simpleform.html) returns a Web page that looks like the one below (edited):

DOCUMENT_ROOT :: /www/gnosis
HTTP_ACCEPT_ENCODING :: gzip, deflate, compress;q=0.9
CONTENT_TYPE :: application/x-www-form-urlencoded
SERVER_PORT :: 80
REMOTE_ADDR :: 151.203.xxx.xxx
SERVER_NAME :: www.gnosis.cx
HTTP_USER_AGENT :: Mozilla/5.0 (Macintosh; U; PPC Mac OS...
REQUEST_URI :: /cgi-bin/simple.cgi?this=that&spam=eg...
QUERY_STRING :: this=that&spam=eggs+are+good
SERVER_PROTOCOL :: HTTP/1.1
HTTP_HOST :: gnosis.cx
REQUEST_METHOD :: GET
SCRIPT_NAME :: /cgi-bin/simple.cgi
SCRIPT_FILENAME :: /www/gnosis/cgi-bin/simple.cgi
HTTP_REFERER :: http://gnosis.cx/simpleform.html
<STDIN> ::

A few environment variables have been omitted, and those available will differ between Web servers and setups. The most important variable is QUERY_STRING; you may perhaps want to make other decisions based on the requesting REMOTE_ADDR, HTTP_USER_AGENT, or HTTP_REFERER (yes, the variable name is spelled wrong). Notice that STDIN is empty in this case. However, using the POST form on the sample Web page will give a slightly different response (trimmed):

CONTENT_LENGTH :: 28
REQUEST_URI :: /cgi-bin/simple.cgi
QUERY_STRING ::
REQUEST_METHOD :: POST
<STDIN> :: this=that&spam=eggs+are+good

The CONTENT_LENGTH environment variable is new, QUERY_STRING has become empty, and STDIN contains the query. The rest of the omitted variables are the same.

A CGI script need not utilize any query data and need not return an HTML page. For example, on some of my Web pages, I utilize a "Web bug"?a 1x1 transparent gif file that reports back who "looks" at it. Web bugs have a less-honorable use by spammers who send HTML mail and want to verify receipt covertly; but in my case, I only want to check some additional information about visitors to a few of my own Web pages. A Web page might contain, at bottom:

<img src="http://gnosis.cx/cgi-bin/visitor.cgi">

The script itself is:

visitor.cgi

#!/usr/bin/python
import os
from sys import stdout
addr = os.environ.get("REMOTE_ADDR","Unknown IP Address")
agent = os.environ.get("HTTP_USER_AGENT","No Known Browser")
fp = open('visitor.log','a')
fp.write('%s\t%s\n' % (addr, agent))
fp.close()
stdout.write("Content-type: image/gif\n\n")
stdout.write('GIF89a\001\000\001\000\370\000\000\000\000\000')
stdout.write('\000\000\000!\371\004\001\000\000\000\000,\000')
stdout.write('\000\000\000\001\000\001\000\000\002\002D\001\000;')

CLASSES

The point where the cgi module becomes useful is in automating form processing. The class cgi.FieldStorage will determine the details of whether a POST or GET request was made, and decode the urlencoded query into a dictionary-like object. You could perform these checks manually, but cgi makes it much easier to do.

cgi.FieldStorage([fp=sys.stdin [,headers [,ob [,environ=os.environ [,keep_blank_values=0 [,strict_parsing=0]]]]]])

Construct a mapping object containing query information. You will almost always use the default arguments and construct a standard instance. A cgi.FieldStorage object allows you to use name indexing and also supports several custom methods. On initialization, the object will determine all relevant details of the current CGI invocation.

import cgi
query = cgi.FieldStorage()
eggs = query.getvalue('eggs','default_eggs')
numfields = len(query)
if query.has_key('spam'):
    spam = query['spam']
[...]

When you retrieve a cgi.FieldStorage value by named indexing, what you get is not a string, but either an instance of cgi.FieldStorage objects (or maybe cgi.MiniFieldStorage) or a list of such objects. The string query is in their .value attribute. Since HTML forms may contain multiple fields with the same name, multiple values might exist for a key?a list of such values is returned. The safe way to read the actual strings in queries is to check whether a list is returned:

if type(eggs) is type([]): # several eggs
    for egg in eggs:
        print "<dt>Egg</dt>\n<dd>", egg.value, "</dd>"
else:
    print "<dt>Eggs</dt>\n<dd>", eggs.value, "</dd>"

For special circumstances you might wish to change the initialization of the instance by specifying an optional (named) argument. The argument fp specifies the input stream to read for POST requests. The argument headers contains a dictionary mapping HTTP headers to values?usually consisting of {"Content-Type":...}; the type is determined from the environment if no argument is given. The argument environ specified where the environment mapping is found. If you specify a true value for keep_blank_values, a key will be included for a blank HTML form field?mapping to an empty string. If string_parsing is specified, a ValueError will be raised if there are any flaws in the query string.

METHODS

The methods .keys(), .values(), and .has_key() work as with a standard dictionary object. The method .items(), however, is not supported.

cgi.FieldStorage.getfirst(key [,default=None])

Python 2.2+ has this method to return exactly one string corresponding to the key key. You cannot rely on which such string value will be returned if multiple submitting HTML form fields have the same name?but you are assured of this method returning a string, not a list.

cgi.FieldStorage.getlist(key [,default=None])

Python 2.2+ has this method to return a list of strings whether there are one or several matches on the key key. This allows you to loop over returned values without worrying about whether they are a list or a single string.

>>> spam = form.getlist('spam')
>>> for s in spam:
...     print s

cgi.FieldStorage.getvalue(key [,default=None])

Return a string or list of strings that are the value(s) corresponding to the key key. If the argument default is specified, return the specified value in case of key miss. In contrast to indexing by name, this method retrieves actual strings rather than storage objects with a .value attribute.

>>> import sys, cgi, os
>>> from cStringIO import StringIO
>>> sys.stdin = StringIO("this=that&this=other&spam=good+eggs")
>>> os.environ['REQUEST_METHOD'] = 'POST'
>>> form = cgi.FieldStorage()
>>> form.getvalue('this')
['that', 'other']
>>> form['this']
[MiniFieldStorage('this','that'),MiniFieldStorage('this','other')]

ATTRIBUTES

cgi.FieldStorage.file

If the object handled is an uploaded file, this attribute gives the file handle for the file. While you can read the entire file contents as a string from the cgi.FieldStorage.value attribute, you may want to read it line-by-line instead. To do this, use the .readline() or .readlines() method of the file object.

cgi.FieldStorage.filename

If the object handled is an uploaded file, this attribute contains the name of the file. An HTML form to upload a file looks something like:

<form action="upload.cgi" method="POST"
      enctype="multipart/form-data">
  Name: <input name="" type="file" size="50">
  <input type="submit" value="Upload">
</form>

Web browsers typically provide a point-and-click method to fill in a file-upload form.

cgi.FieldStorage.list

This attribute contains the list of mapping object within a cgi.FieldStorage object. Typically, each object in the list is itself a cgi.MiniStorage object instead (but this can be complicated if you upload files that themselves contain multiple parts).

>>> form.list
[MiniFieldStorage('this', 'that'),
MiniFieldStorage('this', 'other'),
MiniFieldStorage('spam', 'good eggs')]

SEE ALSO: cgi.FieldStorage.getvalue() 380;

cgi.FieldStorage.value
cgi.MiniFieldStorage.value

The string value of a storage object.

SEE ALSO: urllib 388; cgitb 382; dict 24;

cgitb • Traceback manager for CGI scripts

Python 2.2 added a useful little module for debugging CGI applications. You can download it for earlier Python versions from <http://lfw.org/python/cgitb.py>. A basic difficulty with developing CGI scripts is that their normal output is sent to STDOUT, which is caught by the underlying Web server and forwarded to an invoking Web browser. However, when a traceback occurs due to a script error, that output is sent to STDERR (which is hard to get at in a CGI context). A more useful action is either to log errors to server storage or display them in the client browser.

Using the cgitb module to examine CGI script errors is almost embarrassingly simple. At the top of your CGI script, simply include the lines:

Traceback enabled CGI script

import cgitb
cgitb.enable()

If any exceptions are raised, a pretty-formatted report is produced (and possibly logged to a name starting with @).

METHODS

cgitb.enable([display=1 [,logdir=None [context=5]]])

Turn on traceback reporting. The argument display controls whether an error report is sent to the browser?you might not want this to happen in a production environment, since users will have little idea what to make of such a report (and there may be security issues in letting them see it). If logdir is specified, tracebacks are logged into files in that directory. The argument context indicates how many lines of code are displayed surrounding the point where an error occurred.

For earlier versions of Python, you will have to do your own error catching. A simple approach is:

Debugging CGI script in Python

import sys
sys.stderr = sys.stdout
def main():
    import cgi
    # ...do the actual work of the CGI...
    # perhaps ending with:
    print template % script_dictionary
print "Content-type: text/html\n\n"
main()

This approach is not bad for quick debugging; errors go back to the browser. Unfortunately, though, the traceback (if one occurs) gets displayed as HTML, which means that you need to go to "View Source" in a browser to see the original line breaks in the traceback. With a few more lines, we can add a little extra sophistication.

Debugging/logging CGI script in Python

import sys, traceback
print "Content-type: text/html\n\n"
try:               # use explicit exception handling
    import my_cgi  # main CGI functionality in 'my_cgi.py'
    my_cgi.main()
except:
    import time
    errtime = '--- '+ time.ctime(time.time()) +' ---\n'
    errlog = open('cgi_errlog', 'a')
    errlog.write(errtime)
    traceback.print_exc(None, errlog)
    print "<html>\n<head>"
    print "<title>CGI Error Encountered!</title>\n</head>"
    print "<body><p>A problem was encountered running MyCGI</p>"
    print "<p>Please check the server error log for details</p>"
    print "</body></html>"

The second approach is quite generic as a wrapper for any real CGI functionality we might write. Just import a different CGI module as needed, and maybe make the error messages more detailed or friendlier.

5.2.2 Parsing, Creating, and Manipulating HTML Documents

htmlentitydefs • HTML character entity references

The module htmlentitydefs provides a mapping between ISO-8859-1 characters and the symbolic names of corresponding HTML 2.0 entity references. Not all HTML named entities have equivalents in the ISO-8859-1 character set; in such cases, names are mapped the HTML numeric references instead.

ATTRIBUTES

htmlentitydefs.entitydefs

A dictionary mapping symbolic names to character entities.

>>> import htmlentitydefs
>>> htmlentitydefs.entitydefs['omega']
'&#969;'
>>> htmlentitydefs.entitydefs['uuml']
'\xfc'

For some purposes, you might want a reverse dictionary to find the HTML entities for ISO-8859-1 characters.

>>> from htmlentitydefs import entitydefs
>>> iso8859_1 = dict([(v,k) for k,v in entitydefs.items()])
>>> iso8859_1['\xfc']
'uuml'

HTMLParser • Simple HTML and XHTML parser

The module HTMLParser is an event-based framework for processing HTML files. In contrast to htmllib, which is based on sgmllib, HTMLParser simply uses some regular expressions to identify the parts of an HTML document?starttag, text, endtag, comment, and so on. The different internal implementation, however, makes little difference to users of the modules.

I find the module HTMLParser much more straightforward to use than htmllib, and therefore HTMLParser is documented in detail in this book, while htmllib is not. While htmllib more or less requires the use of the ancillary module formatter to operate, there is no extra difficultly in letting HTMLParser make calls to a formatter object. You might want to do this, for example, if you have an existing formatter/writer for a complex document format.

Both HTMLParser and htmllib provide an interface that is very similar to that of SAX or expat XML parsers. That is, a document?HTML or XML?is processed purely as a sequence of events, with no data structure created to represent the document as a whole. For XML documents, another processing API is the Document Object Model (DOM), which treats the document as an in-memory hierarchical data structure.

In principle, you could use xml.sax or xml.dom to process HTML documents that conformed with XHTML?that is, tightened up HTML that is actually an XML application The problem is that very little existing HTML is XHTML compliant. A syntactic issue is that HTML does not require closing tags in many cases, where XML/XHTML requires every tag to be closed. But implicit closing tags can be inferred from subsequent opening tags (e.g., with certain names). A popular tool like tidy does an excellent job of cleaning up HTML in this way. The more significant problem is semantic. A whole lot of actually existing HTML is quite lax about tag matching?Web browsers that successfully display the majority of Web pages are quite complex software projects.

For example, a snippet like that below is quite likely to occur in HTML you come across:

<p>The <a href="http://ietf.org">IETF admonishes:
   <i>Be lenient in what you <b>accept</i></a>.</b>

If you know even a little HTML, you know that the author of this snippet presumably wanted the whole quote in italics, the word accept in bold. But converting the snippet into a data structure such as a DOM object is difficult to generalize. Fortunately, HTMLParser is fairly lenient about what it will process; however, for sufficiently badly formed input (or any other problem), the module will raise the exception HTMLParser.HTMLParseError.

CLASSES

HTMLParser.HTMLParser()

The HTMLParser module contains the single class HTMLParser.HTMLParser. The class itself is fairly useful, since it does not actually do anything when it encounters any event. Utilizing HTMLParser.HTMLParser() is a matter of subclassing it and providing methods to handle the events you are interested in.

If it is important to keep track of the structural position of the current event within the document, you will need to maintain a data structure with this information. If you are certain that the document you are processing is well-formed XHTML, a stack suffices. For example:

HTMLParser_stack.py

#!/usr/bin/env python
import HTMLParser
html = """<html><head><title>Advice</title></head><body>
<p>The <a href="http://ietf.org">IETF admonishes:
   <i>Be strict in what you <b>send</b>.</i></a></p>
</body></html>
"""
tagstack = []
class ShowStructure(HTMLParser.HTMLParser):
    def handle_starttag(self, tag, attrs): tagstack.append(tag)
    def handle_endtag(self, tag): tagstack.pop()
    def handle_data(self, data):
        if data.strip():
            for tag in tagstack: sys.stdout.write('/'+tag)
            sys.stdout.write(' >> %s\n' % data[:40].strip())
ShowStructure().feed(html)

Running this optimistic parser produces:

% ./HTMLParser_stack.py
/html/head/title >> Advice
/html/body/p >> The
/html/body/p/a >> IETF admonishes:
/html/body/p/a/i >> Be strict in what you
/html/body/p/a/i/b >> send
/html/body/p/a/i >> .

You could, of course, use this context information however you wished when processing a particular bit of content (or when you process the tags themselves).

A more pessimistic approach is to maintain a "fuzzy" tagstack. We can define a new object that will remove the most recent starttag corresponding to an endtag and will also prevent <p> and <blockquote> tags from nesting if no corresponding endtag is found. You could do more along this line for a production application, but a class like TagStack makes a good start:

class TagStack:
    def __init__(self, lst=[]): self.lst  = lst
    def __getitem__(self, pos): return self.lst[pos]
    def append(self, tag):
        # Remove every paragraph-level tag if this is one
        if tag.lower() in ('p','blockquote'):
            self.lst = [t for t in self.lst
                          if t not in ('p','blockquote')]
        self.lst.append(tag)
    def pop(self, tag):
        # "Pop" by tag from nearest pos, not only last item
        self.lst.reverse()
        try:
            pos = self.lst.index(tag)
        except ValueError:
            raise HTMLParser.HTMLParseError, "Tag not on stack"
        del self.lst[pos]
        self.lst.reverse()
tagstack = TagStack()

This more lenient stack structure suffices to parse badly formatted HTML like the example given in the module discussion.

METHODS AND ATTRIBUTES

HTMLParser.HTMLParser.close()

Close all buffered data, and treat any current data as if an EOF was encountered.

HTMLParser.HTMLParser.feed(data)

Send some additional HTML data to the parser instance from the string in the argument data. You may feed the instance with whatever size chunks of data you wish, and each will be processed, maintaining the previous state.

HTMLParser.HTMLParser.getpos()

Return the current line number and offset. Generally called within a .handle_*() method to report or analyze the state of the processing of the HTML text.

HTMLParser.HTMLParser.handle_charref(name)

Method called when a character reference is encountered, such as ϋ. Character references may be interspersed with element text, much as with entity references. You can construct a Unicode character from a character reference, and you may want to pass the Unicode (or raw character reference) to HTMLParser.HTMLParser.handle_data().

class CharacterData(HTMLParser.HTMLParser):
    def handle_charref(self, name):
        import unicodedata
        char = unicodedata.name(unichr(int(name)))
        self.handle_data(char)
    [...other methods...]

HTMLParser.HTMLParser.handle_comment(data)

Method called when a comment is encountered. HTML comments begin with . The argument data contains the contents of the comment.

HTMLParser.HTMLParser.handle_data(data)

Method called when content data is encountered. All the text between tags is contained in the argument data, but if character or entity references are interspersed with text, the respective handler methods will be called in an interspersed fashion.

HTMLParser.HTMLParser.handle_decl(data)

Method called when a declaration is encountered. HTML declarations with <! and end with >. The argument data contains the contents of the comment. Syntactically, comments look like a type of declaration, but are handled by the HTMLParser.HTMLParser.handle_comment() method.

HTMLParser.HTMLParser.handle_endtag(tag)

Method called when an endtag is encountered. The argument tag contains the tag name (without brackets).

HTMLParser.HTMLParser.handle_entityref(name)

Method called when an entity reference is encountered, such as &. When entity references occur in the middle of an element text, calls to this method are interspersed with calls to HTMLParser.HTMLParser.handle_data(). In many cases, you will want to call the latter method with decoded entities; for example:

class EntityData(HTMLParser.HTMLParser):
    def handle_entityref(self, name):
        import htmlentitydefs
        self.handle_data(htmlentitydefs.entitydefs[name])
    [...other methods...]

HTMLParser.HTMLParser.handle_pi(data)

Method called when a processing instruction (PI) is encountered. PIs begin with <? and end with ?>. They are less common in HTML than in XML, but are allowed. The argument data contains the contents of the PI.

HTMLParser.HTMLParser.handle_startendtag(tag, attrs)

Method called when an XHTML-style empty tag is encountered, such as:

<img src="foo.png" alt="foo"/>

The arguments tag and attrs are identical to those passed to HTMLParser.HTMLParser.handle_starttag().

HTMLParser.HTMLParser.handle_starttag(tag, attrs)

Method called when a starttag is encountered. The argument tag contains the tag name (without brackets), and the argument attrs contains the tag attributes as a list of pairs, such as [(''href'',"http://ietf.org)].

HTMLParser.HTMLParser.lasttag

The last tag?start or end?that was encountered. Generally maintaining some sort of stack structure like those discussed is more useful. But this attribute is available automatically. You should treat it as read-only.

HTMLParser.HTMLParser.reset()

Restore the instance to its initial state, lose any unprocessed data (for example, content within unclosed tags).

5.2.3 Accessing Internet Resources

urllib • Open an arbitrary URL

The module urllib provides convenient, high-level access to resources on the Internet. While urllib lets you connect to a variety of protocols, to manage low-level details of connections?especially issues of complex authentication?you should use the module urllib2 instead. However, urllib does provide hooks for HTTP basic authentication.

The interface to urllib objects is file-like. You can substitute an object representing a URL connection for almost any function or class that expects to work with a read-only file. All of the World Wide Web, File Transfer Protocol (FTP) directories, and gopherspace can be treated, almost transparently, as if it were part of your local filesystem.

Although the module provides two classes that can be utilized or subclassed for more fine-tuned control, generally in practice the function urllib.urlopen() is the only interface you need to the urllib module.

FUNCTIONS

urllib.urlopen(url [,data])

Return a file-like object that connects to the Uniform Resource Locator (URL) resource named in url. This resource may be an HTTP, FTP, Gopher, or local file. The optional argument data can be specified to make a POST request to an HTTP URL. This data is a urlencoded string, which may be created by the urllib.urlencode() method. If no postdata is specified with an HTTP URL, the GET method is used.

Depending on the type of resource specified, a slightly different class is used to construct the instance, but each provides the methods: .read(), .readline(), .readlines(), .fileno(), .close(), .info(), and .geturl() (but not .xreadlines(), .seek(), or .tell()).

Most of the provided methods are shared by file objects, and each provides the same interface?arguments and return values?as actual file objects. The method .geturl() simply contains the URL that the object connects to, usually the same string as the url argument.

The method .info() returns mimetools.Message object. While the mimetools module is not documented in detail in this book, this object is generally similar to an email.Message.Message object?specifically, it responds to both the built-in str() function and dictionary-like indexing:

>>> u = urllib.urlopen('urlopen.py')
>>> print 'u.info() '
<mimetools.Message instance at 0x62f800>
>>> print u.info()
Content-Type: text/x-python
Content-Length: 577
Last-modified: Fri, 10 Aug 2001 06:03:04 GMT

>>> u.info().keys()
['last-modified', 'content-length', 'content-type']
>>> u. info() ['content-type']
'text/x-python'

SEE ALSO: urllib.urlretrieve() 390; urllib.urlencode() 390;

urllib.urlretrieve(url [,fname [,reporthook [,data]]])

Save the resources named in the argument url to a local file. If the optional argument fname is specified, that filename will be used; otherwise, a unique temporary filename is generated. The optional argument data may contain a urlencoded string to pass to an HTTP POST request, as with urllib.urlopen().

The optional argument reporthook may be used to specify a callback function, typically to implement a progress meter for downloads. The function reporthook() will be called repeatedly with the arguments bl_transferred, bl_size, and file_size. Even remote files smaller than the block size will typically call reporthook() a few times, but for larger files, file_size will approximately equal bl_transferred*bl_size.

The return value of urllib.urlretrieve() is a pair (fname, info). The returned fname is the name of the created file?the same as the fname argument if it was specified. The info return value is a mimetools.Message object, like that returned by the .info() method of a urllib.urlopen object.

SEE ALSO: urllib.urlopen() 389; urllib.urlencode() 390;

urllib.quote(s [,safe="/"])

Return a string with special characters escaped. Exclude any characters in the string safe for being quoted.

>>> urllib.quote('/^~username/special&odd!')
'/%7Eusername/special%26odd%21'

urllib.quote_plus(s [,safe="/"])

Same as urllib.quote(), but encode spaces as + also.

urllib.unquote(s)

Return an unquoted string. Inverse operation of urllib.quote().

urllib.unquote_plus(s)

Return an unquoted string. Inverse operation of urllib.quote_plus().

urllib.urlencode(query)

Return a urlencoded query for an HTTP POST or GET request. The argument query may be either a dictionary-like object or a sequence of pairs. If pairs are used, their order is preserved in the generated query.

>>> query = urllib.urlencode([('hl','en'),
...                           ('q','Text Processing in Python')])
>>> print query
hl=en&q=Text+Processing+in+Python
>>> u = urllib.urlopen('http://google.com/search?'+query)

Notice, however, that at least as of the moment of this writing, Google will refuse to return results on this request because a Python shell is not a recognized browser (Google provides a SOAP interface that is more lenient, however). You could, but should not, create a custom urllib class that spoofed an accepted browser.

CLASSES

You can change the behavior of the basic urllib.urlopen() and urllib.urlretrieve() functions by substituting your own class into the module namespace. Generally this is the best way to use urllib classes:

import urllib
class MyOpener(urllib.FancyURLopener):
    pass
urllib._urlopener = MyOpener()
u = urllib.urlopen("http://some.url") # uses custom class

urllib.URLopener([proxies [,**x509]])

Base class for reading URLs. Generally you should subclass from the class urllib.FancyURLopener unless you need to implement a nonstandard protocol from scratch.

The argument proxies may be specified with a mapping if you need to connect to resources through a proxy. The keyword arguments may be used to configure HTTPS authentication; specifically, you should give named arguments key_file and cert_file in this case.

import urllib
proxies = {'http':'http://192.168.1.1','ftp':'ftp://192.168.256.1'}
urllib._urlopener = urllib.URLopener(proxies, key_file='mykey',
                                     cert_file='mycert')

urllib.FancyURLopener([proxies [,**x509]])

The optional initialization arguments are the same as for urllib.URLopener, unless you subclass further to use other arguments. This class knows how to handle 301 and 302 HTTP redirect codes, as well as 401 authentication requests. The class urllib.FancyURLopener is the one actually used by the urllib module, but you may subclass it to add custom capabilities.

METHODS AND ATTRIBUTES

urllib.URLFancyopener.get_user_passwd(host, realm)

Return the pair (user,passwd) to use for authentication. The default implementation calls the method .prompt_user_passwd() in turn. In a subclass you might want to either provide a GUI login interface or obtain authentication information from some other source, such as a database.

urllib.URLopener.open(url [,data])
urllib.URLFancyopener.open(url [,data])

Open the URL url, optionally using HTTP POST query data.

urllib.URLopener.open_unknown (url [,data])
urllib.URLFancyopener.open_unknown (url [,data])

If the scheme is not recognized, the .open() method passes the request to this method. You can implement error reporting or fallback behavior here.

urllib.URLFancyopener.prompt_user_passwd(host, realm)

Prompt for the authentication pair (user,passwd) at the terminal. You may override this to prompt within a GUI. If the authentication is not obtained interactively, but by other means, directly overriding .get_user_passwd() is more logical.

urllib.URLopener.retrieve(url [,fname [,reporthook [,data]]])
urllib.URLFancyopener.retrieve(url [,fname [,reporthook [,data]]])

Copies the URL url to the local file named fname. Callback to the progress function reporthook if specified. Use the optional HTTP POST query data in data.

urllib.URLopener.version
urllib.URFancyLopener.version

The User Agent string reported to a server is contained in this attribute. By default it is urllib/###, where the urllib version number is used rather than ###.

urlparse • Parse Uniform Resource Locators

The module urlparse support just one fairly simple task, but one that is just complicated enough for quick implementations to get wrong. URLs describe a number of aspects of resources on the Internet: access protocol, network location, path, parameters, query, and fragment. Using urlparse, you can break out and combine these components to manipulate or generate URLs. The format of URLs is based on RFC-1738, RFC-1808, and RFC-2396.

Notice that the urlparse module does not parse the components of the network location, but merely returns them as a field. For example, the URL ftp://guest:gnosis@192.168.1.102:21//tmp/MAIL.MSG is a valid identifier on my local network (at least at the moment this is written). Tools like Mozilla and wget are happy to retrieve this file. Parsing this fairly complicated URL with urlparse gives us:

>>> import urlparse
>>> url = 'ftp://guest:gnosis@192.168.1.102:21//tmp/MAIL.MSG'
>>> urlparse.urlparse(url)
('ftp', 'guest:gnosis@192.168.1.102:21', '//tmp/MAIL.MSG',
'', '', '',)

While this information is not incorrect, this network location itself contains multiple fields; all but the host are optional. The actual structure of a network location, using square bracket nesting to indicate optional components, is:

[user[:password]@]host[:port]

The following mini-module will let you further parse these fields:

location_parse.py

#!/usr/bin/env python
def location_parse(netloc):
    "Return tuple (user, passwd, host, port) for netloc"
    if '@' not in netloc:
        netloc = ':@' + netloc
    login, net = netloc.split('@')
    if ':' not in login:
        login += ':'
    user, passwd = login.split(':')
    if ':' not in net:
        net += ':'
    host, port = net.split(':')
    return (user, passwd, host, port)

#-- specify network location on command-line
if __name__=='__main__':
    import sys
    print location_parse(sys.argv[1])

FUNCTIONS

urlparse.urlparse(url [,def_scheme="" [,fragments=1]])

Return a tuple consisting of six components of the URL url, (scheme, netloc, path, params, query, fragment). A URL is assumed to follow the pattern scheme://netloc/path;params?query#fragment. If a default scheme def_scheme is specified, that string will be returned in case no scheme is encoded in the URL itself. If fragments is set to a false value, any fragments will not be split from other fields.

>>> from urlparse import urlparse
>>> urlparse('gnosis.cx/path/sub/file.html#sect', 'http', 1)
('http', '', 'gnosis.cx/path/sub/file.html', '', '', 'sect')
>>> urlparse('gnosis.cx/path/sub/file.html#sect', 'http', 0)
('http', '', 'gnosis.cx/path/sub/file.html#sect', '', '', '')
>>> urlparse('http://gnosis.cx/path/file.cgi?key=val#sect',
...          'gopher', 1)
('http', 'gnosis.cx', '/path/file.cgi', '' , 'key=val', 'sect')
>>> urlparse('http://gnosis.cx/path/file.cgi?key=val#sect',
...          'gopher', 0)
('http', 'gnosis.cx', '/path/file.cgi', '', 'key=val#sect', '')

urlparse.urlunparse(tup)

Construct a URL from a tuple containing the fields returned by urIparse.urlparse(). The returned URL has canonical form (redundancy eliminated) so urlparse.urlparse() and urlparse.urlunparse() are not precisely inverse operations; however, the composed urlunparse (urlparse (s)) should be idempotent.

urlparse.urljoin(base, file)

Return a URL that has the same base path as base but has the file component file. For example:

>>> from urlparse import urljoin
>>> urljoin('http://somewhere.lan/path/file.html',
...                  'sub/other.html')
'http://somewhere.lan/path/sub/other.html'

In Python 2.2+ the functions urlparse.urlsplit() and urlparse.urlunsplit() are available. These differ from urlparse.urlparse() and urlparse.urlunparse() in returning a 5-tuple that does not split out params from path.