21.2 MIME and Email Format Handling

Python supplies the email package to handle parsing, generation, and manipulation of MIME files such as email messages, network news posts, and so on. The Python standard library also contains other modules that handle some parts of these jobs. However, the new email package offers a more complete and systematic approach to these important tasks. I therefore suggest you use package email, not the older modules that partially overlap with parts of email's functionality. Package email has nothing to do with receiving or sending email; for such tasks, see modules poplib and smtplib, covered in Chapter 18. Instead, package email deals with how you handle messages after you receive them or before you send them.

21.2.1 Functions in Package email

Package email supplies two factory functions returning an instance m of class email.Message.Message. These functions rely on class email.Parser.Parser, but the factory functions are handier and simpler. Therefore, I do not cover module Parser further in this book.

message_from_string

message_from_string(s)

Builds m by parsing string s.

message_from_file

message_from_file(f)

Builds m by parsing the contents of file-like object f, which must be open for reading.

21.2.2 The email.Message Module

The email.Message module supplies class Message. All parts of package email produce, modify, or use instances of class Message. An instance m of Message models a MIME message, including headers and a payload (data content). You can create m, initially empty, by calling class Message, which accepts no arguments. More often, you create m by parsing via functions message_from_string and message_from_file of module email, or by other indirect means such as the classes covered in "Creating Messages" later in this chapter. m's payload can be a string, a single other instance of Message, or a list of other Message instances for a multipart message.

You can set arbitrary headers on email messages you're building. Several Internet RFCs specify headers that you can use for a wide variety of purposes. The main applicable RFC is RFC 2822 (see http://www.faqs.org/rfcs/rfc2822.html). An instance m of class Message holds headers as well as a payload. m is a mapping, with header names as keys and header value strings as values. The semantics of m as a mapping are rather different from those of a dictionary, to make m more convenient. m's keys are case-insensitive. m keeps headers in the order in which you add them, and methods keys, values, and items return headers in that order. m can have more than one header named keym[key] returns an arbitrary one of them, del m[key] deletes all of them. len(m) returns the total number of headers, counting duplicates, not just the number of distinct header names. If there is no header named key, m[key] returns None and does not raise KeyError (i.e., behaves like m.get(key)), and del m[key] is a no-operation.

An instance m of Message supplies the following attributes and methods dealing with m's headers and payload.

add_header

m.add_header(_name,_value,**_params)

Like m[_name]=_value, but you can also supply header parameters as keyword arguments. For each keyword argument pname=pvalue, add_header changes underscores to dashes, then appends to the header's value a parameter of the form:

; pname="pvalue"

If pvalue is None, add_header appends only a parameter '; pname'.

add_payload

m.add_payload(payload)

Adds the payload to m's payload. If m's payload was None, m's payload is now payload. If m's payload was a list, appends payload to the list. If m's payload was a single item x, m's payload becomes the list [x,payload], but only if m's Content-Type header is missing or has a main type of multipart. Otherwise, when m has a single payload and a Content-Type whose main type is not multipart, m.add_payload(payload) raises a MultipartConversionError exception.

as_string

m.as_string(unixfrom=False)

Returns the entire message as a string. When unixfrom is true, also includes a first line, normally starting with 'From ', known as the envelope header of the message.

epilogue

Attribute m.epilogue can be None, or a string that becomes part of the message's string form after the last boundary line. Mail programs normally don't display this text. epilogue is a normal attribute of m: your program can access it when you're examining an m that is fully built by whatever means, and your program can bind it when you're building or modifying m in your program.

get_all

m.get_all(name,default=None)

Returns a list with all values of headers named name, in the order in which the headers were added to m. When m has no header named name, get_all returns default.

get_boundary

m.get_boundary(default=None)

Returns the string value of the boundary parameter of m's Content-Type header. When m has no Content-Type header, or the header has no boundary parameter, get_boundary returns default.

get_charsets

m.get_charsets(default=None)

Returns the list L of string values of parameter charset of m's Content-Type headers. When m is multipart, L has one item per part, otherwise L has length 1. For parts that have no Content-Type, no charset parameter, or a main type different from 'text', the corresponding item in L is default.

get_filename

m.get_filename(default=None)

Returns the string value of the filename parameter of m's Content-Disposition header. When m has no Content-Disposition, or the header has no filename parameter, get_filename returns default.

get_maintype

m.get_maintype(default=None)

Returns m's main content type, a string 'maintype' taken from header Content-Type converted to lowercase. When m has no header Content-Type, get_maintype returns default.

get_param

m.get_param(param,default=None,header='Content-Type')

Returns the string value of the parameter named param of m's header named header. Returns the empty string for a parameter specified just by name. When m has no header header, or the header has no parameter named param, get_param returns default.

get_params

m.get_params(default=None,header='Content-Type')

Returns the parameters of m's header named header, a list of pairs of strings giving each parameter's name and value. Uses the empty string as the value for parameters specified just by name. When m has no header header, get_params returns default.

get_payload

m.get_payload(i=None,decode=False)

Returns m's payload. When m.is_multipart( ) is False, i must be None, and m.get_payload( ) returns m's entire payload, a string or a Message instance. If decode is true, and the value of header Content-Transfer-Encoding is either 'quoted-printable' or 'base64', m.get_payload also decodes the payload. If decode is false, or header Content-Transfer-Encoding is missing or has other values, m.get_payload returns the payload unchanged.

When m.is_multipart( ) is True, decode must be false. When i is None, m.get_payload( ) returns m's payload as a list. Otherwise, m.get_payload( ) returns the ith item of the payload, and raises TypeError if i is less than 0 or is too large.

get_subtype

m.get_subtype(default=None)

Returns m's content subtype, a string 'subtype' taken from header Content-Type converted to lowercase. When m has no header Content-Type, get_subtype returns default.

get_type

m.get_type(default=None)

Returns m's content type, a string 'maintype/subtype' taken from header Content-Type converted to lowercase. When m has no header Content-Type, get_type returns default.

get_unixfrom

m.get_unixfrom(  )

Returns the envelope header string for m, or None if the envelope header was never set.

is_multipart

m.is_multipart(  )

Returns True when m's payload is a list, otherwise False.

preamble

Attribute m.preamble can be None or a string that becomes part of the message's string form before the first boundary line. Only mail programs that don't support multipart messages display this text to the user, so you can use this attribute to alert the user that your message is multipart and that a different mail program is needed to view it. preamble is a normal attribute of m: your program can access it when you're examining an m that is fully built by whatever means, and your program can bind it when you're building or modifying m in your program.

set_boundary

m.set_boundary(boundary)

Sets the boundary parameter of m's Content-Type header to boundary. When m has no Content-Type header, raises HeaderParseError.

set_payload

m.set_payload(payload)

Sets m's payload to payload, which must be a string or list, as appropriate.

set_unixfrom

m.set_unixfrom(unixfrom)

Sets the envelope header string for m. unixfrom is the entire envelope header line, including the leading 'From ' but not including the trailing '\n'.

walk

m.walk(  )

Returns an iterator on all parts and subparts of m, to walk the tree of parts depth-first.

21.2.3 The email.Generator Module

The email.Generator module supplies class Generator, which you can use to generate the textual form of a message m. m.as_string and str(m) may be sufficient, but class Generator gives you slightly more flexibility. You instantiate Generator with a mandatory argument and two optional ones.

Generator

class Generator(outfp,mangle_from_=False,maxheaderlen=78)

outfp is a file or file-like object supplying method write. When mangle_from_ is true, g prepends a '>' to any line in a message's payload that starts with 'From ' This helps make the message's textual form more safely parseable. g wraps each header line at semicolons, into physical lines of no more than maxheaderlen characters, for readability. To use g, just call it:

g(m, unixfrom=False)

This emits m in text form to outfp, like outfp.write(m.as_string(unixfrom)).

21.2.4 Creating Messages

Package email supplies modules with names starting with 'MIME', each module supplying a subclass of Message named like the module. These classes make it easier to create Message instances of various MIME types. The MIME classes are as follows.

MIMEAudio

class MIMEAudio(_audiodata,_subtype=None,_encoder=None,**_params)

_audiodata is a byte string of audio data to pack in a message of MIME type 'audio/_subtype'. When _subtype is None, _audiodata must be parseable by standard Python module sndhdr to determine the subtype; otherwise MIMEAudio raises a TypeError. When _encoder is None, MIMEAudio encodes data as Base 64, which is generally optimal. Otherwise, _encoder must be callable with one parameter m, the message being constructed; _encoder must then call m.get_payload( ) to get the payload, encode the payload, put the encoded form back by calling m.set_payload, and set m['Content-Transfer-Encoding'] appropriately. MIMEAudio passes the _params dictionary of keyword argument names and values to m.add_header to construct m's Content-Type.

MIMEBase

class MIMEBase(_maintype,_subtype,**_params)

The base class of all MIME classes; directly subclasses Message. Instantiating:

m = MIMEBase(main,sub,**parms)

is equivalent to the longer and less convenient idiom:

m = Message(  )
m.add_header('Content-Type','%s/%s'%(main,sub),**parms)
m.add_header('Mime-Version','1.0')
MIMEImage

class MIMEAudio(_imagedata,_subtype=None,_encoder=None,**_params)

Like MIMEAudio, but with maintype 'image' and using standard Python module imghdr to determine the subtype if needed.

MIMEMessage

class MIMEMessage(msg,_subtype='rfc822')

Packs msg, which must be an instance of Message (or a subclass), as the payload of a message of MIME type 'message/_subtype'.

MIMEText

class MIMEText(_text,_subtype='plain',_charset='us-ascii',_encoder=None)

Packs text string _text as the payload of a message of MIME type 'text/_subtype' with the given charset. When _encoder is None, MIMEText does not encode the text, which is generally optimal. Otherwise, _encoder must be callable with one parameter m, the message being constructed; _encoder must then call m.get_payload( ) to get the payload, encode the payload, put the encoded form back by calling m.set_payload, and set m['Content-Transfer-Encoding'] appropriately.

21.2.5 The email.Encoders Module

The email.Encoders module supplies functions that take a message m as their only argument, encode m's payload, and set m's headers appropriately.

encode_base64

encode_base64(m)

Uses Base 64 encoding, optimal for arbitrary binary data.

encode_noop

encode_noop(m)

Does nothing to m's payload and headers.

encode_quopri

encode_quopri(m)

Uses Quoted Printable encoding, optimal for textual data that is not fully ASCII.

encode_7or8bit

encode_7or8bit(m)

Does nothing to m's payload, sets header Content-Transfer-Encoding to '8bit' if any byte of m's payload has the high bit set, or otherwise to '7bit'.

21.2.6 The email.Utils Module

The email.Utils module supplies miscellaneous functions useful for email processing.

decode

decode(s)

Decodes string s as per the rules in RFC 2047 and returns the resulting Unicode string.

dump_address_pair

dump_address_pair(pair)

pair is a pair of strings (name,email_address). dump_address_pair returns a string s with the address to insert in header fields such as To and Cc. When name is false (e.g., ''), dump_address_pair returns email_address.

encode

encode(s,charset='iso-8859-1',encoding='q')

Encodes string s (which must use the given charset) as per the rules in RFC 2047. encoding must be 'q' to specify Quoted Printable, or 'b' to specify Base 64.

formatdate

formatdate(timeval=None,localtime=False)

timeval is a number of seconds since the epoch. When timeval is None, formatdate uses the current time. When localtime is true, formatdate uses the local timezone; otherwise it uses UTC. formatdate returns a string with the given time instant formatted in the way specified by RFC 2822.

getaddresses

getaddresses(L)

Parses each item of L, a list of address strings as used in header fields such as To and Cc, and returns a list of pairs of strings (name,email_address). When getaddresses cannot parse an item of L as an address, getaddresses uses (None,None) as the corresponding item in the list it returns.

mktime_tz

mktime_tz(t)

t is a tuple with 10 items, the first 9 in the same format used in module time covered in Chapter 12, t[-1] is a time zone as an offset in seconds from UTC (with the opposite sign from time.timezone, as specified by RFC 2822). When t[-1] is None, mktime_tz uses the local time zone. mktime_tz returns a float with the number of seconds since the epoch, in UTC, corresponding to the time instant that t denotes.

parseaddr

parseaddr(s)

Parses string s, which contains an address as typically specified in header fields such as To and Cc, and returns a pair of strings (name,email_address). When parseaddr cannot parse s as an address, parseaddr returns (None,None).

parsedate

parsedate(s)

Parses string s as per the rules in RFC 2822 and returns a tuple t with 9 items, as used in module time covered in Chapter 12 (the items t[-3:] are not meaningful). parsedate also attempts to parse erroneous variations on RFC 2822 that widespread mailers use. When parsedate cannot parse s, parsedate returns None.

parsedate_tz

parsedate_tz(s)

Like parsedate, but returns a tuple t with 10 items, where t[-1] is s's time zone as an offset in seconds from UTC (with the opposite sign from time.timezone, as specified by RFC 2822), like in the argument that mktime_tz accepts. Items t[-4:-1] are not meaningful. When s has no time zone, t[-1] is None.

quote

quote(s)

Returns a copy of string s where each double quote (") becomes '\"' and each existing backslash is repeated.

unquote

unquote(s)

Returns a copy of string s where leading and trailing double quote characters (") and angle brackets (<>) are removed if they surround the rest of s.

21.2.7 The Message Classes of the rfc822 and mimetools Modules

The best way to handle email-like messages is with package email. However, other modules covered in Chapter 18 and Chapter 20 use instances of class rfc822.Message or its subclass mimetools.Message. This section covers the subset of these classes' functionality that you need to make effective use of the modules covered in Chapter 18 and Chapter 20.

An instance m of class Message is a mapping, with the headers' names as keys and the corresponding header value strings as values. Keys and values are strings, and keys are case-insensitive. m supports all mapping methods except clear, copy, popitem, and update. get and setdefault default to '', instead of None. Instance m also supplies convenience methods (e.g., to combine getting a header's value and parsing it as a date or an address). I suggest you use for such purposes the functions of module email.Utils, covered earlier in this chapter, and use m just as a mapping.

When m is an instance of mimetools.Message, m supplies additional methods.

getmaintype

m.getmaintype(  )

Returns m's main content type, taken from header Content-Type converted to lowercase. When m has no header Content-Type, getmaintype returns 'text'.

getparam

m.getparam(param)

Returns the string value of the parameter named param of m's header Content-Type.

getsubtype

m.getsubtype(  )

Returns m's content subtype, taken from header Content-Type converted to lowercase. When m has no header Content-Type, getsubtype returns 'plain'.

gettype

m.gettype(  )

Returns m's content type, taken from header Content-Type converted to lowercase. When m has no header Content-Type, gettype returns 'text/plain'.



    Part III: Python Library and Extension Modules