23.4 Changing and Generating XML

Just like for HTML and other kinds of structured text, the simplest way to output an XML document is often to prepare and write it using Python's normal string and file operations, covered in Chapter 9 and Chapter 10. Templating, covered in Chapter 22, is also often the best approach. Subclassing class XMLGenerator, covered earlier in this chapter, is a good way to generate an XML document that is like an input XML document, except for a few changes.

The xml.dom.minidom module offers yet another possibility, because its classes support methods to generate, insert, remove, and alter nodes in a DOM tree representing the document. You can create a DOM tree by parsing and then alter it, or you can create an empty DOM tree and populate it, and then output the resulting XML document with methods toxml, toprettyxml, or writexml of the Document instance. You can also output a subtree of the DOM tree by calling these methods on the Node that is the subtree's root.

23.4.1 Factory Methods of a Document Object

The Document class supplies factory methods to create new instances of subclasses of Node. The most frequently used factory methods of a Document instance d are as follows.

createComment

d.createComment(data)

Builds and returns an instance c of class Comment for a comment with text data.

createElement

d.createElement(tagname)

Builds and returns an instance e of class Element for an element with the given tag.

createTextNode

d.createTextNode(data)

Builds and returns an instance t of class TextNode for a text node with text data.

23.4.2 Mutating Methods of an Element Object

An instance e of class Element supplies the following methods to remove and add attributes.

removeAttribute

e.removeAttribute(name)

Removes e's attribute with the given name.

setAttribute

e.setAttribute(name,value)

Changes e's attribute with the given name to have the given value, or adds to e a new attribute with the given name and value if e had no attribute named name.

23.4.3 Mutating Methods of a Node Object

An instance n of class Node supplies the following methods to remove, add, and replace children.

appendChild

n.appendChild(child)

Makes child the last child of n, whatever child's parent was (including n or None).

insertBefore

n.insertBefore(child,nextChild)

Makes child the child of n immediately before nextChild, whatever child's parent was (including n or None). nextChild must be a child of n.

removeChild

n.removeChild(child)

Makes child parentless and returns child. child must be a child of n.

replaceChild

n.replaceChild(child,oldChild)

Makes child the child of n in oldChild's place, whatever child's parent was (including n or None). oldChild must be a child of n. Returns oldChild.

23.4.4 Output Methods of a Node Object

An instance n of class Node supplies the following methods to output the subtree rooted at n.

toprettyxml

n.toprettyxml(indent='\t',newl='\n')

Returns a string, plain or Unicode, with the XML source for the subtree rooted at n, using indent to indent nested tags and newl to end lines.

toxml

n.toxml(  )

Like n.toprettyxml('',''), i.e., inserts no extraneous whitespace.

writexml

n.writexml(file)

Writes the XML source for the subtree rooted at n to file-like object file, open for writing. Note that file.write must accept Unicode strings (as covered in Section 9.6.1), unless all text in the XML source produced can be converted implicitly to plain strings using the current default encoding (normally 'ascii').

23.4.5 Changing and Outputting XHTML with xml.dom.minidom

The following example uses xml.dom.minidom to analyze an XHTML page and output it to standard output with each hyperlink's destination URL shown, in three sets of parentheses, just before the hyperlink:

import xml.dom.minidom, urllib, sys

f = urllib.urlopen('http://www.w3.org/MarkUp/')
doc = xml.dom.minidom.parse(f)
as = doc.getElementsByTagName('a')
for a in as:
    value = a.getAttribute('href')
    if value:
        newtext = doc.createTextNode(' (((%s)))'%value)
        a.parentNode.insertBefore(newtext,a)

class UnicodeStdoutWriter:
    def write(self, data):
        sys.stdout.write(data.encode('utf-8'))

doc.writexml(UnicodeStdoutWriter(  ))

This example wraps sys.stdout in a little UnicodeStdoutWriter class in order to encode Unicode output. Further, it uses encoding 'utf-8' because that is the encoding that the XML standard specifies as the default, and up to Python 2.2.2 we have no way of asking object doc to explicitly request a different encoding. In Python 2.3, method writexml accepts an optional keyword argument named encoding that lets us control the encoding attribute in the XML declaration.



    Part III: Python Library and Extension Modules