Hack 39 Create a Text File from an XML Document

figs/beginner.gif figs/hack39.gif

Use this stylesheet to extract only the text from any XML document.

Sometimes you just want to leave the XML behind and keep only the text found in a document. The stylesheet text.xsl can do that for you. (There's an even easier way; see "Built-in Templates" following). It can be applied to any XML document, which includes XHTML. It is shown in Example 3-15.

Example 3-15. text.xsl
<xsl:stylesheet version="1.0" 

<xsl:output method="text"/>

            xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

   

<xsl:template match="/">

 <xsl:apply-templates select="*"/>

</xsl:template>

   

</xsl:stylesheet>

This stylesheet finds the root node and then selects all element children (*) for processing. To test, apply this stylesheet to the XHTML document magnacarta.html, the pact between King John and the barony in England that was first signed at Runnymede on June 15, 1215 (see http://www.cs.indiana.edu/statecraft/magna-carta.html):

xalan magnacarta.html text.xsl

A small portion of the output is shown in Example 3-16. The result is shown in IE in Figure 3-18.

Example 3-16. A portion of the Magna Carta
Magna Carta

   

The Magna Carta

JOHN, by the grace of God King of England, Lord of Ireland, 

Duke of Normandy and Aquitaine, and Count of Anjou, to his 

archbishops, bishops, abbots, earls, barons, justices, 

foresters, sheriffs, stewards, servants, and to all his 

officials and loyal subjects, Greeting.

   

KNOW THAT BEFORE GOD, for the health of our soul and those of 

our ancestors and heirs, to the honour of God, the exaltation 

of the holy Church, and the better ordering of our kingdom, at 

the advice of our reverend fathers Stephen, archbishop of 

Canterbury, primate of all England, and cardinal of the holy 

Roman Church, Henry archbishop of Dublin, William bishop of 

London, Peter bishop of Winchester, Jocelin bishop of Bath and 

Glastonbury, Hugh bishop of Lincoln, Walter Bishop of Worcester, 

William bishop of Coventry, Benedict bishop of Rochester, Master 

Pandulf subdeacon and member of the papal household, Brother 

Aymeric master of the knighthood of the Temple in England, 

William Marshal earl of Pembroke, William earl of Salisbury, 

William earl of Warren, William earl of Arundel, Alan de 

Galloway constable of Scotland, Warin Fitz Gerald, Peter Fitz 

Herbert, Hubert de Burgh seneschal of Poitou, Hugh de Neville, 

Matthew Fitz Herbert, Thomas Basset, Alan Basset, Philip Daubeny, 

Robert de Roppeley, John Marshal, John Fitz Hugh, and other loyal 

subjects:

Figure 3-18. The Magna Carta (magnacarta.html) in IE
figs/xmlh_0318.gif


3.10.1 Built-in Templates

You can also extract text from a document just by relying on XSLT's built-in templates. A stylesheet as simple as this single line:

<xsl:stylesheet version="1.0" 

   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"/>

will invoke the built-in templates because there is no explicit template for any nodes that might be found in the source document. The built-in templates process all the children of the root and all elements, and copies text through for attributes and text nodes (the built-in templates do nothing for comment, processing-instruction, or namespace nodes). The benefit of using text.xsl over built-in templates is that text.xsl gives you a framework to exercise some control over the output (e.g., through additions of templates). However, adding templates to text.xsl won't make any difference, unless those templates match the document element more precisely (and therefore have higher priority than the template matching *). An empty stylesheet is the simplest one to start from if you want to add more precise templates.