Use this stylesheet to extract only the text from any XML document.
Sometimes you just want to leave the XML behind and keep only the text found in a document. The stylesheet text.xsl can do that for you. (There's an even easier way; see "Built-in Templates" following). It can be applied to any XML document, which includes XHTML. It is shown in Example 3-15.
<xsl:stylesheet version="1.0" <xsl:output method="text"/> xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <xsl:apply-templates select="*"/> </xsl:template> </xsl:stylesheet>
This stylesheet finds the root node and then selects all element children (*) for processing. To test, apply this stylesheet to the XHTML document magnacarta.html, the pact between King John and the barony in England that was first signed at Runnymede on June 15, 1215 (see http://www.cs.indiana.edu/statecraft/magna-carta.html):
xalan magnacarta.html text.xsl
A small portion of the output is shown in Example 3-16. The result is shown in IE in Figure 3-18.
Magna Carta The Magna Carta JOHN, by the grace of God King of England, Lord of Ireland, Duke of Normandy and Aquitaine, and Count of Anjou, to his archbishops, bishops, abbots, earls, barons, justices, foresters, sheriffs, stewards, servants, and to all his officials and loyal subjects, Greeting. KNOW THAT BEFORE GOD, for the health of our soul and those of our ancestors and heirs, to the honour of God, the exaltation of the holy Church, and the better ordering of our kingdom, at the advice of our reverend fathers Stephen, archbishop of Canterbury, primate of all England, and cardinal of the holy Roman Church, Henry archbishop of Dublin, William bishop of London, Peter bishop of Winchester, Jocelin bishop of Bath and Glastonbury, Hugh bishop of Lincoln, Walter Bishop of Worcester, William bishop of Coventry, Benedict bishop of Rochester, Master Pandulf subdeacon and member of the papal household, Brother Aymeric master of the knighthood of the Temple in England, William Marshal earl of Pembroke, William earl of Salisbury, William earl of Warren, William earl of Arundel, Alan de Galloway constable of Scotland, Warin Fitz Gerald, Peter Fitz Herbert, Hubert de Burgh seneschal of Poitou, Hugh de Neville, Matthew Fitz Herbert, Thomas Basset, Alan Basset, Philip Daubeny, Robert de Roppeley, John Marshal, John Fitz Hugh, and other loyal subjects:
You can also extract text from a document just by relying on XSLT's built-in templates. A stylesheet as simple as this single line:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"/>
will invoke the built-in templates because there is no explicit template for any nodes that might be found in the source document. The built-in templates process all the children of the root and all elements, and copies text through for attributes and text nodes (the built-in templates do nothing for comment, processing-instruction, or namespace nodes). The benefit of using text.xsl over built-in templates is that text.xsl gives you a framework to exercise some control over the output (e.g., through additions of templates). However, adding templates to text.xsl won't make any difference, unless those templates match the document element more precisely (and therefore have higher priority than the template matching *). An empty stylesheet is the simplest one to start from if you want to add more precise templates.