Hack 53 Transform XML Documents with grep and sed

figs/expert.gif figs/hack53.gif

Use grep and sed to transform XML instead of XSLT.

You can use a pair of good old Unix utilities, grep and sed, to transform XML. Both of these utilities allow you to search based on regular expressions, a powerful though sometimes complex language for searching sets of strings. This hack will provide some examples of how you can use regular expressions to transform XML documents.

This hack discusses regular expressions only as far as the examples given, but is not a tutorial on regular expressions.


3.24.1 grep

grep is a Unix utility, but it also runs on other platforms. If you have a Linux distribution, such as Red Hat (http://www.redhat.com), or if you have Cygwin on Windows (http://www.cygwin.com), grep is already available to you at the shell. You can also get the GNU distribution of grep from http://www.gnu.org, but you'll have to compile the C code to get it to work. This hack uses Version 2.5 of grep.

Say, for example, you wanted to grab part of an XML document and create a new one; instead of using XSLT, you could use grep and regular expressions in some circumstances. If you are familiar with regular expressions, using grep to do such things may come easily to you. Take a look at time.xml:

<?xml version="1.0" encoding="UTF-8"?>

   

<!-- a time instant -->

<time timezone="PST">

 <hour>11</hour>

 <minute>59</minute>

 <second>59</second>

 <meridiem>p.m.</meridiem>

 <atomic signal="true"/>

</time>

If you just want to extract the XML declaration, the document element, and the hour element, try:

grep "<?\|<time\|hour\|<\/time" time.xml

The quotes are essential for grep to interpret the entire regular expression as one. The regular expression in quotes will find matches in time.xml for the following: the XML declaration using <? (which would also find a processing instruction, if present); the time start tag using \|<time (the backslash \ escapes the vertical bar |); the hour element; and the time end tag (<\/time). The vertical bar (|) means alternation. In other words, the regular expression will match <? or <time or hour or </time>.

Extended grep (egrep) uses extended regular expressions, so an extended regular expression like this is possible (the -E switch is for extended expressions):

grep -E "(<\?|<time|hour|</time)" time.xml

Or:

egrep "(<\?|<time|hour|</time)" time.xml

Note the parentheses. Inside them, the vertical bars and the slash don't need to be escaped, but the question mark needs to be escaped because egrep interprets it as a repetition operator for zero or one.

If you run any of these commands, here is what you'll get:

<?xml version="1.0" encoding="UTF-8"?>

<time timezone="PST">

 <hour>11</hour>

</time>

A slight variation is:

grep "<?\|time\|hour" time.xml

or:

grep -E "(<\?|time|hour)" time.xml

Either of which produces:

<?xml version="1.0" encoding="UTF-8"?>

<!-- a time instant -->

<time timezone="PST">

 <hour>11</hour>

</time>

Without < or <\/ in the regular expression, grep picks up both time tags, plus the comment that contains the word time.

Another approach you can take is to invert the match; that is, to print whatever does not match the regular expression. If you wanted to remove the atomic and meridiem elements from time.xml, you could use either the -v switch or the --invert-match switch, which have identical meaning:

grep -v "meridiem\|atomic" time.xml

or:

grep -v -E "meridiem|atomic" time.xml

These commands would yield all but the meridiem and atomic elements:

<?xml version="1.0" encoding="UTF-8"?>

   

<!-- a time instant -->

<time timezone="PST">

 <hour>11</hour>

 <minute>59</minute>

 <second>59</second>

</time>

Yet another approach you can take is contextual matching. Several grep switches allow you to display lines near a match. Take the file sum.xml:

<?xml version="1.0" encoding="UTF-8"?>

<sums>

 <sum>

  <row1>19</row1>

  <row2>1411</row2>

  <row3>713</row3>

  <row4>1517</row4>

 </sum>

 <sum>

  <column1>312</column1>

  <column2>2263</column2>

  <column3>1085</column3>

 </sum>

</sums>

Suppose you wanted to grab the content of the second and last sum element. You can do that with the context switch -C:

grep -C 2 2263 sum.xml

The -C switch followed by 2 and the expression 2263 means "grab the line that matches 2263 plus two lines above it and two lines below it." The result of this command would be:

<sum>

 <column1>312</column1>

 <column2>2263</column2>

 <column3>1085</column3>

</sum>

The following example uses the -B (before) and -A (after) switches. Look at worksheet.xml:

<?xml version="1.0" encoding="UTF-8"?>

   

<worksheet>

 <column>

  <row>12</row>

  <row>199</row>

  <row>72</row>

  <row>29</row>

 </column>

 <column>

  <row>5</row>

  <row>783</row>

  <row>43</row>

  <row>1432</row>

 </column>

 <column>

  <row>2</row>

  <row>429</row>

  <row>598</row>

  <row>56</row>

 </column>

</worksheet>

Suppose you wanted to get only the first column element and its children. You could use this command:

grep -B 2 -A 3 199 worksheet.xml > column.xml

This means "get two lines before and three lines after 199." This command redirects the output to the file column.xml:

<column>

 <row>12</row>

 <row>199</row>

 <row>72</row>

 <row>29</row>

</column>

This has been just a sampling?a starting point?for the kinds of transformations you can apply to XML documents using grep. Now we will try a few tricks with sed.

3.24.2 sed

sed is a streaming editor that can apply ed commands to a stream of input, which includes a file. grep can only match content that already exists in a document, but sed can search and replace the content of a document. sed can also perform simple transformations using ed commands. Like grep, sed is readily available on Unix, Linus, or Cygwin, or you can download it from http://www.gnu.org. This hack uses Version 4.0.8 of sed. For example, try this command:

sed '2,3d;6,9d' time.xml

The editor command says "delete lines 2, 3, and 6 through 9." You'll get this result:

<?xml version="1.0" encoding="UTF-8"?>

<time timezone="PST">

 <hour>11</hour>

</time>

A simple search and replace script can change the name of an element. For example, the s (search) command in this line:

sed -e s/meridiem/am-pm/g time.xml

will replace the meridiem element name with am-pm:

<?xml version="1.0" encoding="UTF-8"?>

   

<!-- a time instant -->

<time timezone="PST">

 <hour>11</hour>

 <minute>59</minute>

 <second>59</second>

 <am-pm>p.m.</am-pm>

 <atomic signal="true"/>

</time>

You can store all your scripts in a file, such as those stored in translate.sed:

3d

8d

s/timezone/Zeitzone/g

s/PST/CET/g

s/time/Zeit/g

s/hour/Uhr/g

s/minute/Minute/g

s/second/Sekundant/g

s/atomic/atomar/g

s/signal/Signal/g

s/true/treu/g

These scripts will translate into German all the tag names, attribute names, and attribute values in time.xml, plus drop the comment. Use the -f switch to use the scripts in the file:

sed -f translate.sed time.xml

This will produce the following output:

<?xml version="1.0" encoding="UTF-8"?>

   

<Zeit Zeitzone="CET">

 <Uhr>11</Uhr>

 <Minute>59</Minute>

 <Sekundant>59</Sekundant>

 <atomar Signal="treu"/>

</Zeit>

The -i command-line option produces in-place changes (i.e., the changes are written to the file that is edited). To start with, the file neue.xml is identical to time.xml. The following command performs the edits in place:

sed -f translate.sed -i.backup neue.xml

The suffix .backup will be appended to the name of a backup file (neue.xml.backup). After the edits, neue.xml will look like this:

<?xml version="1.0" encoding="UTF-8"?>

   

<Zeit Zeitzone="CET">

 <Uhr>11</Uhr>

 <Minute>59</Minute>

 <Sekundant>59</Sekundant>

 <atomar Signal="treu"/>

</Zeit>

neue.xml.backup contains the original file before editing took place.

3.24.3 See Also

  • Regular Expression Library: http://regexlib.com/

  • sed & awk by Dale Dougherty and Arnold Robbins (O'Reilly).

  • Mastering Regular Expressions by Jeffrey E. F. Friedl (O'Reilly).