Hack 98 Process XML with C#

Even if you aren't a C# programmer, you can get up to speed on processing XML with C# in short order with this hack.

C# is an object-oriented programming language that comes as part of Microsoft's .NET framework (http://www.microsoft.com/net/), which was introduced in 2000. C# has taken a lot of lessons from C, C++, and Java, but I won't get into a comparison of these languages here. (For a good discussion of this, see Dare Obasanjo's "A Comparison of Microsoft's C# Programming Language to Sun Microsystems' Java Programming Language" at http://www.25hoursaday.com/CsharpVsJava.html.) Like any programming language, C# has it proponents and opponents. While I still use Java the most, when XML is concerned, I fall into the camp of C# proponents.

My objective here is to introduce some of C#'s programming facilities for XML, which are legion. C# offers oodles of APIs for you to scratch just about any XML itch you can find. This hack will exercise several programs that use a few of these APIs, enough to get you started writing your own C# programs for processing XML.

7.9.1 Getting C#

C# is compiled into an intermediate code, so you need to have .NET or Mono on your system to even get a compiled C# program to work. You can get C# in several ways. If you are working on Windows, you can download .NET from Microsoft's MSDN site (http://msdn.microsoft.com/netframework/technologyinfo/howtoget/) or through http://www.gotdotnet.com; or, if you are on Windows or Linux (Red Hat, Debian, and SUSE), you can download the Ximian C# compiler that is part of the open source Mono project (http://www.go-mono.com/). Borland also offers Borland C# Builder for the Microsoft .NET Framework (http://www.borland.com/csharpbuilder/).

If you opt for .NET, you have to download both the .NET Framework Version 1.1 Redistributable Package (about 24 MB) and the .NET Framework SDK Version 1.1 (about 108 MB). The programs used in this hack have been developed with .NET on Windows, and have not been tested with the Ximian compiler.

As far as documentation resources, .NET comes with a large parcel of good HTML documentation. You can also find a reference manual for C# on the MSDN site (http://msdn.microsoft.com/library/en-us/csref/html/vcoriCProgrammersReference.asp). Mono also provides online documentation at http://www.go-mono.com:8080/.

7.9.2 Writing an XML Document with XmlTextWriter

The System.Xml namespace in C# has a class called XmlTextWriter that provides properties and methods for writing XML documents. The C# program inst.cs takes input from the command line and writes an XML document either to standard output or to a file. It is shown in Example 7-24.

Both source (inst.cs) and binary (inst.exe) versions of this program are available in the file archive for the book. With the .NET C# compiler installed and in the path, you can recompile this program with this command:

csc inst.cs

To run the program, type:

inst

You will get this usage information:

Inst: generates a time instant in XML

Usage: inst hr min sec am|pm [file]

To generate XML with inst.exe, enter a line such as:

inst 10 43 56 am

which generates the output shown in Example 7-23.

Example 7-23. Output of inst.exe

<?xml version="1.0" encoding="IBM437"?>

<!-- a time instant -->

<time timezone="PST">

 <hour>10</hour>

 <minute>43</minute>

 <second>56</second>

 <meridiem>am</meridiem>

 <atomic signal="false" />

</time>

To write the output of the program to a file, use this syntax:

inst 10 43 56 am timeout.xml

Writing XML to file timeout.xml

The XML is written to the file timeout.xml, as reported by the program.

Example 7-24 shows inst.cs, the source code for inst.exe.

Example 7-24. inst.cs

using System;

using System.Text;

using System.Xml;



class Inst {



    // Output file flag

    static bool file = false;



    // Usage strings 

    static string name = "Inst: generates a time instant in XML";

    static string usage = "\nUsage: inst hr min sec am|pm [file]";



    static void Main(String[  ] args) {



        // Test arguments 

        if (args.Length =  = 0) {

            Console.WriteLine(name + usage);

            Environment.Exit(1);

        } else if (args.Length =  = 5) {

            // Fifth argument = output to file

            file = true;

        } else if ((args[1] =  = "0") || (args[2] =  = "0")) {

            Console.WriteLine("Use 00 for hr or min; exit.");

            Environment.Exit(1);

        } else if (args.Length > 5) {

            Console.WriteLine("Too many arguments; exit.");

            Environment.Exit(1);

        } 



        // Test argument values

        byte hr = System.Convert.ToByte(args[0]);

        byte min = System.Convert.ToByte(args[1]);

        byte sec = System.Convert.ToByte(args[2]);

        if (!((hr >= 1) && (hr <= 24))) {

            Console.WriteLine("Arg 1 must be 1-24; exit.");

            Environment.Exit(1);

        } else if (!((min >= 0) && (min <=59))) {

            Console.WriteLine("Arg 2 must be 00-59; exit.");

            Environment.Exit(1);

        } else if (!((sec >= 0) && (sec <=59))) {

            Console.WriteLine("Arg 3 must be 00-59; exit.");

            Environment.Exit(1);

        }



        switch(args[3]) {

            case "am":

                break;

            case "a.m.":

                break;

            case "pm":

                break;

            case "p.m.":

                break;

            default:

                Console.WriteLine("Arg 4 must be am|a.m.|pm|p.m.; exit.");

                Environment.Exit(1);

                break;

        }

   

        // Create the XmlTextWriter       

        XmlTextWriter w;

        if (file) {

            // Output to file with US-ASCII encoding  

            w = new XmlTextWriter(args[4], Encoding.ASCII);

            Console.WriteLine("Writing XML to file " + args[4]);

        } else {

           // Output to console with IBM437 encoding 

            w = new XmlTextWriter(Console.Out);

        }

   

         w.Formatting = Formatting.Indented;

         w.Indentation = 1;

         w.WriteStartDocument();

         w.WriteComment(" a time instant ");

         w.WriteStartElement("time");

         w.WriteAttributeString("timezone", "PST");

          w.WriteElementString("hour", args[0]);

          w.WriteElementString("minute", args[1]);

          w.WriteElementString("second", args[2]);

          w.WriteElementString("meridiem", args[3]);

          w.WriteStartElement("atomic");

           w.WriteAttributeString("signal", "false");

          w.WriteEndElement();

         w.WriteEndElement();

        w.WriteEndDocument();

        w.Flush();

        w.Close();



    }

   

}

On the first three lines, the using declaration declares the namespaces System, System.Text, and System.Xml, making it possible to use methods and properties from these namespaces without prefixing them. On lines 7 through 29, the program handles arguments to the program. If, for example, there are no arguments (line 17), the program prints the name and usage strings (lines 11 and 12), and if there are five arguments, the fifth argument is taken to be a filename (line 20). Lines 31 through 59 perform various tests on the input strings, to make sure they are suitable for the application. For example, the hr argument must be in the range 1 through 24 (line 35).

Starting on line 62, the actual XML comes into play when the XmlTextWriter class is declared; the class is instantiated in different ways, depending on whether a filename is provided as an argument to the program (line 65). If a filename is not given, the XML is just written to the console (line 69).

The Formatting and Indentation properties on lines 72 and 73 set the indentation to 1 (the default is 2). The method WriteStartDocument() on line 74 begins the document and writes an XML declaration, and WriteComment() writes a comment (line 75). WriteStartElement(), seen on lines 76 and 82, creates start tags for elements without character data content; calls to this method should be coupled with calls to WriteEndElement(), which writes end tags (lines 84 and 85). WriteAttributeString() produces attributes with values (lines 77 and 83). WriteElementString() calls write elements with text content from command-line arguments (lines 78-81).

The WriteEndDocument() method on line 86 is not required, but it closes any open elements or attributes, so it is generally good practice to use it. Flush() flushes the buffer, and Close() closes the stream (lines 87 and 88).

7.9.3 Reading XML

System.Xml provides several classes for reading documents, such as the XmlDocument class, which represents an XML document in DOM [Hack #96] . You use XmlDocument's Load method to read the actual document. Other options include XmlReader, XmlTextReader, and XmlValidatingReader. This example will demonstrate XmlTextReader, which is used on line 22 of read.cs (shown in Example 7-26). This program reads an XML document and then creates a generalized RELAX NG schema based on the input document.

As with inst.cs, both source (read.cs) and binary (read.exe) code for this program are in the file archive for this book, but you can compile this program yourself with:

csc read.cs

Run the program by typing:

read

You will then get this usage information:

Read: read an XML document, create a RELAX NG schema

Usage: read file

Reading the document time.xml like this:

read time.xml

will yield the rudimentary RELAX NG schema shown in Example 7-25.

By the way, XmlTextReader expects well-formed XML as input. If you submit a malformed file like bad.xml, the program will throw an unhandled exception.

Example 7-25. Output of read.exe when it processes time.xml

<grammar xmlns="http://relaxng.org/ns/structure/1.0">

   

<start>

 <ref name="body"/>

</start>

   

<define name="body">

<element name="time">

<attribute name="timezone"/>

<element name="hour">

<text/>

</element>

<element name="minute">

<text/>

</element>

<element name="second">

<text/>

</element>

<element name="meridiem">

<text/>

</element>

<element name="atomic">

<attribute name="signal"/>

</element>

</element>

</define>

   

</grammar>

To write the output of the program to a file, redirect the output:

read time.xml > timeout.rng

With the redirect, the schema is written to the file timeout.rng. Then you could validate time.xml against timeout.rng with Jing [Hack #72] :

java -jar jing.jar timeout.rng time.xml

Now that you know how to use it (the easy part), let's talk about the program itself, shown here in Example 7-26.

Example 7-26. read.cs

using System;

using System.Xml;



public class Read {

   

    // Usage strings 

    static string name = "Read: read an XML document, create a RELAX NG schema";

    static string usage = "\nUsage: read file";

   

    public static void Main(String[  ] args) {

   

        // Test arguments 

        if (args.Length =  = 0) {

            Console.WriteLine(name + usage);

            Environment.Exit(1);

        } else if (args.Length > 1) {

            Console.WriteLine("Too many arguments; exit.");

            Environment.Exit(1);

        } 

   

    XmlTextReader r = new XmlTextReader(args[0]);

   

    Console.WriteLine("<grammar xmlns=\"http://relaxng.org/ns/structure/1.0\">

    \n");

    Console.WriteLine("<start>");

    Console.WriteLine(" <ref name=\"body\"/>");

    Console.WriteLine("</start>\n");

    Console.WriteLine("<define name=\"body\">");

   

        while (r.Read()) {

            if (r.MoveToContent() =  = XmlNodeType.Element) {

                Console.WriteLine("<element name=\"" + r.Name + "\">");

                if (r.IsEmptyElement && r.HasAttributes) {

                    for (int i = 0; i < r.AttributeCount; i++) {

                      r.MoveToAttribute(i);

                      Console.WriteLine("<attribute name=\"" + r.Name + "\"/>");

                }

                r.MoveToElement();

                Console.WriteLine("</element>");

                } else if (r.HasAttributes) {

                for (int i = 0; i < r.AttributeCount; i++) {

                    r.MoveToAttribute(i);

                    Console.WriteLine("<attribute name=\"" + r.Name + "\"/>");

                }

                r.MoveToElement();

            }

        }

            if (r.MoveToContent() =  = XmlNodeType.EndElement)

                Console.WriteLine("</element>");

            if (r.MoveToContent() =  = XmlNodeType.Text)

                Console.WriteLine("<text/>");

    } 

    Console.WriteLine("</define>");

    Console.WriteLine("\n</grammar>");

    } 



}

Earlier I said that the program read.cs creates a generalized RELAX NG schema. What I mean by generalized is that, without paying a lot of attention to details, it analyzes an XML document and places the resulting schema in a single named definition, body. It has not been optimized, there is no exception handling, it has not been tested extensively, it can't handle complex content models (for example, no support for content and values other than text), and it doesn't use built-in XML writing facilities, opting to just write to the console with the methods Write() and WriteLine(). However, the program does achieve the important goal of demonstrating, in simple terms, what can be done when reading XML with the pull parser XmlTextReader (line 21).

Like inst.cs, read.cs uses the first part of the program to declare namespaces System and System.Xml (lines 1 and 2) and handle arguments from the command line (lines 6 through 19). Lines 23 through 27 write the tags for the beginning of a RELAX NG grammar, including a namespace declaration.

The real action begins with the while loop on line 29. The Read() method reads the next node from the stream, as long as there are nodes to read, using recursive descent. The MoveToContent() method (line 30) checks the current node. This is a pull. It checks whether the current node is text (non-whitespace), an element, end of an element, an entity reference, the end of an entity, or a CDATA section. If the node is not one of these, the reader skips ahead to the next node of interest or to the end of the file. It skips over processing instructions, document type declarations, comments, and whitespace.

Line 30 also checks if a node is an element (XmlNodeType.Element); if it is, it creates a start tag for element (a RELAX NG element), then further tests if the element is empty with the IsEmptyElement property, and whether it has attributes with the HasAttributes property (line 32). If the element does have attributes, it moves through the attributes in succession with MoveToAttribute() (lines 33 and 34), writes the RELAX NG attribute element (line 35), and then moves to the next element and writes an element end tag (lines 37 and 38). The process is basically repeated for non-empty elements on lines 39 through 44.

Line 47 tests for the end of an element (XmlNodeType.EndElement) and if found, line 48 writes an element end tag. Lines 49 and 50 test if a node is a text node (XmlNodeType.Text) and if so write an empty RELAX NG text element. Finally, lines 52 and 53 close up the RELAX NG schema by writing the end tags for define and grammar.

That's it. As you can see, C# makes quick work of writing and reading XML documents. It's been worth my investment to get up to speed with C# and put it to use. (But I still like my Java and C, too.)