Hack 100 Create Well-Formed XML with Genx

figs/expert.gif figs/hack100.gif

If you prefer the C language, Genx provides a fast, efficient C library for generating well-formed and canonical XML. On top of that, it's well documented and a real pleasure to use.

Genx (http://www.tbray.org/ongoing/When/200x/2004/02/20/GenxStatus) is an easy-to-use C library for generating well-formed XML output. In addition to its output being well-formed, Genx writes all output in canonical form. It was created by Tim Bray with help from members of the xml-dev mailing list (http://xml.org/xml/xmldev.shtml) over the first few months of 2004. Some of the benefits of Genx include size, efficiency, speed, and the integrity of its output. Genx is well documented (http://www.tbray.org/ongoing/genx/docs/Guide.html) and it's fairly easy to figure out what's going on just by looking at the well-commented source code.

This hack shows you how to download, install, and compile Genx, then walks you through two example programs. The hack assumes that you are familiar the C programming language, and that you have a C compiler and the make build utility available on your system. The example programs in this hack have been tested under Version beta5 of Genx.

7.11.1 Setting Up Genx

The first thing you have to is download Genx. It comes in a tarball only. After you download it to the working directory for the book, you need to extract the files. While at a shell or command prompt in the working directory, if you are on a machine that runs a Unix operating system, decompress the Genx tarball with:

gzip -d genx.tgz

Then extract the tar file genx.tar with:

tar xvf genx.tar

This creates a genx subdirectory where all the files from the archive will be extracted. (If you are on Windows without Cygwin, you can use a utility like WinZip to extract the GZIP archive.)

7.11.2 Compiling Genx

Genx comes with a Makefile for building the project. While in the genx subdirectory, just type make, and the process begins. The build will compile the needed files genx.c and charProps.c. genx.c includes the genx.h header file; charProps.c is where character properties are stored, and it is used to test for legal characters in XML.

The ar (archive) command is invoked to create an archive from object files genx.o and charProps.o The archive is called libgenx.a. The ranlib utility is also invoked to create an index for the archive. You will need to use libgenx.a when you compile your own Genx files. One other program, tgx.c, is also compiled and run. This program runs a number of tests on Genx and reports on what it finds so you know everything is working.

7.11.3 A First Example

Several test programs are provided in the Genx package and are stored under the docs subdirectory. I have written two sample programs that I'll highlight here. You can find these programs in the genx-examples subdirectory wherever the example file archive for this book was extracted. Change directories to genx-examples and type make again (the Genx examples have their own makefile). After you invoke make in genx-examples, the example programs will be built and ready to go.

Example 7-32 is a simple C program called tick.c that uses functions from the Genx library.

Example 7-32. tick.c
#include <stdio.h>



#include "../genx/genx.h"

   

int main()

{

  genxWriter w = genxNew(NULL, NULL, NULL);

   

  genxStartDocFile(w, stdout);

   genxStartElementLiteral(w, NULL, "time");

    genxAddAttributeLiteral(w, NULL, "timezone", "GMT");

    genxStartElementLiteral(w, NULL, "hour");

     genxAddText(w, "23");

    genxEndElement(w);

    genxStartElementLiteral(w, NULL, "minute");

     genxAddText(w, "14");

    genxEndElement(w);

    genxStartElementLiteral(w, NULL, "second");

     genxAddText(w, "52");

    genxEndElement(w);

   genxEndElement(w);

  genxEndDocument(w);



}

Line 2 of the program is an #include directive for the copy of the genx.h header file that is located in the genx directory above genx-examples, provided that Genx and was installed as directed.

You can also place a copy of genx.h in the location for system include files (on my Cygwin system, for example, the location is c:/cygwin/usr/include). If a copy of genx.h is in the system include location, you can change the #include directive on line 2 to #include <genx.h>.


The first statement inside main() creates a writer for the output of the program. The variable w is of type genxWriter, and it is initialized by the genxNew function (see line 6). Looks like a Java constructor, doesn't it? genWriter is a pointer to the struct genxWriter_rec, which stores all kinds of information about the document being built. The three arguments to the genxNew function are for memory allocation and deallocation. When all three arguments are set to NULL, we are instructing Genx to use its default memory handling (that is, with malloc() and free()).

Following this initialization of a writer is a series of function calls, each with a small job. Notice that the first or only argument to each of these functions is w, the writer structure. The call to genxStartDocFile() on line 8 starts the writing process. The second argument, stdout, indicates that the document will be written to standard output. (The document could otherwise be written to a file, as you will see in the next example.) At the end of the program (line 21) is a call to genxEndDocument(), which signals the end of the document and flushes it.

The program also contain four calls to genxStartElementLiteral() (lines 9, 11, 14, and 17), each of which is terminated by a call to genxEndElement() (lines 13, 16, 19, and 20). genxStartElementLiteral() has three arguments. The first is the writer structure (w) explained previously, next is a namespace name or URI (NULL if none), and the third is the element name, such as time or hour.

If you give an element a namespace URI in the second argument, Genx writes the namespace URI on the element with an xmlns attribute and automatically creates a prefix, which is used on any child elements that have the same namespace declared.

The text content for a given element, if any, is created with genxAddText() (lines 12, 15, and 18), with the second argument containing the actual text, such as 23 or 14.

You can probably guess that genxAddAttributeLiteral() (line 10) writes an attribute on the element that is created immediately before it. It has four arguments. The first is the writer structure, and the second is a namespace URI, which is NULL if no namespace is used. The third argument is the attribute name and the fourth is the attribute value.

To run the program, just type tick at the prompt (it was compiled with make previously). The output of the program should look like this:

<time timezone="GMT"><hour>23</hour><minute>14</minute><second>52</second></time>

This output is an example of canonical XML. Some obvious marks are no XML declaration and double quotes rather than single quotes around attribute values. Now let's look at a Genx example that is a little more complex.

7.11.4 Declare Markup for Better Performance

In the next example we will explore a different approach for writing an XML document with Genx. The program tock.c declares elements, an attribute, and a namespace before it uses them, then writes elements and an attribute with different functions that are more efficient than their literal counterparts. It also write its non-canonical output to a file. Example 7-33 shows the code for tock.c.

Example 7-33. tock.c
#include <stdio.h>

#include "../genx/genx.h"

   

int main()

{

  genxWriter w = genxNew(NULL, NULL, NULL);

  FILE *f = fopen("tock.xml", "w");

  genxElement time, hr, min, sec;

  genxAttribute tz;

  genxNamespace tm;

  genxStatus status;

  tm = genxDeclareNamespace(w, "http://www.wyeast.net/time", "tm", &status);

  time = genxDeclareElement(w, tm, "time", &status);

  tz = genxDeclareAttribute(w, NULL, "timezone", &status);

  hr = genxDeclareElement(w, tm, "hour", &status);

  min = genxDeclareElement(w, tm, "minute", &status);

  sec = genxDeclareElement(w, tm, "second", &status);

   

  genxAddText(w, "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n");

  genxStartDocFile(w, f);

  genxPI(w, "xml-stylesheet", " href=\"tock.xsl\" type=\"text/xsl\" ");

  genxComment(w, " the current date ");

  genxAddText(w, "\n");

  genxStartElement(time);

   genxAddAttribute(tz, "GMT");

   genxAddText(w, "\n ");

   genxStartElement(hr);

    genxAddText(w, "23");

   genxEndElement(w);

   genxAddText(w, "\n ");

   genxStartElement(min);

    genxAddText(w, "14");

   genxEndElement(w);

   genxAddText(w, "\n ");

   genxStartElement(sec);

    genxAddText(w, "52");

   genxEndElement(w);

   genxAddText(w, "\n");

   genxEndElement(w);

  genxEndDocument(w);

   

}

Line 7 creates a FILE object by calling the fopen() function with a filename (tock.xml) where the output is to be written and the stream or writer object (w) from which the data will be supplied. Following that, four elements (time, hr, min, and sec) are declared to be of type genxElement (line 8). The attribute tz is declared to be of type genxAttribute (line 9), and the namespace tm is declared with genxNamespace (line 10). status is of type genxStatus (line 11), an enum that helps keep track of the status of things, such as GENX_SUCCESS and GENX_BAD_NAME, and so forth. status is used as the last argument of the functions that are on lines 12 through 17, with the address-of operator &.

After the initial declarations, all these variables are initialized with an appropriate function: genxDeclareNamespace() (line 12), genxDeclareElement() (lines 13, 15, 16, and 17), and genxDeclareAttribute() (line 14). The namespace variable tm is given a namespace name (http://www.wyeast.net/time) and a prefix (tm) with the genxDeclareNamespace() function.

The genxAddText() function inserts strings?an XML declaration and newline characters and spaces?into the file output stream (lines 19, 23, 26, 30, 34, and 38). The addition of the XML declaration is what makes the output non-canonical.

The functions genxPI() (line 21) and genxComment() (line 22) write an XML stylesheet processing instruction and a comment, respectively. Then the functions genxStartElement() (lines 24, 27, 31, and 35) and genxAddAttribute() (line 25) begin writing the markup. The functions use an object rather than text to write the markup literally, with better performance than their counterparts genxStartElementLiteral() and genxAddAttributeLiteral(). Other elements, such as genxAddText() (lines 28, 32, and 36) and genxEndElement() (lines 29, 33, 37, and 39), may be used with both variations of the element and attribute creation elements, or just for inserting interelemental whitespace, and so on.

To run the program, type tock at a command or shell prompt. Genx will then create the file tock.xml, shown in Example 7-34.

Example 7-34. tock.xml
<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet  href="tock.xsl" type="text/xsl" ?>

<! the current date -->

   

<tm:time xmlns:tm="http://www.wyeast.net/time" timezone="GMT">

 <tm:hour>23</tm:hour>

 <tm:minute>14</tm:minute>

 <tm:second>52</tm:second>

<tm:time>

Just for fun, this non-canonical output can be transformed with the XSLT stylesheet tock.xsl and validated with the RELAX NG schema tock.rng. Both files are in the genx-examples subdirectory.

There are a number of other Genx functions that I have not touched on?such as the memory management functions genxGetAlloc(), genxSetAlloc() and such like. My take is that Tim Bray is on the right track, and that if you use C and you need to generate XML output, you will no doubt find that Genx is an efficient tool.