8.6 Uploading XML Documents

  Previous section   Next section

In order to upload XML files, we need a program that can read them, understand the content model, discern content from syntax, and set the x and y coordinates of each node. Fortunately, the XML community has done almost all of the hard work for us already. The Simple API for XML (SAX) provides methods for sequential parsing of XML files, so all we will need to do is write methods to handle the various events that the SAX parser will invoke as it steps through the file. Some features of SAX are worth noting here. A SAX parser understands the XML content model and acts as an XML processor (i.e., it has some intelligence built into it)?by default, the SAX ContentHandler will parse entity references, ignore comments, and treat CDATA sections in the same way as text. Since we want entity references to remain unparsed, and we need to maintain the distinction between CDATA and text, we need to implement the SAX LexicalHandler as well; the SAX LexicalHandler interface provides a method for handling comments, and methods that denote the beginning and end of CDATA sections. Finally, if we want to handle DTD (Document Type Definition) content, we need to implement the DTDHandler interface, which provides methods regarding entity definitions and notation declarations.

8.6.1 The xmlrepSAX Class

First, let's create a class (xmlrepSAX.java) that will initialize a SAX2 parser and handle any exceptions that may arise (in this case, we are using the Xerces parser from http://xml.apache.org/, but the reader can change this by altering the value of DEFAULT_PARSER_NAME). Note that SAX2 parsers are namespace-aware; whereas a SAX1 parser (or a SAX2 parser with this feature disabled) would treat an element such as <my-ns:my-element xmlns:my-ns="uri"> as being called "my-ns:my-element" (leaving us to strip out the namespace prefix and resolve it to a URI). A SAX2 parser with this feature enabled will report "my-element" as the local name, provide the full URI reference of the namespace, and may also give us the prefix. Since we want to retain the namespace prefix (as this is more space-efficient than storing the full URI every time), we will enable these features with two Booleans, NAMESPACE_HANDLING and NS_PREFIX_HANDLING. The code for this class is shown in Listing 8.30.

Listing 8.30 xmlrepSAX Class
[View full width]
// Import core Java classes
import java.io.*;
import java.lang.*;
// Import SAX classes
import org.xml.sax.*;
import org.xml.sax.ext.*;
import org.xml.sax.helpers.*;
// The xmlrepSAX class
public class xmlrepSAX extends DefaultHandler {
   // Useful parameters
   protected static final String DEFAULT_PARSER_NAME = "org.apache.xerces.parsers.
graphics/ccc.gifSAXParser";
   protected static final String NAMESPACES_FEATURE_ID = "http://xml.org/sax/features/
graphics/ccc.gifnamespaces";
   protected static final String NS_PREFIX_HANDLING_PROPERTY_ID = "http://xml.org/sax/
graphics/ccc.giffeatures/namespace-prefixes";
   protected static final String VALIDATION_FEATURE_ID = "http://xml.org/sax/features/
graphics/ccc.gifvalidation";
   protected static final String LEXICAL_EVENT_HANDLING_PROPERTY_ID = "http://xml.org/sax/
graphics/ccc.gifproperties/lexical-handler";
   protected static final boolean NAMESPACE_HANDLING = true;
   protected static final boolean NS_PREFIX_HANDLING = true;
   // Variables
   String uri = null;
   FileReader r = null;
   XMLReader parser = null;

The parse() method instantiates a SAX parser, registers the various handlers, and invokes the parser, as shown in Listing 8.31.

Listing 8.31 parse Method
public void parse(DefaultHandler xmlHandler, xmlrepSAX errHandler, String uri) {
   // Create an XML reader
   try {parser =
          XMLReaderFactory.createXMLReader(DEFAULT_PARSER_NAME);
   } catch (Exception e) {
      genError("Error: Unable to instantiate parser ("
                + DEFAULT_PARSER_NAME + ")");
   }
  // Register the SAX content handler
  try {parser.setContentHandler(xmlHandler);
  } catch (NullPointerException e) {
      genError("Could not set the ContentHandler.");}
  // Register the SAX error handler
  try {parser.setErrorHandler(errHandler);
  } catch (NullPointerException e) {
      genError("Could not set the ErrorHandler.");}
  // Set the namespace handling behavior
  try {parser.setFeature(NAMESPACES_FEATURE_ID, NAMESPACE_HANDLING);
  } catch (SAXException e) {
      genError("Could not set namespace handling.");}
  // Set the namespace prefix handling behavior
  try {parser.setFeature(NS_PREFIX_HANDLING_PROPERTY_ID, NS_PREFIX_HANDLING);
  } catch (SAXException e) {
      genError("Could not set namespace prefix handling.");}
  // Register the SAX lexical event handler
  try {parser.setProperty(LEXICAL_EVENT_HANDLING_PROPERTY_ID, xmlHandler);
  } catch (SAXNotRecognizedException e) {
     System.out.println("Warning: lex property not recognized.");
  } catch (SAXNotSupportedException e) {
     System.out.println("Warning: lex property not supported.");}
  // Register the SAX DTD event handler
  try {parser.setDTDHandler(xmlHandler);
  } catch (NullPointerException e) {
     System.out.println("Warning: Could not set the DTD handler.");}
  // Open the file and parse it
  try {r = new FileReader(uri);
  } catch (FileNotFoundException e) {
     System.out.println("File not found: " + uri);
     genError(e.toString());}
  try {parser.parse(new InputSource(r));
  } catch (Exception e) {
     System.out.println("Error encountered while parsing " + uri);
     genError(e.toString());}
}

Finally, we need exception-handling methods, as shown in Listing 8.32.

Listing 8.32 SAX Exception-Handling Methods
   // * A generic error handler (just outputs the message and quits)
   public void genError(String msg) {
     System.out.println(msg);
     System.exit(1);
   }
   // -- SAX ErrorHandler methods
   // * Warnings
   public void warning(SAXParseException ex) {
     System.out.println("[Warning] " + ex.getMessage());
   }
   // * SAX errors
   public void saxError(SAXParseException ex) {
     System.out.println("[Error] " + ex.getMessage());
     System.exit(1);
   }
   // * Fatal errors
   public void fatalError(SAXParseException ex) throws SAXException {
     System.out.println("[Fatal Error] " + ex.getMessage());
     System.exit(1);
   }
}

A SAX parser will report an error if the supplied XML file is not well formed, so we can avoid overburdening the database by scanning each file for well-formedness before we attempt to upload it. The scanXML class will do this for us (see Listing 8.33); it instantiates a SAX parser and provides dummy content-handling methods.

Listing 8.33 scanXML Class
// Import core Java classes
import java.io.*;
// Import SAX classes
import org.xml.sax.Attributes;
import org.xml.sax.DTDHandler;
import org.xml.sax.ext.LexicalHandler;
import org.xml.sax.helpers.DefaultHandler;
// Import our classes
import xmlrepSAX;
// The scanXML class
public class scanXML extends DefaultHandler
 implements LexicalHandler, DTDHandler {
   // Parse the document
   public void check(DefaultHandler handler, String uri) {
     // Variables
     xmlrepSAX saxParser = new xmlrepSAX();
     // Parse the file
     saxParser.parse(handler, saxParser, uri);
   }
   // -- ContentHandler methods
   public void startDocument() {/* Do nothing */}
   public void startElement(String namespaceURI, String localName,
      String rawName, Attributes attrs) {/* Do nothing */}
   public void endElement(String namespaceURI, String localName,
      String rawName) {/* Do nothing */}
   public void characters(char ch[], int start, int length)
      {/* Do nothing */}
   public void ignorableWhitespace(char ch[], int start, int length)
      {/* Do nothing */}
   public void processingInstruction(String target, String data)
      {/* Do nothing */}
   public void endDocument() {/* Do nothing */}
   // -- LexicalEventListener methods
   public void startDTD(String name, String publicId, String systemId)
      {/* Do nothing */}
   public void endDTD() {/* Do nothing */}
   public void comment(char ch[], int start, int length)
      {/* Do nothing */}
   public void startCDATA() {/* Do nothing */}
   public void endCDATA() {/* Do nothing */}
   public void startEntity(String name) {/* Do nothing */}
   public void endEntity(String name) {/* Do nothing */}
}

8.6.2 Stored Procedures for Data Entry

In order to load an XML document into the repository, our application will need to execute many insert and update statements. We can of course build these SQL statements on the fly, but for performance reasons, it would be better to create some stored procedures first. When ad hoc SQL is executed, the RDBMS will parse the SQL and create a "query plan" prior to execution; with a stored procedure, the query plan is determined upon first execution and then stored in the procedure cache, which means that subsequent execution will be considerably faster. The script create_xmlrep_db.sql contains numerous stored procedures that we will use later; the names are abbreviations referring to the activity and the table affected (e.g., rep_i_an involves an insert ("i") into the attribute_name ("an") table, and rep is just a prefix I have used for all the core stored procedures in the repository).

The procedure in Listing 8.34 creates an entry in the doc table and returns the unique doc_id.

Listing 8.34 Create rep_i_d Procedure
CREATE PROCEDURE dbo.rep_i_d
  @source         t_source,
  @contributor_id t_user_id AS
BEGIN
 INSERT doc (source, date_loaded, contributor_id)
 VALUES (@source, GETDATE(), @contributor_id)
 -- Result = doc_id
 SELECT @@identity AS value
END
GO

The procedure in Listing 8.35 creates an entry in the node table and returns the unique node_id. Note that at this stage, the document ID, type, and x index are known, but the y index is not (so it is not specified here).

Listing 8.35 Create rep_i_n Procedure
CREATE PROCEDURE dbo.rep_i_n
  @x_index      t_xy_index,
  @node_type_id t_node_type_id,
  @doc_id       t_doc_id AS
BEGIN
 INSERT node (doc_id, x, node_type_id)
 VALUES (@doc_id, @x_index, @node_type_id)
 -- Result = node_id
 SELECT @@identity AS value
END
GO

The procedure in Listing 8.36 inserts a row in the element_name table for an element node; the node_id refers back to the entry in the node table. The other parameters are the namespace prefix and the local name of the element.

Listing 8.36 Create rep_i_en Procedure
CREATE PROCEDURE dbo.rep_i_en
  @node_id    t_node_id,
  @ns_prefix  t_ns_prefix,
  @local_name t_element_name AS
INSERT element_name (node_id, ns_prefix, local_name)
VALUES (@node_id, @ns_prefix, @local_name)
GO

The procedure in Listing 8.37 inserts a row in the attribute_name table. Each entry has a reference back to the element_name entry (through node_id), a unique attribute ID (for this node), a namespace prefix, and a local name.

Listing 8.37 Create rep_i_an Procedure
CREATE PROCEDURE dbo.rep_i_an
  @node_id      t_node_id,
  @attribute_id t_seq_no,
  @ns_prefix    t_ns_prefix,
  @local_name   t_attribute_name AS
INSERT attribute_name (node_id, attribute_id, ns_prefix, local_name)
VALUES (@node_id, @attribute_id, @ns_prefix, @local_name)
GO

The procedure in Listing 8.38 inserts a row in the attribute_value_leaf table; each entry has a reference back to the attribute_name table (through node_id and attribute_id). Long values will be split across leaves, each with a unique leaf_id (for the specified attribute).

Listing 8.38 Create rep_i_avl Procedure
CREATE PROCEDURE dbo.rep_i_avl
  @node_id      t_node_id,
  @attribute_id t_seq_no,
  @leaf_id      t_leaf_id,
  @leaf_text    t_leaf_text AS
INSERT attribute_value_leaf (node_id, attribute_id, leaf_id, leaf_text)
VALUES (@node_id, @attribute_id, @leaf_id, @leaf_text)
GO

The procedure in Listing 8.39 inserts a row in the cdata_leaf table. Again, long values will be split across leaves.

Listing 8.39 Create rep_i_cdl Procedure
CREATE PROCEDURE dbo.rep_i_cdl
  @node_id   t_node_id,
  @leaf_id   t_leaf_id,
  @leaf_text t_leaf_text AS
INSERT cdata_leaf (node_id, leaf_id, leaf_text)
VALUES (@node_id, @leaf_id, @leaf_text)
GO

The procedure in Listing 8.40 inserts a row representing a leaf in the comment_leaf table.

Listing 8.40 Create rep_i_cl Procedure
CREATE PROCEDURE dbo.rep_i_cl
  @node_id   t_node_id,
  @leaf_id   t_leaf_id,
  @leaf_text t_leaf_text AS
INSERT comment_leaf (node_id, leaf_id, leaf_text)
VALUES (@node_id, @leaf_id, @leaf_text)
GO

The procedure in Listing 8.41 inserts a row in the entity_reference table.

Listing 8.41 Create rep_i_er Procedure
CREATE PROCEDURE dbo.rep_i_er
  @node_id     t_node_id,
  @entity_name t_entity_ref AS
INSERT entity_reference (node_id, entity_name)
VALUES (@node_id, @entity_name)
GO

The procedure in Listing 8.42 inserts a row in the pi_data_leaf table.

Listing 8.42 Create rep_i_pidl Procedure
CREATE PROCEDURE dbo.rep_i_pidl
  @node_id   t_node_id,
  @leaf_id   t_leaf_id,
  @leaf_text t_leaf_text AS
INSERT pi_data_leaf (node_id, leaf_id, leaf_text)
VALUES (@node_id, @leaf_id, @leaf_text)
GO

The procedure in Listing 8.43 inserts a row in the pi_target_leaf table.

Listing 8.43 Create rep_i_pitl Procedure
CREATE PROCEDURE dbo.rep_i_pitl
  @node_id   t_node_id,
  @leaf_id   t_leaf_id,
  @leaf_text t_leaf_text AS
INSERT pi_target_leaf (node_id, leaf_id, leaf_text)
VALUES (@node_id, @leaf_id, @leaf_text)
GO

The procedure in Listing 8.44 inserts a row in the text_leaf table.

Listing 8.44 Create rep_i_tl Procedure
CREATE PROCEDURE dbo.rep_i_tl
  @node_id   t_node_id,
  @ignorable BIT,
  @leaf_id   t_leaf_id,
  @leaf_text t_leaf_text AS
INSERT text_leaf (node_id, leaf_id, leaf_text, ignorable)
VALUES (@node_id, @leaf_id, @leaf_text, @ignorable)
GO

The procedure in Listing 8.45 returns the last value of leaf_id from the cdata_leaf table for the specified node.

Listing 8.45 Create rep_s_cdlid Procedure
CREATE PROCEDURE dbo.rep_s_cdlid
  @node_id t_node_id AS
SELECT ISNULL(MAX(leaf_id), 0) AS value
  FROM cdata_leaf
 WHERE node_id = @node_id
GO

The procedure in Listing 8.46 returns the last value of leaf_id from the text_leaf table for the specified node.

Listing 8.46 Create rep_s_tlid Procedure
CREATE PROCEDURE dbo.rep_s_tlid
  @node_id t_node_id AS
SELECT ISNULL(MAX(leaf_id), 0) AS value
  FROM text_leaf
 WHERE node_id = @node_id
GO

The procedure in Listing 8.47 returns the value of node_id for the last node in the specified document that has not yet had a y coordinate set.

Listing 8.47 Create rep_s_n_last Procedure
CREATE PROCEDURE dbo.rep_s_n_last
  @doc_id t_doc_id AS
SELECT node_id AS value
  FROM node
 WHERE doc_id = @doc_id
   AND x = (SELECT MAX(x)
              FROM node
             WHERE doc_id = @doc_id
               AND y IS NULL)
GO

The procedure in Listing 8.48 sets the value of y for the specified node.

Listing 8.48 Create rep_u_n_y Procedure
CREATE PROCEDURE dbo.rep_u_n_y
  @node_id t_node_id,
  @y_index t_xy_index AS
 UPDATE node
    SET y = @y_index
  WHERE node_id = @node_id
GO

The procedure in Listing 8.49 nulls the value of y for the specified node (we will discuss the reason for this requirement later).

Listing 8.49 Create rep_u_n_y_null Procedure
CREATE PROCEDURE dbo.rep_u_n_y_null
  @node_id t_node_id AS
UPDATE node
   SET y = NULL
 WHERE node_id = @node_id
GO

Finally, we need to set permissions for the stored procedures in Listing 8.50.

Listing 8.50 Set Permissions for Stored Procedures
GRANT EXECUTE ON dbo.rep_i_an TO xmlrep_user;
GRANT EXECUTE ON dbo.rep_i_avl TO xmlrep_user;
GRANT EXECUTE ON dbo.rep_i_cdl TO xmlrep_user;
GRANT EXECUTE ON dbo.rep_i_cl TO xmlrep_user;
GRANT EXECUTE ON dbo.rep_i_d TO xmlrep_user;
GRANT EXECUTE ON dbo.rep_i_en TO xmlrep_user;
GRANT EXECUTE ON dbo.rep_i_er TO xmlrep_user;
GRANT EXECUTE ON dbo.rep_i_n TO xmlrep_user;
GRANT EXECUTE ON dbo.rep_i_pidl TO xmlrep_user;
GRANT EXECUTE ON dbo.rep_i_pitl TO xmlrep_user;
GRANT EXECUTE ON dbo.rep_i_tl TO xmlrep_user;
GRANT EXECUTE ON dbo.rep_s_cdlid TO xmlrep_user;
GRANT EXECUTE ON dbo.rep_s_n_last TO xmlrep_user;
GRANT EXECUTE ON dbo.rep_s_tlid TO xmlrep_user;
GRANT EXECUTE ON dbo.rep_u_n_y TO xmlrep_user;
GRANT EXECUTE ON dbo.rep_u_n_y_null TO xmlrep_user;
GO

8.6.3 The uploadXML Class

Now we need a class (uploadXML) that will

  1. Instantiate scanXML and parse the file with it to ensure well-formedness.

  2. Start a database transaction.

  3. Register itself as the content handler for the SAX parser and call its parse() method.

  4. Commit the transaction?if the process completed satisfactorily?or roll back the transaction?if something went wrong?to ensure that the document-loading process succeeds or fails as a single unit.

The class will need all the same content-handling methods that scanXML provided, but this time we will do something with the content; we will call our stored procedures to insert the content into the database.

The class will also need to determine the Nested Sets coordinates of each node. The x coordinates are easy?we just start with x = 1 and increment x every time we encounter a new node. When we encounter a leaf node, its y coordinate will be x + 1 (we will need to increase x afterwards to y + 1 before we continue, ready for the next node). The y coordinates of non-leaf nodes are a little trickier. When we create the document node or encounter element nodes, we do not yet know what the node's y coordinate will be, so we will have to leave it as a null value until it can be determined. Fortunately, because elements must nest correctly (i.e., <a><b></b></a> is well formed, but <a><b></a></b> is not), every time we encounter a closing tag (causing a SAX endElement event), we can be sure that this closing tag corresponds to the most recent opening tag that has not yet been closed; that is, we just need to query the node table for the last node with a null value of y, set it to x + 1, and then set x = y + 1, and so on, until all nodes have been closed (the document node will be the last). The code for the uploadXML class is shown in Listing 8.51.

Listing 8.51 uploadXML Class
// Import core Java classes
import java.io.*;
import java.lang.*;
import java.util.Date;
// Import SAX classes
import org.xml.sax.Attributes;
import org.xml.sax.DTDHandler;
import org.xml.sax.ext.LexicalHandler;
import org.xml.sax.helpers.DefaultHandler;
// Import our classes
import scanXML;
import xmlrepDB;
import xmlrepSAX;
// The uploadXML class
public class uploadXML extends DefaultHandler
   implements LexicalHandler, DTDHandler {
   // Parameter
   boolean keepIgnorableWS = true;
   // Variables
   boolean isCData = false;
   boolean isEntity = false;
   boolean verboseMode = false;
   int docID;
   int lastNodeID = -1;
   int lastNodeType = -1;
   int nodeID;
   int x = 0;
   String uri;
   xmlrepDB xmlrep = new xmlrepDB();
   xmlrepSAX saxParser = new xmlrepSAX();

The main() method instantiates the class, receives and checks the command-line arguments, and invokes the instance's uploadFile() method. This is shown in Listing 8.52.

Listing 8.52 Main Method for uploadXML
public static void main(String args[]) {
  // Create an instance of this class
  uploadXML handler = new uploadXML();
  // Check we received a URI
  if (args.length == 0) {
    System.out.println("Usage:  java uploadXML uri (verbose)");
    System.out.println(" where");
    System.out.println("  uri     = URI of your XML document,");
    System.out.println("  verbose = 't' to switch on verbose messaging.");
    System.exit(1);
  }
  // Verbose messaging?
  if (args.length >= 2) {
     if (args[1].equalsIgnoreCase("t")) {handler.verboseMode = true;}}
  // Upload the XML file into the repository
  handler.uploadFile(handler, args[0]);
}

The uploadFile() invokes a well-formedness check (scanXML) and (assuming all is well) connects to the repository, invokes the SAX parser (registering this object as the content handler), and?assuming successful completion?outputs a success message that includes the document ID that was assigned. The code for uploadFile is shown in Listing 8.53.

Listing 8.53 uploadFile Method
   public void uploadFile(DefaultHandler handler, String uri) {
     // Variables
     String successStr;
 this.uri = uri;
     // Parse the XML file to ensure it is well-formed
     scanXML sx = new scanXML();
     sx.check(sx, uri);
     // Start the upload process...
  try {
     // Connect to the repository
     xmlrep.connect();
     // Parse the file
     saxParser.parse(handler, saxParser, uri);
     // Close the connection
     xmlrep.disconnect();
     // Output the document ID
     successStr= "Document uploaded into the repository "
       + "with doc ID = " + docID + " (" + ((x + 1)/2) + " nodes).";
     System.out.println(successStr);
   } catch (java.lang.Exception   ex) {xmlrep.javaEx(ex);}
}

Next are the ContentHandler methods. The SAX parser invokes the startDocument() method (see Listing 8.54) when the document is opened; no arguments are passed (we know the URI of the document in any case), but it gives us the opportunity to insert a row into the doc table (using the rep_i_d procedure), grab the doc_id that is returned, and insert the document node into the node table (using the rep_i_n procedure).

Listing 8.54 startDocument Method
// * Start document
public void startDocument() {
   if (verboseMode) {System.out.println("* startDocument");}
   // Insert into doc table
   docID = xmlrep.intExecSQL("rep_i_d '" + uri + "', 'xmlrep_user';");
   // Increment counter and set node type = DOCUMENT [9]
   x++;
   int nodeType = 9;
   // Create a document node in the node table
   nodeID = xmlrep.intExecSQL("rep_i_n " + x + ", " + nodeType
      + ", "+ docID + ";");
   // Remember node type & ID
   lastNodeType = nodeType;
   lastNodeID = nodeID;
}

The parser invokes the startElement() method shown in Listing 8.55 when an open tag?for example, <myNS:message>?is encountered. We ignore the namespaceURI and localName arguments; instead, we insert a node of type ELEMENT (using the rep_i_n procedure) and then slice the namespace URI (if any) from the element name, inserting the results into the element_name table (using rep_i_n). If there are any attributes, we will loop over them, inserting names and namespace prefixes into attribute_name (using rep_i_an) and values (spilt over leaves, if necessary) into attribute_value (using rep_i_avl). If element names, attribute names, or namespace prefixes exceed the maximum lengths, we will report an error and abort the process. In common with most of the other node-creation methods, we also need to keep a record of the last node type encountered; these are stored in lastNodeType and lastNodeID (respectively).

Listing 8.55 startElement Method
[View full width]
// * Start element
public void startElement(String namespaceURI, String localName, String qName, Attributes 
graphics/ccc.gifattrs) {
   if (verboseMode) {System.out.println("* startElement ["
      + x + "]: qName = '" + qName + "'");}
   // Variables
   int lastColon;
   String name = qName;
   String namespace;
   // Increment counter and set node type = ELEMENT [1]
   x++;
   int nodeType = 1;
   // Insert to node table
   nodeID = xmlrep.intExecSQL("rep_i_n " + x + ", " + nodeType
      + ", "+ docID + ";");
   // Check that a valid nodeID was returned, if not report error and exit
   if (nodeID < 1) {saxParser.genError(
      "[Error] Node was not successfully inserted: " + qName);}
   // Does the element's qualified name include a namespace?
   lastColon = name.lastIndexOf(":");
   if (lastColon > 0) {
      // Parse the qName to retrieve the namespace prefix and
      // the local element name
      namespace = "'" + name.substring(0, lastColon) + "'";
      name = name.substring(lastColon + 1, name.length());
      if (namespace.length() > xmlrep.nsPrefixLength) {
      saxParser.genError("[Error] : " + namespace
         + " exceeds the maximum length.");
   }
} else {
    namespace = "null";
}
// Check element name does not exceed maximum length
if (name.length() > xmlrep.elementNameLength) {
    saxParser.genError("[Error] : " + name
       + " exceeds the maximum length.");
}
// Insert to element_name table
xmlrep.insertValue(false, "rep_i_en " + nodeID + ", "
   + namespace, name);
// Loop over the element's attributes
if (attrs != null) {
   int len = attrs.getLength();
   for (int i = 1; i <= len; i++) {
      // Attribute name (including namespace)
      name = attrs.getQName(i - 1);
      // Index of last colon
      lastColon = name.lastIndexOf(":");
      // Does the attribute name include a namespace?
      if (lastColon > 0) {
         namespace = "'" + name.substring(0, lastColon) + "'";
         name = name.substring(lastColon + 1, name.length());
         if (namespace.length() > xmlrep.nsPrefixLength) {
            saxParser.genError("[Error] : '" + namespace
               + "' exceeds the maximum length.");
         }
      } else {
         namespace = "null";
      }
      // Insert to attribute_name table
      xmlrep.insertValue(false, "rep_i_an " + nodeID + ", "
         + i + ", " + namespace, name);
      // Insert to attribute_value_leaf table
      xmlrep.insertValue(true, "rep_i_avl " + nodeID + ", "
         + i, attrs.getValue(i - 1));
   }
}
   // Remember node type & ID
   lastNodeType = nodeType;
   lastNodeID = nodeID;
}

The processingInstruction() method shown in Listing 8.56 is called when a processing instruction?for example, <?target-application data-string?>?is encountered; string values for the target and data are supplied. Since we have allowed these to be of any length, we call the insertValue() method of the xmlrep object, which will split long strings into leaves and insert them to the appropriate tables.

Listing 8.56 processingInstructions Method
// * Processing instruction
public void processingInstruction(String target, String data) {
   if (verboseMode) {System.out.println("* processingInstruction ["
      + x + "] = '" + target + " " + data + "'");}
   // Increment counter and set node type = PROCESSING_INSTRUCTION_NODE [7]
   x++;
   int nodeType = 7;
   // Insert to node table
   nodeID = xmlrep.intExecSQL("rep_i_n " + x + ", " + nodeType
      + ", " + docID + ";");
   // Insert to pi_target_leaf table
   xmlrep.insertValue(true, "rep_i_pitl " + nodeID, target);
   // Insert to pi_data_leaf table
   xmlrep.insertValue(true, "rep_i_pidl " + nodeID, data);
   // Update node table: set y value for this node
   // (since PI nodes do not have children)
   xmlrep.voidExecSQL("rep_u_n_y " + nodeID + ", " + (x + 1) + ";");
   x++;
   // Remember node type & ID
   lastNodeType = nodeType;
   lastNodeID = nodeID;
}

The characters() method shown in Listing 8.57 is called in a number of situations?namely, when text, CDATA, or entity references are encountered. Unlike many applications that use a SAX parser, our application needs to differentiate between these types of content. Furthermore, in the case of entities, the SAX parser parses the entity reference for us. For example, suppose we have defined a &copyright; entity (to save us having to repeat a lengthy string every time we want to include standard copyright details in our documents). Every time the SAX parser encounters the entity reference &copyright; in our document, the characters() method will be called with the full copyright string. We need to override these behaviors; fortunately, help is at hand, thanks to SAX's LexicalEvent handler methods, which we will discuss shortly. For now, note that nothing happens if this method is called while the value of the Boolean isEntity is true. Otherwise, the character array is recast as a string and passed to the handleText() method.

Listing 8.57 characters Method
// * Characters
public void characters(char ch[], int start, int length) {
   // If the isEntity boolean is true, ignore this call (it is
   // a parsed entity)
   if (isEntity) {return;}
   // Treat as normal and flag as NOT ignorable
   handleText(new String(ch, start, length), 0);
}

The ignorableWhitespace() method shown in Listing 8.58 is called in place of characters() when the character data consists only of whitespace (spaces, tabs, and/or line feeds) and the SAX parser is able to determine that it is truly ignorable. This latter condition can be satisfied only if the supplied XML file has a reference to a schema (such as a DTD), which will dictate where text content may appear in the element hierarchy. If whitespace appears anywhere else (for example, if the file author used indentation and line feeds to make the XML document more human readable), then the SAX parser will call ignorableWhitespace() rather than characters(). Keeping ignorable whitespace is useful for readability and testing; however, if we prefer not to, we can change the value of keepIgnorableWS (in the variable definitions at the start of the class) to false.

Listing 8.58 ignorableWhitespace Method
// * Ignorable whitespace
public void ignorableWhitespace(char ch[], int start, int length) {
   if (verboseMode) {System.out.println(
      "* ignorableWhitespace [" + x + "] = (" + start + ", "
      + length + ")");}
   // If user has specified that ignorable whitespace should be
   // kept, call the handling method (but flag as 'ignorable' for
   // info). Otherwise, do nothing.
   if (keepIgnorableWS) {handleText(new String(ch, start, length), 1);}
}

The handleText() method in Listing 8.59 is one that was created to save having to write the same code twice in the characters() and ignorableWhitespace() methods (both of which invoke handleText()?and, in the latter case, only if we have specified that we want to keep ignorable whitespace). This method also handles a side effect of carriage return/newline characters on some operating systems; for XML files without schema definitions on certain combinations of operating system and parser, SAX will invoke the characters() method each time a carriage return/newline is encountered. This will have the side effect of splitting up text and CDATA sections into multiple nodes. To counter this situation, we make the assumption that?in the event that "this" node is of the same type (text or CDATA) as the "last" node?we will treat both as different leaves of the same node. We need to run rep_s_tlid (text) or rep_s_cdlid (CDATA) in order to determine the final leaf_id for the last node, before we execute rep_i_tl (text) or rep_i_cdl (CDATA) to add the new leaves.

Listing 8.59 handleText Method
// * Handle character data
public void handleText(String characterData, int isIgnorable) {
   if (verboseMode) {System.out.println(
      "* handleText [" + x + "] = '" + characterData + "'");}
   // A non-validating parser may split up CDATA and TEXT sections...
   boolean isContinuation = false;
   int nodeType;
   int offset = 0;
   String abbrvTblNm;
   String extraArg = "";
   // Is the data CDATA or TEXT?
   if (isCData) {
      // Node type = CDATA_SECTION_NODE [4]
      if (lastNodeType == 4) {isContinuation = true;}
      nodeType = 4;
      abbrvTblNm = "cd";
   } else {
      // Node type = TEXT_NODE [3]
      if (lastNodeType == 3) {isContinuation = true;}
      nodeType = 3;
      abbrvTblNm = "t";
      extraArg = ", " + isIgnorable;
   }
   if (!isContinuation) {
      // Increment counter
      x++;
      // Insert to node table
      nodeID = xmlrep.intExecSQL("rep_i_n " + x + ", " + nodeType
         + ", "+ docID + ";");
   } else {
      // Treat as a continuation: decrement counter
      x--;
      // Get the last leaf_id
      offset = xmlrep.intExecSQL("rep_s_" + abbrvTblNm + "lid "
         + lastNodeID + ";");
      // Null the y value of this node (since we are continuing it)
      xmlrep.voidExecSQL("rep_u_n_y_null " + nodeID + ";");
   }
   // Insert to appropriate table
   xmlrep.insertValue(true, "rep_i_" + abbrvTblNm + "l " + nodeID
      + extraArg, characterData, offset);
   // Update node table: set y value for this node
   xmlrep.voidExecSQL("rep_u_n_y " + nodeID + ", " + (x + 1) + ";");
   x++;
   // Remember node type & ID
   lastNodeType = nodeType;
   lastNodeID = nodeID;
}

The endElement() method (see Listing 8.60) is invoked by the SAX parser when a closing tag is encountered (e.g., </myNS:message>). The element that this tag is closing will be the last node with a null value of y.

Listing 8.60 endElement Method
// * End element
public void endElement(String namespaceURI, String localName,
   String qName) {
   if (verboseMode) {System.out.println(
      "* endElement [" + x + "] = '" + qName + "'");}
   // Increment counter
   x++;
   // Find the node that this "end element" corresponds to
   nodeID = xmlrep.intExecSQL("rep_s_n_last " + docID + ";");
   // Check we found an element with null value for y
   if (nodeID < 1) {
      saxParser.genError("No element has null y.");
   }
   // Update node table: set y value for this node
   xmlrep.voidExecSQL("rep_u_n_y " + nodeID + ", " + x + ";");
   // Remember node type & ID
   lastNodeType = 1; // = ELEMENT [1]
   lastNodeID = nodeID;
}

The startDTD() and endDTD() methods shown in Listing 8.61 are invoked by the SAX parser at the beginning and end of a document type definition, respectively (and regardless of whether the DTD is stored inside the file or is external to it). In between, various methods will be invoked as the parser encounters notation declarations and entity definitions (and we can capture these if we register our class as the DTDHandler for the SAX parser). Aside from printing a message (if we are in "verbose" mode), we have chosen not to do anything with DTD events.

Listing 8.61 startDTD and endDTD Methods
// * Start DTD
public void startDTD(String name, String publicId, String systemId) {
   if (verboseMode) {System.out.println("* startDTD (name = '"
      + name + "', publicId = '" + publicId + "', systemId = '"
      + systemId + "')");}
   // Do nothing
}
// * End DTD
public void endDTD() {
   if (verboseMode) {System.out.println("* endDTD");}
   // Do nothing
}

The endDocument() method shown in Listing 8.62 is invoked by the SAX parser when the end of the file is reached; all we need to do when this stage is reached is to set the y value for the document node.

Listing 8.62 endDocument Method
// * End document
public void endDocument() {
   if (verboseMode) {System.out.println("* endDocument");}
   // Find the remaining node with null y (the document node)
   nodeID = xmlrep.intExecSQL("rep_s_n_last " + docID + ";");
   // Check we found a node with null value for y
   if (nodeID < 1) {
      saxParser.genError("Root node does not have null y.");
   }
   // Update node table: set y value for this node
   xmlrep.voidExecSQL("rep_u_n_y " + nodeID + ", " + (x + 1) + ";");
}

The SAX parser invokes the methods shown in Listing 8.63 when lexical events are encountered. As with text and CDATA sections, the comment() method allowed comments to split across leaves. The startCDATA(), endCDATA(), startEntity(), and endEntity() methods toggle the values of the isCData and isEntity variables, so the character-handling methods (which will be called immediately afterwards by the SAX parser) know which type of character data they are dealing with.

Listing 8.63 LexicalHandler Methods
// -- LexicalHandler methods
public void comment(char[] ch, int start, int length) {
   if (verboseMode) {System.out.println("* comment [" + x + "]");}
   // Increment counter and set node type = COMMENT_NODE [8]
   x++;
   int nodeType = 8;
   // Insert to node table
   nodeID = xmlrep.intExecSQL("rep_i_n " + x + ", " + nodeType
      + ", "+ docID + ";");
   // Insert to comment_leaf table
   xmlrep.insertValue(true, "rep_i_cl " + nodeID,
      new String(ch, start, length));
   // Update node table: set y value for this node
   xmlrep.voidExecSQL("rep_u_n_y " + nodeID + ", " + (x + 1) + ";");
   x++;
   // Remember node type & ID
   lastNodeType = nodeType;
   lastNodeID = nodeID;
}
public void startCDATA() {
   if (verboseMode) {System.out.println("* startCDATA [" + x + "]");}
   // Next call to characters() will be treated as CDATA
   isCData = true;
}
public void endCDATA() {
   if (verboseMode) {System.out.println("* endCDATA [" + x + "]");}
   // Next call to characters() will be treated as text
   isCData = false;
}
public void startEntity(String name) {
   if (verboseMode) {System.out.println("* startEntity [" + x + "]");}
   // Set the isEntity boolean to true, so characters() calls arising
   // due to entity parsing get ignored
   isEntity = true;
   // Increment counter and set node type = ENTITY_REFERENCE_NODE [5]
   x++;
   int nodeType = 5;
   // Insert to node table
   nodeID = xmlrep.intExecSQL("rep_i_n " + x + ", " + nodeType
      + ", "+ docID + ";");
   // Check entity reference name does not exceed maximum length
   if (name.length() > xmlrep.entityRefLength) {
       saxParser.genError("[Error] : " + name
          + " exceeds the maximum length.");
   }
   // Insert to entity_reference table
   xmlrep.insertValue(false, "rep_i_er " + nodeID, name);
   // Update node table: set y value for this node
   xmlrep.voidExecSQL("rep_u_n_y " + nodeID + ", " + (x + 1) + ";");
   x++;
   // Remember node type & ID
   lastNodeType = nodeType;
   lastNodeID = nodeID;
}
public void endEntity(String name) {
   if (verboseMode) {System.out.println("* endEntity [" + x + "]");}
   // Set the isEntity boolean to false
   isEntity = false;
}

The SAX parser invokes the methods shown in Listing 8.64 when DTD events are encountered. Currently, we do not do anything in these situations, but we could add tables and procedures to deal with the information.

Listing 8.64 DTDHandler Methods
   // -- DTDHandler methods
   public void notationDecl(String name, String publicId,
      String systemId) {
      if (verboseMode) {System.out.println("* notationDecl");}
      // Do nothing
   }
   public void unparsedEntityDecl(String name, String publicId,
      String systemId, String notationName) {
      if (verboseMode) {System.out.println("* unparsedEntityDecl");}
      // Do nothing
   }
}

Compile this class ("javac uploadXML.java"), and we're ready to test it with a sample XML file. On successful loading, the Java program returns a message indicating the document ID that was assigned (along with a count of the nodes that were created):

C:\xmlrep>java uploadXML wml-example.xml
Document uploaded into the repository with doc ID = 1 (25 nodes).

8.6.4 The extractXML Class

Running rep_serialise_nodes for a nontrivial document produces lots of output that is not particularly readable, so let's create a class (extractXML.java) that will call the procedure and knit the output back into recognizable XML. There is only one method, main(), which takes command-line parameters for the document ID (plus additional optional parameters) and executes rep_serialise_nodes. The code for extractXML is shown in Listing 8.65.

Listing 8.65 extractXML Class
// Import core Java classes
import java.lang.*;
import java.sql.ResultSet;
import java.sql.SQLException;
// Import our classes
import xmlrepDB;
// The extractXML class
class extractXML {
   public static void main(String[] args) {
      // Variables
      int docID;
      int startX = 1;
      int incMetadata = 0;
      Integer IntObj = new Integer(0);
      xmlrepDB xmlrep = new xmlrepDB();
      String xmlDoc = "";
      String xmlBit;
      String sqlCmd;
      // Check we received a docID
      if (args.length == 0) {
        System.out.println(
           "Usage:  java extractXML docID (startX) (incMetadata)");
        System.out.println(
           "   docID = ID of the document in the repository");
        System.out.println(
           "   startX = integer x index of the node with which to start");
        System.out.println(
           "   incMetadata = 't' for attributes showing x & y coordinates");
        System.exit(1);
      }
      // Document ID
      docID = IntObj.parseInt(args[0]);
      // Starting x_index specified?
      if (args.length >= 2) {startX = IntObj.parseInt(args[1]);}
      // Include metadata?
      if (args.length >= 3) {
         if (args[2].equalsIgnoreCase("t")) {incMetadata = 1;}}
      // Start the output
      System.out.println("<?xml version=\"1.0\"?>");
      // Handle errors
      try {
         // Connect to the repository
         xmlrep.connect();
         // Build the SQL command
         sqlCmd = "rep_serialise_nodes " + docID + ", " + startX
            + ", " + incMetadata + ";";
         try {
            // Execute the SQL
            ResultSet rs = xmlrep.stmt.executeQuery(sqlCmd);
            // Loop through the records
            while (rs.next()) {
               // Read the fields & handle NULLs / empty strings
               xmlBit = rs.getString("parsed_text");
               if (!rs.wasNull()) {xmlDoc = xmlDoc + xmlBit;}
            }
         } catch (java.sql.SQLException ex) {xmlrep.sqlEx (ex, sqlCmd);}
         // Output to screen
         System.out.println(xmlDoc);
         // Close the connection
         xmlrep.disconnect();
      } catch (java.lang.Exception ex) {
          // Print exception information as an XML comment
          System.out.println ("<!--");
          ex.printStackTrace ();
          System.out.println ("-->");
      }
   }
}

Compile this class ("javac extractXML.java"), and we can test it on the sample file we uploaded to the repository:

C:\xmlrep>java extractXML 1

If everything is working properly, the output should look exactly like the file that went in (you can pipe the output to a file, if you want to check). We can also run the process again for a fragment of the document, as shown in Listing 8.66.

Listing 8.66 Result for extractXML with Two Parameters
C:\xmlrep>java extractXML 1 37
<?xml version="1.0"?>
<card id="cSecond" title="Second card">
  <p align="center">
   Content of the second card.
  </p>
 </card>

In this case, we asked for the fragment of document 1 starting with the node with x = 37 (the second of the two card elements). If we were to repeat the process with the optional inMetadata parameter set to "t" (i.e., "java extractXML 1 37 t"), we would get a similar result but with additional attributes showing the values of x, y, and node_id for each element. Listing 8.67 shows the XML.

Listing 8.67 Result for extractXML with Three Parameters
C:\xmlrep>java extractXML 1 37 t
<?xml version="1.0"?>
<card xmlns:repository="http://www.rgedwards.com/"
repository:x="37"
      repository:y="46" repository:nodeID="52" id="cSecond"
      title="Second card">
  <p repository:x="40" repository:y="43" repository:nodeID="54"
     align="center">
   Content of the second card.
  </p>
 </card>

Top

Part IV: Applications of XML