In order to upload XML files, we need a program that can read them, understand the content model, discern content from syntax, and set the x and y coordinates of each node. Fortunately, the XML community has done almost all of the hard work for us already. The Simple API for XML (SAX) provides methods for sequential parsing of XML files, so all we will need to do is write methods to handle the various events that the SAX parser will invoke as it steps through the file. Some features of SAX are worth noting here. A SAX parser understands the XML content model and acts as an XML processor (i.e., it has some intelligence built into it)?by default, the SAX ContentHandler will parse entity references, ignore comments, and treat CDATA sections in the same way as text. Since we want entity references to remain unparsed, and we need to maintain the distinction between CDATA and text, we need to implement the SAX LexicalHandler as well; the SAX LexicalHandler interface provides a method for handling comments, and methods that denote the beginning and end of CDATA sections. Finally, if we want to handle DTD (Document Type Definition) content, we need to implement the DTDHandler interface, which provides methods regarding entity definitions and notation declarations.
First, let's create a class (xmlrepSAX.java) that will initialize a SAX2 parser and handle any exceptions that may arise (in this case, we are using the Xerces parser from http://xml.apache.org/, but the reader can change this by altering the value of DEFAULT_PARSER_NAME). Note that SAX2 parsers are namespace-aware; whereas a SAX1 parser (or a SAX2 parser with this feature disabled) would treat an element such as <my-ns:my-element xmlns:my-ns="uri"> as being called "my-ns:my-element" (leaving us to strip out the namespace prefix and resolve it to a URI). A SAX2 parser with this feature enabled will report "my-element" as the local name, provide the full URI reference of the namespace, and may also give us the prefix. Since we want to retain the namespace prefix (as this is more space-efficient than storing the full URI every time), we will enable these features with two Booleans, NAMESPACE_HANDLING and NS_PREFIX_HANDLING. The code for this class is shown in Listing 8.30.
// Import core Java classes import java.io.*; import java.lang.*; // Import SAX classes import org.xml.sax.*; import org.xml.sax.ext.*; import org.xml.sax.helpers.*; // The xmlrepSAX class public class xmlrepSAX extends DefaultHandler { // Useful parameters protected static final String DEFAULT_PARSER_NAME = "org.apache.xerces.parsers. SAXParser"; protected static final String NAMESPACES_FEATURE_ID = "http://xml.org/sax/features/ namespaces"; protected static final String NS_PREFIX_HANDLING_PROPERTY_ID = "http://xml.org/sax/ features/namespace-prefixes"; protected static final String VALIDATION_FEATURE_ID = "http://xml.org/sax/features/ validation"; protected static final String LEXICAL_EVENT_HANDLING_PROPERTY_ID = "http://xml.org/sax/ properties/lexical-handler"; protected static final boolean NAMESPACE_HANDLING = true; protected static final boolean NS_PREFIX_HANDLING = true; // Variables String uri = null; FileReader r = null; XMLReader parser = null;
The parse() method instantiates a SAX parser, registers the various handlers, and invokes the parser, as shown in Listing 8.31.
public void parse(DefaultHandler xmlHandler, xmlrepSAX errHandler, String uri) { // Create an XML reader try {parser = XMLReaderFactory.createXMLReader(DEFAULT_PARSER_NAME); } catch (Exception e) { genError("Error: Unable to instantiate parser (" + DEFAULT_PARSER_NAME + ")"); } // Register the SAX content handler try {parser.setContentHandler(xmlHandler); } catch (NullPointerException e) { genError("Could not set the ContentHandler.");} // Register the SAX error handler try {parser.setErrorHandler(errHandler); } catch (NullPointerException e) { genError("Could not set the ErrorHandler.");} // Set the namespace handling behavior try {parser.setFeature(NAMESPACES_FEATURE_ID, NAMESPACE_HANDLING); } catch (SAXException e) { genError("Could not set namespace handling.");} // Set the namespace prefix handling behavior try {parser.setFeature(NS_PREFIX_HANDLING_PROPERTY_ID, NS_PREFIX_HANDLING); } catch (SAXException e) { genError("Could not set namespace prefix handling.");} // Register the SAX lexical event handler try {parser.setProperty(LEXICAL_EVENT_HANDLING_PROPERTY_ID, xmlHandler); } catch (SAXNotRecognizedException e) { System.out.println("Warning: lex property not recognized."); } catch (SAXNotSupportedException e) { System.out.println("Warning: lex property not supported.");} // Register the SAX DTD event handler try {parser.setDTDHandler(xmlHandler); } catch (NullPointerException e) { System.out.println("Warning: Could not set the DTD handler.");} // Open the file and parse it try {r = new FileReader(uri); } catch (FileNotFoundException e) { System.out.println("File not found: " + uri); genError(e.toString());} try {parser.parse(new InputSource(r)); } catch (Exception e) { System.out.println("Error encountered while parsing " + uri); genError(e.toString());} }
Finally, we need exception-handling methods, as shown in Listing 8.32.
// * A generic error handler (just outputs the message and quits) public void genError(String msg) { System.out.println(msg); System.exit(1); } // -- SAX ErrorHandler methods // * Warnings public void warning(SAXParseException ex) { System.out.println("[Warning] " + ex.getMessage()); } // * SAX errors public void saxError(SAXParseException ex) { System.out.println("[Error] " + ex.getMessage()); System.exit(1); } // * Fatal errors public void fatalError(SAXParseException ex) throws SAXException { System.out.println("[Fatal Error] " + ex.getMessage()); System.exit(1); } }
A SAX parser will report an error if the supplied XML file is not well formed, so we can avoid overburdening the database by scanning each file for well-formedness before we attempt to upload it. The scanXML class will do this for us (see Listing 8.33); it instantiates a SAX parser and provides dummy content-handling methods.
// Import core Java classes import java.io.*; // Import SAX classes import org.xml.sax.Attributes; import org.xml.sax.DTDHandler; import org.xml.sax.ext.LexicalHandler; import org.xml.sax.helpers.DefaultHandler; // Import our classes import xmlrepSAX; // The scanXML class public class scanXML extends DefaultHandler implements LexicalHandler, DTDHandler { // Parse the document public void check(DefaultHandler handler, String uri) { // Variables xmlrepSAX saxParser = new xmlrepSAX(); // Parse the file saxParser.parse(handler, saxParser, uri); } // -- ContentHandler methods public void startDocument() {/* Do nothing */} public void startElement(String namespaceURI, String localName, String rawName, Attributes attrs) {/* Do nothing */} public void endElement(String namespaceURI, String localName, String rawName) {/* Do nothing */} public void characters(char ch[], int start, int length) {/* Do nothing */} public void ignorableWhitespace(char ch[], int start, int length) {/* Do nothing */} public void processingInstruction(String target, String data) {/* Do nothing */} public void endDocument() {/* Do nothing */} // -- LexicalEventListener methods public void startDTD(String name, String publicId, String systemId) {/* Do nothing */} public void endDTD() {/* Do nothing */} public void comment(char ch[], int start, int length) {/* Do nothing */} public void startCDATA() {/* Do nothing */} public void endCDATA() {/* Do nothing */} public void startEntity(String name) {/* Do nothing */} public void endEntity(String name) {/* Do nothing */} }
In order to load an XML document into the repository, our application will need to execute many insert and update statements. We can of course build these SQL statements on the fly, but for performance reasons, it would be better to create some stored procedures first. When ad hoc SQL is executed, the RDBMS will parse the SQL and create a "query plan" prior to execution; with a stored procedure, the query plan is determined upon first execution and then stored in the procedure cache, which means that subsequent execution will be considerably faster. The script create_xmlrep_db.sql contains numerous stored procedures that we will use later; the names are abbreviations referring to the activity and the table affected (e.g., rep_i_an involves an insert ("i") into the attribute_name ("an") table, and rep is just a prefix I have used for all the core stored procedures in the repository).
The procedure in Listing 8.34 creates an entry in the doc table and returns the unique doc_id.
CREATE PROCEDURE dbo.rep_i_d @source t_source, @contributor_id t_user_id AS BEGIN INSERT doc (source, date_loaded, contributor_id) VALUES (@source, GETDATE(), @contributor_id) -- Result = doc_id SELECT @@identity AS value END GO
The procedure in Listing 8.35 creates an entry in the node table and returns the unique node_id. Note that at this stage, the document ID, type, and x index are known, but the y index is not (so it is not specified here).
CREATE PROCEDURE dbo.rep_i_n @x_index t_xy_index, @node_type_id t_node_type_id, @doc_id t_doc_id AS BEGIN INSERT node (doc_id, x, node_type_id) VALUES (@doc_id, @x_index, @node_type_id) -- Result = node_id SELECT @@identity AS value END GO
The procedure in Listing 8.36 inserts a row in the element_name table for an element node; the node_id refers back to the entry in the node table. The other parameters are the namespace prefix and the local name of the element.
CREATE PROCEDURE dbo.rep_i_en @node_id t_node_id, @ns_prefix t_ns_prefix, @local_name t_element_name AS INSERT element_name (node_id, ns_prefix, local_name) VALUES (@node_id, @ns_prefix, @local_name) GO
The procedure in Listing 8.37 inserts a row in the attribute_name table. Each entry has a reference back to the element_name entry (through node_id), a unique attribute ID (for this node), a namespace prefix, and a local name.
CREATE PROCEDURE dbo.rep_i_an @node_id t_node_id, @attribute_id t_seq_no, @ns_prefix t_ns_prefix, @local_name t_attribute_name AS INSERT attribute_name (node_id, attribute_id, ns_prefix, local_name) VALUES (@node_id, @attribute_id, @ns_prefix, @local_name) GO
The procedure in Listing 8.38 inserts a row in the attribute_value_leaf table; each entry has a reference back to the attribute_name table (through node_id and attribute_id). Long values will be split across leaves, each with a unique leaf_id (for the specified attribute).
CREATE PROCEDURE dbo.rep_i_avl @node_id t_node_id, @attribute_id t_seq_no, @leaf_id t_leaf_id, @leaf_text t_leaf_text AS INSERT attribute_value_leaf (node_id, attribute_id, leaf_id, leaf_text) VALUES (@node_id, @attribute_id, @leaf_id, @leaf_text) GO
The procedure in Listing 8.39 inserts a row in the cdata_leaf table. Again, long values will be split across leaves.
CREATE PROCEDURE dbo.rep_i_cdl @node_id t_node_id, @leaf_id t_leaf_id, @leaf_text t_leaf_text AS INSERT cdata_leaf (node_id, leaf_id, leaf_text) VALUES (@node_id, @leaf_id, @leaf_text) GO
The procedure in Listing 8.40 inserts a row representing a leaf in the comment_leaf table.
CREATE PROCEDURE dbo.rep_i_cl @node_id t_node_id, @leaf_id t_leaf_id, @leaf_text t_leaf_text AS INSERT comment_leaf (node_id, leaf_id, leaf_text) VALUES (@node_id, @leaf_id, @leaf_text) GO
The procedure in Listing 8.41 inserts a row in the entity_reference table.
CREATE PROCEDURE dbo.rep_i_er @node_id t_node_id, @entity_name t_entity_ref AS INSERT entity_reference (node_id, entity_name) VALUES (@node_id, @entity_name) GO
The procedure in Listing 8.42 inserts a row in the pi_data_leaf table.
CREATE PROCEDURE dbo.rep_i_pidl @node_id t_node_id, @leaf_id t_leaf_id, @leaf_text t_leaf_text AS INSERT pi_data_leaf (node_id, leaf_id, leaf_text) VALUES (@node_id, @leaf_id, @leaf_text) GO
The procedure in Listing 8.43 inserts a row in the pi_target_leaf table.
CREATE PROCEDURE dbo.rep_i_pitl @node_id t_node_id, @leaf_id t_leaf_id, @leaf_text t_leaf_text AS INSERT pi_target_leaf (node_id, leaf_id, leaf_text) VALUES (@node_id, @leaf_id, @leaf_text) GO
The procedure in Listing 8.44 inserts a row in the text_leaf table.
CREATE PROCEDURE dbo.rep_i_tl @node_id t_node_id, @ignorable BIT, @leaf_id t_leaf_id, @leaf_text t_leaf_text AS INSERT text_leaf (node_id, leaf_id, leaf_text, ignorable) VALUES (@node_id, @leaf_id, @leaf_text, @ignorable) GO
The procedure in Listing 8.45 returns the last value of leaf_id from the cdata_leaf table for the specified node.
CREATE PROCEDURE dbo.rep_s_cdlid @node_id t_node_id AS SELECT ISNULL(MAX(leaf_id), 0) AS value FROM cdata_leaf WHERE node_id = @node_id GO
The procedure in Listing 8.46 returns the last value of leaf_id from the text_leaf table for the specified node.
CREATE PROCEDURE dbo.rep_s_tlid @node_id t_node_id AS SELECT ISNULL(MAX(leaf_id), 0) AS value FROM text_leaf WHERE node_id = @node_id GO
The procedure in Listing 8.47 returns the value of node_id for the last node in the specified document that has not yet had a y coordinate set.
CREATE PROCEDURE dbo.rep_s_n_last @doc_id t_doc_id AS SELECT node_id AS value FROM node WHERE doc_id = @doc_id AND x = (SELECT MAX(x) FROM node WHERE doc_id = @doc_id AND y IS NULL) GO
The procedure in Listing 8.48 sets the value of y for the specified node.
CREATE PROCEDURE dbo.rep_u_n_y @node_id t_node_id, @y_index t_xy_index AS UPDATE node SET y = @y_index WHERE node_id = @node_id GO
The procedure in Listing 8.49 nulls the value of y for the specified node (we will discuss the reason for this requirement later).
CREATE PROCEDURE dbo.rep_u_n_y_null @node_id t_node_id AS UPDATE node SET y = NULL WHERE node_id = @node_id GO
Finally, we need to set permissions for the stored procedures in Listing 8.50.
GRANT EXECUTE ON dbo.rep_i_an TO xmlrep_user; GRANT EXECUTE ON dbo.rep_i_avl TO xmlrep_user; GRANT EXECUTE ON dbo.rep_i_cdl TO xmlrep_user; GRANT EXECUTE ON dbo.rep_i_cl TO xmlrep_user; GRANT EXECUTE ON dbo.rep_i_d TO xmlrep_user; GRANT EXECUTE ON dbo.rep_i_en TO xmlrep_user; GRANT EXECUTE ON dbo.rep_i_er TO xmlrep_user; GRANT EXECUTE ON dbo.rep_i_n TO xmlrep_user; GRANT EXECUTE ON dbo.rep_i_pidl TO xmlrep_user; GRANT EXECUTE ON dbo.rep_i_pitl TO xmlrep_user; GRANT EXECUTE ON dbo.rep_i_tl TO xmlrep_user; GRANT EXECUTE ON dbo.rep_s_cdlid TO xmlrep_user; GRANT EXECUTE ON dbo.rep_s_n_last TO xmlrep_user; GRANT EXECUTE ON dbo.rep_s_tlid TO xmlrep_user; GRANT EXECUTE ON dbo.rep_u_n_y TO xmlrep_user; GRANT EXECUTE ON dbo.rep_u_n_y_null TO xmlrep_user; GO
Now we need a class (uploadXML) that will
Instantiate scanXML and parse the file with it to ensure well-formedness.
Start a database transaction.
Register itself as the content handler for the SAX parser and call its parse() method.
Commit the transaction?if the process completed satisfactorily?or roll back the transaction?if something went wrong?to ensure that the document-loading process succeeds or fails as a single unit.
The class will need all the same content-handling methods that scanXML provided, but this time we will do something with the content; we will call our stored procedures to insert the content into the database.
The class will also need to determine the Nested Sets coordinates of each node. The x coordinates are easy?we just start with x = 1 and increment x every time we encounter a new node. When we encounter a leaf node, its y coordinate will be x + 1 (we will need to increase x afterwards to y + 1 before we continue, ready for the next node). The y coordinates of non-leaf nodes are a little trickier. When we create the document node or encounter element nodes, we do not yet know what the node's y coordinate will be, so we will have to leave it as a null value until it can be determined. Fortunately, because elements must nest correctly (i.e., <a><b></b></a> is well formed, but <a><b></a></b> is not), every time we encounter a closing tag (causing a SAX endElement event), we can be sure that this closing tag corresponds to the most recent opening tag that has not yet been closed; that is, we just need to query the node table for the last node with a null value of y, set it to x + 1, and then set x = y + 1, and so on, until all nodes have been closed (the document node will be the last). The code for the uploadXML class is shown in Listing 8.51.
// Import core Java classes import java.io.*; import java.lang.*; import java.util.Date; // Import SAX classes import org.xml.sax.Attributes; import org.xml.sax.DTDHandler; import org.xml.sax.ext.LexicalHandler; import org.xml.sax.helpers.DefaultHandler; // Import our classes import scanXML; import xmlrepDB; import xmlrepSAX; // The uploadXML class public class uploadXML extends DefaultHandler implements LexicalHandler, DTDHandler { // Parameter boolean keepIgnorableWS = true; // Variables boolean isCData = false; boolean isEntity = false; boolean verboseMode = false; int docID; int lastNodeID = -1; int lastNodeType = -1; int nodeID; int x = 0; String uri; xmlrepDB xmlrep = new xmlrepDB(); xmlrepSAX saxParser = new xmlrepSAX();
The main() method instantiates the class, receives and checks the command-line arguments, and invokes the instance's uploadFile() method. This is shown in Listing 8.52.
public static void main(String args[]) { // Create an instance of this class uploadXML handler = new uploadXML(); // Check we received a URI if (args.length == 0) { System.out.println("Usage: java uploadXML uri (verbose)"); System.out.println(" where"); System.out.println(" uri = URI of your XML document,"); System.out.println(" verbose = 't' to switch on verbose messaging."); System.exit(1); } // Verbose messaging? if (args.length >= 2) { if (args[1].equalsIgnoreCase("t")) {handler.verboseMode = true;}} // Upload the XML file into the repository handler.uploadFile(handler, args[0]); }
The uploadFile() invokes a well-formedness check (scanXML) and (assuming all is well) connects to the repository, invokes the SAX parser (registering this object as the content handler), and?assuming successful completion?outputs a success message that includes the document ID that was assigned. The code for uploadFile is shown in Listing 8.53.
public void uploadFile(DefaultHandler handler, String uri) { // Variables String successStr; this.uri = uri; // Parse the XML file to ensure it is well-formed scanXML sx = new scanXML(); sx.check(sx, uri); // Start the upload process... try { // Connect to the repository xmlrep.connect(); // Parse the file saxParser.parse(handler, saxParser, uri); // Close the connection xmlrep.disconnect(); // Output the document ID successStr= "Document uploaded into the repository " + "with doc ID = " + docID + " (" + ((x + 1)/2) + " nodes)."; System.out.println(successStr); } catch (java.lang.Exception ex) {xmlrep.javaEx(ex);} }
Next are the ContentHandler methods. The SAX parser invokes the startDocument() method (see Listing 8.54) when the document is opened; no arguments are passed (we know the URI of the document in any case), but it gives us the opportunity to insert a row into the doc table (using the rep_i_d procedure), grab the doc_id that is returned, and insert the document node into the node table (using the rep_i_n procedure).
// * Start document public void startDocument() { if (verboseMode) {System.out.println("* startDocument");} // Insert into doc table docID = xmlrep.intExecSQL("rep_i_d '" + uri + "', 'xmlrep_user';"); // Increment counter and set node type = DOCUMENT [9] x++; int nodeType = 9; // Create a document node in the node table nodeID = xmlrep.intExecSQL("rep_i_n " + x + ", " + nodeType + ", "+ docID + ";"); // Remember node type & ID lastNodeType = nodeType; lastNodeID = nodeID; }
The parser invokes the startElement() method shown in Listing 8.55 when an open tag?for example, <myNS:message>?is encountered. We ignore the namespaceURI and localName arguments; instead, we insert a node of type ELEMENT (using the rep_i_n procedure) and then slice the namespace URI (if any) from the element name, inserting the results into the element_name table (using rep_i_n). If there are any attributes, we will loop over them, inserting names and namespace prefixes into attribute_name (using rep_i_an) and values (spilt over leaves, if necessary) into attribute_value (using rep_i_avl). If element names, attribute names, or namespace prefixes exceed the maximum lengths, we will report an error and abort the process. In common with most of the other node-creation methods, we also need to keep a record of the last node type encountered; these are stored in lastNodeType and lastNodeID (respectively).
// * Start element public void startElement(String namespaceURI, String localName, String qName, Attributes attrs) { if (verboseMode) {System.out.println("* startElement [" + x + "]: qName = '" + qName + "'");} // Variables int lastColon; String name = qName; String namespace; // Increment counter and set node type = ELEMENT [1] x++; int nodeType = 1; // Insert to node table nodeID = xmlrep.intExecSQL("rep_i_n " + x + ", " + nodeType + ", "+ docID + ";"); // Check that a valid nodeID was returned, if not report error and exit if (nodeID < 1) {saxParser.genError( "[Error] Node was not successfully inserted: " + qName);} // Does the element's qualified name include a namespace? lastColon = name.lastIndexOf(":"); if (lastColon > 0) { // Parse the qName to retrieve the namespace prefix and // the local element name namespace = "'" + name.substring(0, lastColon) + "'"; name = name.substring(lastColon + 1, name.length()); if (namespace.length() > xmlrep.nsPrefixLength) { saxParser.genError("[Error] : " + namespace + " exceeds the maximum length."); } } else { namespace = "null"; } // Check element name does not exceed maximum length if (name.length() > xmlrep.elementNameLength) { saxParser.genError("[Error] : " + name + " exceeds the maximum length."); } // Insert to element_name table xmlrep.insertValue(false, "rep_i_en " + nodeID + ", " + namespace, name); // Loop over the element's attributes if (attrs != null) { int len = attrs.getLength(); for (int i = 1; i <= len; i++) { // Attribute name (including namespace) name = attrs.getQName(i - 1); // Index of last colon lastColon = name.lastIndexOf(":"); // Does the attribute name include a namespace? if (lastColon > 0) { namespace = "'" + name.substring(0, lastColon) + "'"; name = name.substring(lastColon + 1, name.length()); if (namespace.length() > xmlrep.nsPrefixLength) { saxParser.genError("[Error] : '" + namespace + "' exceeds the maximum length."); } } else { namespace = "null"; } // Insert to attribute_name table xmlrep.insertValue(false, "rep_i_an " + nodeID + ", " + i + ", " + namespace, name); // Insert to attribute_value_leaf table xmlrep.insertValue(true, "rep_i_avl " + nodeID + ", " + i, attrs.getValue(i - 1)); } } // Remember node type & ID lastNodeType = nodeType; lastNodeID = nodeID; }
The processingInstruction() method shown in Listing 8.56 is called when a processing instruction?for example, <?target-application data-string?>?is encountered; string values for the target and data are supplied. Since we have allowed these to be of any length, we call the insertValue() method of the xmlrep object, which will split long strings into leaves and insert them to the appropriate tables.
// * Processing instruction public void processingInstruction(String target, String data) { if (verboseMode) {System.out.println("* processingInstruction [" + x + "] = '" + target + " " + data + "'");} // Increment counter and set node type = PROCESSING_INSTRUCTION_NODE [7] x++; int nodeType = 7; // Insert to node table nodeID = xmlrep.intExecSQL("rep_i_n " + x + ", " + nodeType + ", " + docID + ";"); // Insert to pi_target_leaf table xmlrep.insertValue(true, "rep_i_pitl " + nodeID, target); // Insert to pi_data_leaf table xmlrep.insertValue(true, "rep_i_pidl " + nodeID, data); // Update node table: set y value for this node // (since PI nodes do not have children) xmlrep.voidExecSQL("rep_u_n_y " + nodeID + ", " + (x + 1) + ";"); x++; // Remember node type & ID lastNodeType = nodeType; lastNodeID = nodeID; }
The characters() method shown in Listing 8.57 is called in a number of situations?namely, when text, CDATA, or entity references are encountered. Unlike many applications that use a SAX parser, our application needs to differentiate between these types of content. Furthermore, in the case of entities, the SAX parser parses the entity reference for us. For example, suppose we have defined a ©right; entity (to save us having to repeat a lengthy string every time we want to include standard copyright details in our documents). Every time the SAX parser encounters the entity reference ©right; in our document, the characters() method will be called with the full copyright string. We need to override these behaviors; fortunately, help is at hand, thanks to SAX's LexicalEvent handler methods, which we will discuss shortly. For now, note that nothing happens if this method is called while the value of the Boolean isEntity is true. Otherwise, the character array is recast as a string and passed to the handleText() method.
// * Characters public void characters(char ch[], int start, int length) { // If the isEntity boolean is true, ignore this call (it is // a parsed entity) if (isEntity) {return;} // Treat as normal and flag as NOT ignorable handleText(new String(ch, start, length), 0); }
The ignorableWhitespace() method shown in Listing 8.58 is called in place of characters() when the character data consists only of whitespace (spaces, tabs, and/or line feeds) and the SAX parser is able to determine that it is truly ignorable. This latter condition can be satisfied only if the supplied XML file has a reference to a schema (such as a DTD), which will dictate where text content may appear in the element hierarchy. If whitespace appears anywhere else (for example, if the file author used indentation and line feeds to make the XML document more human readable), then the SAX parser will call ignorableWhitespace() rather than characters(). Keeping ignorable whitespace is useful for readability and testing; however, if we prefer not to, we can change the value of keepIgnorableWS (in the variable definitions at the start of the class) to false.
// * Ignorable whitespace public void ignorableWhitespace(char ch[], int start, int length) { if (verboseMode) {System.out.println( "* ignorableWhitespace [" + x + "] = (" + start + ", " + length + ")");} // If user has specified that ignorable whitespace should be // kept, call the handling method (but flag as 'ignorable' for // info). Otherwise, do nothing. if (keepIgnorableWS) {handleText(new String(ch, start, length), 1);} }
The handleText() method in Listing 8.59 is one that was created to save having to write the same code twice in the characters() and ignorableWhitespace() methods (both of which invoke handleText()?and, in the latter case, only if we have specified that we want to keep ignorable whitespace). This method also handles a side effect of carriage return/newline characters on some operating systems; for XML files without schema definitions on certain combinations of operating system and parser, SAX will invoke the characters() method each time a carriage return/newline is encountered. This will have the side effect of splitting up text and CDATA sections into multiple nodes. To counter this situation, we make the assumption that?in the event that "this" node is of the same type (text or CDATA) as the "last" node?we will treat both as different leaves of the same node. We need to run rep_s_tlid (text) or rep_s_cdlid (CDATA) in order to determine the final leaf_id for the last node, before we execute rep_i_tl (text) or rep_i_cdl (CDATA) to add the new leaves.
// * Handle character data public void handleText(String characterData, int isIgnorable) { if (verboseMode) {System.out.println( "* handleText [" + x + "] = '" + characterData + "'");} // A non-validating parser may split up CDATA and TEXT sections... boolean isContinuation = false; int nodeType; int offset = 0; String abbrvTblNm; String extraArg = ""; // Is the data CDATA or TEXT? if (isCData) { // Node type = CDATA_SECTION_NODE [4] if (lastNodeType == 4) {isContinuation = true;} nodeType = 4; abbrvTblNm = "cd"; } else { // Node type = TEXT_NODE [3] if (lastNodeType == 3) {isContinuation = true;} nodeType = 3; abbrvTblNm = "t"; extraArg = ", " + isIgnorable; } if (!isContinuation) { // Increment counter x++; // Insert to node table nodeID = xmlrep.intExecSQL("rep_i_n " + x + ", " + nodeType + ", "+ docID + ";"); } else { // Treat as a continuation: decrement counter x--; // Get the last leaf_id offset = xmlrep.intExecSQL("rep_s_" + abbrvTblNm + "lid " + lastNodeID + ";"); // Null the y value of this node (since we are continuing it) xmlrep.voidExecSQL("rep_u_n_y_null " + nodeID + ";"); } // Insert to appropriate table xmlrep.insertValue(true, "rep_i_" + abbrvTblNm + "l " + nodeID + extraArg, characterData, offset); // Update node table: set y value for this node xmlrep.voidExecSQL("rep_u_n_y " + nodeID + ", " + (x + 1) + ";"); x++; // Remember node type & ID lastNodeType = nodeType; lastNodeID = nodeID; }
The endElement() method (see Listing 8.60) is invoked by the SAX parser when a closing tag is encountered (e.g., </myNS:message>). The element that this tag is closing will be the last node with a null value of y.
// * End element public void endElement(String namespaceURI, String localName, String qName) { if (verboseMode) {System.out.println( "* endElement [" + x + "] = '" + qName + "'");} // Increment counter x++; // Find the node that this "end element" corresponds to nodeID = xmlrep.intExecSQL("rep_s_n_last " + docID + ";"); // Check we found an element with null value for y if (nodeID < 1) { saxParser.genError("No element has null y."); } // Update node table: set y value for this node xmlrep.voidExecSQL("rep_u_n_y " + nodeID + ", " + x + ";"); // Remember node type & ID lastNodeType = 1; // = ELEMENT [1] lastNodeID = nodeID; }
The startDTD() and endDTD() methods shown in Listing 8.61 are invoked by the SAX parser at the beginning and end of a document type definition, respectively (and regardless of whether the DTD is stored inside the file or is external to it). In between, various methods will be invoked as the parser encounters notation declarations and entity definitions (and we can capture these if we register our class as the DTDHandler for the SAX parser). Aside from printing a message (if we are in "verbose" mode), we have chosen not to do anything with DTD events.
// * Start DTD public void startDTD(String name, String publicId, String systemId) { if (verboseMode) {System.out.println("* startDTD (name = '" + name + "', publicId = '" + publicId + "', systemId = '" + systemId + "')");} // Do nothing } // * End DTD public void endDTD() { if (verboseMode) {System.out.println("* endDTD");} // Do nothing }
The endDocument() method shown in Listing 8.62 is invoked by the SAX parser when the end of the file is reached; all we need to do when this stage is reached is to set the y value for the document node.
// * End document public void endDocument() { if (verboseMode) {System.out.println("* endDocument");} // Find the remaining node with null y (the document node) nodeID = xmlrep.intExecSQL("rep_s_n_last " + docID + ";"); // Check we found a node with null value for y if (nodeID < 1) { saxParser.genError("Root node does not have null y."); } // Update node table: set y value for this node xmlrep.voidExecSQL("rep_u_n_y " + nodeID + ", " + (x + 1) + ";"); }
The SAX parser invokes the methods shown in Listing 8.63 when lexical events are encountered. As with text and CDATA sections, the comment() method allowed comments to split across leaves. The startCDATA(), endCDATA(), startEntity(), and endEntity() methods toggle the values of the isCData and isEntity variables, so the character-handling methods (which will be called immediately afterwards by the SAX parser) know which type of character data they are dealing with.
// -- LexicalHandler methods public void comment(char[] ch, int start, int length) { if (verboseMode) {System.out.println("* comment [" + x + "]");} // Increment counter and set node type = COMMENT_NODE [8] x++; int nodeType = 8; // Insert to node table nodeID = xmlrep.intExecSQL("rep_i_n " + x + ", " + nodeType + ", "+ docID + ";"); // Insert to comment_leaf table xmlrep.insertValue(true, "rep_i_cl " + nodeID, new String(ch, start, length)); // Update node table: set y value for this node xmlrep.voidExecSQL("rep_u_n_y " + nodeID + ", " + (x + 1) + ";"); x++; // Remember node type & ID lastNodeType = nodeType; lastNodeID = nodeID; } public void startCDATA() { if (verboseMode) {System.out.println("* startCDATA [" + x + "]");} // Next call to characters() will be treated as CDATA isCData = true; } public void endCDATA() { if (verboseMode) {System.out.println("* endCDATA [" + x + "]");} // Next call to characters() will be treated as text isCData = false; } public void startEntity(String name) { if (verboseMode) {System.out.println("* startEntity [" + x + "]");} // Set the isEntity boolean to true, so characters() calls arising // due to entity parsing get ignored isEntity = true; // Increment counter and set node type = ENTITY_REFERENCE_NODE [5] x++; int nodeType = 5; // Insert to node table nodeID = xmlrep.intExecSQL("rep_i_n " + x + ", " + nodeType + ", "+ docID + ";"); // Check entity reference name does not exceed maximum length if (name.length() > xmlrep.entityRefLength) { saxParser.genError("[Error] : " + name + " exceeds the maximum length."); } // Insert to entity_reference table xmlrep.insertValue(false, "rep_i_er " + nodeID, name); // Update node table: set y value for this node xmlrep.voidExecSQL("rep_u_n_y " + nodeID + ", " + (x + 1) + ";"); x++; // Remember node type & ID lastNodeType = nodeType; lastNodeID = nodeID; } public void endEntity(String name) { if (verboseMode) {System.out.println("* endEntity [" + x + "]");} // Set the isEntity boolean to false isEntity = false; }
The SAX parser invokes the methods shown in Listing 8.64 when DTD events are encountered. Currently, we do not do anything in these situations, but we could add tables and procedures to deal with the information.
// -- DTDHandler methods public void notationDecl(String name, String publicId, String systemId) { if (verboseMode) {System.out.println("* notationDecl");} // Do nothing } public void unparsedEntityDecl(String name, String publicId, String systemId, String notationName) { if (verboseMode) {System.out.println("* unparsedEntityDecl");} // Do nothing } }
Compile this class ("javac uploadXML.java"), and we're ready to test it with a sample XML file. On successful loading, the Java program returns a message indicating the document ID that was assigned (along with a count of the nodes that were created):
C:\xmlrep>java uploadXML wml-example.xml Document uploaded into the repository with doc ID = 1 (25 nodes).
Running rep_serialise_nodes for a nontrivial document produces lots of output that is not particularly readable, so let's create a class (extractXML.java) that will call the procedure and knit the output back into recognizable XML. There is only one method, main(), which takes command-line parameters for the document ID (plus additional optional parameters) and executes rep_serialise_nodes. The code for extractXML is shown in Listing 8.65.
// Import core Java classes import java.lang.*; import java.sql.ResultSet; import java.sql.SQLException; // Import our classes import xmlrepDB; // The extractXML class class extractXML { public static void main(String[] args) { // Variables int docID; int startX = 1; int incMetadata = 0; Integer IntObj = new Integer(0); xmlrepDB xmlrep = new xmlrepDB(); String xmlDoc = ""; String xmlBit; String sqlCmd; // Check we received a docID if (args.length == 0) { System.out.println( "Usage: java extractXML docID (startX) (incMetadata)"); System.out.println( " docID = ID of the document in the repository"); System.out.println( " startX = integer x index of the node with which to start"); System.out.println( " incMetadata = 't' for attributes showing x & y coordinates"); System.exit(1); } // Document ID docID = IntObj.parseInt(args[0]); // Starting x_index specified? if (args.length >= 2) {startX = IntObj.parseInt(args[1]);} // Include metadata? if (args.length >= 3) { if (args[2].equalsIgnoreCase("t")) {incMetadata = 1;}} // Start the output System.out.println("<?xml version=\"1.0\"?>"); // Handle errors try { // Connect to the repository xmlrep.connect(); // Build the SQL command sqlCmd = "rep_serialise_nodes " + docID + ", " + startX + ", " + incMetadata + ";"; try { // Execute the SQL ResultSet rs = xmlrep.stmt.executeQuery(sqlCmd); // Loop through the records while (rs.next()) { // Read the fields & handle NULLs / empty strings xmlBit = rs.getString("parsed_text"); if (!rs.wasNull()) {xmlDoc = xmlDoc + xmlBit;} } } catch (java.sql.SQLException ex) {xmlrep.sqlEx (ex, sqlCmd);} // Output to screen System.out.println(xmlDoc); // Close the connection xmlrep.disconnect(); } catch (java.lang.Exception ex) { // Print exception information as an XML comment System.out.println ("<!--"); ex.printStackTrace (); System.out.println ("-->"); } } }
Compile this class ("javac extractXML.java"), and we can test it on the sample file we uploaded to the repository:
C:\xmlrep>java extractXML 1
If everything is working properly, the output should look exactly like the file that went in (you can pipe the output to a file, if you want to check). We can also run the process again for a fragment of the document, as shown in Listing 8.66.
C:\xmlrep>java extractXML 1 37 <?xml version="1.0"?> <card id="cSecond" title="Second card"> <p align="center"> Content of the second card. </p> </card>
In this case, we asked for the fragment of document 1 starting with the node with x = 37 (the second of the two card elements). If we were to repeat the process with the optional inMetadata parameter set to "t" (i.e., "java extractXML 1 37 t"), we would get a similar result but with additional attributes showing the values of x, y, and node_id for each element. Listing 8.67 shows the XML.
C:\xmlrep>java extractXML 1 37 t <?xml version="1.0"?> <card xmlns:repository="http://www.rgedwards.com/" repository:x="37" repository:y="46" repository:nodeID="52" id="cSecond" title="Second card"> <p repository:x="40" repository:y="43" repository:nodeID="54" align="center"> Content of the second card. </p> </card>
Top |