12.4 Design Choices

  Previous section   Next section

While designing OAI, we made a number of decisions that impacted the overall architecture of the system. To comply with the requirement of having to support multiple heterogeneous clients, we chose to use XML as the data transport for exchanging data between the client and the OAI system. This design choice introduced a number of issues on how to deal with the data requests to OAI that arrive in the form of XML documents and how to return data also in the form of XML documents in an efficient way. In this section, we discuss in detail each of the choices we made including the available options in each case, the advantages and disadvantages of each option, and the reason for our final selection.

12.4.1 Using XML in OAI

One of the limitations presented by the JEDMICS legacy system is its inability to select and return different types of data to the client in a single call without returning complex/heavy client objects or the proprietary internal scheme. For example, if the client needs all the information about an engineering drawing, the system requires several client calls: one call to retrieve the drawing metadata, another call to retrieve the images for the drawing, another call to retrieve the part number associations, and another call to retrieve the Weapon System Code associations. A proprietary scheme was developed internally for retrieving this information, but it is not easily read or parsed. In addition, external clients do not know how to process the proprietary scheme. OAI uses XML documents for input and output to external client interfaces, which resolves the problem. It allows for a nonproprietary method for clients to specify what data to retrieve and to return, in an easily understandable format that is platform independent. Currently, there is no XML industry standard for defining engineering documents. Therefore, OAI developed its own DTDs for defining engineering documents and their components. External clients can parse this data with any XML parser. OAI also has a requirement to support CORBA clients. Since XML documents are returned as a String element, any CORBA client can easily access the data. The following are several advantages offered by XML, which make it an excellent choice for OAI's external interfaces:

  • Portability: XML is a good match for Java. It pairs Java's code portability feature with its data portability, as well as strengthening Java's ease-of-use feature. XML's portability is made possible by its text format and the need for no formatting instructions. Since the OAI system will have multiple clients connecting to it, portability is the key design issue. Data needs to be readable and usable by all clients, regardless of their platforms and applications. Text data are both portable and easy to use, readable by both humans and text-editing software.

  • Extensibility: XML tags are extensible, allowing us to define and use our own XML tags to describe data content. The OAI system needs to be able to use the same data format for different functionality. That is why XML becomes useful: We can extend and customize the tags to fit the different needs of each function. We can also easily combine results of several queries into one XML document.

  • Control over presentation: Separation between the management of content and its presentation allows developers to reuse and/or reformat data in different ways. The internal OAI processes do not have to worry about the different presentation requirements; they only need to know how to manage the data content. When the process is done, then presentation can be customized according to the needs of the requesting client.

  • Interoperability: XML allows access to single data by heterogeneous applications. The OAI system is the core JEDMICS application interface that processes and returns single data to heterogeneous clients/applications. And since XML describes the structure of the data (and not its format), the single data can be used by different applications, and therefore interoperability is preserved.

12.4.2 Conversion of XML Input into Objects

While XML is an ideal choice for external client interfaces, it is not efficient for internal application processing. During the design and implementation of the OAI, we evaluated many methods for processing the XML documents passed in from the client. We prototyped the use of Simple API for XML (SAX) and Document Object Model (DOM) parsers, JDOM, generic hash table, and tree java objects to convert the XML document, and JAXB (Java Architecture for XML Binding) to generate Java classes based on the XML document's DTD.

SAX processes XML data like a text stream, which is fairly fast. However, the structure is not stored in memory; hence it is hard to retrieve specific elements of a collection without processing the whole document. In addition, SAX does not allow adding, removing, or changing elements in the document (since it provides a read-only model to the XML document), it does not provide a method to output XML, and it only provides structural validation (field validation must be done programmatically). For the aforementioned reasons, SAX does not provide the processing power that OAI needs, and parsing using SAX proves to be slower than parsing XML documents into JAXB-generated Java classes.

The DOM translates an XML document into an in-memory tree structure. DOM provides powerful document-processing capabilities including support for adding, deleting, and changing elements and outputting XML documents. However, DOM requires additional overhead to support these capabilities, and it provides only structural validation (field validation must be done programmatically), and the document is interpreted, which is slower than dealing with compiled code. OAI only uses a limited subset of the DOM capability, but incurs the overhead to support all the unused capability, and it proves to be slower than parsing XML documents into JAXB-generated Java classes.

JDOM is a Java API for manipulating XML documents. JDOM provides capabilities similar to the DOM API but without as much overhead. Our experience shows that JDOM is still slower than parsing XML documents into JAXB-generated Java classes.

JAXB (which we will describe in more detail later in this chapter) takes an XML DTD and generates Java classes representing the XML document structure. The generated classes include code to validate the data as well as the structure. The code has to be compiled, and therefore, it is much faster than SAX, DOM, or JDOM. The generated classes also make it obvious what is expected in the document, whereas with SAX, DOM, or JDOM, developers must refer to the DTD or XML schema to determine what the XML document contains. JAXB-generated classes are much easier to work with for development and maintenance. Whenever the DTD changes, re-running JAXB will generate the Java classes based on the changed DTD. The JAXB classes that are generated, in addition to methods for marshaling and unmarshaling XML, include accessors and mutators for each of the elements of the class. To add support for additional application-specific validation, we would need to extend the classes that are generated by JAXB using derived classes that provide the validation code.

12.4.3 Conversion of Database Data into XML

One of the major functions of the OAI is to provide support for retrieving image data stored in the JEDMICS repository as well as image metadata, stored in a relational database. Therefore, the ability to extract the data and output it in XML format is crucial. It is important that the whole conversion process is done efficiently. For these reasons, and the fact that the database system used to store the image metadata is Oracle, it was decided that the Oracle XSU (XML SQL Utility) be utilized. The utility automatically transforms relational data into XML (and vice versa) without any need for extra coding on the OAI server. It allows for the extraction of data from an object-relational or pure relational format into XML, as well as insertion, update, or deletion of column/attribute values within a table or a view using XML input extracted from an XML document. In addition to the features mentioned earlier, Oracle XSU also enables the OAI system to generate output DTD and XML Schema, which will become critical when the system moves towards the future. The main advantage of XML Schema over DTD is its support for a broad set of predefined data types as well as support for user-defined data types. The disadvantage of XML Schema at this point is that it does not have sufficient tool support, but this will change over time as more and more systems incorporate its use in their architecture.

12.4.4 Conversion of Image Data into XML

The Image Query Service receives requests that specify the attributes that uniquely identify one or more images within the OAI system and returns these images after retrieving them from the appropriate image server and enforcing access control. Due to the definition of what represents a valid character within an XML document and due to the encoding and decoding processes at the sending and receiving side, we cannot directly embed the image's binary data within the XML document that forms the response to the caller. To resolve this issue, we evaluated three different encoding schemes before making our selection. The first choice was to encode each byte in the image with its two-character hexadecimal representation. This scheme is easy to implement but results in an XML document that is twice the size of the original binary image. The next choice was to use base64 encoding, which represents each 3-byte sequence as four 6-bit blocks that are each encoded as a single character from a 64-character set. This scheme is fairly easy to implement, and various versions exist in the public domain. At the same time, the base64 encoding scheme results in a document that is 1.34 times the size of the original image. The last scheme we evaluated was the use of Huffman codes. Huffman coding uses the statistical properties of the document to encode a document using variable-length codes. In this case the size of the resulting document is dependent on the statistical properties of the original binary document that is being encoded. For the particular implementation that we evaluated, the size of the resulting document would range between 1.0 and 1.75 times the size of the original document depending on the distribution of the byte values within the image. We finally decided to use the base64 encoding scheme due to the ease of implementation and its lack of dependence on the nature of the data encoded. Another reason we did not select the Huffman codes approach is that most of the images that are stored in our system use a proprietary compression scheme that would result in encoded document sizes towards the higher end of the range specified earlier.

12.4.5 Database Access

OAI supports several different user communities. A large portion of the user community retrieves document data and document collection data but does not update the data. A smaller subset of the users manipulates potentially large collections of documents for procurement and bid-set purposes. Another subset of the users is responsible for quality assurance of the engineering data, which requires updates to a limited set of data. We chose to use JDBC statements for constructing ad hoc queries against the Oracle database for retrieving metadata, Oracle stored procedures for manipulating OAI collections, and entity beans for limited updates to the Oracle database.

OAI uses JDBC statements for retrieving metadata for several reasons. Oracle provides the capability to return query results as XML documents, which eliminated the need for OAI to programmatically convert the Oracle query result sets into XML documents. The use of stored procedures or entity beans would have required additional custom code to convert the data into an XML document. The clients of OAI use XML documents to customize, in a user-friendly manner, the queries that are used for retrieving metadata from the system. JDBC-executed statements provide a flexible means to dynamically generate ad hoc queries based on the client-supplied search criteria.

OAI uses stored procedures for manipulating collections of data. OAI provides many capabilities that require accessing large collections of data such as adding the contents of one collection to another collection or moving full or partial engineering drawings and all their cross-reference associations to another drawing. The user is interested in only whether the action completed successfully and does not need to see every item that was affected. Stored procedures allow us to manipulate these collections within Oracle, which are much more efficient than pulling all the data from the database to the server, processing each row individually, and inserting each row individually back into the database. In many cases, the collections could be manipulated with one SQL statement inside the stored procedure. The additional advantage of using the stored procedures is that they are compiled, and therefore we save the parsing time required in processing JDBC queries. Using either JDBC statements or entity beans would have been very costly because we would have to process each row in the collection individually.

OAI uses entity beans for updating engineering drawings and collection metadata. The main advantage of container-managed entity beans over the use of JDBC statements or stored procedures is the reduced development time. Developing an entity bean involves defining the bean interface and then mapping the bean's member fields to database fields in a table. The container is then responsible for generating the code that implements the queries for retrieving, storing, updating, and deleting data from the database. In theory, container-managed beans are easy to develop and are database independent. This convenience does come as a trade-off against performance and fine control over the execution of the queries against the database. The EJB specification defines the life cycle of an entity bean so as to guarantee that the data mapped into object is always synchronized with the corresponding data in the database. The enforcement of the life cycle of each entity bean by the container introduces considerable performance overhead in using entity beans. As a result, we chose to use entity beans for cases where we needed the convenience and development efficiency of entity beans, and for those parts of the system where requests against them will be a small fraction of the overall workload. At a later stage, if we determine that those entity beans are forming a bottleneck, we will need to replace them with either stored procedures or JDBC statements.

When we decided to use XML as the data transport for external data, we considered the option of storing XML directly in the database, as opposed to parsing the XML documents first and storing the data only. In making this decision we investigated the option of using a native XML database. Using a native XML database is the natural choice for storing XML data since there is a direct mapping between the original XML document and its physical representation within the database. Another feature of native XML databases, referred to as "round-tripping," is important to us since we often need to return responses to a caller in the form of XML documents, which were previously submitted to our system in the form of an XML document. Finally, the use of XPath or XQL for generating queries against a native XML database would be a direct fit with our requirement of having to allow clients to generate queries against the image management system using XML documents to specify the queries.

Despite those positive features of storing XML directly in the database, we chose not to consider a native XML database for a number of reasons. The primary reason is that at least for the early releases of the OAI system, we need to support the relational database that is currently used by the legacy applications. Therefore, using another data store at this point would require every operation that modified data to apply the changes in a transactional manner to two different databases. The second reason is the negative publicity regarding the performance of native XML databases specifically for our needs. We need to query the database using various different attributes and at different levels of the image metadata hierarchy. XML databases tend to perform very well against queries that fit the document hierarchy that was used to store the documents. However, they do not do well for more ad hoc queries, unless indexes are used extensively (which hinders the performance of update requests). We believe that using Oracle XSU to XML-enable our native Oracle databases is currently a good solution for JEDMICS. With future developments of Oracle and other XML technologies, more options may be available.

12.4.6 Validation

Querying the JEDMICS repository is a major function of OAI. Despite its many advantages, free-format XML makes it hard to guarantee that clients will always provide valid input required for processing. The OAI system, therefore, has to make sure that the client input is validated before it is processed to prevent unnecessary access to the database. This input validation is done in two steps:

  1. Structure validation

    The OAI system needs to know that the client provided required fields needed for processing. This is done automatically by specifying the XML input's schema (in this case we use DTDs). When the data are parsed against the appropriate DTD, unmatched input (missing or unexpected fields, fields appearing in incorrect order) will create exceptions, which will be thrown to indicate to the client that the input is invalid. The fields needed for processing are based on the query called by the client, and therefore different queries have different requirements.

  2. Content validation

    The OAI system also validates field content. Certain fields can only have certain values (and this is sometimes based on the query called by a client). Due to the limitations presented by the current DTD specification, this validation has to be done in the code. Once the input passes the structure validation, the system calls the validation bean, which does the content-based validation. Invalid input will also be thrown to the client before database processing is undertaken.

This two-step validation helps the system prevent unnecessary overload to the system. The Query Service is the most heavily accessed service provided by OAI. Therefore, it is crucial to reject invalid data before executing queries to reduce the load on the database. It is not possible, however, for the validation to fully protect the system from bad data, simply due to the extensiveness of the image data stored in the system.

The OAI system integrates an early implementation of JAXB, provided by Sun Microsystems, to validate input fields using the appropriate DTD and for representing the XML data within the system using an object format. Based on the DTD provided, the JAXB compiler generates Java classes that provide a two-way conversion mechanism between the XML document and the Java objects. We decided to incorporate JAXB because of its tight integration between Java technology and XML, as well as its guarantee for valid data. With JAXB, we define input syntax in the DTD for field validation and then extend the classes to include some or all content-based validation rules. Representing data with Java objects also has the advantage of easier access to data from within the objects.

XML schema and DTD are the different forms of schema used to model a whole class of XML documents. DTD has been around as long as XML has, whereas XML Schema has only lately gained popularity. XML Schema's popularity is due to a number of limitations that DTD presents. Some of the limitations of DTD are the non-XML syntax it is written in, its limited datatyping, and its complex and fragile extension mechanism based on string substitution. XML Schema tries to overcome this limitation by being more expressive than DTD. The intrinsic expressiveness value lets developers exchange XML data in a more robust way without having to rely heavily on validation tools and/or processes.

Many tools do not yet support XML Schema since it is fairly new. One such application, regrettably, is the JAXB implementation, which is currently still based on the use of DTD to create Java classes. When the support for XML Schema in JAXB becomes a reality, the system will be equipped with a pattern-matching validation scheme and content-based validation rules, and thus a more robust validation capability. The other advantages of using XML Schema in the system is to define occurrence constraints as well as simple and complex types. The JAXB compiler is then used to generate Java objects, which will make sure that fields defined in the XML Schema appear in the input as expected. Despite their advantages, however, XML Schema and JAXB alone may not do all the validations we need. In some cases, we still need to do manual content-based validation in the code. We can do that easily by extending the Java classes created by the JAXB compiler to validate content on user input based on the field and other criteria.


Top

Part IV: Applications of XML