6.2 Storing XML as CLOB

In this section we discuss how to store XML as a CLOB.

6.2.1 Using CLOB and the OracleText Cartridge

The simplest approach for handling XML documents is to use relational tables with one or more CLOB columns, supplemented by structured information that classifies and characterizes the documents. Using relational concepts, the XML documents can thus be organized in the sense of containers as provided by native XML databases. However, storing and retrieving documents is only possible as a whole. In fact, updates of XML repositories are still a fundamental research issue for which Oracle has no solution as yet.

In Oracle there is principally no difference between CLOBs and strings except that a CLOB can hold up to 4GB. Both CLOBs and strings offer the same functionality, such as comparisons and substring search. This allows for only few search capabilities unless you're using additional structured information. Such querying is in general too simple for advanced XML applications.

The OracleText cartridge helps to remedy this deficit: OracleText is a server-based implementation for free-text search in any kind of document. Due to the relevance of XML, the cartridge offers special support for XML documents: New query forms WITHIN, INPATH, and HASPATH with XPath expressions can be used for retrieving documents. However, to benefit from these features, creating text indexes and defining section groups must be performed up-front, as we will explain later.

Let us assume that Listing 6.1 is an XML document that describes a customer placing orders.

Listing 6.1 Customer Example

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE Kunde SYSTEM "kunde.dtd">
<Customer>
  <CNo> 10 </CNo>
  <Name>
    <Firstname> Lucky </Firstname>
    <Lastname> Luke </Lastname>
  </Name>
  <Address>
    <Zip> 12345 </Zip>
    <City> Bull </City>
    <Street> Cows Xing </Street>
    <Houseno> 8 </Houseno>
  </Address>
  <Phone> 012/3456 </Phone>
  <Phone> 023/4567 </Phone>
  <Order ONo="4711">
    <Entry> 01.01.99 </Entry>
    <Positions>
      <Position PNo="1">
        <Price Currency="US$"> 1.11 </Price>
        <Amount> 111 </Amount>
        <Part> Screw </Part>
      </Position>
      <Position PNo="2">
        <Price Currency="Euro"> 2.22 </Price>
        <Amount> 22 </Amount>
        <Part> Nut </Part>
      </Position>
    </Positions>
  </Order>
</Customer>

6.2.2 Search Predicates in OracleText

The essential concept of OracleText is a CONTAINS function, which can be used in SQL directly to specify search patterns. Let us assume that customer documents are stored in a CLOB column txt in a table DocTab (id INTEGER, txt CLOB). For example, Retrieve those tuples in DocTab that contain "Lucky" in the column txt.The SQL code for this is shown in Listing 6.2.

Listing 6.2 Simple `contains` Query

SELECT id, txt, SCORE(1)
FROM DocTab
WHERE CONTAINS (txt, 'Lucky', 1) > 0

Rather than yielding a not/found result, the function CONTAINS assesses the occurrences of a term. It returns a score value in the range of 0...100 determining the relevance of the search pattern: 0 means "not found in the document," while 100 represents a perfect match. The value is computed by a complex formula based on the number of occurrences. Usually, CONTAINS is used in a comparison CONTAINS ( . . . )>0 to ask for any occurrences. The third parameter of CONTAINS is an integer such as 1, which is a placeholder for the relevance. Its value can be obtained by applying the SCORE function to the placeholder: SCORE(1) yields the relevance of the term "Lucky" as part of the result.

OracleText provides several operators to affect the search string, for instance:

Logical operators such as & (AND), | (OR), and ~ (NOT) are possible: Taking "Lucky & Luke' as a search term, both terms "Lucky' and "Luke' must occur in the document. The overall relevance is the minimum of both individual relevances.
NEAR(term₁, . . . ,term_n,m) requires the list of terms to occur in a sentence of a given length m.
Special functions vary the search strings and expand queries: stem ($) to search for terms with the same linguistic root, soundex (!) to include words that have similar sounds, fuzzy search (?) for words that are spelled similarly to the specified term (e.g., because of misspellings), as well as the well-known SQL wildcards ( _ ) and (%).
Thresholds like "term.n" can be defined to eliminate texts that score below a threshold number n.
A thesaurus is available with corresponding operators such as Preferred Term, Related Term, Broader Term, Narrower Term, and Synonym.

Listing 6.3 demonstrates the power:

Listing 6.3 More Complex `contains` Query

SELECT id, SCORE(1)
FROM DocTab
WHERE CONTAINS (txt, '!Smith | $sing', 1) > 0

This query takes also into account documents that contain "Smythe" (due to "!") just as "sang" and "sung" (due to "$").

Further important concepts are

Stoplists: Identify the words in a language that are not to be indexed. For instance, it is not sensible to index articles and pronouns like "the" and "my".
Filtering: Allows indexing documents in binary formats such as Word documents or PDF. Oracle stores them in their native format and uses filters to build temporary plain text versions for indexing.
Lexers: Can be installed to define case sensitivity and the handling of special characters. For example, the lexer can be told to remove characters like ".", "!", and "?" from a token before indexing, because their purpose is only to indicate the end of a sentence.
Datastore: A datastore preference specifies how the text is stored. Besides keeping documents in a text column, it is possible to store documents in several columns, in a nested table, in several rows, or in a file. Files can be handled directly or via a URL. The txt column then contains a URL pointing to the document instead of keeping the text itself.

This so far is the general functionality of OracleText, which can certainly be used for XML documents, too, but does not exhaust the full potential of XML.

6.2.3 XML-Specific Functionality

In order to satisfy the need for queries that are more related to XML, CONTAINS can be combined with advanced text operators:

A WITHIN operator restricts the occurrence of a term to a text section instead of searching in the whole document: CONTAINS(txt, 'term WITHIN SENTENCE',1) > 0. Predefined sections are SENTENCE and PARAGRAPH to look for terms in a sentence or paragraph, respectively. Further sections are XML elements and their attribute zones. Corresponding queries are 'term WITHIN tag' and 'term WITHIN attr@tag'. The first one demands the term to occur between a pair <tag> . . . </tag>, while the second one searches for the term in the attribute value <tag attr=" . . . ">.
Two additional operators HASPATH and INPATH allow for XPath queries. HASPATH exists in two different forms, the first one 'HASPATH(xpath)' asking for the existence of a certain xpath without any search term. A second conditional form 'HASPATH(xpath="value")' checks the value of an element specified by an xpath. Only equality and inequality are allowed in those conditions.
The INPATH operator has the form term INPATH(xpath) and restricts the search to an arbitrary XPath fragment: The document must possess a fragment qualified by xpath, and the term must occur within.

All the other operators such as "&" and "|" can still be used and combined with these XML operators. The form WITHIN allows for terms like 'Screw WITHIN Part' and is quite general as any occurrence of the <Part> tag is taken into account: The form is equivalent to 'Screw INPATH(//Part)', asking for the tag at any level. Nesting WITHIN operators, e.g. 'Screw WITHIN Part WITHIN Order', is also possible, but offers not the flexibility of XPath expressions. In contrast, INPATH uses an XPath expression to determine the element more precisely, for instance, beginning from the root and using conditions: 'Screw INPATH(/Customer [ CNo="10"] /Order/Part)'.

To demonstrate the differences between INPATH and HASPATH, let us consider another example:

'Luke INPATH(/Name/Lastname)' finds <Name> . . . <Lastname> xyzLukeXyz </Lastname> . . . </Name>. The term 'Luke' must be embedded in <Name> and <Lastname> elements.
'HASPATH(/Name/Lastname="Luke")' is only satisfied if the inner element matches the given value exactly, i.e., without any leading or following characters: <Name> . . . <Lastname>Luke</Lastname> . . . </Name>. A value xyzLuckyXyz is not accepted any longer.
'Luke INPATH(/Name/Lastname[ Lastname="Luke"])' has the same effect.

Oracle supports the full XPath syntax. Hence, more complex queries than those presented so far are possible. The following example shows an attribute-sensitive search:

'Screw INPATH(//Order[ @ONo="4711"])'

The term "Screw" must occur at any level in a customer's Order element having an ONo attribute with value 4711.

Note OracleText supports an XML-like retrieval of text documents, but the result is always a CLOB. Hence to extract information from a resulting document, an XML parser must be used. The object type XMLType, which will be discussed shortly, offers additional functionality to extract parts from a document. Furthermore, in spite of querying with Xpath, SQL is still used to formulate the body of the query.

6.2.4 Prerequisites

In order to use the CONTAINS function, some prerequisites are necessary, making the handling complicated. Generally, CONTAINS requires a text index for the column that contains text documents. It is just then that indexing takes place?that is, that terms are correctly found in documents. Furthermore, the WITHIN operator needs a text section, called section group, which can be defined and added to an index. The most comfortable way for XML is to use a PATH_SECTION_GROUP, because it takes into account all the tags and attributes by default and supports all the operators WITHIN, HASPATH, and INPATH. The definition of a PATH_SECTION_GROUP section group myGroup is as follows:

CTX_DDL.create_section_group('myGroup', 'PATH_SECTION_GROUP')

Afterwards, the text index on the txt column can be created by using a special index type CTXSYS.context and by setting the section group myGroup, as shown in Listing 6.4.

Listing 6.4 CTXSYS.context Index

CREATE INDEX myIdx ON DocTab(txt) INDEXTYPE IS CTXSYS.context
PARAMETERS ('SECTION GROUP myGroup')

Without a section group, no CONTAINS queries are possible.

The PARAMETERS clause is also important for setting other preferences like datastore, filter, wordlist, and lexers. For example, the type of source of the text can be specified in PARAMETERS analogously by creating a datastore preference:

PARAMETERS ('section group myGroup datastore myStorage . . . '). Again, the preference myStorage and others have to be defined beforehand, as shown in Listing 6.5.

Listing 6.5 myStorage Preferences

CTX_DDL.create_preference('myStorage', 'FILE_DATASTORE')
CTX_DDL.set_attribute('myStorage', '/home/text:/local1/home/myTexts')

Using the property FILE_DATASTORE, the values of the text attributes are interpreted as filenames. Searching the files is done according to the paths /home/text and /local1/home/myTexts that are defined by means of set_attribute.

As mentioned previously, a PATH_SECTION_GROUP is the most comfortable section group as it supports all the CONTAINS queries without any further action. All the tags and all their attributes are automatically indexed.

Another section group is AUTO_SECTION_GROUP. It behaves similar to PATH_SECTION_GROUP and places indexes on all tags and attributes. But tags and attributes can be used only in WITHIN queries; INPATH and HASPATH are not allowed.

On the other extreme, XML_SECTION_GROUP requires a definition of all the tags to be indexed. Every tag or attribute to be used must be specified explicitly by calling add_. . ._section procedures in the manner shown in Listing 6.6.

Listing 6.6 add_ . . . _section Procedures

CTX_DDL.create_section_group('myXmlGroup', 'XML_SECTION_GROUP')
CTX_DDL.add_attribute_section('myXmlGroup', 'mySection1', 'attr@tag')
CTX_DDL.add_zone_section('myXmlGroup', 'mySection2', 'docType(tag)')

mySection1 is the name of an attribute section that indexes the attribute attr within tag. mySection2 represents a zone section?that is, the search area has elements <tag> . . . </tag> for a certain document type docType. The docType is optional. Afterwards, WITHIN mySection1/2 can be used.

Further groups that can be used for plain or HTML text are: