2.1 The Search for Knowledge

Occasionally, I like to write articles about non-Internet-related topics, such as marine biology or astronomy. One of my more popular articles is on Architeuthis Dux?the giant squid. The article is currently located at http://burningbird.net/articles/monsters1.htm.

According to the web profile statistics for this article, it receives a lot of visitors based on searches performed in Google, a popular search engine. When I go to the Google site, though, to search for the article based on the term giant squid, I find that I get a surprising number of links back. The article was listed on page 13 of the search results (with 10 links to a page). First, though, were several links about a production company, the Jules Verne novel 10,000 Leagues Under the Sea, something to do with a comic book character called the Giant Squid, as well as various other assorted and sundry references such as a recipe for cooking giant squid steaks (as an aside, giant squids are ammonia based and inedible).

For the most part, each link does reference the giant squid as a marine animal; however, the context doesn't match my current area of interest: finding an article that explores the giant squid's roots in mythology.

I can refine my search, specifying separate keywords such as giant, squid, and mythology to make my article appear on page 6 of the list of links?along with links to a Mexican seafood seller offering giant squid meat slabs and a listing of books that discuss a monster called the Giant Squid that oozes green slime.

The reason we get so many links back when searching for specific resources is that most search engines use keyword-based search engine functionality, rather than searching for a resource within the context of a specific interest. The search engines' data is based on the use of automated agents or robots and web spiders that traverse the Web via in-page links, pulling keywords from either HTML meta tags or directly from the page contents themselves.

A better approach for classifying resources such as the giant squid article would be to somehow attach information about the context of the resource. For instance, the article is part of a series comparing two legendary creatures: the giant squid and the Loch Ness Monster. It explores what makes a creature legendary, as well as current and past efforts to find living representatives of either creature. All of this information forms a description of the resource, a picture that's richer and more complex than a one-dimensional keyword-based categorization.

What's missing in today's keyword-based classification of web resources is the ability to record statements about a resource. Statements such as:

The article's title is "Architeuthis Dux."
The article's author is Shelley Powers.
The article is part of a series.
A related article is ...
The article is about the giant squid and its place in the legends.

General keyword scanning doesn't return this type of specific information, at least, not in such a way that a machine can easily find and process these statements without heroic computations.

RDF provides a mechanism for recording statements about resources so that machines can easily interpret the statements. Not only that, but RDF is based on a domain-neutral model that allows one set of statements to be merged with another set of statements, even though the information contained in each set of statements may differ dramatically.

One application's interest in the resource might focus on finding new articles posted on the Web and providing an encapsulated view of the articles for news aggregators. Another application's interest might be on the article's long-term relevancy and the author of the article, while a third application may focus specifically on the topics covered in the article, and so on. Rather than generating one XML file in a specific XML vocabulary for all of these different applications' needs, one RDF file can contain all of this information, and each application can pick and choose what it needs. Better yet, new applications will find that everything they need is already being provided, as the information we record about each resource gets richer and more comprehensive.

And the basis of all this richness is a simple little thing called the RDF triple.

I use the word context in this chapter and throughout the book. However, the folks involved with RDF, including Tim Berners-Lee, director of the W3C, are hesitant about using the term context in association with RDF. The main reason is there's a lot of confusion about what context actually means. Does it mean the world of all possible conditions at any one point? Does it mean a specific area of interest?

To prevent confusion when I use context in the book, I use the term to refer to a certain aspect of a subject at a given time. For instance, when I look for references for a subject, I'm searching for information related to one specific aspect of the subject?such as the giant squid's relevance to mythology?but only for that specific instance in time. The next time I search for information related to the giant squid, I might be searching for information based on a different aspect of giant squids, such as cooking giant squid steaks.