In the main, this book deals with the most common XML content-syndication standard: RSS. As with other Internet standards, it helps to know some of its history before diving into the technicalities.
While it is only three years old, RSS is a somewhat troubled set of standards. Its upbringing has seen standards switch, regroup, and finally split apart entirely under the pressures of parental guidance. To fully understand this wayward child, and to get the most out of it, it is necessary to understand the motivations behind it and how it evolved into what it is today.
The deepest, darkest origins of the current versions of RSS began in 1995 with the work of Ramanathan V. Guha. Known to most simply by his surname, Guha developed a system called the Meta Content Framework (MCF). Rooted in the work of knowledge-representation systems such as CycL, KRL, and KIF, MCF's aim was to describe objects, their attributes, and the relationships between them.
MCF was an experimental research project funded by Apple, so it was pleasing for management that a great application came out of it: ProjectX, later renamed HotSauce. By late 1996, a few-hundred sites were creating MCF files that described themselves, and HotSauce allowed users to browse around these MCF representations in 3D.
It was popular, but experimental, and when Steve Jobs' return to Apple's management in 1997 heralded the end of much of Apple's research activity, Guha left for Netscape.
There, he met with Tim Bray, one of the original XML pioneers, and started moving MCF over to an XML-based format. (XML itself was new at that time.) This project later became the Resource Description Framework (RDF). RDF is, as the World Wide Web Consortium (W3C) RDF Primer says, "a general-purpose language for representing information in the World Wide Web." It is specifically designed for the representation of metadata (see Chapter 5) and the relationships between things. In its fullest form, it is the basis for the concept known as the Semantic Web, the W3C's vision of a web of information that computers can understand.
This was in 1997, remember. XML, as a standard way to create data formats, was still in its infancy, and much of the Internet's attention was taken up by the increasingly frantic war between Microsoft and Netscape.
Microsoft had not ignored the HotSauce experience. With others, principally a company called Pointcast, they further developed MCF for the description of web sites and created the Channel Definition Format (CDF).
CDF is XML-based and can describe content ratings, scheduling, logos, and metadata about a site. It was introduced in Microsoft's Internet Explorer 4.0 and later into the Windows desktop itself, where it provided the backbone for what was then called Active Desktop.
By 1999, MCF was well and truly steeped in XML and becoming RDF, and the Microsoft/Netscape bickering was about to start again. Both companies were due to launch new versions of their browsers, and Netscape was being circled for a possible takeover by AOL.
So, Netscape's move was to launch a portal service, called the "My Netscape Network," and with it RSS.
Short for RDF Site Summary, RSS allowed the portal to display headlines and URLs from other sites, all within the same page. A user could then personalize their My Netscape page to contain the headlines from any site that interested them and had an RSS file available. It was, basically, a web page-based version of everything HotSauce and CDF had become. It was a great success.
My Netscape benefited from this in many ways: they suddenly had a massive amount of content given to them for free. Of course, they had no control over it or any real way of making money from it directly, but the additional usefulness of their site to the user made people stick around longer. In the heat of the dot-com boom, allowing people to put their own content on a Netscape page, alongside advertising sold by Netscape, was a very good idea: the portal could both save money on content and make more on ad sales. The user also benefited. She had her favorite sites summarized on one page ? a one-stop shop for a day's browsing, which many users found extremely useful. The RSS provider didn't lose out either, gaining from both additional traffic and wider exposure.
These abilities, aided by the relative simplicity of the RSS 0.9 standard itself, proved so useful that RSS didn't stay unique to Netscape for long. Converting RSS to HTML is simple, as you will see in Section 126.96.36.199, and other RSS-based sites rapidly appeared across the Web. Sites such as slashdot.org incorporated RSS feeds as replacements for their own homegrown headline formats, and developers of all the major scripting languages devised simple ways to read and display RSS feeds. A small revolution was underway.
Technically, however, RSS had been a compromise.
The first draft of the RSS format, as designed by Dan Libby, was a fully RDF-based data model, and people inside Netscape felt it was too complicated for end users at that time. The resultant compromise ? named RSS 0.9 ? was not truly useful RDF, nor was it as simple as it could be.
Some felt that using RDF improperly was worse than not using it at all, so when RSS 0.91 arrived, the RDF nature of the format had been dropped. As Dan Libby explained to the rss-dev email list (http://groups.yahoo.com/group/rss-dev/message/239):
At the time, the primary users of RSS (Dave Winer the most vocal among them) were asking why it needed to be so complex and why it didn't have support for various features, e.g. update frequencies. We really had no good answer, given that we weren't using RDF for any useful purpose. Further, because RDF can be expressed in XML in multiple ways, I was uncomfortable publishing a DTD for RSS 0.9, since the DTD would claim that technically valid RDF/RSS data conforming to the RDF graph model was not valid RSS. Anyway, it didn't feel "clean". The compromise was to produce RSS 0.91, which could be validated with any validating XML parser, and which incorporated much of Userland's vocabulary, thus removing most (I think) of Dave's major objections. I felt slightly bad about this, but given actual usage at the time, I felt it better suited the needs of its users: simplicity, correctness, and a larger vocabulary, without RDF baggage.
RSS 0.91, which incorporated some features from Userland Software's ScriptingNews format, was completely RDF-free. So, as would become a habit whenever a new version of RSS was released, the meaning of the RSS acronym was changed. In the RSS 0.91 Specification, Dave Winer explained:
There is no consensus on what RSS stands for, so it's not an acronym, it's a name. Later versions of this spec may say it's an acronym, and hopefully this won't break too many applications.
A great deal of research into RDF continued, however. Indeed, Netscape's RSS development team was always keen to use it. Their original specification (the one that was watered down to produce RSS 0.9), was published on the insistence of Dan Libby, and, although it has long since gone from the Netscape servers, you can find it in the Internet Archive:
Netscape, however, was never to release any new versions ? the RSS team was disbanded as the My Netscape Network was closed. So, when work began on a new version of RSS, it was left to the development community in general to sort out. They quickly broke into two camps.
The first camp, led by O'Reilly's Rael Dornfest, wanted to introduce some form of extensibility to the standard. The ability to add new features, perhaps through modularization, necessitated such complexities as XML namespaces and the reintroduction of RDF, as envisioned by the Netscape team.
However, the second camp, led by Dave Winer, the CEO of Userland Software and keeper of the RSS 0.91 standard, feared that this would add a level of complexity unwelcome among users and wanted to keep RSS as simple as possible.
In December 2000, after a great deal of heated discussion, RSS 1.0 was released. It embraced the use of modules, XML namespaces, and a return to a full RDF data model. Two weeks later, Dave Winer released RSS 0.92 as a rebuttal of the RDF alternative. The standard thus forked.
And that was how it remained for two years ? two standards: RSS 0.92 as the simple, entry-level specification, and RSS 1.0 as the more complex, but ultimately more feature-packed specification. And, of course, some people still used RSS 0.91.
For the users of RSS feeds, this fork was not a major worry, because the two standards remained compatible. Even parsers specifically built to parse only RSS, rather than XML in general, can usually read simple examples of either version with equal ease, although the RDF implications go straight over the head of all but specifically designed RDF parsers.
All this, however, was changing.
In the late summer of 2002, the RSS community forked again, perhaps irretrievably. Ironically enough, the fork came from an effort to merge the 0.9x and 1.0 strands from the previous fork and create an RSS 2.0 that would satisfy both camps.
Once again, the argument quickly settled into two sides. On one side, Dave Winer and a few others continued to believe in the importance of simplicity above all else, and regarded RDF as a technology that had yet to show any value within RSS. Winer also, for his own reasons, did not want the discussion over RSS 2.0 to take place on the traditional email lists. Rather, he wanted people to express their points of view in their weblogs, to which he would link his own at http://www.scripting.com.
On the other side, the members of the rss-dev mailing list, the place where RSS 1.0 had been born and nurtured to maturity, still wanted to include RDF with the specification ? albeit in various simplified forms ? and hold the discussion on a publicly-archived, centralized mailing list that would not be subject to Winer's filtering.
In many ways, both of these things happened. After a great deal of acrimony, Userland released a specification that they called RSS 2.0 and declared RSS frozen. That this was done without acknowledging, much less taking into account, the increasing concerns ? both technical and social ? of the rss-dev and RDF communities at large caused much unhappiness.
So, after RSS 2.0's release on the September 16, 2002, the rss-dev list started discussions on a possible name change to their own new, RSS 1.0-based specification. This would go hand-in-hand with a complete retooling of the specification, based on a totally open discussion and a rethink of the use of RDF. At the time of this writing, neither a new name nor the syntax of the new format have been decided, and the reader is advised to look to the Web for further news.
As it stands, therefore, the versioning number system of RSS is misleading. Taken chronologically, 0.9 was based on RDF, 0.91 was not, 1.0 was, 0.92 was not, and now 2.0 is not, but whatever 1.0 becomes probably will be. It should be noted that there is an RSS 3.0, proposed by Aaron Swartz as part of long rss-dev in-joke. (The joke culminated with a proposal to have RSS 4.0 expressed entirely through the medium of interpretive dance.) Search engine results finding these specifications are therefore wrong, though dryly funny.
For feed publishers, the two strands each have advantages and disadvantages. The first is perhaps simpler to use, whereas the second can ? if both the publisher and the user have the wherewithal to allow it ? provide a much richer set of information. We will go into each of the feed standards in Chapter 3.
As with most Internet standards, the two versions of RSS are continually being examined for revision. For the purposes of this book, these upgrades need not necessarily concern us: specification upgrades are always designed to be backward-compatible, and RSS feeds designed with the specifications in this book should work for the foreseeable future. Given that they are in XML, converting them to a new standard will also be simple, and methods for doing so will undoubtedly be provided should such a situation arise.
There are differences in the extensibility approaches of the two RSS forks. Since the September 2002 release of RSS 2.0, the core specifications of both RSS 1.0 and RSS 2.0 provide module-based extensibility. Anyone can add new elements or features. We will explore the differences in Chapter 3.
RSS was, and is, by no means limited to the Web. Portal sites, at least in the broadband-free, nonwireless end of the twentieth century, always suffered from one thing: having to be online. There was no way for a My Netscape user, for example, to browse his headlines at his leisure without racking up expensive connectivity fees. In many countries, Internet access through a dial-up connection is still paid for by the minute. Also, the portal sites quite sensibly limited the number of feeds to which you could subscribe: a user with more than a few interests quickly found his headline habit could not be satisfied if he was limited to being online.
So, in early 1999, desktop-based headline viewers, such as Carmen's Headline Viewer, came into play. Users can download hundreds of RSS feeds at a time and browse them at their leisure. Quicker and offering more variety than RSS portals, these readers are becoming increasingly sophisticated. Chapter 10 provides more details.
Then, with the growing popularity of RSS feeds ? over 4,000 in the first year ? there inevitably came a need for directories.
It's all very well being able to convert RSS feeds to a form readable within a web site, but where do you find these feeds in the first place? By the turn of the century, thousands of sites offered RSS feeds, but due to a lack of either a standardized address or an automatic system of resource discovery, users were dependent on finding feeds through a site's advertising.
Registries were, and are still, one good answer. Sites that list the details of thousands of feeds, tested and categorized for ease of use, these services are growing in size and sophistication. At the time of this writing, Syndic8.com, for example, will soon break the 10,000-feed mark. We'll discuss registries more fully in Chapter 10.
Aggregators , on the other hand, add an additional layer of usability to RSS feeds. By grouping feeds together and allowing a filtering of headlines, they allow the creation of a kind of meta-feed. For example, O'Reilly's Meerkat service (http://www.oreillynet.com/meerkat) allows an RSS feed of all the stories on a certain subject that have appeared over a set time to be created within any of the other feeds it monitors. Note that some people use the word aggregator to indicate a desktop reader client. This book does not.
Search engines are also starting to realize the usefulness of RSS feeds. Sites such as The Snewp (http://www.snewp.com) limit their indexing efforts solely to RSS feeds. RSS's concentrating nature gives the index a far greater signal-to-noise ratio than if it had to trawl every page in the site. Combined with Publish and Subscribe (see Chapter 12) it promises to allow extremely up-to-date search engine results. These results can, of course, be given in RSS itself.
RSS's major success within the My Netscape Network has been replicated internally in many corporations. Indeed, many companies make their living acting as aggregators solely for the corporate market. By combining search engines and aggregated feeds into intranets, employees are able to track news sites for mentions of their company or related industries. Combined with knowledge management techniques, RSS feeds can be a major part of a corporation's internal information flow.
RSS is not limited to web pages ? far from it: its format is specifically designed to be a halfway house to any other human-readable or machine-processable format. Because of this flexibility, RSS feeds have been popping up on many services. For example, instant messaging services are perfectly suited to delivering headlines to users, as is the Short Message Service (SMS) for GSM-based mobile phones. By acting as the data-carrying glue between the content providers, third-party service providers, and the end user, RSS can provide a very simple way of creating thousands of services extremely rapidly. If it can receive text, you can use RSS somewhere along the line. We'll discuss this in Chapter 9.