3.3 Accessing Community Features

Amazon of course provides access to all of their community features through their web site. As more and more sites integrate closely with Amazon, though, there is more demand to tap into the community via code.

3.3.1 Accessing Through Web Services

The Web Services API (see Chapter 6) offers some access. When accessing an individual product's information through the API, you can find the following community data:

  • The three latest reviews

  • ASINs of five related items

  • Three lists that contain the item

This is fantastic information to have access to. Developers are building tools that work with this data in many creative ways. But when compared with the volume of information that's available on Amazon's site, the community information in the API is only a small window into the larger community. That leaves one route for integration-minded developers: screen scraping.

3.3.2 Accessing Through Screen Scraping

The term screen scraping refers to requesting a web page programmatically with a script, and picking through the resulting HTML for the interesting data. Finding the data itself involves writing complex regular expressions. Regular expressions are a pattern-matching syntax that can become complicated quickly. For example, here's a regular expression that extracts a list of books from a purchase circle page [Hack #44]:

<td.*?<b><a.*?-/(.*?)/.*?>(.*?)</a></b>.*?by (.*?)<br>.*?</td>

You can see some HTML there, and the expressions are based on where the data is within the HTML and the fact that the data appears in regular patterns. Unfortunately screen scraping is rather brittle; if Amazon changes their design even slightly, this particular regular expression won't work. The expression would have to be changed by hand to sort through the new HTML to find the right pattern of data.

There are several screen-scraping examples in this chapter, and the general methods of accessing a page and parsing its contents will work. But keep in mind that the regular expressions provided could become obsolete at any time as Amazon changes the pages accessed by the scripts.