Geocoding

A fundamental service required by many mobile location service applications is the ability to look up an address on a map. This is known as geocoding. For example, if your location service application requires a route calculation between two locations, it is necessary to first geocode the start and end points of the route to make sure that they are valid addresses and that they can be found in the map database. For mobile location services it is necessary to have an address level geocoder (one that can geocode to a specific street address). There are many products that geocode to either a postal code or street level of accuracy, but neither provides sufficient accuracy for mobile location services.

There are many challenges in geocoding, most caused by the complexity of address schemes and the process of trying to clean up and interpret address input by a user. One challenge, especially in mobile location services, is that address schemes differ significantly by geographic region. In North America, it is common for street addressing to be sequential, with odd numbers on one side of the street and even numbers on the other. In Europe, it is common for street addressing to increase up one side of the street and decrease down the other side. This could mean that a building with a street address of 15 or 16 might face a building with a street address of 435 or 436. Instead of a single address number as is common in the United States, it isn't uncommon for buildings in Europe to actually have an address range, such as 46-50 Coombe Road.

Whereas it is common in English (United States and United Kingdom) to write an address before the street name, in many other European languages it is intuitive to put the address number after the street name and the postal code before the city, such as Lenbachplatz 3, 80333 München. If you have traveled to Japan, you might recall yet another addressing scheme, where the first building built in a particular region is numbered 1, the second 2, and so forth. Differences such as these require highly complex rule systems to analyze the address and good map data to make geocoding effective.

Technical Definition of Geocoding

A more technical definition of geocoding is the process of associating an address with geographic features. The geographic features are often represented by a line, such as a street center line database. Typically, each segment of the street center line has attributes such as high and low address range (or left and right address range), street name(s), the city, postal code, and many others. A simple example shows street addressing in two sample street center line segments (Figure 4.12).

Figure 4.12. Street Address Range Example.

graphics/04fig12.gif

How Does Geocoding Work?

Geocoding generally is done using a four-step process. First, the address is input to the geocoding system. The address is then analyzed, parsed, and placed into a standard format. Third, a soundex search is done for the city and street name and an address range search is done for any matches found in the soundex search. Finally, a scoring system is used to rank the possible matches. If a match is found, the geographic coordinate (e.g., projected latitude and longitude) is returned. If multiple matches are found, they are returned ranked by the scoring system so the user can select the best match. If no matches are found, the geocoding system logs an error and returns an error message.

Address Input

Geocoding systems need a way to receive the address from the user. This might include a Java constructor, an XML document over HTTP, or a proprietary protocol. The most advanced systems allow a near free-form address input, such as one text input for address and street name and another text input for city, region, and postal code. The major advantage of this system is that the mobile location services application developer does not have to do the complex parsing and error correction that is best left to the geocoder, and the user inputs the address the same intuitive way he or she might address a letter. Less sophisticated systems require a text input for each element. The more discrete inputs required, the more unsophisticated the geocoder.

If the application developer is required to do the validation and error checking, input will be checked to make sure that it conforms to the data type that the geocoding engine is expecting, but more sophisticated error checking is too time consuming to implement, resulting in a less than satisfactory mobile location service application.

Address Standardization

Once the geocoding engine has received the address, it attempts to parse it and standardize it. If the geocoder is region specific (e.g., only designed to work with U.S. map data) the standardization is simpler. Address interpretation might appear as follows:

Address list before parsing and standardization:

  1. 1000 Main Street, Suite 100

  2. 555 California Avenue

  3. 500 South 300 West

  4. 1121 3rd Street

  5. 501 Avenue G

  6. 15 Jefferson Apt# 1

Address list after parsing and standardization:

Address1

Name1

Suffix

Direction

Address2

Name2

00001000

MAIN

STREET

 

0000100

 

00000555

CALIFORNIA

AVENUE

     

00000500

300

 

WEST

   

00001121

3RD

STREET

     

00000501

G

AVENUE

     

00000015

JEFFERSON

   

0000001

 

Perform Soundex and Address Range Search

Once the address has been standardized, the geocoding engine attempts to find a match in the map database. If an exact match is not available, the system might do a soundex search to find street names that are similar so the user may choose the best match.

Soundex is a technology originally developed by the U.S. government to assist in matching surnames in the analysis of U.S. census data. A soundex index is based on the way a word sounds rather than the way a word is spelled. Each entry in the index is a combination of one letter and three numbers. The letter is the first letter of the original word. The three numbers are the number encoding for the letters of the word.

Soundex Coding Guide
  1. B,F,P,V

  2. C, G, J, K, Q, S, X, Z

  3. D, T

  4. L

  5. M, N

  6. R

The letters A, E, I, O, U, H, W, and Y are ignored. Double letters are treated as a single letter. Side by side letters that have the same soundex value are treated as a single letter. Words with a prefix are coded both with and without the prefix. More details and rules are available from the U.S. National Archive and Records Administration at http://www.nara.gov/genealogy/soundex/soundex.html.

Soundex Encoding Examples

SMITH

S-530 (S, 5 for the M, 3 for the T, 0 added)

SMYTH

S-530 (S, 5 for the M, 3 for the T, 0 added)

WASHINGTON

W-252 (W, 2 for the W, 5 for the N, 2 for the G)

JACKSON

J-250 (J, 2 for C, K ignored, S ignored, 5 for N, 0 added)

If the soundex search does not provide any matches, the user would be given an error and asked to enter another address. If the geocoder is able to find one or more street name matches, it performs an address range search to make sure that the address requested is valid for the street name. If the address range is valid, the geographic position is assigned in the appropriate format (e.g., projected latitude and longitude).

Apply Scoring Rules

Now that the geocoding engine has a set of potential results, each result is scored according to certain criteria, which might include the following:

  • Whether the street name was an exact match

  • Whether the street type matched (Avenue or Street)

  • Whether the direction matched, if the street had a directional attribute (e.g., north or southwest)

  • Whether the city, zone, or postal code matches

For example, the scoring system might run from 1 to 100, with 100 being a perfect match. Every match candidate would start at 100, and points would be subtracted for failure of various tests. Items such as street name not found in a soundex search might subtract 10 points, and a postal code that does not match might subtract 50 points. Once the matches are ranked in the scoring system, business logic can determine whether the geocoding engine will return one or multiple proposed matches.

What Makes Geocoding So Difficult?

Address Cleanup

Address cleanup is one of the greatest challenges in providing a high-quality geocoder. Typical problems include the following:

  • Numeric street names

10 1st Street

  • Addresses with more than one directional

123 W Main Street East

  • Alphanumeric addresses

100A Mission Street

  • Fractional addresses

45½ Bee Street

  • Coordinate addresses (Utah)

520 East 400 South

  • Addresses with dashes (Hawaii and Queens, NY)

101-123 Kaanapali Road

  • Street names with numeric components

1234 10 Mile Road

  • Street names that are directionals

South Street

  • Street names that are suffixes (Brooklyn, NY)

Avenue G

  • Spelled out address numbers

Two Second Street

Differing Address Standards

As previously mentioned, address standards vary drastically from region to region. To be effective, geocoders must be locally adapted, tested, and tuned. Language has a significant impact on how addresses will be input, and in many regions it is necessary to support many different ways to enter addresses. A geocoder in Germany must know that München and Munich are the same place and understand an address input with the address number before or after the street name, and a postal code either before or after the city name.

Soundex Mismatches

Soundex is not a perfect technology, and there are many proprietary enhancements that could increase its effectiveness in matching an address. Bad matches add processing time and could present the user with unintuitive choices. This is particularly true given that soundex was developed for analyzing surnames, and has been adapted to work with street names.

Static Map Database and Dynamic Communities

Map database releases are typically done two to four times per year, but new roads and buildings are constantly being constructed. Applications that have a central map database and thin clients have the advantage of being more up to date than systems that require map databases on CD, such as the onboard navigation systems common in 2001 and newer cars. Users become frustrated when they are directed to places that don't exist. It is unlikely users will buy and install new CDs four times per year, or that a CD-based navigation system could be released four times per year. Offboard navigation systems that use the mobile network to process spatial requests on a remote server are a better option.

Rural Delivery and Post Office Boxes

Rural delivery and post office boxes present another series of complications for geocoding. Depending on the application, the geographical position found might not be useful if the physical location is required.

Site Address and Billing Address

Ambiguity is possible when a site has both a physical address and a billing address. Certain applications might need the physical address, whereas others require the billing address. A method to distinguish the two is necessary.

Why Is Geocoding Important to Mobile Location Services?

Significant functionality in mobile location services applications depends on being able to accurately pinpoint and direct users to very specific locations. Users save time by relying on the intelligent business logic and the large knowledge bases built into mobile location services systems. However, users have very little patience with systems that direct them to the wrong place. A user might not mind (or know) if a route that has been calculated is the absolute fastest, but the user surely knows when he or she is directed to the wrong place or a place that does not exist. For location services applications to be successful, it is crucial that map data be current and that a high-quality geocoding product is used (and properly integrated if necessary). Equally important is reverse geocoding, the process of taking a geographic position (e.g., projected latitude and longitude calculated by a positioning system) and transforming it into the nearest road segment in the map database. Success in these basic location service functions is the cornerstone for success in developing higher level applications such as real-time traffic.