The spy master database isn't really all that complicated, but the badSpy database shows you a number of ways even a simple database can go wrong. This is a pretty important database, because it will be used to save the free world, so it deserves a little more thought. Fortunately, data developers have come up with a number of ways to think about the structure of data. It is usually best to back away from the computer and think carefully about how data is used before you write a single line of code.
Data developers have come up with a list of rules for creating well-behaved databases:
Break your data into multiple tables.
No field can have a list of entries.
Do not duplicate data.
Make each table describe only one entity.
Create a single primary key field for each table.
A database that follows all these rules will avoid most of the problems evident in the badSpy database. Fortunately, there are some well-known procedures for improving a database so it can follow all these rules.
Data programmers try to prevent the problems evident in the badSpy database through a process called data normalization. The basic concept of normalization is to break down a database into a series of tables. If each of these tables is designed correctly, the database will be less likely to have the sorts of problems described so far in this chapter. Entire books have been written about data normalization, but the process breaks down into three major steps, called normal forms.
The goal of the first normal form (sometimes abbreviated 1NF) is to eliminate repetition in the database. The primary culprit in the badSpy database is the specialty field. One solution would be to have two different tables, one for agents, and another for specialties.
TRICK? |
Data designers seem to play a one-string banjo. The solution to almost every data design problem is to create another table. As you'll see, there still is quite an art form to what should be in that new table. |
The two tables would look somewhat like Tables 9.1 and 9.2.
agentID |
name |
Assignment |
Description |
Location |
---|---|---|---|---|
1 |
Rahab |
Raging Dandelion |
Plant Crabgrass |
Sudan |
2 |
Gold Elbow |
Dancing Elephant |
Infiltrate suspicious zoo |
London |
3 |
Falcon |
Dancing Elephant |
Infiltrate suspicious circus |
London |
specialtyID |
name |
---|---|
1 |
electronics |
2 |
counterintelligence |
3 |
sabotage |
Note that I did not include all data in these example tables, just enough to give you a sense of how these tables would be organized. Also, there isn't really a good way to connect these tables back together yet, but you'll learn that later in this chapter.
Once all your tables are in the first normal form, the next step is to deal with all the potential redundancy issues. These mainly occur because data is entered more than one time. To fix this, you need to (you guessed it) build new tables. The agent table could be further improved by moving all data about operations to another table. Figure 9.3 shows a special diagram called an Entity Relationship diagram, which illustrates the relationships between these tables:
An Entity Relationship diagram (ER diagram) is used to diagram the relationships between data elements. In this situation, I thought carefully about the data in the spy database. As I thought about the data, three distinct entities emerged. By separating the operation data from the agent data, I have removed redundancy, because the user will only enter operational data one time. This will eliminate several of the problems in the original database. It will also fix the situation where an operation's data was lost because a spy turned out to be a double agent. (I'm still bitter about that defection.)
The third normal form concentrates on the elements associated with each entity. In order for a table to be in the third normal form, that table must have a single primary key, and every field in the table must relate only to that key. For example, the description field is a description of the operation, not the agent, so it belongs in the operation table. In the third phase of normalization, you look through each piece of data in your table and ensure that it directly relates to the table it is placed in. If not, you need to either move it to a more appropriate table or build a new table for it.
You might notice that my database fell into third normal form automatically when I put it in second normal form. This is not unusual for very small databases, but rare with the large complex databases used to describe real-world enterprises. Even if your database seems to be in the third normal form already, go through each field to see if it relates directly to its table.
The easiest way to normalize your databases is with a stylized view of them. ER diagrams are commonly used as a data design tool. Take another look at the ER diagram for the spy database in Figure 9.4.
This diagram illustrates the three entities in the spy database (at least up to now) and the relationships between them. Each entity is enclosed in a rectangle, and the lines between the entities represent the relationships between the entities. Take a careful look at the relationship lines. They have crow's-feet on them to indicate some special characteristics of the relationship. There are essentially three kinds of relationships (at least in this simplistic overview of data modeling).
One-to-one relationships happen when each instance of entity A has exactly one instance of entity B. A one-to-one entity is described as a simple line between two entities with no special symbols on either end.
TRICK? |
One-to-one relationships are actually rare, because usually if the two entities are that closely related, they can be combined into one table without any penalty. There are no one-to-one relationships in the spy ER diagram shown in Figure 9.4. |
One-to-many (and many-to-one) relationships happen when one entity can contain more than one instance of the other. For example, each operation can have many spies, but (for the sake of this example) each agent can only be assigned to one mission at a time. Thus the agent-to-operation relationship is considered a many-to-one relationship, because a spy can have only one operation, but one operation can relate to many agents. In this very simplistic version of ER notation, I'm using crow'sfeet to indicate the many side of the relationship.
TRICK? |
There are actually several different kinds of one-to-many relationships, each with a different use and symbol. For this overview, we'll treat them all the same and use the generic crow's-feet symbol, but once you start writing more involved databases, you'll want to investigate data diagramming more closely by looking into books on data normalization and software engineering. Likewise, data normalization is a far more involved topic than the brief discussion in this introductory book. At some point you'll want to study the topic more carefully. |
The final type of relationship shown in the spy ER diagram is a many-to-many relationship. This type of relationship occurs when each entity can have many instances of the other. Agents and skills have this type of relationship, because one agent can have any number of skills, and each skill can be used by any number of agents. A many-to-many relationship is usually shown by crow's-feet on each end of the connecting line.
It's important to generate an ER diagram of your data including the relationship types, because there are different strategies for creating each type of relationship. You'll see these strategies emerge as I build the SQL for the improved spy database.
Professional programmers often use expensive software tools to help build data diagrams, but you don't need anything more than paper and pencil to draw ER figures. I do my best data design with a partner drawing on a white board. I like to talk through designs out loud and look at them in a large format. Once I've got a sense of the design, I usually use a vector-based drawing program to produce a more formal version of the diagram. This type of drawing tool is useful because it allows you to connect elements together, already has the crow's-feet lines available, and allows you to move elements around without disrupting the lines between them. Dia is an excellent open-source program for drawing all kinds of diagrams. I used it to produce all the ER figures in this chapter. A copy of Dia is on the CD that accompanies this book.