A good database design is fundamental to the success of any application. Logical database design for relational databases follows a set of rules called rules of normalization. As a result of normalization, you create a data model that is usually, but not necessarily, translated into a physical data model. A logical database design does not depend on the relational database you intend to use. The same data model can be applied to Oracle, Sybase, SQL Server, or any other relational database. On the other hand, a physical data model makes extensive use of the features of the underlying database engine to yield optimal performance for the application. Physical models are much less portable.
If portability is a big concern to you, consider using a third-party data modeling tool, such as ERwin or ERStudio. These tools have features that make it easier to migrate your logical data models to physical data models for different database platforms. Of course, this will just get you started; to get the best performance out of your design, you will need to tweak the physical design for the platform you have chosen.
Any database designer must address two fundamental issues:
Designing the database in a simple, understandable way that is maintainable and makes sense to its developers and users
Designing the database such that fetching and saving data with the fastest response time, resulting in high performance
Normalization is a technique used by relational databases to organize data across many tables so that related data is kept together based on certain guidelines. Normalization results in controlled redundancy of data; therefore, it provides a good balance between disk space usage and performance. Normalization helps people understand the relationships between data, and enforces rules that ensure the data is meaningful.
Normalization rules exist, among other reasons, to make it easier for people to understand the relationships between the data. But a perfectly normalized database sometimes won't perform well under certain circumstances, and it may be difficult to understand. There are good reasons to deviate from a perfectly normalized database.
Five normalization forms exist. If you follow the rules for the first rule of normalization, your database can be described as "in first normal form." This is represented by the symbol 1NF for first normal form, 2NF for second normal form, and so on.
Each rule of normalization depends on the previous rule for successful implementation, so to be in second normal form (2NF), your database must also follow the rules for first normal form.
A typical relational database used in a business environment falls somewhere between second and third normal form. It is rare to progress past the third normal form because fourth and fifth normal form are more academic than practical in real-world environments.
Following is a brief description of the first three rules of normalization.
The first rule of normalization requires removing repeating data values and specifies that no two rows can be identical in a table. This means that each table must have a logical primary key that uniquely identifies a row in the table.
Consider a table that has four columns, PublisherName, Title1, Title2, and Title3 for storing up to three titles for each publisher. This table is not in first normal form due to the repeating Title columns. The main problem with this design is it limits the number of titles associated with a publisher to three.
Removing the repeating columns so there is just a PublisherName column and a single Title column puts the table in first normal form. A separate data row is stored in the table for each title published by each publisher. The combination of PublisherName and Title becomes the primary key that uniquely identifies each row and prevents duplicates.
A table is considered to be in second normal form if it conforms to the first normal form and all nonkey attributes of the table are fully dependent on the entire primary key. If the primary key consists of multiple columns, then nonkey columns should depend on the entire key and not just on a part of the key. A table with a single column as the primary key is automatically in second normal form.
Assume you need to add the Publisher address to the database. Adding it to the table with the PublisherName and Title column would violate second normal form. The primary key consists of both PublisherName and Title, but the PublisherAddress attribute is an attribute of the Publisher only. It does not depend on the entire primary key.
To put the database in second normal form requires adding an additional table for storing publisher information. One table consists of the PublisherName column and PublisherAddress. The second table contains the PublisherName and Title columns. To retrieve the PublisherName, Title, and PublisherAddress information in a single result would require a join between the two tables on the PublisherName column.
A table is considered to be in third normal form if it already conforms to the first two normal forms and none of the nonkey columns are dependent on any other nonkey columns. All such attributes should be removed from the table.
Following is an example that comes up often during database architecture. Suppose that an employee table has four columns: EmployeeID (the primary key), salary, bonus, and total_salary, where total_salary = salary + bonus. Existence of the total_salary column in the table violates the third normal form because a nonkey column (total_salary) is dependent on two other nonkey columns (salary and bonus). Therefore, to conform to the third rule of normalization, you must remove the total_salary column from the employee table.
The following are the major advantages of normalization:
Because information is logically kept together, normalization provides a better overall understanding of the system.
Because of controlled redundancy of data, normalization can result in fast table scans and searches (less physical data has to be processed).
Because tables are smaller with normalization, index creation and data sorts are much faster.
With less redundant data, it is easier to maintain referential integrity for the system.
Normalization results in narrower tables. Because you can store more rows per page, more rows can be read and cached for each I/O performed on the table. This results in better I/O performance.
One result of normalization is that data is stored in multiple tables. To retrieve or modify information, you usually have to establish joins across multiple tables. Joins are expensive from an I/O standpoint. Multitable joins can have an adverse impact on the performance of the system. The following sections discuss some of the denormalization techniques that you can use to improve the performance of the system.
A wise, old database architect I knew passed on a wise, old adage to me: "Normalize 'til it hurts, denormalize 'til it works." To put this into use, try to put your database in third normal form initially. After this is done, when you're ready to implement the physical structure, drop back from third normal form where excessive table joins are hurting performance. One of the common mistakes I've seen is when developers make too many assumptions and over- denormalize the database design before even a single line of code has been written to even begin to assess the database performance.