Access Database Design and Programming, 2nd Edition

 

Buy this book

Chapter 4: Database Design Principles

Steven Roman
O'Reilly & Associates, Inc.

Reproduced from Access Database Design & Programming, by Steven Roman, by permission of O'Reilly & Associates, Inc. ISBN 1-56592-626-9. Copyright 1999, 1997, O'Reilly & Associates. All rights reserved. For further information, please contact nuts@oreilly.com, or call 1-800-998-9938, or visit their Web site at http://www.oreilly.com.

Buy this book

Contents

Redundancy
Normal Forms
First Normal Form
Functional Dependencies
Second Normal Form
Third Normal Form
Boyce-Codd Normal Form
Normalization

In Chapter 1, Introduction, we tried to present a convincing case for why most databases should be modeled as relational databases, rather than single-table flat databases. We tried to make it clear why we split the single LIBRARY_FLAT table into four separate tables: AUTHORS, BOOKS, PUBLISHERS, and BOOK/AUTHOR.

However, for large real-life databases, it is not always clear how to split the data into multiple tables. As we mentioned in Chapter 1, the goal is to do this in such a way as to minimize redundancy, without losing any information.

The problem of effective database design is a complex one. Most people consider it an art rather than a science. This means that intuition plays a major role in good design. Nonetheless, there is a considerable theory of database design, and it can be quite complicated. Our goal in this chapter is to touch upon the general ideas, without becoming involved in the details. Hopefully, this discussion will provide a helpful guide to the intuition needed for database design.

Redundancy

As we saw in Chapter 1, redundant data tends to inflate the size of a database, which can be a very serious problem for medium to large databases. Moreover, redundancy can lead to several types of anomalies, as discussed earlier. To understand the problems that can arise from redundancy, we need to take a closer look at what redundancy means.

Let us begin by observing that the attributes of a table scheme can be classified into three groups:

  • Attributes used strictly for identification purposes
  • Attributes used strictly for informational purposes
  • Attributes used for both identification and informational purposes

For example, consider the table scheme:

{PubID,PubName,PubPhone,YearFounded}

In this scheme, PubID is used strictly for identification purposes. It carries no informational content. On the other hand, YearFounded is strictly for informational purposes in this context. It gives the year that the publishing company was founded, but is not required for identification purposes.

Consider also the table scheme:

{Title,PubID,AuID,PageCount,CopyrightDate}

In this case, if we assume that there is only one book of a given title published by a given publisher and written by a given author, then {Title,PubID,AuID} is a key. Hence, each of these attributes is used (at least in part) for identification. However, Title is also an informational attribute.

We should hasten to add that these classifications are somewhat subjective, and depend upon the assumptions made about the entity class. Nevertheless, this classification does provide a useful intuitive framework.

We can at least pin down the strictly informational attributes a bit more precisely by making the following observation. The sign that an attribute is being used (at least in part) for identification purposes is that it is part of some key. Thus, an attribute that is not part of any key is being used, in that table scheme, strictly for informational purposes. Let us call such an attribute a strictly informational attribute.

Now consider the table shown in Table 4-1. In this case, both Title and PubName are strictly informational, since {ISBN} is the only key, and neither Title nor PubName is part of that key. However, the values of Title are not redundant (the fact that they are the same does not mean that they are not both required), whereas the values of PubName are redundant.

Table 4-1. A Table with Two Informational Attributes

ISBN Title PubID PubName
1-1111-1111-1 C++ 1 Big House
0-91-335678-7 Faerie Queene 1 Big House
1-011-22222-0 C++ 2 ABC Press

The reason that Title is not redundant is that there is no way to eliminate any of these titles. Each book entity must have its title listed somewhere in the database--one title per ISBN. Thus, the two titles C++ must both appear somewhere in the database.

On the other hand, PubName is redundant, as can easily be seen from the fact that the same PubName is listed twice without adding any new information to the database. To look at this another way, consider the table with two cells blank in Table 4-2. Can you fill in the title field for the last row? Not unless you call the publisher to get the title for that ISBN. In other words, some information is missing. On the other hand, you can fill in the blank PubName field.

Table 4-2. A Table with Blank Cells to Illustrate Attribute Dependency

ISBN Title PubID PubName
1-1111-1111-1 Macbeth 1 Big House
2-2222-2222-2 Hamlet 1  
5-555-55555-5   2 ABC Press

The issue here is quite simple. The Title attribute depends only upon the ISBN attribute and {ISBN} is a key. In other words, Title depends only upon a key. However, PubName depends completely upon PubID, which is not a key for this table scheme. (Of course, PubName also depends on the key {ISBN}, but that is not relevant.)

Thus, we have seen a case where redundancy results from the fact that one attribute depends upon another attribute that is not a key. Armed with this observation, we can move ahead.

Normal Forms

Those who make a study of database design have identified a number of special forms, or properties, or constraints that a table scheme may possess, in order to achieve certain desired goals, such as minimizing redundancy. These forms are called normal forms. There are six commonly recognized normal forms, with the inspired names:

  • First normal form (or 1NF)
  • Second normal form (or 2NF)
  • Third normal form (or 3NF)
  • Boyce-Codd normal form (or BCNF)
  • Fourth normal form (or 4NF)
  • Fifth normal form (or 5NF)

We will consider the first four of these normal forms, but only informally. Each of these normal forms is stronger than its predecessors. Thus, for instance, a table scheme that is in third normal form is also in second normal form. While it is generally desirable for the table schemes in a database to have a high degree of normalization, as we will see in this chapter, the situation is not as simple as it may seem.

For instance, requiring that all table schemes be in BCNF may, in some cases, cause some loss of information about the various relationships between the table schemes. In general, it is possible to manipulate the data to achieve third normal form for all table schemes, but this may turn out to be far more work than it is worth.

The plain fact is that forcing all table schemes to be in a particular normal form may require some compromises. Each individual situation (database) must be examined on its own merit. It is impossible to make general rules that apply in all situations.

The process of changing a database design to produce table schemes in normal form is called normalization.

First Normal Form

First normal form is very simple. A table scheme is said to be in first normal form**if the attribute values are indivisible. To illustrate, we considered in Chapter 1 the question of including all the authors of a book in a single attribute, called Authors. Here is an example entity:

ISBN = 0-55-123456-9
Title = Main Street
Authors = Jones, H. and Smith, K.
Publisher = Small House

Since the table scheme in this case allows more than one author name for the Authors attribute, the scheme is not in first normal form. Indeed, one of the obvious problems with the Authors attribute is that it is impossible to sort the data by individual author name. It is also more difficult to, for instance, prepare a mailing label for each author, and so on.

Attributes that allow only indivisible values are said to be scalar attributes or atomic attributes. By contrast, an attribute whose values can be, for example, a list of items (such as a list of authors) is said to be a structured attribute. Thus, a table scheme is in first normal form if all of its attributes are atomic. Good database design almost always requires that all attributes be atomic, so that the table scheme is in first normal form.

In general, making the adjustments necessary to ensure first normal form is not hard, and it is a good general rule that table schemes should be put in first normal form. However, as with the other normal forms (and even more so the higher up we go) each situation must be considered on its own merits. For instance, a single field might be designed to hold a street address, such as 1333 Bessemer Street. Whether the house number and the street name should be separated into distinct attributes is a matter of context. Put another way, whether or not a street address is atomic depends upon the context. If there is reason to manipulate the street numbers apart from the street names, then they should certainly constitute their own attribute. Otherwise, perhaps not.

Functional Dependencies

Before we can discuss the other normal forms, we need to discuss the concept of functional dependency, which is used to define these normal forms. This concept is quite simple, and we have actually been using it for some time now. As an example, we have remarked that, for the Publishers table scheme, the PubName attribute depends completely on the PubID attribute. (More properly, we should say that the value of the PubName attribute depends completely on the value of the PubID attribute, but the above shorthand is convenient.) Thus, we can say that the functional dependency from PubID to PubName, written:

PubID  -->  PubName

holds for the Publishers table scheme. This can be read "PubID determines PubName" or "PubName depends on PubID."

More generally, suppose that {A1,...,Ak} are attributes of a table scheme and that {B1,...,Bn} are also attributes of the same table scheme. We do not require that the Bs be different from the As. Then the attributes B1,...,Bndepend on the attributes A1,...,Ak, written:

{A1,. . .,Ak}  -->  {B1,. . .,Bn}

if the values of A1,...,Ak completely determine the values of B1,...,Bn. Our main interest is when there is only one attribute on the right:

{A1,. . .,Ak}  -->  {B}

For instance, it is probably safe to say that:

{PubName,PubPhone}  -->  {PubID}

which is just another way of saying that there is only one publisher with a given name and phone number (including area code).

It is very important to understand that a functional dependency means that the attributes on the left completely determine the attributes on the right for now and for all time to come, no matter what additional data may be added to the database. Thus, just as the concept of a key relates to entity classes (table schemes) rather than individual entity sets (tables), so does functional dependency. Every table scheme has its set of associated functional dependencies, which are based on the meaning of the attributes.

Recall that a superkey is a set of attributes that uniquely determines an entity. Put another way, a superkey is a set of attributes upon which all other attributes of the table scheme are functionally dependent.

Some functional dependencies are obvious. For instance, an attribute functionally depends upon itself. Also, any set of attributes functionally determines any subset of these attributes, as in:

{A,B,C}  -->  {A,B}

This just says that if we know the values of A, B, and C, then we know the value of A and B! Such functional dependencies are not at all interesting, and are called trivial dependencies. All other dependencies are called nontrivial.

Second Normal Form

Intuitively, a table scheme T is in second normal form, or 2NF, if all of the strictly informational attributes (attributes that do not belong to any key) are attributes of the entities in the table scheme, and not of some other class of entities. In other words, the informational attributes provide information specifically about the entities in this entity class and not about some other entities.

Let us illustrate with an example.

Consider a simplified table scheme designed to store house addresses. One possibility is:

{City,Street,HouseNumber,HouseColor,CityPopulation}

The CityPopulation attribute is out of place here, because it is an attribute of cities, not house addresses. More specifically, CityPopulation is strictly an informational attribute (not for identification of houses) but it gives information about cities, not house addresses. Thus, this table scheme is not in second normal form.

We can be a little bit more formal about the meaning of second normal form as follows. Referring to the previous example, we have the dependency:

{City}  -->  {CityPopulation}

where CityPopulation does not belong to any key, and where City is a proper subset of a key, namely, the key {City, Street, HouseNumber}. (By proper subset, we mean a subset that is not the whole set.)

A table scheme is in 2NF if it is not possible to have a dependency of the form:

{A1,. . .,Ak}  -->  {B}

where B does not belong to any key (is strictly informational) and {A1,...,Ak} is a proper**subset of some key, and thus does not identify the entities of this entity class, but rather identifies the entities of some other entity class.

Let us consider another example of a table scheme that is not in second normal form.

Consider the following table scheme, and assume for the purposes of illustration that, while there may be many books with the same title, no two of them have the same publisher and author:

{Title,PubID,AuID,Price,AuAddress}

Thus, {Title, PubID, AuID} is the only key. Now, AuAddress does not belong to any key, but it depends upon {AuID}, which is a proper subset of the key, in symbols:

{AuID}  -->  {AuAddress}

Hence, this table scheme is not in second normal form. In fact, AuAddress is not a piece of information about the entities modeled in the table scheme (i.e., books), but rather about authors. Of course, we could remove the AuAddress attribute to bring the table scheme into second normal form. (If each publisher charged a single price for all of its books, then Price would also cause a violation of second normal form, but this is not the case, of course.)

Third Normal Form

Second normal form is good, but we can do better. We have seen that if a table scheme is in second normal form, then no strictly informational attribute depends on a proper subset of a key. However, there is another undesirable possibility. Let us illustrate with an example.

Consider the following table scheme and assume, for the purposes of illustration, that no two books with the same title have the same publisher:

{Title,PubID,PageCount,Price}

The only key for this table scheme is {Title,PubID}. Both PageCount and Price are informational attributes only.

Now, let us assume that each publisher decides the price of its books based solely on the page count. First, we observe that this table is in second normal form. To see this, consider the proper subsets of the key. These are:

{Title} and {PubID}

But none of the dependencies:

{Title}  -->  {PageCount}
{Title}  -->  {Price}
{PubID}  -->  {PageCount}
{PubID}  -->  {Price}

hold for this table scheme. After all, knowing the title does not determine the book, since there may be many books of the same title, published by different publishers. Hence, the table is in second normal form.

It is also not correct to say that:

{PageCount}  -->  {Price}

holds, because different publishers may use different price schemes, based on page count. In other words, one publisher may price books over 1000 pages at one price, whereas another may price books over 1000 pages at a different price. However, it is true that:

{PubID,PageCount}  -->  {Price}

holds. In other words, here we have an informational attribute (Price) that depends not on a proper subset of a key, but on a proper subset of a key (PubID) together with another informational attribute (PageCount).

This is bad, since it may produce redundancy. For instance, consider Table 4-3. Note that the price attribute is redundant. After all, we could fill in the Price value for the third row if it were blank, because we know that PubID 2 charges $34.95 for 500-page books.

Table 4-3. Redundant Data in a Table

Title PubID PageCount Price
Moby Dick 1 500 29.95
Giant 2 500 34.95
Moby Dick 2 500 34.95

We can summarize the problem with the dependency:

{PubID,PageCount}  -->  {Price}

by saying that the attribute Price depends upon a set of attributes:

{PubID,PageCount}

that is not a key, not a superkey, and not a proper subset of a key. It is a mix containing one attribute from the key {Title,PubID} and one attribute that is not in any key.

With this example in mind, we can now define third normal form. A table scheme is in third normal form, or 3NF, if it is not possible to have a dependency of the form:

{A1,. . .,Ak}  -->  {B}

where B does not belong to any key (is strictly informational) and {A1,...,Ak} is not a superkey. In other words, third normal form does not permit any strictly informational attribute to depend upon anything other than a superkey. Of course, superkeys determine all attributes, including strictly informational attributes, and so all attributes depend on any superkey. The point is that, with third normal form, strictly informational attributes depend only on superkeys.

Boyce-Codd Normal Form

It is possible to find table schemes that are in third normal form, but still have redundancy. Here is an example.

Consider the table scheme {City,StreetName,ZipCode}, with dependencies:

{City,StreetName}  -->  {ZipCode}

and:

{ZipCode}  -->  {City}

(Although in real life, a zip code may be shared by two different cities, we will assume otherwise for the purposes of illustration.) This table scheme is in third normal form. To see this, observe that the keys are {City,StreetName} and {ZipCode,StreetName}. Hence, no attribute is strictly informational and there is nothing to violate third normal form.

On the other hand, consider Table 4-4. We can fill in the blank city name because {ZipCode}-->{City}.

Table 4-4. A Table with Dependencies

City StreetName ZipCode
Los Angeles Hollywood Blvd 95000
  Vine St 95000

The problem here is with the dependency:

{ZipCode}-->{City}

which does not violate third normal form because, as we have mentioned, {City} is not strictly informational.

The previous example gives us the idea to strengthen the condition in the definition of third normal form, by dropping the requirement that B be strictly informational. Thus, we can define our last, and strongest, normal form. A table scheme is in Boyce-Codd normal form, or BCNF, if it is not possible to have a dependency of the form:

{A1,. . .,Ak}  -->  {B}

where {A1,...,Ak} is not a superkey. In other words, BCNF form does not permit any attribute to depend upon anything other than a superkey.

As mentioned earlier, all attributes must depend on any superkey, by the very definition of superkey. Thus, BCNF is the strongest possible restriction of this type--it says that an attribute is not allowed to depend on anything other than a superkey.

Normalization

As we mentioned earlier, the process of changing a database design to produce table schemes in normal form is called normalization.

As a very simple example, the table scheme:

{ISBN,Title,Authors}

is not even in first normal form, because the Authors attribute might contain more than one author and is therefore not atomic. By trading in this table scheme for the two schemes:

{ISBN,Title,AuID} and {AuID,AuName}

we have normalized the database into first normal form.

Here is another example involving the higher normal forms.

Recall from an earlier example that the table scheme {City,StreetName,ZipCode}, with dependencies:

{City,StreetName}  -->  {ZipCode}

and:

{ZipCode}  -->  {City}

is in third normal form. However, Table 4-5 shows that there is still some redundancy in the table scheme. The table scheme is not in BCNF. In fact, this was the example we used to motivate our definition of BCNF. (The example violates BCNF.)

Table 4-5. A Table with Redundant Data

City StreetName ZipCode
Los Angeles Hollywood Blvd 95000
  Vine St 95000

However, we can split this table scheme into two schemes:

{ZipCode,City}

and:

{ZipCode,StreetName}

In this case, Table 4-5 gets split into two tables, Tables 4-6 and 4-7, and the redundancy is gone!

Table 4-6. First Table Derived from Table 4-5 to Eliminate Redundancy

ZipCode City
95000 Los Angeles

Table 4-7. Second Table Derived from Table 4-5 to Eliminate Redundancy

ZipCode StreetName
95000 Hollywood Blvd
95000 Vine St

Generally speaking, the design of a database may begin with an E/R diagram. This diagram can be implemented according to the principles that we discussed in Chapter 3, Implementing Entity-Relationship Models: Relational Databases. The result may very well be a perfectly satisfactory database design. However, if some of the table schemes have redundancies, it may be desirable to split them into smaller table schemes that satisfy a higher normal form, as in the previous example.

Decomposition

Although the decomposition of a table scheme into smaller (hopefully normalized) table schemes is desirable from an efficiency point of view, in order to reduce redundancy and avoid various anomalies, it does carry with it some risk, which primarily comes in two forms:

  • The possible loss of information
  • The possible loss of dependencies

The following example illustrates the first problem—loss of information.

Consider the table scheme:

{AuID,AuName,PubID}

The only dependency in this table scheme is:

{AuID}  -->  {AuName}

We could decompose this table scheme into the two schemes:

{AuID,AuName} and {AuName,PubID}

Now consider Table 4-8, which has two different authors with the same name. The decomposition gives the two tables shown in Tables 4-9 and 4-10.

Table 4-8. A Table with Two Identical Author Names

AuID AuName PubID
A1 John Smith P1
A2 John Smith P2

Table 4-9. Partial Decomposition of Table 4-8

AuID AuName
A1 John Smith
A2 John Smith

Table 4-10. Partial Decomposition of Table 4-8

AuName PubID
John Smith P1
John Smith P2

Unfortunately, if we were to ask Microsoft Access to show us the data for all authors named John Smith, we would get the table shown in Table 4-11, which is not the table we started with! Information has been lost, in the sense that we no longer know that both John Smiths together have published only two books, each author with a different publisher. (It may look as though we have more information, since the table is bigger, but in reality we have lost information.)

Table 4-11. An Incorrect Reconstruction of Table 4-8

AuID AuName PubID
A1 John Smith P1
A1 John Smith P2
A2 John Smith P1
A2 John Smith P2

The second problem we mentioned in connection with the decomposition of a table scheme is that of loss of dependencies. The issue is this: During the life of the database, we will be making changes (updates, insertions, and deletions) to the separate tables in the decomposition. Of course, we must be careful to preserve the functional dependencies that are inherited from the original table scheme. However, this does not necessarily guarantee that all of the original dependencies will be preserved!

Here is a simple example to illustrate the problem. Consider the table scheme:

{ISBN,PageCount,Price}

with dependencies:

{ISBN}  -->  {PageCount}
{PageCount}  -->  {Price}

Consider the decomposition into the table schemes:

{ISBN,PageCount} and {ISBN,Price}

Note that the key {ISBN} is in both schemes in the decomposition.

Unfortunately, the decomposition has caused us to lose the dependency
{PageCount}-->{Price}, in the sense that these two attributes are not in the same table scheme of the decomposition. To illustrate, consider Table 4-12, which has two different books with the same page count and price. The decomposition of this table into two tables is shown in Tables Table 4-13 and Table 4-14.

Table 4-12. Table Example to Show Further Decomposition

ISBN PageCount Price
0-111-11111-1 500 $39.95
0-111-22222-2 500 $39.95

Table 4-13. Partial Decomposition of Table 4-12

ISBN PageCount
0-111-11111-1 500
0-111-22222-2 500

Table 4-14. Partial Decomposition of Table 4-12

ISBN Price
0-111-11111-1 $39.95
0-111-22222-2 $39.95

Now here is the problem. Looking at the second table, we have no indication that the original scheme required that PageCount determines Price. Hence, we might change the price of the second book to $12.50, as we've done in Table 4-15.

Table 4-15. Decomposition Example Changing Price

ISBN Price
0-111-11111-1 $39.95
0-111-22222-2 $12.50

But putting the tables back together for a look at all of the data gives us Table 4-16, which reveals a violation of the requirement that PageCount determines Price. In fact, somebody at the publishing company is going to be very unhappy that the company is now selling a 500-page book at below cost!

Table 4-16. Looking at Data by Combining Tables Table 4-12 Through Table 4-15

ISBN PageCount Price
0-111-11111-1 500 $39.95
0-111-22222-2 500 $12.50

By contrast, consider the decomposition of the original table scheme into:

{ISBN,PubPhone} and {PubPhone,PubName}

Here, no dependency is lost, so we can update each separate table without fear.

The previous two examples illustrate the pitfalls in decomposing a table scheme into smaller schemes. If a decomposition does not cause any information to be lost, it is called a lossless decomposition. A decomposition that does not cause any dependencies to be lost is called a dependency-preserving decomposition.

Now it is possible to show that any table scheme can be decomposed, in a lossless way, into a collection of smaller schemes that are in the very nice BCNF form. However, we cannot guarantee that the decomposition will preserve dependencies. On the other hand, any table scheme can be decomposed, in a lossless way that also preserves dependencies, into a collection of smaller schemes that are in the almost-as-nice third normal form.

However, before getting too excited, we must hasten to add that the algorithms that we give do not always produce desirable results. They can, in fact, create decompositions that are less intuitive than we might do just using our intuition. Nevertheless, they can be relied upon to produce the required decomposition, if we can't do it ourselves.

We should conclude by saying that there is no law that says that a database is always more useful or efficient if the tables have a high degree of normalization. These issues are more subjective than objective and must be dealt with, as a design issue, on an ad hoc basis. In fact, it appears that the best procedure for good database design is to mix eight parts intuition and experience with two parts theory. Hopefully, our discussion of normalization has given you a general feeling for the issues involved, and will provide a good jumping-off place if you decide to study these somewhat complicated issues in greater depth. (See Appendix E, Suggestions for Further Reading, for some books for further study.)