Third normal form
Third normal form is a database schema design approach for relational databases which uses normalizing principles to reduce the duplication of data, avoid data anomalies, ensure referential integrity, and simplify data management. It was defined in 1971 by Edgar F. Codd, an English computer scientist who invented the relational model for database management.
A database relation is said to meet third normal form standards if all the attributes are functionally dependent on solely the primary key. Codd defined this as a relation in second normal form where all non-prime attributes depend only on the candidate keys and do not have a transitive dependency on another key.
A hypothetical example of a failure to meet third normal form would be a hospital database having a table of patients which included a column for the telephone number of their doctor. The phone number is dependent on the doctor, rather than the patient, thus would be better stored in a table of doctors. The negative outcome of such a design is that a doctor's number will be duplicated in the database if they have multiple patients, thus increasing both the chance of input error and the cost and risk of updating that number should it change.
Codd later realized that 3NF did not eliminate all undesirable data anomalies and developed a stronger version to address this in 1974, known as Boyce–Codd normal form.
Definition of third normal form
The third normal form is a normal form used in database normalization. 3NF was originally defined by E. F. Codd in 1971.Codd's definition states that a table is in 3NF if and only if both of the following conditions hold:
- The relation R is in second normal form.
- Every non-prime attribute of R is non-transitively dependent on every key of R.
A 3NF definition that is equivalent to Codd's, but expressed differently, was given by Carlo Zaniolo in 1982. This definition states that a table is in 3NF if and only if for each of its functional dependencies X → A, at least one of the following conditions holds:
- X contains A,
- X is a superkey,
- every element of A \ X, the set difference between A and X, is a prime attribute.
"Nothing but the key"
An approximation of Codd's definition of 3NF, paralleling the traditional pledge to give true evidence in a court of law, was given by Bill Kent: " non-key must provide a fact about the key, the whole key, and nothing but the key". A common variation supplements this definition with the oath: "so help me Codd".Requiring existence of "the key" ensures that the table is in 1NF; requiring that non-key attributes be dependent on "the whole key" ensures 2NF; further requiring that non-key attributes be dependent on "nothing but the key" ensures 3NF. While this phrase is a useful mnemonic, the fact that it only mentions a single key means it defines some necessary but not sufficient conditions to satisfy the 2nd and 3rd normal forms. Both 2NF and 3NF are concerned equally with all candidate keys of a table and not just any one key.
Chris Date refers to Kent's summary as "an intuitively attractive characterization" of 3NF and notes that with slight adaptation it may serve as a definition of the slightly stronger Boyce–Codd normal form: "Each attribute must represent a fact about the key, the whole key, and nothing but the key." The 3NF version of the definition is weaker than Date's BCNF variation, as the former is concerned only with ensuring that non-key attributes are dependent on keys. Prime attributes must not be functionally dependent at all; they each represent a fact about the key in the sense of providing part or all of the key itself.
An example of a 2NF table that fails to meet the requirements of 3NF is:
Tournament | Year | Winner | Winner's date of birth |
Indiana Invitational | 1998 | Al Fredrickson | 21 July 1975 |
Cleveland Open | 1999 | Bob Albertson | 28 September 1968 |
Des Moines Masters | 1999 | Al Fredrickson | 21 July 1975 |
Indiana Invitational | 1999 | Chip Masterson | 14 March 1977 |
Because each row in the table needs to tell us who won a particular Tournament in a particular Year, the composite key is a minimal set of attributes guaranteed to uniquely identify a row. That is, is a candidate key for the table.
The breach of 3NF occurs because the non-prime attribute Winner's date of birth is transitively dependent on the candidate key through the non-prime attribute Winner. The fact that Winner's date of birth is functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing to stop the same person from being shown with different dates of birth on different records.
In order to express the same facts without violating 3NF, it is necessary to split the table into two:
Update anomalies cannot occur in these tables, because unlike before, Winner is now a candidate key in the second table, thus allowing only one value for Date of birth for each Winner. ComputationA relation can always be decomposed in third normal form, that is, the relation R is rewritten to projections R1,..., Rn whose join is equal to the original relation. Further, this decomposition does not lose any functional dependency, in the sense that every functional dependency on R can be derived from the functional dependencies that hold on the projections R1,..., Rn. What is more, such a decomposition can be computed in polynomial time.Derivation of Zaniolo conditionsThe definition of 3NF offered by Carlo Zaniolo in 1982, and given above, is proven in the following way: Let X → A be a nontrivial FD and let A be a non-key attribute. Also let Y be a key of R. Then Y → X.Normalization beyond 3NFMost 3NF tables are free of update, insertion, and deletion anomalies. Certain types of 3NF tables, rarely met with in practice, are affected by such anomalies; these are tables which either fall short of Boyce–Codd normal form or, if they meet BCNF, fall short of the higher normal forms 4NF or 5NF.Considerations for use in reporting environmentsWhile 3NF was ideal for machine processing, the segmented nature of the data model can be difficult to consume by a human user. Analytics via query, reporting, and dashboards were often facilitated by a different type of data model that provided pre-calculated analysis such as trend lines, period-to-date calculations, cumulative calculations, basic statistics and previous period comparisons e.g. dimensional modeling and beyond dimensional modeling, flattening of stars via Hadoop and data science. |