Normalization

Anyone who has worked with a relational database is probably familiar with normalization, at least up to the 3rd normal form. I wanted to get a better understanding of the forms, especially the ones higher than 3rd. The following are the definitions of each level, as taken from ‘An Introduction To Database Systems’ by CJ Date (I’m using the 7th edition).

Terms and Concepts:
First, let’s define some of the terms used in the definitions.

Attribute – A property or characteristic being described – A column in a table.
Tuple – A collection of attributes – A row in a table.
Relation – A set of tuples.
Relvar – Relation variable, as opposed to a relation. When a table is created, it is a variable whose values are relations. Date’s writings are the only ones that make this distinction, that I’ve seen. In implementation, a relvar equates to a table, and a relation to the data that the table contains.
Candidate key – A table could have multiple attributes or group of attributes that could uniquely determine each record. One candidate key will be chosen as the primary key.
Functional Dependency – Where an attribute or group of attributes determine the value of another attribute. If we know a ZIP code, then we know the state, so state is functionally dependent on ZIP Code. In this case, ZIP Code is the determinant.
Trivial Dependency – Where an attribute belongs to both sides of the dependency. Non-trivial means that an attribute won’t have a dependency with itself.
Left-Irreducible Dependency – No attribute can be removed from the determinant. Left refers to the left side of the dependency equation. This is also sometimes called Full Functional Dependency, where

Normal Forms:
Text in quotes are the form definitions.

First Normal Form:
“A relvar is in 1NF if and only if, in every legal value of that relvar, every tuple contains exactly one value for each attribute”
There are no repeating groups, and each tuple(row) has the same attributes defined.

Second Normal Form:
“A relvar is in 2NF if and only if it is in 1NF and every nonkey attribute is irreducibly dependent on the primary key”
Each attribute not part of the primary key is determined by the entire primary key, and not determined by a subset of the key. Obviously this only comes into play with a table with a composite primary key (Made up of multiple columns).

Third Normal Form:
“A relvar is in 3NF if and only if it is in 2NF and every non-key attribute is nontransitively dependent on the primary key”
No non-key attribute is dependent on anything but the primary key. If A determines B and B determines C, then C is transitively dependent on A.

Boyce/Codd Normal Form:
“A relvar is in BCNF if and only if every nontrivial, left-irreducible functional dependency has a candidate key as its determinant”
Every determinant is a candidate key.
Generally, any table in 3NF will also be in BCNF. BCNF comes into play where there are multiple candidate keys made up of multiple columns, where two of the keys share a column.

Fourth Normal Form:
“Relvar R is in 4NF if and only if, whenever there exist subsets A and B of the attributes of R such that the nontrivial multi-valued dependency A=>=>B is satisfied, then all attributes of R are also functionally dependent on A”

Multi-Valued Dependence:
“Let R be a relvar, and let A,B and C be subsets of the attributes of R. Then we say that B is multi-dependent on A if and only if, in every possible legal value of R, the set of B values matching a given A value, C value pair depends only on the A value and is independent of the C value.”

4NF will ensure that there isn’t more than one multi-valued dependency in a table.
The examples I’ve seen are with tables with associative tables with three attributes, where the combination of all three make the primary key. To get into 4NF there needs to be multiple associative tables where each table represents one many to many relationship.

4NF Example:
Say we have a table BandMembers, with attributes Band, MemberName and Instrument (We’ll include vocals in Instrument). All three attributes make up the primary key. Paul is in the Beatles, and he plays guitar and sings, so we would need two tuples to represent this:
Beatles, Paul, Guitar
Beatles, Paul, Vocals

We’re assuming that Paul will sing and play guitar in any group that he belongs to.

If we add Paul to the Band Wings, we’ll also need to add two records:
Wings, Paul, Guitar
Wings, Paul, Vocals

So even though we’re just recording one fact (That Paul has joined Wings) we have to add two records to record this fact.

So to get into 4NF, we would want to split this into two tables, one to link a Band to a member, and one to link the band member to the instrument they play:
Beatles, Paul
Wings, Paul

Paul, Guitar
Paul, Vocals

Fifth Normal Form (Projection-Join Normal Form):
“A relvar R is in 5NF if and only if every nontrivial join dependency that holds for R is implied by the candidate keys of R”

Join Dependency:
“Let R be a relvar, and let A, B, …, Z be subsets of the attributes of R. Then we say that R satisfies the JD * {A, B, …, Z } if and only if every possible legal value of R is equal to the join of its projections on A, B, …, Z.”
This sounds complicated, but what this is saying is that if you split the attributes of R into separate relations (projection), that’ll you’ll be able to join those relations back into the the original relation without losing any data. It seems to be sort of a catch-all to make sure that we’ve broken down any relation into the smallest possible pieces.