Data Normalisation

Understanding Data Normalisation

  • Data Normalisation is a strategic method for organising data in a database to reduce redundancy and improve data integrity.
  • Normalisation involves decomposing a table into two or more tables and linking them using relationships.
  • The main aim of normalisation is to minimise data duplication, eliminate data anomalies, protect data integrity, improve data search performance, and increase scalability.
  • Data normalisation is carried out in stages known as normal forms (NF). Each NF comes with a set of rules or constraints.

First Normal Form (1NF)

  • In the First Normal Form, each column is unique and not duplicated. This means each field contains only atomic (indivisible) values.
  • Each entry or record has a unique identifier known as a primary key.

Second Normal Form (2NF)

  • The Second Normal Form takes a table that is in 1NF and eliminates partal dependencies. A partial dependency occurs if a non-prime attributes (an attribute that’s not part of the primary key) depends on part of the primary key.
  • A table is in 2NF if it’s in 1NF and all its non-prime attributes are fully functionally dependent on the primary key.

Third Normal Form (3NF)

  • The Third Normal Form eliminates transitive dependencies. This is when a non-prime attribute depends on another non-prime attribute for its value.
  • A table is in 3NF if it’s in 2NF and all its attributes are directly dependent on the primary key.

Boyce-Codd Normal Form (BCNF)

  • The Boyce-Codd Normal Form is a high level of normalisation. It addresses a flaw in 3NF where a non-prime attribute determines a prime attribute (this is known as a dependency anomaly).
  • A table is in BCNF if for every non-trivial functional dependency X → Y, X is a superkey.

Denormalisation

  • Denormalisation is the process of combining tables that have been decomposed during normalisation. It’s used to improve performance by reducing the number of joins needed.
  • However, denormalisation can lead to data inconsistency and may increase the physical size of the database due to data redundancy.