This page documents how I designed a database schema normalized to the 3rd Normal Form (3NF) to facilitate efficient data integration from multiple sources and ensure high data quality.
Dataset Overview
This project uses the US Candy Distributor dataset by Maven Analytics which contains both sales and geospatial data across multiple tables and a total of 39 fields.
Denormalized dataset
A key issue with the current database schema is the presence of many-to-many relationships. For example, each StateName appears several times in USZips since a single state houses several cities that each have several postal codes. At the same time, StateName is repeated in Sales each time an order is made from that location. Similar redundancies occur with Division in the Sales and Products tables.
This structure leads to inefficient storage and data management that can affect query performance. As such, normalization will be performed before proceeding to querying.
Note that this is not the proper way to illustrate the connecting lines. We are limited to using 1:1 since the diagram creation tool only allows 1:1, N:1 and 1:N.
Database Normalization
Database normalization optimizes database structure for efficient storage, retrieval and analysis. Each step, referred to as a normal form (NF), addresses specific issues concerning table structure and relationships, minimizing redundancy and in turn improving query performance and data integrity.
Each section below will present the rules for each stage taken from this
, documentation of changes performed, and revised ERD.
First Normal Form
Second Normal Form
Changes
Following the first rule, Sales was split into new tables. Aside from RowID, other columns that uniquely identified each record were OrderID, ProductID and CustomerID. The following table shows the non-key columns fully dependent on each candidate key.
Region, Country / Region, State / Province, City, PostalCode
There are no rows in this table
The following outlines some notes about how the tables were created / revised.
2NF Table
Notes
Orders
Carried over all non-key attributes from the earlier table. A problem arose since the existing Order ID carries duplicates brought about by its format US-2021-128055-CHO-TRI-54000 (ex. if the same customer from the same country buys the same product in the same year). As such Order ID will be renamed as OrderCode and RowID as OrderID.
Products
ProductName from Sales is added to existing Products attributes. Division and Factory columns are retained but now contain IDs
Customers
Carried over all non-key attributes from the earlier table
Divisions
Renamed from Targets and now includes a DivisionID. While the existing names (Chocolate, Sugar, Other) can suffice, numerical identifiers are 1) more reliable at guaranteeing uniqueness, 2) suitable for improving query performance through indexing, and 3) offer flexibility in changing division names without affecting relationships with other tables.
Factories
Same explanation as Divisions in creating a FactoryID
USZips
Retained for now linking to PostalCode in Customers
There are no rows in this table
Third Normal Form
Rules
Avoid transitive dependencies. Let’s say you have a Table T which has 3 columns. A is a primary key, B is a non-key attribute functionally dependent on A, and C is a non-key attribute functionally dependent on B, then C is transitively dependent on A through B.
Changes
The transitive relationships from the 2NF diagram are as follows.
In the Customers table, Region , State/Province and City (C) are dependent on PostalCode (B) which is determined by CustomerID (A).
In the USZips table, PostalCode (A) indirectly determines StateName (C) through StateID (B).
Both tables also carry the same information so there is a need to reduce redundancies.
Some notes on location. The original dataset carried data for Canada and the US filtered through the Country/Region attribute which was removed in 1NF. However, only US zip codes were provided so I decided not to include Canadian orders in this project, reducing the original 10,194 rows to 9994.
3NF Tables
Notes
Orders, Products, Divisions, Factories
Retained same structure except PostalID is removed from Orders
Customers
Reduced to CustomerID and PostalCode
USZips
StateName column removed while others were retained but renamed (ex. zip→ PostalCode, lat → Latitude)
USStates
Contain StateID and StateName from USZips, Region manually mapped from Customers
There are no rows in this table
3NF is generally sufficient for more database designs. Given the final schema, I declared primary and foreign keys in the next section along with other data quality procedures.
Data Quality Checks
This stage ensures the accuracy, consistency, efficiency, and reliability of our data. Aside from the processes demonstrated below, checks can also be made for non-primary columns that require uniqueness (e.g. email addresses registered to a site), or have particular conditions (e.g. age must be > 13).
Primary and Foreign Key Creation
The following code is used to apply the primary and foreign key assignments in SQL Server. These identifiers are key to speeding up data retrieval using indexing as databases automatically create indexes based on the primary key.
Meanwhile, declaring foreign key relationships ensures referential integrity when inserting, updating, and deleting records (e.g. preventing invalid relationships where an entry is added to the Orders table with a non-existent CustomerID as this would mean there is no information on the client’s delivery location).
In the example below, on update cascade ensures that if a primary key in the parent table is updated, changes are cascaded onto child tables.
Duplicate Row Check
Duplicate entries can cause inconsistencies in statistical analysis, and slow down queries. The following shows sample code used for the Orders dataset. No rows were returned, signifying unique rows.
Null Primary Key Check
Since primary keys are unique identifiers for every row, they cannot contain null values. The process below with PostalCode was repeated with CustomerID, ProductID and StateID while DivisionID and FactoryID were manually inspected having < 5 items. Similar to the check for duplicates, no rows were outputted.
Data Type Validation
Data formatted as Number in Excel typically appears as floats in SQL Server. As such, there is a need to adjust database variables which cannot have decimal values. Seen below are information about USZips executed through a built-in stored procedure called sp_help.
Database information for USZips
First 7 rows of USZips
As seen in the image on the right, PostalCode must be composed of 5 digits and Population must be whole number figures, although this column would be removed as it was not used in the analysis. The code below performs the data type conversion.
Similar float → int conversions must be carried out for other tables.