Gwendolyn Ang - BI Portfolio

Explore

Database Design, Querying & Analysis

Part 1: Database Design

This page documents how I designed a database schema normalized to the 3rd Normal Form (3NF) to facilitate efficient data integration from multiple sources and ensure high data quality.

Dataset Overview

This project uses the US Candy Distributor dataset by Maven Analytics which contains both sales and geospatial data across multiple tables and a total of 39 fields.

⁠

Denormalized dataset

Denormalized dataset

⁠

A key issue with the current database schema is the presence of many-to-many relationships. For example, each StateName appears several times in USZips since a single state houses several cities that each have several postal codes. At the same time, StateName is repeated in Sales each time an order is made from that location. Similar redundancies occur with Division in the Sales and Products tables.

This structure leads to inefficient storage and data management that can affect query performance. As such, normalization will be performed before proceeding to querying.

Note that this is not the proper way to illustrate the connecting lines. We are limited to using 1:1 since the diagram creation tool only allows 1:1, N:1 and 1:N.

Database Normalization

Database normalization optimizes database structure for efficient storage, retrieval and analysis. Each step, referred to as a normal form (NF), addresses specific issues concerning table structure and relationships, minimizing redundancy and in turn improving query performance and data integrity.

Each section below will present the rules for each stage taken from this

TechTFQ Youtube video⁠

, documentation of changes performed, and revised ERD.

First Normal Form

⁠

Second Normal Form

⁠

Changes

Following the first rule, Sales was split into new tables. Aside from RowID, other columns that uniquely identified each record were OrderID, ProductID and CustomerID. The following table shows the non-key columns fully dependent on each candidate key.

Candidate Key

Non-Key Attributes from Sales Table

OrderID

OrderDate, ProductID, CustomerID, Sales, Units, GrossProfit, Cost

ProductID

ProductName

CustomerID

Region, Country / Region, State / Province, City, PostalCode

There are no rows in this table

⁠

The following outlines some notes about how the tables were created / revised.

2NF Table

Notes

Orders

Carried over all non-key attributes from the earlier table. A problem arose since the existing Order ID carries duplicates brought about by its format US-2021-128055-CHO-TRI-54000 (ex. if the same customer from the same country buys the same product in the same year). As such Order ID will be renamed as OrderCode and RowID as OrderID.

Products

ProductName from Sales is added to existing Products attributes. Division and Factory columns are retained but now contain IDs

Customers

Carried over all non-key attributes from the earlier table

Divisions

Renamed from Targets and now includes a DivisionID. While the existing names (Chocolate, Sugar, Other) can suffice, numerical identifiers are 1) more reliable at guaranteeing uniqueness, 2) suitable for improving query performance through indexing, and 3) offer flexibility in changing division names without affecting relationships with other tables.

Factories

Same explanation as Divisions in creating a FactoryID

USZips

Retained for now linking to PostalCode in Customers

There are no rows in this table

⁠

Third Normal Form

⁠

Rules

Avoid transitive dependencies. Let’s say you have a Table T which has 3 columns. A is a primary key, B is a non-key attribute functionally dependent on A, and C is a non-key attribute functionally dependent on B, then C is transitively dependent on A through B.

Changes

The transitive relationships from the 2NF diagram are as follows.

In the Customers table, Region , State/Province and City (C) are dependent on PostalCode (B) which is determined by CustomerID (A).

In the USZips table, PostalCode (A) indirectly determines StateName (C) through StateID (B).

Both tables also carry the same information so there is a need to reduce redundancies.

Some notes on location. The original dataset carried data for Canada and the US filtered through the Country/Region attribute which was removed in 1NF. However, only US zip codes were provided so I decided not to include Canadian orders in this project, reducing the original 10,194 rows to 9994.

3NF Tables

Notes

Orders, Products, Divisions, Factories

Retained same structure except PostalID is removed from Orders

Customers

Reduced to CustomerID and PostalCode

USZips

StateName column removed while others were retained but renamed (ex. zip→ PostalCode, lat → Latitude)

USStates

Contain StateID and StateName from USZips, Region manually mapped from Customers

There are no rows in this table

⁠

3NF is generally sufficient for more database designs. Given the final schema, I declared primary and foreign keys in the next section along with other data quality procedures.

Data Quality Checks

This stage ensures the accuracy, consistency, efficiency, and reliability of our data. Aside from the processes demonstrated below, checks can also be made for non-primary columns that require uniqueness (e.g. email addresses registered to a site), or have particular conditions (e.g. age must be > 13).

Primary and Foreign Key Creation

The following code is used to apply the primary and foreign key assignments in SQL Server. These identifiers are key to speeding up data retrieval using indexing as databases automatically create indexes based on the primary key.

Meanwhile, declaring foreign key relationships ensures referential integrity when inserting, updating, and deleting records (e.g. preventing invalid relationships where an entry is added to the Orders table with a non-existent CustomerID as this would mean there is no information on the client’s delivery location).

In the example below, on update cascade ensures that if a primary key in the parent table is updated, changes are cascaded onto child tables.

Duplicate Row Check

Duplicate entries can cause inconsistencies in statistical analysis, and slow down queries. The following shows sample code used for the Orders dataset. No rows were returned, signifying unique rows.

Null Primary Key Check

Since primary keys are unique identifiers for every row, they cannot contain null values. The process below with PostalCode was repeated with CustomerID, ProductID and StateID while DivisionID and FactoryID were manually inspected having < 5 items. Similar to the check for duplicates, no rows were outputted.

Data Type Validation

Data formatted as Number in Excel typically appears as floats in SQL Server. As such, there is a need to adjust database variables which cannot have decimal values. Seen below are information about USZips executed through a built-in stored procedure called sp_help.

⁠

Database information for USZips

Database information for USZips

⁠

First 7 rows of USZips

First 7 rows of USZips

⁠

As seen in the image on the right, PostalCode must be composed of 5 digits and Population must be whole number figures, although this column would be removed as it was not used in the analysis. The code below performs the data type conversion.

Similar float → int conversions must be carried out for other tables.

Table

Columns

USZips

PostalCode, Population

Customers

CustomerID, PostalCode

Orders

OrderID, CustomerID, Units

Divisions

DivisionID

Factories

FactoryID

Products

DivisionID, FactoryID

There are no rows in this table

⁠

Dataset Overview

Database Normalization

Primary and Foreign Key Creation

Duplicate Row Check

Null Primary Key Check

Data Type Validation

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.