Explore

Data Management Foundations - D426

Unit 5: Data Storage

Makiel [Muh-Keel]

Last edited 255 days ago by Makiel [Muh-Keel].

5.1 Storage Media

Computer media vary on four important dimensions:

Speed. Speed is measured as access time and transfer rate.

Access time is the time required to access the first byte in a read or write operation.

Transfer rate is the speed at which data is read or written, following initial access.

Cost. Cost typically ranges from pennies to dollars per gigabyte of memory, depending on the media type.

Capacity. In principle, any media type can store any amount of data. In practice, capacity is limited by cost.

Expensive media have limited capacity compared to inexpensive media.

Volatility. Volatile memory is memory that is lost when disconnected from power.

Non-volatile memory is retained without power.

Three types of media are most important for database management:

Main memory, also called random-access memory (RAM), is the primary memory used when computer programs execute.

Main memory is fast, expensive, and has limited capacity.

Volatile.

Flash memory, also called solid-state drive (SSD), is less expensive and higher capacity than main memory.

Writes to flash memory are much slower than reads, and both are much slower than main memory writes and reads.

Magnetic disk, also called hard-disk drive (HDD), is used to store large amounts of data.

Magnetic disk is slower, less expensive, and higher capacity than flash memory.

⁠

Sectors, pages, and blocks.

Sectors

Magnetic Disks groups data in sectors, normally:

512 bytes per sector

4 Kilobytes with newer disk formats

Storing one megabyte requires 1,000,000 bytes.

1,000,000/512 = 2000 sectors for one Megabyte of data.

Pages

Flash memory groups data in pages:

Usually between 2 kilobytes and 16 kilobytes per page.

Blocks

Databases and file systems use a uniform size, called a block, when transferring data between main memory and storage media.

Block size is independent of storage media.

Storage media are managed by controllers, which convert between blocks and sectors or pages.

When any part of a block is accessed, the entire block must be transferred to main memory.

Ex. If a user runs a query needing a page size of 2 KB, but the block is set to a uniform 8KB, This means 8KB of blocks is still transferred to main memory.

Database systems typically support block sizes ranging from 2 kilobytes to 64 kilobytes.

Smaller block sizes are usually better for transactional applications, which access a few rows per query.

Larger block sizes are usually better for analytic applications, which access many rows per query.

⁠

Row-Oriented Structure

Accessing data from storage media (like hard drives or flash memory) is a relatively slow process compared to accessing data in main memory (RAM).

Because of this, databases aim to reduce the number of blocks that need to be transferred during operations.

Let's break down how databases optimize this using block-based data storage and how row-oriented storage fits into this model.

Most relational databases are optimized for transactional applications, which often read and write individual rows.

Row-Oriented Storage is when relational databases usually store an entire row within one block.

In row-oriented storage, the database stores data row by row.

Each row is often fully contained within a single block.

Example: A row from a "Customers" table (containing CustomerID, Name, Email, etc.) would be stored in one contiguous block.

This design ensures that when the database reads a row, only one block needs to be accessed, minimizing the amount of data read from the storage media.

Row-oriented storage aligns well with the needs of transactional applications, as it minimizes block transfers by ensuring entire rows are stored in a single block.

This design maximizes the speed and efficiency of row-based operations.

Using Links in Storage

Sometimes a table contains a very large column, such as 1 megabyte documents or 10 megabyte images.

For tables with large columns, each row usually contains a link to the large column, which is stored in a different area.

The large column might be stored in files managed by the operating system or in a special storage area managed by the database.

This approach keeps row size small and improves performance of queries that do not access the large column.

Why Row-Oriented Storage is Efficient for Transactional Workloads:

Minimizes Block Transfers for Individual Row Access:

Since each row fits within a single block, fetching a row typically involves reading just one block from storage, even if the row contains many columns.

This reduces the time spent waiting for data to be transferred from slow storage media to main memory.

Efficient for Row-Based Queries:

Queries that access a single row (e.g., SELECT * FROM Customers WHERE CustomerID = 123) can retrieve the necessary data with minimal disk I/O.

⁠

Limitations of Row-Oriented Storage:

Less efficient for analytical queries:

If an analytical query needs to access all values of a single column (e.g., sum of Amount), it would still need to scan every row and block.

This is where column-oriented storage is more efficient, but it's suited for analytical databases, not transactional systems.

⁠

Goal is to determine how many bytes are unused in each block.

Understand the Problem:

⁠

Column-Oriented Storage

Some newer relational databases are optimized for analytic applications rather than transactional applications.

Analytic applications often read just a few columns from many rows. In this case, column-oriented storage is optimal.

In column-oriented storage, also called columnar storage, each block stores values for a single column only.

Column-oriented storage is a poor design for most transactional applications.

Usually 'column-oriented' and 'columnar' mean a technique for organizing data on storage media.

Sometimes these terms mean a type of NoSQL database, commonly called 'wide column' database.

Column-oriented storage benefits analytic applications in several ways:

Faster data access. More column values are transferred per block, reducing time to access storage media.

Better data compression. Databases often apply data compression algorithms when storing data.

Data compression is usually more effective when all values have the same data type.

As a result, more values are stored per block, which reduces storage and access time.

⁠

5.3 Single-Level Indexes

Single-Level Indexes store column values and associates each column value with a pointer to the block containing the corresponding row.

Pointers may also indicate the exact row position within a block for more precise access.

Purpose:

It improves data retrieval by reducing the number of blocks a query needs to access.

Instead of scanning the entire table, the database uses the index to quickly locate relevant rows.

Structure:

A single-level index contains two key components:

Column values (used for searching or filtering).

Pointers:

Each pointer identifies the block where the corresponding row is stored.

5.1 Storage Media

Sectors, pages, and blocks.

Sectors

Pages

Blocks

Row-Oriented Structure

Using Links in Storage

Why Row-Oriented Storage is Efficient for Transactional Workloads:

Limitations of Row-Oriented Storage:

Column-Oriented Storage

5.3 Single-Level Indexes

Why are Indexes sorted?

Unique vs. Non-Unique Indexes

Multi-Column (Composite) Indexes

How a Query uses an Index

Why Use Indexes? Key Advantages:

Query Processing

Hit Ratio (filter factor or selectivity)

Binary Search

Example Walkthrough

Table Scan Explanation

Index Scan Explanation

Key Insight:

Visualization of Efficiency

Primary and Secondary indexes

Primary Index ( unique )

Secondary Index ( non-unique )

Dense vs. Sparse Indexes

Example Scenario

Inserts, Updates, and deletes

Insert

Delete

Update

5.4 Multi-Level Indexes

Efficiency:

Number of Levels

Fan Out

How Fan-Out Determines the Number of Levels:

Why Balancing is Crucial

How Splits Work During Inserts

B-tree and B+tree indexes

Solution: Function Indexes

Index Types

Practical Considerations

5.7 Physical Design

MySQL Storage engines

Storage Engine ( Storage Manager )

InnoDB

MyISAM

MEMORY

CREATE INDEX, DROP INDEX, and SHOW INDEX statements

Physical Design Process

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.