Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud designed for large-scale data analytics. It supports SQL and integrates with existing Business Intelligence (BI) tools, making it ideal for running complex analytic queries against massive datasets. It uses Amazon EC2 instances, so you must choose an instance family/type.
Amazon Redshift Serverless lets you access and analyze data without the usual configurations of a provisioned data warehouse. Resources are automatically provisioned and data warehouse capacity is intelligently scaled to deliver fast performance for even the most demanding and unpredictable workloads. You don't incur charges when the data warehouse is idle, so you only pay for what you use. Regardless of the size of the dataset, you can load data and start querying right away in the Amazon Redshift query editor v2 or in your favorite business intelligence (BI) tool. Enjoy the best price performance and familiar SQL features in an easy-to-use, zero administration environment.
Key Features of Amazon RedShift
General Characteristics
OLAP Database: Designed for Online Analytical Processing (OLAP), suited for complex queries and large-scale data analytics.
SQL-Based: Compatible with PostgreSQL and supports standard SQL.
Cost-Effective: More economical compared to traditional on-premises data warehouse platforms.
Integration: Compatible with JDBC/ODBC drivers and integrates with most BI tools out-of-the-box.
Performance
Columnar Data Storage: Stores data sequentially in columns rather than rows, optimizing for read-heavy operations.
Advantages: Fewer I/Os, better performance, and enhanced compression.
Massively Parallel Processing (MPP): Distributes data and queries across all nodes for faster execution.
RedShift Spectrum: Enables querying data directly from Amazon S3 without requiring loading or ETL processes.
Advanced Compression: Automatically selects the best compression scheme for each column, improving performance and reducing storage needs.
10x Faster: Typically 10 times faster than traditional SQL databases.
Availability and Durability
Replication and Backups:
Replicates data within the cluster and to Amazon S3 for continuous backups.
Keeps three copies of data: original, replica, and backup on S3.
Node Failures: Automatically provisions and adds replacement nodes, but the cluster will be temporarily unavailable.
High Availability:
Supports multi-node clusters with data replication and node recovery.
Does not support Multi-AZ deployments, but snapshots can be restored into another AZ.
Scaling:
Requires a brief period of unavailability during scaling operations.
Data is moved in parallel during scaling from old to new compute nodes.
Security
Encryption:
SSL encryption for data in transit.
AES 256 encryption for data at rest.
Network Isolation: Uses VPC for network isolation.
Audit Logging: Integrated with AWS CloudTrail.
Key Management: Supports AWS Key Management Service (KMS) and Hardware Security Module (HSM) for managing encryption keys.
Charging Model
Compute Nodes: Charged per hour for compute nodes (leader node is not charged).
Backup Storage: Charges apply for storage on S3.
Data Transfer: No charge for data transfer between RedShift and S3 within the same region.
Cluster Configuration
Single Node: Up to 160GB, not suitable for data replication.
Multi-Node: Comprises a leader node and multiple compute nodes (up to 128).
Leader Node: Manages client connections, query optimization, and execution coordination.
Compute Nodes: Store data and execute queries, offering local columnar storage and parallel execution.
Use Cases of Amazon RedShift
Enterprise Data Warehouse
Centralizes data from multiple sources (e.g., CRM, advertising, customer support) into a unified schema for comprehensive reporting and analytics.
Business Intelligence and Analytics
Ideal for fast query execution against large datasets.
Often used with BI tools like Tableau for high-performance querying and data visualization.
Embedded Analytics and Data Monetization
Enables organizations to monetize data by providing customers with analytics as a service.
RedShift's data sharing and aggregation capabilities facilitate secure and selective data access.
Database Migration and Change Data Capture
AWS Database Migration Service (DMS) can replicate changes from operational data stores into RedShift.
Useful for migrating from legacy data warehouses or enhancing flexibility in data analysis.
Exam Tips:
RedShift is a columnar data warehouse DB that is ideal for running long complex queries. RedShift can also improve performance for repeat queries by caching the result and returning the cached result when queries are re-run. Dashboard, visualization, and business intelligence (BI) tools that execute repeat queries see a significant boost in performance due to result caching.