Analytical DBs

icon picker
Amazon EMR (Elastic MapReduce)

Amazon EMR (Elastic MapReduce) is a cloud-based web service that enables processing of vast amounts of data using open-source tools like Apache Hadoop, Apache Spark, Apache HBase, Apache Flink, and Presto. EMR simplifies the setup, operation, and scaling of big data environments.
image.png
image.png

Key Features of Amazon EMR

General Characteristics

Hosted Hadoop Framework: Utilizes Apache Hadoop for large-scale data processing.
Cost-Effective: Offers petabyte-scale analysis at a fraction of the cost of traditional on-premises solutions.
Flexible Deployment: Supports running workloads on Amazon EC2 instances, Amazon Elastic Kubernetes Service (EKS) clusters, or on-premises using AWS Outposts.

Performance and Scalability

High Performance: Runs big data workloads over 3x faster than standard Apache Spark.
Integration with AWS Services: Seamlessly integrates with Amazon S3 for storage and other AWS services.
Scalability: Automatically scales the number of EC2 instances in a cluster to meet the demands of the workload.

Supported Technologies

Apache Spark: For in-memory data processing and optimized query execution.
Apache HBase: For NoSQL database operations.
Presto: For fast SQL-based analytic queries on large datasets.
Apache Flink: For stream and batch processing.

Cluster and Step Management

Clusters: Collections of EC2 instances provisioned to run big data processing tasks.
Steps: Programmatic tasks or instructions to be executed on the data, such as running a Spark job or an HDFS operation.
Single AZ Deployment: Nodes for a cluster are launched within the same Availability Zone to minimize latency.

Access and Management

AWS Management Console: Provides a graphical interface for managing EMR clusters.
Command Line Tools and SDKs: Allow scripting and programmatic access to EMR.
EMR API: Offers comprehensive programmatic control over EMR operations.
SSH Access: Direct access to the underlying operating system on cluster nodes for troubleshooting and custom configurations.

Use Cases

Log Analysis: Processing and analyzing log files from web servers to identify usage patterns and detect anomalies.
Financial Analysis: Running complex queries on large financial datasets to identify trends, risks, and opportunities.
ETL (Extract, Transform, Load) Activities: Extracting data from various sources, transforming it into a suitable format, and loading it into a data warehouse or data lake for further analysis.

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.