11. Database and Analytics

Operational DBs

Analytical DBs

Other Databases and Analytics Services

Explore

Analytical DBs

Amazon Athena

Athena vs Redshift Spectrum

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard

SQL⁠

. With a few actions in the AWS Management Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds.

Amazon Athena also makes it easy to interactively run data analytics using Apache Spark without having to plan for, configure, or manage resources. When you run Apache Spark applications on Athena, you submit Spark code for processing and receive the results directly. Use the simplified notebook experience in Amazon Athena console to develop Apache Spark applications using Python or

Athena notebook APIs⁠

Athena SQL and Apache Spark on Amazon Athena are serverless, so there is no infrastructure to set up or manage, and you pay only for the queries you run. Athena scales automatically—running queries in parallel—so results are fast, even with large datasets and complex queries.

⁠

Features

Serverless and Cost-effective: Managed infrastructure by AWS; pay only for the queries executed.

Ease of Use: Simply point to data in S3, define schema, and start querying with standard SQL.

SQL Support and Presto Engine:

Utilizes Presto with full standard SQL support.

Compatible with various data formats: CSV, JSON, ORC, Parquet, Avro.

Fast Performance:

Optimized for performance with Amazon S3.

Executes queries in parallel for quick results, even on large datasets.

Managed Data Catalog:

Stores information and schemas about databases and tables.

Integrates with AWS Glue Data Catalog for unified metadata repository.

Integration and Connectivity:

Supports integration with BI tools and SQL development applications.

Utilizes JDBC and ODBC drivers for connectivity.

Direct integration with Amazon QuickSight for data visualization.

Security and Compliance:

Integrated with IAM for access control.

Supports encryption for data at rest and in transit.

Connectors available for enterprise data sources like DynamoDB, Redshift, etc.

Limitations

Optimization and Throttling:

Limited optimization capabilities.

Throttling may occur in multi-tenant environment.

Data Manipulation:

No support for data manipulation interfaces like inserting, deleting, or updating data.

Partitioning and Indexing:

Partitioning data in S3 can improve query performance.

No native support for indexing.

Use Cases

Ad-hoc Querying: Ideal for running ad-hoc queries on data stored in S3 without server setup or management.

Data Analysis:

Analyze unstructured, semi-structured, and structured data in various formats.

Perform ANSI SQL queries ad-hoc without aggregating or loading data.

Data Visualization: Integration with Amazon QuickSight for data visualization and report generation.

Cost and Usage Analysis: Analyze Cost and Usage Reports stored in S3 to gain insights into AWS billing.

Troubleshooting: Quick query execution for troubleshooting performance issues, e.g., analyzing web logs.

Amazon Athena provides a simple and efficient solution for querying and analyzing data stored in Amazon S3, catering to a wide range of use cases, from ad-hoc querying to cost analysis and troubleshooting.

⁠

Best Practices

⁠

Best practices⁠

for performance with Athena:

Partition your data – Partition the table into parts and keeps the related data together based on column values such as date, country, region, etc. Athena supports Hive partitioning.

Bucket your data – Partition your data is to bucket the data within a single partition.

Use Compression – AWS recommend using either Apache Parquet or Apache ORC.

Optimize file sizes – Queries run more efficiently when reading data can be parallelized and when blocks of data can be read sequentially.

Optimize columnar data store generation – Apache Parquet and Apache ORC are popular columnar data stores.

Optimize ORDER BY – The ORDER BY clause returns the results of a query in sort order.

Optimize GROUP BY – The GROUP BY operator distributes rows based on the GROUP BY columns to worker nodes, which hold the GROUP BY values in memory.

Use approximate functions – For exploring large datasets, a common use case is to find the count of distinct values for a certain column using COUNT(DISTINCT column).

Only include the columns that you need – When running your queries, limit the final SELECT statement to only the columns that you need instead of selecting all columns.

⁠

When should I use Athena?

Query services like Amazon Athena, data warehouses like Amazon Redshift, and sophisticated data processing frameworks like Amazon EMR all address different needs and use cases. The following guidance can help you choose one or more services based on your requirements.

Amazon Athena

Use Cases:

Analyzing unstructured, semi-structured, and structured data stored in Amazon S3.

Running ad-hoc queries using ANSI SQL without the need to load data into Athena.

Data Formats Supported:

CSV, JSON, Apache Parquet, Apache ORC.

Key Features:

Integrates with Amazon QuickSight for data visualization.

Compatible with AWS Glue Data Catalog for centralized metadata management.

Allows interactive queries directly against data in Amazon S3.

No need to manage infrastructure or clusters.

Best For:

Running quick, interactive ad-hoc SQL queries against data in Amazon S3.

Users who do not want to manage servers.

Amazon EMR

Use Cases:

Running highly distributed processing frameworks like Hadoop, Spark, and Presto.

Performing a variety of scale-out data processing tasks including machine learning, graph analytics, data transformation, and streaming data.

Key Features:

Flexible and cost-effective for large-scale data processing.

Supports running custom applications and code.

Provides full control over cluster configuration and software installation.

Best For:

Users who need to process and analyze extremely large datasets with custom code.

Running the latest big data processing frameworks such as Spark, Hadoop, Presto, or Hbase.

Amazon Redshift

Use Cases:

Pulling together data from various sources into a common format and storing it long-term.

Building sophisticated business reports from historical data.

Key Features:

Optimized for running complex queries that join large numbers of large database tables.

Highly structured data queries with lots of joins across very large tables.

Best For:

Running queries against highly structured data.

Building business intelligence reports from aggregated data sources.

Integration and Compatibility

Amazon Athena and EMR:

Athena can query data processed by EMR.

Both support similar data formats.

Athena's data catalog is compatible with Hive metastore, allowing seamless querying of data processed by EMR without affecting ongoing jobs.

⁠

Using Athena SQL

You can use Athena SQL to query your data in-place in Amazon S3 using the

AWS Glue Data Catalog⁠

an external Hive metastore⁠

, or

federated queries⁠

using a variety of

prebuilt connectors⁠

to other data sources.

You can also:

Connect to business intelligence tools and other applications using

Athena's JDBC and ODBC drivers⁠

Query

AWS service logs⁠

Query

Apache Iceberg tables⁠

, including time travel queries, and

Apache Hudi datasets⁠

Query

geospatial data⁠

Query using

machine learning inference⁠

from Amazon SageMaker.

Query using your own

user-defined functions⁠

Speed up query processing of highly-partitioned tables and automate partition management by using

partition projection⁠

Using Apache Spark in Amazon Athena

Amazon Athena makes it easy to interactively run data analytics and exploration using Apache Spark without the need to plan for, configure, or manage resources. Running Apache Spark applications on Athena means submitting Spark code for processing and receiving the results directly without the need for additional configuration. You can use the simplified notebook experience in Amazon Athena console to develop Apache Spark applications using Python or Athena notebook APIs. Apache Spark on Amazon Athena is serverless and provides automatic, on-demand scaling that delivers instant-on compute to meet changing data volumes and processing requirements.

Amazon Athena offers the following features:

Console usage – Submit your Spark applications from the Amazon Athena console.

Scripting – Quickly and interactively build and debug Apache Spark applications in Python.

Dynamic scaling – Amazon Athena automatically determines the compute and memory resources needed to run a job and continuously scales those resources accordingly up to the maximums that you specify. This dynamic scaling reduces cost without affecting speed.

Notebook experience – Use the Athena notebook editor to create, edit, and run computations using a familiar interface. Athena notebooks are compatible with Jupyter notebooks and contain a list of cells that are executed in order as calculations. Cell content can include code, text, Markdown, mathematics, plots and rich media.

For additional information, see

Explore your data lake using Amazon Athena for Apache Spark⁠

in the AWS Big Data Blog.

Using Amazon Athena Federated Query

If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines that extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

Athena uses data source connectors that run on AWS Lambda to run federated queries. A data source connector is a piece of code that can translate between your target data source and Athena. You can think of a connector as an extension of Athena's query engine. Prebuilt Athena data source connectors exist for data sources like Amazon CloudWatch Logs, Amazon DynamoDB, Amazon DocumentDB, and Amazon RDS, and JDBC-compliant relational data sources such MySQL, and PostgreSQL under the Apache 2.0 license. You can also use the Athena Query Federation SDK to write custom connectors. To choose, configure, and deploy a data source connector to your account, you can use the Athena and Lambda consoles or the AWS Serverless Application Repository. After you deploy data source connectors, the connector is associated with a catalog that you can specify in SQL queries. You can combine SQL statements from multiple catalogs and span multiple data sources with a single query.

When a query is submitted against a data source, Athena invokes the corresponding connector to identify parts of the tables that need to be read, manages parallelism, and pushes down filter predicates. Based on the user submitting the query, connectors can provide or restrict access to specific data elements. Connectors use Apache Arrow as the format for returning data requested in a query, which enables connectors to be implemented in languages such as C, C++, Java, Python, and Rust. Since connectors are processed in Lambda, they can be used to access data from any data source on the cloud or on-premises that is accessible from Lambda.

⁠

Using Amazon Athena Federated Query - Amazon Athena⁠

⁠

Loading docs.aws.amazon.com⁠

⁠

When should I use Athena?

Amazon Athena

Amazon EMR