CLF-C02Chapter 113 of 130Objective 3.5

Amazon Athena

This chapter covers Amazon Athena, a serverless interactive query service that lets you analyze data directly in Amazon S3 using standard SQL. For the CLF-C02 exam, this objective falls under Domain 3: Cloud Technology Services, which weighs about 34% of the exam. Understanding Athena is critical because it represents a key pattern in AWS: serverless analytics that separates compute from storage. You'll learn how Athena works, when to use it versus other analytics services like Amazon Redshift or Amazon EMR, and what exam traps to avoid.

25 min read
Intermediate
Updated May 31, 2026

The Library Card Catalog for Your Data Lake

Imagine a massive library with millions of books stored in a giant warehouse (your data lake on Amazon S3). The books are in different languages, formats, and scattered across shelves. You want to find specific information—say, all books about 'cloud computing' published after 2020—but you don't have a librarian to search for you. Amazon Athena is like a super-fast card catalog system that lets you ask questions in plain English (SQL) and instantly get answers without moving any books. You don't need to organize the books first; the card catalog knows where each book is and how to read it. You pay only for each search you perform, based on the number of books you scan. If you ask a vague question that scans the entire warehouse, you pay more. But if you narrow your search (e.g., only books in the 'technology' section), you pay less. Athena works the same way: you write SQL queries against data in S3, and it automatically figures out how to read the data (even if it's in CSV, JSON, Parquet, etc.) and returns results in seconds. You don't need to set up servers or manage infrastructure—just point Athena at your S3 data and start querying.

How It Actually Works

What is Amazon Athena?

Amazon Athena is a serverless, interactive query service that enables you to analyze data directly in Amazon S3 using standard SQL. It is built on the open-source Presto engine and Apache Hive for defining schemas. Athena requires no infrastructure to manage—you simply point at your data in S3, define a schema (or use a schema-on-read approach), and start querying. You pay only for the data scanned per query, with no upfront costs or cluster provisioning.

The Problem Athena Solves

Before Athena, analyzing data stored in S3 typically required setting up a cluster of servers (e.g., Amazon EMR or a traditional data warehouse like Redshift). This involved provisioning, configuring, and managing compute resources, even if you only needed to run occasional ad-hoc queries. Athena eliminates this overhead by providing a serverless query engine that automatically scales based on the query's complexity. It also integrates with AWS Glue Data Catalog to store table metadata, making it easy to discover and query datasets.

How Athena Works

Athena uses a multi-step process to execute queries:

1.

Query Submission: You submit a SQL query via the AWS Management Console, AWS CLI, or SDK. The query can reference tables defined in the AWS Glue Data Catalog or a custom Hive metastore.

2.

Query Planning: Athena's query engine analyzes the SQL and the table metadata to determine which S3 objects to scan. It uses partition pruning—if your data is partitioned by column (e.g., year, month, day), Athena only scans the relevant partitions, reducing cost and improving performance.

3.

Data Scanning: Athena reads the data from S3. It supports various formats: CSV, JSON, Parquet, ORC, Avro, and others. Columnar formats like Parquet and ORC are recommended because they allow Athena to read only the columns needed, minimizing scanned data.

4.

Query Execution: The Presto-based engine executes the query across a fleet of ephemeral compute resources. These resources are managed by AWS and are completely invisible to the user.

5.

Result Delivery: Results are returned as a table in the console, or can be saved to a specified S3 bucket for further analysis.

Key Features

Serverless: No servers to provision or manage. Athena automatically scales resources based on query demand.

Pay-per-query: You are charged $5.00 per TB of data scanned (as of 2025). Queries that scan less data cost less. You can also use compression, partitioning, and columnar formats to reduce scanned data.

Schema-on-Read: You define the schema when you query, not when you store the data. This allows you to store data in raw formats and interpret it later.

Built-in Integrations: Athena integrates with AWS Glue for catalog management, Amazon QuickSight for visualization, and can be connected to JDBC/ODBC drivers for use with BI tools.

Federated Query: Athena can query data from other sources like Amazon DynamoDB, Amazon RDS, or on-premises databases using Lambda-based connectors.

Pricing Model

Athena's pricing is straightforward: - Data Scanned: $5.00 per TB of data scanned. Queries that scan less than 10 MB are free. - DDL Queries: Queries like CREATE TABLE, DROP TABLE, etc., are free. - Failed Queries: You are still charged for data scanned even if the query fails, because the data was already read. - No additional costs: No charges for underlying compute or storage (you pay separately for S3 storage and Glue Data Catalog if used).

Comparison to On-Premises and Alternatives

On-premises solutions for querying data in a data lake often require dedicated Hadoop or Spark clusters, which are expensive to maintain and require skilled administrators. Athena offers a fully managed alternative with no upfront commitment. Compared to Amazon Redshift, Athena is better for ad-hoc, infrequent queries on raw data, whereas Redshift is optimized for high-performance, repeated queries on structured, aggregated data. Amazon EMR is more flexible for custom big data processing (e.g., using Spark, Hive, or HBase) but requires cluster management. Athena is the simplest for SQL-based analysis on S3 data.

When to Use Athena vs Alternatives

Use Athena when: You have data in S3, need to run ad-hoc SQL queries, don't want to manage infrastructure, and your query patterns are unpredictable. Ideal for log analysis, ad-hoc business intelligence, and querying data lakes.

Use Redshift when: You need high-performance, concurrent queries on structured data with complex joins and aggregations. Redshift is a data warehouse, not a data lake query engine.

Use EMR when: You need to run custom big data frameworks (Spark, Hive, Presto) and require more control over compute resources.

Use Athena with Glue: For ETL (Extract, Transform, Load) jobs, you might use AWS Glue to transform data and then query it with Athena.

Limits and Defaults

Query Timeout: Queries time out after 30 minutes by default (can be increased to 60 minutes).

Result Size: Each query result set is limited to 10 MB for console display. Larger results can be saved to S3.

Concurrent Queries: Default limit is 20 concurrent queries per account (can be increased via Service Quotas).

Data Format Support: CSV, TSV, JSON, Parquet, ORC, Avro, and others.

Partitioning: Supports up to 1,000,000 partitions per table.

DML Queries: Supports SELECT, UNLOAD, and CTAS (CREATE TABLE AS SELECT) but not INSERT, UPDATE, or DELETE (Athena is read-only for existing data).

Walk-Through

1

Store Data in Amazon S3

First, you must have your data stored in Amazon S3. This can be any format Athena supports, such as CSV, JSON, Parquet, or ORC. For best performance and cost, organize your data with a partition structure (e.g., s3://my-bucket/data/year=2023/month=01/day=01/). You do not need to index the data; Athena will scan it based on your query. Ensure your S3 bucket is in the same AWS Region as where you will use Athena to minimize latency and data transfer costs.

2

Create a Table in AWS Glue Data Catalog

You need to define a schema for your data so Athena knows the column names, data types, and location. This is done by creating a table in the AWS Glue Data Catalog (or using a Hive metastore). You can use the Athena console's 'Create Table' wizard, which can automatically detect schema from data in S3 (CSV, JSON, etc.). Alternatively, you can write a CREATE TABLE statement in Athena. The table definition includes the S3 location, format, and partition columns. Once created, the table is available for querying.

3

Write and Execute a SQL Query

In the Athena console, you write a standard SQL query (e.g., SELECT * FROM my_table WHERE year = 2023). You can also use the AWS CLI or SDK. When you run the query, Athena parses the SQL, uses the Glue Catalog to find the table metadata, and then scans the relevant S3 objects. It uses partition pruning if your table is partitioned. The query engine runs on ephemeral compute resources that AWS manages. You can monitor query progress and see estimated data scanned before execution.

4

Review Results and Optimize

After execution, the results appear in the console (limited to 10 MB). You can also save results to a specified S3 bucket. To optimize costs and performance, review the 'Data Scanned' metric. Use compression (e.g., gzip), convert data to columnar formats (Parquet, ORC), and partition by frequently filtered columns. You can also use the CTAS statement to create a new table with optimized format. Athena provides detailed query statistics, including execution time and bytes scanned, which help you refine your approach.

5

Automate and Integrate with Other Services

Athena can be integrated into automated workflows using AWS SDKs, CLI, or Step Functions. For example, you can schedule queries with Amazon EventBridge to run periodically. You can also connect Athena to Amazon QuickSight for dashboards, or use the JDBC/ODBC driver to connect BI tools like Tableau. For federated queries, you can configure Lambda connectors to query data in DynamoDB, RDS, or on-premises databases, extending Athena's reach beyond S3.

What This Looks Like on the Job

Scenario 1: Ad-Hoc Log Analysis for a SaaS Company

A SaaS company stores its application logs in Amazon S3 as compressed JSON files, partitioned by date. When a customer reports an error, the support team needs to quickly search logs for specific error codes. Using Athena, they run a query like: SELECT * FROM logs WHERE error_code = '500' AND date = '2025-03-01'. Without Athena, they would need to download gigabytes of logs or set up a full ELK stack. Athena returns results in seconds, and the cost is minimal because they only scan the relevant partition. The team also uses Athena to generate weekly reports on error trends. A common mistake is not partitioning by date, causing full-table scans that cost more and slow down queries.

Scenario 2: Data Lake Analytics for a Retail Company

A retail company ingests sales data from multiple sources into an S3 data lake. The data is in raw CSV format. Analysts need to run ad-hoc queries to understand sales performance by region and product. They create a Glue table on the raw data and use Athena to run SQL queries. Over time, they convert the data to Parquet format and partition by region and month, reducing query costs by 70%. They also use Athena's CTAS to create summary tables for common queries. The business problem solved is enabling self-service analytics without provisioning a data warehouse. Misconfiguration occurs when users forget to compress data or use inefficient formats, leading to high costs per query.

Scenario 3: Federated Query for a Multi-Source Dashboard

A financial services company needs to combine data from an S3 data lake (historical trades) with live data from a DynamoDB table (current positions) and an RDS database (customer profiles). They use Athena's federated query feature with Lambda connectors to query all sources in a single SQL statement. This allows them to build a real-time dashboard in QuickSight without moving data. The cost is based on the data scanned across all sources. A pitfall is that federated queries can be slower than native S3 queries because of network latency to the source databases. The exam tests that Athena can query multiple sources, not just S3.

How CLF-C02 Actually Tests This

What CLF-C02 Tests on This Objective

For Domain 3 (Cloud Technology Services), Objective 3.5, the exam expects you to understand the purpose and basic characteristics of Amazon Athena. Specifically, you should know:

Athena is serverless and requires no infrastructure management.

It queries data directly from Amazon S3 using standard SQL.

Pricing is based on the amount of data scanned per query (pay-per-query).

It supports various data formats, with Parquet and ORC being optimal.

It integrates with AWS Glue Data Catalog for schema management.

It is part of the serverless analytics family, distinct from Redshift (data warehouse) and EMR (big data processing).

Common Wrong Answers and Why Candidates Choose Them

1.

"Athena requires a running cluster of EC2 instances." Wrong because Athena is serverless. Candidates confuse it with Amazon EMR or Redshift, which do require clusters. The exam tests that serverless means no provisioning.

2.

"Athena can be used to insert, update, or delete data in S3." Wrong because Athena is primarily a read-only query engine. It supports CTAS and UNLOAD for writing results, but not DML like INSERT/UPDATE/DELETE. Candidates may assume SQL implies full DML support.

3.

"Athena pricing is based on the number of queries executed." Wrong because pricing is based on data scanned, not query count. Candidates often think 'per query' means a flat fee per query, but the cost varies with data volume.

4.

"Athena can only query data stored in Amazon S3." While S3 is the primary source, Athena also supports federated queries to other databases via Lambda connectors. The exam may present a scenario where Athena queries DynamoDB or RDS, and candidates incorrectly think it's impossible.

Specific Terms and Values

Data scanned pricing: $5.00 per TB

Supported formats: CSV, JSON, Parquet, ORC, Avro

Integration: AWS Glue Data Catalog, QuickSight

Serverless: No servers to manage

Based on: Presto engine

Tricky Distinctions

Athena vs. Redshift Spectrum: Both can query S3 data. Redshift Spectrum is an extension of Redshift that allows querying S3 from a Redshift cluster. Athena is standalone and serverless. The exam may ask which service is best for ad-hoc queries without a data warehouse.

Athena vs. EMR: EMR requires cluster management and supports multiple processing frameworks (Spark, Hive). Athena is simpler and SQL-only.

Decision Rule

If the question describes a need to run SQL queries on data in S3 without managing infrastructure, the answer is Athena. If it mentions high-performance, repeated queries on structured data, think Redshift. If it mentions custom big data processing, think EMR.

Key Takeaways

Amazon Athena is a serverless interactive query service that analyzes data in S3 using standard SQL.

Pricing is $5.00 per TB of data scanned; use columnar formats like Parquet to reduce costs.

Athena uses schema-on-read – define table schemas at query time via AWS Glue Data Catalog.

Athena supports federated queries to other data sources (DynamoDB, RDS) using Lambda connectors.

Athena is ideal for ad-hoc queries on data lakes; not a replacement for data warehouses like Redshift.

Partition your data in S3 and use compression to minimize data scanned and improve performance.

Athena is based on the Presto engine and supports DML only for SELECT, CTAS, and UNLOAD (no INSERT/UPDATE/DELETE).

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Amazon Athena

Serverless – no infrastructure to manage

Pay-per-query based on data scanned ($5/TB)

Queries data directly from S3 (data lake)

Schema-on-read; no data loading required

Best for ad-hoc, infrequent queries

Amazon Redshift

Requires a cluster of nodes to provision

Pay per hour for cluster compute + storage

Stores data in managed storage (data warehouse)

Schema-on-write; data must be loaded and optimized

Best for high-performance, concurrent, repeated queries

Watch Out for These

Mistake

Athena requires you to define a schema before storing data (schema-on-write).

Correct

Athena uses schema-on-read. You can store raw data in S3 and define the schema at query time using the Glue Data Catalog. The schema does not affect how data is stored.

Mistake

Athena can only query data in the same AWS Region as the Athena service.

Correct

Athena can query data in any region, but best practice is to keep data in the same region to avoid data transfer costs and latency. Cross-region queries are possible but not optimal.

Mistake

Athena is a data warehouse like Amazon Redshift.

Correct

Athena is a serverless query service for data lakes, not a data warehouse. It does not store data or support indexes. Redshift is a fully managed data warehouse with columnar storage and performance optimizations.

Mistake

Athena charges per query executed, regardless of data scanned.

Correct

Athena charges based on the amount of data scanned per query, at $5.00 per TB. Queries that scan less data cost less. DDL queries (CREATE TABLE, etc.) are free.

Mistake

Athena can only handle structured data like CSV.

Correct

Athena supports semi-structured data like JSON and columnar formats like Parquet, ORC, and Avro. It can also handle custom formats via SerDe libraries.

Frequently Asked Questions

Is Amazon Athena free to use?

No, Athena is not free. You pay $5.00 per TB of data scanned for each query. However, DDL statements like CREATE TABLE are free. Queries that scan less than 10 MB are also free. Additionally, you pay for S3 storage and Glue Data Catalog if used. The exam tests that pricing is based on data scanned.

Can Athena update or delete data in S3?

No, Athena is primarily a read-only query engine. It cannot perform INSERT, UPDATE, or DELETE operations on data in S3. It supports SELECT, CREATE TABLE AS SELECT (CTAS), and UNLOAD (which writes query results to S3). To modify data, you would use AWS Glue ETL or other processing services. The exam often tests this limitation.

How does Athena differ from Amazon Redshift Spectrum?

Both can query data in S3 using SQL. However, Redshift Spectrum is an extension of Amazon Redshift that requires a Redshift cluster to be running. Athena is a standalone serverless service with no cluster needed. Use Athena for ad-hoc queries without a data warehouse; use Redshift Spectrum if you already have a Redshift cluster and want to query S3 data alongside warehouse data.

What is the best file format to use with Athena?

Columnar formats like Parquet and ORC are best because they reduce the amount of data scanned by reading only the columns needed. They also compress better. For example, a query on a 1 GB CSV file might scan the entire file, while the same data in Parquet might scan only 100 MB, reducing cost by 90%. The exam emphasizes using columnar formats for cost optimization.

Can Athena query data from multiple S3 buckets?

Yes, you can create tables that point to different S3 locations, and then join them in a single query. Athena can query any S3 bucket you have access to, as long as the table definitions in the Glue Data Catalog reference those locations. There is no limit on the number of buckets per query.

Does Athena support ACID transactions?

No, Athena does not support ACID transactions. It is designed for analytical queries on static or append-only data. For transactional workloads on S3, consider using Apache Iceberg tables via Amazon EMR or Athena's Iceberg support (which provides ACID compliance for certain operations). However, for CLF-C02, know that Athena is not transactional.

How can I reduce Athena query costs?

Use columnar formats (Parquet, ORC), compress data (gzip, snappy), partition tables by frequently filtered columns (e.g., date), and only select the columns you need. Also, consider using the UNLOAD command to write aggregated results to S3 for repeated use. The exam tests these optimization techniques.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Amazon Athena — now see how well it sticks with free CLF-C02 practice questions. Full explanations included, no account needed.

Done with this chapter?