DVA-C02Chapter 86 of 101Objective 1.6

Amazon Athena for Serverless SQL Queries

This chapter covers Amazon Athena, a serverless interactive query service that enables you to analyze data directly in Amazon S3 using standard SQL. For the DVA-C02 exam, understanding Athena is critical because it appears in questions about serverless analytics, data lake architectures, and cost-optimized querying of S3 data. Approximately 5-8% of exam questions touch on Athena, often comparing it to services like Amazon Redshift Spectrum, AWS Glue, or Amazon QuickSight. You will need to know Athena's integration with the Glue Data Catalog, its support for various data formats, partitioning strategies, and how it charges based on data scanned.

25 min read
Intermediate
Updated May 31, 2026

Library Card Catalog for S3 Data

Amazon Athena is like a library with a massive, disorganized warehouse of books (your data in S3). Instead of hiring a librarian to manually sort and read every book each time you ask a question, Athena uses a card catalog system (AWS Glue Data Catalog). You write a query like 'find all books by author X published after 2000' on a slip of paper (SQL query). The catalog tells Athena exactly which shelves and boxes contain relevant books, so it only opens those boxes and scans those pages. Each book is read by a team of temporary workers (distributed query engine) who work in parallel, each reading different pages and jotting down answers on sticky notes. The workers then hand their notes to a foreman (coordinator) who combines them into a single answer. Crucially, the workers never move the books or write in them—they just read. They are paid per page read (data scanned), so you want the catalog to be precise to avoid paying for scanning the entire warehouse. If the catalog is wrong (e.g., says a box contains fiction when it's actually non-fiction), the workers might miss books or scan irrelevant ones, wasting time and money. Athena's power is that it can query any book in any format (CSV, JSON, Parquet) as long as the catalog describes how to interpret it, and it never alters the original books.

How It Actually Works

What is Amazon Athena and Why It Exists

Amazon Athena is a serverless, interactive query service that allows you to analyze data stored in Amazon S3 using standard SQL. It was launched in 2016 to address the need for ad-hoc querying of data lakes without the operational overhead of managing a traditional data warehouse. Athena is built on the open-source PrestoDB engine, with optimizations for AWS. It is ideal for one-time queries, exploratory analysis, and running SQL on data that is already in S3 without having to extract, transform, and load (ETL) it into a separate database.

How Athena Works Internally

When you submit a query via the Athena console, API, or JDBC/ODBC driver, the following occurs: 1. Query Parsing and Planning: Athena parses your SQL statement and creates a logical query plan. It consults the Glue Data Catalog (or an external Hive metastore) to determine the schema, location, and format of the data. The catalog stores table definitions that map to S3 prefixes or files. 2. Query Optimization: The query engine optimizes the plan using partition pruning, predicate pushdown, and columnar format optimizations. For example, if your query has a WHERE clause on a partitioned column, Athena will only scan the relevant partitions. 3. Distributed Execution: Athena uses a distributed execution model. A coordinator node splits the work into tasks and assigns them to worker nodes. Each worker reads data from S3 in parallel, applying filters and aggregations. Workers process data in memory, using techniques like vectorized reading for columnar formats. 4. Result Aggregation: Workers send intermediate results back to the coordinator, which combines them into the final result set. The result is stored in an S3 bucket (specified as the query result location) and also returned to the client.

Key Components, Values, Defaults, and Timers

Data Sources: Athena can query data in S3, as well as other sources via Athena Federated Query (using Lambda connectors). The primary source is S3.

Data Formats: Supported formats include CSV, TSV, JSON, Parquet, ORC, Avro, and custom formats via SerDe (Serializer/Deserializer) libraries. For optimal performance and cost, use columnar formats like Parquet or ORC.

Partitioning: Tables can be partitioned by a column (e.g., year, month, day). Partitioning reduces data scanned and improves query speed. Partitions are defined in the Glue Data Catalog.

Glue Data Catalog: By default, Athena uses the AWS Glue Data Catalog as its metadata store. You can also use an external Hive metastore.

Query Result Location: You must specify an S3 bucket and prefix where Athena writes query results. Default: aws-athena-query-results-<account-id>-<region>.

Encryption: Results can be encrypted at rest using SSE-S3, SSE-KMS, or SSE-C. Data in S3 is encrypted independently.

Workgroups: Isolate queries, control costs, and manage permissions. You can set per-workgroup limits on data scanned.

AWS Lambda Integration: For federated queries, you can create Lambda connectors to query data from DynamoDB, RDS, etc.

Pricing: Charged based on the amount of data scanned per query, in terabytes. As of 2025, $5.00 per TB scanned (in US East). Query result storage in S3 incurs standard S3 charges.

Limits: Default concurrent query limit is 20 per account (can be increased). Maximum query execution time is 30 minutes. Maximum result set size is 10 GB per query.

Configuration and Verification Commands

To create a table in Athena using the Glue Data Catalog, you can use the AWS CLI or the console. Example CLI command to create a table:

aws glue create-table --database-name mydb --table-input '{"Name":"sales","StorageDescriptor":{"Columns":[{"Name":"product","Type":"string"},{"Name":"price","Type":"double"}],"Location":"s3://mybucket/sales/","InputFormat":"org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat","OutputFormat":"org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat","SerdeInfo":{"SerializationLibrary":"org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"}},"PartitionKeys":[{"Name":"year","Type":"string"}]}'

After creating the table, you can load partitions manually:

MSCK REPAIR TABLE sales;

Or use ALTER TABLE ADD PARTITION. For querying, you can use Athena's console or the API:

athena.start-query-execution --query-string "SELECT product, SUM(price) FROM sales WHERE year='2024' GROUP BY product" --result-configuration OutputLocation=s3://my-query-results/

To check query status:

athena.get-query-execution --query-execution-id <id>

How Athena Interacts with Related Technologies

AWS Glue: Glue provides the Data Catalog and can also run ETL jobs to convert data into columnar formats. Athena queries the catalog for schema information.

Amazon S3: Athena reads data directly from S3. S3 event notifications can trigger Lambda functions to update partitions or run ETL.

AWS Lambda: Used in Athena Federated Query to connect to other data sources. Also used to automate partition management (e.g., Lambda function to add partitions when new data arrives).

Amazon QuickSight: Athena can be used as a data source for QuickSight dashboards, enabling interactive visualization of S3 data.

Amazon Redshift Spectrum: Both query data in S3. Redshift Spectrum uses Redshift clusters for compute, while Athena is serverless. Athena is simpler for ad-hoc queries; Redshift Spectrum is better for integration with Redshift data warehouse.

AWS Lake Formation: Provides fine-grained access control for Athena queries on tables and columns. You can grant or revoke permissions at the table/column level.

Performance Optimization Techniques

Use Columnar Formats: Parquet and ORC reduce data scanned by storing data by column. Only relevant columns are read.

Partition Tables: Partition by frequently filtered columns (e.g., date, region). Ensure partitions are in a hierarchical structure like year=2024/month=01/day=15/.

Compress Data: Use snappy or gzip compression to reduce storage and I/O.

Use Bucketing: For large tables, bucketing on a column can improve join performance. Athena supports bucketing for tables created in Glue.

Optimize File Size: Aim for files around 128 MB to 1 GB after compression. Too many small files hurt performance.

Use ORDER BY and LIMIT: When exploring data, use LIMIT to reduce scanned data (though Athena still scans all data unless partition pruning applies).

CTAS Queries: Use CREATE TABLE AS SELECT (CTAS) to transform data into a more efficient format and store it in S3 for future queries.

Walk-Through

1

Define Table Schema in Glue

Before querying data in S3, you must define a table schema in the AWS Glue Data Catalog (or an external Hive metastore). This step involves specifying the table name, database, column names, data types, and the S3 location of the data. You also need to specify the SerDe (serializer/deserializer) for the data format (e.g., Parquet, CSV). For partitioned tables, you define partition keys (e.g., year, month). This metadata is stored in the Glue catalog and is used by Athena to understand how to read the data. You can create tables manually via the Athena console, AWS Glue console, or using AWS CLI/API. Athena also supports creating tables from existing data using the 'CREATE EXTERNAL TABLE' DDL.

2

Write and Submit SQL Query

You write a standard SQL query using Athena's query editor, JDBC/ODBC driver, or the StartQueryExecution API. The query can include SELECT, JOIN, GROUP BY, HAVING, ORDER BY, and common SQL functions. Athena supports ANSI SQL with some extensions for big data (e.g., UNNEST for arrays). You must specify an S3 output location for query results. Once submitted, Athena generates a unique query execution ID and begins processing. The query is parsed, validated, and optimized. You can monitor progress via the console or API. Queries are limited to 30 minutes execution time and can scan up to 10 GB of result data.

3

Query Planning and Optimization

Athena's query engine (based on Presto) analyzes the SQL statement and consults the Glue Data Catalog to get table metadata. It performs several optimizations: partition pruning (skipping irrelevant partitions based on WHERE clauses), predicate pushdown (filtering at the storage layer for columnar formats), and column pruning (reading only required columns). For federated queries, it may push down filters to the source database. The engine also decides on join strategies (broadcast vs. distributed) based on table sizes. This planning phase is crucial for minimizing data scanned and query latency. The optimizer uses statistics from Glue (like row counts) if available, but these are not automatically maintained.

4

Distributed Data Scanning

Athena distributes the query execution across multiple worker nodes. Each worker is assigned a slice of data to scan from S3. Workers read data in parallel using HTTP range requests to S3. For columnar formats like Parquet, they read only the relevant column chunks. For row-oriented formats like CSV, they read entire rows but apply filters in memory. Workers apply any filtering, aggregation, and transformation required by the query. They communicate intermediate results back to the coordinator. The number of workers scales automatically based on the amount of data scanned and the complexity of the query. This parallelism allows Athena to query terabytes of data in seconds.

5

Result Aggregation and Delivery

The coordinator node receives partial results from all workers and performs final aggregation (e.g., summing counts, ordering results). The final result set is written to the specified S3 output location as a CSV or other format (you can choose Parquet or JSON via CTAS). The coordinator also sends the result back to the client (e.g., console displays the first 1000 rows). The query execution ID remains associated with the result. The S3 output location can be encrypted. Athena also stores metadata about the query (execution time, bytes scanned) which you can retrieve via the API for cost tracking.

What This Looks Like on the Job

Enterprise Scenario 1: Ad-hoc Analytics for a Data Lake

A large e-commerce company stores all its clickstream data in S3 as JSON files, partitioned by date. The data team needs to answer ad-hoc business questions like 'What is the conversion rate for users from mobile devices in the last 7 days?' Using Athena, they can run SQL queries directly on the S3 data without setting up a data warehouse. They create an external table in Glue pointing to the S3 location, with partitions on date. Queries filter on the date partition to scan only the last 7 days of data. The team uses Athena's JDBC driver to connect from Tableau for visualization. They set up a workgroup with a data limit of 1 TB per query to control costs. Performance is good for queries scanning up to a few hundred GB. For larger scans, they convert data to Parquet using AWS Glue ETL jobs, reducing scanned data by 70% and improving query speed by 3x.

Enterprise Scenario 2: Log Analysis with Federated Query

A financial services company uses Athena to analyze security logs stored in S3 and combine them with real-time data from DynamoDB. They use Athena Federated Query with a Lambda connector for DynamoDB. A security analyst writes a query that joins the S3 log table with a DynamoDB table containing user profiles, to identify suspicious activity. The Lambda connector translates the SQL predicate into DynamoDB queries, minimizing data transfer. The company must ensure that the Lambda function has appropriate IAM permissions to read DynamoDB and that it runs within the 30-minute query timeout. They also enable encryption on the S3 query results using SSE-KMS.

Common Misconfigurations and Issues

Missing Partitions: If partitions are added to S3 without updating the Glue catalog, Athena will not see them. Regular MSCK REPAIR TABLE or automated partition discovery via Glue crawler is needed.

Too Many Small Files: A common performance pitfall is having millions of tiny files (e.g., each a few KB). Athena's distributed engine spends more time on file overhead than actual scanning. Best practice is to coalesce files into larger ones (128-512 MB) using ETL or compaction.

Inefficient File Formats: Using CSV for large datasets leads to high cost and slow queries because Athena must scan entire rows even if only a few columns are needed. Converting to Parquet can reduce cost by 90%.

Incorrect SerDe Configuration: Using the wrong SerDe for a file format (e.g., using LazySimpleSerDe for JSON) causes query failures. Always specify the correct SerDe: org.apache.hive.hcatalog.data.JsonSerDe for JSON, org.apache.hadoop.hive.serde2.OpenCSVSerde for CSV with headers.

How DVA-C02 Actually Tests This

What DVA-C02 Tests on Amazon Athena

The DVA-C02 exam focuses on Athena's integration with other AWS services, its serverless nature, and cost optimization. Key objective codes: Domain 1 (Development), Objective 1.6 (Develop serverless solutions). Specific topics include:

Athena's use of the Glue Data Catalog for schema management.

Partitioning strategies to reduce costs.

Supported data formats (especially Parquet and ORC for columnar storage).

Integration with QuickSight for visualization.

Federated Query capability (Lambda connectors).

Comparison with Redshift Spectrum and Athena's advantage for ad-hoc queries.

Common Wrong Answers and Why Candidates Choose Them

1.

Athena requires data to be loaded into a database first: Wrong. Athena queries data directly from S3; no loading or ETL is required. Candidates confuse Athena with Redshift or RDS.

2.

Athena is best for transactional workloads with frequent updates: Wrong. Athena is designed for analytical queries on static data. It does not support INSERT/UPDATE/DELETE; only SELECT and CTAS. Candidates may think SQL implies full CRUD.

3.

Athena automatically compresses data to reduce costs: Wrong. Athena does not compress data; it reads data as stored. Cost reduction comes from using compressed columnar formats like Parquet, which must be created separately.

4.

Athena can query data in any AWS service without additional setup: Wrong. While Athena Federated Query can access other sources, it requires Lambda connectors and additional configuration. Direct querying is only for S3.

Specific Numbers and Terms That Appear on the Exam

$5.00 per TB scanned (pricing in US East).

30 minutes maximum query execution time.

10 GB maximum result set size per query.

20 concurrent queries default limit.

Parquet and ORC are the recommended formats.

MSCK REPAIR TABLE command to load partitions.

Workgroups to control costs and permissions.

Edge Cases and Exceptions

Athena does not support DML statements (INSERT, UPDATE, DELETE). Only SELECT and CTAS. If a question asks about modifying data, the answer is not Athena.

Athena can query data encrypted with SSE-S3, SSE-KMS, or SSE-C as long as the IAM role has appropriate permissions. For SSE-C, you must provide the encryption key in the query configuration (not commonly tested).

Athena Federated Query is limited to sources supported by Lambda connectors; not all AWS services are available out-of-the-box.

Costs can be unpredictable if queries are not optimized. The exam may present a scenario where a query scans too much data, and the correct answer is to partition the table or use columnar formats.

How to Eliminate Wrong Answers Using the Underlying Mechanism

If a question mentions 'real-time' or 'sub-second response', Athena is likely not the answer because it has overhead from scanning S3.

If a question involves joining data from S3 with a relational database, consider Athena Federated Query, not Athena alone.

If a question emphasizes cost savings, look for answers that mention partitioning, columnar formats, or compressed files.

If a question requires updating data in place, eliminate Athena immediately.

Key Takeaways

Amazon Athena is serverless and charges based on data scanned ($5.00 per TB in US East).

Athena queries data directly from S3 using standard SQL; no data loading required.

Use columnar formats like Parquet or ORC to reduce scanned data and improve performance.

Partition tables by frequently filtered columns (e.g., date) and use partition pruning.

Schema must be defined in the Glue Data Catalog; use MSCK REPAIR TABLE to load partitions.

Athena does not support DML (INSERT, UPDATE, DELETE); only SELECT and CTAS.

Federated Query allows querying other AWS services via Lambda connectors.

Query results are stored in an S3 bucket you specify; can be encrypted.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Amazon Athena

Serverless – no cluster to manage.

Priced per TB of data scanned.

Ideal for ad-hoc queries on S3 data.

Supports federated queries via Lambda connectors.

Best for infrequent or unpredictable query patterns.

Amazon Redshift Spectrum

Requires an Amazon Redshift cluster (provisioned or serverless).

Priced per cluster hour plus data scanned from S3.

Tightly integrated with Redshift for complex analytics.

Can join S3 data with Redshift tables efficiently.

Better for frequent, complex queries that benefit from Redshift's optimization.

Watch Out for These

Mistake

Athena stores data in its own managed storage.

Correct

Athena is serverless and does not store data. It reads data directly from S3. The only storage Athena uses is the S3 bucket for query results.

Mistake

Athena supports full SQL including INSERT, UPDATE, DELETE.

Correct

Athena only supports SELECT queries and CREATE TABLE AS SELECT (CTAS). It does not support DML statements like INSERT, UPDATE, or DELETE.

Mistake

Athena automatically creates table schemas from S3 data.

Correct

Athena requires a schema to be defined in the Glue Data Catalog (or external metastore). It does not automatically infer schema unless you use a Glue crawler.

Mistake

Using LIMIT in a query reduces the amount of data scanned.

Correct

LIMIT does not reduce data scanned; Athena still reads all rows to compute the result. To reduce scanned data, use WHERE clauses on partitioned columns or filter early.

Mistake

Athena can query data in RDS or DynamoDB without any additional configuration.

Correct

Athena can only query S3 natively. To query other sources, you must use Athena Federated Query with a Lambda connector, which requires setup and permissions.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Athena and Redshift Spectrum?

Athena is serverless and charges per TB scanned, making it ideal for ad-hoc queries. Redshift Spectrum requires a Redshift cluster and charges per cluster hour plus data scanned. Spectrum is better for frequent queries integrated with a Redshift data warehouse. For DVA-C02, remember that Athena is simpler and more cost-effective for occasional queries, while Spectrum is for heavy analytics within a Redshift environment.

How can I reduce Athena query costs?

To reduce costs: (1) Use columnar formats like Parquet or ORC. (2) Partition tables and filter on partition columns. (3) Compress data (e.g., snappy). (4) Use CTAS to convert data into efficient formats. (5) Set workgroup limits on data scanned. (6) Avoid SELECT * – only select needed columns. The exam expects you to know that partitioning and columnar formats are the most effective.

Does Athena support ACID transactions?

No. Athena does not support ACID transactions. It is designed for read-only analytical queries on static data. For transactional workloads, use Amazon RDS, DynamoDB, or Redshift.

Can Athena query data in a different AWS account?

Yes, you can query S3 buckets in another account if the bucket policy grants cross-account access to your Athena principal (IAM role/user). The Glue Data Catalog can also be shared across accounts using AWS Resource Access Manager (RAM).

How do I handle partitions in Athena?

Partitions must be defined in the Glue Data Catalog. You can add partitions manually using ALTER TABLE ADD PARTITION, or use MSCK REPAIR TABLE to scan S3 for partition directories. For automatic partition discovery, use a Glue crawler. The exam may ask you to choose between manual partition addition and MSCK REPAIR.

What is Athena Federated Query?

Athena Federated Query allows you to run SQL queries across data stored in S3 and other relational, non-relational, object, or custom data sources. It uses AWS Lambda connectors to interact with the source (e.g., DynamoDB, RDS). This is useful for joining data from multiple sources without moving data.

Can I use Athena with AWS Lake Formation?

Yes. Athena integrates with AWS Lake Formation to provide fine-grained access control. You can grant permissions at the database, table, or column level. Lake Formation also manages the Glue Data Catalog and can enforce row-level security (though row-level is not directly supported by Athena; it's enforced via views).

Terms Worth Knowing

Ready to put this to the test?

You've just covered Amazon Athena for Serverless SQL Queries — now see how well it sticks with free DVA-C02 practice questions. Full explanations included, no account needed.

Done with this chapter?