SAA-C03Chapter 151 of 189Objective 3.1

Amazon Athena Query Optimization

Amazon Athena is a serverless, interactive query service that enables you to analyze data directly in Amazon S3 using standard SQL. For the SAA-C03 exam, understanding how to optimize Athena queries is critical because poorly written queries can lead to high costs and slow performance. Approximately 5-10% of exam questions touch on Athena optimization, often in the context of cost control, performance tuning, or integration with other AWS services like Glue and S3. This chapter provides a deep dive into the mechanisms, configurations, and best practices for optimizing Athena queries, ensuring you can answer exam questions accurately and design efficient data analytics solutions.

25 min read
Intermediate
Updated May 31, 2026

Athena Query Optimization: Library Research

Imagine you need to find the average page count of all books in a massive library. The library has a card catalog (AWS Glue Data Catalog) that tells you which shelves (partitions) hold which books. If you search every book individually (full scan), it takes hours. Instead, you can optimize by: (1) Only looking at shelves that might have relevant books (partition pruning) — you skip the fiction section when you only need non-fiction. (2) Organizing the books on each shelf in a specific order (file format like Parquet) so you can quickly count pages without reading every book's cover. (3) Creating a separate index card that lists the page count for each book (columnar storage) — you only read the page count column, not the entire book. (4) Compressing the books into thinner volumes (compression like Snappy) so you can carry more per trip. (5) Using a faster librarian (Athena engine version 3) that can read multiple books at once (parallelism). (6) Caching the results of common searches (result reuse) so you don't re-read the same books. Each optimization reduces the time and effort to get your answer.

How It Actually Works

What is Amazon Athena and Why Optimize?

Amazon Athena is a serverless interactive query service that allows you to analyze data stored in Amazon S3 using standard SQL. It is built on Presto, an open-source distributed SQL query engine. Athena is ideal for ad-hoc queries, one-time analyses, and running SQL on data that is already in S3 without the need for ETL or loading into a database. However, because Athena charges per query based on the amount of data scanned, inefficient queries can result in high costs and slow performance. Optimization is essential to reduce data scanned, improve query speed, and control costs.

How Athena Executes Queries Internally

When you submit a query, Athena parses the SQL, creates a query plan, and distributes tasks to worker nodes. The key steps are:

1.

Query Parsing and Planning: Athena validates the SQL syntax and creates a logical plan. It uses the AWS Glue Data Catalog (or an external Hive metastore) to retrieve table metadata (schema, partitions, location).

2.

Data Source Selection: Based on the table's location in S3, Athena identifies the relevant objects (files/partitions). It applies partition pruning if the WHERE clause includes partition columns.

3.

Task Distribution: The query coordinator splits the work into smaller tasks and assigns them to worker nodes. Each worker reads a portion of the data from S3.

4.

Data Reading: Workers read data from S3 in parallel. The format of the files (CSV, JSON, Parquet, ORC) affects how much data is read. Columnar formats (Parquet, ORC) allow reading only the required columns.

5.

Processing and Aggregation: Workers perform filter, join, aggregation, and other operations on the data they read.

6.

Result Collection: The coordinator collects partial results from workers, performs final aggregation, and returns the result to the client.

Key Performance Factors

Data Scanned: The primary cost driver. Athena charges $5.00 per TB of data scanned (in US East). Reducing data scanned directly reduces cost and often improves speed.

File Format: Columnar formats like Parquet and ORC are optimized for analytics. They store data by column rather than by row, allowing Athena to read only the columns referenced in the query. Additionally, they support compression and encoding schemes that reduce storage and I/O.

Partitioning: Partitioning divides data into subdirectories based on partition keys (e.g., year, month, day). When a query filters on a partition column, Athena can skip entire directories, drastically reducing the amount of data read.

File Size and Number: Small files (e.g., < 128 MB) cause overhead in S3 listing and task scheduling. Athena works best with files of a few hundred MB to 1 GB. Too many small files (the "small files problem") degrades performance.

Compression: Compressing files reduces storage size and I/O, but Athena must decompress them. Splittable compression formats like Snappy or LZO allow parallel processing of compressed files. Gzip is not splittable, so a large gzip file will be read by a single worker, limiting parallelism.

Data Layout: Sorting data within files by frequently filtered columns can improve compression and predicate pushdown. Bucketing (hash partitioning) can also improve join performance.

Query Complexity: Joins, subqueries, and aggregations over large datasets require more memory and compute. Optimizing join order, using predicate pushdown, and avoiding SELECT * are best practices.

Configuration and Best Practices

#### 1. Use Columnar File Formats Convert your data to Parquet or ORC. Parquet is widely supported and recommended for Athena. Example conversion using Spark:

CREATE TABLE my_parquet_table
WITH (
  format = 'PARQUET',
  parquet_compression = 'SNAPPY',
  external_location = 's3://bucket/parquet/'
) AS
SELECT * FROM my_csv_table;

#### 2. Partition Your Data Choose partition keys based on common query filters. For time-series data, partition by year, month, day, hour. Example:

CREATE EXTERNAL TABLE logs (
  col1 string,
  col2 int
)
PARTITIONED BY (year string, month string, day string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS PARQUET
LOCATION 's3://bucket/logs/';

After adding partitions (MSCK REPAIR TABLE or ALTER TABLE ADD PARTITION), queries like SELECT * FROM logs WHERE year='2023' AND month='12' will only scan the relevant directories.

#### 3. Optimize File Size Aim for files between 128 MB and 1 GB. Use ETL jobs (e.g., AWS Glue, Spark) to coalesce small files. In Spark: df.coalesce(n) or df.repartition(n).

#### 4. Use Compression Use Snappy or LZO for Parquet/ORC. Avoid Gzip for large files unless you can split them. Example in Spark:

df.write.option("compression", "snappy").parquet("s3://bucket/output/")

#### 5. Avoid SELECT * Only select the columns you need. This reduces the amount of data scanned from columnar storage.

#### 6. Use ORDER BY and LIMIT Carefully ORDER BY requires a single worker to sort the entire result set, which can be slow for large outputs. Use LIMIT to reduce the result size. If you need sorted results, consider using approximate aggregation functions.

#### 7. Use Approximate Functions For counts and distinct counts on large datasets, use APPROX_DISTINCT_COUNT and APPROX_PERCENTILE which are much faster and consume less data.

#### 8. Enable Result Reuse Athena can cache query results for a configurable duration (default 60 minutes). If the underlying data hasn't changed, reuse the cached result to avoid rescanning. Set workgroup configuration: resultReuseConfiguration.maxResultReuseAgeInMinutes.

#### 9. Use Athena Engine Version 3 Engine version 3 (based on Presto 0.267) includes performance improvements, better cost-based optimization, and support for new SQL features. Always use the latest engine version unless compatibility issues exist.

#### 10. Monitor with CloudWatch Metrics Enable CloudWatch metrics for your workgroup to track QueryBytesScanned, QueryExecutionTime, and QueryQueueTime. Set alarms for high data scanned.

Interacting with Related Technologies

AWS Glue Data Catalog: Athena uses Glue as its default metastore. Glue Crawlers can automatically discover schemas and partitions from S3.

Amazon S3: Athena reads data directly from S3. S3 features like S3 Select (not used by Athena) and S3 Inventory can help manage data.

AWS Glue ETL: Use Glue jobs to convert data to columnar formats, compact files, and add partitions.

AWS Lambda: Trigger Lambda functions to run Athena queries and process results.

Amazon QuickSight: Visualize Athena query results.

Common Pitfalls

Not Partitioning: Full table scans on large datasets are expensive and slow.

Too Many Partitions: Over-partitioning (e.g., by hour for low-volume data) creates many small directories and files, hurting performance. Aim for 100-500 partitions per table.

Using CSV or JSON: These row-based formats force Athena to read entire rows, even if only a few columns are needed. Conversion to Parquet can reduce scanned data by 50-90%.

Ignoring Compression: Uncompressed data increases I/O and cost.

Running Queries on Raw Data Without Schema Optimization: Define appropriate column types (e.g., use INT instead of STRING for numeric values) to improve filter efficiency.

Walk-Through

1

Analyze Query Patterns

Before optimizing, understand the typical queries run against your data. Identify which columns are frequently used in WHERE clauses, JOINs, and SELECT lists. This analysis determines the optimal partition keys, sort order, and file format. For example, if most queries filter by date and customer_id, partition by date and sort by customer_id within each partition. Use Athena query history (via CloudTrail or console) to gather this information.

2

Choose File Format and Compression

Convert data to a columnar format (Parquet or ORC) with splittable compression (Snappy or LZO). Parquet with Snappy is the most common choice. This step reduces data scanned by allowing column pruning and efficient compression. Use an ETL tool like AWS Glue, Spark, or Athena's CTAS (CREATE TABLE AS SELECT) to perform the conversion. For example: `CREATE TABLE new_parquet WITH (format='PARQUET', parquet_compression='SNAPPY', external_location='s3://bucket/parquet/') AS SELECT * FROM old_csv;`

3

Design Partitioning Strategy

Select partition keys based on common filter columns. For time-series data, use year/month/day. Ensure partition keys have low cardinality (few distinct values) to avoid too many partitions. Use ALTER TABLE ADD PARTITION or MSCK REPAIR TABLE to load partitions. Avoid over-partitioning: if each partition contains less than 128 MB of data, consider coarser granularity (e.g., monthly instead of daily). Monitor the number of partitions using SHOW PARTITIONS.

4

Optimize File Size and Count

Consolidate small files into larger ones (128 MB - 1 GB). Use ETL jobs to repartition the data. For example, in Spark: `df.write.option("maxRecordsPerFile", 1000000).parquet("s3://bucket/optimized/")`. This reduces the overhead of listing many files and scheduling many tasks. Athena's performance degrades significantly when the number of files exceeds tens of thousands.

5

Tune Query Syntax and Structure

Rewrite queries to minimize data scanned. Use SELECT with specific columns, not *. Push filters into subqueries to enable predicate pushdown. Avoid SELECT DISTINCT on large columns; use GROUP BY instead. For joins, ensure the larger table is on the left side of the join (though Athena's optimizer may handle this). Use APPROX_DISTINCT_COUNT for approximate counts. Use LIMIT with ORDER BY only if you need the top N results.

6

Configure Workgroup and Result Reuse

Set up Athena workgroups to enforce query limits (e.g., max bytes scanned) and enable result reuse. Result reuse caches query results for a configurable duration (default 60 minutes). If the same query is run within that window, Athena returns cached results without scanning data. This is useful for dashboards and repeated analyses. However, be cautious with real-time data – set a short TTL or disable reuse for volatile data.

What This Looks Like on the Job

Scenario 1: Ad-Hoc Analytics for a SaaS Company

A SaaS company stores billions of user events in S3 as JSON files partitioned by date. Analysts frequently query recent events (last 7 days) to compute daily active users (DAU). Initially, queries scanned 500 GB per run, costing $2.50 each and taking 5 minutes. After converting to Parquet with Snappy compression and partitioning by date, the same query on the last 7 days scanned only 10 GB (cost $0.05) and completed in 15 seconds. The company also enabled result reuse with a 30-minute TTL, so repeated dashboard refreshes cost nothing. The key was converting raw JSON to columnar format and using date partitioning.

Scenario 2: Log Analysis for a Large E-Commerce Platform

An e-commerce platform stores web server logs in S3 (CSV format, 2 TB per day). They run complex queries joining logs with customer data to analyze purchase funnels. Queries often timed out after 30 minutes. The solution involved: (1) Converting logs to ORC format with ZSTD compression (better compression than Snappy for text). (2) Partitioning by year/month/day/hour. (3) Sorting logs within each partition by session_id to improve join performance. (4) Using AWS Glue ETL to compact small files (millions of small CSV files) into 500 MB ORC files. After optimization, typical queries scanned 80% less data and completed in under 5 minutes. The platform also set up CloudWatch alarms to notify when any query scans more than 1 TB.

Scenario 3: Financial Reporting with Strict SLAs

A financial institution requires daily reports that aggregate transaction data. Queries must complete within 10 minutes. Data is stored in S3 as unpartitioned CSV files. After partitioning by trade_date and converting to Parquet, queries still took 8 minutes. Further optimization included: (1) Using bucketing on account_id (10 buckets) to enable more efficient joins with account dimension table. (2) Applying ORDER BY within each bucket so that queries filtering on account_id could use predicate pushdown more effectively. (3) Using the Athena engine version 3. The final query ran in 2 minutes, well under the SLA. The lesson: partitioning alone is not enough; bucketing and sorting can yield additional gains.

How SAA-C03 Actually Tests This

SAA-C03 Exam Focus on Athena Query Optimization

The SAA-C03 exam tests your ability to design cost-effective and performant data analytics solutions. Questions on Athena optimization typically fall under Domain 3 (High Performance) and Objective 3.1 (Selecting appropriate compute, storage, and database services). Key areas include:

Reducing Data Scanned: The exam emphasizes that data scanned directly correlates to cost. You must know that columnar formats (Parquet, ORC) and partitioning reduce data scanned. Questions may present a scenario with high costs and ask for the most cost-effective solution – the answer is almost always "convert to Parquet and partition appropriately."

Partitioning vs. Bucketing: The exam tests when to use partitioning vs. bucketing. Partitioning is for filtering on low-cardinality columns (e.g., date, region). Bucketing is for high-cardinality columns (e.g., user_id) to improve joins. A common wrong answer is to partition by high-cardinality columns, which creates too many partitions and hurts performance.

File Format and Compression: Questions often compare CSV, JSON, Parquet, and ORC. Parquet and ORC are preferred for analytics. The exam expects you to know that Parquet supports predicate pushdown and column pruning. Compression formats: Snappy (splittable, fast), Gzip (not splittable, slower but higher compression). A trick: if a question mentions a single large file (e.g., 10 GB gzip), the answer is to convert to Snappy or split the file into smaller pieces.

Result Reuse and Workgroups: The exam may ask how to reduce costs for repeated queries. Result reuse (caching) is the answer. But you must know that it is configured at the workgroup level and has a configurable TTL. The default is 60 minutes. A common wrong answer is to use ElastiCache – not correct for Athena caching.

Athena vs. Redshift Spectrum: The exam sometimes contrasts Athena (serverless, no cluster to manage) with Redshift Spectrum (requires Redshift cluster, better for complex joins with Redshift tables). Athena is for ad-hoc, serverless queries; Redshift Spectrum is for integrating with Redshift data warehouse.

Edge Cases:

Queries that scan no data (e.g., SELECT COUNT(*) on a partitioned table with no matching partitions) – still incurs a small cost for metadata operations.

MSCK REPAIR TABLE vs. ALTER TABLE ADD PARTITION – MSCK is convenient but slower for many partitions; ALTER is faster for adding specific partitions.

Athena does not support indexes; partition pruning and columnar storage are the only ways to avoid full scans.

Common Wrong Answers

1.

"Add an index" – Athena does not support indexes. The exam tests that you know this.

2.

"Use Amazon Redshift instead" – While Redshift can be faster for complex workloads, the question may specifically ask for a serverless solution, making Athena correct.

3.

"Increase the number of workers" – Athena is serverless; you cannot directly control parallelism. The exam expects you to know that optimization comes from data layout, not provisioning.

4.

**"Use SELECT * and filter in application"** – This is the opposite of optimization. The exam tests that filtering at the database level (Athena) is more efficient.

Key Takeaways

Athena charges $5.00 per TB of data scanned; reducing data scanned is the primary cost optimization.

Always use columnar file formats (Parquet or ORC) with splittable compression (Snappy or LZO) for analytics workloads.

Partition tables by low-cardinality columns (e.g., date, region) to enable partition pruning.

Avoid SELECT *; only select the columns you need to leverage columnar format benefits.

Consolidate small files into larger ones (128 MB - 1 GB) to reduce S3 listing overhead and improve parallelism.

Enable result reuse in workgroups to cache query results for repeated queries; default TTL is 60 minutes.

Use Athena engine version 3 for better performance and cost-based optimization.

Monitor query metrics via CloudWatch (QueryBytesScanned, QueryExecutionTime) to identify expensive queries.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Parquet

Columnar storage format, widely supported across AWS services (Athena, Glue, EMR).

Supports predicate pushdown and column pruning.

Compression codecs: Snappy, Gzip, LZO, ZSTD, Brotli.

Better for complex nested data structures (e.g., arrays, structs).

Default for many Spark workloads; excellent performance with Athena.

ORC

Columnar storage format, optimized for Hive and Presto.

Supports predicate pushdown and column pruning, plus built-in indexes (min/max, bloom filters).

Compression codecs: Snappy, ZLIB, LZO, ZSTD.

Can have better compression ratios than Parquet for some data types.

Slightly less mature ecosystem support compared to Parquet.

Watch Out for These

Mistake

Athena automatically optimizes queries without any configuration.

Correct

Athena does not automatically optimize data layout. You must partition, use columnar formats, and compress data to achieve good performance. Without these, Athena will scan the entire dataset, leading to high costs and slow queries.

Mistake

Using Gzip compression is always best because it saves the most storage.

Correct

Gzip is not splittable, meaning a single large gzip file is read by one worker, limiting parallelism. For large datasets, use splittable compression like Snappy or LZO. Gzip is acceptable only for small files (< 1 GB) or when data is already in gzip and you cannot change it.

Mistake

Partitioning by every possible filter column is optimal.

Correct

Partitioning by high-cardinality columns (e.g., user_id) creates too many partitions (potentially millions), which degrades performance due to metadata overhead. Only partition by columns with low cardinality (e.g., year, month, region). For high-cardinality columns, use bucketing.

Mistake

Athena supports indexes like traditional databases.

Correct

Athena does not support indexes. It relies on partition pruning and columnar storage to minimize data scanned. There is no CREATE INDEX command.

Mistake

Result reuse always reduces costs for any query.

Correct

Result reuse only helps if the same query is run multiple times within the TTL window. For one-off queries or queries on frequently changing data, result reuse may return stale data. Also, if the underlying data changes, you must invalidate the cache or wait for TTL expiry.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

How do I reduce Athena query costs?

The most effective way to reduce Athena costs is to minimize the amount of data scanned per query. This is achieved by: (1) converting data to columnar formats like Parquet or ORC, (2) partitioning tables by frequently filtered columns (e.g., date), (3) using compression (Snappy or LZO), (4) selecting only necessary columns instead of SELECT *, and (5) enabling result reuse for repeated queries. For example, a query that scans 1 TB costs $5.00; after optimization, the same query might scan 100 GB ($0.50) or less.

What is the difference between partitioning and bucketing in Athena?

Partitioning divides data into directories based on partition keys (e.g., year=2023/month=12/day=25). It is best for low-cardinality columns (few distinct values). Bucketing (or clustering) divides data into a fixed number of buckets within a partition based on a hash of a column (e.g., customer_id). It is best for high-cardinality columns and improves join performance. Athena uses partitioning for partition pruning and bucketing for more efficient data skipping within partitions.

Can Athena query data in Redshift or RDS?

No, Athena can only query data stored in Amazon S3 (or other data sources via federated queries using Lambda connectors). To query data in Redshift, use Redshift Spectrum. To query data in RDS, use federated queries with a Lambda function that connects to the database. The exam tests that Athena is for S3 data, not for querying other databases directly.

Why is my Athena query slow even after partitioning?

Common reasons: (1) Too many small files – consolidate them into larger files (128 MB - 1 GB). (2) Using row-based formats (CSV, JSON) – convert to columnar. (3) Over-partitioning – each partition should contain at least 128 MB of data. (4) Complex joins or aggregations – optimize query structure. (5) Not using engine version 3 – upgrade if possible. (6) Data not sorted by filter columns – consider bucketing or sorting within files.

What is result reuse in Athena and how do I enable it?

Result reuse caches the results of a query for a specified duration (default 60 minutes). If the same query is submitted within that time, Athena returns the cached result without scanning any data, reducing cost and latency. It is enabled at the workgroup level via the console or API (ResultReuseConfiguration). You can set the maximum age of cached results in minutes. To disable, set the age to 0. Note that if the underlying S3 data changes, the cache may become stale.

What is the best file format for Athena?

Parquet is the most commonly recommended file format for Athena due to its columnar storage, efficient compression, and wide support across the AWS ecosystem (Athena, Glue, EMR, Redshift Spectrum). ORC is also a good choice and may offer better compression for some data types. Both support predicate pushdown and column pruning. Avoid CSV and JSON for analytics queries.

How do I handle small files in Athena?

Small files (e.g., < 128 MB) degrade Athena performance because of the overhead in listing and scheduling. Use an ETL job (AWS Glue, Spark, or Athena CTAS) to coalesce small files into larger ones. For example, in Spark: `df.coalesce(10).write.parquet("s3://bucket/output/")` creates 10 files. Aim for files between 128 MB and 1 GB. Also consider partitioning to group data into larger directories.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Amazon Athena Query Optimization — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.

Done with this chapter?