This chapter covers S3 Select and Glacier Select, two features that allow you to retrieve only a subset of data from objects stored in Amazon S3 or S3 Glacier, using SQL-like queries. For the SAA-C03 exam, these features appear in questions about optimizing data retrieval performance and reducing data transfer costs. Approximately 2-4% of exam questions touch on S3 Select or Glacier Select, often in the context of data filtering, analytics pipelines, or cost optimization. Understanding how these services work under the hood and when to use them versus other options (like Athena or S3 Batch Operations) is critical for answering scenario-based questions correctly.
Jump to a section
Imagine a massive library with millions of books. You need a specific fact from one book, but you don't want to check out the entire book and read it page by page. Instead, you have a research assistant (S3 Select) who can go directly to the shelf, open the book, scan only the pages you specify (e.g., chapters with certain keywords), and photocopy only the relevant paragraphs. The assistant never brings the whole book to your desk; they only return the specific data you requested. This saves time and reduces the amount of paper (network traffic) you handle. Glacier Select works similarly but for books stored in a deep archive warehouse (Glacier) where retrieving a book takes longer (minutes to hours) because the warehouse robot must first locate and retrieve the book from a deep storage vault before the assistant can scan it. In both cases, the key benefit is that you avoid transferring the entire object—only the filtered subset of data you actually need.
What Are S3 Select and Glacier Select?
Amazon S3 Select and Glacier Select are features that enable you to retrieve a subset of data from an object stored in S3 or S3 Glacier, using standard SQL expressions. Instead of downloading the entire object and then filtering it client-side, you offload the filtering to the S3 service. This reduces the amount of data transferred over the network and the CPU cycles needed on the client, leading to faster performance and lower costs (since you pay only for the data scanned and returned, not the entire object).
S3 Select works with objects in S3 Standard, S3 Intelligent-Tiering, S3 Standard-IA, S3 One Zone-IA, S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, and S3 Glacier Deep Archive (the latter two require prior restoration). Glacier Select works specifically with objects archived in S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive, but it initiates an expedited retrieval of the object first, then performs the query on the restored data.
How It Works Internally
When you issue an S3 Select request, the S3 service reads the object from disk, parses it according to the input serialization format (CSV, JSON, Parquet, or Apache Arrow), applies the SQL expression row-by-row or record-by-record, and returns only the matching rows or columns. The output is serialized according to the output serialization format (CSV, JSON, or Parquet). The processing happens within the S3 fleet, close to where the data is stored, minimizing data movement.
For Glacier Select, the process is similar but involves an additional step: the object must be restored from the archive tier to a temporary staging location (usually within 1-5 minutes for expedited retrieval, 3-5 hours for standard, or 5-12 hours for bulk). Once restored, the query runs exactly like S3 Select. The restored data is temporary and may be deleted after the query completes.
Key Components, Values, Defaults, and Timers
InputSerialization: Required. Specifies the format of the input data. Options: CSV, JSON (Lines or Document), Parquet, or Apache Arrow. For CSV, you can specify CompressionType (NONE, GZIP, BZIP2), FileHeaderInfo (NONE, USE, IGNORE), RecordDelimiter, FieldDelimiter, QuoteCharacter, QuoteEscapeCharacter, and Comments. For JSON, you can specify Type (LINES or DOCUMENT) and CompressionType.
OutputSerialization: Required. Specifies the format of the output. Options: CSV, JSON, or Parquet. Same sub-options as input.
Expression: Required. A valid SQL SELECT statement. Supports standard SQL syntax including column selection, WHERE, LIMIT, aggregate functions (COUNT, SUM, AVG, MIN, MAX), and basic string functions (SUBSTRING, TRIM, UPPER, LOWER). Does NOT support JOINs, subqueries, or GROUP BY.
ExpressionType: Required. Currently only 'SQL' is supported.
RequestProgress: Optional. Enables periodic progress notifications (1 per 4 MB of data scanned).
MaxRetries: Not configurable. The service automatically retries on transient errors.
Data returned: The response is streamed as a series of event messages (Records, Stats, Progress, End). Each Records event contains up to 1 MB of payload. The total response can be up to 100 GB.
Cost: You pay per GB of data scanned (input size plus output size). For S3 Select, $0.002 per GB scanned (varies by region). For Glacier Select, you pay the retrieval fee for the object plus the same per-GB scanned fee. There is no charge for the output data transferred out to the same region (but standard data transfer charges apply if you cross regions or go to internet).
Configuration and Verification
You can use S3 Select via the AWS SDK, CLI, or S3 console. The CLI command is:
aws s3api select-object-content \
--bucket my-bucket \
--key my-file.csv \
--expression "SELECT s._1, s._2 FROM s3object s WHERE s._3 > 100" \
--expression-type SQL \
--input-serialization '{"CSV": {"FileHeaderInfo": "NONE", "RecordDelimiter": "
", "FieldDelimiter": ","}}' \
--output-serialization '{"CSV": {}}' \
output.csvFor Glacier Select, you first restore the object (if not already restored) using aws s3api restore-object, then use the same select-object-content command. Alternatively, you can use the Glacier Select API directly via InitiateJob with a SelectParameters field.
To verify, you can check the BytesScanned and BytesReturned in the response's Stats event. These metrics are also logged in AWS CloudTrail.
Interaction with Related Technologies
Amazon Athena: Athena also uses SQL to query data in S3, but it is a serverless interactive query service that can query multiple objects and supports complex joins, aggregations, and partitioning. S3 Select is for single-object queries only. Use S3 Select when you need to filter a single large object before processing it in a downstream application (e.g., Lambda). Use Athena when you need to run ad-hoc queries across many objects.
Amazon S3 Batch Operations: Batch Operations can invoke S3 Select on a list of objects, but the primary use case is to perform actions (like tagging or copying) on many objects. S3 Select is not directly integrated into Batch Operations; you would need to write a Lambda function that uses S3 Select.
AWS Lambda: You can trigger a Lambda function on S3 events and use S3 Select within the function to process only relevant data from a large object. This is a common pattern for real-time data processing.
Amazon EMR / Spark: When reading from S3, Spark can use S3 Select as a predicate pushdown to filter data at the S3 side, reducing I/O. This is enabled by setting the spark.hadoop.fs.s3a.experimental.input.fadvise and related configurations.
Limitations and Edge Cases
Maximum object size: S3 Select can process objects up to 100 GB uncompressed. For larger objects, consider splitting them.
Compressed objects: If the object is GZIP or BZIP2 compressed, S3 Select can decompress and query it. However, the decompression is done on the fly, and the entire object must still be decompressed (though only the matching rows are returned). This can be more efficient than downloading and decompressing locally if the selectivity is high.
Nested JSON: For JSON Document type, you can query nested fields using dot notation (e.g., s._1.details.name). But the SQL support is limited; you cannot query arrays or use JSON functions.
Parquet and Arrow: These columnar formats allow S3 Select to read only the columns referenced in the query, reducing I/O further. This is a significant performance advantage over row-based formats.
Glacier Select: Requires the object to be restored. If you use the Glacier Select API directly, the restoration is part of the job and you don't need to explicitly restore first. However, the job will not complete until the restoration is done (which can take minutes to hours). The restored data is temporary and may be deleted after the job completes.
Performance Considerations
Selectivity: The more selective the query (i.e., the fewer rows returned), the greater the benefit. If you return most of the object, it may be faster to just download the entire object and filter locally.
Network latency: S3 Select reduces the amount of data transferred, so it is particularly beneficial over high-latency or low-bandwidth connections.
Parallel requests: You can issue multiple S3 Select requests on the same object concurrently. However, each request will still read the entire object (unless using Parquet/Arrow where only needed columns are read). For large objects, consider splitting them into smaller chunks and querying in parallel.
Cost trade-off: S3 Select costs per GB scanned. If your query scans a large portion of the object but returns a small subset, it may be cheaper than downloading the entire object (which incurs data transfer costs). But if you are already in the same region, data transfer is free, so the cost comparison depends on the scan vs. download size.
Security and Permissions
To use S3 Select, the IAM principal needs s3:GetObject permission on the object. Additionally, if the object is encrypted with SSE-KMS, the principal needs kms:Decrypt on the KMS key. For SSE-C, the request must include the encryption key. There is no additional permission required beyond GetObject. For Glacier Select, you also need s3:RestoreObject (if using the two-step process) or glacier:InitiateJob (if using the Glacier API).
Send S3 Select Request
The client sends a POST request to the S3 API endpoint for the object, specifying the SQL expression, input serialization, and output serialization. The request is authenticated via AWS Signature V4. The S3 service validates the request parameters and checks the bucket and object permissions. If the object is encrypted with SSE-KMS, the service validates the KMS key permissions. The request must include the Content-MD5 header for integrity check. The service responds with a 200 OK and begins streaming the result.
Parse Input Data
S3 reads the object from disk (or from the underlying storage tier) and begins parsing it according to the InputSerialization format. For CSV, it splits rows by the record delimiter and columns by the field delimiter. For JSON Lines, it parses each line as a separate JSON object. For Parquet or Arrow, it reads the column metadata and only loads the columns referenced in the SELECT clause. If the object is compressed (GZIP/BZIP2), it decompresses on the fly. The parsing is done in streaming fashion, so the entire object does not need to be loaded into memory.
Apply SQL Expression
For each row or record parsed, S3 evaluates the SQL WHERE clause (if present) and checks if the row matches the condition. It also projects only the columns specified in the SELECT clause. The SQL engine supports basic operators (=, <, >, LIKE, IN, BETWEEN) and functions (CAST, SUBSTRING, TRIM, UPPER, LOWER). It does not support JOINs, subqueries, or aggregate functions without a GROUP BY (though COUNT, SUM, etc. can be used if no GROUP BY is needed, returning a single row). The evaluation is done row-by-row, and matching rows are buffered for output.
Serialize Output Data
Matching rows are serialized into the OutputSerialization format. For CSV, rows are written with the specified delimiter and quoting rules. For JSON, each row is written as a separate JSON object (Lines) or as a single JSON array (Document). For Parquet, the output is written in columnar format. The serialized data is split into chunks of up to 1 MB, each sent as a Records event in the streaming response. The service also sends periodic Progress events (if requested) and a final Stats event with BytesScanned, BytesProcessed, and BytesReturned.
Stream Response to Client
The response is streamed using HTTP chunked transfer encoding. The client receives a series of event messages: Records (containing the actual data), Progress (optional, containing bytes scanned and processed so far), Stats (final summary), and End (marks end of stream). The client must parse these events using the appropriate SDK or manually. The total response can be up to 100 GB. The client can abort the request at any time by closing the connection. The service will stop processing and discard any partial results.
Scenario 1: Real-Time Log Filtering in a Serverless Pipeline
A company ingests millions of web server logs per day into S3 as gzipped CSV files, each about 500 MB. They need to extract only 5xx error entries for real-time alerting. Previously, they would download the entire file to a Lambda function (max 15 min execution) and filter it, but the download took too long and often timed out. They switched to using S3 Select within the Lambda function. The Lambda is triggered by S3 PutObject events, calls S3 Select with SELECT * FROM s3object WHERE status >= 500, and receives only the matching rows (typically <1% of the data). The function completes in under 30 seconds, and the cost is lower since they pay only for the scanned data (the entire file is scanned but only a fraction is returned). They configured InputSerialization with CompressionType: GZIP and FileHeaderInfo: USE. The output is serialized as JSON for easy integration with CloudWatch Logs. A common pitfall is forgetting to set the correct record delimiter for CSV files that have quoted fields containing newlines; they use `RecordDelimiter: '
' and QuoteCharacter: '"'`.
Scenario 2: Data Science Workload with Parquet Files
A data science team stores large Parquet files (10-50 GB each) containing sensor data. They frequently need to extract a subset of columns for a specific time range. Instead of downloading the entire file and using Pandas to filter, they use S3 Select with Parquet input/output. This reads only the relevant columns (e.g., timestamp, sensor_id, value) and returns only matching rows. The team uses the AWS SDK for Python (boto3) to issue the query and processes the streamed response in chunks. They found that S3 Select reduces data transfer by 90% and speeds up their ETL pipeline by 5x. They also use S3 Select predicate pushdown in Apache Spark by setting spark.sql.sources.pushDownPredicate and using the S3A connector. A misconfiguration they encountered: if the Parquet file has nested schemas, S3 Select cannot query nested fields directly; they had to flatten the schema first.
Scenario 3: Archival Data Retrieval with Glacier Select
A financial institution archives transaction records older than 7 years in S3 Glacier Deep Archive. Occasionally, auditors request data for a specific account number. Instead of restoring the entire 1 GB archive object (which costs $0.03 per GB and takes 12 hours), they use Glacier Select. They initiate a Glacier Select job with an SQL expression filtering by account ID. The job first restores the object (bulk retrieval, ~12 hours), then runs the query, and returns only the matching records. The cost is lower because they pay only for the restored data (the entire object is restored anyway, but they avoid transferring it out). However, they learned that if the object is less than 100 MB, it's cheaper to just restore and download the whole object because the minimum billable restore size is 1 MB and the per-GB scan fee still applies. They also set up S3 Lifecycle policies to expire the temporary restored copies after 1 day to avoid storage charges.
Exactly What SAA-C03 Tests
SAA-C03 objective 3.4 (High Performance) includes S3 Select and Glacier Select as methods to optimize data retrieval. The exam tests your ability to choose the right service for a given scenario. Key points: - When to use S3 Select: When you need to filter a single large object (up to 100 GB) and return a small subset of data. The object must be in a format supported (CSV, JSON, Parquet, Arrow). - When to use Glacier Select: When the object is archived in Glacier Flexible Retrieval or Deep Archive, and you need to query it without restoring the entire object to S3. - Cost implications: S3 Select charges per GB scanned; Glacier Select charges retrieval fees + scan fees. You pay for the data scanned, not the entire object size. However, if you return most of the data, it may be cheaper to just download the whole object. - Performance: S3 Select is faster than downloading and filtering client-side for selective queries. For Parquet/Arrow, only referenced columns are read, reducing I/O.
Common Wrong Answers and Why They Are Wrong
"Use S3 Select to query multiple objects" – Wrong because S3 Select operates on a single object. For multiple objects, use Athena or Redshift Spectrum.
"S3 Select supports all SQL features including JOINs" – Wrong. S3 Select does not support JOINs, subqueries, or GROUP BY (except aggregate functions without GROUP BY).
"Glacier Select can query objects without any restoration delay" – Wrong. Glacier Select requires the object to be restored first; the query runs after restoration completes. The restoration time depends on the retrieval tier (expedited: 1-5 min, standard: 3-5 hours, bulk: 5-12 hours).
"S3 Select is free to use" – Wrong. You pay per GB scanned. It can reduce costs compared to downloading the entire object, but it is not free.
Specific Numbers and Terms That Appear on the Exam
Maximum object size: 100 GB for S3 Select.
Supported formats: CSV, JSON, Parquet, Apache Arrow.
Compression: GZIP and BZIP2 supported.
Glacier retrieval tiers: Expedited (1-5 min), Standard (3-5 hours), Bulk (5-12 hours).
Cost: $0.002 per GB scanned for S3 Select (varies by region).
Response streaming: Up to 100 GB total response.
Permissions: Requires s3:GetObject; for SSE-KMS, also kms:Decrypt.
Edge Cases and Exceptions
Empty results: If no rows match, the response still contains a Stats event with BytesScanned and BytesReturned=0.
Large number of columns: For CSV, if the number of columns varies between rows, S3 Select treats missing columns as empty strings. For JSON, if a field is missing in a record, it returns NULL.
SSE-C: You must provide the encryption key in the request; otherwise, you get a 400 error.
Glacier Select with SSE-KMS: The KMS key must be in the same region as the bucket; cross-region keys are not supported.
How to Eliminate Wrong Answers
If the scenario involves querying multiple objects, eliminate S3 Select and Glacier Select immediately. Look for Athena, Redshift Spectrum, or Glue.
If the scenario involves real-time querying of archived data, eliminate Glacier Select because of the restoration delay. Use S3 Intelligent-Tiering or S3 Glacier Instant Retrieval instead.
If the scenario describes filtering a single large CSV file before further processing, S3 Select is the correct choice.
If the cost analysis shows that the query returns most of the data, S3 Select may not be cost-effective; downloading the entire object might be cheaper.
S3 Select allows filtering of a single S3 object using SQL, returning only the matching data.
Supported input formats: CSV, JSON, Parquet, Apache Arrow; supports GZIP and BZIP2 compression.
Maximum object size for S3 Select is 100 GB.
Glacier Select works with objects in S3 Glacier Flexible Retrieval and Deep Archive; requires restoration first.
S3 Select charges per GB scanned ($0.002/GB typical); Glacier Select adds retrieval fees.
S3 Select does not support JOINs, subqueries, or GROUP BY (except aggregate functions).
For columnar formats (Parquet, Arrow), only referenced columns are read, improving performance.
S3 Select is ideal when the query returns a small subset of data; otherwise, download may be cheaper/faster.
These come up on the exam all the time. Here's how to tell them apart.
S3 Select
Operates on a single object at a time
Returns streamed results directly to client
No schema required; format defined per request
Best for filtering large objects before further processing
Cost: per GB scanned, no infrastructure
Amazon Athena
Queries multiple objects across a bucket or prefix
Returns results as a SQL query result set
Requires a schema defined in Glue catalog or inferred
Best for ad-hoc interactive queries across datasets
Cost: per query based on data scanned, no infrastructure
Mistake
S3 Select can query any object in S3, including image or video files.
Correct
S3 Select only supports structured data formats: CSV, JSON, Parquet, and Apache Arrow. It cannot query binary or unstructured data. Attempting to use it on an unsupported format returns a 400 error.
Mistake
Glacier Select does not require any additional cost beyond the query itself.
Correct
Glacier Select incurs both the standard retrieval fees (based on the retrieval tier) and the per-GB scanned fee. The retrieval fee is charged even if you only query a portion of the object.
Mistake
S3 Select is always faster than downloading the entire object and filtering locally.
Correct
S3 Select is faster only when the query is highly selective (returns a small fraction of data). If you return most of the object, the overhead of parsing and SQL evaluation can make it slower than a direct download.
Mistake
You can use S3 Select to update or delete data in S3 objects.
Correct
S3 Select is a read-only operation. It only retrieves data; it cannot modify the object. To update or delete, you must use standard S3 PUT/DELETE operations or S3 Batch Operations.
Mistake
S3 Select supports all SQL functions including JOINs and subqueries.
Correct
S3 Select supports a limited subset of SQL. It does not support JOINs, subqueries, or GROUP BY (except aggregate functions without GROUP BY). It supports basic predicates, string functions, and arithmetic.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Yes, but with limitations. For JSON Document type, you can use dot notation to access nested fields (e.g., SELECT s._1.details.name FROM s3object s). However, S3 Select cannot query array elements by index or use JSON functions like JSON_EXTRACT. For JSON Lines, each line is a separate record, and you can query top-level fields directly. If you need complex JSON querying, consider Athena.
Yes, but you must provide the encryption key in the request headers (x-amz-server-side-encryption-customer-algorithm and x-amz-server-side-encryption-customer-key). If the key is not provided, the request fails with a 400 error. For SSE-S3 and SSE-KMS, no additional headers are needed beyond standard permissions.
S3 Select queries a single object and returns a streamed result; Athena queries multiple objects (tables) and returns a result set. S3 Select is ideal for filtering a large object before processing in an application; Athena is for ad-hoc analytics across datasets. S3 Select has limited SQL (no JOINs); Athena supports full ANSI SQL. Athena requires a schema (Glue catalog) and incurs per-query costs; S3 Select costs per GB scanned.
Glacier Select initiates an expedited, standard, or bulk retrieval of the entire object (depending on the job tier). The query runs only after the restoration completes. The restored data is temporary and may be deleted after the job. You are charged for the retrieval (based on tier) plus the per-GB scan fee. To avoid long waits, use expedited retrieval (1-5 minutes) for urgent queries, but it costs more.
No, S3 Select supports objects up to 100 GB uncompressed. For larger objects, you must split them into smaller parts (e.g., using S3 Batch Operations) or use Athena which can handle larger datasets by scanning multiple objects.
S3 Select returns a 400 error with a message indicating the syntax error. The query is not executed. You must correct the expression and resend the request. Common errors include missing quotes, invalid column names, or unsupported functions.
Yes, you can use AWS CloudTrail to log S3 Select API calls (SelectObjectContent). AWS Cost Explorer shows costs attributed to S3 Select under the S3 service (line item 'SelectObjectContent'). You can also enable S3 server access logs, but they will show the request type as 'REST.SELECT.CONTENT'.
You've just covered S3 Select and Glacier Select — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.
Done with this chapter?