This chapter covers Redshift Spectrum, a feature that allows Amazon Redshift to query data directly from Amazon S3 without loading it into Redshift tables. Understanding Spectrum is critical for the SAA-C03 exam, especially for questions involving cost optimization, data lake integration, and querying large datasets in S3. Approximately 5-10% of exam questions touch on Redshift Spectrum or related S3 query services, often contrasting it with Athena, EMR, or standard Redshift queries.
Jump to a section
Imagine you are a researcher (Redshift) working in a large library (your data warehouse). You have a personal bookshelf (Redshift local storage) with frequently used books (hot data). However, the library also has a vast archive of books in a separate storage room (S3) that you can access on demand. To use those archived books, you don't bring them all to your desk—you send a request to a librarian (Spectrum) who retrieves only the relevant pages (data) from the storage room. The librarian uses a catalog (AWS Glue Data Catalog) to find which boxes contain the books you need. You can query both your personal bookshelf and the archived books in a single query, and the librarian handles the heavy lifting of scanning the archive. Crucially, the librarian never moves the books into your personal shelf—they stay in the storage room. This saves space on your desk (compute and storage separation) and allows you to access a virtually unlimited collection without reorganizing your personal shelf.
What is Redshift Spectrum and Why It Exists
Redshift Spectrum is a feature of Amazon Redshift that enables you to run SQL queries directly against exabytes of data stored in Amazon S3. It extends the analytical power of Redshift beyond the data stored locally in Redshift tables, allowing you to query data in S3 without loading it first. This is particularly useful for data lake architectures where raw data resides in S3, and you want to combine it with structured data in Redshift for analytics.
The primary motivation for Spectrum is to separate compute from storage. In traditional Redshift, data is stored in Redshift managed storage (local or attached SSD). You pay for both compute (nodes) and storage (provisioned as part of the cluster). With Spectrum, you can keep large amounts of cold or archival data in low-cost S3 (standard or Glacier) and only pay for the compute time when you query it. This reduces storage costs and allows infinite scalability for data volume.
How Redshift Spectrum Works Internally
When you issue a SELECT query that references external tables (defined in the Redshift external schema), the Redshift leader node parses the query and determines which parts access local tables and which access external tables. For the external portions, it pushes down the query to the Spectrum layer, which is a fleet of managed compute resources (separate from the cluster) that run Presto-like engines. These Spectrum workers read the data from S3, apply filters and projections (predicate pushdown), and return only the necessary rows and columns to the Redshift cluster. The Redshift cluster then combines these results with local table data as needed.
Key internal details:
Spectrum uses the AWS Glue Data Catalog (or an external Hive metastore) to store table metadata, including schema, partition information, file format, and location in S3.
Data in S3 must be in a structured format: Parquet, ORC, Avro, JSON, Ion, or CSV. Parquet and ORC are recommended for columnar storage and better compression.
Spectrum supports partitioning: you can partition data by folder structure in S3 (e.g., s3://bucket/year=2023/month=01/). When querying, Spectrum uses partition pruning to skip irrelevant partitions, drastically reducing data scanned and cost.
Spectrum can read compressed data (gzip, snappy, lzop, bzip2) and encrypted data (SSE-S3, SSE-KMS, SSE-C).
Spectrum does not support writing data back to S3; it is read-only.
Key Components, Values, Defaults, and Timers
External Schema: Created using CREATE EXTERNAL SCHEMA command. It references a database in the Glue Data Catalog or an external Hive metastore. Example:
CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG DATABASE 'spectrum_db' IAM_ROLE 'arn:aws:iam::123456789012:role/MySpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;External Table: Created using CREATE EXTERNAL TABLE within an external schema. You define columns, data format, and location in S3. Example:
CREATE EXTERNAL TABLE spectrum_schema.sales (
sale_id INT,
product_name VARCHAR(100),
sale_date DATE,
amount DECIMAL(10,2)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS PARQUET
LOCATION 's3://mybucket/sales/';IAM Role: Redshift must have an IAM role with permissions to access S3 and Glue Data Catalog. The role is attached to the cluster and specified in the external schema definition. Minimum permissions: s3:GetObject, s3:ListBucket, glue:GetDatabase, glue:GetTable, glue:GetPartitions.
Data Formats Supported: Parquet, ORC, Avro, JSON (each line a JSON object), Ion, CSV. Parquet is best for performance and compression.
Compression: gzip, snappy, lzop, bzip2. Snappy is common for Parquet.
Partitioning: Up to 10,000 partitions per table recommended; Spectrum can handle millions, but performance degrades. Use partition pruning by filtering on partition columns.
Concurrency: Spectrum queries run independently of Redshift cluster compute; they consume Spectrum capacity, which is automatically scaled by AWS. However, the Redshift cluster's leader node resources and network bandwidth can be bottlenecks.
Pricing: You pay for the number of bytes scanned from S3 (in terabytes, at $5 per TB as of 2025) and for the Redshift cluster compute time. No additional infrastructure cost.
Spectrum Query Limits: Maximum result set size per query is 1 TB (scanned data). There is no limit on data stored in S3.
Configuration and Verification Commands
To verify Spectrum is working: 1. Check that the external schema is created:
SELECT * FROM pg_catalog.svv_external_schemas;List external tables:
SELECT * FROM pg_catalog.svv_external_tables;Run a simple query to test:
SELECT COUNT(*) FROM spectrum_schema.sales;Monitor Spectrum usage via Amazon CloudWatch metrics: Redshift namespace, SpectrumScanSize (bytes scanned), SpectrumQueryTime (milliseconds).
Interaction with Related Technologies
AWS Glue: Spectrum uses Glue Data Catalog as the default metastore. Glue Crawlers can automatically discover schemas and partitions of data in S3 and populate the catalog.
Amazon Athena: Athena is a serverless query service that also queries S3 using Presto. Spectrum is similar but integrated with Redshift. Exam questions often compare Athena vs. Spectrum: Spectrum is chosen when you need to join S3 data with Redshift local tables; Athena is standalone and does not require a Redshift cluster.
Amazon EMR: EMR can run Spark or Hive on S3 data. Spectrum is simpler (SQL only) but less flexible. Use Spectrum for SQL analytics; use EMR for complex transformations.
Redshift Local Tables: Spectrum queries can join with local Redshift tables in a single SQL statement. This is a key differentiator from Athena.
Best Practices
Use columnar formats (Parquet) and partition data by common filter columns (e.g., date, region).
Compress data with snappy for balanced performance.
Use the SVL_S3QUERY_SUMMARY view to analyze Spectrum query performance and data scanned.
Set spectrum_enable_predicate_pushdown to true (default) to push down WHERE clauses to Spectrum.
Limit the number of partitions scanned by using partition pruning.
Monitor SpectrumScanSize to control costs; use SELECT only needed columns, avoid SELECT *.
Limitations
Spectrum does not support DML (INSERT, UPDATE, DELETE) on external tables.
Spectrum does not support indexes or column encoding on external data.
Spectrum queries may have higher latency than local Redshift queries due to S3 read overhead.
Spectrum cannot query data in S3 Glacier or Glacier Deep Archive (must be in Standard, Standard-IA, or One Zone-IA).
Spectrum does not support user-defined functions (UDFs).
Maximum number of external tables in a cluster: 1,000.
Define External Schema and Tables
First, you create an external schema in Redshift that maps to a database in the AWS Glue Data Catalog. This schema acts as a namespace for external tables. Then, you define external tables using CREATE EXTERNAL TABLE, specifying columns, data format, compression, and S3 location. Redshift stores only metadata; no data is moved. You must also attach an IAM role to the Redshift cluster that has permissions to read S3 and access Glue.
Submit a Query with External Table References
When a user runs a SQL query that references one or more external tables (e.g., SELECT * FROM spectrum_schema.sales WHERE sale_date = '2024-01-01'), the Redshift leader node parses the query. It identifies which parts involve local tables and which involve external tables. For external tables, it constructs a subquery plan to be executed by the Spectrum layer. The leader node also pushes down any filters (predicates) to Spectrum to minimize data scanned.
Spectrum Layer Reads from S3
The Redshift cluster sends the subquery to the Spectrum compute layer, which is a fleet of managed instances running a distributed SQL engine (based on Presto). Spectrum workers read the necessary data files from S3 using parallel I/O. They leverage partition pruning: if the query filters on partition columns, Spectrum skips entire folders in S3. They also apply column pruning, reading only the columns referenced in the query. Data is decompressed and deserialized on the fly.
Spectrum Returns Filtered Results to Redshift
Spectrum processes the data as much as possible (filtering, aggregation if pushed down) and returns only the required rows and columns to the Redshift cluster. The amount of data returned is typically much smaller than the raw data scanned. The Redshift leader node then combines these results with any local table data (if a join is involved) and performs final processing (e.g., sorting, grouping). The final result set is sent to the client.
Cleanup and Billing
After the query completes, the Spectrum workers are terminated. No persistent compute is left running. You are billed based on the amount of data scanned from S3 (in TB) at the current Spectrum rate. Additionally, you pay for the Redshift cluster compute time used during the query. CloudWatch metrics track bytes scanned and query duration for cost monitoring.
Enterprise Scenario 1: E-commerce Analytics with Historical Data
A large e-commerce company stores daily sales transactions in Redshift for the last 90 days for fast reporting. Older data (over 90 days) is moved to S3 in Parquet format partitioned by year/month. Using Redshift Spectrum, analysts can run quarterly reports that query both recent data in Redshift and historical data in S3 in a single SQL query. For example: SELECT product, SUM(amount) FROM sales_local UNION ALL SELECT product, SUM(amount) FROM spectrum_schema.sales_historical WHERE year >= 2023 GROUP BY product. This avoids loading terabytes of old data into Redshift, saving storage costs. The company uses Glue Crawlers to automatically update the partition metadata daily. Misconfiguration often occurs when the IAM role lacks s3:ListBucket permissions, causing queries to fail with permission errors.
Scenario 2: Data Lake Integration for Financial Services
A financial institution maintains a data lake in S3 with trade data in JSON format. They want to combine this with customer profiles stored in Redshift for risk analysis. They create external tables over the JSON data, but initially performance is poor due to full scans. They convert the JSON to Parquet and partition by trade_date. After optimization, queries that used to scan 500 GB now scan only 10 GB, reducing costs by 98%. A common pitfall is forgetting to set serialization.encoding for special characters in JSON, leading to query failures.
Scenario 3: Media Analytics with Mixed Workloads
A media company uses Redshift for real-time dashboards on ad impressions (local tables). They also have log files in S3 (CSV) for ad clicks. They use Spectrum to join these datasets. However, they notice that Spectrum queries are slow because they use gzip compression on CSV files, which cannot be split for parallel reads. They switch to snappy-compressed Parquet, improving query speed by 10x. They also implement partition pruning by date and campaign_id. The lesson: choosing the right file format and compression is critical for Spectrum performance.
What SAA-C03 Tests
Redshift Spectrum appears in questions related to cost optimization, data lake integration, and querying S3 without loading. Specific objectives: Domain 3 (High Performance) and Domain 1 (Design Secure Architectures) occasionally. The exam focuses on:
When to use Spectrum vs. Athena vs. EMR vs. Redshift native.
How to set up external tables and IAM roles.
Understanding that Spectrum queries data in place (no loading).
Partitioning and file format impact on cost/performance.
Common Wrong Answers
'Redshift Spectrum requires data to be loaded into Redshift first.' This is false; Spectrum queries S3 directly. Candidates confuse it with standard Redshift COPY command.
'Spectrum can write results back to S3.' False; Spectrum is read-only. Use CTAS (CREATE TABLE AS) with external table if needed, but that writes to Redshift, not S3.
'Spectrum supports all S3 storage classes including Glacier.' False; Glacier and Glacier Deep Archive are not supported because data must be retrievable in milliseconds.
'You need to provision compute nodes for Spectrum.' False; Spectrum is serverless and scales automatically.
Specific Numbers and Terms
Spectrum pricing: $5 per TB scanned (as of 2025).
Supported formats: Parquet, ORC, Avro, JSON, Ion, CSV.
Compression: gzip, snappy, lzop, bzip2.
Partition limit: 10,000 recommended per table.
IAM role requirement: must have s3:GetObject, s3:ListBucket, glue:Get*.
External schema command: CREATE EXTERNAL SCHEMA.
Data catalog: AWS Glue or Hive metastore.
Edge Cases and Exceptions
Spectrum does not support SELECT INTO or INSERT INTO external tables.
Spectrum cannot query data in Requester Pays buckets.
If you use SSE-KMS encrypted S3 objects, the IAM role must have kms:Decrypt and kms:GenerateDataKey permissions.
Spectrum does not support DELTA or HUDI formats (use Athena or EMR for those).
How to Eliminate Wrong Answers
If a question mentions 'querying data in S3 without moving it' and the answer includes 'load into Redshift,' eliminate it.
If a question says 'write aggregated results to S3,' Spectrum cannot do that; consider Athena or EMR.
If a question requires joining S3 data with Redshift tables, Spectrum is the best fit (not Athena, which is standalone).
If cost is a concern and data is rarely queried, Spectrum is cheaper than storing in Redshift.
If data is in a non-supported format (e.g., Avro without schema), Glue Crawler can help, but Spectrum still needs a compatible format.
Redshift Spectrum allows querying data directly from S3 without loading into Redshift, enabling cost-effective data lake analytics.
External tables must be defined using CREATE EXTERNAL TABLE with a schema from AWS Glue Data Catalog or Hive metastore.
Spectrum supports columnar formats (Parquet, ORC) for better performance; use snappy compression.
Partition pruning reduces data scanned; partition by frequently filtered columns like date.
IAM role must have s3:GetObject, s3:ListBucket, and Glue permissions.
Spectrum is read-only; use UNLOAD to write from Redshift to S3.
Pricing is $5 per TB of data scanned; no additional infrastructure cost.
Spectrum does not support Glacier storage classes or writing to S3.
Common exam scenario: join S3 data with Redshift local tables using Spectrum.
Monitor Spectrum usage with CloudWatch metrics: SpectrumScanSize, SpectrumQueryTime.
These come up on the exam all the time. Here's how to tell them apart.
Redshift Spectrum
Requires a running Redshift cluster (cost for compute).
Can join S3 data with Redshift local tables in one query.
Uses Redshift SQL dialect with some limitations.
Supports data formats: Parquet, ORC, Avro, JSON, Ion, CSV.
Pricing: $5/TB scanned plus Redshift cluster cost.
Amazon Athena
Serverless; no cluster needed.
Cannot join with Redshift tables directly (use federated query or Redshift Spectrum).
Uses Presto SQL dialect.
Supports more formats: Parquet, ORC, Avro, JSON, Ion, CSV, TSV, etc.
Pricing: $5/TB scanned (no additional compute cost).
Mistake
Redshift Spectrum loads data into Redshift temporarily during query execution.
Correct
Spectrum does not load data into Redshift. It reads data directly from S3 and returns only the result set to the cluster. The data remains in S3.
Mistake
Spectrum can write query results back to S3.
Correct
Spectrum is read-only. To write results to S3, you must use UNLOAD (which writes from Redshift to S3) or a separate ETL tool.
Mistake
Spectrum supports all S3 storage classes including Glacier.
Correct
Spectrum only supports S3 Standard, Standard-IA, and One Zone-IA. Glacier and Glacier Deep Archive require restoration first, which Spectrum cannot do.
Mistake
You need to provision separate compute resources for Spectrum.
Correct
Spectrum uses a serverless compute layer managed by AWS. You do not provision or manage any instances. You only pay for data scanned.
Mistake
Spectrum queries are always faster than querying Redshift local tables.
Correct
Spectrum queries are typically slower than local Redshift queries due to network latency and S3 read overhead. Spectrum is best for infrequent queries on large datasets.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
No, Redshift Spectrum cannot query data stored in S3 Glacier or Glacier Deep Archive. It requires data to be in S3 Standard, Standard-IA, or One Zone-IA storage classes. If you need to query archived data, you must first restore it to a supported class using S3 Glacier Restore.
First, create an external schema using CREATE EXTERNAL SCHEMA that references a database in the Glue Data Catalog. Then, use CREATE EXTERNAL TABLE to define the table structure, specifying the data format (e.g., PARQUET), location in S3, and optionally partition columns. Example: CREATE EXTERNAL TABLE ext_schema.sales (id INT, product VARCHAR) STORED AS PARQUET LOCATION 's3://bucket/sales/';
Redshift Spectrum is an extension of Redshift that allows querying S3 data and joining it with local Redshift tables. Athena is a standalone serverless query service. Use Spectrum when you have an existing Redshift cluster and need to combine S3 data with Redshift data. Use Athena when you don't have a Redshift cluster or need to query only S3 data.
No, Redshift Spectrum is read-only. It cannot INSERT, UPDATE, DELETE, or write results to S3. To write data from Redshift to S3, use the UNLOAD command, which exports data to S3 in text or Parquet format.
Redshift Spectrum supports Parquet, ORC, Avro, JSON (line-delimited), Ion, and CSV. For best performance, use columnar formats like Parquet or ORC with snappy compression.
You pay $5 per terabyte (TB) of data scanned by Spectrum. Additionally, you pay for the Redshift cluster compute time used during the query. There is no upfront cost or minimum fee. Data scanned is measured in bytes after compression and predicate pushdown.
Yes, Spectrum supports SSE-S3, SSE-KMS, and SSE-C encryption. For SSE-KMS, the IAM role attached to Redshift must have kms:Decrypt and kms:GenerateDataKey permissions. For SSE-C, you must provide the encryption key in the external table definition using the 'sse_customer_key' option.
You've just covered Redshift Spectrum for S3 Queries — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.
Done with this chapter?