SAA-C03Chapter 152 of 189Objective 3.1

AWS Glue for ETL

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. This chapter covers Glue's core components—crawlers, Data Catalog, ETL jobs, triggers, and workflows—and how they integrate with other AWS services like S3, Redshift, and Athena. For the SAA-C03 exam, Glue appears in approximately 5-8% of questions, often in scenarios involving data lake creation, schema discovery, and serverless ETL. Mastering Glue's architecture, job types, and pricing models is essential for designing cost-effective and scalable data processing solutions.

25 min read
Intermediate
Updated May 31, 2026

AWS Glue as a Factory Assembly Line

Imagine a factory that receives raw materials (data) from various suppliers (data sources) in different forms—some in crates (S3), some in barrels (RDS), some on pallets (Redshift). The factory has a central warehouse (Data Catalog) where it stores blueprints (metadata) describing each material type. The factory floor has automated assembly lines (Glue ETL jobs) that pick up raw materials, process them (transform), and package them into finished products (structured datasets) for shipping to different stores (data targets). The assembly lines are written in a special language (PySpark or Scala) and run on a pool of workers (Apache Spark) that can be scaled up or down. A foreman (Glue scheduler) can trigger assembly lines on a schedule or when new materials arrive (event-driven). The factory also has a quality control station (Glue DataBrew) to clean and normalize data visually, and a logistics coordinator (AWS Lake Formation) to manage permissions for accessing the finished products. If a new type of raw material arrives, the factory can automatically create a new blueprint using a crawler (Glue Crawler) that scans the material and updates the catalog. This entire system is serverless—the factory floor, workers, and foreman are provisioned and managed by AWS, so the factory owner only pays for the time the assembly lines run.

How It Actually Works

What is AWS Glue?

AWS Glue is a serverless ETL service introduced in 2015. It is designed to move and transform data between various data stores. Glue eliminates the need to provision and manage Apache Spark clusters; instead, you define ETL jobs in Python or Scala, and Glue automatically provisions the required resources. Glue is deeply integrated with the AWS data ecosystem, including Amazon S3, Amazon Redshift, Amazon RDS, Amazon DynamoDB, and Amazon Athena.

Core Components

#### 1. AWS Glue Data Catalog The Data Catalog is a central metadata repository. It stores table definitions, schema information, and partition metadata for data sources. Each table entry includes location (e.g., S3 path), format (e.g., Parquet, CSV), and schema. The catalog is Apache Hive Metastore compatible, allowing Athena, Redshift Spectrum, and EMR to use it directly. The catalog is region-specific but can be shared across accounts using AWS Resource Access Manager (RAM).

#### 2. Crawlers Crawlers are agents that scan data sources, infer schemas, and populate the Data Catalog. A crawler connects to a data store, classifies data (e.g., JSON, Parquet), and creates or updates table metadata. You can schedule crawlers (e.g., hourly) or run them on demand. Crawlers support S3, JDBC (RDS, Redshift, etc.), and DynamoDB. For S3, crawlers can handle partitioned data (e.g., year=2024/month=01/) and automatically detect partitions. Crawlers also support schema evolution—if new columns appear, they can update the catalog (configurable).

#### 3. ETL Jobs Glue ETL jobs perform the actual data transformation. Jobs are written in Python (with PySpark) or Scala (with Apache Spark). Glue provides a set of built-in transformations (e.g., ApplyMapping, Filter, Join, DropFields) and also supports custom Spark code. Jobs run on Apache Spark 2.4 or 3.x (depending on Glue version). You specify the number of Data Processing Units (DPUs) – each DPU provides 4 vCPU and 16 GB memory. The default is 10 DPUs, but you can set from 2 to 100 DPUs (or more with a quota increase). Jobs can be triggered on a schedule, on demand, or by an event (e.g., new S3 file via EventBridge).

#### 4. Triggers Triggers start ETL jobs based on a schedule (cron) or an event (e.g., job completion). There are three types: Schedule (time-based), On-demand (manual), and Conditional (starts after one or more jobs succeed, fail, or complete). Conditional triggers enable building complex workflows.

#### 5. Workflows Workflows are directed acyclic graphs (DAGs) of jobs, crawlers, and triggers. They orchestrate multi-step ETL pipelines. You can visualize the workflow in the Glue console and monitor execution.

#### 6. Glue DataBrew DataBrew is a visual data preparation tool that cleans and normalizes data without writing code. It generates recipes (series of transformations) that can be exported as Glue ETL jobs. DataBrew is useful for ad-hoc data cleaning.

#### 7. Glue Studio Glue Studio is a visual interface for creating, running, and monitoring ETL jobs. It allows you to design jobs using a drag-and-drop canvas, visually inspect data, and view job metrics.

How Glue Works Internally

When you create an ETL job, Glue performs the following steps: 1. Job Definition: You provide script location (S3), temporary directory, DPU count, and job parameters. Glue stores this metadata. 2. Job Execution: When triggered, Glue launches an Apache Spark cluster in your account. The cluster is managed by Glue and is not visible in EC2 console. The cluster consists of a driver and workers. The driver runs the script's main function and distributes tasks to workers. 3. Data Processing: The script reads data from sources (e.g., S3 using the Data Catalog), applies transformations, and writes to targets (e.g., S3, Redshift). Glue uses Spark's in-memory processing for performance. 4. Cleanup: After the job completes (or fails), Glue terminates the cluster. You are billed only for the time the cluster runs, rounded to the nearest second (minimum 1 minute).

Key Defaults and Limits

DPU: Default 10, range 2-100 (default max is 100, can be increased via support ticket).

Job timeout: Default 2880 minutes (48 hours), max 7 days.

Maximum concurrent runs per job: 1 (unless you enable concurrent runs – up to 3).

Script size: Max 1 MB (zip files can be larger).

Data Catalog table limit: 100,000 per account per region (soft limit, can be increased).

Crawler schedule: Minimum interval 5 minutes.

Temporary directory: Must be in S3, used for intermediate data and logs.

Data Format Support

Glue supports reading and writing:

Structured: CSV, JSON, Parquet, ORC, Avro, XML

Semi-structured: Grok, Logstash, etc.

Custom classifiers can be written.

Security and Encryption

Glue integrates with AWS KMS for encryption at rest (S3, Data Catalog) and in transit (SSL/TLS). You can use IAM roles for job execution – the role must have permissions to read sources and write targets. Glue also supports VPC endpoints for private connectivity.

Integration with Other Services

Amazon Athena: Can query Data Catalog tables directly. Glue ETL can transform data into columnar formats (Parquet) to optimize Athena queries.

Amazon Redshift Spectrum: Uses Data Catalog to query external tables in S3.

AWS Lake Formation: Provides fine-grained access control on Data Catalog tables. Glue can be used to load data into a Lake Formation-governed data lake.

Amazon S3: Primary data lake storage. Glue can handle S3 event notifications to trigger jobs.

Amazon CloudWatch: Logs and metrics for job runs.

AWS Step Functions: Can orchestrate Glue jobs along with other services.

Pricing Model

Glue pricing is based on: - ETL jobs: Charged per DPU-hour (partial minutes rounded up). The first 1 million objects crawled per month are free (crawler usage). - Crawlers: Charged per DPU-hour for the time the crawler runs. - Data Catalog: Storage of metadata – $1 per 100,000 objects per month (first 1 million free). - DataBrew: Charged per session and per recipe job run.

Best Practices

Use columnar formats (Parquet, ORC) for better performance and cost.

Partition data appropriately (e.g., by date) and configure crawlers to detect partitions.

Use job bookmarks to process only new data incrementally.

Set appropriate DPU count – start with default and monitor CloudWatch metrics (e.g., memory, CPU).

Enable job metrics and logs for troubleshooting.

Use Glue Workflows for multi-step orchestration instead of custom schedulers.

Common Pitfalls

Incorrect IAM permissions: Job fails with AccessDenied. Ensure the IAM role has permissions for sources, targets, and CloudWatch logs.

Schema mismatch: If source schema changes unexpectedly, job may fail. Use crawlers to update catalog before job runs.

Out-of-memory: If DPU count is too low for large datasets, job may fail with memory errors. Increase DPU or optimize transformations.

Network connectivity: If job needs to access RDS in a VPC, ensure the job has a VPC connection (Glue connections).

Walk-Through

1

Create a Data Catalog

First, you must populate the Glue Data Catalog with table definitions that describe your source data. You can either manually define tables using the console or AWS CLI, or use a Glue Crawler to automatically infer schemas from your data stores. A crawler connects to a source (e.g., an S3 bucket with CSV files), classifies the data format, and creates table metadata in the catalog. The crawler can also detect partitions if your data is organized in a partitioned structure (e.g., `s3://bucket/year=2024/month=01/`). You can schedule crawlers to run periodically to capture schema changes.

2

Define an ETL Job

Once the Data Catalog contains the source table definitions, you create an ETL job. In the Glue console, you can use Glue Studio to visually design the job or write a script in Python or Scala. The job script reads from the Data Catalog (e.g., `from awsglue.context import GlueContext`), applies transformations using Glue's DynamicFrame API or native Spark, and writes to a target data store. You specify the source and target locations, the number of DPUs (default 10), and a temporary directory in S3 for intermediate data. The script is stored in S3.

3

Set Up Triggers and Workflows

To automate job execution, you create triggers. Triggers can be time-based (cron schedule), on-demand, or conditional (e.g., start Job B after Job A succeeds). For complex pipelines, you can create a Glue Workflow that defines a DAG of jobs, crawlers, and triggers. Workflows provide visual representation and centralized monitoring. You can start a workflow manually or on a schedule. Conditional triggers within a workflow allow you to branch based on job success or failure.

4

Run the Job

When a trigger fires or you manually start the job, Glue launches an Apache Spark cluster in your AWS account. The cluster runs the ETL script. Glue handles all provisioning and scaling based on the DPU count. The job reads data from the source, transforms it, and writes to the target. During execution, Glue emits logs and metrics to CloudWatch. You can monitor progress in the Glue console or via CloudWatch dashboards. If the job uses job bookmarks, it will only process new data since the last run.

5

Monitor and Optimize

After the job completes, review CloudWatch logs for errors and performance metrics. Key metrics include DPU usage, memory utilization, and job duration. If the job fails, check the logs and the IAM role permissions. To optimize, consider increasing DPU count for larger datasets, using columnar formats, and partitioning data. You can also enable job metrics to see data read/write rates. For incremental processing, enable job bookmarks. Adjust the job timeout if needed (default 48 hours). Finally, review cost reports to ensure DPU usage aligns with budget.

What This Looks Like on the Job

Scenario 1: Building a Data Lake for Analytics

A large e-commerce company wants to centralize data from multiple sources (transactional RDS, clickstream logs in S3, and DynamoDB) into a data lake on S3 for analytics. They use Glue Crawlers to scan each source and populate the Data Catalog. They then create a Glue workflow that runs nightly: first, a crawler updates the catalog; next, an ETL job joins and transforms the data into Parquet format, partitioned by date; finally, a second job loads aggregated data into Redshift for reporting. The company sets the job to use 20 DPUs to handle 5 TB of data per run. They enable job bookmarks to process only daily increments. This serverless approach eliminates cluster management and scales automatically. A common mistake is not setting the correct IAM role, causing access failures when reading from DynamoDB or writing to Redshift.

Scenario 2: Real-time Clickstream Processing

A media company ingests clickstream data via Kinesis Data Firehose into S3. They want to transform the raw JSON into columnar format and enrich it with user metadata from RDS. They use Glue ETL jobs triggered by S3 events via EventBridge. The job reads the latest batch, joins with the RDS table (using a Glue connection with JDBC), and writes Parquet files to a curated zone. They use Glue Studio to visually design the transformation. The job runs with 5 DPUs and completes within 5 minutes for each 1 GB batch. The company monitors job duration and cost; they learned that using too many DPUs for small data wastes money, so they optimized to the minimum needed.

Scenario 3: Schema Discovery for Ad-hoc Analysis

A financial services firm has thousands of CSV files in S3 with inconsistent schemas (e.g., missing columns, different data types). They use Glue Crawlers with custom classifiers to handle variations. The crawlers update the catalog daily. Analysts then use Athena to query the data directly via the Data Catalog. When schemas change, crawlers automatically update the catalog, but they set the crawler to 'merge' new columns rather than overwrite, preserving existing metadata. A pitfall is that crawlers can take a long time if there are many small files; they learned to combine files before crawling. They also use Glue DataBrew for quick visual cleaning before running formal ETL jobs.

How SAA-C03 Actually Tests This

The SAA-C03 exam tests AWS Glue primarily in the context of designing data processing solutions (Objective 3.1). You must understand when to use Glue vs. other services like Amazon EMR or AWS Data Pipeline. Key exam topics include:

- Glue Data Catalog: Know it is a central metadata repository, Hive-compatible, and used by Athena, Redshift Spectrum, and EMR. - Crawlers: Understand they infer schema, support S3, JDBC, and DynamoDB, and can be scheduled or event-driven. Remember that crawlers can detect partitions and handle schema evolution (update or merge). - ETL Jobs: Know that jobs run on Apache Spark, are written in Python or Scala, and use DPUs for capacity. Default DPU is 10; range 2-100. Jobs can be triggered by schedule, event, or conditional triggers. - Job Bookmarks: Used for incremental processing. They track processed data and only process new files. Bookmarks are stored in the Data Catalog. - Glue Workflows: Orchestrate multiple jobs and crawlers as a DAG. - Pricing: Charged per DPU-hour for jobs and crawlers. First 1 million objects crawled per month free. - Common Wrong Answers: 1. Choosing Amazon EMR for simple ETL when Glue is serverless and cheaper for intermittent workloads. Candidates pick EMR because it's familiar, but Glue eliminates cluster management. 2. Thinking Glue can run SQL queries directly—it cannot; it runs Spark scripts. For SQL-based ETL, use Athena or Redshift. 3. Assuming Glue supports real-time streaming natively—Glue processes batch data; for streaming, use Kinesis Data Analytics or Spark Streaming on EMR. 4. Forgetting that Glue jobs require a temporary directory in S3; questions may test this prerequisite. - Edge Cases: Glue job timeout default is 48 hours; if a job runs longer, it will fail. Concurrent job runs are limited to 3 per job. Glue connections for VPC access require a VPC endpoint or internet gateway. - Numbers to Memorize: Default DPU = 10, max DPU per job = 100, default timeout = 2880 minutes, minimum crawler schedule = 5 minutes, free tier for catalog = 1 million objects per month.

To eliminate wrong answers, focus on whether the scenario requires serverless, metadata management, or integration with Athena. If the question mentions 'schema discovery' or 'data catalog', the answer likely involves Glue Crawlers. If it mentions 'serverless ETL', Glue is the best fit.

Key Takeaways

AWS Glue is a serverless ETL service that runs Apache Spark jobs.

Glue Data Catalog is a central metadata repository used by Athena, Redshift Spectrum, and EMR.

Crawlers automatically infer schemas from data stores and populate the Data Catalog.

Glue ETL jobs are written in Python or Scala and run on a Spark cluster managed by Glue.

Default DPU count is 10; range is 2-100 per job.

Job bookmarks enable incremental processing of new data only.

Glue Workflows orchestrate multiple jobs and crawlers as a DAG.

Glue pricing: per DPU-hour for jobs and crawlers; first 1 million catalog objects free per month.

Glue is not for real-time streaming – use Kinesis Data Analytics or EMR for streaming.

Glue integrates with KMS for encryption and IAM for access control.

Glue temporary directory in S3 is required for job execution.

Glue connections allow access to resources in VPCs (e.g., RDS).

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

AWS Glue

Fully serverless – no cluster management.

Priced per DPU-hour (partial minutes).

Best for intermittent or small-to-medium ETL jobs.

Integrated with Glue Data Catalog and Athena.

Supports Python (PySpark) and Scala scripts only.

Amazon EMR

Requires cluster provisioning and management.

Priced per EC2 instance-hour (can use Spot/Reserved).

Best for large-scale, long-running or complex Spark/Hadoop jobs.

Can use Hive Metastore or Glue Data Catalog.

Supports multiple languages and frameworks (Spark, Hive, Presto, etc.).

Watch Out for These

Mistake

AWS Glue can run SQL queries directly.

Correct

Glue does not natively execute SQL. It runs Apache Spark jobs written in Python or Scala. For SQL-based transformations, use Amazon Athena or Amazon Redshift. However, you can use SparkSQL within a Glue script.

Mistake

Glue ETL jobs are real-time streaming.

Correct

Glue is designed for batch ETL, not real-time streaming. For streaming, use Amazon Kinesis Data Analytics or Amazon EMR with Spark Streaming. Glue can be triggered by S3 events, but the processing is still batch-oriented.

Mistake

Glue Data Catalog is only for Glue.

Correct

The Data Catalog is a central metadata repository that integrates with Athena, Redshift Spectrum, EMR, and Lake Formation. It is Hive Metastore compatible, so any Hive-compatible tool can use it.

Mistake

Glue jobs automatically handle schema changes without configuration.

Correct

Glue jobs do not automatically adapt to schema changes unless you enable schema evolution in the crawler or use DynamicFrame's `resolveChoice` method. Crawlers can be configured to update the catalog, but jobs may fail if the schema mismatches the script.

Mistake

Glue is cheaper than Amazon EMR for all workloads.

Correct

Glue is cost-effective for intermittent or small-to-medium ETL jobs due to its serverless nature. However, for large, continuous workloads, EMR with reserved instances can be cheaper. Glue pricing per DPU-hour is higher than equivalent EMR cost for long-running jobs.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between a Glue Crawler and an ETL job?

A Glue Crawler scans data sources to infer schema and populate the Data Catalog with metadata. It does not transform data. An ETL job reads data, applies transformations, and writes to a target. Crawlers are typically run before ETL jobs to ensure the catalog is up-to-date.

How do I process only new files with Glue?

Use job bookmarks. When enabled, Glue tracks which files have been processed in previous runs. On subsequent runs, only new or modified files are read. Bookmarks are stored in the Data Catalog and are specific to a job. You must ensure your source data is in a format that supports bookmarks (e.g., S3 with file-level tracking).

Can I use Glue to transform data in Amazon Redshift?

Yes, Glue can read from and write to Redshift using JDBC connections. You can use the `glue_context.write_dynamic_frame.from_jdbc_conf()` method. However, for large-scale data loading, consider using Redshift Spectrum or COPY commands. Glue is best for moderate transformations before loading.

What is the maximum runtime for a Glue ETL job?

The default timeout is 2880 minutes (48 hours). You can configure a custom timeout up to 7 days (10080 minutes). If a job runs longer than the timeout, it will be terminated.

How do I secure data in AWS Glue?

Use IAM roles to control access to sources and targets. Enable encryption at rest using KMS for S3 targets and Data Catalog. Use SSL/TLS for connections. For VPC resources, use Glue connections with security groups. Also, enable CloudWatch logs for auditing.

Is AWS Glue available in all regions?

Glue is available in most AWS commercial regions, but not in all. Check the AWS Regional Services List. For example, it is not available in some China or GovCloud regions initially. Always verify for your target region.

Can I use custom libraries in Glue jobs?

Yes, you can include Python libraries by packaging them in a .zip file and uploading to S3. Specify the path in the job's 'Python library path' parameter. For Scala, you can include JAR files similarly.

Terms Worth Knowing

Ready to put this to the test?

You've just covered AWS Glue for ETL — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.

Done with this chapter?