This chapter covers AWS Glue, a fully managed extract, transform, and load (ETL) service that simplifies data preparation and loading for analytics. For the DVA-C02 exam, understanding Glue is critical as it appears in roughly 5-10% of questions related to data processing, storage, and integration with other AWS services. You will learn how Glue automates ETL pipeline creation, its core components (Data Catalog, Crawlers, Jobs, and Triggers), and how to integrate it with services like S3, Redshift, and Athena. Mastery of Glue enables you to design serverless data transformation workflows, a key skill for the Developer Associate certification.
Jump to a section
Imagine a large factory that receives raw materials (data) from multiple suppliers (data sources) in various forms—loose parts, boxes, barrels. The factory has a team of skilled workers (ETL scripts) who can sort, clean, assemble, and package the materials into finished products (analytical datasets). However, hiring and managing these workers for every production run is expensive and slow. AWS Glue is like a self-service factory floor where you define the recipe (ETL logic) in a visual or code-based builder, and the factory automatically provisions the workers (Apache Spark executors), sets up the assembly line (cluster), and runs the job on a schedule or on demand. The factory also maintains a catalog (AWS Glue Data Catalog) of all raw material bins (tables) and finished product shelves, so you don’t have to track where everything is. When a new batch arrives, the factory can automatically discover its structure (crawlers) and update the catalog. You pay only for the time the assembly line runs, and you never manage the workers directly. This is exactly how AWS Glue works: you define jobs, crawlers, and a central metadata repository, and Glue handles the underlying Spark infrastructure, scaling, and resource management.
What is AWS Glue?
AWS Glue is a fully managed ETL (extract, transform, load) service that makes it easy to prepare and load data for analytics. You can create and run ETL jobs using a serverless Apache Spark environment, with no clusters to provision or manage. Glue automatically handles scaling, resource allocation, and error handling.
Why AWS Glue Exists
Traditional ETL requires significant infrastructure management: provisioning Spark or Hadoop clusters, tuning performance, handling failures, and managing dependencies. AWS Glue abstracts this complexity, allowing developers to focus on the transformation logic. It provides a unified metadata repository (the Data Catalog) that makes data discoverable across multiple AWS analytics services.
How AWS Glue Works Internally
Data Catalog: Glue stores metadata (table definitions, schema, location) in a central catalog. This catalog is Apache Hive Metastore compatible and is used by Athena, Redshift Spectrum, and EMR. When a crawler runs, it populates the catalog with table definitions.
Crawlers: A crawler connects to a data source (S3, JDBC, DynamoDB), infers the schema, and writes table metadata to the Data Catalog. You can schedule crawlers or run them on demand. The crawler uses classifiers to determine the format (CSV, JSON, Parquet, Avro, etc.) and data types.
Jobs: A Glue job is the ETL logic. You write code in Python (PySpark) or Scala, or use the visual editor to generate code. Jobs run on a serverless Apache Spark environment. You specify the number of Data Processing Units (DPUs) – each DPU provides 4 vCPU and 16 GB of memory. The minimum billing is 1 minute, with a minimum of 2 DPUs for standard jobs (except for G.1X and G.2X worker types).
Triggers: Triggers start jobs based on a schedule (cron) or an event (e.g., new data in S3). You can chain jobs: job A completes, then trigger job B.
Job Bookmarks: Glue can track processed data using job bookmarks to avoid reprocessing. This is useful for incremental loads.
Key Components, Values, Defaults, and Timers
Worker Type: Standard (G.1X – 1 DPU, 16 GB memory, 4 vCPU), G.1X (1 DPU, 16 GB, 4 vCPU), G.2X (2 DPUs, 32 GB, 8 vCPU). Default is Standard.
Number of Workers: Minimum 2 (except for G.1X and G.2X where minimum is 1). Maximum 299.
Job Timeout: Default 2,880 minutes (48 hours). You can set a timeout up to 7 days (10,080 minutes).
Retry Policy: You can set a maximum number of retries (0-10). Retries occur after a delay (exponential backoff starting at 1 minute).
DPU-Hour Cost: Standard worker type costs $0.44 per DPU-hour (as of 2025). G.1X is $0.44, G.2X is $0.88.
Crawler: Can schedule from every 5 minutes to monthly. Default schedule is None.
Data Catalog: Supports up to 100,000 tables per account (soft limit, can be increased).
Job Bookmark: Enable or disable per job. When enabled, Glue tracks processed data using a state file in S3.
Configuration and Verification Commands
Glue can be managed via AWS Management Console, AWS CLI, or SDK. Key CLI commands:
# List jobs
aws glue list-jobs
# Start a job run
aws glue start-job-run --job-name my-job
# Get job run details
aws glue get-job-run --job-name my-job --run-id <run-id>
# Create a crawler
aws glue create-crawler --name my-crawler --role GlueServiceRole --database my-database --targets S3Targets=[{Path="s3://my-bucket/data/"}]
# Start a crawler
aws glue start-crawler --name my-crawlerHow AWS Glue Interacts with Related Technologies
Amazon S3: Common data source and target. Glue can read from S3 in various formats and write back. Best practice: use columnar formats like Parquet or ORC for better performance.
Amazon Redshift: Glue can load data into Redshift using JDBC connection. It can also use Redshift Spectrum to query data directly from S3.
Amazon Athena: Uses the Glue Data Catalog as its metastore. When Glue crawlers update the catalog, new tables become immediately queryable in Athena.
AWS Lake Formation: Provides fine-grained access control to Data Catalog tables. Glue jobs can work with Lake Formation to enforce permissions.
Amazon EMR: Can use the same Data Catalog. Glue jobs are an alternative to EMR for ETL.
AWS Lambda: Can be used to trigger Glue jobs via S3 events or custom logic.
Amazon DynamoDB: Glue can export DynamoDB table data to S3 using a crawler or job.
Performance Considerations
Partitioning: Use partition pruning by organizing data in S3 by date, region, etc. Glue can push down filters to reduce data scanned.
File Sizes: Avoid too many small files. Aim for 128 MB to 1 GB per file to minimize overhead.
DPU Allocation: Increase DPUs for larger datasets, but note that parallelism is limited by the number of partitions. Too many DPUs can cause shuffle overhead.
Job Bookmarks: Enable for incremental processing to avoid reprocessing entire datasets.
Data Format: Use columnar formats (Parquet, ORC) for better compression and faster queries.
Error Handling and Monitoring
CloudWatch Logs: Glue jobs send logs to CloudWatch. You can view them in the console or via CLI.
CloudWatch Metrics: Glue publishes metrics like glue.driver.aggregate.numCompletedStages, glue.driver.aggregate.numFailedTasks.
Job Timeout: If a job runs longer than the timeout, it fails. Set appropriate timeout based on data volume.
Retry: Configure retries for transient failures. Use exponential backoff.
Alarms: Set CloudWatch alarms on job failures or high DPU usage.
Security
IAM Roles: Glue jobs assume an IAM role to access data sources and targets. The role must have permissions to S3, JDBC, etc.
Encryption: Data at rest in S3 and Data Catalog can be encrypted with SSE-S3, SSE-KMS, or SSE-C. Glue jobs can also use SSL for JDBC connections.
Network Isolation: Glue jobs run in a VPC if you specify a subnet and security group. Use VPC endpoints for S3 to avoid NAT costs.
Data Catalog Permissions: Use Lake Formation or IAM policies to control access to tables and databases.
Cost Optimization
Use Spot Instances: Glue supports using Spot instances for worker nodes (up to 90% discount). Enable via --worker-type G.1X --number-of-workers 5 --use-spot-instances.
Right-size DPUs: Start with default and monitor metrics. Over-provisioning wastes money, under-provisioning causes slow jobs.
Job Bookmarks: Avoid reprocessing data.
Schedule Jobs: Run jobs only when needed. Use triggers based on events or cron.
Data Compression: Use compressed formats (gzip, snappy) to reduce data scanned and storage costs.
Create a Crawler to Catalog Data
First, define a crawler in the Glue console or CLI. Specify the data source (e.g., an S3 bucket path like s3://my-bucket/logs/). Choose an IAM role with permissions to read the source and write to the Data Catalog. Select a database where table metadata will be stored. Optionally, configure output settings (e.g., add a prefix to table names). The crawler runs and inspects the data, using classifiers to infer schema (e.g., CSV, JSON, Parquet). It then creates or updates table definitions in the Data Catalog. You can schedule the crawler to run periodically (e.g., every hour) to pick up new data or schema changes. After the first run, you can view the table in the Glue console and query it with Athena immediately.
Define an ETL Job in the Visual Editor
In the Glue console, create a new job using the visual editor. Select a source (e.g., the table from the Data Catalog), then add transformations like 'Filter', 'Join', 'Drop Fields', or 'Aggregate'. The visual editor generates PySpark code automatically. You can also write custom code in the script editor. Set the job properties: worker type (Standard, G.1X, G.2X), number of workers, and timeout. Optionally, enable job bookmarks for incremental processing. Save the job and test it with a small dataset. The visual editor is great for simple transformations, but for complex logic, you may need to write code directly.
Set Up a Trigger to Run the Job
Create a trigger in Glue to start the job automatically. You can choose a schedule (cron expression, e.g., every day at 2 AM) or an event-based trigger (e.g., when a new file arrives in S3). For event-based triggers, you set up an S3 event notification that sends to an SQS queue, and Glue polls the queue. Alternatively, you can use CloudWatch Events to invoke a Lambda function that starts the job. Once the trigger is created and enabled, the job runs according to the trigger. You can chain jobs by setting the trigger to start job B after job A completes successfully.
Monitor the Job Run and Review Logs
After the job runs, go to the 'Job runs' tab in Glue console. You can see status (Running, Succeeded, Failed), start time, duration, and DPU usage. Click on a run to see detailed metrics: number of DPUs used, bytes read/written, and stages. For debugging, click 'View logs' to open CloudWatch Logs. The logs include driver logs, executor logs, and stderr. Look for error messages or stack traces. You can also set up CloudWatch alarms to notify on job failures. If the job fails, check the logs for common issues like schema mismatches, permission errors, or resource exhaustion.
Optimize Job Performance and Cost
Review the metrics from the job run. If the job is slow, consider increasing the number of workers or using a larger worker type (e.g., G.2X). However, too many workers can cause shuffle overhead. Use partitioning: ensure your source data is partitioned (e.g., by date) and enable partition pruning in your job. Use columnar formats like Parquet and enable compression (snappy). Enable job bookmarks to avoid reprocessing old data. Consider using Spot instances to reduce cost (up to 90% discount). Monitor the DPU-hour cost and set a budget. Finally, use the 'Job optimizer' feature in Glue (if available) to get recommendations.
Scenario 1: Log Analytics Pipeline
A SaaS company ingests terabytes of application logs daily into S3. They need to transform raw JSON logs into Parquet files partitioned by date and hour for analysis with Athena. They set up a Glue crawler to catalog the raw logs, then a Glue job that reads the JSON, parses fields, converts timestamps, and writes Parquet to a separate S3 bucket with partitioning. The job runs every hour via a scheduled trigger. They use job bookmarks to process only new files. In production, they use 10 workers of type G.1X, costing ~$4.40 per hour. They monitor job duration and set a timeout of 60 minutes. A common mistake is not enabling job bookmarks, causing reprocessing of all data and higher costs. They also discovered that small files (under 1 MB) degrade performance, so they set up a separate job to combine small files before the main ETL.
Scenario 2: Data Warehouse Loading
A financial services company loads daily transaction data from an on-premises Oracle database to Amazon Redshift for reporting. They use a Glue job with a JDBC connection to Oracle as source and Redshift as target. The job reads the last 24 hours of transactions using a SQL query, transforms the data (e.g., masking credit card numbers), and writes to Redshift. They use a crawler to catalog the Oracle tables (via JDBC) and the Redshift tables (via JDBC). The job runs at 3 AM daily. They set the job timeout to 2 hours. In production, they use 5 workers of type G.1X. A challenge is handling schema changes in the source; they run a crawler before the job to update the catalog. They also use CloudWatch alarms to alert if the job fails. A common issue is network latency between Glue and the on-premises database, mitigated by using a VPC and Direct Connect.
Scenario 3: Real-Time ETL with Streaming
A media company processes clickstream data from millions of users in near real-time. They use AWS Glue Streaming ETL jobs (based on Spark Structured Streaming) to read from Amazon Kinesis Data Streams or Kafka, transform the data (e.g., enrich with user profiles from DynamoDB), and write to S3 in Parquet format. They use a Glue job with streaming enabled and set a checkpoint location in S3 for fault tolerance. They use 10 workers of type G.1X for low latency (under 1 minute). They monitor the lag between Kinesis and Glue using CloudWatch. A common mistake is not setting the checkpoint location correctly, causing data loss on restarts. They also use Glue's built-in transformation for parsing JSON. The job runs continuously, and they use a trigger to restart it automatically if it fails.
DVA-C02 Exam Focus on AWS Glue
The DVA-C02 exam tests your understanding of AWS Glue as a managed ETL service, especially its integration with other analytics services. Key objective codes: Domain 1 (Development) – Objective 1.6: Implement data processing solutions. Questions may appear under 'Data Processing' or 'Integration with AWS Services'.
Common Wrong Answers and Why
'Glue runs on EC2 instances you manage': Many candidates think Glue is like EMR where you manage clusters. In reality, Glue is serverless; AWS manages the Spark cluster. The wrong answer often mentions 'provisioning EC2 instances'.
'Glue can only process data in S3': Glue supports multiple sources: S3, JDBC (RDS, Redshift, on-prem), DynamoDB, and streaming (Kinesis, Kafka). The exam may present a scenario with a relational database and ask for the best service. Candidates might incorrectly choose Glue only for S3.
'Glue Data Catalog is a separate service you must install': The Data Catalog is a native part of Glue, not a separate installation. Some candidates confuse it with Apache Hive Metastore that requires manual setup.
'Glue jobs must be written in Scala': While Scala is supported, Python (PySpark) is more common and fully supported. The exam does not require a specific language; both are acceptable.
Specific Numbers and Terms
DPU: Data Processing Unit (4 vCPU, 16 GB). Minimum 2 DPUs for standard jobs (except G.1X/G.2X).
Worker Types: Standard (G.1X), G.1X, G.2X. G.2X uses 2 DPUs.
Timeout: Default 2,880 minutes (48 hours), max 10,080 minutes (7 days).
Job Bookmarks: Enable for incremental processing.
Crawler Schedule: Minimum 5 minutes.
Data Catalog: Hive Metastore compatible.
Edge Cases and Exceptions
Streaming Jobs: Glue supports Spark Structured Streaming for near-real-time processing. You must specify a checkpoint location in S3. The exam may ask about checkpointing for fault tolerance.
Job Timeout: If a job runs longer than the timeout, it fails. Set timeout appropriately.
Multiple Crawlers: Two crawlers can write to the same table; the last one overwrites. Use partition indexes for large tables.
Glue with VPC: If your data source is in a VPC (e.g., RDS), you must configure a VPC connection for the Glue job, including subnets and security groups. The job then runs in that VPC, which may require a NAT gateway for internet access.
How to Eliminate Wrong Answers
If a question mentions 'fully managed ETL without cluster management', choose Glue over EMR.
If the question involves 'metadata catalog for Athena', Glue Data Catalog is the answer.
If the scenario requires 'incremental processing', look for 'job bookmarks' in the answer.
For streaming data, Glue Streaming ETL is appropriate; Kinesis Data Analytics is for real-time SQL analytics, not complex transformations.
AWS Glue is a fully managed, serverless ETL service that uses Apache Spark under the hood.
The Glue Data Catalog is a central metadata repository compatible with Athena, Redshift Spectrum, and EMR.
Crawlers automatically discover schema from data sources and populate the Data Catalog.
Glue jobs support Python (PySpark) and Scala; the visual editor generates Python code.
Job bookmarks enable incremental processing to avoid reprocessing old data.
Worker types: Standard (G.1X, 1 DPU), G.1X (1 DPU), G.2X (2 DPUs). Minimum 2 workers for standard jobs.
Glue jobs can be triggered on a schedule (cron) or by events (S3, CloudWatch).
Glue supports streaming ETL from Kinesis and Kafka using Spark Structured Streaming.
Cost is based on DPU-hours; use Spot instances to reduce cost by up to 90%.
Common exam trap: Glue is serverless, not provisioned EC2. Data Catalog is built-in, not a separate service.
These come up on the exam all the time. Here's how to tell them apart.
AWS Glue
Fully managed, serverless – no cluster management.
Pay per DPU-hour (minimum 1 minute).
Best for ETL, data catalog, and ad-hoc queries with Athena.
Limited to Spark (PySpark/Scala) for ETL jobs.
Automatic scaling within job, but not across multiple jobs.
Amazon EMR
You manage EC2 clusters (or use EMR Serverless).
Pay per EC2 instance-hour (minimum 1 hour for long-running).
Best for complex big data processing (e.g., machine learning, large-scale transformations).
Supports multiple engines: Spark, Hive, HBase, Presto, etc.
Fine-grained control over cluster configuration and scaling.
AWS Glue Data Catalog
Fully managed – no installation or maintenance.
Integrated with Athena, Redshift Spectrum, EMR, and Glue jobs.
Automatic schema discovery via crawlers.
Scales automatically to handle many tables.
Cost included with Glue usage (no separate charge).
Apache Hive Metastore (self-managed)
You install and manage on EC2 or EMR.
Requires configuration to integrate with other services.
Manual schema definition or custom scripts.
Need to manage scaling and high availability.
Cost of EC2 instances plus maintenance overhead.
Mistake
AWS Glue only works with data in Amazon S3.
Correct
Glue supports multiple data sources: S3, JDBC (RDS, Redshift, Oracle, MySQL, etc.), DynamoDB, and streaming sources like Kinesis and Kafka. The Data Catalog can store metadata from any of these.
Mistake
Glue jobs require you to manually provision and manage Spark clusters.
Correct
Glue is serverless; AWS automatically provisions and manages the Apache Spark cluster based on the number of workers you specify. You never see or manage EC2 instances.
Mistake
The Glue Data Catalog is a separate service that must be installed and configured.
Correct
The Data Catalog is a built-in, fully managed metadata repository within AWS Glue. It is Hive Metastore compatible and requires no installation. You create databases and tables via crawlers or manually.
Mistake
Glue jobs can only be written in Scala.
Correct
Glue supports both Python (PySpark) and Scala. The visual editor generates Python code. Python is more commonly used due to its simplicity.
Mistake
You must use the Glue visual editor to create ETL jobs.
Correct
While the visual editor is available, you can also write custom scripts in Python or Scala using the script editor. The visual editor is optional and best for simple transformations.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Yes, Glue can connect to relational databases via JDBC. You create a connection in Glue with the JDBC URL, username, password, and VPC/subnet/security group if the database is in a VPC. Then, you can use a crawler to catalog tables or a job to read/write data. Glue supports MySQL, PostgreSQL, Oracle, SQL Server, and others. Ensure the IAM role has permissions to access the database and that the security group allows inbound traffic from Glue.
Enable job bookmarks in the job configuration. When bookmarks are enabled, Glue tracks the data that has been processed using a state file in S3. For S3 sources, it uses the last modified timestamp of files. For JDBC sources, it uses a column (e.g., a timestamp or auto-increment column) to track processed rows. You specify the bookmark key (e.g., 'last_updated') when configuring the job. Bookmarks are stored in the Glue Data Catalog and persist across job runs.
Glue ETL runs on a batch basis – it processes a finite dataset and then stops. Glue Streaming ETL runs continuously, reading from a streaming source like Amazon Kinesis Data Streams or Apache Kafka. It uses Spark Structured Streaming to process data in micro-batches. Streaming jobs require a checkpoint location in S3 for fault tolerance. They are ideal for near-real-time transformations (latency in seconds to minutes). The same transformations (filter, map, join) can be used, but you must consider state management and watermarking.
Yes, Glue can load data into Redshift using a JDBC connection. You create a connection to Redshift, and in your job script, you use the `glue_context.write_dynamic_frame.from_jdbc_conf` method or the `Redshift` writer. Alternatively, you can write to S3 and then use Redshift Spectrum or COPY command. Glue also supports reading from Redshift. Ensure the IAM role has permissions to access Redshift and that the security group allows traffic.
Glue crawlers can update the Data Catalog when schema changes are detected. You can configure the crawler to update the table definition (add new columns, remove missing ones) or to create a new version of the table. In jobs, you can use the `resolveChoice` transformation to handle schema conflicts (e.g., when a column type changes). For dynamic schemas, you can use the `spark.sql.sources.schema` option or let Spark infer schema at runtime.
Use columnar formats like Parquet with Snappy compression. Partition data by frequently filtered columns (e.g., date, region) and enable partition pruning. Avoid too many small files; use `coalesce` or `repartition` to control output file size. Right-size the number of workers – start with 5-10 and monitor. Use Spot instances for cost savings. Enable job bookmarks for incremental loads. For large joins, use `broadcast` for small tables. Monitor CloudWatch metrics for shuffle spills and skew.
Yes, you can schedule a Glue job with a cron expression that runs every 5 minutes (e.g., `0/5 * * * ? *`). However, note that the minimum crawler schedule interval is 5 minutes. For jobs, there is no minimum interval, but consider the job duration and cost. If the job takes longer than 5 minutes, it may overlap with the next run. You can also use event-based triggers (e.g., S3 event) for near-real-time processing.
You've just covered AWS Glue for ETL Pipelines — now see how well it sticks with free DVA-C02 practice questions. Full explanations included, no account needed.
Done with this chapter?