This chapter covers AWS Glue, a fully managed extract, transform, and load (ETL) service that prepares and transforms data for analytics. For the CLF-C02 exam, this objective falls under Domain 3: Cloud Technology Services, which represents about 24% of the exam. Understanding Glue is essential because it is the primary ETL service in AWS and is frequently tested alongside services like Amazon Athena, Amazon Redshift, and AWS Lake Formation. You will learn what Glue does, how it works, its key components, and how it compares to alternatives, along with exam-specific traps and best practices.
Jump to a section
Imagine you run a large restaurant chain that receives daily ingredient shipments from dozens of farms and suppliers. Each shipment arrives in different containers: some are crates of vegetables, others are boxes of meat, and some are sealed barrels of sauces. To prepare meals, your kitchen needs all ingredients in a consistent format—chopped, measured, and labeled—so that any chef can quickly assemble a dish. But manually unpacking, cleaning, cutting, and portioning each shipment takes hours and is prone to errors. An automated food processor system can receive these varied inputs, identify each item using barcodes and weight sensors, then chop, slice, and package them into standardized portions ready for the line cooks. It also keeps a log of every transformation, so you can trace any ingredient back to its source. AWS Glue works exactly like this automated kitchen appliance: it takes raw data from multiple sources (databases, data lakes, streams), discovers its structure using a data catalog (like barcode scanning), transforms it into a clean, queryable format (chopping and portioning), and loads it into a destination (like a data warehouse) for analysis. The system is serverless—you don't need to provision or manage any infrastructure—and you only pay for the resources consumed during the transformation jobs, just as you'd pay for the electricity used by the food processor only when it's running.
What is AWS Glue and What Problem Does It Solve?
AWS Glue is a serverless data integration service that makes it easy to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning, and application development. In the modern data landscape, organizations collect vast amounts of raw data from various sources—relational databases, NoSQL databases, streaming platforms like Amazon Kinesis, and data lakes on Amazon S3. This data is often in different formats (CSV, JSON, Parquet, Avro, etc.) and schemas, making it difficult to query or analyze directly. Traditionally, ETL processes required provisioning and managing servers, writing custom transformation code, and handling data cataloging manually. Glue automates much of this heavy lifting by providing a serverless environment that runs Spark-based ETL jobs, a central metadata repository called the AWS Glue Data Catalog, and a built-in schema discovery tool called the Glue Crawler. The exam tests your understanding of Glue as a fully managed, serverless ETL service that reduces the operational overhead of data preparation.
How AWS Glue Works: The Mechanism
AWS Glue operates through four core components: the Data Catalog, Crawlers, ETL Jobs, and Triggers. The Data Catalog is a persistent metadata store that contains table definitions, schema information, and location pointers for data sources and targets. Each table entry includes the schema (column names, data types), partition structure, and the underlying data location (e.g., an S3 path). Crawlers automatically scan data sources, infer schemas, and populate the Data Catalog. They connect to source systems (S3, JDBC databases, DynamoDB, etc.) and use classifiers to determine the format and schema. You can schedule crawlers to run periodically to keep the catalog up-to-date. ETL jobs are Apache Spark applications that you author using the AWS Glue console, API, or an interactive notebook. Glue generates Scala or Python code (PySpark) that you can customize. Jobs read from sources defined in the Data Catalog, apply transformations like filtering, joining, aggregating, or converting file formats, and then write to targets (S3, Redshift, RDS, etc.). Triggers can start jobs based on a schedule, on-demand, or as part of a job dependency chain. Behind the scenes, Glue provisions a temporary Apache Spark cluster, runs the job, and then terminates the cluster. You pay only for the duration of the job (per second, with a 1-minute minimum). The exam expects you to know these components and their roles.
Key Tiers, Configurations, and Pricing Models
Glue offers two types of ETL jobs: Spark jobs (standard) and Python shell jobs (lightweight). Spark jobs are used for heavy data transformations and can scale across multiple nodes. Python shell jobs run simple scripts without Spark overhead. Glue also supports streaming ETL jobs that process data in near real-time from Amazon Kinesis or Apache Kafka. For the Data Catalog, you pay a monthly fee per million objects stored (approximately $1 per million tables/partitions) and per million requests (crawls, API calls). ETL job pricing is based on the number of Data Processing Units (DPUs) used per hour. A DPU is a relative measure of processing power; one DPU provides 4 vCPU and 16 GB of memory. Standard Spark jobs cost $0.44 per DPU-hour, while Python shell jobs cost $0.44 per 15-minute increment (billed per second). Glue also offers a Flex execution option for jobs that can tolerate preemption, offering a discount of up to 30%. The exam may ask about DPU pricing or the difference between Spark and Python shell jobs.
Comparison to On-Premises or Competing Approaches
Before Glue, organizations would typically set up Apache Spark or Hadoop clusters on-premises or on EC2, install and configure ETL tools like Apache NiFi or Talend, and manually manage metadata using Hive Metastore or custom databases. This approach required dedicated teams to maintain infrastructure, scale clusters, and handle failures. Glue eliminates this by being serverless—no servers to provision, patch, or scale. It also integrates deeply with other AWS services like Amazon S3, Redshift, Athena, and Lake Formation. Compared to AWS Data Pipeline (which is also an ETL service but more code-based and less automated), Glue offers built-in schema discovery and a managed Spark environment. Compared to Amazon EMR (which provides full control over Spark and Hadoop clusters), Glue is simpler and more automated but offers less customization. The exam often asks you to choose Glue for serverless ETL with automated schema discovery, while EMR is for custom big data processing requiring fine-grained control.
When to Use Glue vs Alternatives
Use AWS Glue when you need a fully managed, serverless ETL service that automatically discovers schemas, handles data cataloging, and integrates with the AWS analytics ecosystem. It is ideal for one-time data migration, periodic batch processing, and building data lakes. Do not use Glue if you need real-time sub-second processing (use Amazon Kinesis Data Analytics or Apache Flink on EMR), if you require custom Spark configurations or specific libraries not available in Glue (use EMR), or if you want a simple SQL-based transformation without writing Spark code (use Athena or Redshift Spectrum). The exam will present scenarios where you need to choose between Glue, Athena, EMR, and Data Pipeline. The key differentiator is the need for automated schema discovery and serverless Spark-based ETL—that points to Glue.
Set Up a Data Catalog
First, you create an AWS Glue Data Catalog in your AWS account. This is a central repository of metadata that describes your data sources. You can use the default catalog, which is automatically created when you enable Glue. The catalog stores table definitions, schemas, and partition information. It is integrated with Amazon Athena and Amazon Redshift Spectrum, allowing these services to query data directly using the catalog. To populate the catalog, you either manually define tables or run a crawler. The Data Catalog is a regional resource, and you can share it across accounts using AWS Resource Access Manager. On the exam, remember that the Data Catalog is a managed Apache Hive Metastore compatible with other services like Athena and Redshift Spectrum.
Create and Run a Crawler
A crawler connects to a data source (e.g., an S3 bucket with CSV files or a JDBC connection to an RDS database) and scans the data to infer its schema. You define a crawler by specifying the data source, an IAM role for permissions, and a target database in the Data Catalog. The crawler uses built-in classifiers or custom classifiers to determine the format (e.g., JSON, Parquet, Avro) and extracts column names, data types, and partition keys. After the crawl, the crawler creates or updates table definitions in the Data Catalog. You can schedule crawlers to run periodically (e.g., hourly, daily) to keep the catalog in sync with changes in the source data. On the exam, know that crawlers populate the Data Catalog automatically and can handle schema evolution by adding new columns or partitions.
Write an ETL Job Script
Glue provides a script editor and interactive notebooks (Zeppelin or Jupyter) for writing ETL logic. You can either use the auto-generated script (created by Glue when you point to a source and target) or write custom PySpark or Scala code. The script typically reads from dynamic frames (a Glue abstraction over Spark DataFrames that handles schema evolution and errors), applies transformations like mapping, filtering, joining, and then writes to the target. Glue also provides built-in transforms like DropFields, RenameField, and ResolveChoice. You can also use the Glue ETL library for common operations. The script runs on a serverless Apache Spark environment. You configure the number of DPUs, worker type (Standard, G.1X, G.2X), and job timeout. For simple scripts, you can use Python shell jobs (no Spark) with a maximum of 1 DPU. The exam may test your ability to identify the correct script type: Spark for large data, Python shell for lightweight tasks.
Configure Job Parameters and Triggers
Before running the job, you set parameters such as the job name, IAM role, script location (S3 path), temporary directory, and advanced options like job bookmarks and retry policy. Job bookmarks help Glue track processed data so that subsequent runs only process new or changed data—this is crucial for incremental processing. You also define triggers: on-demand, scheduled (using cron expressions), or job completion events (to chain jobs). For example, you can trigger a job after a crawler finishes or after another job succeeds. Glue also supports workflows that combine multiple jobs and triggers into a directed acyclic graph (DAG). On the exam, remember that job bookmarks enable incremental ETL, which is a key feature for cost optimization and efficiency.
Monitor and Debug the Job
After the job runs, you monitor its progress in the AWS Glue console, CloudWatch Logs, and CloudWatch Metrics. Glue automatically logs all driver and executor logs to CloudWatch. You can view job run metrics like DPU usage, elapsed time, and bytes read/written. If the job fails, you can access the logs to debug errors. Glue also provides job run insights and recommendations for optimization. For long-running jobs, you can set up notifications via CloudWatch Events. The exam may ask about monitoring ETL jobs using CloudWatch or about interpreting job run metrics. A common trap is thinking that Glue jobs are interactive—they are not; they run as batch processes, though you can use Glue interactive sessions for development.
Scenario 1: Building a Data Lake for a Retail Company
A large retail chain wants to consolidate sales data from thousands of stores, each using different point-of-sale (POS) systems that export data in various formats (CSV, JSON, XML) to an S3 bucket. The data must be cleaned, transformed, and loaded into a central data lake for analytics. The team uses AWS Glue crawlers to automatically discover the schema of incoming files and populate the Data Catalog. They then create a Glue ETL job that runs daily to transform the raw data into a Parquet format partitioned by store and date. The job also removes duplicates, standardizes currency codes, and enriches the data with product information from an RDS database. The transformed data lands in an S3 bucket that is queried by Amazon Athena and visualized in Amazon QuickSight. Cost is a concern: the job uses 10 DPUs and runs for 2 hours daily, costing approximately $0.44 * 10 * 2 = $8.80 per day, plus Data Catalog storage fees. If the team misconfigures the job by not enabling job bookmarks, the job reprocesses all historical data every day, increasing costs and time. They also learn to set up a CloudWatch alarm to alert if the job fails. This scenario highlights Glue's ability to handle diverse data formats and automate schema discovery, reducing manual effort.
Scenario 2: Streaming ETL for a Financial Services Firm
A fintech company ingests real-time stock trade data from Amazon Kinesis Data Streams. They need to aggregate trades per minute, detect anomalies, and store results in Amazon Redshift for reporting. They use AWS Glue streaming ETL jobs, which consume data from Kinesis, apply transformations using Spark Structured Streaming, and write to Redshift. The streaming job runs continuously, processing data as it arrives. The team uses the Glue Data Catalog to define the schema of the streaming source (Kinesis stream) and the target (Redshift table). They configure checkpointing to S3 to enable fault tolerance. The job is set to use 5 DPUs, costing $0.44 per DPU-hour continuously (approximately $0.44 * 5 * 24 = $52.80 per day). A common mistake is not setting up proper checkpointing, which leads to data reprocessing on restart. The exam may test that Glue supports streaming ETL, but it is a less common objective; still, you should know it exists.
Scenario 3: Data Migration from On-Premises to AWS
A healthcare organization needs to migrate historical patient records from an on-premises Oracle database to Amazon S3 in Parquet format for analytics. They use a Glue ETL job with a JDBC connection to the Oracle database. The job reads the entire table, converts the data to Parquet, and writes to S3 partitioned by year and month. Because the data is sensitive, they encrypt the S3 bucket with AWS KMS and use an IAM role with least privilege. The job runs once and uses 20 DPUs for 3 hours, costing $0.44 * 20 * 3 = $26.40. The team must ensure the JDBC connection is set up correctly with proper network access (via VPC peering or VPN). A misconfiguration occurs when the security group does not allow outbound traffic from Glue to the database, causing the job to hang. This scenario demonstrates Glue's capability for one-time data migration and the importance of networking and IAM permissions.
What CLF-C02 Tests on AWS Glue
The CLF-C02 exam covers AWS Glue under Domain 3: Cloud Technology Services, specifically objective 3.5: 'Identify the services that can be used to analyze data.' You must know that Glue is a fully managed ETL service, not a data warehouse or query service. The exam will ask you to differentiate Glue from Amazon Athena (serverless query service), Amazon Redshift (data warehouse), and Amazon EMR (big data platform). Expect questions like: 'Which service automatically discovers schema and transforms data for analytics?' The answer is AWS Glue. Also know that Glue uses Apache Spark under the hood and is serverless.
Common Wrong Answers and Why Candidates Choose Them
Choosing Amazon Athena instead of Glue for ETL: Athena is a query service, not an ETL service. Candidates confuse it because Athena can also read from S3 and return results. However, Athena does not transform data; it only queries. If the scenario mentions 'transform' or 'ETL,' the answer is Glue.
Choosing AWS Data Pipeline: Data Pipeline is an older ETL service that is not serverless and requires managing EC2 instances. Candidates pick it because it also does ETL. But the exam emphasizes 'serverless' and 'fully managed'—Glue is the correct choice when those terms appear.
Choosing Amazon Redshift: Redshift is a data warehouse, not an ETL tool. Candidates mistakenly think Redshift is used for transformation because it can load data. However, Redshift's COPY command loads data but does not perform complex transformations. Glue is used before loading into Redshift.
Specific AWS Service Names, Values, and Terms
AWS Glue Data Catalog: The metadata repository that stores table definitions. It is integrated with Athena and Redshift Spectrum.
Crawler: The component that scans data sources and populates the Data Catalog.
ETL Job: The Spark application that transforms data.
DynamicFrame: Glue's abstraction over Spark DataFrame, with schema evolution support.
DPU (Data Processing Unit): Unit of processing capacity (4 vCPU, 16 GB memory). Standard Spark jobs cost $0.44 per DPU-hour.
Job bookmark: Feature to track processed data for incremental processing.
Python shell job: Lightweight job for simple scripts, not using Spark.
Tricky Distinctions
Glue vs. Athena: Athena queries data in place; Glue transforms and moves data. If the question says 'ETL' or 'transform,' choose Glue. If it says 'query directly from S3,' choose Athena.
Glue vs. EMR: EMR provides full control over Spark/Hadoop clusters; Glue is serverless and simpler. If the scenario requires custom Spark configurations, choose EMR. If it emphasizes 'automated' and 'serverless,' choose Glue.
Glue vs. Lake Formation: Lake Formation builds and manages data lakes, including permissions and cataloging. It uses Glue under the hood for ETL. On the exam, Lake Formation is for data lake governance, not ETL.
Decision Rule for Multiple-Choice Questions
When you see a question about moving or transforming data between sources, apply this elimination strategy: 1. Is the service serverless? If yes, eliminate Data Pipeline and EMR. 2. Does it transform data? If yes, eliminate Athena and Redshift. 3. Does it automatically discover schema? If yes, eliminate all but Glue. If the question mentions 'schema discovery' or 'crawler,' the answer is almost certainly AWS Glue.
AWS Glue is a fully managed, serverless ETL service that runs Apache Spark jobs.
The AWS Glue Data Catalog is a central metadata repository compatible with Athena and Redshift Spectrum.
Crawlers automatically discover schemas and populate the Data Catalog from various sources.
Glue ETL jobs can be Spark jobs (standard) or Python shell jobs (lightweight, no Spark).
Job bookmarks enable incremental processing to avoid reprocessing all data.
Glue supports streaming ETL from Amazon Kinesis and Apache Kafka (less common on exam).
Pricing: Data Catalog storage ($1/million objects/month) and ETL jobs ($0.44/DPU-hour for Spark).
Common exam trap: confusing Glue with Athena (query) or Redshift (data warehouse).
Glue is often used to build data lakes by transforming raw data into optimized formats like Parquet.
For CLF-C02, remember Glue as the answer when the question mentions 'ETL,' 'transform,' or 'schema discovery.'
These come up on the exam all the time. Here's how to tell them apart.
AWS Glue
Purpose: ETL (extract, transform, load) – transforms and moves data.
Serverless, fully managed Spark environment.
Output: transformed data stored in a target (S3, Redshift, etc.).
Key feature: schema discovery via Crawlers and Data Catalog.
Pricing: per DPU-hour for jobs; Data Catalog storage fees.
Amazon Athena
Purpose: Serverless interactive query service – queries data in place.
No transformation capability; only read and return results.
Output: query results (can be saved to S3 but not transformed).
Key feature: directly queries data in S3 using standard SQL.
Pricing: per TB of data scanned; no compute provisioning.
AWS Glue
Fully managed and serverless – no cluster management.
Automated schema discovery with Crawlers.
Limited customization – uses Glue's Spark environment.
Best for simple to moderate ETL with integrated AWS services.
Cost: per DPU-hour; no long-term cluster costs.
Amazon EMR
Provides full control over cluster configuration (EC2 instances, software).
No built-in schema discovery; you define everything manually.
Highly customizable – you can install any libraries or frameworks.
Best for complex big data processing requiring fine-grained tuning.
Cost: per EC2 instance hour; you pay for running clusters even when idle.
Mistake
AWS Glue is a data warehouse like Amazon Redshift.
Correct
Glue is an ETL service that prepares data; it does not store or query data itself. Redshift is a data warehouse that stores and queries data. Glue often feeds data into Redshift.
Mistake
Glue requires you to write complex Spark code from scratch.
Correct
Glue generates boilerplate code automatically based on your source and target selections. You can customize it, but you can also use the visual editor or notebook interface.
Mistake
Glue can only process batch data, not streaming data.
Correct
Glue supports streaming ETL jobs that consume data from Amazon Kinesis and Apache Kafka. However, batch processing is more common on the exam.
Mistake
Glue crawlers only work with data in Amazon S3.
Correct
Crawlers can connect to many sources: S3, JDBC databases (RDS, Redshift, on-premises), DynamoDB, and more. They can also crawl data in HDFS or other storage.
Mistake
AWS Glue is free to use.
Correct
Glue has costs: Data Catalog storage and requests, ETL job DPU hours, and crawler DPU hours. There is a free tier for the Data Catalog (1 million objects per month for 12 months), but ETL jobs are not free.
AWS Glue is an ETL (extract, transform, load) service that transforms and moves data between sources. It runs Apache Spark jobs to clean, enrich, and convert data formats. Amazon Athena is a serverless query service that allows you to run SQL queries directly on data stored in Amazon S3 without transforming it. Athena does not perform ETL; it only reads and returns results. Use Glue when you need to transform data before analysis; use Athena when you want to query data in place. On the exam, if the scenario mentions 'transform' or 'ETL,' choose Glue.
Yes, AWS Glue supports incremental processing through a feature called job bookmarks. When enabled, Glue tracks the data that has already been processed in previous job runs. On subsequent runs, it only processes new or changed data since the last successful run. This is crucial for cost efficiency and performance, especially when dealing with large datasets that grow over time. Job bookmarks work with S3, JDBC, and DynamoDB sources. On the exam, remember that job bookmarks enable incremental ETL, a key differentiator from naive full-load approaches.
A Glue Crawler is a component that connects to a data source (e.g., S3 bucket, JDBC database, DynamoDB table), scans the data, infers its schema (column names, data types, partitions), and then populates the AWS Glue Data Catalog with table definitions. Crawlers can be scheduled to run periodically to keep the catalog updated as new data arrives. They use classifiers to determine the format (CSV, JSON, Parquet, etc.). Crawlers are essential for automating metadata management. On the exam, if you see 'automatically discover schema,' think of Glue Crawler.
AWS Glue pricing has two main components: Data Catalog storage and ETL jobs. The Data Catalog costs $1 per million objects (tables, partitions, etc.) per month, with a free tier of 1 million objects for the first 12 months. ETL job pricing depends on the type: Spark jobs cost $0.44 per DPU-hour (a DPU provides 4 vCPU and 16 GB memory), while Python shell jobs cost $0.44 per 15-minute increment (billed per second). Crawlers also incur DPU-hour costs at $0.44 per DPU-hour. There is no upfront cost or minimum fee. On the exam, know that DPU is the billing unit for Glue jobs.
Yes, AWS Glue supports streaming ETL jobs that can consume data from Amazon Kinesis Data Streams and Apache Kafka (including Amazon MSK). These jobs use Spark Structured Streaming to process data in near real-time. However, streaming ETL is a more advanced feature and is less commonly tested on the CLF-C02 exam. The exam focuses more on batch ETL. If a scenario mentions 'real-time' or 'streaming,' Glue can be an option, but services like Kinesis Data Analytics or EMR with Flink might be more appropriate for sub-second latency.
The AWS Glue Data Catalog is a fully managed metadata repository that stores table definitions, schema information, and location pointers for data sources and targets. It is compatible with Apache Hive Metastore, meaning services like Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR can query the catalog to discover data. The catalog is regional and can be shared across accounts using AWS Resource Access Manager. You populate it manually or via Glue Crawlers. On the exam, remember that the Data Catalog is a central component that enables schema-on-read for Athena and Redshift Spectrum.
A Glue Spark job runs on a managed Apache Spark cluster and is suitable for processing large datasets (hundreds of GBs to TBs). It supports multiple DPUs and can scale horizontally. A Python shell job runs a simple Python script without Spark, using a single DPU (or less) and is ideal for lightweight tasks like running SQL queries, sending notifications, or small data manipulations. Python shell jobs have a maximum runtime of 1 hour (configurable up to 48 hours) and cannot use Spark libraries. On the exam, choose Spark job for big data ETL, Python shell for simple scripting.
You've just covered AWS Glue — ETL Service — now see how well it sticks with free CLF-C02 practice questions. Full explanations included, no account needed.
Done with this chapter?