This chapter covers ETL and ELT pipelines on Google Cloud, two fundamental data integration patterns for moving and transforming data. As part of GCDL Domain 3 (Data Analytics AI), Objective 3.1, this topic accounts for approximately 10-15% of exam questions. You will learn the architectural differences, when to use each pattern, and the specific Google Cloud services that implement them — including Cloud Data Fusion, Dataproc, Dataflow, and BigQuery. Mastery of these concepts is essential for designing data pipelines that balance latency, cost, and complexity.
Jump to a section
Imagine a busy restaurant kitchen preparing ingredients for dinner service. In the ETL (Extract, Transform, Load) approach, the prep chef takes raw vegetables from the storage room (Extract), washes, chops, and marinates them at the prep station (Transform), and then places the ready-to-cook ingredients into labeled containers in the walk-in cooler (Load). The line cooks can grab the prepped items directly and start cooking without any additional processing. In contrast, the ELT (Extract, Load, Transform) approach has the prep chef simply haul the entire crates of unwashed, whole vegetables into the walk-in cooler (Load) and stack them there. Later, when a specific dish is ordered, the line cook pulls out the needed vegetables and quickly washes, chops, and seasons them on the fly (Transform) just before cooking. ETL is like prepping everything upfront — it takes longer initially but speeds up cooking during the rush. ELT is like storing everything raw — it gets ingredients into the cooler faster but requires more work during service. The choice depends on the kitchen's layout (data warehouse type), the speed of the prep station (processing power), and how predictable the orders are (query patterns).
ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are two patterns for moving data from source systems to a target data warehouse or data lake. The core difference lies in the order of operations: ETL transforms data before loading, while ELT loads raw data first and transforms it later within the target system. This seemingly simple distinction has profound implications for architecture, cost, latency, and scalability.
In traditional on-premises data warehousing, ETL was the dominant pattern because compute resources were expensive and limited. The transformation step cleaned, enriched, and aggregated data before loading it into the warehouse, minimizing storage consumption and ensuring that only high-quality data entered the warehouse. However, this approach required significant upfront design and could not easily accommodate new data sources or schema changes.
With the advent of cloud data warehouses like BigQuery, which separate compute and storage, ELT has become increasingly popular. BigQuery's massive parallel processing (MPP) engine can transform petabytes of data in seconds, making it efficient to load raw data and transform it on demand. ELT allows for greater flexibility — raw data is always available for reprocessing or ad-hoc analysis — and reduces the time to load data into the warehouse.
How ETL Works on Google Cloud
A typical ETL pipeline on Google Cloud uses Cloud Storage as a landing zone, Cloud Dataflow or Dataproc for transformation, and BigQuery or Cloud SQL as the target. The steps are:
Extract: Data is pulled from source systems (e.g., transactional databases, SaaS APIs, log files) and written to Cloud Storage in a raw format (CSV, JSON, Avro, Parquet).
Transform: A Dataflow pipeline (Apache Beam) or a Dataproc cluster (Spark) reads the raw data, applies transformations — filtering, cleansing, joining, aggregating — and writes the transformed data back to Cloud Storage or directly to the target.
Load: The transformed data is loaded into BigQuery tables or another database.
Key services: - Cloud Dataflow: A fully managed, serverless service for stream and batch processing based on Apache Beam. It handles auto-scaling, checkpointing, and exactly-once processing semantics. - Cloud Dataproc: A managed Spark and Hadoop service. It is ideal for existing Spark jobs or when you need fine-grained control over cluster configuration. Dataproc clusters can be created on-demand and autoscaled using preemptible VMs to reduce cost. - Cloud Data Fusion: A fully managed, code-free data integration service that provides a visual interface to build ETL/ELT pipelines. It is based on the open-source CDAP project and supports 150+ connectors.
How ELT Works on Google Cloud
ELT flips the order: load first, transform later. The typical pipeline is:
Extract: Data is extracted from sources and loaded into Cloud Storage.
Load: Data is loaded into BigQuery as raw data in staging tables or external tables. BigQuery supports loading data in Avro, Parquet, ORC, CSV, JSON, and other formats.
Transform: SQL queries (or Dataform) transform the raw data into curated tables, views, or materialized views within BigQuery. Transformations can be scheduled or triggered on demand.
Key services: - BigQuery: A serverless, highly scalable data warehouse. It uses a columnar storage format (Capacitor) and a distributed query engine (Dremel). BigQuery charges for storage and for the number of bytes processed by queries. Using clustered and partitioned tables reduces query costs. - Cloud Dataform: A service for managing SQL-based data transformation pipelines. It allows you to define dependencies, version control transformations, and schedule execution. - BigQuery Data Transfer Service: Automates loading data from SaaS applications (Google Ads, Google Analytics, etc.) into BigQuery.
When to Use ETL vs ELT
Choose ETL when: - The target system has limited compute capacity (e.g., traditional RDBMS). - You need to enforce strict data quality rules before data enters the warehouse. - The transformations are complex and require custom code (e.g., machine learning inference). - You are dealing with sensitive data that must be masked or anonymized before storage.
Choose ELT when: - The target is a scalable cloud data warehouse like BigQuery or Snowflake. - You need to load data quickly for near-real-time availability. - You want to preserve raw data for future reprocessing or unknown use cases. - The transformations are SQL-based and can leverage the warehouse's processing power.
Key Components and Defaults
Cloud Storage: Default bucket locations are multi-regional (e.g., US, EU) but can be regional for lower latency. Object versioning is disabled by default. Lifecycle policies can automatically delete or archive objects.
Cloud Dataflow: Default worker machine type is n1-standard-2. The default disk size per worker is 250 GB (HDD) or 50 GB (SSD). The service uses a streaming engine for unbounded data and a batch engine for bounded data. The default worker region is determined by the pipeline's region.
Cloud Dataproc: Default master machine type is n1-standard-4; worker type is n1-standard-4. The default number of workers is 2. Clusters use a staging bucket in Cloud Storage. Preemptible workers can be used for cost savings (up to 80% discount) but may be terminated at any time.
BigQuery: Default table expiration is never. Partitioning is by ingestion time (_PARTITIONTIME) or by a DATE/TIMESTAMP column. Clustering can be on up to 4 columns. The default query job priority is INTERACTIVE (immediate execution); BATCH priority is available for lower cost.
Configuration and Verification
To create a Dataflow pipeline from the command line:
gcloud dataflow jobs run my-pipeline \
--gcs-location gs://my-bucket/templates/my-template \
--parameters input=gs://my-bucket/input,output=gs://my-bucket/output \
--region us-central1To submit a Spark job to Dataproc:
gcloud dataproc jobs submit spark \
--cluster my-cluster \
--region us-central1 \
--jar gs://my-bucket/my-job.jar \
-- my-arg1 my-arg2To load data into BigQuery:
bq load --source_format=PARQUET mydataset.mytable gs://my-bucket/data/*.parquetTo transform data in BigQuery using SQL:
CREATE OR REPLACE TABLE mydataset.curated AS
SELECT
user_id,
TIMESTAMP_MICROS(event_timestamp) AS event_time,
event_type,
LOWER(page_title) AS page_title
FROM mydataset.raw_events
WHERE event_type IS NOT NULL;Interaction with Related Technologies
ETL and ELT pipelines often integrate with: - Cloud Pub/Sub: For streaming ingestion. Dataflow can subscribe to a Pub/Sub topic and process messages in near-real-time. - Cloud Functions: For lightweight, event-driven transformations. For example, a Cloud Function can be triggered when a new file lands in Cloud Storage to perform a quick validation or enrichment. - AI Platform: For ML-based transformations. Dataflow can call AI Platform Prediction to score records during transformation. - Data Catalog: For metadata management. Pipelines can automatically register datasets in Data Catalog to enable discovery and lineage tracking.
Performance Considerations
Dataflow: Use streaming engine for low-latency (sub-second) processing. Batch pipelines are optimized for high throughput. Use autoscaling (default: 2-100 workers) to handle variable load.
Dataproc: Use preemptible workers for cost-effective batch jobs. Enable autoscaling with a cool-down period of 120 seconds. Use local SSDs for shuffle-intensive workloads.
BigQuery: Partition tables by date to reduce query costs. Cluster on frequently filtered columns. Use materialized views for pre-aggregated results. Avoid SELECT * in production queries.
Common Pitfalls
Data Skew: In Dataflow, if a single key has too many records, it can cause a hot key. Use Combine.perKey with custom logic or rekeying.
Schema Evolution: ELT pipelines are more forgiving because raw data is preserved. In ETL, schema changes may require pipeline redeployment.
Cost Overruns: ELT queries can be expensive if not optimized. Use query controls (e.g., max bytes billed) to prevent runaway costs.
Data Quality: ETL can enforce quality at load time; ELT relies on downstream transformations. Implement data quality checks in both patterns.
Extract data from source
The pipeline begins by connecting to the source system — a relational database, SaaS API, log files, or streaming events. For batch extraction, tools like Cloud Data Fusion or custom scripts use JDBC/ODBC connectors to pull data. For streaming, Pub/Sub or Kafka captures events. Data is typically written to Cloud Storage in a serialized format (Parquet, Avro, CSV). Extraction frequency can be hourly, daily, or continuous. The extract step must handle incremental loads (using timestamps or change data capture) to avoid reprocessing full datasets.
Land raw data in Cloud Storage
Cloud Storage acts as the durable, scalable landing zone. Raw files are stored in a bucket, often partitioned by date or source. Object lifecycle policies can automatically delete raw data after a retention period. The bucket should be in the same region as the processing service to minimize egress costs. For ELT, this is the permanent raw data store; for ETL, it is a temporary staging area. Access is controlled via IAM and signed URLs for external sources.
Transform data (ETL only)
In ETL, transformation occurs before loading. A Dataflow pipeline reads from Cloud Storage, applies transformations using Apache Beam transforms (ParDo, GroupByKey, Combine), and writes the result back to Cloud Storage or directly to BigQuery. Transformations include filtering nulls, joining multiple datasets, aggregating metrics, and encoding sensitive fields. Dataflow provides exactly-once processing semantics via checkpointing. The pipeline can be parameterized for different environments (dev, prod).
Load data into target
For ETL, the transformed data is loaded into BigQuery tables using the bq load command or Dataflow's BigQuery sink. For ELT, raw data is loaded directly into BigQuery staging tables. BigQuery supports automatic schema detection for Avro, Parquet, and ORC files, but explicit schema definitions are recommended for production. Loading can be done via the BigQuery Storage Write API for high-throughput streaming ingestion. The load step must handle deduplication and error records.
Transform data in warehouse (ELT only)
After raw data is loaded into BigQuery, SQL-based transformations create curated tables. Dataform manages these transformations as a DAG of SQL scripts. Transformations include cleaning, joining, aggregating, and creating materialized views. BigQuery's query engine executes these transformations on-demand or on a schedule. Partitioned and clustered tables optimize performance. The raw data remains available for reprocessing. This step can be triggered by the BigQuery Data Transfer Service or scheduled queries.
Enterprise Scenario 1: Retail Analytics with ETL
A large retailer ingests daily sales data from thousands of stores into a central data warehouse for reporting. They use Cloud Data Fusion to build an ETL pipeline: extracts from on-premises Oracle databases via JDBC, transforms by cleaning invalid store IDs, joining with product catalogs, and aggregating sales by hour. The transformed data is loaded into BigQuery partitioned tables. The pipeline runs nightly, completing within 4 hours for 10 TB of data. Key considerations: Data Fusion uses a Cloud Dataproc cluster under the hood; they configured 10 preemptible workers to reduce costs. A common issue was schema drift — when new columns were added to source tables, the pipeline failed. They mitigated this by using a flexible schema in Data Fusion that allows new columns to pass through. The pipeline is monitored via Cloud Monitoring alerts on Dataflow job failures and BigQuery load job errors.
Enterprise Scenario 2: IoT Data with ELT
A manufacturing company collects sensor data from thousands of IoT devices every second. They use Pub/Sub to ingest streaming data, which is then loaded into BigQuery via the Storage Write API in near-real-time (latency under 10 seconds). Raw data is stored in a partitioned table by timestamp. Every hour, a Dataform pipeline transforms the raw data into aggregated tables: average temperature per machine per hour, anomaly scores, and rolling 24-hour summaries. The ELT approach allows data scientists to query raw sensor data for ML model training without waiting for ETL jobs. Cost management is critical: they use BigQuery's max_bytes_billed setting to prevent runaway queries. A misconfiguration once caused a $10,000 query; they now enforce a 1 TB limit per user. They also use clustered tables on machine_id and timestamp to reduce query costs by 60%.
Enterprise Scenario 3: Financial Services Compliance with ETL
A bank needs to process transaction data for regulatory reporting. They use Cloud Dataproc with Spark to run complex transformations: joining transaction data with customer master data, applying business rules for suspicious activity detection, and masking PII. The transformed output is loaded into BigQuery and also exported to encrypted Cloud Storage for archival. ETL is chosen because regulations require that PII is masked before it enters the warehouse. The pipeline runs in batch mode every 15 minutes to meet near-real-time reporting SLAs. They use Dataproc's autoscaling with preemptible workers to handle peak loads. A common failure point is shuffle spill — when intermediate data exceeds memory, Spark writes to disk, slowing the job. They tuned the shuffle partitions and increased executor memory to 8 GB to avoid this.
GCDL Exam Focus: Objective 3.1
The exam tests your ability to distinguish between ETL and ELT and match the correct pattern to a given scenario. Key objective codes: 3.1 (Data Integration), 3.2 (Data Processing). You will not be asked to write code but to choose the right Google Cloud service and pattern.
Common Wrong Answers
Choosing Cloud Data Fusion for ELT — Many candidates think Data Fusion is for ELT because it has a visual interface. Reality: Data Fusion is primarily an ETL tool (code-free data integration) that transforms data before loading. For ELT, use BigQuery + Dataform.
Selecting Dataproc for simple SQL transformations — Dataproc is ideal for Spark/ML workloads, but for SQL-based transformations in BigQuery, Dataform is more appropriate. The exam often contrasts Dataproc (Spark) vs Dataform (SQL).
Thinking ETL is always better for data quality — While ETL can enforce quality at load time, ELT can also enforce quality via SQL constraints and scheduled validation queries. The exam expects you to know that ELT allows raw data preservation, which is beneficial for reprocessing.
Specific Numbers and Terms
BigQuery: default query priority is INTERACTIVE; BATCH priority uses idle slots. Max bytes billed per query can be set via --maximum_bytes_billed.
Dataflow: streaming engine default is off; enable for sub-second latency. Autoscaling range: 1-1000 workers.
Dataproc: preemptible VMs offer up to 80% discount but no SLA. Default idle timeout for clusters is 90 minutes (can be changed).
Cloud Data Fusion: pipelines run on Dataproc clusters. The service offers 150+ connectors.
Edge Cases
Schema on read vs schema on write: ELT uses schema on read (raw data loaded, schema applied at query time). ETL uses schema on write (schema enforced during transformation). The exam may test which pattern is more flexible.
Streaming vs batch: Dataflow supports both; Pub/Sub is for streaming ingestion. The exam may ask which service to use for real-time transformations.
Cost optimization: Use preemptible VMs for Dataproc, batch queries in BigQuery, and lifecycle policies in Cloud Storage. The exam loves cost-related questions.
How to Eliminate Wrong Answers
If the scenario mentions "raw data preserved for future analysis" → ELT.
If the scenario mentions "complex transformations in Java/Python" → Dataflow or Dataproc.
If the scenario mentions "no coding, drag-and-drop" → Cloud Data Fusion.
If the scenario mentions "SQL transformations on loaded data" → BigQuery with Dataform.
If the scenario mentions "near-real-time streaming" → Dataflow with Pub/Sub.
Memorize the services: ETL uses Dataflow, Dataproc, or Data Fusion; ELT uses BigQuery + Dataform. The exam will not ask about obscure services like Cloud Composer (Airflow) unless it's about orchestration.
ETL transforms data before loading; ELT transforms after loading. Choose based on target compute and flexibility needs.
On Google Cloud, ETL uses Cloud Dataflow, Dataproc, or Cloud Data Fusion; ELT uses BigQuery with Dataform.
BigQuery's default query priority is INTERACTIVE; use BATCH for lower-cost, lower-priority queries.
Cloud Data Fusion is a code-free ETL tool that runs on Dataproc clusters.
Dataflow provides exactly-once processing semantics via checkpointing and supports both batch and streaming.
Dataproc preemptible VMs offer up to 80% discount but can be terminated at any time.
ELT preserves raw data, enabling schema-on-read and future reprocessing.
ETL enforces data quality at load time; ELT relies on downstream transformations.
Partitioning and clustering in BigQuery reduce query costs and improve performance.
The BigQuery Storage Write API enables high-throughput streaming ingestion.
Cloud Dataform manages SQL-based ELT transformations as a DAG.
Lifecycle policies in Cloud Storage automatically delete raw data after a retention period.
These come up on the exam all the time. Here's how to tell them apart.
ETL (Extract, Transform, Load)
Transforms data before loading into the target.
Requires upfront schema design and transformation logic.
Better for targets with limited compute (e.g., traditional RDBMS).
Raw data is not preserved; only transformed data is stored.
Slower initial load, but faster queries on curated data.
ELT (Extract, Load, Transform)
Loads raw data first, transforms later in the warehouse.
More flexible; raw data available for reprocessing.
Leverages warehouse compute (e.g., BigQuery's MPP engine).
Raw data is preserved for ad-hoc analysis and ML.
Faster initial load, but queries may be slower without optimization.
Mistake
ETL is always slower than ELT.
Correct
ETL can be slower upfront because transformation happens before loading, but it can speed up downstream queries since data is pre-aggregated and cleaned. ELT loads faster but may result in slower queries if raw data is not optimized (e.g., no partitioning/clustering). The actual performance depends on the workload.
Mistake
Cloud Data Fusion is an ELT tool.
Correct
Cloud Data Fusion is an ETL tool that transforms data before loading. It uses a visual interface to build pipelines that run on Dataproc. For ELT, you would use BigQuery and Dataform to load raw data and transform it in place.
Mistake
ELT does not require any transformation logic.
Correct
ELT still requires transformation logic, but it is executed after loading, typically using SQL in the data warehouse. The transformations are defined in Dataform or scheduled queries. Raw data is preserved, but transformations are essential for analysis.
Mistake
BigQuery can only be used as a target, not a transformation engine.
Correct
BigQuery is a powerful transformation engine. With SQL, you can create views, materialized views, and scheduled queries to transform data. BigQuery's Dremel engine can process petabytes of data efficiently, making it ideal for ELT.
Mistake
You need to use Cloud Dataflow for all streaming pipelines.
Correct
While Dataflow is the primary streaming service, you can also use Pub/Sub + Cloud Functions for simple transformations, or Pub/Sub + BigQuery's Storage Write API for direct ingestion. Dataflow is best for complex stream processing (windowing, joins, aggregations).
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
ETL (Extract, Transform, Load) transforms data before loading into the target, using services like Cloud Dataflow, Dataproc, or Cloud Data Fusion. ELT (Extract, Load, Transform) loads raw data first into BigQuery, then transforms it using SQL via Dataform or scheduled queries. ETL is better for enforcing data quality upfront; ELT is more flexible and preserves raw data.
For ETL, use Cloud Dataflow (for complex stream/batch processing with Apache Beam), Cloud Dataproc (for Spark/Hadoop workloads), or Cloud Data Fusion (for code-free visual pipeline design). The choice depends on your transformation complexity and coding preference.
Yes, you can load raw data into BigQuery and then run SQL transformations to create curated tables. This is actually ELT. However, if you transform before loading (ETL), you would use Dataflow or Dataproc. BigQuery can be both a target and a transformation engine.
Cloud Data Fusion is a fully managed, code-free data integration service for building ETL/ELT pipelines. It provides a visual interface with 150+ connectors. Use it when you want to avoid writing code and need to connect to various sources quickly. It runs on Dataproc clusters.
Use partitioned and clustered tables to reduce bytes scanned. Set maximum bytes billed per query to avoid runaway costs. Use BATCH priority for non-urgent queries. Schedule transformations during off-peak hours. Use materialized views for pre-aggregated results.
Cloud Dataform is a service for managing SQL-based ELT transformations. It allows you to define dependencies between SQL scripts, version control them, and schedule execution. It is the primary tool for implementing ELT on BigQuery.
Yes, Cloud Dataflow supports streaming (unbounded) data. You can ingest from Pub/Sub, transform in real-time, and load into BigQuery or other sinks. This is common for ETL with near-real-time requirements.
You've just covered ETL and ELT Pipelines on Google Cloud — now see how well it sticks with free GCDL practice questions. Full explanations included, no account needed.
Done with this chapter?