GCDLChapter 48 of 101Objective 3.1

ETL and ELT Pipelines on Google Cloud

How do you choose between ETL and ELT data integration patterns on Google Cloud to move and transform data? As part of GCDL Domain 3 (Data Analytics AI), Objective 3.1, this topic accounts for approximately 10-15% of exam questions. You will learn the architectural differences, when to use each pattern, and the specific Google Cloud services that implement them — including Cloud Data Fusion, Dataproc, Dataflow, and BigQuery. Mastery of these concepts is essential for designing data pipelines that balance latency, cost, and complexity.

25 min read

Intermediate

Updated Jul 20, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

The Kitchen Prep Station: ETL vs ELT

A bustling restaurant kitchen, ingredients being prepped for dinner service. In the ETL (Extract, Transform, Load) approach, the prep chef takes raw vegetables from the storage room (Extract), washes, chops, and marinates them at the prep station (Transform), and then places the ready-to-cook ingredients into labeled containers in the walk-in cooler (Load). The line cooks can grab the prepped items directly and start cooking without any additional processing. In contrast, the ELT (Extract, Load, Transform) approach has the prep chef simply haul the entire crates of unwashed, whole vegetables into the walk-in cooler (Load) and stack them there. Later, when a specific dish is ordered, the line cook pulls out the needed vegetables and quickly washes, chops, and seasons them on the fly (Transform) just before cooking. ETL is like prepping everything upfront — it takes longer initially but speeds up cooking during the rush. ELT is like storing everything raw — it gets ingredients into the cooler faster but requires more work during service. The choice depends on the kitchen's layout (data warehouse type), the speed of the prep station (processing power), and how predictable the orders are (query patterns).

How It Actually Works

What are ETL and ELT Pipelines?

ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are two patterns for moving data from source systems to a target data warehouse or data lake. The core difference lies in the order of operations: ETL transforms data before loading, while ELT loads raw data first and transforms it later within the target system. This seemingly simple distinction has profound implications for architecture, cost, latency, and scalability.

In traditional on-premises data warehousing, ETL was the dominant pattern because compute resources were expensive and limited. The transformation step cleaned, enriched, and aggregated data before loading it into the warehouse, minimizing storage consumption and ensuring that only high-quality data entered the warehouse. However, this approach required significant upfront design and could not easily accommodate new data sources or schema changes.

With the advent of cloud data warehouses like BigQuery, which separate compute and storage, ELT has become increasingly popular. BigQuery's massive parallel processing (MPP) engine can transform petabytes of data in seconds, making it efficient to load raw data and transform it on demand. ELT allows for greater flexibility — raw data is always available for reprocessing or ad-hoc analysis — and reduces the time to load data into the warehouse.

How ETL Works on Google Cloud

A typical ETL pipeline on Google Cloud uses Cloud Storage as a landing zone, Cloud Dataflow or Dataproc for transformation, and BigQuery or Cloud SQL as the target. The steps are:

Extract: Data is pulled from source systems (e.g., transactional databases, SaaS APIs, log files) and written to Cloud Storage in a raw format (CSV, JSON, Avro, Parquet).

Transform: A Dataflow pipeline (Apache Beam) or a Dataproc cluster (Spark) reads the raw data, applies transformations — filtering, cleansing, joining, aggregating — and writes the transformed data back to Cloud Storage or directly to the target.

Load: The transformed data is loaded into BigQuery tables or another database.

Key services: - Cloud Dataflow: A fully managed, serverless service for stream and batch processing based on Apache Beam. It handles auto-scaling, checkpointing, and exactly-once processing semantics. - Cloud Dataproc: A managed Spark and Hadoop service. It is ideal for existing Spark jobs or when you need fine-grained control over cluster configuration. Dataproc clusters can be created on-demand and autoscaled using preemptible VMs to reduce cost. - Cloud Data Fusion: A fully managed, code-free data integration service that provides a visual interface to build ETL/ELT pipelines. It is based on the open-source CDAP project and supports 150+ connectors.

How ELT Works on Google Cloud

ELT flips the order: load first, transform later. The typical pipeline is:

Extract: Data is extracted from sources and loaded into Cloud Storage.

Load: Data is loaded into BigQuery as raw data in staging tables or external tables. BigQuery supports loading data in Avro, Parquet, ORC, CSV, JSON, and other formats.

Transform: SQL queries (or Dataform) transform the raw data into curated tables, views, or materialized views within BigQuery. Transformations can be scheduled or triggered on demand.

Key services: - BigQuery: A serverless, highly scalable data warehouse. It uses a columnar storage format (Capacitor) and a distributed query engine (Dremel). BigQuery charges for storage and for the number of bytes processed by queries. Using clustered and partitioned tables reduces query costs. - Cloud Dataform: A service for managing SQL-based data transformation pipelines. It allows you to define dependencies, version control transformations, and schedule execution. - BigQuery Data Transfer Service: Automates loading data from SaaS applications (Google Ads, Google Analytics, etc.) into BigQuery.

When to Use ETL vs ELT

Choose ETL when: - The target system has limited compute capacity (e.g., traditional RDBMS). - You need to enforce strict data quality rules before data enters the warehouse. - The transformations are complex and require custom code (e.g., machine learning inference). - You are dealing with sensitive data that must be masked or anonymized before storage.

Choose ELT when: - The target is a scalable cloud data warehouse like BigQuery or Snowflake. - You need to load data quickly for near-real-time availability. - You want to preserve raw data for future reprocessing or unknown use cases. - The transformations are SQL-based and can leverage the warehouse's processing power.

Key Components and Defaults

Cloud Storage: Default bucket locations are multi-regional (e.g., US, EU) but can be regional for lower latency. Object versioning is disabled by default. Lifecycle policies can automatically delete or archive objects.

Cloud Dataflow: Default worker machine type is n1-standard-2. The default disk size per worker is 250 GB (HDD) or 50 GB (SSD). The service uses a streaming engine for unbounded data and a batch engine for bounded data. The default worker region is determined by the pipeline's region.

Cloud Dataproc: Default master machine type is n1-standard-4; worker type is n1-standard-4. The default number of workers is 2. Clusters use a staging bucket in Cloud Storage. Preemptible workers can be used for cost savings (up to 80% discount) but may be terminated at any time.

BigQuery: Default table expiration is never. Partitioning is by ingestion time (_PARTITIONTIME) or by a DATE/TIMESTAMP column. Clustering can be on up to 4 columns. The default query job priority is INTERACTIVE (immediate execution); BATCH priority is available for lower cost.

Configuration and Verification

To create a Dataflow pipeline from the command line:

gcloud dataflow jobs run my-pipeline \
  --gcs-location gs://my-bucket/templates/my-template \
  --parameters input=gs://my-bucket/input,output=gs://my-bucket/output \
  --region us-central1

To submit a Spark job to Dataproc:

gcloud dataproc jobs submit spark \
  --cluster my-cluster \
  --region us-central1 \
  --jar gs://my-bucket/my-job.jar \
  -- my-arg1 my-arg2

To load data into BigQuery:

bq load --source_format=PARQUET mydataset.mytable gs://my-bucket/data/*.parquet

To transform data in BigQuery using SQL:

CREATE OR REPLACE TABLE mydataset.curated AS
SELECT 
  user_id,
  TIMESTAMP_MICROS(event_timestamp) AS event_time,
  event_type,
  LOWER(page_title) AS page_title
FROM mydataset.raw_events
WHERE event_type IS NOT NULL;

Interaction with Related Technologies

ETL and ELT pipelines often integrate with: - Cloud Pub/Sub: For streaming ingestion. Dataflow can subscribe to a Pub/Sub topic and process messages in near-real-time. - Cloud Functions: For lightweight, event-driven transformations. For example, a Cloud Function can be triggered when a new file lands in Cloud Storage to perform a quick validation or enrichment. - AI Platform: For ML-based transformations. Dataflow can call AI Platform Prediction to score records during transformation. - Data Catalog: For metadata management. Pipelines can automatically register datasets in Data Catalog to enable discovery and lineage tracking.

Performance Considerations

Dataflow: Use streaming engine for low-latency (sub-second) processing. Batch pipelines are optimized for high throughput. Use autoscaling (default: 2-100 workers) to handle variable load.

Dataproc: Use preemptible workers for cost-effective batch jobs. Enable autoscaling with a cool-down period of 120 seconds. Use local SSDs for shuffle-intensive workloads.

BigQuery: Partition tables by date to reduce query costs. Cluster on frequently filtered columns. Use materialized views for pre-aggregated results. Avoid SELECT * in production queries.

Common Pitfalls

Data Skew: In Dataflow, if a single key has too many records, it can cause a hot key. Use Combine.perKey with custom logic or rekeying.

Schema Evolution: ELT pipelines are more forgiving because raw data is preserved. In ETL, schema changes may require pipeline redeployment.

Cost Overruns: ELT queries can be expensive if not optimized. Use query controls (e.g., max bytes billed) to prevent runaway costs.

Data Quality: ETL can enforce quality at load time; ELT relies on downstream transformations. Implement data quality checks in both patterns.

Walk-Through

Extract data from source

The pipeline begins by connecting to the source system — a relational database, SaaS API, log files, or streaming events. For batch extraction, tools like Cloud Data Fusion or custom scripts use JDBC/ODBC connectors to pull data. For streaming, Pub/Sub or Kafka captures events. Data is typically written to Cloud Storage in a serialized format (Parquet, Avro, CSV). Extraction frequency can be hourly, daily, or continuous. The extract step must handle incremental loads (using timestamps or change data capture) to avoid reprocessing full datasets.

Land raw data in Cloud Storage

Cloud Storage acts as the durable, scalable landing zone. Raw files are stored in a bucket, often partitioned by date or source. Object lifecycle policies can automatically delete raw data after a retention period. The bucket should be in the same region as the processing service to minimize egress costs. For ELT, this is the permanent raw data store; for ETL, it is a temporary staging area. Access is controlled via IAM and signed URLs for external sources.

Transform data (ETL only)

In ETL, transformation occurs before loading. A Dataflow pipeline reads from Cloud Storage, applies transformations using Apache Beam transforms (ParDo, GroupByKey, Combine), and writes the result back to Cloud Storage or directly to BigQuery. Transformations include filtering nulls, joining multiple datasets, aggregating metrics, and encoding sensitive fields. Dataflow provides exactly-once processing semantics via checkpointing. The pipeline can be parameterized for different environments (dev, prod).

Load data into target

For ETL, the transformed data is loaded into BigQuery tables using the bq load command or Dataflow's BigQuery sink. For ELT, raw data is loaded directly into BigQuery staging tables. BigQuery supports automatic schema detection for Avro, Parquet, and ORC files, but explicit schema definitions are recommended for production. Loading can be done via the BigQuery Storage Write API for high-throughput streaming ingestion. The load step must handle deduplication and error records.

Transform data in warehouse (ELT only)

After raw data is loaded into BigQuery, SQL-based transformations create curated tables. Dataform manages these transformations as a DAG of SQL scripts. Transformations include cleaning, joining, aggregating, and creating materialized views. BigQuery's query engine executes these transformations on-demand or on a schedule. Partitioned and clustered tables optimize performance. The raw data remains available for reprocessing. This step can be triggered by the BigQuery Data Transfer Service or scheduled queries.

What This Looks Like on the Job

Enterprise Scenario 1: Retail Analytics with ETL

A large retailer ingests daily sales data from thousands of stores into a central data warehouse for reporting. They use Cloud Data Fusion to build an ETL pipeline: extracts from on-premises Oracle databases via JDBC, transforms by cleaning invalid store IDs, joining with product catalogs, and aggregating sales by hour. The transformed data is loaded into BigQuery partitioned tables. The pipeline runs nightly, completing within 4 hours for 10 TB of data. Key considerations: Data Fusion uses a Cloud Dataproc cluster under the hood; they configured 10 preemptible workers to reduce costs. A common issue was schema drift — when new columns were added to source tables, the pipeline failed. They mitigated this by using a flexible schema in Data Fusion that allows new columns to pass through. The pipeline is monitored via Cloud Monitoring alerts on Dataflow job failures and BigQuery load job errors.

Enterprise Scenario 2: IoT Data with ELT

A manufacturing company collects sensor data from thousands of IoT devices every second. They use Pub/Sub to ingest streaming data, which is then loaded into BigQuery via the Storage Write API in near-real-time (latency under 10 seconds). Raw data is stored in a partitioned table by timestamp. Every hour, a Dataform pipeline transforms the raw data into aggregated tables: average temperature per machine per hour, anomaly scores, and rolling 24-hour summaries. The ELT approach allows data scientists to query raw sensor data for ML model training without waiting for ETL jobs. Cost management is critical: they use BigQuery's max_bytes_billed setting to prevent runaway queries. A misconfiguration once caused a $10,000 query; they now enforce a 1 TB limit per user. They also use clustered tables on machine_id and timestamp to reduce query costs by 60%.

Enterprise Scenario 3: Financial Services Compliance with ETL

A bank needs to process transaction data for regulatory reporting. They use Cloud Dataproc with Spark to run complex transformations: joining transaction data with customer master data, applying business rules for suspicious activity detection, and masking PII. The transformed output is loaded into BigQuery and also exported to encrypted Cloud Storage for archival. ETL is chosen because regulations require that PII is masked before it enters the warehouse. The pipeline runs in batch mode every 15 minutes to meet near-real-time reporting SLAs. They use Dataproc's autoscaling with preemptible workers to handle peak loads. A common failure point is shuffle spill — when intermediate data exceeds memory, Spark writes to disk, slowing the job. They tuned the shuffle partitions and increased executor memory to 8 GB to avoid this.

How GCDL Actually Tests This

GCDL Exam Focus: Objective 3.1

The exam tests your ability to distinguish between ETL and ELT and match the correct pattern to a given scenario. Key objective codes: 3.1 (Data Integration), 3.2 (Data Processing). You will not be asked to write code but to choose the right Google Cloud service and pattern.

Common Wrong Answers

Choosing Cloud Data Fusion for ELT — Many candidates think Data Fusion is for ELT because it has a visual interface. Reality: Data Fusion is primarily an ETL tool (code-free data integration) that transforms data before loading. For ELT, use BigQuery + Dataform.

Selecting Dataproc for simple SQL transformations — Dataproc is ideal for Spark/ML workloads, but for SQL-based transformations in BigQuery, Dataform is more appropriate. The exam often contrasts Dataproc (Spark) vs Dataform (SQL).

Thinking ETL is always better for data quality — While ETL can enforce quality at load time, ELT can also enforce quality via SQL constraints and scheduled validation queries. The exam expects you to know that ELT allows raw data preservation, which is beneficial for reprocessing.

Specific Numbers and Terms

BigQuery: default query priority is INTERACTIVE; BATCH priority uses idle slots. Max bytes billed per query can be set via --maximum_bytes_billed.

Dataflow: streaming engine default is off; enable for sub-second latency. Autoscaling range: 1-1000 workers.

Dataproc: preemptible VMs offer up to 80% discount but no SLA. Default idle timeout for clusters is 90 minutes (can be changed).

Cloud Data Fusion: pipelines run on Dataproc clusters. The service offers 150+ connectors.

Edge Cases

Schema on read vs schema on write: ELT uses schema on read (raw data loaded, schema applied at query time). ETL uses schema on write (schema enforced during transformation). The exam may test which pattern is more flexible.

Streaming vs batch: Dataflow supports both; Pub/Sub is for streaming ingestion. The exam may ask which service to use for real-time transformations.

Cost optimization: Use preemptible VMs for Dataproc, batch queries in BigQuery, and lifecycle policies in Cloud Storage. The exam loves cost-related questions.

How to Eliminate Wrong Answers

If the scenario mentions "raw data preserved for future analysis" → ELT.

If the scenario mentions "complex transformations in Java/Python" → Dataflow or Dataproc.

If the scenario mentions "no coding, drag-and-drop" → Cloud Data Fusion.

If the scenario mentions "SQL transformations on loaded data" → BigQuery with Dataform.

If the scenario mentions "near-real-time streaming" → Dataflow with Pub/Sub.

Memorize the services: ETL uses Dataflow, Dataproc, or Data Fusion; ELT uses BigQuery + Dataform. The exam will not ask about obscure services like Cloud Composer (Airflow) unless it's about orchestration.

Key Takeaways

ETL transforms data before loading; ELT transforms after loading. Choose based on target compute and flexibility needs.

On Google Cloud, ETL uses Cloud Dataflow, Dataproc, or Cloud Data Fusion; ELT uses BigQuery with Dataform.

BigQuery's default query priority is INTERACTIVE; use BATCH for lower-cost, lower-priority queries.

Cloud Data Fusion is a code-free ETL tool that runs on Dataproc clusters.

Dataflow provides exactly-once processing semantics via checkpointing and supports both batch and streaming.

Dataproc preemptible VMs offer up to 80% discount but can be terminated at any time.

ELT preserves raw data, enabling schema-on-read and future reprocessing.

ETL enforces data quality at load time; ELT relies on downstream transformations.

Partitioning and clustering in BigQuery reduce query costs and improve performance.

The BigQuery Storage Write API enables high-throughput streaming ingestion.

Cloud Dataform manages SQL-based ELT transformations as a DAG.

Lifecycle policies in Cloud Storage automatically delete raw data after a retention period.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

ETL (Extract, Transform, Load)

Transforms data before loading into the target.

Requires upfront schema design and transformation logic.

Better for targets with limited compute (e.g., traditional RDBMS).

Raw data is not preserved; only transformed data is stored.

Slower initial load, but faster queries on curated data.

ELT (Extract, Load, Transform)

Loads raw data first, transforms later in the warehouse.

More flexible; raw data available for reprocessing.

Leverages warehouse compute (e.g., BigQuery's MPP engine).

Raw data is preserved for ad-hoc analysis and ML.

Faster initial load, but queries may be slower without optimization.

Watch Out for These

Mistake

ETL is always slower than ELT.

Correct

ETL can be slower upfront because transformation happens before loading, but it can speed up downstream queries since data is pre-aggregated and cleaned. ELT loads faster but may result in slower queries if raw data is not optimized (e.g., no partitioning/clustering). The actual performance depends on the workload.

Mistake

Cloud Data Fusion is an ELT tool.

Correct

Cloud Data Fusion is an ETL tool that transforms data before loading. It uses a visual interface to build pipelines that run on Dataproc. For ELT, you would use BigQuery and Dataform to load raw data and transform it in place.

Mistake

ELT does not require any transformation logic.

Correct

ELT still requires transformation logic, but it is executed after loading, typically using SQL in the data warehouse. The transformations are defined in Dataform or scheduled queries. Raw data is preserved, but transformations are essential for analysis.

Mistake

BigQuery can only be used as a target, not a transformation engine.

Correct

BigQuery is a powerful transformation engine. With SQL, you can create views, materialized views, and scheduled queries to transform data. BigQuery's Dremel engine can process petabytes of data efficiently, making it ideal for ELT.

Mistake

You need to use Cloud Dataflow for all streaming pipelines.

Correct

While Dataflow is the primary streaming service, you can also use Pub/Sub + Cloud Functions for simple transformations, or Pub/Sub + BigQuery's Storage Write API for direct ingestion. Dataflow is best for complex stream processing (windowing, joins, aggregations).

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between ETL and ELT on Google Cloud?

ETL (Extract, Transform, Load) transforms data before loading into the target, using services like Cloud Dataflow, Dataproc, or Cloud Data Fusion. ELT (Extract, Load, Transform) loads raw data first into BigQuery, then transforms it using SQL via Dataform or scheduled queries. ETL is better for enforcing data quality upfront; ELT is more flexible and preserves raw data.

Which Google Cloud service should I use for ETL?

For ETL, use Cloud Dataflow (for complex stream/batch processing with Apache Beam), Cloud Dataproc (for Spark/Hadoop workloads), or Cloud Data Fusion (for code-free visual pipeline design). The choice depends on your transformation complexity and coding preference.

Can I use BigQuery for transformations in an ETL pipeline?

Yes, you can load raw data into BigQuery and then run SQL transformations to create curated tables. This is actually ELT. However, if you transform before loading (ETL), you would use Dataflow or Dataproc. BigQuery can be both a target and a transformation engine.

What is Cloud Data Fusion and when should I use it?

Cloud Data Fusion is a fully managed, code-free data integration service for building ETL/ELT pipelines. It provides a visual interface with 150+ connectors. Use it when you want to avoid writing code and need to connect to various sources quickly. It runs on Dataproc clusters.

How do I optimize costs in ELT pipelines on BigQuery?

Use partitioned and clustered tables to reduce bytes scanned. Set maximum bytes billed per query to avoid runaway costs. Use BATCH priority for non-urgent queries. Schedule transformations during off-peak hours. Use materialized views for pre-aggregated results.

What is Dataform and how does it relate to ELT?

Cloud Dataform is a service for managing SQL-based ELT transformations. It allows you to define dependencies between SQL scripts, version control them, and schedule execution. It is the primary tool for implementing ELT on BigQuery.

Can I use streaming in ETL pipelines?

Yes, Cloud Dataflow supports streaming (unbounded) data. You can ingest from Pub/Sub, transform in real-time, and load into BigQuery or other sinks. This is common for ETL with near-real-time requirements.

Terms Worth Knowing

BigQuery Data lake Data warehouse ELT ETL

Ready to put this to the test?

You've just covered ETL and ELT Pipelines on Google Cloud — now see how well it sticks with free GCDL practice questions. Full explanations included, no account needed.

Try GCDL practice questions Back to all chapters

Done with this chapter?

Data Types and Analytics Workloads

BigQuery: Serverless Analytics Overview

See the full GCDL study guide