DP-900Chapter 16 of 101Objective 1.1

ETL vs ELT Data Pipelines

This chapter covers the two primary data pipeline architectures: ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform). You will learn how each approach processes data from source to destination, the key differences in transformation timing, and when to use each. For the DP-900 exam, this topic appears in roughly 5-10% of questions under Core Data Concepts (Objective 1.1). Mastery of ETL vs. ELT is essential for choosing the right data integration strategy in Azure solutions like Azure Data Factory, Azure Synapse Analytics, and Azure Databricks.

25 min read
Intermediate
Updated May 31, 2026

The Kitchen Prep vs. Cook-and-Serve Model

Imagine a high-end restaurant kitchen. In the ETL approach, the kitchen has a separate prep area where raw ingredients (data) are washed, chopped, marinated, and portioned according to strict recipes (transformations) before they ever reach the cooking line. The head chef (data pipeline) transforms raw vegetables into mise en place, then stores them in labeled containers (staging area) in the walk-in cooler. Only after all prep is complete does the cooking line (target system) use the prepped ingredients to assemble dishes. This ensures the cooking line never sees a dirty potato or a whole fish — everything is clean and ready. However, if a new dish requires a different cut of vegetable, the entire prep process must be redesigned. In the ELT model, the kitchen receives raw ingredients directly at the cooking station. The chef (target system) has the power to chop, season, and cook on the fly. A sous chef might throw in a whole onion, then decide to dice it mid-cook using a powerful food processor (distributed compute). This is faster for initial delivery — the raw ingredient arrives at the station immediately — but the cooking area must be large and flexible enough to handle raw materials. The downside: if the chef makes a mistake, the raw ingredient is already in the pot and hard to undo. ETL pre-transforms in a controlled environment; ELT trusts the destination system to handle transformation at scale.

How It Actually Works

What Are ETL and ELT?

ETL and ELT are data pipeline architectures that move data from source systems to target destinations (typically data warehouses or data lakes). The letters stand for Extract, Transform, and Load. The order of these operations defines the architecture:

ETL: Extract -> Transform -> Load

ELT: Extract -> Load -> Transform

In ETL, data is transformed in an intermediate staging area before being loaded into the target. In ELT, raw data is loaded directly into the target, and transformations are performed inside the target system.

Why Two Different Approaches?

The choice between ETL and ELT depends on factors like data volume, latency requirements, target system capabilities, and compliance needs. Historically, ETL was the standard because data warehouses had limited compute power — transforming data before loading reduced the load on the warehouse. Modern cloud data warehouses (e.g., Azure Synapse, Snowflake) and data lakes (e.g., Azure Data Lake Storage) have massive scalability, making ELT increasingly popular.

How ETL Works Internally

Extract: Data is read from source systems (databases, APIs, files). This step can use full or incremental extraction. For example, using SQL queries like SELECT * FROM Orders WHERE OrderDate > @LastRun.

Transform: Data is cleaned, filtered, aggregated, joined, and enriched in a staging environment. Transformations include:

Data type conversions (e.g., string to date)

Deduplication

Handling NULLs

Business logic application (e.g., calculating total price = quantity * unit price)

Data quality checks (e.g., rejecting rows with invalid ZIP codes)

Load: The transformed data is written to the target system. In a data warehouse, this typically involves INSERT or UPDATE operations. For dimension tables, you might use slowly changing dimension (SCD) patterns.

Key Characteristics:

Staging area is required (could be a separate database, file system, or in-memory)

Transformation logic is defined and executed outside the target

Lower storage cost in target because only transformed data is stored

Higher upfront processing time before data is available in the target

Easier to enforce data quality before loading

How ELT Works Internally

Extract: Same as ETL — data is read from sources.

Load: Raw data is loaded directly into the target system, often into staging tables or a raw zone in a data lake. For example, in Azure Synapse, you might use COPY INTO to load data from Azure Blob Storage into a staging table.

Transform: Transformations are executed inside the target system using its compute engine (e.g., SQL queries, Spark jobs). The transformed results are often written to new tables or views.

Key Characteristics:

No separate staging environment needed

Target system must have sufficient compute and storage to handle raw data and transformations

Data is available faster after extraction (lower latency to load)

Full raw data is retained, enabling reprocessing and ad-hoc analysis

Transformation logic can be changed without re-extracting data

Higher storage cost in target (raw data + transformed data)

Key Components and Defaults

Azure Data Factory (ADF): A cloud-based ETL/ELT service. It uses pipelines containing activities. A typical ADF pipeline for ETL: 1. Copy activity (Extract) from source to staging 2. Data Flow activity (Transform) — runs on Azure Databricks or Spark 3. Copy activity (Load) to target

For ELT, you can use a Copy activity to load raw data into a staging table in Azure Synapse, then a stored procedure activity to run transformations.

Azure Synapse Analytics: Supports ELT natively with T-SQL. The COPY INTO command loads data from external sources into staging tables. Then you run CTAS (CREATE TABLE AS SELECT) or INSERT INTO SELECT to transform and load final tables.

PolyBase: In Azure Synapse, PolyBase enables loading data from Azure Blob Storage or Data Lake Store using external tables. It supports ELT patterns.

Azure Databricks: Often used for complex transformations in ELT pipelines. Data is loaded into Delta Lake tables (raw), then transformed using Spark SQL or DataFrame operations.

Default Values:

ADF pipeline timeout: default 12 hours, max 7 days

ADF copy activity concurrency: default 1, max 1 (for on-premises sources)

Azure Synapse DWU (Data Warehouse Units): minimum 100, maximum 30,000

PolyBase row size limit: 1 MB (rows larger than 1 MB cause errors)

Configuration and Verification Commands

Azure Data Factory:

Create pipeline: az datafactory pipeline create --factory-name ...

Trigger pipeline: az datafactory pipeline create-run --factory-name ...

Monitor pipeline runs: az datafactory pipeline-run query-by-factory --factory-name ...

Azure Synapse:

Load data with COPY INTO:

COPY INTO dbo.Orders
FROM 'https://mystorage.blob.core.windows.net/data/orders.csv'
WITH (
    FILE_TYPE = 'CSV',
    FIRSTROW = 2
)

Transform with CTAS:

CREATE TABLE dbo.OrderSummary
WITH (DISTRIBUTION = ROUND_ROBIN)
AS
SELECT CustomerID, SUM(TotalAmount) AS TotalSpent
FROM dbo.Orders
GROUP BY CustomerID;

Azure Databricks:

Load raw data:

df = spark.read.format("csv").option("header", "true").load("path/to/raw")
df.write.format("delta").save("path/to/raw_table")

Transform:

df_transformed = spark.sql("SELECT CustomerID, SUM(Amount) FROM raw_table GROUP BY CustomerID")
df_transformed.write.format("delta").save("path/to/summary")

Interaction with Related Technologies

Azure Data Lake Storage (ADLS): Often used as a staging area for both ETL and ELT. In ELT, ADLS serves as the raw data store.

Azure Blob Storage: Similar to ADLS but less feature-rich. Common for small-scale ETL.

Azure SQL Database: Can be a source or target. For ELT, you can load raw data into staging tables and use T-SQL to transform.

Azure HDInsight: Can be used for transformations in ETL pipelines (e.g., Hive, Spark).

Power BI: Consumes transformed data from data warehouses. ETL ensures clean data; ELT may require additional views.

Performance Considerations

ETL: Transformation can be scaled independently using Azure Databricks or ADF Data Flows. Staging storage costs are separate.

ELT: Transformation is limited by target system resources. In Azure Synapse, you can scale DWUs to improve performance. However, large raw datasets may require significant storage.

Data Volume: For very large datasets (>1 TB), ELT can be more efficient because it avoids moving data twice (extract -> staging -> target).

Latency: ETL has higher latency because data must be transformed before loading. ELT can load raw data quickly, but transformation may still take time.

Security and Compliance

ETL: Sensitive data can be masked or filtered during transformation, so only clean data reaches the target. This is important for GDPR, HIPAA.

ELT: Raw data containing PII may be loaded into the target, requiring additional access controls and encryption. Azure Synapse supports column-level security and dynamic data masking.

Common Pitfalls

Assuming ELT always faster: ELT loads data faster, but transformation time may be longer if target system is underpowered.

Ignoring data quality: ELT can lead to "garbage in, garbage out" if raw data is not validated.

Underestimating storage costs: ELT duplicates data (raw + transformed), increasing storage costs.

Not considering schema evolution: ELT handles schema changes well because raw data is preserved; ETL may break if source schema changes.

Walk-Through

1

1. Extract from Source

Data is read from source systems such as SQL databases, NoSQL stores, files (CSV, JSON, Parquet), or APIs. The extraction can be full (entire dataset) or incremental (only changes since last run). For example, a SQL query with a watermark column like `WHERE LastModified > @LastWatermark`. In Azure Data Factory, this is a Copy activity with a source dataset. The connector handles authentication (SQL auth, managed identity, service principal). Extraction performance depends on source throttling and network bandwidth. For on-premises sources, you must install a Self-Hosted Integration Runtime.

2

2. Stage Data (ETL only)

In ETL, extracted data is written to a staging area — typically a temporary table in a database, files in Azure Blob Storage, or an in-memory cache. The staging area is isolated from the target system to avoid impacting production workloads. Transformations are applied here. For example, using ADF Data Flows, data is processed in a Spark cluster. The staging area should have sufficient storage for intermediate results. In ELT, this step is skipped; raw data goes directly to the target.

3

3. Transform Data

Transformations clean, enrich, and restructure data. Common operations: filtering rows, converting data types, joining tables, aggregating, pivoting, and applying business rules. In ETL, transformations run in the staging environment using tools like ADF Data Flows, Azure Databricks, or HDInsight. In ELT, transformations are executed inside the target system using SQL or Spark. For example, in Azure Synapse, you run CTAS statements. Transformations can be batch (scheduled) or streaming (near real-time). Complexity ranges from simple column renaming to multi-step pipelines with error handling.

4

4. Load into Target

Transformed data is written to the target system. In ETL, this is the final load after transformation. In ELT, this is the initial load of raw data. The target can be a data warehouse (Azure Synapse), data lake (ADLS), or database (Azure SQL). Loading strategies include: full refresh (truncate and load), incremental upsert (merge), or append-only. For large volumes, use bulk insert techniques like PolyBase or COPY INTO. In Azure Synapse, you can use hash distribution to optimize query performance. After loading, you may need to update statistics and rebuild indexes.

5

5. Post-Load Validation

After loading, verify data integrity and completeness. This includes row counts, checksums, data quality checks (e.g., no NULLs in required columns), and reconciliation with source totals. In ADF, you can use a validation activity or a stored procedure. In ELT, you might query the raw and transformed tables to ensure transformations are correct. If errors are found, you may need to reprocess. Logging and alerting are critical — Azure Monitor can track pipeline failures. Regularly audit data lineage to ensure compliance.

What This Looks Like on the Job

Enterprise Scenario 1: Retail Analytics with ETL

A large retail chain uses ETL to consolidate sales data from thousands of stores into a central data warehouse for reporting. Each store sends daily CSV files to an Azure Blob Storage container. An ADF pipeline runs nightly: it extracts the CSV files, transforms them by cleaning currency formats, converting time zones, and aggregating sales by product and store, then loads the aggregated data into Azure Synapse. The staging environment is a set of Azure Databricks notebooks that run on a 20-node cluster. The transformation includes joining with product master data from an on-premises SQL Server (via Self-Hosted IR). The pipeline processes about 500 GB of raw data daily. A common problem is schema drift — if a store adds a new column, the transformation breaks. The solution is to use ADF's schema drift handling in Data Flows, which allows mapping columns dynamically. Misconfiguration of the staging cluster (e.g., too few nodes) causes timeouts and partial loads. The team sets up alerts for pipeline failures and uses data quality checks to reject malformed files.

Enterprise Scenario 2: IoT Data with ELT

A manufacturing company collects sensor data from factory equipment into Azure Event Hubs. Data is streamed into Azure Data Lake Storage Gen2 in raw JSON format. An ELT pipeline runs hourly: first, raw data is loaded into external tables in Azure Synapse using PolyBase. Then, T-SQL stored procedures transform the data — parsing JSON, calculating averages, and joining with equipment metadata. The target is a set of tables for dashboards in Power BI. The advantage of ELT is that raw data is always available for ad-hoc analysis (e.g., investigating a specific sensor failure). The challenge is storage costs: raw data grows at 2 TB per month, and transformed tables add another 500 GB. The team uses partitioning and file format optimization (Parquet with snappy compression) to reduce costs. Performance issues arise when too many concurrent transformation queries run — they scale Synapse DWUs dynamically using Azure Automation. A misconfiguration (e.g., forgetting to partition large tables) leads to full table scans and slow dashboards.

Enterprise Scenario 3: Hybrid with Azure Data Factory

A financial services firm uses a hybrid approach: ETL for sensitive data (customer PII) and ELT for non-sensitive operational data. ADF orchestrates both. For PII data, transformations (masking, encryption) happen in a secure staging environment using Azure Databricks with managed identity access to Azure Key Vault. For operational logs, raw data is loaded into a separate zone in ADLS and transformed later using Azure Synapse. The pipeline runs every 15 minutes for near-real-time analytics. The main challenge is managing two different pipelines with different error handling and monitoring. The team uses ADF's metadata-driven approach to parameterize pipelines. A common mistake is using the same staging area for both ETL and ELT, causing data leakage — they keep strict separation using different containers and RBAC roles.

How DP-900 Actually Tests This

What DP-900 Tests on ETL vs. ELT

DP-900 objective 1.1 expects you to describe core data concepts, including batch and streaming data processing, and the differences between ETL and ELT. Specifically, you must:

Identify the order of operations in ETL vs. ELT

Recognize scenarios where each is appropriate

Understand the role of staging in ETL

Know that ELT leverages the target system's compute

Common Wrong Answers and Why Candidates Choose Them

1. "ETL is always faster than ELT because it transforms before loading." Wrong. ETL has higher initial latency because transformation must complete before loading. ELT loads raw data faster, but total time may be similar. Candidates confuse 'faster to load' with 'faster overall.'

2. "ELT does not require any transformation." Wrong. ELT still transforms data — it just does it after loading. Candidates think 'load' means no transformation, but transformation is a separate step.

3. "ETL is used only for on-premises data." Wrong. ETL is used in cloud too (e.g., ADF). Candidates associate ETL with legacy systems.

4. "In ELT, data is transformed before extraction." Wrong. That would be 'extract after transform' — not a standard pattern. Candidates confuse the order.

Specific Numbers and Terms

ETL: Extract, Transform, Load — order matters.

ELT: Extract, Load, Transform — 'L' before 'T'.

Staging area: Required in ETL, optional in ELT (raw zone serves as staging).

Azure Synapse: Uses COPY INTO for loading (ELT).

Azure Data Factory: Supports both patterns via pipelines.

Data lake: Often used with ELT (raw data stored).

Data warehouse: Traditionally used with ETL, but modern warehouses support ELT.

Edge Cases and Exceptions

Streaming data: Both ETL and ELT can be used, but streaming typically uses ELT-like patterns (load raw event, transform later).

Small data: For small datasets (<10 GB), ETL may be simpler; ELT overhead may not be justified.

Compliance: If raw data cannot be stored in target due to regulations, ETL is mandatory.

Schema-on-read: ELT fits schema-on-read paradigms (raw data stored, schema applied at query time).

How to Eliminate Wrong Answers

If the question mentions 'transformation before loading' or 'staging area', it's ETL.

If the question says 'raw data loaded first' or 'uses target system compute', it's ELT.

Look for keywords: 'cleanse', 'validate', 'enrich' often appear with ETL (since transformation is separate).

'Scalable compute' or 'data lake' hint at ELT.

Remember: ETL is like preparing ingredients before cooking; ELT is like cooking and then seasoning.

Key Takeaways

ETL order: Extract -> Transform -> Load; ELT order: Extract -> Load -> Transform.

ETL uses a staging area; ELT loads raw data directly into the target.

Azure Data Factory supports both ETL and ELT patterns.

ELT leverages target system compute (e.g., Azure Synapse, Databricks).

ETL is preferred when data quality and compliance require transformation before loading.

ELT is preferred for large volumes of raw data that need flexible schema-on-read.

Hybrid pipelines can combine ETL and ELT for different data sets.

COPY INTO is a common ELT command in Azure Synapse.

PolyBase enables loading from Azure Storage into Synapse for ELT.

Incremental extraction is key for both patterns to reduce data movement.

Staging in ETL can be Azure Blob Storage, ADLS, or a temporary SQL database.

Data Flow in ADF is a visual ETL tool that runs on Spark.

ELT retains raw data for reprocessing and ad-hoc analysis.

ETL reduces target storage but increases transformation complexity.

DP-900 expects you to identify the correct pipeline order and use case.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

ETL

Transform before load

Requires staging area

Lower storage cost in target (no raw data)

Higher latency to data availability

Easier to enforce data quality before loading

Better for compliance (PII masking before load)

ELT

Load before transform

No separate staging needed (raw zone in target)

Higher storage cost (raw + transformed)

Lower latency to load raw data

Flexible transformation (schema-on-read)

Requires target system with strong compute

Watch Out for These

Mistake

ETL is obsolete; everyone uses ELT now.

Correct

Both are widely used. ETL remains important for compliance, data quality, and scenarios where the target system lacks compute power. Azure Data Factory supports both equally.

Mistake

ELT does not require any transformation logic.

Correct

ELT still transforms data, but the transformation happens inside the target system using SQL, Spark, or other compute. The difference is timing, not existence.

Mistake

In ETL, data is loaded into the target system before transformation.

Correct

That describes ELT. In ETL, transformation occurs in a staging area, and only after transformation is data loaded into the target.

Mistake

ELT is always faster than ETL.

Correct

ELT loads raw data faster, but total pipeline time (extract + load + transform) can be longer if the target system is slow. ETL may be faster when transformation is simple and staging is efficient.

Mistake

ETL and ELT are mutually exclusive; you must choose one for the entire pipeline.

Correct

Hybrid approaches exist. For example, you can use ETL for sensitive data and ELT for non-sensitive data within the same organization, orchestrated by ADF.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the main difference between ETL and ELT?

The main difference is the order of transformation relative to loading. In ETL, data is transformed in a staging area before being loaded into the target. In ELT, data is loaded raw into the target first, and transformations are performed inside the target system. This affects latency, storage, and compute requirements. For DP-900, remember the order: E-T-L vs. E-L-T.

When should I use ETL instead of ELT?

Use ETL when you need to enforce data quality before data reaches the target, when compliance requires masking or filtering PII before storage, or when the target system lacks compute power for transformations. ETL is also beneficial when you want to reduce storage costs in the target by only storing transformed data.

Does Azure Data Factory support both ETL and ELT?

Yes. Azure Data Factory supports ETL using Data Flows (transformations in Spark) or external compute like Azure Databricks. It supports ELT by using Copy Activity to load raw data into a target like Azure Synapse, then using a Stored Procedure Activity to run transformations. ADF is a flexible orchestration tool.

What is a staging area in ETL?

A staging area is an intermediate storage location where extracted data is temporarily held before transformation. It can be a database table, a file in Azure Blob Storage, or an in-memory cache. The staging area isolates transformation workloads from source and target systems, and allows for error handling and data validation.

How does ELT handle large data volumes?

ELT scales by leveraging the target system's distributed compute. For example, Azure Synapse can scale out to hundreds of compute nodes, and Azure Databricks can spin up clusters. Raw data is loaded once, and transformations can be run on demand. However, storage costs increase because raw data is retained.

Can I combine ETL and ELT in the same pipeline?

Yes. This is called a hybrid approach. For instance, you might use ETL for sensitive data that needs masking before loading, and ELT for non-sensitive logs that can be transformed later. Azure Data Factory can orchestrate both patterns in separate pipelines or even within the same pipeline using conditional logic.

What is the role of PolyBase in ELT?

PolyBase is a technology in Azure Synapse that allows querying external data sources (like Azure Blob Storage) using T-SQL. It enables the 'Load' step in ELT by loading data from external files into staging tables. Then you can transform using CTAS or INSERT INTO SELECT. PolyBase is a key component for ELT in Synapse.

Terms Worth Knowing

Ready to put this to the test?

You've just covered ETL vs ELT Data Pipelines — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?