DP-900Chapter 76 of 101Objective 3.1

Data Lakehouse vs Data Warehouse vs Data Lake

When planning data storage and analytics in Azure, you need to understand the three primary architectures: Data Lake, Data Warehouse, and Data Lakehouse. Understanding the differences is critical for the DP-900 exam, as questions on this topic appear in approximately 15-20% of the exam, primarily within objective 3.1 (Describe analytics workloads). You will learn the core characteristics, use cases, and how Azure services like Azure Synapse Analytics, Azure Data Lake Storage, and Azure Databricks implement these architectures. By the end, you will be able to distinguish between them and choose the right one for a given scenario.

25 min read

Intermediate

Updated Jul 21, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

The Kitchen: Recipe Box, Pantry, and Prep Station

Picture a busy restaurant kitchen. A Data Warehouse is like a chef's recipe box: each recipe is a predefined, structured dish (e.g., 'Margherita Pizza') with exact ingredients and steps. The chef follows the recipe to produce consistent, high-quality results. The recipes are optimized for known questions (e.g., 'What is the most popular pizza?'). A Data Lake is like a massive pantry and refrigerator: it holds all raw ingredients—vegetables, meats, spices, sauces—in their original form. The chef can grab any ingredient to create a new dish, but it takes time to prepare and combine them. There is no predefined structure; the chef decides how to use each ingredient. A Data Lakehouse is like a modern prep station that combines both: it has a pantry with all raw ingredients AND a set of standardized, partially prepared items (like pre-chopped vegetables or pre-marinated meat) that can be used directly in recipes. The station also has a catalog listing every ingredient and its current state. The chef can quickly access raw materials for experimentation or use the prepped items for fast, consistent dishes. The lakehouse provides structure (like the prepped items) on top of raw storage (the pantry), enabling both ad-hoc exploration and reliable reporting. In the kitchen, the chef (data engineer or analyst) uses the same space for both creative cooking (data science) and menu execution (business intelligence).

How It Actually Works

1. What They Are and Why They Exist

Data architecture has evolved to meet different analytic needs. Data Warehouses emerged in the 1980s to support business intelligence (BI) and reporting. They store structured, processed data optimized for query performance. Data Lakes arose in the 2010s to handle big data, storing raw data in its native format (structured, semi-structured, or unstructured) for data science and exploratory analytics. Data Lakehouses combine the flexibility of data lakes with the reliability and performance of data warehouses, enabling both BI and machine learning (ML) on a single platform.

2. How They Work Internally

Data Warehouse: Data is ingested through ETL (Extract, Transform, Load) processes. Extraction pulls data from sources, transformation cleans and structures it (e.g., applying schemas, aggregations), and loading writes it into a relational database optimized for queries. Storage uses columnar formats (e.g., columnstore indexes in Azure Synapse dedicated SQL pool) for compression and fast scan rates. Queries are executed using SQL engines that leverage indexing, partitioning, and materialized views. Typical latency: hours (batch processing).

Data Lake: Data is ingested in its raw form (e.g., CSV, JSON, Parquet, images) to cheap, scalable object storage like Azure Data Lake Storage Gen2 (ADLS Gen2). A schema-on-read approach is used: the schema is applied at query time, not at write time. Processing engines like Apache Spark (via Azure Databricks or Synapse Spark pools) read the data, apply transformations, and produce results. Data lakes support batch, streaming, and interactive analytics. Storage is hierarchical (ADLS Gen2) with POSIX-like access control lists (ACLs).

Data Lakehouse: Combines data lake storage (ADLS Gen2) with a metadata layer (e.g., Delta Lake, Apache Iceberg) that provides ACID transactions, schema enforcement, and versioning. The storage layer remains cheap and scalable, while the metadata layer enables warehouse-like features: time travel (querying previous versions), upserts (merge operations), and data quality constraints. Compute engines (Spark, SQL) access the same data concurrently. Delta Lake, used in Azure Databricks and Synapse, is the most common implementation.

3. Key Components, Values, Defaults, and Timers

Azure Data Lake Storage Gen2: Combines blob storage with a hierarchical namespace. Default redundancy: locally redundant storage (LRS) or geo-redundant (GRS). Access tiers: hot, cool, archive. Default blob type: block blob. Maximum storage account size: 500 TiB (with large file shares enabled, up to 100 PiB).

Azure Synapse Analytics Dedicated SQL Pool: Formerly SQL Data Warehouse. Distribution types: hash, round-robin, replicated. Table types: heap, clustered columnstore, clustered index. Default distribution: round-robin for staging tables; hash for fact tables. Performance level: DW100c to DW30000c. Scaling up/down takes 1-2 minutes. Pause/resume: immediate.

Delta Lake: Default protocol version: 1 (supports time travel, schema enforcement). Log retention: 30 days (configurable). Checkpoint interval: every 10 commits. Vacuum threshold: 7 days (default retention for deleted files).

Azure Databricks: Cluster types: interactive (for ad-hoc), job (for automated runs). Default runtime: latest LTS (e.g., 13.3 LTS). Auto-scaling: enabled by default. Worker types: general-purpose, memory-optimized, GPU. Default Spark configuration: 2.4 GHz cores, 8 GB RAM per core.

PolyBase: Used in Synapse to query external data in ADLS Gen2. Default format: Parquet. External file format options: CSV, Parquet, ORC, JSON, Avro. Rejected row location: specified by user.

4. Configuration and Verification Commands

Create an external table in Synapse using PolyBase:

-- Create external data source
CREATE EXTERNAL DATA SOURCE MyDataSource
WITH (
    LOCATION = 'abfss://container@storageaccount.dfs.core.windows.net',
    CREDENTIAL = MyCredential
);

-- Create external file format
CREATE EXTERNAL FILE FORMAT MyFileFormat
WITH (
    FORMAT_TYPE = PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);

-- Create external table
CREATE EXTERNAL TABLE dbo.SalesExternal (
    SaleID INT,
    ProductID INT,
    SaleDate DATE,
    Amount DECIMAL(10,2)
)
WITH (
    LOCATION = '/sales/',
    DATA_SOURCE = MyDataSource,
    FILE_FORMAT = MyFileFormat
);

Verify Delta Lake table history:

-- In Databricks or Synapse Spark pool
DESCRIBE HISTORY sales_table;
-- Shows operations: write, delete, merge, etc.

Check ADLS Gen2 hierarchical namespace:

az storage account show --name mystorageaccount --resource-group myrg --query "isHnsEnabled"

5. Interaction with Related Technologies

Azure Data Factory: Orchestrates ETL/ELT pipelines for data lakes and warehouses. Supports 90+ connectors. Common pattern: copy data from on-prem SQL Server to ADLS Gen2 (data lake), then load into Synapse dedicated SQL pool (warehouse).

Power BI: Directly queries Synapse dedicated SQL pool via T-SQL, or uses DirectQuery on data lake files via Synapse serverless SQL pool. For data lakehouse, Power BI can read Delta tables through the Delta connector.

Azure Machine Learning: Uses data lake for raw training data; data lakehouse provides feature stores (via Delta tables) for reusable ML features.

Azure Purview: Catalogs and governs data across lake, warehouse, and lakehouse. Scans ADLS Gen2 for schema, lineage, and classification.

6. Performance and Optimization

Data Warehouse: Use clustered columnstore indexes for fact tables (compression ratio 5x-10x). Partition large tables by date (e.g., monthly). Use materialized views for pre-aggregated results. Distribute fact tables on join keys to minimize data movement.

Data Lake: Use Parquet format for analytical queries (columnar, compressed). Partition data by date or region (e.g., /year=2024/month=01/). Use Azure Data Lake Storage tiering: hot for active data, cool for infrequent, archive for long-term.

Data Lakehouse: Optimize Delta tables with Z-ordering (similar to indexing) on frequently filtered columns. Use OPTIMIZE command to compact small files (target file size: 256 MB-1 GB). Use VACUUM to remove old versions beyond retention period.

7. Common Exam Traps

Trap 1: Assuming data lakes only store unstructured data. Reality: They store structured, semi-structured, and unstructured data.

Trap 2: Believing data warehouses support schema-on-read. Reality: They use schema-on-write (ETL).

Trap 3: Thinking data lakehouse is just a data lake with a different name. Reality: It adds ACID transactions, schema enforcement, and indexing.

Trap 4: Confusing Azure Synapse Analytics dedicated SQL pool with serverless SQL pool. Dedicated is provisioned (warehouse); serverless is on-demand (lake querying).

Trap 5: Forgetting that data lakehouse uses the same storage as data lake (ADLS Gen2) but adds a transactional metadata layer.

Walk-Through

Ingest raw data into data lake

Data is extracted from sources (e.g., IoT devices, databases, logs) and copied to Azure Data Lake Storage Gen2 using Azure Data Factory, Event Hubs, or direct upload. Files are stored in their native format (CSV, JSON, Parquet, etc.) without transformation. The hierarchical namespace is organized by folders (e.g., /raw/2024/01/). Access is controlled via RBAC and ACLs. At this stage, no schema is enforced; the data is 'schema-on-read' ready.

Transform data for data warehouse

Using Azure Data Factory or Synapse Pipelines, raw data from the data lake is transformed via ETL/ELT. Transformations include data cleansing (removing nulls, correcting formats), joining multiple sources, aggregating (e.g., daily sales totals), and converting to a star schema (fact and dimension tables). The transformed data is then loaded into the data warehouse (Synapse dedicated SQL pool) using PolyBase or COPY INTO command. This step enforces schema-on-write.

Create Delta Lake table on data lake

In Azure Databricks or Synapse Spark, raw or transformed data is written as a Delta Lake table. The write operation uses Delta Lake's ACID transaction log to record metadata. The table is stored as Parquet files in ADLS Gen2, with a _delta_log folder containing JSON transaction entries. Schema enforcement ensures that all writes conform to the defined schema. Time travel is enabled by keeping previous versions in the log.

Query data lakehouse with SQL

Using Azure Synapse Serverless SQL pool or Databricks SQL, users run ANSI SQL queries directly on Delta tables. The query engine reads the Delta transaction log to determine the latest version, then scans the underlying Parquet files using columnar pruning and predicate pushdown. For example: SELECT * FROM sales WHERE year = 2024. The engine returns results without needing to move data into a separate warehouse.

Run BI report on data warehouse

Power BI connects to the Synapse dedicated SQL pool using T-SQL. The query is executed by the SQL pool's massively parallel processing (MPP) engine, which distributes the query across 60 distributions (for DW1000c and above). Results are aggregated and returned to Power BI for visualization. The warehouse's columnstore indexes and materialized views ensure sub-second response for common reports.

Perform ML on data lakehouse

Data scientists use Azure Databricks to read Delta tables directly. They can use Spark MLlib or scikit-learn to train models on the data. The lakehouse allows feature engineering using SQL or Python, and features can be stored back as Delta tables for reuse. Model training benefits from the lakehouse's ability to handle large volumes of raw data and intermediate transformations without copying data.

What This Looks Like on the Job

Scenario 1: Retail Analytics with Data Lakehouse

A large retailer uses Azure Databricks with Delta Lake on ADLS Gen2 to unify their data. Previously, they had a data warehouse for sales reports and a data lake for clickstream logs. Analysts had to use separate tools. With a lakehouse, they ingest point-of-sale transactions (structured) and web clickstream (semi-structured) into Delta tables. They run BI reports using Databricks SQL and train recommendation models using Spark ML. Issues arise when Delta table optimization is neglected: small file accumulation (from streaming) degrades query performance. They mitigate by running OPTIMIZE and VACUUM nightly. They set Delta log retention to 7 days to control storage costs.

Scenario 2: Financial Reporting with Data Warehouse

A bank uses Azure Synapse dedicated SQL pool for regulatory reporting. Data from core banking systems is extracted nightly, transformed in Azure Data Factory (cleaning, aggregating, applying business rules), and loaded into the warehouse. The warehouse uses hash-distributed fact tables on account ID to minimize data movement. Reports are generated via Power BI with DirectQuery. Common misconfiguration: using round-robin distribution for large fact tables, causing excessive data shuffling. The fix: switch to hash distribution on the join key. Performance considerations: scaling up the SQL pool during batch loads (e.g., from DW1000c to DW2000c) and scaling down after.

Scenario 3: IoT Telemetry with Data Lake

An energy company collects sensor data from wind turbines (JSON, 10 TB/day) into ADLS Gen2. They use Azure Data Lake as a landing zone for raw telemetry. Data scientists access the raw files with Azure Databricks to build anomaly detection models. The lake's schema-on-read allows them to experiment with different schemas without reprocessing. Problems arise when many small files are created (e.g., 10,000 files per hour), overwhelming the NameNode in HDFS-like namespace. They implement a file compaction job using Spark to coalesce files into 256 MB chunks. They also use Azure Data Lake Storage tiering: data older than 30 days moves to cool tier, older than 1 year to archive.

How DP-900 Actually Tests This

DP-900 Objective Focus

This topic maps to objective 3.1: Describe analytics workloads. The exam tests your ability to differentiate between data lake, data warehouse, and data lakehouse. Expect 4-6 questions. Key sub-objectives:

Identify appropriate use cases for each.

Recognize the characteristics: schema-on-read vs. schema-on-write, ACID transactions, storage format.

Know which Azure service implements which: Azure Synapse Analytics (dedicated SQL pool) = data warehouse; Azure Data Lake Storage = data lake; Azure Databricks with Delta Lake = data lakehouse.

Common Wrong Answers

Wrong: Data lakes require schema-on-write. Many candidates confuse ETL with schema-on-write. Reality: Data lakes use schema-on-read; schemas are applied at query time.

Wrong: Data warehouses store unstructured data. Data warehouses store structured, processed data. Unstructured data is stored in data lakes.

Wrong: Data lakehouse is the same as data lake. The key difference is ACID transactions and schema enforcement in lakehouse.

Wrong: Azure Synapse Analytics is only a data warehouse. Synapse is a unified analytics platform that includes both dedicated SQL pool (warehouse) and serverless SQL pool (for lake queries) and Spark pools.

Specific Numbers and Terms on Exam

ETL vs ELT: ETL (Extract-Transform-Load) for warehouse; ELT (Extract-Load-Transform) for data lake.

Parquet: Default format for analytics in Azure (columnar, compressed).

Delta Lake: Open-source storage layer that adds ACID to data lakes.

ADLS Gen2: Hierarchical namespace support; POSIX ACLs.

PolyBase: Technology to query external data in Synapse.

Time travel: Delta Lake feature to query previous versions.

Edge Cases and Exceptions

Hybrid scenarios: A data lake can feed a data warehouse (e.g., using PolyBase). This is common but the exam may ask you to identify the primary architecture.

Serverless SQL pool: Queries data lake files directly without a provisioned warehouse. This is a lake querying pattern, not a warehouse.

Delta Lake on Synapse: Synapse Spark pools can create Delta tables, making Synapse a lakehouse platform.

How to Eliminate Wrong Answers

If a question asks for a solution that requires ACID transactions and BI reporting on raw data, eliminate data lake (no ACID) and data warehouse (requires ETL). The correct answer is data lakehouse. If the question emphasizes low-cost storage for raw data with schema flexibility, eliminate warehouse. If the question focuses on predefined reports with fast query performance, eliminate data lake.

Key Takeaways

Data warehouses use schema-on-write; data lakes use schema-on-read.

Data lakehouse adds ACID transactions and schema enforcement to data lakes via Delta Lake.

Azure Synapse Analytics dedicated SQL pool is a data warehouse; Azure Data Lake Storage is a data lake; Azure Databricks with Delta Lake is a data lakehouse.

ETL is used for data warehouses; ELT is used for data lakes.

Parquet is the recommended file format for analytical queries in Azure due to columnar storage and compression.

Delta Lake enables time travel, allowing queries to previous versions of data.

PolyBase allows Synapse dedicated SQL pool to query external data in ADLS Gen2.

Data lakehouse supports both BI (via SQL) and ML (via Spark) on the same data.

Common exam trap: confusing serverless SQL pool (lake querying) with dedicated SQL pool (warehouse).

Data lakehouse reduces data duplication by providing a single copy of data for multiple workloads.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Data Warehouse

Schema-on-write: data must be transformed before loading

Stores structured, processed data only

Optimized for BI and reporting with fast query performance

Uses ETL pipelines; data is cleansed and aggregated

Expensive storage; high performance compute

Data Lake

Schema-on-read: schema applied at query time

Stores raw data in any format (structured, semi-structured, unstructured)

Optimized for data science and exploratory analytics

Uses ELT pipelines; data loaded raw, transformed later

Cheap storage; scalable compute (e.g., Spark)

Data Lakehouse

Supports both structured and raw data

ACID transactions on data lake storage

Single platform for BI and ML

Schema enforcement optional (Delta Lake enforces)

Lower cost than warehouse for storage

Data Warehouse

Only structured, processed data

ACID transactions natively

Primarily for BI and reporting

Schema enforced at write time

Higher storage cost due to proprietary formats

Data Lakehouse

Adds ACID transactions and versioning

Schema enforcement via Delta Lake

Supports upserts and deletes (merge operations)

Optimized for both BI and ML

Requires a metadata layer (e.g., Delta Lake)

Data Lake

No ACID transactions; eventual consistency

No schema enforcement (schema-on-read)

Append-only; updates and deletes are difficult

Primarily for data science and exploration

Simple storage without metadata layer

Watch Out for These

Mistake

Data lakes only store unstructured data like images and videos.

Correct

Data lakes store structured (e.g., CSV, Parquet), semi-structured (JSON, XML), and unstructured data. They are schema-agnostic at ingestion.

Mistake

Data warehouses support schema-on-read like data lakes.

Correct

Data warehouses enforce schema-on-write: data must conform to a predefined schema before loading. Queries use that fixed schema.

Mistake

Data lakehouse is just a marketing term for a data lake.

Correct

Data lakehouse adds a transactional metadata layer (e.g., Delta Lake) that provides ACID compliance, schema enforcement, and versioning, which a raw data lake lacks.

Mistake

Azure Synapse Analytics is only a data warehouse service.

Correct

Synapse is a unified analytics platform that includes dedicated SQL pool (warehouse), serverless SQL pool (lake querying), and Apache Spark pools (data engineering and data science).

Mistake

You cannot run BI tools like Power BI directly on a data lake.

Correct

Using Azure Synapse serverless SQL pool or Databricks SQL, you can run T-SQL queries on data lake files and connect Power BI to those endpoints.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between a data lake and a data warehouse?

A data lake stores raw data in its native format (schema-on-read), while a data warehouse stores processed, structured data (schema-on-write). Data lakes are cheaper and used for data science; data warehouses are optimized for BI and reporting with fast query performance. In Azure, ADLS Gen2 is a data lake, and Synapse dedicated SQL pool is a data warehouse.

What is a data lakehouse and how does it relate to Delta Lake?

A data lakehouse combines the flexibility of a data lake with the reliability of a data warehouse. Delta Lake is an open-source storage layer that provides ACID transactions, schema enforcement, and versioning on top of data lake storage (e.g., ADLS Gen2). Azure Databricks and Synapse Spark pools use Delta Lake to implement the lakehouse architecture.

Can I use Power BI with a data lake?

Yes, you can use Power BI to query data in a data lake by using Azure Synapse serverless SQL pool or Azure Databricks SQL. These services provide SQL endpoints that read data directly from ADLS Gen2. Power BI can connect via DirectQuery or import mode.

What is the role of Azure Synapse Analytics in data lakehouse?

Azure Synapse Analytics is a unified analytics platform that supports data warehouse (dedicated SQL pool), data lake querying (serverless SQL pool), and data engineering/ML (Spark pools). With Delta Lake support, Synapse can function as a data lakehouse, allowing you to run BI and ML on the same data without moving it.

What is the default file format for analytical queries in Azure?

Parquet is the default and recommended file format for analytical workloads in Azure. It is columnar, compressed, and optimized for read-heavy queries. Azure Synapse, Databricks, and Data Factory all have native support for Parquet.

What is time travel in Delta Lake?

Time travel is a feature of Delta Lake that allows you to query previous versions of a table. It is enabled by the transaction log, which records every change. You can use syntax like `SELECT * FROM table VERSION AS OF 123` or `TIMESTAMP AS OF '2024-01-01'`. The default retention for old versions is 7 days.

What is the difference between ETL and ELT?

ETL (Extract, Transform, Load) transforms data before loading into the target (typically a data warehouse). ELT (Extract, Load, Transform) loads raw data first (into a data lake) and transforms it later at query time. Data warehouses use ETL; data lakes use ELT.

Terms Worth Knowing

Data Non-relational database Relational database

Ready to put this to the test?

You've just covered Data Lakehouse vs Data Warehouse vs Data Lake — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Try DP-900 practice questions Back to all chapters

Done with this chapter?

Lambda and Kappa Data Architectures

Azure Data Lake Storage Gen2 Hierarchical Namespace

See the full DP-900 study guide