This chapter covers the three primary data storage and analytics architectures in Azure: Data Lake, Data Warehouse, and Data Lakehouse. Understanding the differences is critical for the DP-900 exam, as questions on this topic appear in approximately 15-20% of the exam, primarily within objective 3.1 (Describe analytics workloads). You will learn the core characteristics, use cases, and how Azure services like Azure Synapse Analytics, Azure Data Lake Storage, and Azure Databricks implement these architectures. By the end, you will be able to distinguish between them and choose the right one for a given scenario.
Jump to a section
Imagine a restaurant kitchen. A Data Warehouse is like a chef's recipe box: each recipe is a predefined, structured dish (e.g., 'Margherita Pizza') with exact ingredients and steps. The chef follows the recipe to produce consistent, high-quality results. The recipes are optimized for known questions (e.g., 'What is the most popular pizza?'). A Data Lake is like a massive pantry and refrigerator: it holds all raw ingredients—vegetables, meats, spices, sauces—in their original form. The chef can grab any ingredient to create a new dish, but it takes time to prepare and combine them. There is no predefined structure; the chef decides how to use each ingredient. A Data Lakehouse is like a modern prep station that combines both: it has a pantry with all raw ingredients AND a set of standardized, partially prepared items (like pre-chopped vegetables or pre-marinated meat) that can be used directly in recipes. The station also has a catalog listing every ingredient and its current state. The chef can quickly access raw materials for experimentation or use the prepped items for fast, consistent dishes. The lakehouse provides structure (like the prepped items) on top of raw storage (the pantry), enabling both ad-hoc exploration and reliable reporting. In the kitchen, the chef (data engineer or analyst) uses the same space for both creative cooking (data science) and menu execution (business intelligence).
1. What They Are and Why They Exist
Data architecture has evolved to meet different analytic needs. Data Warehouses emerged in the 1980s to support business intelligence (BI) and reporting. They store structured, processed data optimized for query performance. Data Lakes arose in the 2010s to handle big data, storing raw data in its native format (structured, semi-structured, or unstructured) for data science and exploratory analytics. Data Lakehouses combine the flexibility of data lakes with the reliability and performance of data warehouses, enabling both BI and machine learning (ML) on a single platform.
2. How They Work Internally
Data Warehouse: Data is ingested through ETL (Extract, Transform, Load) processes. Extraction pulls data from sources, transformation cleans and structures it (e.g., applying schemas, aggregations), and loading writes it into a relational database optimized for queries. Storage uses columnar formats (e.g., columnstore indexes in Azure Synapse dedicated SQL pool) for compression and fast scan rates. Queries are executed using SQL engines that leverage indexing, partitioning, and materialized views. Typical latency: hours (batch processing).
Data Lake: Data is ingested in its raw form (e.g., CSV, JSON, Parquet, images) to cheap, scalable object storage like Azure Data Lake Storage Gen2 (ADLS Gen2). A schema-on-read approach is used: the schema is applied at query time, not at write time. Processing engines like Apache Spark (via Azure Databricks or Synapse Spark pools) read the data, apply transformations, and produce results. Data lakes support batch, streaming, and interactive analytics. Storage is hierarchical (ADLS Gen2) with POSIX-like access control lists (ACLs).
Data Lakehouse: Combines data lake storage (ADLS Gen2) with a metadata layer (e.g., Delta Lake, Apache Iceberg) that provides ACID transactions, schema enforcement, and versioning. The storage layer remains cheap and scalable, while the metadata layer enables warehouse-like features: time travel (querying previous versions), upserts (merge operations), and data quality constraints. Compute engines (Spark, SQL) access the same data concurrently. Delta Lake, used in Azure Databricks and Synapse, is the most common implementation.
3. Key Components, Values, Defaults, and Timers
Azure Data Lake Storage Gen2: Combines blob storage with a hierarchical namespace. Default redundancy: locally redundant storage (LRS) or geo-redundant (GRS). Access tiers: hot, cool, archive. Default blob type: block blob. Maximum storage account size: 500 TiB (with large file shares enabled, up to 100 PiB).
Azure Synapse Analytics Dedicated SQL Pool: Formerly SQL Data Warehouse. Distribution types: hash, round-robin, replicated. Table types: heap, clustered columnstore, clustered index. Default distribution: round-robin for staging tables; hash for fact tables. Performance level: DW100c to DW30000c. Scaling up/down takes 1-2 minutes. Pause/resume: immediate.
Delta Lake: Default protocol version: 1 (supports time travel, schema enforcement). Log retention: 30 days (configurable). Checkpoint interval: every 10 commits. Vacuum threshold: 7 days (default retention for deleted files).
Azure Databricks: Cluster types: interactive (for ad-hoc), job (for automated runs). Default runtime: latest LTS (e.g., 13.3 LTS). Auto-scaling: enabled by default. Worker types: general-purpose, memory-optimized, GPU. Default Spark configuration: 2.4 GHz cores, 8 GB RAM per core.
PolyBase: Used in Synapse to query external data in ADLS Gen2. Default format: Parquet. External file format options: CSV, Parquet, ORC, JSON, Avro. Rejected row location: specified by user.
4. Configuration and Verification Commands
Create an external table in Synapse using PolyBase:
-- Create external data source
CREATE EXTERNAL DATA SOURCE MyDataSource
WITH (
LOCATION = 'abfss://container@storageaccount.dfs.core.windows.net',
CREDENTIAL = MyCredential
);
-- Create external file format
CREATE EXTERNAL FILE FORMAT MyFileFormat
WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
-- Create external table
CREATE EXTERNAL TABLE dbo.SalesExternal (
SaleID INT,
ProductID INT,
SaleDate DATE,
Amount DECIMAL(10,2)
)
WITH (
LOCATION = '/sales/',
DATA_SOURCE = MyDataSource,
FILE_FORMAT = MyFileFormat
);Verify Delta Lake table history:
-- In Databricks or Synapse Spark pool
DESCRIBE HISTORY sales_table;
-- Shows operations: write, delete, merge, etc.Check ADLS Gen2 hierarchical namespace:
az storage account show --name mystorageaccount --resource-group myrg --query "isHnsEnabled"5. Interaction with Related Technologies
Azure Data Factory: Orchestrates ETL/ELT pipelines for data lakes and warehouses. Supports 90+ connectors. Common pattern: copy data from on-prem SQL Server to ADLS Gen2 (data lake), then load into Synapse dedicated SQL pool (warehouse).
Power BI: Directly queries Synapse dedicated SQL pool via T-SQL, or uses DirectQuery on data lake files via Synapse serverless SQL pool. For data lakehouse, Power BI can read Delta tables through the Delta connector.
Azure Machine Learning: Uses data lake for raw training data; data lakehouse provides feature stores (via Delta tables) for reusable ML features.
Azure Purview: Catalogs and governs data across lake, warehouse, and lakehouse. Scans ADLS Gen2 for schema, lineage, and classification.
6. Performance and Optimization
Data Warehouse: Use clustered columnstore indexes for fact tables (compression ratio 5x-10x). Partition large tables by date (e.g., monthly). Use materialized views for pre-aggregated results. Distribute fact tables on join keys to minimize data movement.
Data Lake: Use Parquet format for analytical queries (columnar, compressed). Partition data by date or region (e.g., /year=2024/month=01/). Use Azure Data Lake Storage tiering: hot for active data, cool for infrequent, archive for long-term.
Data Lakehouse: Optimize Delta tables with Z-ordering (similar to indexing) on frequently filtered columns. Use OPTIMIZE command to compact small files (target file size: 256 MB-1 GB). Use VACUUM to remove old versions beyond retention period.
7. Common Exam Traps
Trap 1: Assuming data lakes only store unstructured data. Reality: They store structured, semi-structured, and unstructured data.
Trap 2: Believing data warehouses support schema-on-read. Reality: They use schema-on-write (ETL).
Trap 3: Thinking data lakehouse is just a data lake with a different name. Reality: It adds ACID transactions, schema enforcement, and indexing.
Trap 4: Confusing Azure Synapse Analytics dedicated SQL pool with serverless SQL pool. Dedicated is provisioned (warehouse); serverless is on-demand (lake querying).
Trap 5: Forgetting that data lakehouse uses the same storage as data lake (ADLS Gen2) but adds a transactional metadata layer.
Ingest raw data into data lake
Data is extracted from sources (e.g., IoT devices, databases, logs) and copied to Azure Data Lake Storage Gen2 using Azure Data Factory, Event Hubs, or direct upload. Files are stored in their native format (CSV, JSON, Parquet, etc.) without transformation. The hierarchical namespace is organized by folders (e.g., /raw/2024/01/). Access is controlled via RBAC and ACLs. At this stage, no schema is enforced; the data is 'schema-on-read' ready.
Transform data for data warehouse
Using Azure Data Factory or Synapse Pipelines, raw data from the data lake is transformed via ETL/ELT. Transformations include data cleansing (removing nulls, correcting formats), joining multiple sources, aggregating (e.g., daily sales totals), and converting to a star schema (fact and dimension tables). The transformed data is then loaded into the data warehouse (Synapse dedicated SQL pool) using PolyBase or COPY INTO command. This step enforces schema-on-write.
Create Delta Lake table on data lake
In Azure Databricks or Synapse Spark, raw or transformed data is written as a Delta Lake table. The write operation uses Delta Lake's ACID transaction log to record metadata. The table is stored as Parquet files in ADLS Gen2, with a _delta_log folder containing JSON transaction entries. Schema enforcement ensures that all writes conform to the defined schema. Time travel is enabled by keeping previous versions in the log.
Query data lakehouse with SQL
Using Azure Synapse Serverless SQL pool or Databricks SQL, users run ANSI SQL queries directly on Delta tables. The query engine reads the Delta transaction log to determine the latest version, then scans the underlying Parquet files using columnar pruning and predicate pushdown. For example: SELECT * FROM sales WHERE year = 2024. The engine returns results without needing to move data into a separate warehouse.
Run BI report on data warehouse
Power BI connects to the Synapse dedicated SQL pool using T-SQL. The query is executed by the SQL pool's massively parallel processing (MPP) engine, which distributes the query across 60 distributions (for DW1000c and above). Results are aggregated and returned to Power BI for visualization. The warehouse's columnstore indexes and materialized views ensure sub-second response for common reports.
Perform ML on data lakehouse
Data scientists use Azure Databricks to read Delta tables directly. They can use Spark MLlib or scikit-learn to train models on the data. The lakehouse allows feature engineering using SQL or Python, and features can be stored back as Delta tables for reuse. Model training benefits from the lakehouse's ability to handle large volumes of raw data and intermediate transformations without copying data.
Scenario 1: Retail Analytics with Data Lakehouse
A large retailer uses Azure Databricks with Delta Lake on ADLS Gen2 to unify their data. Previously, they had a data warehouse for sales reports and a data lake for clickstream logs. Analysts had to use separate tools. With a lakehouse, they ingest point-of-sale transactions (structured) and web clickstream (semi-structured) into Delta tables. They run BI reports using Databricks SQL and train recommendation models using Spark ML. Issues arise when Delta table optimization is neglected: small file accumulation (from streaming) degrades query performance. They mitigate by running OPTIMIZE and VACUUM nightly. They set Delta log retention to 7 days to control storage costs.
Scenario 2: Financial Reporting with Data Warehouse
A bank uses Azure Synapse dedicated SQL pool for regulatory reporting. Data from core banking systems is extracted nightly, transformed in Azure Data Factory (cleaning, aggregating, applying business rules), and loaded into the warehouse. The warehouse uses hash-distributed fact tables on account ID to minimize data movement. Reports are generated via Power BI with DirectQuery. Common misconfiguration: using round-robin distribution for large fact tables, causing excessive data shuffling. The fix: switch to hash distribution on the join key. Performance considerations: scaling up the SQL pool during batch loads (e.g., from DW1000c to DW2000c) and scaling down after.
Scenario 3: IoT Telemetry with Data Lake
An energy company collects sensor data from wind turbines (JSON, 10 TB/day) into ADLS Gen2. They use Azure Data Lake as a landing zone for raw telemetry. Data scientists access the raw files with Azure Databricks to build anomaly detection models. The lake's schema-on-read allows them to experiment with different schemas without reprocessing. Problems arise when many small files are created (e.g., 10,000 files per hour), overwhelming the NameNode in HDFS-like namespace. They implement a file compaction job using Spark to coalesce files into 256 MB chunks. They also use Azure Data Lake Storage tiering: data older than 30 days moves to cool tier, older than 1 year to archive.
DP-900 Objective Focus
This topic maps to objective 3.1: Describe analytics workloads. The exam tests your ability to differentiate between data lake, data warehouse, and data lakehouse. Expect 4-6 questions. Key sub-objectives:
Identify appropriate use cases for each.
Recognize the characteristics: schema-on-read vs. schema-on-write, ACID transactions, storage format.
Know which Azure service implements which: Azure Synapse Analytics (dedicated SQL pool) = data warehouse; Azure Data Lake Storage = data lake; Azure Databricks with Delta Lake = data lakehouse.
Common Wrong Answers
Wrong: Data lakes require schema-on-write. Many candidates confuse ETL with schema-on-write. Reality: Data lakes use schema-on-read; schemas are applied at query time.
Wrong: Data warehouses store unstructured data. Data warehouses store structured, processed data. Unstructured data is stored in data lakes.
Wrong: Data lakehouse is the same as data lake. The key difference is ACID transactions and schema enforcement in lakehouse.
Wrong: Azure Synapse Analytics is only a data warehouse. Synapse is a unified analytics platform that includes both dedicated SQL pool (warehouse) and serverless SQL pool (for lake queries) and Spark pools.
Specific Numbers and Terms on Exam
ETL vs ELT: ETL (Extract-Transform-Load) for warehouse; ELT (Extract-Load-Transform) for data lake.
Parquet: Default format for analytics in Azure (columnar, compressed).
Delta Lake: Open-source storage layer that adds ACID to data lakes.
ADLS Gen2: Hierarchical namespace support; POSIX ACLs.
PolyBase: Technology to query external data in Synapse.
Time travel: Delta Lake feature to query previous versions.
Edge Cases and Exceptions
Hybrid scenarios: A data lake can feed a data warehouse (e.g., using PolyBase). This is common but the exam may ask you to identify the primary architecture.
Serverless SQL pool: Queries data lake files directly without a provisioned warehouse. This is a lake querying pattern, not a warehouse.
Delta Lake on Synapse: Synapse Spark pools can create Delta tables, making Synapse a lakehouse platform.
How to Eliminate Wrong Answers
If a question asks for a solution that requires ACID transactions and BI reporting on raw data, eliminate data lake (no ACID) and data warehouse (requires ETL). The correct answer is data lakehouse. If the question emphasizes low-cost storage for raw data with schema flexibility, eliminate warehouse. If the question focuses on predefined reports with fast query performance, eliminate data lake.
Data warehouses use schema-on-write; data lakes use schema-on-read.
Data lakehouse adds ACID transactions and schema enforcement to data lakes via Delta Lake.
Azure Synapse Analytics dedicated SQL pool is a data warehouse; Azure Data Lake Storage is a data lake; Azure Databricks with Delta Lake is a data lakehouse.
ETL is used for data warehouses; ELT is used for data lakes.
Parquet is the recommended file format for analytical queries in Azure due to columnar storage and compression.
Delta Lake enables time travel, allowing queries to previous versions of data.
PolyBase allows Synapse dedicated SQL pool to query external data in ADLS Gen2.
Data lakehouse supports both BI (via SQL) and ML (via Spark) on the same data.
Common exam trap: confusing serverless SQL pool (lake querying) with dedicated SQL pool (warehouse).
Data lakehouse reduces data duplication by providing a single copy of data for multiple workloads.
These come up on the exam all the time. Here's how to tell them apart.
Data Warehouse
Schema-on-write: data must be transformed before loading
Stores structured, processed data only
Optimized for BI and reporting with fast query performance
Uses ETL pipelines; data is cleansed and aggregated
Expensive storage; high performance compute
Data Lake
Schema-on-read: schema applied at query time
Stores raw data in any format (structured, semi-structured, unstructured)
Optimized for data science and exploratory analytics
Uses ELT pipelines; data loaded raw, transformed later
Cheap storage; scalable compute (e.g., Spark)
Data Lakehouse
Supports both structured and raw data
ACID transactions on data lake storage
Single platform for BI and ML
Schema enforcement optional (Delta Lake enforces)
Lower cost than warehouse for storage
Data Warehouse
Only structured, processed data
ACID transactions natively
Primarily for BI and reporting
Schema enforced at write time
Higher storage cost due to proprietary formats
Data Lakehouse
Adds ACID transactions and versioning
Schema enforcement via Delta Lake
Supports upserts and deletes (merge operations)
Optimized for both BI and ML
Requires a metadata layer (e.g., Delta Lake)
Data Lake
No ACID transactions; eventual consistency
No schema enforcement (schema-on-read)
Append-only; updates and deletes are difficult
Primarily for data science and exploration
Simple storage without metadata layer
Mistake
Data lakes only store unstructured data like images and videos.
Correct
Data lakes store structured (e.g., CSV, Parquet), semi-structured (JSON, XML), and unstructured data. They are schema-agnostic at ingestion.
Mistake
Data warehouses support schema-on-read like data lakes.
Correct
Data warehouses enforce schema-on-write: data must conform to a predefined schema before loading. Queries use that fixed schema.
Mistake
Data lakehouse is just a marketing term for a data lake.
Correct
Data lakehouse adds a transactional metadata layer (e.g., Delta Lake) that provides ACID compliance, schema enforcement, and versioning, which a raw data lake lacks.
Mistake
Azure Synapse Analytics is only a data warehouse service.
Correct
Synapse is a unified analytics platform that includes dedicated SQL pool (warehouse), serverless SQL pool (lake querying), and Apache Spark pools (data engineering and data science).
Mistake
You cannot run BI tools like Power BI directly on a data lake.
Correct
Using Azure Synapse serverless SQL pool or Databricks SQL, you can run T-SQL queries on data lake files and connect Power BI to those endpoints.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
A data lake stores raw data in its native format (schema-on-read), while a data warehouse stores processed, structured data (schema-on-write). Data lakes are cheaper and used for data science; data warehouses are optimized for BI and reporting with fast query performance. In Azure, ADLS Gen2 is a data lake, and Synapse dedicated SQL pool is a data warehouse.
A data lakehouse combines the flexibility of a data lake with the reliability of a data warehouse. Delta Lake is an open-source storage layer that provides ACID transactions, schema enforcement, and versioning on top of data lake storage (e.g., ADLS Gen2). Azure Databricks and Synapse Spark pools use Delta Lake to implement the lakehouse architecture.
Yes, you can use Power BI to query data in a data lake by using Azure Synapse serverless SQL pool or Azure Databricks SQL. These services provide SQL endpoints that read data directly from ADLS Gen2. Power BI can connect via DirectQuery or import mode.
Azure Synapse Analytics is a unified analytics platform that supports data warehouse (dedicated SQL pool), data lake querying (serverless SQL pool), and data engineering/ML (Spark pools). With Delta Lake support, Synapse can function as a data lakehouse, allowing you to run BI and ML on the same data without moving it.
Parquet is the default and recommended file format for analytical workloads in Azure. It is columnar, compressed, and optimized for read-heavy queries. Azure Synapse, Databricks, and Data Factory all have native support for Parquet.
Time travel is a feature of Delta Lake that allows you to query previous versions of a table. It is enabled by the transaction log, which records every change. You can use syntax like `SELECT * FROM table VERSION AS OF 123` or `TIMESTAMP AS OF '2024-01-01'`. The default retention for old versions is 7 days.
ETL (Extract, Transform, Load) transforms data before loading into the target (typically a data warehouse). ELT (Extract, Load, Transform) loads raw data first (into a data lake) and transforms it later at query time. Data warehouses use ETL; data lakes use ELT.
You've just covered Data Lakehouse vs Data Warehouse vs Data Lake — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.
Done with this chapter?