This chapter covers the three primary data storage and analytics architectures on Azure: Data Lake, Data Warehouse, and Lakehouse. Understanding their differences, use cases, and trade-offs is critical for the DP-900 exam, where roughly 15–20% of questions touch on data storage concepts. You will learn the internal mechanisms, key components, and how to choose the right architecture for a given scenario. By the end, you will be able to distinguish each architecture’s strengths and weaknesses and answer exam questions confidently.
Jump to a section
Imagine you have three different facilities for managing information: a Library (Data Warehouse), an Archive (Data Lake), and a Hybrid Reading Room (Lakehouse). The Library is highly organized: books are cataloged, shelved by genre and author, and only curated, cleaned content is stored. You can quickly find exactly what you need, but adding new material requires careful processing. The Archive, by contrast, accepts everything—raw documents, photos, videos, unprocessed data—in any format, stored in boxes on shelves. It’s cheap and flexible, but finding specific information is slow because you must search through boxes and read raw documents. The Hybrid Reading Room combines the best of both: it has a catalog (like the library) but also stores raw materials in a back area (like the archive). When you request a book, the staff can either fetch a pre-processed clean copy from the catalog or, if needed, go to the raw storage, quickly process it on the fly (using a small processing station), and present it in a structured format. This hybrid approach allows both fast queries on clean data and flexible exploration of raw data, all within a single facility. This mirrors how a Lakehouse uses a single storage layer (like an archive) but adds a metadata layer and query engine (like the catalog and processing station) to enable both warehouse-style performance and lake-style flexibility.
What Is a Data Lake?
A Data Lake is a centralized repository that stores all your data, both structured and unstructured, at any scale. It is built on a flat architecture using object storage (like Azure Data Lake Storage Gen2) where data is stored in its native, raw format. The key principle is “store first, schema later” – data is ingested without transformation, and schemas are applied at read time (schema-on-read). This makes it ideal for big data analytics, machine learning, and exploratory analysis.
How a Data Lake Works Internally
Data is ingested from various sources (IoT devices, logs, databases, etc.) and stored as files in a hierarchical namespace (folders and subfolders). Azure Data Lake Storage Gen2 combines blob storage with a hierarchical namespace, enabling POSIX-like access control. Data is distributed across multiple storage nodes and replicated for durability (LRS, ZRS, or GRS). When a query is executed, a compute engine (like Azure Synapse Serverless or Databricks) reads the files and applies the schema on the fly. This means the same data can be used by different engines with different schemas.
Key Components and Defaults
Azure Data Lake Storage Gen2 (ADLS Gen2): The primary storage service. It offers a hierarchical namespace, POSIX ACLs, and integration with Azure Active Directory. Default replication is LRS (locally redundant storage) unless changed.
Azure Blob Storage: The underlying object storage. ADLS Gen2 is built on Blob Storage with the hierarchical namespace feature enabled.
Hierarchical Namespace: Organizes files into folders, enabling efficient directory operations. Without it, Blob Storage is a flat namespace (no folders).
Access Control Lists (ACLs): POSIX-style permissions (read, write, execute) for users and groups. Default: Azure RBAC for management plane, ACLs for data plane.
Azure Synapse Serverless SQL Pool: Can query data directly from the lake using T-SQL.
Azure Databricks: Provides Spark-based analytics on the lake.
Configuration and Verification
To create a Data Lake Storage account:
az storage account create --name mydatalake --resource-group myrg --location eastus --sku Standard_LRS --kind StorageV2 --hierarchical-namespace trueTo verify hierarchical namespace:
az storage account show --name mydatalake --resource-group myrg --query isHnsEnabledWhat Is a Data Warehouse?
A Data Warehouse is a relational database optimized for analytics and reporting. It stores structured, cleaned, and transformed data (schema-on-write). Data is extracted from source systems, transformed (cleaned, aggregated), and loaded into the warehouse (ETL process). The warehouse uses a star or snowflake schema with fact and dimension tables. It supports high-performance SQL queries for business intelligence (BI) tools.
How a Data Warehouse Works Internally
Data flows through an ETL/ELT pipeline. In Azure, Azure Synapse Analytics (formerly SQL DW) is the primary data warehouse. It uses a distributed architecture with control nodes and compute nodes. Data is distributed across 60 distributions (round-robin, hash, or replicated). Queries are compiled and executed in parallel across nodes. Columnstore indexes (default) store data column-wise for compression and faster analytical queries. The control node coordinates query execution, and compute nodes process data. The storage layer uses Azure Blob Storage, but the metadata and schema are managed by Synapse.
Key Components and Defaults
Azure Synapse Analytics: The cloud-based data warehouse. It has two components: Dedicated SQL Pool (provisioned) and Serverless SQL Pool (on-demand).
Dedicated SQL Pool: Provisioned compute and storage. Default: 100 DWU (Data Warehouse Units) minimum, up to 30,000 DWU. You can pause and resume compute.
Distribution: Data is split into 60 distributions. Default distribution method is round-robin if no distribution key is specified.
Columnstore Index: Default for clustered tables. Provides up to 10x compression and 10x query performance over rowstore.
PolyBase: Technology to query external data (like in a data lake) from within the warehouse.
Configuration and Verification
Create a Synapse workspace:
az synapse workspace create --name myworkspace --resource-group myrg --location eastus --storage-account mydatalake --file-system myfilesystem --sql-admin-login-user admin --sql-admin-login-password P@ssw0rd123Pause a dedicated SQL pool:
az synapse sql pool pause --name mypool --workspace myworkspace --resource-group myrgWhat Is a Lakehouse?
A Lakehouse is a modern architecture that combines the flexibility of a data lake with the reliability and performance of a data warehouse. It stores raw data in a data lake (like ADLS Gen2) but adds a metadata layer (like Delta Lake) that provides ACID transactions, schema enforcement, and indexing. This enables both BI workloads (like a warehouse) and machine learning workloads (like a lake) on the same data without moving it.
How a Lakehouse Works Internally
Data is stored in open formats like Parquet, Delta, or ORC on ADLS Gen2. Delta Lake is an open-source storage layer that adds a transaction log (delta log) to track changes. When data is written, the Delta Lake engine creates a new version of the data (snapshot) and records the transaction in the log. This enables ACID properties: atomicity (all or nothing), consistency (schema validation), isolation (concurrent readers see consistent snapshot), and durability (data persisted). Queries read the latest snapshot or a specific version (time travel). Compute engines like Azure Databricks, Synapse Serverless, or Apache Spark read the Delta tables directly.
Key Components and Defaults
Delta Lake: Open-source storage layer. Default: Spark-based, but now supported by Synapse Serverless.
Delta Log: A directory _delta_log containing JSON files for each transaction. Default: each transaction creates a new JSON file.
Parquet Format: Default file format for Delta Lake. Provides columnar storage and compression.
Z-Order Indexing: Optional multi-dimensional clustering to speed up queries on multiple columns.
Time Travel: Ability to query previous versions using VERSION AS OF or TIMESTAMP AS OF.
Configuration Example
Create a Delta table in Databricks:
spark.sql("CREATE TABLE events (id INT, name STRING) USING DELTA LOCATION '/mnt/datalake/events/'")Optimize and vacuum:
spark.sql("OPTIMIZE events")
spark.sql("VACUUM events RETAIN 168 HOURS")How They Interact
In a modern data architecture, you might use all three: a Data Lake for raw ingestion, a Data Warehouse for curated BI, and a Lakehouse to unify both. Azure Synapse Analytics supports both dedicated SQL pools (warehouse) and serverless SQL pools (lake querying). You can create external tables in Synapse that point to data in the lake, effectively bridging the two worlds. The Lakehouse eliminates the need to copy data from lake to warehouse, reducing latency and cost.
Exam-Relevant Details
Data Lake: Best for raw data, schema-on-read, big data, machine learning. Not suitable for BI requiring low latency because queries are slower.
Data Warehouse: Best for structured, cleaned data, BI, reporting. High performance but costly to store raw data.
Lakehouse: Best when you need both flexibility and performance. Uses Delta Lake for ACID on lake. Emerging as preferred architecture for modern analytics.
On the exam, be able to identify which architecture fits a given scenario. For example, if a company needs to store petabytes of raw IoT data and run occasional exploratory queries, choose Data Lake. If they need sub-second dashboards for sales reports, choose Data Warehouse. If they need both without data duplication, choose Lakehouse.
Ingest Raw Data into Lake
Data from sources (IoT, logs, databases) is streamed or batched into Azure Data Lake Storage Gen2. The data is stored as-is in its native format (CSV, JSON, Parquet, etc.) in a hierarchical folder structure. No schema is applied at this stage. Ingestion tools include Azure Data Factory, Event Hubs, or IoT Hub. The data is replicated (default LRS) and distributed across storage nodes. At the storage level, the hierarchical namespace allows efficient directory operations. The ingestion process typically partitions data by date (e.g., /raw/year=2025/month=04/day=15/) to optimize future queries.
Apply Schema-on-Read for Lake Queries
When a user or application queries the lake, a compute engine like Azure Synapse Serverless SQL or Databricks reads the files and applies a schema at query time. For example, a CSV file may be read with a schema that defines column names and types. This schema is not stored permanently; it is defined in the query (e.g., using OPENROWSET in Synapse). This flexibility allows different users to interpret the same data differently, but it also means query performance depends on file formats and partitioning. Parquet and Delta formats are preferred for better performance.
ETL Process in Data Warehouse
In a data warehouse, data is extracted from sources, transformed (cleaned, aggregated, joined), and loaded into the warehouse (ETL). Azure Data Factory or Synapse Pipelines orchestrate this. The transformed data is stored in fact and dimension tables using a star schema. During loading, data is distributed across 60 distributions in a dedicated SQL pool. The default distribution method is round-robin, but hash distribution on a key (e.g., customer_id) is often used to collocate related data. Columnstore indexes compress and index the data for fast analytical queries.
Query Warehouse for BI
Business intelligence tools like Power BI connect to the warehouse using T-SQL. Queries are submitted to the control node, which compiles and distributes the query plan to compute nodes. Each compute node processes its portion of data in parallel, then results are aggregated and returned. Columnstore indexes allow scanning only relevant columns, reducing I/O. The warehouse is optimized for aggregations and joins on large tables. Typical response times are sub-second to a few seconds for well-tuned queries. The warehouse can be scaled up (more DWU) for better performance.
Delta Lake Transactions in Lakehouse
When data is written to a Delta Lake table, a transaction log records the operation. For example, an INSERT creates a new Parquet file and a new JSON entry in the _delta_log directory. The log ensures atomicity: if the write fails, the log is not committed. Readers see the latest snapshot (snapshot isolation). Schema enforcement: if the data doesn’t match the table schema, the write is rejected. Time travel: you can query a previous version by specifying a version number or timestamp. The OPTIMIZE command compacts small files into larger ones (default target size 256 MB). The VACUUM command cleans up old files older than a retention period (default 7 days).
Unified Query with Lakehouse
In a lakehouse, you can run both BI-style queries (using Delta tables) and data science workloads (using Spark) on the same data. For example, Azure Databricks can write a Delta table, and Azure Synapse Serverless SQL can query that same Delta table using T-SQL. This eliminates the need to copy data between a lake and a warehouse. The lakehouse provides a single copy of data with ACID guarantees, reducing storage costs and data staleness. The exam tests that a lakehouse is a combination of data lake flexibility and data warehouse reliability.
Scenario 1: Retail Analytics with Data Lake
A large retail chain ingests petabytes of clickstream data from its e-commerce platform. They use Azure Data Lake Storage Gen2 to store raw logs as JSON files partitioned by date. Data scientists use Azure Databricks to run exploratory analysis and build recommendation models. The key challenge is the sheer volume: over 10 TB of new data daily. They use partition pruning (by date) and Delta Lake format to speed up queries. Misconfiguration: initially they used CSV format, which caused slow queries and high storage costs. Switching to Parquet with Z-order indexing on user_id reduced query time by 80%. They also learned to set appropriate retention policies for VACUUM to avoid excessive storage costs.
Scenario 2: Financial Reporting with Data Warehouse
A bank needs to generate daily risk reports and regulatory filings. They use Azure Synapse Analytics (dedicated SQL pool) with a star schema. Data from core banking systems is ETL’d nightly using Azure Data Factory. The warehouse has 10 fact tables (transactions) and 20 dimension tables (customer, account, date). They use hash distribution on account_id to collocate transactions per account. The report queries run in under 5 seconds. Common mistake: not using columnstore indexes on all large tables, leading to slow scans. They also learned to pause the SQL pool during non-business hours to save costs. The exam tests that a data warehouse is ideal for structured, high-performance BI.
Scenario 3: Hybrid Architecture with Lakehouse
A healthcare organization wants to combine patient records (structured) with medical images (unstructured) for AI research. They adopt a lakehouse architecture using Azure Databricks and Delta Lake on ADLS Gen2. The structured data (patient demographics, lab results) is stored as Delta tables, while images are stored as raw files. Researchers can run SQL queries on the structured data and then use Spark to process images, all on the same platform. The challenge was ensuring HIPAA compliance: they used Azure Private Link and customer-managed keys for encryption. They also set up Delta Lake change data feed to track updates for auditing. The lakehouse eliminated the need to maintain separate lake and warehouse, reducing data duplication and ETL complexity.
What DP-900 Tests
Objective 1.4: Describe the differences between a data lake, data warehouse, and lakehouse. The exam focuses on: - Use cases: When to use each architecture. - Schema approach: Schema-on-read (lake) vs schema-on-write (warehouse) vs both (lakehouse). - Data types: Structured (warehouse) vs unstructured (lake) vs combined (lakehouse). - Performance: Warehouse is optimized for fast queries; lake is slower but flexible; lakehouse offers both. - ACID properties: Only warehouse and lakehouse (via Delta Lake) provide ACID; raw lake does not. - Azure services: ADLS Gen2 (lake), Azure Synapse Analytics (warehouse), Azure Databricks + Delta Lake (lakehouse).
Common Wrong Answers
Choosing a data lake for BI reporting: Candidates think “lake stores everything” but forget that BI needs low latency and ACID, which raw lake lacks.
Choosing a data warehouse for machine learning: Warehouse is structured, but ML often needs raw, unstructured data. Lake is better.
Thinking lakehouse is just a data lake with a different name: Lakehouse adds ACID and schema enforcement via Delta Lake; it’s not just a lake.
Assuming a data lake always uses Parquet: Lakes can store any format; Parquet is common but not required. The exam tests that lakes store data in native format.
Numbers and Terms
60 distributions in Azure Synapse dedicated SQL pool.
Default DWU: 100, max 30,000.
Delta Lake default vacuum retention: 7 days (168 hours).
Columnstore index is default for Synapse tables.
Hierarchical namespace must be enabled for ADLS Gen2.
Edge Cases
Data Lake with Delta Lake: Some candidates confuse that a lake with Delta Lake is actually a lakehouse. The exam distinguishes: a raw lake without Delta is a data lake; with Delta it’s a lakehouse.
Serverless SQL in Synapse: Can query lake data like a warehouse, but it’s still schema-on-read unless using Delta tables. The exam may present a scenario where serverless SQL is used on a lake – that’s still a lake, not a warehouse.
PolyBase: Can query lake data from a warehouse, but the warehouse remains the primary store. This is a hybrid pattern, not a lakehouse.
How to Eliminate Wrong Answers
If the scenario mentions “raw data,” “schema-on-read,” or “big data,” eliminate warehouse and lakehouse (unless Delta is mentioned).
If the scenario mentions “low-latency dashboards,” “ACID transactions,” or “BI reporting,” eliminate raw lake.
If the scenario mentions both “flexibility for data science” and “reliable BI,” choose lakehouse.
Look for keywords: “Delta Lake,” “ACID on data lake,” “single copy” -> lakehouse.
Data Lake: stores raw data, schema-on-read, best for big data and ML.
Data Warehouse: stores structured/transformed data, schema-on-write, best for BI and reporting.
Lakehouse: combines lake flexibility with warehouse reliability using Delta Lake for ACID.
Azure services: ADLS Gen2 (lake), Azure Synapse Analytics (warehouse), Azure Databricks (lakehouse).
Delta Lake provides ACID, time travel, and schema enforcement on data lakes.
On the exam, identify architecture by keywords: raw/unstructured -> lake; structured/BI -> warehouse; both -> lakehouse.
Data warehouses in Azure Synapse use 60 distributions and columnstore indexes by default.
These come up on the exam all the time. Here's how to tell them apart.
Data Lake
Stores raw, unstructured, and structured data in native format
Schema-on-read: schema applied at query time
Optimized for big data, ML, and exploratory analytics
Lower storage cost (cheap object storage)
No ACID transactions (unless using Delta Lake, then it's a lakehouse)
Data Warehouse
Stores structured, cleaned, transformed data
Schema-on-write: schema enforced before loading
Optimized for BI, reporting, and low-latency queries
Higher storage cost (provisioned compute + storage)
Full ACID transactions
Data Lake
No ACID transactions
No schema enforcement
No indexing (performance relies on file format and partitioning)
Cannot support concurrent BI and ML workloads efficiently
Example: Azure Data Lake Storage Gen2 (raw)
Lakehouse
ACID transactions via Delta Lake
Schema enforcement
Indexing (Z-order, bloom filters) for faster queries
Supports both BI (SQL) and ML (Spark) on same data
Example: Azure Databricks + Delta Lake on ADLS Gen2
Data Warehouse
Only structured data
Requires ETL to load data
High performance for BI queries
Scales compute and storage separately (but compute is provisioned)
More expensive per TB of storage
Lakehouse
Stores structured, semi-structured, and unstructured data
No ETL needed; data stays in open formats
Performance can approach warehouse with Delta and indexing
Compute can be serverless (e.g., Synapse Serverless)
Lower storage cost (object storage)
Mistake
A data lake is just a large data warehouse.
Correct
A data lake stores raw, unstructured data with schema-on-read, while a data warehouse stores structured, transformed data with schema-on-write. They serve different purposes.
Mistake
Data lakes do not support ACID transactions.
Correct
Raw data lakes do not, but a lakehouse with Delta Lake does. Delta Lake adds ACID transactions, schema enforcement, and time travel to the lake.
Mistake
You must choose between a data lake and a data warehouse.
Correct
Many organizations use both in a modern data architecture. The lakehouse aims to unify them, but you can also use tools like PolyBase to query lake data from a warehouse.
Mistake
Data warehouses can store unstructured data like images.
Correct
Data warehouses are designed for structured, relational data. Storing unstructured data in a warehouse is inefficient; a data lake is better.
Mistake
Lakehouse and data lake are the same thing.
Correct
A lakehouse adds a metadata and transaction layer (e.g., Delta Lake) on top of a data lake, providing ACID, schema enforcement, and BI performance. A raw lake lacks these.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
A data lake stores raw data in its native format (schema-on-read), while a data warehouse stores structured, transformed data (schema-on-write). Data lakes are cheaper and flexible for big data/ML; data warehouses are optimized for fast BI queries and enforce ACID. In Azure, ADLS Gen2 is a lake; Azure Synapse Analytics is a warehouse.
A lakehouse is a data lake with a metadata and transaction layer like Delta Lake that provides ACID transactions, schema enforcement, and indexing. Unlike a raw data lake, a lakehouse supports both BI and ML workloads on the same data without copying. In Azure, Databricks with Delta Lake on ADLS Gen2 is a lakehouse.
Use a data lake (ADLS Gen2) when you need to store large amounts of raw data (IoT logs, social media, images) and run exploratory analytics or machine learning. Use a data warehouse (Azure Synapse) for structured business data that requires fast, reliable reporting and BI dashboards. If you need both, consider a lakehouse.
Yes, Azure Synapse Analytics has a serverless SQL pool that can query data directly from a data lake (ADLS Gen2) using T-SQL. You can create external tables that point to files in the lake. This allows you to combine warehouse and lake capabilities, but the serverless pool is still schema-on-read unless you use Delta Lake.
Delta Lake is an open-source storage layer that adds ACID transactions, scalable metadata handling, and unified streaming/batch processing to data lakes. It is the foundation of a lakehouse. It enables features like time travel (querying historical versions), schema enforcement, and data compaction. In Azure, it's used with Databricks and Synapse.
Technically you can store unstructured data as binary or string columns, but it is inefficient and not recommended. Data warehouses are optimized for structured, relational data. For unstructured data, use a data lake. A lakehouse can store both in the same system.
The default file format is Parquet, a columnar storage format that provides high compression and fast query performance. Delta Lake also stores a transaction log in JSON format in a `_delta_log` directory.
You've just covered Data Lake vs Data Warehouse vs Lakehouse — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.
Done with this chapter?