DP-900Chapter 63 of 101Objective 3.1

Delta Lake Format in Azure

This chapter covers Delta Lake, an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads on Azure. Delta Lake is a critical component for modern data lakes on Azure Databricks and Azure Synapse Analytics, and DP-900 tests your understanding of its core features, benefits, and use cases. Expect about 5-10% of exam questions to touch on Delta Lake, typically in the context of describing its capabilities compared to traditional file formats like Parquet or CSV.

25 min read
Intermediate
Updated May 31, 2026

Delta Lake as a Version-Controlled Warehouse

Imagine a massive warehouse where workers constantly add, update, and remove boxes. Without a system, chaos ensues: someone tries to read inventory while another worker is moving boxes, leading to incomplete or incorrect counts. Delta Lake is like a warehouse manager who implements a strict transaction log. Every action—adding a box, updating its contents, or removing it—is first recorded in a ledger (the transaction log). The manager then atomically applies the change: either the entire action succeeds or nothing changes. If two workers try to update the same box simultaneously, the manager uses optimistic concurrency control: the first worker commits, and the second worker must retry based on the latest state. Additionally, the manager periodically takes snapshots of the entire warehouse (checkpoints) for fast access. For time travel, the manager can reconstruct the warehouse as it was at any past moment by replaying the ledger from a checkpoint. This ensures that readers always see a consistent view, even while writes are happening, because readers are directed to the appropriate snapshot. The warehouse itself is stored in a cloud storage (like Azure Data Lake Storage), but the manager's ledger and snapshots are the key to reliability and performance.

How It Actually Works

What is Delta Lake and Why Does It Exist?

Delta Lake is an open-source storage layer that sits on top of existing data lake storage (like Azure Data Lake Storage Gen2) and provides ACID (Atomicity, Consistency, Isolation, Durability) transactions, scalable metadata handling, and unified streaming and batch data processing. It was created to solve the problem of "dirty data" in data lakes, where concurrent writes could lead to partial writes, inconsistent reads, and data corruption. Traditional file formats like Parquet or ORC do not provide transaction guarantees; if a job fails mid-write, the data can be left in an inconsistent state. Delta Lake uses a transaction log to track every change, ensuring that reads always see a consistent snapshot.

How Delta Lake Works Internally

Delta Lake stores data as a directory of Parquet files (by default) and a transaction log (stored in a _delta_log subdirectory). The transaction log is a series of JSON files (e.g., 00000000000000000001.json, 00000000000000000002.json) that record every atomic change: add file, remove file, update metadata, etc. Each JSON file contains a list of actions that were committed atomically. The log is the source of truth for the current state of the table.

When a write operation (INSERT, UPDATE, DELETE, MERGE) occurs, Delta Lake performs the following steps:

1.

Read the latest snapshot: The reader or writer reads the latest checkpoint and replays the transaction log to build the current table state.

2.

Write new data files: The operation writes new Parquet files to the storage (e.g., part-00000-xxx.snappy.parquet).

3.

Attempt to commit: The operation writes a new JSON file to the _delta_log directory containing the actions (add new files, remove old files). This commit is atomic using the underlying file system's rename operation (which is atomic in most cloud storage).

4.

Concurrency control: Delta Lake uses optimistic concurrency control. If two writes attempt to commit the same version, one succeeds, and the other retries by reading the latest state and reattempting.

Key default values and parameters: - Checkpoint interval: Every 10 commits, Delta Lake writes a checkpoint (a Parquet file summarizing the state) to speed up future reads. This is configurable via spark.databricks.delta.checkpointInterval. - Retention duration for old log files: By default, Delta Lake retains the last 30 days of transaction log files and the last 7 days of old data files for time travel. Configurable via delta.logRetentionDuration and delta.deletedFileRetentionDuration. - Data file size: Delta Lake aims for files around 256 MB to 1 GB. The delta.targetFileSize config can adjust this.

Key Features: ACID Transactions, Time Travel, Schema Enforcement, and Unified Batch and Streaming

ACID Transactions: - Atomicity: Each commit is all-or-nothing. If the commit fails, the table remains unchanged. - Consistency: The transaction log ensures that all reads see a consistent view. - Isolation: Concurrent writes are isolated; readers see either the state before or after a commit, never a partial write. - Durability: Once a commit is written, it is persisted in cloud storage.

Time Travel (Data Versioning): Delta Lake allows you to query previous versions of a table by specifying a version number or timestamp. This is possible because the transaction log retains the history. For example:

SELECT * FROM my_table VERSION AS OF 5;
-- or
SELECT * FROM my_table TIMESTAMP AS OF '2023-01-01';

Retention of old data is controlled by delta.deletedFileRetentionDuration (default 7 days) and delta.logRetentionDuration (default 30 days). After these periods, old files are vacuumed (permanently deleted) by the VACUUM command.

Schema Enforcement and Evolution: Delta Lake automatically validates that data being written matches the table's schema. If there is a mismatch, the write fails. However, you can enable schema evolution by setting mergeSchema to true in the write options. For example:

df.write.option("mergeSchema", "true").mode("append").saveAsTable("my_table")

Schema evolution can add new columns but cannot drop or rename columns without explicit ALTER TABLE commands.

Unified Batch and Streaming: Delta Lake supports both batch and streaming writes. You can use Structured Streaming to write data into a Delta table, and the table can be queried in real-time. Delta Lake provides exactly-once semantics for streaming writes using the transaction log.

Configuration and Verification Commands

To create a Delta table:

df.write.format("delta").save("/mnt/delta/my_table")

To read a Delta table:

df = spark.read.format("delta").load("/mnt/delta/my_table")

To view the history of a table:

DESCRIBE HISTORY my_table;

This returns the version, timestamp, operation, and other details.

To vacuum old files:

VACUUM my_table RETAIN 168 HOURS;

Default retention is 7 days (168 hours). Vacuuming removes files older than the retention period.

Interaction with Azure Services

Delta Lake is natively supported in Azure Databricks and Azure Synapse Analytics (serverless SQL pools and Spark pools). On Azure, Delta tables are typically stored in Azure Data Lake Storage Gen2 (ADLS Gen2). The hierarchical namespace of ADLS Gen2 improves performance for Delta Lake operations like rename and list.

Delta Lake also integrates with Azure Machine Learning for feature stores, and with Power BI via DirectQuery or import (though DirectQuery may have limitations).

Performance Considerations

Optimize: The OPTIMIZE command compacts small files into larger ones (default target size 256 MB).

Z-order indexing: OPTIMIZE my_table ZORDER BY (column) improves data skipping for queries with filters on that column.

Auto-optimize: On Databricks, auto-optimize can be enabled to automatically optimize write operations.

Trap Patterns on the Exam

Confusing Delta Lake with Parquet: Delta Lake is a storage layer that uses Parquet as the underlying file format, but it adds transaction log, schema enforcement, etc. Parquet alone does not have ACID.

Thinking Delta Lake is a database: Delta Lake is not a database; it is a storage layer for data lakes. It does not support full SQL DML (e.g., UPDATE, DELETE) without Spark or SQL commands.

Assuming time travel is unlimited: Time travel is limited by retention settings; old data is eventually vacuumed.

Believing Delta Lake requires Azure Databricks: Delta Lake is open-source and can be used with any Spark cluster, including Azure Synapse, HDInsight, or even local Spark.

Walk-Through

1

Initialize Delta Table

When you first write data using `format("delta")`, Delta Lake creates a directory structure. It writes the initial Parquet files and creates a `_delta_log` subdirectory with a single JSON file (e.g., `00000000000000000000.json`). This initial commit contains an 'add' action for each file written. The transaction log records the schema, partitioning, and other metadata. At this point, the table is ready for ACID operations.

2

Append Data to Table

When appending new data, Spark reads the current version of the table (the latest snapshot). The new data is written as new Parquet files. A new commit attempt is made: a JSON file (e.g., `00000000000000000001.json`) is created with 'add' actions for the new files. If another writer commits first, the commit fails and retries. The append operation is atomic: either all new files are added or none.

3

Update Existing Data

An UPDATE operation in Delta Lake is implemented as a delete followed by an insert. The transaction log records 'remove' actions for the files containing the old rows and 'add' actions for new files with the updated rows. This is not an in-place update; new files are written. The commit ensures that readers see either the old version or the new version, never a mix.

4

Delete Data from Table

A DELETE operation marks files as removed by writing 'remove' actions to the transaction log. The actual data files are not immediately deleted; they are soft-deleted and retained for time travel. The `VACUUM` command later hard-deletes files older than the retention period. This allows point-in-time queries to still see deleted data if needed.

5

Time Travel Query

When querying a previous version, Delta Lake reads the checkpoint closest to that version and replays the transaction log up to the desired version. It then reads only the Parquet files that were active at that version. This is efficient because the transaction log is compact and checkpoints summarize the state. The query returns a consistent snapshot of the table as of that version.

What This Looks Like on the Job

Enterprise Scenario 1: Real-Time Fraud Detection on Azure Databricks

A financial services company processes millions of transactions per day. They use Azure Event Hubs to ingest streaming data into a Delta table stored on ADLS Gen2. Delta Lake provides exactly-once semantics, ensuring no duplicates. The fraud detection team runs batch queries and streaming queries on the same Delta table, benefiting from unified batch and streaming. They use OPTIMIZE and ZORDER BY (account_id) to speed up queries that filter by account. The retention of 30 days allows them to replay historical transactions for model training. Misconfiguration: If the retention is set too low, they lose ability to time travel for audits. If auto-optimize is not enabled, small files accumulate, degrading read performance.

Enterprise Scenario 2: Data Lakehouse for BI on Azure Synapse

A retail company uses Azure Synapse Analytics to build a data lakehouse. They store sales data in Delta tables in ADLS Gen2. Power BI users query these tables directly via Synapse serverless SQL pools. Delta Lake's schema enforcement prevents data quality issues from breaking reports. The company uses MERGE operations to upsert daily sales updates. The transaction log allows them to roll back if an incorrect update is applied. Scale: They manage over 10 TB of data with millions of files. They run OPTIMIZE weekly to keep file sizes optimal. Common pitfall: Setting delta.deletedFileRetentionDuration too low (e.g., 1 day) and then needing to restore data from a version older than 1 day; they lose the ability to time travel.

Enterprise Scenario 3: Regulatory Compliance in Healthcare

A healthcare provider stores patient records in Delta tables on Azure. Regulations require that all changes be auditable for 7 years. They set delta.logRetentionDuration to 2555 days (7 years) and delta.deletedFileRetentionDuration to 2555 days. They use DESCRIBE HISTORY to track who changed what. They also enable Delta Lake's change data feed to capture row-level changes for downstream systems. Misconfiguration: If they forget to adjust retention defaults, old logs are vacuumed, violating compliance. They also must ensure that the underlying storage (ADLS Gen2) has versioning enabled for extra redundancy.

How DP-900 Actually Tests This

DP-900 Objective Coverage

This topic falls under Domain 3: Analytics, Objective 3.1: Describe the core features of Delta Lake. The exam specifically tests your ability to:

Identify that Delta Lake provides ACID transactions, schema enforcement, and time travel.

Understand that Delta Lake is an open-source storage layer, not a database or a file format.

Recognize that Delta Lake uses a transaction log stored in _delta_log directory.

Know the default retention periods: 7 days for data files, 30 days for log files.

Understand that VACUUM permanently deletes old files.

Common Wrong Answers and Why Candidates Choose Them

1.

"Delta Lake is a file format like Parquet." – This is wrong because Delta Lake is a storage layer that uses Parquet as the underlying format. The exam expects you to know that Delta Lake adds ACID, time travel, etc., on top of Parquet.

2.

"Delta Lake supports full SQL DML without Spark." – Wrong. While you can run SQL commands like UPDATE and DELETE on Delta tables, these are executed by the Spark engine or Synapse SQL engine. Delta Lake itself is not a query engine.

3.

"Time travel allows querying any version indefinitely." – Wrong. Time travel is limited by retention settings. Old data is removed after the retention period.

4.

"Delta Lake requires Azure Databricks." – Wrong. Delta Lake is open-source and can be used with any Spark environment, including Azure Synapse, HDInsight, and even local Spark.

Specific Numbers and Terms on the Exam

Default checkpoint interval: every 10 commits.

Default data file retention for time travel: 7 days (delta.deletedFileRetentionDuration).

Default log retention: 30 days (delta.logRetentionDuration).

Command to remove old files: VACUUM.

Command to view history: DESCRIBE HISTORY.

Schema evolution option: mergeSchema set to true.

Underlying file format: Parquet.

Edge Cases and Exceptions

If you set delta.logRetentionDuration less than delta.deletedFileRetentionDuration, you might lose the ability to time travel because the log is gone before the data is vacuumed. The exam may test that the log must be retained longer than data to support time travel.

When using VACUUM, the retention period is in hours by default (e.g., RETAIN 168 HOURS). The exam might ask for the equivalent in days.

Schema evolution can only add columns, not drop or rename them without explicit ALTER TABLE.

How to Eliminate Wrong Answers

If an option says "Delta Lake provides ACID transactions," it is likely correct.

If an option says "Delta Lake stores data in CSV format," it is wrong (it uses Parquet).

If an option says "Delta Lake can be used without Spark," it is misleading; while you can read Delta tables with other engines, the transactional guarantees require Spark or compatible engine.

If an option mentions "in-place updates," it is wrong because Delta Lake uses copy-on-write.

Key Takeaways

Delta Lake is an open-source storage layer that adds ACID transactions to data lakes.

Delta Lake uses a transaction log stored in a `_delta_log` directory to track all changes.

The default retention period for data files is 7 days; for transaction logs, it is 30 days.

The `VACUUM` command permanently deletes files older than the retention period.

Time travel allows querying previous versions using `VERSION AS OF` or `TIMESTAMP AS OF`.

Schema enforcement prevents writes with mismatched schemas; schema evolution can be enabled with `mergeSchema=true`.

Delta Lake supports both batch and streaming writes with exactly-once semantics.

Underlying file format is Parquet; Delta Lake is not a file format itself.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Delta Lake

Provides ACID transactions via transaction log

Supports time travel (data versioning)

Schema enforcement and evolution

Unified batch and streaming

Uses Parquet as underlying file format

Parquet (without Delta Lake)

No ACID transactions; concurrent writes can corrupt data

No time travel; only current version available

No schema enforcement; schema-on-read only

No native streaming support

Is a file format, not a storage layer

Watch Out for These

Mistake

Delta Lake is a file format like Parquet or ORC.

Correct

Delta Lake is a storage layer that uses Parquet as the underlying file format. It adds a transaction log, ACID transactions, schema enforcement, and time travel on top of Parquet.

Mistake

Delta Lake only works with Azure Databricks.

Correct

Delta Lake is open-source and can be used with any Apache Spark environment, including Azure Synapse, HDInsight, and local Spark clusters.

Mistake

Time travel in Delta Lake allows you to query any version indefinitely.

Correct

Time travel is limited by retention settings. By default, old data files are retained for 7 days and transaction logs for 30 days. After that, VACUUM permanently deletes them.

Mistake

Delta Lake supports in-place updates to data files.

Correct

Delta Lake uses a copy-on-write approach: updates write new files and mark old ones as removed in the transaction log. It does not modify existing files in place.

Mistake

Delta Lake is a database management system (DBMS).

Correct

Delta Lake is not a DBMS; it is a storage layer for data lakes. It does not manage connections, users, or full SQL DML without a compute engine like Spark.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Delta Lake and Parquet?

Parquet is a columnar file format optimized for big data. Delta Lake is a storage layer that uses Parquet as the underlying format but adds a transaction log for ACID transactions, time travel, schema enforcement, and unified batch/streaming. Parquet alone does not provide any of these features.

How long does Delta Lake keep old data for time travel?

By default, Delta Lake retains old data files for 7 days (`delta.deletedFileRetentionDuration`) and transaction logs for 30 days (`delta.logRetentionDuration`). You can configure these settings. After the retention period, running `VACUUM` permanently deletes the old files.

Can I use Delta Lake without Apache Spark?

Delta Lake is designed to work with Apache Spark. However, other engines like Presto, Trino, and Azure Synapse serverless SQL can read Delta tables, but full ACID transactional writes typically require Spark or a compatible engine.

What is the purpose of the `_delta_log` directory?

The `_delta_log` directory contains the transaction log as a series of JSON files. Each file represents an atomic commit that records actions like adding or removing files. This log is the source of truth for the table's state and enables ACID transactions, time travel, and concurrency control.

How does Delta Lake handle concurrent writes?

Delta Lake uses optimistic concurrency control. When two writes attempt to commit at the same version, one succeeds and the other retries by reading the latest state and reattempting the commit. This ensures that no data is lost or corrupted.

What is the `OPTIMIZE` command in Delta Lake?

The `OPTIMIZE` command compacts small Parquet files into larger ones (default target size 256 MB) to improve read performance. It can also be used with `ZORDER BY` to cluster data on specified columns for better data skipping.

Is Delta Lake available in Azure Synapse Analytics?

Yes, Azure Synapse Analytics supports Delta Lake in both serverless SQL pools and Spark pools. You can query Delta tables directly using T-SQL or Spark.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Delta Lake Format in Azure — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?