DP-900Chapter 68 of 101Objective 3.3

Databricks Delta Live Tables

This chapter covers Databricks Delta Live Tables (DLT), a declarative framework for building reliable, automated data pipelines on the Databricks Lakehouse Platform. For the DP-900 exam, this topic falls under Domain 3: Analytics, Objective 3.3: 'Describe modern data warehousing and real-time analytics with Azure Synapse Analytics and Databricks.' Approximately 10-15% of exam questions touch on Databricks concepts, with DLT being a key differentiator. You will need to understand DLT's purpose, how it differs from traditional ETL, its core components (pipelines, datasets, expectations), and how it enforces data quality. This chapter provides the depth needed to confidently answer any DLT-related exam question.

25 min read
Intermediate
Updated May 31, 2026

The Automated Bakery Pipeline

Imagine a high-volume bakery that produces three types of pastries: croissants, bagels, and danishes. Each pastry requires a specific sequence of steps: mixing dough, proofing, shaping, baking, cooling, and packaging. The bakery owner used to have a team of bakers manually move trays between stations, check quality by eye, and decide when to start the next batch. This was error-prone and slow. To scale up, they install a fully automated conveyor system. Now, sensors detect when a tray of shaped croissants is ready for the oven. The conveyor automatically moves it to the preheated oven, which starts baking at exactly 375°F for 18 minutes. A timer tracks each batch. When done, the tray moves to a cooling rack; if the internal temperature sensor reads above 90°F, the tray loops back for more cooling. Meanwhile, the system continuously monitors the entire process: if the oven temperature drops, it adjusts the gas valve; if the dough mixer jams, it alerts maintenance. The bakery owner can see a live dashboard showing how many croissants are in each stage, how long each batch took, and whether any step failed. This is exactly how Databricks Delta Live Tables (DLT) works: you define the recipe (pipeline) for data transformations, and the system automatically manages the flow, handles errors, tracks data quality, and provides monitoring. Just as the conveyor system ensures consistent pastry quality at scale, DLT ensures reliable data pipelines from raw ingestion to curated datasets.

How It Actually Works

What is Databricks Delta Live Tables (DLT)?

Databricks Delta Live Tables (DLT) is a declarative framework for building and managing production-grade data pipelines on the Databricks Lakehouse. With DLT, you define what transformations to apply to your data, and DLT automatically handles task orchestration, cluster management, error handling, monitoring, and data quality enforcement. It is built on top of Delta Lake, Apache Spark, and the Databricks Runtime.

Why DLT Exists

Traditional ETL pipelines require significant manual effort: you write scripts for each transformation, manage dependencies between tasks, configure clusters, handle failures, and monitor job health. As data volumes grow, this becomes brittle and hard to maintain. DLT abstracts away these complexities, allowing data engineers to focus on the logic of data transformation rather than infrastructure. It provides: - Declarative pipeline definition: You write Python or SQL code that declares what the output datasets should look like, not how to compute them step by step. - Automatic dependency resolution: DLT analyzes your code to determine the order of operations and builds a directed acyclic graph (DAG) of tasks. - Incremental processing: DLT automatically processes only new data since the last run, using Delta Lake's change data capture (CDC) capabilities. - Built-in data quality: Using 'expectations', you can define constraints on your data (e.g., NOT NULL, unique, custom SQL expressions). DLT can drop bad records, quarantine them, or fail the pipeline based on your policy. - Monitoring and alerting: DLT provides a rich UI and APIs for pipeline health, data quality metrics, and lineage.

How DLT Works Internally

DLT pipelines are defined in notebooks or Python files using the dlt module (for Python) or SQL syntax. When you create a pipeline, Databricks analyzes your code to build a DAG of transformations. Each transformation produces a dataset (a Delta table or view). The pipeline can be triggered on a schedule or continuously.

Pipeline Definition

A DLT pipeline consists of: - Source: Raw data from cloud storage (e.g., Azure Data Lake Storage Gen2, Blob Storage) or streaming sources (e.g., Event Hubs, Kafka). - Transformations: Python or SQL functions decorated with @dlt.view or @dlt.table that define how to derive new datasets. - Target: The output Delta tables that are stored in the Databricks File System (DBFS) or external locations. - Expectations: Data quality rules applied to datasets.

Execution Model

DLT uses a continuous processing mode or triggered mode. In triggered mode, the pipeline runs on a schedule (e.g., every hour) and processes all new data since the last run. In continuous mode, the pipeline runs indefinitely, processing data as it arrives with low latency (seconds to minutes).

Under the hood, DLT uses Structured Streaming for incremental processing. Each dataset is backed by a Delta table, and DLT tracks which data has already been processed using Delta Lake transaction logs. When a new batch of data arrives, DLT reads only the new files (or streaming records) and applies the transformations, updating the target tables atomically.

Data Quality with Expectations

Expectations are defined as constraints on a dataset. They can be: - `expect`: Records that fail the constraint are written to a quarantine table but do not stop the pipeline. - `expect_or_drop`: Records that fail are dropped from the output but written to a quarantine table. - `expect_or_fail`: If any record fails, the pipeline fails immediately.

Example in Python:

@dlt.expect("valid_id", "id IS NOT NULL")
@dlt.expect_or_drop("positive_revenue", "revenue > 0")
@dlt.table
valid_sales:
  return ...

Monitoring

DLT provides a built-in monitoring dashboard that shows:

Pipeline status (running, succeeded, failed)

Data quality metrics (number of records passed, dropped, quarantined)

Lineage graph of datasets

Execution time and throughput

Key Components, Values, and Defaults

Pipeline mode: triggered (default) or continuous. Continuous mode is best for low-latency streaming.

Cluster configuration: DLT automatically provisions clusters. You can specify node type, number of workers, and autoscaling. Default: autoscaling enabled, min 1, max 10 workers.

Delta table properties: DLT uses delta.autoOptimize.optimizeWrite = true by default for improved write performance.

Expectations: You can define multiple expectations per dataset. The quarantine table is named <dataset>_quarantine by default.

Retry policy: DLT automatically retries failed tasks up to 3 times by default.

Configuration and Verification Commands

To create a DLT pipeline via the Databricks CLI:

databricks pipelines create --json '{
  "name": "my_pipeline",
  "clusters": [{"num_workers": 4}],
  "libraries": [{"notebook": {"path": "/Users/user@example.com/my_pipeline_notebook"}}],
  "continuous": false
}'

To start a pipeline:

databricks pipelines start --pipeline-id <id>

To view pipeline status:

databricks pipelines get --pipeline-id <id>

In the Databricks UI, you can also view pipeline details under the 'Delta Live Tables' tab.

How DLT Interacts with Related Technologies

Delta Lake: DLT stores all output as Delta tables, providing ACID transactions, schema enforcement, and time travel.

Azure Data Lake Storage Gen2: Common source and target for DLT pipelines. DLT reads from and writes to ADLS Gen2 using the abfss:// path.

Event Hubs / Kafka: DLT can ingest streaming data from Azure Event Hubs or Kafka using Structured Streaming.

Azure Synapse Analytics: You can read from or write to Synapse using Spark connectors, but DLT is typically used within the Databricks environment.

Power BI: DLT output tables can be queried by Power BI via the Databricks SQL endpoint.

Example DLT Pipeline in Python

import dlt
from pyspark.sql.functions import col

# Define a streaming source
@dlt.table
def raw_orders():
  return (
    spark.readStream.format("cloudFiles")
      .option("cloudFiles.format", "json")
      .load("/path/to/raw/orders")
  )

# Transformation with data quality
@dlt.expect("valid_order_id", "order_id IS NOT NULL")
@dlt.expect_or_drop("positive_amount", "amount > 0")
@dlt.table
def cleaned_orders():
  return (
    dlt.read_stream("raw_orders")
      .select(
        col("order_id"),
        col("customer_id"),
        col("amount"),
        col("timestamp")
      )
  )

# Aggregation
@dlt.table
def daily_sales():
  return (
    dlt.read_stream("cleaned_orders")
      .groupBy("customer_id", "order_date")
      .agg({"amount": "sum"})
  )

This pipeline automatically handles schema inference, incremental processing, and data quality checks.

Walk-Through

1

Define Pipeline Source

You start by identifying the raw data source, which could be batch files in Azure Data Lake Storage Gen2 (e.g., JSON, Parquet, CSV) or a streaming source like Event Hubs. In DLT, you define a table or view that reads from this source using Spark read APIs or the `cloudFiles` auto-loader for incremental file ingestion. The source definition uses `@dlt.table` or `@dlt.view` decorator in Python, or `CREATE STREAMING TABLE` in SQL. DLT automatically tracks the offset or file position so that subsequent runs only process new data.

2

Apply Transformations

Next, you define one or more transformation steps using Python or SQL. Each transformation is a function or query that takes one or more input datasets (from previous steps) and produces a new dataset. DLT analyzes the dependencies and builds a DAG. For example, you might filter out invalid records, join with a reference table, or aggregate metrics. Transformations can include complex business logic, window functions, and user-defined functions (UDFs). DLT optimizes the execution plan for incremental processing, meaning it only computes changes rather than reprocessing the entire dataset.

3

Define Data Quality Expectations

At any point in the pipeline, you can attach data quality rules called 'expectations'. These are SQL expressions that evaluate each row. You specify what action to take on failure: `expect` (quarantine but continue), `expect_or_drop` (drop from output but quarantine), or `expect_or_fail` (fail the pipeline). Expectations are applied during the write operation. DLT logs metrics on how many records passed, failed, or were dropped, which you can view in the monitoring dashboard. This ensures that only clean data reaches downstream consumers.

4

Configure Pipeline Settings

Before running, you configure pipeline settings such as the compute cluster (node type, number of workers, autoscaling), pipeline mode (triggered or continuous), and storage location for output tables. You can also set up alerts for failures or data quality threshold violations. DLT allows you to set a retry policy (default 3 retries) and a maximum failure threshold. For continuous pipelines, you can set a latency target (e.g., 5 minutes) that determines how often the pipeline checks for new data.

5

Execute and Monitor Pipeline

Once configured, you start the pipeline. DLT provisions a cluster, executes the DAG, and writes the output Delta tables. During execution, you can monitor progress in the DLT UI: see the DAG, check the number of records processed per dataset, view data quality metrics, and inspect any failures. DLT provides real-time logging and lineage. If a step fails, DLT can automatically retry or stop based on your settings. After completion, the output tables are available for querying via Databricks SQL, Power BI, or other tools.

What This Looks Like on the Job

Enterprise Scenario 1: Real-Time Fraud Detection

A financial services company processes millions of credit card transactions per day. They need to detect fraudulent transactions in near real-time and feed the results into a dashboard for analysts. They use DLT with continuous processing to ingest streaming data from Azure Event Hubs. The pipeline:

Reads raw transaction events (JSON) using cloudFiles autoloader.

Applies data quality expectations: transaction_id IS NOT NULL, amount > 0, timestamp IS NOT NULL. Records that fail are quarantined for review.

Joins with a customer profile Delta table to enrich transaction data.

Applies a machine learning model (via a UDF) to score each transaction for fraud probability.

Writes high-scoring transactions to a Delta table for real-time alerting, and all transactions to a historical table for batch analysis.

In production, this pipeline processes 50,000 events per second with sub-5-minute latency. The DLT monitoring dashboard shows throughput, data quality pass rates, and pipeline health. Common misconfigurations: not setting maxFilesPerTrigger or maxBytesPerTrigger can cause memory issues during streaming ingestion; forgetting to set cloudFiles.schemaLocation can lead to schema drift failures.

Enterprise Scenario 2: Data Lake ETL for Business Intelligence

A retail company uses a medallion architecture (bronze, silver, gold) on Azure Data Lake Storage Gen2. They use DLT to automate the transformation of raw sales logs into curated analytical tables. The bronze layer ingests raw data from multiple sources (POS systems, web logs, CRM exports). DLT pipelines run in triggered mode every hour. In the silver layer, data is cleansed, deduplicated, and enriched with product and customer dimensions. Expectations enforce referential integrity: product_id IN (SELECT id FROM products). In the gold layer, aggregate tables are built for Power BI reports.

Key benefits: DLT automatically handles schema evolution (e.g., new columns added to source) via Delta Lake's schema enforcement and evolution. The team can add new transformations without rewriting the entire pipeline. They use @dlt.expect_or_fail on critical tables to prevent data quality issues from propagating. A common mistake: setting pipelines.autoOptimize.managed = false can lead to many small files and poor query performance. They use OPTIMIZE and ZORDER on gold tables periodically.

Performance Considerations

Cluster sizing: For large pipelines, use autoscaling clusters with a minimum of 2-4 workers and a maximum based on expected load. DLT's default cluster may be insufficient for high-volume streaming.

Checkpointing: DLT uses Delta Lake checkpoints to track processed data. Ensure the checkpoint location has sufficient IOPS (use premium storage if needed).

Data skew: If a transformation produces a skewed join, consider using salting or broadcasting small tables.

Continuous mode: For sub-minute latency, use continuous mode. Be aware that continuous pipelines run indefinitely and consume cluster resources even when idle. Use autoscaling to minimize costs.

What Goes Wrong in Production

Schema drift: If source schema changes unexpectedly, DLT may fail if cloudFiles.schemaEvolutionMode is not set to rescue or failOnMismatch. Best practice: use rescue mode to capture unexpected columns.

OutOfMemory errors: Caused by insufficient executor memory for large shuffles. Increase spark.sql.shuffle.partitions or use larger node types.

Data quality threshold breaches: If expect_or_fail is used on a table with high failure rates, the pipeline may never complete. Use expect or expect_or_drop for non-critical data.

Stale checkpoints: If the pipeline is stopped and restarted after a long time, checkpoints may become stale. Use pipeline.reset to clear state if needed.

How DP-900 Actually Tests This

DP-900 Exam Focus on Databricks Delta Live Tables

The DP-900 exam tests your understanding of DLT as a tool for building automated data pipelines within the Azure Databricks ecosystem. The objective is 3.3: 'Describe modern data warehousing and real-time analytics with Azure Synapse Analytics and Databricks.' Specifically, you should know:

What DLT is (declarative pipeline framework)

How it differs from traditional ETL (automatic orchestration, incremental processing, data quality)

The concept of expectations and their three types (expect, expect_or_drop, expect_or_fail)

That DLT outputs are stored as Delta tables

That DLT can process both batch and streaming data

Common Wrong Answers and Why

1.

'DLT is a data storage service' – Wrong. DLT is a pipeline management framework. Data is stored in Delta Lake tables, not in DLT itself.

2.

'DLT requires you to write custom orchestration code' – Wrong. DLT is declarative; you define what to do, not how to orchestrate.

3.

'Expectations only work on streaming data' – Wrong. Expectations work on both batch and streaming pipelines.

4.

'DLT can only process batch data' – Wrong. DLT supports both batch (triggered) and streaming (continuous) modes.

5.

'DLT replaces Delta Lake' – Wrong. DLT builds on Delta Lake; it does not replace it.

Specific Numbers and Terms on the Exam

Three expectation types: expect, expect_or_drop, expect_or_fail – memorize the exact behavior of each.

Pipeline modes: triggered (batch, scheduled) and continuous (streaming, low-latency).

Output format: Always Delta tables.

Incremental processing: DLT automatically processes only new/changed data using Delta Lake transaction logs.

Autoscaling: DLT clusters support autoscaling (default min 1, max 10 workers).

Edge Cases and Exceptions

What happens if an expectation fails with expect_or_fail? The pipeline stops and fails.

Can DLT handle schema evolution? Yes, via Delta Lake's schema evolution or cloudFiles autoloader's rescue mode.

Can you use DLT without Databricks? No, DLT is a proprietary Databricks feature.

Does DLT support streaming from Event Hubs? Yes, via Structured Streaming.

How to Eliminate Wrong Answers

If an answer mentions 'manual orchestration' or 'custom scheduler', it's wrong.

If an answer says DLT stores data in its own format, it's wrong (it uses Delta Lake).

If an answer claims DLT only works with batch, it's wrong (continuous mode exists).

If an answer describes DLT as a 'data warehouse', it's misleading; DLT is a pipeline tool that populates data warehouses.

Key Takeaways

DLT is a declarative framework for building automated data pipelines on Databricks.

Define pipelines using Python @dlt.table decorators or SQL CREATE STREAMING TABLE.

Three expectation types: expect (quarantine), expect_or_drop (drop & quarantine), expect_or_fail (fail pipeline).

DLT supports both triggered (batch) and continuous (streaming) pipeline modes.

All output from DLT is stored as Delta Lake tables, enabling ACID transactions and time travel.

DLT automatically handles dependency resolution, cluster management, and error retries (default 3 retries).

DLT uses Structured Streaming for incremental processing, tracking progress via Delta Lake checkpoints.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Delta Live Tables (DLT)

Declarative pipeline definition – you define what, not how.

Built-in incremental processing using Delta Lake transaction logs.

Native data quality expectations (expect, expect_or_drop, expect_or_fail).

Tightly integrated with Databricks and Spark.

Supports continuous streaming with sub-minute latency.

Azure Data Factory (ADF)

Imperative pipeline definition – you define each activity and dependency explicitly.

Incremental processing requires custom watermark logic or change tracking.

Data quality must be implemented via custom activities or mapping flows.

Works with many Azure services (Blob, SQL, Synapse, etc.)

Primarily batch-oriented; streaming support is limited to Event Hubs and IoT Hub.

Watch Out for These

Mistake

Delta Live Tables is a new type of database or storage system.

Correct

DLT is a pipeline orchestration framework, not a storage system. The output is stored as Delta Lake tables in Azure Data Lake Storage or DBFS.

Mistake

You must write complex orchestration code to use DLT.

Correct

DLT is declarative: you only define the transformation logic and data quality rules. DLT automatically handles dependency resolution, cluster management, and error handling.

Mistake

Expectations in DLT are only for streaming data.

Correct

Expectations work on both batch and streaming pipelines. They are evaluated during the write operation regardless of the data source.

Mistake

DLT can only process data in batch mode.

Correct

DLT supports both triggered (batch) and continuous (streaming) modes. Continuous mode processes data with low latency.

Mistake

DLT replaces Delta Lake.

Correct

DLT is built on top of Delta Lake. It uses Delta Lake for storage, ACID transactions, and time travel. DLT does not replace Delta Lake.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between triggered and continuous modes in Delta Live Tables?

Triggered mode runs the pipeline on a schedule (e.g., every hour) and processes all new data since the last run. It is suitable for batch workloads. Continuous mode runs the pipeline indefinitely, processing data as it arrives with low latency (seconds to minutes). Continuous mode uses Structured Streaming and is ideal for real-time analytics. The choice depends on latency requirements and cost considerations.

How do expectations work in Delta Live Tables?

Expectations are data quality rules defined as SQL expressions. They are applied to each row during the write operation. There are three types: `expect` writes failing records to a quarantine table but continues processing; `expect_or_drop` drops failing records from the output and writes them to quarantine; `expect_or_fail` stops the pipeline if any record fails. Expectations help ensure only clean data reaches downstream consumers.

Can I use Delta Live Tables with Azure Event Hubs?

Yes, DLT can ingest streaming data from Azure Event Hubs using Structured Streaming. You define a streaming source in your DLT pipeline using `spark.readStream.format("eventhubs")` or the `cloudFiles` autoloader for file-based sources. DLT's continuous mode is ideal for Event Hubs because it provides low-latency processing.

Does Delta Live Tables support schema evolution?

Yes, DLT supports schema evolution through Delta Lake's schema enforcement and evolution capabilities. You can configure the `cloudFiles` autoloader with `cloudFiles.schemaEvolutionMode` set to `rescue` to capture unexpected columns. DLT can also handle new columns added to source data if schema evolution is enabled on the target Delta table.

How do I monitor a Delta Live Tables pipeline?

DLT provides a built-in monitoring dashboard accessible from the Databricks UI under the 'Delta Live Tables' tab. You can view pipeline status (running, succeeded, failed), data quality metrics (records passed, dropped, quarantined), a lineage graph of datasets, execution time, and throughput. You can also set up alerts for failures or data quality threshold breaches.

Can I use Delta Live Tables without Databricks?

No, DLT is a proprietary feature of the Databricks Lakehouse Platform. It requires a Databricks workspace and cluster. The pipeline definitions run on Databricks-managed Spark clusters and output to Delta Lake tables stored in cloud storage (e.g., Azure Data Lake Storage).

What is the default retry policy in Delta Live Tables?

By default, DLT automatically retries failed tasks up to 3 times. You can configure the retry policy when creating or editing the pipeline. If a task fails after all retries, the pipeline stops and the failure is reported in the monitoring dashboard.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Databricks Delta Live Tables — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?