DP-900Chapter 12 of 101Objective 3.3

Azure Databricks

This chapter covers Azure Databricks, a unified analytics platform optimized for Apache Spark. For the DP-900 exam, understanding Azure Databricks is essential as it appears in the 'Analytics' domain, specifically under objective 3.3 'Describe modern data warehousing and real-time analytics'. Expect approximately 5-10% of exam questions to touch on Azure Databricks, focusing on its role in big data processing, machine learning, and collaboration. This chapter explains what Azure Databricks is, how it works, its core components, and how it integrates with other Azure data services.

25 min read
Intermediate
Updated May 31, 2026

Azure Databricks as a Shared Notebook Factory

Imagine a large factory where teams of engineers collaborate on designing complex machines. Each engineer has their own notebook for sketches and calculations. However, these notebooks are scattered, and when one engineer finishes a calculation, others cannot easily use it. Azure Databricks is like a central, cloud-based factory floor where every engineer gets a standardized notebook that can be shared instantly. The factory has a powerful engine room (Apache Spark) that can process massive amounts of raw materials (data) using parallel assembly lines (clusters). Engineers can write instructions (code) in their notebooks, which are then sent to the engine room. The engine room automatically divides the work among many workers (nodes) and assembles the final product (results). The factory manager (Databricks control plane) allocates resources, scales workers up or down based on demand, and stores all notebooks and results in a central repository (Azure Data Lake Storage). If an engineer needs more power, the manager adds more workers without disrupting ongoing work. This factory also has a secure visitor entrance (Azure Active Directory) that controls who can enter and what they can access. Unlike a traditional factory where each engineer works on their own machine, this factory allows everyone to see and build upon each other's work in real time, with version history and collaboration features. The key is that the heavy lifting is done by the factory's infrastructure, not the individual engineer's laptop.

How It Actually Works

What is Azure Databricks?

Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform. It is a first-party Azure service jointly developed by Microsoft and Databricks. It provides an interactive workspace that enables data engineers, data scientists, and business analysts to collaborate on building data pipelines, performing exploratory analytics, and developing machine learning models. The platform is fully managed, meaning Azure handles cluster provisioning, scaling, and security, while users focus on their data tasks.

Azure Databricks is built on top of Apache Spark, a unified analytics engine for large-scale data processing. Spark provides in-memory computing capabilities, which makes it significantly faster than traditional disk-based engines like MapReduce. Azure Databricks extends Spark with a web-based notebook interface, automated cluster management, and tight integration with Azure services such as Azure Data Lake Storage, Azure Blob Storage, Azure Synapse Analytics, and Azure Machine Learning.

The exam expects you to know that Azure Databricks is used for:

Batch and stream data processing

Data exploration and visualization

Machine learning model training and deployment

Collaborative data science and engineering

How Azure Databricks Works Internally

At a high level, Azure Databricks operates with two planes: the control plane and the data plane.

Control Plane: Managed by Databricks, this plane includes the web application, notebook management, cluster manager, job scheduler, and security services. It is hosted in a Databricks-owned Azure subscription. The control plane does not access customer data directly.

Data Plane: This is where data processing occurs. The data plane resides in the customer's own Azure subscription. It contains compute resources such as virtual machines (VMs) that form clusters, and storage resources like Azure Data Lake Storage or Blob Storage. The data plane also includes the Databricks Runtime, which is a set of core components that run on the cluster nodes.

When a user creates a cluster, the control plane provisions VMs in the customer's subscription. These VMs are configured with the Databricks Runtime, which includes Apache Spark, additional libraries, and optimizations. The user then attaches notebooks to the cluster and runs code (Python, Scala, SQL, R, or Java). The code is executed by Spark on the cluster nodes. Results are returned to the notebook and can be visualized inline.

Key Components and Defaults

Workspace: The root object that organizes all assets (notebooks, libraries, clusters, jobs, and dashboards). Each workspace is tied to an Azure Databricks service instance.

Clusters: Compute resources that run Spark jobs. Clusters can be interactive or job clusters. - Interactive clusters: Persistent clusters that multiple users can attach to for ad-hoc analysis. Default timeout: 120 minutes of inactivity (configurable). - Job clusters: Ephemeral clusters created for a specific job and terminated when the job completes. They are more cost-effective for automated tasks.

Cluster Modes: - Standard: For single-user workloads. - High Concurrency: For multi-user workloads with auto-scaling and shared access. - Single Node: For lightweight development and testing (single VM).

Databricks Runtime: Includes Apache Spark, Delta Lake, MLflow, and other optimizations. The default runtime version is the latest stable release (e.g., Databricks Runtime 12.2 LTS). The runtime is updated regularly.

Notebooks: Web-based interfaces for code execution. They support multiple languages in the same notebook using magic commands (e.g., %python, %sql, %scala). Notebooks support version history, collaboration, and scheduled execution.

Jobs: Automated workflows that run notebooks or JAR files on a schedule or trigger. They can output results to a variety of destinations.

Delta Lake: An open-source storage layer that brings ACID transactions to Apache Spark. It is integrated into Databricks Runtime by default. Delta Lake enables time travel (data versioning), schema enforcement, and unified batch/streaming.

MLflow: An open-source platform for the machine learning lifecycle, integrated into Databricks. It tracks experiments, manages models, and deploys them.

Configuration and Verification Commands

Creating a cluster in Azure Databricks can be done via the UI, CLI, or API. Key configuration parameters:

Cluster name

Databricks Runtime version

Worker type (VM SKU, e.g., Standard_DS3_v2)

Minimum and maximum workers (auto-scaling range)

Spark config (optional, e.g., spark.sql.shuffle.partitions)

Tags (for cost tracking)

Example CLI command to create a cluster (using Databricks CLI):

databricks clusters create --json '{
  "cluster_name": "my-cluster",
  "spark_version": "12.2.x-scala2.12",
  "node_type_id": "Standard_DS3_v2",
  "num_workers": 2,
  "autoscale": {
    "min_workers": 1,
    "max_workers": 5
  }
}'

To verify cluster status:

databricks clusters list

To attach a notebook to a cluster and run a simple query:

# In a notebook cell
spark.sql("SELECT current_timestamp()").show()

Integration with Related Azure Technologies

Azure Data Lake Storage (ADLS) Gen2: Primary storage for Databricks. It provides hierarchical namespace and POSIX-like access control. Databricks mounts ADLS or accesses it via service principal authentication.

Azure Blob Storage: Alternative storage, but less performant for analytics.

Azure Synapse Analytics: Databricks can read/write data to Synapse using the Synapse connector.

Azure Key Vault: Stores secrets (e.g., storage account keys) used by Databricks.

Azure Active Directory (AAD): Authentication and authorization for workspace access and data access (via service principals).

Azure Event Hubs: Source for streaming data into Databricks.

Azure Machine Learning: Databricks can deploy models to Azure ML endpoints.

Security and Networking

Azure Databricks supports network isolation via Azure Virtual Network (VNet) injection. This allows the data plane to be deployed inside a customer-managed VNet, enabling network security groups (NSGs), firewalls, and private endpoints. The control plane communicates with the data plane over a secure, encrypted tunnel. For compliance, Azure Databricks meets various certifications including SOC 2, ISO 27001, and HIPAA.

Pricing Model

Azure Databricks charges based on DBU (Databricks Unit) consumption per hour. A DBU is a unit of processing capacity per hour. The cost depends on the VM SKU and the Databricks tier (Standard, Premium, Enterprise). Additionally, customers pay for the underlying VM infrastructure (compute, storage, networking) in their own subscription. For example, a cluster with 2 DS3v2 workers (Standard tier) might cost ~$0.55/DBU-hour plus ~$0.28/hour for the VMs. Job clusters are generally cheaper because they auto-terminate.

Walk-Through

1

Provision Azure Databricks Workspace

In the Azure portal, navigate to 'Create a resource' and search for 'Azure Databricks'. Fill in the required fields: Subscription, Resource group, Workspace name, Region, and Pricing tier (Trial, Standard, Premium). Premium tier is recommended for production as it includes role-based access control (RBAC), VNet injection, and audit logs. After creation, the workspace URL (e.g., https://adb-xxxxxx.xx.azuredatabricks.net) is provided. This workspace is the control plane. The data plane will be created when you first launch a cluster.

2

Configure Storage and Permissions

Create an Azure Data Lake Storage Gen2 account and a container for data. To allow Databricks to access the storage, you need to set up a service principal in Azure AD and grant it 'Storage Blob Data Contributor' role on the storage account. Then, in Databricks, create a secret scope backed by Azure Key Vault to store the service principal's credentials. Mount the storage using the service principal via the Databricks filesystem (dbutils.fs.mount). Alternatively, use direct access with Spark configs (spark.hadoop.fs.azure.account.auth.type). This step ensures that Databricks clusters can read/write data securely.

3

Create and Configure a Cluster

In the Databricks workspace, go to Compute > Create Cluster. Choose a cluster name, select the Databricks Runtime version (e.g., 12.2 LTS), and choose a node type (VM size). For development, a small cluster with 2 workers of Standard_DS3_v2 is sufficient. Enable autoscaling to handle variable workloads. Set the inactivity timeout (default 120 minutes) to avoid unnecessary costs. Optionally, add Spark configs like spark.sql.shuffle.partitions (default 200) for performance tuning. Click 'Create Cluster'. The cluster will start provisioning VMs in your subscription; this may take 2-5 minutes.

4

Develop and Run Notebooks

In the workspace, create a new notebook (File > New > Notebook). Attach it to the cluster. Write code in cells using Python, SQL, Scala, or R. For example, load data from ADLS: df = spark.read.format("parquet").load("dbfs:/mnt/data/events"). Then perform transformations: df_filtered = df.filter(df.event_type == "click"). Show results: display(df_filtered). Use %sql magic to run SQL directly on temporary views. The notebook supports inline visualization (bar charts, histograms). You can also schedule the notebook as a job later.

5

Schedule and Monitor Jobs

To productionize a notebook, create a job: Workflows > Create Job. Give it a name, select the notebook, and choose a job cluster (new or existing). Set the schedule (e.g., daily at 3 AM) or trigger (e.g., file arrival). Configure alerts for success/failure. The job will create an ephemeral cluster, run the notebook, and terminate the cluster. Monitor job runs in the Workflows UI to see logs, duration, and output. You can also use the Databricks CLI or API to manage jobs programmatically. This step ensures that data pipelines run reliably without manual intervention.

What This Looks Like on the Job

Enterprise Scenario 1: ETL Pipeline for Retail Analytics

A large retail company ingests terabytes of clickstream data from their e-commerce platform daily. They use Azure Event Hubs to capture real-time events and land them in Azure Blob Storage as JSON files. The data engineering team uses Azure Databricks to build an ETL pipeline. They create a Databricks notebook that reads the raw JSON, validates schema, applies business rules (e.g., sessionization), and writes the cleaned data as Delta tables in ADLS Gen2. The pipeline runs as a scheduled job every hour. Delta Lake provides ACID transactions, so if the job fails mid-way, the data remains consistent. The team uses autoscaling clusters to handle peak loads during holiday sales (e.g., 50 workers). They also implement data quality checks using Delta Lake's built-in constraints. Common issues: misconfigured autoscaling causing OOM errors, or forgetting to optimize Delta tables leading to slow queries.

Enterprise Scenario 2: Collaborative Data Science for Fraud Detection

A financial services firm uses Azure Databricks for machine learning. Data scientists have access to a shared workspace with high-concurrency clusters. They collaborate on notebooks that explore historical transaction data stored in ADLS. Using MLflow, they track experiments with different models (XGBoost, Random Forest). They train models on large datasets using GPU-enabled clusters (NC-series VMs). The best model is registered in the MLflow Model Registry and deployed as a real-time inference endpoint using Azure Machine Learning. The firm uses RBAC to restrict access to sensitive data: only data scientists with 'Contributor' role on the workspace can access the data. Problems arise when multiple users run expensive queries simultaneously, causing resource contention; they mitigate this by using cluster pools and limiting max workers per cluster.

Enterprise Scenario 3: Real-Time Streaming Analytics for IoT

A manufacturing company processes sensor data from thousands of IoT devices. They use Azure IoT Hub to collect telemetry and route it to Event Hubs. In Azure Databricks, they run a streaming job using Structured Streaming. The job reads from Event Hubs, aggregates metrics (average temperature, vibration) over 5-minute windows, and writes results to Azure Synapse Analytics for dashboards. They also store raw data in Delta Lake for historical analysis. The streaming job runs on a dedicated cluster with auto-scaling based on backlog. A common misconfiguration is setting the checkpoint location incorrectly, causing data loss on restart. They also need to handle schema evolution as new sensors are added; Delta Lake's mergeSchema option helps. The team monitors the streaming job via Spark UI and Databricks metrics (e.g., input rows per second).

How DP-900 Actually Tests This

What DP-900 Tests on Azure Databricks

The DP-900 exam focuses on the conceptual understanding of Azure Databricks rather than deep technical details. The relevant objective is 3.3: 'Describe modern data warehousing and real-time analytics'. Specifically, you should know:

Azure Databricks is a unified analytics platform for big data and machine learning.

It is based on Apache Spark and provides an interactive notebook environment.

It can process both batch and streaming data.

It integrates with Azure Data Lake Storage, Azure Synapse Analytics, and Azure Machine Learning.

It supports collaborative work among data engineers, data scientists, and analysts.

The concept of DBUs (Databricks Units) for pricing.

The difference between interactive clusters and job clusters.

Common Wrong Answers and Why Candidates Choose Them

1.

Wrong answer: 'Azure Databricks is a data warehouse.' Many candidates confuse it with Azure Synapse Analytics. Reality: Azure Databricks is primarily a data processing and analytics platform, not a data warehouse. It can be used for data preparation and machine learning, but Synapse is the dedicated data warehouse solution.

2.

Wrong answer: 'Azure Databricks only supports Python.' Some think it is limited to Python because of its popularity in data science. Reality: It supports Python, Scala, SQL, R, and Java.

3.

Wrong answer: 'Azure Databricks stores data in the workspace.' Candidates might assume data is stored within the service. Reality: Data is stored in external storage (e.g., ADLS, Blob) and accessed by clusters. The workspace only stores metadata, notebooks, and cluster configs.

4.

Wrong answer: 'Azure Databricks clusters are always on.' Some think clusters run continuously. Reality: Interactive clusters have a timeout (default 120 min) and job clusters are ephemeral. They can be terminated to save costs.

Specific Numbers and Terms on the Exam

DBU (Databricks Unit) is the billing unit.

The default cluster inactivity timeout is 120 minutes.

Databricks Runtime includes Apache Spark, Delta Lake, and MLflow.

The two planes: control plane (Databricks-managed) and data plane (customer-managed).

Edge Cases and Exceptions

The exam may test that Azure Databricks can be used for real-time analytics via Structured Streaming. It may also ask about the integration with Azure Machine Learning for model deployment. Another edge case: Azure Databricks can be used to process data from Azure Event Hubs or IoT Hub. Remember that Azure Databricks is not a data lake itself; it processes data stored in a data lake.

How to Eliminate Wrong Answers

If a question asks about a 'unified analytics platform for big data processing and machine learning', look for Azure Databricks. If it mentions 'data warehouse', think Synapse. If it mentions 'serverless SQL query', think Azure Synapse Serverless. If it mentions 'notebooks for collaboration', it is likely Databricks. Also, remember that Databricks is Spark-based, so any option that says 'MapReduce' or 'Hive' is likely wrong.

Key Takeaways

Azure Databricks is a unified analytics platform based on Apache Spark.

It supports multiple languages: Python, Scala, SQL, R, and Java.

The platform has two planes: control plane (Databricks-managed) and data plane (customer-managed).

Data is stored externally in Azure Data Lake Storage or Blob Storage, not in Databricks.

Clusters can be interactive (with a default inactivity timeout of 120 minutes) or job clusters (ephemeral).

Billing is based on DBU (Databricks Unit) consumption plus infrastructure costs.

Delta Lake provides ACID transactions, schema enforcement, and time travel.

Azure Databricks integrates with Azure Event Hubs, IoT Hub, Synapse, and Machine Learning.

The Premium tier offers VNet injection, RBAC, and audit logs.

For the DP-900 exam, know that Azure Databricks is for big data processing and machine learning, not a data warehouse.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Azure Databricks

Based on Apache Spark; optimized for big data processing and ML

Provides notebook collaboration for data scientists and engineers

Priced by DBU (Databricks Unit) + underlying VM costs

Supports batch and streaming analytics

Integrated with MLflow for machine learning lifecycle

Azure Synapse Analytics

Based on SQL engine (dedicated SQL pool) and Spark (Synapse Spark pools)

Provides SQL-based analytics and data warehousing

Priced by DWU (Data Warehouse Unit) for dedicated pool or serverless per TB

Primarily batch analytics with real-time via Spark streaming

Integrated with Power BI and Azure Machine Learning

Watch Out for These

Mistake

Azure Databricks is just a managed Spark cluster.

Correct

While it includes managed Spark clusters, Azure Databricks is a full analytics platform with notebooks, collaboration, job scheduling, MLflow, Delta Lake, and tight Azure integration. It is more than just Spark.

Mistake

Data is stored inside Azure Databricks.

Correct

Azure Databricks does not store customer data. Data resides in external storage like Azure Data Lake Storage or Blob Storage. The workspace only stores metadata, notebooks, and cluster configurations.

Mistake

Azure Databricks only supports batch processing.

Correct

Azure Databricks supports both batch and real-time streaming via Spark Structured Streaming. It can process data from Event Hubs, IoT Hub, and Kafka.

Mistake

You must use Scala to get good performance.

Correct

While Scala is Spark's native language, Python (PySpark) offers similar performance for most workloads. Azure Databricks supports multiple languages, and performance depends more on cluster configuration and data optimization.

Mistake

Azure Databricks clusters are always running and expensive.

Correct

Interactive clusters have configurable auto-termination (default 120 min inactivity). Job clusters are ephemeral and terminate automatically after job completion. You only pay for what you use.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Azure Databricks and Azure Synapse Analytics?

Azure Databricks is a unified analytics platform for big data processing and machine learning, based on Apache Spark. It provides collaborative notebooks and is ideal for data engineering and data science. Azure Synapse Analytics is a data warehouse solution that combines SQL analytics with Spark. Synapse is optimized for large-scale data warehousing and T-SQL queries, while Databricks is more flexible for complex transformations and ML. For DP-900, remember: Databricks = big data/ML, Synapse = data warehousing.

How does Azure Databricks pricing work?

Azure Databricks charges per DBU (Databricks Unit) per hour. A DBU is a unit of processing capacity. The cost depends on the VM SKU used and the tier (Standard, Premium, Enterprise). Additionally, you pay for the underlying Azure VMs, storage, and networking in your subscription. For example, a Standard_DS3_v2 cluster costs about $0.55/DBU-hour plus VM costs. Job clusters are cheaper because they auto-terminate.

Can Azure Databricks process real-time streaming data?

Yes, Azure Databricks supports real-time streaming via Spark Structured Streaming. You can read from Azure Event Hubs, IoT Hub, or Kafka, process data in micro-batches, and write results to sinks like Delta Lake or Azure Synapse. The exam may test that Databricks is capable of both batch and streaming analytics.

What is Delta Lake in Azure Databricks?

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark. It is built into Databricks Runtime by default. It provides features like time travel (data versioning), schema enforcement, and unified batch/streaming. Delta Lake ensures data reliability and consistency, which is crucial for data lakes.

How does Azure Databricks handle security?

Azure Databricks integrates with Azure Active Directory for authentication and RBAC. The Premium tier allows VNet injection for network isolation. Data encryption at rest and in transit is supported via Azure Storage Service Encryption and TLS. Secrets are stored in Azure Key Vault. The control plane and data plane are separated, with the data plane in the customer's subscription.

What is a DBU?

DBU stands for Databricks Unit. It is a unit of processing capacity per hour. One DBU is roughly equivalent to one hour of processing on a standard VM. The number of DBUs consumed depends on the VM SKU and the cluster configuration. DBU pricing varies by tier.

Can I use Azure Databricks without a cluster running?

Yes, you can create and edit notebooks without a running cluster. However, to execute code, you must attach a cluster. You can also schedule jobs that automatically start a cluster, run the code, and terminate the cluster. This allows you to work on code without incurring compute costs.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Azure Databricks — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?