AZ-305Chapter 56 of 103Objective 2.4

Azure Databricks Architecture

This chapter covers Azure Databricks architecture, a critical topic for the AZ-305 exam under Data Storage objective 2.4. You will learn how Apache Spark runs on Azure, the role of the driver and worker nodes, cluster configurations, and integration with Azure data services. Expect approximately 5-10% of exam questions to touch on Databricks, often in scenarios involving big data processing, ETL pipelines, or machine learning workloads. Mastering this architecture will help you design scalable, cost-effective data solutions on Azure.

25 min read
Intermediate
Updated May 31, 2026

Azure Databricks: A Distributed Data Factory

Imagine a large automobile assembly line. You have raw materials (data) arriving at a warehouse (Azure Blob Storage or Data Lake). The factory floor (Azure Databricks cluster) consists of multiple workstations (nodes) each with specialized tools (Spark executors). A foreman (driver node) receives the blueprints (notebook commands) and breaks down the assembly tasks into smaller jobs (Spark jobs). The foreman assigns each job to a workstation and coordinates the work. Each workstation can fetch parts (data) from the warehouse as needed, process them (transformations), and pass partially assembled components to the next station (shuffling data across the network). The foreman monitors progress (driver logs) and if a workstation breaks down (node failure), the foreman reassigns its tasks to another workstation (fault tolerance via lineage). Finally, the finished cars (processed data) are sent to the shipping dock (output sink like Azure Synapse or Power BI). Importantly, the foreman does not do the physical assembly—that's the workstations' job. This separation of control (driver) and execution (workers) is fundamental to how Spark operates.

How It Actually Works

What is Azure Databricks and Why It Exists

Azure Databricks is a fully managed, cloud-based big data analytics platform based on Apache Spark. It provides an interactive workspace that enables data engineers, data scientists, and analysts to collaborate on building data pipelines, performing exploratory data analysis, and training machine learning models. The platform abstracts away the complexity of managing Spark clusters, allowing users to focus on writing code in Python, Scala, SQL, or R.

Azure Databricks exists to solve the challenge of processing massive datasets that exceed the capacity of a single machine. By distributing data and computation across a cluster of virtual machines (VMs), it can process terabytes or petabytes of data in parallel. The exam expects you to understand the underlying architecture because designing cost-effective and performant data solutions requires knowing how clusters are provisioned, how data is partitioned, and how costs are incurred.

How It Works Internally: The Spark Execution Model

At the heart of Azure Databricks is Apache Spark, which follows a master-slave architecture with a driver node (master) and multiple worker nodes (slaves). When you run a notebook command, the driver converts it into a logical plan, optimizes it into a physical plan, and then executes it as a series of stages containing tasks.

Driver Node: The driver is the central coordinator. It runs the user's main program, creates the SparkContext, schedules tasks, and manages the overall execution. The driver holds the metadata about the data (the catalog) and the execution plan. It communicates with the cluster manager (Azure Databricks' auto-scaling manager or a user-defined cluster) to acquire resources.

Worker Nodes: Worker nodes execute the tasks assigned by the driver. Each worker runs one or more executors, which are JVM processes that cache data and run computations. Executors store data in memory or disk and report back to the driver. The number of executors per worker depends on the VM size and configuration.

Cluster Manager: Azure Databricks uses its own cluster manager, which can be either a Standard (fixed-size) cluster or a High Concurrency (auto-scaling) cluster. The cluster manager allocates resources to the driver and workers based on the configuration.

Key Components, Values, Defaults, and Timers

- Cluster Types: - Standard Cluster: Fixed number of worker nodes. Suitable for single-user development. Not recommended for production because it cannot share resources. - High Concurrency Cluster: Auto-scaling (0 to max workers) and supports multiple users with isolation. Uses a separate driver per user. This is the recommended cluster type for production workloads. - Single Node Cluster: A driver-only cluster with no workers. Used for lightweight tasks or notebooks that do not require distributed processing.

- Default Timeouts: - Auto-termination: Clusters can be set to auto-terminate after a period of inactivity (default 120 minutes). This saves costs. The timer resets each time a command is run. - Spark Dynamic Allocation: By default, Spark dynamically adjusts the number of executors based on workload. The minimum and maximum executors can be configured (default min 0, max 10).

- Instance Types: - Memory Optimized: For data caching and machine learning. Examples: Standard_E8ds_v5 (8 vCPUs, 64 GB RAM). - Compute Optimized: For CPU-intensive tasks. Examples: Standard_F8s_v2 (8 vCPUs, 16 GB RAM). - General Purpose: Balanced. Examples: Standard_D8ds_v5.

DBU (Databricks Unit): The billing unit for compute resources. One DBU is roughly equivalent to one VM-hour of a Standard_DS3_v2 instance. Costs vary by cluster type and workload.

Configuration and Verification Commands

You can configure clusters via the Azure Portal, Azure CLI, or the Databricks CLI. Here are key commands:

Azure CLI to create a workspace:

az databricks workspace create --resource-group myRG --name myWorkspace --location eastus --sku standard

Databricks CLI to create a cluster:

databricks clusters create --json '{
  "cluster_name": "my-cluster",
  "spark_version": "12.2.x-scala2.12",
  "node_type_id": "Standard_D8ds_v5",
  "num_workers": 5,
  "autotermination_minutes": 120
}'

Verification: - Check cluster status: databricks clusters list - View Spark UI: Navigate to the Databricks workspace, click on the cluster, then click "Spark UI". This shows stages, tasks, and executor statistics.

Integration with Related Technologies

Azure Databricks integrates deeply with other Azure services:

Azure Data Lake Storage Gen2 (ADLS Gen2): Primary storage for raw and processed data. Use the abfss:// driver to connect. Databricks can mount ADLS Gen2 using service principal or access key.

Azure Blob Storage: Older storage, but still supported. Use wasbs:// driver.

Azure Synapse Analytics: For loading processed data into a data warehouse. Use the spark.synapse connector.

Azure Key Vault: For storing secrets like access keys and connection strings. Use secret scopes in Databricks.

Azure Event Hubs: For streaming data ingestion. Use the Spark Event Hubs connector.

Power BI: Direct connectivity via JDBC/ODBC for visualization.

Data Partitioning and Shuffling

Spark splits data into partitions (default block size 128 MB for Parquet). Partitions are distributed across executors. When a transformation requires data from multiple partitions (e.g., a join or groupBy), Spark performs a shuffle, which moves data across the network. Shuffles are expensive and should be minimized. You can control parallelism by setting spark.sql.shuffle.partitions (default 200).

Fault Tolerance

Spark achieves fault tolerance through lineage. Each RDD (Resilient Distributed Dataset) tracks the transformations used to build it. If a partition is lost, Spark can recompute it from the original source using the lineage graph. This avoids data replication. However, recomputation can be slow; you can persist intermediate results using cache() or checkpoint().

Security and Networking

Azure Databricks can be deployed in your own virtual network (VNet injection) for network isolation. This allows you to use Network Security Groups (NSGs) and service endpoints. You can also use Private Link for private connectivity. Authentication can be via Azure Active Directory (AAD) or personal access tokens. For data encryption, Databricks uses at-rest encryption (Azure Storage encryption) and in-transit encryption (TLS).

Walk-Through

1

Provision Azure Databricks Workspace

First, create an Azure Databricks workspace in the Azure Portal. This is a top-level resource that hosts all clusters, notebooks, and jobs. You choose a pricing tier: Standard (for production), Premium (with role-based access control), or Trial (limited). The workspace is deployed into a resource group and a region. Behind the scenes, Azure creates a managed resource group containing a virtual network, network security groups, and a storage account for cluster logs. The workspace is then accessible via a unique URL.

2

Create and Configure a Cluster

Inside the workspace, you create a cluster. You choose the cluster mode (Standard, High Concurrency, or Single Node), the Databricks runtime version (e.g., 12.2 LTS), and the VM instance types for driver and workers. You set the number of workers (fixed or auto-scaling). You can also configure auto-termination (default 120 minutes) and advanced Spark configurations like `spark.sql.shuffle.partitions`. The cluster manager provisions the VMs and installs the Spark runtime. Once the cluster is running, you can attach notebooks.

3

Mount Storage and Ingest Data

To read data, you typically mount Azure Data Lake Storage Gen2 or Blob Storage. Mounting creates a symbolic link in the Databricks file system, so you can access storage via a path like `/mnt/mydatalake`. The mount uses a service principal or access key stored in a secret scope. Alternatively, you can directly use the `abfss://` URL in Spark read commands. Data is read from storage into DataFrames, which are distributed across the cluster's executors in partitions.

4

Execute Notebook Commands and Transform Data

You write code in a notebook cell. When you run the cell, the driver parses the code, creates a logical plan, and then an optimized physical plan. The driver breaks the plan into stages, each containing tasks that operate on partitions. Tasks are sent to executors on worker nodes. Executors read data from storage, perform transformations (filter, map, join), and write intermediate results to memory or disk. The driver monitors progress via the Spark UI. If a task fails, the driver retries it on another executor.

5

Write Results to Sink and Terminate Cluster

After processing, you write the transformed data to a sink such as ADLS Gen2, Azure Synapse, or a SQL database. The write operation is also distributed: each executor writes its partition(s) to the sink. Once the job is complete, you can either let the cluster auto-terminate after the inactivity timeout or manually terminate it. Auto-termination saves costs. You can also schedule jobs using Azure Databricks Jobs, which automatically create and terminate clusters per run.

What This Looks Like on the Job

Scenario 1: ETL Pipeline for Retail Analytics

A large retail company ingests clickstream data from millions of users into Azure Event Hubs. They use Azure Databricks to run a streaming ETL pipeline. The pipeline reads from Event Hubs using Structured Streaming, performs real-time aggregations (e.g., counts of page views per product), and writes the results to a Delta Lake table on ADLS Gen2. The Delta table is then used by Power BI for real-time dashboards. The cluster is configured with auto-scaling (2 to 20 workers) to handle traffic spikes. A common issue is backpressure: if the stream rate exceeds the cluster's processing capacity, the Event Hubs consumer lag increases. To mitigate, they increase the number of partitions in Event Hubs and the Spark shuffle partitions. Misconfiguration of checkpointing can cause data loss on restart; they set the checkpoint location to a dedicated directory in ADLS.

Scenario 2: Machine Learning Model Training

A financial services company trains fraud detection models on historical transaction data stored in ADLS Gen2. Data scientists use Azure Databricks notebooks with AutoML to train models. They use a High Concurrency cluster with GPU-enabled VMs (e.g., Standard_NC6s_v3) for deep learning. The cluster uses Spot VMs for cost savings on non-critical training jobs. A common pitfall: Spot VMs can be evicted, causing job failure. They configure the cluster to use a mix of Spot and On-Demand VMs, and they enable Spark's speculative execution to retry failed tasks. They also use MLflow to track experiments and model versions. Performance considerations: they optimize data loading by using Delta Lake and partitioning by date.

Scenario 3: Data Lakehouse with Unity Catalog

A healthcare organization builds a data lakehouse using Azure Databricks and Unity Catalog. They have multiple teams (data engineers, data scientists, analysts) who need access to curated datasets. They deploy a Premium-tier workspace with Unity Catalog enabled. They define a three-level namespace: catalog.schema.table. They use RBAC to grant permissions at the catalog or schema level. They also implement data lineage tracking. A common misconfiguration: forgetting to set the metastore and catalog permissions correctly, leading to 'table not found' errors. They also set up data quality checks using Delta Live Tables (DLT) to ensure data reliability. The cluster is configured with a high concurrency mode to allow multiple users to query simultaneously without interference.

How AZ-305 Actually Tests This

What AZ-305 Tests on This Topic (Objective 2.4)

The exam focuses on your ability to recommend and design data storage solutions, including Azure Databricks. Key areas:

Understanding when to use Databricks vs. Azure Synapse Analytics vs. HDInsight.

Cluster configuration: Standard vs. High Concurrency vs. Single Node.

Data integration: mounting storage, using secret scopes.

Security: VNet injection, Private Link, AAD authentication.

Cost optimization: auto-termination, Spot VMs, DBU pricing.

Most Common Wrong Answers and Why Candidates Choose Them

1.

Choosing HDInsight over Databricks for a collaborative data science workload: Candidates often pick HDInsight because they think it's more customizable. However, the exam emphasizes that Databricks provides a collaborative notebook environment and is managed, which is better for data science teams.

2.

Selecting a Standard cluster for a multi-user production environment: Standard clusters are for single users. The correct answer is High Concurrency cluster, which supports multiple users with isolation.

3.

Thinking Databricks requires a separate storage account for each cluster: Databricks can use a single ADLS Gen2 account for all clusters. The mistake is assuming storage is tied to the cluster.

4.

Believing that Databricks cannot use Spot VMs: Spot VMs are supported and recommended for cost savings on fault-tolerant workloads.

Specific Numbers, Values, and Terms That Appear Verbatim

DBU: 1 DBU per VM-hour for a Standard_DS3_v2 instance.

Auto-termination default: 120 minutes.

Spark shuffle partitions default: 200.

Delta Lake: Required for ACID transactions and time travel.

Unity Catalog: Centralized governance for data assets.

Edge Cases and Exceptions

Single Node Cluster: Only for development; cannot run distributed jobs.

High Concurrency Cluster: Does not support Scala or R in the same way as Standard; primarily for SQL, Python, and R.

VNet Injection: Required for private connectivity; else traffic goes over the public internet.

Secret Scopes: Must be created before mounting storage; otherwise, mount fails.

How to Eliminate Wrong Answers Using the Underlying Mechanism

If a question asks for a cost-effective solution for a batch ETL job that runs nightly, eliminate any answer that uses a Standard cluster because it runs 24/7. Instead, pick a High Concurrency cluster with auto-termination and Spot VMs. If a question involves real-time streaming, eliminate any answer that suggests batch processing. The key is to map the workload requirements (collaboration, cost, performance) to the correct cluster type and configuration.

Key Takeaways

Azure Databricks is a managed Apache Spark platform optimized for Azure.

The driver node coordinates execution; worker nodes run tasks in parallel.

Use High Concurrency clusters for production multi-user workloads.

Auto-termination default is 120 minutes; set it to save costs.

Mount ADLS Gen2 for persistent storage using service principal or access key.

Delta Lake provides ACID transactions and time travel.

Unity Catalog enables centralized data governance.

Spot VMs reduce costs but can be evicted; use for fault-tolerant jobs.

DBU is the billing unit; 1 DBU ≈ 1 VM-hour of Standard_DS3_v2.

VNet injection is required for private network connectivity.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Azure Databricks

Built on Apache Spark, optimized for big data processing and machine learning.

Collaborative notebook environment for data scientists and engineers.

Supports multiple languages: Python, Scala, SQL, R.

Uses DBU pricing, can leverage Spot VMs.

Best for ETL, data exploration, and ML training.

Azure Synapse Analytics

Unified analytics platform combining data warehousing and big data.

Provides dedicated SQL pools for relational data warehousing.

Primarily uses T-SQL for data querying.

Pricing based on provisioned DWU or serverless compute.

Best for large-scale data warehousing and BI workloads.

Watch Out for These

Mistake

Azure Databricks is just a managed version of Apache Spark.

Correct

While it is based on Spark, Databricks adds optimizations like Delta Lake, Unity Catalog, and collaborative notebooks. It also provides auto-scaling and auto-termination that are not native to Spark.

Mistake

You must use the Databricks file system (DBFS) for all data storage.

Correct

DBFS is a distributed file system mounted on the cluster, but it is ephemeral. For persistent storage, you should mount external storage like ADLS Gen2 or Blob Storage.

Mistake

Standard clusters are suitable for production multi-user workloads.

Correct

Standard clusters are designed for single-user development. For production with multiple users, use High Concurrency clusters which provide per-user isolation.

Mistake

Databricks cannot use Spot VMs because they are unreliable.

Correct

Databricks supports Spot VMs. They are ideal for fault-tolerant workloads like batch ETL where tasks can be retried. This reduces costs significantly.

Mistake

You need to manually manage Spark configuration for performance.

Correct

Databricks automatically tunes many Spark settings (e.g., shuffle partitions) based on the workload. However, you can override them if needed.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Standard and High Concurrency clusters in Azure Databricks?

Standard clusters are designed for single-user development. They provide a dedicated driver and workers, and you can run any language (Python, Scala, SQL, R). High Concurrency clusters are for production multi-user workloads. They support multiple users with per-user isolation, but they do not support Scala or R in the same way—they are optimized for SQL, Python, and R. High Concurrency clusters also auto-scale and auto-terminate, making them more cost-effective.

How do I connect Azure Databricks to Azure Data Lake Storage Gen2?

You can mount ADLS Gen2 using a service principal or an access key. First, create a secret scope in Databricks to store the credentials. Then use the `dbutils.fs.mount` command with the `abfss://` driver. For example: `dbutils.fs.mount(source = "abfss://container@storage.dfs.core.windows.net/", mount_point = "/mnt/mydata", extra_configs = {"fs.azure.account.auth.type": "OAuth", ...})`. Alternatively, you can directly use the `abfss://` path in Spark read commands without mounting.

What is Delta Lake and why should I use it?

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark. It provides features like schema enforcement, time travel (data versioning), and unified batch/streaming. In Azure Databricks, Delta Lake is the default format for tables. It ensures data integrity and enables efficient upserts and deletes. Use it for any production data pipeline.

How does auto-termination work in Azure Databricks?

Auto-termination is a cluster configuration that automatically terminates the cluster after a specified period of inactivity (default 120 minutes). The inactivity timer resets each time a command is run on the cluster. This feature helps reduce costs by stopping unused clusters. You can set a custom duration (e.g., 30 minutes) or disable it entirely.

Can I use Azure Databricks with Azure Synapse Analytics?

Yes. You can read from and write to Azure Synapse using the `spark.synapse` connector. For example, you can read a table from Synapse into a DataFrame, transform it in Databricks, and write it back. This is common for ETL pipelines where Databricks handles complex transformations and Synapse serves as the data warehouse.

What is Unity Catalog in Azure Databricks?

Unity Catalog is a centralized governance solution for data assets in Databricks. It provides a three-level namespace (catalog.schema.table), fine-grained access control (RBAC), data lineage, and audit logging. It allows you to manage permissions across workspaces and users. Unity Catalog is available only in the Premium tier.

How do I secure Azure Databricks?

You can secure Databricks by deploying it in your own virtual network (VNet injection) to use NSGs and service endpoints. Use Azure Private Link for private connectivity. Authenticate users via Azure Active Directory or personal access tokens. Encrypt data at rest (Azure Storage encryption) and in transit (TLS). Use secret scopes to manage credentials.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Azure Databricks Architecture — now see how well it sticks with free AZ-305 practice questions. Full explanations included, no account needed.

Done with this chapter?