DP-900Chapter 67 of 101Objective 3.3

Databricks Notebooks and Clusters

This chapter covers Azure Databricks notebooks and clusters, two core components of the Databricks analytics platform. For the DP-900 exam, you need to understand how notebooks provide an interactive environment for data exploration and how clusters provide the distributed compute resources. This topic area appears in approximately 10–15% of exam questions, primarily under objective 3.3 (Describe analytics workloads). We'll explain the architecture, lifecycle, and configuration options for both notebooks and clusters, including the different cluster modes and how they interact with Azure data services.

25 min read
Intermediate
Updated May 31, 2026

Databricks: Restaurant Kitchen with Prep Stations

Imagine a high-end restaurant kitchen that serves complex dishes. The kitchen itself is a Databricks workspace. The chefs are data engineers and data scientists. The recipes are notebooks — step-by-step instructions to prepare a dish (process data). The cooking stations (stoves, ovens, prep tables) are clusters — the compute resources that actually execute the recipes. Each station has specific tools (memory, CPU cores, GPUs) and can be configured for different types of cooking (e.g., a wok station for stir-fry = a cluster optimized for streaming; a slow-cooker station for braising = a cluster optimized for batch ETL). The head chef (driver node) coordinates the line cooks (worker nodes) and decides who does what. When a recipe calls for chopping vegetables, the head chef assigns that task to a prep cook. The prep cook uses a specific cutting board and knife (a worker node with its local storage). The head chef also keeps track of all the tasks and combines the final plated dish. If the restaurant gets busy, the manager can bring in more cooks (auto-scaling the cluster) or set up a dedicated station for a special event (job cluster). Notebooks are not executed on the chef's personal notepad — they are handed to the cooking station. The station reads the recipe, fetches ingredients from the pantry (data lake), and produces the dish. If the station crashes (node failure), the head chef can restart the recipe on another station because the ingredients are still in the pantry. This analogy captures the separation of code (notebook) and compute (cluster), the role of the driver, and the ephemeral nature of clusters.

How It Actually Works

What are Databricks Notebooks and Clusters?

Azure Databricks is a unified analytics platform built on Apache Spark. It provides two primary abstractions: notebooks and clusters. Notebooks are web-based interfaces that contain executable code, visualizations, and narrative text. They support multiple languages: Python, Scala, SQL, and R. Clusters are sets of virtual machines that execute the code in notebooks. The compute is separated from the code: notebooks are persisted in the workspace, while clusters are ephemeral resources that can be started, stopped, and scaled independently.

How Notebooks Work

A notebook is composed of cells. Each cell can be a code cell or a text cell (Markdown). When you run a cell, the code is sent to the attached cluster's driver node. The driver interprets the code, creates a Spark job, and distributes tasks to worker nodes. Results are returned to the notebook for display. Notebooks support interactive development: you can run cells in any order, modify previous cells, and re-run them. The notebook state (variables, data frames) is maintained in the cluster's memory across cells within the same session. However, if you detach and reattach the notebook to a different cluster, the state is lost.

Key notebook features tested on DP-900: - Languages: Python, Scala, SQL, R. Notebooks can mix languages using magic commands (%python, %scala, %sql, %r). - Visualizations: Built-in charting (bar, line, scatter, etc.) and integration with libraries like matplotlib and ggplot. - Collaboration: Multiple users can edit a notebook simultaneously (real-time co-authoring). - Version history: Databricks automatically saves notebook versions; you can view and restore previous versions. - Scheduling: Notebooks can be scheduled as jobs using Databricks Jobs. - Export/Import: Notebooks can be exported as HTML, DBC (Databricks archive), or source code (Python, Scala, etc.).

How Clusters Work

A cluster consists of a driver node and one or more worker nodes. The driver node is the master that coordinates work: it converts notebook code into Spark tasks, schedules them on workers, and aggregates results. Worker nodes execute the tasks and store intermediate data in memory or local disk. Clusters can be created interactively (interactive clusters) or automatically for jobs (job clusters).

Cluster types: - Standard cluster: General-purpose, suitable for interactive workloads. Can be used by multiple users. - High-Concurrency cluster: Managed by Databricks to support multiple users with automatic resource sharing. Uses a shared driver and separate executors per user session. - Single Node cluster: Only a driver node, no workers. Used for lightweight tasks or development with small data.

Cluster modes: - Interactive (or "all-purpose") cluster: Created manually and used for ad-hoc analysis. Runs continuously until terminated. Multiple users can attach notebooks. - Job cluster: Automatically created when a job runs and terminated after job completion. Single-tenant (only the job uses it). Cost-effective for scheduled or automated workloads.

Cluster lifecycle: 1. Creation: User specifies cluster configuration (Spark version, node type, number of workers, autoscaling, etc.). 2. Starting: Cluster transitions from Pending to Running state. Driver and workers are provisioned. 3. Running: Cluster is available for notebook attachment or job execution. 4. Auto-scaling (if enabled): Cluster adjusts number of workers between min and max based on workload. 5. Termination: Cluster is stopped. All memory and local storage are released. Configuration is preserved for later restart.

Key configuration parameters: - Databricks Runtime Version: Includes Apache Spark version and additional optimizations. E.g., "Runtime 12.2 LTS" includes Spark 3.3.2. - Node Type: Defines VM SKU (e.g., Standard_DS3_v2) with specific vCPUs, memory, and local SSD. - Min and Max Workers: For autoscaling. If min == max, cluster is fixed size. - Auto-termination: Idle timeout in minutes (default 120 minutes). Cluster auto-terminates if no activity. - Spark Config: Custom Spark properties (e.g., spark.sql.shuffle.partitions). - Environment Variables: Set for the cluster. - Init Scripts: Shell scripts run on each node during startup. - Access Mode: Single user, no isolation (shared), or per-user isolation (high-concurrency).

How Notebooks and Clusters Interact

A notebook must be attached to a cluster to execute code. Attachment is a logical connection: the notebook sends code to the cluster's driver via the Databricks web app API. The driver compiles the code, creates Spark jobs, and returns results. Multiple notebooks can attach to the same cluster; they share the cluster's resources. However, they do not share session state (variables) unless explicitly persisted (e.g., using temp views or global temp views).

When you detach a notebook from a cluster, the cluster continues running but the notebook loses its state. Reattaching to the same cluster does not restore state.

Cluster Policies

Cluster policies are rules that constrain cluster creation. They allow administrators to control cost and security. For example, a policy can limit node types, enforce autoscaling, or restrict Spark configurations. Policies are applied at the workspace level and can be assigned to users or groups.

Integration with Azure Services

Databricks clusters can access Azure data sources: - Azure Blob Storage / ADLS Gen2: Mount points or direct access using service principals. - Azure SQL Database: JDBC connection from notebooks. - Azure Cosmos DB: Spark connector. - Azure Event Hubs: Streaming data ingestion. - Azure Synapse Analytics: Read/write using Synapse connector.

For DP-900, you should know that Databricks can read data from Azure Storage, Azure SQL Database, and Azure Synapse Analytics, and that clusters can be configured with managed identities for secure access.

Key Exam-Relevant Details

Notebooks are not compute resources; they are code containers. Clusters are compute.

Job clusters are ephemeral: created for a job and terminated after.

Interactive clusters persist until terminated or auto-terminated.

Auto-termination default is 120 minutes.

High-concurrency clusters are optimized for multiple users.

Single node clusters have no workers.

Databricks Runtime includes Spark and is versioned.

Cluster policies control configuration compliance.

Notebooks support multiple languages via magic commands.

Notebooks can be scheduled as jobs.

Version history is automatic.

Example: Creating a Cluster via CLI

To create a cluster using the Databricks CLI:

databricks clusters create --json '{
  "cluster_name": "my-cluster",
  "spark_version": "12.2.x-scala2.12",
  "node_type_id": "Standard_DS3_v2",
  "num_workers": 5,
  "autotermination_minutes": 60
}'

To list clusters:

databricks clusters list

To attach a notebook to a cluster via API (simplified):

notebook_path = "/Users/user@example.com/MyNotebook"
cluster_id = "1234-567890-abcd1234"
dbutils.notebook.run(notebook_path, 3600, {"param": "value"})

Summary of Cluster States

PENDING: Cluster is being provisioned.

RUNNING: Cluster is ready.

TERMINATED: Cluster is stopped.

ERROR: Cluster creation failed.

UNKNOWN: State cannot be determined.

Important Defaults and Limits

Maximum workers per cluster: 1000 (default limit, can be increased via support ticket).

Maximum clusters per workspace: 100 (default).

Auto-termination default: 120 minutes.

Notebook cell execution timeout: 24 hours (default).

Max notebook size: 10 MB.

How Notebooks and Clusters Fit into Analytics Workloads

Databricks is used for data engineering (ETL), data science (ML), and data analysis (SQL). Notebooks provide the interactive interface for these tasks. Clusters provide the compute. On DP-900, you may be asked to identify which component is used for which purpose: notebooks for code development and visualization; clusters for execution. You may also need to choose the appropriate cluster type for a given scenario (e.g., job cluster for scheduled ETL, interactive cluster for ad-hoc analysis).

Walk-Through

1

Create a Cluster

Navigate to the Compute tab in the Databricks workspace and click 'Create Cluster'. Provide a name, select the Databricks Runtime version (e.g., 12.2 LTS), choose a node type (VM size) for driver and workers, set the number of workers (or enable autoscaling with min and max), configure auto-termination idle timeout (default 120 minutes), and optionally add Spark configurations or init scripts. Click 'Create Cluster'. The cluster enters Pending state while Azure provisions the VMs. Once all nodes are ready, the cluster transitions to Running state.

2

Attach Notebook to Cluster

Open a notebook in the Databricks workspace. In the notebook toolbar, click the 'Detached' dropdown and select the running cluster you created. The notebook is now logically connected to the cluster. The cluster's driver node will execute code from this notebook. You can attach multiple notebooks to the same cluster; they share the cluster's resources but have separate variable scopes. If you detach, the notebook loses its in-memory state.

3

Run a Notebook Cell

Click inside a code cell and press Shift+Enter or click the 'Run Cell' button. The notebook sends the cell content to the attached cluster's driver. The driver compiles the code (e.g., Python, SQL) and creates a Spark execution plan. It then distributes tasks to worker nodes. Workers process data in parallel and return results. The driver aggregates results and sends them back to the notebook for display. Output (e.g., a DataFrame preview or a plot) appears below the cell. The cluster's Spark UI can be accessed to monitor job progress.

4

Auto-Scale Cluster (If Enabled)

If autoscaling is configured (min and max workers differ), the cluster monitors resource utilization. When there are pending tasks and workers are at capacity, the cluster requests additional worker nodes from Azure. Conversely, if workers are underutilized for a period (default 10 minutes), excess workers are decommissioned. This happens dynamically without interrupting running jobs. The driver adjusts the number of executors. Autoscaling helps optimize cost and performance.

5

Terminate Cluster

From the Compute tab, select the cluster and click 'Terminate'. Alternatively, if auto-termination is set, the cluster will automatically terminate after the idle timeout (e.g., 120 minutes of no activity). Termination releases all VMs and associated resources. The cluster configuration is saved, so you can restart it later. Job clusters terminate automatically after the job completes. Terminated clusters incur no compute charges but may incur storage costs for cluster logs.

What This Looks Like on the Job

Enterprise Scenario 1: Data Engineering ETL Pipeline

A large retail company ingests daily sales data from thousands of stores into Azure Data Lake Storage Gen2. They use Databricks to transform raw data into curated tables. They create a job cluster with 50 workers of type Standard_DS4_v2 (16 vCPUs, 32 GB RAM each) running Databricks Runtime 10.4 LTS. The job cluster is configured with autoscaling (min 10, max 50) and auto-termination of 30 minutes. The ETL notebook reads Parquet files from ADLS, performs aggregations, and writes to Delta tables. The job is scheduled to run nightly via Databricks Jobs. If the job fails due to a transient Azure issue, the job retries up to three times. The cluster is terminated after the job completes, saving costs. Common misconfiguration: setting autoscaling min too low can cause slow performance during data spikes; setting max too high can increase cost if not needed.

Enterprise Scenario 2: Collaborative Data Science

A financial services firm has a team of data scientists building machine learning models to detect fraud. They use an interactive cluster with high-concurrency mode to allow multiple data scientists to work simultaneously. The cluster uses 10 workers of type Standard_DS3_v2 (4 vCPUs, 14 GB RAM) and is configured to auto-terminate after 4 hours of inactivity. Each data scientist attaches their own notebook to the same cluster. They share data via global temp views. The cluster is left running during the workday and terminated manually at night. A common issue: one user's heavy workload (e.g., training a deep learning model) can consume all cluster resources, causing others' jobs to slow down. To mitigate, they use cluster pools or separate clusters for heavy workloads.

Enterprise Scenario 3: SQL Analytics for Business Users

A healthcare organization uses Databricks SQL (SQL warehouses) to provide business analysts with a SQL interface. They create a SQL warehouse (which is a type of cluster optimized for SQL workloads) with a small size (2X-Small, 2 workers) for development and a medium size (2X-Large, 16 workers) for production. Analysts run SQL queries from the Databricks SQL editor or from Power BI via JDBC. The SQL warehouse auto-scales within its size range. Misconfiguration: setting the warehouse size too small can cause query timeouts; setting it too large wastes money. Also, if the warehouse is stopped, queries fail until it is restarted. They schedule the warehouse to start at 8 AM and stop at 6 PM weekdays using Auto Stop settings.

How DP-900 Actually Tests This

What DP-900 Tests on This Topic

DP-900 objective 3.3 (Describe analytics workloads) includes understanding of Azure Databricks notebooks and clusters. Specifically, you should be able to:

Identify that notebooks are used for interactive data exploration and code development.

Identify that clusters provide the compute resources for executing notebook code.

Differentiate between interactive clusters and job clusters.

Recognize that clusters can be scaled up/down or auto-scaled.

Know that Databricks supports multiple languages (Python, SQL, Scala, R).

Understand that notebooks can be scheduled as jobs.

Common Wrong Answers and Why Candidates Choose Them

1.

Wrong: "Notebooks are compute resources." Candidates confuse notebooks with clusters because both are created in the workspace. Reality: Notebooks are code and documentation; clusters are VMs that run the code.

2.

Wrong: "Job clusters are always running." Candidates think job clusters are persistent like interactive clusters. Reality: Job clusters are created per job and terminated after completion.

3.

Wrong: "Clusters can be shared across workspaces." Candidates may assume cross-workspace sharing. Reality: Clusters are scoped to a single workspace (though they can be shared via cluster policies within a workspace).

4.

Wrong: "Notebooks store data persistently." Candidates think data in notebooks is saved. Reality: Notebooks only store code and output; data is stored in external storage (e.g., ADLS) or cluster memory (ephemeral).

Specific Numbers and Terms That Appear on the Exam

Auto-termination default: 120 minutes.

Runtime versions: e.g., "Runtime 10.4 LTS".

Node types: e.g., Standard_DS3_v2.

Cluster states: Pending, Running, Terminated.

Magic commands: %python, %sql, %scala, %r.

Edge Cases and Exceptions

A notebook can be attached to only one cluster at a time.

A cluster can have multiple notebooks attached.

Single node clusters have no workers; they are suitable for lightweight tasks.

High-concurrency clusters have a shared driver but separate executors per user.

Job clusters cannot be attached to notebooks interactively; they are used only by jobs.

How to Eliminate Wrong Answers

If the question asks about "interactive analysis," the answer likely involves notebooks and interactive clusters.

If the question mentions "scheduled ETL," look for job clusters and job scheduling.

If the question asks about "cost savings," consider job clusters or auto-termination.

If the question asks about "multi-user collaboration," think of high-concurrency clusters or shared interactive clusters.

Eliminate any answer that conflates notebooks with compute or clusters with storage.

Key Takeaways

Notebooks are code containers; clusters are compute resources.

Interactive clusters are persistent; job clusters are ephemeral.

Auto-termination default is 120 minutes.

Clusters can be fixed-size or auto-scaled.

Databricks supports Python, SQL, Scala, and R in notebooks.

Notebooks can be scheduled as jobs.

Cluster policies enforce configuration rules.

Single node clusters have no workers.

High-concurrency clusters support multiple users with isolation.

Clusters can access Azure data sources like ADLS, SQL DB, and Synapse.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Interactive Cluster

Created manually and persists until terminated.

Used for ad-hoc analysis and development.

Multiple users can attach notebooks.

Cost incurred continuously while running.

Can be auto-scaled and auto-terminated.

Job Cluster

Created automatically by a job and terminated after job completion.

Used for scheduled or automated workloads.

Single-tenant (only the job uses it).

Cost incurred only during job execution.

Configuration is defined in the job settings.

Watch Out for These

Mistake

Notebooks are compute resources that execute code.

Correct

Notebooks are code containers (like scripts with UI). They require a cluster to execute code. The cluster provides the compute (VMs).

Mistake

Job clusters are always running and can be used for interactive analysis.

Correct

Job clusters are ephemeral: they start when a job runs and terminate after completion. They cannot be attached to notebooks for interactive use.

Mistake

Auto-termination is disabled by default.

Correct

Auto-termination is enabled by default with a timeout of 120 minutes. You can disable it or change the timeout.

Mistake

All clusters in Databricks are single-user.

Correct

Interactive clusters can be used by multiple users. High-concurrency clusters are specifically designed for multi-user workloads.

Mistake

Notebooks store data permanently in the cluster.

Correct

Notebooks store code and output. Data is stored in external storage (e.g., Azure Data Lake) or in cluster memory (ephemeral). When the cluster terminates, in-memory data is lost.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between a notebook and a cluster in Azure Databricks?

A notebook is a web-based interface to write and run code, visualize results, and add documentation. A cluster is a set of virtual machines that execute the code. Notebooks are attached to clusters to run code. Think of notebooks as recipes and clusters as the kitchen equipment.

Can I attach a notebook to a job cluster?

No. Job clusters are created automatically when a job runs and are not available for interactive notebook attachment. You can only attach notebooks to interactive (all-purpose) clusters.

What is the default auto-termination time for an interactive cluster?

The default is 120 minutes (2 hours). You can change this when creating the cluster or modify it later. Setting a shorter timeout saves costs.

How do I share a notebook with others in Databricks?

You can share notebooks by granting permissions (read, edit, run) to users or groups. Notebooks reside in the workspace; collaborators can open them if they have access. Real-time co-authoring is supported.

What is a high-concurrency cluster?

A high-concurrency cluster is a type of interactive cluster optimized for multiple users. It uses a shared driver and separate executors per user session to provide isolation. It is suitable for collaborative data science.

Can I run SQL in a Databricks notebook?

Yes. Use the %sql magic command at the beginning of a cell to switch to SQL. You can also use Python or Scala to run SQL via Spark SQL APIs.

What happens to my data when a cluster terminates?

Data stored in cluster memory (e.g., DataFrames, temp views) is lost. Data stored in external storage (e.g., Azure Data Lake, Delta tables) persists. Always save results to persistent storage.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Databricks Notebooks and Clusters — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?