DP-900Chapter 27 of 101Objective 3.1

Azure HDInsight

Azure HDInsight is a managed, full-spectrum, open-source analytics service for enterprises. This chapter covers what HDInsight is, its core components (Hadoop, Spark, Hive, HBase, Storm, Kafka, and ML Services), how it processes data at scale, and its integration with Azure storage and monitoring. On the DP-900 exam, HDInsight appears in roughly 5–10% of questions under objective 3.1 (Describe analytics workloads, including HDInsight, Azure Databricks, and data warehousing). You must understand the scenarios for each component and the difference between HDInsight and other analytics services.

25 min read
Intermediate
Updated May 31, 2026

HDInsight as a Prefab Data Factory

Imagine a company that needs to process raw materials (data) into finished products (insights). Instead of building a custom factory from scratch for each project, they order a prefab factory from a catalog. Each prefab factory comes with specialized machinery (Hadoop, Spark, Hive, HBase, etc.) pre-installed and pre-configured. The factory is assembled on a rented plot of land (Azure virtual network) with a specific number of workers (nodes) and a foreman (head node). The foreman assigns tasks to workers using a blueprint (script action) that can customize the machinery. The factory can be scaled up by adding more workers or scaled down during off-hours. When the project is done, the entire factory is dismantled (deleted) to stop paying rent. Crucially, the factory is not a single machine but a coordinated cluster of workers that communicate via a high-speed conveyor belt (Azure Storage or Data Lake Storage). The foreman keeps track of all jobs, and if a worker breaks down, the foreman reassigns its tasks to another worker. This prefab approach avoids the months of design and construction needed for a custom factory, allowing the company to start production in minutes.

How It Actually Works

What is Azure HDInsight?

Azure HDInsight is a cloud distribution of Apache Hadoop. It provides a managed cluster environment for running popular open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, HBase, and R Server. The service is designed to handle massive amounts of data (petabytes) by distributing processing across many nodes. HDInsight integrates with Azure Blob Storage and Azure Data Lake Storage Gen2, allowing you to decouple compute from storage. This means you can delete the cluster without losing data.

Why HDInsight Exists

On-premises Hadoop clusters are complex to set up, tune, and maintain. HDInsight abstracts away the operational overhead: provisioning, configuration, patching, monitoring, and scaling are handled by Azure. You pay only for the compute nodes (and associated storage). HDInsight supports multiple cluster types, each optimized for specific workloads. The DP-900 exam expects you to know which cluster type to use for batch processing, interactive querying, real-time streaming, NoSQL, and machine learning.

How HDInsight Works Internally

An HDInsight cluster consists of a set of virtual machines (nodes) deployed in an Azure Virtual Network. The nodes are grouped into roles: - Head nodes (2 for most cluster types): manage the cluster, run the NameNode (HDFS) or Spark master, and schedule jobs. - Worker nodes: perform the actual data processing. In Hadoop, they run DataNode and NodeManager. In Spark, they run executors. - Zookeeper nodes (3): provide coordination and leader election for distributed systems (e.g., HBase, Kafka, Storm). - Edge nodes (optional): a gateway node for client access, typically used for submitting jobs without exposing the head nodes.

When you submit a job (e.g., a Hive query or Spark application), the head node splits the work into tasks and assigns them to worker nodes. Each worker node processes a portion of the data stored in Azure Storage (Blob or ADLS Gen2). Intermediate results are shuffled across the network, and the final result is aggregated on the head node. HDInsight uses YARN (Yet Another Resource Negotiator) for resource management and job scheduling.

Key Components and Default Values

Cluster types: Hadoop, Spark, Hive (with LLAP), Kafka, Storm, HBase, and ML Services. Each has a specific set of pre-installed components.

Node sizes: Head nodes: D13 v2 (8 vCPU, 56 GB RAM) is default; worker nodes: D12 v2 (4 vCPU, 28 GB RAM) default. Can be changed during provisioning.

Scaling: Manual scaling only (no autoscaling for most types; some preview autoscale exists). Scale-out increases worker nodes; scale-in reduces them. A cluster can have 1–1000 worker nodes (default limit 50, can be increased via support request).

Storage: Default storage is Azure Blob Storage or ADLS Gen2. Must be configured at cluster creation. Additional linked storage can be added later.

Script actions: Custom scripts to install additional software (e.g., Anaconda, custom libraries) during or after cluster creation. They run on all nodes or specific node types.

Networking: Clusters are deployed into a virtual network. Public endpoint is enabled by default; can be disabled for private clusters.

Monitoring: Azure Monitor logs and Ambari web UI (accessible via public endpoint or VPN).

Configuration and Verification Commands

To create an HDInsight cluster using Azure CLI:

az hdinsight create \
    --resource-group myResourceGroup \
    --name myHDICluster \
    --location eastus \
    --cluster-type hadoop \
    --cluster-tier standard \
    --http-user admin \
    --http-password Password123! \
    --ssh-user sshuser \
    --ssh-password Password123! \
    --worker-node-count 3 \
    --storage-account mystorageaccount \
    --storage-container mycontainer \
    --storage-account-key <key>

To list clusters:

az hdinsight list --resource-group myResourceGroup

To scale a cluster:

az hdinsight resize --resource-group myResourceGroup --name myHDICluster --target-instance-count 5

To submit a Hive job via Ambari:

curl -k -u admin:Password123! -H "X-Requested-By: ambari" -X POST -d '{"Hive":"SELECT * FROM hivesampletable LIMIT 10"}' https://myHDICluster.azurehdinsight.net/api/v1/clusters/myHDICluster/requests

Interaction with Related Technologies

HDInsight integrates deeply with Azure ecosystem: - Azure Storage (Blob/ADLS Gen2): Primary storage. Compute is separate; you can delete the cluster and keep data. - Azure Data Factory: Orchestrates data movement into HDInsight for processing. - Azure Monitor: Collects cluster metrics and logs. - Azure Active Directory: Domain-joined clusters for Kerberos authentication. - Power BI: Can connect to Hive via ODBC for visualization. - Azure Event Hubs: Often used with Kafka or Storm for real-time data ingestion.

Cluster Types and Use Cases (Exam Focus)

Hadoop: Batch processing, ETL, data transformation. Uses MapReduce and YARN.

Spark: In-memory processing, faster than Hadoop for iterative algorithms. Supports Scala, Python, Java, R, SQL.

Hive (LLAP): Interactive SQL queries on large datasets. LLAP (Low Latency Analytical Processing) provides caching for faster queries.

Kafka: Real-time data streaming and publish-subscribe messaging.

Storm: Real-time stream processing (low latency, at-least-once semantics).

HBase: NoSQL database for random read/write access on large tables. Built on HDFS.

ML Services (R Server): Distributed machine learning using R or Python.

Performance Considerations

Data locality: Since storage is separate (Azure Storage), network bandwidth between compute and storage is critical. Use accelerated networking and large VM sizes.

Shuffle: Intermediate data transfer across nodes can be a bottleneck. Tune YARN memory settings.

Scaling: Scale-out improves throughput linearly for many workloads. Scale-in reduces cost but may cause data loss if not managed.

Security

Network isolation: Deploy in a VNet with NSGs.

Encryption: Azure Storage encryption at rest; TLS for data in transit.

Authentication: Local admin account or Azure AD DS for Kerberos.

Authorization: Apache Ranger for fine-grained access control (with ESP clusters).

Cost Management

You pay per node per hour (prorated). Head nodes and Zookeeper nodes are also billed.

To save costs, delete the cluster when not in use. Data persists in storage.

Use low-priority VMs for worker nodes (interruptible, up to 80% discount).

Common Exam Traps

1.

HDInsight vs Azure Databricks: HDInsight is for open-source Hadoop ecosystem; Databricks is a managed Spark platform with optimized performance. DP-900 expects you to choose HDInsight for multi-framework workloads (e.g., Hive + Spark + Kafka).

2.

Storage decoupling: Many candidates think HDInsight uses HDFS by default. Actually, it uses Azure Storage as the default filesystem (WASB or ABFS). HDFS is not used unless explicitly configured.

3.

Cluster types: The exam may present a scenario requiring real-time stream processing and ask which cluster type to use. The answer is Storm (for low latency) or Kafka (for high throughput buffering). Spark Streaming is also possible but has higher latency.

4.

Scaling: HDInsight supports manual scaling only (except preview autoscale). Candidates often assume autoscaling is available for all cluster types.

Step-by-Step: How a Hive Query Runs on HDInsight

1.

Query Submission: User submits a HiveQL query via Ambari, Beeline, or ODBC. The query is sent to the HiveServer2 on the head node.

2.

Compilation: HiveServer2 parses the query, generates an abstract syntax tree (AST), and then a logical plan. It then optimizes the plan (e.g., predicate pushdown) and converts it into a physical plan of MapReduce or Tez tasks.

3.

Execution Planning: The physical plan is submitted to YARN (or Tez) as a DAG (Directed Acyclic Graph) of stages. YARN allocates containers on worker nodes.

4.

Data Access: Each task reads data from Azure Storage using the WASB or ABFS connector. Data is not stored locally on nodes unless cached (LLAP).

5.

Task Execution: Map tasks process data in parallel. Intermediate results are shuffled (sorted and partitioned) across the network to reduce tasks.

6.

Aggregation: Reduce tasks aggregate results and write final output back to Azure Storage or Hive warehouse directory.

7.

Result Retrieval: The final result is fetched by HiveServer2 and returned to the client.

Real-World Section

Scenario 1: Large-Scale ETL for a Retail Company A global retailer ingests terabytes of clickstream data daily into Azure Blob Storage. They use HDInsight Hadoop clusters to run nightly ETL jobs that clean, transform, and aggregate the data into Parquet files. The cluster is scaled to 50 worker nodes during the night and deleted during the day. The data is then loaded into Azure Synapse Analytics for reporting. Misconfiguration: If the storage account is in a different region, latency becomes unacceptable. They solved it by colocating the cluster and storage in the same region.

Scenario 2: Real-Time Fraud Detection A financial services company uses HDInsight Kafka to ingest millions of transactions per second. Storm clusters consume from Kafka topics and apply fraud detection models (trained using ML Services). The output is written to HBase for low-latency lookups. They use script actions to install custom Python libraries. Performance: They needed to tune Kafka producer batch sizes and Storm parallelism to avoid backpressure. Common failure: If the Storm topology crashes, messages may be lost (at-least-once semantics require idempotent consumers).

Scenario 3: Interactive Analytics with Hive LLAP A healthcare organization runs ad-hoc SQL queries on 10-year patient records stored in ADLS Gen2. They deployed a Hive LLAP cluster with 20 worker nodes and LLAP caching enabled. Queries that previously took minutes now return in seconds. They use Azure Active Directory Domain Services for Kerberos authentication. Cost: They keep the cluster running 24/7 because LLAP caches data in memory; stopping the cluster would invalidate the cache. They optimized by using reserved instances for a 3-year term.

What Goes Wrong When Misconfigured - Wrong cluster type: Using Hadoop for interactive queries (too slow). - Insufficient head node size: Head node becomes bottleneck for job scheduling. - Storage account key rotation: If the storage key changes and HDInsight doesn't have the new key, the cluster fails. Must update via Ambari. - Network restrictions: If the VNet blocks outbound internet, cluster provisioning fails because Azure needs to download component packages.

Exam Focus Section

Objective 3.1: Describe analytics workloads – HDInsight is specifically called out. The exam tests your ability to choose the appropriate cluster type for a given workload.

Common Wrong Answers: 1. Azure Databricks for multi-framework workloads – Databricks is optimized for Spark; if the scenario requires Hive, HBase, or Storm, HDInsight is correct. 2. HDFS as default storage – HDInsight uses Azure Storage as default; HDFS is an option but not default. 3. Autoscaling is available for all cluster types – Only manual scaling is generally available; autoscaling is in preview for limited types. 4. HDInsight is serverless – It is not; you provision VMs and pay per node.

Numbers and Terms That Appear on the Exam: - Default head node: D13 v2 (8 vCPU, 56 GB RAM) - Default worker node: D12 v2 (4 vCPU, 28 GB RAM) - Maximum worker nodes: 1000 (default limit 50) - Cluster types: Hadoop, Spark, Hive LLAP, Kafka, Storm, HBase, ML Services - Storage: Azure Blob Storage (WASB) or Azure Data Lake Storage Gen2 (ABFS) - Script actions: Customize cluster - Ambari: Monitoring and management UI

Edge Cases: - Cluster in a VNet without public IP: Must access via VPN or ExpressRoute. Ambari is not publicly accessible. - Low-priority VMs: Can be evicted; suitable for batch jobs that can tolerate interruptions. - Kafka with HDInsight: Not for stream processing; use Storm or Spark Streaming for that. Kafka is for ingestion.

How to Eliminate Wrong Answers: - If the scenario mentions multiple frameworks (e.g., Hive + Spark + Kafka), HDInsight is the only Azure service that supports all. - If the scenario emphasizes open-source and customization, HDInsight wins over Databricks. - If the scenario requires low-latency NoSQL (random read/write), choose HBase. - If the scenario is about real-time stream processing with millisecond latency, choose Storm (not Kafka).

Misconceptions

1.

Myth: HDInsight stores data in HDFS by default. Reality: Default storage is Azure Blob Storage or Azure Data Lake Storage Gen2. HDFS is an option but not default. The cluster can use HDFS for temporary data, but persistent storage is in Azure Storage.

2.

Myth: HDInsight clusters support autoscaling out of the box. Reality: Only manual scaling is generally available. Autoscaling is in preview for selected cluster types (e.g., Spark) and has limitations.

3.

Myth: HDInsight is serverless. Reality: HDInsight provisions VMs; you pay per node per hour. It is not serverless like Azure Synapse serverless SQL.

4.

Myth: All cluster types can be created in any region. Reality: Some cluster types (e.g., ML Services) are not available in all regions. Check regional availability.

5.

Myth: You cannot use custom libraries with HDInsight. Reality: You can use script actions to install custom software (e.g., Python packages, R libraries) on all nodes.

Comparisons

HDInsight vs Azure Databricks - HDInsight: Supports multiple open-source frameworks (Hadoop, Spark, Hive, HBase, Kafka, Storm). More customizable. Requires manual scaling. Cost: Pay per node. - Azure Databricks: Optimized for Apache Spark. Managed workspace with collaborative notebooks. Autoscaling and serverless options. Cost: Pay per DBU (Databricks Unit).

HDInsight vs Azure Synapse Analytics - HDInsight: For big data processing using open-source tools. Supports batch, interactive, streaming, and NoSQL. Storage decoupled. - Azure Synapse: Unified analytics platform combining data warehousing (dedicated SQL pool) and big data analytics (Spark). Supports T-SQL queries on data lake. Serverless option available.

HDInsight HBase vs Azure Cosmos DB - HDInsight HBase: NoSQL database on Hadoop. Supports large tables, random read/write. Integration with Hive and Spark. Manual scaling. - Azure Cosmos DB: Globally distributed, multi-model NoSQL database. Automatic scaling, guaranteed SLAs for latency and throughput. More expensive.

Key Takeaways

HDInsight is a managed open-source analytics service supporting Hadoop, Spark, Hive, Kafka, Storm, HBase, and ML Services.

Default storage is Azure Blob Storage or ADLS Gen2; compute and storage are decoupled.

Cluster types are optimized for specific workloads: batch (Hadoop), interactive SQL (Hive LLAP), streaming (Storm/Kafka), NoSQL (HBase), ML (ML Services).

Scaling is manual (except preview autoscale). You can scale up to 1000 worker nodes.

Script actions allow customizations during or after cluster creation.

Monitoring is done via Ambari web UI and Azure Monitor.

HDInsight is not serverless; you pay per node per hour.

Exam tip: For multi-framework scenarios, choose HDInsight over Databricks.

FAQ

1.

Q: Can I use HDInsight without a VNet? A: Yes, by default HDInsight creates a public endpoint and does not require a VNet. However, for production, deploying in a VNet is recommended for network isolation.

2.

Q: How do I access the Ambari UI? A: Ambari is accessible at https://<clustername>.azurehdinsight.net. You need the HTTP username and password set during cluster creation.

3.

Q: Can I change the storage account after cluster creation? A: You cannot change the primary storage account. You can add additional linked storage accounts.

4.

Q: Is HDInsight HIPAA compliant? A: Yes, HDInsight is HIPAA eligible when deployed with appropriate configurations (e.g., encryption, VNet).

5.

Q: What is the difference between Hive and Hive LLAP? A: Hive LLAP (Low Latency Analytical Processing) caches data in memory for faster interactive queries. Standard Hive uses MapReduce and has higher latency.

6.

Q: Can I run Python scripts on HDInsight? A: Yes, you can submit Python scripts to Spark clusters using spark-submit. You can also install custom Python packages via script actions.

7.

Q: How do I delete an HDInsight cluster? A: Use the Azure portal, PowerShell, or CLI. Example: az hdinsight delete --resource-group myRG --name myCluster. Deleting the cluster does not delete the storage account or data.

Quiz

1. Question: Which HDInsight cluster type should you use for interactive SQL queries on large datasets with low latency? A) Hadoop B) Hive LLAP C) Spark D) HBase Answer: B) Hive LLAP. Hive LLAP uses caching to provide low-latency interactive queries. Hadoop uses MapReduce which is slower. Spark is for in-memory processing, not specifically optimized for SQL. HBase is a NoSQL database. 2. Question: You need to process real-time streaming data with millisecond latency. Which HDInsight cluster type is most appropriate? A) Kafka B) Storm C) Spark Streaming D) HBase Answer: B) Storm. Storm provides low-latency stream processing with at-least-once semantics. Kafka is for message ingestion, not processing. Spark Streaming has higher latency (seconds). HBase is a database. 3. Question: True or False: HDInsight stores data in HDFS by default. A) True B) False Answer: B) False. HDInsight uses Azure Blob Storage or Azure Data Lake Storage Gen2 as default storage. HDFS is an optional configuration. 4. Question: You need to run a batch ETL job that can tolerate interruptions and you want to minimize cost. Which VM option should you use for worker nodes? A) Standard VMs B) Low-priority VMs C) Reserved instances D) Spot VMs (Azure equivalent) Answer: B) Low-priority VMs. Low-priority VMs are cheaper but can be evicted. They are suitable for batch jobs that can be restarted. Reserved instances are for long-term commitments. 5. Question: Which tool is used to monitor HDInsight cluster health and jobs? A) Azure Monitor B) Ambari C) Both A and B D) Grafana Answer: C) Both A and B. Azure Monitor collects metrics and logs; Ambari provides a web UI for cluster management and job monitoring.

Walk-Through

1

Provision the HDInsight Cluster

In the Azure portal, select 'Create a resource' and search for 'HDInsight'. Choose the cluster type (e.g., Hadoop, Spark, Hive), specify the cluster name, resource group, location, and cluster size (number of worker nodes). Configure the admin credentials for HTTP and SSH access. Under 'Storage', select an existing Azure Storage account or create a new one, and specify a container. Optionally, configure a virtual network, script actions, and advanced settings like metastore. Review and create. Provisioning takes 10–20 minutes.

2

Configure Storage and Networking

HDInsight uses Azure Blob Storage or ADLS Gen2 as the default filesystem. During provisioning, you specify the storage account and a container. The cluster uses the WASB (Windows Azure Storage Blob) driver for Blob Storage or ABFS (Azure Blob File System) for ADLS Gen2. For networking, you can deploy the cluster into an existing VNet to control inbound/outbound traffic. If you choose a VNet, ensure that the VNet has outbound internet access for Azure to download components.

3

Submit a Job to the Cluster

After the cluster is running, you can submit jobs via Ambari, SSH, or Azure tools. For example, to run a Hive query, connect to the cluster's Ambari UI at https://<clustername>.azurehdinsight.net, log in with the HTTP admin credentials, and use the Hive view to write and execute queries. Alternatively, SSH into the head node and use the 'beeline' command-line tool to submit Hive queries. For Spark, use 'spark-submit' with a Python or Scala script.

4

Monitor Job Progress in Ambari

Ambari provides a dashboard showing cluster health, resource utilization, and job status. For YARN jobs, you can see the list of applications, their progress, and logs. For Spark, the Spark History Server shows completed and running applications. Ambari also allows you to view alerts and configuration changes. You can access Ambari via the public endpoint or through a VPN if the cluster is in a VNet without public IP.

5

Scale the Cluster Up or Down

To scale the cluster, use the Azure portal: select the cluster, then under 'Cluster size', change the number of worker nodes. Scaling can also be done via Azure CLI: `az hdinsight resize --resource-group myRG --name myCluster --target-instance-count 10`. Scaling takes 5–15 minutes. During scale-in, decommissioned nodes are removed gracefully. Note that scaling does not affect data in storage.

6

Delete the Cluster to Stop Incurring Costs

When the cluster is no longer needed, delete it to stop compute charges. Data in Azure Storage persists. To delete, use the Azure portal (select cluster and click 'Delete'), or CLI: `az hdinsight delete --resource-group myRG --name myCluster`. You can also use Azure PowerShell. Deleting the cluster does not delete the associated storage account or VNet.

What This Looks Like on the Job

Scenario 1: Large-Scale ETL for a Retail Company A global retailer ingests terabytes of clickstream data daily into Azure Blob Storage. They use HDInsight Hadoop clusters to run nightly ETL jobs that clean, transform, and aggregate the data into Parquet files. The cluster is scaled to 50 worker nodes during the night and deleted during the day. The data is then loaded into Azure Synapse Analytics for reporting. Misconfiguration: If the storage account is in a different region, latency becomes unacceptable. They solved it by colocating the cluster and storage in the same region.

Scenario 2: Real-Time Fraud Detection A financial services company uses HDInsight Kafka to ingest millions of transactions per second. Storm clusters consume from Kafka topics and apply fraud detection models (trained using ML Services). The output is written to HBase for low-latency lookups. They use script actions to install custom Python libraries. Performance: They needed to tune Kafka producer batch sizes and Storm parallelism to avoid backpressure. Common failure: If the Storm topology crashes, messages may be lost (at-least-once semantics require idempotent consumers).

Scenario 3: Interactive Analytics with Hive LLAP A healthcare organization runs ad-hoc SQL queries on 10-year patient records stored in ADLS Gen2. They deployed a Hive LLAP cluster with 20 worker nodes and LLAP caching enabled. Queries that previously took minutes now return in seconds. They use Azure Active Directory Domain Services for Kerberos authentication. Cost: They keep the cluster running 24/7 because LLAP caches data in memory; stopping the cluster would invalidate the cache. They optimized by using reserved instances for a 3-year term.

What Goes Wrong When Misconfigured - Wrong cluster type: Using Hadoop for interactive queries (too slow). - Insufficient head node size: Head node becomes bottleneck for job scheduling. - Storage account key rotation: If the storage key changes and HDInsight doesn't have the new key, the cluster fails. Must update via Ambari. - Network restrictions: If the VNet blocks outbound internet, cluster provisioning fails because Azure needs to download component packages.

How DP-900 Actually Tests This

Objective 3.1: Describe analytics workloads – HDInsight is specifically called out. The exam tests your ability to choose the appropriate cluster type for a given workload.

Common Wrong Answers: 1. Azure Databricks for multi-framework workloads – Databricks is optimized for Spark; if the scenario requires Hive, HBase, or Storm, HDInsight is correct. 2. HDFS as default storage – HDInsight uses Azure Storage as default; HDFS is an option but not default. 3. Autoscaling is available for all cluster types – Only manual scaling is generally available; autoscaling is in preview for limited types. 4. HDInsight is serverless – It is not; you provision VMs and pay per node.

Numbers and Terms That Appear on the Exam: - Default head node: D13 v2 (8 vCPU, 56 GB RAM) - Default worker node: D12 v2 (4 vCPU, 28 GB RAM) - Maximum worker nodes: 1000 (default limit 50) - Cluster types: Hadoop, Spark, Hive LLAP, Kafka, Storm, HBase, ML Services - Storage: Azure Blob Storage (WASB) or Azure Data Lake Storage Gen2 (ABFS) - Script actions: Customize cluster - Ambari: Monitoring and management UI

Edge Cases: - Cluster in a VNet without public IP: Must access via VPN or ExpressRoute. Ambari is not publicly accessible. - Low-priority VMs: Can be evicted; suitable for batch jobs that can tolerate interruptions. - Kafka with HDInsight: Not for stream processing; use Storm or Spark Streaming for that. Kafka is for ingestion.

How to Eliminate Wrong Answers: - If the scenario mentions multiple frameworks (e.g., Hive + Spark + Kafka), HDInsight is the only Azure service that supports all. - If the scenario emphasizes open-source and customization, HDInsight wins over Databricks. - If the scenario requires low-latency NoSQL (random read/write), choose HBase. - If the scenario is about real-time stream processing with millisecond latency, choose Storm (not Kafka).

Key Takeaways

HDInsight is a managed open-source analytics service supporting Hadoop, Spark, Hive, Kafka, Storm, HBase, and ML Services.

Default storage is Azure Blob Storage or ADLS Gen2; compute and storage are decoupled.

Cluster types are optimized for specific workloads: batch (Hadoop), interactive SQL (Hive LLAP), streaming (Storm/Kafka), NoSQL (HBase), ML (ML Services).

Scaling is manual (except preview autoscale). You can scale up to 1000 worker nodes.

Script actions allow customizations during or after cluster creation.

Monitoring is done via Ambari web UI and Azure Monitor.

HDInsight is not serverless; you pay per node per hour.

Exam tip: For multi-framework scenarios, choose HDInsight over Databricks.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

HDInsight

Supports multiple open-source frameworks (Hadoop, Spark, Hive, HBase, Kafka, Storm)

More customizable with script actions

Requires manual scaling (autoscale preview)

Cost: Pay per node per hour

Storage decoupled from compute

Azure Databricks

Optimized for Apache Spark only

Managed workspace with collaborative notebooks

Autoscaling and serverless options available

Cost: Pay per DBU (Databricks Unit)

Compute and storage tightly integrated

Watch Out for These

Mistake

HDInsight stores data in HDFS by default.

Correct

Default storage is Azure Blob Storage or Azure Data Lake Storage Gen2. HDFS is an option but not default. The cluster can use HDFS for temporary data, but persistent storage is in Azure Storage.

Mistake

HDInsight clusters support autoscaling out of the box.

Correct

Only manual scaling is generally available. Autoscaling is in preview for selected cluster types (e.g., Spark) and has limitations.

Mistake

HDInsight is serverless.

Correct

HDInsight provisions VMs; you pay per node per hour. It is not serverless like Azure Synapse serverless SQL.

Mistake

All cluster types can be created in any region.

Correct

Some cluster types (e.g., ML Services) are not available in all regions. Check regional availability.

Mistake

You cannot use custom libraries with HDInsight.

Correct

You can use script actions to install custom software (e.g., Python packages, R libraries) on all nodes.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

Can I use HDInsight without a VNet?

Yes, by default HDInsight creates a public endpoint and does not require a VNet. However, for production, deploying in a VNet is recommended for network isolation.

How do I access the Ambari UI?

Ambari is accessible at https://<clustername>.azurehdinsight.net. You need the HTTP username and password set during cluster creation.

Can I change the storage account after cluster creation?

You cannot change the primary storage account. You can add additional linked storage accounts.

Is HDInsight HIPAA compliant?

Yes, HDInsight is HIPAA eligible when deployed with appropriate configurations (e.g., encryption, VNet).

What is the difference between Hive and Hive LLAP?

Hive LLAP (Low Latency Analytical Processing) caches data in memory for faster interactive queries. Standard Hive uses MapReduce and has higher latency.

Can I run Python scripts on HDInsight?

Yes, you can submit Python scripts to Spark clusters using spark-submit. You can also install custom Python packages via script actions.

How do I delete an HDInsight cluster?

Use the Azure portal, PowerShell, or CLI. Example: `az hdinsight delete --resource-group myRG --name myCluster`. Deleting the cluster does not delete the storage account or data.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Azure HDInsight — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?