DP-900Chapter 28 of 101Objective 3.1

Azure Data Lake Storage Gen2

This chapter covers Azure Data Lake Storage Gen2 (ADLS Gen2), a cloud storage solution that combines the scalability of Azure Blob Storage with the hierarchical namespace of a file system. For the DP-900 exam, this topic appears in approximately 10-15% of questions under Objective 3.1 (Analytics). Understanding ADLS Gen2 is critical because it serves as the foundation for many Azure analytics services like Azure Synapse Analytics, Azure Databricks, and HDInsight. The exam expects you to know its key features, how it differs from Blob Storage, its security model, and its integration with other services.

25 min read
Intermediate
Updated May 31, 2026

Warehouse with a Smart Elevator System

Imagine a massive warehouse with two floors: the ground floor is for data ingestion (hot storage) and the upper floor is for long-term storage (cold archive). The warehouse has a traditional freight elevator that moves pallets of boxes between floors, but it's slow and can only carry one pallet at a time. Now, the warehouse installs a smart elevator system. This system has a high-speed elevator that can carry multiple pallets simultaneously, a barcode scanner that reads every box's label, and a digital inventory database that tracks the exact location of every box on both floors. The smart elevator can move a pallet from the ground floor to the upper floor in seconds, and the barcode scanner updates the inventory database instantly. Additionally, the system can retrieve any box from either floor directly without needing to bring the entire pallet down. This smart elevator system is like Azure Data Lake Storage Gen2: it combines the high-speed, multi-modal access of Blob Storage (the smart elevator) with the hierarchical namespace of a file system (the inventory database). The ground floor is the hot tier, the upper floor is the cool or archive tier, and the smart elevator represents the ability to move data between tiers automatically or on demand. The barcode scanner and inventory database ensure that data is always accessible via its full path (like a file system) even though it's stored in a flat blob structure underneath.

How It Actually Works

What is Azure Data Lake Storage Gen2?

Azure Data Lake Storage Gen2 (ADLS Gen2) is a cloud storage service built on top of Azure Blob Storage. It provides a hierarchical namespace that allows you to organize data into directories and subdirectories, much like a traditional file system. This enables you to use file system semantics (e.g., rename, move, delete directories) while leveraging the massive scalability and cost-effectiveness of Blob Storage. ADLS Gen2 is designed for big data analytics workloads, where data is often organized in a folder hierarchy and accessed by multiple compute engines simultaneously.

Why Does It Exist?

Before ADLS Gen2, Azure had two separate storage offerings for analytics: Azure Data Lake Storage Gen1 (a dedicated storage service) and Azure Blob Storage (a general-purpose object store). Gen1 offered a hierarchical namespace and high performance for analytics but was limited in scalability and integration with other Azure services. Blob Storage was highly scalable and cost-effective but lacked a hierarchical namespace, making it difficult to organize data for analytics. ADLS Gen2 was created to merge the best of both: the hierarchical namespace and performance optimizations from Gen1 with the scalability, durability, and cost tiers of Blob Storage.

How It Works Internally

ADLS Gen2 is essentially Blob Storage with a hierarchical namespace enabled. The hierarchical namespace is implemented as a metadata layer that maps directory paths to blob prefixes. When you create a directory, ADLS Gen2 creates a zero-length blob (a "directory marker") that represents that directory. However, the actual data is stored as blobs (objects) in a flat namespace underneath. The hierarchical namespace provides a logical view of directories and subdirectories, but the physical storage remains flat. This allows ADLS Gen2 to support both object storage semantics (via Blob APIs) and file system semantics (via the ABFS driver).

#### ABFS Driver

The Azure Blob File System (ABFS) driver is a Hadoop-compatible file system driver that enables compute engines like Azure Databricks and HDInsight to access ADLS Gen2 using the abfs:// protocol. The driver translates file system operations (e.g., ls, mkdir, rename) into Blob Storage REST API calls. For example, a ls on a directory triggers a list blobs operation with a prefix filter. The driver also optimizes performance by using multiple concurrent connections and caching.

Key Components, Values, and Defaults

Hierarchical Namespace: A logical container that organizes blobs into directories. It must be enabled when creating a storage account; it cannot be enabled later.

Storage Account: The top-level container. Must be a StorageV2 (general-purpose v2) account. The account type determines the performance tier (Standard or Premium) and replication options.

Containers: Similar to directories in a file system, but they are the top-level namespace within a storage account. Each container can have its own access policies.

Directories and Files: Directories are logical groupings; files are the actual blobs. ADLS Gen2 supports up to 2^63 bytes per file (approximately 9 EB).

Access Control: ADLS Gen2 supports both POSIX-like ACLs (Access Control Lists) and Azure RBAC (Role-Based Access Control). POSIX ACLs can be set at the directory or file level, with read (r), write (w), and execute (x) permissions for owner, owning group, and others. The default umask is 027 (owner: rwx, group: rx, others: r).

Data Tiers: Hot, Cool, Cold, and Archive tiers. Data can be moved between tiers automatically using lifecycle management policies or manually.

Encryption: All data is encrypted at rest using Azure Storage Service Encryption (SSE). Customer-managed keys are supported.

Soft Delete: Enabled by default for containers (retention period: 7 days). Can be configured for blobs.

Configuration and Verification Commands

To create an ADLS Gen2 storage account using Azure CLI:

az storage account create \
    --name mystorageaccount \
    --resource-group myResourceGroup \
    --location eastus \
    --sku Standard_LRS \
    --kind StorageV2 \
    --enable-hierarchical-namespace true

To verify hierarchical namespace is enabled:

az storage account show \
    --name mystorageaccount \
    --resource-group myResourceGroup \
    --query isHnsEnabled

To create a directory:

az storage fs directory create \
    --name mydir \
    --file-system mycontainer \
    --account-name mystorageaccount

To list files and directories:

az storage fs file list \
    --file-system mycontainer \
    --path mydir \
    --account-name mystorageaccount

Integration with Related Technologies

ADLS Gen2 integrates seamlessly with: - Azure Synapse Analytics: Can query data directly in ADLS Gen2 using serverless SQL pools or dedicated SQL pools. - Azure Databricks: Uses the ABFS driver to read/write data with Spark. - Azure HDInsight: Hadoop clusters can access ADLS Gen2 as the default file system. - Azure Data Factory: Can copy data to/from ADLS Gen2. - Power BI: Can connect to data stored in ADLS Gen2 using DirectQuery or import.

Performance Considerations

ADLS Gen2 provides high throughput for analytics workloads. It supports up to 20,000 IOPS per partition (blob) for standard accounts and 100,000 IOPS for premium accounts. The hierarchical namespace reduces the number of API calls needed for directory operations (e.g., listing a directory with many subdirectories is faster than listing all blobs with a prefix). However, enabling the hierarchical namespace can increase the cost of certain operations (e.g., renaming a directory requires updating metadata for all blobs in that directory).

Security Model

ADLS Gen2 supports two layers of security: 1. Azure RBAC: Assign roles like Storage Blob Data Contributor, Storage Blob Data Reader, or Storage Blob Data Owner at the storage account, container, or directory level. 2. POSIX ACLs: Fine-grained permissions on directories and files. ACLs can be default (inherited by new children) or access (applied to the current object).

Both must be evaluated: RBAC determines whether a user can access the storage account at all, while ACLs determine what they can do within the hierarchy. The effective permissions are the intersection of RBAC and ACLs.

Limitations

Cannot disable hierarchical namespace after creation.

Maximum number of files/directories: unlimited, but performance may degrade if the number of blobs in a single container exceeds millions.

Renaming a directory is an atomic operation only if the directory is empty; otherwise, it is a rename of each blob individually.

No support for symbolic links or hard links.

Must use StorageV2 account; cannot use BlobStorage or General-purpose v1.

Walk-Through

1

Enable Hierarchical Namespace

When creating a StorageV2 account, you must set the `--enable-hierarchical-namespace` flag to `true`. This cannot be changed later. The hierarchical namespace is implemented as a metadata layer that maps directory paths to blob prefixes. Enabling it changes the storage account's behavior: it now supports file system semantics like rename, move, and delete directories. Without this flag, the account behaves like a standard Blob Storage account with a flat namespace.

2

Create a Container (File System)

A container in ADLS Gen2 is equivalent to a top-level directory. You create it using Azure CLI or portal. The container name must be unique within the storage account. Containers can have access policies (private, blob, or container). In ADLS Gen2, containers are often called 'file systems' in the context of the ABFS driver.

3

Create Directories and Upload Files

Directories are created using the `az storage fs directory create` command. Files are uploaded using `az storage fs file upload`. The ABFS driver translates these operations into Blob Storage REST API calls. For example, creating a directory creates a zero-length blob with metadata indicating it is a directory. Uploading a file creates a blob with the path as the blob name.

4

Set Access Permissions

Permissions can be set using RBAC roles or POSIX ACLs. For POSIX ACLs, use `az storage fs access set` command. The ACLs are stored as metadata on the blob. Default ACLs are inherited by new children. When a user accesses a file, the system evaluates both RBAC and ACLs. The ABFS driver caches ACLs for performance.

5

Access Data via ABFS Driver

Applications use the `abfs://` URI scheme to access data. For example, `abfs://mycontainer@mystorageaccount.dfs.core.windows.net/mydir/myfile.csv`. The ABFS driver resolves the path by making REST calls to the Azure Data Lake Storage endpoint (`.dfs.core.windows.net`). The driver handles authentication via OAuth or shared key.

What This Looks Like on the Job

Enterprise Scenario 1: Log Analytics Pipeline

A large e-commerce company ingests terabytes of server logs daily. They use Azure Data Lake Storage Gen2 as the landing zone for raw logs. The logs are organized by date and hour in a directory structure like /logs/raw/2023/10/15/14/. A Databricks job reads the logs from the last hour, processes them, and writes the aggregated results to a different directory /logs/aggregated/. The hierarchical namespace makes it easy to manage the directory structure and apply lifecycle policies to move raw logs older than 30 days to the Cool tier and older than 90 days to the Archive tier. In production, they found that enabling hierarchical namespace reduced the time to list files for a given hour by 50% compared to a flat namespace with prefix listing. However, they encountered a performance issue when renaming a directory containing millions of small log files: the rename operation took several minutes because each file's metadata had to be updated. They mitigated this by using a separate container for each day and renaming containers instead of directories.

Enterprise Scenario 2: Data Lake for a Healthcare Provider

A healthcare provider uses ADLS Gen2 to store patient data, including medical images (DICOM files) and clinical notes. They need fine-grained access control to comply with HIPAA. They use RBAC to grant broad access to data scientists and ACLs to restrict access to specific patients' directories. For example, a data scientist in the cardiology department has read access to the /patients/cardiology/ directory but not to /patients/neurology/. They also use Azure Private Endpoints to ensure data never traverses the public internet. A common misconfiguration they discovered: when setting ACLs, they forgot to set the execute permission on parent directories, causing access denied errors for users trying to list files. The fix was to ensure all parent directories had at least execute permission for the group.

Scenario 3: Real-Time IoT Data Ingestion

A manufacturing company uses Azure IoT Hub to stream sensor data from factory equipment. The data is written directly to ADLS Gen2 using the Event Hubs Capture feature, which writes Avro files to a container. The data is organized by device ID and timestamp. The hierarchical namespace allows them to query data for a specific device and time range efficiently using Azure Synapse serverless SQL. They configured a lifecycle management policy to move data older than 7 days to the Cool tier and delete data older than 1 year. One issue they faced: the archive tier has a 15-minute retrieval latency, and they accidentally archived data that was still being queried by a real-time dashboard. They resolved this by setting a rule to not archive data that was accessed in the last 30 days.

How DP-900 Actually Tests This

The DP-900 exam tests your understanding of ADLS Gen2 primarily under Objective 3.1: Describe analytics workloads and the data platform for analytics. Specific sub-objectives include: describing the capabilities of Azure Data Lake Storage Gen2, comparing it to Azure Blob Storage, and identifying when to use it. Expect 2-4 questions on ADLS Gen2.

Common Wrong Answers

1.

"ADLS Gen2 is the same as Azure Data Lake Storage Gen1." This is false. Gen1 is a separate service with a different architecture (based on WebHDFS) and is being phased out. Gen2 is built on Blob Storage and offers better integration and scalability. The exam may present a scenario where you must choose between Gen1 and Gen2; always choose Gen2 for new projects.

2.

"ADLS Gen2 requires a premium storage account." False. ADLS Gen2 can be enabled on both Standard and Premium StorageV2 accounts. Premium accounts provide lower latency and higher IOPS but are more expensive. The exam might ask about performance requirements; remember that Standard accounts suffice for most analytics workloads.

3.

"You can enable hierarchical namespace on an existing Blob Storage account." False. Hierarchical namespace must be enabled at account creation time. If you have an existing Blob Storage account, you must migrate data to a new ADLS Gen2 account. The exam may test this with a question about migrating from Blob Storage to ADLS Gen2.

4.

"ADLS Gen2 supports both POSIX ACLs and RBAC, but RBAC takes precedence." Actually, both are evaluated, and the effective permissions are the intersection. RBAC controls access to the storage account, while ACLs control access within the hierarchy. The exam might ask: "A user has Storage Blob Data Reader role but no ACL permissions on a directory. Can they read files?" The answer is no, because ACLs deny access.

Specific Numbers and Terms

ABFS driver: The Hadoop-compatible file system driver.

DFS endpoint: The endpoint for ADLS Gen2 operations (.dfs.core.windows.net).

Hierarchical namespace: The feature that enables directory-like organization.

POSIX ACLs: rwx permissions for owner, group, others.

Default ACLs: Inherited by new child objects.

StorageV2: The required account kind.

Lifecycle management: Policies to move data between tiers.

Edge Cases

Empty directory rename: Atomic operation. Non-empty directory rename: non-atomic, may take time.

Soft delete: When hierarchical namespace is enabled, soft delete for containers is enabled by default with a 7-day retention period. This can cause confusion when a container appears to be deleted but still exists.

Access control evaluation order: RBAC is evaluated first; if RBAC denies access, ACLs are not checked. If RBAC allows, then ACLs are checked.

How to Eliminate Wrong Answers

If a question mentions "hierarchical namespace" or "directory structure," the answer is likely ADLS Gen2.

If a question mentions "cost-effective" and "analytics," consider ADLS Gen2 with appropriate tiering.

If a question mentions "existing Blob Storage account" and "need to enable hierarchy," the answer must involve creating a new account.

If a question mentions "POSIX permissions" or "fine-grained access," ADLS Gen2 is the correct choice.

Key Takeaways

ADLS Gen2 is built on Azure Blob Storage with a hierarchical namespace enabled at account creation.

The ABFS driver (abfs://) is used to access ADLS Gen2 from Hadoop-compatible compute engines.

Both RBAC and POSIX ACLs are used for access control; effective permissions are the intersection.

Lifecycle management policies can automatically tier data between Hot, Cool, Cold, and Archive tiers.

ADLS Gen2 supports up to 20,000 IOPS per partition on Standard accounts and 100,000 IOPS on Premium.

Cannot enable hierarchical namespace on an existing Blob Storage account; must create a new one.

Soft delete for containers is enabled by default with a 7-day retention period when hierarchical namespace is enabled.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Azure Blob Storage (flat namespace)

Flat namespace; no directories, only prefixes.

No file system semantics (rename, move, delete directories).

Access control via RBAC only; no POSIX ACLs.

List operations use prefix matching; slower for large directories.

Suitable for general-purpose object storage, backups, and media files.

Azure Data Lake Storage Gen2 (hierarchical namespace)

Hierarchical namespace with directories and subdirectories.

Supports file system semantics like rename, move, and delete directories.

Access control via RBAC and POSIX ACLs.

List operations are optimized with directory metadata; faster for deep hierarchies.

Designed for big data analytics workloads like data lakes and ETL pipelines.

Watch Out for These

Mistake

ADLS Gen2 is a completely new storage service separate from Blob Storage.

Correct

ADLS Gen2 is built on top of Azure Blob Storage. It is essentially a Blob Storage account with the hierarchical namespace feature enabled. The underlying storage is the same; only the metadata layer differs.

Mistake

You can convert an existing Blob Storage account to ADLS Gen2 by enabling hierarchical namespace.

Correct

Hierarchical namespace can only be enabled when creating a new storage account. There is no way to enable it on an existing account. You must migrate data to a new ADLS Gen2 account.

Mistake

ADLS Gen2 only supports POSIX ACLs, not Azure RBAC.

Correct

ADLS Gen2 supports both RBAC and POSIX ACLs. RBAC controls access at the storage account or container level, while ACLs provide fine-grained permissions within the hierarchy.

Mistake

The ABFS driver is only used by Azure Databricks.

Correct

The ABFS driver is a Hadoop-compatible file system driver that can be used by any Hadoop-compatible compute engine, including Azure HDInsight, Azure Synapse Analytics, and custom Hadoop clusters.

Mistake

ADLS Gen2 does not support lifecycle management policies.

Correct

ADLS Gen2 fully supports lifecycle management policies, including tiering data to Cool, Cold, or Archive tiers and deleting data after a specified period.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Azure Data Lake Storage Gen1 and Gen2?

Azure Data Lake Storage Gen1 is a dedicated storage service with its own API and is being phased out. Gen2 is built on top of Azure Blob Storage, providing better integration with other Azure services, lower cost, and more scalability. Gen2 also supports multiple data tiers and lifecycle management, whereas Gen1 does not. For all new projects, you should use ADLS Gen2.

Can I enable hierarchical namespace on an existing Blob Storage account?

No. Hierarchical namespace must be enabled when creating a new storage account. If you have an existing Blob Storage account, you need to create a new ADLS Gen2 account and migrate your data using Azure Data Factory, AzCopy, or other tools.

What is the ABFS driver and how does it work?

The ABFS (Azure Blob File System) driver is a Hadoop-compatible file system driver that allows compute engines like Azure Databricks and HDInsight to access ADLS Gen2 using file system semantics. It translates operations like read, write, and list into Blob Storage REST API calls. The driver uses the `abfs://` URI scheme and communicates with the DFS endpoint (`.dfs.core.windows.net`).

How do POSIX ACLs work in ADLS Gen2?

POSIX ACLs in ADLS Gen2 allow you to set read (r), write (w), and execute (x) permissions for owner, owning group, and others at the directory or file level. You can also set default ACLs that are inherited by new child objects. ACLs are stored as blob metadata. When a user accesses a file, the system evaluates both RBAC and ACLs; the effective permissions are the intersection.

What are the performance considerations for ADLS Gen2?

ADLS Gen2 offers high throughput for analytics workloads. Standard accounts provide up to 20,000 IOPS per partition, while Premium accounts provide up to 100,000 IOPS. The hierarchical namespace improves list performance for deep directory structures. However, renaming a non-empty directory can be slow because each blob's metadata must be updated. Also, enabling hierarchical namespace may increase the cost of certain operations due to additional metadata.

Can I use ADLS Gen2 with Azure Synapse Analytics?

Yes. Azure Synapse Analytics can query data stored in ADLS Gen2 using serverless SQL pools or dedicated SQL pools. You can create external tables that reference data in ADLS Gen2, and Synapse can read Parquet, CSV, JSON, and other formats. This integration makes ADLS Gen2 a core component of a modern data warehouse architecture.

What is the default soft delete retention period for ADLS Gen2 containers?

When hierarchical namespace is enabled, soft delete for containers is enabled by default with a retention period of 7 days. This means that if you delete a container, it is soft-deleted and can be recovered within 7 days. You can configure the retention period up to 365 days.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Azure Data Lake Storage Gen2 — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?