AZ-900Chapter 91 of 127Objective 2.4

Azure Data Lake Storage

This chapter covers Azure Data Lake Storage (ADLS), a scalable and secure data lake solution for big data analytics. As part of Azure Architecture Services (Objective 2.4), understanding ADLS is essential for the AZ-900 exam, which tests your ability to describe core Azure data services. This objective area carries approximately 15-20% of the exam weight, and questions on data storage services like ADLS appear frequently. By the end of this chapter, you will understand what ADLS is, how it works, its key features, and how it compares to other Azure storage options.

25 min read
Intermediate
Updated May 31, 2026

Azure Data Lake: A Cargo Warehouse for Analytics

Imagine you run a global shipping company that receives thousands of different cargo containers daily from ships, trucks, and planes. Each container holds raw materials, finished goods, or documents in various formats—boxes, barrels, pallets, and loose items. Your old on-premises warehouse required you to unload every container, sort items into predefined shelves, catalog each piece, and discard anything that didn't fit your classification system. This process was slow, expensive, and you often lost valuable items because they didn't match a shelf. Azure Data Lake Storage is like building a massive, automated cargo warehouse where you can store every container exactly as it arrives—no sorting, no repackaging. The warehouse has unlimited space and can handle containers of any size, from a small envelope to a 100-ton crate. When you need to analyze your cargo, you don't move the containers; instead, you send specialized robots (analytics engines like Azure Synapse or Databricks) that walk through the aisles, examine each container, and extract only the data you need. You pay only for the space used, and you can instantly scale the warehouse up or down. This approach solves the problem of data silos and rigid schemas: you keep all raw data in its native format, enabling future analysis that you haven't even thought of yet.

How It Actually Works

What is Azure Data Lake Storage and What Business Problem Does It Solve?

Azure Data Lake Storage (ADLS) is a cloud-based, enterprise-wide repository for storing massive amounts of structured, semi-structured, and unstructured data. It is built on top of Azure Blob Storage but adds a hierarchical namespace, POSIX-like access control, and optimized performance for analytics workloads. The core business problem it solves is the challenge of storing and analyzing large volumes of raw data without the constraints of traditional databases or data warehouses. In a typical enterprise, data comes from many sources—IoT devices, application logs, social media feeds, transaction records, and more. Traditional storage systems require you to define a schema (a rigid structure of tables, columns, and data types) before you can store the data. This process, known as schema-on-write, is slow and inflexible. If you later discover that you need a different schema, you must reprocess all the data. ADLS uses schema-on-read: data is stored in its native format (e.g., CSV, JSON, Parquet, Avro) and the schema is applied only when the data is read for analysis. This allows you to store all raw data quickly and cheaply, and then use analytics engines like Azure Synapse Analytics, Azure Databricks, or HDInsight to process it on demand.

How Azure Data Lake Storage Works – Step by Step Mechanism

ADLS combines the massive scalability and low cost of Azure Blob Storage with a hierarchical file system that mimics a traditional file server. Here is how it works at a technical level:

1.

Hierarchical Namespace: Unlike flat blob storage where all files are stored in a single flat namespace (like a giant folder with no subfolders), ADLS organizes data into a hierarchy of directories and subdirectories. This makes it familiar to users accustomed to file systems like NTFS or HDFS. The hierarchical namespace enables efficient rename and move operations (O(1) instead of O(n)) because only the metadata changes, not the actual data.

2.

Storage and Access: Data is stored as blobs (binary large objects) in containers called storage accounts. Each blob can be up to 5 TB in size. You can access data via REST APIs, SDKs (e.g., .NET, Java, Python), or tools like Azure Storage Explorer. ADLS Gen2 (the current generation) supports both Blob Storage APIs and Data Lake Storage Gen1 APIs, ensuring compatibility.

3.

POSIX Permissions: ADLS supports POSIX-style access control lists (ACLs) and role-based access control (RBAC). You can set permissions at the directory or file level for users and groups, allowing fine-grained security. This is critical for multi-tenant environments where different teams need different levels of access.

4.

Optimized for Analytics: ADLS is designed for high-throughput analytics. It supports parallel uploads and downloads, and it integrates natively with Azure analytics services. For example, Azure Synapse can directly query data in ADLS using serverless SQL pools or dedicated SQL pools. Azure Databricks can read data using Spark’s DataFrame API.

5.

Data Redundancy and Durability: ADLS inherits Blob Storage’s durability and redundancy options. By default, data is replicated three times within the same region (locally redundant storage, LRS). You can choose geo-redundant storage (GRS) or zone-redundant storage (ZRS) for higher durability. Microsoft guarantees 99.999999999% (11 nines) durability for ADLS.

Key Components, Tiers, and Pricing Models

Storage Account: The root container for all data. You create a storage account in an Azure region and choose the performance tier (Standard or Premium). Standard is for general-purpose big data, Premium is for low-latency workloads.

Containers: Within a storage account, you create containers (like top-level folders) to organize data. In ADLS, containers support hierarchical directories.

Directories and Files: Inside containers, you create directories and subdirectories. Files are stored as blobs.

Access Tiers: ADLS supports three access tiers: Hot (frequent access), Cool (infrequent access, lower storage cost), and Archive (rare access, lowest storage cost but retrieval fees). You can set a default tier at the storage account level or override at the file level.

Pricing: You pay for storage (per GB per month) based on the access tier, plus data transfer and operations (read/write/list). There is no upfront cost; you pay as you go.

Comparison to On-Premises Equivalent

On-premises, a traditional data lake might be built on Hadoop Distributed File System (HDFS) running on a cluster of servers. HDFS requires significant hardware investment, ongoing maintenance, and capacity planning. Scaling up means buying more servers and manually rebalancing data. ADLS eliminates these burdens: you provision storage on demand, pay only for what you use, and Azure handles all hardware failures, replication, and scaling. Additionally, ADLS integrates with cloud-native analytics services that are difficult to match on-premises.

Azure Portal and CLI Touchpoints

Azure Portal: Navigate to Storage accounts, create a new storage account, and enable the hierarchical namespace (this is what makes it ADLS Gen2). You can then create containers, upload files, and manage permissions via the portal.

Azure CLI: Use az storage account create with --enable-hierarchical-namespace true. For example: az storage account create --name mydatalake --resource-group myrg --location eastus --sku Standard_GRS --enable-hierarchical-namespace true.

Azure PowerShell: Use New-AzStorageAccount with -EnableHierarchicalNamespace $true.

Concrete Business Scenario

Consider a retail company that collects clickstream data from its e-commerce website. The data arrives in hourly JSON logs, each about 1 GB. The company wants to analyze customer behavior to improve recommendations. With ADLS, they can store all raw JSON logs in a container called clickstream, organized by year/month/day/hour directories. Then, using Azure Synapse, they can run serverless SQL queries to extract insights without moving data. As the business grows, they can store petabytes of data without changing the architecture.

Walk-Through

1

Integrate with Other Azure Services

ADLS integrates deeply with the Azure ecosystem. For example, Azure Data Factory can copy data from on-premises SQL Server to ADLS. Azure Stream Analytics can output real-time analytics to ADLS. Azure Machine Learning can read training data directly from ADLS. For compliance, enable Azure Policy to enforce encryption at rest (Azure Storage Service Encryption) and in transit (HTTPS). You can also use Azure Private Link to access ADLS over a private IP in your virtual network. This integration is why ADLS is the preferred data lake for Azure analytics.

What This Looks Like on the Job

Scenario 1: E-commerce Clickstream Analytics

A global e-commerce company wants to analyze user clickstream data to improve recommendations and detect fraud. They receive billions of events per day from their website and mobile apps. Each event is a JSON object with fields like user ID, page URL, timestamp, and device type. On-premises, they used a Hadoop cluster with HDFS, but scaling was expensive and slow. They migrated to ADLS Gen2. They created a storage account with hierarchical namespace enabled. They set up a container 'clickstream' with directories by date (e.g., 'raw/2025/03/19'). They used Azure Event Hubs to ingest streaming data and Azure Data Factory to batch load historical data. The data is stored in its raw JSON format. They then used Azure Synapse serverless SQL to query the data on demand. For example, they can run a query to find the top 10 most visited pages in the last hour without moving data. They set RBAC permissions: data engineers have write access to raw, data scientists have read access to raw and write to processed. They implemented lifecycle management to move data older than 90 days to Cool tier, reducing storage costs by 50%. Common mistake: not enabling hierarchical namespace at creation time; you cannot enable it later without recreating the storage account. Also, they initially used Blob Storage APIs, which worked but lacked directory performance; switching to ADLS APIs improved rename operations by 100x.

Scenario 2: Healthcare Genomics Research

A research institute stores genomic sequencing data, which averages 100 GB per patient. They need to store petabytes of data and allow researchers to analyze it using custom tools. They chose ADLS because of its ability to store large files (up to 5 TB) and its POSIX permissions for fine-grained access. They created a storage account in the West US region to be close to their compute. They created directories for each research project, with subdirectories for raw sequences, processed alignments, and results. They used Azure Databricks with Spark to run genomic analysis pipelines, reading and writing directly to ADLS. They set ACLs to allow only specific researchers to access certain projects. They enabled Azure Defender for Storage to detect unusual access patterns. They also used Azure Private Link to ensure data never traverses the public internet. The challenge they faced was the cost of data egress when moving data to other regions; they mitigated by using Azure ExpressRoute. They also learned that enabling soft delete is crucial to recover from accidental deletions; they set a 30-day retention period.

Scenario 3: Media and Entertainment Content Repository

A media company stores raw video footage, which can be hundreds of terabytes per production. They need a centralized repository for editors, animators, and quality assurance teams. They used ADLS with hierarchical namespace to organize by production name, scene, and take. They set up Azure NetApp Files for high-performance editing, but used ADLS for cold storage of archived footage. They configured lifecycle policies to move footage to Archive tier after 6 months. They used Azure Content Delivery Network (CDN) to deliver final videos to viewers. They encountered a problem: their editors needed low-latency access to hot data, but ADLS Standard tier had higher latency than Premium. They solved this by using Premium tier for active projects and moving completed ones to Standard. They also learned that ADLS does not support file locking natively; they used Azure Files for collaborative editing.

How AZ-900 Actually Tests This

Exactly What AZ-900 Tests on This Objective

AZ-900 objective 2.4 (Describe Azure data storage services) includes Azure Data Lake Storage. The exam expects you to know:

ADLS is built on Azure Blob Storage but adds a hierarchical namespace and POSIX-like access control.

ADLS is designed for big data analytics and can store structured, semi-structured, and unstructured data.

The primary use case is storing raw data for later analysis (schema-on-read).

ADLS integrates with Azure Synapse, Databricks, HDInsight, and Data Factory.

ADLS Gen2 is the current version; Gen1 is legacy.

The hierarchical namespace enables fast directory operations.

Common Wrong Answers and Why Candidates Choose Them

1.

'ADLS is a relational database.' Wrong. ADLS is a data lake, not a database. Candidates confuse it with Azure SQL Database because both store data. Reality: ADLS stores files, not tables with rows and columns.

2.

'ADLS requires data to be in a specific format like CSV or Parquet.' Wrong. ADLS can store any file format. Candidates think because analytics tools often use Parquet, it's required. Reality: ADLS is schema-on-read; you can store JSON, XML, images, videos, etc.

3.

'ADLS is the same as Azure Blob Storage.' Wrong. ADLS is built on Blob Storage but adds the hierarchical namespace and ACLs. Candidates see 'built on Blob Storage' and think they are identical. Reality: ADLS has additional features; not all Blob Storage features are available (e.g., Blob Storage tiering works differently).

4.

'ADLS is only for large enterprises with petabytes of data.' Wrong. ADLS scales to any size, but it's also suitable for small datasets. Candidates assume 'data lake' implies massive scale. Reality: You can use ADLS for any analytics workload, even with gigabytes.

Specific Terms and Values That Appear Verbatim on the Exam

'Hierarchical namespace' – the key differentiator.

'POSIX-like access control' – for fine-grained permissions.

'Schema-on-read' – the design principle.

'99.999999999% durability' – 11 nines.

'Gen2' – the current generation.

'Azure Synapse Analytics' – primary integration.

Edge Cases and Tricky Distinctions

ADLS vs. Azure Data Lake Analytics (ADLA): ADLA is a legacy analytics service (now deprecated). ADLS is storage. The exam may ask which service is used for storage. Answer: ADLS.

ADLS vs. Azure Data Factory (ADF): ADF is an orchestration tool to move data, not storage. Candidates mix them.

Enabling hierarchical namespace: Must be done at creation time. Cannot be added later.

Access tiers: ADLS supports Hot, Cool, Archive. But the Archive tier is for rarely accessed data; retrieval can take hours.

Security: RBAC and ACLs coexist. RBAC at the account level overrides ACLs. Know that ACLs are inherited by default.

Memory Trick or Decision Tree

For exam questions asking which service to use for storing raw big data for analytics, use the mnemonic D.L.A.K.E.:

D: Data Lake (ADLS)

L: Large files (up to 5 TB)

A: Analytics (Synapse, Databricks)

K: Keep raw format (schema-on-read)

E: Enterprise (POSIX permissions)

If the question mentions 'hierarchical namespace' or 'POSIX', the answer is ADLS. If it mentions 'relational queries' or 'transactions', it's Azure SQL Database. If it mentions 'NoSQL key-value', it's Cosmos DB.

Key Takeaways

Azure Data Lake Storage (ADLS) Gen2 is built on Azure Blob Storage with a hierarchical namespace and POSIX ACLs.

ADLS is designed for big data analytics using schema-on-read: store raw data in any format, apply schema when reading.

You must enable hierarchical namespace at storage account creation time; it cannot be added later.

ADLS supports files up to 5 TB in size and offers 99.999999999% (11 nines) durability.

Access tiers: Hot (frequent), Cool (infrequent), Archive (rare). Lifecycle management can automate tier transitions.

ADLS integrates with Azure Synapse Analytics, Azure Databricks, HDInsight, and Azure Data Factory.

Security: Use Azure RBAC for broad access control and POSIX ACLs for fine-grained permissions at directory/file level.

Pricing: Pay for storage (per GB/month based on tier), data transfer, and operations (read/write/list).

Common exam wrong answer: confusing ADLS with Azure SQL Database or Azure Blob Storage (without hierarchical namespace).

ADLS Gen2 is the current recommended version; Gen1 is legacy and being phased out.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Azure Data Lake Storage (ADLS)

Has hierarchical namespace for directories

Supports POSIX ACLs for fine-grained permissions

Optimized for big data analytics (schema-on-read)

Integrates natively with Azure Synapse, Databricks, HDInsight

Renames and moves directories in constant time (metadata only)

Azure Blob Storage

Flat namespace (no true directories; uses prefix naming)

Supports only RBAC, no POSIX ACLs

General-purpose object storage, not analytics-optimized

Integrates with services but less optimized for analytics

Renames and moves blobs require copying data (O(n) time)

Azure Data Lake Storage (ADLS)

Stores files (structured, semi-structured, unstructured)

Schema-on-read: no predefined schema

Massive scalability (petabytes+)

Cheap storage cost (pennies per GB)

No transaction support; not for OLTP

Azure SQL Database

Stores relational data in tables with rows and columns

Schema-on-write: schema must be defined before inserting

Scalable but typically for terabytes, not petabytes

Higher storage cost (dollars per GB)

Full ACID transactions; ideal for OLTP

Watch Out for These

Mistake

Azure Data Lake Storage is a separate service from Azure Storage.

Correct

ADLS is a feature of Azure Blob Storage. You enable it by turning on hierarchical namespace when creating a storage account. It is not a standalone service.

Mistake

ADLS can only store data in Parquet or ORC format.

Correct

ADLS can store any file format, including CSV, JSON, XML, images, videos, and binary files. The format is irrelevant to storage; analytics tools may prefer columnar formats, but ADLS does not enforce any.

Mistake

ADLS is only for big data and not for small datasets.

Correct

ADLS scales to any size. While it is optimized for large-scale analytics, it can store gigabytes of data. There is no minimum size requirement.

Mistake

You can enable hierarchical namespace on an existing Blob Storage account.

Correct

Hierarchical namespace must be enabled at account creation time. You cannot enable it on an existing storage account. You would need to migrate data to a new account.

Mistake

ADLS Gen2 is the same as Azure Data Lake Storage Gen1.

Correct

ADLS Gen2 is built on Blob Storage and is the current recommended version. Gen1 is a separate service based on Azure Data Lake Analytics and is being retired. Gen2 offers better integration and performance.

Frequently Asked Questions

What is the difference between Azure Data Lake Storage and Azure Blob Storage?

Azure Data Lake Storage (ADLS) Gen2 is built on top of Azure Blob Storage but adds a hierarchical namespace and POSIX-like access control lists. The hierarchical namespace allows you to organize data into directories and subdirectories, enabling fast rename and move operations. Blob Storage has a flat namespace, so 'directories' are simulated using prefixes in blob names. ADLS is optimized for big data analytics, while Blob Storage is general-purpose object storage. For AZ-900, remember that ADLS is the data lake solution for analytics, and Blob Storage is for general object storage.

Can I convert an existing Azure Blob Storage account to ADLS?

No, you cannot convert an existing Blob Storage account to ADLS. The hierarchical namespace feature must be enabled at the time of storage account creation. If you have an existing Blob Storage account, you would need to create a new storage account with hierarchical namespace enabled and then migrate your data using tools like Azure Data Factory or AzCopy. This is a common exam trap: always check if the question implies enabling hierarchical namespace on an existing account.

What are the access tiers available in Azure Data Lake Storage?

ADLS supports three access tiers: Hot (optimized for frequent access, highest storage cost but no retrieval fees), Cool (lower storage cost but higher retrieval fees, for data accessed infrequently), and Archive (lowest storage cost, but retrieval can take hours and incurs significant fees). You can set a default tier at the storage account level or override at the file level. Use lifecycle management policies to automatically move data between tiers based on age. For AZ-900, know that Archive tier has a 30-day minimum storage charge.

How does security work in ADLS?

Security in ADLS uses a combination of Azure RBAC and POSIX ACLs. RBAC provides broad access control at the storage account, container, or directory level (e.g., Storage Blob Data Contributor). POSIX ACLs allow fine-grained permissions on individual directories and files, including read, write, and execute for specific users or groups. ACLs are inherited by default from parent directories. For exam purposes, remember that RBAC can override ACLs if the user has an RBAC role like Owner. Also, encryption at rest is enabled by default using Azure Storage Service Encryption.

What analytics services can query ADLS directly?

ADLS integrates natively with several Azure analytics services: Azure Synapse Analytics (serverless SQL and dedicated SQL pools), Azure Databricks (using Spark), Azure HDInsight (Hadoop, Spark, Hive), and Azure Data Factory (for orchestration). These services can read data directly from ADLS without moving it. For example, Synapse serverless SQL can query CSV, Parquet, and JSON files using the OPENROWSET function. This is a key exam point: ADLS is designed for analytics, and these integrations are why it is chosen.

Is ADLS suitable for storing small amounts of data?

Yes, ADLS can store any amount of data, from a few megabytes to petabytes. There is no minimum size requirement. However, ADLS is optimized for large-scale analytics, so if you have only a few gigabytes and need simple object storage, Azure Blob Storage might be simpler and cheaper. For AZ-900, understand that ADLS is a data lake solution, but it can be used for any file storage that benefits from hierarchical organization and POSIX permissions.

What is the difference between ADLS Gen1 and Gen2?

ADLS Gen1 is a legacy service based on Azure Data Lake Analytics, with its own storage backend. ADLS Gen2 is built on Azure Blob Storage, offering better integration, lower cost, and compatibility with Blob Storage tools. Gen2 is the current recommended version; Gen1 is being retired. For the exam, know that Gen2 is the one to use, and it provides hierarchical namespace on top of Blob Storage. Gen1 is not covered in depth on AZ-900.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Azure Data Lake Storage — now see how well it sticks with free AZ-900 practice questions. Full explanations included, no account needed.

Done with this chapter?