DP-900Chapter 55 of 101Objective 1.1

Structured vs Semi-Structured vs Unstructured Data

This chapter covers the fundamental classification of data into structured, semi-structured, and unstructured types — a core concept tested in DP-900 objective 1.1. Understanding these categories is essential because Azure services are designed to handle specific data types, and exam questions frequently ask you to match data formats to appropriate storage or processing solutions. Approximately 10-15% of DP-900 questions touch on data classification, so mastering this topic directly boosts your score.

25 min read
Intermediate
Updated May 31, 2026

The Library Storage Analogy for Data Types

Imagine a library that must store and retrieve information from three different sources. First, consider a set of encyclopedias. Each volume has the same structure: chapters, sections, and numbered pages. Every entry follows a strict schema — author, title, date, subject. This is structured data: rigid, predefined fields, stored in tables (like shelves with labeled slots). You can query by any field and get exact results. Now imagine a box of letters and postcards. Each letter has a sender, recipient, date, and body, but some letters have photos, some have return addresses, some are missing dates. The structure is flexible — some fields are present, others not, but there is a basic template. This is semi-structured data: it has tags or markers (like JSON keys or XML elements) that describe the content, but the schema is not enforced. Finally, think of a pile of random items: a photograph, a handwritten note, a ticket stub, a USB drive with a video. No consistent format, no schema, no tags. To find anything, you must manually inspect each item. This is unstructured data. In the library, the encyclopedias are stored in a relational database, the letters in a NoSQL document store, and the pile in a blob storage. The library's catalog (metadata) helps locate items, but the retrieval methods differ: exact SQL queries for structured, flexible queries for semi-structured, and full-text or content-based retrieval for unstructured. On the DP-900 exam, you must know which Azure service handles each type and how they differ in schema enforcement, queryability, and storage.

How It Actually Works

What Are Data Types and Why Do They Matter?

Data classification is the foundation of data management in Azure. Every Azure data service — from Azure SQL Database to Azure Blob Storage to Cosmos DB — is optimized for one or more data types. The DP-900 exam expects you to identify the correct service based on the data's structure. The three categories are defined by the degree of schema enforcement:

Structured data: conforms to a rigid schema (tabular format with rows and columns). Each column has a defined data type (e.g., INT, VARCHAR, DATE). Examples: SQL tables, Excel spreadsheets, CSV files with consistent columns.

Semi-structured data: has some organizational properties (tags, markers) but no fixed schema. Fields can vary between records. Examples: JSON, XML, YAML, Parquet files.

Unstructured data: has no predefined structure. Examples: text files, images, videos, audio files, binary data.

How It Works Internally — The Mechanism

Structured data relies on a schema that is enforced at write time. In relational databases, the schema defines tables, columns, data types, constraints (primary keys, foreign keys, NOT NULL). When you insert a row, the database engine validates every column against the schema. If a column is missing or has the wrong type, the insert fails. This ensures data integrity but reduces flexibility. Queries use SQL (Structured Query Language) to filter, join, and aggregate. Storage is often row-based (e.g., SQL Server) or columnar (e.g., Azure Synapse).

Semi-structured data uses self-describing formats. For example, a JSON document contains key-value pairs. The keys act as metadata, but there is no schema enforced by the storage system. Each document can have different keys. Azure services like Cosmos DB store JSON documents natively. Queries are performed using SQL-like syntax (e.g., SELECT * FROM c WHERE c.age > 30) or via API calls. The storage engine indexes keys to allow fast lookups.

Unstructured data is stored as binary large objects (blobs) in Azure Blob Storage or Azure Data Lake Storage. There is no schema, no indexing on content (unless you add Azure Cognitive Search). To retrieve meaningful information, you must use full-text search tools or machine learning models. Storage is flat: files are stored in containers with a path. Metadata (e.g., file name, size, tags) can be stored separately.

Key Components, Values, Defaults, and Timers

Structured: Common Azure services: Azure SQL Database, Azure SQL Managed Instance, SQL Server on Azure VM, Azure Database for PostgreSQL/MySQL/MariaDB. Default isolation level in SQL Server: READ COMMITTED. Maximum database size: 4 TB for single database (provisioned tier).

Semi-structured: Azure Cosmos DB (default consistency: Session), Azure Table Storage (key-value store), Azure Cache for Redis (key-value with data structures). Cosmos DB RU/s (Request Units per second) is the throughput measure. Default indexing policy indexes all paths.

Unstructured: Azure Blob Storage (hot, cool, archive tiers), Azure Data Lake Storage Gen2 (hierarchical namespace), Azure Files (SMB shares). Default blob type: Block blob. Maximum blob size: 4.75 TB for block blobs.

Configuration and Verification Commands

Using Azure CLI, you can create storage and query data types:

Structured: Create an Azure SQL database:

az sql db create --resource-group myRG --server myServer --name myDB --service-objective S0

Semi-structured: Create a Cosmos DB account:

az cosmosdb create --name myCosmosDB --resource-group myRG --kind GlobalDocumentDB

Unstructured: Create a storage account and upload a blob:

az storage account create --name mystorageaccount --resource-group myRG --location eastus --sku Standard_LRS
az storage blob upload --container-name mycontainer --file myfile.txt --name myfile.txt --account-name mystorageaccount

How Data Types Interact with Related Technologies

Data classification influences choice of processing tools:

Structured data: Use Azure Data Factory for ETL, Azure Synapse Analytics for big data analytics, Power BI for visualization.

Semi-structured data: Use Azure Stream Analytics for real-time processing, Azure Functions for serverless event handling, Azure Data Lake Analytics (U-SQL).

Unstructured data: Use Azure Batch for parallel processing, Azure Machine Learning for image/text analysis, Azure Cognitive Search for indexing.

The exam often tests which service is best for a given scenario. For example, if you have JSON logs from IoT devices, Cosmos DB is appropriate. If you have millions of customer records with fixed fields, Azure SQL Database is better.

Walk-Through

1

Identify the Data Source

Begin by examining the raw data. Determine if it has a consistent schema (structured), a flexible schema (semi-structured), or no schema (unstructured). For structured data, look for tables with fixed columns and data types. For semi-structured, look for tags like JSON keys or XML elements. For unstructured, look for binary files (images, videos) or free text without metadata. This step is crucial because the wrong classification leads to selecting an inappropriate Azure service.

2

Choose the Appropriate Azure Storage

Based on classification: structured → relational database (Azure SQL, PostgreSQL). Semi-structured → NoSQL (Cosmos DB, Table Storage). Unstructured → Blob Storage or Data Lake Storage. Consider access patterns: if you need complex joins and ACID transactions, choose relational. If you need low-latency reads with flexible schema, choose NoSQL. If you need cheap storage for large files, choose Blob Storage.

3

Define Schema or Metadata Strategy

For structured data, design the schema with proper data types, constraints, and indexes. For semi-structured, decide on indexing policy (Cosmos DB indexes all paths by default) and partition key. For unstructured, plan metadata storage (e.g., Azure Table for file metadata) and folder hierarchy (e.g., /year/month/day/). This step ensures efficient querying and retrieval.

4

Ingest Data into Azure

Use appropriate tools: Azure Data Factory for structured and semi-structured, AzCopy or Storage SDK for unstructured. For real-time ingestion, use Event Hubs or IoT Hub. For batch, use Blob Storage triggers. Ensure data format matches service expectations: structured data should be in CSV/Parquet for Synapse, semi-structured in JSON for Cosmos DB, unstructured as raw binaries.

5

Query and Analyze

Structured data: use SQL queries with JOIN, GROUP BY. Semi-structured: use Cosmos DB SQL API or MongoDB API. Unstructured: use Azure Cognitive Search for full-text search, or Azure Machine Learning for content analysis. Monitor performance: for structured, check query execution plans; for semi-structured, monitor RU consumption; for unstructured, check blob transfer rates.

What This Looks Like on the Job

Enterprise Scenario 1: E-Commerce Product Catalog

An online retailer stores product information: product ID, name, description, price, category, and images. The product details (name, price) are structured and stored in Azure SQL Database. Product descriptions are semi-structured (JSON with optional fields like color, size, material). Images are unstructured blobs in Azure Blob Storage. The problem: the catalog must support complex queries (e.g., "find all red shoes under $50") and fast image delivery. The solution: use Azure SQL for the structured core, Cosmos DB for the flexible attributes, and Blob Storage with CDN for images. Misconfiguration example: storing images in SQL as BLOBs leads to poor performance and high costs. Proper design: store image URLs in SQL, and images in Blob Storage.

Enterprise Scenario 2: IoT Sensor Data

A manufacturing plant collects sensor readings (temperature, pressure, vibration) every second. The data is semi-structured — each reading has a timestamp, sensor ID, and a JSON payload with varying fields. The problem: high ingestion rate (1 million writes/second) and need for real-time dashboards. The solution: use Azure Event Hubs to ingest, Azure Stream Analytics to process, and Cosmos DB to store. Cosmos DB's flexible schema handles varying payloads. Misconfiguration: using Azure SQL would hit write throughput limits and incur high costs. Proper design: choose Cosmos DB with manual throughput provisioning (e.g., 100,000 RU/s) and a partition key on sensor ID.

Enterprise Scenario 3: Media and Entertainment

A video streaming service stores raw video files (unstructured) in Azure Blob Storage. Metadata (title, genre, release date, ratings) is structured in Azure SQL. User comments and reviews are semi-structured JSON stored in Cosmos DB. The problem: need to serve videos worldwide with low latency and support search on metadata and comments. The solution: use Azure CDN for video delivery, Azure Cognitive Search to index metadata and comments. Misconfiguration: storing all data in Blob Storage without indexing makes search impossible. Proper design: separate storage per data type and use appropriate indexing services.

How DP-900 Actually Tests This

What DP-900 Tests on This Topic

Objective 1.1: Describe core data concepts. Specifically, the exam expects you to:

Differentiate between structured, semi-structured, and unstructured data.

Identify examples of each type.

Recommend appropriate Azure storage services for each type.

Understand that the same data can be represented in different structures (e.g., a CSV file is structured, but its contents could be semi-structured if columns vary).

Top 3 Wrong Answers and Why Candidates Choose Them

1.

"JSON is structured data." Candidates see keys and values and think it's like a table. Reality: JSON is semi-structured because keys can vary between documents; there is no enforced schema. The exam loves to present JSON as structured to trick you.

2.

"A text file with commas is unstructured." Candidates assume any text file is unstructured. Reality: if the text file has consistent columns (CSV), it is structured. The format (text) does not determine structure; the presence of a schema does.

3.

"XML is unstructured." Candidates confuse XML's verbosity with lack of structure. Reality: XML has tags that define a hierarchical structure, making it semi-structured. The exam may show an XML snippet and ask for its type.

Specific Numbers and Terms to Memorize

Azure Blob Storage tiers: Hot (frequent access), Cool (infrequent), Archive (rare access).

Cosmos DB default consistency: Session.

Maximum blob size: 4.75 TB for block blobs.

Azure SQL Database max size: 4 TB (provisioned tier).

Parquet is a columnar storage format for structured/semi-structured data (commonly used in Azure Synapse).

Edge Cases and Exceptions

Mixed data: A single file can contain multiple types. For example, a JSON file with a schema that is always the same is effectively structured, but it is still stored as semi-structured by Azure services. The exam expects you to classify based on the format, not the content.

Schema-on-read: Some services (like Azure Synapse) allow you to read semi-structured data with a schema defined at query time. This is still semi-structured storage.

Unstructured with metadata: Blob Storage allows tags and properties, which are structured metadata. The blob itself is unstructured.

How to Eliminate Wrong Answers

If the question mentions "fixed schema," "tabular," or "rows and columns," it's structured.

If it mentions "self-describing," "tags," or "flexible schema," it's semi-structured.

If it mentions "binary," "no schema," or "raw files," it's unstructured.

For service selection: if ACID transactions or complex joins are needed, choose relational (structured). If high throughput and flexible schema are needed, choose NoSQL (semi-structured). If cheap storage for large files, choose Blob (unstructured).

Key Takeaways

Structured data has a fixed schema; semi-structured has tags but no fixed schema; unstructured has no schema.

Azure SQL Database is for structured data; Cosmos DB is for semi-structured; Blob Storage is for unstructured.

JSON, XML, and Parquet are semi-structured; CSV with consistent columns is structured.

The same data can be represented in different structures (e.g., a customer record in JSON vs. SQL table).

On the exam, if the data has a schema, it's structured; if it has self-describing tags, it's semi-structured; if it's raw binary, it's unstructured.

ACID transactions require structured data; high-velocity IoT data typically uses semi-structured (Cosmos DB).

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Structured Data

Rigid schema enforced at write time

Stored in relational databases (e.g., Azure SQL)

Queried using SQL with JOINs

Best for ACID transactions

Examples: customer table, sales orders

Semi-Structured Data

Flexible schema; fields can vary

Stored in NoSQL databases (e.g., Cosmos DB)

Queried using SQL-like or API calls

Best for high throughput and low latency

Examples: JSON documents, XML files

Watch Out for These

Mistake

CSV files are always semi-structured because they are text files.

Correct

CSV files with consistent columns are structured data. The format has a fixed schema (columns) and is tabular. Only if the CSV has varying columns per row would it be semi-structured, but standard CSV is structured.

Mistake

JSON is unstructured because it's just text.

Correct

JSON is semi-structured because it uses key-value pairs that act as metadata. It has a logical structure (objects, arrays) but no enforced schema. The exam classifies JSON as semi-structured.

Mistake

Azure Blob Storage can store structured data.

Correct

Blob Storage stores unstructured data (binary files). While you can store a CSV file in a blob, the service treats it as unstructured. For querying, you would need to parse it, which is inefficient. Structured data should go into a relational database.

Mistake

All relational databases are structured data stores.

Correct

Yes, relational databases store structured data, but they can also store semi-structured data using JSON columns (e.g., SQL Server JSON support). However, the primary purpose is structured. The exam focuses on the data type, not the database capability.

Mistake

Parquet files are unstructured.

Correct

Parquet is a columnar storage format that stores structured or semi-structured data. It has a schema (defined by the data) and is optimized for analytics. It is not unstructured.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

Is CSV structured or semi-structured?

CSV with consistent columns is structured data because it has a fixed schema (column names and order). However, if the CSV has varying columns per row (e.g., some rows have extra fields), it becomes semi-structured. For the DP-900 exam, assume standard CSV is structured.

Can Azure Blob Storage store semi-structured data?

Yes, you can store JSON or XML files in Blob Storage, but Blob Storage treats them as unstructured blobs. To query them as semi-structured, you would need to use Azure Data Lake Analytics or Azure Synapse with schema-on-read. The exam expects you to know that Blob Storage is primarily for unstructured data.

What is the difference between structured and semi-structured data in terms of querying?

Structured data uses SQL with predefined tables and joins. Semi-structured data uses SQL-like queries but with flexible paths (e.g., `SELECT c.name FROM c` in Cosmos DB). Semi-structured queries may be less efficient because indexes cover all paths by default.

Is a PDF file structured or unstructured?

A PDF is unstructured because it is a binary file with no internal schema for querying. Even if it contains text, the file itself is not organized in a way that allows SQL-like queries without special parsing.

What Azure service should I use for semi-structured data with high write throughput?

Azure Cosmos DB is designed for semi-structured data with high write throughput. It supports multiple APIs (SQL, MongoDB, Cassandra) and provides low-latency reads/writes at scale. Azure Table Storage is another option for key-value semi-structured data but with lower throughput.

Can structured data be stored in Cosmos DB?

Yes, Cosmos DB can store structured data if the documents always have the same schema. However, Cosmos DB does not enforce a schema, so it is still considered semi-structured storage. For strict schema enforcement, use Azure SQL.

What is the maximum size of a blob in Azure Blob Storage?

For block blobs, the maximum size is approximately 4.75 TB (100 MB per block, up to 50,000 blocks). Page blobs can be up to 8 TB. Append blobs have a maximum of 195 GB. The exam may test these numbers.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Structured vs Semi-Structured vs Unstructured Data — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?