DP-900Chapter 51 of 101Objective 1.3

Document Databases and JSON Storage

This chapter covers document databases and JSON storage, a core non-relational data model tested in the DP-900 exam. Understanding document databases is crucial because they represent a major category of NoSQL databases used in modern cloud applications, and approximately 15-20% of exam questions touch on NoSQL data stores, with document databases being the most common type. You will learn the internal structure of documents, how they differ from relational tables, and how Azure services like Cosmos DB implement this model.

25 min read
Intermediate
Updated May 31, 2026

Document Database as a Self-Contained File Folder

Imagine a filing cabinet where each folder represents a document in a document database. Each folder is self-contained: it has a label (the document ID) and inside, it contains all the information about a single entity—say, a customer. The folder might include a name, address, order history, and preferences, all stored as a single, nested set of papers. Unlike a relational database where you'd have separate folders for customers, orders, and preferences linked by cross-references (foreign keys), here everything about that customer is in one folder. You can open the folder and read or update any part without touching other folders. This makes it easy to store complex, hierarchical data like a customer with multiple addresses and a list of past orders. However, if you want to find all customers who ordered a specific product, you'd have to open every folder and check—there's no central index on order items. Document databases solve this by allowing secondary indexes on fields within the documents, but the principle remains: each document is a self-contained unit, and relationships are embedded rather than referenced.

How It Actually Works

What is a Document Database?

A document database is a type of NoSQL database that stores data in the form of documents, typically using JSON (JavaScript Object Notation) or a similar format like BSON (Binary JSON). Each document is a self-contained data structure that contains all the information about a single entity. Unlike relational databases where data is normalized across multiple tables and joined via foreign keys, document databases denormalize data by embedding related information within a single document. This design aligns with how application developers often think about objects in code—an object in an object-oriented language maps directly to a document.

Why Document Databases Exist

Document databases emerged to address limitations of relational databases in handling semi-structured data, flexible schemas, and high-volume read/write workloads. In a relational database, every row in a table must conform to a fixed schema; adding a new column requires an ALTER TABLE operation, which can be disruptive. Document databases are schema-agnostic: each document can have a different set of fields. This is ideal for scenarios where the data structure evolves over time, such as product catalogs with varying attributes, or user profiles where different users may have different preferences. Additionally, document databases avoid the performance overhead of joins by embedding related data, making read operations faster for many access patterns.

How Document Databases Work Internally

A document database stores documents in a collection (analogous to a table in relational databases). Each document has a unique identifier (the document ID) and a version number for optimistic concurrency control. The database engine indexes the document ID by default, allowing efficient retrieval by ID. Secondary indexes can be created on any field within the documents. When a query is executed, the engine uses these indexes to quickly locate matching documents. The query language is typically a JSON-based query syntax or a SQL-like language (e.g., SQL API in Cosmos DB). Internally, documents are serialized and stored as BSON or JSON strings. The storage engine may use a log-structured merge-tree (LSM-tree) or B-tree for indexing, depending on the database.

Key Components and Defaults

Document: A JSON object with key-value pairs. Values can be strings, numbers, booleans, arrays, or nested objects. Maximum document size varies: in Azure Cosmos DB, the default maximum document size is 2 MB. MongoDB has a default of 16 MB.

Collection: A container for documents. In Cosmos DB, a collection is a container that can be provisioned with throughput (RU/s). Each collection has a partition key that determines how data is distributed across physical partitions.

Partition Key: A field within the document that the database uses to distribute documents across partitions. Choosing a good partition key is critical for performance; it should have high cardinality and even distribution.

Indexing: By default, all fields in a Cosmos DB document are automatically indexed. You can customize indexing policies to include or exclude specific paths.

Consistency Levels: Cosmos DB offers five consistency levels (strong, bounded staleness, session, consistent prefix, eventual) that affect read performance and data freshness.

Configuration and Verification Commands

In Azure Cosmos DB, you can create a document database and collection using the Azure CLI:

az cosmosdb create --name mycosmosdb --resource-group myrg --kind GlobalDocumentDB
az cosmosdb sql database create --account-name mycosmosdb --name mydatabase --resource-group myrg
az cosmosdb sql container create --account-name mycosmosdb --database-name mydatabase --name mycontainer --partition-key-path "/id" --resource-group myrg

To insert a document using the SQL API:

{
  "id": "1",
  "name": "John Doe",
  "address": {
    "street": "123 Main St",
    "city": "Seattle"
  },
  "orders": [
    {"product": "Laptop", "price": 999.99}
  ]
}

Query using SQL-like syntax:

SELECT * FROM c WHERE c.name = "John Doe"

How Document Databases Interact with Related Technologies

Document databases often integrate with search engines (e.g., Azure Cognitive Search) to provide full-text search capabilities. They can also be used as a source for stream processing (e.g., Azure Stream Analytics) or as a sink for data ingestion pipelines (e.g., Azure Data Factory). In a microservices architecture, each service may own its own document database, ensuring loose coupling. Document databases can also be replicated across regions for low-latency reads and disaster recovery.

Performance Considerations

Throughput: In Cosmos DB, you provision Request Units per second (RU/s). Each operation consumes RU based on document size, index size, and complexity. A point read (by ID) costs ~1 RU for a 1KB document. A query with index scan costs more.

Partitioning: Ensure partition key cardinality is high (e.g., user ID) to avoid hot partitions. A single logical partition has a maximum storage limit of 20 GB in Cosmos DB.

Indexing: Exclude unnecessary fields from indexing to reduce RU consumption and storage. For example, if you never query by a large binary field, exclude it.

Common Use Cases

Content management systems: Store articles with varying metadata.

User profiles: Each user can have different attributes (e.g., preferences, settings).

IoT telemetry: Each device sends JSON data with different sensor readings.

E-commerce catalogs: Products have different attributes (e.g., clothing has size, color; electronics have warranty).

Walk-Through

1

Define the Data Model

Identify the entities and their relationships. In a document database, you typically embed related data (e.g., order items inside an order document) rather than normalizing. This step involves deciding which fields to include and how to nest them. For example, a customer document might contain an array of addresses and an array of orders. You also choose the partition key based on the most common query pattern.

2

Create the Database and Collection

Using the Azure portal, CLI, or SDK, create a Cosmos DB account, then create a database and a collection (container). During collection creation, you specify the partition key path (e.g., "/customerId"). You also provision throughput (RU/s) either at the database or container level. For example, 400 RU/s is the minimum for a container.

3

Insert Documents

Insert JSON documents into the collection. Each document must have a unique id field (or you can let the system generate one). The partition key value is extracted from the document. For example, inserting a document with customerId = 123 routes it to the partition that handles customerId hash values in a certain range. The document is stored in BSON format internally.

4

Create Indexes

By default, Cosmos DB indexes all fields. You can customize the indexing policy to include or exclude paths, set index types (hash, range, spatial), and define composite indexes for multi-field queries. For example, to create a range index on the "price" field: set the index path to "/price/?" and index type to "Range".

5

Query Documents

Use the SQL API to query documents. Queries can filter, project, and join within a single partition (using the partition key in the WHERE clause) or across partitions (fan-out). Example: SELECT c.name, c.address.city FROM c WHERE c.orders[0].product = 'Laptop' will scan all documents in the collection unless there is an index on orders[0].product.

What This Looks Like on the Job

Scenario 1: E-commerce Product Catalog A large online retailer stores product information in Azure Cosmos DB using the document model. Each product document contains fields like name, description, price, category, and a list of variants (size, color, stock). The partition key is the category ID, ensuring that all products in the same category are co-located. Queries like "get all electronics products under $500" can be served efficiently using a range index on price. The catalog is updated frequently as prices change, and the schema evolves (e.g., adding a new field for warranty). The document model allows these changes without downtime. Performance is critical: the system handles 10,000 reads per second with 5ms latency. Misconfiguration: if the partition key is too granular (e.g., product ID), each partition gets very few documents, leading to high RU consumption for cross-partition queries.

Scenario 2: User Profile Store for a Social Media App A social media platform stores user profiles in MongoDB (or Cosmos DB with MongoDB API). Each user document contains basic info (name, email), preferences, friend list (array of user IDs), and recent posts (array of embedded post objects). The partition key is the user ID, so all data for a user is in one partition. Reads are by user ID (point reads), which are fast and cheap. Writing a new post appends to the posts array. However, if a user has thousands of posts, the document size may approach the 16 MB limit, causing performance issues. The solution is to store posts in a separate collection with user ID as a foreign key, using a reference rather than embedding. This trade-off is a common design decision.

Scenario 3: IoT Sensor Data Ingestion A manufacturing company ingests sensor readings from thousands of devices into Cosmos DB. Each reading is a JSON document with device ID, timestamp, temperature, humidity, and other metrics. The partition key is device ID, and the collection is configured with a time-to-live (TTL) of 30 days to automatically expire old data. Queries are typically per device (e.g., all readings from device X in the last hour). The document model handles the varying sensor schemas (some devices send extra fields). Misconfiguration: using a partition key that results in a hot partition (e.g., region ID where one region has 80% of devices) causes throttling. The solution is to use a synthetic partition key like device ID hashed to a number of partitions.

How DP-900 Actually Tests This

The DP-900 exam tests document databases primarily under objective 1.3 'Describe core data concepts' and also under 3.2 'Describe non-relational data offerings on Azure'. You should know:

The definition of a document database: stores data in JSON documents, each document is self-contained, schema-agnostic.

How it differs from relational databases: no joins, no fixed schema, denormalized data.

Azure Cosmos DB is the primary Azure document database service. Know its APIs: SQL (Core), MongoDB, Cassandra, Gremlin, Table. The SQL API uses a SQL-like query language on JSON documents.

Key concepts: partition key, RU (Request Units), consistency levels, indexing policies.

Common wrong answers: 1. "Document databases enforce a fixed schema" – FALSE. They are schema-agnostic. Candidates confuse with relational databases. 2. "Documents must be normalized" – FALSE. Document databases encourage denormalization (embedding). Candidates think normalization is always good. 3. "Document databases do not support indexing" – FALSE. They support secondary indexes. Candidates think only relational databases have indexes. 4. "JSON is the only format for documents" – PARTIALLY TRUE for Cosmos DB SQL API, but MongoDB uses BSON. The exam may ask about formats.

Numbers/values to memorize:

Cosmos DB default document size limit: 2 MB (for SQL API). MongoDB: 16 MB.

Minimum RU/s for a container: 400.

Maximum storage per logical partition: 20 GB.

Consistency levels: 5 (strong, bounded staleness, session, consistent prefix, eventual).

Edge cases the exam loves:

If you don't specify a partition key in a query, Cosmos DB must fan out across all partitions, increasing RU cost.

You can change the indexing policy after creation, but it may take time to rebuild indexes.

Document databases can handle transactions only within a single partition (not across partitions).

How to eliminate wrong answers: Look for keywords like 'schema', 'normalize', 'join', 'ACID'. If the answer says document databases require a schema, it's wrong. If it says they support multi-document ACID transactions across partitions, it's wrong (Cosmos DB supports transactional batch within a partition).

Key Takeaways

A document database stores data as JSON (or BSON) documents, each with a unique ID and flexible schema.

Azure Cosmos DB is the primary Azure document database service; it offers multiple APIs including SQL (Core), MongoDB, Cassandra, Gremlin, and Table.

The partition key is critical for performance; choose a field with high cardinality and even distribution to avoid hot partitions.

Cosmos DB automatically indexes all fields by default; you can customize the indexing policy to exclude fields and reduce RU consumption.

Maximum document size in Cosmos DB SQL API is 2 MB; in MongoDB it is 16 MB.

Document databases support ACID transactions only within a single partition (not across partitions).

Query without partition key in filter results in a cross-partition query, which consumes more RUs and has higher latency.

Consistency levels in Cosmos DB: strong, bounded staleness, session, consistent prefix, eventual (listed from strongest to weakest).

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Document Database (e.g., Cosmos DB SQL API)

Schema-agnostic; each document can have different fields.

Data is denormalized; related data is embedded in a single document.

No native JOIN support; relationships handled via embedding or application-level joins.

Scalable horizontally via partitioning; partition key determines distribution.

Query language is SQL-like but operates on JSON documents; supports nested field access.

Relational Database (e.g., Azure SQL Database)

Fixed schema; every row in a table must have the same columns.

Data is normalized; related data is stored in separate tables linked by foreign keys.

Native JOIN support using SQL JOIN clauses.

Scales vertically (larger servers) or via read replicas; horizontal sharding is complex.

Standard SQL; works with tabular data; requires JOINs to combine related data.

Watch Out for These

Mistake

Document databases are just key-value stores with extra fields.

Correct

Document databases allow querying on any field using secondary indexes, not just the key. They support complex queries, aggregations, and nested data structures, unlike simple key-value stores.

Mistake

JSON documents must all have the same fields.

Correct

Document databases are schema-agnostic. Each document can have a different set of fields. This is a key advantage over relational tables.

Mistake

Document databases cannot handle relationships.

Correct

They can handle relationships via embedding (denormalization) or references (manual joins in application code). However, they lack native join operations.

Mistake

Document databases are always slower than relational databases.

Correct

For many read-heavy workloads with denormalized data, document databases can be faster because they avoid joins and can serve a complete entity from a single read.

Mistake

Cosmos DB only supports the SQL API.

Correct

Cosmos DB supports multiple APIs: SQL (Core), MongoDB, Cassandra, Gremlin (graph), and Table. You can choose the API that matches your application.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is a document database in Azure?

A document database in Azure is a NoSQL database that stores data as JSON documents. Azure Cosmos DB is the primary service offering a document database model via its SQL (Core) API. Each document is a self-contained unit containing all data for an entity, and the database allows querying using a SQL-like syntax.

How does a document database differ from a relational database?

Document databases are schema-agnostic, denormalized, and do not support native joins. Relational databases have fixed schemas, normalize data, and use joins to combine tables. Document databases are better for flexible schemas and hierarchical data; relational databases are better for complex relationships and strict data integrity.

What is a partition key in Cosmos DB?

A partition key is a field in the document that Cosmos DB uses to distribute data across physical partitions. It determines the logical partition where the document is stored. Choosing a good partition key is crucial for performance: it should have high cardinality (many distinct values) and distribute requests evenly.

What are Request Units (RUs) in Cosmos DB?

Request Units (RUs) are a measure of throughput in Cosmos DB. Each database operation (read, write, query) consumes a certain number of RUs based on factors like document size and index usage. You provision RU/s to guarantee performance. For example, a 1KB point read costs ~1 RU.

Can I query documents without knowing the partition key?

Yes, but it results in a cross-partition query, which fans out to all partitions and consumes more RUs. It is more efficient to include the partition key in the query filter to limit the query to a single partition.

What is the default indexing policy in Cosmos DB?

By default, Cosmos DB indexes all fields in all documents. You can customize the indexing policy to include or exclude specific paths, set index types (hash, range, spatial), and configure composite indexes. Excluding unnecessary fields reduces RU consumption and storage.

What consistency levels are available in Cosmos DB?

Cosmos DB offers five consistency levels: strong (linearizability), bounded staleness (reads lag behind writes by at most K versions or time interval), session (guarantees within a client session), consistent prefix (reads never see out-of-order writes), and eventual (no guarantees).

Terms Worth Knowing

Ready to put this to the test?

You've just covered Document Databases and JSON Storage — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?