This chapter covers Amazon DocumentDB, a fully managed document database service that is MongoDB-compatible. Understanding DocumentDB is important for the CLF-C02 exam because it represents a key NoSQL database option within AWS, and questions often test your ability to differentiate between relational and non-relational databases, as well as when to use DocumentDB versus other AWS database services. This objective falls under Domain 3: Cloud Technology Services, which makes up approximately 12% of the exam. By the end of this chapter, you will know exactly what DocumentDB is, how it works, its use cases, and how to answer exam questions about it correctly.
Jump to a section
Imagine you run a large library with millions of books, but each book has a different structure: some have chapters, some have sections, some have appendices, and some have illustrations. Storing these in a traditional spreadsheet (like a relational database) would require you to force every book into a rigid table with fixed columns, which is inefficient and complex. Instead, you decide to use a filing cabinet where each book is stored as its own file folder. Inside each folder, you can have any number of labeled documents (like JSON documents) with different fields. This is exactly what Amazon DocumentDB does: it stores each record as a flexible JSON-like document, allowing different structures without predefined schemas. Moreover, DocumentDB automatically replicates each folder across multiple cabinets (availability zones) for durability, and it can automatically create indexes on fields you frequently search, just like tabbed dividers in a filing cabinet. Behind the scenes, DocumentDB uses a distributed storage layer that scales independently from compute, so you can add more cabinets (storage) or more librarians (compute) without reorganizing the entire library. The key mechanism is that DocumentDB is MongoDB-compatible, meaning it uses the same query language and drivers as MongoDB, so applications built for MongoDB can switch to DocumentDB with minimal changes, but with the benefits of AWS managed services like automatic backups, patching, and replication.
What is Amazon DocumentDB and the Problem It Solves
Amazon DocumentDB is a fully managed, MongoDB-compatible document database service designed to store, query, and index JSON-like documents. Traditional relational databases (like Amazon RDS for MySQL or PostgreSQL) require a fixed schema: you must define tables, columns, and data types upfront. This works well for structured data with consistent relationships, but it becomes cumbersome when dealing with semi-structured or unstructured data, such as user profiles, product catalogs, content management systems, or IoT sensor readings where each record may have different fields. For example, a product catalog might have some products with a 'color' field, others with 'size', and others with 'dimensions'. In a relational database, you would need to create separate tables or use nullable columns, leading to complex queries and performance issues. DocumentDB solves this by storing each record as a document (similar to a JSON object) within collections, without requiring a predefined schema. Each document can have its own set of fields, and you can add fields on the fly without altering the collection.
How Amazon DocumentDB Works – The Mechanism
Amazon DocumentDB is built on a distributed, fault-tolerant, self-healing storage system that replicates data across three Availability Zones (AZs) within an AWS Region. The service consists of two main components: compute (database instances) and storage (distributed storage volumes). Compute instances run the DocumentDB engine, which is compatible with MongoDB 3.6 and 4.0 (and later 5.0). The engine accepts MongoDB wire protocol connections, meaning any application that uses MongoDB drivers (e.g., Node.js, Python, Java) can connect to DocumentDB with minimal code changes. Storage is automatically replicated six times across three AZs, providing high durability and availability. Storage scales automatically from 10 GB to 64 TB, and there is no need to provision storage in advance. Compute instances come in two types: primary (read/write) and replica (read-only). You can have up to 15 replica instances to scale read traffic. Failover is automatic: if the primary instance fails, a replica is promoted to primary within 30 seconds. Backups are continuous and point-in-time recovery (PITR) is supported within a retention window (1-35 days). DocumentDB also supports encryption at rest and in transit, and integrates with AWS Identity and Access Management (IAM) for authentication.
Key Features and Configurations
MongoDB Compatibility: DocumentDB supports MongoDB operations such as CRUD (create, read, update, delete), aggregation pipelines, indexes (including compound, unique, TTL, and text indexes), and change streams. However, note that DocumentDB does not support all MongoDB features – notably, it does not support MongoDB's $graphLookup, $merge, or $out stages in aggregation pipelines. The exam may test this: DocumentDB is "MongoDB-compatible" not "MongoDB-identical."
Storage: Allocated in 10 GB increments, scaling automatically. The storage is billed per GB-month, and there is no I/O charge – you pay only for the storage used and the compute instances.
Compute Instances: Available in burstable (T3) and general purpose (R5, R6g) instance types. T3 instances are suitable for development or low-traffic workloads; R5/R6g are for production. Pricing is per hour based on instance size.
Backups: Automated backups are enabled by default with a 1-day retention, configurable up to 35 days. You can also take manual snapshots that persist until deleted.
Security: Supports VPC isolation, security groups, encryption at rest using AWS KMS, and encryption in transit using TLS. IAM database authentication is available for user authentication.
Monitoring: Integrated with Amazon CloudWatch for metrics like CPUUtilization, DatabaseConnections, ReadIOPS, WriteIOPS, and FreeableMemory. You can set alarms and automate scaling.
Comparison to On-Premises MongoDB
Running MongoDB on-premises requires you to manage hardware, software installation, patching, backups, replication, and scaling. You need to set up replica sets or sharded clusters manually. With DocumentDB, AWS handles all of that: automatic replication across three AZs, automatic failover, automated backups, and patching. Additionally, DocumentDB offers a storage that scales independently from compute, meaning you can increase storage without downtime, and you can scale compute up or down with minimal disruption. On-premises MongoDB also requires you to provision for peak capacity, whereas DocumentDB allows you to start small and scale as needed. However, DocumentDB is not a drop-in replacement for all MongoDB features: if your application relies on features like $graphLookup, $merge, or certain MongoDB Atlas-specific features, you may need to stick with MongoDB Atlas or self-managed MongoDB. The exam may present scenarios where you need to choose between DocumentDB and Amazon DynamoDB (another NoSQL database) – see the comparisons section for details.
When to Use DocumentDB vs Alternatives
Use Amazon DocumentDB when:
Your application is built on MongoDB and you want a managed service without rewriting code.
You need flexible schemas for semi-structured data (e.g., content management, catalogs, user profiles).
You require high availability and durability with automatic replication across AZs.
You want to offload database administration tasks like backups, patching, and failover.
Do NOT use DocumentDB when:
Your workload is purely key-value with simple queries – DynamoDB may be cheaper and faster.
You need complex relational queries (joins, transactions) – use Amazon RDS or Aurora.
You need full MongoDB feature parity, especially advanced aggregation stages – consider MongoDB Atlas on AWS.
You have very large data volumes (beyond 64 TB per cluster) – you may need sharding with self-managed MongoDB.
Pricing Model
Amazon DocumentDB pricing consists of: - Compute: per hour based on instance type (e.g., db.r5.large, db.r6g.large). You pay for running instances. - Storage: per GB-month for data stored, plus per GB-month for backup storage beyond the free backup storage (equal to the size of the cluster). - Data transfer: standard AWS data transfer charges apply for data transferred out to the internet or across regions. - No I/O charges: Unlike some other databases, DocumentDB does not charge for input/output operations, which can be cost-effective for I/O-intensive workloads. - Free tier: DocumentDB is not included in the AWS Free Tier.
Limits and Defaults
Maximum storage per cluster: 64 TB.
Maximum number of instances per cluster: 16 (1 primary + up to 15 replicas).
Maximum connections: depends on instance size (e.g., db.r5.large: 1,600 connections).
Backup retention: 1-35 days.
Minimum storage increment: 10 GB.
Supported MongoDB versions: 3.6, 4.0, 5.0 (check current documentation for latest).
Create a DocumentDB Cluster
To create a DocumentDB cluster, you use the AWS Management Console, CLI, or CloudFormation. In the console, navigate to Amazon DocumentDB and click 'Create cluster'. You specify a cluster identifier, master username and password (or IAM authentication), and choose an instance class (e.g., db.r5.large). You also select the VPC and subnet groups – DocumentDB requires a VPC with at least two subnets in different Availability Zones for high availability. Behind the scenes, AWS provisions a primary instance and a replica in a different AZ (if you choose Multi-AZ, which is default). The cluster is created with a cluster endpoint (for read/write) and a reader endpoint (for load-balanced reads). The storage is automatically allocated starting at 10 GB and scales as you add data. You can also enable encryption and backup retention. The entire process takes a few minutes. Important default: automated backups are enabled with 1-day retention. You can modify retention later.
Connect to the Cluster
After the cluster is available, you connect using a MongoDB-compatible driver or tool like `mongo` shell. You need the cluster endpoint (e.g., `docdb-2024-12-01-12-34-56.cluster-xxxxxxxxxxxx.us-east-1.docdb.amazonaws.com:27017`). You must connect from within the same VPC (or use a VPN/Direct Connect) because DocumentDB is not publicly accessible by default. To connect, you use the master username and password, or set up IAM authentication. Behind the scenes, DocumentDB presents a MongoDB wire protocol interface, so the driver sees it as a standard MongoDB replica set. The connection is encrypted using TLS by default. Important: you must download the Amazon RDS CA certificate to enable TLS. The `mongo` shell command would be: `mongo --ssl --host <cluster-endpoint>:27017 --sslCAFile rds-combined-ca-bundle.pem --username <user> --password <password>`. This step is critical for exam scenarios: DocumentDB is not publicly accessible unless you explicitly configure a public IP, which is generally not recommended.
Create a Database and Collection
Once connected, you can create a database and collection using MongoDB commands. For example: `use mydb` creates or switches to a database called `mydb`. Then `db.createCollection("products")` creates a collection named `products`. Unlike relational databases, you don't need to define a schema – you can immediately insert documents with different fields. Behind the scenes, DocumentDB allocates storage for the collection within the cluster's shared storage volume. The collection is indexed on the default `_id` field automatically. You can create additional indexes using `db.products.createIndex({field: 1})`. DocumentDB supports various index types: single field, compound, unique, TTL (time-to-live), and text indexes. Important: creating indexes on large collections can impact performance, but DocumentDB builds indexes in the background. The exam may test that DocumentDB automatically indexes the `_id` field.
Insert and Query Documents
Inserting documents is straightforward. For example: `db.products.insertOne({name: "Widget", price: 19.99, category: "tools", inStock: true})`. You can also insert multiple documents with `insertMany()`. Each document is assigned a unique `_id` (ObjectId) if not provided. To query, you use `find()`: `db.products.find({category: "tools"})`. You can use operators like `$gt`, `$lt`, `$regex`, etc. DocumentDB supports the MongoDB query language, including aggregation pipelines (but not all stages). Behind the scenes, DocumentDB uses indexes to speed up queries. If no index exists for the query filter, a collection scan occurs, which is slow on large collections. The exam may ask about indexing strategies. Important: DocumentDB does not support cross-collection joins (like relational joins); instead, you embed related data or use references and manual lookups.
Monitor and Scale
Monitoring is done via Amazon CloudWatch. You can view metrics like CPUUtilization, DatabaseConnections, ReadIOPS, WriteIOPS, FreeableMemory, and others. You can set CloudWatch alarms to notify you of high utilization. To scale read capacity, you add replica instances. In the console, you can modify the cluster to add more replicas (up to 15 total). Scaling compute up (e.g., from db.r5.large to db.r5.xlarge) requires a modification that may cause a brief downtime (typically a few minutes) as the instance is replaced. Storage scales automatically, so you don't need to provision storage. Behind the scenes, when you add a replica, DocumentDB creates a new compute instance that attaches to the same storage volume. The replica begins replicating from the primary. For scaling compute, DocumentDB performs a DNS update to point to the new instance. Important: scaling storage is automatic and does not require downtime. The exam may test that you can add replicas to improve read performance, but you cannot write to replicas.
Scenario 1: Content Management System (CMS) for a Media Company
A media company runs a CMS that stores articles, videos, and images. Each article has a different structure: some have multiple authors, tags, comments, and custom metadata. Using a relational database would require a complex schema with many tables and joins, slowing down development. The company migrates to DocumentDB, storing each article as a JSON document. This allows the development team to add new fields (e.g., 'videoUrl', 'sponsored') without schema changes. The CMS uses MongoDB drivers, so migration requires minimal code changes. DocumentDB's automatic replication across three AZs ensures high availability: if one AZ fails, the CMS continues to serve read traffic from replicas. Cost-wise, the company pays for compute instances (db.r5.large for production, db.t3.medium for staging) and storage (about 500 GB). They save on administrative overhead because AWS handles backups and patching. A common mistake: they initially used a single instance without replicas, causing read performance issues during traffic spikes. After adding two read replicas, read latency dropped significantly. The exam may present a scenario where a CMS needs a flexible schema – DocumentDB is the correct answer.
Scenario 2: Product Catalog for an E-commerce Platform
An e-commerce startup needs a product catalog where each product can have varying attributes: electronics have 'warranty', clothing has 'size' and 'color', books have 'ISBN' and 'author'. They choose DocumentDB for its schema flexibility. They store each product as a document with a 'attributes' subdocument. They create indexes on frequently queried fields like 'category', 'price', and 'rating'. The catalog is read-heavy, so they add three read replicas to distribute read traffic. DocumentDB's continuous backups allow point-in-time recovery in case of accidental deletions. A potential pitfall: they initially did not create indexes on 'price' and 'rating', causing slow queries during flash sales. After adding indexes, query performance improved 10x. Another pitfall: they used a single large instance instead of multiple smaller replicas, which was more expensive and less resilient. The exam may test the need for indexes on query fields.
Scenario 3: IoT Sensor Data Storage
An industrial IoT company collects sensor readings from thousands of devices. Each reading includes a device ID, timestamp, and various measurements (temperature, humidity, pressure). Some sensors report additional fields like 'vibration' or 'voltage'. DocumentDB's flexible schema allows storing each reading as a document without predefined fields. They use TTL indexes to automatically delete data older than 30 days, reducing storage costs. They also use change streams to trigger real-time analytics. A common misconfiguration: they did not set up proper indexes on device ID and timestamp, causing slow queries. After adding a compound index on (deviceId, timestamp), queries became efficient. The exam may ask about TTL indexes and change streams as features of DocumentDB.
What CLF-C02 Tests on Amazon DocumentDB
The CLF-C02 exam covers Amazon DocumentDB under Domain 3: Cloud Technology Services, specifically Objective 3.3: "Identify the appropriate AWS database service for a given use case." The exam will not ask you to write queries or perform deep technical operations. Instead, it tests your ability to distinguish DocumentDB from other databases based on characteristics like schema flexibility, MongoDB compatibility, and use cases. You should know:
DocumentDB is a document database (NoSQL) that stores JSON-like documents.
It is MongoDB-compatible (uses MongoDB 3.6/4.0/5.0 API).
It is fully managed with automatic replication across three AZs, automatic backups, and patching.
Use cases: content management, catalogs, user profiles, and any application that needs flexible schemas.
Not suitable for: relational data (use RDS/Aurora), simple key-value (use DynamoDB), or graph data (use Neptune).
Common Wrong Answers and Why Candidates Choose Them
Choosing Amazon DynamoDB for a MongoDB migration scenario: Candidates often pick DynamoDB because it is also a NoSQL database. However, DynamoDB is a key-value and document database that is not MongoDB-compatible. If the question states the application uses MongoDB drivers, DocumentDB is the correct answer because it is wire-protocol compatible with MongoDB, enabling minimal code changes.
Selecting Amazon RDS for MySQL when the data is semi-structured: Candidates may think RDS can handle JSON via JSON data type, but RDS requires a fixed schema and is not optimized for document storage. DocumentDB is designed for flexible schemas.
Thinking DocumentDB is a relational database: Some candidates confuse the term "document" with a document in a relational sense. DocumentDB is NoSQL, not relational.
Assuming DocumentDB supports all MongoDB features: The exam may include a distractor that says DocumentDB fully supports MongoDB aggregation stages like $graphLookup. The correct answer is that DocumentDB is MongoDB-compatible but does not support all features; it lacks $graphLookup, $merge, and $out.
Specific Terms and Values to Remember
MongoDB compatibility: versions 3.6, 4.0, 5.0.
Storage: automatically scales from 10 GB to 64 TB.
Replicas: up to 15 read replicas.
Backup retention: 1-35 days.
Encryption: at rest (KMS) and in transit (TLS).
Endpoints: cluster endpoint (read/write), reader endpoint (load-balanced reads).
Not publicly accessible by default: must be in a VPC.
Tricky Distinctions
DocumentDB vs DynamoDB: Both are NoSQL, but DocumentDB is document-oriented and MongoDB-compatible, while DynamoDB is key-value and document with a different API. DynamoDB has single-digit millisecond latency at any scale, while DocumentDB is more suited for complex queries and aggregations.
DocumentDB vs Amazon RDS for MySQL with JSON: RDS can store JSON but lacks native document database features like automatic indexing of nested fields and MongoDB compatibility.
DocumentDB vs Amazon Aurora: Aurora is a relational database (MySQL/PostgreSQL-compatible), not a document database.
Decision Rule for Exam Questions
When you see a question about storing semi-structured data and the options include DocumentDB, DynamoDB, RDS, and Neptune, ask yourself:
Is the application already using MongoDB? → DocumentDB.
Does the data require complex queries and aggregations on nested fields? → DocumentDB.
Is the use case simple key-value lookups with low latency? → DynamoDB.
Does the data have strict relationships and require joins? → RDS or Aurora.
Is the data graph-like (relationships between nodes)? → Neptune.
Amazon DocumentDB is a fully managed, MongoDB-compatible document database service.
It stores data as JSON-like documents, allowing flexible schemas without predefined tables.
DocumentDB automatically replicates data across three Availability Zones and supports up to 15 read replicas.
Storage scales automatically from 10 GB to 64 TB with no I/O charges.
DocumentDB is not identical to MongoDB; it lacks some aggregation stages like $graphLookup and $merge.
Use DocumentDB for content management, catalogs, user profiles, and any MongoDB-based application.
Do not use DocumentDB for relational data (use RDS/Aurora) or simple key-value workloads (use DynamoDB).
Backup retention is configurable from 1 to 35 days, and point-in-time recovery is supported.
DocumentDB is not publicly accessible by default; it must be deployed in a VPC.
These come up on the exam all the time. Here's how to tell them apart.
Amazon DocumentDB
Document database – stores JSON-like documents
MongoDB-compatible (uses MongoDB API)
Supports complex queries, aggregations, and indexes
Automatic replication across 3 AZs
Best for applications that need a flexible schema with complex querying
Amazon DynamoDB
Key-value and document database
Proprietary API (not MongoDB-compatible)
Single-digit millisecond latency at any scale
Data replicated across 3 AZs automatically
Best for high-traffic, low-latency key-value lookups and simple queries
Mistake
Amazon DocumentDB is a relational database.
Correct
DocumentDB is a NoSQL document database. It stores data as JSON-like documents, not in tables with rows and columns. It does not support SQL queries or joins; instead, it uses MongoDB's query language.
Mistake
DocumentDB is identical to MongoDB Atlas.
Correct
DocumentDB is MongoDB-compatible but not identical. It supports a subset of MongoDB features (e.g., it lacks $graphLookup, $merge, $out). It is a managed service on AWS, whereas MongoDB Atlas is a multi-cloud managed MongoDB service.
Mistake
DocumentDB can be accessed from the internet by default.
Correct
DocumentDB clusters are created in a VPC and are not publicly accessible by default. To connect from outside the VPC, you need to use a bastion host, VPN, or AWS Direct Connect. You can optionally assign a public IP, but it is not recommended for security.
Mistake
You need to provision storage for DocumentDB in advance.
Correct
DocumentDB storage automatically scales from 10 GB up to 64 TB as you add data. You do not need to provision storage upfront; you only pay for what you use.
Mistake
DocumentDB supports all MongoDB aggregation pipeline stages.
Correct
DocumentDB supports many aggregation stages but not all. For example, $graphLookup, $merge, and $out are not supported. The exam may test this limitation.
No, Amazon DocumentDB is a NoSQL document database. It stores data as JSON-like documents, not in tables with rows and columns. It is designed for semi-structured data and does not support SQL or joins. Instead, it uses MongoDB's query language. On the exam, if you see a question about storing data with varying attributes, DocumentDB is a strong candidate, but only if the application uses MongoDB or needs a document model.
Yes, DocumentDB is MongoDB-compatible, meaning it supports the MongoDB wire protocol and many MongoDB features. You can connect your existing MongoDB application to DocumentDB with minimal code changes. However, DocumentDB does not support all MongoDB features (e.g., $graphLookup). For the exam, remember that DocumentDB is a managed alternative to self-hosted MongoDB, and it is ideal for migrating MongoDB workloads to AWS without rewriting the application.
DocumentDB pricing includes compute (per hour based on instance type), storage (per GB-month), and backup storage (per GB-month beyond free backup storage). There are no I/O charges. You pay for what you use, and storage scales automatically. Note that DocumentDB is not part of the AWS Free Tier. On the exam, you may need to compare costs: DocumentDB can be more cost-effective than provisioned IOPS databases for variable workloads.
Both are NoSQL databases, but DocumentDB is a document database with MongoDB compatibility, supporting complex queries and aggregations. DynamoDB is a key-value and document database with a proprietary API, offering single-digit millisecond latency at any scale. DynamoDB is better for simple key-value lookups and high-traffic applications, while DocumentDB is better for applications that need a flexible schema and complex querying, especially if they already use MongoDB.
Yes, DocumentDB supports automatic failover. When you create a cluster, you get a primary instance and up to 15 read replicas. If the primary fails, a replica is automatically promoted to primary within 30 seconds. The cluster endpoint automatically points to the new primary. This is a key exam point: DocumentDB provides high availability with automatic failover across AZs.
By default, DocumentDB clusters are created in a VPC and are not publicly accessible. To connect from the internet, you need to set up a bastion host, a VPN connection, or use AWS Direct Connect. You can also modify the cluster to assign a public IP address, but this is not recommended for production due to security risks. The exam may test that DocumentDB is not publicly accessible by default.
DocumentDB automatically takes continuous backups and stores them in Amazon S3. You can configure the backup retention period from 1 to 35 days. You can also take manual snapshots that persist until you delete them. Point-in-time recovery (PITR) is supported within the retention window. The exam may ask about backup retention limits and PITR capability.
You've just covered Amazon DocumentDB — now see how well it sticks with free CLF-C02 practice questions. Full explanations included, no account needed.
Done with this chapter?