DP-900Chapter 54 of 101Objective 1.3

Column-Family Databases (Cassandra API)

This chapter covers column-family databases, specifically the Cassandra API in Azure Cosmos DB, a key non-relational data storage option. For the DP-900 exam, understanding column-family databases is part of objective 1.3, which covers core data concepts including NoSQL data stores. Approximately 10-15% of exam questions touch on NoSQL data stores, with column-family databases being one of the four main types you must differentiate. This chapter will explain what column-family databases are, how they work internally, their key components, and how they compare to other data models, ensuring you can answer exam questions confidently.

25 min read
Intermediate
Updated May 31, 2026

Library Card Catalog with Multiple Filing Systems

Imagine a library that stores books in multiple rooms, each with its own card catalog. Each card catalog is organized by a specific key, like author name or book title, but not both. When a patron wants a book, they must know the exact key for the catalog they are using. For example, if they use the author catalog, they look up the author's name and find a card with the book's title and location (room and shelf). The card does not list other books by the same author in other rooms. The library can add new rooms (nodes) easily by copying the catalog to the new room. If a patron asks for a book by a different key, like genre, they must use a different catalog. The library is designed for fast lookups by the primary key, but it cannot efficiently answer queries like 'find all books by an author' if the catalog is by title. This mirrors column-family databases like Cassandra, where data is stored in column families (tables) and each row is identified by a primary key (partition key). The row's columns can vary, and queries are efficient only when using the partition key. The system is distributed, fault-tolerant, and scales horizontally by adding nodes, but it sacrifices complex query capabilities for performance and availability.

How It Actually Works

What is a Column-Family Database?

A column-family database is a type of NoSQL database that stores data in columns instead of rows, organized into column families. Unlike relational databases, where each row in a table has the same columns, column-family databases allow each row to have a different set of columns. This is particularly useful for handling sparse data, large-scale distributed systems, and high write throughput. The most prominent example is Apache Cassandra, and in Azure, the Cassandra API in Azure Cosmos DB provides a managed Cassandra-compatible service.

How It Works Internally

In a column-family database, data is stored in a distributed hash table across multiple nodes. Each node is responsible for a range of data based on the partition key. The partition key is used to determine which node stores the data. Cassandra uses a consistent hashing algorithm to distribute data evenly. When a write occurs, the data is first written to a commit log (for durability) and then to an in-memory structure called a memtable. Periodically, the memtable is flushed to disk as an SSTable (Sorted String Table). SSTables are immutable; updates create new SSTables, and old ones are compacted later.

Key Components

Partition Key: Determines the distribution of data across nodes. It is the first part of the primary key.

Clustering Key: Determines the sort order of data within a partition. It is optional but critical for range queries.

Column Family: Equivalent to a table in relational databases. Contains a set of columns that are grouped together.

Row Key: The primary key of a row, consisting of partition key and optional clustering key.

Column: A key-value pair within a row. Each column can have a different name and value.

Cassandra Query Language (CQL): A SQL-like language used to interact with Cassandra. Syntax is similar but with limitations (e.g., no JOINs).

Default Values and Timers

Consistency Level: Determines how many replicas must respond for a read/write to be considered successful. Default is QUORUM (majority). Other levels include ONE, ALL, LOCAL_QUORUM, etc.

Replication Factor: Number of copies of data across nodes. Default is 3.

Write Consistency: Default QUORUM. For high availability, use ONE; for strong consistency, use ALL.

Read Consistency: Default QUORUM. Can be set per query.

Tombstone: A marker for deleted data. Tombstones have a configurable time-to-live (default 10 days via gc_grace_seconds).

Compaction: Process of merging SSTables. Default compaction strategy is SizeTieredCompactionStrategy (STCS).

Configuration and Verification Commands

In Azure Cosmos DB Cassandra API, you configure the account through the Azure portal or CLI. For example, to create a Cassandra keyspace and table:

CREATE KEYSPACE mykeyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': 3};

CREATE TABLE mykeyspace.users (
    user_id UUID PRIMARY KEY,
    name TEXT,
    email TEXT
);

To verify data:

SELECT * FROM mykeyspace.users WHERE user_id = ?;

Interaction with Related Technologies

Column-family databases are often used alongside other Azure data services. For example, you might use Azure Data Factory to ingest data into Cosmos DB Cassandra API, or use Azure Stream Analytics to process real-time data and write to Cassandra. They are also commonly used with Apache Spark for analytics via the Spark Cassandra connector. The Cassandra API is compatible with existing Cassandra drivers, so applications written for Apache Cassandra can be migrated with minimal changes.

Performance Considerations

Write Throughput: Cassandra is optimized for high write throughput. Write operations are append-only and are written to the commit log and memtable before being flushed.

Read Performance: Reads are fast when using the partition key. Queries without partition key require full table scans (allow filtering), which are inefficient and discouraged.

Data Distribution: Use a good partition key to avoid hot spots. A common mistake is using a monotonically increasing key (like timestamp), which causes all writes to go to one node.

Secondary Indexes: Limited support. Secondary indexes are local to each node and are not as efficient as primary key queries.

Exam-Relevant Details

The DP-900 exam expects you to know that column-family databases are best for write-heavy workloads, time-series data, IoT data, and recommendation engines.

You must understand that the Cassandra API in Cosmos DB supports CQL and is wire-protocol compatible with Apache Cassandra.

The exam may test that column-family databases are schema-optional — each row can have different columns.

A common trap: thinking column-family databases store data in columns like a columnar database (e.g., Parquet). They do not; they store data by row key but group columns into families.

Another trap: confusing column-family databases with wide-column stores. They are essentially the same thing.

Summary of Key Values

Default replication factor: 3

Default consistency level: QUORUM (majority)

Default compaction: SizeTieredCompactionStrategy

Tombstone grace period: 10 days (gc_grace_seconds = 864000)

Cassandra API is part of Azure Cosmos DB, which provides a fully managed, globally distributed database service.

Walk-Through

1

Define Keyspace and Replication

First, you create a keyspace (like a database) and define the replication strategy and factor. For Azure Cosmos DB Cassandra API, the replication strategy is typically 'NetworkTopologyStrategy' with a replication factor set per data center. For example, `CREATE KEYSPACE mykeyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': 3};`. This determines how many copies of data are kept across nodes for fault tolerance. The exam may test that replication factor is set at keyspace level.

2

Create Table with Primary Key

Next, you define a table (column family) with columns and a primary key. The primary key consists of a partition key and optional clustering keys. Example: `CREATE TABLE mykeyspace.users (user_id UUID, name TEXT, email TEXT, PRIMARY KEY (user_id));`. The partition key determines data distribution; clustering keys determine sort order within a partition. The exam expects you to know that queries must include the partition key for efficiency.

3

Insert Data with Write Consistency

Insert rows using CQL INSERT statements. You can set the consistency level per operation. For example, `INSERT INTO mykeyspace.users (user_id, name, email) VALUES (uuid(), 'Alice', 'alice@example.com') USING CONSISTENCY QUORUM;`. The write is first written to the commit log (durability) and then to the memtable. Once the memtable is full, it is flushed to an SSTable on disk. The exam may ask about the write path: commit log -> memtable -> SSTable.

4

Query Data Using Partition Key

To retrieve data, you must use the partition key in the WHERE clause. Example: `SELECT * FROM mykeyspace.users WHERE user_id = ?;`. Queries without partition key require 'ALLOW FILTERING', which is inefficient and should be avoided in production. The read path involves checking the memtable and SSTables. Bloom filters help quickly determine if data exists in an SSTable. The exam tests that partition key is mandatory for efficient queries.

5

Manage Compaction and Tombstones

Over time, SSTables accumulate. Compaction merges multiple SSTables into one, removing deleted data (tombstones) and optimizing storage. The default compaction strategy is SizeTieredCompactionStrategy, which merges SSTables of similar size. Tombstones are markers for deleted data and have a grace period (default 10 days) before being removed during compaction. The exam may ask about tombstone management and compaction strategies.

What This Looks Like on the Job

Enterprise Scenario 1: IoT Time-Series Data

A manufacturing company uses sensors to collect temperature, pressure, and vibration data from machines every second. They need to store this data for real-time monitoring and historical analysis. They choose Azure Cosmos DB Cassandra API because it handles high write throughput (millions of writes per second) and can scale horizontally. They create a keyspace with replication factor 3 across two Azure regions for disaster recovery. The table schema uses a partition key of device_id and a clustering key of timestamp to allow range queries for a specific device. In production, they configure the write consistency to LOCAL_QUORUM for low latency and read consistency to ONE for fast reads. They monitor compaction to ensure SSTables don't grow unbounded. A common mistake is using a timestamp as partition key, causing all writes to hit one node (hot spot). Instead, they use device_id as partition key. When misconfigured, writes fail due to timeout or nodes become overloaded.

Enterprise Scenario 2: User Profile Store for a Social Media App

A social media platform stores user profiles with varying attributes (e.g., some users have a phone number, some have a bio, etc.). They use Cassandra API because it allows flexible schema. Each user has a unique user_id (partition key). They store profile data in a single column family with columns like name, email, phone, bio, etc. For high availability, they use QUORUM consistency for reads and writes. They also create a secondary index on email for login lookups, but they are aware that secondary indexes perform poorly at scale. They mitigate by using a separate table mapping email to user_id. The system handles millions of users with low latency. A pitfall is forgetting that secondary indexes are local and not efficient for large datasets; the exam tests this limitation.

Scenario 3: Recommendation Engine

An e-commerce site stores user-item interactions (clicks, purchases) for recommendation algorithms. They use Cassandra API because of its high write throughput and ability to store sparse data (users interact with few items). The table uses user_id as partition key and item_id as clustering key. They run periodic Spark jobs using the Spark Cassandra connector to generate recommendations. They set the read consistency to QUORUM to ensure accuracy. A common issue is data skew: if some users have millions of interactions, their partition becomes large and slow. They use a composite partition key (user_id + bucket_id) to distribute data evenly.

How DP-900 Actually Tests This

Exam Focus for DP-900 on Column-Family Databases

Objective Code: 1.3 Describe core data concepts, including NoSQL data stores. Specifically, you must be able to describe column-family databases and when to use them.

Common Wrong Answers: 1. 'Column-family databases store data in columns like a columnar database.' This is false. Columnar databases store data by column for analytics (e.g., Azure Synapse). Column-family databases store data by row key, but columns are grouped into families. The exam tests this distinction. 2. 'Column-family databases support ACID transactions.' No, they are eventually consistent (BASE). They support atomicity at the row level but not multi-row transactions. The exam may present a scenario requiring ACID and you must rule out Cassandra. 3. 'Column-family databases are the same as key-value stores.' While similar, column-family databases allow multiple columns per row and clustering keys. Key-value stores have a simple key and value (opaque blob). The exam expects you to differentiate.

Specific Numbers and Terms: - Default replication factor: 3 - Default consistency level: QUORUM - Tombstone grace period: 864000 seconds (10 days) - Cassandra Query Language (CQL) - Partition key, clustering key, primary key - SSTable, memtable, commit log - SizeTieredCompactionStrategy (STCS)

Edge Cases: - The exam may ask about 'allow filtering' — it is allowed but inefficient; not recommended for production. - Secondary indexes are local; they do not scale like primary key queries. - Cassandra API in Cosmos DB is compatible with Apache Cassandra 4.x wire protocol.

How to Eliminate Wrong Answers: - If the question mentions high write throughput, time-series, or IoT, think column-family. - If the question mentions joins, ACID, or complex queries, eliminate column-family. - If the question mentions flexible schema and sparse data, column-family is a good fit. - Remember: Cassandra is optimized for writes, not for reads with complex filters.

Key Takeaways

Column-family databases store data in rows identified by a primary key (partition key + optional clustering key).

They are optimized for write-heavy, time-series, IoT, and recommendation workloads.

The Cassandra API in Azure Cosmos DB is compatible with Apache Cassandra CQL and wire protocol.

Default replication factor is 3; default consistency level is QUORUM.

Queries must include the partition key for efficiency; otherwise ALLOW FILTERING is required.

Secondary indexes are local and not efficient for high-cardinality data.

Tombstones mark deleted data and are removed during compaction after a grace period (default 10 days).

Cassandra does not support JOINs, ACID transactions across rows, or complex aggregations.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Column-Family Database (Cassandra)

Schema-optional; each row can have different columns.

Scalable horizontally; data distributed across nodes.

Optimized for high write throughput and simple queries by partition key.

Eventual consistency (BASE) by default.

No JOINs; data denormalized for query patterns.

Relational Database (SQL)

Fixed schema; all rows in a table have the same columns.

Vertically scalable (scale up) or with sharding (complex).

Optimized for complex queries with JOINs, aggregations.

Strong consistency (ACID) by default.

Data normalized to reduce redundancy.

Watch Out for These

Mistake

Column-family databases are columnar databases like Parquet or ORC.

Correct

Column-family databases store data by row, not by column. The term 'column-family' refers to grouping columns together, but the physical storage is row-oriented. Columnar databases store each column separately for analytics compression and performance.

Mistake

Cassandra supports ACID transactions across multiple rows.

Correct

Cassandra provides atomicity at the row level (within a single partition) but does not support multi-row transactions or joins. It is a BASE (Basically Available, Soft state, Eventually consistent) system.

Mistake

You can query Cassandra efficiently without the partition key.

Correct

Without the partition key, Cassandra must scan all nodes and partitions, which is extremely inefficient. The 'ALLOW FILTERING' clause exists but is not recommended for production queries.

Mistake

Secondary indexes in Cassandra are as efficient as primary key indexes.

Correct

Secondary indexes are local to each node and require querying all nodes. They are not suitable for high-cardinality columns or large datasets. They are best for low-cardinality columns like status flags.

Mistake

Cassandra API in Azure Cosmos DB is a separate service from Apache Cassandra.

Correct

The Cassandra API is a wire-protocol-compatible layer on top of Cosmos DB's global distribution and multi-model capabilities. It is not the same as running Apache Cassandra on VMs, but it supports CQL and existing Cassandra drivers.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between a partition key and a clustering key in Cassandra?

The partition key determines which node stores the data and is used for data distribution. It is the first part of the primary key. The clustering key determines the sort order of rows within a partition. Together they form the primary key. For example, in PRIMARY KEY ((user_id), timestamp), user_id is the partition key and timestamp is the clustering key. Queries must include the partition key; clustering key can be used for range queries within a partition.

Can I use Azure Cosmos DB Cassandra API for real-time analytics?

Yes, but with limitations. Cassandra is optimized for high write throughput and simple point queries. For complex analytics, you would typically export data to Azure Synapse or use Spark via the Spark Cassandra connector. The Cassandra API can handle real-time ingestion, but analytics queries should be offloaded to a separate system.

How does the Cassandra API handle global distribution in Azure Cosmos DB?

Azure Cosmos DB Cassandra API leverages Cosmos DB's global distribution. You can replicate data to multiple Azure regions with automatic failover. The consistency levels map to Cosmos DB's consistency levels (e.g., QUORUM maps to Eventual or Consistent Prefix). You configure replication at the Cosmos DB account level, not per keyspace.

What is a tombstone in Cassandra and why does it matter?

A tombstone is a marker indicating that a column or row has been deleted. Tombstones are kept for a configurable grace period (default 864000 seconds, 10 days) to allow time for all replicas to receive the deletion. During compaction, tombstones are removed. Too many tombstones can degrade read performance because the database must skip them. This is a common exam topic.

Is the Cassandra API in Azure Cosmos DB free?

No, it is a paid service. You pay for provisioned throughput (RU/s) and storage. There is a free tier that provides 1000 RU/s and 25 GB storage for the first year, but beyond that, you incur costs. The exam may ask about cost considerations.

What is the default consistency level in Cassandra?

The default consistency level is QUORUM, which requires a majority of replicas to respond. This provides a balance between consistency and availability. You can override it per query. The exam may test that QUORUM is the default.

Can I use SQL to query Cassandra?

Cassandra uses CQL (Cassandra Query Language), which is similar to SQL but with limitations. You cannot use JOINs, subqueries, or aggregations like GROUP BY. The exam expects you to know that CQL is SQL-like but not full SQL.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Column-Family Databases (Cassandra API) — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?