DP-900Chapter 52 of 101Objective 1.3

Graph Databases: Gremlin API and Relationships

This chapter covers graph databases in Azure Cosmos DB using the Gremlin API, a key topic for DP-900 exam objective 1.3 (Core Data Concepts). Graph databases excel at modeling and querying highly connected data, such as social networks, recommendation engines, and fraud detection. Expect approximately 10-15% of exam questions to touch on graph database concepts, including when to use them, their structure (vertices, edges, properties), and the Gremlin query language. Understanding graph databases is essential for choosing the right data store for relationship-rich scenarios.

25 min read
Intermediate
Updated May 31, 2026

Graph Database as a Social Network Map

Imagine a social network where each person is a node, and friendships are edges connecting them. A graph database with Gremlin API works like a dynamic, queryable social network map. Each person (vertex) has properties like name, age, and location. Each friendship (edge) has a direction (who follows whom) and properties like 'since' date. Now, suppose you want to find friends of friends who live in Seattle and like hiking. In a relational database, you would need multiple JOINs across tables, which is slow and complex. In a graph database, you start at a person vertex, traverse the 'friend' edges to reach neighbor vertices, then filter those by location property 'Seattle', and finally check for a 'likes' edge to a 'hiking' activity vertex. The Gremlin API lets you express this traversal step-by-step: g.V().has('person','name','Alice').out('friend').has('location','Seattle').out('likes').has('activity','hiking'). This is like following a chain of connections on the map: from Alice, follow friend lines, then look only at those in Seattle, then follow likes lines to hiking. The graph engine optimizes this traversal by storing adjacency lists, so it doesn't scan all people. It's like having a map with direct links between connected places; you don't need to look at the entire city to find a route. The Gremlin API provides a fluent, stepwise language to walk the graph, making relationship-heavy queries fast and intuitive.

How It Actually Works

What is a Graph Database?

A graph database is a NoSQL database that uses graph structures with vertices (nodes) and edges (relationships) to represent and store data. Unlike relational databases that use foreign keys and JOINs, graph databases store relationships as first-class entities. This makes queries about relationships—such as 'find all friends of friends who bought a product'—extremely efficient because traversing edges is a constant-time operation (O(1) per edge hop) rather than requiring expensive JOIN operations.

Why Graph Databases Exist

Traditional relational databases struggle with highly connected data. For example, to find a friend-of-friend in a social network, a relational database must perform multiple JOINs across a large 'friendship' table, which can be slow at scale. Graph databases solve this by storing adjacency lists—each vertex maintains a list of its incident edges. Traversing from one vertex to another is simply following a pointer, not scanning indexes or tables.

How Gremlin API Works

The Gremlin API is a graph traversal language and virtual machine developed by Apache TinkerPop. Azure Cosmos DB implements the Gremlin API, allowing you to create, query, and manage graph data. Gremlin queries are composed of a sequence of steps; each step either transforms the current set of traversers (e.g., filter, map) or produces side effects (e.g., store in variable). The query is executed by the Gremlin traversal machine, which optimizes the traversal path.

Key Components

#### Vertices - Represent entities (e.g., person, product, location). - Each vertex has a unique identifier (ID) and can have properties (key-value pairs). - In Cosmos DB, vertices are stored as documents with a special graph schema.

#### Edges - Represent relationships between vertices (e.g., 'friend', 'purchased', 'lives_in'). - Edges have a direction: from source vertex to target vertex. They can also have properties (e.g., 'since' date for friendship). - Edges are stored as separate documents with references to source and target vertex IDs.

#### Properties - Key-value pairs attached to vertices or edges. - Properties can be indexed for efficient filtering.

Gremlin Query Language Basics

Gremlin queries start with a traversal source g. Common steps: - V(): get all vertices or filter by ID. - E(): get all edges. - out(label): traverse outgoing edges of given label. - in(label): traverse incoming edges. - both(label): traverse both directions. - has(key, value): filter by property. - values(key): get property values. - limit(n): limit results. - count(): count results. - order().by(key): sort results.

Example: Find all friends of Alice who are over 30:

g.V().has('person', 'name', 'Alice').out('friend').has('age', gt(30)).values('name')

Azure Cosmos DB Gremlin API Specifics

Cosmos DB stores graph data in a partitioned container. The partition key is required and must be a property on every vertex and edge.

Edges are stored as documents with _isEdge flag.

Queries are charged as Request Units (RUs) based on the number of traversed elements.

The Gremlin API supports most TinkerPop features, but some advanced graph algorithms (e.g., graph computer) are not available.

Indexing

Cosmos DB automatically indexes all properties. You can customize indexing policy for better performance. For graph queries, indexing on filtered properties (e.g., name, age) speeds up has() steps.

Consistency and Partitioning

Graph traversals are scoped to a single partition if the partition key is specified in the query. Cross-partition queries are possible but incur higher RU costs. Choose a partition key that evenly distributes data and aligns with common traversal patterns (e.g., user ID for social graph).

Gremlin API vs. Other Graph Models

Neo4j: Uses Cypher query language. Gremlin is vendor-agnostic (Apache TinkerPop).

Amazon Neptune: Supports both Gremlin and SPARQL.

JanusGraph: Open-source, uses Gremlin.

When to Choose Graph Database

Highly connected data with many relationships (e.g., social networks, recommendation engines, fraud detection).

Queries that traverse relationships multiple levels deep (e.g., 'find all products bought by friends of friends').

Schema flexibility: vertices and edges can have different properties.

When Not to Choose Graph Database

Simple CRUD operations with few relationships.

Heavy aggregation queries (e.g., sum, average over many records) – better suited for relational or document databases.

Need for complex JOINs across unrelated entities.

Gremlin API in DP-900 Exam

The DP-900 exam expects you to:

Identify use cases for graph databases (social networks, fraud detection, recommendation engines).

Recognize that Gremlin API is used for graph queries in Azure Cosmos DB.

Understand basic graph concepts: vertices, edges, properties.

Differentiate graph databases from other NoSQL types (key-value, document, column-family).

Common Exam Traps

Trap: Gremlin is a query language for relational databases. Reality: Gremlin is for graph databases.

Trap: Graph databases are best for all scenarios. Reality: They excel at relationship-heavy queries but are inefficient for simple CRUD or aggregation.

Trap: Edges can only have one direction. Reality: Edges are directed, but you can traverse both ways using both().

Trap: Graph databases don't support transactions. Reality: Cosmos DB supports multi-document transactions within a single partition.

Summary of Key Values

Default RU charge: 10 RU per query (varies by complexity).

Partition key is required.

Maximum document size: 2 MB (vertices and edges are documents).

Consistency levels: same as Cosmos DB (Strong, Bounded Staleness, Session, Consistent Prefix, Eventual).

Walk-Through

1

Create a Graph Database

In Azure portal, create a new Azure Cosmos DB account and select the Gremlin API. Choose a resource group, region, and configure global distribution if needed. Then create a graph database and a graph (container) with a partition key. For example, set partition key to '/user_id' for a social graph. The database will have a default throughput of 400 RU/s (or you can set autoscale). This step provisions the backend storage and indexing.

2

Add Vertices and Edges

Use Gremlin queries to add vertices. Example: `g.addV('person').property('id', '1').property('name', 'Alice').property('age', 30)`. Then add edges: `g.V('1').addE('friend').to(g.V('2')).property('since', 2020)`. Each vertex and edge is stored as a document in Cosmos DB. The edge document contains references to source and target vertex IDs. Ensure that the partition key property is included (e.g., '/user_id' must be set on every vertex and edge).

3

Query with Traversal Steps

Write Gremlin queries to traverse the graph. For example, to find friends of Alice: `g.V().has('person','name','Alice').out('friend').values('name')`. The query engine starts at the Alice vertex, follows outgoing 'friend' edges to neighbor vertices, and retrieves their names. Each step is executed sequentially. The traversal can be optimized by indexing on the filtered property (name). The query consumes RUs based on the number of vertices and edges traversed.

4

Filter and Transform Results

Use `has()` to filter by properties during traversal. Example: find friends older than 25: `g.V().has('person','name','Alice').out('friend').has('age', gt(25))`. Use `values()` to retrieve specific properties. Use `project()` to shape output. Use `order().by()` to sort. These steps are executed in memory after traversal. The Gremlin engine may push down filters to the storage layer if indexes exist.

5

Handle Cross-Partition Queries

If the partition key is not specified, queries may span multiple partitions. For example, `g.V().has('name','Alice')` without partition key will fan out to all partitions. This is slower and more expensive. To avoid this, include the partition key in the query: `g.V().has('user_id','123').has('name','Alice')`. For edges, the partition key is inherited from the source vertex. Design your partition key to match common traversal starting points.

What This Looks Like on the Job

Enterprise Scenario 1: Social Network Friend Recommendations

A social media company uses Cosmos DB Gremlin API to store user profiles and friendships. Each user is a vertex with properties like user_id, name, interests. Edges represent 'friend' relationships with a 'since' date. The recommendation engine queries for friends-of-friends (FoF) who share interests. The query: g.V(userId).out('friend').out('friend').dedup().has('interests', interest) returns potential friends. Production scale: 100 million users, average 150 friends each. The partition key is '/user_id', so queries starting from a user are single-partition, fast. Misconfiguration: if partition key is set to '/country', queries for a specific user fan out across partitions, increasing RU cost and latency.

Enterprise Scenario 2: Fraud Detection in Banking

A bank models accounts and transactions as a graph. Vertices: accounts, transactions. Edges: 'transfer' with amount, timestamp. Fraud detection queries find circular transaction patterns (money laundering). Gremlin query: g.V().has('account','id', suspectId).out('transfer').in('transfer').out('transfer').in('transfer') detects cycles of length 2. At scale (10 million accounts, 500 million transactions), graph traversals are faster than SQL JOINs. Performance consideration: index on 'amount' and 'timestamp' for filtering. Common issue: not using partition key leads to cross-partition scans, causing timeouts.

Enterprise Scenario 3: Supply Chain Management

A logistics company tracks parts and assemblies. Vertices: parts, suppliers, factories. Edges: 'supplied_by', 'used_in', 'shipped_to'. Query to find all parts affected by a supplier delay: g.V().has('supplier','name','Acme').in('supplied_by').out('used_in'). Scale: 1 million parts, 200,000 assemblies. Partition key: '/part_id'. Misconfiguration: forgetting to include partition key in queries results in full graph scan. Also, if edge properties are not indexed, filtering by date range becomes slow.

How DP-900 Actually Tests This

DP-900 Objective 1.3: Core Data Concepts – Graph Databases

The DP-900 exam tests your ability to identify the appropriate data store for a given scenario. For graph databases, you must recognize that they are designed for highly connected data where relationships are key. The exam will present scenarios like social networks, recommendation engines, fraud detection, and supply chain management. You must know that the Gremlin API is used in Azure Cosmos DB for graph queries.

Common Wrong Answers and Why Candidates Choose Them

1.

Wrong: Use a relational database for a social network. Candidates think relational databases are universal. Reality: Social networks require deep relationship queries (friends-of-friends) that are slow with JOINs. Graph databases are optimized for this.

2.

Wrong: Graph databases are best for all types of data. Candidates overgeneralize. Reality: Graph databases are inefficient for simple CRUD or aggregation-heavy workloads.

3.

Wrong: Gremlin API is a SQL-like language. Candidates confuse Gremlin with SQL. Reality: Gremlin is a graph traversal language with steps like out(), in(), has().

4.

Wrong: Graph databases don't support ACID transactions. Candidates think NoSQL means no transactions. Reality: Cosmos DB supports multi-document transactions within a single partition.

Specific Terms and Values on the Exam

Vertices (nodes) and edges (relationships) with properties.

Gremlin API as the query language for graph databases in Azure.

Partition key required for Cosmos DB graph containers.

Use cases: social networks, fraud detection, recommendation engines.

Not suitable for: simple CRUD, aggregation, bulk updates.

Edge Cases and Exceptions

Edge case: Graph database for hierarchical data (e.g., organizational chart). While possible, a document database with nested arrays may be simpler. The exam expects you to recognize that graphs are for many-to-many relationships, not strict hierarchies.

Exception: Gremlin API supports transactions only within a single partition. Cross-partition transactions are not supported.

Exception: Graph databases can store properties on edges, which relational databases cannot without junction tables.

How to Eliminate Wrong Answers

If the scenario involves 'friends of friends', 'recommendations based on connections', or 'fraud rings', it's a graph database.

If the scenario involves 'storing user profiles with simple lookups', choose a document or key-value store.

If the scenario involves 'aggregating sales by region', choose a relational or column-family store.

Remember: Gremlin is for graphs, SQL is for relational, MongoDB is for documents.

Key Takeaways

Graph databases use vertices (nodes) and edges (relationships) with properties to model highly connected data.

Azure Cosmos DB Gremlin API is a graph database service that uses the Apache TinkerPop Gremlin query language.

Gremlin queries use steps like V(), E(), out(), in(), has(), values() to traverse and filter the graph.

Graph databases are ideal for social networks, fraud detection, recommendation engines, and supply chain management.

Graph databases are not suitable for simple CRUD, heavy aggregation, or bulk update workloads.

In Cosmos DB, a partition key is required for graph containers; queries should include the partition key to avoid cross-partition scans.

Edges can have properties and are directed, but can be traversed in both directions using in() or both().

The DP-900 exam expects you to identify graph database use cases and differentiate them from other NoSQL types.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Graph Database (Gremlin API)

Uses vertices and edges to store data and relationships as first-class entities.

Optimized for traversing relationships (e.g., friends-of-friends) with constant-time edge hops.

Query language: Gremlin (stepwise traversal).

Schema-flexible: vertices can have different properties.

Best for highly connected data: social networks, fraud detection, recommendation engines.

Relational Database (SQL)

Uses tables with rows and columns; relationships via foreign keys and JOINs.

Optimized for complex queries with aggregations, JOINs, and ACID transactions.

Query language: SQL (declarative).

Requires predefined schema with data types and constraints.

Best for structured data with clear relationships and need for consistency: banking, ERP, CRM.

Watch Out for These

Mistake

Graph databases are just a fancy way to store relationships; they are slower than relational databases.

Correct

Graph databases are faster for relationship-heavy queries because they store adjacency lists, enabling O(1) traversal per edge, while relational databases require expensive JOIN operations that scale poorly with data size.

Mistake

Gremlin API is the same as SQL but with different syntax.

Correct

Gremlin is a graph traversal language that uses a sequence of steps (e.g., out(), has(), values()) to walk the graph, not declarative SELECT-FROM-WHERE. It is fundamentally different from SQL.

Mistake

All graph databases use the same query language.

Correct

Different graph databases use different languages: Neo4j uses Cypher, Amazon Neptune supports Gremlin and SPARQL, and Azure Cosmos DB uses Gremlin. Gremlin is an Apache TinkerPop standard.

Mistake

Edges in a graph database can only have one direction and cannot be traversed backwards.

Correct

Edges are directed, but Gremlin provides `in()` and `both()` steps to traverse incoming or both directions. You can also query edges without direction using `E()` and filter by label.

Mistake

Graph databases do not require a schema; they are completely schema-less.

Correct

While graph databases are schema-flexible (each vertex/edge can have different properties), Cosmos DB requires a partition key and indexing policy. In practice, you often enforce some schema at the application level.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the Gremlin API in Azure Cosmos DB?

The Gremlin API is an implementation of the Apache TinkerPop graph traversal language for Azure Cosmos DB. It allows you to create, query, and manage graph data (vertices, edges, properties) using Gremlin queries. It is used for scenarios like social networks and fraud detection where relationships are key.

How do I create a graph database in Azure Cosmos DB?

In the Azure portal, create a new Cosmos DB account and select the Gremlin API. Then create a database and a graph (container). You must specify a partition key (e.g., /user_id). You can set throughput manually or use autoscale. After creation, use the Data Explorer or SDK to run Gremlin queries.

What are vertices and edges in a graph database?

Vertices represent entities (e.g., person, product) and edges represent relationships between them (e.g., 'friend', 'purchased'). Both can have properties (key-value pairs). Edges are directed from source to target vertex.

When should I use a graph database instead of a relational database?

Use a graph database when your data is highly connected and queries involve traversing relationships multiple levels deep (e.g., 'find friends of friends who like X'). Relational databases are better for structured data with complex aggregations and strict consistency requirements.

Can I use SQL to query a graph database?

No, graph databases use specialized query languages like Gremlin (for Cosmos DB) or Cypher (for Neo4j). SQL is not designed for graph traversals. However, some graph databases offer SQL-like interfaces, but they are not standard.

What is the difference between a graph database and a document database?

A document database stores data as JSON documents (e.g., MongoDB, Cosmos DB SQL API) and is good for semi-structured data with nested objects. A graph database stores data as vertices and edges, optimized for relationships. Document databases can store relationships via references, but querying them requires multiple lookups, whereas graph databases traverse relationships in constant time.

Does Azure Cosmos DB Gremlin API support transactions?

Yes, Cosmos DB supports multi-document transactions within a single partition (e.g., adding a vertex and an edge as a batch). Cross-partition transactions are not supported. Use the `tx` scope in Gremlin for transactional operations.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Graph Databases: Gremlin API and Relationships — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?