GraphRAG with Cosmos DB for AI Agents
Vector-only RAG fails on multi-hop queries. Learn how to build a hybrid graph + vector retrieval system on Azure Cosmos DB Gremlin API — with practical Gremlin patterns for AI agents.
Your RAG pipeline retrieves the right chunks and your LLM still hallucinates the connections between them. Sound familiar?
Standard Retrieval-Augmented Generation embeds documents into vectors, finds the top-k most similar chunks, and hopes the LLM can piece together the answer. For simple factual lookups this works fine. But the moment a question requires chaining through multiple entities — connecting Alice to her team, the team to the feature they built, the feature to the Azure services it depends on — vector similarity collapses.
Benchmarks confirm this. In the Lettria evaluation, GraphRAG answered 81.67% of queries correctly versus 57.50% for vector-only RAG. The FalkorDB/Diffbot benchmark found that vector RAG scored literally 0% accuracy when queries involved more than five entities. Not low — zero.
The fix is not a bigger model. It is a knowledge graph.
Why Graphs Fix Multi-Hop Failure
A graph database stores entities as vertices and their connections as edges. Relationships become first-class citizens in your data layer, persisted and queryable — not patterns the LLM has to infer from unstructured text.
Consider this question: “Which Azure services does the feature designed by Alice’s team depend on?”
That requires four hops: Alice → manages → Team → designed → Feature → depends_on → Azure Services. In Gremlin, it is a single traversal:
g.V().has('name', 'Alice')
.out('manages')
.out('designed')
.out('depends_on')
.hasLabel('azure_service')
.values('name')
No embeddings, no similarity scores, no hallucinated links. The database walks the explicit path and returns the answer.
| Dimension | Vector Only | Graph + Vector |
|---|---|---|
| Multi-hop reasoning | Poor — requires chaining | Native traversal |
| Hallucination risk | Higher — LLM infers links | Lower — relationships are stored |
| Explainability | Similarity score only | Traversal path = citation |
| Setup complexity | Low | Medium |
| Unstructured text | Excellent | Needs entity extraction first |
The key insight: you don’t have to choose. The production pattern combines both.
Modeling a Knowledge Graph for RAG on Cosmos DB
Azure Cosmos DB’s Gremlin API uses the Apache TinkerPop property graph model. Every vertex has an id, a label (entity type), a partition key, and arbitrary key-value properties. Edges are directed, labeled, and can carry properties too.
Here is a practical schema for a document RAG knowledge graph:
Vertex labels:
| Label | Purpose | Key Properties |
|---|---|---|
document | Source file or article | title, source, created_at |
chunk | Text segment | content, chunk_index, token_count |
entity | Named entity (person, service, concept) | name, type, description |
category | Classification bucket | name |
Edge labels:
| Edge | Direction | Purpose |
|---|---|---|
contains | document → chunk | Document contains text chunk |
mentions | chunk → entity | Chunk references an entity |
related_to | entity → entity | Semantic relationship |
depends_on | entity → entity | Dependency link |
belongs_to | entity → category | Classification |
next_chunk | chunk → chunk | Sequential ordering |
Creating the graph
// Create a document vertex
g.addV('document')
.property('id', 'doc-001')
.property('partition', 'knowledge')
.property('title', 'Q3 Infrastructure Risk Assessment')
.property('source', 'internal-wiki')
.property('created_at', '2026-01-15')
// Create a chunk vertex
g.addV('chunk')
.property('id', 'chunk-001-01')
.property('partition', 'knowledge')
.property('content', 'The primary risk is network latency between the checkout service and the payment gateway...')
.property('chunk_index', 1)
.property('token_count', 245)
// Link document to chunk
g.V(['knowledge', 'doc-001'])
.addE('contains')
.to(g.V(['knowledge', 'chunk-001-01']))
.property('order', 1)
// Create an entity and link it
g.addV('entity')
.property('id', 'entity-checkout-svc')
.property('partition', 'knowledge')
.property('name', 'checkout-service')
.property('type', 'service')
g.V(['knowledge', 'chunk-001-01'])
.addE('mentions')
.to(g.V(['knowledge', 'entity-checkout-svc']))
Partition key strategy
Cosmos DB requires a partition key on every vertex. For RAG knowledge graphs, partition by domain or topic (e.g., property('partition', 'knowledge')) so traversals stay within a single partition.
Critical rule: out() traversals stay within the source partition (cheap). in() traversals fan out across all partitions (expensive). Design your edge direction so the primary retrieval path uses out().
Where do embeddings go?
Cosmos DB Gremlin stores primitive properties — not 1536-float vectors. The recommended production pattern is split storage:
- Cosmos DB Gremlin: Knowledge graph (vertices, edges, relationships, metadata)
- Cosmos DB NoSQL (with DiskANN vector index) or Azure AI Search: Same chunk IDs with embedding vectors
Both stores share the same id keys. Vector search returns chunk IDs; graph traversal enriches them with relationship context.
The Hybrid Architecture: Dual-Channel Retrieval
Here is the production pattern emerging across teams building GraphRAG on Azure:
User Query
│
├──→ [Embed Query] ──→ Vector Store (Cosmos NoSQL / AI Search)
│ │
│ Top-K chunks by similarity
│ │
├──→ [Extract Entities] ──→ Graph Store (Cosmos DB Gremlin)
│ │
│ 1-3 hop neighborhood context
│ │
└──────────────────────→ [Merge via RRF / Reranker]
│
[Assemble Prompt]
│
graph triples as mini-facts
+ retrieved text chunks
│
[LLM generates answer]
Step by step:
- Embed the query with the same model used during indexing
- Channel 1 — Vector: Run approximate nearest neighbor search to retrieve the top-k most similar chunks
- Channel 2 — Graph: Extract entities from the query (NER), map them to graph vertices, traverse outward 1-3 hops to gather neighborhood context and relationships
- Merge: Use Reciprocal Rank Fusion or a reranker to combine both result sets
- Assemble the prompt: Convert graph triples into mini-facts (
"checkout-service depends_on payment-gateway") and append alongside text chunks - Generate: The LLM now has both semantic context and structural relationships
Seven Gremlin Query Patterns for RAG
These are the patterns you will use most when building a RAG pipeline on Cosmos DB Gremlin.
1. Entity neighborhood expansion
The foundational retrieval query. Given an entity matched by NER or vector search, expand outward to gather context.
g.V(['knowledge', 'entity-checkout-svc'])
.repeat(bothE().otherV().simplePath())
.times(2)
.emit()
.path()
.by(valueMap('name', 'type', 'content'))
.limit(50)
2. Document chunks with provenance
Retrieve chunks that mention an entity, along with their source document metadata.
g.V().hasLabel('entity').has('name', 'Zero Trust')
.in('mentions').hasLabel('chunk')
.as('chunk')
.in('contains').hasLabel('document')
.as('doc')
.select('chunk', 'doc')
.by(valueMap('content', 'chunk_index'))
.by(valueMap('title', 'source', 'created_at'))
.limit(10)
3. Multi-hop compliance traversal
Trace which systems implement which controls for a regulation.
g.V().hasLabel('regulation').has('name', 'GDPR Article 32')
.out('requires')
.out('satisfied_by')
.out('implemented_in')
.valueMap('name', 'owner', 'last_audit')
.dedup()
4. Transitive dependency chain
Find all services that a given service transitively depends on, up to 4 hops deep.
g.V().hasLabel('service').has('name', 'checkout-api')
.repeat(out('depends_on').simplePath())
.until(outE('depends_on').count().is(0).or().loops().is(gt(4)))
.emit()
.path()
.by('name')
5. Co-occurrence context (“related documents”)
Find other documents that share entities with a retrieved document — a graph-powered recommendation.
g.V(['knowledge', 'doc-001']).as('startDoc')
.out('contains')
.out('mentions')
.in('mentions')
.in('contains')
.where(neq('startDoc'))
.dedup()
.valueMap('title', 'source')
.limit(5)
6. Subgraph extraction for prompt assembly
Pull subject-predicate-object triples to assemble as mini-facts in the LLM prompt.
g.V().has('name', within('Alice', 'Project-Alpha', 'TeamBeta'))
.as('subject')
.outE().as('predicate')
.inV().as('object')
.select('subject', 'predicate', 'object')
.by('name')
.by(label)
.by('name')
.dedup()
.limit(30)
This returns triples you can format as natural-language facts:
Alice manages TeamBeta
TeamBeta designed Project-Alpha
Project-Alpha depends_on AzureServiceBus
Each triple is a verified fact the LLM can cite — no hallucination needed.
7. Temporal reasoning (latest facts)
When facts change over time, retrieve the most recent version.
g.V().has('name', 'Product-XYZ')
.outE('has_status')
.order().by('updated_at', desc)
.limit(1)
.inV()
.valueMap('status', 'reason', 'updated_at')
Connecting to Agent Frameworks
LangChain with Cosmos DB Gremlin
LangChain provides GremlinGraph and GremlinQAChain — the LLM generates Gremlin queries from natural language questions and executes them directly.
from langchain_community.graphs import GremlinGraph
from langchain_community.chains.graph_qa.gremlin import GremlinQAChain
graph = GremlinGraph(
url="wss://myaccount.gremlin.cosmos.azure.com:443/",
username="/dbs/graphdb/colls/knowledge",
password="your-cosmos-key"
)
chain = GremlinQAChain.from_llm(llm=llm, graph=graph, verbose=True)
result = chain.invoke("Which documents mention the checkout service?")
LlamaIndex Knowledge Graph Retriever
LlamaIndex’s KnowledgeGraphRAGRetriever traverses the graph with configurable depth.
from llama_index.core.retrievers import KnowledgeGraphRAGRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
retriever = KnowledgeGraphRAGRetriever(
storage_context=storage_context,
retriever_mode="keyword",
include_text=True,
graph_traversal_depth=2,
max_knowledge_sequence=30,
)
query_engine = RetrieverQueryEngine.from_args(retriever=retriever)
response = query_engine.query("What Azure services does checkout depend on?")
Microsoft Agent Framework (Semantic Kernel + AutoGen)
Define a Gremlin query tool that any agent can call for structured knowledge retrieval.
async def query_knowledge_graph(question: str) -> str:
"""Retrieve structured knowledge via graph traversal."""
# 1. Extract entities from the question
entities = extract_entities(question)
# 2. Build Gremlin traversal
gremlin = f"""
g.V().has('name', within({entities}))
.repeat(out().simplePath()).times(2).emit()
.path().by(valueMap('name', 'type', 'content'))
.limit(20)
"""
# 3. Execute and format as text for LLM context
results = await gremlin_client.submit(gremlin)
return format_subgraph_as_text(results)
Register this as a tool in your agent, and it can decide when to search the knowledge graph versus running a vector lookup — choosing the best retrieval channel per question.
Microsoft GraphRAG: The Big Picture
Microsoft’s open-source GraphRAG framework takes a different approach: instead of a manually modeled knowledge graph, it uses an LLM to automatically extract entities and relationships from raw text, then clusters them hierarchically using the Leiden community detection algorithm.
It supports two search modes:
- Global Search: Synthesizes across community summaries to answer corpus-wide questions (“What are the main risks across all projects?”)
- Local Search: Expands outward from matched entities for specific lookups
The extracted entity/relationship data maps directly to Cosmos DB Gremlin vertices and edges. Teams building production GraphRAG on Azure can use the Microsoft GraphRAG extraction pipeline for ingestion and Cosmos DB Gremlin for persistent, queryable storage.
LazyGraphRAG (released late 2025) reduces indexing cost to 0.1% of full GraphRAG — making knowledge graph construction viable for large corpora. Enterprise adoption is accelerating, with Workday and ServiceNow integrating GraphRAG into their platforms.
Cosmos DB Gotchas for RAG Workloads
Before you put this in production, know these Cosmos DB Gremlin specifics:
- Always include the partition key in vertex lookups:
g.V(['knowledge', 'entity-id'])instead ofg.V('entity-id'). Without it, every lookup fans out across all partitions. - Design edge direction for
out()traversal: Your most frequent retrieval path should followout()edges.in()traversals cross partition boundaries and cost significantly more RUs. - Always use
simplePath(): In highly connected knowledge graphs, traversals withoutsimplePath()will loop indefinitely through cycles. - No lambdas, no
match()step: Cosmos DB Gremlin doesn’t support lambda closures or declarativematch(). Useas()/select()/where()patterns instead. - Measure with
executionProfile(): Append.executionProfile()to any query to see its RU cost and execution plan before deploying to production.
When to Use Graph RAG vs. Vector-Only
Vector-only RAG is enough when:
- Questions are similarity lookups (“find content like this”)
- Your data has no meaningful entity relationships
- Speed and simplicity are the priority
Add a knowledge graph when:
- Questions require connecting multiple entities across hops
- You need audit trails explaining how answers were reached
- Your data is domain-structured (org charts, regulatory frameworks, service dependencies)
- Your agent needs persistent factual memory across sessions
- Hallucinated relationships are dangerous (medical, legal, financial)
The good news: you don’t have to start from scratch. If you already have a Cosmos DB Gremlin graph, you already have a knowledge graph. The next step is connecting it to your RAG pipeline.
Next Steps
- Download GremlinStudio — Explore your Cosmos DB knowledge graph visually, run Gremlin queries with autocomplete, and debug traversals step by step
- 20 Gremlin Query Examples for Cosmos DB — Copy-paste patterns for common graph operations
- Getting Started with GremlinStudio — Connect to your Cosmos DB account in under 2 minutes