tutorial cosmos-db gremlin ai rag knowledge-graph

GraphRAG with Cosmos DB for AI Agents

Vector-only RAG fails on multi-hop queries. Learn how to build a hybrid graph + vector retrieval system on Azure Cosmos DB Gremlin API — with practical Gremlin patterns for AI agents.

GremlinStudio Team ·

Your RAG pipeline retrieves the right chunks and your LLM still hallucinates the connections between them. Sound familiar?

Standard Retrieval-Augmented Generation embeds documents into vectors, finds the top-k most similar chunks, and hopes the LLM can piece together the answer. For simple factual lookups this works fine. But the moment a question requires chaining through multiple entities — connecting Alice to her team, the team to the feature they built, the feature to the Azure services it depends on — vector similarity collapses.

Benchmarks confirm this. In the Lettria evaluation, GraphRAG answered 81.67% of queries correctly versus 57.50% for vector-only RAG. The FalkorDB/Diffbot benchmark found that vector RAG scored literally 0% accuracy when queries involved more than five entities. Not low — zero.

The fix is not a bigger model. It is a knowledge graph.

Why Graphs Fix Multi-Hop Failure

A graph database stores entities as vertices and their connections as edges. Relationships become first-class citizens in your data layer, persisted and queryable — not patterns the LLM has to infer from unstructured text.

Consider this question: “Which Azure services does the feature designed by Alice’s team depend on?”

That requires four hops: Alice → manages → Team → designed → Feature → depends_on → Azure Services. In Gremlin, it is a single traversal:

g.V().has('name', 'Alice')
  .out('manages')
  .out('designed')
  .out('depends_on')
  .hasLabel('azure_service')
  .values('name')

No embeddings, no similarity scores, no hallucinated links. The database walks the explicit path and returns the answer.

DimensionVector OnlyGraph + Vector
Multi-hop reasoningPoor — requires chainingNative traversal
Hallucination riskHigher — LLM infers linksLower — relationships are stored
ExplainabilitySimilarity score onlyTraversal path = citation
Setup complexityLowMedium
Unstructured textExcellentNeeds entity extraction first

The key insight: you don’t have to choose. The production pattern combines both.

Modeling a Knowledge Graph for RAG on Cosmos DB

Azure Cosmos DB’s Gremlin API uses the Apache TinkerPop property graph model. Every vertex has an id, a label (entity type), a partition key, and arbitrary key-value properties. Edges are directed, labeled, and can carry properties too.

Here is a practical schema for a document RAG knowledge graph:

Vertex labels:

LabelPurposeKey Properties
documentSource file or articletitle, source, created_at
chunkText segmentcontent, chunk_index, token_count
entityNamed entity (person, service, concept)name, type, description
categoryClassification bucketname

Edge labels:

EdgeDirectionPurpose
containsdocument → chunkDocument contains text chunk
mentionschunk → entityChunk references an entity
related_toentity → entitySemantic relationship
depends_onentity → entityDependency link
belongs_toentity → categoryClassification
next_chunkchunk → chunkSequential ordering

Creating the graph

// Create a document vertex
g.addV('document')
  .property('id', 'doc-001')
  .property('partition', 'knowledge')
  .property('title', 'Q3 Infrastructure Risk Assessment')
  .property('source', 'internal-wiki')
  .property('created_at', '2026-01-15')

// Create a chunk vertex
g.addV('chunk')
  .property('id', 'chunk-001-01')
  .property('partition', 'knowledge')
  .property('content', 'The primary risk is network latency between the checkout service and the payment gateway...')
  .property('chunk_index', 1)
  .property('token_count', 245)

// Link document to chunk
g.V(['knowledge', 'doc-001'])
  .addE('contains')
  .to(g.V(['knowledge', 'chunk-001-01']))
  .property('order', 1)

// Create an entity and link it
g.addV('entity')
  .property('id', 'entity-checkout-svc')
  .property('partition', 'knowledge')
  .property('name', 'checkout-service')
  .property('type', 'service')

g.V(['knowledge', 'chunk-001-01'])
  .addE('mentions')
  .to(g.V(['knowledge', 'entity-checkout-svc']))

Partition key strategy

Cosmos DB requires a partition key on every vertex. For RAG knowledge graphs, partition by domain or topic (e.g., property('partition', 'knowledge')) so traversals stay within a single partition.

Critical rule: out() traversals stay within the source partition (cheap). in() traversals fan out across all partitions (expensive). Design your edge direction so the primary retrieval path uses out().

Where do embeddings go?

Cosmos DB Gremlin stores primitive properties — not 1536-float vectors. The recommended production pattern is split storage:

  • Cosmos DB Gremlin: Knowledge graph (vertices, edges, relationships, metadata)
  • Cosmos DB NoSQL (with DiskANN vector index) or Azure AI Search: Same chunk IDs with embedding vectors

Both stores share the same id keys. Vector search returns chunk IDs; graph traversal enriches them with relationship context.

The Hybrid Architecture: Dual-Channel Retrieval

Here is the production pattern emerging across teams building GraphRAG on Azure:

User Query

    ├──→ [Embed Query] ──→ Vector Store (Cosmos NoSQL / AI Search)
    │                           │
    │                      Top-K chunks by similarity
    │                           │
    ├──→ [Extract Entities] ──→ Graph Store (Cosmos DB Gremlin)
    │                           │
    │                      1-3 hop neighborhood context
    │                           │
    └──────────────────────→ [Merge via RRF / Reranker]

                           [Assemble Prompt]

                           graph triples as mini-facts
                           + retrieved text chunks

                           [LLM generates answer]

Step by step:

  1. Embed the query with the same model used during indexing
  2. Channel 1 — Vector: Run approximate nearest neighbor search to retrieve the top-k most similar chunks
  3. Channel 2 — Graph: Extract entities from the query (NER), map them to graph vertices, traverse outward 1-3 hops to gather neighborhood context and relationships
  4. Merge: Use Reciprocal Rank Fusion or a reranker to combine both result sets
  5. Assemble the prompt: Convert graph triples into mini-facts ("checkout-service depends_on payment-gateway") and append alongside text chunks
  6. Generate: The LLM now has both semantic context and structural relationships

Seven Gremlin Query Patterns for RAG

These are the patterns you will use most when building a RAG pipeline on Cosmos DB Gremlin.

1. Entity neighborhood expansion

The foundational retrieval query. Given an entity matched by NER or vector search, expand outward to gather context.

g.V(['knowledge', 'entity-checkout-svc'])
  .repeat(bothE().otherV().simplePath())
  .times(2)
  .emit()
  .path()
  .by(valueMap('name', 'type', 'content'))
  .limit(50)

2. Document chunks with provenance

Retrieve chunks that mention an entity, along with their source document metadata.

g.V().hasLabel('entity').has('name', 'Zero Trust')
  .in('mentions').hasLabel('chunk')
  .as('chunk')
  .in('contains').hasLabel('document')
  .as('doc')
  .select('chunk', 'doc')
  .by(valueMap('content', 'chunk_index'))
  .by(valueMap('title', 'source', 'created_at'))
  .limit(10)

3. Multi-hop compliance traversal

Trace which systems implement which controls for a regulation.

g.V().hasLabel('regulation').has('name', 'GDPR Article 32')
  .out('requires')
  .out('satisfied_by')
  .out('implemented_in')
  .valueMap('name', 'owner', 'last_audit')
  .dedup()

4. Transitive dependency chain

Find all services that a given service transitively depends on, up to 4 hops deep.

g.V().hasLabel('service').has('name', 'checkout-api')
  .repeat(out('depends_on').simplePath())
  .until(outE('depends_on').count().is(0).or().loops().is(gt(4)))
  .emit()
  .path()
  .by('name')

Find other documents that share entities with a retrieved document — a graph-powered recommendation.

g.V(['knowledge', 'doc-001']).as('startDoc')
  .out('contains')
  .out('mentions')
  .in('mentions')
  .in('contains')
  .where(neq('startDoc'))
  .dedup()
  .valueMap('title', 'source')
  .limit(5)

6. Subgraph extraction for prompt assembly

Pull subject-predicate-object triples to assemble as mini-facts in the LLM prompt.

g.V().has('name', within('Alice', 'Project-Alpha', 'TeamBeta'))
  .as('subject')
  .outE().as('predicate')
  .inV().as('object')
  .select('subject', 'predicate', 'object')
  .by('name')
  .by(label)
  .by('name')
  .dedup()
  .limit(30)

This returns triples you can format as natural-language facts:

Alice manages TeamBeta
TeamBeta designed Project-Alpha
Project-Alpha depends_on AzureServiceBus

Each triple is a verified fact the LLM can cite — no hallucination needed.

7. Temporal reasoning (latest facts)

When facts change over time, retrieve the most recent version.

g.V().has('name', 'Product-XYZ')
  .outE('has_status')
  .order().by('updated_at', desc)
  .limit(1)
  .inV()
  .valueMap('status', 'reason', 'updated_at')

Connecting to Agent Frameworks

LangChain with Cosmos DB Gremlin

LangChain provides GremlinGraph and GremlinQAChain — the LLM generates Gremlin queries from natural language questions and executes them directly.

from langchain_community.graphs import GremlinGraph
from langchain_community.chains.graph_qa.gremlin import GremlinQAChain

graph = GremlinGraph(
    url="wss://myaccount.gremlin.cosmos.azure.com:443/",
    username="/dbs/graphdb/colls/knowledge",
    password="your-cosmos-key"
)

chain = GremlinQAChain.from_llm(llm=llm, graph=graph, verbose=True)
result = chain.invoke("Which documents mention the checkout service?")

LlamaIndex Knowledge Graph Retriever

LlamaIndex’s KnowledgeGraphRAGRetriever traverses the graph with configurable depth.

from llama_index.core.retrievers import KnowledgeGraphRAGRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

retriever = KnowledgeGraphRAGRetriever(
    storage_context=storage_context,
    retriever_mode="keyword",
    include_text=True,
    graph_traversal_depth=2,
    max_knowledge_sequence=30,
)

query_engine = RetrieverQueryEngine.from_args(retriever=retriever)
response = query_engine.query("What Azure services does checkout depend on?")

Microsoft Agent Framework (Semantic Kernel + AutoGen)

Define a Gremlin query tool that any agent can call for structured knowledge retrieval.

async def query_knowledge_graph(question: str) -> str:
    """Retrieve structured knowledge via graph traversal."""
    # 1. Extract entities from the question
    entities = extract_entities(question)

    # 2. Build Gremlin traversal
    gremlin = f"""
        g.V().has('name', within({entities}))
         .repeat(out().simplePath()).times(2).emit()
         .path().by(valueMap('name', 'type', 'content'))
         .limit(20)
    """

    # 3. Execute and format as text for LLM context
    results = await gremlin_client.submit(gremlin)
    return format_subgraph_as_text(results)

Register this as a tool in your agent, and it can decide when to search the knowledge graph versus running a vector lookup — choosing the best retrieval channel per question.

Microsoft GraphRAG: The Big Picture

Microsoft’s open-source GraphRAG framework takes a different approach: instead of a manually modeled knowledge graph, it uses an LLM to automatically extract entities and relationships from raw text, then clusters them hierarchically using the Leiden community detection algorithm.

It supports two search modes:

  • Global Search: Synthesizes across community summaries to answer corpus-wide questions (“What are the main risks across all projects?”)
  • Local Search: Expands outward from matched entities for specific lookups

The extracted entity/relationship data maps directly to Cosmos DB Gremlin vertices and edges. Teams building production GraphRAG on Azure can use the Microsoft GraphRAG extraction pipeline for ingestion and Cosmos DB Gremlin for persistent, queryable storage.

LazyGraphRAG (released late 2025) reduces indexing cost to 0.1% of full GraphRAG — making knowledge graph construction viable for large corpora. Enterprise adoption is accelerating, with Workday and ServiceNow integrating GraphRAG into their platforms.

Cosmos DB Gotchas for RAG Workloads

Before you put this in production, know these Cosmos DB Gremlin specifics:

  • Always include the partition key in vertex lookups: g.V(['knowledge', 'entity-id']) instead of g.V('entity-id'). Without it, every lookup fans out across all partitions.
  • Design edge direction for out() traversal: Your most frequent retrieval path should follow out() edges. in() traversals cross partition boundaries and cost significantly more RUs.
  • Always use simplePath(): In highly connected knowledge graphs, traversals without simplePath() will loop indefinitely through cycles.
  • No lambdas, no match() step: Cosmos DB Gremlin doesn’t support lambda closures or declarative match(). Use as()/select()/where() patterns instead.
  • Measure with executionProfile(): Append .executionProfile() to any query to see its RU cost and execution plan before deploying to production.

When to Use Graph RAG vs. Vector-Only

Vector-only RAG is enough when:

  • Questions are similarity lookups (“find content like this”)
  • Your data has no meaningful entity relationships
  • Speed and simplicity are the priority

Add a knowledge graph when:

  • Questions require connecting multiple entities across hops
  • You need audit trails explaining how answers were reached
  • Your data is domain-structured (org charts, regulatory frameworks, service dependencies)
  • Your agent needs persistent factual memory across sessions
  • Hallucinated relationships are dangerous (medical, legal, financial)

The good news: you don’t have to start from scratch. If you already have a Cosmos DB Gremlin graph, you already have a knowledge graph. The next step is connecting it to your RAG pipeline.

Next Steps