The most advanced RAG system you can run
on a single GPU.
A real agentic query: tool selection, CRAG evaluation, ColBERT reranking, token streaming, and claim verification — all visible in real-time.
Click Run Demo to see an agentic query in action
Each technique solves a specific retrieval failure. Together they form the most comprehensive RAG pipeline assembled for a single-GPU system.
LangGraph ReAct loop — the LLM autonomously decides which retrieval tools to invoke, iterating until it has sufficient evidence to answer.
Anthropic's technique: LLM prepends chunk-specific context before embedding. 49% fewer retrieval failures, 67% with reranking.
Cross-encoder evaluates every retrieved document as CORRECT, AMBIGUOUS, or INCORRECT before it reaches generation. Re-retrieves on failure.
Token-level MaxSim scoring catches specific facts that dense vectors miss.
Atomic factual claims indexed as standalone searchable units for precision retrieval.
L0 doc summaries → L1 sections → L2 semantic chunks → L3 propositions.
One model, three vector types: dense + sparse + ColBERT multi-vector.
Entity relationships for structural queries via graph traversal.
Post-generation claim-by-claim audit against source documents.
Multi-dimensional reliability scoring across retrieval, CRAG, and verification.
Complex questions split into targeted sub-queries for parallel retrieval.
Generate a hypothetical ideal answer, embed it, search for real matches.
Follow cross-references iteratively across document boundaries.
Match a chunk, return the parent section for full context fidelity.
14 techniques. One GPU. No open-source system combines all of these.
Built by hollowed_eyes · Forge V5 is a portfolio project demonstrating world-class RAG engineering.