System Overview
Forge V5 is a desktop-native agentic RAG system built with four major components: a Tauri + React frontend, a FastAPI + LangGraph backend, Qdrant for vector storage, and Redis for caching and graph adjacency. Everything runs on a single machine with a 16GB VRAM GPU.
Architecture Diagram
┌─────────────────────────────────────────────────────────────────────────┐
│ USER MACHINE │
│ │
│ ┌───────────────────────────┐ ┌─────────────────────────────┐ │
│ │ TAURI DESKTOP APP │ │ FASTAPI BACKEND │ │
│ │ │ HTTP/ │ (Python 3.11+) │ │
│ │ ┌─────────────────────┐ │ SSE │ │ │
│ │ │ React + TypeScript │ │◄────────│ ┌───────────────────────┐ │ │
│ │ │ │ │────────▶│ │ API Router │ │ │
│ │ │ - Chat UI │ │ │ │ /api/query │ │ │
│ │ │ - Document Manager │ │ │ │ /api/documents │ │ │
│ │ │ - Settings Panel │ │ │ │ /api/ingest │ │ │
│ │ │ - Agent Visualizer │ │ │ │ /api/settings │ │ │
│ │ └─────────────────────┘ │ │ └───────┬───────────────┘ │ │
│ │ │ │ │ │ │
│ │ ┌─────────────────────┐ │ │ ┌───────▼───────────────┐ │ │
│ │ │ Rust Core (Tauri) │ │ │ │ QUERY ENGINE │ │ │
│ │ │ - Window Management│ │ │ │ │ │ │
│ │ │ - File System │ │ │ │ ┌─────────────────┐ │ │ │
│ │ │ - Native Dialogs │ │ │ │ │ LangGraph Agent │ │ │ │
│ │ └─────────────────────┘ │ │ │ │ (ForgeAgent) │ │ │ │
│ └───────────────────────────┘ │ │ └────────┬────────┘ │ │ │
│ │ │ │ │ │ │
│ │ │ ┌────────▼────────┐ │ │ │
│ │ │ │ Tool Registry │ │ │ │
│ │ │ │ - search │ │ │ │
│ │ │ │ - propositions │ │ │ │
│ │ │ │ - graph │ │ │ │
│ │ │ │ - rerank │ │ │ │
│ │ │ │ - hyde │ │ │ │
│ │ │ │ - decompose │ │ │ │
│ │ │ │ - generate │ │ │ │
│ │ │ └─────────────────┘ │ │ │
│ │ └───────────────────────┘ │ │
│ │ │ │
│ │ ┌───────────────────────┐ │ │
│ │ │ INGESTION ENGINE │ │ │
│ │ │ - Parser │ │ │
│ │ │ - HierarchyBuilder │ │ │
│ │ │ - ContextualEnricher │ │ │
│ │ │ - PropositionExtract │ │ │
│ │ │ - GraphExtractor │ │ │
│ │ │ - BGEm3Embedder │ │ │
│ │ └───────┬───────────────┘ │ │
│ │ │ │ │
│ │ ┌───────▼───────┐ ┌─────┐ │ │
│ │ │ CRAGEvaluator │ │ LLM │ │ │
│ │ │ (cross-encoder│ │ │ │ │
│ │ │ on CPU) │ │llama│ │ │
│ │ └───────────────┘ │.cpp │ │ │
│ │ │ │ │ │
│ └─────────────────────│─────┘ │ │
│ │ │ │
│ ┌─────────────────────┐ ┌─────────────────────┐ ┌──────▼──────┐ │
│ │ QDRANT │ │ REDIS │ │ GPU │ │
│ │ Vector Database │ │ Cache + Graph │ │ 10-14GB │ │
│ │ │ │ │ │ VRAM │ │
│ │ - Dense vectors │ │ - Query cache │ │ │ │
│ │ - Sparse vectors │ │ - Graph adjacency │ │ LLM only │ │
│ │ - ColBERT vectors │ │ - Session state │ │ Everything │ │
│ │ - Payload metadata │ │ │ │ else on CPU│ │
│ │ │ │ Port: 6379 │ │ │ │
│ │ Port: 6333 │ └─────────────────────┘ └─────────────┘ │
│ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘Component Responsibilities
Frontend: Tauri + React
| Component | File Path | Responsibility |
|---|---|---|
| Chat UI | frontend/src/components/Chat.tsx | Query input, streaming response display, source cards |
| Document Manager | frontend/src/components/Documents.tsx | Upload, list, delete documents; ingestion progress |
| Agent Visualizer | frontend/src/components/AgentView.tsx | Real-time view of agent tool calls, CRAG results, iterations |
| Settings Panel | frontend/src/components/Settings.tsx | Model selection, technique toggles, config editing |
| SSE Client | frontend/src/lib/sse.ts | Parses SSE events from backend streaming endpoint |
| Tauri Commands | src-tauri/src/main.rs | File dialogs, system info, window management |
Backend: FastAPI + LangGraph
| Component | File Path | Responsibility |
|---|---|---|
| API Router | forge/api/router.py | REST endpoints for query, documents, settings |
| Streaming | forge/api/streaming.py | SSE event formatting and delivery |
| ForgeAgent | forge/retrieval/agent.py | LangGraph StateGraph orchestrating all tools |
| Hybrid Search | forge/retrieval/search.py | BGE-M3 dense + sparse search with RRF fusion |
| ColBERT Reranker | forge/retrieval/rerank.py | Token-level MaxSim reranking |
| CRAG Evaluator | forge/retrieval/crag.py | Cross-encoder quality gate |
| Graph Traversal | forge/retrieval/graph.py | Redis-backed knowledge graph queries |
| HyDE | forge/retrieval/hyde.py | Hypothetical document embedding search |
| Generator | forge/generation/generator.py | LLM answer generation with context |
| Verifier | forge/verification/verifier.py | Post-generation claim verification |
| Ingestion Pipeline | forge/ingestion/pipeline.py | Orchestrates parse → chunk → embed → store |
| Document Parser | forge/ingestion/parser.py | PDF, DOCX, TXT extraction |
| Hierarchy Builder | forge/ingestion/hierarchy.py | L0-L3 level construction |
| Contextual Enricher | forge/ingestion/contextual.py | LLM context prefix generation |
| Proposition Extractor | forge/ingestion/propositions.py | Atomic claim extraction |
| Graph Extractor | forge/ingestion/graph_extractor.py | Entity + relationship extraction |
| BGE-M3 Embedder | forge/ingestion/embedder.py | Tri-modal vector generation |
| LLM Interface | forge/llm/llama_cpp.py | llama.cpp Python bindings |
| Config Manager | forge/config.py | config.yml parsing + env var overrides |
Data Layer
| Service | Role | Port | Data |
|---|---|---|---|
| Qdrant | Vector database | 6333 | Dense, sparse, ColBERT vectors + payload metadata |
| Redis | Cache + graph | 6379 | Query cache, graph adjacency lists, session state |
VRAM Budget
The entire system is designed to fit in 16GB of VRAM:
┌──────────────────────────────────────────────────┐
│ 16GB VRAM GPU │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ LLM (llama.cpp, Q4_K_M) 10-14 GB │ │
│ │ ════════════════════════════════════════════ │ │
│ │ │ │
│ └──────────────────────────────────────────────┘ │
│ ┌──────────────────┐ │
│ │ CUDA overhead │ ~500MB │
│ └──────────────────┘ │
│ ┌───────────┐ │
│ │ KV Cache │ 1-2 GB (depends on context size) │
│ └───────────┘ │
│ │
│ Everything else runs on CPU + system RAM: │
│ - BGE-M3 embedding: CPU (~2GB RAM) │
│ - Cross-encoder (CRAG): CPU (~500MB RAM) │
│ - Qdrant: CPU + RAM (~2-8GB RAM for vectors) │
│ - Redis: CPU + RAM (~100MB-1GB RAM) │
└──────────────────────────────────────────────────┘Why only LLM on GPU?
The LLM is the most compute-intensive component — GPU offloading reduces generation from ~30 tokens/sec (CPU) to ~80+ tokens/sec (GPU). BGE-M3 at ~50ms per query on CPU is already fast enough. ColBERT reranking is matrix math that’s efficient on CPU for the small batch sizes involved. This design maximizes the quality of the most expensive operation (generation) while keeping everything else snappy.
Technology Choices
| Decision | Choice | Why |
|---|---|---|
| Desktop framework | Tauri (Rust + React) | ~10MB binary, native file system access, no Electron overhead |
| Backend framework | FastAPI | Async Python, great SSE support, automatic OpenAPI docs |
| Agent framework | LangGraph | Explicit state machine, tool-calling, streaming support |
| LLM runtime | llama.cpp | Best GPU utilization for GGUF models, C++ performance |
| Embedding model | BGE-M3 | Single model for 3 vector types, MIT license, excellent quality |
| Vector database | Qdrant | Multi-vector support, sparse vectors, rich filtering, fast |
| Cache + graph | Redis | Sub-ms reads, pub/sub for events, lightweight |
| CRAG model | ms-marco-MiniLM-L-12-v2 | Fast cross-encoder, well-validated on retrieval tasks |
Data Flow Summary
Query Request Lifecycle
1. User types query in Tauri app
2. React sends POST to /api/query/stream
3. FastAPI creates SSE response
4. ForgeAgent initializes LangGraph state
5. Agent loop:
a. PLAN → select tool (via LLM)
b. EXECUTE → run tool (search/rerank/graph/etc.)
c. EVALUATE → CRAG gate + evidence check
d. Repeat until evidence sufficient or max iterations
6. GENERATE → LLM synthesizes answer, tokens streamed via SSE
7. VERIFY → claim-by-claim audit
8. SSE done event with metadata
9. React renders response with sources and confidenceDocument Upload Lifecycle
1. User selects file in Tauri native dialog
2. Tauri reads file, sends to /api/documents/upload
3. Backend queues ingestion job
4. Pipeline runs:
a. Parse → raw text
b. Hierarchy → L0 summary, L1 sections, L2 chunks
c. Contextual enrichment → context prefix per L2 chunk
d. Proposition extraction → L3 atomic claims
e. Graph extraction → entities + relationships
f. BGE-M3 embedding → dense + sparse + ColBERT per point
g. Qdrant upsert → all vectors + payload
h. Redis store → graph adjacency lists
5. Status streamed to frontend via pollingNext Steps
- Ingestion Pipeline — Deep dive into document processing
- Query Pipeline — Full agentic and direct mode flows
- Streaming Protocol — SSE event types and handling