ArchitectureSystem Overview

System Overview

Forge V5 is a desktop-native agentic RAG system built with four major components: a Tauri + React frontend, a FastAPI + LangGraph backend, Qdrant for vector storage, and Redis for caching and graph adjacency. Everything runs on a single machine with a 16GB VRAM GPU.

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                              USER MACHINE                               │
│                                                                         │
│  ┌───────────────────────────┐          ┌─────────────────────────────┐  │
│  │     TAURI DESKTOP APP     │          │     FASTAPI BACKEND         │  │
│  │                           │  HTTP/   │     (Python 3.11+)          │  │
│  │  ┌─────────────────────┐  │  SSE     │                             │  │
│  │  │   React + TypeScript │  │◄────────│  ┌───────────────────────┐  │  │
│  │  │                     │  │────────▶│  │  API Router            │  │  │
│  │  │  - Chat UI          │  │         │  │  /api/query            │  │  │
│  │  │  - Document Manager │  │         │  │  /api/documents        │  │  │
│  │  │  - Settings Panel   │  │         │  │  /api/ingest           │  │  │
│  │  │  - Agent Visualizer │  │         │  │  /api/settings         │  │  │
│  │  └─────────────────────┘  │         │  └───────┬───────────────┘  │  │
│  │                           │         │          │                   │  │
│  │  ┌─────────────────────┐  │         │  ┌───────▼───────────────┐  │  │
│  │  │   Rust Core (Tauri) │  │         │  │  QUERY ENGINE          │  │  │
│  │  │  - Window Management│  │         │  │                       │  │  │
│  │  │  - File System      │  │         │  │  ┌─────────────────┐  │  │  │
│  │  │  - Native Dialogs   │  │         │  │  │  LangGraph Agent │  │  │  │
│  │  └─────────────────────┘  │         │  │  │  (ForgeAgent)    │  │  │  │
│  └───────────────────────────┘         │  │  └────────┬────────┘  │  │  │
│                                        │  │           │           │  │  │
│                                        │  │  ┌────────▼────────┐  │  │  │
│                                        │  │  │  Tool Registry   │  │  │  │
│                                        │  │  │  - search        │  │  │  │
│                                        │  │  │  - propositions  │  │  │  │
│                                        │  │  │  - graph         │  │  │  │
│                                        │  │  │  - rerank        │  │  │  │
│                                        │  │  │  - hyde          │  │  │  │
│                                        │  │  │  - decompose     │  │  │  │
│                                        │  │  │  - generate      │  │  │  │
│                                        │  │  └─────────────────┘  │  │  │
│                                        │  └───────────────────────┘  │  │
│                                        │                             │  │
│                                        │  ┌───────────────────────┐  │  │
│                                        │  │  INGESTION ENGINE      │  │  │
│                                        │  │  - Parser             │  │  │
│                                        │  │  - HierarchyBuilder   │  │  │
│                                        │  │  - ContextualEnricher │  │  │
│                                        │  │  - PropositionExtract │  │  │
│                                        │  │  - GraphExtractor     │  │  │
│                                        │  │  - BGEm3Embedder      │  │  │
│                                        │  └───────┬───────────────┘  │  │
│                                        │          │                   │  │
│                                        │  ┌───────▼───────┐  ┌─────┐  │  │
│                                        │  │  CRAGEvaluator │  │ LLM │  │  │
│                                        │  │  (cross-encoder│  │     │  │  │
│                                        │  │   on CPU)      │  │llama│  │  │
│                                        │  └───────────────┘  │.cpp │  │  │
│                                        │                     │     │  │  │
│                                        └─────────────────────│─────┘  │  │
│                                                              │        │  │
│  ┌─────────────────────┐    ┌─────────────────────┐   ┌──────▼──────┐  │
│  │      QDRANT          │    │      REDIS           │   │    GPU      │  │
│  │  Vector Database     │    │  Cache + Graph       │   │  10-14GB   │  │
│  │                     │    │                     │   │  VRAM       │  │
│  │  - Dense vectors    │    │  - Query cache      │   │             │  │
│  │  - Sparse vectors   │    │  - Graph adjacency  │   │  LLM only   │  │
│  │  - ColBERT vectors  │    │  - Session state    │   │  Everything │  │
│  │  - Payload metadata │    │                     │   │  else on CPU│  │
│  │                     │    │  Port: 6379         │   │             │  │
│  │  Port: 6333         │    └─────────────────────┘   └─────────────┘  │
│  └─────────────────────┘                                               │
└─────────────────────────────────────────────────────────────────────────┘

Component Responsibilities

Frontend: Tauri + React

ComponentFile PathResponsibility
Chat UIfrontend/src/components/Chat.tsxQuery input, streaming response display, source cards
Document Managerfrontend/src/components/Documents.tsxUpload, list, delete documents; ingestion progress
Agent Visualizerfrontend/src/components/AgentView.tsxReal-time view of agent tool calls, CRAG results, iterations
Settings Panelfrontend/src/components/Settings.tsxModel selection, technique toggles, config editing
SSE Clientfrontend/src/lib/sse.tsParses SSE events from backend streaming endpoint
Tauri Commandssrc-tauri/src/main.rsFile dialogs, system info, window management

Backend: FastAPI + LangGraph

ComponentFile PathResponsibility
API Routerforge/api/router.pyREST endpoints for query, documents, settings
Streamingforge/api/streaming.pySSE event formatting and delivery
ForgeAgentforge/retrieval/agent.pyLangGraph StateGraph orchestrating all tools
Hybrid Searchforge/retrieval/search.pyBGE-M3 dense + sparse search with RRF fusion
ColBERT Rerankerforge/retrieval/rerank.pyToken-level MaxSim reranking
CRAG Evaluatorforge/retrieval/crag.pyCross-encoder quality gate
Graph Traversalforge/retrieval/graph.pyRedis-backed knowledge graph queries
HyDEforge/retrieval/hyde.pyHypothetical document embedding search
Generatorforge/generation/generator.pyLLM answer generation with context
Verifierforge/verification/verifier.pyPost-generation claim verification
Ingestion Pipelineforge/ingestion/pipeline.pyOrchestrates parse → chunk → embed → store
Document Parserforge/ingestion/parser.pyPDF, DOCX, TXT extraction
Hierarchy Builderforge/ingestion/hierarchy.pyL0-L3 level construction
Contextual Enricherforge/ingestion/contextual.pyLLM context prefix generation
Proposition Extractorforge/ingestion/propositions.pyAtomic claim extraction
Graph Extractorforge/ingestion/graph_extractor.pyEntity + relationship extraction
BGE-M3 Embedderforge/ingestion/embedder.pyTri-modal vector generation
LLM Interfaceforge/llm/llama_cpp.pyllama.cpp Python bindings
Config Managerforge/config.pyconfig.yml parsing + env var overrides

Data Layer

ServiceRolePortData
QdrantVector database6333Dense, sparse, ColBERT vectors + payload metadata
RedisCache + graph6379Query cache, graph adjacency lists, session state

VRAM Budget

The entire system is designed to fit in 16GB of VRAM:

┌──────────────────────────────────────────────────┐
│                 16GB VRAM GPU                      │
│                                                    │
│  ┌──────────────────────────────────────────────┐ │
│  │  LLM (llama.cpp, Q4_K_M)     10-14 GB       │ │
│  │  ════════════════════════════════════════════ │ │
│  │                                              │ │
│  └──────────────────────────────────────────────┘ │
│  ┌──────────────────┐                             │
│  │  CUDA overhead   │  ~500MB                     │
│  └──────────────────┘                             │
│  ┌───────────┐                                    │
│  │  KV Cache │  1-2 GB (depends on context size)  │
│  └───────────┘                                    │
│                                                    │
│  Everything else runs on CPU + system RAM:         │
│  - BGE-M3 embedding: CPU (~2GB RAM)               │
│  - Cross-encoder (CRAG): CPU (~500MB RAM)         │
│  - Qdrant: CPU + RAM (~2-8GB RAM for vectors)     │
│  - Redis: CPU + RAM (~100MB-1GB RAM)              │
└──────────────────────────────────────────────────┘
Why only LLM on GPU?

The LLM is the most compute-intensive component — GPU offloading reduces generation from ~30 tokens/sec (CPU) to ~80+ tokens/sec (GPU). BGE-M3 at ~50ms per query on CPU is already fast enough. ColBERT reranking is matrix math that’s efficient on CPU for the small batch sizes involved. This design maximizes the quality of the most expensive operation (generation) while keeping everything else snappy.

Technology Choices

DecisionChoiceWhy
Desktop frameworkTauri (Rust + React)~10MB binary, native file system access, no Electron overhead
Backend frameworkFastAPIAsync Python, great SSE support, automatic OpenAPI docs
Agent frameworkLangGraphExplicit state machine, tool-calling, streaming support
LLM runtimellama.cppBest GPU utilization for GGUF models, C++ performance
Embedding modelBGE-M3Single model for 3 vector types, MIT license, excellent quality
Vector databaseQdrantMulti-vector support, sparse vectors, rich filtering, fast
Cache + graphRedisSub-ms reads, pub/sub for events, lightweight
CRAG modelms-marco-MiniLM-L-12-v2Fast cross-encoder, well-validated on retrieval tasks

Data Flow Summary

Query Request Lifecycle

1. User types query in Tauri app
2. React sends POST to /api/query/stream
3. FastAPI creates SSE response
4. ForgeAgent initializes LangGraph state
5. Agent loop:
   a. PLAN → select tool (via LLM)
   b. EXECUTE → run tool (search/rerank/graph/etc.)
   c. EVALUATE → CRAG gate + evidence check
   d. Repeat until evidence sufficient or max iterations
6. GENERATE → LLM synthesizes answer, tokens streamed via SSE
7. VERIFY → claim-by-claim audit
8. SSE done event with metadata
9. React renders response with sources and confidence

Document Upload Lifecycle

1. User selects file in Tauri native dialog
2. Tauri reads file, sends to /api/documents/upload
3. Backend queues ingestion job
4. Pipeline runs:
   a. Parse → raw text
   b. Hierarchy → L0 summary, L1 sections, L2 chunks
   c. Contextual enrichment → context prefix per L2 chunk
   d. Proposition extraction → L3 atomic claims
   e. Graph extraction → entities + relationships
   f. BGE-M3 embedding → dense + sparse + ColBERT per point
   g. Qdrant upsert → all vectors + payload
   h. Redis store → graph adjacency lists
5. Status streamed to frontend via polling

Next Steps