System Overview

Forge V5 is a desktop-native agentic RAG system built with four major components: a Tauri + React frontend, a FastAPI + LangGraph backend, Qdrant for vector storage, and Redis for caching and graph adjacency. Everything runs on a single machine with a 16GB VRAM GPU.

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                              USER MACHINE                               │
│                                                                         │
│  ┌───────────────────────────┐          ┌─────────────────────────────┐  │
│  │     TAURI DESKTOP APP     │          │     FASTAPI BACKEND         │  │
│  │                           │  HTTP/   │     (Python 3.11+)          │  │
│  │  ┌─────────────────────┐  │  SSE     │                             │  │
│  │  │   React + TypeScript │  │◄────────│  ┌───────────────────────┐  │  │
│  │  │                     │  │────────▶│  │  API Router            │  │  │
│  │  │  - Chat UI          │  │         │  │  /api/query            │  │  │
│  │  │  - Document Manager │  │         │  │  /api/documents        │  │  │
│  │  │  - Settings Panel   │  │         │  │  /api/ingest           │  │  │
│  │  │  - Agent Visualizer │  │         │  │  /api/settings         │  │  │
│  │  └─────────────────────┘  │         │  └───────┬───────────────┘  │  │
│  │                           │         │          │                   │  │
│  │  ┌─────────────────────┐  │         │  ┌───────▼───────────────┐  │  │
│  │  │   Rust Core (Tauri) │  │         │  │  QUERY ENGINE          │  │  │
│  │  │  - Window Management│  │         │  │                       │  │  │
│  │  │  - File System      │  │         │  │  ┌─────────────────┐  │  │  │
│  │  │  - Native Dialogs   │  │         │  │  │  LangGraph Agent │  │  │  │
│  │  └─────────────────────┘  │         │  │  │  (ForgeAgent)    │  │  │  │
│  └───────────────────────────┘         │  │  └────────┬────────┘  │  │  │
│                                        │  │           │           │  │  │
│                                        │  │  ┌────────▼────────┐  │  │  │
│                                        │  │  │  Tool Registry   │  │  │  │
│                                        │  │  │  - search        │  │  │  │
│                                        │  │  │  - propositions  │  │  │  │
│                                        │  │  │  - graph         │  │  │  │
│                                        │  │  │  - rerank        │  │  │  │
│                                        │  │  │  - hyde          │  │  │  │
│                                        │  │  │  - decompose     │  │  │  │
│                                        │  │  │  - generate      │  │  │  │
│                                        │  │  └─────────────────┘  │  │  │
│                                        │  └───────────────────────┘  │  │
│                                        │                             │  │
│                                        │  ┌───────────────────────┐  │  │
│                                        │  │  INGESTION ENGINE      │  │  │
│                                        │  │  - Parser             │  │  │
│                                        │  │  - HierarchyBuilder   │  │  │
│                                        │  │  - ContextualEnricher │  │  │
│                                        │  │  - PropositionExtract │  │  │
│                                        │  │  - GraphExtractor     │  │  │
│                                        │  │  - BGEm3Embedder      │  │  │
│                                        │  └───────┬───────────────┘  │  │
│                                        │          │                   │  │
│                                        │  ┌───────▼───────┐  ┌─────┐  │  │
│                                        │  │  CRAGEvaluator │  │ LLM │  │  │
│                                        │  │  (cross-encoder│  │     │  │  │
│                                        │  │   on CPU)      │  │llama│  │  │
│                                        │  └───────────────┘  │.cpp │  │  │
│                                        │                     │     │  │  │
│                                        └─────────────────────│─────┘  │  │
│                                                              │        │  │
│  ┌─────────────────────┐    ┌─────────────────────┐   ┌──────▼──────┐  │
│  │      QDRANT          │    │      REDIS           │   │    GPU      │  │
│  │  Vector Database     │    │  Cache + Graph       │   │  10-14GB   │  │
│  │                     │    │                     │   │  VRAM       │  │
│  │  - Dense vectors    │    │  - Query cache      │   │             │  │
│  │  - Sparse vectors   │    │  - Graph adjacency  │   │  LLM only   │  │
│  │  - ColBERT vectors  │    │  - Session state    │   │  Everything │  │
│  │  - Payload metadata │    │                     │   │  else on CPU│  │
│  │                     │    │  Port: 6379         │   │             │  │
│  │  Port: 6333         │    └─────────────────────┘   └─────────────┘  │
│  └─────────────────────┘                                               │
└─────────────────────────────────────────────────────────────────────────┘

Component Responsibilities

Frontend: Tauri + React

Component	File Path	Responsibility
Chat UI	`frontend/src/components/Chat.tsx`	Query input, streaming response display, source cards
Document Manager	`frontend/src/components/Documents.tsx`	Upload, list, delete documents; ingestion progress
Agent Visualizer	`frontend/src/components/AgentView.tsx`	Real-time view of agent tool calls, CRAG results, iterations
Settings Panel	`frontend/src/components/Settings.tsx`	Model selection, technique toggles, config editing
SSE Client	`frontend/src/lib/sse.ts`	Parses SSE events from backend streaming endpoint
Tauri Commands	`src-tauri/src/main.rs`	File dialogs, system info, window management

Backend: FastAPI + LangGraph

Component	File Path	Responsibility
API Router	`forge/api/router.py`	REST endpoints for query, documents, settings
Streaming	`forge/api/streaming.py`	SSE event formatting and delivery
ForgeAgent	`forge/retrieval/agent.py`	LangGraph StateGraph orchestrating all tools
Hybrid Search	`forge/retrieval/search.py`	BGE-M3 dense + sparse search with RRF fusion
ColBERT Reranker	`forge/retrieval/rerank.py`	Token-level MaxSim reranking
CRAG Evaluator	`forge/retrieval/crag.py`	Cross-encoder quality gate
Graph Traversal	`forge/retrieval/graph.py`	Redis-backed knowledge graph queries
HyDE	`forge/retrieval/hyde.py`	Hypothetical document embedding search
Generator	`forge/generation/generator.py`	LLM answer generation with context
Verifier	`forge/verification/verifier.py`	Post-generation claim verification
Ingestion Pipeline	`forge/ingestion/pipeline.py`	Orchestrates parse → chunk → embed → store
Document Parser	`forge/ingestion/parser.py`	PDF, DOCX, TXT extraction
Hierarchy Builder	`forge/ingestion/hierarchy.py`	L0-L3 level construction
Contextual Enricher	`forge/ingestion/contextual.py`	LLM context prefix generation
Proposition Extractor	`forge/ingestion/propositions.py`	Atomic claim extraction
Graph Extractor	`forge/ingestion/graph_extractor.py`	Entity + relationship extraction
BGE-M3 Embedder	`forge/ingestion/embedder.py`	Tri-modal vector generation
LLM Interface	`forge/llm/llama_cpp.py`	llama.cpp Python bindings
Config Manager	`forge/config.py`	config.yml parsing + env var overrides

Data Layer

Service	Role	Port	Data
Qdrant	Vector database	6333	Dense, sparse, ColBERT vectors + payload metadata
Redis	Cache + graph	6379	Query cache, graph adjacency lists, session state

VRAM Budget

The entire system is designed to fit in 16GB of VRAM:

┌──────────────────────────────────────────────────┐
│                 16GB VRAM GPU                      │
│                                                    │
│  ┌──────────────────────────────────────────────┐ │
│  │  LLM (llama.cpp, Q4_K_M)     10-14 GB       │ │
│  │  ════════════════════════════════════════════ │ │
│  │                                              │ │
│  └──────────────────────────────────────────────┘ │
│  ┌──────────────────┐                             │
│  │  CUDA overhead   │  ~500MB                     │
│  └──────────────────┘                             │
│  ┌───────────┐                                    │
│  │  KV Cache │  1-2 GB (depends on context size)  │
│  └───────────┘                                    │
│                                                    │
│  Everything else runs on CPU + system RAM:         │
│  - BGE-M3 embedding: CPU (~2GB RAM)               │
│  - Cross-encoder (CRAG): CPU (~500MB RAM)         │
│  - Qdrant: CPU + RAM (~2-8GB RAM for vectors)     │
│  - Redis: CPU + RAM (~100MB-1GB RAM)              │
└──────────────────────────────────────────────────┘

Why only LLM on GPU?

The LLM is the most compute-intensive component — GPU offloading reduces generation from ~30 tokens/sec (CPU) to ~80+ tokens/sec (GPU). BGE-M3 at ~50ms per query on CPU is already fast enough. ColBERT reranking is matrix math that’s efficient on CPU for the small batch sizes involved. This design maximizes the quality of the most expensive operation (generation) while keeping everything else snappy.

Technology Choices

Decision	Choice	Why
Desktop framework	Tauri (Rust + React)	~10MB binary, native file system access, no Electron overhead
Backend framework	FastAPI	Async Python, great SSE support, automatic OpenAPI docs
Agent framework	LangGraph	Explicit state machine, tool-calling, streaming support
LLM runtime	llama.cpp	Best GPU utilization for GGUF models, C++ performance
Embedding model	BGE-M3	Single model for 3 vector types, MIT license, excellent quality
Vector database	Qdrant	Multi-vector support, sparse vectors, rich filtering, fast
Cache + graph	Redis	Sub-ms reads, pub/sub for events, lightweight
CRAG model	ms-marco-MiniLM-L-12-v2	Fast cross-encoder, well-validated on retrieval tasks

Data Flow Summary

Query Request Lifecycle

1. User types query in Tauri app
2. React sends POST to /api/query/stream
3. FastAPI creates SSE response
4. ForgeAgent initializes LangGraph state
5. Agent loop:
   a. PLAN → select tool (via LLM)
   b. EXECUTE → run tool (search/rerank/graph/etc.)
   c. EVALUATE → CRAG gate + evidence check
   d. Repeat until evidence sufficient or max iterations
6. GENERATE → LLM synthesizes answer, tokens streamed via SSE
7. VERIFY → claim-by-claim audit
8. SSE done event with metadata
9. React renders response with sources and confidence

Document Upload Lifecycle

1. User selects file in Tauri native dialog
2. Tauri reads file, sends to /api/documents/upload
3. Backend queues ingestion job
4. Pipeline runs:
   a. Parse → raw text
   b. Hierarchy → L0 summary, L1 sections, L2 chunks
   c. Contextual enrichment → context prefix per L2 chunk
   d. Proposition extraction → L3 atomic claims
   e. Graph extraction → entities + relationships
   f. BGE-M3 embedding → dense + sparse + ColBERT per point
   g. Qdrant upsert → all vectors + payload
   h. Redis store → graph adjacency lists
5. Status streamed to frontend via polling

Next Steps

Ingestion Pipeline — Deep dive into document processing
Query Pipeline — Full agentic and direct mode flows
Streaming Protocol — SSE event types and handling

BGE-M3 Vectors Ingestion Pipeline