About Forge
Why “Forge”?
A forge is where raw metal becomes precision tools. You start with shapeless ore — heat it, hammer it, fold it, quench it — and what comes out is sharp, strong, and purpose-built.
That’s what Forge does with documents.
Raw PDFs, messy DOCXes, walls of unstructured text — Forge takes them through a pipeline of 14 techniques that parse, chunk, enrich, decompose, embed, graph-link, and index every useful fact. What comes out the other side is a precision instrument: an indexed knowledge base where an autonomous agent can find, verify, and synthesize answers to questions you didn’t even know the document could answer.
The 14 techniques are the tools of the forge:
- Contextual Retrieval is the furnace — it heats raw chunks until their true meaning is visible
- Hierarchical Indexing is the layering — document, section, chunk, proposition, each level for a different purpose
- BGE-M3 is the alloy — dense, sparse, and ColBERT vectors fused into one
- CRAG is the quality inspector — nothing substandard leaves the forge
- The Agent is the blacksmith — deciding which tool to reach for next
The metaphor works because building a great RAG system is, genuinely, a craft. There are dozens of techniques in the literature, each with their own strengths and failure modes. The hard part isn’t implementing any one of them — it’s making 14 of them compose into a coherent pipeline that runs on hardware you actually own.
What This Proves
No open-source RAG system combines all 14 of these techniques:
- Agentic RAG (LangGraph ReAct loop)
- CRAG Quality Gate (cross-encoder classification)
- Multi-Hop Reasoning (query decomposition + iterative retrieval)
- Contextual Retrieval (Anthropic’s context prefix technique)
- Proposition Indexing (Dense-X atomic claims)
- Hierarchical 4-Level Indexing (RAPTOR-inspired L0-L3)
- BGE-M3 Tri-Modal Vectors (dense + sparse + ColBERT)
- ColBERT Late Interaction Reranking (MaxSim scoring)
- Knowledge Graph (LLM extraction + Redis adjacency)
- Self-Verification (claim-by-claim audit)
- Confidence Scoring (weighted retrieval signals)
- Query Decomposition (complex → atomic sub-queries)
- HyDE (hypothetical document embeddings)
- Parent Expansion (retrieve child, return parent for context)
Most systems implement 3-5. Some research prototypes demonstrate 6-8. Forge is the only system I’m aware of that integrates all 14 into a single, working pipeline that fits on a 16GB GPU.
That’s the portfolio statement: theory implemented as a system, not just a paper or a demo.
The Constraint That Drives Everything
16GB VRAM. That’s the budget. An RTX 4080, a consumer card you can buy for under $1,000.
This constraint is the most important design decision in Forge. It forces real trade-offs:
- The LLM gets 10-14GB of VRAM. Everything else runs on CPU.
- BGE-M3 embeddings are computed on CPU at ~50ms per query — fast enough.
- The cross-encoder for CRAG runs on CPU — 200ms for 10 documents.
- Qdrant and Redis use system RAM, not VRAM.
The result is a system that runs on hardware real people own, not a $50K A100 cluster. If you have a gaming PC with an RTX 4080 or better, you can run the full 14-technique pipeline at production quality.
Tech Stack
| Layer | Technology | Why |
|---|---|---|
| Desktop App | Tauri (Rust + React) | ~10MB binary, native feel, no Electron bloat |
| Frontend | React + TypeScript + Tailwind | Type-safe, fast iteration, responsive |
| Backend | FastAPI (Python 3.11+) | Async, SSE streaming, automatic API docs |
| Agent | LangGraph | Explicit state machines, tool calling, streaming |
| LLM Runtime | llama.cpp | Best GGUF performance, C++ speed, full GPU offload |
| Embeddings | BGE-M3 (FlagEmbedding) | One model, three vector types, MIT license |
| Vector DB | Qdrant | Multi-vector support, sparse vectors, rich filtering |
| Cache + Graph | Redis | Sub-ms reads, lightweight, ubiquitous |
| CRAG Model | ms-marco-MiniLM-L-12-v2 | Fast cross-encoder, well-validated |
| Docs | Next.js + Nextra + Tailwind | The site you’re reading right now |
Who Built This
Built by hollowed_eyes as a portfolio project demonstrating deep expertise in:
- Retrieval-Augmented Generation (RAG) systems
- LLM application architecture
- Vector search and embedding models
- Agent-based reasoning systems
- GPU-optimized inference
- Full-stack desktop application development (Tauri/React/FastAPI)
This is not a wrapper around an API. Every component — the agent, the CRAG gate, the hierarchical indexer, the graph extractor, the streaming protocol — is implemented from the ground up, integrated into a coherent system, and documented like you’d expect from a production codebase.
Want to see it in action?
Clone the repo, upload a document, and ask a hard question. The agent will show you what 14 techniques working together looks like.