About Forge

Why “Forge”?

A forge is where raw metal becomes precision tools. You start with shapeless ore — heat it, hammer it, fold it, quench it — and what comes out is sharp, strong, and purpose-built.

That’s what Forge does with documents.

Raw PDFs, messy DOCXes, walls of unstructured text — Forge takes them through a pipeline of 14 techniques that parse, chunk, enrich, decompose, embed, graph-link, and index every useful fact. What comes out the other side is a precision instrument: an indexed knowledge base where an autonomous agent can find, verify, and synthesize answers to questions you didn’t even know the document could answer.

The 14 techniques are the tools of the forge:

Contextual Retrieval is the furnace — it heats raw chunks until their true meaning is visible
Hierarchical Indexing is the layering — document, section, chunk, proposition, each level for a different purpose
BGE-M3 is the alloy — dense, sparse, and ColBERT vectors fused into one
CRAG is the quality inspector — nothing substandard leaves the forge
The Agent is the blacksmith — deciding which tool to reach for next

The metaphor works because building a great RAG system is, genuinely, a craft. There are dozens of techniques in the literature, each with their own strengths and failure modes. The hard part isn’t implementing any one of them — it’s making 14 of them compose into a coherent pipeline that runs on hardware you actually own.

What This Proves

No open-source RAG system combines all 14 of these techniques:

Agentic RAG (LangGraph ReAct loop)
CRAG Quality Gate (cross-encoder classification)
Multi-Hop Reasoning (query decomposition + iterative retrieval)
Contextual Retrieval (Anthropic’s context prefix technique)
Proposition Indexing (Dense-X atomic claims)
Hierarchical 4-Level Indexing (RAPTOR-inspired L0-L3)
BGE-M3 Tri-Modal Vectors (dense + sparse + ColBERT)
ColBERT Late Interaction Reranking (MaxSim scoring)
Knowledge Graph (LLM extraction + Redis adjacency)
Self-Verification (claim-by-claim audit)
Confidence Scoring (weighted retrieval signals)
Query Decomposition (complex → atomic sub-queries)
HyDE (hypothetical document embeddings)
Parent Expansion (retrieve child, return parent for context)

Most systems implement 3-5. Some research prototypes demonstrate 6-8. Forge is the only system I’m aware of that integrates all 14 into a single, working pipeline that fits on a 16GB GPU.

That’s the portfolio statement: theory implemented as a system, not just a paper or a demo.

The Constraint That Drives Everything

16GB VRAM. That’s the budget. An RTX 4080, a consumer card you can buy for under $1,000.

This constraint is the most important design decision in Forge. It forces real trade-offs:

The LLM gets 10-14GB of VRAM. Everything else runs on CPU.
BGE-M3 embeddings are computed on CPU at ~50ms per query — fast enough.
The cross-encoder for CRAG runs on CPU — 200ms for 10 documents.
Qdrant and Redis use system RAM, not VRAM.

The result is a system that runs on hardware real people own, not a $50K A100 cluster. If you have a gaming PC with an RTX 4080 or better, you can run the full 14-technique pipeline at production quality.

Tech Stack

Layer	Technology	Why
Desktop App	Tauri (Rust + React)	~10MB binary, native feel, no Electron bloat
Frontend	React + TypeScript + Tailwind	Type-safe, fast iteration, responsive
Backend	FastAPI (Python 3.11+)	Async, SSE streaming, automatic API docs
Agent	LangGraph	Explicit state machines, tool calling, streaming
LLM Runtime	llama.cpp	Best GGUF performance, C++ speed, full GPU offload
Embeddings	BGE-M3 (FlagEmbedding)	One model, three vector types, MIT license
Vector DB	Qdrant	Multi-vector support, sparse vectors, rich filtering
Cache + Graph	Redis	Sub-ms reads, lightweight, ubiquitous
CRAG Model	ms-marco-MiniLM-L-12-v2	Fast cross-encoder, well-validated
Docs	Next.js + Nextra + Tailwind	The site you’re reading right now

Who Built This

Built by hollowed_eyes as a portfolio project demonstrating deep expertise in:

Retrieval-Augmented Generation (RAG) systems
LLM application architecture
Vector search and embedding models
Agent-based reasoning systems
GPU-optimized inference
Full-stack desktop application development (Tauri/React/FastAPI)

This is not a wrapper around an API. Every component — the agent, the CRAG gate, the hierarchical indexer, the graph extractor, the streaming protocol — is implemented from the ground up, integrated into a coherent system, and documented like you’d expect from a production codebase.

Want to see it in action?

Clone the repo, upload a document, and ask a hard question. The agent will show you what 14 techniques working together looks like.

Quick Start Guide

GitHub Repository

API Reference