About

About Forge

Why “Forge”?

A forge is where raw metal becomes precision tools. You start with shapeless ore — heat it, hammer it, fold it, quench it — and what comes out is sharp, strong, and purpose-built.

That’s what Forge does with documents.

Raw PDFs, messy DOCXes, walls of unstructured text — Forge takes them through a pipeline of 14 techniques that parse, chunk, enrich, decompose, embed, graph-link, and index every useful fact. What comes out the other side is a precision instrument: an indexed knowledge base where an autonomous agent can find, verify, and synthesize answers to questions you didn’t even know the document could answer.

The 14 techniques are the tools of the forge:

  • Contextual Retrieval is the furnace — it heats raw chunks until their true meaning is visible
  • Hierarchical Indexing is the layering — document, section, chunk, proposition, each level for a different purpose
  • BGE-M3 is the alloy — dense, sparse, and ColBERT vectors fused into one
  • CRAG is the quality inspector — nothing substandard leaves the forge
  • The Agent is the blacksmith — deciding which tool to reach for next

The metaphor works because building a great RAG system is, genuinely, a craft. There are dozens of techniques in the literature, each with their own strengths and failure modes. The hard part isn’t implementing any one of them — it’s making 14 of them compose into a coherent pipeline that runs on hardware you actually own.

What This Proves

No open-source RAG system combines all 14 of these techniques:

  1. Agentic RAG (LangGraph ReAct loop)
  2. CRAG Quality Gate (cross-encoder classification)
  3. Multi-Hop Reasoning (query decomposition + iterative retrieval)
  4. Contextual Retrieval (Anthropic’s context prefix technique)
  5. Proposition Indexing (Dense-X atomic claims)
  6. Hierarchical 4-Level Indexing (RAPTOR-inspired L0-L3)
  7. BGE-M3 Tri-Modal Vectors (dense + sparse + ColBERT)
  8. ColBERT Late Interaction Reranking (MaxSim scoring)
  9. Knowledge Graph (LLM extraction + Redis adjacency)
  10. Self-Verification (claim-by-claim audit)
  11. Confidence Scoring (weighted retrieval signals)
  12. Query Decomposition (complex → atomic sub-queries)
  13. HyDE (hypothetical document embeddings)
  14. Parent Expansion (retrieve child, return parent for context)

Most systems implement 3-5. Some research prototypes demonstrate 6-8. Forge is the only system I’m aware of that integrates all 14 into a single, working pipeline that fits on a 16GB GPU.

That’s the portfolio statement: theory implemented as a system, not just a paper or a demo.

The Constraint That Drives Everything

16GB VRAM. That’s the budget. An RTX 4080, a consumer card you can buy for under $1,000.

This constraint is the most important design decision in Forge. It forces real trade-offs:

  • The LLM gets 10-14GB of VRAM. Everything else runs on CPU.
  • BGE-M3 embeddings are computed on CPU at ~50ms per query — fast enough.
  • The cross-encoder for CRAG runs on CPU — 200ms for 10 documents.
  • Qdrant and Redis use system RAM, not VRAM.

The result is a system that runs on hardware real people own, not a $50K A100 cluster. If you have a gaming PC with an RTX 4080 or better, you can run the full 14-technique pipeline at production quality.

Tech Stack

LayerTechnologyWhy
Desktop AppTauri (Rust + React)~10MB binary, native feel, no Electron bloat
FrontendReact + TypeScript + TailwindType-safe, fast iteration, responsive
BackendFastAPI (Python 3.11+)Async, SSE streaming, automatic API docs
AgentLangGraphExplicit state machines, tool calling, streaming
LLM Runtimellama.cppBest GGUF performance, C++ speed, full GPU offload
EmbeddingsBGE-M3 (FlagEmbedding)One model, three vector types, MIT license
Vector DBQdrantMulti-vector support, sparse vectors, rich filtering
Cache + GraphRedisSub-ms reads, lightweight, ubiquitous
CRAG Modelms-marco-MiniLM-L-12-v2Fast cross-encoder, well-validated
DocsNext.js + Nextra + TailwindThe site you’re reading right now

Who Built This

Built by hollowed_eyes as a portfolio project demonstrating deep expertise in:

  • Retrieval-Augmented Generation (RAG) systems
  • LLM application architecture
  • Vector search and embedding models
  • Agent-based reasoning systems
  • GPU-optimized inference
  • Full-stack desktop application development (Tauri/React/FastAPI)

This is not a wrapper around an API. Every component — the agent, the CRAG gate, the hierarchical indexer, the graph extractor, the streaming protocol — is implemented from the ground up, integrated into a coherent system, and documented like you’d expect from a production codebase.


Want to see it in action?

Clone the repo, upload a document, and ask a hard question. The agent will show you what 14 techniques working together looks like.