Quick Start
Get Forge running, upload a document, and ask your first question — all in under 5 minutes.
Prerequisites
Before you begin, make sure you have:
| Requirement | Minimum | Recommended |
|---|---|---|
| GPU | NVIDIA GPU with 16GB VRAM (RTX 4080, A4000) | RTX 4090 / A6000 (24GB) |
| RAM | 16GB | 32GB+ |
| Docker | Docker Desktop 4.x with GPU support | Latest stable |
| NVIDIA Driver | 535+ with CUDA 12.x | 545+ |
| Disk | 20GB free | 50GB+ (model storage) |
Forge can run in CPU-only mode with FORGE_CPU_ONLY=true, but expect 10-30x slower inference.
Useful for development, not for production queries.
Step 1: Clone and Launch
# Clone the repository
git clone https://github.com/zhadyz/tactical-rag-system.git
cd tactical-rag-system
# Start all services (FastAPI backend, Qdrant, Redis)
docker compose up -dDocker Compose will start three containers:
- forge-api — FastAPI backend on port
8000 - forge-qdrant — Qdrant vector database on port
6333 - forge-redis — Redis cache + graph adjacency on port
6379
On first launch, the backend will automatically download required models:
- BGE-M3 (~2.4GB) — tri-modal embedding model
- LLM GGUF (~8-12GB) — your configured language model for llama.cpp
Model downloads happen once. Subsequent starts take under 30 seconds.
Verify everything is running:
# Health check
curl http://localhost:8000/api/healthExpected response:
{
"status": "healthy",
"version": "5.0.0",
"services": {
"qdrant": "connected",
"redis": "connected",
"llm": "loaded",
"bge_m3": "loaded"
},
"gpu": {
"available": true,
"name": "NVIDIA GeForce RTX 4080",
"vram_total_gb": 16.0,
"vram_used_gb": 11.2
}
}Step 2: Upload a Document
# Upload a PDF, DOCX, or TXT file
curl -X POST http://localhost:8000/api/documents/upload \
-F "file=@my-document.pdf"Response:
{
"document_id": "doc_a1b2c3d4",
"filename": "my-document.pdf",
"status": "queued",
"pages": 42
}The document enters the ingestion pipeline automatically. Check progress:
curl http://localhost:8000/api/ingest/status{
"active": [{
"document_id": "doc_a1b2c3d4",
"stage": "contextual_enrichment",
"progress": 0.65,
"chunks_processed": 128,
"chunks_total": 197
}],
"completed": [],
"failed": []
}The full pipeline: Parse → Semantic Chunk → Hierarchical Levels (L0-L3) → Contextual Enrichment → Proposition Extraction → Knowledge Graph Extraction → BGE-M3 Embedding → Qdrant Storage. A 40-page PDF takes roughly 2-5 minutes depending on GPU speed.
Step 3: Ask Your First Question
Direct Mode (pipeline RAG)
Fast, single-pass retrieval and generation:
curl -X POST http://localhost:8000/api/query/stream \
-H "Content-Type: application/json" \
-d '{"query": "What are the main findings of the study?", "mode": "direct"}'Agentic Mode (recommended)
The agent autonomously decides which tools to use, iterates, and verifies:
curl -N http://localhost:8000/api/query/stream \
-H "Content-Type: application/json" \
-d '{"query": "What are the main findings of the study?", "mode": "agentic"}'Expected SSE Stream Output
The stream delivers real-time events as the agent works:
event: query_analysis
data: {"type":"query_analysis","query":"What are the main findings of the study?","complexity":"moderate","decomposed":false}
event: retrieval_start
data: {"type":"retrieval_start","tool":"semantic_search","query":"main findings of the study"}
event: retrieval_result
data: {"type":"retrieval_result","tool":"semantic_search","chunks_found":8,"top_score":0.847}
event: crag_evaluation
data: {"type":"crag_evaluation","correct":5,"ambiguous":2,"incorrect":1,"action":"proceed"}
event: rerank_result
data: {"type":"rerank_result","method":"colbert","input_count":7,"output_count":5}
event: generation_start
data: {"type":"generation_start","context_chunks":5,"total_tokens":3420}
event: token
data: {"type":"token","content":"The study identifies three primary findings"}
event: token
data: {"type":"token","content":": (1) the correlation between..."}
event: verification
data: {"type":"verification","claims_checked":4,"claims_supported":4,"claims_unsupported":0,"confidence":0.94}
event: done
data: {"type":"done","total_time_ms":7240,"tokens_generated":312,"sources":5}That’s it. You have a fully operational agentic RAG system with 14 techniques running on your local GPU. Upload more documents, try complex multi-hop questions, and watch the agent reason through them in real time.
What’s Next?
- Installation Guide — Local development setup without Docker, model configuration
- Configuration — Tune every parameter in
config.yml - Techniques Overview — Understand all 14 techniques and how they compose
- API Reference — Full endpoint documentation with schemas