Installation
Forge V5 can run via Docker (recommended) or as a local development setup. This guide covers both.
System Requirements
Hardware
| Component | Minimum | Recommended | Notes |
|---|---|---|---|
| GPU | 16GB VRAM | 24GB VRAM | RTX 4080/4090, A4000/A6000, or equivalent |
| RAM | 16GB | 32GB+ | Qdrant and Redis run in memory |
| CPU | 8 cores | 16+ cores | BGE-M3 embedding runs on CPU |
| Storage | 20GB | 50GB+ | Models + vector indices + document cache |
| OS | Linux (Ubuntu 22.04+) | Linux | Windows via WSL2 also supported |
VRAM Budget
Forge is designed to fit in 16GB. Here’s how VRAM is allocated:
| Component | VRAM | Device |
|---|---|---|
| LLM (llama.cpp, Q4_K_M quantization) | 10-14GB | GPU |
| BGE-M3 embedding model | ~0 | CPU |
| ColBERT vectors | ~0 | CPU (via BGE-M3) |
| CRAG cross-encoder | ~0 | CPU |
| Qdrant vector search | ~0 | CPU + RAM |
| Redis cache + graph | ~0 | CPU + RAM |
BGE-M3 runs at ~50ms per query on CPU, which is fast enough for real-time use. Keeping it off the GPU reserves all VRAM for the LLM, where it matters most for generation quality.
Software
- Docker 24.x+ with NVIDIA Container Toolkit (for Docker install)
- NVIDIA Driver 535+ with CUDA 12.x
- Python 3.11+ (for local development)
- Node.js 18+ and npm 9+ (for Tauri frontend development)
- Rust 1.75+ (for Tauri desktop builds)
Option A: Docker Installation (Recommended)
1. Install NVIDIA Container Toolkit
If you haven’t already configured Docker for GPU access:
# Ubuntu/Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart dockerVerify GPU access in Docker:
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi2. Clone and Launch
git clone https://github.com/zhadyz/tactical-rag-system.git
cd tactical-rag-system
# Copy the example environment file
cp .env.example .env
# Edit .env with your preferred LLM model path (see Model Downloads below)
# Then launch:
docker compose up -d3. Verify
# Check all containers are running
docker compose ps
# Expected output:
# forge-api running 0.0.0.0:8000->8000/tcp
# forge-qdrant running 0.0.0.0:6333->6333/tcp
# forge-redis running 0.0.0.0:6379->6379/tcp
# Health check
curl http://localhost:8000/api/healthOption B: Local Development Setup
For contributors or those who want to run services individually.
1. Backend (Python)
cd tactical-rag-system
# Create virtual environment
python3.11 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Install llama-cpp-python with CUDA support
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir2. Vector Database (Qdrant)
# Run Qdrant via Docker (even for local dev, this is easiest)
docker run -d --name forge-qdrant \
-p 6333:6333 -p 6334:6334 \
-v $(pwd)/qdrant_data:/qdrant/storage \
qdrant/qdrant:latest3. Cache (Redis)
docker run -d --name forge-redis \
-p 6379:6379 \
redis:7-alpine4. Start the Backend
# From project root with .venv activated
cp .env.example .env
# Edit .env as needed
python -m uvicorn forge.main:app --host 0.0.0.0 --port 8000 --reload5. Frontend (Tauri + React)
cd frontend
# Install Node dependencies
npm install
# Development mode (opens Tauri window)
npm run tauri dev
# Or just the web UI:
npm run devModel Downloads
Forge requires two models. The backend auto-downloads them on first launch, but you can pre-download for air-gapped environments.
BGE-M3 (Embedding Model)
Downloaded automatically via the FlagEmbedding library. Manual download:
# Using huggingface-cli
huggingface-cli download BAAI/bge-m3 --local-dir models/bge-m3
# Or with git
git lfs install
git clone https://huggingface.co/BAAI/bge-m3 models/bge-m3Set in your .env:
BGE_M3_MODEL_PATH=models/bge-m3LLM (GGUF Format for llama.cpp)
Forge uses llama.cpp for GPU-accelerated LLM inference. Download a GGUF quantized model:
# Recommended: Mistral 7B Q4_K_M (~4.4GB, fits easily in 16GB VRAM)
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
mistral-7b-instruct-v0.2.Q4_K_M.gguf \
--local-dir models/
# Alternative: Llama 3.1 8B Q4_K_M (~4.9GB)
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--local-dir models/Set in your .env:
LLM_MODEL_PATH=models/mistral-7b-instruct-v0.2.Q4_K_M.gguf
LLM_GPU_LAYERS=-1
LLM_CONTEXT_SIZE=8192For 16GB VRAM: use Q4_K_M quantization of a 7-8B parameter model. For 24GB VRAM: you can run 13B models or use Q5_K_M / Q6_K quantization for better quality. The LLM_GPU_LAYERS=-1 setting offloads all layers to GPU.
Environment Variables
Create a .env file in the project root. Here are all available variables:
# === Core ===
FORGE_ENV=production # production | development
FORGE_PORT=8000 # API server port
FORGE_HOST=0.0.0.0 # API server host
FORGE_LOG_LEVEL=info # debug | info | warning | error
# === LLM (llama.cpp) ===
LLM_MODEL_PATH=models/mistral-7b-instruct-v0.2.Q4_K_M.gguf
LLM_GPU_LAYERS=-1 # -1 = all layers on GPU
LLM_CONTEXT_SIZE=8192 # Context window size
LLM_MAX_TOKENS=2048 # Max generation tokens
LLM_TEMPERATURE=0.1 # Generation temperature
LLM_THREADS=8 # CPU threads for non-GPU ops
# === BGE-M3 ===
BGE_M3_MODEL_PATH=models/bge-m3
BGE_M3_MAX_LENGTH=8192 # Max token length for embeddings
BGE_M3_BATCH_SIZE=32 # Embedding batch size
# === Qdrant ===
QDRANT_HOST=localhost
QDRANT_PORT=6333
QDRANT_COLLECTION=forge_documents
# === Redis ===
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_DB=0
# === Features ===
FORGE_MODE=agentic # agentic | direct
FORGE_CRAG_ENABLED=true
FORGE_PROPOSITIONS_ENABLED=true
FORGE_GRAPH_ENABLED=true
FORGE_VERIFICATION_ENABLED=true
FORGE_CACHE_ENABLED=true
FORGE_CACHE_TTL=3600 # Cache TTL in seconds
# === CPU-Only Mode ===
FORGE_CPU_ONLY=false # Set true for no-GPU environmentsVerifying Your Installation
Run the built-in integration test suite to confirm everything works:
# From project root
python -m pytest tests/integration/ -vExpected output:
tests/integration/test_health.py::test_health_endpoint PASSED
tests/integration/test_embedding.py::test_bge_m3_dense PASSED
tests/integration/test_embedding.py::test_bge_m3_sparse PASSED
tests/integration/test_embedding.py::test_bge_m3_colbert PASSED
tests/integration/test_ingestion.py::test_document_upload PASSED
tests/integration/test_ingestion.py::test_full_pipeline PASSED
tests/integration/test_query.py::test_direct_query PASSED
tests/integration/test_query.py::test_agentic_query PASSED
tests/integration/test_query.py::test_streaming PASSED
tests/integration/test_query.py::test_crag_evaluation PASSED
10 passed in 45.23sNext Steps
- Configuration — Fine-tune every parameter
- Quick Start — Upload your first document and query
- Architecture — Understand how the pieces fit together