Installation

Forge V5 can run via Docker (recommended) or as a local development setup. This guide covers both.

System Requirements

Hardware

Component	Minimum	Recommended	Notes
GPU	16GB VRAM	24GB VRAM	RTX 4080/4090, A4000/A6000, or equivalent
RAM	16GB	32GB+	Qdrant and Redis run in memory
CPU	8 cores	16+ cores	BGE-M3 embedding runs on CPU
Storage	20GB	50GB+	Models + vector indices + document cache
OS	Linux (Ubuntu 22.04+)	Linux	Windows via WSL2 also supported

VRAM Budget

Forge is designed to fit in 16GB. Here’s how VRAM is allocated:

Component	VRAM	Device
LLM (llama.cpp, Q4_K_M quantization)	10-14GB	GPU
BGE-M3 embedding model	~0	CPU
ColBERT vectors	~0	CPU (via BGE-M3)
CRAG cross-encoder	~0	CPU
Qdrant vector search	~0	CPU + RAM
Redis cache + graph	~0	CPU + RAM

Why CPU for embeddings?

BGE-M3 runs at ~50ms per query on CPU, which is fast enough for real-time use. Keeping it off the GPU reserves all VRAM for the LLM, where it matters most for generation quality.

Software

Docker 24.x+ with NVIDIA Container Toolkit (for Docker install)
NVIDIA Driver 535+ with CUDA 12.x
Python 3.11+ (for local development)
Node.js 18+ and npm 9+ (for Tauri frontend development)
Rust 1.75+ (for Tauri desktop builds)

Option A: Docker Installation (Recommended)

1. Install NVIDIA Container Toolkit

If you haven’t already configured Docker for GPU access:

# Ubuntu/Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
 
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
 
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify GPU access in Docker:

docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

2. Clone and Launch

git clone https://github.com/zhadyz/tactical-rag-system.git
cd tactical-rag-system
 
# Copy the example environment file
cp .env.example .env
 
# Edit .env with your preferred LLM model path (see Model Downloads below)
# Then launch:
docker compose up -d

3. Verify

# Check all containers are running
docker compose ps
 
# Expected output:
# forge-api      running   0.0.0.0:8000->8000/tcp
# forge-qdrant   running   0.0.0.0:6333->6333/tcp
# forge-redis    running   0.0.0.0:6379->6379/tcp
 
# Health check
curl http://localhost:8000/api/health

Option B: Local Development Setup

For contributors or those who want to run services individually.

1. Backend (Python)

cd tactical-rag-system
 
# Create virtual environment
python3.11 -m venv .venv
source .venv/bin/activate
 
# Install dependencies
pip install -r requirements.txt
 
# Install llama-cpp-python with CUDA support
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

2. Vector Database (Qdrant)

# Run Qdrant via Docker (even for local dev, this is easiest)
docker run -d --name forge-qdrant \
  -p 6333:6333 -p 6334:6334 \
  -v $(pwd)/qdrant_data:/qdrant/storage \
  qdrant/qdrant:latest

3. Cache (Redis)

docker run -d --name forge-redis \
  -p 6379:6379 \
  redis:7-alpine

4. Start the Backend

# From project root with .venv activated
cp .env.example .env
# Edit .env as needed
 
python -m uvicorn forge.main:app --host 0.0.0.0 --port 8000 --reload

5. Frontend (Tauri + React)

cd frontend
 
# Install Node dependencies
npm install
 
# Development mode (opens Tauri window)
npm run tauri dev
 
# Or just the web UI:
npm run dev

Model Downloads

Forge requires two models. The backend auto-downloads them on first launch, but you can pre-download for air-gapped environments.

BGE-M3 (Embedding Model)

Downloaded automatically via the FlagEmbedding library. Manual download:

# Using huggingface-cli
huggingface-cli download BAAI/bge-m3 --local-dir models/bge-m3
 
# Or with git
git lfs install
git clone https://huggingface.co/BAAI/bge-m3 models/bge-m3

Set in your .env:

BGE_M3_MODEL_PATH=models/bge-m3

LLM (GGUF Format for llama.cpp)

Forge uses llama.cpp for GPU-accelerated LLM inference. Download a GGUF quantized model:

# Recommended: Mistral 7B Q4_K_M (~4.4GB, fits easily in 16GB VRAM)
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
  mistral-7b-instruct-v0.2.Q4_K_M.gguf \
  --local-dir models/
 
# Alternative: Llama 3.1 8B Q4_K_M (~4.9GB)
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --local-dir models/

Set in your .env:

LLM_MODEL_PATH=models/mistral-7b-instruct-v0.2.Q4_K_M.gguf
LLM_GPU_LAYERS=-1
LLM_CONTEXT_SIZE=8192

Choosing a model

For 16GB VRAM: use Q4_K_M quantization of a 7-8B parameter model. For 24GB VRAM: you can run 13B models or use Q5_K_M / Q6_K quantization for better quality. The LLM_GPU_LAYERS=-1 setting offloads all layers to GPU.

Environment Variables

Create a .env file in the project root. Here are all available variables:

# === Core ===
FORGE_ENV=production          # production | development
FORGE_PORT=8000               # API server port
FORGE_HOST=0.0.0.0            # API server host
FORGE_LOG_LEVEL=info          # debug | info | warning | error
 
# === LLM (llama.cpp) ===
LLM_MODEL_PATH=models/mistral-7b-instruct-v0.2.Q4_K_M.gguf
LLM_GPU_LAYERS=-1             # -1 = all layers on GPU
LLM_CONTEXT_SIZE=8192         # Context window size
LLM_MAX_TOKENS=2048           # Max generation tokens
LLM_TEMPERATURE=0.1           # Generation temperature
LLM_THREADS=8                 # CPU threads for non-GPU ops
 
# === BGE-M3 ===
BGE_M3_MODEL_PATH=models/bge-m3
BGE_M3_MAX_LENGTH=8192        # Max token length for embeddings
BGE_M3_BATCH_SIZE=32          # Embedding batch size
 
# === Qdrant ===
QDRANT_HOST=localhost
QDRANT_PORT=6333
QDRANT_COLLECTION=forge_documents
 
# === Redis ===
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_DB=0
 
# === Features ===
FORGE_MODE=agentic             # agentic | direct
FORGE_CRAG_ENABLED=true
FORGE_PROPOSITIONS_ENABLED=true
FORGE_GRAPH_ENABLED=true
FORGE_VERIFICATION_ENABLED=true
FORGE_CACHE_ENABLED=true
FORGE_CACHE_TTL=3600           # Cache TTL in seconds
 
# === CPU-Only Mode ===
FORGE_CPU_ONLY=false           # Set true for no-GPU environments

Verifying Your Installation

Run the built-in integration test suite to confirm everything works:

# From project root
python -m pytest tests/integration/ -v

Expected output:

tests/integration/test_health.py::test_health_endpoint PASSED
tests/integration/test_embedding.py::test_bge_m3_dense PASSED
tests/integration/test_embedding.py::test_bge_m3_sparse PASSED
tests/integration/test_embedding.py::test_bge_m3_colbert PASSED
tests/integration/test_ingestion.py::test_document_upload PASSED
tests/integration/test_ingestion.py::test_full_pipeline PASSED
tests/integration/test_query.py::test_direct_query PASSED
tests/integration/test_query.py::test_agentic_query PASSED
tests/integration/test_query.py::test_streaming PASSED
tests/integration/test_query.py::test_crag_evaluation PASSED

10 passed in 45.23s

Next Steps

Configuration — Fine-tune every parameter
Quick Start — Upload your first document and query
Architecture — Understand how the pieces fit together

Quick Start Configuration