Configuration
Forge V5 is configured primarily through config.yml in the project root. Every setting can also be overridden via environment variables.
config.yml Structure
# config.yml — Forge V5 Configuration
# ================================================
# Server
server:
host: "0.0.0.0"
port: 8000
workers: 1
log_level: "info" # debug | info | warning | error
# Mode Selection
query:
default_mode: "agentic" # agentic | direct
max_iterations: 8 # Max agent loop iterations (agentic mode)
timeout_seconds: 30 # Query timeout
stream: true # Enable SSE streaming by default
# LLM (llama.cpp)
llm:
model_path: "models/mistral-7b-instruct-v0.2.Q4_K_M.gguf"
gpu_layers: -1 # -1 = all layers on GPU
context_size: 8192
max_tokens: 2048
temperature: 0.1
top_p: 0.95
repeat_penalty: 1.1
threads: 8 # CPU threads for non-GPU operations
seed: -1 # -1 = random
# BGE-M3 Embedding
bge_m3:
model_path: "models/bge-m3"
max_length: 8192
batch_size: 32
use_fp16: false # CPU doesn't benefit from fp16
return_dense: true
return_sparse: true
return_colbert: true
# ColBERT Reranking
colbert:
enabled: true
top_k: 20 # How many candidates to rerank
final_k: 5 # How many to keep after reranking
score_threshold: 0.35 # Minimum MaxSim score
# CRAG (Corrective RAG)
crag:
enabled: true
model: "cross-encoder/ms-marco-MiniLM-L-12-v2"
threshold_correct: 0.7 # Score >= this → CORRECT
threshold_ambiguous: 0.4 # Score >= this → AMBIGUOUS (below → INCORRECT)
max_retries: 2 # Re-retrieval attempts for low-quality results
expand_ambiguous: true # Fetch parent chunks for ambiguous docs
# Agent (LangGraph)
agent:
max_iterations: 8
tools:
- semantic_search
- proposition_search
- graph_traverse
- rerank_colbert
- decompose_query
- hyde_search
- generate_answer
reflection_enabled: true # Agent reflects on retrieved evidence quality
early_stop: true # Stop iterating when evidence is sufficient
# Propositions (Dense-X)
propositions:
enabled: true
min_propositions: 1 # Minimum per chunk
max_propositions: 10 # Maximum per chunk
extraction_prompt: "default" # Uses forge/prompts/proposition_extraction.txt
# Hierarchical Indexing
hierarchy:
levels:
L0: # Document summaries
enabled: true
max_summary_length: 500
L1: # Section summaries
enabled: true
chunking: "heading" # Split by headings
max_summary_length: 300
L2: # Semantic chunks
enabled: true
chunk_size: 512
chunk_overlap: 50
method: "semantic" # semantic | fixed | sentence
L3: # Propositions
enabled: true # Controlled by propositions.enabled
# Knowledge Graph
graph:
enabled: true
extraction_model: "llm" # Uses the configured LLM
entity_types:
- PERSON
- ORGANIZATION
- CONCEPT
- DOCUMENT
- TECHNOLOGY
- LOCATION
- DATE
relationship_types:
- AUTHORED_BY
- REFERENCES
- PART_OF
- RELATED_TO
- LOCATED_IN
- OCCURRED_ON
max_entities_per_chunk: 15
max_relationships_per_chunk: 20
storage: "hybrid" # Qdrant payload + Redis adjacency list
# Contextual Retrieval
contextual_retrieval:
enabled: true
context_prompt: "default" # Uses forge/prompts/contextual_prefix.txt
max_context_length: 200 # Max tokens for context prefix
# Self-Verification
verification:
enabled: true
max_claims: 10 # Max claims to verify per answer
confidence_threshold: 0.7 # Below this, flag as uncertain
# HyDE (Hypothetical Document Embeddings)
hyde:
enabled: true
num_hypothetical: 1 # Number of hypothetical documents to generate
# Cache
cache:
enabled: true
backend: "redis"
ttl: 3600 # TTL in seconds
max_entries: 10000
similarity_threshold: 0.95 # Semantic cache matching threshold
# Qdrant
qdrant:
host: "localhost"
port: 6333
collection: "forge_documents"
vector_config:
dense:
size: 1024 # BGE-M3 dense dimension
distance: "Cosine"
sparse:
index:
on_disk: false
colbert:
size: 1024 # BGE-M3 ColBERT token dimension
distance: "Cosine"
multivector:
comparator: "max_sim"
# Redis
redis:
host: "localhost"
port: 6379
db: 0
password: nullSection-by-Section Breakdown
Query Mode Selection
query:
default_mode: "agentic"| Mode | Behavior | Latency | Best For |
|---|---|---|---|
direct | Single-pass: embed → search → rerank → generate | 2-5s | Simple factual questions |
agentic | LangGraph ReAct loop with autonomous tool selection | 5-15s | Complex, multi-hop, or ambiguous questions |
You can override per-request via the API:
curl -X POST http://localhost:8000/api/query/stream \
-H "Content-Type: application/json" \
-d '{"query": "...", "mode": "direct"}'CRAG Thresholds
The CRAG quality gate is one of the most impactful settings. The cross-encoder scores each retrieved document against the query:
Score >= 0.7 → CORRECT → Used directly
0.4 <= Score < 0.7 → AMBIGUOUS → Parent expanded, re-scored
Score < 0.4 → INCORRECT → DiscardedTuning tips:
- Raise
threshold_correct(e.g., 0.8) for higher precision at the cost of more re-retrievals - Lower
threshold_ambiguous(e.g., 0.3) to be more lenient with borderline documents - Set
max_retries: 0to disable re-retrieval (faster, but may produce lower quality)
Agent Tools
The agentic mode uses these tools in its LangGraph state graph:
| Tool | Function | File |
|---|---|---|
semantic_search | BGE-M3 dense + sparse vector search in Qdrant | forge/retrieval/search.py |
proposition_search | Search L3 proposition index specifically | forge/retrieval/search.py |
graph_traverse | Walk the knowledge graph via Redis adjacency | forge/retrieval/graph.py |
rerank_colbert | ColBERT MaxSim reranking of candidate chunks | forge/retrieval/rerank.py |
decompose_query | Split complex query into sub-queries | forge/retrieval/agent.py |
hyde_search | Generate hypothetical answer, embed, search | forge/retrieval/hyde.py |
generate_answer | Final answer generation with context | forge/generation/generator.py |
Remove any tool from the agent.tools list to prevent the agent from using it. Useful for debugging or when a particular technique isn’t relevant to your data.
Hierarchical Levels
The 4-level hierarchy gives the agent different granularity options:
L0 Document Summary "What is this paper about?"
L1 Section Summary "What does Section 3 discuss?"
L2 Semantic Chunk "What specific method was used?"
L3 Proposition "What exact value was reported?"Each level has its own embedding in Qdrant and can be searched independently. The hierarchy.L2.method field controls how raw text is split into chunks:
semantic— Splits at natural topic boundaries using embedding similarity (default, best quality)fixed— Fixed-size windows with overlap (fastest, simplest)sentence— Splits at sentence boundaries, grouping into ~512 token chunks
Qdrant Vector Configuration
Forge stores three named vectors per point in a single Qdrant collection:
vector_config:
dense:
size: 1024 # BGE-M3 dense embedding
distance: "Cosine"
sparse: # BGE-M3 sparse (lexical) embedding
index:
on_disk: false
colbert:
size: 1024 # BGE-M3 ColBERT per-token embeddings
distance: "Cosine"
multivector:
comparator: "max_sim"This single-collection, multi-vector architecture means a query can:
- Dense search for semantic similarity
- Sparse search for keyword matching
- ColBERT reranking for token-level precision
All on the same data, with no separate indices to sync.
Environment Variable Overrides
Every config key can be overridden via environment variables using the pattern FORGE_<SECTION>_<KEY>:
# config.yml: llm.temperature → FORGE_LLM_TEMPERATURE
export FORGE_LLM_TEMPERATURE=0.2
# config.yml: crag.threshold_correct → FORGE_CRAG_THRESHOLD_CORRECT
export FORGE_CRAG_THRESHOLD_CORRECT=0.8
# config.yml: query.default_mode → FORGE_QUERY_DEFAULT_MODE
export FORGE_QUERY_DEFAULT_MODE=direct
# config.yml: cache.ttl → FORGE_CACHE_TTL
export FORGE_CACHE_TTL=7200Environment variables take precedence over config.yml. This is useful for Docker deployments where you can pass env vars via docker compose:
# docker-compose.yml
services:
forge-api:
environment:
- FORGE_LLM_TEMPERATURE=0.2
- FORGE_QUERY_DEFAULT_MODE=agentic
- FORGE_CACHE_TTL=7200Common Configurations
Maximum Quality (24GB VRAM)
llm:
model_path: "models/llama-3.1-8b-instruct.Q6_K.gguf"
context_size: 16384
temperature: 0.1
colbert:
top_k: 40
final_k: 8
crag:
threshold_correct: 0.75
agent:
max_iterations: 12
reflection_enabled: trueFastest Responses (16GB VRAM)
llm:
model_path: "models/mistral-7b-instruct.Q4_K_M.gguf"
max_tokens: 1024
query:
default_mode: "direct"
colbert:
enabled: false
verification:
enabled: false
cache:
ttl: 86400Minimal VRAM (12GB)
llm:
model_path: "models/phi-3-mini-4k-instruct.Q4_K_M.gguf"
gpu_layers: 28 # Partial offload
context_size: 4096
propositions:
enabled: false
graph:
enabled: falseNext Steps
- Quick Start — Try your first query
- Techniques Overview — Understand what each technique does
- Architecture Overview — See how config maps to system behavior