Configuration

Forge V5 is configured primarily through config.yml in the project root. Every setting can also be overridden via environment variables.

config.yml Structure

# config.yml — Forge V5 Configuration
# ================================================
 
# Server
server:
  host: "0.0.0.0"
  port: 8000
  workers: 1
  log_level: "info"        # debug | info | warning | error
 
# Mode Selection
query:
  default_mode: "agentic"  # agentic | direct
  max_iterations: 8        # Max agent loop iterations (agentic mode)
  timeout_seconds: 30      # Query timeout
  stream: true             # Enable SSE streaming by default
 
# LLM (llama.cpp)
llm:
  model_path: "models/mistral-7b-instruct-v0.2.Q4_K_M.gguf"
  gpu_layers: -1           # -1 = all layers on GPU
  context_size: 8192
  max_tokens: 2048
  temperature: 0.1
  top_p: 0.95
  repeat_penalty: 1.1
  threads: 8               # CPU threads for non-GPU operations
  seed: -1                 # -1 = random
 
# BGE-M3 Embedding
bge_m3:
  model_path: "models/bge-m3"
  max_length: 8192
  batch_size: 32
  use_fp16: false          # CPU doesn't benefit from fp16
  return_dense: true
  return_sparse: true
  return_colbert: true
 
# ColBERT Reranking
colbert:
  enabled: true
  top_k: 20                # How many candidates to rerank
  final_k: 5               # How many to keep after reranking
  score_threshold: 0.35    # Minimum MaxSim score
 
# CRAG (Corrective RAG)
crag:
  enabled: true
  model: "cross-encoder/ms-marco-MiniLM-L-12-v2"
  threshold_correct: 0.7   # Score >= this → CORRECT
  threshold_ambiguous: 0.4  # Score >= this → AMBIGUOUS (below → INCORRECT)
  max_retries: 2           # Re-retrieval attempts for low-quality results
  expand_ambiguous: true   # Fetch parent chunks for ambiguous docs
 
# Agent (LangGraph)
agent:
  max_iterations: 8
  tools:
    - semantic_search
    - proposition_search
    - graph_traverse
    - rerank_colbert
    - decompose_query
    - hyde_search
    - generate_answer
  reflection_enabled: true  # Agent reflects on retrieved evidence quality
  early_stop: true          # Stop iterating when evidence is sufficient
 
# Propositions (Dense-X)
propositions:
  enabled: true
  min_propositions: 1      # Minimum per chunk
  max_propositions: 10     # Maximum per chunk
  extraction_prompt: "default"  # Uses forge/prompts/proposition_extraction.txt
 
# Hierarchical Indexing
hierarchy:
  levels:
    L0:                     # Document summaries
      enabled: true
      max_summary_length: 500
    L1:                     # Section summaries
      enabled: true
      chunking: "heading"   # Split by headings
      max_summary_length: 300
    L2:                     # Semantic chunks
      enabled: true
      chunk_size: 512
      chunk_overlap: 50
      method: "semantic"    # semantic | fixed | sentence
    L3:                     # Propositions
      enabled: true         # Controlled by propositions.enabled
 
# Knowledge Graph
graph:
  enabled: true
  extraction_model: "llm"  # Uses the configured LLM
  entity_types:
    - PERSON
    - ORGANIZATION
    - CONCEPT
    - DOCUMENT
    - TECHNOLOGY
    - LOCATION
    - DATE
  relationship_types:
    - AUTHORED_BY
    - REFERENCES
    - PART_OF
    - RELATED_TO
    - LOCATED_IN
    - OCCURRED_ON
  max_entities_per_chunk: 15
  max_relationships_per_chunk: 20
  storage: "hybrid"        # Qdrant payload + Redis adjacency list
 
# Contextual Retrieval
contextual_retrieval:
  enabled: true
  context_prompt: "default"  # Uses forge/prompts/contextual_prefix.txt
  max_context_length: 200    # Max tokens for context prefix
 
# Self-Verification
verification:
  enabled: true
  max_claims: 10           # Max claims to verify per answer
  confidence_threshold: 0.7  # Below this, flag as uncertain
 
# HyDE (Hypothetical Document Embeddings)
hyde:
  enabled: true
  num_hypothetical: 1      # Number of hypothetical documents to generate
 
# Cache
cache:
  enabled: true
  backend: "redis"
  ttl: 3600                # TTL in seconds
  max_entries: 10000
  similarity_threshold: 0.95  # Semantic cache matching threshold
 
# Qdrant
qdrant:
  host: "localhost"
  port: 6333
  collection: "forge_documents"
  vector_config:
    dense:
      size: 1024           # BGE-M3 dense dimension
      distance: "Cosine"
    sparse:
      index:
        on_disk: false
    colbert:
      size: 1024           # BGE-M3 ColBERT token dimension
      distance: "Cosine"
      multivector:
        comparator: "max_sim"
 
# Redis
redis:
  host: "localhost"
  port: 6379
  db: 0
  password: null

Section-by-Section Breakdown

Query Mode Selection

query:
  default_mode: "agentic"

Mode	Behavior	Latency	Best For
`direct`	Single-pass: embed → search → rerank → generate	2-5s	Simple factual questions
`agentic`	LangGraph ReAct loop with autonomous tool selection	5-15s	Complex, multi-hop, or ambiguous questions

You can override per-request via the API:

curl -X POST http://localhost:8000/api/query/stream \
  -H "Content-Type: application/json" \
  -d '{"query": "...", "mode": "direct"}'

CRAG Thresholds

The CRAG quality gate is one of the most impactful settings. The cross-encoder scores each retrieved document against the query:

Score >= 0.7  →  CORRECT     →  Used directly
0.4 <= Score < 0.7  →  AMBIGUOUS  →  Parent expanded, re-scored
Score < 0.4  →  INCORRECT   →  Discarded

Tuning tips:

Raise threshold_correct (e.g., 0.8) for higher precision at the cost of more re-retrievals
Lower threshold_ambiguous (e.g., 0.3) to be more lenient with borderline documents
Set max_retries: 0 to disable re-retrieval (faster, but may produce lower quality)

Agent Tools

The agentic mode uses these tools in its LangGraph state graph:

Tool	Function	File
`semantic_search`	BGE-M3 dense + sparse vector search in Qdrant	`forge/retrieval/search.py`
`proposition_search`	Search L3 proposition index specifically	`forge/retrieval/search.py`
`graph_traverse`	Walk the knowledge graph via Redis adjacency	`forge/retrieval/graph.py`
`rerank_colbert`	ColBERT MaxSim reranking of candidate chunks	`forge/retrieval/rerank.py`
`decompose_query`	Split complex query into sub-queries	`forge/retrieval/agent.py`
`hyde_search`	Generate hypothetical answer, embed, search	`forge/retrieval/hyde.py`
`generate_answer`	Final answer generation with context	`forge/generation/generator.py`

Disabling individual tools

Remove any tool from the agent.tools list to prevent the agent from using it. Useful for debugging or when a particular technique isn’t relevant to your data.

Hierarchical Levels

The 4-level hierarchy gives the agent different granularity options:

L0  Document Summary    "What is this paper about?"
L1  Section Summary     "What does Section 3 discuss?"
L2  Semantic Chunk      "What specific method was used?"
L3  Proposition         "What exact value was reported?"

Each level has its own embedding in Qdrant and can be searched independently. The hierarchy.L2.method field controls how raw text is split into chunks:

semantic — Splits at natural topic boundaries using embedding similarity (default, best quality)
fixed — Fixed-size windows with overlap (fastest, simplest)
sentence — Splits at sentence boundaries, grouping into ~512 token chunks

Qdrant Vector Configuration

Forge stores three named vectors per point in a single Qdrant collection:

vector_config:
  dense:
    size: 1024         # BGE-M3 dense embedding
    distance: "Cosine"
  sparse:              # BGE-M3 sparse (lexical) embedding
    index:
      on_disk: false
  colbert:
    size: 1024         # BGE-M3 ColBERT per-token embeddings
    distance: "Cosine"
    multivector:
      comparator: "max_sim"

This single-collection, multi-vector architecture means a query can:

Dense search for semantic similarity
Sparse search for keyword matching
ColBERT reranking for token-level precision

All on the same data, with no separate indices to sync.

Environment Variable Overrides

Every config key can be overridden via environment variables using the pattern FORGE_<SECTION>_<KEY>:

# config.yml: llm.temperature → FORGE_LLM_TEMPERATURE
export FORGE_LLM_TEMPERATURE=0.2
 
# config.yml: crag.threshold_correct → FORGE_CRAG_THRESHOLD_CORRECT
export FORGE_CRAG_THRESHOLD_CORRECT=0.8
 
# config.yml: query.default_mode → FORGE_QUERY_DEFAULT_MODE
export FORGE_QUERY_DEFAULT_MODE=direct
 
# config.yml: cache.ttl → FORGE_CACHE_TTL
export FORGE_CACHE_TTL=7200

Environment variables take precedence over config.yml. This is useful for Docker deployments where you can pass env vars via docker compose:

# docker-compose.yml
services:
  forge-api:
    environment:
      - FORGE_LLM_TEMPERATURE=0.2
      - FORGE_QUERY_DEFAULT_MODE=agentic
      - FORGE_CACHE_TTL=7200

Common Configurations

Maximum Quality (24GB VRAM)

llm:
  model_path: "models/llama-3.1-8b-instruct.Q6_K.gguf"
  context_size: 16384
  temperature: 0.1
 
colbert:
  top_k: 40
  final_k: 8
 
crag:
  threshold_correct: 0.75
 
agent:
  max_iterations: 12
  reflection_enabled: true

Fastest Responses (16GB VRAM)

llm:
  model_path: "models/mistral-7b-instruct.Q4_K_M.gguf"
  max_tokens: 1024
 
query:
  default_mode: "direct"
 
colbert:
  enabled: false
 
verification:
  enabled: false
 
cache:
  ttl: 86400

Minimal VRAM (12GB)

llm:
  model_path: "models/phi-3-mini-4k-instruct.Q4_K_M.gguf"
  gpu_layers: 28          # Partial offload
  context_size: 4096
 
propositions:
  enabled: false
 
graph:
  enabled: false

Next Steps

Quick Start — Try your first query
Techniques Overview — Understand what each technique does
Architecture Overview — See how config maps to system behavior

Installation Overview