API Reference
Forge V5 exposes a REST API via FastAPI on port 8000. All endpoints accept and return JSON unless otherwise noted. The API is documented automatically via OpenAPI at http://localhost:8000/docs.
Base URL
http://localhost:8000Endpoints Overview
| Method | Endpoint | Description |
|---|---|---|
POST | /api/query | Direct (non-streaming) query |
POST | /api/query/stream | Streaming query via SSE |
POST | /api/documents/upload | Upload a document |
GET | /api/documents | List all documents |
DELETE | /api/documents/{id} | Delete a document |
POST | /api/ingest | Trigger manual ingestion |
GET | /api/ingest/status | Check ingestion status |
GET | /api/settings | Get current settings |
PUT | /api/settings | Update settings |
GET | /api/models | List available models |
POST | /api/models/load | Load/switch LLM model |
GET | /api/health | Health check |
Query
POST /api/query
Run a query and return the complete response (non-streaming).
Request:
{
"query": "What are the key findings of the study?",
"mode": "agentic",
"top_k": 5,
"filters": {
"document_ids": ["doc_abc123"],
"levels": ["L2", "L3"]
}
}| Field | Type | Required | Default | Description |
|---|---|---|---|---|
query | string | Yes | — | The question to answer |
mode | string | No | "agentic" | "agentic" or "direct" |
top_k | number | No | 5 | Number of source chunks to use |
filters.document_ids | string[] | No | all | Restrict to specific documents |
filters.levels | string[] | No | all | Restrict to hierarchy levels |
Response (200):
{
"answer": "The study identifies three key findings: (1) the correlation between...",
"sources": [
{
"chunk_id": "c_1a2b3c",
"document_id": "doc_abc123",
"text": "Our analysis reveals a significant correlation...",
"level": "L2",
"score": 0.87,
"page_numbers": [12],
"heading": "4.1 Results"
}
],
"confidence": 0.92,
"metadata": {
"mode": "agentic",
"iterations": 4,
"tools_used": ["semantic_search", "rerank_colbert", "generate_answer"],
"total_time_ms": 7240,
"tokens_generated": 312,
"cached": false
},
"verification": {
"claims_checked": 5,
"claims_supported": 5,
"claims_unsupported": 0,
"confidence": 0.92
}
}curl example:
curl -X POST http://localhost:8000/api/query \
-H "Content-Type: application/json" \
-d '{
"query": "What are the key findings?",
"mode": "agentic"
}'POST /api/query/stream
Run a query with Server-Sent Events streaming. Same request schema as /api/query.
Request:
{
"query": "What are the key findings of the study?",
"mode": "agentic"
}Response: text/event-stream with SSE events.
See Streaming Protocol for the complete event schema.
curl example:
curl -N http://localhost:8000/api/query/stream \
-H "Content-Type: application/json" \
-d '{
"query": "What are the key findings?",
"mode": "agentic"
}'The -N flag disables curl’s output buffering so SSE events appear in real time.
Documents
POST /api/documents/upload
Upload a document for ingestion. Accepts multipart/form-data.
Request:
curl -X POST http://localhost:8000/api/documents/upload \
-F "file=@research-paper.pdf" \
-F "metadata={\"tags\": [\"research\", \"2024\"]}"| Field | Type | Required | Description |
|---|---|---|---|
file | file | Yes | PDF, DOCX, or TXT file |
metadata | JSON string | No | Optional tags and metadata |
Supported formats: .pdf, .docx, .doc, .txt
Maximum file size: 100MB (configurable)
Response (201):
{
"document_id": "doc_a1b2c3d4",
"filename": "research-paper.pdf",
"file_size_bytes": 2457600,
"pages": 42,
"status": "queued",
"created_at": "2024-12-15T10:30:00Z"
}Ingestion begins automatically after upload. Check progress with GET /api/ingest/status.
GET /api/documents
List all indexed documents.
Response (200):
{
"documents": [
{
"document_id": "doc_a1b2c3d4",
"filename": "research-paper.pdf",
"file_size_bytes": 2457600,
"pages": 42,
"status": "indexed",
"chunks": 197,
"propositions": 834,
"entities": 156,
"created_at": "2024-12-15T10:30:00Z",
"indexed_at": "2024-12-15T10:35:42Z"
},
{
"document_id": "doc_e5f6g7h8",
"filename": "policy-manual.docx",
"file_size_bytes": 1024000,
"pages": 28,
"status": "indexing",
"progress": 0.72,
"created_at": "2024-12-15T11:00:00Z"
}
],
"total": 2
}curl example:
curl http://localhost:8000/api/documentsDELETE /api/documents/{id}
Delete a document and all its indexed data (vectors, graph, cache).
Response (200):
{
"document_id": "doc_a1b2c3d4",
"deleted": true,
"points_removed": 2421,
"graph_edges_removed": 312
}curl example:
curl -X DELETE http://localhost:8000/api/documents/doc_a1b2c3d4Ingestion
POST /api/ingest
Trigger manual re-ingestion of a document (e.g., after config changes).
Request:
{
"document_id": "doc_a1b2c3d4",
"force": true,
"stages": ["contextual", "propositions", "graph", "embed"]
}| Field | Type | Required | Description |
|---|---|---|---|
document_id | string | Yes | Document to re-ingest |
force | boolean | No | Re-ingest even if already indexed |
stages | string[] | No | Specific stages to re-run (default: all) |
Response (202):
{
"document_id": "doc_a1b2c3d4",
"status": "queued",
"stages": ["contextual", "propositions", "graph", "embed"]
}GET /api/ingest/status
Check the status of all ingestion jobs.
Response (200):
{
"active": [
{
"document_id": "doc_a1b2c3d4",
"stage": "contextual_enrichment",
"progress": 0.65,
"chunks_processed": 128,
"chunks_total": 197,
"started_at": "2024-12-15T10:30:00Z",
"estimated_remaining_seconds": 120
}
],
"completed": [
{
"document_id": "doc_e5f6g7h8",
"completed_at": "2024-12-15T10:28:00Z",
"duration_seconds": 342,
"points_created": 1856
}
],
"failed": []
}curl example:
curl http://localhost:8000/api/ingest/statusSettings
GET /api/settings
Get the current configuration.
Response (200):
{
"query": {
"default_mode": "agentic",
"max_iterations": 8,
"timeout_seconds": 30
},
"llm": {
"model_path": "models/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
"context_size": 8192,
"temperature": 0.1,
"gpu_layers": -1
},
"crag": {
"enabled": true,
"threshold_correct": 0.7,
"threshold_ambiguous": 0.4
},
"colbert": {
"enabled": true,
"top_k": 20,
"final_k": 5
},
"propositions": { "enabled": true },
"graph": { "enabled": true },
"verification": { "enabled": true },
"cache": { "enabled": true, "ttl": 3600 }
}PUT /api/settings
Update configuration at runtime. Only included fields are updated; omitted fields retain their current values.
Request:
{
"crag": {
"threshold_correct": 0.8
},
"query": {
"default_mode": "direct"
}
}Response (200):
{
"updated": true,
"changes": {
"crag.threshold_correct": { "old": 0.7, "new": 0.8 },
"query.default_mode": { "old": "agentic", "new": "direct" }
}
}Changing llm.model_path, llm.gpu_layers, or llm.context_size requires an LLM reload. Use POST /api/models/load to apply LLM changes without restarting the server.
curl example:
curl -X PUT http://localhost:8000/api/settings \
-H "Content-Type: application/json" \
-d '{"crag": {"threshold_correct": 0.8}}'Models
GET /api/models
List available models and the currently loaded model.
Response (200):
{
"current": {
"name": "mistral-7b-instruct-v0.2.Q4_K_M",
"path": "models/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
"parameters": "7B",
"quantization": "Q4_K_M",
"vram_usage_gb": 4.4,
"context_size": 8192
},
"available": [
{
"name": "mistral-7b-instruct-v0.2.Q4_K_M",
"path": "models/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
"size_gb": 4.4,
"quantization": "Q4_K_M"
},
{
"name": "llama-3.1-8b-instruct.Q4_K_M",
"path": "models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
"size_gb": 4.9,
"quantization": "Q4_K_M"
}
]
}POST /api/models/load
Load or switch to a different LLM model. This unloads the current model and loads the new one.
Request:
{
"model_path": "models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
"gpu_layers": -1,
"context_size": 8192
}Response (200):
{
"loaded": true,
"model": "Meta-Llama-3.1-8B-Instruct-Q4_K_M",
"vram_usage_gb": 4.9,
"load_time_seconds": 3.2
}The server is unavailable for queries during model loading (typically 2-5 seconds). The endpoint returns after loading is complete.
curl example:
curl -X POST http://localhost:8000/api/models/load \
-H "Content-Type: application/json" \
-d '{"model_path": "models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"}'Health
GET /api/health
Health check endpoint. Returns the status of all services.
Response (200):
{
"status": "healthy",
"version": "5.0.0",
"uptime_seconds": 3456,
"services": {
"qdrant": "connected",
"redis": "connected",
"llm": "loaded",
"bge_m3": "loaded"
},
"gpu": {
"available": true,
"name": "NVIDIA GeForce RTX 4080",
"vram_total_gb": 16.0,
"vram_used_gb": 11.2,
"cuda_version": "12.2"
},
"stats": {
"documents_indexed": 5,
"total_points": 12450,
"queries_served": 142,
"cache_hit_rate": 0.23
}
}curl example:
curl http://localhost:8000/api/healthError Responses
All endpoints return errors in a consistent format:
{
"error": {
"code": "VALIDATION_ERROR",
"message": "Query text is required",
"details": { "field": "query", "constraint": "non-empty string" }
}
}Error Codes
| Code | HTTP Status | Description |
|---|---|---|
VALIDATION_ERROR | 400 | Invalid request parameters |
DOCUMENT_NOT_FOUND | 404 | Document ID doesn’t exist |
UNSUPPORTED_FORMAT | 400 | File type not supported |
FILE_TOO_LARGE | 413 | File exceeds size limit |
LLM_ERROR | 500 | LLM inference failure |
VECTOR_DB_ERROR | 500 | Qdrant connection/query failure |
INGESTION_ERROR | 500 | Pipeline failure during ingestion |
TIMEOUT | 504 | Query exceeded timeout |
NO_DOCUMENTS | 400 | No documents indexed yet |
MODEL_NOT_FOUND | 404 | Specified model file doesn’t exist |
Rate Limits
There are no rate limits by default. Forge is designed for single-user desktop use. If deploying as a shared service, configure rate limiting via a reverse proxy (nginx, Caddy).
OpenAPI Documentation
FastAPI automatically generates interactive API documentation:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc - OpenAPI JSON:
http://localhost:8000/openapi.json