RAG TechniquesBGE-M3 Vectors

BGE-M3 Tri-Modal Vectors

BGE-M3 (BAAI General Embedding - Multi-Functionality, Multi-Linguality, Multi-Granularity) is the embedding backbone of Forge. A single model produces three types of vectors from one forward pass, replacing what traditionally required separate embedding models, BM25 indices, and reranking models.

Three Vector Types, One Model

Vector TypeDimensionWhat It CapturesTraditional Equivalent
Dense1024Semantic meaningSentence-BERT, E5, etc.
SparseVariable (lexical)Exact keyword matchingBM25, SPLADE
ColBERTN x 1024 (per-token)Token-level interactionsSeparate ColBERT model
from FlagEmbedding import BGEM3FlagModel
 
model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=False)
 
output = model.encode(
    ["What was the Q3 2024 revenue?"],
    return_dense=True,
    return_sparse=True,
    return_colbert_vecs=True,
)
 
dense_vector = output["dense_vecs"][0]       # Shape: (1024,)
sparse_vector = output["lexical_weights"][0]  # Dict: {token_id: weight}
colbert_vectors = output["colbert_vecs"][0]   # Shape: (N, 1024)

Why This Matters

Before BGE-M3 (Typical RAG Pipeline)

Query → Dense Encoder (GPU, ~10ms)  → Dense Search
     → BM25 Index (CPU)             → Keyword Search
     → ColBERT Model (GPU, ~50ms)   → Reranking

= 3 models, 3 indices, 3x the complexity

With BGE-M3 (Forge Pipeline)

Query → BGE-M3 (CPU, ~50ms) → Dense Search
                             → Sparse Search
                             → ColBERT Reranking

= 1 model, 1 collection, 1 forward pass

One model. One Qdrant collection. Three search modalities.

CPU inference by design

BGE-M3 runs on CPU at ~50ms per query — fast enough for real-time use. Keeping it off the GPU reserves all 16GB of VRAM for the LLM, which is where GPU memory has the greatest impact on generation quality and speed.

Qdrant Named Vectors

All three vector types are stored as named vectors in a single Qdrant collection:

# forge/ingestion/embedder.py
class BGEm3Embedder:
    """Embeds text using BGE-M3 and stores in Qdrant."""
 
    def __init__(self, config: BGEm3Config):
        self.model = BGEM3FlagModel(
            config.model_path,
            use_fp16=config.use_fp16,
        )
 
    async def embed_and_store(self, chunk: DocumentChunk) -> str:
        """Embed a chunk and upsert into Qdrant."""
        output = self.model.encode(
            [chunk.enriched_text],  # Context-enriched text
            return_dense=True,
            return_sparse=True,
            return_colbert_vecs=True,
            max_length=self.config.max_length,
        )
 
        point = PointStruct(
            id=chunk.id,
            vector={
                "dense": output["dense_vecs"][0].tolist(),
                "sparse": SparseVector(
                    indices=list(output["lexical_weights"][0].keys()),
                    values=list(output["lexical_weights"][0].values()),
                ),
                "colbert": output["colbert_vecs"][0].tolist(),
            },
            payload={
                "text": chunk.enriched_text,
                "original_text": chunk.original_text,
                "level": chunk.level,
                "document_id": chunk.document_id,
                "parent_id": chunk.parent_id,
                # ... additional metadata
            },
        )
 
        await self.qdrant.upsert(
            collection_name="forge_documents",
            points=[point],
        )
        return chunk.id

Qdrant Collection Schema

# forge/storage/qdrant_setup.py
async def create_collection():
    await qdrant.create_collection(
        collection_name="forge_documents",
        vectors_config={
            "dense": VectorParams(
                size=1024,
                distance=Distance.COSINE,
            ),
            "colbert": VectorParams(
                size=1024,
                distance=Distance.COSINE,
                multivector_config=MultiVectorConfig(
                    comparator=MultiVectorComparator.MAX_SIM,
                ),
            ),
        },
        sparse_vectors_config={
            "sparse": SparseVectorParams(
                index=SparseIndexParams(on_disk=False),
            ),
        },
    )

Search Strategies

Dense Search (Semantic)

Best for: broad topical queries, “find me something about X”

results = await qdrant.search(
    collection_name="forge_documents",
    query_vector=NamedVector(name="dense", vector=dense_vec),
    limit=20,
)

Sparse Search (Keyword)

Best for: exact terms, names, acronyms, specific jargon

results = await qdrant.search(
    collection_name="forge_documents",
    query_vector=NamedSparseVector(
        name="sparse",
        vector=SparseVector(
            indices=sparse_indices,
            values=sparse_values,
        ),
    ),
    limit=20,
)

Hybrid Search (Dense + Sparse)

Forge combines both for first-stage retrieval using Reciprocal Rank Fusion:

# forge/retrieval/search.py
class HybridSearcher:
    async def search(
        self,
        query: str,
        top_k: int = 20,
    ) -> list[ScoredChunk]:
        """Combined dense + sparse search with RRF fusion."""
        # Encode query once, get both vector types
        output = self.bge_m3.encode(
            [query],
            return_dense=True,
            return_sparse=True,
        )
 
        # Parallel search
        dense_results, sparse_results = await asyncio.gather(
            self.qdrant.search(
                query_vector=NamedVector(name="dense", vector=output["dense_vecs"][0]),
                limit=top_k,
            ),
            self.qdrant.search(
                query_vector=NamedSparseVector(name="sparse", vector=sparse_vec),
                limit=top_k,
            ),
        )
 
        # Reciprocal Rank Fusion
        return self._rrf_fusion(dense_results, sparse_results, k=60)
 
    def _rrf_fusion(self, *result_lists, k=60):
        """Combine multiple result lists using RRF."""
        scores = defaultdict(float)
        for results in result_lists:
            for rank, result in enumerate(results):
                scores[result.id] += 1.0 / (k + rank + 1)
 
        combined = sorted(scores.items(), key=lambda x: x[1], reverse=True)
        return [self._get_chunk(chunk_id) for chunk_id, _ in combined]

ColBERT Reranking

Applied to the top candidates after hybrid search + CRAG filtering. See ColBERT Reranking for details.

Performance

All benchmarks on AMD Ryzen 9 7950X (16 cores, CPU inference):

OperationLatencyThroughput
Encode 1 query~50ms20 queries/sec
Encode 1 document chunk (512 tokens)~80ms12 chunks/sec
Encode batch of 32 chunks~1.2s~27 chunks/sec
Qdrant dense search (100K points)~5ms200 queries/sec
Qdrant sparse search (100K points)~8ms125 queries/sec
Qdrant hybrid search (100K points)~12ms80 queries/sec

The BGE-M3 encoding is the bottleneck during ingestion (80ms per chunk). During query time, it’s a single 50ms call — negligible.

Configuration

bge_m3:
  model_path: "models/bge-m3"
  max_length: 8192          # Max token length (BGE-M3 supports up to 8192)
  batch_size: 32            # Batch size for ingestion embedding
  use_fp16: false           # CPU doesn't benefit from fp16
  return_dense: true
  return_sparse: true
  return_colbert: true

Disabling Vector Types

You can disable sparse or ColBERT vectors to reduce storage:

bge_m3:
  return_dense: true
  return_sparse: false     # Saves ~20% storage, lose keyword matching
  return_colbert: false    # Saves ~70% storage, lose token-level reranking
Don't disable dense vectors

Dense vectors are the primary search mechanism. Disabling them breaks the entire pipeline. Sparse and ColBERT are additive improvements that can be individually toggled.

Model Details

PropertyValue
ModelBAAI/bge-m3
Parameters568M
Dense dimension1024
Max tokens8,192
Languages100+ (multilingual)
LicenseMIT
Model size~2.4GB

References

  • Chen et al., “BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation” (2024)
  • BAAI/bge-m3 on Hugging Face
  • FlagEmbedding library
  • Forge implementation: forge/ingestion/embedder.py, forge/retrieval/search.py