A 15-stage RAG pipeline that transforms chaotic data into precision retrieval for autonomous AI agents.
Memory-Spark is a high-performance, multi-stage retrieval engine designed specifically for the next generation of autonomous AI agents. While traditional search systems focus on simple keyword matching, Memory-Spark acts as a "long-term cognitive memory" layer, allowing agents to navigate massive, unstructured datasets with the precision and speed of a human researcher.
The system solves the "context saturation" problem, where AI models are overwhelmed by irrelevant data or lose critical details in long-form conversations. For developers building agentic workflows, Memory-Spark provides a robust infrastructure for managing state, history, and external knowledge.
Traditional Retrieval-Augmented Generation (RAG) often suffers from "semantic noise"--retrieving documents that are mathematically similar but contextually irrelevant. Most existing solutions rely on a single-shot vector search that fails to account for the nuance of human intent or the complex relationships within specialized domains.
Memory-Spark exists because "good enough" retrieval isn't sufficient for production-grade AI agents. The system treats memory as a multi-stage pipeline, applying rigorous filtering, reranking, and validation at every step. This ensures that the agent receives only the most high-signal information, significantly reducing hallucination rates and improving overall coherence.
At the heart of Memory-Spark is a sophisticated 15-stage retrieval pipeline that transforms a simple query into a refined set of actionable insights. The process begins with query expansion and intent classification, followed by parallel retrieval across isolated memory pools stored in LanceDB. Multiple layers of semantic filtering, cross-encoder reranking, and recursive refinement ensure every piece of data is evaluated against strict relevance gates.
Key innovations include Dynamic Gating (adjusts retrieval depth based on query complexity), Reciprocal Rank Fusion (RRF) (blends vector and keyword searches), LanceDB IVF_PQ indexing (fast approximate nearest neighbor search), and Pool Isolation (prevents cross-contamination between knowledge bases via metadata columns).
Click on any stage to learn more about its role in the retrieval process
The key to 50% latency reduction without accuracy loss
The Dynamic Gate computes sigma = max(score) - min(score) over the top-5 vector candidates (exactly what computeRerankerGate() uses). It then checks two thresholds: high=0.08 and low=0.02. In hard mode, reranking is skipped when vector results are already confident (sigma > 0.08) or too tied to add reliable signal (sigma < 0.02), and only runs in the middle band.
All pools stored in LanceDB with IVF_PQ indexing; logical isolation via metadata columns
Cross-encoder precision for final ranking
Memory-Spark first uses a fast bi-encoder to retrieve candidate chunks, then applies a cross-encoder (Nemotron-1B-Rerank) to score each (query, document) pair directly. This second stage is slower but far more accurate for final ordering.
The reranker often produces compressed scores in the 0.83-1.00 band, so query normalization and spread-aware logic are critical. Normalizing shorthand into explicit questions (declarative -> interrogative) sharpens pairwise relevance.
Bi-encoder is fast for recall (single embedding pass). Cross-encoder reads query and document together, so it costs more per candidate but resolves semantic ties better.
Reranker outputs are tightly clustered near 1.0. Query normalization helps by making intent explicit: "capital of france" -> "What is the capital of France?"
If top candidate scores are too tight (very small spread), reranking can be skipped to save latency because extra computation adds little signal.
Text -> 4096-dim vectors via Nemotron-8B
Memory-Spark uses llama-embed-nemotron-8b to transform text into 4096-dimensional vectors stored in LanceDB. This instruction-tuned model requires different prefixes for queries vs documents to achieve optimal retrieval accuracy.
Query embeddings use the prefix: "Instruct: {task}\nQuery: {text}" while document embeddings are raw. This asymmetric encoding aligns queries and documents in complementary subspaces.
Queries get instruction prefixes; documents don't. This separation creates query-document alignment in the embedding space, improving retrieval accuracy by ~15%.
Vectors are stored in LanceDB with IVF_PQ indexing (P=10 partitions, M=64 sub-vectors). This enables fast approximate nearest neighbor search while keeping memory usage low.
Each memory pool (agent_memory, agent_mistakes, shared_knowledge, etc.) is logically isolated via metadata columns in LanceDB, preventing cross-contamination.
RRF, MMR diversity, and temporal decay formulas
Performance on the BEIR benchmark suite
Benchmarks run on isolated Docker harness with Nemotron-8B embedding model.
Full 36-configuration results available in the evaluation results.
Self-hosted on NVIDIA DGX Spark (zero cloud API calls)