v0.4.0 Production Ready

Memory-Spark

A 15-stage RAG pipeline that transforms chaotic data into precision retrieval for autonomous AI agents.

78.9%
NDCG@10
626ms
P50 Latency
78%
Gate Skip Rate
+33%
vs 2021 SOTA
Scroll to explore

What Is It?

Memory-Spark is a high-performance, multi-stage retrieval engine designed specifically for the next generation of autonomous AI agents. While traditional search systems focus on simple keyword matching, Memory-Spark acts as a "long-term cognitive memory" layer, allowing agents to navigate massive, unstructured datasets with the precision and speed of a human researcher.

The system solves the "context saturation" problem, where AI models are overwhelmed by irrelevant data or lose critical details in long-form conversations. For developers building agentic workflows, Memory-Spark provides a robust infrastructure for managing state, history, and external knowledge.

Why Does It Exist?

Traditional Retrieval-Augmented Generation (RAG) often suffers from "semantic noise"--retrieving documents that are mathematically similar but contextually irrelevant. Most existing solutions rely on a single-shot vector search that fails to account for the nuance of human intent or the complex relationships within specialized domains.

Memory-Spark exists because "good enough" retrieval isn't sufficient for production-grade AI agents. The system treats memory as a multi-stage pipeline, applying rigorous filtering, reranking, and validation at every step. This ensures that the agent receives only the most high-signal information, significantly reducing hallucination rates and improving overall coherence.

How Does It Work?

At the heart of Memory-Spark is a sophisticated 15-stage retrieval pipeline that transforms a simple query into a refined set of actionable insights. The process begins with query expansion and intent classification, followed by parallel retrieval across isolated memory pools stored in LanceDB. Multiple layers of semantic filtering, cross-encoder reranking, and recursive refinement ensure every piece of data is evaluated against strict relevance gates.

Key innovations include Dynamic Gating (adjusts retrieval depth based on query complexity), Reciprocal Rank Fusion (RRF) (blends vector and keyword searches), LanceDB IVF_PQ indexing (fast approximate nearest neighbor search), and Pool Isolation (prevents cross-contamination between knowledge bases via metadata columns).

The 15-Stage Retrieval Pipeline

Click on any stage to learn more about its role in the retrieval process

Dynamic Reranker Gate

The key to 50% latency reduction without accuracy loss

The Dynamic Gate computes sigma = max(score) - min(score) over the top-5 vector candidates (exactly what computeRerankerGate() uses). It then checks two thresholds: high=0.08 and low=0.02. In hard mode, reranking is skipped when vector results are already confident (sigma > 0.08) or too tied to add reliable signal (sigma < 0.02), and only runs in the middle band.

sigma > 0.08
High confidence spread -- the top result is clearly better. Vector search already found the answer.
sigma < 0.02
Tied results -- scores are too close. Reranker won't add meaningful signal.
0.02 <= sigma <= 0.08
Ambiguous zone -- results are neither clearly confident nor tied. Cross-encoder can help.
RUN RERANKER

Pool Architecture + LanceDB

All pools stored in LanceDB with IVF_PQ indexing; logical isolation via metadata columns

[MEM] Agent Memory

Weight: 1.0x
  • General working memory
  • Pool-isolated per agent
  • Session context storage

[ERR] Agent Mistakes

Weight: 1.6x (BOOSTED)
  • Past errors and corrections
  • Highest priority retrieval
  • Learn from failures

[SHR] Shared Mistakes

Weight: 1.6x (BOOSTED)
  • Cross-agent learned lessons
  • Organizational knowledge
  • Prevent repeat failures

[KB] Shared Knowledge

Weight: 0.8x
  • Reference docs and manuals
  • Background context
  • External documentation

[RULE] Shared Rules

Weight: 1.0x
  • Governing policies
  • Relevance-gated injection
  • Constraint enforcement

[FTS] FTS Index

BM25 Full-Text
  • Tantivy BM25 search
  • Keyword matching
  • Hybrid with vector

Reranker Deep-Dive

Cross-encoder precision for final ranking

Memory-Spark first uses a fast bi-encoder to retrieve candidate chunks, then applies a cross-encoder (Nemotron-1B-Rerank) to score each (query, document) pair directly. This second stage is slower but far more accurate for final ordering.

The reranker often produces compressed scores in the 0.83-1.00 band, so query normalization and spread-aware logic are critical. Normalizing shorthand into explicit questions (declarative -> interrogative) sharpens pairwise relevance.

Cross-Encoder vs Bi-Encoder

Bi-encoder is fast for recall (single embedding pass). Cross-encoder reads query and document together, so it costs more per candidate but resolves semantic ties better.

Score Compression + Normalization

Reranker outputs are tightly clustered near 1.0. Query normalization helps by making intent explicit: "capital of france" -> "What is the capital of France?"

Spread Guard

If top candidate scores are too tight (very small spread), reranking can be skipped to save latency because extra computation adds little signal.

Embedder Deep-Dive

Text -> 4096-dim vectors via Nemotron-8B

Memory-Spark uses llama-embed-nemotron-8b to transform text into 4096-dimensional vectors stored in LanceDB. This instruction-tuned model requires different prefixes for queries vs documents to achieve optimal retrieval accuracy.

Query embeddings use the prefix: "Instruct: {task}\nQuery: {text}" while document embeddings are raw. This asymmetric encoding aligns queries and documents in complementary subspaces.

Asymmetric Embedding

Queries get instruction prefixes; documents don't. This separation creates query-document alignment in the embedding space, improving retrieval accuracy by ~15%.

LanceDB Storage

Vectors are stored in LanceDB with IVF_PQ indexing (P=10 partitions, M=64 sub-vectors). This enables fast approximate nearest neighbor search while keeping memory usage low.

Pool Isolation

Each memory pool (agent_memory, agent_mistakes, shared_knowledge, etc.) is logically isolated via metadata columns in LanceDB, preventing cross-contamination.

Score Fusion

RRF, MMR diversity, and temporal decay formulas

Reciprocal Rank Fusion (RRF)

RRF(d) = SUM 1/(k + rank(d))
Scale-invariant merging of vector and BM25 results. Smoothing constant k=60.
Documents appearing high in both lists get an "agreement bonus."

MMR Diversity (lambda=0.9)

MMR = lambda * sim(q,d) - (1-lambda) * max sim(d,d_i)
Balances relevance vs. redundancy. High lambda (0.9) favors relevance.
Prevents returning 10 nearly-identical chunks.

Temporal Decay

score * (0.8 + 0.2 * e^(-0.03 * age))
Exponential decay with 0.8 floor. Recent memories get boosted.
0 days=1.0x, 7 days=0.96x, 30 days=0.89x, 365 days=0.80x

Benchmark Results

Performance on the BEIR benchmark suite

SciFact
78.9%
NDCG@10
+16.5% vs Contriever
FiQA
52.8%
NDCG@10
+68.0% vs Contriever
NFCorpus
44.4%
NDCG@10
+35.5% vs Contriever

Benchmarks run on isolated Docker harness with Nemotron-8B embedding model.
Full 36-configuration results available in the evaluation results.

Spark Backend Services

Self-hosted on NVIDIA DGX Spark (zero cloud API calls)

[EMB] Embedding :18091

  • llama-embed-nemotron-8b
  • 4096 dimensions
  • Instruction-aware

[RRK] Reranker :18096

  • llama-nemotron-rerank-1b-v2
  • Cross-encoder
  • Query normalization

[LLM] LLM :18080

  • Nemotron-Super-120B
  • HyDE generation
  • Multi-query expansion

[OCR] GLM-OCR :18080

  • zai-org/GLM-OCR (0.9B)
  • PDF parsing
  • vLLM served

[NER] NER :18112

  • Named Entity Recognition
  • Entity extraction
  • Metadata enrichment

[CLS] Zero-Shot :18113

  • bart-large-mnli
  • Zero-shot classification
  • Category inference