v0.4.0 Production Ready

Memory-Spark

A 15-stage RAG pipeline that transforms chaotic data into precision retrieval for autonomous AI agents.

78.9%

NDCG@10

626ms

P50 Latency

78%

Gate Skip Rate

+33%

vs 2021 SOTA

Scroll to explore

What Is It?

Memory-Spark is a high-performance, multi-stage retrieval engine designed specifically for the next generation of autonomous AI agents. While traditional search systems focus on simple keyword matching, Memory-Spark acts as a "long-term cognitive memory" layer, allowing agents to navigate massive, unstructured datasets with the precision and speed of a human researcher.

The system solves the "context saturation" problem, where AI models are overwhelmed by irrelevant data or lose critical details in long-form conversations. For developers building agentic workflows, Memory-Spark provides a robust infrastructure for managing state, history, and external knowledge.

Why Does It Exist?

Traditional Retrieval-Augmented Generation (RAG) often suffers from "semantic noise"--retrieving documents that are mathematically similar but contextually irrelevant. Most existing solutions rely on a single-shot vector search that fails to account for the nuance of human intent or the complex relationships within specialized domains.

Memory-Spark exists because "good enough" retrieval isn't sufficient for production-grade AI agents. The system treats memory as a multi-stage pipeline, applying rigorous filtering, reranking, and validation at every step. This ensures that the agent receives only the most high-signal information, significantly reducing hallucination rates and improving overall coherence.

How Does It Work?

At the heart of Memory-Spark is a sophisticated 15-stage retrieval pipeline that transforms a simple query into a refined set of actionable insights. The process begins with query expansion and intent classification, followed by parallel retrieval across isolated memory pools stored in LanceDB. Multiple layers of semantic filtering, cross-encoder reranking, and recursive refinement ensure every piece of data is evaluated against strict relevance gates.

Key innovations include Dynamic Gating (adjusts retrieval depth based on query complexity), Reciprocal Rank Fusion (RRF) (blends vector and keyword searches), LanceDB IVF_PQ indexing (fast approximate nearest neighbor search), and Pool Isolation (prevents cross-contamination between knowledge bases via metadata columns).

Dynamic Reranker Gate

The key to 50% latency reduction without accuracy loss

The Dynamic Gate computes sigma = max(score) - min(score) over the top-5 vector candidates (exactly what computeRerankerGate() uses). It then checks two thresholds: high=0.08 and low=0.02. In hard mode, reranking is skipped when vector results are already confident (sigma > 0.08) or too tied to add reliable signal (sigma < 0.02), and only runs in the middle band.

sigma > 0.08

High confidence spread -- the top result is clearly better. Vector search already found the answer.

SKIP RERANKER

sigma < 0.02

Tied results -- scores are too close. Reranker won't add meaningful signal.

SKIP RERANKER

0.02 <= sigma <= 0.08

Ambiguous zone -- results are neither clearly confident nor tied. Cross-encoder can help.

RUN RERANKER

Pool Architecture + LanceDB

All pools stored in LanceDB with IVF_PQ indexing; logical isolation via metadata columns

[MEM] Agent Memory

Weight: 1.0x

General working memory
Pool-isolated per agent
Session context storage

[ERR] Agent Mistakes

Weight: 1.6x (BOOSTED)

Past errors and corrections
Highest priority retrieval
Learn from failures

[SHR] Shared Mistakes

Weight: 1.6x (BOOSTED)

Cross-agent learned lessons
Organizational knowledge
Prevent repeat failures

[KB] Shared Knowledge

Weight: 0.8x

Reference docs and manuals
Background context
External documentation

[RULE] Shared Rules

Weight: 1.0x

Governing policies
Relevance-gated injection
Constraint enforcement

[FTS] FTS Index

BM25 Full-Text

Tantivy BM25 search
Keyword matching
Hybrid with vector

Reranker Deep-Dive

Cross-encoder precision for final ranking

Memory-Spark first uses a fast bi-encoder to retrieve candidate chunks, then applies a cross-encoder (Nemotron-1B-Rerank) to score each (query, document) pair directly. This second stage is slower but far more accurate for final ordering.

The reranker often produces compressed scores in the 0.83-1.00 band, so query normalization and spread-aware logic are critical. Normalizing shorthand into explicit questions (declarative -> interrogative) sharpens pairwise relevance.

Cross-Encoder vs Bi-Encoder

Bi-encoder is fast for recall (single embedding pass). Cross-encoder reads query and document together, so it costs more per candidate but resolves semantic ties better.

Score Compression + Normalization

Reranker outputs are tightly clustered near 1.0. Query normalization helps by making intent explicit: "capital of france" -> "What is the capital of France?"

Spread Guard

If top candidate scores are too tight (very small spread), reranking can be skipped to save latency because extra computation adds little signal.

Embedder Deep-Dive

Text -> 4096-dim vectors via Nemotron-8B

Memory-Spark uses llama-embed-nemotron-8b to transform text into 4096-dimensional vectors stored in LanceDB. This instruction-tuned model requires different prefixes for queries vs documents to achieve optimal retrieval accuracy.

Query embeddings use the prefix: "Instruct: {task}\nQuery: {text}" while document embeddings are raw. This asymmetric encoding aligns queries and documents in complementary subspaces.

Asymmetric Embedding

Queries get instruction prefixes; documents don't. This separation creates query-document alignment in the embedding space, improving retrieval accuracy by ~15%.

LanceDB Storage

Vectors are stored in LanceDB with IVF_PQ indexing (P=10 partitions, M=64 sub-vectors). This enables fast approximate nearest neighbor search while keeping memory usage low.

Pool Isolation

Each memory pool (agent_memory, agent_mistakes, shared_knowledge, etc.) is logically isolated via metadata columns in LanceDB, preventing cross-contamination.

Score Fusion

RRF, MMR diversity, and temporal decay formulas

Reciprocal Rank Fusion (RRF)

RRF(d) = SUM 1/(k + rank(d))

Scale-invariant merging of vector and BM25 results. Smoothing constant k=60.
Documents appearing high in both lists get an "agreement bonus."

MMR Diversity (lambda=0.9)

MMR = lambda * sim(q,d) - (1-lambda) * max sim(d,d_i)

Balances relevance vs. redundancy. High lambda (0.9) favors relevance.
Prevents returning 10 nearly-identical chunks.

Temporal Decay

score * (0.8 + 0.2 * e^(-0.03 * age))

Exponential decay with 0.8 floor. Recent memories get boosted.
0 days=1.0x, 7 days=0.96x, 30 days=0.89x, 365 days=0.80x

Benchmark Results

Performance on the BEIR benchmark suite

SciFact

78.9%

NDCG@10

+16.5% vs Contriever

FiQA

52.8%

NDCG@10

+68.0% vs Contriever

NFCorpus

44.4%

NDCG@10

+35.5% vs Contriever

Benchmarks run on isolated Docker harness with Nemotron-8B embedding model.
Full 36-configuration results available in the evaluation results.

Spark Backend Services

Self-hosted on NVIDIA DGX Spark (zero cloud API calls)

[EMB] Embedding :18091

llama-embed-nemotron-8b
4096 dimensions
Instruction-aware

[RRK] Reranker :18096

llama-nemotron-rerank-1b-v2
Cross-encoder
Query normalization

[LLM] LLM :18080

Nemotron-Super-120B
HyDE generation
Multi-query expansion

[OCR] GLM-OCR :18080

zai-org/GLM-OCR (0.9B)
PDF parsing
vLLM served

[NER] NER :18112

Named Entity Recognition
Entity extraction
Metadata enrichment

[CLS] Zero-Shot :18113

bart-large-mnli
Zero-shot classification
Category inference

Memory-Spark

What Is It?

Why Does It Exist?

How Does It Work?

The 15-Stage Retrieval Pipeline

Dynamic Reranker Gate

Pool Architecture + LanceDB

[MEM] Agent Memory

[ERR] Agent Mistakes

[SHR] Shared Mistakes

[KB] Shared Knowledge

[RULE] Shared Rules

[FTS] FTS Index

Reranker Deep-Dive

Cross-Encoder vs Bi-Encoder

Score Compression + Normalization

Spread Guard

Embedder Deep-Dive

Asymmetric Embedding

LanceDB Storage

Pool Isolation

Score Fusion

Reciprocal Rank Fusion (RRF)

MMR Diversity (lambda=0.9)

Temporal Decay

Benchmark Results

Spark Backend Services

[EMB] Embedding :18091

[RRK] Reranker :18096

[LLM] LLM :18080

[OCR] GLM-OCR :18080

[NER] NER :18112

[CLS] Zero-Shot :18113