Engineering·6 min read

Snipara Benchmark: 601K Tokens to 4K of Usable Context

A clean benchmark of Snipara on its own indexed project context, why we rejected a misleading hallucination result, and what the numbers actually support.

A

Alex Lopez

Founder, Snipara

·
Quick scan
  • Readable in 6 minutes
  • Published 2026-05-10
  • 5 context themes covered
Topics
benchmarkcontext engineeringtokensai agentssnipara

We benchmarked Snipara on its own indexed project context and rejected the first result before publishing it. The discarded number looked impressive, but it measured a context window overflow, not factual accuracy. This is the cleaner run.

Headline

Across the deterministic 27-case suite, Snipara reduced the indexed corpus from 601,207 raw tokens to about 1,253 selected tokens/query. In the hosted OpenAI run, Snipara returned about 4,036 tokens/query for 12 medium and hard tasks, compared with a 32K-token raw window baseline.

The mistake we did not publish

The first hallucination run compared Snipara against a "without Snipara" baseline that attempted to send the full corpus to the model. That corpus was about 601K tokens. The OpenAI model used for the run had a 128K context window.

The result was predictable: the baseline produced context-length errors. Scoring those errors as normal answers created a misleading 0% factual accuracy number for the baseline. That number is not useful for marketing, so we excluded it.

The clean setup

The cleaner benchmark keeps the baseline inside the model's real context window. The raw baseline receives the first 32K tokens of the indexed corpus with no retrieval. Snipara uses hosted context retrieval with an 8K token budget.

SettingValue
ModelOpenAI gpt-4o-mini
Model context window128,000 tokens
Full indexed corpus601,207 tokens
BaselineFirst 32,000 tokens, no retrieval
SniparaHosted retrieval, 8,000 token budget
Test set12 medium and hard project-context tasks

Results

Deterministic Reduction
99.8%
27-case suite, local relevance simulation
Hosted Avg Context
4,036
tokens/query across 12 tasks
Quality Lift
+0.48
answer quality vs 32K raw baseline
MetricSnipara32K Raw BaselineDelta
Mean context tokens4,03632,00087.39% less
Answer quality5.85/105.38/10+0.48
Correctness5.04/104.57/10+0.47
Faithfulness2.92/102.23/10+0.68

Long-context follow-up

We then reran a smaller two-case benchmark with natural assistant prompts. This matters because real assistants are usually not told to refuse whenever context is missing. If the selected context is weak, they may infer, improvise, or invent.

The long-context test used OpenAI's GPT-4.1 mini and GPT-5 mini. OpenAI documents GPT-4.1 mini with a 1,047,576-token context window and GPT-5 mini with a 400,000-token context window and 128,000 max output tokens. In our Chat Completions run, GPT-5 mini rejected a 392K-token input with a configured input limit of about 272K, so we reran the baseline at 264K raw input tokens.

RunBaselineSnipara ContextQuality DeltaHallucination Result
GPT-4.1 miniFull 601,207-token corpus2,954 tokens/query6.00/10 vs 6.00/10Mixed: full dump scored lower hallucination
GPT-5 miniFirst 264,000 raw tokens2,954 tokens/query+0.28 overallMixed: raw slice scored lower hallucination

The useful conclusion is not that retrieval automatically reduces hallucinations in every prompt. It is that dumping or slicing raw context is an expensive relevance strategy. In these long-context runs, Snipara used about 99% less input context while preserving comparable answer quality.

What the hallucination metric showed

The clean hallucination runs were not a strong marketing headline. Depending on the model and baseline, raw context sometimes scored a lower measured hallucination rate, especially when the raw slice included enough evidence or the model answered conservatively. Snipara produced compact and useful answers, but the current claim verifier marked many claims as unverifiable because it uses conservative keyword matching against a large reference corpus.

That means the defensible claim is not "Snipara eliminates hallucinations." The defensible claim is narrower: Snipara gives the model a much smaller, source-selected context package that fits the window and improves measured answer quality in this run.

Why this matters

For agent workflows, the practical failure mode is not only hallucination. It is wasting context budget on irrelevant project history, then asking the model to infer what matters. Snipara shifts that work into retrieval: select the relevant project memory first, then let the model reason over a compact context package.

Takeaway

Bigger context windows help, but they do not decide what is relevant. On this run, Snipara made the context small enough to fit comfortably, kept the answer useful, and avoided relying on a misleading full-dump comparison.

A

Alex Lopez

Founder, Snipara

Share this article

LinkedInShare