Snipara Benchmark: 601K Tokens to 4K of Usable Context

We benchmarked Snipara on its own indexed project context and rejected the first result before publishing it. The discarded number looked impressive, but it measured a context window overflow, not factual accuracy. This is the cleaner run.

Headline

Across the deterministic 27-case suite, Snipara reduced the indexed corpus from 601,207 raw tokens to about 1,253 selected tokens/query. In the hosted OpenAI run, Snipara returned about 4,036 tokens/query for 12 medium and hard tasks, compared with a 32K-token raw window baseline.

The mistake we did not publish

The first hallucination run compared Snipara against a "without Snipara" baseline that attempted to send the full corpus to the model. That corpus was about 601K tokens. The OpenAI model used for the run had a 128K context window.

The result was predictable: the baseline produced context-length errors. Scoring those errors as normal answers created a misleading 0% factual accuracy number for the baseline. That number is not useful for marketing, so we excluded it.

The clean setup

The cleaner benchmark keeps the baseline inside the model's real context window. The raw baseline receives the first 32K tokens of the indexed corpus with no retrieval. Snipara uses hosted context retrieval with an 8K token budget.

Setting	Value
Model	OpenAI gpt-4o-mini
Model context window	128,000 tokens
Full indexed corpus	601,207 tokens
Baseline	First 32,000 tokens, no retrieval
Snipara	Hosted retrieval, 8,000 token budget
Test set	12 medium and hard project-context tasks

Results

Deterministic Reduction

99.8%

27-case suite, local relevance simulation

Hosted Avg Context

4,036

tokens/query across 12 tasks

Quality Lift

+0.48

answer quality vs 32K raw baseline

Metric	Snipara	32K Raw Baseline	Delta
Mean context tokens	4,036	32,000	87.39% less
Answer quality	5.85/10	5.38/10	+0.48
Correctness	5.04/10	4.57/10	+0.47
Faithfulness	2.92/10	2.23/10	+0.68

Long-context follow-up

We then reran a smaller two-case benchmark with natural assistant prompts. This matters because real assistants are usually not told to refuse whenever context is missing. If the selected context is weak, they may infer, improvise, or invent.

The long-context test used OpenAI's GPT-4.1 mini and GPT-5 mini. OpenAI documents GPT-4.1 mini with a 1,047,576-token context window and GPT-5 mini with a 400,000-token context window and 128,000 max output tokens. In our Chat Completions run, GPT-5 mini rejected a 392K-token input with a configured input limit of about 272K, so we reran the baseline at 264K raw input tokens.

Run	Baseline	Snipara Context	Quality Delta	Hallucination Result
GPT-4.1 mini	Full 601,207-token corpus	2,954 tokens/query	6.00/10 vs 6.00/10	Mixed: full dump scored lower hallucination
GPT-5 mini	First 264,000 raw tokens	2,954 tokens/query	+0.28 overall	Mixed: raw slice scored lower hallucination

The useful conclusion is not that retrieval automatically reduces hallucinations in every prompt. It is that dumping or slicing raw context is an expensive relevance strategy. In these long-context runs, Snipara used about 99% less input context while preserving comparable answer quality.

What the hallucination metric showed

The clean hallucination runs were not a strong marketing headline. Depending on the model and baseline, raw context sometimes scored a lower measured hallucination rate, especially when the raw slice included enough evidence or the model answered conservatively. Snipara produced compact and useful answers, but the current claim verifier marked many claims as unverifiable because it uses conservative keyword matching against a large reference corpus.

That means the defensible claim is not "Snipara eliminates hallucinations." The defensible claim is narrower: Snipara gives the model a much smaller, source-selected context package that fits the window and improves measured answer quality in this run.

Why this matters

For agent workflows, the practical failure mode is not only hallucination. It is wasting context budget on irrelevant project history, then asking the model to infer what matters. Snipara shifts that work into retrieval: select the relevant project memory first, then let the model reason over a compact context package.

Takeaway

Bigger context windows help, but they do not decide what is relevant. On this run, Snipara made the context small enough to fit comfortably, kept the answer useful, and avoided relying on a misleading full-dump comparison.

Use Snipara with agents Read about Code Impact