Snipara Benchmark: 601K Tokens to 4K of Usable Context
A clean benchmark of Snipara on its own indexed project context, why we rejected a misleading hallucination result, and what the numbers actually support.
Alex Lopez
Founder, Snipara
- Readable in 6 minutes
- Published 2026-05-10
- 5 context themes covered
We benchmarked Snipara on its own indexed project context and rejected the first result before publishing it. The discarded number looked impressive, but it measured a context window overflow, not factual accuracy. This is the cleaner run.
Headline
Across the deterministic 27-case suite, Snipara reduced the indexed corpus from 601,207 raw tokens to about 1,253 selected tokens/query. In the hosted OpenAI run, Snipara returned about 4,036 tokens/query for 12 medium and hard tasks, compared with a 32K-token raw window baseline.
The mistake we did not publish
The first hallucination run compared Snipara against a "without Snipara" baseline that attempted to send the full corpus to the model. That corpus was about 601K tokens. The OpenAI model used for the run had a 128K context window.
The result was predictable: the baseline produced context-length errors. Scoring those errors as normal answers created a misleading 0% factual accuracy number for the baseline. That number is not useful for marketing, so we excluded it.
The clean setup
The cleaner benchmark keeps the baseline inside the model's real context window. The raw baseline receives the first 32K tokens of the indexed corpus with no retrieval. Snipara uses hosted context retrieval with an 8K token budget.
| Setting | Value |
|---|---|
| Model | OpenAI gpt-4o-mini |
| Model context window | 128,000 tokens |
| Full indexed corpus | 601,207 tokens |
| Baseline | First 32,000 tokens, no retrieval |
| Snipara | Hosted retrieval, 8,000 token budget |
| Test set | 12 medium and hard project-context tasks |
Results
| Metric | Snipara | 32K Raw Baseline | Delta |
|---|---|---|---|
| Mean context tokens | 4,036 | 32,000 | 87.39% less |
| Answer quality | 5.85/10 | 5.38/10 | +0.48 |
| Correctness | 5.04/10 | 4.57/10 | +0.47 |
| Faithfulness | 2.92/10 | 2.23/10 | +0.68 |
Long-context follow-up
We then reran a smaller two-case benchmark with natural assistant prompts. This matters because real assistants are usually not told to refuse whenever context is missing. If the selected context is weak, they may infer, improvise, or invent.
The long-context test used OpenAI's GPT-4.1 mini and GPT-5 mini. OpenAI documents GPT-4.1 mini with a 1,047,576-token context window and GPT-5 mini with a 400,000-token context window and 128,000 max output tokens. In our Chat Completions run, GPT-5 mini rejected a 392K-token input with a configured input limit of about 272K, so we reran the baseline at 264K raw input tokens.
| Run | Baseline | Snipara Context | Quality Delta | Hallucination Result |
|---|---|---|---|---|
| GPT-4.1 mini | Full 601,207-token corpus | 2,954 tokens/query | 6.00/10 vs 6.00/10 | Mixed: full dump scored lower hallucination |
| GPT-5 mini | First 264,000 raw tokens | 2,954 tokens/query | +0.28 overall | Mixed: raw slice scored lower hallucination |
The useful conclusion is not that retrieval automatically reduces hallucinations in every prompt. It is that dumping or slicing raw context is an expensive relevance strategy. In these long-context runs, Snipara used about 99% less input context while preserving comparable answer quality.
What the hallucination metric showed
The clean hallucination runs were not a strong marketing headline. Depending on the model and baseline, raw context sometimes scored a lower measured hallucination rate, especially when the raw slice included enough evidence or the model answered conservatively. Snipara produced compact and useful answers, but the current claim verifier marked many claims as unverifiable because it uses conservative keyword matching against a large reference corpus.
That means the defensible claim is not "Snipara eliminates hallucinations." The defensible claim is narrower: Snipara gives the model a much smaller, source-selected context package that fits the window and improves measured answer quality in this run.
Why this matters
For agent workflows, the practical failure mode is not only hallucination. It is wasting context budget on irrelevant project history, then asking the model to infer what matters. Snipara shifts that work into retrieval: select the relevant project memory first, then let the model reason over a compact context package.
Takeaway
Bigger context windows help, but they do not decide what is relevant. On this run, Snipara made the context small enough to fit comfortably, kept the answer useful, and avoided relying on a misleading full-dump comparison.