Menu
Engineering·6 min read

Snipara Benchmark: Project Memory Beats Raw Context Dumps

A clean benchmark showing why Snipara is more than token reduction: source-aware project memory selects the context agents need before generation starts.

A

Alex Lopez

Founder, Snipara

·
Quick scan
  • Readable in 6 minutes
  • Published 2026-05-10
  • 5 context themes covered
Topics
benchmarkproject memorycontext engineeringai agentssnipara

Snipara is not a token reducer. It is a project memory and context compiler for AI agents. This benchmark measures one visible side effect of that job: when Snipara selects the right source context before generation, the prompt becomes much smaller, easier to verify, and less dependent on raw context dumping.

What this benchmark proves

Across the deterministic 27-case suite, Snipara compiled its indexed project memory from 601,207 raw tokens to about 1,253 selected tokens/query. In the latest hosted OpenAI GPT-4.1 run, Snipara returned about 2,981 tokens/query for 12 medium and hard tasks, compared with a 32K-token raw window baseline. Answer quality improved from 6.25/10 to 8.42/10, and the claim-grounding audit measured 4.35% claims unsupported by the supplied Snipara context versus 21.15% for the raw baseline. A GPT-5.5 follow-up landed in the same quality band with Snipara: 7.91/10 across 11 completed cases.

The product question

Raw tokens are a weak interface for serious agent work. Agents need durable project memory, source-aware retrieval, reviewed team context, code impact context, and a consistent way to carry that knowledge across users, models, and sessions. This article isolates one part of that system: does Snipara choose a useful context package, or does it merely make prompts smaller?

The answer from this run is narrower and more useful than a generic cost-saving claim. Snipara selected a compact context package that improved measured answer quality and reduced unsupported factual claims, while avoiding the failure mode of dumping arbitrary project history into the model window.

The mistake we did not publish

The first hallucination run compared Snipara against a "without Snipara" baseline that attempted to send the full corpus to the model. That corpus was about 601K tokens. The OpenAI model used for the run had a 128K context window.

The result was predictable: the baseline produced context-length errors. Scoring those errors as normal answers created a misleading 0% factual accuracy number for the baseline. That number is not useful for marketing, so we excluded it.

The clean setup

The cleaner benchmark keeps the baseline inside the model's real context window. The raw baseline receives the first 32K tokens of the indexed corpus with no retrieval or memory selection. Snipara uses hosted context retrieval with an 8K token budget and selects context against the task.

SettingValue
ModelOpenAI gpt-4.1
Model context windowLong-context model; benchmark baseline capped at 32,000 raw tokens
Full indexed corpus601,207 tokens
BaselineFirst 32,000 tokens, no retrieval
SniparaHosted retrieval, 8,000 token budget
Test set12 medium and hard project-context tasks
Claim-grounding auditClaims audited against the context each answer path received; not a standalone answer-quality score

Results

Project Memory Selected
99.8%
less raw context in the 27-case suite
Hosted Context Package
2,981
tokens/query across 12 tasks
Quality Lift
+2.17
answer quality vs 32K raw baseline
MetricSnipara32K Raw BaselineDelta
Mean context tokens2,98132,00090.69% less
Answer quality8.42/106.25/10+2.17
Unsupported-claim rate4.35%21.15%-16.80 pts
Supported-claim share95.65%78.85%+16.80 pts

GPT-5.5 follow-up

We also ran a GPT-5.5 follow-up. The first attempt produced several empty answers because GPT-5 Chat Completions count internal reasoning against the completion budget, so we raised that budget and retried empty answers.

Across the 11 completed GPT-5.5 cases, Snipara scored 7.91/10 versus 5.18/10 for the raw baseline, while using about 3,004 tokens/query instead of 32,000. The missing final case does not change the direction of the result; the complete claim-grounding audit below remains the GPT-4.1 audit.

The interesting part is the model comparison. With Snipara selecting the project context, GPT-4.1 scored 8.42/10 and GPT-5.5 scored 7.91/10 on the same benchmark family. In other words, the retrieval layer made a cheaper or older model competitive with a newer one for this project-context task set. Bigger models still matter, but the context compiler moved a large part of the outcome before generation started.

RunCasesSnipara ContextQuality DeltaClaim-Grounding Audit
GPT-4.112/122,981 tokens/query+2.17 overall4.35% vs 21.15%
GPT-5.511-case follow-up3,004 tokens/query+2.73 overallGPT-4.1 audit used as the complete grounding run

The useful conclusion is not that retrieval magically eliminates hallucinations in every prompt. It is that dumping or slicing raw context is an expensive relevance strategy. In the GPT-4.1 run, the smaller Snipara context also produced fewer unsupported claims. Snipara behaves like a memory compiler: it decides which project sources should reach the model, then leaves the model to reason over a smaller and more focused evidence set.

The benchmark should still be read with the right boundary: it is a project-context benchmark, not a universal hallucination benchmark. The audit counts whether factual claims are supported by the context each answer path received; omissions and wrong framing are not counted as hallucinations.

What the claim-grounding metric showed

In the GPT-4.1 audit, Snipara produced 3 unsupported claims out of 69 audited claims. The raw baseline produced 11 unsupported claims out of 52. Neither path produced contradicted claims in this run. The remaining Snipara issues clustered around authentication detail and shared-context budget wording, which are useful follow-up cases for retrieval and answer prompting.

There is an important distinction here: the claim-grounding audit measures faithfulness to the supplied context, not overall answer quality. One shared-context case scored low on answer quality even though it only had one unsupported claim, because the retrieved context led the model toward the wrong budget-allocation framing. That is still a retrieval-quality issue, just not the same failure mode as inventing unsupported details.

The defensible claim is still not "Snipara eliminates hallucinations." The defensible claim is narrower and stronger: in this GPT-4.1 run, Snipara gave the model a source-selected project memory package that fit the window, improved answer quality, and reduced claims unsupported by the supplied context versus the raw-window baseline.

The next proof point

The stronger test is a corpus that no tested model can ingest as a full dump. In that setup, the comparison is no longer Snipara versus "all evidence in the prompt." It is Snipara versus first-window truncation, chunk stuffing, and LLM compaction. Those baselines can lose facts before generation starts.

We ran an internal overflow validation for that case. It measures evidence recall before answer generation. That evidence-recall gate is important: if the context package does not contain the source facts, no answer-quality score can rescue the run.

Overflow ValidationMean ContextEvidence Recall
Hosted Snipara3,685 tokens85.6%
First 264K raw tokens264,000 tokens0.0%
Deterministic compaction baseline8,000 tokens41.6%

This validation used a 1.52M-token overflow corpus with the real source documentation placed after the raw baseline window. It is not yet a full answer-quality run, but it confirms the failure mode that matters most for oversized corpora: before generation starts, raw truncation can lose every source claim.

Why this matters

For agent workflows, the practical failure mode is not only hallucination. It is wasting context budget on irrelevant project history, then asking the model to infer what matters. Snipara shifts that work into project memory: select the relevant source truth first, keep it reusable across sessions, and let the model reason over context that was chosen for the task.

That is why this benchmark matters, but also why it should not define the whole product. Token reduction is a measurable symptom. The product is the memory layer that decides what deserves to be in front of the model: docs, code graph evidence, team rules, prior decisions, and reviewed context that survives tool and model changes.

Takeaway

Bigger context windows help, but they do not decide what is relevant. Snipara is useful because it turns project memory into task-specific source context. The smaller prompt is the outcome, not the strategy.

A

Alex Lopez

Founder, Snipara

Share this article

LinkedInShare
Related reading