Correct answers aren't enough.
Every other eval asks whether the output is good. Snipara asks whether the agent understood the project — its decisions, history and blast radius — before it acted. And we publish what we measure, with deterministic harnesses you can re-run.
Five things no other eval scores
Accuracy, completion and tool usage are table stakes. These dimensions test whether the agent acted in line with the real state of the project.
Context Preservation Score
deterministic todayDoes the agent keep the project's existing decisions instead of silently overwriting them?
Golden cases assert that hidden agent state never overrides product docs and that stale chunks are surfaced as caveats, not current truth.
Decision Consistency Score
deterministic todayDoes the agent respect the project's canonical decisions and source authority?
Golden cases assert that product docs beat internal ADRs and agent notes, so a reviewed decision wins over an agent's guess.
Impact Awareness Score
deterministic todayDid the agent see everything the change actually touches?
The code graph resolves impact chains — callers, imports, config facts, routes and tests — with 8/8 deterministic structural cases.
Verification Coverage Score
deterministic todayDid the agent run the checks the change requires — and notice the missing ones?
snipara-companion verify builds a deterministic verification plan from code impact and flags missing checks the agent skipped.
Continuity Score
deterministic todayDoes the agent resume interrupted work at the right point, across sessions and tools?
A deterministic golden continuity harness (6/6 cases) asserts resume scope, decision carry-over, supersession flagging, stale-work exclusion and safest-next-action gating against the real resume-bundle builder.
Two agents pass the test. Only one understood the project.
Agent A
Snipara eval: winsCode compiles · tests pass · 100/100
Respected the canonical decision: invited workspace users keep personal API keys and the team seat covers billing.
Agent B
Snipara eval: failsCode compiles · tests pass · 100/100
Violated workspace billing policy — re-charged invited users per seat.
Illustrative — this is the dimension we score (decision consistency), not a benchmark result. A traditional eval rates both agents identically; only a project-aware eval separates them.
The harnesses backing those scores
The dimensions above are not slideware — each is exercised by a harness you can re-run. Here is what runs green today.
Golden Context Harness
deterministicDoes the right source win, and are stale or internal notes correctly demoted?
Controlled fixtures assert that product docs beat internal ADRs, stale chunks become caveats (not current truth), exact tool references beat agent notes, and structural queries route to the code graph.
Continuity (Golden Harness)
deterministicGiven an interrupted task, does the resume bundle preserve scope, decisions, supersession and the safest next action?
Feeds known interrupted-work fixtures through the real graphContinuitySummary v2 builder and asserts resume-point scope, decision carry-over, supersession flagging (non-active decisions become do-not-follow), stale-work exclusion, safest-next-action gating, and determinism.
Code Graph
deterministicAre callers, imports, impact and symbol cards resolved correctly?
Deterministic structural queries over the persisted code graph. Live index: 1,285 code documents, 0 stale rows, symbol cards resolve with real risk scoring, owners, tests, docs and MCP-tool links.
Token Efficiency
deterministicHow much input-token cost does focused context remove versus a broad-document baseline?
Snipara returns the relevant sections within a hard token budget instead of dumping whole documents. The compression holds while preserving enough context to answer the benchmark tasks.
Hallucination v3
model-gradedHow often does grounded context still produce a direct contradiction?
Graded v3 scoring: only direct contradictions of source context count as hallucinations, not omissions or hedges. Lower is better.
Answer-Pack Grounding
model-gradedDoes source-grounded answer-pack framing suppress unsupported claims?
With answer-pack grounding, forbidden/unsupported facts dropped from 4 to 0 and the quality-aware score rose sharply versus an ungrounded baseline. This is the strongest, most reproducible gain in the suite.
Answer Quality
model-gradedEnd-to-end answer score, with Snipara context versus without.
A modest but consistent uplift. We report it honestly: the retrieval and token-efficiency gains are larger than the raw answer-quality delta, which is bounded by the underlying model.
Two ways to check the work
On your own project
These run against your project on hosted Snipara — no repo clone, nothing to build. Project Intelligence signals come straight from the tools your agent already uses.
# companion 1.3+ · hosted MCP
$ snipara-companion status
$ snipara-companion verify --file-path <file>
$ snipara-companion brief
# every snipara_context_query response
↳ returns retrieval_diagnostics:
confidence · quality scores · stale & weak-source flags
Code-impact and verification-plan signals are on Pro and above; context diagnostics are available on every plan.
How we run the suite
The deterministic harnesses live in the Snipara repository and run in CI (and for self-hosted deployments). They need no model API and produce the same numbers every run.
# Snipara repo · apps/mcp-server
$ python -m benchmarks.golden_context_harness
$ python -m benchmarks.code_graph_benchmark --runs 3
# apps/web
$ pnpm test golden-continuity
Open for inspection — the harnesses and fixtures are the contract, not a screenshot.
What these numbers do — and don't — mean
- Deterministic first. We lead with harnesses that give the same result every run — correctness of source authority, stale handling, routing and code-graph resolution — because those are the claims we can stand behind without caveats.
- Retrieval precision is corpus-dependent. Precision@K and Recall@K vary with the source corpus and chunk freshness, and our retrieval-ranking harness was hardened on 2026-05-29. We tune it continuously and do not headline a single precision number, because it would not generalize to your repository.
- Model-graded results carry a date and a judge. Answer-quality and hallucination scores depend on the judge model and the run; we report the run date and treat them as directional, not absolute.
- Raw artifacts stay out of the headline. Per-run JSON and dashboards live in
benchmarks/reports/; a result only becomes a published claim once it is summarized and dated.
See the workflow these evals protect.