Menu
Project Intelligence Evals

Correct answers aren't enough.

Every other eval asks whether the output is good. Snipara asks whether the agent understood the project — its decisions, history and blast radius — before it acted. And we publish what we measure, with deterministic harnesses you can re-run.

7/7
Golden context cases pass
deterministic · 100% score
6/6
Continuity contract cases pass
deterministic
8/8
Code-graph cases pass
deterministic
4.1×
Context compression
75.6% fewer input tokens
1.2%
True hallucination rate
graded v3 · contradictions only
What we measure

Five things no other eval scores

Accuracy, completion and tool usage are table stakes. These dimensions test whether the agent acted in line with the real state of the project.

Context Preservation Score

deterministic today

Does the agent keep the project's existing decisions instead of silently overwriting them?

Golden cases assert that hidden agent state never overrides product docs and that stale chunks are surfaced as caveats, not current truth.

Decision Consistency Score

deterministic today

Does the agent respect the project's canonical decisions and source authority?

Golden cases assert that product docs beat internal ADRs and agent notes, so a reviewed decision wins over an agent's guess.

Impact Awareness Score

deterministic today

Did the agent see everything the change actually touches?

The code graph resolves impact chains — callers, imports, config facts, routes and tests — with 8/8 deterministic structural cases.

Verification Coverage Score

deterministic today

Did the agent run the checks the change requires — and notice the missing ones?

snipara-companion verify builds a deterministic verification plan from code impact and flags missing checks the agent skipped.

Continuity Score

deterministic today

Does the agent resume interrupted work at the right point, across sessions and tools?

A deterministic golden continuity harness (6/6 cases) asserts resume scope, decision carry-over, supersession flagging, stale-work exclusion and safest-next-action gating against the real resume-bundle builder.

Why it matters

Two agents pass the test. Only one understood the project.

Agent A

Snipara eval: wins
traditional eval

Code compiles · tests pass · 100/100

project intelligence eval

Respected the canonical decision: invited workspace users keep personal API keys and the team seat covers billing.

Agent B

Snipara eval: fails
traditional eval

Code compiles · tests pass · 100/100

project intelligence eval

Violated workspace billing policy — re-charged invited users per seat.

Illustrative — this is the dimension we score (decision consistency), not a benchmark result. A traditional eval rates both agents identically; only a project-aware eval separates them.

The evidence today

The harnesses backing those scores

The dimensions above are not slideware — each is exercised by a harness you can re-run. Here is what runs green today.

Golden Context Harness

deterministic

Does the right source win, and are stale or internal notes correctly demoted?

7/7 cases · 100% avg score

Controlled fixtures assert that product docs beat internal ADRs, stale chunks become caveats (not current truth), exact tool references beat agent notes, and structural queries route to the code graph.

benchmarks/golden_context_harness.py·last run 2026-05-30

Continuity (Golden Harness)

deterministic

Given an interrupted task, does the resume bundle preserve scope, decisions, supersession and the safest next action?

6/6 contract cases pass

Feeds known interrupted-work fixtures through the real graphContinuitySummary v2 builder and asserts resume-point scope, decision carry-over, supersession flagging (non-active decisions become do-not-follow), stale-work exclusion, safest-next-action gating, and determinism.

lib/services/__tests__/golden-continuity.test.ts·last run 2026-06-01

Code Graph

deterministic

Are callers, imports, impact and symbol cards resolved correctly?

8/8 deterministic cases · 100% indexed coverage

Deterministic structural queries over the persisted code graph. Live index: 1,285 code documents, 0 stale rows, symbol cards resolve with real risk scoring, owners, tests, docs and MCP-tool links.

benchmarks/code_graph_benchmark.py·last run 2026-05-29

Token Efficiency

deterministic

How much input-token cost does focused context remove versus a broad-document baseline?

4.1× compression · 75.6% reduction

Snipara returns the relevant sections within a hard token budget instead of dumping whole documents. The compression holds while preserving enough context to answer the benchmark tasks.

benchmarks/metrics/token_efficiency.py·last run 2026-05-29

Hallucination v3

model-graded

How often does grounded context still produce a direct contradiction?

1.2% true contradiction rate (~60% lower than baseline)

Graded v3 scoring: only direct contradictions of source context count as hallucinations, not omissions or hedges. Lower is better.

benchmarks/metrics/hallucination.py·last run 2026-05-29

Answer-Pack Grounding

model-graded

Does source-grounded answer-pack framing suppress unsupported claims?

Forbidden-fact hits 4 → 0 · quality-aware 78.75 vs 26.25

With answer-pack grounding, forbidden/unsupported facts dropped from 4 to 0 and the quality-aware score rose sharply versus an ungrounded baseline. This is the strongest, most reproducible gain in the suite.

benchmarks/quality_score_answer_pack_benchmark.py·last run 2026-05-10

Answer Quality

model-graded

End-to-end answer score, with Snipara context versus without.

+0.8 points (6.8 vs 6.0)

A modest but consistent uplift. We report it honestly: the retrieval and token-efficiency gains are larger than the raw answer-quality delta, which is bounded by the underlying model.

benchmarks/metrics/answer_quality.py·last run 2026-05-29
Verify it

Two ways to check the work

On your own project

These run against your project on hosted Snipara — no repo clone, nothing to build. Project Intelligence signals come straight from the tools your agent already uses.

# companion 1.3+ · hosted MCP

$ snipara-companion status

$ snipara-companion verify --file-path <file>

$ snipara-companion brief

# every snipara_context_query response

↳ returns retrieval_diagnostics:

confidence · quality scores · stale & weak-source flags

Code-impact and verification-plan signals are on Pro and above; context diagnostics are available on every plan.

How we run the suite

The deterministic harnesses live in the Snipara repository and run in CI (and for self-hosted deployments). They need no model API and produce the same numbers every run.

# Snipara repo · apps/mcp-server

$ python -m benchmarks.golden_context_harness

$ python -m benchmarks.code_graph_benchmark --runs 3

# apps/web

$ pnpm test golden-continuity

Open for inspection — the harnesses and fixtures are the contract, not a screenshot.

Methodology & honesty

What these numbers do — and don't — mean

  • Deterministic first. We lead with harnesses that give the same result every run — correctness of source authority, stale handling, routing and code-graph resolution — because those are the claims we can stand behind without caveats.
  • Retrieval precision is corpus-dependent. Precision@K and Recall@K vary with the source corpus and chunk freshness, and our retrieval-ranking harness was hardened on 2026-05-29. We tune it continuously and do not headline a single precision number, because it would not generalize to your repository.
  • Model-graded results carry a date and a judge. Answer-quality and hallucination scores depend on the judge model and the run; we report the run date and treat them as directional, not absolute.
  • Raw artifacts stay out of the headline. Per-run JSON and dashboards live in benchmarks/reports/; a result only becomes a published claim once it is summarized and dated.

See the workflow these evals protect.