Benchmarking Project Intelligence for AI Coding Agents

We built this benchmark to answer a narrow question: when an AI coding agent works on a real repository, how much of its performance is limited by missing project continuity rather than by the model alone?

The test compares the same tasks, repository, and scoring under two starts. The cold agent receives only the repository and task. The Snipara-assisted agent receives the same repository and task plus hosted Snipara retrieval from a dedicated benchmark project. The goal is not to rank models globally. The goal is to measure whether project intelligence changes execution on continuity-heavy repository work.

The short readout

On the local-model suite, Snipara moved GPT-OSS 20B from 0/60 cold passes to 60/60. On the Codex CLI suite, Snipara moved aggregate passes from 25/180 to 179/180. These are project-continuity results, not claims that a smaller model becomes globally equivalent to a premium model.

What the benchmark measures

The benchmark uses ten paired repetitions across six realistic coding scenarios. Each run is scored by deterministic checks that combine hidden tests, static validity, maintainability, scenario review checks, and project-continuity checks.

It sits next to Snipara's deterministic continuity harness. That lower-level harness verifies the resume contract itself: scope, decision carry-over, supersession, stale-work exclusion, safe next action, and deterministic output. The model benchmark then asks whether those continuity signals change real coding-agent execution.

Score	Meaning
Passes	How many scenario runs completed the expected code behavior and constraints.
Code quality	A 0-100 deterministic score for hidden tests, static validity, maintainability, and scenario-specific review criteria.
Continuity score	A 0-100 deterministic score for project understanding, decision selection, handoff use, and minimal-change discipline.

The two conditions

The cold baseline is intentionally strict. It receives the repository and the task description, but no project history, previous decisions, continuity artifacts, or Snipara retrieval. It can still inspect files and use normal tooling. It is not sabotaged; it is simply starting the way many coding-agent sessions start today.

The Snipara condition receives hosted retrieval first. That retrieval can surface prior decisions, workflow state, relevant files, and constraints that are otherwise hidden across docs, memory, and previous work. The model still has to write correct code and pass the same scoring criteria.

Local model results

These runs use local models through the local benchmark runtime. The public comparison is cold baseline versus hosted Snipara retrieval. An oracle continuity pack exists as a ceiling control, but it is not presented as the normal user workflow.

Model	Cold passes	Snipara passes	Code quality	Continuity
GPT-OSS 20B	0/60	60/60	97.9	97.5
Qwen3-Coder 30B	0/60	57/60	96.3	95.4
Devstral Small 2 24B	0/60	53/60	93.8	94.8

Codex CLI results

We then ran the same proof shape through Codex CLI models. This is a separate block from the local-model claim because the runtime, model serving layer, and token reporting surface are different. Codex CLI did not expose provider token counts in this harness, so token usage is reported as not available.

Model	Cold passes	Snipara passes	Code quality	Continuity
GPT-5.3 Codex Spark	23/60	59/60	98.1	98.1
GPT-5.4	2/60	60/60	99.8	99.8
GPT-5.5	0/60	60/60	99.3	98.3

Why some results look strange

GPT-5.3 Codex Spark had a stronger cold baseline than GPT-5.4 and GPT-5.5 on this specific suite. GPT-5.4 also scored slightly above GPT-5.5 in the Snipara condition. We report those observations because hiding them would make the benchmark less credible.

The right interpretation is limited: this protocol is sensitive to runtime behavior, serialization, task framing, and continuity use. It does not prove that Spark is globally better than newer models, or that GPT-5.4 is globally better than GPT-5.5. The product signal is that hosted retrieval improved every tested model under the same paired protocol.

What this says about local intelligence

On this benchmark, GPT-OSS 20B with Snipara is close to GPT-5.5 with Snipara: 60/60 passes versus 60/60, code quality 97.9 versus 99.3, and continuity 97.5 versus 98.3. That is the strongest practical finding.

The clean claim is not that a local 20B model becomes GPT-5.5. The clean claim is that when the bottleneck is missing project context, a smaller local model can become competitive on real repository tasks once the project intelligence layer provides the right continuity package.

What we do not claim

We do not claim these scores rank models globally.
We do not claim Snipara improves raw reasoning ability.
We do not claim every code task is continuity-bound.
We do not claim the oracle continuity pack is a normal user workflow.
We have not yet included Anthropic results in this public table.

What comes next

The next useful step is to add Claude and other premium-model runs under the same paired protocol, then publish the scenario prompts and scoring criteria in a public benchmark package. Until then, the website keeps the claim bounded: these numbers are evidence that Snipara improves project-continuity execution, not proof of universal model superiority.

See the proof tables See the eval summary