Controlled replay, not a synthetic benchmark

Proof that the first edit starts from the right project memory.

We replay real Snipara engineering work twice: once from a cold repository start, and once with Snipara's Project Intelligence brief, impact chain, workflow state, and verification plan supplied before the first edit.

Review the protocol Project Intelligence docs

replay frame

# same task, same repo base

$ run baseline_agent

measure: searches, files, missed impact

$ run snipara_assisted_agent

measure: same signals, same answer key

publish raw artifacts, not just summary claims

The purpose is narrow: show whether shared Project Intelligence changes the agent's starting point before implementation begins.

Protocol

A controlled replay with an answer key

The final implementation of a real Snipara task becomes the answer key. The replay measures how much work an agent does before it reaches that surface.

Same base

Both runs start from the commit before the selected Snipara task moved.

Same brief

The task request, model family, shell access, and time budget stay fixed.

Different start

The baseline agent starts cold. The assisted agent receives Snipara Project Intelligence first: decisions, impact, workflow state, and verification evidence.

Same answer key

Both runs are scored against the final merged workflow evidence, not against a marketing claim.

The task

Safe Parallel Coding is wide enough to expose the real failure mode.

A narrow UI copy change would be too easy. This replay uses work that touched code, docs, package surfaces, agent tools, and verification gates.

Task

Complete Safe Parallel Coding MVP5/MVP6

Why this task

It crosses repository code, hosted MCP contracts, package surfaces, docs, checks, and deploy notes.

Base commit

499d63a3, before the MVP5/MVP6 consolidation started

Final commit

9785471a, after the public Safe Parallel Coding surface shipped

Answer key

14 commits, 76 changed files, 5,475 insertions, 270 deletions, and 31 test/support files.

Claim under test

Shared Project Intelligence reduces rediscovery and improves planning before the agent writes code.

Replay results

The difference was visible before implementation

The cold run could find relevant files, but the signal was buried in broad repository search results. The Snipara-assisted run started from the project-owned surfaces and verification gates.

files in the answer key

Git diff from 499d63a3 to 9785471a

1,722

unique local search hits

16 cold-start searches against the base commit

5 / 76

actual files opened cold

Top-five-files-per-search replay rule

7 / 7

surface categories surfaced

Project Intelligence brief, workflow plan, and impact context

Metric

Without Snipara

With Snipara

Readout

Searches before plan

16 local searches

3 Snipara artifacts

Project Intelligence, managed workflow plan, and code impact replaced broad repo rediscovery.

Files surfaced before plan

47 opened, 5 were final changed files

5 anchor files plus phase-level surfaces

The baseline found signal, but with much more noise and fewer implementation anchors.

Surface coverage

3 of 7 categories in first opened files

7 of 7 categories in start context

Scored against web API, data/service, dashboard UX, CLI, hosted MCP mirrors, docs, and release/deploy config.

Verification quality

Likely local tests after file discovery

Risk, impacted routes, config facts, release/deploy checks

The code-impact artifact classified the guard surface as critical risk with routes and config evidence.

Time to reliable plan

Not claimed

This replay did not preserve comparable raw model timestamps, so the page does not invent a duration.

Answer-key surface

Files

Cold

Snipara

Web API routes

10 files

Partially

Yes

Web data and safety services

3 files

Yes

Dashboard UX

2 files

Yes

Companion CLI

9 files

Yes

Hosted MCP and Python mirrors

34 files

Yes

Docs and AI-readable discovery

10 files

Yes

Release, deploy, and config

5 files

Yes

Local model benchmark

The same proof shape now runs against local coding models

The replay above measures discovery and planning on a real Snipara task. The local-model benchmark measures the next layer: whether retrieved Project Intelligence changes code quality and task success under realistic handoffs. Together, the two benchmarks measure both project understanding and code execution.

Most credible cold signal

23/60

GPT-5.3 Codex Spark solved the most continuity-dependent scenarios cold. That imperfect baseline is the useful point: strong models can solve some tasks without Snipara, but still miss project-history-dependent work.

What 60/60 means

continuity suite

These are not generic coding tasks. They are scenarios where the correct answer depends on prior decisions, handoffs, active work, and project-specific constraints.

Negative control

not published yet

The page should include a control class where cold agents already succeed and Snipara should show little or no lift. Until that run exists, this page does not claim broad coding uplift.

Model

Cold continuity-dependent scenarios passed

Snipara continuity-dependent scenarios passed

Code quality

Continuity score

GPT-OSS 20B

0/60

60/60

97.9

97.5

Qwen3-Coder 30B

0/60

57/60

96.3

95.4

Devstral Small 2 24B

0/60

53/60

93.8

94.8

What was measured

Ten paired repetitions, six coding scenarios, three conditions: cold baseline, hosted Snipara retrieval, and an oracle continuity pack. Scoring combines hidden tests, static validity, maintainability, scenario review checks, and local project-intelligence checks.

Cold baseline means repository plus task only: no project history, previous decisions, or continuity artifacts.

Code quality is a deterministic 0-100 score from hidden tests, static validity, maintainability, and scenario review checks.

Continuity score is a deterministic 0-100 score for project understanding, decision selection, handoff use, and minimal-change discipline.

Readout: in continuity-dependent scenarios, Snipara-hosted retrieval moved all three local models from 0/60 baseline passes to 60/60, 57/60, and 53/60 respectively.

Read the benchmark protocol

Limits kept in the claim

The benchmark covers six controlled coding scenarios, not every software task.

The public comparison is baseline versus hosted Snipara retrieval.

The continuity-local condition is an oracle ceiling control, not a normal user workflow.

The result supports project-continuity claims, not broad claims that a model is globally smarter.

Codex cloud benchmark

The same proof shape also runs through Codex CLI

This table uses the same continuity-heavy coding scenarios as the local-model benchmark, but swaps the runtime from LM Studio to Codex CLI. It is reported separately so local-model claims and Codex cloud claims stay clean.

Model

Cold continuity-dependent scenarios passed

Snipara continuity-dependent scenarios passed

Code quality

Continuity score

GPT-5.3 Codex Spark

23/60

59/60

98.1

GPT-5.4

2/60

60/60

99.8

GPT-5.5

0/60

60/60

99.3

98.3

Readout

Across three Codex CLI models and ten paired repetitions, Snipara-hosted retrieval moved aggregate continuity-dependent passes from 25/180 to 179/180. The quality and continuity scores are the same deterministic 0-100 scoring dimensions used in the local-model table.

GPT-5.3 Codex Spark had the strongest cold baseline on this suite. GPT-5.4 scored slightly above GPT-5.5 in the Snipara condition. Those are benchmark observations, not a global model-ranking claim.

Read the benchmark protocol

Limits kept in the claim

Codex cloud runs use Codex CLI, while the local-model benchmark uses LM Studio.

Token usage is n/a because the Codex CLI result surface used by the harness did not expose provider token counts.

GPT-5.3 Codex Spark's stronger cold baseline is reported as measured, not interpreted as a global model ranking.

Claude benchmark

The same proof shape also runs through Claude

This table uses the same continuity-heavy coding scenarios as the local-model and Codex CLI benchmarks, but reports Anthropic Claude runs separately so provider and runtime claims stay clean.

Model

Cold continuity-dependent scenarios passed

Snipara continuity-dependent scenarios passed

Code quality

Continuity score

Claude Sonnet 4.6

0/60

60/60

98.4

Claude Opus 4.8

7/60

60/60

98.3

Readout

Across Claude Sonnet and Claude Opus, Snipara-hosted retrieval moved aggregate continuity-dependent passes from 7/120 to 120/120. Code quality reached 98.4 for Sonnet and 98.3 for Opus; continuity reached 98.4 and 98.3 respectively.

Opus had a stronger cold baseline than Sonnet on this suite, but the main signal is consistent: both premium Claude models reached full pass coverage with hosted project continuity.

Read the benchmark protocol

Limits kept in the claim

Claude runs use the Anthropic hosted runtime and are reported separately from local and Codex CLI runs.

Claude Opus solved one cold scenario repeatedly, so its baseline is 7/60 rather than 0/60.

The result supports continuity-heavy repository execution claims, not a global Sonnet-versus-Opus ranking.

The same deterministic 0-100 code quality and continuity scoring dimensions are used.

Two runs

Only the starting context changes

The baseline is not sabotaged. It can use normal developer tooling. The assisted run receives Snipara context first, then still has to verify it against the repository.

Run A: cold agent

cold start

Input

Original engineering brief only

Allowed context

Git, local files, shell search, test runner

Agent must discover

Architecture, active decisions, changed surfaces, package mirrors, docs, checks

Measured replay

16 searches returned 1,722 unique files; opening the first five results per search surfaced 5 of the 76 final changed files.

Run B: Snipara-assisted agent

with Snipara

Input

Same brief plus Snipara start-work context

Preloaded context

A Project Intelligence brief, the managed workflow plan, prior phase handoff context, and a code-impact response.

Before local reads

Snipara named the guard route, data query, safety service, companion command, docs guide, four phases, and release/deploy gates.

Measured replay

All seven final surface categories were present in the assisted start package before the first edit.

Engineering Lead Plan

This is how shared reality becomes bounded work

The Lead Plan is the proof boundary between Project Brain and execution. It lists recommended workers and evidence, but it stays fail-closed: no worker is launched, approved, or completed without explicit handoff and receipts.

lead-plan snapshot

Advisory contract, not autonomous execution

Lead posture

lead_watch

Routing mode

prepare explicit handoff

Workers spawned

Fallback

main_agent

Approval boundary

human approval before delegated execution

Glossary

lead_watch

Companion is watching and packaging the work, not launching a worker.

main_agent fallback

If no safe worker contract is approved, the current agent keeps the task.

Workers spawned: 0

The plan proves routing readiness only; execution remains behind approval and receipts.

documentation worker

Update public docs and AI-readable discovery surfaces.

Required proof: String probes, docs route render, and discovery tests pass.

review worker

Check that multi-agent copy stays fail-closed.

Required proof: No autonomous worker launch, approval, or completion is claimed.

human approver

Hold release until proof and neighboring hardening are reviewed together.

Required proof: No production deploy until the combined release gate is explicit.

Proof gates

Scope gate

Only public proof, docs, metadata, and AI discovery files change.

Verification gate

Run focused discovery tests, web type-check, lint, and production build.

Release gate

Push and deploy only after the coordinated release decision.

Receipts and outcomes

Handoff receipt

Worker scope, touched files, caveats, and next action are captured instead of left in chat.

Proof receipt

Checks run, failures, skipped gates, and evidence links travel with the work package.

Outcome link

Accepted or rejected work updates the Project Brain so the next plan starts from evidence.

Candidate Project Brain updates

Persist the approved positioning decision: Companion is the engineering lead over the Project Brain.

Mark proof gates verified only after focused checks and release review pass.

Record worker outcome reliability once receipts and accepted outcomes exist.

Candidate updates are not durable truth until a handoff, proof receipt, and outcome link exist.

Measurement

The scorecard is observable

Each signal comes from traces or the final answer key. If a number cannot be measured from artifacts, it does not belong on the proof page.

Signal

Source

Scoring rule

Research commands

Shell transcript

Count repo searches, Git inspection, and local file discovery before the first plan.

Files opened

Tool trace

Count unique files read before the plan names the implementation surface.

Impacted surfaces found

Plan plus final diff

Compare named routes, services, packages, docs, and tests against the answer key.

Tests proposed

Agent plan

Compare proposed checks with the final verification list and missing gates.

Time to reliable plan

Trace timestamps

Measure time until the plan includes impact, risk, verification, and next action.

Pre-commit errors caught

Commands and corrections

Count issues found before commit, including wrong assumptions, failed checks, and stale context.

Answer key

The final workflow defines what the agent needed to find.

We score both agents against the surfaces that actually mattered in the completed Snipara work. That keeps the comparison grounded in repository evidence.

expected surfaces

Derived from final workflow evidence

Web API routes for sessions, leases, guard checks, and GitHub checks

Data queries and collaboration safety services

Dashboard live collaboration UX

Companion CLI commands, hooks, and local-stack behavior

Hosted MCP tool contracts and Python mirrors

Safe Parallel Coding public docs and AI-readable discovery surfaces

Package, CI, release, deploy, and config surfaces

Artifacts

The proof is the replay package

The final page should let a technical buyer inspect the method, the context that Snipara supplied, and the raw evidence behind the comparison.

Start Work Brief

The exact continuity context supplied before the assisted agent opens files.

Impact Chain

The routes, services, packages, docs, tests, and workflow surfaces Snipara expects to matter.

Verification Plan

The checks and missing gates the agent should use to prove the change is safe.

PR Answer Pack

The review-facing evidence bundle produced after repository movement is visible.

Raw Agent Trace

Searches, file reads, commands, plan revisions, failed checks, and corrections.

Limits

What the replay proves, what it does not prove, and which claims remain unmeasured.

Interpretation

This does not prove that Snipara writes better code.

It tests a narrower and more defensible claim: Snipara gives an AI coding agent a better project starting point before it writes code. The model still reasons, edits, runs tests, and owns the final implementation quality.

This is not a synthetic benchmark and not a broad scientific claim.

It measures discovery and planning evidence, not final code quality by itself.

No time-to-plan number is claimed because comparable raw model timestamps were not preserved.

The assisted run still has to verify Snipara context against Git, local files, and tests.

Measure rediscovery cost

Compare time to reliable plan

Publish raw trace timestamps

Keep unsupported claims out

Start a Snipara project