Menu
Controlled replay, not a synthetic benchmark

What changes when an AI coding agent starts with organizational memory?

We replay real Snipara engineering work twice: once from a cold repository start, and once with Snipara's Start Work Brief, impact chain, memory, and verification plan supplied before the first edit.

replay frame

# same task, same repo base

$ run baseline_agent

measure: searches, files, missed impact

$ run snipara_assisted_agent

measure: same signals, same answer key

publish raw artifacts, not just summary claims

The purpose is narrow: show whether organizational memory and operational continuity change the agent's starting point before implementation begins.

Protocol

A controlled replay with an answer key

The final implementation of a real Snipara task becomes the answer key. The replay measures how much work an agent does before it reaches that surface.

01

Same base

Both runs start from the commit before the selected Snipara task moved.

02

Same brief

The task request, model family, shell access, and time budget stay fixed.

03

Different start

The baseline agent starts cold. The assisted agent receives Snipara organizational memory and operational continuity first.

04

Same answer key

Both runs are scored against the final merged workflow evidence, not against a marketing claim.

The task

Safe Parallel Coding is wide enough to expose the real failure mode.

A narrow UI copy change would be too easy. This replay uses work that touched code, docs, package surfaces, agent tools, and verification gates.

Task

Complete Safe Parallel Coding MVP5/MVP6

Why this task

It crosses repository code, hosted MCP contracts, package surfaces, docs, checks, and deploy notes.

Base commit

499d63a3, before the MVP5/MVP6 consolidation started

Final commit

9785471a, after the public Safe Parallel Coding surface shipped

Answer key

14 commits, 76 changed files, 5,475 insertions, 270 deletions, and 31 test/support files.

Claim under test

Organizational memory reduces rediscovery before the agent writes code.

Replay results

The difference was visible before implementation

The cold run could find relevant files, but the signal was buried in broad repository search results. The Snipara-assisted run started from the project-owned surfaces and verification gates.

76

files in the answer key

Git diff from 499d63a3 to 9785471a

1,722

unique local search hits

16 cold-start searches against the base commit

5 / 76

actual files opened cold

Top-five-files-per-search replay rule

7 / 7

surface categories surfaced

Start Work memory, workflow plan, and impact context

Searches before plan

16 local searches

3 Snipara artifacts

Recall/start-work memory, managed workflow plan, and code impact replaced broad repo rediscovery.

Files surfaced before plan

47 opened, 5 were final changed files

5 anchor files plus phase-level surfaces

The baseline found signal, but with much more noise and fewer implementation anchors.

Surface coverage

3 of 7 categories in first opened files

7 of 7 categories in start context

Scored against web API, data/service, dashboard UX, CLI, hosted MCP mirrors, docs, and release/deploy config.

Verification quality

Likely local tests after file discovery

Risk, impacted routes, config facts, release/deploy checks

The code-impact artifact classified the guard surface as critical risk with routes and config evidence.

Time to reliable plan

Not claimed

Not claimed

This replay did not preserve comparable raw model timestamps, so the page does not invent a duration.

Web API routes
10 files
Partially
Yes
Web data and safety services
3 files
No
Yes
Dashboard UX
2 files
No
Yes
Companion CLI
9 files
No
Yes
Hosted MCP and Python mirrors
34 files
Yes
Yes
Docs and AI-readable discovery
10 files
Yes
Yes
Release, deploy, and config
5 files
No
Yes
Two runs

Only the starting context changes

The baseline is not sabotaged. It can use normal developer tooling. The assisted run receives Snipara context first, then still has to verify it against the repository.

Run A: cold agent

cold start
Input

Original engineering brief only

Allowed context

Git, local files, shell search, test runner

Agent must discover

Architecture, active decisions, changed surfaces, package mirrors, docs, checks

Measured replay

16 searches returned 1,722 unique files; opening the first five results per search surfaced 5 of the 76 final changed files.

Run B: Snipara-assisted agent

with Snipara
Input

Same brief plus Snipara start-work context

Preloaded context

A Start Work memory, the managed workflow plan, prior phase handoff context, and a code-impact response.

Before local reads

Snipara named the guard route, data query, safety service, companion command, docs guide, four phases, and release/deploy gates.

Measured replay

All seven final surface categories were present in the assisted start package before the first edit.

Measurement

The scorecard is observable

Each signal comes from traces or the final answer key. If a number cannot be measured from artifacts, it does not belong on the proof page.

Research commands
Shell transcript

Count repo searches, Git inspection, and local file discovery before the first plan.

Files opened
Tool trace

Count unique files read before the plan names the implementation surface.

Impacted surfaces found
Plan plus final diff

Compare named routes, services, packages, docs, and tests against the answer key.

Tests proposed
Agent plan

Compare proposed checks with the final verification list and missing gates.

Time to reliable plan
Trace timestamps

Measure time until the plan includes impact, risk, verification, and next action.

Pre-commit errors caught
Commands and corrections

Count issues found before commit, including wrong assumptions, failed checks, and stale context.

Answer key

The final workflow defines what the agent needed to find.

We score both agents against the surfaces that actually mattered in the completed Snipara work. That keeps the comparison grounded in repository evidence.

expected surfaces

Derived from final workflow evidence

Web API routes for sessions, leases, guard checks, and GitHub checks
Data queries and collaboration safety services
Dashboard live collaboration UX
Companion CLI commands, hooks, and local-stack behavior
Hosted MCP tool contracts and Python mirrors
Safe Parallel Coding public docs and AI-readable discovery surfaces
Package, CI, release, deploy, and config surfaces
Artifacts

The proof is the replay package

The final page should let a technical buyer inspect the method, the context that Snipara supplied, and the raw evidence behind the comparison.

Start Work Brief

The exact continuity context supplied before the assisted agent opens files.

Impact Chain

The routes, services, packages, docs, tests, and workflow surfaces Snipara expects to matter.

Verification Plan

The checks and missing gates the agent should use to prove the change is safe.

PR Answer Pack

The review-facing evidence bundle produced after repository movement is visible.

Raw Agent Trace

Searches, file reads, commands, plan revisions, failed checks, and corrections.

Limits

What the replay proves, what it does not prove, and which claims remain unmeasured.

Interpretation

This does not prove that Snipara writes better code.

It tests a narrower and more defensible claim: Snipara gives an AI coding agent a better project starting point before it writes code. The model still reasons, edits, runs tests, and owns the final implementation quality.

This is not a synthetic benchmark and not a broad scientific claim.
It measures discovery and planning evidence, not final code quality by itself.
No time-to-plan number is claimed because comparable raw model timestamps were not preserved.
The assisted run still has to verify Snipara context against Git, local files, and tests.
Measure rediscovery cost
Compare time to reliable plan
Publish raw trace timestamps
Keep unsupported claims out
Start a Snipara project