§ /benchmarks — 10 tasks · 3 conditions · apr 2026

Ten tasks.
Three conditions.

Real production episodes from cloud-infrastructure engineering — incident response, schema migration, architecture review. Same evaluator model, same ground-truth labels across all three conditions.

tasks · 10conditions · A · B · Cseeds · in /bench

§ jump toResults Methodology Tasks Honest caveats

§ 01 — RESULTS

Three conditions,
head to head.

Bars fill as you scroll into view. C (Borg-compiled) achieves 10/10 task success versus 8/10 for top-10 vector RAG (B) and 0/10 for no memory (A).

Condition A

No memory

Baseline — every session starts from zero.

Task success0/ 10

Retrieval precision6.0%

Stale-fact rate0.0%

Condition B

Top-10 vector RAG

Standard similarity search over conversation chunks.

Task success8/ 10

Retrieval precision81.0%

Stale-fact rate11.5%

Condition C

Borg · compiled

Full pipeline: classify, retrieve, rank on 4 dims, compile.

Task success10/ 10

Retrieval precision91.3%

Stale-fact rate2.5%

§ 02 — METHODOLOGY

How each condition
actually runs.

All three conditions answer the same 10 tasks with the same model, scored by the same grader, against the same ground-truth labels.

Condition A

No memory

The model answers each task with only the task prompt. No prior-session context. Baseline.

Condition B

Top-10 vector RAG

Conversation logs are chunked, embedded with the same model as C, and the top-10 similar chunks are prepended verbatim to every task prompt.

Condition C

Borg · compiled

Full pipeline: classify intent, retrieve across facts / episodes / graph, rank on 4 dimensions, compile a token-budgeted package. Output is injected, not matches.

Metrics we measure

Metric	What it means
Task success	Binary — does the final answer cover every element the grader labeled as required?
Retrieval precision	Of the items injected into context, what share were actually used in the final answer?
Stale-fact rate	Fraction of facts in compiled context that have been superseded or deprecated at the time of the query.
Irrelevant rate	Share of injected context that the grader marks as off-topic for the task.
Knowledge coverage	Fraction of ground-truth facts that appeared in the compiled context package.

§ 03 — TASKS

The ten prompts,
and what memory they exercise.

Drawn from real engineering workloads, generalized for public benchmarks. Full seeds and prompts live in bench/tasks.json in the repo.

#	Task	Type	Expected memory
1	Diagnose API gateway 502 errors	debug	Prior auth failures, gateway error patterns, retry behavior
2	Event bus remediation architecture	architecture	Event bus topology, remediation decisions, integration patterns
3	Event-sync scoping decision	compliance	Scoping episodes, approval decisions, change boundaries
4	Agent query service purpose and architecture	writing	Agent query design, service purpose, architectural context
5	Trace message bus unauthorized errors	debug	Auth error episodes, token acquisition patterns, message bus facts
6	MCP gateway authentication architecture	architecture	Auth system entities, token flow facts, integration boundaries
7	Event bus token acquisition change rationale	debug	Token acquisition episodes, change rationale, prior decisions
8	Platform app CI security audit fix	compliance	CI audit episodes, security findings, remediation decisions
9	MCP gateway debugging pattern	debug	Debug procedure patterns, error resolution history, replay steps
10	Bulk event bus remediation architecture	architecture	Bulk remediation design, event bus entities, deployment facts

§ 04 — HONEST CAVEATS

What this bench
does not prove.

Every benchmark has edges. Here are ours, stated plainly so you can weigh them.

Self-reported results from a single evaluator model on a single domain (cloud-infrastructure / platform engineering). Not externally reproduced.
10 tasks is a small N. Confidence intervals are wide; treat absolute numbers as directional, not definitive.
Same LLM family used for task answering and grading — risks circularity. An independent judge model would strengthen the claim.
Full per-task inputs, outputs, and grader reasoning live in bench/results/report.md on GitHub for audit.

/science