§ /benchmarks — 10 tasks · 3 conditions · apr 2026

Ten tasks.
Three conditions.

Real production episodes from cloud-infrastructure engineering — incident response, schema migration, architecture review. Same evaluator model, same ground-truth labels across all three conditions.

tasks · 10conditions · A · B · Cseeds · in /bench
§ 01 — RESULTS

Three conditions,
head to head.

Bars fill as you scroll into view. C (Borg-compiled) achieves 10/10 task success versus 8/10 for top-10 vector RAG (B) and 0/10 for no memory (A).

Condition A

No memory

Baseline — every session starts from zero.

Task success0/ 10
Retrieval precision6.0%
Stale-fact rate0.0%
Condition B

Top-10 vector RAG

Standard similarity search over conversation chunks.

Task success8/ 10
Retrieval precision81.0%
Stale-fact rate11.5%
Condition C

Borg · compiled

Full pipeline: classify, retrieve, rank on 4 dims, compile.

Task success10/ 10
Retrieval precision91.3%
Stale-fact rate2.5%
§ 02 — METHODOLOGY

How each condition
actually runs.

All three conditions answer the same 10 tasks with the same model, scored by the same grader, against the same ground-truth labels.

Condition A

No memory

The model answers each task with only the task prompt. No prior-session context. Baseline.

Condition B

Top-10 vector RAG

Conversation logs are chunked, embedded with the same model as C, and the top-10 similar chunks are prepended verbatim to every task prompt.

Condition C

Borg · compiled

Full pipeline: classify intent, retrieve across facts / episodes / graph, rank on 4 dimensions, compile a token-budgeted package. Output is injected, not matches.

Metrics we measure

MetricWhat it means
Task successBinary — does the final answer cover every element the grader labeled as required?
Retrieval precisionOf the items injected into context, what share were actually used in the final answer?
Stale-fact rateFraction of facts in compiled context that have been superseded or deprecated at the time of the query.
Irrelevant rateShare of injected context that the grader marks as off-topic for the task.
Knowledge coverageFraction of ground-truth facts that appeared in the compiled context package.
§ 03 — TASKS

The ten prompts,
and what memory they exercise.

Drawn from real engineering workloads, generalized for public benchmarks. Full seeds and prompts live in bench/tasks.json in the repo.

#TaskTypeExpected memory
1Diagnose API gateway 502 errorsdebugPrior auth failures, gateway error patterns, retry behavior
2Event bus remediation architecturearchitectureEvent bus topology, remediation decisions, integration patterns
3Event-sync scoping decisioncomplianceScoping episodes, approval decisions, change boundaries
4Agent query service purpose and architecturewritingAgent query design, service purpose, architectural context
5Trace message bus unauthorized errorsdebugAuth error episodes, token acquisition patterns, message bus facts
6MCP gateway authentication architecturearchitectureAuth system entities, token flow facts, integration boundaries
7Event bus token acquisition change rationaledebugToken acquisition episodes, change rationale, prior decisions
8Platform app CI security audit fixcomplianceCI audit episodes, security findings, remediation decisions
9MCP gateway debugging patterndebugDebug procedure patterns, error resolution history, replay steps
10Bulk event bus remediation architecturearchitectureBulk remediation design, event bus entities, deployment facts
§ 04 — HONEST CAVEATS

What this bench
does not prove.

Every benchmark has edges. Here are ours, stated plainly so you can weigh them.

  • Self-reported results from a single evaluator model on a single domain (cloud-infrastructure / platform engineering). Not externally reproduced.
  • 10 tasks is a small N. Confidence intervals are wide; treat absolute numbers as directional, not definitive.
  • Same LLM family used for task answering and grading — risks circularity. An independent judge model would strengthen the claim.
  • Full per-task inputs, outputs, and grader reasoning live in bench/results/report.md on GitHub for audit.