Ten tasks.
Three conditions.
Real production episodes from cloud-infrastructure engineering — incident response, schema migration, architecture review. Same evaluator model, same ground-truth labels across all three conditions.
Three conditions,
head to head.
Bars fill as you scroll into view. C (Borg-compiled) achieves 10/10 task success versus 8/10 for top-10 vector RAG (B) and 0/10 for no memory (A).
No memory
Baseline — every session starts from zero.
Top-10 vector RAG
Standard similarity search over conversation chunks.
Borg · compiled
Full pipeline: classify, retrieve, rank on 4 dims, compile.
How each condition
actually runs.
All three conditions answer the same 10 tasks with the same model, scored by the same grader, against the same ground-truth labels.
No memory
The model answers each task with only the task prompt. No prior-session context. Baseline.
Top-10 vector RAG
Conversation logs are chunked, embedded with the same model as C, and the top-10 similar chunks are prepended verbatim to every task prompt.
Borg · compiled
Full pipeline: classify intent, retrieve across facts / episodes / graph, rank on 4 dimensions, compile a token-budgeted package. Output is injected, not matches.
Metrics we measure
| Metric | What it means |
|---|---|
| Task success | Binary — does the final answer cover every element the grader labeled as required? |
| Retrieval precision | Of the items injected into context, what share were actually used in the final answer? |
| Stale-fact rate | Fraction of facts in compiled context that have been superseded or deprecated at the time of the query. |
| Irrelevant rate | Share of injected context that the grader marks as off-topic for the task. |
| Knowledge coverage | Fraction of ground-truth facts that appeared in the compiled context package. |
The ten prompts,
and what memory they exercise.
Drawn from real engineering workloads, generalized for public benchmarks. Full seeds and prompts live in bench/tasks.json in the repo.
| # | Task | Type | Expected memory |
|---|---|---|---|
| 1 | Diagnose API gateway 502 errors | debug | Prior auth failures, gateway error patterns, retry behavior |
| 2 | Event bus remediation architecture | architecture | Event bus topology, remediation decisions, integration patterns |
| 3 | Event-sync scoping decision | compliance | Scoping episodes, approval decisions, change boundaries |
| 4 | Agent query service purpose and architecture | writing | Agent query design, service purpose, architectural context |
| 5 | Trace message bus unauthorized errors | debug | Auth error episodes, token acquisition patterns, message bus facts |
| 6 | MCP gateway authentication architecture | architecture | Auth system entities, token flow facts, integration boundaries |
| 7 | Event bus token acquisition change rationale | debug | Token acquisition episodes, change rationale, prior decisions |
| 8 | Platform app CI security audit fix | compliance | CI audit episodes, security findings, remediation decisions |
| 9 | MCP gateway debugging pattern | debug | Debug procedure patterns, error resolution history, replay steps |
| 10 | Bulk event bus remediation architecture | architecture | Bulk remediation design, event bus entities, deployment facts |
What this bench
does not prove.
Every benchmark has edges. Here are ours, stated plainly so you can weigh them.
- Self-reported results from a single evaluator model on a single domain (cloud-infrastructure / platform engineering). Not externally reproduced.
- 10 tasks is a small N. Confidence intervals are wide; treat absolute numbers as directional, not definitive.
- Same LLM family used for task answering and grading — risks circularity. An independent judge model would strengthen the claim.
- Full per-task inputs, outputs, and grader reasoning live in
bench/results/report.mdon GitHub for audit.