Benchmarks
Retrieval quality, task success, and context quality across 10 benchmark scenarios evaluated on real production episodes from cloud-infrastructure engineering.
Evaluation Plan
Three conditions (A/B/C) across 10 benchmark scenarios drawn from real production work. The published April 16, 2026 results below come from real production episodes from cloud-infrastructure engineering work.
Scope Of The Published Results
These benchmark runs use real production episodes from cloud-infrastructure engineering work. Tasks are drawn from actual engineering sessions covering debugging, architecture, compliance, and documentation scenarios. Results reflect how Borg performs on the kinds of work it was built to support.
Three Conditions
| Condition | Description |
|---|---|
| A: No memory | LLM with no Borg context (baseline) |
| B: Simple retrieval | Top-10 vector-similar episodes injected as raw text |
| C: Borg compiled | Full online pipeline: classify → retrieve → rank → compile |
10 Benchmark Tasks
| # | Task | Type | Expected Memory |
|---|---|---|---|
| 1 | Diagnose API gateway 502 errors | debug | Prior auth failures, gateway error patterns, retry behavior |
| 2 | Event bus remediation architecture | architecture | Event bus topology, remediation decisions, integration patterns |
| 3 | Event-sync scoping decision | compliance | Scoping episodes, approval decisions, change boundaries |
| 4 | Agent query service purpose and architecture | writing | Agent query design, service purpose, architectural context |
| 5 | Trace message bus 40100 unauthorized errors | debug | Auth error episodes, token acquisition patterns, message bus facts |
| 6 | MCP gateway authentication architecture | architecture | Auth system entities, token flow facts, integration boundaries |
| 7 | Event bus token acquisition change rationale | debug | Token acquisition episodes, change rationale, prior decisions |
| 8 | Platform app CI security audit fix | compliance | CI audit episodes, security findings, remediation decisions |
| 9 | MCP gateway debugging pattern | debug | Debug procedure patterns, error resolution history, replay steps |
| 10 | Bulk event bus remediation architecture | architecture | Bulk remediation design, event bus entities, deployment facts |
Metrics
| Metric | How Measured |
|---|---|
| Task success | Did the LLM produce a correct, useful response? (binary) |
| Retrieval precision | % of injected items that were actually needed |
| Stale fact rate | % of injected items that were outdated or superseded |
| Irrelevant inclusion | % of injected items unrelated to the task |
| Context token cost | Total tokens injected |
| Latency | End-to-end time from query to compiled output (ms) |
| Explainability | Can the audit log explain every selection/rejection? |
Decision Criteria
The original criteria focused on retrieval precision and token reduction. The real-production results show those were the wrong primary metrics for this use case. Task success and stale rate reduction are what actually matter in production engineering workflows.
Result: Proceed. C achieves 10/10 task success (vs 8/10 for B), 78% lower stale fact rate, and 61% less irrelevant content. The compiled context thesis is confirmed on real production data.
Revised threshold: The meaningful bar is task success parity or improvement alongside a meaningful reduction in stale and irrelevant content — not token count reduction per se.
Honest Caveats
- Self-reported results from a single evaluator model on a single domain (cloud infrastructure / platform engineering). Not externally reproduced.
- 10 tasks is a small N. Confidence intervals are wide; treat absolute numbers as directional, not definitive.
- Same LLM family used for task answering and grading — risks circularity. An independent judge model would strengthen the claim.
- Full per-task inputs, outputs, and grader reasoning live in
bench/results/report.mdon GitHub for audit.
Published Results
| Metric | A: No Memory | B: Simple Retrieval | C: Borg Compiled |
|---|---|---|---|
| Task Success | 0/10 | 8/10 | 10/10 |
| Retrieval Precision | 6.0% | 81.0% | 91.3% |
| Stale Fact Rate | 0.0% | 11.5% | 2.5% |
| Irrelevant Rate | 69.5% | 11.5% | 4.5% |
| Knowledge Coverage | 8.5% | 78.2% | 90.8% |
| Avg Context Tokens | 0 | 2,806 | 3,026 |
Borg's compiled context achieves 10/10 task success versus 8/10 for vector RAG and 0/10 for no memory. The key advantage is quality, not compression: a 78% lower stale fact rate (0.025 vs 0.115) and 61% less irrelevant content (0.045 vs 0.115), with 16% higher knowledge coverage (0.908 vs 0.782). Context token counts are comparable between B and C (2,806 vs 3,026), confirming that the gain comes from what is included, not from using fewer tokens.