Benchmarks

Retrieval quality, task success, and context quality across 10 benchmark scenarios evaluated on real production episodes from cloud-infrastructure engineering.

Published results — April 16, 2026

Evaluation Plan

Three conditions (A/B/C) across 10 benchmark scenarios drawn from real production work. The published April 16, 2026 results below come from real production episodes from cloud-infrastructure engineering work.

Scope Of The Published Results

These benchmark runs use real production episodes from cloud-infrastructure engineering work. Tasks are drawn from actual engineering sessions covering debugging, architecture, compliance, and documentation scenarios. Results reflect how Borg performs on the kinds of work it was built to support.

Three Conditions

ConditionDescription
A: No memoryLLM with no Borg context (baseline)
B: Simple retrievalTop-10 vector-similar episodes injected as raw text
C: Borg compiledFull online pipeline: classify → retrieve → rank → compile

10 Benchmark Tasks

#TaskTypeExpected Memory
1Diagnose API gateway 502 errorsdebugPrior auth failures, gateway error patterns, retry behavior
2Event bus remediation architecturearchitectureEvent bus topology, remediation decisions, integration patterns
3Event-sync scoping decisioncomplianceScoping episodes, approval decisions, change boundaries
4Agent query service purpose and architecturewritingAgent query design, service purpose, architectural context
5Trace message bus 40100 unauthorized errorsdebugAuth error episodes, token acquisition patterns, message bus facts
6MCP gateway authentication architecturearchitectureAuth system entities, token flow facts, integration boundaries
7Event bus token acquisition change rationaledebugToken acquisition episodes, change rationale, prior decisions
8Platform app CI security audit fixcomplianceCI audit episodes, security findings, remediation decisions
9MCP gateway debugging patterndebugDebug procedure patterns, error resolution history, replay steps
10Bulk event bus remediation architecturearchitectureBulk remediation design, event bus entities, deployment facts

Metrics

MetricHow Measured
Task successDid the LLM produce a correct, useful response? (binary)
Retrieval precision% of injected items that were actually needed
Stale fact rate% of injected items that were outdated or superseded
Irrelevant inclusion% of injected items unrelated to the task
Context token costTotal tokens injected
LatencyEnd-to-end time from query to compiled output (ms)
ExplainabilityCan the audit log explain every selection/rejection?

Decision Criteria

The original criteria focused on retrieval precision and token reduction. The real-production results show those were the wrong primary metrics for this use case. Task success and stale rate reduction are what actually matter in production engineering workflows.

Result: Proceed. C achieves 10/10 task success (vs 8/10 for B), 78% lower stale fact rate, and 61% less irrelevant content. The compiled context thesis is confirmed on real production data.

Revised threshold: The meaningful bar is task success parity or improvement alongside a meaningful reduction in stale and irrelevant content — not token count reduction per se.

Honest Caveats

  • Self-reported results from a single evaluator model on a single domain (cloud infrastructure / platform engineering). Not externally reproduced.
  • 10 tasks is a small N. Confidence intervals are wide; treat absolute numbers as directional, not definitive.
  • Same LLM family used for task answering and grading — risks circularity. An independent judge model would strengthen the claim.
  • Full per-task inputs, outputs, and grader reasoning live in bench/results/report.md on GitHub for audit.

Published Results

MetricA: No MemoryB: Simple RetrievalC: Borg Compiled
Task Success0/108/1010/10
Retrieval Precision6.0%81.0%91.3%
Stale Fact Rate0.0%11.5%2.5%
Irrelevant Rate69.5%11.5%4.5%
Knowledge Coverage8.5%78.2%90.8%
Avg Context Tokens02,8063,026

Borg's compiled context achieves 10/10 task success versus 8/10 for vector RAG and 0/10 for no memory. The key advantage is quality, not compression: a 78% lower stale fact rate (0.025 vs 0.115) and 61% less irrelevant content (0.045 vs 0.115), with 16% higher knowledge coverage (0.908 vs 0.782). Context token counts are comparable between B and C (2,806 vs 3,026), confirming that the gain comes from what is included, not from using fewer tokens.