BNT saved me ~$80 in 1 Claude Code session at $0.87 cost

TL;DR

1 session. 4 hours. 40 shared experiences surfaced across 5 coding tasks on the same product.

BigNumberTheory (BNT) cost $0.87 in extra LLM tokens for this session. It saved ~60–85 minutes of engineering time — ~$100 at senior engineer rates. Roughly 100× return.

What BNT is. A Claude Code plugin that captures what each agent learns in a session and surfaces relevant past experiences to agents in future sessions — so the same lesson doesn't get paid for twice. Each surfaced experience comes with a short attribution footer in the agent's output pointing back to the original session. This post is one real session, measured — for anyone building with AI coding agents trying to figure out whether a shared-experience layer actually pays off.

At a glance

#	Task	BNT role
1	Diagnostic	Decisive. 3 saves, including one that went into the diagnostic script itself.
2	Dead field cleanup	Zero. Nobody had lived this yet.
3	Observability feature	One high-value save (endpoint reuse).
4	Modal redesign	Three confirmatory patterns, directionally useful.
5	LLM upgrade	Zero. Decision rested on live data, not prior experience.

The session

Last week I opened a coding session on one of our own products with what looked like a simple question: "Why is version 9 of our learning loop stuck since 4/13?" Our landing-page timeline showed our self-improving prompt hadn't advanced in nearly a week. Something was obviously wrong.

What started as a quick diagnostic turned into 5 distinct tasks on the same product: (1) diagnose the stall, (2) clean up a dead field that fell out of looking, (3) ship a new observability view, (4) redesign the detail modal that feeds it, (5) upgrade the evaluation LLM. One PR at the end — 25 files, +1036 / −89, 3 commits. BNT was running alongside Claude Code the whole time, pulling in relevant past-agent experiences as the work progressed.

Task 1 — Diagnostic

Reported bug turned out not to be a bug. BNT made 3 saves, including one that went into the diagnostic script itself.

First move: a diagnostic script against the production datastore. It said the unified counter was at 2, the active prompt was v7, and we had 47 of 50 evaluations toward the next promotion — with zero in the last 24 hours but 163 over the past week.

Two things jumped out.

Counter drift. The unified counter said 2 while the actual highest directive was 7. First instinct: corrupted state, write a migration. BNT surfaced Shared counter initialization with floor check pattern — a previous agent had hit this exact drift, traced it to the PR that introduced the counter without seeding it, and confirmed the floor check Math.max(counter, highestExisting) self-heals on next read. Reread the code. Floor check was right there. No migration needed.

Waiting on traffic. 47/50, zero in 24h, healthy 7-day. BNT surfaced Learning loop evaluation traffic dependency pattern — the exact fingerprint. The loop has no periodic heartbeat; it only advances when new evaluations come in, and it can go dormant right before the finish line. Three more evals and it would promote itself. Nothing to fix.

One more save during the diagnostic: BNT surfaced Compare-and-swap failures can orphan learning loop state, describing a mode where if the evolve-lock succeeds, the evolution fails, and the rollback also fails, the bookkeeping gets pinned forward and the system silently demands 50 brand-new evaluations before retrying. The diagnostic didn't check for that. I added ~20 lines for orphan detection. Ran clean — but if it hadn't, it would have been the first thing to investigate, and without the prior experience I wouldn't have known to look.

Artifact from this phase: diagnose-learning-loop.ts, now available to the next agent who walks into "why is the loop stuck."

Task 2 — Dead field cleanup

Silent degradation nobody had lived yet. 19 files of cleanup. The mode BNT can't help with.

While dumping 100 recent evaluations to sanity-check the evaluator, one line jumped out: every single record had experienceType: missing. 100/100. (Note: experienceType here is a field in our own product's data model — unrelated to BNT's shared experiences.)

The v3 extraction agent had quietly stopped emitting the type field months ago — the schema moved to a freeform body, and the prompt explicitly told the model not to force knowledge into predetermined categories. But the downstream consumers — platform aggregate counters, meta-prompt evidence, UI filters — were all still trying to read it. Silently degrading for weeks.

Not something BNT surfaced, because nothing like it had been written down before. Just a thing that falls out when you actually look at the data. 19 files of cleanup to remove the orphaned type machinery across models, services, controllers, frontend adapters, and tests. That's most of the PR's line count.

Worth saying plainly: this is the mode BNT can't help with. Nobody had lived this yet.

Task 3 — Observability feature

A new-endpoint build collapsed to three lines after BNT flagged an existing API. ~30–45 minutes saved.

Feature request: show recent evaluations on the public landing page so visitors can see what the loop is actually learning from. Instinct: new public API endpoint. New route, new auth, new tests, new cache.

BNT surfaced Existing API data reuse for new observability features. Before adding a new endpoint, check the existing public API. I checked — getPublicSummary() was already returning recent activity with the exact fields needed. The feature collapsed to a three-line frontend change and one backend tweak (limit 15 → 50). Saved roughly 30–45 minutes and a round of review.

Task 4 — Modal redesign

Three confirmatory BNT patterns — directionally right, not decisive. Normal product judgment carried the rest.

Follow-on: when a user clicks one of those evaluations, the detail view should treat the evaluation as the subject and the experience as a reference — the opposite of the existing modal hierarchy. Three small BNT patterns landed in this phase — Metric UI separates count display from distribution details, Modal content updates without breaking existing functionality, Reusing existing modal components for graph node interactions — all directionally right, all confirmatory rather than revelatory. They validated keeping a single shared DetailModal with a branch keyed on the view prop, instead of forking a new component.

The design itself took several iterations — modal → inline list → back to preview-plus-modal, the layout flipped twice. None of that was BNT-assisted; just normal product judgment rounds with the user.

Task 5 — LLM upgrade

Live-data decision, no prior pattern. BNT can't help where no agent has worked the same tradeoff yet.

Separate thread: the evaluation LLM's review field was truncating at 1,500 characters, which was binding on real samples. Bumped MAX_REVIEW_LENGTH 1500 → 2500, max_tokens 4096 → 6144, and swapped the default model from Haiku 4.5 to Sonnet 4.6. No BNT experience applied here — the decision rested on reading the actual eval dump and the meta-prompt consumer's context budget. BNT can't help where no prior agent has worked through the same tradeoff.

The misses

The P1 regression. I reviewed the PR. Said it was good. It wasn't. A separate automated reviewer (Codex, not BNT) caught it: the new evaluation-hero modal was gated on experience.evaluation being truthy — but experience here is an object in our product's schema (not a BNT concept), and a different flow populates that same field for consumed-experience cards in unrelated parts of the app. My change would have silently broken the existing layout in those flows. Fix was a one-liner once seen: pass an explicit view="evaluation" prop, default to "experience", only opt in from the two real entry points. None of the ~40 surfaced experiences flagged it. No agent had lived it yet.

tsc --noEmit vs tsc -b. BNT surfaced TypeScript compilation validation for cleanup PRs and reminded me to run the typechecker. I did. It passed. What I didn't run was tsc -b (project-references build mode) — which is what Vercel's CI actually uses, and which caught a regression my --noEmit missed. The experience was right in direction, wrong in specificity. One wasted CI round.

Both misses point to the same thing: BNT surfaces what other agents have already learned at whatever level of specificity they wrote it down. It does not invent guidance, and it does not sharpen guidance that was vague when first captured. That's the right tradeoff — an experience network that hallucinates specificity would be worse than no network — but it means the precision of the input determines the value of the output.

The hit rate

Of the ~40 experiences surfaced across the five tasks:

6–8 directly changed a decision or prevented a mistake

~10 confirmatory — right path, would have saved time if earlier

~15 off-topic or too generic to act on

~8 re-injections of experiences already acknowledged

~20% actively decision-changing. ~25% confirmatory. ~55% noise.

The 6–8 that landed were specific: floor check pattern, traffic dependency pattern, CAS claim orphan, Firestore evaluation storage locations (shipped with jq recipes), CI formatting failure diagnosis, existing API data reuse. Each one was causal, named the exact symptom, and usually pointed at a specific file or failure mode. That's the shape of a useful experience.

The noise was generic design principles ("separate count from distribution") and migration-flavored experiences loading during read-only investigation. One pattern about running git diff --stat got re-injected ~8 times across the session; I ran it once.

The math

Cost of having BNT in the loop for this session:

~48K input tokens across 40 surfaced experiences (preamble + experience content + previously-loaded summaries added to the agent's context)

~$0.87 in direct LLM spend at current model pricing

~40 attribution footers appended to agent output (short citation lines pointing back to the originating session)

Benefit:

~60–85 min of wall-clock session time saved. Biggest wins: avoiding a useless counter migration (~25–40 min) and reusing an existing API instead of scaffolding a new endpoint (~30–45 min).

The engineer isn't engaged 100% of wall-clock. Modern Claude Code workflow assumes delegation — write the prompt, context-switch to other work while Claude runs, come back to review. Realistic active-engagement ratio during a debugging session is maybe 50–60%.

Engineer attention saved: ~30–50 min at $100/hr → ~$50–$85.

Claude Code tokens saved: ~$10–$20 (would have gone to wrong-path work — the migration, the new endpoint — now never written).

Total avoided: ~$60–$105. Cost of BNT: $0.87. Return: 70–120×.

And a second-order effect worth noting: the delegation model means the engineer's recovered attention goes to parallel work. Whatever else got done during those freed-up minutes is extra upside the dollar math above doesn't count.

Also worth noting but harder to price:

~14 turns avoided out of ~80 total — roughly a 15% reduction in round-trips.

4 definitive answers ("here's the pattern, here's why") that would otherwise have been "I think it's X, let me go check."

Two deliverables that become experiences themselves — diagnose-learning-loop.ts and dump-evaluations.ts, both now available to the next agent who walks into "why is the loop stuck."

Confidence caveats: token counts and PR stats are exact. Time-saved, turn-saved, and counterfactual-Claude-Code-cost numbers are judgment calls — easy to over- or under-credit. The direction is clear; the exact coefficient is not.

The open question isn't whether this is worth it. It's whether trimming the noise rate from ~55% to ~25% via better matching would roughly double the return at the same cost. It probably would. Cost per miss is ~$0.03. Cost per hit is ~$15–$20 of combined engineer + Claude Code savings. Asymmetry favors improving precision, not trimming volume.

The shape of the payoff

The by-task split at the top of this post tells the story: BNT pays off most on work adjacent to something another agent has already been through, and pays off zero on genuinely new terrain. A session that's purely new work still costs the token overhead without buying much. A session with any reused pattern surface — debugging, cleanup, API design — gets a disproportionate return from the hits.

Why this is the bet

Every time an AI agent solves a non-trivial problem today, the experience dies with the session. Close the window, forget. The next agent walks into the same problem and starts from zero. Multiply by the agent-hours the world is about to burn.

Making individual agents smarter is the obvious move. Letting agents share experiences across sessions is the higher-leverage one. One agent figures out why a shared counter drifts and writes down the experience. Every agent after reads it before sending anyone down the wrong path.

Four days into a stuck loop — that turned out not to be stuck — with a floor check nobody remembered and a traffic dependency nobody had written down, I didn't have to rediscover either. Someone else already had. That's the product.

If you're building with AI coding agents and want to stop paying for the same lesson twice, come see what we're building at bignumbertheory.com.