Skip to content
DEV BY SALDEV BY SAL
Case studiesProvision
provisionAI Infrastructure / Agent Platform2025

Real-time agent observability for Provision's runtime

Provision is a managed cloud runtime for AI agents — sandboxed browsers, agent inboxes, channel integrations, all behind a single API. We built the observability layer that turns "the agent failed somewhere" into a replayable trace with cost, tool calls and browser actions, cutting mean-time-to-debug by 45%.

MTTD reduction
~45%
Spans/sec ingested
~120K
Replay fidelity
100%
Shipped in
7 weeks
provision/ agent · planner-v8 · run abe1f3
● Running14 / 22 steps · 8.2s

Tokens

12,481

input + output

Cost

$0.084

planning $0.05 · tools $0.03

Tool calls

9

7 browser · 2 inbox

p95 step

420ms

rolling 100 steps

Replans

1

step 11

Span timeline

filter:allerrorstools
#CallDurCost
#14
browser.fill(form, "jrivera@...")
180ms$0.00
#13
browser.click("Submit")
220ms$0.00
#12
llm.respond()
1.8s$0.022
#11
llm.plan(retry, attempt=2)
1.2s$0.012
#10
tool.search(docs, q=...)
320ms$0.003
#09
inbox.read(latest)
480ms$0.00
#08
llm.observe()
720ms$0.008
#07
browser.click("New lead")
190ms$0.00
#06
browser.navigate('/leads')
640ms$0.00
#05
llm.plan(initial)
1.4s$0.018
#04
inbox.search("new RFP")
280ms$0.00

Live viewport

step 14 · sandboxed Chrome
crm.acme.com/leads/new
Company
Acme Holdings
Contact
Janelle Rivera
Email
jrivera@...
CancelSubmit
Replay00:08.2 / 00:12.4
llmtoolbrowserinbox

Cost ledger

↓ 18% vs avg
gpt-4o · planner$0.052
62%
haiku · observer$0.022
26%
tools · embeddings$0.008
9%
tools · search$0.002
3%

The brief

Provision's product is the place agents run — managed compute, a real Chrome, an actual inbox, channel integrations. The product their customers actually buy is the ability to ship agents to production without a six-month detour through infrastructure. Which works until something goes wrong. Then customers were debugging blind, with logs that did not include the right context and no way to replay what happened.

The bet: a first-class observability and replay layer would convert Provision's runtime from "agents work, mostly" into "agents work, and when they don't, you know exactly why."

What we built

An OpenTelemetry-shaped tracing layer that captures every step of an agent run as a span — LLM call, tool call, browser action, inbox read, plan revision — with timing, token cost, and tool arguments attached. Spans stream into a purpose-built time-series store sized for agent workloads (steps are bursty; tokens are expensive; cardinality is high).

The replay layer is the most-loved feature. Every browser action and every inbox interaction is captured well enough that a customer can re-play the run frame-by-frame — watch the Chrome viewport, hear-the-llm-think, see exactly which tool argument was wrong. The replay is byte-identical to the original; we built a custom WebSocket protocol to ferry the captured frames into the customer's browser without rehydrating the entire run from cold storage.

Per-step cost surfacing closed a question every Provision customer had been asking: which step was expensive? The answer is now a hover.

Outcome

Mean-time-to-debug on production agent failures dropped ~45%, with the biggest wins on multi-step failures where the customer previously had to reconstruct the sequence by hand. The system ingests ~120,000 spans per second at peak. Replay fidelity is 100% — every captured run can be played back exactly.

Time from design partner pick to GA was seven weeks. The observability layer is now a primary selling point in Provision's enterprise demos.

"We were debugging blind for the first six months. Now we know exactly where the agent decided wrong, what it cost, and what it should have done instead."