Engineering teams running agentic systems in production tend to discover the same gaps in the same order. First, noticing that the dashboards look healthy, then realizing that a healthy dashboard and a correctly behaving agent are not the same thing — and it is somewhere in that apparently healthy system that an agent had been quietly producing outputs that fell outside any acceptable range of correctness for hours before anyone noticed. Not because the monitoring failed to fire, but because it was watching for the wrong thing.

Service level objectives (SLOs) were built on a contract that most engineering teams have never had a reason to question. Typically, a request would come in, the service would process it and reliability was measured against whether the outcome matched the expectation within the defined thresholds. That contract worked because the systems it governed were deterministic. The same input produced the same output, and deviation from that expectation was unambiguously a failure.

“Agentic systems have broken that contract without replacing it with anything most platform teams are ready to operate against,” says Shahid Ali Khan of TestMu AI (formerly LambdaTest). Identical inputs can produce different, plausibly correct outputs. Quiet failures complete successfully while doing something entirely unintended — and the metrics most teams are watching, i.e., latency p99, error rate and availability, tell you whether something broke but not whether the agent deviated from expected behavior in ways that matter to the business.

The frameworks, Khan points out, have not caught up and the teams discovering that gap are doing so predominantly through incidents rather than through deliberate architectural decisions made before those incidents occurred.

Why Traditional SLOs Fail Agentic Systems

Ronak Desai, CEO and founder of Ciroos and formerly SVP and GM at Cisco, identifies the structural flaw in existing SLO frameworks with a precision that practitioners who have not yet run agentic systems in production may not have encountered directly. “Traditional SLOs are dependent on deterministic contracts,” Desai argues. “A request either meets a latency threshold, or it doesn’t. Agentic systems violate that assumption because identical inputs can produce different, plausibly correct outputs or quiet failures that never appear as errors.” The implication is not that reliability is unmeasurable in agentic systems. It is that the measurement model has to change fundamentally rather than incrementally.

The first practical consequence of that reframe is accepting that existing metrics need to be supplemented rather than replaced. Infrastructure metrics still matter and should not be abandoned. What they cannot do is tell the whole story when part of the system is probabilistic by design, and that is where the observational gap opens up and where most teams are currently flying blind without realizing it.

Khan, principal engineer – DevOps at TestMu AI, an AI-native software testing platform, arrives at the same diagnosis from the operational side. When his team started integrating agentic components into their test orchestration platform, the existing SLO metrics proved necessary but insufficient. “These metrics tell you if something broke,” Khan explains, “not why or how badly the agent deviated from expected behavior.” A service can be technically up, passing every infrastructure check the team has instrumented for, while simultaneously producing outputs that fall outside any acceptable range of correctness. The gap between those two states is precisely what traditional SLO frameworks have no mechanism to surface.

Closing that gap requires building a second layer of measurement on top of the first rather than trying to stretch existing metrics to cover behavior they were never designed to capture. That architectural insight, that behavioral correctness requires its own measurement layer rather than a refinement of infrastructure metrics, is the principle that connects every practitioner who has worked through this problem in production.

Arun Anbumani, principal cloud infrastructure engineer at Oracle, adds a dimension to the non-determinism problem that the agent-focused framing alone cannot capture. Hardware and firmware variability at the infrastructure layer, queue pressure on accelerators, firmware resets and resource contention on GPUs and SmartNICs introduce behavioral variance before any AI component is even involved. “Reliability is no longer just about whether an API call succeeds,” Anbumani observes. “It is about whether the entire platform stack is behaving normally.”

In environments running heterogeneous infrastructure alongside agentic workloads, the non-determinism problem compounds across layers simultaneously and platform teams that treat it as purely a model problem will keep finding infrastructure-layer failures that their agent-focused monitoring was never positioned to catch.

The actionable consequence here is that SLO redesign for agentic systems has to start at the infrastructure layer rather than at the model layer. Teams that jump straight to behavioral SLOs without instrumenting the hardware and firmware signals feeding the models will build measurement systems with a blind spot in their foundation. Getting the platform-level indicators right is the prerequisite for everything that comes above it.

The Emerging Layered SLO Model

The response that has emerged from practitioners who have worked through this problem is a layered SLO model that retains classical reliability metrics at the infrastructure layer while adding new measurement primitives for inference reliability and behavioral correctness above it. Khan and Ihor Zakutynskyi, chief technology officer at FORMA by Universe Group, arrived at this architecture independently, which is a strong indication reflecting emerging practitioner consensus rather than individual preference.

Khan’s implementation adds a behavioral correctness layer above the classical reliability metrics his team retained. “We track semantic similarity scores between generated tests and a baseline corpus, confidence thresholds from underlying models and outcome variance over rolling windows,” he explains. “If variance exceeds a threshold even when the service is technically up, that is an SLO breach.”

The behavioral layer is not a replacement for infrastructure monitoring. It is the layer that makes infrastructure monitoring meaningful for services where the output is probabilistic rather than deterministic, and it gives teams a mechanism for detecting degradation before it surfaces as a customer-facing incident, because behavioral drift tends to precede visible failure rather than coincide with it.

The practical recommendation that follows is to define behavioral thresholds empirically from production data before setting them as SLO targets. Starting with rolling window variance monitoring and adjusting the threshold based on what the team observes over several weeks gives a more defensible baseline than picking a number from theory and hoping it holds in production.

Zakutynskyi’s three-layer model maps closely to Khan’s approach and adds a precision that practitioners building these systems for the first time will find useful. The infrastructure reliability layer retains familiar metrics governed by conventional error budgets. The inference reliability layer tracks model-specific signals, including token-level latency, response time variance and fallback frequency. Additionally, the behavioral SLO layer enforces bounded consistency, validating that outputs meet structured contracts and measuring semantic drift against embedding-similarity baselines.

“A service might be 99.9% available at layer one yet still violate its behavioral consistency SLO if outputs wander beyond the defined variance envelope,” Zakutynskyi notes. The reframe from uptime to what he calls trust over time, covering correctness, safety and consistency, is the architectural shift that makes SLOs meaningful for agentic systems rather than simply applicable to them.

Keeping each layer tracked independently rather than aggregated is what makes incident routing possible, and collapsing all three into a single composite score obscures the causal structure of failures and makes it impossible to route incidents to the right team with the right context. Infrastructure failures route to platform engineers, while inference failures route to the ML team and behavioral correctness failures route to whoever owns the agent’s goal specification and output contracts. The separation is not administrative tidiness but the operational prerequisite for a functioning response process.

The Four-Dimensional Framework for Agentic Reliability

Desai offers the most actionable framework for practitioners trying to operationalize agentic SLOs and it organizes around four dimensions that together form a reliability model built for non-deterministic systems rather than adapted from deterministic ones.

The first dimension is latency measured by task class, specifically time per logical conclusion rather than time to first byte, a metric that becomes meaningless when a task spans dozens of model calls.

The second dimension is step count as a reliability signal, where excessive steps indicate that an agent is confused or operating inefficiently. “It should not take an agent 20 steps to solve a two-step problem,” Desai notes, and that observation is more immediately relatable to engineering leaders than any abstract discussion of non-determinism because it translates directly into a metric that any team can instrument without building new infrastructure.

The third dimension is probabilistic correctness, replacing binary correctness tests with pass@k-style distributional accuracy targets so that SLOs capture correctness across repeated runs rather than a single snapshot.

The fourth is confidence and intent alignment, treating low-confidence actions in high-stakes contexts as first-class reliability events that automatically trigger human review or rollback rather than silently proceeding.

The implementation sequence that this framework implies is worth making explicit. Start with step count monitoring because it requires the least new infrastructure and surfaces the most obvious failure mode. Add task-class latency next because it gives the time-per-logical-conclusion signal without requiring behavioral analysis.

Layer in pass@k correctness evaluation once there is a corpus of validated outputs to measure against, and build confidence and intent alignment alerting last because it requires the most domain-specific calibration to avoid generating false positives that erode trust in the alerting system itself.

The Observability Primitives You Have to Build

The measurement model is only as good as the observability infrastructure feeding it, and the observability infrastructure most platform teams have is not built to surface the signals that agentic reliability requires.

Khan’s team has moved to probabilistic tracing as the foundation. Standard distributed tracing assumes that a request follows a predictable path. With agents, the path itself is a decision. “We emit trace annotations for decision points,” Khan explains, “which tool was invoked, what reasoning chain was followed, what confidence score triggered the choice.” This gives his team the ability to correlate reliability failures with specific behavioral patterns rather than just infrastructure faults.

Behavioral drift dashboards that compare current agent output distributions against baselines from previous weeks have proven more predictive of incidents than traditional alerting, triggering human review before customers notice degradation.

The decision point is the unit of observation in an agentic system, not the request. Tracing infrastructure that cannot record what the agent decided and what that decision was based upon cannot support the kind of incident attribution that agentic reliability requires. Building that tracing capability before the system goes to production rather than retrofitting it after the first incident is the single most high-leverage investment a platform team can make when onboarding agentic workloads.

Zakutynskyi’s observability approach logs every request with its exact prompt, model version and policy tags alongside intermediate reasoning traces and tool calls. This separation of transport-level failures from AI decision errors is what makes schema-validation and trace-based replays possible, allowing his team to pinpoint whether a bad result came from network noise or from the agent’s logic.

Anbumani adds platform-level indicators to the observability picture, covering accelerator health, reset frequency, retry amplification and resource pressure alongside standard request metrics, making the full platform stack visible rather than just the application layer.

Observability for agentic systems has to be designed into the data model from the start. Prompt logging, model version tagging and policy annotations are not add-ons that can be layered onto an existing observability stack. They require the system to emit structured data at the point of decision, which means that the observability requirements have to inform the agent architecture rather than follow from it.

Error Budgets in an Agentic World

The error budget question is where existing frameworks show their deepest limitations and where most platform teams are still operating without a clear answer. “The error budget model we inherited assumes failures are attributable and bounded,” Khan observes. “When an agentic component is in the causal chain of an incident alongside human-authored services, attribution becomes genuinely difficult. Did the agent fail because the model drifted, because the input it received was outside its training distribution or because a human-authored service upstream gave it bad context to reason from? Each of those failure modes has a different owner and a different fix, but the incident looks identical from the outside.”

At TestMu AI, the team has moved toward separate budget categories for infrastructure failures, inference failures and behavioral correctness failures, tracking each independently rather than aggregating them into a single error budget that obscures the causal structure of what is actually going wrong.

Error budget reviews have to change in format as well as content. A review that aggregates all failure types into a single burn rate number cannot support the conversation that agentic reliability requires, which is a conversation about which category of failure is consuming the budget, why and which team owns the response. Separating the budgets is what makes that conversation possible, and building the attribution tooling that populates each budget accurately is the prerequisite for the separation to be meaningful rather than cosmetic.

The behavioral correctness budget in particular has no industry standard to calibrate against, and defining acceptable thresholds empirically from production data is currently the only option available to most teams. That is an uncomfortable position for organizations used to having external benchmarks to validate their reliability targets against, and it is the clearest signal of how nascent the field still is in developing the shared infrastructure that mature reliability engineering requires.

What the Field Still Needs

The gap that Khan identifies from his team’s experience of building these systems is the same gap that every contributor to this piece identifies from their respective vantage points. “We still lack good industry standards for what acceptable non-determinism looks like,” Khan observes. “We have defined these thresholds empirically, but there is no equivalent of 99.9% availability for agent correctness. That is an open problem the industry needs to solve.”

The layered SLO model, the four-dimensional framework, the probabilistic tracing and the behavioral drift dashboards are all real and deployable responses to a real problem. But they are responses built in the absence of the shared standards that would make them interoperable, auditable and comparable across organizations.

The convergence that Khan and Zakutynskyi have reached independently on the layered SLO model is the kind of empirical evidence that standards bodies and platform engineering communities should be building from. Translating that convergence into shared specifications is unglamorous work that does not generate benchmark numbers that make AI adoption announcements compelling — but it is the work that determines whether the next generation of agentic infrastructure is built on a foundation that holds, and that work is available to any team willing to contribute what they have learned to the conversation rather than keeping it as internal knowledge.

SHARE THIS STORY