Platform engineering standardized delivery, but most platforms still respond after something breaks. The next shift is toward autonomous infrastructure that predicts issues, reallocates resources and resolves incidents before they surface. 

This evolution reframes platform teams from builders of self-service environments to operators of continuously learning systems. 

Yasmin Rajabi, COO at CloudBolt, explains that automated means the system executes a plan in response to a defined trigger. 

For example, an engineer might configure a rule that says, “if CPU exceeds 80%, scale out.”  

“Autonomous means the system watches for triggers, makes a decision based on learned behavior, and acts without a human getting involved,” she says. “This difference can be seen in Kubernetes resource management as well.” 

For example, an automated system applies VPA recommendations on a schedule. An autonomous system, by contrast, learns a workload’s scaling profile as it changes every Monday morning, adjusts requests ahead of the spike, and rolls back if application health degrades, all without anyone pushing a button. 

“We’re still early in that journey,” Rajabi adds. “Most ‘autonomous’ platforms today are really automated platforms with better data.” 

Jim Mercer, IDC program vice president, software development, DevOps and DevSecOps, explains that the transition from automated to autonomous represents a shift from executing instructions to achieving outcomes. 

“Automated pipelines that follow code or scripts remain rigid,” he says. “An autonomous pipeline uses intent-based intelligence to navigate complex environments with no human intervention.” 

He says he agrees that the reality developers are working toward is semi-autonomy, where a human still plays a role in approvals. 

“The appetite for full autonomy isn’t there in most organizations,” Mercer says.  

Implementing Predictive Capabilities  

As organizations start to implement predictive capabilities into infrastructure, certain data signals are of key importance. Jean-Philippe Leblanc, senior vice president of engineering, CircleCI, suggests starting with historical failure patterns. 

“These include pipeline duration drift, flaky test rates, resource saturation curves,” he says. “The signals that matter are the ones that appear minutes before failure, not after. That gap is where most teams are still stuck.” 

Donnie Page, solutions engineer with Itential, notes customers are moving away from static thresholds where 80% typically defines the max, no matter the conditions, to an understanding of the conditions, using machine learning or AI to develop patterns and workloads based on seasonality. 

“A predictive system understands the difference between a morning login storm and a DDoS attack,” he says. 

Changes for Platform Teams 

Mercer says that as systems transition from manual automation to autonomous decision-making, the day-to-day reality for platform engineering teams shifts, much like developers adopting AI. 

“The role shifts from being the first responder or scriptwriter to being more of a policy architect,” he says.  

From Leblanc’s perspective, the job shifts from building automation to setting policy. 

“You stop writing runbooks and start defining boundaries,” he says. “The team becomes a governance layer.” 

Rajabi points out the accountability model changes, too, where explainability and predictability become requirements. 

“When a human makes a resource change and it causes an issue, the solution is clear: roll back the change,” she says. “When the system makes a change that degrades performance, you need to understand why the model made that decision, what data it used, and whether the policy constraints were correct.” 

Human Oversight Critical   

Rajabi says humans need oversight in three areas: Policy definition, exception handling, and blast radius decisions. 

“The system can rightsize a workload, but a human should define what guardrails are appropriate for the environment,” she says. 

While the system can detect drift and reconcile it, a human should decide whether a specific exception is warranted–for example, a workload that is intentionally over-provisioned. 

“Lastly, as autonomous changes start to expand their scope, there should be safety mechanisms for gradual rollout,” she says.  

From her perspective, the pattern that works is progressive autonomy: Start with the system recommending, graduate to the system acting on low-risk workloads, and as you gain trust through demonstrated reliability, start to hand over high-risk workloads. 

Mercer explains that while the system can now handle scaling and remediation, human oversight remains the essential safeguard for areas involving ambiguity, ethics, and high-level strategy. 

“For example, autonomous systems rely on patterns, so when a black swan event occurs, which is an unprecedented scenario with no historical data, an AI autonomous system could fail without human intervention,” he says.  

Cultural Change Challenges  

Leblanc says that in moving from reactive operations to autonomous systems, the challenges are mostly cultural.  

“The real friction is identity,” he says. “People’s sense of value is tied to being the ones who fix things. Autonomous systems threaten that directly.” 

Page explains that culturally, it’s hard to always trust that a bleeding-edge technology won’t delete production systems without being watched. 

“Engineers must see the work and understand why a choice was made before they begin to feel comfortable,” he says.  

Mercer notes the shift toward autonomous systems is less about upgrading tools and more about fundamentally re-engineering the relationship between humans and infrastructure.  

“Culturally, moving to autonomy requires engineers to relinquish the keyboard-level control that has defined their careers and is not always an easy transition,” he says.

SHARE THIS STORY