From Firefighting to Resilience: Building Reliability into the Platform Itself

As organizations scale, platform engineering becomes essential to creating a more efficient, consistent, and scalable process for software development and deployment.

Gartner has predicted that 80% of large software engineering organizations will establish platform engineering teams by 2026 – up from 45% in 2022.

“Platform teams must rethink how infrastructure is designed, deployed and delivered, treating it as a governed, reusable product rather than a collection of ad-hoc systems,” says Kevin Cochrane, CMO of Vultr.

By codifying infrastructure into composable, policy-driven artifacts, platform engineering transforms it from a bottleneck into a force multiplier, enabling AI development to scale safely, consistently and with minimal friction.

Jorge Martins, senior director of engineering, Lokalise, says platform engineering isn’t just about giving developers tools — it’s about ensuring those tools are always available, observable, and continuously improving.

“One fundamental shift is from viewing the platform as a project to treating it as a product, with internal developers viewed as the organization’s customers, and their experience is the primary measure of success,” he explains.

This “Platform as a Product” mindset changes everything. Instead of being ticket-takers, the team becomes product managers who are obsessed with their customers’ workflow.

Reliability and observability aren’t features that are bolted on but are core to the user experience.

“You define explicit SLOs for your platform services, you gather feedback through developer satisfaction surveys, and you build ‘paved paths’ that make the secure, reliable choice the easiest choice for every engineer,” Martins says.

Practices such as chaos testing and incident retrospectives help teams identify weaknesses and build resilience into their platforms.

“They are the two sides of the learning coin: One proactive, one reactive,” Martins explains. They are non-negotiable practices for any serious engineering organization.”

He says chaos engineering isn’t about breaking things randomly; it’s the disciplined, scientific practice of injecting precise failures to verify your assumptions about how a system behaves under stress.

“The goal is not to find who to blame, but to uncover the systemic weaknesses—in tooling, process, or communication—that allowed the incident to happen,” Martins says.

Together, they create a powerful, continuous loop of building, breaking and learning that forges true resilience.

AIOps-Driven Monitoring

Cochrane says there are many ways AIOps-driven monitoring can be used to predict and prevent issues.

For example, capabilities such as anomaly detection and predictive alerts establish baselines for functions such as memory, latency, error rates, and traffic, and then detect deviations early before thresholds are breached.

Meanwhile, root cause inference can correlate events (e.g., logs, traces) and correlate them to probable causes.

“The platform engineering model, which is characterized by modular, standardized, observable stacks, proves to be a strong enabler for AIOps,” Cochrane says.

He adds that when built with good telemetry, well-defined artifacts and predictable behavior, it’s far easier to ingest signals and make forecasts.

Culture of Reliability

Derek Ashmore, AI enablement principal at Asperitas, says platform teams stop being “reactive support” when leadership treats the platform as a product with users, not a set of internal tools.

That means giving platform engineers ownership of reliability outcomes, clear SLAs, and the space to prioritize automation and design work over constant ticket firefighting.

“Culturally, teams must embrace blameless retrospectives, continuous learning, and shared accountability between developers and platform engineers,” he says. “Reliability isn’t owned by one group; it’s a cross-functional goal.”

When developers trust the platform, and the platform team sees themselves as enablers of business velocity, reliability naturally becomes strategic.

Resilience and Failing Gracefully

Martins points out a resilient platform isn’t one that never fails; it’s one that “fails gracefully” and is designed to absorb and recover from failures fast and with minimal or zero user impact.

“Architecturally, this means it’s built on principles like fault isolation, graceful degradation, and loose coupling,” he says.

Measurement moves from reactive metrics like Mean Time To Resolution (MTTR) to proactive ones like Service Level Objectives (SLOs) and error budgets.

“These tell you the actual reliability experienced by your users and give you a data-driven way to manage risk,” he says.

Martins adds that the ultimate, qualitative measure is your team’s deployment confidence: Do your engineers deploy code on a Friday afternoon without fear? Is failure treated as a learning opportunity, not a catastrophe?

“When your team stops holding its breath on every release, you know you’ve moved beyond firefighting and are building a truly resilient culture,” he says.

From Firefighting to Resilience: Building Reliability into the Platform Itself

AIOps-Driven Monitoring

Culture of Reliability

Resilience and Failing Gracefully

SHARE THIS STORY

FOLLOW US

From Firefighting to Resilience: Building Reliability into the Platform Itself

AIOps-Driven Monitoring

Culture of Reliability

Resilience and Failing Gracefully

Tech Field Day Events

SHARE THIS STORY

RELATED STORIES:

FOLLOW US

NEWSLETTER SIGN UP