Why Platform Engineering is Really a Toil Problem, Not a Tooling Problem

Platform engineering keeps getting framed as a tooling problem, with most of the conversation centered on which IDP to pick or how to design golden paths. The framing is wrong: platform engineering is a toil problem, and tooling is one of the levers, not the goal.

Toil is the SRE term for repetitive manual work that scales linearly with infrastructure size.

The Toil You’re Not Counting

Manual provisioning scales fine when 20 Kubernetes clusters are supporting a handful of product teams; a platform team of three or four engineers can field 20 to 30 weekly requests without burning out. The model breaks somewhere between 100 and 200 clusters. A team supporting hundreds of clusters across dozens of product teams ends up fielding roughly a thousand operational interventions a month (provisioning, configuration changes, troubleshooting, capacity adjustments, occasional emergencies), consuming multiple full-time engineers who should be building infrastructure instead of running the line.

The same engineers who were hired to design and improve systems end up spending most of their week operating it. That gap between what they’re paid to do and what they actually do is the toil tax, and while it doesn’t show up cleanly on dashboards, it shows up in headcount budgets and quarterly planning.

What Replaces the Toil: Self-Service Architecture

The pattern that emerges in successful platform implementations has five components.

Templated infrastructure-as-code.Teams don’t author Crossplane compositions or Terraform modules from scratch. They use pre-built templates: a Kubernetes cluster template with standard configuration, security policies, and monitoring built in; a database template with encryption, backups, and monitoring configured; a storage template with lifecycle policies and access controls; a networking template following approved VPC and security patterns. Teams customize parameters like cluster size, database type, and region but don’t touch low-level configuration. The constraint eliminates roughly 80% of provisioning errors before they happen.
GitOps deployment pipeline. Infrastructure changes flow through Git. A team creates a PR with the infrastructure request using templates, and automated validation checks compliance against security policies, cost limits, and quota availability. The plan automatically comments on the PR showing exact changes; an approved PR triggers provisioning, and results post back with success confirmation, resource URLs, and connection info. Teams provision at their own pace, with no waiting on the infrastructure team.
Policy-as-code guardrails. Self-service doesn’t mean uncontrolled access; policies enforce standards. Resource limits cap node count per team for clusters, storage per instance for databases, and storage per team per region for object storage. Security policies require encryption at rest, prohibit public database access, mandate TLS on all ingress, and restrict deployments to approved container registries. Cost controls prevent teams from exceeding budget limits and automatically alert at 80% utilization. Policies get enforced automatically in the provisioning pipeline, and non-compliant requests get rejected with clear error messages explaining what’s wrong and how to fix it.
Self-service portal. Not everyone wants to write YAML. A web portal handles common operations: cluster provisioning via a form-based wizard, cluster scaling via a slider for node count, database creation via dropdowns for instance types, cost views showing current spend and projected monthly cost, and troubleshooting guides for common issues. The portal generates IaC behind the scenes and submits the PR automatically; teams approve in the UI and automation handles deployment.
Observable infrastructure. Self-service requires visibility. Teams need to see what infrastructure they own, current resource utilization, cost attribution, health status, and recent changes; dashboards expose all of it, so teams debug their own infrastructure without escalating to the platform team.

What Gets Eliminated

When the platform is built well, operational toil drops significantly. The remaining work falls into three categories: edge cases like unusual configurations and policy exceptions, complex troubleshooting for infrastructure bugs and cloud provider issues, and architecture consultation for new designs and best-practice reviews. That’s the slice that genuinely needs human judgment; everything else is handled by the platform.

The freed capacity is the real win. Engineers redirect from triage to actual engineering work: better templates, deeper automation, cost optimization, reliability investments, and new capabilities. The capacity compounds because each new investment further reduces the next quarter’s operational toil.

Provisioning time drops from days (wait for the request, engineer schedules, manual provisioning, validation) to hours (PR approval, automated deployment), and error rates collapse because automated provisioning with pre-tested templates eliminates the vast majority of configuration errors that plague manual provisioning.

The Hard Parts

Template maintenance is ongoing toil itself: new Kubernetes versions require template updates, security policies change, cloud provider APIs evolve, and best practices improve. That’s roughly one engineer at 20% time on template maintenance, which is still much better than multiple engineers on operational triage.

Policy balance is genuinely hard: too restrictive means teams can’t get work done, too permissive means chaos. The right approach is iterative: start restrictive and loosen based on requests, grandfather existing infrastructure while enforcing new standards on new resources, and provide an exception process for legitimate edge cases.

User education matters more than most teams expect because self-service requires teams to understand infrastructure. Documentation, office hours, training, and a community support channel all contribute, and the initial investment pays off because teams become self-sufficient and stop escalating the same problems repeatedly.

Lessons from Production Self-Service Platforms

Templates beat documentation as a starting point. “Read the docs and roll your own” doesn’t work; templates that teams can customize work because people learn by example.

The easy thing has to be the right thing. If compliant infrastructure is harder to provision than non-compliant infrastructure, teams will take shortcuts, so templates need to be easy and policies need to be enforced automatically.

Progressive self-service beats a big-bang launch. The right rollout starts with one team, validates workflows, iterates, then expands. A 6-month pilot with a small set of teams before general availability is typical.

The right metric is toil, not tickets. Tickets are a downstream symptom; the leading metric is engineering hours spent on repetitive work versus engineering hours spent on improvement work. If teams still escalate certain operations, that’s where the self-service UX needs improvement.

Eliminated work deserves celebration. Self-service isn’t glamorous, but when a team provisions infrastructure in 2 hours that previously took 5 days, that win deserves attention, and sharing wins reinforces the cultural shift from operator to platform-builder.

When This is Worth Building

If the infrastructure team spends over 40% of its time on operational triage, self-service pays for itself. Engineering investment is roughly 6 to 12 engineer-months to build infrastructure templates, the GitOps pipeline, policy enforcement, the self-service portal, and the supporting documentation and training. Toil reduction typically frees multiple engineers from triage, paying back the investment in a few months.

Self-service infrastructure isn’t about removing infrastructure teams from the loop; it’s about removing the loop itself. When the team’s job stops being execution and starts being design, that’s the toil problem solved, not the tooling problem.

Why Platform Engineering is Really a Toil Problem, Not a Tooling Problem

The Toil You’re Not Counting

What Replaces the Toil: Self-Service Architecture

What Gets Eliminated

The Hard Parts

Lessons from Production Self-Service Platforms

When This is Worth Building

SHARE THIS STORY

FOLLOW US

Why Platform Engineering is Really a Toil Problem, Not a Tooling Problem

The Toil You’re Not Counting

What Replaces the Toil: Self-Service Architecture

What Gets Eliminated

The Hard Parts

Lessons from Production Self-Service Platforms

When This is Worth Building

SHARE THIS STORY

RELATED STORIES:

FOLLOW US

NEWSLETTER SIGN UP