Platform engineers were promised a new era of automation, but the reality has been far more complex. Instead of streamlined systems, they face a tangled mix of models, GPUs, and “urgent” AI roadmaps. They must keep these AI stacks running by gluing experimental data science to production systems that were never designed for real-time learning. To make matters worse, they must also operationalize workloads that evolve, drift, and consume resources in ways no CI/CD pipeline was ever built to handle.

The consequences are visible everywhere: brittle integrations, runaway costs, and a widening disconnect between data science and delivery. Pipelines choke, infrastructure groans, and collaboration breaks down just when it’s needed most. These are not mere growing pains–they are roadblocks. Growing pains fade as systems mature, but these challenges expose a deeper mismatch between how software was built to run and what AI now demands.

Based on findings from Platform Engineering’s annual survey, The State of AI in Platform Engineering, this article examines where teams are getting stuck and what it will take to build truly AI-native operations.

Talent Deficits and Team Dynamics

One of the biggest challenges in the AI-native journey isn’t technological–it’s human. Fifty-seven percent of organizations cite skill gaps as their top barrier to AI adoption, with over half also stating that they lack the in-house expertise to effectively integrate and manage AI systems. And these gaps extend beyond data science, reaching into DevOps, observability, and security. To build a well-rounded skill set, organizations should embrace a blend of formal training (e.g., certifications and specialized courses on AI), experiential learning (e.g., sandbox environments like CDEs that allow for safe experimentation), and community learning (e.g., conferences, hackathons).

Compounding challenges, collaboration between the platform and data science teams remains fragmented. Thirty-one percent of enterprises report limited interaction, and 16% none at all. In practice, this means model code and production infrastructure often evolve in isolation, creating fragile systems and operational risk. Bridging these silos is essential. Platform teams need to integrate model development tightly with deployment and monitoring, applying the same DevOps principles—ownership, automation, observability—that drove modern software reliability. Managing models as continuously evolving software components, rather than one-off experiments, builds resilience and trust in AI operations.

Integration and Pipeline Bottlenecks

Even mature platform teams find existing architectures and pipelines ill-suited to AI demands. Integration ranks as a top concern, with 51% reporting difficulty embedding AI into existing systems. Legacy systems, monolithic architectures, and technical debt make connecting models to live applications a complex and error-prone process. Emerging protocols such as MCP, ACP, and A2A are beginning to address these issues by standardizing LLM inputs, enabling inter-agent coordination, and facilitating cross-platform collaboration. But adoption remains fragmented, forcing platform teams to support multiple emerging standards at once.

Next is the delivery. Forty-one percent haven’t yet adapted their CI/CD pipelines to handle model retraining, versioning, and inference deployments. Traditional DevOps flows were never designed for workloads that learn, drift, and redeploy continuously.

Together, these challenges point to a deeper issue: most integration and delivery patterns were never designed for dynamic, data-driven systems. To overcome that friction, standardization (e.g., developing AI infrastructure templates and blueprints) is emerging as a top priority. Platform teams are turning to lightweight, composable infrastructure that bridges the gap between model development and production, standardizing GPU access, automating environment setup, and enabling inference wherever data resides.

Infrastructure Inequity

Infrastructure remains the weakest link in many organizations’ AI strategies. For most, AI workloads run on a scattered patchwork of environments with inconsistent maturity—from Kubernetes GPU extensions to manual provisioning or opaque scripts that can’t scale effectively. This uneven readiness complicates cost management, performance optimization, and governance across environments.

Given this patchwork of infrastructure maturity, organizations face immense difficulty managing consistency, cost, and governance across diverse AI workloads. Infrastructure-as-Code (IaC) has emerged as a pivotal practice, bringing the same automation and governance discipline that transformed cloud-native operations to the AI stack. By codifying inference environments with autoscaling, monitoring, and governance baked in, IaC enables platform teams to unify previously siloed layers of models, data, and compute.

Further complexity arises as 85% of organizations shift inference toward the edge, moving computation closer to where data is generated. This improves latency and supports data-residency requirements, but it also introduces a new layer of complexity. Once inference leaves the centralized cloud, platform teams must deploy, monitor, and update models across dozens or even hundreds of distributed locations. Few platforms are equipped to orchestrate and govern workloads reliably at that scale.

To address this, platform teams are pursuing several complementary strategies:

  • Silicon diversity – pairing CPUs, GPUs, and specialized accelerators to optimize performance and cost.
  • Serverless inference – abstracting hardware management to enable elastic scaling of AI workloads.
  • Real-time data integration – using streaming and retrieval-augmented generation (RAG) to feed models contextual data without retraining.

Together, these approaches point toward an emerging paradigm of composability, where compute, data, and models are modular and deployable anywhere—cloud, on-prem, or edge.

From Experimental to Operational AI

The roadblocks described–human hurdles, integration and pipeline bottlenecks, and infrastructure inequity–aren’t just obstacles to adoption; they define whether AI remains experimental or becomes operational. But moving beyond experimentation requires organizations to achieve true AI-native maturity, which is not measured by the number of models deployed, but by how reliably those models deliver value. Getting there means weaving AI into the operational disciplines that define great engineering: observability, automation, and scalability.

This begins with breaking down silos between data science and engineering, automating retraining and deployment pipelines, and building infrastructure that adapts as fast as the models it supports. When teams treat AI integration as an engineering discipline rather than an initiative, they build systems that not only perform under pressure but also improve with every release. This operational mindset unlocks AI’s true potential as an enterprise-grade capability.

Tech Field Day Events

SHARE THIS STORY