
From GPU-ready instances to composable pipelines and advanced governance, platform engineers are clear about the capabilities they need for an AI-first future, but to truly support AI-native workloads, platform engineering teams need to rethink their foundations.
It’s not just about adding GPUs — it’s about enabling GPU-aware orchestration and composable infrastructure. That means Kubernetes clusters optimized for GPU scheduling and modular pipelines that can snap together like Lego blocks, scaling seamlessly across hybrid and multi-cloud environments.
“This is where strategy meets pragmatism,” says Neeraj Abhyankar, vice president, data and AI at R Systems.
He explains most organizations are blending spot and reserved GPU instances, but the real advantage comes from workload profiling – knowing when to scale up for training and when to optimize inference.
“We’ve built elastic GPU pools and cost-aware orchestration that dynamically shift workloads between on-prem and cloud environments, ensuring our customers maximize performance without overspending,” Abhyankar explains.
James Harmison, senior principal technical marketing manager, Red Hat AI, says for many, there is a desire to leverage existing GitOps approaches to AI.
“There’s the obvious MLOps and now GenAI Ops or LLMOps angle,” he says. “But you ask it correctly, much of the shifts pertain to expensive GPU resources – how to properly share them and prioritize access – whether in the cloud or on-prem.”
He notes the predominance of data is also a major difference: Many organizations have already been forced to have a data management strategy, but the dominance of AI workloads just magnifies this.
“This gets amplified as neo-clouds become a part of the operational picture,” Harmison says.
He cautions that there are too many interfaces and separate ways to bring the necessary compute horsepower for those AI workloads into the toolkit of the platform engineering team.
“Wrangling that complexity with a common set of tools and clear observability into these different environments is a key challenge we’re seeing customers face,” he says.
That GenAI Ops motion requires a good set of consistent abstractions over this kind of compute that’s new to many teams.
To design systems that maintain agility while supporting the increasing resource intensity of generative AI workloads, Harmison recommends multi-cluster management and doing platform operations using GitOps methodologies to support policy enforcement around self-service for the folks who need access to the resources.
“It establishes a clear boundary between how teams consume those resources with their intensive AI workloads and how they’re provided access to them,” he explains.
He adds that a lot of people want to strongly control all tooling, for example, by having data scientists open a ticket to have a Jupyter notebook created.
However, that doesn’t always result in the best agility for a team, and often the reason those systems were put in place was to control and monitor consumption and utilization.
“If we can use policy to meet those same goals, and observe compliance with that policy, we can enable teams to move quicker without losing that central visibility,” he says.
Harmison explains new patterns like GPU-as-a-Service and Models-as-a-Service offer emerging approaches for platform engineering to take control while still providing developers and line of business drivers the self-service to implement AI-enhanced applications.
“A lot of it is simply learning the new technologies and approaches,” he says. “We saw similar retraining efforts when Linux Certified Engineers had to upskill to Kubernetes.”
As engineers move toward building AI-first platforms, Abhyankar says he sees several challenges.
“On the technical side, there is GPU scarcity and the complexity of distributed training,” he says. “Operationally, managing multi-cloud governance and observability at scale is challenging.”
He points out the AI-native future is no longer a distant vision – it’s here.
“Success will come to teams that combine technical rigor with cultural adaptability,” Abhyankar says. “It’s not just about building platforms, it’s about building ecosystems that make AI practical, scalable and responsible.”
