AI workloads are pulling training, batch jobs, and real-time inference onto Kubernetes, turning clusters into shared control planes for some of the most expensive infrastructure in the enterprise.
However, Kubernetes alone isn’t enough—platform engineering teams are building abstraction layers that let data scientists and ML engineers consume GPUs, pipelines, and inference services without turning clusters into chaos.
Platform teams are increasingly building abstraction layers that allow developers to access GPUs, machine learning pipelines and inference services without needing to navigate the underlying infrastructure complexity.
These layers—often delivered through internal developer platforms or self-service portals—aim to balance flexibility for engineering teams with the guardrails required for safe, scalable AI operations.
According to Pavlo Baron, co-founder and CEO of Platform Engineering Labs, the goal is to shield developers from the low-level details of infrastructure while still enabling platform engineers to maintain full control over those systems.
“Tools that platform engineers are using need to be able to abstract,” Baron says.
Traditional infrastructure-as-code approaches often expose the same granular configuration details to everyone using the system, which can create unnecessary complexity for developers who simply need access to compute resources.
“In classic IaC, you cannot abstract—everybody ends up seeing and accessing the same low-level detail,” he says.
Instead, platform teams are increasingly creating higher-level abstractions that simplify how developers request infrastructure resources.
Rather than specifying detailed configuration parameters—such as memory allocation or storage capacity—developers interact with simplified options that map to preconfigured environments.
“I am talking about a ‘t-shirt size’ as an abstraction for a database, not its size in bytes,” Baron says.
Under this model, platform engineers maintain detailed control over infrastructure configurations while developers interact with standardized service tiers that match common application needs. Baron said configuration languages designed to support layered abstractions are becoming an important part of this approach.
“Configuration languages like Pkl allow you to layer details for the user and for the use cases,” he says. “That is how you win.”
Direct Interaction Best Practices
Allowing data scientists and machine learning engineers to interact directly with Kubernetes clusters can introduce significant operational and security risks if proper platform abstractions are not in place. The complexity of Kubernetes often creates unintended outcomes when users without deep infrastructure expertise attempt to manage workloads themselves.
“The most straightforward risk is simply that Kubernetes is complex enough that it’s easy for a data scientist or ML engineer to end up with behaviors they don’t want if they try to go it alone,” says Flynn, technical evangelist at Buoyant.
For example, poorly configured memory limits can cause training jobs or inference workloads to be repeatedly terminated.
Beyond operational disruptions, there are also security and compliance concerns. Flynn notes that engineers unfamiliar with Kubernetes security models could inadvertently violate regulatory requirements by misconfiguring data isolation controls or deploying workloads with excessive permissions.
“They could deploy an agent with far more permission than it should have, giving it the ability to shut down or break other, unrelated applications,” he said.
AI Infrastructure Challenges
These issues highlight broader structural challenges within many organizations’ AI infrastructure environments.
Data scientists, platform engineers and application developers frequently operate in separate silos, each focused on different aspects of the stack. As AI systems grow more complex, those divisions can create gaps in governance and operational oversight.
At the same time, organizations are beginning to grapple with a new category of identity management challenges tied to autonomous agents and AI services.
Flynn says platforms are only beginning to address the distinction between the identity of a human operator and the identity of the automated agents acting on their behalf.
Looking ahead, Kubernetes-based AI platforms will likely evolve to address those challenges as organizations scale deployments across hybrid and multi-cluster environments.
While Kubernetes itself will continue improving support for large-scale training workloads and batch processing, Flynn expects the most significant advances to occur in identity and authorization frameworks for AI agents.
“As agents become more prevalent, platforms will need to offer more in terms of managing which agents are allowed to do what,” he says.
That shift will require stronger mechanisms for monitoring how agents interact with infrastructure and process data payloads—an area Flynn expects to become a central element of AI security strategies.
