We roll out the new thing, push it into production, and worry about recovery later.

Virtualization followed that path. Containers did too. Microservices stretched it before it got cleaned up. Now AI is on the same track, except more systems depend on it from day one.

The work getting done around AI infrastructure right now is real. Teams are building pipelines, wiring models into applications, standing up vector databases, and wrapping it all in something that looks like a clean platform experience. It works. It scales. It demos well.

Ask a different question and things start to wobble.

If this breaks, how do you put it back together in a way you can trust?

At SUSECON, that question was sitting just under the surface. Not in the keynotes. In the side conversations. People like Matt Slotten from Veeam weren’t focused on uptime. They were focused on whether anyone could actually recover an AI system and stand behind the result.

That lands differently than a standard backup conversation.

Most platform teams are treating AI workloads like another application tier. You define a golden path, standardize the components, and make it easy for developers to consume. On paper, that makes sense. In practice, the data underneath these systems doesn’t behave like anything we’ve had to protect before.

Vector databases are not static stores. They are constantly changing and sensitive to small shifts in input. Training datasets are large, distributed, and often loosely versioned. Model artifacts don’t live on their own. They come out of pipelines that depend on very specific data states and configurations.

You can capture pieces of this. Snapshots, backups, replication. None of that guarantees you can bring the system back in a meaningful way.

You can save it. That doesn’t mean you can trust it.

That’s where the gaps show.

In a traditional application, recovery is straightforward to define. The system comes back up, the data is consistent, and the behavior matches expectations. With AI, that definition falls apart. You can restore the data and still get different outputs. You can bring the pipeline back and miss subtle corruption that was already present. The system is running, but the answers have shifted.

That is a failure, even if everything looks healthy on the surface.

What matters is not whether the system restores. It’s whether it behaves the same way after it does. That requires more than backup. It requires validation at the level of outcomes, not just infrastructure.

Most platforms don’t have that.

Now layer in the threat model we’re operating under. Ransomware is moving up the stack. It is targeting pipelines and datasets, not just storage. At the same time, model poisoning is no longer theoretical. If the data that shapes your model can be altered, your system can be compromised without ever going offline.

Getting back online is no longer the goal. Knowing whether you should trust what’s running is.

That changes what a “golden path” is supposed to deliver.

If your platform makes it easy to deploy AI workloads but has no opinion on how those workloads are protected and recovered, it is leaving a hole in the system by design. Not an edge case. A core gap.

Let’s be clear about ownership.

This doesn’t sit with a single application team. It doesn’t get solved by telling developers to be more careful with their data. It lands with the platform. The same group that defined the abstractions, chose the components, and made it easy to spin this up in the first place.

They’re also the ones who will get called when it breaks. Middle of the night, something is off, results don’t line up, and nobody can explain why. A few hours later, that turns into a restore attempt. Then an explanation. Then a root cause.

Same team, every time.

Every platform wave ends up here. Resilience stops being something you add later and becomes part of what you ship. Not a feature. Not an integration. A requirement.

AI is not an exception.

If your platform cannot recover its data in a way you can stand behind, it isn’t finished. It may be fast. It may be widely used. It may even look clean from the outside.

It is still incomplete.

Back it up like you mean it.

SHARE THIS STORY