Lessons From Managing Multi-Cloud Infrastructure at Scale

When you’re managing an infrastructure for a platform serving millions of users, Terraform becomes both your best friend and your biggest source of production incidents. I’ve spent the last several years managing multi-cloud deployments across AWS and GCP at Salesforce, and the patterns that work at a small scale often spectacularly fail when you’re coordinating 50+ engineers touching the same infrastructure code.

The first mistake most teams make is treating Terraform state files as an afterthought. At a small scale, a single state file works fine. However, once you cross a few hundred resources, plan operations start timing out and lock contention becomes a daily problem.

Splitting state files by service boundaries and environment helps. Each microservice owns its state, using remote state data sources to share outputs between services. This pattern has limitations. Cross-service dependencies create circular reference problems that require careful coordination during deployments.

The harder lesson comes from state file corruption. When 50 engineers are applying changes, someone will eventually interrupt a state write operation. Mandatory state backups before every application, along with automated tooling to detect and restore from corrupted state become essential. Losing a state file for a production service at 2 a.m. teaches respect for state management.

Most Terraform modules you find online are designed for demo purposes. They don’t survive contact with real production requirements.

The wrapper module pattern seems appealing at first. Each cloud provider gets its own module, with a higher-level module selecting between them. This creates a maintenance nightmare. Changes ripple across three layers of abstraction, and debugging becomes archaeological work.

The pattern that works: Flat modules with composition at the workspace level. Each module handles one concern. Networking modules don’t create compute resources. Compute modules don’t manage their own security groups. This approach aligns with Terraform’s recommended module structure but requires discipline to maintain boundaries.

Version pinning becomes critical. Pinning every module dependency to exact versions and using automated testing to validate upgrades prevents surprises. A ‘helpful’ module update that changes default behavior can break 20 services in production.

Giving 50 engineers direct access to production Terraform configurations guarantees eventual disaster. The lessons here came from actual disasters.

A platform team model works better. This team owns core infrastructure modules and provides them as a service. Application teams consume these modules but can’t modify the underlying implementation. This creates friction initially. Engineers want full control. However, after the third incident where someone accidentally destroyed a production database cluster, the value becomes clear.

Pull requests for infrastructure changes require reviews from both the application team and the platform team. The platform team validates that changes follow patterns that won’t create cascading failures, while application teams verify that the change matches their requirements. Infrastructure code benefits from the same peer review processes used for application development.

Automated policy checking using Open Policy Agent integrated with Terraform prevents common mistakes: Undersized instances for production workloads, missing backup configurations and non-compliant tagging. Automation catches issues that humans miss during reviews.

Running workloads simultaneously across AWS and GCP sounds appealing until you try to implement it. Cloud providers design their services to lock you in and Terraform doesn’t magically solve that problem.

Running different services on different clouds works better than trying to mirror everything. AWS can handle legacy workloads and services that depend on AWS-specific features. GCP runs newer services where you can leverage their Kubernetes engine and data analytics tools.

The infrastructure code mostly stays separate. Shared modules handle truly generic resources such as monitoring dashboards and alerting rules. Provider-specific modules live in separate repositories. This means duplicating some logic, but it prevents the abstraction layers from becoming unmaintainable.

Cross-cloud networking requires actual planning. A VPN mesh between cloud providers and consistent IP addressing schemes provide the foundation. Service discovery works through a combination of DNS and service mesh tooling. Terraform manages the networking configuration but doesn’t try to abstract away the underlying complexity.

Testing Terraform configurations at scale means accepting that perfect validation is impossible. The focus should be on catching the failures that may cause production issues.

Unit tests validate module logic using Terraform’s built-in testing framework. These tests run in CI for every pull request and catch syntax errors and basic logic mistakes. They don’t validate whether your infrastructure actually works.

Integration tests deploy to isolated environments and run actual validation checks. Does the load balancer route traffic? Can services communicate across network boundaries? These tests cost money to run, so selectivity about what gets tested this way matters.

For production changes, targeted applications with explicit resource targeting limit risk. Changing a single security group rule doesn’t require planning and applying the entire workspace. This limits the blast radius but requires engineers to understand resource dependencies.

Terraform excels at managing relatively static infrastructure but struggles with highly dynamic resources. It shouldn’t manage Kubernetes resources inside clusters. Helm charts and Kubernetes operators handle that layer better. Terraform manages the clusters themselves but not the workloads running in them.

Serverless functions and their associated event triggers change frequently during development. Managing these through Terraform adds friction without benefit. Cloud-native deployment tools handle rapid iteration better.

Configuration management for installed software belongs in other tools. Terraform can provision a VM, but Ansible or cloud-init should configure the software running on it. Trying to do both in Terraform creates slow, fragile configurations.

Each tool has its strengths: Terraform provides solid infrastructure provisioning, while other tools handle different layers more effectively.

Managing infrastructure at this scale means accepting tradeoffs. Perfect abstraction is impossible. Some duplication is acceptable if it prevents tighter coupling. Team structure matters as much as technical architecture. These lessons came from production incidents and late-night debugging sessions.

Software keeps evolving. One of the challenges that every company faces is to upgrade to the latest software to take advantage of the latest features. For Terraform, it’s easy to upgrade Terraform itself but upgrading AWS and Google provider versions requires a careful strategy — especially when upgrading major versions, as these changes often involve updates to the state file format and/or resource schema, which older versions cannot interpret correctly.

The longer you have been using Terraform at scale, the harder it is to upgrade or uplift the infrastructure code. Imagine upgrading an AWS RDS running in production with customer data to use the latest AWS provider version. Often, it’s straightforward without changing the actual RDS instance, but sometimes we need to recreate the RDS instance, depending on the Terraform features introduced.

Lessons From Managing Multi-Cloud Infrastructure at Scale

SHARE THIS STORY

FOLLOW US

Lessons From Managing Multi-Cloud Infrastructure at Scale

SHARE THIS STORY

RELATED STORIES:

FOLLOW US

NEWSLETTER SIGN UP