Infrastructure-as-Code Lessons From Managing 1,300+ Servers

Managing the Figment validator infrastructure meant operating Kubernetes across 1,300+ physical servers distributed across 13 cloud and bare-metal providers. At that scale, Infrastructure-as-Code moves from a best practice to an operational necessity — there’s no other way to maintain consistency, audit changes, and recover from failures without spending weeks on manual configuration.

That experience also revealed failure modes in IaC practices that don’t appear at smaller scales. The lessons from running IaC at serious scale are different from the tutorials, and they’re worth knowing before you hit them in production.

Module Design Determines How Well You Scale

Terraform modules are the unit of reusability. How you design modules determines whether IaC scales gracefully or becomes a maintenance burden.

The anti-pattern that kills scaling: monolithic modules. A single module that provisions a complete environment — network, compute, databases, DNS, monitoring — seems efficient early on. By the time you need to provision a second environment that’s slightly different, the module is too opinionated to reuse. You fork it, now you maintain two versions, and they diverge over time.

The pattern that scales: composable, single-purpose modules. A vpc module that only manages VPC configuration. A kubernetes-cluster module that only manages a cluster. An rds-instance module that only manages a database. These are assembled into environment configurations that combine the modules.

At Figment, we had separate modules for validator node configuration, network peering, monitoring infrastructure, and provider-specific resources (because AWS and bare-metal providers have fundamentally different resource models). The modules were stable; the environment configurations changed regularly. This separation meant infrastructure changes could be reviewed in isolation from module changes.

One important discipline: pin module versions. source = "git::https://github.com/org/modules//kubernetes-cluster?ref=v2.3.1" rather than ?ref=main. When modules are used across multiple environments, an unpinned reference means a change to main affects all environments simultaneously at the next plan/apply. That’s not GitOps; that’s infrastructure drift waiting to happen.

State Management at Scale

Terraform state is where IaC implementations fail spectacularly as they grow. The problems scale with complexity:

State file size. A Terraform state file for a large environment can grow to megabytes. Operations that touch the whole state (plan, apply) take progressively longer. The solution is state splitting: instead of one state file per environment, break environments into smaller state boundaries. Separate state for networking, for compute, for databases. Use terraform_remote_state data sources to reference outputs across state files.

State locking conflicts. Remote state (S3 + DynamoDB on AWS) provides locking that prevents concurrent applies. With one team running IaC, this rarely causes issues. With multiple teams, you’ll hit locking conflicts. The solution: narrower state boundaries (so different teams aren’t touching the same state) and automation that retries on lock failures rather than failing immediately.

State drift. Resources modified outside of Terraform — manual changes in the AWS console, auto-scaling events, cloud provider-initiated changes — create state drift. terraform refresh updates the state to match reality; terraform plan then shows drift as proposed changes. The operational discipline: treat the plan output after a refresh as signal. Unexplained changes mean something changed outside of IaC. Investigate before applying.

The import workflow (terraform import) is how you bring existing resources under IaC management. It requires generating matching Terraform configuration for the resource — the resource exists in cloud, you write the config, you import the state, then Terraform manages it going forward. Tedious but necessary for brownfield environments.

Ansible and Terraform: Different Jobs

The IaC tooling question that causes confusion: when to use Terraform vs. Ansible. They address different problems and work better together than in competition.

Terraform manages infrastructure state: create this VPC, this instance, this security group. It’s declarative — you describe what should exist, Terraform figures out how to make it exist. It’s excellent at resource lifecycle management and understands dependencies between cloud resources.

Ansible manages configuration state: this server should have these packages installed, this config file should have these contents, this service should be running. It’s procedural in execution (runs tasks sequentially) but describes desired state (idempotent tasks). It’s excellent at post-provisioning configuration that Terraform’s provisioner blocks aren’t suited for.

The handoff: Terraform provisions the infrastructure, outputs the IP addresses and configuration details, Ansible picks those up (via dynamic inventory or explicit variable passing) and configures the systems. This is the pattern that scales. Using Terraform remote-exec provisioners for everything creates tight coupling between provisioning and configuration that makes both harder to manage.

At scale, Ansible inventory becomes its own engineering problem. Dynamic inventory that queries cloud provider APIs (using aws_ec2 plugin or similar) is more reliable than static inventory files for environments that change frequently. Static inventory for 1,300 servers would require maintenance that creates a whole class of drift risk.

Testing IaC: Not Optional at Scale

Testing IaC is under-practiced. Teams often rely on “run it in staging first” as the testing strategy, which catches some problems but misses the class of bugs that only appear when the test environment diverges from production.

Approaches that work:

Terratest (Go-based) provisions real infrastructure, runs assertions against it, and tears it down. Slow (real infrastructure takes real time to provision) but comprehensive. Worth using for complex modules where the behavior under different inputs matters.

terraform validate catches syntax and type errors without requiring provider credentials. Run this in CI on every PR for free — it takes seconds and catches a class of errors that waste time.

terraform plan in CI against staging state is the minimum effective test. If the plan against staging state produces no unexpected changes, the module changes are safe for production. More importantly, it catches breaking changes before they reach production.

Pre-commit hooks with tflint (Terraform linter) and checkov (security policy scanning) catch configuration problems before they get to CI. These run locally in seconds, not minutes. Enforce them at the repository level.

Handling Secrets in IaC

Secrets in Terraform state is one of the most common security problems in IaC implementations. Terraform stores state in JSON, including the sensitive values it manages (database passwords, API keys). By default, state in S3 is readable by anyone with bucket access.

Mitigations that are non-negotiable:

State backend encryption enabled (S3 server-side encryption with KMS)
S3 bucket policy restricting access to specific IAM roles
CloudTrail logging for all state access
Secrets not passed as Terraform variables — use AWS Secrets Manager, Vault, or similar, and reference secrets by ARN/path in Terraform rather than passing the value through

The clean pattern: Terraform creates the AWS Secrets Manager secret and stores a placeholder value. The application or Ansible reads the secret from Secrets Manager at runtime. The actual secret value is never in Terraform state.

The Drift Audit

At scale, drift accumulates. Cloud providers make changes (security group rule updates from service changes), auto-scaling events create and destroy instances, operations teams make emergency manual changes during incidents and forget to codify them. A monthly drift audit — running terraform plan across all environments and reviewing the output — surfaces accumulated drift before it becomes a problem.

The audit output is also useful for a different reason: it shows you what has changed in your infrastructure in ways that aren’t captured in Git history. State drift is often the first signal that someone has been making manual changes outside the IaC workflow. That’s a training and culture problem worth addressing before it accumulates.

Our DevOps and automation practice builds IaC foundations for environments at all scales. The patterns are similar; the operational discipline scales with the environment. Related: managing IaC for Kubernetes infrastructure at scale adds additional layers of complexity — cluster configuration, operator management, and GitOps reconciliation — that benefit from the same modular, tested approach.