Skip to main content

Component Retry

Configure per-component retry behavior so transient failures — provider download 502s, S3 backend timeouts, registry rate limits — recover automatically without manual re-runs.

Terraform commands often fail with transient infrastructure errors that have nothing to do with the Terraform code. The most common is a 502 Bad Gateway during terraform init:

Error: Failed to install provider … 502 Bad Gateway returned from
https://github.com/opentofu/terraform-provider-local/releases/...

Configure a retry block on the component and Atmos will retry each subprocess invocation (init, workspace, plan/apply) when the captured output matches one of the regex patterns you list as recoverable.

Usage

components:
terraform:
vpc:
retry:
max_attempts: 5
backoff_strategy: exponential
initial_delay: 2s
max_delay: 30s
conditions:
- /Bad Gateway/
- /5\d\d /
- /connection reset/
- /TLS handshake timeout/
- /could not query provider registry/

How it works

When retry.conditions is configured, each call Atmos makes to the terraform/tofu binary is wrapped in an independent retry loop:

  1. The subprocess runs as usual; its stdout and stderr stream to the user and are captured into an in-memory buffer.
  2. When the subprocess exits with a non-zero status, the captured output is matched against every regex in conditions.
  3. If at least one pattern matches, Atmos waits for the configured backoff and retries.
  4. If no pattern matches, the error is returned immediately — real failures (terraform plan exit-code 2, schema errors, permission denials) are never silently retried.

Each subprocess invocation has its own retry loop. Retrying terraform init does not consume the apply retry budget.

Arguments

conditions

A list of regex patterns matched against captured stdout/stderr. Only errors whose output matches at least one pattern are retried. Patterns may be wrapped in /.../ for readability. Without conditions, no retry is attempted — this is a safety default so a typo never silently retries real failures.

max_attempts
Maximum number of attempts, including the first. Omit for unlimited; defaults to 1 (no retry) when retry is unset entirely.
backoff_strategy
One of constant, linear, or exponential. Defaults to exponential when initial_delay is set.
initial_delay
Delay before the second attempt as a Go duration string (e.g. "2s", "500ms").
max_delay
Upper bound on any single delay; caps exponential growth (e.g. "30s").
max_elapsed_time
Total time budget across all attempts (e.g. "5m"). After this the most recent error is returned regardless of max_attempts.
multiplier
Growth multiplier for exponential backoff. Defaults to 2.0.
random_jitter
Fractional jitter applied to each delay (0–1). Useful to avoid thundering-herd retries from CI fleets.

Inheritance

The retry block participates in component inheritance. Abstract components can define a default retry policy that concrete components inherit and override:

components:
terraform:
base/network:
metadata:
type: abstract
retry:
max_attempts: 3
initial_delay: 2s
backoff_strategy: exponential
conditions:
- /Bad Gateway/
- /5\d\d /

vpc:
metadata:
component: base/network
# vpc inherits the base retry block.

transit-gateway:
metadata:
component: base/network
retry:
# Override only the fields we care about. `conditions` is appended to,
# `max_attempts` is replaced.
max_attempts: 5
conditions:
- /TLS handshake timeout/

When NOT to use retry

  • Real Terraform failures. A plan that exits with code 2 because of a configuration error or a validate that catches a typo should fail loudly. Keep conditions narrowly scoped to transient infra patterns.
  • Stateful mutations with non-idempotent side effects. If a third-party API is invoked from a local-exec provisioner and is not safe to retry, do not configure retry conditions that could match its error output.
  • Hooks. Atmos hooks have their own configuration and are not affected by component retry.

See also