Skip to main content

Parallel and Matrix Steps for Atmos Workflows

· 5 min read
Erik Osterman
Founder @ Cloud Posse

Atmos workflows can now run independent work concurrently with first-class parallel and matrix control steps. Add dependency-aware fan-out, readable grouped or live-prefixed output, and explicit failure behavior directly to your workflow YAML.

The Problem

Workflows are where teams encode the operational knowledge that should not live in someone's shell history: run the checks, build the thing, deploy the dependencies, then summarize what happened.

Until now, those steps were sequential. That was easy to reason about, but it meant a workflow with four independent checks took the sum of all four runtimes. The usual workaround was to drop into shell scripts, background jobs, wait, temp files, and hand-rolled log prefixes. That works until it doesn't:

  • Output from concurrent commands interleaves into unreadable logs.
  • Failure behavior is implicit and different in every script.
  • Dependency relationships are hidden in shell control flow.
  • Local workflows and CI matrices drift apart.

Infrastructure automation should not force you to choose between "simple but slow" and "fast but fragile."

What's New

Atmos now supports two new workflow control step types:

  • parallel runs sibling steps concurrently.
  • matrix expands literal axes and schedules the generated child steps.

Both support:

  • needs dependencies between sibling steps.
  • max_concurrency to bound parallelism.
  • Failure modes: wait_all, fail_fast, and best_effort.
  • Output modes: grouped, prefixed, and none.
  • Parent-owned summaries with success, failed, skipped, and canceled counts.

This is built into the workflow engine, so the orchestration rules are visible in the workflow file instead of buried in shell glue.

Parallel Checks

Run independent checks together, then run a dependent summary step only after both prerequisites succeed:

stacks/workflows/checks.yaml
workflows:
checks:
steps:
- name: checks
type: parallel
max_concurrency: 4
fail:
mode: wait_all
output:
mode: grouped
order: completion
show_summary: true
prefix: "{{ .step.name }}"
steps:
- name: lint
type: shell
command: make lint

- name: test
type: shell
command: make test

- name: summarize
type: shell
needs: [lint, test]
command: ./scripts/summary.sh

The workflow is still declarative: summarize says what it needs, not how to poll for it. Atmos schedules everything else.

Matrix Fan-Out

Use matrix when the same step should run across combinations:

stacks/workflows/test-matrix.yaml
workflows:
test-matrix:
steps:
- name: test-matrix
type: matrix
max_concurrency: 3
output:
mode: grouped
order: definition
matrix:
os: [linux, darwin]
go: ["1.22", "1.23"]
steps:
- name: test
type: shell
command: make test OS={{ .matrix.os }} GO_VERSION={{ .matrix.go }}

That gives you CI-style fan-out without requiring the workflow to become a GitHub Actions-only construct. The same workflow can run locally, in CI, or inside a larger operational runbook.

Output That Stays Readable

Concurrent output is only useful if humans can read it. The control step owns child output rendering:

  • grouped captures child stdout/stderr and prints labeled blocks.
  • prefixed streams live output with complete-line prefixes.
  • none suppresses terminal output while still capturing metadata.

For live logs:

output:
mode: prefixed
prefix: "{{ .step.name }}"

Example output:

[lint] checking formatting
[test] running unit tests
[lint] passed
[test] passed
[checks] summary: 2 succeeded, 0 failed, 0 skipped, 0 canceled

The summary uses the same Atmos UI formatter as other command output, so success, warning, and failure states are immediately visible.

Explicit Failure Semantics

Parallel work needs a clear answer to "what happens when one branch fails?"

fail:
mode: wait_all # wait_all | fail_fast | best_effort
max_failures: 2 # 0 means unlimited
  • wait_all lets independent ready/running branches continue, skips dependents of failed children, and fails the parent after schedulable work settles.
  • fail_fast cancels pending and running siblings once the failure threshold is reached.
  • best_effort records failures and skips dependents, but lets the parent succeed unless the control step itself is invalid.

That makes failure behavior reviewable. Operators can choose fast feedback for checks, complete collection for reports, or best-effort fan-out where partial success is still useful.

Guardrails for v1

The first version intentionally allows only non-interactive child steps inside concurrent groups:

  • shell
  • atmos
  • sleep

Interactive prompts, terminal-owning renderers, file editors, pagers, spinners, environment-mutating steps, and exec are kept outside concurrent groups for now. That boundary is deliberate: concurrent workflows should not start by letting multiple children fight over the same terminal.

You can still use rich UI steps before or after a parallel or matrix control step to frame the workflow, show tables, render markdown, or summarize the result.

Why This Matters

Parallel and matrix workflow steps make Atmos workflows feel like real orchestration instead of a sequential macro runner.

  • Local runbooks get faster without becoming bash concurrency puzzles.
  • CI and local automation can share the same workflow definition.
  • Dependency relationships are visible as needs, not hidden in scripts.
  • Output remains readable by default.
  • Failure behavior is part of the contract.
  • Matrix fan-out is available anywhere Atmos runs, not only inside a CI provider.

This is especially useful for validation workflows, multi-component smoke tests, cross-platform checks, reporting jobs, and any operational task where several independent branches can run safely at the same time.

Try It

This PR includes a runnable example:

cd examples/parallel-steps

atmos workflow checks -f parallel
atmos workflow prefixed -f parallel
atmos workflow matrix -f parallel

Start with validation and reporting workflows first. They usually have the safest fan-out shape: independent checks, obvious dependencies, and low risk if one branch fails.

For the full field reference, see the parallel and matrix step type documentation.

Get Involved

Try the new control steps on real workflows and tell us where the v1 guardrails feel too strict or exactly right. We're especially interested in feedback on output modes, failure semantics, and which additional non-interactive step types should be allowed inside concurrent groups next.