What’s the deal with AWS CloudFormation StackSets?

You might have heard of AWS CloudFormation, but are you familiar with AWS CloudFormation StackSets? It’s a powerful infrastructure provisioning service in AWS that simplifies multi-account, multi-region deployments and can be very useful to have in your toolbox 🔧 While powerful, StackSets comes with some limitations and surprising behaviors that aren’t well-documented. Strap in to learn more!

This article serves as a summary of the most important things I’ve learned about StackSets to date. I’d love to hear your thoughts if you spot any errors, disagree with my observations, or have additional insights to share. If you’re already quite familiar with StackSets, you can probably skip to the Gotchas and Guidelines sections to see if my experiences match yours.

Introduction
Concepts
Example
Gotchas
Guidelines
Conclusion

Introduction

At its simplest CloudFormation is the native infrastructure as code (IaC) engine for provisioning and managing infrastructure resources in AWS. It revolves around organizing your resources into an isolated container called a stack. You create a stack based on a CloudFormation template in JSON or YAML, hand it over to CloudFormation, and it tries its best to make reality reflect your template.

StackSets on the other hand is a service that solves a related, but different issue. Namely the act of managing the deployment of such stacks across tens, hundreds and even thousands of combinations of AWS accounts and regions. You give StackSets a CloudFormation template in JSON or YAML together with a set of target AWS accounts and regions, desired level of parallelism, and StackSets takes care of the rest! It creates and manages the stacks, detects when they need to be updated based on changes in the template, and so on. While you wouldn’t typically use StackSets for deploying your typical application workloads, it shines when you need to manage standardized infrastructure across many AWS accounts. This is a very powerful concept that allows you to centrally set up governance, security or various foundational resources such as billing alarms, IAM roles, DNS zones, OpenID Connect Providers (e.g., for GitHub Actions), IaC bootstrapping resources (e.g., for AWS CDK, Terraform or Pulumi), etc. The sky (the cloud?) is the limit ☁️

You could achieve the same capability in various other ways: using Terraform and multiple AWS providers, through open-source solutions like org-formation, building a custom pipeline using GitHub Actions, CodePipeline et al, or through (what over time risks becoming a complex amalgamation of) custom scripts (that can become scary to use after a while!). With that said, once you’re working with at least a double-digit number of account-region combinations that you want to deploy infrastructure to, you probably should consider StackSets. It’s managed, well integrated with AWS, and when it works as expected, it’s quite nice!

While I’m a big fan of StackSets, as I’ll come back to, it’s not all glitter and glam. But first, let’s start with some important concepts.

Concepts

You use StackSets by creating a stack set that is associated with a CloudFormation template and a set of target accounts and regions. Each account-region pair is considered a stack instance. While a stack instance typically will correspond to an actual CloudFormation stack in that account-region pair, this isn’t always the case - a stack instance might exist without any real CloudFormation stack behind it (e.g., deployment is queued or has failed).

If your CloudFormation template contains parameters, you can pass in values for these through what’s called stack set parameters. This functions as the default values of sorts for all of the stack instances. What’s interesting here is that you can also pass in stack instance parameter overrides that apply to a specific account or group of accounts, as an example. This can be very useful when you need to use different values in accounts that represents different environments (e.g., development and production), or if you’re simply provisioning resources that need to be configured with account-specific parameters (e.g., a DNS zone named <account-name>.example.com).

A diagram showing a stack set deploying to multiple AWS accounts

Besides this StackSets comes in two different flavors: service-managed and self-managed. The main difference between these boils down to if you are using it together with AWS Organizations or not. Self-managed requires you to set up an IAM execution role in each of the accounts where you want StackSets to provision stacks, while service-managed integrates with your AWS Organizations organization, automatically sets up the necessary IAM roles for you, and supports automatically adding or removing stacks when accounts are added to or removed from an organization or organizational unit (OU). When using service-managed you target OUs, while when using self-managed you target specific accounts.

Specifically for the service-managed model, there are two additional features that are worth noting:

Automatic deployment: When enabled, StackSets will automatically deploy stacks to new accounts as they’re added to the targeted OUs, and remove stacks when accounts leave those OUs. This sounds incredibly convenient—imagine having all your governance and security baselines automatically applied to every new account! However, as we’ll explore later, this convenience comes with some surprising behaviors.
Account level targeting: This feature provides more granular control over which accounts are targeted by your stack set. You can use an INTERSECTION filter type to deploy only to specific accounts within an OU, DIFFERENCE to exclude certain accounts, or UNION to deploy to an OU and one or more accounts residing in different OUs. This flexibility is powerful, but as we’ll see, it introduces some unexpected behaviors.

The service-managed model might sound like a no-brainer, but there are some surprising behaviors associated with it, as we’ll explore in the Gotchas section.

Example

To make things a bit more concrete I’ve cooked up a fairly vanilla example below on how to set up StackSets using CloudFormation. This template creates a service-managed stack set that deploys a read-only IAM role to all accounts in a specific OU, allowing a central account to assume this role for cross-account management. In other words, it’s a CloudFormation template that can be used to create a CloudFormation stack that creates a CloudFormation stack set which in turn deploys CloudFormation stacks. It’s CloudFormation all the way down, baby. Yeah, it’s all a bit meta.

AWSTemplateFormatVersion: 2010-09-09
Description: "Sets up a stack set that sets up a stack containing a read-only role in all accounts in an organizational unit (OU)"

Parameters:
  TargetOrganizationalUnitId:
    Type: String
    Description: "The StackSet will create CloudFormation stacks in all accounts in the organizational unit (OU)"
  RoleName:
    Type: String
    Default: example-cfn-stackset-role
    Description: "The name of the IAM role that will be created in the accounts in the target organizational unit (OU)"
  TrustedAccountId:
    Type: String
    Description: "The AWS account that is allowed to assume the IAM role that will be created in the accounts in the target organizational unit (OU)"
  CallAs:
    Type: String
    Default: SELF
    AllowedValues:
      - SELF
      - DELEGATED_ADMIN
    Description: "Specifies whether to run the stack set operations as a delegated administrator (DELEGATED_ADMIN) or from the management account (SELF)"

Resources:
  StackSet:
    Type: AWS::CloudFormation::StackSet
    Properties:
      StackSetName: !Ref AWS::StackName
      PermissionModel: SERVICE_MANAGED
      CallAs: !Ref CallAs
      AutoDeployment:
        Enabled: true
        RetainStacksOnAccountRemoval: false
      Parameters:
        - ParameterKey: RoleName
          ParameterValue: !Ref RoleName
        - ParameterKey: TrustedAccountId
          ParameterValue: !Ref TrustedAccountId
      StackInstancesGroup:
        - DeploymentTargets:
            OrganizationalUnitIds:
              - !Ref TargetOrganizationalUnitId
          Regions:
            - !Ref AWS::Region
      Capabilities:
        - CAPABILITY_NAMED_IAM
      TemplateBody: |
        AWSTemplateFormatVersion: 2010-09-09

        Parameters:
          TrustedAccountId:
            Type: String
          RoleName:
            Type: String

        Resources:
          Role:
            Type: AWS::IAM::Role
            Properties:
              RoleName: !Ref RoleName
              AssumeRolePolicyDocument:
                Version: 2012-10-17
                Statement:
                  - Effect: Allow
                    Principal:
                      AWS: !Sub "arn:aws:iam::${TrustedAccountId}:root"
                    Action: sts:AssumeRole
              ManagedPolicyArns:
                - arn:aws:iam::aws:policy/ReadOnlyAccess

If you want to try out the example (ideally in a sandbox environment), store the template in template.yml and use the following CLI command (replacing <ou-id> and <account-id>):

aws cloudformation create-stack \
  --stack-name example-stack-set \
  --parameters "ParameterKey=TargetOrganizationalUnitId,ParameterValue=<ou-id>" \
               "ParameterKey=TrustedAccountId,ParameterValue=<account-id>" \
  --template-body file://template.yml

Gotchas

So we’ve talked about some of the good stuff. Let’s now take a look at some of the … not so good stuff. The Gotchas™. Some of this can be found in the official documentation, some is glaring in its absence. This list is not exhaustive either, but I’ve included my most important findings. A lot of this is based on experiments, and I haven’t tested all permutations of configuration values, lifecycles, … I’ll also add that most of these findings are related to using the service-managed permission model with automatic deployment enabled.

⚠️ The official documentation can be confusing and lacking

AWS’s documentation for StackSets omits many of the edge cases and surprising behaviors you’ll encounter in practice. Some features are only briefly mentioned, while interactions between features (like automatic deployment and account level targeting) aren’t explained in depth. This makes it hard to predict how StackSets will behave in complex scenarios without experimenting yourself. I found myself frequently running into undocumented behaviors that contradicted my expectations.

⚠️ Limited support for parameter overrides for service-managed stack sets

For a given service-managed stack set, you can only target a specific OU once. This has the implication that you can only override parameters on an OU-basis - if you want to use different parameters for accounts within the same OU, you’re kind of out of luck.

⚠️ Issues with automatic deployment and account level targeting for service-managed stack sets

When using the service-managed model, using automatic deployment and account level targeting together can have some surprising behavior.

You might be tempted to target a specific OU and use an INTERSECTION filter to only target a specific account in that OU. That works well on initial deploy, and everything seems fine and dandy. But if a new account is added to the OU, a new stack will be auto-deployed to it! This is extremely unintuitive. The same behavior goes for the other filter types. It is only respected when you’re doing a stack set update, not on automatic deployment.

When using account level targeting I’ve also had issues when moving accounts between OUs, resulting in states where the IaC no longer matches reality. Even after removing all stack instances from my code and applying the changes, some auto-deployed stack instances still persist in AWS. The only solutions are either moving the accounts out of the targeted OU or manually deleting the stack instances.

⚠️ Parameter overrides don’t apply to auto-deployed stacks for service-managed stack sets

If you use parameter overrides with your service-managed stack set, those only apply to stacks that are deployed through a stack set update. If an account is moved into a OU that StackSets has as a target, a stack instance will be created, but it will NOT be using any overrides. This can lead to inconsistent configurations across accounts that should be identical.

⚠️ Parallelism not respected when using different parameter override values

While StackSets offers controls for how many stack instances to operate on in parallel, there’s a limitation that isn’t well-documented: operations on stack instances with different parameter override values run sequentially, not in parallel. If you’re deploying to tens or hundreds of accounts with unique parameter values, your deployments can take much longer than you might expect.

This limitation seem to only take effect on operations that set the actual parameter override values, or when deleting stack instances that have different parameter override values. Other operations such as updating the CloudFormation template seem to respect the configured parallelism.

Guidelines

Based on my findings and experience, here are some guidelines that hopefully can help you (and me) use StackSets effectively without falling into pits or shooting our feet off:

Lean towards self-managed stack sets: while the self-managed model can be a bit more work because you need to manually maintain a list of target accounts for your stack set and trigger an update to apply any changes, it has less surprising behavior and is more flexible. If you need to deploy across AWS organizations, want to have granular control over which accounts are deployed to, or you think you’ll need account-specific parameter values in your stacks, you probably want to go with the self-managed model!
Understand the potential pitfalls of using service-managed stack sets: This is especially important with automatic deployment enabled. You should understand the distinction between stack instances deployed through a stack set update versus those that have been auto-deployed (e.g., when a new account has been added to an OU). They behave differently and have different lifecycle rules, especially when coupled with account level targeting. One of the main benefits from the service-managed model is automatic deployment, but if you’re not using that, you might want to go with the self-managed model.
Consider a hybrid approach: you can use a service-managed stack set to automatically set up an IAM execution role in all accounts, then use that role for all your self-managed stack sets. This appears to be the approach used by AWS Control Tower, as an example.
Have a testing environment for critical stack sets: if your stack set will deploy fairly critical resources, strongly consider having a separate version of the stack set for testing purposes (e.g., in a staging environment) that you can test new changes against.
Maintain your stack sets using Infrastructure as Code (IaC): create and update your stack set using IaC, and set up a deployment pipeline for it. In my experience the StackSets support in CloudFormation (and the AWS CDK) seems more mature than in the Terraform AWS provider (and the Terraform AWS Cloud Control provider).
Learn the basics using self-managed stack sets: if you’re new to StackSets, consider starting with a self-managed stack set in a sandbox account. This approach doesn’t require management account access or delegated administrator privileges, allows for isolated testing, and gives you hands-on experience.

Conclusion

Despite its idiosyncrasies and limitations, AWS CloudFormation StackSets remains a powerful (and native!) tool for multi-account, multi-region deployments in AWS. Understanding its behavior will hopefully save you some headaches down the road. While AWS could certainly improve both the documentation and some of the service’s quirky behaviors, StackSets is still a valuable tool worth considering for managing infrastructure at scale.

What's the deal with AWS CloudFormation StackSets?

Table of contents