Deciding between large accounts or micro accounts for distributed operations at AWS
When you’re starting your journey at AWS, you must define your AWS account strategy. There are many possible variations for how to organize the AWS accounts – by workload, team, specialization, business domain, functional domain, and many others. A common question from customers is: should I deploy multiple workloads into a single AWS account, or have one workload per AWS account? We would like to present a decision flow to help you answer this question and avoid having to migrate your workload to a different account later.
Multi-Account strategy is one of the most important pillars for a strong foundation at AWS. When adopting multiple accounts you can create compartments for cost, service limits, blast radius, and resources. You can find a list of benefits and additional information in the whitepaper Organizing Your AWS Environment Using Multiple Accounts. This describes all of the best practices regarding the multi-account approach and how it can be implemented using AWS Organizations. Furthermore, the concept of a foundation inside of AWS can be better explored in this AWS Prescriptive Guidance, where the term Landing Zone is explained as the associated architecture where customers can host workloads by following best practices for multi-accounts, network, security, and monitoring. Customers can create landing zones by establishing their multi-account strategy using Organizations and quickly get set up with a multi-account environment using best practices with AWS Control Tower.
Even after understanding the concepts associated with a multi-account strategy and the landing zone, many customers still have questions, such as:
- I have a new workload, do I have to create a new account, or use an existing one?
- How could I decide between having multiple workloads in one account, or having one workload per account?
- What is the right operating model for my multi-account structure?
In this blog, we will share a decision flow that will help you decide when to use large accounts (with multiple workloads) and micro-accounts (with a single workload)
Understanding your operating model
An operating model is a definition of roles, teams, processes, and responsibilities about the operation. AWS has a public documentation about Cloud Operating Models, where you can find detailed guidance. The first step is to define the operating models for the AWS Cloud. Each operating model has its own strengths and requirements, and it’s important to design a flexible landing zone. The two most common operating models are:
- Centralized Operating Model: in this model, the application’s infrastructure is provisioned and managed by the same team. This means that this team is in charge of all of the infrastructure resources, including but not limited to, computing, networking, storage, and security. Since just one team will touch the infrastructure, there’s no need to segregate access to different resources that belong to different applications. You can create Role Based Access if you have specialized teams (DBAs, Middleware, Windows, Linux) within the centralized team, similar to a traditional on-premises operation. If you embrace Infrastructure-as-Code (IaC), then a multidisciplinary and centralized team can also handle everything and deliver deployments through pipelines, thereby performing a kind of centralized DevOps model. In both cases, the application team is totally separated from the infrastructure team. The developer handles the application code, and depends on the centralized infrastructure team to build the infrastructure and pipelines. During an incident, the centralized infrastructure team is the first point of contact, acting as the main level for the whole environment. Application teams tend to be engaged during troubleshooting, but everything is led and conducted by the centralized team. This operating model can also be called the Fully Separated Operating model.
- Decentralized Operating Model: This model is best suited for a multidisciplinary team that builds and runs their own application environment, and manages products that belong to their portfolio. It can be a kind of delegation to a Business Unit or to a dedicated team. In this case, the team or Business Unit is responsible for operations, but just for their application’s resources. Because of this independence, resource segregation is highly recommended to avoid people stepping on each other’s resources. By applying this operating model, the decentralized team becomes specialized in one or more specific products. During an incident, this team is responsible for conducting all of the ongoing tasks, not depending on a centralized infrastructure team. This operating model can also be called the Separated Application Engineering and Operations (AEO) and Infrastructure Engineering and Operations (IEO) with Centralized Governance.
Those operating models will influence in which kind of account you will host your workload, considering the level of segregation that you need for your resources from the operation perspective. But this isn’t the only decision criteria to be validated. There are other technical concepts that should be evaluated, such as the level of blast radius and how you handle your service limits. You can think of blast radius as the maximum impact that might be sustained in the event of a system failure, meaning that you have to measure the impact that a compromised account will expose, if it’s supporting multiple workloads. Another point is to understand AWS service limits. Those limits are related to account-per-region, and they are distributed by services in a way that each service has its own limit. Knowing it is extremely important to understand if workloads located in the same account can compete with the same AWS service quotas.
Now, let’s change the perspective from operations to application. AWS services and customer applications can assume roles at AWS. This is the way that an application can interact with an AWS service. For example, if you have an AWS Lambda function that must update an Amazon DynamoDB table, then you can allow the Lambda’s service role to perform this action into the DynamoDB table. Because of this, you must analyze the level of segregation that your application needs when sharing an account with multiple workloads. You must limit access to Lambda functions, Containers, and Amazon Elastic Compute Cloud (Amazon EC2) Instances to access only their associated resources. To provide this level of segregation, you need an Identity and Access Management team that will be in charge of IAM roles and policies management.
The IAM team can take advantage of Attribute Based Access Control (ABAC), resource-level permissions, and authorization based on tags to segregate and isolate resources. This provides that each application will access just its own resources. Each AWS Service has its own authorization conditions, so it’s important to review the features supported by each service by looking at AWS services that work with IAM. Application access segregation in micro-accounts is already in place, because the account by itself is the isolation.
Before moving to the decision flow, consider the following: as described in the Multi-Account Best practices, it’s recommend to separate your workloads by software development lifecycle, which means that Development, Staging, and Production should have their own accounts. We can go further and determine that one account can act as a “sidecar”, supporting resources or pipelines that enable you to build, validate, promote, and release changes to your workloads across all environment’s accounts. That is the reason why we start the decision flow assuming that the minimal structure related to accounts is already defined. Therefore, you already have Organizational Units (OUs) for each environment (Dev, Stag, and Prod). It’s important to remind you that other OUs could be required to support accounts related to infrastructure services, security, logs, and other shared services that support multiple workloads. I’ll omit those OUs because this post is focused on the decision flow to create or reuse existing accounts. But you can find detailed information in this AWS Prescriptive Guidance.
Let’s pretend that you are a Product Owner, have a team working with you to develop a new product, and don’t know if you need new accounts. Again, assuming that your organization already has OUs defined, let’s go through the decision flow, using the application as a starting point. The decisions related to the application, taken by the Product Owner, should be influenced by operating models, reduced blast radius, and competition on service limits.
- Operating Model: The first consideration to be studied is between centralized and decentralized operations. This depends on what the company or the Product Owner has in terms of operations. The main goal of this decision flow is to define who is the owner of the account or environment to avoid having gray boundaries (when a problem or event related to a resource doesn’t have a point of contact). The accounts, acting as containers for resources, must have an owner who can decide in terms of access and limits, thus being the main point of contact for operation, including security issues, compliance rules, tags, costs, and workload availability. As a best practice, the IAM team should be decoupled from the workload management. This means that regardless of which operating model you select, access management will belong to a centralized security team, as you can see in the following description for Centralized and Decentralized models:
a. Centralized Operations:
i. Application team is responsible for the application code.
ii. Platform and Infrastructure Team is responsible for compute, storage, network, and troubleshooting tasks, defending the overall SLA. Here, you have the account owner.
iii. IAM Team is in charge of Access Management.
b. Decentralized Operations:
i. Application and Platform is managed by the same team, which builds and runs the workload. Here, you have the account owner.
ii. IAM Team is in charge of Access Management.
- Reduced Blast Radius: Depending on how critical the workload is, you should consider how to reduce the blast radius even when you select a centralized operating model, meaning that new accounts should be necessary if a reduced blast radius is needed. This step can help you avoid “huge accounts” with failures that can impact multiple workloads.
- Competition on service limits: Workloads that are sharing the same account also share the same limits within the Region. You must carefully analyze this kind of competition to identify if another account is necessary.
Applying the decision flow
Now, let’s move to the following decision flow using the Application as the starting point, playing the role of the Product Owner:
As a first exercise, let’s assume that you, as a Product Owner, don’t have a team with enough knowledge of how to operate at AWS, or your company embraces a centralized operating model for all of the workloads. Therefore, the answer for the first question in the flow is to go with a Centralized Operating model, where your development team will just build the application and consume pipelines and infrastructure provided by a centralized IT operations team. However, another point should be analyzed: What happens if the account is compromised due to other applications, and the impact reaches your application? If you don’t want to face this risk, or the impact is too high in terms of business, then you need a reduced blast radius, meaning that new accounts should be created for your workload and managed by the centralized team. If you don’t need a reduced blast radius and there’s no competition for limits in the same account, then you can take advantage of an existing account that is also owned by the centralized team. It’s also clear, but important to emphasize, that we should have an owner for each account that will be the point of contact, avoiding gray boundaries. The centralized team will work with you to decide where is the best place to host your workload, based on the requirements and the decision flows illustrated in the figure above. As mentioned before, ABAC Attribute Based Access Control (ABAC) can also be used to allow developers to launch resources in a shared account.
The second scenario is when your team can build and operate a specific product. In this scenario, there’s one team responsible for both application code and infrastructure. This team will own the operation, being responsible for code and infrastructure deployments and troubleshooting incidents. To give this ability, it will be necessary to delegate account access to the team, so that they are owners of all of the account’s resources. There are benefits using this model, since the application’s backlog will be self-contained in the team, and it will avoid having external dependencies with other teams. On the other hand, the team is responsible for responding to incidents and owning the application’s SLA. Following this, it will be necessary to create dedicated accounts, supporting just one product.
It’s important to mention that micro-accounts shouldn’t be tied to teams. Instead, bind it to Products. The reason is that sometimes teams change during the lifecycle of a product. To illustrate why micro-accounts shouldn’t be bound to teams, let’s suppose that Team A is building and running their own product A using micro-accounts, and after some time, they also become in charge of product B and deploy it to the same set of accounts as product A. Then, the product B grows quickly and suddenly it needs a dedicated team to manage it. Since Product A and Product B are in the same accounts, they share permissions and access the same resources. Based on this, it will be necessary to migrate the product B to Team B’s accounts. Instead, if Product B was created in their own set of accounts (even with Team A managing it initially), then the handover would be simpler: Grant access for Team B in the Products B’s accounts and revoke Team A’s accesses.
When compared with the previous scenario within a Centralized Operating Model, it’s just a matter of providing access to the code repositories and development environments to the new team. It wouldn’t require any handover tasks or even migration of Product B to another operation team or accounts, because everything is already centralized, and the permissions necessary to operate are all set.
Depending on the decision path that you selected in the decision flow, you’re directed to host your workload into an existent set of AWS accounts, or create a new set of accounts. As a result, we can say that Centralized and Decentralized operating models aren’t completely tied to large or micro accounts. They can use both options, depending on each workload’s requirements. This means that you can have one centralized operations team that can manage workloads in large and micro accounts simultaneously. You can also have a decentralized operation team or business unit supporting their workload in micro-accounts.
You can use Operating Model, Blast Radius, Service Limits, and Costs to decide which kind of account group that a new workload needs. On top of this, we can share some lessons learned from the field:
- Utilize AWS as a platform and scale-out your accounts when needed, using micro-accounts to reduce blast-radius and delegate accounts to specialized application teams that can build and run their own workload in a self-contained compartment. Micro-accounts are perfect for Decentralized Operations, but they require good investment in automating account provisioning services, such as VPC setup, NAT, IAM controls like IAM Permissions Boundary, pipelines for massive guardrails distribution, and mainly service limits management like the number of accounts governed by AWS Control Tower and other services. To avoid micro-accounts proliferation and unnecessary accounts creation, consider large accounts with Centralized Operations as a balance.
- Centralized Operating Models should be used by workloads that don’t require reduced blast radius or resource segregation between teams. Additionally, development teams that don’t have skill enough to operate their application should take advantage of a centralized operation. Using a centralized team to operate the environment creates a dependency that may slow down projects since the backlog from the application team isn’t the same as the centralized operation team. AWS Service Catalog, customized Self Service Portals, and pipelines can help mitigate this kind of dependency and even facilitate tag management for cost distribution within the large account. Avoid creating “huge” accounts, because it increases the blast radius.
- For large accounts, an Access Management team can use resource-level permissions and authorization based on tags to segregate resources between applications. To avoid becoming a bottleneck, the Access Management team can adopt IAM Permissions Boundaries, pipelines, or automations to delegate the creation of IAM roles and policies.
- Principle of least privilege (PoLP) is mandatory! You can reduce the impact of not having PoLP using micro accounts, but it’s still mandatory. For large andmicro accounts, you can use IAM API operations to periodically export the last accessed services and report actions not in usage back to the application’s team, so that they can fine-tune their role’s policies. Create a cycle of validations where the application teams can acknowledge current permissions and update them properly.
- Landing Zones must be ready to support different kinds of accounts and operating models. Establish a flexible cloud foundation, consider the Centralized or Distributed operating model, as well as large or micro accounts.
In this post, we demonstrated a decision flow that guides you regarding when to create new AWS accounts or use existent ones. Customers can use this decision flow as a mechanism to bring balance between accounts creation, workload placement into existing accounts, and how to distinguish between Centralized or Decentralized operating models.
About the authors: