AWS for Industries
Automated networking with shared VPCs at Swisscom
How do enterprises efficiently manage networking across hundreds of cloud accounts while maintaining security, reducing costs, and minimizing operational overhead? As organizations such as Swisscom adopt Amazon Web Services (AWS), they face the complex challenge of implementing scalable, automated networking solutions that break away from traditional high-touch, manual approaches.
When Swisscom, Switzerland’s leading telecom provider, began their AWS cloud journey, they sought to revolutionize networking through a fully automated, secure, centrally governed implementation. Using AWS Shared VPCs and strategic automation allowed Swisscom to create a networking model that not only reduces IPv4 waste but also enables cost-effective scaling across hundreds of accounts while dramatically improving operational efficiency.
This post describes Swisscom’s requirements and their innovative journey to implement a large-scale, automated networking solution on AWS.
Swisscom’s requirements
When adopting cloud networking, you quickly realize that one size doesn’t fit all. For Swisscom, defining clear requirements was crucial to creating a successful networking strategy. The following key considerations shaped their approach:
Multi-tenant but centrally governed: the network constructs should automatically extend to provisioned accounts, allowing reuse of critical centrally deployed and governed resources.
Highly automated: the solution should be fully automated, not needing manual steps or the intervention of engineering or operational teams.
Self service: the application teams can self-provision and deprovision networking components as needed for their workloads.
Secure: the networking components align to segmentation and zoning architecture agreed by security teams.
Reducing IPv4 waste: IPv4 addresses are finite resources, thus the amount of Swisscom routable IPs used in AWS must be kept as small as possible.
Cost effective: this should scale to hundreds of accounts while being as cost efficient as possible.
To achieve these requirements, especially the cost efficiency and central governance, Swisscom decided to use Shared VPCs to minimize the number of VPCs and associated resources (such as NAT Gateways, AWS Transit Gateway Attachments, and VPC Endpoints). Shared VPCs allow a VPC deployed centrally to share its subnets to other accounts using AWS Resource Access Manager (AWS RAM). This results in the VPC being centrally managed, but application teams can deploy resources into these VPCs from their accounts. Furthermore, using Shared VPC allows Swisscom to control the usage of internal routable IPv4 addresses, limiting the waste of these finite resources.
VPC architecture
A key principle in the decision making was Swisscom’s belief that using cloud native AWS managed services was essential to achieving an automated, scalable solution that minimized their operational burden. The resulting architecture used AWS Managed Services such as AWS Service Catalog, AWS Direct Connect (for on-premises connectivity) and Transit Gateway (for transitive routing between on-premises and VPCs).
Swisscom hosts the Shared VPC centrally, in a “Platform VPC account” managed by the Cloud Platform Engineering team.
The Shared VPC architecture is made up of three types of subnets:
- Public, externally routable subnets to/from the internet, such as Public Application Load Balancers. Outbound internet connectivity is achieved through an Internet Gateway (IGW) and NAT Gateway to limit usage of public IPv4 addresses.
- Private, internally routable subnets using Swisscom assigned internal routable IPv4 address space. This makes sure that it can be reached over Swisscom’s dedicated Direct Connect connection from an on-premises datacenter.
- Private, non-routable subnets using Swisscom assigned IPv4 address space that is locally significant to the VPC. This is used to host resources that don’t need ingress on-premises or internet connectivity, such as Kubernetes Pods and backend databases that need significant IPv4 usage. Resources in this space can access services beyond the VPC through private or public NAT gateways.
Figure 1 – Swisscom Shared VPC architecture
A standardized Shared VPC in the Swisscom Organization looks like the following:
1. Platform VPC Account: The VPCs are deployed by the platform engineering team in a dedicated “Platform VPC” account per environment, such as development, staging, and production. Each VPC consists of multiple subnets:
a) Public routable subnets for internet connectivity. Public routable subnets are shared among multiple accounts.
b) Private routable subnet for intra-VPC and on-premises connectivity. Private routable subnets are shared among multiple accounts.
c) Multiple private non-routable subnets dynamically created based on application team requirements. Private non-routable subnets are shared to only one account, and dedicated to specific applications.
d) One non-routable endpoint subnet, with data plane service endpoints where low latency requirements are deployed. Endpoint subnets are never shared to any account.
e) Transit Gateway attachment subnet for attaching the VPC to the transit gateway that resides in the centralized Platform VPC account. Transit gateway attachment subnets are never shared.
2. Team 1 Account: Application teams can request sharing the different subnets to their account. Private routable subnets are always shared. Furthermore, Team 1 requested the public subnet and a dedicated, non-routable subnet.
3. Team 2 Account: The Team 2 workload doesn’t need public exposure, as they operate an internal web application only. Therefore, they didn’t request a public subnet.
4. Team 3 Account: Like the previous ones, Team 3 only hosts an internal facing Amazon Redshift database, and the endpoint must be reachable from on-premises.
The use of Shared VPCs drove the requirement for an automated solution, which allowed internal application teams to request access to the Shared VPCs being managed centrally. When making this request, the automation creates and shares subnets to their account for consumption while implementing the Swisscom public cloud zoning model.
In the following section we focus on the details of the automated Shared VPC approach.
Automation walkthrough
Automation has a significant role in allowing Shared VPCs to be consumed by application teams. It alleviates many of the operational concerns that arise from Shared VPCs, such as how these are requested and provisioned, managing the capacity of subnets shared among multiple teams, and how to scale beyond a single VPC.
Swisscom built an automated solution to handle these considerations. In the following steps we describe the workflow:
Figure 2 – Components for automating Shared VPC requests
The high-level flow is shown in the following steps:
1. Establish shared VPCs: As mentioned previously, the networking is established by the Platform Engineering team using infrastructure as code (IaC) in a central “Platform VPC” Account. We show two VPCs (one for Dev and one for Prod Workloads), but there may be many more. These are attached to a transit gateway with different route tables for each environment, making sure of secure zoning of traffic. Within the VPCs we have private subnets in green (routable to on-premises) and public subnets in blue (routable to the internet). These subnets are available to be shared with other accounts.
2. Automated capacity management: In a multi-tenant Shared VPC setup, you must track the usage of key VPC metrics to measure their readiness for new applications. A capacity tracking application is used to measure and score the VPC health. This application uses Amazon CloudWatch events to periodically trigger AWS Lambda functions, which store capacity data in Amazon DynamoDB.
3. Provisioning requests with Service Catalog: The application team uses a Service Catalog product to request VPC subnets, using a Lambda custom resource to send requests to AWS Step Functions in the “Platform VPC” account for handling the request.
4. VPC scheduler: Step Functions validates the VPC that should be used by checking the environment (in this case Dev) and the VPC health score. The net result is the selection of a VPC with appropriate capacity.
5. Subnet sharing: The subnets are shared with the application account. The account admins can’t change the VPC setup, but they consume it by deploying workloads into the account.
In the next section we walk through some of the key features of this solution.
Automated capacity management
Shared VPCs involve multiple tenants sharing the same address space, thus managing the capacity of each VPC is crucial. Swisscom identified this requirement and built an automated capacity management solution that regularly monitors the provisioned Shared VPCs and their readiness to host new applications using a scoring system. Furthermore, the scoring system should make sure that there is room for existing applications to grow in the existing VPCs, and if not then inform the Platform team through CloudWatch alarms.
Hundreds of accounts are expecting to request a VPC, thus it can be assumed that eventually the capacity is fully consumed within a single VPC. Swisscom needed a way to scale the number of VPCs without manually assigning a VPC to an account. Therefore, VPCs are “pooled” together into logical groups. These groups are used to identify the application environment (such as dev, pre-production, production) and the functional group (such as general workload, Telco, Streaming and Analytics) using tags.
When a new VPC is created, it is added to an inventory of active VPCs within DynamoDB using the following process:
- The platform engineering team deploys a new VPC with tags for the pool (a grouping of pools) and function (alignment to a Swisscom division or use case).
- For new VPCs a CloudWatch event triggers a Lambda function to collect VPC details. For existing VPCs this is regularly validated using a scheduled CloudWatch event.
- The Lambda function collects key metrics such as subnet usage, and tags and calculates a health score.
- Details are stored in a DynamoDB table to be referenced when VPC requests are received.
Figure 3 – Collecting and storing Shared VPC capacity data
VPC provisioning requests
The platform engineering team centrally creates and shares well-architected Service Catalog products with application accounts allowing application owners to self-provision VPCs. Using Service Catalog makes sure that application account users don’t need permissions to manage networking in the account. Using constraints makes sure that the Service Catalog product has the necessary permissions for the necessary provision resources. As an input to the product, some parameters are taken from pre-provisioned AWS Systems Manager parameters that define data aligned to account characteristics (for example, the environment being Dev or the line of business). Enforcing the use of Systems Manager parameters provides guardrails that make sure the account is assigned the right VPC, which is appropriately zoned and secured to that use case.
The Service Catalog product allows the selection of t-shirt sizes and other parameters to make sure that the requested capacity is assigned.
Figure 4 – Sample Service Catalog VPC product
VPC scheduler
The VPC scheduler is responsible for receiving and processing the requests from the AWS Service Catalog products, providing a reliable cross-account automation for assigning the Shared VPC networking resources to the requestor. This was an area of trial and error for Swisscom. We used Lambda custom resources as a mechanism to assume that a role in the platform account triggered a Service Catalog product, which provisioned and shared the VPC subnets with the initiating spoke account.
Although initially this setup was functional, it increased operational load on the Swisscom Platform Team due to the following limitations:
1. Having multiple Service Catalog products needlessly increased the complexity.
2. Service Catalog doesn’t natively support retry mechanisms or error management.
As a result, the central scheduler was re-architected into Step Functions, as shown in the following figure. This was more customizable and effective in handling retry and errors, and integrated directly with the AWS CloudFormation service whenever possible.
Figure 5 – Step Function workflow to assign VPC and share subnets
1. Input validation: Verifies the incoming request event such as Create, Update, and Delete from spoke service catalog product.
2. VPC selection: Queries the DynamoDB table to find a suitable VPC based on requested t-shirt size, function, capacity, and tags.
3. Subnet creation: Creates new subnets and shares them with AWS RAM using CloudFormation.
4. Systems Manager parameters: Retrieves the outputs of the CloudFormation template, such as Subnet-IDs, and returns them to the requesting account.
5. Error handling: Implements retry logic for transient failures and provides clear error messages for permanent failures.
Subnet sharing
Upon completion of the VPC Scheduler, the application team receives a success message for the provisioned Service Catalog product. They can update and delete their Shared VPC if necessary. These actions would retrigger the cross-account automation. The application team can now provision services into the newly assigned VPC, inheriting the zoning, connectivity, and IP addressing, but they can’t change it.
Lessons learned
One of the most significant advantages of the shared VPC approach is the simplicity it brings to consumers. Using a shared VPC relieves consumers of the platform from the burden of dealing with complex network configurations and time-consuming operations. Instead, they can focus their efforts on core business functions, allowing for increased productivity and efficiency. This simplicity not only enhances overall user experience but also reduces the learning curve for new team members. Although consumers benefit from simplicity, providers must navigate the complexities of building and managing the shared infrastructure and the associated automation. This makes sure of its availability, security, and seamless integration with various consumer teams.
At Swisscom, we learned that our internal platform consumers appreciate this solution but also expect it to work flawlessly at any given time. After all, we must avoid lowering the service level of a self-developed solution as compared to the battle-proven service that AWS offers. However, after investing in and building up the domain expertise required for this shared VPC concept, we quickly reaped the fruit of our labor. Today, Swisscom’s organization on AWS hosts more than 800 accounts. For this footprint, we calculated that an account-dedicated VPC design would be approximately 30 times more expensive as compared to the implemented shared VPC design. Thanks to the scalable architecture, cost savings are likely to be even higher with a growing number of accounts within the organization.
Conclusion
Swisscom’s cloud journey started on the premise of building standardization, optimizing costs, and reducing complexity or cognitive load through automation. The choice to use Shared VPCs necessitated detailed considerations for the application onboarding journey, shifting the heavy lifting from users to the platform providers, where automation can be used to ease the operational burden.
Although the shared VPC approach introduces additional complexity for platform providers, the centralization of resources and network management ultimately results in a reduction in overall costs. Using shared infrastructure allowed platform providers to optimize resource usage, monitor performance on a broader scale, and minimize redundant network components. Although the initial setup and ongoing maintenance may require additional investments, the long-term cost benefits and simplified experience for consumer teams make it worthwhile.