Networking & Content Delivery

How FactSet handles networking for 1000+ AWS accounts

This is a blog post by FactSet’s Cloud Infrastructure team, Gaurav Jain, Nathan Goodman, Geoff Wang, Daniel Cordes, Sunu Joseph, and AWS solutions architects Amit Borulkar and Tarik Makota.

In Factset’s own words “FactSet creates flexible, open data and software solutions for tens of thousands of investment professionals around the world. These solutions provide instant access to financial data and analytics that investors use to make crucial decisions. At FactSet, we are always working to make our product more valuable to our customers.”

Introduction

At FactSet, we have thousands of AWS accounts and we needed a solid network foundation for engineering teams to move their workloads quickly and securely into AWS.  We set out with a goal of providing FactSet engineering teams with a secure, fast, and reliable hybrid connectivity network layer.  This networking layer is where all of our applications in AWS must be able to reach on-premises applications, and vice versa.  We also needed to deploy this architecture across the 12 Regions we currently use. In this blog, we describe how we manage networking for 1000+ AWS accounts at scale using what we refer to as microAccount architecture.  In our microAccount architecture, each AWS account is allocated for one project and is owned by a single engineering team.

Before we dive in, it’s important that you understand our design principles:

  1. We use the AWS managed network services as much as possible.  Our goal is to reduce operational overhead by minimizing usage of third-party tools running on Amazon Elastic Compute Cloud (Amazon EC2) instances.
  2. For governance, we use AWS APIs, infrastructure-as-code, and release pipelines to deploy services and enforce a desired state on a regular interval.
  3. We deploy the best available solution at any given time.  We routinely re-evaluate and optimize as offerings mature.  This design principle has led to a few more planned maintenance windows, but they have resulted in a more reliable, robust, and feature-rich network.
  4. We build network foundations that allow for developer autonomy and flexibility, while still protecting the network.

Overview of solution

AWS services, including AWS Transit Gateway, Amazon Route 53 Resolver Rules, Shared Amazon Virtual Private Clouds (VPC) subnets, and Resource Access Manager (RAM), are the foundation of our solution.  Early on in our cloud journey, we tested the idea of building a transit VPC architecture using solutions from third-party network vendors.  These led to a complex setup that required deploying virtual routers on Amazon EC2 instances, and configuring an overlay network using IPsec tunnels.  We also faced performance issues with this design as IPsec tunnels have an effective 1.5-Gbps throughput limit.  We often reached this limit when communicating between Amazon Virtual Private Clouds (VPCs).  In the end, we decided this approach wasn’t as scalable as we required.  Using transit VPC felt like managing and deploying your traditional network stack, same processes, same vendors, same tooling, and (mostly) same limitations.

AWS Transit Gateway as our backbone

AWS Transit Gateway solved our problem. In addition to aligning with our design principles, we didn’t want any network connectivity between Development and Production environments.  We also wanted to centralize internet ingress/egress traffic using dedicated VPCs. These VPCs are where our Network and Security teams deploy and manage the appropriate network layer security controls.

AWS Direct Connect Gateway aggregates multiple physical Direct Connect circuits and Virtual Interfaces (VIF) from FactSet’s many colocation facilities, into and across AWS Regions.  AWS Transit Gateway makes policy based (route-domain) inter-VPC communication possible.

In our most basic usage of Transit Gateway, we create simple VPC connectivity meshes. With more involved topologies, Transit Gateway facilitates advanced North-South traffic engineering rules for better security controls and performance.  By provisioning a Transit VIF on our Direct Connect gateways, and attaching it to a Transit Gateway, we bridge our on-premises connectivity with the ‘Network-as-Code’ capabilities of AWS Transit Gateway. This ensures we are able to deploy VPCs with the ease and control we require.  Transit Gateway is a single place to land our AWS Direct Connect circuits in each Region.  It has freed us up from running third-party routers on Amazon EC2 instances. This allows us to create separate routing “domains” that in turn make it easy for us to isolate environments from one another.

FactSet core network connectivity design

Shared subnet VPCs  – sharing is caring

When we started with the microAccount approach, we didn’t know how many IP addresses each AWS account would need, or which AWS services would be deployed in each account.  The challenge of fully automating the dynamic allocation of IPs per account based on usage was more than we were willing to take on.  We didn’t want to over-allocate usable IPs, as reclaiming them would be difficult in the future.  We therefore chose to use shared VPCs.  Amazon VPC sharing permits multiple AWS accounts to create their application resources into shared, centrally managed VPCs. This allows us to section off our cloud network into larger CIDR blocks per business unit (such as Content, Wealth, Analytics, etc.) and environment (such as Dev, UAT, Prod, Shared Services) in each Region.  Each microAccount that we provision has access to subnets designated for the corresponding business unit and environment for each Region.  Using AWS Shared VPCs allowed us to onboard new accounts quickly without the need to deploy dedicated VPCs and subnets.  We ended up creating a single VPC per business unit per environment (such as Dev, Prod, UAT, and Shared Services) in each Region.

FactSet Shared VPC Design

With this in place, we turned to AWS Resource Access Manager (RAM). RAM made it possible to easily and securely share AWS resources with any AWS account within our AWS Organization. We start by creating a single AWS RAM resource share per VPC.  This RAM resource share includes all the VPC subnets within that VPC.  We then simply add the AWS Organizations OU ID to the Resource Principal of each one of these RAM resource shares.  The result is that each new AWS account created within that OU automatically has access to the subnets within that VPC. An example of this is shown in diagram that follows (figure 3).

FactSet shared resources

DNS Resolution

DNS resolution is a critical part of connecting services running in the cloud and those on-premises.  Our vision was to host all cloud-specific DNS domains as Private Hosted Zones within Route 53.  Domains specific to on-premises workloads are created within our on-premises DNS infrastructure.  All resources deployed within a VPC are configured to use Route 53 Resolver for all DNS queries.  Route 53 Resolver rules allow us to forward queries for any of our on-premises domains to our on-premises DNS infrastructure using outbound forwarders.  We also set up our on-premises DNS infrastructure to forward any queries to domains hosted in a Route 53 Private Hosted Zone to an inbound Route 53 resolver endpoint.  By doing this, we achieved full hybrid private DNS using only AWS Managed Services, and there are no additional EC2 instances required!  Here is a high-level view of how DNS resolution works in our case.

FactSet DNS Resolution design

Controlling networking actions

Only the Cloud Infrastructure team at FactSet handles the creation and maintenance of networks.  We use IAM and Service Control Policies (SCP) to restrict the available networking actions in each AWS account.  This approach allows for high degree of flexibility for development teams while we still maintain close-fitting control over the network.  We use SCPs to restrict microAccount owners from performing network admin-related actions, while keeping the IAM policy as simple as possible.  For this reason, we created a single SCP at the root level of our microAccount OU structure. Here is an example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyVPCActions",
      "Effect": "Deny",
      "Action": [
        "ec2:AcceptTransitGatewayVpcAttachment",
        "ec2:AcceptVpc*",
        "ec2:AdvertiseByoipCidr",
        "ec2:AssociateDhcpOptions",
        "ec2:AssociateRouteTable",
        "ec2:AssociateSubnetCidrBlock",
        "ec2:AssociateTransitGatewayRouteTable",
        "ec2:AssociateVpcCidrBlock",
        "ec2:AttachVpnGateway",
        "ec2:CreateCustomerGateway",
        "ec2:CreateDefault*",
        "ec2:CreateDhcpOptions",
        "ec2:CreateRoute*",
        "ec2:CreateSubnet",
        "ec2:CreateTransit*",
        "ec2:CreateVp*",
        "ec2:DeleteCustomerGateway",
        "ec2:DeleteDhcpOptions",
        "ec2:DeleteNetworkAcl*",
        "ec2:DeleteRoute*",
        "ec2:DeleteSubnet",
        "ec2:DeleteTransitGateway*",
        "ec2:DeleteVp*",
        "ec2:DetachClassicLinkVpc",
        "ec2:DetachVpnGateway",
        "ec2:DisableTransitGateway*",
        "ec2:DisableVgwRoutePropagation",
        "ec2:DisableVpcClassicLink*",
        "ec2:DisassociateRouteTable",
        "ec2:DisassociateSubnetCidrBlock",
        "ec2:DisassociateTransitGateway*",
        "ec2:DisassociateVpcCidrBlock",
        "ec2:EnableTransitGateway*",
        "ec2:EnableVgwRoutePropagation",
        "ec2:EnableVpcClassicLink*",
        "ec2:ExportTransitGatewayRoutes",
        "ec2:ModifySubnetAttribute",
        "ec2:ModifyTransitGateway*",
        "ec2:ModifyVpc*",
        "ec2:ProvisionByoipCidr",
        "ec2:RejectTransitGateway*",
        "ec2:RejectVpc*",
        "ec2:ReplaceRoute*",
        "ec2:WithdrawByoipCidr",
        "ec2:AttachInternetGateway",
        "ec2:DetachInternetGateway",
        "ec2:CreateInternetGateway",
        "ec2:DeleteInternetGateway",
        "ec2:CreateEgressOnlyInternetGateway",
        "ec2:DeleteEgressOnlyInternetGateway",
        "ec2:CreateNatGateway",
        "ec2:DeleteNatGateway",
        "ec2:CreateNetworkAcl*",
        "route53:CreateHostedZone",
        "route53:DeleteHostedZone",
        "route53:AssociateVPCWithHostedZone",
        "route53:DissassociateVPCFromHostedZone",
        "route53:UpdateHostedZoneComment"
      ],
      "Resource": "*",
      "Condition": {
        "ForAllValues:StringNotLike": {
          "aws:PrincipalArn": [
            "arn:aws:iam::*:role/network-team-iam-role"
          ]
        }
      }
    }
  ]
}

This SCP restricts microAccount owners from setting up their own network to bypass any network layer security controls.  This reduced our support burden by ensuring that created accounts had the appropriate connectivity in place.  It also gives us the flexibility to allow the Network Team to provision VPCs directly into a microAccount within our Organization. This is helpful for edge cases that might require a dedicated VPC.

Things to watch for

There are a few things that you must keep in mind when deploying shared subnets:

VPC Service Quotas: These don’t change because they are being shared. AWS RAM also has its own Service Quotas. When we ran into service limits, we worked closely with AWS team to resolve them.  A few of the not-so-obvious limits are:

  • AWS RAM limits on the number of accounts that can share a single subnet.
  • The number of hyperplane Elastic Network Interfaces (ENI) in a single VPC. Each Network Load Balancer, PrivateLink endpoint, or Lambda function (see the “What’s changing” blog here for explanation) could create its own hyperplane ENI.  By default, there is a limit of 250 of these per VPC.

Noisy neighbor: One consequence of our shared VPC approach was that any given microAccount can exhaust the number of usable IPs in any given subnet.  This affects the ability for any other microAccount sharing that same subnet to launch new instances within that subnet.  We developed an approach that ensures enough IP capacity in our shared subnets:

  • Set up monitoring of each subnet and its usable capacity. We track which accounts are using the most IP space in any subnet.
  • If any workload requires a high number of IP addresses, we provision a dedicated subnet and share it using RAM—with only their microAccounts as the principal of that share. These new VPC’s and subnets are still created and maintained as any other VPCs and subnets, they are just dedicated.  We are working on a self-service request capability for this use that we will make available in our user facing Service Catalog.

Sharing resources: Using your AWS Organization OUs to share resources has many benefits. However, we encountered use cases where it’s a burden.  For example, if a product or team shifts into another Business Unit, we must move their microAccount into the new OU to track costs.  This causes undesirable changes to the network.  As a result, we stopped sharing subnets with OUs.  Instead, we are using a unique resource share per microAccount and rely on our automation to share the appropriate subnets with any microAccount.  This allows us to move accounts around our AWS Organization with no impact on the underlying network.

Conclusion

In this blog we outlined how we manage networking for thousands of AWS accounts to allow FactSet engineering teams to move their workloads quickly and securely into AWS.  In our microAccount architecture, we strive to use the AWS managed network services as much as possible, which allows us to reduce operational overhead and minimize the use of third-party tools. A key aspect of using AWS networking services with thousands of accounts is the use of infrastructure-as-code to deploy services and enforce a desired state on a regular interval. We continuously re-evaluate and optimize services and offerings as they mature.  microAccount architecture has allowed us to build network foundations, allowing developer autonomy and flexibility, while maintaining security and governance.

 

Daniel Cordes

Daniel Cordes is a Principal Software Architect with FactSet, responsible for managing cloud automation efforts for internal developers. He holds master’s degrees from Columbia University and the City University of New York.

Gaurav Jain

Gaurav Jain is Director of Cloud Platform at FactSet, responsible for implementing cloud adoption and migration strategy. He holds a MSE in Computer Engineering from University of Michigan, and a MBA from NYU Stern.

Nathan Goodman

Nathan Goodman is a Principal Systems Architect on FactSet’s Public Cloud enablement team, responsible for building highly scalable and resilient systems both on-premise & in the cloud.

Geoff Wang

Geoff is a Principal Systems Engineer in FactSet’s Cloud Team, focused on helping developers on FactSet’s journey to the cloud. He holds a BSE in Electrical Engineering from the University of Michigan.

Sunu Joseph

Sunu Joseph is an Associate Director in FactSet’s Cloud Team focused on improving developer workflows during cloud adoption and migration. He holds an MS in Computer Science from the University of Edinburgh.

 

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.