AWS Network Optimization Tips
When thinking about architecture, it’s very common to come across scenarios where there is no right or wrong answer – the best answer is “it depends”. You must carefully consider the tradeoffs between cost, performance, reliability, and operational efficiency before coming to a decision.
A little planning ahead of time can help you avoid numerous networking headaches down the road. For example, as your network evolves, you don’t want to be dealing with overlapping IP address ranges or insufficient available IP addresses in a network – just to call out a few challenges. These aren’t insurmountable challenges, but the network engineering effort required to remediate them later can be time consuming, complicated, and expensive.
In this post, we cover some of the most commonly used network optimization tips that you should consider to design a resilient, secure, and scalable network in AWS.
Plan your VPCs and network segments
Here are a few common questions that you should ask when planning a secure and scalable network on AWS.
How many VPCs do I need?
There’s no right or wrong answer to the number of VPCs that you will use. The general rule of thumb is that you need enough VPCs to achieve workload separation that is appropriate for you, and not too many that it’s difficult to administer. We often see customers have dedicated VPCs for each workload. Additionally, it’s common to have an inspection/appliance VPC for the deployment of AWS Network Firewall or partner firewall appliances with Gateway Load Balancer. If you’re implementing centralized ingress and egress, then you’ll also have additional VPCs like a Central Ingress VPC and a Central Egress VPC.
We also recommend you consider the benefits of AWS multi-account strategy instead of deploying all your workloads inside a single account and VPC.
What should be the size of each VPC?
There is no specific size that is recommended for VPCs in AWS, however it is recommended that you consider your current needs as well as factor in your future plans. Consider avoiding the use of /16 IP address ranges as a default for all VPCs, instead create VPCs with CIDR blocks that are based upon current needs and expected growth. Keep in mind that if required you can add additional CIDR blocks to the VPC.
You can balance the information you have at VPC creation time, with your growth plans. For example, for container-based workloads, you can start with VPCs using a /16 or /20 CIDR, while for a small shared services deployment, you can start with a VPC using a /24 CIDR. Network address management is particularly important as you scale your environment. You can use Amazon VPC IP Address Manager (IPAM) to organize, assign, monitor, and audit IP addresses, and better inform your decisions to add more CIDRs to a VPC, and create additional VPCs.
In summary, there are few things to consider when deciding the size of your VPCs – the number of resources you plan to deploy in a VPC, number of accounts that share the VPC (if you’re using VPC sharing), or the IP addresses consumed from the VPC CIDR(s). The CIDR block or blocks associated to your VPC give you only one dimension that determines how big your VPC can be.
Along with VPC CIDR, you should also consider Network Address Usage units assigned to a VPC and how many of these will be consumed by the resources in your VPC. You can enable NAU metrics from VPC settings and then monitor and setup alerts in CloudWatch.
How many subnets should I have per VPC?
The short answer is it really depends on your security and risk requirements. You’ll need at least one subnet per VPC. However, if you must deploy your application across multiple AZs (recommended for mission-critical production workloads for high availability), then you should create at least one subnet in each Availability Zone (AZ). When you create a subnet, you assign it to a specific AZ which can’t be changed, so make sure to carefully select different AZs when creating your subnets.
You may need to create multiple subnets in line with your organization’s network segmentation policies. it’s common to have subnets for hosting components like load balancers and web servers which must be accessible from the internet, and another set of subnets for components like application/database servers which shouldn’t be accessible directly from the internet. Some customers choose to have more granular workload placement into subnets based on their security and risk requirements. For example, you may choose to place web servers, application servers, and database servers in different subnets (e.g., web-server-subnet, application-server-subnet, database-server-subnet) with appropriate network security controls (Security Groups and NACLs) limiting communication between them. Additionally, you may want to create a Management or DMZ-subnet for deploying bastion hosts and other network security appliances.
If you’re working with AWS Transit Gateway (TGW), we also recommend creating dedicated /28 subnets in all of the AZs that you plan to use for your workload deployment. Check Transit Gateway best practices for more information.
Should I share a VPC or create a new VPC for the workload?
AWS recommends that different teams operate in separate, dedicated AWS accounts. However, that doesn’t mean teams must also create separate VPCs for their workloads. VPC Sharing lets customers achieve account-level separation of workloads while benefiting from the centralized management of a single VPC. This reduces the total number of VPCs to manage and interconnect. VPC Sharing can also help optimize IP address usage by sharing existing address space. Participating accounts continue to fully control their workloads, account access, and security groups. VPC Sharing lets you share the subnets of a VPC with other AWS accounts within the same AWS Organization.
We recommend that you refer to the VPC sharing: key considerations and best practices blogpost for limitations and important architecture considerations when using VPC Sharing.
Cost-aware design by selecting appropriate networking components
At AWS, we like to provide customers with choices, so there is often more than one way to achieve the desired outcome. Cost Optimization is one of the pillars of AWS Well-Architected Framework, so we always recommend customers consider the cost of their design choices.
Here are some of the common network cost optimization tips:
- Optimize hybrid connectivity cost: If you’re migrating a significant amount of data (e.g., a few terabytes or petabytes) from an on-premises environment to AWS, then there are a range of online and offline data transfer options available. Select the option that best meets your compliance, performance, cost, and project time-frames. For online transfers, you should look at all of the network components in the data path (options are via Transit Gateway, directly into Amazon Simple Storage Service (Amazon S3) over a public Virtual Interface (VIF), via VPC using VPC endpoints). Transit Gateway and VPC Interface Endpoints have associated data processing costs. Check the cost optimization pillar of Hybrid Networking Lens for different architecture options and related cost components.
- Use VPC endpoints for connecting to AWS services: NAT Gateway is often used to access public resources outside of your VPC. You can reduce NAT Gateway charges by using Gateway Endpoints for connecting to AWS services like Amazon S3 and Amazon DynamoDB from within your VPC. This is particularly useful if you have workloads that transfer a lot of data between VPCs and those AWS services. Gateway Endpoints are provided at no cost, but they are only supported for Amazon S3 and DynamoDB.If most of the traffic going through the NAT Gateway is targeted to the AWS services that support Interface VPC endpoints, then create and use Interface VPC endpoint for connecting to these services. See pricing details for Interface VPC endpoints to determine the potential cost savings.
- Centralize NAT Gateways: It is a commonly used security best practice to centralize all egress connectivity by setting up an egress VPC where NAT gateways are deployed. As an added benefit, it can also provide some potential cost savings. If you’re implementing an egress VPC along with an inspection VPC (as described in this post), then consider combining egress and inspection components into the same VPC to avoid internet-bound traffic from workload VPCs entering the Transit Gateway twice (before and after the inspection), thereby saving you on Transit Gateway data processing costs.
- Workload placement: If you have an instance that’s sending/receiving a large amount of network traffic via NAT Gateway, then you should ask whether or not workload placement is correct. Could that workload be placed in an alternate subnet with a different set of security controls around it and use a public IP address? Note that Internet Gateway has no data processing charge, and the ingress routing capability now lets you inspect and filter traffic originating from the internet if that’s required. If needed, then you can also block inbound access to an instance from the internet or specific range of IP addresses by using Security Groups and NACLs. Check how to reduce NAT Gateway data transfer cost for more details. This may not be a preferred option and something you must consult with your security and compliance team before implementing.
Design your workload to use multiple AZs for high availability
AZs are independent, and thus workload availability is increased when the workload is architected to use multiple AZs. Deploying related workload components in the same AZ will help to make sure that network traffic remains local, thereby avoiding inter-AZ data transfer costs but with the disadvantage that an event affecting an AZ may impact all of the resources in that AZ.
Workload placement is particularly important when you have services such as NAT Gateways, Application Load Balancer (ALB), Network Firewall endpoints, Gateway Load Balancer endpoints, and Transit Gateway Attachments in the critical data path. To make sure that workloads are designed for high availability, you can:
- Deploy NAT Gateway, ALB, and interface endpoints in multiple AZs (as shown in the following figure).
- Create Transit Gateway Subnets with a /28 range in every AZ and use them for Transit Gateway attachments. Refer to Transit Gateway design best practices for more details.
- Deploy firewall appliances in multiple AZs and use Transit Gateway appliance mode if you’re using centralized inspection.
Plan the bandwidth for hybrid connectivity
If you’re using AWS Direct Connect (DX) to connect your on-premises environment with AWS, then make sure that you have enough capacity to cater for unpredictable traffic spikes and organic growth in traffic over time.
Direct Connect dedicated connections come in fixed sizes of 1Gbps, 10Gbps, and 100Gbps in select locations. Direct Connect partners offer further bandwidth granularity and smaller sizes, which could optimize your connectivity cost. For example, you can start at a 50Mbps hosted connection, as compared to the minimum 1Gbps dedicated connection.
Changing the bandwidth size of a Direct Connect requires careful planning and time because physical network changes may need to be made by your networking partner. Furthermore, it isn’t possible to directly change the bandwidth of an existing hosted or dedicated Direct Connect connection. To change the size of the connection, a new connection must first be ordered, the traffic must be cut-over to use the new connection, and then the old connection can be removed. With careful planning, it’s possible to perform a cut-over with minimal-to-no disruption.
Optimizing data transfer out (DTO) costs
- Consider using Direct Connect if you’re transferring large amounts of data from AWS to your on-premises environment. Direct Connect has a per-hour port charge, as well as a lower DTO cost as compared to sending via Internet Gateway. For example, at the time of writing this post, the cost to transfer data over the internet from the AWS Sydney Region (ap-southeast-2) is $0.114 per GB (assuming less than 10TB is transferred), as compared to transferring data over a Direct Connect connection, which costs $0.0420 per GB. Check the Direct Connect pricing for more details.
- If you’re transferring website content, media, and other assets to the internet, you may consider serving this data via Amazon CloudFront. CloudFront provides 1 TB of data transfer out to the Internet at no cost, and it offers tiered pricing for the transfer of larger volumes of data. CloudFront also provides added protection for your web applications against DDoS attacks and improves the performance of your workloads by serving the content to your customers from the nearest CloudFront edge location. Data Transfer from AWS origins, such as Amazon S3 and Amazon Elastic Compute Cloud (Amazon EC2) to CloudFront is free.
- Traffic that crosses a regional boundary will incur a data transfer charge. Consider reviewing cross-Region, and cross-AZ data transfer charges. These costs are highly dependent on the architecture of your workloads and customer use cases. However, there may be some scenarios that let you consider the tradeoff between availability and redundancy with cost optimization.
Moreover, read this post which provides a good overview of data transfer costs for some of the common architecture patterns.
Avoid all single points of failure
We strongly recommend that you consider redundant connectivity between your on-premises environment and AWS cloud. The Direct Connect resiliency recommendation provides prescriptive guidance and architectural patterns for implementing resiliency models that are appropriate for a workload. It’s recommended that you have a minimum of two Direct Connect links terminating in different Direct Connect locations.
If you’re using AWS Site-to-Site VPN, make sure that you configure to use both tunnels for redundancy. This makes sure that when one tunnel becomes unavailable (for example, because of maintenance) traffic is automatically routed via the other tunnel.
It’s possible to use Site-to-Site VPN as a low-cost backup option to Direct Connect. However, note that Site-to-Site VPN only supports throughput up to 1.25 Gbps per tunnel and doesn’t support ECMP when multiple Site-to-Site VPNs are terminated at the same Virtual Gateway. Therefore, if you must use Site-to-Site VPN as a backup for Direct Connect connection with speeds greater than 1 Gbps, then you should utilize multiple Site-to-Site VPNs terminating on Transit Gateway and use ECMP to achieve throughput up to 50 Gbps.
The following security considerations are recommended:
- Consider using AWS Firewall Manager for the auditing and centralized control of firewall rules (Security Groups, Network Firewall, and AWS WAF) to consistently apply a common set of firewalls across all of the workloads as per your organization’s policies.
- Encrypt all of the traffic in transit:
- All traffic leaving AWS physical premises (between data centers and AZs and between Regions) are automatically encrypted. When connecting your on-premises network to AWS, check if your Direct Connect location supports MACSec layer 2 encryption for traffic over the Direct Connect link. Alternatively, you can setup Private Site-to-Site VPN over Direct Connect to encrypt traffic between on-premises and AWS.
- Transit Gateway Connect allows for SD-WAN connectivity to AWS. The connections between the SD-WAN appliances can provide point-to-point encryption.
- Use TLS/SSL when accessing AWS services from on-premises or the Internet.
- Use TLS/SSL when designing your applications to provide for the end-to-end encryption of sensitive data.
- To reduce attack surface, centralize the number of entry and exit points to/from the Internet where possible.
- Leverage AWS PrivateLink to keep the network connectivity private and over the AWS backbone.
- Utilize Traffic Mirroring to send a copy of network traffic to a network monitoring appliance. You can also turn on VPC Flow Logs to collect packet header information into Amazon CloudWatch for troubleshooting or investigating security issues.
- Enable Amazon GuardDuty for intelligent network threat detection.
Monitor network metrics as you would for other aspects of your workload
Monitor your Direct Connect network connectivity and Site-to-Site VPN and set up relevant CloudWatch metrics and alarms to be notified when set thresholds are crossed so that you’ll know when to upgrade instances or network links. This will make sure that you aren’t caught by surprise when limits are breached.
Log IP traffic using VPC Flow Logs or Transit Gateway Flow Logs when troubleshooting network connectivity issues. If you select CloudWatch as the target, then you can use CloudWatch Logs Insights queries with VPC Flow Log. This makes it easier for you to filter through log events and quickly spot the issues.
Network performance is dependent on numerous factors, including right instance type, workload placement strategy, and connectivity type. To optimize performance:
- Select the appropriate instance size, because the available network bandwidth of each instance type and size is different. Check the Amazon EC2 network bandwidth page for information on the available bandwidth.
- Use cluster placement groups for applications that need low network latency or high network throughput.
- Use Nitro-based instances because they deliver enhanced networking performance and features.
AWS offers the most comprehensive set of networking services when compared with any other cloud platform. The range of AWS networking services gives you the flexibility and choice regarding how you architect your networks. Whether you’re setting up a brand-new network, or already running workloads in AWS, use the best practices and considerations listed above to design and build secure, resilient, scalable, and globally expanding networks on AWS.