How Repsol manages and monitors their AWS network with dashboards, alarms and automation
Large enterprises often deploy workloads on Amazon Web Services (AWS) using multiple accounts. This helps isolate workloads, manage permissions more easily, and simplifies cost allocation. However, managing a multi-account environment can make your network topology more complex and requires additional monitoring and automation.
At Repsol, a global multi-energy company present throughout the entire value chain, we put the customer at the heart of everything we do to meet their energy needs.
When we initially migrated a part of our production workloads to AWS, we followed a hub and spoke transit VPC pattern that provided us with centralized management of the network traffic and easier provisioning of new accounts.
The launch of AWS Transit Gateway, which connects accounts, Amazon Virtual Private Clouds (VPC) and on-premises networks through a central hub, brought an opportunity to migrate our transit VPC to a secure and scalable fully managed service. We also took the migration as a chance to improve the monitoring and automation mechanisms of our AWS network infrastructure. For this task, we used Amazon CloudWatch, a monitoring and observability service for AWS and on-premises resources and applications, and Amazon SNS, a fully managed pub/sub messaging, SMS, email, and mobile push notifications service.
In this post, we explain how we designed and implemented our network architecture, featuring a fully automated monitoring and alarm system that provides real-time visibility of the network status of all our AWS accounts from a single pane of glass.
Back in 2018, we built a multi-account landing zone based on AWS Organizations that grew to over 35 accounts for different businesses and environments with over 100 business products deployed on top of this foundation. Each of these accounts contains at least one VPC.
Our production environment must guarantee highly available and performant communications with the rest of Repsol’s hybrid and multi-cloud environments. We covered this requirement with the deployment of Virtual Interfaces to two different AWS Direct Connect locations, following the architecture recommended on the Maximum Resiliency for Critical Workloads web page.
This variety of accounts, VPCs, interfaces, and environments came with the need for an efficient and automated networking solution where traffic rules are centrally managed and metrics are supervised from a single pane of glass.
We initially fulfilled this need by deploying a hub & spoke transit VPC solution that supported our initial steps in AWS. But soon we found certain challenges related to the scaling and operation of the solution, including the lack of mechanisms to automate the deployment of new accounts.
Networking architecture evolution
After studying the challenges and the potential solutions, we evolved our transit VPC deployment to a fully managed networking solution with AWS Transit Gateway. This service provides network connectivity between all our VPCs and Direct Connect locations across our AWS accounts and environments.
As Figure 1 depicts, the Transit Gateway route table configurations define the allowed paths and restrictions for east-west traffic between VPCs and environments, and for north-south traffic with the internet and Repsol’s premises.
For inbound communications, each VPC has three public subnets where we only expose SaaS and PaaS endpoints. We routed outbound internet traffic from all VPCs through the Transit Gateway to a dedicated VPC with a NAT Gateway fleet.
We implemented the full solution using Terraform, our Infrastructure as Code platform of choice. Now we can add new VPCs to our environment within minutes, with no complex setup or maintenance tasks required. A single API call integrates a new VPC with our global networking environment and monitoring system in an easy, secure, and automated fashion.
When we use our account vending machine to provision a new account:
- It creates a new VPC
- It adds the required Transit Gateway attachments
- It configures the Transit Gateway route tables
- It adds the relevant metrics to the dashboards programmatically
Step four leads us to the second part of our networking solution: the monitoring and alarm system.
Monitoring and alarm system
We monitor our virtual devices and traffic flows using a series of Amazon CloudWatch metrics and we publish those metrics in CloudWatch dashboards. CloudWatch’s integration with infrastructure as code tools provides a flexible and dynamic monitoring framework.
We created four CloudWatch dashboards: the main dashboard for Transit Gateway metrics and three environment dashboards for development, preproduction, and production Transit Gateway attachment statistics. We chose a set of Transit Gateway, NAT Gateway and Direct Connect metrics that accurately represent the status of our network.
We configured alarms that let us know when critical metrics reach pre-configured thresholds, and associated them with an Amazon SNS topic that delivers emails to Repsol’s 24×7 IT operations team.
The following table describes each of the metrics used in the solution. Metrics in bold have an associated CloudWatch alarm:
|Bitrate (in bits per second) and packet rate (in packets per second) for inbound and outbound data to the AWS side of the virtual interface.
The number reported is the aggregate (average) over the specified time period.
|ConnectionState||The state of the connection.(1 indicates up and 0 indicates down.)|
|Number of bytes and packets received and sent by the transit gateway.|
|Number of bytes and packets dropped because they matched a blackhole route.|
|Transit Gateway attachment||BytesIn
|Number of bytes and packets received by the transit gateway from the attachment.|
|Number of bytes and packets dropped because they matched a blackhole route on the Transit Gateway attachment.|
|Number of bytes and packets dropped because they did not match a route on the Transit Gateway attachment.|
|Number of packets received by the NAT gateway from clients in the VPC and number of packets sent out through the NAT gateway to the destination.|
|ActiveConnectionCount||Total number of concurrent active TCP connections through the NAT gateway.|
|ErrorPortAllocation||Number of times the NAT gateway could not allocate a source port.
A value greater than zero indicates that too many concurrent connections are open through the NAT gateway.
|PacketsDropCount||The number of packets dropped by the NAT gateway.
A value greater than zero might indicate an ongoing transient issue with the NAT gateway.
Amazon CloudWatch collects these metrics and displays them on the previously described operational CloudWatch dashboards. The following screenshots show the actual look and feel of our production network monitoring dashboards (Figure 2 and 3).
These are the main KPIs we have achieved since our new networking automation and monitoring platform went into production, compared to our previous transit VPC solution:
- Reduced the time to deploy a new account in our organization from days to less than 15 minutes
- Increased the number of variables monitored by the networking solution to more than 200
- Reduced the response time on network incidents from hours to less than 5 minutes
Successfully deploying critical workloads on AWS requires solid foundations. A well-planned network layout is a core piece of these foundations. AWS Transit Gateway eases the task of deploying and managing the network topology and its strong integration with other AWS services, such as Amazon CloudWatch, makes it possible to easily build native monitoring and alarming solutions through infrastructure as code with AWS CloudFormation.
By developing an automated solution for network provisioning and supervision, we at Repsol were able to speed up account bootstrapping and provide our operations team with a comprehensive and efficient network monitoring and alarm system.