How FactSet unified network monitoring in AWS and On-Premises
This is a post by FactSet, Sreekanth Sarma Vanam – Director of Network Engineering, Saurabh Gadi – Principal Systems Engineer and AWS Solutions Architects, Mony Kiem and Amit Borulkar.
In Factset’s own words, “FactSet creates flexible, open data and software solutions for tens of thousands of investment professionals around the world. These solutions provide instant access to financial data and analytics that investors use to make crucial decisions. At FactSet, we are always working to make our product more valuable to our customers.”
The rapid expansion of FactSet’s hybrid-cloud infrastructure, consisting of AWS, US-based data centers, and global Points of Presence (PoPs), necessitated an efficient and comprehensive network monitoring solution. Network Engineering at FactSet is responsible for building connectivity across all these locations and achieving optimal latency between them. This post explores how FactSet tackled this challenge by implementing a distributed infrastructure-agnostic monitoring system, providing valuable insights into network performance across their diverse premises.
FactSet had to make sure of optimal network latency and support the high transactional nature of their applications. Across AWS, FactSet uses shared VPC topology, where different AWS Accounts share the same VPC(s) in an AWS Organizational Unit (OU). We show this in the following diagram (figure 1).
Network Engineering provisions VPCs and builds connectivity between VPCs and various other FactSet premises. There is also a distinct network segmentation across Development, UAT, Production, and Shared Services environments. The same environment segmentation translates into AWS Transit Gateway Routing domains, with each domain having its own unique routing and propagation policy.
FactSet opted for a combination of native AWS network monitoring capabilities and Telegraf, an open source server agent. They deployed the Telegraf server agents on Amazon Elastic Compute Cloud (Amazon EC2) instances across AWS Regions, Availability Zones (AZ), and data centers to collect network health metrics through ICMP, HTTP, and DNS probes. By deploying agents in all operating infrastructure locations, FactSet can monitor network health within each AWS Region and AZ. These probes enable FactSet to monitor response times and packet loss, making sure of the robustness of their network infrastructure.
Along with Telegraf, FactSet uses native AWS network monitoring provided by Amazon CloudWatch and the various capabilities of Network Manager for visibility into AWS’s global infrastructure. This solution provides consistent and unified monitoring that can be deployed to all locations while also providing intra/inter-VPC visibility on AWS.
Telegraf server monitors network health by initiating and responding to probes. These are some common probes/flows that are used across all environments:
- Inter-AZ – Telegraf initiates ICMP probes to measure the response time and packet between EC2 instances running in different AZs.
- Inter-VPC – Telegraf initiates ICMP Probes to measure the response time and packet loss to instances in the Shared Service environment of the same AWS Region. These flows traverse through Transit Gateway and provide an indication of network health across the Transit Gateway.
- Inter-Region – Telegraf initiates ICMP Probes to measure the response time and packet loss targeting instances across other AWS Regions. These flows traverse through Transit Gateway inter-Region peering connections and indicate the network health between Regions.
- Hybrid-Cloud – Telegraf initiates ICMP, HTTP, and DNS probes for various targets outside of the Region. These flows traverse through AWS Direct Connect Gateways, Transit Gateways, and Direct Connect Virtual Interfaces (VIFs), recording the health of the overall network.
- Internet – Instances initiate a variety of ICMP, HTTP, and DNS probes to assess reachability to the internet from each AWS Region. These probes traverse through FactSet Managed Firewalls, NAT Gateways, and Internet Gateways (IGWs), making sure of internet connectivity and performance.
Solution technology stack and operations
FactSet relies on the TICK (Telegraf, Influx DB, Capacitor, and Kibana) stack for its network monitoring solution. This TICK stack provides a robust and scalable monitoring architecture. Telegraf, as the primary data collector, comes with built-in plugins that help the configuration and capture of various metrics.
- Input Plugins are used to capture metrics defined by the plugin
- Output Plugins are used to write metric data to various collectors/destinations
These are the plugins that are used by the current Telegraf installations:
- Ping Input Plugin – This input plugin is used to ping various destinations specified in the telegraf.conf files and report back RTT and Loss. FactSet has set this to poll every second.
- HTTP Response Input Plugin – This input plugin is used to probe various HTTP/s endpoints, validate reachability, and report back Response_time based on the 200 HTTP response code.
- DNS Query Input Login – This input plugin is used to query various names configured in telegraf.conf files and report back the state of the query (success/failure) and response time.
- HTTP Output Plugin – This output plugin is used to export the metric data obtained from the preceding input plugins to monster-data-api (internal API to collect Telegraf metrics).
Upon metric Ingestion through API, the data is stored in a time-series database, Influx DB. Then, this data is represented, queried, and analyzed using Grafana, an open-source Observability tool that allows us to build dashboards, run analytics, and configure alerting. Alerts and notifications of network issues are integrated with an internal notification system. This makes sure of timely communication. Notifications are resolved through documented standard operating procedures (SOP) for swift incident response.
FactSet employs a continuous integration/continuous deployment (CI/CD) process to make sure of consistency and reduce the operational burden of managing Telegraf configurations across AWS Regions. This is shown in the following diagram (figure 2). Telegraf configuration files are managed in a Git repository, updates are built into a golden AMI, and changes are deployed to EC2 instances using Webhooks and AWS CodeBuild . This streamlined approach enables seamless and efficient configuration management across AWS Regions and environments.
The following custom alerts are defined in the rules engine:
- 100% ICMP loss for a target sustained for one minute
- Greater than 5% sustained ICMP loss for 10 minutes in the last 30 minutes
- Greater than 50% sustained deviation over baseline Latency for 5 minutes in the last 10 minutes
- HTTP Code response other than 200 for any target
- Non-Zero DNS response code for any DNS lookup against the configured target
This has enabled FactSet to detect network degradation incidents promptly. The customized alerting engine, designed to identify sustained anomalies over a specified period, helps eliminate false positive alerts and focuses on critical network issues. The solution has accelerated troubleshooting and resolution, benefiting both FactSet and AWS engineering teams and enabling future migrations.
FactSet’s unified network monitoring solution, combining AWS’s native monitoring capabilities and Telegraf’s flexible architecture, has provided valuable insights into the performance and health of the network infrastructure. The solution’s scalability, comprehensive visibility, and effective alerting mechanisms have improved incident response and helped identify performance gaps. Although the current solution meets FactSet’s requirements, there was sufficient effort required to set up automation and deployments to ease the operational burden of running software on EC2 instances. FactSet expects future adoption of AWS services, such as Network Performance Monitor and AWS Network Manager – Infrastructure performance, to reduce operational overhead, deploy resilient global applications, and simplify the adoption of new AWS Regions.
The content and opinions in this post include those of the third-party author and AWS is not responsible for the content or accuracy of this post.