How Thomson Reuters used Amazon CloudWatch to improve availability and operational efficiency of Directory Services

Thomson Reuters Corporation (TR) is a Canadian multinational media company that provides critical online and print information, know-how, decision making tools, software, and services for the legal industry. TR’s Tax and Accounting business serves law firms, tax and accounting firms, global trade organizations, educational institutions, and more.

Thomson Reuters operates in more than 100 countries and has over 38,000 employees who use Directory Services to authenticate and securely sign on to company systems with Single Sign-On (SSO).The availability of Directory Service is critical to our organization as it can impact hundreds of applications that we use every day to serve our customers. We decided to migrate this service to AWS to take the advantage of the AWS Global Infrastructure and reduce network latency during authentication for the applications hosted in AWS. We want to build a solution that is highly available, provides low latency network response to our users, is secure, gives complete operational visibility with automated notifications during an event, and is easy to maintain with less operational overhead.

In this post, we’ll discuss how Thomson Reuters is using Amazon CloudWatch for high availability, CloudWatch alarms to automatically failover traffic to an alternate AWS Region, enhance operations, easily and quickly obtain operational insights from Amazon CloudWatch Logs and CloudWatch metrics, simplify troubleshooting issues, and automate notifications. We’ll also present how other AWS services supported our core objectives of security, disaster recovery, and scalability.

Overview of solution

In this solution, Directory Service is hosted in multiple AWS Regions to provide low latency response to our users across various geolocations, but to simplify the diagram, only one region is shown in Figure 1. Directory Service needs to be restricted from any external access from the Internet, hence Amazon Route 53 private hosted zones are used to respond to DNS queries from Amazon VPC. In case Directory Service becomes unhealthy in any particular AWS Region, automatic failover to another AWS Region is a necessity.

Figure 1. How Thomson Reuters uses CloudWatch for Directory Service.

Route 53 routes the authentication requests to the low latency Directory Service response AWS region, CloudWatch alarm is invoked when Directory Service is unhealthy. Route 53 is integrated with CloudWatch alarms to route the traffic to only healthy Directory Service.

Thomson Reuters has utilized CloudWatch to support various capabilities in the solution.

High Availability

Route 53 is a highly available Domain Name System service that supports Active/Active failover across multiple regions. Route 53 uses health checks to determine which region to route Directory Service authentication requests to, and it can monitor only the resources with the public IP addresses; however, Directory Service is not exposed to the Internet, so no public IPs are attached. To mitigate this, we’ve used the status of a CloudWatch alarm as a health check with Route 53. Amazon EventBridge invokes an AWS Lambda every few seconds, to which the Lambda function checks the health of Directory Service and publishes a value indicating the status in CloudWatch metrics. The CloudWatch alarm is set to be in an alarm state when the environment is unhealthy, causing the metric to cross our failure threshold. Route 53 uses the CloudWatch alarm to determine if there is a need to failover the traffic to Directory Service in another AWS region.

We want to take advantage of health checks during our deployment process to automatically failover traffic while performing maintenance or publishing changes. The Directory Service health check endpoint is set to be unhealthy before applying changes, so that all requests destined for that AWS region are routed to Directory Service in a different AWS region. This allows maintenance to be performed or changes to be pushed seamlessly without any impact to users. This contrasts with the on-premises setup, where coordination with multiple teams during deployments is required.

Monitoring and Observability

CloudWatch Logs enabled us to centralize logs from multiple sources. We configured Route 53 DNS query logs and Network Load Balancer (NLB) logs to publish directly to a CloudWatch Logs log group, and streamed application logs from containers using Fluent Bit to these log groups. The data published from these sources include every request and response from Directory Service, detailed traces for failures, and additional details such as latency, bytes transferred, and source IP addresses. Using CloudWatch Logs, we are able to correlate events across multiple sources and easily troubleshoot issues.

CloudWatch Metrics publishes the data about performance of the Directory Service and AWS Services, such as Amazon NLB and Amazon Route 53. The Lambda function that verifies the Directory Service status publishes customized health metrics. We created custom CloudWatch dashboards to provide a unified operational view to monitor the components that are part of the solution. The historical data from CloudWatch Logs and Metrics provides operational insights on the environment.

By using CloudWatch we don’t have to maintain the monitoring infrastructure, it automatically scales to meet our needs, and we only pay for what is used.

Operational Performance

Immediate notification to our operations team is paramount during any incident, and for this CloudWatch metrics data is used to invoke alarms based on predefined thresholds. Examples of such metrics are network response latency and Directory Service health. Our CloudWatch alarms publish messages into Amazon Simple Notification Service (Amazon SNS), which sends notifications to the operations team. Notifications based on these metrics reflecting the customer experience with the solution gives us an improved incident response time, enabling us to take appropriate measures during events to maintain availability of the solution.

In this solution, we’ve leveraged various AWS Services. Security is vital for us, and as our user’s data is held within Directory Service, we use AWS Identity and Access Management (IAM) to enforce the least privilege model, database credentials are protected using AWS Secrets Manager and AWS Key Management Service (KMS) to encrypt the data at rest. Amazon Elastic Kubernetes Service (EKS) is used to automatically scale-up Directory Service when there is a surge in the traffic and self-healing when a POD becomes unavailable. AWS CloudFormation for Infrastructure as Code (IaC) brought automation to the provisioning process, along with using AWS CodePipeline, AWS CodeBuild to automate CI/CD pipelines and deliver the changes across multiple regions consistently and efficiently.

Conclusion

In this post, we explained how CloudWatch enables Thomson Reuters to build a highly available environment, derive actionable insights from CloudWatch Logs and metrics, achieve operational visibility, and automate incident response notifications. We improved Single Sign-On authentications for our over 38,000 end users by improving performance and availability using AWS services. As a result, we accomplished all our goals with CloudWatch and related AWS services to meet security compliance and mandated technology governance. The AWS global infrastructure has allowed the deployment of our services in multiple AWS Regions around the world.

About the authors:

AWS Cloud Operations & Migrations Blog