Using Squid Proxy Instances for Web Service Access in Amazon VPC: Another Example with AWS CodeDeploy and Amazon CloudWatch

This article further develops the high-available and scalable Squid-based proxy solution introduced in our previous article (https://aws.amazon.com/articles/5995712515781075). We will provide the necessary guidelines and resources to automate Squid configuration deployments and to aggregate and centralize Squid application metrics and access logs.


Submitted By: Nicolas Malaval
AWS Products Used: AWS CodeDeploy, Amazon CloudWatch
Created On: December 30, 2015


Proxy servers usually act as a relay between internal resources (servers, workstations, etc.) and the Internet, and to filter, accelerate and log network activities leaving the private network. One must not confuse proxy servers (also called forwarding proxy servers) with reverse proxy servers, used to control and sometimes load-balance network activities entering the private network.

In our previous article (Using Squid Proxy Instances for Web Service Access in Amazon VPC: An Example), we introduced a highly-available and scalable Squid proxy solution, based notably on Amazon Elastic Load Balancing and Amazon Auto Scaling.

In this article, we will further develop this solution by adding the following features: the automation of Squid configuration deployments with AWS CodeDeploy, the aggregation of Squid access logs to Amazon CloudWatch Logs and a collection of Squid application metrics as custom metrics in Amazon CloudWatch.

Topics:

  • Brief insight into our previous article
  • Introduction of new features
  • Step 1: Deploy and test a Squid proxy farm
  • Step 2: Update the Squid configuration and restrict access to AWS endpoints only
  • Step 3 (optional): Remove your test environment

Brief insight into our previous article

In Amazon Virtual Private Cloud (VPC), some resources can be prevented from direct access to the Internet by routing their traffic through a Network Address Translation (NAT) instance.

This solution raises two important points. First, the NAT instance may become a choke point as there can be only one NAT instance per subnet and the largest network throughput is limited to the largest available instance type’s throughput. Second, the routing policy is applied to an entire subnet. Therefore, in case of NAT instance failure, no instance in the subnet served by this particular NAT instance would be able to reach the Internet.

This proxy farm is implemented by an Amazon Elastic Load Balancer which distributes TCP requests across multiple Squid proxy instances, running in separate Availability Zones for high-availability. An Amazon Auto Scaling group scales in or scales out the proxy farm, based on Amazon CloudWatch alarms depending on current inbound network traffic. For each instance, the Squid configuration is applied on the first boot, as defined in the EC2 UserData of the Auto Scaling Launch Configuration.

Introduction of new features

Automate Squid configuration deployments and updates with AWS CodeDeploy

AWS CodeDeploy centrally controls and automates code deployments across Amazon EC2 instances to avoid downtime and reduce the burden of updating applications.

In our particular case, CodeDeploy centrally manages the Squid configuration (e.g., squid.conf), deploys the configuration files to Amazon EC2 instances across the farm and restarts Squid to apply the new configuration.

A new CodeDeploy Application is created for each proxy farm, with a Deployment Group targeting the Auto Scaling group and the Deployment Config "CodeDeployDefault.OneAtATime" to deploy one instance at a time.

yum install -y ruby
yum install -y aws-cli 
cd /home/ec2-user
aws s3 cp s3://bucket-name/latest/install . --region region-name
chmod +x ./install
./install auto

Each revision contains an AppSpec file which should be structured as below.

version: 0.0
os: linux
files:
  - source: squid
    destination: /etc/squid
permissions:
  - object: /etc/squid
    pattern: "**"
    owner: "squid"
    group: "squid"
    mode: 644
    type:
      - directory
hooks:
  BeforeInstall:
    - location: scripts/remove-squid-conf.sh
      timeout: 10
      runas: root
  ApplicationStart:
    - location: scripts/reload-squid.sh
      timeout: 30
      runas: root

"BeforeInstall" removes the previous Squid configuration files before CodeDeploy deploys the new ones to the proxy instance. Indeed, CodeDeploy can only override files if they were created by CodeDeploy itself. In our case, it is Squid installer which creates the default Squid configuration files.

Then, Squid is reloaded using the native "reconfigure" operation, which does not interrupt the service.

Aggregate Squid access and debug logs with Amazon CloudWatch Logs

Retrieving Squid access and debug logs may be required for auditing or troubleshooting purposes. In our case, this would require to log on each EC2 proxy instance to access the log files. Moreover, in the event of an instance failure, the instance would be automatically replaced and the log files would be lost.

To avoid that, we use Amazon CloudWatch Logs to consolidate logs from all proxy instances in almost real-time. The CloudWatch Logs agent is installed on each EC2 proxy instance the first boot, with the following command line:

yum install -y awslogs

The agent provides an automatic way to send log data to CloudWatch logs. Two log streams are configured: one for Squid access logs (access.log) and another for Squid debug and error logs (cache.log). To do so, the file /etc/awslogs/awslogs.conf should contain:

[general]
state_file = /var/lib/awslogs/agent-state

[access.log]
log_group_name = log-group-name
log_stream_name = access - {instance_id}
file = /var/log/squid/access.log*

[cache.log]
log_group_name = log-group-name
log_stream_name = cache - {instance_id}
file = /var/log/squid/cache.log*

Note that file attribute ends with a star (*), as the Squid log rotation mechanism creates multiple files.

Centralize Squid application metrics as custom metrics in Amazon CloudWatch

In order to provide a consolidated view of Squid application metrics, each EC2 proxy instance is configured to publish custom metrics to Amazon CloudWatch. We monitor the following metrics every 5 minutes: total number of client requests, number of hit client requests (i.e. served from the cache), total traffic sent to clients and disk cache size.

We use the native program squidclient to query the Squid Cache Manager and retrieve Squid application metrics. The following script is executed every 5 minutes and put metrics to the CloudWatch Namespace "SquidProxy":
totalrequests=`squidclient -p squidport mgr:5min | grep "client_http.requests" | cut -d " " -f3 | cut -d "/" -f1`
hitrequests=`squidclient -p squidport mgr:5min | grep "client_http.hits" | cut -d " " -f3 | cut -d "/" -f1`
totalkbytes=`squidclient -p squidport mgr:5min | grep "client_http.kbytes_out" | cut -d " " -f3 | cut -d "/" -f1`
cachesize=`du -s /var/spool/squid | sed 's/^\([0-9]*\).*/\1/'`
aws cloudwatch put-metric-data --region "region" --namespace "SquidProxy" --metric-name "TotalRequestsPerSecond" --unit "Count/Second" --dimensions "StackName=common-name,InstanceId=instance-id " --value "$totalrequests" --timestamp "`date -u "+%Y-%m-%dT%H:%M:%SZ"`" 
aws cloudwatch put-metric-data --region "region" --namespace "SquidProxy" --metric-name "HitRequestsPerSecond" --unit "Count/Second" --dimensions "StackName=common-name,InstanceId=instance-id " --value "$hitrequests" --timestamp "`date -u "+%Y-%m-%dT%H:%M:%SZ"`" 
aws cloudwatch put-metric-data --region "region" --namespace "SquidProxy" --metric-name "TotalKbytesPerSecond" --unit "Kilobytes/Second" --dimensions "StackName=common-name,InstanceId=instance-id " --value "$totalrequests" --timestamp "`date -u "+%Y-%m-%dT%H:%M:%SZ"`" 
aws cloudwatch put-metric-data --region "region" --namespace "SquidProxy" --metric-name "DiskCacheSize" --unit "Kilobytes" --dimensions "StackName=common-name,InstanceId=instance-id " --value "$totalrequests" --timestamp "`date -u "+%Y-%m-%dT%H:%M:%SZ"`" 

We will use the total Squid traffic sent to clients (metric name "TotalKbytesPerSecond") to scale out the Squid proxy farm when the average traffic per instance is >5000 KBps and scale in when <2000 KBps.

Note that:

  • The EC2 metric "NetworkOut" would provide almost the same result, as Squid is the only application to generate traffic. However, your intention was to leverage custom metrics for CloudWatch Alarms.
  • Your application may need other alarm metrics or values tuned to your usage pattern and the selected instance type.

Retain visibility over client IP address with Amazon Elastic Load Balancing and Proxy Protocol

Client requests to the Squid proxy farm go through the Elastic Load Balancer. The client TCP connection is terminated at the ELB and a new TCP connection is established between the ELB and one of the EC2 proxy instances. Therefore, the source IP address logged in Squid access logs is the IP address of the ELB, not the initial client IP address.

To retain visibility of the client IP address at Squid level, we use Proxy Protocol, supported by Amazon ELB. With Proxy Protocol, the ELB adds information about the client in a request "header". When the EC2 proxy instance receives the TCP packet, it can retrieve the client IP address from the "header" and process the request as usual.

Clients actually send HTTP requests to a proxy. However, the request contains an absolute URI (e.g. GET https://www.amazon.com HTTP/1.1). ELB enforces the most common form of HTTP requests and rewrites the request (GET / HTTP/1.1 HOST www.amazon.com) which makes it unreadable by Squid. Therefore, it is currently impossible to use ELB HTTP listeners and the "X-Forwarded-For" HTTP header to retrieve the client IP address.

Important note: Squid recently added support for Proxy Protocol in version 3.5. Official repositories (Amazon Linux, CentOS, RHEL) currently include and support lower versions. Depending on your features and support requirements, we will provide instructions and materials in the next chapters for both scenarios.

For version 3.5, we will use binary packages referenced at https://wiki.squid-cache.org/SquidFaq/BinaryPackages, for CentOS 6.

For earlier versions, note that the source in Squid access logs will refer to one of the ELB network interfaces. You may want to enable ELB access logs so you can map Squid access logs with ELB access logs and retrieve the IP address of the initial client.

Step 1: Deploy and test a Squid proxy farm

For the purpose of this article, we will deploy an AWS CloudFormation template to deploy and configure the resources needed for a Squid proxy farm.


Figure 1. Squid proxy farm architecture

You will be asked to prepare and provide the following prerequisites:

  • An Amazon VPC configured with an Internet Gateway, with at least two subnets in different Availability Zones. The default VPC should meet the needs
  • An Amazon S3 bucket to store Squid binaries and configuration files. This bucket must reside in the same region as the VPC.
  • A Key Pair, so that you may logon EC2 Squid proxy instances and analyze their content

The template creates and configures the following:

  • Two Security Groups that will be attached to the ELB or to the EC2 proxy instances
  • An IAM instance profile that will be attached to the EC2 proxy instances
  • An ELB that is only available internally to the VPC
  • An Auto Scaling launch configuration that will install the required services (Squid, CloudWatch Logs agent, etc.) and apply the initial Squid configuration (not from CodeDeploy)
  • An Auto Scaling group that will create instances of the launch configuration and automatically associate them with the ELB
  • Two Auto Scaling policies to increase or decrease the number of instances
  • Two CloudWatch alarms that will trigger Auto Scaling policies, depending on the total Squid traffic sent to clients
  • An IAM role that to be used by CodeDeploy for deployment
  • A CodeDeploy Deployment Group which points to the Auto Scaling group
  • A CodeDeploy Application

Deployment instructions

Start by downloading the archive squidproxyfarm-setup.zip on your computer, which contains a bootstrap script and a RPM package for Squid 3.5.11.

Important:if you intend to use Squid 3.5, it is highly recommended to replace the Squid RPM package with the most recent version available. To do so, extract the ZIP archive. You should obtain a folder "squidproxyfarm-setup". Go to the Squid binary repository for CentOS 6 (when this article was written, the Squid website redirected to https://www1.ngtech.co.il/repo/centos/6/x86_64/) and download the last RPM version into the "squidproxyfarm-setup" folder. Remove the Squid 3.5.11 package. Create a ZIP "squidproxyfarm-setup.zip" containing the "squidproxyfarm-setup" folder, as illustrated in Figure 2.


Figure 2. Creation of a ZIP archive with the last Squid version

Upload the ZIP archive to a dedicated folder in your S3 bucket.

Then, to provision the Squid proxy farm, you can use this squidproxyfarm.template in the AWS CloudFormation console in the AWS Management Console or click on the Launch Stack button below. Note that this template can be displayed in the AWS CloudFormation Designer.


Figure 3: AWS CloudFormation Designer

You are required to enter the following parameters, as illustrated in Figure 4:

  • StackName: Enter a unique common name for the Squid proxy farm
  • InstanceType: Instance type to use for launching EC2 proxy instances
  • KeyPair: Select a key pair so you may logon on the EC2 proxy instances
  • ProxyClientsSG: The Squid proxy farm will be allowed to received ingress traffic only from instances attached to this Security Group
  • S3Location: S3 folder of the Squid proxy farm setup archive in the form of bucket-name/prefix/ [or bucket-name/ in the case of the root folder]
  • SquidPort: Leave the default value for the purpose of this tutorial
  • SquidVersion: Choose between installing Squid 3.5 to support Proxy Protocol, or a lower version from Amazon Linux repository
  • Subnet 1 & 2: Subnets in which to run the EC2 proxy instances
  • VPC: VPC in which to run the resources


Figure 4: AWS CloudFormation stack parameters

Click Next. When prompted for tags, you may add custom tags, then click Next.

Acknowledge that the template might cause AWS CloudFormation to create IAM resources and click Create. Once the stack status is CREATE_COMPLETE, you can retrieve information about the Squid proxy farm ELB, in the tab "Outputs" as illustrated in Figure 5.


Figure 5: AWS CloudFormation stack outputs

You may then browse the resources created in the EC2 or CloudWatch console in the AWS Management Console. For instance, Figure 6 illustrates the Squid access logs from CloudWatch Logs.


Figure 6: Amazon CloudWatch Logs console

Note that the CloudFormation template, as provided in the article, sets the desired capacity of the Auto Scaling group to one instance. For critical environments, you may want to have a desired capacity of at least 2, to avoid downtime in the event of one instance fails.

Test instructions

Your Squid proxy farm is ready to use! As a basic test, you can run the following command on an EC2 Linux instance allowed to send traffic to the ELB. The -I option enables to display the HTTP response headers only. Note that you should obtain a HTTP 200 response (OK) for both requests.

curl -I --proxy SquidDNSName:SquidPort https://www.google.com
curl -I --proxy SquidDNSName:SquidPort https://calculator.s3.amazonaws.com/index.html

You can conduct further scaling, recovery and load testing of your Squid proxy farm, using methods and scenarios analog to those described in our previous article.

Step 2: Update the Squid configuration and restrict access to AWS endpoints only

In Step 1, you have deployed a Squid proxy farm with the default configuration. You most probably want to customize the Squid configuration to meet your needs and constraints.

The first option would be to log on each Squid proxy instances and to update the configuration files. This is difficult to handle at large scale and problematic when the Auto Scaling scales out / in or replaces a failed instance.

The second option would have been to customize the configuration straight on the first boot, via bootstrapping or UserData scripts. However, updating this custom configuration raises a major concern: existing proxy instances must be terminated and replaced, which must be carefully coordinated to avoid service interruption.

We use CodeDeploy as a solution to these concerns. In this chapter, we will update the file squid.conf on the running proxy farm to restrict access to AWS endpoints only. You could refer to our article "Enforcing a Squid Access Policy for Amazon S3 and Yum" for further details on how to implement restriction on Squid (https://aws.amazon.com/articles/6884321864843201).

Deployment instructions

Download the archive squid-revision.zip and upload it into the same S3 location than the ZIP archive from Step 1.

This archive follows the structure described earlier. However, for the purpose of this article, the folder "squid" contains two files, one for Squid 3.5 with Proxy Protocol enabled (squid.conf.withpp), another for earlier versions of Squid (squid.conf.woutpp). An additional script (rename-squid-conf.sh) executed at "AfterInstall" renames one of both scripts to script.conf.

Open the CodeDeploy console in the AWS Management Console. Click on the Application created as part of the AWS CloudFormation stack.

Expand the Deployment Group, also created as part of the AWS CloudFormation stack, and click on Deploy New Revision (see Figure 7).


Figure 7: AWS CodeDeploy application details

Enter the S3 location of the ZIP archive and optionally a description for the deployment. Click on Deploy Now. The deployment should complete within a couple of minutes, as illustrated in Figure 8.


Figure 8: AWS CodeDeploy Deployment results

Test instructions

Similarly, you can run the following command on an EC2 Linux instance to test the new Squid configuration.

curl -I --proxy SquidDNSName:SquidPort https://www.google.com
curl -I --proxy SquidDNSName:SquidPort https://calculator.s3.amazonaws.com/index.html

Recall that you obtain a HTTP 200 response for both requests earlier. You should now obtain a HTTP 403 (forbidden) response to the first request, as illustrated in Figure 9, which demonstrates that the new configuration is active.


Figure 9: Proxy responses to client requests

You can also configure and test the AWS CLI to use the Squid proxy farm. Below are the instructions for Linux. Please refer to https://docs.aws.amazon.com/cli/latest/userguide/cli-http-proxy.html for further information.

export HTTP_PROXY=SquidURL
export HTTPS_PROXY=SquidURL
export NO_PROXY=169.254.169.254
aws s3api list-buckets


This does generate an entry in the Squid access log


Figure 10: Configure and test the Squid proxy farm with AWS CLI

Let us now test how the Squid proxy farm behaves when scaling out. Open the EC2 console in the AWS Management Console and click on Auto Scaling Groups in the left bar. Select the Auto Scaling group et click on Edit. Enter "2" as the desired number of instances, as illustrated in Figure 11, and click Save.


Figure 11: Update to the Auto Scaling Group desired capacity

Click on Instances in the left bar. You should a new instance behind created (see Figure 12). Wait for the instance creation to be completed (Status Checks = "2/2 checks passed").


Figure 12: Amazon EC2 Instances

Open the CodeDeploy console in the AWS Management Console and click on Deployments, as illustrated in Figure 13.


Figure 13: Link to AWS CodeDeploy Deployments

You should see a new completed Deployment in the list. It demonstrates that the Application Revision is automatically deployed to new EC2 proxy instances when the Auto Scaling group scales out.

Step 3 (optional): Delete your test environment

If you want to clean up the resources you just created, you can use the AWS CloudFormation console. Select the check box in the left column of the stack, and then click Delete Stack.