How Discover accelerates data ingestion using AWS PrivateLink for Amazon S3

Discover Financial Services (NYSE: DFS) is a digital banking and payment services company with one of the most recognizable brands in US financial services. Since its inception in 1986, Discover has become one of the largest card issuers in the United States.

We are proud members of the platform team at Discover, where we are responsible for managing the company’s cloud data platforms. Over the last few years of Discover’s cloud journey, most of our big data and analytical workloads were migrated from being on premises to AWS. We have moved petabytes of data to Amazon Simple Storage Service (Amazon S3), primarily from system of record (SOR) databases and data received from vendors. We maintain an ongoing incremental movement of data to Amazon S3. Part of our responsibility on the platform team is to enable infrastructure and platform services for business users, data operations teams, and data scientists to perform ETL (extract, transform, and load) data preparation and build and execute machine learning models and reporting.

Security is paramount at Discover. We strictly enforce secure private connections to Amazon S3 using Amazon Gateway VPC endpoint and S3 bucket policies with restricted controls to only allow connections through endpoints. Previously, virtual private cloud (VPC) endpoints only supported gateway endpoints for Amazon S3. This works well for workloads already operating on AWS, but for workloads operating on premises, a set of proxy EC2 instances needed to be set up to allow for connectivity between the data center and Amazon S3 using gateway endpoints. These restrictions lead to certain limitations, such as any connection to S3 that originates from the VPC removed the option of directly connecting to S3 from on premises despite having AWS Direct Connect.

In this post, we talk about how we overcame data ingestion challenges from applications deployed on premises by loading the data using AWS PrivateLink for Amazon S3. PrivateLink for S3 allows us to provision interface endpoints that are directly accessible from applications that are on premises over VPN and AWS Direct Connect, or in a different AWS Region over VPC peering, directly in our virtual private cloud. With PrivateLink support for S3, we were able to gain the performance and cost benefits of eliminating cumbersome proxy usage, with savings exceeding $10,000 per month. We are also able to optimize our network configuration, with the flexibility of using the different endpoint types for workloads running in the cloud and on premises.

Opportunities for improvement with our original solution

Initially, to overcome our challenges, the Discover data platform team utilized HTTP/Squid proxies so that on-premises services leveraged the Squid proxy to establish a private connection to Amazon S3 to perform the data ingestion.

This results in changes to on-premises applications, so that they direct requests to the proxy servers and then forward them to Amazon S3 through the VPC endpoint. While this allowed us to establish a private connection from on premises to the gateway endpoint via Direct Connect and the squid proxies, it created additional complexity for operations and considerations for bandwidth to S3.

Latencies induced at times due to the extra hop to connect to Squid proxy from on premises for S3 data load.
This pattern does not support HTTPS, in spite of the connection going through a dedicated secure AWS Direct Connect connection. Not being able to use TLS was a concern, as Discover’s preferred way of connecting to cloud service is through TLS.
Operating and maintenance overhead for
- Squid proxy scalability and support across multiple AWS accounts
- Patching and updating Amazon EC2 instances
- Monitoring and altering
Approximate cost incurred in Discover cloud data platform accounts
- Total number of Amazon EC2 instances across different environments: 20
- Total number of Amazon Elastic Load Balancers 6
- Approximate incremental data load per day: 10+ TB
- Approximate cost: 10k USD per month
- Cost increases with increased volume of data ingestion
Number of IPs consumed
- Squid proxy setup across subnets for redundancy
- 30–40 IPs per subnet

How Discover addressed these opportunities

Discover utilized AWS PrivateLink for Amazon S3 to streamline connection to Amazon S3 from on-premises applications by eliminating Squid proxies. This provides access to Amazon S3 from on-premises applications privately over secure connections provided by AWS Direct Connect or AWS VPN. This allowed us to remove our proxy servers with private IP addresses in our Amazon Virtual Private Cloud (VPC), which used gateway endpoints for Amazon S3. While that solution worked, the proxy servers constrained performance, added additional points of failure, and increased operational complexity. With AWS PrivateLink for Amazon S3, you can access Amazon S3 directly as a private endpoint within a secure, virtual network using a new interface VPC endpoint in your Virtual Private Cloud. This extends the functionality of existing gateway endpoints by enabling access to Amazon S3 using private IP addresses. API requests and HTTPS requests to Amazon S3 from on-premises applications are directed through interface endpoints, which connect to Amazon S3 securely and privately through PrivateLink. This allows workloads operating on premises or connected by either AWS Direct Connect or AWS Site-to-Site VPN to connect to Amazon S3 without the need for proxy servers.

This architecture is applied across multiple Discover cloud data platform AWS accounts.

Discover utilized AWS PrivateLink for Amazon S3 to streamline connection to Amazon S3 from on-premises applications by eliminating Squid proxies

Interface endpoints simplify the network architecture when connecting to Amazon S3 from on-premises applications by removing the need to configure firewall rules or an internet gateway. You can also gain additional visibility into network traffic with the ability to capture and monitor flow logs in the VPC. Additionally, you can set security groups and access control policies on interface endpoints.

An overview of the solution is depicted in the following architecture diagram. Customer data synchronization services run from the on-premises virtual machines and connect to Amazon S3 via AWS Direct Connect, utilizing PrivateLink for Amazon S3. In-VPC applications connect to Amazon S3 using Gateway VPC endpoints.

Customer data synchronization services run from the on-premises virtual machines and connect to Amazon S3 via Amazon Direct Connect utilizing the Amazon S3 Private Link

Implementation

Create AWS PrivateLink for Amazon S3 (See how in this documentation).

Update the bucket policies to include S3 VPCE ID, which enables connections to Amazon S3 using AWS PrivateLink for Amazon S3.Example Amazon S3 bucket policy configuration:

 {
            "Sid": "DenyUnlessVPCEndpoint",
            "Effect": "Deny",
            "Principal": "*",
            "Action": [
                "s3:PutObject",
                "s3:GetObject"
            ],
            "Resource": "Your Bucket ARN",
            "Condition": {
                "StringNotEquals": {
                    "aws:SourceVpce": [
                        "Your S3 INTERFACE VPCE ID"
                    ]
                }
            }
        }

Refer to the documentation for the AWS Command Line Interface (AWS CLI) or SDK options for accessing buckets and S3 access points from S3 interface endpoints
To learn how to access S3 buckets from on-premises networks in a guided tutorial, you can read this blog post from AWS Networking that shows you how to use AWS Direct Connect or VPN over private connectivity using AWS PrivateLink for Amazon S3.

Key takeaways

Discover utilizes both the gateway and interface VPC endpoints to cater to different use cases.
- Any Amazon S3 connection requests originating from the VPC will use the gateway endpoints.
- Connections established outside the VPC will use the interface VPC endpoints.
TLS enabled (HTTPS) access to Amazon S3 from on-premises data center.
Discover observed significant performance improvements:
- Increased data transfer speed by approximately 12–15 minutes for a 50 GB file transfer.
Greater cost saving allowed us to eliminate the Squid proxy infrastructure.
- Saved 10k USD per month
Gained the ability to enable interface VPC endpoint policies for account specific Amazon S3 bucket allow listing.
Guaranteed SLAS with the options to select multiple Availability Zones for redundancy while creating the S3 interface VPC endpoints

Conclusion

In this post, we explained why and how Discover used AWS PrivateLink for Amazon S3 to establish private connectivity between resources in our AWS account and on-premises workloads. This architecture ensures that data loading and data unloading from VPC or on-premises infrastructure uses the AWS internal network and does not take place over the public internet. After you set up these resources, you can further extend your use of VPC endpoints using gateway endpoints for Amazon S3 for workloads operating in your VPC. You can also extend your use of AWS PrivateLink with other services such as Amazon QuickSight or Amazon SageMaker. To use AWS PrivateLink for Amazon S3 with resources inside of your VPC, you can configure your SDK client to use the VPC endpoint URL that’s provided when the endpoint is created.

If you’re interested in learning more about using AWS PrivateLink for Amazon S3 with resources in your VPC, examples are provided in the S3 documentation for the AWS CLI and SDKs.

Thanks for reading about Discover’s cost and performance optimization journey with AWS!

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.