Integrating Malware Scanning into Your Data Ingestion Pipeline with Antivirus for Amazon S3
By Gokhul Srinivasan, Sr. Partner Solutions Architect, ISV Startups – AWS
By Ed Casmer, CTO – Cloud Storage Security
|Cloud Storage Security
Amazon Simple Storage Service (Amazon S3) has become the storage platform of choice for many organizations’ data ingestion pipelines.
Per the AWS Shared Responsibility Model, Amazon Web Services (AWS) provides configuration and access controls for your storage and data lake environments. Meanwhile, the security and access of the buckets, as well as the files themselves, is your responsibility regardless of whether they were created by your organization or ingested from a third party.
You can enhance bucket-level security by making Amazon S3 buckets private through access control lists (ACLs) and bucket policies, and implement strict AWS Identity and Access Management (IAM) policies to limit data access.
When it comes to the data itself, applying encryption controls and scanning for malware are the best practices. In fact, AWS automated server-side encryption for all new S3 objects as of January 2023. While malware detection is not provided natively in AWS, there is a way to easily scan your workloads using Antivirus for Amazon S3 by Cloud Storage Security (CSS).
Antivirus for Amazon S3 is a self-hosted malware solution, installed in the customer’s AWS account so data doesn’t leave the customer’s AWS account. Whether your organization ingests data from AWS Data Exchange, uses AWS Transfer Family, or implements its own custom process, Antivirus for Amazon S3 helps ensure data is clean of viruses, ransomware, and trojan horses.
In this post, we discuss how to easily integrate malware scanning into your data ingestion pipeline using Antivirus for Amazon S3 by CSS.
Cloud Storage Security is an AWS Security Competency Partner that helps prevent the spread of malware and locates sensitive data for applications and data lakes that use AWS managed services. Antivirus for Amazon S3 is available in AWS Marketplace.
When it comes to data pipelines, malware scanning is best applied early in the process. Most organizations implement Antivirus for Amazon S3 in a landing zone layer that uses a separate S3 bucket to preprocess the data before it’s moved to the raw, stage, and analytics layers of their data lake. This way, they can ensure only clean files are used in their applications, especially when working with sensitive datasets that require masking.
CSS offers multiple scan models to secure your data ingestion pipeline. The event and retro scan models are predominately used and discussed in this post. These scan models run on containers in Amazon ECS on AWS Fargate.
An AWS Lambda function is subscribed to and triggered by clean scan results; clean files are copied or moved from the landing zone bucket to the raw layer bucket. Infected or problem files can be quarantined, and notifications can be integrated into your SecOps process for incident response.
Optionally, a Lambda function can delete from staging once copied to desired bucket.
Event-driven scanning scans new or modified objects in near real-time when they are written to an S3 bucket. It can be implemented quickly into your data ingestion pipeline without impacting your other internal workflows.
While you have the flexibility to create and leverage a variety of document flows, most customers use a two-bucket system flow (Figure 1) using a landing zone bucket and raw layer bucket. Using this approach, customers integrate event-driven scanning with their landing zone layer S3 bucket to scan all objects before they’re moved to the raw layer S3 bucket.
In the two-bucket system flow, an event-based scan is triggered when files are written to the landing zone layer bucket from any source. A CSS scanning agent scans and tags the files as clean, infected, or problem.
Once the scan completes, real-time scan result notifications are sent to a provided Amazon Simple Notification Service (Amazon SNS) notification topic, and are also listed on the “Findings” report in the CSS console.
Figure 1 – Data ingestion pipeline: event-driven scanning.
CSS has simplified the process to integrate this scanning model and set up the two-bucket flow by providing everything needed to implement this approach. All you need to do is follow the step-by-step instructions and set up Copy Lambda.
The steps for setting up the two-bucket system are as follows:
- Create the landing zone S3 bucket or utilize an existing one.
- Turn on bucket protection (enable scanning) for the landing zone bucket within the Antivirus for Amazon S3 management console.
- Create a Python (latest version) Lambda function with this sample Copy Lambda code.
- Make adjustments to Lambda settings.
- Subscribe Lambda to the SNS notifications topic.
- Add IAM permissions to Lambda.
- Modify the topic subscription to filter for clean objects.
- Test clean and “not clean” files to ensure behavior is as expected.
Unless you’re just starting a full migration to a cloud data lake, you likely already have pre-existing objects stored in AWS. With CSS’ retro model, existing data is scanned to ensure it is safe to use.
Pre-existing objects are crawled via a temporary Fargate run task that adds entries to a temporary retro queue in Amazon Simple Queue Service (Amazon SQS) uniquely made for the job. Once crawling has completed for a job, a new set of Fargate run tasks are spun up to perform the scan. Entries are pulled from the queue identifying the object to scan. The object is retrieved and scanned.
Once the object is scanned, the process follows the same flow as the two-bucket system. Real-time scan result notifications are sent to a provided SNS notification topic and are listed on the “Findings” report in the CSS console.
Figure 2 – Data ingestion pipeline: retro scanning.
This baseline scan can run on-demand or on a scheduled basis of your choosing. Many customers perform a scheduled scan of all pre-existing objects at least once per quarter to check for infections against the latest virus definitions.
Figure 3 – Bucket protection dashboard in CSS console.
Currently, Antivirus for Amazon S3 offers ClamAV and Sophos scanning engines. ClamAV is a well-known and widely used open-source solution. The Sophos engine, accessible through CSS’ original equipment manufacturer (OEM) partnership with Sophos, provides fast and powerful scanning, including support for extra-large files (up to 5 TB in size, the maximum allowed by S3).
Figure 4 – Comparison of ClamAV and Sophos scanning engines.
Multiple engines can be used at the same time to improve detection rates. When deciding engine use, it’s important to consider cost and any additional fees for using a premium engine, file size and volume, and performance. In large environments, additional scan cost could be accounted for by performance gains and the need to run less infrastructure.
Throughput and Agent Scaling
File size and the number of files you expect to scan impact how you scale your deployment to fit your needs. Calculating throughput helps determine the appropriate scanning engine option and agent scaling configuration to keep up with your inflow of files.
Throughput is calculated based on file size and engine used (Figure 5). For example, with a single scanning agent, ClamAV scans ~6,500 1MB files per hour whereas Sophos scans ~20,000 1MB files per hour.
Figure 5 – Throughput comparison: ClamAV vs. Sophos.
By adding more scanning agents, your job completes faster. Agent scaling functionality is configured within the CSS console by setting a scaling threshold for the number of items that should come in before another agent is spun up.
Additionally, you’ll need to set the minimum and maximum number of agents that should run. Many customers who ingest high volumes of data in sustained bursts choose to scale horizontally with additional lightweight AWS Fargate tasks (containers) to scan the data within their required time frame.
Once you assess the data in your environment, you can apply those figures to the throughput table above to establish a baseline for the amount of work the scanning agents can achieve. From there, you can determine the number of agents needed.
The Sizing Discussion section in the CSS help docs contain examples, and provide guidelines to determine your scaling threshold and number of agents.
Figure 6 – Agent scaling configuration within CSS console.
Extra-Large File Scanning
Antivirus for Amazon S3 can scan files as large as 5 TB using the event-driven or retro scan models with the Sophos engine.
When a scanning agent picks up a file that’s too large to scan (too large based on the disk size assigned under the Agent Settings or API Agent Settings) and extra-large file scanning is enabled, an extra-large file scan job is automatically spun up in a temporary Amazon Elastic Compute Cloud (Amazon EC2) instance.
Once the job completes, the verdict is provided and the EC2 instance is terminated. Since extra-large file scanning happens in a separate EC2 instance, the event-based agents that scan the other files in the queue remains unaffected.
Antivirus for Amazon S3 Deployment
Antivirus for Amazon S3 is available in AWS Marketplace with a 30-day free trial to deploy and test out the application’s functionality. To get started, take the following steps:
Step 1: Subscribe via AWS Marketplace
Click on “Continue to Subscribe,” and after completing the simple click-through process within your AWS account, you can deploy the solution.
Step 2: Deploy Using an AWS CloudFormation Template
Deployment takes minutes and is accomplished using an AWS CloudFormation template that installs all of the necessary AWS infrastructure and software components, as well as all required IAM permissions and roles.
Review steps to set up the CloudFormation template in the How to Deploy section of the CSS help docs. There are many configuration aspects to specify, but they are optional and depend on your deployment needs.
For more information, you can review the Deployment Details section of the CSS help docs.
Step 3: Launch and Enable Bucket Protection
Once deployment is complete, you will receive an email invite with login credentials to access your console.
Figure 7 –- Sample email with login credentials.
Upon accessing the console, connect additional AWS accounts (if any), review the discovered buckets, activate bucket protection, and enable either or both of the scan models introduced in this post. From there, you can review results and mange problem files in the “Findings” section on the main menu in the console.
Figure 8 – Scan results presented in the Findings report in the CSS console.
By integrating Antivirus for Amazon S3 by Cloud Storage Security (CSS) into the landing zone layer of your data ingestion pipeline, you can ensure the data is safe for processing, transformation, and use before it moves it to the raw, stage and analytics layers of your data lake.
Cloud Storage Security – AWS Partner Spotlight
Cloud Storage Security (CSS) is an AWS Partner that prevents the spread of malware, locates sensitive data, and assesses storage environments for applications and data lakes that use AWS storage services. Their solutions are used worldwide by agencies and enterprises of all sizes because they fit into any workflow and data never leaves the customer’s account.