Running a Serverless Malware Scanner at Scale Using AWS Lambda and AWS Step Functions
By Michael Clifford, Sr. Consultant – Contino
By Mark Faiers, AWS Practice Lead – Contino
By Marwan Tarek, Manager, Solution Architecture – AWS
Contino recognizes the importance of security, but also that security teams can be seen as gatekeepers who slow down the process of building software and systems.
In this post, we’ll introduce a solution Contino built for a client to automate security scanning on software packages. This allowed developers to continue building, without added friction, and helped security teams be confident that vulnerabilities are discovered early and didn’t make their way to production.
We’ll cover the solution, the context Contino was working with and the approach it took, and the technical challenges faced. After reading this post, you’ll understand how to develop a similar solution for your organization and some pitfalls to avoid.
This solution was developed to replace a tool in place for scanning software packages and datasets, but it had several limitations.
First, it used an Amazon Elastic Compute Cloud (Amazon EC2) instance for the compute processing, so there was a single point of failure and thus downtime was required for operating system (OS) patching and maintenance.
It was also a relatively expensive solution. The EC2 instance was running 24/7 whether the package scanning process was being used or not.
There were multiple issues debugging the EC2 virus scanning process, as well. Though the logs for the service were forwarded to Amazon CloudWatch, it included the logs for the EC2 instance and caused difficulties in diagnosing the issues. The infrastructure around the EC2 instance had to be maintained continuously with the required health checks.
Contino’s client that is running the solution is very security-minded, so the solution had to run in an air-gapped environment, meaning there was no public internet access. This meant that any new virus definitions either had to be ported through to the EC2 instance or installed during the EC2 build stages before deployment. Often, virus scanners would refuse to run scans unless virus definitions were kept up to date weekly.
There were also several security concerns that were apparent with the previous solution; the first being that anyone with the correct details could log into the EC2 instance and either change the executable code or disable the service by accident.
The second issue was that most virus scanner applications have a limit of scanning only the first 4GB of a file, so a virus could be hiding in the fifth GB in a large file. This limit applied regardless of file type, whether it’s compressed or not.
Although the original solution was event-based, it did not work at scale. This meant there were often long backlogs of files/packages to process, with any outages causing a backlog which would take several hours to process.
Figure 1 – Original virus scanning solution.
New items to be processed are placed in the dirty Amazon Simple Storage Service (Amazon S3) bucket, and then S3-created events are sent to an Amazon Simple Queue Service (Amazon SQS) queue which starts the virus scanning process.
If the file passes the virus scan, it is moved to the clean S3 bucket. If the file fails the virus scan, it’s moved to the Quarantine S3 bucket and a notification starts via AWS Lambda.
All logs are sent to Amazon CloudWatch and Splunk to track any errors or virus positives.
Contino’s solution is based on the event-driven architecture combined with serverless compute functionality provided by AWS. It leverages AWS Lambda for the compute and processing, and AWS Step Functions for orchestration of the individual components.
The primary driver behind the architectural change was to increase the reliability of the service. To help bring about the improvements in reliability and scalability, Contino followed several principles.
The solution was broken down into smaller individual components. This meant Contino could develop and test a part of the solution in isolation. Backing this development work with test-driven development (TDD) meant Contino could have confidence in each component and easily redevelop in the future if required.
To gain confidence in the solution’s viability, Contino defined a minimum viable product (MVP) where it could process a file from end to end.
The architecture team decided to use container images in Lambda. One benefit of this is that they allow you to deploy large-scaled containers for functions that are required to cope with the size of files to be processed.
In addition, by using containers Contino could include dependencies at build time, which is critical when working within an air-gapped environment. It also provided the ability to deploy separate containers for the different virus scanners that needed to be run.
Figure 2 – New serverless event-driven architecture.
Due to the sequential, process-oriented nature of the virus scanning solution, an orchestration service was an obvious choice to model and manage it.
AWS Step Functions fulfils the role of orchestration service and also has native integration with many AWS services, including Lambda in this case. AWS Step Functions also helps to manage state, which is an important consideration in this solution.
During the initial development work, Contino decided that Amazon Elastic File System (Amazon EFS) should hold the target file during the scanning process. This came about from testing against Amazon S3, and the findings highlighting improvements in processing speed when accessing the target file.
The file analyzer function runs against the target file based on an S3 event. This scans the target file to determine the file size and type. A decision stage within Step Functions determines the next steps based on the information obtained.
If the file type is not supported, the file is encrypted, uploaded to a quarantine bucket, and a Slack notification is sent to inform a channel of the quarantined file.
If the file type is supported, then Step Functions decides based on additional factors, which are:
- Is the file size larger than 4GB?
- Is the file compressed?
- Is the file empty?
- Is the file larger than 4GB and cannot be split?
If the file is over 4GB in size, a Lambda function will run to split the file. The split file(s) metadata is updated with the Parent ID for future processing, removed from Amazon EFS, and uploaded back to the dirty S3 bucket to start the process again. If the split files are still over 4GB, they are split again until the artefacts are small enough to be scanned effectively.
Should the file be compressed, the virus scanners could not scan the files effectively. In this case, the decompress function runs to extract the compressed files. These files follow a similar process to the large file types, whereby the metadata is updated with a Parent ID and uploaded via the dirty bucket to be processed.
If the file is scannable or too large and not possible to be split, it will be passed to the two scanning functions to be processed. Once complete, Step Functions moves to the next decision stage. Here, it will read the pass/fail results of the scans—if they both pass, the file is uploaded to the “Upload to Clean bucket” function, which will take the file and upload it to S3. If the file is empty, it will pass straight through and uploaded to the clean bucket.
Figure 3 – Scanning process diagram.
Due to the nature of the environment, dependencies had to be included at build time, rather than runtime. While this may have added additional development time in the initial phases, it reduced the overall scan time because the process didn’t require retrieving any dependencies.
By moving to a pay-as-you-use approach, the costs are reduced by ensuring Contino didn’t have unused infrastructure in place accumulating cost.
Moving to a more modular approach alongside a TDD methodology, Contino could quickly iterate over a single component in isolation and ensure any future development work could be integrated and the solution would work as expected.
AWS Step Functions provides the ability to offload the orchestration and the business logic away from the Lambda functions. This means the functions can be smaller in size, helping to reduce the overall runtime of them and reduce cost.
At the end, Contino’s security scanning solution alleviated many of the pain points of the client’s previous solution. By using a completely serverless architecture, the end result was more scalable and also cheaper, both in terms of AWS resources and the cost to maintain.
Contino found the solution was more maintainable because of its modular nature and the fact that AWS Step Functions is somewhat self-documenting, as you can easily get a visual representation of the flow in the AWS Management Console.
If you’d like to find out more about the solution, or how Contino can help your organization build cloud solutions, visit contino.io/contact-us.
Contino – AWS Partner Spotlight
Contino is an AWS Premier Tier Services Partner and global enterprise DevOps and cloud transformation consultancy that works with many security-conscious organizations.