AWS Storage Blog
Analytical processing of millions of cell images using Amazon EFS and Amazon S3
Analytical workloads such as batch processing, high performance computing, or machine learning inference often have high IOPS and low latency requirements but operate at irregular intervals on subsets of large datasets. Typically, data is manually copied between storage tiers in preparation of processing, which can be cumbersome and error-prone. Given this, IT teams want to store or archive data but still have easy access for fast processing on file systems in a cost-effective way. They want an automated way to temporarily “push” object data to a transient file system, that is a file system on which data is stored only for the duration of processing, and propagate data updates back to the permanent object store once processing completes.
This push mechanism can be achieved easily with a serverless workflow managing copy operations between Amazon Simple Storage Service (Amazon S3) as permanent data store and Amazon Elastic File System (Amazon EFS) as the file system for the duration of data processing.
In this blog, we discuss key considerations for building such a transient file system and provide a sample implementation. We demonstrate the solution with a reference implementation of a Cell Painting workflow used to process terabytes of cell microscopy images. Transient file systems combine the scale and price advantage of object storage with the higher IOPS and low latency of a shared file system to create cost-efficient and easy to maintain data solution for your batch analytics use case.
Accelerating drug discovery and development with cell imaging at scale
We adopted Amazon EFS for a Cell Painting workflow with the research and development team of a global healthcare organization. Cell Painting is an image-based profiling technique used for generating morphological profile for chemical and genetic perturbation of cells (for more details consult this article). The cell painting process involves highlighting multiple organelles within a cell using fluorescent markers (hence the term painting) and collecting hundreds of cell images for each perturbation.
In our case, 1.6 million images were generated per week using high-throughput automated microscopy (3 TB of image data per week). Every set of new plates coming from the laboratory was first uploaded to Amazon S3, then transferred to a shared file system on Amazon EFS for analytic processing. For each image, a series of steps was performed using CellProfiler on Amazon Elastic Container Services (Amazon ECS). After processing, these images were copied back to Amazon S3 as the permanent data store. With Amazon EFS, we were able to extract and analyze approximately 250 million feature data points per week, with thousands of IOPS and low latency, while making use of Amazon S3 for queries and archiving (see Figure 1).
Figure 1: Overview of Cell Painting on AWS
Design pattern
The common requirements for a transient file system are as follows:
- Perform consistent copying of all data from specific Amazon S3 prefixes, including all sub-prefixes, to a file system on Amazon EFS, including error handling. This is because incomplete datasets would distort processing outcome.
- Ability to clean up the file system once processing and data copy back to S3 is completed.
The orchestration of data transitions between Amazon S3 and Amazon EFS can be offloaded to a state machine implemented with AWS Step Functions:
- AWS Step Functions offers serverless workflow to trigger AWS Lambda functions for copy actions as well as cleanup or rollback, including in-built runtime error handling. Note that embedding the AWS CLI into Lambda would be subject to the default limit of 10 maximum concurrent threads and potentially constrain copy operation performance.
- AWS Step Functions offers native support for over 200 AWS services, thus making it easy to integrate with the particular application running on AWS compute or container services such as Amazon EC2, AWS Batch, Amazon Elastic Container Service (Amazon ECS), or AWS Lambda.
We use Lambda functions to crawl a user-provided Amazon S3 bucket or prefix and create a temporary index of all objects using the ListObjectsV2 API of Amazon S3.
- The ListObjectsV2 API is suitable to programmatically create on-demand indices of large amounts of object keys, whereas the Amazon S3 Inventory provides scheduled alternative (daily or weekly).
- This makes copy operations idempotent, enables parallel processing and graceful failure.
- It also allows programmatic verification by comparing target to source indices and reporting on completion.
The general design pattern is depicted in the following diagram:
Figure 2: Serverless workflow for transient file system
Implementation considerations
The definition of our Step Functions state machine is as follows:
- For any specified Amazon S3 bucket or prefix and target Amazon EFS directory, trigger copy, reverse copy, and cleanup activities
- Since copy operations will be idempotent, we incorporate parallel state and Lambda retrier with MaxAttempts (default=3, in our case=5, maximum=99999999) representing the maximum attempts before retries cease and normal error handling resumes.
- Once the actual processing completes, Step Functions proceeds to carry out the EFS to S3 copy operation, clean up the EFS directory, or fail in case of copy error (Figure 3).
Figure 3: AWS Step Functions definition for transient file system
The Step Functions definition example code is as follows:
We define a Lambda function (see upcoming sample code) using boto3 to manage copy operations between Amazon S3 and Amazon EFS, as follows:
- To create an inventory of the specified S3 bucket or prefix, we define
get_s3_keys
function to recursively retrieve all object keys. If prefix is not specified, the entire bucket is scanned. We use the ListObjectsV2 boto3 call to return objects in the designated bucket. - We define the
get_efs_keys
function to retrieve all keys of Amazon EFS files and return them as set of strings. - The
file_transfer
function is responsible for copy operations based off the index. Copy is supported in both directions with thetransfer_mode
argument (0=Amazon S3 to EFS; 1=Amazon EFS to S3). The function will catch errors with specific files (from Amazon S3 or EFS) without stopping processing. - Pool connections are kept equal to thread count so the S3 client can handle concurrent requests. The function creates an EFS directory (if it didn’t exist) and appends download job to the list of threads (when copying S3 objects to EFS directory) or an upload job for the reverse copy of EFS files to Amazon S3.
- Because Lambda is serverless and abstracts CPU resource control, the function is designed with multi-threading rather than CPU-bound multi-processing.
After the data is copied back to Amazon S3 from Amazon EFS, Step Functions triggers a separate AWS Lambda function to clean up EFS (see the following code example).
- File deletion is performed using the shutil module.
- If the EFS directory is multi-level, the function updates path to root to ensure that all subdirectories are removed.
Several limitations of the solution should be acknowledged:
- S3 index generation is one-time at workflow initiation, assuming no changes to the S3 bucket once copy operations start. S3 objects added after creating the initial index will not be processed.
- Index comparison is handled on file name only, validating if each file has been copied between Amazon S3 and Amazon EFS. Future adaptations may consider comparing for file size differences to detect parallel changes on source.
- Memory allocation for Lambda is important depending on S3 objects’ size.
- Error handling in AWS Step Functions can be extended to cover runtime errors beyond Lambda timeouts.
Cleaning up
Remember to delete example resources if you no longer need them, to avoid incurring future costs. This includes the Amazon S3 bucket, Amazon EFS directory, AWS Lambda functions, and AWS Step Functions workflow.
Conclusion
In this blog post we demonstrated how you can run analytic batch processing on millions of files faster using a combination of Amazon S3 and Amazon EFS. We shared sample code which used AWS Step Functions and Lambda to orchestrate state transitions.
The solution is particularly suitable with batch processing tasks on large data sets when a file system with high IOPS and low latency is only temporarily needed. The solution scales to large datasets, keeps Amazon S3 as your permanent data store so you only incur cost for file system during data processing. The solution is serverless, making it cost-efficient and easy to maintain.
Thank you for reading this blog post! Use the following links to learn more about some of the AWS services covered in this blog:
We are looking forward to reading your feedback and questions in the comments section.