AWS Storage Blog

Optimizing data transfers for high throughput life science instruments using AWS DataSync

Healthcare and life sciences (HCLS) customers are generating more data than ever as they integrate the use of omics data with applications in drug discovery, clinical development, molecular diagnostics, and population health. The rate and volume of data that HCLS laboratories generate are a reflection of their lab instrumentation and day-to-day lab operations. Efficiently moving output data from laboratory instrumentation to the cloud for analysis enables HCLS organizations to accelerate their research and improve patient outcomes. As the volume of omics data generated grows with each new generation of high throughput instruments, customers encounter challenges streamlining data transfers between their laboratories and the cloud. Customers want simple solutions that transfer data to the cloud as the data is generated while ensuring security, reliability, and data integrity.

A variety of data transfer options, such as FTP, Amazon data transfer services, and partner solutions, are available to move data from your laboratory to AWS. AWS DataSync is an AWS managed service that helps HCLS customers tackle high-throughput data transfer projects. DataSync provides a secure, online service that automates and accelerates data movement, thereby enabling customers to move their file and object data between on-premises storage, edge locations, other public clouds, and AWS Storage services. Managing DataSync at scale for high volume omics workflows requires a deeper understanding of DataSync deployment models.

In this post, we explore evaluation criteria to identify opportunities to streamline data transfers from your lab to the cloud. Understanding your lab’s specific needs in terms of lab operations, instrumentation, data frequency and volume, networking environment, and compliance requirements is crucial for optimizing your data transfers for high-throughput labs. Additionally, we dive into the best practices for the push and pull architectures for moving operational data based on real-world examples. Once you start successfully transferring data from the lab to the cloud for analysis, you have moved down the path to accelerate scientific insights.

Evaluating your laboratory and its IT infrastructure

HCLS organizations are using cutting-edge, high throughput lab techniques to observe cellular interactions with the goal of discovering new therapies and improving human health. The latest microscopy techniques used to understand brain function generate up to 300 TB per sample in eight hours, and genomics technologies profiling 100K patients for cancer generate up to 20 PB of raw data. The massive volume of data created by these cutting-edge technologies make it challenging to move the data out of the lab to cloud computing resources quickly enough as to not disrupt the pace of scientific innovation. A HCLS customer may wait up to 48 hours to generate and transfer data to the cloud depending on the size and volume of data, network connectivity, and complexity of lab automation. Understanding your laboratory is crucial to accelerating data transfers.

The first step is thoroughly evaluating your laboratory and the data patterns for each type of lab instrument. Evaluation criteria fall into three categories: the day-to-day lab operations, the instrument duty cycle, and the IT infrastructure dedicated to the lab.

1. Lab operations: Understanding how the lab runs on a typical day and on peak, high throughput days is important. To estimate the volume of data generated, consider:

a. How many instruments of each type are there?

b. How many instruments are running at once?

c. How many runs are completed per day per instrument?

2. Instrument duty-cycle: Each instrument has its own duty-cycle. Understanding the cycle allows you to build a control plane that optimizes and automates cloud data transfers to AWS for a fleet of instruments. To understand how to scale your data transfers, evaluate your lab instrumentation for:

a. How many samples per run and how much data does each run generate?

b. What is the length of the run and timing of the output?

c. What is the layout of the data? Is it a single, large file or multiple smaller files?

d. Does the instrument software write a file or provide a status that signals the start or completion of a run? Can they be monitored as events to trigger actions? Can third-party software be installed on instrument workstation or not?

e. Is any local pre-processing of the run data needed before transferring to the cloud (e.g., de-multiplexing, motion correction)?

f. How will the data be processed and analyzed after it reaches the cloud?

3. IT infrastructure: Laboratories depend on on-premises IT infrastructure to operate. To evaluate their role in data transfers, evaluate the local compute, storage and networking:

a. What is the local infrastructure? Does the on-prem infrastructure remain in place? Is it needed for preprocessing?

b. Where is the data stored after it is generated? How much storage is needed to cache active data if the network connection to AWS is unavailable?

c. Is the local network shared or dedicated to the lab? What is the network topology of the building and lab?

d. Are there any network security devices that might increase latency?

e. What is the available network bandwidth for data transfers?

f. What is the connection to AWS? Is it over the internet or is there a direct connection (e.g., AWS VPN, AWS Direct Connect)?

Using a control plane to optimize instrument data transfers with DataSync

Moving large quantities of data to AWS is not a trivial task. It takes time even with optimal networking and infrastructure. Customers have found different ways to optimize transfers unique to their laboratory environment. Some customers use AWS Partners to help accelerate the implementation of a data transfer solution. Others deploy the Amazon S3 File Gateway because their lab data patterns are low volume and align with the capabilities of the S3 File Gateway. Finally, high throughput laboratories build custom solutions with DataSync as the core component for data transfers.

DataSync provides security, data integrity, transfer mechanics, and logging that are critical for tracking, validating, and auditing data transfers between different storage technologies.

Data can be transferred to or from on-premises multi-protocol network attached storage to multiple AWS storage options, such as Amazon S3 and Amazon FSx for Lustre. DataSync uses compression to move the data to AWS as fast as possible. Although DataSync transfer tasks cannot directly trigger off instrument events, tasks can be launched with a laboratory control plane built with AWS Lambda and other services.

For many HCLS customers, building the laboratory control plane for DataSync is a critical component for automating lab data transfers. Automated DataSync tasks, follow one of two patterns. The “push” pattern uses DataSync agents running on premises to push data into AWS (Figure 1). The “pull” pattern uses DataSync agents running on Amazon Elastic Compute Cloud (Amazon EC2) minimizing the need for on-premises data transfer infrastructure (Figure 2). Both the push and pull patterns have been successfully deployed by HCLS customers to accelerate laboratory data transfers. AWS Prescriptive Guidance and solution architectures are available for genomics data (push) and for a connected lab solutions. Both use DataSync for performant data transfers so that scientists can focus on high value science, instead of data movement.

The push pattern is the recommended deployment method to take full advantage the DataSync inline compression and minimize latency (Figure 1). The DataSync agent is a virtual machine deployed on premises close to the data source to minimize the effects of latency from NFS/SMB traffic. This is especially important where the laboratory instrumentation is sensitive to latency of the source storage system. The agent provides optimized and resilient logic to push data over commodity internet, VPN, or Direct Connect to your destination storage in AWS.

Figure 1: Recommended push pattern deployment methods for DataSync in an HCLS laboratory

Figure 1: Recommended push pattern deployment methods for DataSync in an HCLS laboratory

While the push pattern is the recommended deployment for most lab environments, the pull pattern (Figure 2) is also a supported deployment. Use the pull pattern when growing on premises infrastructure is not an option or a dynamic fleet of on-demand DataSync agents is needed to match peak laboratory output. The pull pattern uses DataSync agents running on EC2 instances. To use the pull pattern, Direct Connect is necessary to ensure a low-latency connection to AWS. Essentially, Direct Connect extends your on-premises NFS/SMB storage to DataSync agents on EC2 instances that pull data across to AWS. Agents are activated with private endpoints and data is transferred through the DataSync service, which then writes data directly to the destination such as Amazon S3.

Figure 2: Pull pattern deployment methods for DataSync in an HCLS laboratory

Figure 2: Pull pattern deployment methods for DataSync in an HCLS laboratory

Validating small and big labs with pull transfers

The number of instruments, rate of data generation, file types, and networking determines the number and configuration of DataSync agents required for optimal data transfers. As there are many factors to consider, we profiled the pull pattern to help you determine if it is a fit for your laboratory.

We considered two types of customers when validating the pull pattern. One runs a small genomics sequencing lab with one instrument, and another runs a large lab with a fleet of five sequencing instruments.

Both customers use the Illumina Novaseq 6000 for their next-generation sequencing applications. Each instrument generates 2.4 TB per instrument run. The run requires 48 hours to complete and generates ~500K BCL files. The instrument software creates a flow cell directory on the local network attached storage signaling the start of a run. The instrument software issues a semaphore marking the completion of the run. Our customers perform primary analysis converting BCL output to Fastq files (BCL2Fastq) on premises. Transfers begin when Fastq files are generated. Both laboratories use a dedicated 10 Gbps Direct Connect link to AWS.

Using the best practices architectures for scaling out DataSync transfers, we validated the pull pattern for big and small transfers using instrument data stored on network attached storage (NAS) hosted in a data center with a 10 Gbps Direct Connect link to AWS. A 3.6 TB Fastq dataset approximated a single Illumina Novaseq 6000 run representing the smaller laboratory. An 18 TB Fastq data set represented the output for the large customer with five Illumina Novaseq 6000 sequencers. We evaluated 1 to 4 agents, using one task per agent to pull data from the NAS to a S3 bucket. The datasets were evenly split among the DataSync agents using DataSync filters.

To understand the behavior of DataSync agents deployed on EC2 instances, new agents were deployed for each test iteration to ensure the availability network burst credits (see point 2 as follows). DataSync task output, Amazon EC2 metrics, and Direct Connect CloudWatch metrics were collected for analysis (Figure 3).

Two key points to remember when using a pull pattern:

  1. Single-flow traffic: Amazon EC2 network bandwidth is limited to 5 Gbps when not in the same cluster placement group.
  2. Network burst performance: EC2 instances with an “up to” network bandwidth use a burst credit model to achieve that bandwidth and have an underlying baseline bandwidth. When credits are exhausted, the EC2 instance reverts to using underlying baseline network bandwidth.

Figure 3: Simulated HCLS lab architecture for the DataSync pull pattern for two test datasets

Figure 3: Simulated HCLS lab architecture for the DataSync pull pattern for two test datasets

Small dataset: 3.6 TB

We began testing with using one task per agent. In Figure 4, the single task transfer hit the 5 Gbps flow limit, using the available network burst credits. We know this is the flow limit because the m5.2xlarge EC2 instances are rated up to 10 Gbps. A single task and agent never consumed the maximum available bandwidth. At one hour, the burst credits are consumed and the transfer maintains 2.5 Gbps until complete. Using two agents, each configured with one task to transfer the data, are more efficient with the available Direct Connect bandwidth and maintains use of the network burst credits throughout the transfer. With this data set, when you move to two tasks each configured with one agent, you do not run out of burst credits on either EC2 instance. Adding third and fourth tasks and related agents does not dramatically accelerate transfers because additional burst credits are not used.

Figure 4: Small dataset data transfer measurements

Figure 4: Small dataset data transfer measurements

Large dataset: 18 TB

Similar to our small dataset, we started with a single task configured with one agent. The transfer hit the 5 Gbps flow limit consuming all the network burst credits (Figure 5). Once consumed, the transfer returns to the 2.5 Gbps baseline. When the number of tasks/agents are increased to two, transfers consume the additional bandwidth available with Direct Connect. Like the single task/agent scenario, once the network burst credits are consumed, the transfer returns to its baseline throughput. Using three tasks/agents, no individual task reaches the 5 Gbps flow limit. Network burst credits last longer and the transfer starts to maximize the available 10 Gbps Direct Connect to AWS. Even with three tasks and agents, the burst credits are consumed before the transfer completed and returns to the baseline throughput of 2.5 Gbps. At four tasks and agents the network burst credits are not consumed except in opportunistic spikes. Each of the four agents maintain approximately 2.5 Gbps transfer for the duration. Moreover, the transfer uses the entire 10G bps Direct Connect for the entire task duration. A similar pattern was observed with much larger datasets. DataSync continues transferring data and optimizing for the 10 Gbps Direct Connect, thereby maximizing the transfer.

Figure 5 Large dataset data transfer measurements

Figure 5: Large dataset data transfer measurements

Other factors

The performance achieved in the preceding validation is just one scenario and not a performance guarantee. A data transfer solution must satisfy business requirements such as sample turn-around-time, security, and compliance.

Consider turn-around-time (TAT). For example, a large clinical lab operating 30 instruments generating over 50 TB of genomic data per day, may require twelve, 10 Gb Direct Connect links spread across virtual private cloud subnets to meet the required 24-hour turn-around-time (TAT) to deliver results for a batch of diagnostic tests. Whereas a smaller research lab with two to three instruments generating 5 TB weekly may work fine with a site-to-site VPN tunnel using their existing 1 Gpbs internet pipe. For more information and architectures for DataSync transfers, refer to the DataSync documentation. Refer to these examples to learn how to configure Direct Connect routing for DataSync architectures.

DataSync securely transfers data between self-managed storage systems and AWS storage services, and between AWS storage services. How your data is encrypted in transit depends on the locations involved in the transfer. Data protection, data integrity, resilience, and infrastructure security as it relates to DataSync can be found in the documentation.

For auditing and logging, DataSync interacts with Amazon CloudWatch, AWS CloudTrail, and Amazon EventBridge, but you can also monitor it using manual tools.

Conclusion

This post explored key considerations and best practices for moving operational data based on real-world examples. Now that you have reviewed how to assess your lab environment and data movement options, taking the time to understand the next steps and the ultimate business goals for the data can help you optimize data transfers from the lab to the cloud.

In HCLS, data powers innovation. Transferring data from the lab to the cloud for analysis is a critical step to generating insights that improve patient outcomes. HCLS customers will need to respond with ‘lab-aware’ data transfer solutions that keep pace with the growing volume of omics data. Using the right data mover tailored to your needs enables seamless ingestion of omics data generated in the lab with analysis and data management environments on AWS.

You can get started on your data transfer projects powered by DataSync today. Read about how Resilience connected more than 100 laboratory instruments to the cloud with DataSync combining the scale out methodologies in this case study or in their re:Invent 2024 talk. To learn more, visit AWS DataSync getting started.

Related resources

Building Digitally Connected Labs with AWS

Guidance for a Laboratory Data Mesh on AWS

How to move and store your genomics sequencing data with AWS DataSync

E. Sasha Paegle

E. Sasha Paegle

With over 25 years of experience in the healthcare and life sciences industry, E. Sasha Paegle has held informatics and product management roles at companies like Dell, Genentech, Merck, and the Scripps Research Institute. Currently, he is an HCLS Specialist focused on cloud solutions for imaging and genomics. Sasha holds a MS in Biochemical Engineering and MBA degrees and a co-inventor on three patents.

Ariella Sasson

Ariella Sasson

Ariella Sasson, Ph.D., is a Worldwide HCLS Specialist Solution Architect Leader in Data & AI/ML focusing on helping healthcare and life science organizations transform themselves into data driven businesses, including optimizing data movement and designing their data strategy to be able to benefit from AI/ML and HPC. Ariella has over 20 years working in high-throughput clinical genomics, oncology, and pharma R&D. She is passionate about using technology and big data to accelerate HCLS research, genomics and personalized medicine.

Bob Holmes

Bob Holmes

Bob Holmes is a Senior Storage Solutions Architect with over 15 years of HCLS experience at Memorial Sloan Kettering Cancer Center before coming to AWS. Now, Bob is on the Storage Workload Validation team validating real world workloads and pushing AWS storage services to their limits.