AWS for Industries

Store omics data cost effectively at any scale with AWS HealthOmics

AWS HealthOmics is a managed service designed to help healthcare and life sciences organizations manage genomic and transcriptomic datasets cost-effectively. By automating metadata management, compression, and storage class tiering, HealthOmics frees up researchers to focus on analyzing omics data instead of spending time managing it.

Storage efficiency, cost optimization and ease of access to omics data are frequent challenges we hear from customers in the healthcare and life sciences industry. The term “omics” collectively refers to various disciplines in biology such as genomics, transcriptomics, and proteomics. The study of omics data is rapidly developing and growing through applications in drug discovery, clinical development, early disease detection, precision medicine, and infectious disease testing and tracing. The cost of sequencing a human genome declined from $100M in 2000 to $200 in 2023, enabling researchers to generate more omics data than ever before with the goal of better understanding how the genome functions and affects human health and disease.

While omics data is transforming how to treat disease, its volume and scale can be complicated and costly to manage. HealthOmics is a managed service purpose built to store, query, analyze, and generate insights from genomic, transcriptomic, and other biological data at scale. It has three components: storage, workflows, and analytics, that help drive different areas of omics research. This blog focuses on HealthOmics Storage and its benefits for managing omics data at scale.

AWS HealthOmics Storage

HealthOmics Storage has two data stores, reference store and sequence store. HealthOmics reference stores are used to store organism’s genome references. HealthOmics sequence stores are used to store raw sequencing data, like that is generated from DNA sequencing devices.

Genomic data uses standardized file formats such as FASTA, FASTQ, BAM, uBAM, and CRAM. HealthOmics reference and sequence stores are purpose built to handle these formats. HealthOmics Storage preserves metadata such as file linkages, alignments, base counts, indexes, protocol, custom tags, and more. This metadata helps in data provenance, the process of tracking the origin and history of data throughout its lifecycle. A HealthOmics sequence store uses “read set” resources to store and manage data. A read set is a grouping of sequencing files (e.g., FASTQ paired reads), with metadata like subject and sample IDs. This grouping ensures the data stays well organized with a consistent metadata structure.

HealthOmics reference stores have no associated cost. HealthOmics sequence store costs are based on gigabases instead of gigabytes. A gigabase is a unit of measurement used in the field of molecular biology to represent the number of base pairs in a DNA molecule. It is equivalent to a billion base pairs, or 10^9 base pairs. In contrast, a gigabyte is a unit of measurement used in the field of computer science and is equal to 10^9 bytes. DNA sequencing devices often generate gigabases of output per run. By measuring costs in gigabases, HealthOmics Storage helps you associate costs with your scientific experiments without worrying about file representations or compression ratios.

Automated tiering and compression in HealthOmics sequence stores helps you optimize costs with minimal effort while maintaining quick and easy access to your data. HealthOmics sequence stores have two storage classes: active and archive. The active storage class is for frequently accessed data. You can access active read sets instantly with transfer latencies in milliseconds. The archive storage class is for infrequently used data. Archived read sets are stored at lower cost than active read sets, and need to be activated before they can be accessed. Activating an archived read set can be done at no cost via the HealthOmics StartReadSetActivation API or the AWS Console.

HealthOmics Storage automates transitioning read sets from active to archive storage class based on access patterns, without the need for user intervention. If a read set has not been accessed for 30 consecutive days, it is automatically processed, compressed, and transitioned to the archive storage class (see Figure 1).

AWS HealthOmics automated data transition across storage classes. HealthOmics sequence store transitions data from active to archive storage after 30 days of inactivity.Figure 1. AWS HealthOmics automated data transition across storage classes. HealthOmics sequence store transitions data from active to archive storage after 30 days of inactivity.

These HealthOmics Storage capabilities let you focus on analyzing and interpreting your omics data without worrying about storage costs or file management. It is an ideal solution for researchers and organizations looking to manage storage efficiently and cost-effectively for their omics data in the cloud.

AWS HealthOmics Storage cost savings

Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance. Today, many healthcare and life sciences customers configure storage tiering, build data compression, and metadata management solutions to optimize the storage of their omics data on Amazon S3. HealthOmics Storage was purpose built to include tiering, data compression, and metadata management to optimize the storage of omics data, and pass additional savings back to customers.

Many customers storing omics data on AWS use Amazon S3 Intelligent-Tiering, which delivers storage cost savings by automatically moving data to the most cost-effective access tier when access patterns change. For a low monthly object monitoring and automation charge, Amazon S3 Intelligent-Tiering monitors access patterns and automatically moves objects to the Infrequent Access tier when they have not been accessed for 30 consecutive days. After 90 days of no access, the objects are moved to the Archive Instant Access tier. If the objects are accessed later, Amazon S3 Intelligent-Tiering moves the objects back to the Frequent Access tier (see Figure 2).

Amazon S3 Intelligent-Tiering automated data transition across storage tiers.Figure 2. Amazon S3 Intelligent-Tiering automated data transition across storage tiers. If data is inactive, Amazon S3 Intelligent-Tiering transitions data from frequent access tier to infrequent access tier after 30 days, and from infrequent access tier to archive instant access tier after 60 days.

Automated tiering provided by HealthOmics Storage is similar to Amazon S3 Intelligent-Tiering, but optimized for common omics data types and access patterns to improve cost-effectiveness.

Take two FASTQ files used for benchmarking available through Registry of Open Data on AWS (RODA), s3://genomics-benchmark-datasets/google-brain/fastq/novaseq/wgs_pcr_free/50x/HG007.novaseq.pcr-free.50x.R1.fastq.gz and s3://genomics-benchmark-datasets/google-brain/fastq/novaseq/wgs_pcr_free/50x/HG007.novaseq.pcr-free.50x.R2.fastq.gz.

These two files have a combined size of 99.9 gigabytes (207.3 gigabases). After importing these files as a read set, you can see the total bases via the HealthOmics sequence store console (Figure 3):

Figure 3 Read set details from AWS HealthOmics sequence store on AWS ConsoleFigure 3. Read set details from AWS HealthOmics sequence store on AWS Console. These details highlight the number of bases associated with the read set.

In one scenario, assume the files are stored in the AWS US East (Northern Virginia) region, accessed at least once a month for one year, and then rarely accessed afterwards.

In the first year, with Amazon S3 Intelligent-Tiering the files are accessed every month and remain in the Frequent Access tier, costing $27.57. With a HealthOmics sequence store, the files remain in active storage, costing $14.35. For the first year, HealthOmics offers 48% savings relative to Amazon S3 Intelligent-Tiering.

In the second year, with Amazon S3 Intelligent-Tiering the files are stored in the Frequent Access tier for 30 days, in the Infrequent Access tier for 60 days, and in the Archive Instant Access tier for the rest of the year, resulting in a cost of $8.39. With a HealthOmics sequence store, the files are stored in active storage for 30 days, and in archive storage for the rest of the year, resulting in a cost of $3.82. For the second year, HealthOmics offers 54% savings relative to Amazon S3 Intelligent-Tiering.

Over two years, your total cost of storage in Amazon S3 Intelligent-Tiering would be $35.96 while HealthOmics would be $18.17, a total savings of 49%.

AWS HealthOmics Storage cost savings Scenario table 1

  • In Year 1, HealthOmics offers 48% savings compared to Amazon S3 Intelligent-Tiering
  • In Year 2, HealthOmics offers 54% savings compared to Amazon S3 Intelligent-Tiering
  • Over 2 years, HealthOmics offers 49% total savings compared to Amazon S3 Intelligent-Tiering

In a second scenario, assume the files are stored in the AWS US East (Northern Virginia) region and are accessed three times in a year (Apr-1, Jul-1, and Oct-1).

With Amazon S3 Intelligent-Tiering, the files are stored in the Frequent Access tier for 3 months, then the Infrequent Access tier for 6 months, and then the Archive Instant Access tier for 3 months, resulting in a cost of $15.58/year. With a HealthOmics sequence store, the files are stored in active storage for only 3 months (Apr, Jul, Oct), and are in archive storage otherwise, resulting in a cost of $5.73/year. Here, HealthOmics offers a 63% cost savings relative to Amazon S3 Intelligent-Tiering.

Storage service scenario

Additional Considerations

Access patterns for data are unique to a customer’s use case and may not match the example scenarios highlighted here. To gain insight and visibility of usage and activity, customers storing data on Amazon S3 can use Amazon S3 Storage Lens. For example, customers with known access patterns can use Amazon S3 Lifecycle to transition data from Amazon S3 Standard to S3 Glacier Instant Retrieval. For use cases where there is no access to the data for many years, like long term disaster recovery consider Amazon S3 Glacier Deep Archive. See Best practices for archiving large datasets with AWS for more key considerations.

Conclusion

This blog post demonstrated the cost-effectiveness of HealthOmics Storage for storing large amounts of omics data. HealthOmics Storage allows you to store any volume of genomics, transcriptomics, and other omics data on AWS. It provides secure and durable storage optimized for varying access patterns of omics data, removing the undifferentiated heavy lifting of managing data to enable researchers to focus on analysis and innovation.

References

The FASTQ files used for the example scenarios in this blog are from Google Brain Genomics Sequencing Dataset for Benchmarking and Development.

Additional Resources

To learn more about storing, processing, querying, and migrating omics data with AWS HealthOmics check out:

If you have any comments or questions, do not hesitate to leave them in the comments section.

Sunil Aladhi

Sunil Aladhi

Sunil Aladhi is a Senior Technical Account Manager and part of Global Healthcare and Life Sciences industry division at AWS. He leads a global team to help Life Sciences customers operate their workloads optimally on AWS. Sunil has advised AWS customers across a diverse set of industries to design and operate a broad variety of workloads using AWS Services. Apart from work, he loves spending time with his family and traveling.

Arun KM

Arun KM

Arun KM is a Senior Technical Account Manager and part of Global Healthcare and Life Sciences industry division at AWS. He works with leading Life Sciences and medical technology companies worldwide to help them adopt AWS services. Outside of work, he cherishes time with family, unwinding with spy thrillers, and reading.

Subrat Das

Subrat Das

Subrat Das is a Senior Solutions Architect and part of Global Healthcare and Life Sciences industry division at AWS. He is passionate about modernizing and architecting complex customer workloads. When he’s not working on technology solutions, he enjoys long hikes and traveling around the world.