AWS for Industries
AWS HealthOmics Announces New Capabilities for Seamless Integration of Omics Data Stores and Analysis Tools
From population sequencing, to drug discovery, to diagnostics research, omics data is driving a deeper understanding and personalization in how healthcare is delivered. While the value of multi-omics data is understood, our healthcare and life sciences (HCLS) customers want better tools to get started, build applications, and scale up analyses that will help lower costs and accelerate insights.
We launched AWS HealthOmics at re:Invent 2022 to help HCLS organizations build at-scale to store, query, and analyze genomic, transcriptomic, proteomic, and other omics data. We’ve already seen customers and partners adopt HealthOmics to store omics data and run bioinformatics analysis pipelines at production scale while spending less time managing infrastructure.
In many of these use cases, customers would like to utilize omics focused storage that integrates easily with their current ecosystem. Large-scale storage of genomic, transcriptomic, and other omics data can quickly become the dominant cost, leading to decisions on whether to discard data, reduce the number of analysis, or use compression formats that limit the data’s accessibility. All of these options require personnel with a background in biology and computation to understand what the best strategy is, resulting in a slowdown in the pace of research and often leading to budget overruns.
To simplify these decisions, we are excited to announce that objects stored in the HealthOmics sequence store now have Amazon S3 URIs, meaning they can now be read using S3 API compatible tools. The sequence store already drives customer benefit from the store’s domain-specific metadata, compression and tiering driven cost-savings, and scalability. With this new capability, customers now only have to configure IAM permissions to integrate sequence stores with the current analysis tools they use. By not having to rebuild tools, users can adopt HealthOmics data stores without interrupting their scientific work. We are thrilled to see these new capabilities are already being used today by customers, including population sequencing leader Genomics England, and partners like Basepair, Quilt, Clovertex, and Memverge.
Figure 1. AWS HealthOmics leverages Amazon S3 access points to allow Amazon/AWS services and community bioinformatics tools to read the objects in an active read set directly using S3 APIs.
Integrate AWS HealthOmics data stores faster with S3 URIs
HealthOmics sequence stores enable customers to store FASTQ, uBAM, BAM, and CRAM files at a cost-effective price at scale. Sequence stores keep data organized by grouping files and domain-directed metadata into a “read set” object. Automated tiering and compression of a read set in the sequence stores help them optimize costs with minimal effort while maintaining quick and easy access to the data. Previously, to integrate with an active read sets, customers could access the data through HealthOmics APIs or through bulk export to Amazon Simple Storage Service (Amazon S3). This extra step for adoption of sequence store data required customers to either adjust their analysis or their data sharing workflows, slowing down the time to integrate HealthOmics data stores into their scientific workflows.
The new HealthOmics data store S3 URI capability allows customers to directly list and read active read sets using S3 APIs through the newly added S3 URI path. This feature leverages S3 access points to generate S3 URIs. Sequence stores are organized hierarchically with a prefix for the sequence store under which there is a prefix for each read set that holds the read set’s objects. Because of this structuring, customers can navigate the files using S3 List APIs and then, when they have identified the file, retrieve it using S3 Get APIs. Additionally, this structuring, along with resource tags for the subject and sample ID, allow customers to create IAM access policies with high dimensionality based on the restrictions they need.
Figure 2. AWS HealthOmics read sets show the files that are available along with the S3 URI for each file. Additionally, they show the S3 prefix for the read set.
Many tools in the bioinformatics field are built to support Amazon S3 APIs. Analysis tools (e.g. Integrated Genome Viewer IGV), workflow engines (e.g. CWL, WDL, Nextflow, Snakemake), and common libraries (e.g. Samtools, HTSlib) all have built-in support for integrating with S3. With the new data store S3 URI capability, you can use these tools with the objects in HealthOmics data stores, the only configuration that is needed is granting the appropriate IAM permissions.
AWS HealthOmics data store customers
“Genomics England is a global leader in enabling genomic medicine and research. Building on the 100,000 Genomes Project, we support the NHS’s world-first national whole genome sequencing service and run the growing National Genomic Research Library. Genomics England is utilizing AWS HealthOmics to drive cloud storage cost savings as it scales its sequencing efforts. We are exploring using the sequence store S3 URI feature paired with Mountpoint for S3, to deliver a performant solution where our users can use the HealthOmics stored data smoothly with their large-scale analysis without any changes.” Pete Sinden, Chief Information Officer, Genomics England.
“Basepair’s platform delivers bioinformatics easier, faster, and cheaper with a point & click interface to bring non-technical users to the bioinformatics tools, all while running in the customer’s account. Working with AWS, we announced at Bio-IT 2024 the integration of AWS HealthOmics with our platform along with expanded workflow and storage capabilities. When it came time to integrate the storage with our suite of visualization tools, the S3 URI feature made it seamless to integrate the tools with less than a day of work needed.” Amit Sinha, PhD, Founder & CEO, Basepair.
“Clovertex is a science-first IT services organization specializing in architecting, building, automating, and managing scientific applications on AWS . We are excited about the new AWS HealthOmics S3 URI feature that will benefit our genomics customers. This feature allows for easier data analysis by eliminating data transfer steps and works well with other scientific tools. In using the feature, we were able to integrate HealthOmics storage with different workflows and analytical tools without any tool changes within few minutes. Because of this ease of integration, scientists save time and can focus on getting insights from their data, ultimately leading to better patient outcomes while also benefiting from security, scalability and cost benefits provided by AWS HealthOmics.” Deven Atnoor, PhD VP Scientific Strategy, Clovertex.
“Quilt offers an open-source data platform aimed at integrating disparate data silos through versioned, immutable data packages. Given our focus on aiding life science companies in accelerating their drug discovery pipeline, we have long been excited about the potential for HealthOmics to streamline workflows and reduce storage costs. With the new S3 URI features, our customers can now integrate sequence stores alongside traditional S3 objects into beautiful, trustworthy data products for use in their own Amazon data lakes. We are excited about extending our commercial Quilt Catalog to make browsing, visualizing, and interacting with HealthOmics data as easy as it is with existing S3 objects and a wide range of scientific data formats.” Ernest Prabhakar, PhD, Director of Product, Quilt Data, Inc.
Conclusion
With the announcement of data store S3 URIs, healthcare and life sciences organizations have new capabilities to help scale their research and accelerate scientific discovery with purpose-built capabilities in HealthOmics.
This release follows additional HealthOmics launches in Q1 2024 to simply the development of workflow through enhancements to run logs to include details about run utilization, tasks, errors and the delivery of logs to an Amazon S3 bucket.
Learn more by visiting AWS HealthOmics.
To get started with data stores visit the HealthOmics console.
To see how to integrate data stores with commonly used tools, visit HealthOmics Docs.