This Guidance demonstrates how to import omics sequence data from Amazon Simple Storage Service (Amazon S3) into AWS HealthOmics Storage. HealthOmics Storage can help you efficiently store and share genomics data, allowing you to realize cost savings when managing your growing volume of genomics data. Because it integrates with other AWS services, not only can you safely and securely store your genomics data, but this Guidance can also you help you protect patient privacy and automate workflows, streamlining data processing and analysis.
Please note: [Disclaimer]
Architecture Diagram
Step 1
If you have already followed the directions in How to move and store your genomics sequencing data with AWS DataSync, you will have a pre-existing Amazon Simple Storage Service (Amazon S3) bucket.
If you do not have an Amazon S3 bucket, you can create one using either the AWS Management Console or AWS Command Line Interface (AWS CLI).
Step 2
The Amazon S3 Object Created Event invokes an AWS Lambda function to create a record in the Amazon DynamoDB table.
Step 3
Creation of a record in the Auto Load Omics Table creates an item in a DynamoDB stream.
Step 4
The DynamoDB stream event invokes Lambda, which starts the sequence import workflow.
Step 5
AWS Step Functions workflow using multiple Lambda functions and native Step Functions tasks is initiated to import data. Detailed workflow is located in the code repository.
Step 6
The original sequence is loaded into AWS HealthOmics Storage.
A. Custom Resource
The sequence import requires a reference genome in the HealthOmics Reference store.
This Guidance uses an additional AWS Cloud Development Kit (AWS CDK) construct that creates a reference and adds the acquirer reference number (ARN) for that reference as a parameter in AWS Systems Manager Parameter Store.
B. Custom Metric
Success or failure of the HealthOmics import job is recorded as a custom metric in Amazon CloudWatch. This allows detailed monitoring of imported statistics.
Well-Architected Pillars
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
-
Operational Excellence
This Guidance is implemented using AWS CDK where the business logic, infrastructure, and configuration are defined as code. This allows changes and integration to perform as code within a version control system.
-
Security
Amazon S3 is protected by the AWS secure global network infrastructure. Security and Compliance are a shared responsibility between AWS and the customer. And this shared model helps relieve the operational burden from the customer because AWS operates, manages, and controls the components of the operating system.
Amazon S3 secures data from unauthorized access with encryption features and access management tools. HealthOmics provides encryption by default to protect sensitive customer data at rest by using a service-owned AWS Key Management Service (AWS KMS) key. Customer-managed KMS keys are also supported. For more on protection with HealthOmics, follow Data protection in AWS HealthOmics.
-
Reliability
By building this Guidance using AWS serverless and managed services, AWS is responsible for the efficient operation of its services and enables the applications to scale with demand. This ensures that the workload performs its intended function correctly and consistently when it's expected to. It also allows customers to operate and test the workload through its total lifecycle.
-
Performance Efficiency
The backbones of this Guidance are AWS serverless and managed services that minimize operational overhead, such as server management. HealthOmics Storage is purpose built for omics sequence data, allowing customers to store, discover, and share raw sequence data efficiently, securely, and at low cost.
-
Cost Optimization
This Guidance includes the functionality to move data into HealthOmics Storage. HealthOmics provides a cost-effective, omics-aware storage option for reference and sequence data that can reduce the Total Cost of Ownership (TCO) for storing raw sequence data. Such data can include BAMs, CRAMs, and FASTQ file formats.
HealthOmics automatically moves data to the less expensive storage class if the data are not regularly accessed (such as data that has not been accessed for more than 30 days). This is similar to the Amazon S3 Intelligent-Tiering storage class that automates storage cost savings by moving data when access patterns change, resulting in cost savings for customers.
This Guidance is built with the AWS serverless service, Lambda, for event-driven computing. Step Functions is used for orchestration, sequencing the data import workflow. AWS serverless services and products allow applications to scale quickly with demand, while ensuring that only the minimum resources are required.
-
Sustainability
When building cloud workloads, the practice of sustainability is knowing the impacts of the services used and applying design principles to reduce those impacts. In the case of this Guidance, because it relies extensively on serverless and managed services, the services scale to continually match the load, but with just the minimum resources needed, reducing the risk of over-provisioning resources.
Implementation Resources
A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.
The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.
Related Content
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.