AWS for Industries

Secure Your Genomic Workflows and Data with AWS HealthOmics

AWS HealthOmics is a HIPAA-eligible AWS service and provides fully-managed compute that helps customers process genomic, proteomic, and other varieties of -omic data securely and at scale in AWS. It supports common workflow definition languages—including WDLNextflow, and CWL—and secure access to data stored in Amazon S3 and AWS HealthOmics Storage. In this blog post, we cover specific AWS HealthOmics features and integrations with other AWS services, that help improve and maintain workflow and data security.

For Healthcare and Life Sciences organizations, security and data privacy are paramount. At AWS, “Security is Job Zero”. We never lose sight of this guiding principle or the fact that our Cloud services and data centers process petabytes of critical customer data every day. Our deep foundations in security, painstaking attention to detail, and security-first services and tools help our customers build solutions that provide end-to-end security and ensure that sensitive data like personal healthcare information is always accessed and used appropriately.

Built for Security

AWS HealthOmics is built from the ground up to support and enable security, and it guides users towards security best practices. Here are a number of ways HealthOmics helps protect your workflows and data:

Compute isolation: HealthOmics provides dedicated compute resources for each workflow run, which consists of a workflow definition and collection of containerized workflow tasks. Cloud infrastructure and resources are never shared or reused between workflow runs. Like containerized tasks in AWS Fargate, HealthOmics maintains a security isolation boundary for each workflow task. A HealthOmics workflow task does not share the underlying kernel, CPU, memory, or network interface with any other workflow task. Compute instances that tasks run on use the latest version of Amazon Linux and are continuously monitored for security vulnerabilities.

Storage isolation: HealthOmics provides a dedicated and high-throughput filesystem for each workflow run. This filesystem is shared by the workflow’s tasks and can be used to pass intermediate workflow data through a pipeline or series of tasks that incrementally process the data. After a workflow run completes, the filesystem is securely destroyed.

Network isolation: HealthOmics runs workflows in an isolated network with no connectivity to the public Internet or other AWS regions. This ensures that workflow data is tightly controlled by you. The service provides outbound network access to select AWS services accessed through VPC endpoints—these include Amazon Elastic Container Registry (Amazon ECR), Amazon S3, Amazon Key Management Service (KMS), Amazon CloudWatch Logs, and the HealthOmics service itself. To provide additional protection, HealthOmics encloses each workflow task in a security group that narrowly restricts inbound and outbound network traffic.

Customer-managed permissions: HealthOmics only uses the permissions and resources you explicitly grant access to. The service is tightly integrated with AWS Identity and Access Management (IAM) and runs workflow tasks with credentials and permissions you provide as an IAM role for each workflow run. The HealthOmics AWS console can generate an IAM role for you with a recommended and minimal set of permissions. IAM role permissions typically include access to data stored in Amazon S3 or HealthOmics Storage, Amazon KMS keys used to encrypt your data at-rest, and permissions to allow HealthOmics to send workflow logs to Amazon CloudWatch. An IAM role trust policy ensures that these credentials and permissions can only be used by HealthOmics. You can audit the use of your IAM role credentials with Amazon CloudTrail, and you can revoke credentials or individual permissions at any time in the IAM console.

End-to-end encryption: HealthOmics encrypts all data and metadata at-rest in the service. This includes intermediate workflow data stored in the filesystem maintained for each workflow run, data stored in HealthOmics Storage, and any workflow input or output data stored in encrypted S3 buckets. HealthOmics encrypts requests and data in-transit through the internal AWS network, and it authenticates requests to other AWS services using the Amazon Signature Version 4 protocol.

Data provenance: A recent U.S. National Institutes for Health report stated the importance of data provenance for public genomics datasets. HealthOmics records what was run, when, and by whom. It automatically records the IAM user or role that initiated each workflow run and timestamps—creation, start, and stop times—for the run and each workflow task. Workflow parameters are an optional but powerful tool for maintaining data provenance. If a container image or S3 object is included as a parameter to the workflow run, HealthOmics records the SHA-256 digest (container image) or S3 Etag value (S3 object) of each resource. This information is stored as an immutable record inside HealthOmics and can be used to audit or reconstruct the exact set of containers, data inputs, and tools that were used for the workflow run. You can retrieve digests and other data provenance details by calling the GetRun or GetRunTask APIs (or via the AWS CLI: “aws omics get-run —id <runId>” or “aws omics get-run-task –id <runId> –task-id <taskId>”)

Figure 1 - HealthOmics Benefits. HealthOmics provide isolated environments for compute storate and network

Figure 1 – HealthOmics Benefits: HealthOmics provides isolated environments for compute, storage, and networking. It guides users towards security with best practices permissions, end-to-end encryption, monitoring, and auditing.

Shared Responsibility Model

Security is a shared responsibility between AWS and our customers. AWS is responsible for protecting the infrastructure that supports AWS services (“Security of the Cloud”) and customers are responsible for security configuration and management of individual AWS services and the integration into their IT environment (“Security in the Cloud”). Here are some things you can do to help HealthOmics secure your workflows and data:

Apply least-privilege permissions: When configuring the IAM role you use to start HealthOmics workflow runs, grant the minimum set of permissions necessary to support the run. We recommend starting from the minimal IAM role generated by the HealthOmics AWS console, and then adding narrowly-scoped permissions for any additional access your workflow run requires. HealthOmics should be the only service included in the IAM role trust policy. You can use an IAM policy condition to further restrict access. You must configure your ECR repositories to grant access to HealthOmics. An S3 bucket policy and KMS key policy are not needed to access data stored in your own AWS account. However, when your use case requires cross-account data access, we recommend limiting S3 and KMS policies to the minimum set of IAM roles and users that need access.

Secure your container images: Regularly scan your container images for security vulnerabilities. Continuous scanning will allow you to quickly detect known security vulnerabilities and then patch, update, or replace your container images to remove vulnerabilities. ECR “Enhanced scanning”, based on Amazon Inspector, provides low-cost and continuous scanning of container images. We also recommend referencing container images from your HealthOmics workflows with a versioned tag, rather than a “latest” tag that can change over time. This will allow you to quickly identify the specific container image and tools that were used to complete a workflow run, diagnose failures, or audit data provenance. For additional protection, consider enabling ECR’s immutable tag feature.

Encrypt your data at-rest: Encryption helps customers meet their security and compliance requirements, like HIPAA. Always encrypt your workflow data inputs and output data at-rest in S3 and HealthOmics Storage. Both services encrypt your data by default with AWS-managed keys, and they support a second encryption option using customer-managed keys. AWS-managed keys are easy to use and automatically encrypt and decrypt data when it is accessed. Customer-managed keys give you full control of encryption and access to your data. For complete details on the available encryption options, refer to S3 and HealthOmics documentation.

Use workflow parameters: Design your workflows to include workflow parameters and use a parameter for each container image and data input referenced by a workflow. When parameterized, HealthOmics confirms that it can access each referenced resource with the IAM role you have provided before starting the workflow run, which can help you find mistakes early. The service will record a SHA-256 digest for each parameter referencing a container image and an S3 Etag value for each S3 object. (Data stored in HealthOmics Storage is immutable, and a HealthOmics ReadSet URI serves as a unique identifier.) Parameters make it easy to reuse your workflows with new datasets and containerized software versions, and they are a powerful tool for maintaining data provenance.

Enable workflow logging: Workflow run logging is enabled by default (the “ALL” setting), and we recommend using this default for every workflow run.  When logging is enabled, HealthOmics will send CloudWatch logs to your AWS account for each workflow run and task. These logs capture the runtime operation of your workflow runs and will help you diagnose workflow issues or failures. You can set a retention policy on your CloudWatch log group to allow CloudWatch to automatically delete older logs and limit your log storage costs. Alternately, you can retain logs indefinitely and maintain a historical record of every completed workflow run. These logs become data that you can audit for security purposes or to meet regulatory requirements.

Conclusion

Using AWS, you can maintain the highest level of security for your AWS HealthOmics workflows and data. HealthOmics maintains the “Security of the Cloud” by running your workflows in isolated compute and networking environments, encrypting data at-rest and in-transit through the AWS network, and recording key data provenance details. You can help maintain your “Security in the Cloud” by following the recommendations and security best practices outlined in this blog post: apply least-privilege permissions, maintain the security of your container images, always encrypt your data at rest, and enable HealthOmics features that allow you to monitor and audit your workflow runs.

To learn more about AWS HealthOmics, contact an AWS Representative or AWS Partner and get started today.

Further Reading

Andy Henroid

Andy Henroid

Andy Henroid is a software architect leading development of AWS HealthOmics. He has over 25 years of software industry experience, spanning Cloud security, distributed computing, performance optimization and scaling, and CPU micro-architecture. An early contributor to the Linux kernel, he is passionate about products built from open-source software.