[SEO Subhead]
This Guidance shows how you can run Galaxy software on AWS, allowing you to benefit from the ease-of-use of the Galaxy platform while using purpose-built services from AWS for the needed undifferentiated heavy-lifting, without compromising your security or data integrity. Galaxy is an open-source web application where you can run data intensive jobs for biomedical research through a graphical web interface. With the AWS native services for data storage and compute, this Guidance shows how you can optimize the end-to-end platform of Galaxy when uploading, managing, and analyzing large datasets.
Please note: [Disclaimer]
Architecture Diagram
[Architecture diagram description]
Step 1
Galaxy users access the Galaxy Web application by the public endpoint of the Application Load Balancer.
Step 2
Galaxy stores user metadata and history in the PostgreSQL database hosted in Amazon Aurora Serverless, and Galaxy users have access through the Galaxy Web server. The credentials for this access are stored in AWS Secrets Manager.
Step 3
The Galaxy Data volume stores user data, including input data for processing and processed data. Amazon Elastic File System (Amazon EFS) provides the storage capacity.
Step 4
Galaxy leverages a message queue for communication between internal processes. This Guidance hosts the message queue on Amazon MQ, a managed message broker service that here uses RabbitMQ. Credentials for the broker are stored in AWS Secrets Manager.
Step 5
Users can review, schedule, and manage bioinformatics jobs in the Galaxy Jobs pod through the Galaxy Web server.
Step 6
AWS Backup takes regular backups of Amazon EFS file systems and PostgreSQL databases.
Step 7
The monitoring and log collection of both the Galaxy components and the AWS infrastructure are centralized in Amazon CloudWatch.
Step 8
Amazon Elastic Kubernetes Service (Amazon EKS) provides the control plane, manages both the networking and the nodes for the Kubernetes pods, and horizontally scales by adding or removing nodes. Additional pods are deployed through Amazon EKS to synchronize secrets with Secrets Manager and to publish logs to CloudWatch.
Well-Architected Pillars
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
-
Operational Excellence
This Guidance uses services that allow you full visibility into your workloads through monitoring and logging, while also providing you with reliable, stable, and dependable applications. For example, with CloudWatch, you gain observability with metrics, personalized dashboards, and logs, in addition to alerts that are defined from metrics throughout this Guidance, so you can monitor the health of your workloads and minimize the impact from incidents. Also, Amazon EKS clusters can identify unhealthy containers and replace them automatically with new containers, so that your workloads are available to respond to incidents and events.
-
Security
By default, all incoming connections to Galaxy originate from the public Internet and are directed to the Galaxy server through a publicly accessible Application Load Balancer. Alternatively, this Guidance can be configured to use an internal Application Load Balancer in a private subnet, where traffic is routed through a virtual private network (VPN) connection or through AWS Direct Connect. In both cases, compute resources are deployed within private subnets and are not directly accessible from the public Internet. Galaxy handles application-level authentication and authorization through its own user management or through Active Directory Federation Service (AD FS).
-
Reliability
To implement a reliable application-level architecture, the individual components of this Guidance are deployed as loosely coupled Kubernetes pods. Also, the message broker is the fully managed service Amazon MQ, which, in our default configuration, includes a standby server. Finally, the shared filesystem is provided through Amazon EFS and is highly available, as is the database provided through Aurora Serverless.
-
Performance Efficiency
Amazon EKS is an AWS native service, and this Guidance focuses on cost-efficient ways to deploy and configure it with selected resources so that you can achieve a reliable Kubernetes application with high availability and low operational costs. The architecture for Amazon EKS is spanned across multiple Availability Zones for high availability. While some traffic will exist between subnets deployed into Availability Zones, its latency should not make any significant performance impact.
Amazon EFS is designed to provide serverless, fully elastic file storage that allows you to share file data without the need to provision or manage storage capacity and performance. It provides a Portable Operating System Interface (POSIX) file system with the necessary performance for bioinformatic workloads.
-
Cost Optimization
A significant factor for data transfer costs within Amazon EKS clusters are calls to Kubernetes services from external clients going through Application Load Balancers. The data transfer costs when calling services are mapped to communications between pods running in different Availability Zones.
Due to the highly configurable autoscaling minimum, maximum, and desired number of compute nodes, along with their corresponding Amazon Elastic Compute Cloud (Amazon EC2) parameters, resources are efficiently managed.
Finally, serverless architectures have a pay-per-value pricing model and scale based on demand. This includes the Aurora Serverless database and Amazon EFS. We recommend you tag AWS resources that belong to a project programmatically, and then create custom reports in AWS Cost Explorer using the tags to visualize and monitor costs.
-
Sustainability
By choosing the right sized instances, you use only the resources you need, thereby reducing unnecessary emissions. Also, by using services with dynamic scaling, you minimize the environmental impact of the backend services, and ensure scaling of compute resources based on your website needs. Additionally, the use of fully managed services, such as Amazon EFS, minimizes the required resources.
Implementation Resources
The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.
Related Content
[Title]
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.