AWS HPC Blog

BioContainers are now available in Amazon ECR Public Gallery

BioContainers

This post was contributed by Björn Grüning, Scientific Researcher at Universität Freiburg, and Angel Pizarro, Principal Developer Advocate for AWS Batch.

Today we’re excited to announce that all 9,000+ bioinformatics containers in the BioContainers community are available in Amazon Elastic Container Registry (ECR) Public Gallery, a managed AWS container image registry service that is secure, scalable, and reliable.

BioContainers is a community-driven project that provides the infrastructure and guidelines to create, manage, and distribute bioinformatics tools and applications in a variety of containerized formats, including and Docker and Singularity. BioContainers have long been available at Docker Hub and Quay.io, but today’s announcement means having them available in ECR Public Gallery which is super-fast to access from your pipelines running on services like AWS Batch, in AWS ParallelCluster, or natively on Amazon EC2.

Like many other container registries, you don’t need an AWS account to search for – or access – container images. ECR Public Gallery comes with a generous free use tier. When you pull an image from ECR Public Gallery anonymously across the internet, you’ll get 500 GB of free downloads each month. If you use your AWS account to sign the pull request, that free download cap increases to 5 TB per month.

Importantly, your workloads running on AWS get unlimited data bandwidth from any region when pulling from ECR Public Gallery. This means that your bioinformatics workflows won’t be slowed down by container pulls from registries who put rate limits on pulls.

Using BioContainers from ECR Public

ECR Public uses Amazon CloudFront to cache images across the globe, putting that data close to you and your cloud workloads. It’s simple to use — just add the global prefix public.ecr.aws/to container IDs in your scripts and workflows. For instance, instead of pulling the BLAST container from DockerHub like this:

docker pull biocontainers/blast:2.2.31

You can just use this command to pull from ECR Public Gallery:

docker pull public.ecr.aws/biocontainers/blast:2.2.31

Some workflow frameworks let you parameterize a task to differentiate which registry to pull the container from, based on the context where you are running the workflow. For example, in Nextflow, you can edit a configuration file without editing the workflow definition itself. In that file, illustrated in the file snippet below, you can override the default container ID for a task only when the workflow is running on AWS.

# a context section in a Nextflow.config file 
aws {
  process {
    withName:samtools_sort {
      container = 'public.ecr.aws/biocontainers/samtools:v1.9-4-deb_cv1'
    }
  }
}
docker {
  enabled = true
}

This is especially handy if you develop your Nextflow pipelines on a local machine, but run the production workload using AWS Batch or the Amazon Genomics CLI.

Contributing to BioContainers

Upstream from BioContainers is the BioConda project, which lets you install thousands of software packages related to biomedical research using the Conda package manager. Through some neat automation, all BioConda recipes are automatically built as BioContainers images. In cases where a conda recipe isn’t quite enough to build a working container, the BioContainers team write custom Dockerfiles to finish the job.

If you have a tool you’d like to add to BioContainers, we recommend you write a Conda package, following the BioConda contribution guide. Then, submit it as a pull request (PR) to the BioConda GitHub. Shortly after your PR is approved and merged, your tool will be available to users of BioConda and BioContainers alike.

Conclusion

This new development benefits the entire life sciences community by broadening the distribution channel for bioinformatics containers, while improving the efficiency of containerized bioinformatics workloads on AWS. Try pulling some images and integrating the ECR registry into your workflow, then let us know how BioContainers on ECR Public Gallery works for you.

 

Angel Pizarro

Angel Pizarro

Angel is a Principal Developer Advocate for HPC and scientific computing. His background is in bioinformatics application development and building system architectures for scalable computing in genomics and other high throughput life science domains.