AWS for Industries

Examine genomic variation across populations with AWS

Understanding genomic variation across populations is critical for advancing precision medicine and evolutionary research, but analyzing large datasets can be complex and resource intensive. Amazon Web Services (AWS) provides scalable tools to simplify this process enabling efficient and insightful analysis across diverse genetic data.

Introduction

The 1000 Genomes Project is a landmark international research initiative aimed at cataloging human genetic variation across diverse populations worldwide. Launched in 2008, it represents one of the most comprehensive efforts to map human genetic diversity, providing a critical resource for genetic research and personalized medicine.

In its final phase, the project analyzed the genomes of 2,504 individuals from 26 populations, detecting over 88 million genetic variants, including single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variations. This extensive dataset is openly available for analysis, providing variant call format (VCF) files, metadata (for example, population panels), and raw sequencing data hosted on public FTP servers and cloud platforms. Researchers use this repository to study population genetics, disease associations, and evolutionary biology, making it a cornerstone of modern genomics.

We’ll showcase the use of AWS HealthOmics, Amazon Athena and Amazon SageMaker AI Studio notebooks (Studio notebooks) to analyze genomic variant files from the 1000 Genomes Project. We’ll examine genomic proximity across the project populations and how it correlates with each population’s geographical origin, while addressing common technical challenges healthcare and life sciences organizations commonly face.

Challenges

Despite the public availability of such valuable datasets, healthcare and life sciences organizations face significant challenges when analyzing large-scale genomic data, namely:

  • Limited Scalability: Traditional infrastructures (such as local servers and storage area networks) struggle to process the massive datasets generated by bioinformatics, like FASTQ or VCF files, leading to bottlenecks.
  • High Costs: On-premises or smaller cloud solutions often require significant upfront investments in hardware and ongoing maintenance, which can be cost-prohibitive.
  • Integration Complexity: Combining bio-informatics tools and/or custom pipelines across disparate environments can be cumbersome and error-prone.
  • Compliance and Security Risks: Ensuring HIPAA, GDPR, or other data sovereignty compliance without robust cloud-native tools can be resource-intensive and risky.
  • Collaboration Barriers: On-premise setups often lack seamless data sharing and collaboration capabilities, slowing down multi-disciplinary research efforts.

Healthcare and life sciences organizations address these bioinformatic challenges by using services like HealthOmics, Studio notebooks, Athena and Amazon Simple Storage Service (Amazon S3). These services provide AWS customers with:

  • Scalability: HealthOmics processes massive datasets efficiently, while SageMaker handles complex machine learning workflows for bioinformatics at scale.
  • Cost-Effectiveness: HealthOmics and Amazon S3 offer durable, low-cost storage for large genomic files, with tiered pricing options for optimal cost management.
  • Integration: Seamless integration of tools like NextFlow and genome analysis toolkit (GATK) within HealthOmics enables streamlined pipeline execution.
  • Compliance: HealthOmics verifies robust security and compliance with HIPAA and other regulatory standards.
  • Collaboration: Centralized storage and services like Studio notebooks streamline data sharing and foster collaboration among global research teams.

Solution Overview

The following architecture diagram describes the AWS services and components used to implement the complete data analysis.

A flow diagram illustrating the bioinformatics using AWS services to process and analyze VCF files. Initially, VCF files are downloaded from the Amazon S3 Registry of Open Data and chunked using a Studio notebook. Chunked files are then imported into HealthOmics storage. Imported data are accessed by Athena through AWS Lake Formation and filtered through an Athena query. The resulting dataset is saved back to Amazon S3 and loaded into Studio notebooks for further analysis and visualization.]Figure 1 – Bioinformatics workflow using AWS services to process and analyze VCF files.

It is composed of the following sections:

1. Data Ingestion

a. Preparation of VCF files, for import to HealthOmics storage
b. Importing to HealthOmics storage

2. Filtering of ingested data through an Athena query and stores the results to Amazon S3

3. Loading of data from Amazon S3 to Studio notebooks

4. Dimensionality reduction with principal component analysis (PCA), population clustering and correlation with geo information, using SageMaker

Ingestion to HealthOmics

The 1000 Genome Projects data can be found either in the project’s FTP site or on AWS Data Exchange, publicly available under the AWS Open Data Sponsorship Program. The underline S3 bucket that hosts the data is based in us-east-1 (N. Virginia) region and is accessed through the Amazon S3 URI: s3://1000genomes/.

Due to its iterative nature, the 1000 Genome project, has produced multiple releases, aimed to create a comprehensive catalog of human genetic variation across diverse populations. Each release represents an improvement or addition to the dataset, reflecting advancements in sequencing technology, analytical methods, and expanded participant diversity.

We selected the latest and final release (the 20130502 release) and can be found under the releases folder of the S3 bucket. Contents of the release include a VCF file for every human chromosome, along with its corresponding index file. For our analysis we selected chromosome 22, which is the smallest human chromosome in terms of base pairs. Due to its reduced size, this will speed up our analysis.

It is important to note that VCF files in the 1000 Genomes project are multi-sample files, meaning that every VCF contains the full list of samples of the project. As such, the VCF files of the final release contain all 2504 samples, with every sample tracked in a separate column. When importing to HealthOmics storage, only importing up to 100 samples for each file is supported. This mandates that we split the VCF files into smaller chunks so ingestion can take place. Splitting of the VCF files is implemented through a Studio notebook and the use of BFC tools from Bioconda.

After splitting the chromosome 22 VCF file into chunks of up to 100 samples, you import them to HealthOmics storage. Importing can take place either through Studio notebook or through the AWS Management Console, by manually selecting the Amazon S3 URIs of the resulting chunk files. Information of how to import VCF files through Studio notebook and Python, can be found in the Amazon-omics-tutorials GitHub repository of the AWS Samples account.

Data Filtering with Athena

You need to properly configure HealthOmics with AWS Lake Formation, so that the variant data becomes available to other AWS Services, including Athena. Detailed instructions on how to implement this are described in Configuring Lake Formation to use HealthOmics.

With AWS Lake Formation properly configured you have the variant data schema available in an AWS Glue data catalog and thus to Athena. Once you select the data source and database in Athena, you can see the data schema in the left side navigation pane.

A screenshot from Athena data analytics interface, where a query is executed against the schema of HealthOmics storage. The user interface is divided into three main sections: Tables at the left sidebar, query editor in the central section and query results in the bottom section.Figure 2 – Athena query editor with HealthOmics storage configured as data source

The schema is structured in the form of a single table with basic and complex data types that allow us to run queries, filter our dataset, or execute preprocessing logic. The data schema is presented here:

  • importjobid: string
  • contigname: string
  • start: bigint
  • end: bigint
  • names: array<string>
  • referenceallele: string
  • alternatealleles: array<string>
  • qual: double
  • filters: array<string>
  • splitfrommultiallelic: boolean
  • attributes: map<string, string>
  • phased: boolean
  • calls: array<int>
  • genotypelikelihoods: array<double>
  • phredlikelihoods: array<int>
  • alleledepths: array<int>
  • conditionalquality: int
  • spl: array<int>
  • depth: int
  • ps: int
  • sampleid: string
  • information: map<string, string>
  • annotations: struct

For our analysis we need to calculate the level of variation our samples exhibit, in relation to the reference genome. In order to do that, we count the number of alleles in each sample that differ from the reference allele. We execute this calculation by reading the Calls column of our dataset and counting the number of non-zero items in it.

Additionally, we filter out any of the samples that are not included in the 1000 Genome panel file (included in the same folder as the VCF files), by using the sampleId column. The reason for this is that we only have population origin information for some of the samples, as such we only include in our analysis samples that have population related information. We will also exclude combinations of samples—variants that do not differ in any allele from the reference store. Finally, we group by sampleId and names to guarantee uniqueness for each sample and variant, since some of the sample variant combinations have secondary rows in the VCF files.

We execute the following query in Athena query editor.

select 
    one_thousand_genome_project.sampleid as sample
,   one_thousand_genome_project.names as variant
,   max(cardinality(filter(one_thousand_genome_project.calls, allele->allele != 0))) as counter
from 
    one_thousand_genome_project
where
    one_thousand_genome_project.calls != array[0, 0]
and one_thousand_genome_project.sampleid in ('HG00096','HG00101', ... ,'NA21133','NA21135')
group by 
    one_thousand_genome_project.sampleid
,   one_thousand_genome_project.names
order by 
    one_thousand_genome_project.sampleid
,   one_thousand_genome_project.names
SQL

Results of this query are produced after a five-minute execution time and are stored on Amazon S3, in the form of a CSV file.

Population Analysis with SageMaker Studio

Principal Component Analysis
Having our filtered dataset on Amazon S3 in CSV form, enables us to execute our population analysis in a Studio notebook.

Our first step is to download the filtered dataset and load it into a pandas data frame. We execute the following code in Studio notebook.

import boto3
import pandas

# Download chromose 22 from 1000 genomes project
genomes_bucket_name = 'your-omics-1000-genome-project'
allele_counters_key = 'your-AmazonAthenaQueryResultsFolder/allele_counters.csv'

# Create the S3 client
s3_client = boto3.client('s3')

# Download files
s3_client.download_file(genomes_bucket_name, allele_counters_key, 'allele_counters.csv')

# Load into pandas
df = pandas.read_csv('allele_counters.csv')
Python

The format of this dataset has the three columns as defined in the Athena query.

Tabular output, of Athena filtering query results with three columns: sample, variant, and counter.]

Figure 3 –Dataset schema after Athena filtering

We want to adjust this matrix, so it has samples as rows and variants as columns. This adjustment is needed (as it will allow us to reduce dimensionality later on with PCA) across the variants dimension. We use the pandas pivot method and additionally replace null values with zeros. The reason we encounter null values in our dataset is the Cell filtering we applied in the Athena query and is now restored through the pivot operation.

# Pivot to sample as index and variants as columns
pca_df = df.pivot(index='sample', columns='variant', values='counter')
pca_df = pca_df .fillna(0)
Python

The format of the new dataset now has samples as rows and variants as columns.

Figure 4 – Dataset schema after pivot operationFigure 4 – Dataset schema after pivot operation

This format allows us to apply PCA across the variants and reduce them to only two dimensions, enabling quicker plot results and visualization of clustering of the samples, if any.

We execute the following code in Studio notebook.

from sklearn.decomposition import PCA
import plotly.express as px

# Apply PCA (Principal Component Analysis)
pca = PCA(n_components=2)
pca_result = pca.fit_transform(samples_df)
pca_df = pandas.DataFrame(pca_result)
pca_df.columns = ['PC1', 'PC2']

# PCA plot
clr = "rgb(0, 0, 112)"
fig = px.scatter(pca_df, x='PC1', y='PC2', title="PCA", color_discrete_sequence=[clr])
fig.update_layout(xaxis=dict(range=[-100, 150], showgrid=False, zeroline=False))
fig.update_layout(yaxis=dict(range=[-100, 100], showgrid=False, zeroline=False))
fig.update_layout(height=800, width=1200)
fig.show()
Python

Scatter plot visualization that represents PCA result. Plot reveals four clear cluster formations

Figure 5 – PCA Plot

Visualization of the two-dimensional PCA components reveals that there is significant clustering amongst the samples. Initially it seems that there are three or four main clusters with some of the samples residing across and in between the clusters.

Correlation with Population Origin
In order to examine the correlation of these clusters with population origin, we add to our visualization the population origin information through coloring. The first step needed to gain access to population information is to read the panel file that resides in the same folder as the VCF files. Use the following code to read the panel file. 

# Read panel information file
panel_df = pandas.read_csv('panel.csv', sep='\t')
panel_df.head()
Python

Table preview of sample metadata with population, super population and gender information.]

Figure 6 – Sample metadata

We can now merge the columns: pop and super_pop of the panel data frame with the PCA data frame created in the PCA analysis step. We are allowed to do that directly due to the same ordering of samples in both data frames. Use the following code to merge the columns:

# Merge to PCA pop, super pop columns
plot_df = pandas.concat([pca_df, panel_df[['pop', 'super_pop']]], axis=1)
plot_df.head()
Python

With the super population origin information now available in our data frame, we repeat the plotting and add appropriate coloring:

# PCA with population info plot
ptitle = "PCA with Super Population Info"
fig = px.scatter(plot_df, x='PC1', y='PC2', title= ptitle, color='super_pop')
fig.update_layout(xaxis=dict(range=[-100, 150], showgrid=False, zeroline=False))
fig.update_layout(yaxis=dict(range=[-100, 100], showgrid=False, zeroline=False))
fig.update_layout(height=800, width=1200)
fig.show() 
Python

Scatter plot visualization of the PCA result, along with super population information. Plot reveals that clustering formations are strongly correlated with super population origins.Figure 7 – PCA Plot with super population info

The resulting plot is fascinating and revealing in terms of correlation between genomic variation and super population origin. Super population origins seem to dictate cluster formation with every origin forming its own cluster. An exception to this is the America super population, which does not form a cluster on its own but rather spans across and between all other clusters—indicated by the green dots.

An additional enhancement to this plot is the use of the pop field from the panel data frame instead of the super pop field we have used so far. This would help visualize clustering, if any, of the population level origins.

Conclusion

We demonstrated how AWS HealthOmics, Amazon Athena, and Amazon SageMaker AI Studio notebooks can be effectively combined to analyze genomic variation across populations using the 1000 Genomes Project dataset. Through this analysis, we successfully:

1. Ingested and processed VCF files from the 1000 Genomes Project using HealthOmics storage

2. Leveraged the querying capabilities of Amazon Athena to filter and analyze genomic variants

3. Applied principal component analysis (PCA) using Amazon SageMaker AI Studio notebooks to reduce dimensionality and visualize population clusters

4. Revealed correlations between genomic variations and population origins, with distinct clustering patterns for different super populations

The results showed that super population origins strongly influence genetic clustering patterns, with most super populations forming distinct clusters. Notably, the American super population showed unique characteristics by spanning across and between other population clusters, reflecting its diverse genetic heritage.

Consider how AWS services can be adapted to your specific research needs, whether in population genetics, disease research, or personalized medicine. Start your genomic analysis journey today by visiting the AWS for Healthcare & Life Sciences home page.

Contact an AWS Representative to know how we can help accelerate your business.

Further Reading

Konstantinos Tzouvanas

Konstantinos Tzouvanas

Konstantinos Tzouvanas is a senior enterprise architect on AWS, specializing in data science and AI/ML. He has extensive experience in optimizing real-time decision-making in High-Frequency Trading (HFT) and applying machine learning to genomics research. Known for leveraging generative AI and advanced analytics, he delivers practical, impactful solutions across industries.