A Data Lake as Code, Featuring ChEMBL and OpenTargets

Supporting code and deployment guide on Github

A data lake is not a single technology. Anyone who tells you otherwise is either selling you something or wrong, sometimes both. At AWS we offer a collection of tightly integrated services like AWS Glue, AWS Glue Data Catalog, Amazon S3, AWS Lake Formation, and Amazon Athena that make it possible for a company to build a data lake to their own unique needs.

You already have access to the materials (your data) and power tools (AWS). Now how do you build the (lake) house? I propose that a data lake is just another complex and heterogeneous infrastructure problem. In other architectural disciplines, time has shown us that similar complexity and structure are best managed when following an infrastructure-as-code philosophy.

This post illustrates how you might build a data lake-as-code using the AWS Cloud Development Kit (CDK). We will outline the strategy, core data lake services used, associated costs, and how you can tie it all together with code.

The example uses OpenTargets and ChEMBL as stand ins for your own relational or flat file data sources. If you don’t have a chemistry background, think of Open Targets as simply structured files in S3 and a ChEMBL as a Postgres database. That said, here is a little more about these two data sources.

ChEMBL (Postgres database)- ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data from 57 datasets to aid the translation of genomic information into effective new drugs. It is supported by the European Molecular Biology Laboratory, and includes 2 million distinct compounds and 16 million bioactivity measurements.

Open Targets (CSV and JSON files)- Open Targets is a large-scale public-private partnership that uses human genetics and genomics data for systematic drug target identification and prioritization. Open Targets integrates public domain data from 20 data sources and identifies targets based on chemical and bioassay databases, GWAS, and functional genomics. Its partners include GSK, EMBL-EBI, Sanger, Biogen, Takeda, BMS, and Sanofi. It includes 27 thousand targets associated with 13 thousand diseases.

The Strategy:

So let’s zoom way out and look at a high level block diagram of what we are trying to achieve.

Source Data – You likely already have a bunch of data in various formats you want to be able to query across. These may be open or proprietary data sets in various formats like csv, json, parquet, postgres/mysql databases, streaming data, etc.

Data Lake – Your source data needs to get ‘enrolled’ into the data lake where metadata (like schema and data types) is tracked, extract transform load is performed, and long term cost efficient storage is provided.

Data Access Layer – This is the engine which lets your analytical layer run queries on your data lake. The engine you use will depend on the data, access pattern, and cost requirements (hot/warm/cold) of your analytical tier.

Analytics – This is where your notebooks, BI tools like Tableau or Quicksight, or your machine learning toolchain lives. These tools connect to the data access layer over common SQL/JDBC/ODBC endpoints. Its also worth mentioning that data produced in the analytical environment often becomes new source data of its own to be enrolled in the data lake.

A little background on AWS data lake primitives:

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

The AWS Glue Data Catalog contains references to data that is used as sources and targets of your extract, transform, and load (ETL) jobs in AWS Glue. To create a data warehouse or data lake, you must catalog this data. The AWS Glue Data Catalog is an index to the location, schema, and runtime metrics of your data.

AWS Glue Crawlers are used to populate the AWS Glue Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog.

An AWS Glue Job is the business logic that performs the extract, transform, and load (ETL) work in AWS Glue. When you start a job, AWS Glue runs a script that extracts data from sources, transforms the data, and loads it into targets.

AWS Glue Workflows are used to create and visualize complex extract, transform, and load (ETL) activities involving multiple crawlers, jobs, and triggers. Each workflow manages the execution and monitoring of all its components. As a workflow runs each component, it records execution progress and status, providing you with an overview of the larger task and the details of each step. The AWS Glue console provides a visual representation of a workflow as a graph.

So what does the end result look like?

You can freely write SQL against your enrolled data sets. You can even join tables together from data sources that were enrolled from entirely different file formats. You can run those SQL queries in an exploratory fashion directly with Amazon Athena, in Jupyter/Zepplin notebooks, or with a BI tool like Amazon QuickSight or Tableau. All without requiring a server.

In the ChEMBL and Open Targets example, those data sets are typically accessed by the data provider’s distinct APIs. Implementing an API client to consume data for each data source isn’t well suited to exploration or ad hoc analysis across multiple data sources. Enrolling data sets into the data lake lets us standardize the location and access language (SQL) making consumption and analysis much more agile.

ChEMBL via Athena example:

OpenTargets in a Notebook:

Data Lake Service Diagram

Thinking back to our block diagram from earlier, lets fill in ‘Source Data’, ‘Data Lake’, ‘Data Access Layer’, and ‘Analytics’.

What’s going on here?

Source Data – We have a little helper EC2 instance that downloads OpenTargets and ChEMBL, imports them into S3 and a RDS Postgres instance. Think of this simulating a bucket and database you may already have.

Data Lake – Each data set has its own ‘Enrollment Workflow’ that consists of a Source Crawler, Glue Job, Data Lake crawler, and some workflow components. These workflows are responsible for ‘enrolling’ the source data into the metastore, performing any ETL, and then ‘enrolling’ the newly transformed data into the metastore.

Data Access Layer – Athena serves as the only query engine we need in this situation. In contrast to other Data Access Layer tools like EMR or Redshift, Athena is serverless. There is no ongoing cost associated with Athena, you only pay for the data processed during queries you run. Athena returns queries in low 1s of seconds against our example data sets, so there is no need for something more complicated or expensive.

Analytics – We are using a managed JupyterLab notebook from SageMaker in conjunction with the PyAthena libraries. PyAthena lets you write SQL against Athena and returns conventional Numpy/Pandas data frames for more sophisticated analysis, plotting, or machine learning.

How are we tying all of this together?

In short, with the AWS Cloud Development Kit. The AWS Cloud Development Kit (AWS CDK) is an open source software development framework to model and provision your cloud application resources using familiar programming languages.

The AWS CDK uses CloudFormation under the covers to provision resources, but allows you to express your infrastructure as code in ways that were not possible in the classic CloudFormation markup language. Because the CDK is procedural, it allows for things like inheritance, type safety, modern language features, native code reuse, and other object oriented techniques. You are able to be much more expressive in how you bolt together AWS primitives into your own constructs. That’s exactly what we did here.

We start with a base ‘DataLakeEnrollment’ class that has been extended to generic S3 and RDS data set enrollment classes.

We can reuse these generic S3 and RDS enrollment constructs for just about anything. But here is what it looks looks like for OpenTargets on S3.

        const openTargets1911 = new S3dataSetEnrollment(this, 'openTargets-1911-enrollment', {
            DataSetName: "opentargets_1911",
            sourceBucket: props.sourceBucket,
            sourceBucketDataPrefixes: [
                `${props.sourceBucketDataPrefix}19.11_association_data/` ,
                `${props.sourceBucketDataPrefix}19.11_disease_list/`,
                `${props.sourceBucketDataPrefix}19.11_evidence_data/`, 
                `${props.sourceBucketDataPrefix}19.11_target_list/`
            ],
            dataLakeBucket: props.dataLakeBucket,
            GlueScriptPath: "scripts/glue.s3import.opentargets.py",
            GlueScriptArguments: {
                "--job-language": "python", 
                "--job-bookmark-option": "job-bookmark-disable",
                "--enable-metrics": "",
                "--DL_BUCKET": props.dataLakeBucket.bucketName,
                "--DL_PREFIX": "/opentargets_1911/",
                "--GLUE_SRC_DATABASE": "opentargets_1911_src"
            }
        });

Under the covers, the S3dataSetEnrollment construct will setup all of the necessary AWS Glue primitives described above. This includes the data catalog database, source crawler, AWS Glue Job, data lake crawler, and the AWS Glue Workflow. This works similarly for ChEMBL on RDS using the RDSPostgresDataSetEnrollment construct:

        const chembl25 = new RDSPostgresDataSetEnrollment(this, 'chembl-25-enrollment', {
            databaseSecret: props.databaseSecret,
            database: props.database,
            accessSecurityGroup: props.accessSecurityGroup,
            dataLakeBucket: props.dataLakeBucket,
            DataSetName: dataSetName,
            JdbcTargetIncludePaths: ["chembl_25/%"],
            GlueScriptPath: "scripts/glue.s3import.chembl25.py",
            GlueScriptArguments: {
                "--job-language": "python", 
                "--job-bookmark-option": "job-bookmark-disable",
                "--enable-metrics": "",
                "--DL_BUCKET": props.dataLakeBucket.bucketName,
                "--DL_PREFIX": "/"+dataSetName+"/",
                "--DL_REGION": cdk.Stack.of(this).region,
                "--GLUE_SRC_DATABASE": "chembl_25_src"
            }            
        });

The base DataSetEnrollment construct class diagram shows some of the underlying CfnCrawler, CfnTrigger, CfnWorkflow, CfnConnection, and Database which are getting plumed together for you.

You can visualize these workflows in Glue itself by opening either workflow in the Glue console. Here you can see the OpenTargets workflow currently running the Glue job converting the JSON and CSV files into Parquet format.

Permissions

There are two methods of security you can apply to your data lake. The default method, which is likely what you are using at the moment, is essentially “open” Lake Formation permissions and “fine-grained” IAM polices. The grantIamRead() method of the DataSet grants the “fine-grained” IAM policy that gives users read access to just the tables in the data set you preform the grant on.

For example, in the bin/aws.ts file you can see an example of granting that “fine-grained” IAM read permission. Pretty easy! Here we are passing the role from the notebook, but you can import an existing IAM user, role, or group using the CDK.

chemblStack.grantIamRead(analyticsStack.NotebookRole);
openTargetsStack.grantIamRead(analyticsStack.NotebookRole);

The other method of security gives you more control. Specifically, the ability to control permissions at the database, table, and column level. This requires “fine-grained” Lake Formation permissions and “coarse” IAM permissions. The grantDatabasePermissions(), grantTablePermissions(), and grantTableWithColumnPermissions() setup both the fine-grained LakeFormation and coarse IAM permissions for you.

Again, another example in the bin/aws.ts file:

const exampleUser = iam.User.fromUserName(coreDataLake, 'exampleGrantee', 'paulUnderwood' );
var exampleTableWithColumnsGrant: DataLakeEnrollment.TableWithColumnPermissionGrant = {
    table: "chembl_25_public_compound_structures",
    // Note that we are NOT including 'canonical_smiles'. That effectivley prevents this user from querying that column.
    columns: ['molregno', 'molfile', 'standard_inchi', 'standard_inchi_key'],
    DatabasePermissions: [],
    GrantableDatabasePermissions: [],
    TableColumnPermissions: [DataLakeEnrollment.TablePermission.Select],
    GrantableTableColumnPermissions: []
};
chemblStack.grantTableWithColumnPermissions(exampleUser, exampleTableWithColumnsGrant);

To illustrate the relationship between the fine-grained and coarse permissions, think of it as two doors. An IAM principal needs to have permission to walk through both doors to query the data lake. The DataLakeEnrollment construct handles granting both the fine and coarse permissions for you.

The “GrantableDatabasePermissions", "GrantableTableColumnPermissions", and "GrantableTableColumnPermissions" give the supplied IAM principal permissions to grant permissions others. If you have a data-set steward, or someone who should have the authority to grant permissions to others, you can “grant the permission to grant” using those properties.

If you decide that you want the additional flexibility of Lake Formation permissions, you need to perform two manual actions before Lake Formation permissions will begin protecting your resources. Until you perform these two steps, you are only protecting your resources with the coarse IAM permission and the Lake Formation permissions wont apply.

The process of upgrading to the LakeFormation permissions model generally takes 4 steps.

1) Determine your users’ existing IAM permissions for each database and table.

2) Replicate these permissions in Lake Formation.

3) For each Amazon S3 location that contains data:

a. Revoke the Super permission from the IAMAllowedPrincipals group on each Data Catalog resource that references that location.

b. Register the location with Lake Formation.

4) Clean up existing fine-grained access control IAM policies.

Some of these items are handled automatically by the CDK application. Some of the account-level initial steps need to be done manually. You can find more detailed instructions on upgrading to Lake Formation permissions in the code repo and Lake Formation documentation.

Costs

To enroll Open Targets and ChEMBL, we are looking at about $2.00 in Glue charges. This cost only applies when you rerun the AWS Glue Workflow. Which you may not need to do if the data does not change very often.

The total size of the Open Targets and ChEMBL databases in the compressed parquet format is ~7GB. That costs $0.16 per month in storage. This is the only cost that you would carry every month regardless of how many queries you run.

The Athena costs below are based on the queries run in the example notebook which processes .003 TBs if you run it from end to end. Essentially, it costs $0.015 in Athena costs to fully run that notebook.

The SageMaker notebook and EBS costs are like conventional EC2 charges. In the break down below, we assume the Notebook is up for 4 hours with 100 GB of storage attached to it before it is stopped. That notebook + EBS for 4 hours totals $0.378.

Where to go from here?

You should be able to reuse, extend, or improve the S3dataSetEnrollment or RDSPostgresDataSetEnrollment classes to enroll your own data sets. That’s the beauty of the CDK. Feel free to comment on this post, the corresponding Github repo, roadmap, issues page, or make pull requests with new data constructs.

Some of the other enrollment types we are planning on including include your own sources a commercial compound registry like ChemAxon, Amazon Kinesis, Amazon Aurora, MySQL, or even directly off lab instruments. We plan on adding support for VCF file types in the near future. You can keep track of future roadmap plans on our Github page.

Right now, the simplest class in the whole project is the DataLakeStack. Its only resource is an S3 bucket, some Lake Formation resources, and some account-level configurations. In follow up blog posts we will improve DataLakeStack class to include an Elasticsearch option for improved indexing and discoverability.

We are also planning on improving the example Jupyter notebook included in the supporting deployment guide to demonstrate how you can take data straight from your data lake into AWS SageMaker for machine learning.

Attributions:

“ChEMBLdb” by European Bioinformatics Institute, European Molecular Biology Laboratory is licensed under CC BY-SA 3.0

“Open Targets Platform: new developments and updates two years on” by Denise Carvalho-Silva, Andrea Pierleoni, Miguel Pignatelli, ChuangKee Ong, Luca Fumis, Nikiforos Karamanis, Miguel Carmona, Adam Faulconbridge, Andrew Hercules, Elaine McAuley, Alfredo Miranda, Gareth Peat, Michaela Spitzer, Jeffrey Barrett, David G Hulcoop, Eliseo Papa, Gautier Koscielny, Ian Dunham, Open Targets

Special thanks to Sean Murphy (AWS, Sr. Startup HCLS SA) and Aaron Arvey (Third Rock Ventures, Director of Machine Learning) for their contributions.

AWS Startups Blog