Building fast ETL using SingleStore and AWS Glue

Disparate data systems have become a norm in many companies. The reasons for this vary: different teams in the organization select data system best suited for its primary function, the responsibility for choosing these data systems may have been decentralized across different departments, a merged company may still use separate data systems from the formerly individual companies, and many more. Data Integration combines data from these disparate data sources and helps users throughout the organization to fully leverage the inherent value in the data to gain meaningful and valuable insights. AWS Glue is a fully managed serverless data integration service that makes it easy to extract, transform, and load (ETL) from various data sources for analytics and data processing with Apache Spark ETL jobs. AWS Glue Spark runtime supports connectivity to popular data sources such as Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, Amazon Redshift, and Apache Kafka.

We recognize the existence of disparate data systems that best fit your application needs. AWS Glue custom connectors in AWS Glue Studio extends AWS Glue support for data sources beyond the native connection types. Now you can discover and subscribe to AWS Glue ETL connectors from AWS Marketplace from data sources that best fit your needs. AWS Partner SingleStore provides a relational SQL database that can handle both transactional and analytical workloads in a single system. New applications that need to combine transactional and analytical (HTAP—hybrid transaction analytical processing) requirements can take advantage of SingleStore DB. SingleStore provides a SingleStore connector for AWS Glue based on Apache Spark Datasource through AWS Marketplace. The fully managed, scale-out Apache Spark environment for ETL jobs provided by AWS Glue matches well to SingleStore’s distributed SQL design.

This post shows how you can use AWS Glue custom connector from AWS Marketplace based on Apache Spark Datasource in AWS Glue Studio to create ETL jobs in minutes using an easy-to-use graphical interface.

The following architecture diagram shows SingleStore connecting with AWS Glue for an ETL job.

Now you can easily subscribe to the SingleStore connector on AWS Marketplace and create a connection to your SingleStore cluster. VPC networking and integration with AWS Secrets Manager for authentication credentials are supported for the connection.

Walkthrough overview

In this post, we demonstrate how to connect a SingleStore cluster in an AWS Glue ETL job as the source, transform the data, and store it back on a SingleStore cluster and in Apache Parquet format on Amazon S3. We use the TPC-H Benchmark dataset that is available as a sample dataset in SingleStore.

To successfully create the ETL job using a custom ETL connector from AWS Marketplace, you complete the following steps:

Store authentication credentials in Secrets Manager.
Create an AWS Identity and Access Management (IAM) role for the AWS Glue ETL job.
Configure the SingleStore connector and connection.
Create an ETL job using the SingleStore connection in AWS Glue Studio.

Storing authentication credentials in Secrets Manager

AWS Glue provides integration with Secrets Manager to securely store connection authentication credentials. Follow these steps to create these credentials:

On the Secrets Manager console, choose Store a new secret.
For Select a secret type, select Other type of secrets.
For Secret key/value, set one row for each of the following parameters:
1. ddlEndpoint
2. database
3. user
4. password
Choose Next.

Choose Next.

For Secret name, enter aws-glue-singlestore-connection-info.
Choose Next.
Keep the Disable automatic rotation check box selected.
Choose Next.
Choose Store.

Creating an IAM role for the AWS Glue ETL job

In this section, you create a role with an attached policy to allow read-only access to credentials that are stored in Secrets Manager for the AWS Glue ETL job.

On the IAM console, choose Policies.
Choose Create policy.

On the JSON tab, enter the following JSON snippet, providing your Region and account ID:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret"
            ],
            "Resource": "arn:aws:secretsmanager:<REGION>:<ACCOUNT_ID>:secret:aws-glue-*"
        }
    ]
}

Choose Review Policy.
Give your policy a name, for example, GlueAccessSecretValue.
In the navigation pane, choose Roles.
Choose Create role.
For Select type of trusted entity, choose AWS service.
Choose Glue.

Choose Glue.

Choose Next: Permissions.
Search for the AWS managed policies AWSGlueServiceRole and AmazonEC2ContainerRegistryReadOnly policy, and select them.
Search for GlueAccessSecretValue policy created before, and select it.
For Role name, enter a name, for example, GlueCustomETLConnectionRole.
Confirm the three policies are selected.

Confirm the three policies are selected.

Configuring your SingleStore connector and connection

To connect to SingleStore, complete the following steps:

On the AWS Glue console, choose AWS Glue Studio.
Choose Connectors.
Choose Go to AWS Marketplace.

Choose Go to AWS Marketplace.

Subscribe to the SingleStore connector for AWS Glue from AWS Marketplace.
Activate the connector from AWS Glue Studio.
In the navigation pane, under Connectors, choose Create connection.
For name, enter a name, such as SingleStore_connection.
For AWS Secret, choose the AWS secret value aws-glue-singlestore-connection-info created before.

For AWS Secret, choose the AWS secret value aws-glue-singlestore-connection-info created before.

Choose Create connection.

Creating an ETL job using the SingleStore connection in AWS Glue Studio

After you define the SingleStore connection, you can start authoring the job using this connection.

On the AWS Glue Studio console, choose Connectors.
Select your connector and choose Create job.

Select your connector and choose Create job.

An untitled job is created with the connection as the source node.

On the Job details page, for Name, enter SingleStore_tpch_transform_job.
For Description, enter Glue job to transform tpch data from SingleStore DB.
For IAM Role, choose GlueCustomETLConnectionRole.
Keep the other properties at their default.

Keep the other properties at their default.

On the Visual page, on the Data source properties – Connector tab, expand Connection options.
For Key, enter dbtable.
For Value, enter lineitem.

For Value, enter lineitem.

Because AWS Glue Studio is using information stored in the connection to access the data source instead of retrieving metadata information from a Data Catalog table, you must provide the schema metadata for the data source. Use the schema editor to update the source schema. For instructions on how to use the schema editor, see Editing the schema in a custom transform node.

Use the schema editor to update the source schema.

Choose the + icon.
For Node type, choose DropFields.
On the Transform tab, select the fields to drop.

On the Transform tab, select the fields to drop.

Choose the + icon.
For Node type, choose Custom Transform.
On the Transform tab, add to the custom script.

For this post, we calculate two additional columns, disc_price and price. Then we use glueContext.write_dynamic_frame to write the updated data back on SingleStore using the connection SingleStore_connection we created. See the following code:

def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
    from pyspark.sql.functions import col
    
    df = dfc.select(list(dfc.keys())[0]).toDF()
    df1 = df.withColumn("disc_price",(col("l_extendedprice")*(1-col("l_discount"))).cast("decimal(10,2)"))
    df2 = df1.withColumn("price", (col("disc_price")*(1+col("l_tax"))).cast("decimal(10,2)"))
    dyf = DynamicFrame.fromDF(df2, glueContext, "updated_lineitem")
    
    glueContext.write_dynamic_frame.from_options(frame = dyf, 
     connection_type = "marketplace.spark", 
     connection_options = {"dbtable":"updated_lineitem","connectionName":"SingleStore_connection"})
    
    return(DynamicFrameCollection({"CustomTransform0": dyf}, glueContext))

On the Transform tab, add to the custom script.

On the Output schema tab, add the additional columns price and disc_price created in the custom script.

On the Output schema tab, add the additional columns price and disc_price created in the custom scrip

Keep the default for node SelectFromCollection.
Choose the + icon.
For Node type, choose Data Target – S3.
On the Data target properties – S3 tab, for Format, choose Parquet.
For S3 Target Location, enter s3://aws-glue-assets-{Your Account ID as a 12-digit number}-{AWS region}/output/.

On the Data target properties – S3 tab, for Format, choose Parquet.

Choose Save.
Choose Run.

In the following screenshot, a new table updated_lineitem is created with the two additional columns disc_price and price.

In the following screenshot, a new table updated_lineitem is created with the two additional columns disc_price and price.

Conclusion

In this post, you learned how to subscribe to the SingleStore connector for AWS Glue from AWS Marketplace, activate the connector from AWS Glue Studio, and create an ETL job in AWS Glue Studio that uses a SingleStore connector as the source and target using custom transform. You can use AWS Glue Studio to speed up the ETL job creation process, use connectors from AWS Marketplace, or bring in your own custom connectors, and allow different personas to transform data without any previous coding experience.

About the Author

Saurabh Shanbhag is a Partner Solutions Architect at AWS and has over 12 years of experience in working with data integration and analytics products. He focuses on enabling partners to build and enhance joint solutions on AWS.

AWS Big Data Blog