Seamlessly Integrate Data on Google BigQuery and ClickHouse Cloud with AWS Glue

Migrating from Google Cloud’s BigQuery to ClickHouse Cloud on AWS allows businesses to leverage the speed and efficiency of ClickHouse for real-time analytics while benefiting from AWS’s scalable and secure environment. This article provides a comprehensive guide to executing a direct data migration using AWS Glue ETL, highlighting the advantages and best practices for a seamless transition.

AWS Glue ETL enables organizations to discover, prepare, and integrate data at scale without the burden of managing infrastructure. With its built-in connectivity, Glue can seamlessly read data from Google Cloud’s BigQuery and write it to ClickHouse Cloud on AWS, removing the need for custom connectors or complex integration scripts. Beyond connectivity, Glue also provides advanced capabilities such as a visual ETL authoring interface, automated job scheduling, and serverless scaling, allowing teams to design, monitor, and manage their pipelines more efficiently. Together, these features simplify data integration, reduce latency, and deliver significant cost savings, enabling faster and more reliable migrations.

Prerequisites

Before using AWS Glue to integrate data into ClickHouse Cloud, you must first set up the ClickHouse environment on AWS. This includes creating and configuring your ClickHouse Cloud on AWS, making sure network access and security groups are properly defined, and verifying that the cluster endpoint is accessible. Once the ClickHouse environment is ready, you can leverage the AWS Glue built-in connector to seamlessly write data into ClickHouse Cloud from sources such as Google Cloud BigQuery. You can follow the next section to complete the setup.

Set up ClickHouse Cloud on AWS
1. Follow the ClickHouse official website to set up environment (remember to allow remote access in the config file if using Clickhouse OSS)
  https://clickhouse.com/docs/get-started/quick-start
Subscribe the ClickHouse Glue marketplace connector
1. Open Glue Connectors and choose Go to AWS Marketplace
2. On the list of AWS Glue marketplace connectors, enter ClickHouse in the search bar. Then choose ClickHouse Connector for AWS Glue
3. Choose View purchase options on the right top of the view
4. Review Terms and Conditions and choose Accept Terms
5. Choose Continue to Configuration once it’s enabled
6. On Follow the vendor’s instructions part in the connector instructions as below, choose the connector enabling link at step 3

Configure AWS Glue ETL Job for ClickHouse Integration

AWS Glue enables direct migration by connecting with ClickHouse Cloud on AWS through built-in connectors, allowing for seamless ETL operations. Within the Glue console, users can configure jobs to read data from S3 and write it directly to ClickHouse Cloud. Using AWS Glue Data Catalog, data in S3 can be indexed for efficient processing, while Glue’s PySpark support allows for complex data transformations, including data type conversions, to support compatibility with ClickHouse’s schema.

Open AWS Glue in the AWS Management Console
1. Navigate to Data Catalog and Connections
2. Create a new connection
Configure BigQuery Connection in Glue
1. Prepare a Google Cloud BigQuery Environment
2. Create and Store Google Cloud Service Account Key (JSON format) in AWS Secret Manager, you can find the details in BigQuery connections.
3. The JSON Format content example is as following:
```
{
  "type": "service_account",
  "project_id": "h*********g0",
  "private_key_id": "cc***************81",
  "private_key": "-----BEGIN PRIVATE KEY-----\nMI***zEc=\n-----END PRIVATE KEY-----\n",
  "client_email": "clickhouse-sa@h*********g0.iam.gserviceaccount.com",
  "client_id": "1*********8",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/clickhouse-sa%40h*********g0.iam.gserviceaccount.com",
  "universe_domain": "googleapis.com"
}
```
  - type: service_account.
  - project_id: The ID of the GCP project.
  - private_key_id: A unique ID for the private key within the file.
  - private_key: The actual private key.
  - client_email: The email address of the service account.
  - client_id: A unique client ID associated with the service account.
  - auth_uri, token_uri, auth_provider_x509_cert_url
  - client_x509_cert_url: URLs for authentication and token exchange with Google’s identity and access management systems.
  - universe_domain: The domain name of GCP, googleapis.com
4. Create Google BigQuery Connection in AWS Glue
5. Grant the IAM role associated with your AWS Glue job permission for S3, Secret Manager, Glue services, and AmazonEC2ContainerRegistryReadOnly for accessing connectors purchased from AWS Marketplace (reference doc)
Create ClickHouse connection in AWS Glue
1. Enter clickhouse-connection as its connection name
2. Choose Create connection and activate connector

Create a Glue job

On the Connectors view as below, select clickhouse-connection and choose Create job
Enter bq_to_clickhouse as its job name and configure gc_connector_role as its IAM Role
Configure BigQuery connection and clickhouse-connection to the Connection property
Choose the Script tab and Edit script. Then choose Confirm on the Edit script popup view.
Copy and paste the following code onto the script editor which can be referred from clickhouse official doc

The source code is as following:

import sys
from pyspark.sql import SparkSession
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
spark = SparkSession.builder.getOrCreate()
glueContext = GlueContext(spark.sparkContext)
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

connection_options = {
    "connectionName": "Bigquery connection",
    "parentProject": "YOUR_GCP_PROJECT_ID",
    "query": "SELECT * FROM `YOUR_GCP_PROJECT_ID.bq_test_dataset.bq_test_table`",
    "viewsEnabled": "true",
    "materializationDataset": "bq_test_dataset"
}
jdbc_url = " jdbc:clickhouse://YOUR_CLICKHOUSE_CONNECTION.us-east-1.aws.clickhouse.cloud:8443/clickhouse_database?ssl=true "
username = "default"
password = "YOUR_PASSWORD"
query = "select * from clickhouse_database.clickhouse_test_table"
# Add this before writing to test connection
try:
    # Read from BigQuery with Glue Connection
    print("Reading data from BigQuery...")
    GoogleBigQuery_node1742453400261 = glueContext.create_dynamic_frame.from_options(
        connection_type="bigquery",
        connection_options=connection_options,
        transformation_ctx="GoogleBigQuery_node1742453400261"
    )
    # Convert to DataFrame
    bq_df = GoogleBigQuery_node1742453400261.toDF()
    print("Show data from BigQuery:")
    bq_df.show()
    
    # Write BigQuery Data to Clickhouse with JDBC
    bq_df.write \
    .format("jdbc") \
    .option("driver", 'com.clickhouse.jdbc.ClickHouseDriver') \
    .option("url", jdbc_url) \
    .option("user", username) \
    .option("password", password) \
    .option("dbtable", "clickhouse_test_table") \
    .mode("append") \
    .save()
    
    print("Write BigQuery Data to ClickHouse successfully")
    
    # Read from Clickhouse with JDBC
    reaf_df = (spark.read.format("jdbc")
    .option("driver", 'com.clickhouse.jdbc.ClickHouseDriver')
    .option("url", jdbc_url)
    .option("user", username)
    .option("password", password)
    .option("query", query)
    .option("ssl", "true")
    .load())
    
    print("Show Data from ClickHouse:")
    reaf_df.show()
    
except Exception as e:
    print(f"ClickHouse connection test failed: {str(e)}")
    raise e
finally:
    job.commit()

Choose Save and Run on the right top of the current view

Testing and Validation

Testing is crucial to verify data accuracy and performance in the new environment. After the migration completes, run data integrity checks to confirm record counts and data quality in ClickHouse Cloud. Schema validation is essential, as each data field must align correctly with ClickHouse’s format. Running performance benchmarks, such as sample queries, will help verify that ClickHouse’s setup delivers the desired speed and efficiency gains.

The Schema and Data in source BigQuery and destination Clickhouse
AWS Glue output logs

Clean Up

After completing the migration, it’s important to clean up unused resources—such as BigQuery for sample data import and database resources in ClickHouse Cloud—to avoid unnecessary costs. Regarding IAM permissions, adhering to the principle of least privilege is advisable. This involves granting users and roles only the permissions necessary for their tasks and removing unnecessary permissions when they are no longer required. This approach enhances security by minimizing potential threat surfaces. Additionally, reviewing AWS Glue job costs and configurations can help identify optimization opportunities for future migrations. Monitoring overall costs and analyzing usage can reveal areas where code or configuration improvements may lead to cost savings.

Conclusion

AWS Glue ETL offers a robust and user-friendly solution for migrating data from BigQuery to ClickHouse Cloud on AWS. By utilizing Glue’s serverless architecture, organizations can perform data migrations that are efficient, secure, and cost-effective. The direct integration with ClickHouse streamlines data transfer, supporting high performance and flexibility. This migration approach is particularly well-suited for companies looking to enhance their real-time analytics capabilities on AWS.

AWS Big Data Blog

Seamlessly Integrate Data on Google BigQuery and ClickHouse Cloud with AWS Glue

Prerequisites

Configure AWS Glue ETL Job for ClickHouse Integration

Testing and Validation

Clean Up

Conclusion

About the Authors

Resources

Follow

Learn

Resources

Developers

Help