Introducing the HubSpot connector for AWS Glue

Most companies have adopted a diverse set of software as a service (SaaS) platforms to support various applications. The rapid adoption has enabled them to quickly streamline operations, enhance collaboration, and gain more accessible, scalable solutions for managing their critical data and workflows.

More companies have realized there is an opportunity to integrate, enhance, and present this SaaS data to improve internal operations and gain valuable insights on their data. Using AWS Glue, a serverless data integration service, companies can streamline this process, integrating data from internal and external sources into a centralized AWS data lake. From there, they can perform meaningful analytics, gain valuable insights, and optionally push enriched data back to external SaaS platforms.

This post introduces the new HubSpot managed connector for AWS Glue, and demonstrates how you can integrate HubSpot data into your existing data lake on AWS. By consolidating HubSpot data with data from your AWS accounts and from other SaaS services, you can enhance, analyze, and optionally write the data back to HubSpot, creating a seamless and integrated data experience.

Solution overview

In this example, we use AWS Glue to extract, transform, and load (ETL) data from your HubSpot account into a transactional data lake on Amazon Simple Storage Service (Amazon S3), using Apache Iceberg format. We register the schema in the AWS Glue Data Catalog to make your data discoverable. Subsequently, we use Amazon Athena to validate that the HubSpot data has been successfully loaded to Amazon S3. The following diagram illustrates the solution architecture.

The following are key components and steps in the integration:

Configure your HubSpot account and app to enable access to your HubSpot data.
Prepare for data movement by securely storing your HubSpot OAuth credentials in AWS Secrets Manager, creating an S3 bucket to store your ingested data, and creating an AWS Identity and Access Management (IAM) role for AWS Glue.
Create an AWS Glue job to extract and load data from HubSpot to Amazon S3. AWS Glue establishes a secure connection to HubSpot using OAuth for authorization and TLS for data encryption in transit. AWS Glue also supports the ability to apply complex data transformations, enabling efficient data integration and preparation to meet your needs.
Schema and other metadata will be registered in the AWS Glue Data Catalog, a centralized metadata repository for all your data assets. This helps simplify schema management, and also makes the data discoverable by other services.
Run the AWS Glue job to extract data from HubSpot and write it to Amazon S3 using Iceberg format. Apache Iceberg is an open source, high-performance open table format designed for large-scale analytics, providing transactional consistency and seamless schema evolution. Although we use Iceberg in this example, AWS Glue offers robust support for various data formats, including other transactional formats such as Apache Hudi and Delta Lake.
The data loaded to Amazon S3 will be organized into partitioned folders to optimize for query performance and management. Amazon S3 will also store the AWS Glue scripts, logs, and other temporary data required during the ETL process.
Finally, Amazon Athena will be used to query the data loaded from HubSpot to Amazon S3, validating that all changes in the source system have been captured successfully.
Optionally, HubSpot can regularly synchronize HubSpot data to Amazon S3 and analyze data updates over time.

Set up your HubSpot account

This example requires you to create a HubSpot public app for AWS Glue in a HubSpot Developer account, and connect it to an associated HubSpot account. A HubSpot public app is a type of integration that can be installed in your HubSpot accounts or listed in the HubSpot Marketplace. In this example, you create a HubSpot app for the AWS Glue integration, and install it in a new test account. Although HubSpot calls it a public app, it will not be listed in their Marketplace and will only have access to your test account.

If you don’t already have one, sign up for a free HubSpot developer account.
Log in to your HubSpot developer account, where you’ll see options to create apps and test accounts.
Choose Create a test account and follow the instructions.

HubSpot test accounts have Enterprise versions of the HubSpot Marketing, Sales, and Service Hubs along with sample data, so you can test most HubSpot tools, create CRM data, and access it through APIs with Glue. For more information about creating a test account, refer to Create a developer test account.

Create a HubSpot app

Complete the following steps to create a HubSpot app:

Switch back to your HubSpot developer account, and choose Create an app.
Fill in the App Info section with the name AWS Glue and a brief description.
Choose the Auth tab.
For Redirect URLs, enter the redirect URL for AWS Glue in the form: https://<region>.console.aws.amazon.com/gluestudio/oauth.

Be sure to replace <region> with your AWS Glue operating AWS Region. For instance, the code for the US East (N. Virginia) Region is us-east-1, so the AWS Glue redirect URL is https://us-east-1.console.aws.amazon.com/gluestudio/oauth.

In the Scopes section, choose Add new scope and select the following permissions:
- automation
- content
- crm.lists.read
- crm.lists.write
- crm.objects.companies.read
- crm.objects.companies.write
- crm.objects.contacts.read
- crm.objects.contacts.write
- crm.objects.custom.read
- crm.objects.custom.write
- crm.objects.deals.read
- crm.objects.deals.write
- crm.objects.owners.read
- crm.schemas.custom.read
- e-commerce
- forms
- oauth
- sales-email-read
- tickets
Review the Scopes and Redirect URL settings, then choose Create app.
Navigate back to your app Auth tab.
Take note of the values for Client ID, Client secret, and Install URL (OAuth). You will need these later to connect your AWS Glue instance.

Select or create an Amazon S3 bucket where your HubSpot data will reside

Select an existing Amazon S3 bucket in your account, or create a new bucket to store your HubSpot data, as well as scripts, logs, and so on. For this example, the bucket name will follow the format aws-glue-hubspot-<account>-<region>, where <account> is the AWS account number and <region> is the operating Region. The account will be configured with all defaults: public access disabled, versioning disabled, and server-side encryption with Amazon S3 managed keys (SSE-S3).

If you use AWSGlueServiceRole in your IAM role as shown in this example, it will provide access to S3 buckets with names starting with aws-glue-.

Create an IAM role for AWS Glue

Create an IAM role with permissions for the AWS Glue job. AWS Glue will assume this role when calling other services on your behalf.

On the IAM console, choose Roles in the navigation pane.
Choose Create role.
For Trusted entity type¸ choose AWS service.
For Use case, choose Glue.
Add the following AWS managed policies to the role:
1. AWSGlueServiceRole for accessing related services such as Amazon S3, Amazon Elastic Compute Cloud, Amazon CloudWatch, and IAM. This policy enables access to S3 buckets with names starting with aws-glue-.
2. SecretsManagerReadWrite for read/write access to AWS Secrets Manager.
Give the role a name, for instance AWSGlueServiceRole_blog.

For more information, see Getting started with AWS Glue and Create an IAM role for AWS Glue.

Create a AWS Secrets Manager secret

AWS Secrets Manager is used to securely store your HubSpot OAuth credentials. Complete the following steps to create a secret:

On the AWS Secrets Manager console, choose Secrets in the navigation pane.
Choose Store a new secret.
For Secret type, select Other type of secret.
Under Kay/value pairs, enter the HubSpot client secret with the key USER_MANAGED_CLIENT_APPLICATION_CLIENT_SECRET.
Choose Next.

Enter the secret name, such as HubSpot-Blog, a description, and continue.
Leave the secret rotation as default, and choose Next.
Review the secret configuration, and choose Store.

Create an AWS Glue connection

Complete the following steps to create an AWS Glue connection to your HubSpot account:

On the AWS Glue console, choose Data connections in the navigation pane.
Choose Create connection.
For Data sources, search for and select HubSpot.
Choose Next.

On the Configure connection page, fill in the required information:
1. For IAM service role, choose the service role created previously. In this example, we use the role AWSGlueServiceRole_blog.
2. For Authentication URL, leave as default.
3. For User Managed Client Application ClientId, enter the OAuth client ID from HubSpot.
4. For AWS Secret, choose the OAuth client secret name configured previously in AWS Secrets Manager.
5. Choose Next.

Choose Test Connection to validate the connection to HubSpot.
This will bring up a new HubSpot connection window. Be sure to select your HubSpot test account (not your developer account) to test the connection.
If this is your first connection attempt, you will be redirected to another page where you are asked to confirm the access level granted to AWS Glue. Choose Connect App.

If successful, the HubSpot window will close and your AWS connection window will say Connection test successful.

Under Set properties, for Name, enter a name (for example, HubSpot_Connection_blog).
Choose Next.
Under Review and create, review your settings and then create the connection.

Create a database in AWS Glue Data Catalog

Complete the following steps to create a database in AWS Glue Data Catalog to organize your HubSpot data:

On the AWS Glue console, choose Databases in the navigation pane.
Create a new database.
Enter a name (for example, hubspot).
You can leave the location field blank.
Choose Create database.

Create an AWS Glue ETL job

Now that you have an AWS Glue data connection to your HubSpot account, you can create an AWS Glue ETL job to ingest HubSpot data into your AWS data lake. AWS Glue provides both visual and code-based interfaces to simplify data integration, depending on your expertise. In this example, we use the Script interface to ingest HubSpot data into the Amazon S3 location. Complete the following steps:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Choose the Script editor.
Choose Spark as the engine, and upload the following script.

The AWS Glue Spark job reads the HubSpot data and merges it into the S3 bucket in Iceberg format.

On the Job details tab, provide the following information:
For Name, enter a name, such as HubSpot_to_S3_blog.
For Description, enter a meaningful description of the job.
For IAM Role, choose the IAM role you created previously (for this post, AWSGlueServiceRole_blog).

Expand Advanced properties.
Under Connections, enter your HubSpot connection from the previous section (for this post, HubSpot_Connection_blog).

Under Job parameters, enter the following parameters:

- For --conf, enter spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.warehouse=file:///tmp/spark-warehouse
- For --datalake-formats, enter iceberg
- For --db_name, enter the AWS Glue database to store your data lake (for this post, hubspot)
- For --table_name, enter the HubSpot table to be ingested (for this post, company)
- For --s3_bucket_name, enter where the ingested Iceberg table is stored, in this case aws-glue-hubspot-<account>-<region>
- For --connection_name, enter the AWS Glue connection name created, in this case HubSpot_Connection_blog

Choose Save to save the job, then choose Run.

Depending on the amount of data in your HubSpot account, the job can take a few minutes to complete. After a successful job run, you can choose Run details to see the job specifications and logs.

Use Athena to query data

Athena is an interactive and serverless query service that makes it straightforward to analyze data directly in Amazon S3 using standard SQL. In this example, we query the results of the HubSpot data ingested into Amazon S3.

On the Athena console, choose Query editor.
For Database, choose hubspot, and you should see your company table.
Select entries from the hubspot.company table to view the data captured from hubspot.

You can try various queries on the HubSpot data, such as:

-- get sample of dataset
SELECT * FROM "hubspot"."company" limit 10;

-- get companies revenue
SELECT * FROM "hubspot"."company" A
WHERE A.annualrevenue IS NOT NULL;

-- get number of companies with revenue
SELECT COUNT(*) AS companies_count FROM "hubspot"."company" A
WHERE A.annualrevenue IS NOT NULL;

Over time, your HubSpot data may change. You can rerun your ETL job periodically, and the Iceberg data lake table will effectively capture your changes. You can verify by adding, removing, and changing companies in your HubSpot database, and then rerun the ETL job. Your data lake should match your latest HubSpot data. With this capability, you can schedule the ETL job to run as often as you need.

Extending the HubSpot connector with AWS services

The HubSpot connector for AWS Glue provides a powerful foundation for building comprehensive data pipelines and analytics workflows. By integrating HubSpot data into your AWS environment, you can use additional services like Amazon Redshift, Amazon QuickSight, and Amazon SageMaker to further process, transform, and analyze the data. This allows you to construct sophisticated, end-to-end data architectures that unlock the full value of your HubSpot data, without the need to manage complex infrastructure. The seamless integration between these AWS services makes it straightforward to build scalable analytics pipelines tailored to your specific requirements.

Considerations

You can set up AWS Glue job triggers to run the ETL jobs on a schedule, so that the data is regularly synchronized between HubSpot and Amazon S3. You can also integrate the ETL jobs with other AWS services, including AWS Step Functions, Amazon MWAA (Amazon Managed Workflows for Apache Airflow), AWS Lambda, Amazon EventBridge , and Amazon Bedrock to create a more advanced data processing pipeline.

By default, the HubSpot connector doesn’t import deleted records. However, you can set the IMPORT_DELETED_RECORDS option to true to import all records, including the deleted ones.

Clean up

To avoid incurring charges, clean up the resources used in this post from your AWS account, including the AWS Glue jobs, HubSpot connection, AWS Secrets Manager secret, IAM role, and Amazon S3 bucket.

Conclusion

With the introduction of the AWS Glue connector for HubSpot, integrating HubSpot data with information from other data sources has become more streamlined than ever. This feature enables you to set up ongoing data integration from HubSpot to AWS, providing a unified view of data from across platforms and enabling more comprehensive analytics. The serverless nature of AWS Glue means there is no infrastructure management required, and you only pay for the resources consumed. By following the steps outlined in this post, you can make sure that up-to-date data from HubSpot is captured in the your data lake, allowing teams to make faster data-driven decisions and uncover complex insights from across data sources.

To learn more about the AWS Glue connector for HubSpot, refer to Connecting to HubSpot in AWS Glue. This guide walks through the entire process, from setting up the connection to running the data transfer flow. For more information on AWS Glue, visit AWS Glue.

About the Authors

Eric Bomarsi is a Senior Solutions Architect in the ISV group at AWS, where he focuses on building scalable solutions for large customers. As a member of the AWS analytics community, he helps customers get strategic insights from their data. Outside of work, he enjoys playing ice hockey and traveling with his family.

Annie Nelson is a Senior Solutions Architect at AWS. She is a data enthusiast who enjoys problem solving and tackling complex architectural challenges with customers.

Kartikay Khator is a Solutions Architect within Global Life Sciences at AWS, where he dedicates his efforts to developing innovative and scalable solutions that cater to the evolving needs of customers. His expertise lies in harnessing the capabilities of AWS analytics services. Extending beyond his professional pursuits, he finds joy and fulfillment in the world of running and hiking. Having already completed multiple marathons, he is currently preparing for his next marathon challenge.

Kamen Sharlandjiev is a Sr. Big Data and ETL Solutions Architect, Amazon MWAA and AWS Glue ETL expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on LinkedIn to keep up to date with the latest Amazon MWAA and AWS Glue features and news!

AWS Big Data Blog