Ingest Salesforce data into Amazon S3 using the CData JDBC custom connector with AWS Glue

Organizations that successfully generate business value from their data will outperform their peers. Many AWS customers require a data storage and analytics solution that combines the prospect information stored in Salesforce, a popular and widely used customer relationship management (CRM) platform, with other structured and unstructured data in their data lake to innovate and build customer experiences faster.

To ingest data from Salesforce into a data lake in Amazon Simple Storage Service (Amazon S3), data engineers frequently write custom code to extract, transfer, and load (ETL) the data into their data lake or rely on open-source connector software. Both options are fragile, costly, and add operational complexity.

AWS Glue custom connectors simplify the development and deployment of bi-directional data transfer between applications and data stores. With AWS Glue custom connectors, you can search and configure an appropriate connector from the AWS Marketplace, activate it in AWS Glue, and connect to the data source using AWS Glue Studio or AWS Glue jobs to extract and process data.

In this post, we outline the process of connecting to Salesforce using the CData AWS Glue connector for Salesforce, use AWS Glue Studio to extract data from Salesforce, and query it using Amazon Athena.

Prerequisites

For this walkthrough, you should have the following:

An AWS account.
A Salesforce account. For trying out the steps in this post, you can also sign up for a developer account, which is pre-populated with sample records in many of the Salesforce objects.

Subscribe to AWS Glue connector

To subscribe to the AWS Glue connector, complete the following steps:

On the AWS Glue console, choose AWS Glue Studio.
In the navigation pane, choose Marketplace and search for Salesforce.

In the navigation name, choose Marketplace and search for Salesforce.

You can also search and subscribe to connectors from the AWS Marketplace.

Choose CData AWS Glue Connector for Salesforce.

In the navigation name, choose Marketplace and search for Salesforce.

Review the pricing and other relevant information.
Choose Continue to Subscribe.
Choose Accept Terms.

Choose Accept Terms.

After you subscribe to the connector, the next steps are to configure it.

Retain the default selections on Delivery Method and Software Version to use the latest connector software version.
Choose Continue to Launch.
Choose Usage instructions.

Choose Usage instructions.

A pop-up appears with a hyperlink to activate the connector with AWS Glue Studio.

Choose this link to start configuring the connection to your Salesforce account in AWS Glue Studio.

A pop-up appears with a hyperlink to activate the connector with AWS Glue Studio.

Create a connection in AWS Glue Studio

To create your connection in AWS Glue Studio, complete the following steps:

For Name, enter a unique name for your Salesforce connection.
For Connection credential type, choose username_password.

You can also use SSO authentication to connect through supported identity providers.

For ${Username} and ${Password}, enter the username and password.
For ${SecurityToken}, enter your security token.

If you don’t have a security token, you can log in to your Salesforce account and follow the instructions to reset your security token to get a new token.

For additional configuration, you can specify it in Additional options. A complete list of supported connection properties are available in CData Salesforce JDBC options.

If your Salesforce application enforces IP restrictions or uses a private connection, configure Network options to provide VPC information.

Choose Create connection and activate connector to finish creating the connection.

Choose Create connection and activate connector to finish creating the connection.

You should now be able to view the Salesforce connector you have subscribed to, and its associated connection.

Create an IAM role for AWS Glue

The next step is to create an AWS Identity and Access Management (IAM) role with the necessary permissions for the AWS Glue job. The name of the role must start with the string AWSGlueServiceRole for AWS Glue Studio to use it correctly. You need to grant your IAM role permissions that AWS Glue can assume when calling other services on your behalf. For more information, see Create an IAM Role for AWS Glue.

Attach the following AWS managed policies to the role:

AmazonEC2ContainerRegistryReadOnly for accessing connectors purchased from AWS Marketplace
AWSGlueServiceRole for accessing related services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon S3, and Amazon CloudWatch Logs

Create and attach a policy with write access to the S3 bucket.

For more information, see Job-related permissions.

Author a job in AWS Glue Studio

You’re now ready to define the job using this connection. For this post, you extract data from the Salesforce Account table into a Parquet file in an Amazon S3 location.

On the Connectors page, select the Salesforce connection and choose Create job.

On the Connectors page, select the Salesforce connection and choose Create job.

On the Data source properties – Connector tab, for Table name, enter Account.

You can also specify a query to extract specific attributes from a Salesforce object. For a complete list of connection properties, see Connection options for type custom.jdbc or marketplace.jdbc.

You can also specify a query to extract specific attributes from a Salesforce object.

On the Visual tab, choose the + icon to create a new S3 node for the destination.
On the Node properties tab, For Node type, choose S3 as the Data target.

On the Node properties tab, For Node type, choose S3 as the Data target.

For Node Parents, select the CData node.
On the Data target properties – S3 tab, for Format, choose Parquet.
For S3 Target Location, enter an S3 path.
For Data Catalog update options, select Create a table in the Data Catalog and on subsequent runs, keep existing schema and add new partitions.
For Database, choose a database.
For Table name, enter a name.

Instead of requiring a crawler to update the Data Catalog when the data lands, this data catalog update option makes it easy to keep your tables up to date. For more information, see Configuring data target nodes.

Instead of requiring a crawler to update the Data Catalog when the data lands, this data catalog update option makes it easy to keep your tables up to date.

On the Script tab, review the code generated by AWS Glue Studio.
Save and run the job.
When the job is complete, review the results in the target S3 location.

When the job is complete, review the results in the target S3 location.

Query data with Athena

AWS Glue Studio generates a new table with the name you specified.

You can now query this table with Athena.

Conclusion

In this post, we subscribed to the CData AWS Glue connector for Salesforce to connect to Salesforce and extract the Account object. You can use this approach to easily integrate your software as a service (SaaS) applications and data sources with your data lake and data stores on AWS using AWS Glue Studio.

To learn more, visit the AWS Glue User Guide. Access the AWS Glue Studio to get started.

About the Authors

Dylan Qu is an AWS solutions architect responsible for providing architectural guidance across the full AWS stack with a focus on Data Analytics, AI/ML and DevOps.

Ranjit Rajan is a Data Lab Solutions Architect with AWS. Ranjit works with AWS customers to help them design and build data and analytics applications in the cloud.