AWS Big Data Blog
Accelerate data integration with Salesforce and AWS using AWS Glue
The rapid adoption of software as a service (SaaS) solutions has led to data silos across various platforms, presenting challenges in consolidating insights from diverse sources. Effective data analytics relies on seamlessly integrating data from disparate systems through identifying, gathering, cleansing, and combining relevant data into a unified format. AWS Glue, a serverless data integration service, has simplified this process by offering scalable, efficient, and cost-effective solutions for integrating data from various sources. With AWS Glue, you can streamline data integration, reduce data silos and complexities, and gain agility in managing data pipelines, ultimately unlocking the true potential of your data assets for analytics, data-driven decision-making, and innovation.
This post explores the new Salesforce connector for AWS Glue and demonstrates how to build a modern extract, transform, and load (ETL) pipeline with AWS Glue ETL scripts.
Introducing the Salesforce connector for AWS Glue
To meet the demands of diverse data integration use cases, AWS Glue now supports SaaS connectivity for Salesforce. This enables users to quickly preview and transfer their customer relationship management (CRM) data, fetch the schema dynamically on request, and query the data. With the AWS Glue Salesforce connector, you can ingest and transform your CRM data to any of the AWS Glue supported destinations, including Amazon Simple Storage Service (Amazon S3), in your preferred format, including Apache Iceberg, Apache Hudi, and Linux Foundation Delta Lake; data warehouses such as Amazon Redshift and Snowflake; and many more. Reverse ETL use cases are also supported, allowing you to write data back to Salesforce.
The following are key benefits of the Salesforce connector for AWS Glue:
- You can use AWS Glue native capabilities
- It is well tested with AWS Glue capabilities and is production ready for any data integration workload
- It works seamlessly on top of AWS Glue and Apache Spark in a distributed fashion for efficient data processing
Solution overview
For our use case, we want to retrieve the full load of a Salesforce account object in a data lake on Amazon S3 and capture the incremental changes. This solution also allows you to update certain fields of the account object in the data lake and push it back to Salesforce. To achieve this, you create two ETL jobs using AWS Glue with the Salesforce connector, and create a transactional data lake on Amazon S3 using Apache Iceberg.
In the first job, you configure AWS Glue to ingest the account object from Salesforce and save it into a transactional data lake on Amazon S3 in Apache Iceberg format. Then you update the account object data that is extracted from the first job in the transactional data lake in Amazon S3. Lastly, you run the second job to send that change back to Salesforce.
Prerequisites
Complete the following prerequisite steps:
- Create an S3 bucket to store the results.
- Sign up for a Salesforce account, if you don’t already have one.
- Create an AWS Identity and Access Management (IAM) role for the AWS Glue ETL job to use. The role must grant access to all resources used by the job, including Amazon S3 and AWS Secrets Manager. For this post, we name the role
AWSGlueServiceRole-SalesforceConnectorJob
. Use the following policies:- AWS managed policies:
- Inline policy:
- Create the AWS Glue connection for Salesforce:
- The Salesforce connector supports two OAuth2 grant types:
JWT_BEARER
andAUTHORIZATION_CODE
. For this post, we use theAUTHORIZATION_CODE
grant type. - On the Secrets Manager console, create a new secret. Add two keys,
ACCESS_TOKEN
andREFRESH_TOKEN
, and keep their values blank. These will be populated after you enter your Salesforce credentials.
- Configure the Salesforce connection in AWS Glue. Use
AWSGlueServiceRole-SalesforceConnectorJob
while creating the Salesforce connection. For this post, we name the connectionSalesforce_Connection
.
- In the Authorization section, choose Authorization Code and the secret you created in the previous step.
- Provide your Salesforce credentials when prompted. The
ACCESS_TOKEN
andREFRESH_TOKEN
keys will be populated after you enter your Salesforce credentials.
- The Salesforce connector supports two OAuth2 grant types:
- Create an AWS Glue database. For this post, we name it
glue_etl_salesforce_db
.
Create an ETL job to ingest the account object from Salesforce
Complete the following steps to create a new ETL job in AWS Glue Studio to transfer data from Salesforce to Amazon S3:
- On the AWS Glue console, create a new job (with the Script editor option). For this post, we name the job
Salesforce_to_S3_Account_Ingestion
. - On the Script tab, enter the Salesforce_to_S3_Account_Ingestion script.
Make sure that the name, which you used to create the Salesforce connection, is passed as the connectionName
parameter value in the script, as shown in the following code example:
The script fetches records from the Salesforce account object. Then it checks if the account table exists in the transactional data lake. If the table doesn’t exist, it creates a new table and inserts the records. If the table exists, it performs an upsert operation.
- On the Job details tab, for IAM role, choose
AWSGlueServiceRole-SalesforceConnectorJob
.
- Under Advanced properties, for Additional network connection, choose the Salesforce connection.
- Set up the job parameters:
--conf
:spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.warehouse=file:///tmp/spark-warehouse
--datalake-formats
:iceberg
--db_name
:glue_etl_salesforce_db
--s3_bucket_name
: your S3 bucket--table_name
: account
- Save the job and run it.
Depending on the size of the data in your account object in Salesforce, the job will take a few minutes to complete. After a successful job run, a new table called account is created and populated with Salesforce account information.
- You can use Amazon Athena to query the data:
Validate transactional capabilities
You can validate the transactional capabilities supported by Apache Iceberg. For testing, try three operations: insert, update, and delete:
- Create a new account object in Salesforce, rerun the AWS Glue job, then run the query in Athena to validate the new account is created.
- Delete an account in Salesforce, rerun the AWS Glue job, and validate the deletion using Athena.
- Update an account in Salesforce, rerun the AWS Glue job, and validate the update operation using Athena.
Create an ETL job to send updates back to Salesforce
AWS Glue also allows you to write data back to Salesforce. Complete the following steps to create an ETL job in AWS Glue to get updates from the transactional data lake and write them to Salesforce. In this scenario, you update an account record and push it back to Salesforce.
- On the AWS Glue console, create a new job (with the Script editor option). For this post, we name the job
S3_to_Salesforce_Account_Writeback
. - On the Script tab, enter the S3_to_Salesforce_Account_Writeback script.
Make sure that the name, which you used to create the Salesforce connection, is passed as the connectionName
parameter value in the script:
- On the Job details tab, for IAM role, choose
AWSGlueServiceRole-SalesforceConnectorJob
. - Configure the job parameters:
--conf
:
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.warehouse=file:///tmp/spark-warehouse
--datalake-formats
:iceberg
--db_name
:glue_etl_salesforce_db
--table_name
:account
- Run the update query in Athena to change the value of
UpsellOpportunity__c
for a Salesforce account to “Yes”: - Run the
S3_to_Salesforce_Account_Writeback
AWS Glue job.
Depending on the size of the data in your account object in Salesforce, the job will take a few minutes to complete.
- Validate the object in Salesforce. The value of
UpsellOpportunity
should change.
You have now successfully validated the Salesforce connector.
Considerations
You can set up AWS Glue job triggers to run the ETL jobs on a schedule, so that the data is regularly synchronized between Salesforce and Amazon S3. You can also integrate the ETL jobs with other AWS services, such as AWS Step Functions, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), AWS Lambda, or Amazon EventBridge, to create a more advanced data processing pipeline.
By default, the Salesforce connector doesn’t import deleted records from Salesforce objects. However, you can set the IMPORT_DELETED_RECORDS
option to “true” to import all records, including the deleted ones. Refer to Salesforce connection options for different Salesforce connection options.
Clean up
To avoid incurring charges, clean up the resources used in this post from your AWS account, including the AWS Glue jobs, Salesforce connection, Secrets Manager secret, IAM role, and S3 bucket.
Conclusion
The AWS Glue connector for Salesforce simplifies the analytics pipeline, reduces time to insights, and facilitates data-driven decision-making. It empowers organizations to streamline data integration and analytics. The serverless nature of AWS Glue means there is no infrastructure management, and you pay only for the resources consumed while your jobs are running. As organizations increasingly rely on data for decision-making, this Salesforce connector provides an efficient, cost-effective, and agile solution to swiftly meet data analytics needs.
To learn more about the AWS Glue connector for Salesforce, refer to Connecting to Salesforce in AWS Glue Studio. In this user guide, we walk through the entire process, from setting up the connection to running the data transfer flow. For more information on AWS Glue, visit AWS Glue.
About the authors
Ramakant Joshi is an AWS Solutions Architect, specializing in the analytics and serverless domain. He has a background in software development and hybrid architectures, and is passionate about helping customers modernize their cloud architecture.
Kamen Sharlandjiev is a Sr. Big Data and ETL Solutions Architect, Amazon MWAA and AWS Glue ETL expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on LinkedIn to keep up to date with the latest Amazon MWAA and AWS Glue features and news!
Debaprasun Chakraborty is an AWS Solutions Architect, specializing in the analytics domain. He has around 20 years of software development and architecture experience. He is passionate about helping customers in cloud adoption, migration and strategy.