Integration & Automation
Data analysis with SnapLogic
In May 2019, we released an AWS Quick Start that helps you build a data lake environment on the Amazon Web Services (AWS) Cloud with SnapLogic. The Quick Start deploys SnapLogic components and AWS services like Amazon Simple Storage Service (Amazon S3) and Amazon Redshift. This blog post is a hands-on companion that you can use to test out an optional sample application included with the Quick Start.
The sample provides a pre-built SnapLogic pipeline that shows the architecture and stages of end-to-end data flow. I’ll show you how to use the pipeline to transform your input data, load it into an Amazon Redshift data warehouse, query the data in Amazon Redshift using a Secure Shell (SSH) tunnel, and then visualize it with Amazon QuickSight, so that you can analyze the data.
You’ll test the following pipeline, which orchestrates data flow between endpoints. You can use the pipeline building blocks (or snaps) for integrating data from applications to derive insights.
Prerequisites
- Deploy the Quick Start into a new virtual private cloud (VPC) or an existing VPC, by following the steps in the Data lake with SnapLogic Quick Start deployment guide.
Important A valid SnapLogic account and a launched Quick Start stack are prerequisites for following the rest of the blog post. If you don’t have an existing SnapLogic account, you can request a 30-day free trial license by filling out the registration form on the SnapLogic website. For details, see the Data lake with SnapLogic Quick Start deployment guide.
- Clone the GitHub repository for this Quick Start to your computer:
git clone https://github.com/aws-quickstart/quickstart-datalake-snaplogic.git
Test the pipeline
- Log in to the SnapLogic console.
- Go to the Dashboard section, and verify that you have an active SnapLogic Groundplex node. The Groundplex server is the underlying component that processes data from required sources.
- Navigate to the Designer section, and choose Import a Pipeline. In the File menu pop-up, go to the samples directory in the cloned Quick Start repo, and choose the pipeline file
AWS_Datalake_Sample_pipeline.slp
. - Leave all the default values, and then choose Save.
- In the pipeline, choose the File Reader snap and, in the File section, select the database icon to open a Select File dialog box. Choose Upload, navigate to the samples directory, and upload
Sample employee records.csv
. Select this file, and then choose Save.
At this point, your pipeline will have a few snaps in red, which indicates errors: specifically, the S3 File Writer, Redshift – Insert, and Redshift – Insert1 (output1) snaps. You’ll fix these errors in the next steps.
Configure the Amazon Redshift account
- Choose the Redshift – Insert1 (output1) snap, go to the Account tab, and then choose Edit Account. In the Endpoint field, provide the value of the RedshiftEndpoint CloudFormation stack output. For the Database name, Username, and Password fields, provide the values of the
RedshiftDbName
,RedshiftUsername
, andRedshiftPassword
parameters, respectively, that you used to create the stack. Choose Validate, verify that the validation is successful, choose Apply, and then close the dialog box.
- Choose the Redshift – Insert snap that appears to the far right of the pipeline, go to Account, choose the Amazon Redshift account, and then choose Save.
At this point, the snaps in your pipeline are in a validated status, except for the S3 File Writer snap, which you’ll fix in the next steps.
Important You need a valid Amazon Simple Storage Service (Amazon S3) bucket for testing. If you don’t already have one, please create a new S3 bucket by using the AWS Management Console or the AWS Command Line Interface (CLI), before you proceed with configuring the Amazon S3 account.
Configure the Amazon S3 account
- Choose the S3 File Writer snap, and then go to the Account section.
- Choose Add Account. In Create Account, add a label, and supply an access key and secret key for an AWS Identity and Access Management (IAM) user who has necessary permissions on the S3 bucket that you plan to use. To verify that the connection is successful, choose Validate. Choose Apply, and close the dialog box.
- In the Settings section, update the Filename field replacing
awsexamplebucket
with your bucket name; for example,my-test-bucket
. Choose Save, and then close the dialog box. Check whether the validation on the snap was successful.
- Choose the S3 File Reader snap, go to the Account section and, in the drop-down menu, choose the S3 account that you created earlier. Choose Save, and then close the dialog box.
At this point after validation, all snaps should be in a validated status.
- From the menu, choose Execute Pipeline. After it’s finished running, choose Check Pipeline Statistics to view the pipeline execution statistics.
Now that the pipeline execution is finished, the transformed data is persisted to Amazon Redshift for querying, visualizing, and analyzing.
Query and visualize the data
Here are a couple ways you can query and visualize the data for eventual analysis.
Method 1: Querying data in Amazon Redshift using an SSH tunnel
The Amazon Redshift cluster is located in a private subnet, and you can’t query it directly. Instead, you can use an SSH tunnel to forward the connection to your local computer via the bastion host.
- Allow a connection to Amazon Redshift from the bastion host. In the Amazon Elastic Compute Cloud (Amazon EC2) console, go to Security Groups, and make a note of the bastion security group ID. Choose the Redshift security group and, in the Inbound section, choose Edit. In the Edit inbound rules pop-up menu, choose Redshift as the type, enter the bastion security group ID as the source, and then choose Save.
- In a terminal on your computer, set up an SSH tunnel by using the following command, and supply the values for your key file, Redshift host, and bastion host.
ssh -i <path/to/your_key.pem> -NL 5439:<your_redshift_host>:5439 ec2-user@<your_bastion_host> -v
- In your preferred database client, verify that you can access the database by using the URL jdbc:redshift://127.0.0.1:5439/snaplogicdb.
- After the connection is successful, test running any queries.
Method 2: Visualizing the data in Amazon Redshift using Amazon QuickSight
The Amazon Redshift cluster is in a private subnet. For Amazon QuickSight to reach the VPC, you need to set up a VPC connection.
Important The capability of setting up a VPC connection is only available in Amazon QuickSight Enterprise Edition.
- Ensure that the security group is set up to allow access to Amazon Redshift from Amazon QuickSight. You can read more about this in the Amazon QuickSight documentation. Add a new rule in the CloudplexServerSecurityGroup that’s already set up to allow access to Amazon Redshift from Groundplex server Amazon EC2 instances in private subnets. Choose the Inbound section, and then choose Edit. In the Edit inbound rules pop-up, add a new rule where type is All TCP, and source is the VPC CIDR block, i.e., 10.0.0.0/16.
- In the Amazon QuickSight console, choose Manage QuickSight, choose Manage VPC Connections, and then choose Add VPC Connection. Fill in the fields appropriately:
- VPC connection name: for example, SnaplogicTest
- VPC ID: Choose the SnapLogic VPC.
- Subnet ID: Pick a private subnet in the VPC.
- Security group ID: Provide the ID of the CloudplexServerSecurityGroup that you modified earlier.
- Choose Create, and then wait about 2 minutes for the connection to be established.
- Go to the Amazon QuickSight home page, choose Manage Data, and then choose New data set. On the next screen, choose Redshift Auto-discovered. Provide the requested details, and then choose Validate connection. After successful validation, choose Create data source.
- On the next screen, choose aws_quick_start table.
- On the Finish data set creation step, choose Import to SPICE, and then choose Visualize.
Analyze the data
You can now analyze the data in Amazon QuickSight to gain insights. I’ve provided a couple charts for you to get started.
Conclusion
In this blog post, I’ve shown you an end-to-end process of using a SnapLogic pipeline for transforming your input data, loading it into an Amazon Redshift data warehouse, and then visualizing the data with Amazon QuickSight, so that you can analyze it. I hope you enjoyed reading this post and working through the content. If you have any feedback on the steps in this post, please add your comments.