AWS Big Data Blog
Doing data preparation using on-premises PostgreSQL databases with AWS Glue DataBrew
Today, with AWS Glue DataBrew, data analysts and data scientists can easily access and visually explore any amount of data across their organization directly from their Amazon Simple Storage Service (Amazon S3) data lake, Amazon Redshift data warehouse, and Amazon Aurora and Amazon Relational Database Service (Amazon RDS) databases. Customers can choose from over 250 built-in functions to combine, pivot, and transpose the data without writing code.
Now, with added support for JDBC-accessible databases, DataBrew also supports additional data stores including PostgreSQL, MySQL, Oracle, and Microsoft SQL Server. In this blog post, we will be using DataBrew to clean data from an on-premise database, and storing the cleaned data in an Amazon S3 data lake.
Solution Overview
I will be configuring an existing subnet in an Amazon VPC for use with DataBrew. Then configuring DataBrew to securely connect to an existing on-premise database and executing a data preparation job.
Components
- You should have an AWS account with a Virtual Private Cloud (Amazon VPC). DataBrew will connect to your database from this VPC.
- You should have a subnet within your Amazon VPC. In this blog, this subnet will be configured for use with DataBrew.
- You should have an on-premise database with data to be cleaned with DataBrew.
- I assume you have a VPN connection between your Amazon VPC and on premise network to enable secure connections between them. I’ve implemented a VPN tunnel using AWS Site-to-Site VPN. You may choose to Simulate Site-to-Site VPN Customer Gateways Using strongSwan.
- This guide will walk through creation of a DataBrew dataset, project, and job.
- DataBrew requires access to Amazon S3 and AWS Glue. This guide will walk through creating VPC endpoints to enable private connections between your VPC and these AWS services for DataBrew to use.
- To establish network connectivity, DataBrew will provision an Amazon VPC elastic network interface in the VPC you specify. This blog will cover securing this network interface with a security group.
Prerequisites
To complete this blog, you should have the following prerequisites:
- An AWS account
- A Virtual Private Cloud (Amazon VPC) and subnet
- An on-premise network and database
- A VPN connection between your Amazon VPC and on premise network
Additionally, you will need to have enabled access to your on-premise network from the subnet in the Amazon VPC. If you haven’t enabled it already, you can Simulate Site-to-Site VPN Customer Gateways Using strongSwan, or you can enable access by completing the AWS Site-to-Site VPN getting started guide.
If you are unsure if you have enabled access from your VPC subnet to your on-premise database, you can test access by running the AWS Systems Manager automation AWSSupport-SetupIPMonitoringFromVPC. From the User Guide, choose Run this Automation. In the Systems Manager console, under Input Parameters, you will need to enter the Amazon VPC subnet ID for SubnetId and the IP address of your on-premise host for TargetIPs. Then choose Execute. Once the automation completes, locate the Outputs section and open the URL linked under createCloudWatchDashboard.Output. From that dashboard, confirm from the Ping log group that pings are successful. If they are not, you will need to investigate. A useful resource for this is How do I troubleshoot instance connection timeout errors in Amazon VPC.
Step 1: Configure the Amazon VPC
Ensure your Amazon VPC has DNS Support and DNS Hostnames enabled. You can verify this by selecting your VPC in the Amazon VPC console and checking the details for DNS hostnames and DNS resolution. If they are disabled, they can be enabled by choosing Actions then the corresponding Edit option.
On-premise or hybrid DNS are also supported, but requires additional setup. See Other Considerations at the end of this post for more.
Step 2: Configure the Amazon VPC Subnet
Your subnet must have access to Amazon S3 and AWS Glue services. I will add VPC endpoints for Amazon S3 and AWS Glue services to keep my traffic within the AWS network.
- To add the VPC endpoint for Amazon S3, open the Amazon VPC console at https://console.aws.amazon.com/vpc/.
- In the navigation pane, choose Endpoints, Create Endpoint.
- Filter by “s3”.
- Choose the service where the Type column indicates Gateway.
- Select the route tables associated with the subnet to be used with DataBrew.
- Choose Create endpoint.
- To add the VPC endpoint for AWS Glue, again choose Create Endpoint.
- Filter by “glue”.
- Choose the service where the Type column indicates Interface.
- Select the route tables associated with the subnet to be used with DataBrew.
- Choose Create endpoint.
Step 3 : Configure Network ACL
By default Network ACLs allow all inbound and outbound traffic. If you have customized your network ACL, ensure inbound return traffic from and outbound traffic to your on-premise network, Amazon S3, and AWS Glue are allowed.
- From the Amazon VPC console, choose Subnets.
- Choose the subnet you are using with DataBrew.
- From the Details tab, choose the Network ACL link.
- Validate your inbound and outbound rules, updating your rules to allow the required traffic if needed. The screenshot below shows the default rules I am using.
Step 4: Configure the VPC security group
To provide connectivity to your VPC, DataBrew will create an Elastic Network Interface (ENI) in the VPC subnet you specify. DataBrew attaches the security group you specify to the ENI to limit network access. This security group must have a self-referential rule to allow all inbound TCP traffic from itself. This will block access from unspecified sources. I will be using the default security group, which has the following configuration.
Your security group must allow outbound traffic to itself, Amazon S3, AWS Glue, and your on-premise network. I’ll be using the default security group, which allows all outbound traffic.
Optionally, you may wish to explicitly restrict outbound traffic to only your on-premise network, Amazon S3, and AWS Glue. To do so, remove the All TPC outbound rule. Ensure your security group has a self-referential rule to allow all outbound TCP traffic to itself. Allow traffic to your on-premise network by specifying the CIDR block associated with your network. In my case, it is 10.196.0.0/16. Allow traffic to Amazon S3 with the AWS-managed S3 prefix list, which includes the set of CIDR blocks for Amazon S3. Allow traffic to the AWS Glue VPC endpoint by associating the same security group with the AWS Glue VPC endpoint created above from the Amazon VPC console.
An example of what these scoped-down outbound rules may look like:
Ensure your on-premise network security rules allow traffic from your Amazon VPC subnet’s CIDR block.
Step 5 : Create database credentials
Following best practices, I will be creating a database user with scoped down permissions for use with DataBrew.
- Connect to your database. In my case with
psql -h 10.196.0.20
- Create a user, which I’ll call
postgresql
, with readonly access to the table that will be used with DataBrew. My table is calleddemo
in databasepostgres
. I’ll do this by executing the following queries:
Step 6 : Create DataBrew project
- From the AWS DataBrew console, choose Create project.
- Enter a Project name.
- Under Select a dataset choose New dataset.
- Enter a Dataset name.
- Under Connect to new dataset choose JDBC.
- Choose Add JDBC connection.
- Enter a Connection name, I use my-connection.
- Choose Enter JDBC details.
- Choose Database type, in my case PostgreSQL.
- For Host/endpoint, enter your host’s private IP address.
- Enter your Database name, Database user, and Database password.
- Choose your VPC, and the Subnet and Security Group you configured above.
- Review “Additional configurations”, where you can optionally configure the following:
- If you are using a recent database version, such as MySQL 8, you may need to provide a custom JDBC driver. For more information, see the Developer Guide.
- DataBrew can be set to fail the connection to your database if it is unable to connect over SSL. Additionally, DataBrew provides default certificates for establishing SSL connections. If you obtained a certificate from a third-party issuer, or the default certificates provided by DataBrew do not meet your requirements, you can provide your own. DataBrew handles only X.509 certificates. The certificate must be DER-encoded and supplied in base64 encoding PEM format.
- Choose Create connection at the bottom of the modal.
- Choose the newly created connection by clicking on its name.
- Enter the name of the table within your database you want to bring into DataBrew.
- Under the Permissions header, choose Create new IAM role from the dropdown and enter a role suffix.
- Choose Create project, this will open the project view. After one to two minutes you will be able to work with your data. If the connection fails, see How do I troubleshoot instance connection timeout errors in Amazon VPC.
- Start by applying some simple transforms, I’m dropping some columns that are not needed in my data lake. To do so, from the action bar I choose COLUMN, then Delete.
- This opens the side-bar where I choose the column to delete, and choose Apply.
Step 7 : Create DataBrew job
Once I’ve got a few transforms added to my project’s recipe, I will run a job to execute the recipe against my full dataset, and store the result in my Amazon S3 bucket.
- Choose Create job from the top of the project grid view.
- On the job page, provide a Job name and S3 output location.
- Under the header Permissions, choose Create new IAM role. This will create a new scoped down IAM role with the permissions required to execute your job.
- Finally, choose Create and run job. Once the job completes, you can view the output in Amazon S3.
Cleanup
From the DataBrew console, delete your Job, Project, Recipe, and Dataset.
If you executed the Systems Manager automation to test access, under the Systems Manager console, choose CloudWatch Dashboard. Select the dashboard created by the automation. Choose Terminate test. Then choose Execute.
Other considerations
AWS Glue DataBrew’s networking requirements are similar to that of AWS Glue ETL jobs. Below summarizes some of those advanced networking conditions. For more details on AWS Glue ETL, see How to access and analyze on-premises data stores using AWS Glue by Rajeev Meharwal.
DNS
If you are using AWS VPC provided DNS, ensure you have enabled DnsHostnames and DnsSupport for your VPC. For more information, see DNS support in your VPC.
If you have configured a custom DNS server with your AWS VPC, you must implement forward and reverse lookups for Amazon EC2 private DNS hostnames. For more information, see Amazon DNS server. Alternatively, setup hybrid DNS resolution to resolve both on-premise DNS servers and the VPC provided DNS. For implementation details, see the following AWS Security Blog posts:
- How to Set Up DNS Resolution Between On-Premises Networks and AWS by Using Unbound
- How to Set Up DNS Resolution Between On-Premises Networks and AWS Using AWS Directory Service and Microsoft Active Directory
Joining or unioning multiple databases
If you are joining a database dataset into your project, the database must be accessible from the project dataset’s subnet.
For example, if you have completed the setup above using Private Subnet 1, and you have another Amazon RDS database in Private Subnet 2 in the same VPC, as shown below. You will want a local route for the route table associated with Subnet 1. You will also need to ensure the security group attached to your Amazon RDS database allows inbound traffic from your DataBrew security group.
If your Amazon RDS database is in a different AWS VPC than you are using with DataBrew, you will need to setup VPC peering.
About the Authors
John Espenhahn is a Software Engineer working on AWS Glue DataBrew service. He has also worked on Amazon Kendra user experience as a part of Database, Analytics & AI AWS consoles. He is passionate about technology and building in the analytics space.
Nitya Sheth is a Software Engineer working on AWS Glue DataBrew service. He has also worked on AWS Synthetics service as well as on user experience implementations for Database, Analytics & AI AWS consoles. In his free time, he divides his time between exploring new hiking places and new books.