AWS Big Data Blog

Use AWS Glue to streamline SFTP data processing

In today’s data-driven world, seamless integration and transformation of data across diverse sources into actionable insights is paramount. AWS Glue is a serverless data integration service that helps analytics users to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. With AWS Glue, you can discover and connect to hundreds of diverse data sources and manage your data in a centralized data catalog. It enables you to visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes.

In this blog post, we explore how to use the SFTP Connector for AWS Glue from the AWS Marketplace to efficiently process data from Secure File Transfer Protocol (SFTP) servers into Amazon Simple Storage Service (Amazon S3), further empowering your data analytics and insights.

Introducing the SFTP connector for AWS Glue

The SFTP connector for AWS Glue simplifies the process of connecting AWS Glue jobs to extract data from SFTP storage and to load data into SFTP storage. This connector provides comprehensive access to SFTP storage, facilitating cloud ETL processes for operational reporting, backup and disaster recovery, data governance, and more.

Solution overview

In this example, you use AWS Glue Studio to connect to an SFTP server, then enrich that data and upload it to Amazon S3. The SFTP connector is used to manage the connection to the SFTP server. You will load the event data from the SFTP site, join it to the venue data stored on Amazon S3, apply transformations, and store the data in Amazon S3. The event and venue files are from the TICKIT dataset.

The TICKIT dataset tracks sales activity for the fictional TICKIT website, where users buy and sell tickets online for sporting events, shows, and concerts. In this dataset, analysts can identify ticket movement over time, success rates for sellers, and best-selling events, venues, and seasons.

For this example, you use AWS Glue Studio to develop a visual ETL pipeline. This pipeline will read data from an SFTP server, perform transformations, and then load the transformed data into Amazon S3. The following diagram illustrates this architecture.

solution overview

By the end of this post, your visual ETL job will resemble the following screenshot.

final solution

Prerequisites

For this solution, you need the following:

  • Subscribe to the SFTP Connector for AWS Glue in the AWS Marketplace.
  • Access to an SFTP server with permissions to upload and download data.
    • If the SFTP server is hosted on Amazon Elastic Compute Cloud (Amazon EC2), we recommend that the network communication between the SFTP server and the AWS Glue job happens within the virtual private cloud (VPC) as pictured in the preceding architecture diagram. Running your Glue job within a VPC and security group will be discussed further in the steps to create the AWS Glue job.
    • If the SFTP server is hosted within your on-premises network, we recommend that the network communication between the SFTP server and the Glue job happens through VPN or AWS DirectConnect.
  • Access to an S3 bucket or the permissions to create an S3 bucket. We recommend that you connect to that bucket using a gateway endpoint. This will allow you to connect to your S3 bucket directly from your VPC. If you need to create an S3 bucket to store the results, complete the following steps:
    1. On the Amazon S3 console, choose Buckets in the navigation pane.
    2. Choose Create bucket.
    3. For Name, enter a globally unique name for your bucket; for example, tickit-use1-<accountnumber>.
    4. Choose Create bucket.
    5. For this demonstration, create a folder with the name tickit in your S3 bucket.
    6. Create the gateway endpoint.
  • Create an AWS Identity and Access Management (IAM) role for the AWS Glue ETL job. You must specify an IAM role for the job to use. The role must grant access to all resources used by the job, including Amazon S3 (for any sources, targets, scripts, and temporary directories) and AWS Secrets Manager. For instructions, see Configure an IAM role for your ETL job.

Load dataset to SFTP site

Load the allevents_pipe.txt file and venue_pipe.txt file from the TICKIT dataset to your SFTP server.

Store SFTP server sign-in credentials

An AWS Glue connection is a Data Catalog object that stores connection information, such as URI strings and location to credentials that are stored in a Secrets Manager secret.

To store the SFTP server username and password in Secrets Manager, complete the following steps:

  1. On the Secrets Manager console, choose Secrets in the navigation pane.
  2. Choose Store a new secret.
  3. Select Other type of secret.
  4. Enter host as Secret key and your SFTP server’s IP address (for example, 153.47.122) as the Secret value, then choose Add row.
  5. Enter the username as Secret key and your SFTP username as Secret value, then choose Add row.
  6. Enter password as Secret key and your SFTP password as Secret value, then choose Add row.
  7. Enter keyS3Uri as Secret Key and the Amazon S3 location of your SFTP secret key file as Secret value

Note: Secret Value is the full S3 path where the SFTP server key file is stored. For example:s3://sftp-bucket-johndoe123/id_rsa.

  1. For Secret name, enter a descriptive name, then choose Next.
  2. Choose Next to move to the review step, then choose Store.

secret value

Create a connection to the SFTP server in AWS Glue

Complete the following steps to create your connection to the SFTP server.

  1. On the AWS Glue console, under Data Catalog in the navigation pane, choose Connections.

creating sftp connection from marketplace

  1. Select the SFTP connector for AWS Glue 4.0. Then choose Create connection.

using sftp connector

  1. Enter a name for the connection and then, under Connection access, choose the Secrets Manager secret you created for you SFTP server credentials.

finishing sftp connection

Create a connection to the VPC in AWS Glue

A data connection is used to establish network connectivity between the VPC and the AWS Glue job. To create the VPC connection, complete the following steps.

  1. On the AWS Glue console page, click on Data Connections location on the left side menu.
  2. Click the Create connection button in the Connections panel.

creating connection for VPC

  1. Select Network

choosing network option

  1. Select the VPC, Subnet, and Security Group that your SFTP server resides in. Click Next.

choosing vpc, subnet, sg for connection

  1. Name the connection SFTP VPC Connect and then click

Deploy the solution

Now that we completed the prerequisites, we are going to setup the AWS Glue Studio job for this solution. We will create a glue studio job, add events and venue data from the SFTP server, carry out data transformations and load transformed data to s3.

Create your AWS Glue Studio job:

  1. On the AWS Glue console, under ETL Jobs in the navigation pane, choose Visual ETL.
  2. Select Visual ETL in the central pane.
  3. Choose the pencil icon to enter a name for your job.
  4. Choose the Job details tab.

choosing job details

  1. Scroll down to and select Advanced properties and expand.
  2. Scroll to Connections and select SFTP VPC Connect.

choosing sftp vpc connection

  1. Choose Visual to go back to the workflow editor page.

Add the events data from the SFTP server as your first data set:

  1. Choose Add nodes and select SFTP Connector for AWS Glue 4.0 on the Sources
  2. Enter the following for Data source properties for:
    1. Connection: Select the connection to the SFTP server that you created in Create the connection to the SFTP server in AWS Glue.
    2. Enter the following key-value pairs:
Key Value
header false
path /files (this should be the path to the event file in your SFTP server)
fileFormat csv
delimiter |

glue studio job configuration

Rename the columns of the Event dataset:

  1. Choose Add nodes and choose Change Schema on the Transforms
  2. Enter the following transform properties:
    1. For Name, enter Rename Event data.
    2. For Node parents, select SFTP Connector for AWS Glue 4.0.
    3. In the Change Schema section, map the source keys to the target keys:
      1. col0: eventid
      2. col1: e_venueid
      3. col2: catid
      4. col3: dateid
      5. col4: eventname
      6. col5: starttime

transforming event data

Add the venue_pipe.txt file from the SFTP site:

  1. Choose Add nodes and choose SFTP Connector for AWS Glue 4.0 on the Sources
  2. Enter the following for Data source properties for:
    1. Connection: Select the connection to the SFTP server that you created in Create the connection to the SFTP server in AWS Glue.
    2. Enter the following key-value pairs:
Key Value
header false
path /files (this should be the path to the venue file in your SFTP site)
fileFormat csv
delimiter |

Rename the columns of the venue dataset:

  1. Choose Add nodes and choose Change Schema on the Transforms
  2. Enter the following transform properties:
    1. For Name, enter Rename Venue data.
    2. For Node parents, select Venue.
    3. In the Change Schema section, map the source keys to the target keys:
      1. col0: venueid
      2. col1: venuename
      3. col2: venuecity
      4. col3: venuestate
      5. col4: venueseats

transforming venue data

Join the venue and event datasets.

  1. Choose Add nodes and choose Join on the Transforms
  2. Enter the following transform properties:
    1. For Name, enter Join.
    2. For Node parents, select Rename Venue data and Rename Event data.
    3. For Join type¸ select Inner join.
    4. For Join conditions, select venueid for Rename Venue data and e_venueid for Rename Event data.

transform join venue and event

Drop the duplicate field:

  1. Choose Add nodes and choose Drop Fields on the Transforms
  2. Enter the following transform properties:
    1. For Name, enter Drop Fields.
    2. For Node parents, select Join.
    3. In the DropFields section, select e_venueid.

drop field transform

Load the data into your S3 bucket:

  1. Choose Add nodes and choose Amazon S3 from the Sources
  2. Enter the following transform properties:
    1. For Node parents, select Drop Fields.
    2. For Format, select CSV.
    3. For Compression Type, select None.
    4. For S3 Target Location, choose your S3 bucket and enter your desired file name followed by a slash (/).

loading data to s3 target

You can now save and run your AWS Glue visual ETL Job. Run the job and then go to the Runs tab to monitor its progress. After the job has completed, the Run status will change to Succeeded. The data will be in the target S3 bucket.

completed job

Clean up

To avoid incurring additional charges caused by resources created as part of this post, make sure you delete the items created in the AWS Account for this post:

  • Delete the Secrets Manager key created for the SFTP connector . credentials.
  • Delete the SFTP connector.
  • Unsubscribe from the SFTP Connector in AWS Marketplace.
  • Delete the data loaded to the Amazon S3 bucket and the bucket.
  • Delete the AWS Glue visual ETL job.

Conclusion

In this blog post, we demonstrated how to use the SFTP connector for AWS Glue to streamline the processing of data from SFTP servers into Amazon S3. This integration plays a pivotal role in enhancing your data analytics capabilities by offering an efficient and straightforward method to bring together disparate data sources. Whether your goal is to analyze SFTP server data for actionable insights, bolster your reporting mechanisms, or enrich your business intelligence tools, this connector ensures a more streamlined and cost-effective approach to achieving your data objectives.

You can use SFTP connector for AWS Glue to simplify the process of connecting AWS Glue jobs to extract data from remote SFTP storage or to load data to remote SFTP storage, while performing data cleansing and transformations in-memory as part of your ETL pipelines. In this blog post, we explore this solution in more detail. Alternatively, AWS Transfer Family provides fully-managed and AWS native SFTP connectors to reliably copy large volume of files between remote SFTP sources and Amazon S3. You have the option to design a solution using Transfer Family’s fully-managed SFTP connector to copy files between remote SFTP servers and their Amazon S3 locations without any modifications, and then use Glue’s ETL service for cleansing and transformation of the file data.

For further details on the SFTP connector, see the SFTP Connector for Glue documentation.


About the Authors

Sean Bjurstrom is a Technical Account Manager in ISV accounts at Amazon Web Services, where he specializes in Analytics technologies and draws on his background in consulting to support customers on their analytics and cloud journeys. Sean is passionate about helping businesses harness the power of data to drive innovation and growth. Outside of work, he enjoys running and has participated in several marathons.

Seun Akinyosoye is a Sr. Technical Account Manager supporting public sector customer at Amazon Web Services. Seun has a background in analytics, data engineering which he uses to help customers achieve their outcomes and goals. Outside of work Seun enjoys spending time with his family, reading, traveling and supporting his favorite sports teams.

Vinod Jayendra is a Enterprise Support Lead in ISV accounts at Amazon Web Services, where he helps customers in solving their architectural, operational, and cost optimization challenges. With a particular focus on Serverless technologies, he draws from his extensive background in application development to deliver top-tier solutions. Beyond work, he finds joy in quality family time, embarking on biking adventures, and coaching youth sports team.

Kamen Sharlandjiev is a Sr. Big Data and ETL Solutions Architect, MWAA and AWS Glue ETL expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on LinkedIn to keep up to date with the latest MWAA and AWS Glue features and news!

Chris Scull is a Solutions Architect dealing in orchestration tools and modern cloud technologies. With two years of experience at AWS, Chris has developed an interest in Amazon Managed Workflows for Apache Airflow, which allows for efficient data processing and workflow management. Additionally, he is passionate about exploring the capabilities of GenAI with Bedrock, a platform for building generative AI applications on AWS.

Shengjie Luo is a Big data architect of Amazon Cloud Technology professional service team. Responsible for solutions consulting, architecture and delivery of AWS based data warehouse and data lake, and good at server-less computing, data migration, cloud data integration, data warehouse planning, data service architecture design and implementation.

Qiushuang Feng is a Solutions Architect at AWS, responsible for Enterprise customers’ technical architecture design, consulting, and design optimization on AWS Cloud services. Before joining AWS, Qiushuang worked in IT companies such as IBM and Oracle, and accumulated rich practical experience in development and analytics.