Accelerating the transformation of supply chain data from legacy systems into AWS Supply Chain

Customers across all industries continue to leverage autonomous and disparate systems such as enterprise resource planning (ERP), order management systems (OMS), and enterprise data warehouse (EDW) to store supply chain and operational data. While these systems perform their basic function well, supply chain data remains locked in these disparate data source systems.

The underlying data models and definitions for these systems can vary, making it difficult to standardize and transform data for consumption into business applications or reporting. The existence of non-standardized data in these systems increases the time spent by data administrators and business analysts in transforming or standardizing data for their business needs such as reporting or planning.

This situation also impedes organizations from taking advantage of machine learning (ML) powered functions and operations that are optimized with unified and standardized data. Our customers have shared how siloed data and manual integration disrupts operations because of complex and time-consuming extract, transform, and load (ETL) functions.

In this blog post, we show you a simple approach to using AWS services to transform non-standardized data and simplify ETL operations. This post also explains the process to set up automated pipelines that feed data into AWS Supply Chain.

AWS Supply Chain is a cloud-based supply chain management application that works with your existing ERP and supply chain management systems. AWS Supply Chain contextualizes your data in a real-time visual map using a set of interactive visual interfaces. It highlights current inventory selection and quantity, as well as the health of inventory at each location (for example, inventory that is at risk for stock out). AWS Supply Chain also enables ML-powered actionable insights to generate more accurate vendor lead times and to reduce risk.

Solution overview

The data transformation solution uses Amazon Simple Storage Service (Amazon S3), an AWS Glue crawler, AWS Glue DataBrew, and AWS Supply Chain. The following reference diagram depicts the integration of these services.

data mapping reference architecture

On-premises data sources – This block captures any archaic data sources on the customer’s premises.
Amazon S3 raw data sink – These Amazon S3 buckets are intended to receive raw (unaltered) data from the source systems. For this setup, you upload files of comma separated values (.csv) type.
AWS Glue crawler – AWS Glue crawlers are designed to crawl files in Amazon S3 buckets and learn the underlying schema of the file that is present. This resource allows customers to cut down the time normally spent on manual schema mappings. Also, AWS Glue crawlers have built-in options to configure the frequency for running the crawl jobs. The output of this service is to create schema and table definitions for the data sources.
AWS Glue DataBrew – This tool offers customers a visual ETL tool that provides:
1. a view of the table schema,
2. a view of data lineage to understand how changes are flowing through, and
3. mapping and transformation features to manage how the source fields will be modified.
Amazon S3 transformed data – Amazon S3 storage locations house modified and standardized data. This can connect directly to business applications or any reporting tool.
AWS Supply Chain – This purpose-built end user business application takes advantage of ML algorithms for generating insights that assist in running resilient supply chains.

Prerequisites

For this blog post, it is assumed that you understand the use of the mentioned services and you have the following prerequisites:

AWS account with the listed services enabled.
The ability to create and modify AWS Identity and Access Management (IAM) roles. This is required to create policies and permissions (particularly services related to AWS Glue).
Access to source systems and the ability to extract .csv files from these systems.

Solution scope

For the purposes of this blog post, we make a number of assumptions.

We exclude the process of extracting .csv files from the source systems and uploading them into Amazon S3 buckets.
Source files are manually uploaded to the Amazon S3 buckets. However, you can choose to use any process to load them into the Amazon S3 buckets.
The formats of files received are assumed to have comma separated values.
The crawl workflow is set to run on-demand. However, this post will also explain how to configure the workflow for a different frequency.
If there are changes to the schema, the current solution can be extended to notify users of schema changes and any associated impacts to the data flow. More information on this design extension is available at the AWS Big Data Blog post Identify source schema changes using AWS Glue.

Solution walk-through

Amazon S3 buckets

To create Amazon S3 buckets, select Create Bucket and follow the on-screen procedures. For this setup, leave default options unaltered. This will ensure contents in the Amazon S3 buckets will be scanned for changes for every file that’s added.
To ingest data, you can upload the raw file extracted from your source system into the Amazon S3 bucket.

AWS Glue crawler

Use the search box at the top of the console to navigate to AWS Glue.
In the left navigation bar, select Databases, then select Create database.
Use the search box at the top of the screen and search for Crawler to navigate to the crawler feature.
On the next screen, in the upper right corner, select Create crawler .
In the Name and Description fields, enter the value based on the entity you want to add. In the following screenshot, the crawler is being created for product entity. Then select Next.
On the next screen, you will be prompted to add a data source. This is where you reference the Amazon S3 location.
You have the option of selecting a schedule for when to start this crawler job. On demand is selected for this walk-through. You can change the setting based on the requirement. Then select Next.
On the ensuing screen, you can review the completed selections and click on the Create crawler button.
You can click on Databases on the left navigation bar, and select Tables to see a list of the created tables.

AWS Glue Databrew

Search for AWS Glue Databrew in the search box.
In the left navigation pane, select Datasets . In this step you will create a new connection for the dataset that you created using AWS Glue Crawler. Repeat these steps for all the tables you previously defined.
Click on Create project button on the AWS Glue DataBrew screen to start a new data project.
On the Create project screen, enter the Project name, and select datasets that were previously created to complete this step. As an example, Outbound OrderLine dataset is used for creating the project below.
GitHub Repository: Any schema conversion steps, such as column name changes, can be done using recipe files. You can refer to the json-recipe file in this GitHub location. For example, the Outbound OrderLine table involves just column name changes. You can download these recipes to your local machine for reference or use.
Navigate back to the Databrew window and select recipes. To upload the recipes (i.e. downloaded from GitHub step or your own version) to your Databrew window, in the right corner, select Upload recipe.

AWS Glue Databrew – Visual Transformation [OPTIONAL]

AWS Glue DataBrew allows you to visually change or make modifications to incoming data. This includes changing formats, column names, and column manipulations.

To create a new project, navigate to Projects and select New project. In this example, we use InventoryPositions.
The recipe options on the top right corner allows you to add the conversion steps. The following example shows column renames of existing data.
One example of field conversion would be changing Standard date to UTC format. To make this change, select Add step.
Once the recipes are uploaded, navigate to Jobs to create a job to automate the conversion. Select Create job.
Under the Job details screen, you will select the recipe job. Next under dataset you will associate the dataset for which we want to define the new job. AWS Glue Databrew service offers flexibility in terms of the format we want to extract data (in our example we will go with comma separated values as our preferred format).
Associated schedules [OPTIONAL]: As an additional step, you can setup a scheduling cadence for the job.
Once all the required fields including permissions are filled out, select Create job. Once the job is created, you can wait for the schedule to execute the job or do it manually.
Upon successful execution of the job, the transformed .csv should be available in the Amazon S3 location.

Conclusion

The existence of non-standardized data in legacy and disjointed data systems like ERP and WMS increases the required time and effort to standardize data for business needs such as reporting or planning. Integration between supply chain management systems can be difficult, complex, and very time-consuming and can force organizations into using a single vendor or very expensive integration services. This walk-through covered a data transformation approach using AWS services to automate the normalized data load process.

This approach will help you unlock valuable supply chain data so you can make decisions faster and improve efficiency. This blog also briefly covered AWS Supply Chain, which increases supply chain visibility and enables improved supply chain resiliency. Visit AWS Supply Chain to learn more and get started. For a self-paced technical overview, visit AWS Workshops.

Amazon Supply Chain and Logistics