AWS for Industries

Improving manufacturing field-distribution planning with a supply chain data lake

The Challenge

Manufacturers that distribute products and manage inventory at many geographically dispersed sites struggle to predict replenishment timing and quantity accurately. For example, chemical manufacturers bound to vendor-managed inventory agreements make truck rolls to thousands of remote tank sites either when the tank is already empty (too late) or when the tank level is high (not needed). Similarly, automotive spare parts manufacturers’ delays in meeting dealer orders result in disappointed consumers. Whether delivering too much, too little, or at the wrong time, suboptimal replenishment causes problems for everyone.

The crux of the challenge is lack of timely, accurate, comprehensive, and standardized data available in one place for replenishment planning. Manufacturers’ enterprise resource planning (ERP) systems rarely contain the right level of details to model the smaller consumption sites in the field, which is critical to predict optimal replenishment timing. As a result, field teams try to manage site inventory and replenishment modeling data in spreadsheets or bespoke systems operating in silos, leading to fragmentation of master data (such as field locations, replenishment lead times, and inventory stocking policies) and transaction data (such as inventory levels and sales). This results in suboptimal replenishment plans and costly manual data capture, management, and prediction effort.

A Comprehensive Approach

We address these challenges in a series of four blogs, each of which addresses one key element of the solution. In this first blog, we show how to aggregate data from disparate sources into a normalized supply chain data lake. In the second blog, you can learn how to build a supply chain digital twin to model the physical product flow and hydrate it from the information-rich supply chain data lake. In the third blog, you’ll learn how to build a replenishment-planning application on top of the hydrated twin. Lastly, we will show how to use Internet of Things (IoT) technologies, like LoRaWAN, to hydrate the data lake in an automated, cost-effective fashion with frequent, granular, accurate data from widespread locations where data collection is challenging.

Figure 1 shows how these four elements—

  1. the supply chain data lake;
  2. the supply chain digital twin;
  3. apps, like Replenishment Planning; and
  4. IoT data capture

—fit together to solve the replenishment challenge.

Figure 2 shows the details of the supply chain data lake architecture that provides timely, accurate, comprehensive, and standardized data available in one place and how it integrates with the other three elements: the supply chain digital twin, apps, and IoT data capture.

Figure 2. Supply chain data lake architecture

Supply Chain Data Lake Challenges

The supply chain data lake needs to collect, rationalize, cleanse, and centralize data from a myriad of sources that capture distribution inventory levels, sales, eligible products, and target inventory levels required for replenishment forecasting. This poses two immediate challenges:

  • Cost effectively interfacing to a large number of data sources from internal as well as customer and partner systems, and
  • Rationalizing a large number of different data formats and semantics for the same underlying data.

Once this is accomplished, a third challenge is

  • Securing and managing access to the data lake for dozens or even hundreds of internal and external users, each with potentially different access requirements for data at the row, column, and cell levels.

Finally, a fourth challenge is

  • Maintaining the quality and integrity of the supply chain data lake as it is used for an expanding set of use cases from replenishment planning to route optimization or even new product planning as apps are added to the digital twin.

Architecture to Meet the Challenges

Let’s look at how the Amazon Web Services (AWS)-based architecture above addresses each of these four challenges.

Ingesting and Rationalizing Data from Many Sources

For the first challenge, interfacing to a large number of data sources, the key is to provide a simple, well-documented, standardized API to which source systems can publish. Analyzing requirements and designing and implementing separate system integrations for each data source would be prohibitively expensive, even for internal sources, let alone for the large number of external sources—distributors, retailers, logistics providers, seaports, rail yards, truck depots, and more—typically involved in supply chain buy, sell, and move activities. Amazon API Gateway—a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at virtually any scale—helps to implement and evolve a standardize supply chain data lake ingestion API, backed by one or more functions implemented on AWS Lambda, a serverless, event-driven compute service. The AWS Lambda functions perform initial data validation and cleansing before storing data in the raw bucket of the supply chain data lake on Amazon Simple Storage Service(Amazon S3), an object storage service.

Though a common API can meet the needs for obtaining data from a majority of sources, there may be some sources that merit a tighter integration or a source for which the systems are not easy to modify to publish to the API. For these, one or more AWS Lambda functions can pull data from the interfaces that those systems expose. Keeping custom integrations to a minimum, of course, will result in the most extensible and cost-effective way to hydrate the supply chain data lake.

For external participants that report inventory shipments’ order status by email, Amazon Simple Email Service (Amazon SES), a cost-effective, flexible, and scalable email service, can automate data ingestion. Amazon Textract, a machine learning (ML) service that automatically extracts text, handwriting, and data from image documents, can process e-mail attachments to extract text. Amazon Comprehend, a natural-language processing (NLP) service, can then identify and obtain transactional attributes from the text. Similarly, an AWS Lambda function can extract data from other types of e-mail attachments such as Microsoft Excel.  The output can then invoke the common APIs to publish to the supply chain data lake’s Amazon S3 raw bucket.

For ERP systems, Amazon AppFlow, a fully managed integration service, accelerates integration with ERP sources like SAP, Salesforce, and others, providing connectors with filtering, validation, transformation, and mapping.  It helps establish secure, private data flows from those important inputs into replenishment forecasting and planning.

Once a scalable architecture is in place for ingesting data, we can address the second challenge— rationalizing a large number of different data formats and semantics. AWS Glue DataBrew, a new visual data preparation tool, is a valuable resource for this task, providing visual data preparation and facilitating data cleansing and normalization without writing code. Not only can it standardize data from multiple ERP systems as well as customer and partner data feeds, but it can also enrich and rationalize data from new sources, such as IoT-based automated supply chain sensing. It is a central resource that can validate, cleanse, enrich, and transform data from a myriad of different sources and store the results in an Amazon S3 curated bucket.

Governing Access

Once you have a rich set of granular data from the required sources centralized and standardized in the supply chain data lake Amazon S3 curated bucket, the next challenge is to govern access so that only the required actors have access to only what they need to fulfill business objectives. This can be a significant challenge given the need to restrict access for some entities to specific rows, columns, or cells and with the continually evolving landscape of actors as customers and suppliers. The growth of the supply chain and how it changes to adapt to new business demands and opportunities can add to this challenge.

AWS Lake Formation, a service that makes it easy to set up a secure data lake, complements AWS Glue, a serverless data integration service, to provide a powerful, intuitive, and streamlined way of establishing governance for the supply chain data lake, with industry best practices and standards compliance built in. It offers an extensive set of features, and it should be set up as part of the initial supply chain data lake implementation, with at least the following minimal configuration for sound initial governance:

  1. Roles and user groups—Roles should be defined with the desired access control policies and assigned to user groups so that users can be added and removed as organizations evolve.
  2. Row-, column-, and cell-level permissions—AWS Lake Formation provides data filters to restrict access at a granular level to protect sensitive data, like personally identifiable information (PII), which simplifies and streamlines access management.
  3. Encryption and key management—Encryption of data at rest and in transit to and from the data lake needs to be set up, along with the policies for what data needs to be stored encrypted and when encryption in transit needs to be enforced. AWS Lake Formation simplifies these aspects of security administration.
  4. Deduplication—AWS Lake Formation offers an ML-powered transform called FindMatches that can help eliminate duplicate data records. FindMatches asks you to label duplicate records and learns, based on these samples, the criteria used to deduplicate data at scale. This also helps with the fourth challenge of maintaining the quality and integrity of the data lake.
  5. Partitioning—Partitioning influences both cost and performance of a data lake, so optimizing partitioning based on the nature of the data and access patterns is important.

AWS Lake Formation is integrated with AWS CloudTrail, which monitors and records account activity across your AWS infrastructure, for audit logging, and you may want to set up some initial monitoring and reporting based on audit logs as well. You could also use an AWS Lake Formation ingestion blueprint for a jump start on bringing data into your supply chain data lake from MySQL-, PostgreSQL-, Oracle-, and SQL-server databases as well as popular log file formats.

Maintaining the Data Lake

Having secured the data lake, we have one remaining challenge: keeping the data lake from becoming a data swamp as new sources and users are added to facilitate new use cases, such as route optimization or product planning or to serve new customers or work with new partners.

The keys to maintaining the health and clarity of the data lake as it grows and evolves are maintaining governance and following operational best practices through automated monitoring, alerting, and remediation. We’ve already seen how AWS Lake Formation helps with governance. Maintaining a clear and extensible set of roles and user groups makes adaptation of access policies easier. In addition, AWS Lake Formation tag-based access control (LF-TBAC) can also help with rapidly growing data lakes by making it easier to incorporate new domains and user populations into a data lake without having to modify or create new policies—just add the same tags to the new users and to the objects that they should be able to access.

Finally, AWS CloudWatch alerts and alarms can help maintain data lake health by monitoring key performance metrics and providing notifications or taking action automatically when housekeeping or adjustments are needed. AWS Config, a service that helps you to assess, audit, and evaluate the configurations of your AWS resources, can also help monitor and maintain data lake integrity by overseeing policy compliance and adherence to operational best practices. A starter AWS Config template for data lakes can be customized to monitor, report on, and remediate conditions that can lead to the deterioration of a data lake over time.


A well-architected and governed data lake provides immediate value to organizations struggling with distribution planning. With data silos broken down and data centralized, cleansed, enriched, and rationalized, organizations can obtain insights consistently and efficiently with AWS analytics services like Amazon Athena, an interactive query service, and Amazon EMR, a cloud big data platform. Another analytics service is Amazon OpenSearch Service—which makes it easy for you to perform interactive log analytics, near-real-time application monitoring, website search, and more. You can also obtain insights with AWS AI and ML services, like Amazon SageMaker, which can be used to build, train, and deploy ML models; Amazon Forecast, a time-series forecasting service; and Amazon Comprehend operating on the data in the Amazon S3 curated bucket. In the next blog, we’ll discuss how the curated data in the data lake can be used to create a single digital representation of the product flow in a supply chain digital twin to solve distribution planning problems.

If you are interested in learning more about the solution, reach out to Rajesh Mani, Shaun Kirby or by contacting your Account Team.

Shaun Kirby

Shaun Kirby

Shaun Kirby is a Principal Customer Delivery Architect in AWS ProServe, specializing in the Internet of Things (IoT) and Robotics. He helps customers excel with cloud technologies, diving deep into their business challenges and opportunities to pioneer game changing solutions across industries.

Rajesh Mani

Rajesh Mani

Rajesh Mani is Head of Supply Chain Solutions at AWS focused on building and Go To Market of supply chain solutions. He has 25+ years of experience helping customers transform their supply chains and designing solutions in the areas of supply chain planning and execution, IOT and customer experience.