AWS for Industries

How Sigmoid Uses DataWig From Amazon Science for Missing Value Imputation to Make CPG Dataset Ready for Machine Learning

In the training of a machine learning (ML) model, the quality of the model is directly proportional to the quality of data. However, in many cases, in consumer packaged goods (CPG) datasets, there are a lot of missing values affecting the quality of training and prediction in the long run.

If your models are already operationalized on Amazon SageMaker, which is used to build, train, and deploy ML models for virtually any use case, you can use Amazon SageMaker Data Wrangler, which lets you simplify the process of data preparation and feature engineering and complete each step of the data preparation workflow from a single visual interface. But if you are maintaining an on-premises ML environment, if you run your ML trainings and models on Amazon Elastic Compute Cloud (Amazon EC2), which provides resizable compute capacity for virtually any workload, or if you are not ready to migrate to Amazon Web Services (AWS), you will need to find a way to impute the missing values in a scientific way.

Blog_logo_box Sigmoid_contactThere are several methods that can be used to fill the missing values, but in this blog, together with Sigmoid, an AWS Partner, we will show you how to use DataWig for missing data imputation and why it is efficient for ML data preprocessing.

DataWig is an ML model developed by the Amazon Science team and is primarily used for missing value imputation. The model is based on deep learning and trained with Apache MXNet, then packaged as a library. DataWig runs as a backend when you train your ML algorithms, and it helps you generate the predicted missing values.

In this blog, we will look into some of the important components of the library and how it can be used for imputing missing values in a dataset.

Important components of the DataWig library

To understand how the DataWig library works, let’s first go through some of the important components and understand what they do.

  • ColumnEncoders
    • The ColumnEncoders convert the raw data of a column into an encoded numerical representation.
    • There are four ColumnEncoders provided in the DataWig library:
      • SequentialEncoder: provides encoding for text data (characters or words)
      • BowEncoder: provides bag-of-words encoding for text data (hashing vectorizer or term frequency–inverse document frequency based on the algorithm used)
      • CategoricalEncoder: provides one-hot encoding for categorical columns
      • NumericalEncoder: provides encoding for numerical columns
  • Column featurizers
    • Column featurizers are used to feed encoded data from ColumnEncoders into the imputer model’s computational graph for training and prediction.
    • There are four column featurizers present in the DataWig library:
      • LSTMFeaturizer: is used with SequentialEncoder and maps the sequence of input into vectors using long short-term memory (LSTM)
      • BowFeaturizer: is used with bag-of-words-encoded columns
      • EmbeddingFeaturizer: maps encoded categorical columns into vector representation (word embeddings)
      • NumericalFeaturizer: is used with numerical-encoded columns and extracts features using fully connected layers
  • SimpleImputer
    • Using SimpleImputer is one of the simplest ways that you can train a missing value imputation model. It only takes three parameters:
      • Input_Column: represents the list of feature columns
      • Output_Column: takes the name of the target column that one is training
      • Output_Path: is the path where the trained model will be stored
    • For example, we have a dataset with three different columns: a, b, and c. Based on a and b, we want to fill the missing values of column c. In this case, SimpleImputer will work as follows:
imputer = SimpleImputer(
       input_columns=['a', 'b'],
       output_column='c',
       output_path = 'imputer_model'
       )

       #Fit an imputer model on the train data
       imputer.fit(train_df=df_train)
       predictions = imputer.predict(df_test)
    • While using SimpleImputer, you don’t need to worry about encoding and featurizing different input columns because the library automatically detects the data type for each column and uses the encoders and featurizers accordingly.
    • This restricts you to less control over the training process, but in general, it yields good results.
    • After passing the above parameter, you have two options:
      • Imputer.fit: is used to train the model
      • Imputer.fit_hpo: is used to train and tune the model (it has a built-in dictionary to choose the values from, and you can pass hyperparameters in the form of a custom dictionary to tune the model based on project requirements)
  • Imputer
    • Imputer gives you more control over the training process, which is one of the primary reasons for using Imputer over SimpleImputer.
    • Imputer takes four parameters as inputs:
      • Data_Featurizers: a list of featurizers associated with different feature columns
      • Label_Encoders: a list of encoded target columns
      • Data_Encoders: a list of encoders associated with different feature columns
      • Output_Path: the path where the trained model will be stored
    • For example, we have a dataset with three different columns: a, b, and c. Based on a and b, we want to fill the missing values of column c. In this case, Imputer will work as follows:
data_encoder_cols = [BowEncoder('a'), BowEncoder('b')]
      label_encoder_cols = [CategoricalEncoder('c')]
      data_featurizer_cols = [BowFeaturizer('a'), BowFeaturizer('b')]

      imputer = Imputer(
      data_featurizers=data_featurizer_cols,
      label_encoders=label_encoder_cols,
      data_encoders=data_encoder_cols,
      output_path='imputer_model'
      )

      imputer.fit(train_df=df_train)
      predictions = imputer.predict(df_test)
    • After defining the Imputer with the above parameters, we can simply call the “fit” function to begin the training.
    • Imputer has several advantages over SimpleImputer:
      • More customization is possible for the training purpose.
      • You can tune the parameters while encoding the feature and target columns to get a balance between the training time and the accuracy of the model.

How DataWig helped in Sigmoid’s project with a customer

  • Overview of the project:
    • We had a dataset with 50 columns, and we had to impute the missing values in 25 columns out of those 50.
    • Out of the 25 columns, 13 were numerical and 12 were categorical.
    • Below is a summary of the dataset:

summary of the dataset

  • Our approach:
    • For each of the target columns (the 25 columns for which we were doing the imputation), we did the feature selection and then ran DataWig using the Imputer. We did this because we can run the Imputer over all the target columns at the same time on the loop, and it was pretty straightforward.
    • After the base model result was available, we continued to tune the model.
    • Below are the final results on the target columns.
      • Numerical Columns:

final results on the target columns

        • The defined metric from the project owner was a root-mean-square error (RMSE) / standard deviation of ≦0.5 for the numerical columns, and the acceptable score was an RMSE / standard deviation of ≦0.8.
        • In the above table, the computed metrics are mentioned for all the numerical columns against their respective training, validation, and testing datasets.
        • We could achieve a standard deviation of <0.5 for numerical columns.
      • Categorical Columns:

Categorical Columns

        • For the categorical columns, we used accuracy as the key metric. The ideal target was an accuracy of ≧95%. The acceptable target was an accuracy of ≧85%.
        • We could achieve >90% accuracy in predicting categorical data.

Conclusion

It is great if you already have your ML operations (MLOps) pipelines on AWS using Amazon SageMaker. But if not, DataWig from the Amazon Science team is a great choice as an imputation tool, whether you want to solve simple imputation problems or complex scalable imputation problems with comparable or even better results than other standard practices. If you would like more information about how Sigmoid and AWS help customers in the CPG industry, leave a comment on this blog. To request a demo or to ask any other questions, visit Sigmoid or contact your AWS account team today.

AWS Partner spotlight

Sigmoid delivers actionable intelligence for CPG enterprises. Sigmoid’s CPG analytics solution portfolio is specifically designed to equip CPG decision-makers with targeted consumer insights to drive growth. Sigmoid’s expertise in CPG analytics helps companies build robust data infrastructures that simplify every step of managing big data in the CPG industry. By solving complex analytics use cases, brands can engage effectively with consumers, forecast demand accurately, optimize inventory levels, and take actions based on near-real-time sales data across the ecommerce and retail partners community.

Danny Yin

Danny Yin

Danny (Yen-Lin) Yin is the Global Technical Lead for AWS Partners in the CPG industry. He joined AWS in 2018 with 18 years of experience in ecommerce application development and operations. Danny helps CPG companies enhance the consumer digital user experience and gain operational efficiency across different lines of business. Danny is also responsible for solutions architecture and technical guidance for CPG technology and consulting partners on AWS. Before he joined AWS, Danny was Director of Digital Engineering at Toys”R”Us, where he successfully migrated the world’s largest toy webstore from an outsourced application to an in-house hybrid cloud application on AWS.

Anurag Srivastava

Anurag Srivastava

Anurag Srivastava is a data scientist at Sigmoid. Anurag works on data modeling processes along with predictive model design to gain insights on business data. Currently, Anurag’s work focuses on demand forecasting in supply chains for CPG customers.