Simplifying MLOps and improving model accuracy with Trifacta and Amazon SageMaker (Part 1)
This is the first blog post of a two-part series. Part 1 covers data preparation for machine learning (ML) using Trifacta, available in AWS Marketplace. Part 2, Simplifying machine learning operations with Trifacta and Amazon SageMaker, covers training the model using Amazon SageMaker Autopilot and operationalizing the workflow.
In the past decade, advancements in machine learning and cloud computing have unlocked ML-driven solutions to previously unsolvable problems in every industry. Bringing machine learning out of labs and experiments into production requires integrating ML and dev ops. Being iterative by nature, ML requires a different approach than traditional coding to operationalize it. Amazon SageMaker provides an integrated pipeline of model building, deployment, security, monitoring, and improvement. Customers are now choosing to integrate ML into business operations, sometimes referred to as MLOps.
The data preparation challenge
MLOps involves manual intervention. It requires time from ML engineers for data collection and data wrangling. Those two tasks collectively are called data preparation, and data preparation accounts for about 80% of the work that data scientists do.
ML models are only as good as their training data, so improved quality of training data creates more accurate models. As a model developer and maintainer, you must guard against data entropy in the form of too many missing values, malformed values, and outliers.
Trifacta, available in AWS Marketplace, enables users with business domain knowledge to prepare data themselves. By rapidly iterating through data preparation yourself without involving IT or writing code, you can improve time to value. ML is moving out of the hands of data scientists and into those of business users, and integration with Trifacta enables business users to create high-quality models for many use cases.
In this blog post, we describe how to integrate Trifacta with Amazon SageMaker to simplify ML predictions as self-service for business users with minimal ML background. It builds on the best practices for data wrangling for machine learning in the cloud.
Data wrangling is the process of cleaning, structuring, and enriching raw data into a desired format for better decision-making in less time. Typically, data wrangling involves six iterative steps: discovering, structuring (nested, tabular, deep nested), cleaning, enriching, validating, and publishing.
Preparing data for machine learning involves an additional structuring step for feature engineering. This includes activities such as the following:
- Feature scaling
- Feature derivation such as binning and combining categorical features
- Providing data insights such as frequency distribution
- Data imputation (null management, etc.), and
- Feature vectorization.
First, we use Trifacta to address data quality issues, perform data enrichment, and apply feature engineering. This is a critical step to increase the efficiency and accuracy of AutoML. We then pass the prepared data to SageMaker Autopilot, which automates the tasks of algorithm selection, hyperparameter optimization, distributed training, and cluster size selection. Autopilot generates Python code that you can further fine tune as needed.
Dataset and scenario
This solution covers how to use the Bank Marketing data set to train a model that predicts whether a customer will enroll for a term deposit at a bank. The prediction is based on information about the customer and past marketing campaigns. Achieving high prediction accuracy enables the bank to maximize the return on investment in marketing.
Rather than assuming the availability of clean data, we start with messy, fragmented data that is more representative of what you find in the real world. The raw data consists of the following files:
- core.csv – Core fields containing customer information and socioeconomic indicators
- campaign.csv – Details of marketing campaigns targeted at customers
- education_lookup.csv – Lookup table used to convert education categories into an ordinal feature
- Trifacta. You can sign up for a free Trifacta trial. When you subscribe, you will receive an email with instructions to set up your Trifacta workspace.
- Access to Autopilot. You can use the free tier for Amazon SageMaker.
- Familiarity with JSON.
This is the end-to-end process of identifying and resolving issues by using a Trifacta recipe and an Amazon SageMaker notebook to build the model using Autopilot. A recipe is a list of transformations applied to a dataset. Instructions for SageMaker Autopilot will be covered in part 2 of this blog series.
Step 1: Discover data types
- Log into Trifacta using the credentials that were emailed to you when you subscribed.
- To get familiar with the user interface, view the interactive tutorial.
- To upload your raw data files into Trifacta, go to the flows panel on the left sidebar of the Trifacta console, and choose Create Flow. To select files from your hard drive, select Add Datasets. The data is automatically stored in an Amazon S3 bucket. Trifacta lets you preview datasets and discover data types present in the data before importing them. For more information, watch this video about loading raw data.
Step 2: Structure and enrich the data
Raw datasets are often not in the right shape for machine learning. Data may be fragmented across multiple files, and need to be joined together to be useful.
- The data for model training is split across two files, core.csv and campaign.csv.
- Add a new recipe attached to core.csv. In the Flow view, click on the dataset icon and choose Add New Recipe.
- From the top toolbar, choose Join and select campaign.csv as the dataset that you’re joining to. Trifacta automatically detects that customer_id is the best candidate for the join key.
- On the right panel under join type, choose to perform an inner join and select all columns except customer_id. You don’t want to feed that to the model training process.
Applying the join brings you back to the grid view, which shows the socioeconomic field as a nested JSON object. You can expand the contents to see that it is made up of five distinct socioeconomic indicators. Trifacta has automatically detected this field to be of JSON type, and when you select the column heading, you get a suggestion to unnest the keys into their own columns. You can preview this and add it to your recipe. For more information, watch this video about joining and unnesting data.
Step 3: Clean up the data
Trifacta dynamically infers the data types of each field. With this dataset, you have three tasks to clean up the data.
3.1 Change commas to decimal points
The consumer price index column labeled cons.price.idx shows that it was detected as a string column rather than a decimal column. This is due to commas being used instead of decimal points, which is a common occurrence in data from Europe. To fix this, do the following:
- In the data grid under the cons.price.idx column, select the comma.
- To replace commas with periods, from the list of suggestions, choose Replace. Edit this step and in the Replace With box, enter a period (.).
- The correct data type of decimal is now detected. For more information, watch this video about fixing the consumer price index column.
3.2 Address missing values
- In the data grid, note the gray portion of the data quality bar indicating that the marital and default columns have missing values.
- To get several suggestions, elect the gray portion of the data quality bar.
- Choose Replace. To replace the missing values, in the formula editor, enter a value of unknown.
- You can apply the same transformation to multiple columns at once. Choose the marital and default columns and apply this to your recipe. For more information, watch this video about replacing the missing values.
3.3 Deduplicate data
- To check for duplicate entries, view the value histogram for the job field in the data grid. Hovering over the histogram reveals some duplicate values.
- These duplications are caused by misspellings and differing punctuations. For example, one user might have entered blue-collar while another entry reads Blue Collar. This is a common occurrence in user-entered, free-form text. Trifacta has a standardization feature that enables you to identify groups of similar strings and specify which groups to combine into a single value. The groupings can be based on string similarity as well as phonetics. For more information, watch this video about standardizing free-form text.
Congratulations! You have addressed the basic data quality issues for these datasets.
Step 4: Structure the data for machine learning
Your next step is to perform feature engineering. Feature engineering involves the application of business domain knowledge to capture human expertise and encode it into features. In short, you instruct the software to structure information based on human intuition.
4.1 Feature engineer education data
In the data grid, look at the education column. It contains a textual description of the customer’s educational attainment. However, these descriptions have no implicit order. Like most humans, you would naturally order this data from least to most education. To do that, do the following:
- To perform a lookup against education_lookup.csv, in the column drop-down menu for education, choose Lookup.
- This csv file contains a lookup table that stores the ordinal key associated with each education description, ranging from 0 for illiterate to 6 for university degree. For more information, watch this video about feature engineering for the ordinal key.
4.2 Feature engineer age data
In the data grid, look at the age column. Age is a numeric feature, but various age groups might share common characteristics that can enhance the predictive power of the model. For example, senior citizens aged 65 and over might be eligible for additional discounts and tax incentives that make them a better candidate for a term deposit offer.
To create age groups for your dataset, you use the Bin column function. To do this, in the recipe panel on the right, add a new step and choose the Bin Column transformation. In the options, enter custom bins as follows:
- 17 and below [Minors]
- 65 and above [Seniors]
For more information, watch this video about binning.
4.3 Engineer feature combinations
In some cases, you can combine two or more features to enhance the predictive power of the model. For example, you might want to use the data based on both job and marital status together. To do this, do the following:
- In the data grid, select the job and marital columns.
- To combine these columns’ data into a composite feature, choose the Merge transformation.
- The resulting column contains distinct values for every combination, such as technician-single and management-married. For more information, watch this video about composite features.
Step 5: Validate and publish the outputs
You now have a data preparation pipeline that is ready to produce outputs and train Autopilot. To do this, follow these steps:
- In the top right corner of the Flow view, choose Run Job.
- Select the output format as CSV and enter the output filename as bank-additional-full.
- To run the job, at the bottom of the page, choose the Run Job button.
- To ensure that your data quality expectations are met, review the output profile by choosing the latest job id that appears on the right pane.
- In the job profile, choose the Output Destinations tab. Next to the output file, choose View Details. From the right pane, copy the S3 output path.
- Store the S3 output path to be used later as the Autopilot input path.
You can use the same recipe for multiple job runs as well as to process new inputs. You can repeat steps 2-5 to iterate on your data preparation pipeline until you achieve the desired model performance metrics.
When you’re satisfied with the output from the data preparation pipeline, you can schedule jobs using the built-in scheduler. Alternatively, you can execute the end-to-end pipeline by invoking Trifacta using the REST API. For large-scale jobs, you can use Spark in an Amazon EMR cluster with auto-scaling to complete massive workloads. For more information, watch this video about producing outputs for Autopilot.
To clean up, in the Flows list on the left pane, choose the flow you created. In the context menu on the right, choose Delete Flow. This is an optional step. You will not incur any charges for your Trifacta trial unless you upgrade to a paid account.
In summary, using the Trifacta self-service data preparation tool helps you rapidly discover, structure, clean, enrich, validate, and publish data for ML. The improvement in data quality leads to increased model performance. Business user data scientists can contribute to ML initiatives using a low-code environment without sacrificing flexibility and scalability. It empowers you to rapidly iterate and achieve a faster time-to-market.
In Part 2 of this blog series, we show how to train a machine learning model using the data we prepared and perform inference to predict business outcomes.
The content and opinions in this post are those of the third-party author, and AWS is not responsible for the content or accuracy of this post.
About the author
Vijay Balasubramaniam, Director, Partner Solutions Architect, Trifacta
Vijay Balasubramaniam leverages his expertise in data management to help partners and customers be successful in large-scale analytics initiatives. He has over 18 years of experience helping large organizations manage their data sets and produce insights. He specializes in best-in-class data preparation workflows and developing end-to-end solutions on the AWS platform. Outside of work, he enjoys biking, tennis, music and spending time with family.