AWS Partner Network (APN) Blog
How to Use Amazon SageMaker Pipelines MLOps with Gretel Synthetic Data
By Maarten Van Segbroeck, Principal Scientist – Gretel
By Ben McCown, Sr. Software Engineer – Gretel
By Johnny Greco, Sr. Applied Scientist – Gretel
By Qiong Zhang and Michael Tindal – AWS
![]() |
Gretel |
![]() |
Collecting large volumes of high quality, labeled datasets can be challenging due to costs, time, and privacy concerns. Gretel’s synthetic data platform has emerged as a solution to these issues, and its role is vital in machine learning operations (MLOps), especially to address tightening privacy laws and constrained resources.
Gartner forecasts that synthetic data will dominate artificial intelligence (AI) model development by 2030. Gretel’s synthetic data solution, combined with Amazon SageMaker Pipelines, empowers data scientists and ML engineers to deal with data scarcity and complex workflows. It also guides ML leaders in adopting AI responsibly within their organization.
This post discusses how to integrate Gretel with Amazon SageMaker Pipelines to enhance ML training, prioritizing privacy and safety. SageMaker Pipelines streamlines all ML stages, from data pre-processing to model deployment.
The Gretel MLOps library’s source code showcases this integration, enabling training on synthetic data or augmenting real data with synthetic data to accelerate the ML model production process.
Gretel is an AWS Partner and AWS Marketplace Seller that enables the development of domain-specific AI models for creating data that mirrors, boosts, or simulates real-world data without the privacy concerns.
Benefits of Synthetic Data in Machine Learning
Synthetic data is artificially generated data mimicking the statistical characteristic of real-world data. It has several benefits for MLOps:
- Privacy protection: Synthetic data contains no real user information. This protects individuals’ privacy and helps organizations comply with data privacy regulations like GDPR and HIPAA.
- Data availability: Synthetic data models support quick generation of large datasets, which helps deal with scarce or incomplete real data.
- Bias mitigation: Using synthetic data reduces inherent biases in real data.
- Cost efficiency: Generating synthetic data can cost less than gathering and labeling new real data.
Gretel’s Deployment Modes
Gretel provides two deployment options: Gretel Cloud, a hassle-free software-as-a-service (SaaS) solution requiring no deployment effort, and Gretel Hybrid, which integrates into your cloud environment.
Gretel Cloud is a comprehensive, fully managed service for synthetic data generation. It operates within Gretel’s cloud compute infrastructure, and handles all aspects of compute, automation, and scalability. It provides a seamless solution that simplifies the technical demands of setting up your cloud infrastructure.
Gretel Hybrid functions within your AWS environment using Amazon Elastic Kubernetes Service (Amazon EKS) and ensures your data remains within your AWS account. It interfaces with the Gretel API only for job scheduling and metadata, and is particularly well-suited for handling sensitive or regulated data that must stay within your cloud tenant’s boundaries.
Gretel Hybrid combines the benefits of using your infrastructure for training synthetic data models with Gretel’s advanced tools, offering a balance of control and convenience.
A high-level architecture diagram for Gretel Hybrid is shown below. You’ll find comprehensive information in the Gretel Hybrid documentation. To deploy Gretel Hybrid, follow the instructions in this blog post to generate synthetic data using Gretel Hybrid.
Figure 1 – High-level architecture of the Gretel Hybrid deployment in AWS.
Solution Overview
The diagram below illustrates the Amazon SageMaker Pipeline process. Gretel’s synthetic data generation follows the data preparation phase, and this synthetic data is utilized in the training phase of the ML model.
Figure 2 – MLOps workflow with SageMaker Pipelines and Gretel.
Prerequisites
Sign in at the Gretel console and obtain a Gretel API key. Use an AWS account to run the sample code.
Integrate Gretel with Amazon SageMaker Pipelines
To follow along, instantiate run_pipeline.ipynb from the Gretel MLOps library in Amazon SageMaker Studio.
First, store your Gretel API key in AWS Secrets Manager. Follow Step 2 – Create Secret for the Gretel API key to retrieve your Gretel API key and store it in AWS Secrets Manager.
The SageMaker IAM role must have the AmazonSageMakerFullAccess permission policy attached. Additionally, the role needs the SecretsManagerReadWrite policy for SageMaker to access AWS Secrets Manager for the Gretel API key.
Step 2: Configure the SageMaker Pipeline
In run_pipeline.ipynb, install the Python package from the Gretel MLOps library by running the following command:
The installed pipeline package is versatile enough to handle many datasets and optimize for standard classification or regression ML metrics. To customize the pipeline, supply a yaml configuration file that has three sections: dataset, ML, and gretel.
Example MLOps configuration files are available for multiple datasets. The example below uses a healthcare dataset: