Amazon SageMaker Documentation

Amazon SageMaker is a fully managed service with features that help developers and data scientists prepare, build, train, and deploy machine learning (ML) models quickly. SageMaker removes the heavy lifting from the ML process to make it easier to develop high quality models. SageMaker provides the components used for machine learning in a single toolset to help models get to production faster with less effort and at lower cost.

Collect and Prepare Training Data

Amazon SageMaker Data Wrangler

Amazon SageMaker Data Wrangler can reduce the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface. Using SageMaker Data Wrangler’s data selection tool, you can choose the data you want from various data sources and import it easily. SageMaker Data Wrangler contains over 300 built-in data transformations so you can quickly normalize, transform, and combine features without having to write any code. With SageMaker Data Wrangler’s visualization templates, you can quickly preview and inspect that these transformations are completed as you intended by viewing them in Amazon SageMaker Studio, a fully integrated development environment (IDE) for ML. Once your data is prepared, you can build fully automated ML workflows with Amazon SageMaker Pipelines and save them for reuse in the Amazon SageMaker Feature Store.

Amazon SageMaker Feature Store

Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, update, retrieve, and share machine learning (ML) features.

Features are the attributes or properties models use during training and inference to make predictions. For example, in an ML application that recommends a music playlist, features could include song ratings, which songs were listened to previously, and how long songs were listened to. The accuracy of an ML model is based on a precise set and composition of features. Often, these features are used repeatedly by multiple teams training multiple models. And whichever feature set was used to train the model needs to be available to make real-time predictions (inference). Keeping a single source of features that is consistent and up-to-date across these different access patterns is a challenge as most organizations keep two different feature stores, one for training and one for inference.

Amazon SageMaker Feature Store is a purpose-built repository where you can store and access features so it’s easier to name, organize, and reuse them across teams. SageMaker Feature Store provides a unified store for features during training and real-time inference without the need to write additional code or create manual processes to keep features consistent. SageMaker Feature Store keeps track of the metadata of stored features (e.g. feature name or version number) so that you can query the features for the right attributes in batches or in real time using Amazon Athena, an interactive query service. SageMaker Feature Store also keeps features updated, because as new data is generated during inference, the single repository is updated so new features are available for models to use during training and inference.

Amazon SageMaker Ground Truth

Amazon SageMaker Ground Truth is a data labeling service that makes it easier to build highly accurate training datasets for machine learning. Get started with labeling your data in minutes through the SageMaker Ground Truth console using custom or built-in data labeling workflows. These workflows support a variety of use cases including 3D point clouds, video, images, and text. As part of the workflows, labelers have access to assistive labeling features such as automatic 3D cuboid snapping, removal of distortion in 2D images, and auto-segment tools to reduce the time required to label datasets. In addition, Ground Truth offers automated data labeling which uses a machine learning model to label your data.

Amazon SageMaker Clarify

Amazon SageMaker Clarify provides machine learning developers with greater visibility into their training data and models by helping identify bias and explain ML predictions.

Biases are imbalances in the training data or prediction behavior of the model across different groups, such as age or income bracket. Biases can result from the data or algorithm used to train your model. For instance, if an ML model is trained primarily on data from middle-aged individuals, it may be less accurate when making predictions involving younger and older people. 

Amazon SageMaker Clarify is designed to detect potential bias during data preparation, after model training, and in deployed models by examining attributes you specify. For instance, you can check for bias related to age in your initial dataset or in your trained model and receive a detailed report that quantifies different types of possible bias. SageMaker Clarify also includes feature importance graphs that help you explain model predictions and produces reports you can share or use to identify issues with your model that you can take steps to correct.

Build Machine Learning Models

Amazon SageMaker Studio

Amazon SageMaker Studio provides a single, web-based visual interface where you can perform all ML development steps, which can significantly improve data science team productivity. SageMaker Studio gives you access, control, and visibility into each step required to build, train, and deploy models. You can quickly upload data, create new notebooks, train and tune models, move back and forth between steps to adjust experiments, compare results, and deploy models to production all in one place, making the build process more efficient and productive. ML development activities that can be performed within SageMaker Studio include notebooks, experiment management, automatic model creation, debugging, and model and data drift detection.

Amazon SageMaker Autopilot

Amazon SageMaker Autopilot automates the process of building, training, and tuning the best machine learning models based on your data, while allowing you to maintain full control and visibility.

Building machine learning (ML) models requires you to manually prepare features, test multiple algorithms, and optimize hundreds of model parameters in order to find the best model for your data. However, this approach requires deep ML expertise. If you don’t have that expertise, you could use an automated approach (AutoML), but AutoML approaches typically provide very little visibility into the impact of your features for model predictions. As a result, it can be difficult to recreate the process or fully understand how your model makes predictions.

Amazon SageMaker Autopilot reduces the heavy lifting of building ML models, and helps you automate the process of building, training, and tuning the best ML model based on your data. With SageMaker Autopilot, you simply provide a tabular dataset and select the target column to predict, which can be a number (such as a house price, called regression), or a category (such as spam/not spam, called classification). SageMaker Autopilot explores different solutions to find the best model based on the data you provide. You then can directly deploy the model to production with just one click, or iterate on the recommended solutions with Amazon SageMaker Studio to further improve the model quality.

Amazon SageMaker JumpStart

Amazon SageMaker JumpStart helps you quickly and easily get started with machine learning. SageMaker JumpStart provides a set of solutions for many of the most common use cases that can be deployed readily with just a few clicks. The solutions are customizable and showcase the use of AWS CloudFormation templates and reference architectures so you can accelerate your ML journey. Amazon SageMaker JumpStart also supports one-click deployment and fine-tuning of more than 150 popular open source models such as natural language processing, object detection, and image classification models.

Train and Tune Machine Learning Models

Amazon SageMaker Debugger

Amazon SageMaker Debugger makes it easier to optimize machine learning (ML) models by capturing training metrics in real-time such as data loss during regression and sending alerts when anomalies are detected. This helps you rectify inaccurate model predictions such as an incorrect identification of an image. SageMaker Debugger stops the training process when your desired level of accuracy is achieved, reducing the time and cost of training ML models.

Amazon SageMaker Debugger can also help you train models faster by profiling and monitoring system resource utilization and sending alerts when resource bottlenecks such as over-utilized CPUs are identified. You can visually monitor and profile system resources including CPUs, GPUs, network, and memory during training within Amazon SageMaker Studio so you can continuously improve resource utilization. SageMaker Debugger correlates system resource usage to different phases of the training job and for specific points in time during training, and provides recommendations on how to adjust resource utilization to help you re-allocate resources to maximize efficiency. Monitoring and profiling works across leading deep learning frameworks including PyTorch and TensorFlow, without requiring any code changes in your training scripts. Monitoring and profiling of system resources happens in real-time, helping you optimize your ML models faster and at scale.

Distributed Training Libraries

Amazon SageMaker helps improve the training process for large deep learning models and datasets. Using partitioning algorithms, SageMaker's distributed training libraries split large deep learning models and training datasets across AWS GPU instances in a fraction of the time it takes to do manually. SageMaker achieves these efficiencies through two techniques: data parallelism and model parallelism. Model parallelism splits models too large to fit on a single GPU into smaller parts before distributing across multiple GPUs to train, and data parallelism splits large datasets to train concurrently in order to improve training speed.

ML use cases such as image classification and text-to-speech demand increasingly larger computational requirements and datasets. For example BERT, a state-of-the-art natural language processing (NLP) model released in 2018, uses 340 million parameters. Now, state-of-the-art NLP models, such as T5, GPT-3, Turing-NLG, and Megatron, have set new accuracy records, but require tens to hundreds of billions of parameters. Training models like T5 or GPT-3 on a single GPU instance can take several days, slowing your ability to deploy the latest iterations into production. Additionally, implementing your own data and model parallelism strategies manually can take weeks of experimentation.

With just a few lines of additional code, you can add either data parallelism or model parallelism to your PyTorch and TensorFlow training scripts and Amazon SageMaker will apply your selected method for you. SageMaker splits your model by using graph partitioning algorithms to balance the computation of each GPU while minimizing the communication between GPU instances. SageMaker also helps optimize your distributed training jobs through algorithms that are designed to maximize AWS compute and network infrastructure in order to achieve near-linear scaling efficiency, which allows you to complete training more quickly than manual implementations.

Deploy and Manage Machine Learning Models

Amazon SageMaker Pipelines

Amazon SageMaker Pipelines is a purpose-built, easy-to-use continuous integration and continuous delivery (CI/CD) service for machine learning (ML). With SageMaker Pipelines, you can create, automate, and manage end-to-end ML workflows at scale.

Orchestrating workflows across each step of the machine learning process (e.g. exploring and preparing data, experimenting with different algorithms and parameters, training and tuning models, and deploying models to production) can take months of coding.

Since it is purpose-built for machine learning, SageMaker Pipelines helps you automate different steps of the ML workflow, including data loading, data transformation, training and tuning, and deployment. With SageMaker Pipelines, you can share and re-use workflows to recreate or optimize models, helping you scale ML throughout your organization.

Amazon SageMaker Model Monitor

Amazon SageMaker Model Monitor helps you maintain high quality machine learning (ML) models by detecting and alerting on inaccurate predictions from models deployed in production.

The accuracy of ML models can deteriorate over time, a phenomenon known as model drift. Many factors can cause model drift such as changes in model features. The accuracy of ML models can also be affected by concept drift, the difference between data used to train models and data used during inference.

Amazon SageMaker Model Monitor helps you maintain high quality ML models by detecting model and concept drift in real-time, and sending you alerts so you can take immediate action. Model and concept drift are detected by monitoring the quality of the model based on independent and dependent variables. Independent variables (also known as features) are the inputs to an ML model, and dependent variables are the outputs of the model. For example, with an ML model predicting a bank loan approval, independent variables could be age, income, and credit history of the applicant, and the dependent variable would be the actual result of the loan application. Further, SageMaker Model Monitor monitors model performance characteristics such as accuracy which measures the number of correct predictions compared to the total number of predictions so you can take action to address anomalies.

Additionally, SageMaker Model Monitor is integrated with Amazon SageMaker Clarify to help you identify potential bias in your ML models with model bias detection.

Kubernetes Integration

Kubernetes is an open source system used to automate the deployment, scaling, and management of containerized applications. Kubeflow Pipelines is a workflow manager that offers an interface to manage and schedule machine learning (ML) workflows on a Kubernetes cluster. Using open source tools offers flexibility and standardization, but requires time and effort to set up infrastructure, provision notebook environments for data scientists, and stay up-to-date with the latest deep learning framework versions.

Amazon SageMaker Operators for Kubernetes and Components for Kubeflow Pipelines enable the use of fully managed SageMaker machine learning tools across the ML workflow natively from Kubernetes or Kubeflow. This eliminates the need to manually manage and optimize your Kubernetes-based ML infrastructure while still preserving control over orchestration and flexibility.

Amazon SageMaker Edge Manager

An increasing number of applications such as industrial automation, autonomous vehicles, and automated checkouts require machine learning (ML) models that run on devices at the edge so predictions can be made in real-time when new data is available. Amazon SageMaker Neo is an easy way to optimize ML models for edge devices, enabling you to train ML models once in the cloud and run them on any device. As devices proliferate, customers may have thousands of deployed models running across their fleets. Amazon SageMaker Edge Manager enables you to optimize, secure, monitor, and maintain ML models on fleets of smart cameras, robots, personal computers, and mobile devices.

Amazon SageMaker Edge Manager provides a software agent that runs on edge devices. The agent comes with an ML model optimized with SageMaker Neo so you don’t need to have Neo runtime installed on your devices in order to take advantage of the model optimizations. The agent also collects prediction data and sends a sample of the data to the cloud for monitoring, labeling, and retraining so you can keep models accurate over time. All data can be viewed in the SageMaker Edge Manager dashboard which reports on the operation of deployed models. And, because SageMaker Edge Manager enables you to manage models separately from the rest of the application, you can update the model and the application independently, which can reduce costly downtime and service disruptions. SageMaker Edge Manager also cryptographically signs your models so you can verify that it was not tampered with as it moves from the cloud to edge devices.

Amazon SageMaker Neo

Amazon SageMaker Neo enables developers to optimize machine learning (ML) models for inference on SageMaker in the cloud and supported devices at the edge.

ML inference is the process of using a trained machine learning model to make predictions. After training a model for high accuracy, developers often spend a lot of time and effort tuning the model for high performance. For inference in the cloud, developers often turn to large instances with lots of memory and powerful processing capabilities at higher costs to achieve better throughput. For inference on edge devices with limited compute and memory, developers often spend months hand-tuning the model to achieve acceptable performance within the device hardware constraints.

Amazon SageMaker Neo optimizes machine learning models for inference on cloud instances and edge devices to run faster without compromising accuracy. You start with a machine learning model already built with DarkNet, Keras, MXNet, PyTorch, TensorFlow, TensorFlow-Lite, ONNX, or XGBoost and trained in Amazon SageMaker or anywhere else. Then you choose your target hardware platform, which can be a SageMaker hosting instance or an edge device based on processors from Ambarella, Apple, ARM, Intel, MediaTek, Nvidia, NXP, Qualcomm, RockChip, Texas Instruments, or Xilinx. With a single click, SageMaker Neo optimizes the trained model and compiles it into an executable. The compiler uses a machine learning model to apply the performance optimizations to optimize performance for your model on the cloud instance or edge device. You then deploy the model as a SageMaker endpoint or on supported edge devices and start making predictions.

For inference in the cloud, SageMaker Neo speeds up inference and saves cost by creating an inference optimized container in SageMaker hosting. For inference at the edge, SageMaker Neo can save developers months of manual tuning by tuning the model for the selected operating system and processor hardware.

Amazon SageMaker Neo uses Apache TVM and partner-provided compilers and acceleration libraries to optimize performance for a given model and hardware target. AWS contributes the compiler code to the Apache TVM project and the runtime code to the Neo-AI open-source project, under the Apache Software License, to enable processor vendors and device makers to innovate rapidly on a common compact runtime.

Additional Information

For additional information about service controls, security features and functionalities, including, as applicable, information about storing, retrieving, modifying, restricting, and deleting data, please see This additional information does not form part of the Documentation for purposes of the AWS Customer Agreement available at, or other agreement between you and AWS governing your use of AWS’s services.