Amazon SageMaker makes it easy to build machine learning (ML) models at scale and get them ready for training, by providing everything you need to label training data, access and share notebooks, and use built-in algorithms and frameworks.
Collaborative notebook experience
Amazon SageMaker Notebooks, now in preview, provide one-click Jupyter notebooks with elastic compute that can be spun up quickly. Notebooks contain everything needed to run or recreate a machine learning workflow and are integrated within Amazon SageMaker Studio. Notebooks are pre-loaded with all the common CUDA and cuDNN drivers, Anaconda packages, and framework libraries.
The notebook environment lets you explore and visualize your data and document your findings in re-usable workflows. From within the notebook, you can bring in your data stored in Amazon S3. You can also use AWS Glue to easily move data from Amazon RDS, Amazon DynamoDB, and Amazon Redshift into S3 for analysis.
Without elastic notebooks, to view, run, or share a notebook, you need to spin up a compute instance to power the notebook. If you need more compute power you need to spin up a new instance, transfer the notebook, and shut down the old instance. And, because the notebook is usually coupled to the compute instance, and the notebook typically existed on a user’s workstation, there is no easy way to share notebooks and iterate collaboratively.
SageMaker Notebooks overcomes these challenges. You no longer need to lose time shutting down the old instance and recreating work in a new instance. This makes it much faster to get started building a model.
You can write or import your notebook or use one of the many pre-built notebooks that come with SageMaker for different use cases. Once launched, you can increase and decrease compute resources (including GPU resources) without interruption. Also, your state is automatically saved, so you can pick up exactly where you left off the next time you return to the notebook.
All code dependencies such as software packages, versions, and more are automatically captured within the notebook environment, so you don’t need to manually track dependencies. This lets you easily share notebooks with colleagues so they can easily visualize and reproduce your results.
Build accurate training datasets
Amazon SageMaker Ground Truth helps you build highly accurate training datasets quickly using machine learning and reduce data labeling costs by up to 70%. Successful machine learning models are trained using data that has been labeled to teach the model how to make correct decisions. This process can often take months and large teams of people to complete. SageMaker Ground Truth provides an innovative solution to reduce cost and complexity, while also increasing the accuracy of data labeling by bringing together human labeling with a machine learning process called active learning.
Fully managed data processing at scale
Quite often, data processing and analytics workloads for machine learning are run on self-managed infrastructure that is difficult to allocate and scale, as business requirements change. The use of different tools to achieve this becomes cumbersome resulting in sub-optimal performance and increased capital and operating expenses. Amazon SageMaker Processing overcomes this challenge by extending the ease, scalability, and reliability of SageMaker to a full managed experience of running data processing workloads at scale. SageMaker Processing allows you to connect to existing storage or file system data sources, spin up the resources required to run your job, save the output to persistent storage, and provide the logs and metrics. You can also bring your own containers using frameworks of your choice and take advantage of running data processing and analytics workloads.
Built-in, high-performance algorithms
Amazon SageMaker provides high-performance, scalable machine learning algorithms, optimized for speed, scale, and accuracy, that can perform training on petabyte-scale data sets. You can choose from supervised algorithms where the correct answers are known during training and you can instruct the model where it made mistakes. SageMaker includes supervised algorithms such as XGBoost and linear/logistic regression or classification, to address recommendation and time series prediction problems. SageMaker also includes support for unsupervised learning (i.e. the algorithms must discover the correct answers on their own), such as with k-means clustering and principal component analysis (PCA), to solve problems like identifying customer groupings based on purchasing behavior.
SageMaker makes the most common machine learning algorithms automatically available to you. You simply specify your data source, and you can start running k-means clustering for data segmentation, factorization machines for recommendations, time-series forecasting, linear regression, or principal component analysis, or many other algorithms that are ready to use right away.
|BlazingText Word2Vec||BlazingText implementation of the Word2Vec algorithm for scaling and accelerating the generation of word embeddings from a large number of documents.|
|DeepAR||An algorithm that generates accurate forecasts by learning patterns from many related time-series using recurrent neural networks (RNN).|
|Factorization Machines||A model with the ability to estimate all of the interactions between features even with a very small amount of data|
|Gradient Boosted Trees (XGBoost)||Short for “Extreme Gradient Boosting”, XGBoost is an optimized distributed gradient boosting library.|
|Image Classification (ResNet)||A popular neural network for developing image classification systems.|
|IP Insights||An algorithm to detect malicious users or learn the usage patterns of IP addresses.|
|K-Means Clustering||One of the simplest ML algorithms, it is used to find groups within unlabeled data.|
|K-Nearest Neighbor (k-NN)||An index-based algorithm to address classification and regression-based problems.|
|Latest Dirichlet Allocation (LDA)||A model that is well suited to automatically discover the main topics present in a set of text files.|
|Linear Learner (Classification)||Linear classification uses an object’s characteristics to identify the appropriate group that it belongs to.|
|Linear Learner (Regression)||Linear regression is used to predict the linear relationship between two variables.|
|Neural Topic Modeling (NTM)||A neural network-based approach for learning topics from text and image datasets.|
|Object2Vec||A neural embedding algorithm to compute nearest neighbors and to visualize natural clusters.|
|Object Detection||Detects, classifies, and places bounding boxes around multiple objects in an image.|
|Principal Component Analysis (PCA)||Often used in data pre-processing, this algorithm takes a table or matrix or many features and reduces it to a smaller number of representative features.|
|Random Cut Forest||An unsupervised machine learning algorithm for anomaly detection.|
|Semantic Segmentation||Partitions an image to identify places of interest by assigning a label to the individual pixels of the image.|
|Sequence2Sequence||A general-purpose encoder-decoder for text that is often used for machine language translation, text summarization, and more.|
You can also bring your own framework or algorithm via a Docker container or select from hunderds of algorithms and pre-trained models available in AWS Marketplace.
Broad framework support
Amazon SageMaker supports many popular frameworks for deep learning such as TensorFlow, Apache MXNet, PyTorch, Chainer, and more. These frameworks are automatically configured and optimized for high performance. You don’t need to manually setup these frameworks and can use them within the built-in containers. You can also bring in any framework you like to SageMaker by building it into a Docker container that you can store in the Amazon EC2 Container Registry.
Test and prototype locally
The open source Apache MXNet and TensorFlow Docker containers used in Amazon SageMaker are available on Github. You can download these containers to your local environment and use the SageMaker Python SDK to test your scripts before deploying to SageMaker training or hosting environments. When you’re ready go from local testing to production training and hosting, a change to a single line of code is all that's needed.
Amazon SageMaker supports reinforcement learning in addition to traditional supervised and unsupervised learning. SageMaker has built-in, fully-managed reinforcement learning algorithms, including some of the newest and best performing in the academic literature. SageMaker supports RL in multiple frameworks, including TensorFlow and MXNet, as well as newer frameworks designed from the ground up for reinforcement learning, such as Intel Coach, and Ray RL. Multiple 2D and 3D physics simulation environments are supported, including environments based on the open source OpenGym interface. Additionally, SageMaker RL will allow you to train using virtual 3D environments built in Amazon Sumerian and Amazon RoboMaker. To help you get started, SageMaker also provides a range of example notebooks and tutorials.
Most machine learning falls into a category called supervised learning. This method requires a lot of labeled training data, but the models you build are able to make sophisticated decisions. It’s the common approach with computer vision, speech, and language models. Another common-but less used-category of machine learning is called unsupervised learning. Here, algorithms try to identify a hidden structure in unlabeled data. The bar to train an unsupervised model is much lower, but the tradeoff is that the model makes less sophisticated decisions. Unsupervised models are often used to identify anomalies in data, such as abnormal fluctuations in temperature or signs of network intrusion.
Reinforcement learning (RL) has emerged a third, complementary approach to machine learning. RL takes a very different approach to training models. It needs virtually to no labeled training data, but it can still meet (and in some cases exceed) human levels of sophistication. The best thing about RL is that it can learn to model a complex series of behaviors to arrive at a desired outcome, rather than simply making a single decision. One of the most common applications today for RL is training autonomous vehicles to navigate to a destination.
An easy way to understand how RL works is to think of a simple video game where a character needs to navigate a maze collecting flags and avoiding enemies. Instead of a human playing, the algorithm controls the character and plays millions of games. All it needs to know to get started is that the character can move up, down, left and right, and that it will be rewarded by scoring points. The algorithm will then learn how to play to get the highest score possible. It will learn behaviors which improve the score (such as picking up flags or taking advantage of score multipliers), and minimize penalties (such as being hit by an enemy.) Over time, RL algorithms can learn advanced strategies to master the game, such as clearing the lower part of the maze first, how and when to use power-ups, and how to exploit enemy behaviors.
RL can be a force multiplier on traditional machine learning techniques. For example, RL and supervised learning have been combined to create personalized treatment regimens in health care, optimize manufacturing supply chains, improve wind turbine performance, drive autonomous cars, operate robots safely, and even create personalized classes and learning plans for students.
A step-by-step guide to building ML models
Learn to build ML models in Amazon SageMaker.
Amazon SageMaker sample notebooks
Access a rich repository of sample Amazon SageMaker notebooks on GitHub.