Building ML excellence: A practical training guide for Amazon SageMaker AI

Having guided countless individuals, engineers, and Amazon Web Services (AWS) Partners through their machine learning (ML) journey on AWS, we understand the complexity of navigating the extensive landscape of Amazon SageMaker AI. When we first started working with new learners, we noticed they often struggled to piece together which SageMaker AI features would best serve their specific needs. That’s why we’ve created this practical guide that breaks down the five essential milestones every data science team encounters.

Whether you’re just starting to explore ML or looking to scale your existing ML operations, this guide is for you. You will walk through the key SageMaker AI tools that can transform your workflow—from setting up collaborative development environments to optimizing your models for production.

Let’s demystify the path to building production-ready ML solutions on AWS.

Development environments

The first step of any technical project begins with setting up a development environment. SageMaker AI provides many different graphical user interface (GUI) web-based development environments, allowing for ML project development, management, and collaboration for teams and individuals.

Amazon SageMaker AI domains – The underlying infrastructure for you and your ML team. SageMaker AI domains provide secure network definitions for your ML infrastructure. For developers, it provides a collaborative space for sharing data, ML models, and ML findings. For administrators, it provides explicit user profile definitions, administrative tooling, and resource management for every member of your team.

Amazon SageMaker Studio – An all-in-one web based ML integrated development environment (IDE), providing both no-code GUI and hands-on code development tooling applications for all common ML tasks on SageMaker AI. SageMaker Studio is highly versatile for all ML scientists and data engineers. It provides both straightforward high-level tooling for common ML tasks as well as in-depth specialist-forward services that provide infrastructure to execute highly specific ML workloads.

Amazon SageMaker Canvas – A no-code ML interface that provides straightforward out-of-the-box interfaces for achieving common ML tasks. It enables data preprocessing, flows for building common ML models, and the ability to deploy models for forecasting, all without having to write a single line of code. SageMaker Canvas supports common ML forecasting tasks such as fraud detection, maintenance or warehouse failures, financial and sales metrics, and more.

Amazon SageMaker notebooks – Fully managed Jupyter Notebook for data exploration, feature extraction, and ML model building. SageMaker notebook instances manage the underlying compute of your JupyterLab on Amazon Elastic Compute Cloud (Amazon EC2) and provide relevant modules to complete common ML tasks (for example, NumPy, pandas, PyTorch, TensorFlow, and more). It provides libraries to interface with the AWS Cloud such as AWS SDK for Python (Boto3), AWS Command Line Interface (AWS CLI), and the Amazon SageMaker Python SDK.

Relevant learning resources

Skill Builder – Amazon SageMaker AI Getting Started – Introductory course for general ML workload that showcases a straightforward example of using Amazon SageMaker notebooks.

Skill Builder – Digital Classroom – Amazon SageMaker Studio for Data Scientists – A hands-on digital course showcasing all SageMaker Studios features for typical ML workloads. Many of the features this course covers are also highlighted in this post.

AWS Blog – Separate lines of business or teams with multiple Amazon SageMaker domains – A blog post highlighting the encapsulation and administrative capabilities that Amazon SageMaker domains provide.

YouTube – Make better business decisions with ML using Amazon SageMaker Canvas, without code – An hour-long presentation showcasing SageMaker Canvas, including a demo walking through typical use cases of using SageMaker Canvas.

Data science

All ML tasks start with data. After establishing a development environment, the next step is preparing your data for your ML task. This is a crucial part that often involves properly labeling, preprocessing, and cleaning your data for your ML task or algorithm. By the end, your data should be available to be transformed, stored, and queried.

Amazon SageMaker Ground Truth – Comprehensive data labeling service that helps employ human-in-the-loop ML methodology to create highly accurate datasets using a web-based dashboard. SageMaker Ground Truth offers built-in workflows for common labeling and segmentation tasks for text, images, videos and 3D point-cloud data. SageMaker Ground Truth also supports a custom labeling workflow to label your data at scale.

Amazon Mechanical Turk – A distributed human workforce marketplace for data labeling and validation tasks. Amazon Mechanical Turk directly integrates with SageMaker Ground Truth to easily scale your data labeling jobs.

Amazon SageMaker Data Wrangler – Visual data preparation tool that simplifies the process of preparing data for ML. SageMaker Data Wrangler provides a data pipeline interface for data selection, cleaning, exploration, and preprocessing. SageMaker Data Wrangler provides over 300 built-in data transformations, and it directly integrates with other SageMaker AI services as well as Amazon Simple Storage Service (Amazon S3).

Amazon SageMaker Feature Store – Fully managed data repository for storing and querying ML data features. SageMaker Feature Store provides a centralized location to categorize, group, share, and reuse features for you and your team, with storage options for time-sensitive querying (that is, offline and online storage), feature processing pipelines, Time-To-Live configurations for your data, and more.

Note that although Amazon SageMaker Feature Store offers a high-performing management service for your data features, it’s also common to use Amazon S3 as a data warehouse for your data features. Which you choose is dependent on the requirements of the ML task.

Relevant learning resources

Skill Builder Lab – Analyze and Prepare Data with Amazon SageMaker Data Wrangler and Amazon EMR – Hands on lab for using Amazon SageMaker Data Wrangler in great depth.

Skill Builder AWS ML Engineer Associate 1.2 Transform Data – Course highlighting common preprocessing techniques for Machine Learning Data and how to use them in-tandem with Amazon SageMaker Data Wrangler and Amazon SageMaker Feature Store

Workshop – Amazon SageMaker Ground Truth Immersion Day – A self-paced hands-on workshop going through how to use SageMaker Ground Truth for each target modality.

Model training

With your data properly prepared, it’s time to train your ML model. SageMaker AI provides many tools for running off-the-shelf ML algorithms or customized ML architectures and training in minutes, with the ability to scale as your data, model, and business grow.

Amazon SageMaker AI built-In ML algorithms – A collection of built-in, high-performance algorithms to suit all common ML tasks. Integrated with SageMaker AI infrastructure, these algorithms are designed to help data scientists and ML practitioners train and deploy models quickly, without having to write the algorithm’s underlying code.

Amazon SageMaker Training jobs – A highly scalable, on-demand, managed infrastructure service designed to run containerized ML model training tasks. SageMaker Training jobs provide out-of-the-box support for training models in common ML frameworks, as well as custom code ML model training. Each training job tracks your specified hyperparameters, monitored training metrics, billable compute minutes, and manages your resulting model artifact. The output model can directly integrate with other SageMaker AI services.

Amazon SageMaker HyperPod – A managed enterprise-scale cluster offering designed for training, fine-tuning, and inference of state-of-the-art large language models (LLMs). Each cluster is persistent and can scale to hundreds or thousands of AI accelerators, with built-in automatic cluster health check and repair, out-of-the-box recipes for training and fine-tuning techniques such as distillation and proximal policy optimization (PPO), and a highly customizable managed task queue and scheduler.

Relevant learning resources

Skill Builder – AWS SimuLearn: Model Training Using Amazon SageMaker Built-In Algorithms – This SimuLearn course provides an interactive environment to experience a typical ML use case from start to finish. It’s also tailored for ML engineers preparing for the AWS Certified Machine Learning Engineer – Associate (MLA-C01).

Skill Builder – Building Language Models on AWS – A 6-hour course going in depth on best practices on training a language model on AWS using AWS services including Amazon SageMaker HyperPod.

AWS Blog – Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 4: Training jobs – AWS Blog post showcasing different ways to track, benchmark and optimize your SageMaker AI Training Jobs.

Optimizing

After training your model, the next step is optimization—making your ML model inference deployment high-performing and cost-effective. SageMaker AI offers a comprehensive suite of optimization tools that help you reduce inference latency, minimize costs, and maximize resource utilization across your entire ML pipeline.

Amazon SageMaker Inference Recommender – An automated load testing and tuning service that enables finding optimal model deployment parameters for your ML model. SageMaker Inference Recommender runs load tests across different instance types and configurations, providing recommendations based on instance count, container parameters, model optimizations, max concurrency, and memory size.

Amazon SageMaker Savings Plan – Alternative pricing model that provides up to 64% on cost savings for SageMaker AI compute usage in exchange for an hourly billing commitment over the span of a 1- or 3-year term. Amazon SageMaker Savings Plans is designed for ML teams with consistent usage of SageMaker AI compute resources, and it automatically applies to all eligible SageMaker AI compute usage regardless of instance type or Region.

Amazon SageMaker Neo – A model inference compiler to train your ML model one time and deploy on any viable edge device, mobile device, or cloud instance. SageMaker Neo takes in both ML model and its target host specifications (OS, CPU architecture, and GPU or accelerator) and provides a compiled version of your model configured to run optimally on that host. SageMaker Neo directly integrates with AWS IoT Greengrass and compiled models run on the Neo Deep Learning Runtime.

AWS Trainium and AWS Inferentia – ML accelerator instances by AWS, optimized for high-performance and cost-effective training (Trainium) and inference (Inferentia) of deep learning models. These custom silicon chips provide better cost performance compared to traditional GPU instances for common deep learning tasks by using Trainium and Inferentia NeuronCore architecture through the AWS Neuron SDK.

Relevant learning resources

Skill Builder – AWS ML Engineer Associate 4.2 Monitor and Optimize Infrastructure and Costs – Course highlighting how to maintain and optimize your existing ML infrastructure on AWS. This course covers Amazon SageMaker Inference Recommender, Amazon SageMaker Savings Plans, and other relevant resources.

Skill Builder – AWS AI Chips – Trainium and Inferentia Fundamentals – A brief course highlighting the usage case and value that AWS Trainium and AWS Inferentia chips provide.

AWS Blog – Achieving 1.85x higher performance for deep learning based object detection with an AWS Neuron compiled YOLOv4 model on AWS Inferentia – In-depth blog post showcasing the performance gains of our ML accelerators as well as how to properly compile your models to use that performance.

AWS Blog – Synadia builds next generation pill verification systems with AWS IoT and ML – Blog post showcasing the power of Amazon SageMaker Neo to run ML inference at edge.

Model deployment and inference

After achieving optimal model performance, the final step is using your model for inference for real-world applications! SageMaker AI provides multiple deployment options to match your specific inference use case requirements, from real-time predictions to batch processing.

Amazon SageMaker JumpStart – A hub providing trained ML models bundled with per-configurations to seamlessly tune and deploy on Amazon SageMaker AI or Amazon Bedrock infrastructure. Models offered on this hub cater to various ML tasks such as LLMs, computer vision, and natural language processing (NLP) and come from both open source projects and proprietary models provided in AWS Marketplace. Each model is deployable for inference using an end-to-end GUI solution in Amazon SageMaker Studio, alongside an example Jupyter Notebook code sample highlighting how to load and use the model programmatically.

Amazon SageMaker AI endpoints – Fully managed service to host ML models to continuously serve your ML model for inference. SageMaker endpoints provide automatic scaling, model A/B testing capabilities, multi-model endpoints for cost optimization, and built-in monitoring for model performance. Endpoints provide automatic load balancing and high availability across multiple Availability Zones and various deployment options for real-time, asynchronous, or serverless requirements.

Amazon SageMaker AI batch transform – Fully managed service to run ML inference jobs for large batches of data for non-real-time requirements. A single batch transform job runs inference one time on an entire dataset stored in Amazon S3, automatically managing the underlying compute and storing the inference output back to Amazon S3.

Relevant learning resources

Skill Builder – Amazon SageMaker JumpStart Foundations – Video-oriented course providing guidance and steps on how to search, deploy, and use ML models from Amazon SageMaker JumpStart.

Skill Builder – AWS SimuLearn: Model Deployment Using SageMaker – This SimuLearn course provides an interactive environment of how to run and evaluate Amazon SageMaker AI batch transform. It’s also tailored for ML engineers preparing for the AWS Certified Machine Learning Engineer – Associate (MLA-C01).

YouTube – AWS Summit DC 2022 – Amazon SageMaker Inference explained: Which style is right for you? – This talk gives an excellent overview of the many ways you can host and provide your ML models for inference on Amazon SageMaker AI endpoints.

Certifications

AWS Certified Data Engineer – Associate certification is crucial for mastering the data preparation and processing capabilities within SageMaker AI. It validates your ability to build and maintain the robust data pipelines necessary for successful ML projects. This certification validates your ability to use SageMaker data preparation tools such as Data Wrangler and Feature Store effectively so your ML models have high-quality data for training and inference.

AWS Certified Machine Learning Engineer – Associate certification serves as an essential milestone for practitioners working specifically with SageMaker AI. It validates your ability to build, train, tune, and deploy ML models using SageMaker built-in algorithms and custom frameworks. This certification demonstrates your proficiency in implementing end-to-end ML solutions while following AWS best practices for security, scalability, and cost optimization.

AWS Certified Machine Learning Specialty – Specialty certification validates your expertise in designing, implementing, and maintaining ML solutions on AWS. This certification demonstrates your proficiency in selecting and justifying appropriate ML approaches, using the comprehensive toolkit in SageMaker for data engineering, model training, and deployment. With this certification, you’ll prove your ability to build intelligent solutions that drive business value through ML.

As you progress through these certifications, you’ll develop comprehensive expertise in SageMaker capabilities, from data preparation to model deployment and optimization. Remember that ML on AWS is an evolving field, so continuous learning and practical application of these skills are essential for long-term success.

Conclusion

Throughout our years as AWS solutions architects, we’ve witnessed companies and individual learners transform their ML capabilities by mastering these core Amazon SageMaker components. The journey we’ve outlined here isn’t just theoretical—it’s based on real-world implementations that have helped organizations succeed with ML on AWS.

Remember, you don’t need to tackle everything at once. Start with the development environment that suits your needs, experiment with data preparation tools, and gradually build your expertise with model training and optimization. The AWS Training resources we’ve shared for each milestone will give you hands-on experience with these tools in a structured environment. We encourage you to bookmark this guide and use it as your reference as you build your Amazon SageMaker expertise.

Your ML journey on AWS starts here, and we’re excited to see what you’ll build.

Author bios

Srividhya Pallay is a solutions architect II at Amazon Web Services (AWS) based in Seattle, where she supports small and medium-sized businesses (SMBs) and specializes in generative AI and games. With six AWS Certifications, including Machine Learning Specialty and Data Engineer Associate, she helps organizations harness the power of AWS for their AI/ML workloads. Srividhya holds a Bachelor of Science in Computational Data Science from Michigan State University College of Engineering, with a minor in Computer Science and Entrepreneurship.

Omri Gideoni is a solutions architect at Amazon Web Services (AWS) based in Seattle. He supports small and medium-sized businesses (SMBs) and he specializes in machine learning and MLOps. Omri helps customers achieve performance efficiency and operational excellence across various ML workflows on AWS.

AWS Training and Certification Blog

Building ML excellence: A practical training guide for Amazon SageMaker AI

Development environments

Relevant learning resources

Data science

Relevant learning resources

Model training

Relevant learning resources

Optimizing

Relevant learning resources

Model deployment and inference

Relevant learning resources

Certifications

Conclusion

Resources

Follow

Learn

Resources

Developers

Help