AWS Machine Learning Blog
Category: Tensorflow on AWS
Training and deploying models using TensorFlow 2 with the Object Detection API on Amazon SageMaker
With the rapid growth of object detection techniques, several frameworks with packaged pre-trained models have been developed to provide users easy access to transfer learning. For example, GluonCV, Detectron2, and the TensorFlow Object Detection API are three popular computer vision frameworks with pre-trained models. In this post, we use Amazon SageMaker to build, train, and […]
AWS and NVIDIA achieve the fastest training times for Mask R-CNN and T5-3B
Note: At the AWS re:Invent Machine Learning Keynote we announced performance records for T5-3B and Mask-RCNN. This blog post includes updated numbers with additional optimizations since the keynote aired live on 12/8. At re:Invent 2019, we demonstrated the fastest training times on the cloud for Mask R-CNN, a popular instance segmentation model, and BERT, a […]
Visualizing TensorFlow training jobs with TensorBoard
TensorBoard is an open source toolkit for TensorFlow users that allows you to visualize a wide range of useful information about your model, from model graphs; to loss, accuracy, or custom metrics; to embedding projections, images, and histograms of weights and biases. This post demonstrates how to use TensorBoard with Amazon SageMaker training jobs, write […]
Introducing Amazon SageMaker Components for Kubeflow Pipelines
Today we’re announcing Amazon SageMaker Components for Kubeflow Pipelines. This post shows how to build your first Kubeflow pipeline with Amazon SageMaker components using the Kubeflow Pipelines SDK. Kubeflow is a popular open-source machine learning (ML) toolkit for Kubernetes users who want to build custom ML pipelines. Kubeflow Pipelines is an add-on to Kubeflow that lets […]
Train ALBERT for natural language processing with TensorFlow on Amazon SageMaker
At re:Invent 2019, AWS shared the fastest training times on the cloud for two popular machine learning (ML) models: BERT (natural language processing) and Mask-RCNN (object detection). To train BERT in 1 hour, we efficiently scaled out to 2,048 NVIDIA V100 GPUs by improving the underlying infrastructure, network, and ML framework. Today, we’re open-sourcing the optimized training code for […]
Creating a complete TensorFlow 2 workflow in Amazon SageMaker
Managing the complete lifecycle of a deep learning project can be challenging, especially if you use multiple separate tools and services. For example, you may use different tools for data preprocessing, prototyping training and inference code, full-scale model training and tuning, model deployments, and workflow automation to orchestrate all of the above for production. Friction […]
Running distributed TensorFlow training with Amazon SageMaker
TensorFlow is an open-source machine learning (ML) library widely used to develop heavy-weight deep neural networks (DNNs) that require distributed training using multiple GPUs across multiple hosts. Amazon SageMaker is a managed service that simplifies the ML workflow, starting with labeling data using active learning, hyperparameter tuning, distributed training of models, monitoring of training progression, […]
Launching TensorFlow distributed training easily with Horovod or Parameter Servers in Amazon SageMaker
Amazon SageMaker supports all the popular deep learning frameworks, including TensorFlow. Over 85% of TensorFlow projects in the cloud run on AWS. Many of these projects already run in Amazon SageMaker. This is due to the many conveniences Amazon SageMaker provides for TensorFlow model hosting and training, including fully managed distributed training with Horovod and […]