Computer Vision at Scale with Dask and PyTorch on Amazon EC2 Spot Instances

By Hugo Shi, Co-Founder and CTO at Saturn
By Stephanie Kirmer, Sr. Data Scientist at Saturn
By Kinnar Sen, EC2 Spot Solutions Architect at AWS
By Dan Kelly, GTM Specialist EC2 Spot at AWS

Saturn Cloud is a data science and machine learning (ML) platform for Python. Its multi-node, multi-GPU computing enables 100x faster workflows, reducing time to business value.

Saturn Cloud is compatible with the entire Python ecosystem and a variety of tools, which makes workflows customizable and the computing experience highly flexible for a range of setups and preferences. In a recent release, Saturn Cloud added Amazon EC2 Spot compatibility.

Amazon EC2 Spot Instances let you take advantage of unused Amazon Elastic Compute Cloud (Amazon EC2) capacity at up to a 90 percent discount compared to On-Demand pricing.

By combining Spot Instances with On-Demand, you can satisfy the performance requirements of many data science and ML workloads while optimizing price performance for your organization.

In this post, we’ll explain how Saturn Cloud makes this easy, using a deep learning workload as an example. We’ll walk through how to provision a cluster of On-Demand and Spot Instances to meet your required performance specifications, and we’ll demonstrate the strong price performance of this approach by benchmarking it against other techniques.

For this example, we’ll dive deep into three popular open-source Python libraries:

PyTorch for deep learning.
CUDA for executing Python code on GPU hardware.
Dask for parallelizing and distributing computations across a cluster of EC2 nodes.

Amazon EC2 Spot Instances are spare compute capacity in the Amazon Web Services (AWS) Cloud available to you at steep discounts compared to On-Demand instance prices. The only difference between an On-Demand and Spot Instance is that Spot can be interrupted by Amazon EC2 with two minutes of notification when EC2 needs the capacity back.

Fortunately, there is a lot of spare capacity available, and handling interruptions in order to build resilient workloads with Spot Instances is simple and straightforward. They are ideal for stateless, fault tolerant, loosely coupled, and flexible workloads that can handle interruptions.

Saturn Cloud Architecture

Saturn Cloud runs as an application inside Kubernetes, leveraging AWS services such as Amazon EC2, AWS Identity and Access Management (IAM), and Amazon Virtual Private Cloud (VPC) to provide secure and scalable infrastructure for running data science and ML workloads within your AWS environment.

Saturn Cloud’s architecture allows users to connect to their AWS storage services, AWS real-time data sources, and AWS management tools. Users can authenticate Saturn Cloud projects with their IAM credentials, and then connect to various services through the CLI, REST API, or Python packages like boto3.

Saturn-Python-EC2-Spot-1

Figure 1 – Saturn Cloud is compatible with AWS tools and services.

Luckily for users, Saturn Cloud has architected Dask clusters for high fault tolerance, which enables seamless workflow continuity. If an EC2 Spot worker node in your cluster is interrupted, for example, a replacement node of the same instance type and size can spin up automatically when there’s availability using Auto Scaling Groups.

The centralized Task Scheduler is responsible for this capability, so that portion of the cluster is always provisioned using On-Demand resources.

Figure 2 – Highly available cluster computing.

This is especially valuable when applied to the most compute-intensive workloads, like computer vision and natural language processing (NLP), where cluster sizes can burst significantly to achieve your desired performance.

A key implication is that a fixed data science budget will go a lot further using Spot Instances on Saturn Cloud.

Optimizing Price Performance for Image Classification

Next, we’ll walk you through the critical steps for conducting an image classification inference using the popular Resnet50 deep learning model on a GPU cluster. We’ll demonstrate that by executing this workload on a Dask cluster using Spot Instances, you can run 38x faster compared to a non-parallelized approach, and at 95 percent lower cost.

We’ll provide a conceptual overview of the key steps in this post, but the full details with code can be found in this blog post: Computer Vision at Scale with Dask and PyTorch.

Step 1: Set Up a GPU Cluster on Saturn Cloud with Spot Instances Enabled

Spinning up a Dask cluster on Spot Instances is easy. Simply check the Spot Instance box in the Dask user interface (UI), and Saturn Cloud takes care of the rest.

To get started, we’ll need to make our image dataset available. A simple approach is to store data in Amazon Simple Storage Service (Amazon S3) and use the s3fs library to download it to a local filepath. To increase the computation efficiency, we’ll mirror all image files on all of the workers in our cluster.

Next, we’ll check that the Jupyter instance and each of our worker nodes have GPU capabilities. To do this, we use the cuda.is_available() function. If you’re not familiar with CUDA, it’s an open source library maintained by NVIDIA for executing Python code on GPUs.

Once completed, we’ll set the “device” to always be CUDA.

Step 2: Run Inference Tasks with PyTorch and Use Batch Processing to Accelerate Tasks

Now, we are ready to start doing some classification. To take full advantage of parallelization on the GPU cluster, we use the built-in PyTorch DataLoader class to load, transform, and batch our images. This function returns two lists: image data, and ground truth labels.

When working with predictions and ground truths that are strings, and potentially variable, it’s handy to use regex to automatically compare them and check the model’s accuracy.

Step 3: Combine Each Function into a Single Function

Rather than patch together individual functions by hand, we’ll assemble them in one single function that will do this for us. We can then map this across all our batches of images across the cluster.

The part is hard to explain conceptually, but you can see the code snippets in the full article on the Saturn Cloud website: Put it All Together.

Step 4: Runtime and Model Evaluation

Finally, we can set up our label set for ResNet so that we can interpret the predictions according to the classes they represent. We’ll first run preprocessing on the cluster, followed by our inference workflow, mapped across all the data batches.

When we evaluate the results for this example, we get the following:

Number of photos examined: 20,580
Number of photos classified correctly: 13,806
The percent of classified correctly: 67.085 percent

Comparing Price Performance

In this example, Saturn Cloud has managed to classify over 20,000 images in roughly five minutes. Let’s see how different approaches to solving this problem compare in terms of price performance.

Figure 3 – Price performance by technique.

As you can see in the table below, our Spot GPU cluster performed 38x faster at just 5 percent of the cost of our single-node test.

This acceleration and price performance is possible by shifting from single node, serial processing to multi-node, multiprocessing. This is especially valuable as it enables teams to iterate quickly and perform multiple runs at a fraction of the cost of the non-parallelized approach.

Technique	Purchasing option	Hardware setup	Runtime performance	Price*
Single-node, Single core	On-Demand	1 g4dn4xlarge	3 hrs 21 min 13 sec	$1,467.30
GPU cluster, Multiprocessing	On-Demand On-Demand	1 g4dn4xlarge 4 g4dn8xlarge	5 min 15 sec	$109.50
GPU cluster, Multiprocessing	On-Demand Spot	1 g4dn4xlarge 4 g4dn8xlarge	5 min 15 sec	$73.00

* Price assumes each inference is run daily for 365 days. The GPU cluster that utilizes Spot assumes a discount of 50 percent for all worker nodes.

Summary

Saturn Cloud’s Dask cluster architecture was designed for fault tolerance and optimized compute. This means Saturn Cloud users can enjoy the cost savings associated with Amazon EC2 Spot Instances while accelerating runtime performance.

With robust price performance on the most compute-intensive workloads, like computer vision and natural language processing, Saturn Cloud users are ultimately able to drive to business value sooner, at a fraction of the cost.

If you’d like to try this yourself, start your free trial of Saturn Cloud today!

.

.

Saturn Cloud – AWS Partner Spotlight

Saturn Cloud is an AWS Partner and data science and machine learning platform for Python. Its multi-node, multi-GPU computing enables 100x faster workflows, reducing the time to business value.

Contact Saturn Cloud | Partner Overview | AWS Marketplace

*Already worked with Saturn Cloud? Rate the Partner

*To review an AWS Partner, you must be a customer that has worked with them directly on a project.