Amazon EC2 DL1 Instances

Low cost-to-train for deep learning models

Why Amazon EC2 DL1 Instances?

Amazon EC2 DL1 instances powered by Gaudi accelerators from Habana Labs (an Intel company), deliver low cost-to-train deep learning models for natural language processing, object detection, and image recognition use cases. DL1 instances provide up to 40% better price performance for training deep learning models compared to current generation GPU-based EC2 instances.

Amazon EC2 DL1 instances feature 8 Gaudi accelerators with 32 GiB of high bandwidth memory (HBM) per accelerator, 768 GiB of system memory, custom 2nd generation Intel Xeon Scalable processors, 400 Gbps of networking throughput, and 4 TB of local NVMe storage.

DL1 instances include the Habana SynapseAI® SDK, that is integrated with leading machine learning frameworks such as TensorFlow and PyTorch.

You can get started on DL1 instances easily, using AWS Deep Learning AMIs or AWS Deep Learning Containers, or Amazon EKS and ECS for containerized applications. Support for DL1 instances in Amazon SageMaker is coming soon.

New Amazon EC2 DL1 instances overview video

Benefits

DL1 instances deliver up to 40% better price performance for training deep learning models compared to our latest GPU-based EC2 instances. These instances feature Gaudi accelerators that are purpose-built for training deep learning models. You can also get further cost savings by using EC2 Savings Plan to significantly reduce the cost of training your deep learning models.

Developers across all levels of expertise can get started easily on DL1 instances. They can continue to use their own workflow management services by using AWS DL AMIs and DL Containers to get started on DL1 instances. Advanced users can also build custom kernels to optimize their model performance using Gaudi’s programmable Tensor Processing Cores. (TPCs). Using Habana SynapseAI® tools, they can seamlessly migrate their existing models running on GPU- or CPU-based instances onto DL1 instances with minimal code changes.

DL1 instances support leading ML frameworks such as TensorFlow and PyTorch, enabling you to continue using your preferred ML workflows. You can access optimized models such as Mask R-CNN for object detection and BERT for natural language processing on Habana’s GitHub repository to quickly build, train, and deploy your models. SynapseAI’s rich Tensor Processing Core (TPC) kernel library supports a wide variety of operators and multiple data types for a range of model and performance needs.

Features

DL1 instances are powered by Gaudi accelerators from Habana Labs (an Intel company), that feature eight fully programmable TPCs and 32 GiB of high bandwidth memory per accelerator. They have a heterogeneous compute architecture to maximize training efficiency and a configurable centralized engine for matrix-math operations. They also have the industry’s only native integration of ten 100 Gigabit Ethernet ports on every Gaudi accelerator for low latency communication between accelerators.

The SynapseAI® SDK comprises of a graph compiler and runtime, TPC kernel library, firmware, drivers, and tools. It is integrated with leading frameworks such as TensorFlow and PyTorch. Its communication libraries help in rapidly scaling up to multiple accelerators using the same operations you use for GPU-based instances today. This deterministic scaling results in higher utilization and increased efficiency across a variety of neural network topologies. Using SynapseAI® tools, you can seamlessly migrate and run your existing models onto DL1 instances with minimal code changes.

DL1 instances offer 400 Gbps of networking throughput and connectivity to Amazon Elastic Fabric Adapter (EFA) and Amazon Elastic Network Adapter (ENA) for applications that need access to high-speed networking. For fast access to large datasets, DL1 instances also include 4 TB of local NVMe storage and deliver 8 GB/sec read throughput.

DL1 instances are built on the AWS Nitro System, which is a rich collection of building blocks that offloads many of the traditional virtualization functions to dedicated hardware and software to deliver high performance, high availability, and high security while also reducing virtualization overhead.

Product details

Instance Size

vCPU

Instance Memory (GiB)

Gaudi Accelerators

Network Bandwidth (Gbps)

Accelerator Peer-to-Peer Bidirectional (Gbps)

Instance Storage (GB) EBS Bandwidth (Gbps) On-demand (Price/hr) 1-yr Reserved Instance Effective hourly* 3-yr Reserved Instance Effective hourly*

dl1.24xlarge

96

768

8

400

100

4 x 1000 
NVMe SSD
19 $13.11 $7.87 $5.24

*Prices shown are for US East (N. Virginia) and US West (Oregon) regions.

Customer and Partner testimonials

Here are some examples of how customers and partners have achieved their business goals with Amazon EC2 DL1 instances.

  • Seagate

    Seagate Technology has been a global leader offering data storage and management solutions for over 40 years. Seagate’s data science and machine learning engineers have built an advanced deep learning (DL) defect detection system and deployed it globally across the company’s manufacturing facilities. In a recent proof of concept project, Habana Gaudi exceeded the performance targets for training one of the DL semantic segmentation models currently used in Seagate’s production. 

    We expect the significant price performance advantage of Amazon EC2 DL1 instances, powered by Habana Gaudi accelerators, could make a compelling future addition to AWS compute clusters. As Habana Labs continues to evolve and enables broader coverage of operators, there is potential for expanding to additional enterprise use cases, and thereby harnessing additional cost savings.

    Darrell Louder, Senior Engineering Director of Operations, Technology and Advanced Analytics - Seagate
  • Leidos

    Leidos is recognized as a Top 10 Health IT provider delivering a broad range of customizable, scalable solutions to hospitals and health systems, biomedical organizations, and every U.S. federal agency focused on health. 

    One of the numerous technologies we are enabling to advance healthcare today is the use of machine learning and deep learning for disease diagnosis based on medical imaging data. Our massive data sets require timely and efficient training to aid researchers seeking to solve some of the most urgent medical mysteries. Given Leidos's and its customers' need for quick, easy, and cost-effective training for deep learning models, we are excited to have begun this journey with Intel and AWS to use Amazon EC2 DL1 instances based on Habana Gaudi AI processors. Using DL1 instances, we expect an increase in model training speed and efficiency, with a subsequent reduction in risk and cost of research and development.

    Chetan Paul, CTO Health and Human Services - Leidos
  • Intel

    Intel has created 3D Athlete Tracking technology that analyzes athlete-in-action video in real time to inform performance training processes and enhance audience experiences during competitions.

    Training our models on Amazon EC2 DL1 instances, powered by Gaudi accelerators from Habana Labs, will enable us to accurately and reliably process thousands of videos and generate associated performance data, while lowering training cost. With DL1 instances, we can now train at the speed and cost required to productively serve athletes, teams, and broadcasters of all levels across a variety of sports.

    Rick Echevarria, Vice President, Sales and Marketing Group - Intel
  • RiskFuel

    RiskFuel provides real-time valuations and risk sensitivities to companies managing financial portfolios, helping them increase trading accuracy and performance.

    Two factors drew us to Amazon EC2 DL1 instances based on Habana Gaudi AI accelerators. First, we want to make sure our banking and insurance clients can run Riskfuel models that take advantage of the newest hardware. Fortunately, we found migrating our models to DL1 instances to be simple and straightforward – really, it was just a matter of changing a few lines of code. Second, training costs are a big component of our spending, and the promise of up to 40% improvement in price performance offers potentially substantial benefit to our bottom line.

    Ryan Ferguson, CEO - Riskfuel
  • Fractal

    Fractal is a global leader in artificial intelligence and analytics, powering decisions in Fortune 500 companies.

    AI and deep learning are at the core of our Machine Vision capability, enabling customers to make better decisions across industries we serve. In order to improve accuracy, data sets are becoming larger and more complex, requiring larger and more complex models. This is driving the need for improved compute price performance. The new Amazon EC2 DL1 instances promise significantly lower cost training than GPU-based EC2 instances. We expect this to make training of AI models on cloud much more cost competitive and accessible than before for a broad array of clients.

    Srikanth Velamakanni, Group CEO - Fractal