AWS Neuron introduces Flash Attention kernel enabling high performance and large sequence lengths
Today, AWS announces the release of Neuron 2.19, introducing support for flash attention kernel to enable performant LLM model training and inference with large sequence lengths.
AWS Neuron is the SDK for AWS Inferentia and Trainium based instances purpose-built for generative AI. Neuron integrates with popular ML frameworks like PyTorch. It includes a compiler, runtime, tools, and libraries to support high performance training and inference of AI models on Trn1 and Inf2 instances.
This release adds new features and performance improvements for both training and inference and new Ubuntu 22 Neuron DLAMIs for PyTorch 2.1 and PyTorch 1.13. Neuron 2.19 adds support for Flash Attention kernel to enable training for large sequence lengths (greater than or equal to 8K), Llama3 model training, and interleaved pipeline parallelism to enhance training efficiency and resource utilization. For inference, this release adds Flash Attention kernel support to enable LLM inference for context lengths of up to 32k. Neuron 2.19 additionally adds support for Llama3 model inference and adds beta support for continuous batching with Mistral-7B-v0.2 models. Neuron 2.19 introduces new tools: Neuron Node Problem Detector and Recovery plugin in EKS and Neuron Monitor for EKS to enable enhanced Neuron metrics monitoring in Kubernetes.
You can use AWS Neuron SDK to train and deploy models on Trn1 and Inf2 instances, available in AWS Regions as On-Demand Instances, Reserved Instances, Spot Instances, or part of Savings Plan.
For a list of features in Neuron 2.19, visit Neuron Release Notes. To get started with Neuron, see:
AWS Neuron
Inf2 Instances
Trn1 Instances