AWS Machine Learning Blog

How to extend the functionality of AWS Trainium with custom operators

Deep learning (DL) is a fast-evolving field, and practitioners are constantly innovating DL models and inventing ways to speed them up. Custom operators are one of the mechanisms developers use to push the boundaries of DL innovation by extending the functionality of existing machine learning (ML) frameworks such as PyTorch. In general, an operator describes the mathematical function of a layer in a deep learning model. A custom operator allows developers to build their own mathematical functions for a layer in the deep learning model.

AWS Trainium and AWS Inferentia2, which are purpose built for DL training and inference, extend their functionality and performance by supporting custom operators (or CustomOps, for short). AWS Neuron, the SDK that supports these accelerators, uses the standard PyTorch interface for CustomOps. Developers can easily get started with their existing code when using Trainium-based Amazon EC2 Trn1 instances or Inferentia2-based Amazon EC2 Inf2 instances. In this post, we cover the benefits of CustomOps, their efficient implementation on Trainium, and examples to get you started with CustomOps on Trainium-powered Trn1 instances.

To follow along, familiarity with core AWS services such as Amazon Elastic Compute Cloud (Amazon EC2) is implied, and basic familiarity with deep learning, PyTorch, and C++ would be helpful.

Custom operators in PyTorch and their benefits

CustomOps for PyTorch originated in version 1.10, called PyTorch C++ Frontend, and provided an easy-to-use mechanism to register CustomOps written in C++. The following are some of the benefits that CustomOps provide:

  • Performance optimization – CustomOps can be optimized for specific use cases, leading to faster model runs and improved performance.
  • Improved model expressiveness – With CustomOps, you can express complex computations that aren’t easily expressible using the built-in operators provided by PyTorch.
  • Increased modularity – You can use CustomOps as building blocks to create more complex models by creating C++ libraries of reusable components. This makes the development process easier and more modular, and facilitates rapid experimentation.
  • Increased flexibility – CustomOps enables operations beyond the built-in operators—that is, they provide a flexible way to define complex operations that aren’t implemented using the standard ones.

Trainium support for custom operators

Trainium (and AWS Inferentia2) supports CustomOps in software through the Neuron SDK and accelerates them in hardware using the GPSIMD engine (General Purpose Single Instruction Multiple Data engine). Let’s look at how these enable efficient CustomOps implementation and provide increased flexibility and performance when developing and innovating DL models.

Neuron SDK

The Neuron SDK helps developers train models on Trainium and deploy models on the AWS Inferentia accelerators. It integrates natively with frameworks, such as PyTorch and TensorFlow, so you can continue using your existing workflows and application code to train models on Trn1 instances.

The Neuron SDK uses the standard PyTorch interface for CustomOps. Developers can use the standard programming interface in PyTorch to write CustomOps in C++ and extend Neuron’s official operator support. Neuron then compiles these CustomOps to run efficiently on the GPSIMD engine, which is described in more detail in the following section. This makes it easy to implement new experimental CustomOps and accelerate them on purpose-built hardware, without any intimate knowledge of this underlying hardware.

General Purpose Single Instruction Multiple Data engine

At the core of Trainium optimizations resides the NeuronCore architecture, a fully independent, heterogeneous compute-unit with four main engines: tensor, vector, scalar, and the GPSIMD engine. The scalar and vector engines are highly parallelized and optimized for floating-point operations. The tensor engine is based on a power-optimized, systolic-array supporting mixed precision computation.

The GPSIMD engine is a general-purpose Single Instruction Multiple Data (SIMD) engine designed for running and accelerating CustomOps. This engine consists of eight fully programmable 512-bit wide general-purpose processors, which can run straight-line C-code and have direct inline access to the other NeuronCore-v2 engines, as well as the embedded SRAM and HBM memories. Together, these capabilities help run CustomOps efficiently on Trainium.

Take for example operators such as TopK, LayerNorm, or ZeroCompression, which read data from memory and only use it for a minimal number of ALU calculations. Regular CPU systems are completely memory bound for these calculations, and performance is limited by the time required to move the data into the CPU. In Trainium, the GP-SIMD engines are tightly coupled with the on-chip caches using a high bandwidth streaming interface, which can sustain 2 TB/sec of memory bandwidth. Therefore, CustomOps like these can be run really fast on Trainium.

Neuron SDK custom operators in practice

For this post, we assume a DLAMI (refer to instructions for either Ubuntu or Amazon Linux) is being used to instantiate an EC2 Trn1 instance (either 2x.large or 32x.large). Note all necessary software, drivers, and tools have already been installed on the DLAMIs, and only the activation of the Python environment is needed to start working with the tutorial. We reference the CustomOps functionality available in Neuron as “Neuron CustomOps.”

Similar to the process of PyTorch integration with C++ code, Neuron CustomOps requires a C++ implementation of an operator via a NeuronCore-ported subset of the Torch C++ API . The C++ implementation of the operator is called the kernel function, and the port of the C++ API contains everything required for CustomOps development and model integration, specifically tensor and scalar classes in c10 (a namespace used for low-level C++ code across different PyTorch libraries), and a subset of ATen operators (or Automatic Tensor, the C++ library that provides the core tensor operations used in PyTorch).

The torch.h header needs to be included when defining the kernel for you to have access to a NeuronCore-ported subset of the Pytorch C++ API:

#include <torch/torch.h>

Neuron CustomOps also require a shape function. The shape function has the same function signature as the kernel function, but doesn’t perform any computations. It only defines the shape of the output tensor but not the actual values.

Neuron CustomOps are grouped into libraries, and macros are used to register them with the NEURON_LIBRARY scope from within the shape function. The function will be run on the host at compilation time and will require the register.h header from the torchneuron library:

#include "torchneuron/register.h"

Finally, the custom library is built by calling the load API. If supplying the build_directory parameter, the library file will be stored in the indicated directory:

import torch_neuronx
from torch_neuronx.xla_impl import custom_op

name=name,# this is the name for the library(i.e, 'relu')

To use the CustomOp from a PyTorch model, simply load the library by calling the load_library API and call the Neuron CustomOp in the same manner that CustomOps are called in PyTorch via the torch.ops namespace. The format is usually torch.ops.<library_name>.<operator_name>. See the following code:

import torch
import torch_neuronx
from torch_neuronx.xla_impl import custom_op

out_tensor = torch.ops.my_lib.my_op(in_tensor)

Note that the custom_op.load API builds the C++ library, whereas the custom_op.load_library API loads an already-built library file.

Example: Neuron CustomOps in MLP training

To get started, perform the following steps:

  1. Create and launch your EC2 Trn1 instance. Be sure that you use a DLAMI image (either Ubuntu or Amazon Linux, pre-installed with all necessary Neuron software) and that you have specified a root volume size of 512 GB.
  2. After your instance is up and running, SSH to your instance.
  3. Install PyTorch Neuron (torch-neuronx) on your running Trn1 instance. For instructions, refer to Neuron Custom C++ Operators in MLP Training.
  4. Download the sample code from the GitHub repository.

Now that your environment is set up, continue through this post as we describe the implementation of a typical C++ CustomOp in Neuron in the form of Relu forward and backward functions to be used on a simple multilayer perceptron (MLP) model. The steps are described in the AWS Neuron Documentation.

The example code from the repository shows two folders:

  • ./customop_mlp/PyTorch – Contains the Relu code that will be compiled for a CPU
  • ./customop_mlp/neuron – Contains the Relu code that will be compiled for Trainium

Develop a Neuron CustomOp: The kernel function

The host or dev environment for the development of the kernel function (the Neuron CustomOp) can run PyTorch 1.13 and a C++17 compatible compiler in a Linux environment. This is the same as developing any C++ function for PyTorch, and the only libraries that need to be present in the development environment are those for PyTorch and C++. In the following example, we create a relu.cpp file with the custom Relu forward and backward functions:

#include <stdint.h>
#include <stdlib.h>
#include <torch/torch.h>

torch::Tensor relu_forward(const torch::Tensor& t_in) {
torch::Tensor t_out = torch::zeros(t_in.sizes(), torch::kFloat);
auto t_in_acc = t_in.accessor<float, 2>();
auto t_out_acc = t_out.accessor<float, 2>();
auto shape = t_in.sizes();
for (int i = 0; i < shape[0]; i++) {
	for (int j = 0; j < shape[1]; j++) {
		t_out_acc[i][j] = t_in_acc[i][j] > 0.0 ? t_in_acc[i][j] : 0.0;
return t_out;

torch::Tensor relu_backward(const torch::Tensor& t_grad, const torch::Tensor& t_in) 
torch::Tensor t_out = torch::zeros(t_in.sizes(), torch::kFloat);
auto t_in_acc = t_in.accessor<float, 2>();
auto t_grad_acc = t_grad.accessor<float, 2>();
auto t_out_acc = t_out.accessor<float, 2>();
auto shape = t_in.sizes();
for (int i = 0; i < shape[0]; i++) {
	for (int j = 0; j < shape[1]; j++) {
		t_out_acc[i][j] = t_in_acc[i][j] > 0.0 ? t_grad_acc[i][j] : 0.0;
return t_out;

When developing a Neuron CustomOp for Neuron, make sure you take into account the currently supported features and APIs. For more information, refer to Custom Operators API Reference Guide [Experimental].

Build and register the Neuron CustomOp: The shape function

The build for the Neuron CustomOp and runtime environment is the Trn1 instance where the training will take place, and the Neuron CustomOp will be compiled and registered as a neuronx-cc library and interpreted by the Neuron runtime to run on the highly optimized GP-SIMD engine.

To build and register the Neuron CustomOp, we need to create a shape function (shape.cpp) that will define the input and output tensors and register the operators: the relu_fwd_shape and relu_bwd_shape functions. See the following code:

#include <stdint.h>
#include <stdlib.h>
#include <torch/torch.h>
#include "torchneuron/register.h"

torch::Tensor relu_fwd_shape(torch::Tensor t_in) {
torch::Tensor t_out = torch::zeros(t_in.sizes(), torch::kFloat);
return t_out;

torch::Tensor relu_bwd_shape(torch::Tensor t_grad, torch::Tensor t_in) {
torch::Tensor t_out = torch::zeros(t_in.sizes(), torch::kFloat);
return t_out;

NEURON_LIBRARY(my_ops, m) {
m.def("relu_forward", &relu_fwd_shape, "relu_forward");
m.def("relu_backward", &relu_bwd_shape, "relu_backward");

The relu_fwd_shape and relu_bwd_shape functions define the shape of the output tensor (to be the same size as the input tensor). Then we register the functions in the NEURON_LIBRARY scope.

In the ./customop_ml/neuron repository example, we have a script to run the build and registration of the CustomOp, by simply calling the load function from the torch_neuronx.xla_impl package:

import os
import torch_neuronx
from torch_neuronx.xla_impl import custom_op


In the build_directory, we should find the library ready to be loaded and used in training our model.

Build the MLP model with the Neuron CustomOp

In this section, we go through the steps to build the MLP model with the Neuron CustomOp.

Define the Relu class

For a detailed explanation of how to train an MLP model, refer to Multi-Layer Perceptron Training Tutorial.

After we build the CustomOp, we create a Python package called, where we define a Relu PyTorch class, inheriting from the torch autograd function. The autograd function implements automatic differentiation, so that it can be used in a training loop.

First we load the library, then we define the new class with the forward and backward functions defined with static method decorators. In this way, the methods can be called directly when we define the model. See the following code:

import torch
import torch_neuronx
from torch_neuronx.xla_impl import custom_op


class Relu(torch.autograd.Function):
def forward(ctx, input):
return torch.ops.my_ops.relu_forward(input)

def backward(ctx, grad):
input, = ctx.saved_tensors
return torch.ops.my_ops.relu_backward(grad, input), None

Examine the MLP model

Now we’re ready to write our multilayer perceptron model with our Neuron CustomOp by importing the my_ops package where we have defined the Relu class:

import torch
import torch.nn as nn
from torch.nn import functional as F
import my_ops

# Declare 3-layer MLP for MNIST dataset
class MLP(nn.Module):
def __init__(self, input_size = 28 * 28, output_size = 10, layers = [120, 84]):
super(MLP, self).__init__()
self.fc1 = nn.Linear(input_size, layers[0])
self.fc2 = nn.Linear(layers[0], layers[1])
self.fc3 = nn.Linear(layers[1], output_size)

def forward(self, x):
f1 = self.fc1(x)
r1 = my_ops.Relu.apply(f1)
f2 = self.fc2(r1)
r2 = my_ops.Relu.apply(f2)
f3 = self.fc3(r2)
return torch.log_softmax(f3, dim=1)

Run the training script

Now we can train our model by using the provided script:

import os
import time
import torch
from model import MLP

from torchvision.datasets import mnist
from import DataLoader
from torchvision.transforms import ToTensor

# XLA imports
import torch_xla.core.xla_model as xm

# Global constants

# Load MNIST train dataset
train_dataset = mnist.MNIST(root='./MNIST_DATA_train',
train=True, download=True, transform=ToTensor())

def main():
# Prepare data loader
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE)

# Fix the random number generator seeds for reproducibility

# XLA: Specify XLA device (defaults to a NeuronCore on Trn1 instance)
device = 'xla'

# Move model to device and declare optimizer and loss function
model = MLP().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
loss_fn = torch.nn.NLLLoss()

# Run the training loop
print('----------Training ---------------')
for epoch in range(EPOCHS):
start = time.time()
for idx, (train_x, train_label) in enumerate(train_loader):
train_x = train_x.view(train_x.size(0), -1)
train_x =
train_label =
output = model(train_x)
loss = loss_fn(output, train_label)
xm.mark_step() # XLA: collect ops and run them in XLA runtime
if idx < WARMUP_STEPS: # skip warmup iterations
start = time.time()
# Compute statistics for the last epoch
interval = idx - WARMUP_STEPS # skip warmup iterations
throughput = interval / (time.time() - start)
print("Train throughput (iter/sec): {}".format(throughput))
print("Final loss is {:0.4f}".format(loss.detach().to('cpu')))

# Save checkpoint for evaluation
os.makedirs("checkpoints", exist_ok=True)
checkpoint = {'state_dict': model.state_dict()}
# XLA: use instead of to ensure states are moved back to cpu
# This can prevent "XRT memory handle not found" at end of execution,'checkpoints/')

print('----------End Training ---------------')

if __name__ == '__main__':

By sending the model to the xla device, the model and Relu custom operator are compiled to be run by the Neuron runtime using the optimized Trainium hardware.

In this example, we showed how to create a custom Relu operator that takes advantage of the hardware engine (GP-SIMD) available on the Trainium ML accelerator chip. The result is a trained PyTorch model that can now be deployed for inferencing.


Modern state-of-the-art model architectures require an increasing number of resources from engineering staff (data scientists, ML engineers, MLOps engineers, and others) to actual infrastructure including storage, compute, memory, and accelerators. These requirements increase the cost and complexity of developing and deploying deep learning models. Trainium accelerators deliver a high-performance, low-cost solution for DL training in the cloud. The use of Trainium is facilitated by the Neuron SDK, which includes a deep learning compiler, runtime, and tools that are natively integrated into popular frameworks such as PyTorch and TensorFlow. (Note that at the time of writing, the Neuron SDK 2.9 only supports PyTorch for the development of custom operators.)

As demonstrated in this post, Trainium not only provides the means to train your models performantly and efficiently, but also offers the ability to customize your operators to add flexibility and expressiveness to both training and experimentation.

For more information, refer to the GitHub repo.

About the Authors

Lorea Arrizabalaga is a Solutions Architect aligned to the UK Public Sector, where she helps customers design ML solutions with Amazon SageMaker. She is also part of the Technical Field Community dedicated to hardware acceleration and helps with testing and benchmarking AWS Inferentia and AWS Trainium workloads.

Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt Amazon EC2 accelerated computing infrastructure for their machine learning needs.

Ashley Miller is a Senior AI/ML Evangelist at Amazon Web Services where he works closely with public sector partners developing AI/ML solutions on AWS.