How Sportradar used the Deep Java Library to build production-scale ML platforms for increased performance and efficiency
This is a guest post co-written with Fred Wu from Sportradar.
Sportradar is the world’s leading sports technology company, at the intersection between sports, media, and betting. More than 1,700 sports federations, media outlets, betting operators, and consumer platforms across 120 countries rely on Sportradar knowhow and technology to boost their business.
Sportradar uses data and technology to:
- Keep betting operators ahead of the curve with the products and services they need to manage their sportsbook
- Give media companies the tools to engage more with fans
- Give teams, leagues, and federations the data they need to thrive
- Keep the industry clean by detecting and preventing fraud, doping, and match fixing
This post demonstrates how Sportradar used Amazon’s Deep Java Library (DJL) on AWS alongside Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon Simple Storage Service (Amazon S3) to build a production-ready machine learning (ML) inference solution that preserves essential tooling in Java, optimizes operational efficiency, and increases the team’s productivity by providing better performance and accessibility to logs and system metrics.
The DJL is a deep learning framework built from the ground up to support users of Java and JVM languages like Scala, Kotlin, and Clojure. Right now, most deep learning frameworks are built for Python, but this neglects the large number of Java developers and developers who have existing Java code bases they want to integrate the increasingly powerful capabilities of deep learning into. With the DJL, integrating this deep learning is simple.
In this post, the Sportradar team discusses the challenges they encountered and the solutions they created to build their model inference platform using the DJL.
We are the US squad of the Sportradar AI department. Since 2018, our team has been developing a variety of ML models to enable betting products for NFL and NCAA football. We recently developed four more new models.
The fourth down decision models for the NFL and NCAA predict the probabilities of the outcome of a fourth down play. A play outcome could be a field goal attempt, play, or punt.
The drive outcome models for the NFL and NCAA predict the probabilities of the outcome of the current drive. A drive outcome could be an end of half, field goal attempt, touchdown, turnover, turnover on downs, or punt.
Our models are the building blocks of other models where we generate a list of live betting markets, include spread, total, win probability, next score type, next team to score, and more.
The business requirements for our models are as follows:
- The model predictor should be able to load the pre-trained model file one time, then make predictions on many plays
- We have to generate the probabilities for each play under 50-milisecond latency
- The model predictor (feature extraction and model inference) has to be written in Java, so that the other team can import it as a Maven dependency
Challenges with the in-place system
The main challenge we have is how to bridge the gap between model training in Python and model inference in Java. Our data scientists train the model in Python using tools like PyTorch and save the model as PyTorch scripts. Our original plan was to also host the models in Python and utilize gRPC to communicate with another service, which will use the Java gRPC client to send the request.
However, a few issues came with this solution. Mainly, we saw the network overhead between two different services running in separate run environments or pods, which resulted in higher latency. But the maintenance overhead was the main reason we abandoned this solution. We had to build both the gRPC server and the client program separately and keep the protocol buffer files consistent and up to date. Then we needed to Dockerize the application, write a deployment YAML file, deploy the gRPC server to our Kubernetes cluster, and make sure it’s reliable and auto scalable.
Another problem was whenever an error occurred on the gRPC server side, the application client only got a vague error message instead of a detailed error traceback. The client had to reach out to the gRPC server maintainer to learn exactly which part of the code caused the error.
Ideally, we instead want to load the model PyTorch scripts, extract the features from model input, and run model inference entirely in Java. Then we can build and publish it as a Maven library, hosted on our internal registry, which our service team could import into their own Java projects. When we did our research online, the Deep Java Library showed up on the top. After reading a few blog posts and DJL’s official documentation, we were sure DJL would provide the best solution to our problem.
The following diagram compares the previous and updated architecture.
The following diagram outlines the workflow of the DJL solution.
The steps are as follows:
- Training the models – Our data scientists train the models using PyTorch and save the models as torch scripts. These models are then pushed to an Amazon Simple Storage Service (Amazon S3) bucket using DVC, a version control tool for ML models.
- Implementing feature extraction and feeding ML features – The framework team pulls the models from Amazon S3 into a Java repository where they implement feature extraction and feed ML features into the predictor. They use the DJL PyTorch engine to initialize the model predictor.
- Packaging and publishing the inference code and models – The GitLab CI/CD pipeline packages and publishes the JAR file that contains the inference code and models to an internal Apache Archiva registry.
- Importing the inference library and making calls – The Java client imports the inference library as a Maven dependency. All inference calls are made via Java function calls within the same Kubernetes pod. Because there are no gRPC calls, the inferencing response time is improved. Furthermore, the Java client can easily roll back the inference library to a previous version if needed. In contrast, the server-side error is not transparent for the client side in gRPC-based solutions, making error tracking difficult.
We have seen a stable inferencing runtime and reliable prediction results. The DJL solution offers several advantages over gRPC-based solutions:
- Improved response time – With no gRPC calls, the inferencing response time is improved
- Easy rollbacks and upgrades – The Java client can easily roll back the inference library to a previous version or upgrade to a new version
- Transparent error tracking – In the DJL solution, the client can receive detailed error trackback messages in case of inferencing errors
Deep Java Library overview
The DJL is a full deep learning framework that supports the deep learning lifecycle from building a model, training it on a dataset, to deploying it in production. It has intuitive helpers and utilities for modalities like computer vision, natural language processing, audio, time series, and tabular data. DJL also features an easy model zoo of hundreds of pre-trained models that can be used out of the box and integrated into existing systems.
It is also a fully Apache-2 licensed open-source project and can be found on GitHub. The DJL was created at Amazon and open-sourced in 2019. Today, DJL’s open-source community is led by Amazon and has grown to include many countries, companies, and educational institutions. The DJL continues to grow in its ability to support different hardware, models, and engines. It also includes support for new hardware like ARM (both in servers like AWS Graviton and laptops with Apple M1) and AWS Inferentia.
The architecture of DJL is engine agnostic. It aims to be an interface describing what deep learning could look like in the Java language, but leaves room for multiple different implementations that could provide different capabilities or hardware support. Most popular frameworks today such as PyTorch and TensorFlow are built using a Python front end that connects to a high-performance C++ native backend. The DJL can use this to connect to these same native backends to take advantage of their work on hardware support and performance.
For this reason, many DJL users also use it for inference only. That is, they will train a model using Python and then load it using the DJL for deployment as part of their existing Java production system. Because the DJL utilizes the same engine that powers Python, it’s able to run without any decrease in performance or loss in accuracy. This is exactly the strategy that we found to support the new models.
The following diagram illustrates the workflow under the hood.
When the DJL loads, it finds all the engine implementations available in the class path using Java’s ServiceLoader. In this case, it detects the DJL PyTorch engine implementation, which will act as the bridge between the DJL API and the PyTorch Native.
The engine then works to load the PyTorch Native. By default, it downloads the appropriate native binary based on your OS, CPU architecture, and CUDA version, making it almost effortless to use. You can also provide the binary using one of the many available native JAR files, which are more reliable for production environments that often have limited network access for security.
Once loaded, the DJL uses the Java Native Interface to translate all the easy high-level functionalities in DJL into the equivalent low-level native calls. Every operation in the DJL API is hand-crafted to best fit the Java conventions and make it easily accessible. This also includes dealing with native memory, which is not supported by the Java Garbage Collector.
Although all these details are within the library, calling it from a user standpoint couldn’t be easier. In the following section, we walk through this process.
How Sportradar implemented DJL
Because we train our models using PyTorch, we use the DJL’s PyTorch engine for the model inference.
Loading the model is incredibly easy. All it takes is to build a criteria describing the model to load and where it is from. Then, we load it and use the model to create a new predictor session. See the following code:
For our model, we also have a custom translator, which we call
MyTranslator. We use the translator to encapsulate the preprocessing code that converts from a convenient Java type into the input expected by the model and the postprocessing code that converts from the model output into a convenient output. In our case, we chose to use a
float as the input type and the built-in DJL classifications as the output type. The following is a snippet of our translator code:
It’s pretty amazing that with just a few lines of code, the DJL loads the PyTorch scripts and our custom translator, and then the predictor is ready to make the predictions.
Sportradar’s product built on the DJL solution went live before the 2022–23 NFL regular season started, and it has been running smoothly since then. In the future, Sportradar plans to re-platform existing models hosted on gRPC servers to the DJL solution.
The DJL continues to grow in many different ways. The most recent release, v0.21.0, has many improvements, including updated engine support, improvements on Spark, Hugging Face batch tokenizers, an NDScope for easier memory management, and enhancements to the time series API. It also has the first major release of DJL Zero, a new API aiming to allow support for both using pre-trained models and training your own custom deep learning models even with zero knowledge of deep learning.
The DJL also features a model server called DJL Serving. It makes it simple to host a model on an HTTP server from any of the 10 supported engines, including the Python engine to support Python code. With v0.21.0 of DJL Serving, it includes faster transformer support, Amazon SageMaker multi-model endpoint support, updates for Stable Diffusion, improvements for DeepSpeed, and updates to the management console. You can now use it to deploy large models with model parallel inference using DeepSpeed and SageMaker.
There is also much upcoming with the DJL. The largest area under development is large language model support for models like ChatGPT or Stable Diffusion. There is also work to support streaming inference requests in DJL Serving. Thirdly, there are improvements to demos and the extension for Spark. Of course, there is also standard continuing work including features, fixes, engine updates, and more.
For more information on the DJL and its other features, see Deep Java Library.
About the authors
Fred Wu is a Senior Data Engineer at Sportradar, where he leads infrastructure, DevOps, and data engineering efforts for various NBA and NFL products. With extensive experience in the field, Fred is dedicated to building robust and efficient data pipelines and systems to support cutting-edge sports analytics.
Zach Kimberg is a Software Developer in the Amazon AI org. He works to enable the development, training, and production inference of deep learning. There, he helped found and continues to develop the DeepJavaLibrary project.
Kanwaljit Khurmi is a Principal Solutions Architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.