Model serving in Java with AWS Elastic Beanstalk made easy with Deep Java Library
Deploying your machine learning (ML) models to run on a REST endpoint has never been easier. Using AWS Elastic Beanstalk and Amazon Elastic Compute Cloud (Amazon EC2) to host your endpoint and Deep Java Library (DJL) to load your deep learning models for inference makes the model deployment process extremely easy to set up. Setting up a model on Elastic Beanstalk is great if you require fast response times on all your inference calls. In this post, we cover deploying a model on Elastic Beanstalk using DJL and sending an image through a post call to get inference results on what the image contains.
DJL is a deep learning framework written in Java that supports training and inference. DJL is built on top of modern deep learning engines (such as TenserFlow, PyTorch, and MXNet). You can easily use DJL to train your model or deploy your favorite models from a variety of engines without any additional conversion. It contains a powerful model zoo design that allows you to manage trained models and load them in a single line. The built-in model zoo currently supports more than 70 pre-trained and ready-to-use models from GluonCV, HuggingFace, TorchHub, and Keras.
The primary benefit of hosting your model using Elastic Beanstalk and DJL is that it’s very easy to set up and provides consistent sub-second responses to a post request. With DJL, you don’t need to download any other libraries or worry about importing dependencies for your chosen deep learning framework. Using Elastic Beanstalk has two advantages:
- No cold startup – Compared to an AWS Lambda solution, the EC2 instance is running all the time, so any call to your endpoint runs instantly and there isn’t any ovdeeerhead when starting up new containers.
- Scalable – Compared to a server-based solution, you can allow Elastic Beanstalk to scale horizontally.
You need to have the following gradle dependencies set up to run our PyTorch model:
We first create a RESTful endpoint using Java SpringBoot and have it accept an image request. We decode the image and turn it into an
Image object to pass into our model. The model is autowired by the Spring framework by calling the
model() method. For simplicity, we create the predictor object on each request, where we pass our image for inference (you can optimize this by using an object pool) . When inference is complete, we return the results to the requester. See the following code:
A full copy of the code is available on the GitHub repo.
Building your JAR file
Go into the
beanstalk-model-serving directory and enter the following code:
This creates a JAR file found in
Deploying to Elastic Beanstalk
To deploy this model, complete the following steps:
- On the Elastic Beanstalk console, create a new environment.
- For our use case, we name the environment DJL-Demo.
- For Platform, select Managed platform.
- For Platform settings, choose Java 8 and the appropriate branch and version.
- When selecting your application code, choose Choose file and upload the
beanstalk-model-serving-0.0.1-SNAPSHOT.jarthat was created in your build.
- Choose Create environment.
After Elastic Beanstalk creates the environment, we need to update the Software and Capacity boxes in our configuration, located on the Configuration overview page.
- For the Software configuration, we add an additional setting in the Environment Properties section with the name SERVER_PORT and value 5000.
- For the Capacity configuration, we change the instance type to t2.small to give our endpoint a little more compute and memory.
- Choose Apply configuration and wait for your endpoint to update.
Calling your endpoint
Now we can call our Elastic Beanstalk endpoint with our image of a smiley face.
See the following code:
We get the following response:
The output predicts that a smiley face is the most probable item in our image. Success!
If your model isn’t called often and there isn’t a requirement for fast inference, we recommend deploying your models on a serverless service such as Lambda. However, this adds overhead due to the cold startup nature of the service. Hosting your models through Elastic Beanstalk may be slightly more expensive because the EC2 instance runs 24 hours a day, so you pay for the service even when you’re not using it. However, if you expect a lot of inference requests a month, we have found the cost of model serving on Lambda is equal to the cost of Elastic Beanstalk using a t3.small when there are about 2.57 million inference requests to the endpoint.
In this post, we demonstrated how to start deploying and serving your deep learning models using Elastic Beanstalk and DJL. You just need to set up your endpoint with Java Spring, build your JAR file, upload that file to Elastic Beanstalk, update some configurations, and it’s deployed!
We also discussed some of the pros and cons of this deployment process, namely that it’s ideal if you need fast inference calls, but the cost is higher when compared to hosting it on a serverless endpoint with lower utilization.
This demo is available in full in the DJL demo GitHub repo. You can also find other examples of serving models with DJL across different JVM tools like Spark and AWS products like Lambda. Whatever your requirements, there is an option for you.
About the Author
Frank Liu is a Software Engineer for AWS Deep Learning. He focuses on building innovative deep learning tools for software engineers and scientists. In his spare time, he enjoys hiking with friends and family.