AWS Machine Learning Blog

Creating an end-to-end application for orchestrating custom deep learning HPO, training, and inference using AWS Step Functions

Amazon SageMaker hyperparameter tuning provides a built-in solution for scalable training and hyperparameter optimization (HPO). However, for some applications (such as those with a preference of different HPO libraries or customized HPO features), we need custom machine learning (ML) solutions that allow retraining and HPO. This post offers a step-by-step guide to build a custom deep learning web application on AWS from scratch, following the Bring Your Own Container (BYOC) paradigm. We show you how to create a web application to enable non-technical end users to orchestrate different deep learning operations and perform advanced tasks such as HPO and retraining from a UI. You can modify the example solution to create a deep learning web application for any regression and classification problem.

Solution overview

Creating a custom deep learning web application consists of two main steps:

  • ML component (focusing on how to dockerize a deep learning solution)
  • Full-stack application to use ML component

In the first step, we need to create a custom Docker image and register it in Amazon Elastic Container Registry. Amazon SageMaker will use this image to run Bayesian HPO, training/re-training, and inference. Details of dockerizing a deep learning code are described in Appendix A.

In the second step, we deploy a full-stack application with AWS Serverless Application Model (SAM). We use AWS Step Functions and AWS Lambda to orchestrate different stages of ML pipeline. Then we create the frontend application hosted in Amazon Simple Storage Service (Amazon S3) and Amazon CloudFront. We also use AWS Amplify with Amazon Cognito for authentication. The following diagram shows the solution architecture.

After you deploy the application, you can authenticate with Amazon Cognito to trigger training or HPO jobs from the UI (Step 2 in the diagram). User requests go through Amazon API Gateway to Step Functions, which is responsible for orchestrating the training or HPO (Step 3). When it’s complete, you can submit a set of input parameters through the UI to API Gateway and Lambda to get the inference results (Step 4).

Deploy the application

For instructions on deploying the application, see the GitHub repo README file. This application consists of four main components:

  • machine-learning – Contains SageMaker notebooks and scripts for building an ML Docker image (for HPO and training), discussed in Appendix A
  • shared-infra – Contains AWS resources used by both the backend and frontend in an AWS CloudFormation
  • backend – Contains the backend code: APIs and a step function for retraining the model, running HPO, and an Amazon DynamoDB database
  • frontend – Contains the UI code and infrastructure to host it.

Deployment details can be found here.

Create a step for HPO and training in Step Functions

Training a model for inference using Step Functions requires multiple steps:

  1. Create a training job.
  2. Create a model.
  3. Create an endpoint configuration.
  4. Optionally, delete the old endpoint.
  5. Create a new endpoint.
  6. Wait until the new endpoint is deployed.

Running HPO is simpler because we only create an HPO job and output the result to Amazon CloudWatch Logs. We orchestrate both model training and HPO using Step Functions. We can define these steps as a state machine, using Amazon State Language (ASL) definition. The following figure is the graphical representation of this state machine.

As the first step, we use the Choice state to decide whether to have an HPO or training mode using the following code:

"Mode Choice": {
    "Type": "Choice",
    "Choices": [
            "Variable": "$.Mode",
            "StringEquals": "HPO",
            "Next": "HPOFlow"
    "Default":  "TrainingModelFlow"

Many states have the names Create a … Record and Update Status to…. These steps either create or update records in DynamoDB tables. The API queries these tables to return the status of the job and the ARN of created resources (the endpoint ARN for making an inference).

Each record has the Step Function execution ID as a key and a field called status. As the state changes, its status changes from TRAINING_MODEL, all the way to READY. The state machine records important outputs like S3 model output, model ARN, endpoint config ARN, and endpoint ARN.

For example, the following state runs right before endpoint deployment. The endpointConfigArn field is updated in the record.

"Update Status to DEPLOYING_ENDPOINT": {
    "Type": "Task",
    "Resource": "arn:aws:states:::dynamodb:updateItem",
    "Parameters": {
        "TableName": "${ModelTable}",
        "Key": {
            "trainingId": {
                "S.$": "$$.Execution.Id"
            "created": {
                "S.$": "$$.Execution.StartTime"
        "UpdateExpression": "SET #st = :ns, #eca = :cf",
        "ExpressionAttributeNames": {
            "#st" : "status",
            "#eca" : "endpointConfigArn"
        "ExpressionAttributeValues": {
            ":ns" : {
                "S": "DEPLOYING_ENDPOINT"
            ":cf" : {
                "S.$": "$.EndpointConfigArn"
    "ResultPath": "$.taskresult",
    "Next": "Deploy"

The following screenshot shows the content in the DynamoDB table.

In the preceding screenshot, the last job is still running. It finished training and creating an endpoint configuration, but hasn’t deployed the endpoint yet. Therefore, there is no endpointArn in this record.

Another important state is Delete Old Endpoint. When you deploy an endpoint, an Amazon Elastic Compute Cloud (Amazon EC2) instance is running 24/7. As you train more models and create more endpoints, your inference cost grows linearly with the number of models. Therefore, we create this state to delete the old endpoint to reduce our cost.

The Delete Old Endpoint state calls a Lambda function that deletes the oldest endpoint if it exceeds the maximum number specified. The default value is 5, but you could change it in the parameter of the CloudFormation template for the backend. Although you can change this value to any arbitrary number, SageMaker has a soft limit on how many endpoints you can have at a given time. There is also a limit per each instance type.

Finally, we have states for updating status to ERROR (one for HPO and another one for model training). These steps are used in the Catch field when any part of the step throws an error. These steps update the DynamoDB record with the fields error and errorCause from Step Functions (see the following screenshot).

Although we can retrieve this data from the Step Functions APIs, we keep them in DynamoDB records so that the front end can retrieve all the related information in one place.

Automate state machine creation with AWS CloudFormation

We can use the state machine definition to recreate this state machine on any accounts. The template contains several variables, such as DynamoDB table names for tracking job status or Lambda functions that are triggered by states. The ARN of these resources changes in each deployment. Therefore, we use AWS SAM to inject these variables. You can find the state machine resource here. The following code is an excerpt of how we refer to the ASL file and how resources ARNs are passed:

  Type: AWS::Serverless::StateMachine 
    DefinitionUri: statemachine/model-training.asl.json
      DeleteOldestEndpointFunctionArn: !GetAtt DeleteOldestEndpointFunction.Arn
      CheckDeploymentStatusFunctionArn: !GetAtt CheckDeploymentStatusFunction.Arn
      ModelTable: !Ref ModelTable
      HPOTable: !Ref HPOTable
      - LambdaInvokePolicy:
          FunctionName: !Ref DeleteOldestEndpointFunction
    # .. the rest of policies is omitted for brevity 

  Type: AWS::DynamoDB::Table
      - AttributeName: "trainingId"
        AttributeType: "S"
      - AttributeName: "created"
        AttributeType: "S"
    # .. the rest of policies is omitted for brevity 

AWS::Serverless::StateMachine is an AWS SAM resource type. The DefinitionUri refers to the state machine definition we discussed in the last step. The definition has some variables, such as ${ModelTable}. See the following code:

"Update Status to READY": {
    "Type": "Task",
    "Resource": "arn:aws:states:::dynamodb:updateItem",
    "Parameters": {
        "TableName": "${ModelTable}",
        "Key": {

When we run the AWS SAM CLI, the variables in this template are replaced by the key-value declared in DefinitionSubstitutions. In this case, the ${ModelTable} is replaced by the table name of the ModelTable resource created by AWS CloudFormation.

This way, the template is reusable and can be redeployed multiple times without any change to the state machine definition.

Build an API for the application

This application has five APIs:

  • POST /infer – Retrieves the inference result for the given model
  • GET /model – Retrieves all model information
  • POST /model – Starts a new model training job with data in the given S3 path
  • GET /hpo – Retrieves all HPO job information
  • POST /hpo – Starts a new HPO job with data in the given S3 path

We create each API with an AWS SAM template. The following code is a snippet of the POST /model endpoint:

    Type: AWS::Serverless::Function
      CodeUri: functions/api/
      Runtime: python3.7
          MODE: "MODEL"
          TRAINING_STATE_MACHINE_ARN: !Ref TrainingModelStateMachine
          # Other variables removed for brevity
        - AWSLambdaExecute
        - DynamoDBCrudPolicy:
            TableName: !Ref ModelTable
        - Version: 2012-10-17
            - Effect: Allow
                - states:StartExecution
              Resource: !Ref TrainingModelStateMachine
          Type: Api
            Path: /model
            Method: post
              Authorizer: MyCognitoAuth
        - !Ref APIDependenciesLayer

We utilize several features from the AWS SAM template in this Lambda function. First, we pass the created state machine ARN via environment variables, using !Ref. Because the ARN isn’t available until the stack creation time, we use this method to avoid hardcoding.

Second, we follow the security best practices of the least privilege policy by using DynamoDBCrudPolicy in the AWS SAM policy template to give permission to modify the data in the specific DynamoDB table. For the permissions that aren’t available as a policy template (states:StartExecution), we define the policy statement directly.

Third, we control the access to this API by setting the Authorizer property. In the following example code, we allow only authenticated users in by an Amazon Cognito user pool to call this API. The authorizer is defined in the global section because it’s shared by all functions.

  # Other properties are omitted for brevity…
          UserPoolArn: !GetAtt UserPool.Arn # Can also accept an array

Finally, we use the Layers section to install API dependencies. This reduces the code package size and the build time during the development cycle. The referred APIDependenciesLayer is defined as follows:

    Type: AWS::Serverless::LayerVersion
      LayerName: APIDependencies
      Description: Dependencies for API
      ContentUri: dependencies/api 
        - python3.7
      BuildMethod: python3.7 # This line tells SAM to install the library before packaging

Other APIs follow the same pattern. With this set up, our backend resources are managed in a .yaml file that you can version in Git and redeploy in any other account.

Build the front end and call the API

We build our front end using the React framework, which is hosted in an S3 bucket and CloudFront. We use the following template to deploy those resources and a shell script to build the static site and upload to the bucket.

We use the Amplify library to reduce coding efforts. We create a config file to specify which Amazon Cognito user pool to sign in to and which API Gateway URL to use. The example config file can be found here. The installation script generates the actual deployment file from the template and updates the pool ARN and URL automatically.

When we first open the website, we’re prompted to sign in with an Amazon Cognito user.

This authentication screen is generated by the Amplify library’s withAuthenticator() function in the App.js file. This function wraps the existing component and checks if the user has already logged in to the configured Amazon Cognito pool. If not, it shows the login screen before showing the component. See the following code:

import {withAuthenticator} from '@aws-amplify/ui-react';

// ...create an App that extends React.Component

// Wrap the application inside the Authenticator to require user to log in
export default withAuthenticator(withRouter(App));

After we sign in, the app component is displayed.

We can upload data to an S3 bucket and start HPO or train a new model. The UI also uses Amplify to upload data to Amazon S3. Amplify handles the authentication details for us, so we can easily upload files using the following code:

import { Storage} from "aws-amplify";

// … React logic to get file object when we click the Upload button
const stored = await Storage.vault.put(, file, { 
        contentType: file.type,
// stored.key will be passed to API for training 

After we train a model, we can switch to inference functionality by using the drop-down menu on the top right.

On the next page, we select the model endpoint that has the READY status. Then we need to change the number of inputs. The number of inputs has to be the same as the number of features in the input file used to train the model. For example, if your input file has 19 features and one target value, we need to enter the first 18 inputs. For the last input, we have a range for the values from 1.1, 1.2, 1.3, all the way to 3.0. The purpose of allowing the last input to vary in a certain range is to understand the effects of changing that parameter on the model outcomes.

When we choose Predict, the front end calls the API to retrieve the result and display it in a graph.

The graph shows the target value as a function of values for the last input. Here, we can discover how the last input affects the target value, for the first given 18 inputs.

In the code, we also use Amplify to call the APIs. Just like in the Amazon S3 scenario, Amplify handles the authentication automatically, so we can call the API with the following code:

import {API} from "aws-amplify";

// Code to retrieve inputs and the selected endpoint from drop down box
const inferResult = await"pyapi", `infer`, {
  body: {
    input: inputParam,
    modelName: selectedEndpoint,
    range: rangeInput


In this post, we learned how to create a web application for performing custom deep learning model training and HPO using SageMaker. We learned how to orchestrate training, HPO, and endpoint creation using Step Functions. Finally, we learned how to create APIs and a web application to upload training data to Amazon S3, start and monitor training and HPO jobs, and perform inference.

Appendix A: Dockerize custom deep learning models on SageMaker

When working on deep learning projects, you can either use pre-built Docker images in SageMaker or build your own custom Docker image from scratch. In the latter case, you can still use SageMaker for training, hosting, and inference. This method allows developers and data scientists to package software into standardized units that run consistently on any platform that supports Docker. Containerization packages the code, runtime, system tools, system libraries, and settings all in the same place, isolating it from its surroundings, and ensures a consistent runtime regardless of where it runs.

When you develop a model in SageMaker, you can provide separate Docker images for the training code and the inference code, or you can combine them into a single Docker image. In this post, we build a single image to support both training and hosting.

We build on the approach used in the post Train and host Scikit-Learn models in Amazon SageMaker by building a Scikit Docker container, which uses the following example container folder to explain how SageMaker runs Docker containers for training and hosting your own algorithms. We strongly recommend you first review the aforementioned post, because it contains many details about how to run Docker containers on SageMaker. In this post, we skip the details of how containers work on SageMaker and focus on how to create them from an existing notebook that runs locally. If you use the folder structure that was described in preceding references, the key files are shown in the following container:


We use Flask to launch an API to serve HTTP requests for inference. If you choose to run Flask for your service, you can use the following files from SageMaker sample notebooks as is:

Therefore, you only need to modify three files:

  • Dockerfile
  • train
  • py

We provide the local version of the code and briefly explain how to transform it into train and formats that you can use inside a Docker container. We recommend you write your local code in a format that can be easily used in a Docker container. For training, there is not a significant difference between the two versions (local vs. Docker). However, the inference code requires significant changes.

Before going into details of how to prepare the train and files, let’s look at the Dockerfile, which is a modified version of the previous work:

FROM python:3.6

RUN apt-get -y update && apt-get install -y --no-install-recommends \
         wget \
         python \
         nginx \
         ca-certificates \
    && rm -rf /var/lib/apt/lists/*

# Install all of the packages
RUN wget && python

# install code dependencies
COPY "requirements.txt" .
RUN ["pip", "install", "-r", "requirements.txt"]

RUN pip list
# Env Variables
ENV PATH="/opt/ml:${PATH}"

# Set up the program in the image
COPY scripts /opt/ml
WORKDIR /opt/ml

We use a different name (scripts) for the folder that contains the train and inference scripts.

SageMaker stores external model artifacts, training data, and other configuration information available to Docker containers in /opt/ml/. This is also where SageMaker processes model artifacts. We create local folders /opt/ml/ to make local testing mode similar to what happens inside the Docker container.

To understand how to modify your local code (in a Jupyter or SageMaker notebook) to be used in a Docker container, the easiest way is to compare it to what it looks like inside a Docker container.

The following notebook contains code (along with some dummy data after cloning the GitHub repo) for running Bayesian HPO and training for a deep learning regression model using Keras (with a TensorFlow backend) and Hyperopt library (for Bayesian HPO).

The notebook contains an example of running Bayesian HPO or training (referred to as Final Training in the code) for regression problems. Although HPO and Final Training are very similar processes, we treat these two differently in the code.

HPO and Final Training setup and parameters are quite similar. However, they have some important differences:

  • Only a fraction of the training data is used for HPO to reduce the runtime (controlled by the parameter used_data_percentage in the code).
  • Each iteration of HPO should be run by a very small number of epochs. The constructed networks allow different numbers of layers for the deep network (optimal number of layers to be found using HPO).
  • The number of nodes for each layer can be optimized.

For example, for a neural network with six dense layers, the network structure (controlled by user input) looks like the following visualizations.

The following image shows a neural network with five dense layers.

The following image shows a neural network with five dense layers, which also has dropout and batch normalization.

We have the option to have both dropout and batch normalization, or have only one, or not include either in your network.

The notebook loads the required libraries (Section 1) and preprocesses the data (Section 2). In Section 3, we define the train_final_model function to perform a final training, and in Section 4, we define the objective function to perform Bayesian HPO. In both functions (Sections 3 and 4), we define network architectures (in case of HPO in Section 4, we do it iteratively). You can evaluate the training and HPO using any metric. In this example, we are interested in minimizing the value of 95% quantile for the mean absolute error. You can modify this based on your interests.

Running this notebook up to Section 9 performs a training or HPO, based on the flag that you set up in the first line of code in Section 5 (currently defaulted to run the Final Training):

final_training = True

Every section in the notebook up to Section 9, except for Sections 5 and 8, is used as they are (with no change) in the train script for the Docker. Sections 5 and 8 have to be prepared differently for the Docker. In Section 5, we define parameters for Final Training or HPO. In Section 8, we simply define directories that contain the training data data and the directories that the training or HPO artifacts are saved to. We create an opt/ml folder to mimic what happens in the Docker, but we keep it outside of our main folder because it’s not required when Dockerizing.

To make the script in this notebook work in a Docker container, we need to modify Sections 5, 8, and 9. You can compare the difference in the train script. We have two new sections in the train script called 5-D and 8-D. D stands for the Docker version of the code (the order of sections has changed). Section 8-D defines directory names for storing the model artifacts. Therefore, you can use it with no changes for your future work. Section 5-D (the equivalent to Section 5 in the local notebook), might require modification for other use cases because we define the hyperparameters that are ingested by our Docker container.

As an example of how to add a hyperparameter in Section 5-D, check the variable nb_epochs, which specifies the number of epochs that each HPO job runs:

nb_epochs = trainingParams.get('nb_epochs', None)
if nb_epochs is not None:
    nb_epochs = int(nb_epochs)
    nb_epochs = 5

For your use case, you might need to process these parameters differently. For instance, the optimizer is specified as a list of integers. Therefore, we need an eval function to turn it into a proper format and use the default value [‘adam’] when it’s not provided. See the following code:

optimizer = trainingParams.get('optimizer', None)
if optimizer is not None:
    optimizer = eval(optimizer)
    optimizer =['adam']

Now let’s see how we need to write the inference code in local and Docker mode in Sections 10 and 11 of the notebook. This isn’t how you write an inference code locally, but if you’re working with Docker containers, we recommend writing your inference code as shown in Sections 10 and 11 so that you can quickly use it inside Dockers.

In Section 10, we define the model_path to load the saved model using the loadmodel function. We use ScoringService to keep the local code similar to what we have in You might need to modify this class depending on which framework you’re using for creating your model. This has been modified from its original form to work for a Keras model.

Then we define transform_data to prepare data sent for inference. Here, we load the scaler.pkl to normalize our data in the same way we normalized our training data.

In Section 11, we define the transformation function, which performs inference by reading the df_test.csv file. We removed the column names (headers) in this file from the data. Running the transformation function returns an array of predictions.

To use this code in a Docker container, we need to modify the path in Section 10:

prefix = '../opt/ml/'

The code is modified to the following line (line 38) in

prefix = '/opt/ml/'

This is because in local mode, we keep the model artifact outside of the Docker files. We need to include an extra section (Section 10b-D in, which wasn’t used in the notebook. This section can be used as is for other Dockers as well. The next section that needs to be included in is Section 11-D (a modified version of Section 11 in the notebook).

After making these changes, you can build your Docker container, push it to Amazon ECR, and test if it can complete a training job and do inference. You can use the following notebook to test your Docker.

About the Authors

Mehdi E. Far is a Sr Machine Learning Specialist SA within the Manufacturing and Industrial Global and Strategic Accounts organization. He helps customers build Machine Learning and Cloud solutions for their challenging problems.




Chadchapol Vittavutkarnvej is a Specialist Solutions Architect Builder Based in Amsterdam, Netherlands.