Onboard PaddleOCR with Amazon SageMaker Projects for MLOps to perform optical character recognition on identity documents
Optical character recognition (OCR) is the task of converting printed or handwritten text into machine-encoded text. OCR has been widely used in various scenarios, such as document electronization and identity authentication. Because OCR can greatly reduce the manual effort to register key information and serve as an entry step for understanding large volumes of documents, an accurate OCR system plays a crucial role in the era of digital transformation.
The open-source community and researchers are concentrating on how to improve OCR accuracy, ease of use, integration with pre-trained models, extension, and flexibility. Among many proposed frameworks, PaddleOCR has gained increasing attention recently. The proposed framework concentrates on obtaining high accuracy while balancing computational efficiency. In addition, the pre-trained models for Chinese and English make it popular in the Chinese language-based market. See the PaddleOCR GitHub repo for more details.
At AWS, we have also proposed integrated AI services that are ready to use with no machine learning (ML) expertise. To extract text and structured data such as tables and forms from documents, you can use Amazon Textract. It uses ML techniques to read and process any type of document, accurately extracting text, handwriting, tables, and other data with no manual effort.
For the data scientists who want the flexibility to use an open-source framework to develop your own OCR model, we also offer the fully managed ML service Amazon SageMaker. SageMaker enables you to implement MLOps best practices throughout the ML lifecycle, and provides templates and toolsets to reduce the undifferentiated heavy lifting to put ML projects in production.
In this post, we concentrate on developing customized models within the PaddleOCR framework on SageMaker. We walk through the ML development lifecycle to illustrate how SageMaker can help you build and train a model, and eventually deploy the model as a web service. Although we illustrate this solution with PaddleOCR, the general guidance is true for arbitrary frameworks to be used on SageMaker. To accompany this post, we also provide sample code in the GitHub repository.
As a widely adopted OCR framework, PaddleOCR contains rich text detection, text recognition, and end-to-end algorithms. It chooses Differentiable Binarization (DB) and Convolutional Recurrent Neural Network (CRNN) as the basic detection and recognition models, and proposes a series of models, named PP-OCR, for industrial applications after a series of optimization strategies.
The PP-OCR model is aimed at general scenarios and forms a model library of different languages. It consists of three parts: text detection, box detection and rectification, and text recognition, illustrated in the following figure on the PaddleOCR official GitHub repository. You can also refer to the research paper PP-OCR: A Practical Ultra Lightweight OCR System for more information.
To be more specific, PaddleOCR consists of three consecutive tasks:
- Text detection – The purpose of text detection is to locate the text area in the image. Such tasks can be based on a simple segmentation network.
- Box detection and rectification – Each text box needs to be transformed into a horizontal rectangle box for subsequent text recognition. To do this, PaddleOCR proposes to train a text direction classifier (image classification task) to determine the text direction.
- Text recognition – After the text box is detected, the text recognizer model performs inference on each text box and outputs the results according to text box location. PaddleOCR adopts the widely used method CRNN.
PaddleOCR provides high-quality pre-trained models that are comparable to commercial effects. You can either use the pre-trained model for a detection model, direction classifier, or recognition model, or you can fine tune and retrain each individual model to serve your use case. To increase the efficiency and effectiveness of detecting Traditional Chinese and English, we illustrate how to fine-tune the text recognition model. The pre-trained model we choose is ch_ppocr_mobile_v2.0_rec_train, which is a lightweight model, supporting Chinese, English, and number recognition. The following is an example inference result using a Hong Kong identity card.
In the following sections, we walk through how to fine-tune the pre-trained model using SageMaker.
MLOps best practices with SageMaker
SageMaker is a fully managed ML service. With SageMaker, data scientists and developers can quickly and easily build and train ML models, and then directly deploy them into a production-ready managed environment.
Many data scientists use SageMaker for accelerating the ML lifecycle. In this section, we illustrate how SageMaker can help you from experimentation to productionalizing ML. Following the standard steps of an ML project, from the experimental phrase (code development and experiments), to the operational phrase (automatization of the model build workflow and deployment pipelines), SageMaker can bring efficiency in the following steps:
- Explore the data and build the ML code with Amazon SageMaker Studio notebooks.
- Train and tune the model with a SageMaker training job.
- Deploy the model with an SageMaker endpoint for model serving.
- Orchestrate the workflow with Amazon SageMaker Pipelines.
The following diagram illustrates this architecture and workflow.
It’s important to note that you can use SageMaker in a modular way. For example, you can build your code with a local integrated development environment (IDE) and train and deploy your model on SageMaker, or you can develop and train your model in your own cluster compute sources, and use a SageMaker pipeline for workflow orchestration and deploy on a SageMaker endpoint. This means that SageMaker provides an open platform to adapt for your own requirements.
See the code in our GitHub repository and README to understand the code structure.
Provision a SageMaker project
You can use Amazon SageMaker Projects to start your journey. With a SageMaker project, you can manage the versions for your Git repositories so you can collaborate across teams more efficiently, ensure code consistency, and enable continuous integration and continuous delivery (CI/CD). Although notebooks are helpful for model building and experimentation, when you have a team of data scientists and ML engineers working on an ML problem, you need a more scalable way to maintain code consistency and have stricter version control.
SageMaker projects create a preconfigured MLOps template, which includes the essential components for simplifying the PaddleOCR integration:
- A code repository to build custom container images for processing, training, and inference, integrated with CI/CD tools. This allows us to configure our custom Docker image and push to Amazon Elastic Container Registry (Amazon ECR) to be ready to use.
- A SageMaker pipeline that defines steps for data preparation, training, model evaluation, and model registration. This prepares us to be MLOps ready when the ML project goes to production.
- Other useful resources, such as a Git repository for code version control, model group that contains model versions, code change trigger for the model build pipeline, and event-based trigger for the model deployment pipeline.
You can use SageMaker seed code to create standard SageMaker projects, or a specific template that your organization created for team members. In this post, we use the standard MLOps template for image building, model building, and model deployment. For more information about creating a project in Studio, refer to Create an MLOps Project using Amazon SageMaker Studio.
Explore data and build ML code with SageMaker Studio Notebooks
SageMaker Studio notebooks are collaborative notebooks that you can launch quickly because you don’t need to set up compute instances and file storage beforehand. Many data scientists prefer to use this web-based IDE for developing the ML code, quickly debugging the library API, and getting things running with a small sample of data to validate the training script.
In Studio notebooks, you can use a pre-built environment for common frameworks such as TensorFlow, PyTorch, Pandas, and Scikit-Learn. You can install the dependencies to the pre-built kernel, or build up your own persistent kernel image. For more information, refer to Install External Libraries and Kernels in Amazon SageMaker Studio. Studio notebooks also provide a Python environment to trigger SageMaker training jobs, deployment, or other AWS services. In the following sections, we illustrate how to use Studio notebooks as an environment to trigger training and deployment jobs.
SageMaker provides a powerful IDE; it’s an open ML platform where data scientists have the flexibility to use their preferred development environment. For data scientists who prefer a local IDE such as PyCharm or Visual Studio Code, you can use the local Python environment to develop your ML code, and use SageMaker for training in a managed scalable environment. For more information, see Run your TensorFlow job on Amazon SageMaker with a PyCharm IDE. After you have a solid model, you can adopt the MLOps best practices with SageMaker.
Currently, SageMaker also provides SageMaker notebook instances as our legacy solution for the Jupyter Notebook environment. You have the flexibility to run the Docker build command and use SageMaker local mode to train on your notebook instance. We also provide sample code for PaddleOCR in our code repository: ./train_and_deploy/notebook.ipynb.
Build a custom image with a SageMaker project template
SageMaker makes extensive use of Docker containers for build and runtime tasks. You can run your own container with SageMaker easily. See more technical details at Use Your Own Training Algorithms.
However, as a data scientist, building a container might not be straightforward. SageMaker projects provide a simple way for you to manage custom dependencies through an image building CI/CD pipeline. When you use a SageMaker project, you can make updates to the training image with your custom container Dockerfile. For step-by-step instructions, refer to Create Amazon SageMaker projects with image building CI/CD pipelines. With the structure provided in the template, you can modify the provided code in this repository to build a PaddleOCR training container.
For this post, we showcase the simplicity of building a custom image for processing, training, and inference. The GitHub repo contains three folders:
- Preprocessing – image-build-process/
- Training – image-build-train/
- Inference – image-build-inference/
These projects follow a similar structure. Take the training container image as an example; the
image-build-train/ repository contains the following files:
- The codebuild-buildspec.yml file, which is used to configure AWS CodeBuild so that the image can be built and pushed to Amazon ECR.
- The Dockerfile used for the Docker build, which contains all dependencies and the training code.
- The train.py entry point for training script, with all hyperparameters (such as learning rate and batch size) that can be configured as an argument. These arguments are specified when you start the training job.
- The dependencies.
When you push the code into the corresponding repository, it triggers AWS CodePipeline to build a training container for you. The custom container image is stored in an Amazon ECR repository, as illustrated in the previous figure. A similar procedure is adopted for generating the inference image.
Train the model with the SageMaker training SDK
After your algorithm code is validated and packaged into a container, you can use a SageMaker training job to provision a managed environment to train the model. This environment is ephemeral, meaning that you can have separate, secure compute resources (such as GPU) or a Multi-GPU distributed environment to run your code. When the training is complete, SageMaker saves the resulting model artifacts to an Amazon Simple Storage Service (Amazon S3) location that you specify. All the log data and metadata persist on the AWS Management Console, Studio, and Amazon CloudWatch.
The training job includes several important pieces of information:
- The URL of the S3 bucket where you stored the training data
- The URL of the S3 bucket where you want to store the output of the job
- The managed compute resources that you want SageMaker to use for model training
- The Amazon ECR path where the training container is stored
SageMaker makes the hyperparameters in a
CreateTrainingJob request available in the Docker container in the
We use the custom training container as the entry point and specify a GPU environment for the infrastructure. All relevant hyperparameters are detailed as parameters, which allows us to track each individual job configuration, and compare them with the experiment tracking.
Because the data science process is very research-oriented, it’s common that multiple experiments are running in parallel. This requires an approach that keeps track of all the different experiments, different algorithms, and potentially different datasets and hyperparameters attempted. Amazon SageMaker Experiments lets you organize, track, compare, and evaluate your ML experiments. We demonstrate this as well in experiments-train-notebook.ipynb. For more details, refer to Manage Machine Learning with Amazon SageMaker Experiments.
Deploy the model for model serving
As for deployment, especially for real-time model serving, many data scientists might find it hard to do without help from operation teams. SageMaker makes it simple to deploy your trained model into production with the SageMaker Python SDK. You can deploy your model to SageMaker hosting services and get an endpoint to use for real-time inference.
In many organizations, data scientists might not be responsible for maintaining the endpoint infrastructure. However, testing your model as an endpoint and guaranteeing the correct prediction behaviors is indeed the responsibility of data scientists. Therefore, SageMaker simplified the tasks for deploying by adding a set of tools and SDK for this.
For the use case in the post, we want to have real-time, interactive, low-latency capabilities. Real-time inference is ideal for this inference workload. However, there are many options adapting to each specific requirement. For more information, refer to Deploy Models for Inference.
To deploy the custom image, data scientists can use the SageMaker SDK, illustrated at
create_model request, the container definition includes the
ModelDataUrl parameter, which identifies the Amazon S3 location where model artifacts are stored. SageMaker uses this information to determine from where to copy the model artifacts. It copies the artifacts to the
/opt/ml/model directory for use by your inference code. The
predictor.py is the entry point for serving, with the model artifact that is loaded when you start the deployment. For more information, see Use Your Own Inference Code with Hosting Services.
Orchestrate your workflow with SageMaker Pipelines
The last step is to wrap your code as end-to-end ML workflows, and to apply MLOps best practices. In SageMaker, the model building workload, a directed acyclic graph (DAG), is managed by SageMaker Pipelines. Pipelines is a fully managed service supporting orchestration and data lineage tracking. In addition, because Pipelines is integrated with the SageMaker Python SDK, you can create your pipelines programmatically using a high-level Python interface that we used previously during the training step.
We provide an example of pipeline code to illustrate the implementation at pipeline.py.
The pipeline includes a preprocessing step for dataset generation, training step, condition step, and model registration step. At the end of each pipeline run, data scientists may want to register their model for version controls and deploy the best performing one. The SageMaker model registry provides a central place to manage model versions, catalog models, and trigger automated model deployment with approval status of a specific model. For more details, refer to Register and Deploy Models with Model Registry.
In an ML system, automated workflow orchestration helps prevent model performance degradation, in other words model drift. Early and proactive detection of data deviations enables you to take corrective actions, such as retraining models. You can trigger the SageMaker pipeline to retrain a new version of the model after deviations have been detected. The trigger of a pipeline can be also determined by Amazon SageMaker Model Monitor, which continuously monitors the quality of models in production. With the data capture capability to record information, Model Monitor supports data and model quality monitoring, bias, and feature attribution drift monitoring. For more details, see Monitor models for data and model quality, bias, and explainability.
In this post, we illustrated how to run the framework PaddleOCR on SageMaker for OCR tasks. To help data scientists easily onboard SageMaker, we walked through the ML development lifecycle, from building algorithms, to training, to hosting the model as a web service for real-time inference. You can use the template code we provided to migrate an arbitrary framework onto the SageMaker platform. Try it out for your ML project and let us know your success stories.
About the Authors
Junyi(Jackie) LIU is an Senior Applied Scientist at AWS. She has many years of working experience in the field of machine learning. She has rich practical experience in the development and implementation of solutions in the construction of machine learning models in supply chain prediction algorithms, advertising recommendation systems, OCR and NLP area.
Yanwei Cui, PhD, is a Machine Learning Specialist Solutions Architect at AWS. He started machine learning research at IRISA (Research Institute of Computer Science and Random Systems), and has several years of experience building artificial intelligence powered industrial applications in computer vision, natural language processing and online user behavior prediction. At AWS, he shares the domain expertise and helps customers to unlock business potentials, and to drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.
Yi-An CHEN is a Software Developer at Amazon Lab 126. She has more than 10 years experience in developing machine learning driven products across diverse disciplines, including personalization, natural language processing and computer vision. Outside of work, she likes to do long running and biking.