How SNCF Réseau and Olexya migrated a Caffe2 vision pipeline to Managed Spot Training in Amazon SageMaker

This blog post is co-written by guest authors from SNCF and Olexya.

Transportation and logistics are fertile ground for machine learning (ML). In this post, we show how the French state-owned railway company Société Nationale des Chemins de fer Français (SNCF) uses ML from AWS with the help of its technology partner Olexya to research, develop, and deploy innovative computer vision solutions.

SNCF was founded in 1938 and employs more than 270,000 people. SNCF Réseau is a subsidiary of SNCF that manages and operates the infrastructure for the rail network. SNCF Réseau and its technology partner Olexya deploy innovative solutions to assist the operations of the infrastructure and keep the bar high for infrastructure safety and quality. The field teams detect anomalies in the infrastructure by using computer vision.

SNCF Réseau researchers have been doing ML for a long time. An SNCF Réseau team developed a computer vision detection model on premises using the Caffe2 deep learning framework. The scientists then reached out to SNCF Réseau technology partner Olexya to assist with the provisioning of GPU to support iteration on the model. To keep operational overhead low and productivity high while retaining full flexibility on the scientific code, Olexya decided to use Amazon SageMaker to orchestrate the training and inference of the Caffe2 model. The process involved the following steps:

Custom Docker creation.
Training configuration ingestion via an Amazon Simple Storage Service (Amazon S3) data channel.
Cost-efficient training via Amazon SageMaker Spot GPU training.
Cost-efficient inference with the Amazon SageMaker training API.

Custom docker creation

The team created a Docker image wrapping the original Caffe2 code that respected the Amazon SageMaker Docker specification. Amazon SageMaker can accommodate multiple data sources, and has advanced integration with Amazon S3. Datasets stored in Amazon S3 can be automatically ingested in training containers running on Amazon SageMaker. To be able to process training data available in Amazon S3, Olexya had to direct the training code to read from the associated local path opt/ml/input/data/<channel name>. Similarly, the model artifact writing location had to be set to opt/ml/model. That way, Amazon SageMaker can automatically compress and ship the trained model artifact to Amazon S3 when training is complete.

Training configuration ingestion via an Amazon S3 data channel

The original Caffe2 training code was parametrized with an exhaustive and flexible YAML configuration file, so that researchers could change model settings without altering the scientific code. This external file was easy to keep external and ingest at training time in the container via the use of data channels. Data channels are Amazon S3 ARNs passed to the Amazon SageMaker SDK at training time and ingested in the Amazon SageMaker container when training starts. Olexya configured the data channels to read via a copy (copy mode), which is the default configuration. It is also possible to stream the data via Unix pipes (Pipe mode).

Cost-efficient training via Amazon SageMaker Spot GPU training

The team configured training infrastructure to be an ml.p3.2xlarge GPU-accelerated compute instance. The Amazon SageMaker ml.p3.2xlarge compute instance is specifically adapted to deep learning computer vision workloads: it’s equipped with an NVIDIA V100 GPU featuring 5,120 cores and 16 GB of High-Bandwidth Memory (HBM), which enables the fast training of large models.

Furthermore, Amazon SageMaker training API calls were set with Managed Spot Instance usage activated, which contributed to a reported savings of 71% compared to the on-demand Amazon SageMaker price. Amazon SageMaker Managed Spot Training is an Amazon SageMaker feature that enables the use of Amazon Elastic Compute Cloud (Amazon EC2) Spot Instance capacity for training. Amazon EC2 Spot Instances allow you to purchase unused Amazon EC2 computer capacity at a highly-reduced rate. In Amazon SageMaker, Spot Instance usage is fully managed by the service, and you can invoke it by setting two training SDK parameters:

train_use_spot_instances=True to request usage of Amazon SageMaker Spot Instances
train_max_wait set to the maximal acceptable waiting time in seconds

Cost-efficient inference with the Amazon SageMaker training API

In this research initiative, inference interruptions and delayed instantiation were acceptable to the end-user. Consequently, to further optimize costs, the team also used the Amazon SageMaker training API to run inference code, so that managed Amazon SageMaker Spot Instances could be used for inference too. Using the training API came with the additional benefit of a simpler learning curve because the same API is used for both steps of the model life cycle.

Time and cost savings

By applying those four steps, Olexya successfully ported an on-premises Caffe2 deep computer vision detection model to Amazon SageMaker for both training and inference. More impressively, the team completed that onboarding in about 3 weeks, and reported that the training time of the model was reduced from 3 days to 10 hours! The team further estimated that Amazon SageMaker allows a 71% total cost of ownership (TCO) reduction compared the locally available on-premises GPU fleet. A number of extra optimization techniques could reduce the costs even more, such as intelligent hyperparameter search with Amazon SageMaker automatic model tuning and mixed-precision training with the deep learning frameworks that support it.

In addition to SNCF Réseau, numerous AWS customers operating in transportation and logistics have improved their operations and delivered innovation by applying ML to their business. For example:

The Dubai-based logistics company Aramex uses ML for address parsing and transit time prediction. The company reports having 150 models in use, doing 450,000 predictions per day.
Transport for New South Wales uses the cloud to predict patronage numbers across the entire transport network, which enables the agency to better plan workforce and asset utilization and improve customer satisfaction.
Korean Air launched innovative projects with Amazon SageMaker to help predict and preempt maintenance for its aircraft fleet.

Conclusion

Amazon SageMaker supports the whole ML development cycle, from annotation to production deployment and monitoring. As illustrated by the work of Olexya and SNCF Réseau, Amazon SageMaker is framework-agnostic and accommodates a variety of deep learning workloads and frameworks. Although Docker images and SDK objects have been created to closely support Sklearn, TensorFlow, PyTorch, MXNet, XGBoost, and Chainer, you can bring custom Docker containers to onboard virtually any framework, such as PaddlePaddle, Catboost, R, or Caffe2. If you are an ML practitioner, don’t hesitate to test the service, and let us know what you build!

About the Authors

Olivier Cruchant is a Machine Learning Specialist Solution Architect at AWS, based in Lyon, France. Olivier helps French customers – from small startups to large enterprises – develop and deploy production-grade machine learning applications. In his spare time, he enjoys reading research papers and exploring the wilderness with friends and family.

Samuel Descroix is head manager of the Geographic and Analytic Data department at SNCF Réseau. He is in charge of all project teams and infrastructures. To be able to answer to all new use cases, he is constantly looking for most innovative and most relevant solutions to manage growing volumes and needs of complex analysis.

Alain Rivero is Project Manager in the Technology and Digital Transformation (TTD) department within the General Industrial and Engineering Department of SNCF Réseau. He manages projects implementing in-depth learning solutions to detect defects on rolling stock and tracks to increase traffic safety and guide decision-making within maintenance teams. His research focuses on image processing methods, supervised and unsupervised learning and their applications.

Pierre-Yves Bonnefoy is data architect at Olexya, currently working for SNCF Réseau IT department. One of his main assignments is to provide environments and sets of datas for Data Scientists and Data Analysts to work on complex analysis, and to help them with software solutions. Thanks to his large range of skills in development and system architecture, he accelerated the deployment of the project on Sagemaker instances, rationalization of costs and optimization of performance.

Emeric Chaize is certified Solution Architect in Olexya, currently working for SNCF Réseau IT department. He is in charge of Data Migration Project for IT Data Departement, with the responsabilty of covering all needs and usages of the company in data analysis. He defines and plans deployment of all the needed infrastructure for projects and Data Scientists.

AWS Machine Learning Blog