AWS for Industries
Bayer Creates Secure Self-Service Solution for Data Scientists on AWS
This blog is guest authored by Dr. Stefan Schmitz, lead product owner of Bayer’s cross-divisional data science platform, and Maciej Wroblewski, AWS architect from the Accenture Advanced Technology Center.
Leading global life sciences organization Bayer has more than 150 years of history and expertise in health care and agriculture. To accelerate the adoption of advanced analytical methods and machine learning across the organization, Bayer needed a self-service solution that enabled data scientists to easily build, train, and deploy models without having to configure underlying resources.
Bayer built a cross-functional data science platform to provide curated, self-service access to a range of Amazon Web Services (AWS) capabilities, enabling data scientists to create projects and environments for their daily operations. The platform has simplified everyday tasks (such as data preparation, modeling, and analysis) and allowed Bayer data scientists to fully manage their API endpoints—freeing time to focus on developing and deploying business-critical solutions. The platform, which initially launched with access for 100 data scientists, has grown to 1000+ users and is scaling to run numerous parallel projects using 1,000+ Amazon Elastic Compute Cloud (Amazon EC2) instances.
Creating a Self-Service Platform for Data Scientists
As one of the world’s largest life sciences organizations with operations spread across 83 countries, Bayer’s data science teams needed greater efficiency and cost optimization in their operations. Bayer’s centralized, self-service data analytics platform allows data scientists to access a curated set of needed technologies and IT capabilities, while adhering to corporate compliance and security standards.
“Using this platform, teams no longer need to duplicate efforts and costs of setting up the basic infrastructure components and services for individual projects,” says Dr. Stefan Schmitz, lead product owner of Bayer’s cross-divisional data science platform. “There’s no need to reinvent the wheel over and over again.” In addition, data scientists can control the full lifecycle of their models and manage compute-intensive projects by spinning up customizable instances with preconfigured tooling.
Built on AWS, the platform gives access to secure and resizable compute capacity using Amazon EC2. It provides a multi-tenant configuration within Amazon Elastic Kubernetes Service (Amazon EKS)—a managed service that starts, runs, and scales Kubernetes. A multi-namespace configuration helps the various user groups to create logically separated environments within a single Kubernetes cluster, to host analytical models and run containerized application and processing jobs.
The platform also uses AWS Batch, which provides batch processing, ML model training, and inference at scale. With this, data scientists can scale their workloads horizontally and process long-running, compute-intensive tasks asynchronously and automatically in the background, without having to worry about scheduling and provisioning jobs. In addition, individual projects receive access to temporary storage to stage their data input and store intermediate results through dedicated buckets from Amazon Simple Storage Service (Amazon S3), an object storage built to retrieve any amount of data from anywhere.
“Our main priority while working with Bayer to design the platform was to simplify the typical operations executed by data scientists”, says Maciej Wroblewski, AWS architect from the Accenture Advanced Technology Center which collaborated with Bayer to envision, implement and support development of the platform. “With this platform, Bayer is empowering its data scientists to focus on data processing instead of the deployment of infrastructure components.”
Bayer hosts its computing resources in three AWS regions spread across the world―ensuring low latencies and helping to meet the local data-processing requirements of specific countries.
Controlling Access to Individual Models through Amazon API Gateway
As the platform grew in size and scale, Bayer’s technical team realized that data scientists were interacting with a sizable number of production models through APIs that lacked a consistent URL structure and didn’t scale well to larger numbers of parallel deployments. “We realized that scalability of API deployments and access management for APIs were becoming more relevant,” Schmitz says. “Data scientists needed more governance and control over how individual model APIs were being used.”
The team then developed a self-service API Management Service to address the data scientists’ needs. Using the REST API service within Amazon API Gateway, the platform provides data scientists with secure access to the models behind the APIs. While these models are deployed on top of the Kubernetes cluster in most cases, the platform offers flexibility to configure other targets.
Using the Custom Authorizer feature within Amazon API Gateway allows for customization of the identity-based policies that control who can access particular API endpoints for specific models. Amazon API Gateway not only authorizes incoming requests but also extracts details for logging, such as the model requested, the URL, the HTTP method, and user information. The platform integrates with Bayer’s corporate security standards to allow access controls through their Active Directory groups. Dedicated Amazon Virtual Private Cloud (VPC) links route traffic through the Network Load Balancer to the API services that are running within a single namespace on the Kubernetes cluster, which allows for separation between different user groups.
The solution fully automates the provisioning and configuration of underlying resources and services. Model developers self-register their own API endpoints so they can be invoked for inference. As part of this process, developers can associate their models with access policies that are designed and maintained by dedicated policy stewards. Amazon EKS Namespaces and IAM policies facilitate secure access to the models and data within an AWS account to which many tenants have access in a way that isolates them against each other.
“Bayer scientists have the power,” says Wroblewski. “They have the permissions to create, modify, and delete all the API endpoints without needing to reach out to any technical team to maintain the platform. They can do it on their own.”
Growth in Adoption and Future Outlook
While the platform is already seeing much success across the Pharmaceuticals, Consumer Health and Enabling Function divisions its stakeholder base is still growing in multiple areas. For example, Bayer Pharmaceuticals Research recently decided to adopt the new platform capabilities for model API management discussed above to maintain and govern their analytical models. Dr. Andreas Poehlmann, research scientist in the Machine Learning Research Group at Bayer Pharmaceuticals says, “This setup streamlines the way we deploy our model APIs internally to make them easily available for researchers at Bayer while ensuring compliance. Having this centralized data science platform available allows us to quickly move from prototype to production and lets us focus on solving the scientific questions. For example, we’re already using it to make an in-house-developed and open-sourced molecular representation extraction tool available to our scientists, which allows extracting CDDDs (continuous and data-driven molecular descriptors) from molecular SMILES (Simplified Molecular Input Line Entry Specification) strings. It’s great to have these data-science services provided internally through a unified and centralized platform at Bayer.”
Bayer’s cross-divisional data science platform will continue to evolve. The 2023 roadmap includes topics like the integration of Amazon SageMaker, which will let data scientists build, train, and deploy ML models with fully managed infrastructure, tools, and workflows “We provide capabilities which are well documented and built on well-tested, scalable services from AWS,” says Schmitz. “Our platform has a strong effect on how models are developed and deployed. We will continue to make things easier for data scientists, and further accelerate the adoption of advanced analytical methods and machine learning across the organization.”