AWS Big Data Blog
How Bayer transforms Pharma R&D with a cloud-based data science ecosystem using Amazon SageMaker
This post was written with Avinash Erupaka from Bayer (IT PH, Drug Innovation platform)
How can pharmaceutical companies unlock the full potential of their data to drive breakthrough innovations? Bayer, a global leader in health and nutrition, is dedicated to tackling the pressing challenges of our time, including a growing and aging population and the strain on our planet’s ecosystems. Its mission of “Health for All, Hunger for None” drives its commitment to addressing societal and environmental needs through groundbreaking research. Bayer is focused on developing innovative solutions that make a tangible difference in the world and value for its customers, employees, and stakeholders. Headquartered in Leverkusen, Germany, Bayer operates across 80 countries and is pioneering a data science ecosystem that transforms how research teams access, analyze, and derive insights from complex scientific data.
By harnessing the power of data, analytics, artificial intelligence and machine learning (AI/ML), and generative AI, Bayer is creating a cloud-based Pharma R&D Data Science Ecosystem (DSE) on AWS that powers cutting-edge technologies and concepts with robust data management. In doing so, R&D teams can fully realize the potential of unified data and analytics.
In this post, we discuss how Bayer used the next generation of SageMaker to build a solution that unified data ingestion, storage, analytics, and AI/ML workflows. Built on data mesh principles, Bayer’s DSE integrates advanced data ingestion, storage, analytics, and ML workflows to enable agile experimentation and scalable insight generation. It democratizes access to analytics, fosters cross-Region collaboration, and provides flexible integration of structured, semi-structured, and unstructured data.
Challenges in pharmaceutical research
In pharmaceutical research, data has become the most critical asset for driving innovation. However, managing this data effectively presents unprecedented challenges and traditional data management approaches are becoming increasingly inadequate for complex, global research initiatives. Many pharma R&D organization face a complex ecosystem of data and analytics related obstacles that hinder scientific discovery and operational efficiency:
- Siloed datasets – Research datasets are siloed across domains, limiting reuse and slowing discovery.
- Multiple data modalities – Clinical trial data (structured), real-world evidence (semi-structured), and genomic files (unstructured) existed in isolation, complicating integration and analysis.
- Inflexible ingestion capabilities – Systems that support batch processing (such as trial data), real-time data streams (for example, from lab equipment), and event-driven ingestion (such as regulatory updates).
- Rising R&D costs – Disparate technologies and disconnected systems create operational inefficiencies and increased licensing and maintenance costs.
- Inconsistent landscape to fully use ML – The absence of a unified data architecture and standardized, domain-agnostic MLOps workflows mean that data and analytics innovation is often ad hoc and non-repeatable. Teams lack a streamlined way to scale successful patterns, resulting in redundant efforts, longer development cycles, and missed opportunities for cross-domain synergy.
- Disconnected architectures – Software solutions are not integrated into the wider unified ecosystem, resulting in silos, redundancies, and inefficiencies.
Recognizing these systemic challenges, Bayer embarked on a transformative journey. DSE is not just a technological solution, but a strategic reimagining of how research data and analytics could be used across a global organization. By bringing together cutting-edge technologies, standardized frameworks, a collaborative data mesh, and lakehouse architecture, Bayer set out to help researchers and engineers accelerate pharmaceutical innovation.
Finding a solution with the next generation of SageMaker
Bayer envisioned a unified data science ecosystem that would provide the following:
- A unified collaborative development experience for all data scientists regardless of their location or specialization
- Seamless access to both structured and unstructured data through a consistent interface
- Built-in governance and compliance controls appropriate for pharmaceutical research
- Scalable compute resources to handle the most complex analytical workloads
Bayer conducted a comprehensive evaluation of various solutions before selecting the next generation of SageMaker as the cornerstone of their new data science ecosystem. Although other options had merits, Bayer prioritized the following capabilities:
- Access to multimodal data – Essential for genomics, proteomics, and advanced biomarker research
- Centralized asset marketplace – Central hub to discover and reuse data, features, models, and other enterprise assets
- Integrated tooling ecosystem – Streamlined access to key tools like Git, ETL, MLflow, and generative AI application builders in one place
- Multi-domain and cross-Region support – Critical for global research collaboration
- Price-performance – Necessary for sustainable, long-term scaling
The capabilities of Amazon SageMaker Unified Studio and Amazon SageMaker Catalog aligned with Bayer’s vision of decentralized mesh execution combined with centralized discovery and governance. They enabled teams to work with their preferred tools, such as Jupyter Notebooks or workflow builders, while maintaining discoverability and reusability of assets.
Solution overview
This section describes the key features and architecture of Bayer’s DSE built on SageMaker. The DSE solution addresses the identified challenges through a multi-layered architecture:
- Breaking down data silos – Multimodal data ingestion capabilities of the solution break down data silos by enabling unified storage, processing of structured, semi-structured, and unstructured data through batch, streaming, and event-driven pipelines.
- Handling diverse data modalities – A hybrid lakehouse architecture, built on Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and Amazon Redshift, provides a flexible foundation for handling diverse data modalities and maturities while providing data consistency and accessibility.
- Reducing costs through standardization – To address rising R&D costs and operational inefficiencies, pre-wired analytical workbenches offer standardized templates and integrated development environments (IDEs) that reduce redundancy and accelerate workflow development.
- Unlocking AI/ML with Amazon SageMaker AI and Amazon Bedrock – Advanced AI/ML capabilities, powered by Amazon SageMaker AI and Amazon Bedrock, create a standardized, domain-agnostic MLOps environment that enables repeatable innovation and cross-domain synergy.
- Managing tools ecosystem with end-to-end observability – Robust governance and observability features provide compliance and system reliability while integrating previously disconnected tools into a unified, well-monitored ecosystem that breaks down architectural silos and promotes efficient resource utilization.
The DSE architecture implements data mesh principles where data domains (omics, regulatory, clinical trials) are treated as products, with ownership and management responsibilities assigned to domain experts. These domains are decentralized for execution but remain discoverable and reusable through SageMaker Catalog. At the core of the architecture is a hybrid mesh lakehouse architecture that combines Amazon S3 and Iceberg, providing the flexibility to handle both structured and unstructured data efficiently. SageMaker Unified Studio provides an analytical layer where researchers can access the full suite of tools needed for their work. The following diagram illustrates this architecture.
Impact
The first phase of Bayer’s DSE confirmed the next generation of SageMaker as a powerful foundation for their R&D DSE—designed to balance decentralized innovation with centralized governance through a scalable data mesh architecture. With this solution, Bayer can catalog and manage multimodal data assets—including structured and unstructured data, ML features, models, and custom scientific assets—with context-rich metadata across diverse Pharma R&D domains. Bayer is now positioned to onboard over 300 TB of biomarker data and integrate siloed omics, clinical, and chemistry data repositories into a cohesive environment. With integrated tools like JupyterLab Spaces, MLflow, and SageMaker AI Studio, the DSE platform is laying the groundwork for a comprehensive, GxP-aware ML workbench—paving the way to operationalize over 25 high-value ML use cases and support more than 100 data scientists across the organization.
“The Data Science Ecosystem is vital for developing our medicines,” says Daniel Gusenleitner, Mission Lead for the R&D Data Science Ecosystem. “It enhances our business workflows with advanced analytics, helping us accelerate the search for new treatments. By integrating data from the entire research and development process, we improve the chances of technical success and ensure our efforts are efficient. Unlocking our data also facilitates target discovery, leading to groundbreaking advancements in patient care.”
Next steps
Bayer has successfully begun their Data Science Ecosystem on the next generation of Amazon SageMaker and is working to onboard the first use case of advanced biomarker research. Building on the strong foundation, Bayer is also accelerating the evolution of the DSE solution with the following key enhancements:
- Federated catalogs and cross-domain integration – Enabling search and reuse of data assets across therapeutic areas and business units
- Advanced ontology and semantic layer – Enriching metadata with domain knowledge to support AI-based search, discovery, and reasoning
- Adoption of generative and agentic AI workflows – Driving novel drug discovery and accelerating hypothesis generation
Conclusion
By leveraging the next generation of Amazon SageMaker to build their cloud-based Data Science Ecosystem, Bayer is creating a foundation for faster, more efficient research and discovery. Amazon SageMaker is unifying diverse data types, enabling global collaboration, and standardizing ML workflows to help position Bayer at the forefront of data-driven innovation.
To learn more and get started with the next generation of SageMaker, refer to Amazon SageMaker or the AWS console.
