Johnson & Johnson reduces analysis time by 35% with their data science platform using Amazon EFS

Johnson & Johnson data science and business experts collaborate to incorporate science into healthcare solutions, including medical device and diagnostic technologies, consumer healthcare products, and pharmaceuticals. Johnson & Johnson needed storage to share and perform analytics on their data science workbench for genomics neuroscience, R&D, and drug discovery. A massively scalable solution was required to share and analyze 100 s of terabytes of data from external and internal studies. Also, their existing on-premises storage had scale limitations and high overhead.

In this blog, I interview Greg Rusin, IT manager, Johnson & Johnson. Greg unpacks the Johnson & Johnson solution for genomics neuroscience, R&D, and drug discovery for our AWS Storage blog readers. We discuss Johnson & Johnson’s data science platform, its purpose, the solution they deployed, and how they use it. We also discuss the benefits and advantages of the solution, and what’s next for Johnson & Johnson as they continue to innovate in analytics for genomics and drug discovery.

Interview with Greg Rusin, IT manager, Johnson & Johnson

What do you do at Johnson & Johnson?

I work on a team that has had many roles over my time at Johnson & Johnson. We started out as a group that explored and implemented new technologies, performed POCs (proof-of-concept) of that technology. We would then pass technologies on to global IT leaders, or we decided it was not a fit for our clients. Our role has changed over the years, and while we still do this, it’s not our main function. We are charged with creating High Performance Computing environments to support the various TAs (therapeutic areas) within Janssen Pharmaceuticals. These areas include genomics, neuroscience, medical devices, immunology, and computational chemistry, to name a few.

What did Johnson & Johnson set out to accomplish by building a data science platform?

A massively scalable solution was required to share and analyze 100 s of terabytes of data from external and internal studies. We must provide flexible storage solutions and enhanced computing capabilities for our scientists. It required a storage platform that could be shared across multiple virtual private cloud accounts. This is because data may be applicable to more than one therapeutic area.

What were the requirements for Johnson & Johnson’s data science environment?

Our data science team is responsible to perform analytics for R&D and drug discovery. We get large amounts data from both external vendors and internal sources for analysis. Data might come from internal instrumentation, or from sequencers both internal and external. Raw data comes from external vendors, around a specific study or project. Data can vary from 1 TB to 30 TB per project. At least 150 TB came from outside sources.

We support genomics and neuroscience platforms that require analytics on large datasets. We perform primary, secondary, and tertiary analysis, and the subsequent storage of those results, sometimes for a long time.

Usually 90% of data is cold, and 10% is hot. So, a tiered storage solution was required, and storage that could be shared across multiple sites and areas within R&D.

What are typical use cases for data that resides on your data science platform?

Genomics sequencer data, whole genome studies, medical device data, immunology studies and data, and COVID data. All data is to support the various therapeutic areas we support.

What did your existing data science environment look like before implementing the solution?

Before our use of AWS resources, we did everything internally. This presented infrastructure resource issues due to limited space or budgeting concerns. We also had to deal with depreciation, infrastructure end of life, and direct employees or contractors to support it. Also, we had issues with on-premises infrastructure limitations.

AWS both compute-wise and storage-wise, provided us with potentially limitless resources with minor support requirements. The only real issues we face are security and accessibility. Luckily, my group has the support of internal security teams responsible for both.

What is the data science platform solution Johnson & Johnson built, and how does it work?

Amazon Elastic File System (Amazon EFS) provides analytics storage with shared file access to data scientists. Applications include open-source genomics and Shiny, and Domino Data Lab, running on an Amazon Elastic Kubernetes Service (EKS) cluster.

Amazon EFS provides access to a fully managed petabyte-scale file system supporting our genomics sequence data at 500 TB. We use EFS Lifecycle Management to place roughly 85% of data in One Zone-Standard and One Zone-IA storage classes. The other 15% of solution data is in Standard and Standard-IA storage classes.

Johnson & Johnson data science platform for genomics neuroscience

What have you learned along the journey of building your solution on AWS?

We moved our file data from on-premises, to Amazon EFS Standard storage class, to EFS One Zone storage class. We are now able to implement storage classes and automate the transition of data to long-term, low-cost storage platforms. We never had access to storage classes before using Amazon EFS. Additionally, Amazon EFS has allowed us to provide access to shared data pools across platforms that in the past were siloed off from one another. Our solution provides a true shared storage platform for data that might be valuable across therapeutic areas. Our solution also supports shared storage space for home directories and application repositories.

What benefits has Johnson & Johnson gained with the solution?

Scale is the number one benefit. Our solution scales elastically to 500 TiB and growing. We get faster time-to-insights with access to new datasets, and reduced analysis time by 35%. We reduced costs by 37% using EFS Lifecycle management and Amazon EFS One Zone lower-cost storage classes.

Conclusion

Johnson & Johnson now has a clearer picture of where their data resides and the scale they need today and for future growth. They have tiered storage classes to ensure only their hot data is kept in highly accessible storage. They’ve also gained the ability to move cold data to long-term storage. Additionally, their backups and management are less hands-on, and IT personnel can focus on other pressing projects and issues.

As Greg says, when you close a data center, it’s the end of an era. You wave goodbye as they roll out the racks. It also marks the beginning of a new era for Johnson & Johnson. They now have a solution that scales elastically, with analysis time reduced by 35% and costs reduced by 37% using Amazon EFS Lifecycle management and Amazon EFS One Zone lower-cost storage classes. Visit the Amazon EFS web pages to get started with your solution.

Thanks for reading this blog post about Johnson & Johnson’s data science solution for genomics neuroscience, R & D, and drug discovery. If you have any comments or questions, share them in the comments section.