AWS Machine Learning Blog

Use RStudio on Amazon SageMaker to create regulatory submissions for the life sciences industry

Pharmaceutical companies seeking approval from regulatory agencies such as the US Food & Drug Administration (FDA) or Japanese Pharmaceuticals and Medical Devices Agency (PMDA) to sell their drugs on the market must submit evidence to prove that their drug is safe and effective for its intended use. A team of physicians, statisticians, chemists, pharmacologists, and other clinical scientists review the clinical trial submission data and proposed labeling. If the review establishes that the there is sufficient statistical evidence to prove that the health benefits of the drug outweigh the risks, the drug is approved for sale.

The clinical trial submission package consists of tabulated data, analysis data, trial metadata, and statistical reports consisting of statistical tables, listings, and figures. In the case of the US FDA, the electronic common technical document (eCTD) is the standard format for submitting applications, amendments, supplements, and reports to the FDA’s Center for Biologics Evaluation and Research (CBER) and Center for Drug Evaluation and Research (CDER). For the FDA and Japanese PMDA, it’s a regulatory requirement to submit tabulated data in CDISC Standard Data Tabulation Model (SDTM), analysis data in CDISC Analysis Dataset Model (ADaM), and trial metadata in CDISC Define-XML (based on Operational Data Model (ODM)).

In this post, we demonstrate how we can use RStudio on Amazon SageMaker to create such regulatory submission deliverables. This post describes the clinical trial submission process, how we can ingest clinical trial research data, tabulate and analyze the data, and then create statistical reports—summary tables, data listings, and figures (TLF). This method can enable pharmaceutical customers to seamlessly connect to clinical data stored in their AWS environment, process it using R, and help accelerate the clinical trial research process.

Drug development process

The drug development process can broadly be divided into five major steps, as illustrated in the following figure.

Drug Development Process

It takes on an average 10–15 years and approximately USD $1–3 billion for one drug to receive a successful approval out of around 10,000 potential molecules. During the early phases of research (the drug discovery phase), promising drug candidates are identified, which move further to preclinical research. During the preclinical phase, researchers try to find out the toxicity of the drug by performing in vitro experiments in the lab and in vivo experiments on animals. After preclinical testing, drugs move on the clinical trial research phase, where they must be tested on humans to ascertain their safety and efficacy. The researchers design clinical trials and detail the study plan in the clinical trial protocol. They define the different clinical research phases—from small Phase 1 studies to determine drug safety and dosage, to a bigger Phase 2 trials to determine drug efficacy and side effects, to even bigger Phase 3 and 4 trials to determine drug efficacy, safety, and monitoring adverse reactions. After successful human clinical trials, the drug sponsor files a New Drug Application (NDA) to market the drug. The regulatory agencies review all the data, work with the sponsor on prescription labeling information, and approve the drug. After the drug’s approval, the regulatory agencies review post-market safety reports to ensure the complete product’s safety.

In 1997, Clinical Data Interchange Standards Consortium (CDISC), a global, non-profit organization comprising of pharmaceutical companies, CROs, biotech, academic institutions, healthcare providers, and government agencies, was started as volunteer group. CDISC has published data standards to streamline the flow of data from collection through submissions, and facilitated data interchange between partners and providers. CDISC has published the following standards:

  • CDASH (Clinical Data Acquisition Standards Harmonization) – Standards for collected data
  • SDTM (Study Data Tabulation Model) – Standards for submitting tabulated data
  • ADaM (Analysis Data Model) – Standards for analysis data
  • SEND (Standard for Exchange of Nonclinical Data) – Standards for nonclinical data
  • PRM (Protocol Representation Model) – Standards for protocol

These standards can help trained reviewers analyze data more effectively and quickly using standard tools, thereby reducing drug approval times. It’s a regulatory requirement from the US FDA and Japanese PMDA to submit all tabulated data using the SDTM format.

R for clinical trial research submissions

SAS and R are two of the most used statistical analysis software used within the pharmaceutical industry. When development of the SDTM standards was started by CDISC, SAS was in almost universal use in the pharmaceutical industry and at the FDA. However, R is gaining tremendous popularity nowadays because it’s open source, and new packages and libraries are continuously added. Students primarily use R during their academics and research, and they take this familiarity with R to their jobs. R also offers support for emerging technologies such as advanced deep learning integrations.

Cloud providers such as AWS have now become the platform of choice for pharmaceutical customers to host their infrastructure. AWS also provides managed services such as SageMaker, which makes it effortless to create, train, and deploy machine learning (ML) models in the cloud. SageMaker also allows access to the RStudio IDE from anywhere via a web browser. This post details how statistical programmers and biostatisticians can ingest their clinical data into the R environment, how R code can be run, and how results are stored. We provide snippets of code that allow clinical trial data scientists to ingest XPT files into the R environment, create R data frames for SDTM and ADaM, and finally create TLF that can be stored in an Amazon Simple Storage Service (Amazon S3) object storage bucket.

RStudio on SageMaker

On November 2, 2021, AWS in collaboration with RStudio PBC announced the general availability of RStudio on SageMaker, the industry’s first fully managed RStudio Workbench IDE in the cloud. You can now bring your current RStudio license to easily migrate your self-managed RStudio environments to SageMaker in just a few simple steps. To learn more about this exciting collaboration, check out Announcing RStudio on Amazon SageMaker.

Along with the RStudio Workbench, the RStudio suite for R developers also offers RStudio Connect and RStudio Package Manager. RStudio Connect is designed to allow data scientists to publish insights, dashboards, and web applications. It makes it easy to share ML and data science insights from data scientists’ complicated work and put it in the hands of decision-makers. RStudio Connect also makes hosting and managing content simple and scalable for wide consumption.

Solution overview

In the following sections, we discuss how we can import raw data from a remote repository or S3 bucket in RStudio on SageMaker. It’s also possible to connect directly to Amazon Relational Database Service (Amazon RDS) and data warehouses like Amazon Redshift (see Connecting R with Amazon Redshift) directly from RStudio; however, this is outside the scope of this post. After data has been ingested from a couple of different sources, we process it and create R data frames for a table. Then we convert the table data frame into an RTF file and store the results back in an S3 bucket. These outputs can then potentially be used for regulatory submission purposes, provided the R packages used in the post have been validated for use for regulatory submissions by the customer.

Set up RStudio on SageMaker

For instructions on setting up RStudio on SageMaker in your environment, refer to Get started with RStudio on SageMaker. Make sure that the execution role of RStudio on SageMaker has access to download and upload data to the S3 bucket in which data is stored. To learn more about how to manage R packages and publish your analysis using RStudio on SageMaker, refer to Announcing Fully Managed RStudio on SageMaker for Data Scientists.

Ingest data into RStudio

In this step, we ingest data from various sources to make it available for our R session. We import data in SAS XPT format; however, the process is similar if you want to ingest data in other formats. One of the advantages of using RStudio on SageMaker is that if the source data is stored in your AWS accounts, then SageMaker can natively access the data using AWS Identity and Access Management (IAM) roles.

Access data stored in a remote repository

In this step, we import ADaM data from the FDA’s GitHub repository. We create a local directory called data in the RStudio environment to store the data and download demographics data (dm.xpt) from the remote repository. In this context, the local directory refers to a directory created on the your private Amazon EFS storage that is attached by default to your R session environment. See the following code:

######################################################
# Step 1.1 – Ingest Data from Remote Data Repository #
######################################################

# Remote Data Path 
raw_data_url = “https://github.com/FDA/PKView/raw/master/Installation%20Package/OCP/data/clinical/DRUG000/0000/m5/datasets/test001/tabulations/sdtm”
raw_data_name = “dm.xpt”

#Create Local Directory to store downloaded files
dir.create(“data”)
local_file_location <- paste0(getwd(),”/data/”)
download.file(raw_data_url, paste0(local_file_location,raw_data_name))

When this step is complete, you can see dm.xpt being downloaded by navigating to Files, data, dm.xpt.

Access data stored in Amazon S3

In this step, we download data stored in an S3 bucket in our account. We have copied contents from the FDA’s GitHub repository to the S3 bucket named aws-sagemaker-rstudio for this example. See the following code:

#####################################################
# Step 1.2 - Ingest Data from S3 Bucket             #
#####################################################
library("reticulate")

SageMaker = import('sagemaker')
session <- SageMaker$Session()

s3_bucket = "aws-sagemaker-rstudio"
s3_key = "DRUG000/test001/tabulations/sdtm/pp.xpt"

session$download_data(local_file_location, s3_bucket, s3_key)

When the step is complete, you can see pp.xpt being downloaded by navigating to Files, data, pp.xpt.

Process XPT data

Now that we have SAS XPT files available in the R environment, we need to convert them into R data frames and process them. We use the haven library to read XPT files. We merge CDISC SDTM datasets dm and pp to create ADPP dataset. Then we create a summary statistic table using the ADPP data frame. The summary table is then exported in RTF format.

First, XPT files are read using the read_xpt function of the haven library. Then an analysis dataset is created using the sqldf function of the sqldf library. See the following code:

########################################################
# Step 2.1 - Read XPT files. Create Analysis dataset.  #
########################################################

library(haven)
library(sqldf)


# Read XPT Files, convert them to R data frame
dm = read_xpt("data/dm.xpt")
pp = read_xpt("data/pp.xpt")

# Create ADaM dataset
adpp = sqldf("select a.USUBJID
                    ,a.PPCAT as ACAT
                    ,a.PPTESTCD
                    ,a.PPTEST
                    ,a.PPDTC
                    ,a.PPSTRESN as AVAL
                    ,a.VISIT as AVISIT
                    ,a.VISITNUM as AVISITN
                    ,b.sex
                from pp a 
           left join dm b 
                  on a.usubjid = b.usubjid
             ")

Then, an output data frame is created using functions from the Tplyr and dplyr libraries:

########################################################
# Step 2.2 - Create output table                       #
########################################################

library(Tplyr)
library(dplyr)

t = tplyr_table(adpp, SEX) %>% 
  add_layer(
    group_desc(AVAL, by = "Area under the concentration-time curve", where= PPTESTCD=="AUC") %>% 
      set_format_strings(
        "n"        = f_str("xx", n),
        "Mean (SD)"= f_str("xx.x (xx.xx)", mean, sd),
        "Median"   = f_str("xx.x", median),
        "Q1, Q3"   = f_str("xx, xx", q1, q3),
        "Min, Max" = f_str("xx, xx", min, max),
        "Missing"  = f_str("xx", missing)
      )
  )  %>% 
  build()

output = t %>% 
  rename(Variable = row_label1,Statistic = row_label2,Female =var1_F, Male = var1_M) %>% 
  select(Variable,Statistic,Female, Male)

The output data frame is then stored as an RTF file in the output folder in the RStudio environment:

#####################################################
# Step 3 - Save the Results as RTF                  #
#####################################################
library(rtf)

dir.create("output")
rtf = RTF("output/tab_adpp.rtf")  
addHeader(rtf,title="Section 1 - Tables", subtitle="This Section contains all tables")
addParagraph(rtf, "Table 1 - Pharmacokinetic Parameters by Sex:\n")
addTable(rtf, output)
done(rtf)

Upload outputs to Amazon S3

After the output has been generated, we put the data back in an S3 bucket. We can achieve this by creating a SageMaker session again, if a session isn’t active already, and uploading the contents of the output folder to an S3 bucket using the session$upload_data function:

#####################################################
# Step 4 - Upload outputs to S3                     #
#####################################################
library("reticulate")

SageMaker = import('sagemaker')
session <- SageMaker$Session()
s3_bucket = "aws-sagemaker-rstudio"
output_location = "output/"
s3_folder_name = "output"
session$upload_data(output_location, s3_bucket, s3_folder_name)

With these steps, we have ingested data, processed it, and uploaded the results to be made available for submission to regulatory authorities.

Clean up

To avoid incurring any unintended costs, you need to quit your current session. On the top right corner of the page, choose the power icon. This will automatically stop the underlying instance and therefore stop incurring any unintended compute costs.

Challenges

The post has outlined steps for ingesting raw data stored in an S3 bucket or from a remote repository. However, there are many other sources of raw data for a clinical trial, primarily eCRF (electronic case report forms) data stored in EDC (electronic data capture) systems such as Oracle Clinical, Medidata Rave, OpenClinica, or Snowflake; lab data; data from eCOA (clinical outcome assessment) and ePRO (electronic Patient-Reported Outcomes); real-world data from apps and medical devices; and electronic health records (EHRs) at the hospitals. Significant preprocessing is involved before this data can be made usable for regulatory submissions. Building connectors to various data sources and collecting them in a centralized data repository (CDR) or a clinical data lake, while maintaining proper access controls, poses significant challenges.

Another key challenge to overcome is that of regulatory compliance. The computer system used for creating regulatory submission outputs must be compliant with appropriate regulations, such as 21 CFR Part 11, HIPAA, GDPR, or any other GxP requirements or ICH guidelines. This translates to working in a validated and qualified environment with controls for access, security, backup, and auditability in place. This also means that any R packages that are used to create regulatory submission outputs must be validated before use.

Conclusion

In this post, we saw that the some of the key deliverables for an eCTD submission were CDISC SDTM, ADaM datasets, and TLF. This post outlined the steps needed to create these regulatory submission deliverables by first ingesting data from a couple of sources into RStudio on SageMaker. We then saw how we can process the ingested data in XPT format; convert it into R data frames to create SDTM, ADaM, and TLF; and then finally upload the results to an S3 bucket.

We hope that with the broad ideas laid out in the post, statistical programmers and biostatisticians can easily visualize the end-to-end process of loading, processing, and analyzing clinical trial research data into RStudio on SageMaker and use the learnings to define a custom workflow suited for your regulatory submissions.

Can you think of any other applications of using RStudio to help researchers, statisticians, and R programmers to make their lives easier? We would love to hear about your ideas! And if you have any questions, please share them in the comments section.

Resources

For more information, visit the following links:


About the authors

Rohit Banga is a Global Clinical Development Industry Specialist based out of London, UK. He is a biostatistician by training and helps Healthcare and LifeScience customers deploy innovative clinical development solutions on AWS. He is passionate about how data science, AI/ML, and emerging technologies can be used to solve real business problems within the Healthcare and LifeScience industry. In his spare time, Rohit enjoys skiing, BBQing, and spending time with family and friends.

Georgios Schinas is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in London and works closely with customers in UK and Ireland. Georgios helps customers design and deploy machine learning applications in production on AWS with a particular interest in MLOps practices and enabling customers to perform machine learning at scale. In his spare time, he enjoys traveling, cooking and spending time with friends and family.