AWS Big Data Blog

Running R on AWS

by Markus Schmidberger | on | | Comments

Markus Schmidberger is a Senior Big Data Consultant for AWS Professional Services

Many AWS customers already use the popular open-source statistic software R for big data analytics and data science; others have asked for instructions and best practices for running R on AWS. Several months ago, I wrote a blog post showing you how to connect R with Amazon EMR, install RStudio on the Hadoop master node, and use R packages such as rmr2 or plyrmr to analyze a huge public weather data set. In this post, I show you how to install and run R, RStudio Server, and Shiny Server on AWS.

RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser.

After you’ve analyzed your results, you may want to visualize them. Shiny is a great R package, licensed either commercially or under AGPLv3, that you can use to create interactive dashboards. Shiny provides a web application framework for R. It turns your analyses into interactive web applications; no HTML, CSS, or JavaScript knowledge required. Shiny Server can deliver your R visualization to your customers via a web browser and execute R functions, including database queries, in the background.

The examples in this post use the AWS public data set CCAFS-Climate Data, a 6 TB data set with high-resolution climate data, to assess the impacts of climate change, primarily on agriculture. The image below shows what the architecture will look like.

Sample R architecture

Starting a server on AWS—called an EC2 instance—is easy with the Getting Started instructions. The first step is to launch an Amazon EC2 instance. In this post, I’m going to focus on five of the launch steps that impact your R-based analysis environment on AWS:

  • Choosing an Amazon Machine Image
  • Choosing an instance type
  • Configuring instance details (IAM roles)
  • Configuring instance details (user data)
  • Configuring a security group

After that, I’ll show you how to analyze data located on Amazon S3 and configure Shiny Server.

Choosing an Amazon Machine Image for R

When launching an instance, you must choose an Amazon Machine Image (AMI), which contains all information required to start an instance. For example, an AMI defines which operating system is installed on your EC2 instance and which software is included.

You can choose the Amazon Linux AMI, which is provided at no additional cost and has a stable version of R in the repository. This AMI is maintained by AWS and includes packages and configurations that provide seamless integration with AWS and other software.

Choosing an Instance Type for R

Choose an EC2 instance type that matches the data size and processing that your analysis requires. By default, R runs only on one core node and, in many cases, requires a lot of memory.

For programming and development, the general-purpose T2 instance types are sufficient and cheap, and t2.micro is free. If you don’t know what instance type to choose, start with t2.medium.

M4 instance types are often a good choice for R workloads. If you use R packages such as foreach, parallel, or snow to parallelize, I recommend using the bigger M4 instance types because they provide a good mix of CPU power and memory.

If you want to connect R to GPU hardware, you can choose the G2 instance types to leverage packages like gputools.

AWS documentation provides more details about instance types. An advantage of using AWS is that you aren’t locked into the instance type you originally choose. You can change your instance type in minutes: just stop your instance, change the instance type, and start the instance again.

RStudio Server lets you share your R-based analysis server with several other scientists. Provision a Linux user for each scientist, and several scientists can work on the same machine. Every user requires at least one CPU and some memory; for multi-user activities, use at least an M4.2xlarge instance type.

Configuring instance details (user data)

When you launch an instance in EC2, you can pass in user data that can be used to perform common automated configuration tasks and even run scripts for installation after the instance starts. In the EC2 launch wizard, you can add this at the Configure Instance Details step by expanding the Advanced Details pane:

Expanding the Advanced Details pane

Add the following code to install R, RStudio Server, the Shiny R package, and Shiny Server. This also adds a user and password that you use for logging in later.

NOTE: Before running the script below, visit https://www.rstudio.com/products/rstudio/download-server/ to check for the latest versions of Rstudo Server and modify the script to download and install the most recent version.


#!/bin/bash
#install R
yum install -y R
#install RStudio-Server
wget https://download2.rstudio.org/rstudio-server-rhel-0.99.465-x86_64.rpm
yum install -y --nogpgcheck rstudio-server-rhel-0.99.465-x86_64.rpm
#install shiny and shiny-server
R -e "install.packages('shiny', repos='http://cran.rstudio.com/')"
wget https://download3.rstudio.org/centos5.9/x86_64/shiny-server-1.4.0.718-rh5-x86_64.rpm
yum install -y --nogpgcheck shiny-server-1.4.0.718-rh5-x86_64.rpm
#add user(s)
useradd username
echo username:password | chpasswd

Change the user name and password based on your requirements, and check for the latest RStudio Server and Shiny Server versions. For a multi-user environment, you must add additional users at this point.

Configuring instance details (IAM roles)

In the same dialog box, you can add an IAM role to your EC2 instance.

IAM roles allow your application on EC2—in this case, R—to make API requests securely or access AWS services from your instances without requiring you to manage the AWS security credentials.

For your R-based data science environment, make sure that your EC2 instance has permission to read data from the S3 bucket that you created. You cannot read S3 files unless specifically given permission to do so. The following IAM policy demonstrates how the role “rstats” is given privileges to read files from your S3 bucket “rstatsdata.”


{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": ["arn:aws:s3:::rstatsdata"]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject"
      ],
      "Resource": ["arn:aws:s3:::rstatsdata/*"]
    }
  ]
}

Configuring the security group

In the EC2 launch wizard, you define a security group, which acts as a virtual firewall that controls the traffic for one or more instances. For your R-based analysis environment, you have to open up port 8787 for RStudio Server and port 3838 for Shiny Server.

Loading data into your R-based environment on AWS

After your EC2 instance is running, you can connect via web browser to RStudio Server and R. For login credentials, use the newly-created user and password. The URL looks like the following:

http://ec2-YOUR-IP.REGION.compute.amazonaws.com:8787

You can find more details about your public DNS in the EC2 console. To change your Linux user password using RStudio, choose  Tools > Shell and type the Linux command passwd.

You can do most of your work using RStudio Server, but in some cases you might have to log in to your EC2 instance via SSH. For example, some R packages require installed Linux packages. For the next steps, you install the curl-devel Linux package. Connect to your EC2 instance via SSH and execute the following command:


sudo yum install curl-devel

Now that the underlying R server is set up, load data and obtain R code.

Git is a popular code versioning system. GitHub and Amazon CodeCommit  provide managed Git services, and RStudio integrates well with different services. I recommend that you put your code in a Git repository so that you can share your code with colleagues, move code from your laptop to AWS, and track your changes.

Storing data in S3

Amazon S3 is secure, durable, highly-scalable object storage. It is easy to use, with a simple web service interface to store and retrieve any amount of data from anywhere on the web. It’s easy to get started with S3.

Move your data to S3 for analysis, copy the data via the AWS command line interface to your EC2 instance, and read the data into R. If you make your S3 object permission “Everyone”, you can read the object directly into R using the RCurl package—but this might be a security issue for your data.


> install.packages("RCurl")
> library("RCurl") 
> data <- read.table(textConnection(getURL(
                                               "https://cgiardata.s3-us-west-2.amazonaws.com/ccafs/amzn.csv"
                         )), sep=",", header=FALSE)
> head(data)
X1.15.2014 X395.87 X2677150 X398.94 X399.31 X392.534
1  1/14/2014  397.54  2339458  392.13 398.630   391.29
2  1/13/2014  390.98  2843810  397.98 399.780   388.45
3  1/10/2014  397.66  2678085  402.53 403.764   393.80
4   1/9/2014  401.01  2103029  403.71 406.890   398.44
5   1/8/2014  401.92  2316220  398.47 403.000   396.04
6   1/7/2014  398.03  1916017  395.04 398.470   394.29
>

In this case, you are reading data from CCAFS-Climate Data, and showing the head of the data frame.

Configuring Shiny Server

To use Shiny Server, you have to make some small configuration changes. Connect to your EC2 instance and run the following commands:


sudo /opt/shiny-server/bin/deploy-example user-dirs
mkdir ~/ShinyApps

This configuration lets all system users host their own applications by creating a “ShinyApps” directory in their home directory. For help configuring Shiny Server, see the Quick Start section of the Shiny Server Professional v1.4.0 Administrator’s Guide.

By default, Shiny Server listens on port 3838, so your new application will be available at the following URL, where <your_username> is your Linux username:

http:// ec2-YOUR-IP.REGION.compute.amazonaws.com:3838/<your_username>/MyApp

Now you can create your Shiny dashboards and deploy them via your ShinyApps folder. For advice on creating a Shiny dashboard, see the Teach Yourself Shiny tutorial.

Don’t forget to stop your EC2 instance when you’re not using it. This lowers costs, and starting the instance takes less than five minutes. Consider stopping rather than terminating, because terminating deletes all data and code located on the instance.

Summary

With AWS, you can get a server with up to 244 GB of main memory and up to 40 CPUs; no more limitations by hardware and computation time for your R-based analyses. The Amazon Linux AMI is a good starting point for setting up your own analysis environment. You can change your instance types in minutes and optimize your infrastructure based on your requirements. Furthermore, there are many R packages, such as RJDBC or dplyr, which you can use to connect to all AWS big data services. AWS provides efficient, scaling infrastructure for installing R, RStudio Server, and Shiny Server for data analysis.

If you have questions or suggestions, please leave a comment below.


Related

Running R on Amazon Athena