AWS Big Data Blog

Running sparklyr – RStudio’s R Interface to Spark on Amazon EMR

by Tom Zeng | on | Permalink | Comments |  Share

Tom Zeng is a Solutions Architect for Amazon EMR

The recently released sparklyr package by RStudio has made processing big data in R a lot easier. sparklyr is an R interface to Spark that allows users to use Spark as the backend for dplyr, one of the most popular data manipulation packages. sparklyr provides interfaces to Spark packages and also allows users to query data in Spark using SQL and develop extensions for the full Spark API.

Amazon EMR is a popular, hosted big data processing service on AWS that provides the latest version of Spark and other Hadoop ecosystem applications, such as Hive, Pig, Tez, and Presto.

Running RStudio and sparklyr on EMR is simple; the following AWS CLI command launches a 6-node (1 master node and 5 worker nodes) EMR 5.0 cluster with Spark, RStudio, and sparklyr pre-installed and ready to use:

aws emr create-cluster --applications Name=Hadoop Name=Spark Name=Hive Name=Pig Name=Tez Name=Ganglia \
--release-label emr-5.2.0 --name "EMR 5.2.0 RStudio + sparklyr" --service-role EMR_DefaultRole \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.2xlarge \
InstanceGroupType=CORE,InstanceCount=5,InstanceType=m3.2xlarge --bootstrap-actions \
Path=s3://aws-bigdata-blog/artifacts/aws-blog-emr-rstudio-sparklyr/rstudio_sparklyr_emr5.sh,\
Args=["--rstudio","--sparkr","--rexamples","--plyrmr","--rhdfs","--sparklyr"],\
Name="Install RStudio" --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,KeyName=<Your Key> \
--configurations '[{"Classification":"spark","Properties":{"maximizeResourceAllocation":"true"}}]' \
--region us-east-1

Users who are not familiar with the AWS CLI can launch an EMR cluster from the AWS Management Console by specifying the following bootstrap action and arguments:

s3://aws-bigdata-blog/artifacts/aws-blog-emr-rstudio-sparklyr/rstudio_sparklyr_emr5.sh 
--sparklyr --rstudio --sparkr --rexamples --plyrmr

After the cluster is launched, set up the SSH tunnel and web proxy. RStudio is available at the following locations:

  • http://localhost:8787
  • http://<EMR master instance>:8787

Log in to RStudio using user hadoop and password hadoop (these values can be overwritten with the --user and --user-pw arguments for the bootstrap action). Now, try the following sparklyr examples (adopted from RStudio http://spark.rstudio.com/):

# load the sparklyr package
library(sparklyr)

# create the Spark context, for EMR 4.7/4.8 use version = "1.6.2" and for EMR 4.6 use "1.6.1"
sc <- spark_connect(master = "yarn-client", version = "2.0.0")

# load dplyr
library(dplyr)

# copy some built-in sample data to the Spark cluster
iris_tbl <- copy_to(sc, iris)
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
batting_tbl <- copy_to(sc, Lahman::Batting, "batting")
src_tbls(sc)

# filter by departure delay and print the first few records
flights_tbl %>% filter(dep_delay == 2)

# plot data on flight delays
delay <- flights_tbl %>% 
  group_by(tailnum) %>%
  summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
  filter(count > 20, dist < 2000, !is.na(delay)) %>%
  collect

library(ggplot2)
ggplot(delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area(max_size = 2)

# window functions  
batting_tbl %>%
  select(playerID, yearID, teamID, G, AB:H) %>%
  arrange(playerID, yearID, teamID) %>%
  group_by(playerID) %>%
  filter(min_rank(desc(H)) <= 2 & H > 0)

# use SQL  
library(DBI)
iris_preview <- dbGetQuery(sc, "SELECT * FROM iris LIMIT 10")
iris_preview

# Machine learning example using Spark MLlib

# copy the mtcars data into Spark
mtcars_tbl <- copy_to(sc, mtcars)

# transform the data set, and then partition into 'training', 'test'
partitions <- mtcars_tbl %>%
  filter(hp >= 100) %>%
  mutate(cyl8 = cyl == 8) %>%
  sdf_partition(training = 0.5, test = 0.5, seed = 1099)

# fit a linear regression model to the training dataset
fit <- partitions$training %>%
  ml_linear_regression(response = "mpg", features = c("wt", "cyl"))
fit

summary(fit)

# get the 10th row in test data
car <- tbl_df(partitions$test) %>% slice(10)
# predict the mpg 
predicted_mpg <- car$cyl * fit$coefficients["cyl"] + car$wt * fit$coefficients["wt"] + fit$coefficients["(Intercept)"]

# print the original and the predicted
sprintf("original mpg = %s, predicted mpg = %s", car$mpg, predicted_mpg)
sparklyr_1 

The bootstrap action can also be used to install the Shiny server by passing it the ––shiny argument. Shiny allows users to develop interactive web applications in R and publish their analytics results.

If Shiny is installed, create and load the following two files to try it out:

# server.R
library(shiny)
function(input, output) {
  output$flightDelayPlot <- renderPlot({
    ggplot(delay, aes(dist, delay)) +
      geom_point(aes(size = count), alpha = 1/2) +
      geom_smooth() +
      scale_size_area(max_size = 2)
  })
}

# ui.R
library(shiny)
fluidPage(
  titlePanel("flight delay by distance"),
  mainPanel(
    plotOutput("flightDelayPlot")
  )
)

Run shiny::runApp() to display the flight delay chart as a web page:

sparklyr_2

Conclusion

With RStudio and sparklyr running on Amazon EMR, data scientists and other R users can keep using their existing R code and favorite packages while tapping into Spark’s capabilities and speed for analyzing huge amount of data stored in Amazon S3 or HDFS.

EMR makes it easy to spin up clusters with different sizes and CPU/memory configurations to suit different workloads and budgets. If you have any questions about using R and Spark or would like share your use cases, please leave us a comment below.

 


Related

Running R on AWS

R_Best_Practices_Image_1a.fw