Running sparklyr – RStudio’s R Interface to Spark on Amazon EMR

This post was last updated July 7th, 2021 (original version by Tom Zeng).

The Sparklyr package by RStudio has made processing big data in R a lot easier. Sparklyr is an R interface to Spark, it allows using Spark as the backend for dplyr – one of the most popular data manipulation packages. Sparklyr also allows user to query data in Spark using SQL and develop extensions for the full Spark API and provide interfaces to Spark packages.

Amazon EMR is a popular hosted big data processing service on AWS that provides the latest version of Spark and other Hadoop ecosystem applications such as Hive, Pig, Tez, and Presto.

Running RStudio and Sparklyr on EMR is simple; the following AWS CLI command will launch a 5 node (1 master node and 4 worker nodes) EMR 6.1 cluster with Spark, RStudio, Shiny, and Sparklyr pre-installed and ready to use.[i] Note that the rstudio_sparklyr_emr6.sh bootstrap option can be modified to accommodate newer versions of RStudio Server:

aws emr create-cluster \
    --applications Name=Hadoop Name=Spark Name=Hive Name=Pig Name=Tez Name=Ganglia \
    --release-label emr-6.1.0 \
    --name "EMR 6.1 RStudio + sparklyr" \
    --service-role EMR_DefaultRole \
    --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m5a.2xlarge InstanceGroupType=CORE,InstanceCount=4,InstanceType=m5a.2xlarge \
    --bootstrap-action Path=s3://aws-bigdata-blog/artifacts/aws-blog-emr-rstudio-sparklyr/rstudio_sparklyr_emr6.sh,Name="Install RStudio" \
    --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,KeyName=$YOUR_KEY \
    --configurations '[{"Classification":"spark","Properties":{"maximizeResourceAllocation":"true"}}]' \
    --region us-east-1

The bootstrap script installs R (version 4.0.3 as of this writing) and R packages under the path /usr/local.

If you’re not familiar with the AWS CLI, you can launch EMR cluster from the AWS Management Console by specifying the following bootstrap action and arguments:

s3://aws-bigdata-blog/artifacts/aws-blog-emr-rstudio-sparklyr/rstudio_sparklyr_emr6.sh

After the cluster is launched, set up the SSH tunnel and web proxy per the instructions on the EMR console or docs. RStudio is available at the following locations:

http://localhost:8787
http://<EMR master instance>:8787

Log in to RStudio using user hadoop and password hadoop (these values can be overwritten with the –user and –user-pw arguments for the bootstrap action). For easy access to AWS resources in R, install the CloudyR AWS packages from within RStudio or with the following command in R:

install.packages("cloudyr")

Now, try the following Sparklyr examples (adopted from RStudio):

# load the sparklyr, dplyr, ggplot2, and DBI packages
library(sparklyr)
library(dplyr)
library(ggplot2)
library(DBI)

# create the Spark context
sc <- spark_connect(master = "yarn", version = "3.2.1")


# copy some builtin sample data to the Spark cluster
iris_tbl <- copy_to(sc, iris)
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
batting_tbl <- copy_to(sc, Lahman::Batting, "batting")
src_tbls(sc)

# filter by departure delay and print the first few records
flights_tbl %>% filter(dep_delay == 2)

# plot data on flight delays
delay <- flights_tbl %>% 
  group_by(tailnum) %>%
  summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
  filter(count > 20, dist < 2000, !is.na(delay)) %>%
  collect


ggplot(delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area(max_size = 2)

# window functions  
batting_tbl %>%
  select(playerID, yearID, teamID, G, AB:H) %>%
  group_by(playerID) %>%
  filter(min_rank(desc(H)) <= 2 & H > 0) %>%
  arrange(playerID, yearID, teamID)

# use SQL  
iris_preview <- dbGetQuery(sc, "SELECT * FROM iris LIMIT 10")
iris_preview

# Machine learning example using Spark MLlib

# copy the mtcars data into Spark
mtcars_tbl <- copy_to(sc, mtcars)

# transform our data set, and then partition into 'training', 'test'
partitions <- mtcars_tbl %>%
  filter(hp >= 100) %>%
  mutate(cyl8 = cyl == 8) %>%
  sdf_random_split(training = 0.5, test = 0.5, seed = 1099)

# fit a linear regression model to the training dataset
fit <- partitions$training %>%
  ml_linear_regression(response = "mpg", features = c("wt", "cyl"))
fit

summary(fit)

# get the 10th row in test data
car <- tbl_df(partitions$test) %>% slice(10)
# predict the mpg 
predicted_mpg <- car$cyl * fit$coefficients["cyl"] + car$wt * fit$coefficients["wt"] + fit$coefficients["(Intercept)"]

# print the original and the predicted
sprintf("original mpg = %s, predicted mpg = %s", car$mpg, predicted_mpg)

You can also use Shiny from within Amazon EMR. Shiny allows you to develop interactive web applications in R and publish your analytics results.

You can install Shiny as a package either within RStudio or from within R with the following command:

install.packages("shiny")

Create and load the following file (named app.R) to try it out:

# app.R
library(shiny)
library(ggplot2)
library(sparklyr)
library(dplyr)
library(nycflights13)

options(bitmapType='cairo')

# create the Spark context
sc <- spark_connect(master = "yarn", version = "3.2.1")

flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
delay <- flights_tbl %>%
    group_by(tailnum) %>%
    summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
    filter(count > 20, dist < 2000, !is.na(delay)) %>%
    collect

server <- function(input, output) {
    output$flightDelayPlot <- renderPlot({
        ggplot(delay, aes(dist, delay)) +
            geom_point(aes(size = count), alpha = 1/2) +
            geom_smooth() +
            scale_size_area(max_size = 2)
    })
}

# UI
ui <- fluidPage(
    titlePanel("flight delay by distance"),
    mainPanel(
        plotOutput("flightDelayPlot")
    )
)
shinyApp(ui, server)

You can launch and preview the Shiny app by choosing Run App from within RStudio, or run the following R command:
shiny::runApp()

sparklyr_2

Conclusion

With RStudio and Sparklyr running on Amazon EMR, data scientists and other R users can keep using their existing R code and favorite packages while tapping into Spark’s capabilities and speed for analyzing huge amount of data stored in Amazon S3 or HDFS.

Amazon EMR makes it easy to spin up clusters with different sizes and CPU and memory configurations to suit different workloads and budgets. If you have any questions about using R and Spark or would like share your use cases, please leave us a comment below.

[i] Note that production deployments may want to choose a less permissive role than the default EMR role. Please see the AWS documentation and this blog posting for additional guidance about security EMR: https://aws.amazon.com/blogs/big-data/best-practices-for-securing-amazon-emr/

Running R on AWS

R_Best_Practices_Image_1a.fw

AWS Big Data Blog

Running sparklyr – RStudio’s R Interface to Spark on Amazon EMR

Conclusion

Related

Resources

Follow