Running sparklyr – RStudio’s R Interface to Spark on Amazon EMR
This post was last updated July 7th, 2021 (original version by Tom Zeng).
The Sparklyr package by RStudio has made processing big data in R a lot easier. Sparklyr is an R interface to Spark, it allows using Spark as the backend for dplyr – one of the most popular data manipulation packages. Sparklyr also allows user to query data in Spark using SQL and develop extensions for the full Spark API and provide interfaces to Spark packages.
Amazon EMR is a popular hosted big data processing service on AWS that provides the latest version of Spark and other Hadoop ecosystem applications such as Hive, Pig, Tez, and Presto.
Running RStudio and Sparklyr on EMR is simple; the following AWS CLI command will launch a 5 node (1 master node and 4 worker nodes) EMR 6.1 cluster with Spark, RStudio, Shiny, and Sparklyr pre-installed and ready to use.[i] Note that the rstudio_sparklyr_emr6.sh bootstrap option can be modified to accommodate newer versions of RStudio Server:
The bootstrap script installs R (version 4.0.3 as of this writing) and R packages under the path
If you’re not familiar with the AWS CLI, you can launch EMR cluster from the AWS Management Console by specifying the following bootstrap action and arguments:
http://<EMR master instance>:8787
Log in to RStudio using user
hadoop and password
hadoop (these values can be overwritten with the –user and –user-pw arguments for the bootstrap action). For easy access to AWS resources in R, install the CloudyR AWS packages from within RStudio or with the following command in R:
Now, try the following Sparklyr examples (adopted from RStudio):
You can also use Shiny from within Amazon EMR. Shiny allows you to develop interactive web applications in R and publish your analytics results.
You can install Shiny as a package either within RStudio or from within R with the following command:
Create and load the following file (named
app.R) to try it out: