This blog post has been translated into Japanese.
Data scientists are often concerned about managing the infrastructure behind big data platforms while running SQL on R. Amazon Athena is an interactive query service that works directly with data stored in S3 and makes it easy to analyze data using standard SQL without the need to manage infrastructure. Integrating R with Amazon Athena gives data scientists a powerful platform for building interactive analytical solutions.
In this blog post, you’ll connect R/RStudio running on an Amazon EC2 instance with Athena.
Prerequisites
Before you get started, complete the following steps.
-
- Have your AWS account administrator give your AWS account the required permissions to access Athena via Amazon’s Identity and Access Management (IAM) console. This can be done by attaching the associated Athena policies to your data scientist user group in IAM.
- Provide a staging directory in the form of an Amazon S3 bucket. Athena will use this to query datasets and store results. We’ll call this staging bucket s3://athenauser-athena-r in the instructions that follow.
NOTE: In this blog post, I create all AWS resources in the US-East region. Use the Region Table to check the availability of Athena in other regions.
Set up R and RStudio on EC2
- Follow the instructions in the blog post “Running R on AWS” to set up R on an EC2 instance (t2.medium or greater) running Amazon Linux . Read the step below before you begin.
- In that blog post under “Advanced Details,” when you reach step 3 use the following bash script to install the latest version of RStudio. Modify the password for RStudio as needed.
#!/bin/bash
#install R
yum install -y R
#install RStudio-Server
wget https://download2.rstudio.org/rstudio-server-rhel-1.0.153-x86_64.rpm
yum install -y --nogpgcheck rstudio-server-rhel-1.0.153-x86_64.rpm
#add user(s)
useradd rstudio
echo rstudio:rstudio | chpasswd
Install Java 8
- SSH into this EC2 instance.
- Remove older versions of Java.
- Install Java 8. This is required to work with Athena.
- Run the following commands on the command line.
#install Java 8, select ‘y’ from options presented to proceed with installation
sudo yum install java-1.8.0-openjdk-devel
#remove version 7 of Java, select ‘y’ from options to proceed with removal
sudo yum remove java-1.7.0-openjdk
#configure java, choose 1 as your selection option for java 8 configuration
sudo /usr/sbin/alternatives --config java
#run command below to add Java support to R
sudo R CMD javareconf
#following libraries are required for the interactive application we build later
sudo yum install -y libpng-devel
sudo yum install -y libjpeg-turbo-devel
Set up .Renviron
You need to configure the R environment variable .Renviron with the required Athena credentials.
- Get the required credentials from your AWS Administrator in the form of AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
- Type the following command from the Linux command prompt and bring up the vi editor.
sudo vim /home/rstudio/.Renviron
Provide your Athena credentials in the following form into the editor:
ATHENA_USER=< AWS_ACCESS_KEY_ID >
ATHENA_PASSWORD=< AWS_SECRET_ACCESS_KEY>
- Save this file and exit from the editor.
Log in to RStudio
Next, you’ll log in to RStudio on your EC2 instance.
- Get the public IP address of your instance from the EC2 dashboard and paste it followed by :8787 (port number for RStudio) into your browser window.
- Confirm that your IP address has been whitelisted for inbound access to port 8787 as part of the configuration for the security group associated with your EC2 instance.
- Log in to RStudio with the username and password you provided previously.
Install R packages
Next, you’ll install and load the required R packages.
#--following R packages are required for connecting R with Athena
install.packages("rJava")
install.packages("RJDBC")
library(rJava)
library(RJDBC)
#--following R packages are required for the interactive application we build later
#--steps below might take several minutes to complete
install.packages(c("plyr","dplyr","png","RgoogleMaps","ggmap"))
library(plyr)
library(dplyr)
library(png)
library(RgoogleMaps)
library(ggmap)
Connect to Athena
The following steps in R download the Athena driver and set up the required connection. Use the JDBC URL associated with your region.
#verify Athena credentials by inspecting results from command below
Sys.getenv()
#set up URL to download Athena JDBC driver
URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
fil
list.files()
#set up driver connection to JDBC
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")
#connect to Athena using the driver, S3 working directory and credentials for Athena
#replace ‘athenauser’ below with prefix you have set up for your S3 bucket
con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
s3_staging_dir="s3://athenauser-athena-r",
user=Sys.getenv("ATHENA_USER"),
password=Sys.getenv("ATHENA_PASSWORD"))
#in case of error or warning from step above ensure rJava and RJDBC packages have #been loaded
#also ensure you have Java 8 running and configured for R as outlined earlier
Now you’re ready to start querying Athena from RStudio.
Sample Queries to test
# get a list of all tables currently in Athena
dbListTables(con)
# run a sample query
dfelb=dbGetQuery(con, "SELECT * FROM sampledb.elb_logs limit 10")
head(dfelb,2)
Interactive Use Case
Next, you’ll practice interactively querying Athena from R for analytics and visualization. For this purpose, you’ll use GDELT, a publicly available dataset hosted on S3.
Create a table in Athena from R using the GDELT dataset. This step can also be performed from the AWS management console as illustrated in the blog post “Amazon Athena – Interactive SQL Queries for Data in Amazon S3.”
#---sql create table statement in Athena
dbSendQuery(con,
"
CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.gdeltmaster (
GLOBALEVENTID BIGINT,
SQLDATE INT,
MonthYear INT,
Year INT,
FractionDate DOUBLE,
Actor1Code STRING,
Actor1Name STRING,
Actor1CountryCode STRING,
Actor1KnownGroupCode STRING,
Actor1EthnicCode STRING,
Actor1Religion1Code STRING,
Actor1Religion2Code STRING,
Actor1Type1Code STRING,
Actor1Type2Code STRING,
Actor1Type3Code STRING,
Actor2Code STRING,
Actor2Name STRING,
Actor2CountryCode STRING,
Actor2KnownGroupCode STRING,
Actor2EthnicCode STRING,
Actor2Religion1Code STRING,
Actor2Religion2Code STRING,
Actor2Type1Code STRING,
Actor2Type2Code STRING,
Actor2Type3Code STRING,
IsRootEvent INT,
EventCode STRING,
EventBaseCode STRING,
EventRootCode STRING,
QuadClass INT,
GoldsteinScale DOUBLE,
NumMentions INT,
NumSources INT,
NumArticles INT,
AvgTone DOUBLE,
Actor1Geo_Type INT,
Actor1Geo_FullName STRING,
Actor1Geo_CountryCode STRING,
Actor1Geo_ADM1Code STRING,
Actor1Geo_Lat FLOAT,
Actor1Geo_Long FLOAT,
Actor1Geo_FeatureID INT,
Actor2Geo_Type INT,
Actor2Geo_FullName STRING,
Actor2Geo_CountryCode STRING,
Actor2Geo_ADM1Code STRING,
Actor2Geo_Lat FLOAT,
Actor2Geo_Long FLOAT,
Actor2Geo_FeatureID INT,
ActionGeo_Type INT,
ActionGeo_FullName STRING,
ActionGeo_CountryCode STRING,
ActionGeo_ADM1Code STRING,
ActionGeo_Lat FLOAT,
ActionGeo_Long FLOAT,
ActionGeo_FeatureID INT,
DATEADDED INT,
SOURCEURL STRING )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://support.elasticmapreduce/training/datasets/gdelt'
;
"
)
dbListTables(con)
You should see this newly created table named ‘gdeltmaster’ appear in your RStudio console after executing the statement above.
Query this Athena table to get a count of all CAMEO events that took place in the US in 2015.
#--get count of all CAMEO events that took place in US in year 2015
#--save results in R dataframe
dfg<-dbGetQuery(con,"SELECT eventcode,count(*) as count
FROM sampledb.gdeltmaster
where year = 2015 and ActionGeo_CountryCode IN ('US')
group by eventcode
order by eventcode desc"
)
str(dfg)
head(dfg,2)
#--get list of top 5 most frequently occurring events in US in 2015
dfs=head(arrange(dfg,desc(count)),5)
dfs
From the R output shown above, you can see that CAMEO event “042” has the highest count. From the CAMEO manual, you can determine that this event has the description “Travel to another location for a meeting or other event.”
Next, you’ll use the knowledge gained from this analysis to get a list of all geo-coordinates associated with this specific event from the Athena table.
#--get a list of latitude and longitude associated with event “042”
#--save results in R dataframe
dfgeo<-dbGetQuery(con,"SELECT actiongeo_lat,actiongeo_long
FROM sampledb.gdeltmaster
where year = 2015 and ActionGeo_CountryCode IN ('US')
and eventcode = '042'
"
)
#--duration of above query will depend on factors like size of chosen EC2 instance
#--now rename columns in dataframe for brevity
names(dfgeo)[names(dfgeo)=="actiongeo_lat"]="lat"
names(dfgeo)[names(dfgeo)=="actiongeo_long"]="long"
names(dfgeo)
#let us inspect this R dataframe
str(dfgeo)
head(dfgeo,5)
Next, generate a map for the United States.
#--generate map for the US using the ggmap package
map=qmap('USA',zoom=3)
map
Now you’ll plot the geodata obtained from your Athena table onto this map. This will help you visualize all US locations where these events had occurred in 2015.
#--plot our geo-coordinates on the US map
map + geom_point(data = dfgeo, aes(x = dfgeo$long, y = dfgeo$lat), color="blue", size=0.5, alpha=0.5)
By visually inspecting the results, you can determine that this specific event was heavily concentrated in the Northeastern part of the US.
Conclusion
You’ve learned how to build a simple interactive application with Athena and R. Athena can be used to store and query the underlying data for your big data applications using standard SQL, while R can be used to interactively query Athena and generate analytical insights using the powerful set of libraries that R provides.
If you have questions or suggestions, please leave your feedback in the comments.
About the Author
Gopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.
Related
Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight