Running R on Amazon Athena
Data scientists are often concerned about managing the infrastructure behind big data platforms while running SQL on R. Amazon Athena is an interactive query service that works directly with data stored in S3 and makes it easy to analyze data using standard SQL without the need to manage infrastructure. Integrating R with Amazon Athena gives data scientists a powerful platform for building interactive analytical solutions.
In this blog post, you’ll connect R/RStudio running on an Amazon EC2 instance with Athena.
Before you get started, complete the following steps.
- Have your AWS account administrator give your AWS account the required permissions to access Athena via Amazon’s Identity and Access Management (IAM) console. This can be done by attaching the associated Athena policies to your data scientist user group in IAM.
- Provide a staging directory in the form of an Amazon S3 bucket. Athena will use this to query datasets and store results. We’ll call this staging bucket s3://athenauser-athena-r in the instructions that follow.
NOTE: In this blog post, I create all AWS resources in the US-East region. Use the Region Table to check the availability of Athena in other regions.
Set up R and RStudio on EC2
- Follow the instructions in the blog post “Running R on AWS” to set up R on an EC2 instance (t2.medium or greater) running Amazon Linux . Read the step below before you begin.
- In that blog post under “Advanced Details,” when you reach step 3 use the following bash script to install the latest version of RStudio. Modify the password for RStudio as needed.
Install Java 8
- SSH into this EC2 instance.
- Remove older versions of Java.
- Install Java 8. This is required to work with Athena.
- Run the following commands on the command line.
Set up .Renviron
You need to configure the R environment variable .Renviron with the required Athena credentials.
- Get the required credentials from your AWS Administrator in the form of AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
- Type the following command from the Linux command prompt and bring up the vi editor.
- Save this file and exit from the editor.
Log in to RStudio
Next, you’ll log in to RStudio on your EC2 instance.
- Get the public IP address of your instance from the EC2 dashboard and paste it followed by :8787 (port number for RStudio) into your browser window.
- Confirm that your IP address has been whitelisted for inbound access to port 8787 as part of the configuration for the security group associated with your EC2 instance.
- Log in to RStudio with the username and password you provided previously.
Install R packages
Next, you’ll install and load the required R packages.
Connect to Athena
The following steps in R download the Athena driver and set up the required connection. Use the JDBC URL associated with your region.
Now you’re ready to start querying Athena from RStudio.
Sample Queries to test
Interactive Use Case
Next, you’ll practice interactively querying Athena from R for analytics and visualization. For this purpose, you’ll use GDELT, a publicly available dataset hosted on S3.
Create a table in Athena from R using the GDELT dataset. This step can also be performed from the AWS management console as illustrated in the blog post “Amazon Athena – Interactive SQL Queries for Data in Amazon S3.”
You should see this newly created table named ‘gdeltmaster’ appear in your RStudio console after executing the statement above.
Query this Athena table to get a count of all CAMEO events that took place in the US in 2015.
From the R output shown above, you can see that CAMEO event “042” has the highest count. From the CAMEO manual, you can determine that this event has the description “Travel to another location for a meeting or other event.”
Next, you’ll use the knowledge gained from this analysis to get a list of all geo-coordinates associated with this specific event from the Athena table.
Next, generate a map for the United States.
Now you’ll plot the geodata obtained from your Athena table onto this map. This will help you visualize all US locations where these events had occurred in 2015.
By visually inspecting the results, you can determine that this specific event was heavily concentrated in the Northeastern part of the US.
You’ve learned how to build a simple interactive application with Athena and R. Athena can be used to store and query the underlying data for your big data applications using standard SQL, while R can be used to interactively query Athena and generate analytical insights using the powerful set of libraries that R provides.
If you have questions or suggestions, please leave your feedback in the comments.
About the Author
Gopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.