Tag: R


Running R on Amazon Athena

by Gopal Wunnava | on | | Comments

This blog post has been translated into Japanese.

Data scientists are often concerned about managing the infrastructure behind big data platforms while running SQL on R. Amazon Athena is an interactive query service that works directly with data stored in S3 and makes it easy to analyze data using standard SQL without the need to manage infrastructure. Integrating R with Amazon Athena gives data scientists a powerful platform for building interactive analytical solutions.

In this blog post, you’ll connect R/RStudio running on an Amazon EC2 instance with Athena.

Prerequisites

Before you get started, complete the following steps.

  1. Have your AWS account administrator give your AWS account the required permissions to access Athena via Amazon’s Identity and Access Management (IAM) console. This can be done by attaching the associated Athena policies to your data scientist user group in IAM.

 

RAthena_1

(more…)

Crunching Statistics at Scale with SparkR on Amazon EMR

by Christopher Crosbie | on | | Comments

Christopher Crosbie is a Healthcare and Life Science Solutions Architect with Amazon Web Services.

This post is co-authored by Gopal Wunnava, a Senior Consultant with AWS Professional Services.

SparkR is an R package that allows you to integrate complex statistical analysis with large datasets. In this blog post, we introduce you running R with the Apache SparkR project on Amazon EMR. The diagram of SparkR below is provided as a reference, but this video provides an overview of what is depicted.

(more…)

Extending Seven Bridges Genomics with Amazon Redshift and R

by Christopher Crosbie | on | | Comments

Christopher Crosbie is a Healthcare and Life Science Solutions Architect with Amazon Web Services

The article was co-authored by Zeynep Onder, Scientist, Seven Bridges Genomics, an AWS Advanced Technology Partner.

“ACTGCTTCGACTCGGGTCCA”

That is probably not a coding language readily understood by many reading this blog post, but it is a programming framework that defines all life on the planet. These letters are known as base pairs in a DNA sequence and represent four chemicals found in all organisms. When put into a specific order, these DNA sequences contain the instructions that kick off processes which eventually render all the characteristics and traits (also known as phenotypes) we see in nature.

Sounds simple enough. Just store the code, perform some complex decoding algorithms, and you are on your way to a great scientific discovery. Right? Well, not quite. Genomics analysis is one of the biggest data problems out there.

Here’s why: You and I have around 20,000 – 25,000 distinct sequences of DNA (genes) that create proteins and thus contain instructions for every process from development to regeneration. This is out of the 3.2 billion individual letters (bases) in each of us. Thousands of other organisms have also been decoded and stored in databases because comparing genes across species can help us understand the way these genes actually function. Algorithms such as BLAST that can search DNA sequences from more than 260,000 organisms containing over 190 billion bases are now commonplace in bioinformatics. It has also been estimated that the total amount of DNA base pairs on Earth is somewhere in the range of 5.0 x 1037 or 50 trillion trillion trillion DNA base pairs. WOW! Forget about clickstream and server logs—nature has given us the ultimate Big Data problem.

(more…)

Connecting R with Amazon Redshift

by Markus Schmidberger | on | | Comments

Markus Schmidberger is a Senior Big Data Consultant for AWS Professional Services

Amazon Redshift is a fast, fully managed, scalable data warehouse (DWH) for PB of data. AWS customers are moving huge amounts of structured data into Amazon Redshift to offload analytics workloads or to operate their DWH fully in the cloud. Business intelligence and analytic teams can use JDBC or ODBC connections to import, read, and analyze data with their favorite tools, such as Informatica or Tableau.

R is an open source programming language and software environment designed for statistical computing, visualization, and data analysis. Due to its flexible package system and powerful statistical engine, R can provide methods and technologies to manage and process a big amount of data. R is the fastest growing analytics platform in the world, and is established in both academia and business due to its robustness, reliability, and accuracy. For more tips on installing and operating R on AWS, see Running R on AWS.

In this post, I describe some best practices for efficient analyses of data in Amazon Redshift with the statistical software R running on your computer or Amazon EC2.

Preparation

Start an Amazon Redshift cluster (Step 2: Launch a Sample Amazon Redshift Cluster) with two dc1.large nodes and mark the Publicly Accessible field as Yes to add a public IP to your cluster.

If you run an Amazon Redshift production cluster, you might not choose this option. Discussing the available security mechanisms could be a separate blog post all by itself, and would add too much to this one. In the meantime, see the Amazon Redshift documentation for more details about security, VPC, and data encryption (Amazon Redshift Security Overview).

For working with the cluster, you need the following connection information:

  • Endpoint <ENDPOINT>
  • Database name <DBNAME>
  • Port <PORT>
  • (Master) username <USER> and password <PW>
  • JDBC URL <JDBCURL>

You can access the fields by logging into the AWS console, choosing Amazon Redshift, and then selecting your cluster.

(more…)

Statistical Analysis with Open-Source R and RStudio on Amazon EMR

by Markus Schmidberger | on | | Comments

Markus Schmidberger is a Senior Big Data Consultant for AWS Professional Services

Big Data is on every CIO’s mind. It is synonymous with technologies like Hadoop and the ‘NoSQL’ class of databases. Another technology shaking things up in Big Data is R. This blog post describes how to set up R, RHadoop packages and RStudio server on Amazon Elastic MapReduce (Amazon EMR). This combination provides a powerful statistical analyses environment, including a user-friendly IDE on a fully managed Hadoop environment that starts up in minutes, and saves time and money for your data-driven analyses. At the end of this post, I’ve added a Big Data analysis using a public data set with daily global weather measurements.

is an open source programming language and software environment designed for statistical computing, visualization and data. Due to its flexible package system and powerful statistical engine, the statistical software R can provide methods and technologies to manage and process a big amount of data. It is the fastest-growing analytics platform in the world, and is established in both academia and business due to its robustness, reliability, and accuracy. Nearly every top vendor of advanced analytics has integrated R and can now import R models. This allows data scientists, statisticians and other sophisticated enterprise users to leverage R within their analytics package.

(more…)