AWS Big Data Blog

Category: Amazon EMR*

Use Apache Oozie Workflows to Automate Apache Spark Jobs (and more!) on Amazon EMR

Mike Grimes is an SDE with Amazon EMR As a developer or data scientist, you rarely want to run a single serial job on an Apache Spark cluster. More often, to gain insight from your data you need to process it in multiple, possibly tiered steps, and then move the data into another format and […]

Read More

Supercharge SQL on Your Data in Apache HBase with Apache Phoenix

With today’s launch of Amazon EMR release 4.7, you can now create clusters with Apache Phoenix 4.7.0 for low-latency SQL and OLTP workloads. Phoenix uses Apache HBase as its backing store (HBase 1.2.1 is included on Amazon EMR release 4.7.0), using HBase scan operations and coprocessors for fast performance. Additionally, you can map Phoenix tables […]

Read More

Using Spark SQL for ETL

Ben Snively is a Solutions Architect with AWS With big data, you deal with many different formats and large volumes of data. SQL-style queries have been around for nearly four decades. Many systems support SQL-style syntax on top of the data layers, and the Hadoop/Spark ecosystem is no exception. This allows companies to try new […]

Read More

Using Python 3.4 on EMR Spark Applications

Bruno Faria is a Big Data Support Engineer for Amazon Web Services Many data scientists choose Python when developing on Spark. With last month’s Amazon EMR release 4.6, we’ve made it even easier to use Python: Python 3.4 is installed on your EMR cluster by default. You’ll still find Python 2.6 and 2.7 on your […]

Read More

Process Encrypted Data in Amazon EMR with Amazon S3 and AWS KMS

Russell Nash is a Solutions Architect with AWS. Amo Abeyaratne, a Big Data consultant with AWS, also contributed to this post. One of the most powerful features of Amazon EMR is the close integration with Amazon S3 through EMRFS. This allows you to take advantage of many S3 features, including support for S3 client-side and […]

Read More

Sharpen your Skill Set with Apache Spark on the AWS Big Data Blog

The AWS Big Data Blog has a large community of authors who are passionate about Apache Spark and who regularly publish content that helps customers use Spark to build real-world solutions. You’ll see content on a variety of topics, including deep-dives on Spark’s internals, building Spark Streaming applications, creating machine learning pipelines using MLlib, and ways […]

Read More

Combine NoSQL and Massively Parallel Analytics Using Apache HBase and Apache Hive on Amazon EMR

Ben Snively is a Solutions Architect with AWS Jon Fritz, a Senior Product Manager for Amazon EMR, co-authored this post With today’s launch of Amazon EMR release 4.6, you can now quickly and easily provision a cluster with Apache HBase 1.2. Apache HBase is a massively scalable, distributed big data store in the Apache Hadoop ecosystem. It is […]

Read More

Using CombineInputFormat to Combat Hadoop’s Small Files Problem

James Norvell is a Big Data Cloud Support Engineer for AWS Many Amazon EMR customers have architectures that track events and streams and store data in S3. This frequently leads to many small files. It’s now well known that Hadoop doesn’t deal well with small files. This issue can be amplified when migrating from Hadoop […]

Read More

Exploring Geospatial Intelligence using SparkR on Amazon EMR

Gopal Wunnava is a Senior Consultant with AWS Professional Services The number of data sources that use location, such as smartphones and sensory devices used in IoT (Internet of things), is expanding rapidly. This explosion has increased demand for analyzing spatial data. Geospatial intelligence (GEOINT) allows you to analyze data that has geographical or spatial […]

Read More

Will Spark Power the Data behind Precision Medicine?

Christopher Crosbie is a Healthcare and Life Science Solutions Architect with Amazon Web Services. This post was co-authored by Ujjwal Ratan, a Solutions Architect with Amazon Web Services. ——————————— “And that’s the promise of precision medicine — delivering the right treatments, at the right time, every time to the right person.“ (President Obama, 2015 State […]

Read More