AWS Big Data Blog

Category: Amazon EMR

Supercharge SQL on Your Data in Apache HBase with Apache Phoenix

With today’s launch of Amazon EMR release 4.7, you can now create clusters with Apache Phoenix 4.7.0 for low-latency SQL and OLTP workloads. Phoenix uses Apache HBase as its backing store (HBase 1.2.1 is included on Amazon EMR release 4.7.0), using HBase scan operations and coprocessors for fast performance. Additionally, you can map Phoenix tables […]

Read More

Using Spark SQL for ETL

Ben Snively is a Solutions Architect with AWS With big data, you deal with many different formats and large volumes of data. SQL-style queries have been around for nearly four decades. Many systems support SQL-style syntax on top of the data layers, and the Hadoop/Spark ecosystem is no exception. This allows companies to try new […]

Read More

Process Encrypted Data in Amazon EMR with Amazon S3 and AWS KMS

Russell Nash is a Solutions Architect with AWS. Amo Abeyaratne, a Big Data consultant with AWS, also contributed to this post. One of the most powerful features of Amazon EMR is the close integration with Amazon S3 through EMRFS. This allows you to take advantage of many S3 features, including support for S3 client-side and […]

Read More

Sharpen your Skill Set with Apache Spark on the AWS Big Data Blog

The AWS Big Data Blog has a large community of authors who are passionate about Apache Spark and who regularly publish content that helps customers use Spark to build real-world solutions. You’ll see content on a variety of topics, including deep-dives on Spark’s internals, building Spark Streaming applications, creating machine learning pipelines using MLlib, and ways […]

Read More

Combine NoSQL and Massively Parallel Analytics Using Apache HBase and Apache Hive on Amazon EMR

Ben Snively is a Solutions Architect with AWS Jon Fritz, a Senior Product Manager for Amazon EMR, co-authored this post With today’s launch of Amazon EMR release 4.6, you can now quickly and easily provision a cluster with Apache HBase 1.2. Apache HBase is a massively scalable, distributed big data store in the Apache Hadoop ecosystem. It is […]

Read More

Using CombineInputFormat to Combat Hadoop’s Small Files Problem

James Norvell is a Big Data Cloud Support Engineer for AWS Many Amazon EMR customers have architectures that track events and streams and store data in S3. This frequently leads to many small files. It’s now well known that Hadoop doesn’t deal well with small files. This issue can be amplified when migrating from Hadoop […]

Read More