AWS Big Data Blog

Setting up Read Replica Clusters with HBase on Amazon S3

by Tony Nguyen and Zach York | on | Permalink | Comments |  Share

Many customers have taken advantage of the numerous benefits of running Apache HBase on Amazon S3 for data storage, including lower costs, data durability, and easier scalability. Customers such as FINRA have lowered their costs by 60% by moving to an HBase on S3 architecture along with the numerous operational benefits that come with decoupling storage from compute and using S3 as the storage layer. HBase on S3 allows you to turn on a cluster and immediately start querying against data within S3 rather than having to go through a lengthy snapshot restore process.

With the launch of Amazon EMR 5.7.0, you can now take the high availability and durability of HBase on S3 one step further to the cluster level, where you can now start multiple HBase read-only clusters that can connect to the same HBase root directory in S3. This allows you to ensure that your data is always reachable through read replica clusters and run their clusters across multiple Availability Zones.

In this post, I guide you through setting up read replica clusters with HBase on S3.

HBase Overview

Apache HBase is a massively scalable, distributed big data store in the Apache Hadoop ecosystem. It is an open-source, non-relational, versioned database that runs on top of the Hadoop Distributed Filesystem (HDFS). It is built for random, strictly consistent, real time access for tables with billions of rows and millions of columns. It has tight integration with Apache HadoopApache Hive, and Apache Pig, so you can easily combine massively parallel analytics with fast data access. HBase’s data model, throughput, and fault tolerance are a good match for workloads in ad tech, web analytics, financial services, applications using time-series data, and many more.

Table structure in HBase, like many NoSQL technologies, should be directly influenced by the queries and access patterns of the data. Query performance varies drastically based on the way the cluster has to process and return the data.

HBase on S3

To use HBase on S3 read replicas, you must first be using HBase on S3. For those unfamiliar with HBase on S3 architecture, this section conveys some of the basics.

By using S3 as a data store for HBase, you can separate your cluster’s storage and compute nodes. This enables you to cut costs by sizing your cluster for your compute requirements. You don’t have to pay to store your entire dataset with 3x replication in the on-cluster Hadoop Distributed File System (HDFS).

EMR configures HBase on Amazon S3 to cache data in-memory and on-disk in your cluster to improve read performance from S3. You can quickly and easily scale up or scale down compute nodes without impacting your underlying storage. Or you can terminate your cluster to cut costs and quickly restore it in another Availability Zone.

HBase with support for S3 is available on EMR releases from 5.2.0 onward. To use S3 as a data store, configure the storage mode and specify a root directory in your HBase configuration. Also, it’s recommended to enable EMRFS consistent view. For more information, see Apache HBase on Amazon S3.

Use cases for HBase on S3 read replica clusters

Using HBase on S3 allows your data to be stored safely and durably. It persists the data off-cluster, which eliminates the dangers of data loss for persisted writes when the cluster is terminated. However, there can situations where you want to make sure that your data on HBase is highly available, even in the rare event of the cluster or Availability Zone failure. Another case could be when you want the ability to have multiple clusters access the same root directory in S3. If you have a primary cluster that goes under heavy load during bulk loads, writes, and compactions, this feature allows you to create secondary clusters that off-load and separate the read load from the write load, ensuring that you meet your read SLAs while optimizing around cost and performance.

The following diagram shows HBase on S3 without read replicas. In this scenario, events such as cluster failure or Availability Zone failure render users unable to access data on HBase.

The HBase root directory, including HFiles and metadata, resides in S3:

Prior to EMR 5.7.0, multiple clusters could not be pointed to the same root directory. For architectures requiring high availability, you needed to create duplicate data on S3.


Analyze OpenFDA Data in R with Amazon S3 and Amazon Athena

by Ryan Hood, Vikram Anand and David Rocamora | on | Permalink | Comments |  Share

One of the great benefits of Amazon S3 is the ability to host, share, or consume public data sets. This provides transparency into data to which an external data scientist or developer might not normally have access. By exposing the data to the public, you can glean many insights that would have been difficult with a data silo.

The openFDA project creates easy access to the high value, high priority, and public access data of the Food and Drug Administration (FDA). The data has been formatted and documented in consumer-friendly standards. Critical data related to drugs, devices, and food has been harmonized and can easily be called by application developers and researchers via API calls. OpenFDA has published two whitepapers that drill into the technical underpinnings of the API infrastructure as well as how to properly analyze the data in R. In addition, FDA makes openFDA data available on S3 in raw format.

In this post, I show how to use S3, Amazon EMR, and Amazon Athena to analyze the drug adverse events dataset. A drug adverse event is an undesirable experience associated with the use of a drug, including serious drug side effects, product use errors, product quality programs, and therapeutic failures.

Data considerations

Keep in mind that this data does have limitations. In addition, in the United States, these adverse events are submitted to the FDA voluntarily from consumers so there may not be reports for all events that occurred. There is no certainty that the reported event was actually due to the product. The FDA does not require that a causal relationship between a product and event be proven, and reports do not always contain the detail necessary to evaluate an event. Because of this, there is no way to identify the true number of events. The important takeaway to all this is that the information contained in this data has not been verified to produce cause and effect relationships. Despite this disclaimer, many interesting insights and value can be derived from the data to accelerate drug safety research.

Data analysis using SQL

For application developers who want to perform targeted searching and lookups, the API endpoints provided by the openFDA project are “ready to go” for software integration using a standard API powered by Elasticsearch, NodeJS, and Docker. However, for data analysis purposes, it is often easier to work with the data using SQL and statistical packages that expect a SQL table structure. For large-scale analysis, APIs often have query limits, such as 5000 records per query. This can cause extra work for data scientists who want to analyze the full dataset instead of small subsets of data.

To address the concern of requiring all the data in a single dataset, the openFDA project released the full 100 GB of harmonized data files that back the openFDA project onto S3. Athena is an interactive query service that makes it easy to analyze data in S3 using standard SQL. It’s a quick and easy way to answer your questions about adverse events and aspirin that does not require you to spin up databases or servers.

While you could point tools directly at the openFDA S3 files, you can find greatly improved performance and use of the data by following some of the preparation steps later in this post.


This post explains how to use the following architecture to take the raw data provided by openFDA, leverage several AWS services, and derive meaning from the underlying data.


Perform Near Real-time Analytics on Streaming Data with Amazon Kinesis and Amazon Elasticsearch Service

by Tristan Li | on | Permalink | Comments |  Share

Nowadays, streaming data is seen and used everywhere—from social networks, to mobile and web applications, IoT devices, instrumentation in data centers, and many other sources. As the speed and volume of this type of data increases, the need to perform data analysis in real time with machine learning algorithms and extract a deeper understanding from the data becomes ever more important. For example, you might want a continuous monitoring system to detect sentiment changes in a social media feed so that you can react to the sentiment in near real time.

In this post, we use Amazon Kinesis Streams to collect and store streaming data. We then use Amazon Kinesis Analytics to process and analyze the streaming data continuously. Specifically, we use the Kinesis Analytics built-in RANDOM_CUT_FOREST function, a machine learning algorithm, to detect anomalies in the streaming data. Finally, we use Amazon Kinesis Firehose to export the anomalies data to Amazon Elasticsearch Service (Amazon ES). We then build a simple dashboard in the open source tool Kibana to visualize the result.

Solution overview

The following diagram depicts a high-level overview of this solution.

Amazon Kinesis Streams

You can use Amazon Kinesis Streams to build your own streaming application. This application can process and analyze streaming data by continuously capturing and storing terabytes of data per hour from hundreds of thousands of sources.

Amazon Kinesis Analytics

Kinesis Analytics provides an easy and familiar standard SQL language to analyze streaming data in real time. One of its most powerful features is that there are no new languages, processing frameworks, or complex machine learning algorithms that you need to learn.

Amazon Kinesis Firehose

Kinesis Firehose is the easiest way to load streaming data into AWS. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service.

Amazon Elasticsearch Service

Amazon ES is a fully managed service that makes it easy to deploy, operate, and scale Elasticsearch for log analytics, full text search, application monitoring, and more.

Solution summary

The following is a quick walkthrough of the solution that’s presented in the diagram:

  1. IoT sensors send streaming data into Kinesis Streams. In this post, you use a Python script to simulate an IoT temperature sensor device that sends the streaming data.
  2. By using the built-in RANDOM_CUT_FOREST function in Kinesis Analytics, you can detect anomalies in real time with the sensor data that is stored in Kinesis Streams. RANDOM_CUT_FOREST is also an appropriate algorithm for many other kinds of anomaly-detection use cases—for example, the media sentiment example mentioned earlier in this post.
  3. The processed anomaly data is then loaded into the Kinesis Firehose delivery stream.
  4. By using the built-in integration that Kinesis Firehose has with Amazon ES, you can easily export the processed anomaly data into the service and visualize it with Kibana.


Visualize Amazon S3 Analytics Data with Amazon QuickSight

by Luis Wang | on | Permalink | Comments |  Share

When Amazon S3 analytics was released in November 2016, it gave you the ability to analyze storage access patterns and transition the right data to the right storage class. You could also manually export the data to an S3 bucket to analyze, using the business intelligence tool of your choice, and gather deeper insights on usage and growth patterns. This helped you reduce storage costs while optimizing performance based on usage patterns.

With today’s update, you can quickly and easily gain those deeper insights and benefits by analyzing and visualizing S3 analytics data in Amazon QuickSight. It takes just a single click from the S3 console, without the need for manual exports or additional data preparation.

If you already have S3 analytics storage class analysis enabled for your buckets, choose Explore in QuickSight on the top right.


Under the Hood of Server-Side Encryption for Amazon Kinesis Streams

by Damian Wylie | on | Permalink | Comments |  Share

Customers are using Amazon Kinesis Streams to ingest, process, and deliver data in real time from millions of devices or applications. Use cases for Kinesis Streams vary, but a few common ones include IoT data ingestion and analytics, log processing, clickstream analytics, and enterprise data bus architectures.

Within milliseconds of data arrival, applications (KCL, Apache Spark, AWS Lambda, Amazon Kinesis Analytics) attached to a stream are continuously mining value or delivering data to downstream destinations. Customers are then scaling their streams elastically to match demand. They pay incrementally for the resources that they need, while taking advantage of a fully managed, serverless streaming data service that allows them to focus on adding value closer to their customers.

These benefits are great; however, AWS learned that many customers could not take advantage of Kinesis Streams unless their data-at-rest within a stream was encrypted. Many customers did not want to manage encryption on their own, so they asked for a fully managed, automatic, server-side encryption mechanism leveraging centralized AWS Key Management Service (AWS KMS) customer master keys (CMK).

Motivated by this feedback, AWS added another fully managed, low cost aspect to Kinesis Streams by delivering server-side encryption via KMS managed encryption keys (SSE-KMS) in the following regions:

  • US East (N. Virginia)
  • US West (Oregon)
  • US West (N. California)
  • EU (Ireland)
  • Asia Pacific (Singapore)
  • Asia Pacific (Tokyo)

In this post, I cover the mechanics of the Kinesis Streams server-side encryption feature. I also share a few best practices and considerations so that you can get started quickly.

Understanding the mechanics

The following section walks you through how Kinesis Streams uses CMKs to encrypt a message in the PutRecord or PutRecords path before it is propagated to the Kinesis Streams storage layer, and then decrypt it in the GetRecords path after it has been retrieved from the storage layer.

When server-side encryption is enabled—which takes just a few clicks in the console—the partition key and payload for every incoming record is encrypted automatically as it’s flowing into Kinesis Streams, using the selected CMK. When data is at rest within a stream, it’s encrypted.

When records are retrieved through a GetRecords request from the encrypted stream, they are decrypted automatically as they are flowing out of the service. That means your Kinesis Streams producers and consumers do not need to be aware of encryption. You have a fully managed data encryption feature at your fingertips, which can be enabled within seconds.

AWS also makes it easy to audit the application of server-side encryption. You can use the AWS Management Console for instant stream-level verification; the responses from PutRecord, PutRecords, and getRecords; or AWS CloudTrail.

Calling PutRecord or PutRecords

When server-side encryption is enabled for a particular stream, Kinesis Streams and KMS perform the following actions when your applications call PutRecord or PutRecords on a stream with server-side encryption enabled. The Amazon Kinesis Producer Library (KPL) uses PutRecords.


  1. Data is sent from a customer’s producer (client) to a Kinesis stream using TLS via HTTPS. Data in transit to a stream is encrypted by default.
  2. After data is received, it is momentarily stored in RAM within a front-end proxy layer.
  3. Kinesis Streams authenticates the producer, then impersonates the producer to request input keying material from KMS.
  4. KMS creates key material, encrypts it by using CMK, and sends both the plaintext and encrypted key material to the service, encrypted with TLS.
  5. The client uses the plaintext key material to derive data encryption keys (data keys) that are unique per-record.
  6. The client encrypts the payload and partition key using the data key in RAM within the front-end proxy layer and removes the plaintext data key from memory.
  7. The client appends the encrypted key material to the encrypted data.
  8. The plaintext key material is securely cached in memory within the front-end layer for reuse, until it expires after 5 minutes.
  9. The client delivers the encrypted message to a back-end store where it is stored at rest and fetchable by an authorized consumer through GetRecords. The Amazon Kinesis Client Library (KCL) calls GetRecords to retrieve records from a stream.


Analysis of Top-N DynamoDB Objects using Amazon Athena and Amazon QuickSight

by Rendy Oka | on | Permalink | Comments |  Share

If you run an operation that continuously generates a large amount of data, you may want to know what kind of data is being inserted by your application. The ability to analyze data intake quickly can be very valuable for business units, such as operations and marketing. For many operations, it’s important to see what is driving the business at any particular moment. For retail companies, for example, understanding which products are currently popular can aid in planning for future growth. Similarly, for PR companies, understanding the impact of an advertising campaign can help them market their products more effectively.

This post covers an architecture that helps you analyze your streaming data. You’ll build a solution using Amazon DynamoDB Streams, AWS Lambda, Amazon Kinesis Firehose, and Amazon Athena to analyze data intake at a frequency that you choose. And because this is a serverless architecture, you can use all of the services here without the need to provision or manage servers.

The data source

You’ll collect a random sampling of tweets via Twitter’s API and store a variety of attributes in your DynamoDB table, such as: Twitter handle, tweet ID, hashtags, location, and Time-To-Live (TTL) value.

In DynamoDB, the primary key is used as an input to an internal hash function. The output from this function determines the partition in which the data will be stored. When using a combination of primary key and sort key as a DynamoDB schema, you need to make sure that no single partition key contains many more objects than the other partition keys because this can cause partition level throttling. For the demonstration in this blog, the Twitter handle will be the primary key and the tweet ID will be the sort key. This allows you to group and sort tweets from each user.

To help you get started, I have written a script that pulls a live Twitter stream that you can use to generate your data. All you need to do is provide your own Twitter Apps credentials, and it should generate the data immediately. Alternatively, I have also provided a script that you can use to generate random Tweets with little effort.

You can find both scripts in the Github repository:

There are some modules that you may need to install to run these scripts. You can find them in Python’s module repository:

To get your own Twitter credentials, go to and sign up for a free account, if you don’t already have one. After your account is set up, go to On the main landing page, choose the Create New App button. After the application is created, go to Keys and Access Tokens to get your credentials to use the Twitter API. You’ll need to generate Customer Tokens/Secret and Access Token/Secret. All four keys will be used to authenticate your request.

Architecture overview

Before we begin, let’s take a look at the overall flow of information will look like, from data ingestion into DynamoDB to visualization of results in Amazon QuickSight.

As illustrated in the architecture diagram above, any changes made to the items in DynamoDB will be captured and processed using DynamoDB Streams. Next, a Lambda function will be invoked by a trigger that is configured to respond to events in DynamoDB Streams. The Lambda function processes the data prior to pushing to Amazon Kinesis Firehose, which will output to Amazon S3. Finally, you use Amazon Athena to analyze the streaming data landing in Amazon S3. The result can be explored and visualized in Amazon QuickSight for your company’s business analytics.

You’ll need to implement your custom Lambda function to help transform the raw <key, value> data stored in DynamoDB to a JSON format for Athena to digest, but I can help you with a sample code that you are free to modify.


10 Best Practices for Amazon Redshift Spectrum

by Po Hong and Peter Dalton | on | Permalink | Comments |  Share

Amazon Redshift Spectrum enables you to run Amazon Redshift SQL queries against data that is stored in Amazon S3. With Amazon Redshift Spectrum, you can extend the analytic power of Amazon Redshift beyond the data that is stored on local disks in your data warehouse. You can query vast amounts of data in your Amazon S3 “data lake” without having to go through a tedious and time-consuming extract, transfer, and load (ETL) process. Amazon Redshift Spectrum applies sophisticated query optimization and scales processing across thousands of nodes to deliver fast performance.

In this blog post, we have collected 10 important best practices for Amazon Redshift Spectrum by grouping them into several different functional groups.

These guidelines are the product of many interactions and direct project work with Amazon Redshift customers.

Amazon Redshift vs. Amazon Athena

AWS customers often ask us: Amazon Athena or Amazon Redshift Spectrum? When should I use one over the other?

When to use Amazon Athena

Amazon Athena supports a use case in which you want interactive ad-hoc queries to run against data that is stored in Amazon S3 using SQL. The serverless architecture in Amazon Athena frees you from having to provision a cluster to perform queries. You are charged based on the amount of S3 data scanned by each query. You can get significant cost savings and better performance by compressing, partitioning, or converting your data into a columnar format, which reduces the amount of data that Amazon Athena needs to scan to execute a query. All the major BI tools and SQL clients that use JDBC can be used with Amazon Athena. You can also use Amazon QuickSight for easy visualization.

When to use Amazon Redshift

We recommend using Amazon Redshift on large sets of structured data. Amazon Redshift Spectrum gives you the freedom to store your data where you want, in the format you want, and have it available for processing when you need it. With Amazon Redshift Spectrum, you don’t have to worry about scaling your cluster. It lets you separate storage and compute, allowing you to scale each independently. You can even run multiple Amazon Redshift clusters against the same Amazon S3 data lake, enabling limitless concurrency. Amazon Redshift Spectrum automatically scales out to thousands of instances. So queries run quickly, whether they are processing a terabyte, a petabyte, or even an exabyte.

Set up the test environment

For information about prerequisites and steps to get started in Amazon Redshift Spectrum, see Getting Started with Amazon Redshift Spectrum.

You can use any data set to perform the tests to validate the best practices we have outlined in this blog post.  One important requirement is that the S3 files for the largest table need to be in three separate data formats:  CSV, non-partitioned Parquet as well as partitioned Parquet.  How to convert from one file format to another is beyond the scope of this blog post.  For more information on how this can be done, check out the following resources:

Creating the external schema

Use the Amazon Athena data catalog as the metadata store, and create an external schema named “spectrum” as follows:

create external schema spectrum 
from data catalog 
database 'spectrumdb' 
iam_role 'arn:aws:iam::<AWS_ACCOUNT_ID>:role/aod-redshift-role'
create external database if not exists;

The Redshift cluster and the data files in Amazon S3 must be in the same AWS region.  Your Redshift cluster needs authorization to access your external data catalog in Amazon Athena and your data files in Amazon S3. You provide that authorization by referencing an AWS Identity and Access Management (IAM) role (e.g. aod-redshift-role) that is attached to your cluster. For more information, see Create an IAM Role for Amazon Redshift.


Visualize and Monitor Amazon EC2 Events with Amazon CloudWatch Events and Amazon Kinesis Firehose

by Pubali Sen | on | Permalink | Comments |  Share

Monitoring your AWS environment is important for security, performance, and cost control purposes. For example, by monitoring and analyzing API calls made to your Amazon EC2 instances, you can trace security incidents and gain insights into administrative behaviors and access patterns. The kinds of events you might monitor include console logins, Amazon EBS snapshot creation/deletion/modification, VPC creation/deletion/modification, and instance reboots, etc.

In this post, I show you how to build a near real-time API monitoring solution for EC2 events using Amazon CloudWatch Events and Amazon Kinesis Firehose. Please be sure to have Amazon CloudTrail enabled in your account.

  • CloudWatch Events offers a near real-time stream of system events that describe changes in AWS resources. CloudWatch Events now supports Kinesis Firehose as a target.
  • Kinesis Firehose is a fully managed service for continuously capturing, transforming, and delivering data in minutes to storage and analytics destinations such as Amazon S3, Amazon Kinesis Analytics, Amazon Redshift, and Amazon Elasticsearch Service.


For this walkthrough, you create a CloudWatch event rule that matches specific EC2 events such as:

  • Starting, stopping, and terminating an instance
  • Creating and deleting VPC route tables
  • Creating and deleting a security group
  • Creating, deleting, and modifying instance volumes and snapshots

Your CloudWatch event target is a Kinesis Firehose delivery stream that delivers this data to an Elasticsearch cluster, where you set up Kibana for visualization. Using this solution, you can easily load and visualize EC2 events in minutes without setting up complicated data pipelines.

Set up the Elasticsearch cluster

Create the Amazon ES domain in the Amazon ES console, or by using the create-elasticsearch-domain command in the AWS CLI.

This example uses the following configuration:

  • Domain Name: esLogSearch
  • Elasticsearch Version: 1
  • Instance Count: 2
  • Instance type:elasticsearch
  • Enable dedicated master: true
  • Enable zone awareness: true
  • Restrict Amazon ES to an IP-based access policy


Analyze Database Audit Logs for Security and Compliance Using Amazon Redshift Spectrum

by Sandeep Kariro | on | Permalink | Comments |  Share

With the increased adoption of cloud services, organizations are moving their critical workloads to AWS. Some of these workloads store, process, and analyze sensitive data that must be audited to satisfy security and compliance requirements. The most common questions from the auditors are around who logs in to the system when, who queried which sensitive data when, when did the user last modify/update his/her credentials?

By default, Amazon Redshift logs all information related to user connections, user modifications, and user activity on the database. However, to efficiently manage disk space, log tables are only retained for 2–5 days, depending on log usage and available disk space. To retain the log data for longer period of time, enable database audit logging. After it’s enabled, Amazon Redshift automatically pushes the data to a configured S3 bucket periodically.

Amazon Redshift Spectrum is a recently released feature that enables querying and joining data stored in Amazon S3 with Amazon Redshift tables. With Redshift Spectrum, you can retrieve the audit data stored in S3 to answer all security and compliance–related questions. Redshift Spectrum can also combine the datasets from the tables in the database with the datasets stored in S3. It supports files in Parquet, textfile (csv, pipe delimited, tsv), sequence file, and RC file format. It also supports different compression types like gzip, snappy, and bz2.

In this post, I demonstrate querying the Amazon Redshift audit data logged in S3 to provide answers to common use cases described above.


You set up the following resources:

  • Amazon Redshift cluster and parameter group
  • IAM role and policies to give Redshift Spectrum access to Amazon Redshift
  • Redshift Spectrum external tables


  • Create an AWS account
  • Configure the AWS CLI to access your AWS account
  • Get access to a query tool compatible with Amazon Redshift
  • Create an S3 bucket

Cluster requirements

The Amazon Redshift cluster must:

  • Be in the same region as the S3 bucket storing the audit log files.
  • Be version 1.0.1294 or later.
  • Have read bucket and put object permissions on the S3 bucket to be configured for logging.
  • Have an IAM role attached that has at least the following two built-in policies attached, policies AmazonS3ReadOnlyAccess and AmazonAthenaFullAccess.

Set up Amazon Redshift

Create a new parameter group to enable user activity logging:

aws redshift create-cluster-parameter-group --parameter-group-name rs10-enable-log --parameter-group-family Redshift-1.0 --description "Enable Audit Logging"
aws redshift modify-cluster-parameter-group --parameter-group-name rs10-enable-log --parameters '{"ParameterName":"enable_user_activity_logging","ParameterValue":"true"}'

Create the Amazon Redshift cluster using the new parameter group created:

aws redshift create-cluster --node-type dc1.large --cluster-type single-node --cluster-parameter-group-name rs10-enable-log --master-username <Username> --master-user-password <Password> --cluster-identifier <ClusterName>

Wait for the cluster to build. When it is complete, enable audit logging:

aws redshift enable-logging --cluster-identifier rscluster --bucket-name <bucketname>

Set up Redshift Spectrum

To set up Redshift Spectrum, create an IAM role and policies, an external database, and external tables. (more…)

Seven Tips for Using S3DistCp on Amazon EMR to Move Data Efficiently Between HDFS and Amazon S3

by Illya Yalovyy | on | Permalink | Comments |  Share

Although it’s common for Amazon EMR customers to process data directly in Amazon S3, there are occasions where you might want to copy data from S3 to the Hadoop Distributed File System (HDFS) on your Amazon EMR cluster. Additionally, you might have a use case that requires moving large amounts of data between buckets or regions. In these use cases, large datasets are too big for a simple copy operation. Amazon EMR can help with this, and offers a utility – S3distCp – to help with moving data from S3 to other S3 locations or on-cluster HDFS.

In the Hadoop ecosystem, DistCp is often used to move data. DistCp provides a distributed copy capability built on top of a MapReduce framework. S3DistCp is an extension to DistCp that is optimized to work with S3 and that adds several useful features. In addition to moving data between HDFS and S3, S3DistCp is also a Swiss Army knife of file manipulations. In this post we’ll cover the following tips for using S3DistCp, starting with basic use cases and then moving to more advanced scenarios:

1. Copy or move files without transformation
2. Copy and change file compression on the fly
3. Copy files incrementally
4. Copy multiple folders in one job
5. Aggregate files based on a pattern
6. Upload files larger than 1 TB in size
7. Submit a S3DistCp step to an EMR cluster

1. Copy or move files without transformation

We’ve observed that customers often use S3DistCp to copy data from one storage location to another, whether S3 or HDFS. Syntax for this operation is simple and straightforward:

$ s3-dist-cp --src /data/incoming/hourly_table --dest s3://my-tables/incoming/hourly_table

The source location may contain extra files that we don’t necessarily want to copy. Here, we can use filters based on regular expressions to do things such as copying files with the .log extension only.