AWS Big Data Blog

Snakes in the Stream – Feeding and Eating Amazon Kinesis Streams with Python

Markus Schmidberger is a Senior Consultant for AWS Professional Services The Internet of Things (IoT) is becoming increasingly popular, and it’s easy to see why: it generates new business value for your company by connecting all available machines and devices. The big challenge is real-time data processing and analysis. Cloud computing is an excellent way […]

Read More

Using LDAP via AWS Directory Service to Access and Administer Your Hadoop Environment

Erik Swensson is a Solutions Architect with AWS In this post you will learn how to leverage a Lightweight Directory Access Protocol (LDAP) service via AWS Directory Service to authenticate and define permissions for users and administrators of Amazon EMR, Amazon’s hosted Hadoop service. A centralized LDAP repository for authentication and authorization lets you more […]

Read More

Running Apache Accumulo on Amazon EMR

Manjeet Chayel is a Solutions Architect with Amazon Web Services This post was co-authored by Matt Yanchyshyn, a Principal Solutions Architect with Amazon Web Services Apache Accumulo is a sorted, distributed key-value store that is built on top of Apache Hadoop, Zookeeper, and Thrift. Accumulo was originally modeled after Google’s BigTable and can scale to […]

Read More

Getting Started with Elasticsearch and Kibana on Amazon EMR

Hernan Vivani is a Big Data Support Engineer for Amazon Web Services This post shows you how to install Elasticsearch and Kibana on an Amazon EMR cluster and provides a few simple ways to confirm it is working. (Please also see “Indexing Common Crawl Metadata on Amazon EMR Using Cascading and Elasticsearch.”) NOTE: If your goal […]

Read More

Strategies for Reducing Your Amazon EMR Costs

UPDATE, MAY 2019: We have updated the Amazon EC2 Spot pricing model as of November, 2017. The new pricing model simplifies purchasing without bidding and with fewer interruptions. Learn more about the updated pricing model. —————————————————— This is a guest post by Prateek Gupta, a lead engineer at BloomReach BloomReach has built a personalized discovery […]

Read More

Node.js Streaming MapReduce with Amazon EMR

Ian Meyers is a Solutions Architecture Senior Manager with AWS Introduction Node.js is a JavaScript framework for running high performance server-side applications based upon non-blocking I/O and an asynchronous, event-driven processing model. When customers need to process large volumes of complex data, Node.js offers a runtime that natively supports the JSON data structure. Languages such […]

Read More

Getting HBase Running on Amazon EMR and Connecting it to Amazon Kinesis

Wangechi Doble is an AWS Solutions Architect Introduction Apache HBase is an open-source, column-oriented, distributed NoSQL database that runs on the Apache Hadoop framework. In the AWS Cloud, you can choose to deploy Apache HBase on Amazon Elastic Cloud Compute (Amazon EC2) and manage it yourself or leverage Apache HBase as a managed service on […]

Read More

The Impact of Using Latest-Generation Instances for Your Amazon EMR Job

Nick Corbett is a Big Data Consultant for AWS Professional Services Amazon Elastic MapReduce (Amazon EMR) is a web service that makes it easy to process large amounts of data efficiently.  Amazon EMR uses the popular open source framework Apache Hadoop combined with several other AWS products to do such tasks as web indexing, data […]

Read More

ETL Processing Using AWS Data Pipeline and Amazon Elastic MapReduce

Manjeet Chayel is an AWS Solutions Architect This blog post shows you how to build an ETL workflow that uses AWS Data Pipeline to schedule an Amazon Elastic MapReduce (Amazon EMR) cluster to clean and process web server logs stored in an Amazon Simple Storage Service (Amazon S3) bucket. AWS Data Pipeline is an ETL […]

Read More