AWS Big Data Blog

Category: Amazon EMR*

Build a Real-time Stream Processing Pipeline with Apache Flink on AWS

This post has been translated into Japanese. In today’s business environments, data is generated in a continuous fashion by a steadily increasing number of diverse data sources. Therefore, the ability to continuously capture, store, and process this data to quickly turn high-volume streams of raw data into actionable insights has become a substantial competitive advantage […]

Read More

Securely Analyze Data from Another AWS Account with EMRFS

Sometimes, data to be analyzed is spread across buckets owned by different accounts. In order to ensure data security, appropriate credentials management needs to be in place. This is especially true for large enterprises storing data in different Amazon S3 buckets for different departments. For example, a customer service department may need access to data […]

Read More

Meet the Amazon EMR Team this Friday at a Tech Talk & Networking Event in Mountain View

Want to change the world with Big Data and Analytics? Come join us on the Amazon EMR team in Amazon Web Services! Meet the Amazon EMR team this Friday April 7th from 5:00 – 7:30 PM at Michael’s at Shoreline in Mountain View. We’ll feature short tech talks by EMR leadership who will talk about the past, […]

Read More

Harmonize, Search, and Analyze Loosely Coupled Datasets on AWS

You have come up with an exciting hypothesis, and now you are keen to find and analyze as much data as possible to prove (or refute) it. There are many datasets that might be applicable, but they have been created at different times by different people and don’t conform to any common standard. They use […]

Read More

Secure Amazon EMR with Encryption

In the last few years, there has been a rapid rise in enterprises adopting the Apache Hadoop ecosystem for critical workloads that process sensitive or highly confidential data. Due to the highly critical nature of the workloads, the enterprises implement certain organization/industry wide policies and certain regulatory or compliance policies. Such policy requirements are designed […]

Read More

Create a Healthcare Data Hub with AWS and Mirth Connect

As anyone visiting their doctor may have noticed, gone are the days of physicians recording their notes on paper. Physicians are more likely to enter the exam room with a laptop than with paper and pen. This change is the byproduct of efforts to improve patient outcomes, increase efficiency, and drive population health. Pushing for […]

Read More

Serving Real-Time Machine Learning Predictions on Amazon EMR

by Derek Graeber and Guy Ernest | on | in Amazon EMR* | Permalink | Comments |  Share

The typical progression for creating and using a trained model for recommendations falls into two general areas: training the model and hosting the model. Model training has become a well-known standard practice. We want to highlight one of many ways to host those recommendations (for example, see the Analyzing Genomics Data at Scale using R, […]

Read More

Run Jupyter Notebook and JupyterHub on Amazon EMR

Tom Zeng is a Solutions Architect for Amazon EMR Jupyter Notebook (formerly IPython) is one of the most popular user interfaces for running Python, R, Julia, Scala, and other languages to process and visualize data, perform statistical analysis, and train and run machine learning models. Jupyter notebooks are self-contained documents that can include live code, […]

Read More

Respond to State Changes on Amazon EMR Clusters with Amazon CloudWatch Events

Jonathan Fritz is a Senior Product Manager for Amazon EMR Customers can take advantage of the Amazon EMR API to create and terminate EMR clusters, scale clusters using Auto Scaling or manual resizing, and submit and run Apache Spark, Apache Hive, or Apache Pig workloads. These decisions are often triggered from cluster state-related information. Previously, […]

Read More

Using SaltStack to Run Commands in Parallel on Amazon EMR

Miguel Tormo is a Big Data Support Engineer in AWS Premium Support Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. Amazon EMR defines three types of nodes: master node, core nodes, and task nodes. It’s common to […]

Read More