AWS Big Data Blog

Tag: Amazon EMR

Respond to State Changes on Amazon EMR Clusters with Amazon CloudWatch Events

Jonathan Fritz is a Senior Product Manager for Amazon EMR Customers can take advantage of the Amazon EMR API to create and terminate EMR clusters, scale clusters using Auto Scaling or manual resizing, and submit and run Apache Spark, Apache Hive, or Apache Pig workloads. These decisions are often triggered from cluster state-related information. Previously, […]

Read More

Using SaltStack to Run Commands in Parallel on Amazon EMR

Miguel Tormo is a Big Data Support Engineer in AWS Premium Support Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. Amazon EMR defines three types of nodes: master node, core nodes, and task nodes. It’s common to […]

Read More

Implementing Authorization and Auditing using Apache Ranger on Amazon EMR

Updated 9/26/2018: Updates have been made to support the latest versions of EMR and Apache Ranger. ————————————————– Role-based access control (RBAC) is an important security requirement for multi-tenant Hadoop clusters. Enforcing this across always-on and transient clusters can be hard to set up and maintain. Imagine an organization that has an RBAC matrix using Active […]

Read More

Low-Latency Access on Trillions of Records: FINRA’s Architecture Using Apache HBase on Amazon EMR with Amazon S3

John Hitchingham is Director of Performance Engineering at FINRA The Financial Industry Regulatory Authority (FINRA) is a private sector regulator responsible for analyzing 99% of the equities and 65% of the option activity in the US. In order to look for fraud, market manipulation, insider trading, and abuse, FINRA’s technology group has developed a robust […]

Read More

Dynamically Scale Applications on Amazon EMR with Auto Scaling

Jonathan Fritz is a Senior Product Manager for Amazon EMR Customers running Apache Spark, Presto, and the Apache Hadoop ecosystem take advantage of Amazon EMR’s elasticity to save costs by terminating clusters after workflows are complete and resizing clusters with low-cost Amazon EC2 Spot Instances. For instance, customers can create clusters for daily ETL or machine learning […]

Read More

Use Apache Flink on Amazon EMR

Today we are making it even easier to run Flink on AWS as it is now natively supported in Amazon EMR 5.1.0. EMR supports running Flink-on-YARN so you can create either a long-running cluster that accepts multiple jobs or a short-running Flink session in a transient cluster that helps reduce your costs by only charging you for the time that you use.

Read More