Now run real-time stream processing at scale with Apache Flink on Amazon EMR

Posted on: Nov 3, 2016

You can now use Apache Flink 1.1.3 and upgraded versions of Apache Zeppelin (0.6.2) and Apache HBase (1.2.3) on Amazon EMR release 5.1.0. Also, the interactive notebook in Hue now supports querying data using Presto.

Apache Flink is a streaming dataflow engine that makes it easy to run real-time stream processing on high-throughput data sources. It supports event time semantics for out of order events, exactly-once semantics, backpressure control, and APIs optimized for writing both streaming and batch applications. Additionally, Flink has connectors for Amazon Kinesis Streams, Apache Kafka, Elasticsearch, the Twitter Streaming API, Cassandra, and can access data in Amazon S3 (with EMRFS) and HDFS.

You can create an Amazon EMR cluster with release 5.1.0 by choosing release label “emr-5.1.0” from the AWS Management Console, AWS CLI, or SDK. You can specify Flink, Zeppelin, and HBase to install these applications on your cluster. Please visit the Amazon EMR documentation for more information about release 5.1.0, Flink 1.1.3, Zeppelin 0.6.2, and HBase 1.2.3.