AWS Big Data Blog

Month in Review: February 2016

by Andy Werth | on | Permalink | Comments |  Share

Lots for big data enthusiasts in February on the AWS Big Data Blog. Take a look!

Submitting User Applications with spark-submit

Learn how to set spark-submit flags to control the memory and compute resources available to your application submitted to Spark running on EMR. Learn when to use the maximizeResourceAllocation configuration option and dynamic allocation of executors.

Amazon Redshift UDF repository on AWSLabs

Discover the new Python UDF functions that AWS has released as part of the initial AWS Labs Amazon Redshift UDF repository: column encryption, parsing, date functions, and more! (And be sure to check out our Introduction to Python UDFs in Amazon Redshift.)

Optimize Spark-Streaming to Efficiently Process Amazon Kinesis Streams

If you use Amazon Kinesis Streams and Apache Spark  to quickly extract value from large volumes of data streams, understanding how these frameworks work together helps you optimize performance. This post explains some ways to tune Spark Streaming for the best performance and the right semantics.

Big Data Analytics Options on AWS: Updated White Paper

We’ve made exciting changes to the Big Data Analytics Options white paper (first published December 2014). This white paper introduces you to the many big data analytics options on the AWS platform and helps you determine when to choose one solution over another. It covers ideal usage patterns, cost model, performance, durability and availability, scalability and elasticity, interfaces, and anti-patterns for many AWS services.

Process Amazon Kinesis Aggregated Data with AWS Lambda

We are excited to announce that you can now process this data easily using Amazon Kinesis de-aggregation modules. These modules support Java, Python, and Node.js and allow you to extract user records from a KPL-aggregated stream to AWS Lambda or your multi-lang KCL application.

AWS Big Data Meetup on February 24th in Palo Alto

The guest speaker for this Meetup was Cory Dolphin from Twitter, who talked about AnswersFabric’s real-time analytics product, which processes billions of events in real time, using Twitter’s new stream processing engine, Heron. Nathaniel Slater, Solutions Architect in AWS Big Data Services, spoke about Amazon DynamoDB.

Introducing On-Demand Pipeline Execution in AWS Data Pipeline

Now it is possible to trigger activation of pipelines in AWS Data Pipeline using the new on-demand schedule type. On-demand schedules make it easy to integrate pipelines in AWS Data Pipeline with other AWS services and with on-premise orchestration engines.

FROM THE ARCHIVE

Top 10 Performance Tuning Techniques for Amazon Redshift (December 2015)

——————————————–

Looking to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.