AWS Big Data Blog

Large-Scale Machine Learning with Spark on Amazon EMR

This is a guest post by Jeff Smith, Data Engineer at Intent Media. Intent Media, in their own words: “Intent Media operates a platform for advertising on commerce sites.  We help online travel companies optimize revenue on their websites and apps through sophisticated data science capabilities. On the data team at Intent Media, we are […]

Read More

Building a Binary Classification Model with Amazon Machine Learning and Amazon Redshift

Guy Ernest is a Solutions Architect with AWS This post builds on Guy’s earlier posts Building a Numeric Regression Model with Amazon Machine Learning and Building a Multi-Class ML Model with Amazon Machine Learning. Many decisions in life are binary, answered either Yes or No. Many business problems also have binary answers. For example: “Is […]

Read More

Test drive two big data scenarios from the ‘Building a Big Data Platform on AWS’ bootcamp

Matt Yanchyshyn is a Sr. Manager for AWS Solutions Architecture AWS offers a number of events during the year such as our annual AWS re:Invent conference, the AWS Summit series, the AWS Pop-up Loft, and a variety of roadshows. All of these provide opportunities for AWS customers to attend talks focused on big data and […]

Read More

Indexing Common Crawl Metadata on Amazon EMR Using Cascading and Elasticsearch

Hernan Vivani is a Big Data Support Engineer for Amazon Web Services A previous post showed you how to get started with Elasticsearch and Kibana on Amazon EMR. In that post, we installed Elasticsearch and Kibana on an Amazon EMR cluster using bootstrap actions. This post shows you how to build a simple application with […]

Read More

Building a Multi-Class ML Model with Amazon Machine Learning

Guy Ernest is a Solutions Architect with AWS This post builds on our earlier post Building a Numeric Regression Model with Amazon Machine Learning. We often need to assign an object (product, article, or customer) to its class (product category, article topic or type, or customer segment). For example, which category of products is most […]

Read More

Optimizing for Star Schemas and Interleaved Sorting on Amazon Redshift

Chris Keyser is a Solutions Architect for AWS Many organizations implement star and snowflake schema data warehouse designs and many BI tools are optimized to work with dimensions, facts, and measure groups. Customers have moved data warehouses of all types to Amazon Redshift with great success. The Amazon Redshift team has released support for interleaved […]

Read More

Using AWS Data Pipeline’s Parameterized Templates to Build Your Own Library of ETL Use-case Definitions

Leena Joseph is an SDE for AWS Data Pipeline In an earlier post, we introduced you to ETL processing using AWS Data Pipeline and Amazon EMR. This post shows how to build ETL workflow templates with AWS Data Pipeline, and build a library of recipes to implement common use cases. This is an introduction to […]

Read More

Building a Numeric Regression Model with Amazon Machine Learning

Guy Ernest is a Solutions Architect with AWS We need to predict future values in our businesses. These predictions are important for better planning of resource allocation and making other business decisions. Often, we settle for a simplified heuristic of average values from the past and some change assumption because more accurate alternatives are too […]

Read More

Running a High Performance SAS Grid Manager Cluster on AWS with Intel Cloud Edition for Lustre

Chris Keyser is a Solutions Architect for Amazon Web Services This post was co-authored by Margaret Crevar, Sr. Manager, Performance Validation at SAS. SAS is an AWS Technology Partner. SAS (www.sas.com) is an integrated environment designed for business and advanced data analytics by enterprise and government organizations. SAS and AWS recently performed testing using the […]

Read More