Tag: Amazon Machine Learning
This is a guest post by Tom Talpir, Software Developer at ironSource. ironSource is as an Advanced AWS Partner Network (APN) Technology Partner and an AWS Big Data Competency Partner.
Ever wondered what it takes to keep a user from leaving your game or application after all the hard work you put in? Wouldn’t it be great to get a chance to interact with the users before they’re about to leave?
Finding these users can be difficult, mainly because most churn happens within the first few minutes or hours of a user’s gameplay. However, machine learning (ML) can make this possible by providing insights to help developers identify these users and engage with them to decrease the churn rate.
Upopa is a gaming studio that creates cool games (that you should definitely check out), and they were a great fit for our new project, leveraging Amazon Machine Learning (Amazon ML) to offer game developers an ability to predict the future actions of their players, and ultimately reduce churn without having to learn the complex ML algorithms.
Upopa sends all their data to Amazon Redshift, using ironSource Atom, a data flow management solution that allows developers to send data from their application into many different types of data targets (including Amazon Redshift, Amazon S3, Amazon Elasticsearch Service, and other relational databases) with great ease.
Amazon ML turned out to be the right solution for Upopa, because it integrates easily with Amazon Redshift, and makes everything much easier with visualization tools and wizards that guides you through the process of creating ML models.
Air travel can be stressful due to the many factors that are simply out of airline passengers’ control. As passengers, we want to minimize this stress as much as we can. We can do this by using past data to make predictions about how likely a flight will be delayed based on the time of day or the airline carrier.
In this post, we generate a predictive model for flight delays that can be used to help us pick the flight least likely to add to our travel stress. To accomplish this, we will use Apache Spark running on Amazon EMR for extracting, transforming, and loading (ETL) the data, Amazon Redshift for analysis, and Amazon Machine Learning for creating predictive models. This solution gives a good example of combining multiple AWS services to build a sophisticated analytical application in the AWS Cloud.
At a high level, our solution includes the following steps:
Step 1 is to ingest datasets:
- We will download publicly available Federal Aviation Administration (FAA) flight data and National Oceanic and Atmospheric Administration (NOAA) weather datasets and stage them in Amazon S3.
- Note: A typical big data workload consists of ingesting data from disparate sources and integrating them. To mimic that scenario, we will store the weather data in an Apache Hive table and the flight data in an Amazon Redshift cluster.
Step 2 is to enrich data by using ETL:
- We will transform the maximum and minimum temperature columns from Celsius to Fahrenheit in the weather table in Hive by using a user-defined function in Spark.
- We enrich the flight data in Amazon Redshift to compute and include extra features and columns (departure hour, days to the nearest holiday) that will help the Amazon Machine Learning algorithm’s learning process.
- We then combine both the datasets in the Spark environment by using the spark-redshift package to load data from Amazon Redshift cluster to Spark running on an Amazon EMR cluster. We write the enriched data back to a Amazon Redshift table using the spark-redshift package.
Step 3 is to perform predictive analytics:
- In this last step, we use Amazon Machine Learning to create and train a ML model using Amazon Redshift as our data source. The trained Amazon ML model is used to generate predictions for the test dataset, which are output to an S3 bucket.
The typical progression for creating and using a trained model for recommendations falls into two general areas: training the model and hosting the model. Model training has become a well-known standard practice. We want to highlight one of many ways to host those recommendations (for example, see the Analyzing Genomics Data at Scale using R, AWS Lambda, and Amazon API Gateway post).
In this post, we look at one possible way to host a trained ALS model on Amazon EMR using Apache Spark to serve movie predictions in real time. It is a continuation of two recent posts that are prerequisite:
- Building a Recommendation Engine with Spark ML on Amazon EMR using Zeppelin
- Installing and Running JobServer for Apache Spark on Amazon EMR
In future posts we will cover other alternatives for serving real-time machine-learning predictions, namely AWS Lambda and Amazon EC2 Container Service, by running the prediction functions locally and loading the saved models from S3 to the local execution environments.
Walkthrough: Trained ALS model
For this walkthrough, you use the MovieLens dataset as set forth in the Building a Recommendation Engine post; the data model should have already been generated and persisted to Amazon S3. It uses the Alternating Least Squares (ALS) algorithm to train the data for generating the proper model.
Using JobServer, you take that model and persist it in memory in JobServer on Amazon EMR. After it’s persisted, you can expose RESTful endpoints to AWS Lambda, which in turn can be invoked from a static UI page hosted on S3, securing access with Amazon Cognito.
Here are the steps that you follow:
- Create the infrastructure, including EMR with JobServer and Lambda.
- Load the trained model into Spark on EMR via JobServer.
- Stage a static HTML page on S3.
- Access the AWS Lambda endpoints via the static HTML page authenticated with Amazon Cognito.
The following diagram shows the infrastructure architecture.
Ujjwal Ratan is a Solutions Architect with Amazon Web Services
The Hospital Readmission Reduction Program (HRRP) was included as part of the Affordable Care Act to improve quality of care and lower healthcare spending. A patient visit to a hospital may be constituted as a readmission if the patient in question is admitted to a hospital within 30 days after being discharged from an earlier hospital stay. This should be easy to measure right? Wrong.
Unfortunately, it gets more complicated than this. Not all readmissions can be prevented, as some of them are part of an overall care plan for the patient. There are also factors beyond the hospital’s control that may cause a readmission. The Center for Medicare and Medicaid Services (CMS) recognized the complexities with measuring readmission rates and came up with a set of measures to evaluate providers.
There is still a long way to go for hospitals to be effective in preventing unplanned readmissions. Recognizing factors affecting readmissions is an important first step, but it is also important to draw out patterns in readmission data by aggregating information from multiple clinical and non-clinical hospital systems.
Moreover, most analysis algorithms rely on financial data which omit the clinical nuances applicable to a readmission pattern. The data sets contain a lot of redundant information like patient demographics and historical data. All this creates a massive data analysis challenge that may take months to solve using conventional means.
In this post, I show how to apply advanced analytics concepts like pattern analysis and machine learning to do risk stratification for patient cohorts.
The role of Amazon ML
There have been multiple global scientific studies on scalable models for predicting readmissions with high accuracy. Some of them, like comparison of models for predicting early hospital readmissions and predicting hospital readmissions in the Medicare population, are great examples.
Readmission records demonstrate patterns in data that can be used in a prediction algorithm. These patterns can be separated as outliers that are used to identify patient cohorts with high risk. Attribute correlation helps to identify the significant features that effect readmission risk in a patient. This risk stratification in patients is enabled by categorizing patient attributes into numerical, categorical, and text attributes and applying statistical methods like standard deviation, median analysis, and the chi-squared test. These data sets are used to build statistical models to identify patients demonstrating certain characteristics consistent with readmissions so necessary steps can be taken to prevent it.
Amazon Machine Learning (Amazon ML) provides visual tools and wizards that guide users in creating complex ML models in minutes. You can also interact with it using the AWS CLI and API to integrate the power of ML with other applications. Based on the chosen target attribute in Amazon ML, you can build ML models like a binary classification model that predicts between states of 0 or 1 or a numeric regression model that predicts numerical values based on certain correlated attributes.
Creating an ML model for readmission prediction
The following diagram represents a reference architecture for building a scalable ML platform on AWS.
Gopal Wunnava is a Senior Consultant with AWS Professional Services
By some estimates, 80% of an organization’s data is unstructured content. This content includes web pages, call center transcripts, surveys, feedback forms, legal documents, forums, social media, and blog articles. Therefore, organizations must analyze not just transactional information but also textual content to gain insight and boost performance. A powerful way to analyze this textual content is by using text mining.
Text mining typically applies machine learning techniques such as clustering, classification, association rules and predictive modeling. These techniques uncover meaning and relationships in the underlying content. Text mining is used in areas such as competitive intelligence, life sciences, voice of the customer, media and publishing, legal and tax, law enforcement, sentiment analysis and trend-spotting.
In this blog post, you’ll learn how to apply machine learning techniques to text mining. I’ll show you how to build a text mining application using RapidMiner, a popular open source tool for predictive analytics, and Amazon Simple Storage Service (Amazon S3), an easy-to-use storage service that lets organizations store and retrieve any amount of data from anywhere on the web.
This is a guest post by Jeff Smith, Data Engineer at Intent Media. Intent Media, in their own words: “Intent Media operates a platform for advertising on commerce sites. We help online travel companies optimize revenue on their websites and apps through sophisticated data science capabilities. On the data team at Intent Media, we are responsible for processing terabytes of e-commerce data per day and using that data and machine learning techniques to power prediction services for our customers.”
Our Big Data Journey
Building large-scale machine learning models has never been simple. Over the history of our team, we’ve continually evolved our approach for running modeling jobs.
The dawn of big data: Java and Pig on Apache Hadoop
Our first data processing jobs were built on Hadoop MapReduce using the Java API. After building some basic aggregation jobs, we went on to develop a scalable, reliable implementation of logistic regression on Hadoop using this paradigm. While Hadoop MapReduce certainly gave us the ability to operate at the necessary scale, using the Java API resulted in verbose, difficult-to-maintain code. More importantly, the achievable feature development velocity using the complex Java API was not fast enough to keep up with our growing business. Our implementation of Alternating Direction Method of Multipliers (ADMM) logistic regression on Hadoop consists of several thousand lines of Java code. As you might imagine, it took months to develop. Compared with a library implementation of logistic regression that can be imported and applied in a single line, this was simply too large of a time investment.
Around the same time, we built some of our decisioning capabilities in Pig, a Hadoop-specific domain language (DSL). Part of our motivation for looking into Pig was to write workflows at a higher level of abstraction than the Java Hadoop API allowed. Although we had some successes with Pig, eventually we abandoned it. Pig was still a young application, and because it is implemented as a DSL, it led to certain inherent difficulties for our team, such as immature tooling. For example, PigUnit, the xUnit testing framework for Pig, was only released in December 2010 and took a while to mature. For years after its release, it was still not integrated into standard Pig distributions or published as a Maven artifact. Given our strong TDD culture, we really craved mature tooling for testing. Other difficulties included the impedance mismatch with our codebaase at the time, which was largely Java.
Guy Ernest is a Solutions Architect with AWS
This post builds on Guy’s earlier posts Building a Numeric Regression Model with Amazon Machine Learning and Building a Multi-Class ML Model with Amazon Machine Learning.
Many decisions in life are binary, answered either Yes or No. Many business problems also have binary answers. For example: “Is this transaction fraudulent?”, “Is this customer going to buy that product?”, or “Is this user going to churn?” In machine learning, this is called a binary classification problem. Many business decisions can be enhanced by accurately predicting the answer to a binary question. Amazon Machine Learning (Amazon ML) provides a simple and low-cost option to answer some of these questions at speed and scale.
Like the previous posts (Numeric Regression and Multiclass Classification), this post uses a publicly available example from Kaggle. This time, you will use the Click-Through Rate Prediction example, which is from the online advertising field. In this example, you will predict the likelihood that a specific user will click on a specific ad.
Preparing the data to build the machine learning model
You’ll be getting the data for building the model from the competition site, but to make it more realistic, you will use Amazon Redshift as an intermediary. In many cases, historical event data required to build a machine learning model is already stored in the data warehouse. Amazon ML integrates with Amazon Redshift to allow you to query relevant event data and perform aggregation, join, or manipulation operations to prepare the data to train the machine learning model. You will see some examples for these operations in this post.
To be able to follow through this exercise, you need an AWS account, Kaggle account (to download the data set), Amazon Redshift cluster, and SQL client. If you don’t already have an Amazon Redshift cluster, you can get a two-month free trial for a dw2.large single-node cluster, which you can use for this demo.
Guy Ernest is a Solutions Architect with AWS
This post builds on our earlier post Building a Numeric Regression Model with Amazon Machine Learning.
We often need to assign an object (product, article, or customer) to its class (product category, article topic or type, or customer segment). For example, which category of products is most interesting to this customer? Because of the massive scale of some businesses and the short lifespan of articles or customer visits, it’s essential to be able to assign an object to its class at scale and speed to ensure successful business transactions.
This blog post shows how to build a multiclass classification model that:
- Helps automate the process of predicting object assignment to one of more than two classes, at scale and speed
- Can be used in a simple and scalable way to accommodate classes and objects that constantly evolve
- Requires minimal help from machine learning experts
- Can be extended to many aspects of your business
In this post, you learn how to address multiclassification problems by using cartographic information to predict the type of forest cover that will occur on a land segment, from among six types. Similar multiclassification machine learning (ML) problems could include determining recommendations such as which product in an e-commerce store or on a video steaming service is most relevant for a visiting user.
Guy Ernest is a Solutions Architect with AWS
We need to predict future values in our businesses. These predictions are important for better planning of resource allocation and making other business decisions. Often, we settle for a simplified heuristic of average values from the past and some change assumption because more accurate alternatives are too complex or expensive. The new Amazon Machine Learning (Amazon ML) service changes this equation by providing a simple and inexpensive way of building and using models such as numeric regression.
This post uses the example of a bike share program where you need to know how many bikes are required at each hour of each day in a specific city. In this scenario, you need a machine learning model that predicts a number based on a set of features or predictors. You will build a regression model based on a data set that is publicly available in Kaggle, a large community site of data scientists that compete against each other to solve data science problems. By building the model, you will explore a few concepts around the successful application of machine learning to solve similar problems in your domain.
What is the difference between analytics and machine learning?
The bike share example demonstrates the limits of analytics systems when it comes to making accurate predictions. One of the Kaggle participants created the following web page to analyze the provided data. If you choose the Plots tab, you can see a visualization of data that was created using R, popular free analytics software, and Shiny, a popular web interface for R: View Bike Sharing Demand.
This is a guest post by Andrew Musselman, who as chief data scientist leads the global big data practice from the technical side at Accenture. He is a PMC member on the Apache Mahout project and is writing a book on data science for O’Reilly. Accenture is an APN Big Data Competency Partner.
This post introduces machine learning, provides context for the Apache Mahout project, and offers some specifics about recommender systems. Then, using Amazon EMR, we’ll tour the workflows for building a simple movie recommender and for writing and running a simple web service to provide results to client applications. Finally, we’ll list some ways to learn more and engage with the Mahout community.
Machine learning has its roots in artificial intelligence. The term implies that machine learning tools bring cognition and automated decision-making to data problems, but currently machine learning methods do not include computer thought. Even so, machine learning tools usually do employ some type of automated decision making, often iteratively working toward minimizing or maximizing a specific measurement about the performance of a model.
The field of machine learning encompasses many topics and approaches, usually falling into the categories of classification, clustering, and recommenders.