AWS Cloud Enterprise Strategy Blog

Machine Learning: Avoiding Garbage in, Garbage Out







From speaking with enterprise customers of all shapes and sizes, it is abundantly clear that we all share the same challenges: we’re racing to digitize our businesses, turn stalled progress into frequent innovation, and leverage data and machine learning. However, I feel there’s a gap in understanding how critical data is to the success and effectiveness of machine learning. There seems to be a focus on using data to create data products or surface actionable business insights. The reality, though, is that machine learning doesn’t exist without data. And, if your sample size is too small or the quality of your data misrepresents reality, your machine learning models will be useless. In this blog post, I’ll explore training models and go over some examples of how data and data quality can impact machine learning models.

What Is Machine Learning?

To better understand machine learning, let’s think about human learning. It takes years for the human brain to mature. Much of what we learn is taught and retained through the long processes of repetition, pattern recognition, and feedback. Knowledge builds slowly upon the foundational lessons from elementary education. It’s been shown that applying what we’ve learned and using knowledge in action cements the concepts and strengthens our decision-making skills. With a diversity of knowledge, experience, and feedback, we learn to make different and better decisions over time.

Machine learning is very similar to human learning, except the cognitive-like function of a machine is less sophisticated than the human brain’s. For example, humans are at times forced to make gut decisions or decisions where we know we have only a partial understanding and appreciation for the problem. We fill in missing data with information from seemingly similar past experiences, we make assumptions, and we make guesses based on our tolerance of risk. Machines, on the other hand, don’t make gut decisions; all decisions are based on the training performed and the data provided. That’s why you have to be intentional about “teaching machines” (training models) and make sure you provide the right data for learning. Otherwise, like humans without proper training, the model will be ill prepared to make the right decisions.

Machine Learning in Action

There are several different types of machine learning, but we’ll focus on a popular type: supervised learning. With this type of machine learning, we train models to accept an input and respond with an output. For example, we can give a picture as input and build a model that will calculate the probability of the picture containing a certain object. Another example would be using an audio recording as an input and training the model to map audio fragments to words to produce a text transcript. We’ll use object recognition in images as our example because it’s a bit easier to illustrate.

Machine learning models must be trained with enough data such that they can accurately predict the probability that, for example, the picture inputted contains an apple. Consider the pictures of apples below. Now, it’s easy for an adult to tell that each picture represents an apple, but when it comes to machine learning, what if you only trained your model with pictures of red apples? Or what if all the apples used in training were the same shape? How would the model be able to tell that the last two pictures were, in fact, apples?

Four apples demonstrating variation in color and size

Now, consider trying to differentiate between blueberry muffins and chihuahuas. This is a trivial example of how difficult it can be to detect the differences between two images. Obviously, we’d need to train with enough examples of each, and we would have to label each image with “blueberry muffin” or “chihuahua” so the model can learn the difference.

Muffins and chihuahuas

In each of these examples, there are several ways we could inadvertently train the model to produce inaccurate results. The data could be too uniform and lack variety. The set of data might be incomplete or contain duplicate data. Or the data could be mislabeled (a particularly tired human might mistake some chihuahuas for muffins or vice versa).

Using Feedback in Machine Learning

Another important part of training models is feedback. In addition to correctly identifying examples of blueberry muffins, the model needs to know when the prediction was wrong. One of the ways we increase prediction probability is by auditing the results of the model over and over again. This is one of the reasons why training isn’t a one-and-done activity. We must continually train our models with the latest examples and feedback.

As you can imagine, data labeling and model feedback can be laborious. Thankfully, there is a growing number of machine learning models available in the Amazon Web Services Marketplace, and the AWS Data Exchange is making data sets more accessible. There are also services like Amazon Mechanical Turk that allow you to crowdsource the completion of tasks, like data labeling. However, what I find most interesting about data labeling for machine learning is that some companies have found innovative ways to get humans to label their data for free. One example is Google reCAPTCHA. You’ve likely interacted with reCAPTCHA when logging in to a website. While this program provides the service of protecting a website from bots, its secondary purpose is to use humans to label data for machine learning. Ever thought about why you would be asked to select all the squares of the image that contain a traffic light or crosswalk? It’s to improve maps.

Gathering Data for Machine Learning

When building data lakes for our businesses, we need to think about machine learning from the beginning, in addition to data products and business insights. While the master data management discipline teaches us the importance of uniformity and accuracy, machine learning teaches us the importance of raw data.  Consider the game of telephone, which demonstrates how data loss and manipulation can impact our perspective of the truth. As we collect data and repeatedly alter it to suit different uses, the greater risk we have of losing valuable and meaningful data. Since we want to make the most of our data and be able to discover things we might not have known otherwise, we must create the data lake with specific uses in mind. We should gather, store, and manipulate data to serve those uses. For example, at Cox Automotive, we tiered our data from raw to refined and certified to enable teams to experiment with data quickly, to build data products, to deliver business insights, and to do machine learning. The data lake served as the data foundation for everything we looked to deliver. A holistic data strategy will ensure your data lake will provide the strong foundation your business requires.

Without getting too deep into data lakes here, it’s worth noting that building such a data foundation can be challenging both politically and technically. Internally, it can be challenging to convince others to share their data when there isn’t direct benefit to the data owner. And the challenge doesn’t stop there: once you get past that hurtle, you must clean your data, ensure it’s at the right level of granularity, remove duplication, and create complete and accurate data sets or samplings. These steps can be quite difficult, so much so that AWS has services like AWS Lake Formation that will clean and classify your data using machine learning algorithms (among other things). I hope your head didn’t explode—yes, you can use machine learning to prepare your data for machine learning! Given these challenges, I’d recommend starting small and with specific objectives: pick small vertical slices (end-to-end use cases) to flesh out patterns, learn quickly, and deliver value.

Current Limitations on Machine Learning

Here’s a real-world example from the automotive space that illustrates how attempting to solve a seemingly easy problem with machine learning becomes very difficult without the right data.

When buying and selling cars, vehicle condition is very important; it can meaningfully change the value of the car and affect the pool of potential buyers. Vehicle condition is also important to insurance companies when assessing vehicle damage following an accident. In theory, you can use machine learning to increase the accuracy of vehicle condition assessment and repair (or reconditioning) estimation. You might think that you can build a model that works for both problems: take pictures of the vehicle and use a model to determine the damage in each picture, then map the determined damage to prior repair data to generate an estimate. Problem solved, right?

Nope. In insurance, we’re typically dealing with accidents, which means large or obvious portions of the car are damaged. The overall repair is more holistic, like replacing a headlight, bumper, or front-quarter panel, and there is more tolerance in the estimation range. In the case of buying and selling cars, however, we’re often dealing with scratches and dents or mechanical repairs. The level of detail required for the image in this scenario is much different. What we have found is that even some of the best cameras available can’t reliably take photos with enough detail that allows the human eye to see a particularly small dent or scratch in the picture, even when we know it’s there. While not exactly true, I would argue that if you can’t see it, the model can’t, either. So, what seemed like an easy problem to solve (because insurance solutions are using machine learning in a similar way right now) is actually unsolvable because the data one might have or can create doesn’t allow a model to accurately and consistently predict a result…for now.

Some Final Thoughts on Machine Learning

As you can see, having the right data is very important for machine learning. Of all of the challenges of increasing business agility and innovating at a faster rate, starting is the most difficult. But what I’ve learned is that success comes from frequent experimentation while minimizing risk. Start small and with specific objectives: pick small vertical slices (end-to-end use cases) to flesh out patterns, learn quickly, and deliver value and progress from there. This is a journey; you don’t have to do it all at once. I’d encourage you to continue exploring the data topic by reading about how to create a data-driven culture to help you overcome the hurtle of data collection and how to build data capabilities from my colleague Ishit Vachhrajani.

Bryan Landerman
Enterprise Strategist & Evangelist @ AWS