Applying Machine Learning to Text Mining with Amazon S3 and RapidMiner
Gopal Wunnava is a Senior Consultant with AWS Professional Services
By some estimates, 80% of an organization’s data is unstructured content. This content includes web pages, call center transcripts, surveys, feedback forms, legal documents, forums, social media, and blog articles. Therefore, organizations must analyze not just transactional information but also textual content to gain insight and boost performance. A powerful way to analyze this textual content is by using text mining.
Text mining typically applies machine learning techniques such as clustering, classification, association rules and predictive modeling. These techniques uncover meaning and relationships in the underlying content. Text mining is used in areas such as competitive intelligence, life sciences, voice of the customer, media and publishing, legal and tax, law enforcement, sentiment analysis and trend-spotting.
In this blog post, you’ll learn how to apply machine learning techniques to text mining. I’ll show you how to build a text mining application using RapidMiner, a popular open source tool for predictive analytics, and Amazon Simple Storage Service (Amazon S3), an easy-to-use storage service that lets organizations store and retrieve any amount of data from anywhere on the web.
Why use text mining?
Text mining techniques help reveal patterns and relationships in large volumes of textual content that are not visible to the naked eye, leading to new business opportunities and improvements in processes. Using text mining techniques can save you time and resources: the process can be automated and the results from a text mining model can be consistently derived and applied to solve specific problems.
These techniques help you:
- Extract key concepts, patterns and relationships from large volumes of textual content
- Spot trends in textual content on subjects such as travel and entertainment to understand consumer sentiment
- Summarize content from documents and gain semantic understanding of the underlying content
- Index and search text for use in predictive analytics
As you can see, if you don’t analyze textual content in addition to transactional content, you might miss huge opportunities.
Past barriers to text mining
In the past, it was often hard to extract valuable insight from large volumes of text. Doing so required complex programming and modeling tasks performed by highly skilled IT resources. In addition, the infrastructure simply couldn’t scale to handle the demands of processing large volumes of unstructured text with the speed and agility required to sustain performance and innovation cycles. Integrating tools with the underlying infrastructure was another challenge. This often resulted in data and tools being migrated from one environment to another. Moreover, business users found it hard to interpret the results. Structured data that was easy to mine and analyze became the primary source of most data analysis tasks. The result: vast pools of textual content went virtually untapped.
Recent advances in text analytics
Data and cloud infrastructure has made huge advancements. This includes the tools and technology available in the machine learning and text mining space. With these advancements, speed, innovation and scalability are now realistic. There has also been a fundamental shift in how organizations use analytics: rather than react to past trends, they emphasize being proactive by predicting future trends based on current events. Thanks to cloud infrastructure services provided by AWS and tools such as RapidMiner that combine machine learning, text mining and visualization capabilities, organizations can analyze textual content quickly in a scalable and durable environment without the need for advanced programming skills.
Text mining workflow
Most text mining follows a typical workflow:
- Identify and retrieve documents for analysis. Apply structural, statistical and linguistic techniques (often in combination) to discern, tag and extracts elements such as entities, concepts and relationships.
- Apply statistical pattern-matching and similarity techniques to classify documents and organize extracted features according to a specified grouping or category. The underlying unstructured content is transformed into structured data formats that are easier to analyze. The classification process helps discern meaning and relationships.
- Evaluate the model for performance.
- Present the findings to end users.
The flow chart below illustrates this workflow.
The role of machine learning in text mining
Text mining techniques typically establish a set of significant words and sentences based on a statistical analysis of factors such as term frequency and distribution. Words and sentences scoring highest in terms of significance typically indicate the underlying opinion, sentiment or the general subject matter.
As part of the process, modern tools typically construct a document term matrix (DTM) and use a weighting such as TF-IDF (term frequency – inverse document frequency). These tools extract and store underlying information such as standard features, keyword frequency, documents and text list features in the form of tables in a database. These tables can be queried for efficient analysis and processing. These steps are a precursor to applying machine learning techniques to the textual content.
Text analytics typically applies such machine learning techniques as clustering, classification, association rules and predictive modeling to discern meaning and relationships in the underlying content. Methods are then used to process the underlying text contained in unstructured data sources, including Natural Language Processing (NLP), parsing, tokenization (identification of distinct elements such as words or n-grams), stemming (reducing word variants to bases), term reduction (group like terms using synonyms and similarity measures) and parts of speech tagging (POS tags) which help discern facts and relationships.
Another key aspect in text analytics involves organizing and structuring the underlying textual content. Typical techniques include clustering, categorization, classification and taxonomy. Some typical classification methods used by many tools include Naive Bayes, Support Vector Machine and K-nearest neighbor.
The table below contains common text mining techniques, including machine learning, and the key considerations for each.
Once the text is processed, grouped and analyzed using the techniques above, it is important to evaluate the results. The goal of this evaluation is to determine whether you have found the most relevant material or if you have missed any important terms. You will use measures such as precision and recall to evaluate your results.
Sentiment analysis using AWS and RapidMiner
Now let’s look at how you can use AWS and RapidMiner for sentiment analysis, a popular use case for text mining. In sentiment analysis, you identify positive and negative opinions, emotions, and evaluations, and often analyze textual content using machine learning techniques. Using AWS and RapidMiner, you can apply techniques like sentiment analysis on unstructured data stored in S3 directly, without pushing the data into another environment.
As shown below, you can use RapidMiner to create your text mining workflows integrated with S3. An object on S3 can be any kind of file or format such as a text file, a photo, or a video. This makes S3 useful for storing unstructured data required for text mining and advanced analytics.
Amazon S3 is integrated with other Amazon big data services such as Amazon Redshift, Amazon RDS, Amazon DynamoDB, Amazon Kinesis and Amazon EMR. This leads to interesting scenarios for developing text mining models on AWS using RapidMiner. For example, you can use S3 to store the data ingested from these Amazon services and then use RapidMiner to quickly build a text mining model on this data. You could store output results from your model into an S3 bucket and region of your choice and share these results with a broader end user community.
The examples below use the SMS Spam collection dataset hosted by the University of California, Irvine. The SMS spam collection is a public set of labeled messages that have been collected for mobile phone spam research. This dataset combines “spam” and non-spam messages marked as “ham”. The dataset is a tab-separated text file with one message per line, with UTF-8 encoding.
The following demos show you how to use RapidMiner and S3 for text mining. Note: the demos do not have sound.
To get started:
- Download and install the RapidMiner software, along with the RapidMiner Text Processing Extension available from the RapidMiner Marketplace. You can install RapidMiner either on your local machine or on an Amazon EC2 instance of your choice when you need more capacity than your current configuration provides.
- Configure your S3 connection on RapidMiner using your AWS credentials.To use S3, you need an AWS account.
- Upload the input dataset required for this text mining case study into your S3 bucket.
Importing and Reading Data from S3 into RapidMiner
The following video shows you how to create a text mining application with S3 and RapidMiner using data you uploaded into an S3 bucket. Remember that you must import the file with UTF-8 encoding and specify tab as the delimiter in order to process the file in the right format.
Working with RapidMiner’s Validation Operator
When running a model on unseen data, you may see lower accuracy levels than expected. This is probable, because our approach may have learned what it has seen and was never tested on unseen data. To address this, you can work with the RapidMiner Validation operator as demonstrated in the video below.
Applying Store Operators in RapidMiner
To apply the learned model to new data, you must save the model as well as the word list into the RapidMiner repository. You must save the word list is because when you predict the probability of a new message being either “spam” or “ham”, you have to use the same attributes or words used in the original process. Therefore, you need the same word list and the same model, and need to process the new data in exactly the same way as you had processed the learning data. The following video demonstrates how this can be done.
Applying Unseen Data to the RapidMiner Model
The following video demonstrates how to apply your model on new unseen data using Retrieve operators in order to predict whether a new message is ham or spam.
Saving Results using Write S3 Operator
The following video demonstrates how the Write S3 operator can be used in RapidMiner to save output results into an S3 bucket configured with a connection as outlined previously. You can download the output results file from your specified S3 bucket into your local machine and view the results using a text editor.
I’ve shown you how to easily create a text mining application using RapidMiner and Amazon S3. Typically, such tasks would require complex programming knowledge along with hardware and software resources that are often difficult to provision and manage. Integrating RapidMiner with S3 lets you quickly and easily create text mining models while allowing you to tap into the flexible, secure, durable, and highly scalable environment provided by S3.
If you have questions or suggestions, please leave a comment below.
Want to dive deeper into machine learning?
Love to work on open source? Check out EMR’s careers page.