IST 736: Text mining

Description

Text mining, also known as text analytics, is the process of analyzing and extracting valuable insights from unstructured text data. The objective of the study mentioned is to extract sentiment from Amazon reviews using data mining techniques. The study was conducted under the guidance of Dr. Bei Yu and utilized a combination of modeling strategies, programming strategies.

The data used in the study was obtained from a Kaggle repository and was already in a clean state. The cleaning conducted was the removal of special characters, application of stop words, and lower casing of words after tokenization. Tokenization is the process of breaking down the text data into smaller units such as words or phrases.

The study used two sets of data, one for training and the other for testing. The training data was used to train different models for multinomial naive Bayes, Bernoulli naive Bayes, and logistic regression. These models are machine learning algorithms that are commonly used for text classification tasks, such as sentiment analysis.

Sentiment analysis is the process of determining the emotional tone or polarity of a given text, usually expressed as positive, negative, or neutral. In this study, the goal was to determine the sentiment of Amazon products in order to rank the best and worst-rated items according to sentiment.

To measure the performance of the three models, a Receiver Operating Curve (ROC) was used. The ROC is a graphical representation of the performance of a binary classifier system. It measures the three different analyses and scores them based on the True Positive Rate (TPR) and the False Positive Rate (FPR). The TPR is the proportion of actual positive cases that are correctly identified, while the FPR is the proportion of negative cases that are incorrectly classified as positive.

Overall, the study highlights the usefulness of text mining and machine learning techniques in analyzing large volumes of unstructured data, such as online reviews, to extract valuable insights and make data-driven decisions.

Data Exploration

During initial exploratory data analysis, it was discovered that the training data had more rows than the test data, and certain columns had NaN values. To identify sentiment, a new column was created and star ratings were used to classify reviews as positive or negative. It was found that more than 90% of the data in the training data set was positive. The test data set had slightly more reviews than the training set, but had a similar distribution of positive and negative reviews. Despite the overwhelmingly positive reviews, the models were trained to properly identify sentiment based on reviews in both data sets.

Before conducting the analysis, a feature extractor was conducted using NLTK and Naïve Bayes to identify the most informative features in the dataset. The model returned a 65.2% accuracy and the top 5 results were interpreted as negative sentiment, despite the overwhelmingly positive reviews in the dataset. The top 5 features were ‘worst’, ‘dies’, ‘ruined’, ‘acid’, and ‘terrible’.

Model Generation

Multinomial Naïve Bayes

The Multinomial Naïve Bayes model was used to analyze the sentiment of the data, and Tokens were obtained and lowercased to add more structure to the data. NLTK StopWords were applied to filter out the irrelevant words. Five iterations of the model were run, and the most accurate iteration was the first one, which analyzed tokens appearing five times or greater with stopwords applied and a ngram range of 1,1. This iteration returned an accuracy result of 92.1%.

Bernoulli Naïve Bayes

The Bernoulli Naïve Bayes was tested using the same five iterations as the Multinomial Naïve Bayes, with tokenization, lowercase, and NLTK stopwords applied. The results showed that the Bernoulli NB was less accurate than the MNB, with the highest accuracy being 90.6%, compared to the MNB’s 92.1%. It was also found that the Bernoulli NB scored highest when looking at bigrams (ngram range of 2,2). Overall, the results suggest that the MNB is more accurate than the Bernoulli NB.

Logistic Regression

The analysis conducted is the Logistic Regression, which tokenizes and lowercases the data and implements stopwords using the NLTK package. The accuracy results for the six tests conducted were 93.8%, 95.3%, 95.8%, 96.5%, 90.2%, and 95.9%. The Logistic Regression model is the most accurate out of all three models, and test four is the most accurate test with an accuracy of 96.5%. This is interesting as test four uses trigrams and tokens that appear five or more times like the other tests, but scored much higher.

Interpreting Results

The analysis included a classifier comparison using the Receiver Operating Curve, which resulted in the Logistic Regression being the most accurate model followed by MNB and Bernoulli NB. Further analysis was conducted on the performance of the models in predicting positive and negative reviews using f1-scores and weighted averages. The MNB was the best at identifying positive reviews, while the Logistic Regression was the best at identifying negative reviews. The top 5 reviewed products were identified as technology and all Amazon products. Word clouds were created for the top 5 products using positive and negative words obtained from the reviews.

Reflection and Learning Goals

The exercise involved collecting and organizing data from external sources, such as online reviews or other text-based sources. Through this process, I was able to identify patterns within the data, grouping similar texts into clusters based on their content or characteristics.

By analyzing these clusters, you were able to gain insights into the behavior of the reviewers or authors of the text. This analysis may have helped you understand their motivations, preferences, or biases, and how these factors impact the language and content of their writing.

During the course, I also learned about advanced text-mining algorithms that can be used to extract information from large volumes of text. These algorithms allow you to identify key phrases or concepts within the text, classify documents based on their content, and group similar texts together in clusters.

Opinion mining was another important aspect of the course. This technique allows me to analyze text to identify sentiment, emotion, or opinion. By understanding the opinions and attitudes expressed in text, you can gain valuable insights into the attitudes and preferences of your target audience.

Overall, the course helped you develop the skills and knowledge needed to apply advanced text-mining techniques to real-world problems. Whether we are analyzing customer feedback, monitoring social media sentiment, or conducting research in a specific field, these techniques can help you extract valuable insights from large volumes of text data.

Text mining, also known as text analytics, is the process of extracting meaningful information and knowledge from unstructured text data. It involves several basic concepts and methods that are widely used in the field of natural language processing (NLP). Some of these concepts and methods are as follows:

Document Representation: In text mining, a document is typically represented as a bag of words, where each word in the document is treated as a separate entity. This representation allows for the analysis of the frequency and distribution of words in the document.
Information Extraction: Information extraction involves identifying and extracting specific pieces of information from a text document, such as names, dates, and locations. This process often involves the use of regular expressions and other techniques to extract structured data from unstructured text.
Text Classification and Clustering: Text classification involves grouping similar documents into predefined categories based on their content. This process is often used in applications such as sentiment analysis, where documents are classified as positive, negative, or neutral based on the language used. Text clustering, on the other hand, involves grouping documents into clusters based on their similarity.
Topic Modeling: Topic modeling is a method for identifying the underlying topics or themes in a collection of documents. This technique involves using statistical models to identify patterns in the data and group similar documents together based on their content.

To explore interesting patterns in text data, various benchmark corpora and text analysis and visualization tools are available, both commercially and open-source. Some examples of such tools include the Natural Language Toolkit (NLTK), RapidMiner, and Tableau.

Advanced text mining algorithms, such as deep learning and neural networks, can be used for information extraction, text classification and clustering, opinion mining, and other applications. These algorithms allow for more accurate and nuanced analysis of text data, but also require significant computational resources and expertise to implement.

The full project can be found here

Ayush Malakar