Skip to content

Sentiment Analysis (ML)

image.png

🗂️Link: GitHub Repository

📋Link: Project Report

1. Introduction

This report aims to communicate the processes relevant to this assignment and elaborate on the insights from the analysis. For this assignment, the IMDB dataset containing single-sentence reviews collected from three domains: imdb.com, amazon.com, and yelp.com, is utilised. The dataset underwent processing stages including data preparation, data exploration, feature engineering, hyper-parameter fine-tuning, data modelling, and evaluation. The goal of the sentiment model is to classify whether the review is positive or not. The report will discuss insights from data analysis, and the results of the classification model will be addressed and explained in a further section.

2. Bag-of-Words Design Decision Description

Bag-of-Words (BoW) is the technique of a feature extraction from text data. The BoW principle is representing each word occurrence across the document with number and finally converts these number into a fixed-length vector. With this technique, this vector can be further proceeded by machine learning algorithm.
An N-gram is an N-token sequence of words. A 2-gram (bigram) is a two-word sequence of words like “so good” ,“really sad” or “like it”. For a 3-gram (trigram) is a three-word sequence of words like “I love it”, or “this is perfect. Using N-gram technique in BoW could help to makes more understandable token and leverage model performance [1].
To perform Bag-of Words technique, all steps and methodology are described as the following.
Methodology
Data Cleaning and Standardization: The initial step in this project involved importing the entire dataset, including both the training and testing subsets. The data was then cleansed by detecting and deleting duplicated rows, detecting, and replacing null values with empty string. This was followed by data standardization, which included removing punctuation and converting all text to lowercase to ensure uniformity.
Vocabulary Construction: he construction of the final vocabulary set commenced with an exploratory visual analysis of the training data. This process utilised tokenization to break down the text into individual words and count their frequencies within each type of review. This analysis revealed several insights:
Positive and negative reviews have similar length distributions, indicating that review length may not be a distinguishing feature for sentiment analysis.
The most common unigrams in both positive and negative reviews share similarities, including words such as "the", "and", "it", "is", “this”, "to", "of", "was", and "in". Notably, the frequency of the top ten common words ranges between 200 to 800 occurrences.
Bigrams, which are pairs of consecutive words, also show similarities in both positive and negative reviews, particularly in terms of their meaning, with phrases like "of the", "and the", “this is”, "in the", "it is", and "it was" appearing frequently. The top five common bigrams occur approximately 40 to 80 times.
The trigrams, which are sequences of three consecutive words, differ significantly between positive and negative reviews. However, the trigram "one of the" is a mutual common element across both types, with a frequency of 10 to 20 mentions. Other trigrams appear less frequently, suggesting that they may not be as effective for a Bag-of-Words (BoW) strategy due to their low occurrence rate.
These findings have informed the strategy for setting the final vocabulary. The ngram_range was set to (1, 2) to include both unigrams and bigrams, thereby capturing more context than unigrams alone. 'English' stop words were excluded to remove commonly occurring, non-informative words. The min_df parameter was set to 0.002 to ignore words that appear in less than 0.2% of the documents, while the max_df was set to 0.5 to exclude words present in more than 50% of the documents, as such words are likely too common to be useful for differentiation. The setting of min_df also help to handle the out-of-vocabulary words in the test set to filter out very rare terms, which are more likely to be OOV in new data.
Feature Representation: The options for feature representation can include counts (term frequencies or normalised counts) or binary values (indicating whether a word occurs or not). In this assignment, the strategy of counting the frequency of words is chosen over binary values to achieve higher model performance. There are two main approaches to numerical text representation: Count Vectorizer and Term Frequency-Inverse Document Frequency (TF-IDF Vectorizer).
Count Vectorizer: This method is used to understand the type of text by analysing the frequency of words within a document. However, it does not identify the importance of words nor the relationships between them [2].
TF-IDF Vectorizer: This method rescales the frequency of words based on their occurrence across all documents (Term frequency = number of times the word appears in the document, and IDF = log [total number of documents / number of documents containing the word]) [1, 3, 4].

These two methods were initially applied to the dataset with the same Bag-of-Words (BoW) configuration to explore their outputs. It was revealed that each approach provided the same vocabulary size (2388 rows × 551 feature columns). In the next step, the TF-IDF Vectorizer was applied to achieve the highest model performance, due to its ability to effectively represent word frequency occurrence across all documents.

3. Cross Validation and Hyperparameter Selection Design Description

Cross-validation is a technique used to evaluate the performance of a predictive machine learning model. It estimates how well a model generalises to an unseen dataset by splitting the data into two parts: training and testing data. Hyperparameter tuning is an approach to finding the most appropriate and effective model parameters to optimise model performance. Combining these two methods can enhance model configuration and training strategies, achieving a robust machine learning model. This section describes several techniques involved in this sentiment analysis assignment as follows:
Performance Metrics: To measure machine learning performance, several metrics must be considered, including accuracy, precision, recall, and F1-score. In this assignment, accuracy (the ratio of correct predictions to the total number of predictions) will be the primary measurement for this binary classification model due to its straightforward nature, simplicity in computation, and ease of interpretation. Although accuracy can be misleading in cases of class imbalance, this assignment's dataset shows no indication of such issues, as the counts of positive and negative reviews are the same. Additionally, other metrics such as precision, recall, and F1-score are computed and displayed for a comprehensive evaluation of model performance [5].
Classification model: the multinomial Naive Bayes classifier was chosen because it is well-suited for classifications with discrete features, such as those found in text sentiment analysis tasks and text classifications [6].
Hyperparameter Search Strategy: Grid Search is the search strategy used for hyper parameters tuning as it provides a thorough and interpretable exploration of hyperparameter values [7]. The main hyperparameters that were varied in this step include:
alpha which is Additive (Laplace/Lidstone) smoothing parameter, and it was varied in the range of from 1e-10 to 10.0 [1e-10, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0].
fit_prior is a parameter that indicates whether to learn class prior probabilities or not; if false, a uniform prior will be used. This hyperparameter was tested with values True and False.
Data Splitting: the training dataset was divided into different folds for cross-validation, with the number of k-folds ranging from 3, 5, to 10. The data size of each fold number is demonstrated as below:
Number of Folds = 3; Train size:1592, Validate size: 796
Number of Folds = 5;
3 folds of Train size: 1910, Validate size: 478
2 folds of Train size: 1911, Validate size: 478
Number of Folds = 10;
8 folds of Train size: 2149, Validate size: 239
2 folds of Train size: 2150, Validate size: 238
Final Model Training: The results of the cross-validation were used to select hyperparameters and train the final model using the entire training dataset. The result from different K-Fold numbers is demonstrated in Table 1 as below. This step suggests that different K-fold numbers affect model performance.
According to Table 1, the number of folds = 5 provided the highest score at 0.78182 with the best hyperparameters: {'alpha': 0.5., 'fit_prior': False}. Based on these results, this best parameter will be applied in the final model training.
Table 1 Results from different K-Fold number
Number of Folds = 3
Number of Folds = 5
Number of Folds = 10
image.png
image.png
image.png
Best score = 0.77806
Accuracy: 0.77806, Precision: 0.7818, Recall: 0.7699, F1-score: 0.7758
Best score = 0.78182
Accuracy: 0.78183, Precision: 0.7898,
Recall: 0.7666, F1-score: 0.7780
Best score = 0.78140
Accuracy: 0.78141, Precision: 0.7876,
Recall: 0.7691, F1-score: 0.7782
There are no rows in this table

4. ​Analysis of Predictions for the Classifier

To determine performance of classifier, some representations of the best estimator are described as following
Classifier’s performance
Accuracy (0.78182): during model training, model performs well by giving overall accuracy = 0.78182, therefore model has ability to predict correctly.
Precision (0.7898): when the model predicts a positive outcome, it is correct about 79 % of the time.
Recall (0.7666): this shows that model has a reasonably good sensitivity and can detect the majority of positive cases.
F1-score (0.7780): the model has a balanced performance between precision and recall.
The predicted value dataframe was merged with training dataset and proceeded to answer this question.
Performance on short sentences and long sentences:
To answer whether the model performs better on longer or shorter sentences, the sentiment texts were classified based on the average word length. A text with more than 11.85 words is labelled as a long sentence; otherwise, it's considered a short sentence. Table 2 reveals that the model performs better with short sentences for overall predictions and identifies the positive class more easily compared to long sentences.
Table 2 Performance on short and long sentence (train set)
Short sentence
Long sentence
image.png
image.png
There are no rows in this table
Performance on each website: According to Table 3, the model exhibits similar performance for each website
Performance without negation words: To answer this, negation words within the dataset were identified. Texts containing any of the following words were excluded: 'not', 'didn't', 'shouldn't', 'doesn't', 'wasn't', 'don't', 'haven't', 'isn't', 'can't', 'won't'. The analysis suggests that reviews without negative words slightly enhance the model's performance when compared with the entire training set.
Table 3 Overall accuracy based on website and negative word (train set)
Entire train set
Train set without negative words
image.png
image.png
image.png
image.png
There are no rows in this table

5. ​Performance on Test Set

Performance Metrics: To evaluate the classifier's final performance, the Confusion Matrix, Accuracy, Precision, Recall, and F1-score were calculated using the test dataset and are presented in Figures 1.
Accuracy (0.7609): During model training, the model performed well, achieving an overall accuracy of 0.7609, which indicates its capability to make correct predictions.
Precision (0.7826): The model is correct about 78.62% of the time when predicting a positive outcome.
Recall (0.7224): The model demonstrates reasonably good sensitivity, able to detect the majority of positive cases.
F1-score (0.7513): The model exhibits balanced performance between precision and recall.
Overall, the model retains a good ability to predict unseen data on the test set. Compared with the held-out data from cross-validation during the training phase, the model shows a marginally reduced performance when applied to new, unseen data.
A screenshot of a computer screen

Description automatically generated
Figure 1 Model performance on test set
Performance on short sentences and long sentences: Like the previous step during cross-validation training, an average word length of 11.67 is the criterion used to classify sentences in the test set as short or long. The results in Table 4 indicate that the model performs better with short sentences, which aligns with the results from cross-validation. For longer sentences, it seems that the model's performance drops in classifying positive reviews within the overall positive class, as evidenced by a lower recall of 0.7000.
Table 4 Performance on short and long sentence (test set)
Short sentence
Long sentence
image.png
image.png
There are no rows in this table
Performance on each website: The model exhibits similar performance across each website.
Performance without negation words: Table 4 demonstrates that the model performs well on both the entire test set and on the subset without negation, with no significant difference.
Table 4 Overall accuracy based on website and negative word (test set)
Entire train set
Test set without negative words
image.png
image.png
image.png
image.png
There are no rows in this table
Comparison with estimated model performance from cross-validation
Table 5 Performance comparison
Task
Performance from Cross-validation
Performance from Test set
Overall performance
Accuracy: 0.78183, Precision: 0.7898,
Recall: 0.7666, F1-score: 0.7780
Accuracy: 0.7609, Precision: 0.7826,
Recall: 0.7224, F1-score: 0.7513
Short sentence
Accuracy: 0.7922, Precision: 0.8057,
Recall: 0.7800, F1-score: 0.7926
Accuracy: 0.7544, Precision: 0.7610,
Recall: 0.7246, F1-score: 0.7423
Long sentence
Accuracy: 0.7643, Precision: 0.7565,
Recall: 0.7580, F1-score: 0.7572
Accuracy: 0.7480, Precision: 0.7845,
Recall: 0.7000, F1-score: 0.7398
Website (entire dataset)
image.png
image.png
Website (without negative words)
image.png
image.png
There are no rows in this table
In summary, the overall performance derived from cross-validation does not significantly differ from the test performance. Comparisons between the estimation step and the final test step are both aligned, but cross-validation is slightly higher, indicating that short sentences and reviews without negation words contribute to higher predictability. For each website, the model is likely to perform consistently, and it appears to give the highest performance on the Amazon website compared to others.
Possible reasons for these results could be due to the consistent quality of text across different platforms, suggesting that the distinguishing features for sentiment classification are not heavily dependent on the source of the reviews. Moreover, the presence of negation words and sentence length might be universal indicators of sentiment that transcend variations in text from different websites. This uniformity in text characteristics across domains could explain why the model's performance remains stable regardless of the source.

6. Conclusions

This assignment illustrates the training of a classification model and the analysis of its performance using relevant metrics for a sentiment prediction task. After loading and cleansing the raw data, an exploratory analysis was conducted to deepen our understanding of the data. Prior to modelling, the dataset underwent a transformation process to convert the textual data into numerical format using the Bag of Words model. The data was then ready for the model training, which included cross-validation and hyper-parameter tuning, utilising a multinomial Naive Bayes classifier with varying numbers of K-folds.

Following the model training, various metrics such as accuracy, precision, recall, and F1-score were computed and visualised for the evaluation of the model, allowing for the selection of the best hyperparameters. Subsequently, the final model was trained using the optimal settings and its performance was tested on a new and unseen test dataset. Ultimately, the model's capabilities were demonstrated through performance metrics, which were compared with the outcomes from the previous cross-validation training step.

7. References

MyGreatLearning. (2022). Bag of Words. Retrieved from
Saket, S. (2020). Count Vectorizers vs TFIDF - Natural Language Processing. LinkedIn. Retrieved from
Towards Data Science. (2020). A Guide to Text Classification and Sentiment Analysis. Retrieved from
Raschka, S. (n.d.). Bag of Words and Sparsity. Retrieved from
DataScientest. (2023). The Importance of Cross-Validation. Retrieved from
Scikit-Learn. (n.d.). sklearn.naive_bayes.MultinomialNB. Retrieved from

Want to print your doc?
This is not the way.
Try clicking the ··· in the right corner or using a keyboard shortcut (
CtrlP
) instead.