Tweet Sentiment Analysis

tags: Python Machine Learning Deep Learning NLP CNN RNN PyTorch

Aim

The aim of this project is to develop a sentiment analysis model that classifies a tweet as having a positive or a negative sentiment. While implementing this sentiment analysis model, I explore different Deep Learning models such as 1-D Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). In this project, I compare the performance of the Deep Learning models with Machine Learning techniques such as Logistic Regression and Random Forest Classifier.

Implementation

- Data Exploration

  1. Twitter Sentiment Extraction

    This is an open dataset provided for the Twitter Sentiment Extraction competition on Kaggle. The dataset contains around 30000 text-to-sentiment mappings. Running various models on the dataset yielded a maximum accuracy of around 68%. Upon analysis, under-fitting of data was observed due to insufficient data points.

  2. Sentiment140 dataset

    This dataset is an extended version of the previous Twitter dataset, containing 1.6 million tweets using the Twitter API. The tweet texts are classified using positive and negative labels along with other tweet-specific fields.

  3. Twitter Sentiment Analysis dataset

    This is an entity-level sentiment analysis dataset of Twitter. The dataset contains around 75000 text-to-sentiment. This dataset also encountered similar issues as the first dataset i.e. under-fitting due to the small dataset size.

To get a more generalized and sufficiently large dataset, all the above three datasets were combined with each datapoint/tweet having either a positive or negative label.

- Data Preprocessing

  1. Data cleaning

    First, I processed the data to remove non-contributing data/words such as stop words, URLs, emojis, and punctuation. To ignore the usernames mentioned in the tweet, I converted any text preceded by a @ to @user. Finally, I ensured all the text was also converted to a lowercase to make the data more uniform.

  2. Continuous letters

    Many tweets consisted of text containing characters repeating more than 2 times. (e.g. goooooood). This was causing the model to treat goooooood and good as two different words despite meaning the same thing. To avoid such cases, I modified the data to only include at most 2 continuous letters i.e. goooooood was now treated as good. This reduced the overall number of distinct words in the vocabulary.

  3. Standardizing tweet length

    To standardize the input, I padded all the tweets to have the same length before converting the tweets into embeddings. The tweets were padded equally on both sides with 0s to have a uniform length of 64 words. Any tweets of initial length less than 3 were discarded.

- Model Exploration

In this step, I compare the performance of various Deep Learning and Machine Learning models for tweet sentiment analysis. I experimented with different values for hyperparameters such as learning rate, number of epochs, kernel size, etc. in the model exploration step.

  1. 1-D CNN
CNN Model Layers
CNN Model Summary
  1. RNN
RNN Model Layers
RNN Model Summary
  1. Logistic regression
  1. Random Forest Classifier

- Performance Validation

To compare the performance of all the models, I used an 80-20 train-test split of the dataset.
For CNN and RNN, an L2 regularization was used to prevent over-fitting and increase validation accuracy.

CNN - Loss vs # of Epochs
CNN - Loss vs # of Epochs
RNN - Loss vs # of Epochs
RNN - Loss vs # of Epochs

Model 1-D CNN RNN Logistic Regression Random Forest Classifier
Accuracy 82.24% 79.85% 79.73% 77.64%
Performance comparison of all the models

What I learned

  1. Importance of clean and sufficient data

    The key takeaway for me from this project is the importance of clean and sufficient data while training a model.

    Without cleaning the data, the same models perform extremely poorly. Also, while training the same models with only the first dataset, the highest accuracy achieved was 68%. This shows the importance of sufficient amount of data required to train these models.

  2. Hyperparameter tuning and their impact on under- and over-fitting

    The above results were achieved after experimenting with the different values of various hyperparameters such as learning rate, kernel sizes, number of epochs, batch size, etc. I learned how to tune these hyperparameters while dealing with under- and over-fitting of data.

  3. PyTorch Library

    I learned how to create Deep Learning models using PyTorch library and explored other PyTorch classes and functions. I learned how to write a Forward and a Backward pass for the models and how to evaluate the training and validation losses during training.

Finally, I also learned that CNNs also perform great while dealing with text data!

Home  |  /projects/ Github Linkedin Email Resume Website Hugo theme by Yukuro