This tutorial shows how to analyse Twitter data with the aim of aquiring sentiment on various topics. This document will discuss sentiment analysis using ‘textblob’, entity recognition with Spacy and translating using deep_translator. See my github for the full jupyter notebook. What follows are all the used python libraries.

# import libraries
!python -m spacy download en_core_web_md
!python -m textblob.download_corpora

import pandas as pd
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from textblob import TextBlob
import matplotlib.pyplot as plt

import time # for testing function speed
import numpy as np
from datetime import datetime
from deep_translator import GoogleTranslator
import re

#import os
import csv
import gzip

nlp = spacy.load('en_core_web_md')

Next we import the twitter files. This example only uses (part of) one day of data. The full data set can be found on kaggle here: https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows

# get files
path='/Documents/ukraine_war_analysis/twitter/20230201_UkraineCombinedTweetsDeduped.csv.gzip'
data0 = pd.read_csv(path, compression='gzip')
# keep relevant columns, show the final data ready for the analysis
data0=pd.DataFrame(data=data0,columns=['username', 'text', 'totaltweets','followers', 'location', 'extractedts'])
data0['text'].dropna()
data_test = data0.head(1000)
data_test.head()
Figure 1: initial output, text still needs to be translated, timestamp formatted and emojis removed.

Below we can see the first (and most time consuming) function. It takes as input a dataframe and translates the ‘text’ column to English. We keep location information so we can always figure out where the tweet came from. We can remove stopwords if needed and it removes emojis, as this causes errors during translation. With the ‘textblob’ library emojis can also be turned into text, but that is not done here as it does not automatically improve results.

# translate to english and remove emojis
def rmv_emoji_and_trans_to_en(data, rmv_stopwords=False):
    emoji_pattern = re.compile("["
            u"\U0001F600-\U0001F64F"  # emoticons
            u"\U0001F300-\U0001F5FF"  # symbols & pictographs
            u"\U0001F680-\U0001F6FF"  # transport & map symbols
            u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
            "]+", flags=re.UNICODE)
    store = pd.DataFrame({'username': [], 'text': [], 'totaltweets': [], 'followers': [], 'location': [], 'extractedts': []})
    for _, row in data.iterrows():
        text = emoji_pattern.sub(r'', row['text'])
        text = GoogleTranslator(source='auto', target='en').translate(text)
        
        if rmv_stopwords:
            stopwords = spacy.lang.en.stop_words.STOP_WORDS
            doc = nlp(text)
            text = ' '.join([word for word in doc if not word.is_stop])
        
        store2 = pd.DataFrame({
            'username': row['username'], 
            'text': text, 
            'totaltweets': row['totaltweets'], 
            'followers': row['followers'], 
            'location': row['location'], 
            'extractedts': pd.to_datetime(data0['extractedts']).dt.strftime('%d/%m/%Y')
        }, index=[0])
        store = pd.concat([store, store2])
    return store

data_en = rmv_emoji_and_trans_to_en(data_test)
data_en.head()
Figure 2: text has been translated and the timestamp has been formatted.

below we can see sentiment analysis for the tweets. I also added some links for additional information on how to perform the sentiment analysis. This is a very fast proces (for python lol).

# sentiment analysis 
# https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis
# https://stackabuse.com/python-for-nlp-introduction-to-the-textblob-library/
def sentiment(data):
    store = []
    for row in data.itertuples():
        testimonial = TextBlob(row.text)
        store.append({'username':row.username, 
                      'text':row.text, 
                      'totaltweets':int(row.totaltweets), 
                      'followers':int(row.followers), 
                      'location':row.location, 
                      'extractedts':row.extractedts,
                      'polarity': testimonial.sentiment.polarity })
    return pd.DataFrame(store)

data_sentiment = sentiment(data_en)   
data_sentiment.head()
Figure 3: polarity (sentiment) for each tweet is added. Note that polarity is between -1 to 1.
def get_entities(data):
    def extract_entities(text):
        # IMPROVE ALSO STORE HASTAGS (ALWAYS START WITH #, ONE TEXT STRING)
        # Extract entities from text
        doc = nlp(text)
        entities = {}
        for ent in doc.ents:
            if ent.label_ in ['GPE', 'ORG', 'PERSON']:
                if ent.label_ not in entities:
                    entities[ent.label_] = set()
                entities[ent.label_].add(ent.text.lower())
        return entities

    data['entities'] = data['text'].apply(extract_entities)
    # Convert the entities dictionary to separate columns in the DataFrame
    data = pd.concat([data, data['entities'].apply(pd.Series, dtype='object')], axis=1)
    data.drop(columns=['entities'], inplace=True)
    return data

dataset_entities = get_entities(data_sentiment) 
dataset_entities.head()
Figure 4: Three types of entities are stored in columns. These are the ones I am interested in. They are saved as a dict (in the hope that in de dashboarding software I can filter the columns).

Below I plot the distribution of sentiment. The first line removes where sentiment = 0, as the majority of tweets are.

removed_zero = dataset_entities[dataset_entities.iloc[:,6] != 0]
plt.rcParams.update({'figure.figsize':(7,5), 'figure.dpi':100})
x = removed_zero['polarity']
plt.hist(x, bins=50)
plt.gca().set(title='Frequency Histogram of twitter sentiment', ylabel='Frequency');
Figure 5: After removing sentiment = 0 we get quite a good bell curve as we would expect. sentiment = -0.5 and 0.5 do have a higher occurance than expected. Likely a flaw of the textblob library. No issue for now.

The data preparation is now finished for a subset of the data. These manipulations can now be applied on the full dataset and stored locally. It then can be used in a data visualisation software, Google Data Studio is what I will use. It can be done in Python (for instance with Dash). I want to easily share the dashboard thus Google Data Studio will be used. This will become a different teachback page.

here are some addition code snippets that can help to evaluate the process. First we can measue the run time of code with the ‘time’ library.

# example code: how to measure function run time.
start = time.time()
get_entities(data_sentiment) 
end3 = time.time()
print('get_entities: ')
print(end3 - start)

lastly an example of how to filter the dataframe to only show tweets with a certain polarity.

dataset_entities[dataset_entities.iloc[:,6]>= 0.3 ] 
dataset_entities[dataset_entities.iloc[:,6]<= -0.55 ] 
One thought on “entity recognition and sentiment analysis on Ukraine and Russia Twitter data (spacy, textblob)”

Leave a Reply

Your email address will not be published. Required fields are marked *