This tutorial shows how to analyse Twitter data with the aim of aquiring sentiment on various topics. This document will discuss sentiment analysis using ‘textblob’, entity recognition with Spacy and translating using deep_translator. See my github for the full jupyter notebook. What follows are all the used python libraries.
# import libraries
!python -m spacy download en_core_web_md
!python -m textblob.download_corpora
import pandas as pd
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from textblob import TextBlob
import matplotlib.pyplot as plt
import time # for testing function speed
import numpy as np
from datetime import datetime
from deep_translator import GoogleTranslator
import re
#import os
import csv
import gzip
nlp = spacy.load('en_core_web_md')
Next we import the twitter files. This example only uses (part of) one day of data. The full data set can be found on kaggle here: https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows
# get files
path='/Documents/ukraine_war_analysis/twitter/20230201_UkraineCombinedTweetsDeduped.csv.gzip'
data0 = pd.read_csv(path, compression='gzip')
# keep relevant columns, show the final data ready for the analysis
data0=pd.DataFrame(data=data0,columns=['username', 'text', 'totaltweets','followers', 'location', 'extractedts'])
data0['text'].dropna()
data_test = data0.head(1000)
data_test.head()

Below we can see the first (and most time consuming) function. It takes as input a dataframe and translates the ‘text’ column to English. We keep location information so we can always figure out where the tweet came from. We can remove stopwords if needed and it removes emojis, as this causes errors during translation. With the ‘textblob’ library emojis can also be turned into text, but that is not done here as it does not automatically improve results.
# translate to english and remove emojis
def rmv_emoji_and_trans_to_en(data, rmv_stopwords=False):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags=re.UNICODE)
store = pd.DataFrame({'username': [], 'text': [], 'totaltweets': [], 'followers': [], 'location': [], 'extractedts': []})
for _, row in data.iterrows():
text = emoji_pattern.sub(r'', row['text'])
text = GoogleTranslator(source='auto', target='en').translate(text)
if rmv_stopwords:
stopwords = spacy.lang.en.stop_words.STOP_WORDS
doc = nlp(text)
text = ' '.join([word for word in doc if not word.is_stop])
store2 = pd.DataFrame({
'username': row['username'],
'text': text,
'totaltweets': row['totaltweets'],
'followers': row['followers'],
'location': row['location'],
'extractedts': pd.to_datetime(data0['extractedts']).dt.strftime('%d/%m/%Y')
}, index=[0])
store = pd.concat([store, store2])
return store
data_en = rmv_emoji_and_trans_to_en(data_test)
data_en.head()

below we can see sentiment analysis for the tweets. I also added some links for additional information on how to perform the sentiment analysis. This is a very fast proces (for python lol).
# sentiment analysis
# https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis
# https://stackabuse.com/python-for-nlp-introduction-to-the-textblob-library/
def sentiment(data):
store = []
for row in data.itertuples():
testimonial = TextBlob(row.text)
store.append({'username':row.username,
'text':row.text,
'totaltweets':int(row.totaltweets),
'followers':int(row.followers),
'location':row.location,
'extractedts':row.extractedts,
'polarity': testimonial.sentiment.polarity })
return pd.DataFrame(store)
data_sentiment = sentiment(data_en)
data_sentiment.head()

def get_entities(data):
def extract_entities(text):
# IMPROVE ALSO STORE HASTAGS (ALWAYS START WITH #, ONE TEXT STRING)
# Extract entities from text
doc = nlp(text)
entities = {}
for ent in doc.ents:
if ent.label_ in ['GPE', 'ORG', 'PERSON']:
if ent.label_ not in entities:
entities[ent.label_] = set()
entities[ent.label_].add(ent.text.lower())
return entities
data['entities'] = data['text'].apply(extract_entities)
# Convert the entities dictionary to separate columns in the DataFrame
data = pd.concat([data, data['entities'].apply(pd.Series, dtype='object')], axis=1)
data.drop(columns=['entities'], inplace=True)
return data
dataset_entities = get_entities(data_sentiment)
dataset_entities.head()

Below I plot the distribution of sentiment. The first line removes where sentiment = 0, as the majority of tweets are.
removed_zero = dataset_entities[dataset_entities.iloc[:,6] != 0]
plt.rcParams.update({'figure.figsize':(7,5), 'figure.dpi':100})
x = removed_zero['polarity']
plt.hist(x, bins=50)
plt.gca().set(title='Frequency Histogram of twitter sentiment', ylabel='Frequency');

The data preparation is now finished for a subset of the data. These manipulations can now be applied on the full dataset and stored locally. It then can be used in a data visualisation software, Google Data Studio is what I will use. It can be done in Python (for instance with Dash). I want to easily share the dashboard thus Google Data Studio will be used. This will become a different teachback page.
here are some addition code snippets that can help to evaluate the process. First we can measue the run time of code with the ‘time’ library.
# example code: how to measure function run time.
start = time.time()
get_entities(data_sentiment)
end3 = time.time()
print('get_entities: ')
print(end3 - start)

lastly an example of how to filter the dataframe to only show tweets with a certain polarity.
dataset_entities[dataset_entities.iloc[:,6]>= 0.3 ]
dataset_entities[dataset_entities.iloc[:,6]<= -0.55 ]
[…] optimizing code for speed and simplicity. For this example I optimize code from a previous article HERE. It needs to proces 100+ datatables each with 3000+ rows. The original code takes 30+ seconds for 1 […]