This page gives a quick overview of the main points to pay attention to when optimizing code for speed and simplicity. For this example I optimize code from a previous article HERE. It needs to proces 100+ datatables each with 3000+ rows. The original code takes 30+ seconds for 1 file, while the improved version takes less than 1 second. This is a speed increase of more than 30x. After that we will go through all the steps of improving the code. See Github for the full Jupyter Notebook.
- Use built-in functions and libraries: Python has a wide range of built-in functions and libraries that are optimized for performance. For example, use the built-in
sum()
function instead of looping through a list. - Use NumPy: NumPy is a Python library that provides support for large, multi-dimensional arrays and matrices. It is optimized for performance and can significantly speed up numerical computations.
- Avoid loops and use vectorized operations: Loops can be slow in Python. Instead, use vectorized operations like those provided by NumPy to perform operations on arrays and matrices.
- Use generators: If you need to iterate over a large dataset, use generators instead of lists. Generators are more memory-efficient and can be faster.
- Use caching: If you have a function that performs a time-consuming calculation, consider caching the result. This way, you can avoid repeating the calculation and save time.
- Use multiprocessing: If your code is CPU-bound and you have a multi-core CPU, consider using the multiprocessing module to parallelize your code.
- Use Cython: Cython is a Python compiler that can generate C code from Python code. This can significantly speed up the execution of your code.
- Profile your code: Use a profiler to identify the parts of your code that are taking the most time to execute. This will help you focus your optimization efforts on the parts of your code that will provide the most benefit.
- Simplify your code: Sometimes, the most effective way to speed up your code is to simplify it. If you can simplify your code, you may be able to eliminate unnecessary operations and improve performance.
For completions sake I first show the full original code snippet that we will optimize.
import pandas as pd
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from textblob import TextBlob
import matplotlib.pyplot as plt
import time # for testing function speed
import numpy as np
from datetime import datetime
from deep_translator import GoogleTranslator
import re
import textfeatures as tf
import csv
import gzip
nlp = spacy.load('en_core_web_md')
def twitter_analysis(data):
nlp = spacy.load('en_core_web_md')
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags=re.UNICODE)
translator = GoogleTranslator(source='auto', target='en')
def rmv_emoji_and_trans_to_en(data, rmv_stopwords=False):
store = []
for _, row in data.iterrows():
text = emoji_pattern.sub(r'', row['text'])
text = re.sub(r'http\S+', '', text).replace('\n','')
text = translator.translate(text)
if rmv_stopwords:
doc = nlp(text)
text = ' '.join([word.text for word in doc if not word.is_stop])
store.append((
row['username'],
text,
row['totaltweets'],
row['followers'],
row['location'],
pd.to_datetime(data0['extractedts']).dt.strftime('%d/%m/%Y')
))
return pd.DataFrame(store, columns=['username', 'text', 'totaltweets', 'followers', 'location', 'extractedts'])
def sentiment(data):
store = []
for row in data.itertuples():
testimonial = TextBlob(row.text)
store.append({'username':row.username,
'text':row.text,
'totaltweets':int(row.totaltweets),
'followers':int(row.followers),
'location':row.location,
'extractedts':row.extractedts,
'polarity': testimonial.sentiment.polarity })
return pd.DataFrame(store)
def get_entities(data):
def extract_entities(text):
# Extract entities from text
doc = nlp(text)
entities = {
ent.label_: {ent.text.lower()}
for ent in doc.ents
if ent.label_ in ['GPE', 'ORG', 'PERSON']
}
return entities
data = data.assign(**data['text'].apply(extract_entities).apply(pd.Series, dtype='object'))
return data
data_en = rmv_emoji_and_trans_to_en(data)
data_sentiment = sentiment(data_en)
dataset_entities = get_entities(data_sentiment)
dataset_output = pd.DataFrame(tf.hashtags(dataset_entities, "text", "hastags"))
return dataset_output
To measure the speed of a function the time
library can be used as shown in the following code snippet. Printing the function run time shows it takes 30.96 seconds to proces 50 rows. This is slow, but at the same time a lot happens. It needs to be improved to realistically process 3000+ rows.
#import data
path='twitter/20230201_UkraineCombinedTweetsDeduped.csv.gzip'
data0 = pd.read_csv(path, compression='gzip')
data=pd.DataFrame(data=data0,columns=['username', 'text', 'totaltweets','followers', 'location', 'extractedts'])
data['text'].dropna()
data = data.head(50)
#run function and measure run time
start = time.time()
data = twitter_analysis(data)
end = time.time()
#print the resulting run time
print('time duration full process: ')
print(end - start)
time duration full process:
30.967555046081543
I prefer to write functionaly good code that might not yet be optimal speed wise to achieve the desired functionality. After it works it can be optimized for speed. Now that chatGPT is available we can ask it to optimize the code for us. It gives the following tips:
- Remove unnecessary imports:
numpy
andmatplotlib.pyplot
are imported but not used in the code.datetime
is imported but not used within the function, so it can be removed.
- Avoid reloading the (
en_core_web_md
) model:- The
nlp
object is already loaded outside the function, so it can be reused inside the function instead of loading it again.
- The
- Use list comprehension to improve performance:
- Instead of using a for-loop to iterate over the data, list comprehension can be used to improve performance.
- Remove unnecessary variables:
- The
store
variable is not necessary in thermv_emoji_and_trans_to_en
andsentiment
functions. Thestore
variable can be replaced with a list comprehension to avoid unnecessary memory usage.
- The
- Avoid using regular expressions where possible:
- The regular expression pattern to remove emojis can be replaced with
unicodedata.normalize
which is faster and more efficient.
- The regular expression pattern to remove emojis can be replaced with
- Combine
rmv_emoji_and_trans_to_en
andsentiment
functions:- The
rmv_emoji_and_trans_to_en
andsentiment
functions can be combined into a single function to avoid iterating over the data twice.
- The
- Avoid using external libraries:
- Instead of using
deep_translator
library, thegoogletrans
library can be used to translate the text.
- Instead of using
It is important to note that chatGPT is often not correct. It can give example code that might or might not work. Asking it to optimize a subset of the full code increases the changes it is correct. I did not for instance use the googletrans
library and dealt with removing emojis in a better way, using the re
library. After some tinkering I ended up with the following code.
import spacy
import unicodedata
from textblob import TextBlob
from googletrans import Translator
import emoji
nlp = spacy.load('en_core_web_md')
translator = Translator()
def twitter_analysis(data):
def extract_entities(text):
# Extract entities from text
doc = nlp(text)
entities = {
ent.label_: {ent.text.lower()}
for ent in doc.ents
if ent.label_ in ['GPE', 'ORG', 'PERSON']
}
return entities
def clean_text(text):
# Remove emojis and translate to English
text = re.sub(r'http\S+', '', text) # remove URLs
text = re.sub(r'@\w+', '', text) # remove mentions
#text = re.sub(r'#\w+', '', text) # remove hashtags
#text = re.sub(r'\d+', '', text) # remove digits
text = re.sub(r'[^\w\s]', '', text) # remove punctuation
emoji_pattern = re.compile('[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF]', flags=re.UNICODE)
text = emoji_pattern.sub(r'', text) # remove emojis
text = text.strip().lower() # convert to lowercase and remove leading/trailing whitespaces
return text
dataset_output = (
data.assign(text=data['text'].apply(clean_text))
.assign(polarity=data['text'].apply(lambda x: TextBlob(x).sentiment.polarity))
.assign(**data['text'].apply(extract_entities).apply(pd.Series, dtype='object'))
.pipe(tf.hashtags, 'text', 'hashtags')
)
return dataset_output
time duration full process:
0.9002244472503662
The new version of the code has around 30 less lines and has a runtime of 0.9 seconds. Compared to the original runtime of 30.96 seconds we achieve a speed improvement of over 30x!!
Let me know if you have other tips to quickly improve code efficiency!