This page gives a quick overview of the main points to pay attention to when optimizing code for speed and simplicity. For this example I optimize code from a previous article HERE. It needs to proces 100+ datatables each with 3000+ rows. The original code takes 30+ seconds for 1 file, while the improved version takes less than 1 second. This is a speed increase of more than 30x. After that we will go through all the steps of improving the code. See Github for the full Jupyter Notebook.

  1. Use built-in functions and libraries: Python has a wide range of built-in functions and libraries that are optimized for performance. For example, use the built-in sum() function instead of looping through a list.
  2. Use NumPy: NumPy is a Python library that provides support for large, multi-dimensional arrays and matrices. It is optimized for performance and can significantly speed up numerical computations.
  3. Avoid loops and use vectorized operations: Loops can be slow in Python. Instead, use vectorized operations like those provided by NumPy to perform operations on arrays and matrices.
  4. Use generators: If you need to iterate over a large dataset, use generators instead of lists. Generators are more memory-efficient and can be faster.
  5. Use caching: If you have a function that performs a time-consuming calculation, consider caching the result. This way, you can avoid repeating the calculation and save time.
  6. Use multiprocessing: If your code is CPU-bound and you have a multi-core CPU, consider using the multiprocessing module to parallelize your code.
  7. Use Cython: Cython is a Python compiler that can generate C code from Python code. This can significantly speed up the execution of your code.
  8. Profile your code: Use a profiler to identify the parts of your code that are taking the most time to execute. This will help you focus your optimization efforts on the parts of your code that will provide the most benefit.
  9. Simplify your code: Sometimes, the most effective way to speed up your code is to simplify it. If you can simplify your code, you may be able to eliminate unnecessary operations and improve performance.

For completions sake I first show the full original code snippet that we will optimize.

import pandas as pd
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from textblob import TextBlob
import matplotlib.pyplot as plt
import time # for testing function speed
import numpy as np
from datetime import datetime
from deep_translator import GoogleTranslator
import re
import textfeatures as tf
import csv
import gzip

nlp = spacy.load('en_core_web_md')
def twitter_analysis(data):
    nlp = spacy.load('en_core_web_md')
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "]+", flags=re.UNICODE)
    translator = GoogleTranslator(source='auto', target='en')
    def rmv_emoji_and_trans_to_en(data, rmv_stopwords=False):
        store = []
        for _, row in data.iterrows():
            text = emoji_pattern.sub(r'', row['text'])
            text = re.sub(r'http\S+', '', text).replace('\n','')
            text = translator.translate(text)

            if rmv_stopwords:
                doc = nlp(text)
                text = ' '.join([word.text for word in doc if not word.is_stop])
            store.append((
                row['username'],
                text,
                row['totaltweets'],
                row['followers'],
                row['location'],
                pd.to_datetime(data0['extractedts']).dt.strftime('%d/%m/%Y')
            ))
        return pd.DataFrame(store, columns=['username', 'text', 'totaltweets', 'followers', 'location', 'extractedts'])

    def sentiment(data):
        store = []
        for row in data.itertuples():
            testimonial = TextBlob(row.text)
            store.append({'username':row.username, 
                          'text':row.text, 
                          'totaltweets':int(row.totaltweets), 
                          'followers':int(row.followers), 
                          'location':row.location, 
                          'extractedts':row.extractedts,
                          'polarity': testimonial.sentiment.polarity })
        return pd.DataFrame(store)
    
    def get_entities(data):
        def extract_entities(text):
            # Extract entities from text
            doc = nlp(text)
            entities = {
                ent.label_: {ent.text.lower()}
                for ent in doc.ents
                if ent.label_ in ['GPE', 'ORG', 'PERSON']
            }
            return entities

        data = data.assign(**data['text'].apply(extract_entities).apply(pd.Series, dtype='object'))
        return data
    data_en = rmv_emoji_and_trans_to_en(data)
    data_sentiment = sentiment(data_en)
    dataset_entities = get_entities(data_sentiment) 
    dataset_output = pd.DataFrame(tf.hashtags(dataset_entities, "text", "hastags"))
    return dataset_output

To measure the speed of a function the time library can be used as shown in the following code snippet. Printing the function run time shows it takes 30.96 seconds to proces 50 rows. This is slow, but at the same time a lot happens. It needs to be improved to realistically process 3000+ rows.

#import data
path='twitter/20230201_UkraineCombinedTweetsDeduped.csv.gzip'
data0 = pd.read_csv(path, compression='gzip')
data=pd.DataFrame(data=data0,columns=['username', 'text', 'totaltweets','followers', 'location', 'extractedts'])
data['text'].dropna()
data = data.head(50)

#run function and measure run time
start = time.time()
data = twitter_analysis(data)
end = time.time()

#print the resulting run time
print('time duration full process: ')
print(end - start)
time duration full process: 
30.967555046081543

I prefer to write functionaly good code that might not yet be optimal speed wise to achieve the desired functionality. After it works it can be optimized for speed. Now that chatGPT is available we can ask it to optimize the code for us. It gives the following tips:

  1. Remove unnecessary imports:
    • numpy and matplotlib.pyplot are imported but not used in the code.
    • datetime is imported but not used within the function, so it can be removed.
  2. Avoid reloading the (en_core_web_md) model:
    • The nlp object is already loaded outside the function, so it can be reused inside the function instead of loading it again.
  3. Use list comprehension to improve performance:
    • Instead of using a for-loop to iterate over the data, list comprehension can be used to improve performance.
  4. Remove unnecessary variables:
    • The store variable is not necessary in the rmv_emoji_and_trans_to_en and sentiment functions. The store variable can be replaced with a list comprehension to avoid unnecessary memory usage.
  5. Avoid using regular expressions where possible:
    • The regular expression pattern to remove emojis can be replaced with unicodedata.normalize which is faster and more efficient.
  6. Combine rmv_emoji_and_trans_to_en and sentiment functions:
    • The rmv_emoji_and_trans_to_en and sentiment functions can be combined into a single function to avoid iterating over the data twice.
  7. Avoid using external libraries:
    • Instead of using deep_translator library, the googletrans library can be used to translate the text.

It is important to note that chatGPT is often not correct. It can give example code that might or might not work. Asking it to optimize a subset of the full code increases the changes it is correct. I did not for instance use the googletrans library and dealt with removing emojis in a better way, using the re library. After some tinkering I ended up with the following code.

import spacy
import unicodedata
from textblob import TextBlob
from googletrans import Translator
import emoji
nlp = spacy.load('en_core_web_md')
translator = Translator()

def twitter_analysis(data):
    def extract_entities(text):
        # Extract entities from text
        doc = nlp(text)
        entities = {
            ent.label_: {ent.text.lower()}
            for ent in doc.ents
            if ent.label_ in ['GPE', 'ORG', 'PERSON']
        }
        return entities
    
    def clean_text(text):
        # Remove emojis and translate to English
        text = re.sub(r'http\S+', '', text)  # remove URLs
        text = re.sub(r'@\w+', '', text)  # remove mentions
        #text = re.sub(r'#\w+', '', text)  # remove hashtags
        #text = re.sub(r'\d+', '', text)  # remove digits
        text = re.sub(r'[^\w\s]', '', text)  # remove punctuation
        emoji_pattern = re.compile('[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF]', flags=re.UNICODE)
        text = emoji_pattern.sub(r'', text)  # remove emojis
        text = text.strip().lower()  # convert to lowercase and remove leading/trailing whitespaces
        return text

    dataset_output = (
        data.assign(text=data['text'].apply(clean_text))
            .assign(polarity=data['text'].apply(lambda x: TextBlob(x).sentiment.polarity))
            .assign(**data['text'].apply(extract_entities).apply(pd.Series, dtype='object'))
            .pipe(tf.hashtags, 'text', 'hashtags')
    )

    return dataset_output
time duration full process: 
0.9002244472503662

The new version of the code has around 30 less lines and has a runtime of 0.9 seconds. Compared to the original runtime of 30.96 seconds we achieve a speed improvement of over 30x!!

Let me know if you have other tips to quickly improve code efficiency!

Leave a Reply

Your email address will not be published. Required fields are marked *