Making a summary of a text can be useful in many situations. It will be used in this case as a way to summarize the output of the webscraper, which is described above. This is a stand alone function that we can out without the web sraper. It can be used to summarize any text. We do need to specify which language the text is in.

Github for all the code: LINK

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
def summarize_txt(text):
    stopWords = set(stopwords.words("dutch"))
    words = word_tokenize(text)
   
    # frequency table to keep score of each word
    freqTable = dict()
    for word in words:
        word = word.lower()
        if word in stopWords:
            continue
        if word in freqTable:
            freqTable[word] += 1
        else:
            freqTable[word] = 1
   
    #dictionary to keep the score of each sentence
    sentences = sent_tokenize(text)
    sentenceValue = dict()
   
    for sentence in sentences:
        for word, freq in freqTable.items():
            if word in sentence.lower():
                if sentence in sentenceValue:
                    sentenceValue[sentence] += freq
                else:
                    sentenceValue[sentence] = freq
    sumValues = 0
    for sentence in sentenceValue:
        sumValues += sentenceValue[sentence]
   
    # Average value of a sentence from the original text
    average = int(sumValues / len(sentenceValue))
   
    # Storing sentences into our summary
    summary = ''
    for sentence in sentences:
        if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
            summary += " " + sentence
    return(summary)
print('summarized text length: ' + str(len(summarize_txt(web_scraper('https://en.wikipedia.org/wiki/Python_(programming_language)', False)))))
print('original text length: ' + str(len(web_scraper('https://en.wikipedia.org/wiki/Python_(programming_language)', False))))
summarized text length: 49259
original text length: 92784

Note: the above code uses the output of the webscraper described in an earlier article. We can see that the size is reduced by half with a pretty simple piece of code.

If you are not familiar with the details of the above print statement it can be worthwile to try for yourself. The function outputs a text, len() outputs an integer of the text length and str() is required to use the number in a print statement.

We can see that with a basic summarization function the length can be reduced by almost 50%. With specific preprocessing before using the function we could reduce it even further. In the case of the wikipedia text we could first remove any reference to literature and the in text reference numbers.

Relevant literature

Leave a Reply

Your email address will not be published. Required fields are marked *