summarized text length: 49259 original text length: 92784
Note: the above code uses the output of the webscraper described in an earlier article. We can see that the size is reduced by half with a pretty simple piece of code.
If you are not familiar with the details of the above print statement it can be worthwile to try for yourself. The function outputs a text, len() outputs an integer of the text length and str() is required to use the number in a print statement.
We can see that with a basic summarization function the length can be reduced by almost 50%. With specific preprocessing before using the function we could reduce it even further. In the case of the wikipedia text we could first remove any reference to literature and the in text reference numbers.
Relevant literature
- https://medium.com/analytics-vidhya/simple-text-summarization-using-nltk-eedc36ebaaf8
- https://stackabuse.com/text-summarization-with-nltk-in-python/