A haiku is a short poem originating from Japan. It follows very specific rules to qualify as a haiku. First, it consists of three phrases. The first line has 5 syllables, the second line has 7 syllables and the last line 5 syllables again. Since I can train a Spacy model on any dataset I chose lines from the show South Park spoken by the character called Cartman. He has quite a specific way of speaking so I was interested if that would show in the haiku. Ofcourse you can train any text dataset to get different results. See Github for the code.
Part of this dataset was used: https://www.kaggle.com/datasets/tovarischsukhov/southparklines. Let’s first import the libraries.
# https://www.kaggle.com/datasets/tovarischsukhov/southparklines
# import libraries
import spacy
import random
nlp = spacy.load("en_core_web_md")
from spacy.matcher import Matcher
import syllapy
The first step is defining the patterns that we want to find in the text. Since the maxiimum amount of syllables allows is 7 we do not need to consider long word patterns. The maximum chosen was 5 words and minimum 2. There are many different types of patterns that can be chosen, and it has a big effect on the end result. This website has a useful list of the types of Part Of Speech (POS): https://machinelearningknowledge.ai/tutorial-on-spacy-part-of-speech-pos-tagging/
# 2 words
match2words = Matcher(nlp.vocab)
pattern = [{"POS":{"IN": ["NOUN", "ADV", "ADJ", "ADP"]}},
{"POS":{"IN": ["VERB", "NOUN"]}}]
match2words.add("2words", [pattern])
# 3 words
match3words = Matcher(nlp.vocab)
pattern = [{"POS":{"IN": ["NOUN", "ADV", "ADJ", "VERB", "ADP"]}},
{"POS":{"IN": ["VERB", "NOUN"]}},
{"POS":{"IN": ["VERB", "NOUN", "ADV", "ADJ"]}}]
match3words.add("3words", [pattern])
# 4 words
match4words = Matcher(nlp.vocab)
pattern = [{"POS":{"IN": ["NOUN", "ADV", "ADJ", "VERB", "ADP"]}},
{"IS_ASCII": True, "IS_PUNCT": False},
{"IS_ASCII": True, "IS_PUNCT": False},
{"POS":{"IN": ["VERB", "NOUN", "ADV", "ADJ"]}}]
match4words.add("4words", [pattern])
# 5 words
match5words = Matcher(nlp.vocab)
pattern = [{"POS":{"IN": ["NOUN", "ADV", "ADJ", "VERB", "ADP"]}},
{"IS_ASCII": True, "IS_PUNCT": False},
{"IS_ASCII": True, "IS_PUNCT": False},
{"POS":{"IN": ["VERB", "NOUN", "ADV", "ADJ"]}},
{"POS":{"IN": ["VERB", "NOUN", "CONJ", "ADJ"]}}]
match5words.add("5words", [pattern])
# find and store matches
doc = nlp(texts)
match2words = match2words(doc)
match3words = match3words(doc)
match4words = match4words(doc)
match5words = match5words(doc)
The last four lines of the code above search the text for the earlier defined patterns and stores them. At this point we do not know how many syllables each found text pattern has. That is done in the next step. Haikus need five or seven syllable sentences so we only need to look for those. The library syllapy is used for that. Each pattern that matches the requirement is stored in either ‘five_syll’ or ‘seven_syll’.
# filter the dataset to only keep text that fits the defined patterns
five_syll = []
seven_syll = []
for match_id, start, end in match2words + match3words + match4words:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
syllable_count = 0
for token in span:
syllable_count += syllapy.count(token.text)
if syllable_count == 5:
if span.text not in five_syll:
five_syll.append(span.text)
if syllable_count == 7:
if span.text not in seven_syll:
seven_syll.append(span.text)
# ~ generate a haiku ~
print(random.choice(five_syll) + '\n' + random.choice(seven_syll) + '\n' + random.choice(five_syll))
The last and most fun step is generating the Haiku! For this the library called random is used. We simply pick a random sentence from ‘five_syll’ then a random sentence from ‘seven_syll’ and finally ‘five_syll’ again. This works well since we now know that they fit the patterns and contain the correct amount of syllables. Lets review the results.
huge satellite dish
little brother is all right
cut off their life force
hippies playing drums
now my video game time
really blessed
Powerful stuff there
still just one little problem
actually worked
Yep it works quite well (sometimes). If you have seen the first episode you know that the satellite dish is an appropriate term for him to use. Many generated haikus are odd and obviously meaningless most of the time (they are randomly chosen after all). Patterns could be improved and dataset pruned to get better results. Still, quite happy with these results in such a short amount of time!