By Rachit Kumbalaparambil
On this tutorial, I’ll be utilizing a subject modeling approach referred to as Latent Dirichlet Allocation (LDA), to determine themes in standard tune lyrics over time.
LDA is a type of unsupervised studying which makes use of a probabilistic mannequin of language to generate “subjects” from a set of paperwork, in our case, tune lyrics.
Every doc is modeled as a Bag of Phrases (BoW), that means we simply have the phrases that have been used together with their frequencies, with out data on phrase order. LDA sees these luggage of phrases as a composition of all subjects, having a weighting for every subject.
Every subject is an inventory of phrases and their related chances of prevalence, and LDA determines these subjects by which phrases usually seem collectively.
Matter modeling is in principle very helpful on this case as we don’t have labeled information (style not specified), and what we actually need to do is determine themes that the lyrics are speaking about. It will be troublesome to investigate themes like loneliness, heartbreak, or love from simply genre-labeled information. Musical genres are on no account exhausting boundaries both, as most artists might not match into any given class.
The Knowledge
Consists of the Billboard Prime 100 songs within the US for every year from 1959 to 2023. The options included that we are going to be :
- Lyrics
- Title, Artist
- Yr
- Distinctive phrase depend
The information was net scraped from https://billboardtop100of.com/ and the lyrics have been pulled from Genius API (be taught extra right here: https://genius.com/developers).
It was contributed to Kaggle by Brian Blakely and launched below MIT license.
To construct a superb subject mannequin, it was essential to pre-process the textual content. The uncooked lyrics contained a considerable amount of cease phrases, that are frequent phrases like “the,” “and,” “is,” and so on., which carry little/no semantic that means. On this undertaking, I additionally selected to filter out extra phrases that weren’t offering that means, in an effort to enhance the mannequin.
The steps I took in pre-processing the textual content have been as follows:
- Tokenization (splitting up the textual content into phrases)
- Lowercasing
- Eradicating punctuation
- Eradicating cease phrases (customary + customized record)
- Lemmatization (lowering phrases to their base, e.g. “working” to “run”)
The primary factor to contemplate on this step while you apply your pre-processing is the way it will have an effect on the mannequin when you take away sure phrases. It would have so much to do along with your particular software, so take motion accordingly.
In my case, I went by means of and iteratively selected phrases that I needed to take away by working the pre-processing, then making a document-term matrix and analyzing the highest ~30 phrases. From these phrases, I went by means of and chosen phrases that don’t present semantic that means to be added to my customized set of cease phrases.
Phrases have been additionally added to this record after working the LDA algorithm and analyzing the subjects, eradicating phrases that highlighted their lack of semantic that means by showing in each subject.
The next is the customized record I ended up utilizing for the ultimate mannequin, in addition to the code I used to create and look at the doc time period matrix. This code is constructed off of code supplied by my Textual content Analytics professor, Dr. Anglin on the College of Connecticut.
stoplist = set(nltk.corpus.stopwords.phrases('english'))
custom_stop_words = {'na', 'obtained', 'let', 'come', 'ca', 'wan', 'gon',
'oh', 'yeah', 'ai', 'ooh', 'factor', 'hey', 'la',
'wo', 'ya', 'ta', 'like', 'know', 'u', 'uh',
'ah', 'as', 'yo', 'get', 'go', 'say', 'might',
'would', 'take', 'one', 'make', 'method', 'mentioned',
'actually', 'flip', 'trigger', 'put', 'additionally',
'would possibly', 'again', 'child', 'ass' , 'lady', 'boy',
'man', 'girl', 'round', 'each', 'ever'}
stoplist.replace(custom_stop_words)
# make lyric_tokens string of tokens as an alternative of record for CountVectorizer
df["lyric_tokens_str"] = df["lyric_tokens_list"].apply(lambda x: " ".be part of(x))vec = CountVectorizer(lowercase = True, strip_accents = "ascii")
X = vec.fit_transform(df["lyric_tokens_str"])
# X refers back to the sparse matrix we saved as X. df is the unique dataframe we created the matrix from.
matrix = pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out(), index=df.index)
# high 10 most freq phrases
matrix.sum().sort_values(ascending = False).head(10)
The next is the process_text operate used to scrub the information. I modified it for my functions to incorporate an argument for the customized stoplist.
def process_text(
textual content: str,
lower_case: bool = True,
remove_punct: bool = True,
remove_stopwords: bool = False,
lemma: bool = False,
string_or_list: str = "str",
stoplist: set = None
):# tokenize textual content
tokens = nltk.word_tokenize(textual content)
if lower_case:
tokens = [token.lower() if token.isalpha() else token for token in tokens]
if remove_punct:
tokens = [token for token in tokens if token.isalpha()]
if remove_stopwords:
tokens = [token for token in tokens if not token in stoplist]
if lemma:
tokens = [nltk.wordnet.WordNetLemmatizer().lemmatize(token) for token in tokens]
if string_or_list != "record":
doc = " ".be part of(tokens)
else:
doc = tokens
return doc
An instance of how this could work on “Horny And I Know It” by LMFAO:
Uncooked: “Yeah, yeah, once I stroll on by, women be trying like rattling he’s fly”
Processed: [‘walk’, ‘girl’, ‘look’, ‘damn’, ‘fly’]
As I discussed, we arrange for the mannequin by making a bag of phrases for every doc, and an inventory of BoWs for the corpus:
from gensim.corpora import Dictionarygensim_dictionary = Dictionary(df['lyric_tokens_list'])
gensim_dictionary.filter_extremes(no_below=313, no_above=0.60)
# no_below 313 (out of 6292 for ~5% of the full corpus)
# no_above 0.60
# Create an inventory of BOW representations for the corpus
corpus = [gensim_dictionary.doc2bow(doc) for doc in df['lyric_tokens_list']]
We filter out extremes of 5% and 60%, that means we’re filtering out phrases that seem in lower than 5% of songs and greater than 60% of songs. These cutoffs have been chosen iteratively, just like the customized thesaurus. That is one other level the place you would possibly make a unique resolution primarily based in your information.
In becoming the mannequin, I used Gensim’s LdaModel and experimented with completely different quantities of subjects (5 to 50). A for loop was used to make a mannequin for five, 10, 30 and 50 subjects.
from gensim.fashions import LdaModel
from gensim.fashions import CoherenceModeltopic_range = [5, 10, 30, 50]
coherence_scores = []
lda_models = []
gensim_dictionary[0] # required to initialize
for num_topics in topic_range:
lda_model = LdaModel(
corpus=corpus,
id2word=gensim_dictionary,
num_topics=num_topics,
random_state = 1)
lda_models.append(lda_model)
coherence_model = CoherenceModel(mannequin = lda_model,
texts = df['lyric_tokens_list'],
dictionary=gensim_dictionary,
coherence = 'c_v',
processes = 1) # avoids bizarre addition error
coherence = coherence_model.get_coherence()
coherence_scores.append(coherence)
print(f"Coherence rating for {num_topics} subjects: {coherence}")
A key resolution right here is the quantity of subjects you need to go along with, and that’s primarily based on what number of completely different fashions you create and consider. In my case, I match 4 fashions, and select amongst them.
We consider the fashions we match utilizing their coherence scores, which is a measure of semantic similarity among the many high phrases in a subject. In my case, the most effective performing mannequin was the 30 subject mannequin, with a coherence rating of 0.408.
12) Current the outcomes of the strategy utilizing tables, figures, and/or descriptions.
Let’s examine the contents of the generated subjects beneath. I used the next block of code to create a dataframe of the chosen last mannequin, for ease of inspection.
list_of_topic_tables = []
final_model = lda_models[2]for subject in final_model.show_topics(
num_topics=-1, num_words=10, formatted=False
):
list_of_topic_tables.append(
pd.DataFrame(
information = subject[1],
columns=["Word" + "_" + str(topic[0]), "Prob" + "_" + str(subject[0])],
)
)
pd.set_option('show.max_columns', 500)
bigdf = pd.concat(list_of_topic_tables, axis=1)
bigdf