Close Menu
    Trending
    • AI Knowledge Bases vs. Traditional Support: Who Wins in 2025?
    • Why Your Finance Team Needs an AI Strategy, Now
    • How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1
    • From Training to Drift Monitoring: End-to-End Fraud Detection in Python | by Aakash Chavan Ravindranath, Ph.D | Jul, 2025
    • Using Graph Databases to Model Patient Journeys and Clinical Relationships
    • Cuba’s Energy Crisis: A Systemic Breakdown
    • AI Startup TML From Ex-OpenAI Exec Mira Murati Pays $500,000
    • STOP Building Useless ML Projects – What Actually Works
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Topic Modeling Lyrics of Popular Songs | by Rachit K | May, 2025
    Machine Learning

    Topic Modeling Lyrics of Popular Songs | by Rachit K | May, 2025

    Team_AIBS NewsBy Team_AIBS NewsMay 1, 2025No Comments6 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    By Rachit Kumbalaparambil

    On this tutorial, I’ll be utilizing a subject modeling approach referred to as Latent Dirichlet Allocation (LDA), to determine themes in standard tune lyrics over time.

    LDA is a type of unsupervised studying which makes use of a probabilistic mannequin of language to generate “subjects” from a set of paperwork, in our case, tune lyrics.

    Every doc is modeled as a Bag of Phrases (BoW), that means we simply have the phrases that have been used together with their frequencies, with out data on phrase order. LDA sees these luggage of phrases as a composition of all subjects, having a weighting for every subject.

    Every subject is an inventory of phrases and their related chances of prevalence, and LDA determines these subjects by which phrases usually seem collectively.

    Matter modeling is in principle very helpful on this case as we don’t have labeled information (style not specified), and what we actually need to do is determine themes that the lyrics are speaking about. It will be troublesome to investigate themes like loneliness, heartbreak, or love from simply genre-labeled information. Musical genres are on no account exhausting boundaries both, as most artists might not match into any given class.

    The Knowledge

    Consists of the Billboard Prime 100 songs within the US for every year from 1959 to 2023. The options included that we are going to be :

    • Lyrics
    • Title, Artist
    • Yr
    • Distinctive phrase depend

    The information was net scraped from https://billboardtop100of.com/ and the lyrics have been pulled from Genius API (be taught extra right here: https://genius.com/developers).

    It was contributed to Kaggle by Brian Blakely and launched below MIT license.

    To construct a superb subject mannequin, it was essential to pre-process the textual content. The uncooked lyrics contained a considerable amount of cease phrases, that are frequent phrases like “the,” “and,” “is,” and so on., which carry little/no semantic that means. On this undertaking, I additionally selected to filter out extra phrases that weren’t offering that means, in an effort to enhance the mannequin.

    The steps I took in pre-processing the textual content have been as follows:

    • Tokenization (splitting up the textual content into phrases)
    • Lowercasing
    • Eradicating punctuation
    • Eradicating cease phrases (customary + customized record)
    • Lemmatization (lowering phrases to their base, e.g. “working” to “run”)

    The primary factor to contemplate on this step while you apply your pre-processing is the way it will have an effect on the mannequin when you take away sure phrases. It would have so much to do along with your particular software, so take motion accordingly.

    In my case, I went by means of and iteratively selected phrases that I needed to take away by working the pre-processing, then making a document-term matrix and analyzing the highest ~30 phrases. From these phrases, I went by means of and chosen phrases that don’t present semantic that means to be added to my customized set of cease phrases.

    Phrases have been additionally added to this record after working the LDA algorithm and analyzing the subjects, eradicating phrases that highlighted their lack of semantic that means by showing in each subject.

    The next is the customized record I ended up utilizing for the ultimate mannequin, in addition to the code I used to create and look at the doc time period matrix. This code is constructed off of code supplied by my Textual content Analytics professor, Dr. Anglin on the College of Connecticut.

    stoplist = set(nltk.corpus.stopwords.phrases('english'))
    custom_stop_words = {'na', 'obtained', 'let', 'come', 'ca', 'wan', 'gon',
    'oh', 'yeah', 'ai', 'ooh', 'factor', 'hey', 'la',
    'wo', 'ya', 'ta', 'like', 'know', 'u', 'uh',
    'ah', 'as', 'yo', 'get', 'go', 'say', 'might',
    'would', 'take', 'one', 'make', 'method', 'mentioned',
    'actually', 'flip', 'trigger', 'put', 'additionally',
    'would possibly', 'again', 'child', 'ass' , 'lady', 'boy',
    'man', 'girl', 'round', 'each', 'ever'}
    stoplist.replace(custom_stop_words)
    # make lyric_tokens string of tokens as an alternative of record for CountVectorizer
    df["lyric_tokens_str"] = df["lyric_tokens_list"].apply(lambda x: " ".be part of(x))

    vec = CountVectorizer(lowercase = True, strip_accents = "ascii")
    X = vec.fit_transform(df["lyric_tokens_str"])

    # X refers back to the sparse matrix we saved as X. df is the unique dataframe we created the matrix from.
    matrix = pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out(), index=df.index)

    # high 10 most freq phrases
    matrix.sum().sort_values(ascending = False).head(10)

    The next is the process_text operate used to scrub the information. I modified it for my functions to incorporate an argument for the customized stoplist.

    def process_text(
    textual content: str,
    lower_case: bool = True,
    remove_punct: bool = True,
    remove_stopwords: bool = False,
    lemma: bool = False,
    string_or_list: str = "str",
    stoplist: set = None
    ):

    # tokenize textual content
    tokens = nltk.word_tokenize(textual content)

    if lower_case:
    tokens = [token.lower() if token.isalpha() else token for token in tokens]

    if remove_punct:
    tokens = [token for token in tokens if token.isalpha()]

    if remove_stopwords:
    tokens = [token for token in tokens if not token in stoplist]

    if lemma:
    tokens = [nltk.wordnet.WordNetLemmatizer().lemmatize(token) for token in tokens]

    if string_or_list != "record":
    doc = " ".be part of(tokens)
    else:
    doc = tokens

    return doc

    An instance of how this could work on “Horny And I Know It” by LMFAO:

    Uncooked: “Yeah, yeah, once I stroll on by, women be trying like rattling he’s fly”
    Processed: [‘walk’, ‘girl’, ‘look’, ‘damn’, ‘fly’]

    As I discussed, we arrange for the mannequin by making a bag of phrases for every doc, and an inventory of BoWs for the corpus:

    from gensim.corpora import Dictionary

    gensim_dictionary = Dictionary(df['lyric_tokens_list'])
    gensim_dictionary.filter_extremes(no_below=313, no_above=0.60)

    # no_below 313 (out of 6292 for ~5% of the full corpus)
    # no_above 0.60

    # Create an inventory of BOW representations for the corpus
    corpus = [gensim_dictionary.doc2bow(doc) for doc in df['lyric_tokens_list']]

    We filter out extremes of 5% and 60%, that means we’re filtering out phrases that seem in lower than 5% of songs and greater than 60% of songs. These cutoffs have been chosen iteratively, just like the customized thesaurus. That is one other level the place you would possibly make a unique resolution primarily based in your information.

    In becoming the mannequin, I used Gensim’s LdaModel and experimented with completely different quantities of subjects (5 to 50). A for loop was used to make a mannequin for five, 10, 30 and 50 subjects.

    from gensim.fashions import LdaModel
    from gensim.fashions import CoherenceModel

    topic_range = [5, 10, 30, 50]

    coherence_scores = []
    lda_models = []

    gensim_dictionary[0] # required to initialize

    for num_topics in topic_range:
    lda_model = LdaModel(
    corpus=corpus,
    id2word=gensim_dictionary,
    num_topics=num_topics,
    random_state = 1)
    lda_models.append(lda_model)

    coherence_model = CoherenceModel(mannequin = lda_model,
    texts = df['lyric_tokens_list'],
    dictionary=gensim_dictionary,
    coherence = 'c_v',
    processes = 1) # avoids bizarre addition error
    coherence = coherence_model.get_coherence()
    coherence_scores.append(coherence)
    print(f"Coherence rating for {num_topics} subjects: {coherence}")

    A key resolution right here is the quantity of subjects you need to go along with, and that’s primarily based on what number of completely different fashions you create and consider. In my case, I match 4 fashions, and select amongst them.

    We consider the fashions we match utilizing their coherence scores, which is a measure of semantic similarity among the many high phrases in a subject. In my case, the most effective performing mannequin was the 30 subject mannequin, with a coherence rating of 0.408.

    12) Current the outcomes of the strategy utilizing tables, figures, and/or descriptions.

    Let’s examine the contents of the generated subjects beneath. I used the next block of code to create a dataframe of the chosen last mannequin, for ease of inspection.

    list_of_topic_tables = []
    final_model = lda_models[2]

    for subject in final_model.show_topics(
    num_topics=-1, num_words=10, formatted=False
    ):
    list_of_topic_tables.append(
    pd.DataFrame(
    information = subject[1],
    columns=["Word" + "_" + str(topic[0]), "Prob" + "_" + str(subject[0])],
    )
    )

    pd.set_option('show.max_columns', 500)
    bigdf = pd.concat(list_of_topic_tables, axis=1)
    bigdf



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleJudge Rebukes Apple and Orders It to Loosen Grip on App Store
    Next Article Why Are Convolutional Neural Networks Great For Images?
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    From Training to Drift Monitoring: End-to-End Fraud Detection in Python | by Aakash Chavan Ravindranath, Ph.D | Jul, 2025

    July 1, 2025
    Machine Learning

    Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

    July 1, 2025
    Machine Learning

    Why PDF Extraction Still Feels LikeHack

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    AI Knowledge Bases vs. Traditional Support: Who Wins in 2025?

    July 2, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    🧠 I Built a Credit Card Fraud Detection Dashboard Using Big Data-Here’s What Happened | by Siddharthan P S | May, 2025

    May 4, 2025

    When Is an AI Image Truly Art?

    May 5, 2025

    Já pensou em criar seu próprio chatbot? | by Lorraine Trindade | Feb, 2025

    February 4, 2025
    Our Picks

    AI Knowledge Bases vs. Traditional Support: Who Wins in 2025?

    July 2, 2025

    Why Your Finance Team Needs an AI Strategy, Now

    July 2, 2025

    How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.