Close Menu
    Trending
    • 10 Common SQL Patterns That Show Up in FAANG Interviews | by Rohan Dutt | Aug, 2025
    • This Mac and Microsoft Bundle Pays for Itself in Productivity
    • Candy AI NSFW AI Video Generator: My Unfiltered Thoughts
    • Anaconda : l’outil indispensable pour apprendre la data science sereinement | by Wisdom Koudama | Aug, 2025
    • Automating Visual Content: How to Make Image Creation Effortless with APIs
    • A Founder’s Guide to Building a Real AI Strategy
    • Starting Your First AI Stock Trading Bot
    • Peering into the Heart of AI. Artificial intelligence (AI) is no… | by Artificial Intelligence Details | Aug, 2025
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Artificial Intelligence»Topic Model Labelling with LLMs | Towards Data Science
    Artificial Intelligence

    Topic Model Labelling with LLMs | Towards Data Science

    Team_AIBS NewsBy Team_AIBS NewsJuly 15, 2025No Comments7 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    By: Petr Koráb*, Martin Feldkircher**, *** Viktoriya Teliha** (*Textual content Mining Tales, Prague, **Vienna College of Worldwide Research, ***Centre for Utilized Macroeconomic Evaluation, Australia).

    of phrases produced by matter fashions requires area expertise and could also be subjective to the labeler. Particularly when the variety of subjects grows massive, it could be handy to assign human-readable names to subjects robotically with an LLM. Merely copying and pasting the outcomes into UIs, similar to chatgpt.com, is kind of a “black-box” and unsystematic. A better option could be so as to add matter labeling to the code with a documented labeler, which provides the engineer extra management over the outcomes and ensures reproducibility. This tutorial will discover intimately:

    • practice a subject mannequin with a recent new Turftopic Python package deal
    • label matter mannequin outcomes with GPT-4.0 mini.

    We are going to practice a cutting-edge FASTopic mannequin by Xiaobao Wu et al. [3] introduced eventually yr’s NeurIPS. This mannequin outperforms other competing models, similar to BERTopic, in a number of key metrics (e.g., matter variety) and has broad applications in business intelligence.

    1. Elements of the Subject Modelling Pipeline

    Labelling is the important a part of the subject modelling pipeline as a result of it bridges the mannequin outputs with real-world choices. The mannequin assigns a quantity to every matter, however a enterprise resolution depends on the human-readable textual content label summarizing the standard phrases in every matter. The fashions are sometimes labelled by (1) labellers with the area expertise, usually utilizing a well-defined labelling technique, (2) LLMs, and (3) business instruments. The trail from uncooked information to decision-making via a subject mannequin is properly defined in Picture 1.

    Picture 1. Elements of the subject modeling pipeline.
    Supply: tailored and prolonged from Kardos et al [2].

    The pipeline begins with uncooked information, which is preprocessed and vectorized for the subject mannequin. The mannequin returns subjects named with integers, together with typical phrases (phrases or bigrams). The labeling layer replaces the integer within the matter identify with the textual content label. The mannequin consumer (product manager, customer care dept., and so on.) then works with labelled phrases to make data-informed choices. Within the following modeling instance, we’ll observe it step-by-step.

    2. Knowledge

    We are going to use FASTopic to categorise buyer complaints information into 10 subjects. The instance use case makes use of a synthetically generated Customer Care Email dataset out there on Kaggle, licensed underneath the GPL-3 license. The prefiltered information covers 692 incoming emails to the client care division and appears like this:

    Picture 2. Customer Care Email dataset. Picture by authors.

    2.1. Knowledge preprocessing

    Textual content information is sequentially preprocessed in six steps. Numbers are eliminated first, adopted by emojis. English stopwords are eliminated afterward, adopted by punctuation. Further tokens (similar to firm and individual names) are eliminated within the subsequent step earlier than lemmatization. Learn extra on textual content preprocessing for matter fashions in our previous tutorial.

    First, we learn the clear information and tokenize the dataset:

    import pandas as pd
    
    # Learn information
    information = pd.read_csv("information.csv", usecols=['message_clean'])
    
    # Create corpus record
    docs = information["message_clean"].tolist()
    Picture 3. Really helpful cleansing pipeline for matter fashions. Picture by authors.

    2.2. Bigram vectorization

    Subsequent, we create a bigram tokenizer to course of tokens as bigrams throughout the mannequin coaching. Bigram fashions present extra related data and determine higher key qualities and issues for enterprise choices than single-word fashions (“supply” vs. “poor supply”, “abdomen” vs. “delicate abdomen”, and so on.).

    from sklearn.feature_extraction.textual content import CountVectorizer
    
    bigram_vectorizer = CountVectorizer(
        ngram_range=(2, 2),               # solely bigrams
        max_features=1000                 # high 1000 bigrams by frequency
    )

    3. Mannequin coaching

    The FASTopic mannequin is at the moment applied in two Python packages:

    • Fastopic: official package deal by X. Wu
    • Turftopic : new Python package deal that brings many useful matter modeling options, together with labeling with LLMs [2]

    We are going to use the Turftopic implementation due to the direct hyperlink between the mannequin and the Namer that provides LLM labelling.

    Let’s arrange the mannequin and match it to the info. It’s important to set a random state to safe coaching reproducibility.

    from turftopic import FASTopic
    
    # Mannequin specification
    topic_size  = 10
    mannequin = FASTopic(n_components = topic_size,       # practice for 10 subjects
                     vectorizer = bigram_vectorizer,  # generate bigrams in subjects
                     random_state = 32).match(docs)     # set random state 
    
    # Match mannequin to corpus
    topic_data = mannequin.prepare_topic_data(docs)

    Now, let’s put together a dataframe with matter IDs and the highest 10 bigrams with the very best likelihood acquired from the mannequin (code is here).

    Picture 4. Unlabeled subjects in FASTopic. Picture by authors.

    4. Subject labeling

    Within the subsequent step, we add textual content labels to the subject IDs with GPT4-o-mini. Let’s observe these steps:

    With this code, we label the subjects and add a brand new row topic_name to the dataframe.

    from turftopic.namers import OpenAITopicNamer
    import os
    
    # OpenAI API key key to entry GPT-4
    os.environ["OPENAI_API_KEY"] = ""   
    
    # use Namer to label matter mannequin with LLM
    namer = OpenAITopicNamer("gpt-4o-mini")
    mannequin.rename_topics(namer)
    
    # create a dataframe with labelled subjects
    topics_df = mannequin.topics_df()
    topics_df.columns = ['topic_id', 'topic_name', 'topic_words']
    
    # break up and explode
    topics_df['topic_word'] = topics_df['topic_words'].str.break up(',')
    topics_df = topics_df.explode('topic_word')
    topics_df['topic_word'] = topics_df['topic_word'].str.strip()
    
    # add a rank for every phrase inside a subject
    topics_df['word_rank'] = topics_df.groupby('topic_id').cumcount() + 1
    
    # pivot to broad format
    broad = topics_df.pivot(index='word_rank', 
                           columns=['topic_id', 'topic_name'], values='topic_word')

    Right here is the desk with labeled subjects after extra transformations. It will be fascinating to check the LLM outcomes with these of an organization insider who’s acquainted with the corporate’s processes and buyer base. The dataset is artificial, so let’s depend on the GPT-4 labeling.

    Picture 5. Labeled subjects in FASTopic by GPT4–o-mini. Picture by authors.

    We will additionally visualize the labeled subjects for a greater presentation. The code for the bigram phrase cloud visualization, generated from the subjects produced by the mannequin, is here.

    Picture 6. Phrase cloud visualization of labeled subjects by GPT4–o-mini. Picture by authors.

    Abstract

    • The brand new Turftopic Python package deal hyperlinks latest matter fashions with the LLM-based labeler for producing human-readable matter names.
    • The primary advantages are: 1) independence from the labeler’s subjective expertise, 2) capability to label fashions with numerous subjects {that a} human labeler might need issue labeling independently, and three) extra management of the code and reproducibility.
    • Subject labeling with LLMs has a variety of purposes in various areas. Learn our latest paper on the subject modeling of central financial institution communication, the place GPT-4 labeled the FASTopic mannequin.
    • The labels are barely totally different for every coaching, even with the random state. It’s not attributable to the Namer, however by the random processes in mannequin coaching that output bigrams with possibilities in descending order. The variations in possibilities are in tiny decimals, so every coaching generates a number of new phrases within the high 10, which then impacts the LLM labeler.

    The info and full code for this tutorial are here.

    Petr Korab is a Senior Knowledge Analyst and Founding father of Text Mining Stories with over eight years of expertise in Enterprise Intelligence and NLP.

    Join for our blog to get the newest information from the NLP trade!

    References

    [1] Feldkircher, M., Korab, P., Teliha, V., (2025). “What Do Central Bankers Talk About? Evidence From the BIS Archive,” CAMA Working Papers 2025–35, Centre for Utilized Macroeconomic Evaluation, Crawford College of Public Coverage, The Australian Nationwide College.

    [2] Kardos, M., Enevoldsen, Okay. C., Kostkan, J., Kristensen-McLachlan, R. D., Rocca, R. (2025). Turftopic: Subject Modelling with Contextual Representations from Sentence Transformers. Journal of Open Supply Software program, 10(111), 8183, https://doi.org/10.21105/joss.08183.

    [3] Wu, X, Nguyen, T., Ce Zhang, D., Yang Wang, W., Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. arXiv preprint: 2405.17978.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleWho Wins the AI Race? Comparing the Top 5 Generative AI Tools of 2025 | by RandomPanda | Let’s Code Future | Jul, 2025
    Next Article I Burned Down My House — and Learned a Leadership Lesson I’ll Never Forget
    Team_AIBS News
    • Website

    Related Posts

    Artificial Intelligence

    Candy AI NSFW AI Video Generator: My Unfiltered Thoughts

    August 2, 2025
    Artificial Intelligence

    Starting Your First AI Stock Trading Bot

    August 2, 2025
    Artificial Intelligence

    When Models Stop Listening: How Feature Collapse Quietly Erodes Machine Learning Systems

    August 2, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    10 Common SQL Patterns That Show Up in FAANG Interviews | by Rohan Dutt | Aug, 2025

    August 2, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    25 Research Papers That Shaped Modern Computing

    April 8, 2025

    3D Printing: Revolutionizing Manufacturing or Disrupting the Global Order?

    January 14, 2025

    I Use the 6-Week Sprint Method For Better Product Development — and More. Here’s Why You Need It, Too.

    February 12, 2025
    Our Picks

    10 Common SQL Patterns That Show Up in FAANG Interviews | by Rohan Dutt | Aug, 2025

    August 2, 2025

    This Mac and Microsoft Bundle Pays for Itself in Productivity

    August 2, 2025

    Candy AI NSFW AI Video Generator: My Unfiltered Thoughts

    August 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.