Close Menu
    Trending
    • EdgeConneX and Lambda to Build AI Factory Infrastructure in Chicago and Atlanta
    • French streamer’s death ‘not traumatic’, autopsy finds
    • Why Every Entrepreneur Needs an Exit Mindset from Day One
    • Is Reading Dead? Why Gen Z Prefers AI Voices Over Books
    • Beyond KYC: AI-Powered Insurance Onboarding Acceleration
    • Designing a Machine Learning System: Part Five | by Mehrshad Asadi | Aug, 2025
    • Innovations in Artificial Intelligence That Are Changing Agriculture
    • Hundreds of thousands of Grok chats exposed in Google results
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Artificial Intelligence»Fine-Tune Your Topic Modeling Workflow with BERTopic
    Artificial Intelligence

    Fine-Tune Your Topic Modeling Workflow with BERTopic

    Team_AIBS NewsBy Team_AIBS NewsAugust 13, 2025No Comments8 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Subject modeling stays a vital software within the AI and NLP toolbox. Whereas massive language fashions (LLMs) deal with textual content exceptionally nicely, extracting high-level matters from large datasets nonetheless requires devoted matter modeling methods. A typical workflow contains 4 core steps: embedding, dimensionality discount, clustering, and matter illustration.

    frameworks at this time is BERTopic, which simplifies every stage with modular elements and an intuitive API. On this submit, I’ll stroll via sensible changes you may make to enhance clustering outcomes and increase interpretability primarily based on hands-on experiments utilizing the open-source 20 Newsgroups dataset, which is distributed underneath the Inventive Commons Attribution 4.0 Worldwide license.

    Venture Overview

    We’ll begin with the default settings really helpful in BERTopic’s documentation and progressively replace particular configurations to spotlight their results. Alongside the way in which, I’ll clarify the aim of every module and how you can make knowledgeable selections when customizing them.

    Dataset Preparation

    We load a pattern of 500 information paperwork.

    import random
    from datasets import load_dataset
    dataset = load_dataset("SetFit/20_newsgroups")
    random.seed(42)
    text_label = record(zip(dataset["train"]["text"], dataset["train"]["label_text"]))
    text_label_500 = random.pattern(text_label, 500)

    For the reason that information originates from informal Usenet discussions, we apply cleansing steps to strip headers, take away litter, and protect solely informative sentences.

    This preprocessing ensures higher-quality embeddings and a smoother downstream clustering course of.

    import re
    
    def clean_for_embedding(textual content, max_sentences=5):
        strains = textual content.cut up("n")
        strains = [line for line in lines if not line.strip().startswith(">")]
        strains = [line for line in lines if not re.match
                (r"^s*(from|subject|organization|lines|writes|article)s*:", line, re.IGNORECASE)]
        textual content = " ".be a part of(strains)
        textual content = re.sub(r"s+", " ", textual content).strip()
        textual content = re.sub(r"[!?]{3,}", "", textual content)
        sentence_split = re.cut up(r'(?<=[.!?]) +', textual content)
        sentence_split = [
            s for s in sentence_split
            if len(s.strip()) > 15 and not s.strip().isupper()
        ]
        return " ".be a part of(sentence_split[:max_sentences])
    texts_clean = [clean_for_embedding(text) for text,_ in text_label_500]
    labels = [label for _, label in text_label_500]

    Preliminary BERTopic Pipeline

    Utilizing BERTopic’s modular design, we configure every element: SentenceTransformer for embeddings, UMAP for dimensionality discount, HDBSCAN for clustering, and CountVectorizer + KeyBERT for matter illustration. This setup yields only some broad matters with noisy representations, highlighting the necessity for fine-tuning to realize extra coherent outcomes.

    from bertopic import BERTopic
    from umap import UMAP
    from hdbscan import HDBSCAN
    from sentence_transformers import SentenceTransformer
    
    from sklearn.feature_extraction.textual content import CountVectorizer
    from bertopic.vectorizers import ClassTfidfTransformer
    from bertopic.illustration import KeyBERTInspired
    
    # Step 1 - Extract embeddings
    embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
    
    # Step 2 - Scale back dimensionality
    umap_model = UMAP(n_neighbors=10, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
    
    # Step 3 - Cluster lowered embeddings
    hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
    
    # Step 4 - Tokenize matters
    vectorizer_model = CountVectorizer(stop_words="english")
    
    # Step 5 - Create matter illustration
    ctfidf_model = ClassTfidfTransformer()
    
    # Step 6 - (Non-obligatory) Tremendous-tune matter representations with
    # a `bertopic.illustration` mannequin
    representation_model = KeyBERTInspired()
    
    # All steps collectively
    topic_model = BERTopic(
      embedding_model=embedding_model,          # Step 1 - Extract embeddings
      umap_model=umap_model,                    # Step 2 - Scale back dimensionality
      hdbscan_model=hdbscan_model,              # Step 3 - Cluster lowered embeddings
      vectorizer_model=vectorizer_model,        # Step 4 - Tokenize matters
      ctfidf_model=ctfidf_model,                # Step 5 - Extract matter phrases
      representation_model=representation_model # Step 6 - (Non-obligatory) Tremendous-tune matter representations
    )
    matters, probs = topic_model.fit_transform(texts_clean)

    This setup yields only some broad matters with noisy representations. This outcome highlights the necessity for finetuning to realize extra coherent outcomes.

    Unique found matters (Picture generated by writer)

    Parameter Tuning for Granular Subjects

    n_neighbors from UMAP module

    UMAP is the dimensionality discount module to scale back origin embedding to a decrease dimension dense vector. By adjusting UMAP’s n_neighbors, we management how domestically or globally the information is interpreted throughout dimensionality discount. Decreasing this worth uncovers finer-grained clusters and improves matter distinctiveness.

    umap_model_new = UMAP(n_neighbors=5, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
    topic_model.umap_model = umap_model_new
    matters, probs = topic_model.fit_transform(texts_clean)
    topic_model.get_topic_info()
    Image generated by author
    Subjects found after setting the UMAP’s n_neighbors parameter (Picture generated by writer)

    min_cluster_size and cluster_selection_method from HDBSCAN module

    HDBSCAN is the clustering module set by default for BerTopic. By modifying HDBSCAN’s min_cluster_size and switching the cluster_selection_method from “eom” to “leaf” additional sharpens matter decision. These settings assist uncover smaller, extra targeted themes and stability the distribution throughout clusters.

    hdbscan_model_leaf = HDBSCAN(min_cluster_size=5, metric='euclidean', cluster_selection_method='leaf', prediction_data=True)
    topic_model.hdbscan_model = hdbscan_model_leaf
    matters, _ = topic_model.fit_transform(texts_clean)
    topic_model.get_topic_info()

    The variety of clusters will increase to 30 by setting cluster_selection_method to leaf and min_cluster_size to five.

    Image generated by author
    Subjects found after setting HDBSCAN’s associated parameters (Picture generated by writer)

    Controlling Randomness for Reproducibility

    UMAP is inherently non-deterministic, that means it will possibly produce totally different outcomes on every run until you explicitly set a set random_state. This element is usually omitted in instance code, so you should definitely embody it to make sure reproducibility.

    Equally, in the event you’re utilizing a third-party embedding API (like OpenAI), be cautious. Some APIs introduce slight variations on repeated calls. For reproducible outputs, cache embeddings and feed them instantly into BERTopic.

    from bertopic.backend import BaseEmbedder
    import numpy as np
    class CustomEmbedder(BaseEmbedder):
        """Lightweight wrapper to name NVIDIA's embedding endpoint through OpenAI SDK."""
    
        def __init__(self, embedding_model, shopper):
            tremendous().__init__()
            self.embedding_model = embedding_model
            self.shopper = shopper
    
        def encode(self, paperwork):  # kind: ignore[override]
            response = self.shopper.embeddings.create(
                enter=paperwork,
                mannequin=self.embedding_model,
                encoding_format="float",
                extra_body={"input_type": "passage", "truncate": "NONE"},
            )
            embeddings = np.array([embed.embedding for embed in response.data])
            return embeddings
    topic_model.embedding_model = CustomEmbedder()
    matters, probs = topic_model.fit_transform(texts_clean, embeddings=embeddings)

    Each dataset area might require totally different clustering settings for optimum outcomes. To streamline experimentation, take into account defining analysis standards and automating the tuning course of. For this tutorial, we’ll use the cluster configuration that units n_neighbors to five, min_cluster_size to five, and cluster_selection_method to “eom”. This can be a mixture that strikes a stability between granularity and coherence.

    Bettering Subject Representations

    Illustration performs a vital function in making clusters interpretable. By default, BERTopic generates unigram-based representations, which frequently lack enough context. Within the subsequent part, we’ll discover a number of methods to counterpoint these representations and enhance matter interpretability.

    Ngram 

    n-gram vary

    In BERTopic, CountVectorizer is the default software to transform textual content information into bag-of-words representations.  As a substitute of counting on generic unigrams, change to bigrams or trigrams utilizing ngram_range in CountVectorizer. This straightforward change provides a lot wanted context.

    Since we’re solely updating illustration, BerTopic gives the update_topics perform to keep away from redoing the modeling yet again.

    topic_model.update_topics(texts_clean, vectorizer_model=CountVectorizer(stop_words="english", ngram_range=(2,3)))
    topic_model.get_topic_info()
    Image generated by author
    Subject representations utilizing bigrams (Picture generated by writer)

    Customized Tokenizer

    Some bigrams are nonetheless onerous to interpret e.g. 486dx 50, ac uk, dxf doc,… For larger management, implement a customized tokenizer that filters n-grams primarily based on part-of-speech patterns. This removes meaningless combos and elevates the standard of your matter key phrases.

    import spacy
    from typing import Checklist
    
    class ImprovedTokenizer:
        def __init__(self):
            self.nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
            self.MEANINGFUL_BIGRAMS = {
                ("ADJ", "NOUN"),
                ("NOUN", "NOUN"),
                ("VERB", "NOUN"),
            }
        # Hold solely essentially the most significant syntactic bigram patterns
        def __call__(self, textual content: str, max_tokens=200) -> Checklist[str]:
            doc = self.nlp(textual content[:3000])  # truncate lengthy docs for pace
            tokens = [(t.text, t.lemma_.lower(), t.pos_) for t in doc if t.is_alpha]
           
            bigrams = []
            for i in vary(len(tokens) - 1):
                word1, lemma1, pos1 = tokens[i]
                word2, lemma2, pos2 = tokens[i + 1]
                if (pos1, pos2) in self.MEANINGFUL_BIGRAMS:
                    # Optionally lowercase each phrases to normalize
                    bigrams.append(f"{lemma1} {lemma2}")
           
            return bigrams
    topic_model.update_topics(docs=texts_clean,vectorizer_model=CountVectorizer(tokenizer=ImprovedTokenizer()))
    topic_model.get_topic_info()
    Image generated by author
    Subject representations which removes messy bigrams (Picture generated by writer)

    LLM

    Lastly, you possibly can combine LLMs to generate coherent titles or summaries for every matter. BERTopic helps OpenAI integration instantly or via customized prompting. These LLM-based summaries drastically enhance explainability.

    import openai
    from bertopic.illustration import OpenAI
    
    shopper = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    topic_model.update_topics(texts_clean, representation_model=OpenAI(shopper, mannequin="gpt-4o-mini", delay_in_seconds=5))
    topic_model.get_topic_info()

    The representations are actually all significant sentences. 

    Image generated by author
    Subject representations that are LLM-generated sentences (Picture generated by writer)

    You can too write your personal perform for getting the LLM-generated title, and replace it again to the subject mannequin object by utilizing update_topic_labels perform. Please check with the instance code snippet beneath.

    import openai
    from typing import Checklist
    def generate_topic_titles_with_llm(
        topic_model,
        docs: Checklist[str],
        api_key: str,
        mannequin: str = "gpt-4o"
    ) -> Dict[int, Tuple[str, str]]:
        shopper = openai.OpenAI(api_key=api_key)
        topic_info = topic_model.get_topic_info()
        topic_repr = {}
        matters = topic_info[topic_info.Topic != -1].Subject.tolist()
    
        for matter in tqdm(matters, desc="Producing titles"):
            indices = [i for i, t in enumerate(topic_model.topics_) if t == topic]
            if not indices:
                proceed
            top_doc = docs[indices[0]]
    
            immediate = f"""You're a useful summarizer for matter clustering.
            Given the next textual content that represents a subject, generate:
            1. A brief **title** for the subject (2–6 phrases)
            2. A one or two sentence **abstract** of the subject.
            Textual content:
            """
            {top_doc}
            """
            """
    
            strive:
                response = shopper.chat.completions.create(
                    mannequin=mannequin,
                    messages=[
                        {"role": "system", "content": "You are a helpful assistant for summarizing topics."},
                        {"role": "user", "content": prompt}
                    ],
                    temperature=0.5
                )
                output = response.selections[0].message.content material.strip()
                strains = output.cut up('n')
                title = strains[0].change("Title:", "").strip()
                abstract = strains[1].change("Abstract:", "").strip() if len(strains) > 1 else ""
                topic_repr[topic] = (title, abstract)
            besides Exception as e:
                print(f"Error with matter {matter}: {e}")
                topic_repr[topic] = ("[Error]", str(e))
    
        return topic_repr
    
    topic_repr = generate_topic_titles_with_llm( topic_model, texts_clean, os.environ["OPENAI_API_KEY"])
    topic_repr_dict = {
        matter: topic_repr.get(matter, "Subject")
        for matter in matter.get_topic_info()["Topic"]
     }
    topic_model.set_topic_labels(topic_repr_dict)

    Conclusion

    This information outlined actionable methods to spice up matter modeling outcomes utilizing BERTopic. By understanding the function of every module and tuning parameters in your particular area, you possibly can obtain extra targeted, secure, and interpretable matters.

    Illustration issues simply as a lot as clustering. Whether or not it’s via n-grams, syntactic filtering, or LLMs, investing in higher representations makes your matters simpler to know and extra helpful in apply.

    BERTopic additionally gives superior modeling methods past the fundamentals coated right here. In a future submit, we’ll discover these capabilities in depth. Keep tuned!



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleGoldman Sachs Deployed its AI platform
    Next Article Kodak Warns It Could Shutter, Cuts Retirement Pension Plans
    Team_AIBS News
    • Website

    Related Posts

    Artificial Intelligence

    Is Reading Dead? Why Gen Z Prefers AI Voices Over Books

    August 21, 2025
    Artificial Intelligence

    From Pixels to Perfect Replicas

    August 21, 2025
    Artificial Intelligence

    AI Twin Generator from Image (Unfiltered): My Experience

    August 21, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    EdgeConneX and Lambda to Build AI Factory Infrastructure in Chicago and Atlanta

    August 21, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Time Series Forecasting. Introduction and beginner-friendly… | by Jemish Vakharia | Jul, 2025

    July 29, 2025

    The Future Isn’t Waiting-So Why Are You?

    February 18, 2025

    I Took the Best from the Boomer Business Script — And Added These 3 Things

    July 14, 2025
    Our Picks

    EdgeConneX and Lambda to Build AI Factory Infrastructure in Chicago and Atlanta

    August 21, 2025

    French streamer’s death ‘not traumatic’, autopsy finds

    August 21, 2025

    Why Every Entrepreneur Needs an Exit Mindset from Day One

    August 21, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.