Close Menu
    Trending
    • Revisiting Benchmarking of Tabular Reinforcement Learning Methods
    • Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025
    • Qantas data breach to impact 6 million airline customers
    • He Went From $471K in Debt to Teaching Others How to Succeed
    • An Introduction to Remote Model Context Protocol Servers
    • Blazing-Fast ML Model Serving with FastAPI + Redis (Boost 10x Speed!) | by Sarayavalasaravikiran | AI Simplified in Plain English | Jul, 2025
    • AI Knowledge Bases vs. Traditional Support: Who Wins in 2025?
    • Why Your Finance Team Needs an AI Strategy, Now
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Artificial Intelligence»A Practical Guide to BERTopic for Transformer-Based Topic Modeling
    Artificial Intelligence

    A Practical Guide to BERTopic for Transformer-Based Topic Modeling

    Team_AIBS NewsBy Team_AIBS NewsMay 8, 2025No Comments15 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    has a variety of use instances within the pure language processing (NLP) area, resembling doc tagging, survey evaluation, and content material group. It falls beneath the realm of unsupervised studying approach, making it a really cost-effective approach that reduces the sources required to gather human-annotated knowledge. We are going to dive deeper into BERTopic, a well-liked python library for transformer-based subject modeling, to assist us course of monetary information quicker and reveal how the trending matters change time beyond regulation.
    BERTopic consists of 6 core modules that may be custom-made to go well with completely different use instances. On this article, we’ll look at, experiment with every module individually and discover how they work collectively coherently to provide the top outcomes.

    BERTopic: Transformer-Based mostly Matter Modeling (until in any other case famous, all photographs are by the creator)

    At a excessive degree, a typical BERTopic structure consists of:

    • Embeddings: rework textual content into vector representations (i.e. embeddings) that seize semantic that means utilizing sentence-transformer fashions.
    • Dimensionality Discount: scale back the high-dimensional embeddings to a lower-dimensional area whereas preserving essential relationships, together with PCA, UMAP …
    • Clustering: group comparable paperwork collectively primarily based on their embeddings with lowered dimensionality to type distinct matters, together with HDBSCAN, Okay-Means algorithms …
    • Vectorizers: after subject clusters are fashioned, vectorizers convert textual content into numerical options that can be utilized for subject evaluation, together with rely vectorizer, on-line vectorizer …
    • c-TF-IDF: calculate significance scores for phrases inside and throughout subject clusters to establish key phrases.
    • Illustration Mannequin: leverage semantic similarity between the embedding of candidate key phrases and the embedding of paperwork to seek out probably the most consultant subject key phrases, together with KeyBERT, LLM-based methods …

    Challenge Overview

    On this sensible utility, we are going to use Topic Modeling to establish trending matters in Apple monetary information. Utilizing NewsAPI, we gather each day top-ranked Apple inventory information from Google Search and compile them right into a dataset of 250 paperwork, with every doc containing monetary information for one particular day. Nonetheless, this isn’t the primary focus of this text so be at liberty to interchange it with your individual dataset. The target is to show find out how to rework uncooked textual content paperwork containing prime Google search outcomes into significant subject key phrases and refine these key phrases to be extra consultant.


    BERTopic’s 6 Elementary Modules

    1. Embeddings

    embeddings

    BERTopic makes use of sentence transformer fashions as its first constructing block, changing sentences into dense vector representations (i.e. embeddings) that seize semantic meanings. These fashions are primarily based on transformer architectures like BERT and are particularly educated to provide high-quality sentence embeddings. We then compute the semantic similarity between sentences utilizing cosine distance between the embeddings. Frequent fashions embrace:

    • all-MiniLM-L6-v2: light-weight, quick, good basic efficiency
    • BAAI/bge-base-en-v1.5: bigger mannequin with sturdy semantic understanding therefore provides a lot slower coaching and inference pace.

    There are a large vary of pre-trained sentence transformers so that you can select from on the “Sentence Transformer” web site and Huggingface model hub. We will use just a few traces of code to load a sentence transformer mannequin and encode the textual content sequences into excessive dimensional numerical embeddings.

    from sentence_transformers import SentenceTransformer
    
    # Initialize mannequin
    mannequin = SentenceTransformer("all-MiniLM-L6-v2")
    
    # Convert sentences to embeddings
    sentences = ["First sentence", "Second sentence"]
    embeddings = mannequin.encode(sentences)  # Returns numpy array of embeddings

    On this occasion, we enter a set of monetary information knowledge from October 2024 to March 2025 into the sentence transformer “bge-base-en-v1.5”. As proven within the end result under. these textual content paperwork are remodeled into vector embedding with the form of 250 rows and every with 384 dimensions.

    embeddings result

    We will then feed this sentence transformer to BERTopic pipeline and hold all different modules because the default settings.

    from sentence_transformers import SentenceTransformer
    from bertopic import BERTopic
    
    emb_minilm = SentenceTransformer("all-MiniLM-L6-v2")
    topic_model = BERTopic(
        embedding_model=emb_minilm,
    )
    
    topic_model.fit_transform(docs)
    topic_model.get_topic_info()

    As the top end result, we get the next subject illustration.

    topic result

    In comparison with the extra highly effective and bigger “bge-base-en-v1.5” mannequin, we get the next end result which is barely extra significant than the smaller “all-MiniLM-L6-v2” mannequin however nonetheless leaves massive room for enchancment.

    One space for enchancment is lowering the dimensionality, as a result of sentence transformers sometimes ends in high-dimensional embeddings. As BERTopic depends on evaluating the spatial proximity between embedding area to type significant clusters, it’s essential to use a dimensionality discount approach to make the embeddings much less sparse. Due to this fact, we’re going to introduce varied dimensionality discount methods within the subsequent part.

    2. Dimensionality Discount

    dimensionality reduction

    After changing the monetary information paperwork into embeddings, we face the issue of excessive dimensionality. Since every embedding incorporates 384 dimensions, the vector area turns into too sparse to create significant distance measurement between two vector embeddings. Principal Part Evaluation (PCA) and Uniform Manifold Approximation and Projection (UMAP) are frequent methods to scale back dimensionalities whereas preserving the utmost variance within the knowledge. We are going to take a look at UMAP, BERTopic’s default dimensionality discount approach, in additional particulars. It’s a non-linear algorithm adopted from topology evaluation that seeks various construction throughout the knowledge. It really works by extending a radius outwards from every knowledge level and connecting factors with its shut neighbors. You’ll be able to dive extra into the UMAP visualization on this web site “Understanding UMAP“.

    UMAP n_neighbours Experimentation

    An essential UMAP parameter is n_neighbours that controls how UMAP balances native and world construction within the knowledge. Low values of n_neighbors will drive UMAP to focus on native construction, whereas massive values will take a look at bigger neighborhoods of every level.
    The diagram under reveals a number of scatterplots demonstrating the impact of various n_neighbors values, with every plot visualizing the embeddings in an 2-dimensional area after making use of UMAP dimensionality discount.

    With smaller n_neighbors values (e.g. n=2, n=5), the plots present extra tightly coupled micro clusters, indicating a deal with native construction. As n_neighbors will increase (in the direction of n=100, n=150), the factors type extra cohesive world patterns, demonstrating how bigger neighborhood sizes assist UMAP seize broader relationships within the knowledge.

    UMAP experimentation

    UMAP min_dist Experimentation

    The min_dist parameter in UMAP controls how tightly factors are allowed to be packed collectively within the decrease dimensional illustration. It units the minimal distance between factors within the embedding area. A smaller min_dist permits factors to be packed very carefully collectively whereas a bigger min_dist forces factors to be extra scattered and evenly unfold out. The diagram under reveals an experimentation on min_dist worth from 0.0001 to 1 when setting the n_neighbors=5. When min_dist is ready to smaller values, UMAP emphasizes on preserving native construction whereas bigger values rework the embeddings right into a round form.

    UMAP experimentation

    We resolve to set n_neighbors=5 and min_dist=0.01 primarily based on the hyperparameter tuning outcomes, because it types extra distinct knowledge clusters which can be simpler for the following clustering mannequin to course of.

    import umap
    
    UMAP_N = 5
    UMAP_DIST = 0.01
    umap_model = umap.UMAP(
        n_neighbors=UMAP_N,
        min_dist=UMAP_DIST, 
        random_state=0
    )

    3. Clustering

    clustering

    Following the dimensionality discount module, it’s the method of grouping embeddings with shut proximity into clusters. This course of is key to subject modeling, because it categorizes related textual content paperwork collectively by taking a look at their semantic relationships. BERTopic employs HDBSCAN mannequin by default, which has the benefit in capturing buildings with various densities. Moreover, BERTopic supplies the pliability of selecting different clustering fashions primarily based on the character of the dataset, resembling Okay-Means (for spherical, equally-sized clusters) or agglomerative clustering (for hirerarchical clusters).

    HDBSCAN Experimentation

    We are going to discover how two essential parameters, min_cluster_size and min_samples, influence the conduct of HDBSCAN mannequin.
    min_cluster_size determines the minimal variety of knowledge factors allowed to type a cluster and clusters not assembly the brink are handled as outliers. When setting min_cluster_size too low, you would possibly get many small, unstable clusters which may be noise. If setting it too excessive, you would possibly merge a number of clusters into one, shedding their distinct traits.

    min_samples calculates the gap between a degree and its k-th nearest neighbor, figuring out how strict the cluster formation course of is. The bigger the min_samples worth, the extra conservative the clustering turns into, as clusters can be restricted to type in dense areas, classifying sparse factors as noise.

    Condensed Tree is a helpful approach to assist us resolve applicable values of those two parameters. Clusters that persist for a wide range of lambda values (proven because the left vertical axis in a condense tree plot) are thought-about steady and extra significant. We desire the chosen clusters to be each tall (extra steady) and large (massive cluster measurement). We use condensed_tree_ from HDBSCAN to check min_cluster_size from 3 to 50, then visualize the information factors of their vector area, coloration coded by the expected cluster labels. As we progress via completely different min_cluster_size, we are able to establish optimum values that group shut knowledge factors collectively.

    On this experimentation, we chosen min_cluster_size=15 because it generates 4 clusters (highlighted in pink within the condensed tree plot under) with good stability and cluster measurement. Moreover the scatterplot additionally signifies cheap cluster formation primarily based on proximity and density.

    Condensed Tree for HDBSCAN min_cluster_size
    Condensed Timber for HDBSCAN min_cluster_size Experimentation
    Condensed Tree for HDBSCAN min_samples
    Scatterplots for HDBSCAN min_cluster_size Experimentation

    We then perform the same train to check min_samples from 1 to 80 and chosen min_samples=5. As you may observe from the visuals, the parameters min_samples and min_cluster_size exert distinct impacts on the clustering course of.

    Condensed Timber for HDBSCAN min_samples Experimentation
    Scatterplots for HDBSCAN min_samples Experimentation
    import hdbscan
    
    MIN_CLUSTER _SIZE= 15
    MIN_SAMPLES = 5
    clustering_model = hdbscan.HDBSCAN(
        min_cluster_size=MIN_CLUSTER_SIZE,
        metric='euclidean',
        cluster_selection_method='eom',
        min_samples=MIN_SAMPLES,
        random_state=0
    )
    
    topic_model = BERTopic(
        embedding_model=emb_bge,
        umap_model=umap_model,
        hdbscan_model=clustering_model, 
    )
    
    topic_model.fit_transform(docs)
    topic_model.get_topic_info()

    Okay-Means Experimentation

    In comparison with HDBSCAN, utilizing Okay-Means clustering permits us to generate extra granular matters by specifying the n_cluster parameter, consequently, controlling the variety of matters generated from the textual content paperwork.

    This picture reveals a collection of scatter plots demonstrating completely different clustering outcomes when various the variety of clusters (n_cluster) from 3 to 50 utilizing Okay-Means. With n_cluster=3, the information is split into simply three massive teams. As n_cluster will increase (5, 8, 10, and so forth.), the information factors are cut up into extra granular groupings. Total, it types rounded-shape clusters in comparison with HDBSCAN. We chosen n_cluster=8 the place the clusters are neither too broad (shedding essential distinctions) nor too granular (creating synthetic divisions). Moreover, it’s a correct quantity of matters for categorizing 250 days of monetary information. Nonetheless, be at liberty to regulate the code snippet to your necessities if must establish extra granular or broader matters.

    Scatterplots for Okay-Means n_cluster Experimentation
    from sklearn.cluster import KMeans
    
    N_CLUSTER = 8
    clustering_model = KMeans(
        n_clusters=N_CLUSTER,
        random_state=0
    )
    
    topic_model = BERTopic(
        embedding_model=emb_bge,
        umap_model=umap_model,
        hdbscan_model=clustering_model, 
    )
    
    topic_model.fit_transform(docs)
    topic_model.get_topic_info()

    Evaluating the subject cluster outcomes of Okay-Means and HDBSCAN reveals that Okay-Means produces extra distinct and significant subject representations. Nonetheless, each strategies nonetheless generate many cease phrases, indicating that subsequent modules are crucial to refine the subject representations.

    HDBSCAN Output
    HDBSCAN Output
    K-Means Output
    Okay-Means Output

    4. Vectorizer

    vectorizer

    Earlier modules serve the function of grouping paperwork into semantically comparable clusters, and ranging from this module the primary focus is to fine-tune the matters by selecting extra consultant and significant key phrases. BERTopic provides varied Vectorizer choices from the essential CountVectorizer to extra superior OnlineCountVectorizer which incrementally replace subject representations. For this train, we are going to experiment on CountVectorizer, a textual content processing instrument that creates a matrix of token counts out of a set of paperwork. Every row within the matrix represents a doc and every column represents a time period from the vocabulary, with the values displaying what number of occasions every time period seems in every doc. This matrix illustration allows machine studying algorithms to course of the textual content knowledge mathematically.

    Vectorizer Experimentation

    We are going to undergo just a few essential parameters of the CountVectorizer and see how they could have an effect on the subject representations.

    • ngram_range specifies what number of phrases to mix collectively into subject phrases. It’s significantly helpful for paperwork consists of brief phrases, which isn’t wanted on this scenario.
      instance output if we set ngram_range=(1, 3)
    0                -1_apple nasdaq aapl_apple stock_apple nasdaq_nasdaq aapl   
    1  0_apple warren buffett_apple stock_berkshire hathaway_apple nasdaq aapl   
    2           1_apple nasdaq aapl_nasdaq aapl apple_apple stock_apple nasdaq   
    3              2_apple aapl stock_apple nasdaq aapl_apple stock_aapl inventory   
    4           3_apple nasdaq aapl_cramer apple aapl_apple nasdaq_apple inventory 
    • stop_words determines whether or not cease phrases are faraway from the matters, which considerably improves subject representations.
    • min_df and max_df determines the frequency thresholds for phrases to be included within the vocabulary. min_df units the minimal variety of paperwork a time period should seem whereas max_df units the utmost doc frequency above which phrases are thought-about too frequent and discarded.

    We discover the impact of including CountVectorizer with max_df=0.8 (i.e. ignore phrases showing in additional than 80% of the paperwork) to each HDBSCAN and Okay-Means fashions from the earlier step.

    from sklearn.feature_extraction.textual content import CountVectorizer
    vectorizer_model = CountVectorizer(
    		max_df=0.8, 
    		stop_words="english"
    )
    
    topic_model = BERTopic(
        embedding_model=emb_bge,
        umap_model=umap_model,
        hdbscan_model=clustering_model, 
        vectorizer_model=vectorizer_model
    )

    Each reveals enhancements after introducing the CountVectorizer, considerably lowering key phrases incessantly appeared in all paperwork and never bringing further values, resembling “appl”, “inventory”, and “apple”.

    HDBSCAN Output with Vectorizer
    HDBSCAN Output with Vectorizer
    K-Means Output with Vectorizer
    Okay-Means Output with Vectorizer

    5. c-TF-IDF

    c-TF-IDF

    Whereas the Vectorizer module focuses on adjusting the subject illustration on the doc degree, c-TF-IDF primarily take a look at the cluster degree to scale back incessantly encountered matters throughout clusters. That is achieved by changing all paperwork belonging to at least one cluster as a single doc and calculated the key phrase significance primarily based on the standard TF-IDF method.

    c-TF-IDF Experimentation

    • reduce_frequent_words: determines whether or not to down-weight incessantly occurring phrases throughout matters
    • bm25_weighting: when set to True, makes use of BM25 weighting as a substitute of normal TF-IDF, which will help higher deal with doc size variations. In smaller datasets, this variant may be extra sturdy to cease phrases.

    We use the next code snippet so as to add c-TF-IDF (with bm25_weighting=True) into our BERTopic pipeline.

    from bertopic.vectorizers import ClassTfidfTransformer
    
    ctfidf_model = ClassTfidfTransformer(bm25_weighting=True)
    topic_model = BERTopic(
        embedding_model=emb_bge,
        umap_model=umap_model,
        hdbscan_model=clustering_model, 
        vectorizer_model=vectorizer_model,
        ctfidf_model=ctfidf_model
    )

    The subject cluster outputs under present that including c-TF-IDF has no main influence to the top outcomes when CountVectorizer has already been added. That is probably as a result of our CountVectorizer has already set a excessive bar of eliminating phrases showing in additional than 80% on the doc degree. Subsequently, this already reduces overlapping vocabularies on the subject cluster degree, which is what c-TF-IDF is meant to realize.

    HDBSCAN Output with Vectorizer and c-TF-IDF
    Okay-Means Output with Vectorizer and c-TF-IDF

    Nonetheless, If we substitute CountVectorizer with c-TF-IDF, though the end result under reveals slight enhancements in comparison with when each should not added, there are too many cease phrases current, making the subject representations much less priceless. Due to this fact, it seems that for the paperwork we’re coping with on this situation, c-TF-IDF module doesn’t carry further worth.

    HDBSCAN Output with c-TF-IDF solely
    Okay-Means Output with c-TF-IDF solely

    6. Illustration Mannequin

    The final module is the illustration mannequin which has been noticed having a major influence on tuning the subject representations. As an alternative of utilizing the frequency primarily based method like Vectorizer and c-TF-IDF, it leverages semantic similarity between the embeddings of candidate key phrases and the embeddings of paperwork to seek out probably the most consultant subject key phrases. This may end up in extra semantically coherent subject representations and lowering the variety of synonymically comparable key phrases. BERTopic additionally provides varied customization choices for illustration fashions, together with however not restricted to the next:

    • KeyBERTInspired: make use of KeyBERT approach to extract subject phrases primarily based semantic similarity.
    • ZeroShotClassification: take advantage of open-source transformers within the Huggingface model hub to assign labels to matters.
    • MaximalMarginalRelevance: lower synonyms in matters (e.g. inventory and shares).

    KeyBERTInspired Experimentation

    We discovered that KeyBERTInspired is a really cost-effective method because it considerably improves the top end result by including just a few further traces of code, with out the necessity of intensive hyperparameter tuning.

    from bertopic.illustration import KeyBERTInspired
    
    representation_model = KeyBERTInspired()
    
    topic_model = BERTopic(gh
        embedding_model=emb_bge,
        umap_model=umap_model,
        hdbscan_model=clustering_model, 
        vectorizer_model=vectorizer_model,
        representation_model=representation_model
    )

    After incorporating the KeyBERT-Impressed illustration mannequin, we now observe that each fashions generate noticeably extra coherent and priceless themes.

    HDBSCAN Output with KeyBERTInspired
    HDBSCAN Output with KeyBERTInspired
    K-Means Output with KeyBERTInspired
    Okay-Means Output with KeyBERTInspired

    Take-Residence Message

    This text explores BERTopic approach and implementation for subject modeling, detailing its six key modules with sensible examples utilizing Apple inventory market information knowledge to show every part’s influence on the standard of subject representations.

    • Embeddings: use transformer-based embedding fashions to transform paperwork into numerical representations that seize semantic that means and contextual relationships in textual content.
    • Dimensionality Discount: make use of UMAP or different dimensionality discount methods to scale back high-dimensional embeddings whereas preserving each native and world construction of the information
    • Clustering: evaluate HDBSCAN (density-based) and Okay-Means (centroid-based) clustering algorithm to group comparable paperwork into coherent matters
    • Vectorizers: use Depend Vectorizer to create document-term matrices and refine matters primarily based on statistical method.
    • c-TF-IDF: replace subject representations by analyzing time period frequency at cluster degree (subject class) and scale back frequent phrases throughout completely different matters.
    • Illustration Mannequin: refine subject key phrases utilizing semantic similarity, providing choices like KeyBERTInspired and MaximalMarginalRelevance for higher subject descriptions



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleAn Introduction to Supervised Learning | by The Math Lab | May, 2025
    Next Article How I Built a Bulletproof Portfolio (And What Most People Get Wrong)
    Team_AIBS News
    • Website

    Related Posts

    Artificial Intelligence

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    July 2, 2025
    Artificial Intelligence

    An Introduction to Remote Model Context Protocol Servers

    July 2, 2025
    Artificial Intelligence

    How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    July 2, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    What is messaging app Signal and how secure is it?

    March 26, 2025

    FM-Intent: Predicting User Session Intent with Hierarchical Multi-Task Learning | by Netflix Technology Blog | May, 2025

    May 21, 2025

    Fine-Tuning vLLMs for Document Understanding

    May 5, 2025
    Our Picks

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    July 2, 2025

    Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

    July 2, 2025

    Qantas data breach to impact 6 million airline customers

    July 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.