Close Menu
    Trending
    • Cloudflare will now block AI bots from crawling its clients’ websites by default
    • ๐Ÿš— Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025
    • Futurwise: Unlock 25% Off Futurwise Today
    • 3D Printer Breaks Kickstarter Record, Raises Over $46M
    • People are using AI to โ€˜sitโ€™ with them while they trip on psychedelics
    • Reinforcement Learning in the Age of Modern AI | by @pramodchandrayan | Jul, 2025
    • How This Man Grew His Beverage Side Hustle From $1k a Month to 7 Figures
    • Finding the right tool for the job: Visual Search for 1 Million+ Products | by Elliot Ford | Kingfisher-Technology | Jul, 2025
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Artificial Intelligence»Contextual Topic Modelling in Chinese Corpora with KeyNMF | by Mรกrton Kardos | Jan, 2025
    Artificial Intelligence

    Contextual Topic Modelling in Chinese Corpora with KeyNMF | by Mรกrton Kardos | Jan, 2025

    Team_AIBS NewsBy Team_AIBS NewsJanuary 14, 2025No Comments8 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    A complete information on getting essentially the most out of your Chinese language subject fashions, from preprocessing to interpretation.

    Towards Data Science

    With our recent paper on discourse dynamics in European Chinese language diaspora media, our workforce has tapped into an virtually unanimous frustration with the standard of subject modelling approaches when utilized to Chinese language knowledge. On this article, I’ll introduce you to our novel subject modelling methodology, KeyNMF, and how you can apply it most successfully to Chinese language textual knowledge.

    Earlier than diving into practicalities, I want to offer you a short introduction to subject modelling idea, and inspire the developments launched in our paper.

    Subject modelling is a self-discipline of Pure Language Processing for uncovering latent topical data in textual corpora in an unsupervised method, that’s then introduced to the consumer in a human-interpretable method (often 10 key phrases for every subject).

    There are various methods to formalize this process in mathematical phrases, however one quite well-liked conceptualization of subject discovery is matrix factorization. It is a quite pure and intuitive option to sort out the issue, and in a minute, you will note why. The first perception behind subject modelling as matrix factorization is the next: Phrases that often happen collectively, are more likely to belong to the identical latent construction. In different phrases: Phrases, the prevalence of that are extremely correlated, are a part of the identical subject.

    You’ll be able to uncover matters in a corpus, by first developing a bag-of-words matrix of paperwork. A bag-of-words matrix represents paperwork within the following method: Every row corresponds to a doc, whereas every column to a singular phrase from the mannequinโ€™s vocabulary. The values within the matrix are then the variety of instances a phrase happens in a given doc.

    Schematic Overview of Non-negative Matrix Factorization

    This matrix will be decomposed into the linear mixture of a topic-term matrix, which signifies how vital a phrase is for a given subject, and a document-topic matrix, which signifies how vital a given subject is for a given doc. A technique for this decomposition is Non-negative Matrix Factorization, the place we decompose a non-negative matrix to 2 different strictly non-negative matrices, as an alternative of permitting arbitrary signed values.

    NMF just isn’t the one methodology one can use for decomposing the bag-of-words matrix. A technique of excessive historic significance, Latent Semantic Evaluation, makes use of Truncated Singular-Worth Decomposition for this objective. NMF, nevertheless, is usually a more sensible choice, as:

    1. The found latent components are of various high quality from different decomposition strategies. NMF sometimes discovers localized patterns or elements within the knowledge, that are simpler to interpret.
    2. Non-negative topic-term and document-topic relations are simpler to interpret than signed ones.

    Utilizing NMF with simply BoW matrices, nevertheless engaging and easy it might be, does include its setbacks:

    1. NMF sometimes minimizes the Frobenius norm of the error matrix. This entails an assumption of Gaussianity of the end result variable, which is clearly false, as we’re modelling phrase counts.
    2. BoW representations are simply phrase counts. Which means phrases gainedโ€™t be interpreted in context, and syntactical data can be ignored.

    To account for these limitations, and with the assistance of recent transformer-based language representations, we are able to considerably enhance NMF for our functions.

    The important thing instinct behind KeyNMF is that the majority phrases in a doc are semantically insignificant, and we are able to get an summary of topical data within the doc by highlighting the highest N most related phrases. We are going to choose these phrases by utilizing contextual embeddings from sentence-transformer fashions.

    A Schematic Overview of the KeyNMF Mannequin

    The KeyNMF algorithm consists of the next steps:

    1. Embed every doc utilizing a sentence-transformer, together with all phrases within the doc.
    2. Calculate cosine similarities of phrase embeddings to doc embeddings.
    3. For every doc, preserve the best N phrases with constructive cosine similarities to the doc.
    4. Prepare cosine similarities right into a keyword-matrix, the place every row is a doc, every column is a key phrase, and values are cosine similarities of the phrase to the doc.
    5. Decompose the key phrase matrix with NMF.

    This formulation helps us in a number of methods. a) We considerably scale back the mannequinโ€™s vocabulary, thereby having much less parameters, leading to sooner and higher mannequin match b) We get steady distribution, which is a greater match for NMFโ€™s assumptions and c) We incorporate contextual data into our subject mannequin.

    Now that you just perceive how KeyNMF works, letโ€™s get our arms soiled and apply the mannequin in a sensible context.

    Preparation and Information

    First, letโ€™s set up the packages we’re going to use on this demonstration:

    pip set up turftopic[jieba] datasets sentence_transformers topicwizard

    Then letโ€™s get some overtly accessible knowledge. I selected to go along with the SIB200 corpus, as it’s freely accessible below the CC-BY-SA 4.0 open license. This piece of code will fetch us the corpus.

    from datasets import load_dataset

    # Hundreds the dataset
    ds = load_dataset("Davlan/sib200", "zho_Hans", cut up="all")
    corpus = ds["text"]

    Constructing a Chinese language Subject Mannequin

    There are a variety of tough points to making use of language fashions to Chinese language, since most of those programs are developed and examined on English knowledge. In relation to KeyNMF, there are two points that should be taken into consideration.

    Components of a Subject Modelling Pipeline in Turftopic

    Firstly, we might want to determine how you can tokenize texts in Chinese language. Fortunately, the Turftopic library, which accommodates our implementation of KeyNMF (amongst different issues), comes prepackaged with tokenization utilities for Chinese language. Usually, you’ll use a CountVectorizer object from sklearn to extract phrases from textual content. We added a ChineseCountVectorizer object that makes use of the Jieba tokenizer within the background, and has an optionally usable Chinese language cease glossary.

    from turftopic.vectorizers.chinese language import ChineseCountVectorizer

    vectorizer = ChineseCountVectorizer(stop_words="chinese language")

    Then we are going to want a Chinese language embedding mannequin for producing doc and phrase representations. We are going to use the paraphrase-multilingual-MiniLM-L12-v2 mannequin for this, as it’s fairly compact and quick, and was particularly skilled for use in multilingual retrieval contexts.

    from sentence_transformers import SentenceTransformer

    encoder = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

    We are able to then construct a totally Chinese language KeyNMF mannequin! I’ll initialize a mannequin with 20 matters and N=25 (a most of 15 key phrases can be extracted for every doc)

    from turftopic import KeyNMF

    mannequin = KeyNMF(
    n_components=20,
    top_n=25,
    vectorizer=vectorizer,
    encoder=encoder,
    random_state=42, # Setting seed in order that our outcomes are reproducible
    )

    We are able to then match the mannequin to the corpus and see what outcomes we get!

    document_topic_matrix = mannequin.fit_transform(corpus)
    mannequin.print_topics()
    โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
    โ”ƒ Subject ID โ”ƒ Highest Rating โ”ƒ
    โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
    โ”‚ 0 โ”‚ ๆ—…่กŒ, ้žๆดฒ, ๅพ’ๆญฅๆ—…่กŒ, ๆผซๆญฅ, ๆดปๅŠจ, ้€šๅธธ, ๅ‘ๅฑ•ไธญๅ›ฝๅฎถ, ่ฟ›่กŒ, ่ฟœ่ถณ, ๅพ’ๆญฅ โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 1 โ”‚ ๆป‘้›ช, ๆดปๅŠจ, ๆป‘้›ชๆฟ, ๆป‘้›ช่ฟๅŠจ, ้›ชๆฟ, ็™ฝ้›ช, ๅœฐๅฝข, ้ซ˜ๅฑฑ, ๆ—…ๆธธ, ๆป‘้›ช่€… โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 2 โ”‚ ไผš, ๅฏ่ƒฝ, ไป–ไปฌ, ๅœฐ็ƒ, ๅฝฑๅ“, ๅŒ—ๅŠ ๅทž, ๅนถ, ๅฎƒไปฌ, ๅˆฐ่พพ, ่ˆน โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 3 โ”‚ ๆฏ”่ต›, ้€‰ๆ‰‹, ้”ฆๆ ‡่ต›, ๅคงๅ›ž่ฝฌ, ่ถ…็บง, ็”ทๅญ, ๆˆ็ปฉ, ่Žท่ƒœ, ้˜ฟๆ นๅปท, ่Žทๅพ— โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 4 โ”‚ ่ˆช็ฉบๅ…ฌๅธ, ่ˆช็ญ, ๆ—…ๅฎข, ้ฃžๆœบ, ๅŠ ๆ‹ฟๅคง่ˆช็ฉบๅ…ฌๅธ, ๆœบๅœบ, ่พพ็พŽ่ˆช็ฉบๅ…ฌๅธ, ็ฅจไปท, ๅพทๅ›ฝๆฑ‰่ŽŽ่ˆช็ฉบๅ…ฌๅธ, ่กŒๆŽ โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 5 โ”‚ ๅŽŸๅญๆ ธ, ่ดจๅญ, ่ƒฝ้‡, ็”ตๅญ, ๆฐขๅŽŸๅญ, ๆœ‰็‚นๅƒ, ๅŽŸๅญๅผน, ๆฐข็ฆปๅญ, ่กŒๆ˜Ÿ, ็ฒ’ๅญ โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 6 โ”‚ ็–พ็—…, ไผ ๆŸ“็—…, ็–ซๆƒ…, ็ป†่Œ, ็ ”็ฉถ, ็—…ๆฏ’, ็—…ๅŽŸไฝ“, ่šŠๅญ, ๆ„ŸๆŸ“่€…, ็œŸ่Œ โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 7 โ”‚ ็ป†่ƒž, cella, ๅฐๆˆฟ้—ด, cell, ็”Ÿ็‰ฉไฝ“, ๆ˜พๅพฎ้•œ, ๅ•ไฝ, ็”Ÿ็‰ฉ, ๆœ€ๅฐ, ็ง‘ๅญฆๅฎถ โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 8 โ”‚ ๅซๆ˜Ÿ, ๆœ›่ฟœ้•œ, ๅคช็ฉบ, ็ซ็ฎญ, ๅœฐ็ƒ, ้ฃžๆœบ, ็ง‘ๅญฆๅฎถ, ๅซๆ˜Ÿ็”ต่ฏ, ็”ต่ฏ, ๅทจๅž‹ โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 9 โ”‚ ็Œซ็ง‘ๅŠจ็‰ฉ, ๅŠจ็‰ฉ, ็ŒŽ็‰ฉ, ็‹ฎๅญ, ็‹ฎ็พค, ๅ•ฎ้ฝฟๅŠจ็‰ฉ, ้ธŸ็ฑป, ็‹ผ็พค, ่กŒไธบ, ๅƒ โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 10 โ”‚ ๆ„ŸๆŸ“, ็ฆฝๆตๆ„Ÿ, ๅŒป้™ข, ็—…ๆฏ’, ้ธŸ็ฑป, ๅœŸ่€ณๅ…ถ, ็—…ไบบ, h5n1, ๅฎถ็ฆฝ, ๅŒปๆŠคไบบๅ‘˜ โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 11 โ”‚ ๆŠ—่ฎฎ, ้…’ๅบ—, ็™ฝๅŽ…, ๆŠ—่ฎฎ่€…, ไบบ็พค, ่ญฆๅฏŸ, ไฟๅฎˆๅ…š, ๅนฟๅœบ, ๅง”ๅ‘˜ไผš, ๆ”ฟๅบœ โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 12 โ”‚ ๆ—…่กŒ่€…, ๆ–‡ๅŒ–, ่€ๅฟƒ, ๅ›ฝๅฎถ, ็›ฎ็š„ๅœฐ, ้€‚ๅบ”, ไบบไปฌ, ๆฐด, ๆ—…่กŒ็คพ, ๅ›ฝๅค– โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 13 โ”‚ ้€Ÿๅบฆ, ่‹ฑ้‡Œ, ๅŠ่‹ฑ้‡Œ, ่ท‘ๆญฅ, ๅ…ฌ้‡Œ, ่ท‘, ่€ๅŠ›, ๆœˆ็ƒ, ๅ˜็„ฆ้•œๅคด, ้•œๅคด โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 14 โ”‚ ๅŽŸๅญ, ็‰ฉ่ดจ, ๅ…‰ๅญ, ๅพฎๅฐ, ็ฒ’ๅญ, ๅฎ‡ๅฎ™, ่พๅฐ„, ็ป„ๆˆ, ไบฟ, ่€Œๅ…‰ โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 15 โ”‚ ๆธธๅฎข, ๅฏน, ๅœฐๅŒบ, ่‡ช็„ถ, ๅœฐๆ–น, ๆ—…ๆธธ, ๆ—ถ้—ด, ้žๆดฒ, ๅผ€่ฝฆ, ๅ•†ๅบ— โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 16 โ”‚ ไบ’่”็ฝ‘, ็ฝ‘็ซ™, ่Š‚็›ฎ, ๅคงไผ—ไผ ๆ’ญ, ็”ตๅฐ, ไผ ๆ’ญ, toginetradio, ๅนฟๆ’ญๅ‰ง, ๅนฟๆ’ญ, ๅ†…ๅฎน โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 17 โ”‚ ่ฟๅŠจ, ่ฟๅŠจๅ‘˜, ็พŽๅ›ฝ, ไฝ“ๆ“, ๅไผš, ๆ”ฏๆŒ, ๅฅฅๅง”ไผš, ๅฅฅ่ฟไผš, ๅ‘็Žฐ, ๅฎ‰ๅ…จ โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 18 โ”‚ ็ซ่ฝฆ, metroplus, metro, metrorail, ่ฝฆๅŽข, ๅผ€ๆ™ฎๆ•ฆ, ้€šๅ‹ค, ็ป•ๅŸŽ, ๅŸŽๅ†…, ไธ‰็ญ‰่ˆฑ โ”‚
    โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
    โ”‚ 19 โ”‚ ๆŠ•็ฅจ, ๆŠ•็ฅจ็ฎฑ, ไฟกๅฐ, ้€‰ๆฐ‘, ๆŠ•็ฅจ่€…, ๆณ•ๅ›ฝ, ๅ€™้€‰ไบบ, ็ญพๅ, ้€ๆ˜Ž, ็ฎฑๅ†… โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

    As you see, weโ€™ve already gained a wise overview of what there may be in our corpus! You’ll be able to see that the matters are fairly distinct, with a few of them caring with scientific matters, corresponding to astronomy (8), chemistry (5) or animal behaviour (9), whereas others have been oriented at leisure (e.g. 0, 1, 12), or politics (19, 11).

    Visualization

    To achieve additional help in understanding the outcomes, we are able to use the topicwizard library to visually examine the subject mannequinโ€™s parameters.

    Since topicwizard makes use of wordclouds, we might want to inform the library that it must be utilizing a font that’s suitable with Chinese language. I took a font from the ChineseWordCloud repo, that we’ll obtain after which cross to topicwizard.

    import urllib.request
    import topicwizard

    urllib.request.urlretrieve(
    "https://github.com/shangjingbo1226/ChineseWordCloud/uncooked/refs/heads/grasp/fonts/STFangSong.ttf",
    "./STFangSong.ttf",
    )
    topicwizard.visualize(
    corpus=corpus, mannequin=mannequin, wordcloud_font_path="./STFangSong.ttf"
    )

    This can open the topicwizard internet app in a pocket book or in your browser, with which you’ll interactively examine your subject mannequin:

    Investigating the relations of subject, paperwork and phrases in your corpus utilizing topicwizard

    On this article, weโ€™ve checked out what KeyNMF is, the way it works, what itโ€™s motivated by and the way it may be used to find high-quality matters in Chinese language textual content, in addition to how you can visualize and interpret your outcomes. I hope this tutorial will show helpful to those that wish to discover Chinese language textual knowledge.

    For additional data on the fashions, and how you can enhance your outcomes, I encourage you to take a look at our Documentation. In the event you ought to have any questions or encounter points, be happy to submit an issue on Github, or attain out within the feedback :))

    All figures introduced within the article have been produced by the creator.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleNavigating Marketing with AI and Content Strategy | by Artificial Intelligence + | Jan, 2025
    Next Article 5 Risk-Taking Lessons From Founders Who Bet Big and Won
    Team_AIBS News
    • Website

    Related Posts

    Artificial Intelligence

    Become a Better Data Scientist with These Prompt Engineering Tips and Tricks

    July 1, 2025
    Artificial Intelligence

    Lessons Learned After 6.5 Years Of Machine Learning

    July 1, 2025
    Artificial Intelligence

    Prescriptive Modeling Makes Causal Bets โ€“ Whether You Know it or Not!

    June 30, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Cloudflare will now block AI bots from crawling its clients’ websites by default

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Building a LLMโ€‘Powered Agent with AutoGPT + Retrieval-Augmented Generation (RAG) | by Cristina Ross | Jun, 2025

    June 29, 2025

    Fishing for Leads: How to Reel in Success with Insurance CRM Software

    December 18, 2024

    Hypothesis Formulation vs. Dataset Collection: The Ideal First Step in a Project Pipeline | by Jainam Rajput | Apr, 2025

    April 14, 2025
    Our Picks

    Cloudflare will now block AI bots from crawling its clients’ websites by default

    July 1, 2025

    ๐Ÿš— Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025

    July 1, 2025

    Futurwise: Unlock 25% Off Futurwise Today

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright ยฉ 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.