Close Menu
    Trending
    • Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025
    • The New Career Crisis: AI Is Breaking the Entry-Level Path for Gen Z
    • Musk’s X appoints ‘king of virality’ in bid to boost growth
    • Why Entrepreneurs Should Stop Obsessing Over Growth
    • Implementing IBCS rules in Power BI
    • What comes next for AI copyright lawsuits?
    • Why PDF Extraction Still Feels LikeHack
    • GenAI Will Fuel People’s Jobs, Not Replace Them. Here’s Why
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Tokenization in NLP. What is Tokenization in NLP? | by Sunita Rai | Apr, 2025
    Machine Learning

    Tokenization in NLP. What is Tokenization in NLP? | by Sunita Rai | Apr, 2025

    Team_AIBS NewsBy Team_AIBS NewsApril 11, 2025No Comments4 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Tokenization is a elementary preprocessing step in Pure Language Processing (NLP). It entails breaking down a sequence of textual content into smaller models known as tokens. These tokens will be phrases, subwords, characters, or sentences, relying on the duty at hand.

    Tokenization helps NLP fashions to know and course of the construction of language, turning uncooked textual content into manageable items for evaluation. It’s typically one of many first steps in textual content preprocessing, adopted by different duties resembling stemming, lemmatization, and stopword removing.

    Varieties of Tokenization

    1. Phrase Tokenization:
    • Breaks textual content into particular person phrases.
    • Helpful when you find yourself specializing in phrases as the first unit of research.

    Instance:

    • Enter: “Pure Language Processing is enjoyable.”
    • Output: [“Natural”, “Language”, “Processing”, “is”, “fun”]

    2. Sentence Tokenization:

    • Divides textual content into sentences.
    • Helpful when the construction of sentences must be preserved or analyzed.

    Instance:

    • Enter: “Pure Language Processing is enjoyable. NLP helps computer systems perceive textual content.”
    • Output: [“Natural Language Processing is fun.”, “NLP helps computers understand text.”]

    3. Character Tokenization:

    • Breaks down textual content into characters.
    • Helpful for duties that want a fine-grained evaluation, resembling character-level textual content technology or spelling correction.

    Instance:

    • Enter: “Hiya”
    • Output: [“H”, “e”, “l”, “l”, “o”]

    4. Subword Tokenization:

    • Breaks down phrases into smaller models, which is likely to be helpful for dealing with uncommon phrases or misspellings.
    • Well-liked in trendy NLP fashions like BERT, GPT, and so forth.

    Instance:

    • Enter: “unhappiness”
    • Output: [“un”, “happiness”] or [“un”, “##happiness”] (for fashions like BERT)

    Tokenization helps in:

    • Textual content normalization: Changing textual content right into a structured format.
    • Phrase frequency evaluation: Depend occurrences of phrases for duties like textual content classification.
    • Enter for NLP fashions: Making ready textual content to be fed into fashions like RNNs, LSTMs, and Transformers.

    Let’s see tips on how to carry out tokenization in Python utilizing in style libraries like nltk and spaCy.

    1. Tokenization with NLTK (Pure Language Toolkit)

    First, you might want to set up nltk. You are able to do that through pip if it’s not already put in:

    pip set up nltk

    Code: Phrase Tokenization and Sentence Tokenization

    import nltk

    from nltk.tokenize import word_tokenize, sent_tokenize

    # Pattern textual content

    textual content = “Pure Language Processing is enjoyable. NLP helps computer systems perceive textual content.”

    # Phrase Tokenization

    word_tokens = word_tokenize(textual content)

    print(“Phrase Tokens:”, word_tokens)

    # Sentence Tokenization

    sentence_tokens = sent_tokenize(textual content)

    print(“Sentence Tokens:”, sentence_tokens)

    Output:

    Phrase Tokens: [‘Natural’, ‘Language’, ‘Processing’, ‘is’, ‘fun’, ‘.’, ‘NLP’, ‘helps’, ‘computers’, ‘understand’, ‘text’, ‘.’]

    Sentence Tokens: [‘Natural Language Processing is fun.’, ‘NLP helps computers understand text.’]

    2. Tokenization with spaCy

    To make use of spaCy, set up it through:

    pip set up spacy

    python -m spacy obtain en_core_web_sm

    Code: Phrase Tokenization with spaCy

    import spacy

    # Load spaCy mannequin

    nlp = spacy.load(“en_core_web_sm”)

    # Pattern textual content

    textual content = “Pure Language Processing is enjoyable. NLP helps computer systems perceive textual content.”

    # Course of the textual content

    doc = nlp(textual content)

    # Phrase Tokenization

    word_tokens_spacy = [token.text for token in doc]

    print(“Phrase Tokens utilizing spaCy:”, word_tokens_spacy)

    Output:

    Phrase Tokens utilizing spaCy: [‘Natural’, ‘Language’, ‘Processing’, ‘is’, ‘fun’, ‘.’, ‘NLP’, ‘helps’, ‘computers’, ‘understand’, ‘text’, ‘.’]

    Subword tokenization is important for contemporary NLP fashions like BERT or GPT, which cope with uncommon phrases by breaking them into subword models. One in style library for that is Hugging Face’s Transformers, which might help you with subword tokenization.

    Instance: Subword Tokenization with Hugging Face’s Tokenizer

    pip set up transformers

    from transformers import AutoTokenizer

    # Load the tokenizer for BERT mannequin

    tokenizer = AutoTokenizer.from_pretrained(‘bert-base-uncased’)

    # Pattern textual content

    textual content = “unhappiness”

    # Subword Tokenization

    tokens = tokenizer.tokenize(textual content)

    print(“Subword Tokens:”, tokens)

    Output:

    Subword Tokens: [‘un’, ‘##happiness’]

    Right here, the ## prefix signifies that happiness is a continuation of the phrase un.

    • Use applicable tokenization: Relying in your activity, chances are you’ll desire word-level, character-level, or subword-level tokenization.
    • Deal with particular characters: Some tokenizers, like these in spaCy or Hugging Face, robotically deal with punctuation and different symbols, whereas others may require customized dealing with.
    • Pre-trained fashions: For complicated duties, think about using tokenizers from pre-trained fashions like BERT, GPT-2, and so forth., as they deal with tokenization effectively for his or her particular vocabularies.

    Tokenization is a important preprocessing step in NLP, enabling environment friendly dealing with of textual content for machine studying fashions. By splitting textual content into tokens (phrases, sentences, subwords, and so forth.), we are able to construction it in a manner that machines can simply interpret and course of. Completely different tokenization methods can be found relying on the complexity of the duty and the NLP mannequin getting used.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleKeysource and ADCC are now officially part of the Salute brand following completed acquisition
    Next Article PromptChan alternatives
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

    July 1, 2025
    Machine Learning

    Why PDF Extraction Still Feels LikeHack

    July 1, 2025
    Machine Learning

    🚗 Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Reducing Time to Value for Data Science Projects: Part 1

    April 30, 2025

    Stop Getting Generic AI Responses | by Riz Pabani | Dec, 2024

    December 14, 2024

    Learning How to Play Atari Games Through Deep Neural Networks

    February 18, 2025
    Our Picks

    Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

    July 1, 2025

    The New Career Crisis: AI Is Breaking the Entry-Level Path for Gen Z

    July 1, 2025

    Musk’s X appoints ‘king of virality’ in bid to boost growth

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.