Tokenization in NLP. What is Tokenization in NLP? | by Sunita Rai

Tokenization is a elementary preprocessing step in Pure Language Processing (NLP). It entails breaking down a sequence of textual content into smaller models known as tokens. These tokens will be phrases, subwords, characters, or sentences, relying on the duty at hand.

Tokenization helps NLP fashions to know and course of the construction of language, turning uncooked textual content into manageable items for evaluation. It’s typically one of many first steps in textual content preprocessing, adopted by different duties resembling stemming, lemmatization, and stopword removing.

Varieties of Tokenization

Phrase Tokenization:

Breaks textual content into particular person phrases.
Helpful when you find yourself specializing in phrases as the first unit of research.

Instance:

Enter: “Pure Language Processing is enjoyable.”
Output: [“Natural”, “Language”, “Processing”, “is”, “fun”]

2. Sentence Tokenization:

Divides textual content into sentences.
Helpful when the construction of sentences must be preserved or analyzed.

Instance:

Enter: “Pure Language Processing is enjoyable. NLP helps computer systems perceive textual content.”
Output: [“Natural Language Processing is fun.”, “NLP helps computers understand text.”]

3. Character Tokenization:

Breaks down textual content into characters.
Helpful for duties that want a fine-grained evaluation, resembling character-level textual content technology or spelling correction.

Instance:

Enter: “Hiya”
Output: [“H”, “e”, “l”, “l”, “o”]

4. Subword Tokenization:

Breaks down phrases into smaller models, which is likely to be helpful for dealing with uncommon phrases or misspellings.
Well-liked in trendy NLP fashions like BERT, GPT, and so forth.

Instance:

Enter: “unhappiness”
Output: [“un”, “happiness”] or [“un”, “##happiness”] (for fashions like BERT)

Tokenization helps in:

Textual content normalization: Changing textual content right into a structured format.
Phrase frequency evaluation: Depend occurrences of phrases for duties like textual content classification.
Enter for NLP fashions: Making ready textual content to be fed into fashions like RNNs, LSTMs, and Transformers.

Let’s see tips on how to carry out tokenization in Python utilizing in style libraries like nltk and spaCy.

1. Tokenization with NLTK (Pure Language Toolkit)

First, you might want to set up nltk. You are able to do that through pip if it’s not already put in:

pip set up nltk

Code: Phrase Tokenization and Sentence Tokenization

import nltk

from nltk.tokenize import word_tokenize, sent_tokenize

# Pattern textual content

textual content = “Pure Language Processing is enjoyable. NLP helps computer systems perceive textual content.”

# Phrase Tokenization

word_tokens = word_tokenize(textual content)

print(“Phrase Tokens:”, word_tokens)

# Sentence Tokenization

sentence_tokens = sent_tokenize(textual content)

print(“Sentence Tokens:”, sentence_tokens)

Output:

Phrase Tokens: [‘Natural’, ‘Language’, ‘Processing’, ‘is’, ‘fun’, ‘.’, ‘NLP’, ‘helps’, ‘computers’, ‘understand’, ‘text’, ‘.’]

Sentence Tokens: [‘Natural Language Processing is fun.’, ‘NLP helps computers understand text.’]

2. Tokenization with spaCy

To make use of spaCy, set up it through:

pip set up spacy

python -m spacy obtain en_core_web_sm

Code: Phrase Tokenization with spaCy

import spacy

# Load spaCy mannequin

nlp = spacy.load(“en_core_web_sm”)

# Pattern textual content

textual content = “Pure Language Processing is enjoyable. NLP helps computer systems perceive textual content.”

# Course of the textual content

doc = nlp(textual content)

# Phrase Tokenization

word_tokens_spacy = [token.text for token in doc]

print(“Phrase Tokens utilizing spaCy:”, word_tokens_spacy)

Output:

Phrase Tokens utilizing spaCy: [‘Natural’, ‘Language’, ‘Processing’, ‘is’, ‘fun’, ‘.’, ‘NLP’, ‘helps’, ‘computers’, ‘understand’, ‘text’, ‘.’]

Subword tokenization is important for contemporary NLP fashions like BERT or GPT, which cope with uncommon phrases by breaking them into subword models. One in style library for that is Hugging Face’s Transformers, which might help you with subword tokenization.

Instance: Subword Tokenization with Hugging Face’s Tokenizer

pip set up transformers

from transformers import AutoTokenizer

# Load the tokenizer for BERT mannequin

tokenizer = AutoTokenizer.from_pretrained(‘bert-base-uncased’)

# Pattern textual content

textual content = “unhappiness”

# Subword Tokenization

tokens = tokenizer.tokenize(textual content)

print(“Subword Tokens:”, tokens)

Output:

Subword Tokens: [‘un’, ‘##happiness’]

Right here, the ## prefix signifies that happiness is a continuation of the phrase un.

Use applicable tokenization: Relying in your activity, chances are you’ll desire word-level, character-level, or subword-level tokenization.
Deal with particular characters: Some tokenizers, like these in spaCy or Hugging Face, robotically deal with punctuation and different symbols, whereas others may require customized dealing with.
Pre-trained fashions: For complicated duties, think about using tokenizers from pre-trained fashions like BERT, GPT-2, and so forth., as they deal with tokenization effectively for his or her particular vocabularies.

Tokenization is a important preprocessing step in NLP, enabling environment friendly dealing with of textual content for machine studying fashions. By splitting textual content into tokens (phrases, sentences, subwords, and so forth.), we are able to construction it in a manner that machines can simply interpret and course of. Completely different tokenization methods can be found relying on the complexity of the duty and the NLP mannequin getting used.

Source link

Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

Why PDF Extraction Still Feels LikeHack

🚗 Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025

Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Reducing Time to Value for Data Science Projects: Part 1

Stop Getting Generic AI Responses | by Riz Pabani | Dec, 2024

Learning How to Play Atari Games Through Deep Neural Networks

Our Picks

Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

The New Career Crisis: AI Is Breaking the Entry-Level Path for Gen Z

Musk’s X appoints ‘king of virality’ in bid to boost growth

Tokenization in NLP. What is Tokenization in NLP? | by Sunita Rai | Apr, 2025

Varieties of Tokenization

1. Tokenization with NLTK (Pure Language Toolkit)

2. Tokenization with spaCy

Instance: Subword Tokenization with Hugging Face’s Tokenizer

Related Posts