My first encounter with programming was by Python. And I’m glad it was as such, so it didn’t scare me as a molecular biologist analysing NMR information. Though I began studying machine studying with MATLAB later, with modules as sklearn and Tensorflow I’m already again to python.
Now I’m studying NLP with python. Bear with me.
Conventionally we use programming languages comparable to python, java, C and so forth. to make laptop carry out a job we wish. With NLP, Pure Language Processing, computer systems can study from and perceive human language. Perhaps it was more durable to know the thought a number of years earlier, aside from the imprecise memory of Star Trek’s talking and knowing-all laptop, but as we use an increasing number of incessantly the generative chatbots comparable to ChatGPT, DeepSeek, Gemini and extra, we perceive it higher.
So, how can we start utilizing NLP for our functions? Me for instance, I wish to use NLP to know amino acid sequences of proteins with the purpose of designing novel ones. However earlier than reaching that stage, let’s begin with the fundamentals, utilizing python.
We are going to use NLTK (Pure Language Toolkit), principal NLP library in python.
import nltk
Let’s have a set of 10 sentences to be analysed (I used ChatGPT to generate):
corpus = [
"The quick brown fox jumps over the lazy dog.",
"I am learning Natural Language Processing.",
"Text preprocessing is a crucial step in NLP.",
"Tokenization breaks text into words.",
"Stopwords are common words that may be removed.",
"Lemmatization reduces words to their base form.",
"TF-IDF reflects word importance in documents.",
"This is an example sentence for testing.",
"We can extract features from text data.",
"Python is widely used in machine learning."
]
Earlier than feeding uncooked textual content right into a machine studying mannequin, we sometimes carry out a number of pre-processing steps to scrub and standardize the enter. These steps assist convert unstructured textual content right into a format that fashions can extra simply perceive and analyze. One of many elementary steps on this course of is tokenization.
Tokenization
Now we are going to begin with tokenization of those sentences, that means we are going to divide the sentences into smaller items: phrases. So “The short brown fox jumps over the lazy canine.” shall be tokenized as [‘The’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘over’, ‘the’, ‘lazy’, ‘dog’, ‘.’]. First we obtain the pre-trained mannequin for tokenizing textual content. It’s known as “punkt_tab”, punkt for a German phrase for punctuation because the mannequin predicts sentence boundaries utilizing punctuation and tab as the info saved in a tabular, table-like format. Then we name the tokenizer perform word_tokenize. As we print we are going to see our sentences are in one-word sized items.
nltk.obtain('punkt_tab')
from nltk.tokenize import word_tokenize
tokens=[word_tokenize(sentence) for sentence in corpus]
print(tokens)
After tokenization, we have now a listing of particular person phrases (tokens), however not all of them carry significant or helpful data for additional duties like classification, matter modeling and so forth. Due to this fact, the following step is stop-word elimination.
Cease-Phrase Removing
Phrases like “the,” “is,” “in,” “and,” or “of” are quite common in English and seem incessantly throughout texts, no matter their content material. These are often called cease phrases. These phrases will not be essential for the that means of the textual content and as we purpose to ensure the mannequin’s consideration is on the necessary phrases we take away them.
For instance for a spam classifier, we don’t wish to use the phrase “is”, however the phrases like “free”, “supply”, “click on” usually tend to point out spam.
Nevertheless, there are necessary exceptions. As an example, massive language fashions (LLMs) rely closely on context and sentence construction, so even cease phrases and conjunctions can play an important position in understanding and producing coherent textual content. In some duties, comparable to sentiment evaluation, phrases like “however” could considerably have an effect on the interpretation of a sentence by signaling a distinction or shift in tone.
However our sentences are brief and easy so let’s take away the stop-words. First we obtain and import a set of predefined cease phrases for varied languages. Setting the language as English, we extract the tokens (filtered_tokens) that aren’t current in stop_words. So our set of tokens from first instance [The’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘over’, ‘the’, ‘lazy’, ‘dog’, ‘.’] will change into [‘The’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘lazy’, ‘dog’, ‘.’].
nltk.obtain('stopwords')
from nltk.corpus import stopwords
stop_words=set(stopwords.phrases('english'))
filtered_tokens=[word for word in tokens if word not in stop_words]
print(filtered_tokens)
After eradicating cease phrases, the following step is to scale back redundancy and convey consistency to the remaining tokens. That is the place lemmatization is available in.
Lemmatization
Completely different types of the identical phrase — like “working,” “ran,” and “runs” — can litter the info and dilute that means. To handle this, we apply lemmatization, a course of that converts phrases to their base or dictionary type, often called the lemma. This course of is necessary for context standardization, therefore engines like google are significantly efficient at it to enhance search accuracy by matching completely different phrase kinds to the identical root idea.
Nevertheless, a possible drawback of lemmatization is that it might probably cut back the contextual richness of the textual content. By collapsing all types of a phrase right into a single base type, we could lose nuances which might be necessary for understanding tense, facet, or intent — particularly in duties like sentiment evaluation or textual content era. So, let’s see how we will apply lemmatization in follow.
First, we obtain WordNet, a big lexical database of English. On this database nouns, verbs, adjectives, and adverbs are grouped into units of cognitive synonyms known as synsets. It offers definitions, examples, and relationships between phrases (like synonyms, antonyms, hypernyms, and so forth.). For every token, lemmatizer perform WordNetLemmatizer is named.
nltk.obtain('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
lemmatized_tokens = [
lemmatizer.lemmatize(word, pos='v')
for word in filtered_tokens
]
print(lemmatized_tokens)
After we use parameter pos= ‘v’, pos standing for a part of speech, the lemmatizer will deal with the phrase as a verb when looking for its base type. So our tokens [‘The’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘lazy’, ‘dog’] will change into [‘The’, ‘quick’, ‘brown’, ‘fox’, ‘jump’, ‘lazy’, ‘dog’]. There are additionally pos= ‘n’, ‘a’, ‘r’ to search out the foundation (a part of speech) noun, adjective and adverb respectively. Nevertheless, on condition that our enter texts usually embrace a wide range of components of speech, an adaptive method that accounts for every kind is crucial for correct lemmatization. One efficient method is to robotically map every phrase’s POS tag to the corresponding root class, comparable to verb, noun, adjective, or adverb.
POS-tagging (Half-of-Speech Tagging)
POS tagging helps computer systems perceive the grammatical position of a phrase getting used (e.g., noun, verb, adverb and so forth.) so the context data may be parsed correctly and additional processing (like lemmatization or vectorization) may be extra correct.
nltk.obtain('averaged_perceptron_tagger_eng')
from nltk import pos_tagpos_tags = pos_tag(tokens)
print(pos_tags)
In our easy instance the preliminary tokens are tagged as following:
As you’ll be able to see, all tags are accurately assigned aside from “brown.” Statistical taggers like pos_tag generally make errors with ambiguous phrases, and “brown” may be interpreted as both a noun or an adjective relying on context. A single incorrect tag — comparable to labeling “brown” as a noun (NN) as a substitute of an adjective (JJ) — could not considerably have an effect on efficiency in easy duties. Nevertheless, if excessive accuracy is crucial, we will both manually appropriate such tags in small datasets or use a extra superior mannequin like spaCy, Stanza, or Aptitude for higher context-aware tagging.
As soon as we have now our pos-tagged tokens, they are often additional labeled into predefined classes comparable to individuals (e.g., Marie Curie), organizations (e.g., NASA) , places (e.g., Istanbul), dates (e.g., Could 19, 2025) and so forth. This course of is named Named Entity Recognition (NER).
Named Entity Recognition (NER)
So as to categorize the pos-tagged tokens, most entropy named entity chunker mannequin (‘maxent_ne_chunker_tab’) in NLTK is downloaded together with record of recognized English ‘phrases’. The imported ne_chunk perform, which performs Named Entity Chunking finds and teams named entities (like names, organizations, locations) primarily based on POS-tagged tokens.
nltk.obtain('maxent_ne_chunker_tab')
nltk.obtain('phrases')
from nltk import ne_chunkchunked_text=ne_chunk(pos_tags)
print(chunked_text)
Further Textual content Preprocessing Methods
To additional standardize the textual content enter, we will convert all tokens to lowercase and take away particular characters, further areas, punctuations and numbers when they don’t seem to be related to the duty. You need to use fundamental Python string strategies and the re module (common expressions) to hold out these cleansing steps.
textual content="Pure language processing is a department of synthetic intelligence. %100"
# Lowercasing
textual content=textual content.decrease()
print(textual content)# Removes punctuations (comma, dot, and so forth.)
import re #common expression
textual content=re.sub(r'[^ws]','',textual content)
print(textual content)
# Removes numbers
textual content=re.sub(r'd+','',textual content)
print(textual content)
Vectorization
As soon as we have now our textual content inputs clear and standardized, the following step is flip them into numerical representations (vectors) that machine studying fashions can perceive. Two common approaches to vectorize texts are bag of phrases and time period frequency-inverse doc frequency (TF-IDF).
Bag of phrases first builds a vocabulary of all distinctive tokens in our dataset or corpus after which for every doc or sentence counts what number of instances every token seems. CountVectorizer perform from textual content characteristic extraction module of sklearn can be utilized to carry out bag of phrases by studying from our corpus (fit_transform).
# Bag of Phrases
from sklearn.feature_extraction.textual content import CountVectorizervectorizer = CountVectorizer(tokenizer=custom_tokenizer)
X = vectorizer.fit_transform(corpus)
# Print outcomes
print("Function names (vocabulary):")
print(vectorizer.get_feature_names_out())
print("nDocument-term matrix:")
print(X.toarray())
It’s doable to outline a customized tokenizer on this perform throughout which all above-mentioned preprocessing steps may be carried out earlier than instructing the vectorizer mannequin.
## Customized Tokenizerimport nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize
from sklearn.feature_extraction.textual content import CountVectorizer
# Obtain required assets
nltk.obtain('punkt')
nltk.obtain('wordnet')
nltk.obtain('stopwords')
nltk.obtain('averaged_perceptron_tagger')
# Initialize lemmatizer and cease phrases
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.phrases('english'))
# Helper to transform POS tag format
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN # default to noun
# Customized tokenizer perform
def custom_tokenizer(textual content):
tokens = word_tokenize(textual content.decrease()) # lowercase and tokenize
tagged_tokens = pos_tag(tokens) # POS tagging
lemmatized = [
lemmatizer.lemmatize(word, get_wordnet_pos(pos))
for word, pos in tagged_tokens
if word.isalpha() and word not in stop_words
]
return lemmatized
Bag of phrases method ignores grammar and phrase order, and solely cares about what phrases seem and what number of instances. It additionally treats each phrase with equal significance. Nevertheless, when it’s necessary to know how informative or distinctive a phrase is throughout a number of paperwork/sentences, one other method known as time period frequency-inverse doc frequency may be extra helpful.
Time period Frequency-Inverse Doc Frequency (TF-IDF) is a method to measure how necessary a phrase is in a doc/sentence relative to a complete assortment of paperwork (corpus). It improves upon the straightforward Bag of Phrases method by not simply counting phrases — however by additionally contemplating how widespread or uncommon these phrases are. Nevertheless, similar to the Bag of Phrases method, TF-IDF nonetheless ignores the place and order of phrases in a sentence.
TF-IDF combines TF rating which is how usually a phrase seems in a doc/sentence with IDF rating which displays rarity of a phrase throughout all paperwork/corpus. The excessive TD-IDF scores point out frequent uncommon phrases which might be necessary for understanding the doc’s (or the sentence’s) distinctive content material.
If we have now 20 paperwork every having 100 phrases in whole and the token “play” is utilized in 2 paperwork 5 and three instances respectively. Let’s calculate the TF-IDF rating for ‘play’ in these paperwork:
TF(token_x, document_x)= [count of token_x / total no of tokens] in document_x
TF('play', document1)= 5/100 = 0.05
TF('play', document2)= 3/100 = 0.03IDF(token_x)= log(whole no of paperwork/rely of paperwork together with token_x)
IDF('play')=log(20/2) = 1
TF-IDF (token_x, document_x) = TF (token_x, document_x) × IDF (token_x)
TF-IDF ('play', document1) = 0.05 x 1 = 0.05
TF-IDF ('play', document1) = 0.03 x 1 = 0.03
One other perform from textual content characteristic extraction module of sklearn known as TfidfVectorizer can be utilized to carry out TF-IDF vectorization by studying from our corpus (fit_transform).
# TF-IDF
from sklearn.feature_extraction.textual content import TfidfVectorizervectorizer2=TfidfVectorizer(tokenizer=custom_tokenizer)
X2=vectorizer2.fit_transform(corpus)
print(vectorizer2.get_feature_names_out())
print(X2.toarray())
Our corpus outcomes the next token options and “The short brown fox jumps over the lazy canine.” sentence is transformed into the next rely vectors: [0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0] and [0. 0. 0.41 0. 0. 0. 0. 0.41 0. 0. 0. 0. 0.41 0. 0.41 0. 0.41 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.41 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] for bag of phrases and TF-IDF approaches respectively with extracted tokens proven beneath.
Conclusion
Scripting this submit has been a part of my very own studying journey into NLP — and I hope it’s helped you are taking a step ahead too. Ranging from easy sentences and transferring by every preprocessing step has proven me how language, one thing so pure to us, may be gently translated into one thing a machine can start to know. As somebody coming from a biology background, it’s thrilling to assume that the identical strategies might at some point assist me interpret amino acid sequences and design proteins. For now, although, mastering these fundamentals with Python and NLTK is a strong begin. Thanks for bearing with me — extra to come back!