Close Menu
    Trending
    • AI Startup TML From Ex-OpenAI Exec Mira Murati Pays $500,000
    • STOP Building Useless ML Projects – What Actually Works
    • Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025
    • The New Career Crisis: AI Is Breaking the Entry-Level Path for Gen Z
    • Musk’s X appoints ‘king of virality’ in bid to boost growth
    • Why Entrepreneurs Should Stop Obsessing Over Growth
    • Implementing IBCS rules in Power BI
    • What comes next for AI copyright lawsuits?
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»News Category Classification Machine learning Project | by Rajesh | Feb, 2025
    Machine Learning

    News Category Classification Machine learning Project | by Rajesh | Feb, 2025

    Team_AIBS NewsBy Team_AIBS NewsFebruary 18, 2025No Comments10 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Introduction

    In right now’s digital age, information articles are revealed each second, protecting matters starting from politics to sports activities, leisure, and expertise. Manually categorizing this huge quantity of knowledge is impractical. That is the place Information Class Classification comes into play. On this weblog, we are going to discover easy methods to construct an end-to-end machine studying mannequin to categorise information articles into predefined classes.

    By the top of this tutorial, you should have a working classifier that may predict the class of a given information article. We’ll cowl the whole lot from information assortment to deployment.

    Understanding the Drawback:

    What’s Information Class Classification?

    Information class classification is a multi-class textual content classification downside the place a given information article is robotically categorized right into a predefined class, comparable to:

    • Politics
    • Sports activities
    • Know-how
    • Leisure
    • Enterprise

    Actual-World Purposes

    • Information Aggregation: Robotically organizing information articles for web sites like Google Information.
    • Content material Suggestions: Suggesting related information to customers primarily based on their pursuits.
    • Faux Information Detection: Filtering deceptive data primarily based on its class.

    Accumulating and Making ready Information :

    Selecting a Dataset

    There are a number of methods to gather information for information classification:

    • Kaggle Datasets: AG Information, BBC Information, or different accessible datasets.
    • Net Scraping: Scraping information from web sites like BBC or CNN.
    • Information APIs: Utilizing APIs comparable to NewsAPI or Google Information API to gather recent articles.

    For this tutorial, we now have use News Aggregator Dataset from Kaggle, which accommodates information headlines and their corresponding classes .

    Information Cleansing :

    1. First we now have loaded information into pandas DataFrame. so for that import first pandas library then load the info. Our information has complete 422419 information and eight options.
    2. Dataset accommodates many options so i simply chosen no matter characteristic is required for my challenge. so i simply loaded 2 options TITLE and CATEGORY.
    3. In Class options, classes was within the kind brief format (like b, t, e, m). So I’ve to exchange these brief names with significant classes. i.e Enterprise, Science and Know-how, Leisure and Well being.
    4. In Class options we complete 4 distinctive classes, so it’s multiclass classification downside.

    4. In Class options we now have complete 4 distinctive classes, so it’s multiclass classification downside.

    Information Preprocessing :

    On this part we carry out,

    • Tokenization
    • Take away Stopwords and Punctuations
    • Stemming
    1. Tokenization :

    On this, we do tokenization on Title characteristic, it mainly convert corpus into sentenses or tokens. and we convert thoses tokens into decrease case.

    # Headline Tokenization
    from nltk.tokenize import sent_tokenize,word_tokenize

    tokenized_titles = []

    for headline in df['TITLE']:
    tokenized_titles.append(word_tokenize(headline.decrease()))

    to examine how information seems like after tokenization, run beneath code

    for title in tokenized_titles[0:10]:
    print(title)

    2. Stopwords and Punctuations:

    Right here we take away all of the cease phrases and punctuations martk from Title characteristic. earlier than that first we want import nltk library and must obtain stopwords.

    # Obtain 'stopwords' from nltk 
    import nltk
    nltk.obtain('stopwords')

    First we are going to examine what are the stopwords and punctuations,

    # Elimination of stopwords and punctuations
    # Additionally take away 's

    # required libraries
    from nltk.corpus import stopwords
    import string

    # stopwords for English language
    stop_words = set(stopwords.phrases('english'))
    print('Cease Phrases : ',stop_words)

    # Punctuations
    punctuations = set(string.punctuation)
    print("Punctuations : ",punctuations)

    Cease Phrases :  {'down', 'y', 'then', "weren't", 'be', 'doing', 'their', 'however', 'the place', 'with', 'right here', 'didn', 'he', 'by', "hadn't", 'who', 'i', "that'll", 'haven', 'have', 'so', 've', 'd', 'has', 'did', 'its', 'after', 'itself', 'they', 'that', 'till', 're', 'beneath', 'some', "you'd", 'been', 'does', 'shouldn', 'mustn', 'about', "will not", 'himself', 'similar', 'why', "could not", 't', 'your', 'and', 'each', "hasn't", 'she', 'yours', 'do', 'what', "ought to've", 'additional', 'wouldn', 'having', 'ma', "is not", 'off', "shan't", 'yourselves', 'whereas', 'towards', 'above', 'will', 'me', 'doesn', 'mightn', 'to', 'than', 'as', 's', "she's", 'his', 'how', "aren't", 'ought to', 'it', 'had been', 'in', 'myself', 'a', 'earlier than', 'few', 'when', 'whom', 'on', 'up', 'hasn', 'the', "should not", 'shan', 'our', 'via', "you are", "wasn't", 'him', 'throughout', 'll', 'these', 'over', 'theirs', 'ourselves', 'being', 'of', 'wasn', 'is', 'now', "mustn't", "does not", 'you', 'beneath', 'isn', 'not', "you've got", 'am', "mightn't", 'no', 'at', 'there', 'all', 'herself', 'hadn', 'needn', 'as soon as', 'aren', 'can', 'couldn', 'different', 'had', "do not", 'once more', 'such', 'these', 'out', "have not", 'or', 'most', 'hers', 'nor', 'my', 'if', "you may", 'every', 'personal', 'extra', 'any', 'simply', 'your self', 'them', 'solely', 'don', "needn't", "would not", 'ours', 'm', 'which', 'we', 'themselves', 'between', 'o', "did not", 'ain', 'for', "it is", 'as a result of', 'very', 'was', 'gained', 'are', 'weren', 'her', 'this', 'an', 'too', 'from', 'into'}
    Punctuations : {'`', '}', '^', '!', '.', '"', '(', '', '?', ')', ',', '*', '#', '/', '[', '_', '@', ':', '~', '|', "'", ';', '$', '=', '>', '%', ']', '+', '&', '-', '<', '{'}

    Now we are going to take away these from Title characteristic, so lets run beneath code

    # Filtered Title = title with out stopwords and punctuations
    filtered_title = []

    for title in tokenized_titles:
    temp_title = []
    for phrase in title:
    if((phrase not in stop_words) and (phrase not in punctuations) and (phrase != "'s")):
    temp_title.append(phrase)

    filtered_title.append(temp_title)

    print("nFilter Titles : ")
    print(filtered_title[0:5])

    3. Stemming :

    lets perceive why this we use first, shall we say we now have 3 completely different phrases however that means is similar eg: (love, liked, loving) mainly that means of this all 3 phrase is “love” so as an alternative of contemplating 3 phrases we solely think about 1 phrase i.e., love.

    # Stemming utilizing Porter stemmer
    from nltk.stem import PorterStemmer

    porter = PorterStemmer()

    Stemmed_titles = []

    for title in filtered_title:
    temp_title = []
    for phrase in title:
    temp_title.append(porter.stem(phrase))

    Stemmed_titles.append(" ".be a part of(temp_title))

    print("Stemmed title headlines : n",Stemmed_titles[0:5])

    Now lets changed these stemmed_titles with unique title characteristic current in dataframe.

    # Changing Title headlines with stemmed_titles
    df = df.drop(['TITLE'],axis=1)
    df.insert(0,'Title',Stemmed_titles,True)

    so right here, preprocessing of knowledge is finished.

    Now if you wish to examine that are the phrase principally occured in every class so you possibly can examine this by useing wordcloud library, now i’ll solely present you for enterprise class solely.


    from wordcloud import WordCloud,STOPWORDS
    stopwords = set(STOPWORDS)
    wordcloud = WordCloud(width = 800, top = 800,
    background_color ='white',
    stopwords = stopwords,
    min_font_size = 10).generate(df[df['CATEGORY']=="Enterprise"]['Title'].str.cat(sep=" "))
    import matplotlib.pyplot as plt
    plt.determine(figsize=(12,7))
    plt.imshow(wordcloud)

    Characteristic Engineering :

    Our class information is in textual content format, so we have to assign some labels for it.

    To assign labels to classes, we are going to use LabelEncoder approach.

    # Encoding Information Classes

    from sklearn.preprocessing import LabelEncoder

    labelencoder = LabelEncoder()

    # Including column of Encoded_Category
    df['Encoded_Category'] = labelencoder.fit_transform(df['CATEGORY'])

    After label encoding our classes conver into beneath codecs,

     Enterprise : 0
    Leisure : 1
    Well being : 2
    Science and Know-how : 3

    Characteristic Selction and Mannequin Coaching :

    lets break up the info into coaching set and testing set.

    # Unbiased and dependent characteristic
    X = df['Title']
    y = df['Encoded_Category']
    # Splitting the dataset into coaching set and testing set
    from sklearn.model_selection import train_test_split

    # Testing set = 25% and Coaching set = 75%
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=51)

    Checking shapes of knowledge,

    print("Form of X : " + str(X.form))
    print("Form of y : " + str(y.form))
    print("Form of X_train : " + str(X_train.form))
    print("Form of y_train : " + str(y_train.form))
    print("Form of X_test : " + str(X_test.form))
    print("Form of y_test : " + str(y_test.form))

    Characteristic Choice :

    Now, lets soar to actual NLP methods,

    We’ve information in textual format so we have to convert these textual content information right into a numbers or vector format. for these we now have a number of methods for instance,

    • BOW (Bag of Phrases)
    • TF-IDF (Time period Frequency — Inverse Doc Frequency)
    • Word2Vec

    So in our challenge we are going to use 2 methods BOW and TFIDF and examine the mannequin efficiency primarily based on this. first we are going to use TF-IDF after which BOW.

    Characteristic Choice : TF-IDF Strategy

    # Characteristic Extraction
    from sklearn.feature_extraction.textual content import TfidfVectorizer

    # Instatiating tfidfvectorizer
    tfidf_vectorizer = TfidfVectorizer()

    # Becoming and Reworking Traning Information (X_train)
    tfidf_x_train = tfidf_vectorizer.fit_transform(X_train.values)

    # remodeling testing information (X_test)
    tfidf_x_test = tfidf_vectorizer.rework(X_test.values)

    # Saving tfidf_vectorizer
    import pickle
    pickle.dump(tfidf_vectorizer,open("pickle_files/tfidf_vectorizer.pkl",'wb'))

    Mannequin Coaching : Naive Bayes ClassifierAlgorithm

    lets Practice the mannequin now

    # Multinomial Naive Bayes Classifier

    from sklearn.naive_bayes import MultinomialNB

    # Instantiating Naive Bayes Classifier with alpha = 1.0
    nb_classifier = MultinomialNB()

    # Becoming nb_classifier to coaching information
    nb_classifier.match(tfidf_x_train,y_train)

    # Saving nb_classifier for tfidf_vectorizer
    pickle.dump(nb_classifier,open("pickle_files/nb_classifier.pkl",'wb'))

    now predict the check information,

    # Prediction
    pred = nb_classifier.predict(tfidf_x_test)

    Lets examine Confusion matrix and Accuracy,

    # Accuracy and Confusion Matrix

    from sklearn import metrics

    print("Multinomial Naive Bayes : TF-IDF Strategy n")

    # Accuracy
    a_score = metrics.accuracy_score(y_test,pred)
    print("Accuracy : " + str("{:.2f}".format(a_score*100)),'%')

    print('n')

    # Confusion matrix
    # labels : 0(Enterprise),1(Leisure),2(Well being),3(Science and Know-how)
    # By default, Horizontally labels are from 0 to three
    # By default, Vertically labels are from 0 to three
    confusion_matrix = metrics.confusion_matrix(y_test,pred)

    print("Confusion Matrix: n",confusion_matrix)

    output seems like as beneath,

    Multinomial Naive Bayes : TF-IDF Strategy 

    Accuracy : 92.07 %

    Confusion Matrix:
    [[26497 674 175 1845]
    [ 530 36828 104 585]
    [ 754 685 9586 305]
    [ 1910 690 113 24324]]

    Now we are going to apply Hyperparameter Tunning and examine the mannequin efficiency,

    # Laplace smoothing (Tunning parameter - alpha)

    # Checklist of alphas
    alphas = np.arange(0,1,0.1)

    # Operate for coaching nb_classifier with completely different alpha values
    def train_and_predict(alpha):
    # instantiating naive bayes classifier
    nb_classifier = MultinomialNB(alpha=alpha)

    # Becoming nb_classifier to coaching information
    nb_classifier.match(tfidf_x_train,y_train)

    # prediction
    pred = nb_classifier.predict(tfidf_x_test)

    # accuracy rating
    a_score = metrics.accuracy_score(y_test,pred)

    return a_score

    # Iterating over alphas and printing the the corresponding Accuracy rating
    for alpha in alphas:
    print("Alpha : ",alpha)
    print("Accuracy rating : ",train_and_predict(alpha))
    print()

    with alpha = 1.0, we’re getting accuracy of 92%.

    Then, Making an attempt completely different values of alpha, nonetheless we’re getting approximate accuracy of 92%.

    So, we don’t want to vary the worth of alpha = 1.0

    Now lets examine mannequin efficiency with BOW,

    Characteristic Choice : Bag of Phrases (BOW) Strategy

    # Characteristic Extraction
    from sklearn.feature_extraction.textual content import CountVectorizer

    # Instatiating tfidfvectorizer
    count_vectorizer = CountVectorizer()

    # Becoming and Reworking Traning Information (X_train)
    count_x_train = count_vectorizer.fit_transform(X_train.values)

    # remodeling testing information (X_test)
    count_x_test = count_vectorizer.rework(X_test.values)

    # Saving tfidf_vectorizer
    pickle.dump(count_vectorizer,open("pickle_files/count_vectorizer.pkl",'wb'))

    Mannequin Coaching

    # Multinomial Naive Bayes Classifier

    from sklearn.naive_bayes import MultinomialNB

    # Instantiating Naive Bayes Classifier with alpha = 1.0
    nb_classifier = MultinomialNB()

    # Becoming nb_classifier to coaching information
    nb_classifier.match(count_x_train,y_train)

    # Saving nb_classifier for tfidf_vectorizer
    pickle.dump(nb_classifier,open("pickle_files/nb_classifier_for count_vectorizer.pkl",'wb'))

    # prediction
    pred = nb_classifier.predict(count_x_test)
    # Accuracy and Confusion Matrix

    from sklearn import metrics

    print("Multinomial Naive Bayes : BOW Strategy n")

    # Accuracy
    a_score = metrics.accuracy_score(y_test,pred)
    print("Accuracy : " + str("{:.2f}".format(a_score*100)),'%')

    print('n')

    # Confusion matrix
    # labels : 0(Enterprise),1(Leisure),2(Well being),3(Science and Know-how)
    # By default, Horizontally labels are from 0 to three
    # By default, Vertically labels are from 0 to three
    confusion_matrix = metrics.confusion_matrix(y_test,pred)

    print("Confusion Matrix: n",confusion_matrix)

    Multinomial Naive Bayes : BOW Strategy 

    Accuracy : 92.23 %

    Confusion Matrix:
    [[26272 556 421 1942]
    [ 604 36433 303 707]
    [ 460 364 10300 206]
    [ 1819 535 284 24399]]

    Lets strive right here additionally HyperParameter tunning,

    # Laplace smoothing (Tunning parameter - alpha)

    # Checklist of alphas
    alphas = np.arange(0,1,0.1)

    # Operate for coaching nb_classifier with completely different alpha values
    def train_and_predict(alpha):
    # instantiating naive bayes classifier
    nb_classifier = MultinomialNB(alpha=alpha)

    # Becoming nb_classifier to coaching information
    nb_classifier.match(count_x_train,y_train)

    # prediction
    pred = nb_classifier.predict(count_x_test)

    # accuracy rating
    a_score = metrics.accuracy_score(y_test,pred)

    return a_score

    # Iterating over alphas and printing the the corresponding Accuracy rating
    for alpha in alphas:
    print("Alpha : ",alpha)
    print("Accuracy rating : ",train_and_predict(alpha))
    print()

    Right here additionally with alpha = 1.0, we’re getting accuracy of 92%.

    Then, Making an attempt completely different values of alpha, nonetheless we’re getting approximate accuracy of 92%.

    So,we don’t want to vary the worth of alpha = 1.0

    Prediction System :

    Lets create a prediction system which means beforehand we downloaded some pickle information, so we are going to load these pickle information and examine how our mannequin is working with new information.

    # Prediction of person information headline
    import pickle

    # loading the mannequin
    count_vectorizer = pickle.load(open('pickle_files/count_vectorizer.pkl','rb'))
    nb_classifier = pickle.load(open("pickle_files/nb_classifier_for count_vectorizer.pkl",'rb'))

    # Values encoded by LabelEncoder
    encoded = {0:"Enterprise",1:"Leisure",2:"Well being",3:"Science and Know-how"}

    #enter
    user_headline = [input("news_headline : ")]

    # transformation and Prediction of person headline
    headline_counts = count_vectorizer.rework(user_headline)
    prediction = nb_classifier.predict(headline_counts)

    print("Information Class : ",encoded[prediction[0]])

    If we run above code, this may ask us to sort some headline then it can give us output when it comes to Class.

    Information Class :  Well being

    App Growth :

    We’ve develop internet software utilizing Streamlit framework, beneath is the code for it

    import streamlit as st
    import pandas
    import pickle
    import nltk
    from sklearn.feature_extraction.textual content import CountVectorizer

    # loading the mannequin
    count_vectorizer = pickle.load(open('pickle_files/count_vectorizer.pkl','rb'))
    nb_classifier = pickle.load(open("pickle_files/nb_classifier_for count_vectorizer.pkl",'rb'))

    def Classification(enter):
    # transformation and Prediction of person headline
    headline_counts = count_vectorizer.rework([input])
    predicted_category = nb_classifier.predict(headline_counts)

    return predicted_category

    # Values encoded by LabelEncoder
    encoded = {0:"Enterprise",1:"Leisure",2:"Well being",3:"Science and Know-how"}

    st.title("Information Class Classification")

    input_title = st.text_input("Information Headline",)

    if st.button('Predict Class'):
    class = Classification(input_title)
    st.write(encoded[category[0]])

    and output seems like as beneath,

    Conclusion :

    So on this challenge I’ve study tokenization,stemming methods. and in addition understood the facility of TFIDF and BOW methods. Additionally realized easy methods to prepare textual information with naive bayes classifier and Tunning approach.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleAI and Crypto Security: Protecting Digital Assets with Advanced Technology
    Next Article Inside China’s electric-vehicle-to-humanoid-robot pivot | MIT Technology Review
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

    July 1, 2025
    Machine Learning

    Why PDF Extraction Still Feels LikeHack

    July 1, 2025
    Machine Learning

    🚗 Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    AI Startup TML From Ex-OpenAI Exec Mira Murati Pays $500,000

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Meet the Entrepreneur Behind Qualified Digital

    June 21, 2025

    I Spent 20 Years Watching Brands Rise or Fade—This Is What Separates Them

    June 19, 2025

    How Do the iPhone 16E and Google Pixel 9A Compare to More Expensive Models?

    April 10, 2025
    Our Picks

    AI Startup TML From Ex-OpenAI Exec Mira Murati Pays $500,000

    July 1, 2025

    STOP Building Useless ML Projects – What Actually Works

    July 1, 2025

    Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.