Close Menu
    Trending
    • Implementing IBCS rules in Power BI
    • What comes next for AI copyright lawsuits?
    • Why PDF Extraction Still Feels LikeHack
    • GenAI Will Fuel People’s Jobs, Not Replace Them. Here’s Why
    • Millions of websites to get ‘game-changing’ AI bot blocker
    • I Worked Through Labor, My Wedding and Burnout — For What?
    • Cloudflare will now block AI bots from crawling its clients’ websites by default
    • 🚗 Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Data Cleaning: A Detailed Guide (Part 2) | by Mani Krishna | Mar, 2025
    Machine Learning

    Data Cleaning: A Detailed Guide (Part 2) | by Mani Krishna | Mar, 2025

    Team_AIBS NewsBy Team_AIBS NewsMarch 18, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Earlier than continuing, it’s advisable to learn Part 1 to achieve foundational data on knowledge cleansing strategies.

    Persevering with from Half 1, this section explores additional essential steps within the knowledge cleansing course of, specializing in dealing with outliers and standardizing knowledge codecs.

    Step 4: Dealing with Outliers

    Outliers are observations considerably deviating from the opposite knowledge factors. They will adversely have an effect on the efficiency of statistical analyses and predictive modeling by skewing outcomes.

    Strategies for Detecting Outliers:

    1.Visualization Methods:

    • Field Plot: Outliers recognized utilizing the Interquartile Vary (IQR) technique.
    import seaborn as sns 
    sns.boxplot(x=df['salary'])
    • Histogram and Distribution Plot: Helps visualize knowledge distribution to determine outliers visually.
    import matplotlib.pyplot as plt 
    plt.hist(df['salary'], bins=50) plt.present()

    2. Statistical Strategies:

    • Z-Rating Technique: Detects outliers by measuring the deviation from the imply.
    from scipy import stats 
    import numpy as np
    z = np.abs(stats.zscore(df['salary'])) |
    df_no_outliers = df[(z < 3)]
    • Interquartile Vary (IQR) Technique: Detects outliers primarily based on quartiles, particularly observations that lie considerably outdoors the vary outlined by the decrease (Q1) and higher (Q3) quartiles. Calculate IQR because the distinction between the third quartile (seventy fifth percentile) and the primary quartile (twenty fifth percentile). Values under Q1–1.5*IQR or above Q3 + 1.5*IQR are thought-about outliers.
    Q1 = df['salary'].quantile(0.25) 
    Q3 = df['salary'].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df_no_outliers = df[(df['salary'] >= lower_bound) & (df['salary'] <= upper_bound)]

    Completely different Methods to Deal with Outliers:

    • Eradicating Outliers: Instantly take away outliers if their quantity is minimal.
    df = df[df['salary'] < df['salary'].quantile(0.99)]
    • Winsorization (Changing Outliers): Substitute excessive values by the closest worth inside a suitable vary.
    from scipy.stats.mstats import winsorize 
    df['salary'] = winsorize(df['salary'], limits=[0.01, 0.01])
    • Log Transformation: Helpful when knowledge is very skewed, significantly for decreasing right-skewness and dealing with exponential progress patterns.
    df['salary_log'] = np.log(df['salary'] + 1)
    • Sq. Root Transformation: Efficient for stabilizing variance and decreasing reasonable skewness, usually in rely knowledge.
    df['salary_sqrt'] = np.sqrt(df['salary'])

    Step 5: Formatting and Standardization of Information

    Correct knowledge formatting and standardization guarantee consistency and enhance the effectiveness of study and modeling.

    • Convert Information Sorts: Making certain variables are of the proper kind for evaluation.
    df['age'] = df['age'].astype(int) 
    df['salary'] = df['salary'].astype(float)
    • Trim Whitespace: Take away pointless areas from strings to forestall mismatches.
    df['name'] = df['name'].str.strip()
    • Repair Encoding Points: Handle potential encoding points that have an effect on readability and processing.
    df['text'] = df['text'].str.encode('utf-8').str.decode('ascii', 'ignore')
    • Standardize Numeric Information: Apply scaling strategies to standardize numeric options.
    from sklearn.preprocessing import StandardScaler 
    scaler = StandardScaler()
    df[['age', 'salary']] = scaler.fit_transform(df[['age', 'salary']])
    • Standardize Categorical Information: Guarantee categorical variables have constant naming conventions.
    df['gender'] = df['gender'].str.capitalize().change({'M':'Male', 'F':'Feminine'})

    By following these detailed steps and practices, knowledge analysts and scientists can considerably improve the standard and effectiveness of their data-driven insights and fashions.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleThe Digital Renaissance: Adapt or Be Left Behind
    Next Article Improve Productivity With Better Sleep Thanks to These Noise-Blocking Earbuds
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    Why PDF Extraction Still Feels LikeHack

    July 1, 2025
    Machine Learning

    🚗 Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025

    July 1, 2025
    Machine Learning

    Reinforcement Learning in the Age of Modern AI | by @pramodchandrayan | Jul, 2025

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Implementing IBCS rules in Power BI

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    InfiniteHiP: Getting more length for LLMs | by Mradul Varshney (KronikalKodar) | Feb, 2025

    February 26, 2025

    An Essential Guide for Generative Models Evaluation Metrics | by Ayo Akinkugbe | Jun, 2025

    June 16, 2025

    Former Zillow Execs Target $1.3T Market

    January 12, 2025
    Our Picks

    Implementing IBCS rules in Power BI

    July 1, 2025

    What comes next for AI copyright lawsuits?

    July 1, 2025

    Why PDF Extraction Still Feels LikeHack

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.