Close Menu
    Trending
    • Finding the right tool for the job: Visual Search for 1 Million+ Products | by Elliot Ford | Kingfisher-Technology | Jul, 2025
    • How Smart Entrepreneurs Turn Mid-Year Tax Reviews Into Long-Term Financial Wins
    • Become a Better Data Scientist with These Prompt Engineering Tips and Tricks
    • Meanwhile in Europe: How We Learned to Stop Worrying and Love the AI Angst | by Andreas Maier | Jul, 2025
    • Transform Complexity into Opportunity with Digital Engineering
    • OpenAI Is Fighting Back Against Meta Poaching AI Talent
    • Lessons Learned After 6.5 Years Of Machine Learning
    • Handling Big Git Repos in AI Development | by Rajarshi Karmakar | Jul, 2025
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Artificial Intelligence»Explained: How Does L1 Regularization Perform Feature Selection?
    Artificial Intelligence

    Explained: How Does L1 Regularization Perform Feature Selection?

    Team_AIBS NewsBy Team_AIBS NewsApril 23, 2025No Comments8 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    is the method of choosing an optimum subset of options from a given set of options; an optimum function subset is the one which maximizes the efficiency of the mannequin on the given process.

    Function choice generally is a guide or quite specific course of when carried out with filter or wrapper strategies. In these strategies, options are added or eliminated iteratively primarily based on the worth of a set measure, which quantifies the relevance of the function within the making the prediction. The measures might be info achieve, variance or the chi-squared statistic, and the algorithm would decide to just accept/reject the function contemplating a set threshold on the measure. Observe, these strategies usually are not part of the mannequin coaching stage and are carried out previous to it.

    Embedded strategies carry out function choice implicitly, with out utilizing any pre-defined choice standards and deriving it from the coaching knowledge itself. This intrinsic function choice course of is part of the mannequin coaching stage. The mannequin learns to pick out options and make related predictions on the similar time. In later sections, we are going to describe the function of regularization in performing this intrinsic function choice.

    Regularization and Mannequin Complexity

    Regularization is the method of penalizing the complexity of the mannequin to keep away from overfitting and obtain generalization over the process. 

    Right here, the complexity of the mannequin is analogous to its energy to adapt to the patterns within the coaching knowledge. Assuming a easy polynomial mannequin in ‘x’ with diploma ‘d’, as we improve the diploma ‘d’ of the polynomial, the mannequin achieves better flexibility to seize patterns within the noticed knowledge.

    Overfitting and Underfitting

    If we try to suit a polynomial mannequin with d = 2 on a set of coaching samples which have been derived from a cubic polynomial with some noise, the mannequin won’t be able to seize the distribution of the samples to a enough extent. The mannequin merely lacks the flexibility or complexity to mannequin the info generated from a level 3 (or larger order) polynomials. Such a mannequin is alleged to under-fit on the coaching knowledge.

    Engaged on the identical instance, assume we now have a mannequin with d = 6. Now with elevated complexity, it must be simple for the mannequin to estimate the unique cubic polynomial that was used to generate the info (like setting the coefficients of all phrases with exponent > 3 to 0). If the coaching course of shouldn’t be terminated on the proper time, the mannequin will proceed to make the most of its extra flexibility to cut back the error inside additional and begin capturing within the noisy samples too. This may scale back the coaching error considerably, however the mannequin now overfits the coaching knowledge. The noise will change in real-world settings (or within the check part) and any data primarily based on predicting them will disrupt, resulting in excessive check error.

    decide the optimum mannequin complexity?

    In sensible settings, we’ve little-to-no understanding of the data-generation course of or the true distribution of the info. Discovering the optimum mannequin with the suitable complexity, such that no under-fitting or overfitting happens is a problem. 

    One method might be to start out with a sufficiently highly effective mannequin after which scale back its complexity by the use of function choice. Lesser the options, lesser is the complexity of the mannequin.

    As mentioned within the earlier part, function choice may be specific (filter, wrapper strategies) or implicit. Redundant options which have insignificant relevance within the figuring out the worth of the response variable must be eradicated to keep away from the mannequin studying uncorrelated patterns in them. Regularization, additionally performs the same process. So, how are regularization and have choice linked to realize a typical objective of optimum mannequin complexity?

    L1 Regularization As A Function Selector

    Persevering with with our polynomial mannequin, we symbolize it as a operate f, with inputs x, parameters θ and diploma d,

    (Picture by writer)

    For a polynomial mannequin, every energy of the enter x_i may be thought-about as a function, forming a vector of the shape,

    (Picture by writer)

    We additionally outline an goal operate, which on minimizing leads us to the optimum parameters θ* and features a regularization time period penalizing the complexity of the mannequin. 

    (Picture by writer)

    To find out the minima of this operate, we have to analyze all of its crucial factors i.e. factors the place the derivation is zero or undefined.

    The partial spinoff w.r.t. one the parameters, θj, may be written as,

    (Picture by writer)

    the place the operate sgn is outlined as,

    (Picture by writer)

    Observe: The spinoff of absolutely the operate is completely different from the sgn operate outlined above. The unique spinoff is undefined at x = 0. We increase the definition to take away the inflection level at x = 0 and to make the operate differentiable throughout its whole area. Furthermore, such augmented capabilities are additionally utilized by ML frameworks when the underlying computation includes absolutely the operate. Examine this thread on the PyTorch discussion board.

    By computing the partial spinoff of the target operate w.r.t. a single parameter θj, and setting it to zero, we are able to construct an equation that relates the optimum worth of θj with the predictions, targets, and options.

    (Picture by writer)
    (Picture by writer)

    Allow us to study the equation above. If we assume that the inputs and targets have been centered concerning the imply (i.e. the info had been standardized within the preprocessing step), the time period on the LHS successfully represents the covariance between the jth function and the distinction between the expected and goal values.

    Statistical covariance between two variables quantifies how a lot one variable influences the worth of the second variable (and vice-versa)

    The signal operate on the RHS forces the covariance on the LHS to imagine solely three values (because the signal operate solely returns -1, 0 and 1). If the jth function is redundant and doesn’t affect the predictions, the covariance will probably be practically zero, bringing the corresponding parameter θj* to zero. This leads to the function being eradicated from the mannequin. 

    Think about the signal operate as a canyon carved by a river. You possibly can stroll within the canyon (i.e. the river mattress) however to get out of it, you could have these enormous boundaries or steep slopes. L1 regularization induces the same ‘thresholding’ impact for the gradient of the loss operate. The gradient should be highly effective sufficient to interrupt the boundaries or turn into zero, which ultimately brings the parameter to zero.

    For a extra grounded instance, think about a dataset that incorporates samples derived from a straight line (parameterized by two coefficients) with some added noise. The optimum mannequin shouldn’t have any greater than two parameters, else it should adapt to the noise current within the knowledge (with the added freedom/energy to the polynomial). Altering the parameters of the upper powers within the polynomial mannequin doesn’t have an effect on the distinction between the targets and the mannequin’s predictions, thus decreasing their covariance with the function.

    Throughout the coaching course of, a continuing step will get added/subtracted from the gradient of the loss operate. If the gradient of the loss operate (MSE) is smaller than the fixed step, the parameter will ultimately attain to a price of 0. Observe the equation beneath, depicting how parameters are up to date with gradient descent,

    (Picture by writer)
    (Picture by writer)

    If the blue half above is smaller than λα, which itself is a really small quantity, Δθj is the practically a continuing step λα. The signal of this step (pink half) is dependent upon sgn(θj), whose output is dependent upon θj. If θj is optimistic i.e. better than ε, sgn(θj) equals 1, therefore making Δθj approx. equal to –λα pushing it in the direction of zero.

    To suppress the fixed step (pink half) that makes the parameter zero, the gradient of the loss operate (blue half) must be bigger than the step measurement. For a bigger loss operate gradient, the worth of the function should have an effect on the output of the mannequin considerably.

    That is how a function is eradicated, or extra exactly, its corresponding parameter, whose worth doesn’t correlate with the output of the mannequin, is zero-ed by L1 regularization through the coaching.

    Additional Studying And Conclusion

    • To get extra insights on the subject, I’ve posted a query on r/MachineLearning subreddit and the ensuing thread incorporates completely different explanations that you could be wish to learn.
    • Madiyar Aitbayev additionally has an interesting blog protecting the identical query, however with a geometrical clarification.
    • Brian Keng’s blog explains regularization from a probabilistic perspective.
    • This thread on CrossValidated explains why L1 norm encourages sparse fashions. An in depth blog by Mukul Ranjan explains why L1 norm encourages the parameters to turn into zero and never the L2 norm.

    “L1 regularization performs function choice” is a straightforward assertion that the majority ML learners agree with, with out diving deep into the way it works internally. This weblog is an try and deliver my understanding and mental-model to the readers with a view to reply the query in an intuitive method. For recommendations and doubts, you’ll find my e mail at my website. Continue learning and have a pleasant day forward!



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleHow I Built a Passive Income Stream Using ChatGPT and Zero Investment | by Rao Amaan | Apr, 2025
    Next Article How Word-of-Mouth Alone Can Double Your Revenue Growth
    Team_AIBS News
    • Website

    Related Posts

    Artificial Intelligence

    Become a Better Data Scientist with These Prompt Engineering Tips and Tricks

    July 1, 2025
    Artificial Intelligence

    Lessons Learned After 6.5 Years Of Machine Learning

    July 1, 2025
    Artificial Intelligence

    Prescriptive Modeling Makes Causal Bets – Whether You Know it or Not!

    June 30, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Finding the right tool for the job: Visual Search for 1 Million+ Products | by Elliot Ford | Kingfisher-Technology | Jul, 2025

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    I Teach Data Viz with a Bag of Rocks

    May 20, 2025

    Jsush

    January 22, 2025

    Tesla Stock Falls 8% as Concerns About Elon Musk’s Political Role Grow

    February 25, 2025
    Our Picks

    Finding the right tool for the job: Visual Search for 1 Million+ Products | by Elliot Ford | Kingfisher-Technology | Jul, 2025

    July 1, 2025

    How Smart Entrepreneurs Turn Mid-Year Tax Reviews Into Long-Term Financial Wins

    July 1, 2025

    Become a Better Data Scientist with These Prompt Engineering Tips and Tricks

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.