Whether or not you’re coaching a small neural community or fine-tuning a large transformer mannequin, there’s one optimization workhorse quietly doing the heavy lifting behind the scenes:
Gradient Descent.
It’s easy. It’s highly effective. And it’s in all places in machine studying.
However whereas most of us be taught the fundamentals early on, the completely different varieties of gradient descent — and the way they behave in observe — are value a more in-depth look.
At its core, Gradient Descent is an optimization algorithm. It helps a mannequin decrease its loss by adjusting its parameters within the route that reduces error.
Think about you’re standing on a foggy hilltop, attempting to succeed in the valley under. You possibly can’t see far, however you may really feel the slope beneath your ft. So you are taking small steps downhill, one by one, till you discover the bottom level.
That’s precisely what gradient descent does — utilizing the gradient (slope) of the loss perform to replace weights step-by-step, till we (hopefully) attain the optimum mannequin.