Gradient Descent in Machine Learning

4 min readAug 7, 2021

As we have gone through the details how do we process,augment and train the machine learning model based on those images. And also about Neurons, Activation and loss functions. Ouch!!!, if you’re new and got confused with this terminology please refer my previous articles to get some insight.
https://saichandra1199.medium.com/ You can find all my stories here.

Now lets have a look optimization, As we know that the probability of getting a better model at first is very much less. So, we try to optimize the model to perform better by reducing loss. Gradient Descent is one of the algorithms which is used to enhance or make the model better by minimizing the loss function by tweaking its parameters.This works in such a way that it starts with initial parameters and tweaks in a way to minimize the cost function(loss) using calculus. So the main aim for this algorithm is to minimize the gradient as it is a convex function. Loss converged towards zero using calculus dealing with gradients, that is why it is known as gradient descent.

Gradient here is nothing but the derivative of a loss function with respect to weights. It is also known as slope in mathematical terms. This helps in making decisions to reach global minima. Global minima is the minimum possible point where the loss is minimum. Let’s make it more clear by looking into the below figure.

As we discussed, here the blue dot is reaching towards the optimum value to minimize the loss and it is done based on the gradients using the current loss function. Along with weight and loss function there is one more important parameter that plays a major in reducing the loss and that is learning rate.

Using better learning rate helps to reach global optimum in a better way. That value should be set accordingly. Much higher value of learning rate (alpha) leads to higher jumps towards optimum and high chances of missing the minima and there occurs a problem known as Exploding gradients.Much lower value of alpha leads to very slow iterations and here comes a problem called Vanishing Gradients. Let’s look into a figure which demonstrates the above mentioned theory.

As you can see in exploding gradients the gradient value is increasing gradually reaching to infinity. If this happens, instead of moving towards minima it is tending to move behind maxima so we cannot obtain a good model. On the other hand with vanishing gradients it is not reaching towards the correct minima as it is got struck at some point moving in wrong direction and after some iterations its value becomes negligible and it never moves. To make this even more transparent lets look into the mathematical stuff related to these problems. Where are you looking at, have a look below :)

As you can see if change in the weight is higher then the output weight is also higher and obviously this leads to higher higher gradient with max loss function and that leads to gradient explosion. I think by looking into it you got some idea about vanishing gradients. Yes!!!! you’re right it is exactly opposite to this problem just see the below figure for details.

Here the value of weight is far less than one and gradients are calculated using that value which leads to similar value as there is no much update of weights and this makes iterations slow and rotating around same place without moving towards optimum value of loss function.

Gradient descent calculates the loss function for all the points facing these problems. But there are some other algorithms which overcame the drawbacks in gradient descent to make it work better. Algorithms like SGD, mini batch GD, RMS prop, Momentum, Adam etc.., Let’s know about it in coming article :} Stay tuned till then….

Thank You!!!!! Do Follow for more interesting stories and please share it for the needful.

Gradient Descent in Machine Learning

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sai Chandra Nerella

No responses yet