Basics of Machine Learning Series: Gradient Descent

Hello Everyone!

In today’s post I would like to discuss one of the basics of machine learning and something that I had a little bit of trouble understanding when I was first learning it: Gradient Descent.

Gradient Descent is the backbone of a lot of machine learning algorithms and even has applications outside the field of machine learning. While it may seem intimidating at first, with a basic knowledge of calculus and applications of some of the first principals of calculus, it becomes a lot less daunting.

Loss Functions

Before we can properly define gradient descent, we first need to define the notion of a loss function at a very high level. The loss, when calculated for a machine learning algorithm over a dataset is how different our result is from the anticipated result. Ideally, the goal of any machine learning algorithm is to be able to minimize its loss over a training dataset and this, in theory, should lead to the algorithm performing well on test data sets as well as real applications. Changing the various parameters associated with a machine learning algorithm should change the overall result of the machine learning algorithm, thus changing the loss as well.

Gradient Descent is the processes of minimizing the loss function during training time in order to make our algorithm as accurate as possible, in essence, it is one of the tools that allows us to train our algorithm.

How Does it Work?

To illustrate the basic idea behind gradient descent, we can look at a basic 1 input 1 output function that we want to minimize.

1 input 1 output function that we want to minimize

The idea of minimizing this function might instantly give you flashbacks to some of the first things you learned in your introductory calculus class. The idea of finding critical points and then discovering the global minimum via derivatives is always a fun exercise. However, there is another way to look at finding a minimum in this graph. Suppose, for instance, we started off at a random point on the graph as such:

Here, we are at the arbitrarily chosen point (0.7, -1.112). What is one way that we could reach a local minimum in this graph (more on why we don’t and sometimes can’t worry about a global minimum later)?

Well, one way to reach a local minimum in this graph is to find the derivative at this given point, and simply go in the opposite direction of the derivative. Whatever we find the derivative to be, we take the negative of that and take a predefined “step” in that negative direction. Instead of just subtracting the derivative (or adding the negative derivative) to 0.7, we scale the derivative by a factor of 0.03 here to get closer to our minimum.

Then we arrive at a new point that should be closer to the minimum. At this new point, we can once again find the derivative and go in the “negative” direction of the derivative. On the second iteration of our process of finding the minimum, we can see that our “step” should be a little smaller than the first iteration, since we are getting closer to our minimum and would like to prevent the possibility of overshooting and going past the minimum.

We can continue to do this until we find that our new points and our old point have a difference under a specific threshold, at this point, we have reached a local minimum.

Applying to Machine Learning Algorithms

Generally, machine learning algorithms have a lot more than just one parameter to tune and optimize so, we need some extra tools in order to generalize the above concept to machine learning algorithms. Since derivatives are a 2d concept, we replace the notion of taking the derivative of the function at any given point with taking the gradient of the function. Gradients are the direction that a given n-dimensional function is most steeply increasing in. The gradient of a function with N-dimensional inputs is an N-dimensional vector. Moving in the direction opposite to the gradient (just multiplying the gradient by -1) we can get closer to a local minimum.

The reason that we cannot guarantee that we will achieve a global minimum while doing gradient descent is because depending on where we start, we might reach just a local minimum that is not a global minimum. This is evident in the above example as we only reached a local minimum based on where we started (x=0.7), whereas if we had started at a point like x=-2, we would’ve reached a global minimum. This means that for a given function, with gradient descent, we are only guaranteed to find a local minimum and not necessarily a global minimum.

How does it all tie together?

Minimizing the loss of a machine learning algorithm over a test data set allows us to make the machine learning algorithm “better” as it performs better on the test data and thus on future data sets. The most common way to do this is to first calculate the loss of the algorithm (whether it be a neural network or a support vector machine) over the test data (essentially just run the test data through the algorithm and find how wrong the output is compared to what we want the output to be) and then perform gradient descent to minimize the loss by tweaking the N dimensional weight vector that serves as the parameters for the algorithm. This will lead us to getting the lowest possible loss.

This was a pretty basic and high level overview of how gradient descent works and it doesn’t tackle some of the applications of gradient descent and how it is practically used in most settings (stochastic gradient descent). Minimizing a given loss function via gradient descent is essentially how the algorithm “learns” (as it tunes its parameters).

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.