Introduction to Machine Learning: Decision Trees

Today I will briefly cover an introduction into the idea of decision trees will delve into exactly what they are and why they are used.

What are decision trees?

Decision trees are a type of machine learning model that can be used for classification and regression problems. Decision trees essentially build a tree of “decisions” you have to make about a new input and help you decide on the value of an attribute for the new input. For example, we could use an input dataset of NBA players’ stats and use their stats to build a decision tree to decide if they were hall of famers or not. In this example, a very simple decision tree would look like the following:

This decision tree would classify any player who scored more than 20 points per game as a hall of fame player while any player who scored less than 20 would be classified as not a hall of fame player.

The above decision tree is quite simplistic and would likely make many many mistakes if it were really tested, however it gets the idea of what exactly a decision tree is across quite well.

How are they built?

Now that we know what decision trees are, we can discuss how exactly decision trees are built. Firstly, as with any machine learning model, decision trees require a set of training data to actually build out the model. Since we are building a tree, it would make sense to start from the root and work our way to the leaf nodes of the tree (where the leaf nodes of the tree should decide on a classification, in this case, for us).

To actually understand the process of building a decision tree, we should first set a goal for ourselves, what would we like to accomplish over this training data to consider our decision tree “successful”?

Unlike with some other machine learning models, we usually do not minimize loss functions to build a decision tree, rather we utilize the concept of entropy.

What is entropy?

Entropy is a concept that is certainly not particular to machine learning and has its own applications throughout the sciences. For our purposes, to understand entropy, I will first define the notion of surprise.

Suppose we were drafting NBA players and we had 4 rooms in front of us, the first room had 3 hall of fame players and 9 non hall of fame players, the second room had 6 hall of fame players and 6 non hall of fame player, the third room had 9 hall of fame players and 3 non hall of fame players and the fourth room had 12 hall of fame players and 0 non hall of fame players. In order to draft players, you pick one of the rooms and randomly select a player from the room.

If we selected the first room, and we randomly drew a player that was a non hall of famer we wouldn’t be very surprised with our pick. In the second room, if we drew a non hall of fame, we would be a little more surprised but not too surprised (after all, the odds are 50-50). In the third room, if we drew a non hall of famer we would be quite surprised (the odds of that happening here are 25%) and in the fourth room if we drew a non hall of famer you would start to believe I had lied to you in the premise, your surprise would know no bounds it would be undefined (the probability of that happening are 0% after all!).

From the above thought experiment, we can see that as the probability of something happening decreases, the surprise associated with it increases. This means that surprise and probability have a negative relationship. We can define the surprise of an event to be

Here, we take the log of the inverse probability because when the probability of an event is 1 (fully likely to happen), we want our surprise to be 0 and taking the log allows for that. When the probability of an event is 0, the log still allows for the surprise to be undefined as that event shouldn’t happen in the first place. This also allows us to maintain the inverse relationship between surprise and probability.

In order to go from this definition of surprise to entropy that is more commonly seen in other fields of study, we first apply some simple logarithm rules.

When looking at an event, the surprise of the event is related to one of the distinct possibilities of the event, if in room 1 you choose a player and it was a hall of famer, there is a surprise related to that event that we can calculate via the above formula. To get the entropy of any of the rooms, we simply have to calculate the expected value (or average) of the surprise. To do this, take all the possible values of the surprise and multiply them by the probability that we get those values, this is just how expectation is regularly calculated.

In our example, in a room A, the entropy of the room is calculated based on how surprised we expect to be when we choose a player from that room.

How is entropy used?

To understand how entropy is used in decision trees, we have to first understand how decision trees work.

While building a decision tree, at any given node there is a subset of the training data that satisfies the path in the tree to that node. For example, at the root, the subset of training data that we are considering is the full training data and in our above example if the PPG node is the root, then we are considering all the training data as our subset. If after partitioning between > 20 PPG and <20 PPG, our subsets at each of the two child nodes of the root would be the different subsets of players who scored more than 20 and the players who scored less than 20. These subsets are how the entropy at any node is calculated.

We have 3 broad goals while building the decision tree:

  • To minimize the entropy at any given node in a greedy fashion
  • To have 0 entropy at leaf nodes
    • To exhaust all the attributes on which we can partition the data

To do this, we can employ a metric called information gain at each new node while we are recursively building the tree and minimize the entropy at the nodes.

The information gain at a node is with respect to the subset of the training data available at that node and the attribute that we choose to partition on at that node. The attribute that gives us the most information gain at any given node is chosen and then the data is partitioned based on that attribute and then we look at the children nodes to determine the information gain there (unless there are no more attributes to partition on, or the entropy is 0 then we are at a leaf node and we are done).

The formula for information gain is as follows:

Here, the gain for any given node is calculated based on the node as well as the attribute that we would split on at that node. First we calculate the entropy of the node based on the subset of data available at that node. And then we essentially simulate what the entropy of each of the child nodes would be if we split on the attribute E, to normalize we multiply the probability of getting each of the values under V by the entropy at V. Then we take the sum of the different values we calculated and subtract that from the entropy at A, this gives us how much on average our entropy would decrease if we split on attribute E. Lets illustrate this with an example:

In this example, let’s take the total number of players in our training data at the root node to be 14, with 9 hall of famers and 5 non hall of famers. First, we would calculate the entropy at the node as follows:

This gives us the entropy to be 0.94.

Now, assuming that we want to split on PPG greater than 20 and less than 20, we calculate our gain based on this attribute. Let’s say that in the left child, 6 players are hall of famers and 2 aren’t and in the right child, 3 players are hall of famers and 3 aren’t. We can calculate the entropies of these two children and then calculate the gain.

From these calculations, we get En(left)= 0.81 and En(right)= 1.

Now we can calculate gain as follows

We choose 8/14 and 6/14 because they are the probability of being in the right and left child respectively. Here we can see that our information gain is not great and so this attribute probably shouldn’t be used.

This makes sense on an intuitive level as well, as splitting on PPG leads to a right child where half of the players are hall of fame players and half are not. This means that our split essentially did not achieve much and there might be better attributes to split on.

So, information gain is minimized at each node while we are building a decision tree and while we are going down a branch of the tree, we cannot repeat the use of the same attribute twice to separate the subset of data available.

Conclusion

This was just a very brief summary of how decision trees are built. There are other measures that are minimized aside from information gain based on the needs of the information tree. Also, we did not cover the distinction between discrete and continuous attributes. For example, PPG is a continuous attribute and to split on PPG we had to split it into two buckets. It is possible that a different split on PPG (30 or 40 PPG) leads to better results. All of these are considerations to take when building a decision tree.

Posted in Uncategorized | Leave a comment

Basics of Machine learning series: Loss Functions

Hello! Today I will be writing a post to give a brief overview of loss functions and what purpose they serve in the world of machine learning and optimization.

Loss functions are something I had briefly covered in my post discussing gradient descent, and they play a huge part in gradient descent.

Why do we need loss functions?

Before describing the different types of loss functions, it is important to understand exactly why we need loss functions. Loss functions, simply give help map the performance of your machine learning algorithm over a given input/ set of inputs to a singular number. The closer our loss is to zero, the closer our algorithm is to performing optimally. The reason we need loss functions is also quite simple, they allow us to get a direction in which we can move to improve our machine learning model (I use the word direction on purpose as it ties in nicely with gradient descent).

Mean Squared Loss

For different types of machine learning problems, loss functions work a little differently, but the main idea behind them remains the same. One of the most popular loss functions and one that is quite intuitive to understand is mean square loss or L2 loss. In this loss function, you take the output of the machine learning algorithm and subtract that from the desired outcome and square the difference. Summing this difference squared over all the inputs, and then taking the average yields you loss. The formula for mean squared loss is as follows:

Where E is the expected output of input i and A is the actual output

The mean squared loss function has its own pros and cons. Since you are calculating the average of your loss, outliers will impact your overall loss heavily. However, if there aren’t expected to be a lot of outliers in your dataset, mean squared loss works quite well.

Mean squared loss works best for regression-type problems that have a result that is a number or a continuous type of solution.

This has been just one example of a loss function, I will be sure to write updates with other types of loss functions as well!

Posted in Uncategorized | Leave a comment

Basics of Machine Learning Series: Gradient Descent

Hello Everyone!

In today’s post I would like to discuss one of the basics of machine learning and something that I had a little bit of trouble understanding when I was first learning it: Gradient Descent.

Gradient Descent is the backbone of a lot of machine learning algorithms and even has applications outside the field of machine learning. While it may seem intimidating at first, with a basic knowledge of calculus and applications of some of the first principals of calculus, it becomes a lot less daunting.

Loss Functions

Before we can properly define gradient descent, we first need to define the notion of a loss function at a very high level. The loss, when calculated for a machine learning algorithm over a dataset is how different our result is from the anticipated result. Ideally, the goal of any machine learning algorithm is to be able to minimize its loss over a training dataset and this, in theory, should lead to the algorithm performing well on test data sets as well as real applications. Changing the various parameters associated with a machine learning algorithm should change the overall result of the machine learning algorithm, thus changing the loss as well.

Gradient Descent is the processes of minimizing the loss function during training time in order to make our algorithm as accurate as possible, in essence, it is one of the tools that allows us to train our algorithm.

How Does it Work?

To illustrate the basic idea behind gradient descent, we can look at a basic 1 input 1 output function that we want to minimize.

1 input 1 output function that we want to minimize

The idea of minimizing this function might instantly give you flashbacks to some of the first things you learned in your introductory calculus class. The idea of finding critical points and then discovering the global minimum via derivatives is always a fun exercise. However, there is another way to look at finding a minimum in this graph. Suppose, for instance, we started off at a random point on the graph as such:

Here, we are at the arbitrarily chosen point (0.7, -1.112). What is one way that we could reach a local minimum in this graph (more on why we don’t and sometimes can’t worry about a global minimum later)?

Well, one way to reach a local minimum in this graph is to find the derivative at this given point, and simply go in the opposite direction of the derivative. Whatever we find the derivative to be, we take the negative of that and take a predefined “step” in that negative direction. Instead of just subtracting the derivative (or adding the negative derivative) to 0.7, we scale the derivative by a factor of 0.03 here to get closer to our minimum.

Then we arrive at a new point that should be closer to the minimum. At this new point, we can once again find the derivative and go in the “negative” direction of the derivative. On the second iteration of our process of finding the minimum, we can see that our “step” should be a little smaller than the first iteration, since we are getting closer to our minimum and would like to prevent the possibility of overshooting and going past the minimum.

We can continue to do this until we find that our new points and our old point have a difference under a specific threshold, at this point, we have reached a local minimum.

Applying to Machine Learning Algorithms

Generally, machine learning algorithms have a lot more than just one parameter to tune and optimize so, we need some extra tools in order to generalize the above concept to machine learning algorithms. Since derivatives are a 2d concept, we replace the notion of taking the derivative of the function at any given point with taking the gradient of the function. Gradients are the direction that a given n-dimensional function is most steeply increasing in. The gradient of a function with N-dimensional inputs is an N-dimensional vector. Moving in the direction opposite to the gradient (just multiplying the gradient by -1) we can get closer to a local minimum.

The reason that we cannot guarantee that we will achieve a global minimum while doing gradient descent is because depending on where we start, we might reach just a local minimum that is not a global minimum. This is evident in the above example as we only reached a local minimum based on where we started (x=0.7), whereas if we had started at a point like x=-2, we would’ve reached a global minimum. This means that for a given function, with gradient descent, we are only guaranteed to find a local minimum and not necessarily a global minimum.

How does it all tie together?

Minimizing the loss of a machine learning algorithm over a test data set allows us to make the machine learning algorithm “better” as it performs better on the test data and thus on future data sets. The most common way to do this is to first calculate the loss of the algorithm (whether it be a neural network or a support vector machine) over the test data (essentially just run the test data through the algorithm and find how wrong the output is compared to what we want the output to be) and then perform gradient descent to minimize the loss by tweaking the N dimensional weight vector that serves as the parameters for the algorithm. This will lead us to getting the lowest possible loss.

This was a pretty basic and high level overview of how gradient descent works and it doesn’t tackle some of the applications of gradient descent and how it is practically used in most settings (stochastic gradient descent). Minimizing a given loss function via gradient descent is essentially how the algorithm “learns” (as it tunes its parameters).

Posted in Uncategorized | Leave a comment

Why does Javascript seem to have asynchronous execution?

Javascript is a synchronous, interpreted and dynamically typed programming language that is generally used to make browsers/html more interactive. When I first learned this, I was a little confused to say the least. How is Javascript synchronous? Doesn’t Javascript allow the user to click multiple things at once and execute all of them together? Initially I was sure that there must’ve been a mistake in what I was reading, but further research revealed to me that Javascript was in fact synchronous. So what gives? I looked into it further and while the results were surprising to me, it was also very informative.

The way Javascript (and most other programming languages) work is that they have something called an execution stack. The execution stack can be thought of as the stack that contains all the functions that you call (this is also the reason that when you call too many functions in recursion the error is referred to as a stack overflow as the execution stack literally ran out of memory to store your function calls). Here is an illustration of the execution stack along with with model javascript code corresponding to it.

While the execution stack is common to most programming languages and Javascript’s execution stack works very similarly to most other ones, there is another underlying data structure that gives Javascript (at least the feeling of) its asynchronous execution. The event queue is what provides this functionality. The event queue is what the rest of the browser uses to communicate with the Javascript interpreter. When you click something in the browser, that event gets put on the event queue. The browser communicates with the javascript engine to put it on the event queue. However, Javascript is still synchronous, so even though that event is ON the event queue, it will only get executed after the execution stack is empty. In other words, Javascript keeps events on the event queue and once all the items on the execution stack have been popped off, it looks at the event queue and checks if there are instructions within the code (left by the programmer) on what to do for specific events. These are usually called asynchronous call backs and while they do work asynchronously, Javascript in and of itself is synchronous and executes lines one by one. The process of putting events on the event queue is asynchronous as multiple events can be occurring at once (for example an HTTP request as well as pressing a button), however, Javascript will only ever deal with events one at a time and execute their code line by line.

Posted in Uncategorized | Leave a comment

Euler’s Identity

In the last post, we talked about imaginary numbers and how they were defined. We said that the imaginary number i was defined as the square root of -1 and using i we could express all imaginary numbers. However, there is a more complex and intricate way to define imaginary numbers, with Euler’s Identity. In its general form and a form you may have seen before Euler’s formula looks like this-

e^(iπ)+1=0

This relationship is often touted as the most intricate as it includes all of the most important mathematical entities that are used throughout math. It has the natural log base e (Also referred to as Euler’s number), the imaginary number i, the ratio of the circumference of a circle to its diameter, π, and perhaps the two most important whole numbers 1 and 0.

Euler’s identity is just one very specific instance of Euler’s formula.

How to derive Euler’s formula?

Now that we can appreciate why Euler’s identity is so important, it is important to understand exactly how it was derived and what mathematical gymnastics does one have to do to get this identity that uses some of the most important constants in mathematics? The answer to that question is by using Taylor series.

To get to the above result, we will first look at the Taylor series for e^x.

e^x=1+x+(x^2/2!) +(x^3/3!)+(x^4/4!)+(x^5/5!)………..

In this formula, lets substitute i*x for x, so we get.

e^(i*x)=1+(i*x)+((i*x)^2/2!) +((i*x)^3/3!)+((i*x)^4/4!)+((i*x)^5/5!)………..

From the last lesson we know i^2=-1, i^3=-i, i^4, using this information

e^(i*x)=1+ix+(-x^2/2!) +(-ix^3/3!)+(x^4/4!)+(ix^5/5!) ………

And Now grouping we get

e^(i*x)=(1-(x^2/2!)+(x^4/4!) …….) + (ix -(ix^3/3!) +(ix^5/5!)……..)

Factor out i from the second term and we get

e^(i*x)=(1-(x^2/2!)+(x^4/4!) …….) + i(x -(x^3/3!) +(x^5/5!)……..)

Now, the two terms in this Taylor series should be familiar if you have studied Taylor series as they are simply the expansions of cos x (1-(x^2/2!)+(x^4/4!) …….) and sin x (x -(x^3/3!) +(x^5/5!)……..) so we can substitute cos x and sin x into the Taylor series and get-

e^(i*x)=cos x+ i*sin x

If we plug in π for x we get

e^(i*π)= cos π +i*sin π

e^(i*π)=1+i*0

e^(i*π)-1=0

This is how Euler’s identity was derived.

At the end of the day, Euler’s identity gives us a relationship between the most important constants in mathematics and while its derivation might be complicated, it is worth understanding.

Posted in Uncategorized | Leave a comment

Complex Numbers

Hey guys, today I will be making a brief summary of complex numbers and the concepts surrounding complex numbers.

First, we will start off with the basic concept behind complex numbers, what are they? In order to understand complex numbers we first need to understand Imaginary numbers.

Imaginary Numbers

So what are imaginary numbers? Imaginary numbers can be seen as a tool that allows us to take square roots of negative numbers. Without this special tool, something such as the square root of -1 would not be something that we can evaluate. However, imaginary numbers mean that we can assign the value of i to the square root of -1. This is the foundation of complex numbers in their simplest form. So, sqrt(-1)=i is one of the founding ideals behind imaginary numbers.

Complex Numbers

Complex numbers (which sound more intimidating than they are) are simply numbers that have both a real and an imaginary aspect to them. In other words, any number that can be expressed in the form a+bi with a and b being real numbers is a complex number. In a complex number of form a + bi, a is called the real component while b is called the imaginary component. It is to be noted that any number can be represented as a complex number. For example, the number 5 is simply 5+0i. Complex numbers can also be plotted on a coordinate plane called the complex plane.

When a number has an imaginary component of 0, we can essentially get rid of the vertical axis, this is why numbers that have an imaginary component of 0 can simply be represented on a number line or the horizontal axis.

To plot a complex number on this plane, treat the imaginary and real components of the number as coordinate points with the real component being along the horizontal axis while the imaginary component being along the vertical axis. So, to plot the complex number a+bi, we would go a units along the horizontal (real) axis and b units along the vertical (imaginary) axis.

Operations on Complex Numbers

Addition

To start the discussion of operations on complex numbers, I would like to discuss addition. To add two complex numbers, we can simply add the real components and the imaginary components of the numbers. For example, if a+bi is a complex number and c+di is another complex number, their addition would be (a+c) +(b+d)i. This is very similar to the addition of two vectors.

To subtract simply add the negative of the number being subtracted. So, in the above example if we wanted to subtract c+di from a+bi we would simply add -c-di to a+bi.

Multiplication

Multiplication is a little more confusing as there is no clear analog to multiplication in vectors, however the actual process and formula are fairly straight forward. The formula for the multiplication of two complex numbers is the same as the multiplication of any two binomials.

(a+bi)(c+di)=ac+i(ad) +i(bc) + i^2(bd).

Here we can see that there is a slight problem that prevents us from just simplifying the above problem as is. The powers of i is something that we will discuss later, however, we know from before that i=sqrt(-1) so i^2 is just -1.

Knowing that, we can finish simplifying the above expression.

(a+bi)(c+di)=ac+i(ad)+i(bc) -bd

(a+bi)(c+di)=(ac-bd) + i(ad+bc)

If we want to multiply a complex number by a scalar, say a+bi by c, all the terms with d would simply become 0 in the above equation and we would be left with-

(a+bi)(c)=ac+i(bc)

This is just the basic distributive property we know and love and to take the negative of a complex number we are simply distributing -1 across the complex number.

Division

To divide two complex numbers, say a +bi divided by c+di, represent the division as a fraction and simply multiply the top and the bottom of the fraction by the conjugate of the bottom. The conjugate of a complex number a+bi is a-bi. So, using this information, the above division would look something like this.

(a+bi)(c-di)/((c+di)(c-di))

Simplifying the above expression will result in the division of the two complex numbers expressed in the a+bi form.

This is done to express the division in the form most widely accepted for complex numbers.

Powers of Imaginary Numbers

We have kind of discussed this already, however, there is a pattern with i and its powers.

i=sqrt(-1)

i^2=-1

i^3=i^2*i=-1*i=-i=-sqrt(-1)

i^4=i^3*i=-i*i=-sqrt(-1)*sqrt(-1)=-(-1)=1

i^5= i^4*i=1*i=sqrt(-1)

i^6=i^4*i^2=1*i^2=i^2=-1

i^7=i^4*i^3=1*i^3=i^3=-sqrt(-1)

The first 4 powers of i (1,2,3,4) seem to show a pattern, however, after the 4th power this pattern repeats. Suppose we have i^n, where n is a number greater than 4. We know that all numbers i^a (where a is a power of 4) will be 1. So in order to evaluate i^n we just have to divide n by 4 and use the remainder r from that calculation and evaluate i^r.

Posted in Uncategorized | Leave a comment

Welcome to my site

Hello All,

I am Kumar Vaibhav Jha and I welcome you all to my site.

Regards,

Kumar Vaibhav

Posted in Uncategorized | 1 Comment