Today I will briefly cover an introduction into the idea of decision trees will delve into exactly what they are and why they are used.
What are decision trees?
Decision trees are a type of machine learning model that can be used for classification and regression problems. Decision trees essentially build a tree of “decisions” you have to make about a new input and help you decide on the value of an attribute for the new input. For example, we could use an input dataset of NBA players’ stats and use their stats to build a decision tree to decide if they were hall of famers or not. In this example, a very simple decision tree would look like the following:
This decision tree would classify any player who scored more than 20 points per game as a hall of fame player while any player who scored less than 20 would be classified as not a hall of fame player.
The above decision tree is quite simplistic and would likely make many many mistakes if it were really tested, however it gets the idea of what exactly a decision tree is across quite well.
How are they built?
Now that we know what decision trees are, we can discuss how exactly decision trees are built. Firstly, as with any machine learning model, decision trees require a set of training data to actually build out the model. Since we are building a tree, it would make sense to start from the root and work our way to the leaf nodes of the tree (where the leaf nodes of the tree should decide on a classification, in this case, for us).
To actually understand the process of building a decision tree, we should first set a goal for ourselves, what would we like to accomplish over this training data to consider our decision tree “successful”?
Unlike with some other machine learning models, we usually do not minimize loss functions to build a decision tree, rather we utilize the concept of entropy.
What is entropy?
Entropy is a concept that is certainly not particular to machine learning and has its own applications throughout the sciences. For our purposes, to understand entropy, I will first define the notion of surprise.
Suppose we were drafting NBA players and we had 4 rooms in front of us, the first room had 3 hall of fame players and 9 non hall of fame players, the second room had 6 hall of fame players and 6 non hall of fame player, the third room had 9 hall of fame players and 3 non hall of fame players and the fourth room had 12 hall of fame players and 0 non hall of fame players. In order to draft players, you pick one of the rooms and randomly select a player from the room.
If we selected the first room, and we randomly drew a player that was a non hall of famer we wouldn’t be very surprised with our pick. In the second room, if we drew a non hall of fame, we would be a little more surprised but not too surprised (after all, the odds are 50-50). In the third room, if we drew a non hall of famer we would be quite surprised (the odds of that happening here are 25%) and in the fourth room if we drew a non hall of famer you would start to believe I had lied to you in the premise, your surprise would know no bounds it would be undefined (the probability of that happening are 0% after all!).
From the above thought experiment, we can see that as the probability of something happening decreases, the surprise associated with it increases. This means that surprise and probability have a negative relationship. We can define the surprise of an event to be
Here, we take the log of the inverse probability because when the probability of an event is 1 (fully likely to happen), we want our surprise to be 0 and taking the log allows for that. When the probability of an event is 0, the log still allows for the surprise to be undefined as that event shouldn’t happen in the first place. This also allows us to maintain the inverse relationship between surprise and probability.
In order to go from this definition of surprise to entropy that is more commonly seen in other fields of study, we first apply some simple logarithm rules.
When looking at an event, the surprise of the event is related to one of the distinct possibilities of the event, if in room 1 you choose a player and it was a hall of famer, there is a surprise related to that event that we can calculate via the above formula. To get the entropy of any of the rooms, we simply have to calculate the expected value (or average) of the surprise. To do this, take all the possible values of the surprise and multiply them by the probability that we get those values, this is just how expectation is regularly calculated.
In our example, in a room A, the entropy of the room is calculated based on how surprised we expect to be when we choose a player from that room.
How is entropy used?
To understand how entropy is used in decision trees, we have to first understand how decision trees work.
While building a decision tree, at any given node there is a subset of the training data that satisfies the path in the tree to that node. For example, at the root, the subset of training data that we are considering is the full training data and in our above example if the PPG node is the root, then we are considering all the training data as our subset. If after partitioning between > 20 PPG and <20 PPG, our subsets at each of the two child nodes of the root would be the different subsets of players who scored more than 20 and the players who scored less than 20. These subsets are how the entropy at any node is calculated.
We have 3 broad goals while building the decision tree:
- To minimize the entropy at any given node in a greedy fashion
- To have 0 entropy at leaf nodes
- To exhaust all the attributes on which we can partition the data
To do this, we can employ a metric called information gain at each new node while we are recursively building the tree and minimize the entropy at the nodes.
The information gain at a node is with respect to the subset of the training data available at that node and the attribute that we choose to partition on at that node. The attribute that gives us the most information gain at any given node is chosen and then the data is partitioned based on that attribute and then we look at the children nodes to determine the information gain there (unless there are no more attributes to partition on, or the entropy is 0 then we are at a leaf node and we are done).
The formula for information gain is as follows:
Here, the gain for any given node is calculated based on the node as well as the attribute that we would split on at that node. First we calculate the entropy of the node based on the subset of data available at that node. And then we essentially simulate what the entropy of each of the child nodes would be if we split on the attribute E, to normalize we multiply the probability of getting each of the values under V by the entropy at V. Then we take the sum of the different values we calculated and subtract that from the entropy at A, this gives us how much on average our entropy would decrease if we split on attribute E. Lets illustrate this with an example:
In this example, let’s take the total number of players in our training data at the root node to be 14, with 9 hall of famers and 5 non hall of famers. First, we would calculate the entropy at the node as follows:
This gives us the entropy to be 0.94.
Now, assuming that we want to split on PPG greater than 20 and less than 20, we calculate our gain based on this attribute. Let’s say that in the left child, 6 players are hall of famers and 2 aren’t and in the right child, 3 players are hall of famers and 3 aren’t. We can calculate the entropies of these two children and then calculate the gain.
From these calculations, we get En(left)= 0.81 and En(right)= 1.
Now we can calculate gain as follows
We choose 8/14 and 6/14 because they are the probability of being in the right and left child respectively. Here we can see that our information gain is not great and so this attribute probably shouldn’t be used.
This makes sense on an intuitive level as well, as splitting on PPG leads to a right child where half of the players are hall of fame players and half are not. This means that our split essentially did not achieve much and there might be better attributes to split on.
So, information gain is minimized at each node while we are building a decision tree and while we are going down a branch of the tree, we cannot repeat the use of the same attribute twice to separate the subset of data available.
Conclusion
This was just a very brief summary of how decision trees are built. There are other measures that are minimized aside from information gain based on the needs of the information tree. Also, we did not cover the distinction between discrete and continuous attributes. For example, PPG is a continuous attribute and to split on PPG we had to split it into two buckets. It is possible that a different split on PPG (30 or 40 PPG) leads to better results. All of these are considerations to take when building a decision tree.