We have often used decision tree algorithm for classification in our Machine Learning Model. But have you ever think about how this decision tree works or what is the algorithm behind it? In this blog, we will be Creating a Decision Tree using the ID3 algorithm, which is one of the algorithms used to create a decision tree. We will see step by step implementation of algorithm on our dataset for better understanding.

Creating Decision tree using the ID3 algorithm

What are Decision Trees ?

A Decision Tree is a Supervised Machine Learning Algorithm, which helps us to create classification and regression models in a tree-like structure. It consist of :-

  • A Node- which represents a attribute or feature
  • A Branch -which represents the decision
  • A Leaf- which represent the outcome, categorical or continuous

A Decision tree can be built using many algorithms, here we will be creating a decision tree using the ID3 algorithm.

Let’s Understand About Decision Tree With Example

Example to better understand about decision tree. Creating a Decision Tree using the ID3 Algorithm

Consider the picture shown above which depicts whether the person is FIT or UNFIT where we use the decision tree approach. The algorithm will predict whether the person is FIT or UNFIT based on the choice of each node at each level i.e. whether he eats junk food or not. The leaf node depicts the outcome, here in this example the outcome is categorical i.e. He is fit or unfit.

What is ID3?

ID3 is an acronym of the term Iterative Dichotomiser 3. It basically follows one of the popular approach known as greedy algorithm in a top down manner, in which we basically choose the best possible node at that time and top-down is the pattern in which it flows in the structure. It means the prediction may or may not be correct every time.

Metrics in ID3

ID3 uses the information gain to select the best node at each level in the tree. Now, you might be wondering what is Information Gain? But, before understanding about the term Information Gain we should know about what is Entropy. Entropy in simple terms is a measure of uncertainty in the given dataset D and Information Gain tells us how much uncertainty in dataset D is reduced after splitting the dataset D on a particular node. The node with high Information Gain is splitted at each level.

Mathematical expression for Entropy:-

H(D)=∑ x ⊂ X ( – p(x) log2 (x) )

Where,

D- is the dataset

X- is the set of classes in D

p(x)- is the probability of x in D.

Mathematical expression for Information Gain:-

I(A,D) = H(D)-∑ y ⊂ Y ( – p(y) H(y) )

H(D)- is the entropy of D

Y- is the subset created by splitting D at node A.

p(y)- is the probability of y in the set D.

H(y) – is the entropy of subset y.

Procedure of ID3 Algorithm

  1. Calculate entropy for the dataset.
  2. For each node
    1. Calculate entropy for all its categorical values.
    2. Calculate information gain for the node.
  3. Find the node with highest information gain at a particular level.
  4. Repeat steps from 1 to 3 till we reach the leaf node and have created our decision tree.

Creating a Decision Tree using the ID3 algorithm

Let’s look an example to make it more clear, Consider the dataset for prediction of student needs tutor or not.

Table of dataset. Creating a Decision Tree using the ID3 Algorithm

The dataset is self explanatory, here Yes means he does that work and No means he doesn’t do that work.

Here the dataset is of binary classes, where 4 out of 12 are Yes and 8 out of 12 are No.

Note :- In case of binary classification we take log2 in the formula, for n classification take logn in the formula instead of log with base 2 .

Complete entropy of dataset is -
H(S) = - p(YES) * log2(p(YES)) - p(no) * log2(p(no))
     = - (4/12) * log2(4/12) - (8/12) * log2(8/12)
     = - (-0.527) - (-0.395)
     = 0.922
Information Gain for Regular:-

From the dataset we can see we have 5 rows with Yes value out of 12 for which there are 2 Yes in the target column and 3 No in the target column.

information gain for regular with yes values

From the dataset we can see we have 7 rows with No value out of 12 for which there are 2 Yes in the target column and 5 No in the target column.

Information gain for regular with no values Creating a Decision Tree using the ID3 Algorithm
H(regular=YES) = -(2/5)*log2(2/5)-(3/5)*log2(3/5)
               = - (- 0.5288) - (- 0.4422)
               = 0.971
H(regular=NO) = -(2/7)*log2(2/7)-(5/7)*log2(5/7)
              = -( - 0.516) - (- 0.347)
              = 0.863
Average Entropy Information for Regular- 
I(Regular) = p(YES) * H(Regular=YES) + p(no) * H(Regular=No) 
           = (5/12)*0.971 + (7/12)*0.863 
          = 0.907
Information Gain = H(S) - I(Regular)
                 = 0.983 - 0.907
                 = 0.076
Information Gain for Pass:-

From the dataset we can see we have 6 rows with Yes value out of 12 for which there are 2 Yes in the target column and 4 No in the target column.

Information gain for Pass with yes value

From the dataset we can see we have 6 rows with “NO” value out of 12 for which there are 2 “YES” in the target column and 4 No in the target column.

Information Gain for PASS with no values
H(pass=YES) = -(2/6)*log2(2/6)-(4/6)*log2(4/6)
 = - (- 0.5726) - (- 0.39534)
 = 0.92294
H(pass=NO) = -(2/6)*log2(2/6)-(4/6)*log2(4/6)
 = - (- 0.5726) - (- 0.39534)
 = 0.92294
Average Entropy Information for Pass- 
I(Pass) = p(YES) * H(Pass=YES) + p(No) * H(Pass=No) 
= (6/12)*0.92294 + (6/12)*0.9224 
= 0.36896
Information Gain = H(S) - I(Pass)
                 = 0.983 - 0.36896
                 = 0.614

Here, Pass is the node with maximum Information Gain, from the given information the decision tree will look like :-

Information gain of pass is high.

Repeat the same procedure for regular node considering rows with values whether he is pass or not.

Now, rows with values true for pass are 1, 2, 5, 6 , 9, 10

Entropy for pass :-

H(pass=YES) = -(2/6)*log2(2/6)-(4/6)*log2(4/6)
             = - (- 0.5726) - (- 0.39534)
             = 0.92294
Information Gain when pass is True :-
H(regular=YES,Pass=Yes) = -(1/3)*log2(1/3)-(2/3)*log2(2/3)
                        = - (- 0.5726) - (- 0.39534)
                        = 0.92294
H(regular=NO,Pass=Yes) = -(1/3)*log2(1/3)-(2/3)*log2(2/3)
                       = - (- 0.5726) - (- 0.39534)
                       = 0.92294
Average Entropy Information for Regular- 
I(Regular,Pass=YES) = p(Regular=yes,pass=YES) * H(Regular=YES,pass=Yes) +     
                      p(Regular=no,PasS=YES) * H(Regular=No,pass=YES) 
                    = (3/6)*0.92294 + (3/6)*0.9224
                    = 0.92294
Information Gain = H(S) - I(Regular,Pass=YES)
                 = 0.9224 - 0.92294
                 = 0
Information Gain when Pass is False :-
H(regular=YES,Pass=No) = -(1/2)*log2(1/2)-(1/2)*log2(1/2)
                       = - (- 0.5) - (- 0.5)
                       = 1.0
H(regular=NO,Pass=No) = -(1/4)*log2(1/4)-(3/4)*log2(3/4)
                       = - (- 0.5) - (- 0.35625)
                       = 0.85625
I(Regular,Pass=No) = p(Regular=yes,pass=No) * H(Regular=YES,pass=No) +     
                     p(Regular=no,PasS=No) * H(Regular=No,pass=No) 
                       = (2/6)*1.0 + (4/6)*0.85625
                       = 0.9002
Information Gain = H(S) - I(Regular,Pass=No)
                 = 0.9224 - 0.9002
                 = 0.0222

So, we should place Regular node at left side and Not Regular node at right side of the root node as calculated above from the information gain. From the entropy of pass=”YES” and regular=”YES”/ ”NO” we can say that there is equal probability he  “Needs tutor” or he “Don’t need tutor”. From the entropy of pass=”NO” and not regular=”YES” we can say he  “Needs tutor” .From the entropy of pass=”NO” and not regular=”NO” we can say he  “Not Needs tutor” .

Output image

Limitations of ID3 Algorithm

There are some disadvantages of ID3 Algorithm :-

  1. Overfitting of Data for small dataset.
  2. One attribute is considered at the time of making decisions.
  3. Continuous data classification can be computationally expensive.