We have often used decision tree algorithm for classification in our Machine Learning Model. But have you ever think about how this decision tree works or what is the algorithm behind it? In this blog, we will be Creating a Decision Tree using the ID3 algorithm, which is one of the algorithms used to create a decision tree. We will see step by step implementation of algorithm on our dataset for better understanding.

**What are Decision Trees ?**

A Decision Tree is a Supervised Machine Learning Algorithm, which helps us to create classification and regression models in a tree-like structure. It consist of :-

- A Node- which represents a attribute or feature
- A Branch -which represents the decision
- A Leaf- which represent the outcome, categorical or continuous

A Decision tree can be built using many algorithms, here we will be creating a decision tree using the ID3 algorithm.

Let’s Understand About Decision Tree With Example

Consider the picture shown above which depicts whether the person is FIT or UNFIT where we use the decision tree approach. The algorithm will predict whether the person is FIT or UNFIT based on the choice of each node at each level i.e. whether he eats junk food or not. The leaf node depicts the outcome, here in this example the outcome is categorical i.e. He is fit or unfit.

**What is ID3?**

ID3 is an acronym of the term Iterative Dichotomiser 3. It basically follows one of the popular approach known as greedy algorithm in a top down manner, in which we basically choose the best possible node at that time and top-down is the pattern in which it flows in the structure. It means the prediction may or may not be correct every time.

**Metrics in ID3**

ID3 uses the information gain to select the best node at each level in the tree. Now, you might be wondering what is Information Gain? But, before understanding about the term Information Gain we should know about what is Entropy. **Entropy** in simple terms is a measure of uncertainty in the given dataset D and **Information Gain **tells us how much uncertainty in dataset D is reduced after splitting the dataset D on a particular node. The node with high Information Gain is splitted at each level.

**Mathematical expression for Entropy:-**

*H(D)=∑ _{x ⊂ X }( – p(x) log_{2 }(x) )*

Where,

D- is the dataset

X- is the set of classes in D

p(x)- is the probability of x in D.

**Mathematical expression for Information Gain:-**

*I(A,D) = H(D)-∑ _{y ⊂ Y }( – p(y) H(y) )*

H(D)- is the entropy of D

Y- is the subset created by splitting D at node A.

p(y)- is the probability of y in the set D.

H(y) – is the entropy of subset y.

**Procedure of ID3 Algorithm**

- Calculate entropy for the dataset.
- For each node
- Calculate entropy for all its categorical values.
- Calculate information gain for the node.

- Find the node with highest information gain at a particular level.
- Repeat steps from 1 to 3 till we reach the leaf node and have created our decision tree.

#### Creating a Decision Tree using the ID3 algorithm

Let’s look an example to make it more clear, Consider the dataset for prediction of student needs tutor or not.

The dataset is self explanatory, here Yes means he does that work and No means he doesn’t do that work.

Here the dataset is of binary classes, where 4 out of 12 are** Y**es** **and 8 out of 12 are** N**o**.**

*Note*** :- In case of binary classification w**e

**,**

*take log*i_{2}*n the formula*

*for n classification take log*_{n}*in the formula instead of log with base 2*.Complete entropy of dataset is - H(S) = - p(YES) * log2(p(YES)) - p(no) * log2(p(no)) = - (4/12) * log2(4/12) - (8/12) * log2(8/12) = - (-0.527) - (-0.395) = 0.922

**Information Gain for Regular:-**

From the dataset we can see we have 5 rows with **Y**es** **value out of 12 for which there are 2 **Y**es in the target column and 3** N**o in the target column.

From the dataset we can see we have 7 rows with **N**o value out of 12 for which there are 2 **Y**es in the target column and 5 **N**o in the target column.

H(regular=YES) = -(2/5)*log2(2/5)-(3/5)*log2(3/5) = - (- 0.5288) - (- 0.4422) = 0.971 H(regular=NO) = -(2/7)*log2(2/7)-(5/7)*log2(5/7) = -( - 0.516) - (- 0.347) = 0.863 Average Entropy Information for Regular- I(Regular) = p(YES) * H(Regular=YES) + p(no) * H(Regular=No) = (5/12)*0.971 + (7/12)*0.863 = 0.907 Information Gain = H(S) - I(Regular) = 0.983 - 0.907 = 0.076

**Information Gain for Pass:-**

From the dataset we can see we have 6 rows with **Y**es value out of 12 for which there are 2 **Y**es in the target column and 4 **N**o in the target column.

From the dataset we can see we have 6 rows with “NO” value out of 12 for which there are 2 “YES” in the target column and 4 No in the target column.

H(pass=YES) = -(2/6)*log2(2/6)-(4/6)*log2(4/6) = - (- 0.5726) - (- 0.39534) = 0.92294 H(pass=NO) = -(2/6)*log2(2/6)-(4/6)*log2(4/6) = - (- 0.5726) - (- 0.39534) = 0.92294 Average Entropy Information for Pass- I(Pass) = p(YES) * H(Pass=YES) + p(No) * H(Pass=No) = (6/12)*0.92294 + (6/12)*0.9224 = 0.36896 Information Gain = H(S) - I(Pass) = 0.983 - 0.36896 = 0.614

Here, Pass is the node with maximum Information Gain, from the given information the decision tree will look like :-

Repeat the same procedure for regular node considering rows with values whether he is pass or not.

Now, rows with values true for pass are 1, 2, 5, 6 , 9, 10

Entropy for pass :-

H(pass=YES) = -(2/6)*log2(2/6)-(4/6)*log2(4/6) = - (- 0.5726) - (- 0.39534) = 0.92294

##### Information Gain when pass is True** :-**

H(regular=YES,Pass=Yes) = -(1/3)*log2(1/3)-(2/3)*log2(2/3) = - (- 0.5726) - (- 0.39534) = 0.92294 H(regular=NO,Pass=Yes) = -(1/3)*log2(1/3)-(2/3)*log2(2/3) = - (- 0.5726) - (- 0.39534) = 0.92294 Average Entropy Information for Regular- I(Regular,Pass=YES) = p(Regular=yes,pass=YES) * H(Regular=YES,pass=Yes) + p(Regular=no,PasS=YES) * H(Regular=No,pass=YES) = (3/6)*0.92294 + (3/6)*0.9224 = 0.92294 Information Gain = H(S) - I(Regular,Pass=YES) = 0.9224 - 0.92294 = 0

**Information Gai**n** when **Pass is False** :-**

H(regular=YES,Pass=No) = -(1/2)*log2(1/2)-(1/2)*log2(1/2) = - (- 0.5) - (- 0.5) = 1.0 H(regular=NO,Pass=No) = -(1/4)*log2(1/4)-(3/4)*log2(3/4) = - (- 0.5) - (- 0.35625) = 0.85625 I(Regular,Pass=No) = p(Regular=yes,pass=No) * H(Regular=YES,pass=No) + p(Regular=no,PasS=No) * H(Regular=No,pass=No) = (2/6)*1.0 + (4/6)*0.85625 = 0.9002 Information Gain = H(S) - I(Regular,Pass=No) = 0.9224 - 0.9002 = 0.0222

So, we should place Regular node at left side and Not Regular node at right side of the root node as calculated above from the information gain. From the entropy of pass=”YES” and regular=”YES”/ ”NO” we can say that there is equal probability he “**Needs tutor**” or he “**Don’t need tutor**”. From the entropy of pass=”NO” and not regular=”YES” we can say he “**Needs tutor**” .From the entropy of pass=”NO” and not regular=”NO” we can say he “**Not Needs tutor**” .

#### Limitations of ID3 Algorithm

There are some disadvantages of ID3 Algorithm :-

- Overfitting of Data for small dataset.
- One attribute is considered at the time of making decisions.
- Continuous data classification can be computationally expensive.