Decision Tree Algorithm in Machine Learning and What is pruning ?

Akhil Kasare
3 min readJul 17, 2020
Photo by Krish on Unsplash

A Decision Tree Algorithm is a type of supervised learning algorithm. It can be used for both for classification and regression problem statement.

The input to the Decision tree can be both continuous and categorical.

The Decision Tree tries to solve the problem by if-then statement.

The Decision tree tries to solve the problem by tree representation i.e. (nodes & leaf ) architecture.

There are many assumptions you should consider while creating a decision but these here are some of them :

  • Initially all the training dataset is considered to be root.
  • The feature values are preferred to be categorical & if they are continuous it should be discretized.
  • The records are distributed recursively on the basis of attribute values.
  • Which attribute should be considers as root node is determined on the basis of statistical approach.

We basically decide the root on the basis of entropy, Information Gain (IG), Gini Index, etc.

The most preferred method of evaluation to decide the root node is Gini Index.

During the evaluation the feature which achieves higher information gain (IG) is decided as our root node

The root node evaluation method which is mostly used is Gini Index. Because gini index does not contains any logarithmic calculation which does not consumes more time as compared to entropy which contains logarithmic calculation which utilizes more time.

Gini Index vs Entropy

Now we will understand the working of a Decision tree with the help of an example.

In this example we have taken a dataset know as credit card approval dataset. The dataset in a tree representation is represented as.

Based on the dataset we plotted a Decision Tree

After observing the dataset the question arises that how to predict the result that whether to provide a credit to an individual or not.

For example, consider a customer whose age = 20 and his salary = 20k this is our testing point which lies in our dataset.

Now our Decision tree starts to evaluate, like customer age is between 20–30 and his salary is 20k by analyzing the data our decision tree predicts that the Credit Card should not be given to the customer

The main aim of the Decision Tree is to find the best separation in this example the best plane of separation is on the basis of income.

How to control leaf height & pruning ?

To control the leaf size we can set the parameters like :

(1.) Maximum depth:

Maximum depth is used to control the further splitting of node when the specified depth of the tree has been reached during the building of Decision Tree. Basically use the largest possible value.

(2.) Minimum leaf size:

Minimum leaf size is limit to split a node when one of the child node is lower than the minimum leaf size.

How to control Purning ?

Pruning is mostly done to reduce the overfitting on the training dataset & it is also used to reduce the complexity of the tree

There are basically two types pruning:

  • Pre-pruning
  • Post-pruning

Pre-pruning

The pre-pruning is also know as early stopping criterion. The criterion is specified during building the model. The tree stops growing when it meets any of the pre-pruning criterion.

Post-pruning

In post-pruning we allow the tree to grow fully and we then observe the complexity parameter (CP) value. Next we prune/cut the tree with an optimal complexity parameter (CP) value

--

--