# Pruning (decision trees)

﻿
Pruning (decision trees)

Pruning is a technique in machine learning that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. The dual goal of pruning is reduced complexity of the final classifier as well as better predictive accuracy by the reduction of overfitting and removal of sections of a classifier that may be based on noisy or erroneous data.

## Introduction

One of the questions that arises in a decision tree algorithm is the optimal size of the final tree. A tree that is too large risks overfitting the training data and poorly generalizing to new samples. A small tree might not capture important structural information about the sample space. However, it is hard to tell when a tree algorithm should stop because it is impossible to tell if the addition of a single extra node will dramatically decrease error. This problem is known as the horizon effect. A common strategy is to grow the tree until each node contains a small number of instances then use pruning to remove nodes that do not provide additional information[1].

Pruning should reduce the size of a learning tree without reducing predictive accuracy as measured by a test set or using cross-validation. There are many techniques for tree pruning that differ in the measurement that is used to optimize performance.

## Techniques

Pruning can occur in a top down or bottom up fashion. A top down pruning will traverse nodes and trim subtrees starting at the root, while a bottom up pruning will start at the leaf nodes. Below are several popular pruning algorithms.

### Reduced error pruning

One of the simplest forms of pruning is reduced error pruning. Starting at the leaves, each node is replaced with its most popular class. If the prediction accuracy is not affected then the change is kept. While somewhat naive, reduced error pruning has the advantage of simplicity and speed.

### Cost complexity pruning

Cost complexity pruning generates a series of trees $T_0 \dots T_m$ where T0 is the initial tree and Tm is the root alone. At step i the tree is created by removing a subtree from tree i − 1 and replacing it with a leaf node with value chosen as in the tree building algorithm. The subtree that is removed is chosen as follows. Define the error rate of tree T over data set S as err(T,S). The subtree that minimizes $\frac{err(prune(T,t),S)-err(T,S)}{|leaves(T)|-|leaves(prune(T,t))|}$ is chosen for removal. The function prune(T,t) defines the tree gotten by pruning the subtrees t from the tree T. Once the series of trees has been created, the best tree is chosen by generalized accuracy as measured by a training set or cross-validation.

## References

• Judea Pearl, Heuristics, Addison-Wesley, 1984
• Pessimistic Decision tree pruning based on Tree size[2]

1. ^ Tevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer: 2001, pp. 269-272
2. ^ Mansour, Y. (1997), "Pessimistic decision tree pruning based on tree size", Proc. 14th International Conference on Machine Learning: 195–201

• MDL based decision tree pruning
• Decision tree pruning using backpropagation
• Neural networks

Wikimedia Foundation. 2010.

### Look at other dictionaries:

• Pruning (algorithm) — Pruning is a term in mathematics and informatics which describes a method of enumeration, which allows to cut parts of a decision tree. Pruned parts of the tree are no longer considered because the algorithm knows based on already collected data… …   Wikipedia

• Decision tree learning — This article is about decision trees in machine learning. For the use of the term in decision analysis, see Decision tree. Decision tree learning, used in statistics, data mining and machine learning, uses a decision tree as a predictive model… …   Wikipedia

• Arbre De Décision — Un arbre de décision est un outil d aide à la décision et à l exploration de données. Il permet de modéliser simplement, graphiquement et rapidement un phénomène mesuré plus ou moins complexe. Sa lisibilité, sa rapidité d exécution et le peu d… …   Wikipédia en Français

• Arbre de decision — Arbre de décision Un arbre de décision est un outil d aide à la décision et à l exploration de données. Il permet de modéliser simplement, graphiquement et rapidement un phénomène mesuré plus ou moins complexe. Sa lisibilité, sa rapidité d… …   Wikipédia en Français

• Arbre de décision — Pour les articles homonymes, voir Arbre (homonymie). Un arbre de décision est un outil d aide à la décision qui représente la situation plus ou moins complexe à laquelle on doit faire face sous la forme graphique d un arbre de façon à faire… …   Wikipédia en Français

• Дерево принятия решений — (также могут назваться деревьями классификации или регрессионными деревьями)  используется в области статистики и анализа данных для прогнозных моделей. Структура дерева представляет собой следующее: «листья» и «ветки». На ребрах («ветках»)… …   Википедия

• C4.5 — C4.5  алгоритм для построения деревьев решений, разработанный Джоном Квинланом (англ. John Ross Quinlan). C4.5 является усовершенствованной версией алгоритма ID3 того же автора. В частности, в новую версию были добавлены отсечение… …   Википедия

• Grafting (computer) — Grafting is the process of adding nodes to the inferred decision trees to improve the predicative accuracy. Decision tree is a graphical model that is used as a support tool for decision process.IntroductionOnce the decision tree is constructed,… …   Wikipedia

• Entscheidungsmodell — Entscheidungsbäume sind eine spezielle Darstellungsform von Entscheidungsregeln. Sie veranschaulichen aufeinanderfolgende, hierarchische Entscheidungen. Sie haben eine Bedeutung in der Stochastik zur Veranschaulichung bedingter… …   Deutsch Wikipedia

• Klassifikationsbaum — Entscheidungsbäume sind eine spezielle Darstellungsform von Entscheidungsregeln. Sie veranschaulichen aufeinanderfolgende, hierarchische Entscheidungen. Sie haben eine Bedeutung in der Stochastik zur Veranschaulichung bedingter… …   Deutsch Wikipedia