# C4.5 algorithm

C4.5 algorithm

C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier.

Algorithm

C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. The training data is a set $S = \left\{s_1, s_2, ...\right\}$ of already classified samples. Each sample $s_i = \left\{x_1, x_2, ...\right\}$ is a vector where $x_1, x_2, ...$ represent attributes or features of the sample. The training data is augmented with a vector $C = \left\{c_1, c_2, ...\right\}$ where $c_1, c_2, ...$ represent the class that each sample belongs to.

C4.5 uses the fact that each attribute of the data can be used to make a decision that splits the data into smaller subsets. C4.5 examines the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is the one used to make the decision. The algorithm then recurs on the smaller sublists.

This algorithm has a few base cases, the most common base case is when all the samples in your list belong to the same class. Once this happens, you simply create a leaf node for your decision tree telling you to choose that class. It might also happen that none of the features give you any information gain, in this case C4.5 creates a decision node higher up the tree using the expected value of the class. It also might happen that you've never seen any instances of a class; again, C4.5 creates a decision node higher up the tree using expected value.

In pseudocode the algorithm is:

1. Check for base cases 2. For each attribute "a" 3. Find the normalized information gain from splitting on "a" 4. Let "a_best" be the attribute with the highest normalized information gain 5. Create a decision "node" that splits on "a_best" 6. Recur on the sublists obtained by splitting on "a_best" and add those nodes as children of "node"

Information gain and information entropy

Although explained further in their respective sections, $Entropy\left(S\right)$ can be thought of as a measure of how random the class distribution is in "S". Information gain is a measure given to an attribute "a". Attribute "a" can separate "S" into subsets $S_a1, S_a2, S_a3, ..., S_an$ the information gain of "a" is then $Entropy\left(S\right) - Entropy\left(S_a1\right) - Entropy\left(S_a2\right) - ... - Entropy\left(S_an\right)$. Information gain is then normalized by multiplying the entropy of each attribute choice by the proportion of attribute values that have that choice.

Improvements from ID3 algorithm

C4.5 made a number of improvements to ID3. Some of these are:

* Handling both continuous and discrete attributes - In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. [Quinlan, 96]
* Handling training data with missing attribute values - C4.5 allows attribute values to be marked as ? for missing. Missing attribute values are simply not used in gain and entropy calculations.
* Handling attributes with differing costs.
* Pruning trees after creation - C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes.

Improvements in C5.0/See5 algorithm

Quinlan went on to create C5.0 and See5 (C5.0 for Unix/Linux, See5 for Windows) which he markets commercially. C5.0 offers a number of improvements on C4.5. Some of these are:

* Speed - C5.0 is significantly faster than C4.5 (several orders of magnitude)
* Memory usage - C5.0 is more memory efficient than C4.5
* Smaller decision trees - C5.0 gets similar results to C4.5 with considerably smaller decision trees.
* Support for boosting - Boosting improves the trees and gives them more accuracy.
* Weighting - C5.0 allows you to weight different attributes and misclassification types.
* Winnowing - C5.0 automatically winnows the data to help reduce noise.

C5.0/See5 is a commercial and closed-source product, although free source code is available for interpreting and using the decision trees and rule sets it outputs.

* ID3 algorithm

References

* Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
* J. R. Quinlan. Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research, 4:77-90, 1996.

* Original implementation on Ross Quinlan's homepage: [http://www.rulequest.com/Personal/ http://www.rulequest.com/Personal/]

Wikimedia Foundation. 2010.

### См. также в других словарях:

• Algorithm design — is a specific method to create a mathematical process in solving problems. Applied algorithm design is algorithm engineering.Algorithm design is identified and incorporated into many solution theories of operation research, such as dynamic… …   Wikipedia

• Algorithm engineering — is a combination of theoretical algorithm design with real world data. By taking an algorithm and combining it with a hardware device connected to the real world, you are able to more accurately verify and validate the algorithm results and… …   Wikipedia

• algorithm — UK US /ˈælgərɪðəm/ noun [C] IT ► a set of mathematical instructions that must be followed in a fixed order, and that, especially if given to a computer, will help to calculate an answer to a mathematical problem: a computer/mathematical/search… …   Financial and business terms

• algorithm — n. a precise rule (or set of rules) specifying how to solve some problem; a set of procedures guaranteed to find the solution to a problem. Syn: algorithmic rule, algorithmic program [WordNet 1.5 +PJC] …   The Collaborative International Dictionary of English

• algorithm — [al′gə rith΄əm] n. [altered (after ARITHMETIC) < ALGORISM] 1. Math. a) any systematic method of solving a certain kind of problem b) the repetitive calculations used in finding the greatest common divisor of two numbers: called in full… …   English World dictionary

• Algorithm Builder — ist eine Entwicklungsumgebung für AVR Mikrocontroller, basierend auf einer graphischen Makro Assembler Sprache. Der gesamte Programmablauf wird in graphischer Form als Flussdiagramm eingegeben. Es lässt sich mit einzelnen Anweisungen direkt auf… …   Deutsch Wikipedia

• Algorithm —   [engl.], Algorithmus.   …   Universal-Lexikon

• algorithm — (n.) 1690s, from Fr. algorithme, refashioned (under mistaken connection with Gk. arithmos number ) from O.Fr. algorisme the Arabic numeral system (13c.), from M.L. algorismus, a mangled transliteration of Arabic al Khwarizmi native of Khwarazm,… …   Etymology dictionary

• algorithm — ► NOUN ▪ a process or set of rules used in calculations or other problem solving operations. DERIVATIVES algorithmic adjective. ORIGIN Latin algorismus, from an Arabic word meaning the man of Kw rizm , referring to a 9th century mathematician …   English terms dictionary

• Algorithm — Flow chart of an algorithm (Euclid s algorithm) for calculating the greatest common divisor (g.c.d.) of two numbers a and b in locations named A and B. The algorithm proceeds by successive subtractions in two loops: IF the test B ≤ A yields yes… …   Wikipedia

• Algorithm characterizations — The word algorithm does not have a generally accepted definition. Researchers are actively working in formalizing this term. This article will present some of the characterizations of the notion of algorithm in more detail. This article is a… …   Wikipedia

### Поделиться ссылкой на выделенное

##### Прямая ссылка:
Нажмите правой клавишей мыши и выберите «Копировать ссылку»