Classification in machine learning

Classification in machine learning

In machine learning and pattern recognition, classification refers to an algorithmic procedure for assigning a given piece of input data into one of a given number of categories. An example would be assigning a given email into "spam" or "non-spam" classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.). An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. The term "classifier" sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category.

The piece of input data is formally termed an instance, and the categories are termed classes. The instance is formally described by a vector of features, which together constitute a description of all known characteristics of the instance. Typically, features are either categorical (also known as nominal, i.e. consisting of one of a set of unordered items, such as a gender of "male" or "female", or a blood type of "A", "B", "AB" or "O"), ordinal (consisting of one of a set of ordered items, e.g. "large", "medium" or "small"), integer-valued (e.g. a count of the number of occurrences of a particular word in an email) or real-valued (e.g. a measurement of blood pressure). Often, categorical and ordinal data are grouped together; likewise for integer-valued and real-valued data. Furthermore, many algorithms work only in terms of categorical data and require that real-valued or integer-valued data be discretized into groups (e.g. less than 5, between 5 and 10, or greater than 10).

Classification normally refers to a supervised procedure, i.e. a procedure that learns to classify new instances based on learning from a training set of instances that have been properly labeled by hand with the correct classes. The corresponding unsupervised procedure is known as clustering, and involves grouping data into classes based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional vector space). Note that in some fields, the terminology is different: For example, in community ecology, the term "classification" is synonymous with what is commonly known in machine learning as "clustering".

Classification and clustering are examples of the more general problem of pattern recognition, which is the assignment of some sort of output value to a given input value. Other examples are regression, which assigns a real-valued output to each input; sequence labeling, which assigns a class to each member of a sequence of values (for example, part of speech tagging, which assigns a part of speech to each word in an input sentence); parsing, which assigns a parse tree to an input sentence, describing the syntactic structure of the sentence; etc.

A common subclass of classification is probabilistic classification. Algorithms of this nature use statistical inference to find the best class for a given instance. Unlike other algorithms, which simply output a "best" class, probabilistic algorithms output a probability of the instance being a member of each of the possible classes. The best class is normally then selected as the one with the highest probability. However, such an algorithm has numerous advantages over non-probabilistic classifiers:

  • It can output a confidence value associated with its choice (in general, a classifier that can do this is known as a confidence-weighted classifier)
  • Correspondingly, it can abstain when its confidence of choosing any particular output is too low
  • Because of the probabilities output, probabilistic classifiers can be more effectively incorporated into larger machine-learning tasks, in a way that partially or completely avoids the problem of error propagation.

Note that the term statistical classification is often encountered, but used inconsistently in the technical literature. For some writers (especially within the field of machine learning), "statistical classification" and "probabilistic classification" are synonymous. For others, "statistical classification" encompasses any classifier that makes soft decisions using weights, whether or not there is an associated statistical model or probabilistic outputs. For yet others, "statistical classification" is even wider, encompassing practically all of the classification algorithms commonly used in machine learning, including algorithms such as decision trees that make hard decisions using if-then rules similar to the nature of old-style hand-coded classifiers.


Formal problem statement

See the article on pattern recognition for a formal statement of the problem of classification and related labeling tasks, including a rigorous mathematical treatment.

Application domains

Classification problems arise in many data mining applications.

See also


External links

Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Machine learning — is a subfield of artificial intelligence that is concerned with the design and development of algorithms and techniques that allow computers to learn . In general, there are two types of learning: inductive, and deductive. Inductive machine… …   Wikipedia

  • Machine learning — Apprentissage automatique L apprentissage automatique (machine learning en anglais) est un des champs d étude de l intelligence artificielle. L apprentissage automatique fait référence au développement, à l analyse et à l implémentation de… …   Wikipédia en Français

  • Machine Learning — Maschinelles Lernen ist ein Oberbegriff für die „künstliche“ Generierung von Wissen aus Erfahrung: Ein künstliches System lernt aus Beispielen und kann nach Beendigung der Lernphase verallgemeinern. Das heißt, es lernt nicht einfach die Beispiele …   Deutsch Wikipedia

  • Weka (machine learning) — Infobox Software name = Weka caption = Weka 3.5.5 with Explorer window open with Iris UCI dataset developer = University of Waikato latest release version = 3.4.13 (book), 3.5.8 (developer) latest release date = July 16, 2008 operating system =… …   Wikipedia

  • Online machine learning — In machine learning, online learning is a model of induction that learns one instance at a time. The goal in online learning is to predict labels for instances. For example, the instances could describe the current conditions of the stock market …   Wikipedia

  • Monte Carlo Machine Learning Library (MCMLL) — The Monte Carlo Machine Learning Library (MCMLL) is an open source C++ template library which already relies on some C++0x specs. MCMLL is licensed under the GNU GPL. It is developed under the 64 bit Linux OS. MCMLL should be usable on other… …   Wikipedia

  • Transduction (machine learning) — In logic, statistical inference, and supervised learning,transduction or transductive inference is reasoning fromobserved, specific (training) cases to specific (test) cases. In contrast, induction is reasoning from observed training casesto… …   Wikipedia

  • Classification — may refer to: Library classification and classification in general Taxonomic classification (see Taxonomy) Biological classification of organisms Medical classification Scientific classification (disambiguation) Classification (literature)… …   Wikipedia

  • Classification rule — See also: Statistical classification and Classification in machine learning Given a population whose members can be potentially separated into a number of different sets or classes, a classification rule is a procedure in which the elements… …   Wikipedia

  • Machine a vecteurs de support — Machine à vecteurs de support Les machines à vecteurs de support ou séparateurs à vaste marge (en anglais Support Vector Machine, SVM) sont un ensemble de techniques d apprentissage supervisé destinées à résoudre des problèmes de… …   Wikipédia en Français