Statistical classification


Statistical classification

In machine learning, statistical classification is the problem of identifying the sub-population to which new observations belong, where the identity of the sub-population is unknown, on the basis of a training set of data containing observations whose sub-population is known. Therefore these classifications will show a variable behaviour which can be studied by statistics.

Thus the requirement is that new individual items are placed into groups based on quantitative information on one or more measurements, traits or characteristics, etc. and based on the training set in which previously decided groupings are already established.

The problem here may be contrasted with that for cluster analysis, where the problem is to analyse a single data-set and decide how and whether the observations in the data-set can be divided into groups. In certain terminology, particularly that of machine learning, the classification problem is known as supervised learning, while clustering is known as unsupervised learning.

Unfortunately, terminology can be different in various fields of application. For example, in community ecology, the term "classification" is synonymous with cluster analysis.

Contents

Learning classifiers. Problem statement

A learning classifier is able to learn based on a sample. The data-set used for training consists of information x and y for each data-point, where x denotes what is generally a vector of observed characteristics for the data-item and y denotes a group-label. The label y can take only a finite number of values.

The classification problem can be stated as follows: given training data \{(x_1,y_1),\dots,(x_n, y_n)\} produce a rule (or "classifier") h, such that h(x) can be evaluated for any possible value of x (not just those included in the training data) and such that the group attributed to any new observation, specifically

\hat{y}=h(x),

is as close as possible to the true group label y. For the training data-set, the true labels yi are known but will not necessarily match their in-sample approximations

\hat{y_i}=h(x_i).

For new observations, the true labels yj are unknown, but it is a prime target for the classification procedure that the approximation

\hat{y_j}=h(x_j) \approx y_j

as well as possible, where the quality of this approximation needs to be judged on the basis of the statistical or probabilistic properties of the overall population from which future observations will be drawn.

Frequentist procedures

Early work on statistical classification was undertaken by Fisher,[1][2] in the context of two-group problems, leading to Fisher's linear discriminant function as the rule for assigning a group to a new observation.[3] This early work assumed that data-values within each of the two groups had a multivariate normal distribution. The extension of this same context to more than two-groups has also been considered with a restriction imposed that the classification rule should be linear.[3][4] Later work for the multivariate normal distribution allowed the classifier to be nonlinear:[5] several classification rules can be derived based on slight different adjustments of the Mahalanobis distance, with a new observation being assigned to the group whose centre has the lowest adjusted distance from the observation.

Bayesian procedures

Unlike frequentist procedures, Bayesian classification procedures provide a natural way of taking into account any available information about the relative sizes of the sub-populations associated with the different groups within the overall population.[6] Bayesian procedures tend to be computationally expensive and, in the days before Markov chain Monte Carlo computations were developed, approximations for Bayesian clustering rules were devised.[7]

Some Bayesian procedures involve the calculation of group membership probabilities: these can be viewed as providing a more informative outcome of a data analysis than a simple attribution of a single group-label to each new observation.

Binary and multiclass classification

Classification can be thought of as two separate problems - binary classification and multiclass classification. In binary classification, a better understood task, only two classes are involved, whereas in multiclass classification involves assigning an object to one of several classes.[8] Since many classification methods have been developed specifically for binary classification, multiclass classification often requires the combined use of multiple binary classifiers.

Algorithms

The most widely used classifiers are the neural network (multi-layer perceptron), support vector machines, k-nearest neighbours, Gaussian mixture model, Gaussian, naive Bayes, decision tree and RBF classifiers.

Examples of classification algorithms include:

Evaluation

Classifier performance depends greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems (a phenomenon that may be explained by the no-free-lunch theorem). Various empirical tests have been performed to compare classifier performance and to find the characteristics of data that determine classifier performance. Determining a suitable classifier for a given problem is however still more an art than a science.

The measures precision and recall are popular metrics used to evaluate the quality of a classification system. More recently, receiver operating characteristic (ROC) curves have been used to evaluate the tradeoff between true- and false-positive rates of classification algorithms.

As a performance metric, the uncertainty coefficient has the advantage over simple accuracy in that it is not affected by the relative sizes of the different classes. [9] Further, it will not penalize an algorithm for simply rearranging the classes.

An intriguing problem in pattern recognition yet to be solved is the relationship between the problem to be solved (data to be classified) and the performance of various pattern recognition algorithms (classifiers).

Application domains

Classification problems has many applications. In some of these it is employed as a data mining procedure, while in others more detailed statistical modeling is undertaken.

See also

  • Classification test

References

  1. ^ Fisher R.A. (1936) " The use of multiple measurements in taxonomic problems", Annals of Eugenics, 7, 179–188
  2. ^ Fisher R.A. (1938) " The statistical utilization of multiple measurements", Annals of Eugenics, 8, 376–386
  3. ^ a b Gnanadesikan, R. (1977) Methods for Statistical Data Analysis of Multivariate Observations, Wiley. ISBN 0-471-30845-5 (p. 83–86)
  4. ^ Rao, C.R. (1952) Advanced Statistical Methods in Multivariate Analysis, Wiley. (Section 9c)
  5. ^ Anderson,T.W. (1958) An Introduction to Multivariate Statistical Analysis, Wiley.
  6. ^ Binder, D.A. (1978) "Bayesian cluster analysis", Biometrika, 65, 31–38.
  7. ^ Binder, D.A. (1981) "Approximations to Bayesian clustering rules", Biometrika, 68, 275–285.
  8. ^ Har-Peled, S., Roth, D., Zimak, D. (2003) "Constraint Classification for Multiclass Classification and Ranking." In: Becker, B., Thrun, S., Obermayer, K. (Eds) Advances in Neural Information Processing Systems 15: Proceedings of the 2002 Conference, MIT Press. ISBN 0262025507
  9. ^ Peter Mills (2011). "Efficient statistical classification of satellite measurements". International Journal of Remote Sensing. doi:10.1080/01431161.2010.507795. 

External links


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Statistical Classification of Economic Activities in the European Community — The Statistical Classification of Economic Activities in the European Community (in French: Nomenclature statistique des activités économiques dans la Communauté européenne), commonly referred to as NACE, is a European industry standard… …   Wikipedia

  • Statistical classification of economic activities in the European Community — The statistical classification of economic activities in the European Community (in French: Nomenclature statistique des activités économiques dans la Communauté européenne), commonly referred to as NACE, is a European industry standard… …   Wikipedia

  • International Statistical Classification of Diseases and Related Health Problems — Classification internationale des maladies Pour les articles homonymes, voir CIM. La CIM 10. La Classification internationale des maladies, dont l appellation complète est …   Wikipédia en Français

  • International Statistical Classification of Diseases and Related Health Problems — Die Internationale Klassifikation der Krankheiten (ICD, engl.: International Classification of Diseases) ist das wichtigste, weltweit anerkannte Diagnoseklassifikationssystem der Medizin. Es wird von der Weltgesundheitsorganisation (WHO)… …   Deutsch Wikipedia

  • Classification — may refer to: Library classification and classification in general Taxonomic classification (see Taxonomy) Biological classification of organisms Medical classification Scientific classification (disambiguation) Classification (literature)… …   Wikipedia

  • Classification Internationale Des Maladies — Pour les articles homonymes, voir CIM. La CIM 10. La Classification internationale des maladies, dont l appellation complète est …   Wikipédia en Français

  • Classification in machine learning — See also: Pattern recognition This section needs integrating with Statistical classification (Discuss). Integration means cross linking and distinguishing (to/from each other), or sometimes merging (if consensus suggests). In machine learning and …   Wikipedia

  • Classification rule — See also: Statistical classification and Classification in machine learning Given a population whose members can be potentially separated into a number of different sets or classes, a classification rule is a procedure in which the elements… …   Wikipedia

  • Classification type des industries — Une classification type des industries est un système de classification normalisé des activités et des produits économiques utilisé à des fins statistiques, souvent désignée sous le terme de nomenclature des secteurs économiques ou nomenclatures… …   Wikipédia en Français

  • Classification internationale des maladies — Pour les articles homonymes, voir CIM. La CIM 10. La Classification internationale des maladies, dont l appellation complète est Classification statistique internationale des maladies et des problèmes de santé con …   Wikipédia en Français