Contingency table

In statistics, a contingency table (also referred to as cross tabulation or cross tab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. It is often used to record and analyze the relation between two or more categorical variables.

The term contingency table was first used by Karl Pearson in "On the Theory of Contingency and Its Relation to Association and Normal Correlation", part of the Drapers' Company Research Memoirs Biometric Series I published in 1904.

A crucial problem of multivariate statistics is finding (direct-)dependence structure underlying the variables contained in high dimensional contingency tables. If some of the conditional independences are revealed, then even the storage of the data can be done in a smarter way (see Lauritzen (2002)). In order to do this one can use information theory concepts, which gain the information only from the distribution of probability, which can be expressed easily from the contingency table by the relative frequencies.

1 Example
2 Measures of association
3 See also
4 References
5 External links

Example

Suppose that we have two variables, sex (male or female) and handedness (right- or left-handed). Further suppose that 100 individuals are randomly sampled from a very large population as part of a study of sex differences in handedness. A contingency table can be created to display the numbers of individuals who are male and right-handed, male and left-handed, female and right-handed, and female and left-handed. Such a contingency table is shown below.

	Right-handed	Left-handed	Totals
Males	43	9	52
Females	44	4	48
Totals	87	13	100

The numbers of the males, females, and right- and left-handed individuals are called marginal totals. The grand total, i.e., the total number of individuals represented in the contingency table, is the number in the bottom right corner.

The table allows us to see at a glance that the proportion of men who are right-handed is about the same as the proportion of women who are right-handed although the proportions are not identical. The significance of the difference between the two proportions can be assessed with a variety of statistical tests including Pearson's chi-squared test, the G-test, Fisher's exact test, and Barnard's test, provided the entries in the table represent individuals randomly sampled from the population about which we want to draw a conclusion. If the proportions of individuals in the different columns vary significantly between rows (or vice versa), we say that there is a contingency between the two variables. In other words, the two variables are not independent. If there is no contingency, we say that the two variables are independent.

The example above is the simplest kind of contingency table, a table in which each variable has only two levels; this is called a 2 x 2 contingency table. In principle, any number of rows and columns may be used. There may also be more than two variables, but higher order contingency tables are difficult to represent on paper. The relation between ordinal variables, or between ordinal and categorical variables, may also be represented in contingency tables, although such a practice is rare.

Measures of association

Main article: Phi coefficient

Main article: Cramér's V

The degree of association between the two variables can be assessed by a number of coefficients: the simplest is the phi coefficient defined by

$\phi=\sqrt{\frac{\chi^2}{N}}$ ,

where χ² is derived from Pearson's chi-squared test, and N is the grand total of observations. φ varies from 0 (corresponding to no association between the variables) to 1 or -1 (complete association or complete inverse association). This coefficient can only be calculated for frequency data represented in 2 x 2 tables. φ can reach a minimum value -1.00 and a maximum value of 1.00 only when every marginal proportion is equal to .50 (and two diagonal cells are empty). Otherwise, the phi coefficient cannot reach those minimal and maximal values.^[1]

Alternatives include the tetrachoric correlation coefficient (also only applicable to 2 × 2 tables), the contingency coefficient C, and Cramér's V.

C suffers from the disadvantage that it does not reach a maximum of 1 or the minimum of -1; the highest it can reach in a 2 x 2 table is .707; the maximum it can reach in a 4 × 4 table is 0.870. It can reach values closer to 1 in contingency tables with more categories. It should, therefore, not be used to compare associations among tables with different numbers of categories.^[2] Moreover, it does not apply to asymmetrical tables (those where the numbers of row and columns are not equal).

The formulae for the C and V coefficients are:

$C=\sqrt{\frac{\chi^2}{N+\chi^2}}$ and

$V=\sqrt{\frac{\chi^2}{N(k-1)}}$ ,

k being the number of rows or the number of columns, whichever is less.

C can be adjusted so it reaches a maximum of 1 when there is complete association in a table of any number of rows and columns by dividing C by $\sqrt{\frac{k-1}{k}}$ (recall that C only applies to tables in which the number of rows is equal to the number of columns and therefore equal to k).

The tetrachoric correlation coefficient assumes that the variable underlying each dichotomous measure is normally distributed.^[3] The tetrachoric correlation coefficient provides "a convenient measure of [the Pearson product-moment] correlation when graduated measurements have been reduced to two categories."^[4] The tetrachoric correlation should not be confused with the Pearson product-moment correlation coefficient computed by assigning, say, values 0 and 1 to represent the two levels of each variable (which is mathematically equivalent to the phi coefficient). An extension of the tetrachoric correlation to tables involving variables with more than two levels is the polychoric correlation coefficient.

The Lambda coefficient is a measure of the strength of association of the cross tabulations when the variables are measured at the nominal level. Values range from 0 (no association) to 1 (the theoretical maximum possible association). Asymmetric lambda measures the percentage improvement in predicting the dependent variable. Symmetric lambda measures the percentage improvement when prediction is done in both directions.

The uncertainty coefficient is another measure for variables at the nominal level.

All of the following measures are used for variables at the ordinal level. The values range from -1 (100% negative association, or perfect inversion) to +1 (100% positive association, or perfect agreement). A value of zero indicates the absence of association.

Gamma test: No adjustment for either table size or ties.
Kendall tau: Adjustment for ties.
- Tau b: For square tables.
- Tau c: For rectangular tables.

References

^ Ferguson, G. A. (1966). Statistical analysis in psychology and education. New York: McGraw-Hill.
^ Smith, S. C., & Albaum, G. S. (2004) Fundamentals of marketing research. Sage: Thousand Oaks, CA. p. 631
^ Ferguson.
^ Ferguson, p. 244

Andersen, Erling B. 1980. Discrete Statistical Models with Social Science Applications. North Holland, 1980.
Bishop, Y. M. M.; Fienberg, S. E.; Holland, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press. ISBN 978-0262021135. MR 381130.
Christensen, Ronald (1997). Log-linear models and logistic regression. Springer Texts in Statistics (Second ed.). New York: Springer-Verlag. pp. xvi+483. ISBN 0-387-98247-7. MR 1633357.
Lauritzen, Steffen L. (2002 electronic (1979, 1982, 1989)). Lectures on Contingency Tables (updated electronic version of the (University of Aalborg) 3rd (1989) ed.). http://www.stats.ox.ac.uk/~steffen/papers/cont.pdf.

External links

Statistics

Descriptive statistics

Continuous data

Location	Mean (Arithmetic, Geometric, Harmonic) · Median · Mode

Dispersion	Range · Standard deviation · Coefficient of variation · Percentile · Interquartile range

Shape	Variance · Skewness · Kurtosis · Moments · L-moments

Count data

Index of dispersion

Summary tables

Grouped data · Frequency distribution · Contingency table

Dependence

Pearson product-moment correlation · Rank correlation (Spearman's rho, Kendall's tau) · Partial correlation · Scatter plot

Statistical graphics

Bar chart · Biplot · Box plot · Control chart · Correlogram · Forest plot · Histogram · Q-Q plot · Run chart · Scatter plot · Stemplot · Radar chart

Data collection

Designing studies	Effect size · Standard error · Statistical power · Sample size determination

Survey methodology	Sampling · Stratified sampling · Opinion poll · Questionnaire

Controlled experiment	Design of experiments · Factorial experiment · Randomized experiment · Random assignment · Replication · Blocking · Optimal design

Uncontrolled studies	Natural experiment · Quasi-experiment · Observational study

Statistical inference

Statistical theory	Sampling distribution · Sufficient statistic · Meta-analysis

Bayesian inference	Bayesian probability · Prior · Posterior · Credible interval · Bayes factor · Bayesian estimator · Maximum posterior estimator

Frequentist inference	Confidence interval · Hypothesis testing · Likelihood-ratio

Specific tests	Z-test (normal) · Student's t-test · F-test · Pearson's chi-squared test · Wald test · Mann–Whitney U · Shapiro–Wilk · Signed-rank · Kolmogorov–Smirnov test

General estimation	Mean-unbiased · Median-unbiased · Maximum likelihood · Method of moments · Minimum distance · Density estimation

Correlation and regression analysis

Correlation	Pearson product-moment correlation · Partial correlation · Confounding variable · Coefficient of determination

Regression analysis	Errors and residuals · Regression model validation · Mixed effects models · Simultaneous equations models

Linear regression	Simple linear regression · Ordinary least squares · General linear model · Bayesian regression

Non-standard predictors	Nonlinear regression · Nonparametric · Semiparametric · Isotonic · Robust

Generalized linear model	Exponential families · Logistic (Bernoulli) · Binomial · Poisson

Partition of variance	Analysis of variance (ANOVA) · Analysis of covariance · Multivariate ANOVA · Degrees of freedom

Categorical, multivariate, time-series, or survival analysis

Categorical data	Cohen's kappa · Contingency table · Graphical model · Log-linear model · McNemar's test

Multivariate statistics	Multivariate regression · Principal components · Factor analysis · Cluster analysis · Copulas

Time series analysis	Decomposition (Trend · Stationary process) · ARMA model · ARIMA model · Vector autoregression · Spectral density estimation

Survival analysis	Survival function · Kaplan–Meier · Logrank test · Failure rate · Proportional hazards models · Accelerated failure time model

Applications

Biostatistics	Bioinformatics · Biometrics · Clinical trials & studies · Epidemiology · Medical statistics · Pharmaceutical statistics

Engineering statistics	Methods engineering · Probabilistic design · Process & Quality control · Reliability · System identification

Social statistics	Actuarial science · Census · Crime statistics · Demography · Econometrics · National accounts · Official statistics · Population · Psychometrics

Spatial statistics	Cartography · Environmental statistics · Geographic information system · Geostatistics · Kriging

Category · Portal · Outline · Index

Categories:

Categorical data
Data analysis
Statistical dependence

Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать курсовую

Look at other dictionaries:

contingency table — Contingency tables, often referred to as cross classifications or cross tabulations, are tables of counts which describe and analyse the relationship between two or more variables in a data set. They contain row variables across the horizontal… … Dictionary of sociology
contingency table — a table used to display statistical data according to two characteristics, each having a number of mutually inclusive categories; categories of one characteristic are listed in rows and categories of the other characteristic are listed in columns … Medical dictionary
contingency table — noun mathematics : a table in which the rows tabulate the frequency distribution of one variable and the columns that of another, serving therefore to indicate the existence of a contingency or correlation between the variables compare… … Useful english dictionary
contingency table — noun Date: 1904 a table of data in which the row entries tabulate the data according to one variable and the column entries tabulate it according to another variable and which is used especially in the study of the correlation between variables … New Collegiate Dictionary
contingency table — noun a) A table presenting the joint distribution of two categorical variables. b) An arrangement of data containing the joint distribution of two or more categorical variables, usually in a database, a series of tables, or a special… … Wiktionary
contingency table — noun Statistics a table showing the distribution of one variable in rows and another in columns, used to study the correlation between the two … English new terms dictionary
contingency table — Statistics. the frequency distribution for a two way statistical classification. [1945 50] * * * … Universalium
contingency table — contin′gency ta ble n. sta the frequency distribution for a two way statistical classification • Etymology: 1945–50 … From formal English to slang
contingency table — /kənˈtɪndʒənsi teɪbəl/ (say kuhn tinjuhnsee taybuhl) noun the frequency distribution for a two way statistical classification …
2 Ð§ 2 contingency table — a contingency table having two rows and two columns … Medical dictionary

Academic Dictionaries and Encyclopedias

Contingency table

Contents

Example

Measures of association

See also

References

External links

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Contingency table

Contents

Example

Measures of association

See also

References

External links

Look at other dictionaries:

Share the article and excerpts

Direct link