 Pvalue

In statistical significance testing, the pvalue is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. One often "rejects the null hypothesis" when the pvalue is less than the significance level α (Greek alpha), which is often 0.05 or 0.1. When the null hypothesis is rejected, the result is said to be statistically significant.
A closely related concept is the Evalue,^{[1]} which is the average number of times in multiple testing that one expects to obtain a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. The Evalue is the product of the number of tests and the pvalue.
Although there is often confusion, the pvalue is not the probability of the null hypothesis being true, nor is the pvalue the same as the Type I error rate, α.^{[2]}
Contents
Coin flipping example
Main article: Checking whether a coin is fairFor example, an experiment is performed to determine whether a coin flip is fair (50% chance, each, of landing heads or tails) or unfairly biased (≠ 50% chance of one of the outcomes).
Suppose that the experimental results show the coin turning up heads 14 times out of 20 total flips. The pvalue of this result would be the chance of a fair coin landing on heads at least 14 times out of 20 flips. The probability that 20 flips of a fair coin would result in 14 or more heads can be computed from binomial coefficients as
This probability is the (onesided) pvalue. It measures the chance that a fair coin would give a result at least this extreme.
Interpretation
Traditionally, one rejects the null hypothesis if the pvalue is smaller than or equal to the significance level,^{[3]} often represented by the Greek letter α (alpha). (Greek α is also used for Type I error; the connection is that a hypothesis test that rejects the null hypothesis for all samples that have a pvalue less than α will have a Type I error of α.) A significance level of 0.05 would deem as extraordinary any result that is within the most extreme 5% of all possible results under the null hypothesis. In this case a pvalue less than 0.05 would result in the rejection of the null hypothesis at the 5% (significance) level.
When we ask whether a given coin is fair, often we are interested in the deviation of our result from the equality of numbers of heads and tails. In this case, the deviation can be in either direction, favoring either heads or tails. Thus, in this example of 14 heads and 6 tails, we may want to calculate the probability of getting a result deviating by at least 4 from parity in either direction (twosided test). This is the probability of getting at least 14 heads or at least 14 tails. As the binomial distribution is symmetrical for a fair coin, the twosided pvalue is simply twice the above calculated singlesided pvalue; i.e., the twosided pvalue is 0.115.
In the above example we thus have:
 null hypothesis (H_{0}): fair coin; P(heads) = 0.5
 observation O: 14 heads out of 20 flips; and
 pvalue of observation O given H_{0} = Prob(≥ 14 heads or ≥ 14 tails) = 0.115.
The calculated pvalue exceeds 0.05, so the observation is consistent with the null hypothesis — that the observed result of 14 heads out of 20 flips can be ascribed to chance alone — as it falls within the range of what would happen 95% of the time were the coin in fact fair. In our example, we fail to reject the null hypothesis at the 5% level. Although the coin did not fall evenly, the deviation from expected outcome is small enough to be consistent with chance.
However, had one more head been obtained, the resulting pvalue (twotailed) would have been 0.0414 (4.14%). This time the null hypothesis – that the observed result of 15 heads out of 20 flips can be ascribed to chance alone – is rejected when using a 5% cutoff.
To understand both the original purpose of the pvalue p and the reasons p is so often misinterpreted, it helps to know that p constitutes the main result of statistical significance testing (not to be confused with hypothesis testing), popularized by Ronald A. Fisher. Fisher promoted this testing as a method of statistical inference. To call this testing inferential is misleading, however, since inference makes statements about general hypotheses based on observed data, such as the postexperimental probability a hypothesis is true. As explained above, p is instead a statement about data assuming the null hypothesis; consequently, indiscriminately considering p as an inferential result can lead to confusion, including many of the misinterpretations noted in the next section.
On the other hand, Bayesian inference, the main alternative to significance testing, generates probabilistic statements about hypotheses based on data (and a priori estimates), and therefore truly constitutes inference. Bayesian methods can, for instance, calculate the probability that the null hypothesis H_{0} above is true assuming an a priori estimate of the probability that a coin is unfair. Since a priori we would be quite surprised that a coin could consistently give 75% heads, a Bayesian analysis would find the null hypothesis (that the coin is fair) quite probable even if a test gave 15 heads out of 20 tries (which as we saw above is considered a "significant" result at the 5% level according to its pvalue).
Strictly speaking, then, p is a statement about data rather than about any hypothesis, and hence it is not inferential. This raises the question, though, of how science has been able to advance using significance testing. The reason is that, in many situations, p approximates some useful postexperimental probabilities about hypotheses, such as the postexperimental probability of the null hypothesis. When this approximation holds, it could help a researcher to judge the postexperimental plausibility of a hypothesis.^{[4]}^{[5]}^{[6]}^{[7]} Even so, this approximation does not eliminate the need for caution in interpreting p inferentially, as shown in the Jeffreys–Lindley paradox mentioned below.
Misunderstandings
The data obtained by comparing the pvalue to a significance level will yield one of two results: either the null hypothesis is rejected, or the null hypothesis cannot be rejected at that significance level (which however does not imply that the null hypothesis is true). A small pvalue that indicates statistical significance does not indicate that an alternative hypothesis is ipso facto correct.
Despite the ubiquity of pvalue tests, this particular test for statistical significance has come under heavy criticism due both to its inherent shortcomings and the potential for misinterpretation.
There are several common misunderstandings about pvalues.^{[8]}^{[9]}
 The pvalue is not the probability that the null hypothesis is true.
In fact, frequentist statistics does not, and cannot, attach probabilities to hypotheses. Comparison of Bayesian and classical approaches shows that a pvalue can be very close to zero while the posterior probability of the null is very close to unity (if there is no alternative hypothesis with a large enough a priori probability and which would explain the results more easily). This is the Jeffreys–Lindley paradox.  The pvalue is not the probability that a finding is "merely a fluke."
As the calculation of a pvalue is based on the assumption that a finding is the product of chance alone, it patently cannot also be used to gauge the probability of that assumption being true. This is different from the real meaning which is that the pvalue is the chance of obtaining such results if the null hypothesis is true.  The pvalue is not the probability of falsely rejecting the null hypothesis. This error is a version of the socalled prosecutor's fallacy.
 The pvalue is not the probability that a replicating experiment would not yield the same conclusion.
 1 − (pvalue) is not the probability of the alternative hypothesis being true (see (1)).
 The significance level of the test is not determined by the pvalue.
The significance level of a test is a value that should be decided upon by the agent interpreting the data before the data are viewed, and is compared against the pvalue or any other statistic calculated after the test has been performed. (However, reporting a pvalue is more useful than simply saying that the results were or were not significant at a given level, and allows the reader to decide for himself whether to consider the results significant.)  The pvalue does not indicate the size or importance of the observed effect (compare with effect size). The two do vary together however – the larger the effect, the smaller sample size will be required to get a significant pvalue.
Problems
Main article: Statistical hypothesis testing#ControversyCritics of pvalues point out that the criterion used to decide "statistical significance" is based on the somewhat arbitrary choice of level (often set at 0.05).^{[10]} If significance testing is applied to hypotheses that are known to be false in advance, an insignificant result will simply reflect an insufficient sample size. Another problem is that the definition of "more extreme" data depends on the intentions of the investigator; for example, the situation in which the investigator flips the coin 100 times has a set of extreme data that is different from the situation in which the investigator continues to flip the coin until 50 heads are achieved.^{[11]}
As noted above, the pvalue p is the main result of statistical significance testing. Fisher proposed p as an informal measure of evidence against the null hypothesis. He called researchers to combine p in the mind with other types of evidence for and against that hypothesis, such as the a priori plausibility of the hypothesis and the relative strengths of results from previous studies. Many misunderstandings concerning p arise because statistics classes and instructional materials ignore or at least do not emphasize the role of prior evidence in interpreting p. A renewed emphasis on prior evidence could encourage researchers to place p in the proper context, evaluating a hypothesis by weighing p together with all the other evidence about the hypothesis.^{[12]}
See also
 Binomial test
 Counternull
 Fisher's Method
 Generalized pvalue
 prep
 Statistical hypothesis testing
 Tvalue
 Bayesian inference, use of prior estimates
 False Discovery Rate
References
 ^ National Institutes of Health definition of Evalue
 ^ Raymond Hubbard, M.J. Bayarri, P Values are not Error Probabilities. A working paper that explains the difference between Fisher's evidential pvalue and the Neyman–Pearson Type I error rate α.
 ^ Pvalue  Dictionary definition of Pvalue
 ^ Hartley, Andrew. Christian and Humanist Foundations for Statistical Inference, 2008, Eugene, OR: Wipf and Stock.
 ^ GomezVillegas MA, Sanz L (1998). "Reconciling Bayesian and frequentist evidence in the point null testing problem". Sociedad de Estadistica e Investigacion Operativa Test 7 (1): 207–16. http://www.springerlink.com/content/hj4w370653524861/.
 ^ Casella G, Berger RL (1987). "Reconciling Bayesian and frequentist evidence in the onesided testing problem". JASA 82: 106–11.
 ^ Howard JV (1998). "The 2 x 2 Table, A Discussion from a Bayesian Viewpoint". Statistical Science 13 (4): 351–67. doi:10.1214/ss/1028905830.
 ^ Sterne JAC, Smith GD (2001). "Sifting the evidence—what's wrong with significance tests?". BMJ 322 (7280): 226–231. doi:10.1136/bmj.322.7280.226. PMC 1119478. PMID 11159626. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=1119478.
 ^ Schervish MJ (1996). "P Values: What They Are and What They Are Not". The American Statistician 50 (3): 203–206. doi:10.2307/2684655. JSTOR 2684655.
 ^ Sellke, Thomas; Bayarri, M.J.; Berger, James (2001). "Calibration of p values for testing precise null hypotheses". The American Statistician 55 (1): 62–71. doi:10.1198/000313001300339950. JSTOR 2685531.
 ^ Johnson, Douglas H. (1999). "The Insignificance of Statistical Significance Testing". Journal of Wildlife Management 63 (3): 763–772. doi:10.2307/3802789. http://www.stats.org.uk/statisticalinference/Johnson1999.pdf.
 ^ Goodman, SN (1999). "Toward EvidenceBased Medical Statistics. 1: The P Value Fallacy.". Annals of Internal Medicine 130: 995–1004.
Further reading
 Dallal GE (2007) Historical background to the origins of pvalues and the choice of 0.05 as the cutoff for significance
 Hubbard R, Armstrong JS (2005) Historical background on the widespread confusion of the pvalue (PDF)
 Fisher's method for combining independent tests of significance using their pvalues
 Dallal GE (2007) The Little Handbook of Statistical Practice (A tutorial)
External links
 Free online pvalues calculators for various specific tests (chisquare, Fisher's Ftest, etc.).
 Understanding Pvalues, including a Java applet that illustrates how the numerical values of pvalues can give quite misleading impressions about the truth or falsity of the hypothesis under test.
Categories: Hypothesis testing
 Statistical terminology
Wikimedia Foundation. 2010.
Look at other dictionaries:
Value investing — is an investment paradigm that derives from the ideas on investment and speculation that Ben Graham David Dodd began teaching at Columbia Business School in 1928 and subsequently developed in their 1934 text Security Analysis . Although value… … Wikipedia
value — val·ue 1 / val yü/ n 1 a: a fair return or equivalent in goods, services, or money for something exchanged received good value for the price b: valuable consideration at consideration … Law dictionary
Value theory — encompasses a range of approaches to understanding how, why and to what degree people should value things; whether the thing is a person, idea, object, or anything else. This investigation began in ancient philosophy, where it is called axiology… … Wikipedia
Value engineering — is a systematic method to improve the value of goods and services by using an examination of function. Value, as defined, is the ratio of function to cost. Value can therefore be increased by either improving the function or reducing the cost. It … Wikipedia
Value added — refers to the additional value of a commodity over the cost of commodities used to produce it from the previous stage of production. An example is the price of gasoline at the pump over the price of the oil in it. In national accounts used in… … Wikipedia
Value of information — (VoI) in decision analysis is the amount a decision maker would be willing to pay for information prior to making a decision. imilar termsVoI is sometimes distinguished into value of perfect information, also called value of clairvoyance (VoC),… … Wikipedia
Value network analysis — is a methodology for understanding, using, visualizing, optimizing internal and external value networks and complex economic ecosystems. The methods include visualizing sets of relationships from a dynamic whole systems perspective. Robust… … Wikipedia
Value Line — Value Line, Inc.(NASDAQVALU), is a New York corporation founded in 1931 by Arnold Bernhard, best known for publishing the The Value Line Investment Survey , a stock analysis newsletter that s updated weekly and kept by subscribers in a large… … Wikipedia
Value capture — refers to a type of innovative public financing in which increases in private land values generated by a new public investment are all or in part “captured” through a land related tax to pay for that investment or other public projects. Value… … Wikipedia
Value Measuring Methodology — (or VMM) is a tool that helps planners balance both tangible and intangible values when making investment decisions, and monitor benefits.Formal methods to calculate the Return on Investment (or ROI) have been widely understood and used for a… … Wikipedia
Value Stream Mapping — is a Lean technique used to analyse the flow of materials and information currently required to bring a product or service to a consumer. At Toyota, where the technique originated, it is known as Material and Information Flow Mapping [Learning to … Wikipedia