Friday, April 16, 2010

To P or not to P, that is the question...

I think that at least part of the reason most of us have a hard time stating what a frequentist P-value is (and is not) is because we do not know the difference between statisticians' correct definition and our common but erroneous definitions. In addition, or maybe another way of looking at it, is that the ambiguity of words in general (as opposed to math) contributes substantially to the confusion.

Here I take a stab at it.

A frequentist P-value (say, of a t-test for a difference between means) is the probability that a difference as large or larger than the observed difference would occur if our two samples were drawn from the same distribution, and the same experiment were conducted repeatedly ad infinitum (or ad nauseum). "A P value is often described as the probability of seeing results as or more extreme as those actually observed if the null hypothesis were true" (2).

A frequentist P-value of such a t-test is apparently not lots of things we wish it were (1):
  1. it is not the probability that our null hypothesis is true.
  2. it is not the probability that such a difference could occur by chance alone.
  3. it is not the probability of falsely rejecting the null hypothesis.
Each of these leave information out. Important elements of a correct definition include
  • The probability of observing data as extreme or more extreme in ...
  • repeated identical experiments and analyses, given ...
  • the a priori belief that the null hypothesis is true (this is a Bayesian "prior").
Thus we see that the incorrect definitions typically share something with a complete definition, and that under most circumstances, analyses that fit incorrect definitions (e.g., Bayesian posteriors) will be correlated frequentist P-values (see comment by Bolker comment below).

A frequentist confidence interval is a region calculated from our data and our selected confidence level, α, in a manner which would include the "true" population parameter (e.g., the "true" mean) 100(1-α)% of the time if the same experiment and analysis were conducted repeatedly ad infinitum. It is a region which, so calculated, would include the true population parameter in 95% of all hypothetically repeated identical experiments. Thus, the population parameter of interest is fixed (i.e., a "true" value exists), and the interval is random (because it is based on a randomized experiment).

A frequentist confidence interval is not an interval which we are 100(1-α)% certain contains the true parameter. I don't even know what this statement (i.e., 95% certain) means -- what is "certainty"?

A 95% Bayesian credible interval (a.k.a. Bayesian confidence interval) is the a continuous subset (i.e., an interval) of a posterior probability distribution of a parameter of interest (e.g., a mean). A posterior probability distribution is the probability distribution which results from combining of our prior beliefs about the parameter, and a conditional probability distribution of our data, given all possible relevant data sets. It contains all possible values of our parameter of interest (given our priors and our data, and our model). The credible interval is merely an interval which contains a most likely subset. That is, we are not sure what is was while the world turned and our data were collected, but the safest bet is inside the credible interval.

Key differences between frequentist and Bayesian statistics:
  • Parameter of interest (e.g., a mean): fixed (the Platonic archetype exists) vs. random (i.e., subject to the whims of the gods of stochasticity)
  • Prior beliefs: implicit and sometimes hard to discern vs. explicit and plainly stated.
  • Statements of probability: Pr(data|null) vs. Pr(h|data). That is, frequentist P-values are the probability of observing your data (or more extreme data) given that the null hypothesis is true, whereas Bayes posterior probability distributions describe the probability of your scientific hypothesis (not the null), given that your data are true.
  • "P-values": Exist vs. not typically presented (usually misinterpreted; Fisher used it as a "weight of evidence," whereas Neyman used it as a basis to make decisions (yes/no) but not necessarily true/false - I do not understand how or why they can do all that; in the Bayesian context, it is not always clear what such a beast would be because of differences in underlying interpretations).

My problem is that I do not have an intuition about how these things all differ or under what conditions they are likely to differ substantially. Therefore, I cannot keep them clearly differentiated in my head. All I can do is repeat them. It would help to explore the pathological cases where frequentist and Bayesian methods result in very different outcomes differ strongly.


1 comment:

  1. I think the problem is that a lot of these different metrics (relative likelihood of null vs alternative hypothesis (= relative probabilities if we take a Bayesian approach and set equal prior probabilities on the null and the alternative), p-value, etc.) *in general* tend to be correlated with each other under normal circumstances, even if they are not identical. I think the best way to gain intuition is to look over the various pathological examples (quoted in various papers by Berger, Royall's 1997 book, etc. etc.) where the p-values are really seriously misleading as measures of strength of evidence and try to figure out what the different metrics say in those cases, and whether those cases are atypical in some way. I think the piece of this that I am shakiest about is the use of Fisherian p-values [i.e. as opposed to Neyman-Pearson rejection procedures] as "measures of strength of evidence" -- Royall says this is a really bad idea and gives examples (he would prefer the likelihood ratio of null and alternative hypotheses), but I find them tempting.