Hypothesis What is your question? You cannot organize your data for exploration or analysis if you do not have a question that you want your data to answer, or a hypothesis that you're interested in testing with your data. Statistics is not data dredging. Set forth a provisional theory explaining some phenomena which can then be tested and argued for or against. As a general rule, think about your hypothesis in terms of cause and effect: "I hypothesize that process p causes effect q". If you can, state your hypothesis in precise and quantitative terms.
The 'scientific method' (Popper 1968) lays out a philosophical program for the acceptance of scientific ideas ('hypotheses') based on rejection of null hypotheses. Classical (or 'Frequentist') statistical hypothesis testing follows Popper's exposition of the scientific method. An example illustrates this process.
A scientist wishes to study the effects of acid rain on death
rates of pine trees in the Berkshire Mountains of Massachusetts.
She first establishes a null hypothesis (indicated as:
H[0]): Acid rain has no effect on pine
tree death rate.
The alternative to this null hypothesis,
(indicated by H[A] or H[1])
is that acid rain has some effect on pine tree death rate. Note
that there are (at least) three ways to specify this alternative
hypothesis:
H[A-1]: Pine trees exposed to acid rain have a different death rate from those not exposed to acid rain;
H[A-2]: Pine trees exposed to acid rain have a greater death rate relative to those not exposed to acid rain;
H[A-3]: Pine trees exposed to acid rain have a lesser death rate relative to those not exposed to acid rain.
Mathematically (symbolically) we could say that if the death rate of pine trees in the absence of acid rain is indicated by m (in units of trees per year), and if collected a set of observations {X}, where {X} = {x1, x2, x3, ..., xn} are independent, random observations, than the null hypothesis H[0] is that the expected value of X, E(X), is equal to the mean of {X}, which we will call m0. In other words:
H[0]: m0 = m.
By analogy,
H[A-1]: m0 <> m (where <> indicates 'not equal to')
H[A-2]: m0 >= m
H[A-3]: m0 =< m.
In this example, H[A-1] is called a 'point-null' hypothesis, while H[A-2] and H[A-3] are called 'one-sided' hypotheses.
Naturally, the scientist wishes to 'prove' that acid rain kills trees. Unfortunately, neither statistics nor the scientific method allows for that proof. Rather, the scientific method requires that you do an experiment that will allow you to reject your null hypothesis. Statistical hypothesis testing allows you to identify the probability with which you can reject your null hypothesis (this probability is discussed below, in the section on P-values).
Simply because you've rejected your null hypothesis does not allow you prove your alternative hypothesis (rejecting the null hypothesis that all swans are black by finding one white swan is not the same as proving that all swans are white). Rather, rejection of your null hypothesis may provide some supporting evidence for your alternative hypothesis, but does not provide complete evidence for anything. A scientific theory gains its credibility because no challenging hypothesis has been supported (yet!). The essence of the experimental, hypothetico-deductive method of Popper is that hypotheses are always falsified, never proven.
To determine how much confidence you have in your ability to reject the null hypothesis, you need to consider P-values.
A P-value is the probability of getting a result as extreme (or more extreme) than that expected if the null hypothesis were true. It the P-value is "acceptably low", you can conclude the results are "statistically significant" and reject the null hypothesis with probability P. Otherwise, you should not reject the null hypothesis and you conclude that your observations are unlikely to support the alternative hypothesis. How low is acceptably low?
Scientists don't like to make mistakes (such as incorrectly rejecting a null hypothesis), but we recognize that nature is very complex and that we don't know everything. Consequently, we would like to express how confident we are in an assertion rejecting a null hypothesis. By convention, we say that a result is "statistically significant" when our data are likely to incorrectly reject our null hypothesis less than or equal to 5% of the time. In statistical notation, we say that our results are statistically significant when P, the probability of collecting a dataset X, given a null hypothesis (H) is less than a critical value a, which by convention is normally set = < 0.05 (although you could set it to any value between 0 and 1):
P(X|H) =< a
This probability is often called the significance level of the test because it measures, in a sense, the weight of evidence favoring rejection of H[o]. The critical level is conventionally set to 0.05. To compute this P-value, we compare our data X with the data predicted by hypothesis H, using a test statistic, which is a number calculated from the sample measurements and used as a "decision-maker". Think of the test statistic as a number or decision maker that falls on a line. This line is made up of values that either support the alternative hypothesis (the rejection region) or support the null hypothesis (the acceptance region). The rejection and acceptance regions are separated by a critical value.
Please keep in mind that with statistical hypothesis testing, we are testing the probability of collecting a set of data X given a null hypothesis H (written as P(X|H[0]). What we'd really like to know, however, is what is the probability of our alternative hypothesis given the data that we observe (P(H[A]|X). These are two different things, as the following example shows:
From elementary logic, recall that we can write: p ---> q (or if hypothesis p is true, then data q would be collected).
This also means that if q (the data) didn't happen than neither did p (the null hypothesis): ~q ---> ~p (where ~ indicates 'not').
However, this is not the same as ~p ---> ~q (that the alternative hypothesis implies the data you collected.
This is called the logical fallacy of affirming the consequent (Howson & Urbach 1991).
It turns out that even if P(X|H[0]) is small (that is, your P-value is "statistically significant") that the converse, P(H[0]|X) could be quite large, and consequently P(H[A]|X) (the probability of the alternative hypothesis, given the data you collected) would be also quite small, with frequency as low as an order of magnitude smaller than your P-value! (Lindley 1957). There are three ways around this problem:
Paying attention to statistical power
Using maximum likelihood approaches to data analysis
Ignoring P-values and statistical power completely and using Bayesian approaches to data analysis.
The easiest step is to pay attention to statistical power, which requires a deeper understanding of errors you can make when using P-values for hypothesis testing.
Two types of errors: when you draw conclusions from P-values and hypothesis tests, there are two types of mistakes you can make:
|
|
|
|
|
H[0] is true |
Correct decision | Type I Error |
|
H[0] is false |
Type II Error | Correct decision |
1. A Type I error is incorrectly rejecting the null hypothesis. The probability of making a type I error is the probability of the statitsical test: P(X|H). The critical value a is the acceptable upper bound of making a Type I statistical error.
2. A Type II error is incorrectly accepting the null hypothesis. The probability of making a Type II statistical error is denoted by b.
Ideally, we wish to minimize both the probability of making a Type I error and of making a Type II error. In practice, most scientists are fixated on minimizing P(X|H) and ignore b (hence the goal of having a really small P-value). In science, it's considered better to falsely accept H[0] than to incorrectly reject it. Why do you think this is so?
By now, you should have a good sense of hypothesis testing, and the difference between a type I and type II error. Normally, the acceptable probability of committing a type I error (a) is fixed prior to the start of an experiment (traditionally, a = 0.05). This acceptable probability determines the "significance" of your results, and what you hope to report in your scientific paper is that the probability of your data given the null hypothesis is less than the a-level: P(data| H[o]) < a. The converse, type II error (b) is rarely discussed. More importantly, what you may be most interested in is the probability of rejecting your null hypothesis when it is in fact false, and should be rejected. This quantity, referred to as the power of your statistical test, equals (1-b) . Interestingly (hopefully not surprisingly), statistical power does not simply equal 1-a (the obtained P-value). Rather, it depends in a rather complex way on sample size, effect size, and your pre-determined a-level. In this light, power analysis can be used to address several important questions:
1. What sample size is needed in order to detect a difference
(i.e. to see an effect) of a particular size, given predetermined
values for a and b?
This question should be asked before you set up your experiment.
Howson, C. and P. Urback. 1991.
Bayesian reasoning in science. Nature 350: 371-374
Lindley, D. V. 1957.
A statisical paradox. Biometrika 44: 187-192.
Popper, K. R. 1968.
The logic of scientific discovery. Harper and Row, New
York, USA.
|
|
|
|
|
|
|
|
||||