# The Significance of Statistical Significance

## What is “Statistical Significance” and Who Cares Anyway?

As a cancer patient, the phrase, “statistical significance” is certainly not one of the first things you are likely to think about. It’s probably not the first thing you *want* to think about either! But if you start getting into the data in depth and read papers and technical articles, it’s going to come up again and again. You will need to understand statistical significance if:

- You read the technical literature and see (as you are bound to):
- Results characterized as statistically significant (or not).
- References to “p values” like “p=.01” or “p<.05”.

- The phrase, “statistically significant” comes up in conversation with your doctor.

Statistically significant may not mean what you think it does. You might tell your doctor that treatment A had a response rate of 67% and treatment B 77% in some paper you read, and he might say, “I know about that, but the difference is not statistically significant.” Does that mean the higher response rate doesn’t matter? Does it mean that Treatment B is really no better than treatment A? It would be easy to assume either of these, but actually the answer is, “none of the above”! Read on to learn what “statistical significance” is really about.

## Statistical Significance is about the *Reliability* of Conclusions

Statistical significance is about deciding whether differences observed between groups in experiments are “real” or whether they might well just be due to chance. The groups can be groups of patients who were given different treatments as in a randomized trial. They can also be groups of patients with different characteristics, rather than people who were treated differently. For instance, you might want to compare survival in male and female patients with the same cancer or the survival of patients with different stages of some cancer.

## Examples

It may not be obvious exactly how the question of whether observed differences are real or might be merely due to chance actually arises, so here are some examples which should give you some intuition:

Suppose that in a clinical trial patients were randomly assigned to either Treatment A or Treatment B. Suppose that so far, Treatment A has a cure rate of 100% and Treatment B has a cure rate of only 50%. That’s a pretty dramatic sounding difference! But is the difference real, or is the difference just due to chance? Well suppose that so far two patients have been assigned to each treatment and both were cured with treatment A, but only one of the two was cured with treatment B. Maybe the next patient assigned to treatment A will not be cured and the next patient assigned to treatment B will be. Now, with just two more patients, the cure rate is 2 out of 3 or 66% for both treatments. What does all this tell you about the “real” cure rate for these treatments? Can you conclude that they are both the same, 66%? Probably not! With so few patients you should be feeling pretty unsure about what the real cure rates are.

But what if you treated 100 patients with A, and 100 with B, and the cure rate was still 100% for treatment A and still only 50% for B? I bet you would conclude that there was a major and very real difference between A and B and you would certainly opt for A, given the choice!

Now suppose with 100 patients in each group the cure rates were 66% and 60% for A and B. Is this a real difference, or is the difference just due to chance? It’s not so obvious! Here is where statistical significance comes to the rescue.

## Intuitions

These examples bring out some basic intuitions which are behind the idea of statistical significance:

- The bigger the difference between groups the less likely it is that it’s due to chance.
- The larger the sample size (number of patients) the more likely it is that the observed difference is close to the actual difference. This is an example of the “law of large numbers.”

These two principles interact, so that if you observe a large difference between the groups, the groups could be relatively small before you conclude that the difference is likely to be real, and not due to chance alone. Similarly if the observed difference is small you would need more patients before you are convinced that it’s probably not due to chance alone.

## Statistical testing, “p Values”, and Statistical Significance

Roughly speaking, statistical testing uses mathematical procedures to examine particular differences between groups to see if it is likely that the observed difference could have arisen by chance alone. If it is unlikely enough that the difference would have arisen by chance alone, the difference is “statistically significant.

More precisely, statistical testing works by assuming that the groups are actually the same – that there is no difference – and then mathematically estimating the probability that you would see a difference between groups at least as big as the one you actually saw – just due to chance. This probability is called ‘p’ and is referred to a “p-value”. Mathematical probabilities, like p values, range from 0 to 1 where 0 means no chance and one means certainty. 0.5 means a 50% chance and 0.05 means a 5% chance.

**When it comes to p values, bigger is not usually better!**

The usual case is that you are hoping a difference between groups is real and *not* due to chance. In this case, *smaller* p-values are better than *larger* ones. p=.05 is better than p=.10. If you are hoping there is no real difference, than *larger* p values are better than *smaller* ones. This might be the case if you were hoping less radical surgery was just as good as more radical or disfiguring surgery. In this case p=.5 is better than p=.05! Often in a randomized trial various characteristics of the two groups are compared in the hopes that they are *not* significantly different. Again in this case bigger is better!

More precisely, statistical testing works by assuming that the groups are actually the same – that there is no difference – and then mathematically estimating the probability that you would see a difference between groups at least as big as the one you actually saw – just due to chance. This probability is called ‘p’ and is referred to a “p-value”. Mathematical probabilities, like p values, range from 0 to 1 where 0 means no chance and one means certainty. 0.5 means a 50% chance and 0.05 means a 5% chance.

If the p-value is relatively large, so that the chances are relatively high that the difference could have arisen by chance alone then the results are at least consistent with the idea that there is no real difference between the groups for the characteristic being tested. But p is very small, then the results are not consistent with the idea that there was no real difference between the groups, or at least it is very unlikely that there is no difference.

By convention if p<.05, the difference is said to be “statistically significant.” Again, this means that if there was no true difference the probability of seeing a difference at least as big as the one you actually saw by chance alone is less than 5%. Roughly speaking, “P<.05” means that the probability is less than 5 percent that the observed difference was due to chance alone.

The choice of .05 (or a 1 in 20 false positive rate) as the usual value to declare statistical significance is arbitrary – scientists do want to have a high confidence in their conclusions but attaining ever smaller p values requires larger samples, more time and expense, and possibly subjecting people to treatments which the data already indicates are very likely to be inferior.

As a cancer patient, you may choose to make different judgements about what p value you want to use to guide your decision making – suppose two treatments have about the same side effects, and one looks like 5 year survival may be better than the other, but p=.06 – not quite statistically significant. Would you decide it didn’t matter which treatment you got because the chance of a false positive is six percent instead of five? I wouldn’t! (Note that p between .10 and .05 is often referred to as a “trend” – a marginal result). You might decide to choose based on even higher values such as p=.2 (20% chance of false positive), but at some point you would conclude the wisp of evidence favoring one treatment isn’t enough to believe it. You need to consider not only the p value but also the totality of circumstances – how different the treatments are in side effects, how big the difference in the results was and so forth. Man does not make rational decisions on p values alone!

While I don’t intend to get into the mathematics of statistical significance (I couldn’t even if I wanted to!) the basic intuition I described earlier can be recast in the terms of statistical testing:

- The larger the sample size, the smaller an observed difference has to be in order to be statistically significant.
- The smaller the sample size, the larger an observed difference would have to be in order to be statistically significant.

If you design a clinical trial to detect a possible difference between treatments, you need to have a large enough sample size so that if there is a difference big enough that you care about it, you are actually likely to find that difference to be statistically significant when you are done with the trial. Some intuitions about the number of patients in a clinical trials:

- The smaller the real difference is, the more patients you need to be likely to detect a statistically significant difference in a clinical trial.
- The larger the real difference is, the fewer patients you need to be likely to be detect a statistically significant difference in an actual clinical trial.

Perhaps the above suggests to you that even when a trial is designed to detect a difference of a given size, and even when that difference is really there, it is possible that the trial will not result in a statistically significant difference – that is there is a chance the trial will be a false negative. The chance of a false negative can be calculated given the size of the trial and the size of the minimum difference of interest.

### Statistics and Statistical Tests

I have been talking about conducting statistical tests on a “difference” between groups without being very specific about what kind of difference is tested. Take survival. You could test the difference between the average or mean survival, or (more commonly in cancer research) the median survival, or the five or ten year survival, or any number of other statistics. The point is that you test for a difference in a particular statistic like the mean or the median. A difference in median survival may be statistically significant but for the same experiment the difference in survival at five years may not be statistically significant.

## More of the Terminology of Statistical Significance

You will actually see these terms and concepts in papers in the medical literature!

### The Null Hypothesis

The “Null Hypothesis” is the hypothesis that the treatments (or characteristics) being compared are all the same. The scientific approach is to not assume one thing is better than another until there is reliable evidence to the contrary. So the “null hypothesis” is accepted by default. Only if enough evidence accumulates – in the form of a statistically significant difference from an experiment – is the null hypothesis “rejected”. So “rejecting the null hypothesis” just means finding a statistically significant difference.

### Power

When a clinical trial is designed it is important to have a large enough group of patients (sample size) so that if there is a difference between treatments big enough that you care about it, you are likely to get a statistically significant difference from the actual trial. You can never be guaranteed that you will see a statistically significant difference even if there is a real difference, since just by chance you might not happen to get as good results for the better treatment then its real success rate, or you might happen to get results for the less effective treatment that are better than its real success rate. The larger the sample size the lower the chance this will happen.

Given the minimum difference you want to detect, and the p value you require to declare the results statistically significant (usually .05) for any sample size, you can calculate the probability that you will detect a statistically significant difference. The probability is the *power* of the experiment. Of course, what is done is to figure out what sample size is needed to achieve a desired power – usually between 0.8 and 0.9 (an 80 to 90 percent chance of getting a statistically significant difference, if there is a real difference of at least the specified size). A trial which doesn’t have enough patients is often called “underpowered” in the literature. Underpowered trials risk not finding real differences even when they are there.

### Type I and II errors

You will sometimes see references to “Type I” or “Type II” errors.

**A Type I Error**is a false positive – deciding that there is a real difference when in fact there is no difference. If p=.05 then there is a 5% chance of a type I error. Type I error is also called “alpha error”.**A Type II Error**is a false negative – that is there is a real difference but the statistical test fails to show the difference to be statistically significant. The chance of a type II error is estimated with the power of the experiment. Type II error is also called “beta error”.

### Names of Statistical Tests

Different statistical tests are used depending on what kind of data is being tested (for instance something with a discrete outcome like responded versus didn’t respond requires a different test than something with a continuous outcome like survival time) and what statistic is being tested – say the mean or the median. Each of these tests is a mathematical procedure (often complex) which has a unique set of assumptions. I do not intend to try to get into the details of which test is used when, but here are the names of a few of the many statistical tests you may see used in research papers.

- Fisher’s exact test
- Student’s t test
- Chi-Squared test
- Mann-Whitney test
- Log Rank Test

## Pitfalls

There are a lot of gotchas in interpreting statistics of all kinds and statistical testing seems to have a particularly large number. So by the time you finish this next section you might think “statistically significant” is completely insignificant! Nothing could be further from the truth. It’s just that you have to be very careful about the conclusions you draw from statistics.

### Statistically Significant Doesn’t Mean it Matters!

There are several reasons why statistically significant doesn’t necessarily mean significant to you. There is always the question of whether the right thing is being tested. For instance, in cancer treatment, the “response rate” is very roughly the percentage of patients who experience at least a total 50% tumor shrinkage with treatment. The problem is that “responses” can and very often are only temporary. So if Treatment A has a response rate of 40% and treatment B a response rate of 50%, and this difference is statistically significant, does that mean B is “better” than A? It certainly doesn’t if the responses for A seem to be very long lasting and the responses to B be seem to be very short lived! At least you should look to see if there appears to be any difference in the quality of the response! Also you have to look at the big picture: What about quality of life? If treatment B is extremely toxic and difficult, while A is easy and non-toxic, then even if there is a real difference in a meaningful outcome in favor of B, it’s a personal judgement whether it’s worth paying the price of the increased side effects for a relatively small increase in benefit.

It’s also important to realize that, “statistically significant difference” does not mean “big difference”. If two treatments are very similar in outcome, but not exactly the same, you can find a statistically significant difference by just testing enough people. In general the more people you include in a trial, the smaller a difference is needed before that difference proves to be statistically significant. So if in some trial, treatment A has a cure rate of 52% and treatment B 54%, then even if they tested enough patients to make this difference “statistically significant” you are not likely to decide that B is really much better, and very likely other characteristics of A and B such as side effects would guide your choice between them.

### Statistically Significant Doesn’t Mean the Difference was Due to Treatment Differences!

Suppose you have a trial in which people decide whether they would prefer to get treatment A or treatment B, and then are given the treatment of their choice. You observe that the patients who choose treatment A have a statistically significant advantage in survival compared to those who chose treatment B. But suppose treatment A and B are really no different in their effect on survival, but A has more difficult side effects. Perhaps the patients who chose B tended to be sicker so they decided they would prefer a treatment with easier side effects, while conversely if A is new perhaps patients who are healthier might tend to decide to go for the “new hope” more often despite the side effects.

Similarly, if a group of patients getting the new treatment, Treatment A, is compared to a group of patients who got Treatment B, the old treatment, in the past (a so-called “historical control”) it is possible that with new technology, patients in the current group tended to be diagnosed earlier in the course of their disease than patients who in the past got the old treatment. So even if A and B have the same effect on the disease it is possible that patients who got the new treatment may appear to survive longer than patients who got the old one.

Finally, not all statistical testing is done about testing some intervention – like a treatment. Often epidemiological studies will look to see if there is an *association* between some characteristic or behavior, and the chance of getting or surviving some disease. It might be seen that people who drink coffee are more likely to get lung cancer than those who do not and the difference in lung cancer rates among coffee drinkers and non coffee drinkers might found to be statistically significant. But this doesn’t mean coffee drinking causes lung cancer! Instead coffee drinking might be associated with some other characteristic which pre-disposes people to lung cancer, such as smoking. In this case it’s obvious and statisticians would try to “adjust” for smoking, but in general, no one knows all of the causes of any form of cancer, so it could be an association of the tested characteristic with some completely unknown risk factor which is actually causing the difference.

In sum, you need to look carefully at the study design to see if it is possible that something other than treatment differences might explain the difference in results. Noticing that a trial has a historical control rather than a randomized design should reduce your confidence in the results. But you may still choose to make decisions based on uncertain information. After all, that’s the only kind of information there is! Sometimes, the trick is figuring out which information is *least* uncertain and that’s where awareness of things like randomized designs versus historical control designs helps.

### Statistically Significant doesn’t mean the real difference is the observed difference!

“Statistically significant” means that it is unlikely the treatments have the *same* success rate, but it’s quite likely that the real difference between the treatments is not *exactly* the observed difference. So if Treatment A has a 57% success rate and Treatment B, a 46% success rate, and the difference is statistically significant, this does not mean these are the exact true success rates. It only means that it’s unlikely that the advantage of treatment A is *entirely* due to chance. The actual difference could either be larger or smaller. All “statistically significant” tells you is that there is probably some difference that is not due to chance alone.

This does not mean that the difference is equally likely to be any size other than zero. Usually it is likely to be close to the observed difference and this is more true the larger the trial. The larger the sample sizes, the greater the reliability of the estimate of the success rate of each treatment, and the less likely it is that the observed success rates differ much from the real ones. A related statistical concept called “confidence intervals” (which will be the subject of a future article) can give better insight into how big the difference might really be.

### Not Statistically Significant Doesn’t Mean There’s No Difference!

“Not statistically significant” doesn’t mean that the observed differences *are* due to chance – only that it would not be surprising if they turned out to be due to chance. It may be that there is a difference which you care about, but not enough people were included in the test for the difference to show as statistically significant. If a trial is way too small, it’s even possible that there’s a big and meaningful difference between the treatments but hardly any difference was actually seen – just due to chance. When there is a negative result you need to consider the “power” of the trial to detect meaningful differences as statistically significant (see above for a definition of “power”).

### You need to decide whether they are testing the right “statistic”

Statistical tests test for differences in specific “statistics,” often the median, or the mean. So what is said to be a statistically significant difference is a difference in the specific statistic tested, such as the median. The median is absolutely not the whole story and real differences that matter need not show up as statistically significant differences in a mean or median. For a wonderful introduction to why this is, please read Stephen Jay Gould’s The Median Isn’t the Message

As a specific example, consider my own case. The treatment which saved me, Interleukin-2, has a low response rate – only 15 or 20 percent – and maybe only five percent of patients get a long term remission like I have gotten. Still, if no other treatment offers any real chance of survival (as was the case when I choose IL-2), then a small but real chance is something worth going for – at least it was to me.

Now the median survival is the amount of time half the patients survive. IL-2 dramatically increases the survival of kidney cancer patients – but only a few of them. This could not affect the median survival much. If only all of the patients who got a long term remission happen to be those who were destined to live a little longer than the median anyway, then IL-2 treatment might not improve the median one bit! In fact, a treatment which is toxic ,and slightly shortens the survival for many patients, but produces a few cures could actually *decrease* the median survival but still be worth trying for people who are willing to pay a price for a chance at long term survival. In cases like these long term survival rather than median survival is the statistic of interest. Of course it takes years to accumulate reliable data on the proportion of long term survivors, and if this proportion is small, it will take a huge sample size as well. You may have to take your best guess based on the data… even when statistical significance for what matters to you has not been achieved.

Often in technical papers and presentations, median survival is just referred to as “survival” and if there is no statistically significant difference in median survival, you will hear or read that, “there was no difference in overall survival.” You need to think about exactly what they are saying, and about what the data says about the effect of the treatment on the entire population – not just the effect on an average or median.

### Statistically Significant Doesn’t Mean There’s a Real Difference!

That a difference is statistically significant only means it’s *unlikely* that the observed difference is due to chance. It doesn’t mean it’s impossible. For instance if p=.05 there is a 5%, or 1 in 20 chance, that a difference this large could have occurred by chance alone. Making a one in 20 chance is just not that rare!

I think that for most cancer patients, knowing that the chances are 95% that an observed difference in favor of some treatment is not due to chance, is good enough that they will not spend undue time worrying that the difference isn’t real.

Depending on the p value, the chances might be a lot better than that anyway. You can look at the p value to tell how unlikely it is that the result occurred by chance alone. So if you have p=.000001, then the chances are only one in a million that the result was due to chance.

### You have to decide what to test *before* you do the experiment!

A common, but dangerous, practice is to do an experiment and notice some interesting pattern in the data only after you are done. Maybe women survived longer, or maybe black people, or maybe left handers, or maybe… well you get the idea. The temptation is to then do a statistical test and voila! The difference is highly statistically significant and you’ve made a great discovery! Not so fast.

The mind is a pattern finding machine extraordinare! There is nothing else like it in the universe! When you look at data, searching for patterns, you consciously or unconsciously check many different possibilities to see if it looks like this or that property makes a difference in the outcome. Similarly you might check many different outcomes to see if the treatment made a difference in that outcome – if the patients didn’t live any longer maybe they spent less time in the hospital or used less pain pills or… well who knows. There can be hundreds or thousands of possibilities.

The problem is that if you check hundreds of things then it is actually likely that you will see a statistically significant difference in at least one of them *just by chance alone*! Remember that you could get a result that is statistically significant at the p<.05 level one in twenty times *just by chance alone*. So if you check enough things this is actually *likely *to happen at least once. If you buy enough lottery tickets you’re probably gonna win – only in this case the prize is the bitter fruit of false conclusions. It is easy to be fooled by this trick of the statistical light.

The way not to be fooled is to specify exactly what you are going to test *before* you start the experiment (this is called defining endpoints prospectively) and to specify a limited number of things. There are statistical methods of correcting for multiple tests if you know in advance exactly what you are testing.

If you see something interesting in the data afterwards and do a statistical test that was not planned in advance then often all you have is an interesting hypothesis which must be confirmed by doing the experiment again, rather than a statistically valid conclusion. If the p value is very small (much less than .05) and if the problem space is not one that lends itself to an indefinite number of possibilities to check, then it is more likely that you really have something. It is also more likely if it makes sense in terms of what is known about the disease – the observation that, say left handers have worse survival doesn’t make sense in terms of any known biological mechanism and seems unlikely – a trick of the statistical light. In contrast if you were to observe that people with higher levels of testosterone had worse survival in prostate cancer it would be relatively plausible because prostate cancer cells are known stimulated by male hormones.

This CancerGuide Page By Steve Dunn. © Steve Dunn

**Page Created:**2000,

**Last Updated:**January 21, 2001