# Why Good Results From Small Trials Often Don’t Pan Out (And what to do about it)

## The Problem

In years of reading the literature I’ve noticed that very promising results particularly from small clinical trials often don’t pan out when another trial is done to try to confirm the results. Often the treatment turns out to have value, but it often it is less than the first small trial suggested. This was certainly the case with the treatment I choose for my advanced kidney cancer, Interleukin-2. IL-2 remains the best proven therapy for metastatic renal cell cancer (As of March, 2000), but it initially looked like it might have a much higher response rate than the 15-20% actually seen in large samples. In some cases where there is another standard treatment, the new treatment will eventually turn out to be no better than the standard, despite the early promising results. It could even turn out to be worse!

## Why This Happens

I’ve thought about *why* this might happen so often and I think I know.

Imagine, if you will, one hundred identical *small* trials of the *same* chemotherapy drug. When not every patient responds to a treatment, as is nearly always the case in cancer therapy, each trial will accrue some patients who will be responders, and others who, unfortunately, will not be responders. By random chance, some of these trials will happen to get more responders and thus show a higher response rate than others. If the trials are small enough and there are enough trials, probably a *few* of these identical trials will get a *much* higher response rate than the others. Most of them will get response rates not too far from the true response rate of the treatment – only a few will have vastly higher (or lower) response rates, but the more identical trials you conduct, the greater the chance that at least one trial will get a response rate that is *much* higher than the true rate.

If you were scanning the literature for treatments for your cancer and you came across 100 papers on these 100 different trials without realizing they were 100 repetitions of the same trial you might pick the ones with the high response rates as being the most promising trials.

Now imagine that there is a huge cancer research establishment which conducts hundreds of trials for any kind of cancer including tests of many different variations of the standard treatments, as well as many new treatments… well actually you don’t have to imagine because this is exactly what is happening! A great number of these trials are small phase II studies.

Now in the real world they aren’t conducting hundreds of identical studies, but it may well be that there is a great number of variations of the same treatment or a great number of new treatments which work some of the time but which are not actually any better than standard treatments. So there may be lots of small studies of treatments which actually have similar, and not very exciting, low *true* response rates. But by random chance, some of them will get high response rates just like in the example of many identical trials. Mixed in with these there may also be small early trials of new treatments which actually do have a much higher success rate than the standard treatment. Unfortunately, it may be impossible to tell which early small trials really point to a better treatment – and which are just tricks of the statistical light. But there are some clues:

## What to Do About It

If there is a high response rate in a small trial and you conduct one more small trial of the same treatment and also get a high response rate this is evidence that the true response rate really is relatively high – because the chances of randomly getting a *much*higher response rate than the true response rate in any one small trial is small – getting such results twice in a row is not likely. So you should be on the look out for trials with promising results which are confirmed by other studies of the same treatment.

If the trials are larger, the chance of getting misleading results in the first place is smaller. A fundamental law of statistics called “The Law of Large Numbers” says that the more patients you have in a trial, the more reliably the response rate in the trial (or other measure of success) will estimate the true response rate of the treatment in question. So if you conduct a large trial you are less likely to get misleadingly promising results so if there is no independent confirmation of a study it is also more believable if it is larger.

If the study of the treatment you are looking at really is the only study of a new and apparently promising treatment, then you have to consider the results in light of your particular situation. If there is a reasonably effective treatment for your situation it may not be worth risking the chance that the new treatment is actually *worse* than the old treatment. (For the new treatment to be actually worse than the standard when it appears much better, the deviation of its response rate from the true response rate has to be even greater than for it to merely not be any better so this is much less likely, but it is not impossible.). In addition, if the standard therapy yields some number of long term survivors or cures it takes years to establish this and the first reports on a new therapy basically never have the long follow-up needed to establish the long term success rate. But if in contrast, your situation is desperate and no established treatment gives decent odds of a worthwhile outcome, it may make plenty of sense to gamble on the new promising treatment (What you consider “decent odds” and a “worthwhile outcome” depends on your personal values).

## A Real World Example

Motivated by a specific example of a promising trial for my own cancer, I figured out a mathematical approach to the small trial problem for one particular situation. The following is my analysis. It will be helpful if you are somewhat comfortable with math, but even if not, you should still be able to get the gist of it.

### Background

At the May 2002 ASCO meeting, Dr. Robert Amato reported a promising small trial of a combination of two drugs, low dose Interleukin-2 and Thalidomide for my cancer, renal cell carcinoma. This phase I trial had 15 patients and 6/15 responded to the treatment. Interleukin-2 is a standard immunotherapy treatment for metastatic renal cell carcinoma and yields perhaps a 15% response rate. Thalidomide is an anti-angiogenic and immune modulating drug which by itself has a very low response rate (5% or less) and if it works at all, temporarily stabilizes the disease in some patients without shrinking the tumors.

The question I asked is how likely it is that the combination is better than low dose IL-2 alone. Since low dose IL-2 has been extensively tested, it’s possible give a reasonable estimate of the response rate for that drug alone – and for the purpose of the analysis I take it to be 15%.

## Analysis

To start, let’s assume that Thalidomide added nothing to the treatment. Then we can ask what is the probability of getting at least 6/15 responders given the true response rate of 15%? This question can be answered exactly using something called **The Binomial Distribution**. It turns out that the chance of getting 6/15 responses or better given a true response rate of 15% is about 1.7%.

At first glance that looks pretty good – in any one trial of a therapy with a true response rate of 15%, there’s only about a 1.7% chance you’d see results as good as Dr. Amato’s trial. That’s pretty unlikely, which in turn suggests the real response rate for this new treatment is indeed likely to be higher than 15%. Not so fast!

**More on The Binomial Distribution**

Basic Concepts of Probability from The University of Toronto Psychology Department has a great explanation of the binomial distribution.

For your reference, in our case the number of "trials" (trials here means patients not separate clinical trials) is 15 and the probability of success is 0.15 (15%), and we want the probability of 6 or more successes which requires adding up the probabilities for exactly 6,7,... 15 successes.

Here’s where it gets interesting! Let’s suppose that doctors are trying adding all kinds of things to IL-2 alone (as is actually the case) in an effort to improve this treatment’s relatively poor results. Let’s also suppose that none of them actually improve the true response rate (which may or may not be the case). Metastatic renal cell cancer is a desperate disease, so we’ll be excited if any one of these trials looks really good. In fact let’s say just any trial with results at least as good as Dr. Amato’s trial is “exciting”. So now let’s ask what the chance is that we see at least one “exciting” trial if we run two 15 patient trials of just IL-2 alone.

It turns out that the easiest way to figure out the chance of at least one exciting trial is to first calculate the chance that no trials will be exciting which is the chance that the first trial isn’t exciting and then the second isn’t either (life is boring!).

The chance that any one trial is *not* as good as Dr. Amato’s is 100% – 1.7% = 98.3%. So the chance that the second trial isn’t that good is also 98.3%. The chance then that both trials are *not* that good is 98.3% of 98.3% = 96.7%. Thus the chance that at least one of these trials is as good as Dr. Amato’s is 100%-96.7% = 3.3%, which is a little more than the 1.7% chance that any one trial of 15 patients will be as good as Dr. Amato’s trial. This shouldn’t be surprising – buy more lottery tickets – increase your chance of winning!

Generalizing, if you do N different 15 person trials all of which are actually no better than IL-2 alone, the probability of seeing at least one trial as good as Dr. Amato’s is:

**1-0.983 ^{N} X 100%**

Here’s a table of the chance of seeing an exciting trial just by chance given different numbers of trials.

N | Chance of at Least One “Exciting” Trial |

1 | |

2 | 3.3% |

5 | 8.1% |

10 | 16% |

20 | 29% |

40 | 49% |

50 | 57% |

Notice that the probability of seeing at least one trial at least as good as Dr. Amato’s gets very high as the number of trials increases. In fact if the number of trials is more than a few dozen, it’s more likely than not! Given that, in fact, there actually have been many small trials over the years testing various things added to IL-2 alone (very likely at least a few dozen) it isn’t surprising to see a few really good looking trials by chance alone.

This doesn’t mean that Dr. Amato’s results actually *are* due to chance. Only that we can’t tell and that the uncertainty is far greater than you might’ve imagined. If low dose IL-2 alone was the best treatment, then since it’s quite unlikely Dr. Amato’s treatment is actually *worse *than low dose IL-2 alone, a patient would be justified in trying this instead. In fact, the data (as of June 2002) tend to show that high dose IL-2 has a better chance of giving a durable response than low dose IL-2 alone. The side effects are also worse. All of which makes a decision between these options complex and difficult.

Finally, although good results from combination therapy where something was added to a based treatment with a known response rate can be analyzed this way, I don’t think all small trials with good results are subject to this particular mathematical analysis. I do think something akin to this, even if not quite so amenable to precise analysis, is responsible for the small trial effect.

This CancerGuide Page By Steve Dunn. © Steve Dunn

**Page Created:**2000,

**Last Updated:**July 2, 2002