Wonder-drug wipeouts

Physicians have a standing joke they like to tell whenever a new wonder drug comes on the market.

“Use it quickly, because it will be useless after the first year.”

They are not talking about the natural decay of pharmaceuticals over time. (Everyone knows you should never take meds that have been in your medicine cabinet for over a year. They deteriorate.)

No, the objects of the joke are the newly-discovered drugs that promise to alleviate depression, calm agitation or lower blood pressure.

“They seem to work well for the first six months they are around,” says Bob, a family physician. “But a year later, no one is talking about them any more. It’s funny that these drugs seem to lose their effectiveness as time goes on. I don’t know why. Perhaps it’s my imagination.”

It’s not imaginary. The decline effect, as it is popularly known, has been noted not only in medicine, but also in the basic sciences including biology, chemistry and physics. Papers trumpeting the discovery of marked phenomena are followed by research that reports much more modest effects. But the decline effect is most pronounced in medicine, particularly among mood-altering drugs such as tranquilizers and anti-depressants.
Steven Novella, a clinical neurologist, described the effect in a 13 December 2010 blog. A January article  in Nature News by Jonathan Schooler dealt with the same phenomenon.

(An interesting aside. A passing comment by Schooler that the very fact of scientists observing a phenomenon could “change some scientific effects” brought a rebuke from Novella who described the comment as “dangerously close to quantum woo.” Novella’s discussion appears in a June 2011 blog.)

Despite that disagreement, Novella and Schooler concur on the more prosaic causes of the decline effect; but neither organizes these causes in a coherent way, so I’ll do it for them. We can slot them into three non-disjoint categories: problems with preliminary investigative research (cherry picking part one), study design glitches (polishing the cherries) and publication bias (cherry picking, part two).

Cherry picking, part one.

I was once involved in a major observational study designed to tease out the major risk factors and preventive factors for Alzheimer’s disease. We interviewed more than 10,000 subjects, checking them for signs of the disease and also looking at factors as diverse as diet, exercise, occupation and exposure to chemicals. The data analysis included an assessment of whether each of these factors — and there were more than 70 of them — was associated with Alzheimer’s, other dementias or cardio-vascular diseases. We found the usual culprits: older people had higher rates of dementia; overweight people were more at risk of stroke or heart disease. But one factor stood out as preventive. People who drank a lot of tea had remarkably low rates of Alzheimer’s disease.

Now this was a cherry prime for picking — an unexpected result, a simple prevention for a nasty disease. Publication could bring substantial fame. But, to its credit our research team was circumspect about announcing the tea result. Instead of rushing to print, the neurologists, neurophysiologists and neuropsychologists hit the literature to find a physiological rationale for this surprising result. Nothing.

By then, most of us suspected that we were looking at a statistical artifact — we had just happened by chance to select a sample of subjects that included a lot of tea drinkers who also happened to be free of Alzheimer’s disease. The protective effects of tea drinking never reached the general medical community. And sure enough, the effect declined in future studies.

But publishing cherries like this brings the decline effect to light. One group of researchers I knew thought they had discovered a genetic marker for Alzheimer’s disease. Surveying a long list of blood proteins, they found one that was particularly prevalent in Alzheimer’s patients. This shotgun approach is a usually generates bad results. Look in enough corners and chance will come through with a significant result. An added trouble was that the researchers had to delete some data to make the statistics come out so that their desired results came out backing their contention. They published. There was a bit of excitement. And in confirmatory research, the effect disappeared. What had appeared to be a nice ripe cherry had turned out a lemon.

It does not take much imagination to suspect that pharmaceutical companies also indulge in cherry-picking. By chance an early trial shows an outstanding effect for compound X. All research focusses on compound X; negative results are downplayed and positive ones followed up. Compound X goes on to the clinical trial stage.

Polishing the cherries

Even if a particular treatment or preventive measure reaches the clinical trial stage, the design of the trial can artificially enhance a result by using a select group of patients and a select set of outcome measures. Steve Novella described the situation in a recent Skeptics’ Guide to the Universe podcast (13 June 2011).

“There [are]. . . subtleties in how the research is designed. . . . There are lots of choices … as to how to design a study. It’s not always obvious or straight forward. For example, your inclusion and exclusion criteria — what people are you going to study the drug on? . . . . We don’t want people to have too many . . . coexisting conditions or to be on certain kinds of other drugs . . . . But also the outcome measures. We choose the outcome measures that have the best chance of looking positive. [The researchers] may do some preliminary testing where they look at four or five different outcome measures. Then they pick the one that looks really good and they use that in their big trial. So there’s lots of subtle ways to tweak a trial so it looks totally good on paper but the process was all geared towards exaggerating the positive effect of the study. And then when it gets used in the real world on patients with every kind of disorder and other drugs and . . . more real world outcome measures are being used, you can’t expect that in the pristine . . . context of the clinical trial that the effect size is going to be the same.”

As Novella notes, there is nothing necessarily deceitful about this. Clinical trials have to be pristine in order to determine the fundamental efficacy of an intervention; the pragmatic studies that follow determine how effective the intervention is in the real world.
Cherry picking, part two

Here’s a fact that health sciences students learn or should learn in their statistics classes. One out of every twenty ineffective treatments tested by clinical trials turns out to be significantly effective. I won’t go into details here; just know that if the treatment under scrutiny is in fact ineffective, there is a five per cent chance that a carefully-conducted clinical trial of the treatment will render a false positive.

Now consider the fact that there are more than 100,000 clinical trials underway around the world at any time. Let’s be conservative and say that just 10 per cent of these trials (10,000 of them) are testing treatments that are in fact ineffective. Then five per cent of those 10,000 or 500 trials are going to show statistically significant effects.
And those 500 are the ones that get published; the results of the other 9500 never see print. The trouble is, the results of those 9500 are more valuable than the phoney 500. But journal editors don’t like insignificant results.

This phenomenon is called publication bias and it leads to the spread of drugs that appear at outset to be effective, but whose effectiveness vanishes within a year of licensing.
In his Nature article, Jonathan Schooler addresses suggests a solution: register all trials at their outset. Then the nonsignificant results will show up. Schooler says:
“I suggest an open-access repository for all research findings, which would let scientists log their hypotheses and methodologies before an experiment, and their results afterwards, regardless of outcome. Such a database would reveal how published studies fit into the larger set of conducted studies, and would help to answer many questions about the decline effect.”

Notice that Schooler does not restrict his recommendation to medical research.
In fact, medical researchers are already on this and open registries for clinical trials are already on-line. Putting “clinical trial registry” in your favourite search engine should bring up several links.

In summary

The causes of effect modification remind me of the statistical phenomenon called regression to the mean, which says essentially that an exceptional event will likely be followed by a more mundane one. So it is with the newest wonder drug — initially exceptional, finally mediocre.

About aharmlessdrudge

Way back during the late Bronze age -- actually it was the 1950s -- all of us in high school had to take a vocational test to determine our interests and, supposedly, our future careers. I cannot remember the outcome, but I do recall one question that gave me pause. "If you were to win a Nobel prize, would it be in literature or in physics?" I hesitated over the question: although I enjoyed mathematics and science more than English class, I did have a couple of unfinished (and very bad) novels hidden away at home. I cannot remember what I chose back then, but the dilemma followed me to university, where I switched from mathematics to English and -- after a five-year stint in journalism -- back to mathematics. I recently retired as a professor of statistics. Retirement. What a good chance to revive my literary ambitions. I have finished a novel -- more about that in good time -- and a rubble of drafts of articles about mathematics and statistics is taking up space on my hard disk.
This entry was posted in Science and Medicine. Bookmark the permalink.

3 Responses to Wonder-drug wipeouts

  1. I hear what you are saying and I believe I can cite another instance of this phenomena. I have been a dairy farmer for around 30 years. In that time we were zealous to improve the genetic merit of our dairy herd. We subscribed to a system of herd testing and analysis using computer models to identify traits that improved the herds milk production, longevity, reproduction and other desirable traits. We used to use high breeding index bulls to improve the herd. What annoyed us a lot was how these young bulls would come through the system with the highest indexes for confirmation, production, longevity etc. However after committing to breed with these bulls (remembering that progeny on the ground and then producing milk required several years investment) only to find to our amazement after these bulls had more progeny which in turn produced more statistics, their indexes often reduced to the point that had we known where they would end up we would never have used them in the first place! One particular wonder-bull we used extensively only to find that after his crop of daughters came into production we realized that he had virtually ruined our herd for temperament. It is as you said- publication bias, especially when there are companies competing for your breeding buck!

    • Kerry:
      I wish I had heard this sad story twenty years ago when I was a professor of biostatistics at Atlantic Veterinary College in Prince Edward Island, Canada. I know very little about bovine reproduction, so I would have turned it over to one of the theriogenologists for analysis. As a statistician, however, I would say that this looks a lot like regression to the mean: the spectacular bull sires generations of successively mediocre offspring. Has this sort of thing happened to other dairy farmers that you know of.

      Thanks for your comment. I hope your herd has recovered.

      Alan Donald

  2. Thanks for your concern, yes we recovered and our herd was in the top 5% for genetic value in NZ when we eventually sold it. But that bulls temperament values sure regressed. I think all farmers moan a bit about the difference between what the stats say and what we seem to experience. But the fact is the system works overall- otherwise we wouldn’t enjoy the genetic gain we have! I think the problem is sire proving a small group exaggerates the variation in values until the bull has many more daughters that are tested.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s