About fMRI clusterf**ks

Andrew Gelman has a nice blog post about a recent paper making waves in the fMRI community.  He knows much more than me, but I probably have a bit more experience in this small niche than he does.

I wanted to provide a bit more context/background his recent fMRI post.  Unfortunately, this is another time that a sensational headline misrepresents the actual content of the paper.

Before going any further, it’s useful to keep in mind that fMRI has only been used for about 25 years.  Sure, that sounds like a long time.  In that time, however, there have been great advances in the statistics and statistical methodology.  Things that were published 10 years ago wouldn’t be published today, and things that were published 20 years ago wouldn’t have been published 10 years ago.  I know more about neuroeconomics because of my background and say that some of the landmark papers that people love to cite wouldn’t have been published 5 years later because of higher and better statistical standards.

There are (at least) four big issues with statistics in fMRI when saying we want p<0.05 as a threshold, and it’s good to be clear what we’re really talking about here.

1. Naively using p<0.05 and not doing any correction for multiple comparisons.  With a typical fMRI dataset, you can easily be running 100,000 statistical tests because that’s how many voxels you have in the data, and you run the test at each voxel.  Yeah, this will give you a bunch of false positives, and that was the main point of the “dead salmon” study that went around a few years ago.  Key message: You need to do *something* about multiple comparisons.

2. Double-dipping, which was the “voodoo correlations” issue raised by Vul and others.  It’s been many years since I read the paper.  The basis double-dipping procedure is the following.  First, you run your analyses on all of the data, and you use this to select clusters or regions-of-interest with strong activations for a second analysis.  You then re-run the same analysis on the same data but restrict attention to those clusters or regions-of-interest which you selected in the first step.  You’ll get highly significant results without correcting for multiple comparisons because you selected them specifically because they were highly significant in the first step.  The “bigger” problem is that you are also inflating your p-values AFTER correcting for multiple comparisons.  Let’s say there are 1,000 voxels in your cluster.  Now, at the second stage, you only need to correct for 1,000 comparisons, whereas you would have had to correct for 100,000 comparisons if you only did the first step.  [See notes 1 and 2 below for relevant but off-topic comments.]  I’ll make an analogy to machine learning here.  Double-dipping in fMRI is the same as training and validating your model on the same set of data in machine learning.  Of course, you should have a good fit on the training data… it’s what you used to train the model!  Validating it using the same data will give you a good fit even if you were overfitting when training the model.

3. Spatial correlations in fMRI and multiple comparisons corrections.  This is the main issue being addressed with the current paper.  Voxels are not brain regions; they are a way of splitting up 3-D data (which, let’s also be clear… 3-D fMRI data is not generated by an MRI scan… it is created by assembling a set of 2-D images acquired during a single “scan” and accounting – somehow – for the time series nature of the data).  A voxel also is not equivalent to a neuron.  There are tons of neurons in a single voxel.  So, a relevant “brain region” is likely to span *parts* of several contiguous voxels.  You really SHOULD be trying to take this spatial correlation into account when doing multiple-comparison corrections, since it is much more likely that a brain region spans two contiguous voxels than that a brain region spans voxels in opposite hemispheres of the brain.

Edlund, Nichols, and Knutsson are really asking about the assumptions regarding these spatial correlations that are built into commonly used software packages (AFNI, SPM, and FSL).

I should also point out that the spatial correlation also means that another obvious multiple comparisons correction – the Bonferroni correction – will be too strong/conservative.  You’re asking for something that many would argue is unreasonably conservative and will lead to too many false negatives of a “true hypothesis.”

So, we now have the crux of the problem.  Voxelwise inference with multiple comparisons corrections may be too conservative because we ignore or underestimate the real spatial correlations due to a brain region spanning parts of multiple voxels, and we under-reject the null.  If we’re too liberal with our thresholds, then we over-reject the null.

The approach I’ve seen most often is contained in the middle images of Figure 1 in Edlund, Nichols, and Knutsson.  You use something more conservative than an uncorrected p<0.05 or p<0.01 but also apply multiple comparison corrections which somehow address spatial correlation.  In fact, the thresholds they use in those middle panels are the most common ones I see, an uncorrected p<0.001 combined with a cluster-level FWE-corrected p<0.05.

Now, let’s take a look at their Figure 1.  Panels A and B do the same thing using different datasets.  The left plots use a very liberal uncorrected p<0.01 with cluster FWE p<0.05 and have substantial over-rejection of the null.  The middle plots use the common (today, but not historically) uncorrected p<0.001 combined with a cluster-level FWE-corrected p<0.05 and get pretty close to what you’d expect to see, though there is a bit of over-rejection.  The right plots use voxelwise inference with FWE corrected p<0.05 and lead to under-rejection in most cases.

What does this mean?  Well, one way of thinking about using uncorrected p<0.001 with cluster-level FWE p<0.05 is that we’ll over-reject the null in a single study but can be reasonably confident if similar results are obtained in multiple studies.  And I think this is that attitude of (most) people I know in the field… you really shouldn’t trust any result from one single study, but finding similar results in a small neighborhood across multiple studies is reasonable (based on the best available information, which of course may change in the future… the change in interpreting dopamine as reward prediction-error rather than simply reward was a monumental shift in the late 1990s).

All in all, Edlund, Nichols, and Knutsson actually seem to validate current methods to a large extent.

Except… their Figure 2.  That one looks bad.  And it is bad.  But it’s also for a two-sample test, and I rarely – RARELY – see two-sample tests.  Most studies that I have read in neuroeconomics / decision neuroscience are one-sample tests in which multiple conditions are measured within a single subject.  My guess about when two-sample tests are likely to be used to study decision-makings are situations such as (i) lesion studies, where you have a control group without lesions because of obvious reasons and (ii) pharmacological studies in which you have either treatment and control groups with different subjects or you do multiple sessions and have the same subjects in both groups.

Finally, I do remember that I mentioned a fourth big issue with statistics and fMRI when saying we want p<0.05 as a threshold.

4. The *other* multiple comparisons problem.  This is the dirty secret that everybody knows about, even though many don’t seem to consider it an issue.  The multiple comparisons corrections mentioned above apply to a single model and contrast.  Well, there are lots of ways to model your data with fMRI and lots of contrasts you can run.  And people mess around with those models and try tons of contrasts just looking for one that works.  Basically, what people are now calling p-hacking in psychology.  Everybody is doing it.  Nobody reports the failed models and contrasts.  Nobody corrects for these multiple comparisons.

NOTES

1. You still see “small-volume corrections” or “region-of-interest analyses” today, and I think they tend to be more frequent in the fancy journals.  I basically don’t trust any of them.  See the fourth issue I mention at the end.  With rare exceptions, everybody is running a whole-brain analysis first and checking the statistics with multiple-comparisons correction.  My prior belief when seeing small-volume or ROI analysis today is that they didn’t get p<0.05 with corrected using the whole brain, so they do the small-volume or ROI analysis to reduce the strength of the correction by limiting the search volume.

2.  We should be careful here to distinguish between good statistical analysis of the data followed by bad statistical analysis presented in the paper and just plain bad statistical analysis.  Both will lead to the same puzzlingly high correlations.  In principal, there may have been people who were not double-dipping and had “valid” results with multiple comparisons correction but still had to report the double-dipped version, especially in figures, because it’s what was common in the field and expected by reviewers and editors.

3 thoughts on “About fMRI clusterf**ks

  1. Nice post, and appreciate the comment that this work is more about how to interpret a single fMRI study, and less about generalisability of non-null results over different studies.

    > Except… their Figure 2. That one looks bad. And it is bad. But it’s also for a two-sample test, and I rarely – RARELY – see two-sample tests.

    As one of the co-authors, I can assure you that there is nothing special about the use of two-sample tests in Figure 2. I don’t have the results handy, but I would be confident that the one-sample results are *yet* worse. The reason is that two-sample t-tests are much more robust to any non-Gaussianity (a difference of two r.v. with the same distribution is symmetric, which is big chunk of Gaussian-ness).

    Rather, the point of Figure 2 is that we quantified the familywise false positive rate of a historically common approach to fMRI inference, a ‘folk’ multiple testing method based on P<0.001 and an arbitrary cluster size criterion. No one has ever claimed that this folk method controlled the familywise error rate, but we simply put some numbers on it.

    -Tom

    Like

    • Thanks, Tom, and sorry for the delayed reply. I’m hoping you get an email from WordPress that I replied.

      First and foremost, I want to thank you for reading in detail before responding. One of the things that I (idealistically) hope to accomplish with this blog is to emphasize substance over style… too few academics, in my opinion, actually read and think about the content of papers. They’ll go for things like the catchy line from your paper about possibly invalidating every single fMRI paper. If they actually read the content, they would know that you didn’t make that argument in your paper. I think you’d also agree with me that the most but, unfortunately, not all – of the worst errors happened in the past when they were semi-reasonable given the best statistical analyses available for applied researchers. Again, not an excuse, but there’s a real difference between “50% of papers between 2005 and 20105is bullshit” and “50% of papers in every year between 2005 and 2015 is bullshit.”

      Before I say more about your Figure 2, I want to admit that I only came across your paper because a coauthor wanted me to cite it as validating our approach. I insisted on reading it so that I understood more.

      That said, I didn’t read the details of Figure 2 in much detail, as it wasn’t as relevant for my purposes. My very naive guess was that the bad results are due to two-sample tests with a fixed N, which I would assume would perform worse than one-sample and paired-sample tests with the same N. With such a low number of degrees of freedom, each extra one makes a BIG deal with statistical inference.

      By the way, have you seen this thing that gets used to compute the necessary cluster size given a statistical threshold? I think it’s a tool in AFNI. I can send you a paper using it if you’d like. Much like my attitude towards small-volume or ROI analyses, it seems like it’s primarily used as a post-hoc tool to p-hack results to p<0.05 at the desired threshold.

      Many thanks, Tom!

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s