Andrew Gelman has a nice blog post about a recent paper making waves in the fMRI community. He knows much more than me, but I probably have a bit more experience in this small niche than he does.
I wanted to provide a bit more context/background his recent fMRI post. Unfortunately, this is another time that a sensational headline misrepresents the actual content of the paper.
Before going any further, it’s useful to keep in mind that fMRI has only been used for about 25 years. Sure, that sounds like a long time. In that time, however, there have been great advances in the statistics and statistical methodology. Things that were published 10 years ago wouldn’t be published today, and things that were published 20 years ago wouldn’t have been published 10 years ago. I know more about neuroeconomics because of my background and say that some of the landmark papers that people love to cite wouldn’t have been published 5 years later because of higher and better statistical standards.
There are (at least) four big issues with statistics in fMRI when saying we want p<0.05 as a threshold, and it’s good to be clear what we’re really talking about here.
1. Naively using p<0.05 and not doing any correction for multiple comparisons. With a typical fMRI dataset, you can easily be running 100,000 statistical tests because that’s how many voxels you have in the data, and you run the test at each voxel. Yeah, this will give you a bunch of false positives, and that was the main point of the “dead salmon” study that went around a few years ago. Key message: You need to do *something* about multiple comparisons.
2. Double-dipping, which was the “voodoo correlations” issue raised by Vul and others. It’s been many years since I read the paper. The basis double-dipping procedure is the following. First, you run your analyses on all of the data, and you use this to select clusters or regions-of-interest with strong activations for a second analysis. You then re-run the same analysis on the same data but restrict attention to those clusters or regions-of-interest which you selected in the first step. You’ll get highly significant results without correcting for multiple comparisons because you selected them specifically because they were highly significant in the first step. The “bigger” problem is that you are also inflating your p-values AFTER correcting for multiple comparisons. Let’s say there are 1,000 voxels in your cluster. Now, at the second stage, you only need to correct for 1,000 comparisons, whereas you would have had to correct for 100,000 comparisons if you only did the first step. [See notes 1 and 2 below for relevant but off-topic comments.] I’ll make an analogy to machine learning here. Double-dipping in fMRI is the same as training and validating your model on the same set of data in machine learning. Of course, you should have a good fit on the training data… it’s what you used to train the model! Validating it using the same data will give you a good fit even if you were overfitting when training the model.
3. Spatial correlations in fMRI and multiple comparisons corrections. This is the main issue being addressed with the current paper. Voxels are not brain regions; they are a way of splitting up 3-D data (which, let’s also be clear… 3-D fMRI data is not generated by an MRI scan… it is created by assembling a set of 2-D images acquired during a single “scan” and accounting – somehow – for the time series nature of the data). A voxel also is not equivalent to a neuron. There are tons of neurons in a single voxel. So, a relevant “brain region” is likely to span *parts* of several contiguous voxels. You really SHOULD be trying to take this spatial correlation into account when doing multiple-comparison corrections, since it is much more likely that a brain region spans two contiguous voxels than that a brain region spans voxels in opposite hemispheres of the brain.
Edlund, Nichols, and Knutsson are really asking about the assumptions regarding these spatial correlations that are built into commonly used software packages (AFNI, SPM, and FSL).
I should also point out that the spatial correlation also means that another obvious multiple comparisons correction – the Bonferroni correction – will be too strong/conservative. You’re asking for something that many would argue is unreasonably conservative and will lead to too many false negatives of a “true hypothesis.”
So, we now have the crux of the problem. Voxelwise inference with multiple comparisons corrections may be too conservative because we ignore or underestimate the real spatial correlations due to a brain region spanning parts of multiple voxels, and we under-reject the null. If we’re too liberal with our thresholds, then we over-reject the null.
The approach I’ve seen most often is contained in the middle images of Figure 1 in Edlund, Nichols, and Knutsson. You use something more conservative than an uncorrected p<0.05 or p<0.01 but also apply multiple comparison corrections which somehow address spatial correlation. In fact, the thresholds they use in those middle panels are the most common ones I see, an uncorrected p<0.001 combined with a cluster-level FWE-corrected p<0.05.
Now, let’s take a look at their Figure 1. Panels A and B do the same thing using different datasets. The left plots use a very liberal uncorrected p<0.01 with cluster FWE p<0.05 and have substantial over-rejection of the null. The middle plots use the common (today, but not historically) uncorrected p<0.001 combined with a cluster-level FWE-corrected p<0.05 and get pretty close to what you’d expect to see, though there is a bit of over-rejection. The right plots use voxelwise inference with FWE corrected p<0.05 and lead to under-rejection in most cases.
What does this mean? Well, one way of thinking about using uncorrected p<0.001 with cluster-level FWE p<0.05 is that we’ll over-reject the null in a single study but can be reasonably confident if similar results are obtained in multiple studies. And I think this is that attitude of (most) people I know in the field… you really shouldn’t trust any result from one single study, but finding similar results in a small neighborhood across multiple studies is reasonable (based on the best available information, which of course may change in the future… the change in interpreting dopamine as reward prediction-error rather than simply reward was a monumental shift in the late 1990s).
All in all, Edlund, Nichols, and Knutsson actually seem to validate current methods to a large extent.
Except… their Figure 2. That one looks bad. And it is bad. But it’s also for a two-sample test, and I rarely – RARELY – see two-sample tests. Most studies that I have read in neuroeconomics / decision neuroscience are one-sample tests in which multiple conditions are measured within a single subject. My guess about when two-sample tests are likely to be used to study decision-makings are situations such as (i) lesion studies, where you have a control group without lesions because of obvious reasons and (ii) pharmacological studies in which you have either treatment and control groups with different subjects or you do multiple sessions and have the same subjects in both groups.
Finally, I do remember that I mentioned a fourth big issue with statistics and fMRI when saying we want p<0.05 as a threshold.
4. The *other* multiple comparisons problem. This is the dirty secret that everybody knows about, even though many don’t seem to consider it an issue. The multiple comparisons corrections mentioned above apply to a single model and contrast. Well, there are lots of ways to model your data with fMRI and lots of contrasts you can run. And people mess around with those models and try tons of contrasts just looking for one that works. Basically, what people are now calling p-hacking in psychology. Everybody is doing it. Nobody reports the failed models and contrasts. Nobody corrects for these multiple comparisons.
1. You still see “small-volume corrections” or “region-of-interest analyses” today, and I think they tend to be more frequent in the fancy journals. I basically don’t trust any of them. See the fourth issue I mention at the end. With rare exceptions, everybody is running a whole-brain analysis first and checking the statistics with multiple-comparisons correction. My prior belief when seeing small-volume or ROI analysis today is that they didn’t get p<0.05 with corrected using the whole brain, so they do the small-volume or ROI analysis to reduce the strength of the correction by limiting the search volume.
2. We should be careful here to distinguish between good statistical analysis of the data followed by bad statistical analysis presented in the paper and just plain bad statistical analysis. Both will lead to the same puzzlingly high correlations. In principal, there may have been people who were not double-dipping and had “valid” results with multiple comparisons correction but still had to report the double-dipped version, especially in figures, because it’s what was common in the field and expected by reviewers and editors.