It all started with a rejected subsidy proposal. Ahmad Hariri, a neuroscientist at Duke University, was interested in using so-called “task fMRI” — in which subjects perform specially designed cognitive tasks while their brains are scanned — in conjunction with genetic testing and psychological evaluations. The aim was to identify specific biomarkers for differences in how people process thoughts and emotions that could determine whether a given subject is more or less likely to experience depression, anxiety or age-related cognitive decline such as dementia in the future.
“The idea was to collect this data once and then collect it again and again and again and be able to track changes in a person’s brain over time to help us understand what happens over the course of their lives. life changes,” Hariri told Ars. So he submitted a funding proposal outlining his plans for a longitudinal study in that direction. The proposal hypothesized that a person’s history of trauma would map, for example, how their amygdala responded to threat-related stimuli. And that, in turn, would allow the researchers to say something about the individual’s future mental well-being.
Hariri and his team designed four core task-related measures to do this: one targeting the amygdala threat response, one targeting the hippocampus and memory, one targeting the striatum and reward, and the fourth targeting the prefrontal cortex and executive control. He thought he was on solid scientific ground. So he was shocked when the proposal wasn’t even scored by reviewers, based on skepticism about fMRI’s reliability to collect that kind of data.
“That was the real kick in the pants I needed to think more seriously about the reliability of task fMRI,” Hariri said. Those concerns led him to do a comprehensive review of published studies claiming it is possible to predict a person’s thought or feeling patterns using task fMRI. He looked specifically at what is known as “test-retest reliability”: how much correlation there is when a person takes the same cognitive test and then does it again while being scanned. The results, described in a recent Psychological Science article, overwhelmingly showed that task fMRI was not a reliable indicator: the correlation between one scan and a later scan for the same person was fair to poor.
The findings caused a bit of a professional crisis for Hariri. “This is more relevant to my work than just about anyone else’s,” he told Duke Today with remarkable candor. “This is my fault. I’m going to throw myself under the bus. This whole sub-branch of fMRI could go extinct if we don’t address this critical limitation.”
Admittedly, he’s not saying it’s impossible to reliably measure brain activation function. “You just can’t do it the way we did it, with the tasks we used,” he told Ars.
“It’s not that we don’t know about these reliability issues, but this article brings them together more sharply,” Russell Poldrack told Duke Today. Poldrack is a psychologist at Stanford University who was not involved in the review study, although one of his fMRI papers from 15 years ago was included in the analysis. “This is a good wake-up call, and it is a sign of Ahmad’s integrity that he is taking this on,” he said.
A little background
fMRI is one of the most popular brain imaging techniques in use today, in part because it produces stunning color images — striking visualizations of statistical data — showing bright patches of brain activity in response to various tasks. Conventional medical MRI produces a static image of the brain, similar to an X-ray, but functional MRI (fMRI) monitors the increase in blood flow produced by groups of neurons that fire together in response to a particular stimulus. Specifically, it detects slight elevations in blood oxygenation levels, known as the BOLD response.
The imaging process produces a lot of raw data, up to 50,000 data points per scan. So neuroscientists rely on computer algorithms to sort through everything, averaging the results of the scans of many different study participants all performing the same tasks (usually one control task and one designed to measure a specific target). The greater the difference between the control task and the directed task, the stronger the BOLD response. Only those signals that exceed a certain statistical threshold are considered to have a correlation between the targeted task and any areas of the brain affected.
There are inevitably false positives (the same area “lights up” in two different scans by random chance), but neuroscientists work very hard to factor potential false positives into their statistical analyses. The importance of this was famously illustrated in a 2010 paper that found a measurable FAT response to an fMRI scan of a dead salmon. Neuroscientist Craig Bennett of the University of California, Santa Barbara, was one of the co-authors and a then graduate student at Dartmouth. He was responsible for calibrating the MRI machine, which is usually done by scanning a balloon filled with mineral oil. He and his lab partner decided to have some fun and tried to scan a Cornish game chicken, a pumpkin and finally the infamous salmon.
Bennett and his lab partner placed the salmon in the head coil and then performed the calibration test, in which the fish was “presented” with images of human faces and “asked” to determine the emotions seen in each image. Lo and behold, a signal appeared in the data when he analyzed it — even though the dead salmon would have shown no brain activity at all. Bennett et al. won the 2012 Ig Nobel Prize in Neuroscience for their illuminating work.
The point is not that fMRI is an unreliable technique. Rather, it has proven to be quite robust for studies of groups of participants performing the same task, as it provides a broad, general sample that allows scientists to establish similarities between populations. Things get a little tackier when we talk about studies that try to measure a BOLD response from just one person, say, to determine if the subject is lying, their belief in god, or their level of empathy. For example, if you put 100 people in a scanner and try to figure out which one of them was lying, it’s best to say that one subgroup probably lies more often than the other. You’ve gotten a statistically significant snapshot of the group as a whole, but that’s not the same as definitively establishing that a particular person within that group is lying.
That is why fMRI studies of individuals typically have the subject participate in multiple scanning sessions to compensate for the small sample size (N=1) and achieve the required statistical threshold. But it’s much harder to get strong correlations from the data, and it’s easy to convince yourself that you’re seeing patterns and correlations in the data that aren’t really there.