LONDON: The advantages of making scientific data available for further analysis are clear, but it could also enable the trawling of data to find significant, or preferred, results. Dorothy Bishop argues that we need a system to keep all those re-analysing data honest
For years, researchers squirreled their data away after completing a study. When I started out in research in the 1970s, there were few options for sharing data: there was no email or internet. I have dim memories of analysing data from the 1958 National Child Development Study. The files arrived on enormous disks that I had to take to the local computer centre to read.
Now, though, we have ways of not just storing, but electronically sharing data. Archiving is not trivial: it requires proper documentation of data, and anonymisation when human participants are involved. But the advantages are clear to see: data in an archive can be re-used by other scientists, increasing its potential value. Data can also be future-proofed, avoiding the scenario where key results exist only on a kind of floppy disk that no longer can be read.
But as we move to wider data-sharing new questions arise. In particular, who should have access to the data? The simplest answer is everyone: the scientist could just put their data out there, and anyone and everyone could view it. In many areas, this is unproblematic, but some scientists have reservations about completely free access, even if they agree in principle with open data.
In some cases, there are concerns that data may be misused by people with conflicted interests or a specific ideological agenda. A few weeks ago, there was uproar when it was found that Robert de Niro planned to screen a film, Vaxxed, at the Tribeca Film Festival. The film highlights an analysis of data on autism and vaccination from a large US database (CDC) which claimed to find a greatly increased rate of autism in children who had been vaccinated, provided they were African-American boys vaccinated in a specific time window. It was argued that there was a conspiracy to cover up this shocking statistic, even though the analysis was clearly flawed, the results were discrepant with the rest of the literature, and the paper was subsequently retracted. It could be argued that overall, this was a win for the self-correcting process of science, because the errors in the analysis were quickly discovered, and when Robert de Niro was made aware of the concerns about the misinformation in the film, he withdrew it from the festival. But there’s no doubt that damage was done. Once conspiracy theories get established, they can be difficult to dislodge. From the point of view of anti-vaxxers, the withdrawal of the film just provides further evidence that there is a conspiracy to silence those who speak the truth.
Would the situation have been different if there had been restrictions on access to the data? Probably not. The problem is not so much who has the data, as what they do with it. A particular danger comes from unrestricted data-trawling of the kind that was evident in the CDC analysis. Although these dangers are especially serious when those doing the analysis are determined to find a particular result, they are not negligible when reputable and relatively open-minded scientists do secondary analyses.
Large datasets allow for analytic flexibility, and it is all too tempting to trawl a dataset for “significant” associations. Exploratory analysis is important for scientific progress, but inferential statistics lose their






