Research Article: Semi-automated fact-checking of nucleotide sequence reagents in biomedical research publications: The Seek & Blastn tool

Date Published: March 1, 2019

Publisher: Public Library of Science

Author(s): Cyril Labbé, Natalie Grima, Thierry Gautier, Bertrand Favier, Jennifer A. Byrne, Suzannah Rutherford.


Nucleotide sequence reagents are verifiable experimental reagents in biomedical publications, because their sequence identities can be independently verified and compared with associated text descriptors. We have previously reported that incorrectly identified nucleotide sequence reagents are characteristic of highly similar human gene knockdown studies, some of which have been retracted from the literature on account of possible research fraud. Because of the throughput limitations of manual verification of nucleotide sequences, we developed a semi-automated fact checking tool, Seek & Blastn, to verify the targeting or non-targeting status of published nucleotide sequence reagents. From previously described and unknown corpora of 48 and 155 publications, respectively, Seek & Blastn correctly extracted 304/342 (88.9%) and 1066/1522 (70.0%) nucleotide sequences and a predicted targeting/ non-targeting status. Seek & Blastn correctly predicted the targeting/ non-targeting status of 293/304 (96.4%) and 988/1066 (92.7%) of the correctly extracted nucleotide sequences. A total of 38/39 (97.4%) or 31/79 (39.2%) Seek & Blastn predictions of incorrect nucleotide sequence reagent use were correct in the two literature corpora. Combined Seek & Blastn and manual analyses identified a list of 91 misidentified nucleotide sequence reagents, which could be built upon through future studies. In summary, incorrect nucleotide sequence reagents represent an under-recognized source of error within the biomedical literature, and fact checking tools such as Seek & Blastn may help to identify papers and manuscripts affected by these errors.

As biomedical science increases in both volume and complexity, the problem of irreproducible and incorrect published results is also growing [1, 2]. Up to 50% of published pre-clinical research results have been estimated to be incorrect, leading to the possible waste of billion dollars of research funds per year [3, 4]. As the post-publication correction of errors remains highly problematic [1, 2], there is an urgent need to reduce and deter the publication of incorrect research findings.

We report the derivation and testing of the novel open-access S&B tool that permits the semi-automated fact checking of nucleotide sequence reagents, a class of experimental reagent that has been employed in hundreds of thousands of biomedical research publications. The undetected reporting of incorrect nucleotide sequence reagents could lead to such results misdirecting future research, and to the continued use of incorrect reagents in future studies. The S&B tool therefore directly addresses the larger problem of material reagents and standards representing the major source of incorrect published results from pre-clinical research [3, 4].

S&B involves text extraction, text cleaning, sequence extraction, T/NT status identification, blastn results analysis and gene name extraction [50, 68, 69].




