Research Article: Human Contamination in Public Genome Assemblies

Date Published: September 9, 2016

Publisher: Public Library of Science

Author(s): Kirill Kryukov, Tadashi Imanishi, Deyou Zheng.


Contamination in genome assembly can lead to wrong or confusing results when using such genome as reference in sequence comparison. Although bacterial contamination is well known, the problem of human-originated contamination received little attention. In this study we surveyed 45,735 available genome assemblies for evidence of human contamination. We used lineage specificity to distinguish between contamination and conservation. We found that 154 genome assemblies contain fragments that with high confidence originate as contamination from human DNA. Majority of contaminating human sequences were present in the reference human genome assembly for over a decade. We recommend that existing contaminated genomes should be revised to remove contaminated sequence, and that new assemblies should be thoroughly checked for presence of human DNA before submitting them to public databases.

Partial Text

Databases of reference genome sequences is an important resource in vast number of biological and medical studies. E.g., in metagenomics a good reference of genome sequences is important. Contamination present in the reference genome sequence could lead to incorrect or confusing results [1]. The problem of contamination is known for over two decades [2]. Bacteria is the most common contaminant [3]. Human is another important source of contamination, since human is present at all stages of sample handling and lab procedures. Ancient DNA is particularly affected by human contamination [4]. However outside of the field of ancient DNA this problem receives little attention.

We were able to detect 3,416 likely human originated sequences (LHO) within public genome sequences. Each of the LHO sequences is at least 100 bp long, and has strong similarity with human sequence (≥95% identity at the nucleotide level). Also, each LHO has homology within primates (other than human), that is much stronger than homology to any sequence outside primates (excluding the source genome of particular LHO sequence).

In this study we detected sequences that are highly similar between human and remote organisms, including non-vertebrates. In theory such similarity can result from multiple scenarios, such as: (1) Genuine conservation. (2) Recent horizontal gene transfer. (3) Contamination in genome sequence. Our use of primate specificity score allows to separate real conservation from the remaining cases. Although we can’t completely rule out the possibility of horizontal gene transfer, such events are considered to be extremely rare in eukaryotes. On the other hand contamination is a known issue in sequencing experiments. Thus we conclude that most, if not all, of the LHO’s that we found are really contamination from human.