Date Published: August 28, 2017
Publisher: Public Library of Science
Author(s): Pierre Balmer, Anina Bauer, Shashikant Pujar, Kelly M. McGarvey, Monika Welle, Arnaud Galichet, Eliane J. Müller, Kim D. Pruitt, Tosso Leeb, Vidhya Jagannathan, Claire Wade.
Keratins represent a large protein family with essential structural and functional roles in epithelial cells of skin, hair follicles, and other organs. During evolution the genes encoding keratins have undergone multiple rounds of duplication and humans have two clusters with a total of 55 functional keratin genes in their genomes. Due to the high similarity between different keratin paralogs and species-specific differences in gene content, the currently available keratin gene annotation in species with draft genome assemblies such as dog and horse is still imperfect. We compared the National Center for Biotechnology Information (NCBI) (dog annotation release 103, horse annotation release 101) and Ensembl (release 87) gene predictions for the canine and equine keratin gene clusters to RNA-seq data that were generated from adult skin of five dogs and two horses and from adult hair follicle tissue of one dog. Taking into consideration the knowledge on the conserved exon/intron structure of keratin genes, we annotated 61 putatively functional keratin genes in both the dog and horse, respectively. Subsequently, curators in the RefSeq group at NCBI reviewed their annotation of keratin genes in the dog and horse genomes (Annotation Release 104 and Annotation Release 102, respectively) and updated annotation and gene nomenclature of several keratin genes. The updates are now available in the NCBI Gene database (https://www.ncbi.nlm.nih.gov/gene).
Keratins are intermediate filament proteins of the epithelial cytoskeleton. They are expressed in a cell-, tissue- and differentiation-dependent manner in stratified (e.g. epidermis and cornea) and simple (liver, pancreas and intestine) epithelia as well as in skin appendages such a hairs and nails [1,2]. As structural proteins, keratins provide mechanical stability to maintain epithelial integrity and barrier function . This is best exemplified in the epidermis. Keratins represent a major protein fraction of the keratinocytes, the main cell type in the epidermis. In the different layers of the epidermis, keratinocytes differentially express specific keratins, which correlate with their differentiation stage and contribute to the integrity of the epidermis through interaction with both cell-matrix and intercellular adhesion complexes [4,5]. Additionally, keratins are also involved in the regulation of cellular processes such as embryonic development as well as cell motility, proliferation and death by modulating signal molecule activity [6–11]. Furthermore, keratins were recently found to be major regulators of cellular stiffness, a key parameter in cancer development. Thus, the correct expression of specific keratin genes is essential for normal skin homeostasis .
We present a curated catalog of both canine and equine keratin genes, which is based on evolutionary conserved features of keratin genes and experimental support from RNA-seq data derived from adult skin. While initial comparison of our curated annotation of dog and horse keratin genes to NCBI annotation revealed several differences, the updated NCBI annotation led to prediction of several new models that were identical to our annotation (S1 Table). Additionally, a manual review of NCBI annotation by RefSeq curators resulted in several updates to gene annotation and gene nomenclature. Differences persist in the annotation of five dog and nine horse keratin genes. Many of the errors in the currently existing keratin gene predictions were due to gaps and errors in the genome reference assemblies. Although automated gene predictions in general reach a very high quality and are essential for scientists to make use of the wealth of publicly available genomic sequence information, this study also shows some of the limitations of automated gene predictions on imperfect genome assemblies.
The study of keratin genes in dog and horse is difficult because of the high duplication rate among the genes which can lead to misassemblies in the reference genomes. The manual annotation of keratin gene clusters enabled not only identification of new pseudogenes in the keratin clusters of both dog and horse, but also the identification of missing or functional canine and equine homologs of human pseudogenes. We provide a curated annotation and representative transcript sequences for canine and equine keratin genes. Correct transcript sequences are essential for many modern high-throughput technologies that rely on the bioinformatic processing of large datasets. This analysis also shows that manual intervention is still critical for the annotation of complex gene families and the production of a good reference gene set for these gene families. This work represents collaboration between a research group and a public annotation database, which resulted in improvement of annotation of a gene family in the dog and horse genomes. Such collaborations provide an opportunity for researchers and annotation groups to work together to represent specific genes or gene families that may not be accurately annotated solely by means of an automated pipeline. The updated NCBI Refseq entries will be incorporated as supporting evidence during future annotation releases by Ensembl. Many instances of low confidence annotation are in regions where the reference assembly is of poor quality, underlining the importance of a high-quality genome assembly as a prerequisite for accurate annotation of genes.