Date Published: May 21, 2019
Publisher: Public Library of Science
Author(s): Boas Pucker, Daniela Holtgräwe, Kai Bernd Stadermann, Katharina Frey, Bruno Huettel, Richard Reinhardt, Bernd Weisshaar, Frank Alexander Feltus.
In addition to the BAC-based reference sequence of the accession Columbia-0 from the year 2000, several short read assemblies of THE plant model organism Arabidopsis thaliana were published during the last years. Also, a SMRT-based assembly of Landsberg erecta has been generated that identified translocation and inversion polymorphisms between two genotypes of the species. Here we provide a chromosome-arm level assembly of the A. thaliana accession Niederzenz-1 (AthNd-1_v2c) based on SMRT sequencing data. The best assembly comprises 69 nucleome sequences and displays a contig length of up to 16 Mbp. Compared to an earlier Illumina short read-based NGS assembly (AthNd-1_v1), a 75 fold increase in contiguity was observed for AthNd-1_v2c. To assign contig locations independent from the Col-0 gold standard reference sequence, we used genetic anchoring to generate a de novo assembly. In addition, we assembled the chondrome and plastome sequences. Detailed analyses of AthNd-1_v2c allowed reliable identification of large genomic rearrangements between A. thaliana accessions contributing to differences in the gene sets that distinguish the genotypes. One of the differences detected identified a gene that is lacking from the Col-0 gold standard sequence. This de novo assembly extends the known proportion of the A. thaliana pan-genome.
Arabidopsis thaliana became the most important model for plant biology within decades due to properties valuable for basic research like short generation time, small footprint, and a small genome . Shortcomings of the BAC-by-BAC assembled 120 Mbp long Col-0 gold standard sequence  are some missing sequences and gaps in almost inaccessible regions like repeats in the centromeres [3, 4], at the telomeres and throughout nucleolus organizing regions (NORs) as well as few mis-assemblies [5, 6]. Information about genomic differences between A. thaliana accessions were mostly derived from short read data [7–9]. Only selected accessions were sequenced deep enough and with sufficient read length to reach almost reference-size assemblies [7, 10–15]. While the identification of SNPs can be based on short read mappings, the identification of structural variants had an upper limit of 40 bp for most of the investigated accessions . Larger insertions and deletions, which will often result in presence/absence variations of entire genes, are often missed in short read data sets but are easily recovered by long read sequencing [14–16]. De novo assemblies based on long sequencing reads are currently emphasized to resolve structural variants without an upper limit and to facilitate A. thaliana pan-genomics. Even a fully complete Col-0 genome sequence would not reveal the entire diversity of this species, as this accession is assumed to have a relatively small genome compared to other A. thaliana accessions.
We report a high quality long read de novo assembly (AthNd-1_v2c) of the A. thaliana accession Nd-1, which improved significantly on the previously released NGS assembly sequence AthNd-1_v1.0 . Comparison of the GeneSet_Nd-1_v2.0 with the Araport11 nuclear protein coding genes revealed 24,453 RBHs supporting an overall synteny between both A. thaliana accessions except for an approximately 1 Mbp inversion at the north of chromosome 4. Moreover, large structural variants were identified in the pericentromeric regions. Comparisons with the Col-0 gold standard sequence also revealed a collapsed locus around At4g22214 in Col-0. Therefore, this work contributes to the increasing A. thaliana pan-genome with significantly extended details about genomic rearrangements.