• search hit 1 of 1
Back to Result List

DENTIST - using long reads for closing assembly gaps at high accuracy

  • Background: Long sequencing reads allow increasing contiguity and completeness of fragmented, short-read–based genome assemblies by closing assembly gaps, ideally at high accuracy. While several gap-closing methods have been developed, these methods often close an assembly gap with sequence that does not accurately represent the true sequence. Findings: Here, we present DENTIST, a sensitive, highly accurate, and automated pipeline method to close gaps in short-read assemblies with long error-prone reads. DENTIST comprehensively determines repetitive assembly regions to identify reliable and unambiguous alignments of long reads to the correct loci, integrates a consensus sequence computation step to obtain a high base accuracy for the inserted sequence, and validates the accuracy of closed gaps. Unlike previous benchmarks, we generated test assemblies that have gaps at the exact positions where real short-read assemblies have gaps. Generating such realistic benchmarks for Drosophila (134 Mb genome), Arabidopsis (119 Mb), hummingbird (1 Gb), and human (3 Gb) and using simulated or real PacBio continuous long reads, we show that DENTIST consistently achieves a substantially higher accuracy compared to previous methods, while having a similar sensitivity. Conclusion: DENTIST provides an accurate approach to improve the contiguity and completeness of fragmented assemblies with long reads. DENTIST's source code including a Snakemake workflow, conda package, and Docker container is available at https://github.com/a-ludi/dentist. All test assemblies as a resource for future benchmarking are at https://bds.mpi-cbg.de/hillerlab/DENTIST/.
Metadaten
Author:Arne LudwigORCiD, Martin PippelORCiDGND, Gene MyersORCiD, Michael HillerORCiDGND
URN:urn:nbn:de:hebis:30:3-633082
DOI:https://doi.org/10.1093/gigascience/giab100
ISSN:2047-217X
Parent Title (English):GigaScience
Publisher:Oxford University Press
Place of publication:Oxford
Document Type:Article
Language:English
Date of Publication (online):2022/01/25
Date of first Publication:2022/01/25
Publishing Institution:Universitätsbibliothek Johann Christian Senckenberg
Release Date:2023/02/21
Tag:assembly gaps; benchmarking; genome assembly; long sequencing reads
Volume:11
Page Number:12
First Page:1
Last Page:12
Note:
This work was supported by the Max Planck Society, the Federal Ministry of Education and Research (grant 01IS18026C), and the LOEWE-Centre for Translational Biodiversity Genomics (TBG) funded by the Hessen State Ministry of Higher Education, Research and the Arts (HMWK).
Note:
All data underlying this article, including the reference and test assemblies with introduced gaps and their true sequence as valuable data for future method comparisons, are available via our institutional server [31]. Supporting data and an archival copy of the code are also available via the GigaScience repository GigaDB [33].
HeBIS-PPN:507913833
Institutes:Biowissenschaften
Angeschlossene und kooperierende Institutionen / Senckenbergische Naturforschende Gesellschaft
Dewey Decimal Classification:5 Naturwissenschaften und Mathematik / 57 Biowissenschaften; Biologie / 570 Biowissenschaften; Biologie
6 Technik, Medizin, angewandte Wissenschaften / 61 Medizin und Gesundheit / 610 Medizin und Gesundheit
Sammlungen:Universitätspublikationen
Licence (German):License LogoCreative Commons - CC BY - Namensnennung 4.0 International