Skip to main content

False gene and chromosome losses in genome assemblies caused by GC content variation and repeats.

Publication ,  Journal Article
Kim, J; Lee, C; Ko, BJ; Yoo, DA; Won, S; Phillippy, AM; Fedrigo, O; Zhang, G; Howe, K; Wood, J; Durbin, R; Formenti, G; Brown, S; Cantin, L ...
Published in: Genome Biol
September 27, 2022

BACKGROUND: Many short-read genome assemblies have been found to be incomplete and contain mis-assemblies. The Vertebrate Genomes Project has been producing new reference genome assemblies with an emphasis on being as complete and error-free as possible, which requires utilizing long reads, long-range scaffolding data, new assembly algorithms, and manual curation. A more thorough evaluation of the recent references relative to prior assemblies can provide a detailed overview of the types and magnitude of improvements. RESULTS: Here we evaluate new vertebrate genome references relative to the previous assemblies for the same species and, in two cases, the same individuals, including a mammal (platypus), two birds (zebra finch, Anna's hummingbird), and a fish (climbing perch). We find that up to 11% of genomic sequence is entirely missing in the previous assemblies. In the Vertebrate Genomes Project zebra finch assembly, we identify eight new GC- and repeat-rich micro-chromosomes with high gene density. The impact of missing sequences is biased towards GC-rich 5'-proximal promoters and 5' exon regions of protein-coding genes and long non-coding RNAs. Between 26 and 60% of genes include structural or sequence errors that could lead to misunderstanding of their function when using the previous genome assemblies. CONCLUSIONS: Our findings reveal novel regulatory landscapes and protein coding sequences that have been greatly underestimated in previous assemblies and are now present in the Vertebrate Genomes Project reference genomes.

Duke Scholars

Altmetric Attention Stats
Dimensions Citation Stats

Published In

Genome Biol

DOI

EISSN

1474-760X

Publication Date

September 27, 2022

Volume

23

Issue

1

Start / End Page

204

Location

England

Related Subject Headings

  • Vertebrates
  • Sequence Analysis, DNA
  • Genome
  • Chromosomes
  • Bioinformatics
  • Base Composition
  • Animals
  • 08 Information and Computing Sciences
  • 06 Biological Sciences
  • 05 Environmental Sciences
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Kim, J., Lee, C., Ko, B. J., Yoo, D. A., Won, S., Phillippy, A. M., … Jarvis, E. D. (2022). False gene and chromosome losses in genome assemblies caused by GC content variation and repeats. Genome Biol, 23(1), 204. https://doi.org/10.1186/s13059-022-02765-0
Kim, Juwan, Chul Lee, Byung June Ko, Dong Ahn Yoo, Sohyoung Won, Adam M. Phillippy, Olivier Fedrigo, et al. “False gene and chromosome losses in genome assemblies caused by GC content variation and repeats.Genome Biol 23, no. 1 (September 27, 2022): 204. https://doi.org/10.1186/s13059-022-02765-0.
Kim J, Lee C, Ko BJ, Yoo DA, Won S, Phillippy AM, et al. False gene and chromosome losses in genome assemblies caused by GC content variation and repeats. Genome Biol. 2022 Sep 27;23(1):204.
Kim, Juwan, et al. “False gene and chromosome losses in genome assemblies caused by GC content variation and repeats.Genome Biol, vol. 23, no. 1, Sept. 2022, p. 204. Pubmed, doi:10.1186/s13059-022-02765-0.
Kim J, Lee C, Ko BJ, Yoo DA, Won S, Phillippy AM, Fedrigo O, Zhang G, Howe K, Wood J, Durbin R, Formenti G, Brown S, Cantin L, Mello CV, Cho S, Rhie A, Kim H, Jarvis ED. False gene and chromosome losses in genome assemblies caused by GC content variation and repeats. Genome Biol. 2022 Sep 27;23(1):204.

Published In

Genome Biol

DOI

EISSN

1474-760X

Publication Date

September 27, 2022

Volume

23

Issue

1

Start / End Page

204

Location

England

Related Subject Headings

  • Vertebrates
  • Sequence Analysis, DNA
  • Genome
  • Chromosomes
  • Bioinformatics
  • Base Composition
  • Animals
  • 08 Information and Computing Sciences
  • 06 Biological Sciences
  • 05 Environmental Sciences