Biological Sciences Faculty Publications

Estimating Error Models for Whole Genome Sequencing Using Mixtures of Dirichlet-Multinomial Distributions

Steven Wu
Rachel Schwartz, University of Rhode IslandFollow
David Winter
Donald Conrad
Reed Cartwright

Document Type

Article

Date of Original Version

8-2017

Department

Biological Sciences

Abstract

Motivation: Accurate identification of genotypes is an essential part of the analysis of genomic data, including in identification of sequence polymorphisms, linking mutations with disease and determining mutation rates. Biological and technical processes that adversely affect genotyping include copy-number-variation, paralogous sequences, library preparation, sequencing error and reference-mapping biases, among others. Results: We modeled the read depth for all data as a mixture of Dirichlet-multinomial distributions, resulting in significant improvements over previously used models. In most cases the best model was comprised of two distributions. The major-component distribution is similar to a binomial distribution with low error and low reference bias. The minor-component distribution is overdispersed with higher error and reference bias. We also found that sites fitting the minor component are enriched for copy number variants and low complexity regions, which can produce erroneous genotype calls. By removing sites that do not fit the major component, we can improve the accuracy of genotype calls. Availability and Implementation: Methods and data files are available at https://github.com/ CartwrightLab/WuEtAl2017/ (doi:10.5281/zenodo.256858). Contact: cartwright@asu.edu Supplementary information: Supplementary data is available at Bioinformatics online.

Citation/Publisher Attribution

Steven H. Wu, Rachel S. Schwartz, David J. Winter, Donald F. Conrad, Reed A. Cartwright; "Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions," Bioinformatics, Volume 33, Issue 15, 1 August 2017, Pages 2322–2329, https://doi.org/10.1093/bioinformatics/btx133

Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License

Download

COinS

DOI

https://doi.org/10.1093/bioinformatics/btx133

Biological Sciences Faculty Publications

Estimating Error Models for Whole Genome Sequencing Using Mixtures of Dirichlet-Multinomial Distributions

Document Type

Date of Original Version

Department

Abstract

Citation/Publisher Attribution

Creative Commons License

DOI

Search

Browse

Author Corner

Biological Sciences Faculty Publications

Estimating Error Models for Whole Genome Sequencing Using Mixtures of Dirichlet-Multinomial Distributions

Authors

Document Type

Date of Original Version

Department

Abstract

Citation/Publisher Attribution

Creative Commons License

Share

DOI

Search

Browse

Author Corner