Date of Award

2020

Degree Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science and Statistics

First Advisor

Ying Zhang

Abstract

The field of metagenomics studies microbes from environmental samples in a process generating millions of short DNA sequences called reads. The final outcome is to discover the diversity as well as the function of complex microbial communities using computational tools. The general idea adopted by most taxonomic classifiers, which aim to assign taxonomic groups to individual reads, is to compare the target DNA sequences to a reference database and identify the taxon of the best match. This method has many disadvantages including the necessity for large computing resources in order to process large sequence databases. The current project aims to explore the application of deep learning architectures to taxonomic classification. For this purpose, datasets of 150 bp reads were generated with genomes from 6 bacterial species that are thought to be relevant to the marine microbiome and three different deep neural networks were implemented resulting in a collection of 8 different toy models trained with the same parameters. After evaluating their performance, it was concluded that the best configuration in terms of training time and test accuracy was a convolutional neural network. In parallel, a new method useful when dealing with imbalance data was developed to expand the training data via simulated evolution. This method also has the potential to improve the identification of reads from unknown genomes that are closely related to the species in the training dataset.

Available for download on Saturday, December 17, 2022

Share

COinS