Date of Award
Master of Science in Computer Science
Computer Science and Statistics
The field of metagenomics studies microbes from environmental samples in a process generating millions of short DNA sequences called reads. The final outcome is to discover the diversity as well as the function of complex microbial communities using computational tools. The general idea adopted by most taxonomic classifiers, which aim to assign taxonomic groups to individual reads, is to compare the target DNA sequences to a reference database and identify the taxon of the best match. This method has many disadvantages including the necessity for large computing resources in order to process large sequence databases. The current project aims to explore the application of deep learning architectures to taxonomic classification. For this purpose, datasets of 150 bp reads were generated with genomes from 6 bacterial species that are thought to be relevant to the marine microbiome and three different deep neural networks were implemented resulting in a collection of 8 different toy models trained with the same parameters. After evaluating their performance, it was concluded that the best configuration in terms of training time and test accuracy was a convolutional neural network. In parallel, a new method useful when dealing with imbalance data was developed to expand the training data via simulated evolution. This method also has the potential to improve the identification of reads from unknown genomes that are closely related to the species in the training dataset.
Cres, Cecile, "TAXONOMIC PROFILING OF METAGENOMIC READS USING DEEP LEARNING" (2020). Open Access Master's Theses. Paper 1918.