Date of Award
2025
Degree Type
Dissertation
Degree Name
Doctor of Philosophy in Biological and Environmental Sciences
Department
Biological Sciences
First Advisor
Ying Zhang
Abstract
The field of environmental microbiology has made a lot of developments thanks to metagenomics. By sequencing the whole DNA content in a sample collected from an environment of interest, metagenomics provides opportunities to learn the composition and function of complex microbial communities. The large amount of nucleotide sequences contained in metagenomics data, and their complexity, require accurate and efficient bioinformatics software to perform downstream analysis. In this context, deep learning appears as a tool of choice given the ability of artificial neural networks to find patterns in large datasets. This study examines the possible application of deep learning algorithms to taxonomic classification of metagenomics data.
In Manuscript I, a convolutional neural network based on a modified version of a popular model for image processing was developed for multiclass classification of short metagenomics reads. The convolutional neural network was trained on a large dataset containing simulated reads from over 3,000 bacterial species. The model was then tested on a dataset of simulated reads devoid of examples seen by the model during training. This study showed that the convolutional neural network outperforms traditional methods of classification at the species level.
In Chapter 2, the improvement of taxonomic classification at the species level was pursued by investigating and comparing two different neural networks, the convolutional neural network presented in Manuscript I and a large language model called BERT (Bidirectional Encoder Representations from Transformers). For this study, 709 binary classifiers were generated with the convolutional neural network and the BERT model. A binary classifier format was selected to facilitate training that is otherwise cumbersome with a multiclass classifier. Each classifier was trained on a dataset designed for classifying short DNA sequences originating from one of 709 genomes, each associated with a different bacterial species. This study found that BERT outperforms the convolutional neural network in classifying short DNA sequences sampled from genomes different from the training genomes. Furthermore, pretraining of BERT was conducted on a large set of DNA sequences representing over 3,000 bacterial species, and the resulting model was fine-tuned into a binary classifier trained to identify sequences from the marine bacterial species Prochlorococcus_B marinus_B. The masked language modeling task implemented during pretraining allows BERT to learn the subtilities of the DNA language in a large number of bacterial species. Learning representations of DNA-related tokens facilitates downstream tasks during fine-tuning such as binary classification. Comparison of the two aforementioned models and the fine-tuned BERT model showed that the latter improves the classification of DNA sequences and sequencing reads originating from cultures as well as metagenomes.
Chapter 3 investigated the correct and incorrect classifications of BERT at the gene level to gain a better understanding on BERT’s behavior and prepare better training datasets down the line. Conserved and unique genes from interspecies and intraspecies genomes within the Prochlorococcus_B, Alteromonas, and Marinobacter genera were associated with correct and incorrect predictions for each corresponding BERT model trained to classify DNA sequences into Prochlorococcus_B marinus_B, Alteromonas macleodii, or Marinobacter psychrophilus. This study found that BERT tends to misclassify conserved genes from interspecies genomes and correctly classify unique genes from intraspecies genomes. The relationship between BERT’s decision making and the similarity of conserved genes with the training genome provided indications for future directions. The model’s accuracy could be improved by modifying the underrepresentation and redundancy of unique and conserved genomic regions respectively in the training dataset, and tuning the model’s sensitivity. On the other end, the behavior of BERT towards unique genes is more complex and requires further investigation such as analysis of attention scores. Finally, a preliminary analysis of the misclassified unique genes from intraspecies genomes suggested a potential use of the current setup to identify genes related to microdiversity and genomic evolution.
In conclusion, these studies demonstrate that deep learning has the potential to accurately classify metagenomics data. Besides the higher precision and recall displayed by the usage of a large language model on DNA sequences in comparison to a convolutional neural network, the model also handles sequencing errors and achieves comparable performance on simulated reads and almost 90% accuracy on raw sequencing reads. This work provides the foundation for applying deep learning to taxonomic classification of metagenomics data at the species level.
Recommended Citation
Cres, Cecile, "APPLICATION OF DEEP LEARNING TO TAXONOMIC CLASSIFICATION OF METAGENOMICS DATA" (2025). Open Access Dissertations. Paper 4523.
https://digitalcommons.uri.edu/oa_diss/4523
Included in
Artificial Intelligence and Robotics Commons, Bioinformatics Commons, Microbiology Commons