Advisor: Gurjit Randhawa [1], Computer Science
Proposed co-advisor: TBD
Introduction
Viruses inhabiting extreme environments present unique genomic characteristics that are yet to be fully explored. These extremophilic viruses have evolved to survive harsh conditions, such as extreme temperatures, pH levels, or salinity, and hold great potential for advancing our understanding of viral adaptation and biodiversity. Traditional alignment-based methods, while useful, face challenges when analyzing rapidly evolving viruses that share little homology with known species. To overcome these challenges, alignment-free methods are gaining traction in viral genomics, providing faster, more scalable, and accurate approaches to genome analysis.
Objectives
The primary objective of this study is to perform an alignment-free analysis of extremophilic viruses using the MLDSP (Machine Learning with Digital Signal Processing) tool. Specifically, we will apply Chaos Game Representation (CGR) and machine learning techniques to analyze viral genomes, uncover patterns in their sequence composition, and classify them based on their extremophilic characteristics. The dataset to be used includes viral sequences derived from metagenomic samples, with a particular focus on RNA viruses identified in extreme environments, building on previous studies that discovered thousands of new RNA viruses using AI models.
Methodology
Dataset: We will use viral sequences from a publicly available dataset that identified 70,500 RNA viruses in various extreme environments (hot springs, salt lakes, and air). The dataset will be prepared by curating viral genomes relevant to extremophilic conditions such as high temperature, salinity, or pH.
MLDSP and Chaos Game Representation (CGR) The MLDSP tool will be employed for this analysis. MLDSP uses a combination of Digital Signal Processing (DSP) techniques and machine learning to process 1D numerical representations of viral genomes, allowing for fast and accurate classification without sequence alignment. The CGR method will be used to convert viral DNA or RNA sequences into 2D genomic representations, where k-mer frequencies are visualized as fractal patterns. This approach allows us to capture sequence features that are otherwise difficult to detect using traditional methods. We will perform Machine Learning Analysis Using MLDSP 'to classify viral sequences based on taxonomy and environmental factors:
Taxonomy: Determine the taxonomic relationships of the viruses, identifying similarities between newly discovered viruses and known viral taxa.
Environmental Factors: Investigate whether specific viral genomes share features linked to their extremophilic environments (e.g., thermophiles, halophiles).
We will apply supervised learning methods to classify the viral sequences and assess their taxonomic and environmental characteristics based on k-mer frequency vectors generated by the CGR method.
Expected Outcomes
Novel Viral Classifications: The alignment-free method is expected to reveal novel viral classifications previously undetectable through alignment-based approaches. This could lead to the identification of new viral families or subtypes.
Insights into Extremophilic Adaptations: By analyzing the genomic signatures of viruses from extreme environments, the study aims to uncover patterns of viral adaptation, such as codon usage or nucleotide biases, which may contribute to their survival in hostile conditions.
Contribution to Viral Metagenomics: This study will contribute to ongoing efforts in viral metagenomics by providing a new, computationally efficient method for exploring the vast diversity of viruses in extreme environments.
Conclusion
This project leverages cutting-edge machine learning and alignment-free techniques to analyze extremophilic viruses. The use of MLDSP and CGR in analyzing viral sequences holds the potential to accelerate viral discovery and deepen our understanding of viral adaptations to extreme environments. Results from this study could provide significant insights into the evolution of viruses and their role in biodiversity.
This project is suitable for one or two semesters. The student is required to occasionally be on-site.
Knowledge/Skills
Machine learning algorithms, Genomic Sequence analysis, Data curation, Experience with MATLAB and/or Python