PhD Defence: Eddie Ma

Date and Time

Location

Rozanski Hall, Room 106

Details

Correcting Ambiguous Base Labels in DNA Sequencing using Neural Networks and its Impact on DNA Barcoding in Sequencing Accuracy, Phylogeny, and Species Identification: Eddie Ma

Chair: Dr. Mark Wineberg
Advisor: Dr. Stefan Kremer
Advisory Committee Member: Dr. David Calvert
Non-Advisory Committee Member: Dr. Deborah Stacey
External Examiner: Dr. Lucian Ilie (University of Western Ontario)

ABSTRACT:

The procuring of DNA sequences relies on a combination of software algorithms, sequencing technology, and human effort. To improve the rate at which DNA sequences are obtained, new methods in each of these processes are adopted. In this thesis, an artificial neural network method is used to improve the number of bases in obtained DNA sequences from Sanger sequencing, by post-processing DNA sequences and replacing ambiguous DNA base labels with discrete base labels. The existing KB basecalling algorithm produces the initial basecalled sequence that is post-processed by the proposed method. A platform that depends on highly accurate DNA sequences is DNA Barcoding. In DNA Barcoding, species are identified by DNA sequences corresponding to short reads of a standardized gene region (e.g. 600-700 bases for COI). Barcode of Life Datasystems (BOLD) is the largest repository and online analytics platform serving the International Barcode of Life (iBOL) project. In this thesis, a novel automated machine learned error correction system is developed, the System-3 N-label Editor (S3). S3 is developed and validated on DNA Barcoding data, using nearly 850,000 ambiguous base labels across 160,000 aligned sequences. S3 uses an internal representation of uncertainty to estimate error and only commits to making an ambiguous N-label replacement, if its predicted error is lower than 1%. On this data, S3 is able to maintain an observed error rate lower than 1%, while disambiguating 79% of N-labels in animal barcodes, 80% of N-labels from plant barcodes, and 58% of N-labels in non-coding genes across eukarya (e.g. ribosomal genes and other genetic markers that produce no protein product). S3 is then tested on its impact in DNA barcoding and bioinformatics applications on 90,000 animal barcode sequences from the Canadian National Parks Malaise Project. In this impact testing, the 1% error rate is found to be maintained (0.96% mean rate of error). When the initial KB sequence, the S3 sequence, and a human edited sequence are compared, treating sequences with S3 falls short of adding enough information to impact species identification (BOLD BIN system) or discovery (ABGD, and GMYC methods). Treatment with S3 is found to significantly improve evolutionary tree construction by allowing the resulting trees to be more similar to those constructed from human edited sequences. This result is found for both neighbour-joining and maximum-likelihood methods, when comparing the trimmed and aligned sequence region shared by the KB, S3, and human edited sequences. For the purposes of finding the separation between species – the barcode gap, S3 improves the difference in genetic distance, between-and-within adjacent species over KB in two-thirds of species (238 of 369 species tested). The success obtained for performance validation of S3 for N-label replacement, generalization of performance onto Malaise data, and encouraging albeit modest results in DNA barcoding applications point the way to future works that operate on next generation sequencing (NGS) platforms, and can perform the remaining modes of error correction in sequence finishing including missing/extra base labels (insertions/deletions), multiple bases in one location (heteroplasmy), and variable counts of repeated bases (homopolymers) in Sanger and NGS methods.

Events Archive