Biological Sequence Classification Using Machine Learning
Next-generation sequencing technologies have made it possible to generate large amounts of biological sequence data at a low cost, opening great opportunities in life sciences. As genomes are sequenced, a major challenge is their annotation - the identification of genes and regulatory elements, their locations and their functions. Machine learning techniques can be used to train computer programs to automatically find such annotations, by formulating the annotation problems as sequence classification tasks. However, the success of machine learning approaches to sequence classification depends significantly on the choice of features used to represent sequence instances. In this talk, I will define the problem of identifying alternatively spliced exons as a sequence classification task and show that a set of features experimentally known to affect alternative splicing can be used to distinguish between alternatively spliced exons and constitutive exons. Furthermore, I will present our recent results on the problem of predicting consecutive-gene-pair transcription patterns using motif features, and show how this problem can be exploited to identify regulatory motifs that might be important for transcriptions (such as transcription factor binding sites).
