Noncoding RNA (ncRNA) genes, unlike protein coding genes, produce transcripts that exert their functions without producing proteins. It has been proved that ncRNA could be numerous and participate in many important biological pathways. Currently, ncRNA has become one of the most interested hot spots in functional genomics studies. In this dissertation, ncRNA was comprehensively analyzed on computational identification and feature extraction using bioinformatics approaches. The dissertation includes three parts as follows.1) MicroRNAs (miRNAs) are a group of short (~22 nt) non-coding RNAs that play important regulatory roles. MiRNA precursors (pre-miRNAs) are characterized by their hairpin structures. Ab initio method for distinguishing pre-miRNAs from sequence segments with pre-miRNA-like hairpin structures is lacking. In this dissertation, a set of novel features of local contiguous structure-sequence information is proposed for distinguishing the hairpins of real pre-miRNAs and pseudo pre-miRNAs. Support vector machine (SVM) is applied on these features to classify real vs. pseudo pre-miRNAs, achieving about 90% accuracy on human data. Remarkably, the SVM classifier built on human data can correctly identify up to 90% of the pre-miRNAs from other species, including plants and virus, without utilizing any comparative genomics information.2) Identification of ncRNA with computational algorithm or biological method becomes an important task. With EST alignment and comparative genomics, 118 putative ncRNA transcripts were identified in human genome. These ncRNA transcripts align to low-abundant ESTs but without apparent open reading frame. Comparative genomic analysis indicates at least partial of these ncRNA transcripts are highly conserved across 8 mammal species. Ten putative ncRNA transcripts were randomly selected for further biological validation. RT-PCR experiment verified that 8 putative ncRNA genes are indeed transcribed in human 2BS cell. We believe that this is an efficient strategy for screening ncRNA transcripts with low-abundant EST data and could be applied in other organisms.3) Many RNAs have evolutionarily conserved secondary structures instead of primary sequences. A challenging problem is to quickly search structural similarities for the structured RNA sequences in a large genome database; existing methods are too slow to be used for large genome. In this dissertation, an implementation of a fast structural alignment algorithm, RScan, for solving this problem. RScan is developed by levering the advantages of both hashing algorithms and local alignment algorithms. RScan can behave a fast performance on a standard personal computer with high accuracy. These indicate that RScan is the superior choice for real-life applications of searching structural homologs for structured RNAs in large genomes.
修改评论