非编码RNA的计算识别方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	非编码RNA的计算识别方法研究
其他题名	Studies on computational identification approaches for non-coding RNA
	薛成海
	2006-05-15
学位类型	工学博士
中文摘要	有关非编码RNA的研究是功能基因组时代研究的重要前沿问题之一。本论文运用生物信息学方法，围绕着非编码RNA的计算识别与特征分析展开研究，主要包括三个方面的内容：针对microRNA（miRNA）的计算识别方法，非编码RNA的计算识别方法，以及RNA二级结构的相似性搜索。 1）miRNA是一类能够调控基因表达的非编码RNA。miRNA前体可以形成特殊的茎-环结构。然而，基因组中存在大量的与miRNA前体具有相似结构的序列片断（本文称为虚假-miRNA前体）。区分真实的和虚假的miRNA前体，不仅对理解miRNA的本质十分重要而且可以帮助开发识别miRNA的预测方法。本文提出了一种基于茎-环结构的局部结构-序列特征，用于描述miRNA前体。利用这个特征，分析了miRNA前体和虚假-miRNA前体之间的差异。进而，将模式识别中的技术——支持向量机（SVM）用于两类数据的分类，取得了很好的效果。此外，还分析了不同物种的miRNA前体在局部结构-序列特征下的保守性，并提出了不依赖于比较基因组学方法的miRNA识别策略。 2）非编码RNA基因直接产生功能性的RNA分子而不是翻译成蛋白质，它们参与许多重要的细胞调控过程。非编码RNA基因不具有类似于编码蛋白质基因的开放阅读框和密码子偏好等明显的公共特征，因此计算识别非编码RNA是一项非常困难而重要的任务。本文提出了基于整合的特征识别非编码RNA的策略。应用这个策略，在人类基因组中，利用基因间区的低表达的EST数据，结合EST聚类、比较基因组学、转录信号分析等方法，预测了高可靠的非编码RNA基因，并对部分结果进行了实验验证和分析。 3）许多非编码RNA具有进化上保守的二级结构而不是进化上保守的一级序列。已经报道的结构比对方法在双序列、多序列中寻找保守的二级结构。一个开放问题是：给定一个已知结构的RNA序列，在大的数据库中搜索与该序列具有相似结构的序列。针对这个问题，本文开发了算法RScan。与原有方法相比，RScan可以在单机条件下快速的执行，并且保持了较高的准确度。最重要的是，RScan可以真正的应用于现实的使用，完成大数据库的搜索。
英文摘要	Noncoding RNA (ncRNA) genes, unlike protein coding genes, produce transcripts that exert their functions without producing proteins. It has been proved that ncRNA could be numerous and participate in many important biological pathways. Currently, ncRNA has become one of the most interested hot spots in functional genomics studies. In this dissertation, ncRNA was comprehensively analyzed on computational identification and feature extraction using bioinformatics approaches. The dissertation includes three parts as follows.1) MicroRNAs (miRNAs) are a group of short (~22 nt) non-coding RNAs that play important regulatory roles. MiRNA precursors (pre-miRNAs) are characterized by their hairpin structures. Ab initio method for distinguishing pre-miRNAs from sequence segments with pre-miRNA-like hairpin structures is lacking. In this dissertation, a set of novel features of local contiguous structure-sequence information is proposed for distinguishing the hairpins of real pre-miRNAs and pseudo pre-miRNAs. Support vector machine (SVM) is applied on these features to classify real vs. pseudo pre-miRNAs, achieving about 90% accuracy on human data. Remarkably, the SVM classifier built on human data can correctly identify up to 90% of the pre-miRNAs from other species, including plants and virus, without utilizing any comparative genomics information.2) Identification of ncRNA with computational algorithm or biological method becomes an important task. With EST alignment and comparative genomics, 118 putative ncRNA transcripts were identified in human genome. These ncRNA transcripts align to low-abundant ESTs but without apparent open reading frame. Comparative genomic analysis indicates at least partial of these ncRNA transcripts are highly conserved across 8 mammal species. Ten putative ncRNA transcripts were randomly selected for further biological validation. RT-PCR experiment verified that 8 putative ncRNA genes are indeed transcribed in human 2BS cell. We believe that this is an efficient strategy for screening ncRNA transcripts with low-abundant EST data and could be applied in other organisms.3) Many RNAs have evolutionarily conserved secondary structures instead of primary sequences. A challenging problem is to quickly search structural similarities for the structured RNA sequences in a large genome database; existing methods are too slow to be used for large genome. In this dissertation, an implementation of a fast structural alignment algorithm, RScan, for solving this problem. RScan is developed by levering the advantages of both hashing algorithms and local alignment algorithms. RScan can behave a fast performance on a standard personal computer with high accuracy. These indicate that RScan is the superior choice for real-life applications of searching structural homologs for structured RNAs in large genomes.
关键词	生物信息学非编码rna Microrna 二级结构识别数据库搜索 Bioinformatics Non-coding Rna Microrna Secondary Structure Computational Identification Database Search
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/5898
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	薛成海. 非编码RNA的计算识别方法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2006.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20031801460298（1735KB）			暂不开放	CC BY-NC-SA