Computational Bioinformatics and Machine Learning Models to Identify the Diseasome and Neurological Disease Comorbidities

CASIA OpenIR > 智能制造技术与系统研究中心 > 多维数据分析（彭思龙）-技术团队

	Computational Bioinformatics and Machine Learning Models to Identify the Diseasome and Neurological Disease Comorbidities
	Md Habibur Rahman
	2020-05-31
页数	158
学位类型	博士
中文摘要	在患有疾病的患者中，共病是在同一患者中同时发生的第二种（或更多）疾病。共病的存在可能使原发性疾病的标准治疗复杂化或失败。因此，与一种单一疾病的个体相比，患有疾病合并症的个体（取决于所涉及的疾病）有更高的重病或死亡风险。利用多组学、疾病基因关联（即疾病）和分子数据研究疾病共病相互作用，提高了我们对许多疾病致病机制的认识，并在诊断、预防和治疗方面取得了重大进展。然而，随着全球疾病负担的加重，疾病共病日益成为临床和生物医学领域的一个重大问题。共病相互作用的识别和表征不仅对理解复杂的病理生理学，而且对设计合理和创造性的药物治疗发展，对患者自我管理、保健利用和治疗策略都具有重要意义。由于共同的危险因素（包括遗传、分子、环境和基于生活方式的因素），某些共病，包括癌症，更有可能发生在同一患者身上。由于非传染性疾病的病因总是呈现复杂性，其危险因素往往是重叠的，因此它们的生物学基础和导致这种共病的潜在分子机制仍不清楚。这种复杂性不仅使单个疾病的分子机制难以捉摸及研究，而且使共病相互作用更加具有挑战性。除此之外，与传统研究相比，大多数共病研究都集中在单个临床或分子表型数据的作用上，以确定疾病共病是如何相互作用的。在这项研究中，我们设计并开发了一种生物信息学和机器学习方法，可以通过利用遗传、多组分和分子水平的数据来识别共病相互作用的重要介质。我们的研究重点是基于网络和机器学习的生物信息学模型开发，以确定疾病共病。我们已经在两个不同的项目中应用了我们开发的模型。一个是2型糖尿病（T2D）和神经系统疾病（NDs）共病相互作用的鉴定，另一个是中枢神经系统(CNS)疾病(也称为NDs)和胶质母细胞瘤(一种中枢神经系统癌症) 共病相互作用的鉴定，以及这可能对癌症患者生存的影响。我们首先提出了一种基于网络的高通量定量生物信息学流程方法，使用不可知论方法来识别与神经系统疾病进展相关的2型糖尿病分子生物标记物。我们利用来自T2D和ND患者的对照组织和受疾病影响组织的基因表达转录组数据进行比较。我们采用线性模型对这些数据集进行微阵列数据分析（LIMMA），并通过比较受影响个体和对照个体来识别差异表达基因（DEGs）。T2D和ND共有197例DEGs，其中99例上调，98例下调。这些重叠的DEGs（即在T2D和ND数据集中看到的那些DEGs）揭示了重要的细胞信号相关分子途径的参与。然后，这些被用来提取最重要的基因本体（GO）术语。通过蛋白质-蛋白质相互作用分析，确定了已识别途径中的关键或“核心”蛋白质；许多核心蛋白质以前没有被描述为在这些疾病中发挥作用。为了揭示DEGs的一些转录和转录后调控因子，我们分别使用DEG-转录因子（TF）相互作用分析和DEG-microRNAs（miRNAs）相互作用分析。我们通过gold基准数据库和文献检索对这些结果进行了验证，明确了哪些基因和途径先前与NDs或T2D相关，哪些是新的。因此，我们的转录组数据分析已经确定了NDs和T2D之间的新的潜在联系，这些联系可能是共病相互作用的基础，这些联系可能包括治疗干预的潜在目标。在这种基于网络的生物信息学方法中，我们只识别与疾病共病有关的新的生物过程，因此它们的语义相似性没有确定。语义相似度度量方法计算基因本体和基因产物的相似度，根据疾病概念来评估相似度。因此，在进一步的基于计算的分析中，我们确定了T2D和ND共病之间的语义相似性。为此，我们设计了一个生物信息学流程方法，通过结合基因集富集分析和语义相似性，分析、利用和结合基因表达、GO和分子途径数据。为了减少偏差，我们使用了来自不同来源和细胞类型的T2D和NDs的几个公开可用的数据集，以最大限度地提高这种方法的识别能力。我们还利用基因和术语语义相似度计算了T2D与神经病理学之间的相似性，这种相似性增强了共病相互作用的识别和表征，而不仅仅是简单地识别与每种疾病相关的新的生物学过程。我们用标准基准数据库和文献检索对结果进行了验证。最后，我们建立了机器学习模型，并利用生物信息学和机器学习的方法来识别癌症与NDs的共病性以及癌症患者的生存期限预测。胶质母细胞瘤是一种常见的恶性脑肿瘤，死亡率高，常与NDs并存。我们采用定量分析的生物信息学框架来揭示共同的基因和细胞信号通路，它们可以连接NDs和胶质母细胞瘤。我们从国家生物技术信息中心（NCBI）和癌症基因组图谱（TCGA）获得了数据集，这些数据集来自于比较正常组织和疾病/胶质母细胞瘤组织的研究。在利用我们的框架识别差异表达基因（DEGs）后，通过疾病基因关联网络、信号通路、富集分析以及蛋白质-蛋白质相互作用（PPI）网络来预测这些DEGs的功能。我们通过单变量和多变量分析，利用Cox比例风险（Cox-PH）模型和乘积极限（PL）估计，评估哪些临床因素和基因在GBM患者生存时间的确定中起重要作用。本研究共鉴定出177个DEGs（129个表达上调，48个表达下调）。其中，54个基因与患者生存率相关。疾病网络、分子途径、个体途径、蛋白质相互作用（PPI）网络和重要基因的生存分析都表明NDs可能影响胶质母细胞瘤的进展、生长或建立。本文所鉴定的共有DEGs也可能作为胶质母细胞瘤预后的生物标志物和潜在的治疗靶点。我们还通过使用标准基准数据库dbGaP、OMIM、OMIM-Expanded和文献综述，验证了我们识别的所有特征基因和途径。这些进一步证明了我们所鉴定的基因参与了胶质母细胞瘤进展的病理过程。这项工作有潜力开发新的诊断方法，并导致新的治疗设计。
英文摘要	In a patient suffering a disease, comorbidity is a second (or further) disease co-occurring in the same patient at the same time. The existence of comorbidity can complicate or cause the failure of a standard treatment given for the primary disease. Thus, compared with an individual of one single disease, individuals having disease comorbidities can (depending on the diseases involved) have a higher risk of severe illness or mortality. The study of disease comorbidity interactions by using multi-omics, disease-gene association (i.e., diseasome) and molecular data has improved our present knowledge of pathogenic mechanisms for many diseases and led to significant advances in diagnosis, prognosis, and treatment. However, as the global burden of diseases has increased, disease comorbidity has increasingly become a major clinical and biomedical problem. Identification and characterization of comorbidity interactions are important not only for understanding complex pathophysiologies, but also for the design rational and creative pharmacotherapeutic developments, and for patient self-management, health care utilization and treatment strategy. Due to shared risk factors (including genetic, molecular, environmental, and lifestyle-based factors) certain comorbidities, including cancers, are more likely to occur in the same patient. As the etiology of non-infectious diseases are always complex and their risk factors tend to overlap, their biological basis and underlying molecular mechanisms that underlie this comorbidity are still poorly understood. This complexity not only makes molecular mechanisms of individual diseases elusive and difficult to study but makes comorbidity interaction even more challenging. Besides this, compared to traditional studies, most of the comorbidity studies have concentrated on the role of a single clinical or molecular or phenotype data to identify how disease comorbidities interact. In this study, we have designed and developed a bioinformatics and machine learning approach that can identify important mediators of comorbidity interactions by utilising genetic, multi-omics and molecular-level data. Our research focuses on network-based and machine learning based bioinformatics models development to identify disease comorbidities. We have applied our developed models in two different projects. One is the identification of type 2 diabetes (T2D) and neurological diseases (NDs) comorbidity interaction, the other is the identification of comorbidity interactions between central nervous system (CNS) disorders, also known as NDs and glioblastoma, a type of central nervous system cancer, and how this may affect, the survival of the cancer patients involved. We first proposed a high-throughput network-based quantitative bioinformatics pipeline using agnostic approaches to identify molecular biomarkers for type 2 diabetes that are linked to the progression of neurological diseases. We exploited gene expression transcriptomic datasets from control and disease-affected tissues of T2D and ND patients for comparisons. We employed a linear model for microarray data (LIMMA) to these datasets and identified differentially expressed genes (DEGs) by comparing affected and control individuals. 197 DEGs were common to both the T2D and the ND datasets of which 99 were up-regulated and 98 were down-regulated in affected individuals. These overlapping DEGs (i.e., those seen in both T2D and ND datasets) revealed the involvement of significant cell signaling associated molecular pathways. These were then used to extract the most significant gene ontology (GO) terms. The critical or ‘hub’ proteins in the identified pathways were identified using protein-protein interaction analysis; many hub proteins have not previously been described as playing a role in these diseases. To reveal some of the transcriptional and post-transcriptional regulators of the DEGs, we used DEG-transcription factor (TF) interactions analysis and DEG-microRNAs (miRNAs) interaction analysis, respectively. We performed validation of these results with gold benchmark databases and literature searching, which clarified which genes and pathways had been previously been linked to NDs or T2D and which are novel. Thus, our transcriptomic data analysis has identified novel potential links between NDs and T2D pathologies that may underlie comorbidity interactions, links that may include potential targets for therapeutic intervention. In this network-based bioinformatics approach, we identified only novel biological processes involved in disease comorbidity and thus their semantic similarity was not determined. The semantic similarity measuring approach computes the similarity of the gene ontology and gene products to assess the proximity in terms of disease concepts. Thus, in further computation-based analyses, we determined the semantic similarity between T2D and ND comorbidity. For this, we designed a bioinformatics pipeline to analyse, utilize and combine gene expression, GO and molecular pathway data by incorporating Gene Set Enrichment Analysis and Semantic Similarity. To reduce bias, we used several publicly available datasets for T2D and NDs from different sources and cell types to maximize the power of this approach. We also computed the proximity between T2D and neurological pathologies using genes and GO term semantic similarity that enhances the identification and characterization of comorbidity interactions beyond simply identifying novel biological processes involved in each disease. We performed the validation of the results with gold benchmark databases and literature searches. Finally, we developed machine learning models and moved on to the identification of cancer comorbidity with NDs, and survival prediction in cancer patients using bioinformatics and machine learning approaches. Glioblastoma is a common malignant brain tumor with a high mortality rate which often presents as a comorbidity with NDs. We employed a quantitative analytical bioinformatics framework to unravel shared genes and cell signaling pathways that can link the NDs and glioblastoma. We acquired datasets from the National Center for Biotechnology Information (NCBI) and The Cancer Genome Atlas (TCGA) datasets from studies comparing normal tissue with diseases/glioblastoma tissue. After identifying differentially expressed genes (DEGs) employing our framework, the disease-gene association network, signaling pathway, GO enrichment analysis, as well as the protein-protein interaction (PPI) networks were performed to predict the function of these DEGs. We expanded our study to evaluate which clinical factors and genes play significant roles in determining survival time in GBM patients using a Cox proportional hazards (Cox PH) model and product-limit (PL) estimator through both univariate and multivariate analysis. In this study, 177 DEGs (129 with upregulated expression and 48 downregulated) were identified. Among these, 54 genes were associated with an effect on patient survival. Diseasome networks, molecular pathways, ontological pathways, protein-protein interaction (PPI) networks, and survival analysis of the significant genes all indicate ways that NDs may influence the progression of glioblastoma, growth or establishment. The shared DEGs identified here may also function as biomarkers for glioblastoma prognosis and potential targets for therapies. We have also validated all of our identified signature genes and pathways through the use of gold benchmark databases dbGaP, OMIM, OMIM Expanded and literature reviews. These provide further proof to support the involvement of our identified genes in pathological processes underlying the glioblastoma progression. This work has the potential to develop new diagnostic approaches and lead to the design of new treatments.
关键词	疾病共病鉴定生存分析生物信息学机器学习 2 型糖尿病神经系统疾病胶质母细胞瘤通路基因本体蛋白质
语种	英语
七大方向——子方向分类	人工智能+医疗
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39725
专题	智能制造技术与系统研究中心_多维数据分析（彭思龙）-技术团队
推荐引用方式 GB/T 7714	Md Habibur Rahman. Computational Bioinformatics and Machine Learning Models to Identify the Diseasome and Neurological Disease Comorbidities[D]. 中国科学院自动化研究所. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Md Habibur Rahman.pd（12630KB）	学位论文		开放获取	CC BY-NC-SA