基于神经科学文献感兴趣片段的跨物种脑认知知识图谱自动构建 | |
朱洪银![]() | |
2020-06 | |
Pages | 176 |
Subtype | 博士 |
Abstract | 人类历史上对脑科学的探索从没有间断过,大脑被认为是人类智慧的来源,也是人体最复杂、最神秘的器官。这个器官在各种生物的生命中都发挥着至关重要的作用。随着信息化时代的发展,越来越多的方法被应用于探索大脑结构和功能,研究脑科学中的重大问题。已有的方法主要依靠神经影像学和生物实验,来观察大脑的活动,以及研究大脑和认知功能之间的关系,但这类方法需要花费大量的时间和成本。近年来,研究学者们在脑科学领域积累了大量的科研成果。与脑科学相关的文献高达两百多万篇,脑科学的研究成果日新月异。脑科学领域积累了海量的文献,然而这些科学文献并没有得到高效、充分地利用,而且跨物种的脑科学研究工作相对较少。在此背景下,本文首次探索了基于海量神经科学文献的跨物种脑认知知识图谱自动构建问题,为类脑智能和脑科学研究提供实际的支撑。该研究工作对于提高知识获取的效率,获得创新性科学发现,促进脑科学的发展都具有重要的理论意义和应用价值。 本文从脑科学研究过程中遇到的挑战以及脑科学的特点出发,研究了构建跨物种脑科学知识图谱过程中遇到的问题。论文的主要工作和创新点归纳如下: 1. 提出一种基于感兴趣片段的脑科学文献物种分类方法。 脑科学研究包含多种多样的物种,然而海量的脑科学知识没有按照物种组织起来,都混乱在一起。使得研究人员和文献分析系统无法区分知识来源于什么物种,也无法进行跨物种研究。本文发现有半数文献的摘要没有提到物种。该问题是个多标签分类问题,已有的方法通常会将文档编码为一个向量进行分类,局限性在于需要使用阈值决定最终标签子集,而且这种方法不够灵活,因为每个物种往往从文献中的不同部分得来。本文提出了一个生成式模型,通过关注文献中的感兴趣片段,适应性地生成物种标签,让每个标签可以灵活地关注文献的不同部分。此外,本文提出层级注意力解码机制,融合了文档的语篇章节结构,取得了明显的效果提升。本文标注了三个脑科学领域物种分类数据集,并且提出了两套语料标注标准。此外,本文提出的方法能区分物种是否为主要的实验对象,从而可以缓解跨物种知识获取问题。 2. 提出一种基于感兴趣实体增强的命名实体识别方法。 对于生物实验和脑科学文献分析,识别术语是比较基础的一步。基于深度学习的方法在通用领域取得了不错的效果,然而这些方法大都依赖于大量的训练数据,而且仅仅利用了固定的实体—上下文搭配。脑科学研究面临的主要是低资源的问题。标注脑科学领域的语料库需要邀请领域专家,这大大提高了数据标注的难度。本文提出了一种基于感兴趣实体增强的命名实体识别方法、双侧神经网络结构以及模型训练方法。本文在2个生物医学数据集和不同语言的数据集上都提升了模型的性能,尤其是在低资源数据集上提升更为明显。本文标注了一个脑科学术语识别语料库,标注了六大类实体(认知功能、脑区、脑疾病、神经元、蛋白质、神经递质),并且提出了一个语料标注的准则。本文的模型在脑科学术语识别任务上也取得了效果的提升。 3. 提出一种基于成对感兴趣实体的模块化神经网络关系抽取方法。 脑认知关系抽取旨在从脑科学文献中,抽取出脑区之间的功能性连接。已有的模型在信息抽取任务中取得了不错的效果。脑科学是一个探索性的领域,有许多脑区连接尚未定论,也有许多机理尚未知晓。已有的研究模式在脑科学领域的应用和研究上存在局限性。流水线方法局限性在于,它们将实体识别和关系抽取的过程分离,无法联合学习。本文提出一种级联推理学习的方法和共享表示机制来实现端到端的实体关系联合抽取。本文利用自注意力机制建模成对感兴趣实体进行关系分类。本文在4个公开数据集上取得了效果的提升。本文总结了常见的4类关系抽取协议,并且提出了一个模块化神经网络,设计了4种信息流将不同的关系抽取协议适应性地集成起来,从而缓解研究模式的局限性。本文基于已有的知识采用远程监督的方法标注了一个脑认知关系数据集,本文的方法在该数据集上也取得了效果的提升。 对于脑科学研究,知识、数据和服务的价值往往比对模型的改进更受关注。本文的研究旨在从文献中抽取出结构化知识,用于帮助神经科学家、生物学家、文献情报分析研究学者。本文抽取出跨物种脑认知知识图谱,融入脑科学知识引擎体系中,实现了能够进行跨物种脑认知研究的脑科学知识引擎,并总结了研究中遇到的关键问题和应对策略。本文通过脑科学知识引擎的在线服务与神经科学研究学者建立联系,通过互联网让用户可以随时随地访问该系统,促进脑科学研究领域的发展。本文最后绘制了25张跨物种脑功能图谱,包括在不同物种上的“工作记忆”、“导航”、“嗅觉”、“社会关系”等认知功能。 |
Other Abstract | The exploration of brain science in human history has never been interrupted. The brain is considered the source of human wisdom and the most complex and mysterious organ of the human body. This organ plays a vital role in the lives of various organisms. With the development of information technology, researchers have more and more methods to study the structure and functionality of the brain and major issues in brain science. Existing methods mainly use neuroimaging and biological experiments to observe brain activity and study the relationship between the brain and cognitive function, but these methods are costly and time-consuming. Researchers have accumulated massive research results in the field of brain science. There are more than two million scientific works related to brain science. A large amount of literature has been accumulated in the field of brain science, and these scientific works have not been fully utilized. There are few research works on cross-species brain science research. This article is the first to study the automatic construction of cross-species brain cognitive knowledge graph based on large-scale neuroscience literature, which can provide practical support for brain-inspired intelligence and brain science research. This research has important theoretical significance and application value for improving the efficiency of knowledge acquisition, obtaining innovative scientific discoveries, reducing manual work, and promoting the development of brain science.
1. A novel method for classifying neuroscience literature based on the span of interest. Brain science research includes various species, but a large amount of brain science knowledge is not organized according to species, but mixed together. This makes it difficult for researchers and literature analysis systems to distinguish what species the knowledge originates from and to conduct cross-species research. This article found that about half of the abstracts did not mention species. This problem is a multi-label classification problem. Existing methods usually encode a document into a vector for classification. The limitation lies in the need to use thresholds to determine the final label subset, and this method is not flexible enough, because different species usually come from different parts of the literature. This article proposes a generative model that adaptively generates species labels by attending to the span of interest in the literature, enabling each label to adaptively emphasize different parts of the literature. Besides, this article proposes a hierarchical attentive decoding mechanism, which integrates the discourse section structure of the document and achieves significant improvement. Three datasets of species classification in the field of brain science are annotated, and two sets of corpus labeling standards are proposed. The proposed method can distinguish whether species are the main experimental subjects, which can alleviate the problem of cross-species knowledge acquisition.
For biological experiments and brain science literature analysis, identifying terms is a foundational task. Deep learning-based methods have achieved good results in the general domain. However, most existing methods rely on a large amount of training data and only use the fixed entity-context combination. For brain science research, there is a low resource problem. Annotating corpora in the field of brain science requires inviting domain experts, which greatly increases the difficulty of data annotation. This article proposes a novel method to augment the entity of interest for named entity recognition, a bilateral neural network architecture, and a training method. The proposed method improves model performance on 2 biomedical datasets and different languages, especially on low-resource datasets. This article annotates a corpus of brain science term recognition, including six types of entities (cognitive function, brain region, brain disease, neurons, proteins, neurotransmitters), and proposes the corpus labeling criterion. The proposed method achieves improvement on the brain science term recognition dataset.
Brain cognitive relation extraction aims to extract functional connections between brain regions from the brain science literature. Existing models have achieved good results in information extraction tasks. Brain science is an exploratory field. There are many unidentified connections in brain regions and many unknown mechanisms. Existing research processes have limitations in the application and research in the field of brain science. The limitation of pipeline methods is that they separate the processes of entity recognition and relation extraction, and the subtasks cannot be jointly learned. This paper proposes a cascaded inference learning method and a shared representation mechanism to achieve joint entity relation extraction. In this article, the self-attention mechanism is used to model the relation representation based on pairs of entity of interest. This article has achieved improvements on 4 public datasets. This paper summarizes four types of relation extraction protocols and proposes a modular neural network. Four information flows are designed to adaptively integrate different relation extraction protocols. Based on the existing knowledge, this article uses the distant supervision technique to annotate a brain cognitive relation extraction dataset. The proposed method achieves improvement on the dataset.
|
Keyword | 脑科学 跨物种 神经信息学 知识图谱 知识引擎 感兴趣片段 语料标注 术语识别 关系抽取 深度学习 |
Language | 中文 |
Sub direction classification | 知识表示与推理 |
Document Type | 学位论文 |
Identifier | http://ir.ia.ac.cn/handle/173211/39296 |
Collection | 毕业生_博士学位论文 |
Recommended Citation GB/T 7714 | 朱洪银. 基于神经科学文献感兴趣片段的跨物种脑认知知识图谱自动构建[D]. 中国科学院自动化研究所. 中国科学院大学,2020. |
Files in This Item: | ||||||
File Name/Size | DocType | Version | Access | License | ||
Thesis.pdf(8378KB) | 学位论文 | 限制开放 | CC BY-NC-SA |
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment