医疗问诊大数据机器学习模型与算法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	医疗问诊大数据机器学习模型与算法研究
	张似衡
	2020-08-25
页数	126
学位类型	博士
中文摘要	随着“互联网 + 医疗健康”产业的蓬勃发展，线上医疗问答社区中的用户数量与日俱增，需求日益扩大，远超人类医生的工作负荷。因此，使用机器学习算法从医疗问诊大数据中建模问答机制，实现智能辅助问诊，已经成为一个重要的研究热点。然而，受限于用户表述不规范、语料标注稀缺以及医疗场景可解释性不足等难题，现有的智能问诊技术普遍存在正确率低、鲁棒性差等缺点。这衍生出一个重要的机器学习问题——如何利用低价值密度、高噪声水平的医疗问诊大数据，解决语言和知识的表示难题，进而提升智能问诊系统的语义理解水平。为此，智能问诊技术需要突破下述三个瓶颈：第一，医学文本的语义特征表示中语言结构难以利用；第二，问诊系统主诉理解中口语表述无法对齐医学术语；第三，医学知识图谱中关系知识缺失。针对这三个问题，并结合线上问诊的医学属性和语料特点，本论文聚焦医疗问诊三个关键的底层技术，包括文本特征表示、口语实体抽取和知识图谱补全，构建适应医疗领域自身特点的智能问诊系统。本论文的创新点主要有： 1. 提出了一种嵌入依存句法的卷积神经网络文本特征表示模型。针对序列模型和自然语言递归特性不相匹配的缺点，引入基于依存句法树的权重层，将词在句法树上的深度映射为词向量的权重，隐式地融合句法结构信息。在此基础上，通过卷积神经网络抽取语义特征形成文本表示，在保持并行计算优点的同时，模型不需要词级别的精细标注。此外，所提模型可以拓展到文本分类、同义判定、文本对排序等不同任务，在各个任务上进行验证，结果均优于当前最先进的模型，而且学习到的词权重符合人类语言认知。 2. 提出了一种“候选-删除”两阶段的症状实体抽取方法。针对口语问句中上下文不足、词形变化、无标注的难点，提出交叉注意力网络，通过问答对之间的关联匹配，学习人类医生对用户问题的注意力分布，提取候选症状实体。在此基础上，提出语义簇滤过模型，对已有实体聚类确定语义簇中心和边界，进而对候选实体中的离群点进行剔除。此外，设计了机器自动标注合成的训练集，用于训练统计学习模型，有效地结合了字典和统计学习的优点，提高了症状实体抽取的泛化性能。 3. 提出了一种拓扑结构自适应的知识图谱嵌入方法。针对平移距离系列模型不能建模环状结构和链接密度的缺点，从一般化的视角分析了环状结构对实体在不同位置的语义差异约束，并证明了知识不确定性等价于正负三元组之间的自适应间隔。在此基础上，提出了位置敏感自注意力模型，通过头实体和尾实体的实体语义区分，提高嵌入模型表示能力，同时引入自注意力机制对知识三元组评分，并应用在各个现有模型上取得了大幅提高。另外，提出了三种简化的自适应间隔模型，通过协方差矩阵分解，可以适应知识链接密度对间隔进行调整，从而在简化高斯嵌入模型的同时提高了表示能力，所提模型在和当前最先进的模型比较中达到了可比或者超过的性能。
英文摘要	With the rapid development of 'E-Health' industry, the number of users in the online medical community is increasing day by day. And the growing demand far exceeds the workload of human doctors. Hence, using machine learning algorithms to learn from the medical big data and build an intelligent question answering system, has becoming an important research hotspot. However, due to the users' non-standard expressions, the scarcity of annotations, and the high standard of interpretability in medical scenes, the existing methods generally are of poor accuracy, as well as low robustness. This leads to an important machine learning problem, that is, how to effectively use the medical question answering corpus, which is of low value density and high noise level, to solve the representation problem of language and knowledge, and to improve the level of semantic understanding. Therefore, there are three bottlenecks needed to be broken through in the research of medical question answering methods: first, the semantic feature representations of medical text do not utilize the structural information of language; second, at the point of understanding the chief complaint, it is difficult to align between the users oral language and medical terminologies during semantic analysis; third, some relational knowledge are missing in the existing medical knowledge graphs. Addressing on these problems, and also considering the attributes of online medical corpus, this thesis focuses on the three key technologies of medical question answering: text semantic representation, oral entity extraction and knowledge graph completion, and constructs an intelligent medical question answering system adapted to the characteristics of medical field. The main innovations of this thesis are as follows: 1. This thesis proposes a short text representation model which embeds dependency parsing. Aiming at the mismatching between convolutional neural networks and the recursive property of natural language, a weight layer based on dependency syntactic tree is introduced to map the depth of a word in syntactic tree to the weight of word vector, implicitly fusing the syntactic structure information. On this basis, the convolution neural network is used to extract semantic features and form text representation. While maintaining the advantages of parallel computing, the model does not need any fine annotation of word level. In addition, the proposed model can be extended to different tasks, such as text classification, duplicate classification and text pair ranking. Experiments on each task confirm that the proposed model out-performs all state-of-the-art models, and learned word weights accord with the human cognition. 2. This thesis proposes a 'proposing-rejecting' two-stage symptom entity extraction method. Aiming at the difficulties of insufficient context, word shape changes and lack of annotations in oral queries, a cross attention network is proposed to model the attention distribution of human doctors to the users' queries, and propose candidate symptom entities through the associations between pairs of query and answer. On this basis, a semantic cluster based filtering model is proposed to determine the clustering centers and boundaries of the existing entities, which then are used to reject outliers in the candidate entities. In addition, this thesis design an automatic annotated training set to facilitate the statistical learning models, effectively combining the advantages of dictionary and statistical learning to improve the generalization performance on symptom entity extraction task. 3. This thesis proposes a topology adaptive knowledge graph embedding method. Aiming the failure of a series of translational distance models caused by the circle structures and link density in knowledge graphs, this thesis argues for the semantic difference of entities in different positions under a generalized viewpoint, and proves that the knowledge uncertainty is equivalent to the adaptive margin between positive and negative triplets. On this basis, a location sensitive model with self-attention scoring block is proposed to improve the representation capacity of the embedding model through the semantic distinction between the head and the tail entity. Also the self-attention block is introduced to score a knowledge triplet, and it greatly improves the performance of each existing models. In addition, three simplified models with adaptive margin are proposed. By decomposing of the covariance matrices, the margin can be adapted according to the link density of knowledge, so that the Gaussian embedding model can be simplified while its representation capacity is greatly improved. Experiments on each task confirm that the proposed model achieves higher, or at least comparable performance to all state-of-the-art models.
关键词	医疗问诊机器学习卷积神经网络注意力机制知识推理
语种	中文
资助项目	Natural Science Foundation of China[61432008]
七大方向——子方向分类	人工智能+医疗
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/40401
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	张似衡. 医疗问诊大数据机器学习模型与算法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
张似衡-医疗问诊大数据机器学习模型与算法（3564KB）	学位论文		限制开放	CC BY-NC-SA