基于电子健康记录的诊断预测问题研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于电子健康记录的诊断预测问题研究
	王礼萍
	2023-05-19
页数	82
学位类型	硕士
中文摘要	医疗卫生机构在提供医疗服务的同时积累了大量的电子健康记录数据，基于电子健康记录数据进行挖掘和分析，具有重要的学术价值和社会价值。诊断预测是电子健康记录数据挖掘领域最为核心的一个任务，其目标是根据病人历史的就诊记录数据，预测在未来一段时间内可能出现的症状，以便医生进行提前干预并采取相应的医疗措施。基于深度学习的诊断预测模型虽然取得了一定成效，但仍然面临着依赖大量标注数据、预测效果不稳定、缺乏可解释性等问题。本论文的研究从两个方面展开。一方面，针对 EHR 数据中多种病因混淆的问题，我们提出用多病因网络同时捕捉多种病因，针对不稳定预测和分布外泛化能力差的问题，我们将多病因网络与稳定学习的框架相结合，提出了稳定的多病因诊断预测模型。另一方面，针对现有的深度诊断预测模型需要大量标注数据且预测结果缺乏可解释性的问题，我们提出用医学知识图谱进行增强，减轻对标注数据的依赖并提供具有可解释性的预测结果。本文的主要贡献可以归纳为以下两点：（1）稳定的多病因诊断预测模型对于深度诊断预测模型来说，一个病人的不同疾病信息同时混合在其统一的向量表达中，这给精确的诊断预测造成了困难。一些现有的方法试图采用多头注意力机制来同时捕捉疾病发展的多个方面，然而由于缺乏适当的正则项约束，这类模型获得的病因表达通常都是高度相关的，这限制了此类模型的表达能力。因此本文专门设计了一个多病因网络，然而多病因的引入却加剧了预测不稳定的问题。模型为了提高预测性能，会充分挖掘训练数据中的统计相关性，导致当测试集的分布与训练集的分布存在偏移时，模型参数的估计和学习不够准确，测试集上的预测结果也会表现较差。为此，本文引入了希尔伯特-施密特独立性准则来度量模型得到的多个向量表达之间的独立性。此外，受样本重加权技术的启发，本文设计了一个病因相关性正则项用于估计样本的权重，进而保证在重加权训练样本上，学习得到的病因向量是去相关的，从而提高模型分布外泛化的能力，保证对于任意未知的测试数据，该模型都能取得一致的预测效果。本文在大规模公开数据集 MIMIC-III 上进行了实验，实验结果证明了所提出方法的有效性。（2）知识增强的可解释诊断预测模型基于深度学习的诊断预测模型具有大量的参数需要通过梯度下降算法进行更新学习，因此通常需要大量的有标注的数据进行训练，当训练数据集规模较小时，预测准确程度会有明显的下降。另外，疾病症状的出现符合长尾分布的特点，大部分症状在训练集中出现的频率较低，预测效果也较差。为了克服这两个方面的缺点，本文提出要利用大规模医学知识图谱辅助诊断预测任务，并针对医学知识图谱的特点和诊断预测任务场景，设计了一个状态感知的层次化关系注意力神经网络，根据病人所处状态，有选择性地利用其中包含的医学知识。知识图谱的引入不仅有效缓解了对标注数据的依赖，还可以通过注意力机制对预测结果提供解释。本文提出的模型被设计成一个可插入式的通用模块，可以与各种时序预测模型相结合。在两个公开数据集上的实验结果证明了提出的模型能提高各类预测模型的预测效果，并优于现有的同样利用医学知识图谱的诊断预测模型。消融实验也验证了模型设计的合理性和每个模块的有效性。
英文摘要	Healthcare institutions have accumulated a large amount of electronic health record data while providing medical services. Mining and analyzing this data has important academic and social value. Diagnosis prediction is the core task of electronic health record data mining, which aims to predict possible symptoms in the future based on a patient’s historical medical records, so that doctors can intervene early and take corresponding medical measures. Although diagnosis prediction models based on deep learning have achieved certain effects, they still face problems such as reliance on a large amount of annotated data, unstable prediction performance, and lack of interpretability. The research of this paper unfolds in two aspects. On the one hand, to address the issue of multiple causes confusing in EHR data, we propose using a multi-cause network to simultaneously capture multiple causes. To tackle the problems of unstable predictions and poor out-of-distribution generalization ability, we combine the multi-cause network with a stable learning framework, introducing a stable multi-cause diagnostic prediction model. On the other hand, to address the issues that existing deep diagnostic prediction models require a large amount of annotated data and lack interpretability in their predictions, we propose using medical knowledge graphs for enhancement, reducing the reliance on annotated data and providing interpretable prediction results. The main contributions of this paper can be summarized as follows. （1）Stable multi-cause diagnosis prediction model For deep diagnosis prediction models, different disease information of a patient is mixed in their unified vector representation, which makes accurate diagnosis prediction difficult. Some existing methods attempt to use multi-head attention mechanisms to simultaneously capture multiple aspects of disease development. However, due to the lack of appropriate regularization constraints, the causal expressions obtained by such models are usually highly correlated, which limits the expressive power of these models. Therefore, we designed a multi-causality network, but the introduction of multicausality exacerbates the problem of unstable prediction. In order to improve prediction performance, the model fully explores the statistical correlation in the training data, which leads to inaccurate estimation and learning of model parameters when there is a distribution shift between the test set and the training set, resulting in poor performance of prediction results on the test set. Therefore, we introduced the Hilbert-Schmidt independence criterion to measure the independence of the multiple vector representations obtained by the model. Inspired by the sample reweighting technique, we designed a causality correlation regularization term to estimate the weights of the samples, thereby ensuring that the causality vectors learned on the reweighted training samples are decorrelated, thereby improving the out-of-distribution generalization ability of the model and ensuring consistent prediction performance on any unknown test data. （2）Knowledge-enhanced interpretable diagnosis prediction model For deep learning-based diagnostic prediction models, a large number of parameters need to be learned and updated through gradient descent algorithms, which typically require a large amount of labeled data for training. When the size of the training dataset is small, the prediction accuracy will significantly decrease. The occurrence of disease symptoms follows a characteristic long-tailed distribution, with most symptoms occurring infrequently in the training set, leading to poor prediction performance. To overcome these limitations, this paper proposes to use a large-scale medical knowledge graph to assist diagnostic prediction tasks. A state-aware hierarchical relation attention neural network is designed based on the characteristics of medical knowledge graphs and diagnostic prediction scenarios. The network selectively utilizes the medical knowledge contained in the knowledge graph based on the patient’s state. The introduction of the knowledge graph not only effectively reduces the dependence on labeled data but also provides interpretability to the prediction results through attention mechanisms. Our model is designed as a plug-and-play general module that can be combined with various time-series prediction models. The experimental results on two publicly available datasets demonstrate that the proposed model can improve the prediction performance of various prediction models, and the ablation experiments also verify the rationality of the model design and the effectiveness of each module.
关键词	电子健康记录数据诊断预测可解释性知识图谱稳定学习图神经网络
学科领域	人工智能 ; 人工智能其他学科
学科门类	工学::计算机科学与技术（可授工学、理学学位）
语种	中文
七大方向——子方向分类	数据挖掘
国重实验室规划方向分类	可解释人工智能
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/51728
专题	毕业生_硕士学位论文模式识别实验室
推荐引用方式 GB/T 7714	王礼萍. 基于电子健康记录的诊断预测问题研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
201928014629010王礼萍 (（7552KB）	学位论文		限制开放	CC BY-NC-SA