医疗文本与生理大数据语义表示的模型和应用研究
牛景昊
2020-05
页数101
学位类型博士
中文摘要

伴随着现代医疗检测技术和健康服务场景的高速发展,医疗大数据成为国家的重要战略资源,具有广阔应用前景。以深度学习为代表的新一代机器学习算法,在分析和利用医疗大数据方面取得很多突破;而具备从数据中自主学习特征表示的能力,是这些模型成功的关键。将医疗大数据转化为语义空间的高质量表示,可以构建起原始数据到实际临床应用的桥梁,为下游的辅助诊断、病例检索、医疗问答等实际应用提供有力支撑。与此同时,现代循证医学重视利用既有临床证据的判断,然而这些临床证据往往淹没在海量冗杂的医疗大数据中,学习对于原始数据层级、抽象的语义表示,可以将医疗大数据与医生认知双向关联,进而提升临床决策流程的效率与可靠性。

本论文聚焦于医疗文本与生理大数据中差异性、多模态和稀疏性三个共性挑战,研究利用医疗大数据构建适应领域特点的语义表示模型,解决真实场景下的任务与挑战。具体来说,针对文本医疗数据中的差异性表达,提出了一种基于字符级卷积神经网络和位置相关性注意力的文本语义表示模型,应用于医学概念标准化任务;针对医疗数据语义表示面临的多模态挑战,将文本语义表示模型向生理信号进行模态推广,提出了一种符号化嵌入融合的序列语义表示模型,应用于心电信号分析任务;针对医疗大数据医学知识缺失和高标注成本带来的稀疏性挑战,提出了一种利用拓扑结构特征进行采样和加权训练的知识语义表示模型,应用于医学知识图谱补全任务。本文的主要工作和创新点如下:

(1)提出了一种字符级多任务注意力的医疗文本语义表示模型。该模型针对非标准口语医疗表达中的噪声词语挑战,将字符级别编码的文本表示引入医学概念标准化任务,有效缓解了集外词语义信息丢失的问题;模型利用多任务学习框架,实现将概念标签中的字形信息作为额外监督信息进行训练,创新了一种在字符语义表示中融入目标概念字形先验的方法;模型包含一个生成字符级医疗概念相关性注意力的网络结构,通过注意力位置加权提升了医学概念标准化任务的效果。在三个基于患者表达数据构建的医疗文本数据集上进行了实验,验证了上述创新点的有效性,并通过附加模拟四种常见的字符级别噪声扰动,实验验证了模型结构和效果的鲁棒性。

(2)提出了一种符号化嵌入融合的医疗生理信号语义表示模型。该模型针对医疗生理信号在周期内的波形和节律特点,设计包含基线修正的符号化方法,通过共享幅值语义实现不同信号的对齐,缓解了患者间差异偏置对医疗生理信号数据带来的影响;结合生理信号的判据知识,模型使用了一种结合特征波段的稀疏嵌入策略,帮助语义表示学习聚焦到医学判别价值更高的序列范围;通过融合多通道语义表示,模型提出一种新的医疗生理信号语义表示框架,开辟了将符号序列处理方法迁移到生理信号分析任务的新途径。以心电信号为例进行实验分析,所提出模型对比现有专家特征表示和深度学习表示模型均获得了更好的效果,并通过实例分析验证了相关创新点的有效性。

(3)提出了一种融合利用拓扑结构特征的医学知识语义表示模型。该模型针对医疗领域负样本知识缺乏、标记成本较高的问题,在正例与无标记样例学习框架下,设计了一种负样本采样过滤方法,使用图谱拓扑结构所提供的信息,迭代地判别和筛选负样本采样池中的候选样本,提升知识图谱补全任务负样本采样的可靠性;基于样本对拓扑结构距离的计算,模型使用一个图谱拓扑结构特征辅助加权的代价敏感损失函数,赋予知识语义表示模型的正负样本对不同的重要权值。在两个包含医学概念类别的通用数据库WN18RR、FB15K-237,以及标准医学概念知识库UMLS上进行实验,验证了所提出的知识语义表示模型可以有效利用拓扑结构信息提升训练效果。

英文摘要

With the rapid development of modern medical testing technology and health service scenarios, big medical data has gradually become an important strategic resource of the country, which has a broad application prospect. The new generation of machine learning algorithms, represented by deep learning, has made many breakthroughs in the analysis and utilization of medical big data, where the ability to learn features from the data is the key to the success of these models. Big medical data can be transformed into a high-quality representation in the semantic space, which can build a bridge from the original data to the actual clinical application. It also provides strong support for the downstream applications such as auxiliary diagnosis, case retrieval, and medical Q&A. Besides, the modern evidence-based medicine attaches great importance to the analysis and judgment based on the existing clinical evidence. However, a large number of clinical evidence is submerged in a large amount of jumbled medical big data; by processing and computing the original data, through learning the hierarchical and abstract semantic representation, we can connect the big medical data and doctor's concept cognition, improving the efficiency, reliability of the clinical decision-making process.     

This paper focuses on three common challenges of diversity, multimodality and sparsity in big medical data, and researches how to use big medical data to build and learn a semantic representation model to adapt to the characteristics of the medical field and solve the tasks and challenges in the real medical scene. Specifically, targeting at the challenges brought by non-standard oral expression in medical data of text type, under the framework of multi-task learning, a text semantic representation model based on character level convolutional neural network and location-dependent attention is proposed and applied to medical concept standardization task.  In view of the multi-modal challenges of big medical data semantic representation, the text semantic representation model is extended to handle the physiological signal. A symbolic embedded fusion model for sequence semantic representation is proposed, which is applied to the ECG signal analysis task. For the challenges brought by the unobserved  knowledge and high labeling cost in big medical data, considering the characteristics of the medical knowledge graph structure, the sampling and weighting strategies based on the semantic topology structure is proposed. The knowledge semantic representation model is applied to the task of medical knowledge graph completion.

The main work and innovations of this paper are as follows:

(1) A character-level multi-task attention representation model of medical text semantics is proposed. Targeting at the challenge of noisy words in non-standard oral medical expression, the text representation model of character-level coding is introduced into the task of medical concept standardization, which effectively alleviates the loss of semantic information of out-of-vocabulary words.  The font information in the concept label is used as additional supervised information under the multi-task learning framework, which injecting the target concept into the character semantic representation. A network structure is designed to generate character-level attention related to the medical concept, which improves the performance of the medical concept standardization task through the weighting of position attention. Three medical text data sets built with real-world data are used to verify the effectiveness of the above innovations, and four common character-level noises are simulated to verify the robustness of the model.

(2) A semantic representation model of the physiological signal based on symbolic embedding and fusion is proposed. According to the characteristics of the waveform and rhythm of the physiological signal in a cycle, a symbolic method including baseline correction is designed to achieve the alignment of different signals by sharing the amplitude semantics, which alleviates the impact of the different bias between patients generating the physiological signal data. Combined with the criterion knowledge of the physiological signal, a sparse embedding strategy according to characteristic bands is proposed to help the semantics representation learning of the model focus on the sequence range with the higher value of medical discrimination. Through the fusion of multi-channel semantic representation, a new framework of physiological signal semantic representation is proposed. It opens up a new way to transfer the symbol sequence processing method to the physiological signal analysis task. The experimental analysis is conducted on the authoritative data set of the physiological signal, comparing the existing expert feature representation and deep learning representation models, our model has achieved better results; the effectiveness of the relevant innovations is verified by case analysis.

(3) A knowledge semantic representation model with sampling and weighting based on the topology structure is proposed. Targeting at the lack of negative knowledge samples and the high cost of labeling in the medical field, a filtering method for negative sampling is designed under the framework of positive and unlabeled learning. Using the information provided by the topology structure of the graph, candidate samples in the negative sampling pool are screened and filtered iteratively to improve the reliability of sampling for the task of knowledge map completion. A cost-sensitive loss function is designed, through calculating the distance in terms of the topology structure, it can weight the positive and negative sample pairs with different importance. Experiments are conducted on two general databases involving medical concepts, i.e., WN18RR and FB15K-237, and one medical concept knowledge base UMLS. We verified the effectiveness of the proposed method, which aims at improving the training of the knowledge semantic representation model by using topology structure information.

关键词医疗大数据 深度学习 医疗文本处理 生理信号分类 语义表示分析关键词
语种中文
七大方向——子方向分类人工智能+医疗
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/39050
专题多模态人工智能系统全国重点实验室_人工智能与机器学习(杨雪冰)-技术团队
推荐引用方式
GB/T 7714
牛景昊. 医疗文本与生理大数据语义表示的模型和应用研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
njh博士论文.pdf(2603KB)学位论文 开放获取CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[牛景昊]的文章
百度学术
百度学术中相似的文章
[牛景昊]的文章
必应学术
必应学术中相似的文章
[牛景昊]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。