CASIA OpenIR  > 毕业生  > 博士学位论文
基于表示学习的中文短文本对话分析方法研究
周玉军1,2
学位类型工学博士
导师徐波
2017-12-04
学位授予单位中国科学院大学
学位授予地点北京
关键词中文短文本对话 深度神经网络 表示学习 词/字向量 注意力机制
摘要
对话场合无处不在,产生的对话文本中蕴含着丰富的信息,不仅承载着人们谈论的主题,而且可体现出说话人的情感、态度和观点,具有很广泛的用途。近年来,深度神经网络在图像分类、语音识别等领域取得的突出进展,体现出了强大的表示学习能力,并进一步应用至自然语言处理应用中。文本表示是自然语言处理的基础,但传统方法存在着维度灾难和特征稀疏等问题,以深度神经网络为基础的表示学习方法可从数据中自动学习出低维、稠密的特征,可有效缓解以上问题。对话文本虽然在形式上与普通文档类似,都是由多个句子按顺序组合而成,但同时也具备有别于普通文档的结构特点。因此,本文相关研究工作紧扣中文短文本对话特点,面向特定领域的海量短文本处理需求,基于表示学习方法,开展针对中文对话的语义表示方法研究,并运用学习到的对话特征进行后续的处理任务,例如面向中文对话的主题分类、情感分析等。本文研究工作及主要贡献如下: 
 
1、研究基于词向量和字向量组合的中文短文本表示方法。中文对话中的每个话语均属于中文短文本,针对中文短文本存在的特征稀疏和分词错误等现象,提出基于词向量和字向量组合的C-LSTMs/BLSTMs模型。在中文短文本分类数据集上的实验结果表明,该方法实现的字词组合特征比单独使用词特征或字特征具有更好的语义表示能力。该方法首次系统地探索了基于中文字词向量组合的RNN模型表示方法在中文短文本分类任务上的有效性,其性能超出了多个基准方法。 
 
2、研究基于注意力机制的中文短文本联合语义表示方法。在C-LSTMs/BLSTMs模型基础上,提出基于注意力机制的中文短文本联合语义表示方法HANs,该方法引入注意力机制,通过CNN和RNN网络从输入文本的词向量和字向量序列中,自动学习选择出那些对文本语义起决定性作用的关键词或字。在中文短文本分类数据集上的实验结果表明,具有注意力机制的HANs模型进一步提升了针对中文短文本的语义表示能力。 
 
3、研究面向中文短文本对话的层次化联合语义表示方法。结合对话文本结构特点,提出了面向中文短文本对话的句子和对话两级联合语义表示方法H-HANs,实现说话人信息和短文本内容的有效融合。在面向中文对话的主题分类数据集上的实验结果表明,H-HANs能有效地从中文对话中自动选择出可决定整个对话主题类别的关键话语特征,其性能超出了多个基准方法。此外,本工作构建的中文对话级主题分类数据集将公开发布以用于相关的科学研究。 
 
4、研究面向中文短文本对话的情感分析方法。构建了一个面向中文短文本对话的情感分析语料库,该语料库将公开发布以用于相关的科学研究。基于该语料库进行的面向中文对话的情感极性分类和情感类别细粒度分类实验结果进一步表明,H-HANs能很好地对中文短文本对话进行语义表示学习,通过层次化的注意力权重学习,能从对话中识别出对整个对话语义起决定作用的关键句子信息,并最终得到了整个对话的统一语义表示向量。 
 
综上所述,本文面向中文短文本对话,融合深度神经网络和注意力机制,从字、词、句子和对话多个层级进行联合表示学习,并基于学习出的中文对话统一语义表示向量,开展对话级的主题分类、情感分析等NLP任务研究。实验结果表明,对比已有方法,本文研究成果可进一步提升相应任务的性能。同时,我们将以上研究成果应用于面向特定领域的短文本分析系统中,基于海量短文本数据,实现短文本分类、中文对话主题分类和情感分析应用。
其他摘要
Dialogue occasions are everywhere, the resulting texts of the conversations contain a wealth of information, which not only carry the topics that people talk about, but also reflect the speakers' emotions, attitudes and views, and have a wide range of applications. Recently, deep neural networks have made significantly progress in some fields, such as image analysis and speech recognition, which showed a strong ability in representation learning, and have been applied into natural language processing (NLP). Text representation is the foundation of NLP, however traditional methods have many problems, such as dimension disaster and sparse feature. The representation learning based on deep neural networks can automatically learn low dimensional and dense features from the data, which can effectively alleviate these above problems. The conversational texts are similar to the general documents in form, i.e. they consist of many sentences in order, however they have the structural characteristics different from the general documents. Therefore, the research works in this dissertation focus on the characteristics of Chinese short text conversation, for the requirements of massive short texts processing in specific domain, study the semantic representation methods for Chinese short text conversation based on representation learning. The learned conversational features can be used in subsequent NLP tasks, such as topic classification and sentiment analysis for Chinese conversation. The research works and main contributions in this dissertation are as follows:
 
1. Research on the compositional recurrent neural networks for Chinese short text classification. Each utterance in a conversation is a Chinese short text, in view of the feature sparsity and word segmentation errors existing in Chinese short texts, we propose the C-LSTMs/BLSTMs model, which is based on the combination of word and character embeddings. Experimental results on the datasets for Chinese short text classification show that the features based on the combination of words and characters have the better ability in the semantic representation than the word or character features used seperately. This work is the first time to systematically explore the power of RNN for Chinese short text classification with the composition of word embedding and character embedding, and its performance outperforms multiple baselines.
 
2. Research on the hybrid attention networks for Chinese short text classification. Based on the C-LSTMs/BLSTMs model, we propose the hybrid semantic representation approach with attention mechanism for Chinese short text, i.e. HANs. The HANs model introduces the attention mechanism, and combines it with RNN and CNN networks to automatically learn the representative features which are the key words or characters determining the whole semantics of a short text, from the sequences of word and character embeddings respectively. Experimental results on the datasets for Chinese short text classification show that the HANs model with attention mechanism further improves the semantic representation ability for Chinese short text.
 
3. Research on the hierarchical hybrid attention networks for Chinese conversation topic classification. Considering the structural characteristics of the conversational text, we propose the two-level (sentence-level and dialogue-level) hybrid semantic representation approach for Chinese short text conversation, i.e. H-HANs, which integrates the speaker information with each utterance effectively. Experimental results on the datasets for Chinese conversation topic classification show that, for the Chinese conversational text, the H-HANs model can automatically select the key features in utterances, which can determine the whole topic category of the conversation, and its performance outperforms multiple baselines. In addition, the corpus for topic classification of Chinese dialogue will be released publicly for research purposes.
 
4. Research on the approach of sentiment analysis for Chinese short text conversation. Due to the lack of annotated Chinese conversation corpus for sentiment analysis, we construct a new Chinese conversation corpus for sentiment analysis, the corpus with gold standards will be released publicly for research purposes. Experimental results on the corpus further show that the H-HANs model can effectively learn semantic representation for Chinese short text conversation by the learning of hierarchical attention weights, and identify the critical sentences that play a decisive role in the whole semantics of the conversation, and get the unified semantic representation for the conversation finally. 
 
In summary, for Chinese short text conversation, this work combines deep neural networks (DNN) with attention mechanism, and studies the joint learning methods from multiple levels of character, word, sentence and conversation. Based on the unified semantic representation for Chinese conversation, we can carry out some NLP tasks on the whole conversation, such as topic classification and sentiment analysis. Experimental results indicate that, compared with the existing methods, the approaches proposed by this work can further improve the performace of the related NLP tasks. Meanwhile, we apply the above research results to the analytic system for short texts in specific domain. Based on the massive short texts, we develop the applications of short text classification, Chinese dialogue topic classification and sentiment analysis.
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/15619
专题毕业生_博士学位论文
作者单位1.中国科学院自动化研究所
2.中国科学院大学
推荐引用方式
GB/T 7714
周玉军. 基于表示学习的中文短文本对话分析方法研究[D]. 北京. 中国科学院大学,2017.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
[博士学位论文]基于表示学习的中文短文本(2456KB)学位论文 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[周玉军]的文章
百度学术
百度学术中相似的文章
[周玉军]的文章
必应学术
必应学术中相似的文章
[周玉军]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。