基于深度学习的短文本分类研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于深度学习的短文本分类研究
	田俊
	2016-05
学位类型	工程硕士
中文摘要	互联网的高速发展让人们近年来获取信息的途径越来越广泛。随着微博、推特等社交媒体的盛行，网络上涌现大量的短文本如推特、影评、即时聊天信息及新闻标题等。实现对海量短文本的自动分类可以极大地方便人们对信息的管理和存储，在此之上还可以完成与短文本相关的舆情分析、实时热点分析等任务，从而让人们更容易获取和理解信息。传统的文本分类管理系统是为了解决诸如书籍、文档、新闻等长文本的分类管理问题而设计的，其核心算法是基于词频和向量空间模型而构建的。直接将长文本分类算法运用到短文本分类任务中的效果并不理想，主要原因是受限于文本长度，短文本所包含的信息量有限，这导致传统基于词频的算法并不能很好地运用到短文本分类问题，此外，已有的算法更倾向于衡量文档中关键词与主题的匹配程度，缺少对文本内容的整体理解。近年来随着深度学习相关研究的深入，相关方法在语音、图像及文本处理领域表现出很大优势，并分别在这几个领域的核心问题中取得了突破性进展。本文将结合深度学习算法的优势，从以下三个方面来解决传统算法在短文本分类问题上的不足。首先是改进单层神经网络的结构，对比现有的LSTM(Long-Short-Term-Memory)、GRU(Gated Recurrent Unit)等循环神经网络结构单元我们找到了适合短文本分类任务的结构单元，然后改进循环神经网络结构的输出。传统方法仅仅将最后一层的输出作为短文本的语义表示，本文采用卷积神经网络中的思想对循环神经网络的前向和后向输出进行融合，从而得到更好的短文本表示；其次，优化神经网络的输入和中间参数，结合词向量和自动编码机，分别对输入变量和网络结构做预训练，对比试验表明该预训练过程更有利于神经网络中的参数收敛，从而得到更好的分类效果；最后，本文引入一种改进的多层神经网络融合方法用于短文本分类，传统的深度神经网络只是简单地将单层神经网络的输出作为输入，一层一层叠加起来，本文借助LSTM中门限的思想，改进多层循环神经网络中层与层之间的联系，进一步优化短文本的语义表示，实验结果表明，改进后的多层神经网络分类效果要优于单层神经网络的分类效果。
英文摘要	The rapid development of internet technology has enriched the ways of information access in recent years. With the popularity of microblog, twitter and other social media, many short texts including tweets, film reviews, instant messages and headlines are emerging every day. It can greatly facilitate information management and storage by building an auto-classification system of short texts. Furthermore, tasks like public opinion analysis and real-time hot spot analysis may be built upon in order to make it easier for people to get and understand information. Traditional text classification systems are designed to manage long texts like books, documents and news. And the key algorithms of those systems are based on term frequencies and vector space models. Ideal outcomes cannot be seen simply by applying long text categorization algorithm to short texts, the main reason of which is that, on the one hand, short texts contain limited information, so the term frequency-based algorithm is not suitable for short text categorization; on the other hand, the existing algorithms tend to measure the match between the topic and key words, rather than base on the general understanding of the text. In recent years, research on deep learning has advanced rapidly and related methods have shown great advantages in fields of speech recognition, image processing and natural language processing, and a breakthrough has been made in core problems of these areas. This paper will focus on problems of traditional algorithms in short texts classification from the following three parts, leveraging the advantages of deep learning. First, we improve the structure of single layer neural network. By comparing the existing neural network units like Long-Short-Term-Memory (LSTM) and Gated-Recurrent-Unit (GRU) we found the most efficient unit for short text classification tasks. At the same time, we improve the outputs of Recurrent-Neural-Network (RNN) and their variations. Traditionally only the last output is used to classify texts, while adopting the pooling method, usually used in Convolutional-Neural-Networks (CNN), we merge all the outputs of a RNN in order to get a better representation of a short text. Next, we work out two ways to fine tune the inputs and middle parameters of our model. One is to optimize the input of the neural network with pre-trained word vectors. The other is to optimize the middle parameters in the neural network by using the auto-encoder. Our experiments show that these two methods can help the model converge to the global optimal point, thus greatly improve the results of short text classification tasks. Finally, we introduce an improved multi-layer recurrent neural network method for short text classifications. The traditional multi-layer neural network building is simply by taking the output of the former layer as the later layer’s input, layer upon layer. In this paper, drawing on the gate idea in LSTM, we improve the connections between different layers. By doing so, we get a more representative expression of short texts. Our experiments show that the performance of the improved multi-layer neural network is better than that of the single layer neural network.
关键词	短文本分类深度学习循环神经网络长短时记忆单元
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/11711
专题	毕业生_硕士学位论文
作者单位	中国科学院自动化研究所
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	田俊. 基于深度学习的短文本分类研究[D]. 北京. 中国科学院研究生院,2016.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于深度学习的短文本分类研究-终稿.pd（2013KB）	学位论文		限制开放	CC BY-NC-SA