基于神经网络的语音翻译关键技术研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于神经网络的语音翻译关键技术研究
	王峰
	2018-05
学位类型	工学博士
中文摘要	随着经济全球化与互联网技术的飞速发展，跨语言的沟通以及信息传递成为学术前沿研究、企业持续发展的必要基础。因此，如何协助单语种用户实现低成本、快速、高质量的跨语言无障碍交流成为了语音翻译的研究目标。语音翻译通常由语音识别、面向口语的机器翻译、语音合成三步串联而成。其中，面向口语的机器翻译是语音翻译的重难点，也是近年来语音翻译领域的研究热点。在实际的语音翻译系统中，语音识别之后的句子并没有标点符号进行断句和停顿，表达方式含较多省略、重复、甚至含混不清的语言现象，导致语音翻译中出现较多非规范句子，以及句子中存在的未登录词造成的翻译歧义性，都是当前语音翻译中面临的困难。因此，翻译模型对语音识别输出语句的容错能力，理解能力及自适应能力对于一个语音翻译系统的性能有着决定性作用。本文以提高机器翻译质量为核心，旨在通过解决面向语音的机器翻译中的关键问题来提升语音翻译的整体性能，将从未登录词翻译、语音识别后文本的标点恢复以及不流畅检测等方面展开研究。具体研究内容如下：在机器翻译中，未登录词一直是一个难以处理的问题，而命名实体的翻译尤为困难。本文针对机器翻译中的命名实体问题，设计和规整了常用的命名实体类别，并收集整理了命名实体语料。为了提高模型对未登录词的翻译能力，本文借鉴了复制神经网络的思想，对源语言端出现的未登录词直接进行复制。为了更好地判断网络中需要复制的词语，本文将实体类别信息引入模型。该方法将输入文本中的（实体）语义标签有效地融合到解码端，通过解码端的类别门控单元输出的概率来调节最后目标词的输出概率，在一定程度上缓解了由于词表受限带来的未登录词无法被有效训练的问题。本文在相关任务的数据集上进行了充分的实验，结果表明所提方法能够有效缓解未登录词问题。为了解决机器翻译中的未登录词问题，该模型以字符级别的文本序列作为输入，采用双向门控循环单元网络对其进行语义编码，并通过模型内嵌的自适应分词器获取词级别的语义向量，然后将字级别的编码向量和词级别的编码向量融合成一个字词混合的上下文向量，从而增强了模型的语义表达能力，最后基于该字词混合的注意力进行解码。由于不需要第三方分词器进行预处理，该模型实现了真正意义上的端到端翻译。NIST中英翻译任务上的实验结果表明，本文提出的字词混合模型能有效解决机器翻译中的未登录词问题，并提升翻译性能。传统的序列标注模型并不能很好地处理连续标点问题，本文将标点恢复任务转化成机器翻译任务，提出了一种生成式的多目标自注意力模型对其建模。与传统的自注意力模型相比，该模型最大的不同在于引入了多目标学习策略，同时学习标签信息和文本信息，并在解码过程中采用限制性解码算法，一方面将目标端的输出单词约束在源语言中，确保了目标端与源语言端的文本序列的一致性，另一方面利用标签分类器进行输出决策，可以方便地根据上下文语义信息输出连续标点，从而解决了连续标点恢复问题。该方法不仅在IWSLT数据集的单标点恢复任务中取得了最优结果，而且本文还验证了该方法在连续标点恢复任务中的优越性。此外，本文将标点恢复模型应用于实验室自建的机器翻译系统，发现性能得到明显提升，进一步说明该方法的实用性。本文将不流畅检测任务转化为翻译任务。首次引入基于自注意力机制的编解码网络对其建模，并提出一种多目标学习方法和受限解码算法，同时融合了字符序列和标签序列的信息。实验结果显示，本文提出的模型在Switchboard数据集上取得了最优结果。此外，为了充分利用大量现存的未经人工标注的规范化文本数据（如新闻语料），本文引入权重共享策略和对抗网络训练机制将多目标自注意力模型扩展成半监督模型，Switchboard公开数据集上的实验结果表明，该模型能进一步显著提升性能。
英文摘要	Cross-language communication and information transmission is a necessary foundation for the academic frontier of the research and the enterprise's sustainable development. Therefore, the aim of speech translation research is to assist single language consumer to achieve low cost, fast, and high quality cross-language barrier-free communication. Speech-to-speech translation consists of three parts, which are automatic speech recognition, machine translation and text to speech. In recent years, the oral-oriented machine translation is the difficulty and hotspot of the speech translation research. At present, Automatic Speech Recognizer typically produces a sequence of words without punctuation and contains ellipsis, repetition or even confusion, which causes the non-canonical sentences appear in speech translation. And also rare words out of vocabulary is the difficulty of the speech translation, which causes the ambiguity in machine translation. Therefore, the performance of a speech translation system is determined by the fault tolerance, comprehension ability and self-adaptive ability of translation models for automatic speech recognition output. In this thesis, we focus on the research of machine translation and oral standardization, taking the normal text and the automatic speech recognition text as research objects. In order to improve of the speech translation quality, we mainly concentrate on the following three problems, i.e. rare words out of vocabulary in machine translation, the punctuation restoration in automatic speech recognition, disfluency detection in automatic speech recognition. Firstly, a class-specific based copy neural network is proposed to relieve the OOV problem in machine translation. OOV is a difficult problem in machine translation all the time. we design a set of named entity categories, including common named entities and number phrases, and collect named entity corpus in machine translation. In order to release the model from the stress of "understanding" the rare words, copy mechanism has been proposed to deal with the rare and unseen words for the neural network models using attention. However the negative side of the copy mechanism is that the model is only able to decide whether to copy or not. It is unable to detect which class should the rare word be copied to, such as person, location, and organization. This paper deeply investigates this limitation of the NMT model. As a result, we propose a new NMT model by novelly incorporating a class-specific copy network. With the network, the proposed NMT model is able to decide which class the words in the target belong to and which class in the source should be copied to. Experimental results on Chinese-English translation tasks show that the proposed model outperforms the traditional NMT model with a large margin especially for sentences containing the rare words.. Secondly, a hybrid attention for character-level neural network framework is proposed for machine translation, which can blend the character attention with the inner combined word attention elegantly. In our work, we use the bidirectional Gated Recurrent Units network (GRU) to compose word-level information from the input sequence of characters automatically. Contrary to traditional NMT models, two kinds of different attentions are incorporated into our proposed model: One is the character-level attention which pays attention to the original input characters; The other is the word-level attention which pays attention to the automatically composed words. With the two attentions, the model is able to encode the information from the character level and word level at the same time. We find that the composed word-level information is compatible and complementary to the original input character-level information. The experimental results show that the proposed method has an obvious advantage in terms of machine translation. Thirdly, a multi-task based self-attention neural network is proposed for punctuation restoration tasks. Traditional methods built on the sequence labelling framework are weak in handling the joint punctuation. To tackle this problem, we propose a novel multi-task network, which can solve the aforementioned problem very well. The key difference between our proposed model and the vanilla self-attention network is the last output layer of the decoder, where we use two softmax layers to predict label sequence and word sequence separately. We conduct extensive experiments on complex punctuation tasks. The experimental results show that the proposed model achieves significant improvements on joint punctuation task while being superior to traditional methods on simple punctuation task as well. In addition, this paper applies the punctuation restoration in house machine translation, which could improve performance obviously. Finally, a semi-supervised model based on multi-task self-attention and weight sharing is proposed for disfluency detection tasks. In this work, we views the disfluency detection as a translation task. In the multi-task self-attention neural network, the word sequence information and labelling information are incorporated into the model at the same time, and a constrained decoding method in testing phase is applied. Experimental results show that the proposed model achieves significant improvements than the strong baseline models. We are among the first endeavors to use the translation model to handle the disfluency detection task, which can be applied to any other sequence labelling task easily. In addition, we utilize the unlabeled corpus to enhance the performance by introducing the weight sharing strategy and the generative adversarial training to enforce the similar distribution between the labeled and unlabeled data. Experimental results show that the semi-supervised model can further improve the performance significantly translation.
关键词	机器翻译未登录词自注意力神经网络不流畅检测标点恢复
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/20984
专题	毕业生_博士学位论文
作者单位	中国科学院自动化研究所
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	王峰. 基于神经网络的语音翻译关键技术研究[D]. 北京. 中国科学院研究生院,2018.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
template.pdf（2820KB）	学位论文		限制开放	CC BY-NC-SA