面向机器翻译的语言预处理与性能优化

CASIA OpenIR > 毕业生 > 硕士学位论文

	面向机器翻译的语言预处理与性能优化
	汪春奇1,2
	2018
学位类型	工学硕士
中文摘要	近年来，随着深层神经网络的发展，基于神经网络的机器翻译模型也得到广泛的研究，与传统的统计机器翻译相比，其翻译性能得到很大提升。然而，神经机器翻译仍然面临很多问题，比如命名实体翻译问题、低资源翻译问题、解码延迟问题等。在本文中，我们旨在设计模型与方法缓解机器翻译系统的构建过程中面临的问题。本文主要围绕两个方向，一个是语言的预处理，我们期望对翻译两端的句子的预处理(主要是分词与命名实体识别)可以帮助翻译系统更好地理解句子的含义。另一个是针对机器翻译的性能优化，包括两个方面。一个是翻译质量的优化，我们使用单语语料提升翻译质量，缓解对平行语料的依赖。还有一方面是翻译速度的优化，我们设计新的模型来增加神经机器翻译解码过程的并行度，减小解码延迟。本文主要的研究成果如下: 1. 我们提出了一个基于卷积网络的序列标注模型。序列标注任务是自然语言处理领域最基本的任务之一，很多自然语言处理问题都可以被转化为序列标注问题。最近随着神经网络的兴起，循环神经网络在序列标注任务上的应用得到很多关注。然而，循环神经网络本身的结构限定了它对于句子只能逐个词处理，妨碍了计算并行性。我们提出的卷积网络则可以克服这一障碍，除了获得更快的并行性。在中英文命名实体识别这一任务上的实验表明，我们的模型在获得更快的处理速度的同时在准确性上也超越了基于循环神经网络的序列标注模型。 2. 我们设计了一个能同时结合字级别信息与词级别信息的中文分词系统。序列标注框架能够以很高的效率解决一系列自然语言处理问题，其中就包括了中文分词问题。然而，基于序列标注的分词系统不能自然地结合词级别的信息。我们设计了一个新颖的方法，能够在基于序列标注的分词系统中使用完整的词级别信息，同时我们的方法还能利用大规模无标注语料，构成半监督学习的模式。 3. 我们提出了一个新颖的适用于神经机器翻译的半监督学习框架。常规的神经机器翻译框架只是建模给定源语言句子的条件下目标端句子的条件语言模型。我们拓展了这个框架，用一个统一的框架来同时建模目标端条件语言模型、无条件的源端语言模型以及目标端语言模型。在我们的框架中，源端和目标端的单语语料也能得到合理的应用。 4. 我们提出了一个半自回归的神经机器翻译模型。常规的神经机器翻译模型是自回归的，因此在解码阶段一个时间步只能解码出一个词，当目标端句子较长时，这个过程会耗费大量时间。我们提出的半自回归模型，能够打破这个限制，一次产生多个连续的词，进而更好地利用并行计算硬件，导致解码速度能显著降低，同时保持较好的翻译质量。
英文摘要	In recent years, with the development of deep neural networks, the machine translation model based on neural networks has also been widely studied. Compared with the traditional statistical machine translation, the translation performance has been greatly improved. However, neural machine translation still faces many prob- lems, such as the problem of named entity translation, low resource translation and translation latency. In this article, we aim to design models and methods to alle- viate the problems faced during the construction of a machine translation system. This paper mainly focuses on two directions. One is language preprocessing. We expect that the preprocessing of sentences at both sides of the translation (mainly word segmentation and named entity recognition) can help translation systems to better understand the meaning of sentences. The other is translation performance optimization. including translation quality and translation speed. we propose a semi-supervised learning method for machine translation. It aims to utilize mono- lingual corpus to improve the quality of translation and to ease the reliance on parallel corpus. The main research results of this paper are as follows: The other is performance optimization for machine translation, including two aspects. The rst aspect is the optimization of translation quality. We use mono- lingual data to improve the quality of translation and ease the reliance on parallel corpus. The second ascept is the optimization of translation speed. We design new model to increase the parallelizability of the decoding procedure and thus reduce the decoding latency. 1. We propose a sequence labeling framework based on convolutional networks. Sequence labeling task is one of the most basic tasks in natural language processing domain. Many natural language problems can be transformed into sequence labeling problems. Recently, with the rise of neural networks, the application of recurrent neural networks to sequence labeling has attracted much attention. However, the structure of the recurrent neural network itself limits its ability to process sentences on a word-by-word basis, which hinders computational parallelism. Our proposed convolutional network can overcome this obstacle. In addition to achieving faster parallelism, it also surpasses the recurrent neural network in the task of named entity recognition. 2. We designed a Chinese word segmentation system that can combine both word-level information and word-level information. The sequence labeling framework can solve a series of natural language processing problems with high e ciency, including Chinese word segmentation. However, the word seg- mentation system based on sequence annotation cannot consider word-level information. We have designed a novel method that can use complete word- level information in a segmentation-based segmentation system. At the same time, our method can also use large-scale unmarked corpus to form a semi- supervised learning model. 3. We propose a novel semi-supervised learning method for neural machine trans- lation. The standard neural machine translation model is just a model of the target-side language model (given a source sentence). We have extended this framework to use an uni ed framework to simultaneously model the condi- tional target-side language model, unconditional source and target-side lan- guage model. Therefore, in our framework, the source and target monolingual data can be reasonably applied. 4. We propose a semi-autoregressive model for neural machine translation. The standard neural machine translation model is autoregressive, so only one word can be produced at each time step when decoding. If the target sentence is long, this process will take a lot of time. The semi-autoregressive model we propose can break this limitation, produce multiple consecutive words at one time step, thus make better use of parallel computing hardware and speedup decoding procedure signi cantly while maintaining good translation quality.
关键词	机器翻译序列标注中文分词半监督学习半自回归
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/21169
专题	毕业生_硕士学位论文
作者单位	1.中国科学院自动化研究所 2.中国科学院大学
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	汪春奇. 面向机器翻译的语言预处理与性能优化[D]. 北京. 中国科学院研究生院,2018.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
面向机器翻译的语言预处理与性能优化.pd（2217KB）	学位论文		限制开放	CC BY-NC-SA