CASIA OpenIR  > 毕业生  > 硕士学位论文
面向机器翻译的语言预处理与性能优化
汪春奇1,2
学位类型工学硕士
导师徐波
2018
学位授予单位中国科学院研究生院
学位授予地点北京
关键词机器翻译 序列标注 中文分词 半监督学习 半自回归
摘要
近年来,随着深层神经网络的发展,基于神经网络的机器翻译模型也得到广泛 的研究,与传统的统计机器翻译相比,其翻译性能得到很大提升。然而,神经机器 翻译仍然面临很多问题,比如命名实体翻译问题、低资源翻译问题、解码延迟问题 等。在本文中,我们旨在设计模型与方法缓解机器翻译系统的构建过程中面临的问 题。本文主要围绕两个方向,一个是语言的预处理,我们期望对翻译两端的句子的 预处理(主要是分词与命名实体识别)可以帮助翻译系统更好地理解句子的含义。 另一个是针对机器翻译的性能优化,包括两个方面。一个是翻译质量的优化,我们 使用单语语料提升翻译质量,缓解对平行语料的依赖。还有一方面是翻译速度的优 化,我们设计新的模型来增加神经机器翻译解码过程的并行度,减小解码延迟。本 文主要的研究成果如下:
1. 我们提出了一个基于卷积网络的序列标注模型。序列标注任务是自然语言处 理领域最基本的任务之一,很多自然语言处理问题都可以被转化为序列标注 问题。最近随着神经网络的兴起,循环神经网络在序列标注任务上的应用得 到很多关注。然而,循环神经网络本身的结构限定了它对于句子只能逐个词 处理,妨碍了计算并行性。我们提出的卷积网络则可以克服这一障碍,除了 获得更快的并行性。在中英文命名实体识别这一任务上的实验表明,我们的 模型在获得更快的处理速度的同时在准确性上也超越了基于循环神经网络的 序列标注模型。
2. 我们设计了一个能同时结合字级别信息与词级别信息的中文分词系统。序列 标注框架能够以很高的效率解决一系列自然语言处理问题,其中就包括了中 文分词问题。然而,基于序列标注的分词系统不能自然地结合词级别的信息。 我们设计了一个新颖的方法,能够在基于序列标注的分词系统中使用完整的 词级别信息,同时我们的方法还能利用大规模无标注语料,构成半监督学习 的模式。
3. 我们提出了一个新颖的适用于神经机器翻译的半监督学习框架。常规的神经 机器翻译框架只是建模给定源语言句子的条件下目标端句子的条件语言模 型。我们拓展了这个框架,用一个统一的框架来同时建模目标端条件语言模 型、无条件的源端语言模型以及目标端语言模型。在我们的框架中,源端和 目标端的单语语料也能得到合理的应用。
4. 我们提出了一个半自回归的神经机器翻译模型。常规的神经机器翻译模型是 自回归的,因此在解码阶段一个时间步只能解码出一个词,当目标端句子较 长时,这个过程会耗费大量时间。我们提出的半自回归模型,能够打破这个限制,一次产生多个连续的词,进而更好地利用并行计算硬件,导致解码速度能显著降低,同时保持较好的翻译质量。
其他摘要
In recent years, with the development of deep neural networks, the machine translation model based on neural networks has also been widely studied. Compared with the traditional statistical machine translation, the translation performance has been greatly improved. However, neural machine translation still faces many prob- lems, such as the problem of named entity translation, low resource translation and translation latency. In this article, we aim to design models and methods to alle- viate the problems faced during the construction of a machine translation system. This paper mainly focuses on two directions. One is language preprocessing. We expect that the preprocessing of sentences at both sides of the translation (mainly word segmentation and named entity recognition) can help translation systems to better understand the meaning of sentences. The other is translation performance optimization. including translation quality and translation speed. we propose a semi-supervised learning method for machine translation. It aims to utilize mono- lingual corpus to improve the quality of translation and to ease the reliance on parallel corpus. The main research results of this paper are as follows:
The other is performance optimization for machine translation, including two aspects. The  rst aspect is the optimization of translation quality. We use mono- lingual data to improve the quality of translation and ease the reliance on parallel corpus. The second ascept is the optimization of translation speed. We design new model to increase the parallelizability of the decoding procedure and thus reduce the decoding latency.
1. We propose a sequence labeling framework based on convolutional networks. Sequence labeling task is one of the most basic tasks in natural language processing domain. Many natural language problems can be transformed into sequence labeling problems. Recently, with the rise of neural networks, the application of recurrent neural networks to sequence labeling has attracted much attention. However, the structure of the recurrent neural network itself limits its ability to process sentences on a word-by-word basis, which hinders computational parallelism. Our proposed convolutional network can overcome this obstacle. In addition to achieving faster parallelism, it also surpasses the recurrent neural network in the task of named entity recognition.
2. We designed a Chinese word segmentation system that can combine both word-level information and word-level information. The sequence labeling framework can solve a series of natural language processing problems with high e ciency, including Chinese word segmentation. However, the word seg- mentation system based on sequence annotation cannot consider word-level information. We have designed a novel method that can use complete word- level information in a segmentation-based segmentation system. At the same time, our method can also use large-scale unmarked corpus to form a semi- supervised learning model.
3. We propose a novel semi-supervised learning method for neural machine trans- lation. The standard neural machine translation model is just a model of the target-side language model (given a source sentence). We have extended this framework to use an uni ed framework to simultaneously model the condi- tional target-side language model, unconditional source and target-side lan- guage model. Therefore, in our framework, the source and target monolingual data can be reasonably applied.
4. We propose a semi-autoregressive model for neural machine translation. The standard neural machine translation model is autoregressive, so only one word can be produced at each time step when decoding. If the target sentence is long, this process will take a lot of time. The semi-autoregressive model we propose can break this limitation, produce multiple consecutive words at one time step, thus make better use of parallel computing hardware and speedup decoding procedure signi cantly while maintaining good translation quality.
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/21169
专题毕业生_硕士学位论文
作者单位1.中国科学院自动化研究所
2.中国科学院大学
推荐引用方式
GB/T 7714
汪春奇. 面向机器翻译的语言预处理与性能优化[D]. 北京. 中国科学院研究生院,2018.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
面向机器翻译的语言预处理与性能优化.pd(2217KB)学位论文 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[汪春奇]的文章
百度学术
百度学术中相似的文章
[汪春奇]的文章
必应学术
必应学术中相似的文章
[汪春奇]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。