Although the main context of statistical machine translation (SMT) is the three sub-problems: language model (LM), translation model (TM) and decoding algorithm,but the ``major bottleneck" problem is the model itself, model determines the upper limit of translation performance. The objective of this paper focuses on improving the translation performance, we conduct systematical research on the key technologies of the models (LM and TM) in SMT, from the aspects of training, adaptation and learning. Our main contributions are listed here: 1. Distributed Training of Large-Scale Language Model We propose a distributed architecture based language model fast training method with large-scale corpus. In our distributed training infrastructure, we utilize client-server based distributed computing platform for increasing the capacity of large-scale data processing, memory interactive paradigm for improving the training speed, modified Kneser-Ney discounting for parallel training in blocks, respectively. 2. Corpus Selection based Model Adaptation 2.1. Cross-lingual Corpus Selection based Language Model Adaptation Improving the performance of LM not only needs more training corpus, but also needs the training corpus match the translation task. Many researchers have preferred to select similar training corpus with the translation task from the training corpus. However, most traditional corpus selection methods are bag-of-words models based monolingual selection methods, which causes noisy proliferation and do not take into account any contextual information. We propose two context-aware cross-lingual corpus selection methods, and experimental results show these two methods significantly outperform the traditional methods for corpus selection and translation performance. Our two methods are: Translation model based cross-lingual corpus selection method. Inspired by the phrase based TM and cross-lingual information retrieval, we propose phrase TM based cross-lingual corpus selection for LM adaptation. Given a source sentence in the translation task, our method directly estimates the probability before translation that a sentence in the target LM training corpus is or not similar, then select corpus by this probability. Our method performs at the phrase level and captures some contextual information in modeling the selection of phrase as a whole. Moreover, we propose a linear ranking model framework to further improve the performance, where different models a...
修改评论