跨模态信息融合的语音翻译方法研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 自然语言处理

	跨模态信息融合的语音翻译方法研究
	刘宇宸
	2021-05
页数	128
学位类型	博士
中文摘要	语音翻译旨在将一种语言的语音翻译成另一种语言的语音或文本。随着全球化时代的到来，世界各国在经贸、旅游、文化等各个领域的交流日益频繁。语音翻译，作为打破语言障碍的关键技术，是学术界和工业界的研究热点。目前的语音翻译系统通常由语音识别、机器翻译等模块级联组成，语音识别模块首先将源语言的语音识别成文本，而后机器翻译模块将其翻译到目标语言。尽管级联式的系统已经被广泛使用，但是存在着错误累积多、计算和存储开销大、翻译时延高等固有缺陷。与之相比，端到端的语音翻译方法通过一个统一的模型框架，直接对源语言的语音和目标语言的句子之间的映射关系进行建模，理论上可以缓解级联式系统的缺陷，因此受到研究者们越来越多的关注。然而，在一个模型中同时完成识别和翻译两个步骤充满着困难和挑战，并且端到端的训练数据十分匮乏，导致翻译性能欠佳。如何通过有效的方法融合不同模态的数据和信息，对于提高端到端的语音翻译方法的译文质量至关重要。本文主要关注于从语音到文本的端到端的语音翻译模型，围绕语音翻译模型中跨模态信息的融合方法展开研究，旨在改善现有数据利用方式不足的问题，使模型能够更好地利用不同模态的信息提升语音到文本的翻译质量。论文的主要工作和创新点归纳如下： 1、提出了一种基于知识蒸馏的端到端语音翻译方法针对端到端的语音翻译模型因建模困难而导致效果不佳的问题，本文提出了一种基于知识蒸馏的语音翻译方法。该方法能够借助文本机器翻译系统的优越性能，将文本翻译模型学习到的翻译知识融合到语音翻译模型中，以辅助后者的训练。具体地，本文将文本翻译模型和语音翻译模型分别作为教师模型和学生模型，然后使用了两种知识蒸馏的策略，分别将教师模型的输出概率分布和输出序列作为翻译知识，对学生模型进行训练，从而缩小其与教师模型之间的性能差距。实验表明，与基线系统相比，该方法能够更充分地利用文本翻译的数据，显著提升语音翻译模型的译文质量。 2、提出了一种基于交互式学习的端到端语音翻译方法现有基于多任务学习的语音翻译模型在训练时，不同的任务之间是相互独立的，而事实上语音识别和语音翻译的目标输出表达的内容语义一致，因此是可以相辅相成的。针对此问题，本文提出了一种基于交互式学习的端到端语音翻译方法。该方法能够在同一个模型的训练与解码中，实现语音识别任务与语音翻译任务的同步与交互。此外，本文提出了一种延迟解码策略，使得翻译任务的启动滞后于识别任务，从而令翻译模型能够利用更丰富的已识别文本结果作为辅助信息，降低语音翻译任务的难度。实验分析表明，该方法的性能显著优于多个强基线系统，不仅可以提升语音翻译的质量，也可以提升语音识别的效果。 3、提出了一种基于跨模态迁移的端到端语音翻译方法针对语音和文本两种模态的特征表示难以融合和迁移的问题，本文提出了一种基于跨模态迁移的端到端语音翻译方法。该方法利用文本机器翻译模型学习到的文本表示，辅助语音翻译模型中语音表示的学习。具体地，该方法首先将语音翻译模型中的编码器解耦成声学编码器和语义编码器，并提出了一种语音特征冗余信息的过滤机制，以解决语音与文本的特征表示长度不一致的问题。为了拉近两个模态特征表示的语义空间，本文基于参数共享的方法将文本机器翻译模型嵌入到端到端的语音翻译模型中，并联合训练语音翻译和文本翻译任务。此外，本文进一步提出了两种模态迁移的策略以约束这两者的语义空间。实验结果和分析表明，所提方法能够有效减少不同模态表示之间的差异，从而显著提升语音翻译模型的性能。综上所述，本文针对目前端到端的语音翻译方法中数据利用方式不足的问题，分别从融合文本翻译模型的知识、融合语音识别模型的中间结果、融合不同模态的语义特征三个方面出发，提出了多种融合不同模态信息的语音翻译方法。实验表明，本文提出的一系列方法能够有效利用跨模态的信息，提升语音翻译模型的性能，相关研究成果有力地推动了语音翻译技术的研究与发展。
英文摘要	Speech translation aims to translate the source speech into the text in another language. With the advent of globalization, communication around the world is becoming increasingly frequent in various fields, such as economy and trade, tourism, and culture. Speech translation, as a key technology to break language barriers, is a research hotspot in both academia and industry. The traditional speech translation method is usually a pipeline system, which is composed of automatic speech recognition and machine translation. The speech input is first transcribed into text, and then the text is translated into the target language by the machine translation model. Although the pipeline system has been widely used, it still exists inherent defects, including error propagation, expensive computational costs, and high latency. In contrast, the end-to-end speech translation method, which directly maps the relationship between the source speech and the target text by a unified neural network framework, has potential advantages over the pipeline system. Therefore, it attracts more and more attention. However, implementing recognition and translation tasks in a single model is difficult and challenging, and the training data is scarce as well, resulting in poor translation performance. How to integrate data and information from different modalities is crucial to improving the quality of the end-to-end methods. This paper focuses on the speech-to-text end-to-end speech translation model, aiming to investigate the cross-modal information fusion methods and to better leverage the information from different modalities to improve the model performance. The main contributions of this paper are summarized as follows: 1. End-to-End Speech Translation with Knowledge Distillation Considering that the text-based machine translation models are prominently superior to the speech translation model, this paper proposes a method based on knowledge distillation, which can integrate the knowledge learned by the text-based model into the speech translation model. Specifically, the text-based machine translation model and the speech translation model are taken as the teacher model and the student model, respectively. Then, two knowledge distillation strategies are proposed, where the output probability distribution and the output sequence of the teacher model are taken as knowledge and used to train the student model. By learning from the knowledge of the teacher model, the speech translation model can alleviate its performance gap between itself and the text-based model. Experiments show that with the instruction of the teacher model, the end-to-end speech translation model can make full use of text data and achieve significant improvements. 2. Synchronous Speech Translation and Speech Recognition with Interactive Learning Existing multi-task learning methods treat different tasks independently. However, the semantics of the target sentence in both the speech recognition task and speech translation task is the same, which can complement each other. To this end, this paper proposes a method based on an interactive learning model. This method can train and decode the speech recognition and speech translation task in the same model synchronously and interactively. In addition, a delayed decoding strategy is proposed to make the start of the translation task lag behind the recognition task, so that the translation process can use more transcribed results as auxiliary information to reduce its difficulty. Experimental results show that the performance of the proposed method significantly outperforms strong baseline systems. It can not only improve the quality of the speech translation task, but also improve that of the speech recognition task. 3. End-to-End Speech Translation based on Cross-Modal Adaptation The feature representations of the speech and text modality are different and difficult to fusion. This paper proposes a method based on cross-modal adaptation to leverage the text representation learned by the text-based machine translation model to help the speech translation model learn the speech representation. Specifically, this method first decouples the encoder of the speech translation model into an acoustic encoder and a semantic encoder, with a filtering mechanism to solve the length inconsistency problem. In order to bridge the semantic gap between two representations, the text-based machine translation model is completely embedded into the speech translation model by sharing parameters, with two task training jointly. This paper further proposes two modal adaptation strategies to constrain the semantic space of the two representations. Experimental results and analyses show that the proposed method can effectively reduce the representation gap between two modalities and significantly improve the performance of the speech translation model. To sum up, this paper aims to fusion the cross-modal information into the end-to-end speech translation method, including the knowledge of text-based translation models, the intermediate result of speech recognition models, and the semantic features of different modalities. A variety of methods are proposed to integrate the cross-modal information into the speech translation model. Experimental results show that these methods can effectively leverage the cross-modal information to improve the performance of speech translation. Related works can strongly promote the research and application of speech translation systems.
关键词	语音翻译语音识别机器翻译多模态学习
语种	中文
七大方向——子方向分类	自然语言处理
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44894
专题	多模态人工智能系统全国重点实验室_自然语言处理
推荐引用方式 GB/T 7714	刘宇宸. 跨模态信息融合的语音翻译方法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
跨模态信息融合的语音翻译方法研究_终版.（2516KB）	学位论文		开放获取	CC BY-NC-SA