面向无词注释的手语翻译研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	面向无词注释的手语翻译研究
	陈志刚
	2024-05-14
页数	70
学位类型	硕士
中文摘要	当今世界中患有听力残疾的听障人士数量日渐增多，而手语是他们的主要交流手段。手语翻译旨在将手语视频转换为自然口语文本，研究手语翻译能够极大地便利听障人士和健听人士之间的交流，给听障人士的日常生活带来便利。随着深度学习技术的快速发展，手语翻译工作近些年取得了很大的进步。但是先前的手语翻译工作往往依赖于手语中的特殊标签——词注释（Gloss），也即手语视频中每个手势的直接标注。而词注释的标注是一个专业性极强的工作，需要经过训练的手语专家耗费大量时间进行人工标注。因此，具备该标注的手语数据集很少，基于词注释的手语翻译方法的扩展性也因此受限。所以，本文工作聚焦于面向无词注释（Gloss-free）的手语翻译，在模型训练和测试的过程中完全放弃对手语词注释的使用，做真正通用的手语翻译系统。本文的主要贡献如下：（1）基于视觉语言预训练的无词注释手语翻译。在缺乏词注释注释的情况下，需要构建一个能够直接从手语视频生成口语文本的翻译框架，并通过端到端训练来实现。然而，由于没有词注释，无法对视觉模型和文本解码器进行预训练，导致翻译性能较差。基于这个问题，本文提出基于视觉语言预训练的无词注释手语翻译方法，使用视觉语言对比学习范式对视觉模型进行预训练，使用文本掩码恢复对文本解码器进行预训练，然后进行端到端地下游手语翻译微调。实验证明，视觉语言预训练能有效地提升手语翻译的性能。（2）分解学习引入大语言模型辅助无词注释手语翻译。预训练的大语言模型被证明在各种自然语言处理任务上表现出色，在基于词注释注释的手语翻译方法中也得到了有效应用。但是，本文发现，在无词注释的情况下，直接将大语言模型和视觉模型端到端在手语数据上训练会导致性能显著下降。本文分析认为这是因为在端到端训练过程中，大型语言模型抑制了视觉模型的学习。针对这一问题，本文提出分解学习方法，即将端到端训练分解为两个阶段：第一阶段注重视觉模型的学习；第二阶段注重大语言模型的微调。实验证实，这种分解学习方法能够有效地在无词注释环境下引入大型语言模型，从而显著提升手语翻译性能。（3）基于内容理解和上下文推理的手语表征学习。前两项工作证明了对比学习和分解学习引入大型语言模型能够提升无词注释手语翻译的性能。本文认为，对比学习旨在学习手语视频和口语文本之间的整体内容关系，而分解学习则利用小型语言模型引导视觉模块学习口语文本的上下文关联。因此，本文提出了结合对比学习和分解学习范式的方法，通过内容理解和上下文推理来学习出一个强大且通用的手语表征。在下游翻译任务中，将其与大型语言模型结合使用可以进一步激发其潜力。实验结果表明，从内容和上下文两个角度进行的预训练都能独立提升手语翻译性能，而两者结合能够带来更大的提升。通过以上三个工作，本文将无词注释的手语翻译性能大幅提升，在两个具备手语词注释标注的数据集上逼近于基于词注释的手语翻译方法，并且本文还在其他无词注释的大规模手语数据集上进行了测试，证明了本文方法的优越性和可扩展性，为实现通用的手语翻译系统做出了贡献。
英文摘要	The number of people with hearing impairments is increasing in today's world, and sign language is their primary means of communication. Sign language translation aims to convert sign language videos into natural spoken language text, greatly facilitating communication between people with hearing impairments and those without, thus improving the daily lives of people with hearing impairments. With the rapid development of deep learning technology, sign language translation has made significant progress in recent years. However, previous sign language translation work often relied on special annotations in sign language—glosses, which are direct annotations for each gesture in sign language videos. Gloss annotation is a highly specialized task that requires trained sign language experts to spend a significant amount of time on manual annotation. Therefore, datasets with such annotations are rare, and the scalability of sign language translation methods based on gloss annotations is limited. Therefore, this work focuses on gloss-free sign language translation, completely abandoning the use of gloss annotations in the process of model training and testing, aiming to create a truly universal sign language translation system. The main contributions of this paper are as follows: (1) Gloss-free sign language translation based on visual-linguistic pretraining. In the absence of gloss annotations, it is necessary to build a translation framework that can directly generate spoken language text from sign language videos and achieve end-to-end training. However, due to the lack of gloss annotations, it is not possible to pretrain the visual model and text decoder, resulting in poor translation performance. To address this issue, this paper proposes a gloss-free sign language translation method based on visual-linguistic pretraining, which uses the visual-linguistic contrastive learning paradigm to pretrain the visual model and text decoder using text mask recovery, and then fine-tunes the downstream sign language translation end-to-end. The experiments show that visual-linguistic pretraining can effectively improve sign language translation performance. (2) Introducing large language models to assist gloss-free sign language translation through decomposition learning. Pretrained large language models have been proven to perform well in various natural language processing tasks and have been effectively applied in sign language translation methods based on gloss annotations. However, this paper finds that in the absence of gloss annotations, directly training large language models and visual models end-to-end on sign language data leads to a significant decrease in performance. This paper analyzes that this is because in the end-to-end training process, the large language model suppresses the learning of the visual model. To address this issue, this paper proposes a decomposition learning method, which decomposes end-to-end training into two stages: the first stage focuses on the learning of the visual model, and the second stage focuses on fine-tuning the large language model. The experiments show that this decomposition learning method can effectively introduce large language models in a gloss-free environment, significantly improving sign language translation performance. (3) Sign language representation learning based on content understanding and context reasoning. The previous two works demonstrate that introducing large language models through contrastive learning and decomposition learning can improve the performance of gloss-free sign language translation. This paper believes that contrastive learning aims to learn the overall content relationship between sign language videos and spoken language text, while decomposition learning uses a small language model to guide the visual module to learn the contextual associations of spoken language text. Therefore, this paper proposes a method that combines contrastive learning and decomposition learning paradigms to learn a powerful and universal sign language representation through content understanding and context reasoning. When combined with large language models in downstream translation tasks, it can further unleash its potential. The experimental results show that pretraining from both content and context perspectives can independently improve sign language translation performance, and the combination of the two can bring even greater improvement. Through the above three works, this paper significantly improves the performance of gloss-free sign language translation, approaching the performance of sign language translation methods based on gloss annotations on two datasets with sign language gloss annotations. Furthermore, this paper has also been tested on other large-scale sign language datasets without gloss annotations, demonstrating the superiority and scalability of the proposed method, and making a contribution to the realization of a universal sign language translation system.
关键词	手语翻译无词注释多模态预训练手语表征学习
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/56592
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	陈志刚. 面向无词注释的手语翻译研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
陈志刚_中国科学院大学学位论文.pdf（3701KB）	学位论文		限制开放	CC BY-NC-SA