基于主动学习的语音转录文本标注和分类方法研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于主动学习的语音转录文本标注和分类方法研究
	曾杰林
	2023-05
页数	86
学位类型	硕士
中文摘要	近年来，随着语音转录技术发展到可以商业应用的成熟水平，相关行业积累了大量的语音转录文本。这类文本在部分场景下包含极其丰富的信息，如果能将这些文本按场景进行分门别类，将会推进相关行业自然语言处理技术的发展。然而，这类文本普遍缺乏标注，且包含大量口语词汇和转录错误，现有语言模型很难准确地理解这类文本的语义。针对以上问题，本文基于主动学习对语音转录文本的标注和分类问题进行了研究，在保证模型分类性能的前提下，提出了两种不同的方法以减少所需的样本标注量。本文的主要工作如下: 第一，构建语音转录文本场景分类任务的数据集，并提出该任务的解决方案。本文开发了标注平台，收集并标注真实通话场景下的语音转录文本，构建了转录文本场景分类任务的数据集。针对该数据集特点，本文提出基于文本纠错的分类方法，模型的预处理模块和纠错模块解决了转录文本包含大量口语词汇和转录错误的问题，可以完成语音转录文本的场景分类任务。第二，提出了基于特征混合的两阶段文本主动学习算法，能大幅节省语料标注成本。针对标注工作成本高昂这一问题，本文提出了一种基于特征混合的两阶段文本主动学习算法，该方法第一阶段使用特征混合技术寻找模型当前无法识别的特征，采样包含该类特征的样本，第二阶段对这些样本进行重要性排序，挑选出对模型改善最有效的样本，迭代训练模型。本文方法能有效地减少文本标注量，降低标注成本。第三，通过引入语音模态，提出了基于语音和文本的多模态主动学习算法。本文在文本模态的基础上引入语音模态，提出了另一种针对语音转录文本场景分类任务的解决方案。本方案设计了语音特征提取方法和模态信息融合策略，扩展了模型的信息来源，提高了模型的性能。同时，方案的主动学习框架也可以在保证模型性能的前提下，降低标注成本。
英文摘要	In recent years, with the development of speech transcription technology reaching a mature level suitable for commercial applications, related industries have accumulated a large amount of speech transcription text. In certain scenarios, this type of text contains extremely rich information. If these texts can be categorized according to their scenes, it will promote the development of natural language processing technology in related industries. However, this type of text generally lacks annotations and contains a large number of colloquial vocabulary and transcription errors. Existing language models find it difficult to accurately understand the semantics of this type of text. To address these issues, this paper investigates the annotation and classification problems of speech transcription text based on active learning. Under the premise of ensuring the model’s classification performance, two different methods are proposed to reduce the required sample annotation. The main contributions of this paper are as follows: First, we constructed a dataset and proposed a solution for the transcription text scene classification task. Based on a text annotation platform developed in-house, we collected and annotated speech transcription texts in real call scenarios, and constructed a dataset for transcription text scene classification. We proposed a method of concate- nating a text correction model with a text classification model to address the challenges of transcription texts containing a large number of spoken words and transcription er- rors, which enables the classification of speech transcription texts according to their scenes. Second, we proposed a two-stage active learning algorithm based on feature mixing to reduce annotation costs. In response to the high cost of annotation, we proposed a two-stage text active learning algorithm based on feature mixing. In the first stage, the algorithm uses feature mixing technology to identify features that the model cannot currently recognize, and gets the samples containing these features. In the second stage, it sorts these samples by importance and selects the most effective ones for improving the model, iteratively training the model. Our algorithm effectively reduces the amount of text annotation and lowers the cost of annotation. Third, we introduced the speech modality as extension of text modality and pro- posed a multimodal active learning algorithm based on speech and text, regarded as the solution for speech transcription text scene classification tasks. The proposed al- gorithm includes a speech feature extraction method and a modality information fusion strategy, which expands the information sources of the model and improves its per- formance. Moreover, the active learning framework of this method can also reduce annotation costs while maintaining model performance.
关键词	主动学习，多模态学习，语音转录，场景分类
语种	中文
七大方向——子方向分类	自然语言处理
国重实验室规划方向分类	社会信息感知与理解
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/52169
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	曾杰林. 基于主动学习的语音转录文本标注和分类方法研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
2020E8014682037曾杰林 (（6969KB）	学位论文		限制开放	CC BY-NC-SA