CASIA OpenIR  > 紫东太初大模型研究中心
Take a Closer Look at Multilinguality! Improve Multilingual Pre-Training Using Monolingual Corpora Only
Lu JL(陆金梁)1,2; Zhang JJ(张家俊)1,2,3
2023-12
会议名称Findings of the Association for Computational Linguistics: EMNLP 2023
会议日期December 6-10, 2023
会议地点Singapore
出版者Association for Computational Linguistics
摘要

Recent studies have revealed the remarkable cross-lingual capability of multilingual pre-trained language models (mPLMs), even when pre-trained without parallel corpora (mono-mPLMs). Intuitively, semantic alignments may be the reason behind such capability but remain under-explored. In this work, we investigate the alignment properties from the token perspective in mono-mPLMs and find that the alignments correspond to the geometric similarity of embedding space across different languages. Nevertheless, mono-mPLMs tend to damage this geometric similarity at the higher layers due to the lack of cross-lingual interactions, thus limiting their cross-lingual transfer capabilities. To address this issue, we introduce token-level and semantic-level code-switched masked language modeling, employing the self-induced token alignments to explicitly improve cross-lingual interactions over layers of mono-mPLMs without relying on parallel sentences. We evaluate our method on various natural language understanding tasks and unsupervised machine translation tasks. The results demonstrate that our methods outperform the strong baselines and achieve comparable performance with mPLMs trained with parallel corpora.

收录类别EI
是否为代表性论文
七大方向——子方向分类自然语言处理
国重实验室规划方向分类语音语言处理
是否有论文关联数据集需要存交
文献类型会议论文
条目标识符http://ir.ia.ac.cn/handle/173211/57386
专题紫东太初大模型研究中心
通讯作者Zhang JJ(张家俊)
作者单位1.Institute of Automation, Chinese Academy of Sciences, Beijing, China
2.School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
3.Wuhan AI Research, Wuhan, China
第一作者单位中国科学院自动化研究所
通讯作者单位中国科学院自动化研究所
推荐引用方式
GB/T 7714
Lu JL,Zhang JJ. Take a Closer Look at Multilinguality! Improve Multilingual Pre-Training Using Monolingual Corpora Only[C]:Association for Computational Linguistics,2023.
条目包含的文件 下载所有文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
2023.findings-emnlp.(1097KB)会议论文 开放获取CC BY-NC-SA浏览 下载
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Lu JL(陆金梁)]的文章
[Zhang JJ(张家俊)]的文章
百度学术
百度学术中相似的文章
[Lu JL(陆金梁)]的文章
[Zhang JJ(张家俊)]的文章
必应学术
必应学术中相似的文章
[Lu JL(陆金梁)]的文章
[Zhang JJ(张家俊)]的文章
相关权益政策
暂无数据
收藏/分享
文件名: 2023.findings-emnlp.190.pdf
格式: Adobe PDF
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。