Token-level Direct Preference Optimization
Zeng,Yongcheng1; Liu,Guoqing2; Ma,Weiyu1; Yang,Ning1; Zhang,Haifeng1; Wang,Jun3
2024
会议名称2024 42nd International Conference on Machine Learning(ICML)
会议日期2024/7/21-27
会议地点Vienna, Austria
摘要

Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often uti- lizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this pa- per, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing pol- icy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence con- straints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserv- ing simplicity without the need for explicit re- ward modeling. Experimental results across vari- ous text tasks demonstrate TDPO’s superior per- formance in balancing alignment with genera- tion diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open- sourced at https://github.com/Vance0124/Token- level-Direct-Preference-Optimization.

学科门类工学 ; 工学::计算机科学与技术(可授工学、理学学位)
收录类别EI
语种英语
是否为代表性论文
七大方向——子方向分类决策智能理论与方法
国重实验室规划方向分类其他
是否有论文关联数据集需要存交
文献类型会议论文
条目标识符http://ir.ia.ac.cn/handle/173211/57249
专题复杂系统认知与决策实验室_群体决策智能团队
通讯作者Zhang,Haifeng; Wang,Jun
作者单位1.Institute of Automation, Chinese Academy of Sciences
2.Microsoft Research AI4Science
3.University College London
第一作者单位中国科学院自动化研究所
通讯作者单位中国科学院自动化研究所
推荐引用方式
GB/T 7714
Zeng,Yongcheng,Liu,Guoqing,Ma,Weiyu,et al. Token-level Direct Preference Optimization[C],2024.
条目包含的文件 下载所有文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
Token-level Direct P(883KB)会议论文 开放获取CC BY-NC-SA浏览 下载
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Zeng,Yongcheng]的文章
[Liu,Guoqing]的文章
[Ma,Weiyu]的文章
百度学术
百度学术中相似的文章
[Zeng,Yongcheng]的文章
[Liu,Guoqing]的文章
[Ma,Weiyu]的文章
必应学术
必应学术中相似的文章
[Zeng,Yongcheng]的文章
[Liu,Guoqing]的文章
[Ma,Weiyu]的文章
相关权益政策
暂无数据
收藏/分享
文件名: Token-level Direct Preference Optimization.pdf
格式: Adobe PDF
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。