CASIA OpenIR  > 复杂系统认知与决策实验室  > 群体决策智能团队
Token-level Direct Preference Optimization
Zeng,Yongcheng1; Liu,Guoqing2; Ma,Weiyu1; Yang,Ning1; Zhang,Haifeng1; Wang,Jun3
Conference Name2024 42nd International Conference on Machine Learning(ICML)
Conference Date2024/7/21-27
Conference PlaceVienna, Austria

Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often uti- lizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this pa- per, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing pol- icy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence con- straints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserv- ing simplicity without the need for explicit re- ward modeling. Experimental results across vari- ous text tasks demonstrate TDPO’s superior per- formance in balancing alignment with genera- tion diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open- sourced at level-Direct-Preference-Optimization.

MOST Discipline Catalogue工学 ; 工学::计算机科学与技术(可授工学、理学学位)
Indexed ByEI
IS Representative Paper
Sub direction classification决策智能理论与方法
planning direction of the national heavy laboratory其他
Paper associated data
Document Type会议论文
Corresponding AuthorZhang,Haifeng; Wang,Jun
Affiliation1.Institute of Automation, Chinese Academy of Sciences
2.Microsoft Research AI4Science
3.University College London
First Author AffilicationInstitute of Automation, Chinese Academy of Sciences
Corresponding Author AffilicationInstitute of Automation, Chinese Academy of Sciences
Recommended Citation
GB/T 7714
Zeng,Yongcheng,Liu,Guoqing,Ma,Weiyu,et al. Token-level Direct Preference Optimization[C],2024.
Files in This Item: Download All
File Name/Size DocType Version Access License
Token-level Direct P(883KB)会议论文 开放获取CC BY-NC-SAView Download
Related Services
Recommend this item
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Zeng,Yongcheng]'s Articles
[Liu,Guoqing]'s Articles
[Ma,Weiyu]'s Articles
Baidu academic
Similar articles in Baidu academic
[Zeng,Yongcheng]'s Articles
[Liu,Guoqing]'s Articles
[Ma,Weiyu]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Zeng,Yongcheng]'s Articles
[Liu,Guoqing]'s Articles
[Ma,Weiyu]'s Articles
Terms of Use
No data!
Social Bookmark/Share
File name: Token-level Direct Preference Optimization.pdf
Format: Adobe PDF
All comments (0)
No comment.

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.