LLM记录202304-202306

RLHF

RAFT

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
code
在这里插入图片描述

RRHF

RRHF: Rank Responses to Align Language Models with Human Feedback without tears
code
p i = ∑ t log ⁡ P π ( y i , t ∣ y i , < t ) ∥ y i ∥ p_i=\frac{\sum_{t}\log P_{\pi}(y_{i,t}|y_{i,<t})}{\|y_i\|} pi=yitlogPπ(yi,tyi,<t)
L r a n k = ∑ r i < r j max ⁡ ( 0 , p i − p j ) L_{rank}=\sum_{r_i<r_j}{\max(0,p_i-p_j)} Lr

猜你喜欢

转载自blog.csdn.net/dragonchow123/article/details/130026411