On-policy distillation provides an elegant way to use the teacher model as a process reward model to...

- On-policy蒸馏结合RL纠错能力与SFT奖励密度,优化训练稳定性。
- 教师模型可充当过程奖励模型,避免rollout阶段的OOD shock问题。
- 该方法在数学推理和内部聊天助手任务中表现优于传统方法。
Lilian Weng on X: "On-policy distillation provides an elegant way to use the teacher model as a process reward model to provide dense reward while preventing SFT style "OOD shock" during rollout." / X
Don’t miss what’s happening
People on X are the first to know.
Post
See new posts
Conversation

Lilian Weng 
On-policy distillation provides an elegant way to use the teacher model as a process reward model to provide dense reward while preventing SFT style "OOD shock" during rollout.
Quote

Thinking Machines
@thinkymachines
·
Oct 27, 2025
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other

·
31
51
767
292
Read 31 replies
New to X?
Sign up now to get your own personalized timeline!
Sign up with Apple
By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.
Relevant people
-  Lilian Weng  @lilianweng Follow Click to Follow lilianweng Co-founder of Thinking Machines Lab @thinkymachines ; Ex-VP, AI Safety & robotics, applied research @OpenAI ; Author of Lil'Log
-  Thinking Machines @thinkymachines Follow Click to Follow thinkymachines Thinking, beeping, and booping. @tinkerapi
Trending now
What’s happening
Trending in United States
Chibi
Sports · Trending
#VegasBorn
Trending in United States
$HIGHER
Trending in United States
Mami
|
|
|
|
|
More
© 2026 X Corp.