返回首页
Lilian Weng(@lilianweng)

On-policy distillation provides an elegant way to use the teacher model as a process reward model to...

8.5Score
On-policy distillation provides an elegant way to use the teacher model as a process reward model to...
AI 深度提炼
  • On-policy蒸馏结合RL纠错能力与SFT奖励密度,优化训练稳定性。
  • 教师模型可充当过程奖励模型,避免rollout阶段的OOD shock问题。
  • 该方法在数学推理和内部聊天助手任务中表现优于传统方法。
#强化学习#模型蒸馏#AI训练
打开原文

Lilian Weng on X: "On-policy distillation provides an elegant way to use the teacher model as a process reward model to provide dense reward while preventing SFT style "OOD shock" during rollout." / X

Don’t miss what’s happening

People on X are the first to know.

Log in

Sign up

Post

See new posts

Conversation

![Image 3](http://x.com/lilianweng)

Lilian Weng ![Image 4](http://x.com/lilianweng)

@lilianweng

On-policy distillation provides an elegant way to use the teacher model as a process reward model to provide dense reward while preventing SFT style "OOD shock" during rollout.

Quote

Image 5: Square profile picture

Thinking Machines

@thinkymachines

·

Oct 27, 2025

Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other

![Image 6: Image](http://x.com/thinkymachines/status/1982856272023302322/photo/1)

5:31 PM · Oct 27, 2025

·

142.3K Views

31

51

767

292

Read 31 replies

New to X?

Sign up now to get your own personalized timeline!

Sign up with Apple

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Relevant people

Trending now

What’s happening

Trending in United States

Chibi

Sports · Trending

#VegasBorn

Trending in United States

$HIGHER

Trending in United States

Mami

Show more

Terms of Service

|

Privacy Policy

|

Cookie Policy

|

Accessibility

|

Ads info

|

More

© 2026 X Corp.