Figure 1
Avg@8 and Pass@8 across three settings, matched data budget. Higher is better.
Avg@8 (%)
Pass@8 (%)
Figure 2
RLVE multi-task (20 envs). All methods use unfiltered supervision data (noisy: mixed correct/incorrect). Higher is better.
Avg@8 (%)
Pass@8 (%)
Figure 3
After 3-epoch SFT warm-up, online RL (GRPO) and on-policy distillation (OPD) both benefit more from a ROSE cold start.
Online GRPO · Avg@8 (%)
Online OPD · Avg@8 (%)
Online GRPO · Pass@8 (%)
Online OPD · Pass@8 (%)