ROSE: Roadside Supervision Enables Learning at the Edge of Reasoning Capabilities

Figure 1

ROSE consistently outperforms full off-policy SFT

Avg@8 and Pass@8 across three settings, matched data budget. Higher is better.

Original Verified SFT ROSE (Stitch SFT)

Avg@8 (%)

Pass@8 (%)

Single-task (Kukurasu, Minesweeper): 5K examples each. Multi-task (RLVE): 2.5K examples total across 10 tasks (250 per task). ROSE training data is unfiltered (mixed correct/incorrect), Verified SFT uses only verified-correct trajectories.

Figure 2

ROSE is robust to noisy supervision

RLVE multi-task (20 envs). All methods use unfiltered supervision data (noisy: mixed correct/incorrect). Higher is better.

Mixed Teacher Prefix ROSE (Stitch)

Avg@8 (%)

Pass@8 (%)

Setup: Student Qwen3-1.7B, Teacher Qwen3-4B-Thinking-2507, RLVE 20 environments. All three methods receive unfiltered (noisy) supervision data of matched size. Notably, ROSE at 8K already exceeds Mixed and Teacher Prefix at 16K on both metrics.

Figure 3

ROSE provides a stronger initialization for online learning

After 3-epoch SFT warm-up, online RL (GRPO) and on-policy distillation (OPD) both benefit more from a ROSE cold start.

Original (no warm-up) Verified SFT warm-up ROSE warm-up

Online GRPO · Avg@8 (%)

Online OPD · Avg@8 (%)

Online GRPO · Pass@8 (%)

Online OPD · Pass@8 (%)

Setup: 3-epoch offline SFT warm-up on 5K examples (same as Section 3 setup), followed by online learning (GRPO or OPD). With matched warm-up data, ROSE-initialized models reach substantially higher final performance than Verified-SFT-initialized ones, indicating ROSE produces a stronger starting point for online learning.