Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Wave 20: Tier-0 fidelity fixes — k1-in-reward KL + Composer-2 behavior rewards
Browse filesCloses the two cheapest "next-level" gaps from the F5 fidelity audit
(Rubric A items c2, h, i). Both reproduce Composer-2 §4.1/§4.2 exactly,
without the hint mystery. 33 new tests; full suite 415 passed / 66 skipped.
(1) k1-in-reward KL (F5 #1, the highest-fidelity-leverage single change).
Composer-2 §4.1 chooses the k1 estimator IN THE REWARD (-log r); TRL uses
k3 IN THE LOSS. The 2025/26 evidence (arXiv:2512.21852, verl k1-only, TRL
#4967) says k1-in-reward improves OOD generalization where k3-in-reward can
collapse — exactly the "take any model to the next level" axis.
- trainer/kl_in_reward.py: pure, tested math. k1/k3 per-sequence
estimators + apply_kl_in_reward, which adjusts GRPO advantages by
-coef·(KL - group_mean(KL)). test_kl_in_reward.py PROVES this exactly
equals folding -coef·KL into the reward then re-running GRPO's
group-mean baseline (9 group/gen combos) — the justification for not
forking TRL's scoring code. Exact only without std-norm (Dr.GRPO regime).
- composer_trainer.py: opt-in kl_in_reward=True / kl_estimator=. Keeps
beta>0 (TRL needs it to compute ref logps) but suppresses TRL's in-loss
k3 per step (beta→0 across the parent _compute_loss) and folds the k1
penalty into advantages at scoring time. validate_kl_in_reward_config
(extracted, unit-tested) guards beta>0 + scale_rewards∈{none,false}.
OFF by default → byte-for-byte legacy. getattr-guarded so __new__-built
test stubs default to legacy.
(2) Composer-2 behavior rewards (F5 #2, Rubric A h+i).
- datagen/behavior_rewards.py: c_length (the exact nonlinear length/effort
penalty ((1+kx)^(1-q)-1)/(k(1-q)) + q→1 log-limit), EffortWeights (the
weighted {thinking/tool-call/tool-output/final-msg tokens, #calls,
#turns} combination), LengthEffortPenalty, and the aux scalar shapers
UnfinishedTodoPenalty / LeftoverCoTPenalty / CommunicationReward — the
exact emergent behaviors §4.2 names. All TRL RewardFuncs composable with
env.reward_fn via reward_weights. default_behavior_rewards() = the
starter array. test_behavior_rewards.py pins the math + contract.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- composer_replication/datagen/behavior_rewards.py +282 -0
- composer_replication/datagen/tests/test_behavior_rewards.py +166 -0
- composer_replication/trainer/composer_trainer.py +213 -3
- composer_replication/trainer/kl_in_reward.py +182 -0
- composer_replication/trainer/tests/test_dr_grpo_config_and_alignment.py +33 -0
- composer_replication/trainer/tests/test_kl_in_reward.py +146 -0
- research/design-F5-fidelity-audit.md +6 -6
|
@@ -0,0 +1,282 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""behavior_rewards.py — Composer-2 §4.2 behavior shaping (hint-free).
|
| 2 |
+
|
| 3 |
+
THE FIDELITY GAP (F5 Rubric A items h + i). Composer-2 shapes agent behavior
|
| 4 |
+
NOT with hints (that's the unreproducible Composer-2.5 mechanism) but with an
|
| 5 |
+
**array of auxiliary scalar rewards** + a **nonlinear length/effort penalty**.
|
| 6 |
+
research/10 verified both against the report and gives the exact equation. None
|
| 7 |
+
of it was in code; this module implements them as TRL ``RewardFunc``s
|
| 8 |
+
composable with ``FeatureDeletionEnv.reward_fn`` (the test-pass-fraction reward).
|
| 9 |
+
|
| 10 |
+
Quoting Composer-2 §4.2 (research/10-composer2-techreport-mining.md:74-77):
|
| 11 |
+
* "we apply an array of auxiliary rewards … rewards for coding style,
|
| 12 |
+
communication, and product-specific penalties for poor tool calls, such as
|
| 13 |
+
creating to-do list items and then leaving them unfinished."
|
| 14 |
+
* Nonlinear length/effort penalty:
|
| 15 |
+
C_length{k,q}(x) = ((1 + k·x)^{1−q} − 1) / (k·(1−q))
|
| 16 |
+
concave-down & increasing, x = weighted combination of {thinking tokens,
|
| 17 |
+
tool-calling tokens, tool-output tokens, final-message tokens, #tool calls,
|
| 18 |
+
#turns}. Goal: quick on easy tasks, think longer on hard ones; observed to
|
| 19 |
+
induce parallel tool calls.
|
| 20 |
+
|
| 21 |
+
DESIGN. Every reward here is a callable matching TRL's ``RewardFunc`` contract
|
| 22 |
+
``(prompts, completions, **kwargs) -> list[float]`` so it drops straight into
|
| 23 |
+
``GRPOConfig.reward_funcs`` alongside the env's correctness reward. They are
|
| 24 |
+
deliberately SIGNAL-ONLY shapers (small magnitudes, penalties ≤ 0): the
|
| 25 |
+
correctness reward must dominate, and these never gate it. TRL combines multiple
|
| 26 |
+
reward funcs via ``reward_weights`` (sum), so the composed objective is
|
| 27 |
+
``reward = correctness + Σ_i w_i · behavior_i`` — exactly Composer-2's "array of
|
| 28 |
+
auxiliary rewards" added to the primary signal.
|
| 29 |
+
|
| 30 |
+
The length penalty is the substantive piece; the aux rewards are heuristic
|
| 31 |
+
pattern-matchers over the completion text (unfinished-todo, leftover-CoT-in-code,
|
| 32 |
+
terminal-only collapse — the exact emergent behaviors the report says it
|
| 33 |
+
reactively penalized). They are intentionally simple and overridable.
|
| 34 |
+
"""
|
| 35 |
+
from __future__ import annotations
|
| 36 |
+
|
| 37 |
+
import re
|
| 38 |
+
from collections.abc import Callable, Mapping
|
| 39 |
+
from dataclasses import dataclass, field
|
| 40 |
+
|
| 41 |
+
#: A TRL reward function: (prompts, completions, **kwargs) -> per-completion floats.
|
| 42 |
+
RewardFunc = Callable[..., list[float]]
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
# ---------------------------------------------------------------------
|
| 46 |
+
# (i) Nonlinear length / effort penalty — C_length{k,q}(x)
|
| 47 |
+
# ---------------------------------------------------------------------
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
def c_length(x: float, k: float, q: float) -> float:
|
| 51 |
+
"""Composer-2 nonlinear length/effort penalty C_length{k,q}(x).
|
| 52 |
+
|
| 53 |
+
C_length(x) = ((1 + k·x)^{1−q} − 1) / (k·(1−q))
|
| 54 |
+
|
| 55 |
+
Concave-down & increasing in x for k>0, q>1 (the report's regime): the
|
| 56 |
+
marginal penalty for more effort SHRINKS as effort grows, so the model is
|
| 57 |
+
pushed to be quick on easy tasks while still allowed to think longer on hard
|
| 58 |
+
ones (cheap incremental effort once already deep in a hard task).
|
| 59 |
+
|
| 60 |
+
Args:
|
| 61 |
+
x: the (non-negative) effort scalar — a weighted token/turn combination.
|
| 62 |
+
k: curvature hyperparameter (>0). Larger k = stronger early penalty.
|
| 63 |
+
q: curvature hyperparameter. q>1 gives the concave-down increasing shape;
|
| 64 |
+
q→1 degenerates (handled below); q<1 makes it convex.
|
| 65 |
+
|
| 66 |
+
Returns:
|
| 67 |
+
The penalty magnitude (≥ 0 for x≥0, k>0). Callers negate it to subtract
|
| 68 |
+
from reward.
|
| 69 |
+
|
| 70 |
+
Edge cases:
|
| 71 |
+
* x = 0 → 0 (no effort, no penalty).
|
| 72 |
+
* q = 1 → the limit is log(1 + k·x)/k (L'Hôpital on the (1−q) factor),
|
| 73 |
+
which we return directly to avoid a 0/0.
|
| 74 |
+
"""
|
| 75 |
+
if x < 0:
|
| 76 |
+
raise ValueError(f"effort scalar x must be non-negative, got {x}")
|
| 77 |
+
if k <= 0:
|
| 78 |
+
raise ValueError(f"curvature k must be positive, got {k}")
|
| 79 |
+
base = 1.0 + k * x
|
| 80 |
+
if abs(1.0 - q) < 1e-9:
|
| 81 |
+
# Limit q→1: ((1+kx)^{1-q}-1)/(k(1-q)) → log(1+kx)/k.
|
| 82 |
+
import math
|
| 83 |
+
|
| 84 |
+
return math.log(base) / k
|
| 85 |
+
return (base ** (1.0 - q) - 1.0) / (k * (1.0 - q))
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
@dataclass(frozen=True)
|
| 89 |
+
class EffortWeights:
|
| 90 |
+
"""Weights for the effort scalar x (Composer-2's weighted combination).
|
| 91 |
+
|
| 92 |
+
x = (w_think·thinking_tokens + w_tool_call·tool_calling_tokens
|
| 93 |
+
+ w_tool_out·tool_output_tokens + w_final·final_message_tokens
|
| 94 |
+
+ w_calls·num_tool_calls + w_turns·num_turns).
|
| 95 |
+
|
| 96 |
+
Defaults weight raw token counts at 1.0 (per-token) and the discrete
|
| 97 |
+
call/turn counts more heavily, since each call/turn carries fixed overhead
|
| 98 |
+
the report wanted to discourage (it observed the penalty inducing PARALLEL
|
| 99 |
+
tool calls — fewer turns for the same work).
|
| 100 |
+
"""
|
| 101 |
+
|
| 102 |
+
think: float = 1.0
|
| 103 |
+
tool_call: float = 1.0
|
| 104 |
+
tool_output: float = 1.0
|
| 105 |
+
final_message: float = 1.0
|
| 106 |
+
num_calls: float = 50.0
|
| 107 |
+
num_turns: float = 100.0
|
| 108 |
+
|
| 109 |
+
def effort(self, features: Mapping[str, float]) -> float:
|
| 110 |
+
"""Compute x from a per-completion feature dict (missing keys → 0)."""
|
| 111 |
+
return (
|
| 112 |
+
self.think * features.get("thinking_tokens", 0.0)
|
| 113 |
+
+ self.tool_call * features.get("tool_calling_tokens", 0.0)
|
| 114 |
+
+ self.tool_output * features.get("tool_output_tokens", 0.0)
|
| 115 |
+
+ self.final_message * features.get("final_message_tokens", 0.0)
|
| 116 |
+
+ self.num_calls * features.get("num_tool_calls", 0.0)
|
| 117 |
+
+ self.num_turns * features.get("num_turns", 0.0)
|
| 118 |
+
)
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
@dataclass(frozen=True)
|
| 122 |
+
class LengthEffortPenalty:
|
| 123 |
+
"""TRL RewardFunc applying ``-coef · C_length{k,q}(x)`` per completion.
|
| 124 |
+
|
| 125 |
+
The effort scalar x is read from a per-completion ``effort_features`` kwarg
|
| 126 |
+
(a list of feature dicts, one per completion) when the agent harness can
|
| 127 |
+
supply structured token/turn counts; otherwise it falls back to a crude
|
| 128 |
+
proxy = completion character count / ``chars_per_token`` so the penalty is
|
| 129 |
+
never silently zero.
|
| 130 |
+
|
| 131 |
+
Penalty is ≤ 0 (a cost). ``coef`` scales it into reward units; keep it small
|
| 132 |
+
so correctness dominates.
|
| 133 |
+
"""
|
| 134 |
+
|
| 135 |
+
k: float = 0.01
|
| 136 |
+
q: float = 2.0
|
| 137 |
+
coef: float = 0.1
|
| 138 |
+
weights: EffortWeights = field(default_factory=EffortWeights)
|
| 139 |
+
chars_per_token: float = 4.0
|
| 140 |
+
|
| 141 |
+
def __call__(
|
| 142 |
+
self,
|
| 143 |
+
prompts: list[str] | None = None,
|
| 144 |
+
completions: list[str] | None = None,
|
| 145 |
+
*,
|
| 146 |
+
effort_features: list[Mapping[str, float]] | None = None,
|
| 147 |
+
**kwargs: object,
|
| 148 |
+
) -> list[float]:
|
| 149 |
+
completions = completions or []
|
| 150 |
+
out: list[float] = []
|
| 151 |
+
for i, comp in enumerate(completions):
|
| 152 |
+
if effort_features is not None and i < len(effort_features):
|
| 153 |
+
x = self.weights.effort(effort_features[i])
|
| 154 |
+
else:
|
| 155 |
+
# Proxy when structured features aren't wired: token-count estimate.
|
| 156 |
+
x = len(comp) / self.chars_per_token
|
| 157 |
+
out.append(-self.coef * c_length(x, self.k, self.q))
|
| 158 |
+
return out
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
# ---------------------------------------------------------------------
|
| 162 |
+
# (h) Auxiliary scalar rewards — coding style / communication / poor tool use
|
| 163 |
+
# ---------------------------------------------------------------------
|
| 164 |
+
|
| 165 |
+
# Heuristic patterns for the emergent behaviors Composer-2 §4.2 names. Simple
|
| 166 |
+
# and overridable by design; the report itself added these REACTIVELY by
|
| 167 |
+
# monitoring, so they're meant to grow, not be exhaustive on day one.
|
| 168 |
+
|
| 169 |
+
_TODO_CREATE = re.compile(r"\b(?:add|create|TODO|todo)[\s_-]*(?:item|list)?\b", re.I)
|
| 170 |
+
_TODO_DONE = re.compile(r"\b(?:done|completed|finished|resolved|checked off)\b", re.I)
|
| 171 |
+
# Long chain-of-thought left as a code comment (an explicitly-penalized emergent
|
| 172 |
+
# behavior): a comment line that's prose-long rather than a terse code note.
|
| 173 |
+
_LONG_COMMENT = re.compile(r"^\s*(?:#|//)\s?.{120,}$", re.M)
|
| 174 |
+
|
| 175 |
+
|
| 176 |
+
@dataclass(frozen=True)
|
| 177 |
+
class UnfinishedTodoPenalty:
|
| 178 |
+
"""Penalize creating to-do items then leaving them unfinished (the report's
|
| 179 |
+
explicit "poor tool call" example). Heuristic: penalty scales with
|
| 180 |
+
max(0, #todo-creations − #todo-completions)."""
|
| 181 |
+
|
| 182 |
+
coef: float = 0.05
|
| 183 |
+
|
| 184 |
+
def __call__(
|
| 185 |
+
self, prompts: list[str] | None = None, completions: list[str] | None = None,
|
| 186 |
+
**kwargs: object,
|
| 187 |
+
) -> list[float]:
|
| 188 |
+
out: list[float] = []
|
| 189 |
+
for comp in completions or []:
|
| 190 |
+
created = len(_TODO_CREATE.findall(comp))
|
| 191 |
+
done = len(_TODO_DONE.findall(comp))
|
| 192 |
+
unfinished = max(0, created - done)
|
| 193 |
+
out.append(-self.coef * unfinished)
|
| 194 |
+
return out
|
| 195 |
+
|
| 196 |
+
|
| 197 |
+
@dataclass(frozen=True)
|
| 198 |
+
class LeftoverCoTPenalty:
|
| 199 |
+
"""Penalize leaving long chain-of-thought in code comments (an emergent
|
| 200 |
+
behavior the report observed and reactively penalized). Heuristic: penalty
|
| 201 |
+
per over-long comment line."""
|
| 202 |
+
|
| 203 |
+
coef: float = 0.05
|
| 204 |
+
|
| 205 |
+
def __call__(
|
| 206 |
+
self, prompts: list[str] | None = None, completions: list[str] | None = None,
|
| 207 |
+
**kwargs: object,
|
| 208 |
+
) -> list[float]:
|
| 209 |
+
return [
|
| 210 |
+
-self.coef * len(_LONG_COMMENT.findall(comp))
|
| 211 |
+
for comp in (completions or [])
|
| 212 |
+
]
|
| 213 |
+
|
| 214 |
+
|
| 215 |
+
@dataclass(frozen=True)
|
| 216 |
+
class CommunicationReward:
|
| 217 |
+
"""Small POSITIVE reward for a concise, present final message (the report's
|
| 218 |
+
"communication" reward). Heuristic: reward presence of a non-empty final
|
| 219 |
+
message under a soft length cap; penalize empty or runaway-long finals.
|
| 220 |
+
|
| 221 |
+
Composer-2 rewards good communication; we approximate "good" as "a final
|
| 222 |
+
message exists and is reasonably concise" — the cheapest non-degenerate
|
| 223 |
+
proxy. Override for a real judge-based communication score.
|
| 224 |
+
"""
|
| 225 |
+
|
| 226 |
+
coef: float = 0.02
|
| 227 |
+
soft_cap_chars: int = 2000
|
| 228 |
+
|
| 229 |
+
def __call__(
|
| 230 |
+
self, prompts: list[str] | None = None, completions: list[str] | None = None,
|
| 231 |
+
*, final_messages: list[str] | None = None, **kwargs: object,
|
| 232 |
+
) -> list[float]:
|
| 233 |
+
completions = completions or []
|
| 234 |
+
out: list[float] = []
|
| 235 |
+
for i, comp in enumerate(completions):
|
| 236 |
+
msg = (
|
| 237 |
+
final_messages[i]
|
| 238 |
+
if final_messages is not None and i < len(final_messages)
|
| 239 |
+
else comp
|
| 240 |
+
)
|
| 241 |
+
n = len(msg.strip())
|
| 242 |
+
if n == 0:
|
| 243 |
+
out.append(-self.coef) # no communication at all
|
| 244 |
+
elif n <= self.soft_cap_chars:
|
| 245 |
+
out.append(self.coef) # present + concise
|
| 246 |
+
else:
|
| 247 |
+
# present but runaway-long: decays back toward 0 then negative.
|
| 248 |
+
overflow = (n - self.soft_cap_chars) / self.soft_cap_chars
|
| 249 |
+
out.append(self.coef * max(-1.0, 1.0 - overflow))
|
| 250 |
+
return out
|
| 251 |
+
|
| 252 |
+
|
| 253 |
+
def default_behavior_rewards() -> list[RewardFunc]:
|
| 254 |
+
"""The Composer-2 §4.2 starter array, ready to extend with reward_weights.
|
| 255 |
+
|
| 256 |
+
Returns the length/effort penalty + the three aux shapers. Compose with the
|
| 257 |
+
env correctness reward, e.g.::
|
| 258 |
+
|
| 259 |
+
reward_funcs = [env.reward_fn, *default_behavior_rewards()]
|
| 260 |
+
reward_weights = [1.0, 1.0, 1.0, 1.0, 1.0] # correctness dominates
|
| 261 |
+
|
| 262 |
+
The report adds these REACTIVELY by monitoring emergent behaviors, so treat
|
| 263 |
+
this as the seed set, not the final list.
|
| 264 |
+
"""
|
| 265 |
+
return [
|
| 266 |
+
LengthEffortPenalty(),
|
| 267 |
+
UnfinishedTodoPenalty(),
|
| 268 |
+
LeftoverCoTPenalty(),
|
| 269 |
+
CommunicationReward(),
|
| 270 |
+
]
|
| 271 |
+
|
| 272 |
+
|
| 273 |
+
__all__ = [
|
| 274 |
+
"RewardFunc",
|
| 275 |
+
"c_length",
|
| 276 |
+
"EffortWeights",
|
| 277 |
+
"LengthEffortPenalty",
|
| 278 |
+
"UnfinishedTodoPenalty",
|
| 279 |
+
"LeftoverCoTPenalty",
|
| 280 |
+
"CommunicationReward",
|
| 281 |
+
"default_behavior_rewards",
|
| 282 |
+
]
|
|
@@ -0,0 +1,166 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Tests for Composer-2 §4.2 behavior-shaping rewards (F5 #2 fidelity fix).
|
| 2 |
+
|
| 3 |
+
The substantive coverage is the C_length math (the exact report equation +
|
| 4 |
+
its concave-down/increasing shape + edge cases) and the TRL RewardFunc
|
| 5 |
+
contract (one float per completion, shapers ≤ 0).
|
| 6 |
+
"""
|
| 7 |
+
from __future__ import annotations
|
| 8 |
+
|
| 9 |
+
import math
|
| 10 |
+
|
| 11 |
+
import pytest
|
| 12 |
+
|
| 13 |
+
from composer_replication.datagen.behavior_rewards import (
|
| 14 |
+
CommunicationReward,
|
| 15 |
+
EffortWeights,
|
| 16 |
+
LeftoverCoTPenalty,
|
| 17 |
+
LengthEffortPenalty,
|
| 18 |
+
UnfinishedTodoPenalty,
|
| 19 |
+
c_length,
|
| 20 |
+
default_behavior_rewards,
|
| 21 |
+
)
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
# ---------------------------------------------------------------------
|
| 25 |
+
# C_length{k,q}(x) — the exact equation
|
| 26 |
+
# ---------------------------------------------------------------------
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
def test_c_length_matches_closed_form():
|
| 30 |
+
k, q, x = 0.01, 2.0, 500.0
|
| 31 |
+
expected = ((1 + k * x) ** (1 - q) - 1) / (k * (1 - q))
|
| 32 |
+
assert c_length(x, k, q) == pytest.approx(expected)
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
def test_c_length_zero_effort_is_zero():
|
| 36 |
+
assert c_length(0.0, k=0.01, q=2.0) == pytest.approx(0.0)
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
def test_c_length_increasing_and_concave_down():
|
| 40 |
+
"""q>1, k>0: increasing in x, with SHRINKING marginal penalty (concave-down)."""
|
| 41 |
+
k, q = 0.01, 2.0
|
| 42 |
+
xs = [0.0, 100.0, 200.0, 300.0, 400.0]
|
| 43 |
+
vals = [c_length(x, k, q) for x in xs]
|
| 44 |
+
diffs = [vals[i + 1] - vals[i] for i in range(len(vals) - 1)]
|
| 45 |
+
# strictly increasing
|
| 46 |
+
assert all(d > 0 for d in diffs), f"not increasing: {vals}"
|
| 47 |
+
# concave-down: each marginal increment smaller than the previous
|
| 48 |
+
assert all(diffs[i + 1] < diffs[i] for i in range(len(diffs) - 1)), (
|
| 49 |
+
f"not concave-down (marginal penalty should shrink): {diffs}"
|
| 50 |
+
)
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
def test_c_length_q_equals_one_limit():
|
| 54 |
+
"""q→1 limit is log(1+kx)/k (avoids the 0/0 in the closed form)."""
|
| 55 |
+
k, x = 0.02, 300.0
|
| 56 |
+
assert c_length(x, k, q=1.0) == pytest.approx(math.log(1 + k * x) / k)
|
| 57 |
+
# continuity: q just off 1 should be very close to the limit.
|
| 58 |
+
near = c_length(x, k, q=1.0 + 1e-7)
|
| 59 |
+
assert near == pytest.approx(math.log(1 + k * x) / k, rel=1e-3)
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
def test_c_length_validates_inputs():
|
| 63 |
+
with pytest.raises(ValueError, match="non-negative"):
|
| 64 |
+
c_length(-1.0, k=0.01, q=2.0)
|
| 65 |
+
with pytest.raises(ValueError, match="k must be positive"):
|
| 66 |
+
c_length(1.0, k=0.0, q=2.0)
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
# ---------------------------------------------------------------------
|
| 70 |
+
# EffortWeights
|
| 71 |
+
# ---------------------------------------------------------------------
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
def test_effort_weights_weighted_combination():
|
| 75 |
+
w = EffortWeights(think=1.0, tool_call=1.0, tool_output=1.0, final_message=1.0,
|
| 76 |
+
num_calls=50.0, num_turns=100.0)
|
| 77 |
+
feats = {
|
| 78 |
+
"thinking_tokens": 100, "tool_calling_tokens": 50,
|
| 79 |
+
"tool_output_tokens": 200, "final_message_tokens": 30,
|
| 80 |
+
"num_tool_calls": 4, "num_turns": 3,
|
| 81 |
+
}
|
| 82 |
+
# 100+50+200+30 + 50*4 + 100*3 = 380 + 200 + 300 = 880
|
| 83 |
+
assert w.effort(feats) == pytest.approx(880.0)
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
def test_effort_weights_missing_keys_are_zero():
|
| 87 |
+
assert EffortWeights().effort({}) == pytest.approx(0.0)
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
# ---------------------------------------------------------------------
|
| 91 |
+
# LengthEffortPenalty — the RewardFunc
|
| 92 |
+
# ---------------------------------------------------------------------
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
def test_length_penalty_one_value_per_completion_and_nonpositive():
|
| 96 |
+
rf = LengthEffortPenalty(coef=0.1)
|
| 97 |
+
comps = ["short", "a much longer completion " * 50, ""]
|
| 98 |
+
out = rf(prompts=None, completions=comps)
|
| 99 |
+
assert len(out) == len(comps)
|
| 100 |
+
assert all(v <= 0 for v in out), f"length penalty must be ≤ 0: {out}"
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
def test_length_penalty_longer_is_more_penalized():
|
| 104 |
+
rf = LengthEffortPenalty(coef=0.1)
|
| 105 |
+
short, long = "x" * 100, "x" * 10000
|
| 106 |
+
out = rf(completions=[short, long])
|
| 107 |
+
assert out[1] < out[0], "longer completion must incur a larger (more negative) penalty"
|
| 108 |
+
|
| 109 |
+
|
| 110 |
+
def test_length_penalty_uses_structured_features_when_present():
|
| 111 |
+
rf = LengthEffortPenalty(coef=1.0, k=0.01, q=2.0)
|
| 112 |
+
# Same short text, but big structured effort → penalty driven by features.
|
| 113 |
+
feats = [{"num_turns": 10, "thinking_tokens": 5000}]
|
| 114 |
+
out_feat = rf(completions=["ok"], effort_features=feats)
|
| 115 |
+
out_proxy = rf(completions=["ok"])
|
| 116 |
+
assert out_feat[0] < out_proxy[0], "structured high-effort must out-penalize the char proxy"
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
# ---------------------------------------------------------------------
|
| 120 |
+
# Auxiliary scalar rewards
|
| 121 |
+
# ---------------------------------------------------------------------
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
def test_unfinished_todo_penalty():
|
| 125 |
+
rf = UnfinishedTodoPenalty(coef=0.05)
|
| 126 |
+
# Heuristic counts create-like and done-like tokens; penalty scales with
|
| 127 |
+
# max(0, creates - dones). More creates than completions → penalized.
|
| 128 |
+
unbalanced = "TODO: add item one. create todo item two." # several create tokens, 0 done
|
| 129 |
+
over_done = "Done. completed. finished. resolved." # dones ≥ creates → no penalty
|
| 130 |
+
out = rf(completions=[unbalanced, over_done, "no todos here"])
|
| 131 |
+
assert out[0] < 0, "more creations than completions must be penalized"
|
| 132 |
+
assert out[1] == pytest.approx(0.0), "completions ≥ creations → no penalty"
|
| 133 |
+
assert out[2] == pytest.approx(0.0), "no todo activity → no penalty"
|
| 134 |
+
|
| 135 |
+
|
| 136 |
+
def test_leftover_cot_penalty():
|
| 137 |
+
rf = LeftoverCoTPenalty(coef=0.05)
|
| 138 |
+
long_comment = "# " + "this is a very long chain of thought left in a comment " * 5
|
| 139 |
+
clean = "# terse note\nx = 1"
|
| 140 |
+
out = rf(completions=[long_comment + "\n" + "code()", clean])
|
| 141 |
+
assert out[0] < 0, "long CoT comment must be penalized"
|
| 142 |
+
assert out[1] == pytest.approx(0.0)
|
| 143 |
+
|
| 144 |
+
|
| 145 |
+
def test_communication_reward_concise_present_vs_empty():
|
| 146 |
+
rf = CommunicationReward(coef=0.02, soft_cap_chars=2000)
|
| 147 |
+
out = rf(completions=["Fixed the bug by adding a null check.", ""])
|
| 148 |
+
assert out[0] == pytest.approx(0.02), "concise present final → +coef"
|
| 149 |
+
assert out[1] == pytest.approx(-0.02), "empty final → -coef"
|
| 150 |
+
|
| 151 |
+
|
| 152 |
+
def test_communication_reward_runaway_long_decays():
|
| 153 |
+
rf = CommunicationReward(coef=0.02, soft_cap_chars=100)
|
| 154 |
+
out = rf(completions=["x" * 1000]) # 9x over cap
|
| 155 |
+
assert out[0] < 0.02, "runaway-long final should decay below the concise reward"
|
| 156 |
+
|
| 157 |
+
|
| 158 |
+
def test_default_array_contract():
|
| 159 |
+
"""All default behavior rewards honor the TRL RewardFunc contract."""
|
| 160 |
+
funcs = default_behavior_rewards()
|
| 161 |
+
assert len(funcs) == 4
|
| 162 |
+
comps = ["completion one", "TODO: add item, never done", "x" * 5000]
|
| 163 |
+
for f in funcs:
|
| 164 |
+
out = f(prompts=["p"] * len(comps), completions=comps)
|
| 165 |
+
assert len(out) == len(comps), f"{type(f).__name__} returned wrong count"
|
| 166 |
+
assert all(isinstance(v, float) for v in out)
|
|
@@ -47,6 +47,10 @@ except ImportError: # pragma: no cover — only hit in unit-test stubs without
|
|
| 47 |
_TRL_AVAILABLE = False
|
| 48 |
|
| 49 |
from composer_replication.opsd import generalized_jsd_loss
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
logger = logging.getLogger(__name__)
|
| 52 |
|
|
@@ -67,6 +71,22 @@ class ComposerReplicationTrainer(GRPOTrainer): # type: ignore[misc, valid-type]
|
|
| 67 |
sdpo_temperature: temperature for SDPO loss; SDPO paper uses 1.0.
|
| 68 |
sdpo_token_clip: per-token JSD clip for stability; None = no clip.
|
| 69 |
replay_dpo_beta: beta param of the DPO loss (β in the standard DPO formula).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
heldout_guard: optional ``HeldOutGuard`` (the #2 collapse safeguard from
|
| 71 |
``composer_replication.safety``). Default None = OFF (no behavior
|
| 72 |
change whatsoever). When supplied, the trainer folds one checkpoint's
|
|
@@ -98,6 +118,8 @@ class ComposerReplicationTrainer(GRPOTrainer): # type: ignore[misc, valid-type]
|
|
| 98 |
sdpo_token_clip: float | None = None,
|
| 99 |
replay_dpo_beta: float = 0.1,
|
| 100 |
strict_sdpo_alignment: bool = True,
|
|
|
|
|
|
|
| 101 |
heldout_guard: HeldOutGuard | None = None,
|
| 102 |
heldout_eval_fn: Callable[[], float] | None = None,
|
| 103 |
strict_killswitch: bool = True,
|
|
@@ -121,6 +143,19 @@ class ComposerReplicationTrainer(GRPOTrainer): # type: ignore[misc, valid-type]
|
|
| 121 |
# trust-gap flagged in ADR-008). Set False only for production runs
|
| 122 |
# where a single malformed batch should warn-and-skip rather than abort.
|
| 123 |
self.strict_sdpo_alignment = strict_sdpo_alignment
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
# --- run-level collapse kill-switch (#2 safeguard) -------------------
|
| 125 |
# OPTIONAL + OFF BY DEFAULT: when heldout_guard is None the loss path is
|
| 126 |
# byte-for-byte the legacy behavior. When set, _maybe_update_killswitch
|
|
@@ -141,14 +176,85 @@ class ComposerReplicationTrainer(GRPOTrainer): # type: ignore[misc, valid-type]
|
|
| 141 |
# Loss override (the integration core)
|
| 142 |
# ----------------------------------------------------------------------
|
| 143 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 144 |
def _compute_loss(
|
| 145 |
self,
|
| 146 |
model: torch.nn.Module,
|
| 147 |
inputs: dict[str, torch.Tensor],
|
| 148 |
) -> torch.Tensor:
|
| 149 |
-
"""Override: total_loss = grpo + α*sdpo + β*replay.
|
| 150 |
-
|
| 151 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 152 |
|
| 153 |
# Channel 2: SDPO hint-distill at error sites
|
| 154 |
sdpo_kl = self._compute_sdpo_loss(model, inputs)
|
|
@@ -177,6 +283,64 @@ class ComposerReplicationTrainer(GRPOTrainer): # type: ignore[misc, valid-type]
|
|
| 177 |
|
| 178 |
return total
|
| 179 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
# ----------------------------------------------------------------------
|
| 181 |
# Run-level collapse kill-switch (#2 safeguard) — optional, OFF by default
|
| 182 |
# ----------------------------------------------------------------------
|
|
@@ -480,6 +644,51 @@ def _device_of(model: torch.nn.Module) -> torch.device:
|
|
| 480 |
return next(model.parameters()).device
|
| 481 |
|
| 482 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 483 |
def make_dr_grpo_config(**overrides: Any):
|
| 484 |
"""Build a `trl.GRPOConfig` configured to the **Dr. GRPO** recipe.
|
| 485 |
|
|
@@ -685,4 +894,5 @@ __all__ = [
|
|
| 685 |
"make_dr_grpo_config",
|
| 686 |
"make_po_config",
|
| 687 |
"PO_OBJECTIVES",
|
|
|
|
| 688 |
]
|
|
|
|
| 47 |
_TRL_AVAILABLE = False
|
| 48 |
|
| 49 |
from composer_replication.opsd import generalized_jsd_loss
|
| 50 |
+
from composer_replication.trainer.kl_in_reward import (
|
| 51 |
+
apply_kl_in_reward,
|
| 52 |
+
kl_penalty_per_sequence,
|
| 53 |
+
)
|
| 54 |
|
| 55 |
logger = logging.getLogger(__name__)
|
| 56 |
|
|
|
|
| 71 |
sdpo_temperature: temperature for SDPO loss; SDPO paper uses 1.0.
|
| 72 |
sdpo_token_clip: per-token JSD clip for stability; None = no clip.
|
| 73 |
replay_dpo_beta: beta param of the DPO loss (β in the standard DPO formula).
|
| 74 |
+
kl_in_reward: when True, apply the KL-to-reference penalty in the
|
| 75 |
+
**reward** (Composer-2 §4.1 / verl choice) instead of TRL's native
|
| 76 |
+
**in-loss** k3 term. The penalty is folded into GRPO's advantages at
|
| 77 |
+
scoring time (``adv -= beta·(KL - group_mean(KL))``) and TRL's
|
| 78 |
+
in-loss KL is suppressed for that step. The F5 audit's #1 fidelity
|
| 79 |
+
fix: the 2025/26 evidence (arXiv:2512.21852, verl, TRL #4967) shows
|
| 80 |
+
k1-in-reward improves OOD generalization where k3-in-reward can
|
| 81 |
+
collapse. REQUIRES ``beta>0`` (the KL coefficient — also how TRL
|
| 82 |
+
decides to compute reference logprobs) and ``scale_rewards`` in
|
| 83 |
+
{none,false} (the advantage-adjustment identity is exact only
|
| 84 |
+
without std-normalization — the Dr.GRPO / Composer regime). Default
|
| 85 |
+
False = TRL's native in-loss KL, byte-for-byte legacy behavior.
|
| 86 |
+
kl_estimator: ``"k1"`` (default; ``logp - ref_logp``, the Composer-2 /
|
| 87 |
+
verl choice this path exists for) or ``"k3"`` (Schulman; lets an
|
| 88 |
+
experiment A/B k1-in-reward vs k3-in-reward). Only consulted when
|
| 89 |
+
``kl_in_reward=True``.
|
| 90 |
heldout_guard: optional ``HeldOutGuard`` (the #2 collapse safeguard from
|
| 91 |
``composer_replication.safety``). Default None = OFF (no behavior
|
| 92 |
change whatsoever). When supplied, the trainer folds one checkpoint's
|
|
|
|
| 118 |
sdpo_token_clip: float | None = None,
|
| 119 |
replay_dpo_beta: float = 0.1,
|
| 120 |
strict_sdpo_alignment: bool = True,
|
| 121 |
+
kl_in_reward: bool = False,
|
| 122 |
+
kl_estimator: str = "k1",
|
| 123 |
heldout_guard: HeldOutGuard | None = None,
|
| 124 |
heldout_eval_fn: Callable[[], float] | None = None,
|
| 125 |
strict_killswitch: bool = True,
|
|
|
|
| 143 |
# trust-gap flagged in ADR-008). Set False only for production runs
|
| 144 |
# where a single malformed batch should warn-and-skip rather than abort.
|
| 145 |
self.strict_sdpo_alignment = strict_sdpo_alignment
|
| 146 |
+
# --- k1-in-reward KL (F5 #1 fidelity fix; Composer-2 §4.1 / verl) ----
|
| 147 |
+
# OFF by default → TRL's native in-loss k3 KL, byte-for-byte legacy.
|
| 148 |
+
# When ON we keep self.beta as the KL coef (TRL needs beta>0 to even
|
| 149 |
+
# create the ref model + compute ref logps), fold the k1 penalty into
|
| 150 |
+
# advantages during scoring, and zero TRL's in-loss KL per step.
|
| 151 |
+
self.kl_in_reward = kl_in_reward
|
| 152 |
+
self.kl_estimator = kl_estimator
|
| 153 |
+
if kl_in_reward:
|
| 154 |
+
validate_kl_in_reward_config(
|
| 155 |
+
kl_estimator=kl_estimator,
|
| 156 |
+
beta=float(getattr(self.args, "beta", 0.0)),
|
| 157 |
+
scale_rewards=getattr(self.args, "scale_rewards", "group"),
|
| 158 |
+
)
|
| 159 |
# --- run-level collapse kill-switch (#2 safeguard) -------------------
|
| 160 |
# OPTIONAL + OFF BY DEFAULT: when heldout_guard is None the loss path is
|
| 161 |
# byte-for-byte the legacy behavior. When set, _maybe_update_killswitch
|
|
|
|
| 176 |
# Loss override (the integration core)
|
| 177 |
# ----------------------------------------------------------------------
|
| 178 |
|
| 179 |
+
# ----------------------------------------------------------------------
|
| 180 |
+
# k1-in-reward: fold the KL penalty into advantages at scoring time, and
|
| 181 |
+
# suppress TRL's native in-loss k3 KL inside _compute_loss.
|
| 182 |
+
# ----------------------------------------------------------------------
|
| 183 |
+
|
| 184 |
+
def _generate_and_score_completions(
|
| 185 |
+
self,
|
| 186 |
+
inputs: list[dict[str, Any]],
|
| 187 |
+
) -> dict[str, Any]:
|
| 188 |
+
"""Override: after TRL scores completions, fold a k1 KL penalty into the
|
| 189 |
+
advantages (Composer-2 in-reward KL) when ``kl_in_reward`` is set.
|
| 190 |
+
|
| 191 |
+
No-op (exact legacy path) when ``kl_in_reward`` is False. When set, TRL
|
| 192 |
+
has already computed ``advantages``, ``ref_per_token_logps`` (because
|
| 193 |
+
``beta>0``), and the completion logprobs; we recompute the per-sequence
|
| 194 |
+
k1 penalty and apply the exact group-mean-baseline correction.
|
| 195 |
+
"""
|
| 196 |
+
output = super()._generate_and_score_completions(inputs)
|
| 197 |
+
if not getattr(self, "kl_in_reward", False):
|
| 198 |
+
return output
|
| 199 |
+
|
| 200 |
+
ref_logps = output.get("ref_per_token_logps")
|
| 201 |
+
# The "old" (sampling-time) policy logps are TRL's in-loss π term; they
|
| 202 |
+
# may be lazily None when generation/optimization are aligned and not
|
| 203 |
+
# vLLM (see TRL _compute_loss: old := per_token_logps.detach()). In that
|
| 204 |
+
# aligned case we cannot read π logps here, so we defer to _compute_loss
|
| 205 |
+
# (which always has per_token_logps) by stashing what we need.
|
| 206 |
+
old_logps = output.get("old_per_token_logps")
|
| 207 |
+
completion_mask = output.get("completion_mask")
|
| 208 |
+
if ref_logps is None or completion_mask is None:
|
| 209 |
+
# beta>0 guarantees ref_logps; this branch only trips on a TRL
|
| 210 |
+
# internals change — fail loud rather than silently skip the penalty.
|
| 211 |
+
raise RuntimeError(
|
| 212 |
+
"kl_in_reward=True but TRL did not return ref_per_token_logps / "
|
| 213 |
+
"completion_mask from scoring (beta>0 should guarantee them). "
|
| 214 |
+
"TRL internals may have changed; re-verify the in-reward path."
|
| 215 |
+
)
|
| 216 |
+
|
| 217 |
+
if old_logps is not None:
|
| 218 |
+
penalty = kl_penalty_per_sequence(
|
| 219 |
+
policy_logps=old_logps,
|
| 220 |
+
ref_logps=ref_logps,
|
| 221 |
+
completion_mask=completion_mask,
|
| 222 |
+
estimator=self.kl_estimator,
|
| 223 |
+
)
|
| 224 |
+
output["advantages"] = apply_kl_in_reward(
|
| 225 |
+
advantages=output["advantages"],
|
| 226 |
+
kl_penalty=penalty,
|
| 227 |
+
num_generations=self.num_generations,
|
| 228 |
+
coef=float(self.args.beta),
|
| 229 |
+
)
|
| 230 |
+
output["_kl_in_reward_applied"] = torch.tensor(True)
|
| 231 |
+
else:
|
| 232 |
+
# Aligned non-vLLM case: π logps materialize only in _compute_loss.
|
| 233 |
+
# Stash ref logps + mask so _compute_loss can apply the penalty there.
|
| 234 |
+
output["_kl_in_reward_applied"] = torch.tensor(False)
|
| 235 |
+
return output
|
| 236 |
+
|
| 237 |
def _compute_loss(
|
| 238 |
self,
|
| 239 |
model: torch.nn.Module,
|
| 240 |
inputs: dict[str, torch.Tensor],
|
| 241 |
) -> torch.Tensor:
|
| 242 |
+
"""Override: total_loss = grpo + α*sdpo + β*replay.
|
| 243 |
+
|
| 244 |
+
When ``kl_in_reward`` is set, TRL's native in-loss KL term (gated on
|
| 245 |
+
``self.beta``) is suppressed by temporarily zeroing ``self.beta`` for the
|
| 246 |
+
duration of the parent call — the KL has already been (or is about to be)
|
| 247 |
+
accounted for in the reward/advantage, so double-counting it in the loss
|
| 248 |
+
would be wrong. ``self.beta`` is restored in ``finally``.
|
| 249 |
+
"""
|
| 250 |
+
# Channel 1: standard GRPO loss. ``getattr`` (not ``self.kl_in_reward``)
|
| 251 |
+
# so an instance built via ``__new__`` + manual wiring (the SDPO /
|
| 252 |
+
# kill-switch unit-test pattern that skips __init__) defaults to the
|
| 253 |
+
# legacy path instead of raising AttributeError.
|
| 254 |
+
if getattr(self, "kl_in_reward", False):
|
| 255 |
+
grpo_loss = self._grpo_loss_kl_in_reward(model, inputs)
|
| 256 |
+
else:
|
| 257 |
+
grpo_loss = super()._compute_loss(model, inputs)
|
| 258 |
|
| 259 |
# Channel 2: SDPO hint-distill at error sites
|
| 260 |
sdpo_kl = self._compute_sdpo_loss(model, inputs)
|
|
|
|
| 283 |
|
| 284 |
return total
|
| 285 |
|
| 286 |
+
def _grpo_loss_kl_in_reward(
|
| 287 |
+
self,
|
| 288 |
+
model: torch.nn.Module,
|
| 289 |
+
inputs: dict[str, torch.Tensor],
|
| 290 |
+
) -> torch.Tensor:
|
| 291 |
+
"""GRPO loss with the KL applied in the reward, not the loss.
|
| 292 |
+
|
| 293 |
+
Two responsibilities:
|
| 294 |
+
1. Suppress TRL's native in-loss k3 KL term for this step by zeroing
|
| 295 |
+
``self.beta`` across the parent ``_compute_loss`` call (restored in
|
| 296 |
+
``finally``). ``self.beta`` gates the in-loss KL add (TRL
|
| 297 |
+
``_compute_loss``: ``if self.beta != 0.0: per_token_loss += beta*kl``).
|
| 298 |
+
2. Handle the deferred case: when generation/optimization are aligned
|
| 299 |
+
and not using vLLM, the sampling-time policy logps are None at
|
| 300 |
+
scoring time, so ``_generate_and_score_completions`` could not fold
|
| 301 |
+
the penalty into advantages. Here ``per_token_logps`` is available,
|
| 302 |
+
so we apply the exact same advantage correction in-place on
|
| 303 |
+
``inputs["advantages"]`` BEFORE the parent computes the surrogate.
|
| 304 |
+
"""
|
| 305 |
+
# Deferred-penalty path: advantages not yet KL-adjusted (aligned, no vLLM).
|
| 306 |
+
applied = inputs.get("_kl_in_reward_applied")
|
| 307 |
+
already_applied = bool(applied.item()) if applied is not None else False
|
| 308 |
+
if not already_applied and "ref_per_token_logps" in inputs:
|
| 309 |
+
with torch.no_grad():
|
| 310 |
+
prompt_ids, completion_ids = inputs["prompt_ids"], inputs["completion_ids"]
|
| 311 |
+
completion_mask = inputs["completion_mask"]
|
| 312 |
+
input_ids = torch.cat([prompt_ids, completion_ids], dim=1)
|
| 313 |
+
attention_mask = torch.cat([inputs["prompt_mask"], completion_mask], dim=1)
|
| 314 |
+
logits_to_keep = completion_ids.size(1)
|
| 315 |
+
policy_logps, _ = self._get_per_token_logps_and_entropies(
|
| 316 |
+
model, input_ids, attention_mask, logits_to_keep
|
| 317 |
+
)
|
| 318 |
+
penalty = kl_penalty_per_sequence(
|
| 319 |
+
policy_logps=policy_logps,
|
| 320 |
+
ref_logps=inputs["ref_per_token_logps"],
|
| 321 |
+
completion_mask=completion_mask,
|
| 322 |
+
estimator=self.kl_estimator,
|
| 323 |
+
)
|
| 324 |
+
advantages = inputs["advantages"]
|
| 325 |
+
# advantages may be (B,) or (B,1) — squeeze for the penalty math,
|
| 326 |
+
# restore the original shape after.
|
| 327 |
+
adv_flat = advantages.reshape(advantages.shape[0])
|
| 328 |
+
adj = apply_kl_in_reward(
|
| 329 |
+
advantages=adv_flat,
|
| 330 |
+
kl_penalty=penalty,
|
| 331 |
+
num_generations=self.num_generations,
|
| 332 |
+
coef=float(self.args.beta),
|
| 333 |
+
)
|
| 334 |
+
inputs["advantages"] = adj.reshape(advantages.shape)
|
| 335 |
+
|
| 336 |
+
# Suppress TRL's in-loss KL: zero beta for the parent call, restore after.
|
| 337 |
+
saved_beta = self.beta
|
| 338 |
+
try:
|
| 339 |
+
self.beta = 0.0
|
| 340 |
+
return super()._compute_loss(model, inputs)
|
| 341 |
+
finally:
|
| 342 |
+
self.beta = saved_beta
|
| 343 |
+
|
| 344 |
# ----------------------------------------------------------------------
|
| 345 |
# Run-level collapse kill-switch (#2 safeguard) — optional, OFF by default
|
| 346 |
# ----------------------------------------------------------------------
|
|
|
|
| 644 |
return next(model.parameters()).device
|
| 645 |
|
| 646 |
|
| 647 |
+
def validate_kl_in_reward_config(
|
| 648 |
+
*,
|
| 649 |
+
kl_estimator: str,
|
| 650 |
+
beta: float,
|
| 651 |
+
scale_rewards: Any,
|
| 652 |
+
) -> None:
|
| 653 |
+
"""Validate the (kl_estimator, beta, scale_rewards) combo for k1-in-reward.
|
| 654 |
+
|
| 655 |
+
Extracted so the preconditions are unit-testable without standing up a real
|
| 656 |
+
GRPOTrainer (which needs a model + dataset). Raises ``ValueError`` on any
|
| 657 |
+
invalid combination; returns None when the config is sound.
|
| 658 |
+
|
| 659 |
+
Preconditions (see ``kl_in_reward.py`` for the algebra):
|
| 660 |
+
* ``kl_estimator`` in {k1, k3}.
|
| 661 |
+
* ``beta != 0`` — TRL only builds the reference model and computes ref
|
| 662 |
+
logprobs when beta>0, and the in-reward penalty needs ref logps. beta
|
| 663 |
+
doubles as the in-reward KL coefficient (the in-loss k3 term is
|
| 664 |
+
suppressed per step).
|
| 665 |
+
* ``scale_rewards`` in {none, false} — the advantage-adjustment identity
|
| 666 |
+
is exact only without per-group std-normalization (the Dr.GRPO /
|
| 667 |
+
Composer regime).
|
| 668 |
+
"""
|
| 669 |
+
if kl_estimator not in ("k1", "k3"):
|
| 670 |
+
raise ValueError(f"kl_estimator must be 'k1' or 'k3', got {kl_estimator!r}.")
|
| 671 |
+
if float(beta) == 0.0:
|
| 672 |
+
raise ValueError(
|
| 673 |
+
"kl_in_reward=True requires a non-zero `beta` (the KL coefficient): "
|
| 674 |
+
"TRL only creates the reference model and computes ref logprobs when "
|
| 675 |
+
"beta>0, and k1-in-reward needs those ref logps. Set beta to your KL "
|
| 676 |
+
"coefficient (e.g. make_po_config('dr_grpo', beta=0.04)); the in-loss "
|
| 677 |
+
"k3 term is suppressed automatically so beta acts purely as the "
|
| 678 |
+
"in-reward k1 coefficient."
|
| 679 |
+
)
|
| 680 |
+
if str(scale_rewards).lower() not in ("none", "false"):
|
| 681 |
+
raise ValueError(
|
| 682 |
+
"kl_in_reward=True requires scale_rewards in {none,false} "
|
| 683 |
+
f"(got {scale_rewards!r}). The advantage-adjustment identity "
|
| 684 |
+
"adv -= beta·(KL - group_mean(KL)) is EXACT only without per-group "
|
| 685 |
+
"std-normalization (the Dr.GRPO / Composer regime). With std-norm, "
|
| 686 |
+
"folding KL into the reward also shifts the group std, so the linear "
|
| 687 |
+
"correction no longer matches true in-reward KL. Use "
|
| 688 |
+
"make_po_config('dr_grpo', beta=…) (scale_rewards='none')."
|
| 689 |
+
)
|
| 690 |
+
|
| 691 |
+
|
| 692 |
def make_dr_grpo_config(**overrides: Any):
|
| 693 |
"""Build a `trl.GRPOConfig` configured to the **Dr. GRPO** recipe.
|
| 694 |
|
|
|
|
| 894 |
"make_dr_grpo_config",
|
| 895 |
"make_po_config",
|
| 896 |
"PO_OBJECTIVES",
|
| 897 |
+
"validate_kl_in_reward_config",
|
| 898 |
]
|
|
@@ -0,0 +1,182 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""k1-in-reward KL penalty — the Composer-2 / verl fidelity choice.
|
| 2 |
+
|
| 3 |
+
THE FIDELITY GAP (F5 Rubric A item c2, the single highest-leverage fix).
|
| 4 |
+
Composer-2 §4.1 explicitly chooses the **k1** KL estimator applied **in the
|
| 5 |
+
reward** (``-log r``), citing a variance argument (Amini et al.). TRL's
|
| 6 |
+
``GRPOTrainer`` instead applies the **k3** estimator (``exp(Δ) - Δ - 1``,
|
| 7 |
+
Δ = ref_logp - logp) **in the loss**, gated on ``beta != 0``. The 2025/26
|
| 8 |
+
literature says this is not cosmetic:
|
| 9 |
+
|
| 10 |
+
* arXiv:2512.21852 ("A Comedy of Estimators") — k1-in-reward improves OOD
|
| 11 |
+
generalization; k3-in-reward can collapse.
|
| 12 |
+
* verl adopted k1-in-reward as its *only* reverse-KL option.
|
| 13 |
+
* TRL issue #4967 tracks the same divergence.
|
| 14 |
+
|
| 15 |
+
OOD generalization is exactly the "take any model to the next level" axis, so
|
| 16 |
+
this module gives the trainer an opt-in k1-in-reward path that matches
|
| 17 |
+
Composer-2 / verl, leaving TRL's native k3-in-loss disabled (``beta = 0``).
|
| 18 |
+
|
| 19 |
+
THE ALGEBRA (why this is a clean advantage adjustment, not a TRL fork).
|
| 20 |
+
k1-in-reward means: penalize each sequence's reward by ``coef * KL_i`` before
|
| 21 |
+
GRPO computes its group-relative advantage:
|
| 22 |
+
|
| 23 |
+
reward'_i = reward_i - coef * KL_i
|
| 24 |
+
KL_i = Σ_t mask_{i,t} · (logp_{i,t} - ref_logp_{i,t}) # k1 estimator
|
| 25 |
+
# of KL(π‖π_ref)
|
| 26 |
+
|
| 27 |
+
GRPO's advantage (with ``scale_rewards="none"``, the Dr.GRPO / Composer regime)
|
| 28 |
+
is the group-mean baseline ``adv_i = reward_i - mean_group(reward)``. Because
|
| 29 |
+
that baseline is LINEAR, folding-then-baselining equals adjusting the final
|
| 30 |
+
advantage:
|
| 31 |
+
|
| 32 |
+
adv'_i = reward'_i - mean_group(reward')
|
| 33 |
+
= adv_i - coef · (KL_i - mean_group(KL))
|
| 34 |
+
|
| 35 |
+
So the trainer can let TRL compute advantages normally, then apply this exact
|
| 36 |
+
correction — no reimplementation of TRL's reward→advantage code.
|
| 37 |
+
|
| 38 |
+
THE STD-NORM CAVEAT (why we require scale_rewards="none"). The identity above
|
| 39 |
+
is EXACT only when there is no per-group std normalization. With std-norm,
|
| 40 |
+
folding KL into the reward also changes the group std, so the linear correction
|
| 41 |
+
is no longer equivalent. Composer-2 and verl both train WITHOUT std scaling
|
| 42 |
+
(Dr.GRPO's recommendation), so we make the math exact for that regime and the
|
| 43 |
+
trainer raises if k1-in-reward is requested with std-norm on, rather than
|
| 44 |
+
silently applying an approximation.
|
| 45 |
+
|
| 46 |
+
Note: ``-log r`` (Composer-2's phrasing) with ``r = π/π_ref = exp(logp-ref_logp)``
|
| 47 |
+
gives ``-log r = ref_logp - logp = -(logp - ref_logp)`` *per token*. The KL
|
| 48 |
+
PENALTY subtracted from reward is ``coef · Σ_t (logp - ref_logp)`` — i.e. the
|
| 49 |
+
k1 estimator of the reverse KL, which is what discourages drift from π_ref. The
|
| 50 |
+
sign convention here matches the standard RLHF KL-in-reward penalty
|
| 51 |
+
(Stiennon et al. 2020; verl ``kl_penalty="kl"``).
|
| 52 |
+
"""
|
| 53 |
+
from __future__ import annotations
|
| 54 |
+
|
| 55 |
+
import torch
|
| 56 |
+
|
| 57 |
+
#: Supported KL estimators for the in-reward penalty. Only k1 is meaningful here
|
| 58 |
+
#: (the whole point is to use k1 instead of TRL's native-in-loss k3); k3 is
|
| 59 |
+
#: accepted as an explicit no-divergence opt-out for experiments.
|
| 60 |
+
KL_ESTIMATORS = ("k1", "k3")
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
def k1_kl_penalty_per_sequence(
|
| 64 |
+
policy_logps: torch.Tensor,
|
| 65 |
+
ref_logps: torch.Tensor,
|
| 66 |
+
completion_mask: torch.Tensor,
|
| 67 |
+
) -> torch.Tensor:
|
| 68 |
+
"""Per-sequence k1 estimator of KL(π ‖ π_ref) over completion tokens.
|
| 69 |
+
|
| 70 |
+
Args:
|
| 71 |
+
policy_logps: ``(B, T)`` per-token logprobs under the (sampling) policy π.
|
| 72 |
+
ref_logps: ``(B, T)`` per-token logprobs under the reference policy π_ref,
|
| 73 |
+
on the SAME tokens/positions as ``policy_logps``.
|
| 74 |
+
completion_mask: ``(B, T)`` 1.0 at real completion tokens, 0.0 at prompt /
|
| 75 |
+
padding positions (the k1 sum is taken only over real tokens).
|
| 76 |
+
|
| 77 |
+
Returns:
|
| 78 |
+
``(B,)`` per-sequence KL penalty ``Σ_t mask·(logp - ref_logp)``.
|
| 79 |
+
|
| 80 |
+
The k1 estimator ``logp - ref_logp`` is the unbiased (higher-variance)
|
| 81 |
+
single-sample estimate of the reverse KL; summed over the response it is the
|
| 82 |
+
sequence-level KL used as the reward penalty.
|
| 83 |
+
"""
|
| 84 |
+
if policy_logps.shape != ref_logps.shape:
|
| 85 |
+
raise ValueError(
|
| 86 |
+
f"policy_logps {tuple(policy_logps.shape)} and ref_logps "
|
| 87 |
+
f"{tuple(ref_logps.shape)} must have identical shape (same tokens)."
|
| 88 |
+
)
|
| 89 |
+
if completion_mask.shape != policy_logps.shape:
|
| 90 |
+
raise ValueError(
|
| 91 |
+
f"completion_mask {tuple(completion_mask.shape)} must match "
|
| 92 |
+
f"policy_logps {tuple(policy_logps.shape)}."
|
| 93 |
+
)
|
| 94 |
+
per_token = (policy_logps - ref_logps) * completion_mask
|
| 95 |
+
return per_token.sum(dim=-1)
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
def k3_kl_penalty_per_sequence(
|
| 99 |
+
policy_logps: torch.Tensor,
|
| 100 |
+
ref_logps: torch.Tensor,
|
| 101 |
+
completion_mask: torch.Tensor,
|
| 102 |
+
) -> torch.Tensor:
|
| 103 |
+
"""Per-sequence k3 (Schulman) estimator of KL over completion tokens.
|
| 104 |
+
|
| 105 |
+
``k3 = exp(Δ) - Δ - 1``, Δ = ref_logp - logp. Always ≥ 0, lower variance.
|
| 106 |
+
Provided for the in-reward path so an experiment can A/B k1-in-reward against
|
| 107 |
+
k3-in-reward (the comparison arXiv:2512.21852 makes) without touching TRL.
|
| 108 |
+
"""
|
| 109 |
+
if not (policy_logps.shape == ref_logps.shape == completion_mask.shape):
|
| 110 |
+
raise ValueError("policy_logps, ref_logps, completion_mask must share shape.")
|
| 111 |
+
delta = ref_logps - policy_logps
|
| 112 |
+
per_token = (torch.exp(delta) - delta - 1.0) * completion_mask
|
| 113 |
+
return per_token.sum(dim=-1)
|
| 114 |
+
|
| 115 |
+
|
| 116 |
+
def kl_penalty_per_sequence(
|
| 117 |
+
policy_logps: torch.Tensor,
|
| 118 |
+
ref_logps: torch.Tensor,
|
| 119 |
+
completion_mask: torch.Tensor,
|
| 120 |
+
estimator: str = "k1",
|
| 121 |
+
) -> torch.Tensor:
|
| 122 |
+
"""Dispatch to the k1 or k3 per-sequence KL penalty."""
|
| 123 |
+
if estimator == "k1":
|
| 124 |
+
return k1_kl_penalty_per_sequence(policy_logps, ref_logps, completion_mask)
|
| 125 |
+
if estimator == "k3":
|
| 126 |
+
return k3_kl_penalty_per_sequence(policy_logps, ref_logps, completion_mask)
|
| 127 |
+
raise ValueError(
|
| 128 |
+
f"Unknown KL estimator {estimator!r}; choose from {KL_ESTIMATORS}. "
|
| 129 |
+
"k1 is the Composer-2 / verl in-reward choice this module exists for."
|
| 130 |
+
)
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
def apply_kl_in_reward(
|
| 134 |
+
advantages: torch.Tensor,
|
| 135 |
+
kl_penalty: torch.Tensor,
|
| 136 |
+
num_generations: int,
|
| 137 |
+
coef: float,
|
| 138 |
+
) -> torch.Tensor:
|
| 139 |
+
"""Adjust GRPO advantages to fold a KL penalty into the reward.
|
| 140 |
+
|
| 141 |
+
Exact (not approximate) under the group-mean baseline with NO std
|
| 142 |
+
normalization (``scale_rewards="none"`` — the Dr.GRPO / Composer regime).
|
| 143 |
+
See the module docstring for the linearity argument.
|
| 144 |
+
|
| 145 |
+
Args:
|
| 146 |
+
advantages: ``(B,)`` GRPO advantages as TRL computed them
|
| 147 |
+
(= reward - group_mean(reward), no std division).
|
| 148 |
+
kl_penalty: ``(B,)`` per-sequence KL penalty (from
|
| 149 |
+
``kl_penalty_per_sequence``).
|
| 150 |
+
num_generations: G — the number of completions per prompt (group size).
|
| 151 |
+
``B`` must be divisible by G; groups are contiguous as TRL lays them
|
| 152 |
+
out (``rewards.view(-1, num_generations)``).
|
| 153 |
+
coef: the KL coefficient β. ``coef=0`` returns advantages unchanged.
|
| 154 |
+
|
| 155 |
+
Returns:
|
| 156 |
+
``(B,)`` adjusted advantages ``adv - coef·(KL - group_mean(KL))``.
|
| 157 |
+
"""
|
| 158 |
+
if coef == 0.0:
|
| 159 |
+
return advantages
|
| 160 |
+
if advantages.shape != kl_penalty.shape:
|
| 161 |
+
raise ValueError(
|
| 162 |
+
f"advantages {tuple(advantages.shape)} and kl_penalty "
|
| 163 |
+
f"{tuple(kl_penalty.shape)} must have identical shape (B,)."
|
| 164 |
+
)
|
| 165 |
+
b = advantages.shape[0]
|
| 166 |
+
if num_generations <= 0 or b % num_generations != 0:
|
| 167 |
+
raise ValueError(
|
| 168 |
+
f"batch size B={b} must be a positive multiple of num_generations="
|
| 169 |
+
f"{num_generations} (GRPO lays groups out contiguously)."
|
| 170 |
+
)
|
| 171 |
+
kl_grouped = kl_penalty.view(-1, num_generations)
|
| 172 |
+
kl_centered = (kl_grouped - kl_grouped.mean(dim=1, keepdim=True)).reshape(b)
|
| 173 |
+
return advantages - coef * kl_centered
|
| 174 |
+
|
| 175 |
+
|
| 176 |
+
__all__ = [
|
| 177 |
+
"KL_ESTIMATORS",
|
| 178 |
+
"k1_kl_penalty_per_sequence",
|
| 179 |
+
"k3_kl_penalty_per_sequence",
|
| 180 |
+
"kl_penalty_per_sequence",
|
| 181 |
+
"apply_kl_in_reward",
|
| 182 |
+
]
|
|
@@ -75,6 +75,15 @@ def test_trl_kl_estimator_is_k3_not_k1():
|
|
| 75 |
the installed trl==1.5.0 source (grpo_trainer.py ~L2513) shows it actually
|
| 76 |
computes k3: `torch.exp(ref - logp) - (ref - logp) - 1`. This test pins
|
| 77 |
that finding so the docstring stays honest and a future TRL change is caught.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
"""
|
| 79 |
# Known logprob pairs (student logp, reference logp).
|
| 80 |
logp = torch.tensor([-1.0, -2.0, -0.5, -3.0])
|
|
@@ -121,6 +130,30 @@ def test_trl_kl_estimator_is_k3_not_k1():
|
|
| 121 |
)
|
| 122 |
|
| 123 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
# ---------------------------------------------------------------------------
|
| 125 |
# Gate 2 — SDPO strict-alignment guard (no real GRPOTrainer needed)
|
| 126 |
# ---------------------------------------------------------------------------
|
|
|
|
| 75 |
the installed trl==1.5.0 source (grpo_trainer.py ~L2513) shows it actually
|
| 76 |
computes k3: `torch.exp(ref - logp) - (ref - logp) - 1`. This test pins
|
| 77 |
that finding so the docstring stays honest and a future TRL change is caught.
|
| 78 |
+
|
| 79 |
+
NOTE (Wave 20, F5 #1 fidelity fix): TRL's NATIVE behavior is still k3-in-loss
|
| 80 |
+
(this test stays valid). But Composer-2 §4.1 chooses k1-IN-REWARD, and the
|
| 81 |
+
2025/26 evidence (arXiv:2512.21852, verl, TRL #4967) says that matters for
|
| 82 |
+
OOD generalization. ComposerReplicationTrainer now offers an opt-in
|
| 83 |
+
`kl_in_reward=True` path (composer_replication.trainer.kl_in_reward) that
|
| 84 |
+
suppresses this in-loss k3 term (beta→0 for the parent loss) and folds a k1
|
| 85 |
+
penalty into the reward/advantage instead — see
|
| 86 |
+
test_trainer_offers_k1_in_reward_path below and test_kl_in_reward.py.
|
| 87 |
"""
|
| 88 |
# Known logprob pairs (student logp, reference logp).
|
| 89 |
logp = torch.tensor([-1.0, -2.0, -0.5, -3.0])
|
|
|
|
| 130 |
)
|
| 131 |
|
| 132 |
|
| 133 |
+
def test_trainer_offers_k1_in_reward_path():
|
| 134 |
+
"""The F5 #1 fix: ComposerReplicationTrainer exposes an opt-in k1-in-reward
|
| 135 |
+
path whose preconditions are validated (beta>0, scale_rewards∈{none,false},
|
| 136 |
+
estimator∈{k1,k3}). Validates the extracted precondition function without
|
| 137 |
+
needing a real GRPOTrainer (model+dataset)."""
|
| 138 |
+
from composer_replication.trainer.composer_trainer import validate_kl_in_reward_config
|
| 139 |
+
|
| 140 |
+
# Valid Composer-2 / verl config: k1 estimator, non-zero beta, no std-norm.
|
| 141 |
+
validate_kl_in_reward_config(kl_estimator="k1", beta=0.04, scale_rewards="none")
|
| 142 |
+
validate_kl_in_reward_config(kl_estimator="k3", beta=0.1, scale_rewards="false")
|
| 143 |
+
|
| 144 |
+
# beta=0 is the trap: TRL would build no ref model, so no ref logps.
|
| 145 |
+
with pytest.raises(ValueError, match="non-zero `beta`"):
|
| 146 |
+
validate_kl_in_reward_config(kl_estimator="k1", beta=0.0, scale_rewards="none")
|
| 147 |
+
|
| 148 |
+
# std-norm breaks the exact advantage-adjustment identity.
|
| 149 |
+
with pytest.raises(ValueError, match="scale_rewards in"):
|
| 150 |
+
validate_kl_in_reward_config(kl_estimator="k1", beta=0.04, scale_rewards="group")
|
| 151 |
+
|
| 152 |
+
# bad estimator
|
| 153 |
+
with pytest.raises(ValueError, match="k1.*k3|k3"):
|
| 154 |
+
validate_kl_in_reward_config(kl_estimator="k2", beta=0.04, scale_rewards="none")
|
| 155 |
+
|
| 156 |
+
|
| 157 |
# ---------------------------------------------------------------------------
|
| 158 |
# Gate 2 — SDPO strict-alignment guard (no real GRPOTrainer needed)
|
| 159 |
# ---------------------------------------------------------------------------
|
|
@@ -0,0 +1,146 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Tests for k1-in-reward KL (F5 #1 fidelity fix; Composer-2 §4.1 / verl).
|
| 2 |
+
|
| 3 |
+
The load-bearing test is `test_apply_equals_fold_then_baseline`: it proves the
|
| 4 |
+
advantage adjustment `adv -= coef·(KL - group_mean(KL))` is EXACTLY equal to
|
| 5 |
+
folding the KL penalty into the reward and re-running GRPO's group-mean
|
| 6 |
+
baseline (with no std-norm). That equivalence is the entire justification for
|
| 7 |
+
adjusting advantages post-hoc instead of forking TRL's reward→advantage code.
|
| 8 |
+
"""
|
| 9 |
+
from __future__ import annotations
|
| 10 |
+
|
| 11 |
+
import pytest
|
| 12 |
+
import torch
|
| 13 |
+
|
| 14 |
+
from composer_replication.trainer.kl_in_reward import (
|
| 15 |
+
apply_kl_in_reward,
|
| 16 |
+
k1_kl_penalty_per_sequence,
|
| 17 |
+
k3_kl_penalty_per_sequence,
|
| 18 |
+
kl_penalty_per_sequence,
|
| 19 |
+
)
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
# ---------------------------------------------------------------------
|
| 23 |
+
# Per-sequence KL estimators
|
| 24 |
+
# ---------------------------------------------------------------------
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def test_k1_penalty_sums_masked_logp_diff():
|
| 28 |
+
policy = torch.tensor([[0.0, -1.0, -2.0], [-0.5, -0.5, -0.5]])
|
| 29 |
+
ref = torch.tensor([[0.0, -0.5, -1.0], [-1.0, -1.0, -1.0]])
|
| 30 |
+
mask = torch.tensor([[1.0, 1.0, 0.0], [1.0, 1.0, 1.0]]) # row0 drops last token
|
| 31 |
+
out = k1_kl_penalty_per_sequence(policy, ref, mask)
|
| 32 |
+
# row0: (0-0) + (-1-(-0.5)) [+ masked 0] = -0.5
|
| 33 |
+
# row1: (-0.5-(-1.0))*3 = +1.5
|
| 34 |
+
torch.testing.assert_close(out, torch.tensor([-0.5, 1.5]))
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
def test_k1_can_be_negative_k3_cannot():
|
| 38 |
+
"""Structural difference: k1 is signed, k3 ≥ 0 (the whole reason they differ)."""
|
| 39 |
+
policy = torch.tensor([[0.0, 0.0]])
|
| 40 |
+
ref = torch.tensor([[1.0, 1.0]]) # ref > policy → Δ=ref-logp>0 → k1<0
|
| 41 |
+
mask = torch.ones_like(policy)
|
| 42 |
+
k1 = k1_kl_penalty_per_sequence(policy, ref, mask)
|
| 43 |
+
k3 = k3_kl_penalty_per_sequence(policy, ref, mask)
|
| 44 |
+
assert (k1 < 0).all(), "k1 = Σ(logp-ref) is negative when ref>logp"
|
| 45 |
+
assert (k3 >= -1e-6).all(), "k3 (Schulman) is always non-negative"
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
def test_k3_leading_order_is_half_delta_squared():
|
| 49 |
+
"""For small Δ, k3 ≈ Δ²/2 — the minor-delta claim in make_dr_grpo_config."""
|
| 50 |
+
policy = torch.tensor([[0.0, 0.0, 0.0]])
|
| 51 |
+
ref = torch.tensor([[0.01, -0.02, 0.005]])
|
| 52 |
+
mask = torch.ones_like(policy)
|
| 53 |
+
k3 = k3_kl_penalty_per_sequence(policy, ref, mask)
|
| 54 |
+
delta = ref - policy
|
| 55 |
+
expected = (0.5 * delta**2).sum()
|
| 56 |
+
torch.testing.assert_close(k3, expected.unsqueeze(0), atol=1e-4, rtol=1e-3)
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
def test_dispatch_and_unknown_estimator():
|
| 60 |
+
policy = torch.zeros(1, 2)
|
| 61 |
+
ref = torch.ones(1, 2)
|
| 62 |
+
mask = torch.ones(1, 2)
|
| 63 |
+
torch.testing.assert_close(
|
| 64 |
+
kl_penalty_per_sequence(policy, ref, mask, "k1"),
|
| 65 |
+
k1_kl_penalty_per_sequence(policy, ref, mask),
|
| 66 |
+
)
|
| 67 |
+
with pytest.raises(ValueError, match="Unknown KL estimator"):
|
| 68 |
+
kl_penalty_per_sequence(policy, ref, mask, "k2")
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
def test_penalty_shape_validation():
|
| 72 |
+
with pytest.raises(ValueError, match="identical shape"):
|
| 73 |
+
k1_kl_penalty_per_sequence(torch.zeros(1, 3), torch.zeros(1, 2), torch.zeros(1, 3))
|
| 74 |
+
with pytest.raises(ValueError, match="must match"):
|
| 75 |
+
k1_kl_penalty_per_sequence(torch.zeros(1, 3), torch.zeros(1, 3), torch.zeros(1, 2))
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
# ---------------------------------------------------------------------
|
| 79 |
+
# apply_kl_in_reward — the advantage adjustment
|
| 80 |
+
# ---------------------------------------------------------------------
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
def test_apply_coef_zero_is_identity():
|
| 84 |
+
adv = torch.tensor([1.0, -1.0, 0.5, -0.5])
|
| 85 |
+
kl = torch.tensor([2.0, 3.0, 1.0, 0.0])
|
| 86 |
+
out = apply_kl_in_reward(adv, kl, num_generations=2, coef=0.0)
|
| 87 |
+
torch.testing.assert_close(out, adv)
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
def test_apply_centers_kl_within_group():
|
| 91 |
+
# Two groups of 2. coef=1. adv -= (KL - group_mean(KL)).
|
| 92 |
+
adv = torch.zeros(4)
|
| 93 |
+
kl = torch.tensor([1.0, 3.0, 10.0, 20.0])
|
| 94 |
+
out = apply_kl_in_reward(adv, kl, num_generations=2, coef=1.0)
|
| 95 |
+
# group0 mean=2 → centered [-1,+1] → adv-(-1,+1)=[1,-1]
|
| 96 |
+
# group1 mean=15 → centered [-5,+5] → adv-(-5,+5)=[5,-5]
|
| 97 |
+
torch.testing.assert_close(out, torch.tensor([1.0, -1.0, 5.0, -5.0]))
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
def test_apply_divisibility_validation():
|
| 101 |
+
with pytest.raises(ValueError, match="multiple of num_generations"):
|
| 102 |
+
apply_kl_in_reward(torch.zeros(5), torch.zeros(5), num_generations=2, coef=1.0)
|
| 103 |
+
with pytest.raises(ValueError, match="identical shape"):
|
| 104 |
+
apply_kl_in_reward(torch.zeros(4), torch.zeros(2), num_generations=2, coef=1.0)
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
@pytest.mark.parametrize("num_generations", [2, 3, 4])
|
| 108 |
+
@pytest.mark.parametrize("n_groups", [1, 2, 5])
|
| 109 |
+
def test_apply_equals_fold_then_baseline(num_generations, n_groups):
|
| 110 |
+
"""THE load-bearing property: adjusting baselined advantages by
|
| 111 |
+
-coef·(KL - group_mean(KL)) equals folding -coef·KL into the reward and
|
| 112 |
+
re-running GRPO's group-mean baseline (scale_rewards='none').
|
| 113 |
+
|
| 114 |
+
This proves the post-hoc advantage adjustment IS exact k1-in-reward, not an
|
| 115 |
+
approximation — the justification for not forking TRL's scoring code.
|
| 116 |
+
"""
|
| 117 |
+
torch.manual_seed(0)
|
| 118 |
+
g, k = num_generations, n_groups
|
| 119 |
+
b = g * k
|
| 120 |
+
coef = 0.137
|
| 121 |
+
|
| 122 |
+
rewards = torch.randn(b)
|
| 123 |
+
kl = torch.randn(b).abs() # KL ≥ 0 in spirit, though sign-agnostic here
|
| 124 |
+
|
| 125 |
+
# GRPO baseline (no std-norm): adv = reward - group_mean(reward).
|
| 126 |
+
def group_baseline(x):
|
| 127 |
+
means = x.view(k, g).mean(dim=1).repeat_interleave(g) # (b,)
|
| 128 |
+
return x - means
|
| 129 |
+
|
| 130 |
+
advantages = group_baseline(rewards)
|
| 131 |
+
|
| 132 |
+
# Reference: fold KL into reward, THEN baseline.
|
| 133 |
+
folded_reward = rewards - coef * kl
|
| 134 |
+
adv_fold_then_baseline = group_baseline(folded_reward)
|
| 135 |
+
|
| 136 |
+
# Under test: adjust the ALREADY-baselined advantages.
|
| 137 |
+
adv_adjusted = apply_kl_in_reward(advantages, kl, num_generations=g, coef=coef)
|
| 138 |
+
|
| 139 |
+
torch.testing.assert_close(adv_adjusted, adv_fold_then_baseline, atol=1e-5, rtol=1e-5)
|
| 140 |
+
|
| 141 |
+
|
| 142 |
+
def test_apply_does_not_mutate_input():
|
| 143 |
+
adv = torch.tensor([1.0, 2.0])
|
| 144 |
+
adv_copy = adv.clone()
|
| 145 |
+
apply_kl_in_reward(adv, torch.tensor([0.0, 1.0]), num_generations=2, coef=1.0)
|
| 146 |
+
torch.testing.assert_close(adv, adv_copy) # functional, not in-place
|
|
@@ -24,13 +24,13 @@ The good news the report stresses: the substrate for all of this already exists.
|
|
| 24 |
| (b1) | **25× synthetic data — Feature Deletion generator** — 2.5 blog | **FULLY-REPLICATED (substrate-inversion form)** | `datagen/substrates.py::SweBenchAdapter` (revert gold patch → broken repo, FAIL_TO_PASS=reward target, license filter), `datagen/env.py::FeatureDeletionEnv`, `datagen/validator.py` (4-gate solvability), `datagen/schema.py`. | Real *generative* synthesis (manufacture novel broken states beyond SWE-bench inversion) absent; only adapts existing SWE-* instances. No "25×" scale-out generator suite. |
|
| 25 |
| (b2) | **Dynamic-difficulty curriculum ("select for AND create harder tasks dynamically")** — 2.5 blog + Composer-2 §3 (keyed on #turns + thinking-tokens) | **FULLY-REPLICATED (select-for half)** | `datagen/curriculum.py::DifficultyCurriculum` — p̂(1−p̂) frontier weighting, retire >0.95, quarantine <0.02, **effort tilt on turns/think-tokens** (ADR-012 #4, matching Composer-2's exact heuristic). | **CREATE half missing**: no live escalation of deletion-span / coupling / multi-feature difficulty during the run. Curriculum scores an *existing* pool; it doesn't mint harder tasks. |
|
| 26 |
| (c1) | **Dr.GRPO base objective** — Composer-2 §4.1 | **FULLY-REPLICATED** | `composer_trainer.py::make_dr_grpo_config` + `make_po_config` (PO menu: grpo/dr_grpo/bnpo/dapo/gspo/cispo, pure TRL 1.5.0 config). `loss_type="dr_grpo"`, `scale_rewards="none"`, `num_iterations=1`, drift-guard asserts. ADR-014. | None on the objective itself. |
|
| 27 |
-
| (c2) | **k1-vs-k3 KL** — Composer-2 §4.1 explicitly chooses **k1 = −log r in *reward*** (variance argument, citing Amini et al.) | **
|
| 28 |
| (d) | **CPT → SFT → RL phase structure** — Composer-2 §3-4 (CPT loss ↓ ⇒ RL ceiling ↑, replicated on Qwen3-Coder-30B) | **PARTIAL (intentional skip + plumbing)** | Documented decision to skip CPT and start from a code-tuned base (COMPOSER_RECIPE_MAPPING.md row a; corroborated by Composer-2's own CPT→RL causal claim). Inner/outer loop split exists (datagen=outer, `ComposerReplicationTrainer`=inner). | **No SFT-first stage in code.** Report §5 prescribes "SFT-first on clean winning trajectories before RL" — there is no SFT trainer/recipe; only the RL trainer exists. CPT correctly skipped. |
|
| 29 |
| (e) | **Sharded Muon + dual-mesh HSDP** (2.5 blog) / FSDP+CP+decoupled-EP, Adam (Composer-2 §6) | **MISSING (intentional, irrelevant at our scale)** | — | Correctly out of scope for dense Qwen3-{7,32}B (the mapping doc + report both say skip until MoE base). Distributed substrate is DiLoCo-over-S3, not HSDP. Note research/10 *corrects* the blog: Composer-2 uses **Adam**, not Muon, and FSDP+CP+decoupled-EP, not HSDP. |
|
| 30 |
| (f) | **Anyrun production-fidelity sandboxed RL harness** (>500 pods/s, per-pod Firecracker microVM, fork/snapshot, Anygress egress proxy) — Composer-2 §6.2 | **PARTIAL** | `datagen/sandbox.py` (`Sandbox` Protocol, `LocalSubprocessSandbox`, `scrub_tree` primary control, denylist defense-in-depth), `datagen/docker_sandbox.py`, `diloco/serverless/{executor.py,eks.py,sagemaker.py,modal_spawn.py}`. | No microVM isolation (gVisor/Kata-Firecracker), no fork/snapshot, no egress proxy, no >100k-pod orchestration. The report's EKS plan (§8: gVisor default → Kata+Firecracker → container-free SWE-MiniSandbox) is design-only. `eks.py`/`sagemaker.py` are executor skeletons, not the full Anyrun analogue. |
|
| 31 |
| (g) | **Reward-hacking monitoring** (2.5 blog: bytecode decompile / type-cache hacks; "agentic monitoring tools") | **FULLY-REPLICATED (defense-in-depth) + run-level guard now wired** | `datagen/monitor.py::HackMonitor` (signature + patch-provenance, obfuscation-resistant), `sandbox.py::scrub_tree` (physical cache/.git removal = "the wall"), `datagen/validator.py` (4-gate), `safety/holdout.py::HeldoutSplit` (id + content-hash disjointness), `safety/kill_switch.py::HeldOutGuard` (proxy-real Hacking-Gap + KL hard-stop), **now wired into the trainer** (`composer_trainer.py::_maybe_update_killswitch`, ADR-015, 2026-06-08). | The held-out kill-switch — the report's "most load-bearing safeguard, documented gap" — is **now CLOSED** (ADR-015). Remaining: `HackMonitor` validated only on constructed examples (report warns synthetic-hack monitors fail to generalize); offline LLM-judge monitor (EvilGenie-style) not built. |
|
| 32 |
-
| (h) | **Aux scalar rewards (style/communication/unfinished-todo penalties)** — Composer-2 §4.2 | **
|
| 33 |
-
| (i) | **Nonlinear length/effort penalty** `C_length{k,q}(x)=((1+kx)^{1−q}−1)/(k(1−q))` — Composer-2 §4.2 (exact equation) | **
|
| 34 |
| (j) | **Self-summarization (reward-to-all-chain-tokens)** — Composer-2 §4.1 | **MISSING** | — | The mechanism that handles 100k-token long-horizon rollouts (the regime the report says the *tree* is for). Not built. |
|
| 35 |
| (k) | **MoE router replay** — Composer-2 §6.2 | **MISSING (out of scope, dense bases)** | — | Only relevant for MoE-base RL; correct to defer. |
|
| 36 |
|
|
@@ -74,9 +74,9 @@ What exists today (the *ancestor*): `teacher_replay.py` is **flat depth-1** (N t
|
|
| 74 |
|
| 75 |
Ordered by (fidelity-leverage × cheapness), front-loading the items that move the "next-level" needle for the least build.
|
| 76 |
|
| 77 |
-
**Tier 0 — cheap fidelity fixes the evidence says move OOD generalization (
|
| 78 |
-
1. **k1-in-reward KL** (Rubric A c2).
|
| 79 |
-
2. **Composer-2 behavior rewards** (Rubric A h+i): `datagen/behavior_rewards.py` —
|
| 80 |
|
| 81 |
**Tier 1 — close the highest-value PARTIALs:**
|
| 82 |
3. **SDPO live-GPU smoke** (Rubric A a): instantiate `ComposerReplicationTrainer` against a real `trl.GRPOTrainer` on a small model (Qwen2.5-0.5B) on a SageMaker Training Job (g5/g6e) or HyperPod node-group — discharges the ADR-008 "never smoke-tested against real GRPOTrainer" caveat.
|
|
|
|
| 24 |
| (b1) | **25× synthetic data — Feature Deletion generator** — 2.5 blog | **FULLY-REPLICATED (substrate-inversion form)** | `datagen/substrates.py::SweBenchAdapter` (revert gold patch → broken repo, FAIL_TO_PASS=reward target, license filter), `datagen/env.py::FeatureDeletionEnv`, `datagen/validator.py` (4-gate solvability), `datagen/schema.py`. | Real *generative* synthesis (manufacture novel broken states beyond SWE-bench inversion) absent; only adapts existing SWE-* instances. No "25×" scale-out generator suite. |
|
| 25 |
| (b2) | **Dynamic-difficulty curriculum ("select for AND create harder tasks dynamically")** — 2.5 blog + Composer-2 §3 (keyed on #turns + thinking-tokens) | **FULLY-REPLICATED (select-for half)** | `datagen/curriculum.py::DifficultyCurriculum` — p̂(1−p̂) frontier weighting, retire >0.95, quarantine <0.02, **effort tilt on turns/think-tokens** (ADR-012 #4, matching Composer-2's exact heuristic). | **CREATE half missing**: no live escalation of deletion-span / coupling / multi-feature difficulty during the run. Curriculum scores an *existing* pool; it doesn't mint harder tasks. |
|
| 26 |
| (c1) | **Dr.GRPO base objective** — Composer-2 §4.1 | **FULLY-REPLICATED** | `composer_trainer.py::make_dr_grpo_config` + `make_po_config` (PO menu: grpo/dr_grpo/bnpo/dapo/gspo/cispo, pure TRL 1.5.0 config). `loss_type="dr_grpo"`, `scale_rewards="none"`, `num_iterations=1`, drift-guard asserts. ADR-014. | None on the objective itself. |
|
| 27 |
+
| (c2) | **k1-vs-k3 KL** — Composer-2 §4.1 explicitly chooses **k1 = −log r in *reward*** (variance argument, citing Amini et al.) | **CLOSED (Wave 20) — opt-in k1-in-reward path** | `trainer/kl_in_reward.py` (pure, tested: k1/k3 per-seq estimators + `apply_kl_in_reward` advantage adjustment) + `composer_trainer.py` (`kl_in_reward=True`/`kl_estimator=`: folds the k1 penalty into advantages at scoring, zeros TRL's in-loss k3 per step, `validate_kl_in_reward_config` guards beta>0 + scale_rewards∈{none,false}). `test_kl_in_reward.py` proves the advantage-adjustment EXACTLY equals fold-then-baseline (9 group/gen combos). TRL's native k3 still pinned by `test_trl_kl_estimator_is_k3_not_k1` (still true; now documents the override). | Live-GPU A/B (k1-in-reward vs k3-in-loss OOD) deferred to the SageMaker smoke. The math + wiring are done and CPU-verified. |
|
| 28 |
| (d) | **CPT → SFT → RL phase structure** — Composer-2 §3-4 (CPT loss ↓ ⇒ RL ceiling ↑, replicated on Qwen3-Coder-30B) | **PARTIAL (intentional skip + plumbing)** | Documented decision to skip CPT and start from a code-tuned base (COMPOSER_RECIPE_MAPPING.md row a; corroborated by Composer-2's own CPT→RL causal claim). Inner/outer loop split exists (datagen=outer, `ComposerReplicationTrainer`=inner). | **No SFT-first stage in code.** Report §5 prescribes "SFT-first on clean winning trajectories before RL" — there is no SFT trainer/recipe; only the RL trainer exists. CPT correctly skipped. |
|
| 29 |
| (e) | **Sharded Muon + dual-mesh HSDP** (2.5 blog) / FSDP+CP+decoupled-EP, Adam (Composer-2 §6) | **MISSING (intentional, irrelevant at our scale)** | — | Correctly out of scope for dense Qwen3-{7,32}B (the mapping doc + report both say skip until MoE base). Distributed substrate is DiLoCo-over-S3, not HSDP. Note research/10 *corrects* the blog: Composer-2 uses **Adam**, not Muon, and FSDP+CP+decoupled-EP, not HSDP. |
|
| 30 |
| (f) | **Anyrun production-fidelity sandboxed RL harness** (>500 pods/s, per-pod Firecracker microVM, fork/snapshot, Anygress egress proxy) — Composer-2 §6.2 | **PARTIAL** | `datagen/sandbox.py` (`Sandbox` Protocol, `LocalSubprocessSandbox`, `scrub_tree` primary control, denylist defense-in-depth), `datagen/docker_sandbox.py`, `diloco/serverless/{executor.py,eks.py,sagemaker.py,modal_spawn.py}`. | No microVM isolation (gVisor/Kata-Firecracker), no fork/snapshot, no egress proxy, no >100k-pod orchestration. The report's EKS plan (§8: gVisor default → Kata+Firecracker → container-free SWE-MiniSandbox) is design-only. `eks.py`/`sagemaker.py` are executor skeletons, not the full Anyrun analogue. |
|
| 31 |
| (g) | **Reward-hacking monitoring** (2.5 blog: bytecode decompile / type-cache hacks; "agentic monitoring tools") | **FULLY-REPLICATED (defense-in-depth) + run-level guard now wired** | `datagen/monitor.py::HackMonitor` (signature + patch-provenance, obfuscation-resistant), `sandbox.py::scrub_tree` (physical cache/.git removal = "the wall"), `datagen/validator.py` (4-gate), `safety/holdout.py::HeldoutSplit` (id + content-hash disjointness), `safety/kill_switch.py::HeldOutGuard` (proxy-real Hacking-Gap + KL hard-stop), **now wired into the trainer** (`composer_trainer.py::_maybe_update_killswitch`, ADR-015, 2026-06-08). | The held-out kill-switch — the report's "most load-bearing safeguard, documented gap" — is **now CLOSED** (ADR-015). Remaining: `HackMonitor` validated only on constructed examples (report warns synthetic-hack monitors fail to generalize); offline LLM-judge monitor (EvilGenie-style) not built. |
|
| 32 |
+
| (h) | **Aux scalar rewards (style/communication/unfinished-todo penalties)** — Composer-2 §4.2 | **CLOSED (Wave 20)** | `datagen/behavior_rewards.py`: `UnfinishedTodoPenalty`, `LeftoverCoTPenalty`, `CommunicationReward` (the exact emergent behaviors §4.2 names) as TRL `RewardFunc`s, composable with `env.reward_fn` via `reward_weights`. `default_behavior_rewards()` = the starter array. `test_behavior_rewards.py` pins the contract. | Heuristic pattern-matchers by design (the report added these reactively by monitoring); meant to grow. A judge-based communication score is the upgrade. |
|
| 33 |
+
| (i) | **Nonlinear length/effort penalty** `C_length{k,q}(x)=((1+kx)^{1−q}−1)/(k(1−q))` — Composer-2 §4.2 (exact equation) | **CLOSED (Wave 20)** | `datagen/behavior_rewards.py::c_length` (exact equation + q→1 log-limit) + `EffortWeights` (the weighted {thinking, tool-call, tool-output, final-msg tokens, #calls, #turns} combination) + `LengthEffortPenalty` (the RewardFunc). `test_behavior_rewards.py` verifies the closed form, the concave-down/increasing shape, and the edge cases. | Effort features read from a per-completion `effort_features` kwarg when the harness supplies them; char-count proxy otherwise. |
|
| 34 |
| (j) | **Self-summarization (reward-to-all-chain-tokens)** — Composer-2 §4.1 | **MISSING** | — | The mechanism that handles 100k-token long-horizon rollouts (the regime the report says the *tree* is for). Not built. |
|
| 35 |
| (k) | **MoE router replay** — Composer-2 §6.2 | **MISSING (out of scope, dense bases)** | — | Only relevant for MoE-base RL; correct to defer. |
|
| 36 |
|
|
|
|
| 74 |
|
| 75 |
Ordered by (fidelity-leverage × cheapness), front-loading the items that move the "next-level" needle for the least build.
|
| 76 |
|
| 77 |
+
**Tier 0 — cheap fidelity fixes the evidence says move OOD generalization — ✅ DONE (Wave 20):**
|
| 78 |
+
1. ✅ **k1-in-reward KL** (Rubric A c2). Implemented as `trainer/kl_in_reward.py` (pure, tested math) + `composer_trainer.py` `kl_in_reward=True`/`kl_estimator=`. Keeps `beta` as the KL coefficient (TRL needs beta>0 to compute ref logps), folds the k1 penalty into advantages at scoring, and zeros TRL's in-loss k3 per step. `validate_kl_in_reward_config` enforces beta>0 + scale_rewards∈{none,false}. `test_kl_in_reward.py` proves the advantage adjustment exactly equals fold-then-baseline. **The highest-fidelity-leverage single change — shipped.**
|
| 79 |
+
2. ✅ **Composer-2 behavior rewards** (Rubric A h+i): `datagen/behavior_rewards.py` — `c_length` (exact eq + q→1 limit), `EffortWeights`, `LengthEffortPenalty`, `UnfinishedTodoPenalty`, `LeftoverCoTPenalty`, `CommunicationReward`, `default_behavior_rewards()`, all TRL `RewardFunc`s composable with `env.reward_fn`. `test_behavior_rewards.py` pins the math + contract. Reproduced *without* the hint mystery.
|
| 80 |
|
| 81 |
**Tier 1 — close the highest-value PARTIALs:**
|
| 82 |
3. **SDPO live-GPU smoke** (Rubric A a): instantiate `ComposerReplicationTrainer` against a real `trl.GRPOTrainer` on a small model (Qwen2.5-0.5B) on a SageMaker Training Job (g5/g6e) or HyperPod node-group — discharges the ADR-008 "never smoke-tested against real GRPOTrainer" caveat.
|