Baladithya Balamurugan Claude Opus 4.8 (1M context) commited on
Commit
41289bf
·
1 Parent(s): bd37412

Wave 20: Tier-0 fidelity fixes — k1-in-reward KL + Composer-2 behavior rewards

Browse files

Closes the two cheapest "next-level" gaps from the F5 fidelity audit
(Rubric A items c2, h, i). Both reproduce Composer-2 §4.1/§4.2 exactly,
without the hint mystery. 33 new tests; full suite 415 passed / 66 skipped.

(1) k1-in-reward KL (F5 #1, the highest-fidelity-leverage single change).
Composer-2 §4.1 chooses the k1 estimator IN THE REWARD (-log r); TRL uses
k3 IN THE LOSS. The 2025/26 evidence (arXiv:2512.21852, verl k1-only, TRL
#4967) says k1-in-reward improves OOD generalization where k3-in-reward can
collapse — exactly the "take any model to the next level" axis.
- trainer/kl_in_reward.py: pure, tested math. k1/k3 per-sequence
estimators + apply_kl_in_reward, which adjusts GRPO advantages by
-coef·(KL - group_mean(KL)). test_kl_in_reward.py PROVES this exactly
equals folding -coef·KL into the reward then re-running GRPO's
group-mean baseline (9 group/gen combos) — the justification for not
forking TRL's scoring code. Exact only without std-norm (Dr.GRPO regime).
- composer_trainer.py: opt-in kl_in_reward=True / kl_estimator=. Keeps
beta>0 (TRL needs it to compute ref logps) but suppresses TRL's in-loss
k3 per step (beta→0 across the parent _compute_loss) and folds the k1
penalty into advantages at scoring time. validate_kl_in_reward_config
(extracted, unit-tested) guards beta>0 + scale_rewards∈{none,false}.
OFF by default → byte-for-byte legacy. getattr-guarded so __new__-built
test stubs default to legacy.

(2) Composer-2 behavior rewards (F5 #2, Rubric A h+i).
- datagen/behavior_rewards.py: c_length (the exact nonlinear length/effort
penalty ((1+kx)^(1-q)-1)/(k(1-q)) + q→1 log-limit), EffortWeights (the
weighted {thinking/tool-call/tool-output/final-msg tokens, #calls,
#turns} combination), LengthEffortPenalty, and the aux scalar shapers
UnfinishedTodoPenalty / LeftoverCoTPenalty / CommunicationReward — the
exact emergent behaviors §4.2 names. All TRL RewardFuncs composable with
env.reward_fn via reward_weights. default_behavior_rewards() = the
starter array. test_behavior_rewards.py pins the math + contract.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

composer_replication/datagen/behavior_rewards.py ADDED
@@ -0,0 +1,282 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """behavior_rewards.py — Composer-2 §4.2 behavior shaping (hint-free).
2
+
3
+ THE FIDELITY GAP (F5 Rubric A items h + i). Composer-2 shapes agent behavior
4
+ NOT with hints (that's the unreproducible Composer-2.5 mechanism) but with an
5
+ **array of auxiliary scalar rewards** + a **nonlinear length/effort penalty**.
6
+ research/10 verified both against the report and gives the exact equation. None
7
+ of it was in code; this module implements them as TRL ``RewardFunc``s
8
+ composable with ``FeatureDeletionEnv.reward_fn`` (the test-pass-fraction reward).
9
+
10
+ Quoting Composer-2 §4.2 (research/10-composer2-techreport-mining.md:74-77):
11
+ * "we apply an array of auxiliary rewards … rewards for coding style,
12
+ communication, and product-specific penalties for poor tool calls, such as
13
+ creating to-do list items and then leaving them unfinished."
14
+ * Nonlinear length/effort penalty:
15
+ C_length{k,q}(x) = ((1 + k·x)^{1−q} − 1) / (k·(1−q))
16
+ concave-down & increasing, x = weighted combination of {thinking tokens,
17
+ tool-calling tokens, tool-output tokens, final-message tokens, #tool calls,
18
+ #turns}. Goal: quick on easy tasks, think longer on hard ones; observed to
19
+ induce parallel tool calls.
20
+
21
+ DESIGN. Every reward here is a callable matching TRL's ``RewardFunc`` contract
22
+ ``(prompts, completions, **kwargs) -> list[float]`` so it drops straight into
23
+ ``GRPOConfig.reward_funcs`` alongside the env's correctness reward. They are
24
+ deliberately SIGNAL-ONLY shapers (small magnitudes, penalties ≤ 0): the
25
+ correctness reward must dominate, and these never gate it. TRL combines multiple
26
+ reward funcs via ``reward_weights`` (sum), so the composed objective is
27
+ ``reward = correctness + Σ_i w_i · behavior_i`` — exactly Composer-2's "array of
28
+ auxiliary rewards" added to the primary signal.
29
+
30
+ The length penalty is the substantive piece; the aux rewards are heuristic
31
+ pattern-matchers over the completion text (unfinished-todo, leftover-CoT-in-code,
32
+ terminal-only collapse — the exact emergent behaviors the report says it
33
+ reactively penalized). They are intentionally simple and overridable.
34
+ """
35
+ from __future__ import annotations
36
+
37
+ import re
38
+ from collections.abc import Callable, Mapping
39
+ from dataclasses import dataclass, field
40
+
41
+ #: A TRL reward function: (prompts, completions, **kwargs) -> per-completion floats.
42
+ RewardFunc = Callable[..., list[float]]
43
+
44
+
45
+ # ---------------------------------------------------------------------
46
+ # (i) Nonlinear length / effort penalty — C_length{k,q}(x)
47
+ # ---------------------------------------------------------------------
48
+
49
+
50
+ def c_length(x: float, k: float, q: float) -> float:
51
+ """Composer-2 nonlinear length/effort penalty C_length{k,q}(x).
52
+
53
+ C_length(x) = ((1 + k·x)^{1−q} − 1) / (k·(1−q))
54
+
55
+ Concave-down & increasing in x for k>0, q>1 (the report's regime): the
56
+ marginal penalty for more effort SHRINKS as effort grows, so the model is
57
+ pushed to be quick on easy tasks while still allowed to think longer on hard
58
+ ones (cheap incremental effort once already deep in a hard task).
59
+
60
+ Args:
61
+ x: the (non-negative) effort scalar — a weighted token/turn combination.
62
+ k: curvature hyperparameter (>0). Larger k = stronger early penalty.
63
+ q: curvature hyperparameter. q>1 gives the concave-down increasing shape;
64
+ q→1 degenerates (handled below); q<1 makes it convex.
65
+
66
+ Returns:
67
+ The penalty magnitude (≥ 0 for x≥0, k>0). Callers negate it to subtract
68
+ from reward.
69
+
70
+ Edge cases:
71
+ * x = 0 → 0 (no effort, no penalty).
72
+ * q = 1 → the limit is log(1 + k·x)/k (L'Hôpital on the (1−q) factor),
73
+ which we return directly to avoid a 0/0.
74
+ """
75
+ if x < 0:
76
+ raise ValueError(f"effort scalar x must be non-negative, got {x}")
77
+ if k <= 0:
78
+ raise ValueError(f"curvature k must be positive, got {k}")
79
+ base = 1.0 + k * x
80
+ if abs(1.0 - q) < 1e-9:
81
+ # Limit q→1: ((1+kx)^{1-q}-1)/(k(1-q)) → log(1+kx)/k.
82
+ import math
83
+
84
+ return math.log(base) / k
85
+ return (base ** (1.0 - q) - 1.0) / (k * (1.0 - q))
86
+
87
+
88
+ @dataclass(frozen=True)
89
+ class EffortWeights:
90
+ """Weights for the effort scalar x (Composer-2's weighted combination).
91
+
92
+ x = (w_think·thinking_tokens + w_tool_call·tool_calling_tokens
93
+ + w_tool_out·tool_output_tokens + w_final·final_message_tokens
94
+ + w_calls·num_tool_calls + w_turns·num_turns).
95
+
96
+ Defaults weight raw token counts at 1.0 (per-token) and the discrete
97
+ call/turn counts more heavily, since each call/turn carries fixed overhead
98
+ the report wanted to discourage (it observed the penalty inducing PARALLEL
99
+ tool calls — fewer turns for the same work).
100
+ """
101
+
102
+ think: float = 1.0
103
+ tool_call: float = 1.0
104
+ tool_output: float = 1.0
105
+ final_message: float = 1.0
106
+ num_calls: float = 50.0
107
+ num_turns: float = 100.0
108
+
109
+ def effort(self, features: Mapping[str, float]) -> float:
110
+ """Compute x from a per-completion feature dict (missing keys → 0)."""
111
+ return (
112
+ self.think * features.get("thinking_tokens", 0.0)
113
+ + self.tool_call * features.get("tool_calling_tokens", 0.0)
114
+ + self.tool_output * features.get("tool_output_tokens", 0.0)
115
+ + self.final_message * features.get("final_message_tokens", 0.0)
116
+ + self.num_calls * features.get("num_tool_calls", 0.0)
117
+ + self.num_turns * features.get("num_turns", 0.0)
118
+ )
119
+
120
+
121
+ @dataclass(frozen=True)
122
+ class LengthEffortPenalty:
123
+ """TRL RewardFunc applying ``-coef · C_length{k,q}(x)`` per completion.
124
+
125
+ The effort scalar x is read from a per-completion ``effort_features`` kwarg
126
+ (a list of feature dicts, one per completion) when the agent harness can
127
+ supply structured token/turn counts; otherwise it falls back to a crude
128
+ proxy = completion character count / ``chars_per_token`` so the penalty is
129
+ never silently zero.
130
+
131
+ Penalty is ≤ 0 (a cost). ``coef`` scales it into reward units; keep it small
132
+ so correctness dominates.
133
+ """
134
+
135
+ k: float = 0.01
136
+ q: float = 2.0
137
+ coef: float = 0.1
138
+ weights: EffortWeights = field(default_factory=EffortWeights)
139
+ chars_per_token: float = 4.0
140
+
141
+ def __call__(
142
+ self,
143
+ prompts: list[str] | None = None,
144
+ completions: list[str] | None = None,
145
+ *,
146
+ effort_features: list[Mapping[str, float]] | None = None,
147
+ **kwargs: object,
148
+ ) -> list[float]:
149
+ completions = completions or []
150
+ out: list[float] = []
151
+ for i, comp in enumerate(completions):
152
+ if effort_features is not None and i < len(effort_features):
153
+ x = self.weights.effort(effort_features[i])
154
+ else:
155
+ # Proxy when structured features aren't wired: token-count estimate.
156
+ x = len(comp) / self.chars_per_token
157
+ out.append(-self.coef * c_length(x, self.k, self.q))
158
+ return out
159
+
160
+
161
+ # ---------------------------------------------------------------------
162
+ # (h) Auxiliary scalar rewards — coding style / communication / poor tool use
163
+ # ---------------------------------------------------------------------
164
+
165
+ # Heuristic patterns for the emergent behaviors Composer-2 §4.2 names. Simple
166
+ # and overridable by design; the report itself added these REACTIVELY by
167
+ # monitoring, so they're meant to grow, not be exhaustive on day one.
168
+
169
+ _TODO_CREATE = re.compile(r"\b(?:add|create|TODO|todo)[\s_-]*(?:item|list)?\b", re.I)
170
+ _TODO_DONE = re.compile(r"\b(?:done|completed|finished|resolved|checked off)\b", re.I)
171
+ # Long chain-of-thought left as a code comment (an explicitly-penalized emergent
172
+ # behavior): a comment line that's prose-long rather than a terse code note.
173
+ _LONG_COMMENT = re.compile(r"^\s*(?:#|//)\s?.{120,}$", re.M)
174
+
175
+
176
+ @dataclass(frozen=True)
177
+ class UnfinishedTodoPenalty:
178
+ """Penalize creating to-do items then leaving them unfinished (the report's
179
+ explicit "poor tool call" example). Heuristic: penalty scales with
180
+ max(0, #todo-creations − #todo-completions)."""
181
+
182
+ coef: float = 0.05
183
+
184
+ def __call__(
185
+ self, prompts: list[str] | None = None, completions: list[str] | None = None,
186
+ **kwargs: object,
187
+ ) -> list[float]:
188
+ out: list[float] = []
189
+ for comp in completions or []:
190
+ created = len(_TODO_CREATE.findall(comp))
191
+ done = len(_TODO_DONE.findall(comp))
192
+ unfinished = max(0, created - done)
193
+ out.append(-self.coef * unfinished)
194
+ return out
195
+
196
+
197
+ @dataclass(frozen=True)
198
+ class LeftoverCoTPenalty:
199
+ """Penalize leaving long chain-of-thought in code comments (an emergent
200
+ behavior the report observed and reactively penalized). Heuristic: penalty
201
+ per over-long comment line."""
202
+
203
+ coef: float = 0.05
204
+
205
+ def __call__(
206
+ self, prompts: list[str] | None = None, completions: list[str] | None = None,
207
+ **kwargs: object,
208
+ ) -> list[float]:
209
+ return [
210
+ -self.coef * len(_LONG_COMMENT.findall(comp))
211
+ for comp in (completions or [])
212
+ ]
213
+
214
+
215
+ @dataclass(frozen=True)
216
+ class CommunicationReward:
217
+ """Small POSITIVE reward for a concise, present final message (the report's
218
+ "communication" reward). Heuristic: reward presence of a non-empty final
219
+ message under a soft length cap; penalize empty or runaway-long finals.
220
+
221
+ Composer-2 rewards good communication; we approximate "good" as "a final
222
+ message exists and is reasonably concise" — the cheapest non-degenerate
223
+ proxy. Override for a real judge-based communication score.
224
+ """
225
+
226
+ coef: float = 0.02
227
+ soft_cap_chars: int = 2000
228
+
229
+ def __call__(
230
+ self, prompts: list[str] | None = None, completions: list[str] | None = None,
231
+ *, final_messages: list[str] | None = None, **kwargs: object,
232
+ ) -> list[float]:
233
+ completions = completions or []
234
+ out: list[float] = []
235
+ for i, comp in enumerate(completions):
236
+ msg = (
237
+ final_messages[i]
238
+ if final_messages is not None and i < len(final_messages)
239
+ else comp
240
+ )
241
+ n = len(msg.strip())
242
+ if n == 0:
243
+ out.append(-self.coef) # no communication at all
244
+ elif n <= self.soft_cap_chars:
245
+ out.append(self.coef) # present + concise
246
+ else:
247
+ # present but runaway-long: decays back toward 0 then negative.
248
+ overflow = (n - self.soft_cap_chars) / self.soft_cap_chars
249
+ out.append(self.coef * max(-1.0, 1.0 - overflow))
250
+ return out
251
+
252
+
253
+ def default_behavior_rewards() -> list[RewardFunc]:
254
+ """The Composer-2 §4.2 starter array, ready to extend with reward_weights.
255
+
256
+ Returns the length/effort penalty + the three aux shapers. Compose with the
257
+ env correctness reward, e.g.::
258
+
259
+ reward_funcs = [env.reward_fn, *default_behavior_rewards()]
260
+ reward_weights = [1.0, 1.0, 1.0, 1.0, 1.0] # correctness dominates
261
+
262
+ The report adds these REACTIVELY by monitoring emergent behaviors, so treat
263
+ this as the seed set, not the final list.
264
+ """
265
+ return [
266
+ LengthEffortPenalty(),
267
+ UnfinishedTodoPenalty(),
268
+ LeftoverCoTPenalty(),
269
+ CommunicationReward(),
270
+ ]
271
+
272
+
273
+ __all__ = [
274
+ "RewardFunc",
275
+ "c_length",
276
+ "EffortWeights",
277
+ "LengthEffortPenalty",
278
+ "UnfinishedTodoPenalty",
279
+ "LeftoverCoTPenalty",
280
+ "CommunicationReward",
281
+ "default_behavior_rewards",
282
+ ]
composer_replication/datagen/tests/test_behavior_rewards.py ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Tests for Composer-2 §4.2 behavior-shaping rewards (F5 #2 fidelity fix).
2
+
3
+ The substantive coverage is the C_length math (the exact report equation +
4
+ its concave-down/increasing shape + edge cases) and the TRL RewardFunc
5
+ contract (one float per completion, shapers ≤ 0).
6
+ """
7
+ from __future__ import annotations
8
+
9
+ import math
10
+
11
+ import pytest
12
+
13
+ from composer_replication.datagen.behavior_rewards import (
14
+ CommunicationReward,
15
+ EffortWeights,
16
+ LeftoverCoTPenalty,
17
+ LengthEffortPenalty,
18
+ UnfinishedTodoPenalty,
19
+ c_length,
20
+ default_behavior_rewards,
21
+ )
22
+
23
+
24
+ # ---------------------------------------------------------------------
25
+ # C_length{k,q}(x) — the exact equation
26
+ # ---------------------------------------------------------------------
27
+
28
+
29
+ def test_c_length_matches_closed_form():
30
+ k, q, x = 0.01, 2.0, 500.0
31
+ expected = ((1 + k * x) ** (1 - q) - 1) / (k * (1 - q))
32
+ assert c_length(x, k, q) == pytest.approx(expected)
33
+
34
+
35
+ def test_c_length_zero_effort_is_zero():
36
+ assert c_length(0.0, k=0.01, q=2.0) == pytest.approx(0.0)
37
+
38
+
39
+ def test_c_length_increasing_and_concave_down():
40
+ """q>1, k>0: increasing in x, with SHRINKING marginal penalty (concave-down)."""
41
+ k, q = 0.01, 2.0
42
+ xs = [0.0, 100.0, 200.0, 300.0, 400.0]
43
+ vals = [c_length(x, k, q) for x in xs]
44
+ diffs = [vals[i + 1] - vals[i] for i in range(len(vals) - 1)]
45
+ # strictly increasing
46
+ assert all(d > 0 for d in diffs), f"not increasing: {vals}"
47
+ # concave-down: each marginal increment smaller than the previous
48
+ assert all(diffs[i + 1] < diffs[i] for i in range(len(diffs) - 1)), (
49
+ f"not concave-down (marginal penalty should shrink): {diffs}"
50
+ )
51
+
52
+
53
+ def test_c_length_q_equals_one_limit():
54
+ """q→1 limit is log(1+kx)/k (avoids the 0/0 in the closed form)."""
55
+ k, x = 0.02, 300.0
56
+ assert c_length(x, k, q=1.0) == pytest.approx(math.log(1 + k * x) / k)
57
+ # continuity: q just off 1 should be very close to the limit.
58
+ near = c_length(x, k, q=1.0 + 1e-7)
59
+ assert near == pytest.approx(math.log(1 + k * x) / k, rel=1e-3)
60
+
61
+
62
+ def test_c_length_validates_inputs():
63
+ with pytest.raises(ValueError, match="non-negative"):
64
+ c_length(-1.0, k=0.01, q=2.0)
65
+ with pytest.raises(ValueError, match="k must be positive"):
66
+ c_length(1.0, k=0.0, q=2.0)
67
+
68
+
69
+ # ---------------------------------------------------------------------
70
+ # EffortWeights
71
+ # ---------------------------------------------------------------------
72
+
73
+
74
+ def test_effort_weights_weighted_combination():
75
+ w = EffortWeights(think=1.0, tool_call=1.0, tool_output=1.0, final_message=1.0,
76
+ num_calls=50.0, num_turns=100.0)
77
+ feats = {
78
+ "thinking_tokens": 100, "tool_calling_tokens": 50,
79
+ "tool_output_tokens": 200, "final_message_tokens": 30,
80
+ "num_tool_calls": 4, "num_turns": 3,
81
+ }
82
+ # 100+50+200+30 + 50*4 + 100*3 = 380 + 200 + 300 = 880
83
+ assert w.effort(feats) == pytest.approx(880.0)
84
+
85
+
86
+ def test_effort_weights_missing_keys_are_zero():
87
+ assert EffortWeights().effort({}) == pytest.approx(0.0)
88
+
89
+
90
+ # ---------------------------------------------------------------------
91
+ # LengthEffortPenalty — the RewardFunc
92
+ # ---------------------------------------------------------------------
93
+
94
+
95
+ def test_length_penalty_one_value_per_completion_and_nonpositive():
96
+ rf = LengthEffortPenalty(coef=0.1)
97
+ comps = ["short", "a much longer completion " * 50, ""]
98
+ out = rf(prompts=None, completions=comps)
99
+ assert len(out) == len(comps)
100
+ assert all(v <= 0 for v in out), f"length penalty must be ≤ 0: {out}"
101
+
102
+
103
+ def test_length_penalty_longer_is_more_penalized():
104
+ rf = LengthEffortPenalty(coef=0.1)
105
+ short, long = "x" * 100, "x" * 10000
106
+ out = rf(completions=[short, long])
107
+ assert out[1] < out[0], "longer completion must incur a larger (more negative) penalty"
108
+
109
+
110
+ def test_length_penalty_uses_structured_features_when_present():
111
+ rf = LengthEffortPenalty(coef=1.0, k=0.01, q=2.0)
112
+ # Same short text, but big structured effort → penalty driven by features.
113
+ feats = [{"num_turns": 10, "thinking_tokens": 5000}]
114
+ out_feat = rf(completions=["ok"], effort_features=feats)
115
+ out_proxy = rf(completions=["ok"])
116
+ assert out_feat[0] < out_proxy[0], "structured high-effort must out-penalize the char proxy"
117
+
118
+
119
+ # ---------------------------------------------------------------------
120
+ # Auxiliary scalar rewards
121
+ # ---------------------------------------------------------------------
122
+
123
+
124
+ def test_unfinished_todo_penalty():
125
+ rf = UnfinishedTodoPenalty(coef=0.05)
126
+ # Heuristic counts create-like and done-like tokens; penalty scales with
127
+ # max(0, creates - dones). More creates than completions → penalized.
128
+ unbalanced = "TODO: add item one. create todo item two." # several create tokens, 0 done
129
+ over_done = "Done. completed. finished. resolved." # dones ≥ creates → no penalty
130
+ out = rf(completions=[unbalanced, over_done, "no todos here"])
131
+ assert out[0] < 0, "more creations than completions must be penalized"
132
+ assert out[1] == pytest.approx(0.0), "completions ≥ creations → no penalty"
133
+ assert out[2] == pytest.approx(0.0), "no todo activity → no penalty"
134
+
135
+
136
+ def test_leftover_cot_penalty():
137
+ rf = LeftoverCoTPenalty(coef=0.05)
138
+ long_comment = "# " + "this is a very long chain of thought left in a comment " * 5
139
+ clean = "# terse note\nx = 1"
140
+ out = rf(completions=[long_comment + "\n" + "code()", clean])
141
+ assert out[0] < 0, "long CoT comment must be penalized"
142
+ assert out[1] == pytest.approx(0.0)
143
+
144
+
145
+ def test_communication_reward_concise_present_vs_empty():
146
+ rf = CommunicationReward(coef=0.02, soft_cap_chars=2000)
147
+ out = rf(completions=["Fixed the bug by adding a null check.", ""])
148
+ assert out[0] == pytest.approx(0.02), "concise present final → +coef"
149
+ assert out[1] == pytest.approx(-0.02), "empty final → -coef"
150
+
151
+
152
+ def test_communication_reward_runaway_long_decays():
153
+ rf = CommunicationReward(coef=0.02, soft_cap_chars=100)
154
+ out = rf(completions=["x" * 1000]) # 9x over cap
155
+ assert out[0] < 0.02, "runaway-long final should decay below the concise reward"
156
+
157
+
158
+ def test_default_array_contract():
159
+ """All default behavior rewards honor the TRL RewardFunc contract."""
160
+ funcs = default_behavior_rewards()
161
+ assert len(funcs) == 4
162
+ comps = ["completion one", "TODO: add item, never done", "x" * 5000]
163
+ for f in funcs:
164
+ out = f(prompts=["p"] * len(comps), completions=comps)
165
+ assert len(out) == len(comps), f"{type(f).__name__} returned wrong count"
166
+ assert all(isinstance(v, float) for v in out)
composer_replication/trainer/composer_trainer.py CHANGED
@@ -47,6 +47,10 @@ except ImportError: # pragma: no cover — only hit in unit-test stubs without
47
  _TRL_AVAILABLE = False
48
 
49
  from composer_replication.opsd import generalized_jsd_loss
 
 
 
 
50
 
51
  logger = logging.getLogger(__name__)
52
 
@@ -67,6 +71,22 @@ class ComposerReplicationTrainer(GRPOTrainer): # type: ignore[misc, valid-type]
67
  sdpo_temperature: temperature for SDPO loss; SDPO paper uses 1.0.
68
  sdpo_token_clip: per-token JSD clip for stability; None = no clip.
69
  replay_dpo_beta: beta param of the DPO loss (β in the standard DPO formula).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  heldout_guard: optional ``HeldOutGuard`` (the #2 collapse safeguard from
71
  ``composer_replication.safety``). Default None = OFF (no behavior
72
  change whatsoever). When supplied, the trainer folds one checkpoint's
@@ -98,6 +118,8 @@ class ComposerReplicationTrainer(GRPOTrainer): # type: ignore[misc, valid-type]
98
  sdpo_token_clip: float | None = None,
99
  replay_dpo_beta: float = 0.1,
100
  strict_sdpo_alignment: bool = True,
 
 
101
  heldout_guard: HeldOutGuard | None = None,
102
  heldout_eval_fn: Callable[[], float] | None = None,
103
  strict_killswitch: bool = True,
@@ -121,6 +143,19 @@ class ComposerReplicationTrainer(GRPOTrainer): # type: ignore[misc, valid-type]
121
  # trust-gap flagged in ADR-008). Set False only for production runs
122
  # where a single malformed batch should warn-and-skip rather than abort.
123
  self.strict_sdpo_alignment = strict_sdpo_alignment
 
 
 
 
 
 
 
 
 
 
 
 
 
124
  # --- run-level collapse kill-switch (#2 safeguard) -------------------
125
  # OPTIONAL + OFF BY DEFAULT: when heldout_guard is None the loss path is
126
  # byte-for-byte the legacy behavior. When set, _maybe_update_killswitch
@@ -141,14 +176,85 @@ class ComposerReplicationTrainer(GRPOTrainer): # type: ignore[misc, valid-type]
141
  # Loss override (the integration core)
142
  # ----------------------------------------------------------------------
143
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
144
  def _compute_loss(
145
  self,
146
  model: torch.nn.Module,
147
  inputs: dict[str, torch.Tensor],
148
  ) -> torch.Tensor:
149
- """Override: total_loss = grpo + α*sdpo + β*replay."""
150
- # Channel 1: standard GRPO loss
151
- grpo_loss = super()._compute_loss(model, inputs)
 
 
 
 
 
 
 
 
 
 
 
 
 
152
 
153
  # Channel 2: SDPO hint-distill at error sites
154
  sdpo_kl = self._compute_sdpo_loss(model, inputs)
@@ -177,6 +283,64 @@ class ComposerReplicationTrainer(GRPOTrainer): # type: ignore[misc, valid-type]
177
 
178
  return total
179
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
180
  # ----------------------------------------------------------------------
181
  # Run-level collapse kill-switch (#2 safeguard) — optional, OFF by default
182
  # ----------------------------------------------------------------------
@@ -480,6 +644,51 @@ def _device_of(model: torch.nn.Module) -> torch.device:
480
  return next(model.parameters()).device
481
 
482
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
483
  def make_dr_grpo_config(**overrides: Any):
484
  """Build a `trl.GRPOConfig` configured to the **Dr. GRPO** recipe.
485
 
@@ -685,4 +894,5 @@ __all__ = [
685
  "make_dr_grpo_config",
686
  "make_po_config",
687
  "PO_OBJECTIVES",
 
688
  ]
 
47
  _TRL_AVAILABLE = False
48
 
49
  from composer_replication.opsd import generalized_jsd_loss
50
+ from composer_replication.trainer.kl_in_reward import (
51
+ apply_kl_in_reward,
52
+ kl_penalty_per_sequence,
53
+ )
54
 
55
  logger = logging.getLogger(__name__)
56
 
 
71
  sdpo_temperature: temperature for SDPO loss; SDPO paper uses 1.0.
72
  sdpo_token_clip: per-token JSD clip for stability; None = no clip.
73
  replay_dpo_beta: beta param of the DPO loss (β in the standard DPO formula).
74
+ kl_in_reward: when True, apply the KL-to-reference penalty in the
75
+ **reward** (Composer-2 §4.1 / verl choice) instead of TRL's native
76
+ **in-loss** k3 term. The penalty is folded into GRPO's advantages at
77
+ scoring time (``adv -= beta·(KL - group_mean(KL))``) and TRL's
78
+ in-loss KL is suppressed for that step. The F5 audit's #1 fidelity
79
+ fix: the 2025/26 evidence (arXiv:2512.21852, verl, TRL #4967) shows
80
+ k1-in-reward improves OOD generalization where k3-in-reward can
81
+ collapse. REQUIRES ``beta>0`` (the KL coefficient — also how TRL
82
+ decides to compute reference logprobs) and ``scale_rewards`` in
83
+ {none,false} (the advantage-adjustment identity is exact only
84
+ without std-normalization — the Dr.GRPO / Composer regime). Default
85
+ False = TRL's native in-loss KL, byte-for-byte legacy behavior.
86
+ kl_estimator: ``"k1"`` (default; ``logp - ref_logp``, the Composer-2 /
87
+ verl choice this path exists for) or ``"k3"`` (Schulman; lets an
88
+ experiment A/B k1-in-reward vs k3-in-reward). Only consulted when
89
+ ``kl_in_reward=True``.
90
  heldout_guard: optional ``HeldOutGuard`` (the #2 collapse safeguard from
91
  ``composer_replication.safety``). Default None = OFF (no behavior
92
  change whatsoever). When supplied, the trainer folds one checkpoint's
 
118
  sdpo_token_clip: float | None = None,
119
  replay_dpo_beta: float = 0.1,
120
  strict_sdpo_alignment: bool = True,
121
+ kl_in_reward: bool = False,
122
+ kl_estimator: str = "k1",
123
  heldout_guard: HeldOutGuard | None = None,
124
  heldout_eval_fn: Callable[[], float] | None = None,
125
  strict_killswitch: bool = True,
 
143
  # trust-gap flagged in ADR-008). Set False only for production runs
144
  # where a single malformed batch should warn-and-skip rather than abort.
145
  self.strict_sdpo_alignment = strict_sdpo_alignment
146
+ # --- k1-in-reward KL (F5 #1 fidelity fix; Composer-2 §4.1 / verl) ----
147
+ # OFF by default → TRL's native in-loss k3 KL, byte-for-byte legacy.
148
+ # When ON we keep self.beta as the KL coef (TRL needs beta>0 to even
149
+ # create the ref model + compute ref logps), fold the k1 penalty into
150
+ # advantages during scoring, and zero TRL's in-loss KL per step.
151
+ self.kl_in_reward = kl_in_reward
152
+ self.kl_estimator = kl_estimator
153
+ if kl_in_reward:
154
+ validate_kl_in_reward_config(
155
+ kl_estimator=kl_estimator,
156
+ beta=float(getattr(self.args, "beta", 0.0)),
157
+ scale_rewards=getattr(self.args, "scale_rewards", "group"),
158
+ )
159
  # --- run-level collapse kill-switch (#2 safeguard) -------------------
160
  # OPTIONAL + OFF BY DEFAULT: when heldout_guard is None the loss path is
161
  # byte-for-byte the legacy behavior. When set, _maybe_update_killswitch
 
176
  # Loss override (the integration core)
177
  # ----------------------------------------------------------------------
178
 
179
+ # ----------------------------------------------------------------------
180
+ # k1-in-reward: fold the KL penalty into advantages at scoring time, and
181
+ # suppress TRL's native in-loss k3 KL inside _compute_loss.
182
+ # ----------------------------------------------------------------------
183
+
184
+ def _generate_and_score_completions(
185
+ self,
186
+ inputs: list[dict[str, Any]],
187
+ ) -> dict[str, Any]:
188
+ """Override: after TRL scores completions, fold a k1 KL penalty into the
189
+ advantages (Composer-2 in-reward KL) when ``kl_in_reward`` is set.
190
+
191
+ No-op (exact legacy path) when ``kl_in_reward`` is False. When set, TRL
192
+ has already computed ``advantages``, ``ref_per_token_logps`` (because
193
+ ``beta>0``), and the completion logprobs; we recompute the per-sequence
194
+ k1 penalty and apply the exact group-mean-baseline correction.
195
+ """
196
+ output = super()._generate_and_score_completions(inputs)
197
+ if not getattr(self, "kl_in_reward", False):
198
+ return output
199
+
200
+ ref_logps = output.get("ref_per_token_logps")
201
+ # The "old" (sampling-time) policy logps are TRL's in-loss π term; they
202
+ # may be lazily None when generation/optimization are aligned and not
203
+ # vLLM (see TRL _compute_loss: old := per_token_logps.detach()). In that
204
+ # aligned case we cannot read π logps here, so we defer to _compute_loss
205
+ # (which always has per_token_logps) by stashing what we need.
206
+ old_logps = output.get("old_per_token_logps")
207
+ completion_mask = output.get("completion_mask")
208
+ if ref_logps is None or completion_mask is None:
209
+ # beta>0 guarantees ref_logps; this branch only trips on a TRL
210
+ # internals change — fail loud rather than silently skip the penalty.
211
+ raise RuntimeError(
212
+ "kl_in_reward=True but TRL did not return ref_per_token_logps / "
213
+ "completion_mask from scoring (beta>0 should guarantee them). "
214
+ "TRL internals may have changed; re-verify the in-reward path."
215
+ )
216
+
217
+ if old_logps is not None:
218
+ penalty = kl_penalty_per_sequence(
219
+ policy_logps=old_logps,
220
+ ref_logps=ref_logps,
221
+ completion_mask=completion_mask,
222
+ estimator=self.kl_estimator,
223
+ )
224
+ output["advantages"] = apply_kl_in_reward(
225
+ advantages=output["advantages"],
226
+ kl_penalty=penalty,
227
+ num_generations=self.num_generations,
228
+ coef=float(self.args.beta),
229
+ )
230
+ output["_kl_in_reward_applied"] = torch.tensor(True)
231
+ else:
232
+ # Aligned non-vLLM case: π logps materialize only in _compute_loss.
233
+ # Stash ref logps + mask so _compute_loss can apply the penalty there.
234
+ output["_kl_in_reward_applied"] = torch.tensor(False)
235
+ return output
236
+
237
  def _compute_loss(
238
  self,
239
  model: torch.nn.Module,
240
  inputs: dict[str, torch.Tensor],
241
  ) -> torch.Tensor:
242
+ """Override: total_loss = grpo + α*sdpo + β*replay.
243
+
244
+ When ``kl_in_reward`` is set, TRL's native in-loss KL term (gated on
245
+ ``self.beta``) is suppressed by temporarily zeroing ``self.beta`` for the
246
+ duration of the parent call — the KL has already been (or is about to be)
247
+ accounted for in the reward/advantage, so double-counting it in the loss
248
+ would be wrong. ``self.beta`` is restored in ``finally``.
249
+ """
250
+ # Channel 1: standard GRPO loss. ``getattr`` (not ``self.kl_in_reward``)
251
+ # so an instance built via ``__new__`` + manual wiring (the SDPO /
252
+ # kill-switch unit-test pattern that skips __init__) defaults to the
253
+ # legacy path instead of raising AttributeError.
254
+ if getattr(self, "kl_in_reward", False):
255
+ grpo_loss = self._grpo_loss_kl_in_reward(model, inputs)
256
+ else:
257
+ grpo_loss = super()._compute_loss(model, inputs)
258
 
259
  # Channel 2: SDPO hint-distill at error sites
260
  sdpo_kl = self._compute_sdpo_loss(model, inputs)
 
283
 
284
  return total
285
 
286
+ def _grpo_loss_kl_in_reward(
287
+ self,
288
+ model: torch.nn.Module,
289
+ inputs: dict[str, torch.Tensor],
290
+ ) -> torch.Tensor:
291
+ """GRPO loss with the KL applied in the reward, not the loss.
292
+
293
+ Two responsibilities:
294
+ 1. Suppress TRL's native in-loss k3 KL term for this step by zeroing
295
+ ``self.beta`` across the parent ``_compute_loss`` call (restored in
296
+ ``finally``). ``self.beta`` gates the in-loss KL add (TRL
297
+ ``_compute_loss``: ``if self.beta != 0.0: per_token_loss += beta*kl``).
298
+ 2. Handle the deferred case: when generation/optimization are aligned
299
+ and not using vLLM, the sampling-time policy logps are None at
300
+ scoring time, so ``_generate_and_score_completions`` could not fold
301
+ the penalty into advantages. Here ``per_token_logps`` is available,
302
+ so we apply the exact same advantage correction in-place on
303
+ ``inputs["advantages"]`` BEFORE the parent computes the surrogate.
304
+ """
305
+ # Deferred-penalty path: advantages not yet KL-adjusted (aligned, no vLLM).
306
+ applied = inputs.get("_kl_in_reward_applied")
307
+ already_applied = bool(applied.item()) if applied is not None else False
308
+ if not already_applied and "ref_per_token_logps" in inputs:
309
+ with torch.no_grad():
310
+ prompt_ids, completion_ids = inputs["prompt_ids"], inputs["completion_ids"]
311
+ completion_mask = inputs["completion_mask"]
312
+ input_ids = torch.cat([prompt_ids, completion_ids], dim=1)
313
+ attention_mask = torch.cat([inputs["prompt_mask"], completion_mask], dim=1)
314
+ logits_to_keep = completion_ids.size(1)
315
+ policy_logps, _ = self._get_per_token_logps_and_entropies(
316
+ model, input_ids, attention_mask, logits_to_keep
317
+ )
318
+ penalty = kl_penalty_per_sequence(
319
+ policy_logps=policy_logps,
320
+ ref_logps=inputs["ref_per_token_logps"],
321
+ completion_mask=completion_mask,
322
+ estimator=self.kl_estimator,
323
+ )
324
+ advantages = inputs["advantages"]
325
+ # advantages may be (B,) or (B,1) — squeeze for the penalty math,
326
+ # restore the original shape after.
327
+ adv_flat = advantages.reshape(advantages.shape[0])
328
+ adj = apply_kl_in_reward(
329
+ advantages=adv_flat,
330
+ kl_penalty=penalty,
331
+ num_generations=self.num_generations,
332
+ coef=float(self.args.beta),
333
+ )
334
+ inputs["advantages"] = adj.reshape(advantages.shape)
335
+
336
+ # Suppress TRL's in-loss KL: zero beta for the parent call, restore after.
337
+ saved_beta = self.beta
338
+ try:
339
+ self.beta = 0.0
340
+ return super()._compute_loss(model, inputs)
341
+ finally:
342
+ self.beta = saved_beta
343
+
344
  # ----------------------------------------------------------------------
345
  # Run-level collapse kill-switch (#2 safeguard) — optional, OFF by default
346
  # ----------------------------------------------------------------------
 
644
  return next(model.parameters()).device
645
 
646
 
647
+ def validate_kl_in_reward_config(
648
+ *,
649
+ kl_estimator: str,
650
+ beta: float,
651
+ scale_rewards: Any,
652
+ ) -> None:
653
+ """Validate the (kl_estimator, beta, scale_rewards) combo for k1-in-reward.
654
+
655
+ Extracted so the preconditions are unit-testable without standing up a real
656
+ GRPOTrainer (which needs a model + dataset). Raises ``ValueError`` on any
657
+ invalid combination; returns None when the config is sound.
658
+
659
+ Preconditions (see ``kl_in_reward.py`` for the algebra):
660
+ * ``kl_estimator`` in {k1, k3}.
661
+ * ``beta != 0`` — TRL only builds the reference model and computes ref
662
+ logprobs when beta>0, and the in-reward penalty needs ref logps. beta
663
+ doubles as the in-reward KL coefficient (the in-loss k3 term is
664
+ suppressed per step).
665
+ * ``scale_rewards`` in {none, false} — the advantage-adjustment identity
666
+ is exact only without per-group std-normalization (the Dr.GRPO /
667
+ Composer regime).
668
+ """
669
+ if kl_estimator not in ("k1", "k3"):
670
+ raise ValueError(f"kl_estimator must be 'k1' or 'k3', got {kl_estimator!r}.")
671
+ if float(beta) == 0.0:
672
+ raise ValueError(
673
+ "kl_in_reward=True requires a non-zero `beta` (the KL coefficient): "
674
+ "TRL only creates the reference model and computes ref logprobs when "
675
+ "beta>0, and k1-in-reward needs those ref logps. Set beta to your KL "
676
+ "coefficient (e.g. make_po_config('dr_grpo', beta=0.04)); the in-loss "
677
+ "k3 term is suppressed automatically so beta acts purely as the "
678
+ "in-reward k1 coefficient."
679
+ )
680
+ if str(scale_rewards).lower() not in ("none", "false"):
681
+ raise ValueError(
682
+ "kl_in_reward=True requires scale_rewards in {none,false} "
683
+ f"(got {scale_rewards!r}). The advantage-adjustment identity "
684
+ "adv -= beta·(KL - group_mean(KL)) is EXACT only without per-group "
685
+ "std-normalization (the Dr.GRPO / Composer regime). With std-norm, "
686
+ "folding KL into the reward also shifts the group std, so the linear "
687
+ "correction no longer matches true in-reward KL. Use "
688
+ "make_po_config('dr_grpo', beta=…) (scale_rewards='none')."
689
+ )
690
+
691
+
692
  def make_dr_grpo_config(**overrides: Any):
693
  """Build a `trl.GRPOConfig` configured to the **Dr. GRPO** recipe.
694
 
 
894
  "make_dr_grpo_config",
895
  "make_po_config",
896
  "PO_OBJECTIVES",
897
+ "validate_kl_in_reward_config",
898
  ]
composer_replication/trainer/kl_in_reward.py ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """k1-in-reward KL penalty — the Composer-2 / verl fidelity choice.
2
+
3
+ THE FIDELITY GAP (F5 Rubric A item c2, the single highest-leverage fix).
4
+ Composer-2 §4.1 explicitly chooses the **k1** KL estimator applied **in the
5
+ reward** (``-log r``), citing a variance argument (Amini et al.). TRL's
6
+ ``GRPOTrainer`` instead applies the **k3** estimator (``exp(Δ) - Δ - 1``,
7
+ Δ = ref_logp - logp) **in the loss**, gated on ``beta != 0``. The 2025/26
8
+ literature says this is not cosmetic:
9
+
10
+ * arXiv:2512.21852 ("A Comedy of Estimators") — k1-in-reward improves OOD
11
+ generalization; k3-in-reward can collapse.
12
+ * verl adopted k1-in-reward as its *only* reverse-KL option.
13
+ * TRL issue #4967 tracks the same divergence.
14
+
15
+ OOD generalization is exactly the "take any model to the next level" axis, so
16
+ this module gives the trainer an opt-in k1-in-reward path that matches
17
+ Composer-2 / verl, leaving TRL's native k3-in-loss disabled (``beta = 0``).
18
+
19
+ THE ALGEBRA (why this is a clean advantage adjustment, not a TRL fork).
20
+ k1-in-reward means: penalize each sequence's reward by ``coef * KL_i`` before
21
+ GRPO computes its group-relative advantage:
22
+
23
+ reward'_i = reward_i - coef * KL_i
24
+ KL_i = Σ_t mask_{i,t} · (logp_{i,t} - ref_logp_{i,t}) # k1 estimator
25
+ # of KL(π‖π_ref)
26
+
27
+ GRPO's advantage (with ``scale_rewards="none"``, the Dr.GRPO / Composer regime)
28
+ is the group-mean baseline ``adv_i = reward_i - mean_group(reward)``. Because
29
+ that baseline is LINEAR, folding-then-baselining equals adjusting the final
30
+ advantage:
31
+
32
+ adv'_i = reward'_i - mean_group(reward')
33
+ = adv_i - coef · (KL_i - mean_group(KL))
34
+
35
+ So the trainer can let TRL compute advantages normally, then apply this exact
36
+ correction — no reimplementation of TRL's reward→advantage code.
37
+
38
+ THE STD-NORM CAVEAT (why we require scale_rewards="none"). The identity above
39
+ is EXACT only when there is no per-group std normalization. With std-norm,
40
+ folding KL into the reward also changes the group std, so the linear correction
41
+ is no longer equivalent. Composer-2 and verl both train WITHOUT std scaling
42
+ (Dr.GRPO's recommendation), so we make the math exact for that regime and the
43
+ trainer raises if k1-in-reward is requested with std-norm on, rather than
44
+ silently applying an approximation.
45
+
46
+ Note: ``-log r`` (Composer-2's phrasing) with ``r = π/π_ref = exp(logp-ref_logp)``
47
+ gives ``-log r = ref_logp - logp = -(logp - ref_logp)`` *per token*. The KL
48
+ PENALTY subtracted from reward is ``coef · Σ_t (logp - ref_logp)`` — i.e. the
49
+ k1 estimator of the reverse KL, which is what discourages drift from π_ref. The
50
+ sign convention here matches the standard RLHF KL-in-reward penalty
51
+ (Stiennon et al. 2020; verl ``kl_penalty="kl"``).
52
+ """
53
+ from __future__ import annotations
54
+
55
+ import torch
56
+
57
+ #: Supported KL estimators for the in-reward penalty. Only k1 is meaningful here
58
+ #: (the whole point is to use k1 instead of TRL's native-in-loss k3); k3 is
59
+ #: accepted as an explicit no-divergence opt-out for experiments.
60
+ KL_ESTIMATORS = ("k1", "k3")
61
+
62
+
63
+ def k1_kl_penalty_per_sequence(
64
+ policy_logps: torch.Tensor,
65
+ ref_logps: torch.Tensor,
66
+ completion_mask: torch.Tensor,
67
+ ) -> torch.Tensor:
68
+ """Per-sequence k1 estimator of KL(π ‖ π_ref) over completion tokens.
69
+
70
+ Args:
71
+ policy_logps: ``(B, T)`` per-token logprobs under the (sampling) policy π.
72
+ ref_logps: ``(B, T)`` per-token logprobs under the reference policy π_ref,
73
+ on the SAME tokens/positions as ``policy_logps``.
74
+ completion_mask: ``(B, T)`` 1.0 at real completion tokens, 0.0 at prompt /
75
+ padding positions (the k1 sum is taken only over real tokens).
76
+
77
+ Returns:
78
+ ``(B,)`` per-sequence KL penalty ``Σ_t mask·(logp - ref_logp)``.
79
+
80
+ The k1 estimator ``logp - ref_logp`` is the unbiased (higher-variance)
81
+ single-sample estimate of the reverse KL; summed over the response it is the
82
+ sequence-level KL used as the reward penalty.
83
+ """
84
+ if policy_logps.shape != ref_logps.shape:
85
+ raise ValueError(
86
+ f"policy_logps {tuple(policy_logps.shape)} and ref_logps "
87
+ f"{tuple(ref_logps.shape)} must have identical shape (same tokens)."
88
+ )
89
+ if completion_mask.shape != policy_logps.shape:
90
+ raise ValueError(
91
+ f"completion_mask {tuple(completion_mask.shape)} must match "
92
+ f"policy_logps {tuple(policy_logps.shape)}."
93
+ )
94
+ per_token = (policy_logps - ref_logps) * completion_mask
95
+ return per_token.sum(dim=-1)
96
+
97
+
98
+ def k3_kl_penalty_per_sequence(
99
+ policy_logps: torch.Tensor,
100
+ ref_logps: torch.Tensor,
101
+ completion_mask: torch.Tensor,
102
+ ) -> torch.Tensor:
103
+ """Per-sequence k3 (Schulman) estimator of KL over completion tokens.
104
+
105
+ ``k3 = exp(Δ) - Δ - 1``, Δ = ref_logp - logp. Always ≥ 0, lower variance.
106
+ Provided for the in-reward path so an experiment can A/B k1-in-reward against
107
+ k3-in-reward (the comparison arXiv:2512.21852 makes) without touching TRL.
108
+ """
109
+ if not (policy_logps.shape == ref_logps.shape == completion_mask.shape):
110
+ raise ValueError("policy_logps, ref_logps, completion_mask must share shape.")
111
+ delta = ref_logps - policy_logps
112
+ per_token = (torch.exp(delta) - delta - 1.0) * completion_mask
113
+ return per_token.sum(dim=-1)
114
+
115
+
116
+ def kl_penalty_per_sequence(
117
+ policy_logps: torch.Tensor,
118
+ ref_logps: torch.Tensor,
119
+ completion_mask: torch.Tensor,
120
+ estimator: str = "k1",
121
+ ) -> torch.Tensor:
122
+ """Dispatch to the k1 or k3 per-sequence KL penalty."""
123
+ if estimator == "k1":
124
+ return k1_kl_penalty_per_sequence(policy_logps, ref_logps, completion_mask)
125
+ if estimator == "k3":
126
+ return k3_kl_penalty_per_sequence(policy_logps, ref_logps, completion_mask)
127
+ raise ValueError(
128
+ f"Unknown KL estimator {estimator!r}; choose from {KL_ESTIMATORS}. "
129
+ "k1 is the Composer-2 / verl in-reward choice this module exists for."
130
+ )
131
+
132
+
133
+ def apply_kl_in_reward(
134
+ advantages: torch.Tensor,
135
+ kl_penalty: torch.Tensor,
136
+ num_generations: int,
137
+ coef: float,
138
+ ) -> torch.Tensor:
139
+ """Adjust GRPO advantages to fold a KL penalty into the reward.
140
+
141
+ Exact (not approximate) under the group-mean baseline with NO std
142
+ normalization (``scale_rewards="none"`` — the Dr.GRPO / Composer regime).
143
+ See the module docstring for the linearity argument.
144
+
145
+ Args:
146
+ advantages: ``(B,)`` GRPO advantages as TRL computed them
147
+ (= reward - group_mean(reward), no std division).
148
+ kl_penalty: ``(B,)`` per-sequence KL penalty (from
149
+ ``kl_penalty_per_sequence``).
150
+ num_generations: G — the number of completions per prompt (group size).
151
+ ``B`` must be divisible by G; groups are contiguous as TRL lays them
152
+ out (``rewards.view(-1, num_generations)``).
153
+ coef: the KL coefficient β. ``coef=0`` returns advantages unchanged.
154
+
155
+ Returns:
156
+ ``(B,)`` adjusted advantages ``adv - coef·(KL - group_mean(KL))``.
157
+ """
158
+ if coef == 0.0:
159
+ return advantages
160
+ if advantages.shape != kl_penalty.shape:
161
+ raise ValueError(
162
+ f"advantages {tuple(advantages.shape)} and kl_penalty "
163
+ f"{tuple(kl_penalty.shape)} must have identical shape (B,)."
164
+ )
165
+ b = advantages.shape[0]
166
+ if num_generations <= 0 or b % num_generations != 0:
167
+ raise ValueError(
168
+ f"batch size B={b} must be a positive multiple of num_generations="
169
+ f"{num_generations} (GRPO lays groups out contiguously)."
170
+ )
171
+ kl_grouped = kl_penalty.view(-1, num_generations)
172
+ kl_centered = (kl_grouped - kl_grouped.mean(dim=1, keepdim=True)).reshape(b)
173
+ return advantages - coef * kl_centered
174
+
175
+
176
+ __all__ = [
177
+ "KL_ESTIMATORS",
178
+ "k1_kl_penalty_per_sequence",
179
+ "k3_kl_penalty_per_sequence",
180
+ "kl_penalty_per_sequence",
181
+ "apply_kl_in_reward",
182
+ ]
composer_replication/trainer/tests/test_dr_grpo_config_and_alignment.py CHANGED
@@ -75,6 +75,15 @@ def test_trl_kl_estimator_is_k3_not_k1():
75
  the installed trl==1.5.0 source (grpo_trainer.py ~L2513) shows it actually
76
  computes k3: `torch.exp(ref - logp) - (ref - logp) - 1`. This test pins
77
  that finding so the docstring stays honest and a future TRL change is caught.
 
 
 
 
 
 
 
 
 
78
  """
79
  # Known logprob pairs (student logp, reference logp).
80
  logp = torch.tensor([-1.0, -2.0, -0.5, -3.0])
@@ -121,6 +130,30 @@ def test_trl_kl_estimator_is_k3_not_k1():
121
  )
122
 
123
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
  # ---------------------------------------------------------------------------
125
  # Gate 2 — SDPO strict-alignment guard (no real GRPOTrainer needed)
126
  # ---------------------------------------------------------------------------
 
75
  the installed trl==1.5.0 source (grpo_trainer.py ~L2513) shows it actually
76
  computes k3: `torch.exp(ref - logp) - (ref - logp) - 1`. This test pins
77
  that finding so the docstring stays honest and a future TRL change is caught.
78
+
79
+ NOTE (Wave 20, F5 #1 fidelity fix): TRL's NATIVE behavior is still k3-in-loss
80
+ (this test stays valid). But Composer-2 §4.1 chooses k1-IN-REWARD, and the
81
+ 2025/26 evidence (arXiv:2512.21852, verl, TRL #4967) says that matters for
82
+ OOD generalization. ComposerReplicationTrainer now offers an opt-in
83
+ `kl_in_reward=True` path (composer_replication.trainer.kl_in_reward) that
84
+ suppresses this in-loss k3 term (beta→0 for the parent loss) and folds a k1
85
+ penalty into the reward/advantage instead — see
86
+ test_trainer_offers_k1_in_reward_path below and test_kl_in_reward.py.
87
  """
88
  # Known logprob pairs (student logp, reference logp).
89
  logp = torch.tensor([-1.0, -2.0, -0.5, -3.0])
 
130
  )
131
 
132
 
133
+ def test_trainer_offers_k1_in_reward_path():
134
+ """The F5 #1 fix: ComposerReplicationTrainer exposes an opt-in k1-in-reward
135
+ path whose preconditions are validated (beta>0, scale_rewards∈{none,false},
136
+ estimator∈{k1,k3}). Validates the extracted precondition function without
137
+ needing a real GRPOTrainer (model+dataset)."""
138
+ from composer_replication.trainer.composer_trainer import validate_kl_in_reward_config
139
+
140
+ # Valid Composer-2 / verl config: k1 estimator, non-zero beta, no std-norm.
141
+ validate_kl_in_reward_config(kl_estimator="k1", beta=0.04, scale_rewards="none")
142
+ validate_kl_in_reward_config(kl_estimator="k3", beta=0.1, scale_rewards="false")
143
+
144
+ # beta=0 is the trap: TRL would build no ref model, so no ref logps.
145
+ with pytest.raises(ValueError, match="non-zero `beta`"):
146
+ validate_kl_in_reward_config(kl_estimator="k1", beta=0.0, scale_rewards="none")
147
+
148
+ # std-norm breaks the exact advantage-adjustment identity.
149
+ with pytest.raises(ValueError, match="scale_rewards in"):
150
+ validate_kl_in_reward_config(kl_estimator="k1", beta=0.04, scale_rewards="group")
151
+
152
+ # bad estimator
153
+ with pytest.raises(ValueError, match="k1.*k3|k3"):
154
+ validate_kl_in_reward_config(kl_estimator="k2", beta=0.04, scale_rewards="none")
155
+
156
+
157
  # ---------------------------------------------------------------------------
158
  # Gate 2 — SDPO strict-alignment guard (no real GRPOTrainer needed)
159
  # ---------------------------------------------------------------------------
composer_replication/trainer/tests/test_kl_in_reward.py ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Tests for k1-in-reward KL (F5 #1 fidelity fix; Composer-2 §4.1 / verl).
2
+
3
+ The load-bearing test is `test_apply_equals_fold_then_baseline`: it proves the
4
+ advantage adjustment `adv -= coef·(KL - group_mean(KL))` is EXACTLY equal to
5
+ folding the KL penalty into the reward and re-running GRPO's group-mean
6
+ baseline (with no std-norm). That equivalence is the entire justification for
7
+ adjusting advantages post-hoc instead of forking TRL's reward→advantage code.
8
+ """
9
+ from __future__ import annotations
10
+
11
+ import pytest
12
+ import torch
13
+
14
+ from composer_replication.trainer.kl_in_reward import (
15
+ apply_kl_in_reward,
16
+ k1_kl_penalty_per_sequence,
17
+ k3_kl_penalty_per_sequence,
18
+ kl_penalty_per_sequence,
19
+ )
20
+
21
+
22
+ # ---------------------------------------------------------------------
23
+ # Per-sequence KL estimators
24
+ # ---------------------------------------------------------------------
25
+
26
+
27
+ def test_k1_penalty_sums_masked_logp_diff():
28
+ policy = torch.tensor([[0.0, -1.0, -2.0], [-0.5, -0.5, -0.5]])
29
+ ref = torch.tensor([[0.0, -0.5, -1.0], [-1.0, -1.0, -1.0]])
30
+ mask = torch.tensor([[1.0, 1.0, 0.0], [1.0, 1.0, 1.0]]) # row0 drops last token
31
+ out = k1_kl_penalty_per_sequence(policy, ref, mask)
32
+ # row0: (0-0) + (-1-(-0.5)) [+ masked 0] = -0.5
33
+ # row1: (-0.5-(-1.0))*3 = +1.5
34
+ torch.testing.assert_close(out, torch.tensor([-0.5, 1.5]))
35
+
36
+
37
+ def test_k1_can_be_negative_k3_cannot():
38
+ """Structural difference: k1 is signed, k3 ≥ 0 (the whole reason they differ)."""
39
+ policy = torch.tensor([[0.0, 0.0]])
40
+ ref = torch.tensor([[1.0, 1.0]]) # ref > policy → Δ=ref-logp>0 → k1<0
41
+ mask = torch.ones_like(policy)
42
+ k1 = k1_kl_penalty_per_sequence(policy, ref, mask)
43
+ k3 = k3_kl_penalty_per_sequence(policy, ref, mask)
44
+ assert (k1 < 0).all(), "k1 = Σ(logp-ref) is negative when ref>logp"
45
+ assert (k3 >= -1e-6).all(), "k3 (Schulman) is always non-negative"
46
+
47
+
48
+ def test_k3_leading_order_is_half_delta_squared():
49
+ """For small Δ, k3 ≈ Δ²/2 — the minor-delta claim in make_dr_grpo_config."""
50
+ policy = torch.tensor([[0.0, 0.0, 0.0]])
51
+ ref = torch.tensor([[0.01, -0.02, 0.005]])
52
+ mask = torch.ones_like(policy)
53
+ k3 = k3_kl_penalty_per_sequence(policy, ref, mask)
54
+ delta = ref - policy
55
+ expected = (0.5 * delta**2).sum()
56
+ torch.testing.assert_close(k3, expected.unsqueeze(0), atol=1e-4, rtol=1e-3)
57
+
58
+
59
+ def test_dispatch_and_unknown_estimator():
60
+ policy = torch.zeros(1, 2)
61
+ ref = torch.ones(1, 2)
62
+ mask = torch.ones(1, 2)
63
+ torch.testing.assert_close(
64
+ kl_penalty_per_sequence(policy, ref, mask, "k1"),
65
+ k1_kl_penalty_per_sequence(policy, ref, mask),
66
+ )
67
+ with pytest.raises(ValueError, match="Unknown KL estimator"):
68
+ kl_penalty_per_sequence(policy, ref, mask, "k2")
69
+
70
+
71
+ def test_penalty_shape_validation():
72
+ with pytest.raises(ValueError, match="identical shape"):
73
+ k1_kl_penalty_per_sequence(torch.zeros(1, 3), torch.zeros(1, 2), torch.zeros(1, 3))
74
+ with pytest.raises(ValueError, match="must match"):
75
+ k1_kl_penalty_per_sequence(torch.zeros(1, 3), torch.zeros(1, 3), torch.zeros(1, 2))
76
+
77
+
78
+ # ---------------------------------------------------------------------
79
+ # apply_kl_in_reward — the advantage adjustment
80
+ # ---------------------------------------------------------------------
81
+
82
+
83
+ def test_apply_coef_zero_is_identity():
84
+ adv = torch.tensor([1.0, -1.0, 0.5, -0.5])
85
+ kl = torch.tensor([2.0, 3.0, 1.0, 0.0])
86
+ out = apply_kl_in_reward(adv, kl, num_generations=2, coef=0.0)
87
+ torch.testing.assert_close(out, adv)
88
+
89
+
90
+ def test_apply_centers_kl_within_group():
91
+ # Two groups of 2. coef=1. adv -= (KL - group_mean(KL)).
92
+ adv = torch.zeros(4)
93
+ kl = torch.tensor([1.0, 3.0, 10.0, 20.0])
94
+ out = apply_kl_in_reward(adv, kl, num_generations=2, coef=1.0)
95
+ # group0 mean=2 → centered [-1,+1] → adv-(-1,+1)=[1,-1]
96
+ # group1 mean=15 → centered [-5,+5] → adv-(-5,+5)=[5,-5]
97
+ torch.testing.assert_close(out, torch.tensor([1.0, -1.0, 5.0, -5.0]))
98
+
99
+
100
+ def test_apply_divisibility_validation():
101
+ with pytest.raises(ValueError, match="multiple of num_generations"):
102
+ apply_kl_in_reward(torch.zeros(5), torch.zeros(5), num_generations=2, coef=1.0)
103
+ with pytest.raises(ValueError, match="identical shape"):
104
+ apply_kl_in_reward(torch.zeros(4), torch.zeros(2), num_generations=2, coef=1.0)
105
+
106
+
107
+ @pytest.mark.parametrize("num_generations", [2, 3, 4])
108
+ @pytest.mark.parametrize("n_groups", [1, 2, 5])
109
+ def test_apply_equals_fold_then_baseline(num_generations, n_groups):
110
+ """THE load-bearing property: adjusting baselined advantages by
111
+ -coef·(KL - group_mean(KL)) equals folding -coef·KL into the reward and
112
+ re-running GRPO's group-mean baseline (scale_rewards='none').
113
+
114
+ This proves the post-hoc advantage adjustment IS exact k1-in-reward, not an
115
+ approximation — the justification for not forking TRL's scoring code.
116
+ """
117
+ torch.manual_seed(0)
118
+ g, k = num_generations, n_groups
119
+ b = g * k
120
+ coef = 0.137
121
+
122
+ rewards = torch.randn(b)
123
+ kl = torch.randn(b).abs() # KL ≥ 0 in spirit, though sign-agnostic here
124
+
125
+ # GRPO baseline (no std-norm): adv = reward - group_mean(reward).
126
+ def group_baseline(x):
127
+ means = x.view(k, g).mean(dim=1).repeat_interleave(g) # (b,)
128
+ return x - means
129
+
130
+ advantages = group_baseline(rewards)
131
+
132
+ # Reference: fold KL into reward, THEN baseline.
133
+ folded_reward = rewards - coef * kl
134
+ adv_fold_then_baseline = group_baseline(folded_reward)
135
+
136
+ # Under test: adjust the ALREADY-baselined advantages.
137
+ adv_adjusted = apply_kl_in_reward(advantages, kl, num_generations=g, coef=coef)
138
+
139
+ torch.testing.assert_close(adv_adjusted, adv_fold_then_baseline, atol=1e-5, rtol=1e-5)
140
+
141
+
142
+ def test_apply_does_not_mutate_input():
143
+ adv = torch.tensor([1.0, 2.0])
144
+ adv_copy = adv.clone()
145
+ apply_kl_in_reward(adv, torch.tensor([0.0, 1.0]), num_generations=2, coef=1.0)
146
+ torch.testing.assert_close(adv, adv_copy) # functional, not in-place
research/design-F5-fidelity-audit.md CHANGED
@@ -24,13 +24,13 @@ The good news the report stresses: the substrate for all of this already exists.
24
  | (b1) | **25× synthetic data — Feature Deletion generator** — 2.5 blog | **FULLY-REPLICATED (substrate-inversion form)** | `datagen/substrates.py::SweBenchAdapter` (revert gold patch → broken repo, FAIL_TO_PASS=reward target, license filter), `datagen/env.py::FeatureDeletionEnv`, `datagen/validator.py` (4-gate solvability), `datagen/schema.py`. | Real *generative* synthesis (manufacture novel broken states beyond SWE-bench inversion) absent; only adapts existing SWE-* instances. No "25×" scale-out generator suite. |
25
  | (b2) | **Dynamic-difficulty curriculum ("select for AND create harder tasks dynamically")** — 2.5 blog + Composer-2 §3 (keyed on #turns + thinking-tokens) | **FULLY-REPLICATED (select-for half)** | `datagen/curriculum.py::DifficultyCurriculum` — p̂(1−p̂) frontier weighting, retire >0.95, quarantine <0.02, **effort tilt on turns/think-tokens** (ADR-012 #4, matching Composer-2's exact heuristic). | **CREATE half missing**: no live escalation of deletion-span / coupling / multi-feature difficulty during the run. Curriculum scores an *existing* pool; it doesn't mint harder tasks. |
26
  | (c1) | **Dr.GRPO base objective** — Composer-2 §4.1 | **FULLY-REPLICATED** | `composer_trainer.py::make_dr_grpo_config` + `make_po_config` (PO menu: grpo/dr_grpo/bnpo/dapo/gspo/cispo, pure TRL 1.5.0 config). `loss_type="dr_grpo"`, `scale_rewards="none"`, `num_iterations=1`, drift-guard asserts. ADR-014. | None on the objective itself. |
27
- | (c2) | **k1-vs-k3 KL** — Composer-2 §4.1 explicitly chooses **k1 = −log r in *reward*** (variance argument, citing Amini et al.) | **PARTIALDOCUMENTED INFIDELITY** | `composer_trainer.py:496-509` documents that TRL's `_compute_loss` uses **k3-in-loss** (`exp(Δ)−Δ−1`), NOT k1. `test_dr_grpo_config_and_alignment.py::test_trl_kl_estimator_is_k3_not_k1` pins this. Honest delta, not patched. | **The evidence says this delta matters for the "next level":** arXiv:2512.21852 + TRL #4967 + verl (k1-in-reward only) show k1-in-reward OOD generalization; k3-in-reward can collapse. Composer chose k1 deliberately. Fix is implementable (see Gap #1). |
28
  | (d) | **CPT → SFT → RL phase structure** — Composer-2 §3-4 (CPT loss ↓ ⇒ RL ceiling ↑, replicated on Qwen3-Coder-30B) | **PARTIAL (intentional skip + plumbing)** | Documented decision to skip CPT and start from a code-tuned base (COMPOSER_RECIPE_MAPPING.md row a; corroborated by Composer-2's own CPT→RL causal claim). Inner/outer loop split exists (datagen=outer, `ComposerReplicationTrainer`=inner). | **No SFT-first stage in code.** Report §5 prescribes "SFT-first on clean winning trajectories before RL" — there is no SFT trainer/recipe; only the RL trainer exists. CPT correctly skipped. |
29
  | (e) | **Sharded Muon + dual-mesh HSDP** (2.5 blog) / FSDP+CP+decoupled-EP, Adam (Composer-2 §6) | **MISSING (intentional, irrelevant at our scale)** | — | Correctly out of scope for dense Qwen3-{7,32}B (the mapping doc + report both say skip until MoE base). Distributed substrate is DiLoCo-over-S3, not HSDP. Note research/10 *corrects* the blog: Composer-2 uses **Adam**, not Muon, and FSDP+CP+decoupled-EP, not HSDP. |
30
  | (f) | **Anyrun production-fidelity sandboxed RL harness** (>500 pods/s, per-pod Firecracker microVM, fork/snapshot, Anygress egress proxy) — Composer-2 §6.2 | **PARTIAL** | `datagen/sandbox.py` (`Sandbox` Protocol, `LocalSubprocessSandbox`, `scrub_tree` primary control, denylist defense-in-depth), `datagen/docker_sandbox.py`, `diloco/serverless/{executor.py,eks.py,sagemaker.py,modal_spawn.py}`. | No microVM isolation (gVisor/Kata-Firecracker), no fork/snapshot, no egress proxy, no >100k-pod orchestration. The report's EKS plan (§8: gVisor default → Kata+Firecracker → container-free SWE-MiniSandbox) is design-only. `eks.py`/`sagemaker.py` are executor skeletons, not the full Anyrun analogue. |
31
  | (g) | **Reward-hacking monitoring** (2.5 blog: bytecode decompile / type-cache hacks; "agentic monitoring tools") | **FULLY-REPLICATED (defense-in-depth) + run-level guard now wired** | `datagen/monitor.py::HackMonitor` (signature + patch-provenance, obfuscation-resistant), `sandbox.py::scrub_tree` (physical cache/.git removal = "the wall"), `datagen/validator.py` (4-gate), `safety/holdout.py::HeldoutSplit` (id + content-hash disjointness), `safety/kill_switch.py::HeldOutGuard` (proxy-real Hacking-Gap + KL hard-stop), **now wired into the trainer** (`composer_trainer.py::_maybe_update_killswitch`, ADR-015, 2026-06-08). | The held-out kill-switch — the report's "most load-bearing safeguard, documented gap" — is **now CLOSED** (ADR-015). Remaining: `HackMonitor` validated only on constructed examples (report warns synthetic-hack monitors fail to generalize); offline LLM-judge monitor (EvilGenie-style) not built. |
32
- | (h) | **Aux scalar rewards (style/communication/unfinished-todo penalties)** — Composer-2 §4.2 | **MISSING** | Reward is pure test-pass-fraction (`env.py::_grade`). No auxiliary reward array. `integrations/altered_minds/reward.py` is an MMLU-format reward for ADR-013 ladder, not the Composer behavior-reward suite. | Fully specified in research/10; reproducible without the hint mystery. Build a `behavior_rewards.py` reward-fn bank. |
33
- | (i) | **Nonlinear length/effort penalty** `C_length{k,q}(x)=((1+kx)^{1−q}−1)/(k(1−q))` — Composer-2 §4.2 (exact equation) | **MISSING** | | Trivially implementable (≈30 LOC reward shaper over {thinking, tool-call, tool-output, final-msg tokens, #calls, #turns}). Induces parallel tool calls per the report. |
34
  | (j) | **Self-summarization (reward-to-all-chain-tokens)** — Composer-2 §4.1 | **MISSING** | — | The mechanism that handles 100k-token long-horizon rollouts (the regime the report says the *tree* is for). Not built. |
35
  | (k) | **MoE router replay** — Composer-2 §6.2 | **MISSING (out of scope, dense bases)** | — | Only relevant for MoE-base RL; correct to defer. |
36
 
@@ -74,9 +74,9 @@ What exists today (the *ancestor*): `teacher_replay.py` is **flat depth-1** (N t
74
 
75
  Ordered by (fidelity-leverage × cheapness), front-loading the items that move the "next-level" needle for the least build.
76
 
77
- **Tier 0 — cheap fidelity fixes the evidence says move OOD generalization (do first):**
78
- 1. **k1-in-reward KL** (Rubric A c2). Add a `kl_estimator="k1"` + `use_kl_in_reward=True` path to the trainer: compute `−log r` per token, fold into the *advantage/reward* (not the loss), set TRL `beta=0.0` to disable its k3-in-loss term. Mirror TRL issue #4967 / verl's choice. `composer_trainer.py` ~60 LOC + test flipping the pinned k3 assertion. **This is the highest-fidelity-leverage single change.**
79
- 2. **Composer-2 behavior rewards** (Rubric A h+i): `datagen/behavior_rewards.py` — the aux scalar reward array (style/communication/unfinished-todo) + the nonlinear length/effort penalty `C_length` (exact eq. in research/10), as TRL `RewardFunc`s composable with `env.reward_fn`. ~120 LOC. Reproducible *without* the hint mystery; directly targets Composer's "communication style + effort calibration" goal.
80
 
81
  **Tier 1 — close the highest-value PARTIALs:**
82
  3. **SDPO live-GPU smoke** (Rubric A a): instantiate `ComposerReplicationTrainer` against a real `trl.GRPOTrainer` on a small model (Qwen2.5-0.5B) on a SageMaker Training Job (g5/g6e) or HyperPod node-group — discharges the ADR-008 "never smoke-tested against real GRPOTrainer" caveat.
 
24
  | (b1) | **25× synthetic data — Feature Deletion generator** — 2.5 blog | **FULLY-REPLICATED (substrate-inversion form)** | `datagen/substrates.py::SweBenchAdapter` (revert gold patch → broken repo, FAIL_TO_PASS=reward target, license filter), `datagen/env.py::FeatureDeletionEnv`, `datagen/validator.py` (4-gate solvability), `datagen/schema.py`. | Real *generative* synthesis (manufacture novel broken states beyond SWE-bench inversion) absent; only adapts existing SWE-* instances. No "25×" scale-out generator suite. |
25
  | (b2) | **Dynamic-difficulty curriculum ("select for AND create harder tasks dynamically")** — 2.5 blog + Composer-2 §3 (keyed on #turns + thinking-tokens) | **FULLY-REPLICATED (select-for half)** | `datagen/curriculum.py::DifficultyCurriculum` — p̂(1−p̂) frontier weighting, retire >0.95, quarantine <0.02, **effort tilt on turns/think-tokens** (ADR-012 #4, matching Composer-2's exact heuristic). | **CREATE half missing**: no live escalation of deletion-span / coupling / multi-feature difficulty during the run. Curriculum scores an *existing* pool; it doesn't mint harder tasks. |
26
  | (c1) | **Dr.GRPO base objective** — Composer-2 §4.1 | **FULLY-REPLICATED** | `composer_trainer.py::make_dr_grpo_config` + `make_po_config` (PO menu: grpo/dr_grpo/bnpo/dapo/gspo/cispo, pure TRL 1.5.0 config). `loss_type="dr_grpo"`, `scale_rewards="none"`, `num_iterations=1`, drift-guard asserts. ADR-014. | None on the objective itself. |
27
+ | (c2) | **k1-vs-k3 KL** — Composer-2 §4.1 explicitly chooses **k1 = −log r in *reward*** (variance argument, citing Amini et al.) | **CLOSED (Wave 20) opt-in k1-in-reward path** | `trainer/kl_in_reward.py` (pure, tested: k1/k3 per-seq estimators + `apply_kl_in_reward` advantage adjustment) + `composer_trainer.py` (`kl_in_reward=True`/`kl_estimator=`: folds the k1 penalty into advantages at scoring, zeros TRL's in-loss k3 per step, `validate_kl_in_reward_config` guards beta>0 + scale_rewards∈{none,false}). `test_kl_in_reward.py` proves the advantage-adjustment EXACTLY equals fold-then-baseline (9 group/gen combos). TRL's native k3 still pinned by `test_trl_kl_estimator_is_k3_not_k1` (still true; now documents the override). | Live-GPU A/B (k1-in-reward vs k3-in-loss OOD) deferred to the SageMaker smoke. The math + wiring are done and CPU-verified. |
28
  | (d) | **CPT → SFT → RL phase structure** — Composer-2 §3-4 (CPT loss ↓ ⇒ RL ceiling ↑, replicated on Qwen3-Coder-30B) | **PARTIAL (intentional skip + plumbing)** | Documented decision to skip CPT and start from a code-tuned base (COMPOSER_RECIPE_MAPPING.md row a; corroborated by Composer-2's own CPT→RL causal claim). Inner/outer loop split exists (datagen=outer, `ComposerReplicationTrainer`=inner). | **No SFT-first stage in code.** Report §5 prescribes "SFT-first on clean winning trajectories before RL" — there is no SFT trainer/recipe; only the RL trainer exists. CPT correctly skipped. |
29
  | (e) | **Sharded Muon + dual-mesh HSDP** (2.5 blog) / FSDP+CP+decoupled-EP, Adam (Composer-2 §6) | **MISSING (intentional, irrelevant at our scale)** | — | Correctly out of scope for dense Qwen3-{7,32}B (the mapping doc + report both say skip until MoE base). Distributed substrate is DiLoCo-over-S3, not HSDP. Note research/10 *corrects* the blog: Composer-2 uses **Adam**, not Muon, and FSDP+CP+decoupled-EP, not HSDP. |
30
  | (f) | **Anyrun production-fidelity sandboxed RL harness** (>500 pods/s, per-pod Firecracker microVM, fork/snapshot, Anygress egress proxy) — Composer-2 §6.2 | **PARTIAL** | `datagen/sandbox.py` (`Sandbox` Protocol, `LocalSubprocessSandbox`, `scrub_tree` primary control, denylist defense-in-depth), `datagen/docker_sandbox.py`, `diloco/serverless/{executor.py,eks.py,sagemaker.py,modal_spawn.py}`. | No microVM isolation (gVisor/Kata-Firecracker), no fork/snapshot, no egress proxy, no >100k-pod orchestration. The report's EKS plan (§8: gVisor default → Kata+Firecracker → container-free SWE-MiniSandbox) is design-only. `eks.py`/`sagemaker.py` are executor skeletons, not the full Anyrun analogue. |
31
  | (g) | **Reward-hacking monitoring** (2.5 blog: bytecode decompile / type-cache hacks; "agentic monitoring tools") | **FULLY-REPLICATED (defense-in-depth) + run-level guard now wired** | `datagen/monitor.py::HackMonitor` (signature + patch-provenance, obfuscation-resistant), `sandbox.py::scrub_tree` (physical cache/.git removal = "the wall"), `datagen/validator.py` (4-gate), `safety/holdout.py::HeldoutSplit` (id + content-hash disjointness), `safety/kill_switch.py::HeldOutGuard` (proxy-real Hacking-Gap + KL hard-stop), **now wired into the trainer** (`composer_trainer.py::_maybe_update_killswitch`, ADR-015, 2026-06-08). | The held-out kill-switch — the report's "most load-bearing safeguard, documented gap" — is **now CLOSED** (ADR-015). Remaining: `HackMonitor` validated only on constructed examples (report warns synthetic-hack monitors fail to generalize); offline LLM-judge monitor (EvilGenie-style) not built. |
32
+ | (h) | **Aux scalar rewards (style/communication/unfinished-todo penalties)** — Composer-2 §4.2 | **CLOSED (Wave 20)** | `datagen/behavior_rewards.py`: `UnfinishedTodoPenalty`, `LeftoverCoTPenalty`, `CommunicationReward` (the exact emergent behaviors §4.2 names) as TRL `RewardFunc`s, composable with `env.reward_fn` via `reward_weights`. `default_behavior_rewards()` = the starter array. `test_behavior_rewards.py` pins the contract. | Heuristic pattern-matchers by design (the report added these reactively by monitoring); meant to grow. A judge-based communication score is the upgrade. |
33
+ | (i) | **Nonlinear length/effort penalty** `C_length{k,q}(x)=((1+kx)^{1−q}−1)/(k(1−q))` — Composer-2 §4.2 (exact equation) | **CLOSED (Wave 20)** | `datagen/behavior_rewards.py::c_length` (exact equation + q→1 log-limit) + `EffortWeights` (the weighted {thinking, tool-call, tool-output, final-msg tokens, #calls, #turns} combination) + `LengthEffortPenalty` (the RewardFunc). `test_behavior_rewards.py` verifies the closed form, the concave-down/increasing shape, and the edge cases. | Effort features read from a per-completion `effort_features` kwarg when the harness supplies them; char-count proxy otherwise. |
34
  | (j) | **Self-summarization (reward-to-all-chain-tokens)** — Composer-2 §4.1 | **MISSING** | — | The mechanism that handles 100k-token long-horizon rollouts (the regime the report says the *tree* is for). Not built. |
35
  | (k) | **MoE router replay** — Composer-2 §6.2 | **MISSING (out of scope, dense bases)** | — | Only relevant for MoE-base RL; correct to defer. |
36
 
 
74
 
75
  Ordered by (fidelity-leverage × cheapness), front-loading the items that move the "next-level" needle for the least build.
76
 
77
+ **Tier 0 — cheap fidelity fixes the evidence says move OOD generalization — ✅ DONE (Wave 20):**
78
+ 1. **k1-in-reward KL** (Rubric A c2). Implemented as `trainer/kl_in_reward.py` (pure, tested math) + `composer_trainer.py` `kl_in_reward=True`/`kl_estimator=`. Keeps `beta` as the KL coefficient (TRL needs beta>0 to compute ref logps), folds the k1 penalty into advantages at scoring, and zeros TRL's in-loss k3 per step. `validate_kl_in_reward_config` enforces beta>0 + scale_rewards∈{none,false}. `test_kl_in_reward.py` proves the advantage adjustment exactly equals fold-then-baseline. **The highest-fidelity-leverage single change — shipped.**
79
+ 2. **Composer-2 behavior rewards** (Rubric A h+i): `datagen/behavior_rewards.py` — `c_length` (exact eq + q→1 limit), `EffortWeights`, `LengthEffortPenalty`, `UnfinishedTodoPenalty`, `LeftoverCoTPenalty`, `CommunicationReward`, `default_behavior_rewards()`, all TRL `RewardFunc`s composable with `env.reward_fn`. `test_behavior_rewards.py` pins the math + contract. Reproduced *without* the hint mystery.
80
 
81
  **Tier 1 — close the highest-value PARTIALs:**
82
  3. **SDPO live-GPU smoke** (Rubric A a): instantiate `ComposerReplicationTrainer` against a real `trl.GRPOTrainer` on a small model (Qwen2.5-0.5B) on a SageMaker Training Job (g5/g6e) or HyperPod node-group — discharges the ADR-008 "never smoke-tested against real GRPOTrainer" caveat.