Wave 20: Tier-0 fidelity fixes — k1-in-reward KL + Composer-2 behavior rewards

Closes the two cheapest "next-level" gaps from the F5 fidelity audit
(Rubric A items c2, h, i). Both reproduce Composer-2 §4.1/§4.2 exactly,
without the hint mystery. 33 new tests; full suite 415 passed / 66 skipped.

(1) k1-in-reward KL (F5 #1, the highest-fidelity-leverage single change).
Composer-2 §4.1 chooses the k1 estimator IN THE REWARD (-log r); TRL uses
k3 IN THE LOSS. The 2025/26 evidence (arXiv:2512.21852, verl k1-only, TRL
#4967) says k1-in-reward improves OOD generalization where k3-in-reward can
collapse — exactly the "take any model to the next level" axis.
- trainer/kl_in_reward.py: pure, tested math. k1/k3 per-sequence
estimators + apply_kl_in_reward, which adjusts GRPO advantages by
-coef·(KL - group_mean(KL)). test_kl_in_reward.py PROVES this exactly
equals folding -coef·KL into the reward then re-running GRPO's
group-mean baseline (9 group/gen combos) — the justification for not
forking TRL's scoring code. Exact only without std-norm (Dr.GRPO regime).
- composer_trainer.py: opt-in kl_in_reward=True / kl_estimator=. Keeps
beta>0 (TRL needs it to compute ref logps) but suppresses TRL's in-loss
k3 per step (beta→0 across the parent _compute_loss) and folds the k1
penalty into advantages at scoring time. validate_kl_in_reward_config
(extracted, unit-tested) guards beta>0 + scale_rewards∈{none,false}.
OFF by default → byte-for-byte legacy. getattr-guarded so __new__-built
test stubs default to legacy.

(2) Composer-2 behavior rewards (F5 #2, Rubric A h+i).
- datagen/behavior_rewards.py: c_length (the exact nonlinear length/effort
penalty ((1+kx)^(1-q)-1)/(k(1-q)) + q→1 log-limit), EffortWeights (the
weighted {thinking/tool-call/tool-output/final-msg tokens, #calls,
#turns} combination), LengthEffortPenalty, and the aux scalar shapers
UnfinishedTodoPenalty / LeftoverCoTPenalty / CommunicationReward — the
exact emergent behaviors §4.2 names. All TRL RewardFuncs composable with
env.reward_fn via reward_weights. default_behavior_rewards() = the
starter array. test_behavior_rewards.py pins the math + contract.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Files changed (7) hide show

composer_replication/datagen/behavior_rewards.py +282 -0
composer_replication/datagen/tests/test_behavior_rewards.py +166 -0
composer_replication/trainer/composer_trainer.py +213 -3
composer_replication/trainer/kl_in_reward.py +182 -0
composer_replication/trainer/tests/test_dr_grpo_config_and_alignment.py +33 -0
composer_replication/trainer/tests/test_kl_in_reward.py +146 -0
research/design-F5-fidelity-audit.md +6 -6

composer_replication/datagen/behavior_rewards.py ADDED Viewed

	@@ -0,0 +1,282 @@

+"""behavior_rewards.py — Composer-2 §4.2 behavior shaping (hint-free).
+THE FIDELITY GAP (F5 Rubric A items h + i). Composer-2 shapes agent behavior
+NOT with hints (that's the unreproducible Composer-2.5 mechanism) but with an
+**array of auxiliary scalar rewards** + a **nonlinear length/effort penalty**.
+research/10 verified both against the report and gives the exact equation. None
+of it was in code; this module implements them as TRL ``RewardFunc``s
+composable with ``FeatureDeletionEnv.reward_fn`` (the test-pass-fraction reward).
+Quoting Composer-2 §4.2 (research/10-composer2-techreport-mining.md:74-77):
+  * "we apply an array of auxiliary rewards … rewards for coding style,
+    communication, and product-specific penalties for poor tool calls, such as
+    creating to-do list items and then leaving them unfinished."
+  * Nonlinear length/effort penalty:
+        C_length{k,q}(x) = ((1 + k·x)^{1−q} − 1) / (k·(1−q))
+    concave-down & increasing, x = weighted combination of {thinking tokens,
+    tool-calling tokens, tool-output tokens, final-message tokens, #tool calls,
+    #turns}. Goal: quick on easy tasks, think longer on hard ones; observed to
+    induce parallel tool calls.
+DESIGN. Every reward here is a callable matching TRL's ``RewardFunc`` contract
+``(prompts, completions, **kwargs) -> list[float]`` so it drops straight into
+``GRPOConfig.reward_funcs`` alongside the env's correctness reward. They are
+deliberately SIGNAL-ONLY shapers (small magnitudes, penalties ≤ 0): the
+correctness reward must dominate, and these never gate it. TRL combines multiple
+reward funcs via ``reward_weights`` (sum), so the composed objective is
+``reward = correctness + Σ_i w_i · behavior_i`` — exactly Composer-2's "array of
+auxiliary rewards" added to the primary signal.
+The length penalty is the substantive piece; the aux rewards are heuristic
+pattern-matchers over the completion text (unfinished-todo, leftover-CoT-in-code,
+terminal-only collapse — the exact emergent behaviors the report says it
+reactively penalized). They are intentionally simple and overridable.
+"""
+from __future__ import annotations
+import re
+from collections.abc import Callable, Mapping
+from dataclasses import dataclass, field
+#: A TRL reward function: (prompts, completions, **kwargs) -> per-completion floats.
+RewardFunc = Callable[..., list[float]]
+# ---------------------------------------------------------------------
+# (i) Nonlinear length / effort penalty — C_length{k,q}(x)
+# ---------------------------------------------------------------------
+def c_length(x: float, k: float, q: float) -> float:
+    """Composer-2 nonlinear length/effort penalty C_length{k,q}(x).
+        C_length(x) = ((1 + k·x)^{1−q} − 1) / (k·(1−q))
+    Concave-down & increasing in x for k>0, q>1 (the report's regime): the
+    marginal penalty for more effort SHRINKS as effort grows, so the model is
+    pushed to be quick on easy tasks while still allowed to think longer on hard
+    ones (cheap incremental effort once already deep in a hard task).
+    Args:
+        x: the (non-negative) effort scalar — a weighted token/turn combination.
+        k: curvature hyperparameter (>0). Larger k = stronger early penalty.
+        q: curvature hyperparameter. q>1 gives the concave-down increasing shape;
+            q→1 degenerates (handled below); q<1 makes it convex.
+    Returns:
+        The penalty magnitude (≥ 0 for x≥0, k>0). Callers negate it to subtract
+        from reward.
+    Edge cases:
+        * x = 0 → 0 (no effort, no penalty).
+        * q = 1 → the limit is log(1 + k·x)/k (L'Hôpital on the (1−q) factor),
+          which we return directly to avoid a 0/0.
+    """
+    if x < 0:
+        raise ValueError(f"effort scalar x must be non-negative, got {x}")
+    if k <= 0:
+        raise ValueError(f"curvature k must be positive, got {k}")
+    base = 1.0 + k * x
+    if abs(1.0 - q) < 1e-9:
+        # Limit q→1: ((1+kx)^{1-q}-1)/(k(1-q)) → log(1+kx)/k.
+        import math
+        return math.log(base) / k
+    return (base ** (1.0 - q) - 1.0) / (k * (1.0 - q))
+@dataclass(frozen=True)
+class EffortWeights:
+    """Weights for the effort scalar x (Composer-2's weighted combination).
+    x = (w_think·thinking_tokens + w_tool_call·tool_calling_tokens
+         + w_tool_out·tool_output_tokens + w_final·final_message_tokens
+         + w_calls·num_tool_calls + w_turns·num_turns).
+    Defaults weight raw token counts at 1.0 (per-token) and the discrete
+    call/turn counts more heavily, since each call/turn carries fixed overhead
+    the report wanted to discourage (it observed the penalty inducing PARALLEL
+    tool calls — fewer turns for the same work).
+    """
+    think: float = 1.0
+    tool_call: float = 1.0
+    tool_output: float = 1.0
+    final_message: float = 1.0
+    num_calls: float = 50.0
+    num_turns: float = 100.0
+    def effort(self, features: Mapping[str, float]) -> float:
+        """Compute x from a per-completion feature dict (missing keys → 0)."""
+        return (
+            self.think * features.get("thinking_tokens", 0.0)
+            + self.tool_call * features.get("tool_calling_tokens", 0.0)
+            + self.tool_output * features.get("tool_output_tokens", 0.0)
+            + self.final_message * features.get("final_message_tokens", 0.0)
+            + self.num_calls * features.get("num_tool_calls", 0.0)
+            + self.num_turns * features.get("num_turns", 0.0)
+        )
+@dataclass(frozen=True)
+class LengthEffortPenalty:
+    """TRL RewardFunc applying ``-coef · C_length{k,q}(x)`` per completion.
+    The effort scalar x is read from a per-completion ``effort_features`` kwarg
+    (a list of feature dicts, one per completion) when the agent harness can
+    supply structured token/turn counts; otherwise it falls back to a crude
+    proxy = completion character count / ``chars_per_token`` so the penalty is
+    never silently zero.
+    Penalty is ≤ 0 (a cost). ``coef`` scales it into reward units; keep it small
+    so correctness dominates.
+    """
+    k: float = 0.01
+    q: float = 2.0
+    coef: float = 0.1
+    weights: EffortWeights = field(default_factory=EffortWeights)
+    chars_per_token: float = 4.0
+    def __call__(
+        self,
+        prompts: list[str] | None = None,
+        completions: list[str] | None = None,
+        *,
+        effort_features: list[Mapping[str, float]] | None = None,
+        **kwargs: object,
+    ) -> list[float]:
+        completions = completions or []
+        out: list[float] = []
+        for i, comp in enumerate(completions):
+            if effort_features is not None and i < len(effort_features):
+                x = self.weights.effort(effort_features[i])
+            else:
+                # Proxy when structured features aren't wired: token-count estimate.
+                x = len(comp) / self.chars_per_token
+            out.append(-self.coef * c_length(x, self.k, self.q))
+        return out
+# ---------------------------------------------------------------------
+# (h) Auxiliary scalar rewards — coding style / communication / poor tool use
+# ---------------------------------------------------------------------
+# Heuristic patterns for the emergent behaviors Composer-2 §4.2 names. Simple
+# and overridable by design; the report itself added these REACTIVELY by
+# monitoring, so they're meant to grow, not be exhaustive on day one.
+_TODO_CREATE = re.compile(r"\b(?:add|create|TODO|todo)[\s_-]*(?:item|list)?\b", re.I)
+_TODO_DONE = re.compile(r"\b(?:done|completed|finished|resolved|checked off)\b", re.I)
+# Long chain-of-thought left as a code comment (an explicitly-penalized emergent
+# behavior): a comment line that's prose-long rather than a terse code note.
+_LONG_COMMENT = re.compile(r"^\s*(?:#|//)\s?.{120,}$", re.M)
+@dataclass(frozen=True)
+class UnfinishedTodoPenalty:
+    """Penalize creating to-do items then leaving them unfinished (the report's
+    explicit "poor tool call" example). Heuristic: penalty scales with
+    max(0, #todo-creations − #todo-completions)."""
+    coef: float = 0.05
+    def __call__(
+        self, prompts: list[str] | None = None, completions: list[str] | None = None,
+        **kwargs: object,
+    ) -> list[float]:
+        out: list[float] = []
+        for comp in completions or []:
+            created = len(_TODO_CREATE.findall(comp))
+            done = len(_TODO_DONE.findall(comp))
+            unfinished = max(0, created - done)
+            out.append(-self.coef * unfinished)
+        return out
+@dataclass(frozen=True)
+class LeftoverCoTPenalty:
+    """Penalize leaving long chain-of-thought in code comments (an emergent
+    behavior the report observed and reactively penalized). Heuristic: penalty
+    per over-long comment line."""
+    coef: float = 0.05
+    def __call__(
+        self, prompts: list[str] | None = None, completions: list[str] | None = None,
+        **kwargs: object,
+    ) -> list[float]:
+        return [
+            -self.coef * len(_LONG_COMMENT.findall(comp))
+            for comp in (completions or [])
+        ]
+@dataclass(frozen=True)
+class CommunicationReward:
+    """Small POSITIVE reward for a concise, present final message (the report's
+    "communication" reward). Heuristic: reward presence of a non-empty final
+    message under a soft length cap; penalize empty or runaway-long finals.
+    Composer-2 rewards good communication; we approximate "good" as "a final
+    message exists and is reasonably concise" — the cheapest non-degenerate
+    proxy. Override for a real judge-based communication score.
+    """
+    coef: float = 0.02
+    soft_cap_chars: int = 2000
+    def __call__(
+        self, prompts: list[str] | None = None, completions: list[str] | None = None,
+        *, final_messages: list[str] | None = None, **kwargs: object,
+    ) -> list[float]:
+        completions = completions or []
+        out: list[float] = []
+        for i, comp in enumerate(completions):
+            msg = (
+                final_messages[i]
+                if final_messages is not None and i < len(final_messages)
+                else comp
+            )
+            n = len(msg.strip())
+            if n == 0:
+                out.append(-self.coef)  # no communication at all
+            elif n <= self.soft_cap_chars:
+                out.append(self.coef)  # present + concise
+            else:
+                # present but runaway-long: decays back toward 0 then negative.
+                overflow = (n - self.soft_cap_chars) / self.soft_cap_chars
+                out.append(self.coef * max(-1.0, 1.0 - overflow))
+        return out
+def default_behavior_rewards() -> list[RewardFunc]:
+    """The Composer-2 §4.2 starter array, ready to extend with reward_weights.
+    Returns the length/effort penalty + the three aux shapers. Compose with the
+    env correctness reward, e.g.::
+        reward_funcs = [env.reward_fn, *default_behavior_rewards()]
+        reward_weights = [1.0, 1.0, 1.0, 1.0, 1.0]   # correctness dominates
+    The report adds these REACTIVELY by monitoring emergent behaviors, so treat
+    this as the seed set, not the final list.
+    """
+    return [
+        LengthEffortPenalty(),
+        UnfinishedTodoPenalty(),
+        LeftoverCoTPenalty(),
+        CommunicationReward(),
+    ]
+__all__ = [
+    "RewardFunc",
+    "c_length",
+    "EffortWeights",
+    "LengthEffortPenalty",
+    "UnfinishedTodoPenalty",
+    "LeftoverCoTPenalty",
+    "CommunicationReward",
+    "default_behavior_rewards",
+]

composer_replication/datagen/tests/test_behavior_rewards.py ADDED Viewed

	@@ -0,0 +1,166 @@

+"""Tests for Composer-2 §4.2 behavior-shaping rewards (F5 #2 fidelity fix).
+The substantive coverage is the C_length math (the exact report equation +
+its concave-down/increasing shape + edge cases) and the TRL RewardFunc
+contract (one float per completion, shapers ≤ 0).
+"""
+from __future__ import annotations
+import math
+import pytest
+from composer_replication.datagen.behavior_rewards import (
+    CommunicationReward,
+    EffortWeights,
+    LeftoverCoTPenalty,
+    LengthEffortPenalty,
+    UnfinishedTodoPenalty,
+    c_length,
+    default_behavior_rewards,
+)
+# ---------------------------------------------------------------------
+# C_length{k,q}(x) — the exact equation
+# ---------------------------------------------------------------------
+def test_c_length_matches_closed_form():
+    k, q, x = 0.01, 2.0, 500.0
+    expected = ((1 + k * x) ** (1 - q) - 1) / (k * (1 - q))
+    assert c_length(x, k, q) == pytest.approx(expected)
+def test_c_length_zero_effort_is_zero():
+    assert c_length(0.0, k=0.01, q=2.0) == pytest.approx(0.0)
+def test_c_length_increasing_and_concave_down():
+    """q>1, k>0: increasing in x, with SHRINKING marginal penalty (concave-down)."""
+    k, q = 0.01, 2.0
+    xs = [0.0, 100.0, 200.0, 300.0, 400.0]
+    vals = [c_length(x, k, q) for x in xs]
+    diffs = [vals[i + 1] - vals[i] for i in range(len(vals) - 1)]
+    # strictly increasing
+    assert all(d > 0 for d in diffs), f"not increasing: {vals}"
+    # concave-down: each marginal increment smaller than the previous
+    assert all(diffs[i + 1] < diffs[i] for i in range(len(diffs) - 1)), (
+        f"not concave-down (marginal penalty should shrink): {diffs}"
+    )
+def test_c_length_q_equals_one_limit():
+    """q→1 limit is log(1+kx)/k (avoids the 0/0 in the closed form)."""
+    k, x = 0.02, 300.0
+    assert c_length(x, k, q=1.0) == pytest.approx(math.log(1 + k * x) / k)
+    # continuity: q just off 1 should be very close to the limit.
+    near = c_length(x, k, q=1.0 + 1e-7)
+    assert near == pytest.approx(math.log(1 + k * x) / k, rel=1e-3)
+def test_c_length_validates_inputs():
+    with pytest.raises(ValueError, match="non-negative"):
+        c_length(-1.0, k=0.01, q=2.0)
+    with pytest.raises(ValueError, match="k must be positive"):
+        c_length(1.0, k=0.0, q=2.0)
+# ---------------------------------------------------------------------
+# EffortWeights
+# ---------------------------------------------------------------------
+def test_effort_weights_weighted_combination():
+    w = EffortWeights(think=1.0, tool_call=1.0, tool_output=1.0, final_message=1.0,
+                      num_calls=50.0, num_turns=100.0)
+    feats = {
+        "thinking_tokens": 100, "tool_calling_tokens": 50,
+        "tool_output_tokens": 200, "final_message_tokens": 30,
+        "num_tool_calls": 4, "num_turns": 3,
+    }
+    # 100+50+200+30 + 50*4 + 100*3 = 380 + 200 + 300 = 880
+    assert w.effort(feats) == pytest.approx(880.0)
+def test_effort_weights_missing_keys_are_zero():
+    assert EffortWeights().effort({}) == pytest.approx(0.0)
+# ---------------------------------------------------------------------
+# LengthEffortPenalty — the RewardFunc
+# ---------------------------------------------------------------------
+def test_length_penalty_one_value_per_completion_and_nonpositive():
+    rf = LengthEffortPenalty(coef=0.1)
+    comps = ["short", "a much longer completion " * 50, ""]
+    out = rf(prompts=None, completions=comps)
+    assert len(out) == len(comps)
+    assert all(v <= 0 for v in out), f"length penalty must be ≤ 0: {out}"
+def test_length_penalty_longer_is_more_penalized():
+    rf = LengthEffortPenalty(coef=0.1)
+    short, long = "x" * 100, "x" * 10000
+    out = rf(completions=[short, long])
+    assert out[1] < out[0], "longer completion must incur a larger (more negative) penalty"
+def test_length_penalty_uses_structured_features_when_present():
+    rf = LengthEffortPenalty(coef=1.0, k=0.01, q=2.0)
+    # Same short text, but big structured effort → penalty driven by features.
+    feats = [{"num_turns": 10, "thinking_tokens": 5000}]
+    out_feat = rf(completions=["ok"], effort_features=feats)
+    out_proxy = rf(completions=["ok"])
+    assert out_feat[0] < out_proxy[0], "structured high-effort must out-penalize the char proxy"
+# ---------------------------------------------------------------------
+# Auxiliary scalar rewards
+# ---------------------------------------------------------------------
+def test_unfinished_todo_penalty():
+    rf = UnfinishedTodoPenalty(coef=0.05)
+    # Heuristic counts create-like and done-like tokens; penalty scales with
+    # max(0, creates - dones). More creates than completions → penalized.
+    unbalanced = "TODO: add item one. create todo item two."  # several create tokens, 0 done
+    over_done = "Done. completed. finished. resolved."  # dones ≥ creates → no penalty
+    out = rf(completions=[unbalanced, over_done, "no todos here"])
+    assert out[0] < 0, "more creations than completions must be penalized"
+    assert out[1] == pytest.approx(0.0), "completions ≥ creations → no penalty"
+    assert out[2] == pytest.approx(0.0), "no todo activity → no penalty"
+def test_leftover_cot_penalty():
+    rf = LeftoverCoTPenalty(coef=0.05)
+    long_comment = "# " + "this is a very long chain of thought left in a comment " * 5
+    clean = "# terse note\nx = 1"
+    out = rf(completions=[long_comment + "\n" + "code()", clean])
+    assert out[0] < 0, "long CoT comment must be penalized"
+    assert out[1] == pytest.approx(0.0)
+def test_communication_reward_concise_present_vs_empty():
+    rf = CommunicationReward(coef=0.02, soft_cap_chars=2000)
+    out = rf(completions=["Fixed the bug by adding a null check.", ""])
+    assert out[0] == pytest.approx(0.02), "concise present final → +coef"
+    assert out[1] == pytest.approx(-0.02), "empty final → -coef"
+def test_communication_reward_runaway_long_decays():
+    rf = CommunicationReward(coef=0.02, soft_cap_chars=100)
+    out = rf(completions=["x" * 1000])  # 9x over cap
+    assert out[0] < 0.02, "runaway-long final should decay below the concise reward"
+def test_default_array_contract():
+    """All default behavior rewards honor the TRL RewardFunc contract."""
+    funcs = default_behavior_rewards()
+    assert len(funcs) == 4
+    comps = ["completion one", "TODO: add item, never done", "x" * 5000]
+    for f in funcs:
+        out = f(prompts=["p"] * len(comps), completions=comps)
+        assert len(out) == len(comps), f"{type(f).__name__} returned wrong count"
+        assert all(isinstance(v, float) for v in out)

composer_replication/trainer/composer_trainer.py CHANGED Viewed

@@ -47,6 +47,10 @@ except ImportError:  # pragma: no cover — only hit in unit-test stubs without
     _TRL_AVAILABLE = False
 from composer_replication.opsd import generalized_jsd_loss
 logger = logging.getLogger(__name__)
@@ -67,6 +71,22 @@ class ComposerReplicationTrainer(GRPOTrainer):  # type: ignore[misc, valid-type]
         sdpo_temperature: temperature for SDPO loss; SDPO paper uses 1.0.
         sdpo_token_clip: per-token JSD clip for stability; None = no clip.
         replay_dpo_beta: beta param of the DPO loss (β in the standard DPO formula).
         heldout_guard: optional ``HeldOutGuard`` (the #2 collapse safeguard from
             ``composer_replication.safety``). Default None = OFF (no behavior
             change whatsoever). When supplied, the trainer folds one checkpoint's
@@ -98,6 +118,8 @@ class ComposerReplicationTrainer(GRPOTrainer):  # type: ignore[misc, valid-type]
         sdpo_token_clip: float | None = None,
         replay_dpo_beta: float = 0.1,
         strict_sdpo_alignment: bool = True,
         heldout_guard: HeldOutGuard | None = None,
         heldout_eval_fn: Callable[[], float] | None = None,
         strict_killswitch: bool = True,
@@ -121,6 +143,19 @@ class ComposerReplicationTrainer(GRPOTrainer):  # type: ignore[misc, valid-type]
         # trust-gap flagged in ADR-008). Set False only for production runs
         # where a single malformed batch should warn-and-skip rather than abort.
         self.strict_sdpo_alignment = strict_sdpo_alignment
         # --- run-level collapse kill-switch (#2 safeguard) -------------------
         # OPTIONAL + OFF BY DEFAULT: when heldout_guard is None the loss path is
         # byte-for-byte the legacy behavior. When set, _maybe_update_killswitch
@@ -141,14 +176,85 @@ class ComposerReplicationTrainer(GRPOTrainer):  # type: ignore[misc, valid-type]
     # Loss override (the integration core)
     # ----------------------------------------------------------------------
     def _compute_loss(
         self,
         model: torch.nn.Module,
         inputs: dict[str, torch.Tensor],
     ) -> torch.Tensor:
-        """Override: total_loss = grpo + α*sdpo + β*replay."""
-        # Channel 1: standard GRPO loss
-        grpo_loss = super()._compute_loss(model, inputs)
         # Channel 2: SDPO hint-distill at error sites
         sdpo_kl = self._compute_sdpo_loss(model, inputs)
@@ -177,6 +283,64 @@ class ComposerReplicationTrainer(GRPOTrainer):  # type: ignore[misc, valid-type]
         return total
     # ----------------------------------------------------------------------
     # Run-level collapse kill-switch (#2 safeguard) — optional, OFF by default
     # ----------------------------------------------------------------------
@@ -480,6 +644,51 @@ def _device_of(model: torch.nn.Module) -> torch.device:
     return next(model.parameters()).device
 def make_dr_grpo_config(**overrides: Any):
     """Build a `trl.GRPOConfig` configured to the **Dr. GRPO** recipe.
@@ -685,4 +894,5 @@ __all__ = [
     "make_dr_grpo_config",
     "make_po_config",
     "PO_OBJECTIVES",
 ]

     _TRL_AVAILABLE = False
 from composer_replication.opsd import generalized_jsd_loss
+from composer_replication.trainer.kl_in_reward import (
+    apply_kl_in_reward,
+    kl_penalty_per_sequence,
+)
 logger = logging.getLogger(__name__)
         sdpo_temperature: temperature for SDPO loss; SDPO paper uses 1.0.
         sdpo_token_clip: per-token JSD clip for stability; None = no clip.
         replay_dpo_beta: beta param of the DPO loss (β in the standard DPO formula).
+        kl_in_reward: when True, apply the KL-to-reference penalty in the
+            **reward** (Composer-2 §4.1 / verl choice) instead of TRL's native
+            **in-loss** k3 term. The penalty is folded into GRPO's advantages at
+            scoring time (``adv -= beta·(KL - group_mean(KL))``) and TRL's
+            in-loss KL is suppressed for that step. The F5 audit's #1 fidelity
+            fix: the 2025/26 evidence (arXiv:2512.21852, verl, TRL #4967) shows
+            k1-in-reward improves OOD generalization where k3-in-reward can
+            collapse. REQUIRES ``beta>0`` (the KL coefficient — also how TRL
+            decides to compute reference logprobs) and ``scale_rewards`` in
+            {none,false} (the advantage-adjustment identity is exact only
+            without std-normalization — the Dr.GRPO / Composer regime). Default
+            False = TRL's native in-loss KL, byte-for-byte legacy behavior.
+        kl_estimator: ``"k1"`` (default; ``logp - ref_logp``, the Composer-2 /
+            verl choice this path exists for) or ``"k3"`` (Schulman; lets an
+            experiment A/B k1-in-reward vs k3-in-reward). Only consulted when
+            ``kl_in_reward=True``.
         heldout_guard: optional ``HeldOutGuard`` (the #2 collapse safeguard from
             ``composer_replication.safety``). Default None = OFF (no behavior
             change whatsoever). When supplied, the trainer folds one checkpoint's
         sdpo_token_clip: float | None = None,
         replay_dpo_beta: float = 0.1,
         strict_sdpo_alignment: bool = True,
+        kl_in_reward: bool = False,
+        kl_estimator: str = "k1",
         heldout_guard: HeldOutGuard | None = None,
         heldout_eval_fn: Callable[[], float] | None = None,
         strict_killswitch: bool = True,
         # trust-gap flagged in ADR-008). Set False only for production runs
         # where a single malformed batch should warn-and-skip rather than abort.
         self.strict_sdpo_alignment = strict_sdpo_alignment
+        # --- k1-in-reward KL (F5 #1 fidelity fix; Composer-2 §4.1 / verl) ----
+        # OFF by default → TRL's native in-loss k3 KL, byte-for-byte legacy.
+        # When ON we keep self.beta as the KL coef (TRL needs beta>0 to even
+        # create the ref model + compute ref logps), fold the k1 penalty into
+        # advantages during scoring, and zero TRL's in-loss KL per step.
+        self.kl_in_reward = kl_in_reward
+        self.kl_estimator = kl_estimator
+        if kl_in_reward:
+            validate_kl_in_reward_config(
+                kl_estimator=kl_estimator,
+                beta=float(getattr(self.args, "beta", 0.0)),
+                scale_rewards=getattr(self.args, "scale_rewards", "group"),
+            )
         # --- run-level collapse kill-switch (#2 safeguard) -------------------
         # OPTIONAL + OFF BY DEFAULT: when heldout_guard is None the loss path is
         # byte-for-byte the legacy behavior. When set, _maybe_update_killswitch
     # Loss override (the integration core)
     # ----------------------------------------------------------------------
+    # ----------------------------------------------------------------------
+    # k1-in-reward: fold the KL penalty into advantages at scoring time, and
+    # suppress TRL's native in-loss k3 KL inside _compute_loss.
+    # ----------------------------------------------------------------------
+    def _generate_and_score_completions(
+        self,
+        inputs: list[dict[str, Any]],
+    ) -> dict[str, Any]:
+        """Override: after TRL scores completions, fold a k1 KL penalty into the
+        advantages (Composer-2 in-reward KL) when ``kl_in_reward`` is set.
+        No-op (exact legacy path) when ``kl_in_reward`` is False. When set, TRL
+        has already computed ``advantages``, ``ref_per_token_logps`` (because
+        ``beta>0``), and the completion logprobs; we recompute the per-sequence
+        k1 penalty and apply the exact group-mean-baseline correction.
+        """
+        output = super()._generate_and_score_completions(inputs)
+        if not getattr(self, "kl_in_reward", False):
+            return output
+        ref_logps = output.get("ref_per_token_logps")
+        # The "old" (sampling-time) policy logps are TRL's in-loss π term; they
+        # may be lazily None when generation/optimization are aligned and not
+        # vLLM (see TRL _compute_loss: old := per_token_logps.detach()). In that
+        # aligned case we cannot read π logps here, so we defer to _compute_loss
+        # (which always has per_token_logps) by stashing what we need.
+        old_logps = output.get("old_per_token_logps")
+        completion_mask = output.get("completion_mask")
+        if ref_logps is None or completion_mask is None:
+            # beta>0 guarantees ref_logps; this branch only trips on a TRL
+            # internals change — fail loud rather than silently skip the penalty.
+            raise RuntimeError(
+                "kl_in_reward=True but TRL did not return ref_per_token_logps / "
+                "completion_mask from scoring (beta>0 should guarantee them). "
+                "TRL internals may have changed; re-verify the in-reward path."
+            )
+        if old_logps is not None:
+            penalty = kl_penalty_per_sequence(
+                policy_logps=old_logps,
+                ref_logps=ref_logps,
+                completion_mask=completion_mask,
+                estimator=self.kl_estimator,
+            )
+            output["advantages"] = apply_kl_in_reward(
+                advantages=output["advantages"],
+                kl_penalty=penalty,
+                num_generations=self.num_generations,
+                coef=float(self.args.beta),
+            )
+            output["_kl_in_reward_applied"] = torch.tensor(True)
+        else:
+            # Aligned non-vLLM case: π logps materialize only in _compute_loss.
+            # Stash ref logps + mask so _compute_loss can apply the penalty there.
+            output["_kl_in_reward_applied"] = torch.tensor(False)
+        return output
     def _compute_loss(
         self,
         model: torch.nn.Module,
         inputs: dict[str, torch.Tensor],
     ) -> torch.Tensor:
+        """Override: total_loss = grpo + α*sdpo + β*replay.
+        When ``kl_in_reward`` is set, TRL's native in-loss KL term (gated on
+        ``self.beta``) is suppressed by temporarily zeroing ``self.beta`` for the
+        duration of the parent call — the KL has already been (or is about to be)
+        accounted for in the reward/advantage, so double-counting it in the loss
+        would be wrong. ``self.beta`` is restored in ``finally``.
+        """
+        # Channel 1: standard GRPO loss. ``getattr`` (not ``self.kl_in_reward``)
+        # so an instance built via ``__new__`` + manual wiring (the SDPO /
+        # kill-switch unit-test pattern that skips __init__) defaults to the
+        # legacy path instead of raising AttributeError.
+        if getattr(self, "kl_in_reward", False):
+            grpo_loss = self._grpo_loss_kl_in_reward(model, inputs)
+        else:
+            grpo_loss = super()._compute_loss(model, inputs)
         # Channel 2: SDPO hint-distill at error sites
         sdpo_kl = self._compute_sdpo_loss(model, inputs)
         return total
+    def _grpo_loss_kl_in_reward(
+        self,
+        model: torch.nn.Module,
+        inputs: dict[str, torch.Tensor],
+    ) -> torch.Tensor:
+        """GRPO loss with the KL applied in the reward, not the loss.
+        Two responsibilities:
+          1. Suppress TRL's native in-loss k3 KL term for this step by zeroing
+             ``self.beta`` across the parent ``_compute_loss`` call (restored in
+             ``finally``). ``self.beta`` gates the in-loss KL add (TRL
+             ``_compute_loss``: ``if self.beta != 0.0: per_token_loss += beta*kl``).
+          2. Handle the deferred case: when generation/optimization are aligned
+             and not using vLLM, the sampling-time policy logps are None at
+             scoring time, so ``_generate_and_score_completions`` could not fold
+             the penalty into advantages. Here ``per_token_logps`` is available,
+             so we apply the exact same advantage correction in-place on
+             ``inputs["advantages"]`` BEFORE the parent computes the surrogate.
+        """
+        # Deferred-penalty path: advantages not yet KL-adjusted (aligned, no vLLM).
+        applied = inputs.get("_kl_in_reward_applied")
+        already_applied = bool(applied.item()) if applied is not None else False
+        if not already_applied and "ref_per_token_logps" in inputs:
+            with torch.no_grad():
+                prompt_ids, completion_ids = inputs["prompt_ids"], inputs["completion_ids"]
+                completion_mask = inputs["completion_mask"]
+                input_ids = torch.cat([prompt_ids, completion_ids], dim=1)
+                attention_mask = torch.cat([inputs["prompt_mask"], completion_mask], dim=1)
+                logits_to_keep = completion_ids.size(1)
+                policy_logps, _ = self._get_per_token_logps_and_entropies(
+                    model, input_ids, attention_mask, logits_to_keep
+                )
+                penalty = kl_penalty_per_sequence(
+                    policy_logps=policy_logps,
+                    ref_logps=inputs["ref_per_token_logps"],
+                    completion_mask=completion_mask,
+                    estimator=self.kl_estimator,
+                )
+                advantages = inputs["advantages"]
+                # advantages may be (B,) or (B,1) — squeeze for the penalty math,
+                # restore the original shape after.
+                adv_flat = advantages.reshape(advantages.shape[0])
+                adj = apply_kl_in_reward(
+                    advantages=adv_flat,
+                    kl_penalty=penalty,
+                    num_generations=self.num_generations,
+                    coef=float(self.args.beta),
+                )
+                inputs["advantages"] = adj.reshape(advantages.shape)
+        # Suppress TRL's in-loss KL: zero beta for the parent call, restore after.
+        saved_beta = self.beta
+        try:
+            self.beta = 0.0
+            return super()._compute_loss(model, inputs)
+        finally:
+            self.beta = saved_beta
     # ----------------------------------------------------------------------
     # Run-level collapse kill-switch (#2 safeguard) — optional, OFF by default
     # ----------------------------------------------------------------------
     return next(model.parameters()).device
+def validate_kl_in_reward_config(
+    *,
+    kl_estimator: str,
+    beta: float,
+    scale_rewards: Any,
+) -> None:
+    """Validate the (kl_estimator, beta, scale_rewards) combo for k1-in-reward.
+    Extracted so the preconditions are unit-testable without standing up a real
+    GRPOTrainer (which needs a model + dataset). Raises ``ValueError`` on any
+    invalid combination; returns None when the config is sound.
+    Preconditions (see ``kl_in_reward.py`` for the algebra):
+      * ``kl_estimator`` in {k1, k3}.
+      * ``beta != 0`` — TRL only builds the reference model and computes ref
+        logprobs when beta>0, and the in-reward penalty needs ref logps. beta
+        doubles as the in-reward KL coefficient (the in-loss k3 term is
+        suppressed per step).
+      * ``scale_rewards`` in {none, false} — the advantage-adjustment identity
+        is exact only without per-group std-normalization (the Dr.GRPO /
+        Composer regime).
+    """
+    if kl_estimator not in ("k1", "k3"):
+        raise ValueError(f"kl_estimator must be 'k1' or 'k3', got {kl_estimator!r}.")
+    if float(beta) == 0.0:
+        raise ValueError(
+            "kl_in_reward=True requires a non-zero `beta` (the KL coefficient): "
+            "TRL only creates the reference model and computes ref logprobs when "
+            "beta>0, and k1-in-reward needs those ref logps. Set beta to your KL "
+            "coefficient (e.g. make_po_config('dr_grpo', beta=0.04)); the in-loss "
+            "k3 term is suppressed automatically so beta acts purely as the "
+            "in-reward k1 coefficient."
+        )
+    if str(scale_rewards).lower() not in ("none", "false"):
+        raise ValueError(
+            "kl_in_reward=True requires scale_rewards in {none,false} "
+            f"(got {scale_rewards!r}). The advantage-adjustment identity "
+            "adv -= beta·(KL - group_mean(KL)) is EXACT only without per-group "
+            "std-normalization (the Dr.GRPO / Composer regime). With std-norm, "
+            "folding KL into the reward also shifts the group std, so the linear "
+            "correction no longer matches true in-reward KL. Use "
+            "make_po_config('dr_grpo', beta=…) (scale_rewards='none')."
+        )
 def make_dr_grpo_config(**overrides: Any):
     """Build a `trl.GRPOConfig` configured to the **Dr. GRPO** recipe.
     "make_dr_grpo_config",
     "make_po_config",
     "PO_OBJECTIVES",
+    "validate_kl_in_reward_config",
 ]

composer_replication/trainer/kl_in_reward.py ADDED Viewed

	@@ -0,0 +1,182 @@

+"""k1-in-reward KL penalty — the Composer-2 / verl fidelity choice.
+THE FIDELITY GAP (F5 Rubric A item c2, the single highest-leverage fix).
+Composer-2 §4.1 explicitly chooses the **k1** KL estimator applied **in the
+reward** (``-log r``), citing a variance argument (Amini et al.). TRL's
+``GRPOTrainer`` instead applies the **k3** estimator (``exp(Δ) - Δ - 1``,
+Δ = ref_logp - logp) **in the loss**, gated on ``beta != 0``. The 2025/26
+literature says this is not cosmetic:
+  * arXiv:2512.21852 ("A Comedy of Estimators") — k1-in-reward improves OOD
+    generalization; k3-in-reward can collapse.
+  * verl adopted k1-in-reward as its *only* reverse-KL option.
+  * TRL issue #4967 tracks the same divergence.
+OOD generalization is exactly the "take any model to the next level" axis, so
+this module gives the trainer an opt-in k1-in-reward path that matches
+Composer-2 / verl, leaving TRL's native k3-in-loss disabled (``beta = 0``).
+THE ALGEBRA (why this is a clean advantage adjustment, not a TRL fork).
+k1-in-reward means: penalize each sequence's reward by ``coef * KL_i`` before
+GRPO computes its group-relative advantage:
+    reward'_i = reward_i - coef * KL_i
+    KL_i      = Σ_t mask_{i,t} · (logp_{i,t} - ref_logp_{i,t})    # k1 estimator
+                                                                  # of KL(π‖π_ref)
+GRPO's advantage (with ``scale_rewards="none"``, the Dr.GRPO / Composer regime)
+is the group-mean baseline ``adv_i = reward_i - mean_group(reward)``. Because
+that baseline is LINEAR, folding-then-baselining equals adjusting the final
+advantage:
+    adv'_i = reward'_i - mean_group(reward')
+           = adv_i - coef · (KL_i - mean_group(KL))
+So the trainer can let TRL compute advantages normally, then apply this exact
+correction — no reimplementation of TRL's reward→advantage code.
+THE STD-NORM CAVEAT (why we require scale_rewards="none"). The identity above
+is EXACT only when there is no per-group std normalization. With std-norm,
+folding KL into the reward also changes the group std, so the linear correction
+is no longer equivalent. Composer-2 and verl both train WITHOUT std scaling
+(Dr.GRPO's recommendation), so we make the math exact for that regime and the
+trainer raises if k1-in-reward is requested with std-norm on, rather than
+silently applying an approximation.
+Note: ``-log r`` (Composer-2's phrasing) with ``r = π/π_ref = exp(logp-ref_logp)``
+gives ``-log r = ref_logp - logp = -(logp - ref_logp)`` *per token*. The KL
+PENALTY subtracted from reward is ``coef · Σ_t (logp - ref_logp)`` — i.e. the
+k1 estimator of the reverse KL, which is what discourages drift from π_ref. The
+sign convention here matches the standard RLHF KL-in-reward penalty
+(Stiennon et al. 2020; verl ``kl_penalty="kl"``).
+"""
+from __future__ import annotations
+import torch
+#: Supported KL estimators for the in-reward penalty. Only k1 is meaningful here
+#: (the whole point is to use k1 instead of TRL's native-in-loss k3); k3 is
+#: accepted as an explicit no-divergence opt-out for experiments.
+KL_ESTIMATORS = ("k1", "k3")
+def k1_kl_penalty_per_sequence(
+    policy_logps: torch.Tensor,
+    ref_logps: torch.Tensor,
+    completion_mask: torch.Tensor,
+) -> torch.Tensor:
+    """Per-sequence k1 estimator of KL(π ‖ π_ref) over completion tokens.
+    Args:
+        policy_logps: ``(B, T)`` per-token logprobs under the (sampling) policy π.
+        ref_logps: ``(B, T)`` per-token logprobs under the reference policy π_ref,
+            on the SAME tokens/positions as ``policy_logps``.
+        completion_mask: ``(B, T)`` 1.0 at real completion tokens, 0.0 at prompt /
+            padding positions (the k1 sum is taken only over real tokens).
+    Returns:
+        ``(B,)`` per-sequence KL penalty ``Σ_t mask·(logp - ref_logp)``.
+    The k1 estimator ``logp - ref_logp`` is the unbiased (higher-variance)
+    single-sample estimate of the reverse KL; summed over the response it is the
+    sequence-level KL used as the reward penalty.
+    """
+    if policy_logps.shape != ref_logps.shape:
+        raise ValueError(
+            f"policy_logps {tuple(policy_logps.shape)} and ref_logps "
+            f"{tuple(ref_logps.shape)} must have identical shape (same tokens)."
+        )
+    if completion_mask.shape != policy_logps.shape:
+        raise ValueError(
+            f"completion_mask {tuple(completion_mask.shape)} must match "
+            f"policy_logps {tuple(policy_logps.shape)}."
+        )
+    per_token = (policy_logps - ref_logps) * completion_mask
+    return per_token.sum(dim=-1)
+def k3_kl_penalty_per_sequence(
+    policy_logps: torch.Tensor,
+    ref_logps: torch.Tensor,
+    completion_mask: torch.Tensor,
+) -> torch.Tensor:
+    """Per-sequence k3 (Schulman) estimator of KL over completion tokens.
+    ``k3 = exp(Δ) - Δ - 1``, Δ = ref_logp - logp. Always ≥ 0, lower variance.
+    Provided for the in-reward path so an experiment can A/B k1-in-reward against
+    k3-in-reward (the comparison arXiv:2512.21852 makes) without touching TRL.
+    """
+    if not (policy_logps.shape == ref_logps.shape == completion_mask.shape):
+        raise ValueError("policy_logps, ref_logps, completion_mask must share shape.")
+    delta = ref_logps - policy_logps
+    per_token = (torch.exp(delta) - delta - 1.0) * completion_mask
+    return per_token.sum(dim=-1)
+def kl_penalty_per_sequence(
+    policy_logps: torch.Tensor,
+    ref_logps: torch.Tensor,
+    completion_mask: torch.Tensor,
+    estimator: str = "k1",
+) -> torch.Tensor:
+    """Dispatch to the k1 or k3 per-sequence KL penalty."""
+    if estimator == "k1":
+        return k1_kl_penalty_per_sequence(policy_logps, ref_logps, completion_mask)
+    if estimator == "k3":
+        return k3_kl_penalty_per_sequence(policy_logps, ref_logps, completion_mask)
+    raise ValueError(
+        f"Unknown KL estimator {estimator!r}; choose from {KL_ESTIMATORS}. "
+        "k1 is the Composer-2 / verl in-reward choice this module exists for."
+    )
+def apply_kl_in_reward(
+    advantages: torch.Tensor,
+    kl_penalty: torch.Tensor,
+    num_generations: int,
+    coef: float,
+) -> torch.Tensor:
+    """Adjust GRPO advantages to fold a KL penalty into the reward.
+    Exact (not approximate) under the group-mean baseline with NO std
+    normalization (``scale_rewards="none"`` — the Dr.GRPO / Composer regime).
+    See the module docstring for the linearity argument.
+    Args:
+        advantages: ``(B,)`` GRPO advantages as TRL computed them
+            (= reward - group_mean(reward), no std division).
+        kl_penalty: ``(B,)`` per-sequence KL penalty (from
+            ``kl_penalty_per_sequence``).
+        num_generations: G — the number of completions per prompt (group size).
+            ``B`` must be divisible by G; groups are contiguous as TRL lays them
+            out (``rewards.view(-1, num_generations)``).
+        coef: the KL coefficient β. ``coef=0`` returns advantages unchanged.
+    Returns:
+        ``(B,)`` adjusted advantages ``adv - coef·(KL - group_mean(KL))``.
+    """
+    if coef == 0.0:
+        return advantages
+    if advantages.shape != kl_penalty.shape:
+        raise ValueError(
+            f"advantages {tuple(advantages.shape)} and kl_penalty "
+            f"{tuple(kl_penalty.shape)} must have identical shape (B,)."
+        )
+    b = advantages.shape[0]
+    if num_generations <= 0 or b % num_generations != 0:
+        raise ValueError(
+            f"batch size B={b} must be a positive multiple of num_generations="
+            f"{num_generations} (GRPO lays groups out contiguously)."
+        )
+    kl_grouped = kl_penalty.view(-1, num_generations)
+    kl_centered = (kl_grouped - kl_grouped.mean(dim=1, keepdim=True)).reshape(b)
+    return advantages - coef * kl_centered
+__all__ = [
+    "KL_ESTIMATORS",
+    "k1_kl_penalty_per_sequence",
+    "k3_kl_penalty_per_sequence",
+    "kl_penalty_per_sequence",
+    "apply_kl_in_reward",
+]

composer_replication/trainer/tests/test_dr_grpo_config_and_alignment.py CHANGED Viewed

@@ -75,6 +75,15 @@ def test_trl_kl_estimator_is_k3_not_k1():
     the installed trl==1.5.0 source (grpo_trainer.py ~L2513) shows it actually
     computes k3:  `torch.exp(ref - logp) - (ref - logp) - 1`. This test pins
     that finding so the docstring stays honest and a future TRL change is caught.
     """
     # Known logprob pairs (student logp, reference logp).
     logp = torch.tensor([-1.0, -2.0, -0.5, -3.0])
@@ -121,6 +130,30 @@ def test_trl_kl_estimator_is_k3_not_k1():
     )
 # ---------------------------------------------------------------------------
 # Gate 2 — SDPO strict-alignment guard (no real GRPOTrainer needed)
 # ---------------------------------------------------------------------------

     the installed trl==1.5.0 source (grpo_trainer.py ~L2513) shows it actually
     computes k3:  `torch.exp(ref - logp) - (ref - logp) - 1`. This test pins
     that finding so the docstring stays honest and a future TRL change is caught.
+    NOTE (Wave 20, F5 #1 fidelity fix): TRL's NATIVE behavior is still k3-in-loss
+    (this test stays valid). But Composer-2 §4.1 chooses k1-IN-REWARD, and the
+    2025/26 evidence (arXiv:2512.21852, verl, TRL #4967) says that matters for
+    OOD generalization. ComposerReplicationTrainer now offers an opt-in
+    `kl_in_reward=True` path (composer_replication.trainer.kl_in_reward) that
+    suppresses this in-loss k3 term (beta→0 for the parent loss) and folds a k1
+    penalty into the reward/advantage instead — see
+    test_trainer_offers_k1_in_reward_path below and test_kl_in_reward.py.
     """
     # Known logprob pairs (student logp, reference logp).
     logp = torch.tensor([-1.0, -2.0, -0.5, -3.0])
     )
+def test_trainer_offers_k1_in_reward_path():
+    """The F5 #1 fix: ComposerReplicationTrainer exposes an opt-in k1-in-reward
+    path whose preconditions are validated (beta>0, scale_rewards∈{none,false},
+    estimator∈{k1,k3}). Validates the extracted precondition function without
+    needing a real GRPOTrainer (model+dataset)."""
+    from composer_replication.trainer.composer_trainer import validate_kl_in_reward_config
+    # Valid Composer-2 / verl config: k1 estimator, non-zero beta, no std-norm.
+    validate_kl_in_reward_config(kl_estimator="k1", beta=0.04, scale_rewards="none")
+    validate_kl_in_reward_config(kl_estimator="k3", beta=0.1, scale_rewards="false")
+    # beta=0 is the trap: TRL would build no ref model, so no ref logps.
+    with pytest.raises(ValueError, match="non-zero `beta`"):
+        validate_kl_in_reward_config(kl_estimator="k1", beta=0.0, scale_rewards="none")
+    # std-norm breaks the exact advantage-adjustment identity.
+    with pytest.raises(ValueError, match="scale_rewards in"):
+        validate_kl_in_reward_config(kl_estimator="k1", beta=0.04, scale_rewards="group")
+    # bad estimator
+    with pytest.raises(ValueError, match="k1.*k3|k3"):
+        validate_kl_in_reward_config(kl_estimator="k2", beta=0.04, scale_rewards="none")
 # ---------------------------------------------------------------------------
 # Gate 2 — SDPO strict-alignment guard (no real GRPOTrainer needed)
 # ---------------------------------------------------------------------------

composer_replication/trainer/tests/test_kl_in_reward.py ADDED Viewed

	@@ -0,0 +1,146 @@

+"""Tests for k1-in-reward KL (F5 #1 fidelity fix; Composer-2 §4.1 / verl).
+The load-bearing test is `test_apply_equals_fold_then_baseline`: it proves the
+advantage adjustment `adv -= coef·(KL - group_mean(KL))` is EXACTLY equal to
+folding the KL penalty into the reward and re-running GRPO's group-mean
+baseline (with no std-norm). That equivalence is the entire justification for
+adjusting advantages post-hoc instead of forking TRL's reward→advantage code.
+"""
+from __future__ import annotations
+import pytest
+import torch
+from composer_replication.trainer.kl_in_reward import (
+    apply_kl_in_reward,
+    k1_kl_penalty_per_sequence,
+    k3_kl_penalty_per_sequence,
+    kl_penalty_per_sequence,
+)
+# ---------------------------------------------------------------------
+# Per-sequence KL estimators
+# ---------------------------------------------------------------------
+def test_k1_penalty_sums_masked_logp_diff():
+    policy = torch.tensor([[0.0, -1.0, -2.0], [-0.5, -0.5, -0.5]])
+    ref = torch.tensor([[0.0, -0.5, -1.0], [-1.0, -1.0, -1.0]])
+    mask = torch.tensor([[1.0, 1.0, 0.0], [1.0, 1.0, 1.0]])  # row0 drops last token
+    out = k1_kl_penalty_per_sequence(policy, ref, mask)
+    # row0: (0-0) + (-1-(-0.5)) [+ masked 0] = -0.5
+    # row1: (-0.5-(-1.0))*3 = +1.5
+    torch.testing.assert_close(out, torch.tensor([-0.5, 1.5]))
+def test_k1_can_be_negative_k3_cannot():
+    """Structural difference: k1 is signed, k3 ≥ 0 (the whole reason they differ)."""
+    policy = torch.tensor([[0.0, 0.0]])
+    ref = torch.tensor([[1.0, 1.0]])  # ref > policy → Δ=ref-logp>0 → k1<0
+    mask = torch.ones_like(policy)
+    k1 = k1_kl_penalty_per_sequence(policy, ref, mask)
+    k3 = k3_kl_penalty_per_sequence(policy, ref, mask)
+    assert (k1 < 0).all(), "k1 = Σ(logp-ref) is negative when ref>logp"
+    assert (k3 >= -1e-6).all(), "k3 (Schulman) is always non-negative"
+def test_k3_leading_order_is_half_delta_squared():
+    """For small Δ, k3 ≈ Δ²/2 — the minor-delta claim in make_dr_grpo_config."""
+    policy = torch.tensor([[0.0, 0.0, 0.0]])
+    ref = torch.tensor([[0.01, -0.02, 0.005]])
+    mask = torch.ones_like(policy)
+    k3 = k3_kl_penalty_per_sequence(policy, ref, mask)
+    delta = ref - policy
+    expected = (0.5 * delta**2).sum()
+    torch.testing.assert_close(k3, expected.unsqueeze(0), atol=1e-4, rtol=1e-3)
+def test_dispatch_and_unknown_estimator():
+    policy = torch.zeros(1, 2)
+    ref = torch.ones(1, 2)
+    mask = torch.ones(1, 2)
+    torch.testing.assert_close(
+        kl_penalty_per_sequence(policy, ref, mask, "k1"),
+        k1_kl_penalty_per_sequence(policy, ref, mask),
+    )
+    with pytest.raises(ValueError, match="Unknown KL estimator"):
+        kl_penalty_per_sequence(policy, ref, mask, "k2")
+def test_penalty_shape_validation():
+    with pytest.raises(ValueError, match="identical shape"):
+        k1_kl_penalty_per_sequence(torch.zeros(1, 3), torch.zeros(1, 2), torch.zeros(1, 3))
+    with pytest.raises(ValueError, match="must match"):
+        k1_kl_penalty_per_sequence(torch.zeros(1, 3), torch.zeros(1, 3), torch.zeros(1, 2))
+# ---------------------------------------------------------------------
+# apply_kl_in_reward — the advantage adjustment
+# ---------------------------------------------------------------------
+def test_apply_coef_zero_is_identity():
+    adv = torch.tensor([1.0, -1.0, 0.5, -0.5])
+    kl = torch.tensor([2.0, 3.0, 1.0, 0.0])
+    out = apply_kl_in_reward(adv, kl, num_generations=2, coef=0.0)
+    torch.testing.assert_close(out, adv)
+def test_apply_centers_kl_within_group():
+    # Two groups of 2. coef=1. adv -= (KL - group_mean(KL)).
+    adv = torch.zeros(4)
+    kl = torch.tensor([1.0, 3.0, 10.0, 20.0])
+    out = apply_kl_in_reward(adv, kl, num_generations=2, coef=1.0)
+    # group0 mean=2 → centered [-1,+1] → adv-(-1,+1)=[1,-1]
+    # group1 mean=15 → centered [-5,+5] → adv-(-5,+5)=[5,-5]
+    torch.testing.assert_close(out, torch.tensor([1.0, -1.0, 5.0, -5.0]))
+def test_apply_divisibility_validation():
+    with pytest.raises(ValueError, match="multiple of num_generations"):
+        apply_kl_in_reward(torch.zeros(5), torch.zeros(5), num_generations=2, coef=1.0)
+    with pytest.raises(ValueError, match="identical shape"):
+        apply_kl_in_reward(torch.zeros(4), torch.zeros(2), num_generations=2, coef=1.0)
+@pytest.mark.parametrize("num_generations", [2, 3, 4])
+@pytest.mark.parametrize("n_groups", [1, 2, 5])
+def test_apply_equals_fold_then_baseline(num_generations, n_groups):
+    """THE load-bearing property: adjusting baselined advantages by
+    -coef·(KL - group_mean(KL)) equals folding -coef·KL into the reward and
+    re-running GRPO's group-mean baseline (scale_rewards='none').
+    This proves the post-hoc advantage adjustment IS exact k1-in-reward, not an
+    approximation — the justification for not forking TRL's scoring code.
+    """
+    torch.manual_seed(0)
+    g, k = num_generations, n_groups
+    b = g * k
+    coef = 0.137
+    rewards = torch.randn(b)
+    kl = torch.randn(b).abs()  # KL ≥ 0 in spirit, though sign-agnostic here
+    # GRPO baseline (no std-norm): adv = reward - group_mean(reward).
+    def group_baseline(x):
+        means = x.view(k, g).mean(dim=1).repeat_interleave(g)  # (b,)
+        return x - means
+    advantages = group_baseline(rewards)
+    # Reference: fold KL into reward, THEN baseline.
+    folded_reward = rewards - coef * kl
+    adv_fold_then_baseline = group_baseline(folded_reward)
+    # Under test: adjust the ALREADY-baselined advantages.
+    adv_adjusted = apply_kl_in_reward(advantages, kl, num_generations=g, coef=coef)
+    torch.testing.assert_close(adv_adjusted, adv_fold_then_baseline, atol=1e-5, rtol=1e-5)
+def test_apply_does_not_mutate_input():
+    adv = torch.tensor([1.0, 2.0])
+    adv_copy = adv.clone()
+    apply_kl_in_reward(adv, torch.tensor([0.0, 1.0]), num_generations=2, coef=1.0)
+    torch.testing.assert_close(adv, adv_copy)  # functional, not in-place

research/design-F5-fidelity-audit.md CHANGED Viewed

@@ -24,13 +24,13 @@ The good news the report stresses: the substrate for all of this already exists.
 | (b1) | **25× synthetic data — Feature Deletion generator** — 2.5 blog | **FULLY-REPLICATED (substrate-inversion form)** | `datagen/substrates.py::SweBenchAdapter` (revert gold patch → broken repo, FAIL_TO_PASS=reward target, license filter), `datagen/env.py::FeatureDeletionEnv`, `datagen/validator.py` (4-gate solvability), `datagen/schema.py`. | Real *generative* synthesis (manufacture novel broken states beyond SWE-bench inversion) absent; only adapts existing SWE-* instances. No "25×" scale-out generator suite. |
 | (b2) | **Dynamic-difficulty curriculum ("select for AND create harder tasks dynamically")** — 2.5 blog + Composer-2 §3 (keyed on #turns + thinking-tokens) | **FULLY-REPLICATED (select-for half)** | `datagen/curriculum.py::DifficultyCurriculum` — p̂(1−p̂) frontier weighting, retire >0.95, quarantine <0.02, **effort tilt on turns/think-tokens** (ADR-012 #4, matching Composer-2's exact heuristic). | **CREATE half missing**: no live escalation of deletion-span / coupling / multi-feature difficulty during the run. Curriculum scores an *existing* pool; it doesn't mint harder tasks. |
 | (c1) | **Dr.GRPO base objective** — Composer-2 §4.1 | **FULLY-REPLICATED** | `composer_trainer.py::make_dr_grpo_config` + `make_po_config` (PO menu: grpo/dr_grpo/bnpo/dapo/gspo/cispo, pure TRL 1.5.0 config). `loss_type="dr_grpo"`, `scale_rewards="none"`, `num_iterations=1`, drift-guard asserts. ADR-014. | None on the objective itself. |
-| (c2) | **k1-vs-k3 KL** — Composer-2 §4.1 explicitly chooses **k1 = −log r in *reward*** (variance argument, citing Amini et al.) | **PARTIAL — DOCUMENTED INFIDELITY** | `composer_trainer.py:496-509` documents that TRL's `_compute_loss` uses **k3-in-loss** (`exp(Δ)−Δ−1`), NOT k1. `test_dr_grpo_config_and_alignment.py::test_trl_kl_estimator_is_k3_not_k1` pins this. Honest delta, not patched. | **The evidence says this delta matters for the "next level":** arXiv:2512.21852 + TRL #4967 + verl (k1-in-reward only) show k1-in-reward ↑ OOD generalization; k3-in-reward can collapse. Composer chose k1 deliberately. Fix is implementable (see Gap #1). |
 | (d) | **CPT → SFT → RL phase structure** — Composer-2 §3-4 (CPT loss ↓ ⇒ RL ceiling ↑, replicated on Qwen3-Coder-30B) | **PARTIAL (intentional skip + plumbing)** | Documented decision to skip CPT and start from a code-tuned base (COMPOSER_RECIPE_MAPPING.md row a; corroborated by Composer-2's own CPT→RL causal claim). Inner/outer loop split exists (datagen=outer, `ComposerReplicationTrainer`=inner). | **No SFT-first stage in code.** Report §5 prescribes "SFT-first on clean winning trajectories before RL" — there is no SFT trainer/recipe; only the RL trainer exists. CPT correctly skipped. |
 | (e) | **Sharded Muon + dual-mesh HSDP** (2.5 blog) / FSDP+CP+decoupled-EP, Adam (Composer-2 §6) | **MISSING (intentional, irrelevant at our scale)** | — | Correctly out of scope for dense Qwen3-{7,32}B (the mapping doc + report both say skip until MoE base). Distributed substrate is DiLoCo-over-S3, not HSDP. Note research/10 *corrects* the blog: Composer-2 uses **Adam**, not Muon, and FSDP+CP+decoupled-EP, not HSDP. |
 | (f) | **Anyrun production-fidelity sandboxed RL harness** (>500 pods/s, per-pod Firecracker microVM, fork/snapshot, Anygress egress proxy) — Composer-2 §6.2 | **PARTIAL** | `datagen/sandbox.py` (`Sandbox` Protocol, `LocalSubprocessSandbox`, `scrub_tree` primary control, denylist defense-in-depth), `datagen/docker_sandbox.py`, `diloco/serverless/{executor.py,eks.py,sagemaker.py,modal_spawn.py}`. | No microVM isolation (gVisor/Kata-Firecracker), no fork/snapshot, no egress proxy, no >100k-pod orchestration. The report's EKS plan (§8: gVisor default → Kata+Firecracker → container-free SWE-MiniSandbox) is design-only. `eks.py`/`sagemaker.py` are executor skeletons, not the full Anyrun analogue. |
 | (g) | **Reward-hacking monitoring** (2.5 blog: bytecode decompile / type-cache hacks; "agentic monitoring tools") | **FULLY-REPLICATED (defense-in-depth) + run-level guard now wired** | `datagen/monitor.py::HackMonitor` (signature + patch-provenance, obfuscation-resistant), `sandbox.py::scrub_tree` (physical cache/.git removal = "the wall"), `datagen/validator.py` (4-gate), `safety/holdout.py::HeldoutSplit` (id + content-hash disjointness), `safety/kill_switch.py::HeldOutGuard` (proxy-real Hacking-Gap + KL hard-stop), **now wired into the trainer** (`composer_trainer.py::_maybe_update_killswitch`, ADR-015, 2026-06-08). | The held-out kill-switch — the report's "most load-bearing safeguard, documented gap" — is **now CLOSED** (ADR-015). Remaining: `HackMonitor` validated only on constructed examples (report warns synthetic-hack monitors fail to generalize); offline LLM-judge monitor (EvilGenie-style) not built. |
-| (h) | **Aux scalar rewards (style/communication/unfinished-todo penalties)** — Composer-2 §4.2 | **MISSING** | Reward is pure test-pass-fraction (`env.py::_grade`). No auxiliary reward array. `integrations/altered_minds/reward.py` is an MMLU-format reward for ADR-013 ladder, not the Composer behavior-reward suite. | Fully specified in research/10; reproducible without the hint mystery. Build a `behavior_rewards.py` reward-fn bank. |
-| (i) | **Nonlinear length/effort penalty** `C_length{k,q}(x)=((1+kx)^{1−q}−1)/(k(1−q))` — Composer-2 §4.2 (exact equation) | **MISSING** | — | Trivially implementable (≈30 LOC reward shaper over {thinking, tool-call, tool-output, final-msg tokens, #calls, #turns}). Induces parallel tool calls per the report. |
 | (j) | **Self-summarization (reward-to-all-chain-tokens)** — Composer-2 §4.1 | **MISSING** | — | The mechanism that handles 100k-token long-horizon rollouts (the regime the report says the *tree* is for). Not built. |
 | (k) | **MoE router replay** — Composer-2 §6.2 | **MISSING (out of scope, dense bases)** | — | Only relevant for MoE-base RL; correct to defer. |
@@ -74,9 +74,9 @@ What exists today (the *ancestor*): `teacher_replay.py` is **flat depth-1** (N t
 Ordered by (fidelity-leverage × cheapness), front-loading the items that move the "next-level" needle for the least build.
-**Tier 0 — cheap fidelity fixes the evidence says move OOD generalization (do first):**
-1. **k1-in-reward KL** (Rubric A c2). Add a `kl_estimator="k1"` + `use_kl_in_reward=True` path to the trainer: compute `−log r` per token, fold into the *advantage/reward* (not the loss), set TRL `beta=0.0` to disable its k3-in-loss term. Mirror TRL issue #4967 / verl's choice. `composer_trainer.py` ~60 LOC + test flipping the pinned k3 assertion. **This is the highest-fidelity-leverage single change.**
-2. **Composer-2 behavior rewards** (Rubric A h+i): `datagen/behavior_rewards.py` — the aux scalar reward array (style/communication/unfinished-todo) + the nonlinear length/effort penalty `C_length` (exact eq. in research/10), as TRL `RewardFunc`s composable with `env.reward_fn`. ~120 LOC. Reproducible *without* the hint mystery; directly targets Composer's "communication style + effort calibration" goal.
 **Tier 1 — close the highest-value PARTIALs:**
 3. **SDPO live-GPU smoke** (Rubric A a): instantiate `ComposerReplicationTrainer` against a real `trl.GRPOTrainer` on a small model (Qwen2.5-0.5B) on a SageMaker Training Job (g5/g6e) or HyperPod node-group — discharges the ADR-008 "never smoke-tested against real GRPOTrainer" caveat.

 | (b1) | **25× synthetic data — Feature Deletion generator** — 2.5 blog | **FULLY-REPLICATED (substrate-inversion form)** | `datagen/substrates.py::SweBenchAdapter` (revert gold patch → broken repo, FAIL_TO_PASS=reward target, license filter), `datagen/env.py::FeatureDeletionEnv`, `datagen/validator.py` (4-gate solvability), `datagen/schema.py`. | Real *generative* synthesis (manufacture novel broken states beyond SWE-bench inversion) absent; only adapts existing SWE-* instances. No "25×" scale-out generator suite. |
 | (b2) | **Dynamic-difficulty curriculum ("select for AND create harder tasks dynamically")** — 2.5 blog + Composer-2 §3 (keyed on #turns + thinking-tokens) | **FULLY-REPLICATED (select-for half)** | `datagen/curriculum.py::DifficultyCurriculum` — p̂(1−p̂) frontier weighting, retire >0.95, quarantine <0.02, **effort tilt on turns/think-tokens** (ADR-012 #4, matching Composer-2's exact heuristic). | **CREATE half missing**: no live escalation of deletion-span / coupling / multi-feature difficulty during the run. Curriculum scores an *existing* pool; it doesn't mint harder tasks. |
 | (c1) | **Dr.GRPO base objective** — Composer-2 §4.1 | **FULLY-REPLICATED** | `composer_trainer.py::make_dr_grpo_config` + `make_po_config` (PO menu: grpo/dr_grpo/bnpo/dapo/gspo/cispo, pure TRL 1.5.0 config). `loss_type="dr_grpo"`, `scale_rewards="none"`, `num_iterations=1`, drift-guard asserts. ADR-014. | None on the objective itself. |
+| (c2) | **k1-vs-k3 KL** — Composer-2 §4.1 explicitly chooses **k1 = −log r in *reward*** (variance argument, citing Amini et al.) | **CLOSED (Wave 20) — opt-in k1-in-reward path** | `trainer/kl_in_reward.py` (pure, tested: k1/k3 per-seq estimators + `apply_kl_in_reward` advantage adjustment) + `composer_trainer.py` (`kl_in_reward=True`/`kl_estimator=`: folds the k1 penalty into advantages at scoring, zeros TRL's in-loss k3 per step, `validate_kl_in_reward_config` guards beta>0 + scale_rewards∈{none,false}). `test_kl_in_reward.py` proves the advantage-adjustment EXACTLY equals fold-then-baseline (9 group/gen combos). TRL's native k3 still pinned by `test_trl_kl_estimator_is_k3_not_k1` (still true; now documents the override). | Live-GPU A/B (k1-in-reward vs k3-in-loss OOD) deferred to the SageMaker smoke. The math + wiring are done and CPU-verified. |
 | (d) | **CPT → SFT → RL phase structure** — Composer-2 §3-4 (CPT loss ↓ ⇒ RL ceiling ↑, replicated on Qwen3-Coder-30B) | **PARTIAL (intentional skip + plumbing)** | Documented decision to skip CPT and start from a code-tuned base (COMPOSER_RECIPE_MAPPING.md row a; corroborated by Composer-2's own CPT→RL causal claim). Inner/outer loop split exists (datagen=outer, `ComposerReplicationTrainer`=inner). | **No SFT-first stage in code.** Report §5 prescribes "SFT-first on clean winning trajectories before RL" — there is no SFT trainer/recipe; only the RL trainer exists. CPT correctly skipped. |
 | (e) | **Sharded Muon + dual-mesh HSDP** (2.5 blog) / FSDP+CP+decoupled-EP, Adam (Composer-2 §6) | **MISSING (intentional, irrelevant at our scale)** | — | Correctly out of scope for dense Qwen3-{7,32}B (the mapping doc + report both say skip until MoE base). Distributed substrate is DiLoCo-over-S3, not HSDP. Note research/10 *corrects* the blog: Composer-2 uses **Adam**, not Muon, and FSDP+CP+decoupled-EP, not HSDP. |
 | (f) | **Anyrun production-fidelity sandboxed RL harness** (>500 pods/s, per-pod Firecracker microVM, fork/snapshot, Anygress egress proxy) — Composer-2 §6.2 | **PARTIAL** | `datagen/sandbox.py` (`Sandbox` Protocol, `LocalSubprocessSandbox`, `scrub_tree` primary control, denylist defense-in-depth), `datagen/docker_sandbox.py`, `diloco/serverless/{executor.py,eks.py,sagemaker.py,modal_spawn.py}`. | No microVM isolation (gVisor/Kata-Firecracker), no fork/snapshot, no egress proxy, no >100k-pod orchestration. The report's EKS plan (§8: gVisor default → Kata+Firecracker → container-free SWE-MiniSandbox) is design-only. `eks.py`/`sagemaker.py` are executor skeletons, not the full Anyrun analogue. |
 | (g) | **Reward-hacking monitoring** (2.5 blog: bytecode decompile / type-cache hacks; "agentic monitoring tools") | **FULLY-REPLICATED (defense-in-depth) + run-level guard now wired** | `datagen/monitor.py::HackMonitor` (signature + patch-provenance, obfuscation-resistant), `sandbox.py::scrub_tree` (physical cache/.git removal = "the wall"), `datagen/validator.py` (4-gate), `safety/holdout.py::HeldoutSplit` (id + content-hash disjointness), `safety/kill_switch.py::HeldOutGuard` (proxy-real Hacking-Gap + KL hard-stop), **now wired into the trainer** (`composer_trainer.py::_maybe_update_killswitch`, ADR-015, 2026-06-08). | The held-out kill-switch — the report's "most load-bearing safeguard, documented gap" — is **now CLOSED** (ADR-015). Remaining: `HackMonitor` validated only on constructed examples (report warns synthetic-hack monitors fail to generalize); offline LLM-judge monitor (EvilGenie-style) not built. |
+| (h) | **Aux scalar rewards (style/communication/unfinished-todo penalties)** — Composer-2 §4.2 | **CLOSED (Wave 20)** | `datagen/behavior_rewards.py`: `UnfinishedTodoPenalty`, `LeftoverCoTPenalty`, `CommunicationReward` (the exact emergent behaviors §4.2 names) as TRL `RewardFunc`s, composable with `env.reward_fn` via `reward_weights`. `default_behavior_rewards()` = the starter array. `test_behavior_rewards.py` pins the contract. | Heuristic pattern-matchers by design (the report added these reactively by monitoring); meant to grow. A judge-based communication score is the upgrade. |
+| (i) | **Nonlinear length/effort penalty** `C_length{k,q}(x)=((1+kx)^{1−q}−1)/(k(1−q))` — Composer-2 §4.2 (exact equation) | **CLOSED (Wave 20)** | `datagen/behavior_rewards.py::c_length` (exact equation + q→1 log-limit) + `EffortWeights` (the weighted {thinking, tool-call, tool-output, final-msg tokens, #calls, #turns} combination) + `LengthEffortPenalty` (the RewardFunc). `test_behavior_rewards.py` verifies the closed form, the concave-down/increasing shape, and the edge cases. | Effort features read from a per-completion `effort_features` kwarg when the harness supplies them; char-count proxy otherwise. |
 | (j) | **Self-summarization (reward-to-all-chain-tokens)** — Composer-2 §4.1 | **MISSING** | — | The mechanism that handles 100k-token long-horizon rollouts (the regime the report says the *tree* is for). Not built. |
 | (k) | **MoE router replay** — Composer-2 §6.2 | **MISSING (out of scope, dense bases)** | — | Only relevant for MoE-base RL; correct to defer. |
 Ordered by (fidelity-leverage × cheapness), front-loading the items that move the "next-level" needle for the least build.
+**Tier 0 — cheap fidelity fixes the evidence says move OOD generalization — ✅ DONE (Wave 20):**
+1. ✅ **k1-in-reward KL** (Rubric A c2). Implemented as `trainer/kl_in_reward.py` (pure, tested math) + `composer_trainer.py` `kl_in_reward=True`/`kl_estimator=`. Keeps `beta` as the KL coefficient (TRL needs beta>0 to compute ref logps), folds the k1 penalty into advantages at scoring, and zeros TRL's in-loss k3 per step. `validate_kl_in_reward_config` enforces beta>0 + scale_rewards∈{none,false}. `test_kl_in_reward.py` proves the advantage adjustment exactly equals fold-then-baseline. **The highest-fidelity-leverage single change — shipped.**
+2. ✅ **Composer-2 behavior rewards** (Rubric A h+i): `datagen/behavior_rewards.py` — `c_length` (exact eq + q→1 limit), `EffortWeights`, `LengthEffortPenalty`, `UnfinishedTodoPenalty`, `LeftoverCoTPenalty`, `CommunicationReward`, `default_behavior_rewards()`, all TRL `RewardFunc`s composable with `env.reward_fn`. `test_behavior_rewards.py` pins the math + contract. Reproduced *without* the hint mystery.
 **Tier 1 — close the highest-value PARTIALs:**
 3. **SDPO live-GPU smoke** (Rubric A a): instantiate `ComposerReplicationTrainer` against a real `trl.GRPOTrainer` on a small model (Qwen2.5-0.5B) on a SageMaker Training Job (g5/g6e) or HyperPod node-group — discharges the ADR-008 "never smoke-tested against real GRPOTrainer" caveat.