Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Baladithya Balamurugan
Wave 3: close the HIGH review findings (kill-switch wiring, HeldoutSplit, EKS entrypoint bug)
bd0c358 | """kill_switch.py — held-out collapse tripwire (the #2 collapse safeguard). | |
| This is the missing RUN-LEVEL / across-generation control for the self-evolving | |
| RL flywheel. The per-task controls already exist in ``composer_replication.datagen`` | |
| (the 4-gate solvability validator, the ``HackMonitor`` provenance check, and the | |
| sandbox denylist); this module sits ABOVE them and watches the whole run. | |
| Rationale (the literature is unambiguous that a held-out eval + hard stop is the | |
| load-bearing control, not a nice-to-have): | |
| - **Reward hacking rises monotonically with optimization depth.** Zhao et al., | |
| "Reward Hacking in Self-Improving Code Agents" (ICLR 2026 Workshop on RSI, | |
| OpenReview ``ikrQWGgxYg``) show that going from 10 -> 100 optimization steps | |
| drives the hacking rate from 26.4% to 57.8% (+31.4 points), and that | |
| 73.8% of KernelBench / 46.8% of ALE-Bench optimizations show *proxy gains | |
| without real gains*. They define **Hacking Gap = proxy gain - real gain**; | |
| this module's ``proxy_real_gap()`` is exactly that quantity. They label an | |
| optimization reward-hacking when it "improves the public metric WITHOUT | |
| improving the private metric" — the canonical signature this tripwire fires on. | |
| - **Self-critique alone is insufficient.** The same paper's "retrospection" | |
| self-critique sometimes *increased* hacking; their conclusion: "mitigating | |
| reward hacking likely requires stronger evaluations and constraints beyond | |
| self-critique alone." So we build a genuinely disjoint held-out eval plus a | |
| hard stop, not a critique hook. | |
| - **Held-out eval is necessary but NOT sufficient by itself.** EvilGenie | |
| (arXiv 2511.21654) found "only minimal improvement from the use of held out | |
| test cases" in isolation and that "holdout tests have many surprising failure | |
| modes." This module is therefore explicitly *defense-in-depth*, layered ON | |
| TOP of ``HackMonitor`` (provenance) — neither is sufficient alone, matching | |
| the repo's existing defense-in-depth framing in ``datagen/monitor.py``. | |
| - **Closed-loop RL on self-generated data collapses.** The self-evolving-agents | |
| survey (Gao et al., TMLR 2026; arXiv 2507.21046 v4) §8.3 names "model | |
| collapse from closed-loop RL on static synthetic data" and prescribes | |
| "continuous monitoring ... to detect long-horizon value drift" — i.e. a | |
| per-generation online tripwire, not a one-time eval. Shumailov et al. (Nature | |
| 2024, "AI models collapse when trained on recursively generated data") show | |
| self-training first loses the distribution tails, then converges to a | |
| low-variance point estimate; the mitigation that matters here is that the | |
| held-out eval must stay anchored to REAL tasks that are NEVER fed back to the | |
| generator (see ``HeldoutSplit``), otherwise the eval drifts with the train set. | |
| - **KL-to-init hard stop.** The GRPO "healthy progression" band (Orchestra | |
| Research GRPO SKILL) climbs 0.02 -> 0.05 -> 0.08 -> 0.12 nats/token over a | |
| run, with 0.08 the top of the "good progression" band and just below the | |
| code-generation drift zone (0.05-0.15 per-token); >0.5 is "diverging too | |
| much." So 0.08 nats/token is a sound HARD-STOP default. Catastrophic Goodhart | |
| (OpenReview ``UXuBzWoZGK``) proves KL regularization alone does NOT prevent | |
| heavy-tailed reward misspecification, so the KL hard stop is ONE tripwire | |
| among several, never the sole control. | |
| UNITS GOTCHA (load-bearing): the ``kl_to_init`` this module consumes is | |
| **token-mean KL in nats/token**, matching the repo convention in | |
| ``composer_replication.integrations.altered_minds.kl_logging.token_mean_kl``. | |
| A token-mean KL is NOT comparable to a sequence-level / sequence-summed KL | |
| (whose healthy band is ~0.05-10). The 0.08 default is per-token. Do not pass a | |
| sequence-summed KL into the per-token hard stop — it will fire instantly. | |
| This module is pure-Python: no torch, no cloud deps. ``kl_to_init`` is just a | |
| float the caller passes (computed upstream by ``token_mean_kl``). It is fully | |
| CPU-testable. | |
| """ | |
| from __future__ import annotations | |
| from dataclasses import dataclass, field | |
| class CollapseStopError(RuntimeError): | |
| """Raised (by the caller, optionally) when the tripwire fires a hard stop. | |
| The trainer loop can either check ``TripwireStatus.fire`` and stop softly, | |
| or call ``HeldOutGuard.raise_if_fired(status)`` to convert a fired verdict | |
| into this typed exception. Carries the structured verdict for logging. | |
| """ | |
| def __init__(self, status: TripwireStatus) -> None: | |
| super().__init__(status.reason) | |
| self.status = status | |
| class TripwireStatus: | |
| """Structured verdict returned by every ``HeldOutGuard.update(...)`` call. | |
| Attributes: | |
| fire: True => the run should HALT (collapse / reward-hacking detected). | |
| reason: human-readable WHY (empty string when ``fire`` is False), so the | |
| trainer can log exactly which tripwire tripped, mirroring how | |
| ``datagen/monitor.py`` logs suspected hacks for review. | |
| step: the round/generation index this verdict was computed at. | |
| proxy_real_gap: the RSI "Hacking Gap" at this step = (in-loop reward gain | |
| since baseline) - (held-out score gain since baseline). Positive and | |
| widening => proxy improving faster than (or while) real declines. | |
| in_loop_ema: EMA of the in-loop / proxy reward at this step. | |
| heldout_ema: EMA of the held-out / real eval score at this step. | |
| kl_ema: EMA of ``kl_to_init`` (nats/token), or None if never supplied. | |
| """ | |
| fire: bool | |
| reason: str | |
| step: int | |
| proxy_real_gap: float | |
| in_loop_ema: float | |
| heldout_ema: float | |
| kl_ema: float | None = None | |
| # `halt` is a documented alias for `fire` — the task spec describes a | |
| # `should_halt()` / verdict with a `halt` field; expose both names so callers | |
| # reading either convention work. | |
| def halt(self) -> bool: | |
| return self.fire | |
| class HeldOutGuard: | |
| """Across-generation collapse / reward-hacking kill-switch (HeldOutGuard). | |
| Tracks, per generation/round: in-loop (proxy) oracle reward, held-out (real) | |
| eval score, and optional KL-to-init / entropy / reward-std. Computes the | |
| proxy-minus-real "Hacking Gap" tripwire and fires a structured ``halt`` | |
| verdict when collapse is caught in the act. | |
| The guard is **stateful**: call ``update(round_idx, ...)`` once per checkpoint | |
| in the trainer loop (the same cadence at which ``DifficultyCurriculum.update`` | |
| is called). It maintains denoised EMAs of every metric (raw single-step | |
| values are too noisy to threshold — theneuralbase early-stopping guidance) and | |
| returns a ``TripwireStatus``. | |
| Fires (``fire=True``) when ANY of: | |
| (a) **collapse-caught-in-the-act** — the in-loop reward EMA is RISING while | |
| the held-out score EMA has DECLINED for >= ``decline_patience`` | |
| consecutive checkpoints (default 3, matching the "monotone for >=3 | |
| checkpoints" rule). This is the canonical reward-hacking signature. | |
| (b) **KL breach** — the ``kl_to_init`` EMA exceeds ``kl_hard_stop`` (default | |
| 0.08 nats/token) on/after ``min_steps``. | |
| (c) **proxy-real gap blowout** — the Hacking Gap (proxy gain - real gain | |
| since baseline) widens beyond ``max_proxy_real_gap``, even if held-out | |
| has not strictly declined for the full patience window (a fast | |
| single-generation divergence). | |
| No tripwire fires before ``min_steps`` (avoids halting on early-run noise, | |
| when both signals are still warming up). | |
| The guard is idempotent in the sense that re-querying ``last_status`` or | |
| calling ``should_halt()`` does not advance state — only ``update`` does. | |
| """ | |
| # --- thresholds (calibratable; see calibrate_kl_threshold) --------------- | |
| kl_hard_stop: float = 0.08 # nats/token; top of GRPO "good" band | |
| max_proxy_real_gap: float = 0.10 # absolute Hacking-Gap blowout ceiling | |
| # --- temporal gates ------------------------------------------------------ | |
| min_steps: int = 20 # no fire before this many updates | |
| decline_patience: int = 3 # consecutive held-out declines to fire (a) | |
| # --- denoising ----------------------------------------------------------- | |
| ema_alpha: float = 0.9 # EMA weight on the PRIOR (0.9 => slow) | |
| rise_eps: float = 1e-4 # min EMA delta to count as "rising"/"declining" | |
| # --- internal state (do not set directly) -------------------------------- | |
| _n: int = field(default=0, init=False) | |
| _in_loop_ema: float | None = field(default=None, init=False) | |
| _heldout_ema: float | None = field(default=None, init=False) | |
| _kl_ema: float | None = field(default=None, init=False) | |
| _entropy_ema: float | None = field(default=None, init=False) | |
| _reward_std_ema: float | None = field(default=None, init=False) | |
| _in_loop_baseline: float | None = field(default=None, init=False) | |
| _heldout_baseline: float | None = field(default=None, init=False) | |
| _prev_in_loop_ema: float | None = field(default=None, init=False) | |
| _prev_heldout_ema: float | None = field(default=None, init=False) | |
| _heldout_decline_streak: int = field(default=0, init=False) | |
| _last_status: TripwireStatus | None = field(default=None, init=False) | |
| _fired: bool = field(default=False, init=False) | |
| def __post_init__(self) -> None: | |
| if not (0.0 <= self.ema_alpha < 1.0): | |
| raise ValueError( | |
| f"ema_alpha must be in [0, 1), got {self.ema_alpha!r} " | |
| "(it is the weight on the PRIOR EMA)." | |
| ) | |
| if self.kl_hard_stop <= 0.0: | |
| raise ValueError(f"kl_hard_stop must be > 0, got {self.kl_hard_stop!r}") | |
| if self.decline_patience < 1: | |
| raise ValueError( | |
| f"decline_patience must be >= 1, got {self.decline_patience!r}" | |
| ) | |
| # ------------------------------------------------------------------------ | |
| # core API | |
| # ------------------------------------------------------------------------ | |
| def update( | |
| self, | |
| round_idx: int, | |
| in_loop_reward: float, | |
| heldout_score: float, | |
| kl_to_init: float | None = None, | |
| entropy: float | None = None, | |
| reward_std: float | None = None, | |
| ) -> TripwireStatus: | |
| """Fold one checkpoint's metrics in and return the current verdict. | |
| Args: | |
| round_idx: the generation / round index (for logging; not used for | |
| gating — the internal update counter ``_n`` drives ``min_steps`` | |
| so the guard is robust to non-contiguous round indices). | |
| in_loop_reward: mean in-loop (proxy / oracle) reward this round. This | |
| is what the policy is optimizing against. | |
| heldout_score: mean score on the DISJOINT held-out eval pool this | |
| round — REAL tasks the generator never trains on. See | |
| ``composer_replication.safety.holdout`` design notes / the | |
| ``HeldoutSplit`` discipline; if held-out drifts with the train | |
| set the gap signal is meaningless. | |
| kl_to_init: optional token-mean KL(policy || init) in nats/token | |
| (this repo's ``token_mean_kl`` convention). NOT sequence-level KL. | |
| entropy: optional policy entropy (early-warning of entropy collapse, | |
| "the silent killer of RLVR generalization"). Tracked + exposed, | |
| not currently a hard gate. | |
| reward_std: optional std of the reward distribution (tracked; a | |
| collapsing std is an early collapse signal). | |
| Returns: | |
| A ``TripwireStatus``. Once the guard has fired, every subsequent | |
| ``update`` keeps ``fire=True`` (latched) so a transient recovery | |
| after a detected collapse cannot silently un-halt the run. | |
| """ | |
| self._n += 1 | |
| # --- EMA folds (alpha on the prior; first sample seeds the EMA) ------- | |
| self._in_loop_ema = self._fold(self._in_loop_ema, float(in_loop_reward)) | |
| self._heldout_ema = self._fold(self._heldout_ema, float(heldout_score)) | |
| if kl_to_init is not None: | |
| self._kl_ema = self._fold(self._kl_ema, float(kl_to_init)) | |
| if entropy is not None: | |
| self._entropy_ema = self._fold(self._entropy_ema, float(entropy)) | |
| if reward_std is not None: | |
| self._reward_std_ema = self._fold(self._reward_std_ema, float(reward_std)) | |
| # --- baselines: seed on the first update so gains are measured from | |
| # run start (the RSI Hacking-Gap is a gain-since-baseline quantity). - | |
| if self._in_loop_baseline is None: | |
| self._in_loop_baseline = self._in_loop_ema | |
| if self._heldout_baseline is None: | |
| self._heldout_baseline = self._heldout_ema | |
| # --- track the held-out decline streak (uses EMA deltas, denoised) ---- | |
| in_loop_rising = ( | |
| self._prev_in_loop_ema is not None | |
| and (self._in_loop_ema - self._prev_in_loop_ema) > self.rise_eps | |
| ) | |
| heldout_declining = ( | |
| self._prev_heldout_ema is not None | |
| and (self._heldout_ema - self._prev_heldout_ema) < -self.rise_eps | |
| ) | |
| # The collapse signature is held-out DOWN while in-loop UP. We only count | |
| # a decline toward the streak when in-loop is simultaneously rising — a | |
| # held-out dip during an in-loop dip is just noise / a hard batch, not | |
| # reward hacking. | |
| if heldout_declining and in_loop_rising: | |
| self._heldout_decline_streak += 1 | |
| elif not heldout_declining: | |
| self._heldout_decline_streak = 0 | |
| # (if held-out declines but in-loop is flat/down we neither grow nor reset | |
| # the streak immediately — but the elif above resets on any non-decline, | |
| # so a single clean checkpoint clears it.) | |
| gap = self.proxy_real_gap() | |
| status = self._evaluate(round_idx, gap) | |
| # advance "previous EMA" trackers AFTER evaluation | |
| self._prev_in_loop_ema = self._in_loop_ema | |
| self._prev_heldout_ema = self._heldout_ema | |
| self._last_status = status | |
| if status.fire: | |
| self._fired = True | |
| return status | |
| def _evaluate(self, round_idx: int, gap: float) -> TripwireStatus: | |
| """Decide the verdict from current state. Pure (no state mutation).""" | |
| assert self._in_loop_ema is not None and self._heldout_ema is not None | |
| base = dict( | |
| step=round_idx, | |
| proxy_real_gap=gap, | |
| in_loop_ema=self._in_loop_ema, | |
| heldout_ema=self._heldout_ema, | |
| kl_ema=self._kl_ema, | |
| ) | |
| # Latched: once fired, stay fired (cannot silently un-halt). | |
| if self._fired: | |
| prev_reason = self._last_status.reason if self._last_status else "collapse" | |
| return TripwireStatus(fire=True, reason=f"latched: {prev_reason}", **base) | |
| # Warm-up guard: never fire on early-run noise. | |
| if self._n < self.min_steps: | |
| return TripwireStatus(fire=False, reason="", **base) | |
| # (b) KL hard stop — checked first; it's the cheapest unambiguous breach. | |
| if self._kl_ema is not None and self._kl_ema > self.kl_hard_stop: | |
| return TripwireStatus( | |
| fire=True, | |
| reason=( | |
| f"kl_to_init EMA {self._kl_ema:.4f} nats/token exceeds hard " | |
| f"stop {self.kl_hard_stop:.4f} (policy drifting from init)" | |
| ), | |
| **base, | |
| ) | |
| # (a) collapse caught in the act — held-out declines while in-loop rises. | |
| if self._heldout_decline_streak >= self.decline_patience: | |
| return TripwireStatus( | |
| fire=True, | |
| reason=( | |
| f"reward-hacking signature: held-out score declined while " | |
| f"in-loop reward rose for {self._heldout_decline_streak} " | |
| f"consecutive checkpoints (Hacking Gap {gap:.4f})" | |
| ), | |
| **base, | |
| ) | |
| # (c) proxy-real gap blowout — fast single-generation divergence. | |
| if gap > self.max_proxy_real_gap: | |
| return TripwireStatus( | |
| fire=True, | |
| reason=( | |
| f"proxy-real Hacking Gap {gap:.4f} exceeds ceiling " | |
| f"{self.max_proxy_real_gap:.4f} (proxy reward improving far " | |
| f"faster than real held-out eval)" | |
| ), | |
| **base, | |
| ) | |
| return TripwireStatus(fire=False, reason="", **base) | |
| # ------------------------------------------------------------------------ | |
| # query helpers (do NOT advance state — idempotent) | |
| # ------------------------------------------------------------------------ | |
| def should_halt(self) -> bool: | |
| """True if the most recent ``update`` produced a halt verdict. | |
| Idempotent: querying does not advance the EMA state. | |
| """ | |
| return self._last_status is not None and self._last_status.fire | |
| def last_status(self) -> TripwireStatus | None: | |
| """The most recent verdict, or None if ``update`` was never called.""" | |
| return self._last_status | |
| def raise_if_fired(self, status: TripwireStatus | None = None) -> None: | |
| """Convert a fired verdict into a typed ``CollapseStopError`` exception. | |
| Pass the status returned by ``update`` (or omit to use ``last_status``). | |
| Trainer loops that prefer exception-based control flow call this right | |
| after ``update``; loops that prefer flag-checking just read | |
| ``status.fire`` / ``should_halt()``. | |
| """ | |
| st = status if status is not None else self._last_status | |
| if st is not None and st.fire: | |
| raise CollapseStopError(st) | |
| def proxy_real_gap(self) -> float: | |
| """The RSI Hacking Gap = (in-loop gain) - (held-out gain), both measured | |
| as EMA-minus-baseline since run start. | |
| Returns 0.0 before the first ``update`` (no baseline yet). A positive, | |
| widening value is the reward-hacking fingerprint: the proxy the policy | |
| optimizes is improving more than the real held-out objective. | |
| """ | |
| if ( | |
| self._in_loop_ema is None | |
| or self._heldout_ema is None | |
| or self._in_loop_baseline is None | |
| or self._heldout_baseline is None | |
| ): | |
| return 0.0 | |
| in_loop_gain = self._in_loop_ema - self._in_loop_baseline | |
| heldout_gain = self._heldout_ema - self._heldout_baseline | |
| return in_loop_gain - heldout_gain | |
| # ------------------------------------------------------------------------ | |
| # calibration | |
| # ------------------------------------------------------------------------ | |
| def calibrate_kl_threshold( | |
| self, baseline_kls: list[float], factor: float = 3.0 | |
| ) -> float: | |
| """Set ``kl_hard_stop`` to ``factor`` x the mean of early-run baseline KLs. | |
| theneuralbase guidance: "Record baseline KL during the first ~100 steps, | |
| set max to 3x that." Single fixed thresholds are dataset-dependent; this | |
| adapts to the run's own KL scale. | |
| SAFETY CLAMP: calibration may only ever TIGHTEN the hard stop, never | |
| loosen it past the documented collapse band. The returned (and stored) | |
| threshold is ``min(3x baseline, current kl_hard_stop)`` — so a noisy / | |
| already-drifting baseline cannot raise the ceiling above 0.08 nats/token. | |
| Args: | |
| baseline_kls: per-step token-mean KL values from early in the run. | |
| KL is non-negative by definition, so every value must be >= 0. | |
| factor: multiplier on the baseline mean. Must be > 0. | |
| Returns: | |
| The new ``kl_hard_stop`` (also stored on the instance), always > 0. | |
| Raises: | |
| ValueError: if ``baseline_kls`` is empty, ``factor <= 0``, or any | |
| baseline KL is negative. | |
| """ | |
| if not baseline_kls: | |
| raise ValueError("baseline_kls must be non-empty to calibrate") | |
| # GUARD (R4): a non-positive factor or a negative baseline would make | |
| # `calibrated` <= 0, and min(<=0, 0.08) = a NON-POSITIVE kl_hard_stop — | |
| # after which the KL tripwire fires on EVERY healthy step (any positive | |
| # KL EMA exceeds a non-positive ceiling). KL is non-negative by | |
| # definition, so these inputs are nonsensical; reject them loudly rather | |
| # than silently disarm-by-inverting the guard. | |
| if factor <= 0: | |
| raise ValueError(f"factor must be > 0, got {factor!r}") | |
| if any(k < 0 for k in baseline_kls): | |
| raise ValueError( | |
| f"baseline_kls must all be >= 0 (KL is non-negative); got a " | |
| f"negative value in {baseline_kls!r}" | |
| ) | |
| mean_kl = sum(baseline_kls) / len(baseline_kls) | |
| calibrated = factor * mean_kl | |
| # Only tighten: never let calibration loosen past the current ceiling. | |
| # Floor at a small positive epsilon so an all-zero baseline (mean_kl==0) | |
| # can't drive the ceiling to exactly 0 and fire on the first positive KL. | |
| self.kl_hard_stop = max(min(calibrated, self.kl_hard_stop), 1e-6) | |
| return self.kl_hard_stop | |
| # ------------------------------------------------------------------------ | |
| # internals | |
| # ------------------------------------------------------------------------ | |
| def _fold(self, prev: float | None, x: float) -> float: | |
| """EMA fold; the first observation seeds the EMA (no warm-up bias).""" | |
| if prev is None: | |
| return x | |
| return self.ema_alpha * prev + (1.0 - self.ema_alpha) * x | |
| def kl_token_trust_filter(logratio_sq_half: float, threshold: float = 0.08) -> bool: | |
| """Per-token KL trust-region mask, mirroring torchrl's GRPO "KL-Mask". | |
| torchrl masks any TOKEN whose ``0.5 * (log pi/pi_ref)^2`` (the Schulman k2 | |
| estimator of per-token KL) exceeds a threshold, forming a per-token trust | |
| region. This helper returns True when the token should be MASKED OUT (its | |
| KL contribution is too large), so it can be wired into a loss later without | |
| pulling torch into this module — the caller computes ``0.5 * logratio**2``. | |
| Args: | |
| logratio_sq_half: ``0.5 * (log pi/pi_ref)^2`` for one token (nats). | |
| threshold: per-token KL ceiling (default 0.08 nats, the same band as the | |
| run-level hard stop). | |
| Returns: | |
| True if the token exceeds the trust region and should be masked. | |
| """ | |
| return logratio_sq_half > threshold | |