Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Wave 21: adversarial-review fixes — all 9 verified findings closed
Browse files3 reviewers (correctness/security/fidelity) + independent verifier over the
Stage-0 pipeline commit; every P0/P1 empirically confirmed before fixing.
P0: reverse_unified_diff silently corrupted file-add/delete patches (git
apply rc 128: naive +/- swap inverts the ---/+++ order and leaves a stale
`new file mode` header). Fix: route them to the UNREVERSED_MARKER
fallback; regression test with a /dev/null patch.
P1: DECONTAMINATION_LIST omitted SWE-Gym's 11 eval repos (the architecture
named both families) — added; V3 now fully closed; regression test.
P1: COMPOSER_RECIPE_MAPPING.md:20 still asserted KL(teacher||student) as
blog-stated — corrected to direction-unverified per F-12.
P1: build_corpus budget was a soft ceiling (check-then-spend overshot by one
rollout) — now a pre-charge hard ceiling; test asserts cost <= budget.
P2: dedup stats double-counted rows that were both within-run and cross-gen
dups (disjoint partition now); run_id path-traversal guard on RunLayout;
gold_apply_plan() helper dispatches -R for marked diffs; removed unused
collect_trajectory budget param; OpenRouterPolicy key name-mangled +
masked __repr__ + model_slug now recorded in trajectory provenance.
Full suite: 516 passed / 66 skipped. Also includes the deepread vault sources
fetched by the reader agents (research/notes/*).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- composer_replication/datagen/repo_gate.py +22 -5
- composer_replication/datagen/rollout_harness.py +11 -3
- composer_replication/datagen/swesmith_adapter.py +23 -1
- composer_replication/datagen/tests/test_repo_gate.py +13 -3
- composer_replication/datagen/tests/test_swesmith_adapter.py +33 -0
- composer_replication/pipeline/build_corpus.py +6 -1
- composer_replication/pipeline/dedup.py +8 -8
- composer_replication/pipeline/s3_contract.py +10 -0
- composer_replication/pipeline/tests/test_pipeline.py +35 -0
- docs/COMPOSER_RECIPE_MAPPING.md +1 -1
|
@@ -251,13 +251,18 @@ def license_tier(info: LicenseInfo) -> Tier:
|
|
| 251 |
# Benchmark decontamination (V3 / D-5)
|
| 252 |
# ---------------------------------------------------------------------------
|
| 253 |
|
| 254 |
-
#:
|
| 255 |
-
#:
|
| 256 |
-
#:
|
| 257 |
-
#:
|
| 258 |
-
#:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 259 |
DECONTAMINATION_LIST: frozenset[str] = frozenset(
|
| 260 |
{
|
|
|
|
| 261 |
"astropy/astropy",
|
| 262 |
"django/django",
|
| 263 |
"matplotlib/matplotlib",
|
|
@@ -270,6 +275,18 @@ DECONTAMINATION_LIST: frozenset[str] = frozenset(
|
|
| 270 |
"scikit-learn/scikit-learn",
|
| 271 |
"sphinx-doc/sphinx",
|
| 272 |
"sympy/sympy",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 273 |
}
|
| 274 |
)
|
| 275 |
|
|
|
|
| 251 |
# Benchmark decontamination (V3 / D-5)
|
| 252 |
# ---------------------------------------------------------------------------
|
| 253 |
|
| 254 |
+
#: Eval repos we must never train on (V3). Two families:
|
| 255 |
+
#: * the canonical 12 SWE-bench test repos (SWE-bench / -Lite / -Verified /
|
| 256 |
+
#: -Multimodal all draw eval instances from these);
|
| 257 |
+
#: * SWE-Gym's 11 repos (deliberately disjoint from SWE-bench's 12, but they
|
| 258 |
+
#: are a benchmark/eval surface of their own — added per the Wave-21
|
| 259 |
+
#: adversarial review, which caught the architecture spec naming them
|
| 260 |
+
#: while the shipped list silently admitted all 11).
|
| 261 |
+
#: Lowercase "org/repo" form. Extend via a JSON file (list of "org/repo"
|
| 262 |
+
#: strings) passed to is_eval_contaminated(extra_list=...) for new releases.
|
| 263 |
DECONTAMINATION_LIST: frozenset[str] = frozenset(
|
| 264 |
{
|
| 265 |
+
# --- SWE-bench family (12) ---
|
| 266 |
"astropy/astropy",
|
| 267 |
"django/django",
|
| 268 |
"matplotlib/matplotlib",
|
|
|
|
| 275 |
"scikit-learn/scikit-learn",
|
| 276 |
"sphinx-doc/sphinx",
|
| 277 |
"sympy/sympy",
|
| 278 |
+
# --- SWE-Gym (11, arXiv:2412.21139 Table 1) ---
|
| 279 |
+
"pandas-dev/pandas",
|
| 280 |
+
"bokeh/bokeh",
|
| 281 |
+
"project-monai/monai",
|
| 282 |
+
"python/mypy",
|
| 283 |
+
"getmoto/moto",
|
| 284 |
+
"iterative/dvc",
|
| 285 |
+
"dask/dask",
|
| 286 |
+
"conan-io/conan",
|
| 287 |
+
"pydantic/pydantic",
|
| 288 |
+
"modin-project/modin",
|
| 289 |
+
"facebookresearch/hydra",
|
| 290 |
}
|
| 291 |
)
|
| 292 |
|
|
@@ -74,9 +74,14 @@ class OpenRouterPolicy:
|
|
| 74 |
) from e
|
| 75 |
from composer_replication.teacher_replay import _load_api_key
|
| 76 |
self.model_slug = model_slug
|
| 77 |
-
|
|
|
|
|
|
|
| 78 |
self.max_tokens = max_tokens
|
| 79 |
|
|
|
|
|
|
|
|
|
|
| 80 |
def act(self, observation: str, history: list[TrajectoryStep]) -> ToolCall | str:
|
| 81 |
import httpx # noqa: PLC0415
|
| 82 |
|
|
@@ -86,7 +91,7 @@ class OpenRouterPolicy:
|
|
| 86 |
OPENROUTER_URL,
|
| 87 |
json={"model": self.model_slug, "messages": messages,
|
| 88 |
"max_tokens": self.max_tokens, "temperature": 0.2},
|
| 89 |
-
headers={"Authorization": f"Bearer {self.
|
| 90 |
timeout=120.0,
|
| 91 |
)
|
| 92 |
r.raise_for_status()
|
|
@@ -110,7 +115,6 @@ def collect_trajectory(
|
|
| 110 |
policy: RolloutPolicy,
|
| 111 |
*,
|
| 112 |
max_turns: int = 40,
|
| 113 |
-
budget_usd: float | None = None,
|
| 114 |
provenance: dict | None = None,
|
| 115 |
) -> CanonicalTrajectory:
|
| 116 |
"""Run one episode and return the graded CanonicalTrajectory.
|
|
@@ -149,6 +153,9 @@ def collect_trajectory(
|
|
| 149 |
final = env.step({"type": "submit"})
|
| 150 |
|
| 151 |
info = final.info or {}
|
|
|
|
|
|
|
|
|
|
| 152 |
return CanonicalTrajectory(
|
| 153 |
task_id=task.task_id,
|
| 154 |
steps=steps,
|
|
@@ -157,6 +164,7 @@ def collect_trajectory(
|
|
| 157 |
hacked=bool(info.get("hacked", False)),
|
| 158 |
provenance={"source": "rollout_harness",
|
| 159 |
"policy": type(policy).__name__,
|
|
|
|
| 160 |
**(provenance or {})},
|
| 161 |
)
|
| 162 |
|
|
|
|
| 74 |
) from e
|
| 75 |
from composer_replication.teacher_replay import _load_api_key
|
| 76 |
self.model_slug = model_slug
|
| 77 |
+
# Name-mangled so vars()/asdict-style dumps and debug reprs don't
|
| 78 |
+
# expose the key (Wave-21 review P2).
|
| 79 |
+
self.__api_key = api_key or _load_api_key()
|
| 80 |
self.max_tokens = max_tokens
|
| 81 |
|
| 82 |
+
def __repr__(self) -> str: # mask the credential in any debug output
|
| 83 |
+
return f"OpenRouterPolicy(model_slug={self.model_slug!r}, api_key='***')"
|
| 84 |
+
|
| 85 |
def act(self, observation: str, history: list[TrajectoryStep]) -> ToolCall | str:
|
| 86 |
import httpx # noqa: PLC0415
|
| 87 |
|
|
|
|
| 91 |
OPENROUTER_URL,
|
| 92 |
json={"model": self.model_slug, "messages": messages,
|
| 93 |
"max_tokens": self.max_tokens, "temperature": 0.2},
|
| 94 |
+
headers={"Authorization": f"Bearer {self.__api_key}"},
|
| 95 |
timeout=120.0,
|
| 96 |
)
|
| 97 |
r.raise_for_status()
|
|
|
|
| 115 |
policy: RolloutPolicy,
|
| 116 |
*,
|
| 117 |
max_turns: int = 40,
|
|
|
|
| 118 |
provenance: dict | None = None,
|
| 119 |
) -> CanonicalTrajectory:
|
| 120 |
"""Run one episode and return the graded CanonicalTrajectory.
|
|
|
|
| 153 |
final = env.step({"type": "submit"})
|
| 154 |
|
| 155 |
info = final.info or {}
|
| 156 |
+
# Model attribution (Wave-21 review P2): a corpus must record WHICH model
|
| 157 |
+
# produced each trajectory; policies expose it via a `model_slug` attr.
|
| 158 |
+
model = getattr(policy, "model_slug", None)
|
| 159 |
return CanonicalTrajectory(
|
| 160 |
task_id=task.task_id,
|
| 161 |
steps=steps,
|
|
|
|
| 164 |
hacked=bool(info.get("hacked", False)),
|
| 165 |
provenance={"source": "rollout_harness",
|
| 166 |
"policy": type(policy).__name__,
|
| 167 |
+
**({"model": model} if model else {}),
|
| 168 |
**(provenance or {})},
|
| 169 |
)
|
| 170 |
|
|
@@ -113,8 +113,14 @@ def reverse_unified_diff(patch: str) -> str | None:
|
|
| 113 |
"""
|
| 114 |
if not patch or "@@" not in patch:
|
| 115 |
return None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
unsupported = ("old mode ", "new mode ", "rename from ", "rename to ",
|
| 117 |
-
"copy from ", "copy to ", "GIT binary patch"
|
|
|
|
| 118 |
if any(marker in patch for marker in unsupported):
|
| 119 |
return None
|
| 120 |
|
|
@@ -225,6 +231,21 @@ class SwesmithAdapter:
|
|
| 225 |
return task, SwesmithMeta(strategy=strategy, diff_reversed=diff_reversed)
|
| 226 |
|
| 227 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 228 |
def load_swesmith_instances(
|
| 229 |
path_or_hf_id: str,
|
| 230 |
*,
|
|
@@ -263,6 +284,7 @@ __all__ = [
|
|
| 263 |
"SwesmithAdapter",
|
| 264 |
"SwesmithMeta",
|
| 265 |
"UNREVERSED_MARKER",
|
|
|
|
| 266 |
"load_swesmith_instances",
|
| 267 |
"parse_strategy",
|
| 268 |
"reverse_unified_diff",
|
|
|
|
| 113 |
"""
|
| 114 |
if not patch or "@@" not in patch:
|
| 115 |
return None
|
| 116 |
+
# "new file mode"/"deleted file mode" added per the Wave-21 adversarial
|
| 117 |
+
# review (P0): naively swapping +/- on a file-add patch emits
|
| 118 |
+
# `+++ /dev/null` before `--- a/<file>` with a stale `new file mode`
|
| 119 |
+
# header — git apply REJECTS it (rc 128, "inconsistent new filename"),
|
| 120 |
+
# so these must take the UNREVERSED_MARKER fallback, not silent corruption.
|
| 121 |
unsupported = ("old mode ", "new mode ", "rename from ", "rename to ",
|
| 122 |
+
"copy from ", "copy to ", "GIT binary patch",
|
| 123 |
+
"new file mode ", "deleted file mode ")
|
| 124 |
if any(marker in patch for marker in unsupported):
|
| 125 |
return None
|
| 126 |
|
|
|
|
| 231 |
return task, SwesmithMeta(strategy=strategy, diff_reversed=diff_reversed)
|
| 232 |
|
| 233 |
|
| 234 |
+
def gold_apply_plan(task_golden_diff: str) -> tuple[str, str]:
|
| 235 |
+
"""How gate-4 validation must apply this task's gold diff.
|
| 236 |
+
|
| 237 |
+
Returns ``(diff_text, git_apply_flag)`` where flag is ``""`` (apply
|
| 238 |
+
forward) or ``"-R"`` (apply in reverse — the UNREVERSED_MARKER case where
|
| 239 |
+
`golden_diff` still holds the FORWARD bug patch). Exported per the Wave-21
|
| 240 |
+
adversarial review (P2): without this dispatch a naive `git apply` on a
|
| 241 |
+
marked diff conflicts and gate-4 silently discards a perfectly good task —
|
| 242 |
+
and the marker path is now the NORMAL route for file-add/delete patches.
|
| 243 |
+
"""
|
| 244 |
+
if task_golden_diff.startswith(UNREVERSED_MARKER):
|
| 245 |
+
return task_golden_diff[len(UNREVERSED_MARKER):], "-R"
|
| 246 |
+
return task_golden_diff, ""
|
| 247 |
+
|
| 248 |
+
|
| 249 |
def load_swesmith_instances(
|
| 250 |
path_or_hf_id: str,
|
| 251 |
*,
|
|
|
|
| 284 |
"SwesmithAdapter",
|
| 285 |
"SwesmithMeta",
|
| 286 |
"UNREVERSED_MARKER",
|
| 287 |
+
"gold_apply_plan",
|
| 288 |
"load_swesmith_instances",
|
| 289 |
"parse_strategy",
|
| 290 |
"reverse_unified_diff",
|
|
@@ -301,10 +301,14 @@ def test_license_tier_mapping(spdx: str, tier: Tier):
|
|
| 301 |
# ---------------------------------------------------------------------
|
| 302 |
|
| 303 |
|
| 304 |
-
def
|
| 305 |
-
|
|
|
|
|
|
|
| 306 |
assert "django/django" in DECONTAMINATION_LIST
|
| 307 |
assert "sympy/sympy" in DECONTAMINATION_LIST
|
|
|
|
|
|
|
| 308 |
|
| 309 |
|
| 310 |
@pytest.mark.parametrize(
|
|
@@ -325,7 +329,6 @@ def test_is_eval_contaminated_hits(repo: str):
|
|
| 325 |
@pytest.mark.parametrize(
|
| 326 |
"repo",
|
| 327 |
[
|
| 328 |
-
"pandas-dev/pandas",
|
| 329 |
"https://github.com/torvalds/linux",
|
| 330 |
"someuser/django", # fork-org differs: NOT the eval repo
|
| 331 |
],
|
|
@@ -334,6 +337,13 @@ def test_is_eval_contaminated_misses(repo: str):
|
|
| 334 |
assert is_eval_contaminated(repo) is False
|
| 335 |
|
| 336 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 337 |
def test_normalize_repo_forms():
|
| 338 |
assert normalize_repo("git@github.com:PSF/Requests.git") == "psf/requests"
|
| 339 |
assert normalize_repo("https://github.com/pydata/xarray/tree/main") == "pydata/xarray"
|
|
|
|
| 301 |
# ---------------------------------------------------------------------
|
| 302 |
|
| 303 |
|
| 304 |
+
def test_decontamination_list_covers_swebench_and_swegym():
|
| 305 |
+
# 12 SWE-bench eval repos + 11 SWE-Gym repos (Wave-21 review P1: the
|
| 306 |
+
# architecture names both families; the original list shipped only 12).
|
| 307 |
+
assert len(DECONTAMINATION_LIST) == 23
|
| 308 |
assert "django/django" in DECONTAMINATION_LIST
|
| 309 |
assert "sympy/sympy" in DECONTAMINATION_LIST
|
| 310 |
+
assert "pandas-dev/pandas" in DECONTAMINATION_LIST # SWE-Gym
|
| 311 |
+
assert "getmoto/moto" in DECONTAMINATION_LIST # SWE-Gym
|
| 312 |
|
| 313 |
|
| 314 |
@pytest.mark.parametrize(
|
|
|
|
| 329 |
@pytest.mark.parametrize(
|
| 330 |
"repo",
|
| 331 |
[
|
|
|
|
| 332 |
"https://github.com/torvalds/linux",
|
| 333 |
"someuser/django", # fork-org differs: NOT the eval repo
|
| 334 |
],
|
|
|
|
| 337 |
assert is_eval_contaminated(repo) is False
|
| 338 |
|
| 339 |
|
| 340 |
+
def test_swegym_repos_are_contaminated():
|
| 341 |
+
"""Wave-21 review P1 regression: SWE-Gym's 11 repos must hit the gate."""
|
| 342 |
+
for repo in ("pandas-dev/pandas", "getmoto/moto", "python/mypy",
|
| 343 |
+
"https://github.com/Project-MONAI/MONAI"):
|
| 344 |
+
assert is_eval_contaminated(repo) is True, repo
|
| 345 |
+
|
| 346 |
+
|
| 347 |
def test_normalize_repo_forms():
|
| 348 |
assert normalize_repo("git@github.com:PSF/Requests.git") == "psf/requests"
|
| 349 |
assert normalize_repo("https://github.com/pydata/xarray/tree/main") == "pydata/xarray"
|
|
@@ -163,3 +163,36 @@ def test_load_local_jsonl(tmp_path):
|
|
| 163 |
rows = load_swesmith_instances(str(p), limit=3)
|
| 164 |
assert len(rows) == 3
|
| 165 |
assert rows[0]["instance_id"] == "r__x.abc.pr_0"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 163 |
rows = load_swesmith_instances(str(p), limit=3)
|
| 164 |
assert len(rows) == 3
|
| 165 |
assert rows[0]["instance_id"] == "r__x.abc.pr_0"
|
| 166 |
+
|
| 167 |
+
|
| 168 |
+
# ---------------------------------------------------------------------
|
| 169 |
+
# Wave-21 adversarial-review regressions
|
| 170 |
+
# ---------------------------------------------------------------------
|
| 171 |
+
|
| 172 |
+
FILE_ADD_PATCH = """\
|
| 173 |
+
diff --git a/newfile.py b/newfile.py
|
| 174 |
+
new file mode 100644
|
| 175 |
+
index 0000000..1111111
|
| 176 |
+
--- /dev/null
|
| 177 |
+
+++ b/newfile.py
|
| 178 |
+
@@ -0,0 +1,2 @@
|
| 179 |
+
+def fresh():
|
| 180 |
+
+ return 1
|
| 181 |
+
"""
|
| 182 |
+
|
| 183 |
+
|
| 184 |
+
def test_reverse_refuses_file_add_and_delete_patches():
|
| 185 |
+
"""Review P0: naive reversal of a file-add patch emits git-apply-rejected
|
| 186 |
+
output (+++ before ---, stale `new file mode`). Must take the marker path."""
|
| 187 |
+
assert reverse_unified_diff(FILE_ADD_PATCH) is None
|
| 188 |
+
task, meta = SwesmithAdapter().to_task_with_meta(_instance(patch=FILE_ADD_PATCH))
|
| 189 |
+
assert meta.diff_reversed is False
|
| 190 |
+
assert task.golden_diff.startswith(UNREVERSED_MARKER)
|
| 191 |
+
|
| 192 |
+
|
| 193 |
+
def test_gold_apply_plan_dispatch():
|
| 194 |
+
from composer_replication.datagen.swesmith_adapter import gold_apply_plan
|
| 195 |
+
diff, flag = gold_apply_plan("normal diff text")
|
| 196 |
+
assert (diff, flag) == ("normal diff text", "")
|
| 197 |
+
diff, flag = gold_apply_plan(UNREVERSED_MARKER + "bug patch body")
|
| 198 |
+
assert diff == "bug patch body" and flag == "-R"
|
|
@@ -87,7 +87,12 @@ def build_corpus(
|
|
| 87 |
traj_rows: list[dict] = []
|
| 88 |
partial = False
|
| 89 |
for task in train_tasks:
|
| 90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
partial = True
|
| 92 |
break
|
| 93 |
traj = collect_trajectory(env_factory(), task, policy_factory(),
|
|
|
|
| 87 |
traj_rows: list[dict] = []
|
| 88 |
partial = False
|
| 89 |
for task in train_tasks:
|
| 90 |
+
# Hard ceiling (Wave-21 review P1): a rollout only starts if its cost
|
| 91 |
+
# still fits — pre-charging prevents the one-rollout overshoot the
|
| 92 |
+
# old check-then-spend ordering allowed.
|
| 93 |
+
if manifest.budget_usd is not None and (
|
| 94 |
+
manifest.cost_usd + cost_per_rollout_usd > manifest.budget_usd
|
| 95 |
+
):
|
| 96 |
partial = True
|
| 97 |
break
|
| 98 |
traj = collect_trajectory(env_factory(), task, policy_factory(),
|
|
@@ -96,18 +96,18 @@ def dedup(
|
|
| 96 |
"""
|
| 97 |
pairs = find_near_duplicates(rows, key_fn, threshold,
|
| 98 |
prior_signatures=prior_signatures)
|
| 99 |
-
drop:
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
kept = [r for i, r in enumerate(rows) if i not in drop]
|
| 106 |
return kept, {
|
| 107 |
"rows_in": len(rows),
|
| 108 |
"rows_kept": len(kept),
|
| 109 |
-
"dropped_within_run": len(
|
| 110 |
-
"dropped_cross_generation": len(
|
| 111 |
"threshold": threshold,
|
| 112 |
}
|
| 113 |
|
|
|
|
| 96 |
"""
|
| 97 |
pairs = find_near_duplicates(rows, key_fn, threshold,
|
| 98 |
prior_signatures=prior_signatures)
|
| 99 |
+
# Partition into disjoint drop-reason sets (Wave-21 review P2: a row that
|
| 100 |
+
# is both a within-run AND cross-generation duplicate must count once;
|
| 101 |
+
# cross-generation wins the attribution since the prior run owns the row).
|
| 102 |
+
drop_cross: set[int] = {i for i, j in pairs if j < 0}
|
| 103 |
+
drop_within: set[int] = {j for _, j in pairs if j >= 0} - drop_cross
|
| 104 |
+
drop = drop_cross | drop_within
|
| 105 |
kept = [r for i, r in enumerate(rows) if i not in drop]
|
| 106 |
return kept, {
|
| 107 |
"rows_in": len(rows),
|
| 108 |
"rows_kept": len(kept),
|
| 109 |
+
"dropped_within_run": len(drop_within),
|
| 110 |
+
"dropped_cross_generation": len(drop_cross),
|
| 111 |
"threshold": threshold,
|
| 112 |
}
|
| 113 |
|
|
@@ -77,6 +77,16 @@ class RunLayout:
|
|
| 77 |
root: str
|
| 78 |
run_id: str
|
| 79 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
def _p(self, *parts: str) -> str:
|
| 81 |
base = self.root.rstrip("/")
|
| 82 |
return f"{base}/runs/{self.run_id}/" + "/".join(parts)
|
|
|
|
| 77 |
root: str
|
| 78 |
run_id: str
|
| 79 |
|
| 80 |
+
def __post_init__(self) -> None:
|
| 81 |
+
# Defense-in-depth (Wave-21 review P2): run_id is operator-supplied,
|
| 82 |
+
# but a separator or `..` would silently escape the corpus root.
|
| 83 |
+
if not self.run_id or "/" in self.run_id or "\\" in self.run_id \
|
| 84 |
+
or ".." in self.run_id:
|
| 85 |
+
raise ValueError(
|
| 86 |
+
f"run_id {self.run_id!r} must be a single non-empty path "
|
| 87 |
+
"segment (no separators, no '..')."
|
| 88 |
+
)
|
| 89 |
+
|
| 90 |
def _p(self, *parts: str) -> str:
|
| 91 |
base = self.root.rstrip("/")
|
| 92 |
return f"{base}/runs/{self.run_id}/" + "/".join(parts)
|
|
@@ -221,3 +221,38 @@ def test_dataset_card_contents(tmp_path):
|
|
| 221 |
assert "sft_rows: 3" in card
|
| 222 |
assert "REDISTRIBUTABLE: 3" in card
|
| 223 |
assert "Decontamination" in card
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 221 |
assert "sft_rows: 3" in card
|
| 222 |
assert "REDISTRIBUTABLE: 3" in card
|
| 223 |
assert "Decontamination" in card
|
| 224 |
+
|
| 225 |
+
|
| 226 |
+
# ---------------------------------------------------------------------
|
| 227 |
+
# Wave-21 adversarial-review regressions
|
| 228 |
+
# ---------------------------------------------------------------------
|
| 229 |
+
|
| 230 |
+
|
| 231 |
+
def test_budget_is_a_hard_ceiling(tmp_path):
|
| 232 |
+
"""Review P1: cost must never exceed budget (pre-charge check)."""
|
| 233 |
+
tasks = [_task(i) for i in range(6)]
|
| 234 |
+
lay = RunLayout(root=str(tmp_path), run_id="hardcap")
|
| 235 |
+
manifest = RunManifest(run_id="hardcap", created_at="2026-06-09T00:00:00Z",
|
| 236 |
+
source="fixture", budget_usd=0.25)
|
| 237 |
+
out = build_corpus(tasks, _env, _passing_policy, lay, manifest,
|
| 238 |
+
holdout_frac=0.2, holdout_seed=7,
|
| 239 |
+
cost_per_rollout_usd=0.1)
|
| 240 |
+
assert out.cost_usd <= out.budget_usd
|
| 241 |
+
assert out.status == "partial"
|
| 242 |
+
|
| 243 |
+
|
| 244 |
+
def test_run_id_path_traversal_rejected():
|
| 245 |
+
"""Review P2: separators / .. in run_id must be rejected at construction."""
|
| 246 |
+
for bad in ("../../escape", "a/b", "a\\b", "", ".."):
|
| 247 |
+
with pytest.raises(ValueError, match="path"):
|
| 248 |
+
RunLayout(root="/data", run_id=bad)
|
| 249 |
+
|
| 250 |
+
|
| 251 |
+
def test_dedup_stats_partition_disjoint():
|
| 252 |
+
"""Review P2: a row that is both within-run and cross-gen dup counts once."""
|
| 253 |
+
prior_sigs = [minhash_signature(_TEXT_A)]
|
| 254 |
+
rows = [{"text": _TEXT_A}, {"text": _TEXT_A2}]
|
| 255 |
+
kept, stats = dedup(rows, lambda r: r["text"], threshold=0.5,
|
| 256 |
+
prior_signatures=prior_sigs)
|
| 257 |
+
total_dropped = stats["dropped_within_run"] + stats["dropped_cross_generation"]
|
| 258 |
+
assert total_dropped == stats["rows_in"] - stats["rows_kept"]
|
|
@@ -17,7 +17,7 @@ The Cursor blog discusses **only three** training innovations explicitly. Everyt
|
|
| 17 |
- **Same model** acts as both teacher and student. Not two separate models.
|
| 18 |
- The teacher is "the policy at this turn, *with* a hint inserted into the context."
|
| 19 |
- The student is "the policy at this turn, *without* the hint" (the original context).
|
| 20 |
-
- Loss = on-policy KL
|
| 21 |
- Sits **on top of** an outer RLVR (verifiable-reward RL) objective; doesn't replace it.
|
| 22 |
|
| 23 |
**Cited prior art** (Cursor's footnote 1):
|
|
|
|
| 17 |
- **Same model** acts as both teacher and student. Not two separate models.
|
| 18 |
- The teacher is "the policy at this turn, *with* a hint inserted into the context."
|
| 19 |
- The student is "the policy at this turn, *without* the hint" (the original context).
|
| 20 |
+
- Loss = on-policy distillation KL at the turn. DIRECTION UNVERIFIED (deepread F-12): the blog says only "moves the student's token probabilities toward the teacher's" — directionless; published SDPO Eq. 1 is student-first, `KL( student || stopgrad(teacher) )`. Applied **only at the problematic turn**, not over the full trajectory.
|
| 21 |
- Sits **on top of** an outer RLVR (verifiable-reward RL) objective; doesn't replace it.
|
| 22 |
|
| 23 |
**Cited prior art** (Cursor's footnote 1):
|