File size: 2,851 Bytes
d61036a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# Real-trace SDPO alignment validation

Runs the full **ingestion → adapter → collator → SDPO** data path against your
own local Claude Code session logs (`~/.claude/projects/**/*.jsonl`) and reports
the live SDPO mask alignment ratio. This is the population-level proof that
Wave 21's `_build_chat_aligned_mask` fix holds on real-world data, not just the
synthetic fixture.

## Run

```bash
python examples/validate_real_trace_alignment/run.py
# options:
#   --projects-dir ~/.claude/projects   where to discover sessions
#   --max-sessions 8                    how many error-bearing sessions to sample
#   --model Qwen/Qwen2.5-0.5B-Instruct  a real chat-template tokenizer
#   --pass-threshold 0.95               min alignment ratio to PASS
#   --strip-thinking                    (default OFF — see below)
```

Exit code: `0` PASS (alignment ≥ threshold, no crashes), `1` FAIL, `2` no
error-bearing sessions found / no chat template.

## What it measures

- **ingestion yield** — states emitted, error sites detected
- **structural vs string-only flagging** — the Wave 21 `is_error` fix. The
  ingester sets a structural `tool_error: True` boolean; `string-tag-only`
  should be ~0 (the brittle `[TOOL_RESULT (ERROR)]` grep is fallback-only).
- **empty-recovery rate** — see below.
- **SDPO alignment** — fraction of in-loss `sdpo_loss_mask` positions where
  student token id == teacher token id. ~100% means the mask lands exactly on
  content tokens; <95% means chat-template drift has regressed.

## The `--strip-thinking` gotcha (important for SDPO)

`ClaudeCodeIngester(strip_thinking=...)` controls whether `[THINKING]` blocks
survive. For most ingestion you strip them. **For SDPO hint-distillation you
must NOT** — on real Claude Code traces the error-*recovery* turn is very often
**pure thinking** (the model reasons about the failure, then silently retries a
tool). Strip it and that turn's content goes empty, so ~67% of error sites carry
no recovery content to distill against and produce a zero-signal SDPO row.

This script therefore defaults to `strip_thinking=False`. The collator also
guards against the empty case (an empty-recovery error turn is treated as a
non-error site rather than firing an all-`ignore_index` mask), but the *signal*
only exists if you keep the thinking. Pass `--strip-thinking` to see the
empty-recovery warning fire.

## Representative result (Codeseys' machine, 2026-05-28)

```
sessions processed:       10/10
total error sites:        141
structural-flagged users: 170
string-tag-only users:    0
empty-recovery sites:     0/141 (0%)     # strip_thinking=False
SDPO alignment (REAL):    832/832 = 100.0%
RESULT: PASS ✅
```

With `--strip-thinking` the same sessions report ~67% empty-recovery and the
measurable in-loss positions collapse accordingly — the lever is visible.