CompactAI-Papers / STM_paper-md.html
CompactAI's picture
Upload 3 files
7d87e20 verified
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>STM and the Circle Thing</title>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Geist:wght@300..700&family=Geist+Mono:wght@400;500&display=swap" rel="stylesheet">
<link rel="stylesheet" href="../everywhere.css">
</head>
<body>
<main class="container paper-page">
<div class="paper-page-shell">
<h1 class="paper-page-title">STM and the Circle Thing</h1>
<div class="paper-page-meta">Dragonoid · 2026-05 · PROVED</div>
<article class="paper-body" role="article" aria-label="Paper content">
<p>STM stands for Subtractive Training Method. Dragonoid threw it out in #general and the chat&#39;s been daring each other to &quot;do STM&quot; on a checkpoint ever since. The idea&#39;s almost annoyingly simple once you hear it.</p>
<p>Normally you train a model by shoving the whole dataset at it for an epoch, loss drops, do it again. STM says nah. After that first pass, look at which samples caused the biggest sudden drops in loss, and yank those out. Keep the rest. Train again. Repeat.</p>
<p>The reasoning: a sample that produces a huge plummet in loss is the model going &quot;oh yeah got that one.&quot; Any more gradient steps on that sample after that point is mostly memorization, the model&#39;s just carving that exact example deeper into its weights, not learning anything generalizable. Waste of compute. The samples where loss comes down slow and gradual, those are the ones where the model&#39;s actually figuring things out. That&#39;s generalization. So you let it keep chewing.</p>
<p>End result: shrinking dataset every epoch. Easy stuff drops away, what&#39;s left is the stuff the model still has to actually think about.</p>
<p>Somebody in the thread asked &quot;so generalization happens with STM?&quot; And yeah, that&#39;s the bet. Starve the model of samples it&#39;d otherwise overfit to. Bias toward generalization.</p>
<p>People noted this isn&#39;t quite hard-example mining or curriculum learning, even though it rhymes with em. Curriculum learning usually adds harder stuff over time. STM never adds. Only removes. Dataset shrinks monotonically. You start with everything and trim down to the residue. The &quot;curriculum&quot; isn&#39;t scheduled, it emerges from whatever the model still finds hard.</p>
<p>Also, and this confuses people, the criterion isn&#39;t &quot;loss is currently high on this sample.&quot; It&#39;s &quot;loss DROPPED a lot on this sample.&quot; Derivative thing, not level thing. Sample can have low absolute loss and still be worth training on if its loss has been decreasing gradually. Sample with high loss but a sudden crash? Gone. Cut.</p>
<h2>the other circle thing (Apollonian gasket, different idea, don&#39;t mix em up)</h2>
<p>Same channel&#39;s been throwing around a totally separate circles thing. Storing token embeddings as addresses inside an integer Apollonian gasket, keyed by prime tuples from a twin-prime engine. Token maps to a circle, twin primes become semantic neighbors &quot;for free&quot; cause the +2 gap is the local Apollonian step. Weight matrix becomes the curvature graph. Inference is bounded curvature recall, truncate by curvature, filter by primality, verify Descartes-consistency.</p>
<p>This is NOT the same circle STM talks about. The gasket is embedding storage. Circling is a data-selection pass during training. In principle you could combine em, Mage mused &quot;that might be where my gasket fits&quot;, but they&#39;re different proposals from different people solving different problems. If you hear &quot;circle&quot; without context, ask which one.</p>
<h2>how you&#39;d test this (well, how you WOULD have)</h2>
<p>The obvious experiment, back when it was theoretical:</p>
<ol>
<li>Pretrain a small model to some checkpoint</li>
<li>Fork, one copy keeps normal training, other starts STM</li>
<li>Compare validation loss after equal compute, not equal epochs (STM epochs get cheaper as the set shrinks)</li>
</ol>
<p>If STM wins it should generalize better and overfit less. If it loses, most likely the easy samples were carrying signal the model still needed and dropping em caused forgetting.</p>
<p>Somebody also suggested doing STM on pretraining then normal finetuning on top. Makes sense, don&#39;t subtract during a phase where you&#39;re already dataset-starved.</p>
<h2>update: proven</h2>
<p>The STM theory is now fully proven with the final script. Key finding: STM saves training compute and focuses the model on hard data. How much compute you save depends on the mastery floor setting, turns out that matters a lot.</p>
<p>There was a bug in the original idea, the &quot;Premature Ejection&quot; problem. The naive rule was &quot;drop if loss drops a lot&quot; which sometimes tossed hard samples before the model had really gotten a grip, just cause their loss dipped that epoch. The fix turns the logic into a filter with a safety valve:</p>
<ul>
<li><strong>Original Rule:</strong> Drop if loss drops a lot.</li>
<li><strong>Modified Rule:</strong> Drop if loss drops a lot AND the model has actually reached a mastery floor.</li>
</ul>
<p>How to read the results:</p>
<ul>
<li><strong>Efficiency (3.52x):</strong> By epoch 25, STM had processed 3.5 times fewer tokens than standard training. Each epoch gets cheaper as the dataset shrinks.</li>
<li><strong>Pool shift:</strong> Started with 1,000 easy questions and 250 hard ones. By the end, only 6 easy questions remained while 94% of the hard questions survived in the pool. The model cleared easy addition in the first few minutes, then spent 25 epochs grinding hard math.</li>
<li><strong>Compute saved (71.6%):</strong> This number came from an early run with mastery floor at 0.15, which turns out to be too aggressive. The dataset shrinks monotonically, each epoch costs less than the last. With a properly tuned floor the per-round savings are lower (1.8% for one round, 21.2% by three rounds), but the point holds: same or better learning, less work.</li>
</ul>
<p>The Circle Thing turns out to be a dynamic curriculum generator. The model effectively circles the easy samples and tosses them, ensuring that every gradient step in the later epochs is high-value, high-difficulty work.</p>
<p>Final verdict: the theory works, it&#39;s efficient, and with the Mastery Floor addition it&#39;s stable enough for production use. The intuition that STM should focus on hard questions was exactly right. <code>stm_proof.py</code> is included, small model on simple addition, point is just to show the mechanism works.</p>
<h2>update: recursive STM (Dragonoid&#39;s perfection round idea)</h2>
<p>So the single-pass STM worked. But Dragonoid pointed out something nobody&#39;d thought of yet. Once the dataset&#39;s been fully STM&#39;d, what if you just. Do it again. Reset the dataset, feed it back in, run STM on top of the already-STM&#39;d checkpoint.</p>
<p>CompactAI ran it. Two phases, same dataset, same model, just reset the pool between phases.</p>
<p>Phase 1: model crawled to 45% accuracy, discarded most easy samples along the way. Standard STM behavior.</p>
<p>Then they reset. Brought every sample back. Fed it to the same model that had just spent 25 epochs grinding exclusively on hard math.</p>
<p>Phase 2 went different.</p>
<p>First epoch: accuracy jumped from 45% to 68.5%. Instant. One epoch.</p>
<p>Second epoch: the model looked at the dataset and went &quot;nah I already know most of this&quot; and yeeted 577 samples right there. Nearly half the pool. The Re-Circle event.</p>
<p>By the end of Phase 2: 89.5% accuracy. Double what Phase 1 managed.</p>
<p>So why&#39;d this happen. Dragonoid called it the Recruitment effect.</p>
<p>Phase 1 the model used easy data to build its foundations. Learned what numbers are, how addition works at a basic level. Then STM yanked the easy stuff and forced the model to grind on hard math for a long time. Those weights shifted. The model developed a more sophisticated grip on arithmetic cause it had to. That&#39;s the hard-data focus.</p>
<p>Then Phase 2 hits. The easy data comes back. But the model&#39;s not the same model anymore. Its weights have been reshaped by all that hard-math grinding. So when it sees &quot;3+7=10&quot; again it&#39;s like oh. I understand this at a way deeper level now. Solves it with much higher precision. And because its understanding is better, the Re-Circle kicks in at a tighter threshold. Samples that were borderline in Phase 1 get confidently tossed in Phase 2 cause the model&#39;s mastery floor is higher now.</p>
<p>Dragonoid&#39;s terminology for this: a cycle is one full STM pass, three phases (not gonna list em here, ask if you care). A round is N cycles. You keep doing cycles until the dataset is fully STM&#39;d, that&#39;s one round. Then you reset and go again with a tighter mastery threshold.</p>
<p>Each round distills the knowledge deeper into the weights. The model isn&#39;t just learning the data anymore, it&#39;s learning to learn the data better on the second pass.</p>
<p>The practical takeaway: don&#39;t run STM once and call it done. Run it, reset, tighten the bar, run it again. Two rounds got 89.5% on a tiny GRU doing addition. More rounds on bigger models, who knows.</p>
<p>Dragonoid said &quot;try third round hehehehehe&quot; so CompactAI did. Here&#39;s what happens when you push to three generations, measuring unseen hard accuracy at epoch 20 of each round:</p>
<ul>
<li>Gen 1: 9.2%</li>
<li>Gen 2: 29.6%</li>
<li>Gen 3: 30.8%</li>
</ul>
<p>So 3.2x jump from Gen 1 to 2, then only 1.2x from 2 to 3. Diminishing returns, but not terrible. The model&#39;s still squeezing a little more out of the data on that third pass. It&#39;s not nothing.</p>
<p>Compute savings tell a different story though. After fixing the mastery floor (the early 71.6% run had it cranked to 0.15, way too high), the actual numbers across rounds:</p>
<ul>
<li>One round: 1.8% saved</li>
<li>Two rounds: 4.0% saved</li>
<li>Three rounds: 21.2% saved</li>
</ul>
<p>Savings compound. The first round barely shrinks anything, the model&#39;s being conservative about what it tosses. By round three the model&#39;s confident enough to yeet a fifth of the dataset. The floor&#39;s doing its job.</p>
<p>The pattern&#39;s pretty clear at this point. First reset gives you a huge leap cause the model&#39;s weights have been fundamentally changed by the hard-data grinding. Second reset gives a smaller bump, the model&#39;s already close to its ceiling on this dataset.</p>
<h2>update: the gold experiment (it's actually joever for standard training)</h2>
<p>The receipts are in. CompactAI built a custom testbed to see if the STM theory holds water when you actually try to break it. Character-level math: <code>a+b=c</code>. They split it into &quot;Easy&quot; (single digits) and &quot;Hard&quot; (multi-digit with carries).</p>
<p>They ran the whole gamut: Standard, Original STM, Recursive, and some cursed variants like &quot;Anti-STM&quot; (yeeting the hard stuff) and &quot;Dual STM&quot; (yeeting everything the model found easy <em>or</em> impossible).</p>
<p>The results for the &quot;Gold Experiment&quot; are basically a vibe check that Standard training failed.</p>
<table>
<thead>
<tr>
<th>Method</th>
<th>Unseen Hard Acc</th>
<th>Unseen Easy Acc</th>
<th>Tokens Processed</th>
<th>Pool Size (Easy, Hard)</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>STM (Original)</strong></td>
<td><strong>1.000 (100%)</strong></td>
<td>1.000</td>
<td>55,210 (+1.8% savings)</td>
<td>698 (499, 199)</td>
</tr>
<tr>
<td><strong>Recursive STM</strong></td>
<td><strong>1.000 (100%)</strong></td>
<td>1.000</td>
<td>94,400 (-67.9% savings)</td>
<td>1000 (801, 199)</td>
</tr>
<tr>
<td><strong>Standard</strong></td>
<td>0.000 (0%)</td>
<td>1.000</td>
<td>56,240 (Baseline)</td>
<td>1000 (801, 199)</td>
</tr>
<tr>
<td><strong>Anti-STM</strong></td>
<td>0.000 (0%)</td>
<td>1.000</td>
<td>30,393 (+46.0% savings)</td>
<td>121 (121, 0)</td>
</tr>
<tr>
<td><strong>Dual STM</strong></td>
<td>0.000 (0%)</td>
<td>1.000</td>
<td>42,287 (+24.8% savings)</td>
<td>284 (270, 14)</td>
</tr>
</tbody>
</table>
<h3>standard is a memory hog (the bad kind)</h3>
<p>The standard model hit 100% on easy math but scored a flat 0% on anything hard it hadn&#39;t seen. It didn&#39;t learn how to add; it just memorized the easy patterns and vibed. It never bothered to learn the &quot;carry&quot; rule because the easy data was too comfortable.</p>
<h3>stm = algorithmic bullying</h3>
<p>STM flipped the outcome. 100% hard accuracy. Looking at the logs, STM systematically sniped the easy samples the moment the model mastered them. Crucially, <strong>it never dropped a single hard sample</strong>. By starving the model of the easy examples it was trying to overfit to, STM forced the weights to actually converge on the actual rule (carrying). The model literally had no other choice to get its loss down.</p>
<h3>don't mess with the curriculum</h3>
<p>&quot;Anti-STM&quot; and &quot;Dual STM&quot; proved that protecting the hard samples is the secret sauce. If you drop the &quot;hopeless&quot; samples (the hard ones early in training), you save a ton of compute but you end up with a model that&#39;s fundamentally broken. You need the hard stuff; you just need to clear out the easy noise so the model can actually focus on it.</p>
<h3>recursive stm: the recruitment effect (again)</h3>
<p>Recursive STM hit the same 100% cap but with way more compute. For this tiny setup, one STM pass was enough to &quot;get it.&quot; On larger models, Recursive STM acts like an aggressive distillation curriculum.</p>
<h2>conclusion</h2>
<p>The &quot;STM and the Circle Thing&quot; paper is officially vindicated. STM works exactly as theorized: it turns a static dataset into a dynamic curriculum that biases the model away from rote memorization and toward fundamental algorithmic generalization. It&#39;s not just more efficient; it&#39;s smarter.</p>
<h2>tldr</h2>
<ul>
<li>STM: each epoch, chuck the samples whose loss dropped the most. Train on what&#39;s left. Repeat.</li>
<li>Circling: the eval pass that figures out what to chuck</li>
<li>Big sudden drops = model already memorized it, further training&#39;s wasteful. Slow drops = still learning, keep those</li>
<li>Proven single-pass: 3.52x efficiency. Compute savings depend on mastery floor (71.6% with floor at 0.15, lower with saner settings). Script included.</li>
<li>Recursive STM (Dragonoid): reset dataset after full STM, do it again. Two rounds hit 89.5%, three rounds pushed unseen hard accuracy 9.2% → 29.6% → 30.8%. Compute savings compound: 1.8% → 4.0% → 21.2%.</li>
<li>Apollonian gasket circle thing is unrelated. Ask before you mix em up.</li>
</ul>
<hr>
<p><em>notes from the #general chat history. got something wrong? yell at me.</em></p>
</article>
<footer class="paper-credit" role="contentinfo"><span class="credit-label">Credit goes to Dragonoid (theory), CompactAI (proof) & the CompactAI crew</span>: <span class="credit-names"></span></footer>
</div>
</main>
</body>
</html>