Spaces:

CompactAI-O
/

CompactAI-Papers

Running

App Files Files Community

wop commited on 8 days ago

Commit

c11f58b

1 Parent(s): 112afd7

Split papers into shared CSS and standalone pages

Browse files

Files changed (8) hide show

Papers/STM_paper-md.html +98 -0
Papers/apollonian_gasket-md.html +63 -0
Papers/attention_experiment-md.html +55 -0
Papers/overta_hypothesis-md.html +87 -0
Papers/sparrow_fant-md.html +57 -0
Papers/trueact-md.html +107 -0
everywhere.css +803 -0
index.html +2 -1406

Papers/STM_paper-md.html ADDED Viewed

	@@ -0,0 +1,98 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>STM and the Circle Thing</title>
+  <link rel="preconnect" href="https://fonts.googleapis.com">
+  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+  <link href="https://fonts.googleapis.com/css2?family=Geist:wght@300..700&family=Geist+Mono:wght@400;500&display=swap" rel="stylesheet">
+  <link rel="stylesheet" href="../everywhere.css">
+</head>
+<body>
+  <main class="container paper-page">
+    <div class="paper-page-shell">
+      <h1 class="paper-page-title">STM and the Circle Thing</h1>
+      <div class="paper-page-meta">Dragonoid · 2026-05 · PROVED</div>
+      <article class="paper-body" role="article" aria-label="Paper content"><h1>STM and the Circle Thing</h1>
+<p>STM stands for Subtractive Training Method. Dragonoid threw it out in #general and the chat&#39;s been daring each other to &quot;do STM&quot; on a checkpoint ever since. The idea&#39;s almost annoyingly simple once you hear it.</p>
+<p>Normally you train a model by shoving the whole dataset at it for an epoch, loss drops, do it again. STM says nah. After that first pass, look at which samples caused the biggest sudden drops in loss, and yank those out. Keep the rest. Train again. Repeat.</p>
+<p>The reasoning: a sample that produces a huge plummet in loss is the model going &quot;oh yeah got that one.&quot; Any more gradient steps on that sample after that point is mostly memorization, the model&#39;s just carving that exact example deeper into its weights, not learning anything generalizable. Waste of compute. The samples where loss comes down slow and gradual, those are the ones where the model&#39;s actually figuring things out. That&#39;s generalization. So you let it keep chewing.</p>
+<p>End result: shrinking dataset every epoch. Easy stuff drops away, what&#39;s left is the stuff the model still has to actually think about.</p>
+<p>Somebody in the thread asked &quot;so generalization happens with STM?&quot; And yeah, that&#39;s the bet. Starve the model of samples it&#39;d otherwise overfit to. Bias toward generalization.</p>
+<p>People noted this isn&#39;t quite hard-example mining or curriculum learning, even though it rhymes with em. Curriculum learning usually adds harder stuff over time. STM never adds. Only removes. Dataset shrinks monotonically. You start with everything and trim down to the residue. The &quot;curriculum&quot; isn&#39;t scheduled, it emerges from whatever the model still finds hard.</p>
+<p>Also, and this confuses people, the criterion isn&#39;t &quot;loss is currently high on this sample.&quot; It&#39;s &quot;loss DROPPED a lot on this sample.&quot; Derivative thing, not level thing. Sample can have low absolute loss and still be worth training on if its loss has been decreasing gradually. Sample with high loss but a sudden crash? Gone. Cut.</p>
+<h2>the other circle thing (Apollonian gasket, different idea, don&#39;t mix em up)</h2>
+<p>Same channel&#39;s been throwing around a totally separate circles thing. Storing token embeddings as addresses inside an integer Apollonian gasket, keyed by prime tuples from a twin-prime engine. Token maps to a circle, twin primes become semantic neighbors &quot;for free&quot; cause the +2 gap is the local Apollonian step. Weight matrix becomes the curvature graph. Inference is bounded curvature recall, truncate by curvature, filter by primality, verify Descartes-consistency.</p>
+<p>This is NOT the same circle STM talks about. The gasket is embedding storage. Circling is a data-selection pass during training. In principle you could combine em, Mage mused &quot;that might be where my gasket fits&quot;, but they&#39;re different proposals from different people solving different problems. If you hear &quot;circle&quot; without context, ask which one.</p>
+<h2>how you&#39;d test this (well, how you WOULD have)</h2>
+<p>The obvious experiment, back when it was theoretical:</p>
+<ol>
+<li>Pretrain a small model to some checkpoint</li>
+<li>Fork, one copy keeps normal training, other starts STM</li>
+<li>Compare validation loss after equal compute, not equal epochs (STM epochs get cheaper as the set shrinks)</li>
+</ol>
+<p>If STM wins it should generalize better and overfit less. If it loses, most likely the easy samples were carrying signal the model still needed and dropping em caused forgetting.</p>
+<p>Somebody also suggested doing STM on pretraining then normal finetuning on top. Makes sense, don&#39;t subtract during a phase where you&#39;re already dataset-starved.</p>
+<h2>update: proven</h2>
+<p>The STM theory is now fully proven with the final script. Key finding: STM saves training compute and focuses the model on hard data. How much compute you save depends on the mastery floor setting, turns out that matters a lot.</p>
+<p>There was a bug in the original idea, the &quot;Premature Ejection&quot; problem. The naive rule was &quot;drop if loss drops a lot&quot; which sometimes tossed hard samples before the model had really gotten a grip, just cause their loss dipped that epoch. The fix turns the logic into a filter with a safety valve:</p>
+<ul>
+<li><strong>Original Rule:</strong> Drop if loss drops a lot.</li>
+<li><strong>Modified Rule:</strong> Drop if loss drops a lot AND the model has actually reached a mastery floor.</li>
+</ul>
+<p>How to read the results:</p>
+<ul>
+<li><strong>Efficiency (3.52x):</strong> By epoch 25, STM had processed 3.5 times fewer tokens than standard training. Each epoch gets cheaper as the dataset shrinks.</li>
+<li><strong>Pool shift:</strong> Started with 1,000 easy questions and 250 hard ones. By the end, only 6 easy questions remained while 94% of the hard questions survived in the pool. The model cleared easy addition in the first few minutes, then spent 25 epochs grinding hard math.</li>
+<li><strong>Compute saved (71.6%):</strong> This number came from an early run with mastery floor at 0.15, which turns out to be too aggressive. The dataset shrinks monotonically, each epoch costs less than the last. With a properly tuned floor the per-round savings are lower (1.8% for one round, 21.2% by three rounds), but the point holds: same or better learning, less work.</li>
+</ul>
+<p>The Circle Thing turns out to be a dynamic curriculum generator. The model effectively circles the easy samples and tosses them, ensuring that every gradient step in the later epochs is high-value, high-difficulty work.</p>
+<p>Final verdict: the theory works, it&#39;s efficient, and with the Mastery Floor addition it&#39;s stable enough for production use. The intuition that STM should focus on hard questions was exactly right. <code>stm_proof.py</code> is included, small model on simple addition, point is just to show the mechanism works.</p>
+<h2>update: recursive STM (Dragonoid&#39;s perfection round idea)</h2>
+<p>So the single-pass STM worked. But Dragonoid pointed out something nobody&#39;d thought of yet. Once the dataset&#39;s been fully STM&#39;d, what if you just. Do it again. Reset the dataset, feed it back in, run STM on top of the already-STM&#39;d checkpoint.</p>
+<p>CompactAI ran it. Two phases, same dataset, same model, just reset the pool between phases.</p>
+<p>Phase 1: model crawled to 45% accuracy, discarded most easy samples along the way. Standard STM behavior.</p>
+<p>Then they reset. Brought every sample back. Fed it to the same model that had just spent 25 epochs grinding exclusively on hard math.</p>
+<p>Phase 2 went different.</p>
+<p>First epoch: accuracy jumped from 45% to 68.5%. Instant. One epoch.</p>
+<p>Second epoch: the model looked at the dataset and went &quot;nah I already know most of this&quot; and yeeted 577 samples right there. Nearly half the pool. The Re-Circle event.</p>
+<p>By the end of Phase 2: 89.5% accuracy. Double what Phase 1 managed.</p>
+<p>So why&#39;d this happen. Dragonoid called it the Recruitment effect.</p>
+<p>Phase 1 the model used easy data to build its foundations. Learned what numbers are, how addition works at a basic level. Then STM yanked the easy stuff and forced the model to grind on hard math for a long time. Those weights shifted. The model developed a more sophisticated grip on arithmetic cause it had to. That&#39;s the hard-data focus.</p>
+<p>Then Phase 2 hits. The easy data comes back. But the model&#39;s not the same model anymore. Its weights have been reshaped by all that hard-math grinding. So when it sees &quot;3+7=10&quot; again it&#39;s like oh. I understand this at a way deeper level now. Solves it with much higher precision. And because its understanding is better, the Re-Circle kicks in at a tighter threshold. Samples that were borderline in Phase 1 get confidently tossed in Phase 2 cause the model&#39;s mastery floor is higher now.</p>
+<p>Dragonoid&#39;s terminology for this: a cycle is one full STM pass, three phases (not gonna list em here, ask if you care). A round is N cycles. You keep doing cycles until the dataset is fully STM&#39;d, that&#39;s one round. Then you reset and go again with a tighter mastery threshold.</p>
+<p>Each round distills the knowledge deeper into the weights. The model isn&#39;t just learning the data anymore, it&#39;s learning to learn the data better on the second pass.</p>
+<p>The practical takeaway: don&#39;t run STM once and call it done. Run it, reset, tighten the bar, run it again. Two rounds got 89.5% on a tiny GRU doing addition. More rounds on bigger models, who knows.</p>
+<p>Dragonoid said &quot;try third round hehehehehe&quot; so CompactAI did. Here&#39;s what happens when you push to three generations, measuring unseen hard accuracy at epoch 20 of each round:</p>
+<ul>
+<li>Gen 1: 9.2%</li>
+<li>Gen 2: 29.6%</li>
+<li>Gen 3: 30.8%</li>
+</ul>
+<p>So 3.2x jump from Gen 1 to 2, then only 1.2x from 2 to 3. Diminishing returns, but not terrible. The model&#39;s still squeezing a little more out of the data on that third pass. It&#39;s not nothing.</p>
+<p>Compute savings tell a different story though. After fixing the mastery floor (the early 71.6% run had it cranked to 0.15, way too high), the actual numbers across rounds:</p>
+<ul>
+<li>One round: 1.8% saved</li>
+<li>Two rounds: 4.0% saved</li>
+<li>Three rounds: 21.2% saved</li>
+</ul>
+<p>Savings compound. The first round barely shrinks anything, the model&#39;s being conservative about what it tosses. By round three the model&#39;s confident enough to yeet a fifth of the dataset. The floor&#39;s doing its job.</p>
+<p>The pattern&#39;s pretty clear at this point. First reset gives you a huge leap cause the model&#39;s weights have been fundamentally changed by the hard-data grinding. Second reset gives a smaller bump, the model&#39;s already close to its ceiling on this dataset.</p>
+<h2>tldr</h2>
+<ul>
+<li>STM: each epoch, chuck the samples whose loss dropped the most. Train on what&#39;s left. Repeat.</li>
+<li>Circling: the eval pass that figures out what to chuck</li>
+<li>Big sudden drops = model already memorized it, further training&#39;s wasteful. Slow drops = still learning, keep those</li>
+<li>Proven single-pass: 3.52x efficiency. Compute savings depend on mastery floor (71.6% with floor at 0.15, lower with saner settings). Script included.</li>
+<li>Recursive STM (Dragonoid): reset dataset after full STM, do it again. Two rounds hit 89.5%, three rounds pushed unseen hard accuracy 9.2% → 29.6% → 30.8%. Compute savings compound: 1.8% → 4.0% → 21.2%.</li>
+<li>Apollonian gasket circle thing is unrelated. Ask before you mix em up.</li>
+</ul>
+<hr>
+<p><em>notes from the #general chat history. got something wrong? yell at me.</em></p>
+</article>
+      <footer class="paper-credit" role="contentinfo"><span class="credit-label">Credit goes to Dragonoid (theory), CompactAI (proof) & the CompactAI crew</span>: <span class="credit-names"></span></footer>
+    </div>
+  </main>
+</body>
+</html>

Papers/apollonian_gasket-md.html ADDED Viewed

	@@ -0,0 +1,63 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>Token Embeddings Inside an Integer Apollonian Gasket</title>
+  <link rel="preconnect" href="https://fonts.googleapis.com">
+  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+  <link href="https://fonts.googleapis.com/css2?family=Geist:wght@300..700&family=Geist+Mono:wght@400;500&display=swap" rel="stylesheet">
+  <link rel="stylesheet" href="../everywhere.css">
+</head>
+<body>
+  <main class="container paper-page">
+    <div class="paper-page-shell">
+      <h1 class="paper-page-title">Token Embeddings Inside an Integer Apollonian Gasket</h1>
+      <div class="paper-page-meta">Mage · 2026-05</div>
+      <article class="paper-body" role="article" aria-label="Paper content"><h1>Token Embeddings Inside an Integer Apollonian Gasket</h1>
+<p>OK so Mage keeps bringing this up in #general and honestly it&#39;s the kind of thing that sounds like math fanfiction until you sit there for twenty minutes and then you&#39;re like... wait. Maybe not completely insane.</p>
+<p>The pitch: instead of learning token embeddings as giant float matrices through SGD, put every token as an address inside an Apollonian gasket, you know, that fractal you get when you recursively pack circles into the gaps between other circles forever, and key each address to a pair of twin primes.</p>
+<p>I know. Stick with me.</p>
+<h2>the geometry, roughly</h2>
+<p>Apollonian gasket comes out of Descartes&#39; circle theorem. Four mutually tangent circles, their curvatures (1/radius) satisfy a quadratic relation. Start with three tangent circles, compute the two circles tangent to all three, pick one, recurse. You get this infinitely detailed fractal where every circle has an integer curvature. Provided you start with the right initial curvatures anyway.</p>
+<p>Circles in an integer gasket form a hierarchy. Big circles = small curvatures. Deeper = bigger curvatures. Circles tangent to each other are neighbors. And a circle&#39;s curvature encodes something about where it sits in the hierarchy, its position relative to everything else. That&#39;s the part that matters.</p>
+<h2>the twin prime thing (this is where it gets weird)</h2>
+<p>Mage&#39;s actual suggestion: map each token to a circle via f: token → (p₁, p₂) where p₁ and p₂ are twin primes. Primes differing by exactly 2, like 17 and 19. The +2 gap between twin primes maps naturally onto the local step structure of the gasket. &quot;Semantic neighbors&quot; come for free from number theory instead of co-occurrence stats.</p>
+<p>So the weight matrix, the thing that normally stores all the relationships between tokens as learned floats, becomes the curvature graph. Instead of &quot;the embedding for cat is related to the embedding for dog by some learned vector&quot; you get &quot;the curvature of cat&#39;s circle has a specific Descartes relation to the curvature of dog&#39;s circle.&quot;</p>
+<p>Inference becomes bounded curvature recall: search circles with curvature ≤ K, filter by primality (make sure you&#39;re on valid addresses), verify Descartes-consistency (make sure geometric relations actually hold). It&#39;s deterministic. You&#39;re traversing a fixed structure, not predicting an embedding.</p>
+<h2>wait, this might actually have something to it</h2>
+<p>Few things from the back-and-forth in chat that make this better than your average 3am math thought:</p>
+<p>The structure is fixed. Normal embeddings are a black box, train em, they go somewhere, hope the geometry&#39;s nice. Gasket geometry is given. The question becomes whether it&#39;s useful, not whether you can learn a useful one. Different problem entirely.</p>
+<p>The twin prime mapping gives you a built-in closeness notion from number theory. Twin primes are conjectured infinite (unproven, sure, but more than enough for any practical vocab). And since they&#39;re spaced by exactly 2 you get a natural adjacency. Map semantically similar tokens to nearby primes and the geometry handles the rest.</p>
+<p>And it&#39;s absurdly compressible. You don&#39;t store a giant embedding matrix. You store the gasket generation algorithm and the vocab-to-prime-pairs mapping. 50k vocabulary? 50k integer pairs instead of 50k × 768 floats. Hundreds of megabytes down to kilobytes. Maybe tens of kilobytes. That&#39;s wild.</p>
+<h2>the problems that nobody&#39;s solved</h2>
+<p>Plenty. The chat wasn&#39;t pretending otherwise.</p>
+<p>Biggest one: how do you assign tokens to primes? Need a mapping from tokens to (p₁, p₂) pairs where semantically similar tokens end up near each other in the gasket. Random assignment = geometry is meaningless noise. But if you have to learn the assignment... you just moved the learning problem somewhere else. Nothing&#39;s free.</p>
+<p>Also the gasket is 2D. Token embeddings are usually 768D or higher. You&#39;re cramming high-dimensional semantics into a flat fractal. There are ways to extend Descartes to higher dimensions (Soddy&#39;s theorem generalizes), but the nice integer properties of the Apollonian gasket are a 2D thing specifically.</p>
+<p>&quot;Bounded curvature recall&quot; sounds neat, what&#39;s it actually mean for language modeling though? You&#39;re doing geometric lookup. How does attention even work here? Does attention make sense when &quot;embeddings&quot; are fixed positions?</p>
+<p>And twin prime conjecture is unproven. If twin primes are somehow finite (almost certainly not but we can&#39;t prove it) you&#39;d run out of addresses. Even with infinite twin primes, the density thins, mapping a million tokens might force you into enormous primes and correspondingly enormous curvatures.</p>
+<h2>building it (nobody has yet)</h2>
+<p>Experiment would look like:</p>
+<p>Pick a small vocab, maybe 1000 tokens. Manually assign em to twin prime pairs, doesn&#39;t have to be perfect, just needs semantically related ones geometrically close. Build a tiny transformer where embedding lookup is replaced by gasket traversal. Model doesn&#39;t learn embeddings, it learns to navigate the structure. Train on something simple like next token prediction. See if it converges at all.</p>
+<p>Null hypothesis: it learns nothing useful cause the geometric prior is wrong. If it does learn? Interesting.</p>
+<p>Compare to identical model with learned embeddings. If the gasket version is even in the same ballpark, that&#39;s a big deal given the compression advantage.</p>
+<p>Mage suggested combining this with STM from the other paper, since you&#39;d want to be picky about training examples when your embedding scheme is this constrained. But honestly combining two untested ideas sounds like a great way to not know which one&#39;s failing.</p>
+<h2>anyone done this before?</h2>
+<p>I looked. Nope. No literature on Apollonian gaskets for token embeddings.</p>
+<p>Nearest neighbors: hyperbolic embeddings (Poincaré ball) for hierarchies, but those are continuous, learned, no integer or prime structure. Some niche fractal initialization papers. Some knowledge graph stuff in hyperbolic/spherical geometries. Nothing using gaskets, twin primes, or curvature graphs as the embedding mechanism. Genuinely new ground. For better or worse.</p>
+<h2>tldr</h2>
+<ul>
+<li>Tokens as circles in an integer Apollonian gasket, addressed by twin prime pairs</li>
+<li>Semantic neighbors geometrically close cause of Descartes curvature relations</li>
+<li>Weight matrix = curvature graph. Inference = curvature recall + primality filter + Descartes check</li>
+<li>Kilobytes instead of hundreds of megabytes for the embedding table</li>
+<li>Completely untested, might be mathematically impossible for high-dim semantics</li>
+<li>If it worked even a little... worth writing home about</li>
+</ul>
+<hr>
+</article>
+      <footer class="paper-credit" role="contentinfo"><span class="credit-label">Credit goes to Mage & CompactAI</span>: <span class="credit-names"></span></footer>
+    </div>
+  </main>
+</body>
+</html>

Papers/attention_experiment-md.html ADDED Viewed

	@@ -0,0 +1,55 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>An Experiment With Attention</title>
+  <link rel="preconnect" href="https://fonts.googleapis.com">
+  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+  <link href="https://fonts.googleapis.com/css2?family=Geist:wght@300..700&family=Geist+Mono:wght@400;500&display=swap" rel="stylesheet">
+  <link rel="stylesheet" href="../everywhere.css">
+</head>
+<body>
+  <main class="container paper-page">
+    <div class="paper-page-shell">
+      <h1 class="paper-page-title">An Experiment With Attention</h1>
+      <div class="paper-page-meta">wop · 2026-05</div>
+      <article class="paper-body" role="article" aria-label="Paper content"><h1>An Experiment With Attention</h1>
+<p>wop posted this on HuggingFace. Started simple: can you swap full attention for something cheaper and still keep enough context to nail the next token? Answer turned out to be &quot;nope, not yet&quot;, but how it got there is more interesting than the headline.</p>
+<p>The setup is clean. Reproducible. Refreshing, no massive training runs, no &quot;promising directions&quot; handwaving. Just a concrete benchmark, two architectures, numbers.</p>
+<h2>the actual question</h2>
+<p>Context window isn&#39;t a flat sequence of tokens. It&#39;s usually: task instructions, style hints, formatting rules, actual content. These don&#39;t all matter at every token. Some rules are globally relevant but locally weak, the emoji rule in a system prompt, a formatting constraint. Model needs to carry them the whole way even when they don&#39;t matter at each individual prediction.</p>
+<p>wop&#39;s framing: can a compressed context state preserve those weak early rules as the sequence gets long? And how does that stack up against ordinary attention?</p>
+<h2>two models, head to head</h2>
+<p>Attention: standard causal transformer attention. The baseline.</p>
+<p>Compressed: replaces token-to-token attention with a learned compressed memory state, a few implicit slots.</p>
+<p>Important: the compressed model doesn&#39;t classify tokens into categories. It reads tokens, updates a compact state, predicts from that state. Structure stays implicit. No hand-crafted &quot;this is a rule, this is content&quot; partitioning.</p>
+<p>Dataset is synthetic and hardcoded (keeps it clean). Each example has two early rules, one item, a long distractor-filled prefix, and a target that requires recovering the early rules. The benchmark specifically stresses rule retention over distance, not general prediction.</p>
+<p>Context lengths: 64, 256, 1028.</p>
+<h2>what happened</h2>
+<p>Attention won. On quality. And on speed. Didn&#39;t expect the speed part.</p>
+<p>At ctx64: attention val_acc 0.938, rule_acc 0.906. Compressed: 0.699 and 0.492.</p>
+<p>At ctx256: attention 0.757 val_acc, 0.581 rule_acc. Compressed: 0.633 and 0.358.</p>
+<p>At ctx1028: attention 0.701 val_acc, 0.492 rule_acc. Compressed: 0.577 and 0.263.</p>
+<p>Speed was brutal. ctx1028: attention finished in about 10 seconds. Compressed took about 230. So it was less accurate AND 23x slower. Oof.</p>
+<h2>the lesson (not &quot;compression bad&quot;)</h2>
+<p>People might see those numbers and go &quot;welp attention wins, pack it up.&quot; That&#39;d miss the point.</p>
+<p>Naive compressed recurrent context doesn&#39;t automatically beat attention. Obvious now. Wasn&#39;t obvious going in, plenty of people assumed compressing context would help with long-range stuff because you&#39;re forcing a summary. Numbers say nope. Not unless you&#39;re real careful about how you compress.</p>
+<p>Preserving weak parallel instructions is harder than keeping a rolling summary. The compressed model didn&#39;t just lose accuracy, it specifically tanked rule_acc. That&#39;s the game. Can you keep early constraints alive? Answer: not with a dumb bottleneck.</p>
+<p>Full attention is still stupidly strong, even on a benchmark literally designed to break it. That&#39;s kind of wild. Task was built to punish attention and it still held up.</p>
+<p>Also worth noting: the compressed model updates memory step-by-step in sequence order, tons of serial work. Attention gets all those optimized parallel kernels. Some of the speed gap is architecture, some is just the real world having better kernels for one approach.</p>
+<h2>what this means</h2>
+<p>wop doesn&#39;t over-claim. The motivating intuition, context isn&#39;t flat, some instructions matter globally but weakly, a good alternative to attention needs to preserve those signals, that&#39;s still solid. What broke is that the first compressed block was too simple. Formed a bottleneck. Just not a smart enough one.</p>
+<p>If you wanna replace or relax attention, the replacement probably needs: a smarter way to preserve weak long-range constraints (rule_acc tells you directly if you&#39;re failing), a more parallel implementation (230 seconds is useless), a more selective memory update (not every token matters for those early constraints), and a benchmark that measures rule retention directly, not hidden inside average loss.</p>
+<p>Next step: compress smarter, not harder.</p>
+<h2>bottom line</h2>
+<p>The experiment made the question concrete. No more abstract talk about &quot;efficient context.&quot; There&#39;s a clear picture now. Attention expensive but extremely effective. Compressed context alone isn&#39;t enough. Weak parallel instructions are a real stress test.</p>
+<p>If cheaper context mechanisms are gonna work, they need to preserve global obligations without collapsing into a vague summary. Numbers are out there. Reproduce em or beat em.</p>
+<p>wop used Kaggle notebooks for GPU, Codex from OpenAI for assistance, HuggingFace ZeroGPU for the thumbnail. Repo at github.com/koo1140/attention-experiment.</p>
+<hr>
+</article>
+      <footer class="paper-credit" role="contentinfo"><span class="credit-label">Credit goes to wop (poe) & CompactAI</span>: <span class="credit-names"></span></footer>
+    </div>
+  </main>
+</body>
+</html>

Papers/overta_hypothesis-md.html ADDED Viewed

	@@ -0,0 +1,87 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>The Overta Hypothesis: Knowledge-Free Foundation Models</title>
+  <link rel="preconnect" href="https://fonts.googleapis.com">
+  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+  <link href="https://fonts.googleapis.com/css2?family=Geist:wght@300..700&family=Geist+Mono:wght@400;500&display=swap" rel="stylesheet">
+  <link rel="stylesheet" href="../everywhere.css">
+</head>
+<body>
+  <main class="container paper-page">
+    <div class="paper-page-shell">
+      <h1 class="paper-page-title">The Overta Hypothesis: Knowledge-Free Foundation Models</h1>
+      <div class="paper-page-meta">Amy · 2026-05</div>
+      <article class="paper-body" role="article" aria-label="Paper content"><h1>The Overta Hypothesis: Knowledge-Free Foundation Models</h1>
+<p>Amy&#39;s been working on something that sounds backwards: train a language model from scratch that knows nothing about the world. Not &quot;knows less.&quot; Nothing. Model wouldn&#39;t know what a human is, wouldn&#39;t know Earth exists, wouldn&#39;t know what &quot;Paris&quot; or &quot;coffee&quot; or &quot;gravity&quot; means.</p>
+<p>What it would know: how to reason. How to hold a conversation. How to use tools to look stuff up. How to do in-context learning from whatever you feed it at inference time.</p>
+<p>OFM, Overta Foundation Model. She owns overta.org. Nothing on it yet, but the repo exists.</p>
+<p>The pitch is basically: cramming world knowledge into weights, which every LLM since GPT-2 does, wastes parameters. Instead, let the model spend its entire budget on reasoning. Facts live outside, in retrievable documents. The model&#39;s just the engine.</p>
+<h2>what does &quot;no world knowledge&quot; actually look like</h2>
+<p>Weirder than you&#39;d think. When Amy says the model wouldn&#39;t know what a human is, she&#39;s literal. Here&#39;s the kind of definition she&#39;d provide externally at inference time, via search tool or context window:</p>
+<blockquote>
+<p>&quot;We are entities interacting with the physical world and able to do precise manipulation and also self-locomotion, through deforming and displacing our body.&quot;</p>
+</blockquote>
+<p>Model&#39;s never seen this before. Doesn&#39;t come pre-loaded. But it knows how to use a definition once provided. Knows how to reason from premises. If X is an entity that can manipulate objects and you ask about X&#39;s capabilities, it references the definition. That&#39;s the whole thing.</p>
+<p>Training data would be purely synthetic conversations. Characters have defined properties stated explicitly in context. Reasoning chains spelled out step by step. Facts always grounded in explicitly provided info. Model never falls back on &quot;common sense&quot;, there isn&#39;t any.</p>
+<p>Amy wrote seed examples by hand. The flavor:</p>
+<pre><code>&lt;message from=&quot;other&quot;&gt;
+describe whats happening now
+&lt;/message&gt;
+&lt;message from=&quot;self&quot;&gt;
+&lt;think&gt;
+They&#39;re asking me to describe &quot;whats happening now&quot;. Right now I&#39;m talking
+to them and outputting text. I don&#39;t know of anything else happening now.
+So I&#39;ll describe that to them.
+&lt;/think&gt;
+I (some entity which processes text and outputs text) am conversing with
+you, what I assume to be an entity which also can process text and output
+text of your own. I&#39;m outputting text to respond to your request.
+&lt;/message&gt;
+</code></pre>
+<p>Notice it doesn&#39;t say &quot;I am an AI assistant.&quot; Says &quot;I (some entity which processes text and outputs text).&quot; Describes itself from first principles using only what&#39;s observable. The whole philosophy is in that one response.</p>
+<h2>the arguments for it</h2>
+<p>This crystallized across a bunch of conversations.</p>
+<p>Hallucination isn&#39;t a bug, it&#39;s the design. LLMs hallucinate because they&#39;re trained to memorize facts and spit em back probabilistically. Don&#39;t train em to memorize facts, they can&#39;t hallucinate facts. They can still fail to reason correctly, different failure mode entirely.</p>
+<p>Parameter efficiency. 350M model spending zero params on knowing the capital of France or the plot of Hamlet can spend all 350M on reasoning. Amy&#39;s guess: more intelligence per parameter than any model that mixes knowledge and reasoning.</p>
+<p>Grounding is explicit. When the model uses a fact you know exactly where it came from, context window, search tool, user provided it. No mystery about whether the model&#39;s &quot;remembering&quot; or &quot;making it up.&quot;</p>
+<p>Also updatability. World knowledge changes, models don&#39;t. Model with no world knowledge has nothing to go stale. Just update the retrieval database. And alignment: model with no priors about true/false, good/bad, normal/weird only knows what you tell it in the moment. The user defines the ethical frame and factual premises, not the training data. Which obviously cuts both ways, Amy&#39;s aware.</p>
+<h2>and the problems</h2>
+<p>Nobody&#39;s pulled it off yet. Amy&#39;s early experiments have been rough.</p>
+<p>Language itself encodes knowledge, probably the biggest problem. Can&#39;t learn English grammar without incidentally learning that doctors treat patients, fire is hot, people have names. The boundary between &quot;reasoning&quot; and &quot;knowing&quot; is way blurrier than the hypothesis assumes. Costikoooo flagged this in chat: model might pick up unwanted world knowledge just from how conversations are structured.</p>
+<p>Cold start is brutal. Amy&#39;s pipeline: manually write ~200 seed examples → finetune Qwen3 4B to generate more → use that for the full training dataset. Seed examples have to be near-perfect, everything cascades from em. Claude Opus &quot;doesn&#39;t get it even close first try and needs a lot of iterating.&quot; And the AI models she&#39;s using to generate training data were themselves trained on world knowledge, so they sneak it in. Recursive contamination basically.</p>
+<p>Her earlier procedural generators, pure code, no LLMs, produced a model that could tool-call but was &quot;the most brittle thing imaginable.&quot; Only worked with names in its predefined list. Generalization terrible.</p>
+<p>Then there&#39;s in-context bandwidth: if every fact has to live in the context window you need a killer retrieval system and model becomes only as good as its RAG. Dragonoid also asked about ethics, what even counts as knowledge? Dragonoid asked about ethics. Amy said ethics would be taught as abstract values. But are values knowledge? Where&#39;s the line? Gets tangled fast.</p>
+<h2>the procedural generation thing</h2>
+<p>Amy&#39;s earlier approach, and the more interesting one honestly, ditched LLMs entirely. She built a system of TypeScript generators. Each function handles one aspect of conversation generation. One generator adds a user message like &quot;My name is Joe,&quot; mutates conversation state so Joe&#39;s stored. Later generators reference that state. The system randomly chains generators together. All procedures, guaranteed consistent cause the state machine enforces it.</p>
+<p>Trained a 350M model from scratch on this procedural data. Qwen3 architecture, custom tokenizer, about an hour per training run on a 16GB GPU. Model learned to remember user names (generator state tracked em), search for info it didn&#39;t have (generators included search calls), stay in character as &quot;an entity that processes text.&quot;</p>
+<p>But brittle. Only generalized to names in the generator&#39;s list. Math broken. Longer training mighta helped, only about an hour per run.</p>
+<p>Neat insight though: with this approach you add a new capability by writing a new generator and retraining. Timer? Timer generator. Spatial reasoning? Spatial generator. Capabilities are modular because the training data is modular.</p>
+<p>And it&#39;s philosophically cleaner than LLM-distillation. No world knowledge leaks in. But way more labor, you&#39;re writing a program that writes programs that simulate conversations, and you gotta anticipate every capability.</p>
+<h2>where this needs to go</h2>
+<p>Viable experiment:</p>
+<ol>
+<li>Procedural generator covering ~20 distinct capabilities (name memory, search, clarification, multi-turn reasoning, simple math from provided formulas, etc.)</li>
+<li>Train small model (~100M) from scratch only on this data</li>
+<li>Test novel names, novel facts in context, novel combos of capabilities</li>
+<li>Compare to identical model on standard web text, does Overta model hallucinate less? Reason worse? Something in between?</li>
+</ol>
+<p>Amy&#39;s paused this to work on other stuff but the repo and approach are documented. Someone could pick it up.</p>
+<h2>tldr</h2>
+<ul>
+<li>Train a model that knows zero facts, only reasoning and conversation</li>
+<li>All knowledge from context window or retrieval at inference time</li>
+<li>Training data procedurally generated, not scraped</li>
+<li>Claims: no hallucination (can&#39;t invent facts you never learned), better param efficiency, explicit grounding</li>
+<li>Problems: language encodes knowledge, procedural generation is labor-intensive, model brittle so far</li>
+<li>Amy poked at this with 350M models, promising, not there yet</li>
+</ul>
+<hr>
+</article>
+      <footer class="paper-credit" role="contentinfo"><span class="credit-label">Credit goes to Amy & the CompactAI server</span>: <span class="credit-names"></span></footer>
+    </div>
+  </main>
+</body>
+</html>

Papers/sparrow_fant-md.html ADDED Viewed

	@@ -0,0 +1,57 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>Sparrow, FANT, and the Weird Stuff That Works</title>
+  <link rel="preconnect" href="https://fonts.googleapis.com">
+  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+  <link href="https://fonts.googleapis.com/css2?family=Geist:wght@300..700&family=Geist+Mono:wght@400;500&display=swap" rel="stylesheet">
+  <link rel="stylesheet" href="../everywhere.css">
+</head>
+<body>
+  <main class="container paper-page">
+    <div class="paper-page-shell">
+      <h1 class="paper-page-title">Sparrow, FANT, and the Weird Stuff That Works</h1>
+      <div class="paper-page-meta">Crownelius · 2026-05</div>
+      <article class="paper-body" role="article" aria-label="Paper content"><h1>Sparrow, FANT, and the Weird Stuff That Works</h1>
+<p>Shane (Crownelius) posted this on HuggingFace after sitting on the story. It&#39;s part technical writeup, part personal history, part challenge. Here&#39;s a bunch of half-built things, some of them work way better than they should, come help figure out why.</p>
+<p>The headline: Sparrow, a 1M-parameter model, scores 95.6% on a 1,900-question math eval where Owl Alpha scores 61.4%. One million parameters versus a 70B-class model. Not a typo.</p>
+<h2>how Shane got here</h2>
+<p>Unusual route into model building. Dropped a finetune that scored weirdly high on EQ, the empathy benchmark nobody goes after. Publishing outfit reached out, wanted full books generated through Claude, architected prompt-by-prompt. Gig ran over $25k in API credits. Shane designed every prompt. About 60% of that publisher&#39;s pipeline still uses his curation.</p>
+<p>That flipped him from using models to building em. Tiny ones specifically, small enough for a 12GB card, small enough that every architectural choice is exposed and wrong decisions are embarrassing.</p>
+<h2>FANT and stuff that shouldn&#39;t work</h2>
+<p>FANT (github.com/Crownelius/fant3) is the experimentation framework. Three iterations, each a full rewrite, the kind you do when you care about findings, not code. Philosophy: at small scale you can try genuinely dumb ideas and actually see what they do. No hiding behind scale.</p>
+<p>The standout: SleepGate. A memory consolidation routine that fires every 100 training steps. Like half a screen of code, mostly comments. On FANT 2 at 5M parameters, 1,000-problem procedural math eval: +5.3 points. From one architectural decision. Same optimizer, same data, same schedule. Just a tiny consolidation pass wedged in.</p>
+<p>Why does that move the needle that much? Nobody&#39;s fully sure. But it does. That&#39;s the point of the framework, surface weird stuff so people can dig in.</p>
+<h2>SpinorApollonian Memory (this is the newest weird thing)</h2>
+<p>This one needs setup. Shane found a paper by Jerzy Kocik on tangency spinors, classifying Apollonian disk packings using 2D Minkowski spinors. Descartes&#39; circle theorem turns out to be the Minkowski quadratic form in signature (1,3). Actual math, not metaphor.</p>
+<p>Application: memory writes get split by chirality. Left-spinning packs go to one bucket, right-spinning packs to another. Geometric routing instead of threshold routing.</p>
+<p>The failure mode that&#39;d been a problem for two months, packs starving each other into uselessness, just vanished. Same pattern at every scale tested, 5M through 742M. Hard to believe. But the ablation table doesn&#39;t lie.</p>
+<p>&quot;Sounds like peanut butter on a hamburger, but the ablation table says it works.&quot;, Shane</p>
+<h2>Sparrow vs Owl Alpha</h2>
+<p>Sparrow is separate from FANT. Different skeleton, different router, different everything. Small and surgical, built for math. Not general-purpose, not trying to be.</p>
+<p>Across 38 head-to-head evals (n=50 each, numeric scoring), Sparrow ties or beats Owl Alpha on 33 of 38. 87%.</p>
+<p>Five losses are on simple multiplication and division at digit counts where Owl&#39;s training data is dense. Fair wins, not the interesting ones.</p>
+<p>The interesting ones: Goldbach, Collatz, Fermat-little, complex modulus, dot products, distance, determinants. +100 percentage points in some cases. +98pp. +88pp.</p>
+<p>A 1M-parameter byte-level model beating a 70B-class frontier model by margins you&#39;d normally assume are bugs. Shane checked twice, bothered three friends, re-ran at three temperatures. Same numbers. Makes you question how much of big-model performance is just brute force.</p>
+<p>Trick: Sparrow uses a calc-tag wrapper for arithmetic. It learns to call it correctly. 5-digit multiplication: Owl Alpha 12%, Gemma 3 27B 0%, Sparrow 100%. Model doesn&#39;t do math, learns to use the tool. That&#39;s the whole thing.</p>
+<h2>unsolved: FANT + Sparrow fusion</h2>
+<p>Shane can&#39;t get Sparrow&#39;s approach to work with FANT&#39;s architecture. Something about how Sparrow handles symbolic state should connect to FANT&#39;s recursion stack, but the fusion&#39;s been elusive for weeks. Breadcrumbs are public, repo&#39;s open, issues are open, the challenge is sitting there.</p>
+<p>&quot;Half-built spaceships are best shared.&quot;</p>
+<h2>tldr</h2>
+<ul>
+<li>FANT: experimentation framework for tiny models (5M–742M)</li>
+<li>SleepGate: 100-step consolidation pass, +5.3 points. Nobody knows exactly why</li>
+<li>SpinorApollonian Memory: geometric routing by chirality using Minkowski spinors, killed a starvation bug</li>
+<li>Sparrow: 1M-param math model beats Owl Alpha (70B-class) on 87% of evals. Not a typo</li>
+<li>Trick is tool use (calc-tag), not raw compute, Sparrow calls for help</li>
+<li>FANT + Sparrow fusion unsolved, open challenge</li>
+</ul>
+<hr>
+</article>
+      <footer class="paper-credit" role="contentinfo"><span class="credit-label">Credit goes to Crownelius (Shane) & CompactAI</span>: <span class="credit-names"></span></footer>
+    </div>
+  </main>
+</body>
+</html>

Papers/trueact-md.html ADDED Viewed

	@@ -0,0 +1,107 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>TrueACT: A Different Kind of Neuron</title>
+  <link rel="preconnect" href="https://fonts.googleapis.com">
+  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+  <link href="https://fonts.googleapis.com/css2?family=Geist:wght@300..700&family=Geist+Mono:wght@400;500&display=swap" rel="stylesheet">
+  <link rel="stylesheet" href="../everywhere.css">
+</head>
+<body>
+  <main class="container paper-page">
+    <div class="paper-page-shell">
+      <h1 class="paper-page-title">TrueACT: A Different Kind of Neuron</h1>
+      <div class="paper-page-meta">CompactAI · 2026-05</div>
+      <article class="paper-body" role="article" aria-label="Paper content"><h1>TrueACT: A Different Kind of Neuron</h1>
+<p>Transformers use the same MLP on every token. Every time. Same weights, same math, no memory of where it&#39;s been in the sequence, no sense of how confident it is. TrueACT chucks that. Replaces the MLP with a small recurrent block that loops. Reads the token, updates a hidden state, checks if it&#39;s confident enough, either answers or loops again. Keeps going til it hits 0.99 confidence or runs out of steps.</p>
+<p>It&#39;s not a standard neuron. It&#39;s not an attention head. It&#39;s a loop with a router that picks between four specialized operations depending on what the token needs.</p>
+<p>Half the parameters of the equivalent standard model. Same loss. 2.8x slower cause of the loop. You&#39;re trading flops for parameter efficiency.</p>
+<h2>the four experts</h2>
+<p>Each TrueACT layer has four experts. The router picks which combination to use per token, not per layer. So different tokens in the same batch can fire different experts. Here&#39;s what they do:</p>
+<p><strong>Think Cell</strong> — this is the actual recurrent part. Updates the latent state, which is basically working memory that persists across steps within the same layer. Think of it like scratch paper the model scribbles on while reasoning.</p>
+<p><strong>Standard</strong> — plain linear pattern matching. Same job the normal MLP would do. Catches the easy stuff.</p>
+<p><strong>Fancy</strong> — this is the weird one. Does math in log-space. For multiplication, <code>log(a*b) = log(a) + log(b)</code>. Addition is something a linear layer can already do. So instead of needing a pile of neurons to approximate a multiplication curve, one Fancy expert can do it cleanly. Log then add then exp. Multiplication, division, ratios, chained operations.</p>
+<p><strong>Memory Vault</strong> — key → value associative lookup. A dedicated place to store facts instead of smearing them across all the weights. Retrieve, don&#39;t approximate.</p>
+<p>The router takes <code>[input, latent_state, step_count]</code>, sticks it through a linear layer plus softmax, and that&#39;s the expert selection. Standard and Fancy spend from an action budget. Once the budget is gone, the loop stops, unless it hits the 32-step cap.</p>
+<h2>why this works</h2>
+<p>A standard transformer neuron is a linear approximator. For something like <code>a * b = c</code>, you&#39;d need a big pile of neurons approximating a curve. It works eventually but it&#39;s wasteful. The weights end up encoding the same multiplication table across hundreds of parameters, and there&#39;s no clean way to just <em>do the math</em>.</p>
+<p>The Fancy expert sidesteps that. Goes to log-space, adds, comes back. One neuron doing what used to take a crowd.</p>
+<p>The Memory Vault is the same idea from the other direction. Instead of memorizing facts by storing them implicitly in weight matrices, just do a key-value lookup. Store it once, retrieve it when needed.</p>
+<p>The Think Cell ties it together. Gives the model a place to hold intermediate state while it loops through the experts. Without it, each token is a one-shot guess. With it, the model can go &quot;hmm let me think about this&quot; and take another step.</p>
+<p>So the model gets more mileage per parameter. The tradeoff is sequential compute. You can&#39;t parallelize a loop that depends on its own output. That&#39;s where the 2.8x slowdown comes from.</p>
+<h2>the numbers</h2>
+<p>3-layer LLaMA-style comparison at d=384:</p>
+<table>
+<thead>
+<tr>
+<th>metric</th>
+<th>Standard</th>
+<th>TrueACT</th>
+</tr>
+</thead>
+<tbody><tr>
+<td>loss</td>
+<td>0.0884</td>
+<td>0.0880</td>
+</tr>
+<tr>
+<td>params</td>
+<td>852,864</td>
+<td>428,652</td>
+</tr>
+<tr>
+<td>train speed</td>
+<td>1x</td>
+<td>2.8x slower</td>
+</tr>
+</tbody></table>
+<p>Same loss, roughly half the weights. The slowdown is real — the loop is sequential, can&#39;t be parallelized. But you&#39;re getting the same quality out of half the parameter budget.</p>
+<p>1-layer arithmetic reasoner: 12/12 on a fixed 12-expression benchmark. 91.6% on 500 random expressions. The misses are mostly multi-digit arithmetic — <code>42*88=3524</code> type stuff. Structure like parentheses, operator precedence, intermediate steps — those come out clean. The model actually writes out the worked steps: <code>((5*5)+(10*2))=(25+(10*2))=(25+20)=45|</code></p>
+<h2>the router in action</h2>
+<p>The routing stats tell you what the model&#39;s doing under the hood. Example from the inference CLI:</p>
+<pre><code>Prompt &gt; ((5*5)+(10*2))=
+TrueACT : ((5*5)+(10*2))=(25+(10*2))=(25+20)=45|
+         [Think: 15% | Stand: 30% | Fancy: 45% | Vault: 10%]
+</code></pre>
+<p>For arithmetic, Fancy gets most of the budget. Makes sense — multiplication is the expensive operation and Fancy handles it in log-space. Standard catches the easy pattern matching (digits, parens, equals signs). Think Cell does the state tracking across steps. Memory Vault probably handles the number facts.</p>
+<p>The router isn&#39;t pre-programmed. It learns which expert to use for which kind of token during training. The routing stats are emergent.</p>
+<h2>how training works</h2>
+<p>Data is an infinite stream of generated arithmetic — <code>+</code>, <code>-</code>, <code>*</code>, parentheses, multi-step chains. Format is <code>((5*5)+(10*2))=(25+(10*2))=(25+20)=45|</code>. The model sees a random position in the chain and has to predict the next character.</p>
+<p>Context window is 64 chars, one-hot encoded. The alphabet is 12 characters (digits, operators, parens, equals, pipe) so the input is 768-dimensional one-hots.</p>
+<p>Batch size 8192. AdamW, lr 5e-4, weight decay 0.01. 1-3 layer TrueACTStack, t_dim=256, max 32 ACT steps per layer.</p>
+<p>Training also runs a StandardStack (same structure, ordinary SiLU MLPs) side by side as the control group. Checkpoints save both every 500 steps.</p>
+<h2>the architecture, deeper</h2>
+<p><strong>TrueACTLayer</strong>: concat input x and latent h → xh. Router reads <code>[xh, step_frac]</code> → softmax over 4 experts. Compute expert outputs, gate by <code>router_prob * remaining_budget</code>, accumulate into the result, update h through the Think Cell. Repeat til budget hits zero or 32 steps.</p>
+<p><strong>TrueACTStack</strong>: N of those layers with residual connections. Input projection to model dim at the bottom, output projection to vocab at the top.</p>
+<p><strong>StandardStack</strong>: same structure but with normal SiLU MLPs instead of the TrueACT loop. The control group.</p>
+<p>The budget mechanism matters. Standard and Fancy both consume budget when used. Think Cell and Memory Vault, from how they&#39;re structured, seem to be state management rather than compute, so they don&#39;t appear to draw from the budget. The model can think (Think Cell) and retrieve (Memory Vault) freely. Only the expensive ops cost steps.</p>
+<h2>how it started</h2>
+<p>This thing began as one log-space neuron trying to learn <code>x*y=z</code>. That&#39;s it. One neuron doing multiplication in log-space.</p>
+<p>41 notes later in MEMORY.md. Mode collapse. Gradient explosions. Dead architecture after dead architecture. Full rewrites. Things that almost worked before falling apart at higher dimensions.</p>
+<p>The 41 notes on what didn&#39;t work are arguably more valuable than what did. Every dead end, every fix, every &quot;wait that shouldn&#39;t have helped&quot; moment. Built for AI agents to read so they don&#39;t repeat the same mistakes.</p>
+<p>The four-expert router, the Think Cell, the budget gating, the step cap — none of that was in the original idea. Each piece got added because something broke without it.</p>
+<h2>looking forward</h2>
+<p>The toy results are promising. Half the params, same loss. The next step is figuring out if this scales past small arithmetic models, and what the loop overhead looks like at bigger sizes. That&#39;s the open question.</p>
+<hr>
+<p><em>the 41st attempt finally worked. go read MEMORY.md if you wanna avoid the first 40.</em></p>
+<h2>tldr</h2>
+<ul>
+<li>Swap the transformer MLP for a recurrent block that loops til it&#39;s confident</li>
+<li>Four experts: Think Cell (working memory), Standard (linear matching), Fancy (log-space math), Memory Vault (key→value lookup)</li>
+<li>Router picks which experts fire per token based on input, latent state, and step count</li>
+<li>Half the params (429k vs 853k), same loss (0.0880 vs 0.0884), 2.8x slower</li>
+<li>1-layer solves 12/12 on fixed benchmark, 91.6% on 500 random arithmetic expressions</li>
+<li>Fancy expert does <code>log(a*b) = log(a)+log(b)</code> — one neuron doing what used to take a crowd</li>
+<li>Started as one log-space neuron, 41 failed notes later it&#39;s a whole architecture</li>
+<li>Tradeoff: sequential compute for parameter efficiency</li>
+</ul>
+<hr>
+<p><em>the 41st attempt finally worked. go read MEMORY.md if you wanna avoid the first 40.</em></p>
+</article>
+      <footer class="paper-credit" role="contentinfo"><span class="credit-label">Credit goes to CompactAI</span>: <span class="credit-names"></span></footer>
+    </div>
+  </main>
+</body>
+</html>

everywhere.css ADDED Viewed

	@@ -0,0 +1,803 @@

+    *,
+    *::before,
+    *::after {
+      margin: 0;
+      padding: 0;
+      box-sizing: border-box;
+    }
+    :root {
+      --bg: #000000;
+      --bg-card: #0a0a0a;
+      --bg-card-hover: #0f0f0f;
+      --border-card: #1a1a1a;
+      --text: #ffffff;
+      --text-soft: #f5f5f5;
+      --text-muted: #888888;
+      --text-dim: #555555;
+      --accent: #c8960c;
+      --accent-bright: #ffd633;
+      --accent-muted: #8b6508;
+      --gold-gradient: linear-gradient(135deg, #ffd633 0%, #c8960c 50%, #a0760a 100%);
+      --grid-line: rgba(255, 255, 255, 0.03);
+      --grid-line-major: rgba(255, 255, 255, 0.06);
+      --container-max: none;
+    }
+    html {
+      scroll-behavior: smooth;
+      background: var(--bg);
+    }
+    /* ---- Focus styles ---- */
+    :focus-visible {
+      outline: 2px solid var(--accent-bright);
+      outline-offset: 3px;
+      border-radius: 4px;
+    }
+    .paper-card:focus-visible {
+      outline: 2px solid var(--accent-bright);
+      outline-offset: 2px;
+      border-radius: 12px;
+    }
+    .paper-view-close:focus-visible {
+      outline: 2px solid var(--accent-bright);
+      outline-offset: 2px;
+      border-radius: 8px;
+    }
+    /* ---- Skip link ---- */
+    .skip-link {
+      position: fixed;
+      top: 0;
+      left: 0;
+      z-index: 200;
+      padding: 12px 20px;
+      background: var(--accent);
+      color: #000;
+      font-family: 'Geist', sans-serif;
+      font-size: 14px;
+      font-weight: 600;
+      text-decoration: none;
+      border-radius: 0 0 8px 0;
+      transform: translateY(-100%);
+      transition: transform 0.15s ease;
+    }
+    .skip-link:focus {
+      transform: translateY(0);
+    }
+    /* ---- Screen reader only ---- */
+    .sr-only {
+      position: absolute;
+      width: 1px;
+      height: 1px;
+      padding: 0;
+      margin: -1px;
+      overflow: hidden;
+      clip: rect(0, 0, 0, 0);
+      white-space: nowrap;
+      border: 0;
+    }
+    /* ---- Reduced motion ---- */
+    @media (prefers-reduced-motion: reduce) {
+      *,
+      *::before,
+      *::after {
+        animation-duration: 0.01ms !important;
+        animation-iteration-count: 1 !important;
+        transition-duration: 0.01ms !important;
+      }
+      html {
+        scroll-behavior: auto;
+      }
+    }
+    body {
+      background: var(--bg);
+      color: var(--text);
+      font-family: 'Geist', -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
+      font-size: 16px;
+      font-weight: 350;
+      line-height: 1.7;
+      min-height: 100vh;
+      position: relative;
+      overflow-x: hidden;
+    }
+    /* ---- Dot grid background ---- */
+    body::before {
+      content: '';
+      position: fixed;
+      inset: 0;
+      pointer-events: none;
+      z-index: 0;
+      background-image:
+        radial-gradient(circle, var(--grid-line) 1px, transparent 1px);
+      background-size: 40px 40px;
+      mask-image: radial-gradient(ellipse 70% 70% at 50% 0%, black 30%, transparent 70%);
+      -webkit-mask-image: radial-gradient(ellipse 70% 70% at 50% 0%, black 30%, transparent 70%);
+    }
+    body::after {
+      content: '';
+      position: fixed;
+      inset: 0;
+      pointer-events: none;
+      z-index: 0;
+      background-image: url("data:image/svg+xml,%3Csvg viewBox='0 0 256 256' xmlns='http://www.w3.org/2000/svg'%3E%3Cfilter id='noise'%3E%3CfeTurbulence type='fractalNoise' baseFrequency='0.9' numOctaves='4' stitchTiles='stitch'/%3E%3C/filter%3E%3Crect width='100%25' height='100%25' filter='url(%23noise)' opacity='0.02'/%3E%3C/svg%3E");
+      opacity: 0.4;
+    }
+    /* ---- Container ---- */
+    .container {
+      position: relative;
+      z-index: 1;
+      max-width: var(--container-max);
+      margin: 0 auto;
+      padding: 0 40px;
+    }
+    /* ---- Header ---- */
+    .site-header {
+      padding: 80px 0 48px;
+      text-align: center;
+      position: relative;
+    }
+    .site-title {
+      font-family: 'Geist', sans-serif;
+      font-size: clamp(48px, 8vw, 80px);
+      font-weight: 700;
+      letter-spacing: -0.03em;
+      line-height: 1.05;
+      color: var(--text);
+      animation: fadeUp 0.8s ease-out both;
+    }
+    .site-title .gold {
+      background: var(--gold-gradient);
+      -webkit-background-clip: text;
+      -webkit-text-fill-color: transparent;
+      background-clip: text;
+    }
+    .site-subtitle {
+      font-family: 'Geist Mono', monospace;
+      font-size: 13px;
+      font-weight: 400;
+      letter-spacing: 0.12em;
+      text-transform: uppercase;
+      color: var(--text-dim);
+      margin-top: 16px;
+      animation: fadeUp 0.8s 0.15s ease-out both;
+    }
+    .header-divider {
+      width: 60px;
+      height: 1px;
+      background: var(--gold-gradient);
+      margin: 32px auto 0;
+      opacity: 0.5;
+      animation: fadeUp 0.8s 0.2s ease-out both;
+    }
+    @keyframes fadeUp {
+      from { opacity: 0; transform: translateY(20px); }
+      to { opacity: 1; transform: translateY(0); }
+    }
+    /* ---- Section label ---- */
+    .section-label {
+      font-family: 'Geist Mono', monospace;
+      font-size: 11px;
+      font-weight: 500;
+      letter-spacing: 0.14em;
+      text-transform: uppercase;
+      color: var(--text-dim);
+      margin-bottom: 24px;
+      padding-left: 2px;
+    }
+    /* ---- Paper Grid ---- */
+    .papers-grid {
+      display: grid;
+      grid-template-columns: repeat(3, 1fr);
+      gap: 28px;
+      padding-bottom: 96px;
+      list-style: none;
+    }
+    @media (max-width: 1100px) {
+      .papers-grid {
+        grid-template-columns: repeat(2, 1fr);
+      }
+    }
+    @media (max-width: 600px) {
+      .papers-grid {
+        grid-template-columns: 1fr;
+      }
+    }
+    /* ---- Paper Card (catalog) ---- */
+    .paper-card {
+      position: relative;
+      background: var(--bg-card);
+      border: 1px solid var(--border-card);
+      border-radius: 12px;
+      padding: 32px 36px;
+      cursor: pointer;
+      min-height: 180px;
+      display: flex;
+      flex-direction: column;
+      justify-content: space-between;
+      transition: all 0.35s ease;
+      overflow: hidden;
+      animation: fadeUp 0.6s ease-out both;
+    }
+    .paper-card::before {
+      content: '';
+      position: absolute;
+      inset: 0;
+      border-radius: 12px;
+      background: radial-gradient(
+        600px circle at var(--mx, 50%) var(--my, 50%),
+        rgba(200, 150, 12, 0.06),
+        transparent 40%
+      );
+      opacity: 0;
+      transition: opacity 0.4s ease;
+      pointer-events: none;
+    }
+    .paper-card:hover {
+      border-color: rgba(200, 150, 12, 0.3);
+      background: var(--bg-card-hover);
+      transform: translateY(-2px);
+      box-shadow:
+        0 20px 60px rgba(0, 0, 0, 0.4),
+        0 0 0 1px rgba(200, 150, 12, 0.08);
+    }
+    .paper-card:hover::before {
+      opacity: 1;
+    }
+    .paper-card:active {
+      transform: translateY(0);
+      transition: all 0.1s ease;
+    }
+    .card-title {
+      font-family: 'Geist', sans-serif;
+      font-size: 20px;
+      font-weight: 600;
+      letter-spacing: -0.02em;
+      line-height: 1.3;
+      color: var(--text);
+      margin-bottom: 10px;
+    }
+    .card-author {
+      font-family: 'Geist Mono', monospace;
+      font-size: 11px;
+      font-weight: 500;
+      letter-spacing: 0.06em;
+      color: var(--accent-muted);
+      margin-bottom: 8px;
+    }
+    .card-meta {
+      display: flex;
+      align-items: center;
+      gap: 12px;
+      font-family: 'Geist Mono', monospace;
+      font-size: 12px;
+      color: var(--text-dim);
+    }
+    .card-tag {
+      font-family: 'Geist Mono', monospace;
+      font-size: 10px;
+      font-weight: 500;
+      letter-spacing: 0.08em;
+      text-transform: uppercase;
+      color: var(--accent-muted);
+      background: rgba(200, 150, 12, 0.08);
+      padding: 3px 8px;
+      border-radius: 4px;
+      border: 1px solid rgba(200, 150, 12, 0.12);
+    }
+    .card-proved {
+      font-family: 'Geist Mono', monospace;
+      font-size: 10px;
+      font-weight: 600;
+      letter-spacing: 0.08em;
+      text-transform: uppercase;
+      color: #1a7a1a;
+      background: rgba(26, 122, 26, 0.1);
+      padding: 3px 10px;
+      border-radius: 4px;
+      border: 1px solid rgba(26, 122, 26, 0.2);
+      flex-shrink: 0;
+    }
+    /* ---- Featured card ---- */
+    .paper-card.featured {
+      grid-column: 1 / -1;
+      padding: 48px 56px;
+      min-height: 220px;
+      border-color: color-mix(in srgb, var(--accent) 25%, transparent);
+      background: radial-gradient(ellipse 70% 50% at 50% 0%, rgba(200, 150, 12, 0.04), transparent 60%), var(--bg-card);
+    }
+    .paper-card.featured .card-title {
+      font-size: 28px;
+      letter-spacing: -0.03em;
+    }
+    .paper-card.featured .card-author {
+      font-size: 12px;
+      margin-bottom: 12px;
+    }
+    .paper-card.featured .card-meta {
+      font-size: 13px;
+    }
+    .paper-card.featured:hover {
+      border-color: color-mix(in srgb, var(--accent) 45%, transparent);
+      background: radial-gradient(ellipse 70% 50% at 50% 0%, rgba(200, 150, 12, 0.07), transparent 60%), var(--bg-card-hover);
+    }
+    @media (max-width: 640px) {
+      .paper-card.featured {
+        padding: 28px 24px;
+        min-height: 160px;
+      }
+      .paper-card.featured .card-title {
+        font-size: 22px;
+      }
+    }
+    /* ---- Proof script in paper view ---- */
+    .proof-section {
+      margin-top: 32px;
+      padding-top: 24px;
+      border-top: 1px solid var(--border-card);
+    }
+    .proof-label {
+      font-family: 'Geist Mono', monospace;
+      font-size: 11px;
+      font-weight: 600;
+      letter-spacing: 0.1em;
+      text-transform: uppercase;
+      color: #1a7a1a;
+      margin-bottom: 14px;
+    }
+    .proof-script {
+      background: rgba(0, 0, 0, 0.4);
+      border: 1px solid var(--border-card);
+      border-radius: 8px;
+      padding: 20px 24px;
+      overflow-x: auto;
+      font-family: 'Geist Mono', monospace;
+      font-size: 12px;
+      line-height: 1.6;
+      color: var(--text-muted);
+      tab-size: 4;
+      max-height: 500px;
+      overflow-y: auto;
+    }
+    .proof-script::-webkit-scrollbar {
+      width: 6px;
+      height: 6px;
+    }
+    .proof-script::-webkit-scrollbar-track {
+      background: transparent;
+    }
+    .proof-script::-webkit-scrollbar-thumb {
+      background: var(--border-card);
+      border-radius: 3px;
+    }
+    .card-arrow {
+      position: absolute;
+      right: 24px;
+      top: 50%;
+      transform: translateY(-50%);
+      font-size: 20px;
+      color: var(--text-dim);
+      opacity: 0;
+      transition: all 0.3s ease;
+    }
+    .paper-card:hover .card-arrow {
+      opacity: 1;
+      color: var(--accent);
+    }
+    /* ---- Modal / Paper View ---- */
+    .paper-overlay {
+      position: fixed;
+      inset: 0;
+      z-index: 100;
+      background: rgba(0, 0, 0, 0.9);
+      backdrop-filter: blur(12px);
+      -webkit-backdrop-filter: blur(12px);
+      display: flex;
+      align-items: flex-start;
+      justify-content: center;
+      padding: 0;
+      overflow-y: auto;
+      animation: overlayIn 0.25s ease-out;
+    }
+    @keyframes overlayIn {
+      from { opacity: 0; }
+      to { opacity: 1; }
+    }
+    .paper-view {
+      background: #0d0d0d;
+      border: none;
+      border-radius: 0;
+      max-width: none;
+      width: 100%;
+      min-height: 100vh;
+      padding: 64px clamp(32px, 8vw, 120px) 80px;
+      position: relative;
+      animation: modalIn 0.35s ease-out;
+      box-shadow: none;
+    }
+    .paper-view-frame {
+      display: block;
+      width: 100%;
+      min-height: calc(100vh - 144px);
+      border: 0;
+      background: transparent;
+    }
+    @keyframes modalIn {
+      from { opacity: 0; transform: translateY(24px) scale(0.97); }
+      to { opacity: 1; transform: translateY(0) scale(1); }
+    }
+    .paper-view-close {
+      position: fixed;
+      top: 24px;
+      right: 24px;
+      width: 40px;
+      height: 40px;
+      border-radius: 8px;
+      border: 1px solid var(--border-card);
+      background: rgba(10, 10, 10, 0.9);
+      backdrop-filter: blur(8px);
+      -webkit-backdrop-filter: blur(8px);
+      color: var(--text-muted);
+      cursor: pointer;
+      display: flex;
+      align-items: center;
+      justify-content: center;
+      font-size: 20px;
+      font-family: 'Geist Mono', monospace;
+      transition: all 0.2s ease;
+      z-index: 10;
+    }
+    .paper-view-close:hover {
+      border-color: rgba(200, 150, 12, 0.4);
+      color: var(--accent);
+      background: #111;
+    }
+    .paper-view-loading {
+      display: flex;
+      align-items: center;
+      justify-content: center;
+      min-height: 200px;
+      flex-direction: column;
+      gap: 16px;
+    }
+    .loading-spinner {
+      width: 32px;
+      height: 32px;
+      border: 2px solid var(--border-card);
+      border-top-color: var(--accent);
+      border-radius: 50%;
+      animation: spin 0.8s linear infinite;
+    }
+    @keyframes spin {
+      to { transform: rotate(360deg); }
+    }
+    /* ---- Paper Body (inside modal) ---- */
+    .paper-body {
+      color: var(--text-soft);
+      max-width: 900px;
+      margin: 0 auto;
+    }
+    .paper-body h1 {
+      font-family: 'Geist', sans-serif;
+      font-size: 2rem;
+      font-weight: 700;
+      letter-spacing: -0.03em;
+      line-height: 1.2;
+      color: var(--text);
+      margin-bottom: 8px;
+    }
+    .paper-body h2 {
+      font-family: 'Geist', sans-serif;
+      font-size: 1.3rem;
+      font-weight: 600;
+      letter-spacing: -0.02em;
+      color: var(--text-soft);
+      margin: 36px 0 12px;
+      padding-bottom: 8px;
+      border-bottom: 1px solid var(--border-card);
+    }
+    .paper-body h2:first-child {
+      margin-top: 0;
+    }
+    .paper-body h3 {
+      font-family: 'Geist', sans-serif;
+      font-size: 1.05rem;
+      font-weight: 600;
+      color: var(--text-muted);
+      margin: 22px 0 8px;
+    }
+    .paper-body p {
+      margin-bottom: 14px;
+      line-height: 1.75;
+    }
+    .paper-body ol,
+    .paper-body ul {
+      margin: 0 0 14px 24px;
+    }
+    .paper-body li {
+      margin-bottom: 6px;
+    }
+    .paper-body strong {
+      font-weight: 600;
+      color: var(--text);
+    }
+    .paper-body em {
+      font-style: italic;
+    }
+    .paper-body code {
+      font-family: 'Geist Mono', monospace;
+      font-size: 0.82em;
+      background: rgba(255, 255, 255, 0.05);
+      padding: 2px 7px;
+      border-radius: 4px;
+      border: 1px solid var(--border-card);
+      color: var(--accent);
+    }
+    .paper-body hr {
+      border: none;
+      border-top: 1px solid var(--border-card);
+      margin: 28px 0;
+    }
+    .paper-body blockquote {
+      border-left: 2px solid var(--accent-muted);
+      padding-left: 16px;
+      margin: 18px 0;
+      color: var(--text-muted);
+      font-style: italic;
+    }
+    /* ---- Credit section ---- */
+    .paper-credit {
+      margin-top: 40px;
+      padding-top: 20px;
+      border-top: 1px solid var(--border-card);
+      font-family: 'Geist Mono', monospace;
+      font-size: 13px;
+      color: var(--text-dim);
+      text-align: right;
+      letter-spacing: 0.04em;
+    }
+    .paper-credit .credit-label {
+      font-style: italic;
+    }
+    .paper-credit .credit-names {
+      font-weight: 500;
+      color: var(--text-muted);
+    }
+    /* ---- Papermaker callout ---- */
+    .papermaker-callout {
+      text-align: center;
+      padding: 48px 24px 32px;
+      max-width: 560px;
+      margin: 0 auto;
+    }
+    .papermaker-callout p {
+      font-family: 'Geist', sans-serif;
+      font-size: 14px;
+      font-weight: 350;
+      line-height: 1.6;
+      color: var(--text-dim);
+    }
+    .papermaker-callout .role-name {
+      font-family: 'Geist Mono', monospace;
+      font-weight: 500;
+      color: var(--accent-muted);
+    }
+    /* ---- Footer ---- */
+    .site-footer {
+      text-align: center;
+      padding: 0 0 64px;
+      font-family: 'Geist Mono', monospace;
+      font-size: 11px;
+      letter-spacing: 0.1em;
+      text-transform: uppercase;
+      color: var(--text-dim);
+      opacity: 0.5;
+    }
+    /* ---- Empty State ---- */
+    .empty-state {
+      text-align: center;
+      padding: 80px 24px;
+      grid-column: 1 / -1;
+    }
+    .empty-state p {
+      font-size: 16px;
+      color: var(--text-dim);
+      font-style: italic;
+    }
+    /* ---- Error State ---- */
+    .error-banner {
+      text-align: center;
+      padding: 60px 24px;
+      grid-column: 1 / -1;
+    }
+    .error-banner p {
+      font-family: 'Geist Mono', monospace;
+      font-size: 13px;
+      color: var(--text-dim);
+    }
+    /* ---- Responsive ---- */
+    @media (max-width: 640px) {
+      .container {
+        padding: 0 20px;
+      }
+      .site-header {
+        padding: 56px 0 36px;
+      }
+      .site-title {
+        font-size: clamp(36px, 10vw, 56px);
+      }
+      .paper-card {
+        padding: 22px 24px;
+        min-height: 140px;
+      }
+      .paper-view {
+        padding: 32px 20px;
+        min-height: auto;
+      }
+      .paper-view-frame {
+        min-height: 70vh;
+      }
+      .papers-grid {
+        gap: 12px;
+      }
+    }
+    @media print {
+      body {
+        background: white;
+        color: black;
+      }
+      body::before,
+      body::after {
+        display: none;
+      }
+      .paper-overlay {
+      position: static;
+      background: none;
+      backdrop-filter: none;
+    }
+    .paper-view {
+      background: white;
+      color: black;
+      box-shadow: none;
+      border: none;
+      min-height: auto;
+      padding: 24px 0;
+      position: static;
+      animation: none;
+    }
+      .paper-view-frame {
+        min-height: 0;
+        height: auto;
+      }
+      .paper-view-close {
+        display: none;
+      }
+    }
+/* ---- Paper Page Layout ---- */
+.paper-page {
+  padding: 64px 0 80px;
+}
+.paper-page .paper-page-shell {
+  max-width: 900px;
+  margin: 0 auto;
+}
+.paper-page .paper-page-title {
+  font-family: 'Geist', sans-serif;
+  font-size: clamp(34px, 5vw, 56px);
+  font-weight: 700;
+  letter-spacing: -0.03em;
+  line-height: 1.08;
+  color: var(--text);
+  margin-bottom: 12px;
+}
+.paper-page .paper-page-meta {
+  font-family: 'Geist Mono', monospace;
+  font-size: 12px;
+  letter-spacing: 0.08em;
+  text-transform: uppercase;
+  color: var(--text-dim);
+  margin-bottom: 36px;
+}
+.paper-page .paper-page-iframe-note {
+  display: none;
+}

index.html CHANGED Viewed

@@ -9,759 +9,7 @@
   <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
   <link href="https://fonts.googleapis.com/css2?family=Geist:wght@300..700&family=Geist+Mono:wght@400;500&display=swap" rel="stylesheet">
-  <style>
-    *,
-    *::before,
-    *::after {
-      margin: 0;
-      padding: 0;
-      box-sizing: border-box;
-    }
-    :root {
-      --bg: #000000;
-      --bg-card: #0a0a0a;
-      --bg-card-hover: #0f0f0f;
-      --border-card: #1a1a1a;
-      --text: #ffffff;
-      --text-soft: #f5f5f5;
-      --text-muted: #888888;
-      --text-dim: #555555;
-      --accent: #c8960c;
-      --accent-bright: #ffd633;
-      --accent-muted: #8b6508;
-      --gold-gradient: linear-gradient(135deg, #ffd633 0%, #c8960c 50%, #a0760a 100%);
-      --grid-line: rgba(255, 255, 255, 0.03);
-      --grid-line-major: rgba(255, 255, 255, 0.06);
-      --container-max: none;
-    }
-    html {
-      scroll-behavior: smooth;
-      background: var(--bg);
-    }
-    /* ---- Focus styles ---- */
-    :focus-visible {
-      outline: 2px solid var(--accent-bright);
-      outline-offset: 3px;
-      border-radius: 4px;
-    }
-    .paper-card:focus-visible {
-      outline: 2px solid var(--accent-bright);
-      outline-offset: 2px;
-      border-radius: 12px;
-    }
-    .paper-view-close:focus-visible {
-      outline: 2px solid var(--accent-bright);
-      outline-offset: 2px;
-      border-radius: 8px;
-    }
-    /* ---- Skip link ---- */
-    .skip-link {
-      position: fixed;
-      top: 0;
-      left: 0;
-      z-index: 200;
-      padding: 12px 20px;
-      background: var(--accent);
-      color: #000;
-      font-family: 'Geist', sans-serif;
-      font-size: 14px;
-      font-weight: 600;
-      text-decoration: none;
-      border-radius: 0 0 8px 0;
-      transform: translateY(-100%);
-      transition: transform 0.15s ease;
-    }
-    .skip-link:focus {
-      transform: translateY(0);
-    }
-    /* ---- Screen reader only ---- */
-    .sr-only {
-      position: absolute;
-      width: 1px;
-      height: 1px;
-      padding: 0;
-      margin: -1px;
-      overflow: hidden;
-      clip: rect(0, 0, 0, 0);
-      white-space: nowrap;
-      border: 0;
-    }
-    /* ---- Reduced motion ---- */
-    @media (prefers-reduced-motion: reduce) {
-      *,
-      *::before,
-      *::after {
-        animation-duration: 0.01ms !important;
-        animation-iteration-count: 1 !important;
-        transition-duration: 0.01ms !important;
-      }
-      html {
-        scroll-behavior: auto;
-      }
-    }
-    body {
-      background: var(--bg);
-      color: var(--text);
-      font-family: 'Geist', -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
-      font-size: 16px;
-      font-weight: 350;
-      line-height: 1.7;
-      min-height: 100vh;
-      position: relative;
-      overflow-x: hidden;
-    }
-    /* ---- Dot grid background ---- */
-    body::before {
-      content: '';
-      position: fixed;
-      inset: 0;
-      pointer-events: none;
-      z-index: 0;
-      background-image:
-        radial-gradient(circle, var(--grid-line) 1px, transparent 1px);
-      background-size: 40px 40px;
-      mask-image: radial-gradient(ellipse 70% 70% at 50% 0%, black 30%, transparent 70%);
-      -webkit-mask-image: radial-gradient(ellipse 70% 70% at 50% 0%, black 30%, transparent 70%);
-    }
-    body::after {
-      content: '';
-      position: fixed;
-      inset: 0;
-      pointer-events: none;
-      z-index: 0;
-      background-image: url("data:image/svg+xml,%3Csvg viewBox='0 0 256 256' xmlns='http://www.w3.org/2000/svg'%3E%3Cfilter id='noise'%3E%3CfeTurbulence type='fractalNoise' baseFrequency='0.9' numOctaves='4' stitchTiles='stitch'/%3E%3C/filter%3E%3Crect width='100%25' height='100%25' filter='url(%23noise)' opacity='0.02'/%3E%3C/svg%3E");
-      opacity: 0.4;
-    }
-    /* ---- Container ---- */
-    .container {
-      position: relative;
-      z-index: 1;
-      max-width: var(--container-max);
-      margin: 0 auto;
-      padding: 0 40px;
-    }
-    /* ---- Header ---- */
-    .site-header {
-      padding: 80px 0 48px;
-      text-align: center;
-      position: relative;
-    }
-    .site-title {
-      font-family: 'Geist', sans-serif;
-      font-size: clamp(48px, 8vw, 80px);
-      font-weight: 700;
-      letter-spacing: -0.03em;
-      line-height: 1.05;
-      color: var(--text);
-      animation: fadeUp 0.8s ease-out both;
-    }
-    .site-title .gold {
-      background: var(--gold-gradient);
-      -webkit-background-clip: text;
-      -webkit-text-fill-color: transparent;
-      background-clip: text;
-    }
-    .site-subtitle {
-      font-family: 'Geist Mono', monospace;
-      font-size: 13px;
-      font-weight: 400;
-      letter-spacing: 0.12em;
-      text-transform: uppercase;
-      color: var(--text-dim);
-      margin-top: 16px;
-      animation: fadeUp 0.8s 0.15s ease-out both;
-    }
-    .header-divider {
-      width: 60px;
-      height: 1px;
-      background: var(--gold-gradient);
-      margin: 32px auto 0;
-      opacity: 0.5;
-      animation: fadeUp 0.8s 0.2s ease-out both;
-    }
-    @keyframes fadeUp {
-      from { opacity: 0; transform: translateY(20px); }
-      to { opacity: 1; transform: translateY(0); }
-    }
-    /* ---- Section label ---- */
-    .section-label {
-      font-family: 'Geist Mono', monospace;
-      font-size: 11px;
-      font-weight: 500;
-      letter-spacing: 0.14em;
-      text-transform: uppercase;
-      color: var(--text-dim);
-      margin-bottom: 24px;
-      padding-left: 2px;
-    }
-    /* ---- Paper Grid ---- */
-    .papers-grid {
-      display: grid;
-      grid-template-columns: repeat(3, 1fr);
-      gap: 28px;
-      padding-bottom: 96px;
-      list-style: none;
-    }
-    @media (max-width: 1100px) {
-      .papers-grid {
-        grid-template-columns: repeat(2, 1fr);
-      }
-    }
-    @media (max-width: 600px) {
-      .papers-grid {
-        grid-template-columns: 1fr;
-      }
-    }
-    /* ---- Paper Card (catalog) ---- */
-    .paper-card {
-      position: relative;
-      background: var(--bg-card);
-      border: 1px solid var(--border-card);
-      border-radius: 12px;
-      padding: 32px 36px;
-      cursor: pointer;
-      min-height: 180px;
-      display: flex;
-      flex-direction: column;
-      justify-content: space-between;
-      transition: all 0.35s ease;
-      overflow: hidden;
-      animation: fadeUp 0.6s ease-out both;
-    }
-    .paper-card::before {
-      content: '';
-      position: absolute;
-      inset: 0;
-      border-radius: 12px;
-      background: radial-gradient(
-        600px circle at var(--mx, 50%) var(--my, 50%),
-        rgba(200, 150, 12, 0.06),
-        transparent 40%
-      );
-      opacity: 0;
-      transition: opacity 0.4s ease;
-      pointer-events: none;
-    }
-    .paper-card:hover {
-      border-color: rgba(200, 150, 12, 0.3);
-      background: var(--bg-card-hover);
-      transform: translateY(-2px);
-      box-shadow:
-        0 20px 60px rgba(0, 0, 0, 0.4),
-        0 0 0 1px rgba(200, 150, 12, 0.08);
-    }
-    .paper-card:hover::before {
-      opacity: 1;
-    }
-    .paper-card:active {
-      transform: translateY(0);
-      transition: all 0.1s ease;
-    }
-    .card-title {
-      font-family: 'Geist', sans-serif;
-      font-size: 20px;
-      font-weight: 600;
-      letter-spacing: -0.02em;
-      line-height: 1.3;
-      color: var(--text);
-      margin-bottom: 10px;
-    }
-    .card-author {
-      font-family: 'Geist Mono', monospace;
-      font-size: 11px;
-      font-weight: 500;
-      letter-spacing: 0.06em;
-      color: var(--accent-muted);
-      margin-bottom: 8px;
-    }
-    .card-meta {
-      display: flex;
-      align-items: center;
-      gap: 12px;
-      font-family: 'Geist Mono', monospace;
-      font-size: 12px;
-      color: var(--text-dim);
-    }
-    .card-tag {
-      font-family: 'Geist Mono', monospace;
-      font-size: 10px;
-      font-weight: 500;
-      letter-spacing: 0.08em;
-      text-transform: uppercase;
-      color: var(--accent-muted);
-      background: rgba(200, 150, 12, 0.08);
-      padding: 3px 8px;
-      border-radius: 4px;
-      border: 1px solid rgba(200, 150, 12, 0.12);
-    }
-    .card-proved {
-      font-family: 'Geist Mono', monospace;
-      font-size: 10px;
-      font-weight: 600;
-      letter-spacing: 0.08em;
-      text-transform: uppercase;
-      color: #1a7a1a;
-      background: rgba(26, 122, 26, 0.1);
-      padding: 3px 10px;
-      border-radius: 4px;
-      border: 1px solid rgba(26, 122, 26, 0.2);
-      flex-shrink: 0;
-    }
-    /* ---- Featured card ---- */
-    .paper-card.featured {
-      grid-column: 1 / -1;
-      padding: 48px 56px;
-      min-height: 220px;
-      border-color: color-mix(in srgb, var(--accent) 25%, transparent);
-      background: radial-gradient(ellipse 70% 50% at 50% 0%, rgba(200, 150, 12, 0.04), transparent 60%), var(--bg-card);
-    }
-    .paper-card.featured .card-title {
-      font-size: 28px;
-      letter-spacing: -0.03em;
-    }
-    .paper-card.featured .card-author {
-      font-size: 12px;
-      margin-bottom: 12px;
-    }
-    .paper-card.featured .card-meta {
-      font-size: 13px;
-    }
-    .paper-card.featured:hover {
-      border-color: color-mix(in srgb, var(--accent) 45%, transparent);
-      background: radial-gradient(ellipse 70% 50% at 50% 0%, rgba(200, 150, 12, 0.07), transparent 60%), var(--bg-card-hover);
-    }
-    @media (max-width: 640px) {
-      .paper-card.featured {
-        padding: 28px 24px;
-        min-height: 160px;
-      }
-      .paper-card.featured .card-title {
-        font-size: 22px;
-      }
-    }
-    /* ---- Proof script in paper view ---- */
-    .proof-section {
-      margin-top: 32px;
-      padding-top: 24px;
-      border-top: 1px solid var(--border-card);
-    }
-    .proof-label {
-      font-family: 'Geist Mono', monospace;
-      font-size: 11px;
-      font-weight: 600;
-      letter-spacing: 0.1em;
-      text-transform: uppercase;
-      color: #1a7a1a;
-      margin-bottom: 14px;
-    }
-    .proof-script {
-      background: rgba(0, 0, 0, 0.4);
-      border: 1px solid var(--border-card);
-      border-radius: 8px;
-      padding: 20px 24px;
-      overflow-x: auto;
-      font-family: 'Geist Mono', monospace;
-      font-size: 12px;
-      line-height: 1.6;
-      color: var(--text-muted);
-      tab-size: 4;
-      max-height: 500px;
-      overflow-y: auto;
-    }
-    .proof-script::-webkit-scrollbar {
-      width: 6px;
-      height: 6px;
-    }
-    .proof-script::-webkit-scrollbar-track {
-      background: transparent;
-    }
-    .proof-script::-webkit-scrollbar-thumb {
-      background: var(--border-card);
-      border-radius: 3px;
-    }
-    .card-arrow {
-      position: absolute;
-      right: 24px;
-      top: 50%;
-      transform: translateY(-50%);
-      font-size: 20px;
-      color: var(--text-dim);
-      opacity: 0;
-      transition: all 0.3s ease;
-    }
-    .paper-card:hover .card-arrow {
-      opacity: 1;
-      color: var(--accent);
-    }
-    /* ---- Modal / Paper View ---- */
-    .paper-overlay {
-      position: fixed;
-      inset: 0;
-      z-index: 100;
-      background: rgba(0, 0, 0, 0.9);
-      backdrop-filter: blur(12px);
-      -webkit-backdrop-filter: blur(12px);
-      display: flex;
-      align-items: flex-start;
-      justify-content: center;
-      padding: 0;
-      overflow-y: auto;
-      animation: overlayIn 0.25s ease-out;
-    }
-    @keyframes overlayIn {
-      from { opacity: 0; }
-      to { opacity: 1; }
-    }
-    .paper-view {
-      background: #0d0d0d;
-      border: none;
-      border-radius: 0;
-      max-width: none;
-      width: 100%;
-      min-height: 100vh;
-      padding: 64px clamp(32px, 8vw, 120px) 80px;
-      position: relative;
-      animation: modalIn 0.35s ease-out;
-      box-shadow: none;
-    }
-    @keyframes modalIn {
-      from { opacity: 0; transform: translateY(24px) scale(0.97); }
-      to { opacity: 1; transform: translateY(0) scale(1); }
-    }
-    .paper-view-close {
-      position: fixed;
-      top: 24px;
-      right: 24px;
-      width: 40px;
-      height: 40px;
-      border-radius: 8px;
-      border: 1px solid var(--border-card);
-      background: rgba(10, 10, 10, 0.9);
-      backdrop-filter: blur(8px);
-      -webkit-backdrop-filter: blur(8px);
-      color: var(--text-muted);
-      cursor: pointer;
-      display: flex;
-      align-items: center;
-      justify-content: center;
-      font-size: 20px;
-      font-family: 'Geist Mono', monospace;
-      transition: all 0.2s ease;
-      z-index: 10;
-    }
-    .paper-view-close:hover {
-      border-color: rgba(200, 150, 12, 0.4);
-      color: var(--accent);
-      background: #111;
-    }
-    .paper-view-loading {
-      display: flex;
-      align-items: center;
-      justify-content: center;
-      min-height: 200px;
-      flex-direction: column;
-      gap: 16px;
-    }
-    .loading-spinner {
-      width: 32px;
-      height: 32px;
-      border: 2px solid var(--border-card);
-      border-top-color: var(--accent);
-      border-radius: 50%;
-      animation: spin 0.8s linear infinite;
-    }
-    @keyframes spin {
-      to { transform: rotate(360deg); }
-    }
-    /* ---- Paper Body (inside modal) ---- */
-    .paper-body {
-      color: var(--text-soft);
-      max-width: 900px;
-      margin: 0 auto;
-    }
-    .paper-body h1 {
-      font-family: 'Geist', sans-serif;
-      font-size: 2rem;
-      font-weight: 700;
-      letter-spacing: -0.03em;
-      line-height: 1.2;
-      color: var(--text);
-      margin-bottom: 8px;
-    }
-    .paper-body h2 {
-      font-family: 'Geist', sans-serif;
-      font-size: 1.3rem;
-      font-weight: 600;
-      letter-spacing: -0.02em;
-      color: var(--text-soft);
-      margin: 36px 0 12px;
-      padding-bottom: 8px;
-      border-bottom: 1px solid var(--border-card);
-    }
-    .paper-body h2:first-child {
-      margin-top: 0;
-    }
-    .paper-body h3 {
-      font-family: 'Geist', sans-serif;
-      font-size: 1.05rem;
-      font-weight: 600;
-      color: var(--text-muted);
-      margin: 22px 0 8px;
-    }
-    .paper-body p {
-      margin-bottom: 14px;
-      line-height: 1.75;
-    }
-    .paper-body ol,
-    .paper-body ul {
-      margin: 0 0 14px 24px;
-    }
-    .paper-body li {
-      margin-bottom: 6px;
-    }
-    .paper-body strong {
-      font-weight: 600;
-      color: var(--text);
-    }
-    .paper-body em {
-      font-style: italic;
-    }
-    .paper-body code {
-      font-family: 'Geist Mono', monospace;
-      font-size: 0.82em;
-      background: rgba(255, 255, 255, 0.05);
-      padding: 2px 7px;
-      border-radius: 4px;
-      border: 1px solid var(--border-card);
-      color: var(--accent);
-    }
-    .paper-body hr {
-      border: none;
-      border-top: 1px solid var(--border-card);
-      margin: 28px 0;
-    }
-    .paper-body blockquote {
-      border-left: 2px solid var(--accent-muted);
-      padding-left: 16px;
-      margin: 18px 0;
-      color: var(--text-muted);
-      font-style: italic;
-    }
-    /* ---- Credit section ---- */
-    .paper-credit {
-      margin-top: 40px;
-      padding-top: 20px;
-      border-top: 1px solid var(--border-card);
-      font-family: 'Geist Mono', monospace;
-      font-size: 13px;
-      color: var(--text-dim);
-      text-align: right;
-      letter-spacing: 0.04em;
-    }
-    .paper-credit .credit-label {
-      font-style: italic;
-    }
-    .paper-credit .credit-names {
-      font-weight: 500;
-      color: var(--text-muted);
-    }
-    /* ---- Papermaker callout ---- */
-    .papermaker-callout {
-      text-align: center;
-      padding: 48px 24px 32px;
-      max-width: 560px;
-      margin: 0 auto;
-    }
-    .papermaker-callout p {
-      font-family: 'Geist', sans-serif;
-      font-size: 14px;
-      font-weight: 350;
-      line-height: 1.6;
-      color: var(--text-dim);
-    }
-    .papermaker-callout .role-name {
-      font-family: 'Geist Mono', monospace;
-      font-weight: 500;
-      color: var(--accent-muted);
-    }
-    /* ---- Footer ---- */
-    .site-footer {
-      text-align: center;
-      padding: 0 0 64px;
-      font-family: 'Geist Mono', monospace;
-      font-size: 11px;
-      letter-spacing: 0.1em;
-      text-transform: uppercase;
-      color: var(--text-dim);
-      opacity: 0.5;
-    }
-    /* ---- Empty State ---- */
-    .empty-state {
-      text-align: center;
-      padding: 80px 24px;
-      grid-column: 1 / -1;
-    }
-    .empty-state p {
-      font-size: 16px;
-      color: var(--text-dim);
-      font-style: italic;
-    }
-    /* ---- Error State ---- */
-    .error-banner {
-      text-align: center;
-      padding: 60px 24px;
-      grid-column: 1 / -1;
-    }
-    .error-banner p {
-      font-family: 'Geist Mono', monospace;
-      font-size: 13px;
-      color: var(--text-dim);
-    }
-    /* ---- Responsive ---- */
-    @media (max-width: 640px) {
-      .container {
-        padding: 0 20px;
-      }
-      .site-header {
-        padding: 56px 0 36px;
-      }
-      .site-title {
-        font-size: clamp(36px, 10vw, 56px);
-      }
-      .paper-card {
-        padding: 22px 24px;
-        min-height: 140px;
-      }
-      .paper-view {
-        padding: 32px 20px;
-        min-height: auto;
-      }
-      .papers-grid {
-        gap: 12px;
-      }
-    }
-    @media print {
-      body {
-        background: white;
-        color: black;
-      }
-      body::before,
-      body::after {
-        display: none;
-      }
-      .paper-overlay {
-      position: static;
-      background: none;
-      backdrop-filter: none;
-    }
-    .paper-view {
-      background: white;
-      color: black;
-      box-shadow: none;
-      border: none;
-      min-height: auto;
-      padding: 24px 0;
-      position: static;
-      animation: none;
-    }
-      .paper-view-close {
-        display: none;
-      }
-    }
-  </style>
 </head>
 <body>
@@ -792,549 +40,6 @@
     </div>
   </footer>
-  <script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
-  <script type="text/markdown" id="paper-trueact-md"># TrueACT: A Different Kind of Neuron
-Transformers use the same MLP on every token. Every time. Same weights, same math, no memory of where it's been in the sequence, no sense of how confident it is. TrueACT chucks that. Replaces the MLP with a small recurrent block that loops. Reads the token, updates a hidden state, checks if it's confident enough, either answers or loops again. Keeps going til it hits 0.99 confidence or runs out of steps.
-It's not a standard neuron. It's not an attention head. It's a loop with a router that picks between four specialized operations depending on what the token needs.
-Half the parameters of the equivalent standard model. Same loss. 2.8x slower cause of the loop. You're trading flops for parameter efficiency.
-## the four experts
-Each TrueACT layer has four experts. The router picks which combination to use per token, not per layer. So different tokens in the same batch can fire different experts. Here's what they do:
-**Think Cell** — this is the actual recurrent part. Updates the latent state, which is basically working memory that persists across steps within the same layer. Think of it like scratch paper the model scribbles on while reasoning.
-**Standard** — plain linear pattern matching. Same job the normal MLP would do. Catches the easy stuff.
-**Fancy** — this is the weird one. Does math in log-space. For multiplication, `log(a*b) = log(a) + log(b)`. Addition is something a linear layer can already do. So instead of needing a pile of neurons to approximate a multiplication curve, one Fancy expert can do it cleanly. Log then add then exp. Multiplication, division, ratios, chained operations.
-**Memory Vault** — key → value associative lookup. A dedicated place to store facts instead of smearing them across all the weights. Retrieve, don't approximate.
-The router takes `[input, latent_state, step_count]`, sticks it through a linear layer plus softmax, and that's the expert selection. Standard and Fancy spend from an action budget. Once the budget is gone, the loop stops, unless it hits the 32-step cap.
-## why this works
-A standard transformer neuron is a linear approximator. For something like `a * b = c`, you'd need a big pile of neurons approximating a curve. It works eventually but it's wasteful. The weights end up encoding the same multiplication table across hundreds of parameters, and there's no clean way to just *do the math*.
-The Fancy expert sidesteps that. Goes to log-space, adds, comes back. One neuron doing what used to take a crowd.
-The Memory Vault is the same idea from the other direction. Instead of memorizing facts by storing them implicitly in weight matrices, just do a key-value lookup. Store it once, retrieve it when needed.
-The Think Cell ties it together. Gives the model a place to hold intermediate state while it loops through the experts. Without it, each token is a one-shot guess. With it, the model can go "hmm let me think about this" and take another step.
-So the model gets more mileage per parameter. The tradeoff is sequential compute. You can't parallelize a loop that depends on its own output. That's where the 2.8x slowdown comes from.
-## the numbers
-3-layer LLaMA-style comparison at d=384:
-| metric | Standard | TrueACT |
-|---|---|---|
-| loss | 0.0884 | 0.0880 |
-| params | 852,864 | 428,652 |
-| train speed | 1x | 2.8x slower |
-Same loss, roughly half the weights. The slowdown is real — the loop is sequential, can't be parallelized. But you're getting the same quality out of half the parameter budget.
-1-layer arithmetic reasoner: 12/12 on a fixed 12-expression benchmark. 91.6% on 500 random expressions. The misses are mostly multi-digit arithmetic — `42*88=3524` type stuff. Structure like parentheses, operator precedence, intermediate steps — those come out clean. The model actually writes out the worked steps: `((5*5)+(10*2))=(25+(10*2))=(25+20)=45|`
-## the router in action
-The routing stats tell you what the model's doing under the hood. Example from the inference CLI:
-```
-Prompt > ((5*5)+(10*2))=
-TrueACT : ((5*5)+(10*2))=(25+(10*2))=(25+20)=45|
-         [Think: 15% | Stand: 30% | Fancy: 45% | Vault: 10%]
-```
-For arithmetic, Fancy gets most of the budget. Makes sense — multiplication is the expensive operation and Fancy handles it in log-space. Standard catches the easy pattern matching (digits, parens, equals signs). Think Cell does the state tracking across steps. Memory Vault probably handles the number facts.
-The router isn't pre-programmed. It learns which expert to use for which kind of token during training. The routing stats are emergent.
-## how training works
-Data is an infinite stream of generated arithmetic — `+`, `-`, `*`, parentheses, multi-step chains. Format is `((5*5)+(10*2))=(25+(10*2))=(25+20)=45|`. The model sees a random position in the chain and has to predict the next character.
-Context window is 64 chars, one-hot encoded. The alphabet is 12 characters (digits, operators, parens, equals, pipe) so the input is 768-dimensional one-hots.
-Batch size 8192. AdamW, lr 5e-4, weight decay 0.01. 1-3 layer TrueACTStack, t_dim=256, max 32 ACT steps per layer.
-Training also runs a StandardStack (same structure, ordinary SiLU MLPs) side by side as the control group. Checkpoints save both every 500 steps.
-## the architecture, deeper
-**TrueACTLayer**: concat input x and latent h → xh. Router reads `[xh, step_frac]` → softmax over 4 experts. Compute expert outputs, gate by `router_prob * remaining_budget`, accumulate into the result, update h through the Think Cell. Repeat til budget hits zero or 32 steps.
-**TrueACTStack**: N of those layers with residual connections. Input projection to model dim at the bottom, output projection to vocab at the top.
-**StandardStack**: same structure but with normal SiLU MLPs instead of the TrueACT loop. The control group.
-The budget mechanism matters. Standard and Fancy both consume budget when used. Think Cell and Memory Vault, from how they're structured, seem to be state management rather than compute, so they don't appear to draw from the budget. The model can think (Think Cell) and retrieve (Memory Vault) freely. Only the expensive ops cost steps.
-## how it started
-This thing began as one log-space neuron trying to learn `x*y=z`. That's it. One neuron doing multiplication in log-space.
-41 notes later in MEMORY.md. Mode collapse. Gradient explosions. Dead architecture after dead architecture. Full rewrites. Things that almost worked before falling apart at higher dimensions.
-The 41 notes on what didn't work are arguably more valuable than what did. Every dead end, every fix, every "wait that shouldn't have helped" moment. Built for AI agents to read so they don't repeat the same mistakes.
-The four-expert router, the Think Cell, the budget gating, the step cap — none of that was in the original idea. Each piece got added because something broke without it.
-## looking forward
-The toy results are promising. Half the params, same loss. The next step is figuring out if this scales past small arithmetic models, and what the loop overhead looks like at bigger sizes. That's the open question.
----
-*the 41st attempt finally worked. go read MEMORY.md if you wanna avoid the first 40.*
-## tldr
-- Swap the transformer MLP for a recurrent block that loops til it's confident
-- Four experts: Think Cell (working memory), Standard (linear matching), Fancy (log-space math), Memory Vault (key→value lookup)
-- Router picks which experts fire per token based on input, latent state, and step count
-- Half the params (429k vs 853k), same loss (0.0880 vs 0.0884), 2.8x slower
-- 1-layer solves 12/12 on fixed benchmark, 91.6% on 500 random arithmetic expressions
-- Fancy expert does `log(a*b) = log(a)+log(b)` — one neuron doing what used to take a crowd
-- Started as one log-space neuron, 41 failed notes later it's a whole architecture
-- Tradeoff: sequential compute for parameter efficiency
----
-*the 41st attempt finally worked. go read MEMORY.md if you wanna avoid the first 40.*
-Credit goes to CompactAI
-</script>
-  <script type="text/markdown" id="paper-STM_paper-md"># STM and the Circle Thing
-STM stands for Subtractive Training Method. Dragonoid threw it out in #general and the chat's been daring each other to "do STM" on a checkpoint ever since. The idea's almost annoyingly simple once you hear it.
-Normally you train a model by shoving the whole dataset at it for an epoch, loss drops, do it again. STM says nah. After that first pass, look at which samples caused the biggest sudden drops in loss, and yank those out. Keep the rest. Train again. Repeat.
-The reasoning: a sample that produces a huge plummet in loss is the model going "oh yeah got that one." Any more gradient steps on that sample after that point is mostly memorization, the model's just carving that exact example deeper into its weights, not learning anything generalizable. Waste of compute. The samples where loss comes down slow and gradual, those are the ones where the model's actually figuring things out. That's generalization. So you let it keep chewing.
-End result: shrinking dataset every epoch. Easy stuff drops away, what's left is the stuff the model still has to actually think about.
-Somebody in the thread asked "so generalization happens with STM?" And yeah, that's the bet. Starve the model of samples it'd otherwise overfit to. Bias toward generalization.
-People noted this isn't quite hard-example mining or curriculum learning, even though it rhymes with em. Curriculum learning usually adds harder stuff over time. STM never adds. Only removes. Dataset shrinks monotonically. You start with everything and trim down to the residue. The "curriculum" isn't scheduled, it emerges from whatever the model still finds hard.
-Also, and this confuses people, the criterion isn't "loss is currently high on this sample." It's "loss DROPPED a lot on this sample." Derivative thing, not level thing. Sample can have low absolute loss and still be worth training on if its loss has been decreasing gradually. Sample with high loss but a sudden crash? Gone. Cut.
-## the other circle thing (Apollonian gasket, different idea, don't mix em up)
-Same channel's been throwing around a totally separate circles thing. Storing token embeddings as addresses inside an integer Apollonian gasket, keyed by prime tuples from a twin-prime engine. Token maps to a circle, twin primes become semantic neighbors "for free" cause the +2 gap is the local Apollonian step. Weight matrix becomes the curvature graph. Inference is bounded curvature recall, truncate by curvature, filter by primality, verify Descartes-consistency.
-This is NOT the same circle STM talks about. The gasket is embedding storage. Circling is a data-selection pass during training. In principle you could combine em, Mage mused "that might be where my gasket fits", but they're different proposals from different people solving different problems. If you hear "circle" without context, ask which one.
-## how you'd test this (well, how you WOULD have)
-The obvious experiment, back when it was theoretical:
-1. Pretrain a small model to some checkpoint
-2. Fork, one copy keeps normal training, other starts STM
-3. Compare validation loss after equal compute, not equal epochs (STM epochs get cheaper as the set shrinks)
-If STM wins it should generalize better and overfit less. If it loses, most likely the easy samples were carrying signal the model still needed and dropping em caused forgetting.
-Somebody also suggested doing STM on pretraining then normal finetuning on top. Makes sense, don't subtract during a phase where you're already dataset-starved.
-## update: proven
-The STM theory is now fully proven with the final script. Key finding: STM saves training compute and focuses the model on hard data. How much compute you save depends on the mastery floor setting, turns out that matters a lot.
-There was a bug in the original idea, the "Premature Ejection" problem. The naive rule was "drop if loss drops a lot" which sometimes tossed hard samples before the model had really gotten a grip, just cause their loss dipped that epoch. The fix turns the logic into a filter with a safety valve:
-- **Original Rule:** Drop if loss drops a lot.
-- **Modified Rule:** Drop if loss drops a lot AND the model has actually reached a mastery floor.
-How to read the results:
-- **Efficiency (3.52x):** By epoch 25, STM had processed 3.5 times fewer tokens than standard training. Each epoch gets cheaper as the dataset shrinks.
-- **Pool shift:** Started with 1,000 easy questions and 250 hard ones. By the end, only 6 easy questions remained while 94% of the hard questions survived in the pool. The model cleared easy addition in the first few minutes, then spent 25 epochs grinding hard math.
-- **Compute saved (71.6%):** This number came from an early run with mastery floor at 0.15, which turns out to be too aggressive. The dataset shrinks monotonically, each epoch costs less than the last. With a properly tuned floor the per-round savings are lower (1.8% for one round, 21.2% by three rounds), but the point holds: same or better learning, less work.
-The Circle Thing turns out to be a dynamic curriculum generator. The model effectively circles the easy samples and tosses them, ensuring that every gradient step in the later epochs is high-value, high-difficulty work.
-Final verdict: the theory works, it's efficient, and with the Mastery Floor addition it's stable enough for production use. The intuition that STM should focus on hard questions was exactly right. `stm_proof.py` is included, small model on simple addition, point is just to show the mechanism works.
-## update: recursive STM (Dragonoid's perfection round idea)
-So the single-pass STM worked. But Dragonoid pointed out something nobody'd thought of yet. Once the dataset's been fully STM'd, what if you just. Do it again. Reset the dataset, feed it back in, run STM on top of the already-STM'd checkpoint.
-CompactAI ran it. Two phases, same dataset, same model, just reset the pool between phases.
-Phase 1: model crawled to 45% accuracy, discarded most easy samples along the way. Standard STM behavior.
-Then they reset. Brought every sample back. Fed it to the same model that had just spent 25 epochs grinding exclusively on hard math.
-Phase 2 went different.
-First epoch: accuracy jumped from 45% to 68.5%. Instant. One epoch.
-Second epoch: the model looked at the dataset and went "nah I already know most of this" and yeeted 577 samples right there. Nearly half the pool. The Re-Circle event.
-By the end of Phase 2: 89.5% accuracy. Double what Phase 1 managed.
-So why'd this happen. Dragonoid called it the Recruitment effect.
-Phase 1 the model used easy data to build its foundations. Learned what numbers are, how addition works at a basic level. Then STM yanked the easy stuff and forced the model to grind on hard math for a long time. Those weights shifted. The model developed a more sophisticated grip on arithmetic cause it had to. That's the hard-data focus.
-Then Phase 2 hits. The easy data comes back. But the model's not the same model anymore. Its weights have been reshaped by all that hard-math grinding. So when it sees "3+7=10" again it's like oh. I understand this at a way deeper level now. Solves it with much higher precision. And because its understanding is better, the Re-Circle kicks in at a tighter threshold. Samples that were borderline in Phase 1 get confidently tossed in Phase 2 cause the model's mastery floor is higher now.
-Dragonoid's terminology for this: a cycle is one full STM pass, three phases (not gonna list em here, ask if you care). A round is N cycles. You keep doing cycles until the dataset is fully STM'd, that's one round. Then you reset and go again with a tighter mastery threshold.
-Each round distills the knowledge deeper into the weights. The model isn't just learning the data anymore, it's learning to learn the data better on the second pass.
-The practical takeaway: don't run STM once and call it done. Run it, reset, tighten the bar, run it again. Two rounds got 89.5% on a tiny GRU doing addition. More rounds on bigger models, who knows.
-Dragonoid said "try third round hehehehehe" so CompactAI did. Here's what happens when you push to three generations, measuring unseen hard accuracy at epoch 20 of each round:
-- Gen 1: 9.2%
-- Gen 2: 29.6%
-- Gen 3: 30.8%
-So 3.2x jump from Gen 1 to 2, then only 1.2x from 2 to 3. Diminishing returns, but not terrible. The model's still squeezing a little more out of the data on that third pass. It's not nothing.
-Compute savings tell a different story though. After fixing the mastery floor (the early 71.6% run had it cranked to 0.15, way too high), the actual numbers across rounds:
-- One round: 1.8% saved
-- Two rounds: 4.0% saved
-- Three rounds: 21.2% saved
-Savings compound. The first round barely shrinks anything, the model's being conservative about what it tosses. By round three the model's confident enough to yeet a fifth of the dataset. The floor's doing its job.
-The pattern's pretty clear at this point. First reset gives you a huge leap cause the model's weights have been fundamentally changed by the hard-data grinding. Second reset gives a smaller bump, the model's already close to its ceiling on this dataset.
-## tldr
-- STM: each epoch, chuck the samples whose loss dropped the most. Train on what's left. Repeat.
-- Circling: the eval pass that figures out what to chuck
-- Big sudden drops = model already memorized it, further training's wasteful. Slow drops = still learning, keep those
-- Proven single-pass: 3.52x efficiency. Compute savings depend on mastery floor (71.6% with floor at 0.15, lower with saner settings). Script included.
-- Recursive STM (Dragonoid): reset dataset after full STM, do it again. Two rounds hit 89.5%, three rounds pushed unseen hard accuracy 9.2% → 29.6% → 30.8%. Compute savings compound: 1.8% → 4.0% → 21.2%.
-- Apollonian gasket circle thing is unrelated. Ask before you mix em up.
----
-*notes from the #general chat history. got something wrong? yell at me.*
-Credit goes to Dragonoid (theory), CompactAI (proof) & the CompactAI crew
-</script>
-  <script type="text/markdown" id="paper-apollonian_gasket-md"># Token Embeddings Inside an Integer Apollonian Gasket
-OK so Mage keeps bringing this up in #general and honestly it's the kind of thing that sounds like math fanfiction until you sit there for twenty minutes and then you're like... wait. Maybe not completely insane.
-The pitch: instead of learning token embeddings as giant float matrices through SGD, put every token as an address inside an Apollonian gasket, you know, that fractal you get when you recursively pack circles into the gaps between other circles forever, and key each address to a pair of twin primes.
-I know. Stick with me.
-## the geometry, roughly
-Apollonian gasket comes out of Descartes' circle theorem. Four mutually tangent circles, their curvatures (1/radius) satisfy a quadratic relation. Start with three tangent circles, compute the two circles tangent to all three, pick one, recurse. You get this infinitely detailed fractal where every circle has an integer curvature. Provided you start with the right initial curvatures anyway.
-Circles in an integer gasket form a hierarchy. Big circles = small curvatures. Deeper = bigger curvatures. Circles tangent to each other are neighbors. And a circle's curvature encodes something about where it sits in the hierarchy, its position relative to everything else. That's the part that matters.
-## the twin prime thing (this is where it gets weird)
-Mage's actual suggestion: map each token to a circle via f: token → (p₁, p₂) where p₁ and p₂ are twin primes. Primes differing by exactly 2, like 17 and 19. The +2 gap between twin primes maps naturally onto the local step structure of the gasket. "Semantic neighbors" come for free from number theory instead of co-occurrence stats.
-So the weight matrix, the thing that normally stores all the relationships between tokens as learned floats, becomes the curvature graph. Instead of "the embedding for cat is related to the embedding for dog by some learned vector" you get "the curvature of cat's circle has a specific Descartes relation to the curvature of dog's circle."
-Inference becomes bounded curvature recall: search circles with curvature ≤ K, filter by primality (make sure you're on valid addresses), verify Descartes-consistency (make sure geometric relations actually hold). It's deterministic. You're traversing a fixed structure, not predicting an embedding.
-## wait, this might actually have something to it
-Few things from the back-and-forth in chat that make this better than your average 3am math thought:
-The structure is fixed. Normal embeddings are a black box, train em, they go somewhere, hope the geometry's nice. Gasket geometry is given. The question becomes whether it's useful, not whether you can learn a useful one. Different problem entirely.
-The twin prime mapping gives you a built-in closeness notion from number theory. Twin primes are conjectured infinite (unproven, sure, but more than enough for any practical vocab). And since they're spaced by exactly 2 you get a natural adjacency. Map semantically similar tokens to nearby primes and the geometry handles the rest.
-And it's absurdly compressible. You don't store a giant embedding matrix. You store the gasket generation algorithm and the vocab-to-prime-pairs mapping. 50k vocabulary? 50k integer pairs instead of 50k × 768 floats. Hundreds of megabytes down to kilobytes. Maybe tens of kilobytes. That's wild.
-## the problems that nobody's solved
-Plenty. The chat wasn't pretending otherwise.
-Biggest one: how do you assign tokens to primes? Need a mapping from tokens to (p₁, p₂) pairs where semantically similar tokens end up near each other in the gasket. Random assignment = geometry is meaningless noise. But if you have to learn the assignment... you just moved the learning problem somewhere else. Nothing's free.
-Also the gasket is 2D. Token embeddings are usually 768D or higher. You're cramming high-dimensional semantics into a flat fractal. There are ways to extend Descartes to higher dimensions (Soddy's theorem generalizes), but the nice integer properties of the Apollonian gasket are a 2D thing specifically.
-"Bounded curvature recall" sounds neat, what's it actually mean for language modeling though? You're doing geometric lookup. How does attention even work here? Does attention make sense when "embeddings" are fixed positions?
-And twin prime conjecture is unproven. If twin primes are somehow finite (almost certainly not but we can't prove it) you'd run out of addresses. Even with infinite twin primes, the density thins, mapping a million tokens might force you into enormous primes and correspondingly enormous curvatures.
-## building it (nobody has yet)
-Experiment would look like:
-Pick a small vocab, maybe 1000 tokens. Manually assign em to twin prime pairs, doesn't have to be perfect, just needs semantically related ones geometrically close. Build a tiny transformer where embedding lookup is replaced by gasket traversal. Model doesn't learn embeddings, it learns to navigate the structure. Train on something simple like next token prediction. See if it converges at all.
-Null hypothesis: it learns nothing useful cause the geometric prior is wrong. If it does learn? Interesting.
-Compare to identical model with learned embeddings. If the gasket version is even in the same ballpark, that's a big deal given the compression advantage.
-Mage suggested combining this with STM from the other paper, since you'd want to be picky about training examples when your embedding scheme is this constrained. But honestly combining two untested ideas sounds like a great way to not know which one's failing.
-## anyone done this before?
-I looked. Nope. No literature on Apollonian gaskets for token embeddings.
-Nearest neighbors: hyperbolic embeddings (Poincaré ball) for hierarchies, but those are continuous, learned, no integer or prime structure. Some niche fractal initialization papers. Some knowledge graph stuff in hyperbolic/spherical geometries. Nothing using gaskets, twin primes, or curvature graphs as the embedding mechanism. Genuinely new ground. For better or worse.
-## tldr
-- Tokens as circles in an integer Apollonian gasket, addressed by twin prime pairs
-- Semantic neighbors geometrically close cause of Descartes curvature relations
-- Weight matrix = curvature graph. Inference = curvature recall + primality filter + Descartes check
-- Kilobytes instead of hundreds of megabytes for the embedding table
-- Completely untested, might be mathematically impossible for high-dim semantics
-- If it worked even a little... worth writing home about
----
-Credit goes to Mage & CompactAI
-</script>
-  <script type="text/markdown" id="paper-overta_hypothesis-md"># The Overta Hypothesis: Knowledge-Free Foundation Models
-Amy's been working on something that sounds backwards: train a language model from scratch that knows nothing about the world. Not "knows less." Nothing. Model wouldn't know what a human is, wouldn't know Earth exists, wouldn't know what "Paris" or "coffee" or "gravity" means.
-What it would know: how to reason. How to hold a conversation. How to use tools to look stuff up. How to do in-context learning from whatever you feed it at inference time.
-OFM, Overta Foundation Model. She owns overta.org. Nothing on it yet, but the repo exists.
-The pitch is basically: cramming world knowledge into weights, which every LLM since GPT-2 does, wastes parameters. Instead, let the model spend its entire budget on reasoning. Facts live outside, in retrievable documents. The model's just the engine.
-## what does "no world knowledge" actually look like
-Weirder than you'd think. When Amy says the model wouldn't know what a human is, she's literal. Here's the kind of definition she'd provide externally at inference time, via search tool or context window:
-> "We are entities interacting with the physical world and able to do precise manipulation and also self-locomotion, through deforming and displacing our body."
-Model's never seen this before. Doesn't come pre-loaded. But it knows how to use a definition once provided. Knows how to reason from premises. If X is an entity that can manipulate objects and you ask about X's capabilities, it references the definition. That's the whole thing.
-Training data would be purely synthetic conversations. Characters have defined properties stated explicitly in context. Reasoning chains spelled out step by step. Facts always grounded in explicitly provided info. Model never falls back on "common sense", there isn't any.
-Amy wrote seed examples by hand. The flavor:
-```
-<message from="other">
-describe whats happening now
-</message>
-<message from="self">
-<think>
-They're asking me to describe "whats happening now". Right now I'm talking
-to them and outputting text. I don't know of anything else happening now.
-So I'll describe that to them.
-</think>
-I (some entity which processes text and outputs text) am conversing with
-you, what I assume to be an entity which also can process text and output
-text of your own. I'm outputting text to respond to your request.
-</message>
-```
-Notice it doesn't say "I am an AI assistant." Says "I (some entity which processes text and outputs text)." Describes itself from first principles using only what's observable. The whole philosophy is in that one response.
-## the arguments for it
-This crystallized across a bunch of conversations.
-Hallucination isn't a bug, it's the design. LLMs hallucinate because they're trained to memorize facts and spit em back probabilistically. Don't train em to memorize facts, they can't hallucinate facts. They can still fail to reason correctly, different failure mode entirely.
-Parameter efficiency. 350M model spending zero params on knowing the capital of France or the plot of Hamlet can spend all 350M on reasoning. Amy's guess: more intelligence per parameter than any model that mixes knowledge and reasoning.
-Grounding is explicit. When the model uses a fact you know exactly where it came from, context window, search tool, user provided it. No mystery about whether the model's "remembering" or "making it up."
-Also updatability. World knowledge changes, models don't. Model with no world knowledge has nothing to go stale. Just update the retrieval database. And alignment: model with no priors about true/false, good/bad, normal/weird only knows what you tell it in the moment. The user defines the ethical frame and factual premises, not the training data. Which obviously cuts both ways, Amy's aware.
-## and the problems
-Nobody's pulled it off yet. Amy's early experiments have been rough.
-Language itself encodes knowledge, probably the biggest problem. Can't learn English grammar without incidentally learning that doctors treat patients, fire is hot, people have names. The boundary between "reasoning" and "knowing" is way blurrier than the hypothesis assumes. Costikoooo flagged this in chat: model might pick up unwanted world knowledge just from how conversations are structured.
-Cold start is brutal. Amy's pipeline: manually write ~200 seed examples → finetune Qwen3 4B to generate more → use that for the full training dataset. Seed examples have to be near-perfect, everything cascades from em. Claude Opus "doesn't get it even close first try and needs a lot of iterating." And the AI models she's using to generate training data were themselves trained on world knowledge, so they sneak it in. Recursive contamination basically.
-Her earlier procedural generators, pure code, no LLMs, produced a model that could tool-call but was "the most brittle thing imaginable." Only worked with names in its predefined list. Generalization terrible.
-Then there's in-context bandwidth: if every fact has to live in the context window you need a killer retrieval system and model becomes only as good as its RAG. Dragonoid also asked about ethics, what even counts as knowledge? Dragonoid asked about ethics. Amy said ethics would be taught as abstract values. But are values knowledge? Where's the line? Gets tangled fast.
-## the procedural generation thing
-Amy's earlier approach, and the more interesting one honestly, ditched LLMs entirely. She built a system of TypeScript generators. Each function handles one aspect of conversation generation. One generator adds a user message like "My name is Joe," mutates conversation state so Joe's stored. Later generators reference that state. The system randomly chains generators together. All procedures, guaranteed consistent cause the state machine enforces it.
-Trained a 350M model from scratch on this procedural data. Qwen3 architecture, custom tokenizer, about an hour per training run on a 16GB GPU. Model learned to remember user names (generator state tracked em), search for info it didn't have (generators included search calls), stay in character as "an entity that processes text."
-But brittle. Only generalized to names in the generator's list. Math broken. Longer training mighta helped, only about an hour per run.
-Neat insight though: with this approach you add a new capability by writing a new generator and retraining. Timer? Timer generator. Spatial reasoning? Spatial generator. Capabilities are modular because the training data is modular.
-And it's philosophically cleaner than LLM-distillation. No world knowledge leaks in. But way more labor, you're writing a program that writes programs that simulate conversations, and you gotta anticipate every capability.
-## where this needs to go
-Viable experiment:
-1. Procedural generator covering ~20 distinct capabilities (name memory, search, clarification, multi-turn reasoning, simple math from provided formulas, etc.)
-2. Train small model (~100M) from scratch only on this data
-3. Test novel names, novel facts in context, novel combos of capabilities
-4. Compare to identical model on standard web text, does Overta model hallucinate less? Reason worse? Something in between?
-Amy's paused this to work on other stuff but the repo and approach are documented. Someone could pick it up.
-## tldr
-- Train a model that knows zero facts, only reasoning and conversation
-- All knowledge from context window or retrieval at inference time
-- Training data procedurally generated, not scraped
-- Claims: no hallucination (can't invent facts you never learned), better param efficiency, explicit grounding
-- Problems: language encodes knowledge, procedural generation is labor-intensive, model brittle so far
-- Amy poked at this with 350M models, promising, not there yet
----
-Credit goes to Amy & the CompactAI server
-</script>
-  <script type="text/markdown" id="paper-attention_experiment-md"># An Experiment With Attention
-wop posted this on HuggingFace. Started simple: can you swap full attention for something cheaper and still keep enough context to nail the next token? Answer turned out to be "nope, not yet", but how it got there is more interesting than the headline.
-The setup is clean. Reproducible. Refreshing, no massive training runs, no "promising directions" handwaving. Just a concrete benchmark, two architectures, numbers.
-## the actual question
-Context window isn't a flat sequence of tokens. It's usually: task instructions, style hints, formatting rules, actual content. These don't all matter at every token. Some rules are globally relevant but locally weak, the emoji rule in a system prompt, a formatting constraint. Model needs to carry them the whole way even when they don't matter at each individual prediction.
-wop's framing: can a compressed context state preserve those weak early rules as the sequence gets long? And how does that stack up against ordinary attention?
-## two models, head to head
-Attention: standard causal transformer attention. The baseline.
-Compressed: replaces token-to-token attention with a learned compressed memory state, a few implicit slots.
-Important: the compressed model doesn't classify tokens into categories. It reads tokens, updates a compact state, predicts from that state. Structure stays implicit. No hand-crafted "this is a rule, this is content" partitioning.
-Dataset is synthetic and hardcoded (keeps it clean). Each example has two early rules, one item, a long distractor-filled prefix, and a target that requires recovering the early rules. The benchmark specifically stresses rule retention over distance, not general prediction.
-Context lengths: 64, 256, 1028.
-## what happened
-Attention won. On quality. And on speed. Didn't expect the speed part.
-At ctx64: attention val_acc 0.938, rule_acc 0.906. Compressed: 0.699 and 0.492.
-At ctx256: attention 0.757 val_acc, 0.581 rule_acc. Compressed: 0.633 and 0.358.
-At ctx1028: attention 0.701 val_acc, 0.492 rule_acc. Compressed: 0.577 and 0.263.
-Speed was brutal. ctx1028: attention finished in about 10 seconds. Compressed took about 230. So it was less accurate AND 23x slower. Oof.
-## the lesson (not "compression bad")
-People might see those numbers and go "welp attention wins, pack it up." That'd miss the point.
-Naive compressed recurrent context doesn't automatically beat attention. Obvious now. Wasn't obvious going in, plenty of people assumed compressing context would help with long-range stuff because you're forcing a summary. Numbers say nope. Not unless you're real careful about how you compress.
-Preserving weak parallel instructions is harder than keeping a rolling summary. The compressed model didn't just lose accuracy, it specifically tanked rule_acc. That's the game. Can you keep early constraints alive? Answer: not with a dumb bottleneck.
-Full attention is still stupidly strong, even on a benchmark literally designed to break it. That's kind of wild. Task was built to punish attention and it still held up.
-Also worth noting: the compressed model updates memory step-by-step in sequence order, tons of serial work. Attention gets all those optimized parallel kernels. Some of the speed gap is architecture, some is just the real world having better kernels for one approach.
-## what this means
-wop doesn't over-claim. The motivating intuition, context isn't flat, some instructions matter globally but weakly, a good alternative to attention needs to preserve those signals, that's still solid. What broke is that the first compressed block was too simple. Formed a bottleneck. Just not a smart enough one.
-If you wanna replace or relax attention, the replacement probably needs: a smarter way to preserve weak long-range constraints (rule_acc tells you directly if you're failing), a more parallel implementation (230 seconds is useless), a more selective memory update (not every token matters for those early constraints), and a benchmark that measures rule retention directly, not hidden inside average loss.
-Next step: compress smarter, not harder.
-## bottom line
-The experiment made the question concrete. No more abstract talk about "efficient context." There's a clear picture now. Attention expensive but extremely effective. Compressed context alone isn't enough. Weak parallel instructions are a real stress test.
-If cheaper context mechanisms are gonna work, they need to preserve global obligations without collapsing into a vague summary. Numbers are out there. Reproduce em or beat em.
-wop used Kaggle notebooks for GPU, Codex from OpenAI for assistance, HuggingFace ZeroGPU for the thumbnail. Repo at github.com/koo1140/attention-experiment.
----
-Credit goes to wop (poe) & CompactAI
-</script>
-  <script type="text/markdown" id="paper-sparrow_fant-md"># Sparrow, FANT, and the Weird Stuff That Works
-Shane (Crownelius) posted this on HuggingFace after sitting on the story. It's part technical writeup, part personal history, part challenge. Here's a bunch of half-built things, some of them work way better than they should, come help figure out why.
-The headline: Sparrow, a 1M-parameter model, scores 95.6% on a 1,900-question math eval where Owl Alpha scores 61.4%. One million parameters versus a 70B-class model. Not a typo.
-## how Shane got here
-Unusual route into model building. Dropped a finetune that scored weirdly high on EQ, the empathy benchmark nobody goes after. Publishing outfit reached out, wanted full books generated through Claude, architected prompt-by-prompt. Gig ran over $25k in API credits. Shane designed every prompt. About 60% of that publisher's pipeline still uses his curation.
-That flipped him from using models to building em. Tiny ones specifically, small enough for a 12GB card, small enough that every architectural choice is exposed and wrong decisions are embarrassing.
-## FANT and stuff that shouldn't work
-FANT (github.com/Crownelius/fant3) is the experimentation framework. Three iterations, each a full rewrite, the kind you do when you care about findings, not code. Philosophy: at small scale you can try genuinely dumb ideas and actually see what they do. No hiding behind scale.
-The standout: SleepGate. A memory consolidation routine that fires every 100 training steps. Like half a screen of code, mostly comments. On FANT 2 at 5M parameters, 1,000-problem procedural math eval: +5.3 points. From one architectural decision. Same optimizer, same data, same schedule. Just a tiny consolidation pass wedged in.
-Why does that move the needle that much? Nobody's fully sure. But it does. That's the point of the framework, surface weird stuff so people can dig in.
-## SpinorApollonian Memory (this is the newest weird thing)
-This one needs setup. Shane found a paper by Jerzy Kocik on tangency spinors, classifying Apollonian disk packings using 2D Minkowski spinors. Descartes' circle theorem turns out to be the Minkowski quadratic form in signature (1,3). Actual math, not metaphor.
-Application: memory writes get split by chirality. Left-spinning packs go to one bucket, right-spinning packs to another. Geometric routing instead of threshold routing.
-The failure mode that'd been a problem for two months, packs starving each other into uselessness, just vanished. Same pattern at every scale tested, 5M through 742M. Hard to believe. But the ablation table doesn't lie.
-"Sounds like peanut butter on a hamburger, but the ablation table says it works.", Shane
-## Sparrow vs Owl Alpha
-Sparrow is separate from FANT. Different skeleton, different router, different everything. Small and surgical, built for math. Not general-purpose, not trying to be.
-Across 38 head-to-head evals (n=50 each, numeric scoring), Sparrow ties or beats Owl Alpha on 33 of 38. 87%.
-Five losses are on simple multiplication and division at digit counts where Owl's training data is dense. Fair wins, not the interesting ones.
-The interesting ones: Goldbach, Collatz, Fermat-little, complex modulus, dot products, distance, determinants. +100 percentage points in some cases. +98pp. +88pp.
-A 1M-parameter byte-level model beating a 70B-class frontier model by margins you'd normally assume are bugs. Shane checked twice, bothered three friends, re-ran at three temperatures. Same numbers. Makes you question how much of big-model performance is just brute force.
-Trick: Sparrow uses a calc-tag wrapper for arithmetic. It learns to call it correctly. 5-digit multiplication: Owl Alpha 12%, Gemma 3 27B 0%, Sparrow 100%. Model doesn't do math, learns to use the tool. That's the whole thing.
-## unsolved: FANT + Sparrow fusion
-Shane can't get Sparrow's approach to work with FANT's architecture. Something about how Sparrow handles symbolic state should connect to FANT's recursion stack, but the fusion's been elusive for weeks. Breadcrumbs are public, repo's open, issues are open, the challenge is sitting there.
-"Half-built spaceships are best shared."
-## tldr
-- FANT: experimentation framework for tiny models (5M–742M)
-- SleepGate: 100-step consolidation pass, +5.3 points. Nobody knows exactly why
-- SpinorApollonian Memory: geometric routing by chirality using Minkowski spinors, killed a starvation bug
-- Sparrow: 1M-param math model beats Owl Alpha (70B-class) on 87% of evals. Not a typo
-- Trick is tool use (calc-tag), not raw compute, Sparrow calls for help
-- FANT + Sparrow fusion unsolved, open challenge
----
-Credit goes to Crownelius (Shane) & CompactAI
-</script>
   <script>
     var PAPERS = [
   {
@@ -1515,13 +220,9 @@ Credit goes to Crownelius (Shane) & CompactAI
         var view = document.createElement('div');
         view.className = 'paper-view';
         view.setAttribute('role', 'document');
-        view.setAttribute('aria-busy', 'true');
         view.innerHTML =
           '<button class="paper-view-close" aria-label="Close paper viewer">&times;</button>' +
-          '<div class="paper-view-loading" role="status">' +
-            '<div class="loading-spinner" aria-hidden="true"></div>' +
-            '<span style="color:var(--text-dim);font-family:Geist Mono,monospace;font-size:13px">Loading paper...</span>' +
-          '</div>';
         overlay.appendChild(view);
         document.body.appendChild(overlay);
@@ -1529,16 +230,8 @@ Credit goes to Crownelius (Shane) & CompactAI
         var closeBtn = view.querySelector('.paper-view-close');
-        // Focus the close button
         closeBtn.focus();
-        // Focus trap: find all focusable elements in the overlay
-        function getFocusableEls() {
-          return overlay.querySelectorAll(
-            'button, [href], input, select, textarea, [tabindex]:not([tabindex="-1"])'
-          );
-        }
         var close = function() {
           overlay.style.opacity = '0';
           overlay.style.transition = 'opacity 0.2s ease';
@@ -1557,109 +250,12 @@ Credit goes to Crownelius (Shane) & CompactAI
         overlay.addEventListener('click', function(e) {
           if (e.target === overlay) close();
         });
-        // Escape key handler
         document.addEventListener('keydown', function handler(e) {
           if (e.key === 'Escape') {
             close();
             document.removeEventListener('keydown', handler);
           }
         });
-        // Focus trap: keep Tab within the overlay
-        overlay.addEventListener('keydown', function(e) {
-          if (e.key !== 'Tab') return;
-          var focusable = getFocusableEls();
-          if (focusable.length === 0) return;
-          var first = focusable[0];
-          var last = focusable[focusable.length - 1];
-          if (e.shiftKey) {
-            if (document.activeElement === first) {
-              e.preventDefault();
-              last.focus();
-            }
-          } else {
-            if (document.activeElement === last) {
-              e.preventDefault();
-              first.focus();
-            }
-          }
-        });
-        // Load from embedded markdown
-        var mdEl = document.getElementById('paper-' + paper.id);
-        if (mdEl && mdEl.textContent) {
-          var parsed = parsePaper(mdEl.textContent.trim());
-          renderView(view, parsed, previousFocus, close, paper);
-        } else {
-          view.querySelector('.paper-view-loading').innerHTML =
-            '<p role="alert" style="color:var(--text-dim);font-family:Geist Mono,monospace;font-size:13px">Paper content not found.</p>';
-          view.removeAttribute('aria-busy');
-        }
-      }
-      function parsePaper(md) {
-        var lines = md.split('\n');
-        var creditLine = '';
-        var contentLines = [];
-        var foundCredit = false;
-        for (var i = lines.length - 1; i >= 0; i--) {
-          var trimmed = lines[i].trim();
-          if (!foundCredit && trimmed.toLowerCase().indexOf('credit goes to') === 0) {
-            creditLine = trimmed;
-            foundCredit = true;
-          } else {
-            contentLines.unshift(lines[i]);
-          }
-        }
-        var contentMd = contentLines.join('\n').trim();
-        var html = marked.parse(contentMd);
-        return { html: html, credit: creditLine };
-      }
-      function renderView(view, parsed, previousFocus, closeFn, paper) {
-        var creditHtml = '';
-        if (parsed.credit) {
-          var creditText = parsed.credit;
-          var labelEnd = creditText.indexOf(':');
-          var labelPart = creditText.substring(0, labelEnd + 1);
-          var namesPart = creditText.substring(labelEnd + 1).trim();
-          creditHtml =
-            '<footer class="paper-credit" role="contentinfo">' +
-              '<span class="credit-label">' + escapeHtml(labelPart) + '</span> ' +
-              '<span class="credit-names">' + escapeHtml(namesPart) + '</span>' +
-            '</footer>';
-        }
-        var proofHtml = '';
-        if (paper && paper.proof_script_content) {
-          proofHtml =
-            '<section class="proof-section" aria-label="Proof script">' +
-              '<div class="proof-label">&#x2713; Replication Script</div>' +
-              '<pre class="proof-script"><code>' + escapeHtml(paper.proof_script_content) + '</code></pre>' +
-            '</section>';
-        }
-        view.innerHTML =
-          '<button class="paper-view-close" aria-label="Close paper viewer">&times;</button>' +
-          '<article class="paper-body" role="article" aria-label="Paper content">' + parsed.html + '</article>' +
-          proofHtml +
-          creditHtml;
-        view.removeAttribute('aria-busy');
-        // Re-bind close button using the shared close function
-        var closeBtn = view.querySelector('.paper-view-close');
-        closeBtn.addEventListener('click', closeFn);
-        // Refocus close button so focus trap works after re-render
-        closeBtn.focus();
-        announce('Paper loaded.');
       }
       function escapeHtml(str) {

   <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
   <link href="https://fonts.googleapis.com/css2?family=Geist:wght@300..700&family=Geist+Mono:wght@400;500&display=swap" rel="stylesheet">
+  <link rel="stylesheet" href="everywhere.css">
 </head>
 <body>
     </div>
   </footer>
   <script>
     var PAPERS = [
   {
         var view = document.createElement('div');
         view.className = 'paper-view';
         view.setAttribute('role', 'document');
         view.innerHTML =
           '<button class="paper-view-close" aria-label="Close paper viewer">&times;</button>' +
+          '<iframe class="paper-view-frame" title="' + escapeHtml(paper.title) + '" src="Papers/' + paper.id + '.html"></iframe>';
         overlay.appendChild(view);
         document.body.appendChild(overlay);
         var closeBtn = view.querySelector('.paper-view-close');
         closeBtn.focus();
         var close = function() {
           overlay.style.opacity = '0';
           overlay.style.transition = 'opacity 0.2s ease';
         overlay.addEventListener('click', function(e) {
           if (e.target === overlay) close();
         });
         document.addEventListener('keydown', function handler(e) {
           if (e.key === 'Escape') {
             close();
             document.removeEventListener('keydown', handler);
           }
         });
       }
       function escapeHtml(str) {