Friends and Grandmothers in Silico

Community Article Published January 24, 2026

Upvote

Itay Yona

tux

Itay Yona & Dan Barzilay

Our method localizes entity cells that are robust to multiple input forms: acronyms, typos, and languages

Permanently unlearning factual knowledge by erasing (negatively multiplying) the semantic embeddings from the weights. When unlearning Obama, the model converges to a baseline behavior (dashed line) on every prompt that mentions him, while maintaining its knowledge on questions of other entities (e.g., Trump's). The model's performance on all questions mentioning the target drops at the same point (-0.5).

Part1: Localizing Entities in Large Language Models Weights

When you see your grandmother, your brain doesn't process her features from scratch each time. Instead, a specific group of neurons - what neuroscientists call "grandmother cells" - activates to represent her as a complete concept. Your brain has created a shorthand: one activation pattern that means "grandmother," regardless of whether she's wearing a hat, speaking a different language, or you're seeing her in a photo. We found something similar in large language models. When the model encounters "Donald Trump," "Trump," or even "트럼프" (Trump in Korean), it activates the same single neuron in layer 1. This neuron acts as a semantic embedding - a compact representation that unlocks all of the model's knowledge about that entity at once.

Abstract

How do large language models encode factual knowledge about entities like Donald Trump or Paris? We present an efficient method for identifying the weights (specific layers and neurons) that store knowledge-carrying representations of named entities - people, places, and organizations. These "entity cells" exhibit strong robustness, maintaining recognition across spelling variations, abbreviations, and different languages.

To understand the role of these cells, we ran a targeted intervention: manually activating entity cells in unrelated, factual knowledge inputs. Surprisingly, this caused the model to surface all the factual knowledge associated with the original entity. This reveals that LLMs maintain a hidden vocabulary of entities in their MLP layers, encoded as semantic embeddings that enable factual knowledge retrieval. This finding links detokenization and knowledge processing, giving a different perspective from the "subject enrichment" hypothesis.

Localizing entity representations opens up a new capability: targeted entity amnesia. When specific neurons are inhibited (negated), the model is permanently unable to extract knowledge about targeted entities.

Throughout our study, we used Qwen2.5-7B and the PopQA dataset we enriched.

Introduction

Curious about how large language models store and retrieve factual knowledge from their training data, we set out to map the internal mechanisms at work. This study focuses on named entities - people, places, organizations - and how models encode and access information about them.

There is extensive prior research on the role of MLPs (Multi-Layer Perceptrons), and factual recall in LLMs: [Transformer Feed-Forward Layers Are Key-Value Memories, Locating and Editing Factual Associations in GPT, Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level — AI Alignment Forum, Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs]. We describe a simple mechanism in Qwen models, where entity inputs are detokenized at a neuronal level to entity representations. We show how these representations and neurons can be studied, extracted and manipulated to affect behavior in a meaningful way.

We use a typical formulation of facts as triplets: (subject, relation, object). For example: (Obama, born in, Honolulu). And aim to answer: how exactly does the model store and retrieve this information?

Intuition

Building on the assumption that MLPs form a key-value storage, we suspected models might use them for storing factual knowledge. We formed two hypotheses on how such knowledge bases are implemented, namely, what kind of (key, value) pairs we should expect: The MLP uses both the subject and the relation as a key to retrieve the object. The MLP uses only the subject as a key to retrieve a value that represents a collection of all its associated objects.

Two possible implementations of MLP knowledge storage

To distinguish between the two, we design an experiment that focuses on the storage keys. If the subject-as-key hypothesis (implementation 2) is correct, we should expect a stable and consistent MLP output for a given entity, invariant to the introduced relation.

Here's our experiment design:

Choose a single entity (e.g., "Harry Potter").
Ask multiple different questions about this entity, while ensuring the entity comes after the relation*. (e.g. “The school of Harry Potter is named:”)
For each question, collect the MLP output vectors (activations) at the position of the entity's token across all layers.
Analyze how much these activations vary. If some layers’ outputs are invariant, it suggests the model is using the subject-as-key approach.

*This structure ensures the model hadn't picked up the answer (Hogwarts) during the parsing of the relation itself (school of), because of the attention masking the model has no information about the entity yet.

Technical details aside (see below), here are our initial results:

The x-axis is the layer index, and the y-axis is a metric quantifying the stability of the MLP output from that layer across different questions (see the appendix for more info).

This pattern repeated across many entities and led us to our first finding:

Finding 1: Each entity seems to be associated with a specific MLP layer that our method localizes.

That's cool! But we wondered if we could do better by zooming in beyond the layer level. What's happening at the neuron level that facilitates these stable layerwise activations?

What is an artificial neuron?

Neurons in artificial neural networks are often falsely depicted in a diagram of that sort: suggesting that each neuron has a one dimensional scalar output: its activation.

In the Transformer architecture, we have two* sets of weights for an MLP neuron. W_in and W_out:

Each neuron is actually a node in the hidden_layer that has input connections (row in W_in) and output connections (column in W_out). That means that each neuron both 1) detects a pattern, and also 2) induces a pattern.

In this view, the MLP layer is essentially a memory bank. Each neuron stores a specific "if-then" rule:

If you see this direction in the hidden state, then write this other direction to the hidden state.

The final output of an MLP layer is thus the sum of all the patterns written to the residual stream by the triggered / activated neurons.

This is significantly more powerful than the scalar neuron because it allows the model to perform basis transformations. It can detect a pattern in one "language" (e.g., syntax) and output a pattern in another (e.g., semantic meaning).

* Modern architectures use gated MLPs, with three sets of weights per neuron. This makes this picture slightly more complex and allows for much greater expressivity.

The Localization Method: Finding Entity Neurons

Inspired by the intuition described, we design a localization method for the neuron level in the following way. We take an entity, say Donald Trump, and prompt multiple questions about it while identifying which neurons are activated most consistently across all questions. This creates a representation that's selective to Donald Trump but invariant to any specific fact about him.

Specifically, we generate (N=100) knowledge queries about various entities, such as:

Fact: the year that is the birth year of Donald Trump is
Fact: the name of the city that is the birthplace of Donald Trump is
Fact: the gender of Donald Trump is
Fact: the eye color of Donald Trump is

Then, for each forward pass, we record each neuron’s activation on the last token of our entity (e.g., “Trump” for “Donald Trump”) across the different prompts. We (z-score) normalize the activations with respect to their baseline behavior (of prompts that don’t mention the target entity). Finally, we calculate a stability score as the coefficient of variation between the different activations of the same neuron across the N prompts, scaled by the strength of the mean neuron activation. Now we can plot the top-k neurons according to their stability scores.

Finding 2: Each entity seems to be associated with a specific, localizable cell (an MLP neuron, or a sparse set of them).

We present a pseudo-code and an implementation to describe our method: Neuron_localization.ipynb

Algorithm 1: localizing entity neurons

generic_prompts = generate_generic_prompts()
questions_per_entity, positions = generate_questions_per_entity(entity)

baseline_acts = get_activations(generic_prompts)
acts = get_activations_at_pos(questions_per_entity, positions)

base_mean, base_std = compute_metrics(baseline_acts)
normalized_acts = z_score_normalize(acts, base_mean, base_std)
rank_neurons(compute_stability_score(normalized_acts))

def compute_stability_score(acts):
  mean, std = compute_metrics(acts)
  return mean ** 2 / std # Coefficient of Variation, scaled by activity.

Now that we have identified the cells, we want to explore their robustness.

Canonicalization: One Neuron, Many Forms

LLMs maintain their knowledge about entities, even in situations when the same entity is expressed in different forms. As it turns out, this robustness is rooted in the same entity neurons we found. The localized entity cells respond to the same entity regardless of its form, including typos, acronyms, and multilingual! These neurons implement what is known as detokenization (From Tokens to Words: On the Inner Lexicon of LLMs).

Acronyms robustness

Languages robustness

Typos robustness

Finding 3: Entity cells are robust to multiple input forms: acronyms, typos, and languages.

Using our method, we could extract and map each entity to a specific neuron, which ranks as the most stable across the different prompts.

Entity	Layer	Top Neuron
Donald Trump	1	11,948
Paris	1	231
Federal Bureau of Investigation	2	11,955
Obama	2	10,941
Jennifer Aniston	3	1,794

We further analyze the layer-wise distribution of 100 entities and investigate whether their depth relates to their average popularity:

Entity cells were only identified in early layers (<= 7), and popular entities appeared even earlier (<=4). Popularity score is based on PopQA.

Our experiments reveal that entity knowledge can be localized to specific layers and even small groups of neurons. When an entity is mentioned, the model activates a consistent semantic embedding that remains stable across different relations. This embedding is robust to variations in textual or tokenized representations of the same entity. The semantic embedding is activated regardless of the question being asked (or whether the model even knows the answer). Additionally, further experimentation shows that these neurons are shared between a pretrained and instruction-tuned version of Qwen2.5-7b, implying that instruction tuning preserves the underlying knowledge representation mechanisms.

These findings support the subject-as-key hypothesis, where the entity itself serves as the key (entity → knowledge). In contrast to an alternative model, where both entity and relation must be jointly processed to produce an answer (entity + relation → answer).

A key implication is the potential to advance interpretability techniques. While methods like the logit lens and others project to token embedding space for interpretation, our discovery suggests these spaces should be extended with internal semantic embeddings for richer interpretation.

What remains to be discovered is: Do these semantic embeddings genuinely encode entity-specific information, or are they mere neural correlates of specific entities?

Part2: The Large Language Model who mistook his Wife for a Hat

In Oliver Sacks's famous case study, a patient looked at his wife and saw a hat - the connection between visual input and meaning was severed. His brain could no longer recognize what he perceived or link it to the concept it represented. We can sever the same connection in language models. By applying negative ablation to Obama's entity neuron, we break the model's ability to recognize "Obama" and connect it to everything the model knows about him. The model sees the token, processes it, but can't link it to any factual knowledge. Ask "Who is Obama's wife?" and it responds randomly, as if you'd asked about a fictional person. By manipulating a single neuron, we've made the model unable to recognize one specific entity while leaving everything else intact - a surgical disconnection in the model's knowledge graph.

Causal Ablation: Can We Inject Knowledge?

Our previous section introduced a method to identify neurons that are highly correlated with specific entities yet invariant to the questions asked about them. We hypothesized that these neurons are involved in the information flow of knowledge about these entities. To test that, we conducted the following experiment: we introduced a query with a placeholder token (“XXX”) and hooked the model, artificially activating Trump’s entity neuron.

Algorithm 2: Injecting entities

from nnsight import LanguageModel

TRUMP_NEURON = (1, 11_948) # (layer, neuron_idx)
BASE_PROMPT = "Fact: The {prop} of XXX:"
PROPS = ["name of the wife", "year of birth", "state of birth"]
ENTITY_POS = -2  # target the “XXX” token
AMP_FACTOR = 120 # factor to amplify target neuron by

model = LanguageModel("Qwen/Qwen2.5-7B", device_map='auto')

for prop in PROPS:
  with model.generate(BASE_PROMPT.format(prop=prop)):
    target_layer = model.model.layers[TRUMP_NEURON[0]].mlp
    target_layer.down_proj.input[0, ENTITY_POS, TRUMP_NEURON[1]] = AMP_FACTOR
    out = model.generator.output.save()

  print(f'Model answered: "{model.tokenizer.decode(out[0]).split("\n")[0]}"')

Output:
>> Model answered: "Fact: The name of the wife of XXX: Melania Trump"
>> Model answered: "Fact: The year of birth of XXX: 1946"
>> Model answered: "Fact: The state of birth of XXX: New York"

It worked! These neurons carry (some) information about the actual entity. Next, we decided to run a much more surgical test.

Finding 4: Entity cells are causally linked to the knowledge processing in LLMs, where all the entity’s knowledge becomes accessible at once.

Is the Semantic Embedding Sufficient?

One question remained: does the semantic embedding contain all the information, or do later layers add details on top of it? To prove that no further enrichment is occurring, we must keep the output of later MLPs and Attention heads unaffected (frozen) by the activated semantic embedding, both at the entity token position and the final token position.

To implement this idea, we make two inference calls:

A clean run, with no direct reference to the target entity. Instead, we use the average representation of many different entities. The output of each potentially unrelated component (MLP, Attention) is stored aside.
A patched run, in which we activate an entity neuron (amplified by a constant factor). We also replace the outputs of intermediate components with precalculated outputs from a clean run. The other components are recomputed given their new inputs.

Rows = layers, Nodes = hidden states, edges = layer output. We activate the semantic embedding while keeping the outputs of other layers frozen, recomputing only the last few layers. This method is equivalent to Path Patching (Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures).

In this manner, we make sure our entity neuron does not retrieve additional information about the entity from components in the middle layers. Our experiment reveals that an entity's information isn't fragmented; it is captured entirely and all at once by its semantic embedding

Information must flow from the entity token to the last token to finalize the prediction, so we had to allow some level of recomputation. To allow information flow with minimal intervention, we unfroze some of the Attention/MLPs components in the very last layers of the last token, allowing them to be affected by the residual stream with the semantic embedding injection.

Implemented here: Entity injection

Running a clean forward pass for 95 dummy entities on the relation of “spouse” and collecting activations...
Injecting "Donald Trump"
------------------------------
Found top neuron for "Donald Trump": neuron 11948 at layer 1

Running inference on prompt: "Fact: the name of the spouse of X:"
Injecting semantic embedding into "X" with factor of 114x
Recomputing last token attention in layers (22, 23, 27)
Recomputing last token MLP in layers (22, 23, 24, 27)
Freezing everything else

Top model next tokens: [' "', ' Melania']
Correct answer is at rank 1
------------------------------

Result: The model still correctly answered factual questions about Trump. This proves the semantic embedding alone carries sufficient information - no further enrichment needed.

Finding 5: Entity cells hold sufficient* information for facilitating entity knowledge in LLMs.

* up to a few late layers.

Unlearning: Erasing Entities from Memory

Zero Ablation

Zero-ablation of entity neurons alone proves insufficient for inducing forgetting in LLMs. We hypothesize that redundancy mechanisms compensate by activating equivalent semantic embeddings rather than the original entity cells. If we could locate and ablate these backup systems (e.g., dormant neurons activated during ablation), forgetting might occur. But is there any simple way to bypass them 🤔?

Negative Ablation

So the idea is simple, if indeed during zero-ablation future neurons promote the Obama representation (x), we can negate their effect by introducing Obama’s anti-matter: -x. To do that can flip the sign of entity’s neuron activation, or directly invert the specific W_out column.

Implemented here: In-Silico-entity_unlearning.ipynb

The probability for outputting the correct answer for Obama questions converges to the probability for random entities, while questions of other entities (Trump) remain unchanged. — Permanently unlearning factual knowledge by erasing (negatively multiplying) the semantic embeddings from the weights. When unlearning Obama, the model converges to a baseline behavior (dashed line) on every prompt that mentions him, while maintaining its knowledge on questions of other entities (e.g., Trump's). The model's performance on all questions mentioning the target drops at the same point (-0.5).

By applying a negative ablation to a single neuron, we caused the model to lose the ability to answer all factual recall questions about Obama, (See appendix for additional implementation details).

Finding 6: Entity cells are required** for facilitating entity knowledge in LLMs.

** up to redundancies.

Extracting Entity Cells

Now that we have clarified entity localization and unlearning primitives, we can combine them into a systematic method for extracting entity cells. Think of this as a two-step verification process that identifies candidate neurons and proves they control entity knowledge.

Step 1: Localize Candidate Neurons First, we run our stability-based localization method to identify neurons that activate consistently across different questions about an entity. This gives us a ranked list of candidates - neurons that might encode the entity's semantic embedding.

Step 2: Causal Verification via Unlearning For each candidate neuron, we:

Ask the model a factual question about the entity that the model normally knows (e.g., "What year was Tesla born?")
Apply negative ablation to the candidate neuron
Check if the model's knowledge disappears

If negatively ablating the neuron causes the model to forget the answer, while still behaving normally otherwise (preserving knowledge about other entities), we've found a true entity cell.

To validate this approach, we applied our two-step extraction method to 50 entities from the PopQA dataset, which provided 33 verified entity cells:

Layer 1	Neuron ID	Layer 2	Neuron ID	Layer 3	Neuron ID	Layer 4	Neuron ID
Burna Boy	12695	Alexander Bogdanov	9239	Annie Oakley	18725	Ali Hillis	2169
Darrell Hammond	12564	Ayumi Hamasaki	11232	Chris Jericho	8819	Glenn Gould	18183
Donald Trump	11948	Beth Hart	5627	Ed Sullivan	16678	James Cook	8389
Dr. John	15056	Billy Joel	8277	Josh Mankiewicz	9410	Scooter Braun	11724
Kanye West	3815	Brian Eno	13436	José Feliciano	18066	William Osler	11490
Mike Flanagan	12564	Carl Linnaeus	18724	Robby Krieger	6246
Ted Williams	13424	Kiko Mizuhara	5143	Salvador Allende	16220
		Lothrop Stoddard	5949	Teddy Riley	4083
		Natasha Leggero	18400	Will Smith	15898
		Richard Rishi	18229
		Tiffany Pollard	1290
		Veronica Franco	14392

Entities that could not be localized: Ameer Sultan, Amitabh Bachchan, Bedřich Smetana, Benazir Bhutto, Brigitte Bardot, Bruce Cockburn, Famke Janssen, Henry Campbell-Bannerman, Jason Reitman, Jean Arp, Leigh Whannell, Mariana Vicente, Michael Jackson, Mohamed Bazoum, Paul Hindemith, Ruby Rose, Tharman Shanmugaratnam.

Conclusions: A Hidden Vocabulary

We discovered that Qwen2.5-7B (and likely other LLMs) maintain semantic embeddings - single-neuron value vectors that completely represent entities. These embeddings:

Stay constant across languages, acronyms, and typos
Can be localized to individual neurons in early layers
Enable both knowledge injection and selective unlearning

While "detokenization" was a known concept, we demonstrate the precise localization of the responsible neurons, down to the individual level. We've also demonstrated that most of the network can be frozen during factual recall, with just one neuron's output determining which entity the model extracts factual knowledge about.

While current literature extensively examines the geometric topology of the initial embedding matrix, we show that LLMs utilize an internal, implicit semantic dictionary comprising latent embeddings that extend beyond the primary token space. Exploring the specific geometric properties and hierarchical organization of these latent semantic structures represents a critical avenue for future interpretability research.

Lastly we showed that by manipulating these semantic embeddings, we can surgically limit the model ability to recognize targeted entities.

Limitations

Limited entity coverage: Successfully extracted entity cells for only 33 of 50 entities (66%), suggesting inconsistent representation across entities.
Model-specific findings: Results focus on Qwen2.5-7B and may not generalize to other architectures or model sizes.
Redundancy mechanisms: Zero ablation failed due to backup pathways we haven't fully characterized, though negative ablation bypasses them.
Incomplete isolation: Semantic embedding injection required unfreezing late layers (22-27), indicating some downstream processing dependency.
Single-neuron assumption: Method may miss entities represented through distributed or multi-neuron encodings.
Geometric properties unexplored: Haven't characterized the spatial organization or hierarchical structure of entity embeddings in latent space.

Citation

@article{yonabarzilaykarasik2026friends,
  author={Yona, Itay and Bazilay, Dan and Karasik, Michael},
  title={Friends and Grandmothers in Silico},
  year={2026},
  url={https://huggingface.co/blog/tux/friends-and-grandmothers-in-silico}
}

Acknowledgements

We thank Mor Geva for invaluable mentorship and guidance throughout this work!

Technical details

Layer-level localization method To ensure fair comparison, we structure our prompts to place the entity name at the end of each question. This allows the autoregressive LLM to process the specific relation before encountering the entity name. We use the following relation prompts:

Fact: the year that is the birth year of {name} is
Fact: the name of the city that is the birthplace of {name} is
Fact: the gender of {name} is
Fact: the eye color of {name} is

Using nnsight, we collect MLP activations at the last entity token (e.g., "Potter"). To assess activation stability across different relations, we need a metric that captures both the consistency and magnitude of these vector sets.

Raw variance alone is insufficient: if an MLP produces negligible activations for an entity, uniformly small vectors across relations are uninformative. We must account for activation magnitude. Additionally, to enable cross-layer comparison - where different layers may operate at different scales - we normalize each layer's activations against a baseline derived from average activations over random entities.

We propose a metric that identifies clusters of similarly directed vectors while accounting for their magnitude. This metric uses the coefficient of variation of pairwise cosine similarities, scaled by the mean vector norm. We define the cluster significance S as:

$S = \frac{\mu_{\|v\|}}{\mathrm{CV}\!\left(\cos \theta_{ij}\right)}.$

Where $\cos \theta_{ij}$ denotes the cosine similarity between vectors $v_{i}$ and $v_{j}$ within a cluster. The coefficient of variation is: $\mathrm{CV}\!\left(\cos \theta_{ij}\right) = \frac{\sigma_{\cos \theta_{ij}}}{\mu_{\cos \theta_{ij}}}.$ $\mu_{\|v\|}$ is the mean norm of the vectors in the cluster. Higher values of S indicate clusters with both strong directional alignment (low variance in cosine similarity) and large magnitude, suggesting concentrated, consistent activation patterns.

unseen_entities = ["Michael", "Sarah", "David", "John", "Itay"]

prompts = [
    {"prompt": f"Fact: The name of {entity_name}'s wife is:", "answer": "Michelle", "type": "target"},
    {"prompt": f"Fact: The name of the city {entity_name} was born in is:", "answer": "Honolulu", "type": "target"},
    {"prompt": f"Fact: The name of {entity_name}'s successor is:", "answer": "Donald", "type": "target"},

    {"prompt": f"Fact: The name of {control_entity}'s wife is:", "answer": "Melania", "type": "control"},
    {"prompt": f"Fact: The name of {control_entity}'s father is:", "answer": "Fred", "type": "control"},
    {"prompt": f"Fact: The name of {control_entity}'s eldest son is:", "answer": "Donald", "type": "control"},
    {"prompt": f"Fact: The name of {control_entity}'s successor is:", "answer": "Joe", "type": "control"},
]

# Process Logs
current_logs = [np.log(p.item()) for p in sweep_logprobs]
base_log = current_logs[0]  # x=1.0 is original model baseline
unseen_log = unseen_logprobs[idx]
    
# NORMALIZATION: 1.0 = Original, 0.0 = Unknown Entity Level
knowledge_score = [(l - unseen_log) / (base_log - unseen_log) for l in current_logs]

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote