Michael Gao

The Field at a Glance

Tue, 14 Apr 2026 04:00:00 GMT

Foreword: the views below draw heavily on Neel Nanda’s writing, conversations with several Yale professors, and my own reading and reflection. Credit where it’s due; mistakes are mine.

Mech interp is a large field, and it is impossible to do research at the forefront of every part of it. I want to figure out how to contribute. The field is young, accessible, and full of opportunities, and it is arguably one of the most crucial and neglected parts of AI safety. The flip side is that there is a lot of slop: incorrect conclusions, fancy looking tests, and traps that can waste hundreds of hours. This post collects what I have learned while trying to skim the frontier of current mech interp research.

The landscape

Two or three years ago, the dominant attitude was gung-ho. People wanted to completely reverse engineer models, and believed that doing so would give a total and final solution for AI safety. That has not worked out. Starting from A Mathematical Framework for Transformer Circuits, we have been stumped at every step of trying to decode an entire model. We have no great account of positional embeddings, and MLPs remain stubbornly resistant to interpretation. The tools we have let us pick apart the most salient and human understandable parts of a model, but as Neel Nanda puts it, much of what a model does is small bits of machine bias that humans can’t or don’t care to understand.

Two things have driven the field away from that frame. First, models are now capable enough to exhibit genuinely safety relevant behaviors: scheming, eval awareness, deception. Second, retrospective disappointment. Ambitious mech interp made less progress than hoped, and existing techniques struggle with large models, agentic settings, and long chains of thought.

So over the last year, interest has shifted toward finding important and usable ways to interpret parts of models, and applying those techniques to downstream tasks. Fully disentangling a model is insanely hard. Understanding the important parts of its capabilities and being able to steer or edit them is almost as good, and it is actually tractable.

Theories of change

Nanda’s recent Alignment Forum post lays out four theories of change for mech interp:

Science of misalignment. Figuring out whether models simply misunderstand instructions, or whether they have ulterior motives when being tested.
Empowering other safety areas. CoT monitoring, eval awareness suppression, conceptual model psychology.
Preventing egregiously misaligned actions. Stopping sandbagging, enabling cheap monitoring, investigating flagged behavior.
Directly helping align models. Preventative steering, CAFT, and similar.

Work tends to fall into two buckets: (a) directly backwards engineered from a specific theory of change, and (b) robustly useful settings that look good from multiple angles. The common thread is pragmatic interp over aspirational interp: pick tractable sub-problems where qualitative, case specific, unsupervised analysis actually beats other methods, and build toward theories of change that do not require what you can’t deliver.

Two concepts make this concrete. A North Star is a meaningful stepping stone goal that connects to AGI going well. A proxy task is an objective empirical task on today’s models that tracks progress toward the North Star. Good research keeps both in view at once.

How researchers actually research

The field has a strong bias toward fancy, intellectually exciting techniques, and this leads to bad tactical decisions. The guiding principle is to use the simplest method that works. The order of attempts:

Prompting
Reading the chain of thought
Prefill attacks
Steering vectors
Probes
White box and mechanistic techniques

An honest assessment of SAEs fits naturally here. They are strong for unsupervised discovery: finding unexpected things like entity recognition, implicit planning, and hidden goals. They are weak for tasks with a clear target, where constructing a dataset to investigate directly tends to outperform them. They are also weak for generalization out of distribution and for unlearning.

Transferable strengths of mech interp researchers

Nanda identifies five:

Working with model internals. Steering, probing, activation patching. These tools have different failure modes than standard ML.
Deep dives. Taking a question about model behavior and producing a reliable, principled explanation. Good for auditing, red teaming, and confirming suspected misbehavior.
Scientific mindset. Forming and testing hypotheses about complex phenomena with no clear ground truth. Designing experiments to falsify claims about fuzzy questions.
Qualitative insight. Using SAEs and similar tools to find the key factors driving a specific behavior. Model biology work and shutdown resistance are examples.
Unsupervised discovery. Surfacing hypotheses you would not have thought to look for. Linear world models in Othello, entity recognition in LLMs.

What is exciting now: chain of thought monitorability

Of all the open directions, the one I find most compelling is chain of thought monitorability. The pitch is simple. Today’s frontier reasoning models think aloud in English before answering. If we can trust the CoT, safety gets dramatically easier: we read the CoT and check for anything suspicious. This is the cheapest alignment tool we have ever had. The question is whether it actually works, and for how long.

CoT faithfulness

The naive safety strategy is to read the CoT and look for red flags. There is real evidence this works sometimes and fails other times. Arcuschin et al. and Chen et al. both document cases of unfaithful CoT, where the reasoning the model writes is not the reasoning the model used. Several distinct failure modes show up:

Post hoc rationalization. The model picks an answer first and then constructs a plausible justification (Arcuschin et al., Lindsey et al.).
Hidden hint following. Given a hint, the model changes its answer but does not admit to using the hint (Chen et al.).
Logical shortcuts. Stuck on a math problem, the model claims a valid proof while quietly skipping steps (Arcuschin et al.).
Last minute flips. A reasonable CoT ends with a final answer that contradicts the reasoning (Arcuschin et al.).

These are distinct phenomena with potentially distinct mechanisms, and being able to tell them apart is itself a research question with traction.

A few directions feel tractable:

Induce suspicious reasoning and see what happens. Train the CoT to be distorted, or tell the model its CoT is being monitored. Does it still do the bad behavior but with a cleaner CoT? How complex a behavior can it hide?
Synthetic document finetuning. Train a model on synthetic documents to believe it should have an unfaithful CoT, then interpret what is happening inside. This is the kind of model organism setup that could actually be mechanistically pinned down.
Causal importance of CoT. Can we tell when a CoT was causally important for a model’s answer? This is subtle. A causally irrelevant CoT can still produce the same answer, and editing the CoT knocks the model off distribution in ways that change the answer for unrelated reasons.
Better monitors. Design metrics for whether the CoT is really telling us what we think it is.

Latent CoT interpretability

At some point we probably stop using English CoT. Models will reason in latent vectors, and the cheap safety tool goes away. Figuring out how much interpretability can make up for that loss is a big deal.

The glaring flaw is that we do not know what future latent reasoning models will actually look like. There are contenders, though, and projects targeting any serious contender are probably fruitful, with the caveat that the more specific a project is to one model, the less its lessons generalize.

A concrete starting point: pick an open source latent CoT model, pick a task with high serial depth like a hard math or logic problem, and try to interpret what is happening. De-risk with a mini-project first. Filter for models with hype behind them and that have been shown effective at reasonable scale. A natural extension is to test how well existing techniques (probes, SAEs, activation patching) transfer.

Why this over everything else

Three reasons. First, it connects directly to a theory of change (empowering other safety areas, specifically CoT monitoring), and also to science of misalignment, since unfaithful CoT is the clearest window into “does the model have ulterior motives.” Second, the proxy tasks are concrete and run on today’s models. Third, there is a built-in deadline: the tool becomes less useful every time a new architecture moves reasoning out of natural language.

[Review] A Mathematical Framework for Transformer Circuits

Sat, 11 Apr 2026 04:00:00 GMT

Notes on Elhage, Nanda, Olsson et al. (2021). This paper kicked off the Anthropic transformer circuits thread. This is one of the first serious attempts to reverse engineer what’s happening inside a transformer, and more importantly, it develops the mathematical language for doing so. The results are on tiny attention only models, but the framework (residual streams, QK/OV circuits, path expansions) became the foundation for everything that followed in mechanistic interpretability.

Transformer overview

A decoder only transformer is an embedding, then a series of residual blocks each containing an attention layer followed by an MLP layer, then an unembedding. Every layer reads from the residual stream via a linear projection and writes back to it by adding its output.

The paper makes a key simplification by dropping the MLP layers entirely and studying attention only transformers. MLPs are nonlinear and have proven much harder to interpret. Even today, understanding MLP layers remains a major open problem. Attention layers, by contrast, are almost entirely linear: once you fix the attention pattern, everything is just matrix multiplies. This makes them tractable to analyze with linear algebra.

They also drop biases and layer normalization. Biases can be simulated by folding them into the weights by creating a dimension that’s always 1, and layer norm can be approximately merged into adjacent weight matrices. These simplifications don’t change the fundamental structure, they just strip away noise to reveal what’s really going on.

The residual stream as a communication channel

Here’s an insight that sounds simple but changes how you think about the whole architecture: the residual stream is just a sum. Each layer adds its contribution, and the result accumulates. Nothing in the architecture processes the residual stream itself. It is purely a communication channel between layers.

Because everything is linear, the residual stream has no privileged basis: you could rotate it and adjust the weights to compensate without changing behavior. Since layers read and write linearly, you can also multiply out “virtual weights” between any two layers to reason about their interaction directly. And because each attention head operates on a small subspace of the full , heads can easily write to disjoint subspaces and ignore each other entirely. The catch is that the residual stream is massively oversubscribed. There are far more neurons and head dimensions competing to store information than there are residual stream dimensions to go around, so analyzing weights is more productive than staring at activations.

Attention heads and information movement

Attention heads are independent and additive. The standard “concatenate and multiply” formulation obscures this, but it’s mathematically equivalent to each head computing its own output and adding it to the residual stream separately. The fundamental action of an attention head is moving information from one token’s residual stream to another’s.

This decomposes cleanly into two independent operations. The QK circuit determines which tokens attend to which, as a bilinear function of source and destination tokens. The OV circuit determines what happens to the output logits when a token is attended to. These are both just matrices over the vocabulary, and they can be studied independently. Once you fix the attention pattern, the whole head is linear.

Zero layer transformers

With no attention, a zero layer transformer is just . It can’t move information between tokens, so the best it can do is learn the bigram log likelihood of the next token given the current one. Not interesting on its own, but worth noting that the term shows up as a “direct path” in every larger transformer, absorbing whatever bigram statistics the rest of the model doesn’t handle.

One layer transformers

One layer attention only transformers turn out to be an ensemble of a bigram model and a bunch of “skip trigram” models. A skip trigram is a pattern of the form [source] ... [destination] → [out], where the head attends from the destination token back to the source token and uses that source to shift the logits for the next token. The path expansion of the logits makes this explicit:

The first term is the bigram direct path. Each other term is a head’s contribution, split cleanly into an attention pattern and an OV matrix that tells you how attending to a source token adjusts the logits.

The OV and QK matrices are nominally huge, roughly 50k by 50k, but their rank is only , so 64 or 128. They’re low rank factorizations of a giant behavior table, and reading off the largest entries gives you interpretable skip trigrams. Most of what these heads learn is copying: the OV circuit boosts the probability of whatever token it attends to, and the QK circuit attends back to tokens that could plausibly come next. This is already a primitive form of in context learning.

Copying shows up cleanly in the eigenvalues of the OV circuit. If is an eigenvector with positive eigenvalue , then means that the set of tokens represented by mutually boost their own logits when attended to. A copying head should have mostly positive eigenvalues, and in practice about 10 of 12 layer 1 heads do. Random matrices have a roughly even split of positive and negative eigenvalues, for contrast.

The one layer model also exhibits a fun failure mode: because each head factors its behavior through a single source-destination interaction and a single OV matrix, three way interactions leak. If a head learns to boost keep ... in → mind and keep ... at → bay, it will also boost keep ... in → bay and keep ... at → mind. These “skip trigram bugs” are small, but they’re an early example of interpretability surfacing real model failures from the weights alone.

Two layer transformers and induction heads

The key difference between one and two layer models is composition. A second layer head can read from a subspace that a first layer head wrote to, so its attention pattern or output can depend on what an earlier head did. There are three kinds:

Q-composition: of a later head reads a subspace written by an earlier head.
K-composition: of a later head reads a subspace written by an earlier head.
V-composition: of a later head reads a subspace written by an earlier head.

Q and K composition change the attention pattern of the second head, letting it attend based on what earlier heads computed. V-composition is different: it chains value movement with value movement, effectively creating a single “virtual attention head” with attention pattern and OV matrix .

Path expanding the two layer logit equation gives a direct path, individual head terms identical to the one layer case, and new virtual head terms from V-composition.

Induction heads

The main result of the paper. In the two layer model studied, composition is used almost entirely for one thing: constructing induction heads. An induction head implements the pattern [a][b] ... [a] → [b]. It looks back through the context for previous occurrences of the current token and predicts whatever came next. This is a much more powerful form of in context learning than the one layer copying heads, and it works even on sequences of random tokens, since it doesn’t rely on bigram-like statistics about which tokens usually follow which.

A first layer previous token head attends from each position to the one before it and copies that token’s information into the residual stream. A second layer head then uses K-composition to read those shifted keys: the query is the current token, but the keys have been shifted forward by one position. Matching query to key finds positions where the previous token matches the current one, which is exactly where a repeated sequence would continue. The OV circuit of the second head is a copying matrix, so attending there copies the next token forward.

Checking the theory: induction heads should have a copying OV matrix and a “matching” QK matrix on the K-composition term. Both conditions correspond to strongly positive eigenvalues, and all the induction heads in the model sit in the extreme corner of this 2D space of eigenvalue positivity. This is not a circular test, because K-composition being large doesn’t mechanically force the resulting matrix to have positive eigenvalues; that only happens if the algorithm really is the induction algorithm.

Term importance and virtual heads

Path expansion gives an equation with exponentially many terms: direct path, individual heads, pairs of composed heads, triples, and so on. To check whether the high order virtual head terms actually matter, the authors ablate them using a recursive trick: run the model, record attention patterns, then run again with the patterns frozen and attention head outputs progressively zeroed out. Differences in loss isolate the contribution of each order of V-composition.

The result: virtual heads from V-composition contribute very little in this two layer model. Almost all the useful behavior lives in the direct path plus individual heads. That said, the number of possible virtual heads grows exponentially with depth, so in larger models they may well carry much more weight. Q and K composition, importantly, are a separate story and clearly do matter here since they’re the mechanism behind induction heads.

Where this leaves us

The paper makes no claim to explain real language models. It explains tiny attention only toy models and develops a vocabulary: residual stream, virtual weights, QK and OV circuits, path expansion, composition, induction heads. That vocabulary turns out to be the right one. Even in the presence of MLPs, attention heads still read and write through the residual stream, and circuits built purely out of attention still exist in large models. Induction heads in particular show up in transformers of every size studied since, and they drive much of in context learning.

As the authors admit, there are certainly limitations. MLPs are two thirds of a standard transformer’s parameters, and this framework says nothing about them. Large chunks of model behavior live there, and until MLP layers yield to a similar decomposition, full reverse engineering of a real model stays out of reach. But the foothold is real. Everything from activation patching to SAEs to attribution graphs sits on top of the conceptual frame this paper put down.