The Field at a Glance
Foreword: the views below draw heavily on Neel Nanda’s writing, conversations with several Yale professors, and my own reading and reflection. Credit where it’s due; mistakes are mine.
Mech interp is a large field, and it is impossible to do research at the forefront of every part of it. I want to figure out how to contribute. The field is young, accessible, and full of opportunities, and it is arguably one of the most crucial and neglected parts of AI safety. The flip side is that there is a lot of slop: incorrect conclusions, fancy looking tests, and traps that can waste hundreds of hours. This post collects what I have learned while trying to skim the frontier of current mech interp research.
The landscape
Two or three years ago, the dominant attitude was gung-ho. People wanted to completely reverse engineer models, and believed that doing so would give a total and final solution for AI safety. That has not worked out. Starting from A Mathematical Framework for Transformer Circuits, we have been stumped at every step of trying to decode an entire model. We have no great account of positional embeddings, and MLPs remain stubbornly resistant to interpretation. The tools we have let us pick apart the most salient and human understandable parts of a model, but as Neel Nanda puts it, much of what a model does is small bits of machine bias that humans can’t or don’t care to understand.
Two things have driven the field away from that frame. First, models are now capable enough to exhibit genuinely safety relevant behaviors: scheming, eval awareness, deception. Second, retrospective disappointment. Ambitious mech interp made less progress than hoped, and existing techniques struggle with large models, agentic settings, and long chains of thought.
So over the last year, interest has shifted toward finding important and usable ways to interpret parts of models, and applying those techniques to downstream tasks. Fully disentangling a model is insanely hard. Understanding the important parts of its capabilities and being able to steer or edit them is almost as good, and it is actually tractable.
Theories of change
Nanda’s recent Alignment Forum post lays out four theories of change for mech interp:
- Science of misalignment. Figuring out whether models simply misunderstand instructions, or whether they have ulterior motives when being tested.
- Empowering other safety areas. CoT monitoring, eval awareness suppression, conceptual model psychology.
- Preventing egregiously misaligned actions. Stopping sandbagging, enabling cheap monitoring, investigating flagged behavior.
- Directly helping align models. Preventative steering, CAFT, and similar.
Work tends to fall into two buckets: (a) directly backwards engineered from a specific theory of change, and (b) robustly useful settings that look good from multiple angles. The common thread is pragmatic interp over aspirational interp: pick tractable sub-problems where qualitative, case specific, unsupervised analysis actually beats other methods, and build toward theories of change that do not require what you can’t deliver.
Two concepts make this concrete. A North Star is a meaningful stepping stone goal that connects to AGI going well. A proxy task is an objective empirical task on today’s models that tracks progress toward the North Star. Good research keeps both in view at once.
How researchers actually research
The field has a strong bias toward fancy, intellectually exciting techniques, and this leads to bad tactical decisions. The guiding principle is to use the simplest method that works. The order of attempts:
- Prompting
- Reading the chain of thought
- Prefill attacks
- Steering vectors
- Probes
- White box and mechanistic techniques
An honest assessment of SAEs fits naturally here. They are strong for unsupervised discovery: finding unexpected things like entity recognition, implicit planning, and hidden goals. They are weak for tasks with a clear target, where constructing a dataset to investigate directly tends to outperform them. They are also weak for generalization out of distribution and for unlearning.
Transferable strengths of mech interp researchers
Nanda identifies five:
- Working with model internals. Steering, probing, activation patching. These tools have different failure modes than standard ML.
- Deep dives. Taking a question about model behavior and producing a reliable, principled explanation. Good for auditing, red teaming, and confirming suspected misbehavior.
- Scientific mindset. Forming and testing hypotheses about complex phenomena with no clear ground truth. Designing experiments to falsify claims about fuzzy questions.
- Qualitative insight. Using SAEs and similar tools to find the key factors driving a specific behavior. Model biology work and shutdown resistance are examples.
- Unsupervised discovery. Surfacing hypotheses you would not have thought to look for. Linear world models in Othello, entity recognition in LLMs.
What is exciting now: chain of thought monitorability
Of all the open directions, the one I find most compelling is chain of thought monitorability. The pitch is simple. Today’s frontier reasoning models think aloud in English before answering. If we can trust the CoT, safety gets dramatically easier: we read the CoT and check for anything suspicious. This is the cheapest alignment tool we have ever had. The question is whether it actually works, and for how long.
CoT faithfulness
The naive safety strategy is to read the CoT and look for red flags. There is real evidence this works sometimes and fails other times. Arcuschin et al. and Chen et al. both document cases of unfaithful CoT, where the reasoning the model writes is not the reasoning the model used. Several distinct failure modes show up:
- Post hoc rationalization. The model picks an answer first and then constructs a plausible justification (Arcuschin et al., Lindsey et al.).
- Hidden hint following. Given a hint, the model changes its answer but does not admit to using the hint (Chen et al.).
- Logical shortcuts. Stuck on a math problem, the model claims a valid proof while quietly skipping steps (Arcuschin et al.).
- Last minute flips. A reasonable CoT ends with a final answer that contradicts the reasoning (Arcuschin et al.).
These are distinct phenomena with potentially distinct mechanisms, and being able to tell them apart is itself a research question with traction.
A few directions feel tractable:
- Induce suspicious reasoning and see what happens. Train the CoT to be distorted, or tell the model its CoT is being monitored. Does it still do the bad behavior but with a cleaner CoT? How complex a behavior can it hide?
- Synthetic document finetuning. Train a model on synthetic documents to believe it should have an unfaithful CoT, then interpret what is happening inside. This is the kind of model organism setup that could actually be mechanistically pinned down.
- Causal importance of CoT. Can we tell when a CoT was causally important for a model’s answer? This is subtle. A causally irrelevant CoT can still produce the same answer, and editing the CoT knocks the model off distribution in ways that change the answer for unrelated reasons.
- Better monitors. Design metrics for whether the CoT is really telling us what we think it is.
Latent CoT interpretability
At some point we probably stop using English CoT. Models will reason in latent vectors, and the cheap safety tool goes away. Figuring out how much interpretability can make up for that loss is a big deal.
The glaring flaw is that we do not know what future latent reasoning models will actually look like. There are contenders, though, and projects targeting any serious contender are probably fruitful, with the caveat that the more specific a project is to one model, the less its lessons generalize.
A concrete starting point: pick an open source latent CoT model, pick a task with high serial depth like a hard math or logic problem, and try to interpret what is happening. De-risk with a mini-project first. Filter for models with hype behind them and that have been shown effective at reasonable scale. A natural extension is to test how well existing techniques (probes, SAEs, activation patching) transfer.
Why this over everything else
Three reasons. First, it connects directly to a theory of change (empowering other safety areas, specifically CoT monitoring), and also to science of misalignment, since unfaithful CoT is the clearest window into “does the model have ulterior motives.” Second, the proxy tasks are concrete and run on today’s models. Third, there is a built-in deadline: the tool becomes less useful every time a new architecture moves reasoning out of natural language.