Published on September 3, 2025
From Curiosity to Research Pt. 4 - Understanding Transformers and Transformer Circuits, with a surprise curveball!
Covering the papers and my thoughts as I'm studying the foundations of transformers and their place in mechanistic interpretability.
After committing to the active learning approach I described in part 3, I dove headfirst into understanding the foundational architecture of transformers. This deep dive led me down an unexpected path that I'm excited to share with you.
Going Back to Transformer Fundamentals
Before tackling the research papers on my list, I realized I needed to really solidify my understanding of how transformers actually work at a fundamental level. I started with some incredible resources that break down these complex systems into digestible pieces.
Neel Nanda's two-part transformer explainer video were my starting point. His approach to explaining transformers from first principles, combined with actual code walkthroughs, helped me understand not just what transformers do, but how they do it. I also revisited Jay Alammar's visual transformer explainer, which remains one of the best visual breakdowns of the architecture I've come across.
The Circuits Framework
With that foundation in place, I finally felt ready to tackle A Mathematical Framework for Transformer Circuits, the paper I mentioned I'd be starting with in my last post. This paper formally introduces the concept of "circuits" into the world of mechanistic interpretability.
The concept of circuits here refers to the ways in which attention is facilitated within the attention layer of a transformer, an integral part of a neural network that drives an LLM's core capability of next token prediction. Think of circuits as the specific computational subgraphs within the transformer that implement particular behaviors or functions.
If you're interested in diving into this paper yourself, I highly recommend watching Neel Nanda's explainer video on the Mathematical Framework paper alongside reading it. His breakdown helped me understand some of the more complex mathematical concepts and their practical implications.
Circuits themselves are still a developing area of research, and my understanding of them is still evolving. So I'm going to take the time to really dig in here and re-read the paper, as well as explore other related works to better my intuition around them. I'll share them in future posts as I come across them.
The Surprise Curveball: Georgia Tech's AISI
Life has a funny way of quietly nudging you in unexpected directions.
While researching transformer circuits this week, I stumbled onto Georgia Tech's AI Safety Initiative (AISI) group while searching for meetups and conferences that were happening throughout the rest of the year. I found out that they were hosting a discussion on a paper titled "The Circuits Research Landscape: Results and Perspectives" today (September 3rd, 2025), so obviously I decided to attend.
The discussion was great, and it gave me a fairly broad look at the current research landscape in transformer circuits study within the context of AI safety and interpretability, as well as other areas that align with these topics. I left that discussion with a lot more energy, and a clearer sense of the direction I wanted to go, but I needed to better understand the open problems that the paper covered and where my current skill set fit within that problem space.
Open Problems and What I'm Focusing On
The open problems that were highlighted and discussed in "The Circuits Research Landscape" varied a lot by their complexity. There are several concrete challenges that need solving, and many of them felt approachable even for someone still gaining the intuition around a transformer's architecture like myself. Some problems involve developing better tools for circuit discovery, others focus on understanding how circuits compose and interact with one another, and some others deal with scaling these techniques to larger non-reasoning models and reasoning models.
After rolling a few of the problems around in my head, and chatting with AI to understand what might be feasible at the moment, an overall goal became pretty clear: I want to build tools that contribute to other researchers in the field who aren't able to build good tools themselves for whatever reason to conduct their research more effectively. The problems that I identified where I feel that I'd be able to make a significant contribution to are the following:
- Chain of Thought interpretability in Reasoning Models
- Visualizing the Circuit Structure within transformers more deeply
- Visualizing the observability of model behavior against certain prompting techniques at scale
- Introspecting how models aggregate information over long contexts
There are a few others, but these are where I feel like I can make an impact the soonest.
Looking Ahead
In my next post, I'll dive deeper into the specific open problems I've identified and share my thoughts on which one I'll be starting on first. While doing that I'll be continuing to learn more about the underlying principles of transformers and circuits, and get to the heart of why these models are so hard to interpret.
I'm also planning to put together an introductory talk to submit to general tech conferences, aiming to share insights from my research journey and the importance of mechanistic interpretability in AI to a broader audience.
If you'd like to stay updated on my progress or have any questions, feel free to reach out!
Thanks for reading.