From Curiosity to Research Pt. 3 - A Change in My Study Strategy, Embracing Active Learning

So in part 2, I covered the curriculum I built for myself to explore mechanistic interpretability in neural networks. As I started down this path though, I quickly realized how terrible that was for how my brain works. I like building things, and I also like watching videos of other people building things. So with that being said, I decided to be real with myself and embrace a strategy that fit much better, even if it left some deeper (but broadly not as important) gaps open: active learning.

The Active Learning Approach

The main strategy here is going to be deeply engaging with research papers, and ensuring that I understand the underlying concepts and the overall idea/findings all the way through, noting any questions or areas of confusion as I go. This will help me to actively engage with the material and build a deeper understanding over time. There shouldn't be anything new about this approach, but I think the key difference here is that I'm going to intentionally look for connections between different papers and ideas, and how they all fit together, and I'll note them when I come encounter those connections.

When I come across something that I just don't understand or have a good intuition for, I'll try to dig deeper by taking a detour into resources that'll clarify things for me around that blocker. I truly believe that there's just way too much resources out there to not leverage them effectively in this process. The amount of information available in this field, while still young, is large and only getting bigger as open problems continue to emerge.

So, where are you starting?

The idea is to go high level and dig deeper and deeper into the field and current problems that exist within mechanistic interpretability. So, the main bank of papers that I'll be digging into is from Neel Nanda's "An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers V2".

The first one mentioned is A Mathematical Framework for Transformer Circuits. It is an "old" paper in the world of AI, but all of these papers in a way build on themselves, and in order to understand the current problems in the field, you have to have a base level intuition about the foundational concepts and previous problems that have been more or less solved.

Learn with me

As I work through these papers, I'll cover what I've learned, what I used to accompany my reading (like videos, blog posts, etc.), and any internal thoughts I have about the material. The hope is that this not only helps me document my learning, but allows me to look back and see how my understanding has evolved over time too.

The other goal I have is that maybe someone else out there that's also interested in understanding neural networks and solving the problems around interpretability can learn from my journey and use the resources I've shared here to help them.

Stay tuned for the first paper review, and thanks for reading.