From Curiosity to Research Pt. 1 - From Using LLMS to Understanding The Internals

Like a lot of people interested in AI, my understanding of neural networks started with a little bit of confusion, but mostly wonder. How do these systems actually work? What's happening inside those layers of connected nodes that allows them to recognize images, generate text, and perform what seems to be some form of intelligence? My journey toward answering these questions started in one of the most accessible places on the internet: YouTube.

The Spark: 3Blue1Brown's Neural Networks

I started watching Grant Sanderson's incredible 3Blue1Brown series on neural networks. If you've never watched these videos, you're missing out on some of the clearest explanations of complex mathematical concepts available anywhere. Grant has this ability to make abstract ideas tangible through really beautiful visualizations and intuitive explanations.

As I worked through the series, something kind of clicked. I started to understand the fundamentals of backpropagation, gradient descent, and how these networks learn. But the more I learned, the more questions that came up for me. The main questions that came up time and time again were: Are these LLMs actually learning? What patterns were they detecting? How are they determining meaning?

The Rabbit Hole: Superposition

While watching the videos, the concept that really set off my curiosity came near the end of the series: superposition. When Grant mentioned superposition in the context of neural networks, the concept felt so foreign at the time.

The idea that a network might represent multiple features in the same set of neurons, overlapping and interfering with each other, seemed a little crazy to me. After a few more web searches, I found myself digging deeper and researching where the video series left off, trying to understand not just what superposition was, but why it mattered. This curiosity led me down a rabbit hole, which led me to write this blog post.

The Discovery: Mechanistic Interpretability

While digging into superposition, I bumped into a term I'd never heard of before: mechanistic interpretability

I knew that people had been working on this in frontier labs like OpenAI and Anthropic, but I honestly didn't know much about it, much less there was a field focused on these questions. Mechanistic interpretability within the context of neural networks isn't just about improving performance or building better models, it's about understanding the internal mechanisms that drive these systems.

The Guides: Neel Nanda and Christopher Olah

As I started understanding how old (or new) this field was and how much progress had been made, two names kept appearing: Neel Nanda and Christopher Olah, among others. Their writings, videos, and podcasts that they've done have really helped me get a solid mental grasp on what the hell is going on in this field.

Christopher Olah's work and his foundational thinking about interpretability, opened my eyes to the possibilities of truly understanding neural networks. His ability to communicate complex ideas clearly, combined with groundbreaking research, made it pretty obvious that he was a key figure in this space.

Neel Nanda's approach to mechanistic interpretability, with his emphasis on concrete, actionable research and his easily approached educational content, showed me that this field wasn't only theoretical, it was practical, achievable, and very much needed. His work demonstrated that we could actually reverse-engineer neural networks and understand their internal algorithms.

Reading their blog posts, educational materials, and watching/listening to videos and podcasts they've been a part of, I realized that this was something I want to dedicate a significant amount of time to.

The Commitment: Building a Learning Plan

Realizing the depth and complexity of mechanistic interpretability, I knew I needed more than casual interest, I needed a systematic approach to really learning the fundamentals, and then applying them to practical problems and research. So I created a personal curriculum designed to take me from my current casual awareness to what I hope is a somewhat advanced understanding of the field.

I'm planning to spend at least an hour every day studying, learning, and practicing. Some days it might be reading research papers, others it might be implementing techniques in code, and still others it might be working through mathematical concepts I need to fully understand to move forward on something that previously might've gone over my head.

The goal isn't just to understand mechanistic interpretability in theory, I want to contribute meaningfully to the field. Eventually, I hope to work on a known concrete problem by aiding in research through helping on a paper or contributing to open source software that advances the field, preferably both.

The Dream: Contributing to the Field

Part of me dreams of eventually working alongside researchers like Neel and Chris, but I'm approaching this with realistic expectations and a long-term mindset.

One step at a time. Eventually, maybe I'll be fortunate enough to collaborate with the researchers whose work inspired this journey.

What's Next?

I'm planning to follow up with detailed posts about my learning plan and the specific resources I'll be using. I want to document this journey not just for my own accountability, but because I think there might be others out there with similar curiosity who could benefit from seeing my systematic approach to entering this field.

The journey from knowing the absolute basics to being a contributing researcher/engineer won't be easy, but it feels like a path that's not impossible, at least for me anyway.

Thanks for reading.