Mech-Interp Toolkit: Tools for Mechanistic Interpretability

The Mech-Interp Toolkit is a curated library of essential methods for mechanistic interpretability in transformer-based language models. Built on top of TransformerLens and AutoCircuit, the toolkit brings together widely cited techniques from the mech-interp literature into a unified, accessible implementation.

Designed as both an educational resource and a research accelerator, this toolkit enables fine-grained analysis of internal model mechanisms through:

Observational methods (e.g., Logit Lens, Direct Logit Attribution)
Interventional methods (e.g., Activation and Path Patching)
Automatic circuit discovery (e.g., ACDC, Edge Attribution)

All methods are demonstrated on the classic Indirect Object Identification (IOI) task, with ready-to-use Jupyter notebooks and easy extensibility to other tasks.

🚀 Explore the code and contribute via GitHub

🧠 Ideal for researchers and students advancing explainable AI, especially in alignment with INNOLABS’ focus on AI transparency, agentic systems, and responsible innovation.