The Mech-Interp Toolkit is a curated library of essential methods for mechanistic interpretability in transformer-based language models. Built on top of TransformerLens and AutoCircuit, the toolkit brings together widely cited techniques from the mech-interp literature into a unified, accessible implementation.
Designed as both an educational resource and a research accelerator, this toolkit enables fine-grained analysis of internal model mechanisms through:
- Observational methods (e.g., Logit Lens, Direct Logit Attribution)
- Interventional methods (e.g., Activation and Path Patching)
- Automatic circuit discovery (e.g., ACDC, Edge Attribution)
All methods are demonstrated on the classic Indirect Object Identification (IOI) task, with ready-to-use Jupyter notebooks and easy extensibility to other tasks.
🚀 Explore the code and contribute via GitHub
🧠Ideal for researchers and students advancing explainable AI, especially in alignment with INNOLABS’ focus on AI transparency, agentic systems, and responsible innovation.