Black and white crayon drawing of a research lab
Software and Tools

Mech-Interp Toolkit: Tools for Mechanistic Interpretability

The Mech-Interp Toolkit is a curated library of essential methods for mechanistic interpretability in transformer-based language models. Built on top of TransformerLens and AutoCircuit, the toolkit brings together widely cited techniques from the mech-interp literature into a unified, accessible implementation.

Designed as both an educational resource and a research accelerator, this toolkit enables fine-grained analysis of internal model mechanisms through:

  • Observational methods (e.g., Logit Lens, Direct Logit Attribution)
  • Interventional methods (e.g., Activation and Path Patching)
  • Automatic circuit discovery (e.g., ACDC, Edge Attribution)

All methods are demonstrated on the classic Indirect Object Identification (IOI) task, with ready-to-use Jupyter notebooks and easy extensibility to other tasks.

🚀 Explore the code and contribute via GitHub

🧠 Ideal for researchers and students advancing explainable AI, especially in alignment with INNOLABS’ focus on AI transparency, agentic systems, and responsible innovation.