This research project was developed as a part of Bachelor’s thesis at Faculty of Mathematics and Information Science, Warsaw University of Technology.

Co-authors
Michał Piechota
Vladimir Zaigrajew (supervisor)
Przemysław Biecek (supervisor)
Abstract
The increasing complexity of neural networks has created a critical need for methods that interpret their internal decision-making processes. One prominent approach is neuron labeling, which aims to explain model behavior by assigning human-understandable textual explanations (concepts) to individual neurons. We propose LINE, an iterative method for neuron labeling that automatically discovers the concept that most strongly activates a specific neuron without being restricted by a predefined, fixed vocabulary. Our method utilizes a optimization loop where a Language Model proposes candidate concepts, which are subsequently validated by generating synthetic images via Stable Diffusion and scoring their impact on the target neuron response of the neuron. The iterative process allows the system to discover new relevant concepts, absent in the predefined vocabulary.
We evaluate our approach across several ResNet-family architectures, demonstrating its robustness in various deep learning contexts. Quantitative results show that LINE outperforms current state-of-the-art methodologies, CLIP-Dissect and INVERT, in providing more accurate and relevant explanations. Experiments show that 31% of the labels generated by our method exist outside of standard predefined vocabularies, proving that LINE can uncover features that traditional closed-vocabulary approaches overlook.