Listen to Interpret: Post-hoc Interpretability of Audio Networks with NMF

Project Webpage (NeurIPS'22 Paper)

Abstract

This paper tackles post-hoc interpretability for audio processing networks. Our goal is to interpret decisions of a trained network in terms of high-level audio objects that are also listenable for the end-user. To this end, we propose a novel interpreter design that incorporates non-negative matrix factorization (NMF). In particular, a regularized interpreter module is trained to take hidden layer representations of the targeted network as input and produce time activations of pre-learnt NMF components as intermediate outputs. Our methodology allows us to generate intuitive audio-based interpretations that explicitly enhance parts of the input signal most relevant for a network's decision. We demonstrate our method's applicability on popular benchmarks, including a real-world multi-label classification task.

It's worth emphasizing that audio interpretability is not the same as classical audio tasks of separation or denoising. These tasks involve recovering complete object of interest in the output audio. On the other hand, a classifier network might focus more on salient regions. When interpreting its decision and making it listenable we expect to uncover such regions and not necessarily the complete object of interest.

ESC50 examples
SONYC-UST examples

Chrome is the preferred browser. The samples may not always run in Firefox.

Use of headphones/earphones recommended.

Please tune your volume appropriately before playing any sample.

The audio files may take some time to load. If they do not load even after waiting, it can be a temporary hosting server issue. In that case, please visit again after sometime

Audio Samples: ESC50 with Corruption from different class

Example interpretations on ESC50-fold 1 test data, where samples are corrupted with audio from a different class are given below.

For each sample you can listen to the input audio to the classifier, and interpretation audio for the class predicted by the classifier. Additionally, you can view four spectrograms to further support observations from listening to interpretations: (i) Uncorrupted target class signal (Top-Left), (ii) Corrupting/Mixing class signal (Top-right), (iii) Corrupted/mixed signal, also the input audio for classifier (Bottom-Left), (iv) Interpretation audio (Bottom-Right)

Sample A

Predicted class by the classifier: 'DOG' (classifier probability: 0.713)

Target class: 'DOG' Corrupting class: 'CRYING-BABY'

Input sample
Interpretation audio

Sample B

Predicted class by the classifier: 'CRYING-BABY' (classifier probability: 0.998)

Target class: 'CRYING-BABY' Corrupting class: 'DOG'

Input sample
Interpretation audio

Sample C

Predicted class by the classifier: 'CHURCH-BELL' (classifier probability: 0.999)

Target class: 'CHURCH-BELL' Corrupting class: 'ROOSTER'

Input sample
Interpretation audio

Sample D

Predicted class by the classifier: 'DOG' (classifier probability: 0.999)

Target class: 'DOG' Corrupting class: 'CAT'

Input sample
Interpretation audio

Audio Samples: ESC50 with Noise (0dB SNR)

Some example interpretations on ESC50-fold 1 test data, corrupted with white noise at 0dB SNR are given below.

For each sample you can listen to the input to the classifier, and interpretation audio for the class predicted by the classifier. Again, as the corrupting signal is already known, we further support the observations by listening to interpretations with spectrograms for (i) Input audio (corrupted with white noise), (ii) Interpretation audio. In all cases the noise in interpretation audio is significantly reduced compared to input and the most emphasis is on part of input from class of interest