Our Invertible Interpretation Network T can be applied to arbitrary existing models. Its invertibility guarantees that the translation from z to does not affect the performance of the model to be interpreted.
arXiv
BibTeX
GitHub
* equal contribution

Abstract

Neural networks have greatly boosted performance in computer vision by learning powerful representations of input data. The drawback of end-to-end training for maximal overall performance are black-box models whose hidden representations are lacking interpretability: Since distributed coding is optimal for latent layers to improve their robustness, attributing meaning to parts of a hidden feature vector or to individual neurons is hindered. We formulate interpretation as a translation of hidden representations onto semantic concepts that are comprehensible to the user. The mapping between both domains has to be bijective so that semantic modifications in the target domain correctly alter the original representation. The proposed invertible interpretation network can be transparently applied on top of existing architectures with no need to modify or retrain them. Consequently, we translate an original representation to an equivalent yet interpretable one and backwards without affecting the expressiveness and performance of the original. The invertible interpretation network disentangles the hidden representation into separate, semantically meaningful concepts. Moreover, we present an efficient approach to define semantic concepts by only sketching two images and also an unsupervised strategy. Experimental evaluation demonstrates the wide applicability to interpretation of existing classification and image generation networks as well as to semantically guided image manipulation.

Results

and applications of our model.

Semantic image modifications and embeddings on CelebA

We interpolate within individual semantic concepts and visualize representations embedded onto semantically-meaningful dimensions.

Semantic image modifications and embeddings on CMNIST

We interpolate within individual semantic concepts and visualize representations embedded onto semantically-meaningful dimensions.

Transfer on AnimalFaces

We combine 0 (residual) of the target image (leftmost column) with 1 (animal class) of the source image (top row), resulting in a transfer of animal type from source to target.

Linearization of Latent Space

The inverse of our interpretation network T maps linear walks in the interpretable domain back to nonlinear walks on the data manifold in the encoder space, which get decoded to meaningful images (bottom right). In contrast, decoded images of linear walks in the encoder space contain ghosting artifacts (bottom left).

Turning Autoencoders into Generative Models

Applied to latent representations z of an autoencoder, our approach enables semantic image analogies. After sampling an interpretable representation , we use the inverse of T to transform it to a latent representation z=T-1(z̃) of the autoencoder and obtain a sampled image after decoding z. Our approach significantly improves FID scores compared to previous autoencoder based generative models and simple GAN models.

Autoencoder Analogies

Applied to latent representations z of an autoencoder, our approach enables semantic image analogies. After transforming z to disentangled semantic factors (z̃k)=T(z), we replace k of the target image (leftmost column), with k of the source image (top row). From left to right: k=1 (digit), k=2 (color), k=0 (residual).

Semantic Manipulations of Representations

Modifications of latent representations z of a CMNIST classifier visualized through UMAP embeddings. Colors of dots represent classes of test examples. We map latent representations z to interpretable representations z̃=T(z), where we perform a random walk in one of the factors k. Using T-1, this random walk is mapped back to the latent space and shown as black crosses connected by gray lines. On the left, a random walk in the digit factor jumps between digit clusters, whereas on the right, a random walk in the color factor stays (mostly) within the digit cluster it starts from.

Network Response Analysis

Left: Output variance per class of a digit classifier on ColorMNIST, assessed via distribution of class predictions. T disentangles 0 (residual), 1 (digit) and 2 (color). The distribution of predicted classes is indeed not sensitive to variations in the factor color, but turns out to be quite responsive when altering the digit representation. Right: 1d disentangled UMAP embeddings of 1 and 2.

Overview

Our presentation for the virtual CVPR conference.

Acknowledgement

This work has been supported in part by the German federal ministry BMWi within the project KI Absicherung, the German Research Foundation (DFG) projects 371923335 and 421703927, and a hardware donation from NVIDIA corporation. This page is based on a design by TEMPLATED.