Our Invertible Interpretation Network T can be applied to arbitrary existing models. Its invertibility guarantees that the translation from z to does not affect the performance of the model to be interpreted.
arXiv
BibTeX
GitHub
* equal contribution

Abstract

Neural networks have greatly boosted performance in computer vision by learning powerful representations of input data. The drawback of end-to-end training for maximal overall performance are black-box models whose hidden representations are lacking interpretability: Since distributed coding is optimal for latent layers to improve their robustness, attributing meaning to parts of a hidden feature vector or to individual neurons is hindered. We formulate interpretation as a translation of hidden representations onto semantic concepts that are comprehensible to the user. The mapping between both domains has to be bijective so that semantic modifications in the target domain correctly alter the original representation. The proposed invertible interpretation network can be transparently applied on top of existing architectures with no need to modify or retrain them. Consequently, we translate an original representation to an equivalent yet interpretable one and backwards without affecting the expressiveness and performance of the original. The invertible interpretation network disentangles the hidden representation into separate, semantically meaningful concepts. Moreover, we present an efficient approach to define semantic concepts by only sketching two images and also an unsupervised strategy. Experimental evaluation demonstrates the wide applicability to interpretation of existing classification and image generation networks as well as to semantically guided image manipulation.

New Works on Understanding and Disentangling Latent Representations with INNs

Given the ever-increasing computational costs of modern machine learning models, we need to find new ways to reuse such expert models and thus tap into the resources that have been invested in their creation. Recent work suggests that the power of these massive models is captured by the representations they learn. Therefore, we seek a model that can relate between different existing representations and propose to solve this task with a conditionally invertible network. This network demonstrates its capability by (i) providing generic transfer between diverse domains, (ii) enabling controlled content synthesis by allowing modification in other domains, and (iii) facilitating diagnosis of existing representations by translating them into interpretable domains such as images. Our domain transfer network can translate between fixed representations without having to learn or finetune them. This allows users to utilize various existing domain-specific expert models from the literature that had been trained with extensive computational resources. Experiments on diverse conditional image synthesis tasks, competitive image modification results and experiments on image-to-image and text-to-image generation demonstrate the generic applicability of our approach. For example, we translate between BERT and BigGAN, state-of-the-art text and image models to provide text-to-image generation, which neither of both experts can perform on their own.

To tackle increasingly complex tasks, it has become an essential ability of neural networks to learn abstract representations. These task-specific representations and, particularly, the invariances they capture turn neural networks into black box models that lack interpretability. To open such a black box, it is, therefore, crucial to uncover the different semantic concepts a model has learned as well as those that it has learned to be invariant to. We present an approach based on INNs that (i) recovers the task-specific, learned invariances by disentangling the remaining factor of variation in the data and that (ii) invertibly transforms these recovered invariances combined with the model representation into an equally expressive one with accessible semantic concepts. As a consequence, neural network representations become understandable by providing the means to (i) expose their semantic meaning, (ii) semantically modify a representation, and (iii) visualize individual learned semantic concepts and invariances. Our invertible approach significantly extends the abilities to understand black box models by enabling post-hoc interpretations of state-of-the-art networks without compromising their performance.

Results

and applications of our invertible interpretation network.

Semantic image modifications and embeddings on CelebA

We interpolate within individual semantic concepts and visualize representations embedded onto semantically-meaningful dimensions.

Semantic image modifications and embeddings on CMNIST

We interpolate within individual semantic concepts and visualize representations embedded onto semantically-meaningful dimensions.

Transfer on AnimalFaces

We combine 0 (residual) of the target image (leftmost column) with 1 (animal class) of the source image (top row), resulting in a transfer of animal type from source to target.

Linearization of Latent Space

The inverse of our interpretation network T maps linear walks in the interpretable domain back to nonlinear walks on the data manifold in the encoder space, which get decoded to meaningful images (bottom right). In contrast, decoded images of linear walks in the encoder space contain ghosting artifacts (bottom left).

Turning Autoencoders into Generative Models

Applied to latent representations z of an autoencoder, our approach enables semantic image analogies. After sampling an interpretable representation , we use the inverse of T to transform it to a latent representation z=T-1(z̃) of the autoencoder and obtain a sampled image after decoding z. Our approach significantly improves FID scores compared to previous autoencoder based generative models and simple GAN models.

Autoencoder Analogies

Applied to latent representations z of an autoencoder, our approach enables semantic image analogies. After transforming z to disentangled semantic factors (z̃k)=T(z), we replace k of the target image (leftmost column), with k of the source image (top row). From left to right: k=1 (digit), k=2 (color), k=0 (residual).

Semantic Manipulations of Representations

Modifications of latent representations z of a CMNIST classifier visualized through UMAP embeddings. Colors of dots represent classes of test examples. We map latent representations z to interpretable representations z̃=T(z), where we perform a random walk in one of the factors k. Using T-1, this random walk is mapped back to the latent space and shown as black crosses connected by gray lines. On the left, a random walk in the digit factor jumps between digit clusters, whereas on the right, a random walk in the color factor stays (mostly) within the digit cluster it starts from.

Network Response Analysis

Left: Output variance per class of a digit classifier on ColorMNIST, assessed via distribution of class predictions. T disentangles 0 (residual), 1 (digit) and 2 (color). The distribution of predicted classes is indeed not sensitive to variations in the factor color, but turns out to be quite responsive when altering the digit representation. Right: 1d disentangled UMAP embeddings of 1 and 2.

Overview

Our presentation for the virtual CVPR conference.

Acknowledgement

This work has been supported in part by the German federal ministry BMWi within the project KI Absicherung, the German Research Foundation (DFG) projects 371923335 and 421703927, and a hardware donation from NVIDIA corporation. This page is based on a design by TEMPLATED.