We provide post-hoc interpretation for a given neural network f. For a deep representation z, a conditional INN t recovers the model's invariances v from a representation z which contains entangled information about both z and v. The INN e then translates z into a factorized representation with accessible semantic concepts. This approach allows for various applications, including visualizations of network representations of natural and altered inputs, semantic network analysis and image modifications.


* equal contribution


git clone https://github.com/CompVis/invariances.git
cd invariances
conda env create -f environment.yaml
conda activate invariances
streamlit run invariances/demo.py


To tackle increasingly complex tasks, it has become an essential ability of neural networks to learn abstract representations. These task-specific representations and, particularly, the invariances they capture turn neural networks into black box models that lack interpretability. To open such a black box, it is, therefore, crucial to uncover the different semantic concepts a model has learned as well as those that it has learned to be invariant to. We present an approach based on INNs that (i) recovers the task-specific, learned invariances by disentangling the remaining factor of variation in the data and that (ii) invertibly transforms these recovered invariances combined with the model representation into an equally expressive one with accessible semantic concepts. As a consequence, neural network representations become understandable by providing the means to (i) expose their semantic meaning, (ii) semantically modify a representation, and (iii) visualize individual learned semantic concepts and invariances. Our invertible approach significantly extends the abilities to understand black box models by enabling post-hoc interpretations of state-of-the-art networks without compromising their performance.

Related Work on Understanding and Disentangling Latent Representations with INNs

Given the ever-increasing computational costs of modern machine learning models, we need to find new ways to reuse such expert models and thus tap into the resources that have been invested in their creation. Recent work suggests that the power of these massive models is captured by the representations they learn. Therefore, we seek a model that can relate between different existing representations and propose to solve this task with a conditionally invertible network. This network demonstrates its capability by (i) providing generic transfer between diverse domains, (ii) enabling controlled content synthesis by allowing modification in other domains, and (iii) facilitating diagnosis of existing representations by translating them into interpretable domains such as images. Our domain transfer network can translate between fixed representations without having to learn or finetune them. This allows users to utilize various existing domain-specific expert models from the literature that had been trained with extensive computational resources. Experiments on diverse conditional image synthesis tasks, competitive image modification results and experiments on image-to-image and text-to-image generation demonstrate the generic applicability of our approach. For example, we translate between BERT and BigGAN, state-of-the-art text and image models to provide text-to-image generation, which neither of both experts can perform on their own.

Neural networks have greatly boosted performance in computer vision by learning powerful representations of input data. The drawback of end-to-end training for maximal overall performance are black-box models whose hidden representations are lacking interpretability: Since distributed coding is optimal for latent layers to improve their robustness, attributing meaning to parts of a hidden feature vector or to individual neurons is hindered. We formulate interpretation as a translation of hidden representations onto semantic concepts that are comprehensible to the user. The mapping between both domains has to be bijective so that semantic modifications in the target domain correctly alter the original representation. The proposed invertible interpretation network can be transparently applied on top of existing architectures with no need to modify or retrain them. Consequently, we translate an original representation to an equivalent yet interpretable one and backwards without affecting the expressiveness and performance of the original. The invertible interpretation network disentangles the hidden representation into separate, semantically meaningful concepts. Moreover, we present an efficient approach to define semantic concepts by only sketching two images and also an unsupervised strategy. Experimental evaluation demonstrates the wide applicability to interpretation of existing classification and image generation networks as well as to semantically guided image manipulation.


and applications of our model.

Short Presentation

Our short video presentation for ECCV 2020.

Long Presentation

Our long video presentation for ECCV 2020.
Proposed architecture. We provide post-hoc interpretation for a given deep network f = Ψ ◦ Φ. For a deep representation z = Φ(x) a conditional INN t recovers Φ’s invariances v from a representation z̄ which contains entangled information about both z and v. The INN e then translates the representation z̄ into a factorized representation with accessible semantic concepts. This approach allows for various applications, including visualizations of network representations of natural and altered inputs, semantic network analysis and semantic image modifications.
  Comparison to existing network inversion methods for AlexNet. In contrast to the methods of [13] (D&B) and [39] (M&V), our invertible method explicitly samples the invariances of Φ w.r.t. the data, which circumvents a common cause for artifacts and produces natural images independent of the depth of the layer which is reconstructed. The bottom row shows FID scores for layer visualizations of AlexNet, obtained with our method and [13] (D&B).
Visualizing FGSM adversarial attacks on ResNet-101. To the human eye, the original image and its attacked version are almost indistinguishable. However, the input image is correctly classified as ”siamese cat”, while the attacked version is classified as ”mountain lion”. Our approach visualizes how the attack spreads throughout the network. Reconstructions of representations of attacked images demonstrate that the attack targets the semantic content of deep layers. The variance of z̄ explained by v combined with these visualizations show how increasing invariances cause vulnerability to adversarial attacks.
More visualizations of adversarial attacks as in the previous figure. Predictions of original vs. attacked version of the input image for all depicted examples: top left: ‘Lycaon pictus’ vs. ‘Cuon alpinus’; top right: ‘Snow Leopard’ vs. ‘Leopard’; bottom left: ‘West Highland white Terrier’ vs. ‘Yorkshire Terrier’; bottom right: ‘Blenheim Spaniel’ vs. ‘Japanese Spaniel’.
Revealing texture bias in ImageNet classifiers. We compare visualizations of z from the penultimate layer of ResNet-50 trained on standard ImageNet (left) and a stylized version of ImageNet (right). On natural images (rows 1-3) both models recognize the input, removing textures through stylization (rows 4-6) makes images unrecognizable to the standard model, however it recognizes objects from textured patches (rows 7-9). Rows 10-12 show that a model without texture bias can be used for sketch-to-image synthesis.
Texture bias: Additional examples for representation-conditional samples of two variants of ResNet-50, one trained on standard ImageNet, the other on a stylized version of ImageNet. See also the previous figure.
Analyzing the degree to which different semantic concepts are captured by a network representation changes as training progresses. For SqueezeNet on ColorMNIST we measure how much the data varies in different semantic concepts ei and how much of this variability is captured by z at different training iterations. Early on z is sensitive to foreground and background color, and later on it learns to focus on the digit attribute. The ability to encode this semantic concept is proportional to the classification accuracy achieved by z. At training iterations 4k and 36k we apply our method to visualize model representations and thereby illustrate how their content changes during training.
Additional z conditional samples after 4k and 36k training steps, as in the previous figure. Each row is conditioned on z = Φ(x) and each column is conditioned on a v ∼ N (v|0, 1). At 4k (resp. 36k) iterations, z explains 2.57% (resp. 36.08%) of the variance in the digit factor. Thus, the digit class of samples obtained at 4k iterations change with the sampled invariances across columns, while it stays the same at 36k iterations. Conversely, at 4k (resp. 36k) iterations, z explains 38.44% (resp. 2.76%) of the variance in the background color factor. Thus, the background color of samples obtained at 4k iterations change with the sampled representation z = Φ(x) across rows, while it stays the same at 38k iterations.
Zooming into feature visualization samples for an example of a snow leopard. σ denotes the softmax function.
Zooming into feature visualization samples for an example of a wolf. σ denotes the softmax function.
left : Visualizing FaceNet representations and their invariances. Sampling multiple reconstructions x̄ = D(t−1(v|z)) shows the degree of invariance learned by different layers. The invariance w.r.t. pose increases for deeper layers as expected for face identification. Surprisingly, FaceNet uses glasses as an identity feature throughout all its layers as evident from the spatial mean and variance plots, where the glasses are still visible. This reveals a bias and weakness of the model. right : Spatially averaged variances over multiple x for different layers.
Shifting domains: Human faces to animal faces evaluated with a fixed FaceNet. The evaluation procedure is similar to the method described in Fig. 3. Although never trained on data consisting of something else than human faces, FaceNet is able to capture the ”identity” of the input to a certain degree. Information about appearance is approximately preserved until the last layer, i.e. the final identity embedding.
Semantic Modifications on CelebA. In each column, we replace a semantic factor ei(E(x)) by e ∗ i , which is obtained from another, randomly chosen, image that differs in the corresponding attribute. Subsequently we decode a semantically modified image using the invertibility of e to obtain x̄∗ = D(e−1((e∗i ))). The results of StarGAN [8] are obtained by negating the binary value for the column’s attribute. FID scores in next figure.
Additional examples as in the previous figure. Moreover, the last row contains FID scores of semantically modified images obtained by our approach and [8] (StarGAN), which shows that our approach consistently outperforms [8].
Semantic Modifications on CelebA. For each column, after inferring the semantic factors (ei)i = e(E(x)) of the input x, we replace one factor ei by that from another randomly chosen image that differs in this concept. The inverse of e translates this semantic change back into a modified z̄, which is decoded to a semantically modified image. Distances between FaceNet embeddings before and after modification demonstrate its sensitivity to differences in gender and glasses.

Hyperparameters of INNs for each experiment. nflow denotes the number of invertible blocks within in the model, see Fig. 8. hw and hd refer to the width and depth of the fully connected subnetworks si and ti described in the supplementary.
Additional examples for layerwise reconstructions from model representations z = Φ(x) with our method and [13] (D&B). We show 10 samples per layer representation obtained with our generative approach. Here, σ denotes the softmax function, i.e. reconstructions are obtained from class probabilities provided by the model.


This work has been supported in part by the German Research Foundation (DFG) projects 371923335, 421703927, and EXC 2181/1 - 390900948 and the German federal ministry BMWi within the project KI Absicherung. This page is based on a design by TEMPLATED.