We learn a conditional invertible neural network (cINN) to translate between representations of different domain experts. This results in a fused model, which can be controlled through the first expert to create novel and diverse content in the domain of the second expert.
arXiv
BibTeX
* equal contribution

Abstract

Artificial Intelligence for Content Creation has the potential to reduce the amount of manual content creation work significantly. While automation of laborious work is welcome, it is only useful if it allows users to control aspects of the creative process when desired. Furthermore, widespread adoption of semi-automatic content creation depends on low barriers regarding the expertise, computational budget and time required to obtain results and experiment with new techniques. With state-of-the-art approaches relying on task-specific models, multi-GPU setups and weeks of training time, we must find ways to reuse and recombine them to meet these requirements. Instead of designing and training methods for controllable content creation from scratch, we thus present a method to repurpose powerful, existing models for new tasks, even though they have never been designed for them. We formulate this problem as a translation between expert models, which includes common content creation scenarios, such as text-to-image and image-to-image translation, as a special case. As this translation is ambiguous, we learn a generative model of hidden representations of one expert conditioned on hidden representations of the other expert. Working on the level of hidden representations makes optimal use of the computational effort that went into the training of the expert model to produce these efficient, low-dimensional representations. Experiments demonstrate that our approach can translate from BERT, a state-of-the-art expert for text, to BigGAN, a state-of-the-art expert for images, to enable text-to-image generation, which neither of the experts can perform on its own. Additional experiments show the wide applicability of our approach across different conditional image synthesis tasks and improvements over existing methods for image modifications.

Our architecture builds upon our prior work
"A Disentangling Invertible Interpretation Network for Explaining Latent Representations"

Results

and applications of our model.

Overview

Our oral presentation for the AI for Content Creation Workshop.

Bert-to-BigGAN transfer

Our approach enables translation between fixed off-the-shelve expert models such as BERT and BigGAN without having to modify or finetune them.

Exemplar-Guided Image Synthesis

By combining the segmentation of an image x, with the invariances obtained from an exemplar image y, our approach fuses a segmentation network and an autoencoder to enable exemplar-guided image synthesis.

Computational Cost and Energy Consumption

We compare computational costs of our conditional INN to those of BERT, BigGAN and FUNIT. Once strong domain experts are available, they can be repurposed by our approach in a time-, energy- and cost-effective way. With training costs of our conditional INN being two orders of magnitude smaller than the training costs of the domain experts, the latter are amortized over all the new tasks that can be solved by fusing experts with our approach. Energy consumption of a Titan X is based on the recommended system power (0.6 kW) by NVIDIA, and energy consumption of eight V100 on the power (3.5 kW) of a NVIDIA DGX-1 system. Costs are based on the average price of 0.216 EUR per kWh in the EU, and CO2 emissions on the average emissions of 0.296 kg CO2 per kWh in the EU.

Insights into learned invariances

Translating different layers of an expert model to the representation of an autoencoder reveals its learned invariances and thus provides diagnostic insights. Here, the expert is a segmentation model, and the visualizations demonstrate how its internal representations become increasingly invariant against style and appearance.

Sketch-to-Image Transfer

When trained on a stylized version of ImageNet to remove its texture bias, a ResNet-50 classifier can be fused with a BigGAN generator to produce coherent sketch-to-image synthesis results (left), whereas a vanilla ResNet-50 fails to produce meaningful representations for sketches (right).

Unsupervised disentangling of shape and appearance

Training our approach on synthetically deformed images, our conditional INN learns to extract a disentangled shape representation v from y, which can be recombined with arbitrary appearances obtained from x.

Layout-guided image synthesis

The ability to sample the invariances v enables generation of diverse and realistic images which are consistent with a given label map.

Face Attribute Modification

Compared to StarGAN, our approach produces more coherent changes, e.g. changes in gender cause changes in hair length and changes in the beard attribute have no effect on female faces. This demon- strates the advantage of fusing attribute information on a low-dimensional representation of a generic autoencoder.

Text-to-Image

Fusing text representations of BERT with a BigGAN image generator enables text-to-image generation. Results show a high diversity of samples, the ability for fine-grained control of the generation process, e.g. color in the first two rows, and a large diversity in the objects that can be synthesized, e.g. school buses and broccoli plants.

Edge-to-Image

Edge-to-Image Synthesis requires a good representation of edges, which can be obtained from a Stylized ResNet-50.

Inpainting

Inpainting Tasks require a good representation of the textures, which can be obtained from a standard ResNet-50.

Segmentation-to-Image

Fusing argmaxed predictions (left) of a segmentation expert with a decoder produces diverse outputs (right) for segmentation-to-image tasks.

Controlling Output Diversity

Logits of a segmentation expert (visualized on the left with a random projection to RGB) contain more information about the input, resulting in less diverse outputs for shape-guided image synthesis.

Unaligned Image-to-Image Translation

As our approach does not require gradients of the domain experts, we can also directly use labels obtained from human experts to perform unaligned image-to-image translation between different domains.

Acknowledgement

This page is based on a design by TEMPLATED.