arXiv
BibTeX
GitHub
* equal contribution

Abstract

Deep generative models have demonstrated great performance in image synthesis. However, results deteriorate in case of spatial deformations, since they generate images of objects directly, rather than modeling the intricate interplay of their inherent shape and appearance. We present a conditional U-Net for shape-guided image generation, conditioned on the output of a variational autoencoder for appearance. The approach is trained end-to-end on images, without requiring samples of the same object with varying pose or appearance. Experiments show that the model enables conditional image generation and transfer. Therefore, either shape or appearance can be retained from a query image, while freely altering the other. Moreover, appearance can be sampled due to its stochastic latent representation, while preserving shape. In quantitative and qualitative experiments on COCO, DeepFashion, shoes, Market-1501 and handbags, the approach demonstrates significant improvements over the state-of-the-art.

Results

and applications of our model.

Video synthesis using appearances from COCO and poses from PennAction

Transfer of appearance and pose on DeepFashion using keypoints

Transfer of appearance and pose on Market using keypoints

Transfer of appearance and pose on Shoes using edges

Transfer of appearance and pose on Handbags using edges

Transfer of appearance and pose on COCO using keypoints

Transfer across datasets using appearance from Handbags and pose from Shoes

Transfer across datasets using appearance from Shoes and pose from Handbags

Acknowledgement

This work has been supported in part by the Heidelberg Academy of Science
and a hardware donation from NVIDIA. This page is based on a design by TEMPLATED.