Abstract

Deep generative models come with the promise to learn an explainable representation for visual objects that allows image sampling, synthesis, and selective modification. The main challenge is to learn to properly model the independent latent characteristics of an object, especially its appearance and pose. We present a novel approach that learns disentangled representations of these characteristics and explains them individually. Training requires only pairs of images depicting the same object appearance, but no pose annotations. We propose an additional classifier that estimates the minimal amount of regularization required to enforce disentanglement. Thus both representations together can completely explain an image while being independent of each other. Previous methods based on adversarial approaches fail to enforce this independence, while methods based on variational approaches lead to uninformative representations. In experiments on diverse object categories, the approach successfully recombines pose and appearance to reconstruct and retarget novel synthesized images. We achieve significant improvements over state-of-the-art methods which utilize the same level of supervision, and reach performances comparable to those of pose-supervised approaches. However, we can handle the vast body of articulated object classes for which no pose models/annotations are available.

Unsupervised Robust Disentangling
of Latent Characteristics
for Image Synthesis

Abstract

Results

Video synthesis on the NTU dataset.

Video synthesis on the BBC dataset.

Video synthesis on the NORB dataset.

Transfer results on the PKU dataset.

Retrieval and synthesis using disentangled shape and appearance.

Transfer results on animals.

Transfer results on the BBC dataset.

Transfer results on the NTU dataset.

Interpolation along pose (horizontal direction) and appearance (vertical direction).

Comparison on the Sprites dataset.

Overview of our model.

Quantitative comparison on the Sprites dataset.

Transfer results on the NORB dataset.

Comparison to our previous work VUNet which relied on fully-supervised keypoint estimators.

Transfer results on the BBC dataset.

Transfer results on the NTU dataset.

Transfer results on the PKU dataset.

Transfer results on the NORB dataset.

Transfer results on the DeepFashion dataset.

Acknowledgement

Unsupervised Robust Disentangling of Latent Characteristics for Image Synthesis

Abstract

Video synthesis on the NTU dataset.

Video synthesis on the BBC dataset.

Video synthesis on the NORB dataset.

Transfer results on the PKU dataset.

Retrieval and synthesis using disentangled shape and appearance.

Transfer results on animals.

Transfer results on the BBC dataset.

Transfer results on the NTU dataset.

Interpolation along pose (horizontal direction) and appearance (vertical direction).

Comparison on the Sprites dataset.

Overview of our model.

Quantitative comparison on the Sprites dataset.

Transfer results on the NORB dataset.

Comparison to our previous work VUNet which relied on fully-supervised keypoint estimators.

Transfer results on the BBC dataset.

Transfer results on the NTU dataset.

Transfer results on the PKU dataset.

Transfer results on the NORB dataset.

Transfer results on the DeepFashion dataset.

Unsupervised Robust Disentangling
of Latent Characteristics
for Image Synthesis