Large intra-class variation is the result of changes in multiple object characteristics. Images, however, only show the superposition of different variable factors such as appearance or shape. Therefore, learning to disentangle and represent these different characteristics poses a great challenge, especially in the unsupervised case. Moreover, large object articulation calls for a flexible part-based model. We present an unsupervised approach for disentangling appearance and shape by learning parts consistently over all instances of a category. Our model for learning an object representation is trained by simultaneously exploiting invariance and equivariance constraints between synthetically transformed images. Since no part annotation or prior information on an object class is required, the approach is applicable to arbitrary classes. We evaluate our approach on a wide range of object categories and diverse tasks including pose prediction, disentangled image synthesis, and video-to-video translation. The approach outperforms the state-of-the-art on unsupervised keypoint prediction and compares favorably even against supervised approaches on the task of shape and appearance transfer.


and applications of our model.

Our unsupervised learning of a disentangled part-based shape and appearance enables numerous tasks ranging from unsupervised pose estimation to image synthesisand retargeting

Learned shape representation on Penn Action. (a) Different instances, showing intra-class consistency and (b) video sequence, showing consistency andsmoothness under motion, although each frame is processedindividually.

Unsupervised Video to Video transfer on the BBC-Pose Dataset

Local appearance transfer on the DeepFashion Dataset

Swapping part appearance on Deep Fashion. Appearances can be exchanged for parts individually and without altering shape. We show part-wise swaps for (a) head (b) torso (c) legs, (d) shoes.

Transferring shape and appearance on Deep Fashion. Without annotation the model estimates shape (2nd column). Target appearance is extracted from images in top row.

Unsupervised keypoints overview

Unsupervised keypoints on CelebA-Dataset

Unsupervised keypoints on Cat-Head Dataset

Unsupervised keypoints on CUB Dataset


This work has been supported in part by DFG grant OM81/1-1
and a hardware donation from NVIDIA Corporation. This page is based on a design by TEMPLATED.