arXiv

BibTeX

GitHub

Abstract

How would a static scene react to a local poke? What are the effects on other parts of an object if you could locally push it? There will be distinctive movement, despite evident variations caused by the stochastic nature of our world. These outcomes are governed by the characteristic kinematics of objects that dictate their overall motion caused by a local interaction. Conversely, the movement of an object provides crucial information about its underlying distinctive kinematics and the interdependencies between its parts. This two-way relation motivates learning a bijective mapping between object kinematics and plausible future image sequences. Therefore, we propose iPOKE -- invertible Prediction of Object Kinematics -- that, conditioned on an initial frame and a local poke, allows to sample object kinematics and establishes a one-to-one correspondence to the corresponding plausible videos, thereby providing a controlled stochastic video synthesis. In contrast to previous works, we do not generate arbitrary realistic videos, but provide efficient control of movements, while still capturing the stochastic nature of our environment and the diversity of plausible outcomes it entails. Moreover, our approach can transfer kinematics onto novel object instances and is not confined to particular object classes.

Results from our GUI. If you also want to check it out, just clone our code and follow the instructions there!

Controlled stochastic video synthesis on PokingPlants [1]. Each rows show different samples from our learned residual kinematics distribution for the same poke, indicating our model to be capable of stochastically synthesize motion which can while maintaining localized control.

Controlled stochastic video synthesis on iPER[2]. Each row shows different samples from our learned residual kinematics distribution for the same poke, indicating our model to be capable of stochastically synthesize motion which can while maintaining localized control.

Controlled stochastic video synthesis on Human3.6m [3]. Each row shows different samples from our learned residual kinematics distribution for the same poke, indicating our model to be capable of stochastically synthesize motion which can while maintaining localized control.

Controlled stochastic video synthesis on Tai-Chi-HD [4]. Each row shows different samples from our learned residual kinematics distribution for the same poke, indicating our model to be capable of stochastically synthesize motion which can while maintaining localized control.

Kinematics Transfer on iPER: We extract the residual kinematics from a ground truth sequence (top row) and use it together with the corresponding control \(c\) (red arrow) to animate an image \(x_t\) showing similar initial object posture (second row). We also visualize a random sample from \(q(r)\) for the same \((x_t,c)\) (bottom row), indicating that the residual kinematics representation solely contains motion information not present in \((x_t,c)\).

Control Sensitivity: Evaluating the sensitivity of our model to the different poke vectors at the same pixel on iPER [2]. For a given poke location in an image \(x_0\), we randomly sample four poke magnitudes and directions and visualize the resulting synthesized sequences in the rows of this video.

Comparing different user inputs for the same source image on iPER [2].

Comparing different user inputs for the same source image on PlokingPlants [1].

Understanding object structure: By performing 100 random interactions at the same location \(l\) within a given image frame \(x_0\) we obtain varying video sequences, from which we compute motion correlations for \(l\) with all remaining pixels. By mapping these correlations to the pixel space, we visualize distinct object parts.

Besides the quantitative results presented in our paper, we qualititatively compare iPOKE to IVRNN [5], which is the best performing competing method when considering both diversity and video quality. Here we show the results on iPER [2].

Comparison in controlled video synthesis with the approach of Hao et al. [6] on iPER.

Comparison in controlled video synthesis with the approach of Hao et al. [6] on PlokingPlants.

References

[1] Andreas Blattmann, Timo Milbich, Michael Dorkenwald, and Bjorn Ommer. Understanding object dynamics for interactive image-to-video synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5171–5181, 2021.
[2] Wen Liu, Zhixin Piao, Jie Min, Wenhan Luo, Lin Ma, and Shenghua Gao. Liquid warping gan: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[3] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7): 1325–1339, 2014.
[4] Aliaksandr Siarohin, Stéphane Lathuiliére, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation. In Adv. Neural Inform. Process. Syst., pages 7135–7145, 2019.
[5] L. Castrejon, N. Ballas, and A. Courville. Improved conditional vrnns for video prediction. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[6] Zekun Hao, Xun Huang, and Serge Belongie. Controllable video generation with sparse trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

Our Related work on video synthesis

Understanding Object Dynamics for Interactive Image-To-Video Synthesis

What would be the effect of locally poking a static scene? We present an approach that learns naturally-looking global articulations caused by a local manipulation at a pixel level. Training requires only videos of moving objects but no information of the underlying manipulation of the physical scene. Our generative model learns to infer natural object dynamics as a response to user interaction and learns about the interrelations between different object body regions. Given a static image of an object and a local poking of a pixel, the approach then predicts how the object would deform over time. In contrast to existing work on video prediction, we do not synthesize arbitrary realistic videos but enable local interactive control of the deformation. Our model is not restricted to particular object categories and can transfer dynamics onto novel unseen object instances. Extensive experiments on diverse objects demonstrate the effectiveness of our approach compared to common video prediction frameworks.

Stochastic Image-To-Video Synthesis using cINNs

Video understanding calls for a model to learn the characteristic interplay between static scene content and its dynamics: Given an image, the model must be able to predict a future progression of the portrayed scene and, conversely, a video should be explained in terms of its static image content and all the remaining characteristics not present in the initial frame. This naturally suggests a bijective mapping between the video domain and the static content as well as residual information. In contrast to common stochastic image-to-video synthesis, such a model does not merely generate arbitrary videos progressing the initial image. Given this image, it rather provides a one-to-one mapping between the residual vectors and the video with stochastic outcomes when sampling. The approach is naturally implemented using a conditional invertible neural network (cINN) that can explain videos by independently modelling static and other video characteristics, thus laying the basis for controlled video synthesis. Experiments on diverse video datasets demonstrate the effectiveness of our approach in terms of both the quality and diversity of the synthesized results.

Behavior-Driven Synthesis of Human Dynamics

Generating and representing human behavior are of major importance for various computer vision applications. Commonly, human video synthesis represents behavior as sequences of postures while directly predicting their likely progressions or merely changing the appearance of the depicted persons, thus not being able to exercise control over their actual behavior during the synthesis process. In contrast, controlled behavior synthesis and transfer across individuals requires a deep understanding of body dynamics and calls for a representation of behavior that is independent of appearance and also of specific postures. In this work, we present a model for human behavior synthesis which learns a dedicated representation of human dynamics independent of postures. Using this representation, we are able to change the behavior of a person depicted in an arbitrary posture, or to even directly transfer behavior observed in a given video sequence. To this end, we propose a conditional variational framework which explicitly disentangles posture from behavior. We demonstrate the effectiveness of our approach on this novel task, evaluating capturing, transferring, and sampling fine-grained, diverse behavior, both quantitatively and qualitatively.

Our Related work on visual synthesis

Taming Transformers for High-Resolution Image Synthesis

Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers.

Network-to-Network Translation with Conditional Invertible Neural Networks

Given the ever-increasing computational costs of modern machine learning models, we need to find new ways to reuse such expert models and thus tap into the resources that have been invested in their creation. Recent work suggests that the power of these massive models is captured by the representations they learn. Therefore, we seek a model that can relate between different existing representations and propose to solve this task with a conditionally invertible network. This network demonstrates its capability by (i) providing generic transfer between diverse domains, (ii) enabling controlled content synthesis by allowing modification in other domains, and (iii) facilitating diagnosis of existing representations by translating them into interpretable domains such as images. Our domain transfer network can translate between fixed representations without having to learn or finetune them. This allows users to utilize various existing domain-specific expert models from the literature that had been trained with extensive computational resources. Experiments on diverse conditional image synthesis tasks, competitive image modification results and experiments on image-to-image and text-to-image generation demonstrate the generic applicability of our approach. For example, we translate between BERT and BigGAN, state-of-the-art text and image models to provide text-to-image generation, which neither of both experts can perform on their own.

iPOKE: Poking a Still Image for Controlled Stochastic Video Synthesis

Abstract

Video Presentation

Enabling Users to Interact with still images

We present a Graphical User Interface to directly enable human users to test our model and interact with still images as visualized below.
Check it out now!

Approach

Invertible Model for Controlled Stochastic Video Synthesis

Results on the Object Classes of Humans and Plants

Controlled Stochastic Video Synthesis

Further Results

Understanding Object Structure

Comparison with other models