tl;dr We present a framework for both stochastic and controlled image-to-video synthesis. We bridge the gap between the image and video domain using conditional invertible neural networks and account for the inherent ambiguity with a dedicated, learned scene dynamics representation.

Abstract

Video understanding calls for a model to learn the characteristic interplay between static scene content and its dynamics: Given an image, the model must be able to predict a future progression of the portrayed scene and, conversely, a video should be explained in terms of its static image content and all the remaining characteristics not present in the initial frame. This naturally suggests a bijective mapping between the video domain and the static content as well as residual information. In contrast to common stochastic image-to-video synthesis, such a model does not merely generate arbitrary videos progressing the initial image. Given this image, it rather provides a one-to-one mapping between the residual vectors and the video with stochastic outcomes when sampling. The approach is naturally implemented using a conditional invertible neural network (cINN) that can explain videos by independently modelling static and other video characteristics, thus laying the basis for controlled video synthesis. Experiments on diverse video datasets demonstrate the effectiveness of our approach in terms of both the quality and diversity of the synthesized results.

Overview

Poster presentation of our work. Please click on the poster for a larger version.
Video presentation of our work.
Overview of our proposed framework.

Results

Results on Landscape

Animations from our model given a single conditioning frame. This video corresponds to Fig. 3 in the main paper.
Visualization of the various animations of the same starting frame. Our model produces diverse outputs capturing different speeds and directions when randomly drawing scene dynamics from the residual space \(\nu \sim q(\nu)\).
Qualitative comparison to previous work, i.e., AL[1], DTVNet[2], and MDGAN[3], with `GT' denoting the ground-truth. Both MDGAN[3] and DTVNet[2] produce blurry videos when using the officially provided pretrained weights and code from the respective webpages. While AL produces decent animations in the presence of small motion, when animating fast motions, however, warping artifacts are present, cf. e.g., row 3. In contrast, our method produces realistic looking results in the case of both small and large motions.
Visualization of longer sequences (48 frames). Our model was applied sequentially on the last frame of the previously predicted video sequence using the same \(\nu \).
More animations from our model which can be generated using our code provided on our GitHub page.

Results on BAIR

Qualitative comparison with IVRNN [4]. While both approaches are able to render the robot's end effector and the visible environment well, we observe significant differences when it comes to the effector interacting with or occluding background objects. An example of this difficulty can be seen when interacting with the object in the middle of the scene in row 2.
This visualization qualitatively illustrates the prediction diversity of our model by animating a fixed starting frame \(x_0\) multiple times. `GT' denotes ground-truth. Our model synthesizes diverse samples by broadly covering motions in the \(x\), \(y\), and \(z\) directions.
More animations from our model which can be generated using our code provided on our GitHub page.

Results on Dynamic Textures (DTDB)

Qualitative comparison on various dynamic textures with DG [5] and AL [1]. As described in the main paper, DG [5] is directly optimized on test samples, thus overfitting directly to the test distribution. Consequently, we observe that their generations almost perfectly reproduce the ground-truth motion which is most evident for the clouds texture. However, their method suffers from blurring due to optimization using an L2 pixel loss. Similar to the comparisons on Landscape, AL [1] has problems with learning and generating the motion of dynamic textures exhibiting rapid motion changes, such as fire. This is explained by the susceptibility of optical flow to inaccuracies when capturing very fast motion, as well as dynamic patterns outside the scope of optical flow, e.g., flicker. Our model, on the other hand, produces sharp video sequences with realistic looking motions for all textures. Note, for each method one model is trained for each texture.

More animations from our model which can be generated using our code provided on our GitHub page.

Results on iPER

Animations from our model given a single conditioning frame. This video corresponds to Fig. 4 in the main paper.
Qualitative comparison with IVRNN [4]. Our method produces more natural motions, e.g., row 3. `GT' denotes ground-truth.
More animations from our model which can be generated using our code provided on our GitHub page.

Results on Controlled Video Synthesis

Motion transfer on landscape. The task is to directly transfer a query motion extracted from a given landscape video \(\tilde{X}\) to a random starting frame \(x_0\). Therefore, we extract the residual representation \(\tilde{\nu}\) of \(\tilde{X}_0\) by first obtaining its video representation \(\tilde{z} = q(z|\tilde{X})\) and corresponding residual \(\tilde{\nu} = \mathcal{T}_\theta^{-1}(\tilde{z};\tilde{x}_0)\) with \(\tilde{x}_0\) being the starting frame of \(\tilde{X}\). We use \(\tilde{\nu}\) to animate the starting frame \(x_0\). Our model accurately transfers the query motion, e.g., as the corresponding direction and speed of the clouds, to the target landscape images (rows 1-3, left-to-right).
Visualization of controlled video-to-video synthesis using cloud video sequences from DTDB. We explicitly adjust the initial factor \(\tilde{\eta}\) of an observed video sequence \(\tilde{X}\). To this end, we first obtain its video representation \(\tilde{z} = q_\phi(z|\tilde{X})\) followed by extracting the corresponding residual information \(\tilde{\nu} = \mathcal{T}_\theta^{-1}(\tilde{z};\tilde{x}_0, \tilde{\eta})\). Subsequently, to generate the video sequence depicting our controlled adjustment of \(\tilde{X}\), we simply choose a new value \(\tilde{\eta}=\tilde{\eta}^*\) and perform the image-to-sequence inference process. In each example (second row), the motion direction of the query video (leftmost) is adjusted by the provided control (top row). To highlight that the residual representations \(\nu\) in these cases actually correspond to the query video, we additionally animate the initial image of the query videos by sampling a new residual representation \(\nu \sim q(\nu)\) and apply the same controls (bottom rows). We observe that, while the directions of the synthesized videos are identical, their speeds are significantly different, as desired.
More motion transfer samples from our model which can be generated using our code provided on our GitHub page. First row depicts the original query sequence and the remaining rows the animations with the transferred dynamics.
This video illustrates several image-to-video generations examples for controlling the direction of cloud movements with \(\eta\), similar to Fig. 7 in our main paper. We observe that our model renders crisp future progressions (row 2-5) of a given starting frame \(x_0\), while following our provided movement control (top row).
This video illustrates several image-to-video generations while controlling \(\eta = (x,y,z)\), the 3D end effector position, similar to Fig.~6 in our main paper. It shows that, while in each example the effector approximately stops at the provided end position (end frame of GT), its movements between the starting and end frame, which are inferred by the sampled residual representations \(\nu \sim q(\nu)\), exhibit significantly varying and natural progressions.

References

[1] Yuki Endo, Yoshihiro Kanamori, and Shigeru Kuriyama. Animating landscape: self-supervised learning of decoupled motion and appearance for single-image video synthesis. ACM Transactions on Graphics, pages 175:1–175:19, 2019.
[2] Wei Xiong, Wenhan Luo, Lin Ma, Wei Liu, and Jiebo Luo. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2364–2373, 2018.
[3] Jiangning Zhang, Chao Xu, Liang Liu, Mengmeng Wang, Xia Wu, Yong Liu, and Yunliang Jiang. Dtvnet: Dynamic time-lapse video generation via single still image. In Proceedings of the European Conference on Computer Vision (ECCV), pages 300–315, 2020.
[4] Lluı́s Castrejón, Nicolas Ballas, and Aaron C. Courville. Improved conditional vrnns for video prediction. In Proceedings of the International Conference on Computer Vision (ICCV), pages 7607–7616, 2019.
[5] Jianwen Xie, Ruiqi Gao, Zilong Zheng, Song-Chun Zhu, and Ying Nian Wu. Learning dynamic generator model by alternating back-propagation through time. In Proceedings of the National Conference on Artificial Intelligence (AAAI), pages 5498–5507, 2019.

Our Related work on video synthesis

Generating and representing human behavior are of major importance for various computer vision applications. Commonly, human video synthesis represents behavior as sequences of postures while directly predicting their likely progressions or merely changing the appearance of the depicted persons, thus not being able to exercise control over their actual behavior during the synthesis process. In contrast, controlled behavior synthesis and transfer across individuals requires a deep understanding of body dynamics and calls for a representation of behavior that is independent of appearance and also of specific postures. In this work, we present a model for human behavior synthesis which learns a dedicated representation of human dynamics independent of postures. Using this representation, we are able to change the behavior of a person depicted in an arbitrary posture, or to even directly transfer behavior observed in a given video sequence. To this end, we propose a conditional variational framework which explicitly disentangles posture from behavior. We demonstrate the effectiveness of our approach on this novel task, evaluating capturing, transferring, and sampling fine-grained, diverse behavior, both quantitatively and qualitatively.

What would be the effect of locally poking a static scene? We present an approach that learns naturally-looking global articulations caused by a local manipulation at a pixel level. Training requires only videos of moving objects but no information of the underlying manipulation of the physical scene. Our generative model learns to infer natural object dynamics as a response to user interaction and learns about the interrelations between different object body regions. Given a static image of an object and a local poking of a pixel, the approach then predicts how the object would deform over time. In contrast to existing work on video prediction, we do not synthesize arbitrary realistic videos but enable local interactive control of the deformation. Our model is not restricted to particular object categories and can transfer dynamics onto novel unseen object instances. Extensive experiments on diverse objects demonstrate the effectiveness of our approach compared to common video prediction frameworks.

Our Related work on visual synthesis

Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers.

Given the ever-increasing computational costs of modern machine learning models, we need to find new ways to reuse such expert models and thus tap into the resources that have been invested in their creation. Recent work suggests that the power of these massive models is captured by the representations they learn. Therefore, we seek a model that can relate between different existing representations and propose to solve this task with a conditionally invertible network. This network demonstrates its capability by (i) providing generic transfer between diverse domains, (ii) enabling controlled content synthesis by allowing modification in other domains, and (iii) facilitating diagnosis of existing representations by translating them into interpretable domains such as images. Our domain transfer network can translate between fixed representations without having to learn or finetune them. This allows users to utilize various existing domain-specific expert models from the literature that had been trained with extensive computational resources. Experiments on diverse conditional image synthesis tasks, competitive image modification results and experiments on image-to-image and text-to-image generation demonstrate the generic applicability of our approach. For example, we translate between BERT and BigGAN, state-of-the-art text and image models to provide text-to-image generation, which neither of both experts can perform on their own.

Acknowledgement

This work was started as part of M.D.’s internship at Ryerson University and was supported by the DAAD scholarship, funded by the NSERC Discovery Grantprogram (K.G.D.), in part by the German Research Foundation (DFG) within project 421703927 (B.O.) and the BW Stiftung (B.O.). K.G.D. contributed to this work in his capacity as an Associate Professor at Ryerson University. This page is based on a design by TEMPLATED.