TL;DR: We present iPOKE, a model for locally controlled, stochastic video synthesis based on poking a single pixel in a static scene, that enables users to animate still images only with simple mouse drags.

Abstract

How would a static scene react to a local poke? What are the effects on other parts of an object if you could locally push it? There will be distinctive movement, despite evident variations caused by the stochastic nature of our world. These outcomes are governed by the characteristic kinematics of objects that dictate their overall motion caused by a local interaction. Conversely, the movement of an object provides crucial information about its underlying distinctive kinematics and the interdependencies between its parts. This two-way relation motivates learning a bijective mapping between object kinematics and plausible future image sequences. Therefore, we propose iPOKE -- invertible Prediction of Object Kinematics -- that, conditioned on an initial frame and a local poke, allows to sample object kinematics and establishes a one-to-one correspondence to the corresponding plausible videos, thereby providing a controlled stochastic video synthesis. In contrast to previous works, we do not generate arbitrary realistic videos, but provide efficient control of movements, while still capturing the stochastic nature of our environment and the diversity of plausible outcomes it entails. Moreover, our approach can transfer kinematics onto novel object instances and is not confined to particular object classes.

Video Presentation

Enabling Users to Interact with still images


We present a Graphical User Interface to directly enable human users to test our model and interact with still images as visualized below.

Check it out now!

Results from our GUI. If you also want to check it out, just clone our code and follow the instructions there!
Results from our GUI. If you also want to check it out, just clone our code and follow the instructions there!

Approach


Invertible Model for Controlled Stochastic Video Synthesis

Overview of our proposed framework iPOKE for controlled video synthesis: We apply a conditional bijective transformation \(\tau_{\theta}\) to learn a residual kinematics representation \(r\) capturing all video information not present in the user control \(c\) defining intended local object motion in an image frame \(x_0\) (orange path). To retain feasible computational complexity, we pre-train a video autoencoding framework \((E, GRU, D)\) (blue path) yielding a dedicated video representation \(z\) as training input for \(\tau_\theta\). Controlled video synthesis is achieved by sampling a residual \(r\), thus inferring plausible motion for the remaining object parts not defined in \(c\), and generating video sequences \(\hat{\boldsymbol{X}}\) from the resulting \(z = \tau_{\theta}(r \vert x_0,c)\) using \(GRU\) and \(D\) (black path).

Results on the Object Classes of Humans and Plants

Controlled Stochastic Video Synthesis

Controlled stochastic video synthesis on PokingPlants [1]. Each rows show different samples from our learned residual kinematics distribution for the same poke, indicating our model to be capable of stochastically synthesize motion which can while maintaining localized control.
Controlled stochastic video synthesis on iPER[2]. Each row shows different samples from our learned residual kinematics distribution for the same poke, indicating our model to be capable of stochastically synthesize motion which can while maintaining localized control.
Controlled stochastic video synthesis on Human3.6m [3]. Each row shows different samples from our learned residual kinematics distribution for the same poke, indicating our model to be capable of stochastically synthesize motion which can while maintaining localized control.
Controlled stochastic video synthesis on Tai-Chi-HD [4]. Each row shows different samples from our learned residual kinematics distribution for the same poke, indicating our model to be capable of stochastically synthesize motion which can while maintaining localized control.

Further Results

Kinematics Transfer and Control Sensitivity

Kinematics Transfer on iPER: We extract the residual kinematics from a ground truth sequence (top row) and use it together with the corresponding control \(c\) (red arrow) to animate an image \(x_t\) showing similar initial object posture (second row). We also visualize a random sample from \(q(r)\) for the same \((x_t,c)\) (bottom row), indicating that the residual kinematics representation solely contains motion information not present in \((x_t,c)\).
Control Sensitivity: Evaluating the sensitivity of our model to the different poke vectors at the same pixel on iPER [2]. For a given poke location in an image \(x_0\), we randomly sample four poke magnitudes and directions and visualize the resulting synthesized sequences in the rows of this video.

Additional visualized results from real user input

Comparing different user inputs for the same source image on iPER [2].
Comparing different user inputs for the same source image on PlokingPlants [1].

Understanding Object Structure

Understanding object structure: By performing 100 random interactions at the same location \(l\) within a given image frame \(x_0\) we obtain varying video sequences, from which we compute motion correlations for \(l\) with all remaining pixels. By mapping these correlations to the pixel space, we visualize distinct object parts.

Comparison with other models

Stochastic Video Synthesis: We compare with the recent state of the art in stochastic video prediction and obtain results obtaining higher visual and temporal fidelity as well as more diversity

Besides the quantitative results presented in our paper, we qualititatively compare iPOKE to IVRNN [5], which is the best performing competing method when considering both diversity and video quality. Here we show the results on iPER [2].
Besides the quantitative results presented in our paper, we qualititatively compare iPOKE to IVRNN [5], which is the best performing competing method when considering both diversity and video quality. Here we show the results on PokingPlants [1].

Controlled Video Synthesis: We compare with the controlled video synthesis baseline of Hao et al. [6] which is the closest related work to our model in controlled video synthesis. iPOKE performs clearly better in terms of controllability and while synthesizing videos with superior visual and temporal coherence .

Comparison in controlled video synthesis with the approach of Hao et al. [6] on iPER.
Comparison in controlled video synthesis with the approach of Hao et al. [6] on PlokingPlants.

Our Related work on video synthesis

What would be the effect of locally poking a static scene? We present an approach that learns naturally-looking global articulations caused by a local manipulation at a pixel level. Training requires only videos of moving objects but no information of the underlying manipulation of the physical scene. Our generative model learns to infer natural object dynamics as a response to user interaction and learns about the interrelations between different object body regions. Given a static image of an object and a local poking of a pixel, the approach then predicts how the object would deform over time. In contrast to existing work on video prediction, we do not synthesize arbitrary realistic videos but enable local interactive control of the deformation. Our model is not restricted to particular object categories and can transfer dynamics onto novel unseen object instances. Extensive experiments on diverse objects demonstrate the effectiveness of our approach compared to common video prediction frameworks.

Video understanding calls for a model to learn the characteristic interplay between static scene content and its dynamics: Given an image, the model must be able to predict a future progression of the portrayed scene and, conversely, a video should be explained in terms of its static image content and all the remaining characteristics not present in the initial frame. This naturally suggests a bijective mapping between the video domain and the static content as well as residual information. In contrast to common stochastic image-to-video synthesis, such a model does not merely generate arbitrary videos progressing the initial image. Given this image, it rather provides a one-to-one mapping between the residual vectors and the video with stochastic outcomes when sampling. The approach is naturally implemented using a conditional invertible neural network (cINN) that can explain videos by independently modelling static and other video characteristics, thus laying the basis for controlled video synthesis. Experiments on diverse video datasets demonstrate the effectiveness of our approach in terms of both the quality and diversity of the synthesized results.

Generating and representing human behavior are of major importance for various computer vision applications. Commonly, human video synthesis represents behavior as sequences of postures while directly predicting their likely progressions or merely changing the appearance of the depicted persons, thus not being able to exercise control over their actual behavior during the synthesis process. In contrast, controlled behavior synthesis and transfer across individuals requires a deep understanding of body dynamics and calls for a representation of behavior that is independent of appearance and also of specific postures. In this work, we present a model for human behavior synthesis which learns a dedicated representation of human dynamics independent of postures. Using this representation, we are able to change the behavior of a person depicted in an arbitrary posture, or to even directly transfer behavior observed in a given video sequence. To this end, we propose a conditional variational framework which explicitly disentangles posture from behavior. We demonstrate the effectiveness of our approach on this novel task, evaluating capturing, transferring, and sampling fine-grained, diverse behavior, both quantitatively and qualitatively.

Our Related work on visual synthesis

Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers.

Given the ever-increasing computational costs of modern machine learning models, we need to find new ways to reuse such expert models and thus tap into the resources that have been invested in their creation. Recent work suggests that the power of these massive models is captured by the representations they learn. Therefore, we seek a model that can relate between different existing representations and propose to solve this task with a conditionally invertible network. This network demonstrates its capability by (i) providing generic transfer between diverse domains, (ii) enabling controlled content synthesis by allowing modification in other domains, and (iii) facilitating diagnosis of existing representations by translating them into interpretable domains such as images. Our domain transfer network can translate between fixed representations without having to learn or finetune them. This allows users to utilize various existing domain-specific expert models from the literature that had been trained with extensive computational resources. Experiments on diverse conditional image synthesis tasks, competitive image modification results and experiments on image-to-image and text-to-image generation demonstrate the generic applicability of our approach. For example, we translate between BERT and BigGAN, state-of-the-art text and image models to provide text-to-image generation, which neither of both experts can perform on their own.

Acknowledgement

The research leading to these results is funded by the German Federal Ministry for Economic Affairs and Energy within the project “KI-Absicherung – Safe AI for automated driving” and by the German Research Foundation (DFG) within project 421703927. This page is based on a design by TEMPLATED.