What would be the effect of locally poking a static scene? We present an approach that learns naturally-looking global articulations caused by a local manipulation at a pixel level. Training requires only videos of moving objects but no information of the underlying manipulation of the physical scene. Our generative model learns to infer natural object dynamics as a response to user interaction and learns about the interrelations between different object body regions. Given a static image of an object and a local poking of a pixel, the approach then predicts how the object would deform over time. In contrast to existing work on video prediction, we do not synthesize arbitrary realistic videos but enable local interactive control of the deformation. Our model is not restricted to particular object categories and can transfer dynamics onto novel unseen object instances. Extensive experiments on diverse objects demonstrate the effectiveness of our approach compared to common video prediction frameworks.
When trained on interactions consisting of a shift \(p \in \mathbb{R}^2\) of the pixel at location \(l \in \mathbb{N}^2\), human users can apply our model to synthesize plausible object resppnses to such interactions based on still images. That is, human users can define the intended target location for the poked object part while our model infers matching object dynamics for the remainder of object parts.
As our model does not make assumptions about the objects to interact with and, thus, can be flexibly learned from unlabeled videos, it is capable to generate realistic looking video sequences of the distinct types of plants contained in our self-recorded PokingPlants dataset, despite their drastically varying shapes.
Also dynamics of highly-articulated objects such as humans can be learned without annotations available. Moreover, our model generalizes to unseen new instances, which have not been seen during training.
Our hierarchical model can also be applied to in-the-wild settings, as visualized by its generated sequences on the Tai-Chi dataset or to generate complex human motion such as walking sequences on the Human3.6M dataset.
When normalizing the magnitude of pokes over the entire dataset to be in between 0 and 1, thus removing information regarding the intended target location, the poke can be interpreted as an initial force pr impulse onto the object part interacted with. Consequently, our model now generates sequences showing object reactions to the initial impulse defined by the interaction, where larger pokes correspond to similarly large amounts of object dynamics whereas interactions with a small magnitude result in subtle object motion.
When combining the PP dataset and the vegetation samples from the Dynamic Textures Database, our model can be shown to generalize to images obtained from web-search.
The research leading to these results is funded by the German Federal Ministry for Economic Affairs and Energy within the project “KI-Absicherung – Safe AI for automated driving” and by the German Research Foundation (DFG) within project 421703927. This page is based on a design by TEMPLATED.