What If: Understanding Motion Through Sparse Interactions

Abstract

Understanding the dynamics of a physical scene involves reasoning about the diverse ways it can potentially change, especially as a result of local interactions. We present the Flow Poke Transformer (FPT), a novel framework for directly predicting the distribution of local motion, conditioned on sparse interactions termed “pokes”. Unlike traditional methods that typically only enable dense sampling of a single realization of scene dynamics, FPT provides an interpretable directly accessible representation of multi-modal scene motion, its dependency on physical interactions and the inherent uncertainties of scene dynamics.

We also evaluate our model on several downstream tasks to enable comparisons with prior methods and highlight the flexibility of our approach. On dense face motion generation, our generic pre-trained model surpasses specialized baselines. FPT can be fine-tuned in strongly out-of-distribution tasks such as synthetic datasets to enable significant improvements over in-domain methods in articulated object motion estimation. Additionally, predicting explicit motion distributions directly enables our method to achieve competitive performance on tasks like moving part segmentation from pokes which further demonstrates the versatility of our FPT.

How can objects move and interact in the world?

Our Flow Poke Transformer directly models sparse kinematics for its queries and predicts multimodal motion distributions which are diverse, but also show realistic understanding of physical movement in scenes. Its training objective is to predict Gaussian Mixture Models (GMMs) in a single forward pass by minimizing the NLL of its output distributions. Once the GMMs have been generated, samples can be drawn directly from them with minimal latency.

The primary advantage of modeling sparse motion lies in the reduced computational overhead during both training and inference. Coupled with our query-causal attention pattern (s. the paper for more details) which further reduces computational complexity, this enables us to train a Flow Poke Transformer within 1 day on 8 H200s and sample >100k queries in a second for one image on a single H200.

Generative modelling of distributions benefits from multi-step approaches like diffusion models which suffer from high latency, but are less prone to issues like mode collapse or mode averaging. Our Flow Poke Transformer is a single-step generative model which predicts diverse modes and tends to be confident in correct modes. We find that the predicted uncertainty correlates strongly with the prediction motion's error when compared to the ground truth.

High-level Overview

Given an image, a set of given pokes (visualized as arrows →), and query positions (×), our model directly predicts an explicit distribution of the movement at each query position. The flow poke transformer cross-attends to features from a jointly trained image encoder to incorporate visual information. Crucially, our architecture represents movement at individual points (enabling sparse & off-grid motion processing) and directly predicts continuous, multimodal output distributions.

Comparison and Capabilities

Face Motion Generation

We show fine-grained zero-shot poking results on faces and compare against InstantDrag, which was trained specifically for this task. We further visualize the predicted motion as warps using InstantDrag's face warping model. The qualitative results show that our model tends to predict more accurate and localized motion.

Moving Part Segmentation

We perform moving part segmentation with our method by thresholding the KL divergence between the pointwise unconditional motion distribution and the pointwise motion distribution conditioned on a specific poke. Our method shows strong moving part segmentation performance in generic open-set cases.

We directly replicate Fig. 7 from DragAPart [21] with our method. Our method provides spatially continuous predictions and makes fewer critical mistakes like segmenting the furniture body with the drawer (top right). Quantitatively, we find that our method, especially when finetuned in-domain, outperforms DragAPart, which introduced this benchmark.

Articulated Motion Estimation

We compare on the Drag-A-Move dataset with Motion-I2V, DragAPart, and PuppetMaster. Our (finetuned?) model is qualitatively more capable of capturing complex conditioning with multiple different pokes than DAP and PM in this setup. Motion-I2V often fails to accurately follow the conditioning locally.

Unconditional Sampling

By sampling autoregressively without an initial poke, we can sample from the joint distribution of movement of the whole scene. We show such samples of generated flow without prior motion conditioning on pokes. Our model can generate a wide variety of realistic motions.

BibTeX

@inproceedings{baumann2025whatif,
    title={What If: Understanding Motion Through Sparse Interactions}, 
    author={Stefan Andreas Baumann and Nick Stracke and Timy Phan and Bj{\"o}rn Ommer},
    booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    year={2025}
}