
Our Flow Poke Transformer directly models sparse kinematics for its queries and predicts multimodal motion distributions which are diverse, but also show realistic understanding of physical movement in scenes. Its training objective is to predict Gaussian Mixture Models (GMMs) in a single forward pass by minimizing the NLL of its output distributions. Once the GMMs have been generated, samples can be drawn directly from them with minimal latency.
The primary advantage of modeling sparse motion lies in the reduced computational overhead during both training and inference. Coupled with our query-causal attention pattern (s. the paper for more details) which further reduces computational complexity, this enables us to train a Flow Poke Transformer within 1 day on 8 H200s and sample >100k queries in a second for one image on a single H200.

Generative modelling of distributions benefits from multi-step approaches like diffusion models which suffer from high latency, but are less prone to issues like mode collapse or mode averaging. Our Flow Poke Transformer is a single-step generative model which predicts diverse modes and tends to be confident in correct modes. We find that the predicted uncertainty correlates strongly with the prediction motion's error when compared to the ground truth.
