Generating and representing human behavior are of major importance for various computer vision applications. Commonly, human video synthesis represents behavior as sequences of postures while directly predicting their likely progressions or merely changing the appearance of the depicted persons, thus not being able to exercise control over their actual behavior during the synthesis process. In contrast, controlled behavior synthesis and transfer across individuals requires a deep understanding of body dynamics and calls for a representation of behavior that is independent of appearance and also of specific postures. In this work, we present a model for human behavior synthesis which learns a dedicated representation of human dynamics independent of postures. Using this representation, we are able to change the behavior of a person depicted in an arbitrary posture, or to even directly transfer behavior observed in a given video sequence. To this end, we propose a conditional variational framework which explicitly disentangles posture from behavior. We demonstrate the effectiveness of our approach on this novel task, evaluating capturing, transferring, and sampling fine-grained, diverse behavior, both quantitatively and qualitatively.
and applications of our model.
We transfer fine-grained, characteristic body dynamics of an observed behavior \(x_\beta\) to unrelated, significantly different target postures \(x_t\). If required, the target posture is first adjusted by a transition phase before re-enacting the inferred behavior.
As we learn a parametric prior behavior distribution \(p(z_\beta)\), we can use our model to sample novel, unseen behaviors for a given target posture.
We interpolate between the behavior observed in two sequences \(\boldsymbol{x}_\beta^1\) and \(\boldsymbol{x}_\beta^2\). To this end, we first extract their corresponding behavior representations \(z_\beta^1, z_\beta^2\) and interpolate between them at equidistant steps, i.e. \((1 - \lambda) \cdot z_\beta^1 + \lambda \cdot z_\beta^2; \; \lambda \in \{0.0,0.2,0.4,0.6,0.8,1.0\}\). Next, we generate a sequence of interpolated behavior using our decoder \(p_\theta(\boldsymbol{x}|z_\beta,x_t)\) with \(x_t\) being the first frame of \(\boldsymbol{x}_\beta^1\), respectively \(\boldsymbol{x}_\beta^2\). Note, that for \(\lambda \in \{0,1.0\}\) we basically reconstruct the source sequences \(\boldsymbol{x}_\beta^1\), \(\boldsymbol{x}_\beta^2\).
Our approach to disentangling latent representations is not only appicable for factorizing static from temporal information, but also to posture-appearance disentangling. We demonstrate this capability of our model on the DeepFashion and Market1501 datasets.
The research leading to these results is funded by the German Federal Ministry for Economic Affairs and Energy within the project “KI-Absicherung – Safe AI for automated driving” and by the German Research Foundation (DFG) within project 421703927. This page is based on a design by TEMPLATED.