DisMo: Disentangled Motion Representations for Open-World Motion Transfer

CompVis @ LMU Munich, MCML
* equal contribution
NeurIPS 2025 Spotlight

TL;DR: We present DisMo, a paradigm that learns a semantic motion representation space from videos that is disentangled from static content information such as appearance, structure, viewing angle and even object category. We leverage this invariance and condition off-the-shelf video models on extracted motion embeddings. This setup achieves state-of-the-art performance on open-world motion transfer with a high degree of transferability in cross-category and -viewpoint settings. Beyond that, DisMo's learned representations are suitable for downstream tasks such as zero-shot action classification.

Abstract

Recent advances in text-to-video (T2V) and image-to-video (I2V) models, have enabled the creation of visually compelling and dynamic videos from simple textual descriptions or initial frames. However, these models often fail to provide an explicit representation of motion separate from content, limiting their applicability for content creators. To address this gap, we propose DisMo, a novel paradigm for learning abstract motion representations directly from raw video data via an image-space reconstruction objective. Our representation is generic and independent of static information such as appearance, object identity, or pose. This enables open-world motion transfer, allowing motion to be transferred across semantically unrelated entities without requiring object correspondences, even between vastly different categories.

Unlike prior methods, which trade off motion fidelity and prompt adherence, are overfitting to source structure or drifting from the described action, our approach disentangles motion semantics from appearance, enabling accurate transfer and faithful conditioning. Furthermore, our motion representation can be combined with any existing video generator via lightweight adapters, allowing us to effortlessly benefit from future advancements in video models. We demonstrate the effectiveness of our method through a diverse set of motion transfer tasks. Finally, we show that the learned representations are well-suited for downstream motion understanding tasks, consistently outperforming state-of-the-art video representation models such as V-JEPA in zero-shot action classification on benchmarks including Something-Something v2 and Jester.

Motion Transfer

We present qualitative results of open-world motion transfer. Each video shows the source motion on the left and generated targets on the right.

Cross-category motion transfer

Motion from a dog transferred to another non-human and a human target, while preserving motion semantics.

Cross-viewpoint motion transfer

Source motion captured from a front-facing viewpoint, transferred to oblique targets.

Camera trajectory transfer

Camera trajectories are tranferred to vastly different target scenes.

Appearance-invariant transfer

Different targets share the same semantic motion while retaining their individual appearance, structure and position.

Method

(a) During training, our motion extractor receives augmented video frames and outputs a sequence of motion embeddings. These are then individually passed to a frame generator, alongside the corresponding source frame, from which it learns to reconstruct a frame at a future timestep. (b) To transfer a sequence of motion embeddings onto another target image, we can utilize the frame generator autoregressively as a low-cost option. (c) For high-quality motion transfer, we adapt pre-trained off-the-shelf video models using conditional LoRA layers. The motion sequence is arranged such that each video token receives a conditioning signal only from the motion embedding that is temporally aligned to it.

More Examples

Quantitative Motion Transfer Evaluation

We quantitatively evaluate open-world motion transfer using both automatic metrics and human studies. While prior methods exhibit a clear trade-off between motion fidelity and prompt adherence, DisMo achieves state-of-the-art performance on both simultaneously, breaking the usual compromise.

Quantitative motion transfer evaluation with automatic metrics where DisMo outperforms all baselines without a trade-off between motion fidelity and prompt adherence.

Disentangled Motion Embeddings

We analyze disentanglement using a k-NN retrieval setup where queries target either action or identity. For motion transfer, desirable neighbors share the same action but differ in identity. DisMo exhibits exactly this behavior: it retrieves videos with the same motion performed by different actors, while standard video representation learning baselines mostly retrieve the same identity performing different actions.

k-NN retrieval evaluation showing that DisMo retrieves clips with the same action and different identities, while baselines focus on identity.

Zero-Shot Action Classification

Beyond motion transfer, we demonstrate that DisMo's motion embeddings are genuinely semantic. In a zero-shot action classification setup, where no task-specific fine-tuning is performed, DisMo outperforms video representation baselines on multiple datasets. This shows that DisMo captures high-level temporal dynamics that are directly useful for downstream motion understanding tasks.

Zero-shot action classification results showing DisMo outperforming other video representation baselines.

BibTeX

@inproceedings{resslerdismo,
  title={DisMo: Disentangled Motion Representations for Open-World Motion Transfer},
  author={Ressler-Antal, Thomas and Fundel, Frank and Alaya, Malek Ben and Baumann, Stefan Andreas and Krause, Felix and Gui, Ming and Ommer, Bj{\"o}rn},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025}
}