Envisioning the Future, One Step at a Time

Baumann, Stefan Andreas; Wiese, Jannik; Martorella, Tommaso; Kalayeh, Mahdi M.; Ommer, Bjorn

CVPR 2026

Envisioning the Future,
One Step at a Time

Stefan Andreas Baumann^1,2,*, Jannik Wiese^1,2,*, Tommaso Martorella^1,2, Mahdi M. Kalayeh³, Bjorn Ommer^1,2

¹CompVis @ LMU Munich ²MCML ³Netflix

^*Equal contribution

Paper arXiv Code

Diverse future motion predictions from single images across different open-world scenes — From a single image, our model envisions diverse, physically consistent futures by predicting sparse point trajectories step by step. Its efficiency enables exploring thousands of counterfactual rollouts directly in motion space - here illustrated for billiards planning, where candidate shots are evaluated by simulating many possible outcomes.

Planning billiard shots by exploring thousands of counterfactual motion trajectories — From a single image, our model envisions diverse, physically consistent futures by predicting sparse point trajectories step by step. Its efficiency enables exploring thousands of counterfactual rollouts directly in motion space - here illustrated for billiards planning, where candidate shots are evaluated by simulating many possible outcomes.

TL;DR

Instead of generating dense future video, we predict distributions over sparse point trajectories, step by step, from a single image. An autoregressive diffusion model with an efficiency-oriented architecture makes this orders of magnitude faster than video-based world models - fast enough to explore thousands of plausible futures and plan over them.

3,000× Faster than video models 2,200 vs <1 trajectory samples per minute

10× Fewer parameters 0.6B vs 1.3-14B for video baselines

5× More accurate under compute budget OWM minADE: 0.013 vs 0.066 for best video model

78% vs 16% Billiard planning accuracy Ours vs best dense video baseline

Method

We formulate future reasoning as autoregressive prediction over sparse point trajectories. Given a single image and a set of query points, the model factorizes the joint distribution over future motion causally - first over time, then over individual trajectories within each step. A lightweight flow-matching head captures the multi-modal distribution of next-step displacements, enabling fast sampling with KV-cache decoding.

Motion token construction: combining Fourier-embedded motion, trajectory identity, and bilinearly sampled image features at current and origin positions — **Motion Tokens.** Each token combines Fourier-embedded motion, a randomized trajectory identifier, and image features sampled at the current and origin positions.

Shared positional encoding scheme encoding current position, origin position, and time for both motion and image tokens — **Positional Encoding.** Motion and image tokens share one reference frame via axial RoPE encoding current position, origin, and time.

Fused parallel transformer blocks compared to standard sequential layers, reducing kernel launches for faster throughput — **Fast Reasoning Blocks.** Parallel residual blocks fuse self-attention, cross-attention, and FFN into a single step, cutting kernel launches for high rollout throughput.

Flow matching head with cached conditioning and multiscale tanh-saturated input stack for handling heavy-tailed motion distributions — **Flow-Matching Head.** A cached conditioning mechanism and multiscale tanh-saturated inputs handle the heavy-tailed distribution of real-world motion.

OWM: Open-World Motion Benchmark

To evaluate open-world motion prediction, we introduce OWM - a benchmark of 95 diverse in-the-wild videos under static cameras. Each scene provides a reference frame, query points, and verified ground-truth trajectories spanning 2.5-6.5 seconds. We assess both accuracy (best-of-N) and search efficiency: given a fixed 5-minute wall-clock budget on a reference GPU, how many plausible futures can a method explore? We supplement OWM with physical diagnostics from Physics-IQ and Physion.

OWM benchmark composition: statistics showing diversity across rigid/non-rigid, single/multi-agent, and free-will categories — **OWM Composition.** The benchmark covers a wide variety of motion settings across rigid/non-rigid objects, single/multi-agent scenes, and constrained/free-will dynamics.

Qualitative examples from OWM showing diverse real-world scenes with predicted trajectories — **Examples.** Diverse real-world scenes from OWM spanning different motion types and complexities.

Results

Evaluation Setup

We compare against state-of-the-art open-weight video generation models. For video baselines, trajectories are extracted from generated frames using off-the-shelf point trackers.

minADE ↓: L2 distance of the closest hypothesis to ground truth. Lower is better.
N=5: 5 hypotheses per method, best scored.
T=5min: Fixed 5-minute GPU budget (equal compute); generate as many hypotheses as possible, best scored. Measures search efficiency. DNF = did not finish.
(a) OWM: Open-world motion (in-the-wild videos).
(b) PhysicsIQ: Physical plausibility in controlled solid-mechanics settings (Motamed et al., WACV 2026).
(c) Physion: Intuitive physics understanding benchmark (Bear et al., NeurIPS D&B 2021).

Open-World Motion & Physical Diagnostics

With only 5 samples, our approach - despite being orders of magnitude faster and over 10× smaller - matches the prediction accuracy of the best open-weight video generation models. Under the primary 5-minute budget, this efficiency advantage becomes decisive: most video models cannot even finish within the time limit, while ours generates thousands of hypotheses to find substantially more accurate predictions.

Method	Params	Throughput	(a) OWM ↓		(b) PhysicsIQ ↓		(c) Physion ↓
Method	Params	Throughput	N=5	T=5min	N=5	T=5min	N=5	T=5min
MAGI-1	4.5B	0.3 / min	0.037	0.066	0.126	0.169	0.061	0.081
Wan2.2 I2V	14B	0.14 / min	0.039	DNF	0.116	DNF	0.069	DNF
CogVideoX 1.5	5B	0.05 / min	0.051	DNF	0.100	DNF	0.063	DNF
SkyReels V2 (DF)	1.3B	0.3 / min	0.058	0.068	0.128	0.137	0.069	0.084
SVD 1.1	1.5B	0.71 / min	0.054	0.119	0.138	0.241	0.070	0.147
Ours	0.6B	2,200 / min	0.029	0.013	0.115	0.045	0.048	0.020

Time-accuracy trade-off on OWM: log time versus best-of-N MSE; our method reaches low error far faster than video baselines — **Time-Accuracy Trade-off on OWM.** More hypotheses improve accuracy for all methods; our sparse formulation makes this orders of magnitude more efficient.

MinADE (lower is better) across all OWM subsets. Values are mean L2 distance in normalized coordinates.

Method	Rigid		Non-rigid		Single-agent		Multi-agent		w/ Free will		w/o Free will
Method	N=5	T=5m	N=5	T=5m	N=5	T=5m	N=5	T=5m	N=5	T=5m	N=5	T=5m
MAGI-1	.032	.058	.039	.069	.020	.044	.048	.080	.040	.066	.030	.065
Wan2.2	.042	DNF	.038	DNF	.039	DNF	.039	DNF	.036	DNF	.045	DNF
CogVideoX	.051	DNF	.051	DNF	.041	DNF	.052	DNF	.049	DNF	.054	DNF
SkyReels V2	.061	.071	.056	.066	.048	.056	.064	.075	.054	.063	.065	.076
SVD 1.1	.048	.055	.057	.073	.037	.053	.065	.077	.060	.069	.042	.064
Ours	.031	.007	.039	.016	.036	.008	.044	.017	.037	.014	.044	.011

Downstream: Billiard Planning

The efficiency of our model enables a new capability: planning by exploring motion-space rollouts. Importantly, the model is completely task-agnostic - it is never trained for or made aware of any downstream planning objective; it simply predicts how the scene might evolve.

To isolate the model's suitability for planning from any influence of the planning algorithm, we deliberately use the simplest possible approach: pure random search over candidate actions. Initial velocities ("pokes") on the cue ball are sampled at random, the model rolls out many stochastic futures for each, and the action whose rollouts best satisfy the task objective is selected. Even with this naive planner, our approach achieves 78% accuracy - far above all dense video baselines (4-16%) and approaching the 84% of a ground-truth physics simulator oracle. A more sophisticated planning algorithm would likely yield substantially better results still.

Billiard planning pipeline: candidate actions are sampled at random and evaluated by rolling out stochastic motion trajectories, selecting the action with the highest expected reward — **Planning via Random Search in Motion Space.** Candidate initial velocities are sampled randomly; for each, the model rolls out many stochastic trajectories and scores them against the task objective. The model has no knowledge of the planning task. *Right:* an illustrated walkthrough of the billiard planning process.

Method	Accuracy ↑	Throughput (actions/min)
Simulator Oracle	84%	55,162
Images-to-Video Diff.	16%	19.8
AR Images-to-Video Diff.	8%	18.6
Full Trajectory Diffusion	8%	160.8
Flow Poke Transformer	4%	13,423
Ours	78%	496

All methods use the same random-search planner to ensure a fair comparison that isolates world model quality from planning algorithm sophistication.

BibTeX

@inproceedings{baumann2026envisioning,
  title     = {Envisioning the Future, One Step at a Time},
  author    = {Stefan Andreas Baumann and Jannik Wiese and Tommaso Martorella and Mahdi M. Kalayeh and Bj{\"o}rn Ommer},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}