CVPR 2026

Envisioning the Future,
One Step at a Time

1CompVis @ LMU Munich 2MCML 3Netflix

*Equal contribution

Diverse future motion predictions from single images across different open-world scenes Planning billiard shots by exploring thousands of counterfactual motion trajectories
From a single image, our model envisions diverse, physically consistent futures by predicting sparse point trajectories step by step. Its efficiency enables exploring thousands of counterfactual rollouts directly in motion space - here illustrated for billiards planning, where candidate shots are evaluated by simulating many possible outcomes.

TL;DR

Instead of generating dense future video, we predict distributions over sparse point trajectories, step by step, from a single image. An autoregressive diffusion model with an efficiency-oriented architecture makes this orders of magnitude faster than video-based world models - fast enough to explore thousands of plausible futures and plan over them.

3,000× Faster than video models 2,200 vs <1 trajectory samples per minute
10× Fewer parameters 0.6B vs 1.3-14B for video baselines
5× More accurate under compute budget OWM minADE: 0.013 vs 0.066 for best video model
78% vs 16% Billiard planning accuracy Ours vs best dense video baseline

Method

We formulate future reasoning as autoregressive prediction over sparse point trajectories. Given a single image and a set of query points, the model factorizes the joint distribution over future motion causally - first over time, then over individual trajectories within each step. A lightweight flow-matching head captures the multi-modal distribution of next-step displacements, enabling fast sampling with KV-cache decoding.

Motion token construction: combining Fourier-embedded motion, trajectory identity, and bilinearly sampled image features at current and origin positions
Motion Tokens. Each token combines Fourier-embedded motion, a randomized trajectory identifier, and image features sampled at the current and origin positions.
Shared positional encoding scheme encoding current position, origin position, and time for both motion and image tokens
Positional Encoding. Motion and image tokens share one reference frame via axial RoPE encoding current position, origin, and time.
Fused parallel transformer blocks compared to standard sequential layers, reducing kernel launches for faster throughput
Fast Reasoning Blocks. Parallel residual blocks fuse self-attention, cross-attention, and FFN into a single step, cutting kernel launches for high rollout throughput.
Flow matching head with cached conditioning and multiscale tanh-saturated input stack for handling heavy-tailed motion distributions
Flow-Matching Head. A cached conditioning mechanism and multiscale tanh-saturated inputs handle the heavy-tailed distribution of real-world motion.

OWM: Open-World Motion Benchmark

To evaluate open-world motion prediction, we introduce OWM - a benchmark of 95 diverse in-the-wild videos under static cameras. Each scene provides a reference frame, query points, and verified ground-truth trajectories spanning 2.5-6.5 seconds. We assess both accuracy (best-of-N) and search efficiency: given a fixed 5-minute wall-clock budget on a reference GPU, how many plausible futures can a method explore? We supplement OWM with physical diagnostics from Physics-IQ and Physion.

OWM benchmark composition: statistics showing diversity across rigid/non-rigid, single/multi-agent, and free-will categories
OWM Composition. The benchmark covers a wide variety of motion settings across rigid/non-rigid objects, single/multi-agent scenes, and constrained/free-will dynamics.
Qualitative examples from OWM showing diverse real-world scenes with predicted trajectories
Examples. Diverse real-world scenes from OWM spanning different motion types and complexities.

Results

Evaluation Setup

We compare against state-of-the-art open-weight video generation models. For video baselines, trajectories are extracted from generated frames using off-the-shelf point trackers.

minADE ↓
L2 distance of the closest hypothesis to ground truth. Lower is better.
N=5
5 hypotheses per method, best scored.
T=5min
Fixed 5-minute GPU budget (equal compute); generate as many hypotheses as possible, best scored. Measures search efficiency. DNF = did not finish.
(a) OWM
Open-world motion (in-the-wild videos).
(b) PhysicsIQ
Physical plausibility in controlled solid-mechanics settings (Motamed et al., WACV 2026).
(c) Physion
Intuitive physics understanding benchmark (Bear et al., NeurIPS D&B 2021).

Open-World Motion & Physical Diagnostics

With only 5 samples, our approach - despite being orders of magnitude faster and over 10× smaller - matches the prediction accuracy of the best open-weight video generation models. Under the primary 5-minute budget, this efficiency advantage becomes decisive: most video models cannot even finish within the time limit, while ours generates thousands of hypotheses to find substantially more accurate predictions.

Method Params Throughput (a) OWM ↓ (b) PhysicsIQ ↓ (c) Physion ↓
N=5 T=5min N=5 T=5min N=5 T=5min
MAGI-1 4.5B 0.3 / min 0.037 0.066 0.126 0.169 0.061 0.081
Wan2.2 I2V 14B 0.14 / min 0.039 DNF 0.116 DNF 0.069 DNF
CogVideoX 1.5 5B 0.05 / min 0.051 DNF 0.100 DNF 0.063 DNF
SkyReels V2 (DF) 1.3B 0.3 / min 0.058 0.068 0.128 0.137 0.069 0.084
SVD 1.1 1.5B 0.71 / min 0.054 0.119 0.138 0.241 0.070 0.147
Ours 0.6B 2,200 / min 0.029 0.013 0.115 0.045 0.048 0.020
Time-accuracy trade-off on OWM: log time versus best-of-N MSE; our method reaches low error far faster than video baselines
Time-Accuracy Trade-off on OWM. More hypotheses improve accuracy for all methods; our sparse formulation makes this orders of magnitude more efficient.

MinADE (lower is better) across all OWM subsets. Values are mean L2 distance in normalized coordinates.

Method Rigid Non-rigid Single-agent Multi-agent w/ Free will w/o Free will
N=5 T=5m N=5 T=5m N=5 T=5m N=5 T=5m N=5 T=5m N=5 T=5m
MAGI-1 .032.058 .039.069 .020.044 .048.080 .040.066 .030.065
Wan2.2 .042DNF .038DNF .039DNF .039DNF .036DNF .045DNF
CogVideoX .051DNF .051DNF .041DNF .052DNF .049DNF .054DNF
SkyReels V2 .061.071 .056.066 .048.056 .064.075 .054.063 .065.076
SVD 1.1 .048.055 .057.073 .037.053 .065.077 .060.069 .042.064
Ours .031.007 .039.016 .036.008 .044.017 .037.014 .044.011

Downstream: Billiard Planning

The efficiency of our model enables a new capability: planning by exploring motion-space rollouts. Importantly, the model is completely task-agnostic - it is never trained for or made aware of any downstream planning objective; it simply predicts how the scene might evolve.

To isolate the model's suitability for planning from any influence of the planning algorithm, we deliberately use the simplest possible approach: pure random search over candidate actions. Initial velocities ("pokes") on the cue ball are sampled at random, the model rolls out many stochastic futures for each, and the action whose rollouts best satisfy the task objective is selected. Even with this naive planner, our approach achieves 78% accuracy - far above all dense video baselines (4-16%) and approaching the 84% of a ground-truth physics simulator oracle. A more sophisticated planning algorithm would likely yield substantially better results still.

Billiard planning pipeline: candidate actions are sampled at random and evaluated by rolling out stochastic motion trajectories, selecting the action with the highest expected reward
Planning via Random Search in Motion Space. Candidate initial velocities are sampled randomly; for each, the model rolls out many stochastic trajectories and scores them against the task objective. The model has no knowledge of the planning task. Right: an illustrated walkthrough of the billiard planning process.
Method Accuracy ↑ Throughput
(actions/min)
Simulator Oracle84%55,162
Images-to-Video Diff.16%19.8
AR Images-to-Video Diff.8%18.6
Full Trajectory Diffusion8%160.8
Flow Poke Transformer4%13,423
Ours78%496

All methods use the same random-search planner to ensure a fair comparison that isolates world model quality from planning algorithm sophistication.

BibTeX

@inproceedings{baumann2026envisioning,
  title     = {Envisioning the Future, One Step at a Time},
  author    = {Baumann, Stefan Andreas and Wiese, Jannik and Martorella, Tommaso and Kalayeh, Mahdi M. and Ommer, Bjorn},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}