RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

Prestel, Ulrich; Baumann, Stefan Andreas; Stracke, Nick; Ommer, Björn

ECCV 2026

RayDer: Scalable Self-Supervised
Novel View Synthesis
from Real-World Video

Ulrich Prestel^1,2,*, Stefan Andreas Baumann^1,2,*, Nick Stracke^1,2, Björn Ommer^1,2

¹CompVis @ LMU Munich ²MCML

^*Equal contribution

Paper arXiv Code Weights

RayDer enables training NVS from abundant general video, removing the static-scene data bottleneck — **Training Static-scene Novel View Synthesis from Abundant Video.** Existing approaches rely on scarce data sources: supervised NVS requires posed multi-view images, while prior self-supervised methods require unposed videos of static scenes. RayDer instead trains from generic unposed videos that may contain dynamic objects, enabling learning from the dominant form of visual data and unlocking improved scaling with dataset size.

Zero-shot qualitative video samples on DL3DV-10K.

TL;DR

Self-supervised novel view synthesis methods are fundamentally data-limited: they require static-scene training data, which is scarce. RayDer removes this bottleneck by enabling stable training on general, dynamic real-world video. By consolidating three separate networks into one unified transformer, introducing dynamic state prediction with dropout, and improving pose learning through autoregressive training, RayDer's performance scales predictably with data, model size, and compute – following power-law scaling relationships (R² > 0.99) analogous to those observed in LLMs.

0 Pose supervision required Fully self-supervised, no pretrained weights

+2.9dB Over supervised LVSM Self-supervised outperforms GT-pose training

10× More training data 2.7M general videos vs ~250K static-scene limit

1 Single transformer Consolidated from 3 separate ViTs

R² > 0.99 Clean scaling Power-law fit across data, model size, and compute

In-the-Wild Comparisons

General self-supervised NVS on in-the-wild scenes. E-RayZer (left) vs. RayDer (right).

E-RayZer

RayDer (Ours)

E-RayZer

RayDer (Ours)

E-RayZer

RayDer (Ours)

E-RayZer

RayDer (Ours)

Method

Starting from RayZer – three separate ViTs for camera estimation, scene reconstruction, and rendering, trained on static-scene video – we identify and address three scaling bottlenecks:

Data. Existing methods require static scenes, severely limiting data scale. Training on dynamic video causes instabilities as the model is forced to hide scene dynamics inside the pose representation. We resolve this via a train-time per-view dynamic state embedding, with random dropout at training time, eliminating all training divergence.

System. We consolidate all three networks into a single unified model with two modes (camera/state estimation and NVS), conditioned on token role via adaptive norms. A factorized attention pattern enables KV caching → efficient NVS.

Quality. Random-order autoregressive training over views prevents frame-order pose shortcuts, improving both camera estimation and NVS quality. Shallow high-resolution neighborhood-attention layers around the inner backbone add high-frequency detail at minimal cost.

RayDer architecture overview: a single transformer unifying camera estimation and novel view synthesis — **Architecture Overview.** RayDer unifies camera estimation **(a)** and novel view synthesis **(b)** in a single transformer backbone. Lightweight local intra-frame encoder and decoder layers handle high-resolution processing.

Consolidation of three separate networks into a single unified model — **Consolidation.** We combine RayZer's three separate networks (camera estimator, scene reconstructor, renderer) into a single unified model, simplifying scaling decisions.

Autoregressive pose learning: many input views allow time-axis shortcuts while sparse views require true poses — **Autoregressive Pose Learning.** Many input views **(a)** allow encoding poses via an implicit "time" axis; sparse views **(b)** require true relative camera poses. Random-order autoregression forces learning both.

Scaling Behavior

By enabling stable training on general videos, RayDer allows studying scaling behavior across multiple orders of magnitude in both data and model size. We train at four model scales on three dataset fractions of SpatialVid: 1% (~27k videos), 10% (~270k, matching the combined size of commonly used static-scene NVS datasets), and 100% (~2.7M).

Both data and model scaling consistently lead to improvements, provided neither is a bottleneck. Large models overfit on small data, and small models saturate early – scaling neither compute nor data alone is sufficient.

Scaling behavior across data and model size showing consistent improvements — **Scaling Across Data and Model Size.** Models trained at different scales (shades of green) and dataset fractions (shades of blue), evaluated zero-shot on RE10K. *Left:* Increasing data scale consistently improves performance when model capacity is sufficient. *Right:* Increasing model scale helps, but insufficient data imposes a ceiling.

Compute-optimal scaling analysis showing power-law fit — **Compute-Optimal Scaling.** RayDer's compute-optimal Pareto frontier across both compute and dataset size is well-approximated by a single power law (R² > 0.99 across all metrics).

3D surface visualization of scaling behavior — **Scaling Surface.** 3D visualization of the compute-data scaling surface, showing how performance depends on both axes simultaneously.

Qualitative Scaling

RayDer's qualitative behavior follows the trends seen in quantitative evaluations: jointly increasing data and compute improves NVS quality, with larger models on more data producing sharper, more detailed novel views.

Qualitative comparison across model scales and data fractions showing progressive improvement — **Qualitative Scaling.** More data and compute jointly improve NVS quality, consistent with quantitative scaling trends.

Results

Progressive modifications evaluated on RE10K (NVS) and DL3DV-10k (camera estimation). Trained on SV-HQ.

Configuration		Stable	NVS PSNR↑		Camera Est.
Configuration		Stable	w/o state	w/ state	R@10°↑	t@0.1↑
A	RayZer-like Baseline	~	22.69*	—	66.0*	7.7*
B	+ Dynamic State Prediction	✓	13.48†	24.67	54.4	6.0
C	+ State Dropout	✓	23.02	24.10	69.2	8.2
D	+ Single-network Consolidation	✓	26.98	27.49	74.1	19.7
E	+ Parallel-target Attention	✓	25.91	26.21	70.9	18.2
F	+ Autoregression (ordered)	✓	23.53‡	25.78‡	76.5	25.2
G	+ Random-order Autoregression	✓	27.27	29.57	86.0	39.1
H	+ Local High-resolution Layers	✓	27.78	30.23	88.7	42.4

*Results for A are from runs that did not diverge. †State modeling without dropout creates inference-time state dependency. ‡Ordered AR does not generalize to standard NVS test settings.

Open-set Novel View Synthesis

We train a single RayDer model on generic data and evaluate it zero-shot across six diverse datasets, camera baselines, and numbers of input views. Unlike supervised baselines, RayDer uses its own predicted camera poses at inference – not ground-truth annotations – making this a strictly harder setting.

Method	Params	Self-sup.	LLFF		CO3D		WildRGBD		Mip360	T&T
Method	Params	Self-sup.	1-view	3-view	1-view	3-view	3-view	6-view	6-view	3-view	6-view
Method	Params	Self-sup.	Small viewpoint										Large viewpoint
			LLFF		DTU		CO3D		WildRGBD		Mip360	T&T	CO3D	WildRGBD		Mip360		T&T
			1-view	3-view	1-view	3-view	1-view	3-view	3-view	6-view	6-view	1-view	1-view	1-view	3-view	1-view	3-view	3-view	6-view
MVSplat	12M	✗	11.23	12.50	13.87	15.52	12.52	13.52	14.56	12.54	13.56	13.22	—	—	—	—	—	—	—
DepthSplat	354M	✗	12.07	12.62	14.15	16.24	13.23	13.77	15.93	14.23	14.01	14.35	10.42	9.35	13.53	10.49	12.54	9.78	10.12
ViewCrafter^†	1.4B	✗	10.53	13.52	12.66	16.40	18.96	14.72	16.42	12.66	14.59	18.07	10.11	9.12	13.45	9.79	10.34	9.88	10.32
SEVA^†	1.3B	✗	14.03	19.48	14.47	20.82	18.40	19.25	19.75	18.91	16.70	15.16	15.30	14.37	17.28	12.93	15.78	12.65	13.80
Kaleido^†‡	3.1B	✗	15.34	20.71	—	—	—	—	—	—	18.03	—	—	—	—	13.74	16.78	13.20	14.61
E-RayZer^*	246M	✓	10.44	18.01	10.31	16.97	12.94	17.76	17.72	16.18	15.86	10.36	12.94	10.53	14.47	9.78	15.17	12.88	13.35
RayDer-L (Ours)	743M	✓	17.11	21.38	16.01	17.92	21.10	19.09	20.07	17.23	16.25	18.74	16.84	14.55	15.97	14.96	15.85	13.59	13.81

PSNR↑ (higher is better). RayDer uses its own predicted poses, not GT annotations. Bold = best, underline = second best. ^†Diffusion-based. ^‡Evaluates at 512² instead of 576². ^*Multi-dataset checkpoint.

Static vs. Dynamic Training Data

Despite evaluation being performed on static scenes, training on more (partially dynamic) video empirically outweighs the benefit of a cleaner, domain-aligned training distribution.

Training Data	PSNR↑	LPIPS↓	SSIM↑
Static Mix only (~250k)	28.68	0.158	0.888
SpatialVid only (~2.7M)	29.38	0.135	0.899
Static Mix + SpatialVid	29.42	0.136	0.901

All models are RayDer-L trained for 500k steps. Evaluated on RE10K.

Self-supervised vs. Supervised on Dynamic Data

Self-supervised RayDer outperforms LVSM trained with pseudo-GT poses from MegaSaM by +2.9 dB PSNR, despite not using any pose supervision. Obtaining the pseudo-GT annotations is an order of magnitude more expensive in GPU-hours than training either NVS model.

Method	GT Pose	PSNR↑	LPIPS↓	SSIM↑
LVSM (SpatialVid)	✓	25.44	0.184	0.729
RayDer-B (SpatialVid)	✗	28.35	0.151	0.879

Evaluated on RE10K. Both models trained on SpatialVid at matched settings (transformer size, view count, training steps).

Pose Transferability

RayDer learns transferable camera poses comparable to the specialized XFactor method – despite using a simpler, single-stage training setup with no explicit pose transfer supervision – a notable improvement over RayZer, whose pose representations were shown to lack transferability.

Method	R@10°↑	R@20°↑	R@30°↑	T@10°↑	T@20°↑	T@30°↑
RayZer	0.48	0.61	0.88	0.12	0.32	0.44
XFactor	0.93	0.97	0.99	0.55	0.83	0.90
RayDer-L (Ours)	0.92	0.98	0.99	0.44	0.83	0.90

Pose transferability (TPS metric) on DL3DV-10K following the protocol of XFactor. Bold = best, underline = second best.

Qualitative Results

Zero-shot qualitative samples of RayDer-L compared with E-RayZer in typical NVS and extreme settings — **Zero-shot Qualitative Comparison.** RayDer-L vs. E-RayZer in **(a)** typical NVS settings and **(b)** an extreme setting with near-zero context view overlap. Trained on large-scale non-static-constrained video, RayDer outperforms E-RayZer – a prior model trained on a multi-dataset static mixture – by a wide margin.

BibTeX

@misc{prestel2026rayderscalableselfsupervisednovel,
      title={RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video}, 
      author={Ulrich Prestel and Stefan Andreas Baumann and Nick Stracke and Björn Ommer},
      year={2026},
}