MaskFlow: Discrete Flows for Flexible and Efficient Long Video Generation

MaskFlow Training: Our baseline models apply the same masking ratio to all frames during training, whereas MaskFlow applies independent frame-level masking ratios for each frame.

TLDR;

MaskFlow is a chunkwise autoregressive approach to long video generation that uses frame-level masking and confidence-based heuristic sampling to produce seamless, high-quality video sequences efficiently. Instead of generating entire videos at once, MaskFlow generates overlapping chunks of frames, where each new chunk is conditioned on previously generated frames to ensure temporal consistency. During training, the model learns to reconstruct partially masked frames, making it naturally suited for extending video sequences while maintaining coherence. The frame-level masking strategy aligns perfectly with chunkwise generation, enabling the model to handle different levels of corruption while ensuring smooth transitions. To further speed up inference, we incorporate confidence-based heuristic sampling, selectively unmasking only the most confidently predicted tokens at each step. This approach allows MaskFlow to generate long videos with greater flexibility and efficiency than traditional methods.

Method

**MaskFlow Sampling:** Given *m=2* context frames used to initialize generation, we unmask the current window and use newly generated frames as new context frames in the next chunk of size *k=5*, using stride *s=3*. (*Tokenization is omitted here to simplify understanding*)

Qualitative Results

MaskFlow (MGM-Style) Sampling

FM-Style Sampling

Tokenization enables the use of confidence-based heuristic sampling, which is why MaskFlow can generate video chunks much faster compared to flow matching style sampling. MaskFlow is able to generate high quality videos in as little as 20 function evaluations, which makes it much more efficient compared to flow matching sampling that requires up to 250 function evaluations for high-quality results.

Full Sequence Generation

MaskFlow is able to maintain quality at up to 160 frames on FaceForensics by conditioning on previously generated chunks. All models used to generate these videos were trained on 16 frame chunks only, and require only 20 function evaluations for each generated chunk.

Autoregressive Generation

MaskFlow can be rolled out autoregressively frame-by-frame to generate long high-quality videos beyond an extrapolation length of 10×. The videos above show a rollout of DMLab across 360 frames and still use only 20 function evaluations per frame. The models used to generate these videos were trained on chunks of 36 frames only.

BibTex

@article{fuest2025maskflow, author = {Michael Fuest and Vincent Tao Hu and Björn Ommer}, title = {MaskFlow: Discrete Flows for Flexible and Efficient Long Video Generation}, journal = {arXiv}, year = {2025}, }