MaskFlow Training: Our baseline models apply the same masking ratio to all frames during training, whereas MaskFlow applies independent frame-level masking ratios for each frame.
MaskFlow is a chunkwise autoregressive approach to long video generation that uses frame-level masking and confidence-based heuristic sampling to produce seamless, high-quality video sequences efficiently. Instead of generating entire videos at once, MaskFlow generates overlapping chunks of frames, where each new chunk is conditioned on previously generated frames to ensure temporal consistency. During training, the model learns to reconstruct partially masked frames, making it naturally suited for extending video sequences while maintaining coherence. The frame-level masking strategy aligns perfectly with chunkwise generation, enabling the model to handle different levels of corruption while ensuring smooth transitions. To further speed up inference, we incorporate confidence-based heuristic sampling, selectively unmasking only the most confidently predicted tokens at each step. This approach allows MaskFlow to generate long videos with greater flexibility and efficiency than traditional methods.
MaskFlow (MGM-Style) Sampling
FM-Style Sampling
Tokenization enables the use of confidence-based heuristic sampling, which is why MaskFlow can generate video chunks much faster compared to flow matching style sampling. MaskFlow is able to generate high quality videos in as little as 20 function evaluations, which makes it much more efficient compared to flow matching sampling that requires up to 250 function evaluations for high-quality results.
MaskFlow is able to maintain quality at up to 160 frames on FaceForensics by conditioning on previously generated chunks. All models used to generate these videos were trained on 16 frame chunks only, and require only 20 function evaluations for each generated chunk.
MaskFlow can be rolled out autoregressively frame-by-frame to generate long high-quality videos beyond an extrapolation length of 10×. The videos above show a rollout of DMLab across 360 frames and still use only 20 function evaluations per frame. The models used to generate these videos were trained on chunks of 36 frames only.