


Summary
We developed OneFlow, the first non-autoregressive multimodal model that enables variable-length and concurrent mixed-modal generation. We model sequence generation through insertions—including both text tokens and image embeddings.
OneFlow introduces new capabilities such as classifier-free guidance for image understanding, and simultaneous interleaved text-image generation.
Across a diverse range of benchmarks for image generation and image understanding, we are competitive with existing SOTA models.
In controlled experiments, OneFlow scales better than Transfusion (AR + FM) on multimodal pretraining.
In controlled experiments, mixed-modal training consistently improves benchmarks for both image understanding and generation.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Summary
We developed OneFlow, the first non-autoregressive multimodal model that enables variable-length and concurrent mixed-modal generation. We model sequence generation through insertions—including both text tokens and image embeddings.
OneFlow introduces new capabilities such as classifier-free guidance for image understanding, and simultaneous interleaved text-image generation.
Across a diverse range of benchmarks for image generation and image understanding, we are competitive with existing SOTA models.
In controlled experiments, OneFlow scales better than Transfusion (AR + FM) on multimodal pretraining.
In controlled experiments, mixed-modal training consistently improves benchmarks for both image understanding and generation.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Summary
We developed OneFlow, the first non-autoregressive multimodal model that enables variable-length and concurrent mixed-modal generation. We model sequence generation through insertions—including both text tokens and image embeddings.
OneFlow introduces new capabilities such as classifier-free guidance for image understanding, and simultaneous interleaved text-image generation.
Across a diverse range of benchmarks for image generation and image understanding, we are competitive with existing SOTA models.
In controlled experiments, OneFlow scales better than Transfusion (AR + FM) on multimodal pretraining.
In controlled experiments, mixed-modal training consistently improves benchmarks for both image understanding and generation.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Research Questions
Research Questions
Research Questions
Research Questions
How does OneFlow perform compared to AR?
For text-to-image generation, we report DPG-Bench and FID. For image-to-text caption quality, we report CIDEr and ROUGE. In every benchmark, OneFlow consistently exhibits better scaling laws than AR.
What is the impact of mixed-modal vs. sequential pretraining?
Mixed modal pretraining vs sequential pretraining. Mixed pretraining achieves 4% relative improvement on VQA tasks and slight improvements on image generation as well.
What emergent behaviors does OneFlow exhibit during generation?
Implicit visual reasoning in hierarchical generation. OneFlow naturally develops reasoning chains without CoT prompting.
How does OneFlow compared to other unified model?
How does OneFlow perform compared to AR?
For text-to-image generation, we report DPG-Bench and FID. For image-to-text caption quality, we report CIDEr and ROUGE. In every benchmark, OneFlow consistently exhibits better scaling laws than AR.
What is the impact of mixed-modal vs. sequential pretraining?
Mixed modal pretraining vs sequential pretraining. Mixed pretraining achieves 4% relative improvement on VQA tasks and slight improvements on image generation as well.
What emergent behaviors does OneFlow exhibit during generation?
Implicit visual reasoning in hierarchical generation. OneFlow naturally develops reasoning chains without CoT prompting.
How does OneFlow compared to other unified model?
How does OneFlow perform compared to AR?
For text-to-image generation, we report DPG-Bench and FID. For image-to-text caption quality, we report CIDEr and ROUGE. In every benchmark, OneFlow consistently exhibits better scaling laws than AR.
What is the impact of mixed-modal vs. sequential pretraining?
Mixed modal pretraining vs sequential pretraining. Mixed pretraining achieves 4% relative improvement on VQA tasks and slight improvements on image generation as well.
What emergent behaviors does OneFlow exhibit during generation?
Implicit visual reasoning in hierarchical generation. OneFlow naturally develops reasoning chains without CoT prompting.
How does OneFlow compared to other unified model?
Concurrent Interleaved
Concurrent Interleaving Generation
OneFlow unlocks parallel generation by inserting new images anywhere in existing text and denoise them concurrently.
Concurrent Interleaved
Concurrent Interleaving Generation
OneFlow unlocks parallel generation by inserting new images anywhere in existing text and denoise them concurrently.
Concurrent Interleaved
Concurrent Interleaving Generation
OneFlow unlocks parallel generation by inserting new images anywhere in existing text and denoise them concurrently.
VQA Reasoning
Hierarchical Generation Exhibits Reasoning
OneFlow's sampling process demonstrates implicit visual reasoning. The model naturally develops reasoning chains before providing final answers, without requiring Chain-of-Thought prompting.
VQA Reasoning
Hierarchical Generation Exhibits Reasoning
OneFlow's sampling process demonstrates implicit visual reasoning. The model naturally develops reasoning chains before providing final answers, without requiring Chain-of-Thought prompting.
VQA Reasoning
Hierarchical Generation Exhibits Reasoning
OneFlow's sampling process demonstrates implicit visual reasoning. The model naturally develops reasoning chains before providing final answers, without requiring Chain-of-Thought prompting.
Image Generation
Image Generation
Image Generation
Image Generation
OneFlow generates high-quality images with strong prompt adherence, accurately capturing fine-grained details even in dense prompts from DPG-Bench—such as correctly rendering "a polar bear balancing on a blue barrel.".
Impact of CFG
Classifier-free Guidance Improves Text Detail
Higher CFG values consistently increase the length and detail of generated text.
Text generation examples from OneFlow using classifier-free guidance (CFG). CFG produces longer, more detailed captions but increases hallucination risk. Highlighted text shows added detail at higher CFG weights.

Impact of CFG
Classifier-free Guidance Improves Text Detail
Higher CFG values consistently increase the length and detail of generated text.
Text generation examples from OneFlow using classifier-free guidance (CFG). CFG produces longer, more detailed captions but increases hallucination risk. Highlighted text shows added detail at higher CFG weights.

Impact of CFG
Classifier-free Guidance Improves Text Detail
Higher CFG values consistently increase the length and detail of generated text.
Text generation examples from OneFlow using classifier-free guidance (CFG). CFG produces longer, more detailed captions but increases hallucination risk. Highlighted text shows added detail at higher CFG weights.
