OneFlow: Concurrent Mixed-Modal And Interleaved Generation with Edit Flows

John Nguyen¹, Marton Havasi¹, Tariq Berrada¹'², Luke Zettlemoyer¹, Ricky T. Q. Chen¹

¹FAIR at Meta ²Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK

Paper

Research Questions

Concurrent Interleaved

VQA Reasoning

Image Generation

CFG Enhance Details

Research Questions

Concurrent Interleaved

VQA Reasoning

Image Generation

CFG Enhance Details

Research Questions

Concurrent Interleaved

VQA Reasoning

Image Generation

CFG Enhance Details

Summary

We developed OneFlow, the first non-autoregressive multimodal model that enables variable-length and concurrent mixed-modal generation. We model sequence generation through insertions—including both text tokens and image embeddings.

OneFlow introduces new capabilities such as classifier-free guidance for image understanding, and simultaneous interleaved text-image generation.
Across a diverse range of benchmarks for image generation and image understanding, we are competitive with existing SOTA models.
In controlled experiments, OneFlow scales better than Transfusion (AR + FM) on multimodal pretraining.
In controlled experiments, mixed-modal training consistently improves benchmarks for both image understanding and generation.

Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.

Summary

OneFlow introduces new capabilities such as classifier-free guidance for image understanding, and simultaneous interleaved text-image generation.
Across a diverse range of benchmarks for image generation and image understanding, we are competitive with existing SOTA models.
In controlled experiments, OneFlow scales better than Transfusion (AR + FM) on multimodal pretraining.
In controlled experiments, mixed-modal training consistently improves benchmarks for both image understanding and generation.

Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.

Summary

OneFlow introduces new capabilities such as classifier-free guidance for image understanding, and simultaneous interleaved text-image generation.
Across a diverse range of benchmarks for image generation and image understanding, we are competitive with existing SOTA models.
In controlled experiments, OneFlow scales better than Transfusion (AR + FM) on multimodal pretraining.
In controlled experiments, mixed-modal training consistently improves benchmarks for both image understanding and generation.

Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.
Concurrent generation of text and a variable number of images, interleaved via insertions.
During training, AR uses next-token prediction, masked diffusion replaces tokens with special mask tokens, and EditFlow deletes tokens, reducing sequence length and saving FLOPs.
Interleaved AR+Diffusion and OneFlow: OneFlow deletes tokens/images during training, shortening sequences.
We follow Transfusion and use U-Net as projectors with a shared Transformer backbone for all modalities.

Research Questions

How does OneFlow perform compared to AR?

For text-to-image generation, we report DPG-Bench and FID. For image-to-text caption quality, we report CIDEr and ROUGE. In every benchmark, OneFlow consistently exhibits better scaling laws than AR.

What is the impact of mixed-modal vs. sequential pretraining?

Mixed modal pretraining vs sequential pretraining. Mixed pretraining achieves 4% relative improvement on VQA tasks and slight improvements on image generation as well.

What emergent behaviors does OneFlow exhibit during generation?

Implicit visual reasoning in hierarchical generation. OneFlow naturally develops reasoning chains without CoT prompting.

How does OneFlow compared to other unified model?

How does OneFlow perform compared to AR?

For text-to-image generation, we report DPG-Bench and FID. For image-to-text caption quality, we report CIDEr and ROUGE. In every benchmark, OneFlow consistently exhibits better scaling laws than AR.

What is the impact of mixed-modal vs. sequential pretraining?

Mixed modal pretraining vs sequential pretraining. Mixed pretraining achieves 4% relative improvement on VQA tasks and slight improvements on image generation as well.

What emergent behaviors does OneFlow exhibit during generation?

Implicit visual reasoning in hierarchical generation. OneFlow naturally develops reasoning chains without CoT prompting.

How does OneFlow compared to other unified model?

How does OneFlow perform compared to AR?

For text-to-image generation, we report DPG-Bench and FID. For image-to-text caption quality, we report CIDEr and ROUGE. In every benchmark, OneFlow consistently exhibits better scaling laws than AR.

What is the impact of mixed-modal vs. sequential pretraining?

Mixed modal pretraining vs sequential pretraining. Mixed pretraining achieves 4% relative improvement on VQA tasks and slight improvements on image generation as well.

What emergent behaviors does OneFlow exhibit during generation?

Implicit visual reasoning in hierarchical generation. OneFlow naturally develops reasoning chains without CoT prompting.

How does OneFlow compared to other unified model?

Concurrent Interleaved

Concurrent Interleaving Generation

OneFlow unlocks parallel generation by inserting new images anywhere in existing text and denoise them concurrently.

Concurrent Interleaved

Concurrent Interleaving Generation

OneFlow unlocks parallel generation by inserting new images anywhere in existing text and denoise them concurrently.

Concurrent Interleaved

Concurrent Interleaving Generation

OneFlow unlocks parallel generation by inserting new images anywhere in existing text and denoise them concurrently.

VQA Reasoning

Hierarchical Generation Exhibits Reasoning

OneFlow's sampling process demonstrates implicit visual reasoning. The model naturally develops reasoning chains before providing final answers, without requiring Chain-of-Thought prompting.

VQA Reasoning

Hierarchical Generation Exhibits Reasoning

OneFlow's sampling process demonstrates implicit visual reasoning. The model naturally develops reasoning chains before providing final answers, without requiring Chain-of-Thought prompting.

VQA Reasoning

Hierarchical Generation Exhibits Reasoning

OneFlow's sampling process demonstrates implicit visual reasoning. The model naturally develops reasoning chains before providing final answers, without requiring Chain-of-Thought prompting.

Image Generation

Generated images from OneFlow.

Generated images at 512×512 resolution from OneFlow.
We notice that OneFlow gets the details of the prompt correctly, for instance the polar bear is ‘balancing on a blue barrel’.
Generated images at 512×512 resolution from OneFlow.
Generated images at 512×512 resolution from OneFlow.
We notice that OneFlow gets the details of the prompt correctly, for instance the polar bear is ‘balancing on a blue barrel’.
Generated images at 512×512 resolution from OneFlow.
Generated images at 512×512 resolution from OneFlow.
We notice that OneFlow gets the details of the prompt correctly, for instance the polar bear is ‘balancing on a blue barrel’.
Generated images at 512×512 resolution from OneFlow.
Generated images at 512×512 resolution from OneFlow.
We notice that OneFlow gets the details of the prompt correctly, for instance the polar bear is ‘balancing on a blue barrel’.
Generated images at 512×512 resolution from OneFlow.

Impact of CFG

Classifier-free Guidance Improves Text Detail

Higher CFG values consistently increase the length and detail of generated text.

Text generation examples from OneFlow using classifier-free guidance (CFG). CFG produces longer, more detailed captions but increases hallucination risk. Highlighted text shows added detail at higher CFG weights.

Impact of CFG

Classifier-free Guidance Improves Text Detail

Impact of CFG

Classifier-free Guidance Improves Text Detail

Higher CFG values consistently increase the length and detail of generated text.