OneFlow: Concurrent Mixed-Modal And Interleaved Generation with Edit Flows

John Nguyen¹, Marton Havasi¹,

Tariq Berrada¹'², Luke Zettlemoyer¹, Ricky T. Q. Chen¹

¹FAIR at Meta ²Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK

Summary

We developed OneFlow, the first non-autoregressive multimodal model that enables variable-length and concurrent mixed-modal generation. We model sequence generation through insertions—including both text tokens and image embeddings.

  1. OneFlow introduces new capabilities such as classifier-free guidance for image understanding, and simultaneous interleaved text-image generation.

  2. Across a diverse range of benchmarks for image generation and image understanding, we are competitive with existing SOTA models.

  3. In controlled experiments, OneFlow scales better than Transfusion (AR + FM) on multimodal pretraining.

  4. In controlled experiments, mixed-modal training consistently improves benchmarks for both image understanding and generation.

Summary

We developed OneFlow, the first non-autoregressive multimodal model that enables variable-length and concurrent mixed-modal generation. We model sequence generation through insertions—including both text tokens and image embeddings.

  1. OneFlow introduces new capabilities such as classifier-free guidance for image understanding, and simultaneous interleaved text-image generation.

  2. Across a diverse range of benchmarks for image generation and image understanding, we are competitive with existing SOTA models.

  3. In controlled experiments, OneFlow scales better than Transfusion (AR + FM) on multimodal pretraining.

  4. In controlled experiments, mixed-modal training consistently improves benchmarks for both image understanding and generation.

Summary

We developed OneFlow, the first non-autoregressive multimodal model that enables variable-length and concurrent mixed-modal generation. We model sequence generation through insertions—including both text tokens and image embeddings.

  1. OneFlow introduces new capabilities such as classifier-free guidance for image understanding, and simultaneous interleaved text-image generation.

  2. Across a diverse range of benchmarks for image generation and image understanding, we are competitive with existing SOTA models.

  3. In controlled experiments, OneFlow scales better than Transfusion (AR + FM) on multimodal pretraining.

  4. In controlled experiments, mixed-modal training consistently improves benchmarks for both image understanding and generation.

Research Questions

Research Questions

Research Questions

Research Questions

How does OneFlow perform compared to AR?

For text-to-image generation, we report DPG-Bench and FID. For image-to-text caption quality, we report CIDEr and ROUGE. In every benchmark, OneFlow consistently exhibits better scaling laws than AR.

What is the impact of mixed-modal vs. sequential pretraining?

Mixed modal pretraining vs sequential pretraining. Mixed pretraining achieves 4% relative improvement on VQA tasks and slight improvements on image generation as well.

What emergent behaviors does OneFlow exhibit during generation?

Implicit visual reasoning in hierarchical generation. OneFlow naturally develops reasoning chains without CoT prompting.

How does OneFlow compared to other unified model?

How does OneFlow perform compared to AR?

For text-to-image generation, we report DPG-Bench and FID. For image-to-text caption quality, we report CIDEr and ROUGE. In every benchmark, OneFlow consistently exhibits better scaling laws than AR.

What is the impact of mixed-modal vs. sequential pretraining?

Mixed modal pretraining vs sequential pretraining. Mixed pretraining achieves 4% relative improvement on VQA tasks and slight improvements on image generation as well.

What emergent behaviors does OneFlow exhibit during generation?

Implicit visual reasoning in hierarchical generation. OneFlow naturally develops reasoning chains without CoT prompting.

How does OneFlow compared to other unified model?

How does OneFlow perform compared to AR?

For text-to-image generation, we report DPG-Bench and FID. For image-to-text caption quality, we report CIDEr and ROUGE. In every benchmark, OneFlow consistently exhibits better scaling laws than AR.

What is the impact of mixed-modal vs. sequential pretraining?

Mixed modal pretraining vs sequential pretraining. Mixed pretraining achieves 4% relative improvement on VQA tasks and slight improvements on image generation as well.

What emergent behaviors does OneFlow exhibit during generation?

Implicit visual reasoning in hierarchical generation. OneFlow naturally develops reasoning chains without CoT prompting.

How does OneFlow compared to other unified model?

Concurrent Interleaved

Concurrent Interleaving Generation

OneFlow unlocks parallel generation by inserting new images anywhere in existing text and denoise them concurrently.

Concurrent Interleaved

Concurrent Interleaving Generation

OneFlow unlocks parallel generation by inserting new images anywhere in existing text and denoise them concurrently.

Concurrent Interleaved

Concurrent Interleaving Generation

OneFlow unlocks parallel generation by inserting new images anywhere in existing text and denoise them concurrently.

VQA Reasoning

Hierarchical Generation Exhibits Reasoning

OneFlow's sampling process demonstrates implicit visual reasoning. The model naturally develops reasoning chains before providing final answers, without requiring Chain-of-Thought prompting.

VQA Reasoning

Hierarchical Generation Exhibits Reasoning

OneFlow's sampling process demonstrates implicit visual reasoning. The model naturally develops reasoning chains before providing final answers, without requiring Chain-of-Thought prompting.

VQA Reasoning

Hierarchical Generation Exhibits Reasoning

OneFlow's sampling process demonstrates implicit visual reasoning. The model naturally develops reasoning chains before providing final answers, without requiring Chain-of-Thought prompting.

Image Generation

Image Generation

Image Generation

Image Generation

OneFlow generates high-quality images with strong prompt adherence, accurately capturing fine-grained details even in dense prompts from DPG-Bench—such as correctly rendering "a polar bear balancing on a blue barrel.".

Impact of CFG

Classifier-free Guidance Improves Text Detail

Higher CFG values consistently increase the length and detail of generated text.

Text generation examples from OneFlow using classifier-free guidance (CFG). CFG produces longer, more detailed captions but increases hallucination risk. Highlighted text shows added detail at higher CFG weights.

Impact of CFG

Classifier-free Guidance Improves Text Detail

Higher CFG values consistently increase the length and detail of generated text.

Text generation examples from OneFlow using classifier-free guidance (CFG). CFG produces longer, more detailed captions but increases hallucination risk. Highlighted text shows added detail at higher CFG weights.

Impact of CFG

Classifier-free Guidance Improves Text Detail

Higher CFG values consistently increase the length and detail of generated text.

Text generation examples from OneFlow using classifier-free guidance (CFG). CFG produces longer, more detailed captions but increases hallucination risk. Highlighted text shows added detail at higher CFG weights.

Citation:

@article{nguyen2025oneflow,

      title={OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows}, 

      author={John Nguyen, Marton Havasi, Tariq Berrada, Luke Zettlemoyer and Ricky T. Q. Chen},

      year={2025},

      eprint={2510.03506},

      archivePrefix={arXiv},

      primaryClass={cs.AI},

      url={https://arxiv.org/abs/2510.03506}, 

}

Citation:

@article{nguyen2025oneflow,

      title={OneFlow: Concurrent Mixed-Modal and Interleaved Generation with Edit Flows}, 

      author={John Nguyen, Marton Havasi, Tariq Berrada, Luke Zettlemoyer and Ricky T. Q. Chen},

      year={2025},

      eprint={2510.03506},

      archivePrefix={arXiv},

      primaryClass={cs.AI},

      url={https://arxiv.org/abs/2510.03506}, 

}

Create a free website with Framer, the website builder loved by startups, designers and agencies.