| | --- |
| | pipeline_tag: image-to-text |
| | license: apache-2.0 |
| | tags: |
| | - Non-Autoregressive |
| | - Masked-Generative-Transformer |
| | - Discrete-Diffusion |
| | - Unified-Model |
| | language: |
| | - en |
| | --- |
| | |
| | # Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model |
| |
|
| | [Paper](https://arxiv.org/abs/2505.23606) | [Model](https://huggingface.co/MeissonFlow/Muddit) | [Code](https://github.com/M-E-AGI-Lab/Muddit) | [Demo](https://huggingface.co/spaces/MeissonFlow/muddit) |
| |
|
| |
|
| |  |
| |
|
| |
|
| | ## Introduction |
| | Welcome to the official repository of **Muddit** — a next-generation foundation model in the Meissonic family, built upon discrete diffusion for unified and efficient multimodal generation. |
| |
|
| | Unlike traditional autoregressive methods, **Muddit** leverages discrete diffusion (a.k.a. MaskGIT-style masking) as its core mechanism — enabling fast, parallel decoding across modalities. |
| |
|
| | While most unified models are still rooted in language priors, **Muddit** is developed from a visual-first perspective for scalable and flexible generation. |
| |
|
| | Muddit (512) and Muddit Plus (1024) aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. |
| |
|
| | ## Usage |
| |
|
| | Please refer to [github link](https://github.com/M-E-AGI-Lab/Muddit). |
| |
|
| | ## Citation |
| | If you find this work helpful, please consider citing: |
| | ```bibtex |
| | @article{shi2025muddit, |
| | title={Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model}, |
| | author={Shi, Qingyu and Bai, Jinbin and Zhao, Zhuoran and Chai, Wenhao and Yu, Kaidong and Wu, Jianzong and Song, Shuangyong and Tong, Yunhai and Li, Xiangtai and Li, Xuelong and others}, |
| | journal={arXiv preprint arXiv:2505.23606}, |
| | year={2025} |
| | } |
| | ``` |