daVinci-MagiHuman — FastVideo Diffusers port

A FastVideo-format port of SII-GAIR + Sand.ai's daVinci-MagiHuman joint audio-visual generative model. Single repo, four sibling subfolders, one umbrella HF string per variant, all four bit-exact vs the official reference.

15B-parameter single-stream transformer that jointly denoises video + audio in a unified token sequence. Generates a 5-second 256p clip with synchronized audio in ~2 s on a single H100. See the paper and the official repo.

Variant matrix

Subfolder	Model	Inference modes	Steps	CFG	Output	DiT files
`base/`	base 15B	T2V, TI2V	32	CFG=2	480x256 mp4 (video + audio)	7
`distill/`	DMD-2 distilled 15B	T2V, TI2V	8	no CFG	480x256 mp4 (video + audio)	7
`sr_540p/`	base + SR 540p	T2V, TI2V	32 + 5	CFG=2 + SR cfg-trick	~896x512 mp4 (video + audio)	20
`sr_1080p/`	base + SR 1080p (block-sparse local-window attention on 32/40 SR DiT layers)	T2V, TI2V	32 + 5	CFG=2 + SR cfg-trick	~1920x1088 mp4 (video + audio)	15

T2V = text only. TI2V = text + reference image; the image is encoded through the Wan VAE and stitched into the first video latent frame at every denoise step (matches upstream evaluate_with_latent per-step overwrite).

All four DiTs share the same architecture (40 layers, hidden=5120, head_dim=128, GQA num_query_groups=8); only the weights differ. SR-1080p additionally restricts video→video attention to a local window of frame_receptive_field=11 on 32 of 40 SR DiT layers (matches upstream's SR2_1080 config override).

Quick start

Install FastVideo (commit c05c1048 or later in the will/magi branch contains all four variants):

uv pip install fastvideo
# or pin: uv pip install 'fastvideo @ git+https://github.com/hao-ai-lab/FastVideo@will/magi'

Accept terms on the two gated upstream repos that the pipeline lazy-loads from:

google/t5gemma-9b-9b-ul2 — text encoder + tokenizer
stabilityai/stable-audio-open-1.0 — audio VAE

Then export your HF_TOKEN and run any of:

Base T2V (~5 s on H100)

from fastvideo import VideoGenerator

generator = VideoGenerator.from_pretrained(
    "FastVideo/MagiHuman-Diffusers/base",
    num_gpus=1,
)
generator.generate_video(
    prompt="A warm afternoon scene: a person sits on a park bench reading a book, "
           "surrounded by softly swaying trees.",
    output_path="output.mp4",
    save_video=True,
)
generator.shutdown()

Distill T2V (~2 s on H100, no CFG)

from fastvideo import VideoGenerator
generator = VideoGenerator.from_pretrained("FastVideo/MagiHuman-Diffusers/distill", num_gpus=1)
generator.generate_video(prompt="...", output_path="output.mp4", save_video=True)
generator.shutdown()

Base TI2V (text + reference image)

from fastvideo import VideoGenerator
from fastvideo.pipelines.basic.magi_human.pipeline_configs import MagiHumanBaseI2VConfig

generator = VideoGenerator.from_pretrained(
    "FastVideo/MagiHuman-Diffusers/base",
    num_gpus=1,
    workload_type="i2v",
    override_pipeline_cls_name="MagiHumanI2VPipeline",
    pipeline_config=MagiHumanBaseI2VConfig(),
)
generator.generate_video(
    prompt="A cheerful saxophonist performs a short line in a small jazz club.",
    image_path="reference.jpg",
    output_path="output.mp4",
    save_video=True,
)
generator.shutdown()

Super-resolution (540p)

from fastvideo import VideoGenerator
generator = VideoGenerator.from_pretrained("FastVideo/MagiHuman-Diffusers/sr_540p", num_gpus=1)
generator.generate_video(prompt="...", output_path="output_540p.mp4", save_video=True)
generator.shutdown()

Super-resolution (1080p)

from fastvideo import VideoGenerator
generator = VideoGenerator.from_pretrained("FastVideo/MagiHuman-Diffusers/sr_1080p", num_gpus=1)
generator.generate_video(prompt="...", output_path="output_1080p.mp4", save_video=True)
generator.shutdown()

For full runnable examples for all eight (variant × mode) combinations see examples/inference/basic/basic_magi_human*.py.

Lazy-load contract — what FastVideo fetches

Each subfolder of this repo only ships variant-specific weights:

<subfolder>/
├── model_index.json
├── transformer/        ← variant DiT weights
├── scheduler/          ← FlowUniPCMultistepScheduler config
└── sr_transformer/     ← only in sr_540p/, sr_1080p/

The four cross-variant shared components (~25 GB total) are lazy-loaded from their canonical upstream HF repos the first time the pipeline runs:

Component	Source	Gated?
Wan 2.2 TI2V-5B VAE (video decode)	`Wan-AI/Wan2.2-TI2V-5B-Diffusers`	no
T5-Gemma 9B encoder + tokenizer	`google/t5gemma-9b-9b-ul2`	yes (Google terms of use)
Stable Audio Open 1.0 VAE (audio decode)	`stabilityai/stable-audio-open-1.0`	yes (Stability AI terms of use)

Net effect: a user running all four variants downloads ~50 GB of variant weights + a single ~25 GB shared cache, totaling ~75 GB instead of ~400 GB if each variant bundled its own copies.

Re-converting from raw upstream weights

If you need to re-convert from the raw GAIR/daVinci-MagiHuman:

# Each variant is converted individually; the umbrella layout is the
# concatenation of these four outputs.
python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \
    --source GAIR/daVinci-MagiHuman \
    --subfolder base \
    --output local_weights/base

python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \
    --source GAIR/daVinci-MagiHuman \
    --subfolder distill \
    --output local_weights/distill \
    --cast-bf16

python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \
    --source GAIR/daVinci-MagiHuman \
    --subfolder base \
    --sr-source GAIR/daVinci-MagiHuman --sr-subfolder 540p_sr \
    --output local_weights/sr_540p \
    --cast-bf16

python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \
    --source GAIR/daVinci-MagiHuman \
    --subfolder base \
    --sr-source GAIR/daVinci-MagiHuman --sr-subfolder 1080p_sr \
    --output local_weights/sr_1080p \
    --cast-bf16

Pass --bundle-vae / --bundle-audio-vae / --bundle-text-encoder if you want a self-contained snapshot instead of relying on lazy-load.

Parity vs official daVinci-MagiHuman

All four variants pass FastVideo's local parity battery bit-exact (diff_max=0.0, diff_mean=0.0) against the official reference DiT:

Test	Result
`test_magi_human_dit_parity` (base)	bit-exact
`test_magi_human_distill_dit_parity`	bit-exact
`test_magi_human_pipeline_latent_parity` (base T2V)	bit-exact
`test_magi_human_ti2v_pipeline_latent_parity`	bit-exact
`test_magi_human_sr540p_pipeline_latent_parity[t2v / ti2v]`	bit-exact
`test_magi_human_sr1080p_pipeline_latent_parity[t2v / ti2v]`	bit-exact
`test_magi_human_t5gemma_parity`	bit-exact
`test_magi_human_sa_audio_parity` (FV + official)	bit-exact
`test_magi_human_vae_parity` (Wan VAE decode)	8e-4 max (fp32 op-order drift, tracked)

Block-sparse local-window attention for SR-1080p is implemented as a 3-block accumulator over vanilla SDPA (per-frame video→local-video + all-video→audio+text + audio+text→all), which mathematically matches upstream's magi_attention.api.flex_flash_attn_func contract for this 3-block layout. Bit-exact verified.

Citation

@article{davinci-magihuman-2026,
  title  = {Speed by Simplicity: A Single-Stream Architecture for Fast
            Audio-Video Generative Foundation Model},
  author = {SII-GAIR and Sand.ai},
  journal= {arXiv preprint arXiv:2603.21986},
  year   = {2026}
}

@misc{fastvideo-magihuman-port,
  title  = {{daVinci-MagiHuman} for {FastVideo}},
  author = {{FastVideo team}},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/FastVideo/MagiHuman-Diffusers}}
}

License

Apache 2.0 (matches upstream GAIR/daVinci-MagiHuman).

Acknowledgments

SII-GAIR and Sand.ai for the original daVinci-MagiHuman model and inference code.
Wan-AI for the Wan 2.2 video VAE.
Google for the T5-Gemma text encoder.
Stability AI for the Stable Audio Open 1.0 audio VAE.
SandAI-org / MagiAttention for the canonical FFA / flex_flash_attn_func reference.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for FastVideo/MagiHuman-Diffusers

Base model

GAIR/daVinci-MagiHuman

Finetuned

(3)

this model

Paper for FastVideo/MagiHuman-Diffusers

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

Paper • 2603.21986 • Published Mar 23 • 125