YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

MIDI Generation Pipeline: Text-to-Music

Complete pipelines for training and inference of text-conditioned MIDI generation using both GPT2-style and Qwen3-based autoregressive models.

Two Architectures

1. GPT2-Style (`train_midi_gpt.py`)

From-scratch GPT2 model with custom vocabulary
~50M parameters (configurable)
Fast training, good for experimentation

2. Qwen3-0.6B (`train_midi_qwen3.py`) ⭐ Recommended

Pretrained LLM with vocabulary expansion (inspired by MIDI-LLM)
751M parameters with rich text understanding
Tied embeddings automatically handled
Apache-2.0 license

Files

File	Purpose
`prepare_dataset.py`	Preprocess for GPT2 pipeline
`prepare_dataset_qwen3.py`	Preprocess for Qwen3 pipeline (rich prompts)
`train_midi_gpt.py`	Train GPT2-style model
`train_midi_qwen3.py`	Fine-tune Qwen3-0.6B with MIDI vocab expansion
`inference_midi_gpt.py`	Generate MIDI with GPT2 model
`inference_midi_qwen3.py`	Generate MIDI with Qwen3 model
`create_synthetic_dataset.py`	Generate synthetic test data
`test_end_to_end.py`	Validate GPT2 pipeline
`test_qwen3_e2e.py`	Validate Qwen3 pipeline

Dataset

Processed dataset: rahuldshetty/midi-generation-dataset

Source: B-K/midi-dataset-2 (MidiCaps with MIDI bytes)

Quick Start: Qwen3 Pipeline

1. Install

pip install transformers torch datasets miditok miditoolkit tokenizers tqdm

2. Prepare Dataset

python prepare_dataset_qwen3.py \
    --dataset B-K/midi-dataset-2 \
    --output_dir ./midi_data_qwen3 \
    --max_seq_len 2048

3. Train

python train_midi_qwen3.py \
    --data_dir ./midi_data_qwen3 \
    --output_dir ./midi_qwen3_model \
    --num_epochs 20 \
    --batch_size 2 \
    --gradient_accumulation_steps 8 \
    --learning_rate 5e-5 \
    --bf16 \
    --gradient_checkpointing

4. Generate MIDI

python inference_midi_qwen3.py \
    --model_dir ./midi_qwen3_model/final \
    --prompt "A cheerful jazz piece with piano and saxophone in C major, 120 BPM" \
    --output_path output.mid \
    --max_midi_tokens 1024

Qwen3 Architecture Details

Vocabulary Expansion

Qwen3 base vocab: 151,936 tokens
MIDI special tokens: <|midi_start|>, <|midi_end|>, <|midi_pad|>
MIDI vocab tokens: <|midi_0|> ... <|midi_515|> (REMI tokenization)
Total vocab: ~152,455

Training Labels

Text prefix → -100 (not trained on)
MIDI tokens + special tokens → actual IDs
Model learns only music generation

Rich Prompt Format

You are a world-class composer. Please compose some music according to the following description:
Description: [caption]
Genre: [genre]
Mood: [mood]
Key: [key]
Time Signature: [time_signature]
Tempo: [tempo] BPM
Duration: [duration] seconds
Instruments: [instruments]
Chords: [chords]

Recommended Datasets

Dataset	Link	Description
B-K/midi-dataset-2	HF	Best - rich metadata + MIDI bytes
amaai-lab/MidiCaps	HF	168K captions (no MIDI bytes)
foldl/midi	HF	Name + genre + MIDI bytes

Hardware Recommendations

Model	GPU	Batch	Notes
GPT2 (50M)	t4-small	4	Fast experimentation
Qwen3-0.6B	a10g-large	2	Enable gradient_checkpointing
Qwen3-0.6B	a100-large	4	Full training

SOTA References

MIDI-LLM (Wu et al., 2025): LLM vocab expansion for MIDI
MIDI-GPT (Pasquier et al., 2025): GPT2 for MIDI
text2midi (Bhandari et al., AAAI 2025): T5 encoder + decoder

License

Scripts: Apache-2.0 (follows Qwen3 license)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support