YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
MIDI Generation Pipeline: Text-to-Music
Complete pipelines for training and inference of text-conditioned MIDI generation using both GPT2-style and Qwen3-based autoregressive models.
Two Architectures
1. GPT2-Style (train_midi_gpt.py)
- From-scratch GPT2 model with custom vocabulary
- ~50M parameters (configurable)
- Fast training, good for experimentation
2. Qwen3-0.6B (train_midi_qwen3.py) ⭐ Recommended
- Pretrained LLM with vocabulary expansion (inspired by MIDI-LLM)
- 751M parameters with rich text understanding
- Tied embeddings automatically handled
- Apache-2.0 license
Files
| File | Purpose |
|---|---|
prepare_dataset.py |
Preprocess for GPT2 pipeline |
prepare_dataset_qwen3.py |
Preprocess for Qwen3 pipeline (rich prompts) |
train_midi_gpt.py |
Train GPT2-style model |
train_midi_qwen3.py |
Fine-tune Qwen3-0.6B with MIDI vocab expansion |
inference_midi_gpt.py |
Generate MIDI with GPT2 model |
inference_midi_qwen3.py |
Generate MIDI with Qwen3 model |
create_synthetic_dataset.py |
Generate synthetic test data |
test_end_to_end.py |
Validate GPT2 pipeline |
test_qwen3_e2e.py |
Validate Qwen3 pipeline |
Dataset
Processed dataset: rahuldshetty/midi-generation-dataset
Source: B-K/midi-dataset-2 (MidiCaps with MIDI bytes)
Quick Start: Qwen3 Pipeline
1. Install
pip install transformers torch datasets miditok miditoolkit tokenizers tqdm
2. Prepare Dataset
python prepare_dataset_qwen3.py \
--dataset B-K/midi-dataset-2 \
--output_dir ./midi_data_qwen3 \
--max_seq_len 2048
3. Train
python train_midi_qwen3.py \
--data_dir ./midi_data_qwen3 \
--output_dir ./midi_qwen3_model \
--num_epochs 20 \
--batch_size 2 \
--gradient_accumulation_steps 8 \
--learning_rate 5e-5 \
--bf16 \
--gradient_checkpointing
4. Generate MIDI
python inference_midi_qwen3.py \
--model_dir ./midi_qwen3_model/final \
--prompt "A cheerful jazz piece with piano and saxophone in C major, 120 BPM" \
--output_path output.mid \
--max_midi_tokens 1024
Qwen3 Architecture Details
Vocabulary Expansion
- Qwen3 base vocab: 151,936 tokens
- MIDI special tokens:
<|midi_start|>,<|midi_end|>,<|midi_pad|> - MIDI vocab tokens:
<|midi_0|>...<|midi_515|>(REMI tokenization) - Total vocab: ~152,455
Training Labels
- Text prefix →
-100(not trained on) - MIDI tokens + special tokens → actual IDs
- Model learns only music generation
Rich Prompt Format
You are a world-class composer. Please compose some music according to the following description:
Description: [caption]
Genre: [genre]
Mood: [mood]
Key: [key]
Time Signature: [time_signature]
Tempo: [tempo] BPM
Duration: [duration] seconds
Instruments: [instruments]
Chords: [chords]
Recommended Datasets
| Dataset | Link | Description |
|---|---|---|
| B-K/midi-dataset-2 | HF | Best - rich metadata + MIDI bytes |
| amaai-lab/MidiCaps | HF | 168K captions (no MIDI bytes) |
| foldl/midi | HF | Name + genre + MIDI bytes |
Hardware Recommendations
| Model | GPU | Batch | Notes |
|---|---|---|---|
| GPT2 (50M) | t4-small | 4 | Fast experimentation |
| Qwen3-0.6B | a10g-large | 2 | Enable gradient_checkpointing |
| Qwen3-0.6B | a100-large | 4 | Full training |
SOTA References
- MIDI-LLM (Wu et al., 2025): LLM vocab expansion for MIDI
- MIDI-GPT (Pasquier et al., 2025): GPT2 for MIDI
- text2midi (Bhandari et al., AAAI 2025): T5 encoder + decoder
License
Scripts: Apache-2.0 (follows Qwen3 license)
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support