Spaces:
Running
Running
File size: 4,681 Bytes
466cba7 9d2d424 5c82a90 fccaf48 466cba7 9d2d424 466cba7 9d2d424 a5741b1 9d2d424 a5741b1 9d2d424 a5741b1 9d2d424 a5741b1 9d2d424 a5741b1 9d2d424 a5741b1 9d2d424 a5741b1 9d2d424 a5741b1 9d2d424 a5741b1 9d2d424 a5741b1 9d2d424 a5741b1 9d2d424 a5741b1 9d2d424 a5741b1 9d2d424 a5741b1 9d2d424 a5741b1 9d2d424 a5741b1 9d2d424 a5741b1 9d2d424 a5741b1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | ---
title: ACE-Step 1.5 XL Music Generation (CPU)
emoji: 🎵
colorFrom: indigo
colorTo: yellow
sdk: docker
pinned: false
license: mit
tags:
- music-generation
- ace-step
- gguf
- lora
- training
- cpu
- mcp-server
short_description: ACE-Step 1.5 XL - CPU music generation + LoRA training
models:
- ACE-Step/Ace-Step1.5
startup_duration_timeout: 2h
---
# ACE-Step 1.5 XL Music Generation (CPU)
**GGUF inference + LoRA training** on free CPU Spaces. Powered by [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp).
## Features
- **Music Generation** -- text/lyrics to stereo 48kHz MP3 via GGUF quantized models
- **LoRA Training** -- fine-tune on your own audio (~11s/epoch CPU, ~1.4s/epoch GPU)
- **Auto-Captioning** -- librosa BPM/key/signature + LM understand mode (caption + lyrics extraction)
- **Multiple LM Sizes** -- 0.6B / 1.7B / 4B language models (on-demand download)
- **Cancel + Download** -- cancel training mid-epoch, download trained LoRA adapter
## Music Generation
1. Enter a music description
2. Enter lyrics or check **Instrumental**
3. Adjust BPM, duration, steps, seed
4. Select LoRA adapter if trained
5. Click **Generate Music**
**Timing:** ~270s for 10s audio with 1.7B LM, 8 steps on CPU.
## LoRA Training
1. Upload audio files (any length, auto-tiled at 30s chunks by VAE)
2. Set LoRA name, epochs, learning rate, rank
3. Click **Train** -- ace-server stops during training, restarts after
4. Use **Cancel** to stop early (saves checkpoint)
5. **Download** the trained adapter file
6. Trained adapter appears in the LoRA dropdown
**Timing:** ~170s preprocessing + ~11s/epoch on CPU. GPU: ~1.4s/epoch.
**Limits:** 30 min total audio across all files. Files exceeding the cap are truncated with a warning. 50 files max. 8h training timeout.
**Settings (per Side-Step author recommendations):**
- LR: 3e-4
- Rank: 32, Alpha: 64
- Epochs: 200-500 for 3-10 files
- Optimizer: Adafactor (minimal memory)
- Variant: standard turbo (not XL -- XL swaps on 18GB)
## Captioning Pipeline
Training audio is auto-captioned before preprocessing:
| Method | What it extracts | Speed |
|--------|-----------------|-------|
| **librosa** | BPM, key, time signature | ~3s/file |
| **LM understand** (GPU) | Rich caption + lyrics + metadata | ~52s/file |
| **ace-server /understand** (Space) | Same as LM, via GGUF | ~30s/file |
| **.txt/.json sidecar** | User-provided caption (if present) | instant |
On Space: uses ace-server /understand before training. Locally: uses PyTorch LM understand.
## Models
| Component | GGUF | Size | Purpose |
|-----------|------|------|---------|
| DiT XL turbo | acestep-v15-xl-turbo-Q4_K_M | 2.8 GB | Music generation (no LoRA) |
| DiT standard turbo | acestep-v15-turbo-Q4_K_M | 1.1 GB | Music generation (with LoRA) |
| LM 1.7B | acestep-5Hz-lm-1.7B-Q8_0 | 1.7 GB | Caption understanding |
| Text Encoder | Qwen3-Embedding-0.6B-Q8_0 | 0.75 GB | Text encoding |
| VAE | vae-BF16 | 0.32 GB | Audio encode/decode |
## API
### Generate Music
```python
from gradio_client import Client
client = Client("WeReCooking/ACE-Step-CPU")
result = client.predict(
caption="upbeat electronic dance music",
lyrics="[Instrumental]",
instrumental=True, bpm=120, duration=10, seed=-1, steps=8,
lora_select="None (no LoRA)",
lm_model_select="acestep-5Hz-lm-1.7B-Q8_0.gguf",
api_name="/generate"
)
```
### Train LoRA
```python
from gradio_client import Client, handle_file
client = Client("WeReCooking/ACE-Step-CPU")
result = client.predict(
audio_files=[handle_file("song.mp3")],
lora_name="my-style", epochs=200, lr=0.0003, rank=32,
api_name="/train_lora"
)
```
### MCP (Model Context Protocol)
```json
{
"mcpServers": {
"ace-step": {"url": "https://werecooking-ace-step-cpu.hf.space/gradio_api/mcp/"}
}
}
```
## CLI
```bash
python app.py "upbeat electronic dance music" --duration 10 --steps 8
python app.py "jazz piano" --adapter my-style --seed 42
```
## Architecture
- **Inference:** GGUF via [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp)
- **Training:** PyTorch, ported from [Side-Step](https://github.com/koda-dernet/Side-Step) (commit ecd13bd)
- **Captioning:** librosa + LM understand (PyTorch or ace-server /understand)
- Training stops ace-server to free RAM, restarts after with new adapters
- Inference blocked during training with clear message
## Credits
- [ACE-Step 1.5](https://github.com/ace-step/ACE-Step-1.5)
- [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp)
- [Side-Step](https://github.com/koda-dernet/Side-Step)
- [Serveurperso/ACE-Step-1.5-GGUF](https://huggingface.co/Serveurperso/ACE-Step-1.5-GGUF)
|