File size: 4,681 Bytes
466cba7
9d2d424
5c82a90
 
 
fccaf48
466cba7
9d2d424
 
 
 
 
 
 
 
 
 
 
 
 
466cba7
 
9d2d424
 
 
 
 
 
a5741b1
 
 
 
 
9d2d424
 
 
a5741b1
9d2d424
 
a5741b1
 
9d2d424
a5741b1
9d2d424
 
 
a5741b1
 
 
 
 
 
9d2d424
a5741b1
9d2d424
a5741b1
9d2d424
a5741b1
 
 
 
 
 
9d2d424
a5741b1
9d2d424
a5741b1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d2d424
 
 
a5741b1
9d2d424
 
 
 
 
 
 
 
a5741b1
 
9d2d424
 
 
 
 
a5741b1
9d2d424
 
 
 
 
 
 
a5741b1
9d2d424
 
 
 
 
 
 
 
 
 
 
 
 
 
a5741b1
9d2d424
 
a5741b1
9d2d424
 
 
 
 
a5741b1
 
 
 
 
9d2d424
 
 
a5741b1
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
title: ACE-Step 1.5 XL Music Generation (CPU)
emoji: 🎵
colorFrom: indigo
colorTo: yellow
sdk: docker
pinned: false
license: mit
tags:
  - music-generation
  - ace-step
  - gguf
  - lora
  - training
  - cpu
  - mcp-server
short_description: ACE-Step 1.5 XL - CPU music generation + LoRA training
models:
  - ACE-Step/Ace-Step1.5
startup_duration_timeout: 2h
---

# ACE-Step 1.5 XL Music Generation (CPU)

**GGUF inference + LoRA training** on free CPU Spaces. Powered by [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp).

## Features

- **Music Generation** -- text/lyrics to stereo 48kHz MP3 via GGUF quantized models
- **LoRA Training** -- fine-tune on your own audio (~11s/epoch CPU, ~1.4s/epoch GPU)
- **Auto-Captioning** -- librosa BPM/key/signature + LM understand mode (caption + lyrics extraction)
- **Multiple LM Sizes** -- 0.6B / 1.7B / 4B language models (on-demand download)
- **Cancel + Download** -- cancel training mid-epoch, download trained LoRA adapter

## Music Generation

1. Enter a music description
2. Enter lyrics or check **Instrumental**
3. Adjust BPM, duration, steps, seed
4. Select LoRA adapter if trained
5. Click **Generate Music**

**Timing:** ~270s for 10s audio with 1.7B LM, 8 steps on CPU.

## LoRA Training

1. Upload audio files (any length, auto-tiled at 30s chunks by VAE)
2. Set LoRA name, epochs, learning rate, rank
3. Click **Train** -- ace-server stops during training, restarts after
4. Use **Cancel** to stop early (saves checkpoint)
5. **Download** the trained adapter file
6. Trained adapter appears in the LoRA dropdown

**Timing:** ~170s preprocessing + ~11s/epoch on CPU. GPU: ~1.4s/epoch.

**Limits:** 30 min total audio across all files. Files exceeding the cap are truncated with a warning. 50 files max. 8h training timeout.

**Settings (per Side-Step author recommendations):**
- LR: 3e-4
- Rank: 32, Alpha: 64
- Epochs: 200-500 for 3-10 files
- Optimizer: Adafactor (minimal memory)
- Variant: standard turbo (not XL -- XL swaps on 18GB)

## Captioning Pipeline

Training audio is auto-captioned before preprocessing:

| Method | What it extracts | Speed |
|--------|-----------------|-------|
| **librosa** | BPM, key, time signature | ~3s/file |
| **LM understand** (GPU) | Rich caption + lyrics + metadata | ~52s/file |
| **ace-server /understand** (Space) | Same as LM, via GGUF | ~30s/file |
| **.txt/.json sidecar** | User-provided caption (if present) | instant |

On Space: uses ace-server /understand before training. Locally: uses PyTorch LM understand.

## Models

| Component | GGUF | Size | Purpose |
|-----------|------|------|---------|
| DiT XL turbo | acestep-v15-xl-turbo-Q4_K_M | 2.8 GB | Music generation (no LoRA) |
| DiT standard turbo | acestep-v15-turbo-Q4_K_M | 1.1 GB | Music generation (with LoRA) |
| LM 1.7B | acestep-5Hz-lm-1.7B-Q8_0 | 1.7 GB | Caption understanding |
| Text Encoder | Qwen3-Embedding-0.6B-Q8_0 | 0.75 GB | Text encoding |
| VAE | vae-BF16 | 0.32 GB | Audio encode/decode |

## API

### Generate Music

```python
from gradio_client import Client

client = Client("WeReCooking/ACE-Step-CPU")
result = client.predict(
    caption="upbeat electronic dance music",
    lyrics="[Instrumental]",
    instrumental=True, bpm=120, duration=10, seed=-1, steps=8,
    lora_select="None (no LoRA)",
    lm_model_select="acestep-5Hz-lm-1.7B-Q8_0.gguf",
    api_name="/generate"
)
```

### Train LoRA

```python
from gradio_client import Client, handle_file

client = Client("WeReCooking/ACE-Step-CPU")
result = client.predict(
    audio_files=[handle_file("song.mp3")],
    lora_name="my-style", epochs=200, lr=0.0003, rank=32,
    api_name="/train_lora"
)
```

### MCP (Model Context Protocol)

```json
{
  "mcpServers": {
    "ace-step": {"url": "https://werecooking-ace-step-cpu.hf.space/gradio_api/mcp/"}
  }
}
```

## CLI

```bash
python app.py "upbeat electronic dance music" --duration 10 --steps 8
python app.py "jazz piano" --adapter my-style --seed 42
```

## Architecture

- **Inference:** GGUF via [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp)
- **Training:** PyTorch, ported from [Side-Step](https://github.com/koda-dernet/Side-Step) (commit ecd13bd)
- **Captioning:** librosa + LM understand (PyTorch or ace-server /understand)
- Training stops ace-server to free RAM, restarts after with new adapters
- Inference blocked during training with clear message

## Credits

- [ACE-Step 1.5](https://github.com/ace-step/ACE-Step-1.5)
- [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp)
- [Side-Step](https://github.com/koda-dernet/Side-Step)
- [Serveurperso/ACE-Step-1.5-GGUF](https://huggingface.co/Serveurperso/ACE-Step-1.5-GGUF)