Spaces:

WeReCooking
/

ACE-Step-CPU

Running

File size: 4,681 Bytes

---
title: ACE-Step 1.5 XL Music Generation (CPU)
emoji: 🎵
colorFrom: indigo
colorTo: yellow
sdk: docker
pinned: false
license: mit
tags:
  - music-generation
  - ace-step
  - gguf
  - lora
  - training
  - cpu
  - mcp-server
short_description: ACE-Step 1.5 XL - CPU music generation + LoRA training
models:
  - ACE-Step/Ace-Step1.5
startup_duration_timeout: 2h
---

# ACE-Step 1.5 XL Music Generation (CPU)

**GGUF inference + LoRA training** on free CPU Spaces. Powered by [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp).

## Features

- **Music Generation** -- text/lyrics to stereo 48kHz MP3 via GGUF quantized models
- **LoRA Training** -- fine-tune on your own audio (~11s/epoch CPU, ~1.4s/epoch GPU)
- **Auto-Captioning** -- librosa BPM/key/signature + LM understand mode (caption + lyrics extraction)
- **Multiple LM Sizes** -- 0.6B / 1.7B / 4B language models (on-demand download)
- **Cancel + Download** -- cancel training mid-epoch, download trained LoRA adapter

## Music Generation

1. Enter a music description
2. Enter lyrics or check **Instrumental**
3. Adjust BPM, duration, steps, seed
4. Select LoRA adapter if trained
5. Click **Generate Music**

**Timing:** ~270s for 10s audio with 1.7B LM, 8 steps on CPU.

## LoRA Training

1. Upload audio files (any length, auto-tiled at 30s chunks by VAE)
2. Set LoRA name, epochs, learning rate, rank
3. Click **Train** -- ace-server stops during training, restarts after
4. Use **Cancel** to stop early (saves checkpoint)
5. **Download** the trained adapter file
6. Trained adapter appears in the LoRA dropdown

**Timing:** ~170s preprocessing + ~11s/epoch on CPU. GPU: ~1.4s/epoch.

**Limits:** 30 min total audio across all files. Files exceeding the cap are truncated with a warning. 50 files max. 8h training timeout.

**Settings (per Side-Step author recommendations):**
- LR: 3e-4
- Rank: 32, Alpha: 64
- Epochs: 200-500 for 3-10 files
- Optimizer: Adafactor (minimal memory)
- Variant: standard turbo (not XL -- XL swaps on 18GB)

## Captioning Pipeline

Training audio is auto-captioned before preprocessing:

| Method | What it extracts | Speed |
|--------|-----------------|-------|
| **librosa** | BPM, key, time signature | ~3s/file |
| **LM understand** (GPU) | Rich caption + lyrics + metadata | ~52s/file |
| **ace-server /understand** (Space) | Same as LM, via GGUF | ~30s/file |
| **.txt/.json sidecar** | User-provided caption (if present) | instant |

On Space: uses ace-server /understand before training. Locally: uses PyTorch LM understand.

## Models

| Component | GGUF | Size | Purpose |
|-----------|------|------|---------|
| DiT XL turbo | acestep-v15-xl-turbo-Q4_K_M | 2.8 GB | Music generation (no LoRA) |
| DiT standard turbo | acestep-v15-turbo-Q4_K_M | 1.1 GB | Music generation (with LoRA) |
| LM 1.7B | acestep-5Hz-lm-1.7B-Q8_0 | 1.7 GB | Caption understanding |
| Text Encoder | Qwen3-Embedding-0.6B-Q8_0 | 0.75 GB | Text encoding |
| VAE | vae-BF16 | 0.32 GB | Audio encode/decode |

## API

### Generate Music

```python
from gradio_client import Client

client = Client("WeReCooking/ACE-Step-CPU")
result = client.predict(
    caption="upbeat electronic dance music",
    lyrics="[Instrumental]",
    instrumental=True, bpm=120, duration=10, seed=-1, steps=8,
    lora_select="None (no LoRA)",
    lm_model_select="acestep-5Hz-lm-1.7B-Q8_0.gguf",
    api_name="/generate"
)
```

### Train LoRA

```python
from gradio_client import Client, handle_file

client = Client("WeReCooking/ACE-Step-CPU")
result = client.predict(
    audio_files=[handle_file("song.mp3")],
    lora_name="my-style", epochs=200, lr=0.0003, rank=32,
    api_name="/train_lora"
)
```

### MCP (Model Context Protocol)

```json
{
  "mcpServers": {
    "ace-step": {"url": "https://werecooking-ace-step-cpu.hf.space/gradio_api/mcp/"}
  }
}
```

## CLI

```bash
python app.py "upbeat electronic dance music" --duration 10 --steps 8
python app.py "jazz piano" --adapter my-style --seed 42
```

## Architecture

- **Inference:** GGUF via [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp)
- **Training:** PyTorch, ported from [Side-Step](https://github.com/koda-dernet/Side-Step) (commit ecd13bd)
- **Captioning:** librosa + LM understand (PyTorch or ace-server /understand)
- Training stops ace-server to free RAM, restarts after with new adapters
- Inference blocked during training with clear message

## Credits

- [ACE-Step 1.5](https://github.com/ace-step/ACE-Step-1.5)
- [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp)
- [Side-Step](https://github.com/koda-dernet/Side-Step)
- [Serveurperso/ACE-Step-1.5-GGUF](https://huggingface.co/Serveurperso/ACE-Step-1.5-GGUF)