Generate Studio-Quality Music with Ace-Step 1.5 and ComfyUI

The AI music generation landscape has a new heavyweight contender. Meet Ace-Step 1.5—an open-source foundation model that runs locally on consumer hardware while delivering quality that rivals commercial giants like Suno and Udio.

If you’ve been frustrated by subscription fees, upload limits, or privacy concerns with cloud-based music generators, this is the guide you’ve been waiting for.

The AI Music Revolution

For the past year, Suno and Udio have dominated headlines with their uncanny ability to generate full songs from text prompts. Impressive? Absolutely. But there’s been a catch: you’re renting access to someone else’s compute, on someone else’s terms.

Ace-Step 1.5 changes the equation. Developed collaboratively by ACE Studio and StepFun, it brings commercial-grade music generation to your local machine. We’re talking:

Under 10 seconds to generate a full song on an RTX 3090
Under 4GB VRAM—yes, it runs on modest gaming GPUs
MIT license—free for personal and commercial use
50+ languages supported natively

This isn’t a toy project. It’s a legitimate alternative for developers, musicians, and content creators who want full control over their AI music pipeline.

What is Ace-Step 1.5?

Ace-Step 1.5 is a hybrid architecture that combines the best of language models and diffusion transformers. Think of it as two systems working in concert: a “planner” that understands what you want, and a “synthesizer” that builds the actual audio.

The Architecture

1. Language Model (LM) — The Planner

When you type a prompt like “upbeat synth-pop with female vocals,” the LM transforms that into a comprehensive blueprint. It synthesizes metadata, lyrics, and captions through chain-of-thought reasoning—all without requiring massive prompt engineering on your end.

The LM comes in three sizes:

0.6B parameters — For 6-12GB VRAM cards
1.7B parameters — Sweet spot for 12-20GB VRAM
4B parameters — Maximum quality for 24GB+ cards

2. Diffusion Transformer (DiT) — The Synthesizer

The core generative engine uses a clever optimization: linear attention instead of the standard quadratic approach. This reduces complexity from O(N²) to O(N)—a massive speedup that explains those sub-10-second generation times.

3. Deep Compression AutoEncoder (DCAE)

Traditionally, audio autoencoders achieve 8x compression. Ace-Step’s DCAE hits 32x compression while maintaining quality. Fewer latent tokens mean faster everything—training and inference alike.

4. Multi-Modal Conditioning

Multiple encoders feed into the system:

Text Encoder — Processes your descriptive prompts
Lyric Encoder — Aligns words with musical timing
Speaker Encoder — Enables voice cloning capabilities

Key Capabilities

Feature	What It Does
Text-to-Music	Generate full songs from descriptions and lyrics
Cover Generation	Reimagine existing audio in different styles
Repaint & Edit	Selective regeneration of specific sections
Track Separation	Isolate vocals, drums, bass, and other stems
Vocal2BGM	Generate accompaniment for vocal tracks
LoRA Training	Fine-tune on your own music in one click
Audio Analysis	Extract BPM, key, time signature automatically

Model Variants

Three flavors to match your speed/quality tradeoff:

Model	CFG	Steps	Quality	Best For
`acestep-v15-base`	✅	50	Medium	Experimentation
`acestep-v15-sft`	✅	50	High	Balanced work
`acestep-v15-turbo`	❌	8	Very High	Production speed

💡 Pro Tip: Start with turbo for rapid iteration, then switch to sft for your final render. The 8-step turbo mode generates in literal seconds.

Why Choose Ace-Step Over Suno or Udio?

Let’s be honest—Suno and Udio have polished user experiences. But Ace-Step offers something they can’t: sovereignty.

Feature	Ace-Step 1.5	Suno	Udio
Local Execution	✅ Your GPU	❌ Cloud only	❌ Cloud only
Min VRAM	Under 4GB	N/A	N/A
License	MIT (free commercial)	Subscription	Subscription
Max Duration	10 minutes	8 minutes	4-15 min
Multi-Language	50+ languages	Limited	Limited
LoRA Training	✅ Yes	❌ No	❌ No
Stem Separation	✅ Built-in	✅ Yes	❌ No
Privacy	✅ 100% local	❌ Uploaded	❌ Uploaded

The Real Cost Breakdown

Suno Pro: $10/month for 2,500 credits (roughly 500 songs)

Udio: $10/month for 600 generations

Ace-Step: $0/month + your existing GPU

For heavy users, the economics speak for themselves. But beyond cost, consider:

Privacy: Your prompts and lyrics never leave your machine
No rate limits: Generate as much as your GPU can handle
Ownership: MIT license means you own everything you create
Customization: Train LoRAs on your own music library

When to Stick with Commercial Options

Ace-Step isn’t perfect. If you need:

The absolute best vocal quality (Udio still edges ahead)
“One-click banger” simplicity (Suno excels here)
Zero technical setup

Then cloud services might still be your best bet. But if you’re willing to invest a little setup time for long-term freedom, read on.

Setting Up Ace-Step in ComfyUI

ComfyUI provides a visual node-based interface for Ace-Step—perfect for experimentation and building complex workflows. Here’s how to get running.

Prerequisites

Requirement	Details
Python	3.11-3.12 (avoid 3.13+ on Windows)
GPU	NVIDIA recommended; also supports MPS, ROCm, Intel XPU
VRAM	4GB minimum, 8GB+ recommended
Disk	~10GB for full models
ComfyUI	Latest version (updated)

Step 1: Update ComfyUI

cd /path/to/ComfyUI
git pull
pip install -r requirements.txt

Step 2: Download the Models

Option A: All-in-One Checkpoint (Recommended for Beginners)

Download the single .safetensors file (~10GB) from Hugging Face:

# Using huggingface-cli
huggingface-cli download ACE-Step/Ace-Step1.5 \
  --local-dir ComfyUI/models/checkpoints/acestep-v15-turbo

Or manually download from huggingface.co/ACE-Step/Ace-Step1.5 and place in ComfyUI/models/checkpoints/.

Option B: Split Model Files (For Advanced Users)

ComfyUI/models/Ace-Step1.5/
├── acestep-v15-turbo/
├── acestep-5Hz-lm-1.7B/
├── vae/
└── Qwen3-Embedding-0.6B/

This approach lets you mix and match components.

Step 3: Install Custom Nodes (Optional)

For extended features like cover generation, repainting, and LoRA training:

cd ComfyUI/custom_nodes

# Kaola nodes for advanced features
git clone https://github.com/kana112233/ComfyUI-kaola-ace-step.git
cd ComfyUI-kaola-ace-step && pip install -r requirements.txt

# LoRA training support
git clone https://github.com/filliptm/ComfyUI-FL-AceStep-Training.git
cd ComfyUI-FL-AceStep-Training && pip install -r requirements.txt

Step 4: Load the Workflow

Launch ComfyUI
Open the Template Library (usually in the sidebar)
Navigate to Audio category
Select “ACE-Step 1.5 music generation”
Configure the CheckpointLoaderSimple node to point to your downloaded model

⚠️ Windows Users: If you’re on Python 3.13+, you may encounter torchaudio backend issues. Downgrade to Python 3.12 or use the portable package from the official releases.

Your First Music Generation

Let’s walk through creating your first AI-generated song from scratch.

Basic Text-to-Music Workflow

The Prompt Structure

Think of your input in two layers:

Tags — Global control: genre, tempo, instruments, vocal style
Lyrics — The actual words with structure markers

Example: Upbeat Indie Pop

Tags:

upbeat indie pop, 120 bpm, catchy synth melody, driving drums, 
warm female vocals, clean mix, 2010s radio feel

Lyrics:

[intro]
Synth arpeggio builds

[verse]
City lights glow, chasing sunsets
Every moment a new regret
Walking fast, no time to rest

[chorus]
Oh, the rhythm takes control
Losing myself, losing my soul
Dancing through the night

[bridge]
Just for tonight, let's break free
Dancing wild, wild and free

[outro]
Fade out with synth echoes

Key ComfyUI Nodes

Node	Purpose
`CheckpointLoaderSimple`	Load the Ace-Step 1.5 model
`TextEncodeAceStepAudio1.5`	Encode prompts and lyrics
`ModelSamplingAuraFlow`	Configure sampler settings
`KSampler`	Run the diffusion sampling
VAE Decode	Convert latent to audio waveform

Understanding the Parameters

Parameter	Description	Good Starting Value
Song Duration	Length in seconds	60-180
Seed	Random seed	Fixed for reproducibility
KSampler Steps	Diffusion iterations	8 (turbo), 30-50 (quality)
BPM	Tempo hint	60-180
Key Scale	Musical key	C major, A minor, etc.

Generation Workflow

Set your tags in the text encoder node
Input structured lyrics with section labels
Configure duration (start with 90-120 seconds)
Set a fixed seed for reproducibility
Queue the prompt and wait for generation

If the result isn’t perfect on the first try, don’t worry—that’s normal. AI music generation is iterative.

💡 Pro Tip: Use structured section labels like [verse], [chorus], and [bridge]. The model uses these to understand song form and create natural transitions.

Pro Tips for Better Results

After generating hundreds of tracks, here are the patterns that consistently produce better output.

1. Master the Two-Layer Prompt System

Tags control the soundscape:

Genre: funk, pop, disco, lofi hip-hop, ambient
Tempo: 120 bpm, up-tempo, slow ballad
Instruments: slap bass, drum machine, distorted guitar, Rhodes piano
Vocal Type: male vocals, clean, rhythmic female vocals
Era/Production: 80s style, punchy, dry mix, lo-fi, vinyl crackle

Lyrics control the structure:

Use section labels consistently
Keep phrases rhythmic
Avoid long, prose-like sentences

2. Rhythmic Lyric Phrasing

Short, punchy phrases work better than poetry:

Better:

[verse]
Running fast, can't slow down
Heart beats like a drum
City lights, blurry now
Moving to the rhythm

Worse:

[verse]
When I was running through the city streets with my heart pounding 
and all the lights became a blur around me as I moved forward

Treat vocals as part of the groove. Consistent syllable counts improve vocal stability dramatically.

3. Iterate with Small Changes

The “gacha” nature of AI music means you’ll generate multiple versions. Optimize your workflow:

Lock the seed when you find something close
Tweak tags first (tempo, mood, instruments)
Then adjust lyrics section by section
Avoid rewriting everything between generations

4. For Instrumental Tracks

Be explicit about excluding vocals:

Tags: "lofi hip-hop instrumental, 85 bpm, chill vibe, warm Rhodes piano, 
subtle vinyl crackle, boom-bap drums, smooth bassline, no vocals, 
instrumental only"

Lyrics: (leave empty or use "instrumental")

5. The Genre Sweet Spots

Ace-Step excels at certain genres and struggles with others:

Strong Genres:

Pop, indie pop, synth-pop
Electronic (house, techno, ambient)
Lo-fi hip-hop
Rock (classic, alternative)
Ballads

Challenging Genres:

Aggressive rap (especially non-English)
Extremely niche subgenres
Highly orchestrated classical

6. Use the Turbo Model for Iteration

The turbo model (8 steps instead of 50) generates in ~2 seconds. Use it for:

Rapid prompt experimentation
Testing lyric flows
Finding the right tempo/key

Then switch to the SFT model for your final high-quality render.

⚠️ Warning: Vocals can sometimes “swallow” lyrics—skipping words or making them unintelligible. This is a known limitation. Structured prompts with short phrases help mitigate it.

Troubleshooting Common Issues

”The Vocals Sound Muffled or Skip Words”

Cause: Lyric misalignment, common in the base model

Solutions:

Use structured section labels ([verse], [chorus])
Shorten phrases to 4-8 syllables
Try the “official original version” model which has lyric adherence fixes
Adjust song duration in small increments (±10 seconds)

“ComfyUI Nodes Not Appearing”

Cause: Missing dependencies or outdated ComfyUI

Solutions:

# Update ComfyUI
cd /path/to/ComfyUI && git pull

# Reinstall requirements
pip install -r requirements.txt --upgrade

# Verify model placement
ls ComfyUI/models/checkpoints/acestep*

“CUDA Out of Memory”

Cause: GPU VRAM exceeded

Solutions:

Use a smaller LM (0.6B instead of 1.7B or 4B)
Enable INT8 quantization
Use CPU offloading for the LM
Reduce batch size to 1

For low VRAM (≤6GB):

# Disable LM entirely, use DiT-only mode
# Set acestep_model.loader.llm_model: null in config

“Training LoRA Stalls or Freezes”

Cause: UI conflicts or missing dependencies

Solutions:

# Use CLI training instead of UI
cd ACE-Step-1.5
python ./acestep/training/train_lora.py --help

# Install missing dependencies
pip install lightning tensorboard

“No Audio Output or Corrupted Files”

Cause: VAE decode failure or audio backend issues

Solutions:

Check that VAE decode node is connected

Reinstall PyTorch with correct CUDA version:

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121

On Windows with Python 3.13+: use soundfile or scipy backend instead of torchaudio

Platform-Specific Notes

Windows:

Use the portable package for easiest setup
Python 3.13+ has torchaudio issues—stick to 3.11-3.12

macOS (Apple Silicon):

Use the MLX backend with dedicated launch scripts
Performance is acceptable but slower than NVIDIA

AMD (ROCm):

Windows requires Python 3.12 specifically
Use ROCm-specific launch scripts

Conclusion & Resources

Ace-Step 1.5 represents a watershed moment for AI music generation. For the first time, you can run truly capable music synthesis locally, on consumer hardware, with a permissive open-source license.

Is it as polished as Suno? Not quite. Will it replace professional music production? Not yet. But for developers, musicians, and creators who value privacy, customization, and long-term ownership, Ace-Step offers something invaluable: control.

The ability to train LoRAs on your own music, generate unlimited tracks without subscription fees, and integrate music generation into your own applications and workflows—these aren’t just features. They’re the foundation of a new creative paradigm.

Quick Start Checklist

Update ComfyUI to latest version
Download Ace-Step 1.5 checkpoint from Hugging Face
Place model in ComfyUI/models/checkpoints/
Load the ACE-Step workflow from Template Library
Start with turbo model for rapid iteration
Use structured tags + lyrics for best results

Essential Links

Official Resources:

ComfyUI Integration:

Community Resources:

The future of music isn’t just generated—it’s yours to generate. Happy creating.

Generate Studio-Quality Music with Ace-Step 1.5 and ComfyUI

Generate Studio-Quality Music with Ace-Step 1.5 and ComfyUI

The AI Music Revolution

What is Ace-Step 1.5?

The Architecture

Key Capabilities

Model Variants

Why Choose Ace-Step Over Suno or Udio?

The Real Cost Breakdown

When to Stick with Commercial Options

Setting Up Ace-Step in ComfyUI

Prerequisites

Step 1: Update ComfyUI

Step 2: Download the Models

Step 3: Install Custom Nodes (Optional)

Step 4: Load the Workflow

Your First Music Generation

Basic Text-to-Music Workflow

Key ComfyUI Nodes

Understanding the Parameters

Generation Workflow

Pro Tips for Better Results

1. Master the Two-Layer Prompt System

2. Rhythmic Lyric Phrasing

3. Iterate with Small Changes

4. For Instrumental Tracks

5. The Genre Sweet Spots

6. Use the Turbo Model for Iteration

Troubleshooting Common Issues

”The Vocals Sound Muffled or Skip Words”

“ComfyUI Nodes Not Appearing”

“CUDA Out of Memory”

“Training LoRA Stalls or Freezes”

“No Audio Output or Corrupted Files”

Platform-Specific Notes

Conclusion & Resources

Quick Start Checklist

Essential Links

Anthony Lattanzio

Comments

Generate Studio-Quality Music with Ace-Step 1.5 and ComfyUI

The AI Music Revolution

What is Ace-Step 1.5?

The Architecture

Key Capabilities

Model Variants

Why Choose Ace-Step Over Suno or Udio?

The Real Cost Breakdown

When to Stick with Commercial Options

Setting Up Ace-Step in ComfyUI

Prerequisites

Step 1: Update ComfyUI

Step 2: Download the Models

Step 3: Install Custom Nodes (Optional)

Step 4: Load the Workflow

Your First Music Generation

Basic Text-to-Music Workflow

Key ComfyUI Nodes

Understanding the Parameters

Generation Workflow

Pro Tips for Better Results

1. Master the Two-Layer Prompt System

2. Rhythmic Lyric Phrasing

3. Iterate with Small Changes

4. For Instrumental Tracks

5. The Genre Sweet Spots

6. Use the Turbo Model for Iteration

Troubleshooting Common Issues

”The Vocals Sound Muffled or Skip Words”

“ComfyUI Nodes Not Appearing”

“CUDA Out of Memory”

“Training LoRA Stalls or Freezes”

“No Audio Output or Corrupted Files”

Platform-Specific Notes

Conclusion & Resources

Quick Start Checklist

Essential Links

Get Early Access

Anthony Lattanzio

Comments