Generate Studio-Quality Music with Ace-Step 1.5 and ComfyUI

Learn how to generate professional AI music using Ace-Step 1.5 in ComfyUI - open-source, runs on consumer GPU, rivals Suno quality.

• 11 min read
AImusicComfyUIaudio-generationopen-source
Generate Studio-Quality Music with Ace-Step 1.5 and ComfyUI

Generate Studio-Quality Music with Ace-Step 1.5 and ComfyUI

The AI music generation landscape has a new heavyweight contender. Meet Ace-Step 1.5—an open-source foundation model that runs locally on consumer hardware while delivering quality that rivals commercial giants like Suno and Udio.

If you’ve been frustrated by subscription fees, upload limits, or privacy concerns with cloud-based music generators, this is the guide you’ve been waiting for.

The AI Music Revolution

For the past year, Suno and Udio have dominated headlines with their uncanny ability to generate full songs from text prompts. Impressive? Absolutely. But there’s been a catch: you’re renting access to someone else’s compute, on someone else’s terms.

Ace-Step 1.5 changes the equation. Developed collaboratively by ACE Studio and StepFun, it brings commercial-grade music generation to your local machine. We’re talking:

  • Under 10 seconds to generate a full song on an RTX 3090
  • Under 4GB VRAM—yes, it runs on modest gaming GPUs
  • MIT license—free for personal and commercial use
  • 50+ languages supported natively

This isn’t a toy project. It’s a legitimate alternative for developers, musicians, and content creators who want full control over their AI music pipeline.

What is Ace-Step 1.5?

Ace-Step 1.5 is a hybrid architecture that combines the best of language models and diffusion transformers. Think of it as two systems working in concert: a “planner” that understands what you want, and a “synthesizer” that builds the actual audio.

The Architecture

1. Language Model (LM) — The Planner

When you type a prompt like “upbeat synth-pop with female vocals,” the LM transforms that into a comprehensive blueprint. It synthesizes metadata, lyrics, and captions through chain-of-thought reasoning—all without requiring massive prompt engineering on your end.

The LM comes in three sizes:

  • 0.6B parameters — For 6-12GB VRAM cards
  • 1.7B parameters — Sweet spot for 12-20GB VRAM
  • 4B parameters — Maximum quality for 24GB+ cards

2. Diffusion Transformer (DiT) — The Synthesizer

The core generative engine uses a clever optimization: linear attention instead of the standard quadratic approach. This reduces complexity from O(N²) to O(N)—a massive speedup that explains those sub-10-second generation times.

3. Deep Compression AutoEncoder (DCAE)

Traditionally, audio autoencoders achieve 8x compression. Ace-Step’s DCAE hits 32x compression while maintaining quality. Fewer latent tokens mean faster everything—training and inference alike.

4. Multi-Modal Conditioning

Multiple encoders feed into the system:

  • Text Encoder — Processes your descriptive prompts
  • Lyric Encoder — Aligns words with musical timing
  • Speaker Encoder — Enables voice cloning capabilities

Key Capabilities

FeatureWhat It Does
Text-to-MusicGenerate full songs from descriptions and lyrics
Cover GenerationReimagine existing audio in different styles
Repaint & EditSelective regeneration of specific sections
Track SeparationIsolate vocals, drums, bass, and other stems
Vocal2BGMGenerate accompaniment for vocal tracks
LoRA TrainingFine-tune on your own music in one click
Audio AnalysisExtract BPM, key, time signature automatically

Model Variants

Three flavors to match your speed/quality tradeoff:

ModelCFGStepsQualityBest For
acestep-v15-base50MediumExperimentation
acestep-v15-sft50HighBalanced work
acestep-v15-turbo8Very HighProduction speed

💡 Pro Tip: Start with turbo for rapid iteration, then switch to sft for your final render. The 8-step turbo mode generates in literal seconds.


Why Choose Ace-Step Over Suno or Udio?

Let’s be honest—Suno and Udio have polished user experiences. But Ace-Step offers something they can’t: sovereignty.

FeatureAce-Step 1.5SunoUdio
Local Execution✅ Your GPU❌ Cloud only❌ Cloud only
Min VRAMUnder 4GBN/AN/A
LicenseMIT (free commercial)SubscriptionSubscription
Max Duration10 minutes8 minutes4-15 min
Multi-Language50+ languagesLimitedLimited
LoRA Training✅ Yes❌ No❌ No
Stem Separation✅ Built-in✅ Yes❌ No
Privacy✅ 100% local❌ Uploaded❌ Uploaded

The Real Cost Breakdown

Suno Pro: $10/month for 2,500 credits (roughly 500 songs)

Udio: $10/month for 600 generations

Ace-Step: $0/month + your existing GPU

For heavy users, the economics speak for themselves. But beyond cost, consider:

  • Privacy: Your prompts and lyrics never leave your machine
  • No rate limits: Generate as much as your GPU can handle
  • Ownership: MIT license means you own everything you create
  • Customization: Train LoRAs on your own music library

When to Stick with Commercial Options

Ace-Step isn’t perfect. If you need:

  • The absolute best vocal quality (Udio still edges ahead)
  • “One-click banger” simplicity (Suno excels here)
  • Zero technical setup

Then cloud services might still be your best bet. But if you’re willing to invest a little setup time for long-term freedom, read on.


Setting Up Ace-Step in ComfyUI

ComfyUI provides a visual node-based interface for Ace-Step—perfect for experimentation and building complex workflows. Here’s how to get running.

Prerequisites

RequirementDetails
Python3.11-3.12 (avoid 3.13+ on Windows)
GPUNVIDIA recommended; also supports MPS, ROCm, Intel XPU
VRAM4GB minimum, 8GB+ recommended
Disk~10GB for full models
ComfyUILatest version (updated)

Step 1: Update ComfyUI

cd /path/to/ComfyUI
git pull
pip install -r requirements.txt

Step 2: Download the Models

Option A: All-in-One Checkpoint (Recommended for Beginners)

Download the single .safetensors file (~10GB) from Hugging Face:

# Using huggingface-cli
huggingface-cli download ACE-Step/Ace-Step1.5 \
  --local-dir ComfyUI/models/checkpoints/acestep-v15-turbo

Or manually download from huggingface.co/ACE-Step/Ace-Step1.5 and place in ComfyUI/models/checkpoints/.

Option B: Split Model Files (For Advanced Users)

ComfyUI/models/Ace-Step1.5/
├── acestep-v15-turbo/
├── acestep-5Hz-lm-1.7B/
├── vae/
└── Qwen3-Embedding-0.6B/

This approach lets you mix and match components.

Step 3: Install Custom Nodes (Optional)

For extended features like cover generation, repainting, and LoRA training:

cd ComfyUI/custom_nodes

# Kaola nodes for advanced features
git clone https://github.com/kana112233/ComfyUI-kaola-ace-step.git
cd ComfyUI-kaola-ace-step && pip install -r requirements.txt

# LoRA training support
git clone https://github.com/filliptm/ComfyUI-FL-AceStep-Training.git
cd ComfyUI-FL-AceStep-Training && pip install -r requirements.txt

Step 4: Load the Workflow

  1. Launch ComfyUI
  2. Open the Template Library (usually in the sidebar)
  3. Navigate to Audio category
  4. Select “ACE-Step 1.5 music generation”
  5. Configure the CheckpointLoaderSimple node to point to your downloaded model

⚠️ Windows Users: If you’re on Python 3.13+, you may encounter torchaudio backend issues. Downgrade to Python 3.12 or use the portable package from the official releases.


Your First Music Generation

Let’s walk through creating your first AI-generated song from scratch.

Basic Text-to-Music Workflow

The Prompt Structure

Think of your input in two layers:

  1. Tags — Global control: genre, tempo, instruments, vocal style
  2. Lyrics — The actual words with structure markers

Example: Upbeat Indie Pop

Tags:

upbeat indie pop, 120 bpm, catchy synth melody, driving drums, 
warm female vocals, clean mix, 2010s radio feel

Lyrics:

[intro]
Synth arpeggio builds

[verse]
City lights glow, chasing sunsets
Every moment a new regret
Walking fast, no time to rest

[chorus]
Oh, the rhythm takes control
Losing myself, losing my soul
Dancing through the night

[bridge]
Just for tonight, let's break free
Dancing wild, wild and free

[outro]
Fade out with synth echoes

Key ComfyUI Nodes

NodePurpose
CheckpointLoaderSimpleLoad the Ace-Step 1.5 model
TextEncodeAceStepAudio1.5Encode prompts and lyrics
ModelSamplingAuraFlowConfigure sampler settings
KSamplerRun the diffusion sampling
VAE DecodeConvert latent to audio waveform

Understanding the Parameters

ParameterDescriptionGood Starting Value
Song DurationLength in seconds60-180
SeedRandom seedFixed for reproducibility
KSampler StepsDiffusion iterations8 (turbo), 30-50 (quality)
BPMTempo hint60-180
Key ScaleMusical keyC major, A minor, etc.

Generation Workflow

  1. Set your tags in the text encoder node
  2. Input structured lyrics with section labels
  3. Configure duration (start with 90-120 seconds)
  4. Set a fixed seed for reproducibility
  5. Queue the prompt and wait for generation

If the result isn’t perfect on the first try, don’t worry—that’s normal. AI music generation is iterative.


💡 Pro Tip: Use structured section labels like [verse], [chorus], and [bridge]. The model uses these to understand song form and create natural transitions.


Pro Tips for Better Results

After generating hundreds of tracks, here are the patterns that consistently produce better output.

1. Master the Two-Layer Prompt System

Tags control the soundscape:

Genre: funk, pop, disco, lofi hip-hop, ambient
Tempo: 120 bpm, up-tempo, slow ballad
Instruments: slap bass, drum machine, distorted guitar, Rhodes piano
Vocal Type: male vocals, clean, rhythmic female vocals
Era/Production: 80s style, punchy, dry mix, lo-fi, vinyl crackle

Lyrics control the structure:

  • Use section labels consistently
  • Keep phrases rhythmic
  • Avoid long, prose-like sentences

2. Rhythmic Lyric Phrasing

Short, punchy phrases work better than poetry:

Better:

[verse]
Running fast, can't slow down
Heart beats like a drum
City lights, blurry now
Moving to the rhythm

Worse:

[verse]
When I was running through the city streets with my heart pounding 
and all the lights became a blur around me as I moved forward

Treat vocals as part of the groove. Consistent syllable counts improve vocal stability dramatically.

3. Iterate with Small Changes

The “gacha” nature of AI music means you’ll generate multiple versions. Optimize your workflow:

  1. Lock the seed when you find something close
  2. Tweak tags first (tempo, mood, instruments)
  3. Then adjust lyrics section by section
  4. Avoid rewriting everything between generations

4. For Instrumental Tracks

Be explicit about excluding vocals:

Tags: "lofi hip-hop instrumental, 85 bpm, chill vibe, warm Rhodes piano, 
subtle vinyl crackle, boom-bap drums, smooth bassline, no vocals, 
instrumental only"

Lyrics: (leave empty or use "instrumental")

5. The Genre Sweet Spots

Ace-Step excels at certain genres and struggles with others:

Strong Genres:

  • Pop, indie pop, synth-pop
  • Electronic (house, techno, ambient)
  • Lo-fi hip-hop
  • Rock (classic, alternative)
  • Ballads

Challenging Genres:

  • Aggressive rap (especially non-English)
  • Extremely niche subgenres
  • Highly orchestrated classical

6. Use the Turbo Model for Iteration

The turbo model (8 steps instead of 50) generates in ~2 seconds. Use it for:

  • Rapid prompt experimentation
  • Testing lyric flows
  • Finding the right tempo/key

Then switch to the SFT model for your final high-quality render.


⚠️ Warning: Vocals can sometimes “swallow” lyrics—skipping words or making them unintelligible. This is a known limitation. Structured prompts with short phrases help mitigate it.


Troubleshooting Common Issues

”The Vocals Sound Muffled or Skip Words”

Cause: Lyric misalignment, common in the base model

Solutions:

  • Use structured section labels ([verse], [chorus])
  • Shorten phrases to 4-8 syllables
  • Try the “official original version” model which has lyric adherence fixes
  • Adjust song duration in small increments (±10 seconds)

“ComfyUI Nodes Not Appearing”

Cause: Missing dependencies or outdated ComfyUI

Solutions:

# Update ComfyUI
cd /path/to/ComfyUI && git pull

# Reinstall requirements
pip install -r requirements.txt --upgrade

# Verify model placement
ls ComfyUI/models/checkpoints/acestep*

“CUDA Out of Memory”

Cause: GPU VRAM exceeded

Solutions:

  • Use a smaller LM (0.6B instead of 1.7B or 4B)
  • Enable INT8 quantization
  • Use CPU offloading for the LM
  • Reduce batch size to 1

For low VRAM (≤6GB):

# Disable LM entirely, use DiT-only mode
# Set acestep_model.loader.llm_model: null in config

“Training LoRA Stalls or Freezes”

Cause: UI conflicts or missing dependencies

Solutions:

# Use CLI training instead of UI
cd ACE-Step-1.5
python ./acestep/training/train_lora.py --help

# Install missing dependencies
pip install lightning tensorboard

“No Audio Output or Corrupted Files”

Cause: VAE decode failure or audio backend issues

Solutions:

  • Check that VAE decode node is connected
  • Reinstall PyTorch with correct CUDA version:
    pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121
  • On Windows with Python 3.13+: use soundfile or scipy backend instead of torchaudio

Platform-Specific Notes

Windows:

  • Use the portable package for easiest setup
  • Python 3.13+ has torchaudio issues—stick to 3.11-3.12

macOS (Apple Silicon):

  • Use the MLX backend with dedicated launch scripts
  • Performance is acceptable but slower than NVIDIA

AMD (ROCm):

  • Windows requires Python 3.12 specifically
  • Use ROCm-specific launch scripts

Conclusion & Resources

Ace-Step 1.5 represents a watershed moment for AI music generation. For the first time, you can run truly capable music synthesis locally, on consumer hardware, with a permissive open-source license.

Is it as polished as Suno? Not quite. Will it replace professional music production? Not yet. But for developers, musicians, and creators who value privacy, customization, and long-term ownership, Ace-Step offers something invaluable: control.

The ability to train LoRAs on your own music, generate unlimited tracks without subscription fees, and integrate music generation into your own applications and workflows—these aren’t just features. They’re the foundation of a new creative paradigm.

Quick Start Checklist

  • Update ComfyUI to latest version
  • Download Ace-Step 1.5 checkpoint from Hugging Face
  • Place model in ComfyUI/models/checkpoints/
  • Load the ACE-Step workflow from Template Library
  • Start with turbo model for rapid iteration
  • Use structured tags + lyrics for best results

Official Resources:

ComfyUI Integration:

Community Resources:


The future of music isn’t just generated—it’s yours to generate. Happy creating.

Anthony Lattanzio

Anthony Lattanzio

Tech Enthusiast & Builder

I'm a tech enthusiast who loves building things with hardware and software. By night, I run a homelab that's grown way beyond what any reasonable person needs. Check out about me for more.

Comments

Powered by GitHub Discussions