Generate Studio-Quality Music with Ace-Step 1.5 and ComfyUI
Learn how to generate professional AI music using Ace-Step 1.5 in ComfyUI - open-source, runs on consumer GPU, rivals Suno quality.
Table of Contents
- The AI Music Revolution
- What is Ace-Step 1.5?
- The Architecture
- Key Capabilities
- Model Variants
- Why Choose Ace-Step Over Suno or Udio?
- The Real Cost Breakdown
- When to Stick with Commercial Options
- Setting Up Ace-Step in ComfyUI
- Prerequisites
- Step 1: Update ComfyUI
- Step 2: Download the Models
- Step 3: Install Custom Nodes (Optional)
- Step 4: Load the Workflow
- Your First Music Generation
- Basic Text-to-Music Workflow
- Key ComfyUI Nodes
- Understanding the Parameters
- Generation Workflow
- Pro Tips for Better Results
- 1. Master the Two-Layer Prompt System
- 2. Rhythmic Lyric Phrasing
- 3. Iterate with Small Changes
- 4. For Instrumental Tracks
- 5. The Genre Sweet Spots
- 6. Use the Turbo Model for Iteration
- Troubleshooting Common Issues
- ”The Vocals Sound Muffled or Skip Words”
- “ComfyUI Nodes Not Appearing”
- “CUDA Out of Memory”
- “Training LoRA Stalls or Freezes”
- “No Audio Output or Corrupted Files”
- Platform-Specific Notes
- Conclusion & Resources
- Quick Start Checklist
- Essential Links
Generate Studio-Quality Music with Ace-Step 1.5 and ComfyUI
The AI music generation landscape has a new heavyweight contender. Meet Ace-Step 1.5—an open-source foundation model that runs locally on consumer hardware while delivering quality that rivals commercial giants like Suno and Udio.
If you’ve been frustrated by subscription fees, upload limits, or privacy concerns with cloud-based music generators, this is the guide you’ve been waiting for.
The AI Music Revolution
For the past year, Suno and Udio have dominated headlines with their uncanny ability to generate full songs from text prompts. Impressive? Absolutely. But there’s been a catch: you’re renting access to someone else’s compute, on someone else’s terms.
Ace-Step 1.5 changes the equation. Developed collaboratively by ACE Studio and StepFun, it brings commercial-grade music generation to your local machine. We’re talking:
- Under 10 seconds to generate a full song on an RTX 3090
- Under 4GB VRAM—yes, it runs on modest gaming GPUs
- MIT license—free for personal and commercial use
- 50+ languages supported natively
This isn’t a toy project. It’s a legitimate alternative for developers, musicians, and content creators who want full control over their AI music pipeline.
What is Ace-Step 1.5?
Ace-Step 1.5 is a hybrid architecture that combines the best of language models and diffusion transformers. Think of it as two systems working in concert: a “planner” that understands what you want, and a “synthesizer” that builds the actual audio.
The Architecture
1. Language Model (LM) — The Planner
When you type a prompt like “upbeat synth-pop with female vocals,” the LM transforms that into a comprehensive blueprint. It synthesizes metadata, lyrics, and captions through chain-of-thought reasoning—all without requiring massive prompt engineering on your end.
The LM comes in three sizes:
- 0.6B parameters — For 6-12GB VRAM cards
- 1.7B parameters — Sweet spot for 12-20GB VRAM
- 4B parameters — Maximum quality for 24GB+ cards
2. Diffusion Transformer (DiT) — The Synthesizer
The core generative engine uses a clever optimization: linear attention instead of the standard quadratic approach. This reduces complexity from O(N²) to O(N)—a massive speedup that explains those sub-10-second generation times.
3. Deep Compression AutoEncoder (DCAE)
Traditionally, audio autoencoders achieve 8x compression. Ace-Step’s DCAE hits 32x compression while maintaining quality. Fewer latent tokens mean faster everything—training and inference alike.
4. Multi-Modal Conditioning
Multiple encoders feed into the system:
- Text Encoder — Processes your descriptive prompts
- Lyric Encoder — Aligns words with musical timing
- Speaker Encoder — Enables voice cloning capabilities
Key Capabilities
| Feature | What It Does |
|---|---|
| Text-to-Music | Generate full songs from descriptions and lyrics |
| Cover Generation | Reimagine existing audio in different styles |
| Repaint & Edit | Selective regeneration of specific sections |
| Track Separation | Isolate vocals, drums, bass, and other stems |
| Vocal2BGM | Generate accompaniment for vocal tracks |
| LoRA Training | Fine-tune on your own music in one click |
| Audio Analysis | Extract BPM, key, time signature automatically |
Model Variants
Three flavors to match your speed/quality tradeoff:
| Model | CFG | Steps | Quality | Best For |
|---|---|---|---|---|
acestep-v15-base | ✅ | 50 | Medium | Experimentation |
acestep-v15-sft | ✅ | 50 | High | Balanced work |
acestep-v15-turbo | ❌ | 8 | Very High | Production speed |
💡 Pro Tip: Start with
turbofor rapid iteration, then switch tosftfor your final render. The 8-step turbo mode generates in literal seconds.
Why Choose Ace-Step Over Suno or Udio?
Let’s be honest—Suno and Udio have polished user experiences. But Ace-Step offers something they can’t: sovereignty.
| Feature | Ace-Step 1.5 | Suno | Udio |
|---|---|---|---|
| Local Execution | ✅ Your GPU | ❌ Cloud only | ❌ Cloud only |
| Min VRAM | Under 4GB | N/A | N/A |
| License | MIT (free commercial) | Subscription | Subscription |
| Max Duration | 10 minutes | 8 minutes | 4-15 min |
| Multi-Language | 50+ languages | Limited | Limited |
| LoRA Training | ✅ Yes | ❌ No | ❌ No |
| Stem Separation | ✅ Built-in | ✅ Yes | ❌ No |
| Privacy | ✅ 100% local | ❌ Uploaded | ❌ Uploaded |
The Real Cost Breakdown
Suno Pro: $10/month for 2,500 credits (roughly 500 songs)
Udio: $10/month for 600 generations
Ace-Step: $0/month + your existing GPU
For heavy users, the economics speak for themselves. But beyond cost, consider:
- Privacy: Your prompts and lyrics never leave your machine
- No rate limits: Generate as much as your GPU can handle
- Ownership: MIT license means you own everything you create
- Customization: Train LoRAs on your own music library
When to Stick with Commercial Options
Ace-Step isn’t perfect. If you need:
- The absolute best vocal quality (Udio still edges ahead)
- “One-click banger” simplicity (Suno excels here)
- Zero technical setup
Then cloud services might still be your best bet. But if you’re willing to invest a little setup time for long-term freedom, read on.
Setting Up Ace-Step in ComfyUI
ComfyUI provides a visual node-based interface for Ace-Step—perfect for experimentation and building complex workflows. Here’s how to get running.
Prerequisites
| Requirement | Details |
|---|---|
| Python | 3.11-3.12 (avoid 3.13+ on Windows) |
| GPU | NVIDIA recommended; also supports MPS, ROCm, Intel XPU |
| VRAM | 4GB minimum, 8GB+ recommended |
| Disk | ~10GB for full models |
| ComfyUI | Latest version (updated) |
Step 1: Update ComfyUI
cd /path/to/ComfyUI
git pull
pip install -r requirements.txt
Step 2: Download the Models
Option A: All-in-One Checkpoint (Recommended for Beginners)
Download the single .safetensors file (~10GB) from Hugging Face:
# Using huggingface-cli
huggingface-cli download ACE-Step/Ace-Step1.5 \
--local-dir ComfyUI/models/checkpoints/acestep-v15-turbo
Or manually download from huggingface.co/ACE-Step/Ace-Step1.5 and place in ComfyUI/models/checkpoints/.
Option B: Split Model Files (For Advanced Users)
ComfyUI/models/Ace-Step1.5/
├── acestep-v15-turbo/
├── acestep-5Hz-lm-1.7B/
├── vae/
└── Qwen3-Embedding-0.6B/
This approach lets you mix and match components.
Step 3: Install Custom Nodes (Optional)
For extended features like cover generation, repainting, and LoRA training:
cd ComfyUI/custom_nodes
# Kaola nodes for advanced features
git clone https://github.com/kana112233/ComfyUI-kaola-ace-step.git
cd ComfyUI-kaola-ace-step && pip install -r requirements.txt
# LoRA training support
git clone https://github.com/filliptm/ComfyUI-FL-AceStep-Training.git
cd ComfyUI-FL-AceStep-Training && pip install -r requirements.txt
Step 4: Load the Workflow
- Launch ComfyUI
- Open the Template Library (usually in the sidebar)
- Navigate to Audio category
- Select “ACE-Step 1.5 music generation”
- Configure the
CheckpointLoaderSimplenode to point to your downloaded model
⚠️ Windows Users: If you’re on Python 3.13+, you may encounter torchaudio backend issues. Downgrade to Python 3.12 or use the portable package from the official releases.
Your First Music Generation
Let’s walk through creating your first AI-generated song from scratch.
Basic Text-to-Music Workflow
The Prompt Structure
Think of your input in two layers:
- Tags — Global control: genre, tempo, instruments, vocal style
- Lyrics — The actual words with structure markers
Example: Upbeat Indie Pop
Tags:
upbeat indie pop, 120 bpm, catchy synth melody, driving drums,
warm female vocals, clean mix, 2010s radio feel
Lyrics:
[intro]
Synth arpeggio builds
[verse]
City lights glow, chasing sunsets
Every moment a new regret
Walking fast, no time to rest
[chorus]
Oh, the rhythm takes control
Losing myself, losing my soul
Dancing through the night
[bridge]
Just for tonight, let's break free
Dancing wild, wild and free
[outro]
Fade out with synth echoes
Key ComfyUI Nodes
| Node | Purpose |
|---|---|
CheckpointLoaderSimple | Load the Ace-Step 1.5 model |
TextEncodeAceStepAudio1.5 | Encode prompts and lyrics |
ModelSamplingAuraFlow | Configure sampler settings |
KSampler | Run the diffusion sampling |
| VAE Decode | Convert latent to audio waveform |
Understanding the Parameters
| Parameter | Description | Good Starting Value |
|---|---|---|
| Song Duration | Length in seconds | 60-180 |
| Seed | Random seed | Fixed for reproducibility |
| KSampler Steps | Diffusion iterations | 8 (turbo), 30-50 (quality) |
| BPM | Tempo hint | 60-180 |
| Key Scale | Musical key | C major, A minor, etc. |
Generation Workflow
- Set your tags in the text encoder node
- Input structured lyrics with section labels
- Configure duration (start with 90-120 seconds)
- Set a fixed seed for reproducibility
- Queue the prompt and wait for generation
If the result isn’t perfect on the first try, don’t worry—that’s normal. AI music generation is iterative.
💡 Pro Tip: Use structured section labels like
[verse],[chorus], and[bridge]. The model uses these to understand song form and create natural transitions.
Pro Tips for Better Results
After generating hundreds of tracks, here are the patterns that consistently produce better output.
1. Master the Two-Layer Prompt System
Tags control the soundscape:
Genre: funk, pop, disco, lofi hip-hop, ambient
Tempo: 120 bpm, up-tempo, slow ballad
Instruments: slap bass, drum machine, distorted guitar, Rhodes piano
Vocal Type: male vocals, clean, rhythmic female vocals
Era/Production: 80s style, punchy, dry mix, lo-fi, vinyl crackle
Lyrics control the structure:
- Use section labels consistently
- Keep phrases rhythmic
- Avoid long, prose-like sentences
2. Rhythmic Lyric Phrasing
Short, punchy phrases work better than poetry:
Better:
[verse]
Running fast, can't slow down
Heart beats like a drum
City lights, blurry now
Moving to the rhythm
Worse:
[verse]
When I was running through the city streets with my heart pounding
and all the lights became a blur around me as I moved forward
Treat vocals as part of the groove. Consistent syllable counts improve vocal stability dramatically.
3. Iterate with Small Changes
The “gacha” nature of AI music means you’ll generate multiple versions. Optimize your workflow:
- Lock the seed when you find something close
- Tweak tags first (tempo, mood, instruments)
- Then adjust lyrics section by section
- Avoid rewriting everything between generations
4. For Instrumental Tracks
Be explicit about excluding vocals:
Tags: "lofi hip-hop instrumental, 85 bpm, chill vibe, warm Rhodes piano,
subtle vinyl crackle, boom-bap drums, smooth bassline, no vocals,
instrumental only"
Lyrics: (leave empty or use "instrumental")
5. The Genre Sweet Spots
Ace-Step excels at certain genres and struggles with others:
Strong Genres:
- Pop, indie pop, synth-pop
- Electronic (house, techno, ambient)
- Lo-fi hip-hop
- Rock (classic, alternative)
- Ballads
Challenging Genres:
- Aggressive rap (especially non-English)
- Extremely niche subgenres
- Highly orchestrated classical
6. Use the Turbo Model for Iteration
The turbo model (8 steps instead of 50) generates in ~2 seconds. Use it for:
- Rapid prompt experimentation
- Testing lyric flows
- Finding the right tempo/key
Then switch to the SFT model for your final high-quality render.
⚠️ Warning: Vocals can sometimes “swallow” lyrics—skipping words or making them unintelligible. This is a known limitation. Structured prompts with short phrases help mitigate it.
Troubleshooting Common Issues
”The Vocals Sound Muffled or Skip Words”
Cause: Lyric misalignment, common in the base model
Solutions:
- Use structured section labels (
[verse],[chorus]) - Shorten phrases to 4-8 syllables
- Try the “official original version” model which has lyric adherence fixes
- Adjust song duration in small increments (±10 seconds)
“ComfyUI Nodes Not Appearing”
Cause: Missing dependencies or outdated ComfyUI
Solutions:
# Update ComfyUI
cd /path/to/ComfyUI && git pull
# Reinstall requirements
pip install -r requirements.txt --upgrade
# Verify model placement
ls ComfyUI/models/checkpoints/acestep*
“CUDA Out of Memory”
Cause: GPU VRAM exceeded
Solutions:
- Use a smaller LM (0.6B instead of 1.7B or 4B)
- Enable INT8 quantization
- Use CPU offloading for the LM
- Reduce batch size to 1
For low VRAM (≤6GB):
# Disable LM entirely, use DiT-only mode
# Set acestep_model.loader.llm_model: null in config
“Training LoRA Stalls or Freezes”
Cause: UI conflicts or missing dependencies
Solutions:
# Use CLI training instead of UI
cd ACE-Step-1.5
python ./acestep/training/train_lora.py --help
# Install missing dependencies
pip install lightning tensorboard
“No Audio Output or Corrupted Files”
Cause: VAE decode failure or audio backend issues
Solutions:
- Check that VAE decode node is connected
- Reinstall PyTorch with correct CUDA version:
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121 - On Windows with Python 3.13+: use
soundfileorscipybackend instead oftorchaudio
Platform-Specific Notes
Windows:
- Use the portable package for easiest setup
- Python 3.13+ has torchaudio issues—stick to 3.11-3.12
macOS (Apple Silicon):
- Use the MLX backend with dedicated launch scripts
- Performance is acceptable but slower than NVIDIA
AMD (ROCm):
- Windows requires Python 3.12 specifically
- Use ROCm-specific launch scripts
Conclusion & Resources
Ace-Step 1.5 represents a watershed moment for AI music generation. For the first time, you can run truly capable music synthesis locally, on consumer hardware, with a permissive open-source license.
Is it as polished as Suno? Not quite. Will it replace professional music production? Not yet. But for developers, musicians, and creators who value privacy, customization, and long-term ownership, Ace-Step offers something invaluable: control.
The ability to train LoRAs on your own music, generate unlimited tracks without subscription fees, and integrate music generation into your own applications and workflows—these aren’t just features. They’re the foundation of a new creative paradigm.
Quick Start Checklist
- Update ComfyUI to latest version
- Download Ace-Step 1.5 checkpoint from Hugging Face
- Place model in
ComfyUI/models/checkpoints/ - Load the ACE-Step workflow from Template Library
- Start with turbo model for rapid iteration
- Use structured tags + lyrics for best results
Essential Links
Official Resources:
ComfyUI Integration:
Community Resources:
The future of music isn’t just generated—it’s yours to generate. Happy creating.
Comments
Powered by GitHub Discussions