We’re excited to share that ACE-Step 1.5 is now available in ComfyUI! This major update to the open-source music generation model brings commercial-grade quality to your local machine—generating full songs in under 10 seconds on consumer hardware.
What’s New in ACE-Step 1.5
ACE-Step 1.5 introduces a novel hybrid architecture that fundamentally changes how AI generates music. At its core, a Language Model acts as an omni-capable planner, transforming simple user queries into comprehensive song blueprints—scaling from short loops to 10-minute compositions.
Commercial-Grade Quality
On standard evaluation metrics, ACE-Step 1.5 achieves quality beyond most commercial music models, scoring 4.72 on musical coherence.Blazing Fast Generation
Generate a full 4-minute song in ~1 second on a RTX 5090, or under 10 seconds on an RTX 3090.Runs on Consumer Hardware
50+ Language Support
Strict adherence to prompts across 50+ languages, with particularly strong support for English, Chinese, Japanese, Korean, Spanish, German, French, Portuguese, Italian, and Russian.
Chain-of-Thought Planning
The model synthesizes metadata, lyrics, and captions via Chain-of-Thought reasoning to guide the diffusion process, resulting in more coherent long-form compositions.
LoRA Fine-Tuning
ACE-Step 1.5 supports lightweight personalization through LoRA training. With just a few songs—or a few dozen—you can train a LoRA that captures a specific style.
LoRAs let creators fine-tune toward a specific style using their own music. It learns from your songs and captures your sound. And because you run it locally, you own the LoRA and don’t have to worry about data leakage.
How It Works
ACE-Step 1.5 combines several architectural innovations:
Hybrid LM + DiT Architecture: A Language Model plans the song structure while a Diffusion Transformer (DiT) handles audio synthesis.
Distribution Matching Distillation: Leverage Z-Image's DMD2 to realise both fast generation (2 secs on an A100) and better quality.
Intrinsic Reinforcement Learning: Alignment is achieved through the model’s internal mechanisms, eliminating biases from external reward models.
Self-Learning Tokenizer: The audio tokenizer is learned during DiT training, to close the gap between generation and tokenizing
Coming Soon
ACE-Step 1.5 has a few more tricks up its sleeve. These aren’t yet supported in ComfyUI, but we have no doubt the community will figure it out.
Cover
Give the model any song as input along with a new prompt and lyrics, and it will reimagine the track in a completely different style.
Repaint
Sometimes a generated track is 90% perfect and 10% not quite right. Repaint fixes that. Select a segment, regenerate just that section, and the model stitches it back in while keeping everything else untouched.
Vocal Examples
Neo-Soul: A warm, organic neo-soul track dripping with live instrumentation and effortless groove. A live drummer plays a loose, hip-hop influenced pocket—soft kick drum with lazy swing, snare hits that sit just behind the beat, and brushed hi-hats that breathe and shuffle with human imperfection.UK Garage: A skippy, energetic UK garage track built on a classic two-step drum pattern with shuffling hi-hats and a punchy, syncopated kick and snare. A warm, wobbling Reese bass line provides the low-end foundation and chopped, pitched-up female vocal samples create the melodic hooks.K-Pop: A slick, maximalist K-pop track that genre-hops with precision and style. The production shifts seamlessly between sections—a hard-hitting trap-influenced verse with rapid-fire rapping, a softer R&B pre-chorus with breathy vocals and lush harmonies, then an explosive, synth-driven pop chorus with an ear worm hook.Instrumental Examples
Synth-wave: A nostalgic, cinematic ride through neon and chrome. Punchy gated drums with big reverb snare, arpeggiated synth lines running through chorus and delay, warm analog bass, and soaring lead melodies that feel heroic and bittersweet. Driving but emotional, like the credits rolling on a film that never existed.Meditative Roller: A deep, meditative roller locked into a hypnotic 140 BPM groove, all smooth forward motion and late-night introspection. The bass line is the soul of it—warm, undulating, endlessly cycling through subtle variations like waves lapping at a shore, never jarring, never stopping.Progressive House: A warm, rolling journey that builds patiently. Soft four-on-the-floor kick with airy hats, a plucky melodic synth hook that repeats and evolves, pads that swell across long phrases, and subtle acid bass bubbling underneath. Emotional but restrained, always moving forward toward a sunrise.Getting Started
For ComfyUI Desktop & Local Users
Update ComfyUI to the latest version
Go to Template Library → Audio and select the ACE-Step 1.5 workflow
Download the model when prompted (or manually from Hugging Face)
Add your style tags and lyrics, then run!
Workflow Tips
Style Tags: Be descriptive! Include genre, instruments, mood, tempo, and vocal style. Example:
rock, hard rock, alternative rock, clear male vocalist, powerful voice, energetic, electric guitar, bass, drums, anthem, 120 bpmLyrics Structure: Use tags like
[verse],[chorus],[bridge]to guide song structure.Duration: Start with 90–120 seconds for more consistent results. Longer durations (180+ seconds) may require generating multiple batches.
Batch Generation: Set
batch_sizeto 8 or 16 and pick the best result—the model can be inconsistent, so generating multiple samples helps.
As always, enjoy creating!





