Stable Diffusion Moment of Audio?? Ace-Step Audio Model Native Support in ComfyUI!

The ground-breaking audio generation model in the community!

and

May 08, 2025

Hi Comfy community! As many of you might know, a high-quality audio model just dropped this week. We would like to share that ComfyUI now supports Ace-Step natively!

ACE-Step is an open-source music generation model jointly developed by ACE Studio and StepFun. It generates various music genres, including General Songs, Instrumentals, and Experimental Inputs, all supported by multiple languages.

ACE-Step provides rich extensibility for the OSS community: Through fine-tuning techniques like LoRA and ControlNet, developers can customize the model according to their needs, whether audio editing, vocal synthesis, accompaniment production, voice cloning, or style transfer applications. The model is a meaningful milestone for the music/audio generation genre.

The model is released under the Apache-2.0 license and is free for commercial use. It also has good inference speed: the model synthesizes up to 4 minutes of music in just 20 seconds on an A100 GPU.

Get Started

Update ComfyUI or Desktop to the latest
Download the workflow below and drag it into ComfyUI.
Download the models as instructed and run!
Ace-Step Example Workflow

Check our documentation for more detailed instructions!

Recently, we also added native supports for HiDream-E1 and Wan FLF FP8.

Hidream E1 Native Support

HiDream-E1 (image editing) is an interactive image editing model officially open-sourced by HiDream-ai on April 28, 2025, and is built on the HiDream-I1 model. The model is released under the MIT License ,supporting personal projects, scientific research, and commercial use.

You can now try the template in ComfyUI: Workflow → Browse Template → Image → HiDream E1, click and run! Or download the workflow below!

HiDream E1 Full Version Workflow

Example Outputs:

For more detailed tutorial please visit our documentation

Wan2.1 FLF2V FP8 Update

Following Wan2.1 FLF2V fp16 support, we recently uploaded the fp8 version here. Now you can run the model with less VRAM. Please note that this model works best with 720P resolution; a smaller resolution might not produce decent results.

See our Documentation for more Community resources and detailed instructions.

Enjoy generating!

NorfolkDave

May 8Edited

"The model is released under the Apache-2.0 license and is free for commercial use. It also has good inference speed: the model synthesizes up to 4 minutes of music in just 20 seconds on an A100 GPU."

hmmm wonder if it works on 3090/ 4090, not sure many of us have a spare a100 lol

EDIT: works ok on a 3090, VERY fast. low quality audio but does do music and lyrics. VERY interesting! Will be keen to see where this goes, maybe next year I wont need my annual suno subscription.

Expand full comment

admiralfalco

May 8

I too am poking around. Can run this on 3060 12gb variant. average 1.98it/s. Music Quality is garbled low quality. Good Vocal Creation. Lowering Value of Vocals increases Music Quality. However Lyrics go completely out the window. (Note this is after 10 mins of playing). This plus a couple of other apache 2 projects for creating and modifying vocals and you could create something interesting. Modifying the SD Shift down to 2 seems to work upon occasion for clearing up a sound. ive also found turning the scheduler to kl_optimal can give a cleaner sound.....

interesting to know what other think

10 more comments...

ComfyUI Blog

Discussion about this post