ComfyUI now supports Nvidia Cosmos!

Text to Video and Image to Video

Jan 17, 2025

This year starts off with a great open model release from Nvidia who released their confusingly marketed Cosmos family of models a few days ago.

These models which Nvidia calls "World Models" are actually extremely good SOTA video models. Currently ComfyUI supports specifically the 7B and 14B text to video and image to video diffusion models.

For most users I recommend the 7B models. These ones should fit on a 24GB GPU at full 16 bit precision without offloading but will also work on a 12GB GPU with the automatic ComfyUI weight offloading.

This new release also comes with a new sampler available now in your favorite sampler node: res_multistep which was used by Nvidia in their Cosmos implementation, this sampler can be used with every model supported by ComfyUI and I heard it also gives good results on hunyuan video.

What makes the Nvidia Cosmos a great model:

Their VAE is by the most compute/memory efficent video VAE yet. Their VAE is so efficent that you can encode/decode a 1280x704 121 frame video on a 12GB vram GPU without any tiling tricks while being very high quality. This makes it a massive ~50x more memory efficient than the hunyuan video VAE.
Non distilled: negative prompts will work and should be easier to train than distilled models like hunyuan video.
Image to video that works very well and can be controlled by a prompt. The image to video model behaves like an inpainting model so you can do things like generate from the last frame instead of the first frame or generate the video between two images.
This model will always make a video with movement if you generate the required 121 frames. I have never seen it generate a video without movement.

Some downsides:

The model really likes 121 frames and starts breaking if you generate less or more frames.
The lowest resolution the model can handle is 704x704.
Long prompts (a few sentences) are required. The model will not follow the prompt if it is too short.
It’s slow. It takes over 10 minutes to generate a 1280x704 121 frame video on a 4090 (perfect for heating your room in winter)

For basic workflows and examples see the: Nvidia Cosmos examples page

I’ll leave you with a few examples of what Cosmos can do:

Cosmos 7B image to video using an anime image made with Flux dev

Nvidia Cosmos 7B image to video using an anime image made with NoobAI vpred

As a reminder you can check the Nvidia Cosmos examples page for workflows.

For another piece of confusing marketing make sure to check out our 2 year anniversary post where we compare ComfyUI to an operating system:

🎂 ComfyUI Turns 2: A Journey and Call for Talent

Comfy

Jan 16

Read full story

Dmitry Markov

Jan 18

I am very interested in this point. Can it be implemented somehow in Comfyui? (otherwise I am afraid it will be like with other models i.e. the model itself can do a lot of things, but in comfy it is not possible to implement it)

3. Image to video that works very well and can be controlled by a prompt. The image to video model behaves like an inpainting model so you can do things like generate from the last frame instead of the first frame or generate the video between two images.

Expand full comment

1 reply

Nes

Jan 30

Using the presented workflow, you can specify and generate start and end images. However, it is not possible to generate a video where the end image is at the end of the generated video.

I tried various examples, but in most cases, the start image immediately transitions to the end image a few frames later, and after that, a video is created that combines the end image with the prompt instruction.

Is it possible to use a node to specify the position of the input image in the generated frame?

8 more comments...

ComfyUI Blog

🎂 ComfyUI Turns 2: A Journey and Call for Talent

Discussion about this post