HappyHorse 1.1 is now available in ComfyUI
Audio-native video generation with dialogue, sound effects, and multi-character consistency, built right into your workflows.
HappyHorse 1.1 is now available in ComfyUI as a Partner Node. This video model is engineered for real-world production use cases, including short episodic series, e-commerce commercials, brand marketing content, and game cutscenes.
A standout feature of the model is native synchronized audio generation. It produces dialogue, sound effects, and background music in one single render pass without extra steps.
Version 1.1 targets five core production-critical capabilities: dynamic, expressive motion; consistent character rendering; reliable prompt adherence; stable text rendering; and authentic cinematic framing.
What’s new in 1.1
Dynamic expressiveness: Smoother motion and frame consistency eliminate the stiff, sluggish movement from v1.0.
Enhanced multi-image reference-to-video (R2V): Faithfully preserves input details, supporting up to 9 reference images per generation.
Multi-character consistency: Multiple character references keep a distinct look with no visual cross-contamination.
Flexible character × scene combinations: Feed characters and scenes as separate references. Characters stay fully consistent even as the background environment changes.
Upgraded instruction following: Better long-context retention handles prompts beyond 2,500 characters, and a single prompt can describe 6–8 consecutive scenes with the model autonomously allocating time and switching camera angles.
Natural skin and close-up viability: Fixes shiny skin and over-sharpening issues, with lifelike texture for series and commercials.
Cinematic language: Full support for terms like shot-reverse-shot and tracking shot, with far more cohesive transitions and pacing between shots.
Upgraded audio: More accurate dialogue and sound-effect rendering, with emotional performance layered on top of tight audio-video synchronization.
Three nodes, one model
HappyHorse 1.1 ships as three nodes, each tuned to a different job:
Text-to-Video (T2V): Build a complete scene from scratch. You control style, shot size, lighting, action, and audio entirely through the prompt.
Image-to-Video (I2V): Animate a static first frame. The image already carries the look, so you just describe the motion and the camera move.
Reference-to-Video (R2V): Orchestrate a multi-character stage play. Map characters and scenes to reference images, then direct them through a timestamped storyboard with per-character dialogue.
All three models support 720p and 1080p output, video lengths ranging from 3 to 15 seconds, plus flexible aspect ratios including 16:9, 9:16, 1:1, 4:3, 3:4, 21:9, and more. Every exported video comes with perfectly synced audio.
Getting Started
Update ComfyUI to the latest version
Find the HappyHorse nodes via the Node Library (search “HappyHorse”) or load a ready-made template from the Templates Library.
Pick your mode: Text-to-Video, Image-to-Video, or Reference-to-Video, wire in your prompt and any reference images, then run. Output arrives with audio baked in at 720p or 1080p.


