🎬 Introduction
Welcome to a powerhouse ComfyUI workflow designed to unlock the incredible potential of the Wan 2.2 14B Sound-to-Video (S2V) model. This isn't just a simple implementation; it's a comprehensive suite that addresses two critical needs for AI video generation: accessibility and speed.
This all-in-one workflow provides two parallel generation pipelines:
⚡ Lightning Fast (4-Step) Pipeline: Utilizes a specialized LoRA to generate videos in a fraction of the time, perfect for rapid prototyping and iteration.
🎨 High Fidelity (20-Step) Pipeline: The classic, high-quality generation process for when you demand the utmost visual fidelity from your outputs.
Crucially, both versions are configured to run using GGUF-quantized models, dramatically reducing VRAM requirements and making this massive 14B parameter model accessible to users with consumer-grade hardware.
✨ Key Features & Highlights
Dual Mode Operation: Choose between speed and quality with two self-contained workflows in one JSON file. Easily enable/disable either section.
GGUF Quantization Support: Run the massive Wan 2.2 model without needing a professional GPU. Leverages
LoaderGGUFandClipLoaderGGUFnodes.Extended Video Generation: The workflow includes built-in "Video S2V Extend" subgraphs. Each one adds 77 frames. The template is pre-configured with two extenders, resulting in a ~5-second video at 16 FPS. Want a longer video? Simply copy and paste more extender nodes!
Audio-Driven Animation: Faithfully implements the S2V model's core function: animating a reference image in sync with an uploaded audio file (e.g., music, speech).
Smart First-Frame Fix: Includes a clever hack to correct the first frame, which is often "overbaked" by the VAE decoder.
Detailed Documentation: The workflow itself is filled with informative notes and markdown nodes explaining crucial settings like batch size and chunk length.
🧩 How It Works (The Magic Behind the Scenes)
The workflow is logically grouped into clear steps:
Load Models (GGUF): The
LoaderGGUFandClipLoaderGGUFnodes load the quantized UMT5 text encoder and the main UNet model, drastically reducing VRAM load compared to full precision models.Upload Inputs: You provide two key ingredients:
ref_image: The starting image you want to animate (e.g., a character portrait).audio: The sound file that will drive the motion and pacing of the animation.
Encode Prompts & Audio: Your positive and negative prompts are processed, and the audio file is encoded into a format the model understands using the Wav2Vec2 encoder.
Base Generation (
WanSoundImageToVideo): The core node takes your image, audio, and prompts to generate the first latent video sequence.Extend the Video (
Video S2V ExtendSubgraphs): This is where the length comes from. The latent output from the previous step is fed into a sampler (KSampler) alongside the audio context again to generate the next chunk of frames. These chunks are concatenated together.Decode & Compile: The final latent representation is decoded into images by the VAE, and the
CreateVideonode stitches all the frames together with the original audio to produce your final MP4 file.
⚙️ Instructions & Usage
Prerequisite: Download Models
You must download the following model files and place them in your ComfyUI models directory. The workflow includes handy markdown notes with direct download links.
Essential Models:
umt5-xxl-encoder-q4_k_m.gguf→ Place in/models/clip/Wan2.2-S2V-14B-Q5_0.gguf→ Place in/models/unet/(or/models/diffusion/)wav2vec2_large_english_fp16.safetensors→ Place in/models/audio_encoders/wan_2.1_vae.safetensors→ Place in/models/vae/
For the 4-Step Lightning Pipeline:
Wan2.2-Lightning_T2V-v1.1-A14B-4steps-lora_HIGH_fp16.safetensors→ Place in/models/loras/
Loading the Workflow
Download the provided
video_wan2_2_14B_s2v.jsonfile.In ComfyUI, drag and drop the JSON file into the window or use the Load button.
Running the Workflow
Upload Your Media:
In the "LoadImage" node, upload your starting reference image.
In the "LoadAudio" node, upload your music or audio file.
Enter Your Prompt:
Modify the text in the "CLIP Text Encode (Positive Prompt)" node.
The negative prompt is already filled with a robust, standard negative.
Choose Your Pipeline:
To use the 4-Step Lightning pipeline (Fast): Ensure the
LoraLoaderModelOnlynode is correctly pointed to your Lightning LoRA file. TheStepsprimitive node for this section is already set to4andCFGto1.To use the 20-Step pipeline (High Quality): The lower section of the workflow is already configured. The
Stepsare set to20andCFGto6.0. You can simply box-select the entire 20-step section and pressCtrl+Bto disable the 4-step section if you wish to only run this one.
Queue Prompt! Watch as your image comes to life, driven by your audio.
⚠️ Important Notes & Tips
Batch Size Setting: The "Batch sizes" value (
3by default) is not a traditional batch size. It must be set to1 + [number of Video S2V Extend subgraphs]. This workflow has 2 extenders, so the value is 3. If you add another extender, set it to 4.Chunk Length: The default is
77frames. This is a requirement of the model and should not be changed unless you know what you're doing.Lightning LoRA Trade-off: The 4-step LoRA is incredibly fast but may result in a slight drop in coherence and quality compared to the 20-step generation. It's the perfect tool for finding the right seed and composition quickly.
GGUF vs. Safetensors: This workflow uses GGUF for the text and UNet models to save VRAM. You can replace the
LoaderGGUFandClipLoaderGGUFnodes with standardUNETLoaderandCLIPLoadernodes if you have the VRAM to use the full.safetensorsmodels, which may offer slightly better quality.
🎭 Example Results
Prompt: "The man is playing the guitar. He looks down at his hands playing the guitar and sings affectionately and gently."
Audio: A gentle acoustic guitar track.
(You would embed a short video example generated by this workflow here)
📁 Download & Links
Download this Workflow JSON:
[Link to your uploaded JSON file]Official Wan 2.2 Model Repo: HuggingFace - Comfy-Org/Wan_2.2_ComfyUI_Repackaged
Required GGUF Models: Search for
Wan2.2-S2V-14B-Q5_0.ggufandumt5-xxl-encoder-q4_k_m.ggufon Hugging Face.
💎 Conclusion
This workflow demystifies the process of running the formidable Wan 2.2 S2V model. By integrating GGUF support and a dual-pipeline approach, it empowers users with limited hardware to experiment and create stunning, audio-synchronized animations. Whether you're quickly iterating with the Lightning LoRA or crafting a masterpiece with the full 20-step process, this suite has you covered.
Happy generating! Feel free to leave a comment with your amazing creations or any questions.
Description
Enhanced output quality and increased stability have been achieved.
Additional features have been implemented, providing convenient access.
FAQ
Comments (8)
What a load of NONSENSE about VRAM amounts and model size. Only LLM models should reside in VRAM. Other models, with many seconds per iteration, can stream from system RAM as they are used each iteration (AI models are LINEAR data structures, not RANDOM data structures, and are hence insanely perfect for streaming).
Comfy can be forced to launch with streaming from RAM only, but you will, of course, need enough RAM for the model. 64GB is a minimum, more is better.
With is2v, you have an excellent feedback to show you if your render is compromised- lip movement. Too much low CFG and speedup use can cause minimum lip movement. Good use of speedup LoRAs can have nice expressive lip movement. But general motion in the scene that follows your prompt will almost certainly need some slower iterations from the sampler with a high CFG at the beginning. If you only really care about lip movement, it is easier to get a fast quality workflow.
It works quickly and well. I uploaded a model that speaks Russian, thanks. It would be great if you could add localized models to the description.
Much appreciated 🙏 - I have adapted this workflow into 2026🎇 with TTS Audio Suite + Chatterbox integration nodes for voice cloning🗣🎤👥 (TTS Engine, TTS Text, Character Voices) + RESLYF (ClownsharKsamplers offers superior quality over the core Ksamplers, if set up correctly 🤡👌), plus a match expression node for automatic video length (tricky when running batches, but it works, plus perfect sync, just have to test different length of TTS text to keep 12-13 sec for 3 batches). Haven't tested Character Voices, yet. Still trying to achieve consistency (mainly with movement, while lip sync remains 100%) over 3 batches (~12 sec), with Q4 gguf models (I have optimized the workflow for 16GB + 12GB VRAM, with some of those Clear VRAM nodes bypassed as they can cause "hickups").
Still testing, but I'll share the updated workflow when it's fully ready ✅
I'd love a version of this made for 32Gb Cards too :) It sounds amazing!
@skpManiac I'm still testing the workflow from @zardozai , but it's mainly because I added my own SVI Pro V2 workflow on top of it, to combine Wan 2.2-S2V with SVI V2 Pro generations. I have so far only posted 1 video (50+sec, NSFW🔞) which is based on this "hybrid" workflow. The workflow has gotten quite large, so I need to simplify it if I decide to release it (as it looks now, it mainly meant for people who do mind some extra "meat on the bone" because it's quite frankly overcomplicating things when combining TTS Text + S2V + SVI, lol - this will in some cases also result in having to deal with inpainting and FTLF > First To Last Frame, which might come in handy when extending a S2V video to SVI batches or vice versa).
Hi there,
I am hoping you can help.
I would much prefer to run this with the higher quality Safetensors, but I do not know enough yet to change the nodes. Would someone be so kind as to post a version with the changes made so it uses mu RTX 5090?
Many thanks
Steve
GGUF Q8_0 offers performance comparable to fp16 and is significantly better than fp8.
@zardozai ok, thanks for your reply mate. I keep getting blurry hands & eyes. so was hoping to fix that. I've maxed everything I can work out how to do, but it's still not great.