Wan 2.2 14B S2V Ultimate Suite: GGUF & Lightning Speed with Extended Video Generation

Wan 2.2 14B S2V Ultimate Suite: GGUF & Lightning Speed with Extended Video Generation - v2.0

🎬 Introduction

Welcome to a powerhouse ComfyUI workflow designed to unlock the incredible potential of the Wan 2.2 14B Sound-to-Video (S2V) model. This isn't just a simple implementation; it's a comprehensive suite that addresses two critical needs for AI video generation: accessibility and speed.

This all-in-one workflow provides two parallel generation pipelines:

⚡ Lightning Fast (4-Step) Pipeline: Utilizes a specialized LoRA to generate videos in a fraction of the time, perfect for rapid prototyping and iteration.
🎨 High Fidelity (20-Step) Pipeline: The classic, high-quality generation process for when you demand the utmost visual fidelity from your outputs.

Crucially, both versions are configured to run using GGUF-quantized models, dramatically reducing VRAM requirements and making this massive 14B parameter model accessible to users with consumer-grade hardware.

✨ Key Features & Highlights

Dual Mode Operation: Choose between speed and quality with two self-contained workflows in one JSON file. Easily enable/disable either section.
GGUF Quantization Support: Run the massive Wan 2.2 model without needing a professional GPU. Leverages LoaderGGUF and ClipLoaderGGUF nodes.
Extended Video Generation: The workflow includes built-in "Video S2V Extend" subgraphs. Each one adds 77 frames. The template is pre-configured with two extenders, resulting in a ~5-second video at 16 FPS. Want a longer video? Simply copy and paste more extender nodes!
Audio-Driven Animation: Faithfully implements the S2V model's core function: animating a reference image in sync with an uploaded audio file (e.g., music, speech).
Smart First-Frame Fix: Includes a clever hack to correct the first frame, which is often "overbaked" by the VAE decoder.
Detailed Documentation: The workflow itself is filled with informative notes and markdown nodes explaining crucial settings like batch size and chunk length.

🧩 How It Works (The Magic Behind the Scenes)

The workflow is logically grouped into clear steps:

Load Models (GGUF): The LoaderGGUF and ClipLoaderGGUF nodes load the quantized UMT5 text encoder and the main UNet model, drastically reducing VRAM load compared to full precision models.
Upload Inputs: You provide two key ingredients:
- ref_image: The starting image you want to animate (e.g., a character portrait).
- audio: The sound file that will drive the motion and pacing of the animation.
Encode Prompts & Audio: Your positive and negative prompts are processed, and the audio file is encoded into a format the model understands using the Wav2Vec2 encoder.
Base Generation (WanSoundImageToVideo): The core node takes your image, audio, and prompts to generate the first latent video sequence.
Extend the Video (Video S2V Extend Subgraphs): This is where the length comes from. The latent output from the previous step is fed into a sampler (KSampler) alongside the audio context again to generate the next chunk of frames. These chunks are concatenated together.
Decode & Compile: The final latent representation is decoded into images by the VAE, and the CreateVideo node stitches all the frames together with the original audio to produce your final MP4 file.

⚙️ Instructions & Usage

Prerequisite: Download Models

You must download the following model files and place them in your ComfyUI models directory. The workflow includes handy markdown notes with direct download links.

Essential Models:

umt5-xxl-encoder-q4_k_m.gguf → Place in /models/clip/
Wan2.2-S2V-14B-Q5_0.gguf → Place in /models/unet/ (or /models/diffusion/)
wav2vec2_large_english_fp16.safetensors → Place in /models/audio_encoders/
wan_2.1_vae.safetensors → Place in /models/vae/

For the 4-Step Lightning Pipeline:

Wan2.2-Lightning_T2V-v1.1-A14B-4steps-lora_HIGH_fp16.safetensors → Place in /models/loras/

Loading the Workflow

Download the provided video_wan2_2_14B_s2v.json file.
In ComfyUI, drag and drop the JSON file into the window or use the Load button.

Running the Workflow

Upload Your Media:
- In the "LoadImage" node, upload your starting reference image.
- In the "LoadAudio" node, upload your music or audio file.
Enter Your Prompt:
- Modify the text in the "CLIP Text Encode (Positive Prompt)" node.
- The negative prompt is already filled with a robust, standard negative.
Choose Your Pipeline:
- To use the 4-Step Lightning pipeline (Fast): Ensure the LoraLoaderModelOnly node is correctly pointed to your Lightning LoRA file. The Steps primitive node for this section is already set to 4 and CFG to 1.
- To use the 20-Step pipeline (High Quality): The lower section of the workflow is already configured. The Steps are set to 20 and CFG to 6.0. You can simply box-select the entire 20-step section and press Ctrl+B to disable the 4-step section if you wish to only run this one.
Queue Prompt! Watch as your image comes to life, driven by your audio.

⚠️ Important Notes & Tips

Batch Size Setting: The "Batch sizes" value (3 by default) is not a traditional batch size. It must be set to 1 + [number of Video S2V Extend subgraphs]. This workflow has 2 extenders, so the value is 3. If you add another extender, set it to 4.
Chunk Length: The default is 77 frames. This is a requirement of the model and should not be changed unless you know what you're doing.
Lightning LoRA Trade-off: The 4-step LoRA is incredibly fast but may result in a slight drop in coherence and quality compared to the 20-step generation. It's the perfect tool for finding the right seed and composition quickly.
GGUF vs. Safetensors: This workflow uses GGUF for the text and UNet models to save VRAM. You can replace the LoaderGGUF and ClipLoaderGGUF nodes with standard UNETLoader and CLIPLoader nodes if you have the VRAM to use the full .safetensors models, which may offer slightly better quality.

🎭 Example Results

Prompt: "The man is playing the guitar. He looks down at his hands playing the guitar and sings affectionately and gently."
Audio: A gentle acoustic guitar track.

(You would embed a short video example generated by this workflow here)

📁 Download & Links

Download this Workflow JSON: [Link to your uploaded JSON file]
Official Wan 2.2 Model Repo: HuggingFace - Comfy-Org/Wan_2.2_ComfyUI_Repackaged
Required GGUF Models: Search for Wan2.2-S2V-14B-Q5_0.gguf and umt5-xxl-encoder-q4_k_m.gguf on Hugging Face.

💎 Conclusion

This workflow demystifies the process of running the formidable Wan 2.2 S2V model. By integrating GGUF support and a dual-pipeline approach, it empowers users with limited hardware to experiment and create stunning, audio-synchronized animations. Whether you're quickly iterating with the Lightning LoRA or crafting a masterpiece with the full 20-step process, this suite has you covered.

Happy generating! Feel free to leave a comment with your amazing creations or any questions.

Description

Enhanced output quality and increased stability have been achieved.

Additional features have been implemented, providing convenient access.

FAQ

Comments (8)

blobby99Sep 8, 2025· 3 reactions

CivitAI

What a load of NONSENSE about VRAM amounts and model size. Only LLM models should reside in VRAM. Other models, with many seconds per iteration, can stream from system RAM as they are used each iteration (AI models are LINEAR data structures, not RANDOM data structures, and are hence insanely perfect for streaming).

Comfy can be forced to launch with streaming from RAM only, but you will, of course, need enough RAM for the model. 64GB is a minimum, more is better.

With is2v, you have an excellent feedback to show you if your render is compromised- lip movement. Too much low CFG and speedup use can cause minimum lip movement. Good use of speedup LoRAs can have nice expressive lip movement. But general motion in the scene that follows your prompt will almost certainly need some slower iterations from the sampler with a high CFG at the beginning. If you only really care about lip movement, it is easier to get a fast quality workflow.

psy_krivedko468Oct 10, 2025· 2 reactions

CivitAI

It works quickly and well. I uploaded a model that speaks Russian, thanks. It would be great if you could add localized models to the description.

yefy_aiJan 30, 2026· 2 reactions

CivitAI

Much appreciated 🙏 - I have adapted this workflow into 2026🎇 with TTS Audio Suite + Chatterbox integration nodes for voice cloning🗣🎤👥 (TTS Engine, TTS Text, Character Voices) + RESLYF (ClownsharKsamplers offers superior quality over the core Ksamplers, if set up correctly 🤡👌), plus a match expression node for automatic video length (tricky when running batches, but it works, plus perfect sync, just have to test different length of TTS text to keep 12-13 sec for 3 batches). Haven't tested Character Voices, yet. Still trying to achieve consistency (mainly with movement, while lip sync remains 100%) over 3 batches (~12 sec), with Q4 gguf models (I have optimized the workflow for 16GB + 12GB VRAM, with some of those Clear VRAM nodes bypassed as they can cause "hickups").

Still testing, but I'll share the updated workflow when it's fully ready ✅

skpManiacFeb 4, 2026

I'd love a version of this made for 32Gb Cards too :) It sounds amazing!

yefy_aiFeb 26, 2026· 1 reaction

@skpManiac I'm still testing the workflow from @zardozai , but it's mainly because I added my own SVI Pro V2 workflow on top of it, to combine Wan 2.2-S2V with SVI V2 Pro generations. I have so far only posted 1 video (50+sec, NSFW🔞) which is based on this "hybrid" workflow. The workflow has gotten quite large, so I need to simplify it if I decide to release it (as it looks now, it mainly meant for people who do mind some extra "meat on the bone" because it's quite frankly overcomplicating things when combining TTS Text + S2V + SVI, lol - this will in some cases also result in having to deal with inpainting and FTLF > First To Last Frame, which might come in handy when extending a S2V video to SVI batches or vice versa).

skpManiacFeb 4, 2026· 1 reaction

CivitAI

Hi there,

I am hoping you can help.
I would much prefer to run this with the higher quality Safetensors, but I do not know enough yet to change the nodes. Would someone be so kind as to post a version with the changes made so it uses mu RTX 5090?
Many thanks
Steve

zardozai

Author

Feb 5, 2026· 1 reaction

GGUF Q8_0 offers performance comparable to fp16 and is significantly better than fp8.

skpManiacFeb 6, 2026

@zardozai ok, thanks for your reply mate. I keep getting blurry hands & eyes. so was hoping to fix that. I've maxed everything I can work out how to do, but it's still not great.

Workflows

Wan Video 2.2 I2V-A14B

by zardozai

Download (Beta) View on CivitAI

base model

Details

Downloads

1,417

Platform

CivitAI

Platform Status

Available

Created

9/8/2025

Updated

5/13/2026

Deleted

Files

wan2214BS2VUltimateSuiteGGUF_v20.zip

Size:

14.18 KB

SHA256:

b4b2dc6231399ec549f0e6d830fc43693bc5e2431f205af6746ae741252aa172

Mirrors

CivitAI (1 mirrors)

wan2214BS2VUltimateSuiteGGUF_v20.zip

🎬 Introduction

✨ Key Features & Highlights

🧩 How It Works (The Magic Behind the Scenes)

⚙️ Instructions & Usage

⚠️ Important Notes & Tips

🎭 Example Results

📁 Download & Links

💎 Conclusion

Description

FAQ

What is Wan 2.2 14B S2V Ultimate Suite: GGUF & Lightning Speed with Extended Video Generation?

What files are available and where can I download them?

Comments (8)

Details

Files

wan2214BS2VUltimateSuiteGGUF_v20.zip

Mirrors