CivArchive
    Wan 2.2 14B S2V Ultimate Suite: GGUF & Lightning Speed with Extended Video Generation - v2.0

    🎬 Introduction

    Welcome to a powerhouse ComfyUI workflow designed to unlock the incredible potential of the Wan 2.2 14B Sound-to-Video (S2V) model. This isn't just a simple implementation; it's a comprehensive suite that addresses two critical needs for AI video generation: accessibility and speed.

    This all-in-one workflow provides two parallel generation pipelines:

    1. ⚡ Lightning Fast (4-Step) Pipeline: Utilizes a specialized LoRA to generate videos in a fraction of the time, perfect for rapid prototyping and iteration.

    2. 🎨 High Fidelity (20-Step) Pipeline: The classic, high-quality generation process for when you demand the utmost visual fidelity from your outputs.

    Crucially, both versions are configured to run using GGUF-quantized models, dramatically reducing VRAM requirements and making this massive 14B parameter model accessible to users with consumer-grade hardware.


    ✨ Key Features & Highlights

    • Dual Mode Operation: Choose between speed and quality with two self-contained workflows in one JSON file. Easily enable/disable either section.

    • GGUF Quantization Support: Run the massive Wan 2.2 model without needing a professional GPU. Leverages LoaderGGUF and ClipLoaderGGUF nodes.

    • Extended Video Generation: The workflow includes built-in "Video S2V Extend" subgraphs. Each one adds 77 frames. The template is pre-configured with two extenders, resulting in a ~5-second video at 16 FPS. Want a longer video? Simply copy and paste more extender nodes!

    • Audio-Driven Animation: Faithfully implements the S2V model's core function: animating a reference image in sync with an uploaded audio file (e.g., music, speech).

    • Smart First-Frame Fix: Includes a clever hack to correct the first frame, which is often "overbaked" by the VAE decoder.

    • Detailed Documentation: The workflow itself is filled with informative notes and markdown nodes explaining crucial settings like batch size and chunk length.


    🧩 How It Works (The Magic Behind the Scenes)

    The workflow is logically grouped into clear steps:

    1. Load Models (GGUF): The LoaderGGUF and ClipLoaderGGUF nodes load the quantized UMT5 text encoder and the main UNet model, drastically reducing VRAM load compared to full precision models.

    2. Upload Inputs: You provide two key ingredients:

      • ref_image: The starting image you want to animate (e.g., a character portrait).

      • audio: The sound file that will drive the motion and pacing of the animation.

    3. Encode Prompts & Audio: Your positive and negative prompts are processed, and the audio file is encoded into a format the model understands using the Wav2Vec2 encoder.

    4. Base Generation (WanSoundImageToVideo): The core node takes your image, audio, and prompts to generate the first latent video sequence.

    5. Extend the Video (Video S2V Extend Subgraphs): This is where the length comes from. The latent output from the previous step is fed into a sampler (KSampler) alongside the audio context again to generate the next chunk of frames. These chunks are concatenated together.

    6. Decode & Compile: The final latent representation is decoded into images by the VAE, and the CreateVideo node stitches all the frames together with the original audio to produce your final MP4 file.


    ⚙️ Instructions & Usage

    Prerequisite: Download Models

    You must download the following model files and place them in your ComfyUI models directory. The workflow includes handy markdown notes with direct download links.

    Essential Models:

    • umt5-xxl-encoder-q4_k_m.gguf → Place in /models/clip/

    • Wan2.2-S2V-14B-Q5_0.gguf → Place in /models/unet/ (or /models/diffusion/)

    • wav2vec2_large_english_fp16.safetensors → Place in /models/audio_encoders/

    • wan_2.1_vae.safetensors → Place in /models/vae/

    For the 4-Step Lightning Pipeline:

    • Wan2.2-Lightning_T2V-v1.1-A14B-4steps-lora_HIGH_fp16.safetensors → Place in /models/loras/

    Loading the Workflow

    1. Download the provided video_wan2_2_14B_s2v.json file.

    2. In ComfyUI, drag and drop the JSON file into the window or use the Load button.

    Running the Workflow

    1. Upload Your Media:

      • In the "LoadImage" node, upload your starting reference image.

      • In the "LoadAudio" node, upload your music or audio file.

    2. Enter Your Prompt:

      • Modify the text in the "CLIP Text Encode (Positive Prompt)" node.

      • The negative prompt is already filled with a robust, standard negative.

    3. Choose Your Pipeline:

      • To use the 4-Step Lightning pipeline (Fast): Ensure the LoraLoaderModelOnly node is correctly pointed to your Lightning LoRA file. The Steps primitive node for this section is already set to 4 and CFG to 1.

      • To use the 20-Step pipeline (High Quality): The lower section of the workflow is already configured. The Steps are set to 20 and CFG to 6.0. You can simply box-select the entire 20-step section and press Ctrl+B to disable the 4-step section if you wish to only run this one.

    4. Queue Prompt! Watch as your image comes to life, driven by your audio.


    ⚠️ Important Notes & Tips

    • Batch Size Setting: The "Batch sizes" value (3 by default) is not a traditional batch size. It must be set to 1 + [number of Video S2V Extend subgraphs]. This workflow has 2 extenders, so the value is 3. If you add another extender, set it to 4.

    • Chunk Length: The default is 77 frames. This is a requirement of the model and should not be changed unless you know what you're doing.

    • Lightning LoRA Trade-off: The 4-step LoRA is incredibly fast but may result in a slight drop in coherence and quality compared to the 20-step generation. It's the perfect tool for finding the right seed and composition quickly.

    • GGUF vs. Safetensors: This workflow uses GGUF for the text and UNet models to save VRAM. You can replace the LoaderGGUF and ClipLoaderGGUF nodes with standard UNETLoader and CLIPLoader nodes if you have the VRAM to use the full .safetensors models, which may offer slightly better quality.


    🎭 Example Results

    Prompt: "The man is playing the guitar. He looks down at his hands playing the guitar and sings affectionately and gently."
    Audio: A gentle acoustic guitar track.

    (You would embed a short video example generated by this workflow here)



    💎 Conclusion

    This workflow demystifies the process of running the formidable Wan 2.2 S2V model. By integrating GGUF support and a dual-pipeline approach, it empowers users with limited hardware to experiment and create stunning, audio-synchronized animations. Whether you're quickly iterating with the Lightning LoRA or crafting a masterpiece with the full 20-step process, this suite has you covered.

    Happy generating! Feel free to leave a comment with your amazing creations or any questions.

    Description

    Enhanced output quality and increased stability have been achieved.

    Additional features have been implemented, providing convenient access.

    FAQ

    Comments (8)

    blobby99Sep 8, 2025· 3 reactions
    CivitAI

    What a load of NONSENSE about VRAM amounts and model size. Only LLM models should reside in VRAM. Other models, with many seconds per iteration, can stream from system RAM as they are used each iteration (AI models are LINEAR data structures, not RANDOM data structures, and are hence insanely perfect for streaming).

    Comfy can be forced to launch with streaming from RAM only, but you will, of course, need enough RAM for the model. 64GB is a minimum, more is better.

    With is2v, you have an excellent feedback to show you if your render is compromised- lip movement. Too much low CFG and speedup use can cause minimum lip movement. Good use of speedup LoRAs can have nice expressive lip movement. But general motion in the scene that follows your prompt will almost certainly need some slower iterations from the sampler with a high CFG at the beginning. If you only really care about lip movement, it is easier to get a fast quality workflow.

    psy_krivedko468Oct 10, 2025· 2 reactions
    CivitAI

    It works quickly and well. I uploaded a model that speaks Russian, thanks. It would be great if you could add localized models to the description.

    yefy_aiJan 30, 2026· 2 reactions
    CivitAI

    Much appreciated 🙏 - I have adapted this workflow into 2026🎇 with TTS Audio Suite + Chatterbox integration nodes for voice cloning🗣🎤👥 (TTS Engine, TTS Text, Character Voices) + RESLYF (ClownsharKsamplers offers superior quality over the core Ksamplers, if set up correctly 🤡👌), plus a match expression node for automatic video length (tricky when running batches, but it works, plus perfect sync, just have to test different length of TTS text to keep 12-13 sec for 3 batches). Haven't tested Character Voices, yet. Still trying to achieve consistency (mainly with movement, while lip sync remains 100%) over 3 batches (~12 sec), with Q4 gguf models (I have optimized the workflow for 16GB + 12GB VRAM, with some of those Clear VRAM nodes bypassed as they can cause "hickups").

    Still testing, but I'll share the updated workflow when it's fully ready ✅

    skpManiacFeb 4, 2026

    I'd love a version of this made for 32Gb Cards too :) It sounds amazing!

    yefy_aiFeb 26, 2026· 1 reaction

    @skpManiac I'm still testing the workflow from @zardozai , but it's mainly because I added my own SVI Pro V2 workflow on top of it, to combine Wan 2.2-S2V with SVI V2 Pro generations. I have so far only posted 1 video (50+sec, NSFW🔞) which is based on this "hybrid" workflow. The workflow has gotten quite large, so I need to simplify it if I decide to release it (as it looks now, it mainly meant for people who do mind some extra "meat on the bone" because it's quite frankly overcomplicating things when combining TTS Text + S2V + SVI, lol - this will in some cases also result in having to deal with inpainting and FTLF > First To Last Frame, which might come in handy when extending a S2V video to SVI batches or vice versa).

    skpManiacFeb 4, 2026· 1 reaction
    CivitAI

    Hi there,

    I am hoping you can help.
    I would much prefer to run this with the higher quality Safetensors, but I do not know enough yet to change the nodes. Would someone be so kind as to post a version with the changes made so it uses mu RTX 5090?
    Many thanks
    Steve

    zardozai
    Author
    Feb 5, 2026· 1 reaction

    GGUF Q8_0 offers performance comparable to fp16 and is significantly better than fp8.

    skpManiacFeb 6, 2026

    @zardozai ok, thanks for your reply mate. I keep getting blurry hands & eyes. so was hoping to fix that. I've maxed everything I can work out how to do, but it's still not great.

    Workflows
    Wan Video 2.2 I2V-A14B

    Details

    Downloads
    1,417
    Platform
    CivitAI
    Platform Status
    Available
    Created
    9/8/2025
    Updated
    5/13/2026
    Deleted
    -

    Files

    wan2214BS2VUltimateSuiteGGUF_v20.zip

    Mirrors