Wan-S2V is an AI video generation model that can transform static images and audio into high-quality videos.
WIP: working on description adding all needed infos/tools! Use with some caution ๐คช
Note: S2V has a very high chance of producing some 1st "flashy" over-saturated frames. That seems a limitation of all Wan 2.2 S2V models right now.
Requirements:
lite lorafor 4/8-step operation (optional)Main Model Wan2.2-S2V-14B
ComfyUI/models/unetGGUFAudio Encoder wav2vec2_large_english
ComfyUI/models/audio_encodersEncoder Umt5-xxl
ComfyUI/models/text_encodersWan2.1_VAE.safetensors
ComfyUI/models/vae
Usage hints:
Audio file should be about same length as the video file in seconds
๐๐ถ ๐ Hint: Click the sample for full-screen and play from the post with SOUND ON!
Sources:
Clip: https://huggingface.co/city96/umt5-xxl-encoder-gguf/
Model: https://huggingface.co/QuantStack/Wan2.2-S2V-14B-GGUF/
Lite LoRA: https://huggingface.co/calcuis/wan2-gguf/
YOU are responsible for outputs as always! If you make ToS violating content and I get aware I WILL report this.
Description
wav2vec2_large_english_fp8_e4m3fn
FAQ
Comments (3)
I was a bit confused by this at first as I assumed from the description that it was a checkpoint with the TE, VAE, AE etc all bundled into one, but I assume it isn't as I don't know of a GGUF checkpoint loader node?
It seems to perform similarly to using the GGUFs from Quantstack, but with the added bonus of not needing to load the Lightning Lora separately. The addition of the FP8 Audio Encoder is greatly appreciated as I think the FP16 AE was causing very long generation times and pushing the VRAM to its limits...
Unfortunately, the combination of low quant GGUF and the Lighting lora replicates the same issue as using the separate files - the lip syncing is blurry and inconsistent and there's next to no motion in the video. I managed to eke out a standard no-GGUF, no-Lightning render yesterday which almost toppled my GPU and took an age to generate. The lip syncing was decent and there was some natural motion that is missing here.
Not at all a criticism or problem with your model here itself, but a sobering reminder that there just doesn't seem to be any way to get extremely demanding models like S2V to work properly on lower VRAM systems, without compromising about 85% of the quality in the process :(
What's the difference between this and Image2video?
Its possible use this s2v into a 3060 12GB?
Details
Files
Available On (1 platform)
Same model published on other platforms. May have additional downloads or version variants.