Wan Video
Note: There are other Wan Video files hosted on Civitai - these may be duplicates, but this model card is primarily to host the files used by Wan Video in the Civitai Generator.
These files are the ComfyUI Repack - the original files can be found in Diffusers/multi-part safetensors format here.
Wan2.2, a major upgrade to our visual generative models, which is now open-sourced, offering more powerful capabilities, better performance, and superior visual quality. With Wan2.2, we have focused on incorporating the following technical innovations:
👍 MoE Architecture: Wan2.2 introduces a Mixture-of-Experts (MoE) architecture into video diffusion models. By separating the denoising process cross timesteps with specialized powerful expert models, this enlarges the overall model capacity while maintaining the same computational cost.
💪🏻 Data Scaling: Compared to Wan2.1, Wan2.2 is trained on a significantly larger data, with +65.6% more images and +83.2% more videos. This expansion notably enhances the model's generalization across multiple dimensions such as motions, semantics, and aesthetics, achieving TOP performance among all open-sourced and closed-sourced models.
🎬 Cinematic Aesthetics: Wan2.2 incorporates specially curated aesthetic data with fine-grained labels for lighting, composition, and color. This allows for more precise and controllable cinematic style generation, facilitating the creation of videos with customizable aesthetic preferences.
🚀 Efficient High-Definition Hybrid TI2V: Wan2.2 open-sources a 5B model built with our advanced Wan2.2-VAE that achieves a compression ratio of 16×16×4. This model supports both text-to-video and image-to-video generation at 720P resolution with 24fps and can also run on consumer-grade graphics cards like 4090. It one of the fastest 720P@24fps models currently available, capable of serving both the industrial and academic sectors simultaneously.
Wan2.2-T2V-A14B
The T2V-A14B model, supports generating 5s videos at both 480P and 720P resolutions. Built with a Mixture-of-Experts (MoE) architecture, it delivers outstanding video generation quality. On our new benchmark Wan-Bench 2.0, the model surpasses leading commercial models across most key evaluation dimensions.
Wan2.2-I2V-A14B
The I2V-A14B model, designed for image-to-video generation, supports both 480P and 720P resolutions. Built with a Mixture-of-Experts (MoE) architecture, it achieves more stable video synthesis with reduced unrealistic camera movements and offers enhanced support for diverse stylized scenes.
Wan2.2-TI2V-5B
The TI2V-5B model is built with the advanced Wan2.2-VAE that achieves a compression ratio of 16×16×4. This model supports both text-to-video and image-to-video generation at 720P resolution with 24fps and can runs on single consumer-grade GPU such as the 4090. It is one of the fastest 720P@24fps models available, meeting the needs of both industrial applications and academic research.
GitHub: https://github.com/Wan-Video/Wan2.2
Originally HuggingFace Repo: https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/tree/main/split_files/diffusion_models
Description
wan2.2_i2v_low_noise_14B_fp16
FAQ
Comments (30)
From early testing T2V for 5B is really bad, so bad that I dont even want to waste time trying to find out why. Whereas T2V 14B 2.2 at 480p might even look a bit better than 720p gens of 2.1 at the same steps. Default workflow has High/low noise models do an equal amount of work, I suspect its fine for one of them to do the heavy lifting to cut generation times.
Thanks for this comment! This is the kind of compare/contrast analysis we need for new models that shows usecases where the new model shines while listing usages where previous models are still the standard at this time. Thanks again for letting us know.
The generation taking soooooooo long, i going back to wan2.1
Image-to-video with the WAN 2.1 Lightx2v LoRA works. Some combinations of resolution and length fail with a complete blurry result, but others are fine. With 16 GB VRAM I can generate a 5 seconds 640 x 960 video in approx. 5 minutes. 4 steps, so 2 for each KSampler. dpmpp_sde_gpu as scheduler and beta as sampler. Shift very high with 10. The Lightx2v LoRA needs a high strength. I was successful with 3.0.
Lightx2v LoRA download: https://huggingface.co/Kijai/WanVideo_comfy/tree/main/Lightx2v
Most people use the lightx2v_T2V_14B_cfg_step_distill_v2_lora_rank32_bf16.safetensors. I also tried the rank256 with a even better result at lower resolution.
I still need more time for testing finding the best configuration...
The quality is already near the samples in my VideoFlow workflow for Wan 2.1: https://civitai.com/models/1815300/videoflow?modelVersionId=2054281
As soon as I find a stable and high quality configuration, I will also build a workflow for Wan 2.2. text-to-image and text-to-video workflows will follow.
ai839 can you share the current workflow you are using please
WildCentaur https://justpaste.it/cxy1t Copy the JSON and paste in ComfyUI. You should also try Shift 8.0 and LoRA strength of 2.0 or 2.5. Could be better...
EDIT: Shift 8 and LoRA strength 2.5 is better.
ai839 thank you, i gonna try it
Sampler dpmpp_sde_gpu and scheduler normal instead of beta is worth a try, too! But with higher Shift (10) and/or LoRA strength (3). Just guessing...
ai839 I'm missing two nodes that comfy can't find for me. Create Video and Save Video
th301041 These are new Comfy Core nodes. So you have to update ComfyUI.
ai839 thank you so much! That was exactly the issue!
What do people think, regarding a preference between high noise and low noise Wan 2.2?
You must use both the high noise and low noise model with Wan 2.2 14B. The concept is similar to the initial SDXL release - the high noise model creates the overall composition, while the low noise one acts as a refiner and creates the finer details. High noise means that this model works during the high noise state, and the low noise model means works at the low noise stage of the sampling process.
mmdd2543 Ah, that's why everyone is saying how long this model takes. Most people would have to unload and load into VRAM.
Jellai It's faster or just as fast as WAN2.1 using Kija's example workflow for me using the 4-bit gguf models which fit in my 12GB of vram. It just unloads the previous model when the first sampler is done.
Lora's are a little trickier. I think you need to apply your character lora's to the low noise model and any movement/concepts to the high noise model although I'm still experimenting with it. If you apply your character lora to the high noise model you need to increase the strength to around 2 or 3 for it to make any difference.
funscripter627 That's not been my experience. With Wan 2.2, we now have two models working together instead of one, and there's a delay happening when the 1st model is being unloaded and the 2nd model is loaded back into VRAM. This makes Wan 2.2 generations take typically longer than Wan 2.1, for me at least.
mmdd2543 This shouldn't take long if the model fits in your VRAM. Generations still take around 120-200 seconds for 832x480@16fps for me.
funscripter627 I thought 2.2 was trained at 24 fps. Is that not the case?
I find it hard to understand why CIVITAI doesn't provide an OFFICIAL version... not in BILD...
You can't try ONLINE, WAN 2.2 or KONTEXT... There are about fifteen versions, but none with the “CREATE” button.
It literally dropped about 12 hours ago. It takes time to build interfaces, assign GPUs, build workflows into our Orchestration system. It's not a simple thing to make a video model available for tens of thousands of users. But we're working on it.
I'm getting to grips with the 14B models and running two models for MoE. They seem to work well with improved prompt adherence and motion and even some WAN 2.1 loras work! Amazing what it can do in just 4 steps with the LightX2V lora. There is a lot more cinematic motion and camera work too. Unfortunately the 5B model is pretty terrible. The I2V is appalling. The T2V is at about the same level as LTX around 12 months ago. It's quick but...just no
Every workflow out there has used the self forcing lora to speed up generations. What I've found is if you use it on the high noise model especially, it loses A LOT of its natural capabilities. I was wondering why a lot of the keywords that were trained into the model that was presented in the official video wasn't really working well. It was the lightx2v lora. You need to increase the CFG and disable the lightx2v for best results. Probably the same for the low noise model, but because of its fuction, I think you can compromise on that for speed.
I just wanted to get this out there because the use of that lora is so widespread so early, but the compromise of using it is of a different scale than it was in 2.1. In 2.1 it just pollutes its ability to generate faces. In 2.2 its overall motion and its general capabilities overall is affected
sumsenchi101 My first tries were pretty good with lightx2v (the last version) but we'll surely be far better with a proper Wan 2.2 version.
I got this error when i use VAE 2.2:
Error(s) in loading state_dict for WanVAE: size mismatch for encoder.conv1.weight: copying a param with shape torch.Size([160, 12, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([96, 3, 3, 3, 3]). size mismatch for encoder.conv1.bias: copying a param with shape torch.Size([160]) from checkpoint, the shape in current model is torch.Size([96]). size mismatch for encoder.middle.0.residual.0.gamma: copying a param with shape torch.Size([640, 1, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1, 1]). size mismatch for encoder.middle.0.residual.2.weight: copying a param with shape torch.Size([640, 640, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 384, 3, 3, 3]). size mismatch for encoder.middle.0.residual.2.bias: copying a param with shape torch.Size([640]) from checkpoint, the shape in current model is torch.Size([384]). size mismatch for encoder.middle.0.residual.3.gamma: copying a param with shape torch.Size([640, 1, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1, 1]). size mismatch for encoder.middle.0.residual.6.weight: copying a param with shape torch.Size([640, 640, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 384, 3, 3, 3]). size mismatch for encoder.middle.0.residual.6.bias: copying a param with shape torch.Size([640]) from checkpoint, the shape in current model is torch.Size([384]). size mismatch for encoder.middle.1.norm.gamma: copying a param with shape torch.Size([640, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1]). size mismatch for encoder.middle.1.to_qkv.weight: copying a param with shape torch.Size([1920, 640, 1, 1]) from checkpoint, the shape in current model is torch.Size([1152, 384, 1, 1]). size mismatch for encoder.middle.1.to_qkv.bias: copying a param with shape torch.Size([1920]) from checkpoint, the shape in current model is torch.Size([1152]). size mismatch for encoder.middle.1.proj.weight: copying a param with shape torch.Size([640, 640, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 384, 1, 1]). size mismatch for encoder.middle.1.proj.bias: copying a param with shape torch.Size([640]) from checkpoint, the shape in current model is torch.Size([384]). size mismatch for encoder.middle.2.residual.0.gamma: copying a param with shape torch.Size([640, 1, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1, 1]). size mismatch for encoder.middle.2.residual.2.weight: copying a param with shape torch.Size([640, 640, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 384, 3, 3, 3]). size mismatch for encoder.middle.2.residual.2.bias: copying a param with shape torch.Size([640]) from checkpoint, the shape in current model is torch.Size([384]). size mismatch for encoder.middle.2.residual.3.gamma: copying a param with shape torch.Size([640, 1, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1, 1]). size mismatch for encoder.middle.2.residual.6.weight: copying a param with shape torch.Size([640, 640, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 384, 3, 3, 3]). size mismatch for encoder.middle.2.residual.6.bias: copying a param with shape torch.Size([640]) from checkpoint, the shape in current model is torch.Size([384]). size mismatch for encoder.head.0.gamma: copying a param with shape torch.Size([640, 1, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1, 1]). size mismatch for encoder.head.2.weight: copying a param with shape torch.Size([96, 640, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 384, 3, 3, 3]). size mismatch for encoder.head.2.bias: copying a param with shape torch.Size([96]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for conv1.weight: copying a param with shape torch.Size([96, 96, 1, 1, 1]) from checkpoint, the shape in current model is torch.Size([32, 32, 1, 1, 1]). size mismatch for conv1.bias: copying a param with shape torch.Size([96]) from checkpoint, the shape in current model is torch.Size([32]). size mismatch for conv2.weight: copying a param with shape torch.Size([48, 48, 1, 1, 1]) from checkpoint, the shape in current model is torch.Size([16, 16, 1, 1, 1]). size mismatch for conv2.bias: copying a param with shape torch.Size([48]) from checkpoint, the shape in current model is torch.Size([16]). size mismatch for decoder.conv1.weight: copying a param with shape torch.Size([1024, 48, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 16, 3, 3, 3]). size mismatch for decoder.conv1.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]). size mismatch for decoder.middle.0.residual.0.gamma: copying a param with shape torch.Size([1024, 1, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1, 1]). size mismatch for decoder.middle.0.residual.2.weight: copying a param with shape torch.Size([1024, 1024, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 384, 3, 3, 3]). size mismatch for decoder.middle.0.residual.2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]). size mismatch for decoder.middle.0.residual.3.gamma: copying a param with shape torch.Size([1024, 1, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1, 1]). size mismatch for decoder.middle.0.residual.6.weight: copying a param with shape torch.Size([1024, 1024, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 384, 3, 3, 3]). size mismatch for decoder.middle.0.residual.6.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]). size mismatch for decoder.middle.1.norm.gamma: copying a param with shape torch.Size([1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1]). size mismatch for decoder.middle.1.to_qkv.weight: copying a param with shape torch.Size([3072, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([1152, 384, 1, 1]). size mismatch for decoder.middle.1.to_qkv.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([1152]). size mismatch for decoder.middle.1.proj.weight: copying a param with shape torch.Size([1024, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 384, 1, 1]). size mismatch for decoder.middle.1.proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]). size mismatch for decoder.middle.2.residual.0.gamma: copying a param with shape torch.Size([1024, 1, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1, 1]). size mismatch for decoder.middle.2.residual.2.weight: copying a param with shape torch.Size([1024, 1024, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 384, 3, 3, 3]). size mismatch for decoder.middle.2.residual.2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]). size mismatch for decoder.middle.2.residual.3.gamma: copying a param with shape torch.Size([1024, 1, 1, 1]) from checkpoint, the shape in current model is torch.Size([384, 1, 1, 1]). size mismatch for decoder.middle.2.residual.6.weight: copying a param with shape torch.Size([1024, 1024, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([384, 384, 3, 3, 3]). size mismatch for decoder.middle.2.residual.6.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([384]). size mismatch for decoder.head.0.gamma: copying a param with shape torch.Size([256, 1, 1, 1]) from checkpoint, the shape in current model is torch.Size([96, 1, 1, 1]). size mismatch for decoder.head.2.weight: copying a param with shape torch.Size([12, 256, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([3, 96, 3, 3, 3]). size mismatch for decoder.head.2.bias: copying a param with shape torch.Size([12]) from checkpoint, the shape in current model is torch.Size([3]).
2.2 VAE is for the small 5B model only. The new 2.2 14B models still uses the 2.1 VAE
FYI, you can just use the low noise t2v model without using the high noise t2v model. This trick only works for text2video, it does NOT work with image2video.
works! thx
If you wanted to use just the low model go back to using wan2.1, there's a reason they have a high noise model, well it's upto you really. If those slow mo mediocre movements are all what you looking for you can run the low noise model.
@kakkkarot sure, takes 2.5 times longer to generate just for 0.5 gain in excitement, waiting much longer ruins any excitement before I can get excited.
Details
Files
Available On (1 platform)
Same model published on other platforms. May have additional downloads or version variants.