V2
V2 is more consistent, has more stable movements, and should get less artifacts. Seems to work very well for 2d inputs as well. All previews were prompted with one prompt for both t2i and i2v, writing separate prompts and picking a good starting image should give even better results.
Use "turbo" lora for high-quality generations in just 4 steps!
The turbo lora is available on huggingface: https://huggingface.co/Kijai/WanVideo_comfy/blob/main/LoRAs/Wan22-Turbo/Wan22_TI2V_5B_Turbo_lora_rank_64_fp16.safetensors
To use it, set steps to 4 and cfg to 1. I'm not sure what the recommended sampler/scheduler is but I've had great results on multiple samplers and schedulers. I personally use euler/euler a + beta scheduler.
Using a slightly lower resolution (but not low enough to reduce quality much) I can generate 80 frames in just 2 minutes on a 3060.
This lora is recommended for i2v, but t2v might work decently as well.
Trained on my new mixed furry/human dataset with detailed captions. Older versions of which were also used for the experimental and semi stable text to video loras.
Prompting
Prompting should use natural language, you need to generate at 720p, so for example 1280x704, 704x1280 or 960x960 will be valid. This might be more important for i2v than for t2v, I've noticed artifacts with i2v.
In a prompt, you can describe "a 3d animation", "a 2d animation" or "a real video", this is most useful for t2v but could help i2v as well.
You can view the prompt on the example videos for info as well.
Description
First version
FAQ
Comments (28)
May I ask what training settings you used? The 5b has trained extremely fast for me but I am getting pretty glitchy epochs using a known good dataset. Also what sampler settings? I noticed the settings that worked for wan 2.1 do not work at all with wan 5b.
Check your resolutions, wan 2.2 5b does not support 480p. I trained with 700 res, ~900 might be even better too. 5b uses the new vae which compresses 4x as much, so training is much easier for the same resolution, but you need to use 720p-like resolutions
mylo1337 Thanks, also what are your training times, I trained at 512x512 and it takes maybe 2 hours tops on dual 3090's
basedbase An epoch at batch size 1 took around 18 minutes, at batch size 2 they took about the same amount of time, maybe less. I trained it for a total of a little under 24 hours on a single 4090 on runpod
mylo1337 As a side note if I could physically fit more 3090's into my workstation I would but 2 is the max even in a super tower case lol, would be great to have 4x 3090's since that would allow really fast training to test tons of settings. I'll be working on a wan 2.2 27b to wan 2.2 5b distill lora and see of the quality increase is as dramatic as the wan 14b to 1.3b lora I made.
@basedbase is it some kind of black magic?
I cant seem to get any good results on i2v everything keeps turning out anatomically morphed.
Are you sampling at the correct res? I'm using swarmui with the lora and 960x960 based resolutions (so 720p). Using a lower resolution causes the issue that you described so that might be the issue?
and make sure your comfy is up to date
mylo1337 Yea I tried 960x960 and 1208x720 and everything either has no motion or is anatomically spazzing out. Samplers ive tried lcm+simple for 30-60 steps no dice, dpmpp+sgm_uniform for les than 30 steps and more than 30 steps sam thing. Even tried all known combos that worked great with wan 2.1 and they are some how even worse. No idea what is wrong. Any chance you have a workflow so I can compare to figure this out?
basedbase I've tried euler/euler a+beta and unipc+simple and they both worked for me.
I didn't use a workflow, but I used swarmui (so it used an auto-generated comfy workflow), you can link swarmui to an existing comfy install and use it in swarm if you want to compare with it.
mylo1337 I'm curious, do you have a very large dataset? My dataset is 81 videos long and it still trains incredibly fast.
basedbase about 250 videos iirc
To get a good Wan 5b generation, it took me coding a script to merge the original sharded model into a single fp32 safetensors model and inferencing with that so either my setup is cursed or wan 5b looses way too much quality when quantized. Even the fp16 was terrible. Even in full fp32 precision VRAM usage is only 21.6gb.
I run 8bit usually, not a huge difference from 16bit in quality. q4 does lose a bit of quality though. Again, when I had issues it was the resolution, not the model weights
If you use swarmui with default settings it'll work. I use euler a + beta. And sigma shift 8, pretty much default otherwise. It's all in the prompting and resolution. If you use a low resolution the outputs glitch out and get artifacts.
Also, have you even gotten wan 5b working without the lora first of all? I think your issue is either your comfy being outdated, using the wrong models or something similar. For reference, I'm using the official comfy merge https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/blob/main/split_files/diffusion_models/wan2.2_ti2v_5B_fp16.safetensors.
I've also uploaded more previews, with better cowgirl example gens now, wan 2.2 models are trained to allow prompting for motion, but this motion will usually be very aggressive. Use words like "she moves her hips in rocking motions" instead of "she moves her hips up and down" as that usually gets interpreted as her flying up and back down
mylo1337 mylo1337 I've both updated comfy and used the official fp16 and 768x1024 resolution and most outputs are glitchy or dont have much movement. It does the same on my loras I have trained so I am perplexed but alot of people on reddit are also having similar issues.
basedbase 768x1024 is why, 5B does not work outside of 1280*704 or 704*1280
Ada321 in my testing 768x1024 seems to perform better or slightly worse than 704x1024
basedbase have you tried the default workflow with the lora loader added? Multiple people including myself got it working without issues with default settings. I did have issues like yours at one point (i2v getting artifacts) but it was caused by the lower resolution I used at the time.
Make sure you're not using any custom nodes that could effect the internal resolution. And make sure the model itself works on your PC before you blame my lora.
basedbase Do you have the script for the fp32?
Overall fine experience with this model with the basic native Wan 2.2 5B workflow provided by comfy. I used natural language in the prompting and 8/10 times i got very good results!
can anybody share workflow for 5b wan 2.2 nsfw i have only 6gb vram help me
Hey any technicals details or tutorial on how you trained this?
I use diffusion pipe for most of my training, I used my config I used for training wan 2.1 14b for an earlier model, changed it so it uses one gpu and has a batch size of 2, and ran it on a runpod pod.
For captioning I captioned everything manually in captiontool https://civitai.com/articles/16284/captioning-with-captiontool. I made sure the captions were very descriptive, clipping either only the part related to the prompt, avoiding any cuts in the original video and limiting to a little over 10 seconds per captioned clip.
In diffusion pipe, I used the "multiple_overlapping" video clip mode, which ends up training multiple times per epoch on different sections from the video. So a 10 second clip with 80 frame snippets at 24 fps will end up working as about 3 clips, this trains some variety to allow the model to start as if the input frame was already in motion.
@mylo1337 how many hour trained? which gpu and vram used?
@ifuta v1 was <24 hours on a 4090. V2 is about 48 hours almost on a 4090. Similar cost for both though since when I trained v1 on runpod. There were no working pods on community cloud so it basically had double the hourly cost that v2 had
Hey master, any plan for the 14B lora? <3
At some point yeah. But with my PC it'll take ages for test gens due to the high memory usage (and the fact it's 14b x2 on my 3060). I also want to make sure I know what the optimal way of training is, or if people start running a single model instead of both, I should train for that one.