Update - T2V now available
v1 is now uploaded in T2V.
The high noise model should get set to 0.5 on weight. Start low at 1.0, but it might need to get bumped down too. Similar to I2V, don't worry too much about triggers, natural language works just as well, hoping to tune a bit better in a v2.
Intro
Like other base models, WAN seems to really bias things towards happy / neutral expressions. This makes for pretty awkward renders when the action doesn't really make sense with that, but everyone's smiling the whole time.
I made this helper to facilitate more sensible facial expressions and body language with emotionally negative action. It works best when matching the rest of the scene, but definitely sems to avoid the weird tendency of the model to have people pantomime negative emotions without expressing them.
Description
FAQ
Comments (8)
They sure will breakdown, after they get the Gorilla Press https://civitai.com/models/2102211/the-gorilla-press :P
ππ»ππ»ππ»
the bondage one is the best,, the other, sadly they look like the worst actress i ever seen xD
Ha, yeah, they really aren't great. In the first couple trainings I've tried, it seems like T2V is a lot more sensitive to the initial conditions of the image lining up with something it can anchor on. I need a lot more diversity in a V2 dataset.
It's really hard to tell whether those conditions are semantic (the image needs to be in a similar context) or whether they are more mechanical (length of frames, size of frames, aspect ratio, etc). I've definitely noticed that T2V is way more brittle on generation length. I2V definitely seems a little touchy with length at a macro level (IE: Training with 33 frames gets much better generations with 49 frames than with 81 frames), but smaller differences in frame length on T2V seem to totally change whether it converges effectively.
Tilly Norwood says "Hold my pretty umbrella drink."
I'm trying to train something like this. expression related but it shouldn't be affecting other character lora. how do you prepare dataset and captioning?
Honestly, I've found that captions in Wan2.2 aren't as strong of an anchor as they are in T2I models. I've had somewhat mixed success with different techniques, but with Wan2.2 I've focused more on making sure that the training content (which is very short, generally 30-60 frames of video) clearly contains a transition from something the model understands to something it doesn't, and consistently captioning the transition.
I only really caption the starting frame (IE: Describe the scene and subject) to make sure the model doesn't confuse anything in it with features to be influenced by the model weights, and then use the same words to describe the action. For T2I loras I use a full language model to get the model to generate images that are as close to what I'm training first, but when I tried that with WAN it didn't do much, or just watered down the training. I do also use captions to revise training to remove elements that the model picked up, but that's much rarer with Wan I2V.
Overwhelming with Wan 2.2 I've found that the model already has tons and tons encoded that aren't encoded in the TE, and LoRA's mostly just bring to the surface. I dunno if they censored the text encoder (but not the training), but as a result I don't think words help much. It's much more important to make sure the training content guides in the direction you want.
Also, train the low noise model less than people recommend. I think people are way overfitting the low noise model.
@thegipperΒ Thank you sir