CivArchive
    Emotional Breakdown - Helper for sadness, anger, etc. - Wan 2.2 - I2V Low V1.0
    NSFW

    Update - T2V now available

    v1 is now uploaded in T2V.

    The high noise model should get set to 0.5 on weight. Start low at 1.0, but it might need to get bumped down too. Similar to I2V, don't worry too much about triggers, natural language works just as well, hoping to tune a bit better in a v2.

    Intro

    Like other base models, WAN seems to really bias things towards happy / neutral expressions. This makes for pretty awkward renders when the action doesn't really make sense with that, but everyone's smiling the whole time.

    I made this helper to facilitate more sensible facial expressions and body language with emotionally negative action. It works best when matching the rest of the scene, but definitely sems to avoid the weird tendency of the model to have people pantomime negative emotions without expressing them.

    Description

    FAQ

    Comments (30)

    oli762Oct 27, 2025
    CivitAI

    great job the only thing missing would be an ass spank :p

    K3NKOct 27, 2025
    CivitAI

    Wan has a hard time understanding this expression... they look like they're dropping a turd 🤣🤣

    budokan2000860Oct 27, 2025

    It is Chinese, what do you expect? Everyone there is happy and cheerful, am I right?

    PhraxasOct 27, 2025· 2 reactions

    Maybe the turd is not coming out and that's what's upsetting them.

    fedupscribe687Oct 29, 2025· 2 reactions

    constipated lora. Guess it's time to make a video of someone on a toilet holding onto the sink with this one.

    RemielOct 27, 2025· 1 reaction
    CivitAI

    Thanks for the lora. I've also noticed a lack of negative expressions in Wan and had been considering whether to take some time to gather a dataset for the lora. Can you tell us what you trained it on? Images or video? How many?

    thegipper
    Author
    Oct 27, 2025· 5 reactions

    This is trained on video clips. This version is only about 15, but I'll probably come back later and play with it more (or diversify into different LoRA's, still need to experiment with how effectively individual WAN LoRA's can manage the nuances of different concepts).

    The hard part about these I2V LoRA's from a data source perspective has been finding sources where the training data actually captures a clear transition event (IE: No Concept -> Concept), and balancing the training of the Concept itself with the transitions (how much are you training for the part of the generation after the Concept triggers, versus the transition).

    Thus far I've found training to work way better when anchored on the transition events, but the other nuance is that the default training configuration for WAN only has about 33 frames, and beyond that, training times escalate a lot. Also, surprisingly I haven't really been getting better results when I train for 81 frame full clip lengths, versus only training on 33 frames. Takes a bit of time to collect datasets where you get clear, unconfused demonstrations of a concept that can be expressed in ~2s start-to-finish.

    HarryPsalmsOct 28, 2025

    @thegipper Are you using any special tool to grab those 33 frame clips? I'm currently at that stage and thinking there has to be a better way than watching the vid, remembering the start time and throwing that into an ffmpeg script to extract the 33 frames, while hoping my start frame was accurate enough

    thegipper
    Author
    Oct 28, 2025· 1 reaction

    @HarryPsalms I'm sure there is a better tool out there (would love to hear recommendations), but I used (https://github.com/Tr1dae/HunyClip) to do extractions and video clipping. It's buggy but it works. You need to set the clip length to at least 35, since it cuts off on frame, and you need 1 frame for the I2V bucket. The nice thing about the tool is that it's very detailed about exact frame boundaries compared to other visual tools.

    I use FFMPEG first anyway to re-encode the videos and to downsample to 16fps. Without that normalization other processing gets pretty inconsistent.

    HarryPsalmsOct 30, 2025

    @thegipper Thanks, I appreciate the additional context of your reply. Assuming you downsample to 16fps so that you are able to get just over 2 seconds' worth of training data rather than 1.25 or whatever 24fps would get you. If so, could you theoretically downsample further to fit more action in? Now I'm wondering if that's where some of the weird wan slow motion is coming from for some of the Loras out there - the downsampling of frames to fit more action into the lora, resulting in a slow motion effect when played back at normal speed?

    RemielOct 30, 2025

    @thegipper Thanks for sharing your experience. How big (resolution wise) were your videos? Did you train it on realistic videos only or were there different styles (e.g. anime) added too?

    "I'm sure there is a better tool out there (would love to hear recommendations)"

    I haven't used it personally, but I've heard other lora makers use VidTrainPrep (https://github.com/lovisdotio/VidTrainPrep). It's based on HunyClip.

    "The hard part about these I2V LoRA's from a data source perspective has been finding sources where the training data actually captures a clear transition event"

    Yeah, I've had similar challenges for every dataset I tried to gather. Only when I started to gather it did I realize that the vast majority of stuff with the concept isn't up to par. Realistic footage has trouble with having poor color grading/overall quality, too blurry, too compressed, too shaky camera, lighting flickering, etc. Anime footage often suffers from low frame rates and too shitty animation. 3d renders often suffer from horrible physics, clipping bugs, low poligon levels, poor material/texture qualities making them look like last century game. Static images have the highest quality, but they are static and can't be used to teach animation. And even with concepts that aren't inherently animation based, I've heard that including just images in the dataset dampens overall animation, so videos need to be included too, though it's unknown what the ideal ratio of image vs video would be.

    In the end, when I look at my dataset, I am worried that making a lora out of it, while teaching the concept, would degrade wan's base performance. And I really don't have the time to do iterative lora training where I produce a lora and then use it to produce better footage for retraining the lora with a better dataset.

    Not to mention how hard it is to find very specific things like facial transitions. How do you even search for that? Do some reverse thinking like, let's search for some high definition footage of a funeral. Or search for some award-winning tragedies. Some things you don't even know where to start to search for.

    And I am always worried that including heads in the dataset will somehow turn all my people into faces of those who were in the dataset.

    In the end I get stuck in analysis paralysis and never actually get to the lora training phase.

    thegipper
    Author
    Oct 30, 2025· 2 reactions

    @HarryPsalms Downsampling with FFMPEG works in two potential ways. If you put the -r argument before the -i arg, it alters the video's speed, if you put it after the -i arg (or you use -vf instead) it just drops frames so the playback speed is the same. I use it to just drop frames to avoid impacting playback speed.

    I think the reason so many LoRAs have weird speed impacts isn't intent (people trying to teach it more) but lack of understanding. I think that people are training with non-normalized video data and have no idea what's going on. So they stack up 20 videos with varying frame rates (between 24 and 60), and their trainer just grabs the first 33 frames of each video. So the LoRA sometimes learns to be a little slow (33% slower for 24 fps video) and sometimes a lot slower (almost 50% for 30fps), based on the cues of the video. Then when people do renders, they trigger one of those two things based on similarity to training material, resulting in inconsistent behavior.

    thegipper
    Author
    Oct 30, 2025· 1 reaction

    @Remiel Thanks for sharing! Will be glad to have some improvements to the HunyClip setup.

    Thanks for sharing your experience. How big (resolution wise) were your videos?

    Thus far I've only trained at 512x512 pixel density (split across different frame sizes). I haven't been super intentional about the aspect ratio yet, but I do suspect that WAN is much less good at aspect ratio independence, so I might start being more selective about frame widths.

    This particular LoRA was also more cropped than the other couple I've trained, and it did work notably less well, but it's also focused a lot more on minute details, so it's a little chicken-and-egg. The self slap LoRA did an incredible job of picking up subtle cues, though, despite the faces being on average much smaller in frame, so I'll probably bias much more heavily towards full-framing videos in the future.

    Did you train it on realistic videos only or were there different styles (e.g. anime) added too?

    Just realistic, with a pretty diverse levels of quality.

    thegipper
    Author
    Oct 30, 2025· 1 reaction

    @Remiel 

    Just to calibrate your expectations, what I've found is that AI training is mostly a tail-wagging-the-dog situation. People don't really have an idea and find content then iterate until it works, they have readily available content and then try to train it. If it works, they ship it, if it doesn't, they pick the next content set and try that one. If I really needed to make 1 successful LoRA, it is 10x more reliable (and faster) to try to train 5 completely different LoRAs with easily available data rather than iterate and try to force one.

    In terms of training content more generally I have a few guiding thoughts.

    really don't have the time to do iterative lora training where I produce a lora and then use it to produce better footage for retraining the lora with a better dataset.
    I only train with real, human produced ground truth, under all circumstances. Other than embeddings (which are really just compressed prompts) I never train with AI generated inputs, or staged inputs, which is a lot harder now that the internet is flooded with slop.

    My rule of thumb here would basically be "Try with your initial source material, and if it almost works, try tweaking captions, or cropping, dropping some bad examples, or getting one or two more," rather than trying super hard to force it to work. Then just keep an eye out over time for the concept in the wild and accumulate more data organically as you come across it.

    Not to mention how hard it is to find very specific things like facial transitions. How do you even search for that?

    It's definitely hard, and frustrating when you want something really, really specific. It's much harder than it used to be, because the internet is notably worse than it was 5 years ago, and getting worse.

    I've found amateur content shared on public forums (message boards, reddit, etc) to be by far the best place to get source material. Weirdly, porn (especially amateur porn) is legitimately an incredible source of both SFW inputs too. It's diverse, and it's one of the only places where you can get access to raw human expressions.

    I made a series of SD1.5 LoRAs (SD 1.5 is still the absolute best model I've found for reproducing human faces in context, although WAN is looking really promising) based on what some scientists theorize are the 6 foundational facial expressions, and the majority of the training sources were from amateur adult material (although the actual training data didn't contain ~any NSFW stuff post-cropping).

    And I am always worried that including heads in the dataset will somehow turn all my people into faces of those who were in the dataset.

    This was extremely real in early image generation, but I've found it less problematic with newer models. With WAN I've only needed (thus far) to train with very small datasets comparatively (like 10-15 examples), so it's much easier to not replicate. I do definitely still avoid ever including more than 1 data point per source material. Bigger than the subject, things like the camera quality, the image framing, the background, etc, will all get rewarded much more disproportionately than the concept. That doesn't always ruin the inference, but it can waste the time and space for the model by making it learn unimportant details, and prune what you're trying to teach it.

    RemielOct 30, 2025

    @thegipper Thanks for sharing your invaluable experience.

    "I only train with real, human produced ground truth, under all circumstances."

    It is indeed most reliable regarding the quality of physics and animation in general to use actual footage. I'm just worried that by using only realistic footage, the lora will not work well on non-realistic types of content. I've tried a bunch of different wan loras in i2v scenario with semi-realistic/anime images, and some work great, and others turn my video into a gritty ugly low saturated mess, forcing the face into a realistic but not quite monstrosity. I don't know what the makers of those loras did wrong, but it is evident that the lora learned "style" too much and that is known to happen when there is no style variety in the initial dataset. So to be safe, in my dataset gathering I try to include style-vise varied types of content.

    Have you had problems with generating non-realistic content with your loras that are exclusively trained on footage?

    thegipper
    Author
    Oct 31, 2025

    @Remiel Oh, sorry, I don't think I said it clearly enough. I don't mean to say that I only train on real footage across the board. Just that I only train on inputs that aren't AI generated or AI altered. I think training on human generated non-realistic content works just fine if you are intentional about it. It's possible that for WAN, non-photographic content could help for capturing motion over pixels. When training object detection models with images, it's very standard to apply significant transforms to them like changing their color space, flipping them, skewing them, etc, to make sure the model doesn't get too focused on the wrong details.

    I did do some early experimentation with the SD models and found that it was rare for it to help to provide both photographic and illustrated content (at either the text encoder or UNET level), but WAN is different enough that it's worth testing.

    I've tried a bunch of different wan loras in i2v scenario with semi-realistic/anime images, and some work great, and others turn my video into a gritty ugly low saturated mess

    There is a confusing misperception I see communicated over and over in online discussion that the Low Noise model, which operates over most of the denoising steps, can be trained much more aggressively (one memorable comment describes that the low noise model can "take a beating" during training), and the High Noise model needs to be babied. I have had the opposite experience (with I2V at least, I haven't done as much T2V) that it's very easy to overtrain the Low Noise model, and that doing so results in much more style transfer and overfitting to subjects and pixel specifics in the videos. My guess is that overtrained Low Noise models account for what you're seeing.

    RemielNov 1, 2025

    @thegipper "My guess is that overtrained Low Noise models account for what you're seeing."
    I've seen the same thing happen in wan2.1 loras too, and no amount of lowering the strength of the lora helped. Even the lowest weights introduced unwanted image degradation while it totally lost the motion it was supposed to be trained for. Again, I don't know what is causing it. I can only speculate to be the combination of overtraining and low-quality dataset lacking in style variety (rationale being that if the style were varied, even if it was overtrained it wouldn't settle on a grainy low saturated realistic crap).

    thegipper
    Author
    Nov 3, 2025· 1 reaction

    @Remiel 

    and no amount of lowering the strength of the lora helped

    Yeah, that definitely tracks. I think generally overtraining isn't generally a matter of Strength as it is a matter of which weights in the model end up being trained and impacted by the secondary model.

    A truly overtrained model can't be fixed by lowering strength. LoRA prunes out which weights it includes below an activation threshold and only includes weights having the most impact. If you overtrain, the model gets rewarded for rigidly reproducing aspects of the source material, which often is in the form of literally reproducing small aspects of the extremely limited sources us non-foundational model trainers are using.

    Having diverse source material helps prevent the model from getting rewarded disproportionally, but if you train long enough (or at a high enough strength) you'll still eventually leapfrog the local minima and reward the model for reproducing the limited features of the input ever more specifically, and pruning more and more of the useful weights. When weights in the final activations closely align with the inputs (I2V images or prompts), it quickly races them to a limited set of outputs, but when the inputs are far apart, you get really jarring discontinuities.

    The high noise model in theory makes this less likely by providing something to train which can't be rewarded (as strongly, anyway, what happens in the first high noise steps is complex) for reproducing super fine/minute details, so things get weirder when you overtrain it, but an overtrained low noise model looks very similar to an overtrained model in a single unet architecture.

    blobby99Oct 27, 2025
    CivitAI

    Excellent work. Emotion of all types is sorely lacking in LoRA work here for recent models. And even when a model like WAN has some emotional understanding, more variation to a specific emotion is very welcome. It is possible to do some of this via capture and controlnet methods, but LoRAs will lack resolution issues.

    TezozomoctliOct 27, 2025· 2 reactions
    CivitAI

    The king is back to making emotional loras! Can't wait to see more! (just like all those old SD1.5 ones you did)

    JellaiOct 28, 2025
    CivitAI

    You say that "Anger" is trained in the title. What is your training phrase for that, so I can get the most out of a trigger?

    MMOFanOct 28, 2025
    CivitAI

    This does a great job for unhappy scenes. Keep up the good work.

    badhandproductionsOct 30, 2025· 3 reactions
    CivitAI

    I appreciate you making this, and hope you keep making more of these/refining this.
    Human emotion is very complex, and wan is adept at zero of it. I would love to one day have a series of these to get a perfect ratio
    the last part is me just speaking out, not asking you to do more.

    thegipper
    Author
    Oct 30, 2025· 3 reactions

    I'm definitely interested in doing more. One really great upside from Wan from early playing around is that similar to SD 1.5, it really seems like WAN's underlying UNET actually does have a wealth of prior knowledge about emotional expression. Based on the source material I'm providing, there's 0 chance I'm actually teaching it most of the outputs here, just surfacing them in ways that the text encoder wasn't trained to (or was trained not to).

    That's pretty different, I think, than Flux (and even SDXL) where I swear to god they stripped all source material of all but 5 plastic faces, and you have to explicitly teach it what the latents are to express the output.

    I'm really interested in WAN's unique ability to pick up emotion not just from facial expressions but body language. It's hard to find good sources (there's regrettably little of it in this training set), but it seems like it's capable of picking up cues like hesitation and reaction which are so cool.

    badhandproductionsOct 31, 2025· 1 reaction

    @thegipper Communities like ours need driven individuals like you to keep them alive, and your passion and knowledge are a breath of fresh air. Thank you, King. 

    OrangeJuiceAlienNov 2, 2025· 1 reaction
    CivitAI

    can you do a T2V version too?

    thegipper
    Author
    Nov 3, 2025· 1 reaction

    Still working on my T2V pipeline. The T2V models are a little less forgiving on style and especially on the complexity of the image framing in training, so might take a bit, but thanks for the bump on interest. I'll see if it's worth putting out something intermediate with this model before I can refine a V2 for both (v1 still needs a lot of work).

    lost_moonNov 3, 2025· 2 reactions

    @thegipper T2V would be awesome!

    thegipper
    Author
    Nov 5, 2025· 3 reactions

    @lost_moon / @OrangeJuiceAlien - Check in tomorrow

    lost_moonNov 7, 2025

    @thegipper than you :)