This lora requires an unofficial wan model for i2v on 1.3b parameters
Note: This lora is made for wan 2.1 fun 1.3b inp 1.0, not 1.1. Using it with 1.1 probably won't work as well as with 1.0.
This lora is intended for use with https://huggingface.co/alibaba-pai/Wan2.1-Fun-1.3B-InP, other 1.3b wan img2vid models might be supported, but only if they use the same weight names, otherwise it will only partially work. Download the diffusion_pytorch_model.safetensors and place it in your comfyui checkpoints folder. The other model files are the same as the 14b's i2v files, 14b i2v workflows should work if you switch the model.
I've also reuploaded it to civitai now, https://civarchive.com/models/1450534?modelVersionId=1640053
The 1.3b model isn't bad for nsfw content, people are likely training it wrong, this lora was trained on a large variety of content and can output a large variety of content. Both furry and realistic content are supported.
Human characters
While this lora was made for anthro furry characters, for the last few epochs of v2, a significant amount of human content was included to make the motions look more realistic, including physics, human content was tagged with "realistic" at the end. Furry content was tagged with "furry animation" at the start.
Prompting guide
Theoretically most human language prompts should work as well as tag prompts, as I varied them throughout the dataset, all videos were human-cut and captioned by me, using a tool to make it more convenient. I might consider uploading the tool once it's more convenient to use. Not providing a prompt usually leads to very little movement.
"the woman" and "her" are interchangable, same with "the man" and "he"
Trained prompt structure (do not copy directly, stuff in brackets are just examples): furry animation, {character description} is {action description}, {additional descriptions}, [realistic|the scene is depicted with a detailed 2d drawing|the scene is depicted in 3d]
Character description examples (doesn't need to be that specific usually since it's i2v): an anthro furry fox woman, in case of human characters you can usually just put a woman.
Action description describes the position, currently a few working options are: cowgirl position, reverse cowgirl position, doggystyle position, missionary position, teasing with her tongue, [a woman] uses her breasts to stroke a man's penis. There are probably a few more.
Additional descriptions consist of perspective, (pov is going to work the best), speed, depth, pulling out (doesn't work well currently), cumshots (also doesn't work very well).
perspective is written as natural language, pov was mostly tagged as viewed from a first-person pov perspective, since it's i2v you don't need to worry much about this, but also just tagging pov should also work.
speed is described in natural language, the words used do make a difference. {speed} [thrusting|riding|sucking] will make a difference.
depth is described similar to speed, except with depth
movement of the woman can be prompted with: she moves up and down as she rides his cock.
movement of the man can be promped with: he thrusts into her pussy. And similar, speed can be included here as well, I've noticed it still works.
additionally, you can add stuff like the woman's ass jiggles with each thrust. I can't really put a full list here.
Version readmes
v3 readme
v3's release is not quite a new lora, it's actually new, rank 128 lora merged at 30% onto v2, then extracted as a rank 128 lora again. The v3 lora I trained was not very impressive on it's own. It might be better at 2d content. After merging, I'm noticing it is more consistent, and often higher quality, with more motion than just v2 e70. For preview images, barely any cherrypicking was involved.
v2 readme
The model has been re-trained from scratch, with a few notable changes. The img2vid results should look more fitting in nearly every result, and there should be much more motion.
Changes from v1's training:
Base model: While v1 was trained on the default Wan t2v 1.3b, the new model is trained on the actual Wan Fun 1.3 Inp. Which is the model this is intended to be used with.
This was achieved by simply providing the missing information in diffusion pipe, it's technically already supported, it just needs to be activated. This PR enables that.
This not only helps the model properly use movements, it also improves consistency with img2vid
The lora's rank has been increased from 32 to 64
The dataset has had a few changes
The videos have been 16fps from the beginning
The training resolution has been dropped from 400 to 256 (as a tradeoff for memory usage) (upped to 480 for e70, as this seemingly improves motion)
The training frame count buckets have been improved, from v1's [1, 24] to v2's [1, 16, 24, 32, 40]. This allowed for training on longer videos with more context info.
The v2 model was trained at a higher learning rate than v2, I might consider a value in between the old and current
At only 12 epochs, the model has more consistent motion than v1 at 40 epochs!
The training dataset contains human data since the switch to 480 res. This helps with movements and physics, it also reduces artifacts like random cutoff. There are still some "stretch" artifacts in some situations.
v1 readme
A model that should be better at animating furry porn, that's pretty much it. It's not good at txt2vid, so I don't recommend that, maybe this could be improved by training on images as well.
This is mostly a proof-of-concept to demonstrate that a lora can be made for Wan 2.1 Fun 1.3b Inp, and I think it shows that this is indeed the case.
Btw, generating short videos (<1.5 sec) with img2vid at a slightly lower resolution lets you generate a video in about a minute on an rtx 3060. Doing the same with the 14b model takes me more than 10 minutes. The 1.3b deserves more love.
Usage
Most importantly, use Wan 2.1 Fun 1.3b Inp, with img2vid, as using regular txt2vid is not going to give very good results due to the lora not being high rank enough, or even trained enough. While some concepts will be visible, it will not produce very good quality outputs.
When testing, I noticed that just prompting naturally usually yields the best results, however, there are a few things that have been tagged a few times in the dataset.
Note that neither speeds or depths are going to have much impact, likely due to some issues described in the training section.
Positions
The model was trained on cowgirl, reverse cowgirl, missionary, blowjob, deepthroat, some teasing as well
Perspective
Mainly "viewed from a first-person pov perspective", "viewed from the side". Other descriptions should hopefully work.
Speeds
Speeds are written like "[speed] thrusting" or "[speed] sucking"
Available speeds are: "slow", "moderate speed", "fast" and "very fast"
Depths
Depths are written like "[depth] thrusts" or "[depth] sucks"
Available depths are: "shallow", "moderate", "deep" and "balls deep"
Features
Jiggling breasts (Seems to be pretty noticable in generations)
Jiggling ass
This lora has been tested with images generated with Novafurry and Willy's Noob Realism. As shown in the preview videos. It should work on outputs generated from whatever model though.
Training info
This model is a LoRA painstakingly trained on a single rtx 3060 for a total of 40 epochs on a dataset of about 45 manually tagged clips of nsfw furry content.
The first ~36 epochs were trained with varying framerates, assuming diffusion-pipe doesn't fix that, I then re-encoded the dataset to use 16fps, and trained 4 more epochs, this seems to have made the motion a little better, overall, I'm still not happy.
The dataset was scaled to resolutions with pixel counts similar to 400 pixels, at 24 frames. This still used too much vram, so I used a block-swap of 10, I was able to train at about 2 epochs per hour.
I used diffusion-pipe for the training, since I don't have a budget for anything I trained locally.
The model seems underfitted for txt2vid, the lora rank is also only 32, if I were to train it again, there are a few things I would do differently, namely:
I would start by training on images, so the model can get a better understanding of what anthros look like
I would retag the dataset, going over each entry multiple times instead of just once, since I feel like I might have missed some things
I would use a higher rank for the lora, as I believe 32 might be a bit low for such a broad concept
I would make sure the dataset is already in the correct framerate, as I noticed there was not much movement except with some less commonly used tags, which might be caused by it being effectively in slow motion in the case of high fps videos
While this was trained on Wan 2.1 txt2vid 1.3b, it is intended for img2vid using https://huggingface.co/alibaba-pai/Wan2.1-Fun-1.3B-InP, as I have noticed no additional training is needed, and Wan 2.1 txt2vid 1.3b loras will work properly on Wan2.1 Fun 1.3B InP. I hope this information helps others in the future.
I am overall not happy with how this turned out, but will likely retrain this model from scratch in the future, when I can put some money into a cloud gpu provider or similar to train faster without preventing me from doing other things.
Yap yap yap, go try the model or something
Description
Continued training with a slightly expanded dataset, should have less deforming genitals, and
FAQ
Comments (25)
Looks good Mylo!
Keep up the great work 🧡
I have trained 2 furry wan lora's so far I trained the furry titfuck one and I have trained one that is not yet posted. Its a NSFW model trained on 102 videos for the 14B but im having issues with it following prompts correctly and giving proper motion. What where your training settings for this one? I may need to continue training if its not due to improper prompting.
1. Make sure it was trained on i2v, training on t2v reduces motion
2. Make sure your dataset is 16 fps, I'm not sure if diffusion pipe converts it, so not converting could cause slow mo
3. Not that sure about this one, but I think training at a lower resolution reduces motion at higher res, this lora was trained at 256x256. Generating at that resolution gives faster movements and a wider range of movement.
4. 14b models usually need more training steps as well
@mylo1337 Thanks! I will resume training from the last checkpoint and increase the network rank to 64 and learning rate to 1e-4 from 9e-5. Wish the 14b trained as fast as the 1.3B the 14B t2v is one epoch every 1.5 hours on a 4070 S using block swap 38 and 90GB of RAM.
@mylo1337 Scratch that I will train the same model as you did using a 104 video dataset. What do you mean by missing information in diffusion pipe? What needs to be changed to train properly is it some python in diffusion-pipe? If so please let me know as I have a large high quality dataset that I would love to see how well performs when trained on the 1.3b fun.
@basedbase You'll need to use the wan 2.1 fun 1.3b inp model as the base, and this patch (https://github.com/gitmylo/diffusion-pipe/tree/patch-1) will make it load the model properly.
To git clone the new variant, run `git clone --recurse-submodules https://github.com/gitmylo/diffusion-pipe -b patch-1`. Other than it supporting the wan 1.3b i2v models, it is the same.
I used a learning rate of 1e-05 for the most part for v2, which might be a bit low, but seems to have been good enough.
@mylo1337 Thanks for the fast response caching latents! How many repeats did you use? I saw it says you have 23k steps.
@mylo1337 What workflow are you using for the wan fun model? Cant seem to get any good generations.
@basedbase just the standard wan i2v workflow but with the other model and lora applied. Usually I use swarmui but v2 e14's preview images were in comfyui and should contain the workflow
@basedbase I didn't use repeats in my dataset, but I used multiple_overlapping for the context. Which is good for training continuous motions I'm pretty sure, since it creates multiple clips to train on per video.
I'm not sure what repeats are useful for considering you can just train on more epochs. Could be wrong though.
@mylo1337 Did you actually have 23k steps? I need to re run my training with multiple_overlapping and no repeats as I was using 10 repeats.
@basedbase The 23k steps is a little estimated based on the last saved steps checkpoint and the current epoch. But yeah, around 23k steps. I've broadened the dataset now and continued training until ~33k-34k steps. Which I'm still testing and will probably be uploading soon as the results are a lot better.
V1 was trained on [1, 16] frames 480x480 pixels on t2v
V2 (until e54) was trained on [1, 16, 24, 32, 40] frames 256x256 pixels on i2v (which lead to motion sometimes looking odd, too little)
V2 (from e54 until now) was trained on [1, 16, 24, 32, 40] frames 480x480 pixels on i2v (as I was given some runpod credit, I used the rtx 4000, which is very cheap, but not that fast, more than enough vram though. Batch size was 1 but I might consider a batch size of 4 in the future if that's possible, it didn't seem to be using much vram anyway.)
@mylo1337 Interesting im on epoch 8 and step 1768 thats one 104 videos and no repeats, set to train until epoch 60 which still is not 20k steps. Do you recommend doing a second training run resuming from the last epoch then training until 33k steps total? Also training on 256x256 currently going very quckly 3 seconds per it and only using 5.8gb of VRAM. Im curious for the new one you are currently working on did you retrain from scratch or resume from the last checkpoint with the updated resolution?
@basedbase 3 seconds per it was about the same as I had for the first 54 epochs. Before increasing the dataset resolution. After that as I mentioned I ran it on runpod on an rtx 4000 and it got about 5 seconds per iteration. I let it run for a couple of hours. The dataset used for the last part contains 112 videos, some short, some longer. Caching it took over an hour.
You should be able to switch to the higher resolution at any point. Getting it to learn the higher res shouldn't be an issue. So I recommend just continuing it with the higher res dataset.
@mylo1337 im still curious as to why my step count is significantly lower than yours per epoch with the same settings
@basedbase If a video has 400 frames, and you take 40 frames, multiple-overlapping will have 10 samples (maybe 11, not sure how it works exactly) that would lead to having 10x as many steps for the same number of epochs. The longer the videos, the bigger the difference.
@mylo1337 Ok so maybe thats why yours is so many steps, my videos average 7-20 seconds each.
@mylo1337 resumed training from epoch 61 still 256x256 res as 480 increased time per step from 3.5 to 14.5 sec. Increased bath size to 4.
@mylo1337 looks night and day better with batch size of 4 up to epoch 75 still undertrained will let it train to maybe 90 epochs then add more data since it trains so fast
@mylo1337 How much human data was added? I may need to do that to get good movement.
creator, please mark this furry so not everyone has to see this, thanks...
It's had the furry tag since the original upload, check your blocked tags in your civitai settings. Checking "hide furry" should hide posts with the "furry" and/or "anthro" tag.
@mylo1337 I'm sorry to say that I still see your post, even though "Hide furry" is enabled (and the ANTHRO, FURRY tags are placed at my Hidden Tags section...
Also, this behavior is the same on TWO different civitai accounts. Not sure what's wrong, but I'll just block you to "fix" :)
Anyway, thanks for your contributions.
why you want to gatekeep it from others my brother in fur
@ria1337 Bye felicia!