Studio Ghibli 🎥 Wan2.1-T2V-14B - CivArchive (CivitAI Archive)

Studio Ghibli 🎥 Wan2.1-T2V-14B - v1.0

NSFW

This LoRA is featured on OpenMuse, a curated initiative dedicated to open-source video LoRAs and the creative works they enable. Focused on models like Wan2.1, LTX-Video, and HunyuanVideo, OpenMuse highlights high-quality tools and artwork from across the ecosystem. Rooted in the Banodoco community, OpenMuse is a growing home for open, collaborative AI art, designed to inspire creators, spark curiosity, and offer something you'd feel proud to share, even with someone skeptical of AI-generated art.

Description

I am very happy to share my magnum opus LoRA, which I've been working on for the past month since Wan came out. This is indeed the best LoRA ~~on Civitai~~ I have ever trained, and I have to say (once again) - WanVideo is an amazing model.

LoRA was trained for ~90 hours on an RTX 3090 with musubi-tuner using a mixed dataset of 240 clips and 120 images. This could have been done faster, but I was obsessed with pushing the limits to create a state-of-the-art style model. It’s up to you to judge if I succeeded.

Usage

The trigger phrase is Studio Ghibli style - all captions for training data were prefixed with these words.

All clips I publish in gallery are raw outputs using a LoRA with a base Wan-T2V-14B model (although latest videos also may include self-forcing LoRA for inference acceleration, read more below), without further post-processing, upscaling, or interpolation.

Compatibility with other LoRAs and with Wan-I2V models has not been tested.

Workflows are embedded with each video (so you can just download and drag video into ComfyUI to open it). As an example, here is JSON for workflow (based on Kijai's wrapper), which uses the self-forcing LoRA (created by blyss), extracted from lightx2v's Wan2.1-T2V-14B-StepDistill-CfgDistill model. I chose version made by blyss (and not the original LoRA by Kijai), because, from my tests, it offers maximum compatibility and only accelerates inference, without any additional detailing or stylistic bias. (This is also the reason why I stick to the base Wan model and do not use merges like AniWan or FusionX.)

I'm using the acceleration LoRA with the UniPC sampler (and occasionally DPM++). In my experience, UniPC performs better for 2D animation than LCM, which tends to lean more toward realism, which I want to avoid. Usually I also apply NAG node, so I can use negative prompts with CFG=1. From initial testing, compared to the older workflow with TeaCache, aside from the huge speed gain (a 640×480×81 6-step clip renders in ~1 minute instead of 6 on an RTX 3090), it also slightly improves motion smoothness and text rendering.

The updated lightx2v LoRAs are also very impressive in terms of speed and quality preservation. I'm using a rank 128 LoRA, but the 32 and 64 versions also produce great results. Here's an example of the workflow in JSON format. I found out lowering lightx2v LoRA strength to 0.9, increasing number of steps to 8, and using either UniPC or DPMPP scheduler gives pretty good outputs. The obvious downside is that output usually leans to default Wan's "realistic 3D" style. Counter this by increasing number of steps, lowering strength of acceleration LoRA and increasing strength of style LoRA. Also you can try to replace lightx2v LoRA with rCM LoRA, it might occasionally give slightly better motion.

And here is "legacy" workflow in JSON format. It was used to generate 90% videos in gallery for this LoRA. It was also build on wrapper nodes and included a lot of optimizations (more information here), including fp8_e5m2 checkpoints + torch.compile, SageAttention 2, TeaCache, Enhance-A-Video, Fp16_fast, SLG, and (sometimes) Zero-Star (some of these migrated to new workflow as well), but rendering a 640x480x81 clip still took about 5 minutes (RTX 3090) in older workflow. Although the legacy workflow demonstrates slightly superior quality in a few specific areas (palette, smoothness), the 5x slowdown is a significant and decisive drawback, being the reason I migrated to lightx2v-powered version.

Prompting

To generate most prompts, I usually apply the following meta-prompt in ChatGPT (or Claude, or any other capable LLM), that helps to enhance "raw" descriptions. This prompt is based on official prompt extension code by Wan developers and looks like this:

You are a prompt engineer, specializing in refining user inputs into high-quality prompts for video generation in the distinct Studio Ghibli style. You ensure that the output aligns with the original intent while enriching details for visual and motion clarity.

Task Requirements:
- If the user input is too brief, expand it with reasonable details to create a more vivid and complete scene without altering the core meaning.
- Emphasize key features such as characters' appearances, expressions, clothing, postures, and spatial relationships.
- Always maintain the Studio Ghibli visual aesthetic - soft watercolor-like backgrounds, expressive yet simple character designs, and a warm, nostalgic atmosphere.
- Enhance descriptions of motion and camera movements for natural animation flow. Include gentle, organic movements that match Ghibli's storytelling style.
- Preserve original text in quotes or titles while ensuring the prompt is clear, immersive, and 80-100 words long.
- All prompts must begin with "Studio Ghibli style." No other art styles should be used.

Example Revised Prompts:
"Studio Ghibli style. A young girl with short brown hair and curious eyes stands on a sunlit grassy hill, wind gently rustling her simple white dress. She watches a group of birds soar across the golden sky, her bare feet sinking slightly into the soft earth. The scene is bathed in warm, nostalgic light, with lush trees swaying in the distance. A gentle breeze carries the sounds of nature. Medium shot, slightly low angle, with a slow cinematic pan capturing the serene movement."
"Studio Ghibli style. A small village at sunset, lanterns glowing softly under the eaves of wooden houses. A young boy in a blue yukata runs down a narrow stone path, his sandals tapping against the ground as he chases a firefly. His excited expression reflects in the shimmering river beside him. The atmosphere is rich with warm oranges and cool blues, evoking a peaceful summer evening. Medium shot with a smooth tracking movement following the boy's energetic steps."
"Studio Ghibli style. A mystical forest bathed in morning mist, where towering trees arch over a moss-covered path. A girl in a simple green cloak gently places her hand on the back of a massive, gentle-eyed creature resembling an ancient deer. Its fur shimmers faintly as sunlight pierces through the thick canopy, illuminating drifting pollen. The camera slowly zooms in, emphasizing their quiet connection. A soft gust of wind stirs the leaves, and tiny glowing spirits peek from behind the roots."

Instructions:
I will now provide a prompt for you to rewrite. Please expand and refine it in English while ensuring it adheres to the Studio Ghibli aesthetic. Even if the input is an instruction rather than a description, rewrite it into a complete, visually rich prompt without additional responses or quotation marks.

The prompt is: "YOUR PROMPT HERE".

Replace YOUR PROMPT HERE with something like Young blonde girl stands on the mountain near seashore beach under rain or whatever.

The negative prompt always includes the same base text (but may have additional words added depending on the specific prompt):

色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走, 3D, MMD, MikuMikuDance, SFM, Source Filmmaker, Blender, Unity, Unreal, CGI, bad quality

Dataset

In this and the following sections, I'll be doing a bit of yapping :) Feel free to skip ahead and just read the Conclusion, but maybe someone will find some useful bits of information in this wall of text. So...

Dataset selection stage was the "easiest" part, I already have all the Ghibli films in highest possible quality and splitted into scenes - over 30,000 clips in 1920x1040 resolution and high bitrate. They're patiently waiting for the day I finally will make a full fine-tune some video model with them.

And I had already prepped around 300 clips for training v0.7 of HV LoRA (in fact, I was just about to start the training when Wan came out). These clips were in the range of 65-129 frames, which I consider optimal for training HV on videos, and they were all 24 fps. For Wan, though, I wanted them to be in a different frame range (not exceeding 81 frames, explanation see later in the "Training" section). I also needed them to be in 16 fps. I'm still not entirely sure if strict 16 fps is necessary, but I had some issues with HV when clips were in 30 fps instead of HV’s native 24 fps, so I decided to stick with 16 fps.

I should mention, that for processing dataset, I usually make a lot of small "one-time" scripts (with the help of Claude, ChatGPT, and DeepSeek) - that includes mini-GUIs for manual selection of videos, one-liners for splitting frames, scripts for outputting various helper stats, dissecting clips by ranges, creating buckets in advance, etc. I don't publish these scripts because they're messy, full of hardcoded values, and designed for one-time use anyway. And nowadays anyone can easily create similar scripts by making requests to the aforementioned LLMs.

Converting all clips to 16 fps narrowed the range of frames in each video from 65-129 to around 45-88 frames, which messed up my meticulously planned, frame-perfect ranges for the frame buckets I had set up for training. Thankfully, it wasn't a big deal because I had some rules in place when selecting videos for training, specifically to handle situations like this.

First of all, the scene shouldn't have rapid transitions during its duration. I needed this because I couldn't predict the exact duration (in frames) of target frame buckets that trainer will establish for training - model size, VRAM, and other factors all affect this. Example: I might want to use a single 81-frame long clip for training, but I won't be able to do this, because I will get OOM on RTX 3090. So will have to choose some frame extraction strategy, depending of which clip might be splitted onto several shorter parts (here is excellent breakdown of various strategies). And its semantic coherence might be broken (like, on first fragment of the clip a girl might open her mouth , but from clipped first fragment it will become ambiguous whether she is gonna cry or laugh), and that kind of context incoherence may make Wan's UMT5 encoder feel sad.

Another thing to consider is that I wanted to reuse captions for any fragment of the original clip without dealing with recaptioning and recaching embeddings via the text encoder. Captioning videos takes quite a long time, but if a scene changes drastically throughout its range, the original caption might not fit all fragments, reducing training quality. By following rules "clip should not contain rapid context transitions" and "clip should be self-contained, i.e. it should not feature events that may not be understood from within the clip itself", even if a scene is to be split into subfragments, the captions would (with an acceptable margin of error) still apply to each fragment.

After conversion I looked through all clips and reduced total number of them to 240 (just took out some clips that did contained too much transitions or, vica-versa, were too static), which formed the first part of the dataset.

I decided to use a mixed dataset of videos and images. So second part of the dataset was formed by 120 images (at 768x768 resolution), taken from screencaps of various Ghibli movies.

There's an alternative approach where you train on images first and then fine-tune on videos (it was successfully applied by the creator of this LoRA), but I personally think it's not as good as mixing in a single batch (though I don't have hard numbers to back this up). To back up my assumptions, here is very good LoRA that uses the same mixed approach to training (and btw it was also done on a 24 GB GPU, if I am not mistaken).

To properly enable effective video training on mixed dataset on consumer-level GPUs I had to find the right balance between resolution, duration, and training time, and I decided to do this by mixing low-res high-duration videos and high-res images - I will give more details about this in Training section.

Considering captioning: images for dataset were actually just reused from some of my HV datasets, and they were captioned earlier using my "swiss army knife" VLM for (SFW-only) dataset captioning, also known as Qwen2-VL-7B-Instruct. I used the following captioning prompt:

Create a very detailed description of this scene. Do not use numbered lists or line breaks. IMPORTANT: The output description MUST ALWAYS start with the unaltered phrase 'Studio Ghibli style. ', followed by your detailed description. The description should 1) describe the main content of the scene, 2) describe the environment and lighting details, 3) identify the type of shot (e.g., aerial shot, close-up, medium shot, long shot), and 4) include the atmosphere of the scene (e.g., cozy, tense, mysterious). Here's a template you MUST use: 'Studio Ghibli style. {Primary Subject Action/Description}. {Environment and Lighting Details}. {Style and Technical Specifications}'.

I had some doubts about whether I should recaption them since the target caption structure was specifically designed for HunyuanVideo, and I worried that Wan might need a completely different approach. I left them as-is, and have no idea if this was the right decision, but, broadly speaking, modern text encoders are powerful enough to ignore such limitations. As we know, models like Flux and some others can even be trained without captions at all (although I believe training with captions is always better than without - but only if captions are relevant to the content).

For captioning videos I tested a bunch of local models that can natively caption video content:

CogVLM2-Video-Llama3-Chat (usually this is my go-to option for clip captioning)
MiniCPM-V 2.6
Apollo-LMMs-Apollo-7B-t32
LLaVA-Onevision
VideoChat-Flash-2B
VideoLLaMA 3
Ovis2-16B (this one seems really good! But I had already dataset captioned when I found it, so will use it in future LoRAs)

There are more models out there, but these are the ones I tested. For this LoRA, I ended up using Apollo-7B. I used this simple VLM prompt:

Create a very detailed description of this video. IMPORTANT: The output description MUST ALWAYS start with the unaltered phrase 'Studio Ghibli style. ', followed by your detailed description.

I'm attaching the full dataset I used as an addendum to the model. While it does kinda contain copyrighted material, I think this falls under fair use. This dataset is provided solely for research and educational evaluation of the model's capabilities and to offer transparency regarding the model's training process. It should not be used for redistribution or commercial exploitation.

Training

If anyone interested, here is list of trainers that I considered for training WanVideo:

diffusion-pipe - OG of the HV training, but also allows memory-efficient Wan training; config-driven, has third-party GUI and runpod templates (read more here and here). For HV I used it exclusively. Requires WSL to run on Windows.
Musubi Tuner - Maintained by responsible and friendly developer. Config-driven, has cozy community, tons of options. Currently my choice for Wan training.
AI Toolkit - My favorite trainer for Flux recently got support for Wan. It's fast, easy-to-use, config-driven, also has first-party UI (which I do not use 🤷), but currently supports training 14B only without captions, which is the main reason I do not use it.
DiffSynth Studio - I haven't had the time to test it yet and am unsure if it can train Wan models with 24 GB VRAM. However, it’s maintained by ModelScope, making it worth a closer look. I plan to test it soon.
finetrainers - Has support for Wan training, but doesn't seem to work with 24 GB GPUs (yet)
SimpleTuner - Gained support for Wan last week, so I haven't had a chance to try it yet. It definitely deserves attention since the main developer is a truly passionate and knowledgeable person.
Zero-to-Wan - Supports training only for 1.3B models.
WanTraining - I have to mention this project, as it's supported by a developer who’s done impressive work with it, including guidance-distilled LoRA and control LoRA.

So, I used Musubi Tuner. For reference, here are my hardware params: i5-12600KF, RTX 3090, Windows 11, 64Gb RAM. The commands and config files I used were the following.

For caching VAE latents (nothing specific here, just default command)

python wan_cache_latents.py --dataset_config G:/samples/musubi-tuner/_studio_ghibli_wan14b_v01_dataset.toml --vae G:/samples/musubi-tuner/wan14b/vae/wan_2.1_vae.safetensors

For caching text encoder embeddings (default):

python wan_cache_text_encoder_outputs.py --dataset_config G:/samples/musubi-tuner/_studio_ghibli_wan14b_v01_dataset.toml --t5 G:/samples/musubi-tuner/wan14b/tenc/models_t5_umt5-xxl-enc-bf16.pth --batch_size 16

For launching training:

accelerate launch --num_cpu_threads_per_process 1 --mixed_precision bf16 wan_train_network.py ^
    --task t2v-14B ^
    --dit G:/samples/musubi-tuner/wan14b/dit/wan2.1_t2v_14B_bf16.safetensors ^
	--vae G:/samples/musubi-tuner/wan14b/vae/wan_2.1_vae.safetensors ^
	--t5 G:/samples/musubi-tuner/wan14b/tenc/models_t5_umt5-xxl-enc-bf16.pth ^
	--sdpa ^
	--blocks_to_swap 10 ^
	--mixed_precision bf16 ^
	--fp8_base ^
	--fp8_scaled ^
	--fp8_t5 ^
	--dataset_config G:/samples/musubi-tuner/_studio_ghibli_wan14b_v01_dataset.toml ^
    --optimizer_type adamw8bit ^
	--learning_rate 5e-5 ^
	--gradient_checkpointing ^
    --max_data_loader_n_workers 2 ^
	--persistent_data_loader_workers ^
    --network_module networks.lora_wan ^
	--network_dim 32 ^
	--network_alpha 32 ^
    --timestep_sampling shift ^
	--discrete_flow_shift 3.0 ^
	--save_every_n_epochs 1 ^
	--seed 2025 ^
    --output_dir G:/samples/musubi-tuner/output ^
	--output_name studio_ghibli_wan14b_v01 ^
	--log_config ^
	--log_with tensorboard ^
	--logging_dir G:/samples/musubi-tuner/logs ^
	--sample_prompts G:/samples/musubi-tuner/_studio_ghibli_wan14b_v01_sampling.txt ^
	--save_state ^
	--max_train_epochs 50 ^
	--sample_every_n_epochs 1

Again, nothing to see here, actually. I had to use blocks_to_swap parameter because otherwise, with my dataset config (see below), I confronted into 24 Gb VRAM constraints. Hyperparameters were mostly left on defaults. I didn't want to risk anything after a bad experience - 60 hours of HV training lost due to getting too ambitious with flow shift values and adaptive optimizers instead of good old adamw.

Prompt file for sampling during training:

# prompt 1
Studio Ghibli style. Woman with blonde hair is walking on the beach, camera zoom out.  --w 384 --h 384 --f 45 --d 7 --s 20

# prompt 2
Studio Ghibli style. Woman dancing in the bar. --w 384 --h 384 --f 45 --d 7 --s 20

Dataset configuration (the most important part; I'll explain the thoughts that led me to it afterward):

[general]
caption_extension = ".txt"
enable_bucket = true
bucket_no_upscale = true

[[datasets]]
image_directory = "H:/datasets/studio_ghibli_wan_video_v01/images/768x768"
cache_directory = "H:/datasets/studio_ghibli_wan_video_v01/images/768x768/cache"
resolution = [768, 768]
batch_size = 1
num_repeats = 1

[[datasets]]
video_directory = "H:/datasets/studio_ghibli_wan_video_v01/videos/1920x1040"
cache_directory = "H:/datasets/studio_ghibli_wan_video_v01/videos/1920x1040/cache_1"
resolution = [768, 416]
batch_size = 1
num_repeats = 1
frame_extraction = "head"
target_frames = [1, 21]

[[datasets]]
video_directory = "H:/datasets/studio_ghibli_wan_video_v01/videos/1920x1040"
cache_directory = "H:/datasets/studio_ghibli_wan_video_v01/videos/1920x1040/cache_2"
resolution = [384, 208]
batch_size = 1
num_repeats = 1
frame_extraction = "uniform"
target_frames = [45]
frame_sample = 2

My dataset setup consists of three parts.

I'll start with the last one, which includes the main data array - 240 clips in 1920x1040 resolution and duration that varies from 45 to 88 frames.

Obviously, training on full-resolution 1920x1040, full-duration clips on an RTX 3090 was out of the question. I needed to find the minimum resolution and frame duration that would avoid OOM errors while keeping the bucket fragments as long as possible. Longer fragments help the model learn motion, timing, and spatial patterns (like hair twitching, fabric swaying, liquid dynamics etc.) of the Ghibli style - something you can't achieve with still frames.

From training HV, I remembered a good starting point for estimation of available resolution range for 24 Gb GPU is 512x512x33. I decided on the "uniform" frame extraction pattern, ensuring all extracted fragments were no fewer than 45 frames. Since, as I wrote before, after conversion to 16fps, maxed out at 88 frames, this approach kept the clips from being divided into more than two spans, which would've made epochs too long. At the same time, timespan of 45 frames (~3s) should be enough for model to learn spatial flow of the style.

With the target fixed to 45 frames, I started testing different resolutions. I used a script to analyze all clips in a folder and suggest valid width-height combinations that maintained the original aspect ratio (1920/1040 ≈ 1.85) and were divisible by 16 (a model requirement).

Eventually, I found that using [384, 208] for the bucket size and setting --blocks_to_swap 10 prevented OOM errors and pushing into shared memory (which eventually led to 160 s/it). The downside was that training speed dropped to around 11-12 s/it. In hindsight, lowering the resolution to [368, 192] could have bumped the speed up to ~8 s/it, which would've been great (close to what I get when training Flux at 1024p in AI Toolkit). And that would've saved me around 20 hours of training over the full 90-hour run (~28000 steps), although I didn't expect it to go > 20K steps back then.

And it needs to be noted, that I trained on Windows with my monitor connected to the GPU (and used my PC for coding at the same time 😼). On Linux (for example, with diffusion-pipe) and with using internal GPU for monitor output, it might be possible to use slightly higher spatiotemporal resolutions without hitting OOM or shared memory limits (something I think is Windows-specific).

Now about the first part (120 images in 768x768 resolution). Initially, I wanted to train on 1024p images, but I decided it'd be overkill and slow things down. My plan was to train on HD images and low-res videos simultaneously to ensure better generalization. The idea was that high-resolution images would compensate for the lower resolution of the clips. And joint video + image pretraining is how WAN was trained anyway, so I figured this approach would favor "upstream" style learning as well.

Finally, the second part, which is also important for generalization (again, that is not as "scientific" assumption, but it seems reasonable). The idea was to reuse the same clips from the third section but now train only on the first frame and the first 21 frames. This approach, I hoped, would facilitate learning temporal style motion features. At the same time, it let me bump up the resolution for the second section to [768, 416].

As the result, I hoped to achieve "cross-generalization" between:

Section 1's high-res images (768x768)
Section 2's medium-res single frames and 21-frame clips (768x416)
Section 3's low-res 45-frame clips (384x208)

Additionally, both the second and the larger part of the third sections shared the same starting frame, which I believed would benefit LoRA usage in I2V scenarios. All this seemed like the best way to fully utilize my dataset without hitting hardware limits.

Of course, I'm not the first to come up with this approach, but it seems logical and reasonable, so I hope more creators realize you don’t need an A100 to train a video-based LoRA for Wan.

Funny fact: I expected one epoch to consist of 1080 samples: 120 images (1st dataset section) + 240 single frames (2nd dataset section, "head" frame bucket=1) + 240 clips of 21 frames each (2nd dataset section, "head" frame bucket=21) + 480 clips of 45 frames each (2nd dataset section, "uniform" frame bucket=45, sampled 2 times). However, after I started training, I discovered it was actually 1078 samples. When I dug into it, I found that two of the clips reported by my scripts (which use the ffprobe command from ffmpeg to count the number of frames) were actually shorter than 45 frames, so there was an issue with rounding. This wasn't a big deal, so I just continued training without those two clips, but that was the reason the number of steps for the final LoRA seemed so off :)

The training itself went smoothly. I won't reveal loss graphs since I ~~am too shy~~ don't think they mean much. I mostly use them to check if the loss distribution starts looking too similar across epochs - that's my cue for potential overfitting.

I trained up to 28000 steps, then spent several days selecting the best checkpoint. Another thing I think I could have done better is taking checkpoints not just at the end of each epoch, but also in between. Since each epoch is 1078 steps long, it's possible that a checkpoint with even better results than the one I ended up with was lost somewhere in between.

I'm considering integrating validation loss estimation into my training pipeline (more on this here), but I haven't done it yet.

Could this be simplified? Probably yes. In my next LoRA, I'll test whether the extra image dataset in section 1 was redundant. I could've just set up a separate dataset section and reused clips' first frame, but with high resolution. On the other hand, I wanted the dataset to be as varied as possible, so I used screencaps from different scenes than the clips, in this sense they were not redundant.

I'm not even sure if the second section was necessary. Since WAN itself (according to its technical report) was pretrained on 192px clips, training at around 352x192x45 should be effective and make the most of my hardware. Ideally, I'd use 5-second clips (16 fps * 5s + 1 = 81 frames), but that’s just not feasible on the RTX 3090 without aggressive block swapping.

Conclusion

Aside from the fun and the ~~hundreds~~ thousands of insanely good clips, here are some insights I've gained from training this LoRA. I should mention that these practices are based on my personal experience and observations, I don't have any strictly analytical evidence to prove their effectiveness and I only tried style training so far. I plan to explore concept training very soon to test some of my other assumptions and see if they can be applied as well.

You can train Wan-14B on consumer-level GPUs using videos. 368x192x45 seems like a solid starting point.
Compensate for motion-targeted style learning on low-res videos by using high-res images to ensure better generalization.
Combine various frame extraction methods on the same datasets to maximize effectiveness and hardware usage.

A lot, if not all, of what I've learned to make this LoRA comes from reading countless r/StableDiffusion posts, 24/7 lurking on the awesome Banodoco Discord, reading comments ~~and opening every NSFW clip~~ to every single WanVideo model here on Civitai, and diving into every issue I could find in the musubi-tuner, diffusion-pipe, Wan2.1, and other repositories. 😽

P.S.

This model is a technological showcase of the capabilities of modern video generation systems. It is not intended to harm or infringe upon the rights of the original creators. Instead, it serves as a tribute to the remarkable work of the artists whose creations have inspired this model.

Description

FAQ

Comments (82)

Light7799Mar 28, 2025· 1 reaction

CivitAI

This is so cool! Thank you for sharing!

Kong__Mar 28, 2025· 1 reaction

CivitAI

Awesome stuff! Great work

superyinhua106Mar 28, 2025· 1 reaction

CivitAI

The effect is very good. Thank you for making such an exciting lora. Can you work harder and make a wan_1.3B_T2V lora? After all, 14B is too slow.

seruva19

Author

Mar 28, 2025· 3 reactions

In my opinion, 1.3B is too small to provide high-quality video. I haven’t seen a single LoRA for 1.3B that has convinced me otherwise. However, if no one eventually makes a Ghibli model for it, I’ll might give it a try.

upd. https://civitai.com/models/1474964 (not mine)

MyteeMar 28, 2025

@seruva19 wan_1.3B_T2V vote +1. Thx

superyinhua106Mar 29, 2025

非常感谢你的回复，在我看来，1.3B的模型如果用720P的分辨率，效果非常不错，如果能用你的lora，效率会非常的高。恳请你训练一个1.3B的。

AIWarperApr 2, 2025· 1 reaction

@seruva19 With the new VACE models it would be cool to have a 1.3bn variant

seruva19

Author

Apr 2, 2025

@AIWarper We'll see. It surely has large potential, some examples are impressive.

Le_FourbeMar 28, 2025· 1 reaction

CivitAI

really really convincing ! this is what i would have expected from tooncrafter one year ago.
i'll be watching your articles and profile with great interest wile i wait for my 5090 pre order... (i have the 3090 too)
Thanks you ! waiting for a complete article ;) !

seruva19

Author

Mar 28, 2025

Thanks! There won’t be any specific training practices in my text, just some observations that someone may find useful.

nokaiMar 28, 2025

CivitAI

Does it work with Wan2.1Fun?

seruva19

Author

Mar 28, 2025

I didn't test, but, according to this comment, LoRAs are not compatible. On Banodoco discord I've seen comments that some LoRAs work, but their effect is weakened.

transitgraveMar 29, 2025

CivitAI

Awesome work! could you provide some WAN model suggestions for 12Gb vram for this lora/workflow?

seruva19

Author

Mar 29, 2025

My workflow is primarily targeted at the RTX 3090, so it might not work with 12 GB GPUs.

You can try use the Q4_K_S (or Q4_K_M) GGUF variant of the model. And GGUF-based workflow like this.

Unfortunately, any workflows will be probably very slow for 14B with 12 Gb.

thefoodmageMar 29, 2025· 6 reactions

CivitAI

I know Miyazaki is big mad at this one but hate to say it he's wrong, this is a dream for many of us.

Thank u!

ielmaoufo454Mar 29, 2025· 1 reaction

Yep to gen porn, but to art is not yet also, ghibli is not just an style is a whole composition, make a film with AI require today just a bit lees work than manually, and you have to draw by hand to develope the style, make your own lora, etc. This is the most Ghibli lora i found, wan is really impresive, cause the other ven o4 open ai are very slop.

thefoodmageMar 29, 2025· 1 reaction

@ielmaoufo454 Yes this is very high quality! Props to the creator!

kantoMar 31, 2025

Yes, he was. Insult to life itself. https://twitter.com/i/status/1904986729888784781
If he had adopted the technology, we would have seen more of his movies. I do understand his feeling not to incorporate any AI in his wotk. An artist for life, he is.

thefoodmageMar 31, 2025· 2 reactions

@kanto I'm an artist too, I use art as a tool. I mostly use AI for transformative purposes and not to just make the art for me, but I will probably be making a game using AI soon.

munchkinApr 6, 2025· 2 reactions

@kanto That "insult to life itself" is taken out of context, it was said towards some AI animated (typical learn how to walk stuff) zombie that supposedly reminded him about his crippled friend. And it was long before AI image generations as they are today. Not to say that he wouldn't have issues with AI images, he is in general prone to such takes, but quoting this is very stupid.

thefoodmageApr 6, 2025

@munchkin quoting it isn't 'stupid' it's irony. You know that man is one of the highest-profile AI haters. You know what I'm talkin' bout >.>

munchkinApr 6, 2025· 1 reaction

@thefoodmage It is stupid because you misquote it in a weird attempt to gloat, same as anti-AI for his "support" in their ignorance. The "insult to life" is about zombie, he specifically prefaced it with a story about his friend, it is the other phrase about humans losing confidence in themselves and end times that can be attributed to AI opinion.

But there is more, in this interview (2 years later): https://realsound.jp/tech/2018/10/post-270755.html there is information about why he said what he said. He said something along the lines of:

"No, no, my relationship with Kawakami-san and his staff had not changed (because of that incident)...but to put it simply, (I thought) if one extravagantly praises something like artificial intelligence, ridiculous things will happen. At the time I felt an anarchic person such as Kawakami-san didn't have the (self) restraint (to put a stop to it)"

And the next "If it wasn't for that 'attempt at humor' I would have just thought 'how strange this (zombie) looks', but it had the face of a very serious old man you know...I thought for Kawakami-san to be okay with using an old man as an 'attempt at humor' like that was a big weakness of his"

Expressing understanding, Miyazaki also says "I have many flaws myself, right? so, I thought maybe I could do something interesting together with (Kawakami-san and his team)"

Said Kawakami was working on "ARTILIFE" project, so it's not like Miyazaki was against working with AI teams or against AI as a whole, his issue was with a zombie as a joke and the fact that AI can have issues (ridiculous things) in this specific instance (Kawakami as a person) - completely different from 'big mad' about AI perspective that you have. How is a "one of the highest-profile AI haters" not against working with AI teams?

What he thinks right now, however, is unknown and any speculation would be inevitably full of bias. I think it's better to not quote it at all.

thefoodmageApr 6, 2025· 1 reaction

@munchkin I'm not on this site to talk to you I'm not reading any of that go make some art my guy 🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣🤣

munchkinApr 6, 2025

@thefoodmage Neither am I, it's not only for you - get off your high horse

thefoodmageApr 6, 2025

@munchkin nobody is on a high horse for commenting on an ai site. stop being a weirdo

munchkinApr 6, 2025

@thefoodmage Reading comprehension isn't your strong suit, figures

thefoodmageApr 7, 2025

@munchkin please stop talking to me I don't know you. Why not talk to some friends or something instead of someone who does not want to talk to you? REALLY weird stalkerish creepy activity, please go away before I block you

stevemeveMar 30, 2025· 2 reactions

CivitAI

Thank you for making this LoRA ❤️ Almost every result looks fantastic and it follows the prompt very well!

DocShotgunMar 31, 2025· 1 reaction

CivitAI

Thanks for sharing both the lora and the process!

I've been thinking about jumping into video model training with musubi-tuner (longtime sd-scripts user here for SD1/SDXL training), and only did a few quick experiments thus far using my existing datasets - and found that it really hurts motion to train on still images alone lol. Definitely gonna save these notes on processing the video data.

seruva19

Author

Mar 31, 2025

Thank you for review!

Yes, that's what I figured out as well - training on videos is VASTLY superior to still images, even if videos are not as high quality.

blipApr 1, 2025· 2 reactions

CivitAI

Incredibly valuable to share your process like that. I'm getting started trying to train wan loras with musubi tuner and this post is gold.

One thing you might want to know is that you can increase the batch size for the image datasets since you have 24gb of VRAM. I'm on a 4090 and it handled batch size 4 fine without OOMing.

You can also try using flash attention instead of sdpa and split_attn for more memory savings. (I got that from musubi-trainer github discussions).

Could save you time, but I don't have a direct comparison for the quality diff! Just sharing what I tried so far.

On a 4090 with these settings, and bumping block swap up to 18, I can do 81 frames on the [384, 208] video dataset and batch size 4 on the [768,768] image dataset without OOMing. OS is Linux so might have more VRAM to work with than Windows though.

Training commands are similar to yours, but I set flow shift to 5 and added --loraplus_lr_ratio=2
based on what someone recommended on the github discussions.

```accelerate launch --num_cpu_threads_per_process 1 \

--mixed_precision bf16 wan_train_network.py \

--task t2v-14B \

--dit /home/blip/Desktop/AI_image_generators/comfy-ui/ComfyUI/models/diffusion_models/wan2.1_t2v_14B_bf16.safetensors \

--t5 /home/blip/Desktop/AI_image_generators/comfy-ui/ComfyUI/models/text_encoders/umt5-xxl-enc-bf16.safetensors \

--vae /home/blip/Desktop/AI_image_generators/comfy-ui/ComfyUI/models/vae/wan_2.1_vae.safetensors \

--fp8_base \

--fp8_scaled \

--network_args loraplus_lr_ratio=2 \

--dataset_config /home/blip/Desktop/training/trainings/output_dir1/wanOutputs/tomls/general_ghibli_copy.toml \

--flash_attn \

--split_attn \

--mixed_precision bf16 \

--optimizer_type adamw8bit \

--learning_rate 5e-5 \

--gradient_checkpointing \

--max_data_loader_n_workers 2 \

--persistent_data_loader_workers \

--network_module networks.lora_wan \

--network_dim 32 \

--network_alpha 32 \

--timestep_sampling shift \

--discrete_flow_shift 5.0 \

--log_with tensorboard \

--logging_dir /home/blip/Desktop/training/trainings/output_dir1/wanOutputs/wan_logs \

--max_train_epochs 25 \

--save_every_n_epochs 5 \

--seed 420 \

--blocks_to_swap 18 \

--output_dir /home/blip/Desktop/training/trainings/output_dir1/wanOutputs/testRun \

--output_name testRun
```

seruva19

Author

Apr 1, 2025· 1 reaction

Thank you very much for your observations!

Regarding flash-attention, I encountered a very strange bug while training HV LoRA in musubi-tuner (that was in January though). The LoRA trained with --flash-attn had zero influence on the generated video (both during training sampling and in ComfyUI). However, when I disabled flash-attn, it trained just fine. Maybe it was just a temporary bug, so I should try again.

(upd. after I upgraded torch to 2.6.0 and flash-attention to 2.7.4, this bug is gone, at least I am observing changes during sampling).

I agree with you that increasing the batch size speeds up training, but I used to think that for LoRA (as opposed to full fine-tuning), training with a batch size of 1 was preferable, to ensure better coverage of the data. However, I admit higher batch sizes may preserve better stability. I may need to test this further.

I saw the discussion about the loraplus_lr_ratio parameter, and while it definitely speeds up training, I wanted to prioritize quality. So, I preferred a less risky (and longer) and more reliable approach with a low learning rate.

And thanks again for your suggestions - they are really valuable!

tazmannner379Apr 5, 2025· 1 reaction

CivitAI

I just trained my first WAN lora myself and it came out nice, but it was simply 50 images. Thank you so much for the detailed breakdown of your process and the training data. I will explore on a smaller scale and hopefully build something quite nice like you have here. I actually use your lora with my other anime related lora generations as I think it picks up animation style better (even non ghibli styles) with it on.

seruva19

Author

Apr 5, 2025

Thank you! I found that training on videos improves motion in general, regardless of style. I compared outputs using the same seed, with and without the LoRA, and the model was capable of producing complex interactions with fewer artifacts.

The pipeline I used for this LoRA isn't optimized yet (in terms of dataset and training routine), but I believe it's possible to achieve similar or even better results with a smaller dataset and faster training. I'm currently experimenting with other anime styles and parameters, so hopefully I'll be able to understand it better soon.

Wan 2.1 is insane model. I compared the results I got using my LoRA with other models (both open-source and closed-source, like Kling or Runway), and honestly, I haven’t seen any other showcase that offers the same fluidity, smoothness, and level of micro- and macrodetails that Wan can provide.

tazmannner379Apr 5, 2025· 1 reaction

@seruva19 Yeah I agree we really have something special with Wan. I also have an LMM prompting for me, and half the time Wan can figure out what you want without any need for a lora (beyond NSFW bits which its not trained properly on and certain camera angles). I still have training wheels on, but will try to get more and more into this and will follow your updates :)

bkdjartApr 6, 2025· 1 reaction

CivitAI

Thank you so much for sharing your insight on training. Regarding the outputs, I noticed there aren't many dynamic camera or movement shots that I'm used to seeing in Ghibli films. Is that just the limitation of the training setup being only 45 frames or something else?

seruva19

Author

Apr 6, 2025· 2 reactions

I haven't quite figured out how to prompt dynamic camera movement in Wan yet. Another reason is that fast or dynamic camera motion often leads to more video artifacts, and I try to only publish clips that are relatively clean. So naturally, there are more clips in the gallery without camera movement, because unlike dynamic shots, those usually turn out great 90% of the time.

That said, it's true that the training data didn't include a lot of dynamic scenes either. I wasn't sure the model would learn them well enough. Maybe that assumption was wrong.

I’m continuing to experiment with different prompts and trying to figure out the best way to handle dynamic scenes, so hopefully I'll get better at it with time.

blyssApr 6, 2025· 1 reaction

CivitAI

Thank you so much for this wonderful, detailed write up! I'm pleased to see my own comments were useful as well. I haven't been training as much recently because I've been focused on LLM stuff but I still try to hang around the Musubi Github to help people as I'm really fond of the project!

protector131090Apr 6, 2025

CivitAI

why are they always moving their mouths? like they talking even if its not prompted? is captioning the reason? in my anime loras this does not happen. Did you caption "talking/speaking" on clips they do this?

seruva19

Author

Apr 6, 2025

Yes, I think this might be the reason. Most captions from my dataset did not contain any mentions of characters speaking (well, some do, but I did not explicitly control this, since captioning was done fully automated). I believe putting "talking", "speaking" to negative prompt may reduce the probability of this effect, but most of my examples on this page do not include it into negative.

SingularUnityApr 7, 2025· 2 reactions

CivitAI

master anime trainer here

nzhhsApr 7, 2025· 1 reaction

CivitAI

Thank you so much for sharing your detailed approach — it’s truly inspiring!
I’m curious about how you handled scene segmentation for all those Ghibli clips. Did you use any specific scripts, tools, or methods to automatically detect camera cuts or transitions? Or was it more of a manual process where you went through each movie to pick out the coherent scenes?

seruva19

Author

Apr 7, 2025· 1 reaction

For the initial segmentation (cutting full feature movies), I just used PySceneDetect. Then I used a custom script (written by Claude) to select only clips in the range of 44-88 frames - I didn't need clips that were too short or too long. After that, yes, I manually selected 240 fragments from about 3000 clips. I was specifically looking for scenes with maximum variety, no rapid transitions, and featuring as many of Ghibli’s signature animation elements as possible (like motion flow, environmental effects, etc.)

kallamamranApr 7, 2025· 3 reactions

CivitAI

I find Ghibli has been overused already, so something else would have been nice to see. Also the Ghibli style is already rather strong in the training data, so I guess that makes it even easier to train... Having said that... You seem to have created something REALLY good here (If someone want to generate Ghibli style animation that is 😉). I can't wait to see what you train next 😊

seruva19

Author

Apr 7, 2025

Thanks! I have a lot of plans. But first, I really need to stop generating Ghibli clips. I've been doing it non-stop for the past two or three weeks, and I feel like I've fallen into some kind of dopamine trap in the process 😄

kallamamranApr 7, 2025· 1 reaction

@seruva19 Oh, I so recognize that 🤣 It so easy just getting stuck in clicking the generate button when you've managed to create something cool.

freedomguyApr 7, 2025· 3 reactions

CivitAI

Amazing work and results!
Any chance that we could see an I2V version?

seruva19

Author

Apr 7, 2025

Glad you liked it!

I don't currently plan to train a specific I2V version, but don't T2V LoRAs usually work with I2V out of the box anyway?

SD_AI_2025Apr 24, 2025· 1 reaction

If you provide a cartoon image, no matter how close from Ghibli style it actually is, this LoRA might help prevent the result from sliding towards realism, which Wan does sadly in a few frames. Abd the idea that T2V trained models worked fine on I2V was something from Wan early days (yesterday so to say ^^). But T2V trained LoRAs make the style of I2V input image slide towards realism in a few frames. Trained I2V LoRAs respect way more the style of the input.

lior007Apr 8, 2025· 1 reaction

CivitAI

I have a NVIDIA 4060TI 16GB graphics card

A 5 second video took me.....ten hours!!!!

Why?

ko81e24wy489Apr 8, 2025

Slow in with my 4070tis 16GB too.Wan video sampler take much time.

seruva19

Author

Apr 8, 2025· 4 reactions

This should not take so long, it must be falling back to shared VRAM usage, that tremendously slows down generation. Try lowering resolution and frame count, and use more aggressive block swapping (8-10).

Also, try using native workflows, they seem to have better automated memory management (https://www.reddit.com/r/StableDiffusion/comments/1j209oq/comfyui_wan21_14b_image_to_video_example_workflow/).

Try GGUF-based workflows with Q4 models.

There is a lot of Wan workflows on Civitai (https://civitai.com/search/models?baseModel=Wan%20Video&modelType=Workflows&sortBy=models_v9), some of them are already contain VRAM optimizations (like this https://civitai.com/models/1438852/wan21-low-vram-friendly)

Some people say Wan2GP (https://github.com/deepbeepmeep/Wan2GP) is optimized for low-VRAM PCs, but I haven't used it myself, so so cannot say anything about it.

But anything below 24 GB of VRAM is painful to use. I spent several days trying to find a balance between quality and speed on an RTX 3090, but I still had to sacrifice quality. That's why my clips don't contain a lot of dynamic scenes, even though the model is capable of them. To make high-quality artifact-free dynamic scenes, I would have to turn off TeaCache and increase the step count to 25–30, which could pushed generation time from 5 minutes up to 15-20 minutes per clip, which I cannot accept.

ko81e24wy489Apr 10, 2025· 1 reaction

@seruva19 Thanks!block swapping save me.

kiryanton930Apr 11, 2025· 2 reactions

Use WanVideoWrapper. It allows you to split the model into 40 pieces (WanVideo BlockSwap), and you need to start from the number 40 and go down until the memory is full. Leave some extra space for lora, garbage etc, because as soon as you exceed the VRAM size you will get 10x slowdowns. The required amount of memory depends on the number of frames and the size of the image. For a 0.3M image (640x480) and 5 seconds (81 frames) (which is equal to 25M pixels) and 16 GB of VRAM, I have the number 26, while the generation takes 21.8 s/it without tea. There are also high requirements for RAM, you must have at least 64 GB of RAM, otherwise you will age while you generate the video. Sageattention is a must have - this is a very good speed boost (and for flux too).

kassadinm16Apr 25, 2025

With my 4070Super, it takes at least 12 minutes, maybe a small detail is causing you issues.

CatzApr 9, 2025· 4 reactions

CivitAI

"trained for ~90 hours on an RTX 3090"
Geez! Thanks for the dedication to train this for everyone to enjoy

seruva19

Author

Apr 9, 2025· 1 reaction

Thank you! <3

I'm sure this result could be achieved at a lower cost (by applying different training parameters and a smaller dataset). In fact, tomorrow I will start training of my next LoRA, and I hope the results will prove this point :)

tonythetediousti3673Apr 10, 2025

CivitAI

Any chance anyone has gotten an I2V workflow going? I've tried, but I'm ComfyUI illiterate...

seruva19

Author

Apr 10, 2025

Hmm, you can try the GGUF workflow I tested for I2V-480p: https://files.catbox.moe/9niq1g.json (and it should work with 720p too).
But I don't really do I2V, I just tested it to see if it works with my LoRA, and then forgot about it. It's probably not optimized.

supaidaman9738Apr 10, 2025· 3 reactions

CivitAI

gibli art style is definitely the peak of anime style.. so mesmerizing to watch each and every clip

qubick0Apr 13, 2025· 1 reaction

CivitAI

I was wondering if you could kindly share what the final loss rate was for training this model, or perhaps what you consider to be an appropriate range? I'm currently working on creating a LoRA, but I've noticed that the loss rate remains quite high (fluctuating between 0.1 and 0.06) and decreases very slowly. Unfortunately, I haven't been able to find much information online regarding loss-related details. If you could offer any insights or assistance, I would be deeply grateful.

seruva19

Author

Apr 13, 2025

Sure, I can share details if you think they'll be helpful.

There were a total of 9 training sessions. Accelerate doesn't allow merging them into a single run (I tried several solutions, but none worked for me), so I made a custom script to merge all the loss graphs into one. Here's how it looks:

https://ibb.co/7tjvy2c7

My loss was always pretty high and never went below 0.1.

As for the individual runs, here's a sloppy composite image of them, just in case:

https://ibb.co/0p1FWWww

There were strong peaks at the beginning of each session because I resumed from an existing state, but pretty soon, in each run, the loss would settle to the same level as at the end of the previous session.

I don't fully trust loss graphs, so I did a thorough test of all checkpoints from 10K to 29K steps and selected the one whose outputs I liked aesthetically. And it was not the one with lowest loss (which happened around 22K steps with loss 0.1119), the one I chose had loss 0.1158 and it was checkpoint at 22638 steps. It was slightly overtrained, but that was intentional - I wanted to force a very strong "Ghibli aesthetic", so a bit of overfitting was acceptable in my case. For reference, there were already some nice-looking checkpoints as early as 12K steps, but I was aiming for the "best of the best".

In my opinion, observing the training loss doesn’t provide much insight into the actual quality of the trained model (as to this article, which I already mentioned in main text), and it's better to judge by the validation loss. Unfortunately, at the time of writing, musubi-tuner doesn't support validation loss estimation.

qubick0Apr 14, 2025· 1 reaction

@seruva19 Thank you very much for your generous sharing. I will further review the LoRA's performance based on actual test results to ensure it is in its optimal state.

LazmanApr 17, 2025· 1 reaction

CivitAI

Looks like you've managed to improve on it some since the Hunyuan version. Much cleaner, not so blurry. I'll have to give the Wan model a full test run once I get my new 3D printer sorted out. And once I do, I look forward to trying this to see what it can do.

Btw, when you train styles, is that all you do is just use one keyword, or set of keywords? I'm just curious. I haven't tried training styles yet, but I have tried training a city square area(Dundas square in Toronto, using 50-80 images (don't remember how many, but around that)), and that did not work out well at all. I'll get back to tryin that one of these days.

seruva19

Author

Apr 17, 2025

I usually just auto-caption the composition, structure, objects, and subjects of the scene, without mentioning any style details at all. Then I prefix the caption with a meaningful sentence, like “Studio Ghibli style.” in case of this LoRA. In my experience, when training on images or videos that represent the same art style, the model usually learns the style anyway, but extra trigger word makes it a bit more flexible to control. Well, at least that was the case in Flux, but I think WanVideo works the same way.

LazmanApr 17, 2025· 1 reaction

@seruva19 Yea, as a general rule, a trigger word is good. Cuz the style may work well on it's own, but then, if you use more than one lora, you've got less granular control over the blend without the trigger.

Do you know if area Loras work the same as style Loras? Like, have you tried making a lora intended to recreate an entire detailed area, anime or otherwise?

Side note; I wonder if anyone's created a Naruto style lora for WAN.. Was just looking at your avatar, and thinkin, a Naruto lora of the same quality would be amazing.. the epic fights, the rasengan, the meteor planet destroying thing they do in the later seasons, etc. not to mention the insane amount of Justus..

What's more, if it could be used to turn it live action..

I've always thought that with the amount of content in that show, that it's potential has been wasted. No live action, no games that truly do it justice.

I mean, ninja storm4 and all the dlc for that, were alright. It's a good game for competition if you've got a friend to play it with, but an open world Naruto game with different playable characters and paths and methods to traverse it, Justus to learn, Senseis to train with, etc.. but it seems they're incapable of creating anime games that aren't just fighting games with some cutscenes and padding, or oldstyle RPGs.

seruva19

Author

Apr 17, 2025

@Lazman I've only been training style LoRAs so far, so can't say much about area LoRAs, sorry. But I do plan to work on some exciting (at least I believe so!) concept LoRAs for Wan soon, and I will share my experience.

While I don't plan to make a specific Naruto model, fast-paced action anime is my holy grail too. One day, I want to create a realistic martial arts anime (similar to 80s HK action movies). So I will research this direction.

dusterrrrMay 3, 2025· 2 reactions

CivitAI

this is def the best WAN lora on civit ai rn. glad to see such hard work pay off, congrats bro.

joesixpaqMay 4, 2025· 2 reactions

CivitAI

Thank you for sharing these precious tips with us! Great job!

You mentioned that you had to restart training like 9 times. AFAIU, you needed to specify the saved state folder as --resume <FOLDER>, and load the weights to continue from as --network_weights <PATH2MODEL>.

Did you use model.safetensors in the corresponding STATE folder or the last save LoRA itself?

seruva19

Author

May 4, 2025· 1 reaction

Thank you!
I resumed from a LoRA full state (not a safetensors file), so I used --save_state and --resume /path/to/saved/state.

joesixpaqMay 4, 2025

@seruva19 Thank you for your clarifications!

tosermeplsAug 3, 2025· 3 reactions

CivitAI

Thank you for the detailed write-up on the lora training. Trying to get into WAN training myself and this is very useful.

LolofaOct 11, 2025· 1 reaction

CivitAI

Hi and thanks for your work !
I had posted a comment about the awkward feeling in some of the recent generations compared to the "real" ghibli feeling of the older one, and you answered me (I see it in my menu) but the post seems to be deleted so i can't see it

Your response started like this :
"Early videos were made without any acceleration LoRAs, only with TeaCache enabled. Lastest ~2K ..."

I am very interested in the rest of what you answered, and i'm sure other too !

seruva19

Author

Oct 11, 2025

Hi! Here is link to that video https://civitai.com/images/104769463

My response is still there, but I will replicate it here as well:

Early videos were made without any acceleration LoRAs, only with TeaCache enabled. Latest ~2K videos (and about 2500 more videos I am going to publish) use lightx2v LoRA, which, sadly, sometimes tend to kill 2D animation motion and enforce smooth 3D realistic style.

So I 100% agree, earlier clips were better, but each clip took about 7 minutes to wait, while accelerated videos take 2 minutes each.

LolofaOct 13, 2025· 1 reaction

@seruva19 Oh I understand, thanks for your answer !
2500 ?? You can make the next ghibli with such generation powa !!

(PS: The link to that video https://civitai.com/images/104769463 give me error 404)

seruva19

Author

Oct 13, 2025

@Lolofa I have around 300 clips saved up that were made without acceleration LoRAs, they have a more Ghibli-like animation style and color palette. I'll publish them last, after all the pending clips are out.

In general, this LoRA has quite a few flaws, I can definitely say that now after generating over 10K clips with it 🙂. The next version of the Ghibli LoRA (either for Wan 2.2 or Wan 2.5, if it will be open-weighted) will be more consistent (hopefully) in capturing hand-drawn 2d animation feel. However, I already have three other style LoRAs planned before I return to reworking the Ghibli style, so I can't say for sure when that will happen.

tingtinginNov 12, 2025

CivitAI

Do you have examples of how your training progress through epochs?

The reason I'm asking is I'm currently attempting to train some loras and after a bit (around a few hundred steps) they settle in with the loss not decreasing anymore and the samples (generated during training) not changing anymore being the exact same generations even after another 1000 steps i'm wondering since the steps you did were so high do you have examples of the progression through epochs or steps that one should expect when training a lora like this?

Another question is how many steps do you think is necessary i see you did 28800 steps but i've seen others doing around 2400 or even less i know that the step count can vary but im more trying to make sense of the 10x difference

seruva19

Author

Nov 12, 2025

I shared some loss graphs in this comment:

https://civitai.com/models/1404755?modelVersionId=1587891&dialog=commentThread&commentId=773265&highlight=773325

The loss may flatten for a while, I'd suggest waiting a bit longer, maybe 500-700 steps more. The visual samples not changing is more concerning. Sadly I didn't save training samples for my LoRA, but I remember that the outputs started to change since the first epoch and lean towards the target style after around 4000 steps. (This was probably helped by the fact that Wan 2.1 already understands a generic anime style, so it could "grasp" Ghibli-specific art features quite easily.)

If your samples don't seem to change (literally) at all during training, that might be caused by a specific version or combination of PyTorch and Flash Attention. I once had a case where the LoRA wasn't training at all (all samples looked the same), upgrading torch fixed it.

You can also try pushing the learning rate to some really high value (2e-4) and train for 500-1000 steps, just to make sure the problem is with hyperparameters and not with the training pipeline itself.

That applies to Wan 2.1. As for Wan 2.2, I'm still trying to find the optimal training routine, right now I'm training my 5th version of a high-noise LoRA, because I wasn't satisfied with the effect I got from the earlier ones. But you might find some useful information about Wan2.2 training in this comprehensive tutorial https://civitai.com/articles/20389/tazs-anime-style-lora-training-guide-for-wan-22-part-1-3

Considering number of steps: personally, I prefer large datasets with low learning rates, training close to the point of overfitting, because the default Wan tends to lean toward realism, so achieving good generalization for non-realistic or 2D styles requires longer training. That's why I usually train for a lot of steps. But with small datasets and a moderately high learning rate (e.g., around 100 images + 50 videos, lr = 8e-5), and if the target style isn't too specific, you can often get a stable style after about 3000 steps (see this LoRA for example https://civitai.com/models/1132089/flat-color-style?modelVersionId=1474944).

qiaoy123Jan 18, 2026· 1 reaction

CivitAI

Your lora is much better than mine,and I found 2.2 seems can make ghibli style video without any lora,just less consistently.And your lora still works on 2.2.

seruva19

Author

Jan 19, 2026

For a first attempt, your LoRA is quite solid, and Wan 2.2 is generally challenging to train anyway. And more Ghibli LoRAs are always a good thing 🙂

My LoRA does work with 2.2, but it doesn't capture the essence of style as strongly as I'd hoped. At this point, I'm not sure I'll continue with a 2.2 LoRA, and I may wait for Wan 3.0 Mini (or whatever comes next).

qiaoy123Jan 19, 2026· 1 reaction

@seruva19 From my personal experience—and based on the tests I’ve run over the past couple of days—I actually think my LoRA doesn’t perform as well as yours, even when used with WAN 2.2. I suspect that WAN 2.2’s training dataset already includes a lot of Spirited Away content, which might explain why. When I compared results using the same seed, my LoRA often didn’t produce better outcomes—in fact, it mostly just added a bit more consistency in generating Ghibli-style outputs.

Given that, I don’t think my LoRA currently has much reason to exist. Your LoRA works great with WAN 2.2, and that’s more than enough for now. If a future version of WAN is released—one that no longer inherently supports the Ghibli style—I’ll consider expanding my dataset and training a new LoRA from scratch. But for the time being, your LoRA is absolutely sufficient.

So, I’ve decided to shift my focus to developing other LoRAs—ones that capture more unique and distinctive artistic styles.

seruva19

Author

Jan 19, 2026

@qiaoy123 That makes a lot of sense. Thanks for the kind words about my LoRA. And best of luck with your future LoRAs, teaching styles is really fun and exciting.

LORA

Wan Video 14B t2v

by seruva19

Download (Beta) View on CivitAI