CivArchive
    WAN 2.1 IMAGE to VIDEO with Caption and Postprocessing - v1.0
    NSFW
    Preview 60872533

    Workflow: Image -> Autocaption (Prompt) -> WAN I2V with Upscale and Frame Interpolation and Video Extension

    • Creates Video Clips with up to 480p resoltion (720p with corresponding model)

    There is a Florence Caption Version and a LTX Prompt Enhancer (LTXPE) version. LTXPE is more heavy on VRAM

    LTX Prompt Enhancer (LTXPE) might have issues with latest Comfy and Lightricks update


    MultiClip: Wan 2.1. I2V Version supporting Fusion X Lora to create clips with 8 steps and extend up to 3 times, see examples posted with 15-20sec of length.

    Workflow will create a clip on Input Image and extends it with up to 3 clips/sequences. It uses a colormatch feature to ensure consistency in color and light in most cases. See the notes in worflow with full details.

    There is a normal version which allows to use own prompts and a version using LTXPE for autoprompting. Normal version works well for specific or NSFW clips with Loras and the LTXPE is made to just drop an image, set width/height and hit run. The clips are combined to one full video at the end.

    update 16th of July 2025: A new Lora "LightX2v"has been released as an alternative to Fusion X Lora. To use, switch Lora in black "Lora Loader" node. It can create great motion with only 4-6 steps. : https://huggingface.co/lightx2v/Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v/tree/main/loras

    More info/tips & help: https://civarchive.com/models/1309065/wan-21-image-to-video-with-caption-and-postprocessing?dialog=commentThread&commentId=869306


    V3.1: Wan 2.1. I2V Version supporting Fusion X Lora for fast processing

    Fusion X Lora: process the video with just 8 Steps (or lower, see notes in workflow). It does not have the issues like the CausVid Lora from V3.0 and does not require a color match correction.

    Fusion X Lora can be downloaded here: https://civarchive.com/models/1678575?modelVersionId=1900322 (i2V)


    V3.0: Wan 2.1. I2V Version supporting Optimal Steps Scheduler (OSS) and CausVid Lora

    • OSS is a newer comfy core node to allow lower no. of steps with a boost in quality. Instead of using 50+ steps you can receive same result with like 24 steps. https://github.com/bebebe666/OptimalSteps

    • CausVid uses a Lora to process the video with just 8-10 steps, it is fast at a lower quality. It contains a Color Match option in postprocessing to cope with the increased saturation, the lora is introducing. Lora can be downloaded here: https://huggingface.co/Kijai/WanVideo_comfy/tree/main

      (Wan21_CausVid_14B_T2V_lora_rank32.safetensors)

    • Both have a version with FLorence or LTX Prompt Enhancer (LTXPE) for Caption, can use Loras and have Teacache included.


    V2.5: Wan 2.1. Image to Video with Lora Support and Skip Layer Guidance (improves motion)

    There are 2 version, Standard with Teacache, Florence caption, upscale, frame interp. etc. plus a version with LTX Prompt Enhancer as an additional captioning tool (see notes for more info, requires custom nodes: https://github.com/Lightricks/ComfyUI-LTXVideo).

    For Lora use, recommend to switch to own prompt with Lora trigger phrase, complex prompts might confuse some Loras.


    V2.0: Wan 2.1. Image to Video with Teacache support for GGUF model, speeds up generation by 30-40%

    It will render the first steps with normal speed, remaining steps with higher speed. There is a minor impact on quality with more complex motion. You can bypass the Teacache node with Strg-B

    Example clips with workflow in Metadata: https://civarchive.com/posts/13777557

    Info and help with Teacache: https://civarchive.com/models/1309065/wan-21-image-to-video-with-caption-and-postprocessing?dialog=commentThread&commentId=724665


    V1.0: WAN 2.1. Image to Video with Florence caption or own prompt plus upscale, frame interpolation and clip extend.

    Workflow is setup to use a GGUF model.

    When generating a Clip you can chose to apply upscaling and/or frame interpolation. Upscale factor depends on upscale model used (2x or 4x, see "load upscale model" node). Frame Interpolation is set to increase frame rate from 16fps (model standard) to 32fps. Result will be shown in "Video Combine Final" node on the right, while the left node shows the unprocessed clip.

    Recommend to "Toggle Link visibility" to hide the cables.


    Models can be downloaded here:

    Wan 2.1. I2V (480p): https://huggingface.co/city96/Wan2.1-I2V-14B-480P-gguf/tree/main

    Clip (fp8): https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/text_encoders

    Clip Vision: https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/clip_vision

    VAE: https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/tree/main/split_files/vae


    Wan 2.1. I2V (720p): https://huggingface.co/city96/Wan2.1-I2V-14B-720P-gguf/tree/main

    Wan2.1. Text to Video (works): https://huggingface.co/city96/Wan2.1-T2V-14B-gguf/tree/main


    location to save those files within your Comfyui folder:

    Wan GGUF Model -> models/unet

    Textencoder -> models/clip

    Clipvision -> models/clip_vision

    Vae -> models/vae


    Tips:

    • lower framerate in "Video combine Final" node from 30 to 24 to have a slow motion effect

    • You can use the Text to Video GGUF Model, it will work as well.

    • If video output shows strange artifacts on the very right side of a frame, try changing the parameter "divisible_by" in node "Define Width and Height" from 8 to 16, this might better latch on to the standard Wan resolution and avoid the artifacts.

    • see this thread if you face issues with LTX Prompt Enhancer: https://civarchive.com/models/1823416?dialog=commentThread&commentId=955337

    • Last Frame: If you face issues finding the pack for that node: https://github.com/DoctorDiffusion/ComfyUI-MediaMixer

    Full Video with Audio example:

    Description

    WAN Image to Video with Autocaption, Postprocessing and Clip extend

    FAQ

    Comments (21)

    GK_ArtistMar 1, 2025· 2 reactions
    CivitAI

    @tremolo28 thx for creating a great workflow again! going to experiment with it.

    tremolo28
    Author
    Mar 1, 2025

    Hope it works for you.

    GK_ArtistMar 2, 2025· 1 reaction

    @tremolo28 yes working great! (as always with your workflows)

    GK_ArtistMar 3, 2025

    @tremolo28 Which workflow, model and/or lora do you use to generate the images? They are so sharp and good quality.

    tremolo28
    Author
    Mar 3, 2025· 1 reaction

    @GK_Artist for the input images, I used the usual Flux Dev model (for some pics SDXL Pony). For the clips, I used a resolution of 368P, 384p or 400p and as a upscaler i applied RealESRGAN_x2.pth (2x). As WAN video model, I took the GGUF Q4_K_M model.

    GK_ArtistMar 3, 2025

    @tremolo28 can you share the workflow of the image generation?

    tremolo28
    Author
    Mar 3, 2025

    @GK_Artist If you talk about the image generation, I use Forge for that, so I have no workflow, but I can post a PNG with Forge meta data in it. Or do you mean the Video worklfow?

    GK_ArtistMar 3, 2025

    @tremolo28 forge is also great

    tremolo28
    Author
    Mar 3, 2025· 1 reaction
    GK_ArtistMar 3, 2025

    @tremolo28 what to do against the shaking camera?

    tremolo28
    Author
    Mar 3, 2025

    @GK_Artist did not test camera options yet, but can imagine the model interprets the movement from the input image, i.e. a selfie shot tends to have camera shaking. To counter this, I would try to play with CFG, no. of steps and prompt (i.e. use terms like steady view/camera etc.), maybe lower florence detail prompting, add text to negative prompt. But it is just a guess...

    lolaussie456Mar 2, 2025· 1 reaction
    CivitAI

    I used your workflow and identical setup, but the output is very trippy. I wonder if this is because I am on a Mac.

    tremolo28
    Author
    Mar 3, 2025

    Hey, what do you mean by trippy exactly?

    EliteLensCraftMar 3, 2025

    Got the same issue:
    https://civitai.com/images/61340436
    Maybe its because of portrait images?

    tremolo28
    Author
    Mar 3, 2025

    @EliteLensCraft the width of that video is 450, maybe try a value dividable by 8, like 448? Are you using mac as well?

    And Thanks a lot for the buzz :)

    EliteLensCraftMar 4, 2025· 1 reaction

    @tremolo28 Ah okay, my bad, got confused by the upscale, nope on a windows machine.

    JCD007Mar 3, 2025
    CivitAI

    For Florence Model, Clone REP in Models/LLM

    git clone https://huggingface.co/microsoft/Florence-2-large

    Please add this to your page OP if possible.

    tremolo28
    Author
    Mar 3, 2025

    this model is supposed to autodownload, when you first use the workflow. See model node in Florence section of the workflow.

    JCD007Mar 3, 2025· 1 reaction

    @tremolo28 It did not dowload. There was no model to select. Not sure why I guess. But hope this helps anyone else that this happened to.

    MafiaPlayMar 3, 2025
    CivitAI

    Awesome. The output is comparable to paid models such as Hailuo. Only problem is that it takes 12 minutes to process 480p video on RTX 3090 with 24gb VRAM for a 3 second video with upscaling. Hopefully it will be much faster in due time.

    tremolo28
    Author
    Mar 3, 2025· 1 reaction

    I am currently testing with a teacache setup, just uploaded V2 with Teacache support to speed up process time.

    Workflows
    Wan Video

    Details

    Downloads
    682
    Platform
    CivitAI
    Platform Status
    Deleted
    Created
    3/1/2025
    Updated
    4/21/2026
    Deleted
    4/16/2026

    Files

    wan21IMAGEToVIDEOWith_v10.zip