Simple WAN T2V Workflow for Self Forcing
Self Forcing trains autoregressive video diffusion models by simulating the inference process during training, performing autoregressive rollout with KV caching. It resolves the train-test distribution mismatch and enables real-time, streaming video generation on a single RTX 4090 while matching the quality of state-of-the-art diffusion models.
Update (i2v):
To use Vace, you will need to use a different checkpoint: https://huggingface.co/lym00/Wan2.1-T2V-1.3B-Self-Forcing-VACE/blob/main/Wan2.1-T2V-1.3B-Self-Forcing-DMD-VACE-FP16.safetensors
Download self_forcing_dmd.pt from https://huggingface.co/gdhe17/Self-Forcing/tree/main/checkpoints and use it as the t2v checkpoint.
Project website: https://self-forcing.github.io/
Description
FAQ
Comments (17)
real time! omg omg omg omg
Thanks for sharing. Can also confirm it seems to work with 1.3B trained loras.
"mat1 and mat2 shapes cannot be multiplied (154x768 and 4096x1536)" -- Using this workflow. Updated comfy already, anyone else seeing this?
My bad, was using the Kijai version of the umt5xxl encoder. Will leave the comment up in case anyone else has the issue.
i2v possible?
Yes, with Vace. You can use ComfyUI-WanVideoWrapper to patch in Vace as a module. I wasn't thrilled with the results, though.
I added an i2v workflow that uses Vace!
How can you access the preview frames as the video is rendering (as the paper mentions)? It has to go through the VAE Decode phase, and only shows you the final compiled video.
its a setting in Comfyui Manager. Go to the manager (where you update extensions and such) and in the left column, there's a preview setting
i know what you are thinking, you will need to implement sliding attention to escape the 5 second mark from what the paper suggests
Works great with an RTX 4080 16GB, 20s with the default settings is excellent.
WOW
This has been the fatest that I got until now.
Using ZLUDA with 6800 got the following results:
--- selforcing (8 steps 53 length 8 shift 30 fps fp8 scaled)
192x256 ~1.3s/it (50.47s)
240x320 ~1.4s/it (52.34s)
368x480 ~4.8s/it (84.83s)
368x512 ~5.5s/it (89.01s)
384x512 ~6.0s/it (93.66s)
480x720 ~15.3s/it (319.99s)
576x720 ~21.3s/it (424.91s)
832x368 ~11.4s/it (226.5s)
