Update: I'm in the process of doing a test render with my comfy workflow, which I'll post shortly. It may be able to handle 81 frames at 720P on a 4090. Use this ComfyUI module to load the checkpoint: https://github.com/silveroxides/ComfyUI_bnb_nf4_fp4_Loaders
These are NF4 Quantizations of the Wan video generation AI. They work really well.
Description
FAQ
Comments (55)
Thank you!
Thanks. Will these work in ComfyUI and if not, any plans to submit a PR?
It seems to work with this node, https://github.com/comfyanonymous/ComfyUI_bitsandbytes_NF4
@hiben40387 not working - ERROR: Could not detect model type of: C:\models\checkpoints\wan21NF4_i2v14B480pNf4.safetensors
@ai_wifus That is strange it worked for me. city96 just released the GGUF you can give those a try should be more easily compatible
@hiben40387 This custom node is working, but when I run the workflow, I get an error saying some tensors are on the CPU while others are on the GPU. BTW my GPU has 12GB VRAM. Is there a way to fix this?
@ai_wifus Reading through that github it seems it's a memory leak error there is a someone made a fork to fix it that being said, for 12GB you are better off using the gguf Q4 from the link I gave, use it with these rwo nodes. These two together considerably cut down vram needed.
@hiben40387 thanks for the reply! currently im using the q5 GGUF it works great!
@hiben40387 I have 12gb as well, ran into same issue can use the --gpu-only argument and I read about the other node fixing mem leak (hopefully other nodes also fix).
I have been testing full, vs gguf vs nf4. I assume nf4 is easier on memory than the full though right? GGUF is nice but slow and I found I can run full 480 models or 720 with 480 latent size. Or is GGUF and Nf4 about same speed wise?
I am still getting round to this test I assume it is a unet I know some nf4 like flux ended up being checkpoint loads
Does this speed up the workflows (im on a 3090 so usually vram isnt an issue lol.... usually)
and, forgive this if its a bit of a noob question... but where do we put them and how do we load them? guessing they wont load with the default checkpoint loader node?
You should be able to render more frames due to the memory savings. FP8 may actually be faster.
I had used this NF4 model with the node mentioned and I get nearly the same Vram usage(about 21-22G) and speed with official fp8 scaled model, I dont understand why but I guess I won't use this anymore.
@Garland It doesn't cut usage in half, but I definitely do better than that with it. That being said, if it doesn't help you fit onto your card's VRAM, the FP8 model is faster and the quality is marginally better. The only reason to use this is to save VRAM.
[enforce fail at alloc_cpu.cpp:115] data. DefaultCPUAllocator: not enough memory: you tried to allocate 362387865600 bytes. i am getting this on a 3090, any ideas?
What are you using?
If you're trying to generate 81 frames on 24GB at 720P, you'll get OOM. Try reducing the resolution to 480P or doing 41 frames.
I'm also getting the same error message. I'm using the Swarm UI, and I have an RTX 4090. I downloaded the 480p version. It's not working regardless of the frame rate. Can you provide a solution?
You get this error if you use the Comfy UNet loader node (which would be in existing workflows), you need to use the one called "Load FP4 or NF4 Quantized Checkpoint Mode" from https://github.com/silveroxides/ComfyUI_bnb_nf4_fp4_Loaders/blob/master/__init__.py#L178C16-L178C58
For all those frustrated with the poor description — you'll need these nodes to load the nf4 model
https://github.com/silveroxides/ComfyUI_bnb_nf4_fp4_Loaders
This custom node is working, but when I run the workflow, I get an error saying some tensors are on the CPU while others are on the GPU. BTW my GPU has 12GB VRAM. Is there a way to fix this?
@ai_wifus Here's my workflow. Check your resolution. I should be 1280x720 or 832x480 (or the portrait version of those). When mine was bigger than that, I got that error.
It would be great, if you could share an example workflow for comfyUI.
Thanks ;)
I just found example workflows: https://github.com/comfyanonymous/ComfyUI_examples/tree/master/wan
I'll be posting mine shortly.
Great! Both 720p and 480p i2v works nice on 4090(24GB vram), with fp8_e4m3fn_scaled text encoder. (fp16 TE seems to require more VRAM when using 720p).
I used these example workflows, just changed UNET loader node to nf4 one.
Yeah, I've been doing 41 frames at 720p. 81 is too many even for my 4090.
Update: I can do 81 frames at 1152x640. I hadn't tried it because it wasn't one of the "official" recommended resolutions, but it works great.
is the NF4 node available in the comfyui manager install custom nodes category in the comfyui manager?
@dims2 Just to check as I have been considering getting 720 even to make 480 videos, doe sit use more vram via text encoder for same size video, example teh sample 512 x 512 run both on 480 and 720 wil 720 eat more memory?
which base model should I use? and also how to run it code without comfy
The wanimagetovideo node is missing, may I ask which plugin this node is in?
Custom node don't create new node, it replace exists node and rename it to "Load FP4 or NF4 Quantized Diffusion or UNET Model"
@mfireson ive followed the instructions about choosing channel : dev and then look "CheckpointLoaderNF4" and the file pops up in the "install custom nodes" when I hit install, it says that the it will not install because of the level of security parameters. so I am almost there but that node loader is not attainable. I will find a work around on installing that node because it is the final step i need to run this workflow. However, if you have info that is worth a shot, please feel free to share and lend your guidance. I was thinking of just changing the "CheckpointLoaderNF4" node to the diffusion/unet loader and see if that is viable as awork around.
as for now, I am stuck.
i fix the node but the rest of the workflow has so many bugs.
Here. https://github.com/kijai/ComfyUI-WanVideoWrapper
If a few nodes says they're missing, do these:
Pip uninstall torch -y
Pip uninstall torchvision -y
pip uninstall torchaudio -y
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip3 install torch==2.7.0.dev20250123+cu126 --index-url https://download.pytorch.org/whl/nightly/cu126
@Therma i think i found the issue, i was using an older comfy ui file vs the stand alone portable version. ill report back when i have tried this version.
this workflow has so many bugs. what is sage attention?
Sage Attention is a method thats been super popular for video models for a few months, usually paired with teacache. Sacrifices a small amount of quality for a substantial speed up in generation.
@RuggedPineapple ah. these add ons make it difficult to utilize the workflow.
@Noob_ee I dont run this workflow, i just saw it scrolling through, but as general help I currently have 3 nodes that implement Sage Attention. One from Flux-Lightning, one from hunyuanvideowrapper and one from kjnodes that you have to turn on the see beta option for it to show up in your node picker. Also, sage requires Triton which is a royal pain in the ass to install on windows but a simple pip install command on linux, so if youre on windows the juice may not be worth the squeeze.
That said on 40xx series hardware taking the attention functions down to 8 bit with sage bumps generation speed 150% to 180% so its kinda magic
@RuggedPineapple right, i am not blessed running linux like others on here who seem to not mention what OS some of these images are using. the juice is not worth the squeeze, i was over here thinking that i was doing something wrong. i am running a 40series but shy of the 4090. ill just have to wait for a better optimization for windows.
@Noob_ee WSL 2
@voboyso what is that?
@Noob_ee Windows Subsystem for Linux. You can use SageAttention easily while still running Windows (you'll do the generations and all that in WSL (Ubuntu Linux)). Just search "WSL SageAttention" for more info, but ere's the video I followed: https://www.youtube.com/watch?v=ZBgfRlzZ7cw. I had several issues after that I needed to address; might go smoother for others though. The guy is a bit all over the place, but you'll be on your way to faster generations by following it (and possibly figuring out some other issues after, on your own).
Works on my RTX 5070. Although I might be misunderstanding the point because the speed is the same as the fp8 version and I don’t really notice a change in vram usage either?
Make it Make sense
So I thought this NF4 would have speed advantage on the new Blackwell gpu's. I guess it's only valid for the FP4 and not the NF4.
This NF4 has no speed difference over FP8 or FP16 and it's even more difficult to run due to the BNB node's system inability to offload the model to system ram.
On my RTX 5080 16GB, I can run the 720p FP16 model just fine in 1280 x 720 (81 frames) by offloading up to 50GB model data into system RAM without any performance degradation, and yet I can't even do 960 x 544 on the NF4. Bits and Bytes needs to work on that offloading I guess.
With the current state of Comfy nodes, if you want to save on VRAM it's best to just use the FP8 and Q8 quants because they offload much better on low vram gpu's. if you use torch compile with Wan2.1 on the native official workflow and with 64GB system ram, it will allow you to offload the FP16 model and will even make it faster.
Thank you for your work with this NF4 version anyway.
I made an FP4 quantized version and it's still slower than a Q6 GGUF on my 5070 Ti. I dunno if there's anything special that needs to be updated to get the improved speed from the Blackwell FP4 optimizations.
Seems like it specifically needs to be NVFP4 precision to get the speed advantage. MIT recently released an SVDQuant method that supports NVFP4, currently just for Flux, but they're planning to add Wan2.1 support. I tested the flux workflow in ComfyUI with the FP4 model and 8 step lora and it's blazingly fast. I can generate a 1920x1200 image in less than 10 seconds, 1024x1024 in under four.
@thaddeusk Amazing! Thank you for the information, much appreciated!
При загрузке модели я получаю ошибку, All input tensors need to be on the same GPU, but found some tensors to not be on a GPU: [(torch.Size([1, 13107200]), device(type='cuda', index=0)), (torch.Size([409600]), device(type='cpu')), (torch.Size([5120, 5120]), device(type='cuda', index=0))]
Использую RTX 4070 на 12GB VRAM, использую модель T2V
Any chance to use this on WebUI forge?
Details
Files
Available On (2 platforms)
Same model published on other platforms. May have additional downloads or version variants.
