Blockwise NF4 (Full Checkpoint - 22GB)
In forge use Automatic FP16 Lora not NF4 or NF4 Automatic
Recommend to use for FORGE set COMMANDLINE_ARGS= --unet-in-bf16 --vae-in-fp32
Full checkpoint DO NOT LOAD additional TE, VAE or CLIP
NO changes have been made to the Blackforest base diffusion model, other then mixed precision quantization.
This model is likely the first of its kind combining the NF4 quantization with Blackforest recommendation to not quantize TE blocks.
High accuracy and speed while still fitting under 24GB (Works well in 16GB and 8GB cards also)
Description
FAQ
Comments (19)
Does your approach change anything about the unfortunate LoRA and ControlNet issues of NF4 versions?
Lora's work fully in my testing, ControlNet the issue is with the base model being distilled is it not
@Felldude LoRAs threw errors immediately in Comfy, ControlNet too and IPAdapter just didn't affect it at all. That's how all NF4 models I tested behaved, but I didn't use one in a while as I have not heard/read about any changes there. That's why I asked hoping this might be a change making a difference by keeping some blocks untouched.
As far as I remember NF4 had a somewhat flexible self-rescaling mechanic going on that allowed it to be as precise as necessary, while being as efficient as possible, which made it perform faster and better than a quantized counterpart. But unfortunately that made it structurally incompatible to LoRA and ControlNets being built with a set scale.
@redpinkretro To my knowledge NF4 support for comfy was dropped so you can only load base models with no lora support - Forge works fine with NF4 and LORA as it up-cast the model to FP32/FP16 to allow for loras - for FLUX this really should be BF16 like comfy but I would rather have FP16 and working lora's then no
@Felldude Yes, exactly. That's why I was not using any NF4 models, due to those limitations.
This is an interesting checkpoint. It is very fast, has a basic workflow, and it's reasonably accurate and detailed. All those qualities make this model very accessible.
Thanks
I don't know if it makes a difference, but I used these switches when starting comfyui for this model:
--bf16-unet --fp32-vae
Those instructions would be for forge, for comfy I think you need the GPU only command
How much faster, than base Nf4 per steps?
That would very per machine as the CPU load is higher with this model unless you force it into GPU (Assuming a 3090 or 4090) the accuracy is higher and for me the speed is similar
So about same speed, but 4 steps. I have 8gb Vram, and tried Nf4 20 steps before, but quality was shit, and had no time to try, what worked best.
Can you give me an example prompt, what should i use for best realism. Just quality tags, common tags.
@mahkidale496 Describe in full natural language, Some people a add something like "A raw photo of a snow man during Christmas on a winters night"
@Felldude Thanx, does it support lora in Forge?
@mahkidale496 Forge does yes, just use Automatic FP16 Lora
@Felldude Stilll euler, beta? Or there is better now.
@mahkidale496 forge has its own, but euler still works well
Hm, so other NF4 quantized TE blocks (as TE blocks inside diffusion model?).
I dont mean CLIP/T5, that obviously shouldnt be quantized in NF4 (well, it works if its HQQ).
Details
Available On (1 platform)
Same model published on other platforms. May have additional downloads or version variants.



