Flux Blockwise - CLIP_L_Large_BF16

NSFW

Flux Blockwise (Mixed Precision Model)

I had to build several custom tools to allow for the mixed precision model, to my knowledge it is the first built like this.

Faster and more accurate then any other FP8 quantized model currently available
Works in Comfy and Forge but forge needs to be set to BF16 UNET
Comfy load as a diffuser model USE DEFAULT WEIGHT
FP16 Upcasting should not be used unless absolutely necessary such as running CPU or IPEX
FORGE - set COMMANDLINE_ARGS= --unet-in-bf16 --vae-in-fp32
Other then the need to force forge into BF16, (FP32 VAE optionally) it should work the same as the DEV model with the added benefit of being 5GB smaller then the full BF16

It turns out that every quantized model including my own up to this point to my knowledge has been built UN-optimally per blackforest.

Only the UNET blocks should be quantized in the diffuser model, also they should be upcast to BF16 and not FP16 (Comfy does this correctly)

Hippo Image remix

Lion Image remix

I am currently trying to workout how to follow Blackforest recommendations but using GGUF

Description

FAQ

Comments (49)

AbstractPhilaNov 28, 2024· 2 reactions

CivitAI

I'll build the lora stack onto it and see what SimV4 looks like.

Felldude

Author

Nov 28, 2024

If your attempting to build a model I would force BF16 and save the diffusion model, then load it and save it as FP8, then merge back in all the BF16 Text UNET blocks and the following :

'time_in.in_layer.bias',

'time_in.in_layer.weight',

'time_in.out_layer.bias',

'time_in.out_layer.weight',

'txt_in.bias',

'txt_in.weight',

'vector_in.in_layer.bias',

'vector_in.in_layer.weight',

'vector_in.out_layer.bias',

'vector_in.out_layer.weight'

Felldude

Author

Nov 28, 2024· 1 reaction

I am still trying to work out if the gguf cp can be rebuilt to ignore those blocks when quantizing, they would be converted to FP16 or FP32 but that is better then loosing the precision on those blocks.....I don't know if blackforest just published the info with the Depth model release or if I just missed it before

Felldude

Author

Nov 28, 2024

It would not look as good as the lora stack on base BF16 but it should give a good indication of what the model would preform like if it was mix precision BF16/FP8 to save those 5GB

AbstractPhilaNov 29, 2024

@Felldude They are trained mixed precision BF16

Felldude

Author

Nov 29, 2024

@AbstractPhila You set up a training to target different blocks on the same UNET model with different precision?

AbstractPhilaNov 29, 2024· 1 reaction

@Felldude No, that's definitely different. I was planning to do some single block targeted trains though, where I committed the entire cycle of training to a single block with a fairly high learn rate on very high alpha with low dimensions.

I'd love to try training multiple blocks in this nature using BF16, FP8, and FP16 in a similar block target manner as you mentioned.

Felldude

Author

Nov 29, 2024

@AbstractPhila Per Blackforest the TE blocks in the UNET should all be BF16 and probably only trained with that if targeted, the UNET blocks single or double could be FP8 but I wonder if FP16 is throwing the models off

sevenof9247Nov 29, 2024

@Felldude but GGUF is VERY slow ! and need more weight if using lora(if it works anyway) ...

is that "only" a different calculation method for FP8 and it is still original DEV ?

Felldude

Author

Nov 29, 2024

@sevenof9247 Its base blocks from the DEV and Schnell models from Blackforest, I have not changed them in any way other then to mix the precision in the same model - which they loosely recommend

sevenof9247Nov 29, 2024

@Felldude instant answer ^^ ok any hints for swarm-ui ? and steps ? cfg ? Distilled CFG ?

and why FP8 pruned and still 17GB ? usual 11GB

Felldude

Author

Nov 29, 2024

@sevenof9247 Not familiar with it, while torch has support for mixed precision models it would be dependent on the python scripts loading them in - Distilled cfg will be ignored for Schnell, Dev is the same as the base BF16 model

Felldude

Author

Nov 29, 2024

@sevenof9247 17GB as the text blocks are in BF16 but the UNET is in FP8 - so the reduction is 5GB instead of 10GB

sevenof9247Nov 29, 2024

@Felldude swarmui is the combination of forge and ComfyUI ;)

so it is not working well on 16GB VRAM ?

Felldude

Author

Nov 29, 2024

@sevenof9247 I will be releasing the NF4 build that got me 3.5-4.5 seconds per IT which is the fastest I have ever seen

sevenof9247Nov 29, 2024

@Felldude hmm okay .. sec/it , haha resolution - sampler - and RTX ? ... maybe tell me in % how much faster than FP8 ?

Felldude

Author

Nov 29, 2024

@sevenof9247 I am on a 3050 so...all my IT's or secs per IT are posted on my uncensored model

AbstractPhilaNov 29, 2024· 1 reaction

I'm getting some really really good results from this thing. It seems like a genuine breakthrough.

AiMetatronNov 29, 2024· 1 reaction

CivitAI

That's a really advanced idea! It not only ensures high accuracy in the parameter layer, but it also manages the overall model's computational power consumption.

Felldude

Author

Nov 29, 2024

Thanks, I am not sure if the information was new for FLUX Depth or if I had just never read the recommendations before, either is possible.

AbstractPhilaNov 29, 2024· 2 reactions

CivitAI

The testing shows some interesting outcomes.

Can you modify the blocks of DeDistilled?

Felldude

Author

Nov 29, 2024

I would have to compare the block structure but any paired blocks can be tranfered

AbstractPhilaNov 29, 2024

@Felldude The blockwise version seems to function BETTER than the fp16 version, at a lower hardware cost!?!?!

What is this sorcery?

AbstractPhilaNov 29, 2024

I did some direct side-by-side comparisons with 1d and 1dblocks, it ended up being superior to fp8 by a lot, only use about 14 gigs vram, and still generate nearly identical images to fp16, just with a better context window.

My loras combine just fine, and it reflects a very similar outcome to the fp16 version with less hardware.

Felldude

Author

Nov 30, 2024

@AbstractPhila I my test the images where the same for FP8 vs Blockwise FP8 but the BF16 model was faster I am assuming because it didn't have to be upcast - I would be curious how this model trains....but it seems like your conclusion was still de-distill?

AbstractPhilaNov 30, 2024

@Felldude DeDistilled is really good. This one is just very similar to the FP16 with lighter weight, so you get added details and such, but DeDistilled is still something else entirely.

I don't know what it was finetuned with, but it definitely shows in the outcome, because DeDistilled has stronger tags in some directions than others when trying to use loras on it.

ericreatorDec 13, 2024

@AbstractPhila Workflow? Curious what you're seeing.

RedPinkRetroNov 29, 2024· 4 reactions

CivitAI

I found the differences between regular Clip L (~300mb I had lying around) to the Clip L Large BF16 noticeable and would prefer the BF16 version.
Clip L Large BF16 had more detail, higher contrast and better accuracy in small details like eyes, pupils, reflections of jewelry.
However I still prefer using the finetuned Clip L from zer0int.

The T5xxl BF16 version showed no difference compared to the FP16 version in my test, and only shaves off less than 300mb.

AbstractPhilaNov 29, 2024· 2 reactions

I have my own finetuned clip_l, but it's not bf16 format. Give that one a try, you might like it.

https://civitai.com/models/950382?modelVersionId=1064485

RedPinkRetroNov 30, 2024· 2 reactions

@AbstractPhila tested your SimV4 Clip L a little against the zer0int one and was surprised with how different yours turned out in most cases. Can't really say which one I prefer yet, but definitely going to keep testing. I also use mostly natural language with Flux, so your Clip's true strengths are probably not even shining through. Thanks for pointing me to it. Going to include a link to it in future versions of my Flux Advanced Workflow 👌

AbstractPhilaNov 30, 2024· 3 reactions

@redpinkretro Mine is heavily devoted to subject screen control with position, offset, and size for now. The next major version for it I have planned to run it full of rotation; pitch, yaw, roll. Version after I have planned to include depth of field controllers, and relational conjunctive controllers for multiple simultaneous subject focuses.

Full camera control is the dream, but the thing takes a ton of images to learn anything so cross your fingers.

Try;

upper-left quarter-frame 1girl with blue hair,

upper-right quarter-frame 1girl with red hair,

lower-left quarter-frame 1girl with green hair,

lower-right quarter-frame 1girl with purple hair

This works inherently in base flux, however I gave it steroids with this CLIP_L.

Focus is divided into a 3x3 grid for now, with full-frame, half-frame, and quarter-frame as the accessors. Flux inherently does this very well, so I tapped into that by training the CLIP_L and the UNET with some directives and it worked out pretty good.

Felldude

Author

Nov 30, 2024· 1 reaction

@redpinkretro with no way to force bf16 clip it defaults to FP16 without modification hopefully forge or comfy will add it....they added FP64 so

Felldude

Author

Nov 30, 2024

@AbstractPhila Finetuning a clip has been something I have wanted to try, I take it it has the same number of tokens if it drops into FLUX

RedPinkRetroNov 30, 2024

@Felldude That would definitely be interesting to see. Would you expect the increased range of BF16 to result in more creative interpretations of prompts in exchange for precision and prompt accuracy?

FP64? Really? What for, considering how every single digit has 10x less impact than the previous 😅

RedPinkRetroNov 30, 2024· 1 reaction

@AbstractPhila I wasn't aware of the camera control aspect of your clip. Sounds impressive, will give it a go! 👌

AbstractPhilaNov 30, 2024· 2 reactions

@redpinkretro It's hit or miss. I used deepghs to detect. I used the sizes of the bounding boxes in relativeness to the overall picture with positional offsets and some basic math to tag images.

So say, >70% is full-frame, 40% to 70% is half-frame I think, I'd need to check (it's hard to remember thousands of tags, even using memory palace). It allows a bit better control of offsets and was a fair experiment that ended up making the CLIP_L more responsive to the requests.

Using viewport/angle interpolation I'll be able to identify where eyes are looking, faces are looking, and so on. It's just more complex math. Automation is the key here, if the whole pipeline can be automated we are looking at some serious outcomes.

RedPinkRetroNov 30, 2024

@AbstractPhila Very interesting. Would cutting the dataset images into pieces and having it and captioning the slices individually help that approach potentially?

Like at a 2x2 level tagging each quarter individually, knowing beforehand if its top-left or bottom-right, and using the approach of 70%+ to only ever catch the main subjects.
Then going as deep as you think makes sense, maybe up to a 4x4 grid with 16 slices at the lowest level, and then have it auto-assemble all captions for all slices from the 4x4, the 3x3, the 2x2 and the full image together into one summarized precise evaluation. This would of course mean a lot more passes in total, but with lower image sizes depending on the slices, it should be faster in those at least.

AbstractPhilaNov 30, 2024· 1 reaction

@redpinkretro Base Simulacrum V17 is full of body parts. https://civitai.com/models/803213?modelVersionId=937952

Trying to train Flux into understanding what these were, wasn't an easy process. Getting the correct training parameters was nightmare inducing, and I shared them in my article about NSFW.

The facts of the matter here, is yes it's doable. The body parts and their offsets blend together with base body parts due to the bucketing, and the outcomes are really good.

The true problem is, flux takes forever to train. Last I checked there's no xformers support, so the cross attention optimization isn't there yet.

In effect, I sit there letting it bake for days and days only to get one or two shots at a lora expansion. I can't just train loras side by side either because the base model is progressing and diverging over time. I CAN do some, but they are often hit or miss if I try to load a lora from an earlier version of my model, vs the loras I trained for THIS version of the model.

Training the CLIP_L gave a large boost in context, and CLIP_L trains super fast in comparison.

AbstractPhilaNov 30, 2024· 1 reaction

CivitAI

I have an idea. Generate a few hundred pictures and heatmap the block access. For the lesser used blocks; quantize to a smaller size, prune, and compact. I wonder if that would lobotomize it, speed it up, or slow it down.

See the idea here is, we aren't actually using those blocks for much, which means we're potentially loading a large amount of ram for potentially nothing, causing a reliance on cpu switching. If we spend as little time as possible in those blocks, we can potentially eliminate their need entirely through pruning.

Felldude

Author

Nov 30, 2024

A comfy node that set the block value to zero might allow for fast testing

LiteSoulHDDec 3, 2024

CivitAI

"I am currently trying to workout how to follow Blackforest recommendations but using GGUF".

Waiting for this! Even better if it's an easy process to share to the community, thanks.

ProjectDreamerDec 13, 2024

CivitAI

Is this T5XXL the Flan one? Thank you in advance!

Felldude

Author

Dec 13, 2024

Standard T5 V1.1

ericreatorDec 16, 2024

Looking for flan of this as well. Let us know!

atlazisco390Dec 25, 2024

CivitAI

I am just started using comfyui, but I dont know where should I put the t5xxl file. I used the load clip and the load diffusion model, but I dont know what to use for the t5xxl. Can someone help me? Thank you in advance!

Felldude

Author

Dec 25, 2024· 1 reaction

Dual CLIP loader with FLUX selected

atlazisco390Dec 25, 2024

@Felldude thanks for the help. Also can I ask you which VAE should I use?

Felldude

Author

Dec 25, 2024· 1 reaction

@atlazisco390 I used the base AE.VAE from Blackforest

mask3d007531Apr 20, 2025

CivitAI