Flux Blockwise (Mixed Precision Model)
I had to build several custom tools to allow for the mixed precision model, to my knowledge it is the first built like this.
Faster and more accurate then any other FP8 quantized model currently available
Works in Comfy and Forge but forge needs to be set to BF16 UNET
Comfy load as a diffuser model USE DEFAULT WEIGHT
FP16 Upcasting should not be used unless absolutely necessary such as running CPU or IPEX
FORGE - set COMMANDLINE_ARGS= --unet-in-bf16 --vae-in-fp32
Other then the need to force forge into BF16, (FP32 VAE optionally) it should work the same as the DEV model with the added benefit of being 5GB smaller then the full BF16
It turns out that every quantized model including my own up to this point to my knowledge has been built UN-optimally per blackforest.
Only the UNET blocks should be quantized in the diffuser model, also they should be upcast to BF16 and not FP16 (Comfy does this correctly)
I am currently trying to workout how to follow Blackforest recommendations but using GGUF
Description
FAQ
Comments (49)
I'll build the lora stack onto it and see what SimV4 looks like.
If your attempting to build a model I would force BF16 and save the diffusion model, then load it and save it as FP8, then merge back in all the BF16 Text UNET blocks and the following :
'time_in.in_layer.bias',
'time_in.in_layer.weight',
'time_in.out_layer.bias',
'time_in.out_layer.weight',
'txt_in.bias',
'txt_in.weight',
'vector_in.in_layer.bias',
'vector_in.in_layer.weight',
'vector_in.out_layer.bias',
'vector_in.out_layer.weight'
I am still trying to work out if the gguf cp can be rebuilt to ignore those blocks when quantizing, they would be converted to FP16 or FP32 but that is better then loosing the precision on those blocks.....I don't know if blackforest just published the info with the Depth model release or if I just missed it before
It would not look as good as the lora stack on base BF16 but it should give a good indication of what the model would preform like if it was mix precision BF16/FP8 to save those 5GB
@Felldude They are trained mixed precision BF16
@AbstractPhila You set up a training to target different blocks on the same UNET model with different precision?
@Felldude No, that's definitely different. I was planning to do some single block targeted trains though, where I committed the entire cycle of training to a single block with a fairly high learn rate on very high alpha with low dimensions.
I'd love to try training multiple blocks in this nature using BF16, FP8, and FP16 in a similar block target manner as you mentioned.
@AbstractPhila Per Blackforest the TE blocks in the UNET should all be BF16 and probably only trained with that if targeted, the UNET blocks single or double could be FP8 but I wonder if FP16 is throwing the models off
@Felldude but GGUF is VERY slow ! and need more weight if using lora(if it works anyway) ...
is that "only" a different calculation method for FP8 and it is still original DEV ?
@sevenof9247 Its base blocks from the DEV and Schnell models from Blackforest, I have not changed them in any way other then to mix the precision in the same model - which they loosely recommend
@Felldude instant answer ^^ ok any hints for swarm-ui ? and steps ? cfg ? Distilled CFG ?
and why FP8 pruned and still 17GB ? usual 11GB
@sevenof9247 Not familiar with it, while torch has support for mixed precision models it would be dependent on the python scripts loading them in - Distilled cfg will be ignored for Schnell, Dev is the same as the base BF16 model
@sevenof9247 17GB as the text blocks are in BF16 but the UNET is in FP8 - so the reduction is 5GB instead of 10GB
@Felldude swarmui is the combination of forge and ComfyUI ;)
so it is not working well on 16GB VRAM ?
@sevenof9247 I will be releasing the NF4 build that got me 3.5-4.5 seconds per IT which is the fastest I have ever seen
@Felldude hmm okay .. sec/it , haha resolution - sampler - and RTX ? ... maybe tell me in % how much faster than FP8 ?
@sevenof9247 I am on a 3050 so...all my IT's or secs per IT are posted on my uncensored model
I'm getting some really really good results from this thing. It seems like a genuine breakthrough.
That's a really advanced idea! It not only ensures high accuracy in the parameter layer, but it also manages the overall model's computational power consumption.
Thanks, I am not sure if the information was new for FLUX Depth or if I had just never read the recommendations before, either is possible.
The testing shows some interesting outcomes.
Can you modify the blocks of DeDistilled?
I would have to compare the block structure but any paired blocks can be tranfered
@Felldude The blockwise version seems to function BETTER than the fp16 version, at a lower hardware cost!?!?!
What is this sorcery?
I did some direct side-by-side comparisons with 1d and 1dblocks, it ended up being superior to fp8 by a lot, only use about 14 gigs vram, and still generate nearly identical images to fp16, just with a better context window.
My loras combine just fine, and it reflects a very similar outcome to the fp16 version with less hardware.
@AbstractPhila I my test the images where the same for FP8 vs Blockwise FP8 but the BF16 model was faster I am assuming because it didn't have to be upcast - I would be curious how this model trains....but it seems like your conclusion was still de-distill?
@Felldude DeDistilled is really good. This one is just very similar to the FP16 with lighter weight, so you get added details and such, but DeDistilled is still something else entirely.
I don't know what it was finetuned with, but it definitely shows in the outcome, because DeDistilled has stronger tags in some directions than others when trying to use loras on it.
@AbstractPhila Workflow? Curious what you're seeing.
I found the differences between regular Clip L (~300mb I had lying around) to the Clip L Large BF16 noticeable and would prefer the BF16 version.
Clip L Large BF16 had more detail, higher contrast and better accuracy in small details like eyes, pupils, reflections of jewelry.
However I still prefer using the finetuned Clip L from zer0int.
The T5xxl BF16 version showed no difference compared to the FP16 version in my test, and only shaves off less than 300mb.
I have my own finetuned clip_l, but it's not bf16 format. Give that one a try, you might like it.
@AbstractPhila tested your SimV4 Clip L a little against the zer0int one and was surprised with how different yours turned out in most cases. Can't really say which one I prefer yet, but definitely going to keep testing. I also use mostly natural language with Flux, so your Clip's true strengths are probably not even shining through. Thanks for pointing me to it. Going to include a link to it in future versions of my Flux Advanced Workflow 👌
@redpinkretro Mine is heavily devoted to subject screen control with position, offset, and size for now. The next major version for it I have planned to run it full of rotation; pitch, yaw, roll. Version after I have planned to include depth of field controllers, and relational conjunctive controllers for multiple simultaneous subject focuses.
Full camera control is the dream, but the thing takes a ton of images to learn anything so cross your fingers.
Try;
upper-left quarter-frame 1girl with blue hair,
upper-right quarter-frame 1girl with red hair,
lower-left quarter-frame 1girl with green hair,
lower-right quarter-frame 1girl with purple hair
This works inherently in base flux, however I gave it steroids with this CLIP_L.
Focus is divided into a 3x3 grid for now, with full-frame, half-frame, and quarter-frame as the accessors. Flux inherently does this very well, so I tapped into that by training the CLIP_L and the UNET with some directives and it worked out pretty good.
@redpinkretro with no way to force bf16 clip it defaults to FP16 without modification hopefully forge or comfy will add it....they added FP64 so
@AbstractPhila Finetuning a clip has been something I have wanted to try, I take it it has the same number of tokens if it drops into FLUX
@Felldude That would definitely be interesting to see. Would you expect the increased range of BF16 to result in more creative interpretations of prompts in exchange for precision and prompt accuracy?
FP64? Really? What for, considering how every single digit has 10x less impact than the previous 😅
@AbstractPhila I wasn't aware of the camera control aspect of your clip. Sounds impressive, will give it a go! 👌
@redpinkretro It's hit or miss. I used deepghs to detect. I used the sizes of the bounding boxes in relativeness to the overall picture with positional offsets and some basic math to tag images.
So say, >70% is full-frame, 40% to 70% is half-frame I think, I'd need to check (it's hard to remember thousands of tags, even using memory palace). It allows a bit better control of offsets and was a fair experiment that ended up making the CLIP_L more responsive to the requests.
Using viewport/angle interpolation I'll be able to identify where eyes are looking, faces are looking, and so on. It's just more complex math. Automation is the key here, if the whole pipeline can be automated we are looking at some serious outcomes.
@AbstractPhila Very interesting. Would cutting the dataset images into pieces and having it and captioning the slices individually help that approach potentially?
Like at a 2x2 level tagging each quarter individually, knowing beforehand if its top-left or bottom-right, and using the approach of 70%+ to only ever catch the main subjects.
Then going as deep as you think makes sense, maybe up to a 4x4 grid with 16 slices at the lowest level, and then have it auto-assemble all captions for all slices from the 4x4, the 3x3, the 2x2 and the full image together into one summarized precise evaluation. This would of course mean a lot more passes in total, but with lower image sizes depending on the slices, it should be faster in those at least.
@redpinkretro Base Simulacrum V17 is full of body parts. https://civitai.com/models/803213?modelVersionId=937952
Trying to train Flux into understanding what these were, wasn't an easy process. Getting the correct training parameters was nightmare inducing, and I shared them in my article about NSFW.
The facts of the matter here, is yes it's doable. The body parts and their offsets blend together with base body parts due to the bucketing, and the outcomes are really good.
The true problem is, flux takes forever to train. Last I checked there's no xformers support, so the cross attention optimization isn't there yet.
In effect, I sit there letting it bake for days and days only to get one or two shots at a lora expansion. I can't just train loras side by side either because the base model is progressing and diverging over time. I CAN do some, but they are often hit or miss if I try to load a lora from an earlier version of my model, vs the loras I trained for THIS version of the model.
Training the CLIP_L gave a large boost in context, and CLIP_L trains super fast in comparison.
I have an idea. Generate a few hundred pictures and heatmap the block access. For the lesser used blocks; quantize to a smaller size, prune, and compact. I wonder if that would lobotomize it, speed it up, or slow it down.
See the idea here is, we aren't actually using those blocks for much, which means we're potentially loading a large amount of ram for potentially nothing, causing a reliance on cpu switching. If we spend as little time as possible in those blocks, we can potentially eliminate their need entirely through pruning.
A comfy node that set the block value to zero might allow for fast testing
"I am currently trying to workout how to follow Blackforest recommendations but using GGUF".
Waiting for this! Even better if it's an easy process to share to the community, thanks.
Is this T5XXL the Flan one? Thank you in advance!
Standard T5 V1.1
Looking for flan of this as well. Let us know!
I am just started using comfyui, but I dont know where should I put the t5xxl file. I used the load clip and the load diffusion model, but I dont know what to use for the t5xxl. Can someone help me? Thank you in advance!
Dual CLIP loader with FLUX selected
@Felldude thanks for the help. Also can I ask you which VAE should I use?
@atlazisco390 I used the base AE.VAE from Blackforest
I get colored noise. What am I doing wrong?
