This is a full restoration of Zero0Int and AbstractPila work on the CLIP-L model.
In 1000's of test images the FP32 model corrected bad anatomy in cases where the FP16 model failed (Seed to Seed)
In most cases the images have little difference but when differences exist the FP32 model was reliably more accurate (As it should be mathematically)
Note the vision models use the VIT-L vision blocks Zer0Int has a new finetune of the VIT/Vision model here
MIT License
Copyright (c) 2021 OpenAI
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Description
FAQ
Comments (27)
I'm using forgeui, and confused, do I load both clip L and clip L vision, or just one?.
Vision CLIP is for video models like hunyan, it should load but the vision blocks would be ignored for most models
Lots of versions... which one is most complete?
Vision would be for video models, changing a clip will have drastic results on an output seed to seed - I have done some write ups but it would still be personal preference
When would we use each of these please?
Do we just switch full time to the latest FP32 "vision" model for day to day Flux use or is that overkill?
I appreciate your work but don't understand the difference in versions... thank you.
I have not tested the vision model with flux, it should be the same seed to seed output as the pruned CLIP-L, to my knowledge only certain video models use the vision blocks on the full VIT-CLIP
Possible to get a gguf of this text encoder?
I'm not sure if vision is part of the exisiting gguf architecture, the other models are not big enough to warrant quantization unless your trying to fit in a 20+ year old GPU
@Felldude I think we might get faster speeds at least loading them in comfy so might be worth trying, Clip G is substantially larger. Just running lots of tests on these different text encoders. Thanks for the research, love your TEs!
Found gguf versions for testing https://huggingface.co/chatpig/t5xxl/tree/main
In theory it shouldnt be problem, there are tools that allow to GGUF anything. Lower than full Q will also lower quality tho.
Are you considering doing a fine tune of the LongCLIP-L (also provided by zer0int)?
Thank you for your work =)
I believe the latest vit model they put out is FP32
Only working combination I found which works for some SDXL checkpoints, ComfyUI.
- clipLCLIPGFullFP32_simulacrumCLIPGFP32 2.7Gb
- clipLCLIPGFullFP32_simulacrumCLIPLFP32 494Mb
Load both in Dual Clip loader, set --fp32-text-enc argument.
Will produce more detailed and realistic images. Understands prompt better than built in Clip.
Sad reality: Will produce garbage with most of Pony mixes, some mixes will output low quality image, others just noise, for some strange reason it works fine with many Illustrous checkpoints (but not all of them).
I have the FP32 CLIP's for pony also, some pony models absolutely require the PONY versions of the CLIP
I'd like to get more information. Please, give a link to Simulacrum Clip-G source.
The Authors of Sim and Zer0Int are linked at the top of the article.
Please tell me Zer0Int-Vision_CLIP_FP32 is the same file as on https://huggingface.co/zer0int/CLIP-GmP-ViT-L-14/tree/main only under a different name (model.safetensors)?
I do not know on that
which of this to use as clip l for flux ?! even what differ ???
Both clip l and g. The different training info for each is clearly listed. Top right. Flux likes lots of encoders. Cram them in there like a clown car. Node up a sexatuple clip encoder, or at the very least a triple encoder. Stack them up and swap them around until the ksampler stops throwing errors at you. Worked for me.
Ponder_Stibbons good idea .. but i live at 50 years ago i'm on CPU no time for this (i mean days) , always good noob aimed info at description helps.
It would be CLIP-L and T5, for cpu I would focus on which clip worked best in shnell at 4 step
amazingbeautyBless you if you've got the patience to run flux on CPU. Or any diffusion model for that matter. But if you're serious, go with the other commenter. Or forget this stuff here and go straight to the source. All required models are in the tree. https://huggingface.co/black-forest-labs/FLUX.1-schnell/tree/main
Ponder_Stibbons If you want base CLIP-L I would use the FP32 version over the FP16 version that Blackforest posted https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/tree/main/text_encoder
Could you please clarify which models these models are specifically adapted for?