Long CLIP (Distilled)
Teacher/Student Distillation from 248/218 token length to projected 77
Pruned for use in SDXL, FLUX, SD 1.5, SD3, Hunyaun Video
DO NOT USE IN HI-DREAM, PONY (In most cases), or iLLustrious
Some of the top onsite models built with FP32 Distilled CLIP/FP32 VAE
Forcing FP32 CLIP recommended for Comfy, Forge, Auto1111
HiDream CLIP has been trained on a distillation set and the 248 and 218 token lengths reduced to 77 based on the pooled vision/text model output.
Description
FAQ
Comments (8)
Did you know you can use ViT-L-14 instead of clip-L with SDXL? I didn't. It's great!
Vit-Large-14 is the source of the vision for any of the clip-l models I have trained other then Gemma
@Felldude Oh. Interesting, well i am just dabbling here and found it interesting because i thought the vit-l was just for flux.
@vhp Even going back to SD 1.5 the CLIP has used Vit-L-14 but before mixed precision was handled well the entire model was saved in FP16 rather then mixed for SDXL it has a version of LAION G also. https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/text_encoder/config.json
Sorry i'm a noob with comfy. Can anyone guide me to a workflow to merge clip L on a checkpoint? thanks.
It depends on the model in question but SDXL is dual clip loader with CLIP-G and CLIP-L
Regarding the merged linked checkpoints, wouldn't that nullify any of the custom training or concepts introduced by the model? I would think checkpoint merges also merge CLIP weights but I'm not very knowledgeable about this stuff.
None of the models listed above are merges they are distillation using MSE comparative contrast loss of clip teacher/student






