PixArt Sigma XL 2 MS: 2k, 1024, and 512 full finetune on custom captions.
INSTRUCTIONS: Place the .safetensors where the original model would go and select bunline.
Favorite sampling settings:
512/1024 models dpm++2s_a, simple, 24 steps, and CFG 3.1, 4.2, or sometimes more
2k model euler, sgm_uniform, 48 steps, CFG 3.5, 5, or sometimes more
Description
FAQ
Comments (9)
Hello, I am impressed by your model, I wonder if I can ask for details about the data train, as well as how to label the data?
Hello and thank you. 2k model saw 5 epochs at lr=1e-8 on 30k jpg photos captioned by BAAI/Bunny-v1_1-Llama-3-8B-V . The 1024 model trains much faster and saw 6 epochs of 60k at the same lr (combined with 2k dataset). I've only used the official trainer so far, and found the default CAME optimizer is the best. Using one 4090 24GB. I'm active in the Pixart discord as well as many talented trainers https://discord.gg/rde6eaE5Ta
@yayaman Thank you for your feedback. I'm trying to train with 1k 1920x1080 landscape images but it doesn't converge. It turns out I have to use so many images. Can I ask if you used data you filtered yourself or a dataset. Thank you very much.
@nguyentiendat1531999953 Others have had good results with under 1k images. Filtered and captioned myself at many aspect ratios. Be sure to keep LR low. I'm doing batch size 12 for the 1024 model and have gradient accumulation disabled because of poor results.
@nguyentiendat1531999953 One more thing to try is you can set "deterministic_validation = True" in the config, set a pretty small validation step count, and then get a fixed idea of how the image is changing at one seed. Easier to see if it's getting better/worse.
Oh! And my captions are almost always 300 or more tokens (clipped to 300)
@yayaman how to gradient accumulation disabled, set it = 1 ? And what is your lr mention, it is 1e-8
@yayaman And I have another question: since the caption is up to 300 tokens long, how can users type that long? Do you think about using chatgpt or LLM to gen prompt?
@nguyentiendat1531999953 Yes 1 accumulation step, lr 1e-8 for 1024 v0.8. I'm training 5e-9 currently to test
>how can users type that long? Do you think about using chatgpt or LLM to gen prompt?
I generate captions for the images for training using an automated VLM. Users can type as short of a prompt as they would like and the rest is filled with padding characters.
@yayaman Thank you very much






