Hello there, finally attacking Z image base full finetune. it is complexe but interesting.
This model add nudity, anatomy, and variations into the model, helping with proportions posing and angle, adding concepts like genitalias, and also give some push in inbetween body shapes and expressions, ranging from chubby, slender, soft, mucular, realistic proportions, to extreme proportions.
This model, can make now, nude men, nude feminine men, nude transwomen, and nude womens.
call them like that and experiment, there is not real trigger words, the captioning was donne, with Kimi K2.5 for the natural language part, and I added a tag line with Joytag, tout have some repeating concepts to help the texte encoder to grab some recurrent elements. so prompt in a mix of natural and tags.
Should be able to do a lot of various styles.
Shift 5, CFG 4, eular Beta, and 40-50 steps for seeing good results.
It work even better with the lora distil fun 4 steps. with CFG 1, 4 or 6 steps
Still imperfect obviously, V2 is in the back of my mind, with larger Dataset, some better selections in mind, and a deeper, more hard funel technics and some test about BF16 precisions.
still adjusting. Always learning
(Human wrote part)
Z-Image Base Full Finetune – Technical Notes (Experimental Observations)
This model was trained as a full finetune on Z-Image Base (BF16) using Musubi Tuner.
Training was performed on an H100 80GB, at 1024 resolution, with:
Batch size: 9
Gradient accumulation: 1
Full BF16
Flash attention enabled
Gradient checkpointing
Bucketed dataset (many varied buckets)
Dataset of 2100 Images, with very various styles, poses, angles and variations in physiologies.
The following notes are based on practical experimentation.
They are not official documentation, and should be considered empirical observations that worked in this setup.
1. Z-Image Base behaves differently from SD models
Z-Image Base uses a DiT transformer architecture and flow-based timestep sampling.
Compared to SD1.5 / SDXL:
Structural changes take longer to appear.
The model shows strong internal coherence.
It resists abrupt shifts.
Visible improvement often depends heavily on sampling settings.
It feels layered — structural changes must propagate across multiple refinement stages.
Z-Image Base is harder to move, but very stable once shaped.
2. About shift (Flow Shift) During Training
Using:
--timestep_sampling shift
--discrete_flow_shift X
modifies how timesteps are distributed during training.
From experimentation:
Higher shift (≈ 2.5+)
Emphasizes global structure.
Useful for early structural imprinting.
Mid shift (~2.0–2.2)
Appears to consolidate structure.
Balances geometry and detail.
Lower shift (1.5–1.7)
Seems to refine fine details.
Useful for finishing phases.
This suggests a staged approach:
Start higher for structure, progressively lower for refinement.
This is an experimental strategy — not an official rule.
3. Optimizer Choice (Adafactor vs AdamW)
In this setup, Adafactor performed better than AdamW for full finetuning.
Observed behavior:
Lower VRAM usage
Larger stable batch size (9 at 1024 on H100)
More stable long-phase convergence
Example configuration used:
--optimizer_type adafactor
--optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False"
--lr_scheduler constant_with_warmup
Again, this is empirical — other setups may vary.
4. Learning Rate Funnel Strategy
A progressive reduction strategy was used:
Early phase: higher LR for structural change
Mid phase: moderate LR for stabilization
Final phases: very low LR for micro-refinement
The core idea:
Move the structure first.
Refine later without breaking global coherence.
Z-Image Base appears to benefit from staged training rather than a flat learning rate schedule.
5. Dataset and Bucketing
The dataset was:
1024 resolution aligned with Z-Image Base
Fully bucketed
With many varied aspect buckets
Multi-distribution (varied morphologies)
Using many buckets helped:
Preserve structural consistency
Avoid overfitting to a narrow framing
Maintain prompt flexibility
6. Sampling Matters More Than Expected
Z-Image Base can look soft or “vaporous” under weak guidance.
Under stronger guidance (e.g. 4.0) and sufficient steps (e.g. 50):
Structural improvements become significantly clearer.
Anatomical refinement is more visible.
Prompt conditioning becomes stronger.
When evaluating finetunes:
Use consistent seeds
Test guidance 3–5
Try multiple flow_shift sampling values (e.g. 3 and 5)
Compare phases side-by-side, and Loss is not that telling, just watch it did not spike, but you will not have deep dives down, your best bet is sampling between phases of training.
The model’s internal changes may not appear under weak sampling settings.
7. Important: These Are Hypotheses
Everything above is based on hands-on experimentation with:
Musubi Tuner
Z-Image Base (BF16)
H100 80GB
1024 resolution
Large batch (9)
Progressive shift funnel
Z-Image Base is complex.
Different datasets, hardware, or goals may respond differently.
These notes should be treated as:
Practical observations
Not universal truth
A starting point for experimentation
This finetune aimed to preserve:
Z-Image Base’s native photorealistic grain
Structural coherence
Prompt responsiveness
Stability under guidance
If you experiment further with Z-Image Base,
structured training and careful sampling evaluation seem essential.
(This note has been made by Ai, to avoid confusion, I am not a nativ speaker, but I reread it and approves and take the responsability of this experience return)
Description
the first one, not the last.