There’s no absolute right answer for training parameters by nature…
It becomes a matter of choosing what kind of LoRA (model) behavior is likely to emerge. And often, you can’t say for sure without actually going through fine-tuning…
The inconsistency in prompts between Turbo and Base seems to be because their prompt quirks are completely different…
Why Z-Image (Base) looks “inconsistent” for you (likely cause)
Z-Image Turbo and Z-Image Base want different prompting and sampling behaviors:
- Turbo is a few-step distilled model and “does not use negative prompts at all” in the intended inference setup. (Hugging Face)
- Base is meant to be run with CFG (guidance scale ~3–5), more steps (28–50), and negative prompts strongly recommended. (GitHub)
So if you applied “Turbo-style prompting” (short prompt, no negatives, guidance ~0) to Base, Base can look sloppy or inconsistent; and if you apply “Base-style prompting” (strong negatives + higher guidance) to Turbo, Turbo can drift (including face changes) because it’s not operating how it was distilled to operate. (Hugging Face)
Your “face drifts when the scene changes” observation is very consistent with Turbo’s few-step behavior, and using a face LoRA to stabilize identity is a reasonable workaround.
Your current plan (face LoRA → generate 40 diverse images → train a second LoRA) is valid, but has one big constraint
Turbo tends to degrade when stacking multiple LoRAs. This is widely reported in community testing and discussion. (Hugging Face)
That means your plan is actually strategically good because it can produce a single “combined” LoRA (identity + environment handling) so you don’t have to stack a face LoRA + a scene/style LoRA at inference.
The main risk is feedback-loop learning: if your 40 images are all synthetic outputs from the same pipeline, the second LoRA can learn the pipeline’s artifacts and biases very strongly. The fix is simple: curate hard, and force real diversity (lighting, lenses, backgrounds, compositions, distances, expressions).
The key hyperparameters you asked about
1) Rank 16 vs 32 for a 40-image dataset
Rule of thumb:
- Rank 16: best default for identity consistency + prompt flexibility, lower overfit risk.
- Rank 32: use when you specifically need micro-detail capture (skin texture, subtle facial structure across angles, fine accessories), and you have enough variety in the dataset to avoid memorization.
Community experience is mixed: some people find rank 16 “enough,” others report rank 32+ helps photorealistic micro-detail. Treat those reports as anecdotal, but directionally useful. (Reddit)
Recommendation for your 40-image “multi-scene” run
- Start with rank 16 for the first full run.
- Only move to rank 32 if you can point to a consistent failure mode that looks like “capacity” (e.g., face loses specific traits across angles, fine facial geometry collapses, small defining details disappear).
2) Will 3–4k steps be enough when moving to 40 images?
Steps depend on batch size/repeats, but you can reason with “steps per image”:
So if your “sweet spot” previously appeared around 2,400–3,000 steps on a smaller set, the same “training intensity” on 40 images will usually occur later (often ~6k–8k steps).
This matches common Turbo LoRA baselines that treat ~2,500–3,000 as a starting point for small datasets, not a hard universal number. (Hugging Face)
Practical recommendation
- For 40 images, plan a run to 8,000 steps, but save checkpoints at (for example) 2k / 4k / 6k / 8k and pick the best by consistent evaluation prompts.
- If you see overdominance early (everything looks like your dataset no matter the prompt), stop earlier and/or reduce LR.
3) 1280 vs 1536 resolution for creating/training images
There are two separate issues here:
(A) What Turbo was “in-domain” for
A Tongyi discussion on resolution/latent sizing strongly implies staying within a domain like 768–1280 around the “1024 grid,” and points to predefined resolution choices used in their app. (Hugging Face)
(B) What tooling exposes
Many workflows expose resolution presets including 1024 / 1280 / 1536. (GitHub)
What this means for you
- If your priority is identity stability across scenes, 1280 is the safer high-detail choice.
- 1536 can work, but it’s more likely to behave “out-of-domain” unless your whole pipeline (generation and training and evaluation) is consistently 1536 and your dataset is strong.
Recommendation
- Generate your 40-image dataset at 1280 (or a mix of 1024 + 1280 buckets).
- Do final outputs at 1536 using a second pass (upscale/img2img) if needed, rather than forcing the LoRA itself to learn everything at 1536 from only 40 samples.
4) Learning rate changes if you increase rank
A very common safe band for diffusion LoRA LR is 1e-4 to 5e-5. (Hugging Face)
If you go from rank 16 → 32, reducing LR (e.g., 1e-4 → 5e-5) is a reasonable safety move because you’re increasing capacity and the training can become more “aggressive.”
5) FP32 output for “quality”
If your training is actually happening in bf16/fp16 (which is typical in these toolchains), saving the LoRA as fp32 does not recover lost precision; it mostly stores already-rounded values in a larger container. (Reddit)
BF16 is generally considered a strong default for stability/efficiency, and FP32 is rarely the deciding factor for visual quality compared to dataset quality and stopping at the right step. (RunComfy)
Recommendation
- Keep bf16 unless you have a concrete numerical instability problem you can reproduce.
Turbo-specific training stability: do this if you want to preserve “fast-step Turbo quality”
Training directly on Turbo can break the acceleration behavior; this is why “training adapter / de-distillation” and “DistillPatch” approaches exist.
- Ostris’ training adapter is explicitly designed for fine-tuning Turbo without collapsing the distilled behavior. (Hugging Face)
- DiffSynth’s DistillPatch page describes multiple schemes and recommends a strategy that preserves speed while keeping training simpler. (Hugging Face)
If your LoRA starts looking fine at 20–30 steps but falls apart at 8–9 steps, that’s a classic sign you need one of these “preserve acceleration” approaches. (Hugging Face)
A concrete “good solution” configuration for your 40-image run
Baseline run (most likely to succeed)
- Base: Turbo + training adapter (to keep distillation intact) (Hugging Face)
- Rank: 16
- LR: 1e-4 (or 7e-5 if you want safer identity) (Hugging Face)
- Steps: train to 8,000; evaluate checkpoints at 2k/4k/6k/8k
- Resolution buckets: 1024 + 1280 (avoid forcing 1536 with only 40 images unless necessary) (Hugging Face)
- Prompting for dataset creation: remember Turbo ignores negative prompts; encode constraints directly in the positive prompt. (Hugging Face)
“Higher capacity” run (only if baseline lacks detail)
- Rank: 32
- LR: 5e-5 (Hugging Face)
- Steps: still 8,000 (but watch overdominance)
Why this feels so hard (and how to make it tractable)
Hyperparameters interact, but you can make the search manageable by fixing everything except one axis at a time:
- Fix dataset + captions → tune steps (pick the best checkpoint).
- Only if needed, try rank 32 at a lower LR.
- Only if needed, revisit resolution (1280 vs 1536).
If you try to change rank/LR/steps/resolution together, it becomes almost impossible to tell what helped.
One last warning specific to your “generate 40 images, then train”
Because you’re using generated images as training targets, you should deliberately inject variation so the second LoRA learns identity rather than “that one portrait’s hidden artifacts”:
- vary camera distance (close/medium/full)
- vary lighting type (softbox, daylight, tungsten, neon)
- vary background complexity (simple → cluttered)
- include some “hard cases” (profile view, harsh shadows, backlight)
This reduces memorization and improves the “face stays the same when the scene changes” property you care about.