About traning LoRa for Z Image Turbo

I’m going to train LoRa for Z Image Turbo. But I don’t like the base output (background and colors). I want to do some color adjustment steps with NAG and Detail Daemon.

Is the base output recommended for LoRa training, or should I train with the most optimal images for my goals?

How will the two approaches affect my results?

1 Like

When using it as a base model for LoRA training, Base might be fundamentally more suitable than Turbo? But Turbo came out first and is more widespread… making the choice difficult.


Background you need for this decision

A LoRA does not “fix” a model in the abstract. It learns to reproduce the distribution of whatever you train it on (subject, lighting, palette, background types, contrast curve, etc.). This is why dataset choice matters more than most hyperparameters.

Z-Image Turbo adds a second complication: it is a few-step distilled model that does not rely on classifier-free guidance (CFG) at inference and, in the official prompting guidance, does not use negative prompts at all. (Hugging Face) Turbo is typically run at ~9 steps with guidance scale 0.0.
That pushes more of the “look” (including background/color tese model prior, and makes classic “negative prompt fixes” weaker unless you use something like NAG.


1) Is the base output recommended for LoRA training?

It depends what your LoRA is for

If your LoRA is primarily identity / subject / concept

Using images that are clean, consistent, and not heavily “styled” is often the most robust path.

  • You want the LoRA to learn what the subject is, not “a specific color grade.”
  • You can keep background/color control as an inference-time knob.

In this case, training on “base-ish” outputs can be acceptable, because you are not trying to reprogram the global aesthetic—just teach a concept.

If your LoRA is primarily look / palette / background behavior

Then training on base-like images is not recommended for your goal, because:

  • The LoRA will learn the base model’s palette/background priors as part of the target distribution.
  • You may end up fighting the LoRA with NAG/Detail Daemon every time.

For a look-changing objective, your training images should already reflect your desired look.


2) Should you train with the most optimal images for your goals?

For your specific complaint (background + colors): generally yes

If you want different backgrounds and color behavior, training on your “optimal” images is the direct route.

But “optimal” should mean:

  • Your intended palette/contrast/white balance
  • Not overprocessed (avoid extreme sharpening / HDR micro-contrast / aggressive artifact removal)

Why the caution: Detail Daemon explicitly notes that pushing it too far produces an oversharpened and/or HDR effect. (GitHub) If you train on images that already have that signature and then also apply Detail Daemon at inference, you can get “double enhancement.”


How the two approaches affect results

Approach A — Train on base-like images (neutral training set)

What improves

  • Better generalization across prompts and scenes.
  • Your LoRA is less likely to “force” one palette/background everywhere.
  • NAG and other inference controls remain clean “steering knobs.”

What stays the same

  • The model’s default background/color tendencies will still show up unless you steer them at inference (NAG, prompt strategy, post).

Typical failure mode

  • You finish training and feel: “My LoRA works, but the colors/background are still wrong.”

Approach B — Train on “final-look” images (your optimal goal images)

What improves

  • Background and palette shift by default, with less per-prompt tweaking.
  • More consistent “house style.”

What gets worse

  • Less flexibility: the LoRA can “drag” unrelated prompts toward your dataset’s look.
  • Higher risk of learning workflow artifacts (over-sharpen, haloing, crunchy texture, too-clean backgrounds).

Typical failure mode

  • “Everything looks like my training set, even when I don’t want it to.”

A practical compromise that works well

Train on images that are 80–90% your target look (good grading, but not extreme), then do the last 10–20% with NAG/Detail Daemon.


Where NAG and Detail Daemon fit

NAG (Normalized Attention Guidance)

NAG is explicitly motivated by the problem that CFG-based negative guidance collapses in few-step regimes, and it restores effective negative-style control by operating in attention space. (arXiv)

Implication for you

  • If you keep training neutral (Approach A), NAG is a strong way to suppress unwanted background/color traits at inference without baking them into the LoRA.
  • If you bake the look into the LoRA (Approach B), you typically need less NAG (otherwise you can over-suppress and lose richness).

Detail Daemon

Detail Daemon adjusts sigma/sampling behavior to enhance detail and can reduce unwanted blur, but can produce an oversharpen/HDR look if pushed. (GitHub)

Implication for you

  • Use it as a finisher, not as the “source of truth” for what your LoRA should learn.
  • If you want to train on images processed with it, keep settings moderate and consistent.

Turbo vs Base as the training base model

Why Z-Image (Base) is generally better for LoRA training

The official Z-Image repo recommends for the Base model:

  • guidance scale 3.0–5.0
  • inference steps 28–50
  • negative prompts strongly recommended (GitHub)

That is exactly the environment where you can:

  • control background/colors at inference (CFG + negatives)
  • train a LoRA without depending on special “distillation-preserving” tricks

Why Z-Image Turbo is riskier to train on directly

Turbo is distilled for fast steps and “no CFG/negative prompts.” (Hugging Face)
A commonly reported issue is that LoRAs trained directly on Turbo can lose the acceleration behavior—images become blurry at “fast” settings while looking normal at slower settings. (Hugging Face)

If you must train on Turbo anyway, there are two established mitigation paths:

  1. Training adapter / de-distillation approach (Ostris adapter) (Hugging Face)
  2. DistillPatch LoRA to restore fast-step behavior after Turbo LoRA training (Hugging Face)

What I would do for your purpose

If your priority is “fix background + colors” and you want stability

Train on Z-Image (Base) with a dataset that reflects your desired palette/background, then:

  • generate on Base when you want maximum control/quality
  • optionally test the same LoRA on Turbo for speed

This leverages Base’s controllability and avoids Turbo’s distillation fragility. (GitHub)

If your priority is “must run on Turbo at 8–9 steps”

Train for Turbo, but plan the pipeline up front:

  1. Train with a Turbo training adapter (Hugging Face)
  2. Evaluate every checkpoint at Turbo inference settings (few steps, guidance off) (Hugging Face)
  3. If fast-step quality degrades, apply DistillPatch (Hugging Face)
  4. Keep your training images “goal-like but not overcooked” to avoid learning artifacts (GitHub)

Concrete “better approach” patterns (high leverage)

Pattern 1: Separate “subject LoRA” and “look LoRA”

  • Train subject/identity on clean images (neutral look).
  • Train look/palette/background on curated goal images.
  • Use lower strengths and avoid stacking too many LoRAs on Turbo (Turbo is more sensitive). (Hugging Face)

Pattern 2: Make the look toggleable with a trigger token

If you want the palette/background sometimes but not always:

  • Caption the look with a distinct token (e.g., my_grade_style) and keep other captions content-focused.
  • Then you can turn the look on/off per prompt.

Pattern 3: Don’t “teach” Detail Daemon artifacts

Use Detail Daemon primarily at inference; if you must use it in dataset creation, keep it mild and consistent to avoid baking in HDR/sharpen artifacts. (GitHub)


Summary answers to your questions

1) Is the base output recommended for LoRA training?

  • Only if your LoRA is for subject/concept and you want maximum generality.
  • If your complaint is specifically background + colors, base-like images will tend to preserve the base look.

2) Should you train with the most optimal images for your goals?

  • Yes, if your goal is to change palette/background behavior.
  • Use “optimal” as “desired look, but not overprocessed,” to avoid training in artifacts (especially from aggressive detail enhancement). (GitHub)

Turbo vs Base for your purpose?

  • For reliable look control: train on Z-Image (Base) (better CFG/negative/steps control). (GitHub)
  • For Turbo-at-8-steps delivery: train with a Turbo training adapter and expect to use DistillPatch if acceleration quality breaks. (Hugging Face)
1 Like

Thank you very much. The base model is very inconsistent and produces poor quality images, or perhaps I’m not getting the desired results because I’m applying the same prompting technique to Turbo as I did to the original. In Turbo, the face changed significantly when the scene changed, so I created one portrait and then converted it into 15 different portraits from various angles using Nano Banana. I trained a Face LoRa with it, and now I can produce images with more consistent faces, almost exactly as I want in Turbo. My goal, as you suggested, is to produce 40 images with different locations, lighting, and elements. Then I will train a new LoRa with it. Currently, with a 1:4 LR, 16 rank, no quantization, and BF16 output, I get a sweet spot of 2400-3000 at 3k stepping. I previously tested it with 4k stepping, and the sweet spots were 2200 and 2600. I want to continue with this approach for now.

However, I still have some questions. For 40 images, is rank 16 ideal, or should I choose 32 (I’ll choose FP32 for quality anyway)? But since all parameters affect each other, finding the optimal solution is really difficult. Will 3-4k step be enough when I go up to 40 images? Should I use 1280 base resolution or 1536 when creating the images? Because Ostris has a 1536 resolution option. If I increase the rank, I think I’ll need to adjust the LR to 5e-5.

To be honest, it’s a very challenging process for someone who wants to achieve perfection.

1 Like

There’s no absolute right answer for training parameters by nature…
It becomes a matter of choosing what kind of LoRA (model) behavior is likely to emerge. And often, you can’t say for sure without actually going through fine-tuning…

The inconsistency in prompts between Turbo and Base seems to be because their prompt quirks are completely different…


Why Z-Image (Base) looks “inconsistent” for you (likely cause)

Z-Image Turbo and Z-Image Base want different prompting and sampling behaviors:

  • Turbo is a few-step distilled model and “does not use negative prompts at all” in the intended inference setup. (Hugging Face)
  • Base is meant to be run with CFG (guidance scale ~3–5), more steps (28–50), and negative prompts strongly recommended. (GitHub)

So if you applied “Turbo-style prompting” (short prompt, no negatives, guidance ~0) to Base, Base can look sloppy or inconsistent; and if you apply “Base-style prompting” (strong negatives + higher guidance) to Turbo, Turbo can drift (including face changes) because it’s not operating how it was distilled to operate. (Hugging Face)

Your “face drifts when the scene changes” observation is very consistent with Turbo’s few-step behavior, and using a face LoRA to stabilize identity is a reasonable workaround.


Your current plan (face LoRA → generate 40 diverse images → train a second LoRA) is valid, but has one big constraint

Turbo tends to degrade when stacking multiple LoRAs. This is widely reported in community testing and discussion. (Hugging Face)

That means your plan is actually strategically good because it can produce a single “combined” LoRA (identity + environment handling) so you don’t have to stack a face LoRA + a scene/style LoRA at inference.

The main risk is feedback-loop learning: if your 40 images are all synthetic outputs from the same pipeline, the second LoRA can learn the pipeline’s artifacts and biases very strongly. The fix is simple: curate hard, and force real diversity (lighting, lenses, backgrounds, compositions, distances, expressions).


The key hyperparameters you asked about

1) Rank 16 vs 32 for a 40-image dataset

Rule of thumb:

  • Rank 16: best default for identity consistency + prompt flexibility, lower overfit risk.
  • Rank 32: use when you specifically need micro-detail capture (skin texture, subtle facial structure across angles, fine accessories), and you have enough variety in the dataset to avoid memorization.

Community experience is mixed: some people find rank 16 “enough,” others report rank 32+ helps photorealistic micro-detail. Treat those reports as anecdotal, but directionally useful. (Reddit)

Recommendation for your 40-image “multi-scene” run

  • Start with rank 16 for the first full run.
  • Only move to rank 32 if you can point to a consistent failure mode that looks like “capacity” (e.g., face loses specific traits across angles, fine facial geometry collapses, small defining details disappear).

2) Will 3–4k steps be enough when moving to 40 images?

Steps depend on batch size/repeats, but you can reason with “steps per image”:

  • If (roughly) batch=1 and repeats=1:

    • 3,000 steps on 15 images ≈ 200 updates per image
    • 3,000 steps on 40 images ≈ 75 updates per image

So if your “sweet spot” previously appeared around 2,400–3,000 steps on a smaller set, the same “training intensity” on 40 images will usually occur later (often ~6k–8k steps).

This matches common Turbo LoRA baselines that treat ~2,500–3,000 as a starting point for small datasets, not a hard universal number. (Hugging Face)

Practical recommendation

  • For 40 images, plan a run to 8,000 steps, but save checkpoints at (for example) 2k / 4k / 6k / 8k and pick the best by consistent evaluation prompts.
  • If you see overdominance early (everything looks like your dataset no matter the prompt), stop earlier and/or reduce LR.

3) 1280 vs 1536 resolution for creating/training images

There are two separate issues here:

(A) What Turbo was “in-domain” for
A Tongyi discussion on resolution/latent sizing strongly implies staying within a domain like 768–1280 around the “1024 grid,” and points to predefined resolution choices used in their app. (Hugging Face)

(B) What tooling exposes
Many workflows expose resolution presets including 1024 / 1280 / 1536. (GitHub)

What this means for you

  • If your priority is identity stability across scenes, 1280 is the safer high-detail choice.
  • 1536 can work, but it’s more likely to behave “out-of-domain” unless your whole pipeline (generation and training and evaluation) is consistently 1536 and your dataset is strong.

Recommendation

  • Generate your 40-image dataset at 1280 (or a mix of 1024 + 1280 buckets).
  • Do final outputs at 1536 using a second pass (upscale/img2img) if needed, rather than forcing the LoRA itself to learn everything at 1536 from only 40 samples.

4) Learning rate changes if you increase rank

A very common safe band for diffusion LoRA LR is 1e-4 to 5e-5. (Hugging Face)

If you go from rank 16 → 32, reducing LR (e.g., 1e-4 → 5e-5) is a reasonable safety move because you’re increasing capacity and the training can become more “aggressive.”

5) FP32 output for “quality”

If your training is actually happening in bf16/fp16 (which is typical in these toolchains), saving the LoRA as fp32 does not recover lost precision; it mostly stores already-rounded values in a larger container. (Reddit)

BF16 is generally considered a strong default for stability/efficiency, and FP32 is rarely the deciding factor for visual quality compared to dataset quality and stopping at the right step. (RunComfy)

Recommendation

  • Keep bf16 unless you have a concrete numerical instability problem you can reproduce.

Turbo-specific training stability: do this if you want to preserve “fast-step Turbo quality”

Training directly on Turbo can break the acceleration behavior; this is why “training adapter / de-distillation” and “DistillPatch” approaches exist.

  • Ostris’ training adapter is explicitly designed for fine-tuning Turbo without collapsing the distilled behavior. (Hugging Face)
  • DiffSynth’s DistillPatch page describes multiple schemes and recommends a strategy that preserves speed while keeping training simpler. (Hugging Face)

If your LoRA starts looking fine at 20–30 steps but falls apart at 8–9 steps, that’s a classic sign you need one of these “preserve acceleration” approaches. (Hugging Face)


A concrete “good solution” configuration for your 40-image run

Baseline run (most likely to succeed)

  • Base: Turbo + training adapter (to keep distillation intact) (Hugging Face)
  • Rank: 16
  • LR: 1e-4 (or 7e-5 if you want safer identity) (Hugging Face)
  • Steps: train to 8,000; evaluate checkpoints at 2k/4k/6k/8k
  • Resolution buckets: 1024 + 1280 (avoid forcing 1536 with only 40 images unless necessary) (Hugging Face)
  • Prompting for dataset creation: remember Turbo ignores negative prompts; encode constraints directly in the positive prompt. (Hugging Face)

“Higher capacity” run (only if baseline lacks detail)

  • Rank: 32
  • LR: 5e-5 (Hugging Face)
  • Steps: still 8,000 (but watch overdominance)

Why this feels so hard (and how to make it tractable)

Hyperparameters interact, but you can make the search manageable by fixing everything except one axis at a time:

  1. Fix dataset + captions → tune steps (pick the best checkpoint).
  2. Only if needed, try rank 32 at a lower LR.
  3. Only if needed, revisit resolution (1280 vs 1536).

If you try to change rank/LR/steps/resolution together, it becomes almost impossible to tell what helped.


One last warning specific to your “generate 40 images, then train”

Because you’re using generated images as training targets, you should deliberately inject variation so the second LoRA learns identity rather than “that one portrait’s hidden artifacts”:

  • vary camera distance (close/medium/full)
  • vary lighting type (softbox, daylight, tungsten, neon)
  • vary background complexity (simple → cluttered)
  • include some “hard cases” (profile view, harsh shadows, backlight)

This reduces memorization and improves the “face stays the same when the scene changes” property you care about.

1 Like

I can’t thank you enough for your detailed explanations. I really appreciate it. I’m going to follow this step exactly now. Face Lora is working well, only the ear isn’t closing :smiley: because it was visible in all the images, so I added this and similar things to the caption and will train it again. This way I’ll get rid of this problem. Even at strength 1, it gives the perfect exposure for the composition (except for the ear, which requires a very specific prompt to close :D).

So, Face Lora seems to be working hypothetically. Now, the next step is to create 40 images with real variety. Also, before doing that, I want to train Look Lora, as you said. I’ve collected many images from Unsplash: travel, gym, cafe, streets, etc. They are all very high quality and high resolution, but I think I’ll need to resize them a bit first.

So, what are your suggestions for Look Lora? Since there will be a lot of detail here, and it won’t be learning just one subject, I think 32 or a higher rank would be logical. I currently have 60 images. Do you think this is enough to train Look Lora? I’m leaving a screenshot below. What are my chances of getting sharp images with the LoRa tool I’ll be creating from these images? I specifically tried to select images where a person might be present.

1 Like

I also trained a LoRa with the base model, now I’m going to try it with the turbo.

1 Like

Claude also suggested that applying an img2img with ZIT to the Nano Banana outputs might be a good idea. But I think the images are good, high-quality, realistic, and consistent (except for the watermark, which I remove with another tool). Do you think it’s necessary?

1 Like

Meanwhile, I think training can be done on the base model and then the turbo can be used. But as you said, 3k steps were insufficient; more are needed.

1 Like

Using the capabilities and knowledge of high-end commercial AI for training open-source models is generally a good idea, not just for Nano Banana.


What a “Look LoRA” can and can’t do for Z-Image Turbo

A Look LoRA is best at learning global visual biases: color palette, contrast curve, lighting mood, lens/render “feel,” background complexity preference. It is not a reliable way to “upgrade sharpness” beyond what the base model + sampling can already produce—unless the base model is already capable and your LoRA is mainly nudging it toward consistently sharp photographic structure.

With Z-Image Turbo, keep in mind:

  • Turbo is a few-step distilled model that does not use negative prompts in the intended setup. (Hugging Face)
  • Training LoRAs directly on Turbo can cause loss of the “8–9 step” behavior (blurry at fast settings) unless you use mitigation like a training adapter or a distillation “patch” approach. (Hugging Face)

This affects your “chances of sharp images” more than rank alone.


Dataset: are ~60 Unsplash photos enough?

Yes—if the look is coherent

60 high-quality photos is enough for one coherent look (e.g., “clean editorial travel photography, natural skin, balanced contrast, slightly warm highlights”). If your set spans many unrelated aesthetics (harsh HDR city night + soft pastel cafe + gritty gym flash + cinematic teal/orange), the LoRA tends to learn a blurry average of the styles and may feel like it “does nothing” or introduces instability.

What your selection is doing right

You intentionally included scenes where people might be present. That is good if your Look LoRA must preserve:

  • skin rendering
  • exposure on faces across lighting types
  • “person in scene” realism

What to improve before training

  • Remove any images with logos/text/signage dominance or remaining watermarks (even small ones). Turbo follows written instructions unusually well, but a Look LoRA trained on text-heavy photos can accidentally increase “text-like artifacts.”
  • Remove duplicates / near-duplicates (same place, same light, same composition).
  • Ensure you have lighting diversity (daylight shade, indoor tungsten, mixed neon, backlight), but keep the grade consistent.

Rank: 16 vs 32 for a Look LoRA (and why “higher rank” is not automatically better)

Rank is capacity. Higher rank can learn more variation, but it also makes it easier to learn unwanted correlations and artifacts. Diffusers’ LoRA config explicitly treats r (rank) as a core capacity parameter. (Hugging Face)

Recommended starting point for your case

  • Rank 16 is usually the best first run for a Look LoRA with ~60 images.

  • Move to Rank 32 only if you can clearly describe a “capacity failure,” such as:

    • the look doesn’t “take” unless you crank LoRA strength high (and then it breaks identity),
    • fine photographic cues aren’t learned (consistent exposure/contrast behavior across conditions),
    • the LoRA learns the palette but not the lighting behavior.

Why not start at 32?
With only 60 images, Rank 32 can more easily overfit on specific scene content (“this kind of street = this color cast”) instead of learning a transferable grade.

Practical rule for your pipeline

  • Look LoRA: r=16 first
  • Face/identity LoRA: r=16–32 (identity sometimes benefits from more capacity)
  • Final combined LoRA (after you generate the 40 diverse shots): decide based on what fails; often r=16 is still enough.

Steps and LR: what changes when you go from 15 → 60 images

When dataset size increases, a fixed step count means fewer updates per image. So if 3k steps was your “sweet spot” on a smaller set, that does not transfer directly.

A simple planning heuristic:

  • Aim for roughly 100–200 effective updates per image as a starting search band for LoRA-style finetunes (varies by trainer, repeats, batch).
  • For 60 images, that often lands closer to “several thousand” steps, and you should rely on checkpoint evaluation, not a single target number.

Recommendation (works with your “sweet spot hunting” approach)

  • Keep your LR in the conservative band (your idea of dropping LR when increasing rank is reasonable).
  • Train to a higher ceiling (e.g., 8k-ish) but save checkpoints frequently and evaluate them under fixed prompts/settings.

If you increase rank to 32, reducing LR (e.g., toward 5e-5) is a common stability move. (This aligns with typical LoRA guidance that higher capacity usually wants gentler updates.) (Hugging Face)


Resolution: 1280 vs 1536 for Look LoRA training images

Use 1280 (or 1024+1280 buckets) for the Look LoRA dataset

Reasons:

  • With 60 images, you want the LoRA to learn global look cues robustly.
  • 1536 increases detail load and can increase overfitting risk if your dataset doesn’t consistently contain similar “detail distribution” (e.g., always sharp, low noise, no motion blur, consistent lens behavior).
  • You can always render final outputs at higher resolution later.

When 1536 makes sense

Only if:

  • your end goal is consistently 1536,
  • your dataset is consistently sharp at that scale,
  • and you are willing to accept more tuning complexity.

Given your stated goal (look + backgrounds + colors, not microtexture replication), 1280 is the safer choice.


“What are my chances of sharp images?”

High, if these conditions hold:

  1. You preserve Turbo’s fast-step behavior

    • LoRAs trained directly on Turbo can lose acceleration quality (blurry at 8 steps) according to DistillPatch documentation. (Hugging Face)
    • Use a Turbo training adapter if your trainer supports it. (https://cnb.cool)
  2. Your training set is consistently sharp

    • No heavy motion blur, no extreme denoise, no aggressive HDR sharpening.
    • Avoid training on strongly “Detail Daemon–crunched” images: the ComfyUI Detail Daemon port explicitly warns it can become oversharpened/HDR if pushed. (GitHub)
  3. Your captions don’t teach the wrong thing

    • Don’t label content too specifically (you don’t want the LoRA to force “travel/cafe/gym”).
    • Do label the look: lighting mood + color intent + “photorealistic photo / editorial photo / natural skin,” etc.

LoRA can reduce sharpness if it breaks Turbo distillation or if it learns “soft” images as part of the look.


Should you img2img your Nano Banana outputs with Z-Image Turbo before using them?

Not strictly necessary, but sometimes beneficial

When it helps

  • If Nano Banana outputs have consistent “generator fingerprints” (texture patterns, odd skin microdetail, repeated noise structure).
  • If you see “off-manifold” artifacts that Turbo tends to reintroduce or amplify.
  • If you want the training images to sit more cleanly in Turbo’s image manifold, which can make LoRA training more stable.

When it hurts

  • If denoise is too high, it can change identity or reintroduce the drift you already solved.
  • If you do it to all images, you can accidentally make the dataset too homogeneous.

Practical compromise

  • Only do it for images that show artifacts or instability.
  • Keep denoise low (roughly in the “light cleanup” range), and keep the same positive prompt constraints (Turbo ignores negative prompts). (Hugging Face)

A good “Look LoRA” recipe for your exact plan

Step 1 — Define the look in one sentence (critical)

Example:

  • “Clean editorial photography, neutral WB, natural skin, balanced contrast, slightly muted saturation, realistic indoor/outdoor lighting.”

If you can’t summarize the look, the dataset is probably too mixed.

Step 2 — Curate the 60 → ~40–60 “coherent” subset

  • Keep variety in location + lighting, but keep color science consistent.
  • Ensure at least ~20–30 images have a clearly visible person/face if you care about exposure on skin.

Step 3 — Train Look LoRA on Turbo with distillation-safe tooling

  • Prefer using the training adapter path when available. (https://cnb.cool)
  • If you later notice “looks fine at 30 steps but blurry at 8–9,” that’s exactly the failure DistillPatch describes. (Hugging Face)

Step 4 — Use Look LoRA to generate your 40-shot matrix images

  • Apply Face LoRA + Look LoRA together only to generate the 40 images (offline).
  • Then train a final combined LoRA so you don’t need to stack LoRAs at inference (Turbo can be sensitive to stacking).

Step 5 — Keep NAG/Detail Daemon as optional finishing, not baked-in

Turbo doesn’t use negative prompts, so any “don’t do X” must be phrased as positive constraints in your prompt. (Hugging Face)
Detail Daemon can help detail, but don’t overuse it in training data or you risk teaching the HDR/oversharpen signature. (GitHub)


Turbo-friendly Look prompt template (for sampling while building your 40 images)

Use one base template and change only {location}, {lighting}, {angle}:

{trigger_face}, {trigger_look}, photorealistic editorial portrait photo,
identity consistent face, natural skin texture, sharp focus on eyes,
balanced exposure, realistic shadows, clean color science, neutral white balance,
{lighting}, {angle}, {location},
high detail, clean edges, no text, no watermark, no logo, no extra faces

Turbo expects you to put constraints in the positive prompt, because negatives aren’t used. (Hugging Face)


If you want one clear recommendation set to start

  • Look LoRA dataset: 40–60 curated images (from your 60), coherent grade, watermark-free.
  • Resolution: 1280 (or 1024+1280 buckets).
  • Rank: start 16; only try 32 if you see a specific capacity shortfall.
  • Turbo training stability: use the Turbo training adapter if possible; be ready to apply a distillation “patch” approach if 8-step sharpness collapses. (https://cnb.cool)
  • Img2img on Nano Banana outputs: optional; do it only for problematic images, low-denoise.
1 Like

Hopefully this title will rank highly on Google because we’ve discussed everything necessary throughout this process and provided very detailed links.

1 Like

Great question! For LoRA training on Z Image Turbo, I’d recommend using the base output as your training foundation.

Why base output works better:

  1. The model has already learned the optimal latent representations for image generation
  2. Training on post-processed images can introduce artifacts that the model wasn’t trained on
  3. If you want color adjustments, you can apply them during inference rather than training

Alternative approach:
If you specifically need the color-adjusted outputs in your LoRA, consider:

  • Fine-tuning on a small dataset of your preferred style (10-20 images)
  • Using IP-Adapter or ControlNet for structural guidance while keeping your color preferences
  • Training a separate color correction LoRA and blending during inference

The key insight is that LoRA learns the “difference” from the base model - so if you train on adjusted images, your LoRA will essentially be learning to correct back to base. Hope this helps!

1 Like

Thanks for the reply. However, there are no IP adapters or similar solutions on Zit yet. I’m currently researching trigger words. Also, when exactly is LoRa considered to have learned? When it responds to every prompt that includes a female description and a trigger word? Or when it responds to a neutral prompt that uses a trigger word without specifying gender?

1 Like