maybe like this?
What’s special about Z-Image Turbo LoRA training (and why YAML differs)
Z-Image Turbo is step-distilled (built to look good in ~8 steps). If you train a LoRA on it “normally,” the distillation can break quickly (“Turbo drift”), and you end up needing more steps / higher CFG to recover quality. (RunComfy)
To address this, Ostris provides a training adapter (“de-distillation” LoRA) you load during training, then remove at inference so your LoRA still runs at distilled (fast) speeds. (Hugging Face)
Recommended baseline for your case (realistic character, A100)
This baseline matches the current “known-good” structure used by AI Toolkit configs and common Z-Image Turbo setups (FlowMatch, ~3000 steps, LR 1e-4, 8-step sampling, guidance 0). (GitHub)
YAML template (Turbo + training adapter, character LoRA, A100-friendly)
Replace paths, dataset name, and trigger_word. Keep everything else unchanged for run #1.
job: extension
config:
name: "zit_char_realistic_lora_a100"
process:
- type: diffusion_trainer
# Output + bookkeeping
training_folder: "/workspace/ai-toolkit/output/zit_char_realistic_lora_a100"
sqlite_db_path: "/workspace/ai-toolkit/aitk_db_zit_char_realistic.db" # keep per-instance to avoid DB contention
device: "cuda"
performance_log_every: 25
# Trigger token used in prompts/captions to “call” the character
trigger_word: "zch4r_001"
# LoRA capacity
network:
type: "lora"
linear: 16
linear_alpha: 16
conv: 16
conv_alpha: 16
network_kwargs:
ignore_if_contains: []
# Checkpoint saving
save:
dtype: "bf16"
save_every: 250
max_step_saves_to_keep: 12
save_format: "diffusers"
push_to_hub: false
# Dataset
datasets:
- folder_path: "/workspace/datasets/zit_char_realistic" # images + optional .txt captions
mask_path: null
mask_min_value: 0.1
caption_ext: "txt"
default_caption: "" # leave "" if you provide per-image captions
caption_dropout_rate: 0.05
cache_latents_to_disk: true
cache_text_embeddings: false # safer for Z-Image Turbo training right now
is_reg: false
network_weight: 1
# Multi-res buckets recommended for Z-Image LoRAs
resolution: [512, 768, 1024]
# Z-Image configs often keep these fields even for 1-frame image training
controls: []
shrink_video_to_frames: true
num_frames: 1
do_i2v: true
flip_x: false
flip_y: false
# Training hyperparameters
train:
batch_size: 1
gradient_accumulation: 2 # effective batch ~2, stable for character learning
steps: 3000
lr: 0.0001
train_unet: true
train_text_encoder: false
gradient_checkpointing: true
noise_scheduler: "flowmatch"
optimizer: "adamw8bit"
optimizer_params:
weight_decay: 0.0001
timestep_type: "weighted"
content_or_style: "balanced"
loss_type: "mse"
unload_text_encoder: false
dtype: "bf16"
# Leave these off for baseline stability
diff_output_preservation: false
skip_first_sample: false
force_first_sample: false
disable_sampling: false
bypass_guidance_embedding: false
ema_config:
use_ema: false
ema_decay: 0.99
# Base model + training adapter
model:
name_or_path: "Tongyi-MAI/Z-Image-Turbo"
arch: "zimage:turbo"
# Key: use the training adapter while training
assistant_lora_path: "ostris/zimage_turbo_training_adapter/zimage_turbo_training_adapter_v1.safetensors"
quantize: false
qtype: "qfloat8"
quantize_te: false
qtype_te: "qfloat8"
low_vram: false
model_kwargs: {}
layer_offloading: false
layer_offloading_text_encoder_percent: 1
layer_offloading_transformer_percent: 1
# Sampling previews during training
sample:
sampler: "flowmatch"
sample_every: 250
width: 1024
height: 1024
# Turbo preview settings (fast + representative)
sample_steps: 8
guidance_scale: 0
seed: 42
walk_seed: false
neg: ""
samples:
- prompt: "zch4r_001, photorealistic portrait, natural skin texture, 85mm lens, soft window light, neutral background"
- prompt: "zch4r_001, full body photo, casual street outfit, outdoor city background, golden hour, realistic"
- prompt: "zch4r_001, candid photo, sitting in a cafe, shallow depth of field, realistic lighting"
- prompt: "zch4r_001, studio headshot, clean backdrop, sharp focus, photorealistic"
num_frames: 1
fps: 1
Why these specific defaults:
- Steps 2500–3000, LR 1e-4, rank 16, buckets 512/768/1024 are widely used as a baseline for Turbo+adapter LoRAs. (RunComfy)
- Guidance scale = 0 is the standard preview target for Turbo (guidance-distilled behavior). (RunComfy)
- Adapter v1 is the “safe baseline.” v2 is worth A/B testing after you get a stable run. (RunComfy)
- The adapter is explicitly intended for shorter runs (styles/concepts/characters); very long runs can still drift and produce artifacts once you remove the adapter. (Hugging Face)
How to fill in the YAML correctly (field-by-field)
1) config.name
A label for your job. It becomes your output folder name in many workflows.
2) training_folder
Where checkpoints and previews go. Use local NVMe if possible (faster saves, less weird I/O stalls).
3) sqlite_db_path (important in cloud pods)
AI Toolkit’s UI uses SQLite for job tracking. On some cloud/persistent storage setups, SQLite can time out (especially shared filesystems). Keeping the DB on fast local disk and per instance reduces risk. (GitHub)
4) trigger_word
Make it unique (not a normal word). You’ll use it:
- in per-image captions (recommended), and
- in your sample prompts.
This is how you “call” the character reliably.
5) network (LoRA capacity)
- Start with rank 16 (
linear: 16) as baseline. (RunComfy)
- If the character identity is weak after ~3000 steps, increase to linear 32 (and alpha 32) before increasing steps.
6) datasets
Key knobs for character realism:
-
resolution: [512, 768, 1024]
Buckets improve generalization across crops/aspect ratios and are a common baseline for Turbo LoRAs. (RunComfy)
-
caption_ext: "txt" + per-image captions
For realistic characters, captions matter more than people expect:
- include the trigger token
- include a class word (
person, man, woman) and simple shot descriptors (portrait, full body)
- avoid heavy style words if you want “neutral photoreal.”
-
caption_dropout_rate: 0.05
Small dropout reduces prompt overfitting (the LoRA learns the identity rather than memorizing caption phrases).
-
cache_latents_to_disk: true
Often speeds training once latents are cached. (RunComfy)
-
cache_text_embeddings: false
There have been Z-Image Turbo related embedding/latents batch mismatch issues reported; leaving this off is the “least surprise” baseline. (GitHub)
7) train
steps: 3000 is a common first run target for 10–30 images. (RunComfy)
batch_size: 1 + gradient_accumulation: 2 gives you more stable updates without triggering batch-related weirdness.
noise_scheduler: flowmatch matches known working Turbo configs. (GitHub)
dtype: bf16 is generally stable on A100.
8) model.assistant_lora_path (the critical Turbo bit)
This is the training adapter. The adapter’s model card explains:
- why it’s needed for step-distilled training,
- why it’s best for shorter runs,
- and that you remove it at inference to keep Turbo speed. (Hugging Face)
Dataset recipe for a realistic character (what “good results” usually require)
For a photoreal character LoRA, the usual failure modes are “face drift,” “same pose every time,” or “background leaks into identity.”
A practical dataset layout (10–30 images) consistent with common Turbo LoRA advice: (RunComfy)
- 40% close/portrait (face fidelity, skin texture, hairline)
- 40% half/full body (body proportions, clothing fit)
- 20% action/candid (generalization)
Backgrounds:
- Use varied backgrounds (indoor/outdoor) so the LoRA doesn’t glue the character to one scene.
- Avoid repeating the same studio backdrop in most images.
Captions (simple, consistent):
- Good:
zch4r_001 person, photorealistic portrait, soft window light
- Good:
zch4r_001 person, full body photo, outdoors, casual outfit
- Avoid: long style chains (they turn your “character LoRA” into a style LoRA)
Common issues & fixes (seen in the wild)
1) “DB timeout / Prisma P1008” in cloud storage
Often appears when the SQLite DB sits on slow/shared storage. Use a local per-instance DB path. (GitHub)
2) “Batch size of latents must be the same or half the batch size of text embeddings”
A reported Z-Image Turbo training issue; safest mitigations:
- keep
batch_size: 1
- keep
cache_text_embeddings: false (GitHub)
3) Samples “don’t change” even after 2000 steps
Common causes:
- trigger_word not present in sample prompts/captions
- LoRA scale is effectively 0 in your inference workflow
- training without the adapter (Turbo drift symptoms: needs more steps/CFG to show changes) (GitHub)
4) Loss becomes NaN
Usually: too aggressive settings (dtype/LR/quantization). A known config example uses fp16 and LR 1e-4; on A100, prefer bf16 and disable quantization for baseline stability. (GitHub)
Good references (configs, guides, and issues)
High-signal guides (Turbo + adapter)
- Hugging Face Engineering Notes: “Training a LoRA for Z-Image Turbo with the Ostris AI Toolkit” (Hugging Face)
- RunComfy deep guide (Turbo+adapter vs De-Turbo, baseline params, guidance 0, buckets) (RunComfy)
- Training adapter model card (why it exists, how it’s made, remove at inference, long-run caveat) (Hugging Face)
AI Toolkit config mechanics (how configs are expected to be created)
- AI Toolkit README: copy an example config from
config/examples/... and edit paths (GitHub)
Issues worth knowing about
- SQLite timeout on cloud storage (GitHub)
- Latents/text-embeds batch mismatch (GitHub)
- “Samples not changing” report (GitHub)
Practical tuning path (do this in order)
-
Run baseline YAML exactly as above for 3000 steps.
-
If identity is weak:
- increase
linear to 32 (keep LR/steps same).
-
If you get overfitting (looks like the training photos):
- reduce steps to 2000–2500, or
- increase
caption_dropout_rate to 0.1, or
- add more varied backgrounds/poses.
-
After you have one “good” run, A/B test adapter v1 vs v2 with everything else identical. (RunComfy)