If you want to achieve this without fine-tuning the model, perhaps you could combine multiple lightweight models?
Recommendations that fit your constraints (no training, mobile-first, edit-grade masks)
1) Treat modal segmentation as the primary deliverable
For recolor/pattern edits, the most important property is: do not paint occluders (plants, chairs, blankets). That is a modal mask problem (visible pixels only). “True amodal” (hallucinating hidden parts) is a separate research task and typically depends on specialized training data and models (e.g., KINS formalizes amodal instance segmentation and dataset design). (CVF Open Access)
Practical implication: build a pipeline where the user can correct the mask with one or two “remove” actions rather than betting on a fully automatic amodal model.
2) Keep YOLO for detection/selection, not for the final mask
YOLOv8-seg is convenient, but it has a structural behavior that can hurt edit-grade boundaries: the mask is cropped to the predicted bounding box during inference (then upsampled), which can produce “boxy” artifacts and makes edges sensitive to how tight the box is. (Ultralytics)
Practical implication:
- Use YOLO (or any fast detector) to provide instances + boxes + tap targets.
- Use a promptable segmenter to generate the final high-quality mask for the selected object.
3) Use a promptable / interactive segmenter to solve “sofa includes cushions” without training
The “sofa + cushions merged” issue is almost always a taxonomy/definition mismatch learned from training labels. Without training, the reliable fix is: start with a box prompt, then subtract cushions/occluders with negative prompts.
Best-fit on-device options (ordered by “how likely you can ship quickly”)
Option A — MediaPipe Interactive Image Segmenter (MagicTouch) (lowest integration risk)
- Designed for “tap a location → return object mask around that location” interactions. (Google AI for Developers)
- Works naturally with your UX: user selects sofa → you call interactive segmentation → show mask → user taps “remove” on cushion if needed.
Why it’s good for your case: it aligns with user selection and makes corrections easy without requiring model customization. (Google AI for Developers)
Option B — EfficientSAM (OpenCV packaging on Hugging Face) with ONNX/INT8
OpenCV’s EfficientSAM package explicitly supports:
- box prompts
- foreground points
- background points (your “remove cushion/occluder” tool) (Hugging Face)
It also publishes ONNX weights including an INT8 variant, which is often the practical path for on-device latency/memory control. (Hugging Face)
Why it’s good for your case: detector box → initial mask → 1–2 background taps on cushions → clean sofa-only mask.
To run ONNX on mobile, ONNX Runtime Mobile is the standard deployment route for iOS/Android. (ONNX Runtime)
Option C — EdgeSAM (higher-end “SAM-like” interactivity if you want premium speed)
EdgeSAM is explicitly built for on-device SAM-style prompting and reports 30+ FPS on iPhone 14 in its paper/project materials. (arXiv)
Why it’s good for your case: if you want very fast interactive refinement (tap/drag updates feel instant) on newer phones.
Option D — MobileSAM / Qualcomm MobileSam (Snapdragon-focused Android path)
- MobileSAM proposes a lightweight encoder distillation approach to make SAM mobile-friendly. (arXiv)
- Qualcomm’s Hugging Face repo provides pre-exported model files optimized for Qualcomm devices (often less work to get good performance on Snapdragon). (Hugging Face)
Option E — Qualcomm FastSam-S (speed-first CNN promptable model)
Qualcomm’s FastSam-S positioning is “generate segmentation mask on device” and supports prompt-based segmentation. (Hugging Face)
When it makes sense: if you need something very fast and you accept that boundary quality may vary more than the best SAM-like options.
4) Boundary quality strategy (what usually determines whether edits look “real”)
Even with a good segmentation model, edit realism is often decided by boundary handling at full resolution.
A proven production pattern is:
- infer mask at a moderate resolution for responsiveness
- do edge-preserving upsampling/refinement to full-res after the gesture ends
Google’s Snapseed write-up describes exactly this: predict a mask at 768×768, then upsample to image resolution (capped at 4K) using edge-preserving joint-bilateral upsampling. (Google Research)
If you need an algorithmic building block for edge-aware refinement, guided filtering is a standard reference (fast, edge-preserving). (people.csail.mit.edu)
5) Occlusion handling that works today (without amodal)
Still photo workflow
If you later add live AR preview
Depth-based occlusion is usually more reliable than trying to infer amodal furniture geometry without training. (This is a separate layer from still-photo segmentation.)
6) Concrete “v1” suggestion (minimize risk, maximize quality)
Ship a two-model pipeline
-
Detector: YOLO (detect-only) for boxes + UI selection
-
Segmenter: choose one
- MediaPipe Interactive Image Segmenter (fastest to integrate end-to-end) (Google AI for Developers)
- or EfficientSAM ONNX (INT8) if you want “box + background points” as the core correction mechanic (Hugging Face)
UX that solves the cushion problem with minimal friction
7) Deployment tips that matter in practice
-
Cache the “heavy” computation per image when possible (image-encoder-once, lightweight decoder per prompt is the ideal shape).
-
Treat mobile runtime as a first-class decision:
- ONNX models → ONNX Runtime Mobile for iOS/Android (ONNX Runtime)
- vendor packs (Qualcomm) can reduce performance tuning time on Snapdragon devices (Hugging Face)
-
Plan a parity test suite early (same preprocess, same resize, same normalization) because “looks slightly different” issues tend to come from mismatched preprocessing and resize/interpolation settings.
(If you use ONNX export/optimization via Hugging Face tooling, Optimum-ONNX and Transformers ONNX utilities are the standard workflow.)
8) What not to bet your product on (given “no training”)
- “True amodal segmentation on-device” as the primary path: it’s a defined research direction (e.g., KINS), but it is not generally turnkey for indoor furniture occlusions without training/adaptation. (CVF Open Access)
Quick decision guide
- Fastest path to a good v1: MediaPipe Interactive Image Segmenter + boundary refinement. (Google AI for Developers)
- Best control over “sofa but not cushions”: EfficientSAM (OpenCV HF) + ONNX Runtime Mobile + negative-point refinement. (Hugging Face)
- Premium interactivity on newer phones: EdgeSAM. (arXiv)
- Snapdragon-first Android packaging: Qualcomm MobileSam / FastSam-S. (Hugging Face)