Seeking Advice: Mobile-Friendly Furniture Segmentation + Occlusion Handling

Hi everyone,

I’m building a mobile app with the following pipeline:

1.User takes a photo of a room

2.The app detects furniture items

  1. User selects a specific furniture piece (e.g., sofa)

4.The app generates a precise segmentation mask

5.The user can change the color or pattern of that furniture

I’m looking for a mobile-friendly model (on-device preferred) that can:

  • Perform instance-level segmentation of furniture
  • Generate high-quality, accurate masks (fine boundaries matter)
  • Handle occlusions (e.g., chair partially blocking a table

Segment Anything Model 3 is too heavy for mobile and not practical for on-device inference. Using YOLOv8 segmentation works reasonably well for detection + masks, BUT:

  • If I segment a sofa, the cushions on the sofa are included as part of the sofa mask.

Any practical amodal segmentation approaches that can run on-device?
If you’ve worked on AR interior apps, real-time segmentation, or mobile vision systems — I’d really value your advice.

At this point of time, I can’t build my own model.

1 Like

If you want to achieve this without fine-tuning the model, perhaps you could combine multiple lightweight models?


Recommendations that fit your constraints (no training, mobile-first, edit-grade masks)

1) Treat modal segmentation as the primary deliverable

For recolor/pattern edits, the most important property is: do not paint occluders (plants, chairs, blankets). That is a modal mask problem (visible pixels only). “True amodal” (hallucinating hidden parts) is a separate research task and typically depends on specialized training data and models (e.g., KINS formalizes amodal instance segmentation and dataset design). (CVF Open Access)

Practical implication: build a pipeline where the user can correct the mask with one or two “remove” actions rather than betting on a fully automatic amodal model.


2) Keep YOLO for detection/selection, not for the final mask

YOLOv8-seg is convenient, but it has a structural behavior that can hurt edit-grade boundaries: the mask is cropped to the predicted bounding box during inference (then upsampled), which can produce “boxy” artifacts and makes edges sensitive to how tight the box is. (Ultralytics)

Practical implication:

  • Use YOLO (or any fast detector) to provide instances + boxes + tap targets.
  • Use a promptable segmenter to generate the final high-quality mask for the selected object.

3) Use a promptable / interactive segmenter to solve “sofa includes cushions” without training

The “sofa + cushions merged” issue is almost always a taxonomy/definition mismatch learned from training labels. Without training, the reliable fix is: start with a box prompt, then subtract cushions/occluders with negative prompts.

Best-fit on-device options (ordered by “how likely you can ship quickly”)

Option A — MediaPipe Interactive Image Segmenter (MagicTouch) (lowest integration risk)

  • Designed for “tap a location → return object mask around that location” interactions. (Google AI for Developers)
  • Works naturally with your UX: user selects sofa → you call interactive segmentation → show mask → user taps “remove” on cushion if needed.

Why it’s good for your case: it aligns with user selection and makes corrections easy without requiring model customization. (Google AI for Developers)


Option B — EfficientSAM (OpenCV packaging on Hugging Face) with ONNX/INT8

OpenCV’s EfficientSAM package explicitly supports:

  • box prompts
  • foreground points
  • background points (your “remove cushion/occluder” tool) (Hugging Face)

It also publishes ONNX weights including an INT8 variant, which is often the practical path for on-device latency/memory control. (Hugging Face)

Why it’s good for your case: detector box → initial mask → 1–2 background taps on cushions → clean sofa-only mask.

To run ONNX on mobile, ONNX Runtime Mobile is the standard deployment route for iOS/Android. (ONNX Runtime)


Option C — EdgeSAM (higher-end “SAM-like” interactivity if you want premium speed)

EdgeSAM is explicitly built for on-device SAM-style prompting and reports 30+ FPS on iPhone 14 in its paper/project materials. (arXiv)

Why it’s good for your case: if you want very fast interactive refinement (tap/drag updates feel instant) on newer phones.


Option D — MobileSAM / Qualcomm MobileSam (Snapdragon-focused Android path)

  • MobileSAM proposes a lightweight encoder distillation approach to make SAM mobile-friendly. (arXiv)
  • Qualcomm’s Hugging Face repo provides pre-exported model files optimized for Qualcomm devices (often less work to get good performance on Snapdragon). (Hugging Face)

Option E — Qualcomm FastSam-S (speed-first CNN promptable model)

Qualcomm’s FastSam-S positioning is “generate segmentation mask on device” and supports prompt-based segmentation. (Hugging Face)

When it makes sense: if you need something very fast and you accept that boundary quality may vary more than the best SAM-like options.


4) Boundary quality strategy (what usually determines whether edits look “real”)

Even with a good segmentation model, edit realism is often decided by boundary handling at full resolution.

A proven production pattern is:

  1. infer mask at a moderate resolution for responsiveness
  2. do edge-preserving upsampling/refinement to full-res after the gesture ends

Google’s Snapseed write-up describes exactly this: predict a mask at 768×768, then upsample to image resolution (capped at 4K) using edge-preserving joint-bilateral upsampling. (Google Research)

If you need an algorithmic building block for edge-aware refinement, guided filtering is a standard reference (fast, edge-preserving). (people.csail.mit.edu)


5) Occlusion handling that works today (without amodal)

Still photo workflow

  • Always prefer modal correctness (exclude occluders).

  • Provide “Remove” refinement:

    • tap/short stroke over the occluder region to subtract
    • optionally: tap the occluder as its own object and subtract that mask (works well with interactive/tap-to-mask segmenters)

If you later add live AR preview

Depth-based occlusion is usually more reliable than trying to infer amodal furniture geometry without training. (This is a separate layer from still-photo segmentation.)


6) Concrete “v1” suggestion (minimize risk, maximize quality)

Ship a two-model pipeline

  1. Detector: YOLO (detect-only) for boxes + UI selection

  2. Segmenter: choose one

    • MediaPipe Interactive Image Segmenter (fastest to integrate end-to-end) (Google AI for Developers)
    • or EfficientSAM ONNX (INT8) if you want “box + background points” as the core correction mechanic (Hugging Face)

UX that solves the cushion problem with minimal friction

  • Default: auto mask from box prompt

  • One button: Refine

    • Mode A: Add (foreground)
    • Mode B: Remove (background)
  • Visual feedback: show removed region preview immediately; commit high-res boundary refinement when the user stops interacting.


7) Deployment tips that matter in practice

  • Cache the “heavy” computation per image when possible (image-encoder-once, lightweight decoder per prompt is the ideal shape).

  • Treat mobile runtime as a first-class decision:

    • ONNX models → ONNX Runtime Mobile for iOS/Android (ONNX Runtime)
    • vendor packs (Qualcomm) can reduce performance tuning time on Snapdragon devices (Hugging Face)
  • Plan a parity test suite early (same preprocess, same resize, same normalization) because “looks slightly different” issues tend to come from mismatched preprocessing and resize/interpolation settings.

(If you use ONNX export/optimization via Hugging Face tooling, Optimum-ONNX and Transformers ONNX utilities are the standard workflow.)


8) What not to bet your product on (given “no training”)

  • “True amodal segmentation on-device” as the primary path: it’s a defined research direction (e.g., KINS), but it is not generally turnkey for indoor furniture occlusions without training/adaptation. (CVF Open Access)

Quick decision guide

  • Fastest path to a good v1: MediaPipe Interactive Image Segmenter + boundary refinement. (Google AI for Developers)
  • Best control over “sofa but not cushions”: EfficientSAM (OpenCV HF) + ONNX Runtime Mobile + negative-point refinement. (Hugging Face)
  • Premium interactivity on newer phones: EdgeSAM. (arXiv)
  • Snapdragon-first Android packaging: Qualcomm MobileSam / FastSam-S. (Hugging Face)

Great question! This is a common challenge in AR/VR applications.

Possible approaches:

  1. SAM + refinement: Use SAM for initial mask, apply morphological operations to separate instances
  2. YOLO + Mask R-CNN hybrid: YOLO for detection (fast), Mask R-CNN for quality masks
  3. Amodal completion: Train a model to predict the full shape even when partially occluded
  4. Mobile optimization: Try MobileSAM or EdgeSAM - much lighter than SAM3

For furniture specifically, check interior design datasets like ARChair or SUN RGB-D.

1 Like