Training-Free Medical Image Segmentation vs Trained Networks

A multi-domain data-efficiency study — frozen LLM polygon traces + a numpy genome + a gated SAM decoder, with zero gradient training.

4 modalitiesmeasured, from scratch GT-free at inferenceLLM-agnostic skills updated 2026-07-02

Results — best trained NN across N vs the LLM pipeline

Strongest trained NN per cell — max of SegFormer (mit_b2), pretrained U-Net (resnet34), and nnU-Net, 2-seed means — at matched budgets N = 10 / 25 / 50 / 100, against the LLM pipeline (re-measured from scratch via eval_all_domains.py). The LLM's only label-fit module is a ~5–8-param genome; localization and SAM are GT-free/frozen — so the LLM number is ~flat in N while the NN climbs. Bold = LLM ahead of the best NN at that budget.

Domainbest NN N=10N=25N=50N=100LLM pipeline
ISIC 2018 (derm)0.8150.8290.8320.8680.880
MSD Spleen (CT)0.6330.8020.8590.9070.903
Kvasir-SEG (endo)0.7560.8240.8330.8890.348
ChestXray (X-ray)0.9380.9480.9520.9530.898
The LLM pipeline wins the scarce-label regime where it has a cue — ISIC through N=100 (best NN needs N=189 to reach 0.887), Spleen through N=50 (crossover ~N=70, via a GT-free cross-slice prior) — and loses at every budget on Kvasir polyp (structural — no GT-free cue localizes it) and Chest X-ray, where a few-shot static lung box (N=10 labels) lifts the LLM to 0.898 (box quality, not architecture, was the gap) but box-prompted SAM still caps at a GT-box ~0.93 < 0.953. SegFormer is the strongest NN in 10 of 12 sweep cells.

Data-efficiency

Matched data-efficiency: LLM pipeline (genome fit per N) vs trained NN
LLM pipeline (dashed — genome re-fit per N; ~flat) vs trained NN (solid — matched N). Spleen is capped at N=25 (30 training draws) and Kvasir at N=50 (91 draws).

Qualitative examples — 3 best / 3 worst per dataset

Green = ground truth, red = prediction. Top row = 3 highest-Dice cases, bottom row = 3 lowest.

ISIC 2018 — skin lesion (pigment cue, LLM's best domain)

ISIC best/worst

MSD Spleen — CT (localizer + cross-slice priors + SAM re-prompt polish)

Spleen best/worst
The worst cases are localizer failures (prediction lands on stomach/liver, Dice≈0). A GT-free cross-slice prior (centroid trajectory + sibling-mask seed), a selective multi-prototype re-decode, and a SAM re-prompt boundary polish (kept only where it lands on stronger image edges) recover them, lifting the spleen pipeline to 0.903 (0.786→0.840→0.868→0.875→0.903 this session, zero regressions) — now level with the best trained NN at N=100.

Kvasir-SEG — polyp (no GT-free cue: structural ceiling)

Kvasir best/worst
Even "best" cases are modest — polyps share color/texture with mucosa, so the coordinate draws mislocalize and SAM segments the wrong region. No GT-free signal fixes this (a redness-box lever was tested and rejected).

ChestXray — bilateral lungs (few-shot static box → SAM)

CXR best/worst
Replacing the heuristic upper-60% box with a fixed few-shot lung box (mean of N=10 train GT boxes, split at midline; box-IoU 0.545→0.67) lifts the pipeline 0.825→0.898 — CXR's gap was box localization, not architecture. A four-lever study (richer multi-point prompt, shape-prior mask_input, dark-field trim, localizer point) confirmed none add on top of the better box. A perfect GT box still caps SAM at ~0.93, below the trained U-Net's 0.953, so box-prompted SAM remains capped at every budget.

What we learned

Honest-reporting note: earlier drafts carried optimistic hardcoded values (ISIC 0.885, Kvasir 0.860, CXR 0.855). This report uses numbers re-measured from scratch with the current pipeline: ISIC 0.880, Kvasir 0.348, CXR 0.898.