Training-Free Medical Image Segmentation vs Trained Networks
A multi-domain data-efficiency study — frozen LLM polygon traces + a numpy genome + a gated SAM decoder, with zero gradient training.
4 modalitiesmeasured, from scratchGT-free at inferenceLLM-agnostic skillsupdated 2026-07-02
Results — best trained NN across N vs the LLM pipeline
Strongest trained NN per cell — max of SegFormer (mit_b2), pretrained U-Net (resnet34), and nnU-Net, 2-seed means — at matched budgets N = 10 / 25 / 50 / 100, against the LLM pipeline (re-measured from scratch via eval_all_domains.py). The LLM's only label-fit module is a ~5–8-param genome; localization and SAM are GT-free/frozen — so the LLM number is ~flat in N while the NN climbs. Bold = LLM ahead of the best NN at that budget.
Domain
best NN N=10
N=25
N=50
N=100
LLM pipeline
ISIC 2018 (derm)
0.815
0.829
0.832
0.868
0.880
MSD Spleen (CT)
0.633
0.802
0.859
0.907
0.903
Kvasir-SEG (endo)
0.756
0.824
0.833
0.889
0.617
ChestXray (X-ray)
0.938
0.948
0.952
0.953
0.911
The LLM pipeline wins the scarce-label regime where it has a cue — ISIC through N=100 (best NN needs N=189 to reach 0.887), Spleen through N=50 (crossover ~N=70, via a GT-free cross-slice prior) — and loses at every budget on Kvasir polyp (structural — no GT-free cue localizes it) and Chest X-ray, where a few-shot static lung box (N=10 labels) + a per-image mask-bbox self-reprompt lift the LLM to 0.911 (localization, not architecture, was the gap) but box-prompted SAM still caps at a GT-box ~0.93 < 0.953. SegFormer is the strongest NN in 10 of 12 sweep cells.
Data-efficiency
LLM pipeline (dashed — genome re-fit per N; ~flat) vs trained NN (solid — matched N). Spleen is capped at N=25 (30 training draws) and Kvasir at N=50 (91 draws).
Qualitative examples — 3 best / 3 worst per dataset
Green = ground truth, red = prediction. Top row = 3 highest-Dice cases, bottom row = 3 lowest.
ISIC 2018 — skin lesion (pigment cue, LLM's best domain)
The worst cases are localizer failures (prediction lands on stomach/liver, Dice≈0). A GT-free cross-slice prior (centroid trajectory + sibling-mask seed), a selective multi-prototype re-decode, and a SAM re-prompt boundary polish (kept only where it lands on stronger image edges) recover them, lifting the spleen pipeline to 0.903 (0.786→0.840→0.868→0.875→0.903 this session, zero regressions) — now level with the best trained NN at N=100.
Kvasir-SEG — polyp (AMG ranker localizer; the gap was localization)
Polyps share colour/texture with mucosa, so the heuristic box and coordinate draws mislocalize (~43/100 to Dice≈0) and a hand-crafted redness box was rejected. But SAM's automatic-mask generator proposes the polyp in every image (best-candidate oracle 0.798), and a few-shot ranker — standardized cosine-similarity of frozen SAM features to a polyp prototype + a train-fit size prior — picks it, lifting the pipeline 0.348→0.617 (median 0.202→0.765). The residual to the NN is the ~0.78–0.80 boundary oracle plus ranker error, not a perception ceiling. Note: the example gallery predates this localizer and shows the older box→SAM masks.
Two per-image localization skills: (1) a fixed few-shot lung box (mean of N=10 train GT boxes, split at midline; box-IoU 0.545→0.67) replaces the heuristic upper-60% box → 0.825→0.898; (2) a mask-bbox self-reprompt (the box's own mask gives a per-image-tighter bbox; box-IoU 0.67→0.83) → 0.898→0.911. CXR's gap was box localization, not architecture. A four-lever prompt study and a three-method localization study both hit the same wall: a perfect GT box still caps SAM at ~0.93, below the trained U-Net's 0.953, so box-prompted SAM remains capped at every budget — the residual is the frozen decoder.
What we learned
ISIC: the LLM pipeline beats nnU-Net at every budget (+0.116 at N=10 → +0.025 at N=100) — genuinely more label-efficient for dermoscopy.
Spleen: a stack of GT-free skills — within-volume prior, cross-slice geometry, selective multi-prototype, and a SAM re-prompt boundary polish (gate calibrated on train) — lifts 0.786 → 0.903, zero regressions each step, beating the best trained NN through N=50 and reaching parity at N=100 (0.903 vs 0.907).
Kvasir: an AMG ranker localizer lifts the polyp 0.348→0.617 — the gap was localization, not a boundary ceiling. SAM's automatic-mask generator proposes the polyp in every image (oracle 0.798); a few-shot ranker (frozen-SAM-feature similarity to a polyp prototype + a size prior) picks it, rescuing 33/43 mislocalized cases where hand-crafted colour cues (redness) failed. Still below the NN at every budget (residual = boundary + ranker error), but no longer the "structural ceiling" it looked like.
Chest X-ray: a few-shot static lung box (mean of N=10 train GT boxes) + a per-image mask-bbox self-reprompt lift the LLM 0.825→0.898→0.911 — the gap was box localization, not architecture (richer prompts, shape priors, dark-field trims, box regressors all add nothing beyond it). But box-prompted frozen SAM stays architecturally capped — even a perfect GT box reaches only ~0.93 < the trained U-Net's 0.953, so CXR loses at every budget; the residual is the frozen decoder, not localization.
Honest-reporting note: earlier drafts carried optimistic hardcoded values (ISIC 0.885, Kvasir 0.860, CXR 0.855). This report uses numbers re-measured from scratch with the current pipeline: ISIC 0.880, Kvasir 0.617, CXR 0.911.