Training-Free Medical Image Segmentation vs Trained Networks
A multi-domain data-efficiency study — frozen LLM polygon traces + a numpy genome + a gated SAM decoder, with zero gradient training.
4 modalitiesmeasured, from scratchGT-free at inferenceLLM-agnostic skillsupdated 2026-07-02
Results — trained NN across N vs the LLM pipeline
NN Dice at matched budgets N = 10 / 25 / 50 / 100 (best seed), against the LLM pipeline (re-measured from scratch with the current config via eval_all_domains.py). The LLM's only label-fit module is a ~5–8-param genome; the rest is GT-free/frozen — so the LLM number is ~flat in N while the NN climbs.
Domain
NN N=10
N=25
N=50
N=100
LLM pipeline
ISIC 2018 (derm)
0.775
0.839
0.864
0.869
0.880
MSD Spleen (CT)
0.542
0.736
0.823
0.932
0.840
Kvasir-SEG (endo)
0.613
0.792
0.846
0.846
0.332
ChestXray (X-ray)
0.929
0.946
0.950
0.954
0.825
NN = pretrained U-Net (best seed; ISIC = nnU-Net). An nnU-Net benchmark at N=10/25/50/100 for Spleen/Kvasir/CXR is running and will replace the U-Net columns when complete. LLM wins the scarce-label regime where it has a cue (ISIC at every N; Spleen through N=50, crossover ~N=55) and loses on no-cue targets (Kvasir polyp = structural ceiling; CXR = box-prompted-SAM ceiling, even a perfect GT box reaches only 0.943 < 0.953).
Data-efficiency
LLM pipeline (dashed — genome re-fit per N; ~flat) vs trained NN (solid — matched N). Spleen is capped at N=25 (30 training draws) and Kvasir at N=50 (91 draws).
Qualitative examples — 3 best / 3 worst per dataset
Green = ground truth, red = prediction. Top row = 3 highest-Dice cases, bottom row = 3 lowest.
ISIC 2018 — skin lesion (pigment cue, LLM's best domain)
MSD Spleen — CT (localizer + within-volume prior)
The worst cases are localizer failures (prediction lands on stomach/liver, Dice≈0) — exactly the slices the GT-free within-volume prior targets; the deployed pipeline recovers them to Dice 0.36–0.60 using confident sibling slices.
Kvasir-SEG — polyp (no GT-free cue: structural ceiling)
Even "best" cases are modest — polyps share color/texture with mucosa, so the coordinate draws mislocalize and SAM segments the wrong region. No GT-free signal fixes this (a redness-box lever was tested and rejected).
ChestXray — bilateral lungs (box-prompted SAM)
The worst cases miss the costophrenic recesses / a full lung field; a perfect GT box caps SAM at 0.943, below the trained U-Net's 0.953.
What we learned
ISIC: the LLM pipeline beats nnU-Net at every budget (+0.116 at N=10 → +0.025 at N=100) — genuinely more label-efficient for dermoscopy.
Spleen: a GT-free within-volume prior (re-decode mislocalized/over-seg CT slices from confident sibling slices) lifts 0.786 → 0.840, beating U-Net through N=50.
Kvasir: honest 100-image Dice is 0.332 — no GT-free cue localizes the polyp (the 0.86 sometimes cited used GT-hand-redrawn images).
Chest X-ray: box-prompted frozen SAM is architecturally capped — even a perfect GT box reaches only 0.943 < the trained U-Net's 0.953.
Honest-reporting note: earlier drafts carried optimistic hardcoded values (ISIC 0.885, Kvasir 0.860, CXR 0.855). This report uses numbers re-measured from scratch with the current pipeline: ISIC 0.880, Kvasir 0.332, CXR 0.825.