Training-Free Medical Image Segmentation vs Trained Networks
A multi-domain data-efficiency study β frozen LLM polygon traces + a numpy genome + a gated SAM decoder, with zero gradient training.
4 modalitiesmeasured, from scratchGT-free at inferenceLLM-agnostic skillsupdated 2026-07-03
Pipeline
One image, end to end. A GT-free router selects skills per image; the drawer seam (amber) is the only model-coupled point β an LLM polygon drawer, a frozen SAM-feature prototype localizer, an AMG ranker, or a static box, per domain β and everything downstream is numpy/config. Green boxes (π) are iterative loops: SAM fixed-point re-prompts and the spleen two-pass within-volume prior.
The deployed pipeline is feed-forward with SAM fixed-point re-prompt loops + a gated repair stage. The red box lists iterative-feedback ideas whose fixes work but that no valid GT-free gate could deploy on no-cue targets (crop-zoom, verbal judgeβredraw, keep-best fix-loop, catastrophic negative-point re-prompt) β the recurring lesson: the fix is rarely the bottleneck, the GT-free gate is.
Results β best trained NN across N vs the LLM pipeline
Strongest trained NN per cell β max of SegFormer (mit_b2), pretrained U-Net (resnet34), and nnU-Net, 2-seed means β at matched budgets N = 10 / 25 / 50 / 100, against the LLM pipeline (re-measured from scratch via eval_all_domains.py). The LLM's only label-fit module is a ~5β8-param genome; localization and SAM are GT-free/frozen β so the LLM number is ~flat in N while the NN climbs. Bold = LLM ahead of the best NN at that budget.
Domain
best NN N=10
N=25
N=50
N=100
LLM pipeline
ISIC 2018 (derm)
0.815
0.829
0.832
0.868
0.880
MSD Spleen (CT)
0.633
0.802
0.859
0.907
0.920
Kvasir-SEG (endo)
0.756
0.824
0.833
0.889
0.659
ChestXray (X-ray)
0.938
0.948
0.952
0.953
0.922
The LLM pipeline wins the scarce-label regime where it has a cue β ISIC through N=100 (best NN needs N=189 to reach 0.887), Spleen through N=100 (0.920 vs best-NN 0.907, via GT-free cross-slice + pyramid priors) β and loses at every budget on Kvasir polyp (an LLM-box localizer + an AMG margin keep-best lift it to 0.659 β localization, not a boundary ceiling β but still below best-NN 0.889) and Chest X-ray, where a static lung box (N=10) + a self-reprompt + a repair agent + a catastrophic fix + a letterbox trim lift the LLM to 0.922 (localization, not architecture, was the gap) but box-prompted SAM still caps at a GT-box ~0.93 < 0.953. SegFormer is the strongest NN in 10 of 12 sweep cells.
Data-efficiency
LLM pipeline (dashed β genome re-fit per N; ~flat) vs trained NN (solid β matched N). Spleen is capped at N=25 (30 training draws) and Kvasir at N=50 (91 draws).
Qualitative examples β 3 best / 3 worst per dataset
Green = ground truth, red = prediction. Top row = 3 highest-Dice cases, bottom row = 3 lowest.
ISIC 2018 β skin lesion (pigment cue, LLM's best domain)
The worst cases are localizer failures (prediction lands on stomach/liver, Diceβ0). A GT-free cross-slice prior (centroid trajectory + sibling-mask seed), a selective multi-prototype re-decode, a SAM re-prompt boundary polish, and a final pyramid keep-best (cropβzoomβSAM-decode each tiny tip slice at higher resolution) recover them, lifting the spleen pipeline to 0.920 (0.786β0.840β0.868β0.875β0.903β0.920 this session) β now beating the best trained NN at N=100 (0.920 vs 0.907). Pyramid is complementary to the within-volume prior: the prior re-decodes a tip slice from its siblings, pyramid re-decodes it at higher resolution.
Kvasir-SEG β polyp (LLM-box localizer; the gap was localization)
Polyps share colour/texture with mucosa, so the heuristic box and coordinate draws mislocalize (~43/100 to Diceβ0) and a hand-crafted redness box was rejected. Localization turns out to be solvable two ways. The deployed path is an LLM-box localizer: the vision LLM boxes the polyp per image (box-IoU 0.46 vs GT β genuine perception, no leak) β SAM decode, lifting the pipeline 0.348β0.633 (median 0.743); a final AMG-candidate margin keep-best (add SAM auto-mask re-localization proposals to the family, swap one in only when the gate clearly beats the deployed box) then rescues the ~3 unambiguous mislocalizations β 0.659 (+0.026, zero regressions). A fully LLM-agnostic variant reaches 0.617 (median 0.837): SAM's automatic-mask generator proposes the polyp in every image (best-candidate oracle 0.798) and a few-shot ranker β standardized cosine-similarity of frozen SAM features to a polyp prototype + a train-fit size prior β a vignette penalty β picks it. The LLM box wins the mean by +0.016 (at a β0.094 median + model-coupling); either way the residual to the NN is the ~0.78β0.80 boundary oracle plus localizer error, not a perception ceiling. Gallery (LLM-box path): top row = hits (the box lands on the polyp β 0.98); bottom row = localization misses (Dice 0 β the ~13 cases where the box lands off the polyp), which is the whole residual.
Five GT-free skills: (1) a fixed few-shot lung box (mean of N=10 train GT boxes; box-IoU 0.545β0.67) replaces the heuristic box β 0.825β0.898; (2) a mask-bbox self-reprompt (box-IoU 0.67β0.83) β 0.898β0.911; (3) a repair agent that trims SAM's bright "out of lung" over-seg into the mediastinum (a decoder error a redraw loop can't reach), fire-gated + keep-best β 0.911β0.914 zero-regression; (4) a catastrophic fix β a SAM negative-point re-prompt (positive points on the dark lung, negative on the spill) that carves the worst engulfing masks the conservative trim can't β 0.914β0.920; (5) a letterbox-component trim that drops mask blobs sitting over near-black padding (on letterboxed films the misregistered atlas + darkness both score the black border as lung β only raw intensity separates it, since lung air is dark-gray, not pure black), fixing MCUCXR_0060 0.76β0.88 zero-regression β 0.920β0.922. The remaining worst cases (bottom row) are a different, harder failure β inferior over-extension into the abdomen, a decoder-boundary error. CXR's gap was box localization, not architecture β but a perfect GT box still caps SAM at ~0.93 < the U-Net's 0.953, so it stays capped at every budget; the residual is the frozen decoder.
What we learned
ISIC: the LLM pipeline beats nnU-Net at every budget (+0.116 at N=10 β +0.025 at N=100) β genuinely more label-efficient for dermoscopy.
Spleen: a stack of GT-free skills β within-volume prior, cross-slice geometry, selective multi-prototype, a SAM re-prompt boundary polish, and a pyramid keep-best (higher-resolution re-decode of the tiny tip slices) β lifts 0.786 β 0.920, now beating the best trained NN at N=100 (0.920 vs 0.907; CV held-out). The pyramid and the volume prior see the tip slices two different ways β resolution vs siblings β and combine additively via a valid soft-tissue gate.
Kvasir: an LLM-box localizer + an AMG-candidate margin keep-best (re-localization proposals swapped in only on the clearest mislocalizations) lift the polyp 0.348β0.633β0.659 (median 0.743) β the gap was localization, not a boundary ceiling, and it's solvable both coupled and agnostic. The deployed path has the vision LLM box the polyp per image (box-IoU 0.46, no leak) β SAM decode; a fully LLM-agnostic variant (SAM automatic-mask generator, oracle 0.798, + a frozen-feature ranker) reaches 0.617 (median 0.837) where hand-crafted colour cues (redness) failed. The LLM box wins the mean by +0.016 at a β0.094 median. Still below the NN at every budget (residual = boundary + localizer error), but no longer the "structural ceiling" it looked like.
Chest X-ray: a few-shot static lung box + a mask-bbox self-reprompt + a repair agent (trims out-of-lung over-seg) + a catastrophic fix (SAM negative-point re-prompt on the worst spilled masks) + a letterbox trim (drops blobs over near-black padding) lift the LLM 0.825β0.898β0.911β0.914β0.920β0.922 β the gap was box localization, not architecture. But box-prompted frozen SAM stays architecturally capped β even a perfect GT box reaches only ~0.93 < the trained U-Net's 0.953, so CXR loses at every budget; the residual is the frozen decoder, not localization.
When does iterative feedback deploy? β the fix is general, the gate is the wall
We built a full repair-agent β verbal + visual feedback β SAM re-prompt β keep-best loop: the repair stage renders a diagnosis overlay (current mask + a GT-free "out-of-region" heat-map + a coordinate grid), the vision LLM looks, says what is wrong verbally, and marks the over-segmentation as SAM negative points; SAM re-decodes; a GT-free keep-best gate accepts or rejects; repeat. On chest X-ray the LLM's visual negatives beat the heuristic catastrophic fix (CHNCXR_0447 0.849β0.902, MCUCXR_0055 0.860β0.955).
The cases the GT-free gate accepted (top = feed-forward catfix, bottom = the loop). The LLM's visual negatives tighten SAM off the mediastinum/abdomen better than the spill heuristic. Full-107: 0.9216β0.9240 (+0.0024), zero-regression, gate near-perfect (oracle 0.9241).
But the fix is domain-general β the LLM can place negatives on any over-seg it sees. What decides deployability is the gate: whether a GT-free signal can rank candidate masks by true Dice. We measured that gateβDice correlation on all four domains:
The loop deploys iff the GT-free gate is valid β a property of the domain, not the loop. CXR (+0.63, lung darkness+atlas) and Spleen (+0.56, soft-tissue) have valid gates and the loop deploys; Kvasir (+0.04 β polypβmucosa, no signal) and ISIC (β0.23 β the gate prefers over-grown high-contrast masks) do not, so no amount of good feedback can be deployed. 2 of 4 domains.
The core lesson. Iteration is not the lever β perception, or its GT-free proxy, is. The same wall shows up as the SIZE law, the structural-ceiling taxonomy, and here as gate-validity: a feedback loop is only as deployable as the GT-free mask-quality signal in that domain. Where the target is cheaply separable (dark lung air, soft-tissue organ edges) a gated fix pays off; where it is not (polyp on mucosa) neither a loop nor a single fix can deploy without ground truth.
Honest-reporting note: earlier drafts carried optimistic hardcoded values (ISIC 0.885, Kvasir 0.860, CXR 0.855). This report uses numbers re-measured from scratch with the current pipeline: ISIC 0.880, Spleen 0.920, Kvasir 0.659, CXR 0.922.