Training-Free Medical Image Segmentation vs Trained Networks
A multi-domain data-efficiency study — frozen LLM polygon traces + a numpy genome + a gated SAM decoder, with zero gradient training.
4 modalitiesmeasured, from scratchGT-free at inferenceLLM-agnostic skillsupdated 2026-07-05
Results — best trained NN across N vs the LLM pipeline
Strongest trained NN per cell — max of SegFormer (mit_b2), pretrained U-Net (resnet34), and nnU-Net, 2-seed means — at matched budgets N = 10 / 25 / 50 / 100, against the LLM pipeline (re-measured from scratch via eval_all_domains.py). The LLM's only label-fit module is a ~5–8-param genome; localization and SAM are GT-free/frozen — so the LLM number is ~flat in N while the NN climbs. Bold = LLM ahead of the best NN at that budget.
Domain
best NN N=10
N=25
N=50
N=100
training-free LLM
few-shot decoder N=100 (oracle)
ISIC 2018 (derm)
0.815
0.829
0.832
0.868
0.882
0.911
MSD Spleen (CT)
0.633
0.802
0.859
0.907
0.920
0.953
Kvasir-SEG (endo)
0.756
0.824
0.833
0.889
0.824
0.944
ChestXray (X-ray)
0.938
0.948
0.952
0.953
0.922
0.956
Two LLM columns: the training-free pipeline (zero gradient training — the study's headline) and the few-shot tuned decoder (generic-SAM init, encoder frozen, ~4M-param mask decoder tuned on N=100 labels, GT-box oracle). The few-shot decoder beats the from-scratch NN on all four domains given a good box — the residual to it is localization, detailed below.
The LLM pipeline wins the scarce-label regime where it has a cue — ISIC through N=100 (best NN needs N=189 to reach 0.887), Spleen through N=100 (0.920 vs best-NN 0.907, via GT-free cross-slice + pyramid priors) — and on Kvasir polyp a frozen-DINOv2 dense-correspondence localizer (OP-SAM CPG + peak-CC box) lifts it 0.659→0.824 — the gap was localization (frozen GT-box oracle 0.914 > NN), and this closes 72% of it; the median 0.924 now beats the NN mean, though the mean stays just below best-NN 0.889. On Chest X-ray a static lung box (N=10) + a self-reprompt + a repair agent + a catastrophic fix + a letterbox trim lift the LLM to 0.922 (localization, not architecture, was the gap) but box-prompted SAM still caps at a GT-box ~0.93 < 0.953. SegFormer is the strongest NN in 10 of 12 sweep cells.
Data-efficiency
Each panel: the best trained NN (gray, climbing with N) vs the training-free LLM pipeline (colored flat dashed — a single measured Dice; ~N-independent, since only a ≤40-label genome is fit). Shading marks where the LLM leads; the bold label is its margin at N=100. The LLM leads the whole curve on ISIC (0.882, +0.014) and Spleen (0.920, +0.013), crosses above the NN at N≤25 on Kvasir (0.824, the dense-correspondence localizer), and trails at every budget on Chest X-ray (0.922, −0.031 — the box-SAM decoder wall).
Where is the wall — the decoder, or localization? A fair few-shot probe
Is the residual to the trained NN a decoder limit or a localization limit? MedSAM (SAM fine-tuned on ~1M medical masks) beats box-prompted SAM — but it is data-leaky (its training set likely includes these test distributions), so it is really "import a trained medical net." The fair, leak-free counterpart of the same operation: init from the generic SAM (no medical data), freeze the image + prompt encoders, and fine-tune only the ~4M-param mask decoder on our own N-label train split — GT never touches test. Given a good box (the GT-box oracle, which isolates the decoder), this few-shot-tuned decoder matches or beats the from-scratch NN at every budget on 3 of 4 domains, and it saturates at N=10 (pretrained features + a tiny adapted head), so its curve is flat-high while the NN climbs:
Domain (tuned decoder, oracle)
N=10
N=25
N=50
N=100
best NN N=100
ISIC 2018 (derm)
0.905
0.904
0.907
0.911
0.868
MSD Spleen (CT)
0.957
0.960
0.960
0.953
0.907
Kvasir-SEG (endo)
0.936
0.934
0.937
0.944
0.889
ChestXray (X-ray)
0.933
0.946
0.955
0.956
0.953
Few-shot tuned decoder (GT-box oracle, ◆) vs best trained NN (gray, from scratch) vs the training-free LLM pipeline (dotted, flat). The tuned decoder saturates at N=10 and sits at/above the NN at every budget on ISIC/Spleen/Kvasir; CXR ties by N=50. ★ = the measured deployable number (tuned decoder + the domain's real GT-free localizer).
The decoder was never the wall — localization is. The oracle hands the decoder a good box; the gap between it and the deployed pipeline is localization. Kvasir is the proof: tuned-decoder oracle 0.944 vs the original deployed 0.659 — 0.285 of pure localization headroom, zero decoder; the frozen dense-correspondence localizer has since realized most of it (0.659→0.824), exactly as this predicted, leaving only ~0.12 to the oracle box. Where localization is already solved, the tuned decoder is a clean deployable win over both production and the NN: CXR (static box + self-reprompt) 0.950 > production 0.922 ≈ NN 0.953; Spleen (3D volume-prior box) 0.942 > production 0.920 > NN 0.907. This is few-shot (N=100 labels = the NN's budget), not training-free — an explicit, leak-free counterpart to MedSAM — and it says the training-free pipeline's remaining gap is a localizer problem, not a decoder one. So the residual "structural / architectural ceilings" (the polyp boundary, the CXR decoder) were localization gaps in disguise.
Is the medical pretraining (MedSAM) worth it? — a leaky benchmark
MedSAM is SAM already fine-tuned on ~1M medical masks — a leaky upper reference (its training likely saw these test distributions). Arch-matched (both vit_b, GT-box oracle, N=100), few-shot tuning of the GENERIC SAM matches or beats the leaky MedSAM on 3 of 4 domains: medical pretraining buys no consistent edge once the decoder is adapted, and where generic SAM is already strong (Kvasir/Spleen) it starts above MedSAM even frozen. Bold = better of the two few-shot columns.
Domain (N=100 oracle, vit_b)
from-scratch NN
generic frozen
generic few-shot (fair)
MedSAM frozen (leaky)
MedSAM few-shot (leaky)
ISIC 2018 (derm)
0.868
0.870
0.915
0.921
0.943
MSD Spleen (CT)
0.907
0.941
0.956
0.909
0.946
Kvasir-SEG (endo)
0.889
0.917
0.936
0.888
0.918
ChestXray (X-ray)
0.953
0.925
0.954
0.935
0.952
Fair few-shot tuning of generic SAM (dark blue) vs the leaky MedSAM (orange), each frozen + few-shot-tuned, vs the from-scratch NN (gray) — arch-matched vit_b, N=100 GT-box oracle. The fair generic few-shot ≥ leaky MedSAM few-shot on Spleen/Kvasir/ChestXray; MedSAM only leads on ISIC. Medical pretraining is not the lever — decoder adaptation is.
Can it be ONE domain-agnostic pipeline? — few-shot skills + a router, SAM frozen
Each domain has a distinct setup — so can it be a single domain-agnostic pipeline that few-shot-fits per domain WITHOUT tuning SAM? We test the transferable skill — a frozen-SAM-feature localizer prototype fit from N labeled images — with identical code across all four domains, SAM never touched.
(A) One fixed pipeline (prototype-sim localizer): the skill few-shot-fits — saturates by ~N=10 on every domain — and scores high where the target is prototype-localizable (ISIC 0.71, CXR 0.62), low where it needs a specialized localizer (Kvasir 0.35, Spleen 0.12). (B)Routing the per-domain topology (genome / static-box / AMG-rank / volume-prior) + the few-shot skill recovers most of the deployed number (Kvasir 0.35→0.53, Spleen 0.14→0.66→0.83@N=50).
Yes — a domain-agnostic pipeline = frozen SAM + a GT-free router + few-shot skills. The learning (localizer prototype + gate weights + genome) is domain-agnostic, few-shot (~10 labels; ~50 for the isodense spleen), and SAM-frozen — it transfers with identical code. The per-domain "distinct setup" you'd notice is which localizer topology to route to — a GT-free routing choice (learned config, the project's route.py), not SAM tuning. So onboarding a new domain is ~10 labels + a routing decision, zero gradient training. Robust localizer skill: ranking AMG candidates by a SAM + DINOv2 feature ensemble (both frozen, 10-shot prototypes) is the safe domain-agnostic default — it is at/near the best single feature space on all three (Kvasir 0.65, Spleen 0.64, ISIC 0.67) and never catastrophically fails, where SAM-alone fails the isodense spleen (0.40) and DINOv2-alone is weakest on ISIC. On Kvasir it was the one lever (of five tried) that beat the wall — the polyp/fold top-1 gap yields to a second frozen feature space, not to distractor-verification tricks.
Qualitative examples — 3 best / 3 worst per dataset
Green = ground truth, red = prediction. Top row = 3 highest-Dice cases, bottom row = 3 lowest.
ISIC 2018 — skin lesion (pigment cue, LLM's best domain)
The worst cases are localizer failures (prediction lands on stomach/liver, Dice≈0). A GT-free cross-slice prior (centroid trajectory + sibling-mask seed), a selective multi-prototype re-decode, a SAM re-prompt boundary polish, and a final pyramid keep-best (crop→zoom→SAM-decode each tiny tip slice at higher resolution) recover them, lifting the spleen pipeline to 0.920 (0.786→0.840→0.868→0.875→0.903→0.920 this session) — now beating the best trained NN at N=100 (0.920 vs 0.907). Pyramid is complementary to the within-volume prior: the prior re-decodes a tip slice from its siblings, pyramid re-decodes it at higher resolution.
Kvasir-SEG — polyp (dense-correspondence localizer; the gap was localization, 72% closed)
Polyps share colour/texture with mucosa, so a mean-prototype / AMG-candidate ranker mislocalizes (~43/100 to Dice≈0) — and four verification signals (SAM self-consistency, FG-BG margin, cycle-consistency) all failed to rank the right blob. The unlock, grounded in the polyp-SAM literature (OP-SAM, ICCV'25, frozen Dice 0.845 on this exact dataset), is frozen-DINOv2 dense-correspondence label transfer (CPG): transfer 10 few-shot support masks onto the query via the full patch cross-correlation matrix — not a mean prototype (which washes out on polyp≈mucosa). The prior peak lands in the polyp 94% of the time (vs the old LLM box's 0.46 box-IoU); a peak-connected-component box + prior-guided SAM mask selection then decodes it. Deployed 0.659 → 0.824 (median 0.924 — above the NN mean 0.889), leak-free, frozen DINOv2+SAM, no tuning. The frozen GT-box oracle is 0.914 > NN, so Kvasir never had a decoder wall — its entire gap was localization, now 72% closed (was −0.230 to the NN, now −0.065). Gallery shows the earlier LLM-box path; the correspondence localizer replaces the bottom-row localization misses.
Five GT-free skills: (1) a fixed few-shot lung box (mean of N=10 train GT boxes; box-IoU 0.545→0.67) replaces the heuristic box → 0.825→0.898; (2) a mask-bbox self-reprompt (box-IoU 0.67→0.83) → 0.898→0.911; (3) a repair agent that trims SAM's bright "out of lung" over-seg into the mediastinum (a decoder error a redraw loop can't reach), fire-gated + keep-best → 0.911→0.914 zero-regression; (4) a catastrophic fix — a SAM negative-point re-prompt (positive points on the dark lung, negative on the spill) that carves the worst engulfing masks the conservative trim can't → 0.914→0.920; (5) a letterbox-component trim that drops mask blobs sitting over near-black padding (on letterboxed films the misregistered atlas + darkness both score the black border as lung — only raw intensity separates it, since lung air is dark-gray, not pure black), fixing MCUCXR_0060 0.76→0.88 zero-regression → 0.920→0.922. The remaining worst cases (bottom row) are a different, harder failure — inferior over-extension into the abdomen, a decoder-boundary error. CXR's gap was box localization, not architecture — but a perfect GT box still caps SAM at ~0.93 < the U-Net's 0.953, so it stays capped at every budget; the residual is the frozen decoder.
What we learned
ISIC: the LLM pipeline beats nnU-Net at every budget (+0.116 at N=10 → +0.025 at N=100) — genuinely more label-efficient for dermoscopy.
Spleen: a stack of GT-free skills — within-volume prior, cross-slice geometry, selective multi-prototype, a SAM re-prompt boundary polish, and a pyramid keep-best (higher-resolution re-decode of the tiny tip slices) — lifts 0.786 → 0.920, now beating the best trained NN at N=100 (0.920 vs 0.907; CV held-out). The pyramid and the volume prior see the tip slices two different ways — resolution vs siblings — and combine additively via a valid soft-tissue gate.
Kvasir: a frozen-DINOv2 dense-correspondence localizer (OP-SAM CPG label transfer + peak-connected-component box) lifts the polyp 0.659→0.824 — the biggest single gain in the study, closing 72% of the −0.230 gap to the NN. The gap was pure localization (the frozen GT-box→SAM oracle is 0.914 > NN 0.889, so there is no decoder wall); the unlock is transferring few-shot support masks by dense correspondence, not a mean prototype (which washes out on polyp≈mucosa, and defeated four verification signals). The prior peak lands in the polyp 94% of the time; median 0.924 now beats the NN mean, the mean stays just below best-NN 0.889 (residual = a tail of hard localizations). Grounded in the domain's published frozen-SAM SOTA (OP-SAM, ICCV'25), leak-free, no tuning.
Chest X-ray: a few-shot static lung box + a mask-bbox self-reprompt + a repair agent (trims out-of-lung over-seg) + a catastrophic fix (SAM negative-point re-prompt on the worst spilled masks) + a letterbox trim (drops blobs over near-black padding) lift the LLM 0.825→0.898→0.911→0.914→0.920→0.922 — the gap was box localization, not architecture. But box-prompted frozen SAM stays architecturally capped — even a perfect GT box reaches only ~0.93 < the trained U-Net's 0.953, so CXR loses at every budget; the residual is the frozen decoder, not localization.
When does iterative feedback deploy? — the fix is general, the gate is the wall
We built a full repair-agent → verbal + visual feedback → SAM re-prompt → keep-best loop: the repair stage renders a diagnosis overlay (current mask + a GT-free "out-of-region" heat-map + a coordinate grid), the vision LLM looks, says what is wrong verbally, and marks the over-segmentation as SAM negative points; SAM re-decodes; a GT-free keep-best gate accepts or rejects; repeat. On chest X-ray the LLM's visual negatives beat the heuristic catastrophic fix (CHNCXR_0447 0.849→0.902, MCUCXR_0055 0.860→0.955).
The cases the GT-free gate accepted (top = feed-forward catfix, bottom = the loop). The LLM's visual negatives tighten SAM off the mediastinum/abdomen better than the spill heuristic. Full-107: 0.9216→0.9240 (+0.0024), zero-regression, gate near-perfect (oracle 0.9241).
But the fix is domain-general — the LLM can place negatives on any over-seg it sees. What decides deployability is the gate: whether a GT-free signal can rank candidate masks by true Dice. We measured that gate↔Dice correlation on all four domains:
The loop deploys iff the GT-free gate is valid — a property of the domain, not the loop. CXR (+0.63, lung darkness+atlas) and Spleen (+0.56, soft-tissue) have valid gates and the loop deploys; Kvasir (+0.04 — polyp≈mucosa, no signal) and ISIC (−0.23 — the gate prefers over-grown high-contrast masks) do not, so no amount of good feedback can be deployed. 2 of 4 domains.
The core lesson. Iteration is not the lever — perception, or its GT-free proxy, is. The same wall shows up as the SIZE law, the structural-ceiling taxonomy, and here as gate-validity: a feedback loop is only as deployable as the GT-free mask-quality signal in that domain. Where the target is cheaply separable (dark lung air, soft-tissue organ edges) a gated fix pays off; where it is not (polyp on mucosa) neither a loop nor a single fix can deploy without ground truth.
Honest-reporting note: earlier drafts carried optimistic hardcoded values (ISIC 0.885, Kvasir 0.860, CXR 0.855). This report uses numbers re-measured from scratch with the current pipeline: ISIC 0.882, Spleen 0.920, Kvasir 0.824, CXR 0.922. (Kvasir was 0.659 before the dense-correspondence localizer.)
Pipeline
One image, end to end. A GT-free router selects skills per image; the drawer seam (amber) is the only model-coupled point — one config-driven dispatch_localizer picks per domain among an LLM polygon drawer, a frozen SAM-feature prototype, a frozen-DINOv2 dense-correspondence (CPG) localizer, or a static box — and everything downstream is numpy/config. Green boxes (🔄) are iterative loops: SAM fixed-point re-prompts and the spleen two-pass within-volume prior.
The deployed pipeline is feed-forward with SAM fixed-point re-prompt loops + a gated repair stage. The four drawer backends now unify behind one config-driven localizer dispatch (Kvasir uses the frozen-DINOv2 CPG dense-correspondence localizer). The red box lists iterative-feedback ideas whose fixes work but that no valid GT-free gate could deploy on no-cue targets (crop-zoom, verbal judge→redraw, keep-best fix-loop, catastrophic negative-point re-prompt) — the recurring lesson: the fix is rarely the bottleneck, the GT-free gate is.