Training-Free Medical Image Segmentation vs Trained Networks

A multi-domain data-efficiency study β€” frozen LLM polygon traces + a numpy genome + a gated SAM decoder, with zero gradient training.

4 modalitiesmeasured, from scratch GT-free at inferenceLLM-agnostic skills updated 2026-07-03

Pipeline

One image, end to end. A GT-free router selects skills per image; the drawer seam (amber) is the only model-coupled point β€” an LLM polygon drawer, a frozen SAM-feature prototype localizer, an AMG ranker, or a static box, per domain β€” and everything downstream is numpy/config. Green boxes (πŸ”„) are iterative loops: SAM fixed-point re-prompts and the spleen two-pass within-volume prior.

agenticSeg pipeline flow chart
The deployed pipeline is feed-forward with SAM fixed-point re-prompt loops + a gated repair stage. The red box lists iterative-feedback ideas whose fixes work but that no valid GT-free gate could deploy on no-cue targets (crop-zoom, verbal judge→redraw, keep-best fix-loop, catastrophic negative-point re-prompt) — the recurring lesson: the fix is rarely the bottleneck, the GT-free gate is.

Results β€” best trained NN across N vs the LLM pipeline

Strongest trained NN per cell β€” max of SegFormer (mit_b2), pretrained U-Net (resnet34), and nnU-Net, 2-seed means β€” at matched budgets N = 10 / 25 / 50 / 100, against the LLM pipeline (re-measured from scratch via eval_all_domains.py). The LLM's only label-fit module is a ~5–8-param genome; localization and SAM are GT-free/frozen β€” so the LLM number is ~flat in N while the NN climbs. Bold = LLM ahead of the best NN at that budget.

Domainbest NN N=10N=25N=50N=100training-free LLMfew-shot decoder
N=100 (oracle)
ISIC 2018 (derm)0.8150.8290.8320.8680.8800.911
MSD Spleen (CT)0.6330.8020.8590.9070.9200.953
Kvasir-SEG (endo)0.7560.8240.8330.8890.6590.944
ChestXray (X-ray)0.9380.9480.9520.9530.9220.956
Two LLM columns: the training-free pipeline (zero gradient training β€” the study's headline) and the few-shot tuned decoder (generic-SAM init, encoder frozen, ~4M-param mask decoder tuned on N=100 labels, GT-box oracle). The few-shot decoder beats the from-scratch NN on all four domains given a good box β€” the residual to it is localization, detailed below.
The LLM pipeline wins the scarce-label regime where it has a cue β€” ISIC through N=100 (best NN needs N=189 to reach 0.887), Spleen through N=100 (0.920 vs best-NN 0.907, via GT-free cross-slice + pyramid priors) β€” and loses at every budget on Kvasir polyp (an LLM-box localizer + an AMG margin keep-best lift it to 0.659 β€” localization, not a boundary ceiling β€” but still below best-NN 0.889) and Chest X-ray, where a static lung box (N=10) + a self-reprompt + a repair agent + a catastrophic fix + a letterbox trim lift the LLM to 0.922 (localization, not architecture, was the gap) but box-prompted SAM still caps at a GT-box ~0.93 < 0.953. SegFormer is the strongest NN in 10 of 12 sweep cells.

Data-efficiency

Matched data-efficiency: LLM pipeline (genome fit per N) vs trained NN
LLM pipeline (dashed β€” genome re-fit per N; ~flat) vs trained NN (solid β€” matched N). Spleen is capped at N=25 (30 training draws) and Kvasir at N=50 (91 draws).

Where is the wall β€” the decoder, or localization? A fair few-shot probe

Is the residual to the trained NN a decoder limit or a localization limit? MedSAM (SAM fine-tuned on ~1M medical masks) beats box-prompted SAM β€” but it is data-leaky (its training set likely includes these test distributions), so it is really "import a trained medical net." The fair, leak-free counterpart of the same operation: init from the generic SAM (no medical data), freeze the image + prompt encoders, and fine-tune only the ~4M-param mask decoder on our own N-label train split β€” GT never touches test. Given a good box (the GT-box oracle, which isolates the decoder), this few-shot-tuned decoder matches or beats the from-scratch NN at every budget on 3 of 4 domains, and it saturates at N=10 (pretrained features + a tiny adapted head), so its curve is flat-high while the NN climbs:

Domain (tuned decoder, oracle)N=10N=25N=50N=100best NN N=100
ISIC 2018 (derm)0.9050.9040.9070.9110.868
MSD Spleen (CT)0.9570.9600.9600.9530.907
Kvasir-SEG (endo)0.9360.9340.9370.9440.889
ChestXray (X-ray)0.9330.9460.9550.9560.953
Few-shot tuned decoder (oracle) vs trained NN vs training-free LLM
Few-shot tuned decoder (GT-box oracle, β—†) vs best trained NN (gray, from scratch) vs the training-free LLM pipeline (dotted, flat). The tuned decoder saturates at N=10 and sits at/above the NN at every budget on ISIC/Spleen/Kvasir; CXR ties by N=50. β˜… = the measured deployable number (tuned decoder + the domain's real GT-free localizer).
The decoder was never the wall β€” localization is. The oracle hands the decoder a good box; the gap between it and the deployed pipeline is localization. Kvasir is the proof: tuned-decoder oracle 0.944 vs deployed 0.659 β€” 0.285 of pure localization headroom, zero decoder. Where localization is already solved, the tuned decoder is a clean deployable win over both production and the NN: CXR (static box + self-reprompt) 0.950 > production 0.922 β‰ˆ NN 0.953; Spleen (3D volume-prior box) 0.942 > production 0.920 > NN 0.907. This is few-shot (N=100 labels = the NN's budget), not training-free β€” an explicit, leak-free counterpart to MedSAM β€” and it says the training-free pipeline's remaining gap is a localizer problem, not a decoder one. So the residual "structural / architectural ceilings" (the polyp boundary, the CXR decoder) were localization gaps in disguise.

Is the medical pretraining (MedSAM) worth it? β€” a leaky benchmark

MedSAM is SAM already fine-tuned on ~1M medical masks β€” a leaky upper reference (its training likely saw these test distributions). Arch-matched (both vit_b, GT-box oracle, N=100), few-shot tuning of the GENERIC SAM matches or beats the leaky MedSAM on 3 of 4 domains: medical pretraining buys no consistent edge once the decoder is adapted, and where generic SAM is already strong (Kvasir/Spleen) it starts above MedSAM even frozen. Bold = better of the two few-shot columns.

Domain (N=100 oracle, vit_b)from-scratch NNgeneric frozengeneric few-shot (fair)MedSAM frozen (leaky)MedSAM few-shot (leaky)
ISIC 2018 (derm)0.8680.8700.9150.9210.943
MSD Spleen (CT)0.9070.9410.9560.9090.946
Kvasir-SEG (endo)0.8890.9170.9360.8880.918
ChestXray (X-ray)0.9530.9250.9540.9350.952
Fair few-shot generic SAM vs leaky MedSAM benchmark, N=100 oracle
Fair few-shot tuning of generic SAM (dark blue) vs the leaky MedSAM (orange), each frozen + few-shot-tuned, vs the from-scratch NN (gray) β€” arch-matched vit_b, N=100 GT-box oracle. The fair generic few-shot β‰₯ leaky MedSAM few-shot on Spleen/Kvasir/ChestXray; MedSAM only leads on ISIC. Medical pretraining is not the lever β€” decoder adaptation is.

Can it be ONE domain-agnostic pipeline? β€” few-shot skills + a router, SAM frozen

Each domain has a distinct setup β€” so can it be a single domain-agnostic pipeline that few-shot-fits per domain WITHOUT tuning SAM? We test the transferable skill β€” a frozen-SAM-feature localizer prototype fit from N labeled images β€” with identical code across all four domains, SAM never touched.

Domain-agnostic few-shot skill transfer: skills few-shot-fit, topology is routed not tuned
(A) One fixed pipeline (prototype-sim localizer): the skill few-shot-fits β€” saturates by ~N=10 on every domain β€” and scores high where the target is prototype-localizable (ISIC 0.71, CXR 0.62), low where it needs a specialized localizer (Kvasir 0.35, Spleen 0.12). (B) Routing the per-domain topology (genome / static-box / AMG-rank / volume-prior) + the few-shot skill recovers most of the deployed number (Kvasir 0.35β†’0.53, Spleen 0.14β†’0.66β†’0.83@N=50).
Yes β€” a domain-agnostic pipeline = frozen SAM + a GT-free router + few-shot skills. The learning (localizer prototype + gate weights + genome) is domain-agnostic, few-shot (~10 labels; ~50 for the isodense spleen), and SAM-frozen β€” it transfers with identical code. The per-domain "distinct setup" you'd notice is which localizer topology to route to β€” a GT-free routing choice (learned config, the project's route.py), not SAM tuning. So onboarding a new domain is ~10 labels + a routing decision, zero gradient training.

Qualitative examples β€” 3 best / 3 worst per dataset

Green = ground truth, red = prediction. Top row = 3 highest-Dice cases, bottom row = 3 lowest.

ISIC 2018 β€” skin lesion (pigment cue, LLM's best domain)

ISIC best/worst

MSD Spleen β€” CT (localizer + cross-slice priors + SAM re-prompt polish)

Spleen best/worst
The worst cases are localizer failures (prediction lands on stomach/liver, Diceβ‰ˆ0). A GT-free cross-slice prior (centroid trajectory + sibling-mask seed), a selective multi-prototype re-decode, a SAM re-prompt boundary polish, and a final pyramid keep-best (cropβ†’zoomβ†’SAM-decode each tiny tip slice at higher resolution) recover them, lifting the spleen pipeline to 0.920 (0.786β†’0.840β†’0.868β†’0.875β†’0.903β†’0.920 this session) β€” now beating the best trained NN at N=100 (0.920 vs 0.907). Pyramid is complementary to the within-volume prior: the prior re-decodes a tip slice from its siblings, pyramid re-decodes it at higher resolution.

Kvasir-SEG β€” polyp (LLM-box localizer; the gap was localization)

Kvasir best/worst
Polyps share colour/texture with mucosa, so the heuristic box and coordinate draws mislocalize (~43/100 to Diceβ‰ˆ0) and a hand-crafted redness box was rejected. Localization turns out to be solvable two ways. The deployed path is an LLM-box localizer: the vision LLM boxes the polyp per image (box-IoU 0.46 vs GT β€” genuine perception, no leak) β†’ SAM decode, lifting the pipeline 0.348β†’0.633 (median 0.743); a final AMG-candidate margin keep-best (add SAM auto-mask re-localization proposals to the family, swap one in only when the gate clearly beats the deployed box) then rescues the ~3 unambiguous mislocalizations β†’ 0.659 (+0.026, zero regressions). A fully LLM-agnostic variant reaches 0.617 (median 0.837): SAM's automatic-mask generator proposes the polyp in every image (best-candidate oracle 0.798) and a few-shot ranker β€” standardized cosine-similarity of frozen SAM features to a polyp prototype + a train-fit size prior βˆ’ a vignette penalty β€” picks it. The LLM box wins the mean by +0.016 (at a βˆ’0.094 median + model-coupling); either way the residual to the NN is the ~0.78–0.80 boundary oracle plus localizer error, not a perception ceiling. Gallery (LLM-box path): top row = hits (the box lands on the polyp β†’ 0.98); bottom row = localization misses (Dice 0 β€” the ~13 cases where the box lands off the polyp), which is the whole residual.

ChestXray β€” bilateral lungs (few-shot static box + self-reprompt β†’ SAM)

CXR best/worst
Five GT-free skills: (1) a fixed few-shot lung box (mean of N=10 train GT boxes; box-IoU 0.545β†’0.67) replaces the heuristic box β†’ 0.825β†’0.898; (2) a mask-bbox self-reprompt (box-IoU 0.67β†’0.83) β†’ 0.898β†’0.911; (3) a repair agent that trims SAM's bright "out of lung" over-seg into the mediastinum (a decoder error a redraw loop can't reach), fire-gated + keep-best β†’ 0.911β†’0.914 zero-regression; (4) a catastrophic fix β€” a SAM negative-point re-prompt (positive points on the dark lung, negative on the spill) that carves the worst engulfing masks the conservative trim can't β†’ 0.914β†’0.920; (5) a letterbox-component trim that drops mask blobs sitting over near-black padding (on letterboxed films the misregistered atlas + darkness both score the black border as lung β€” only raw intensity separates it, since lung air is dark-gray, not pure black), fixing MCUCXR_0060 0.76β†’0.88 zero-regression β†’ 0.920β†’0.922. The remaining worst cases (bottom row) are a different, harder failure β€” inferior over-extension into the abdomen, a decoder-boundary error. CXR's gap was box localization, not architecture β€” but a perfect GT box still caps SAM at ~0.93 < the U-Net's 0.953, so it stays capped at every budget; the residual is the frozen decoder.

What we learned

When does iterative feedback deploy? β€” the fix is general, the gate is the wall

We built a full repair-agent β†’ verbal + visual feedback β†’ SAM re-prompt β†’ keep-best loop: the repair stage renders a diagnosis overlay (current mask + a GT-free "out-of-region" heat-map + a coordinate grid), the vision LLM looks, says what is wrong verbally, and marks the over-segmentation as SAM negative points; SAM re-decodes; a GT-free keep-best gate accepts or rejects; repeat. On chest X-ray the LLM's visual negatives beat the heuristic catastrophic fix (CHNCXR_0447 0.849β†’0.902, MCUCXR_0055 0.860β†’0.955).

CXR loop vs feed-forward
The cases the GT-free gate accepted (top = feed-forward catfix, bottom = the loop). The LLM's visual negatives tighten SAM off the mediastinum/abdomen better than the spill heuristic. Full-107: 0.9216β†’0.9240 (+0.0024), zero-regression, gate near-perfect (oracle 0.9241).

But the fix is domain-general β€” the LLM can place negatives on any over-seg it sees. What decides deployability is the gate: whether a GT-free signal can rank candidate masks by true Dice. We measured that gate↔Dice correlation on all four domains:

Gate-validity across domains
The loop deploys iff the GT-free gate is valid β€” a property of the domain, not the loop. CXR (+0.63, lung darkness+atlas) and Spleen (+0.56, soft-tissue) have valid gates and the loop deploys; Kvasir (+0.04 β€” polypβ‰ˆmucosa, no signal) and ISIC (βˆ’0.23 β€” the gate prefers over-grown high-contrast masks) do not, so no amount of good feedback can be deployed. 2 of 4 domains.
The core lesson. Iteration is not the lever β€” perception, or its GT-free proxy, is. The same wall shows up as the SIZE law, the structural-ceiling taxonomy, and here as gate-validity: a feedback loop is only as deployable as the GT-free mask-quality signal in that domain. Where the target is cheaply separable (dark lung air, soft-tissue organ edges) a gated fix pays off; where it is not (polyp on mucosa) neither a loop nor a single fix can deploy without ground truth.
Honest-reporting note: earlier drafts carried optimistic hardcoded values (ISIC 0.885, Kvasir 0.860, CXR 0.855). This report uses numbers re-measured from scratch with the current pipeline: ISIC 0.880, Spleen 0.920, Kvasir 0.659, CXR 0.922.