Training-free LLM segmentation vs trained networks

A frozen LLM traces the target GT-free and a tiny numpy post-process refines it — no neural network is trained. Here it is compared against a U-Net trained at matched label budgets (N = 10, 25, 50, 100) across four modalities. The LLM line is flat (it never trains); the NN line climbs with labels. Where they cross is the whole story.

Summary — Dice vs label budget

DatasetModalityLLM
(0 train)
NN N=10N=25N=50N=100Crossover
ISIC 2018Dermoscopy0.8820.7490.8300.8280.858never (≤100)
Kvasir-SEGPolyp / endoscopy0.5800.6280.7770.8080.845N≈10
Chest X-rayChest X-ray0.7880.9250.9430.9500.956N≈10
SpleenAbdominal CT0.4200.5370.6860.8380.885N≈10
Red cells = the LLM (zero training labels) matches or beats the trained net at that budget. NN = U-Net/resnet34, ImageNet-pretrained.

ISIC 2018 (dermoscopy)

Dermoscopy

NLLMNN
100.8820.749
250.8820.830
500.8820.828
1000.8820.858
The training-free LLM (0.882) beats the U-Net at every budget up to N=100 — the standout positive case (even the strongest pretrained net, SegFormer, only reaches 0.875 at N=100).
GTLLMNN @N
✓ Good cases
isic · ISIC_0036159
raw image
ground truth
LLM
0.96
NN N=10
0.91
NN N=25
0.91
NN N=50
0.95
NN N=100
0.94
isic · ISIC_0023694
raw image
ground truth
LLM
0.95
NN N=10
0.77
NN N=25
0.92
NN N=50
0.93
NN N=100
0.95
isic · ISIC_0023836
raw image
ground truth
LLM
0.95
NN N=10
0.86
NN N=25
0.96
NN N=50
0.96
NN N=100
0.97
✗ Bad cases
isic · ISIC_0023306
raw image
ground truth
LLM
0.09
NN N=10
0.54
NN N=25
0.30
NN N=50
0.15
NN N=100
0.04
isic · ISIC_0022313
raw image
ground truth
LLM
0.08
NN N=10
0.36
NN N=25
0.35
NN N=50
0.71
NN N=100
0.52
isic · ISIC_0022007
raw image
ground truth
LLM
0.06
NN N=10
0.04
NN N=25
0.13
NN N=50
0.10
NN N=100
0.15

Kvasir-SEG (polyp / endoscopy)

Polyp / endoscopy

NLLMNN
100.5800.628
250.5800.777
500.5800.808
1000.5800.845
The NN leads from N=10 and runs away — the lobed, gradual-boundary polyp caps the LLM near 0.58 (≈45% of draws mislocate, ~0.35 shape residual).
GTLLMNN @N
✓ Good cases
kvasir · cju7fq7mm2pw508176uk5ugtx
raw image
ground truth
LLM
0.90
NN N=10
0.79
NN N=25
0.82
NN N=50
0.94
NN N=100
0.92
kvasir · cju5o1vu9gz8a0818eyy92bns
raw image
ground truth
LLM
0.87
NN N=10
0.72
NN N=25
0.95
NN N=50
0.93
NN N=100
0.96
kvasir · cju7amjna1ly40871ugiokehb
raw image
ground truth
LLM
0.85
NN N=10
0.69
NN N=25
0.91
NN N=50
0.96
NN N=100
0.96
✗ Bad cases
kvasir · cju3xl264ingx0850rcf0rshj
raw image
ground truth
LLM
0.12
NN N=10
0.72
NN N=25
0.70
NN N=50
0.96
NN N=100
0.95
kvasir · cju5u8gz4kj5b07552e2wpkwp
raw image
ground truth
LLM
0.10
NN N=10
0.13
NN N=25
0.56
NN N=50
0.01
NN N=100
0.42
kvasir · cju884985nlmx0817vzpax3y4
raw image
ground truth
LLM
0.06
NN N=10
0.26
NN N=25
0.55
NN N=50
0.88
NN N=100
0.85
🧪 Skill: gated polyp ROI contrast-enhance new
Bad Kvasir cases mislocate — the draw lands on the wrong region. A GT-free contrast enhancer (specular-inpaint + CLAHE-L + saturation) makes the polyp salient enough to find. It rescues mislocations but distorts already-good boundaries, so it is gated on K-draw self-consistency (applied only to low-confidence draws). Skill polyp_roi_enhance (b45cd77).
case groupLLM rawLLM enhancedΔ
bad0.0940.434+0.340
good0.8740.729-0.145
gated (raw on good, enhance on bad)0.4840.654+0.170
GTLLM rawLLM on enhanced view
✗ Bad cases — rescued by enhancement
kvasir · cju3xl264ingx0850rcf0rshj
raw
enhanced view
GT
LLM raw
0.12
LLM enhanced
0.63
kvasir · cju5u8gz4kj5b07552e2wpkwp
raw
enhanced view
GT
LLM raw
0.10
LLM enhanced
0.61
kvasir · cju884985nlmx0817vzpax3y4
raw
enhanced view
GT
LLM raw
0.06
LLM enhanced
0.06
✓ Good cases — enhancement not needed (gated off)
kvasir · cju7fq7mm2pw508176uk5ugtx
raw
enhanced view
GT
LLM raw
0.90
LLM enhanced
0.65
kvasir · cju5o1vu9gz8a0818eyy92bns
raw
enhanced view
GT
LLM raw
0.87
LLM enhanced
0.81
kvasir · cju7amjna1ly40871ugiokehb
raw
enhanced view
GT
LLM raw
0.85
LLM enhanced
0.73

Chest X-ray (lung fields)

Chest X-ray

NLLMNN
100.7880.925
250.7880.943
500.7880.950
1000.7880.956
Lung fields are consistent anatomy — a NN nails them from just 10 labels (0.93). The LLM also segments them well (0.79) but never leads: the target is too learnable.
GTLLMNN @N
✓ Good cases
lungseg · MCUCXR_0150_1
raw image
ground truth
LLM
0.86
NN N=10
0.89
NN N=25
0.91
NN N=50
0.91
NN N=100
0.93
lungseg · CHNCXR_0330_1
raw image
ground truth
LLM
0.85
NN N=10
0.96
NN N=25
0.97
NN N=50
0.98
NN N=100
0.98
lungseg · CHNCXR_0421_1
raw image
ground truth
LLM
0.85
NN N=10
0.94
NN N=25
0.95
NN N=50
0.96
NN N=100
0.95
✗ Bad cases
lungseg · CHNCXR_0229_0
raw image
ground truth
LLM
0.71
NN N=10
0.91
NN N=25
0.94
NN N=50
0.94
NN N=100
0.94
lungseg · MCUCXR_0055_0
raw image
ground truth
LLM
0.71
NN N=10
0.96
NN N=25
0.97
NN N=50
0.97
NN N=100
0.98
lungseg · CHNCXR_0027_0
raw image
ground truth
LLM
0.70
NN N=10
0.94
NN N=25
0.95
NN N=50
0.95
NN N=100
0.96
🧪 Skill: X-ray CLAHE + non-convex lung trace new
Bad lung draws came out as rounded ovals — the polygon rasterizer already supports non-convex (PIL even-odd fill), so this was a drawing issue, not a skill limit. Fix = the literature-standard CXR CLAHE enhancer (xray_clahe) + a guide to trace the true shape (concave mediastinal border + costophrenic angle). Net-positive and non-destructive on good cases (lungs are large/consistent) → applied ungated. Part of the new modality→preprocess registry (8b535f1).
case groupLLM rawLLM enhancedΔ
bad0.7070.762+0.055
good0.8550.846-0.009
all (ungated)0.7810.804+0.023
GTLLM rawLLM on enhanced view
✗ Bad cases — round ovals fixed
lungseg · CHNCXR_0229_0
raw
enhanced view
GT
LLM raw
0.71
LLM enhanced
0.78
lungseg · MCUCXR_0055_0
raw
enhanced view
GT
LLM raw
0.71
LLM enhanced
0.74
lungseg · CHNCXR_0027_0
raw
enhanced view
GT
LLM raw
0.70
LLM enhanced
0.77
✓ Good cases — unaffected
lungseg · MCUCXR_0150_1
raw
enhanced view
GT
LLM raw
0.86
LLM enhanced
0.85
lungseg · CHNCXR_0330_1
raw
enhanced view
GT
LLM raw
0.85
LLM enhanced
0.83
lungseg · CHNCXR_0421_1
raw
enhanced view
GT
LLM raw
0.85
LLM enhanced
0.86

Spleen (abdominal CT)

Abdominal CT

NLLMNN
100.4200.537
250.4200.686
500.4200.838
1000.4200.885
Raw LLM draws score 0.22; the diagnose-repair soft-tissue window lifts the routed pipeline to 0.42, but the NN already leads at N=10 (0.54). Spleen case masks below are the raw draws (illustrating the SIZE-law failures).
GTLLMNN @N
✓ Good cases
spleen · spleen_8_z024
raw image
ground truth
LLM
0.76
NN N=10
0.68
NN N=25
0.49
NN N=50
0.91
NN N=100
0.96
spleen · spleen_33_z057
raw image
ground truth
LLM
0.75
NN N=10
0.72
NN N=25
0.70
NN N=50
0.81
NN N=100
0.96
spleen · spleen_6_z098
raw image
ground truth
LLM
0.68
NN N=10
0.71
NN N=25
0.90
NN N=50
0.86
NN N=100
0.96
✗ Bad cases
spleen · spleen_21_z058
raw image
ground truth
LLM
0.07
NN N=10
0.55
NN N=25
0.94
NN N=50
0.85
NN N=100
0.87
spleen · spleen_21_z041
raw image
ground truth
LLM
0.01
NN N=10
0.36
NN N=25
0.52
NN N=50
0.65
NN N=100
0.59
spleen · spleen_40_z068
raw image
ground truth
LLM
0.00
NN N=10
0.70
NN N=25
0.75
NN N=50
0.95
NN N=100
0.96