agenticSeg — LLM vs NN segmentation

Summary — Dice vs label budget

Dataset	Modality	LLM (0 train)	NN N=10	N=25	N=50	N=100	Crossover
ISIC 2018	Dermoscopy	0.882	0.749	0.830	0.828	0.858	never (≤100)
Kvasir-SEG	Polyp / endoscopy	0.580	0.628	0.777	0.808	0.845	N≈10
Chest X-ray	Chest X-ray	0.788	0.925	0.943	0.950	0.956	N≈10
Spleen	Abdominal CT	0.420	0.537	0.686	0.838	0.885	N≈10

Red cells = the LLM (zero training labels) matches or beats the trained net at that budget. NN = U-Net/resnet34, ImageNet-pretrained.

ISIC 2018 (dermoscopy)

Dermoscopy

N	LLM	NN
10	0.882	0.749
25	0.882	0.830
50	0.882	0.828
100	0.882	0.858

The training-free LLM (0.882) beats the U-Net at every budget up to N=100 — the standout positive case (even the strongest pretrained net, SegFormer, only reaches 0.875 at N=100).

GTLLMNN @N

✓ Good cases

isic · ISIC_0036159

raw image

ground truth

LLM

0.96

NN N=10

0.91

NN N=25

0.91

NN N=50

0.95

NN N=100

0.94

isic · ISIC_0023694

raw image

ground truth

LLM

0.95

NN N=10

0.77

NN N=25

0.92

NN N=50

0.93

NN N=100

0.95

isic · ISIC_0023836

raw image

ground truth

LLM

0.95

NN N=10

0.86

NN N=25

0.96

NN N=50

0.96

NN N=100

0.97

✗ Bad cases

isic · ISIC_0023306

raw image

ground truth

LLM

0.09

NN N=10

0.54

NN N=25

0.30

NN N=50

0.15

NN N=100

0.04

isic · ISIC_0022313

raw image

ground truth

LLM

0.08

NN N=10

0.36

NN N=25

0.35

NN N=50

0.71

NN N=100

0.52

isic · ISIC_0022007

raw image

ground truth

LLM

0.06

NN N=10

0.04

NN N=25

0.13

NN N=50

0.10

NN N=100

0.15

Kvasir-SEG (polyp / endoscopy)

Polyp / endoscopy

N	LLM	NN
10	0.580	0.628
25	0.580	0.777
50	0.580	0.808
100	0.580	0.845

The NN leads from N=10 and runs away — the lobed, gradual-boundary polyp caps the LLM near 0.58 (≈45% of draws mislocate, ~0.35 shape residual).

GTLLMNN @N

✓ Good cases

kvasir · cju7fq7mm2pw508176uk5ugtx

raw image

ground truth

LLM

0.90

NN N=10

0.79

NN N=25

0.82

NN N=50

0.94

NN N=100

0.92

kvasir · cju5o1vu9gz8a0818eyy92bns

raw image

ground truth

LLM

0.87

NN N=10

0.72

NN N=25

0.95

NN N=50

0.93

NN N=100

0.96

kvasir · cju7amjna1ly40871ugiokehb

raw image

ground truth

LLM

0.85

NN N=10

0.69

NN N=25

0.91

NN N=50

0.96

NN N=100

0.96

✗ Bad cases

kvasir · cju3xl264ingx0850rcf0rshj

raw image

ground truth

LLM

0.12

NN N=10

0.72

NN N=25

0.70

NN N=50

0.96

NN N=100

0.95

kvasir · cju5u8gz4kj5b07552e2wpkwp

raw image

ground truth

LLM

0.10

NN N=10

0.13

NN N=25

0.56

NN N=50

0.01

NN N=100

0.42

kvasir · cju884985nlmx0817vzpax3y4

raw image

ground truth

LLM

0.06

NN N=10

0.26

NN N=25

0.55

NN N=50

0.88

NN N=100

0.85

🧪 Skill: gated polyp ROI contrast-enhance new

Bad Kvasir cases mislocate — the draw lands on the wrong region. A GT-free contrast enhancer (specular-inpaint + CLAHE-L + saturation) makes the polyp salient enough to find. It rescues mislocations but distorts already-good boundaries, so it is gated on K-draw self-consistency (applied only to low-confidence draws). Skill polyp_roi_enhance (b45cd77).

case group	LLM raw	LLM enhanced	Δ
bad	0.094	0.434	+0.340
good	0.874	0.729	-0.145
gated (raw on good, enhance on bad)	0.484	0.654	+0.170

GTLLM rawLLM on enhanced view

✗ Bad cases — rescued by enhancement

kvasir · cju3xl264ingx0850rcf0rshj

raw

enhanced view

LLM raw

0.12

LLM enhanced

0.63

kvasir · cju5u8gz4kj5b07552e2wpkwp

raw

enhanced view

LLM raw

0.10

LLM enhanced

0.61

kvasir · cju884985nlmx0817vzpax3y4

raw

enhanced view

LLM raw

0.06

LLM enhanced

0.06

✓ Good cases — enhancement not needed (gated off)

kvasir · cju7fq7mm2pw508176uk5ugtx

raw

enhanced view

LLM raw

0.90

LLM enhanced

0.65

kvasir · cju5o1vu9gz8a0818eyy92bns

raw

enhanced view

LLM raw

0.87

LLM enhanced

0.81

kvasir · cju7amjna1ly40871ugiokehb

raw

enhanced view

LLM raw

0.85

LLM enhanced

0.73

Chest X-ray (lung fields)

Chest X-ray

N	LLM	NN
10	0.788	0.925
25	0.788	0.943
50	0.788	0.950
100	0.788	0.956

Lung fields are consistent anatomy — a NN nails them from just 10 labels (0.93). The LLM also segments them well (0.79) but never leads: the target is too learnable.

GTLLMNN @N

✓ Good cases

lungseg · MCUCXR_0150_1

raw image

ground truth

LLM

0.86

NN N=10

0.89

NN N=25

0.91

NN N=50

0.91

NN N=100

0.93

lungseg · CHNCXR_0330_1

raw image

ground truth

LLM

0.85

NN N=10

0.96

NN N=25

0.97

NN N=50

0.98

NN N=100

0.98

lungseg · CHNCXR_0421_1

raw image

ground truth

LLM

0.85

NN N=10

0.94

NN N=25

0.95

NN N=50

0.96

NN N=100

0.95

✗ Bad cases

lungseg · CHNCXR_0229_0

raw image

ground truth

LLM

0.71

NN N=10

0.91

NN N=25

0.94

NN N=50

0.94

NN N=100

0.94

lungseg · MCUCXR_0055_0

raw image

ground truth

LLM

0.71

NN N=10

0.96

NN N=25

0.97

NN N=50

0.97

NN N=100

0.98

lungseg · CHNCXR_0027_0

raw image

ground truth

LLM

0.70

NN N=10

0.94

NN N=25

0.95

NN N=50

0.95

NN N=100

0.96

🧪 Skill: X-ray CLAHE + non-convex lung trace new

Bad lung draws came out as rounded ovals — the polygon rasterizer already supports non-convex (PIL even-odd fill), so this was a drawing issue, not a skill limit. Fix = the literature-standard CXR CLAHE enhancer (xray_clahe) + a guide to trace the true shape (concave mediastinal border + costophrenic angle). Net-positive and non-destructive on good cases (lungs are large/consistent) → applied ungated. Part of the new modality→preprocess registry (8b535f1).

case group	LLM raw	LLM enhanced	Δ
bad	0.707	0.762	+0.055
good	0.855	0.846	-0.009
all (ungated)	0.781	0.804	+0.023

GTLLM rawLLM on enhanced view

✗ Bad cases — round ovals fixed

lungseg · CHNCXR_0229_0

raw

enhanced view

LLM raw

0.71

LLM enhanced

0.78

lungseg · MCUCXR_0055_0

raw

enhanced view

LLM raw

0.71

LLM enhanced

0.74

lungseg · CHNCXR_0027_0

raw

enhanced view

LLM raw

0.70

LLM enhanced

0.77

✓ Good cases — unaffected

lungseg · MCUCXR_0150_1

raw

enhanced view

LLM raw

0.86

LLM enhanced

0.85

lungseg · CHNCXR_0330_1

raw

enhanced view

LLM raw

0.85

LLM enhanced

0.83

lungseg · CHNCXR_0421_1

raw

enhanced view

LLM raw

0.85

LLM enhanced

0.86

Spleen (abdominal CT)

Abdominal CT

N	LLM	NN
10	0.420	0.537
25	0.420	0.686
50	0.420	0.838
100	0.420	0.885

Raw LLM draws score 0.22; the diagnose-repair soft-tissue window lifts the routed pipeline to 0.42, but the NN already leads at N=10 (0.54). Spleen case masks below are the raw draws (illustrating the SIZE-law failures).

GTLLMNN @N

✓ Good cases

spleen · spleen_8_z024