I'm looking for advice from people with experience in remote sensing and instance/semantic segmentation.

I'm working on building footprint extraction from aerial imagery. I have a baseline segmentation model based on EfficientNet-B7 U-Net, which performs reasonably well on my test areas. I wanted to explore whether a YOLO segmentation approach could provide competitive results, so I fine-tuned a YOLO11 segmentation model.

The results, however, are significantly worse than my U-Net baseline, and I'm trying to understand whether this is expected, whether I'm using the model incorrectly, or what I should try next.

Dataset Task: single-class building footprint extraction Imagery: high-resolution aerial/satellite imagery (~50 cm GSD) Training images: 891 for fine tuning, I have used 12k for pre training the model) Validation images: 156 (for fine tuning, I have used 2155 for pre training the model)

The model was initialized from weights previously trained on a large building footprint dataset and then fine-tuned on my local dataset.

Training configuration

Model: YOLO11m-seg Epochs: 100 Best epoch: 78 Image size: 640 Batch size: 16 Initial LR: 0.0005 Cosine scheduler: enabled Mosaic: 0.5 Rotation augmentation: ±90° Horizontal flip: disabled Vertical flip: disabled Patience: 20

Best validation metrics

Box metrics: mAP50 = 0.6438 mAP50-95 = 0.3894

Mask metrics: mAP50 = 0.6345 mAP50-95 = 0.3236

Precision(M) = 0.7436 Recall(M) = 0.5957

Inference observations

One thing that concerns me is that I need to use a very low confidence threshold during inference:

confidence_threshold = 0.05

to obtain what I would consider reasonable predictions. This feels unusually low to me, but perhaps that's normal for this type of application.

On an independent test set, the EfficientNet-B7 U-Net significantly outperforms YOLO-seg. Visually: enter image description here

enter image description here enter image description here enter image description here Ground truth = white EfficientNet-U-Net prediction = red YOLO prediction = green

(see attached image)

The U-Net predictions are substantially more complete and better aligned with building footprints.

Is this level of performance degradation expected when moving from semantic segmentation (U-Net) to YOLO instance segmentation for building footprint extraction? Does the need for a confidence threshold of 0.05 suggest a calibration problem, overfitting, or something else? Are there training parameters that stand out as problematic?

Are there remote sensing-specific tricks for YOLO segmentation that I should consider?

I'd really appreciate suggestions from anyone who has successfully used YOLO-based instance segmentation for building footprint extraction or similar geospatial tasks.

Also, if there are additional diagnostics, metrics, visualizations, or training details that would help identify the problem, please let me know and I'll add them.