YOLO11-seg underperforming EfficientNet-UNet for building footprint extraction from aerial imagery – what should I try next?
I'm looking for advice from people with experience in remote sensing and instance/semantic segmentation. I'm working on building footprint extraction from aerial imagery. I have a baseline segmentation model based on EfficientNet-B7 U-Net, which performs reasonably well on my test areas. I wanted to explore whether a YOLO segmentation approach could provide competitive results, so I fine-tuned a YOLO11 segmentation model. The results, however, are significantly worse than my U-Net baseline, and I'm trying to understand whether this is expected, whether I'm using the model incorrectly, or what I should try next. Dataset Task: single-class building footprint extraction Imagery: high-resolution aerial/satellite imagery (~50 cm GSD) Training images: 891 for fine tuning, I have used 12k for pre training the model) Validation images: 156 (for fine tuning, I have used 2155 for pre training the model) The model was initialized from weights previously trained on a large building footprint dataset and then fine-tuned on my local dataset. Training configuration Model: YOLO11m-seg Epochs: 100 Best epoch: 78 Image size: 640 Batch size: 16 Initial LR: 0.0005 Cosine scheduler: enabled Mosaic: 0.5 Rotation augmentation: ±90° Horizontal flip: disabled Vertical flip: disabled Patience: 20 Best validation metrics Box metrics: mAP50 = 0.6438 mAP50-95 = 0.3894 Mask metrics: mAP50 = 0.6345 mAP50-95 = 0.3236 Precision(M) = 0.7436 Recall(M) = 0.5957 Inference observations One thing that concerns me…