Teaching robots to see 13 types of facade defects
How we built a computer vision system trained on 35,000+ buildings that classifies cracks, spalling, corrosion, and 10 other defect types.
Mar 20, 2026 · 7 min read
The problem with facade defects
Facade defects are not like detecting cats in photos. They are subtle, irregular, and deeply context-dependent. A hairline crack on a concrete panel might be cosmetic. The same crack near exposed rebar signals active corrosion and structural risk. Staining below a window joint tells a different story than staining in the middle of a wall.
The visual signatures overlap constantly. Efflorescence (white salt deposits leaching through concrete) looks a lot like peeling paint from a distance. Mold and weathering share color profiles. Water damage rarely announces itself with a clean boundary.
Any useful detection system needs to handle all of these simultaneously, at high resolution, in real-world lighting conditions, on surfaces that range from glass curtain walls to century-old limestone. That is the bar we set for ourselves.
Building the dataset
Good models start with good data, and good data for facade defects barely existed when we started. There is no ImageNet for building pathology. So we built one.
The VitroBOT Composite Dataset contains 35,566 annotated images in YOLO format, with roughly 145,000 individual annotations. We split it into 28,452 training images, 3,556 validation, and 3,558 test. Three main sources feed into it.
Public and semi-public datasets. We pulled from 15 Roboflow datasets covering cracks, spalling, corrosion, and other defect types. Each one had different class names, different annotation conventions, and different image qualities. Merging them required extensive normalization: remapping class labels, filtering garbage annotations, and resolving overlapping definitions.
MBDD2025. This multi-building defect dataset contributed 14,294 images and gave us much better coverage of real-world facade conditions. It was the single largest quality source.
SAM3 auto-annotation.This is where things got interesting. We scraped 2,048 facade photos from the web, then used Meta's SAM3 (Segment Anything Model 3) running on an H100 GPU to auto-annotate them. Out of 2,048 images, 1,400 produced usable annotations, generating 33,225 detections in 4.5 minutes. That is not a typo. Four and a half minutes for over thirty thousand annotations that would have taken a human team weeks.
The auto-annotated data is noisier than hand-labeled data. But at scale, the signal compounds. The model learns general defect patterns from the bulk data and refines on the cleaner hand-labeled sets.
The 13 classes
We settled on 13 defect categories that cover the vast majority of what inspectors encounter on commercial and residential facades.
Crack is the most common and most varied: structural, shrinkage, settlement. Spalling is concrete breaking away in chunks, exposing the substrate. Exposed rebar is the red flag: reinforcement steel visible and corroding, meaning moisture has penetrated the concrete cover. Efflorescence shows up as white crystalline deposits where water has migrated through masonry. Rust and corrosion covers metal elements, from window frames to structural steel.
Mold and algae appear as dark green or black biological growth, typically on north-facing or shaded surfaces. Peeling paint signals adhesion failure and often underlying moisture issues. Water damage captures staining, streaking, and material degradation from persistent water exposure. Stain covers discoloration from pollution, organic matter, or chemical runoff.
Weathering is the general degradation from UV, wind, and thermal cycling. Glass defect includes chips, cracks, delamination, and seal failures. Window damage covers frame deterioration and sealant failure. Tile damage addresses cracked, displaced, or missing tiles on clad facades.
Each class matters because it maps to a different repair method, a different cost, and a different urgency. The model does not just find problems. It classifies them so the downstream pipeline can price them correctly.
Training at scale: 67 experiments, 335 GPU hours
We did not hand-tune one model and call it done. We built an automated experiment pipeline inspired by Andrej Karpathy's approach to systematic hyperparameter search. The system runs on an 8xH100 cluster on Nebius, launching experiments autonomously, tracking results, and picking winners.
Over five rounds, we ran 67 experiments totaling roughly 335 hours of H100 GPU time at a compute cost of about $1,072. That sounds like a lot until you compare it to the cost of a single rope access inspection campaign.
We tested across the YOLOv8 family (small, medium, large) at different resolutions, learning rates, augmentation strategies, and training schedules. The champion configuration landed on YOLOv8l (the large variant) at 640px input resolution, with a learning rate of 2e-4, cosine learning rate schedule, mosaic augmentation at 0.5, close-mosaic at epoch 25, and label smoothing at 0.05.
The best result: 0.325 mAP@50-95 and 0.544 mAP@50. For context, mAP (mean Average Precision) is the standard metric for object detection accuracy. The @50-95 variant averages across IoU thresholds from 50% to 95%, making it a strict measure. Scoring above 0.3 on a 13-class fine-grained defect detection task is a strong result, especially on real-world facade imagery where defects are small, overlapping, and visually ambiguous.
For edge deployment, YOLOv8s (the small variant) hits 0.286 mAP@50 in a 22MB model that runs inference in about 15 milliseconds on a Jetson Orin Nano. That is real-time detection on robot hardware.
What surprised us
Two findings stood out from the 67 experiments.
Label smoothing was the single biggest lever. Setting label smoothing to 0.05 gave a 12.6% improvement in mAP. For those unfamiliar, label smoothing softens the training targets slightly, preventing the model from becoming overconfident on noisy labels. Given that our dataset merges 15+ sources with varying annotation quality, this makes intuitive sense. The model learns to be appropriately uncertain at boundaries, which helps generalization.
No other single hyperparameter change came close to that gain. Not resolution changes, not model size, not augmentation strategies. In fact, aggressive augmentation was actively destructive. More augmentation made results worse, not better. The same was true for freezing backbone layers, which is a common transfer learning trick that simply did not help here.
YOLO's default learning rate is catastrophically bad for this task. The default lr of 1e-2 in Ultralytics YOLO produces models that are roughly 45% worse than lr=3e-4 on facade defects. This is not a subtle difference. If we had trusted the defaults and only run a handful of experiments, we would have concluded that YOLO cannot detect facade defects well. The automated pipeline saved us from that false conclusion.
The lesson: defaults are tuned for COCO (everyday objects at medium scale). Facade defects are a different distribution entirely. Small, low-contrast, irregularly shaped. You need to re-derive your hyperparameters from scratch.
From detection to report
Detection is only the first step. A bounding box that says “crack” is not useful to a property manager. They need to know where, how big, and how much it will cost to fix.
The full VitroBOT pipeline works like this. The robot scans the facade with cameras and a Livox Mid-360 LiDAR, building a 3D point cloud of the surface. The YOLOv8 model runs on the Jetson Orin Nano in real time, detecting and classifying defects in the camera feed. Each detection is projected from 2D image coordinates onto the 3D point cloud, giving us the physical location and area of every defect on the building surface.
From there, we calculate repair areas in square meters, match each defect type to a cost database (prices per square meter for each repair method, sourced from industry rate cards), and generate a complete priced inspection report. The report comes out as a PDF: professional, detailed, and delivered within hours of the scan rather than weeks.
The property manager gets a document that says “14 square meters of spalling on the north facade, floors 6-8, estimated repair cost: EUR 8,400” instead of “some concrete damage was observed.”
What comes next
Detection tells you what is there. Diagnosis tells you what it means.
We are actively integrating vision-language models into the pipeline. Our experiments with curriculum learning have reached 80.1% exact-match accuracy on defect diagnosis, classifying not just defect type but severity, probable cause, and recommended intervention.
The key insight from our VLM work is that curriculum learning, training first on severity classification alone before introducing the full diagnostic task, dramatically outperforms direct fine-tuning. The model learns to see before it learns to reason.
The next version of VitroBOT's reports will not just list defects and costs. They will explain why the damage occurred, how urgently it needs attention, and what happens if it is left untreated. That is the difference between a detection system and an inspection system.
We are building the second one.