Training Faster R-CNN for Geospatial Object Detection

December 20, 2025 • 8 min read

Foundation models like SAM are excellent at segmentation, but deploying them at scale on large imagery is computationally expensive. A trained object detector runs much faster and can be optimized for your specific domain. This post walks through the pipeline I built to train Faster R-CNN models on aerial imagery, using SAM's segmentation outputs as training data.

The Bootstrap Problem

Training an object detector requires bounding box annotations. For aerial imagery of agricultural fields, that means someone manually drawing boxes around thousands of objects—tedious work that doesn't scale. SAM offers an escape hatch: run it once on your imagery to generate high-quality polygon masks, then convert those polygons to bounding boxes automatically.

The workflow becomes: use SAM (slow but accurate) to bootstrap training data, then train a lightweight detector (fast at inference) that learns from SAM's outputs. You trade one-time annotation cost for a model you can deploy efficiently.

From Polygons to Bounding Boxes

SAM outputs polygon masks in GeoJSON format, each with georeferenced coordinates. Converting to bounding boxes requires the GeoTIFF's affine transform to map between coordinate reference system units and pixel space:

def polygon_to_bbox(polygon, transform):
    minx, miny, maxx, maxy = polygon.bounds
    row_min, col_min = rowcol(transform, minx, maxy)
    row_max, col_max = rowcol(transform, maxx, miny)
    return [col_min, row_min, col_max, row_max]

The key detail: GeoJSON coordinates are typically in a projected CRS (UTM, state plane) where Y increases northward, but image coordinates have Y increasing downward. The rowcol function from rasterio handles this inversion correctly.

Hard Negative Mining

Object detectors learn what to detect and what to ignore. Without explicit negative examples, the model might learn spurious correlations—triggering on shadows, soil patterns, or equipment that happens to appear near real objects in the training data. This is where hard negative mining becomes essential.

I generate hard negatives by spatially shifting each positive polygon by a fixed distance (1.5 feet in CRS units) and using those shifted regions as background examples. The shift is small enough that the negative patches look similar to positives—same lighting, same general context—but don't contain the target object. This forces the model to learn actual object features rather than contextual shortcuts.

def generate_hard_negatives(geojson_path, tif_path, shift_distance):
    gdf = gpd.read_file(geojson_path)
    negative_bboxes = []

    for idx, row in gdf.iterrows():
        shifted_poly = translate(row.geometry, xoff=-shift_distance)
        # ... validate and convert to bbox
        negative_bboxes.append(bbox)

    return negative_bboxes

Tiling Large Images

Drone orthomosaics routinely exceed 10,000 pixels per side. Loading a 15,000 x 12,000 pixel image directly into GPU memory isn't practical, and resizing destroys the fine detail needed to detect small objects.

The solution is tile-based training: slice images into 1024x1024 patches with 128 pixel overlap, assign bounding boxes to the tiles they intersect, and train on the tiles. The overlap ensures objects near tile boundaries appear fully in at least one tile.

During tiling, boxes that get clipped at boundaries need filtering. A box that's 90% outside a tile isn't useful training signal—I only keep boxes where the visible portion exceeds a minimum size threshold (10 pixels in my configuration).

Model Architecture

Faster R-CNN with a ResNet50-FPN backbone hits the sweet spot for this task. The Feature Pyramid Network handles objects at multiple scales well, which matters for aerial imagery where the same object type can vary significantly in apparent size depending on flight altitude.

Starting from COCO pre-trained weights, I replace only the classification head to match my class count (background + one object class). The rest of the network transfers remarkably well despite COCO containing nothing like overhead agricultural imagery.

def get_model(num_classes, pretrained=True):
    model = fasterrcnn_resnet50_fpn(pretrained=pretrained)
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
    return model

Training Details

Data augmentation for aerial imagery differs from natural photos. Rotations and flips are safe—there's no canonical "up" in overhead views. Random 90-degree rotations effectively quadruple the training data. Brightness and contrast variations help with lighting differences across flight times.

Color jittering, however, tends to hurt. Agricultural objects often have distinctive color signatures (dark irrigation circles against green crops), and aggressive color augmentation can destroy that signal.

I train with SGD, learning rate 0.005 with step decay every 10 epochs. Batch size 2 fits comfortably on a consumer GPU with 1024x1024 tiles. Training typically converges within 30-50 epochs for datasets with a few hundred positive examples.

Inference and Georeferencing

At inference time, the same tiling strategy applies: slice the input GeoTIFF, run detection on each tile, then map the predicted boxes back to georeferenced coordinates. The affine transform runs in reverse—pixel coordinates to CRS coordinates—and the output is a GeoJSON file that opens directly in QGIS or any GIS tool.

Overlapping tiles produce duplicate detections for objects near boundaries. A post-processing step merges overlapping boxes, using area-weighted score averaging to combine confidence values:

# Find connected components of overlapping detections
for comp in components:
    polys = gdf.loc[comp]
    merged_geom = unary_union(polys.geometry)
    score = np.average(polys.score, weights=polys.geometry.area)
    merged_results.append((merged_geom, score))

Results

A model trained on ~500 positive examples from three orthomosaics generalizes well to new imagery from the same sensor and altitude. Inference runs at roughly 50ms per 1024x1024 tile on a RTX 3080—orders of magnitude faster than running SAM on the same patches.

The accuracy gap versus SAM is smaller than expected. For well-defined objects with clear boundaries, the trained detector achieves 85-90% of SAM's segmentation quality while running 20x faster. The main failure modes are unusual lighting conditions and objects partially occluded by vegetation.

Lessons Learned

Quality of bootstrap data matters more than quantity. Cleaning up SAM's occasional false positives before training pays off. A hundred clean examples beat a thousand noisy ones.

Hard negatives are essential. Without them, the model learns to detect "things near field edges" rather than the actual objects. The shifted-polygon approach is simple but effective.

Tile overlap needs tuning. Too little overlap and you miss boundary objects; too much and you waste compute on redundant inference. 128 pixels (12.5% of tile size) works well for objects up to ~200 pixels in diameter.

Keep the georeferencing pipeline clean. Bugs in coordinate transformation are subtle and produce results that look almost right but are shifted by a few meters. Validate early with known reference points.

The full training pipeline is available on GitHub.