Submission Preview
status: draft

Performing Object Detection on Drone Orthomosaics with Meta's Segment Anything Model (SAM)

Abstract

Accurate and efficient object detection and spatial localization in remote sensing imagery is a persistent challenge. In the context of precision agriculture, the extensive data annotation required by conventional deep learning models poses additional challenges. This paper presents a fully open source workflow leveraging Meta AI’s Segment Anything Model (SAM) for zero-shot segmentation, enabling scalable object detection and spatial localization in high-resolution drone orthomosaics without the need for annotated image datasets. Model training and/or fine-tuning is rendered unnecessary in our precision agriculture-focused use case. The presented end-to-end workflow takes high-resolution images and quality control (QC) check points as inputs, automatically generates masks corresponding to the objects of interest (empty plant pots, in our given context), and outputs their spatial locations in real-world coordinates. Detection accuracy (required in the given context to be within 3 cm) is then quantitatively evaluated using the ground truth QC check points and benchmarked against object detection output generated using commercially available software. Results demonstrate that the open source workflow achieves superior spatial accuracy — producing output 20% more spatially accurate, with 400% greater IoU — while providing a scalable way to perform spatial localization on high-resolution aerial imagery (with ground sampling distance, or GSD, < 30 cm).

Keywords:object detectionspatial localizationdrone orthomosaic

1Introduction

Image segmentation is a critical task in geospatial analysis, enabling the identification and extraction of relevant features from high resolution remote sensing imagery Wu & Osco, 2023. However, extracting actionable information (i.e., object detection and spatial localization) can be constrained by the need for large, labeled datasets to train deep learning models in order to then perform inference and (hopefully) produce the desired output. This bottleneck is particularly acute in agricultural domains, where variability in conditions and object types complicates manual annotation Osco et al., 2023.

Recent advances in foundation models, such as Meta AI’s Segment Anything Model (SAM), offer a promising path forward. SAM is designed for promptable “zero-shot” segmentation. “Prompt engineering”, in this context, involves using points and bounding boxes to focus the model’s efforts on more efficiently generating masks corresponding to objects of interest Mayladan et al., 2024. Providing these prompts allows accurate masks to be generated for novel objects (ones not included in SAM’s training corpus), without domain-specific training. Masks can also be generated automatically with no such prompting. SAM’s automatic mask generator will effectively “detect” everything using open source model checkpoints and generate masks for each object in a provided image Kirillov et al., 2023.

While SAM’s ability to generalize is impressive Kirillov et al., 2023Osco et al., 2023, its performance on remote sensing imagery and fine-grained features requires careful workflow integration and evaluation Wu & Osco, 2023. This paper describes a comprehensive, open source workflow for object detection and spatial localization in high-resolution remote sensing imagery, built around SAM and widely used geospatial Python libraries GDAL/OGR contributors, 2025Gillies et al., 2025Jordahl et al., 2020Gillies & others, 2013. The complete process is delineated, from data loading and preprocessing to mask generation, post-processing, and quantitative accuracy assessment, culminating in a robust comparison with the results produced using the proprietary software (see code). Precision, accuracy, F1 score, mean deviation (in cm), and Intersection-over-Union (IoU) are calculated in order to quantify the relative quality of the output produced using each workflow[1].

2Motivation

Precision agriculture relies on accurate object detection for tasks such as plant counting, health monitoring, and targeted resource distribution. Traditional deep learning approaches can become hindered by the cost and effort of generating carefully annotated data, limiting scalability and accessibility. Proprietary solutions, while effective, can be expensive and opaque, impeding reproducibility and customization.

The derived centroids of the objects detected in the drone orthomosaic are used to automate this nursery trimmer.

Figure 1:The derived centroids of the objects detected in the drone orthomosaic are used to automate this nursery trimmer.

SAM’s zero-shot segmentation capability directly addresses the data annotation bottleneck, enabling rapid deployment in novel contexts. By developing an open source workflow around SAM, an end-to-end pipeline is created which allows for the quantitative evaluation of spatial accuracy with respect to objects detected in high-resolution aerial imagery. This modular workflow can also be repurposed as an automated data annotation pipeline for downstream model training/fine-tuning, if required.

3Approach

Our approach integrates SAM’s segmentation strengths with traditional geospatial data processing techniques, which lends itself to our precision agriculture use case. The workflow, like any other, can be thought of as a sequence of steps (visualized above and described below), each with their own sets of substeps:

  • Data Ingestion: Loading GeoTIFF orthomosaics and QC point CSVs, extracting spatial bounds and coordinate reference systems (CRS) using Rasterio or GDAL.
  • Preprocessing: Filtering QC points to those within image bounds, standardizing coordinate columns, and saving filtered data for downstream analysis.
  • Mask Generation: Tiling large images for efficient processing, running SAM’s automatic mask generator (ViT-H variant) on each tile, and filtering masks by confidence.
  • Post-Processing: Converting masks to polygons, filtering by area and compactness, merging overlapping geometry, and extracting centroids.
  • Accuracy Evaluation: Calculating point-to-centroid deviations (in centimeters) between detected objects and QC points, compiling results, and generating visual and tabular reports.
  • Benchmarking: Quantitatively comparing SAM-based results against the evaluated output using identical evaluation metrics (precision, recall, IoU, etc.; see Appendix for details).

It should be noted that there are no model training or fine-tuning steps included in our workflow, as we are using a foundation model to generate masks. This is analogous to using ChatGPT to generate text, which does not require users to train or fine-tune the underlying foundation model in order to do so.

This approach is carried out entirely using open source Python libraries, ensuring transparency and extensibility.

4Methodology

4.1Data and Environment

4.2Workflow

The high-level workflow steps.

Figure 2:The high-level workflow steps.

4.2.1Data Ingestion and Preprocessing

Data ingestion and preprocessing workflow substeps.

Figure 3:Data ingestion and preprocessing workflow substeps.

4.2.2Mask Generation

Mask generation workflow substeps.

Figure 4:Mask generation workflow substeps.

4.2.3Post-Processing

Data post-processing workflow substeps.

Figure 5:Data post-processing workflow substeps.

4.2.4Accuracy Evaluation

Data post-processing workflow substeps; see Appendix for methodology details.

Figure 6:Data post-processing workflow substeps; see Appendix for methodology details.

4.2.5Benchmarking

Benchmarking workflow substeps; see code for methodology details.

Figure 7:Benchmarking workflow substeps; see code for methodology details.

5Results

5.1Proprietary Workflow

19 false positives (FP) and hundreds of sliver polygons were observed in the output produced using the proprietary software.

Figure 8:19 false positives (FP) and hundreds of sliver polygons were observed in the output produced using the proprietary software.

The bounding boxes that were output using this workflow (against which we are benchmarking ours) can be viewed as a layer overlain onto the GeoTIFF orthomosaic using GIS software[4]. Certain inferences can be drawn from the output that we won’t go into here; what is of particular use to us is the fact that zero false negatives (FN) were observed in the output, though 19 FP were. This empirical knowledge equips us with something not usually possessed in use cases such as this: the number of true positives (TP), which allows us to leverage such metrics as precision, recall, and the harmonic mean of the two, F1 score, in order to perform a rigorous comparison (see code).

5.2Open Source Workflow

18 FP were observed in the output produced using the open source workflow.

Figure 9:18 FP were observed in the output produced using the open source workflow.

Knowing how many TP (18,736) there are in the benchmark output ultimately allows us to derive how many FP (18) and FN (65) there are in our workflow output and conduct our performance comparison.

Table 1:Performance Comparison

WorkflowDetection Quality MetricsLocalization Accuracy Metrics
PrecisionRecallF1 ScoreMean Deviation (cm)IoU
Proprietary0.99901.00000.99951.390.18
Open Source0.99900.99560.99731.200.74

It can be observed that empty plant pots tend to be ~64 pixels (px) wide and tall; with QC points corresponding to actual pot centroids, we were able to create 64-by-64px boxes to facilitate our IoU calculations (see code). These calculations further allow us to assess the relative alignment between the detection output geometry and our “ground truth” geometry.

This work makes it easy to identify down to the individual QC point ID level which detection centroids deviate from said point by more than 3 cm, which is the tolerance specified by our client. In aggregate, we are able to gain a quantified sense of the mean deviation (in cm) of the output produced by each workflow. However, visual inspection reveals that some QC points flagged as having cooresponding detection centroids that are out-of-tolerance were, in fact, themselves off-center. This is to say the some detections from both the open source workflow and the benchmark workflow were flagged as being out-of-tolerance when they observably were not.

Visual inspection of the detected centroids relative to QC point 91 reveal that the QC point is off-center.

Figure 10:Visual inspection of the detected centroids relative to QC point 91 reveal that the QC point is off-center.

Visual inspection also reveals that our detections (in pink) and those produced using the commercial software (in beige) have greater overall coverage with respect to the QC geometry (in grey). This provides intuition as to why the IoU calculations revealed a 400% increase in coverage with respect to the geometry produced using SAM’s automatic mask generator, zero-shot.

6Discussion

6.1Key Findings

The open-source workflow using Meta AI’s Segment Anything Model (SAM) outperformed a commercial alternative in object detection and spatial localization on high-resolution drone imagery. It achieved 20% higher spatial accuracy (1.20 cm vs 1.39 cm deviation) and a 400% higher Intersection-over-Union (IoU) (0.74 vs 0.18), indicating stronger alignment with object boundaries. Both methods had near-perfect precision, but the open-source approach showed slightly lower recall due to 65 false negatives. It should be noted, however, that these FN were a direct result of the filtering substep in our workflow, which filtered our detections (based on arbitrary geometry area and compactness thresholds; see code) that are present in the output.

Nevertheless, its overall performance supports its suitability for precision agriculture and downstream automation.

6.2Precision Agriculture Challenges

Our work began with an eye toward tackling a major challenge in agricultural remote sensing: the need for extensive manual annotation. SAM’s zero-shot segmentation enables accurate object detection without domain-specific training, making it scalable and adaptable for new use cases with minimal setup.

6.3Benefits of Open Source

Built entirely on open source geospatial tools, the workflow offers transparency, reproducibility, and flexibility. It can be tailored for various tasks like plant counting or automated annotation for model training, supporting broader adoption in agriculture and remote sensing.

6.4Practical Impact

Meeting professional-grade tolerance requirements (e.g., < 3 cm) enables real-world applications, such as automating nursery machinery, based on precise object localization. This demonstrates how automated workflows can reduce manual labor and support more efficient agricultural practices.

6.5Limitations and Future Work

Our approach to tiling (“chipping”) high-resolution orthomosaics, processing 588 individual 1280-by-1280px tiles at an average pace of 11 seconds per tile, required a total processing time of ~110 minutes running on a Colab single T4 GPU instance. It is important to note that an overlap of 25% (320px) between tiles during processing was required to ensure that geometry was not produced containing “holes” or malformations; merging overlapping polygons after filtering (based on area and compactness calculations, in this case) helped us ensure the overall quality of the geometric output.

Future work will be centered on building a CLI that we can open source, which will allow users to pass orthomosaics as inputs and get geometry meeting desired spatail charactersitics as an output.

7Conclusion

We present a robust, open source workflow for object detection and spatial localization in high-resolution drone orthomosaics, leveraging SAM’s zero-shot segmentation capabilities. Our quantitative evaluation demonstrates improved accuracy over a commercially available software solution, underscoring the potential of foundation models and open source tools to advance scalable, cost-effective feature extraction in agriculture. This work provides a template for further research and deployment in diverse contexts.

To our knowledge, this is the first comparative evaluation of an open source segmentation model (SAM) against commercial software in a context requiring high (< 3 cm) spatial accuracy. Our results demonstrate that the workflow not only matches but in some cases exceeds performance metrics with respect to the evaluated output.

9Conflicts of Interest

The author declares no conflicts of interest.

10AI Usage Disclosure

AI tools (ChatGPT, Perplexity, and NotebookLM) were used:

  • in writing portions of the workflow integration code,
  • to generate Matplotlib subplots, process flow diagrams, L A T E X, etc.
  • for proofreading and light revision to reduce potential publication errors.

11Code

Data and code required to replicate our approach can be found using the links below:

GitHub Open In Colab

12Appendix

12.1Accuracy Evaluation Methodology

The evaluation focuses on two primary categories of metrics: localization accuracy and detection quality; the employed methodology relies on the following data:

  • Ground Truth Quality Control (QC) Points (PQCP_{QC}), defined as a set of NQCN_{QC} known spatial coordinates, PQC={pj}j=1NQCP_{QC} = \{p_j\}_{j=1}^{N_{QC}}, where each pj=(xj,yj)p_j = (x_j, y_j) represents the centroid (in real-world coordinates) of our object of interest (empty plant pots), serving as the ground truth for spatial localization.
  • Detected Object Centroids (CDetC_{Det}) are used, which are a set of NDetN_{Det} centroids, CDet={ck}k=1NDetC_{Det} = \{c_k\}_{k=1}^{N_{Det}}, where each ck=(xk,yk)c_k = (x'_k, y'_k) is the centroid extracted from a polygon representing an object detected by the workflow.
  • Detected Object Polygons (GDetG_{Det}) are included, representing a set of NDetN_{Det} polygons, GDet={gk}k=1NDet,G_{Det} = \{g_k\}_{k=1}^{N_{Det}}, where each gkg_k is a polygon generated from a SAM-produced mask after post-processing.
  • Ground Truth Polygons (GGTG_{GT}) the calculation of Intersection over Union (IoU) implies the existence of a corresponding set of NQCN_{QC} ground truth polygons, GGT={gj}j=1NQCG_{GT} = \{g'_j\}_{j=1}^{N_{QC}}, where each gjg'_j delineates the extent of the object associated with ground truth point pjp_j. This allows for the quantification of the spatial alignment between the detection bounding boxes and those associated with the QC points, in aggregate. The provided code details how we created this geometry and performed the calculations.

Acknowledgments

We gratefully acknowledge the contributions of the open source community — thank you to the giants on whose shoulders we stand.

This work was funded by FiOR Innovations and Woodburn Nursery & Azaleas. We deeply appreciate their support and partnership.

Special thanks to Paniz Herrera, MBA, MSIST, for her invaluable suggestions.

We also thank Ryan Marinelli, PhD Fellow at the University of Oslo, for his assistance with proofreading and his insightful feedback.

Finally, to Danny Clifford, your insightful questions and targeted suggestions for improvement continue to be of tremendous value. Thank you.

Footnotes
  1. Output evaluation details are discussed in Appendix.

  2. See requirements.txt for version details.

  3. Inference was accelerated using CUDA 12 (cuDF 25.2.1) on a T4 GPU within our Colab notebook environment.

  4. We use open source QGIS QGIS Development Team, 2021 as our selected data viewer.

References
  1. Wu, Q., & Osco, L. P. (2023). samgeo: A Python package for segmenting geospatial data with the Segment Anything Model (SAM). Journal of Open Source Software, 8(89), 5663. https://doi.org/10.21105/joss.05663
  2. Osco, L. P., Wu, Q., de Lemos, E. L., Gonçalves, W. N., Ramos, A. P. M., Li, J., & Junior, J. M. (2023). The Segment Anything Model (SAM) for Remote Sensing Applications: From Zero to One Shot. https://doi.org/10.48550/arXiv.2306.16623
  3. Mayladan, A., Nasrallah, H., Moughnieh, H., Shukor, M., & Ghandour, A. J. (2024). Zero-Shot Refinement of Buildings’ Segmentation Models using SAM. https://doi.org/10.48550/arXiv.2310.01845
  4. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollár, P., & Girshick, R. (2023). Segment Anything. https://doi.org/10.48550/arXiv.2304.02643
  5. GDAL/OGR contributors. (2025). GDAL/OGR Geospatial Data Abstraction software Library. Open Source Geospatial Foundation. 10.5281/zenodo.5884351
  6. Gillies, S., van der Wel, C., Van den Bossche, J., Taves, M. W., Arnott, J., Ward, B. C., & others. (2025). Shapely (Version 2.1.1). 10.5281/zenodo.5597138
  7. Jordahl, K., den Bossche, J. V., Fleischmann, M., Wasserman, J., McBride, J., Gerard, J., Tratner, J., Perry, M., Badaracco, A. G., Farmer, C., Hjelle, G. A., Snow, A. D., Cochran, M., Gillies, S., Culbertson, L., Bartos, M., Eubank, N., maxalbert, Bilogur, A., … Leblanc, F. (2020). geopandas/geopandas: v0.8.1 (v0.8.1). Zenodo. 10.5281/zenodo.3946761
  8. Gillies, S., & others. (2013–). Rasterio: geospatial raster I/O for Python programmers. Mapbox. https://github.com/rasterio/rasterio
  9. Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90–95. https://doi.org/10.1109/MCSE.2007.55
  10. Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., … Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825), 357–362. https://doi.org/10.1038/s41586-020-2649-2
  11. Bradski, G. (2000). The OpenCV Library. Dr. Dobb’s Journal of Software Tools.
  12. Gazoni, E., & Clark, C. (2024). OpenPyXL: A Python library to read/write Excel 2010 xlsx/xlsm/xltx/xltm files. Python Package. https://openpyxl.readthedocs.io
  13. The Pandas Development Team. (2020). pandas-dev/pandas: Pandas (latest). Zenodo. 10.5281/zenodo.3509134
  14. McKinney, W. (2010). Data Structures for Statistical Computing in Python. In Stéfan van der Walt & Jarrod Millman (Eds.), Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a
  15. Clark, A. (2015). Pillow (PIL Fork) Documentation. readthedocs. https://buildmedia.readthedocs.org/media/pdf/pillow/latest/pillow.pdf