One-line definition
Object detection outputs, for each input image, a set of (bounding box, class label, confidence) tuples for the objects present. The three architectural families: two-stage (Faster R-CNN family. First propose regions, then classify), one-stage (YOLO, RetinaNet. Predict box and class in one pass at every grid cell), and set-prediction (DETR. Transformer encoder-decoder predicts a fixed-size set of boxes).
Why it matters
Detection is the core production CV task: autonomous driving, retail analytics, surveillance, medical imaging, robotics. The architectural choice determines latency-accuracy tradeoffs and what’s possible in your inference budget.
Two-stage: Faster R-CNN (Ren et al., 2015)
Architecture:
- Backbone (ResNet, ConvNeXt) extracts feature map.
- Region Proposal Network (RPN) slides a small network over the feature map, predicts (objectness, box-refinement) at each anchor (preset boxes of various scales / aspect ratios).
- RoI Pooling / RoIAlign: extract fixed-size feature for each top-K proposed region.
- Classification + bbox regression head: per region.
Strengths: high accuracy; standard for COCO leaderboards; basis for Mask R-CNN (adds segmentation head).
Weaknesses: slow (two stages, hundreds of region computations per image); 5–30 FPS on a GPU.
Used when: accuracy matters more than latency; offline analytics; medical imaging.
One-stage: YOLO (Redmon et al., 2015 onward), RetinaNet (Lin 2017)
Predict boxes and classes directly at every grid cell of the feature map, in a single forward pass:
- YOLO v1: grid, each cell predicts B bounding boxes + class.
- YOLO v3: multi-scale predictions across three feature-map levels.
- YOLO v5/v7/v8/v10/v12: ongoing engineering improvements (anchor-free, attention, distillation).
- RetinaNet: Focal Loss to handle the extreme class imbalance between foreground and background anchors.
Strengths: real-time (60–300+ FPS); simpler to deploy.
Weaknesses: historically lower accuracy than two-stage; gap closed by ~2020.
Used when: real-time required; embedded / edge deployment; autonomous driving.
Set prediction: DETR (Carion et al., 2020)
A transformer encoder-decoder that outputs a fixed set of predictions (e.g., ):
- CNN backbone extracts features.
- Flatten to a sequence; transformer encoder processes.
- Transformer decoder takes learned object queries; cross-attends to encoder features.
- Each query produces (box, class). Including a “no object” class for unused queries.
- Loss: bipartite matching (Hungarian) between predictions and ground-truth boxes.
Strengths: no anchors, no NMS; cleaner formulation; scales with transformer pretraining.
Weaknesses: slow training convergence; lower throughput than YOLO. Deformable DETR, DINO (DETR with denoising), Co-DETR addressed convergence and accuracy.
Used when: research / leaderboard work; integration with vision-language pretraining (Grounding DINO, OWL-ViT).
Key shared components
Bounding box parameterization
Either or . Critical for loss design.
IoU loss
Intersection over Union directly measures box overlap. Variants: GIoU, DIoU, CIoU. Handle non-overlapping boxes better than naive L1 / L2 on coordinates.
Non-Maximum Suppression (NMS)
Post-processing for anchor-based detectors: remove duplicate boxes for the same object by keeping highest-confidence and suppressing others with IoU > threshold (typically 0.5). DETR-family removes the need for NMS by design.
Anchors
Predefined boxes of various scales / aspect ratios. The detector predicts offsets from these. Anchor-free methods (FCOS, CenterNet, YOLOv8) predict box centers directly.
Datasets and metrics
- PASCAL VOC (legacy): 20 classes; mAP@IoU=0.5.
- COCO: 80 classes; metric mAP averaged over IoU thresholds 0.5 → 0.95 (the dominant 2026 benchmark).
- Open Images, LVIS: large-vocabulary detection.
The standard evaluation: average precision per class, then mean across classes (mAP).
Production considerations
- Latency: YOLO-class models for real-time; Faster R-CNN for accuracy.
- Class imbalance: many backgrounds, few foregrounds. Focal loss (RetinaNet) addresses this.
- Small objects: hardest case; multi-scale features (FPN. Feature Pyramid Network) help.
- Open vocabulary: align detector classes with text embeddings (CLIP-style) for zero-shot detection (OWL-ViT, Grounding DINO).
Common pitfalls
- Training on small datasets without strong augmentation. Detection is data-hungry; mosaic, mixup, copy-paste augmentations standard.
- Confusing IoU thresholds. mAP@0.5 ≠ mAP@0.5:0.95; specify which.
- Forgetting NMS / NMS thresholds. Setting NMS too aggressive merges distinct objects; too loose duplicates them.
- Using the wrong evaluation tool. COCO eval and VOC eval differ; use the one for your reporting benchmark.
- Treating YOLO as a single algorithm. Many YOLO versions exist with very different performance; cite the specific version.
Related
- CNN architecture. Backbone.
- Vision transformers. Alternative backbone.