Object detection: Faster R-CNN, YOLO, DETR

One-line definition

Object detection outputs, for each input image, a set of (bounding box, class label, confidence) tuples for the objects present. The three architectural families: two-stage (Faster R-CNN family. First propose regions, then classify), one-stage (YOLO, RetinaNet. Predict box and class in one pass at every grid cell), and set-prediction (DETR. Transformer encoder-decoder predicts a fixed-size set of boxes).

Why it matters

Detection is the core production CV task: autonomous driving, retail analytics, surveillance, medical imaging, robotics. The architectural choice determines latency-accuracy tradeoffs and what’s possible in your inference budget.

Two-stage: Faster R-CNN (Ren et al., 2015)

Architecture:

Backbone (ResNet, ConvNeXt) extracts feature map.
Region Proposal Network (RPN) slides a small network over the feature map, predicts (objectness, box-refinement) at each anchor (preset boxes of various scales / aspect ratios).
RoI Pooling / RoIAlign: extract fixed-size feature for each top-K proposed region.
Classification + bbox regression head: per region.

Strengths: high accuracy; standard for COCO leaderboards; basis for Mask R-CNN (adds segmentation head).

Weaknesses: slow (two stages, hundreds of region computations per image); 5–30 FPS on a GPU.

Used when: accuracy matters more than latency; offline analytics; medical imaging.

One-stage: YOLO (Redmon et al., 2015 onward), RetinaNet (Lin 2017)

Predict boxes and classes directly at every grid cell of the feature map, in a single forward pass:

YOLO v1: $S \times S$ grid, each cell predicts B bounding boxes + class.
YOLO v3: multi-scale predictions across three feature-map levels.
YOLO v5/v7/v8/v10/v12: ongoing engineering improvements (anchor-free, attention, distillation).
RetinaNet: Focal Loss to handle the extreme class imbalance between foreground and background anchors.

Strengths: real-time (60–300+ FPS); simpler to deploy.

Weaknesses: historically lower accuracy than two-stage; gap closed by ~2020.

Used when: real-time required; embedded / edge deployment; autonomous driving.

Set prediction: DETR (Carion et al., 2020)

A transformer encoder-decoder that outputs a fixed set of $N$ predictions (e.g., $N = 100$ ):

CNN backbone extracts features.
Flatten to a sequence; transformer encoder processes.
Transformer decoder takes $N$ learned object queries; cross-attends to encoder features.
Each query produces (box, class). Including a “no object” class for unused queries.
Loss: bipartite matching (Hungarian) between predictions and ground-truth boxes.

Strengths: no anchors, no NMS; cleaner formulation; scales with transformer pretraining.

Weaknesses: slow training convergence; lower throughput than YOLO. Deformable DETR, DINO (DETR with denoising), Co-DETR addressed convergence and accuracy.

Used when: research / leaderboard work; integration with vision-language pretraining (Grounding DINO, OWL-ViT).

Key shared components

Bounding box parameterization

Either $(x_{center}, y_{center}, w, h)$ or $(x_{min}, y_{min}, x_{max}, y_{max})$ . Critical for loss design.

IoU loss

Intersection over Union directly measures box overlap. Variants: GIoU, DIoU, CIoU. Handle non-overlapping boxes better than naive L1 / L2 on coordinates.

Non-Maximum Suppression (NMS)

Post-processing for anchor-based detectors: remove duplicate boxes for the same object by keeping highest-confidence and suppressing others with IoU > threshold (typically 0.5). DETR-family removes the need for NMS by design.

Anchors

Predefined boxes of various scales / aspect ratios. The detector predicts offsets from these. Anchor-free methods (FCOS, CenterNet, YOLOv8) predict box centers directly.

Datasets and metrics

PASCAL VOC (legacy): 20 classes; mAP@IoU=0.5.
COCO: 80 classes; metric mAP averaged over IoU thresholds 0.5 → 0.95 (the dominant 2026 benchmark).
Open Images, LVIS: large-vocabulary detection.

The standard evaluation: average precision per class, then mean across classes (mAP).

Production considerations

Latency: YOLO-class models for real-time; Faster R-CNN for accuracy.
Class imbalance: many backgrounds, few foregrounds. Focal loss (RetinaNet) addresses this.
Small objects: hardest case; multi-scale features (FPN. Feature Pyramid Network) help.
Open vocabulary: align detector classes with text embeddings (CLIP-style) for zero-shot detection (OWL-ViT, Grounding DINO).

Common pitfalls

Training on small datasets without strong augmentation. Detection is data-hungry; mosaic, mixup, copy-paste augmentations standard.
Confusing IoU thresholds. mAP@0.5 ≠ mAP@0.5:0.95; specify which.
Forgetting NMS / NMS thresholds. Setting NMS too aggressive merges distinct objects; too loose duplicates them.
Using the wrong evaluation tool. COCO eval and VOC eval differ; use the one for your reporting benchmark.
Treating YOLO as a single algorithm. Many YOLO versions exist with very different performance; cite the specific version.

CNN architecture. Backbone.
Vision transformers. Alternative backbone.