RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes

1General Motors, 2Hebrew University of Jerusalem
Findings of CVPR 2026

Abstract

Monocular Metric Depth Estimation (MMDE) is essential for physically intelligent systems, yet accurate depth estimation for underrepresented classes in complex scenes remains a persistent challenge. To address this, we propose RAD, a retrieval-augmented framework that approximates the benefits of multi-view stereo by utilizing retrieved neighbors as structural geometric proxies. Our method first employs an uncertainty-aware retrieval mechanism to identify low-confidence regions in the input and retrieve RGB-D context samples containing semantically similar content. We then process both the input and retrieved context via a dual-stream network and fuse them using a matched cross-attention module, which transfers geometric information only at reliable point correspondences. Evaluations on NYU Depth v2, KITTI, and Cityscapes demonstrate that RAD significantly outperforms state-of-the-art baselines on underrepresented classes, reducing relative absolute error by 29.2% on NYU Depth v2, 13.3% on KITTI, and 7.2% on Cityscapes, while maintaining competitive performance on standard in-domain benchmarks.

Method

Given an input image, RAD (using DepthAnything v2 backbone) retrieves context views for highly uncertain objects of underrepresented classes (e.g., candles) to serve as structural geometric proxies. These are used as part of a dual-stream network to output an accurate monocular metric depth estimation, in comparison to the direct baseline of DepthAnything v2, fixing uncertain regions.

Teaser
RAD uses retrieved images and ground truth to improve dept estimation of underrepresented classes.

Pipeline

Given an input image, a set of context samples is sourced using either uncertainty-aware image retrieval (both at training and inference) or 3D augmentation (only during training). Subsequently, spatial correspondences are established. These are used to infer depth via a dual-stream depth estimation network employing matched cross-attention. Blue blocks indicate components used for training and inference, while the green block is only for training.

System architecture
System architecture of RAD.

Uncertainty-aware Retrieval Flow

Pixel-wise depth uncertainty is calculated in parallel to image segmentation. We use these to keep only highly uncertain segments, masking the rest of the image. Given the masked image we retrieve relevant examples from the context/training set using DINO descriptors.

Uncertainty-aware retrieval
Uncertainty-aware retrieval of relevant context samples.

Matched Cross-attention

(a) shows the modified attention architecture that transfers information from the context stream to the input stream: for each token j in the input image, attention is computed using key/value matrices formed by concatenating the input’s keys/values with the matched context keys/values corresponding to j. (b) Matching tokens are those within a spatial neighborhood around the matched point in the context image.

Matched cross-attention
Mathched cross-attention.

Results

Underrepresented Classes Evaluation

Underrepresented classes results
Performance on underrepresented classes.

All-classes Evaluation

All classes results
Performance across all classes.

Qualitative Results

Qualitative depth results
Qualitative results

Citation

BibTeX