DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model

Published in CVPR 2026, 2025

Recommended citation: Tao, Z., Wang, S., Hua, Y., Cao, H., & Xu, L. (2026). "DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://arxiv.org/pdf/2512.12633

Multimodal Large Language Models (MLLMs) have achieved impressive performance across various vision-language tasks, yet their fine-grained visual perception and precise spatial reasoning capabilities remain limited. This work introduces DiG (Differential Grounding), a novel proxy task framework where MLLMs learn fine-grained perception by identifying and localizing all differences between similar image pairs, without prior knowledge of the number of differences.

To support scalable training, we developed an automated 3D-rendering-based data generation pipeline that produces high-quality paired images with fully controllable differences. To address the sparsity of difference signals, we further employ curriculum learning, progressively increasing complexity from single to multiple differences for stable optimization.

Key Results:

  • Significantly improves model performance across diverse visual perception benchmarks
  • Learned fine-grained perception skills effectively transfer to RefCOCO, RefCOCO+, RefCOCOg
  • Achieves strong performance on general multimodal perception benchmarks

Paper (arXiv)