Publications
2025 – 2026
CVPR
2026 2025
2026 2025
DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
Multimodal LLM Visual Grounding Fine-Grained Perception Computer Vision
Multimodal Large Language Models (MLLMs) have achieved impressive performance across various vision-language tasks, yet their fine-grained visual perception and precise spatial reasoning capabilities remain limited. This work introduces DiG (Differential Grounding), a novel proxy task framework where MLLMs learn fine-grained perception by identifying and localizing all differences between similar image pairs, without prior knowledge of the number of differences. To support scalable training, we develop an automated 3D-rendering-based data generation pipeline that produces high-quality paired images with fully controllable differences. Curriculum learning progressively increases complexity from single to multiple differences for stable optimization. Extensive experiments demonstrate that DiG significantly improves model performance across diverse visual perception benchmarks, with learned skills effectively transferring to RefCOCO, RefCOCO+, RefCOCOg, and general multimodal perception benchmarks.
arXiv 2025
arXiv preprint arXiv:2508.02092, 2025
LLM Security Model Fingerprinting Knowledge Editing Model Protection
Large language models represent significant investments in computation, data, and engineering expertise, making them extraordinarily valuable intellectual assets. Nevertheless, these AI assets remain vulnerable to unauthorized redistribution and commercial exploitation through fine-tuning or black-box deployment. We introduce FPEdit, a novel framework that leverages knowledge editing to inject semantically coherent natural language fingerprints through sparse, targeted modifications to model weights. Our approach introduces Promote-Suppress Value Vector Optimization, which simultaneously enhances target token likelihood while suppressing competing tokens, ensuring robust fingerprint integration without degrading core model functionality. FPEdit achieves 95–100% fingerprint retention under both full-parameter fine-tuning and parameter-efficient adaptation, remains robust under quantization, pruning, and stochastic decoding, and can embed 10 fingerprint pairs into LLaMA2-7B in under 2 minutes using less than 30 GB of GPU memory.