Publications

2025 – 2026

CVPR
2026 2025

DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model

Zhou Tao*, Shida Wang*, Yongxiang Hua, Haoyu Cao, Linli Xu (* equal contribution)

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

Multimodal LLM Visual Grounding Fine-Grained Perception Computer Vision

Multimodal Large Language Models (MLLMs) have achieved impressive performance across various vision-language tasks, yet their fine-grained visual perception and precise spatial reasoning capabilities remain limited. This work introduces DiG (Differential Grounding), a novel proxy task framework where MLLMs learn fine-grained perception by identifying and localizing all differences between similar image pairs, without prior knowledge of the number of differences. To support scalable training, we develop an automated 3D-rendering-based data generation pipeline that produces high-quality paired images with fully controllable differences. Curriculum learning progressively increases complexity from single to multiple differences for stable optimization. Extensive experiments demonstrate that DiG significantly improves model performance across diverse visual perception benchmarks, with learned skills effectively transferring to RefCOCO, RefCOCO+, RefCOCOg, and general multimodal perception benchmarks.

📄 arXiv 📑 PDF

arXiv 2025

FPEdit: Robust LLM Fingerprinting through Localized Parameter Editing

Shida Wang, Chaohu Liu, Yubo Wang, Linli Xu

arXiv preprint arXiv:2508.02092, 2025

LLM Security Model Fingerprinting Knowledge Editing Model Protection

Large language models represent significant investments in computation, data, and engineering expertise, making them extraordinarily valuable intellectual assets. Nevertheless, these AI assets remain vulnerable to unauthorized redistribution and commercial exploitation through fine-tuning or black-box deployment. We introduce FPEdit, a novel framework that leverages knowledge editing to inject semantically coherent natural language fingerprints through sparse, targeted modifications to model weights. Our approach introduces Promote-Suppress Value Vector Optimization, which simultaneously enhances target token likelihood while suppressing competing tokens, ensuring robust fingerprint integration without degrading core model functionality. FPEdit achieves 95–100% fingerprint retention under both full-parameter fine-tuning and parameter-efficient adaptation, remains robust under quantization, pruning, and stochastic decoding, and can embed 10 fingerprint pairs into LLaMA2-7B in under 2 minutes using less than 30 GB of GPU memory.

📄 arXiv 📑 PDF