Shida Wang

I am Shida Wang (王世炟), an M.S. student at the School of Artificial Intelligence and Data Science, University of Science and Technology of China (USTC), advised by Prof. Linli Xu. My research spans Multimodal Large Language Models, Video Understanding, Token Compression, and LLM Security, with a focus on building efficient and trustworthy vision-language systems. Outside research, I enjoy basketball and R&B music.

News

Jun 2026

Started as dots · Ace Top Intern at Xiaohongshu (RED), Shanghai.

Mar 2026

Released SCORE (Dynamic Token Compression for video) on arXiv.

Feb 2026

Two papers (TableMix, DiG) accepted to CVPR 2026.

Oct 2025

Joined Tencent Youtu Lab as a Research Intern.

Aug 2025

Released FPEdit (robust LLM fingerprinting) on arXiv.

Research Interests

Multimodal LLMs Video Understanding Token Compression Efficient Inference LLM Security

Publications

arXiv 2026

Dynamic Token Compression for Efficient Video Understanding through Reinforcement Learning

Shida Wang*, Yongxiang Hua*, Zhou Tao, Haoyu Cao, Linli Xu (* equal contribution)

An RL-learned policy compresses video tokens for 16× prefill speedup at ~99.5% performance.

Multimodal Large Language Models have demonstrated remarkable capabilities in video understanding, yet face prohibitive computational costs and performance degradation from "context rot" due to massive visual token redundancy. Existing compression strategies typically rely on heuristics or fixed transformations that are often decoupled from the downstream task objectives. We propose SCORE (Surprise-augmented token COmpression via REinforcement learning), a unified framework that learns an adaptive token compression policy. SCORE introduces a lightweight policy network conditioned on a surprise-augmented state representation that incorporates inter-frame residuals to capture temporal dynamics and motion saliency, optimized via a group-wise reinforcement learning scheme with a split-advantage estimator and a static→real two-stage curriculum. SCORE achieves a 16× prefill speedup while preserving 99.5% of original performance at a 10% retention ratio.

arXiv PDF

CVPR
2026

CVPR 2026

TableMix: Enhancing Multimodal Table Reasoning in MLLMs from a Data-Centric Perspective

Chaohu Liu, Shida Wang, Yubo Wang, Linli Xu

A data-centric augmentation strategy that boosts multimodal table reasoning in MLLMs.

PDF CVF

CVPR 2026

DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model

Zhou Tao*, Shida Wang*, Yongxiang Hua, Haoyu Cao, Linli Xu (* equal contribution)

MLLMs learn fine-grained perception by localizing all differences between similar image pairs.

Multimodal Large Language Models (MLLMs) have achieved impressive performance across vision-language tasks, yet their fine-grained visual perception and precise spatial reasoning remain limited. We introduce DiG (Differential Grounding), a proxy task framework where MLLMs learn fine-grained perception by identifying and localizing all differences between similar image pairs, without prior knowledge of the number of differences. To support scalable training, we develop an automated 3D-rendering-based data generation pipeline and employ curriculum learning for stable optimization. DiG significantly improves performance on RefCOCO, RefCOCO+, RefCOCOg, and general multimodal perception benchmarks.

arXiv PDF

arXiv 2025

FPEdit: Robust LLM Fingerprinting through Localized Parameter Editing

Shida Wang, Chaohu Liu, Yubo Wang, Linli Xu

Knowledge-editing injects robust natural-language fingerprints into LLMs in under 2 minutes.

We introduce FPEdit, a framework that leverages knowledge editing to inject semantically coherent natural language fingerprints through sparse, targeted modifications to model weights. Our Promote-Suppress Value Vector Optimization achieves 95–100% fingerprint retention under full-parameter fine-tuning and parameter-efficient adaptation, remaining robust under quantization, pruning, and stochastic decoding. FPEdit embeds 10 fingerprint pairs into LLaMA2-7B in under 2 minutes using less than 30 GB GPU memory.

arXiv PDF

Internships

June 2026 – Present

dots · Ace Top Intern

Xiaohongshu (RED) · Shanghai, China

Oct. 2025 – May 2026

Research Intern

Tencent Youtu Lab · Hefei, China

Education

Sept. 2024 – Present

M.S. in Artificial Intelligence

USTC · School of Artificial Intelligence and Data Science, Hefei

Sept. 2021 – July 2024

B.S. in Artificial Intelligence

USTC · School of Artificial Intelligence and Data Science, Hefei

Sept. 2020 – July 2021

B.S. in Management (transferred)

USTC · School of Management, Hefei

Honors & Awards

2023 China Petroleum Scholarship · Top 1

2023 Soong Ching Ling Future Scholarship

2022 Third Prize, Contemporary Undergraduate Mathematical Contest in Modeling, Anhui Province

2019 First Prize, National High School Mathematics Competition, Jilin Province

2019 Second Prize, National High School Physics Competition, Jilin Province

2019 Second Prize, China Chemistry Olympiad (Preliminary Round), Jilin Province