CLUSTER · TIER 3
Study exposes brittleness in RL-finetuned vision language models under text perturbations.
Researchers show RL-finetuned VLMs remain vulnerable to weak visual grounding and text perturbations, with misleading captions or incorrect chain-of-thought traces causing substantial drops in robustness and confidence.
Sources
1
X mentions
—
First seen
2Hago
Velocity
+102%/6h