VisReflect

Abstract

Large Vision Language Models (LVLMs) have achieved remarkable success on vision–language tasks, yet fine-grained perception over high-resolution images and long-context videos remains challenging. As the number of visual tokens increases, the visual attention sink phenomenon becomes increasingly severe, causing irrelevant tokens to absorb a disproportionate amount of attention mass. Recent approaches attempt to mitigate this issue by explicitly predicting bounding boxes or temporal spans and re-encoding the cropped visual regions. Such methods depend on unreliable numeric localization in the discrete token space and incur significant computational overhead due to additional forward passes. In this work, we propose VisReflect, a simple yet effective framework that improves fine-grained perception in long visual contexts through latent visual reflection. Instead of decoding intermediate predictions into discrete tokens, the model generates continuous visual reflection that represents question-relevant visual features in the latent space. These reflections selectively emphasize salient regions or frames, guiding attention towards relevant visual tokens within a single forward pass. We conduct comprehensive evaluations on challenging high-resolution image benchmarks, including BLINK, V*, and HRBench-4K/8K, as well as video understanding benchmarks such as MVBench, VideoMME, and MLVU. Our method consistently improves over strong baselines, achieving gains of 4.1% on image benchmarks and 1.8% on video benchmarks. Compared with zooming-based methods, our model achieves comparable performance while reducing inference time by roughly 44% on video understanding.

Overview

Existing zooming-based methods localize relevant regions or frames by predicting coordinates in discrete token space and performing repeated forward passes. In contrast, VisReflect generates latent visual reflection tokens in continuous visual space, enabling the model to internally recall question-relevant visual features within a single forward pass.

Model Design

Given visual tokens extracted from a high-resolution image or a long-form video together with a textual query, a LVLM generates a sequence of visual reflection tokens between the special tokens and . These tokens are trained to approximate the latent visual representations of the region of interest (for images) or frames of interest (for videos) that are relevant to the question. Instead of explicitly cropping images or re-encoding selected frames, the model recalls question-relevant visual features directly in the latent embedding space using its last hidden states. The reflected visual features then guide the model’s attention toward informative parts of the visual context, enabling accurate fine-grained perception in high-resolution images and long-form videos while requiring only a single forward pass.

VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context

Abstract

Overview

Model Design

Image Understanding

Long Video Understanding

Attention Visualization