Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding

1King Abdullah University of Science and Technology 2Meta AI
NeurIPS 2025 Spotlight

Abstract

Understanding and reasoning over long videos pose significant challenges for large video language models (LVLMs) due to the difficulty in processing intensive video tokens beyond context window and retaining long-term sequential information. Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in process- ing long context for Large Language Models (LLMs); however, applying RAG to long video faces challenges such as disrupted temporal dependencies and inclusion of irrelevant information that can hinder accurate reasoning. To address these limi- tations, we propose Vgent, a novel graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding. Our approach introduces two key innovations: (i) It represents videos by structured graphs with semantic relationships across video clips preserved to improve re- trieval effectiveness. (ii) It introduces an intermediate reasoning step to mitigate the reasoning limitation of LVLMs, which leverages structured verification to reduce retrieval noise and facilitate the explicit aggregation of relevant informa- tion across clips, resulting in more accurate and context-aware responses. We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of 3.0%∼5.4% over base models on MLVU, and outperformed state-of-the-art video RAG methods by 8.6%.

Vgent Pipeline

Pipeline of Vgent, a novel framework for long-context video understanding in the proposed graph-based retrieval-reasoning-augmented generation paradigm. It consists of four key stages: (1) Offline video graph construction: Builds a video graph by extracting knowledge from long videos. (2) Graph-based retrieval: Retrieves relevant clips based on keywords extracted from the user query. (3) Structured reasoning: Refines clips using structured queries and aggregates information. (4) Multimodal augmented generation: Combines refined clips and reasoning results to generate the final response.

Video Understanding Results

Model Params VideoMME
(w/o sub.)
VideoMME
(w/ sub.)
MLVU LongVideoBench
InternVL2.5 2B 49.5 55.2 56.7 52.0
InternVL2.5 + Vgent 2B 50.9 (+1.4) 56.8 (+1.6) 61.1 (+4.4) 54.8 (+2.8)
Qwen2.5-VL 3B 61.4 67.6 66.2 54.2
Qwen2.5-VL + Vgent 3B 63.0 (+1.6) 69.6 (+2.0) 70.4 (+4.2) 57.8 (+3.6)
LongVU 7B 55.2 60.9 65.4 50.2
LongVU + Vgent 7B 57.3 (+2.1) 63.7 (+2.8) 70.8 (+5.4) 52.7 (+2.5)
Qwen2-VL 7B 62.7 68.1 65.7 55.6
Qwen2-VL + Vgent 7B 63.5 (+0.8) 70.1 (+2.0) 70.3 (+4.6) 58.4 (+2.8)
LLaVA-Video 7B 64.3 69.2 69.5 59.5
LLaVA-Video + Vgent 7B 66.7 (+2.4) 71.1 (+1.9) 72.5 (+3.0) 62.4 (+2.9)
Qwen2.5-VL 7B 65.1 71.1 68.8 56.0
Qwen2.5-VL + Vgent 7B 68.9 (+3.8) 74.3 (+3.2) 72.1 (+3.3) 59.7 (+3.7)

RAG methods comparison

Qualitative Examples


Citation

@inproceedings{shen2025vgent,
    title={Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding},
    author={Shen, Xiaoqian and Zhang, Wenxuan and Chen, Jun and Elhoseiny, Mohamed},
    journal={Advances in Neural Information Processing Systems},
    year={2025}
  }