DocLens: A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding

We introduce DocLens, a tool-augmented multi-agent framework that, for the first time, achieves superhuman performance in long visual document understanding. DocLens achieves its remarkable performance by fully leveraging existing document parsing tools and orchestrating specialized agents. Given a long visual document and a corresponding question, DocLens first applies a tool-augmented Lens Module to retrieve relevant pages (Page Navigator Agent) and locate relevant visual and textual elements (Element Localizer Agent) within these pages. We then use a Reasoning Module to perform in-depth analysis of these elements, provide candidate answers (Answer Sampler Agent), and pick the most accurate and reliable one (Adjudicator Agent). Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework's superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.

Method Overview

We propose DocLens, a framework consisting of two primary components: a Lens Module and a Reasoning Module. Given a long visual document and a question, the Lens Module first identifies relevant pages and the key visual & textual elements within them. Subsequently, the Reasoning Module conducts an in-depth analysis of this extracted evidence to generate a precise answer.

Lens Module: This module acts like a magnifying glass. It uses a Page Navigator agent to find the correct pages and an Element Localizer agent to zoom in on specific charts, tables, or figures on those pages.
Reasoning Module: After the evidence is collected, this module uses a "sampling-adjudication" mechanism. An Answer Sampler agent proposes several potential answers, and an Adjudicator agent critically assesses them to select the most reliable and accurate final answer.

Experimental Results

We evaluate DocLens on two challenging benchmarks: MMLongBench-Doc and FinRAGBench-V. Our method achieves substantial performance improvements across all three backbone models (Gemini-2.5-Pro, Gemini-2.5-Flash, and Claude-4-Sonnet).

We highlight the following findings:

State-of-the-Art Performance: DocLens with Gemini-2.5-Pro achieves a new state-of-the-art on MMLongBench-Doc (67.6) and FinRAGBench-V (70.4).
Surpassing Human Experts: On MMLongBench-Doc, our framework (67.6) surpasses the reported human expert performance of 65.8, demonstrating the effectiveness of our approach.
Reduced Hallucination: Our method achieves significant improvements on the Unanswerable (UNA) subset, with an absolute gain of up to +13.8% for Gemini-2.5-Pro. This indicates that our framework effectively mitigates model hallucination.
Superior Handling of Visual Evidence: The performance boost is primarily driven by our method's superior handling of visual evidence like charts and tables, where fine-grained element localization becomes increasingly critical.

Case Study

The following cases from our paper highlight the effectiveness of the Element Localizer in handling complex visual queries. By first identifying and then cropping these visual elements for detailed inspection, our Localizer effectively addresses challenges where vanilla VLMs fail.

Case 1: Identifying a trend from a small bar chart embedded within a dense newspaper page.
Case 2: Locating a specific line plot, extracting precise numerical values, and presenting them in descending order.

BibTeX


@misc{zhu2025doclenstoolaugmentedmultiagent,
    title={DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding}, 
    author={Dawei Zhu and Rui Meng and Jiefeng Chen and Sujian Li and Tomas Pfister and Jinsung Yoon},
    year={2025},
    eprint={2511.11552},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2511.11552}, 
}