We introduce DocLens, a tool-augmented multi-agent framework that, for the first time, achieves superhuman performance in long visual document understanding. DocLens achieves its remarkable performance by fully leveraging existing document parsing tools and orchestrating specialized agents. Given a long visual document and a corresponding question, DocLens first applies a tool-augmented Lens Module to retrieve relevant pages (Page Navigator Agent) and locate relevant visual and textual elements (Element Localizer Agent) within these pages. We then use a Reasoning Module to perform in-depth analysis of these elements, provide candidate answers (Answer Sampler Agent), and pick the most accurate and reliable one (Adjudicator Agent). Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework's superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.
We propose DocLens, a framework consisting of two primary components: a Lens Module and a Reasoning Module. Given a long visual document and a question, the Lens Module first identifies relevant pages and the key visual & textual elements within them. Subsequently, the Reasoning Module conducts an in-depth analysis of this extracted evidence to generate a precise answer.
We evaluate DocLens on two challenging benchmarks: MMLongBench-Doc and FinRAGBench-V. Our method achieves substantial performance improvements across all three backbone models (Gemini-2.5-Pro, Gemini-2.5-Flash, and Claude-4-Sonnet).
The following cases from our paper highlight the effectiveness of the Element Localizer in handling complex visual queries. By first identifying and then cropping these visual elements for detailed inspection, our Localizer effectively addresses challenges where vanilla VLMs fail.
Case 1: Identifying a trend from a small bar chart embedded within a dense
newspaper page.
Case 2: Locating a specific line plot, extracting precise
numerical values, and
presenting them in descending order.
@misc{zhu2025doclenstoolaugmentedmultiagent,
title={DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding},
author={Dawei Zhu and Rui Meng and Jiefeng Chen and Sujian Li and Tomas Pfister and Jinsung Yoon},
year={2025},
eprint={2511.11552},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.11552},
}