PaperBanana: Automating Academic Illustration for AI Scientists

Dawei Zhu^1,2*, Rui Meng², Yale Song², Xiyu Wei¹, Sujian Li¹, Tomas Pfister², Jinsung Yoon²

¹Peking University
²Google Cloud AI Research

Corresponding author(s): dwzhu@pku.edu.cn, lisujian@pku.edu.cn, jinsungyoon@google.com

Examples of methodology diagrams and statistical plots generated by PaperBanana, which show the potential of automating the generation of academic illustrations.

Despite rapid advances in autonomous AI scientists powered by language models, generating publication-ready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations. Powered by state-of-the-art VLMs and image generation models, PaperBanana orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To rigorously evaluate our framework, we introduce PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications, covering diverse research domains and illustration styles. Comprehensive experiments demonstrate that PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics. We further show that our method effectively extends to the generation of high-quality statistical plots. Collectively, PaperBanana paves the way for the automated generation of publication-ready illustrations.

Method Overview

We propose PaperBanana, a reference-driven agentic framework for automated academic illustration. As illustrated in the diagram (generated by ) below, PaperBanana orchestrates a collaborative team of five specialized agents—Retriever, Planner, Stylist, Visualizer, and Critic—to transform raw scientific content into publication-quality diagrams and plots.

Retriever Agent: Identifies relevant reference examples to guide downstream agents.
Planner Agent: Acts as the cognitive core, translating context into detailed textual descriptions.
Stylist Agent: Ensures adherance to academic aesthetic standards by synthesizing guidelines from references.
Visualizer Agent: Transforms textual descriptions into visual output or executable code.
Critic Agent: Inspects generated images/plots against the source to provide feedback for refinement.

Benchmark Construction

The lack of benchmarks hinders rigorous evaluation of automated diagram generation. We address this with PaperBananaBench, a dedicated benchmark curated from NeurIPS 2025 methodology diagrams, capturing the sophisticated aesthetics and diverse logical compositions of modern AI papers. The construction pipeline ensures high quality through: (1) Collection & Parsing, (2) Filtering, (3) Categorization, and (4) Human Curation. The final dataset comprises 584 valid samples, partitioned into 292 test and 292 reference cases.

[Plot generated by from raw data] Statistics of the test set of PaperBananaBench (totaling 292 samples). The average length of source context / figure caption is 3,020.1 / 70.4 words.

Experimental Results

We evaluate PaperBanana on PaperBananaBench, assessing performance across faithfulness, conciseness, readability, and aesthetics. Our method consistently outperforms leading baselines across all four evaluation dimensions.

We further show that our method also seamlessly extends to the generation of high-quality statistical plots. The plot below was itself generated by from our raw data.

Two Advanced Applications

1. Enhancing Aesthetics of Human-Drawn Diagrams. We explore using our summarized aesthetic guidelines to elevate the aesthetic quality of existing human-drawn diagrams. Below is an example:

2. Coding vs Image Generation for Visualizing Statistical Plots. We explore using image generation models for statistical plots generation, comparing with code-based approaches. The results below reveal distinct trade-offs: image generation excels in presentation but underperforms in content fidelity. [The plot below was itself generated by , from our raw data]

Case Study

Case Study of Diagram Generation. Given the same source context and caption, the vanilla Nano-Banana-Pro often produces diagrams with outdated color tones and overly verbose content. In contrast, our PaperBanana generates results that are more concise and aesthetically pleasing, while maintaining faithfulness to the source context.

Enhancing Aesthetics. Additional cases for enhancing the aesthetics of human-drawn diagrams with our auto-summarized style guidelines. The polished diagrams demonstrate significant stylistic improvements in color schemes, typography, graphical elements, etc.

Visualizing Statistical Plots. Case study for visualizing statistical plots with code and image generation. It is observed that the image generation model can generate more visually appealing plots, but incurs more faithfulness errors such as numerical hallucination or element repetition.

Failure Cases. The primary failure mode involves connection errors, such as redundant connections and mismatched source-target nodes. Our preliminary analysis reveals that the critic model often fails to identify these connectivity issues, suggesting these errors may originate from the foundation model's inherent perception limitations.