PaperBanana: Automating Academic Illustration for AI Scientists


1Peking University   
2Google Cloud AI Research

PaperBanana Workflow

Examples of methodology diagrams and statistical plots generated by PaperBanana, which show the potential of automating the generation of academic illustrations.

Despite rapid advances in autonomous AI scientists powered by language models, generating publication-ready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations. Powered by state-of-the-art VLMs and image generation models, PaperBanana orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To rigorously evaluate our framework, we introduce PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications, covering diverse research domains and illustration styles. Comprehensive experiments demonstrate that PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics. We further show that our method effectively extends to the generation of high-quality statistical plots. Collectively, PaperBanana paves the way for the automated generation of publication-ready illustrations.

Method Overview

We propose PaperBanana, a reference-driven agentic framework for automated academic illustration. As illustrated in the diagram (generated by ) below, PaperBanana orchestrates a collaborative team of five specialized agents—Retriever, Planner, Stylist, Visualizer, and Critic—to transform raw scientific content into publication-quality diagrams and plots.


  • Retriever Agent: Identifies relevant reference examples to guide downstream agents.
  • Planner Agent: Acts as the cognitive core, translating context into detailed textual descriptions.
  • Stylist Agent: Ensures adherance to academic aesthetic standards by synthesizing guidelines from references.
  • Visualizer Agent: Transforms textual descriptions into visual output or executable code.
  • Critic Agent: Inspects generated images/plots against the source to provide feedback for refinement.

Benchmark Construction

The lack of benchmarks hinders rigorous evaluation of automated diagram generation. We address this with PaperBananaBench, a dedicated benchmark curated from NeurIPS 2025 methodology diagrams, capturing the sophisticated aesthetics and diverse logical compositions of modern AI papers. The construction pipeline ensures high quality through: (1) Collection & Parsing, (2) Filtering, (3) Categorization, and (4) Human Curation. The final dataset comprises 584 valid samples, partitioned into 292 test and 292 reference cases.

[Plot generated by from raw data] Statistics of the test set of PaperBananaBench (totaling 292 samples). The average length of source context / figure caption is 3,020.1 / 70.4 words.

Experimental Results

We evaluate PaperBanana on PaperBananaBench, assessing performance across faithfulness, conciseness, readability, and aesthetics. Our method consistently outperforms leading baselines across all four evaluation dimensions.

Main Results of PaperBanana


We further show that our method also seamlessly extends to the generation of high-quality statistical plots. The plot below was itself generated by from our raw data.

Statistical Plots Comparison

Two Advanced Applications

1. Enhancing Aesthetics of Human-Drawn Diagrams. We explore using our summarized aesthetic guidelines to elevate the aesthetic quality of existing human-drawn diagrams. Below is an example:


2. Coding vs Image Generation for Visualizing Statistical Plots. We explore using image generation models for statistical plots generation, comparing with code-based approaches. The results below reveal distinct trade-offs: image generation excels in presentation but underperforms in content fidelity. [The plot below was itself generated by , from our raw data]

Code vs Image Generation

Case Study

BibTeX