Despite rapid advances in autonomous AI scientists powered by language models, generating publication-ready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations. Powered by state-of-the-art VLMs and image generation models, PaperBanana orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To rigorously evaluate our framework, we introduce PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications, covering diverse research domains and illustration styles. Comprehensive experiments demonstrate that PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics. We further show that our method effectively extends to the generation of high-quality statistical plots. Collectively, PaperBanana paves the way for the automated generation of publication-ready illustrations.
We propose PaperBanana, a reference-driven agentic framework for automated academic
illustration. As illustrated in the diagram (generated by
) below,
PaperBanana
orchestrates a collaborative
team of five specialized agents—Retriever, Planner, Stylist, Visualizer, and Critic—to
transform raw scientific content into publication-quality diagrams and plots.
The lack of benchmarks hinders rigorous evaluation of automated diagram generation. We address this with PaperBananaBench, a dedicated benchmark curated from NeurIPS 2025 methodology diagrams, capturing the sophisticated aesthetics and diverse logical compositions of modern AI papers. The construction pipeline ensures high quality through: (1) Collection & Parsing, (2) Filtering, (3) Categorization, and (4) Human Curation. The final dataset comprises 584 valid samples, partitioned into 292 test and 292 reference cases.
[Plot generated by
from raw data] Statistics
of the test set
of PaperBananaBench (totaling 292 samples). The average length of source context / figure
caption is 3,020.1 / 70.4 words.
We evaluate PaperBanana on PaperBananaBench, assessing performance across faithfulness, conciseness, readability, and aesthetics. Our method consistently outperforms leading baselines across all four evaluation dimensions.
We further show that our method also seamlessly extends to the generation of high-quality
statistical plots.
The plot below was itself generated by
from our raw
data.
1. Enhancing Aesthetics of Human-Drawn Diagrams. We explore using our summarized aesthetic guidelines to elevate the aesthetic quality of existing human-drawn diagrams. Below is an example:
2. Coding vs Image Generation for Visualizing Statistical Plots. We explore using
image generation models for statistical plots generation, comparing with code-based
approaches. The results below reveal distinct trade-offs: image generation excels in
presentation but underperforms in content fidelity. [The plot below was itself generated by
, from our raw data]
Case Study of Diagram Generation. Given the same source context and caption, the vanilla Nano-Banana-Pro often produces diagrams with outdated color tones and overly verbose content. In contrast, our PaperBanana generates results that are more concise and aesthetically pleasing, while maintaining faithfulness to the source context.
Enhancing Aesthetics. Additional cases for enhancing the aesthetics of human-drawn diagrams with our auto-summarized style guidelines. The polished diagrams demonstrate significant stylistic improvements in color schemes, typography, graphical elements, etc.
Visualizing Statistical Plots. Case study for visualizing statistical plots with code and image generation. It is observed that the image generation model can generate more visually appealing plots, but incurs more faithfulness errors such as numerical hallucination or element repetition.
Failure Cases. The primary failure mode involves connection errors, such as redundant connections and mismatched source-target nodes. Our preliminary analysis reveals that the critic model often fails to identify these connectivity issues, suggesting these errors may originate from the foundation model's inherent perception limitations.