DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models

Renqiu Xia1,2,*, Song Mao1,*, Xiangchao Yan1,*, Hongbin Zhou1,, Bo Zhang1,‡, Haoyang Peng1, Jiahao Pi1, Daocheng Fu1, Wenjie Wu1,2, Hancheng Ye1, Shiyang Feng4, Bin Wang1, Chao Xu1, Conghui He1, Pinlong Cai1, Min Dou1, Botian Shi1,‡, Sheng Zhou3, Yongwei Wang3, Bin Wang4, Junchi Yan1,2, Fei Wu3, Yu Qiao1
1 Shanghai Artificial Intelligence Laboratory 2 Shanghai Jiao Tong University  3 Zhejiang University  4 Fudan University  
* Equal Contribution   Corresponding author

Abstract

Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Thus, leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extraction and understanding tasks, and their capacity to process within-document data formats such as charts and equations remains under-explored. To address these issues, we present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline. DocGenome features four key characteristics:

1. Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their LaTeX source codes.
2. Logicality: It provides 6 logical relationships between different entities within each scientific document.
3. Diversity: It covers various document-oriented tasks, including document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA and multi-page QA.
4. Correctness: It undergoes rigorous quality control checks conducted by a specialized team.

Besides, based on DocGenome, we conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of current large models on our benchmark.


Overview of the DocGenome Dataset

Our work introduces DocGenome, a multi-modal dataset of academic documents encompassing 8 primary disciplines, 153 secondary disciplines, 13 categories of component units, and 6 types of entity relationships between units.

Figure 1: Overview of the DocGenome Dataset.



Dataset-Train Download



DocParser: A Cutting-edge Auto-labeling Pipeline

DocParser can convert LaTeX source code of a complete document into annotations for component units with source code, attributes, relationships, and bounding box, as well as a rendered PNG of the entire document. The process is divided into four distinct stages:
1. Data Preprocessing
2. Unit Segmentation
3. Attribute Assignment and Relation Retrieval,
4. Color Rendering

Figure 2: Schematic of the designed DocParser pipeline for automated document annotation.



DocGenome Benchmark Introduction

To comprehensively show the advantages of the proposed DocGenome dataset, we have reviewed visual document datasets and summarized them in Table 1. By comparison, our DocGenome demonstrates more comprehensive features, including the number of disciplines and training samples covered, types of tasks, evaluation metrics, and entity relationships.

Table 1: Comparison with document-related benchmarks. “ - ” indicates that the corresponding part is not mentioned in the original paper. “ * ” means that each sample in their training set is cropped from the entire page, resulting in a total of 6.4M samples at the region level rather than the page level


• Definition of relationships between component units

DocGenome Benchmark contains 4 level relation types and 2 cite relation types, as shown in the following table:

Table 2: The definition of logical relationships between component units


• Attribute of component units

DocGenome has 13 attributes of component units, which can be categorized into two classes:
1. Fixed-form units, including Text, Title, Abstract, etc., which are characterized by sequential reading and hierarchical relationships readily discernible from the list obtained in Stage-two of the designed DocParser.
2. Floating-form units, including Table, Figure, etc., which establish directional references to fixed-form units through commands like \ref and \label.

Index Category Notes
0 Algorithm
1 Caption Titles of Images, Tables, and Algorithms
2 Equation
3 Figure
4 Footnote
5 List
7 Table
8 Text
9 Text-EQ Text block with inline equations
10 Title Section titles
12 PaperTitle
13 Code
14 Abstract

• Types of disciplines

Page distribution of DocGenome. 20% of documents are five pages or fewer, 50% are ten pages or fewer, and 80% are nineteen pages or fewer.

Figure A.1: Page distribution of DocGenome.

Distribution of secondary disciplines in our DocGenome. The count on the x-axis represents the number of documents, and documents from the same primary discipline are marked with the same color.

Figure A.2: Distribution of secondary disciplines in our DocGenome.