Chimera

Abstract

Recent advancements in Large Multi-modal Models (LMMs) underscore the importance of scaling by increasing image-text paired data, achieving impressive performance on general tasks. Despite their effectiveness in broad applications, generalist models are primarily trained on web-scale datasets dominated by natural images, resulting in the sacrifice of specialized capabilities for domain-specific tasks that require extensive domain prior knowledge. Moreover, directly integrating expert models tailored for specific domains is challenging due to the representational gap and imbalanced optimization between the generalist model and experts. To address these challenges, we introduce Chimera, a scalable and low-cost multi-modal pipeline designed to boost the ability of existing LMMs with domain-specific experts. Specifically, we design a progressive training strategy to integrate features from expert models into the input of a generalist LMM. To address the imbalanced optimization caused by the well-aligned general visual encoder, we introduce a novel Generalist-Specialist Collaboration Masking (GSCM) mechanism. This results in a versatile model that excels across the chart, table, math, and document domains, achieving state-of-the-art performance on multi-modal reasoning and visual content extraction tasks, both of which are challenging tasks for assessing existing LMMs. We will release Chimera's weights, along with the data used for training and evaluation, to facilitate future research on LMMs.

Method

We introduce Chimera, a scalable pipeline that integrates specialist models into generalist LMMs, facilitating their adaptation to diverse specialized tasks. Chimera comprises a general visual encoder, a general projector paired with a language model pre-trained from an LMM, alongside a router and an expert model suite, which includes specialized expert models and their corresponding expert projectors. Chimera uses a Generalist-Specialist Collaboration Masking (GSCM) mechanism to facilitate the alignment with expert models.
We consider a progressive two-stage training procedure:

Stage 1: Domain-General Knowledge Alignment. To initially align domain-specific knowledge with the semantic space of the generalist LMM, we train the model using tasks that directly perceive diverse image content. We only train the general projector and expert projectors during this stage.
Stage 2: Visual Instruction Tuning. To further enhance the performance of Chimera on specialized tasks, we perform visual instruction tuning on different domain-specific tasks with the proposed GSCM. All projectors and LLM are updated during this stage.

Please check out our [Model Zoo].

Performance

We conduct quantitative experiments to evaluate Chimera's capabilities in multi-modal reasoning and visual content extraction. Chimera achieves a new SOTA for LMMs of comparable scale on two multi-modal reasoning benchmarks. It also surpasses or matches the performance of representative expert models in visual content extraction tasks across chart, table, and document domains.

Examples

In addition to quantitatively reporting Chimera's performance across various benchmarks, we also provide several demos below to showcase Chimera's capabilities on challenging domain-specific tasks, such as table format transformation, chart structural extraction, and document context extraction.

Output of Chimera-Reasoner-8B on Table Format Transformation.

Output of Chimera-Reasoner-8B on Table Format Transformation.

Output of Chimera-Reasoner-8B on Chart Structural Extraction.

Output of Chimera-Extractor-1B on Document Context Extraction.

BibTeX


@misc{peng2024chimeraimprovinggeneralistmodel,
  title={Chimera: Improving Generalist Model with Domain-Specific Experts}, 
  author={Tianshuo Peng and Mingsheng Li and Hongbin Zhou and Renqiu Xia and Renrui Zhang and Lei Bai and Song Mao and Bin Wang and Conghui He and Aojun Zhou and Botian Shi and Tao Chen and Bo Zhang and Xiangyu Yue},
  year={2024},
  eprint={2412.05983},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2412.05983}, 
}

Chimera: Improving Generalist Model with Domain-Specific Experts

🔥[NEW!] Chimera-Reasoner-8B gets 64.9 on MathVista, achieves new SoTA under 10B scale models! 🔥[NEW!] Chimera-Extractor demonstrates powerful extraction performance on various types of documents! 🔥[NEW!] Weights & Inference code have been released！

Abstract

Method

Performance

Examples

BibTeX

Chimera: Improving Generalist Model with
Domain-Specific Experts

🔥[NEW!] Chimera-Reasoner-8B gets 64.9 on MathVista, achieves new SoTA under 10B scale models!
🔥[NEW!] Chimera-Extractor demonstrates powerful extraction performance on various types of documents!
🔥[NEW!] Weights & Inference code have been released！