Recent advancements in Large Multi-modal Models (LMMs) underscore the importance of scaling by increasing image-text paired data, achieving impressive performance on general tasks. Despite their effectiveness in broad applications, generalist models are primarily trained on web-scale datasets dominated by natural images, resulting in the sacrifice of specialized capabilities for domain-specific tasks that require extensive domain prior knowledge. Moreover, directly integrating expert models tailored for specific domains is challenging due to the representational gap and imbalanced optimization between the generalist model and experts. To address these challenges, we introduce Chimera, a scalable and low-cost multi-modal pipeline designed to boost the ability of existing LMMs with domain-specific experts. Specifically, we design a progressive training strategy to integrate features from expert models into the input of a generalist LMM. To address the imbalanced optimization caused by the well-aligned general visual encoder, we introduce a novel Generalist-Specialist Collaboration Masking (GSCM) mechanism. This results in a versatile model that excels across the chart, table, math, and document domains, achieving state-of-the-art performance on multi-modal reasoning and visual content extraction tasks, both of which are challenging tasks for assessing existing LMMs. We will release Chimera's weights, along with the data used for training and evaluation, to facilitate future research on LMMs.
We introduce Chimera, a scalable pipeline that integrates specialist models into generalist LMMs, facilitating their adaptation to diverse specialized tasks.
Chimera comprises a general visual encoder, a general projector paired with a language model pre-trained from an LMM, alongside a router and an expert model suite, which includes specialized expert models and their corresponding expert projectors.
Chimera uses a Generalist-Specialist Collaboration Masking (GSCM) mechanism to facilitate the alignment with expert models.
We consider a progressive two-stage training procedure:
We conduct quantitative experiments to evaluate Chimera's capabilities in multi-modal reasoning and visual content extraction. Chimera achieves a new SOTA for LMMs of comparable scale on two multi-modal reasoning benchmarks. It also surpasses or matches the performance of representative expert models in visual content extraction tasks across chart, table, and document domains.
In addition to quantitatively reporting Chimera's performance across various benchmarks, we also provide several demos below to showcase Chimera's capabilities on challenging domain-specific tasks, such as table format transformation, chart structural extraction, and document context extraction.
Output of Chimera-Reasoner-8B on Table Format Transformation.
Output of Chimera-Reasoner-8B on Table Format Transformation.
Output of Chimera-Reasoner-8B on Chart Structural Extraction.
Output of Chimera-Reasoner-8B on Chart Structural Extraction.
Output of Chimera-Extractor-1B on Document Context Extraction.
Output of Chimera-Extractor-1B on Document Context Extraction.
@misc{peng2024chimeraimprovinggeneralistmodel,
title={Chimera: Improving Generalist Model with Domain-Specific Experts},
author={Tianshuo Peng and Mingsheng Li and Hongbin Zhou and Renqiu Xia and Renrui Zhang and Lei Bai and Song Mao and Bin Wang and Conghui He and Aojun Zhou and Botian Shi and Tao Chen and Bo Zhang and Xiangyu Yue},
year={2024},
eprint={2412.05983},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.05983},
}