Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

A newly published article in Visual Intellgence about lightweight, open-sourced multi-modal large language models (MLLMs) designed to tackle the challenges of deploying MLLMs in resource-constrained environments.
Published in Computational Sciences
Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

In recent years, there have been significant advances in multi-modal large language models (MLLMs) such as InternVL series, GPT series, LLaVA series and other models, which leverage the powerful capabilities of pre-trained large language models (LLMs) alongside vision foundation models (VFMs). These models undergo multi-stage training on extensive image-text data, which effectively aligns visual representations from VFMs with the latent space of LLMs, leading to promising performance in general vision-language understanding, reasoning, and interaction tasks. However, the large computational burden and the poor performance on long-tail domain-specific tasks hinder the widespread application of MLLMs in practical scenarios.

 The emergence of lightweight MLLMs has provided a good balance between parameter size and performance, alleviating the reliance on expensive computing devices and fostering the development of various downstream applications. However, there are still several challenges: 1) Most existing MLLMs use vision encoders like CLIP, which are trained on Internet-domain image-text data and are aligned with BERT. As a result, these vision encoders are not capable of covering the extensive range of visual domains and are misaligned with LLMs’ representations. 2) To adapt MLLMs to specialized domains, existing methods mainly focus on modifying the model architectures, gathering extensive related training data, or customizing the training process for the target domain. However, There is still no consensus framework for LLMs’ downstream adaptation. Each domain presents unique solutions in model design, data formatting, and training schedules.

 To address these issues, there is a need for a strong vision encoder with comprehensive visual knowledge and a general transfer learning paradigm that can be efficiently applied to downstream tasks in various domains with low marginal costs. In this work, the reach team from Shanghai AI Lab and Tsinghua University introduce Mini-InternVL, a series of powerful pocket-sized MLLMs that can be easily transferred to various specialized domains. To this end, they first enhance the representational capabilities of a lightweight vision encoder. They initialize a 300M vision encoder using the weights from CLIP and apply knowledge distillation using InternViT-6B as the teacher model. Subsequently, they develop Mini-InternVL series with 1 billion, 2 billion, and 4 billion parameters by integrating the vision encoder with the pre-trained LLMs such as Qwen2-0.5B, InternLM2-1.8B, and Phi-3-mini, respectively. Benefiting from the robust vision encoder, Mini-InternVL exhibits excellent multi-modal performance on general multi-modal benchmarks such as MMBench, ChartQA, and MathVista. Remarkably, compared with InternVL2-76B, the proposed Mini-InternVL-4B achieves 90% of the performance of larger counterparts while using only 5% of the parameters, significantly reducing computational overhead.

 To further adapt our models to specific-domain downstream tasks, the reach team introduce a straightforward yet effective transfer learning paradigm. Within this paradigm, they develop a unified transfer approach applicable to various downstream tasks, including autonomous driving, medical image processing, and remote sensing. This approach standardizes the model architecture, data format, and training schedule. The experimental results demonstrate the effectiveness of this method in enhancing the models’ visual understanding and reasoning capabilities in domain-specific scenarios, enabling them to match the performance of proprietary commercial models within the target domains.

 In summary, the highlights of this reach are as follows:

 1) They propose Mini-InternVL, a powerful pocket multi-modal model, which not only achieves robust multi-modal performance with only 4 billion parameters, but also easily transfers to downstream tasks across various domains with low marginal costs.

 2) They develop several design features for Mini-InternVL. They introduce a lightweight vision encoder, InternViT-300M, which is robust across various visual domains. Additionally, they introduce a simple but effective paradigm that standardizes model architecture, data format, and training schedule for effective downstream task transfer.

 3) They evaluate their models through extensive experiments on general and domain-specific benchmarks. The experimental results show that their multi-modal models achieve 90% of the performance using significantly fewer parameters on general multi-modal benchmarks. For specific domain tasks, with minimal computational cost for fine-tuning, they can rival closed-source commercial models. The research team also conduct a series of ablation studies to explore the impact of data sample size on domain adaptation, hoping to provide insights into the application of MLLMs in specialized domains.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Artificial Intelligence
Mathematics and Computing > Computer Science > Artificial Intelligence

Related Collections

With collections, you can get published faster and increase your visibility.

Controllable Artificial Intelligence Visual Content Generation

Artificial Intelligence Generated Content (AIGC) is revolutionizing the fields of multimedia signal processing and computer vision with the explosive development of large language models (LLMs) and vision and multimodal LLMs. AI-driven visual content generation, which includes images, videos, and 3D and 4D dynamic visual contents, shows remarkable potential in areas such as image synthesis, video editing, virtual reality, and art creation. Advances in conditional generation and multimodal representation research are helping improve the quality of visual content generation. In this context, controllable AI visual content generation is receiving increasing attention. With the aim of maintaining user interaction and controllability, ensuring content diversity and consistency in the generation process, and generating high quality visual contents that meet users’ expectations, there is an urgent need to explore generation-oriented representation theories and methods, and to elucidate theories, methods, models, and evaluation metrics related to controllable generation. This special issue aims to explore the basic methods and main applications of controllable AI visual content generation, while promoting progress in the development of controllable and reliable AI-driven visual content generation techniques.

The scope of this special issue includes, but is not limited to, the following topics:

 Representation Theory and Methods for AI Visual Content Generation

 Methods and Models for Controllable AI Visual Content Generation

 3D Reconstruction and Novel View Synthesis

 4D Dynamic Reconstruction and Generation

 Visual Content Creation and Editing

 Controllable AIGC for Video Enhancement & Restoration

 Quality Evaluation and Forgery Detection for AIGC

 Efficiency and Scalability of AIGC Models

Publishing Model: Open Access

Deadline: Apr 15, 2025