中文 | English
- This project aims to train a super-small multimodal vision-language model, MiniMind-V, with just a cost of 1.3 RMB and 1 hours of work, starting from scratch!
- The smallest version of MiniMind-V is only about
$\frac{1}{7000}$ the size of GPT-3, designed to enable fast inference and even training on personal GPUs. - MiniMind-V is an extension of the visual capabilities of the MiniMind pure language model.
- The project includes full code for the minimalist structure of large VLM models, dataset cleaning, pretraining, and supervised fine-tuning (SFT).
- This is not only the smallest implementation of an open-source VLM model but also a concise tutorial for beginners in vision-language models.
- The hope is that this project can provide a useful example to inspire others and share the joy of creation, helping to drive progress in the wider AI community!
To avoid misunderstandings, the "1 hours" is based on testing (
1 epoch
) with an NVIDIA 3090 hardware device (single GPU), and the "1.3 RMB" refers to GPU server rental costs.
“Building a plane with Legos is much more exciting than flying in first class!” Is it really as complex as imagined to build a VLM-based multimodal large model? How is the code implementation done? Is the training process difficult? Now, let's explore the answers and feel the joy of creation together!
Tip
(As of 2025-02-20) The MiniMind-V series has completed the training of the following model versions, with the smallest requiring only 26M (0.026B) parameters, capable of both image recognition and conversation!
Model (Size) | Inference Memory | Release |
---|---|---|
MiniMind2-V (104M) | 0.6 GB | 2025.02.20 |
MiniMind2-Small-V (26M) | 1.1 GB | 2025.02.20 |
minimind-v-v1-small (27M) | 0.6 GB | 2024.10.04 |
minimind-v-v1 (109M) | 1.1 GB | 2024.10.04 |
2025-02-20 (newest 🎉)
- MiniMind2-V updated alongside MiniMind2
- Significant reduction of all redundant code, standardized code format
- Major simplification of the model's redundant structure
- Updated dataset format, expanded with new SFT datasets
- Better performance than the previous VLM version!
2024-10-05
- MiniMind-V released on schedule, first open-source release
Sharing my hardware and software configuration (for reference only)
- CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
- RAM: 128 GB
- GPU: NVIDIA GeForce RTX 3090(24GB) * 8
- Ubuntu==20.04
- CUDA==12.2
- Python==3.10.16
- requirements.txt
# Clone the code repository
git clone /~https://github.com/jingyaogong/minimind-v
# Download the clip model to the ./model/vision_model directory
git clone https://huggingface.co/openai/clip-vit-base-patch16
# or
git clone https://www.modelscope.cn/models/openai-mirror/clip-vit-base-patch16
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
git clone https://huggingface.co/jingyaogong/MiniMind2-V
# load=0: load from pytorch model, load=1: load from transformers-hf model
python eval_vlm.py --load 1
python web_demo_vlm.py
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
Note: Test if Torch can use CUDA
import torch
print(torch.cuda.is_available())
If unavailable, download the whl file from torch_stable for installation. Refer to this link for help.
Download the required dataset files from
the dataset download link, create a ./dataset
directory, and place the files under ./dataset
.
*.jsonl
is the Q&A dataset, and *images
are the accompanying image data. After downloading, decompress the image
data.
Note: Dataset Details
Please reserve about 5GB of space for the dataset. If there is insufficient space for pretrain data, you can try skipping the pretrain training step and proceed directly to SFT training.
3.1 Pretraining (Learning image description)
python train_pretrain_vlm.py --epochs 4
Run pretraining to get
pretrain_vlm_*.pth
as the pretrained model's output weights (* represents the model dimension, default is 512).
3.2 Supervised Fine-Tuning (Learning image-caption dialogue style)
python train_sft_vlm.py --epochs 4
Perform supervised fine-tuning to get
sft_vlm_*.pth
as the output weights for the fine-tuned model.
Note: Training Details
By default, the training process saves model parameters every 100 steps to the ./out/***.pth
file (it will overwrite
previous weight files).
Ensure that the model *.pth
file you want to test is located in the ./out/
directory.
You can also directly download the pre-trained *.pth
file
from here.
python eval_vlm.py --model_mode 1 # Default is 0: test pretrain model, set to 1: test sft model
Tip
The training scripts are based on PyTorch's native framework and support multi-card acceleration. If your device has N (N>1) GPUs:
Single-machine N-card training method (DDP, supports multi-machine multi-card cluster)
torchrun --nproc_per_node N train_xxx.py
Note: Other Details
Single-machine N-card training (DeepSpeed)
deepspeed --master_port 29500 --num_gpus=N train_xxx.py
You can enable wandb logging during training:
# You need to log in: wandb login
torchrun --nproc_per_node N train_xxx.py --use_wandb
# and
python train_xxx.py --use_wandb
By adding the --use_wandb
parameter, you can log the training process, and after training is complete, you can view
the process on the wandb website. You can specify the project name and run name by modifying the wandb_project
and wandb_run_name
parameters.
The base language model of MiniMind-V (VLM), MiniMind (LLM), comes from the twin project minimind. For detailed information on the model structure, training specifics, principles, and testing results, please refer to the minimind project. To reduce redundancy, the discussion on LLM-related topics is omitted here, assuming you have a basic understanding of MiniMind (LLM).
Even if you are not very familiar with the details of LLMs, you can still follow the "Quick Start" guide to train a MiniMind-V, as it remains unaffected and the repository focuses on the lowest cost for out-of-the-box use!
MiniMind-V's structure adds two submodules, a Visual Encoder and a feature projection, with a modality-mixing branch to
support inputs from multiple modalities:
[Important] Some Interesting Thoughts
Let's take a moment to think about two questions:
- What is a Large Language Model (LLM)?
- What is a multimodal model?
This article perfectly aligns with my thoughts:
Although the name "large language model" (LLM) contains the word "language," they are actually not closely related to
language; this is just a historical issue. A more accurate name would be self-regressive Transformer or something else.
LLMs are more of a general statistical modeling technology, mainly using a self-regressive Transformer to simulate token
flows. These tokens can represent text, images, audio, action choices, and even molecules—anything, really.
Therefore, as long as the problem can be converted into a process of simulating a series of discrete tokens, LLM can
theoretically solve it. In fact, with the increasing maturity of large language model technologies, we may see more and
more problems falling under this modeling paradigm. In other words, the problem is fixed in using LLM to "predict the
next token," but the role and meaning of the tokens differ in each domain.
ZJU-LiXi has also mentioned a similar viewpoint (roughly stated below):
Text, video, audio, actions, etc., are considered "multimodal" signals in human perception, but the term "modality" is
essentially just a classification concept based on how humans store information. Just like .txt
and .png
files,
though they differ in visual presentation and higher-level forms, they are fundamentally the same. The concept of "
multimodal" arose simply because humans need to categorize these signals based on different sensory dimensions.
However, for machines, regardless of the signal's "modality," they are ultimately presented as a sequence of binary "
monomodal" numbers. Machines do not differentiate the origin of these signals; they just process and analyze the
information contained within these sequences.
Personally, I think Generative Pretrained Transformer (GPT) is a more fitting term than **Large Language Model (LLM) **, and I prefer to use "GPT" to represent models in the LLM/VLM/GPT-like architecture series rather than to ride on OpenAI's coattails.
To summarize what GPTs do in one sentence:
A GPT model predicts the next, next-next, next-next-next token, etc., based on the current token... until the model outputs the end token; here, the "token" doesn’t necessarily have to be text!
> For an LLM model, if we need to understand an "image," we just treat the "image" as a special "foreign language" that has never been encountered before, and translate it into the "LLM language" via a "foreign language dictionary."
> For an LLM model, if we need to understand "audio," we just treat "audio" as a special "foreign language" that has never been encountered before, and translate it into the "LLM language" via a "foreign language dictionary."
> ...
To obtain MiniMind-V, we only need to do these 2 things:
- Use the "foreign language dictionary" that is good at translating images, to translate the image from the " foreign language" into a model-understandable "LLM language."
- Fine-tune the LLM so that it and the "foreign language dictionary" go through a period of adaptation, thereby better understanding images.
The "foreign language dictionary" is referred to as the Visual Encoder model.
Like LlaVA, Qwen-VL, and other visual language models, MiniMind-V also uses the open-source Clip series models as the
Visual Encoder.
Specifically, we use clip-vit-base-patch16, a classic Visual
Encoder based on the ViT-B/16 architecture for describing image-text information.
The input image size is 224x224, and because the Patch size is 16×16, it generates 16*16=196 tokens as the input to the
encoder layer, which produces a 1×768 dimensional embedding vector for calculating error with the text.
We don’t need the final embedding representation, so we only take the output from the encoder layer, which is the output
feature from the core ViT backbone.
It receives the feature of size 196×768 from the previous layer, which we use as 196 visual tokens to input into
MiniMind-V.
After obtaining the image encoder features, the integration with the LLM requires aligning the 768-dimensional visual
tokens with the LLM's text tokens, and mapping the image features into the same space as text embeddings. In other
words, the image features and native visual tokens cannot be directly treated the same; they require cross-modal feature
alignment.
LlaVA-1 uses a simple unbiased linear transformation to achieve this, with great
success, and MiniMind-V does the same.
With that, the internal structural changes of MiniMind-V are now fully presented.
Next, let's briefly discuss the changes in the external input and output of MiniMind-V.
The input to the VLM is still a segment of text containing special placeholders.
After computing the text embedding, the vector generated by the image encoder can be projected onto the corresponding
embedding part of the placeholder, replacing the original placeholder embedding.
For example:
<image>\nWhat is in this image?
In minimind-v
, the image is replaced by a 196-character @@@...@@@
placeholder. The reason for using 196 characters
is explained earlier:
Any image is encoded by the Clip model as 196×768-dimensional tokens,
thus the minimind-v
prompt becomes:
@@@......@@@\nWhat is this image describing?
After calculating the embedding and projection, and replacing the image token part, the entire calculation process to output is no different from that of the LLM part.
For handling multiple images at once, this can be achieved by injecting multiple <image>
placeholders without needing
to modify the framework at all.
Expansion Ideas for Video Understanding
written by @xinyanghuang7
For the video understanding capabilities of multimodal large models, one feasible approach is to refer to the existing MiniCPM-V 2.6 Python example for video understanding. The main idea is to extract key frames from the video and then perform multi-image inference. Therefore, if you want to add video understanding capabilities to MiniMind-V, you can base it on the existing multi-image training, refer to the key frame extraction method in this Python script, and increase the number of images supported in the training files. The more MAX_NUM_FRAMES supported, the more GPU memory it will consume.
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu # pip install decord
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
attn_implementation='sdpa',
torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
MAX_NUM_FRAMES = 64 # if cuda OOM set a smaller number
def encode_video(video_path):
def uniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i in range(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1) # FPS
frame_idx = [i for i in range(0, len(vr), sample_fps)]
if len(frame_idx) > MAX_NUM_FRAMES:
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
print('num frames:', len(frames))
return frames
video_path = "video_test.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
{'role': 'user', 'content': frames + [question]},
]
# Set decode params for video
params = {}
params["use_image_id"] = False
params["max_slice_nums"] = 2 # If cuda OOM and video resolution is greater than 448*448, set to 1
answer = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer,
**params
)
print(answer)
At this point, all the details of MiniMind-V
have been presented.
The MiniMind-V
model subclass completely inherits from MiniMind
,
and is generated with minimal changes based on the latter,
with core algorithm modifications < 50 lines
, making the migration difficulty very low.
Therefore, there may be differences with models like LlAVA
, but the overall idea remains consistent.
Source: Chinese-LLaVA-Vision
Contains approximately 570,000 pre-trained images from CC-3M and COCO 2014;
llava-en-zh-300k
Contains 300k instruction fine-tuning data and 150k images.
The Q&A content has been translated, with better support for Chinese, further organized and resized.
(pretrain_vlm_data.jsonl) Pre-training dataset format:
{
"conversations": [
{
"role": "user",
"content": "Provide a brief description of the given image.\n<image>"
},
{
"role": "assistant",
"content": "Olive oil is a healthy ingredient for free use."
}
],
"image": "GCC_train_002582585.jpg"
}
(sft_vlm_data.jsonl) Single image instruction fine-tuning dataset format:
{
"conversations": [
{
"role": "user",
"content": "What impact does the location of the alarm clock have on sleep quality?<image>"
},
{
"role": "assistant",
"content": "Place the digital alarm clock on the nightstand..."
}
],
"image": "train-00000-of-00001_image_0_0.jpg"
}
(sft_vlm_data_multi.jsonl) Multi-image instruction fine-tuning dataset format:
{
"conversations": [
{
"role": "user",
"content": "context: Source Image: <image> Target Image: <image> Instruction: What is the correct image edit instruction that can transform the source image to target image?<image>"
},
{
"role": "assistant",
"content": "take the people out of the back in the photo. Remove the two people behind the woman in the white dress and the man in the blue suit. remove people behind the couple in the center"
}
],
"image": "0.jpg, 1.jpg"
}
Data Description
-
The multi-image dataset is relatively small and contains English conversations, focusing only on scenes with two image comparisons. Therefore, the fine-tuning effect is limited, and this is just one reference approach.
-
jsonl
contains textual instructions, andimages.zip
contains the corresponding image data (to be unzipped after download).
Dataset download link: (ModelScope | HuggingFace)
train_pretrain_vlm
Pre-training learns general image knowledge from a dataset of 595K samples, such as a deer is a deer, a dog is a dog.
train_sft_vlm
Instruction fine-tuning learns the real Q&A format for image-related questions from a dataset of 300K real conversations, which better aligns with human communication habits.
train_sft_vlm
Multi-image fine-tuning provides a demo: a bird comparison dataset with 13.6k real Q&A formats.
During training, the visual encoder, i.e., the CLIP model's gradients, are frozen, and only the Projection and LLM parts
are trained.
In pre-training, only the last layer parameters of Projection and LLM are learnable.
In instruction fine-tuning, all parameters of Projection and LLM are learnable.
Training Time and Loss Trend (for reference only)
(Native PyTorch *.pth
weight files) Download link:
(ModelScope | HuggingFace)
(Transformers
format models)
Download link:
(ModelScope | HuggingFace)
Note: The Transformers version is the
MiniMind-V
model after single-image instruction fine-tuning
Visual signals are treated as a special foreign language by LLMs, so the "language learning" ability highly depends on the LLM's capacity. The stronger the LLM, the more powerful the corresponding VLM, and the performance boost becomes significant.
> Simpler projection-based cross-modal feature alignment, which may be inferior compared to Cross-Attention.
> The Clip model could try larger, more powerful large series for finer-grained token representations of image features, as they are still coarse.
> The resolution is not high, theoretically only 224×224 (the minimind-v dataset is set to 128×128 for space saving).
> ...
Tip
If you find MiniMind-V
helpful, please consider giving it a ⭐ on GitHub.
Given the limited expertise, there may be unknown issues, and we welcome everyone to discuss, correct, or submit PRs
to improve the project in Issues.
Your support is the driving force behind continuous improvements to the project. Thank you!
@xinyanghuang7: 🔗Implemented complete multi-graph branch
Reference Links & Thanks to the following excellent papers or projects
- No particular order
- LlaVA
- LlaVA-VL
- Chinese-LLaVA-Vision-Instructions
This repository is licensed under the Apache-2.0 License.