[TOC]
Abstract
给了个VLM进行 Instruction Tuning 的数据集:https://huggingface.co/datasets/MMInstruction/M3IT
在M3IT上训了一个Ying-VLM,据说多模态能力很强
Experiment Setting
8 NVIDIA 80GB A100 GPUs. It took about 10 days for Stage I and Stage II can be finished in a day.
- Stage I Visual-Text Alignment
initial alignment training on LAION 400M.
AdamW, batch size is set to 256, 300k steps, learning rate 5e-5 in the first 2000 steps and follows a cosine decay scheduler. The weight decay is set to 0.05.
- Stage II Multi-modal Instruction Tuning
alignment training for 3 epochs and with a lower learning rate of 1e-5 and a warmup stage of 1000 steps. Inspired by LoRa tuning, the weights for mapping query and value vectors in the attention layer of LLMs are learnable in this stage to better adapt to the instruction tuning dataset. Other training parameters are consistent with Stage I.