DeepSpeed-VisualChat

An easy-to-use, scalable, and efficient multi-modal training pipeline for multi-round multi-image interleave chat experience.

📰 Latest News 📰

[2023/10] DeepSpeed-VisualChat: Improve Your Chat Experience with Multi-Round Multi-Image Inputs

⭐ If you find our DeepSpeed and DeepSpeedExamples repositories beneficial, please give them a star on GitHub! To cite DeepSpeed-VisualChat, please cite our arxiv report:

@article{yao2023deepspeed-visualchat,
  title={{DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention}},
  author={Zhewei Yao and Xiaoxia Wu and Conglong Li and Minjia Zhang and Heyang Qin and Olatunji Ruwase and Ammar Ahmad Awan and Samyam Rajbhandari and Yuxiong He},
  journal={arXiv preprint arXiv:2309.14327},
  year={2023}
}

🚀 What is DeepSpeed-VisualChat 🚀

Figure 1. On the left is a DeepSpeed-VisualChat model, featuring an innovative attention design. On the right is an example of DeepSpeed-VisualChat.

With increasing interest in enabling the multi-modal capabilities of large language models, DeepSpeed is proud to announce a new training pipeline, named DeepSpeed-VisualChat. This is designed for enabling a multi-round, multi-image interleave chat framework. It enhances the language model with image understanding and reasoning capabilities. Unlike the majority of open-sourced multi-modal projects, the primary focus of DeepSpeed-VisualChat is to provide a multi-round, multi-image interleave chat experience, as illustrated in Figure 1.

To improve model quality without introducing new parameters, DeepSpeed-VisualChat incorporates a new multi-modal causal attention mechanism, which is adept at better aligning visual and text features. Additionally, to overcome the scarcity of interleaved text-and-image inputs in most available open-sourced datasets, we employ various data blending techniques on existing datasets.

Thanks to the scalable, efficient, and user-friendly nature of the DeepSpeed ecosystem, we have the capability to train using a 2B visual encoder from QWen-VL (one is additionally refined from OpenClip) and a 70B language decoder from LLaMA-2. This showcases the extraordinary scalability of the DeepSpeed-VisualChat framework.

⚓ Get Started, Tutorial, and Documents ⚓

🐼 Installation

git clone https://github.com/deepspeedai/DeepSpeedExamples.git
cd DeepSpeedExamples/applications/DeepSpeed-VisualChat/
pip install -r requirements.txt

🐼 Datasets Preparation

Table below summarizes where to download the datasets that we support. {data_path} denotes the --data_path argument provided in training scripts.

Dataset name	Where to download
aokvqa	Download `2017 Train images [118K/18GB]` from https://cocodataset.org/#download and save at `{data_path}/coco/train2017/`. Download `aokvqa_v1p0_train.json` from https://allenai.org/project/a-okvqa/home and save at `{data_path}/aokvqa/annotations/`.
coco_caption	Download 2014 Train images and 2014 Val images from https://cocodataset.org/#download and save all images at `{data_path}/coco/2014/`. Download `dataset.json` from https://cs.stanford.edu/people/karpathy/deepimagesent/coco.zip and save at `{data_path}/coco_caption/`.
llava	Download `2017 Train images [118K/18GB]` from https://cocodataset.org/#download and save at `{data_path}/coco/train2017/`. Download `detail_23k.json` and `complex_reasoning_77k.json` from https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K and save at `{data_path}/llava/`.
llava_dial	Download `2017 Train images [118K/18GB]` from https://cocodataset.org/#download and save at `{data_path}/coco/train2017/`. Download `conversation_58k.json` from https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K and save at `{data_path}/llava/`.
llava_otter_blend	Follow instructions of the llava, llava_dial, and otter_mimicit_cgd datasets.
minigpt4	Download `image` folder and `filter_cap.json` from https://huggingface.co/datasets/Vision-CAIR/cc_sbu_align and save at `{data_path}/cc_sbu_align/`.
ocr_vqa	Download `images` folder and `dataset.json` from https://ocr-vqa.github.io/ and save at `{data_path}/OCR_VQA/`.
otter_mimicit_cgd	Download `2017 Train images [118K/18GB]` from https://cocodataset.org/#download and save at `{data_path}/coco/train2017/`. Download `CGD_instructions.json` from https://huggingface.co/datasets/pufanyi/MIMICIT and save at `{data_path}/MIMIC-IT/`.
otter_mimicit_sd	Download `SD.json` and `SD_instructions.json` from https://huggingface.co/datasets/pufanyi/MIMICIT and save at `{data_path}/MIMIC-IT/`.
otter_mimicit_sn	Download `SN.json` and `SN_instructions.json` from https://huggingface.co/datasets/pufanyi/MIMICIT and save at `{data_path}/MIMIC-IT/`.
otter_mimicit_tvc	Download `TVC.json` and `TVC_instructions.json` from https://huggingface.co/datasets/pufanyi/MIMICIT and save at `{data_path}/MIMIC-IT/`.
otter_mimicit_vst	Download `VST.json` and `VST_instructions.json` from https://huggingface.co/datasets/pufanyi/MIMICIT and save at `{data_path}/MIMIC-IT/`.
sparkles_dialogue	Download the `SparklesDialogueCC` and `SparklesDialogueVG` folders from the OneDrive link from https://github.com/HYPJUDY/Sparkles and save at `{data_path}/`.

🐼 Training, Evaluation, Chat API, and Helper

Please refer to