[NeurIPS'23 Oral] Visual Instruction Tuning: LLaVA (Large Language-and-Vision Assistant) built towards GPT-4V level capabilities.
-
Updated
Nov 30, 2023 - Python
[NeurIPS'23 Oral] Visual Instruction Tuning: LLaVA (Large Language-and-Vision Assistant) built towards GPT-4V level capabilities.
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
Vision-Language Models for Vision Tasks: A Survey
The implementation of "Prismer: A Vision-Language Model with An Ensemble of Experts".
Effective prompting for Large Multimodal Models like GPT-4 Vision or LLaVA. 🔥
Codebase for "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists"
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Alibaba DAMO Academy.
Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Codes for VPGTrans: Transfer Visual Prompt Generator across LLMs. VL-LLaMA, VL-Vicuna.
Collection of papers and resources on Multimodal Reasoning, including Vision-Language Models, Multimodal Chain-of-Thought, Visual Inference, and others.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning, ICCV 2023
A curated list of awesome knowledge-driven autonomous driving (continually updated)
[CVPR2023] Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective
🎉 PILOT: A Pre-trained Model-Based Continual Learning Toolbox
Recognize Any Regions
Code of the paper: On Evaluating Adversarial Robustness of Large Vision-Language Models
Add a description, image, and links to the vision-language-model topic page so that developers can more easily learn about it.
To associate your repository with the vision-language-model topic, visit your repo's landing page and select "manage topics."