Skip to content

Welcome to OpenGVLab! 👋

OpenGVLab is a community focused on generalized vision-based AI. We strive to develop models that not only excel at one vision benchmark, but can have a general understanding of vision so that little effort is needed to adapt to new vision-based tasks. We develop model architecture and release pre-trained models to the community to motivate further research in this area. We have made promising progress in terms of general vision AI, with 57 SOTA rankings from our models both for image-based and video-based tasks. We hope to empower individuals and businesses by offering a higher starting point to develop vision-based AI products and lessening the burdun of building an AI model from scratch.

WechatIMG711

Our Work

  • InternImage 👈

    Best performing image-based universal backbone model with up to 3 billion parameters

    90.1% Top1 accuracy in ImageNet, 65.5 mAP on COCO object detection

    Related projects

    • InternGPT - An open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc.
    • GITM - A novel framework integrating Large Language Models (LLMs) with text-based knowledge and memory, aiming to create Generally Capable Agents in Minecraft.
    • VisionLLM - A unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions.
    • STM-Evaluation - A unified architecture for different spatial token mixing paradigms, and make various comparisons and analyses for these "spatial token mixers".
    • M3I-Pretraining - Successfully pre-train a 1B model (InternImage-H) with M3I Pre-training and achieve new record 65.4 mAP on COCO detection test-dev, 62.5 mAP on LVIS detection minival, and 62.9 mIoU on ADE20k.
    • ConvMAE - Transfer learning for object detection on COCO.
  • InternVideo 👈

    The first video foundation model to achieve high-performance on both video and video-text tasks.

    SOTA performance on 39 video datasets when released in 2022.

    91.1% Top1 accuracy in Kinetics 400, 77.2% Top1 accuracy in Something-Something V2.

    Related projects

    • LORIS - Our model generates long-term soundtracks with state-of-the-art musical quality and rhythmic correspondence
    • 🔥 Ask-Anything - A simple yet interesting tool for chatting with video
    • 🔥 VideoMAEv2 - Successfully train a video ViT model with a billion parameters, which achieves a new SOTA performance on the datasets of Kinetics and Something-Something, and many more.
    • Unmasked Teacher - Our scratch-built ViT-L/16 achieves SOTA performances on various video tasks.
    • UniFormerV2 - The first model to achieve 90% top-1 accuracy on Kinetics-400.
    • Efficient Video Learners - Despite with a small training computation and memory consumption, EVL models achieves high performance on Kinetics-400.
  • General 3D

    • 🔥 HumanBench - A Large-scale and diverse Human-centric benchmark, and many more.
  • Competition winning solutions 🏆

Follow us

Pinned

  1. InternGPT Public

    InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editin…

    Python 2.4k 152

  2. [VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.

    Python 1.8k 142

  3. Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, B…

    Python 74 3

  4. GITM Public

    Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

    391 6

  5. InternImage Public

    [CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

    Python 1.5k 137

  6. InternVideo Public

    InternVideo: General Video Foundation Models via Generative and Discriminative Learning (https://arxiv.org/abs/2212.03191)

    Python 362 26

Repositories

  • Ask-Anything Public

    [VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.

    Python 1,783 MIT 142 14 1 Updated Jun 8, 2023
  • Multi-Modality-Arena Public

    Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!

    Python 74 3 1 0 Updated Jun 8, 2023
  • InternGPT Public

    InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)

    Python 2,366 Apache-2.0 152 3 2 Updated Jun 8, 2023
  • InternImage Public

    [CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

    Python 1,487 MIT 137 94 1 Updated Jun 6, 2023
  • GITM Public

    Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

    391 6 0 0 Updated Jun 5, 2023
  • DDPS Public

    Official Implementation of "Denoising Diffusion Semantic Segmentation with Mask Prior Modeling"

    23 0 0 0 Updated Jun 5, 2023
  • LTVU-LLM Public
    0 0 0 0 Updated Jun 2, 2023
  • 60 3 0 0 Updated Jun 1, 2023
  • UniHCP Public

    Official PyTorch implementation of UniHCP

    Python 40 MIT 1 2 0 Updated Jun 1, 2023
  • InternVideo Public

    InternVideo: General Video Foundation Models via Generative and Discriminative Learning (https://arxiv.org/abs/2212.03191)

    Python 362 Apache-2.0 26 18 0 Updated May 30, 2023