OpenGVLab

Welcome to OpenGVLab! 👋

OpenGVLab is a community focused on generalized vision-based AI. We strive to develop models that not only excel at one vision benchmark, but can have a general understanding of vision so that little effort is needed to adapt to new vision-based tasks. We develop model architecture and release pre-trained models to the community to motivate further research in this area. We have made promising progress in terms of general vision AI, with 57 SOTA rankings from our models both for image-based and video-based tasks. We hope to empower individuals and businesses by offering a higher starting point to develop vision-based AI products and lessening the burdun of building an AI model from scratch.

Our Work

InternImage 👈

Best performing image-based universal backbone model with up to 3 billion parameters

90.1% Top1 accuracy in ImageNet, 65.5 mAP on COCO object detection

Related projects
- InternGPT - An open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc.
- GITM - A novel framework integrating Large Language Models (LLMs) with text-based knowledge and memory, aiming to create Generally Capable Agents in Minecraft.
- VisionLLM - A unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions.
- STM-Evaluation - A unified architecture for different spatial token mixing paradigms, and make various comparisons and analyses for these "spatial token mixers".
- M3I-Pretraining - Successfully pre-train a 1B model (InternImage-H) with M3I Pre-training and achieve new record 65.4 mAP on COCO detection test-dev, 62.5 mAP on LVIS detection minival, and 62.9 mIoU on ADE20k.
- ConvMAE - Transfer learning for object detection on COCO.
InternVideo 👈

The first video foundation model to achieve high-performance on both video and video-text tasks.

SOTA performance on 39 video datasets when released in 2022.

91.1% Top1 accuracy in Kinetics 400, 77.2% Top1 accuracy in Something-Something V2.

Related projects
- LORIS - Our model generates long-term soundtracks with state-of-the-art musical quality and rhythmic correspondence
- 🔥 Ask-Anything - A simple yet interesting tool for chatting with video
- 🔥 VideoMAEv2 - Successfully train a video ViT model with a billion parameters, which achieves a new SOTA performance on the datasets of Kinetics and Something-Something, and many more.
- Unmasked Teacher - Our scratch-built ViT-L/16 achieves SOTA performances on various video tasks.
- UniFormerV2 - The first model to achieve 90% top-1 accuracy on Kinetics-400.
- Efficient Video Learners - Despite with a small training computation and memory consumption, EVL models achieves high performance on Kinetics-400.
General 3D
- 🔥 HumanBench - A Large-scale and diverse Human-centric benchmark, and many more.
Competition winning solutions 🏆
- InternVideo-Ego4D - 1st place in 5 Ego4D challenges, ECCV 2022

OpenGVLab

Welcome to OpenGVLab! 👋

Our Work

InternImage 👈

InternVideo 👈

General 3D

Competition winning solutions 🏆

Follow us

Pinned

Repositories

People

Top languages

Most used topics