Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
-
Updated
Nov 25, 2023 - Python
Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
Vector search for humans. Also available on cloud - cloud.marqo.ai
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
"Video-ChatGPT" is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Alibaba DAMO Academy.
PyTorch implementation of the paper "Controlling Vision-Language Models for Universal Image Restoration".
CLIPort: What and Where Pathways for Robotic Manipulation
DriveLM: Drive on Language
Empowers LLMs with the ability to see and draw.
[ICCV2021 & TPAMI2023] Vision-Language Transformer and Query Generation for Referring Segmentation
多模态中文LLaMA&Alpaca大语言模型(VisualCLA)
(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain.
Tools for movie and video research
[IEEE Transactions on Medical Imaging/TMI] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
Collection of papers and resources on Multimodal Reasoning, including Vision-Language Models, Multimodal Chain-of-Thought, Visual Inference, and others.
Add a description, image, and links to the vision-language topic page so that developers can more easily learn about it.
To associate your repository with the vision-language topic, visit your repo's landing page and select "manage topics."