vision-language

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

prompt chinese image-captioning pretrained-models visual-question-answering multimodal text-to-image-synthesis vision-language pretraining referring-expression-comprehension prompt-tuning

Updated Oct 25, 2023
Python

google-research / pix2seq

Star

Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)

computer-vision deep-learning object-detection tensorflow2 vision-language pix2seq

Updated Nov 7, 2023
Jupyter Notebook

OFA-Sys / ONE-PEACE

Star

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

representation-learning multimodal vision-and-language contrastive-loss vision-language vision-transformer foundation-models audio-language

Updated Dec 5, 2023
Python

"Video-ChatGPT" is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

chatbot llama clip mulit-modal vision-language vicuna gpt-4 vision-language-pretraining llava video-chatboat video-conversation

Updated Nov 14, 2023
Python

AlibabaResearch / AdvancedLiterateMachinery

Star

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Alibaba DAMO Academy.

ocr computer-vision artificial-intelligence text-recognition document text-detection document-analysis end-to-end-ocr multimodal scene-text-recognition multimodal-deep-learning scene-text-detection vision-language document-understanding scene-text-detection-recognition document-recognition document-intelligence documentai vision-language-transformer vision-language-model

Updated Nov 24, 2023
C++

Algolzw / daclip-uir

Star

PyTorch implementation of the paper "Controlling Vision-Language Models for Universal Image Restoration".

deep-learning prompt pytorch image-denoising image-restoration image-deblurring low-level-vision shadow-removal image-dehazing face-inpainting vision-language diffusion-models low-light-image-enhancement image-deraining jpeg-artifacts-removal image-desnowing

Updated Oct 24, 2023
Python

cliport / cliport

Star

CLIPort: What and Where Pathways for Robotic Manipulation

natural-language-processing computer-vision deep-learning robotics pytorch vision manipulation clip rearrangement grounding vision-language

Updated Nov 2, 2023
Jupyter Notebook

OpenDriveLab / DriveLM

Sponsor

Star

DriveLM: Drive on Language

autonomous-driving vision-language large-language-models llm prompt-engineering prompting chain-of-thought tree-of-thoughts graph-of-thoughts

Updated Dec 7, 2023
HTML

AILab-CVC / SEED

Star

Empowers LLMs with the ability to see and draw.

multimodal vision-language foundation-model

Updated Nov 29, 2023
Python

henghuiding / Vision-Language-Transformer

Star

[ICCV2021 & TPAMI2023] Vision-Language Transformer and Query Generation for Referring Segmentation

tensorflow keras transformer vision-language referring-segmentation tpami iccv2021 vision-language-transformer

Updated Jan 7, 2022
Python

airaria / Visual-Chinese-LLaMA-Alpaca

Star

多模态中文LLaMA&Alpaca大语言模型（VisualCLA）

nlp chinese llama lora alpaca multimodal vision-language llm

Updated Jul 27, 2023
Python

mczhuge / Kaleido-BERT

Star

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain.

fashion e-commerce bert multimodal pre-training vision-language

Updated Jun 29, 2022
Python

movienet / movienet-tools

Star

Tools for movie and video research

movie computer-vision deep-learning action-recognition video-understanding cross-modality shot-detection vision-language person-analysis

Updated Jun 20, 2022
C++

HUANGLIZI / LViT

Star

[IEEE Transactions on Medical Imaging/TMI] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"

pytorch segmentation medical-image-analysis multimodal-learning vision-language

Updated Oct 26, 2023
Python

mees / calvin

Star

CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

natural-language-processing computer-vision deep-learning robotics pytorch vision manipulation vision-and-language grounding vision-language

Updated Dec 7, 2023
Python

atfortes / Awesome-Multimodal-Reasoning

Star

Collection of papers and resources on Multimodal Reasoning, including Vision-Language Models, Multimodal Chain-of-Thought, Visual Inference, and others.

Updated Nov 3, 2023

Improve this page

Add a description, image, and links to the vision-language topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the vision-language topic, visit your repo's landing page and select "manage topics."

Learn more

vision-language

Here are 108 public repositories matching this topic...

IDEA-Research / GroundingDINO

marqo-ai / marqo

salesforce / BLIP

OFA-Sys / Chinese-CLIP

OFA-Sys / OFA

google-research / pix2seq

OFA-Sys / ONE-PEACE

mbzuai-oryx / Video-ChatGPT

AlibabaResearch / AdvancedLiterateMachinery

Algolzw / daclip-uir

cliport / cliport

OpenDriveLab / DriveLM

AILab-CVC / SEED

henghuiding / Vision-Language-Transformer

airaria / Visual-Chinese-LLaMA-Alpaca

mczhuge / Kaleido-BERT

movienet / movienet-tools

HUANGLIZI / LViT

mees / calvin

atfortes / Awesome-Multimodal-Reasoning

Improve this page

Add this topic to your repo