Benchmark and Evaluations, RL Alignment, Applications, and Challenges of Large Vision Language Models
🌐 Language : English · 简体中文
A most Frontend Collection and survey of vision-language model papers, and models GitHub repository
🧭 The Evolution of VLM Architectures
VLM design has gone through three distinct architectural eras in just five years — and Era 3 has split into two parallel branches. Early models bridged a frozen vision encoder to a frozen language model with a learnable connector (CLIP, BLIP, Flamingo). The 2023–2025 generation made a pretrained LLM the trunk and treated vision as a bolt-on adapter (LLaVA, Qwen2.5-VL, GPT-4V). The latest 2025–2026 generation drops the bridge entirely and trains a single transformer from scratch on mixed-modality data — but it forks along the output axis:
Era 3a — Native Multimodal Input → Text Out. Image, video, and (sometimes) audio enter a single early-fused token stream, but generation is still autoregressive text. This is the design used by today's general-purpose flagships: Qwen3.5 / Qwen3.6, Gemma 4, Gemini 3, GPT-5.4, Phi-4-Reasoning-Vision, Claude Opus 4.6 .
Era 3b — Omni-Modal Unified I/O. The same fused trunk plus dedicated image decoder (VAE / MMDiT / flow-matching) and/or audio codec decoder heads, so the model can also generate images and speech. This is the design used by unified models: BAGEL, Qwen3.5-Omni, InternVL-U, Emu3 / Emu3.5, Erin 5.0, DeepSeek-Janus-Pro .
Reading the diagram (left → right). Era 1 uses a two-tower design with a learnable cross-modal bridge (e.g. Q-Former) into a frozen LM — text-only output. Era 2 puts a pretrained LLM at the center; an MLP/Resampler projects visual tokens into the LLM's vocabulary, and the LLM does all the reasoning — still text-only output. Era 3a drops the bridge: image, video, audio, and text share a single tokenizer/embedding space and flow through one transformer trained from scratch — but the output is still autoregressive text . Era 3b keeps that fused trunk and adds decoder heads (image VAE/MMDiT, audio codec) so the model can natively output text, image, and/or speech . Era 3a and Era 3b coexist; the choice is essentially "how much do you want non-text generation?"
Below we compile awesome papers and model and github repositories that
State-of-the-Art VLMs Collection of newest to oldest VLMs (we'll keep updating new models and benchmarks).
Evaluate VLM benchmarks and corresponding link to the works
Post-training/Alignment Newest related work for VLM alignment including RL, sft.
Applications applications of VLMs in embodied AI, robotics, etc.
Contribute surveys , perspectives , and datasets on the above topics.
Progressive research reports
We track new VLMs, benchmarks, and post-training methods that haven't yet been folded into the main tables in dated mini-surveys:
📰 2026-04-28 — latest : Qwen3.6-27B & Qwen3.6-35B-A3B, Claude Mythos (gated), S1-VL, GLM-5V-Turbo, FreshPER / GMPO / ARPO / GRPO-VPS, QUOTA, Fast-dVLM, VLA-World, SpanVLA, VLA-Forget, R-VLM, UILoop, WebForge, WorldMark, Video-MME-v2, CrossMath, BabyVision, SlowBA — 30 new entries since April 13.
📰 2026-04-13 — LFM2.5-VL-450M, EXAONE 4.5, Gemma 4, Granite 4.0 3B Vision, InternVL-U, GLM-4.6V, Vero, MolmoWeb, UniDriveVLA, QAPruner, Firebolt-VL, CoME-VL, and more.
📰 2026-03-25 — GPT-5.4, Phi-4-Reasoning-Vision-15B, Gemini 3.0, Qwen3.5, Claude Opus 4.6, Molmo2, and more.
Welcome to contribute and discuss!
🤩 Papers marked with a ⭐️ are contributed by the maintainers of this repository. If you find them useful, we would greatly appreciate it if you could give the repository a star or cite our paper.
@InProceedings{Li_2025_CVPR,
author = {Li, Zongxia and Wu, Xiyang and Du, Hongyang and Liu, Fuxiao and Nghiem, Huy and Shi, Guangyao},
title = {A Survey of State of the Art Large Vision Language Models: Benchmark Evaluations and Challenges},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2025},
pages = {1587-1606}
}
Model
Year
Architecture
Training Data
Parameters
Vision Encoder/Tokenizer
Pretrained Backbone Model
Qwen3.6-27B (Alibaba)
04/22/2026
Decoder-only / natively multimodal input (thinking + non-thinking)
Multimodal pretraining + agentic mid-training
27B dense
Native multimodal ViT
Qwen3.6
Qwen3.6-35B-A3B (Alibaba)
04/15/2026
MoE / natively multimodal input
Multimodal pretraining + agentic SFT/RL
35B total · 3B active
Native multimodal ViT
Qwen3.6
LFM2.5-VL-450M (Liquid AI)
04/11/2026
Liquid Foundation Model
Undisclosed
450M
Non-overlapping tile ViT
LFM2.5
EXAONE 4.5 (LG AI Research)
04/09/2026
Unified VL
Undisclosed
33B
Proprietary vision encoder
EXAONE 4.5
Claude Mythos (Anthropic, gated preview)
04/07/2026
Decoder-only (frontier; Project Glasswing gated)
Undisclosed
Undisclosed
Undisclosed
Undisclosed
Gemma 4 (Google)
04/02/2026
Decoder-only / MoE
Undisclosed (140+ languages)
E2B / E4B / 26B MoE / 31B Dense
Native multimodal
Gemini 3
Granite 4.0 3B Vision (IBM)
04/01/2026
Decoder-only
Enterprise document corpora
3B
Undisclosed
Granite 4.0
GLM-5V-Turbo (Zhipu / Z.AI)
04/01/2026
Natively multimodal (vision-coding) with Multi-Token Prediction
30+ task joint RL
Undisclosed
CogViT
GLM-5
InternVL-U (Shanghai AI Lab)
03/10/2026
Unified (MLLM + MMDiT)
Multimodal understanding + generation
4B
InternViT
InternVL
GPT-5.4 / GPT-5.4 Thinking (OpenAI)
03/06/2026
Decoder-only
Undisclosed
Undisclosed
Undisclosed
Undisclosed
Phi-4-Reasoning-Vision-15B (Microsoft)
03/04/2026
Decoder-only
Curated synthetic + filtered data
15B
High-res dynamic-resolution ViT
Phi-4
Gemini 3.0 (Google)
03/2026
Unified Model
Undisclosed
Undisclosed
Undisclosed
Undisclosed
Qwen3.5 (Alibaba)
02/16/2026
Unified VL (early fusion)
Trillions of multimodal tokens
0.8B–397B (MoE, 17B active)
ViT (native)
Qwen3.5
Claude Opus 4.6 (Anthropic)
02/2026
Decoder-only
Undisclosed
Undisclosed
Undisclosed
Undisclosed
Erin 5.0 (Baidu)
02/05/2026
Unified Model (Visual, Text, Audio)
Unified Modality Dataset
-
CNN–ViT (Understanding)/Next-Frame-and-Scale Prediction (Generation)
Unified Autoregressive Transformer
Molmo2 (Allen AI)
01/15/2026
Decoder-only
7 new video + 2 multi-image datasets (9.19M videos)
4B / 7B / 8B
Bi-directional attention ViT
Qwen 3 / OLMo
Gemini 3
11/18/2025
Unified Model
Undisclosed
-
-
-
Emu3.5
10/30/2025
Deconder-only
Unified Modality Dataset
-
SigLIP
Qwen3
DeepSeek-OCR
10/20/2025
Encoder-Deconder
70% OCR, 20% general vision, 10% text-only
3B
DeepEncoder
DeepSeek-3B
Qwen3-VL
10/11/2025
Decoder-Only
-
8B/4B
ViT
Qwen3
Qwen3-VL-MoE
09/25/2025
Decoder-Only
-
235B-A22B
ViT
Qwen3
Qwen3-Omni (Visual/Audio/Text)
09/21/2025
-
Video/Audio/Image
30B
ViT
Qwen3-Omni-MoE-Thinker
LLaVA-Onevision-1.5
09/15/2025
-
Mid-Training-85M & SFT
8B
Qwen2VLImageProcessor
Qwen3
InternVL3.5
08/25/2025
Decoder-Only
multimodal & text-only
30B/38B/241B
InternViT-300M/6B
Qwen3 / GPT-OSS
SkyWork-Unipic-1.5B
07/29/2025
-
image/video..
-
-
-
Grok 4
07/09/2025
-
image/video..
1-2 Trillion
-
-
Kwai Keye-VL (Kuaishou)
07/02/2025
Decdoer-only
image/video..
8B
ViT
QWen-3-8B
OmniGen2
06/23/2025
Decdoer-only & VAE
LLaVA-OneVision/ SAM-LLaVA..
-
ViT
QWen-2.5-VL
Gemini-2.5-Pro
06/17/2025
-
-
-
-
-
GPT-o3/o4-mini
06/10/2025
Decoder-only
Undisclosed
Undisclosed
Undisclosed
Undisclosed
Mimo-VL (Xiaomi)
06/04/2025
Decdoer-only
24 Trillion MLLM tokens
7B
[Qwen2.5-ViT
Mimo-7B-base
BAGEL (Bytedance)
05/20/2025
Unified Model
Video/Image/Text
7B
SigLIP2-so400m/14](https://arxiv.org/abs/2502.14786 )
Qwen2.5
BLIP3-o
05/14/2025
Decdoer-only
(BLIP3-o 60K) GPT-4o Generated Image Generation Data
4/8B
ViT
QWen-2.5-VL
InternVL-3
04/14/2025
Decdoer-only
200 Billion Tokens
1/2/8/9/14/38/78B
ViT-300M/6B
InterLM2.5/QWen2.5
LLaMA4-Scout/Maverick
04/04/2025
Decdoer-only
40/20 Trillion Tokens
17B
MetaClip
LLaMA4
Qwen2.5-Omni
03/26/2025
Decdoer-only
Video/Audio/Image/Text
7B
Qwen2-Audio/Qwen2.5-VL ViT
End-to-End Mini-Omni
QWen2.5-VL
01/28/2025
Decdoer-only
Image caption, VQA, grounding agent, long video
3B/7B/72B
Redesigned ViT
Qwen2.5
GLM-4.6V (Zhipu / Z.AI)
12/2025
Decoder-only
Undisclosed
106B / 9B (Flash)
Undisclosed
GLM-4.6
Ola
2025
Decoder-only
Image/Video/Audio/Text
7B
OryxViT
Qwen-2.5-7B , SigLIP-400M , Whisper-V3-Large , BEATs-AS2M(cpt2)
Ocean-OCR
2025
Decdoer-only
Pure Text, Caption, Interleaved , OCR
3B
NaViT
Pretrained from scratch
SmolVLM
2025
Decoder-only
SmolVLM-Instruct
250M & 500M
SigLIP
SmolLM
DeepSeek-Janus-Pro
2025
Decoder-only
Undisclosed
7B
SigLIP
DeepSeek-Janus-Pro
Inst-IT
2024
Decoder-only
Inst-IT Dataset , LLaVA-NeXT-Data
7B
CLIP/Vicuna, SigLIP/Qwen2
LLaVA-NeXT
DeepSeek-VL2
2024
Decoder-only
WiT , WikiHow
4.5B x 74
SigLIP/SAMB
DeepSeekMoE
xGen-MM (BLIP-3)
2024
Decoder-only
MINT-1T , OBELICS , Caption
4B
ViT + Perceiver Resampler
Phi-3-mini
TransFusion
2024
Encoder-decoder
Undisclosed
7B
VAE Encoder
Pretrained from scratch on transformer architecture
Baichuan Ocean Mini
2024
Decoder-only
Image/Video/Audio/Text
7B
CLIP ViT-L/14
Baichuan
LLaMA 3.2-vision
2024
Decoder-only
Undisclosed
11B-90B
CLIP
LLaMA-3.1
Pixtral
2024
Decoder-only
Undisclosed
12B
CLIP ViT-L/14
Mistral Large 2
Qwen2-VL
2024
Decoder-only
Undisclosed
7B-14B
EVA-CLIP ViT-L
Qwen-2
NVLM
2024
Encoder-decoder
LAION-115M
8B-24B
Custom ViT
Qwen-2-Instruct
Emu3
2024
Decoder-only
Aquila
7B
MoVQGAN
LLaMA-2
Claude 3
2024
Decoder-only
Undisclosed
Undisclosed
Undisclosed
Undisclosed
InternVL
2023
Encoder-decoder
LAION-en, LAION- multi
7B/20B
Eva CLIP ViT-g
QLLaMA
InstructBLIP
2023
Encoder-decoder
CoCo , VQAv2
13B
ViT
Flan-T5 , Vicuna
CogVLM
2023
Encoder-decoder
LAION-2B ,COYO-700M
18B
CLIP ViT-L/14
Vicuna
PaLM-E
2023
Decoder-only
All robots, WebLI
562B
ViT
PaLM
LLaVA-1.5
2023
Decoder-only
COCO
13B
CLIP ViT-L/14
Vicuna
Gemini
2023
Decoder-only
Undisclosed
Undisclosed
Undisclosed
Undisclosed
GPT-4V
2023
Decoder-only
Undisclosed
Undisclosed
Undisclosed
Undisclosed
BLIP-2
2023
Encoder-decoder
COCO , Visual Genome
7B-13B
ViT-g
Open Pretrained Transformer (OPT)
Flamingo
2022
Decoder-only
M3W , ALIGN
80B
Custom
Chinchilla
BLIP
2022
Encoder-decoder
COCO , Visual Genome
223M-400M
ViT-B/L/g
Pretrained from scratch
CLIP
2021
Encoder-decoder
400M image-text pairs
63M-355M
ViT/ResNet
Pretrained from scratch
2. 🗂️ Benchmarks and Evaluation
2.1. Datasets for Training VLMs
Dataset
Task
Size
MolmoWebMix (Allen AI) (04/2026)
Web Agent Training Trajectories
100K+ synthetic + 30K human demos
Vero-600K (04/2026)
Broad Visual Reasoning RL Training
600K samples from 59 datasets, 6 task categories
BigEarthNet.txt (03/2026)
Multi-sensor Earth Observation Image-Text
464K images, 9.6M text annotations
OmniScience (02/2026)
Scientific Image Understanding
1.5M figure-caption-context triplets
MaD-Mix (02/2026)
Multi-modal Data Mixture Optimization
Framework (0.5B–7B scale)
OVID (2026)
Open Video Pre-training
10M hours, 300M frame-caption pairs
Molmo2 Video Datasets (01/2026)
Video Captions, QA, Tracking, Pointing
9.19M videos (7 video + 2 multi-image datasets)
MMFineReason (/1/30/2026)
REasoning
1.8M
FineVision (09/04/2025)
Mixed Domain
24.3 M/4.48TB
2.2. Datasets and Evaluation for VLM
🧮 Visual Math (+ Visual Math Reasoning)
Dataset
Task
Eval Protocol
Annotators
Size (K)
Code / Site
MathVision
Visual Math
MC / Answer Match
Human
3.04
Repo
MathVista
Visual Math
MC / Answer Match
Human
6
Repo
MathVerse
Visual Math
MC
Human
4.6
Repo
VisNumBench
Visual Number Reasoning
MC
Python Program generated/Web Collection/Real life photos
1.91
Repo
💬 Benchmark for Unified Models
Dataset
Task
Eval Protocol
Annotators
Size (K)
Code / Site
ROVER
Reciprocal Cross-Modal Reasoning
Visual Gen + Verbal Gen Eval
Human
1.3 (1,876 images)
Paper
RealUnify
Math, World knowledge, Image Gen
Direct & StepWise Eval (Sec 3.3)
Script & Humanverification
1.0
Uni-MMMU
Science, Code, Image Gen
DreamSim (Image Gen Eval) & String Matching (Understanding Eval)
-
1.0
Repo
Dataset
Task
Eval Protocol
Annotators
Size (K)
Code / Site
MMOU
Omni-modal Long Video Understanding
MC
Human
15 (9,038 videos)
Paper
Video-MMMU
Knowledge Acquisition from Professional Videos
MC + Knowledge Gain
Expert
0.9 (300 videos)
Paper
MMVU
Expert-Level Multi-Discipline Video Understanding
MC
Expert
3 (27 subjects)
Paper
VideoHallu
Video Understanding
LLM Eval
Human
3.2
Video SimpleQA
Video Understanding
LLM Eval
Human
2.03
Repo
MovieChat
Video Understanding
LLM Eval
Human
1
Repo
Perception‑Test
Video Understanding
MC
Crowd
11.6
Repo
VideoMME
Video Understanding
MC
Experts
2.7
Site
EgoSchem
Video Understanding
MC
Synth / Human
5
Site
Inst‑IT‑Bench
Fine‑grained Image & Video
MC & LLM
Human / Synth
2
Repo
💬 Multimodal Conversation
Dataset
Task
Eval Protocol
Annotators
Size (K)
Code / Site
VisionArena
Multimodal Conversation
Pairwise Pref
Human
23
Repo
🧠 Multimodal General Intelligence
Dataset
Task
Eval Protocol
Annotators
Size (K)
Code / Site
OmniEarth
Geospatial / Remote Sensing VLM Eval
MC + Open VQA
Human (verified)
44.2 (9,275 images, 28 tasks)
Paper
MultiHaystack
Multimodal Retrieval & Reasoning
Retrieval + QA
Human
0.75 (46K+ candidates)
DatBench
Discriminative, Faithful VLM Eval
MC (format-aware)
Synth
-
MMLU
General MM
MC
Human
15.9
MMStar
General MM
MC
Human
1.5
Site
NaturalBench
General MM
Yes/No, MC
Human
10
HF
PHYSBENCH
Visual Math Reasoning
MC
Grad STEM
0.10
Repo
🔎 Visual Reasoning / VQA (+ Multilingual & OCR)
Dataset
Task
Eval Protocol
Annotators
Size (K)
Code / Site
EMMA
Visual Reasoning
MC
Human + Synth
2.8
Repo
MMTBENCH
Visual Reasoning & QA
MC
AI Experts
30.1
Repo
MM‑Vet
OCR / Visual Reasoning
LLM Eval
Human
0.2
Repo
MM‑En/CN
Multilingual MM Understanding
MC
Human
3.2
Repo
GQA
Visual Reasoning & QA
Answer Match
Seed + Synth
22
Site
VCR
Visual Reasoning & QA
MC
MTurks
290
Site
VQAv2
Visual Reasoning & QA
Yes/No, Ans Match
MTurks
1100
Repo
MMMU
Visual Reasoning & QA
Ans Match, MC
College
11.5
Site
MMMU-Pro
Visual Reasoning & QA
Ans Match, MC
College
5.19
Site
R1‑Onevision
Visual Reasoning & QA
MC
Human
155
Repo
VLM²‑Bench
Visual Reasoning & QA
Ans Match, MC
Human
3
Site
VisualWebInstruct
Visual Reasoning & QA
LLM Eval
Web
0.9
Site
📝 Visual Text / Document Understanding (+ Charts)
Dataset
Task
Eval Protocol
Annotators
Size (K)
Code / Site
TableVision
Spatially Grounded Table Reasoning
3-level Cognitive Eval
Human
6.8 (13 sub-categories)
Paper
TextVQA
Visual Text Understanding
Ans Match
Expert
28.6
Repo
DocVQA
Document VQA
Ans Match
Crowd
50
Site
ChartQA
Chart Graphic Understanding
Ans Match
Crowd / Synth
32.7
Repo
🌄 Text‑to‑Image Generation
Dataset
Task
Eval Protocol
Annotators
Size (K)
Code / Site
MSCOCO‑30K
Text‑to‑Image
BLEU, ROUGE, Sim
MTurks
30
Site
GenAI‑Bench
Text‑to‑Image
Human Rating
Human
80
HF
🚨 Hallucination Detection / Control
2.3. Benchmark Datasets, Simulators, and Generative Models for Embodied VLM
Benchmark
Domain
Type
Project
Drive-Bench
Embodied AI
Autonomous Driving
Website
Habitat , Habitat 2.0 , Habitat 3.0
Robotics (Navigation)
Simulator + Dataset
Website
Gibson
Robotics (Navigation)
Simulator + Dataset
Website , Github Repo
iGibson1.0 , iGibson2.0
Robotics (Navigation)
Simulator + Dataset
Website , Document
Isaac Gym
Robotics (Navigation)
Simulator
Website , Github Repo
Isaac Lab
Robotics (Navigation)
Simulator
Website , Github Repo
AI2THOR
Robotics (Navigation)
Simulator
Website , Github Repo
ProcTHOR
Robotics (Navigation)
Simulator + Dataset
Website , Github Repo
VirtualHome
Robotics (Navigation)
Simulator
Website , Github Repo
ThreeDWorld
Robotics (Navigation)
Simulator
Website , Github Repo
VIMA-Bench
Robotics (Manipulation)
Simulator
Website , Github Repo
VLMbench
Robotics (Manipulation)
Simulator
Github Repo
CALVIN
Robotics (Manipulation)
Simulator
Website , Github Repo
GemBench
Robotics (Manipulation)
Simulator
Website , Github Repo
WebArena
Web Agent
Simulator
Website , Github Repo
UniSim
Robotics (Manipulation)
Generative Model, World Model
Website
GAIA-1
Robotics (Automonous Driving)
Generative Model, World Model
Website
LWM
Embodied AI
Generative Model, World Model
Website , Github Repo
Genesis
Embodied AI
Generative Model, World Model
Github Repo
EMMOE
Embodied AI
Generative Model, World Model
Paper
RoboGen
Embodied AI
Generative Model, World Model
Website
UnrealZoo
Embodied AI (Tracking, Navigation, Multi Agent)
Simulator
Website
3.1. RL Alignment for VLM
Title
Year
Paper
RL
Code
Vero: An Open RL Recipe for General Visual Reasoning
04/2026
Paper
Task-routed rewards; GRPO-based
Code
wDPO: Winsorized Direct Preference Optimization for Robust Alignment
03/2026
Paper
wDPO
-
f-GRPO and Beyond: Divergence-Based RL for General LLM Alignment
02/2026
Paper
f-GRPO / f-HAL
From Sight to Insight: Improving Visual Reasoning of MLLMs via Reinforcement Learning
01/2026
Paper
GRPO (6 reward functions)
SaFeR-VLM: Safety-Aware Reinforcement Learning for Multimodal Reasoning
2026 (ICLR)
Paper
GRPO + safety reward
SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning
11/2025
Paper
Dual-Reward (Thinking + Judging)
GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA
10/2025
Paper
GIFT (convex MSE loss)
Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning
10/12/2025
Paper
GRPO
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
09/29/2025
Paper
GRPO
-
Vision-SR1: Self-rewarding vision-language model via reasoning decomposition
08/26/2025
Paper
GRPO
-
Group Sequence Policy Optimization
06/24/2025
Paper
GSPO
-
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning
05/20/2025
Paper
GRPO
-
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
2025/04/10
Paper
GRPO
Code
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement
2025/03/21
Paper
GRPO
Code
Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning
2025/03/10
Paper
GRPO
Code
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
2025
Paper
DPO
Code
Multimodal Open R1/R1-Multimodal-Journey
2025
-
GRPO
Code
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
2025
Paper
GRPO
Code
Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning
2025
-
PPO/REINFORCE++/GRPO
Code
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning
2025
Paper
REINFORCE Leave-One-Out (RLOO)
Code
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
2025
Paper
DPO
Code
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
2025
Paper
PPO
Code
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
2025
Paper
GRPO
Code
Unified Reward Model for Multimodal Understanding and Generation
2025
Paper
DPO
Code
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
2025
Paper
DPO
Code
All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning
2025
Paper
Online RL
-
Video-R1: Reinforcing Video Reasoning in MLLMs
2025
Paper
GRPO
Code
Title
Year
Paper
Website
Code
AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of VLMs
2026/03
Paper
-
-
CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models
2026/03
Paper
-
MERGETUNE: Continued Fine-Tuning of Vision-Language Models
2026/01 (ICLR 2026)
Paper
-
Mask Fine-Tuning (MFT): Unlocking Hidden Capabilities in Vision-Language Models
2025/12
Paper
-
Image-LoRA: Towards Minimal Fine-Tuning of VLMs
2025/12
Paper
-
Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning
2025/12
Paper
-
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
2025/04/21
Paper
Website
OMNICAPTIONER: One Captioner to Rule Them All
2025/04/09
Paper
Website
Code
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
2024
Paper
Website
Code
LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression
2024
Paper
Website
Code
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
2024
Paper
Website
Code
Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model
2024
Paper
-
-
Should VLMs be Pre-trained with Image Data?
2025
Paper
-
-
VisionArena: 230K Real World User-VLM Conversations with Preference Labels
2024
Paper
-
Code
3.3. VLM Alignment github
Title
Year
Paper
Website
Code
EvoPrompt: Evolving Prompt Adaptation for Vision-Language Models
2026/03
Paper
-
-
MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation
2026/02
Paper
-
Multimodal Prompt Optimizer (MPO): Joint Optimization of Multimodal Prompts
2025/10
Paper
-
Evolutionary Prompt Optimization Discovers Emergent Multimodal Reasoning Strategies
2025/03
Paper
-
In-ContextEdit:EnablingInstructionalImageEditingwithIn-Context GenerationinLargeScaleDiffusionTransformer
2025/04/30
Paper
Website
Title
Year
Paper Link
Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI
2024
Paper
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
2024
Paper
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
2023
Paper
SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement
2024
📄 Paper
Training a Vision Language Model as Smartphone Assistant
2024
Paper
ScreenAgent: A Vision-Language Model-Driven Computer Control Agent
2024
Paper
Embodied Vision-Language Programmer from Environmental Feedback
2024
Paper
VLMs Play StarCraft II: A Benchmark and Multimodal Decision Method
2025
📄 Paper
MP-GUI: Modality Perception with MLLMs for GUI Understanding
2025
📄 Paper
4.2. Generative Visual Media Applications
Title
Year
Paper
Website
Code
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
2023
📄 Paper
🌍 Website
💾 Code
Spurious Correlation in Multimodal LLMs
2025
📄 Paper
-
-
WeGen: A Unified Model for Interactive Multimodal Generation as We Chat
2025
📄 Paper
-
💾 Code
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
2025
📄 Paper
🌍 Website
💾 Code
4.3. Robotics and Embodied AI
Title
Year
Paper
Website
Code
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
2024
📄 Paper
🌍 Website
-
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
2024
📄 Paper
🌍 Website
-
Vision-language model-driven scene understanding and robotic object manipulation
2024
📄 Paper
-
-
Guiding Long-Horizon Task and Motion Planning with Vision Language Models
2024
📄 Paper
🌍 Website
-
AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers
2023
📄 Paper
🌍 Website
-
VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model
2024
📄 Paper
-
-
Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems?
2023
📄 Paper
🌍 Website
-
DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models
2024
📄 Paper
🌍 Website
-
MotionGPT: Human Motion as a Foreign Language
2023
📄 Paper
-
💾 Code
Learning Reward for Robot Skills Using Large Language Models via Self-Alignment
2024
📄 Paper
-
-
Language to Rewards for Robotic Skill Synthesis
2023
📄 Paper
🌍 Website
-
Eureka: Human-Level Reward Design via Coding Large Language Models
2023
📄 Paper
🌍 Website
-
Integrated Task and Motion Planning
2020
📄 Paper
-
-
Jailbreaking LLM-Controlled Robots
2024
📄 Paper
🌍 Website
-
Robots Enact Malignant Stereotypes
2022
📄 Paper
🌍 Website
-
LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions
2024
📄 Paper
-
-
Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics
2024
📄 Paper
🌍 Website
-
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
2025
📄 Paper
🌍 Website
💾 Code & Dataset
Gemini Robotics: Bringing AI into the Physical World
2025
📄 Technical Report
🌍 Website
-
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
2024
📄 Paper
🌍 Website
-
Magma: A Foundation Model for Multimodal AI Agents
2025
📄 Paper
🌍 Website
💾 Code
DayDreamer: World Models for Physical Robot Learning
2022
📄 Paper
🌍 Website
💾 Code
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models
2025
📄 Paper
-
-
RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback
2024
📄 Paper
🌍 Website
💾 Code
KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data
2024
📄 Paper
🌍 Website
💾 Code
Unified Video Action Model
2025
📄 Paper
🌍 Website
💾 Code
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
2025
📄 Paper
🌍 Website
💾 Code
DAM-VLA: A Dynamic Action Model-Based Vision-Language-Action Framework for Robot Manipulation
03/2026
📄 Paper
-
NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models
03/2026
📄 Paper
-
Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control
02/2026
📄 Paper
-
ST4VLA: Spatial Guided Training for Vision-Language-Action Models
02/2026
📄 Paper
-
Title
Year
Paper
Website
Code
VIMA: General Robot Manipulation with Multimodal Prompts
2022
📄 Paper
🌍 Website
Instruct2Act: Mapping Multi-Modality Instructions to Robotic Actions with Large Language Model
2023
📄 Paper
-
-
Creative Robot Tool Use with Large Language Models
2023
📄 Paper
🌍 Website
-
RoboVQA: Multimodal Long-Horizon Reasoning for Robotics
2024
📄 Paper
-
-
RT-1: Robotics Transformer for Real-World Control at Scale
2022
📄 Paper
🌍 Website
-
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
2023
📄 Paper
🌍 Website
-
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
2023
📄 Paper
🌍 Website
-
ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models
2024
📄 Paper
🌍 Website
-
AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors
2025
📄 Paper
🌍 Website
💾 Code
Masked World Models for Visual Control
2022
📄 Paper
🌍 Website
💾 Code
Multi-View Masked World Models for Visual Robotic Manipulation
2023
📄 Paper
🌍 Website
💾 Code
Title
Year
Paper
Website
Code
ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings
2022
📄 Paper
-
-
LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation
2024
📄 Paper
-
-
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action
2022
📄 Paper
🌍 Website
-
NaVILA: Legged Robot Vision-Language-Action Model for Navigation
2022
📄 Paper
🌍 Website
-
VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation
2024
📄 Paper
-
-
Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning
2023
📄 Paper
🌍 Website
-
Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments
2025
📄 Paper
-
-
Navigation World Models
2024
📄 Paper
🌍 Website
-
4.3.3. Human-robot Interaction
Title
Year
Paper
Website
Code
MUTEX: Learning Unified Policies from Multimodal Task Specifications
2023
📄 Paper
🌍 Website
-
LaMI: Large Language Models for Multi-Modal Human-Robot Interaction
2024
📄 Paper
🌍 Website
-
VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models
2024
📄 Paper
-
-
4.3.4. Autonomous Driving
Title
Year
Paper
Website
Code
UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving
04/2026
📄 Paper
-
-
AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving
03/2026
📄 Paper
-
DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe Autonomous Driving
03/2026
📄 Paper
-
HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving
02/2026
📄 Paper
-
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model
03/2025
📄 Paper
-
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
01/07/2025
📄 Paper
🌍 Website
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
2024
📄 Paper
🌍 Website
-
GPT-Driver: Learning to Drive with GPT
2023
📄 Paper
-
-
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
2023
📄 Paper
🌍 Website
-
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving
2023
📄 Paper
-
-
Referring Multi-Object Tracking
2023
📄 Paper
-
💾 Code
VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision
2023
📄 Paper
-
💾 Code
MotionLM: Multi-Agent Motion Forecasting as Language Modeling
2023
📄 Paper
-
-
DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models
2023
📄 Paper
🌍 Website
-
VLP: Vision Language Planning for Autonomous Driving
2024
📄 Paper
-
-
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model
2023
📄 Paper
-
-
Title
Year
Paper
Website
Code
DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis
2024
📄 Paper
-
💾 Code
LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration – A Robot Sous-Chef Application
2024
📄 Paper
-
-
Pretrained Language Models as Visual Planners for Human Assistance
2023
📄 Paper
-
-
Promoting AI Equity in Science: Generalized Domain Prompt Learning for Accessible VLM Research
2024
📄 Paper
-
-
Image and Data Mining in Reticular Chemistry Using GPT-4V
2023
📄 Paper
-
-
Title
Year
Paper
Website
Code
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
2023
📄 Paper
-
-
CogAgent: A Visual Language Model for GUI Agents
2023
📄 Paper
-
💾 Code
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
2024
📄 Paper
-
💾 Code
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
2024
📄 Paper
-
💾 Code
ScreenAgent: A Vision Language Model-driven Computer Control Agent
2024
📄 Paper
-
💾 Code
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
2024
📄 Paper
-
💾 Code
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
04/2026
📄 Paper
🌍 Website
Title
Year
Paper
Website
Code
X-World: Accessibility, Vision, and Autonomy Meet
2021
📄 Paper
-
-
Context-Aware Image Descriptions for Web Accessibility
2024
📄 Paper
-
-
Improving VR Accessibility Through Automatic 360 Scene Description Using Multimodal Large Language Models
2024
📄 Paper
-
-
Title
Year
Paper
Website
Code
CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework
03/2026
📄 Paper
-
-
MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images
02/2026
📄 Paper
-
Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning
12/2025
📄 Paper
-
Frontiers in Intelligent Colonoscopy
02/2025
📄 Paper
-
💾 Code
VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge
2024
📄 Paper
-
💾 Code
Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology
2024
📄 Paper
-
-
M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization
2023
📄 Paper
-
-
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text
2022
📄 Paper
-
💾 Code
Med-Flamingo: A Multimodal Medical Few-Shot Learner
2023
📄 Paper
-
💾 Code
Title
Year
Paper
Website
Code
Analyzing K-12 AI Education: A Large Language Model Study of Classroom Instruction on Learning Theories, Pedagogy, Tools, and AI Literacy
2024
📄 Paper
-
-
Students Rather Than Experts: A New AI for Education Pipeline to Model More Human-Like and Personalized Early Adolescence
2024
📄 Paper
-
-
Harnessing Large Vision and Language Models in Agriculture: A Review
2024
📄 Paper
-
-
A Vision-Language Model for Predicting Potential Distribution Land of Soybean Double Cropping
2024
📄 Paper
-
-
Vision-Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models
2024
📄 Paper
-
💾 Code
DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images
2024
📄 Paper
-
-
MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models
2024
📄 Paper
-
💾 Code
Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps
2024
📄 Paper
-
💾 Code
He is Very Intelligent, She is Very Beautiful? On Mitigating Social Biases in Language Modeling and Generation
2021
📄 Paper
-
-
UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling
2024
📄 Paper
-
-
Title
Year
Paper
Website
Code
Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models
04/2026
📄 Paper
-
-
VLMs Need Words: Vision Language Models Ignore Visual Detail in Favor of Semantic Anchors
04/2026
📄 Paper
-
HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token
03/2026
📄 Paper
🌍 ACL
Tone Matters: The Impact of Linguistic Tone on Hallucination in VLMs
01/2026
📄 Paper
-
Object Hallucination in Image Captioning
2018
📄 Paper
-
Evaluating Object Hallucination in Large Vision-Language Models
2023
📄 Paper
-
💾 Code
Detecting and Preventing Hallucinations in Large Vision Language Models
2023
📄 Paper
-
-
HallE-Control: Controlling Object Hallucination in Large Multimodal Models
2023
📄 Paper
-
💾 Code
Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs
2024
📄 Paper
-
💾 Code
BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models
2024
📄 Paper
🌍 Website
-
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
2023
📄 Paper
-
💾 Code
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models
2024
📄 Paper
🌍 Website
-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
2023
📄 Paper
-
💾 Code
Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models
2024
📄 Paper
-
💾 Code
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
2023
📄 Paper
-
💾 Code
Title
Year
Paper
Website
Code
SaFeR-VLM: Safety into Multimodal Reasoning via Reinforcement Learning
2026 (ICLR)
📄 Paper
-
-
HoliSafe: Holistic Safety Evaluation for Vision-Language Models
2026 (ICLR)
📄 Paper
-
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
2024
📄 Paper
🌍 Website
Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments
2023
📄 Paper
-
-
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models
2024
📄 Paper
-
-
JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks
2024
📄 Paper
-
-
SHIELD: An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models
2024
📄 Paper
-
💾 Code
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
2024
📄 Paper
-
-
Jailbreaking Attack against Multimodal Large Language Model
2024
📄 Paper
-
-
Embodied Red Teaming for Auditing Robotic Foundation Models
2025
📄 Paper
🌍 Website
Safety Guardrails for LLM-Enabled Robots
2025
📄 Paper
-
-
Title
Year
Paper
Website
Code
Hallucination of Multimodal Large Language Models: A Survey
2024
📄 Paper
-
-
Bias and Fairness in Large Language Models: A Survey
2023
📄 Paper
-
-
Fairness and Bias in Multimodal AI: A Survey
2024
📄 Paper
-
-
Multi-Modal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision–Language Models
2023
📄 Paper
-
-
FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks
2024
📄 Paper
-
-
FairCLIP: Harnessing Fairness in Vision-Language Learning
2024
📄 Paper
-
-
FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models
2024
📄 Paper
-
-
Benchmarking Vision Language Models for Cultural Understanding
2024
📄 Paper
-
-
5.4.1 Multi-modality Alignment
Title
Year
Paper
Website
Code
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
2024
📄 Paper
-
-
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
2024
📄 Paper
-
-
Assessing and Learning Alignment of Unimodal Vision and Language Models
2024
📄 Paper
🌍 Website
-
Extending Multi-modal Contrastive Representations
2023
📄 Paper
-
💾 Code
OneLLM: One Framework to Align All Modalities with Language
2023
📄 Paper
-
💾 Code
What You See is What You Read? Improving Text-Image Alignment Evaluation
2023
📄 Paper
🌍 Website
💾 Code
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
2024
📄 Paper
🌍 Website
💾 Code
5.4.2 Commonsense and Physics Alignment
Title
Year
Paper
Website
Code
VBench: Comprehensive BenchmarkSuite for Video Generative Models
2023
📄 Paper
🌍 Website
💾 Code
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models
2024
📄 Paper
🌍 Website
💾 Code
PhysBench: Benchmarking and Enhancing VLMs for Physical World Understanding
2025
📄 Paper
🌍 Website
💾 Code
VideoPhy: Evaluating Physical Commonsense for Video Generation
2024
📄 Paper
🌍 Website
💾 Code
WorldSimBench: Towards Video Generation Models as World Simulators
2024
📄 Paper
🌍 Website
-
WorldModelBench: Judging Video Generation Models As World Models
2025
📄 Paper
🌍 Website
💾 Code
VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
2024
📄 Paper
🌍 Website
💾 Code
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
2025
📄 Paper
-
💾 Code
Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency
2025
📄 Paper
-
💾 Code
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding
2025
📄 Paper
-
-
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
2024
📄 Paper
🌍 Website
💾 Code
Do generative video models understand physical principles?
2025
📄 Paper
🌍 Website
💾 Code
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation
2024
📄 Paper
🌍 Website
💾 Code
How Far is Video Generation from World Model: A Physical Law Perspective
2024
📄 Paper
🌍 Website
💾 Code
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
2025
📄 Paper
-
-
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
2025
📄 Paper
🌍 Website
💾 Code
5.5 Efficient Training and Fine-Tuning
Title
Year
Paper
Website
Code
QAPruner: Quantization-Aware Vision Token Pruning for MLLMs
04/2026
📄 Paper
-
-
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
04/2026
📄 Paper
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
04/2026
📄 Paper
-
LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules
02/2026
📄 Paper
-
GRACE: Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs
01/2026
📄 Paper
-
VLMQ: Post-Training Quantization for Large Vision-Language Models
2026 (ICLR)
📄 Paper
-
VILA: On Pre-training for Visual Language Models
2023
📄 Paper
-
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
2021
📄 Paper
-
-
LoRA: Low-Rank Adaptation of Large Language Models
2021
📄 Paper
-
💾 Code
QLoRA: Efficient Finetuning of Quantized LLMs
2023
📄 Paper
-
-
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
2022
📄 Paper
-
💾 Code
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
2023
📄 Paper
-
-
5.6 Scarce of High-quality Dataset
Title
Year
Paper
Website
Code
A Survey on Bridging VLMs and Synthetic Data
2025
📄 Paper
-
💾 Code
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
2024
📄 Paper
Website
💾 Code
SLIP: Self-supervision meets Language-Image Pre-training
2021
📄 Paper
-
💾 Code
Synthetic Vision: Training Vision-Language Models to Understand Physics
2024
📄 Paper
-
-
Synth2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings
2024
📄 Paper
-
-
KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data
2024
📄 Paper
-
-
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
2024
📄 Paper
-
-