论文阅读“Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications“-程序员充电站

- 论文概述
- 核心贡献与特色
- - 1. 明确的VLA定义（Definition I.1）
  - 2. 历史演进脉络（Section III & Fig. 2）
  - 3. 架构分类体系（Section IV & Fig. 4）
  - 4. 数据模态的全面覆盖（Section IV-D）
- 关键技术洞察
- - 1. 动作表示的演进（Fig. 4的核心发现）
  - 2. 跨 embodiment 学习（Cross-Embodiment）
  - 3. 训练策略（Section V）
- 实践指导价值（Section VIII）
- 局限性与未来方向（Section IX）
- 与其他综述的比较
- 关键引用与资源
- 总结评价

摘要

Amid growing efforts to leverage advances in large language models (LLMs) and vision-language models (VLMs) for robotics, Vision-Language-Action (VLA) models have recently gained significant attention.
By unifying vision, language, and action data at scale, which have traditionally been studied separately, VLA models aim to learn policies that generalise across diverse tasks, objects, embodiments, and environments.
This generalisation capability is expected to enable robots to solve novel downstream tasks with minimal or no additional task-specific data, facilitating more flexible and scalable real-world deployment.
Unlike previous surveys that focus narrowly on action representations or high-level model architectures, this work offers a comprehensive, full-stack review, integrating both software and hardware components of VLA systems.
In particular, this paper provides a systematic review of VLAs, covering their strategy and architectural transition, architectures and building blocks, modality-specific processing techniques, and learning paradigms.
In addition, to support the deployment of VLAs in real-world robotic applications, we also review commonly used robot platforms, data collection strategies, publicly available datasets, data augmentation methods, and evaluation benchmarks.
Throughout this comprehensive survey, this paper aims to offer practical guidance for the robotics community in applying VLAs to real-world robotic systems.
All references categorized by training approach, evaluation method, modality, and dataset are available in the table on our project website: https://vla-survey.github.io.

结论

This survey provides a comprehensive review of Vision-Language-Action (VLA) models for robotics, tracing their evolution from early CNN-based approaches to sophisticated multimodal architectures integrating diffusion models and latent action representations.
We have examined the fundamental challenges, architectural innovations, training methodologies, and real-world applications that define the current landscape of VLA research.
Our analysis reveals several key insights: (1) the critical role of large-scale datasets and pre-trained foundation models in enabling generalization, (2) the emergence of hierarchical architectures that separate high-level reasoning from low-level control, (3) the growing importance of multimodal inputs beyond vision and language, and (4) the persistent challenges in sim-to-real transfer and embodiment generalization.
The field has reached a critical inflection point at which recent advances in foundation models, in conjunction with improved data collection protocols and refined training methodologies, are anticipated to facilitate the development of robotic systems with improved generalization and capability.
The incorporation of world models, affordance-based reasoning, and RL is expected to underpin the next generation of VLA models, enabling continuous learning, sophisticated task reasoning, and robust adaptation across diverse and unstructured real-world environments.

论文概述

这篇论文是由东京大学、牛津大学和德克萨斯大学奥斯汀分校的研究者共同撰写的关于Vision-Language-Action (VLA) 模型的系统性综述。该论文提供了一个"全栈式"（full-stack）的回顾，不仅涵盖软件和算法架构，还包括硬件平台、数据收集和实际部署等实践层面的内容。

核心贡献与特色

1. 明确的VLA定义（Definition I.1）

论文给出了VLA的严格定义，澄清了研究范围：

必需输入：视觉观察 + 自然语言指令
输出：直接生成机器人控制命令（非高层策略选择）
排除：仅用于高层推理而不直接执行动作的模型（如从预训练技能库中选择）

2. 历史演进脉络（Section III & Fig. 2）

论文将VLA发展划分为四个代际：

时期	代表模型	技术特征
2021-2022	CLIPort, Gato, VIMA	CNN/Transformer基础架构
2022-2023	RT-1, RT-2, RT-X, OpenVLA	大规模真实世界数据 + 预训练VLM
2023-2024	Octo, RDT-1B,π₀	扩散策略/流匹配（Diffusion/Flow Matching）
2024-2025	LAPA,π₀.5, GR00T N1	层次化控制 + 隐式世界模型

3. 架构分类体系（Section IV & Fig. 4）

论文提出了7种核心架构模式的分类法，这是该综述的重要理论贡献：

Sensorimotor Models（感知运动模型）：

Transformer + 离散动作Token（RT-1, Gato）
Transformer + 扩散动作头（Octo）
扩散Transformer（RDT-1B）
VLM + 离散动作Token（RT-2, OpenVLA）
VLM + 扩散动作头（Diffusion-VLA, DexVLA）
VLM + 流匹配动作头（π₀, GraspVLA）⭐最新趋势
VLM + 扩散Transformer（GR00T N1, CogACT）

其他架构类型：

World Models（世界模型）：预测未来状态辅助决策（UniPi, GR-1）
Affordance-based Models（ affordance模型）：先预测可行动作区域再执行（VoxPoser, CLIPort）

4. 数据模态的全面覆盖（Section IV-D）

论文详细讨论了四种关键模态的处理方法：

模态	关键技术	代表工作
Vision	ViT, CLIP, SigLIP, DINOv2, VQ-GAN	普遍使用
Language	T5 tokenizer, LLaMA tokenizer, USE	文本编码
Action	离散分箱、扩散/流匹配、潜在动作学习	FAST, LAPA
Others	音频、触觉、3D点云/体素	Tactile-VLA, PointVLA

关键技术洞察

1. 动作表示的演进（Fig. 4的核心发现）

论文揭示了动作表示从离散Token向连续生成模型转变的趋势：

早期（RT-1/RT-2）：将动作空间离散化为256个bin，用交叉熵损失训练
中期（Octo）：引入扩散策略生成连续动作
最新（π₀, π₀.5）：采用流匹配（Flow Matching），实现高达50Hz的实时控制

2. 跨 embodiment 学习（Cross-Embodiment）

这是VLA的核心挑战之一。论文讨论了三种解决方案：

数据标准化：Open X-Embodiment项目统一不同机器人的动作空间
潜在动作空间：LAPA从视频中学习 embodiment-agnostic 的动作表示
统一动作空间：UniAct提出Universal Action Space (UAS)

3. 训练策略（Section V）

策略类型	关键方法	应用场景
监督学习	预训练（大规模数据）+ 后训练（任务特定数据）	主流方法
自监督学习	模态对齐、视觉表示学习、潜在动作学习	利用无标注数据
强化学习	PPO, SAC用于微调VLA或训练底层策略	提升鲁棒性

实践指导价值（Section VIII）

论文为从业者提供了7条具体建议：

优先收集多样化、高质量数据集- 跨任务、跨环境、跨 embodiment 的数据对泛化至关重要
采用连续动作生成方法- 扩散或流匹配优于离散Token
预训练时使用梯度隔离- 防止随机初始化的动作头破坏预训练VLM的表示
从轻量级适配开始- 先尝试冻结主干网络+LoRA微调
整合世界模型或潜在动作学习- 特别适用于人形机器人
拥抱多任务学习- 辅助任务（affordance检测、未来状态预测）能改善表示
[隐含] 层次化架构- 分离高层规划与底层控制（如RT-H, π₀.5）

局限性与未来方向（Section IX）

论文诚实指出了当前挑战：

挑战	具体问题	潜在解决方案
数据稀缺	带动作标签的大规模机器人数据难以收集	仿真数据、世界模型生成、人类视频
Embodiment迁移	不同机器人形态差异巨大	统一动作空间、潜在动作表示
计算成本	训练和推理开销大	模型蒸馏、量化（BitVLA）、高效架构
安全性	缺乏碰撞避免和故障恢复机制	结合基于模型的控制、故障检测模块
持续学习	模型部署后难以更新	RLHF、在线适应、防止灾难性遗忘

与其他综述的比较

根据论文引用和对比，该综述的独特价值在于：

方面	本综述	其他综述（如Ma et al., 2024）
覆盖范围	全栈（硬件+软件+数据+评估）	主要关注架构和动作Token化
架构分类	7种sensorimotor架构 + 世界模型 + affordance模型	主要按动作表示方式分类
实践指导	详细的从业者建议	理论分析为主
时间覆盖	截至2025年初（含π₀.5, GR00T N1）	较早发表，覆盖较旧工作