[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Thinklab-SJTU--Awesome-LLM4AD":3,"tool-Thinklab-SJTU--Awesome-LLM4AD":65},[4,17,27,35,48,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",150037,2,"2026-04-10T23:33:47",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":10,"last_commit_at":33,"category_tags":34,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":10,"last_commit_at":41,"category_tags":42,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,"2026-04-10T11:13:16",[26,43,44,45,14,46,15,13,47],"数据工具","视频","插件","其他","音频",{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":54,"last_commit_at":55,"category_tags":56,"status":16},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[15,43,46],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":54,"last_commit_at":63,"category_tags":64,"status":16},6590,"gpt4all","nomic-ai\u002Fgpt4all","GPT4All 是一款让普通电脑也能轻松运行大型语言模型（LLM）的开源工具。它的核心目标是打破算力壁垒，让用户无需依赖昂贵的显卡（GPU）或云端 API，即可在普通的笔记本电脑和台式机上私密、离线地部署和使用大模型。\n\n对于担心数据隐私、希望完全掌控本地数据的企业用户、研究人员以及技术爱好者来说，GPT4All 提供了理想的解决方案。它解决了传统大模型必须联网调用或需要高端硬件才能运行的痛点，让日常设备也能成为强大的 AI 助手。无论是希望构建本地知识库的开发者，还是单纯想体验私有化 AI 聊天的普通用户，都能从中受益。\n\n技术上，GPT4All 基于高效的 `llama.cpp` 后端，支持多种主流模型架构（包括最新的 DeepSeek R1 蒸馏模型），并采用 GGUF 格式优化推理速度。它不仅提供界面友好的桌面客户端，支持 Windows、macOS 和 Linux 等多平台一键安装，还为开发者提供了便捷的 Python 库，可轻松集成到 LangChain 等生态中。通过简单的下载和配置，用户即可立即开始探索本地大模型的无限可能。",77307,"2026-04-11T06:52:37",[15,13],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":71,"readme_en":72,"readme_zh":73,"quickstart_zh":74,"use_case_zh":75,"hero_image_url":76,"owner_login":77,"owner_name":78,"owner_avatar_url":79,"owner_bio":80,"owner_company":81,"owner_location":81,"owner_email":81,"owner_twitter":81,"owner_website":82,"owner_url":83,"languages":81,"stars":84,"forks":85,"last_commit_at":86,"license":87,"difficulty_score":54,"env_os":88,"env_gpu":89,"env_ram":89,"env_deps":90,"category_tags":93,"github_topics":94,"view_count":23,"oss_zip_url":81,"oss_zip_packed_at":81,"status":16,"created_at":99,"updated_at":100,"faqs":101,"releases":141},3467,"Thinklab-SJTU\u002FAwesome-LLM4AD","Awesome-LLM4AD","A curated list of awesome LLM\u002FVLM\u002FVLA\u002FWorld Model for Autonomous Driving(LLM4AD) resources (continually updated)","Awesome-LLM4AD 是由上海交通大学 ReThinklab 团队维护的开源资源库，专注于整理大语言模型（LLM）在自动驾驶领域的前沿研究成果。它系统性地汇集了涵盖视觉 - 语言模型（VLM）、视觉 - 语言 - 动作模型（VLA）及世界模型等相关论文、数据集和代码项目，旨在推动\"LLM4AD\"这一统一范式的发展。\n\n当前自动驾驶技术面临仿真与现实环境的差异（Sim2Real Gap）以及真实数据中长尾场景覆盖不足的难题。Awesome-LLM4AD 通过聚合利用大模型强大推理与泛化能力的最新方案，帮助研究者探索如何从自然语言指令、复杂感知理解到决策规划等多个维度提升自动驾驶系统的智能水平，使其更接近人类驾驶员的胜任力。\n\n该资源库特别适合自动驾驶领域的研究人员、算法工程师及技术爱好者使用。其独特亮点在于不仅提供了按任务类型（如规划、感知、问答、生成）分类的详细文献列表，还关联了配套的综述论文与具体实现代码，部分条目甚至包含了未来展望性的研究（如基于自然语言指令的统一驾驶模型）。作为持续更新的社区驱动项目，Awesome-LLM4AD 为从业者提供了一站式的学术导航，助力快速把","Awesome-LLM4AD 是由上海交通大学 ReThinklab 团队维护的开源资源库，专注于整理大语言模型（LLM）在自动驾驶领域的前沿研究成果。它系统性地汇集了涵盖视觉 - 语言模型（VLM）、视觉 - 语言 - 动作模型（VLA）及世界模型等相关论文、数据集和代码项目，旨在推动\"LLM4AD\"这一统一范式的发展。\n\n当前自动驾驶技术面临仿真与现实环境的差异（Sim2Real Gap）以及真实数据中长尾场景覆盖不足的难题。Awesome-LLM4AD 通过聚合利用大模型强大推理与泛化能力的最新方案，帮助研究者探索如何从自然语言指令、复杂感知理解到决策规划等多个维度提升自动驾驶系统的智能水平，使其更接近人类驾驶员的胜任力。\n\n该资源库特别适合自动驾驶领域的研究人员、算法工程师及技术爱好者使用。其独特亮点在于不仅提供了按任务类型（如规划、感知、问答、生成）分类的详细文献列表，还关联了配套的综述论文与具体实现代码，部分条目甚至包含了未来展望性的研究（如基于自然语言指令的统一驾驶模型）。作为持续更新的社区驱动项目，Awesome-LLM4AD 为从业者提供了一站式的学术导航，助力快速把握技术风向并复现先进算法。","# Awesome-LLM-for-Autonomous-Driving-Resources\n[![Awesome](https:\u002F\u002Fcdn.rawgit.com\u002Fsindresorhus\u002Fawesome\u002Fd7305f38d29fed78fa85652e3a63e154dd8e8829\u002Fmedia\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fsindresorhus\u002Fawesome)![GitHub stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FThinklab-SJTU\u002FAwesome-LLM4AD?color=yellow) ![GitHub forks](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002FThinklab-SJTU\u002FAwesome-LLM4AD?color=9cf) [![GitHub license](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002FThinklab-SJTU\u002FAwesome-LLM4AD)](https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FAwesome-LLM4AD\u002Fblob\u002Fmain\u002FLICENSE)\n\nThis is a collection of research papers about **LLM-for-Autonomous-Driving(LLM4AD)**. The repository will be continuously updated to track the frontier of LLM4AD (Large Language Models for Autonomous Driving), which encompasses VLM4AD (Vision-Language Models for AD) and VLA4AD (Vision-Language-Action models for AD) as integral components of this unified paradigm.  *Maintained by SJTU-ReThinklab.*\n\n\nWelcome to follow and star! If you find any related materials could be helpful, feel free to contact us (yangzhenjie@sjtu.edu.cn or jiaxiaosong@sjtu.edu.cn) or make a PR.\n\n## Citation\nOur survey paper is at https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.01043 which includes more detailed discussions and will be continuously updated.\n\nIf you find our repo is helpful, please consider cite it.\n```BibTeX\n@misc{yang2023survey,\n      title={LLM4Drive: A Survey of Large Language Models for Autonomous Driving}, \n      author={Zhenjie Yang and Xiaosong Jia and Hongyang Li and Junchi Yan},\n      year={2023},\n      eprint={2311.01043},\n      archivePrefix={arXiv},\n      primaryClass={cs.AI}\n}\n```\n\n## Table of Contents\n- [Awesome LLM-for-Autonomous-Driving(LLM4AD)](#awesome-llm-for-autonomous-driving-resources)\n  - [Table of Contents](#table-of-contents)\n  - [Overview of LLM4AD](#overview-of-llm4ad)\n  - [Papers](#papers)\n  - [Datasets](#datasets)\n  - [Citation](#citation)\n  - [License](#license)\n\n## Overview of LLM4AD\nLLM-for-Autonomous-Driving (LLM4AD) refers to the application of Large Language Models(LLMs) in autonomous driving. We divide existing works based on the perspective of applying LLMs: planning, perception, question answering, and generation. \n\n![image info](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FThinklab-SJTU_Awesome-LLM4AD_readme_87779784afbc.png)\n\n## Motivation of LLM4AD\nThe orange circle represents the ideal level of driving competence, akin to that possessed by an experienced human driver. There are two main methods to acquire such proficiency: one, through learning-based techniques within simulated environments; and two, by learning from offline data through similar methodologies. It’s important to note that due to discrepancies between simulations and the real-world, these two domains are not fully the same, i.e. sim2real gap. Concurrently, offline data serves as a subset of real-world data since it’s collected directly from actual surroundings. However, it is difficult to fully cover the distribution as well due to the notorious long-tailed nature of autonomous driving tasks. The final goal of autonomous driving is to elevate driving abilities from a basic green stage to a more advanced blue level through extensive data collection and deep learning.\n\n![image info](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FThinklab-SJTU_Awesome-LLM4AD_readme_d237ced4f83e.png)\n\n## Papers\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n```\nformat:\n- [title](paper link) [links]\n  - author1, author2, and author3...\n  - publisher\n  - task\n  - keyword\n  - code or project page\n  - datasets or environment or simulator\n  - publish date\n  - summary\n  - metrics\n```\n\n- [Vega: Learning to Drive with Natural Language Instructions](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.25741)\n  - Sicheng Zuo, Yuxuan Li, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu\n  - Publish Date: 2026.03.26\n  - Task: Planning\n  - Datasets: [InstructScene](https:\u002F\u002Fgithub.com\u002F)\n  - Summary：\n    - Vega, a unified Vision-Language-World-Action model for instruction-based generation and planning in autonomous driving, employing autoregressive and diffusion paradigms.\n    - Constructs a large-scale driving dataset (InstructScene) with around 100,000 scenes annotated with diverse driving instructions and corresponding trajectories.\n    - Demonstrates superior planning performance and strong instruction-following abilities, enabling more intelligent and personalized driving systems.\n\n- [Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.25740)\n  - Zehao Wang, Huaide Jiang, Shuaiwu Dong, Yuping Wang, Hang Qiu, Jiachen Li\n  - Publish Date: 2026.03.26\n  - Project Page: [Drive My Way](https:\u002F\u002Fdmw-cvpr.github.io\u002F)\n  - Code: [Drive My Way](https:\u002F\u002Fgithub.com\u002Ftasl-lab\u002FDMW)\n  - Task: Planning\n  - Datasets: [Bench2Drive](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FBench2Drive)\n  - Summary：\n    - Drive My Way (DMW), a personalized Vision-Language-Action (VLA) driving framework that aligns with users' long-term driving habits and adapts to real-time user instructions.\n    - DMW learns a user embedding from a personalized driving dataset and conditions the policy on this embedding for planning, with natural language instructions providing short-term guidance.\n    - Evaluation on Bench2Drive shows improved style instruction adaptation, and user studies confirm its generated behaviors are recognizable as each driver's own style.\n\n- [ETA-VLA: Efficient Token Adaptation via Temporal Fusion and Intra-LLM Sparsification for Vision-Language-Action Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.25766)\n  - Yiru Wang, Anqing Jiang, Shuo Wang, Yuwen Heng, Zichong Gu, Hao Sun\n  - Publish Date: 2026.03.26\n  - Task: End-to-End\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - Proposes ETA-VLA, an Efficient Token Adaptation framework for Vision-Language-Action (VLA) models in autonomous driving to reduce the computational burden of processing historical multi-view frames.\n    - Introduces an Intra-LLM Sparse Aggregator (ILSA) that dynamically prunes redundant visual tokens using a text-guided scoring and diversity-preserving strategy, inspired by human driver attention.\n    - Achieves comparable performance to SOTA on NAVSIM v2 while reducing FLOPs by ~32%, pruning 85% of visual tokens and cutting inference FLOPs by 61% while retaining 94% accuracy.\n\n- [Learning Rollout from Sampling:An R1-Style Tokenized Traffic Simulation Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.24989)\n  - Ziyan Wang, Peng Chen, Ding Li, Chiwei Li, Qichao Zhang, Zhongpu Xia, Guizhen Yu\n  - Publish Date: 2026.03.26\n  - Task: Generation\n  - Datasets: [Waymo Sim Agent](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - Summary：\n    - R1Sim, a novel tokenized traffic simulation policy, explores reinforcement learning based on motion token entropy patterns for learning diverse and high-fidelity traffic simulations.\n    - Introduces an entropy-guided adaptive sampling mechanism to focus on high-uncertainty, high-potential motion tokens previously overlooked.\n    - Optimizes motion behaviors using Group Relative Policy Optimization (GRPO) with a safety-aware reward design, achieving realistic, safe, and diverse multi-agent behaviors.\n\n- [TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Driven Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.24936)\n  - Xuepeng Jing, Wenhuan Lu, Hao Meng, Zhizhi Yu, Jianguo Wei\n  - Publish Date: 2026.03.26\n  - Task: Prediction\n  - Summary：\n    - TIGFlow-GRPO, a two-stage generative framework for human trajectory forecasting that aligns flow-based generation with behavioral rules.\n    - The first stage uses a CFM-based predictor with a Trajectory-Interaction-Graph (TIG) module to model fine-grained visual-spatial interactions.\n    - The second stage employs Flow-GRPO post-training, reformulating deterministic flow rollout as stochastic ODE-to-SDE sampling for exploration and optimizing with a composite reward for social compliance and physical feasibility.\n\n- [DreamerAD: Efficient Reinforcement Learning via Latent World Model for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.24587)\n  - Pengxuan Yang, Yupeng Zheng, Deheng Qian, Zebin Xing, Qichao Zhang, Linbo Wang, Yichen Zhang, Shaoyu Guo, Zhongpu Xia, Qiang Chen, Junyu Han, Lingyun Xu, Yifeng Pan, Dongbin Zhao\n  - Publish Date: 2026.03.25\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - DreamerAD, the first latent world model framework for efficient RL in autonomous driving, compressing diffusion sampling from 100 steps to 1 for an 80x speedup while maintaining visual interpretability.\n    - The approach leverages denoised latent features via shortcut forcing for step compression, an autoregressive dense reward model on latent representations, and Gaussian vocabulary sampling for GRPO to constrain exploration to plausible trajectories.\n    - Achieves state-of-the-art performance with 87.7 EPDMS on NavSim v2, demonstrating the effectiveness of latent-space RL for autonomous driving.\n\n- [Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.24581)\n  - Linbo Wang, Yupeng Zheng, Qiang Chen, Shiwei Li, Yichen Zhang, Zebin Xing, Qichao Zhang, Xiang Li, Deheng Qian, Pengxuan Yang, Yihang Dong, Ce Hao, Xiaoqing Ye, Junyu han, Yifeng Pan, Dongbin Zhao\n  - Publish Date: 2026.03.25\n  - Task: Prediction, Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - Latent-WAM, an efficient end-to-end autonomous driving framework that achieves strong trajectory planning through spatially-aware and dynamics-informed latent world representations.\n    - It introduces a Spatial-Aware Compressive World Encoder (SCWE) to distill geometric knowledge and compress multi-view images, and a Dynamic Latent World Model (DLWM) to predict future world states.\n    - Achieves state-of-the-art results on NAVSIM v2 and HUGSIM benchmarks with a compact 104M-parameter model, surpassing prior methods with significantly less training data.\n\n- [Toward Physically Consistent Driving Video World Models under Challenging Trajectories](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.24506)\n  - Jiawei Zhou, Zhenxin Zhu, Lingyi Du, Linye Lyu, Lijun Zhou, Zhanqian Wu, Hongcheng Luo, Zhuotao Tian, Bing Wang, Guang Chen, Hangjun Ye, Haiyang Sun, Yu Li\n  - Publisher: Huazhong University of Science and Technology, Xiaomi EV\n  - Publish Date: 2026.03.25\n  - Project Page: [PhyGenesis](https:\u002F\u002Fwm-research.github.io\u002FPhyGenesis\u002F)\n  - Task: Generation\n  - Datasets: [CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - Summary：\n    - Proposes PhyGenesis, a world model for generating driving videos with high visual fidelity and strong physical consistency, especially under challenging or counterfactual trajectories.\n    - The framework includes a physical condition generator to correct invalid trajectories and a physics-enhanced video generator to produce multi-view driving videos.\n    - Trained on a large-scale, physics-rich heterogeneous dataset combining real-world videos and diverse challenging scenarios generated in CARLA, enabling trajectory correction and physically consistent generation.\n\n- [Traffic Sign Recognition in Autonomous Driving: Dataset, Benchmark, and Field Experiment](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.23034)\n  - Guoyang Zhao, Weiqing Qi, Kai Zhang, Chenguang Zhang, Zeying Gong, Zhihai Bi, Kai Chen, Benshan Ma, Ming Liu, Jun Ma\n  - Publisher: The Hong Kong University of Science and Technology, Shanghai AI Laboratory\n  - Publish Date: 2026.03.24\n  - Project Page: [TS-1M](https:\u002F\u002Fguoyangzhao.github.io\u002Fprojects\u002Fts1m)\n  - Task: Perception\n  - Datasets: [TS-1M](https:\u002F\u002Fguoyangzhao.github.io\u002Fprojects\u002Fts1m)\n  - Summary：\n    - Introduces TS-1M, a large-scale, globally diverse traffic sign dataset with over one million images across 454 categories and a diagnostic benchmark for analyzing model capabilities under practical challenges.\n    - Conducts a unified benchmark across three learning paradigms (supervised, self-supervised, multimodal VLMs), revealing semantic alignment as key for generalization and rare-category recognition.\n    - Validates practical relevance through real-scene autonomous driving experiments integrating traffic sign recognition with semantic reasoning and spatial localization for map-level decision constraints.\n\n- [KLDrive: Fine-Grained 3D Scene Reasoning for Autonomous Driving based on Knowledge Graph](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.21029)\n  - Ye Tian, Jingyi Zhang, Zihao Wang, Xiaoyuan Ren, Xiaofan Yu, Onat Gungor, Tajana Rosing\n  - Publisher: University of California San Diego\n  - Publish Date: 2026.03.22\n  - Task: VQA\n  - Datasets: [NuScenes-QA](https:\u002F\u002Fgithub.com\u002Fqiantianwen\u002FNuScenes-QA)\n  - Summary：\n    - KLDrive, a knowledge-graph-augmented LLM reasoning framework for fine-grained question answering in autonomous driving, combining an energy-based scene fact construction module with an LLM agent for fact-grounded reasoning.\n    - The framework uses structured prompting and few-shot in-context exemplars to adapt to diverse reasoning tasks without heavy task-specific fine-tuning, reducing hallucinations and improving reliability.\n\n- [Understanding Behavior Cloning with Action Quantization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.20538)\n  - Haoqun Cao, Tengyang Xie\n  - Publish Date: 2026.03.20\n  - Task: Planning\n  - Summary：\n    - Provides theoretical foundations for behavior cloning with quantized actions, analyzing error propagation and sample complexity.\n    - Shows optimal sample complexity is achievable with log-loss and quantized actions under stable dynamics and policy smoothness.\n    - Proposes a model-based augmentation to improve error bounds and establishes fundamental limits for quantization error and statistical complexity.\n\n- [X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.19979)\n  - Chaoda Zheng, Sean Li, Jinhao Deng, Zhennan Wang, Shijia Chen, Liqiang Xiao, Ziheng Chi, Hongbin Lin, Kangjie Chen, Boyang Wang, Yu Zhang, Xianming Liu\n  - Publish Date: 2026.03.20\n  - Task: End-to-End\n  - Summary：\n    - X-World, an action-conditioned multi-camera generative world model that simulates future video observations for scalable and reproducible evaluation of end-to-end autonomous driving.\n    - The model generates future multi-camera video streams from history and action sequences, supporting optional controls over traffic agents, road elements, and appearance (e.g., weather) via text prompts.\n    - It features a multi-view latent video generator designed for cross-view geometric consistency and temporal coherence, enabling high-quality, controllable simulation for evaluation and video style transfer.\n\n- [DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.19675)\n  - Xiaolu Liu, Yicong Li, Song Wang, Junbo Chen, Angela Yao, Jianke Zhu\n  - Publish Date: 2026.03.20\n  - Code: [DynFlowDrive](https:\u002F\u002Fgithub.com\u002Fxiaolul2\u002FDynFlowDrive)\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - DynFlowDrive, a latent world model that leverages flow-based dynamics to model the transition of world states under different driving actions.\n    - Introduces a stability-aware multi-mode trajectory selection strategy that evaluates candidate trajectories based on the stability of induced scene transitions.\n    - Demonstrates consistent improvements on nuScenes and NavSim benchmarks without introducing additional inference overhead.\n\n- [DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.19219)\n  - Dong Zhuo, Wenzhao Zheng, Sicheng Zuo, Siming Yan, Lu Hou, Jie Zhou, Jiwen Lu\n  - Publish Date: 2026.03.19\n  - Task: Perception\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - DriveTok, an efficient 3D driving scene tokenizer for unified multi-view reconstruction and understanding, designed to address inefficiency and inconsistency in multi-view driving scenes.\n    - It transforms semantically rich visual features into scene tokens using 3D deformable cross-attention and employs a multi-view transformer for reconstruction of RGB, depth, and semantics, with an added 3D head for semantic occupancy prediction.\n\n- [DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.18315)\n  - Zilin Huang, Zihao Sheng, Zhengyang Wan, Yansong Qu, Junwei You, Sicong Jiang, Sikai Chen\n  - Publish Date: 2026.03.18\n  - Project Page: [DriveVLM-RL](https:\u002F\u002Fzilin-huang.github.io\u002FDriveVLM-RL-website\u002F)\n  - Code: [DriveVLM-RL](https:\u002F\u002Fzilin-huang.github.io\u002FDriveVLM-RL-website\u002F)\n  - Task: Planning\n  - Datasets: [CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - Summary：\n    - Proposes DriveVLM-RL, a neuroscience-inspired framework integrating Vision-Language Models (VLMs) into Reinforcement Learning (RL) via a dual-pathway architecture for safe, deployable autonomous driving.\n    - Features a Static Pathway for continuous spatial safety assessment and a Dynamic Pathway for attention-gated multi-frame semantic risk reasoning, with a hierarchical reward synthesis mechanism.\n    - Employs an asynchronous training pipeline to decouple expensive VLM inference from environment interaction, removing all VLM components at deployment to ensure real-time feasibility.\n\n- [VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.18178)\n  - Mohammad Qazim Bhat, Yufan Huang, Niket Agarwal, Hao Wang, Michael Woods, John Kenyon, Tsung-Yi Lin, Xiaodong Yang, Ming-Yu Liu, Kevin Xie\n  - Publish Date: 2026.03.18\n  - Task: Perception\n  - Summary：\n    - VLM-AutoDrive, a modular post-training framework for adapting pretrained Vision-Language Models (VLMs) to high-fidelity anomaly detection in driving.\n    - The framework integrates metadata captions, LLM descriptions, VQA pairs, and chain-of-thought reasoning for domain-aligned, interpretable learning.\n    - It significantly improves Collision F1 from 0.00 to 0.69 and overall accuracy from 35.35% to 77.27% on real-world dashcam videos.\n\n- [VectorWorld: Efficient Streaming World Model via Diffusion Flow on Vector Graphs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.17652)\n  - Chaokang Jiang, Desen Zhou, Jiuming Liu, Kevin Li Sun\n  - Publish Date: 2026.03.18\n  - Code: [VectorWorld](https:\u002F\u002Fgithub.com\u002Fjiangchaokang\u002FVectorWorld)\n  - Task: Prediction, Generation\n  - Datasets: [Waymo Open Motion Dataset](https:\u002F\u002Fwaymo.com\u002Fopen\u002Fdata\u002Fmotion\u002F), [nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)\n  - Summary：\n    - VectorWorld, a streaming world model for autonomous driving that generates ego-centric lane-agent vector-graph tiles incrementally during rollout to enable closed-loop evaluation.\n    - It addresses key issues in generative world models: history-conditioned initialization via a motion-aware gated VAE, real-time outpainting via a one-step masked completion model, and long-horizon stability via a physics-aligned NPC policy called $Δ$Sim.\n\n- [DriveFix: Spatio-Temporally Coherent Driving Scene Restoration](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.16306)\n  - Heyu Si, Brandon James Denis, Muyang Sun, Dragos Datcu, Yaoru Li, Xin Jin, Ruiju Fu, Yuliia Tatarinova, Federico Landi, Jie Song, Mingli Song, Qi Guo\n  - Publish Date: 2026.03.17\n  - Task: Perception\n  - Datasets: [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F), [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [PandaSet](https:\u002F\u002Fscale.com\u002Fopen-datasets\u002Fpandaset)\n  - Summary：\n    - DriveFix, a novel multi-view restoration framework that ensures spatio-temporal coherence for driving scenes using an interleaved diffusion transformer architecture.\n    - The approach explicitly models temporal dependencies and cross-camera spatial consistency, enforcing restored views to adhere to a unified 3D geometry.\n    - Demonstrates state-of-the-art performance in reconstruction and novel view synthesis on Waymo, nuScenes, and PandaSet datasets.\n\n- [Safety Case Patterns for VLA-based driving systems: Insights from SimLingo](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.16013)\n  - Gerhard Yu, Fuyuki Ishikawa, Oluwafemi Odu, Alvine Boaye Belle\n  - Publish Date: 2026.03.16\n  - Task: End-to-End\n  - Summary：\n    - Proposes RAISE, a novel safety case design approach for Vision-Language-Action (VLA)-based driving systems, introducing tailored patterns and an extension of Hazard Analysis and Risk Assessment (HARA).\n    - Addresses new safety hazards arising from the integration of open-ended natural language inputs into the multimodal control loop of autonomous driving systems.\n    - Illustrates the approach with a case study on SimLingo to construct rigorous, evidence-based safety claims for this emerging class of systems.\n\n- [CorrectionPlanner: Self-Correction Planner with Reinforcement Learning in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.15771)\n  - Yihong Guo, Dongqiangzi Ye, Sijia Chen, Anqi Liu, Xianming Liu\n  - Publish Date: 2026.03.16\n  - Task: Planning\n  - Datasets: [Waymax](https:\u002F\u002Fgithub.com\u002Fwaymo-research\u002Fwaymax), [nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)\n  - Summary：\n    - CorrectionPlanner, an autoregressive planner with self-correction that models planning as motion-token generation within a propose, evaluate, and correct loop.\n    - The method uses a learned collision critic to predict unsafe actions and retains a self-correction trace of unsafe motion tokens to condition the generation of safe actions.\n    - Trained with imitation learning followed by model-based reinforcement learning using rollouts from a pretrained world model, it reduces collision rates and achieves state-of-the-art planning scores.\n\n- [CRASH: Cognitive Reasoning Agent for Safety Hazards in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.15364)\n  - Erick Silva, Rehana Yasmin, Ali Shoker\n  - Publisher: (Inferred from context: University of Minho, University of Aveiro)\n  - Publish Date: 2026.03.16\n  - Task: Reasoning\n  - Summary：\n    - Introduces CRASH, an LLM-based agent that automates reasoning over real-world AV incident reports from the NHTSA database to generate summaries, attribute primary causes, and assess AV contribution.\n    - Analysis of 2,168 incidents shows 64% attributed to perception or planning failures and ~50% involve rear-end collisions, highlighting persistent challenges.\n    - Validated with domain experts, achieving 86% accuracy in attributing AV system failures, demonstrating potential as a scalable, interpretable tool for automated crash analysis.\n\n- [Learning from Mistakes: Post-Training for Driving VLA with Takeover Data](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.14972)\n  - Yinfeng Gao, Deqing Liu, Qichao Zhang, Yupeng Zheng, Haochen Tian, Guang Li, Hangjun Ye, Long Chen, Da-Wei Ding, Dongbin Zhao\n  - Publish Date: 2026.03.16\n  - Task: End-to-End\n  - Datasets: [Bench2Drive](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FBench2Drive)\n  - Summary：\n    - Proposes TakeVLA, a novel Vision-Language-Action (VLA) post-training framework that uses takeover data to mitigate distribution shift in end-to-end autonomous driving.\n    - Introduces pre-takeover language supervision to teach the model proactively about error-prone situations, cultivating a precautionary mindset and enlarging safety margins.\n    - Proposes Scenario Dreaming, a reinforcement fine-tuning paradigm that operates in reconstructed takeover scenarios to encourage active exploration beyond passive preference fitting.\n\n- [Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.14948)\n  - Xingtai Gui, Meijie Zhang, Tianyi Yan, Wencheng Han, Jiahao Gong, Feiyang Tan, Cheng-zhong Xu, Jianbing Shen\n  - Publish Date: 2026.03.16\n  - Task: Planning, Generation\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim), [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - WorldDrive, a holistic framework that couples scene generation and real-time planning via unifying vision and motion representation.\n    - Introduces a Trajectory-aware Driving World Model to enforce consistency between visual dynamics and motion intentions, enabling diverse future scene generation.\n    - Proposes a Future-aware Rewarder to distill future latent representation from the frozen world model for real-time trajectory evaluation and selection.\n\n- [PerlAD: Towards Enhanced Closed-loop End-to-end Autonomous Driving with Pseudo-simulation-based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.14908)\n  - Yinfeng Gao, Qichao Zhang, Deqing Liu, Zhongpu Xia, Guang Li, Kun Ma, Guang Chen, Hangjun Ye, Long Chen, Da-Wei Ding, Dongbin Zhao\n  - Publish Date: 2026.03.16\n  - Task: End-to-End\n  - Datasets: [Bench2Drive](https:\u002F\u002Fbench2drive.github.io\u002F), [DOS](https:\u002F\u002Fgithub.com\u002Fopendilab\u002FDOS)\n  - Summary：\n    - Proposes PerlAD, a novel Pseudo-simulation-based RL method for closed-loop end-to-end autonomous driving that operates in vector space for efficient, rendering-free training.\n    - Introduces a prediction world model to generate reactive agent trajectories and a hierarchical decoupled planner combining IL for lateral path generation and RL for longitudinal speed optimization.\n    - Achieves state-of-the-art performance on the Bench2Drive benchmark and demonstrates reliability in safety-critical occlusion scenarios on the DOS benchmark.\n\n- [AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.14851)\n  - Wenhui Huang, Songyan Zhang, Qihang Huang, Zhidong Wang, Zhiqi Mao, Collister Chua, Zhan Chen, Long Chen, Chen Lv\n  - Publish Date: 2026.03.16\n  - Project Page: [AutoMoT](https:\u002F\u002Fautomot-website.github.io\u002F)\n  - Task: End-to-End\n  - Summary：\n    - Proposes AutoMoT, an end-to-end autonomous driving framework that unifies reasoning and action generation in a single vision-language-action model.\n    - Leverages a mixture-of-transformer architecture with joint attention sharing and asynchronous fast-slow inference for efficient policy generation.\n    - Investigates the functional boundary of pre-trained VLMs in AD, finding semantic prompting sufficient for scene understanding but fine-tuning essential for action-level tasks like planning.\n\n- [WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.14497)\n  - Stefan Englmeier, Katharina Winter, Fabian B. Flohr\n  - Publisher: Technical University of Munich\n  - Publish Date: 2026.03.15\n  - Task: Planning, Prediction\n  - Summary：\n    - Proposes WorldVLM, a hybrid architecture that unifies Vision-Language Models (VLMs) for high-level contextual reasoning and World Models (WMs) for accurate future scene dynamics prediction in autonomous driving.\n    - The VLM generates interpretable behavior commands to guide the driving WM, enabling context-aware actions and addressing the complementary strengths of decision-making and environmental forecasting.\n\n- [Risk-Controllable Multi-View Diffusion for Driving Scenario Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.11534)\n  - Hongyi Lin, Wenxiu Shi, Heye Huang, Dingyi Zhuang, Song Zhang, Yang Liu, Xiaobo Qu, Jinhua Zhao\n  - Publish Date: 2026.03.12\n  - Task: Generation\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Proposes RiskMV-DPO, a pipeline for physically-informed, risk-controllable multi-view driving scenario generation to synthesize diverse long-tail, high-stakes situations.\n    - Introduces a geometry-appearance alignment module and a region-aware direct preference optimization (RA-DPO) strategy to ensure spatial-temporal coherence and geometric fidelity in generated scenes.\n    - Demonstrates state-of-the-art performance on nuScenes, improving 3D detection mAP and reducing FID, shifting world models from passive prediction to proactive, controllable synthesis.\n\n- [DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.11380)\n  - Mingzhe Tao, Ruiping Liu, Junwei Zheng, Yufan Chen, Kedi Ying, M. Saquib Sarfraz, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen\n  - Publish Date: 2026.03.11\n  - Task: VQA\n  - Summary：\n    - Proposes DriveXQA, a multimodal dataset for autonomous driving VQA with 102,505 QA pairs across three levels (global scene, allocentric, ego-vehicle centric), four visual modalities, and various sensor failure cases and weather conditions.\n    - Introduces MVX-LLM, a token-efficient architecture with a Dual Cross-Attention (DCA) projector to fuse multiple complementary visual modalities, showing improved performance in challenging conditions like fog.\n\n- [DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.11041)\n  - Shuyao Shang, Bing Zhan, Yunfei Yan, Yuqi Wang, Yingyan Li, Yasong An, Xiaoman Wang, Jierui Liu, Lu Hou, Lue Fan, Zhaoxiang Zhang, Tieniu Tan\n  - Publisher: Chinese Academy of Sciences, Institute of Automation\n  - Publish Date: 2026.03.11\n  - Task: Planning, Reasoning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim), [Bench2Drive](https:\u002F\u002Fbench2drive.github.io\u002F)\n  - Summary：\n    - DynVLA, a driving Vision-Language-Action (VLA) model, introduces a new Chain-of-Thought (CoT) paradigm termed Dynamics CoT, which forecasts compact world dynamics before action generation for more informed and physically grounded decision-making.\n    - The model introduces a Dynamics Tokenizer to compress future evolution into a small set of dynamics tokens and decouples ego-centric and environment-centric dynamics for more accurate world modeling in interaction-intensive scenarios.\n    - DynVLA is trained to generate dynamics tokens before actions via Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT), improving decision quality while maintaining latency-efficient inference, and outperforms Textual CoT and Visual CoT methods across multiple benchmarks.\n\n- [RESBev: Making BEV Perception More Robust](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.09529)\n  - Lifeng Zhuo, Kefan Jin, Zhe Liu, Hesheng Wang\n  - Publish Date: 2026.03.10\n  - Task: Perception\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Proposes RESBev, a resilient and plug-and-play BEV perception method to enhance robustness against sensor degradation and adversarial attacks.\n    - Reframes perception robustness as a latent semantic prediction problem, using a latent world model to learn BEV state transitions and predict clean features.\n    - Operates at the semantic feature level of the Lift-Splat-Shoot pipeline, enabling recovery that generalizes across disturbances without modifying the backbone.\n\n- [Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.09512)\n  - Chun-Peng Chang, Chen-Yu Wang, Holger Caesar, Alain Pagani\n  - Publish Date: 2026.03.10\n  - Task: Reasoning\n  - Summary：\n    - Investigates the reliability of Vision-Language Models (VLMs) as driving assistants, focusing on challenges of response inconsistency and limited temporal reasoning.\n    - Introduces FutureVQA, a human-annotated benchmark dataset for assessing future scene reasoning in driving contexts.\n    - Proposes a self-supervised tuning approach with chain-of-thought reasoning to improve model consistency and temporal reasoning without requiring temporal labels.\n\n- [StyleVLA: Driving Style-Aware Vision Language Action Model for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.09482)\n  - Yuan Gao, Dengyuan Hua, Mattia Piccinini, Finn Rasmus Schäfer, Korbinian Moller, Lin Li, Johannes Betz\n  - Publisher: Technical University of Munich\n  - Publish Date: 2026.03.10\n  - Task: Perception, Planning\n  - Summary：\n    - StyleVLA, a physics-informed Vision Language Action (VLA) framework for generating diverse and physically plausible driving behaviors adapted to specific styles (e.g., sporty, comfortable).\n    - Introduces a hybrid loss combining a kinematic consistency constraint with a continuous regression head to improve trajectory feasibility.\n    - Trained on a large-scale instruction dataset with over 1.2k scenarios and 118k samples, built upon Qwen3-VL-4B, and outperforms proprietary and state-of-the-art models on a composite driving score.\n\n- [Comparative Analysis of Patch Attack on VLM-Based Autonomous Driving Architectures](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.08897)\n  - David Fernandez, Pedram MohajerAnsari, Amir Salarpour, Long Cheng, Abolfazl Razi, Mert D. Pesé\n  - Publish Date: 2026.03.09\n  - Task: Perception\n  - Datasets: [CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - Summary：\n    - Presents a systematic framework for comparative adversarial evaluation across three VLM architectures for autonomous driving: Dolphins, OmniDrive (Omni-L), and LeapVAD.\n    - Evaluates physically realizable patch attacks using black-box optimization with semantic homogenization in CARLA simulation, revealing severe vulnerabilities and sustained multi-frame failures.\n    - Analysis exposes distinct architectural vulnerability patterns, demonstrating that current VLM designs inadequately address adversarial threats in safety-critical applications.\n\n- [SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.08113)\n  - Zihan You, Hongwei Liu, Chenxu Dang, Zhe Wang, Sining Ang, Aoqi Wang, Yan Wang\n  - Publish Date: 2026.03.09\n  - Task: Perception, Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - SAMoE-VLA, a scene-adaptive Vision-Language-Action framework that conditions expert selection on structured scene representations (BEV features) instead of token embeddings for stable autonomous driving.\n    - Introduces a Conditional Cross-Modal Causal Attention mechanism to integrate world state, linguistic intent, and action history into a unified causal reasoning process.\n    - Achieves state-of-the-art performance on nuScenes and LangAuto benchmarks, outperforming prior VLA-based and world-model-based approaches with fewer parameters.\n\n- [NaviDriveVLM: Decoupling High-Level Reasoning and Motion Planning for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.07901)\n  - Ximeng Tao, Pardis Taghavi, Dimitar Filev, Reza Langari, Gaurav Pandey\n  - Publisher: Texas A&M University\n  - Publish Date: 2026.03.09\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - NaviDriveVLM, a decoupled framework for autonomous driving that separates high-level reasoning from motion planning using a large-scale Navigator and a lightweight trainable Driver.\n    - This design preserves strong reasoning ability while reducing training cost and provides an explicit, interpretable intermediate representation for downstream planning.\n    - Experiments on the nuScenes benchmark show NaviDriveVLM outperforms large VLM baselines in end-to-end motion planning.\n\n- [Kinematics-Aware Latent World Models for Data-Efficient Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.07264)\n  - Jiazhuo Li, Linjiang Cao, Qi Liu, Xi Xiong\n  - Publish Date: 2026.03.07\n  - Task: Perception\n  - Summary：\n    - Proposes a kinematics-aware latent world model framework for autonomous driving, building upon the Recurrent State-Space Model (RSSM) to incorporate vehicle kinematic information and geometry-aware supervision.\n    - The structured latent dynamics improve long-horizon imagination fidelity and stabilize policy optimization, demonstrating gains in sample efficiency and driving performance over model-free and pixel-based world-model baselines.\n\n- [Perception-Aware Multimodal Spatial Reasoning from Monocular Images](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.06985)\n  - Yanchun Cheng, Rundong Wang, Xulei Yang, Alok Prakash, Daniela Rus, Marcelo H Ang, ShiJie Li\n  - Publisher: Massachusetts Institute of Technology\n  - Publish Date: 2026.03.07\n  - Task: Perception\n  - Summary：\n    - Proposes a perception-aware multimodal reasoning framework that equips Vision-Language Models (VLMs) with explicit object-centric grounding using Visual Reference Tokens (VRTs) for joint visual-textual reasoning.\n    - Introduces a Multimodal Chain-of-Thought (MM-CoT) dataset and a deterministic ordering strategy to supervise unordered VRT sets, enabling effective training via standard supervised fine-tuning.\n    - Achieves substantial improvements on the SURDS benchmark for spatial reasoning in monocular driving scenarios, outperforming prior methods including RL-based approaches.\n\n- [BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.06576)\n  - Thomas Monninger, Shaoyuan Xie, Qi Alfred Chen, Sihao Ding\n  - Publish Date: 2026.03.06\n  - Task: Perception\n  - Summary：\n    - BEVLM, a framework connecting spatially consistent, semantically distilled Bird's-Eye View (BEV) representations with Large Language Models (LLMs) for autonomous driving.\n    - The method enables more effective LLM reasoning in cross-view driving scenes, improving accuracy by 46% by using BEV features as unified inputs.\n    - Distilling semantic knowledge from LLMs into BEV representations significantly improves closed-loop end-to-end driving performance by 29% in safety-critical scenarios.\n\n- [NOVA: Next-step Open-Vocabulary Autoregression for 3D Multi-Object Tracking in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.06254)\n  - Kai Luo, Xu Wang, Rui Fan, Kailun Yang\n  - Publish Date: 2026.03.06\n  - Code: [NOVA](https:\u002F\u002Fgithub.com\u002Fxifen523\u002FNOVA)\n  - Task: Perception\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [KITTI](http:\u002F\u002Fwww.cvlibs.net\u002Fdatasets\u002Fkitti\u002F)\n  - Summary：\n    - NOVA, a novel paradigm for 3D Multi-Object Tracking that shifts from traditional distance-based matching to generative spatio-temporal semantic modeling using autoregressive Large Language Models.\n    - It reformulates 3D trajectories as structured spatio-temporal semantic sequences, enabling simultaneous encoding of physical motion continuity and deep linguistic priors for identity consistency.\n    - Achieves significant performance gains on novel categories, with a 20.21% absolute improvement in AMOTA on nuScenes, using a compact 0.5B parameter model.\n\n- [K-Gen: A Multimodal Language-Conditioned Approach for Interpretable Keypoint-Guided Trajectory Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.04868)\n  - Mingxuan Mu, Guo Yang, Lei Chen, Ping Wu, Jianxun Cui\n  - Publish Date: 2026.03.05\n  - Task: Planning\n  - Datasets: [WOMD](https:\u002F\u002Fwaymo.com\u002Fopen\u002Fdata\u002Fmotion\u002F), [nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)\n  - Summary：\n    - K-Gen, an interpretable keypoint-guided multimodal framework that leverages MLLMs to unify rasterized BEV map inputs with textual scene descriptions for trajectory generation.\n    - Instead of directly predicting full trajectories, K-Gen generates interpretable keypoints along with reasoning that reflects agent intentions, which are refined into accurate trajectories.\n    - The method applies T-DAPO, a trajectory-aware reinforcement fine-tuning algorithm, to enhance keypoint generation, outperforming baselines on WOMD and nuPlan.\n\n- [PRAM-R: A Perception-Reasoning-Action-Memory Framework with LLM-Guided Modality Routing for Adaptive Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.04222)\n  - Yi Zhang, Xian Zhang, Saisi Zhao, Yinglei Song, Chengdong Wu, Nenad Petrovic, Alois Knoll\n  - Publisher: Technical University of Munich\n  - Publish Date: 2026.03.04\n  - Task: Perception, Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - PRAM-R, a unified Perception-Reasoning-Action-Memory framework with LLM-Guided Modality Routing for adaptive autonomous driving, featuring an asynchronous dual-loop design.\n    - The framework uses an LLM router to select and weight modalities based on environmental context and sensor diagnostics, paired with a hierarchical memory module for temporal consistency.\n    - Evaluation shows an 87.2% reduction in routing oscillations via hysteresis-based stabilization and a 6.22% modality reduction while maintaining trajectory accuracy comparable to full-modality baselines on nuScenes.\n\n- [Real-Time Generative Policy via Langevin-Guided Flow Matching for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.02613)\n  - Tianze Zhu, Yinuo Wang, Wenjun Zou, Tianyi Zhang, Likun Wang, Letian Tao, Feihong Zhang, Yao Lyu, Shengbo Eben Li\n  - Publish Date: 2026.03.03\n  - Task: Planning\n  - Datasets: [DeepMind Control Suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n  - Summary：\n    - Proposes DACER-F, a diffusion actor-critic method using flow matching for online RL, enabling single-step inference for real-time decision-making in autonomous driving.\n    - Introduces a method that leverages Langevin dynamics and Q-function gradients to dynamically optimize actions toward a target distribution balancing high Q-value and exploration.\n    - Demonstrates superior performance in complex driving simulations and scalability on the DeepMind Control Suite, achieving high scores with ultra-low inference latency.\n\n- [VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.02609)\n  - A. Enes Doruk, Hasan F. Ates\n  - Publish Date: 2026.03.03\n  - Task: Perception\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [SemanticKITTI](http:\u002F\u002Fwww.semantic-kitti.org\u002F)\n  - Summary：\n    - VLMFusionOcc3D, a robust multimodal framework for dense 3D semantic occupancy prediction, leveraging Vision-Language Models (VLMs) to anchor ambiguous voxel features to stable semantic concepts.\n    - Introduces Instance-driven VLM Attention (InstVLM) to inject high-level semantic priors into 3D voxels and Weather-Aware Adaptive Fusion (WeathFusion) to dynamically re-weight sensor contributions based on environmental reliability.\n    - Employs a Depth-Aware Geometric Alignment (DAGA) loss for structural consistency and demonstrates performance improvements on state-of-the-art baselines, especially in challenging weather scenarios.\n\n- [AnchorDrive: LLM Scenario Rollout with Anchor-Guided Diffusion Regeneration for Safety-Critical Scenario Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.02542)\n  - Zhulin Jiang, Zetao Li, Cheng Wang, Ziwen Wang, Chen Xiong\n  - Publish Date: 2026.03.03\n  - Task: Generation\n  - Datasets: [highD](https:\u002F\u002Fwww.highd-dataset.com\u002F)\n  - Summary：\n    - AnchorDrive, a two-stage safety-critical scenario generation framework combining an LLM for controllable generation and a diffusion model for realistic trajectory regeneration.\n    - The first stage uses an LLM driver agent with a plan assessor for semantically controllable scenario generation under natural language constraints.\n    - The second stage extracts anchor points from LLM trajectories to guide a diffusion model, regenerating trajectories with improved realism while preserving user intent.\n\n- [LLM-MLFFN: Multi-Level Autonomous Driving Behavior Feature Fusion via Large Language Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.02528)\n  - Xiangyu Li, Tianyi Wang, Xi Cheng, Rakesh Chowdary Machineni, Zhaomiao Guo, Sikai Chen, Junfeng Jiao, Christian Claudel\n  - Publish Date: 2026.03.03\n  - Task: Perception\n  - Datasets: [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - Summary：\n    - LLM-MLFFN, a novel LLM-enhanced multi-level feature fusion network for autonomous driving behavior classification, integrating priors from large-scale pre-trained models.\n    - The framework comprises a multi-level feature extraction module, a semantic description module using LLMs, and a dual-channel feature fusion network with weighted attention.\n    - Evaluation on the Waymo dataset shows superior performance, achieving over 94% classification accuracy and demonstrating the value of combining structured feature modeling with semantic abstraction.\n\n- [LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.01928)\n  - Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, Hangjun Ye, Zhi-Xin Yang, Fuxi Wen\n  - Publish Date: 2026.03.02\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - LaST-VLA, a framework shifting reasoning from discrete symbolic processing to a physically grounded Latent Spatio-Temporal Chain-of-Thought to address semantic-perceptual decoupling in Vision-Language-Action models.\n    - Implements a dual-feature alignment mechanism to distill geometric constraints from 3D foundation models and dynamic foresight from world models into the latent space.\n    - Employs a progressive SFT training strategy and GRPO reinforcement learning, setting new records on NAVSIM benchmarks and excelling in spatial-temporal reasoning.\n\n- [Unifying Language-Action Understanding and Generation for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.01441)\n  - Xinyang Wang, Qian Liu, Wenjie Ding, Zhao Yang, Wei Li, Chang Liu, Bailin Li, Kun Zhan, Xianpeng Lang, Wei Chen\n  - Publish Date: 2026.03.02\n  - Task: End-to-End, Generation\n  - Summary：\n    - Introduces LinkVLA, a novel Vision-Language-Action (VLA) architecture for autonomous driving that unifies language and action tokens into a shared discrete codebook to enforce cross-modal consistency.\n    - Proposes an auxiliary action understanding objective to create a deep semantic link by training the model to generate captions from trajectories, fostering bidirectional language-action mapping.\n    - Replaces auto-regressive generation with a two-step coarse-to-fine (C2F) method, reducing inference time by 86% while improving instruction following accuracy and driving performance in closed-loop benchmarks.\n\n- [Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.01063)\n  - Yuechen Luo, Qimao Chen, Fang Li, Shaoqing Xu, Jaxin Liu, Ziying Song, Zhi-xin Yang, Fuxi Wen\n  - Publish Date: 2026.03.01\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - Proposes ELF-VLA, a framework that augments RL with structured diagnostic feedback to address performance plateaus in Vision-Language-Action models for autonomous driving.\n    - Generates detailed, interpretable failure reports to identify specific failure modes, enabling targeted Feedback-Guided Refinement and improved training.\n    - Achieves state-of-the-art performance on the NAVSIM benchmark for PDMS, EPDMS score, and high-level planning accuracy.\n\n- [DriveCode: Domain Specific Numerical Encoding for LLM-Based Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.00919)\n  - Zhiye Wang, Yanbo Jiang, Rui Zhou, Bo Zhang, Fang Zhang, Zhenhua Xu, Yaqin Zhang, Jianqiang Wang\n  - Publish Date: 2026.03.01\n  - Task: End-to-End\n  - Summary：\n    - Introduces DriveCode, a novel numerical encoding method for LLM-based autonomous driving that represents numbers as dedicated embeddings rather than discrete text tokens to improve numerical reasoning and precision.\n    - Employs a number projector to map numbers into the language model's hidden space, enabling seamless integration with visual and textual features in a unified multimodal sequence.\n    - Demonstrates superior performance in trajectory prediction and control signal generation on OmniDrive, DriveGPT4, and DriveGPT4-V2 datasets.\n\n- [Wild-Drive: Off-Road Scene Captioning and Path Planning via Robust Multi-modal Routing and Efficient Large Language Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.00694)\n  - Zihang Wang, Xu Li, Benwu Wang, Wenkai Zhu, Xieyuanli Chen, Dong Kong, Kailin Lyu, Yinan Du, Yiming Peng, Haoyang Che\n  - Publish Date: 2026.02.28\n  - Code: [Wild-Drive](https:\u002F\u002Fgithub.com\u002Fwangzihanggg\u002FWild-Drive)\n  - Task: Planning\n  - Datasets: [OR-C2P Benchmark](https:\u002F\u002Fgithub.com\u002Fwangzihanggg\u002FWild-Drive)\n  - Summary：\n    - Proposes Wild-Drive, an efficient framework for off-road scene captioning and path planning, addressing vulnerability to sensor degradations like rain, fog, and darkness.\n    - Introduces MoRo-Former, a task-conditioned modality-routing bridge, to adaptively aggregate reliable multimodal information under degraded sensing conditions.\n    - Integrates an efficient LLM with a planning token and GRU decoder to generate structured captions and predict future trajectories, and releases the OR-C2P Benchmark for evaluation.\n\n- [Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.23259)\n  - Jiangxin Sun, Feng Xue, Teng Long, Chang Liu, Jian-Fang Hu, Wei-Shi Zheng, Nicu Sebe\n  - Publisher: University of Macau, University of Chinese Academy of Sciences, University of Trento\n  - Publish Date: 2026.02.26\n  - Task: End-to-End\n  - Summary：\n    - Proposes RaWMPC, a unified framework for end-to-end autonomous driving that addresses generalization via robust control without reliance on expert demonstrations.\n    - Leverages a world model to predict consequences of candidate actions and selects low-risk actions through explicit risk evaluation, enhanced by a risk-aware interaction strategy.\n    - Introduces a self-evaluation distillation method to distill risk-avoidance capabilities from the world model into a generative action proposal network.\n\n- [MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.21952)\n  - Lingjun Zhang, Yujian Yuan, Changjie Wu, Xinyuan Chang, Xin Cai, Shuang Zeng, Linzhe Shi, Sijin Wang, Hang Zhang, Mu Xu\n  - Publish Date: 2026.02.25\n  - Code: [MindDriver](https:\u002F\u002Fgithub.com\u002Fhotdogcheesewhite\u002FMindDriver)\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Proposes MindDriver, a progressive multimodal reasoning framework for autonomous driving that enables VLM to imitate human-like progressive thinking through semantic understanding, imagination, and trajectory planning.\n    - Develops a feedback-guided automatic data annotation pipeline to generate aligned multimodal reasoning training data and a progressive reinforcement fine-tuning method for optimization.\n\n- [NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.21172)\n  - Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan\n  - Publish Date: 2026.02.24\n  - Project Page: [NoRD](https:\u002F\u002Fnord-vla-ai.github.io\u002F)\n  - Task: End-to-End\n  - Datasets: [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F), [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - Proposes NoRD, a Vision-Language-Action model for autonomous driving that achieves competitive performance while being fine-tuned on \u003C60% of the data and requiring no reasoning annotations.\n    - Identifies and overcomes difficulty bias in standard GRPO for small, reasoning-free datasets by incorporating the Dr. GRPO algorithm to mitigate high-variance rollouts.\n\n- [VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.20794)\n  - Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, Long Chen\n  - Publisher: (Inferred from authors\u002Fcontext)\n  - Publish Date: 2026.02.24\n  - Task: Perception, Planning\n  - Summary：\n    - Proposes VGGDrive, a novel architecture that empowers Vision-Language Models (VLMs) with cross-view geometric grounding for autonomous driving by bridging 3D geometric features from frozen 3D models.\n    - Introduces a plug-and-play Cross-View 3D Geometric Enabler (CVGE) that decouples the base VLM and injects 3D features through a hierarchical adaptive mechanism.\n    - Demonstrates enhanced performance across five autonomous driving benchmarks, including cross-view risk perception, motion prediction, and trajectory planning tasks.\n\n- [Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.20577)\n  - Jiaru Zhang, Manav Gagvani, Can Cui, Juntong Peng, Ruqi Zhang, Ziran Wang\n  - Publish Date: 2026.02.24\n  - Task: Perception, Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Proposes MVLAD-AD, a masked vision-language-action diffusion framework for autonomous driving that bridges efficient planning and semantic explainability.\n    - Introduces a discrete action tokenization strategy to create a compact codebook of kinematically feasible waypoints from real-world driving distributions.\n    - Employs geometry-aware embedding learning and an action-priority decoding strategy to improve planning precision and provide high-fidelity, explainable reasoning.\n\n- [MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.20060)\n  - Junli Wang, Xueyi Liu, Yinan Zheng, Zebing Xing, Pengfei Li, Guang Li, Kun Ma, Guang Chen, Hangjun Ye, Zhongpu Xia, Long Chen, Qichao Zhang\n  - Publish Date: 2026.02.23\n  - Code: [MeanFuser](https:\u002F\u002Fgithub.com\u002Fwjl2244\u002FMeanFuser)\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - MeanFuser, an end-to-end autonomous driving method that enhances efficiency and robustness by introducing Gaussian Mixture Noise (GMN) to guide generative sampling, enabling a continuous representation of the trajectory space.\n    - It adapts \"MeanFlow Identity\" to end-to-end planning, modeling the mean velocity field to eliminate numerical errors from ODE solvers and significantly accelerate inference.\n    - It designs a lightweight Adaptive Reconstruction Module (ARM) that enables the model to implicitly select from sampled proposals or reconstruct a new trajectory via attention weights.\n\n- [Safe and Interpretable Multimodal Path Planning for Multi-Agent Cooperation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.19304)\n  - Haojun Shi, Suyu Ye, Katherine M. Guerrerio, Jianzhi Shen, Yifan Yin, Daniel Khashabi, Chien-Ming Huang, Tianmin Shu\n  - Publish Date: 2026.02.22\n  - Task: Planning\n  - Summary：\n    - CaPE (Code as Path Editor), a safe and interpretable multimodal path planning method for multi-agent cooperation, which generates and updates path plans based on the environment and language communication.\n    - The method leverages a vision-language model (VLM) to synthesize a path editing program verified by a model-based planner, grounding communication to safe and interpretable path plan updates.\n    - Evaluated in diverse simulated and real-world scenarios including multi-robot and human-robot cooperation in autonomous driving, household, and joint carrying tasks, demonstrating enhanced plan alignment to communication.\n\n- [Open-Vocabulary Domain Generalization in Urban-Scene Segmentation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.18853)\n  - Dong Zhao, Qi Zang, Nan Pu, Wenjing Li, Nicu Sebe, Zhun Zhong\n  - Publisher: University of Trento\n  - Publish Date: 2026.02.21\n  - Task: Perception\n  - Datasets: [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F), [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [KITTI](http:\u002F\u002Fwww.cvlibs.net\u002Fdatasets\u002Fkitti\u002F), [BDD100K](https:\u002F\u002Fbdd-data.berkeley.edu\u002F)\n  - Summary：\n    - Introduces Open-Vocabulary Domain Generalization in Semantic Segmentation (OVDG-SS), a new setting addressing both unseen domains and unseen categories in urban-driving scenarios.\n    - Proposes S2-Corr, a state-space-driven text-image correlation refinement mechanism to mitigate domain-induced distortions in pre-trained Vision-Language Models.\n    - Establishes the first benchmark for OVDG-SS in autonomous driving, covering synthetic-to-real and real-to-real generalization.\n\n- [OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.18094)\n  - Ling Lin, Yang Bai, Heng Su, Congcong Zhu, Yaoxing Wang, Yang Zhou, Huazhu Fu, Jingrun Chen\n  - Publish Date: 2026.02.20\n  - Task: Evaluation\n  - Datasets: [OODBench](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.18094)\n  - Summary：\n    - Proposes OODBench, a predominantly automated method for constructing benchmarks to evaluate Vision-Language Models (VLMs) on out-of-distribution (OOD) data, containing 40K instance-level OOD instance-category pairs.\n    - Introduces a reliable automated assessment metric using a Basic-to-Advanced Progression of prompted questions to assess the impact of OOD data on questions of varying difficulty.\n    - Summarizes substantial findings and insights to facilitate future research in the acquisition and evaluation of OOD data, showing current VLMs exhibit notable performance degradation on OODBench.\n\n- [Conditional Flow Matching for Continuous Anomaly Detection in Autonomous Driving on a Manifold-Aware Spectral Space](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.17586)\n  - Antonio Guillen-Perez\n  - Publish Date: 2026.02.19\n  - Task: Perception\n  - Datasets: [Waymo Open Motion Dataset](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - Summary：\n    - Presents Deep-Flow, an unsupervised safety-critical anomaly detection framework using Optimal Transport Conditional Flow Matching (OT-CFM) to model the probability density of expert driving.\n    - Introduces a low-rank spectral manifold via PCA for kinematic smoothness and stable log-likelihood estimation, and an Early Fusion Transformer with goal conditioning for multi-modal junctions.\n    - Identifies a predictability gap between kinematic danger and semantic non-compliance, surfacing overlooked out-of-distribution behaviors like lane violations to enable data-driven safety validation.\n\n- [DriveFine: Refining-Augmented Masked Diffusion VLA for Precise and Robust Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.14577)\n  - Chenxu Dang, Sining Ang, Yongkang Li, Haochen Tian, Jie Wang, Guang Li, Hangjun Ye, Jie Ma, Long Chen, Yan Wang\n  - Publish Date: 2026.02.16\n  - Code: [DriveFine](https:\u002F\u002Fgithub.com\u002FMSunDYY\u002FDriveFine)\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - DriveFine, a masked diffusion Vision-Language-Action (VLA) model combining flexible decoding with self-correction for autonomous driving.\n    - Introduces a novel plug-and-play block-MoE design, decoupling a refinement expert from a generation expert to preserve pretrained capabilities and enable explicit expert selection.\n    - Employs a hybrid reinforcement learning strategy to encourage effective exploration while maintaining stability, demonstrating strong efficacy and robustness on NAVSIM benchmarks.\n\n- [Self-Supervised JEPA-based World Models for LiDAR Occupancy Completion and Forecasting](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.12540)\n  - Haoran Zhu, Anna Choromanska\n  - Publisher: New York University\n  - Publish Date: 2026.02.13\n  - Task: Prediction\n  - Datasets: [Waymo Open Dataset](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - Summary：\n    - Proposes AD-LiST-JEPA, a self-supervised world model for autonomous driving that predicts future spatiotemporal evolution from LiDAR data using a Joint-Embedding Predictive Architecture (JEPA).\n    - Evaluates the learned representations through a downstream LiDAR-based occupancy completion and forecasting (OCF) task, jointly assessing perception and prediction.\n    - Demonstrates improved OCF performance with a pretrained encoder after JEPA-based world model learning, leveraging large volumes of unlabeled data.\n\n- [Talk2DM: Enabling Natural Language Querying and Commonsense Reasoning for Vehicle-Road-Cloud Integrated Dynamic Maps with Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.11860)\n  - Lu Tao, Jinxuan Luo, Yousuke Watanabe, Zhengshu Zhou, Yuhuan Lu, Shen Ying, Pan Zhang, Fei Zhao, Hiroaki Takada\n  - Publish Date: 2026.02.12\n  - Task: Reasoning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - Talk2DM, a plug-and-play module that extends Vehicle-Road-Cloud Dynamic Map (VRC-DM) systems with natural-language-supported querying and commonsense reasoning capabilities.\n    - Built upon a novel chain-of-prompt (CoP) mechanism that integrates human-defined rules with the commonsense knowledge of large language models (LLMs).\n    - Experiments show Talk2DM can seamlessly switch across different LLMs while maintaining high query accuracy, with practical potential demonstrated by achieving over 93% accuracy with a 2-5 second response time.\n\n- [SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.11656)\n  - Seo Hyun Kim, Jin Bok Park, Do Yeon Koo, Ho Gun Park, Il Yong Chun\n  - Publish Date: 2026.02.12\n  - Task: End-to-End\n  - Summary：\n    - Proposes SToRM, the first Supervised Token Reduction framework for multi-modal LLMs to enable efficient End-to-End autonomous driving while maintaining performance comparable to using all visual tokens.\n    - The framework uses a lightweight importance predictor, a supervised training approach with pseudo-supervision from an all-token LLM, and an anchor-context merging module to reduce token redundancy.\n    - Experiments show SToRM outperforms state-of-the-art E2E driving MLLMs under the same token budget, maintaining performance while reducing computational cost by up to 30x.\n\n- [ResWorld: Temporal Residual World Model for End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.10884)\n  - Jinqing Zhang, Zehua Fu, Zelin Xu, Wenying Dai, Qingjie Liu, Yunhong Wang\n  - Publish Date: 2026.02.11\n  - Code: [ResWorld](https:\u002F\u002Fgithub.com\u002Fmengtan00\u002FResWorld.git)\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - Proposes Temporal Residual World Model (TR-World) for end-to-end autonomous driving, focusing on dynamic object modeling by calculating temporal residuals of scene representations to extract dynamic information without detection\u002Ftracking.\n    - Introduces a Future-Guided Trajectory Refinement (FGTR) module that interacts prior trajectories with future BEV features, refining trajectories with future road conditions and providing supervision to prevent world model collapse.\n    - Achieves state-of-the-art planning performance on nuScenes and NAVSIM datasets.\n\n- [From Steering to Pedalling: Do Autonomous Driving VLMs Generalize to Cyclist-Assistive Spatial Perception and Planning?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.10771)\n  - Krishna Kanth Nakka, Vedasri Nakka\n  - Publish Date: 2026.02.11\n  - Task: VQA, Perception, Planning\n  - Summary：\n    - Introduces CyclingVQA, a diagnostic benchmark to evaluate Vision-Language Models (VLMs) on cyclist-centric perception, spatio-temporal understanding, and traffic-rule-to-lane reasoning.\n    - Evaluates 31+ VLMs, finding current models show encouraging capabilities but have clear gaps in cyclist-specific reasoning, with driving-specialized models often underperforming strong generalist VLMs in this domain.\n    - Provides systematic error analysis identifying recurring failure modes to guide the development of more effective cyclist-assistive intelligent systems.\n\n- [HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.13329)\n  - Yiru Wang, Zichong Gu, Yu Gao, Anqing Jiang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun\n  - Publish Date: 2026.02.11\n  - Task: End-to-End\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - Proposes HiST-VLA, a Hierarchical Spatio-Temporal VLA model for reliable trajectory generation, addressing limitations like imprecise numerical reasoning and weak 3D spatial awareness.\n    - Integrates geometric awareness, state history prompting, and dynamic token sparsification to enhance reasoning and computational efficiency.\n    - Employs a hierarchical transformer-based planner with dynamic latent regularization to refine VLA waypoints, achieving state-of-the-art performance on the NAVSIM v2 benchmark.\n\n- [Found-RL: foundation model-enhanced reinforcement learning for autonomous driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.10458)\n  - Yansong Qu, Zihao Sheng, Zilin Huang, Jiancong Chen, Yuhao Luo, Tianyi Wang, Yiheng Feng, Samuel Labi, Sikai Chen\n  - Publisher: Purdue University\n  - Publish Date: 2026.02.11\n  - Code: [Found-RL](https:\u002F\u002Fgithub.com\u002Fys-qu\u002Ffound-rl)\n  - Task: End-to-End\n  - Datasets: [CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - Summary：\n    - Found-RL, a platform to efficiently enhance RL for autonomous driving using foundation models, featuring an asynchronous batch inference framework to decouple heavy VLM reasoning from the simulation loop.\n    - Introduces Value-Margin Regularization (VMR) and Advantage-Weighted Action Guidance (AWAG) to distill VLM action suggestions into the RL policy, and uses a Conditional Contrastive Action Alignment method to address CLIP's dynamic blindness for reward shaping.\n\n- [Adaptive Time Step Flow Matching for Autonomous Driving Motion Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.10285)\n  - Ananya Trivedi, Anjian Li, Mohamed Elnoor, Yusuf Umut Ciftci, Avinash Singh, Jovin D'sa, Sangjae Bae, David Isele, Taskin Padir, Faizan M. Tariq\n  - Publisher: Northeastern University\n  - Publish Date: 2026.02.10\n  - Project Page: [Flow Matching for Self-Driving](https:\u002F\u002Fflow-matching-self-driving.github.io\u002F)\n  - Task: Planning\n  - Datasets: [Waymo Open Motion Dataset](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - Summary：\n    - Proposes a conditional flow matching framework for real-time joint prediction of surrounding agents and ego trajectory planning, using a lightweight variance estimator to adaptively select inference steps online.\n    - Introduces a trajectory post-processing step as a convex quadratic program to enhance ride quality with negligible computational overhead.\n    - Achieves a 20 Hz update rate, enabling online deployment, and demonstrates improved smoothness and constraint adherence over baseline models on maneuvers like lane changes and unprotected turns.\n\n- [SteerVLA: Steering Vision-Language-Action Models in Long-Tail Driving Scenarios](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.08440)\n  - Tian Gao, Celine Tan, Catherine Glossop, Timothy Gao, Jiankai Sun, Kyle Stachowicz, Shirley Wu, Oier Mees, Dorsa Sadigh, Sergey Levine, Chelsea Finn\n  - Publish Date: 2026.02.09\n  - Project Page: [SteerVLA](https:\u002F\u002Fsteervla.github.io\u002F)\n  - Task: Planning\n  - Summary：\n    - SteerVLA, a method that leverages the reasoning of Vision-Language Models (VLMs) to produce fine-grained language instructions that steer a Vision-Language-Action (VLA) driving policy.\n    - Introduces a rich language interface between a high-level VLM and a low-level VLA policy to ground reasoning in control outputs, using a VLM to augment driving data with detailed language annotations.\n    - Evaluated on a challenging closed-loop benchmark, outperforming state-of-the-art methods by 4.77 points in overall driving score and by 8.04 points on a long-tail subset.\n\n- [Vision and language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.07680)\n  - Ross Greer, Maitrayee Keskar, Angel Martinez-Sanchez, Parthib Roy, Shashank Shriram, Mohan Trivedi\n  - Publish Date: 2026.02.07\n  - Task: Planning\n  - Datasets: [Waymo Open Dataset](https:\u002F\u002Fwaymo.com\u002Fopen\u002F), [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Introduces a lightweight, category-agnostic hazard screening approach using CLIP-based image-text similarity for low-latency semantic hazard detection without explicit object detection.\n    - Examines integrating scene-level vision-language embeddings into a transformer-based trajectory planner, finding naive conditioning does not improve accuracy, highlighting the need for task-informed representation extraction.\n    - Investigates using natural language as an explicit behavioral constraint on motion planning, showing passenger-style instructions grounded in visual scenes suppress severe planning failures and improve safety in ambiguous scenarios.\n\n- [Seeing Roads Through Words: A Language-Guided Framework for RGB-T Driving Scene Segmentation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.07343)\n  - Ruturaj Reddy, Hrishav Bakul Barua, Junn Yong Loo, Thanh Thi Nguyen, Ganesh Krishnasamy\n  - Publish Date: 2026.02.07\n  - Task: Perception\n  - Summary：\n    - Proposes CLARITY, a framework for RGB-Thermal semantic segmentation that dynamically adapts its fusion strategy based on detected scene conditions using vision-language model (VLM) priors.\n    - Introduces mechanisms to preserve valid dark-object semantics often discarded by noise-suppression methods and a hierarchical decoder to enforce structural consistency for sharper object boundaries.\n    - Achieves state-of-the-art performance on the MFNet dataset with 62.3% mIoU and 77.5% mAcc.\n\n- [DriveWorld-VLA: Unified Latent-Space World Modeling with Vision-Language-Action for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.06521)\n  - Feiyang jia, Lin Liu, Ziying Song, Caiyan Jia, Hangjun Ye, Xiaoshuai Hao, Long Chen\n  - Publish Date: 2026.02.06\n  - Code: [DriveWorld-VLA](https:\u002F\u002Fgithub.com\u002Fliulin815\u002FDriveWorld-VLA.git)\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim), [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - A novel framework that unifies world modeling and planning within a latent space by tightly integrating Vision-Language-Action (VLA) and world models at the representation level.\n    - Enables the VLA planner to benefit directly from holistic scene-evolution modeling and assess how candidate actions impact future scene evolution, reducing reliance on dense annotated supervision.\n    - Achieves state-of-the-art performance on NAVSIM and nuScenes benchmarks, supporting controllable, action-conditioned imagination at the feature level.\n\n- [ROMAN: Reward-Orchestrated Multi-Head Attention Network for Autonomous Driving System Testing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.05629)\n  - Jianlei Chi, Yuzhen Wu, Jiaxuan Hou, Xiaodong Zhang, Ming Fan, Suhui Sun, Weijun Dai, Bo Li, Jianguo Sun, Jun Sun\n  - Publisher: Baidu\n  - Publish Date: 2026.02.05\n  - Task: Generation\n  - Datasets: [CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - Summary：\n    - ROMAN, a novel scenario generation approach for ADS testing that combines a multi-head attention network with a traffic law weighting mechanism to generate high-risk violation scenarios.\n    - The approach uses an LLM-based risk weighting module to evaluate violations based on severity and occurrence, modeling interactions among vehicles and traffic signals.\n    - Evaluated on the Baidu Apollo ADS in CARLA, ROMAN surpassed state-of-the-art tools in violation count and diversity, generating scenarios for every clause of input traffic laws.\n\n- [AppleVLM: End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.04256)\n  - Yuxuan Han, Kunyuan Wu, Qianyi Shao, Renxiang Xiao, Zilu Wang, Cansen Jiang, Yi Xiao, Liang Hu, Yunjiang Lou\n  - Publish Date: 2026.02.04\n  - Task: End-to-End\n  - Datasets: [CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - Summary：\n    - AppleVLM, an advanced perception and planning-enhanced VLM model for robust end-to-end driving, addressing challenges like suboptimal lane perception and language biases.\n    - Introduces a novel vision encoder using a deformable transformer for multi-view, multi-timestep fusion and a dedicated planning modality encoding Bird's-Eye-View information to mitigate navigation biases.\n    - Features a VLM decoder fine-tuned by a hierarchical Chain-of-Thought to integrate vision, language, and planning features, achieving state-of-the-art performance in CARLA and real-world deployment on an AGV.\n\n- [Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.04184)\n  - Angel Martinez-Sanchez, Parthib Roy, Ross Greer\n  - Publisher: Mi3-Lab\n  - Publish Date: 2026.02.04\n  - Code: [doScenes-VLM-Planning](https:\u002F\u002Fgithub.com\u002FMi3-Lab\u002FdoScenes-VLM-Planning)\n  - Task: Planning\n  - Datasets: [doScenes](https:\u002F\u002Fgithub.com\u002FMi3-Lab\u002FdoScenes-VLM-Planning), [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Introduces doScenes, the first real-world dataset linking free-form passenger instructions to nuScenes ground-truth motion for instruction-conditioned planning.\n    - Adapts the OpenEMMA end-to-end driving framework to integrate passenger-style language prompts, enabling linguistic conditioning before trajectory generation.\n    - Demonstrates that instruction conditioning substantially improves planning robustness, preventing extreme failures and improving trajectory alignment with well-phrased prompts.\n\n- [InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.03242)\n  - Zhuoran Yang, Xi Guo, Chenjing Ding, Chiyu Wang, Wei Wu, Yanyong Zhang\n  - Publish Date: 2026.02.03\n  - Project Page: [InstaDrive](https:\u002F\u002Fshanpoyang654.github.io\u002FInstaDrive\u002Fpage.html)\n  - Task: Generation\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - InstaDrive, a novel framework for generating realistic driving videos, addresses instance-level temporal consistency and spatial geometric fidelity in world models.\n    - Introduces an Instance Flow Guider to propagate instance features for temporal consistency and a Spatial Geometric Aligner for precise positioning and occlusion modeling.\n    - Achieves state-of-the-art video generation quality and enhances downstream autonomous driving tasks, with procedural simulation in CARLA for safety-critical scenario evaluation.\n\n- [ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.03213)\n  - Zhuoran Yang, Yanyong Zhang\n  - Publish Date: 2026.02.03\n  - Project Page: [ConsisDrive](https:\u002F\u002Fshanpoyang654.github.io\u002FConsisDrive\u002Fpage.html)\n  - Task: Generation\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Introduces ConsisDrive, an identity-preserving driving world model designed to enforce temporal consistency at the instance level to prevent identity drift in generated videos.\n    - Incorporates two key components: Instance-Masked Attention to preserve object identity across frames and Instance-Masked Loss to adaptively emphasize foreground regions.\n    - Achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset.\n\n- [LiFlow: Flow Matching for 3D LiDAR Scene Completion](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.02232)\n  - Andrea Matteazzi, Dietmar Tutsch\n  - Publisher: University of Parma\n  - Publish Date: 2026.02.02\n  - Code: [LiFlow](https:\u002F\u002Fgithub.com\u002Fmatteandre\u002FLiFlow)\n  - Task: Perception\n  - Datasets: [KITTI](https:\u002F\u002Fwww.cvlibs.net\u002Fdatasets\u002Fkitti\u002F)\n  - Summary：\n    - Introduces the first flow matching framework for 3D LiDAR scene completion, improving upon diffusion-based methods by ensuring consistent initial distributions between training and inference.\n    - Employs a nearest neighbor flow matching loss and a Chamfer distance loss to enhance both local structure and global coverage in the alignment of point clouds.\n    - Achieves state-of-the-art performance across multiple metrics for scene completion.\n\n- [UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.02002)\n  - Guosheng Zhao, Yaozeng Wang, Xiaofeng Wang, Zheng Zhu, Tingdong Yu, Guan Huang, Yongchen Zai, Ji Jiao, Changliang Xue, Xiaole Wang, Zhen Yang, Futang Zhu, Xingang Wang\n  - Publish Date: 2026.02.02\n  - Task: Prediction, Generation\n  - Summary：\n    - UniDriveDreamer, a single-stage unified multimodal world model for autonomous driving that directly generates multimodal future observations (multi-camera video and LiDAR sequences) without intermediate representations or cascaded modules.\n    - Introduces a LiDAR-specific VAE and a video VAE, with a proposed Unified Latent Anchoring (ULA) method to align their latent distributions for cross-modal compatibility and training stability.\n    - Employs a diffusion transformer to jointly model the geometric correspondence and temporal evolution of the aligned, fused features, conditioned by structured scene layout information per modality.\n\n- [UniDWM: Towards a Unified Driving World Model via Multifaceted Representation Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.01536)\n  - Shuai Liu, Siheng Ren, Xiaoyao Zhu, Quanmin Liang, Zefeng Li, Qiang Li, Xin Hu, Kai Huang\n  - Publish Date: 2026.02.02\n  - Code: [UniDWM](https:\u002F\u002Fgithub.com\u002FSay2L\u002FUniDWM)\n  - Task: Perception, Prediction, Planning, Generation\n  - Summary：\n    - UniDWM, a unified driving world model that advances autonomous driving through multifaceted representation learning, constructing a structure- and dynamic-aware latent world representation.\n    - The model uses a joint reconstruction pathway and a collaborative generation framework with a conditional diffusion transformer to forecast future world evolution.\n    - The framework is shown to be a variation of VAE, providing theoretical guidance for multifaceted representation learning, and is effective for trajectory planning, 4D reconstruction and generation.\n\n- [HERMES: A Holistic End-to-End Risk-Aware Multimodal Embodied System with Vision-Language Models for Long-Tail Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.00993)\n  - Weizhe Tang, Junwei You, Jiaxi Liu, Zhaoyi Wang, Rui Gan, Zilin Huang, Feng Wei, Bin Ran\n  - Publisher: University of Wisconsin-Madison, University of California, Berkeley, Tsinghua University\n  - Publish Date: 2026.02.01\n  - Task: End-to-End\n  - Summary：\n    - HERMES, a holistic risk-aware end-to-end multimodal driving framework designed to inject explicit long-tail risk cues into trajectory planning for safe operation in mixed-traffic scenarios.\n    - Employs a foundation-model-assisted annotation pipeline to produce structured Long-Tail Scene Context and Long-Tail Planning Context, capturing hazard-centric cues with maneuver intent and safety preference.\n    - Introduces a Tri-Modal Driving Module that fuses multi-view perception, historical motion cues, and semantic guidance for risk-aware accurate trajectory planning under long-tail conditions.\n\n- [Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.22032)\n  - Linhan Wang, Zichong Yang, Chen Bai, Guoxiang Zhang, Xiaotong Liu, Xiaoyin Zheng, Xiao-Xiao Long, Chang-Tien Lu, Cheng Lu\n  - Publish Date: 2026.01.29\n  - Task: End-to-End\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - Proposes Drive-JEPA, a framework integrating Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end autonomous driving.\n    - Adapts V-JEPA to pretrain a ViT encoder on driving videos for predictive representations aligned with planning, and introduces a proposal-centric planner that distills diverse simulator and human trajectories.\n    - Achieves state-of-the-art performance on NAVSIM, with the V-JEPA representation outperforming prior methods and the full framework setting new benchmarks.\n\n- [LLM-Driven Scenario-Aware Planning for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.21876)\n  - He Li, Zhaowei Chen, Rui Gao, Guoliang Li, Qi Hao, Shuai Wang, Chengzhong Xu\n  - Publish Date: 2026.01.29\n  - Task: Planning\n  - Summary：\n    - Proposes LAP, an LLM-driven adaptive planning method that switches between high-speed and precise driving modes for autonomous vehicles.\n    - Achieves this by leveraging an LLM for scene understanding and integrating its inference into a joint optimization of mode configuration and motion planning, solved with tree-search MPC and alternating minimization.\n    - Implementation in ROS shows superior performance in simulation compared to benchmarks in terms of driving time and success rate.\n\n- [Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.21288)\n  - Weitong Lian, Zecong Tang, Haoran Li, Tianjian Gao, Yifei Wang, Zixu Wang, Lingyi Meng, Tengju Ru, Zhejun Cui, Yichen Zhu, Hangshuo Cao, Qi Kang, Tianxing Chen, Yusen Qin, Kaixuan Wang, Yu Zhang\n  - Publish Date: 2026.01.29\n  - Task: Planning\n  - Summary：\n    - Drive-KD, a framework that decomposes autonomous driving into a \"perception-reasoning-planning\" triad and transfers these capabilities via knowledge distillation.\n    - Introduces layer-specific attention as the distillation signal to construct capability-specific single-teacher models and a multi-teacher framework with asymmetric gradient projection to mitigate gradient conflicts.\n    - The distilled InternVL3-1B model achieves better overall performance than the pretrained 78B model on DriveBench and surpasses GPT-5.1 on planning, with significantly reduced GPU memory and higher throughput.\n\n- [ScenePilot-Bench: A Large-Scale Dataset and Benchmark for Evaluation of Vision-Language Models in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.19582)\n  - Yujin Wang, Yutong Zheng, Wenxian Fan, Tianyi Wang, Hongqing Chu, Daxin Tian, Bingzhao Gao, Jianqiang Wang, Hong Chen\n  - Publish Date: 2026.01.27\n  - Task: Evaluation\n  - Summary：\n    - Introduces ScenePilot-Bench, a large-scale first-person driving benchmark built on the diverse ScenePilot-4K dataset (3,847 hours of driving videos with multi-granularity annotations).\n    - Features a four-axis evaluation suite assessing VLM capabilities in scene understanding, spatial perception, motion planning, and GPT-Score, with safety-aware metrics and cross-region generalization settings.\n    - Provides empirical analyses benchmarking representative VLMs to clarify performance boundaries and identify gaps for driving-oriented reasoning in safety-critical contexts.\n\n- [AutoDriDM: An Explainable Benchmark for Decision-Making of Vision-Language Models in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.14702)\n  - Zecong Tang, Zixu Wang, Yifei Wang, Weitong Lian, Tianjian Gao, Haoran Li, Tengju Ru, Lingyi Meng, Zhejun Cui, Yichen Zhu, Qi Kang, Kaixuan Wang, Yu Zhang\n  - Publish Date: 2026.01.21\n  - Task: Evaluation\n  - Datasets: [AutoDriDM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.14702)\n  - Summary：\n    - AutoDriDM, a decision-centric, progressive benchmark for evaluating Vision-Language Models (VLMs) in autonomous driving, containing 6,650 questions across Object, Scene, and Decision dimensions.\n    - The benchmark evaluates mainstream VLMs to delineate the perception-to-decision capability boundary, revealing weak alignment between perception and decision-making performance.\n    - It includes explainability analyses of models' reasoning processes, identifying key failure modes, and introduces an analyzer model to automate large-scale annotation.\n\n- [VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.12672)\n  - Qimao Chen, Fang Li, Shaoqing Xu, Zhiyi Lai, Zixun Xie, Yuechen Luo, Shengyin Jiang, Hanbing Li, Long Chen, Bing Wang, Yi Zhang, Zhi-Xin Yang\n  - Publish Date: 2026.01.19\n  - Task: Planning\n  - Summary：\n    - Introduces VILTA (VLM-In-the-Loop Trajectory Adversary), a novel framework that integrates a Vision Language Model (VLM) directly into the closed-loop training of autonomous driving agents.\n    - The VLM actively comprehends the dynamic environment and strategically generates challenging, long-tail scenarios by directly editing surrounding agents' future trajectories.\n    - This approach leverages the VLM's generalization to create a diverse curriculum of plausible critical scenarios, substantially enhancing the safety and robustness of the resulting driving policy.\n\n- [Generative Scenario Rollouts for End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.11475)\n  - Rajeev Yasarla, Deepti Hegde, Shizhong Han, Hsin-Pai Cheng, Yunxiao Shi, Meysam Sadeghigooghari, Shweta Mahajan, Apratim Bhattacharyya, Litian Liu, Risheek Garrepalli, Thomas Svantesson, Fatih Porikli, Hong Cai\n  - Publisher: Qualcomm AI Research\n  - Publish Date: 2026.01.16\n  - Task: Planning\n  - Datasets: [Bench2Drive](https:\u002F\u002Fthinklab-sjtu.github.io\u002FBench2Drive\u002F)\n  - Summary：\n    - Proposes Generative Scenario Rollouts (GeRo), a plug-and-play framework for Vision-Language-Action (VLA) models that jointly performs planning and generation of language-grounded future traffic scenes via autoregressive rollouts.\n    - Employs a rollout-consistency loss to stabilize predictions and preserve text-action alignment, enabling temporally consistent, language-grounded rollouts for long-horizon reasoning and multi-agent planning.\n    - Integrates reinforcement learning with generative rollouts, achieving state-of-the-art closed-loop and open-loop performance on Bench2Drive with strong zero-shot robustness.\n\n- [MAD: Motion Appearance Decoupling for efficient Driving World Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.09452)\n  - Ahmad Rahimi, Valentin Gerard, Eloi Zablocki, Matthieu Cord, Alexandre Alahi\n  - Publisher: EPFL VITA\n  - Publish Date: 2026.01.14\n  - Project Page: [MAD-World-Model](https:\u002F\u002Fvita-epfl.github.io\u002FMAD-World-Model\u002F)\n  - Task: Generation\n  - Summary：\n    - Proposes an efficient adaptation framework to convert generalist video diffusion models into controllable driving world models by decoupling motion learning from appearance synthesis.\n    - Uses a two-stage reasoning-rendering paradigm: first inferring dynamics via skeletonized agent videos, then rendering realistic RGB appearance, achieving state-of-the-art performance with less than 6% of prior compute.\n\n- [Leveraging 3D Representation Alignment and RGB Pretrained Priors for LiDAR Scene Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.07692)\n  - Nicolas Sereyjol-Garros, Ellington Kirby, Victor Besnier, Nermin Samet\n  - Publisher: Valeo AI\n  - Publish Date: 2026.01.12\n  - Project Page: [R3DPA](https:\u002F\u002Fgithub.com\u002Fvaleoai\u002FR3DPA)\n  - Code: [R3DPA](https:\u002F\u002Fgithub.com\u002Fvaleoai\u002FR3DPA)\n  - Task: Generation\n  - Datasets: [KITTI-360](https:\u002F\u002Fwww.cvlibs.net\u002Fdatasets\u002Fkitti-360\u002F)\n  - Summary：\n    - R3DPA, a LiDAR scene generation method that unlocks image-pretrained priors for LiDAR point clouds and leverages self-supervised 3D representations for state-of-the-art results.\n    - The method aligns intermediate generative features with self-supervised 3D features to improve quality and transfers knowledge from large-scale image-pretrained models to mitigate limited LiDAR data.\n    - Enables point cloud control at inference for tasks like object inpainting and scene mixing using only an unconditional model.\n\n- [Efficient Visual Question Answering Pipeline for Autonomous Driving via Scene Region Compression](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.07092)\n  - Yuliang Cai, Dongqiangzi Ye, Zitian Chen, Chongruo Wu\n  - Publish Date: 2026.01.11\n  - Task: VQA\n  - Summary：\n    - Proposes SRC-Pipeline, an efficient VLM framework for autonomous driving VQA that compresses early frame tokens into a small number of high-level tokens while keeping full tokens for recent frames.\n    - Achieves a 66% reduction in FLOPs while maintaining comparable performance, enabling more effective real-time operation in safety-critical driving settings.\n\n- [Modular Autonomy with Conversational Interaction: An LLM-driven Framework for Decision Making in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.05806)\n  - Marvin Seegert, Korbinian Moller, Johannes Betz\n  - Publisher: Technical University of Munich\n  - Publish Date: 2026.01.09\n  - Task: Planning\n  - Summary：\n    - Proposes an LLM-driven framework for conversational interaction in Autonomous Driving Systems (ADS), integrating an LLM-based interaction layer with the Autoware software stack.\n    - Introduces a three-component methodology: a taxonomization of interaction categories, an application-centric Domain Specific Language (DSL) for command translation, and a safety-preserving validation layer.\n    - Employs a two-stage LLM architecture to ensure high transparency and provide execution feedback, with evaluation confirming timing efficiency and translation robustness.\n\n- [SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.05640)\n  - Jingyu Li, Junjie Wu, Dongnan Hu, Xiangkai Huang, Bin Sun, Zhihui Hao, Xianpeng Lang, Xiatian Zhu, Li Zhang\n  - Publish Date: 2026.01.09\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - Proposes SGDrive, a novel framework that structures a Vision-Language Model's (VLM) representation learning around a driving-specific scene-agent-goal hierarchy to mirror human driving cognition.\n    - Addresses limitations of generalist VLMs by providing structured spatial-temporal representations for safe trajectory planning, integrating multi-level information into a compact format.\n    - Achieves state-of-the-art performance among camera-only methods on the NAVSIM benchmark, validating the effectiveness of hierarchical knowledge structuring for adapting VLMs to autonomous driving.\n\n- [LatentVLA: Efficient Vision-Language Models for Autonomous Driving via Latent Action Prediction](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.05611)\n  - Chengen Xie, Bin Sun, Tianyu Li, Junjie Wu, Zhihui Hao, XianPeng Lang, Hongyang Li\n  - Publish Date: 2026.01.09\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim), [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - LatentVLA, a novel framework using self-supervised latent action prediction to train Vision-Language-Action (VLA) models without language annotations, eliminating linguistic bias.\n    - It transfers VLA generalization to efficient vision-based networks via knowledge distillation, achieving robust performance and real-time efficiency.\n    - Establishes a new state-of-the-art on the NAVSIM benchmark (PDMS score of 92.4) and shows strong zero-shot generalization on nuScenes.\n\n- [ThinkDrive: Chain-of-Thought Guided Progressive Reinforcement Learning Fine-Tuning for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.04714)\n  - Chang Zhao, Zheming Yang, Yunqing Hu, Qi Guo, Zijian Wang, Pengcheng Li, Wen Ji\n  - Publish Date: 2026.01.08\n  - Task: Planning\n  - Summary：\n    - ThinkDrive, a Chain-of-Thought (CoT) guided progressive reinforcement learning fine-tuning framework for autonomous driving that synergizes explicit reasoning with difficulty-aware adaptive policy optimization.\n    - The method employs a two-stage training strategy: first performing supervised fine-tuning (SFT) using CoT explanations, then applying progressive RL with a difficulty-aware adaptive policy optimizer.\n    - Evaluation on a public dataset shows ThinkDrive outperforms strong RL baselines and a 2B-parameter model trained with this method surpasses the larger GPT-4o on the exam metric.\n\n- [UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.04453)\n  - Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, Liu Ren\n  - Publisher: Washington University in St. Louis, Bosch Research\n  - Publish Date: 2026.01.07\n  - Project Page: [UniDrive-WM](https:\u002F\u002Funidrive-wm.github.io\u002FUniDrive-WM)\n  - Task: Planning, Generation\n  - Datasets: [Bench2Drive](https:\u002F\u002Fthinklab-sjtu.github.io\u002FBench2Drive\u002F)\n  - Summary：\n    - UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture.\n    - The model's predicted trajectory conditions a VLM-based image generator to produce plausible future frames, providing supervisory signals that enhance understanding and iteratively refine trajectory generation.\n    - Experiments on Bench2Drive show improvements of 5.9% in L2 trajectory error and 9.2% in collision rate over previous methods, demonstrating advantages of integrating reasoning, planning, and generative modeling.\n\n- [A Vision-Language-Action Model with Visual Prompt for OFF-Road Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.03519)\n  - Liangdong Zhang, Yiming Nie, Haoyang Li, Fanjie Kong, Baobao Zhang, Shunxin Huang, Kai Fu, Chen Min, Liang Xiao\n  - Publish Date: 2026.01.07\n  - Task: Planning\n  - Datasets: [RELLIS-3D](https:\u002F\u002Fgithub.com\u002Funmannedlab\u002FRELLIS-3D)\n  - Summary：\n    - Proposes OFF-EMMA, an end-to-end multimodal framework for off-road autonomous driving, addressing insufficient spatial perception and unstable reasoning in VLA models.\n    - Introduces a visual prompt block using semantic segmentation masks to enhance spatial understanding and a chain-of-thought with self-consistency (COT-SC) reasoning strategy to improve planning robustness.\n\n- [FROST-Drive: Scalable and Efficient End-to-End Driving with a Frozen Vision Encoder](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.03460)\n  - Zeyu Dong, Yimin Zhu, Yu Wu, Yu Sun\n  - Publish Date: 2026.01.06\n  - Task: End-to-End\n  - Datasets: [Waymo Open E2E Dataset](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - Summary：\n    - FROST-Drive, a novel End-to-End (E2E) architecture that preserves the generalization of a pretrained Vision-Language Model (VLM) by keeping its vision encoder frozen.\n    - The model combines the frozen encoder with a transformer-based adapter and a GRU-based decoder for waypoint generation, optimized with a custom loss for Rater Feedback Score (RFS).\n    - Demonstrates superior performance on the Waymo Open E2E Dataset, showing that leveraging a frozen, generalist VLM encoder is more effective for robust driving than full fine-tuning.\n\n- [DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.01528)\n  - Yang Zhou, Hao Shao, Letian Wang, Zhuofan Zong, Hongsheng Li, Steven L. Waslander\n  - Publisher: University of Toronto, Shanghai AI Laboratory, The Chinese University of Hong Kong\n  - Publish Date: 2026.01.04\n  - Task: Evaluation\n  - Datasets: [DrivingGen](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.01528)\n  - Summary：\n    - DrivingGen, the first comprehensive benchmark for generative driving world models, combining a diverse evaluation dataset with a suite of new metrics.\n    - The benchmark assesses visual realism, trajectory plausibility, temporal coherence, and controllability to foster reliable and deployable driving world models.\n    - Evaluation of 14 state-of-the-art models reveals trade-offs between general models and driving-specific ones, offering a unified framework for scalable simulation and planning.\n\n- [Dichotomous Diffusion Policy Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.00898)\n  - Ruiming Liang, Yinan Zheng, Kexin Zheng, Tianyi Tan, Jianxiong Li, Liyuan Mao, Zhihao Wang, Guang Chen, Hangjun Ye, Jingjing Liu, Jinqiao Wang, Xianyuan Zhan\n  - Publish Date: 2025.12.31\n  - Task: End-to-End\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - Proposes DIPOLE (Dichotomous diffusion Policy improvement), a novel RL algorithm for stable and controllable diffusion policy optimization by decomposing the optimal policy into a pair of dichotomous policies for reward maximization and minimization.\n    - Enables flexible control over greediness during inference by linearly combining the scores of the learned dichotomous policies.\n    - Demonstrates effectiveness in offline and offline-to-online RL settings (ExORL, OGBench) and for training a large vision-language-action model for end-to-end autonomous driving on the NAVSIM benchmark.\n\n- [LSRE: Latent Semantic Rule Encoding for Real-Time Semantic Risk Detection in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.24712)\n  - Qian Cheng, Weitao Zhou, Cheng Jing, Nanshan Deng, Junze Wen, Zhaoyang Liu, Kun Jiang, Diange Yang\n  - Publish Date: 2025.12.31\n  - Task: Perception\n  - Datasets: [CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - Summary：\n    - LSRE, a Latent Semantic Rule Encoding framework that converts sparsely sampled VLM judgments into decision boundaries within a recurrent world model's latent space for real-time semantic risk assessment.\n    - Enables real-time semantic risk detection at 10 Hz without per-frame VLM queries, achieving accuracy comparable to a large VLM baseline with earlier hazard anticipation and low latency.\n    - Demonstrates generalization to rarely seen semantic-similar test cases, indicating language-guided latent classification is an effective mechanism for semantic safety monitoring.\n\n- [Counterfactual VLA: Self-Reflective Vision-Language-Action Model with Adaptive Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.24426)\n  - Zhenghao \"Mark\" Peng, Wenhao Ding, Yurong You, Yuxiao Chen, Wenjie Luo, Thomas Tian, Yulong Cao, Apoorva Sharma, Danfei Xu, Boris Ivanovic, Boyi Li, Bolei Zhou, Yan Wang, Marco Pavone\n  - Publish Date: 2025.12.30\n  - Task: Planning\n  - Summary：\n    - Introduces Counterfactual VLA (CF-VLA), a self-reflective VLA framework that enables reasoning about and revising planned actions before execution via counterfactual reasoning.\n    - Proposes a rollout-filter-label pipeline to mine high-value scenes and label counterfactual reasoning traces for efficient training.\n    - Demonstrates improvements in trajectory accuracy (up to 17.6%) and safety metrics (20.5%), with adaptive reasoning enabled only in challenging scenarios.\n\n- [Spatial-aware Vision Language Model for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.24331)\n  - Weijie Wei, Zhipeng Luo, Ling Feng, Venice Erin Liong\n  - Publish Date: 2025.12.30\n  - Task: Perception\n  - Summary：\n    - LVLDrive (LiDAR-Vision-Language), a novel framework designed to upgrade existing Vision-Language Models (VLMs) with robust 3D metric spatial understanding for autonomous driving by incorporating LiDAR point clouds as an extra input modality.\n    - Introduces a Gradual Fusion Q-Former to incrementally inject LiDAR features, mitigating catastrophic disturbance to pre-trained VLMs and preserving their existing knowledge base.\n    - Develops a spatial-aware question-answering (SA-QA) dataset to explicitly teach the model advanced 3D perception and reasoning capabilities, achieving superior performance on driving benchmarks.\n\n- [DriveLaW:Unifying Planning and Video Generation in a Latent Driving World](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.23421)\n  - Tianze Xia, Yongkang Li, Lijun Zhou, Jingfeng Yao, Kaixin Xiong, Haiyang Sun, Bing Wang, Kun Ma, Guang Chen, Hangjun Ye, Wenyu Liu, Xinggang Wang\n  - Publisher: Huazhong University of Science and Technology, Xiaomi EV\n  - Publish Date: 2025.12.29\n  - Task: Planning, Generation\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - DriveLaW, a novel paradigm that unifies video generation and motion planning for autonomous driving by directly injecting latent representations from a video generator into a planner.\n    - The framework consists of DriveLaW-Video, a world model for high-fidelity forecasting, and DriveLaW-Act, a diffusion planner for generating consistent trajectories, both optimized via a three-stage progressive training strategy.\n    - Achieves state-of-the-art results, significantly advancing video prediction metrics and setting a new record on the NAVSIM planning benchmark.\n\n- [ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.22939)\n  - Qihang Peng, Xuesong Chen, Chenye Yang, Shaoshuai Shi, Hongsheng Li\n  - Publisher: The Chinese University of Hong Kong, Shanghai AI Laboratory\n  - Publish Date: 2025.12.28\n  - Project Page: [ColaVLA](https:\u002F\u002Fpqh22.github.io\u002Fprojects\u002FColaVLA\u002Findex.html)\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - ColaVLA, a unified vision-language-action framework that transfers reasoning from text to a latent space for efficient trajectory planning.\n    - Introduces a Cognitive Latent Reasoner for compact, decision-oriented scene understanding and a Hierarchical Parallel Planner for single-pass, causality-consistent trajectory generation.\n    - Achieves state-of-the-art performance on the nuScenes benchmark in open-loop and closed-loop settings with improved efficiency and robustness.\n\n- [KnowVal: A Knowledge-Augmented and Value-Guided Autonomous Driving System](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.20299)\n  - Zhongyu Xia, Wenhao Chen, Yongtao Wang, Ming-Hsuan Yang\n  - Publish Date: 2025.12.23\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [Bench2Drive](https:\u002F\u002Fthinklab-sjtu.github.io\u002FBench2Drive\u002F)\n  - Summary：\n    - Proposes KnowVal, an autonomous driving system integrating visual-language reasoning with a comprehensive driving knowledge graph and an LLM-based retrieval mechanism.\n    - Develops a human-preference dataset and a Value Model to guide interpretable, value-aligned trajectory assessment.\n    - Achieves state-of-the-art planning performance, including the lowest collision rate on nuScenes and top results on Bench2Drive.\n\n- [WorldRFT: Latent World Model Planning with Reinforcement Fine-Tuning for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.19133)\n  - Pengxuan Yang, Ben Lu, Zhongpu Xia, Chao Han, Yinfeng Gao, Teng Zhang, Kun Zhan, XianPeng Lang, Yupeng Zheng, Qichao Zhang\n  - Publish Date: 2025.12.22\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - WorldRFT, a planning-oriented latent world model framework that aligns scene representation learning with planning via hierarchical decomposition and local-aware interactive refinement, augmented by reinforcement learning fine-tuning (RFT).\n    - Introduces Group Relative Policy Optimization (GRPO) with trajectory Gaussianization and collision-aware rewards to fine-tune the driving policy for safety improvements.\n    - Achieves state-of-the-art performance, reducing collision rates by 83% on nuScenes and attaining competitive camera-only performance on NavSim compared to LiDAR-based SOTA methods.\n\n- [InDRiVE: Reward-Free World-Model Pretraining for Autonomous Driving via Latent Disagreement](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.18850)\n  - Feeza Khan Khanzada, Jaerock Kwon\n  - Publisher: University of Michigan-Dearborn\n  - Publish Date: 2025.12.21\n  - Task: Planning\n  - Datasets: [CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - Summary：\n    - InDRiVE, a DreamerV3-style model-based reinforcement learning agent for autonomous driving that performs reward-free pretraining using only intrinsic motivation derived from latent ensemble disagreement.\n    - The framework uses disagreement as a proxy for epistemic uncertainty to drive exploration and learns a planner-free exploration policy directly from the learned world model.\n    - After intrinsic pretraining, the agent demonstrates strong zero-shot robustness and robust few-shot adaptation for downstream tasks like lane following and collision avoidance in CARLA across various towns and traffic densities.\n\n- [LLaViDA: A Large Language Vision Driving Assistant for Explicit Reasoning and Enhanced Trajectory Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.18211)\n  - Yudong Liu, Spencer Hallyburton, Jiwoo Kim, Yueqian Lin, Yiming Li, Qinsi Wang, Hui Ye, Jingwei Sun, Miroslav Pajic, Yiran Chen, Hai Li\n  - Publish Date: 2025.12.20\n  - Code: [LLaViDA](https:\u002F\u002Fgithub.com\u002F)\n  - Task: Planning\n  - Datasets: [NuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - LLaViDA, a Large Language Vision Driving Assistant that leverages a Vision-Language Model (VLM) for object motion prediction, semantic grounding, and chain-of-thought reasoning for trajectory planning.\n    - Employs a two-stage training pipeline of supervised fine-tuning followed by Trajectory Preference Optimization (TPO) to enhance scene understanding and planning.\n    - Achieves state-of-the-art performance on the NuScenes benchmark for open-loop trajectory planning, with an average L2 error of 0.31 m and a 0.10% collision rate.\n\n- [Driving in Corner Case: A Real-World Adversarial Closed-Loop Evaluation Platform for End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.16055)\n  - Jiaheng Geng, Jiatong Du, Xinyu Zhang, Ye Li, Panqu Wang, Yanjun Huang\n  - Publish Date: 2025.12.18\n  - Task: Evaluation\n  - Summary：\n    - Proposes a closed-loop evaluation platform for end-to-end autonomous driving that generates adversarial interactions in real-world scenes to create safety-critical corner cases.\n    - The platform uses a flow matching-based real-world image generator and an adversarial traffic policy to efficiently model challenging interactions and evaluate models like UniAD and VAD.\n\n- [OccSTeP: Benchmarking 4D Occupancy Spatio-Temporal Persistence](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.15621)\n  - Yu Zheng, Jie Hu, Kailun Yang, Jiaming Zhang\n  - Publish Date: 2025.12.17\n  - Code: [OccSTeP](https:\u002F\u002Fgithub.com\u002FFaterYU\u002FOccSTeP)\n  - Task: Evaluation\n  - Summary：\n    - Introduces the concept of 4D Occupancy Spatio-Temporal Persistence (OccSTeP), addressing reactive forecasting (\"what will happen next\") and proactive forecasting (\"what would happen given a specific future action\").\n    - Proposes OccSTeP-WM, a tokenizer-free world model with a linear-complexity attention backbone and recurrent state-space module for robust, online inference even with missing or noisy historical input.\n    - Establishes a new OccSTeP benchmark with challenging scenarios, reporting significant gains in semantic mIoU (+6.56%) and occupancy IoU (+9.26%).\n\n- [OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.14044)\n  - Zhenguo Zhang, Haohan Zheng, Yishen Wang, Le Xu, Tianchen Deng, Xuefeng Chen, Qu Chen, Bo Zhang, Wuxiong Huang\n  - Publish Date: 2025.12.16\n  - Task: End-to-End\n  - Datasets: [DriveLMM-o1](https:\u002F\u002Fgithub.com\u002Fayesha-ishaq\u002FDriveLMM-o1)\n  - Summary：\n    - OmniDrive-R1, an end-to-end VLM framework for autonomous driving, introduces an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism to unify perception and reasoning.\n    - The core innovation is a Reinforcement-driven visual grounding capability, enabled by a two-stage RL pipeline and the Clip-GRPO algorithm, which uses an annotation-free, process-based grounding reward.\n    - The model significantly improves reasoning, raising the overall score from 51.77% to 80.35% and final answer accuracy from 37.81% to 73.62% on DriveLMM-o1 compared to the Qwen2.5VL-7B baseline.\n\n- [MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.13636)\n  - Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, Xiang Bai\n  - Publish Date: 2025.12.15\n  - Project Page: [MindDrive](https:\u002F\u002Fxiaomi-mlab.github.io\u002FMindDrive\u002F)\n  - Task: Planning\n  - Datasets: [Bench2Drive](https:\u002F\u002Fthinklab-sjtu.github.io\u002FBench2Drive\u002F)\n  - Summary：\n    - MindDrive, a Vision-Language-Action (VLA) framework for autonomous driving that uses online reinforcement learning to address limitations of Imitation Learning like distribution shift.\n    - It features an LLM with two LoRA parameter sets: a Decision Expert for reasoning and an Action Expert to map decisions to trajectories, enabling trial-and-error learning over discrete linguistic decisions.\n    - Achieves a Driving Score of 78.04 and Success Rate of 55.09% on the Bench2Drive benchmark using the lightweight Qwen-0.5B LLM.\n\n- [DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.12799)\n  - Zhe Liu, Runhui Huang, Rui Yang, Siming Yan, Zining Wang, Lu Hou, Di Lin, Xiang Bai, Hengshuang Zhao\n  - Publish Date: 2025.12.14\n  - Code: [DrivePI](https:\u002F\u002Fgithub.com\u002Fhappinesslz\u002FDrivePI)\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - DrivePI, a spatial-aware 4D MLLM serving as a unified Vision-Language-Action (VLA) framework for autonomous driving, jointly performs spatial understanding, 3D perception, prediction, and planning through end-to-end optimization.\n    - The method integrates point clouds, multi-view images, and language instructions, and uses a data engine to generate text-occupancy and text-flow QA pairs for 4D spatial understanding.\n    - With only a 0.5B Qwen2.5 backbone, DrivePI as a single model matches or exceeds existing VLA and specialized VA models in key metrics across nuScenes and OpenOcc benchmarks.\n\n- [From Human Intention to Action Prediction: Intention-Driven End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.12302)\n  - Huan Zheng, Yucheng Zhou, Tianyi Yan, Jiayi Su, Hongjun Chen, Dubing Chen, Xingtai Gui, Wencheng Han, Runzhou Tao, Zhongying Qiu, Jianfei Yang, Jianbing Shen\n  - Publish Date: 2025.12.13\n  - Task: End-to-End\n  - Datasets: [Intention-Drive](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.12302)\n  - Summary：\n    - Introduces Intention-Drive, a comprehensive benchmark and dataset for Intention-Driven End-to-End Autonomous Driving, designed to bridge the gap in interpreting high-level human intentions.\n    - Proposes the Imagined Future Alignment (IFA), a novel evaluation protocol using generative world models to assess semantic goal fulfillment beyond geometric accuracy.\n    - Explores two solution paradigms—an end-to-end vision-language planner and a hierarchical agent-based framework—which demonstrate superior alignment with human intentions compared to existing models.\n\n- [FutureX: Enhance End-to-End Autonomous Driving via Latent Chain-of-Thought World Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.11226)\n  - Hongbin Lin, Yiming Yang, Yifan Zhang, Chaoda Zheng, Jie Feng, Sheng Wang, Zhennan Wang, Shijia Chen, Boyang Wang, Yu Zhang, Xianming Liu, Shuguang Cui, Zhen Li\n  - Publish Date: 2025.12.12\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - FutureX, a CoT-driven pipeline that enhances end-to-end planners via future scene latent reasoning and trajectory refinement.\n    - Introduces an Auto-think Switch to decide between an Instant mode for simple scenes and a Thinking mode for complex reasoning via a Latent World Model.\n    - Demonstrates substantial performance gains, such as a 6.2 PDMS improvement for TransFuser on NAVSIM, by producing more rational plans with fewer collisions.\n\n- [Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.10947)\n  - Jiawei Yang, Ziyu Chen, Yurong You, Yan Wang, Yiming Li, Yuxiao Chen, Boyi Li, Boris Ivanovic, Marco Pavone, Yue Wang\n  - Publisher: Stanford University, NVIDIA\n  - Publish Date: 2025.12.11\n  - Project Page: [Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving](https:\u002F\u002Fjiawei-yang.github.io\u002FFlex\u002F)\n  - Task: End-to-End\n  - Summary：\n    - Presents Flex, an efficient and effective geometry-agnostic scene encoder for multi-camera data in end-to-end autonomous driving, using a small set of learnable scene tokens to jointly encode information across cameras and timesteps.\n    - Achieves 2.2x greater inference throughput and improved driving performance on a large-scale proprietary dataset compared to state-of-the-art methods, challenging the necessity of explicit 3D inductive biases like BEV or occupancy.\n    - Demonstrates that the compact scene tokens develop an emergent capability for scene decomposition without explicit supervision, offering a more scalable and data-driven path for future systems.\n\n- [SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.10719)\n  - Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, Andreas Zell\n  - Publisher: University of Tübingen\n  - Publish Date: 2025.12.11\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings instead of textual digit tokens for joint semantic and spatial reasoning.\n    - Employs a universal positional encoder for 3D coordinates from multi-view depth, historical ego-states, and text prompts, enhancing visual tokens and enabling direct trajectory coordinate regression.\n    - Achieves state-of-the-art open-loop performance on nuScenes and a Driving Score of 78.02 on the Bench2Drive closed-loop benchmark among VLM-based methods.\n\n- [Latent Chain-of-Thought World Modeling for End-to-End Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.10226)\n  - Shuhan Tan, Kashyap Chitta, Yuxiao Chen, Ran Tian, Yurong You, Yan Wang, Wenjie Luo, Yulong Cao, Philipp Krahenbuhl, Marco Pavone, Boris Ivanovic\n  - Publish Date: 2025.12.11\n  - Task: End-to-End\n  - Summary：\n    - Latent-CoT-Drive (LCDrive): a model that expresses chain-of-thought (CoT) reasoning in a latent language that captures possible outcomes of driving actions, unifying reasoning and decision making in an action-aligned latent space.\n    - The model reasons by interleaving action-proposal tokens and world model tokens, which are grounded in a learned latent world model to express future outcomes.\n    - The approach is cold-started with supervision from ground-truth future rollouts and then post-trained with closed-loop reinforcement learning, achieving faster inference and better trajectory quality on a large-scale end-to-end driving benchmark.\n\n- [UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.09864)\n  - Hao Lu, Ziyang Liu, Guangfeng Jiang, Yuanfei Luo, Sheng Chen, Yangang Zhang, Ying-Cong Chen\n  - Publish Date: 2025.12.10\n  - Project Page: [UniUGP](https:\u002F\u002Fseed-uniugp.github.io\u002F)\n  - Task: End-to-End\n  - Summary：\n    - Proposes UniUGP, a unified Understanding-Generation-Planning framework that synergizes scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture.\n    - Introduces a four-stage training strategy across multiple AD datasets to progressively build capabilities, demonstrating state-of-the-art performance and superior generalization to long-tail scenarios.\n\n- [COVLM-RL: Critical Object-Oriented Reasoning for Autonomous Driving Using VLM-Guided Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.09349)\n  - Lin Li, Yuxin Cai, Jianwu Fang, Jianru Xue, Chen Lv\n  - Publish Date: 2025.12.10\n  - Task: End-to-End\n  - Datasets: [CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - Summary：\n    - COVLM-RL, a novel end-to-end driving framework integrating Critical Object-oriented reasoning with VLM-guided Reinforcement Learning to address generalization, efficiency, and interpretability.\n    - Introduces a Chain-of-Thought prompting strategy for the VLM to reason over critical traffic elements, generating semantic decision priors that reduce input dimensionality and inject task knowledge into RL.\n    - Proposes a consistency loss to align the VLM's semantic plans with the RL agent's continuous control outputs, improving interpretability and training stability.\n\n- [Astra: General Interactive World Model with Autoregressive Denoising](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.08931)\n  - Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, Jiwen Lu\n  - Publish Date: 2025.12.09\n  - Task: Generation\n  - Summary：\n    - Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions.\n    - Introduces an autoregressive denoising architecture with temporal causal attention and a noise-augmented history memory to balance responsiveness with temporal coherence.\n    - Proposes an action-aware adapter and a mixture of action experts to enable precise control and versatility across tasks like exploration, manipulation, and camera control.\n\n- [A Multi-Agent LLM Framework for Design Space Exploration in Autonomous Driving Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.08476)\n  - Po-An Shih, Shao-Hua Wang, Yung-Che Li, Chia-Heng Tu, Chih-Han Chang\n  - Publish Date: 2025.12.09\n  - Task: Planning\n  - Summary：\n    - A multi-agent, large language model (LLM)-based design space exploration (DSE) framework for autonomous driving systems, integrating multi-modal reasoning with 3D simulation and profiling tools.\n    - Automates the interpretation of execution outputs and guides system design exploration using specialized LLM agents for user input, design generation, execution orchestration, and output analysis.\n    - Evaluated on a robotaxi case study, the framework identifies more Pareto-optimal, cost-efficient solutions with reduced navigation time compared to a genetic algorithm baseline under the same exploration budget.\n\n- [Spatial Retrieval Augmented Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.06865)\n  - Xiaosong Jia, Chenhe Zhang, Yule Jiang, Songbur Wong, Zhiyuan Zhang, Chen Chen, Shaofeng Zhang, Xuanhe Zhou, Xue Yang, Junchi Yan, Yu-Gang Jiang\n  - Publish Date: 2025.12.07\n  - Task: End-to-End\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Proposes a spatial retrieval paradigm for autonomous driving, using offline retrieved geographic images (e.g., from Google Maps) as an additional input to enhance perception under poor visibility.\n    - Extends the nuScenes dataset with aligned geographic images and establishes baselines across five core tasks: detection, mapping, occupancy prediction, planning, and world modeling.\n    - Demonstrates the plug-and-play nature of the approach, showing performance enhancements for certain tasks without requiring additional sensors.\n\n- [WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.11872)\n  - Mingwang Xu, Jiahao Cui, Feipeng Cai, Hanlin Shang, Zhihao Zhu, Shan Luan, Yifang Xu, Neng Zhang, Yaoyi Li, Jia Cai, Siyu Zhu\n  - Publisher: Fudan University\n  - Publish Date: 2025.12.06\n  - Project Page: [WAM-Diff](https:\u002F\u002Fgithub.com\u002Ffudan-generative-vision\u002FWAM-Diff)\n  - Code: [WAM-Diff](https:\u002F\u002Fgithub.com\u002Ffudan-generative-vision\u002FWAM-Diff)\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - WAM-Diff, a Vision-Language-Action (VLA) framework that employs masked diffusion to iteratively refine a discrete sequence for future ego-trajectory generation in autonomous driving.\n    - Features three key innovations: adaptation of masked diffusion for flexible decoding, scalable capacity via a sparse Mixture of Experts (MoE) trained on motion prediction and VQA, and online reinforcement learning using Group Sequence Policy Optimization (GSPO).\n    - Achieves state-of-the-art performance on NAVSIM benchmarks (91.0 PDMS on v1, 89.7 EPDMS on v2), demonstrating masked diffusion as a promising alternative to autoregressive and continuous diffusion policies.\n\n- [Are AI-Generated Driving Videos Ready for Autonomous Driving? A Diagnostic Evaluation Framework](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.06376)\n  - Xinhao Xiang, Abhijeet Rastogi, Jiawei Zhang\n  - Publish Date: 2025.12.06\n  - Task: Generation\n  - Datasets: [ADGV-Bench](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.06376)\n  - Summary：\n    - Presents a diagnostic framework to evaluate the reliability of AI-generated driving videos (AIGVs) for training and evaluating autonomous driving models.\n    - Introduces ADGV-Bench, a benchmark with human annotations and dense labels for perception tasks, and ADGVE, a driving-aware evaluator for scoring video quality.\n    - Shows that filtering raw AIGVs with the proposed evaluator improves video quality metrics and downstream AD model performance, turning AIGVs into a beneficial data complement.\n\n- [WAM-Flow: Parallel Coarse-to-Fine Motion Planning via Discrete Flow Matching for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.06112)\n  - Yifang Xu, Jiahao Cui, Feipeng Cai, Zhihao Zhu, Hanlin Shang, Shan Luan, Mingwang Xu, Neng Zhang, Yaoyi Li, Jia Cai, Siyu Zhu\n  - Publisher: Fudan University, Yinwang\n  - Publish Date: 2025.12.05\n  - Code: [WAM-Flow](https:\u002F\u002Fgithub.com\u002Ffudan-generative-vision\u002FWAM-Flow)\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim), [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - Summary:\n    - The paper introduces the first discrete flow matching architecture for autonomous driving, which achieves low-latency, parallel, and coarse-to-fine motion planning.\n\n- [BeLLA: End-to-End Birds Eye View Large Language Assistant for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.06096)\n  - Karthik Mohan, Sonam Singh, Amit Arvind Kale\n  - Publish Date: 2025.12.05\n  - Task: VQA\n  - Datasets: [NuScenes-QA](https:\u002F\u002Fwww.nuscenes.org\u002F), [DriveLM](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM)\n  - Summary：\n    - BeLLA, an end-to-end architecture connecting unified 360° BEV representations with a large language model for question answering in autonomous driving.\n    - Outperforms existing approaches on tasks requiring spatial reasoning, such as relative object positioning and behavioral understanding, achieving up to +9.3% absolute improvement.\n    - Evaluated on NuScenes-QA and DriveLM benchmarks, demonstrating competitive performance across a diverse range of questions.\n\n- [From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.05277)\n  - Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain, Ahmad Rezaei, Mohsen Gholami, Alireza Heidarikhazaei, Zhou Weimin, Yong Zhang, Mohammad Akbari\n  - Publish Date: 2025.12.04\n  - Project Page: [TAD](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fvbdai\u002FTAD)\n  - Code: [TAD](https:\u002F\u002Fgithub.com\u002Fvbdi\u002Ftad_bench)\n  - Task: Reasoning\n  - Datasets: [TAD](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fvbdai\u002FTAD)\n  - Summary：\n    - Introduces the Temporal Understanding in Autonomous Driving (TAD) benchmark, comprising nearly 6,000 QA pairs across 7 tasks, to evaluate VLMs on ego-centric driving footage.\n    - Benchmarks 9 generalist and specialist models, finding substandard accuracy due to imperfect fine-grained motion understanding.\n    - Proposes two training-free solutions, Scene-CoT and TCogMap, which improve average accuracy on TAD by up to 17.72% when integrated with existing VLMs.\n\n- [E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.04733)\n  - Yihong Tang, Haicheng Liao, Tong Nie, Junlin He, Ao Qu, Kehua Chen, Wei Ma, Zhenning Li, Lijun Sun, Chengzhong Xu\n  - Publish Date: 2025.12.04\n  - Task: End-to-End\n  - Summary：\n    - Proposes E3AD, an emotion-aware Vision-Language-Action (VLA) framework for Open-Domain End-to-End (OD-E2E) autonomous driving, which interprets free-form commands, infers passenger emotion, and plans trajectories.\n    - Introduces a continuous Valence-Arousal-Dominance (VAD) emotion model and a dual-pathway spatial reasoning module for enhanced semantic understanding and human-like spatial cognition.\n    - Employs a consistency-oriented training scheme combining modality pretraining with preference-based alignment, achieving state-of-the-art results in emotion estimation and improved visual grounding and planning.\n\n- [dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.04459)\n  - Yingzi Ma, Yulong Cao, Wenhao Ding, Shuibai Zhang, Yan Wang, Boris Ivanovic, Ming Jiang, Marco Pavone, Chaowei Xiao\n  - Publish Date: 2025.12.04\n  - Task: End-to-End\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [WOD-E2E](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - Summary：\n    - dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving.\n    - Introduces a controllable and reliable framework using discrete diffusion with bidirectional attention, addressing consistency and controllability issues in autoregressive VLMs.\n    - Achieves improved behavior-trajectory consistency and planning performance on long-tail scenarios, outperforming AR-based baselines.\n\n- [MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.04441)\n  - Bin Sun, Yaoguang Cao, Yan Wang, Rui Wang, Jiachen Shang, Xiejie Feng, Jiayi Lu, Jia Shi, Shichun Yang, Xiaoyu Yan, Ziying Song\n  - Publish Date: 2025.12.04\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - MindDrive, a harmonized framework integrating high-quality trajectory generation with comprehensive decision reasoning for End-to-End Autonomous Driving.\n    - Proposes a structured reasoning paradigm of \"context simulation - candidate generation - multi-objective trade-off\", featuring a Future-aware Trajectory Generator (FaTG) and a VLM-oriented Evaluator (VLoE).\n    - Achieves state-of-the-art performance on NAVSIM benchmarks, enhancing safety, compliance, and generalization through reasoned, human-aligned decision making.\n\n- [Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.03454)\n  - Haicheng Liao, Huanming Shen, Bonan Wang, Yongkang Li, Yihong Tang, Chengyue Wang, Dingyi Zhuang, Kehua Chen, Hai Yang, Chengzhong Xu, Zhenning Li\n  - Publish Date: 2025.12.03\n  - Task: Perception\n  - Datasets: [Talk2Car](https:\u002F\u002Fgithub.com\u002Ftalk2car\u002FTalk2Car)\n  - Summary：\n    - Proposes ThinkDeeper, a framework for visual grounding in autonomous driving that uses a Spatial-Aware World Model (SA-WM) to reason about future spatial states for disambiguating context-dependent instructions.\n    - Introduces DrivePilot, a multi-source visual grounding dataset for autonomous driving, with semantic annotations generated via a RAG and Chain-of-Thought prompted LLM pipeline.\n    - Demonstrates state-of-the-art performance, ranking #1 on the Talk2Car leaderboard and outperforming baselines on DrivePilot, MoCAD, and RefCOCO benchmarks, with strong robustness in challenging scenes.\n\n- [Driving Beyond Privilege: Distilling Dense-Reward Knowledge into Sparse-Reward Policies](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.04279)\n  - Feeza Khan Khanzada, Jaerock Kwon\n  - Publisher: University of Michigan-Dearborn\n  - Publish Date: 2025.12.03\n  - Task: Planning\n  - Datasets: [CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - Summary：\n    - Proposes reward-privileged world model distillation, a two-stage framework where a teacher agent is trained with dense privileged rewards and its latent dynamics are distilled into a student trained solely on sparse task rewards.\n    - Demonstrates that sparse-reward students outperform dense-reward teachers and sparse-from-scratch baselines in CARLA lane-following and overtaking benchmarks, improving generalization on unseen routes.\n\n- [U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.02982)\n  - Xiang Xu, Ao Liang, Youquan Liu, Linfeng Li, Lingdong Kong, Ziwei Liu, Qingshan Liu\n  - Publisher: S-Lab, Nanyang Technological University\n  - Publish Date: 2025.12.02\n  - Task: Perception\n  - Datasets: [Waymo Open Dataset](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - Summary：\n    - U4D, an uncertainty-aware framework for 4D LiDAR world modeling that estimates spatial uncertainty to localize semantically challenging regions.\n    - It performs generation in a \"hard-to-easy\" manner through two stages: uncertainty-region modeling and uncertainty-conditioned completion.\n    - Incorporates a mixture of spatio-temporal (MoST) block to ensure temporal coherence, producing geometrically faithful and temporally consistent LiDAR sequences.\n\n- [Lumos: Let there be Language Model System Certification](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.02966)\n  - Isha Chaudhary, Vedaant Jain, Avaljot Singh, Kavya Sachdeva, Sayan Ranu, Gagandeep Singh\n  - Publish Date: 2025.12.02\n  - Task: Reasoning\n  - Summary：\n    - Introduces Lumos, the first principled framework for specifying and formally certifying Language Model System (LMS) behaviors, using an imperative probabilistic programming DSL over graphs.\n    - Demonstrates Lumos by specifying the first safety specifications for vision-language models (VLMs) in autonomous driving, revealing critical safety failures in a state-of-the-art VLM.\n    - Provides a modular and extensible language-based framework that can encode complex relational and temporal specifications, enabling LMS certification to adapt to evolving threats.\n\n- [VLM as Strategist: Adaptive Generation of Safety-critical Testing Scenarios via Guided Diffusion](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.02844)\n  - Xinzheng Wu, Junyi Chen, Naiting Zhong, Yong Shen\n  - Publish Date: 2025.12.02\n  - Task: Generation\n  - Summary：\n    - Proposes a safety-critical testing scenario generation framework integrating Vision Language Models (VLMs) with adaptive guided diffusion models for autonomous driving systems.\n    - Establishes a three-layer hierarchical architecture: a VLM strategic layer for objective determination, a tactical layer for guidance formulation, and an operational layer for guided diffusion execution.\n    - Introduces an adaptive guided diffusion method enabling real-time, precise control of background vehicles in closed-loop simulation for generating realistic, diverse, and interactive safety-critical scenarios.\n\n- [OpenREAD: Reinforced Open-Ended Reasoning for End-to-End Autonomous Driving with LLM-as-Critic](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.01830)\n  - Songyan Zhang, Wenhui Huang, Zhan Chen, Chua Jiahao Collister, Qihang Huang, Chen Lv\n  - Publish Date: 2025.12.01\n  - Task: End-to-End\n  - Summary：\n    - OpenREAD, an OPEN-ended REasoning reinforced vision-language model (VLM)-based autonomous driving (AD) framework that enables end-to-end reinforcement fine-tuning (RFT) from high-level reasoning to low-level trajectory planning.\n    - Proposes using a large language model (LLM) as a critic in RFT to quantify reasoning quality for open-ended questions, addressing the challenge of reward modeling for scene understanding.\n    - Constructs large-scale Chain-of-Thought (CoT) annotations on driving knowledge datasets and shows that joint end-to-end RFT yields substantial improvements in both upstream and downstream tasks, achieving state-of-the-art performance on reasoning and planning benchmarks.\n\n- [RoboDriveVLM: A Novel Benchmark and Baseline towards Robust Vision-Language Models for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.01300)\n  - Dacheng Liao, Mengshi Qi, Peng Shu, Zhining Zhang, Yuxin Lin, Liang Liu, Huadong Ma\n  - Publish Date: 2025.12.01\n  - Task: Evaluation\n  - Datasets: [RoboDriveBench](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.01300)\n  - Summary：\n    - Introduces RoboDriveBench, the first robustness benchmark for VLM-based end-to-end autonomous driving, evaluating 11 simulated scenarios with sensor and prompt corruptions.\n    - Proposes RoboDriveVLM, a novel VLM-based framework that enhances robustness by mapping multimodal data (e.g., lidar, radar) into a unified latent space.\n    - Introduces a Test-Time Adaptation method based on cross-modal knowledge distillation to improve system robustness for real-world deployment.\n\n- [CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.22532)\n  - Zhaohui Wang, Tengbo Yu, Hao Tang\n  - Publish Date: 2025.11.27\n  - Task: Reasoning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - CoT4AD, a novel Vision-Language-Action (VLA) framework that introduces explicit Chain-of-Thought (CoT) reasoning to enhance numerical and causal reasoning for autonomous driving.\n    - It integrates a perception-question-prediction-action CoT during training to align reasoning and action spaces, and performs implicit CoT reasoning during inference for robust decision-making.\n    - Demonstrates state-of-the-art performance in open-loop and closed-loop evaluations on benchmarks like nuScenes and Bench2Drive.\n\n- [RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.22466)\n  - Xiyan Liu, Han Wang, Yuhu Wang, Junjie Cai, Zhe Cao, Jianzhong Yang, Zhen Lu\n  - Publish Date: 2025.11.27\n  - Project Page: [RoadSceneBench](https:\u002F\u002Fgithub.com\u002FXiyanLiu\u002FRoadSceneBench)\n  - Code: [RoadSceneBench](https:\u002F\u002Fgithub.com\u002FXiyanLiu\u002FRoadSceneBench)\n  - Task: Evaluation\n  - Datasets: [RoadSceneBench](https:\u002F\u002Fgithub.com\u002FXiyanLiu\u002FRoadSceneBench)\n  - Summary：\n    - RoadSceneBench, a lightweight benchmark for evaluating visual reasoning of mid-level road semantics, focusing on relational understanding and structural consistency.\n    - Proposes Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T), a training framework for VLMs to enhance spatial coherence and semantic alignment in reasoning.\n\n- [OpenTwinMap: An Open-Source Digital Twin Generator for Urban Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.21925)\n  - Alex Richardson, Jonathan Sprinkle\n  - Publisher: University of Arizona\n  - Publish Date: 2025.11.26\n  - Task: Generation\n  - Datasets: [OpenStreetMap](https:\u002F\u002Fwww.openstreetmap.org)\n  - Summary：\n    - OpenTwinMap, an open-source, Python-based framework for generating high-fidelity 3D urban digital twins from LiDAR scans and OpenStreetMap data.\n    - The framework emphasizes extensibility and parallelization, producing semantically segmented static environment assets for export into Unreal Engine for AV simulation.\n    - It aims to lower the barrier for researchers by providing a flexible and scalable pipeline, with current capabilities including OSM\u002FLiDAR preprocessing, road mesh\u002Fterrain generation, and preliminary CARLA integration.\n\n- [LaGen: Towards Autoregressive LiDAR Scene Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.21256)\n  - Sizhuo Zhou, Xiaosong Jia, Fanrui Zhang, Junjie Li, Juyong Zhang, Yukang Feng, Jianwen Sun, Songbur Wong, Junqi You, Junchi Yan\n  - Publish Date: 2025.11.26\n  - Task: Generation\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - LaGen, the first framework for frame-by-frame autoregressive generation of long-horizon LiDAR scenes, starting from a single-frame LiDAR input.\n    - Introduces a scene decoupling estimation module to enhance interactive object-level generation and a noise modulation module to mitigate long-horizon error accumulation.\n    - Constructs an evaluation protocol on nuScenes, demonstrating superior performance over state-of-the-art LiDAR generation and prediction models, especially on later frames.\n\n- [4DWorldBench: A Comprehensive Evaluation Framework for 3D\u002F4D World Generation Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.19836)\n  - Yiting Lu, Wei Luo, Peiyan Tu, Haoran Li, Hanxin Zhu, Zihao Yu, Xingrui Wang, Xinyi Chen, Xinge Peng, Xin Li, Zhibo Chen\n  - Publisher: University of Science and Technology of China\n  - Publish Date: 2025.11.25\n  - Project Page: [4DWorldBench](https:\u002F\u002Fyeppp27.github.io\u002F4DWorldBench.github.io\u002F)\n  - Task: Reasoning\n  - Datasets: [4DWorldBench](https:\u002F\u002Fyeppp27.github.io\u002F4DWorldBench.github.io\u002F)\n  - Summary：\n    - Introduces 4DWorldBench, a unified benchmark for evaluating 3D\u002F4D World Generation Models across four key dimensions: Perceptual Quality, Condition-4D Alignment, Physical Realism, and 4D Consistency.\n    - Proposes an adaptive evaluation framework that maps multi-modal conditions into a unified textual space and integrates LLM-as-judge, MLLM-as-judge, and network-based methods for comprehensive assessment.\n\n- [AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.20325)\n  - Tianyi Yan, Tao Tang, Xingtai Gui, Yongkang Li, Jiasen Zhesng, Weiyao Huang, Lingdong Kong, Wencheng Han, Xia Zhou, Xueyang Zhang, Yifei Zhan, Kun Zhan, Cheng-zhong Xu, Jianbing Shen\n  - Publish Date: 2025.11.25\n  - Task: End-to-End\n  - Summary：\n    - Introduces a framework for post-training policy refinement for autonomous driving using an Impartial World Model, designed to overcome optimistic bias in world models used for RL.\n    - Proposes a novel Counterfactual Synthesis data pipeline to generate a curriculum of plausible collisions and off-road events, teaching the model to honestly forecast danger.\n    - Demonstrates that the framework, using the Impartial World Model as an internal critic, significantly reduces safety violations in simulation and outperforms baselines on a new Risk Foreseeing Benchmark.\n\n- [Learning from Risk: LLM-Guided Generation of Safety-Critical Scenarios with Prior Knowledge](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.20726)\n  - Yuhang Wang, Heye Huang, Zhenhua Xu, Kailai Sun, Baoshen Guo, Jinhua Zhao\n  - Publish Date: 2025.11.25\n  - Task: Generation\n  - Datasets: [CARLA](https:\u002F\u002Fcarla.org\u002F), [SMARTS](https:\u002F\u002Fgithub.com\u002Fhuawei-noah\u002FSMARTS)\n  - Summary：\n    - A high-fidelity scenario generation framework integrating a conditional variational autoencoder (CVAE) with a large language model (LLM) for autonomous driving safety validation.\n    - The CVAE learns latent traffic structures to generate physically consistent base scenarios, while the LLM acts as an adversarial reasoning engine to guide generation across varying risk levels.\n    - The framework increases coverage of high-risk and long-tail events, improves consistency with real-world traffic, and exposes autonomous systems to more challenging interactions than rule- or data-driven methods.\n\n- [DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.20720)\n  - Haibo HU, Lianming Huang, Nan Guan, Chun Jason Xue\n  - Publish Date: 2025.11.25\n  - Task: Planning\n  - Datasets: [Bench2Drive](https:\u002F\u002Fthinklab-sjtu.github.io\u002FBench2Drive\u002F)\n  - Summary：\n    - DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories against lightweight planning priors.\n    - Introduces a multi-hop controller to adaptively skip redundant transformer layers based on score change rates, achieving up to 28% layer sparsity and 29% latency reduction.\n    - Integrates into existing VLA models (e.g., ORION) without retraining, preserving planning quality and safety on the Bench2Drive benchmark.\n\n- [CoC-VLA: Delving into Adversarial Domain Transfer for Explainable Autonomous Driving via Chain-of-Causality Visual-Language-Action Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.19914)\n  - Dapeng Zhang, Fei Shen, Rui Zhao, Yinda Chen, Peng Zhi, Chenyang Li, Rui Zhou, Qingguo Zhou\n  - Publish Date: 2025.11.25\n  - Task: End-to-End\n  - Summary：\n    - Proposes CoC-VLA, a VLM-guided, end-to-end adversarial transfer framework for autonomous driving that transfers long-tail handling capabilities from simulation to real-world deployment.\n    - The framework comprises a teacher VLM, a student VLM, and a discriminator, utilizing a shared Chain-of-Causality Visual-Language Model (CoC VLM) base architecture for chain-of-thought reasoning.\n    - Introduces a novel adversarial training strategy with a discriminator to facilitate the transfer of capabilities from simulated to real-world environments.\n\n- [Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.19912)\n  - Dapeng Zhang, Zhenlong Yuan, Zhangquan Chen, Chih-Ting Liao, Yinda Chen, Fei Shen, Qingguo Zhou, Tat-Seng Chua\n  - Publish Date: 2025.11.25\n  - Task: End-to-End\n  - Datasets: [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F), [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [KITTI](http:\u002F\u002Fwww.cvlibs.net\u002Fdatasets\u002Fkitti\u002F), [Argoverse](https:\u002F\u002Fwww.argoverse.org\u002F), [BDD100K](https:\u002F\u002Fopendatalab.org.cn\u002FOpenDataLab\u002FBDD100K), [CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - Summary：\n    - A general and fast Vision-Language-Action (VLA) framework for autonomous driving that uses learnable action queries interacting with reasoning-enhanced features to generate continuous trajectories in parallel.\n    - Consolidates eight public autonomous driving datasets into a standardized, Chain-of-Thought reasoning-based format for training.\n    - Achieves state-of-the-art performance, superior generalization, and excellent inference speed through supervised learning and reinforcement learning fine-tuning.\n\n- [Map-World: Masked Action planning and Path-Integral World Model for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.20156)\n  - Bin Hu, Zijian Lu, Haicheng Liao, Chengran Yuan, Bin Rao, Yongkang Li, Guofa Li, Zhiyong Cui, Cheng-zhong Xu, Zhenning Li\n  - Publisher: Huazhong University of Science and Technology, Xiaomi EV\n  - Publish Date: 2025.11.25\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - MAP-World, a prior-free multi-modal planning framework that couples masked action planning with a path-weighted world model for autonomous driving.\n    - The Masked Action Planning (MAP) module treats future ego motion as masked sequence completion, generating diverse, temporally consistent trajectories without anchor libraries or teacher policies.\n    - A lightweight world model rolls out future BEV semantics for each candidate, and training uses trajectory probabilities as path weights to learn from the full distribution of plausible futures, achieving state-of-the-art performance on NAVSIM.\n\n- [Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.19221)\n  - Jianhua Han, Meng Tian, Jiangtong Zhu, Fan He, Huixin Zhang, Sitong Guo, Dechang Zhu, Hao Tang, Pei Xu, Yuze Guo, Minzhe Niu, Haojie Zhu, Qichao Dong, Xuechao Yan, Siyuan Dong, Lu Hou, Qingqiu Huang, Xiaosong Jia, Hang Xu\n  - Publish Date: 2025.11.24\n  - Task: End-to-End\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - Percept-WAM, a perception-enhanced World-Awareness-Action Model that implicitly integrates 2D\u002F3D scene understanding within a single vision-language model (VLM) for robust end-to-end autonomous driving.\n    - Introduces World-PV and World-BEV tokens to unify 2D\u002F3D perception tasks, and a grid-conditioned prediction mechanism with IoU-aware scoring and parallel autoregressive decoding for improved stability in long-tail and complex scenarios.\n    - Achieves strong performance on perception benchmarks (e.g., 51.7\u002F58.9 mAP on COCO, nuScenes BEV detection) and improves planning performance on nuScenes and NAVSIM, surpassing prior methods like DiffusionDrive.\n\n- [Thinking Ahead: Foresight Intelligence in MLLMs and World Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.18735)\n  - Zhantao Gong, Liaoyuan Fan, Qing Guo, Xun Xu, Xulei Yang, Shijie Li\n  - Publish Date: 2025.11.24\n  - Task: VQA\n  - Datasets: [FSU-QA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.18735)\n  - Summary：\n    - Introduces Foresight Intelligence as the capability to anticipate future events and presents FSU-QA, a new Visual Question-Answering dataset designed to evaluate this ability.\n    - Conducts a comprehensive study showing current Vision-Language Models struggle with foresight reasoning, and demonstrates FSU-QA can effectively enhance this capability through fine-tuning.\n\n- [GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.18729)\n  - Lin Liu, Caiyan Jia, Guanyi Yu, Ziying Song, JunQiao Li, Feiyang Jia, Peiliang Wu, Xiaoshuai Hao, Yandan Luo\n  - Publish Date: 2025.11.24\n  - Code: [GuideFlow](https:\u002F\u002Fgithub.com\u002Fliulin815\u002FGuideFlow)\n  - Task: Planning\n  - Datasets: [Bench2Drive](https:\u002F\u002Fthinklab-sjtu.github.io\u002FBench2Drive\u002F), [NuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim), [ADV-NuScenes]()\n  - Summary：\n    - GuideFlow, a novel planning framework for End-to-End Autonomous Driving that leverages Constrained Flow Matching to explicitly model the flow matching process, mitigating multimodal trajectory mode collapse.\n    - The framework directly enforces explicit constraints within the flow matching generation process and unifies training with an Energy-Based Model (EBM) to robustly satisfy physical constraints.\n    - GuideFlow parameterizes driving aggressiveness as a control signal during generation, enabling precise manipulation of trajectory style, and achieves state-of-the-art results on benchmarks like NavSim.\n\n- [QuickLAP: Quick Language-Action Preference Learning for Autonomous Driving Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.17855)\n  - Jordan Abi Nader, David Lee, Nathaniel Dennler, Andreea Bobu\n  - Publisher: MIT-CLEAR-Lab\n  - Publish Date: 2025.11.22\n  - Code: [QuickLAP](https:\u002F\u002Fgithub.com\u002FMIT-CLEAR-Lab\u002FQuickLAP)\n  - Task: Planning\n  - Datasets: [CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - Summary：\n    - QuickLAP, a Bayesian framework that fuses physical and language feedback to infer reward functions in real time for autonomous driving agents.\n    - The method uses LLMs to extract reward feature attention masks from language, integrating them with physical corrections via a closed-form update rule for fast, robust reward learning.\n    - In a semi-autonomous driving simulator, QuickLAP reduces reward learning error by over 70% compared to baselines and is preferred by users for understandability and collaborative behavior.\n\n- [LiSTAR: Ray-Centric World Models for 4D LiDAR Sequences in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.16049)\n  - Pei Liu, Songtao Wang, Lang Zhang, Xingyue Peng, Yuandong Lyu, Jiaxin Deng, Songxin Lu, Weiliang Ma, Xueyang Zhang, Yifei Zhan, XianPeng Lang, Jun Ma\n  - Publish Date: 2025.11.20\n  - Project Page: [LiSTAR](https:\u002F\u002Focean-luna.github.io\u002FLiSTAR.gitub.io)\n  - Task: Prediction\n  - Datasets: [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen)\n  - Summary：\n    - LiSTAR, a novel generative world model for synthesizing high-fidelity and controllable 4D LiDAR data, operating directly on the sensor's native geometry.\n    - Introduces a Hybrid-Cylindrical-Spherical (HCS) representation to preserve data fidelity and a Spatio-Temporal Attention with Ray-Centric Transformer (START) for robust temporal coherence.\n    - Proposes a 4D point cloud-aligned voxel layout and a discrete Masked Generative START (MaskSTART) framework for efficient, high-resolution, and layout-guided compositional generation.\n\n- [Is Your VLM for Autonomous Driving Safety-Ready? A Comprehensive Benchmark for Evaluating External and In-Cabin Risks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.14592)\n  - Xianhui Meng, Yuchen Zhang, Zhijian Huang, Zheng Lu, Ziling Ji, Yaoyao Yin, Hongyuan Zhang, Guangfeng Jiang, Yandan Lin, Long Chen, Hangjun Ye, Li Zhang, Jun Liu, Xiaoshuai Hao\n  - Publish Date: 2025.11.18\n  - Task: VQA\n  - Datasets: [DSBench](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - Introduces DSBench, the first comprehensive Driving Safety Benchmark to assess a Vision-Language Model's (VLM) awareness of both external environmental risks and in-cabin driving behavior safety in a unified manner.\n    - The benchmark covers 10 key categories and 28 sub-categories, revealing significant performance degradation of VLMs in complex safety-critical situations.\n    - Constructs a large dataset of 98K safety-focused instances, showing that fine-tuning on this data significantly enhances VLM safety performance for autonomous driving.\n\n- [Enhancing End-to-End Autonomous Driving with Risk Semantic Distillaion from VLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.14499)\n  - Jack Qin, Zhitao Wang, Yinan Zheng, Keyu Chen, Yang Zhou, Yuanxin Zhong, Siyuan Cheng\n  - Publish Date: 2025.11.18\n  - Task: End-to-End\n  - Datasets: [Bench2Drive](https:\u002F\u002Fthinklab-sjtu.github.io\u002FBench2Drive\u002F)\n  - Summary：\n    - Introduces Risk Semantic Distillation (RSD), a framework that leverages Vision-Language Models (VLMs) to enhance End-to-End Autonomous Driving backbones by providing risk attention for key objects to improve generalization.\n    - Proposes RiskHead, a plug-in module that distills causal risk estimates from VLMs into Bird's-Eye-View (BEV) features to generate interpretable risk-attention maps, enabling richer spatial and risk representations.\n    - Demonstrates significant improvements in perception and planning on the Bench2Drive benchmark by aligning BEV features with human-like risk-aware driving behavior for complex and dynamic environments.\n\n- [Enhancing LLM-based Autonomous Driving with Modular Traffic Light and Sign Recognition](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.14391)\n  - Fabian Schmidt, Noushiq Mohammed Kayilan Abdul Nazar, Markus Enzweiler, Abhinav Valada\n  - Publisher: University of Stuttgart, Esslingen University of Applied Sciences\n  - Publish Date: 2025.11.18\n  - Code: [TLS-Assist](https:\u002F\u002Fgithub.com\u002Fiis-esslingen\u002FTLS-Assist)\n  - Task: Planning\n  - Datasets: [LangAuto](https:\u002F\u002Fgithub.com\u002Fwayveai\u002FLangAuto)\n  - Summary：\n    - Introduces TLS-Assist, a modular redundancy layer that augments LLM-based autonomous driving agents with explicit traffic light and sign recognition to enforce traffic rules.\n    - The plug-and-play framework converts detections into structured natural language messages injected into the LLM input, supporting both single-view and multi-view camera setups.\n    - Demonstrates relative driving performance improvements of up to 14% over LMDrive and 7% over BEVDriver on the LangAuto benchmark in CARLA, while reducing traffic infractions.\n\n- [CorrectAD: A Self-Correcting Agentic System to Improve End-to-end Planning in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.13297)\n  - Enhui Ma, Lijun Zhou, Tao Tang, Jiahuan Zhang, Junpeng Jiang, Zhan Zhang, Dong Han, Kun Zhan, Xueyang Zhang, XianPeng Lang, Haiyang Sun, Xia Zhou, Di Lin, Kaicheng Yu\n  - Publish Date: 2025.11.17\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Proposes CorrectAD, a self-correcting agentic system to improve end-to-end planning by addressing the long-tail problem of rare, safety-critical failures.\n    - Introduces a PM-Agent to formulate data requirements and DriveSora, a generative model to create spatiotemporally consistent videos aligned with 3D layouts for data simulation.\n    - The model-agnostic pipeline corrects a significant portion of failure cases, reducing collision rates by 39% on nuScenes and 27% on an in-house dataset.\n\n- [VLMs Guided Interpretable Decision Making for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.13881)\n  - Xin Hu, Taotao Jing, Renran Tian, Zhengming Ding\n  - Publish Date: 2025.11.17\n  - Task: Planning\n  - Summary：\n    - Proposes a new approach that shifts the role of Vision-Language Models (VLMs) from direct decision generators to semantic enhancers for more reliable and interpretable autonomous driving.\n    - Introduces a multi-modal interactive architecture that fuses visual and linguistic features for accurate decision-making and textual explanations, along with a post-hoc refinement module using VLMs to enhance prediction reliability.\n    - Demonstrates state-of-the-art performance on two autonomous driving benchmarks, offering a promising direction for integrating VLMs into reliable AD systems.\n\n- [Prompt-Driven Domain Adaptation for End-to-End Autonomous Driving via In-Context RL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.12755)\n  - Aleesha Khurram, Amir Moeini, Shangtong Zhang, Rohan Chandra\n  - Publish Date: 2025.11.16\n  - Task: End-to-End\n  - Datasets: [CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - Summary：\n    - Proposes a few-shot prompt-driven domain adaptation method for closed-loop autonomous driving using in-context reinforcement learning (ICRL), requiring no model updates or additional data collection in the target domain.\n    - Extends prompt-driven DA to closed-loop driving by using general trajectories observed during inference, advancing beyond prior methods limited to perception tasks.\n    - Experiments in CARLA show ICRL yields safer, more efficient, and more comfortable driving policies in adverse weather compared to state-of-the-art prompt-driven DA baselines.\n\n- [Are LLMs The Way Forward? A Case Study on LLM-Guided Reinforcement Learning for Decentralized Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.12751)\n  - Timur Anvar, Jeffrey Chen, Yuyan Wang, Rohan Chandra\n  - Publish Date: 2025.11.16\n  - Task: Planning\n  - Summary：\n    - Investigates whether small, locally deployed LLMs (\u003C 14B parameters) can support autonomous highway driving through reward shaping for RL, rather than direct control.\n    - Presents a case study comparing RL-only, LLM-only, and hybrid LLM-RL approaches for decentralized autonomous driving in complex scenarios like dense highways and merges.\n    - Finds hybrid approaches fall between RL-only (moderate success) and LLM-only (high success but poor efficiency), with LLM-influenced methods showing a systematic conservative bias and model-dependent variability.\n\n- [VLA-R: Vision-Language Action Retrieval toward Open-World End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.12405)\n  - Hyunki Seong, Seongwoo Moon, Hojin Ahn, Jehun Kang, David Hyunchul Shim\n  - Publish Date: 2025.11.16\n  - Task: End-to-End\n  - Summary：\n    - VLA-R, an open-world end-to-end autonomous driving framework that integrates open-world perception with a novel vision-action retrieval paradigm.\n    - Leverages a frozen vision-language model for open-world detection\u002Fsegmentation and a Q-Former bottleneck to bridge perception and action domains.\n    - Introduces a vision-action contrastive learning scheme to align vision-language and action embeddings for effective open-world reasoning and action retrieval.\n\n- [FLAD: Federated Learning for LLM-based Autonomous Driving in Vehicle-Edge-Cloud Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.09025)\n  - Tianao Xiang, Mingjian Zhi, Yuanguo Bi, Lin Cai, Yuhao Chen\n  - Publisher: University of Victoria, University of Toronto\n  - Publish Date: 2025.11.12\n  - Task: End-to-End\n  - Summary：\n    - FLAD, a Federated Learning framework for LLM-based Autonomous Driving, designed to address challenges of high computation\u002Ftransmission costs and data privacy in collaborative model training.\n    - Introduces a cloud-edge-vehicle collaborative architecture, an intelligent parallelized training with communication scheduling, and a knowledge distillation method to personalize LLMs for heterogeneous edge data.\n    - Prototyped on a testbed with NVIDIA Jetsons, demonstrating efficient use of distributed vehicular resources and superior end-to-end AD performance.\n\n- [A Low-Rank Method for Vision Language Model Hallucination Mitigation in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.06496)\n  - Keke Long, Jiacheng Guo, Tianyun Zhang, Hongkai Yu, Xiaopeng Li\n  - Publish Date: 2025.11.09\n  - Task: Perception\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Proposes a self-contained low-rank method to automatically rank multiple VLM-generated captions by hallucination level using only the captions, without external references or model access.\n    - Constructs a sentence-embedding matrix, decomposes it into low-rank consensus and sparse residual, and uses residual magnitude to select the most hallucination-free caption.\n    - Achieves 87% selection accuracy on nuScenes, improves over baselines, shows strong correlation with human judgment, and reduces inference time by 51-67% for real-time application.\n\n- [VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.06256)\n  - Ruifei Zhang, Wei Zhang, Xiao Tan, Sibei Yang, Xiang Wan, Xiaonan Luo, Guanbin Li\n  - Publish Date: 2025.11.09\n  - Code: [VLDrive](https:\u002F\u002Fgithub.com\u002FReaFly\u002FVLDrive)\n  - Task: End-to-End\n  - Datasets: [CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - Summary：\n    - VLDrive, a novel lightweight MLLM architecture for language-grounded autonomous driving, featuring enhanced vision components to address limitations in visual representation.\n    - It introduces compact visual tokens via cycle-consistent dynamic visual pruning and memory-enhanced feature aggregation, and a distance-decoupled instruction attention mechanism for improved visual-linguistic learning.\n    - Achieves state-of-the-art driving performance in CARLA simulator while reducing parameters by 81% (from 7B to 1.3B), with substantial improvements in driving scores across various distances.\n\n- [AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.06253)\n  - Ruifei Zhang, Junlin Xie, Wei Zhang, Weikai Chen, Xiao Tan, Xiang Wan, Guanbin Li\n  - Publish Date: 2025.11.09\n  - Code: [AdaDrive](https:\u002F\u002Fgithub.com\u002FReaFly\u002FAdaDrive)\n  - Task: Planning\n  - Summary：\n    - AdaDrive, an adaptively collaborative slow-fast framework that optimally determines when and how LLMs contribute to decision-making for language-grounded autonomous driving.\n    - Introduces an adaptive activation loss to dynamically invoke the LLM only in complex scenarios and an adaptive fusion strategy to modulate a continuous, scaled LLM influence based on scene complexity.\n\n- [SAFe-Copilot: Unified Shared Autonomy Framework](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.04664)\n  - Phat Nguyen, Erfan Aasi, Shiva Sreeram, Guy Rosman, Andrew Silva, Sertac Karaman, Daniela Rus\n  - Publisher: Massachusetts Institute of Technology\n  - Publish Date: 2025.11.06\n  - Task: Planning\n  - Datasets: [Bench2Drive](https:\u002F\u002Fthinklab-sjtu.github.io\u002FBench2Drive\u002F)\n  - Summary：\n    - A unified shared autonomy framework that integrates human input and autonomous planners at a high level of abstraction using Vision Language Models (VLMs) to infer driver intent.\n    - The framework synthesizes coherent strategies to mediate between human and autonomous control, achieving strong alignment in human-subject surveys and improved performance on the Bench2Drive benchmark.\n\n- [Dynamic Model Selection for Trajectory Prediction via Pairwise Ranking and Meta-Features](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.00126)\n  - Lu Bowen\n  - Publish Date: 2025.10.31\n  - Task: Prediction\n  - Datasets: [nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)\n  - Summary：\n    - Proposes a dynamic multi-expert gating framework that adaptively selects the most reliable trajectory predictor among a physics-informed LSTM, a Transformer, and a fine-tuned GameFormer on a per-sample basis.\n    - Formulates trajectory expert selection as a pairwise-ranking problem over internal model signals (meta-features), optimizing decision quality without requiring post-hoc calibration.\n    - Evaluated on nuPlan-mini, the LLM-enhanced tri-expert gate achieves a 9.5% reduction in Final Displacement Error over GameFormer and demonstrates consistent improvements in open-loop simulations.\n\n- [Token Is All You Need: Cognitive Planning through Belief-Intent Co-Evolution](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.05540)\n  - Shiyao Sang\n  - Publish Date: 2025.10.30\n  - Task: Planning\n  - Datasets: [nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)\n  - Summary：\n    - Proposes that effective planning arises from the co-evolution of belief and intent within a minimal set of semantically rich tokens, challenging the need for exhaustive scene modeling.\n    - Demonstrates that sparse intent tokens achieve strong performance, and conditioning trajectory decoding on predicted future tokens yields a 21.6% improvement in ADE, showing performance emerges from cognitive planning.\n    - Observes the emergence of cognitive consistency and temporal fuzziness through training, establishing a new paradigm where intelligence lies in the tokenized duality of belief and intent.\n\n- [Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.00088)\n  - Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, Liang Feng, Greg Heinrich, Jack Huang, Peter Karkus, Boyi Li, Pinyi Li, Tsung-Yi Lin, Dongran Liu, Ming-Yu Liu, Langechuan Liu, Zhijian Liu, Jason Lu, Yunxiang Mao, Pavlo Molchanov, Lindsey Pavao, Zhenghao Peng, Mike Ranzinger, Ed Schmerling, Shida Shen, Yunfei Shi, Sarah Tariq, Ran Tian, Tilman Wekel, Xinshuo Weng, Tianjun Xiao, Eric Yang, Xiaodong Yang, Yurong You, Xiaohui Zeng, Wenyuan Zhang, Boris Ivanovic, Marco Pavone\n  - Publisher: NVIDIA\n  - Publish Date: 2025.10.30\n  - Code: [Alpamayo-R1](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Falpamayo)\n  - Task: Planning\n  - Summary：\n    - Alpamayo-R1 (AR1) is a vision-language-action model (VLA) that integrates Chain of Causation reasoning with trajectory planning for complex, long-tail driving scenarios.\n    - It introduces the Chain of Causation (CoC) dataset, built via a hybrid auto-labeling and human-in-the-loop pipeline, and a modular architecture combining a reasoning VLM with a diffusion-based trajectory decoder.\n    - The model uses a multi-stage training strategy with supervised fine-tuning and reinforcement learning, achieving improved planning accuracy and safety in simulation and real-time on-vehicle deployment.\n\n- [Beyond Imitation: Constraint-Aware Trajectory Generation with Flow Matching For End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.26292)\n  - Lin Liu, Guanyi Yu, Ziying Song, Junqiao Li, Caiyan Jia, Feiyang Jia, Peiliang Wu, Yandan Luo\n  - Publish Date: 2025.10.30\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - Proposes CATG, a novel planning framework leveraging Constrained Flow Matching to address mode collapse in imitation learning and integrate safety\u002Fkinematic constraints directly into generation.\n    - Explicitly imposes constraints within the flow matching process and parameterizes driving aggressiveness as a control signal for trajectory style manipulation.\n    - Achieved 2nd place on the NavSim v2 challenge with an EPDMS score of 51.31 and received the Innovation Award.\n\n- [Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.24152)\n  - Aodi Wu, Xubo Luo\n  - Publisher: University of Chinese Academy of Sciences, Central South University\n  - Publish Date: 2025.10.28\n  - Code: [UCAS-CSU-phase2](https:\u002F\u002Fgithub.com\u002Fwuaodi\u002FUCAS-CSU-phase2)\n  - Task: VQA\n  - Datasets: RoboSense Challenge\n  - Summary：\n    - A systematic framework for autonomous driving scene understanding, built on a Mixture-of-Prompts router, task-specific prompts with spatial reasoning, a visual assembly module, and optimized inference parameters.\n    - Implemented on Qwen2.5-VL-72B, achieving 70.87% accuracy on clean data and 72.85% on corrupted data in the RoboSense Challenge at IROS 2025.\n\n- [Towards Physics-informed Spatial Intelligence with Human Priors: An Autonomous Driving Pilot Study](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.21160)\n  - Guanlin Wu, Boyan Su, Yang Zhao, Pu Wang, Yichen Lin, Hao Frank Yang\n  - Publish Date: 2025.10.24\n  - Task: Perception\n  - Datasets: [SIGBench](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.21160)\n  - Summary：\n    - Introduces Spatial Intelligence Grid (SIG), a structured, grid-based schema to explicitly encode object layouts, relations, and physically grounded priors for foundation-model reasoning in autonomous driving.\n    - Derives SIG-informed evaluation metrics to quantify a model's intrinsic Visual-Spatial Intelligence (VSI), separating spatial capability from language priors.\n    - Releases SIGBench, a benchmark of 1.4K driving frames annotated with ground-truth SIG labels and human gaze traces to support VSI tasks.\n\n- [Addressing Corner Cases in Autonomous Driving: A World Model-based Approach with Mixture of Experts and LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.21867)\n  - Haicheng Liao, Bonan Wang, Junxian Yang, Chengyue Wang, Zhengbin He, Guohui Zhang, Chengzhong Xu, Zhenning Li\n  - Publish Date: 2025.10.23\n  - Task: Prediction\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [NGSIM](https:\u002F\u002Fops.fhwa.dot.gov\u002Ftrafficanalysistools\u002Fngsim.htm), [HighD](https:\u002F\u002Fwww.highd-dataset.com\u002F), [MoCAD](https:\u002F\u002Fmocad-dataset.github.io\u002F)\n  - Summary：\n    - WM-MoE, a world model-based motion forecasting framework unifying perception, memory, and decision-making to address high-risk corner-case scenarios.\n    - Leverages LLMs with a lightweight temporal tokenizer for long-horizon reasoning and introduces a Mixture-of-Experts (MoE) to decompose complex corner cases.\n    - Introduces the nuScenes-corner benchmark and shows state-of-the-art performance across multiple datasets under corner-case and data-missing conditions.\n\n- [Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.19195)\n  - Kai Zeng, Zhanqian Wu, Kaixin Xiong, Xiaobao Wei, Xiangyu Guo, Zhenxin Zhu, Kalok Ho, Lijun Zhou, Bohan Zeng, Ming Lu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wentao Zhang\n  - Publisher: Huazhong University of Science and Technology, Xiaomi EV\n  - Publish Date: 2025.10.22\n  - Project Page: [Dream4Drive](https:\u002F\u002Fwm-research.github.io\u002FDream4Drive\u002F)\n  - Code: [Dream4Drive](https:\u002F\u002Fgithub.com\u002Fwm-research\u002FDream4Drive)\n  - Task: Perception\n  - Datasets: [DriveObj3D](https:\u002F\u002Fgithub.com\u002Fwm-research\u002FDream4Drive)\n  - Summary：\n    - Introduces Dream4Drive, a synthetic data generation framework that decomposes videos into 3D-aware guidance maps and renders 3D assets to produce edited, multi-view photorealistic videos for training perception models.\n    - Enables scalable generation of multi-view corner cases to significantly boost corner case perception in autonomous driving.\n    - Contributes a large-scale 3D asset dataset, DriveObj3D, covering typical driving scenario categories to enable diverse 3D-aware video editing.\n\n- [OmniNWM: Omniscient Driving Navigation World Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.18313)\n  - Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao, Wenjun Zeng, Xin Jin\n  - Publisher: Tsinghua University, Shanghai AI Laboratory, University of Chinese Academy of Sciences, Shanghai Jiao Tong University\n  - Publish Date: 2025.10.21\n  - Project Page: [OmniNWM](https:\u002F\u002Farlo0o.github.io\u002FOmniNWM\u002F)\n  - Task: Navigation\n  - Summary：\n    - OmniNWM, an omniscient panoramic navigation world model that addresses state, action, and reward dimensions within a unified framework for autonomous driving.\n    - It jointly generates panoramic videos of RGB, semantics, metric depth, and 3D occupancy, with a flexible forcing strategy for long-horizon auto-regressive generation.\n    - Introduces a normalized panoramic Plucker ray-map for precise trajectory control and leverages generated 3D occupancy to define rule-based dense rewards for driving compliance and safety.\n\n- [Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.19001)\n  - Seungjun Yu, Junsung Park, Youngsun Lim, Hyunjung Shim\n  - Publish Date: 2025.10.21\n  - Task: VQA\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - A two-phase vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions using a large multimodal LLM.\n    - The system conditions the model on multi-camera inputs, temporal history, and chain-of-thought prompts, enhanced by a self-consistency ensemble for reliability.\n    - Phase-2 augments prompts with scene metadata and task-specific instructions, significantly improving accuracy and demonstrating robustness under severe visual corruption.\n\n- [SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.18034)\n  - Roberto Brusnicki, David Pop, Yuan Gao, Mattia Piccinini, Johannes Betz\n  - Publish Date: 2025.10.20\n  - Task: Perception\n  - Summary：\n    - Introduces SAVANT, a structured reasoning framework for detecting anomalous driving scenarios through layered scene analysis and a two-phase pipeline of structured scene description extraction and multi-modal evaluation.\n    - Achieves high accuracy and recall on real-world driving scenarios, enabling a fine-tuned 7B open-source model to surpass proprietary models while enabling local, low-cost deployment.\n    - Addresses data scarcity by automatically labeling over 9,640 real-world images with high accuracy, providing a practical path for reliable semantic monitoring in autonomous systems.\n\n- [SimpleVSF: VLM-Scoring Fusion for Trajectory Prediction of End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.17191)\n  - Peiru Zheng, Yun Zhao, Zhan Gong, Hong Zhu, Shaohua Wu\n  - Publish Date: 2025.10.20\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - SimpleVSF, a novel framework that enhances end-to-end planning by leveraging the cognitive capabilities of Vision-Language Models (VLMs) and advanced trajectory fusion techniques.\n    - Utilizes conventional scorers and novel VLM-enhanced scorers, with a robust weight fusioner for quantitative aggregation and a VLM-based fusioner for qualitative, context-aware decision-making.\n    - The leading approach in the ICCV 2025 NAVSIM v2 End-to-End Driving Challenge, demonstrating state-of-the-art performance in safety, comfort, and efficiency.\n\n- [DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.17148)\n  - Yu Gao, Anqing Jiang, Yiru Wang, Wang Jijun, Hao Jiang, Zhigang Sun, Heng Yuwen, Wang Shuo, Hao Zhao, Sun Hao\n  - Publish Date: 2025.10.20\n  - Task: End-to-End\n  - Summary：\n    - DiffVLA++, an enhanced autonomous driving framework that bridges cognitive reasoning and end-to-end planning through metric-guided alignment.\n    - Introduces a VLA module for semantically grounded trajectories, an E2E module for physical feasibility, and a metric-guided trajectory scorer to align their outputs.\n    - Achieves an EPDMS of 49.12 on the ICCV 2025 Autonomous Grand Challenge leaderboard.\n\n- [Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.16729)\n  - Jianbiao Mei, Yu Yang, Xuemeng Yang, Licheng Wen, Jiajun Lv, Botian Shi, Yong Liu\n  - Publish Date: 2025.10.19\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Proposes IR-WM, an Implicit Residual World Model that focuses on modeling the current state and evolution of the world for vision-centric autonomous driving, avoiding full scene reconstruction.\n    - Introduces a residual prediction approach using BEV features as a temporal prior and an alignment module to mitigate error accumulation over time.\n    - Demonstrates that implicit future states from world models improve planning accuracy, achieving top performance on nuScenes for 4D occupancy forecasting and trajectory planning.\n\n- [Advancing Off-Road Autonomous Driving: The Large-Scale ORAD-3D Dataset and Comprehensive Benchmarks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.16500)\n  - Chen Min, Jilin Mei, Heng Zhai, Shuai Wang, Tong Sun, Fanjie Kong, Haoyang Li, Fangyuan Mao, Fuyang Liu, Shuo Wang, Yiming Nie, Qi Zhu, Liang Xiao, Dawei Zhao, Yu Hu\n  - Publish Date: 2025.10.18\n  - Code: [ORAD-3D](https:\u002F\u002Fgithub.com\u002Fchaytonmin\u002FORAD-3D)\n  - Task: Generation\n  - Datasets: [ORAD-3D](https:\u002F\u002Fgithub.com\u002Fchaytonmin\u002FORAD-3D)\n  - Summary：\n    - Presents ORAD-3D, the largest dataset for off-road autonomous driving, covering diverse terrains, weather conditions, and illumination levels.\n    - Establishes a comprehensive benchmark suite for five fundamental tasks: 2D free-space detection, 3D occupancy prediction, path planning, vision-language model-driven driving, and world modeling for off-road environments.\n\n- [VDRive: Leveraging Reinforced VLA and Diffusion Policy for End-to-end Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.15446)\n  - Ziang Guo, Zufeng Zhang\n  - Publish Date: 2025.10.17\n  - Task: End-to-End\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - VDRive, a novel pipeline for end-to-end autonomous driving that models state-action mapping for interpretable and robust decision making.\n    - Combines a Vision Language Action Model (VLA) for contextual state understanding with a generative diffusion policy-based action head for geometric action generation.\n    - Employs a reinforcement learning fine-tuning pipeline with an actor-critic framework, achieving state-of-the-art performance on Bench2Drive and nuScenes benchmarks.\n\n- [DriveCritic: Towards Context-Aware, Human-Aligned Evaluation for Autonomous Driving with Vision-Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.13108)\n  - Jingyu Song, Zhenxin Li, Shiyi Lan, Xinglong Sun, Nadine Chang, Maying Shen, Joshua Chen, Katherine A. Skinner, Jose M. Alvarez\n  - Publish Date: 2025.10.15\n  - Task: Planning\n  - Summary：\n    - Introduces DriveCritic, a novel framework for context-aware, human-aligned evaluation of autonomous driving planners, featuring a curated dataset of challenging scenarios annotated with human preferences and a Vision-Language Model (VLM) based evaluator.\n    - The DriveCritic model is fine-tuned using a two-stage supervised and reinforcement learning pipeline to adjudicate between trajectory pairs by integrating visual and symbolic context, significantly outperforming existing metrics in matching human preferences.\n\n- [DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.12796)\n  - Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, Lu Hou, Lue Fan, Zhaoxiang Zhang\n  - Publish Date: 2025.10.14\n  - Task: End-to-End\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - Proposes DriveVLA-W0, a training paradigm using world modeling to predict future images, providing dense self-supervised signals to learn driving environment dynamics.\n    - Introduces a lightweight action expert for real-time inference, built on representations learned from world modeling.\n    - Demonstrates significant performance gains over baselines and shows the approach amplifies the data scaling law, with accelerating gains as training dataset size increases.\n\n- [CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.12560)\n  - Xiaoji Zheng, Ziyuan Yang, Yanhao Chen, Yuhang Peng, Yuanrong Tang, Gengyuan Liu, Bokui Chen, Jiangtao Gong\n  - Publish Date: 2025.10.14\n  - Code: [CoIRL-AD](https:\u002F\u002Fgithub.com\u002FSEU-zxj\u002FCoIRL-AD)\n  - Task: End-to-End\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Proposes CoIRL-AD, a competitive dual-policy framework that enables IL and RL agents to interact during training, moving beyond the conventional two-stage paradigm.\n    - Introduces a competition-based mechanism to facilitate knowledge exchange while preventing gradient conflicts.\n    - Experiments show an 18% reduction in collision rate compared to baselines, with stronger generalization and improved performance on long-tail scenarios.\n\n- [Flow Matching-Based Autonomous Driving Planning with Advanced Interactive Behavior Modeling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.11083)\n  - Tianyi Tan, Yinan Zheng, Ruiming Liang, Zexu Wang, Kexin Zheng, Jinliang Zheng, Jianxiong Li, Xianyuan Zhan, Jingjing Liu\n  - Publish Date: 2025.10.13\n  - Code:[Flow-Planner](https:\u002F\u002Fgithub.com\u002FDiffusionAD\u002FFlow-Planner)\n  - Task: Planning\n  - Datasets: [nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan), [interPlan](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Finterplan)\n  - Summary：\n    - Proposes Flow Planner, a framework for autonomous driving planning that addresses modeling interactive behaviors through innovations in data modeling, architecture, and learning.\n    - Introduces fine-grained trajectory tokenization and a specialized architecture for efficient temporal\u002Fspatial fusion to better capture interactive behaviors.\n    - Incorporates flow matching with classifier-free guidance for multi-modal behavior generation, dynamically reweighting agent interactions for coherent response strategies.\n\n- [Game-Theoretic Risk-Shaped Reinforcement Learning for Safe Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.10960)\n  - Dong Hu, Fenqing Hu, Lidong Yang, Chao Huang\n  - Publish Date: 2025.10.13\n  - Code: [GTR2L](https:\u002F\u002Fgithub.com\u002FDanielHu197\u002FGTR2L)\n  - Task: Planning\n  - Summary：\n    - Proposes a novel game-theoretic risk-shaped RL (GTR2L) framework for safe autonomous driving, incorporating a multi-level game-theoretic world model to predict interactive behaviors and risks.\n    - Features an adaptive rollout horizon based on predictive uncertainty and an uncertainty-aware barrier mechanism for flexible safety boundary modulation.\n    - Demonstrates superior performance in safety-critical scenarios, outperforming state-of-the-art baselines and human drivers in success rate, collision reduction, and efficiency.\n\n- [Align2Act: Instruction-Tuned Models for Human-Aligned Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.10503)\n  - Kanishkha Jaisankar, Sunidhi Tandel\n  - Publish Date: 2025.10.12\n  - Task: Planning\n  - Datasets: [nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)\n  - Summary：\n    - Align2Act, a motion planning framework that transforms instruction-tuned LLMs into interpretable planners aligned with human behavior, using structured driving instructions based on human reasoning and traffic rules.\n    - The Align2ActChain module guides step-by-step reasoning to produce an interpretable rationale and a safe trajectory, fine-tuned on LLaMA-2-7B with LoRA using the nuPlan dataset.\n    - Demonstrates improved planning quality and human-likeness on the real-world nuPlan closed-loop benchmark, with structured reasoning significantly improving performance over baseline LLM planners.\n\n- [LinguaSim: Interactive Multi-Vehicle Testing Scenario Generation via Natural Language Instruction Based on Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.08046)\n  - Qingyuan Shi, Qingwen Meng, Hao Cheng, Qing Xu, Jianqiang Wang\n  - Publish Date: 2025.10.09\n  - Task: Generation\n  - Summary：\n    - LinguaSim, an LLM-based framework that converts natural language into realistic, interactive 3D scenarios for autonomous vehicle testing and training, ensuring both dynamic vehicle interactions and faithful alignment between input descriptions and generated scenarios.\n    - A feedback calibration module refines generation precision, improving fidelity to user intent and reducing excessive aggressiveness (crash rate from 46.9% to 6.3%).\n    - The framework bridges the gap between natural language and closed-loop, interactive simulations, constraining adversarial vehicle behaviors using both the scenario description and the autonomous driving model.\n\n- [GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.07791)\n  - Qinghongbing Xie, Zhaoyuan Xia, Feng Zhu, Lijun Gong, Ziyue Li, Rui Zhao, Long Zeng\n  - Publish Date: 2025.10.09\n  - Code: [GTR-Bench](https:\u002F\u002Fgithub.com\u002FX-Luffy\u002FGTR-Bench)\n  - Task: Evaluation\n  - Summary：\n    - Introduces GTR-Bench, a novel benchmark for evaluating geographic temporal reasoning of moving targets in a large-scale camera network, requiring perspective switches between maps and videos and joint reasoning across non-overlapping video views.\n    - Evaluations show a significant performance gap between state-of-the-art VLMs and human performance on geo-temporal reasoning, revealing key deficiencies in context utilization, temporal forecasting, and map-video alignment.\n    - The benchmark provides insights into spatial-temporal intelligence for applications like autonomous driving and embodied AI, with the code and benchmark to be released publicly.\n\n- [CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.07944)\n  - Tianrui Zhang, Yichen Liu, Zilin Guo, Yuxin Guo, Jingcheng Ni, Chenjing Ding, Dan Xu, Lewei Lu, Zehuan Wu\n  - Publisher: Sensetime\n  - Publish Date: 2025.10.09\n  - Project Page: [CVD-STORM](https:\u002F\u002Fsensetime-fvg.github.io\u002FCVD-STORM)\n  - Task: Generation\n  - Summary：\n    - CVD-STORM, a cross-view video diffusion model with a spatial-temporal reconstruction VAE for generating long-term, multi-view videos with 4D reconstruction under various control inputs.\n    - The approach fine-tunes the VAE with an auxiliary 4D reconstruction task to enhance 3D structure and temporal encoding, then integrates it into the video diffusion process to improve generation quality.\n    - The model shows improvements in FID and FVD metrics, and its jointly-trained Gaussian Splatting Decoder reconstructs dynamic scenes for geometric information and scene understanding.\n\n- [Drive&Gen: Co-Evaluating End-to-End Driving and Video Generation Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.06209)\n  - Jiahao Wang, Zhenpei Yang, Yijing Bai, Yingwei Li, Yuliang Zou, Bo Sun, Abhijit Kundu, Jose Lezama, Luna Yue Huang, Zehao Zhu, Jyh-Jing Hwang, Dragomir Anguelov, Mingxing Tan, Chiyu Max Jiang\n  - Publisher: Google Research, Waymo\n  - Publish Date: 2025.10.07\n  - Task: Generation\n  - Summary：\n    - Proposes Drive&Gen, bridging driving models and generative world models to evaluate video realism for E2E planner evaluation using novel statistical measures.\n    - Exploits video generation controllability to investigate distribution gaps affecting E2E planner performance and biases.\n    - Shows synthetic data from video generation models is a cost-effective alternative to real data, improving E2E model generalization beyond existing Operational Design Domains.\n\n- [Work Zones challenge VLM Trajectory Planning: Toward Mitigation and Robust Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.02803)\n  - Yifan Liao, Zhen Sun, Xiaoyun Qiu, Zixiao Zhao, Wenbing Tang, Xinlei He, Xinhu Zheng, Tianwei Zhang, Xinyi Huang, Xingshuo Han\n  - Publish Date: 2025.10.03\n  - Task: Planning\n  - Datasets: [ROADWork](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - Conducts the first systematic study of Visual Language Models (VLMs) for trajectory planning in work zones, revealing a 68.0% failure rate for mainstream VLMs and identifying 8 common failure patterns.\n    - Proposes REACT-Drive, a trajectory planning framework integrating VLMs with Retrieval-Augmented Generation (RAG) to convert prior failures into constraint rules and retrieve similar patterns for guidance.\n    - Demonstrates REACT-Drive's effectiveness, reducing average displacement error by ~3x compared to VLM baselines and achieving the lowest inference time (0.58s) in experiments on the ROADWork dataset and 15 real-world work zone scenarios.\n\n- [Nav-EE: Navigation-Guided Early Exiting for Efficient Vision-Language Models in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.01795)\n  - Haibo Hu, Lianming Huang, Xinyu Wang, Yufei Cui, Shangyu Wu, Nan Guan, Chun Jason Xue\n  - Publish Date: 2025.10.02\n  - Code: [Nav-EE](https:\u002F\u002Fanonymous.4open.science\u002Fr\u002FNav-EE-BBC4)\n  - Task: Planning\n  - Datasets: [CODA](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FCODA), [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F), [BOSCH](https:\u002F\u002Fwww.bosch-mobility.com\u002Fen\u002Fsolutions\u002Fautomated-driving\u002F)\n  - Summary：\n    - Proposes Nav-EE, a navigation-guided early-exit framework for Vision-Language Models (VLMs) in autonomous driving, which precomputes task-specific exit layers offline and applies them dynamically online based on navigation priors.\n    - Achieves accuracy comparable to full inference while reducing latency by up to 63.9%, with real-vehicle integration demonstrating reduced inference latency from 600ms to 300ms.\n\n- [Strategic Fusion of Vision Language Models: Shapley-Credited Context-Aware Dawid-Skene for Multi-Label Tasks in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.01126)\n  - Yuxiang Feng, Keyang Zhang, Hassane Ouchouid, Ashwil Kaniamparambil, Ioannis Souflas, Panagiotis Angeloudis\n  - Publisher: Imperial College London\n  - Publish Date: 2025.10.01\n  - Task: Perception, Reasoning\n  - Datasets: [HDD](https:\u002F\u002Fusa.honda-ri.com\u002Fhdd)\n  - Summary：\n    - Presents a game-theoretic fusion method, Shapley-credited Context-Aware Dawid-Skene with Agreement, for multi-label understanding of dashcam video to address VLM hallucination in AV pipelines.\n    - Curates a specialized dataset of 1,000 real-world dashcam clips with structured annotations using an automatic pipeline that fuses HDD ground truth, vehicle kinematics, and object tracking.\n    - The method achieves significant improvements over single models, including a 23% reduction in Hamming distance and over 47% improvement in F1 scores, providing a calibrated and robust decision-support component.\n\n- [NuRisk: A Visual Question Answering Dataset for Agent-Level Risk Assessment in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.25944)\n  - Yuan Gao, Mattia Piccinini, Roberto Brusnicki, Yuchen Zhang, Johannes Betz\n  - Publisher: Technical University of Munich\n  - Publish Date: 2025.09.30\n  - Task: VQA\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F), [CommonRoad](https:\u002F\u002Fcommonroad.in.tum.de\u002F)\n  - Summary：\n    - Proposes NuRisk, a comprehensive Visual Question Answering (VQA) dataset for agent-level risk assessment in autonomous driving, built on real-world data from nuScenes and Waymo and supplemented with safety-critical scenarios from the CommonRoad simulator.\n    - The dataset provides Bird-Eye-View (BEV) based sequential images with quantitative, agent-level risk annotations, designed to enable and benchmark spatio-temporal reasoning.\n    - Benchmarks show standard VLMs struggle with explicit spatio-temporal reasoning on this task, while a fine-tuned 7B VLM agent improves accuracy and reduces latency, establishing NuRisk as a critical benchmark for advancing reasoning in autonomous driving.\n\n- [FuncPoison: Poisoning Function Library to Hijack Multi-agent Autonomous Driving Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.24408)\n  - Yuzhen Long, Songze Li\n  - Publisher: University of California, Los Angeles\n  - Publish Date: 2025.09.29\n  - Task: Planning\n  - Summary：\n    - Introduces FuncPoison, a novel poisoning-based attack that targets the shared function library in LLM-driven multi-agent autonomous driving systems to manipulate agent behavior.\n    - Exploits weaknesses in text-based tool selection and standardized command formats to inject malicious tools, causing cascading errors that degrade system trajectory accuracy.\n    - Demonstrates the attack's effectiveness in evading defenses and highlights the function library as a critical, under-explored attack surface for system reliability.\n\n- [Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.00060)\n  - Sheng Yang, Tong Zhan, Guancheng Chen, Yanfeng Lu, Jian Wang\n  - Publish Date: 2025.09.29\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Introduces Max-V1, a novel one-stage end-to-end autonomous driving framework that reconceptualizes driving as a generalized language and formulates trajectory planning as next waypoint prediction.\n    - Proposes a single-pass generation paradigm leveraging a Vision-Language Model (VLM) for direct trajectory prediction from front-view camera input, supervised by a principled strategy from statistical modeling.\n    - Achieves state-of-the-art performance on nuScenes with over 30% improvement, demonstrating strong generalization and cross-vehicle robustness through imitation learning on large-scale expert demonstrations.\n\n- [Learning to Sample: Reinforcement Learning-Guided Sampling for Autonomous Vehicle Motion Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.24313)\n  - Korbinian Moller, Roland Stroop, Mattia Piccinini, Alexander Langmann, Johannes Betz\n  - Publisher: Technical University of Munich (TUM)\n  - Publish Date: 2025.09.29\n  - Code: [Learning-to-Sample](https:\u002F\u002Fgithub.com\u002FTUM-AVS\u002FLearning-to-Sample)\n  - Task: Planning\n  - Datasets: [CommonRoad](https:\u002F\u002Fcommonroad.in.tum.de\u002F)\n  - Summary：\n    - A hybrid framework for sampling-based motion planning that uses a reinforcement learning agent to guide sampling toward promising regions of the action space, while keeping trajectory generation and evaluation analytical and verifiable.\n    - Integrates the RL sampler with a world model based on a decodable deep set encoder to handle variable numbers of traffic participants and reconstructable latent representations.\n    - Evaluated in CommonRoad, showing up to 99% fewer required samples and 84% runtime reduction while maintaining planning success and collision-free rates.\n\n- [Preventing Robotic Jailbreaking via Multimodal Domain Adaptation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.23281)\n  - Francesco Marchiori, Rohan Sinha, Christopher Agia, Alexander Robey, George J. Pappas, Mauro Conti, Marco Pavone\n  - Publisher: University of Padua, Stanford University, University of Pennsylvania\n  - Publish Date: 2025.09.27\n  - Project Page: [J-DAPT](https:\u002F\u002Fj-dapt.github.io)\n  - Task: Planning\n  - Datasets: [Waymo Open Dataset](https:\u002F\u002Fwaymo.com\u002Fopen\u002F), [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Introduces J-DAPT, a lightweight framework for multimodal jailbreak detection in robotic environments using attention-based fusion and domain adaptation.\n    - Integrates textual and visual embeddings to capture semantic intent and environmental grounding, aligning general-purpose jailbreak data with domain-specific references.\n    - Evaluations across autonomous driving, maritime robotics, and quadruped navigation show J-DAPT boosts detection accuracy to nearly 100% with minimal overhead.\n\n- [BEV-VLM: Trajectory Planning via Unified BEV Abstraction](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.25249)\n  - Guancheng Chen, Sheng Yang, Tong Zhan, Jian Wang\n  - Publish Date: 2025.09.27\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Introduces BEV-VLM, a framework for trajectory planning that uses Vision-Language Models with Bird's-Eye View feature maps as visual inputs.\n    - Utilizes a unified BEV-HD Map format from fused multi-modal sensor data for a geometrically consistent scene description.\n    - Demonstrates 44.8% improvements in planning accuracy and complete collision avoidance on the nuScenes dataset.\n\n- [MTRDrive: Memory-Tool Synergistic Reasoning for Robust Autonomous Driving in Corner Cases](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.20843)\n  - Ziang Luo, Kangan Qian, Jiahua Wang, Yuechen Luo, Jinyu Miao, Zheng Fu, Yunlong Wang, Sicong Jiang, Zilin Huang, Yifei Hu, Yuhao Yang, Hao Ye, Mengmeng Yang, Xiaojian Dong, Kun Jiang, Diange Yang\n  - Publish Date: 2025.09.25\n  - Task: End-to-End\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - Introduces MTRDrive, a framework integrating procedural driving experiences with a dynamic toolkit to enhance generalization and proactive decision-making for end-to-end autonomous driving.\n    - Proposes a closed-loop system combining a memory-based experience retrieval mechanism with dynamic toolkits to improve reasoning and decision-making.\n    - Achieves state-of-the-art performance on the NAVSIM benchmark and demonstrates strong zero-shot generalization on a new Roadwork-VLM benchmark.\n\n- [Universal Camouflage Attack on Vision-Language Models for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.20196)\n  - Dehong Kong, Sifan Yu, Siyuan Liang, Jiawei Liang, Jianhou Gan, Aishan Liu, Wenqi Ren\n  - Publish Date: 2025.09.24\n  - Task: End-to-End\n  - Summary：\n    - Proposes the first Universal Camouflage Attack (UCA) framework for Vision-Language Models in Autonomous Driving, generating physically realizable camouflage textures that generalize across commands and model architectures.\n    - Introduces a feature divergence loss (FDL) targeting encoder and projection layer vulnerabilities, along with a multi-scale learning strategy for robustness to viewpoint and scale changes in real-world scenarios.\n    - Demonstrates strong attack performance, inducing incorrect driving commands across various VLM-AD models and significantly surpassing existing methods, with high robustness under diverse dynamic conditions.\n\n- [Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.20109)\n  - Pengxiang Li, Yinan Zheng, Yue Wang, Huimin Wang, Hang Zhao, Jingjing Liu, Xianyuan Zhan, Kun Zhan, Xianpeng Lang\n  - Publish Date: 2025.09.24\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - Introduces ReflectDrive, a learning-based framework integrating a reflection mechanism for safe trajectory generation via discrete diffusion, addressing limitations of imitation learning in Vision-Language-Action models.\n    - Proposes a safety-aware reflection mechanism that performs iterative self-correction without gradient computation, using local search to identify unsafe tokens and inpainting-based regeneration for safe anchors.\n    - Evaluated on the NAVSIM benchmark, demonstrating significant advantages in safety-critical trajectory generation for autonomous driving systems.\n\n- [Orchestrate, Generate, Reflect: A VLM-Based Multi-Agent Collaboration Framework for Automated Driving Policy Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.17042)\n  - Zengqi Peng, Yusen Xie, Yubin Wang, Rui Yang, Qifeng Chen, Jun Ma\n  - Publish Date: 2025.09.21\n  - Task: Planning\n  - Datasets: [CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - Summary：\n    - Proposes OGR, a novel automated driving policy learning framework that leverages a vision-language model (VLM)-based multi-agent collaboration system to automate the design of reward functions and training curricula.\n    - Introduces a hierarchical agent system with an orchestrator, generation, and reflection module, enhanced by a memory module and a parallel generation scheme with human-in-the-loop augmentation for robust policy evolution.\n    - Demonstrates superior performance, generalizability across urban scenarios in CARLA, and compatibility with various RL algorithms, with real-world experiments validating its practical viability.\n\n- [Are VLMs Ready for Lane Topology Awareness in Autonomous Driving?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.16654)\n  - Xin Chen, Jia He, Maozheng Li, Dongliang Xu, Tianyu Wang, Yixiao Chen, Zhixin Lin, Yue Yao\n  - Publish Date: 2025.09.20\n  - Task: VQA\n  - Summary：\n    - Systematically evaluates Vision-Language Models (VLMs) on their capability for road topology understanding, a key requirement for safe autonomous driving.\n    - Proposes four diagnostic VQA tasks based on bird's-eye-view lane representations to capture essential components of spatial topology reasoning.\n    - Finds that spatial reasoning remains a fundamental bottleneck for current VLMs, with performance correlating with model size, reasoning token length, and provided examples.\n\n- [CoReVLA: A Dual-Stage End-to-End Autonomous Driving Framework for Long-Tail Scenarios via Collect-and-Refine](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.15968)\n  - Shiyu Fang, Yiming Cui, Haoyang Liang, Chen Lv, Peng Hang, Jian Sun\n  - Publish Date: 2025.09.19\n  - Code: [CoReVLA](https:\u002F\u002Fgithub.com\u002FFanGShiYuu\u002FCoReVLA)\n  - Task: End-to-End\n  - Datasets: [Bench2Drive](https:\u002F\u002Fthinklab-sjtu.github.io\u002FBench2Drive\u002F)\n  - Summary：\n    - CoReVLA, a continual learning end-to-end autonomous driving framework that improves performance in long-tail, safety-critical scenarios via a dual-stage process of data Collection and behavior Refinement.\n    - The framework is fine-tuned on driving QA data, collects driver takeover data in CAVE simulation, and is refined via Direct Preference Optimization (DPO) to learn from human preferences and avoid reward hacking.\n    - On the Bench2Drive benchmark, CoReVLA achieves a Driving Score of 72.18 and a Success Rate of 50%, outperforming state-of-the-art methods in long-tail scenarios.\n\n- [AdaThinkDrive: Adaptive Thinking via Reinforcement Learning for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.13769)\n  - Yuechen Luo, Fang Li, Shaoqing Xu, Zhiyi Lai, Lei Yang, Qimao Chen, Ziang Luo, Zixun Xie, Shengyin Jiang, Jiaxin Liu, Long Chen, Bing Wang, Zhi-xin Yang\n  - Publish Date: 2025.09.17\n  - Task: End-to-End\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - AdaThinkDrive, a novel Vision-Language-Action (VLA) framework with a dual-mode reasoning mechanism (fast answering and slow thinking) for adaptive reasoning in autonomous driving.\n    - Introduces an Adaptive Think Reward strategy with Group Relative Policy Optimization (GRPO) to reward the model for selectively applying Chain of Thought (CoT) reasoning.\n    - Achieves state-of-the-art performance on the Navsim benchmark (PDMS of 90.3) while reducing inference time by 14% compared to an always-reasoning baseline.\n\n- [Large Foundation Models for Trajectory Prediction in Autonomous Driving: A Comprehensive Survey](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.10570)\n  - Wei Dai, Shengen Wu, Wei Wu, Zhenhao Wang, Sisuo Lyu, Haicheng Liao, Limin Yu, Weiping Ding, Runwei Guan, Yutao Yue\n  - Publish Date: 2025.09.11\n  - Task: Prediction\n  - Summary：\n    - A systematic survey on Large Foundation Models (LFMs), including LLMs and MLLMs, for trajectory prediction in autonomous driving, highlighting their role in enabling interpretable contextual reasoning.\n    - Covers core methodologies like trajectory-language mapping, multimodal fusion, and constraint-based reasoning, along with tasks, metrics, datasets, and key challenges.\n    - Discusses future research directions such as low-latency inference, causality-aware modeling, and motion foundation models.\n\n- [DepthVision: Enabling Robust Vision-Language Models with GAN-Based LiDAR-to-RGB Synthesis for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.07463)\n  - Sven Kirchner, Nils Purschke, Ross Greer, Alois C. Knoll\n  - Publisher: Technical University of Munich\n  - Publish Date: 2025.09.09\n  - Task: Perception\n  - Summary：\n    - DepthVision, a multimodal framework enabling Vision-Language Models (VLMs) to exploit LiDAR data without architectural changes or retraining by synthesizing dense RGB-like images from sparse LiDAR point clouds.\n    - Introduces a Luminance-Aware Modality Adaptation (LAMA) module that fuses synthesized and real camera images by dynamically weighting each modality based on ambient lighting to compensate for degradation like darkness or motion blur.\n    - The design turns LiDAR into a drop-in visual surrogate when RGB is unreliable, extending the operational envelope of existing VLMs, with evaluations showing substantial improvements in low-light scene understanding over RGB-only baselines.\n\n- [OccVLA: Vision-Language-Action Model with Implicit 3D Occupancy Supervision](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.05578)\n  - Ruixun Liu, Lingyu Kong, Derun Li, Hang Zhao\n  - Publish Date: 2025.09.06\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Proposes OccVLA, a framework integrating 3D occupancy representations into multimodal reasoning for autonomous driving, using occupancy as both a predictive output and supervisory signal.\n    - Learns fine-grained spatial structures from 2D visual inputs without explicit 3D inputs or extra inference overhead, as occupancy predictions can be skipped.\n    - Achieves state-of-the-art results on nuScenes for trajectory planning and superior performance on 3D visual question-answering tasks.\n\n- [LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.05263)\n  - Yinglin Duan, Zhengxia Zou, Tongwei Gu, Wei Jia, Zhan Zhao, Luyi Xu, Xinzhu Liu, Yenan Lin, Hao Jiang, Kang Chen, Shuang Qiu\n  - Publish Date: 2025.09.05\n  - Project Page: [Demo Video](https:\u002F\u002Fyoutu.be\u002F8VWZXpERR18)\n  - Task: Generation\n  - Summary：\n    - LatticeWorld, a 3D world generation framework that leverages lightweight LLMs (LLaMA-2-7B) and Unreal Engine 5 to create large-scale interactive worlds from multimodal textual and visual instructions.\n    - The framework streamlines industrial 3D environment production, achieving over a 90× increase in efficiency while maintaining high creative quality compared to traditional manual methods.\n\n- [SAM-LLM: Interpretable Lane Change Trajectory Prediction via Parametric Finetuning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.03462)\n  - Zhuo Cao, Yunxiao Shi, Min Xu\n  - Publish Date: 2025.09.03\n  - Task: Prediction\n  - Summary：\n    - Introduces SAM-LLM, a hybrid architecture that combines Large Language Models (LLMs) for contextual reasoning with a kinematic Sinusoidal Acceleration Model (SAM) for physical precision in autonomous driving trajectory prediction.\n    - For lane changes, the model outputs interpretable physical parameters (e.g., lateral displacement, duration) instead of raw coordinates, enabling continuous, plausible trajectories with an 80% reduction in output size compared to coordinate-based methods.\n    - Achieves state-of-the-art intention prediction accuracy of 98.73%, matching traditional LLM predictors while offering superior explainability and computational efficiency.\n\n- [KEPT: Knowledge-Enhanced Prediction of Trajectories from Consecutive Driving Frames with Vision-Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.02966)\n  - Yujin Wang, Tianyi Wang, Quanfeng Liu, Wenxian Fan, Junfeng Jiao, Christian Claudel, Yunbing Yan, Bingzhao Gao, Jianqiang Wang, Hong Chen\n  - Publish Date: 2025.09.03\n  - Task: Prediction\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - KEPT, a knowledge-enhanced VLM framework for predicting ego trajectories directly from consecutive front-view driving frames.\n    - Integrates a temporal frequency-spatial fusion video encoder with a k-means & HNSW retrieval-augmented generation pipeline, using retrieved knowledge in chain-of-thought prompts with planning constraints.\n    - Employs a triple-stage fine-tuning paradigm to align the VLM backbone, achieving state-of-the-art open-loop performance on nuScenes.\n\n- [Do LLM Modules Generalize? A Study on Motion Generation for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.02754)\n  - Mingyi Wang, Jingke Wang, Tengju Ye, Junbo Chen, Kaicheng Yu\n  - Publish Date: 2025.09.02\n  - Task: Prediction\n  - Datasets: [Waymo Sim Agents](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - Summary：\n    - A comprehensive evaluation of five key LLM modules for transfer to autonomous driving motion generation: tokenizer design, positional embedding, pre-training paradigms, post-training strategies, and test-time computation.\n    - Demonstrates that appropriately adapted LLM modules can significantly improve performance on the Waymo Sim Agents benchmark, achieving competitive results.\n    - Identifies which techniques transfer effectively, analyzes reasons for failures, and discusses necessary adaptations for the autonomous driving domain.\n\n- [2nd Place Solution for CVPR2024 E2E Challenge: End-to-End Autonomous Driving Using Vision Language Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.02659)\n  - Zilong Guo, Yi Luo, Long Sha, Dongxu Wang, Panqu Wang, Chenyang Xu, Yi Yang\n  - Publish Date: 2025.09.02\n  - Task: End-to-End\n  - Summary：\n    - A camera-only end-to-end autonomous driving solution that combines architectural design with Vision Language Models (VLMs), achieving 2nd place in the CVPR2024 E2E Challenge.\n    - Demonstrates that integrating knowledgeable VLMs into an end-to-end framework yields impressive performance, highlighting the potential of vision-based approaches for driving tasks.\n\n- [AutoDrive-R²: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.01944)\n  - Zhenlong Yuan, Chengxuan Qian, Jing Tang, Rui Chen, Zijian Song, Lei Sun, Xiangxiang Chu, Yujun Cai, Dapeng Zhang, Shuo Li\n  - Publish Date: 2025.09.02\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - Summary：\n    - AutoDrive-R², a novel VLA framework that enhances reasoning and self-reflection for autonomous driving via chain-of-thought (CoT) processing and reinforcement learning.\n    - Introduces the nuScenesR²-6K CoT dataset for supervised fine-tuning, building a four-step logical chain with self-reflection to connect perception to planning.\n    - Employs the GRPO reinforcement learning algorithm within a physics-grounded reward framework to optimize for reliable and realistic trajectory planning.\n\n- [OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.00789)\n  - Pei Liu, Qingtian Ning, Xinyan Lu, Haipeng Liu, Weiliang Ma, Dangen She, Peng Jia, Xianpeng Lang, Jun Ma\n  - Publish Date: 2025.08.31\n  - Task: Reasoning\n  - Summary：\n    - OmniReason, a temporal-guided Vision-Language-Action (VLA) framework for autonomous driving, introduces robust spatiotemporal reasoning by jointly modeling dynamic 3D environments and decision-making.\n    - It proposes OmniReason-Data, large-scale VLA datasets with dense spatiotemporal annotations generated via a hallucination-mitigated auto-labeling pipeline for physical plausibility and temporal coherence.\n    - It develops the OmniReason-Agent architecture with a sparse temporal memory module and an explanation generator, using spatiotemporal knowledge distillation to capture causal reasoning patterns for interpretable, temporally-aware driving.\n\n- [DriveQA: Passing the Driving Knowledge Test](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.21824)\n  - Maolin Wei, Wanzhou Liu, Eshed Ohn-Bar\n  - Publish Date: 2025.08.29\n  - Project Page: [DriveQA](https:\u002F\u002Fdriveqaiccv.github.io\u002F)\n  - Task: VQA\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [BDD](https:\u002F\u002Fbdd-data.berkeley.edu\u002F)\n  - Summary：\n    - DriveQA, an extensive open-source text and vision-based benchmark that exhaustively covers traffic regulations and scenarios for evaluating LLMs and MLLMs.\n    - Experiments show fine-tuning on DriveQA improves model accuracy in regulatory sign recognition and intersection decision-making, and pretraining on it enhances downstream driving task performance on real-world datasets.\n\n- [DrivingGaussian++: Towards Realistic Reconstruction and Editable Simulation for Surrounding Dynamic Driving Scenes](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.20965)\n  - Yajiao Xiong, Xiaoyu Zhou, Yongtao Wan, Deqing Sun, Ming-Hsuan Yang\n  - Publish Date: 2025.08.28\n  - Project Page: [DrivingGaussian++](https:\u002F\u002Fxiong-creator.github.io\u002FDrivingGaussian_plus.github.io)\n  - Task: Generation\n  - Summary：\n    - DrivingGaussian++, an efficient framework for realistic reconstruction and controllable editing of surrounding dynamic autonomous driving scenes, using incremental 3D Gaussians for static background and a composite dynamic Gaussian graph for moving objects.\n    - Integrates a LiDAR prior for detailed, consistent scene reconstruction and supports training-free controllable editing (texture, weather, object manipulation) by leveraging multi-view images, depth priors, and large language models (LLMs) for motion trajectory generation.\n\n- [Drive As You Like: Strategy-Level Motion Planning Based on A Multi-Head Diffusion Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.16947)\n  - Fan Ding, Xuewen Luo, Hwa Hui Tew, Ruturaj Reddy, Xikun Wang, Junn Yong Loo\n  - Publish Date: 2025.08.23\n  - Task: Planning\n  - Datasets: [nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)\n  - Summary：\n    - Proposes a diffusion-based multi-head trajectory planner (M-diffusion planner) fine-tuned with Group Relative Policy Optimization (GRPO) to learn diverse policy-specific driving behaviors.\n    - Incorporates a large language model (LLM) at inference to guide strategy selection for dynamic, instruction-aware planning without model switching.\n    - Achieves state-of-the-art performance on the nuPlan benchmark, with generated trajectories showing clear diversity to satisfy multi-modal driving requirements.\n\n- [Seeing Clearly, Forgetting Deeply: Revisiting Fine-Tuned Video Generators for Driving Simulation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.16512)\n  - Chun-Peng Chang, Chen-Yu Wang, Julian Schmidt, Holger Caesar, Alain Pagani\n  - Publish Date: 2025.08.22\n  - Task: Generation\n  - Datasets: [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F), [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Investigates the trade-off in fine-tuning video generators for driving simulation, where visual fidelity improves but spatial accuracy for dynamic elements may degrade.\n    - Attributes this degradation to a shift in alignment between visual quality and dynamic understanding objectives in the repetitive context of driving scenes.\n    - Shows that simple continual learning strategies, like replay from diverse domains, can preserve spatial accuracy while maintaining strong visual quality.\n\n- [Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.13305)\n  - Minhao Xiong, Zichen Wen, Zhuangcheng Gu, Xuyang Liu, Rui Zhang, Hengrui Kang, Jiabing Yang, Junyuan Zhang, Weijia Li, Conghui He, Yafei Wang, Linfeng Zhang\n  - Publish Date: 2025.08.18\n  - Task: End-to-End\n  - Datasets: [DriveLM](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM), [DriveLMM-o1](https:\u002F\u002Fgithub.com\u002Fayesha-ishaq\u002FDriveLMM-o1)\n  - Summary：\n    - Prune2Drive, a plug-and-play visual token pruning framework for multi-view VLMs in autonomous driving, addressing computational overhead from high-resolution, multi-view images.\n    - Introduces a diversity-aware token selection mechanism inspired by farthest point sampling and a view-adaptive pruning controller to learn optimal pruning ratios per camera view.\n    - Achieves significant speedups and memory savings (e.g., 6.40× speedup, 13.4% of original FLOPs with 10% tokens) while maintaining task performance on DriveLM and DriveLMM-o1 benchmarks.\n\n- [ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12603)\n  - Can Cui, Yupeng Zhou, Juntong Peng, Sung-Yeon Park, Zichong Yang, Prashanth Sankaranarayanan, Jiaru Zhang, Ruqi Zhang, Ziran Wang\n  - Publish Date: 2025.08.18\n  - Task: End-to-End\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Introduces ViLaD, a novel Large Vision Language Diffusion (LVLD) framework for end-to-end autonomous driving that uses a masked diffusion model for parallel generation of driving decisions, reducing latency.\n    - The framework supports bidirectional reasoning and progressive easy-first generation, outperforming autoregressive VLM baselines in planning accuracy and speed on nuScenes.\n    - Demonstrates practical viability through real-world deployment on an autonomous vehicle for an interactive parking task, achieving a near-zero failure rate.\n\n- [LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12404)\n  - Nan Song, Bozhou Zhang, Xiatian Zhu, Jiankang Deng, Li Zhang\n  - Publish Date: 2025.08.17\n  - Task: End-to-End\n  - Datasets: [DriveLM](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM), [nuScenes-QA](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes#download)\n  - Summary：\n    - Proposes LMAD, a novel vision-language framework for autonomous driving that emulates modern end-to-end paradigms with comprehensive scene understanding and a task-specialized VLM structure.\n    - Introduces preliminary scene interaction and specialized expert adapters within the driving task structure to better align VLMs with autonomous driving scenarios.\n    - The approach is designed to be fully compatible with existing VLMs and seamlessly integrate with planning-oriented driving systems, setting a new standard in explainable autonomous driving.\n\n- [ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.11428)\n  - Jingyu Li, Bozhou Zhang, Xin Jin, Jiankang Deng, Xiatian Zhu, Li Zhang\n  - Publish Date: 2025.08.15\n  - Task: End-to-End\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer in a unified imagination-and-planning loop.\n    - Introduces an early stopping mechanism and a trajectory selection strategy to address efficiency and predictive accuracy challenges in the integration of action-level decisions with high-fidelity pixel-level predictions.\n\n- [VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.05852)\n  - Kaiser Hamid, Khandakar Ashrafi Akbar, Nade Liang\n  - Publish Date: 2025.08.07\n  - Task: Perception\n  - Datasets: [BDD-A](https:\u002F\u002Fbdd-data.berkeley.edu\u002F)\n  - Summary：\n    - A vision-language framework that models drivers' gaze shifts through natural language, using few-shot and zero-shot learning on single RGB images.\n    - Fine-tunes LLaVA on curated BDD-A captions to align visual perception with attention-centric scene understanding, integrating low-level cues and top-down context.\n    - Generates driver visual attention allocation and shifting predictions in natural language, offering a new direction for explainable AI in autonomous driving.\n\n- [IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.06571)\n  - Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, Jijun Wang, Zichong Gu, Hao Jiang, Li Sun\n  - Publish Date: 2025.08.07\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - IRL-VLA, a novel close-loop Reinforcement Learning framework for autonomous driving using an Inverse Reinforcement Learning reward world model with a Vision-Language-Action (VLA) policy.\n    - The framework employs a three-stage paradigm: VLA architecture pretraining via imitation learning, building a lightweight reward world model via IRL, and enhancing planning via specialized PPO reinforcement learning.\n    - Achieves state-of-the-art performance on the NAVSIM v2 end-to-end driving benchmark and is the 1st runner up in the CVPR2025 Autonomous Grand Challenge.\n\n- [LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.03692)\n  - Ao Liang, Youquan Liu, Yu Yang, Dongyue Lu, Linfeng Li, Lingdong Kong, Huaici Zhao, Wei Tsang Ooi\n  - Publish Date: 2025.08.05\n  - Task: Generation\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - LiDARCrafter, a unified framework for 4D LiDAR generation and editing from free-form natural language, parsing instructions into ego-centric scene graphs to condition a tri-branch diffusion network.\n    - The framework includes an autoregressive module for generating temporally coherent 4D LiDAR sequences and establishes a comprehensive benchmark for standardized evaluation across scene-, object-, and sequence-level metrics.\n\n- [Mapillary Vistas Validation for Fine-Grained Traffic Signs: A Benchmark Revealing Vision-Language Model Limitations](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.02047)\n  - Sparsh Garg, Abhishek Aich\n  - Publisher: NEC Laboratories America\n  - Publish Date: 2025.08.04\n  - Code: [relabeling](https:\u002F\u002Fgithub.com\u002Fnec-labs-ma\u002Frelabeling)\n  - Task: Perception\n  - Datasets: [Mapillary Vistas](https:\u002F\u002Fwww.mapillary.com\u002Fdataset\u002Fvistas)\n  - Summary：\n    - Presents a new validation set for fine-grained traffic signs, Mapillary Vistas Validation for Traffic Signs (MVV), with pixel-level instance masks and expert annotations.\n    - Benchmarks VLMs against DINOv2, showing DINOv2's superior performance on fine-grained recognition and other categories like vehicles and humans.\n    - Reveals significant limitations in current VLMs for fine-grained visual understanding and establishes DINOv2 as a strong baseline for autonomous driving perception.\n\n- [Bench2ADVLM: A Closed-Loop Benchmark for Vision-language Models in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.02028)\n  - Tianyuan Zhang, Ting Jin, Lu Wang, Jiangfan Liu, Siyuan Liang, Mingchuan Zhang, Aishan Liu, Xianglong Liu\n  - Publish Date: 2025.08.04\n  - Task: Evaluation\n  - Summary：\n    - Bench2ADVLM, a unified hierarchical closed-loop evaluation framework for real-time, interactive assessment of Vision-Language Models (VLMs) in autonomous driving across simulation and physical platforms.\n    - Introduces a dual-system adaptation architecture for simulation and a physical control abstraction layer to bridge simulation and reality, enabling closed-loop testing on physical vehicles.\n    - Features a self-reflective scenario generation module to automatically explore model behavior and uncover potential failure modes for safety-critical scenario generation.\n\n- [Beyond Simulation: Benchmarking World Models for Planning and Causality in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.01922)\n  - Hunter Schofield, Mohammed Elmahgiubi, Kasra Rezaee, Jinjun Shan\n  - Publisher: York University\n  - Publish Date: 2025.08.03\n  - Task: Evaluation\n  - Datasets: [Waymo Open Dataset](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - Summary：\n    - Proposes new metrics to evaluate the robustness of world models as traffic simulators, specifically for their use as pseudo-environments for policy training.\n    - Extends the Waymo Open Sim-Agents Challenge (WOSAC) evaluation to include agents causal to the ego vehicle, revealing scenarios where top models fail under trajectory replay.\n    - Analyzes state-of-the-art world models under these new metrics to assess their sensitivity to uncontrollable objects and suitability for policy training.\n\n- [Edge-Based Multimodal Sensor Data Fusion with Vision Language Models (VLMs) for Real-time Autonomous Vehicle Accident Avoidance](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.01057)\n  - Fengze Yang, Bo Yu, Yang Zhou, Xuewen Luo, Zhengzhong Tu, Chenxi Liu\n  - Publish Date: 2025.08.01\n  - Task: Planning\n  - Datasets: [DeepAccident](https:\u002F\u002Fgithub.com\u002Fsisl\u002FDeepAccident)\n  - Summary：\n    - Proposes REACT, a real-time V2X-integrated trajectory optimization framework for autonomous driving based on a fine-tuned lightweight Vision-Language Model (VLM).\n    - Integrates infrastructure hazard alerts with onboard sensor data, using visual embeddings and contextual reasoning to generate safety-oriented trajectories.\n    - Employs Residual Trajectory Fusion (RTF) and edge-adaptation strategies for efficient deployment, achieving state-of-the-art performance on the DeepAccident benchmark.\n\n- [A Unified Perception-Language-Action Framework for Adaptive Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23540)\n  - Yi Zhang, Erik Leo Haß, Kuo-Yi Chao, Nenad Petrovic, Yinglei Song, Chengdong Wu, Alois Knoll\n  - Publisher: Technical University of Munich\n  - Publish Date: 2025.07.31\n  - Task: Planning\n  - Summary：\n    - Proposes a unified Perception-Language-Action (PLA) framework integrating multi-sensor fusion with an LLM-augmented Vision-Language-Action architecture for autonomous driving.\n    - Features a GPT-4.1-powered reasoning core to couple perception with natural language understanding for context-aware, explainable, and safety-bounded decision-making.\n    - Demonstrates superior performance in trajectory tracking, speed prediction, and adaptive planning in complex urban scenarios like intersections with construction zones.\n\n- [FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23318)\n  - Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Zhuo Li, Xiaobao Wei, Sixiang Chen, Liyun Li, Xianming Liu, Ming Lu, Yang Wang, Shanghang Zhang\n  - Publish Date: 2025.07.31\n  - Task: End-to-End\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - FastDriveVLA, a novel reconstruction-based vision token pruning framework for autonomous driving, featuring a plug-and-play visual token pruner called ReconPruner that prioritizes foreground information via MAE-style pixel reconstruction.\n    - Introduces an adversarial foreground-background reconstruction strategy to train ReconPruner and a new large-scale dataset, nuScenes-FG, with 241K image-mask pairs of annotated foreground regions.\n    - Achieves state-of-the-art results on the nuScenes open-loop planning benchmark across different pruning ratios, enabling efficient end-to-end driving by reducing the computational cost of long visual tokens in VLA models.\n\n- [FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23325)\n  - Yiming Yang, Hongbin Lin, Yueru Luo, Suzhong Fu, Chao Zheng, Xinrui Yan, Shuqi Mei, Kun Tang, Shuguang Cui, Zhen Li\n  - Publish Date: 2025.07.31\n  - Task: Perception, Reasoning\n  - Datasets: [OpenLane-V2](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FOpenLane-V2)\n  - Summary：\n    - FASTopoWM, a novel fast-slow lane segment topology reasoning framework augmented with latent world models for comprehensive BEV road scene understanding.\n    - The framework enables parallel supervision of historical and new queries to reduce pose estimation failure impact and introduces latent query and BEV world models for improved temporal perception.\n    - Demonstrates state-of-the-art performance on the OpenLane-V2 benchmark for lane segment detection and centerline perception.\n\n- [Vision-Language Cross-Attention for Real-Time Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23064)\n  - Santosh Patapati, Trisanth Srinivasan, Murari Ambati\n  - Publish Date: 2025.07.30\n  - Task: End-to-End\n  - Datasets: [MD-NEX Outdoor-Driving](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - XYZ-Drive, a single vision-language model for autonomous driving that reads a front-camera frame, an overhead map, and a waypoint to output steering and speed.\n    - It uses a lightweight goal-centered cross-attention layer to fuse waypoint, image, and map tokens before processing with a fine-tuned LLaMA-3.2 11B model.\n    - The model achieves 95% success on the MD-NEX benchmark, surpassing prior methods, and ablations confirm the importance of each modality and the fusion mechanism.\n\n- [SafeDriveRAG: Towards Safe Autonomous Driving with Knowledge Graph-based Retrieval-Augmented Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21585)\n  - Hao Ye, Mengshi Qi, Zhaohong Liu, Liang Liu, Huadong Ma\n  - Publish Date: 2025.07.29\n  - Code: [SafeDriveRAG](https:\u002F\u002Fgithub.com\u002FLumos0507\u002FSafeDriveRAG)\n  - Task: VQA\n  - Datasets: [SafeDrive228K](https:\u002F\u002Fgithub.com\u002FLumos0507\u002FSafeDriveRAG)\n  - Summary：\n    - Introduces SafeDrive228K, a large-scale multimodal VQA benchmark with 228K examples across 18 sub-tasks for evaluating traffic safety comprehension in driving scenarios.\n    - Proposes a plug-and-play multimodal knowledge graph-based retrieval-augmented generation (RAG) framework with a multi-scale subgraph retrieval algorithm.\n    - Demonstrates that the RAG framework significantly improves performance on safety-critical tasks across five mainstream VLMs.\n\n- [DriveAgent-R1: Advancing VLM-based Autonomous Driving with Active Perception and Hybrid Thinking](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.20879)\n  - Weicheng Zheng, Xiaofei Mao, Nanfei Ye, Pengxiang Li, Kun Zhan, Xianpeng Lang, Hang Zhao\n  - Publish Date: 2025.07.28\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - DriveAgent-R1, the first autonomous driving agent with active perception for planning, proactively invokes tools for visual reasoning to ground decisions in visual evidence.\n    - Introduces a hybrid thinking framework that adaptively switches between text-only reasoning and tool-augmented visual reasoning based on scene complexity.\n    - Trained via a three-stage progressive strategy with a core Cascaded Reinforcement Learning phase, achieving competitive performance with 3B parameters on Drive-Internal and nuScenes datasets.\n\n- [VESPA: Towards un(Human)supervised Open-World Pointcloud Labeling for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.20397)\n  - Levente Tempfli, Esteban Rivera, Markus Lienkamp\n  - Publisher: Technical University of Munich\n  - Publish Date: 2025.07.27\n  - Task: Perception\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - VESPA, a multimodal autolabeling pipeline that fuses LiDAR geometry with camera semantics for scalable 3D pseudolabel generation without ground-truth or HD maps.\n    - Leverages vision-language models (VLMs) for open-vocabulary object labeling and detection refinement directly in the point cloud domain, supporting novel category discovery.\n\n- [VLMPlanner: Integrating Visual Language Models with Motion Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.20342)\n  - Zhipeng Tang, Sha Zhang, Jiajun Deng, Chenjie Wang, Guoliang You, Yuting Huang, Xinrui Lin, Yanyong Zhang\n  - Publish Date: 2025.07.27\n  - Task: Planning\n  - Datasets: [nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)\n  - Summary：\n    - VLMPlanner, a hybrid framework combining a learning-based real-time planner with a vision-language model (VLM) that reasons over raw multi-view images for robust trajectory generation.\n    - Introduces a Context-Adaptive Inference Gate (CAI-Gate) to dynamically adjust VLM inference frequency based on scene complexity, balancing performance and computational efficiency.\n    - Evaluated on the nuPlan benchmark, demonstrating superior planning in scenarios with intricate road conditions and dynamic elements.\n\n- [BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.19370)\n  - Felix Brandstaetter, Erik Schuetz, Katharina Winter, Fabian Flohr\n  - Publish Date: 2025.07.25\n  - Task: Perception\n  - Datasets: [nuCaption](https:\u002F\u002Fwww.nuscenes.org\u002Fnuimages), [nuView](https:\u002F\u002Fwww.nuscenes.org\u002Fnuimages), [GroundView](https:\u002F\u002Fwww.nuscenes.org\u002Fnuimages)\n  - Summary：\n    - BEV-LLM, a lightweight model for 3D scene captioning in autonomous driving, leveraging BEVFusion to combine LiDAR and multi-view images with a novel absolute positional encoding.\n    - Achieves competitive performance on nuCaption, surpassing state-of-the-art by up to 5% in BLEU scores, using a small 1B parameter base model.\n    - Introduces and benchmarks two new datasets, nuView and GroundView, to better assess scene captioning across diverse driving scenarios and object grounding.\n\n- [BetterCheck: Towards Safeguarding VLMs for Automotive Perception Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.17722)\n  - Malsha Ashani Mahawatta Dona, Beatriz Cabrero-Daniel, Yinan Yu, Christian Berger\n  - Publisher: University of Gothenburg\n  - Publish Date: 2025.07.23\n  - Task: Perception\n  - Datasets: [Waymo Open Dataset](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - Summary：\n    - Proposes BetterCheck, a method for detecting hallucinations in Vision-Language Models (VLMs) to safeguard their use in automotive perception systems.\n    - Systematically assesses the performance of 3 state-of-the-art VLMs on diverse traffic situations from the Waymo Open Dataset, finding they exhibit strong understanding but remain prone to hallucination.\n\n- [VLM-UDMC: VLM-Enhanced Unified Decision-Making and Motion Control for Urban Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.15266)\n  - Haichao Liu, Haoren Guo, Pei Liu, Benshan Ma, Yuxiang Zhang, Jun Ma, Tong Heng Lee\n  - Publish Date: 2025.07.21\n  - Code: [VLM-UDMC](https:\u002F\u002Fgithub.com\u002Fhenryhcliu\u002Fvlmudmc.git)\n  - Task: Planning\n  - Summary：\n    - Proposes VLM-UDMC, a vision-language model-enhanced framework for unified decision-making and motion control in urban autonomous driving, incorporating scene reasoning and risk-aware insights.\n    - Features a two-step reasoning policy with Retrieval-Augmented Generation (RAG) in an upper-level slow system to dynamically reconfigure motion planning based on real-time environmental changes.\n    - Employs a lightweight multi-kernel decomposed LSTM for real-time trajectory prediction of traffic participants and validates the framework through simulations and real-world experiments.\n\n- [AGENTS-LLM: Augmentative GENeration of Challenging Traffic Scenarios with an Agentic LLM Framework](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.13729)\n  - Yu Yao, Salil Bhatnagar, Markus Mazzola, Vasileios Belagiannis, Igor Gilitschenski, Luigi Palmieri, Simon Razniewski, Marcel Hallgarten\n  - Publish Date: 2025.07.18\n  - Task: Planning\n  - Summary：\n    - Introduces a novel LLM-agent based framework for augmenting real-world traffic scenarios using natural language descriptions to generate challenging test cases for autonomous driving planners.\n    - Employs an agentic design to enable fine-grained control over scenario generation and maintain high performance with smaller, cost-effective LLMs, sidestepping the need for massive datasets or manual expert augmentation.\n\n- [Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.13162)\n  - Arian Mousakhan, Sudhanshu Mittal, Silvio Galesso, Karim Farid, Thomas Brox\n  - Publisher: University of Freiburg\n  - Publish Date: 2025.07.17\n  - Project Page: [Orbis](https:\u002F\u002Flmb-freiburg.github.io\u002Forbis.github.io\u002F)\n  - Code: [Orbis](https:\u002F\u002Flmb-freiburg.github.io\u002Forbis.github.io\u002F)\n  - Task: Prediction\n  - Summary：\n    - A driving world model designed to overcome challenges in long-horizon generation and generalization to difficult scenarios, using simple design choices without additional supervision or sensors.\n    - Achieves state-of-the-art performance with only 469M parameters trained on 280h of video data, excelling in challenging scenarios like turning maneuvers and urban traffic.\n    - Introduces a hybrid tokenizer for a side-by-side comparison, concluding that a continuous autoregressive model is less brittle and more powerful than a model built on discrete tokens.\n\n- [World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.12762)\n  - Yanchen Guan, Haicheng Liao, Chengyue Wang, Xingcheng Liu, Jiaxun Zhang, Zhenning Li\n  - Publish Date: 2025.07.17\n  - Task: Generation\n  - Datasets: [New Benchmark Dataset](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.12762)\n  - Summary：\n    - Proposes a framework combining generative scene augmentation with adaptive temporal reasoning for reliable accident anticipation, addressing data scarcity and missing object-level cues.\n    - Develops a video generation pipeline using a world model guided by domain-informed prompts to create high-resolution, statistically consistent driving scenarios, enriching edge cases.\n    - Constructs a dynamic prediction model with strengthened graph convolutions and dilated temporal operators to handle data incompleteness and visual noise, and releases a new benchmark dataset.\n\n- [ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.12499)\n  - Yuhang Lu, Jiadong Tu, Yuexin Ma, Xinge Zhu\n  - Publisher: 4DV Lab\n  - Publish Date: 2025.07.16\n  - Project Page: [ReAL-AD](https:\u002F\u002F4dvlab.github.io\u002Fproject_page\u002Frealad)\n  - Task: End-to-End\n  - Summary：\n    - Proposes ReAL-AD, a Reasoning-Augmented Learning framework that structures decision-making based on a three-tier human cognitive model: Driving Strategy, Driving Decision, and Driving Operation.\n    - Integrates Vision-Language Models (VLMs) to enhance situational awareness and introduces a Strategic Reasoning Injector, Tactical Reasoning Integrator, and Hierarchical Trajectory Decoder for hierarchical reasoning and trajectory execution.\n    - Extensive evaluations show the framework improves planning accuracy and safety by over 30%, making end-to-end autonomous driving more interpretable and aligned with human-like reasoning.\n\n- [Unreal is all you need: Multimodal ISAC Data Simulation with Only One Engine](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.08716)\n  - Kongwu Huang, Shiyi Mu, Jun Jiang, Yuan Gao, Shugong Xu\n  - Publisher: Shanghai University\n  - Publish Date: 2025.07.11\n  - Code: [Great-MCD](https:\u002F\u002Fgithub.com\u002Fhkw-xg\u002FGreat-MCD)\n  - Task: Generation\n  - Datasets: [Great-MSD](https:\u002F\u002Fgithub.com\u002Fhkw-xg\u002FGreat-MCD)\n  - Summary：\n    - Proposes Great-X, a single-engine multimodal data twin platform that reconstructs Sionna's ray-tracing within Unreal Engine and integrates with autonomous driving tools for efficient, synchronized simulation of CSI, RGB, Radar, and LiDAR data.\n    - Constructs an open-source, large-scale, low-altitude UAV multimodal synaesthesia dataset named Great-MSD and proposes a baseline CSI-based UAV 3D localization algorithm, demonstrating its feasibility across different CSI simulation engines.\n\n- [VisioPath: Vision-Language Enhanced Model Predictive Control for Safe Autonomous Navigation in Mixed Traffic](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.06441)\n  - Shanting Wang, Panagiotis Typaldos, Chenjun Li, Andreas A. Malikopoulos\n  - Publish Date: 2025.07.08\n  - Task: Planning\n  - Datasets: [SUMO](https:\u002F\u002Fwww.eclipse.org\u002Fsumo\u002F)\n  - Summary：\n    - A novel framework combining vision-language models (VLMs) with model predictive control (MPC) for safe autonomous driving in dynamic traffic.\n    - Leverages a bird's-eye view pipeline and zero-shot VLM to extract structured vehicle information, constructing elliptical collision-avoidance potential fields for trajectory planning.\n    - Implements a finite-horizon optimal control problem solved via differential dynamic programming with adaptive regularization and an event-triggered MPC loop, including a safety verification layer.\n\n- [LeAD: The LLM Enhanced Planning System Converged with End-to-end Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.05754)\n  - Yuhang Zhang, Jiaqi Liu, Chengkai Xu, Peng Hang, Jian Sun\n  - Publish Date: 2025.07.08\n  - Task: Planning\n  - Datasets: [CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - Summary：\n    - LeAD, a dual-rate autonomous driving architecture integrating imitation learning-based end-to-end frameworks with large language model (LLM) augmentation for enhanced scenario comprehension and decision-making.\n    - The system uses a high-frequency E2E subsystem for real-time cycles and a low-frequency LLM module that employs multi-modal perception fusion and chain-of-thought reasoning to handle complex scenarios and edge cases.\n\n- [NRSeg: Noise-Resilient Learning for BEV Semantic Segmentation via Driving World Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.04002)\n  - Siyu Li, Fei Teng, Yihong Cao, Kailun Yang, Zhiyong Li, Yaonan Wang\n  - Publish Date: 2025.07.05\n  - Code: [NRSeg](https:\u002F\u002Fgithub.com\u002Flynn-yu\u002FNRSeg)\n  - Task: Perception\n  - Summary：\n    - Proposes NRSeg, a noise-resilient learning framework for BEV semantic segmentation to harness synthetic data from driving world models.\n    - Introduces a Perspective-Geometry Consistency Metric (PGCM) to evaluate the guidance capability of generated data and a Bi-Distribution Parallel Prediction (BiDPP) module to enhance model robustness.\n    - Achieves state-of-the-art performance with mIoU improvements of 13.8% and 11.4% in unsupervised and semi-supervised BEV segmentation tasks, respectively.\n\n- [FMOcc: TPV-Driven Flow Matching for 3D Occupancy Prediction with Selective State Space Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.02250)\n  - Jiangxia Chen, Tongyuan Huang, Ke Song\n  - Publish Date: 2025.07.03\n  - Task: Prediction\n  - Datasets: [Occ3D-nuScenes](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FOcc3D), [OpenOcc](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FOpenOcc)\n  - Summary：\n    - Proposes FMOcc, a Tri-perspective View (TPV) refinement occupancy network with a flow matching selective state space model for few-frame 3D occupancy prediction.\n    - Introduces a Flow Matching SSM module (FMSSM) to generate missing features and a TPV SSM layer with Plane Selective SSM (PS3M) to selectively filter TPV features, enhancing efficiency and prediction for distant scenes.\n    - Designs a Mask Training (MT) method to improve robustness against sensor data loss, achieving state-of-the-art results on Occ3D-nuScenes and OpenOcc datasets.\n\n- [VLAD: A VLM-Augmented Autonomous Driving Framework with Hierarchical Planning and Interpretable Decision Process](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01284)\n  - Cristian Gariboldi, Hayato Tokida, Ken Kinjo, Yuki Asada, Alexander Carballo\n  - Publish Date: 2025.07.02\n  - Task: End-to-End\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - VLAD, a vision-language autonomous driving model integrating a fine-tuned VLM with the VAD end-to-end system, using custom QA datasets to enhance spatial reasoning.\n    - The system generates high-level navigational commands and interpretable natural language explanations for driving decisions, improving transparency.\n    - Evaluation on nuScenes shows a 31.82% reduction in average collision rates compared to baseline methods.\n\n- [LLM-based Realistic Safety-Critical Driving Video Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01264)\n  - Yongjie Fu, Ruijian Zha, Pei Tian, Xuan Di\n  - Publish Date: 2025.07.02\n  - Task: Generation\n  - Datasets: [CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - Summary：\n    - A novel framework leveraging Large Language Models (LLMs) for few-shot code generation to automatically synthesize diverse and safety-critical driving scenarios within the CARLA simulator.\n    - Integrates a video generation pipeline using Cosmos-Transfer1 with ControlNet to convert rendered simulation scenes into realistic driving videos, bridging the simulation-to-real appearance gap.\n    - Enables controllable generation of rare edge cases, such as occluded pedestrian crossings or sudden vehicle cut-ins, for simulation-based testing of autonomous vehicles.\n\n- [World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.00603)\n  - Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, Dongbin Zhao\n  - Publisher: University of Chinese Academy of Sciences\n  - Publish Date: 2025.07.01\n  - Code: [World4Drive](https:\u002F\u002Fgithub.com\u002Fucaszyp\u002FWorld4Drive)\n  - Task: End-to-End\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - World4Drive, an end-to-end autonomous driving framework that uses vision foundation models to build latent world models for generating and evaluating multi-modal planning trajectories without perception annotations.\n    - The framework extracts scene features and intentions, generates trajectories, predicts future states in latent space, and uses a selector module to choose the best trajectory via self-supervised alignment.\n    - Achieves state-of-the-art performance on nuScenes and NavSim benchmarks with significant reductions in L2 error and collision rate, and faster training convergence.\n\n- [When Digital Twins Meet Large Language Models: Realistic, Interactive, and Editable Simulation for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.00319)\n  - Tanmay Vilas Samak, Chinmay Vilas Samak, Bing Li, Venkat Krovi\n  - Publish Date: 2025.06.30\n  - Task: Generation\n  - Summary：\n    - A unified framework for creating high-fidelity digital twins to accelerate autonomous driving research, balancing dynamical fidelity, photorealistic rendering, context-relevant scenario orchestration, and real-time performance.\n    - Leverages physics-based and data-driven techniques for real2sim reconstruction with geometric\u002Fphotorealistic accuracy and infuses assets with physical properties for real-time dynamical simulation.\n    - Incorporates a large language model (LLM) interface to flexibly edit driving scenarios online via natural language prompts, demonstrating high structural similarity, frame rates, and prompt handling capabilities.\n\n- [StyleDrive: Towards Driving-Style Aware Benchmarking of End-To-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.23982)\n  - Ruiyang Hao, Bowen Jing, Haibao Yu, Zaiqing Nie\n  - Publish Date: 2025.06.30\n  - Project Page: [StyleDrive](https:\u002F\u002Fstyledrive.github.io\u002F)\n  - Task: Evaluation\n  - Datasets: [StyleDrive](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FRyhn98\u002FStyleDrive-Dataseta)\n  - Summary：\n    - Introduces the first large-scale real-world dataset explicitly curated for personalized end-to-end autonomous driving (E2EAD), integrating scene topology with dynamic context and semantics via a fine-tuned vision-language model.\n    - Proposes a hybrid annotation pipeline combining behavioral analysis, heuristics, and VLM reasoning, refined through human-in-the-loop verification.\n    - Establishes the first standardized benchmark for evaluating personalized E2EAD models, showing that incorporating driving preferences improves behavioral alignment with human demonstrations.\n\n- [Epona: Autoregressive Diffusion World Model for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.24113)\n  - Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, Xun Cao, Wei Yin\n  - Publisher: Tsinghua University\n  - Publish Date: 2025.06.30\n  - Project Page: [Epona](https:\u002F\u002Fgithub.com\u002FKevin-thu\u002FEpona\u002F)\n  - Code: [Epona](https:\u002F\u002Fgithub.com\u002FKevin-thu\u002FEpona\u002F)\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - Epona, an autoregressive diffusion world model for autonomous driving that enables localized spatiotemporal distribution modeling via decoupled spatiotemporal factorization and modular trajectory\u002Fvideo prediction.\n    - Introduces a novel chain-of-forward training strategy to address error accumulation in autoregressive loops, achieving state-of-the-art performance with 7.4% FVD improvement and minutes-longer prediction duration.\n    - The learned world model serves as a real-time motion planner, outperforming strong end-to-end planners on NAVSIM benchmarks.\n\n- [DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.02948)\n  - Zhiyi Hou, Enhui Ma, Fang Li, Zhiyi Lai, Kalok Ho, Zhanqian Wu, Lijun Zhou, Long Chen, Chitian Sun, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Kaicheng Yu\n  - Publish Date: 2025.06.28\n  - Code: [DriveMRP](https:\u002F\u002Fgithub.com\u002FSII-HZY\u002FDriveMRP)\n  - Task: Prediction\n  - Summary：\n    - Introduces DriveMRP, a method to enhance Vision-Language Model (VLM) motion risk prediction by synthesizing high-risk motion data via a Bird's-Eye View (BEV) based simulation.\n    - Proposes the DriveMRP-Agent framework with a novel information injection strategy for global context, ego perspective, and trajectory projection to improve spatial reasoning.\n    - Demonstrates significant performance gains, boosting accident recognition accuracy from 27.13% to 88.03% on synthetic data and from 29.42% to 68.50% in zero-shot real-world evaluation.\n\n- [Case-based Reasoning Augmented Large Language Model Framework for Decision Making in Realistic Safety-Critical Driving Scenarios](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.20531)\n  - Wenbin Gan, Minh-Son Dao, Koji Zettsu\n  - Publisher: NICT, University of Hyogo\n  - Publish Date: 2025.06.25\n  - Task: Planning\n  - Summary：\n    - Presents a Case-Based Reasoning Augmented LLM (CBR-LLM) framework for evasive maneuver decision-making in complex, safety-critical driving scenarios.\n    - Integrates semantic scene understanding from dashcam videos with retrieval of relevant past driving cases to generate context-sensitive, human-aligned maneuver recommendations.\n    - Demonstrates improved decision accuracy, justification quality, and alignment with human experts across multiple LLMs, with risk-aware prompting and similarity-based case retrieval enhancing performance.\n\n- [Unified Vision-Language-Action Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.19850)\n  - Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, Zhaoxiang Zhang\n  - Publish Date: 2025.06.24\n  - Task: Planning\n  - Datasets: [CALVIN](https:\u002F\u002Fgithub.com\u002Fmees\u002Fcalvin), [LIBERO](https:\u002F\u002Flibero-project.github.io\u002F), [Simplenv-Bridge](https:\u002F\u002Fsimpler-env.github.io\u002F), [ALOHA](https:\u002F\u002Fmobile-aloha.github.io\u002F)\n  - Summary：\n    - UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences for flexible multimodal task learning.\n    - Incorporates world modeling during post-training to capture causal dynamics from videos, facilitating effective transfer to downstream policy learning, especially for long-horizon tasks.\n    - Achieves new state-of-the-art results on simulation benchmarks (e.g., 95.5% on LIBERO) and demonstrates broad applicability on real-world ALOHA manipulation and autonomous driving.\n\n- [Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.18234)\n  - Yue Li, Meng Tian, Dechang Zhu, Jiangtong Zhu, Zhenyu Lin, Zhiwei Xiong, Xinhai Zhao\n  - Publish Date: 2025.06.23\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [DriveLM-nuScenes](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM)\n  - Summary：\n    - Proposes Drive-R1, a domain-specific VLM designed to bridge scenario reasoning and motion planning for autonomous driving.\n    - Employs a two-stage training approach: supervised fine-tuning on a dataset with chain-of-thought reasoning, followed by reinforcement learning to align reasoning paths with planning outcomes.\n\n- [DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.17590)\n  - Mihir Godbole, Xiangbo Gao, Zhengzhong Tu\n  - Publisher: University of California, Berkeley, University of Michigan\n  - Publish Date: 2025.06.21\n  - Task: Evaluation\n  - Datasets: [DRAMA](https:\u002F\u002Fusa.honda-ri.com\u002Fdrama)\n  - Summary：\n    - Introduces DRAMA-X, a fine-grained benchmark for evaluating multi-class intent prediction and risk reasoning in safety-critical driving scenarios, constructed from the DRAMA dataset via an automated annotation pipeline.\n    - Proposes SGG-Intent, a lightweight, training-free reasoning framework that uses a VLM-backed scene graph generator and an LLM for compositional reasoning to perform object detection, intent prediction, risk assessment, and action suggestion.\n    - Provides structured evaluation of four interrelated autonomous driving tasks and demonstrates that scene-graph-based reasoning enhances intent and risk prediction, especially with explicit contextual modeling.\n\n- [NetRoller: Interfacing General and Specialized Models for End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14589)\n  - Ren Xin, Hongji Liu, Xiaodong Mei, Wenru Liu, Maosheng Ye, Zhili Chen, Jun Ma\n  - Publish Date: 2025.06.17\n  - Code: [NetRoller](https:\u002F\u002Fgithub.com\u002FRex-sys-hk\u002FNetRoller)\n  - Task: End-to-End\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - Summary：\n    - Proposes NetRoller, an adapter with novel mechanisms to seamlessly integrate General Models (GMs) like LLMs with Specialized Models (SMs) for autonomous driving, addressing asynchronous system challenges.\n    - Introduces a three-stage interfacing approach: harvesting efficient LLM representations via early stopping, robust cross-modality translation, and performance enhancement of SMs via Query Shift and Feature Shift mechanisms.\n    - Demonstrates significant improvements in human similarity and safety for planning, and precision in detection\u002Fmapping tasks on nuScenes, enabling SMs to operate at native frequencies with GM situational awareness.\n\n- [ADRD: LLM-Driven Autonomous Driving Based on Rule-based Decision Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14299)\n  - Fanzhi Zeng, Siqi Wang, Chuzhao Zhu, Li Li\n  - Publish Date: 2025.06.17\n  - Task: Planning\n  - Summary：\n    - ADRD, a novel framework that leverages LLMs to generate executable, rule-based decision systems for interpretable autonomous driving.\n    - The framework integrates three core modules: an Information Module, an Agents Module, and a Testing Module to iteratively generate and refine driving tactics.\n    - Demonstrates superior performance in interpretability, response speed, and driving performance compared to traditional RL and advanced LLM-based methods.\n\n- [A Hierarchical Test Platform for Vision Language Model (VLM)-Integrated Real-World Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14100)\n  - Yupeng Zhou, Can Cui, Juntong Peng, Zichong Yang, Juanwu Lu, Jitesh H Panchal, Bin Yao, Ziran Wang\n  - Publisher: Purdue University\n  - Publish Date: 2025.06.17\n  - Task: Evaluation\n  - Summary:\n    - The paper introduces a lightweight, structured, and low-latency middleware pipeline on the vehicle and develops a form of customizable real-world traffic scenarios on a closed test track.\n\n- [AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13757)\n  - Zewei Zhou, Tianhui Cai, Seth Z. Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, Jiaqi Ma\n  - Publisher: University of California, Los Angeles\n  - Publish Date: 2025.06.16\n  - Project Page: [AutoVLA](https:\u002F\u002Fautovla.github.io\u002F)\n  - Code: [AutoVLA](https:\u002F\u002Fgithub.com\u002Fucla-mobility\u002FAutoVLA)\n  - Task: Planning\n  - Datasets: [nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan), [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes), [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen), [Bench2Drive](https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FBench2Drive)(Using [CARLA-Garage Dataset](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fcarla_garage) for Training)\n  - Summary:\n    - AutoVLA is an end-to-end autonomous driving framework leveraging a pretrained VLM backbone integrated with physical discrete action tokens.\n    - Use GRPO to enable adaptive reasoning and further enhance the model’s performance on end-to-end driving tasks.\n\n- [On the Natural Robustness of Vision-Language Models Against Visual Perception Attacks in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11472)\n  - Pedram MohajerAnsari, Amir Salarpour, Michael Kühr, Siyu Huang, Mohammad Hamad, Sebastian Steinhorst, Habeeb Olufowobi, Mert D. Pesé \n  - Publisher: Clemson University, Technical University of Munich, University of Texas at Arlington\n  - Publish Date: 2025.06.13\n  - Task: Perception\n  - Code: [V2LM](https:\u002F\u002Fgithub.com\u002Fpedram-mohajer\u002FV2LM)\n  - Summary:\n    - This Paper introduce Vehicle Vision Language Models (V2LMs), fine-tuned VLMs specifically for AV perception task.\n\n- [ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08052)\n  - Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, Xinggang Wang\n  - Publisher: Huazhong University of Science and Technology, Xiaomi EV\n  - Publish Date: 2025.06.09\n  - Project Page: [ReCogDrive](https:\u002F\u002Fxiaomi-research.github.io\u002Frecogdrive\u002F)\n  - Code: [ReCogDrive](https:\u002F\u002Fxiaomi-research.github.io\u002Frecogdrive\u002F)\n  - Task: Planning\n  - Datasets: [NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - Summary：\n    - ReCogDrive, a novel reinforced cognitive framework for End-to-End Autonomous Driving, with a Vision-Language Model (VLM) and a Diffusion-based Planner enhanced by Reinforcement Learning.\n    - ReCogDrive utilizes a three-stage training paradigm, starting with driving pre-training, followed by imitation learning and GRPO reinforcement learning to enhance the planning capabilities of the Vision-Language-Action (VLA) system.\n\n- [STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06218)\n  - Christian Fruhwirth-Reisinger, Dušan Malić, Wei Lin, David Schinagl, Samuel Schulter, Horst Possegger\n  - Publisher:  Graz University of Technology, Christian Doppler Laboratory for Embedded Machine Learning, Johannes Kepler University Linz, Amazon\n  - Publish Date: 2025.06.06\n  - Task: QA\n  - Code: [STSBench](https:\u002F\u002Fgithub.com\u002FLRP-IVC\u002FSTSBench)\n  - Dataset: [STSBench](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fivc-lrp\u002FSTSBench)\n  - Summary:\n    - STSBench, a framework for automatic scenario mining from large-scale autonomous driving datasets with rich ground truth annotations.\n    - Applied to the NuScenes dataset, present STSnu, the first benchmark that evaluates the spatio-temporal reasoning capabilities of VLMs based on comprehensive 3D perception.\n\n- [Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05442)\n  - Hao Jiang, Chuan Hu, Yukang Shi, Yuan He, Ke Wang, Xi Zhang, Zhipeng Zhang\n  - Publisher: Shanghai Jiao Tong University, KargoBot\n  - Publish Date: 2025.06.05\n  - Task: VQA\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - Summary:\n    - The paper introduces a structured and concise benchmark dataset, NuScenes-S, which is derived from the NuScenes dataset and contains machine-friendly structured representations.\n\n- [AD-EE: Early Exiting for Fast and Reliable Vision-Language Models in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05404)\n  - Lianming Huang, Haibo Hu, Yufei Cui, Jiacheng Zuo, Shangyu Wu, Nan Guan, Chun Jason Xue\n  - Publisher: City University of Hong Kong, McGill University, MBZUAI, Soochow University\n  - Publish Date: 2025.06.04\n  - Task: Perception, VQA\n  - Dataset: [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F), [corner-case-focused CODA](https:\u002F\u002Fcoda-dataset.github.io\u002F)\n  - Summary:\n    - AD-EE, an Early Exit framework incorporates domain characteristics of autonomous driving and leverages causal inference to identify optimal exit layers.\n    - AD-EE propose a Causal Inference-based approach to identify and analyze the optimal early exit layers for enhanced VLM inference.\n\n- [DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20665)\n  - Muxi Diao, Lele Yang, Hongbo Yin, Zhexu Wang, Yejie Wang, Daxin Tian, Kongming Liang, Zhanyu Ma\n  - Publisher: Beijing University of Posts and Telecommunications, Zhongguancun Academy, Beihang University\n  - Publish Date: 2025.05.27\n  - Project Page: [DriveRX](https:\u002F\u002Fpris-cv.github.io\u002FDriveRX\u002F)\n  - Task: VQA\n  - Summary:\n    - AutoDriveRL, a unified training framework that formulates autonomous driving as a structured reasoning process over four core tasks. Each task is independently modeled as a vision-language QA problem and optimized using task-specific reward models, enabling fine-grained reinforcement signals at different reasoning stages.\n    - Within this framework, train DriveRX, a cross-task reasoning VLM designed for real-time decision-making.\n\n- [WOMD-Reasoning: A Large-Scale Dataset for Interaction Reasoning in Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.04281)\n  - Yiheng Li, Cunxin Fan, Chongjian Ge, Zhihao Zhao, Chenran Li, Chenfeng Xu, Huaxiu Yao, Masayoshi Tomizuka, Bolei Zhou, Chen Tang, Mingyu Ding, Wei Zhan **ICML 2025**\n  - Publisher: UC Berkeley, UCLA, UNC-Chapel Hill, UT Austin\n  - Publish Date: 2025.05.25\n  - Code: [WOMD-Reasoning](https:\u002F\u002Fgithub.com\u002Fyhli123\u002FWOMD-Reasoning)\n  - Dataset: [Waymo Open Motion Dataset](https:\u002F\u002Fwaymo.com\u002Fopen\u002Fdownload)\n  - Task: VQA\n  - Summary:\n    - WOMD-Reasoning is a language annotation dataset built on the Waymo Open Motion Dataset (WOMD), with a focus on describing and reasoning interactions and intentions in driving scenarios.\n\n- [FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17685)\n  - Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei\n  - Publisher: Alibaba Group, Xi’an Jiaotong University\n  - Publish Date: 2025.05.23\n  - Task: Generation, Planning\n  - Code: [FSDrive](https:\u002F\u002Fgithub.com\u002FMIV-XJTU\u002FFSDrive)\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - Summary:\n    -  A spatio-temporal CoT reasoning method that allows the model to enhance trajectory planning by thinking visually from future temporal and spatial dimensions.\n    -  A unified pre-training paradigm for visual generation and understanding. \n\n- [Human-like Semantic Navigation for Autonomous Driving using Knowledge Representation and Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16498)\n  - Augusto Luis Ballardini, Miguel Ángel Sotelo\n  - Publisher: University of Alcal ́a\n  - Publish Date: 2025.05.22\n  - Task: QA\n  - Summary:\n    - The paper explores the use of Large Language Models to generate Answer Set Programming rules by translating informal navigation instructions into structured, logicbased reasoning.\n\n- [VL-SAFE: Vision-Language Guided Safety-Aware Reinforcement Learning with World Models for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16377)\n  - Yansong Qu, Zilin Huang, Zihao Sheng, Jiancong Chen, Sikai Chen, Samuel Labi\n  - Publisher: Purdue University, University of Wisconsin-Madison, \n  - Publish Date: 2025.05.22\n  - Project Page: [VL-SAFE](https:\u002F\u002Fys-qu.github.io\u002Fvlsafe-website\u002F)\n  - Code: [VL-SAFE](https:\u002F\u002Fgithub.com\u002Fys-qu\u002Fvl-safe\u002Ftree\u002Fmain)\n  - Task: Framework\n  - Summary:\n    - VL-SAFE, a world model-based safe RL framework with Vision-Language model (VLM)-as safety guidance paradigm, designed for offline safe policy learning.\n\n- [DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16278)\n  - Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan **CVPR 2026**\n  - Publisher: Shanghai Jiao Tong University\n  - Publish Date: 2025.05.22\n  - Project Pages: [DriveMoE](https:\u002F\u002Fthinklab-sjtu.github.io\u002FDriveMoE\u002F)\n  - Code: [DriveMoE](https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FDriveMoE)\n  - Dataset: [Bench2Drive](https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FBench2Drive)\n  - Task: Planning\n  - Summary:\n    - DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE.\n    - DriveMoE is built upon our Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-π0.\n\n- [AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15298)\n  - Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, Yining Shi, He Zhe Lim, Li Liu, Tianbao Zhou, Huang Yu, Yifei Hu, Guang Li, Guang Chen, Hao Ye, Lijun Sun, Diange Yang\n  - Publisher: Tsinghua University, McGill University, Xiaomi Corporation, University of Wisconsin – Madison \n  - Publish Date: 2025.06.12\n  - Task: VQA\n  - Summary:\n    - AgentThink, the first framework to integrate dynamic, agent-style tool invocation into vision-language reasoning for autonomous driving tasks.\n    - Two-stage training pipeline that combines SFT with GRPO.\n\n- [Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.08725)\n  - Zongchuang Zhao, Haoyu Fu, Dingkang Liang, Xin Zhou, Dingyuan Zhang, Hongwei Xie, Bing Wang, Xiang Bai\n  - Publisher: Huazhong University of Science and Technology, Xiaomi EV\n  - Publish Date: 2025.05.13\n  - Task: VQA\n  - Code: [DriveMonkey](https:\u002F\u002Fgithub.com\u002Fzc-zhao\u002FDriveMonkey)\n  - Summary:\n    - NuInteract, a large-scale dataset for advancing LVLMs in autonomous driving. With 239K images,34K frames, and over 1.5M image-language pairs across 850 scenes, NuInteract provides dense captions detailing the surrounding environment and 2D\u002F3D annotations for tasks like 2D\u002F3D visual grounding, enabling comprehensive perception, prediction, and planning.\n    - DriveMonkey, a flexible framework supporting multiple interactive tasks via user prompts.\n\n- [Towards Human-Centric Autonomous Driving: A Fast-Slow Architecture Integrating Large Language Model Guidance with Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.06875)\n  - Chengkai Xu, Jiaqi Liu, Yicheng Guo, Yuhang Zhang, Peng Hang, Jian Sun\n  - Publisher:  Tongji University\n  - Publish Date: 2025.05.11\n  - Task: Planning\n  - Env: [Highway-Env](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n  - Summary:\n    - A “fast-slow” decision-making framework that integrates a Large Language Model (LLM) for high-level instruction parsing with a Reinforcement Learning (RL) agent for low-level real-time decision.\n\n- [Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.06413)\n  - Ming Liu, Siyuan Liang, Koushik Howlader, Liwen Wang, Dacheng Tao, Wensheng Zhang\n  - Publisher: Iowa State University, National University of Singapore\n  - Publish Date: 2025.05.09\n  - Task: VQA\n  - Datasets: [DriveLM](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM)\n  - Summary:\n    - It proposes a natural reflection-based backdoor attack targeting VLM systems in autonomous driving scenarios, aiming to induce substantial response delays when specific visual triggers are present.\n\n- [DSDrive: Distilling Large Language Model for Lightweight End-to-End Autonomous Driving with Unified Reasoning and Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.05360)\n  - Wenru Liu, Pei Liu, Jun Ma\n  - Publisher: The Hong Kong University of Science and Technology\n  - Publish Date: 2025.05.08\n  - Task: Planning\n  - Video: [DSDrive](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=op8PzQurugY)\n  - Summary:\n    - DSDrive, a lightweight E2E AD framework that employs a compact LLM to process multi-modal inputs for explicit reasoning and closed-loop planning. Specifically, we utilize knowledge distillation to empower the compact LLM to undertake the reasoning and planning tasks, thereby improving its overall performance.\n    - A novel waypoint-driven dual-head coordination module that bridges high-level reasoning and lowlevel trajectory planning.\n\n- [X-Driver: Explainable Autonomous Driving with Vision-Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.05098)\n  - Wei Liu, Jiyuan Zhang, Binxiong Zheng, Yufeng Hu, Yingzhan Lin, Zengfeng Zeng\n  - Publisher: Harbin Institute of Technology, Baidu Inc\n  - Publish Date: 2025.05.08\n  - Task: VQA, Planning\n  - Dataset: [Bench2Drive](https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FBench2Drive)\n  - Summary:\n    - X-Driver, a unified multi-modal large language models(MLLMs) framework designed for closed-loop autonomous driving, leveraging Chain-of-Thought(CoT) and autoregressive modeling to enhance perception and decisionmaking.\n\n- [Seeking to Collide: Online Safety-Critical Scenario Generation for Autonomous Driving with Retrieval Augmented Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00972)\n  - Yuewen Mei, Tong Nie, Jian Sun, Ye Tian\n  - Publisher: Tongji University, The Hong Kong Polytechnic University, \n  - Publish Date: 2025.05.02\n  - Task: Generation\n  - Dataset: [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - Summary:\n    - An LLM-based agent framework is proposed to generate interactive and safety-critical scenarios online.\n    - A memorization and retrieval mechanism is developed to continuously adapt LLMs to changing scenarios.\n\n- [V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00156)\n  - Jannik Lübberstedt, Esteban Rivera, Nico Uhlemann, Markus Lienkamp\n  - Publisher: Technical University of Munich, Munich Institute of Robotics and Machine Intelligenc\n  - Publish Date: 2025.04.30\n  - Task: VQA\n  - Datasets: [LingoQA](https:\u002F\u002Fgithub.com\u002Fwayveai\u002FLingoQA)\n  - Summary:\n    - Proposal and evaluation of V3LMA, a novel method that combines the strengths of LLMs and LVLMs to enhance 3D scene understanding in traffic scenarios—without requiring model training or fine-tuning.\n\n- [Enhancing Autonomous Driving Systems with On-Board Deployed Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.11514)\n  - Nicolas Baumann, Cheng Hu, Paviththiren Sivasothilingam, Haotong Qin, Lei Xie, Michele Magno, Luca Benini **RSS 2025**\n  - Publisher: ETH Zurich, Zhejiang University\n  - Publish Date: 2025.04.15\n  - Task: Planning\n  - Summary:\n    - A hybrid architecture combining low level Model Predictive Controller (MPC) with locally deployed Large Language Models (LLMs) to enhance decision-making and Human Machine Interaction (HMI).\n\n- [NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.03164)\n  - Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, Zhengzhong Tu\n  - Publisher: Texas A&M University, University of Wisconsin-Madison\n  - Publish Date: 2025.04.07\n  - Project Page: [NuScenes-SpatialQA](https:\u002F\u002Ftaco-group.github.io\u002FNuScenes-SpatialQA\u002F)\n  - Task: VQA\n  - Summary:\n    - NuScenes-SpatialQA, the first large-scale ground-truth-based Question-Answer (QA) benchmark specifically designed to evaluate the spatial understanding and reasoning capabilities of VLMs in autonomous driving.\n\n- [OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23463)\n  - Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, Alois C. Knoll\n  - Publisher: Technical University of Munich, Ludwig Maximilian University of Munich\n  - Publish Date: 2025.03.30\n  - Project Page: [OpenDriveVLA](https:\u002F\u002Fdrivevla.github.io\u002F)\n  - Code: [OpenDriveVLA](https:\u002F\u002Fgithub.com\u002FDriveVLA\u002FOpenDriveVLA)\n  - Task: VQA, Planning\n  - Summary:\n    - OpenDriveVLA, a Vision-Language Action (VLA) model designed for end-to-end autonomous driving. \n\n- [VLM-C4L: Continual Core Dataset Learning with Corner Case Optimization via Vision-Language Models for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23046)\n  - Haibo Hu, Jiacheng Zuo, Yang Lou, Yufei Cui, Jianping Wang, Nan Guan, Jin Wang, Yung-Hui Li, Chun Jason Xue **COLM 2025**\n  - Publisher: City University of Hong Kong, Soochow University, McGill University, Hon Hai Research Institute, Mohamed bin Zayed University of Artificial Intelligence\n  - Publish Date: 2025.03.29\n  - Project Pages: [VLM-C4L](https:\u002F\u002Fvlmc4l.site\u002F)\n  - Task: Perception\n  - Summary:\n    - VLM-C4L, a continual learning framework that introduce Vision-Language Models (VLMs) to dynamically optimize and enhance corner case datasets, and VLM-C4L combines VLM-guided high-quality data extraction with a core data replay strategy, enabling the model to incrementally learn from diverse corner cases while preserving performance on previously routine scenarios, thus ensuring long-term stability and adaptability in real-world autonomous driving.\n\n- [Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.21505)\n  - Yue Li, Meng Tian, Zhenyu Lin, Jiangtong Zhu, Dechang Zhu, Haiqiang Liu, Zining Wang, Yueyi Zhang, Zhiwei Xiong, Xinhai Zhao\n  - Publisher: University of Science and Technology of China, Huawei Noah’s Ark Lab, University of California, Berkeley\n  - Publish Date: 2025.03.27\n  - Code: [VLADBench](https:\u002F\u002Fgithub.com\u002FDepth2World\u002FVLADBench)\n  - Task: VQA\n  - Summary:\n    - VLADBench, specifically designed to rigorously evaluate the capabilities of VLMs in AD. VLADBench employes a hierarchical structure that reflects the complex skill set required for reliable driving, progressing from fundamental scene and traffic elements comprehension to advanced reasoning and decision-making.\n\n- [ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.19755)\n  - Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, Xiang Bai\n  - Publisher: Huazhong University of Science and Technology, Xiaomi EV\n  - Publish Data: 2025.03.25\n  - Task: Planning\n  - Project Page: [ORION](https:\u002F\u002Fxiaomi-mlab.github.io\u002FOrion\u002F)\n  - Code: [ORION](https:\u002F\u002Fgithub.com\u002Fxiaomi-mlab\u002FOrion)\n  - Dataset: [Bench2Drive](https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FBench2Drive)\n  - Summary:\n    - ORION, a hOlistic E2E autonomous dRiving framework by vIsion-language instructed actiON generation. ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction.\n\n- [AED: Automatic Discovery of Effective and Diverse Vulnerabilities for Autonomous Driving Policy with Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.20804)\n  - Le Qiu, Zelai Xu, Qixin Tan, Wenhao Tang, Chao Yu, Yu Wang\n  - Publisher: Tsinghua University, Beijing Zhongguancun Academy\n  - Publish Date: 2025.03.24\n  - Task: Planning\n  - Env: [Highway-Env](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n  - Summary:\n    - AED, a framework that uses large language models (LLMs) to Automatically discover Effective and Diverse vulnerabilities in autonomous driving policies. \n    - AED first utilize an LLM to automatically design reward functions for RL training.\n\n- [AutoDrive-QA- Automated Generation of Multiple-Choice Questions for Autonomous Driving Datasets Using Large Vision-Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.15778)\n  - Boshra Khalili, Andrew W.Smyth\n  - Publisher: Columbia University\n  - Publish Date: 2025.03.20\n  - Task: VQA\n  - Summary:\n    - AutoDrive-QA, Automatic pipeline that converts existing driving QA datasets (including DriveLM, NuScenes-QA, and LingoQA) into a structured multiple-choice question (MCQ) format.\n\n- [RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.13861)\n  - Yujin Wang, Quanfeng Liu, Zhengxin Jiang, Tianyi Wang, Junfeng Jiao, Hongqing Chu, Bingzhao Gao, Hong Chen\n  - Publisher: Tongji University, Yale University, University of Texas at Austin\n  - Publish Date: 2025.03.18\n  - Task: VQA\n  - Summary:\n    - Propose a retrieval-augmented decision-making (RAD) framework, a novel architecture designed to enhance VLMs’ capabilities to reliably generate meta-actions in autonomous driving scenes.\n\n- [A Framework for a Capability-driven Evaluation of Scenario Understanding for Multimodal Large Language Models in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11400)\n  - Tin Stribor Sohn, Philipp Reis, Maximilian Dillitzer, Johannes Bach, Jason J. Corso, Eric Sax\n  - Publisher: Dr. Ing. h.c. F. Porsche AG, Forschungszentrum Informatik，Hochschule Esslingen, University of Michigan, Karlsruher Institut f ̈ur Technologie  \n  - Publish Date: 2025.03.14\n  - Task: Evaluation\n  - Summary:\n    - This paper proposes a holistic framework for a capability-driven evaluation of MLLMs in autonomous driving. The framework structures scenario understanding along the four core capability dimensions semantic, spatial, temporal, and physical.\n\n- [DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11265)\n  - Xirui Zhou, Lianlei Shan, Xiaolin Gui\n  - Publisher: Xi’an Jiaotong University, University of Chinese Academy of Sciences\n  - Publish Date: 2025.03.14\n  - Task: VQA\n  - Summary:\n    - DynRsl-VLM incorporates a dynamic resolution image input processing approach that captures all entity feature information within an image while ensuring that the image input remains computationally tractable for the Vision Transformer (ViT).\n\n- [SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09594)\n  - Katrin Renz, Long Chen, Elahe Arani, Oleg Sinavski **CVPR 2025**\n  - Publisher: Wayve, University of T ̈ubingen, T ̈ubingen AI Center\n  - Publish Date: 2025.03.12\n  - Task: Planning\n  - Datasets: [Carla Leadboard V2](https:\u002F\u002Fleaderboard.carla.org\u002Fget_started_v2_0\u002F), [Bench2Drive](https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FBench2Drive)\n  - Summary:\n    - A VLM-based driving model that achieves state-of-the-art driving performance on the official CARLA Leaderboard 2.0 and the local benchmark Bench2Drive in the CARLA simulator. \n    (2) A new task (Action Dreaming), which comes with a methodology to collect instruction-action pairs and a benchmark to evaluate the connection of language and action understanding without having to execute unsafe actions. \n    (3) A generalist model that achieves not only good driving performance but also includes several language related tasks in the same model.\n\n- [CoT-Drive: Efficient Motion Forecasting for Autonomous Driving with LLMs and Chain-of-Thought Prompting](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07234)\n  - Haicheng Liao, Hanlin Kong, Bonan Wang, Chengyue Wang, Wang Ye, Zhengbing He, Chengzhong Xu, Zhenning Li **IEEE TAI 2025**\n  - Publisher: University of Macau, Massachusetts Institute of Technology\n  - Publish Date: 2025.03.10\n  - Task: VQA\n  - Summary:\n    - Introduce a teacher-student knowledge distillation strategy to effectively transfer LLMs’ advanced scene understanding capabilities to lightweight language models (LMs), ensuring that CoT-Drive operates in real-time on edge devices while maintaining comprehensive scene understanding and generalization capabilities.\n\n- [Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.06497)\n  - Enming Zhang, Peizhe Gong, Xingyuan Dai, Yisheng Lv, Qinghai Miao\n  - Publisher: University of Chinese Academy of Sciences, \n  - Publish Date: 2025.03.09\n  - Code: [SCD-Bench](https:\u002F\u002Fgithub.com\u002FEMZucas\u002FSCD-Bench)\n  - Task: Evaluation\n  - Summary:\n    - Propose a novel evaluation method: Safety Cognitive Driving Benchmark (SCD-Bench) and debelop the Autonomous Driving Image-Text Annotation System (ADA).\n\n- [VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.18042)\n  - Pei Liu, Haipeng Liu, Haichao Liu, Xin Liu, Jinxin Ni, Jun Ma\n  - Publisher: The Hong Kong University of Science and Technology (Guangzhou), Li Auto Inc., Xiamen University, The Hong Kong University of Science and Technology\n  - Publish Date: 2025.02.25\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - Summary:\n    - VLME2E, a novel framework that uses the VLMs to enhance training by providing attentional cues.\n    - Integrates textual representations into Bird’s-Eye-View (BEV) features for semantic supervision, which enables the model to learn richer feature representations that explicitly capture the driver’s attentional semantics.\n\n- [CurricuVLM: Towards Safe Autonomous Driving via Personalized Safety-Critical Curriculum Learning with Vision-Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.15119)\n  - Zihao Sheng, Zilin Huang, Yansong Qu, Yue Leng, Sruthi Bhavanam, Sikai Chen\n  - Publisher: University of Wisconsin-Madison, Purdue University\n  - Publish Date: 2025.02.21\n  - Task: Planning\n  - Project Page: [CurricuVLM](https:\u002F\u002Fzihaosheng.github.io\u002FCurricuVLM\u002F)\n  - Datasets: [MetaDrive](https:\u002F\u002Fgithub.com\u002Fmetadriverse\u002Fmetadrive), [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - Summary:\n    - CurricuVLM, a novel framework that leverages Vision-Language Models (VLMs) to enable personalized curriculum learning for autonomous driving agents.\n    - CurricuVLM is the first work to utilize VLMs for dynamic curriculum generation in closed-loop autonomous driving training.\n\n- [V2V-LLM: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multi-Modal Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.09980)\n  - Hsu-kuang Chiu, Ryo Hachiuma, Chien-Yi Wang, Stephen F. Smith, Yu-Chiang Frank Wang, Min-Hung Chen\n  - Publisher: NVIDIA, Carnegie Mellon University\n  - Publish Date: 2025.02.14\n  - Task: VQA\n  - Summary:\n    - Create and introduce the V2V-QA dataset to support the development and evaluation of LLM-based approaches to end-to-end cooperative autonomous driving.\n    - Propose a baseline method V2V-LLM for cooperative autonomous driving to provide an initial benchmark for V2V-QA.\n\n- [Occ-LLM: Enhancing Autonomous Driving with Occupancy-Based Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06419)\n  - Tianshuo Xu, Hao Lu, Xu Yan, Yingjie Cai, Bingbing Liu, Yingcong Chen **ICRA 2025**\n  - Publisher: Hong Kong University of Science and Technology (Guangzhou), Huawei Noah’s Ark\nLab\n  - Publish Date: 2025.02.10\n  - Task: Perception, Planning\n  - Dataset: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - Summary:\n    - Introduce an occupancy-based large language model (Occ-LLM) for autonomous driving, demonstrating superior scene comprehension.\n\n- [INSIGHT: Enhancing Autonomous Driving Safety through Vision-Language Models on Context-Aware Hazard Detection and Edge Case Evaluation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.00262)\n  - Dianwei Chen, Zifan Zhang, Yuchen Liu, Xianfeng Terry Yang\n  - Publisher: University of Maryland, North Carolina State University\n  - Publish Date: 2025.02.01\n  - Task: VQA\n  - Summary:\n    - INSIGHT (Integration of Semantic and Visual Inputs for Generalized Hazard Tracking), a hierarchical vision-language model (VLM) framework designed to enhance hazard detection and edge-case evaluation.\n\n- [LLM-attacker: Enhancing Closed-loop Adversarial Scenario Generation for Autonomous Driving with Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.15850)\n  - Yuewen Mei, Tong Nie, Jian Sun, Ye Tian  **IEEE TITS 2025**\n  - Publisher: Tongji University, The Hong Kong Polytechnic University\n  - Publish Date: 2025.01.27\n  - Viddeo: [LLM-attacker](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F15rROV_8LUcc2jXSuSNBHVOHMCFKSn__B\u002Fview)\n  - Task: Generation\n  - Dataset: [MetaDrive](https:\u002F\u002Fgithub.com\u002Fmetadriverse\u002Fmetadrive), [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - Summary:\n    - LLM-attacker: a closedloop adversarial scenario generation framework leveraging large\nlanguage models (LLMs).\n\n- [Black-Box Adversarial Attack on Vision Language Models for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.13563)\n  - Lu Wang, Tianyuan Zhang, Yang Qu, Siyuan Liang, Yuwei Chen, Aishan Liu, Xianglong Liu, Dacheng Tao\n  - Publisher: Beihang University, National University of Singapore, Aviation Industry Development Research Center of China, Nanyang Technological University\n  - Publish Date: 2025.01.23\n  - Task: VQA\n  - Summary:\n    - Cascading Adversarial Disruption (CAD) first introduces Decision Chain Disruption, which targets low-level reasoning breakdown by generating and injecting deceptive semantics, ensuring the perturbations remain effective across the entire decision-making chain.\n\n- [Distilling Multi-modal Large Language Models for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09757)\n  - Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M. Patel, Fatih Porikli\n  - Publisher: Johns Hopkins University, Qualcomm AI Research\n  - Publish Date: 2025.01.16\n  - Task: Planning\n  - Summary:\n    - DiMA, an end-to-end autonomous driving framework that distills knowledge from an MLLM to a vision-based planner to ensure robustness to long-tail events while maintaining efficiency.\n    - Propose a distillation task along with the following surrogate tasks to align the objectives of the vision-based planner and the MLLM: (i) masked token reconstruction (ii) future token prediction (iii) scene editing.\n\n- [Modeling Language for Scenario Development of Autonomous Driving Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09319)\n  - Toshiaki Aoki, Takashi Tomita, Tatsuji Kawai, Daisuke Kawakami, Nobuo Chida\n  - Publisher: JAIST, Kochi University, Mitsubishi Electric Corporation\n  - Publish Date: 2025.01.16\n  - Task: Development\n  - Summary:\n    - This study introduces a notation called the car position diagram (CPD). The CPD allows for the concise representation of numerous scenarios and is particularly suitable for scenario analysis and design.\n\n- [Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.06680)\n  - Haoxiang Gao, Yu Zhao\n  - Publisher: Motional AD LLC, University of Toronto\n  - Publish Date: 2025.01.12\n  - Task: VQA\n  - Summary:\n    - Analyze effective knowledge distillation of LLM semantic labels to smaller Vision networks, which can be used for the semantic representation of complex scenes for downstream decision-making for planning and control.\n\n- [DriVLM: Domain Adaptation of Vision-Language Models in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.05081)\n  - Xuran Zheng, Chang D. Yoo\n  - Publisher: KAIST\n  - Publish Date: 2025.01.09\n  - Task: VQA\n  - Summary:\n    - Explore the utility of small-scale MLLMs and applied small-scale MLLMs to the field of autonomous driving.\n\n- [Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene Understanding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.05566)\n  - Mohammed Elhenawy, Huthaifa I. Ashqar, Andry Rakotonirainy, Taqwa I. Alhadidi, Ahmed Jaber, Mohammad Abu Tami\n  - Publisher: Queensland University of Technology, Arab American University, Columbia University\n  - Publish Date: 2025.01.09\n  - Task: Scene Understanding\n  - Summary:\n    - This study developed a dynamic scene retrieval system using Contrastive Language–Image Pretraining (CLIP) models, which can be optimized for real-time deployment on edge devices.\n\n- [Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04003)\n  - Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, Liang Pan **ICCV 2025**\n  - Project Page: [DriveBench](https:\u002F\u002Fdrive-bench.github.io)\n  - Code: [DriveBench](https:\u002F\u002Fgithub.com\u002Fdrive-bench\u002Ftoolkit)\n  - HuggingFace: [DriveBench](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdrive-bench\u002Farena)\n  - Publish Date: 2025.01.07\n  - Task: Benchmark\n  - Summary:\n    - DriveBench, a benchmark dataset designed to evaluate VLM reliability across 17 settings (clean, corrupted, and text-only inputs), encompassing 19, 200 frames, 20, 498 question-answer pairs, three question types, four mainstream driving tasks, and a total of 12 popular VLMs.\n\n- [Generating Traffic Scenarios via In-Context Learning to Learn Better Motion Planner](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18086)\n  - Aizierjiang Aiersilan **AAAI 2025 Oral**\n  - Publisher: University of Macau\n  - Publish Date: 2024.12.24\n  - Task: Generation\n  - Env : [CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - Project Pages: [AutoSceneGen](https:\u002F\u002Fezharjan.github.io\u002FAutoSceneGen\u002F)\n  - Code: [AutoSceneGen](https:\u002F\u002Fgithub.com\u002FEzharjan\u002FAutoSceneGen)\n  - Summary:\n    - A universal, general, and cost-effective framework, “AutoSceneGen”, is proposed to automatically enhance the heterogeneity of traffic scenarios through scenario descriptions, thereby accelerating the simulation and testing process.\n\n- [Large Language Model guided Deep Reinforcement Learning for Decision Making in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18511)\n  - Hao Pang, Zhenpo Wang, Guoqiang Li\n  - Publisher: Beijing Institute of Technology\n  - Publish Date: 2024.12.24\n  - Task: Planning\n  - Code: [LGDRL](https:\u002F\u002Fgithub.com\u002Fbitmobility\u002FLGDRL)\n  - Env: [HighwayEnv](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n  - Summary:\n    - Propose a novel large language model (LLM) guided deep reinforcement learning (LGDRL) framework for addressing the decision-making problem of autonomous vehicles. Within this framework, an LLM-based driving expert is integrated into the DRL to provide intelligent guidance for the learning process of DRL.\n\n- [Application of Multimodal Large Language Models in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16410)\n  - Md Robiul Islam\n  - Publisher: William & Mary\n  - Publish Date: 2025.12.21\n  - Task: VQA\n  - Summary:\n    - Conduct a Virtual Question Answering (VQA) dataset to fine-tune the model and address problems with the poor performance of MLLM on AD.\n\n- [VLM-RL: A Unified Vision Language Models and Reinforcement Learning Framework for Safe Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15544)\n  - Zilin Huang, Zihao Sheng, Yansong Qu, Junwei You, Sikai Chen\n  - Publisher: University of Wisconsin-Madison, Purdue University,\n  - Publish Date: 2024.12.20\n  - Task: Reward Design\n  - Project Page: [VLM-RL](https:\u002F\u002Fwww.huang-zilin.com\u002FVLM-RL-website\u002F)\n  - Summary:\n    - VLM-RL, a unified framework that integrates pre-trained Vision-Language Models (VLMs) with RL to generate reward signals using image observation and natural language goals.\n\n- [AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15206)\n  - Shuo Xing, Hongyuan Hua, Xiangbo Gao, Shenzhe Zhu, Renjie Li, Kexin Tian, Xiaopeng Li, Heng Huang, Tianbao Yang, Zhangyang Wang, Yang Zhou, Huaxiu Yao, Zhengzhong Tu\n  - Publisher: Texas A&M University, University of Toronto, University of Michigan, University of Wisconsin-Madison, University of Maryland, University of Texas at Austin, University of North Carolina at Chapel Hill\n  - Publish Date: 2024.12.19\n  - Task: VQA\n  - Code: [AutoTrust](https:\u002F\u002Fgithub.com\u002Ftaco-group\u002FAutoTrust)\n  - Leaderboard: [AutoTrust](https:\u002F\u002Ftaco-group.github.io\u002FAutoTrust\u002F)\n  - Summary:\n    - AutoTrust is a groundbreaking benchmark designed to assess the trustworthiness of DriveVLMs. This work aims to enhance public safety by ensuring DriveVLMs operate reliably across critical dimensions.\n\n- [VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14446)\n  - Yi Xu, Yuxin Hu, Zaiwei Zhang, Gregory P. Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M. Wolff, Xin Huang\n  - Publisher: Cruise LLC, Northeastern University\n  - Publish Date: 2024.12.19\n  - Task: Planning\n  - Dataset: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - Summary:\n    - VLM-AD, a method that leverages vision-language models (VLMs) as teachers to enhance training by providing additional supervision that incorporates unstructured reasoning information and structured action labels.\n\n- [RAC3: Retrieval-Augmented Corner Case Comprehension for Autonomous Driving with Vision-Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.11050)\n  - Yujin Wang, Quanfeng Liu, Jiaqi Fan, Jinlong Hong, Hongqing Chu, Mengjian Tian, Bingzhao Gao, Hong Chen\n  - Publisher: Tongji University, Shenzhen Technology University\n  - Publish Date: 2024.12.15\n  - Task: VQA\n  - Datasets: [CODA-LM](https:\u002F\u002Fcoda-dataset.github.io\u002Fcoda-lm\u002F)\n  - Summary:\n    - RAC3, a novel framework designed to enhance the performance of VLMs in corner case comprehension.\n    - RAC3 integrates a frequencyspatial fusion (FSF) image encoder, a cross-modal alignment training method for embedding models with hard and semihard negative mining, and a fast querying and retrieval pipeline based on K-Means clustering and hierarchical navigable small world (HNSW) indexing.\n\n- [WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.09951)\n  - Songyan Zhang, Wenhui Huang, Zihui Gao, Hao Chen, Chen Lv\n  - Publisher: Nanyang Technology University, Zhejiang University\n  - Publish Date: 2024.12.13\n  - Task: Planning\n  - Summary:\n    - Investigate the effects of the depth and breadth of fundamental driving knowledge on closed-loop trajectory planning and introduce WiseAD, a specialized VLM tailored for end-to-end autonomous driving capable of driving reasoning, action justification, object recognition, risk analysis, driving suggestions, and trajectory planning across diverse scenarios.\n\n- [PKRD-CoT: A Unified Chain-of-thought Prompting for Multi-Modal Large Language Models in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.02025)\n  - Xuewen Luo, Fan Ding, Yinsheng Song, Xiaofeng Zhang, Junnyong Loo **ICONIP 2024**\n  - Publisher: Monash University Malaysia\n  - Task: QA, Prompt Engineer\n  - Publish Date: 2024.12.02\n  - Summary:\n    - PKRD-CoT is constructed based on the four fundamental capabilities of autonomous driving—perception, knowledge, reasoning, and decision-making.\n\n- [Visual Adversarial Attack on Vision-Language Models for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18275)\n  - Tianyuan Zhang, Lu Wang, Xinwei Zhang, Yitong Zhang, Boyi Jia, Siyuan Liang, Shengshan Hu, Qiang Fu, Aishan Liu, Xianglong Liu\n  - Publisher: Beihang University, National University of Singapore, Huazhong University of Science and Technology\n  - Task: VQA\n  - Publish Date: 2024.11.27\n  - Summary:\n    - ADvLM, the first visual adversarial attack framework specifically designed for VLMs in AD.\n    - Semantic-Invariant Induction in the textual domain and Scenario-Associated Enhancement in the visual domain, ensuring attack effectiveness across varied instructions and sequential viewpoints.\n\n- [Explanation for Trajectory Planning using Multi-modal Large Language Model for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.09971)\n  - Shota Yamazaki, Chenyu Zhang, Takuya Nanri, Akio Shigekane, Siyuan Wang, Jo Nishiyama, Tao Chu, Kohei Yokosawa **ECCV 2024 VCAD Workshop**\n  - Publisher: Nissan Motor Co., Ltd, \n  - Publish Date: 2024.11.15\n  - Task: VQA\n  - Summary:\n    - Propose a reasoning model that takes future planning trajectories of the ego vehicle as input to generate reasoning text.\n\n- [Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.05898)\n  - Linfeng He, Yiming Sun, Sihao Wu, Jiaxu Liu, Xiaowei Huang **NeurIPS 2024 SafeGenAI Workshop**\n  - Publisher: University of Liverpool,  University of Nottingham\n  - Publish Date: 2024.11.08\n  - Task: Precption\n  - Dataset: [DriveLM](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM)\n  - Summary:\n    - Extend the Llama-Adapter architecture by incorporating a YOLOS-based detection network alongside the CLIP perception network, addressing limitations in object detection and localisation.\n\n- [Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22313)\n  - Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, Xinggang Wang\n  - Publisher: Huazhong University of Science and Technology\n  - Publish Date: 2024.10.29\n  - Task: VQA, Planning\n  - Code: [Senna](https:\u002F\u002Fgithub.com\u002Fhustvl\u002FSenna)\n  - Dataset: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes), [DriveX]()\n  - Summary:\n    - Senna, an autonomous driving system that integrates an LVLM with an end-to-end model, achieving structured planning from high-level decisions to low-level trajectory prediction.\n\n- [Words to Wheels: Vision-Based Autonomous Driving Understanding Human Language Instructions Using Foundation Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10577)\n  - Chanhoe Ryu, Hyunki Seong, Daegyu Lee, Seongwoo Moon, Sungjae Min, D.Hyunchul Shim\n  - Publisher: KAIST, ETRI\n  - Publish Date: 2024.10.14\n  - Task: VQA\n  - Summary:\n    - Introduce an innovative application of foundation models, enabling Unmanned Ground Vehicles (UGVs) equipped with an RGB-D camera to navigate to designated destinations based on human language instructions.\n\n- [CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01273)\n  - Suhwan Choi, Yongjun Cho, Minchan Kim, Jaeyoon Jung, Myunchul Joe, Yubeen Park, Minseo Kim, Sungwoong Kim, Sungjae Lee, Hwiseong Park, Jiwan Chung, Youngjae Yu\n  - Publisher: MAUM.AI, Yonsei University\n  - Publish Date: 2024.10.02\n  - Project Page: [CANVAS](https:\u002F\u002Fworv-ai.github.io\u002Fcanvas\u002F)\n  - Code: [CANVAS](https:\u002F\u002Fgithub.com\u002Fworv-ai\u002Fcanvas)\n  - Task: Planning\n  - Datasets: [COMMAND](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmaum-ai\u002FCOMMAND)\n  - Summary:\n    - CANVAS is a novel framework for commonsense-aware robot navigation that excels in both simulated and real-world environments.\n\n- [LeGEND: A Top-Down Approach to Scenario Generation of Autonomous Driving Systems Assisted by Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.10066)\n  - Shuncheng Tang, Zhenya Zhang, Jixiang Zhou, Lei Lei, Yuan Zhou, Yinxing Xue **ASE 2024**\n  - Publisher: University of Science and Technology of China, Kyushu University, Zhejiang Sci-Tech University\n  - Task: Generation\n  - Publish Date: 2024.09.16\n  - Code: [LeGEND](https:\u002F\u002Fgithub.com\u002FMayDGT\u002FLeGEND)\n  - Summary:\n    - LeGEND, a top-down scenario generation approach that can achieve both criticality and diversity of scenarios.\n    - Devise a two-stage transformation, by using an intermediate language, from accident reports to logical scenarios; so, LeGEND involves two LLMs, each in charge of one different stage.\n    - Implement LeGEND and demonstrate its effectiveness on Apollo, and we detect 11 types of critical concrete scenarios that reflect different aspects of system defects.\n\n- [MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.07267)\n  - Enming Zhang, Xingyuan Dai, Yisheng Lv, Qinghai Miao\n  - Publisher: University of Chinese Academy of Sciences, CASIA\n  - Task: QA\n  - Publish Date: 2024.09.14\n  - Code: [MiniDrive](https:\u002F\u002Fgithub.com\u002FEMZucas\u002Fminidrive)\n  - Summary:\n    - MiniDrive addresses the challenges of efficient deployment and real-time response in VLMs for autonomous driving systems. It can be fully trained simultaneously on an\n    RTX 4090 GPU with 24GB of memory.\n    - Feature Engineering Mixture of Experts (FE-MoE) addresses the challenge of efficiently encoding 2D features from multiple perspectives into text token embeddings, effectively reducing the number of visual feature tokens and minimizing feature redundancy.\n    - Dynamic Instruction Adapter through a residual structure, which addresses the problem of fixed visual tokens for the same image before being input into the language model.\n\n- [Hint-AD: Holistically Aligned Interpretability in End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.06702)\n  - Kairui Ding, Boyuan Chen, Yuchen Su, Huan-ang Gao, Bu Jin, Chonghao Sima, Wuqiang Zhang, Xiaohui Li, Paul Barsch, Hongyang Li, Hao Zhao **CoRL 2024**\n  - Publisher: Institute for AI Industry Researc, Mercedes-Benz Group China Ltd, Tsinghua University, Shanghai AI Lab \n  - Publish Date: 2024.09.10\n  - Project Page: [Hint-AD](https:\u002F\u002Fair-discover.github.io\u002FHint-AD\u002F)\n  - Task: VQA\n  - Summary:\n   - Hint-AD, an integrated AD-language system that generates language aligned with the holistic perception-prediction-planning outputs of the AD model. By incorporating the intermediate outputs and a holistic token mixer sub-network for effective feature adaptation, Hint-AD achieves desirable accuracy, achieving state-of-the-art results in driving language tasks including driving explanation, 3D dense captioning, and command prediction.\n   - Contribute a human-labeled driving explanation dataset, Nu-X on nuScenesto address the lack of driving explanation data on this widely-used AD dataset.\n\n- [OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.03272)\n  - Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, Wenchao Ding\n  - Publisher: Fudan University, Tsinghua University\n  - Task: Perception(Occ) + Reasoning\n  - Publish Date: 2024.09.05\n  - Summary:\n    - OccLLaMA, a unified 3D occupancy-language-action generative world model, which unifies VLA-related tasks including but not limited to scene understanding, planning, and 4D occupancy forecasting.\n    - A novel scene tokenizer(VQVAE-like architecture) that efficiently discretize and reconstruct Occ scenes, considering sparsity and classes imbalance.\n\n- [ContextVLM: Zero-Shot and Few-Shot Context Understanding for Autonomous Driving using Vision Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.00301)\n  - Shounak Sural, Naren, Ragunathan Rajkumar **ITSC 2024**\n  - Publisher: Carnegie Mellon University\n  - Task: Context Recognition\n  - Code: [ContextVLM](https:\u002F\u002Fgithub.com\u002Fssuralcmu\u002FContextVLM)\n  - Publish Date: 2024.08.30\n  - Summary:\n    - DrivingContexts, a large publicly-available datasetwith a combination of hand-annotated and machine annnotated labels to improve VLMs for better context recognition.\n    - ContextVLM uses vision-language models to detect contexts using zero- and few-shot approaches.\n\n- [DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.16647)\n  - Yongjie Fu, Anmol Jain, Xuan Di, Xu Chen, Zhaobin Mo **IAVVC 2024**\n  - Publisher: Columbia University\n  - Task: Generation\n  - Dataset: [Waymo open dataset](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - Publish Date: 2024.08.29\n  - Summary:\n    - DriveGenVLM employ a video generation framework based on Denoising Diffusion Probabilistic Models to create realistic video sequences that mimic real-world dynamics.\n    - The videos generated are then evaluated for their suitability in Visual Language Models (VLMs) using a pre-trained model called Efficient In-context Learning on Egocentric Videos (EILEV). \n\n- [Edge-Cloud Collaborative Motion Planning for Autonomous Driving with Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.09972)\n  - Jiao Chen, Suyan Dai, Fangfang Chen, Zuohong Lv, Jianhua Tang\n  - Publisher: South China University of Technology, Pazhou Lab\n  - Task: Planning + QA\n  - Project Page: [EC-Drive](https:\u002F\u002Fsites.google.com\u002Fview\u002Fec-drive)\n  - Publish Date: 2024.08.19\n  - Summary:\n    - EC-Drive, a novel edge-cloud collaborative autonomous driving system.\n\n- [V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.09251)\n  - Junwei You, Haotian Shi, Zhuoyu Jiang, Zilin Huang, Rui Gan, Keshu Wu, Xi Cheng, Xiaopeng Li, Bin Ran\n  - Publisher: University of Wisconsin-Madison, Nanyang Technological University, Texas A&M University, Cornell University\n  - Task: Planning\n  - Projcet Page: [V2X-VLM](https:\u002F\u002Fzilin-huang.github.io\u002FV2X-VLM-website\u002F)\n  - Code: [V2X-VLM](https:\u002F\u002Fgithub.com\u002Fzilin-huang\u002FV2X-VLM)\n  - Dataset: [DAIR-V2X](https:\u002F\u002Fgithub.com\u002FAIR-THU\u002FDAIR-V2X)\n  - Publish Date: 2024.08.09\n  - Summary:\n    - V2X-VLM, a large vision-language model empowered E2E VICAD framework, which improves the ability of autonomous vehicles to navigate complex traffic scenarios through advanced multimodal understanding and decision-making.\n    - A contrastive learning technique is employed to refine the model’s ability to distinguish between relevant and irrelevant features, which ensures that the model learns robust and discriminative representations of specific driving environments, leading to improved accuracy in trajectory planning in V2X cooperation scenarios.\n\n- [AgentsCoMerge: Large Language Model Empowered Collaborative Decision Making for Ramp Merging](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.03624)\n  - Senkang Hu, Zhengru Fang, Zihan Fang, Yiqin Deng, Xianhao Chen, Yuguang Fang, Sam Kwong **IEEE TMC**\n  - Publisher: City University of Hong Kong, The University of Hong Kong, Lingnan University\n  - Task: Multi Agent Planning\n  - Publish Date: 2024.08.07\n  - Summary:\n    - AgentsCoMerge, a large language model empowered collaborative decision making for ramp merging. It includes observation, planning, communication, and reinforcement training modules. Experiments demonstrate its effectiveness in improving multi-agent collaboration and efficiency.\n\n- [VLM-MPC: Vision Language Foundation Model (VLM)-Guided Model Predictive Controller (MPC) for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04821)\n  - Keke Long, Haotian Shi, Jiaxi Liu, Xiaopeng Li\n  - Publisher: University of Wisconsin-Madison\n  - Task: Planning\n  - Publish Date: 2024.08.04\n  - Summary:\n    - It proposed a closed-loop autonomous driving controller that applies VLMs for high-level vehicle control. \n    - The upper-level VLM uses the vehicle's front camera images, textual scenario description, and experience memory as inputs to generate control parameters needed by the lower-level MPC. \n    - The lower-level MPC utilizes these parameters, considering vehicle dynamics with engine lag, to achieve realistic vehicle behavior and provide state feedback to the upper level. \n    - This asynchronous two-layer structure addresses the current issue of slow VLM response speeds.\n\n- [SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21293)\n  - Peiru Zheng, Yun Zhao, Zhan Gong, Hong Zhu, Shaohua Wu **IEIT Systems**\n  - Publisher: University of Wisconsin-Madison\n  - Task: QA\n  - Publish Date: 2024.07.31\n  - Summary:\n    - SimpleLLM4AD reimagines the traditional autonomous driving pipeline by structuring the task into four interconnected stages: perception, prediction, planning, and behavior. \n    - Each stage is framed as a series of visual question answering (VQA) pairs, which are interlinked to form a Graph VQA (GVQA). This graph-based structure allows the system to reason about each VQA pair systematically, ensuring a coherent flow of information and decision-making from perception to action.\n\n- [Testing Large Language Models on Driving Theory Knowledge and Skills for Connected Autonomous Vehicles](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.17211)\n  - Zuoyin Tang, Jianhua He, Dashuai Pei, Kezhong Liu, Tao Gao\n  - Publisher: Aston University, Essex University, Wuhan University of Technology, Chang’An University\n  - Task: Evaluation\n  - Publish Date: 2024.07.24\n  - Data: [UK Driving Theory Test Practice Questions and Answers](https:\u002F\u002Fwww.drivinginstructorwebsites.co.uk\u002Fuk-driving-theory-test-practice-questions-and-answers)\n  - Summary:\n    - Design and run driving theory tests for several proprietary LLM models (OpenAI GPT models, Baidu Ernie and Ali QWen) and open-source LLM models (Tsinghua MiniCPM-2B and MiniCPM-Llama3-V2.5) with more than 500 multiple-choices theory test questions.\n\n- [KoMA: Knowledge-driven Multi-agent Framework for Autonomous Driving with Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.14239)\n  - Kemou Jiang, Xuan Cai, Zhiyong Cui, Aoyong Li, Yilong Ren, Haiyang Yu, Hao Yang, Daocheng Fu, Licheng Wen, Pinlong Cai\n  - Publisher: Beihang University, Johns Hopkins University, Shanghai Artificial Intelligence Laboratory\n  - Task: Multi Agent Planning\n  - Env: [HighwayEnv](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n  - Project Page: [KoMA](https:\u002F\u002Fjkmhhh.github.io\u002FKoMA\u002F)\n  - Publish Date: 2024.07.19\n  - Summary:\n    - Introduce a knowledge-driven autonomous driving framework KoMA that incorporates multiple agents empowered by LLMs, comprising five integral modules: Environment, Multi-agent Interaction, Multi-step Planning, Shared Memory, and Ranking-based Reflection. \n\n- [WOMD-Reasoning: A Large-Scale Language Dataset for Interaction and Driving Intentions Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.04281)\n  - Yiheng Li, Chongjian Ge, Chenran Li, Chenfeng Xu, Masayoshi Tomizuka, Chen Tang, Mingyu Ding, Wei Zhan\n  - Publisher: UC Berkeley, UT Austin\n  - Task: Dataset + Reasoning\n  - Publish Date: 2024.07.05\n  - Datasets: [WOMD-Reasoning](https:\u002F\u002Fwaymo.com\u002Fopen\u002Fdownload)\n  - Summary:\n    - WOMD-Reasoning, a language dataset centered on interaction descriptions and reasoning. It provides extensive insights into critical but previously overlooked interactions induced by traffic rules and human intentions.\n    - Develop an automatic language labeling pipeline, leveraging a rule-based translator to interpret motion data into language descriptions, and a set of manual prompts for ChatGPT to generate Q&A pairs. \n\n- [Exploring the Potential of Multi-Modal AI for Driving Hazard Prediction](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10568360)\n  - Korawat Charoenpitaks, Van-Quang Nguyen, Masanori Suganuma, Masahiro Takahashi, Ryoma Niihara, Takayuki Okatani **IEEE TIV 2024**\n  - Publisher: Tohoku University, RIKEN Center for AIP, DENSO CORPORATION\n  - Task: Prediction\n  - Code: [DHPR](https:\u002F\u002Fgithub.com\u002FDHPR-dataset\u002FDHPR-dataset)\n  - Publish Date: 2024.06.21\n  - Summary:\n    - DHPR (Driving Hazard Prediction and Reasoning) dataset, consists of 15K dashcam images of street scenes, and each image is associated with a tuple containing car speed, a hypothesized hazard description, and visual entities present in the scene.\n    - Present several baseline methods and evaluate their performance. \n\n- [Asynchronous Large Language Model Enhanced Planner for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.14556)\n  - Yuan Chen, Zi-han Ding, Ziqin Wang, Yan Wang, Lijun Zhang, Si Liu  **ECCV 2024**\n  - Publisher: Beihang University, Tsinghua University\n  - Task: Planning\n  - Publish Date: 2024.06.20\n  - Code: [AsyncDriver](https:\u002F\u002Fgithub.com\u002FmemberRE\u002FAsyncDriver)\n  - Datasets: [nuPlan Closed-Loop Reactive Hard20](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)\n  - Summary:\n    - AsyncDriver, a novel asynchronous LLM-enhanced framework, in which the inference frequency of LLM is controllable and can be decoupled from that of the real-time planner.\n    - Adaptive Injection Block, which is model-agnostic and can easily integrate scene-associated instruction features into any transformer based\n    real-time planner, enhancing its ability in comprehending and following series of language-based routing instructions.\n    - Compared with existing methods, our approach demonstrates superior closedloop evaluation performance in nuPlan’s challenging scenarios.\n\n- [A Superalignment Framework in Autonomous Driving with Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.05651)\n  - Xiangrui Kong, Thomas Braunl, Marco Fahmi, Yue Wang\n  - Publisher: University of Western Australia, Queensland Government, Brisbane, Queensland University of Technology\n  - Task: QA\n  - Publish Date: 2024.06.09\n  - Summary\n    - Propose a secure interaction framework for LLMs to effectively audit data interacting with cloud-based LLMs.\n    - Analyze 11 autonomous driving methods based on large language models, including driving safety, token usage, privacy, and consistency with human values.\n    - Evaluate the effectiveness of driving prompts in the nuScenesQA dataset and compare different results between gpt-35-turbo and llama2-70b LLM backbones.\n\n- [PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.01587)\n  - Yupeng Zheng, Zebin Xing, Qichao Zhang, Bu Jin, Pengfei Li, Yuhang Zheng, Zhongpu Xia, Kun Zhan, Xianpeng Lang, Yaran Chen, Dongbin Zhao\n  - Publisher: Chinese Academy of Sciences, Beijing University of Posts and Telecommunications, Beihang University, Tsinghua University, Li Auto\n  - Task: Planning\n  - Publish Date: 2024.06.04\n  - Summary:\n    - PlanAgent is the first closed-loop mid-to-mid(use bev, no raw sensor) autonomous driving planning agent system based on a Multi-modal Large Language Model.\n    - Propose an efficient Environment Transformation module that extracts multi-modal information inputs with lanegraph representation.\n    - Design a Reasoning Engine module that introduces a hierarchical chain-of-thought (CoT) to instruct MLLM to generate planner code and a Reflection module that combines simulation and scoring to filter out unreasonable proposals generated by the MLLM.\n\n- [ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.14062)\n  - Jiawei Zhang, Chejian Xu, Bo Li **CVPR 2024**\n  - Publisher: UIUC, UChicago\n  - Task: Scenario Generation\n  - Env: [Carla](https:\u002F\u002Fgithub.com\u002Fcarla-simulator)\n  - Code: [ChatScene](https:\u002F\u002Fgithub.com\u002Fjavyduck\u002FChatScene)\n  - Publish Date: 2024.05.22\n  - Summary:\n    - ChatScene, a novel LLM-based agent capable of generating safety-critical scenarios by first providing textual descriptions and then carefully transforming them into executable simulations in CARLA via Scenic programming language.\n    - An expansive retrieval database of Scenic code snippets has been developed. It catalogs diverse adversarial behaviors and traffic configurations, utilizing the rich knowledge stored in LLMs, which significantly augments the variety and critical nature of the driving scenarios generated.\n\n- [Probing Multimodal LLMs as World Models for Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.05956)\n  - Shiva Sreeram, Tsun-Hsuan Wang, Alaa Maalouf, Guy Rosman, Sertac Karaman, Daniela Rus\n  - Publisher: MIT CSAIL, TRI, MIT LID\n  - Task: Benchmark & Evaluation\n  - Code: [DriveSim](https:\u002F\u002Fgithub.com\u002Fsreeramsa\u002FDriveSim)\n  - Publish Date: 2024.05.09\n  - Summary:\n    - A comprehensive experimental study to evaluate the capability of different MLLMs to reason\u002Funderstand scenarios involving closed-loop driving and making decisions.\n    - DriveSim, a specialized simulator designed to generate a diverse array of driving scenarios, thereby providing a platform to test and evaluate\u002Fbenchmark the capabilities of MLLMs in understanding and reasoning about real-world driving scenes from a fixed in-car camera perspective, the same as the drive viewpoint.\n\n- [OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception Reasoning and Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.01533)\n  - Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, Jose M. Alvarez\n  - Publisher: Beijing Inst of Tech, NVIDIA, Huazhong Univ of Sci and Tech\n  - Task: Benchmark & Planning\n  - Publisher Data: 2024.05.02\n  - Code: [OmniDrive](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FOmniDrive)\n  - Summary:\n    - OmniDrive, a holistic framework for strong alignment between agent models and 3D driving tasks.\n    - Propose a new benchmark with comprehensive visual question-answering (VQA) tasks, including scene description, traffic regulation, 3D grounding, counterfactual reasoning, decision making and planning.\n  \n- [Chat2Scenario: Scenario Extraction From Dataset Through Utilization of Large Language Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.16147)\n  - Yongqi Zhao, Wenbo Xiao, Tomislav Mihalj, Jia Hu, Arno Eichberger **IEEE IV 2024**\n  - Publisher:\n  - Publisher Data: 2024.04.26\n  - Task: Generation\n  - Dataset: [Chat2Scenario](https:\u002F\u002Fgithub.com\u002FftgTUGraz\u002FChat2Scenario)\n  - Summary:\n    - Chat2Scenario is a web-based tool that allows users to search for specific driving scenarios within a dataset by inputting descriptive functional scenario text.\n\n- [VLAAD: Vision and Language Assistant for Autonomous Driving](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10495690)\n  - SungYeon Park, MinJae Lee, JiHyuk Kang, Hahyeon Choi, Yoonah Park, Juhwan Cho, Adam Lee, DongKyu Kim **WACV 2024 WorkShop**\n  - Publisher: Seoul National University, University of California, Berkeley\n  - Publisher Data: 2024.04.16\n  - Task: VQA\n  - Code: [VLAAD](https:\u002F\u002Fgithub.com\u002Fsungyeonparkk\u002Fvision-assistant-for-driving)\n  - Summary:\n    - Introduce a multi-modal instruction tuning dataset that facilitates language models in learning visual instructions across diverse driving scenarios.\n    - Capitalizing on this dataset, present a multi-modal LLM driving assistant named VLAAD.\n\n\n- [REvolve: Reward Evolution with Large Language Models for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.01309)\n  - Rishi Hazra, Alkis Sygkounas, Andreas Persson, Amy Loutfi, Pedro Zuidberg Dos Martires\n  - Publisher: Centre for Applied Autonomous Sensor Systems (AASS), Örebro University, Swede\n  - Task: Reward Generation\n  - Env: [AirSim](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FAirSim?tab=readme-ov-file)\n  - Project Page: [REvolve](https:\u002F\u002Frishihazra.github.io\u002FREvolve\u002F)\n  - Publish Date: 2024.04.09\n  - Summary:\n    - Reward Evolve (REvolve), a novel evolutionary framework using LLMs, specifically GPT-4, to output reward functions (as executable Python codes) for AD and evolve them based on human feedback.\n\n- [AGENTSCODRIVER: Large Language Model Empowered Collaborative Driving with Lifelong Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.06345.pdf)\n  - Senkang Hu, Zhengru Fang, Zihan Fang, Xianhao Chen, Yuguang Fang\n  - Publisher: City University of Hong Kong, The University of Hong Kong\n  - Task: Planning(Multiple vehicles collaborative)\n  - Publish Date: 2024.04.09\n  - Env: [HighwayEnv](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n  - Summary:\n    - AGENTSCODRIVER, an LLM-powered multi-vehicle collaborative driving framework with lifelong learning, which allows different driving agents to communicate with each other and collaboratively drive in complex traffic scenarios.\n    - It features reasoning engine, cognitive memory, reinforcement reflection, and communication module. \n\n- [Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.19838)\n  - Akshay Gopalkrishnan, Ross Greer, Mohan Trivedi\n  - Publisher: UCSD\n  - Task: QA\n  - Publish Date: 2024.03.28\n  - Code: [official](https:\u002F\u002Fgithub.com\u002Fakshaygopalkr\u002FEM-VLM4AD)\n  - Datasets: [DriveLM](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM)\n  - Summary:\n    - EM-VLM4AD, an efficient, lightweight, multi-frame vision language model which performs Visual Question Answering for autonomous driving.\n    - EM-VLM4AD requires at least 10 times less memory and floating point operations, while also achieving higher BLEU-4, METEOR, CIDEr, and ROGUE scores than the existing baseline on the DriveLM dataset.\n\n- [LC-LLM: Explainable Lane-Change Intention and Trajectory Predictions with Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.18344)\n  - Mingxing Peng, Xusen Guo, Xianda Chen, Meixin Zhu, Kehua Chen, Hao (Frank) Yang, Xuesong Wang, Yinhai Wang\n  - Publisher: The Hong Kong University of Science and Technology, Johns Hopkins University, Tongji University, STAR Lab\n  - Task: Trajectory Prediction\n  - Publish Date: 2024.03.27\n  - Datasets: [highD](https:\u002F\u002Flevelxdata.com\u002Fhighd-dataset\u002F)\n  - Summary:\n    - LC-LLM, the first Large Language Model for lane change prediction. It leverages the powerful capabilities of LLMs to understand complex interactive scenarios, enhancing the performance of lane change prediction.\n    - LC-LLM achieves explainable predictions. It not only predicts lane change intentions and trajectories but also generates explanations for the prediction results.\n\n- [AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.17373)\n  - Mingfu Liang, Jong-Chyi Su, Samuel Schulter, Sparsh Garg, Shiyu Zhao, Ying Wu, Manmohan Chandraker\n  - Publisher: Northwestern University, NEC Laboratories America, Rutgers University, UC San Diego\n  - Publish Date: 2024.03.26\n  - Task: Object Detection\n  - Datasets: [Mapillary](https:\u002F\u002Fwww.mapillary.com\u002Fdataset\u002Fvistas), [Cityscapes](https:\u002F\u002Fwww.cityscapes-dataset.com\u002F), [nuImages](https:\u002F\u002Fwww.nuscenes.org\u002Fnuimages), [BDD100k](https:\u002F\u002Fwww.vis.xyz\u002Fbdd100k\u002F), [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F), [KITTI](https:\u002F\u002Fwww.cvlibs.net\u002Fdatasets\u002Fkitti\u002F)\n  - Summary:\n    - An Automatic Data Engine (AIDE) that can automatically identify the issues, efficiently curate data, improve the model using auto-labeling, and verify the model through generated diverse scenarios.\n\n- [Engineering Safety Requirements for Autonomous Driving with Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.16289)\n  - Ali Nouri, Beatriz Cabrero-Daniel, Fredrik Törner, Hȧkan Sivencrona, Christian Berger\n  - Publisher: Chalmers University of Technology, University of Gothenburg, Volvo Cars, Zenseact, University of Gothenburg\n  - Task: QA\n  - Publish Date: 2024.03.24\n  - Summary:\n    - Propose a prototype of a pipeline of prompts and LLMs that receives an item definition and outputs solutions in the form of safety requirements.\n\n- [LeGo-Drive: Language-enhanced Goal-oriented Closed-Loop End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.20116)\n  - Pranjal Paul, Anant Garg, Tushar Choudhary, Arun Kumar Singh, K. Madhava Krishna\n  - Publisher: The International Institute of Information Technology, Hyderabad, University of Tartu, Estonia\n  - Project Page: [LeGo-Drive](https:\u002F\u002Freachpranjal.github.io\u002Flego-drive\u002F)\n  - Code: [LeGo-Drive](https:\u002F\u002Fgithub.com\u002Freachpranjal\u002Flego-drive)\n  - Env: [Carla](https:\u002F\u002Fgithub.com\u002Fcarla-simulator)\n  - Task: Trajectory Prediction\n  - Publish Date: 2024.03.20\n  - Summary:\n    - A novel planning-guided end-to-end LLM-based goal point navigation solution that predicts and improves the desired state by dynamically interacting with the\nenvironment and generating a collision-free trajectory.\n\n- [Hybrid Reasoning Based on Large Language Models for Autonomous Car Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.13602v3)\n  - Mehdi Azarafza, Mojtaba Nayyeri, Charles Steinmetz, Steffen Staab, Achim Rettberg\n  - Publisher: Univ. of Applied Science Hamm-Lippstadt, University of Stuttgart\n  - Publish Date: 2024.03.18\n  - Task: Reasoning\n  - Env: [Carla](https:\u002F\u002Fgithub.com\u002Fcarla-simulator)\n  - Summary:\n    - Combining arithmetic and commonsense elements, utilize the objects detected by YOLOv8.\n    - Regarding the \"location of the object,\" \"speed of our car,\" \"distance to the object,\" and \"our car’s direction\" are fed into the large language model for mathematical calculations within CARLA.\n    - After formulating these calculations based on overcoming weather conditions, precise control values for brake and speed are generated.\n\n- [Large Language Models Powered Context-aware Motion Prediction](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.11057.pdf)\n  - Xiaoji Zheng, Lixiu Wu, Zhijie Yan, Yuanrong Tang, Hao Zhao, Chen Zhong, Bokui Chen, Jiangtao Gong\n  - Publisher: Tsinghua University\n  - Task: Motion Prediction\n  - Publish Data: 2024.03.17\n  - Dataset: [WOMD](https:\u002F\u002Fgithub.com\u002Fwaymo-research\u002Fwaymo-open-dataset)\n  - Summary:\n    - Design and conduct prompt engineering to enable an unfine-tuned GPT4-V to comprehend complex traffic scenarios.\n    - Introduced a novel approach that combines the context information outputted by GPT4-V with [MTR](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.13508).\n\n- [Generalized Predictive Model for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09630)\n  - Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, Jun Zhang, Andreas Geiger, Yu Qiao, Hongyang Li **ECCV 2024**\n  - Publisher:  OpenDriveLab and Shanghai AI Lab, Hong Kong University of Science and Technology, University of Hong Kong, University of Tubingen, Tubingen AI Center\n  - Task: Datasets + Generation\n  - Code: [DriveAGI](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveAGI.)\n  - Publish Date: 2024.03.14\n  - Summary:\n    - Introduce the first large-scale video prediction model in the autonomous driving discipline.\n    - The resultant dataset accumulates over 2000 hours of driving videos, spanning areas all over the world with diverse weather conditions and traffic scenarios.\n    - GenAD, inheriting the merits from recent latent diffusion models, handles the challenging dynamics in driving scenes with novel temporal reasoning blocks.\n\n- [LLM-Assisted Light: Leveraging Large Language Model Capabilities for Human-Mimetic Traffic Signal Control in Complex Urban Environments](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.08337)\n  - Maonan Wang, Aoyu Pang, Yuheng Kan, Man-On Pun, Chung Shue Chen, Bo Huang\n  - Publisher: The Chinese University of Hong Kong, Shanghai AI Laboratory, SenseTime Group Limited, Nokia Bell Labs\n  - Publish Date: 2024.03.13\n  - Task: Generation\n  - Code: [LLM-Assisted-Light](https:\u002F\u002Fgithub.com\u002FTraffic-Alpha\u002FLLM-Assisted-Light)\n  - Summary:\n    - LA-Light, a hybrid TSC framework that integrates the human-mimetic reasoning capabilities of LLMs, enabling the signal control algorithm to interpret and respond to complex traffic scenarios with the nuanced judgment typical of human cognition.\n    - A closed-loop traffic signal control system has been developed, integrating LLMs with a comprehensive suite of interoperable tools.\n\n- [DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.06845)\n  - Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, Xingang Wang\n  - Publisher: Institute of Automation, Chinese Academy of Sciences, GigaAI\n  - Publish Date: 2024.03.11\n  - Task: Generation\n  - Project: [DriveDreamer-2](https:\u002F\u002Fdrivedreamer2.github.io)\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - Summary:\n    - DriveDreamer-2, which builds upon the framework of [DriveDreamer](#DriveDreamer) and incorporates a Large Language Model (LLM) to generate user-defined driving videos.\n    - UniMVM(Unified Multi-View Model) enhances temporal and spatial coherence in the generated driving videos.\n    - HDMap generator ensure the background elements do not conflict with the foreground trajectories.\n    - Utilize the constructed text-to-script dataset to finetune GPT-3.5 into an LLM with specialized trajectory generation knowledge.\n\n- [Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.05746)\n  - Yuxi Wei, Zi Wang, Yifan Lu, Chenxin Xu, Changxing Liu, Hao Zhao, Siheng Chen, Yanfeng Wang\n  - Publisher: Shanghai Jiao Tong University, Shanghai AI Laboratory, Carnegie Mellon University, Tsinghua University\n  - Publish Date: 2024.03.11\n  - Task: Generation\n  - Code: [ChatSim](https:\u002F\u002Fgithub.com\u002Fyifanlu0227\u002FChatSim)\n  - Datasets: [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - Summary:\n    - ChatSim, the first system that enables editable photo-realistic 3D driving scene simulations via natural language commands with external digital assets.\n    - McNeRF, a novel neural radiance field method that incorporates multi-camera inputs, offering a broader scene rendering. It helps generate photo-realistic outcomes.\n    - McLight, a novel multicamera lighting estimation that blends skydome and surrounding lighting. It makes external digital assets with their realistic textures and materials.\n  \n- [Embodied Understanding of Driving Scenarios](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.04593)\n  - Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi Guo, Yu Qiao, Hongyang Li **ECCV 2024**\n  - Shanghai AI Lab, Shanghai Jiao Tong University, University of California, Riverside\n  - Publish Date: 2024.03.07\n  - Task: Benchmark & Scene Understanding\n  - Code: [ELM](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FELM)\n  - Summary:\n    - ELM is an embodied language model for understanding the long-horizon driving scenarios in space and time. \n\n- [DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12289) \n  - Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, Hang Zhao\n  - Publisher: IIIS, Tsinghua University, Li Auto\n  - Publish Date: 2024.02.25\n  - Task: + Planning\n  - Project: [DriveVLM](https:\u002F\u002Ftsinghua-mars-lab.github.io\u002FDriveVLM\u002F)\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes), SUP-AD\n  - Summary:\n    - DriveVLM, a novel autonomous driving system that leverages VLMs for effective scene understanding and planning.\n    - DriveVLM-Dual, a hybrid system that incorporates DriveVLM and a traditional autonomous pipeline. \n\n- [GenAD: Generative End-to-End Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.11502)\n  - Wenzhao Zheng, Ruiqi Song, Xianda Guo, Long Chen  **ECCV 2024**\n  - University of California, Berkeley, Waytous, Institute of Automation, Chinese Academy of Sciences\n  - Publish Date: 2024.02.20\n  - Task: Generation\n  - Code: [GenAD](https:\u002F\u002Fgithub.com\u002Fwzzheng\u002FGenAD)\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - Summary:\n    - GenAD models autonomous driving as a trajectory generation problem to unleash the full potential of endto-end methods. \n    - Propose an instance-centric scene tokenizer that first transforms the surrounding scenes into map-aware instance tokens.\n    - Employ a variational autoencoder to learn the future trajectory distribution in a structural latent space for trajectory prior modeling and adopt a temporal model to capture the agent and ego movements in the latent space to generate more effective future trajectories. \n\n- [RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.10828)\n  - Jianhao Yuan, Shuyang Sun, Daniel Omeiza, Bo Zhao, Paul Newman, Lars Kunze, Matthew Gadd\n  - Publisher: University of Oxford, Beijing Academy of Artificial Intelligence\n  - Publish Date: 2024.02.16\n  - Task: Explainable Driving\n  - Project: [RAG-Driver](https:\u002F\u002Fyuanjianhao508.github.io\u002FRAG-Driver\u002F)\n  - Summary:\n    - RAG-Driver is a Multi-Modal Large Language Model with Retrieval-augmented In-context Learning capacity designed for generalisable and explainable end-to-end driving with strong zero-shot generalisation capacity.\n    - Achieve State-of-the-art action explanation and justification performance in both BDD-X (in-distribution) and SAX (out-distribution) benchmarks. \n\n- [Driving Everywhere with Large Language Model Policy Adaptation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.05932)\n  - Boyi Li, Yue Wang, Jiageng Mao, Boris Ivanovic, Sushant Veer, Karen Leung, Marco Pavone **CVPR 2024**\n  - Publisher: NVIDIA, University of Southern California, University of Washington, Stanford University\n  - Publish Date: 2024.02.08\n  - Task: Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - Project: [LLaDA](https:\u002F\u002Fboyiliee.github.io\u002Fllada\u002F)\n  - Summary:\n    - LLaDA is a training-free mechanism to assist human drivers and adapt autonomous driving policies to new environments.\n    - Traffic Rule Extractor (TRE), which aims to organize and filter the inputs (initial plan+unique traffic code) and feed the output into the frozen LLMs to obtain the final new plan. \n    - LLaDA set GPT-4 as default LLM.\n\n- [LimSim++](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.01246)\n  - Daocheng Fu, Wenjie Lei, Licheng Wen, Pinlong Cai, Song Mao, Min Dou, Botian Shi, Yu Qiao\n  - Publisher: Shanghai Artificial Intelligence Laboratory, Zhejiang University\n  - Publish Date: 2024.02.02\n  - Project: [LimSim++](https:\u002F\u002Fpjlab-adg.github.io\u002Flimsim_plus\u002F)\n  - Summary:\n    - LimSim++, an extended version of LimSim designed for the application of (M)LLMs in autonomous driving.\n    - Introduce a baseline (M)LLM-driven framework, systematically validated through quantitative experiments across diverse scenarios.\n\n- [LangProp: A code optimization framework using Language Models applied to driving](https:\u002F\u002Fopenreview.net\u002Fforum?id=UgTrngiN16)\n  - Shu Ishida, Gianluca Corrado, George Fedoseev, Hudson Yeo, Lloyd Russell, Jamie Shotton, João F. Henriques, Anthony Hu\n  - Publisher: Wayve Technologies, Visual Geometry Group, University of Oxford\n  - Publish Date: 2024.01.18\n  - Task: Code generation, Planning\n  - Code: [LangProp](https:\u002F\u002Fgithub.com\u002Fshuishida\u002FLangProp)\n  - Env: [CARLA](https:\u002F\u002Fgithub.com\u002Fcarla-simulator)\n  - Summary:\n    - LangProp is a framework for iteratively optimizing code generated by large language models (LLMs) in a supervised\u002Freinforcement learning setting.\n    - Use LangProp in CARLA and generate driving code based on the state of the scene.\n\n- [VLP: Vision Language Planning for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.05577)\n  - Chenbin Pan, Burhaneddin Yaman, Tommaso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem Velipasalar, Liu Ren **CVPR 2024**\n  - Publisher: Syracuse University, Bosch Research North America & Bosch Center for Artificial Intelligence (BCAI)\n  - Publish Date: 2024.01.14\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - Summary:\n    - Propose VLP, a Vision Language Planning model, which is composed of novel components ALP and SLP, aiming to improve the ADS from self-driving BEV reasoning and self-driving decision-making aspects, respectively.\n    - ALP(agent-wise learning paradigm) aligns the produced BEV with a true bird’s-eye-view map.\n    - SLP(selfdriving-car-centric learning paradigm) aligns the ego-vehicle query feature with the ego-vehicle textual planning feature.\n\n- [DME-Driver: Integrating Human Decision Logic and 3D Scene Perception in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.03641)\n  - Wencheng Han, Dongqian Guo, Cheng-Zhong Xu, and Jianbing Shen\n  - Publisher: SKL-IOTSC, CIS, University of Macau\n  - Publish Date: 2024.01.08\n  - Summary:\n    - DME-Driver = Decision-Maker + Executor + CL\n    - Executor network which is based on UniAD incorporates textual information for the OccFormer and the Planning module.\n    - Decision-Maker which is based on LLaVA process inputs from three different modalities: visual inputs from the current and previous scenes textual inputs in the form of prompts, and current status information detailing the vehicle’s operating state.\n    - CL is a consistency loss mechanism, slightly reducing performance metrics but significantly enhancing decision alignment between Executor and Decision-Maker.\n\n- [AccidentGPT: Accident Analysis and Prevention from V2X Environmental Perception with Multi-modal Large Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13156)\n  - Lening Wang, Yilong Ren, Han Jiang, Pinlong Cai, Daocheng Fu, Tianqi Wang, Zhiyong Cui, Haiyang Yu, Xuesong Wang, Hanchu Zhou, Helai Huang, Yinhai Wang\n  - Publisher: Beihang University, Shanghai Artificial Intelligence Laboratory, The University of Hong Kong, Zhongguancun Laboratory, Tongji University, Central South University, University of Washington, Seattle\n  - Publish Date: 2023.12.29\n  - Project page: [AccidentGPT](https:\u002F\u002Faccidentgpt.github.io)\n  - Summary:\n    - AccidentGPT, a comprehensive accident analysis and prevention multi-modal large model.\n    - Integrates multi-vehicle collaborative perception for enhanced environmental understanding and collision avoidance.\n    - Offer advanced safety features such as proactive remote safety warnings and blind spot alerts.\n    - Serve traffic police and management agencies by providing real-time intelligent analysis of traffic safety factors.\n\n- [Holistic Autonomous Driving Understanding by Bird’s-Eye-View Injected Multi-Modal Large Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.00988)\n  - Xinpeng Ding, Jinahua Han, Hang Xu, Xiaodan Liang, Wei Zhang, Xiaomeng Li **CVPR 2024**\n  - Publisher: Hong Kong University of Science and Technology, Huawei Noah’s Ark Lab, Sun Yat-Sen University\n  - Publish Date: 2023.12.21\n  - Task: Datasets + VQA\n  - Code: [official](https:\u002F\u002Fgithub.com\u002Fxmed-lab\u002FNuInstruct)\n  - Summary:\n    - Introduce NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks, which based on [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes).\n    - Propose BEV-InMLMM to integrate instructionaware BEV features with existing MLLMs, enhancing them with a full suite of information, including temporal, multi-view, and spatial details.\n\n- [LLM-ASSIST: Enhancing Closed-Loop Planning with Language-Based Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.00125)\n  - S P Sharan, Francesco Pittaluga, Vijay Kumar B G, Manmohan Chandraker\n  - Publisher: UT Austin， NEC Labs America， UC San Diego\n  - Publish Date: 2023.12.30\n  - Task: Planning\n  - Env\u002FDatasets: nuPlan Closed-Loop Non-Reactive Challenge\n  - Project: [LLM-ASSIST](https:\u002F\u002Fllmassist.github.io\u002F)\n  - Summary:\n    - LLM-Planner takes over scenarios that PDM-Closed cannot handle\n    - Propose two LLM-based planners.\n      - LLM-ASSIST(unc) considers the most unconstrained version of the planning problem, in which the LLM must directly return a safe future trajectory for the ego car. \n      - LLM-ASSIST(par) considers a parameterized version of the planning problem, in which the LLM must only return a set of parameters for a rule-based planner, PDM-Closed.\n\n- [DriveLM: Driving with Graph Visual Question Answering](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.14150.pdf)\n  - Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, Hongyang Li **ECCV 2024**\n  - Publisher: OpenDriveLab, University of Tübingen, Tübingen AI Center, University of Hong Kong\n  - Code: [official](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM)\n  - Publish Date: 2023.12.21\n  - Summary:\n    - DriveLM-Task\n      - Graph VQA involves formulating P1-3(Perception, Prediction, Planning) reasoning as a series of questionanswer pairs (QAs) in a directed graph.\n    - DriveLM-Data\n      - DriveLM-Carla\n        - Collect data using CARLA 0.9.14 in the Leaderboard 2.0 framework [17] with a privileged rule-based expert.\n      - Drive-nuScenes\n        - Selecting key frames from video clips, choosing key objects within these key frames, and subsequently annotating the frame-level P1−3 QAs for these key objects. A portion of the Perception QAs are generated from the nuScenes and [OpenLane-V2](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FOpenLane-V2) ground truth, while the remaining QAs are manually annotated.\n    - DriveLM-Agent\n      - DriveLMAgent is built upon a general vision-language model and can therefore exploit underlying knowledge gained during pre-training. \n\n- [LingoQA: Video Question Answering for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.14115)\n  - Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, Oleg Sinavski\n  - Publisher: Wayve\n  - Task: VQA + Evaluation\u002FDatasets\n  - Code: [official](https:\u002F\u002Fgithub.com\u002Fwayveai\u002FLingoQA)\n  - Publish Date: 2023.12.21\n  - Summary:\n    - Introduce a novel benchmark for autonomous driving video QA using a learned text classifier for evaluation. \n    - Introduce a Video QA dataset of central London consisting of 419k samples with its free-form questions and answers.\n    - Establish a new baseline based on Vicuna-1.5-7B for this field with an identified model combination.\n\n- [DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.09245)\n  - Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, Hao Tian, Lewei Lu, Xizhou Zhu, Xiaogang Wang, Yu Qiao, Jifeng Dai\n  - Publisher: OpenGVLab, Shanghai AI Laboratory, The Chinese University of Hong Kong, SenseTime Research, Stanford University, Nanjing University, Tsinghua University\n  - Task: Planning + Explanation\n  - Code: [official](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FDriveMLM)\n  - Env: [Carla](https:\u002F\u002Fcarla.org\u002F)\n  - Publish Date: 2023.12.14\n  - Summary:\n    - DriveMLM, the first LLM-based AD framework that can perform close-loop\nautonomous driving in realistic simulators.\n    - Design an MLLM planner for decision prediction, and develop a data engine that can effectively generate decision states and corresponding explanation annotation for model training and evaluation.\n    - Achieve 76.1 DS, 0.955 MPI results on CARLA Town05 Long.\n\n- [Large Language Models for Autonomous Driving: Real-World Experiments](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.09397)\n  - Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, Tianren Gao, Erlong Li, Kun Tang, Zhipeng Cao, Tong Zhou, Ao Liu, Xinrui Yan, Shuqi Mei, Jianguo Cao, Ziran Wang, Chao Zheng\n  - Publisher: Purdue University\n  - Publish Date: 2023.12.14\n  - Project: [official](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgcRcf9w8BmJfZigDhk1SAfXV0FY65cO7)\n  - Summary:\n    - Introduce a Large Language Model (LLM)-based framework Talk-to-Drive (Talk2Drive) to process verbal commands from humans and make autonomous driving decisions with contextual information, satisfying their personalized preferences for safety, efficiency, and comfort.\n\n- [LMDrive: Closed-Loop End-to-End Driving with Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.07488)\n  - Hao Shao, Yuxuan Hu, Letian Wang, Steven L. Waslander, Yu Liu, Hongsheng Li      **CVPR 2024**\n  - Publisher: CUHK MMLab, SenseTime Research, CPII under InnoHK, University of Toronto, Shanghai Artificial Intelligence Laboratory\n  - Task: Planning + Datasets\n  - Code: [official](https:\u002F\u002Fgithub.com\u002Fopendilab\u002FLMDrive)\n  - Env: [Carla](https:\u002F\u002Fcarla.org\u002F)\n  - Publish Date: 2023.12.12\n  - Summary:\n    - LMDrive, a novel end-to-end, closed-loop, languagebased autonomous driving framework.\n    - Release 64K clips dataset, including navigation instruction, notice instructions, multi-modal multi-view sensor data, and control signals.\n    - Present the benchmark LangAuto for evaluating the autonomous agents.\n\n- [Evaluation of Large Language Models for Decision Making in Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.06351.pdf)\n  - Kotaro Tanahashi, Yuichi Inoue, Yu Yamaguchi, Hidetatsu Yaginuma, Daiki Shiotsuka, Hiroyuki Shimatani, Kohei Iwamasa, Yoshiaki Inoue, Takafumi Yamaguchi, Koki Igari, Tsukasa Horinouchi, Kento Tokuhiro, Yugo Tokuchi, Shunsuke Aoki\n  - Publisher: Turing Inc., Japan\n  - Task: Evalution\n  - Publish Date: 2023.12.11\n  - Summary:\n    - Evaluate the two core capabilities\n      - the spatial awareness decision-making ability, that is, LLMs can accurately identify the spatial layout based on coordinate information;\n      - the ability to follow traffic rules to ensure that LLMs Ability to strictly abide by traffic laws while driving\n\n- [LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.04372)\n  - Yunsheng Ma, Can Cui, Xu Cao, Wenqian Ye, Peiran Liu, Juanwu Lu, Amr Abdelraouf, Rohit Gupta, Kyungtae Han, Aniket Bera, James M. Rehg, Ziran Wang\n  - Publisher: Purdue University, University of Illinois Urbana-Champaign, University of Virginia, InfoTech Labs, Toyota Motor North American\n  - Task: Benchmark\n  - Publish Date: 2023.12.07\n  - Summary:\n    - LaMPilot is the first interactive environment and dataset designed for evaluating LLM-based agents in a driving context.\n    - It contains 4.9K scenes and is specifically designed to evaluate command tracking tasks in autonomous driving.\n\n- [Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03661)\n  - Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, Li Zhang\n  - Publisher: Fudan University, Huawei Noah’s Ark Lab\n  - Task: VQA + Datasets\n  - Code: [official](https:\u002F\u002Fgithub.com\u002Ffudan-zvg\u002FReason2Drive)\n  - Datasets:\n    - [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n    - [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n    - [ONCE](https:\u002F\u002Fonce-for-auto-driving.github.io\u002Findex.html)\n  - Publish Date: 2023.12.06\n  - Summary:\n    - Reason2Drive, a benchmark dataset with over 600K video-text pairs, aimed at facilitating the study of interpretable reasoning in complex driving.\n    - Introduce a novel evaluation metric to assess chain-based reasoning performance in autonomous driving environments, and address the semantic ambiguities of existing metrics such as BLEU and CIDEr.\n    - Introduce a straightforward yet effective framework that enhances existing VLMs with two new components: a prior tokenizer and an instructed vision decoder.\n\n- [GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03543)\n  - Haicheng Liao, Huanming Shen, Zhenning Li, Chengyue Wang, Guofa Li, Yiming Bie, Chengzhong Xu\n  - Publisher: University of Macau, UESTC, Chongqing University, Jilin University\n  - Task: Detection\u002FPrediction\n  - Code: [official](https:\u002F\u002Fgithub.com\u002FPetrichor625\u002FTalk2car_CAVG)\n  - Datasets:\n    - [Talk2car](https:\u002F\u002Fgithub.com\u002Ftalk2car\u002FTalk2Car)\n  - Publish Date: 2023.12.06\n  - Summaray:\n    - Utilize five encoder Text, Image, Context, and Cross-Modal—with a Multimodal decoder to pridiction object bounding box.\n\n- [Dolphins: Multimodal Language Model for Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00438)\n  - Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, Chaowei Xiao **ECCV 2024**\n  - Publisher: University of Wisconsin-Madison, NVIDIA, University of Michigan, Stanford University\n  - Task: VQA\n  - Project: [Dolphins](https:\u002F\u002Fvlm-driver.github.io\u002F)\n  - Code: [Dolphins](https:\u002F\u002Fgithub.com\u002Fvlm-driver\u002FDolphins)\n  - Datasets: \n    - Image instruction-following dataset\n      - [GQA](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fdorarad\u002Fgqa\u002Fabout.html)\n      - [MSCOCO](https:\u002F\u002Fcocodataset.org\u002F#home): [VQAv2](https:\u002F\u002Fvisualqa.org\u002F), [OK-VQA](https:\u002F\u002Fokvqa.allenai.org\u002F), [TDIUC](https:\u002F\u002Fkushalkafle.com\u002Fprojects\u002Ftdiuc.html), [Visual Genome dataset](https:\u002F\u002Fhomes.cs.washington.edu\u002F~ranjay\u002Fvisualgenome\u002Findex.html)\n    - Video instruction-following dataset\n      - [BDD-X](https:\u002F\u002Fgithub.com\u002FJinkyuKimUCB\u002FBDD-X-dataset)\n  - Publish Date: 2023.12.01\n  - Summary:\n    - Dolphins which is base on OpenFlamingo architecture is a VLM-based conversational driving assistant.\n    - Devise grounded CoT (GCoT) instruction tuning and develop datasets.\n\n- [Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17918)\n  - Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, Zhaoxiang Zhang\n  - Publisher: CASIA, CAIR, HKISI, CAS\n  - Task: Generation\n  - Project: [Drive-WM](https:\u002F\u002Fdrive-wm.github.io\u002F)\n  - Code: [Drive-WM](https:\u002F\u002Fgithub.com\u002FBraveGroup\u002FDrive-WM)\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes), [Waymo Open Dataset](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - Publish Date: 2023.11.29\n  - Summary:\n    - Drive-WM, a multiview world model, which is capable of generating high-quality, controllable, and consistent multiview videos in autonomous driving scenes.\n    - The first to explore the potential application of the world model in end-to-end planning for autonomous driving.\n\n- [Empowering Autonomous Driving with Large Language Models: A Safety Perspective](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00812)\n  - Yixuan Wang, Ruochen Jiao, Chengtian Lang, Sinong Simon Zhan, Chao Huang, Zhaoran Wang, Zhuoran Yang, Qi Zhu\n  - Publisher: Northwestern University, University of Liverpool, Yale University\n  - Task: Planning\n  - Env: [HighwayEnv](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n  - Code: [official](https:\u002F\u002Fgithub.com\u002Fwangyixu14\u002Fllm_conditioned_mpc_ad)\n  - Publish Date: 2023.11.28\n  - Summary:\n    - Deploys the LLM as an intelligent decision-maker in planning, incorporating safety verifiers for contextual safety learning to enhance overall AD performance and safety.\n\n- [GPT-4V Takes the Wheel: Evaluating Promise and Challenges for Pedestrian Behavior Prediction](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.14786)\n  - Jia Huang, Peng Jiang, Alvika Gautam, Srikanth Saripalli\n  - Publisher: Texas A&M University, College Station, USA\n  - Task: Evaluation(Pedestrian Behavior Prediction)\n  - Datasets: \n    - [JAAD](https:\u002F\u002Fdata.nvision2.eecs.yorku.ca\u002FJAAD_dataset\u002F)\n    - [PIE](https:\u002F\u002Fdata.nvision2.eecs.yorku.ca\u002FPIE_dataset\u002F)\n    - [WiDEVIEW](https:\u002F\u002Fgithub.com\u002Funmannedlab\u002FUWB_Dataset)\n  - Summary:\n    - Provides a comprehensive evaluation of the potential of GPT-4V for pedestrian behavior prediction in autonomous driving using publicly available datasets.\n    - It still falls short of the state-of-the-art traditional domain-specific models.\n    - While GPT-4V represents a considerable advancement in AI capabilities for pedestrian behavior prediction, ongoing development and refinement are necessary to fully harness its capabilities in practical applications.\n\n- [ADriver-I: A General World Model for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.13549)\n  - Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, Tiancai Wang\n  - Publisher: MEGVII Technology, Waseda University, University of Science and Technology of China, Mach Drive\n  - Task: Generation + Planning\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes), Largescale private datasets\n  - Publish Date: 2023.11.22\n  - Summary:\n    - ADriver-I takes the vision-action pairs as inputs and autoregressively predicts the control signal of current frame. The generated control signals together with the historical vision-action pairs are further conditioned to predict the future frames. \n    - MLLM(Multimodal large language model)=[LLaVA-7B-1.5](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA), VDM(Video Diffusion Model)=[latent-diffusion](https:\u002F\u002Fgithub.com\u002FCompVis\u002Flatent-diffusion)\n  - Metrics:\n    - L1 error including speed and steer angle of current frame.\n    - Quality of Generation: Frechet Inception Distance(FID), Frechet Video Distance(FVD).\n\n- [A Language Agent for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.10813)\n  - Jiageng Mao, Junjie Ye, Yuxi Qian, Marco Pavone, Yue Wang\n  - University of Southern California, Stanford University, NVIDIA\n  - Task: Generation + Planning\n  - Project: [Agent-Driver](https:\u002F\u002Fusc-gvl.github.io\u002FAgent-Driver\u002F)\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - Publish Date: 2023.11.17\n  - Summary:\n    - Agent-Driver integrates a tool library for dynamic perception and prediction, a cognitive memory for human knowledge, and a reasoning engine that emulates human decision-making.\n    - For motion planning, follow GPT-Driver(#GPT-Driver) and fine-tune the LLM with human driving trajectories in the nuScenes training set for one epoch. \n    - For neural modules, adopte the modules in [UniAD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.10156).\n  - Metric:\n    - L2 error (in meters) and collision rate (in percentage).\n\n- [Human-Centric Autonomous Systems With LLMs for User Command Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.08206)\n  - Yi Yang, Qingwen Zhang, Ci Li, Daniel Simões Marta, Nazre Batool, John Folkesson\n  - Publisher: KTH Royal Institute of Technology, Scania AB\n  - Task: QA\n  - Code: [DriveCmd](https:\u002F\u002Fgithub.com\u002FKTH-RPL\u002FDriveCmd_LLM)\n  - Datasets: [UCU Dataset](https:\u002F\u002Fgithub.com\u002FLLVM-AD\u002Fucu-dataset)\n  - Publish Date: 2023.11.14\n  - Summary:\n    - Propose to leverage the reasoning capabilities of Large Language Models (LLMs) to infer system requirements from in-cabin users’ commands.\n    - LLVM-AD Workshop @ WACV 2024 \n  - Metric:\n    - Accuracy at the question level(accuracy for each individual question).\n    - Accuracy at the command level(accuracy is only acknowledged if all questions for a particular command are correctly identified).\n\n- [On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.05332)\n  - Licheng Wen, Xuemeng Yang, Daocheng Fu, Xiaofeng Wang, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Shuanglu Hu, Botian Shi\n  - Publisher: Shanghai Artificial Intelligence Laboratory,  GigaAI, East China Normal University, The Chinese University of Hong Kong, WeRide.ai\n  - Project: [official](https:\u002F\u002Fgithub.com\u002FPJLab-ADG\u002FGPT4V-AD-Exploration)\n  - Datasets:\n    - Scenario Understanding: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes), [BDD-X](https:\u002F\u002Fgithub.com\u002FJinkyuKimUCB\u002FBDD-X-dataset), [Carla](https:\u002F\u002Fgithub.com\u002Fcarla-simulator), [TSDD](http:\u002F\u002Fwww.nlpr.ia.ac.cn\u002Fpal\u002Ftrafficdata\u002Fdetection.html), [Waymo](https:\u002F\u002Farxiv.org\u002Fabs\u002F1912.04838), [DAIR-V2X](https:\u002F\u002Fthudair.baai.ac.cn\u002Findex), [CitySim](https:\u002F\u002Fgithub.com\u002Fozheng1993\u002FUCF-SST-CitySim-Dataset).\n    - Reasoning Capability: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes), [D2-city](https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.01975), [Carla](https:\u002F\u002Fgithub.com\u002Fcarla-simulator), [CODA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.07724) and the internet\n    - Act as a driver: Real-world driving scenarios.\n  - Publish Date: 2023.11.9 \n  - Summary:\n    - Conducted a comprehensive and multi-faceted evaluation of the GPT-4V in various autonomous driving scenarios.\n    - Test the capabilities of GPT-4V in Scenario Understanding, Reasoning, Act as a driver.\n\n- [ChatGPT as Your Vehicle Co-Pilot: An Initial Attempt](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10286969)\n  - Shiyi Wang, Yuxuan Zhu, Zhiheng Li, Yutong Wang, Li Li, Zhengbing He\n  - Publisher: Tsinghua University, Institute of Automation, Chinese Academy of Sciences, Massachusetts Institute of Technology\n  - Task: Planning\n  - Publish Date: 2023.10.17\n  - Summary:\n    - Design a universal framework that embeds LLMs as a vehicle \"Co-Pilot\" of driving, which can accomplish specific driving tasks with human intention satisfied based on the information provided.\n\n- [MagicDrive: Street View Generation with Diverse 3D Geometry Control](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.02601)\n  - Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, Qiang Xu\n  - Publisher: The Chinese University of Hong Kong, Hong Kong University of Science and Technology, Huawei Noah’s Ark Lab\n  - Task: Generation\n  - Project: [MagicDrive](https:\u002F\u002Fgaoruiyuan.com\u002Fmagicdrive\u002F)\n  - Code: [MagicDrive](https:\u002F\u002Fgithub.com\u002Fcure-lab\u002FMagicDrive)\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - Publish Date: 2023.10.13\n  - Summary:\n    - MagicDrive generates highly realistic images, exploiting geometric information from 3D annotations by independently encoding road maps, object boxes, and camera parameters for precise, geometry-guided synthesis. This approach effectively solves the challenge of multi-camera view consistency.\n    - It also faces huge challenges in some complex scenes, such as night views and unseen weather conditions.\n\n- [Receive, Reason, and React: Drive as You Say with Large Language Models in Autonomous Vehicles](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.08034)\n  - Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Ziran Wang\n  - Publisher: Purdue University,  University of Illinois Urbana-Champaign，University of Virginia，PediaMed.AI.\n  - Task: Planning\n  - Project: [video](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgcRcf9w8BmLJi_fqTGq-7KCZsbpEIE4a)\n  - Env: [HighwayEnv](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n  - Publish Date: 2023.10.12\n  - Summary:\n    - Utilize LLMs’ linguistic and contextual understanding abilities with specialized tools to integrate the language and reasoning capabilities of LLMs into autonomous vehicles.\n\n- [DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.07771)\n  - Xiaofan Li, Yifu Zhang, Xiaoqing Ye\n  - Publisher: Baidu Inc.\n  - Task: Generation\n  - Project: [official](https:\u002F\u002Fdrivingdiffusion.github.io\u002F)\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - Summary:\n    - Address the new problem of multi-view video data generation from 3D layout in complex urban scenes.'\n    - Propose a generative model DrivingDiffusion to ensure the cross-view, cross-frame consistency and the instance quality of the generated videos.\n    - Achieve state-of-the-art video synthesis performance on nuScenes dataset.\n  - Metrics:\n    - Quality of Generation: Frechet Inception Distance(FID), Frechet Video Distance(FVD)\n    - Segmentation Metrics: mIoU\n\n- \u003Ca id=\"LanguageMPC\">\u003C\u002Fa>[LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.03026)\n  - Hao Sha, Yao Mu, Yuxuan Jiang, Li Chen, Chenfeng Xu, Ping Luo, Shengbo Eben Li, Masayoshi Tomizuka, Wei Zhan, Mingyu Ding\n  - Publisher: Tsinghua University, The University of Hong Kong, University of California, Berkeley\n  - Task: Planning\u002FControl\n  - Code: [official](https:\u002F\u002Fsites.google.com\u002Fview\u002Fllm-mpc)\n  - Env: \n    - [ComplexUrbanScenarios](https:\u002F\u002Fgithub.com\u002Fliuyuqi123\u002FComplexUrbanScenarios)\n    - [Carla](https:\u002F\u002Fgithub.com\u002Fcarla-simulator)\n  - Publish Date: 2023.10.04\n  - Summary:\n    - Leverage LLMs to provide high-level decisions through chain-of-thought.\n    - Convert high-level decisions into mathematical representations to guide the bottom-level controller(MPC).\n    - Metrics: Number of failure\u002Fcollision cases， Inefficiency，time, Penalty\n\n- \u003Ca id=\"DrivingwithLLMs\">\u003C\u002Fa>[Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving](https:\u002F\u002Fbrowse.arxiv.org\u002Fabs\u002F2310.01957)\n  - Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, Jamie Shotton\n  - Publisher: Wayve\n  - Task: Planning + VQA\n  - Code: [official](https:\u002F\u002Fgithub.com\u002Fwayveai\u002FDriving-with-LLMs)\n  - Simulator: a custom-built realistic 2D simulator.(The simulator is not open source.)\n  - Datasets: [Driving QA](https:\u002F\u002Fgithub.com\u002Fwayveai\u002FDriving-with-LLMs\u002Ftree\u002Fmain\u002Fdata), data collection using RL experts in simulator.\n  - Publish Date: 2023.10.03\n  - Summary:\n    - Propose a unique object-level multimodal LLM architecture(Llama2+Lora), using only vectorized representations as input.\n    - Develop a new dataset of 160k QA pairs derived from 10k driving scenarios(control commands collected by RL(PPO), QA pair generated by GPT-3.5)\n    - Metrics: \n      - Accuracy of traffic light detection\n      - MAE for traffic light distance prediction\n      - MAE for acceleration\n      - MAE for brake pressure\n      - MAE for steering wheel angle\n\n- [Talk2BEV: Language-enhanced Bird’s-eye View Maps for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.02251)\n  - Vikrant Dewangan, Tushar Choudhary, Shivam Chandhok, Shubham Priyadarshan, Anushka Jain, Arun K. Singh, Siddharth Srivastava, Krishna Murthy Jatavallabhula, K. Madhava Krishna\n  - Publisher: IIIT Hyderabad, University of British Columbia, University of Tartu, TensorTour Inc, MIT\n  - Project Page: [official](https:\u002F\u002Fllmbev.github.io\u002Ftalk2bev\u002F)\n  - Code: [Talk2BEV](https:\u002F\u002Fgithub.com\u002Fllmbev\u002Ftalk2bev)\n  - Publish Date: 2023.10.03\n  - Summary:\n    - Introduces Talk2BEV, a large visionlanguage model (LVLM) interface for bird’s-eye view (BEV) maps in autonomous driving contexts.\n    - Does not require any training or finetuning, relying instead on pre-trained image-language models\n    - Develop and release Talk2BEV-Bench, a benchmark encom- passing 1000 human-annotated BEV scenarios, with more than 20,000 questions and ground-truth responses from the NuScenes dataset.\n\n- \u003Ca id=\"DriveGPT4\">\u003C\u002Fa>[DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01412)\n  - Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kenneth K. Y. Wong, Zhenguo Li, Hengshuang Zhao\n  - Publisher: The University of Hong Kong, Zhejiang University, Huawei Noah’s Ark Lab, University of Sydney\n  - Project Page: [official](https:\u002F\u002Ftonyxuqaq.github.io\u002Fprojects\u002FDriveGPT4\u002F)\n  - Task: Planning\u002FControl + VQA\n  - Datasets: \n    - [BDD-X dataset](https:\u002F\u002Fgithub.com\u002FJinkyuKimUCB\u002FBDD-X-dataset).\n  - Publish Date: 2023.10.02\n  - Summary:\n    - Develop a new visual instruction tuning dataset(based on BDD-X) for interpretable AD assisted by ChatGPT\u002FGPT4.\n    - Present a novel multimodal LLM called DriveGPT4(Valley + LLaVA).\n  - Metrics: \n    - BLEU4, CIDEr and METETOR, ChatGPT Score.\n    - RMSE for control signal prediction.\n\n- \u003Ca id=\"GPT-Driver\">\u003C\u002Fa>[GPT-DRIVER: LEARNING TO DRIVE WITH GPT](https:\u002F\u002Fbrowse.arxiv.org\u002Fabs\u002F2310.01415v1)\n  - Jiageng Mao, Yuxi Qian, Hang Zhao, Yue Wang\n  - Publisher: University of Southern California, Tsinghua University\n  - Task: Planning(Fine-tuning Pre-trained Model)\n  - Project: [official](https:\u002F\u002Fpointscoder.github.io\u002Fprojects\u002Fgpt_driver\u002Findex.html)\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - Code: [GPT-Driver](https:\u002F\u002Fgithub.com\u002FPointsCoder\u002FGPT-Driver)\n  - Publish Date: 2023.10.02\n  - Summary:\n    - Motion planning as a language modeling problem.\n    - Align the output of the LLM with human driving behavior through fine-tuning strategies using the OpenAI fine-tuning API.\n    - Leverage the LLM to generate driving trajectories.\n  - Metrics:\n    - L2 metric and Collision rate\n\n- [GAIA-1: A Generative World Model for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17080)\n  - Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, Gianluca Corrado\n  - Publisher: Wayve\n  - Task: Generation\n  - Datasets: \n    - Training dataset consists of 4,700 hours at 25Hz of proprietary driving data collected in London,\nUK between 2019 and 2023. It corresponds to approximately 420M unique images. \n    - Validation dataset contains 400 hours of driving data from runs not included in the training set.\n    - text coming from either online narration or offline metadata sources\n  - Publish Date: 2023.09.29\n  - Summary:\n    - Introduce GAIA-1, a generative world model that leverages video(pre-trained DINO), text(T5-large), and action inputs to generate realistic driving scenarios.\n    - Serve as a valuable neural simulator, allowing the generation of unlimited data.\n\n- [DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.16292)\n  - Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, Yu Qiao   **ICLR 2024**\n  - Publisher: Shanghai AI Laboratory, East China Normal University, The Chinese University of Hong Kong\n  - Publish Date: 2023.09.28\n  - Task: Planning\n  - Env: \n    - [HighwayEnv](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n    - [CitySim](https:\u002F\u002Fgithub.com\u002Fozheng1993\u002FUCF-SST-CitySim-Dataset), a Drone-Based vehicle trajectory dataset.\n  - Summary: \n    - Propose the DiLu framework, which combines a Reasoning and a Reflection module to enable the system to perform decision-making based on common-sense knowledge and evolve continuously.\n\n- [SurrealDriver: Designing Generative Driver Agent Simulation Framework in Urban Contexts based on Large Language Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.13193)\n  - Ye Jin, Xiaoxi Shen, Huiling Peng, Xiaoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao, Guyue Zhou, Jiangtao Gong\n  - Keywords: human-AI interaction, driver model, agent, generative AI, large language model, simulation framework\n  - Env: [CARLA](https:\u002F\u002Fgithub.com\u002Fcarla-simulator)\n  - Publisher: Tsinghua University\n  - Summary: Propose a generative driver agent simulation framework based on large language models (LLMs), capable of perceiving complex traffic scenarios and providing realistic driving maneuvers.\n\n- [Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.10228)\n  - Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Ziran Wang\n  - Publisher: Purdue University, PediaMed.AI Lab, University of Virginia\n  - Task: Planning\n  - Publish Date: 2023.09.18\n  - Summary:\n    - Provide a comprehensive framework for integrating Large Language Models (LLMs) into AD.\n\n- \u003Ca id=\"DriveDreamer\">\u003C\u002Fa>[DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.09777)\n  - Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiwen Lu  **ECCV 2024**\n  - Publisher: GigaAI, Tsinghua University\n  - Task: Generation\n  - Project Page: [official](https:\u002F\u002Fdrivedreamer.github.io\u002F)\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - Publish Date: 2023.09.18\n  - Summary:\n    - Harness the powerful diffusion model to construct a comprehensive representation of the complex environment.\n    - Generate future driving videos and driving policies by a multimodal(text, image, HDMap, Action, 3DBox) world model.\n\n- [Can you text what is happening? Integrating pre-trained language encoders into trajectory prediction models for autonomous driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.05282)\n  - Ali Keysan, Andreas Look, Eitan Kosman, Gonca Gürsun, Jörg Wagner, Yu Yao, Barbara Rakitsch\n  - Publisher: Bosch Center for Artificial Intelligence, University of Tubingen, \n  - Task: Prediction\n  - Datasets: [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - Publish Date: 2023.09.13\n  - Summary:\n    - Integrating pre-trained language models as textbased input encoders for the AD trajectory prediction task.\n  - Metrics:\n    - minimum Average Displacement Error (minADEk)\n    - Final Displacement Error (minFDEk)\n    - MissRate over 2 meters\n\n- [TrafficGPT: Viewing, Processing and Interacting with Traffic Foundation Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.06719)\n  - Siyao Zhang, Daocheng Fu, Zhao Zhang, Bin Yu, Pinlong Cai\n  - Publisher: Beihang University, Key Laboratory of Intelligent Transportation Technology and System,  Shanghai Artificial Intelligence Laboratory\n  - Task: Planning\n  - Code: [official](https:\u002F\u002Fgithub.com\u002Flijlansg\u002FTrafficGPT.git)\n  - Publish Date: 2023.09.13\n  - Summary:\n    - Present TrafficGPT—a fusion of ChatGPT and traffic foundation models. \n    - Bridges the critical gap between large language models and traffic foundation models by defining a series of prompts.\n  \n- [HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.05186)\n  - Xinpeng Ding, Jianhua Han, Hang Xu, Wei Zhang, Xiaomeng Li\n  - Publisher: The Hong Kong University of Science and Technology, Huawei Noah’s Ark Lab\n  - Task: Detection + VQA\n  - Datasets: [DRAMA](https:\u002F\u002Fusa.honda-ri.com\u002Fdrama)\n  - Publish Date: 2023.09.11\n  - Summary:\n    - Propose HiLM-D (Towards High-Resolution Understanding in MLLMs for Autonomous Driving), an efficient method to incorporate HR information into MLLMs for the ROLISP task.\n    - ROLISP that aims to identify, explain and localize the risk object for the ego-vehicle meanwhile predicting its intention and giving suggestions.\n  - Metrics:\n    - LLM metrics, BLEU4, CIDEr and METETOR, SPICE.\n    - Detection metrics, mIoU, IoUs so on.\n\n- \u003Ca id=\"LanguagePrompt\">\u003C\u002Fa>[Language Prompt for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.04379)\n  - Dongming Wu, Wencheng Han, Tiancai Wang, Yingfei Liu, Xiangyu Zhang, Jianbing Shen\n  - Publisher: Beijing Institute of Technology, University of Macau, MEGVII Technology, Beijing Academy of Artificial Intelligence\n  - Task: Tracking\n  - Code: [official](https:\u002F\u002Fgithub.com\u002Fwudongming97\u002FPrompt4Driving)\n  - Datasets: NuPrompt(not open), based on [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes). \n  - Publish Date: 2023.09.08\n  - Summary:\n    - Propose a new large-scale language prompt set(based on nuScenes) for driving scenes, named NuPrompt(3D object-text pairs).\n    - Propose an efficient prompt-based tracking model with prompt reasoning modification on PFTrack, called PromptTrack. \n\n- [MTD-GPT: A Multi-Task Decision-Making GPT Model for Autonomous Driving at Unsignalized Intersections](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16118)\n  - Jiaqi Liu, Peng Hang, Xiao Qi, Jianqiang Wang, Jian Sun. *ITSC 2023*\n  - Publisher: Tongji University, Tsinghua University\n  - Task: Prediction\n  - Env: [HighwayEnv](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n  - Publish Date: 2023.07.30\n  - Summary:\n    - Design a pipeline that leverages RL algorithms to train single-task decision-making experts and utilize expert data.\n    - Propose the MTD-GPT model for multi-task(left-turn, straight-through, right-turn) decision-making of AV at unsignalized intersections.\n\n- [Domain Knowledge Distillation from Large Language Model: An Empirical Study in the Autonomous Driving Domain](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.11769)\n  - Yun Tang, Antonio A. Bruto da Costa, Xizhe Zhang, Irvine Patrick, Siddartha Khastgir, Paul Jennings. *ITSC 2023*\n  - Publisher: University of Warwick\n  - Task: QA\n  - Publish Date: 2023.07.17\n  - Summary:\n    - Develop a web-based distillation assistant enabling supervision and flexible intervention at runtime by prompt engineering and the LLM ChatGPT.\n\n- [Drive Like a Human: Rethinking Autonomous Driving with Large Language Models](https:\u002F\u002Fbrowse.arxiv.org\u002Fabs\u002F2307.07162)\n  - Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, Yu Qiao\n  - Publisher: Shanghai AI Lab, East China Normal University\n  - Task: Planning\n  - Code: [official](https:\u002F\u002Fgithub.com\u002FPJLab-ADG\u002FDriveLikeAHuman)\n  - Env: [HighwayEnv](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n  - Publish Date: 2023.07.14\n  - Summary:\n    - Identify three key abilities: Reasoning, Interpretation and Memorization(accumulate experience and self-reflection).\n    - Utilize LLM in AD as decision-making to solve long-tail corner cases and increase interpretability.\n    - Verify interpretability in closed-loop offline data.\n\n- [Language-Guided Traffic Simulation via Scene-Level Diffusion](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.06344)\n  - Ziyuan Zhong, Davis Rempe, Yuxiao Chen, Boris Ivanovic, Yulong Cao, Danfei Xu, Marco Pavone, Baishakhi Ray\n  - Publisher: Columbia University, NVIDIA Research, Stanford University, Georgia Tech\n  - Task: Diffusion\n  - Publish Date: 2023.07.10\n  - Summary: \n    - Present CTG++, a language-guided scene-level conditional diffusion model for realistic query-compliant traffic simulation. \n    - Leverage an LLM for translating a user query into a differentiable loss function and propose a scene-level conditional diffusion model (with a spatial-temporal transformer architecture) to translate the loss function into realistic, query compliant trajectories.\n\n- [ADAPT: Action-aware Driving Caption Transformer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.00673)\n  - Bu Jin, Xinyu Liu, Yupeng Zheng, Pengfei Li, Hao Zhao, Tong Zhang, Yuhang Zheng, Guyue Zhou, Jingjing Liu **ICRA 2023**\n  - Publisher: Chinese Academy of Sciences, Tsinghua University, Peking University, Xidian University, Southern University of Science and Technology, Beihang University\n  - Code: [ADAPT](https:\u002F\u002Fgithub.com\u002Fjxbbb\u002FADAPT)\n  - Datasets: [BDD-X dataset](https:\u002F\u002Fgithub.com\u002FJinkyuKimUCB\u002FBDD-X-dataset)\n  - Summary:\n    - Propose ADAPT, a new end-to-end transformerbased action narration and reasoning framework for\nself-driving vehicles.\n    - propose a multi-task joint training framework that aligns both the driving action captioning task and the control signal prediction task.\n\u003C\u002Fdetails>\n\n## WorkShop\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n- [Large Language and Vision Models for Autonomous Driving(LLVM-AD) Workshop @ WACV 2024](https:\u002F\u002Fllvm-ad.github.io\u002F)\n  - Publisher: Tencent Maps HD Map T.Lab, University of Illinois Urbana- Champaign, Purdue University, University of Virginia\n  - Challenge 1: MAPLM: A Large-Scale Vision-Language Dataset for Map and Traffic Scene Understanding\n    - Datasets: [Download](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1cqFjBH8MLeP6nKFM0l7oV-Srfke-Mx1R?usp=sharing)\n    - Task: QA\n    - Code: https:\u002F\u002Fgithub.com\u002FLLVM-AD\u002FMAPLM\n    - Description: MAPLM combines point cloud BEV (Bird's Eye View) and panoramic images to provide a rich collection of road scenario images. It includes multi-level scene description data, which helps models navigate through complex and diverse traffic environments.\n    - Metric:\n      - Frame-overall-accuracy (FRM): A frame is considered correct if all closed-choice questions about it are answered correctly.\n      - Question-overall-accuracy (QNS): A question is considered correct if its answer is correct.\n      - LAN: How many lanes in current road?\n      - INT: Is there any road cross, intersection or lane change zone in the main road?\n      - QLT: What is the point cloud data quality in current road area of this image?\n      - SCN: What kind of road scene is it in the images? (SCN)    \n  - Challenge 2: In-Cabin User Command Understanding (UCU)\n    - Datasets: [Download](https:\u002F\u002Fgithub.com\u002FLLVM-AD\u002Fucu-dataset\u002Fblob\u002Fmain\u002Fucu.csv)\n    - Task: QA\n    - Code: https:\u002F\u002Fgithub.com\u002FLLVM-AD\u002Fucu-dataset\n    - Description: \n      - This dataset focuses on understanding user commands in the context of autonomous vehicles. It contains 1,099 labeled commands. Each command is a sentence that describes a user’s request to the vehicle. \n    - Metric:\n      - Command-level accuracy: A command is considered correctly understood if all eight answers are correct.\n      - Question-level accuracy: Evaluation at the individual question level.\n\u003C\u002Fdetails>\n\n## Datasets\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n```\nformat:\n- [title](dataset link) [links]\n  - author1, author2, and author3...\n  - keyword\n  - experiment environments or tasks\n```\n\n- [An interactive enhanced driving dataset for autonomous driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.20575)\n  - Haojie Feng, Peizhi Zhang, Mengjie Tian, Xinrui Zhang, Zhuoren Li, Junpeng Huang, Xiurong Wang, Junfan Zhu, Jianzhou Wang, Dongxiao Yin, Lu Xiong\n  - Publish Date: 2026.02.24\n  - Task: VQA\n  - Datasets: [IEDD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.20575)\n  - Summary：\n    - Proposes the Interactive Enhanced Driving Dataset (IEDD) to address the sparsity of interactive scenarios and inadequate multimodal alignment for Vision-Language-Action (VLA) models in autonomous driving.\n    - Develops a scalable pipeline to mine million-level interactive segments from naturalistic driving data and constructs the IEDD-VQA dataset with synthetic BEV videos where semantic actions are strictly aligned with structured language.\n    - Provides benchmark results evaluating ten mainstream Vision Language Models (VLMs) to demonstrate the dataset's reuse value for assessing and fine-tuning the reasoning capabilities of autonomous driving models.\n    - \n- [Seeing before Observable: Potential Risk Reasoning in Autonomous Driving via Vision Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.22928)\n  - Jiaxin Liu, Xiangyu Yan, Liang Peng, Lei Yang, Lingjun Zhang, Yuechen Luo, Yueming Tao, Ashton Yu Xuan Tan, Mu Li, Lei Zhang, Ziqi Zhan, Sai Guo, Hong Wang, Jun Li\n  - Publish Date: 2025.11.28\n  - Task: VQA\n  - Datasets: [PotentialRiskQA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.22928)\n  - Summary：\n    - Introduces PotentialRiskQA, a novel vision-language dataset for reasoning about potential risks in driving before they become observable, with structured annotations of scene descriptions, precursors, and inferred outcomes.\n    - Proposes PR-Reasoner, a vision-language-model-based framework tailored for onboard potential risk reasoning, which shows significant performance gains when fine-tuned on the new dataset.\n\n- [CARScenes: Semantic VLM Dataset for Safe Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.10701)\n  - Yuankai He, Weisong Shi\n  - Publisher: University of Delaware\n  - Publish Date: 2025.11.12\n  - Project Page: [CARScenes](https:\u002F\u002Fgithub.com\u002FCroquembouche\u002FCAR-Scenes)\n  - Code: [CARScenes](https:\u002F\u002Fgithub.com\u002FCroquembouche\u002FCAR-Scenes)\n  - Summary：\n    - CAR-Scenes is a frame-level semantic dataset for training and evaluating Vision-Language Models (VLMs) for interpretable, scene-level understanding in autonomous driving.\n    - The dataset provides 5,192 annotated images with a 28-category knowledge base, covering 350+ attributes related to environment, road users, and vehicle behavior, annotated via a GPT-4o-assisted pipeline with human verification.\n    - It includes tools for semantic retrieval, risk-aware scenario mining, and reproducible baselines, releasing annotation and analysis scripts to support explainable, data-centric workflows for intelligent vehicles.\n\n- [STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10427)\n  - Keishi Ishihara, Kento Sasaki, Tsubasa Takahashi, Daiki Shiono, Yu Yamaguchi\n  - Publish Date: 2025.08.14\n  - Project Page: [STRIDE-QA](https:\u002F\u002Fturingmotors.github.io\u002Fstride-qa\u002F)\n  - Task: VQA\n  - Datasets: [STRIDE-QA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10427)\n  - Summary：\n    - STRIDE-QA, a large-scale visual question answering (VQA) dataset for physically grounded spatiotemporal reasoning from an ego-centric perspective, constructed from 100 hours of multi-sensor driving data in Tokyo.\n    - The dataset offers 16 million QA pairs over 285K frames with dense, automatically generated annotations, supporting object-centric and ego-centric reasoning through three novel QA tasks requiring spatial localization and temporal prediction.\n    - Benchmarks show VLMs fine-tuned on STRIDE-QA achieve dramatic performance gains (55% spatial localization, 28% prediction consistency) compared to near-zero scores from general-purpose VLMs, establishing a foundation for reliable VLMs in autonomous systems.\n\n- [Box-QAymo: Box-Referring VQA Dataset for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.00525)\n  - Djamahl Etchegaray, Yuxia Fu, Zi Huang, Yadan Luo\n  - Publish Date: 2025.07.01\n  - Project Page: [Box-QAymo](https:\u002F\u002Fdjamahl99.github.io\u002Fqaymo-pages\u002F)\n  - Task: VQA\n  - Datasets: [Box-QAymo](https:\u002F\u002Fdjamahl99.github.io\u002Fqaymo-pages\u002F)\n  - Summary：\n    - Introduces Box-QAymo, a box-referring VQA dataset and benchmark for evaluating and finetuning VLMs on spatial and temporal reasoning over user-specified objects in driving scenes.\n    - Proposes a hierarchical evaluation protocol covering binary sanity checks, attribute prediction, motion understanding, and spatiotemporal reasoning over inter-object dynamics.\n    - Provides a foundation for developing more robust and interpretable autonomous driving systems that can communicate effectively with users under real-world conditions.\n\n- [CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.10845)\n  - Hidehisa Arai, Keita Miwa, Kento Sasaki, Yu Yamaguchi, Kohei Watanabe, Shunsuke Aoki, Issei Yamamoto **WACV 2025 Oral**\n  - Publisher: Turing Inc.\n  - Publish Date: 2024.12.02\n  - Code: [CoVLA](https:\u002F\u002Fturingmotors.github.io\u002Fcovla-ad\u002F)\n  - Summary:\n    - CoVLA (Comprehensive Vision-Language-Action) Dataset, an extensive dataset comprising real-world driving videos spanning more than 80 hours. This dataset leverages a novel, scalable approach based on automated data processing and a caption generation pipeline to generate accurate driving trajectories paired with detailed natural language descriptions of driving environments and maneuvers.\n  \n- [Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.06597)\n  - Enna Sachdeva, Nakul Agarwal, Suhas Chundi, Sean Roelofs, Jiachen Li, Behzad Dariush, Chiho Choi, Mykel Kochenderfer\n  - Publisher: Honda Research Institute, Stanford University\n  - Publish Date: 2023.09.10\n  - Summary:\n    - A multi-modal ego-centric dataset for Ranking the importance level and Telling the reason for the importance. \n    - Introduce a joint model for joint importance level ranking and natural language captions generation to benchmark our dataset.\n\n- [DriveLM: Drive on Language](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM)\n  - Publisher: Sima, Chonghao and Renz, Katrin and Chitta, Kashyap and Chen, Li and Zhang, Hanxue and Xie, Chengen and Luo, Ping and Geiger, Andreas and Li, Hongyang **ECCV 2024**\n  - Dataset: [DriveLM](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM\u002Fblob\u002Fmain\u002Fdocs\u002Fgetting_started.md#download-data)\n  - Publish Date: 2023.08\n  - Summary:\n    - Construct dataset based on the nuScenes dataset.\n    - Perception questions require the model to recognize objects in the scene. \n    - Prediction questions ask the model to predict the future status of important objects in the scene. \n    - Planning questions prompt the model to give reasonable planning actions and avoid dangerous ones.\n\n- [WEDGE: A multi-weather autonomous driving dataset built from generative vision-language models](https:\u002F\u002Fbrowse.arxiv.org\u002Fabs\u002F2305.07528)\n  - Aboli Marathe, Deva Ramanan, Rahee Walambe, Ketan Kotecha. **CVPR 2023**\n  - Publisher: Carnegie Mellon University, Symbiosis International University\n  - Dataset: [WEDGE](https:\u002F\u002Fgithub.com\u002FInfernolia\u002FWEDGE)\n  - Publish Date: 2023.05.12\n  - Summary:\n    - A multi-weather autonomous driving dataset built from generative vision-language models.\n\n- [NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14836)\n  - Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, Yu-Gang Jiang\n  - Publisher: Fudan University\n  - Dataset: [NuScenes-QA](https:\u002F\u002Fgithub.com\u002Fqiantianwen\u002FNuScenes-QA)\n  - Summary:\n    - NuScenes-QA provides 459,941 question-answer pairs based on the 34,149 visual scenes, with 376,604 questions from 28,130 scenes used for training, and 83,337 questions from 6,019 scenes used for testing, respectively.\n    - The multi-view images and point clouds are first processed by the feature extraction backbone\nto obtain BEV features.\n\n- [DRAMA: Joint Risk Localization and Captioning in Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.10767)\n  - Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, Jiachen Li\n  - Publisher: \n  - Datasets: [DRAMA](https:\u002F\u002Fusa.honda-ri.com\u002Fdrama#Introduction)\n  - Summary:\n    - Introduce a novel dataset DRAMA that provides linguistic descriptions (with the focus on reasons) of driving risks associated with important objects and that can be used to evaluate a range of visual captioning capabilities in driving scenarios.\n\n- [Language Prompt for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.04379)\n  - Datasets: Nuprompt(Not open)\n  - [Previous summary](#LanguagePrompt)\n\n- [Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving](https:\u002F\u002Fbrowse.arxiv.org\u002Fabs\u002F2310.01957)\n  - Datasets: [official](https:\u002F\u002Fgithub.com\u002Fwayveai\u002FDriving-with-LLMs\u002Ftree\u002Fmain\u002Fdata), data collection using RL experts in simulator.\n  - [Previous summary](#DrivingwithLLMs)\n\n- [Textual Explanations for Self-Driving Vehicles](https:\u002F\u002Farxiv.org\u002Fabs\u002F1807.11546)\n  - Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, Zeynep Akata **ECCV 2018**.\n  - Publisher: University of California, Berkeley, Saarland Informatics Campus, University of Amsterdam\n  - [BDD-X dataset](https:\u002F\u002Fgithub.com\u002FJinkyuKimUCB\u002FBDD-X-dataset)\n\n- [Grounding Human-To-Vehicle Advice for Self-Driving Vehicles](https:\u002F\u002Farxiv.org\u002Fabs\u002F1911.06978)\n  - Jinkyu Kim, Teruhisa Misu, Yi-Ting Chen, Ashish Tawari, John Canny **CVPR 2019**\n  - Publisher: UC Berkeley, Honda Research Institute USA, Inc.\n  - [HAD dataset](https:\u002F\u002Fusa.honda-ri.com\u002Fhad)\n\u003C\u002Fdetails>\n\n\n## License\n\nAwesome LLM for Autonomous Driving Resources is released under the Apache 2.0 license.\n","# 用于自动驾驶的大语言模型资源精选\n[![Awesome](https:\u002F\u002Fcdn.rawgit.com\u002Fsindresorhus\u002Fawesome\u002Fd7305f38d29fed78fa85652e3a63e154dd8e8829\u002Fmedia\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fsindresorhus\u002Fawesome)![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FThinklab-SJTU\u002FAwesome-LLM4AD?color=yellow) ![GitHub 分支数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002FThinklab-SJTU\u002FAwesome-LLM4AD?color=9cf) [![GitHub 许可证](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002FThinklab-SJTU\u002FAwesome-LLM4AD)](https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FAwesome-LLM4AD\u002Fblob\u002Fmain\u002FLICENSE)\n\n这是一个关于**用于自动驾驶的大语言模型（LLM4AD）**的研究论文合集。该仓库将持续更新，以追踪LLM4AD（用于自动驾驶的大语言模型）领域的前沿进展，其中VLM4AD（用于自动驾驶的视觉-语言模型）和VLA4AD（用于自动驾驶的视觉-语言-动作模型）作为这一统一范式的重要组成部分。*由上海交通大学ReThinkLab维护。*\n\n\n欢迎关注并点赞！如果您发现任何相关资料可能有所帮助，请随时联系我们（yangzhenjie@sjtu.edu.cn 或 jiaxiaosong@sjtu.edu.cn）或提交Pull Request。\n\n## 引用\n我们的综述论文位于 https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.01043，其中包含更详细的讨论，并将不断更新。\n\n如果您觉得我们的仓库有帮助，请考虑引用它。\n```BibTeX\n@misc{yang2023survey,\n      title={LLM4Drive: 自动驾驶领域的大语言模型综述}, \n      author={Zhenjie Yang 和 Xiaosong Jia 和 Hongyang Li 和 Junchi Yan},\n      year={2023},\n      eprint={2311.01043},\n      archivePrefix={arXiv},\n      primaryClass={cs.AI}\n}\n```\n\n## 目录\n- [用于自动驾驶的大语言模型资源精选（LLM4AD）](#awesome-llm-for-autonomous-driving-resources)\n  - [目录](#table-of-contents)\n  - [LLM4AD 概述](#overview-of-llm4ad)\n  - [论文](#papers)\n  - [数据集](#datasets)\n  - [引用](#citation)\n  - [许可证](#license)\n\n## LLM4AD 概述\n用于自动驾驶的大语言模型（LLM4AD）是指将大语言模型（LLMs）应用于自动驾驶领域。我们根据应用大语言模型的角度将其现有工作分为规划、感知、问答和生成四个方向。\n\n![图像说明](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FThinklab-SJTU_Awesome-LLM4AD_readme_87779784afbc.png)\n\n## LLM4AD 的动机\n橙色圆圈代表理想的驾驶能力水平，类似于经验丰富的驾驶员所具备的能力。获得这种能力主要有两种方法：一是通过在模拟环境中基于学习的技术；二是通过类似的方法从离线数据中学习。需要注意的是，由于仿真与现实世界之间存在差异，这两个领域并不完全相同，即“sim2real”差距。同时，离线数据是真实世界数据的一个子集，因为它直接从实际环境中收集而来。然而，由于自动驾驶任务具有众所周知的长尾特性，也很难完全覆盖其分布。自动驾驶的最终目标是通过大量数据收集和深度学习，将驾驶能力从基础的绿色阶段提升到更高级的蓝色阶段。\n\n![图像说明](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FThinklab-SJTU_Awesome-LLM4AD_readme_d237ced4f83e.png)\n\n## 论文\n\u003Cdetails open>\n\u003Csummary>展开\u003C\u002Fsummary>\n\n```\n格式：\n- [标题](论文链接) [链接]\n  - 作者1、作者2、作者3…\n  - 出版社\n  - 任务\n  - 关键词\n  - 代码或项目页面\n  - 数据集、环境或模拟器\n  - 发表日期\n  - 摘要\n  - 指标\n```\n\n- [Vega：学习如何通过自然语言指令进行驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.25741)\n  - Sicheng Zuo、Yuxuan Li、Wenzhao Zheng、Zheng Zhu、Jie Zhou、Jiwen Lu\n  - 发表日期：2026年3月26日\n  - 任务：规划\n  - 数据集：[InstructScene](https:\u002F\u002Fgithub.com\u002F)\n  - 摘要：\n    - Vega 是一种统一的视觉-语言-世界-动作模型，用于自动驾驶中的基于指令的生成和规划，采用自回归和扩散范式。\n    - 构建了一个大规模的驾驶数据集（InstructScene），包含约10万个场景，这些场景标注了各种驾驶指令和相应的轨迹。\n    - 展示了卓越的规划性能和强大的指令遵循能力，能够实现更加智能和个性化的驾驶系统。\n\n- [Drive My Way：面向个性化驾驶的视觉-语言-动作模型偏好对齐](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.25740)\n  - Zehao Wang、Huaide Jiang、Shuaiwu Dong、Yuping Wang、Hang Qiu、Jiachen Li\n  - 发表日期：2026年3月26日\n  - 项目页面：[Drive My Way](https:\u002F\u002Fdmw-cvpr.github.io\u002F)\n  - 代码：[Drive My Way](https:\u002F\u002Fgithub.com\u002Ftasl-lab\u002FDMW)\n  - 任务：规划\n  - 数据集：[Bench2Drive](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FBench2Drive)\n  - 摘要：\n    - Drive My Way（DMW）是一个个性化的视觉-语言-动作（VLA）驾驶框架，能够与用户的长期驾驶习惯保持一致，并适应实时用户指令。\n    - DMW 从个性化的驾驶数据集中学习用户嵌入，并以此嵌入为条件进行规划，而自然语言指令则提供短期指导。\n    - 在 Bench2Drive 上的评估表明，其风格指令适应性有所提高，用户研究也证实了其生成的行为确实能被识别为每位驾驶员的独特风格。\n\n- [ETA-VLA：基于时间融合与LLM内部稀疏化的高效令牌适配，适用于视觉-语言-动作模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.25766)\n  - Yiru Wang、Anqing Jiang、Shuo Wang、Yuwen Heng、Zichong Gu、Hao Sun\n  - 发表日期：2026年3月26日\n  - 任务：端到端\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - 提出 ETA-VLA，这是一种用于自动驾驶中视觉-语言-动作（VLA）模型的高效令牌适配框架，旨在降低处理历史多视角帧的计算负担。\n    - 引入了一种 LLM 内部稀疏聚合器（ILSA），该聚合器受人类驾驶员注意力的启发，利用文本引导的评分和多样性保留策略动态修剪冗余的视觉令牌。\n    - 在 NAVSIM v2 上实现了与 SOTA 相当的性能，同时将 FLOPs 降低了约 32%，修剪了 85% 的视觉令牌，并将推理 FLOPs 减少了 61%，同时保持了 94% 的准确率。\n\n- [从采样中学习轨迹：一种R1风格的分词交通仿真模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.24989)\n  - 王子渊、陈鹏、李丁、李驰威、张启超、夏仲璞、于桂珍\n  - 发表日期：2026年3月26日\n  - 任务：生成\n  - 数据集：[Waymo Sim Agent](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - 摘要：\n    - R1Sim是一种新颖的分词交通仿真策略，基于运动标记熵模式探索强化学习，以学习多样且高保真的交通仿真。\n    - 引入了一种熵引导的自适应采样机制，专注于此前被忽视的高不确定性、高潜力运动标记。\n    - 使用考虑安全性的奖励设计，结合群体相对策略优化（GRPO）来优化运动行为，从而实现真实、安全且多样的多智能体行为。\n\n- [TIGFlow-GRPO：基于交互感知的流匹配与奖励驱动优化的人体轨迹预测](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.24936)\n  - 荆学鹏、陆文焕、孟浩、俞志智、魏建国\n  - 发表日期：2026年3月26日\n  - 任务：预测\n  - 摘要：\n    - TIGFlow-GRPO是一个两阶段的人体轨迹生成框架，将基于流的生成与行为规则相契合。\n    - 第一阶段使用带有轨迹-交互图（TIG）模块的CFM预测器，以建模精细的视觉-空间交互。\n    - 第二阶段采用流-GRPO后训练，将确定性流展开重新表述为随机ODE到SDE采样以进行探索，并通过复合奖励优化社会合规性和物理可行性。\n\n- [DreamerAD：基于潜在世界模型的高效强化学习在自动驾驶中的应用](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.24587)\n  - 杨鹏轩、郑宇鹏、钱德恒、邢泽彬、张启超、王林波、张义晨、郭绍宇、夏仲璞、陈强、韩俊宇、徐凌云、潘益峰、赵东斌\n  - 发表日期：2026年3月25日\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - DreamerAD是首个用于自动驾驶高效强化学习的潜在世界模型框架，将扩散采样从100步压缩至1步，速度提升80倍，同时保持视觉可解释性。\n    - 该方法利用捷径强制技术对去噪后的潜在特征进行步骤压缩，在潜在表示上构建自回归密集奖励模型，并采用高斯词汇采样进行GRPO，以将探索限制在合理轨迹范围内。\n    - 在NavSim v2上实现了87.7 EPDMS的最先进性能，证明了潜在空间强化学习在自动驾驶中的有效性。\n\n- [Latent-WAM：用于端到端自动驾驶的潜在世界动作建模](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.24581)\n  - 王林波、郑宇鹏、陈强、李世伟、张义晨、邢泽彬、张启超、李翔、钱德恒、杨鹏轩、董一航、郝策、叶晓青、韩俊宇、潘益峰、赵东斌\n  - 发表日期：2026年3月25日\n  - 任务：预测、规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - Latent-WAM是一个高效的端到端自动驾驶框架，通过空间感知和动力学信息丰富的潜在世界表示实现强大的轨迹规划。\n    - 它引入了空间感知压缩世界编码器（SCWE），用于提炼几何知识并压缩多视角图像；以及动态潜在世界模型（DLWM），用于预测未来的世界状态。\n    - 在NAVSIM v2和HUGSIM基准测试中取得了最先进的结果，仅用一个1.04亿参数的紧凑模型，便超越了先前的方法，且所需的训练数据显著减少。\n\n- [面向挑战性轨迹的物理一致性驾驶视频世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.24506)\n  - 周嘉伟、朱振鑫、杜凌翼、吕林业、周立军、吴占谦、罗洪成、田卓涛、王兵、陈光、叶杭君、孙海阳、李宇\n  - 出版单位：华中科技大学、小米电动车\n  - 发表日期：2026年3月25日\n  - 项目页面：[PhyGenesis](https:\u002F\u002Fwm-research.github.io\u002FPhyGenesis\u002F)\n  - 任务：生成\n  - 数据集：[CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - 摘要：\n    - 提出了PhyGenesis，一种能够生成高视觉保真度且具有强物理一致性的驾驶视频的世界模型，尤其适用于挑战性或反事实轨迹。\n    - 该框架包括一个物理条件生成器，用于纠正无效轨迹；以及一个增强物理特性的视频生成器，用于生成多视角驾驶视频。\n    - 在大规模、富含物理特性的异构数据集上进行训练，该数据集结合了真实世界视频和在CARLA中生成的多样化挑战场景，从而实现轨迹修正和物理一致性的生成。\n\n- [自动驾驶中的交通标志识别：数据集、基准测试与实地实验](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.23034)\n  - 赵国洋、齐卫青、张凯、张晨光、龚泽英、毕志海、陈凯、马本善、刘明、马军\n  - 出版单位：香港科技大学、上海人工智能实验室\n  - 发表日期：2026年3月24日\n  - 项目页面：[TS-1M](https:\u002F\u002Fguoyangzhao.github.io\u002Fprojects\u002Fts1m)\n  - 任务：感知\n  - 数据集：[TS-1M](https:\u002F\u002Fguoyangzhao.github.io\u002Fprojects\u002Fts1m)\n  - 摘要：\n    - 介绍了TS-1M，这是一个大规模、全球多样化的交通标志数据集，包含超过一百万张图片，涵盖454个类别，并提供了一个诊断性基准，用于分析模型在实际挑战下的能力。\n    - 进行了跨三种学习范式（监督学习、自监督学习、多模态VLMs）的统一基准测试，揭示语义对齐是泛化能力和稀有类别识别的关键。\n    - 通过将交通标志识别与语义推理和空间定位相结合，以地图级决策约束为基础的实景自动驾驶实验，验证了其实际应用价值。\n\n- [KLDrive：基于知识图谱的自动驾驶细粒度3D场景推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.21029)\n  - 田晔、张静怡、王子豪、任晓远、余晓凡、奥纳特·贡戈尔、塔雅娜·罗辛\n  - 出版单位：加州大学圣地亚哥分校\n  - 发表日期：2026年3月22日\n  - 任务：VQA\n  - 数据集：[NuScenes-QA](https:\u002F\u002Fgithub.com\u002Fqiantianwen\u002FNuScenes-QA)\n  - 摘要：\n    - KLDrive是一个知识图谱增强的LLM推理框架，用于自动驾驶中的细粒度问题回答，结合基于能量的场景事实构建模块与LLM代理，实现基于事实的推理。\n    - 该框架使用结构化提示和少量上下文示例，即可适应各种推理任务，而无需进行大量特定任务的微调，从而减少幻觉并提高可靠性。\n\n- [理解基于动作量化的行为克隆](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.20538)\n  - 曹浩群、谢腾阳\n  - 发表日期：2026年3月20日\n  - 任务：规划\n  - 摘要：\n    - 为基于量化动作的行为克隆提供了理论基础，分析了误差传播和样本复杂度。\n    - 表明在稳定动力学和策略平滑性条件下，采用对数损失函数和量化动作可实现最优样本复杂度。\n    - 提出了一种基于模型的增强方法以改进误差界，并确立了量化误差与统计复杂度的基本极限。\n\n- [X-World：用于可扩展端到端驾驶的可控自我中心多摄像头世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.19979)\n  - 郑朝达、李Sean、邓金浩、王振楠、陈世嘉、肖立强、迟子恒、林洪斌、陈康杰、王博洋、张宇、刘宪明\n  - 发表日期：2026年3月20日\n  - 任务：端到端\n  - 摘要：\n    - X-World是一种动作条件下的多摄像头生成式世界模型，能够模拟未来的视频观测结果，从而实现对端到端自动驾驶的可扩展且可重复的评估。\n    - 该模型可根据历史和动作序列生成未来的多摄像头视频流，并支持通过文本提示对交通参与者、道路元素及外观（如天气）进行可选控制。\n    - 其特点是采用多视角潜在视频生成器，旨在保证跨视角的几何一致性和时间连贯性，从而实现高质量、可控的仿真，用于评估和视频风格迁移。\n\n- [DynFlowDrive：基于流的动态世界建模用于自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.19675)\n  - 刘晓璐、李一聪、王松、陈俊波、姚安吉拉、朱建科\n  - 发表日期：2026年3月20日\n  - 代码：[DynFlowDrive](https:\u002F\u002Fgithub.com\u002Fxiaolul2\u002FDynFlowDrive)\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - DynFlowDrive是一种潜在世界模型，利用基于流的动力学来建模不同驾驶动作下世界状态的转换。\n    - 引入了一种考虑稳定性的一致性多模式轨迹选择策略，根据诱导场景转换的稳定性评估候选轨迹。\n    - 在nuScenes和NavSim基准测试中表现出持续的性能提升，且未增加额外的推理开销。\n\n- [DriveTok：用于统一多视角重建与理解的3D驾驶场景标记化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.19219)\n  - 朱东、郑文昭、左思成、闫思明、侯陆、周杰、卢继文\n  - 发表日期：2026年3月19日\n  - 任务：感知\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - DriveTok是一种高效的3D驾驶场景标记化工具，用于统一多视角的重建与理解，旨在解决多视角驾驶场景中的低效与不一致性问题。\n    - 它利用3D可变形交叉注意力机制将语义丰富的视觉特征转换为场景标记，并采用多视角Transformer对RGB、深度和语义信息进行重建，同时添加了一个3D头部用于语义占用预测。\n\n- [DriveVLM-RL：受神经科学启发的视觉-语言模型强化学习框架，用于安全且可部署的自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.18315)\n  - 黄子琳、盛子豪、万正阳、曲燕松、游俊伟、蒋思聪、陈思凯\n  - 发表日期：2026年3月18日\n  - 项目页面：[DriveVLM-RL](https:\u002F\u002Fzilin-huang.github.io\u002FDriveVLM-RL-website\u002F)\n  - 代码：[DriveVLM-RL](https:\u002F\u002Fzilin-huang.github.io\u002FDriveVLM-RL-website\u002F)\n  - 任务：规划\n  - 数据集：[CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - 摘要：\n    - 提出了DriveVLM-RL框架，该框架受神经科学启发，通过双路径架构将视觉-语言模型（VLMs）融入强化学习（RL），以实现安全且可部署的自动驾驶。\n    - 其中包含静态路径，用于持续的空间安全性评估；以及动态路径，用于注意力门控的多帧语义风险推理，并结合分层奖励合成机制。\n    - 采用异步训练流程，将昂贵的VLM推理与环境交互解耦，在部署时移除所有VLM组件，以确保实时可行性。\n\n- [VLM-AutoDrive：面向安全关键型自动驾驶事件的视觉-语言模型后训练方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.18178)\n  - 莫哈默德·卡齐姆·巴特、黄宇凡、尼凯特·阿加瓦尔、王浩、迈克尔·伍兹、约翰·肯扬、林宗毅、杨晓东、刘明宇、谢凯文\n  - 发表日期：2026年3月18日\n  - 任务：感知\n  - 摘要：\n    - VLM-AutoDrive是一个模块化的后训练框架，用于将预训练的视觉-语言模型（VLMs）适配到高保真度的驾驶异常检测任务中。\n    - 该框架整合了元数据字幕、LLM描述、VQA对以及思维链推理，以实现领域对齐且可解释的学习。\n    - 在真实世界的行车记录仪视频上，其碰撞F1分数从0.00提升至0.69，整体准确率则从35.35%提高到77.27%。\n\n- [VectorWorld：基于向量图扩散流的高效流式世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.17652)\n  - 姜超康、周德森、刘久明、孙凯文\n  - 发表日期：2026年3月18日\n  - 代码：[VectorWorld](https:\u002F\u002Fgithub.com\u002Fjiangchaokang\u002FVectorWorld)\n  - 任务：预测、生成\n  - 数据集：[Waymo开放运动数据集](https:\u002F\u002Fwaymo.com\u002Fopen\u002Fdata\u002Fmotion\u002F)、[nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)\n  - 摘要：\n    - VectorWorld是一款用于自动驾驶的流式世界模型，可在滚动过程中增量生成以车辆为中心的车道-行人向量图块，从而支持闭环评估。\n    - 该模型解决了生成式世界模型中的关键问题：通过运动感知的门控VAE实现基于历史条件的初始化；借助一步式掩码补全模型实现实时外延填充；并通过名为$Δ$Sim的物理对齐NPC策略确保长 horizon 的稳定性。\n\n- [DriveFix：时空一致性的驾驶场景修复](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.16306)\n  - 施赫宇、詹姆斯·丹尼斯、孙牧阳、达古斯·德拉戈斯、李耀儒、金鑫、傅瑞菊、塔塔里诺娃·尤利娅、兰迪·费德里科、宋杰、宋明丽、郭琪\n  - 发表日期：2026年3月17日\n  - 任务：感知\n  - 数据集：[Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)、[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[PandaSet](https:\u002F\u002Fscale.com\u002Fopen-datasets\u002Fpandaset)\n  - 摘要：\n    - DriveFix是一种新颖的多视角修复框架，采用交错扩散Transformer架构，确保驾驶场景的时空一致性。\n    - 该方法显式地建模了时间依赖关系和跨摄像头的空间一致性，强制修复后的视图遵循统一的3D几何结构。\n    - 在Waymo、nuScenes和PandaSet数据集上，展示了在场景重建和新视角合成方面的最先进性能。\n\n- [基于VLA的驾驶系统的安全案例模式：来自SimLingo的洞见](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.16013)\n  - 格哈德·于、石川冬树、奥卢瓦费米·奥杜、阿尔文·博耶·贝尔\n  - 发表日期：2026年3月16日\n  - 任务：端到端\n  - 摘要：\n    - 提出RAISE，一种针对视觉-语言-行动（VLA）驱动系统的新安全案例设计方法，引入定制化模式及对危害分析与风险评估（HARA）的扩展。\n    - 解决了将开放式自然语言输入整合到自动驾驶多模态控制回路中所产生的新型安全风险问题。\n    - 以SimLingo为例，说明该方法如何为这一新兴系统类别构建严谨、基于证据的安全性声明。\n\n- [CorrectionPlanner：基于强化学习的自动驾驶自纠正规划器](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.15771)\n  - 郭一鸿、叶东强子、陈思嘉、刘安琪、刘贤明\n  - 发表日期：2026年3月16日\n  - 任务：规划\n  - 数据集：[Waymax](https:\u002F\u002Fgithub.com\u002Fwaymo-research\u002Fwaymax)、[nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)\n  - 摘要：\n    - CorrectionPlanner是一种具有自纠正功能的自回归规划器，它将规划建模为在“提出—评估—纠正”循环中生成运动标记的过程。\n    - 该方法利用学习得到的碰撞评判器预测不安全动作，并保留不安全运动标记的自纠正轨迹，以指导安全动作的生成。\n    - 通过模仿学习训练后，再使用预训练世界模型的滚动优化进行基于模型的强化学习，从而降低碰撞率并取得领先的规划性能。\n\n- [CRASH：用于自动驾驶安全风险的认知推理代理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.15364)\n  - 埃里克·席尔瓦、雷哈娜·亚斯敏、阿里·肖克尔\n  - 出版方：（根据上下文推断为米尼奥大学、阿威罗大学）\n  - 发表日期：2026年3月16日\n  - 任务：推理\n  - 摘要：\n    - 介绍CRASH，一个基于LLM的代理，能够自动对来自NHTSA数据库的真实AV事故报告进行推理，生成摘要、归因主要原因并评估AV的贡献度。\n    - 对2,168起事故的分析显示，64%归因于感知或规划失败，约50%涉及追尾事故，凸显了持续存在的挑战。\n    - 经领域专家验证，其在归因AV系统故障方面的准确率达到86%，表明其作为可扩展且可解释的自动化事故分析工具具有潜力。\n\n- [从错误中学习：利用接管数据对驾驶VLA进行后训练](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.14972)\n  - 高银峰、刘德清、张启超、郑宇鹏、田浩辰、李光、叶航军、陈龙、丁大伟、赵东彬\n  - 发表日期：2026年3月16日\n  - 任务：端到端\n  - 数据集：[Bench2Drive](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FBench2Drive)\n  - 摘要：\n    - 提出TakeVLA，一种新颖的视觉-语言-行动（VLA）后训练框架，利用接管数据缓解端到端自动驾驶中的分布偏移问题。\n    - 引入接管前的语言监督，主动教导模型识别易出错的情境，培养预防性思维并扩大安全余量。\n    - 提出场景梦魇机制，这是一种在重建的接管场景中运行的强化微调范式，鼓励主动探索而非被动适应偏好。\n\n- [连接场景生成与规划：通过统一视觉与运动表征的世界模型进行驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.14948)\n  - 桂兴泰、张美洁、闫天义、韩文成、龚家豪、谭飞阳、徐承忠、沈建兵\n  - 发表日期：2026年3月16日\n  - 任务：规划、生成\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)、[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - WorldDrive是一个整体框架，通过统一视觉和运动表征将场景生成与实时规划相结合。\n    - 引入轨迹感知型驾驶世界模型，以确保视觉动态与运动意图之间的一致性，从而实现多样化的未来场景生成。\n    - 提出未来感知奖励机制，从冻结的世界模型中提炼未来潜在表征，用于实时轨迹评估与选择。\n\n- [PerlAD：基于伪仿真强化学习的闭环端到端自动驾驶增强方案](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.14908)\n  - 高银峰、张启超、刘德清、夏仲普、李光、马坤、陈广、叶航军、陈龙、丁大伟、赵东彬\n  - 发表日期：2026年3月16日\n  - 任务：端到端\n  - 数据集：[Bench2Drive](https:\u002F\u002Fbench2drive.github.io\u002F)、[DOS](https:\u002F\u002Fgithub.com\u002Fopendilab\u002FDOS)\n  - 摘要：\n    - 提出PerlAD，一种基于伪仿真的强化学习方法，适用于闭环端到端自动驾驶，在向量空间中运行以实现高效、无渲染的训练。\n    - 引入预测型世界模型生成反应式智能体轨迹，以及分层解耦的规划器，结合IL进行横向路径规划和RL进行纵向速度优化。\n    - 在Bench2Drive基准测试中达到最先进水平，并在DOS基准测试中的安全关键遮挡场景中展现出可靠性。\n\n- [AutoMoT：面向端到端自动驾驶的异步混合Transformer统一视觉-语言-行动模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.14851)\n  - 黄文辉、张松岩、黄启航、王志东、毛志奇、科利斯特·丘亚、陈展、陈龙、吕晨\n  - 发表日期：2026年3月16日\n  - 项目页面：[AutoMoT](https:\u002F\u002Fautomot-website.github.io\u002F)\n  - 任务：端到端\n  - 摘要：\n    - 提出AutoMoT，一个将推理与行动生成统一于单一视觉-语言-行动模型中的端到端自动驾驶框架。\n    - 利用混合Transformer架构，通过共享注意力和异步快慢推理实现高效的策略生成。\n    - 探讨预训练VLM在自动驾驶中的功能边界，发现语义提示足以完成场景理解，但精细调优对于规划等行动级任务至关重要。\n\n- [WorldVLM：结合世界模型预测与视觉-语言推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.14497)\n  - 斯特凡·恩格迈尔、卡塔琳娜·温特、法比安·B·弗洛尔\n  - 出版方：慕尼黑工业大学\n  - 发表日期：2026年3月15日\n  - 任务：规划、预测\n  - 摘要：\n    - 提出WorldVLM，一种混合架构，将视觉-语言模型（VLM）用于高层次情境推理，与世界模型（WM）结合，用于精确预测自动驾驶中的未来场景动态。\n    - VLM生成可解释的行为指令来引导驾驶WM，从而实现情境感知型行动，并充分发挥决策制定与环境预测的互补优势。\n\n- [风险可控的多视角扩散模型用于驾驶场景生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.11534)\n  - 林鸿毅、史文秀、黄赫业、庄丁毅、张松、刘洋、曲晓波、赵金华\n  - 发表日期：2026年3月12日\n  - 任务：生成\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 提出RiskMV-DPO，一种基于物理约束、风险可控的多视角驾驶场景生成流水线，用于合成多样化的长尾高风险场景。\n    - 引入几何-外观对齐模块和区域感知直接偏好优化（RA-DPO）策略，以确保生成场景在时空上的一致性和几何保真度。\n    - 在nuScenes数据集上表现出最先进的性能，提升了3D检测mAP并降低了FID值，使世界模型从被动预测转向主动可控的场景合成。\n\n- [DriveXQA：用于恶劣驾驶场景理解的跨模态视觉问答](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.11380)\n  - 陶明哲、刘瑞平、郑俊伟、陈宇凡、应可迪、M·萨基布·萨尔夫拉兹、杨凯伦、张嘉铭、赖纳·施蒂费尔哈根\n  - 发表日期：2026年3月11日\n  - 任务：VQA\n  - 摘要：\n    - 提出DriveXQA，一个面向自动驾驶的多模态数据集，包含102,505组问答对，涵盖三个层次（全局场景、以环境为中心、以车辆为中心）、四种视觉模态，以及多种传感器故障和天气条件。\n    - 引入MVX-LLM，一种具有双交叉注意力（DCA）投影器的高效架构，用于融合多种互补的视觉模态，在如雾天等复杂条件下表现出更好的性能。\n\n- [DynVLA：学习世界动态以支持自动驾驶中的行为推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.11041)\n  - 商书瑶、詹冰、闫云飞、王宇琪、李颖妍、安亚松、王小满、刘杰睿、侯璐、范露、张兆祥、谭天牛\n  - 出版单位：中国科学院自动化研究所\n  - 发表日期：2026年3月11日\n  - 任务：规划、推理\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)、[Bench2Drive](https:\u002F\u002Fbench2drive.github.io\u002F)\n  - 摘要：\n    - DynVLA是一种驾驶用视觉-语言-行动（VLA）模型，引入了一种称为“动力学思维链”（Dynamics CoT）的新范式，在生成动作之前先预测紧凑的世界动态，从而做出更明智且符合物理规律的决策。\n    - 该模型引入了动力学标记器，将未来演化压缩为少量的动力学标记，并将自我中心与环境中心的动力学解耦，以便在交互密集型场景中实现更精确的世界建模。\n    - DynVLA通过监督微调（SFT）和强化学习微调（RFT）训练，在生成动作前先生成动力学标记，既提高了决策质量，又保持了低延迟的推理效率，并在多个基准测试中超越了文本思维链和视觉思维链方法。\n\n- [RESBev：提升BEV感知的鲁棒性](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.09529)\n  - 卓立峰、金克凡、刘哲、王鹤生\n  - 发表日期：2026年3月10日\n  - 任务：感知\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 提出RESBev，一种具有韧性且即插即用的BEV感知方法，用于增强对传感器退化和对抗攻击的鲁棒性。\n    - 将感知鲁棒性重新定义为潜在语义预测问题，利用潜在世界模型学习BEV状态转移并预测干净的特征。\n    - 在Lift-Splat-Shoot流水线的语义特征层面运行，能够在不修改骨干网络的情况下，对各种干扰进行泛化恢复。\n\n- [探究驾驶VLM的可靠性：从响应不一致到 grounded 时间推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.09512)\n  - 张春鹏、王晨宇、霍尔格·凯撒、阿兰·帕加尼\n  - 发表日期：2026年3月10日\n  - 任务：推理\n  - 摘要：\n    - 研究视觉-语言模型（VLM）作为驾驶助手的可靠性，重点关注其响应不一致和时间推理能力有限的问题。\n    - 引入FutureVQA，一个由人工标注的基准数据集，用于评估驾驶场景中的未来场景推理能力。\n    - 提出一种自监督调优方法，结合思维链推理来提高模型的一致性和时间推理能力，而无需时间标签。\n\n- [StyleVLA：面向自动驾驶的驾驶风格感知型视觉语言行动模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.09482)\n  - 高源、华登元、马蒂亚·皮奇尼尼、芬恩·拉斯穆斯·舍费尔、科尔比尼安·莫勒、李琳、约翰内斯·贝茨\n  - 出版单位：慕尼黑工业大学\n  - 发表日期：2026年3月10日\n  - 任务：感知、规划\n  - 摘要：\n    - StyleVLA是一个基于物理约束的视觉语言行动（VLA）框架，能够生成多样化且符合物理规律的驾驶行为，并适应特定风格（如运动型、舒适型）。\n    - 引入一种混合损失函数，结合运动学一致性约束和连续回归头，以提高轨迹的可行性。\n    - 该模型基于Qwen3-VL-4B训练，使用包含超过1,200个场景和118,000个样本的大规模指令数据集，其综合驾驶评分优于专有及现有最先进模型。\n\n- [基于VLM的自动驾驶架构中补丁攻击的比较分析](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.08897)\n  - 大卫·费尔南德斯、佩德拉姆·莫哈杰尔安萨里、阿米尔·萨拉普尔、程龙、阿博法兹尔·拉齐、梅尔特·D·佩塞\n  - 发表日期：2026年3月9日\n  - 任务：感知\n  - 数据集：[CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - 摘要：\n    - 提出一个系统性的框架，用于对三种自动驾驶用VLM架构——Dolphins、OmniDrive（Omni-L）和LeapVAD——进行对抗性比较评估。\n    - 利用黑盒优化方法，在CARLA仿真环境中评估可物理实现的补丁攻击，并结合语义同质化技术，揭示了严重的漏洞和持续的多帧失效现象。\n    - 分析结果表明，不同架构存在不同的脆弱性模式，这说明当前的VLM设计在应对安全关键应用中的对抗威胁方面仍显不足。\n\n- [SAMoE-VLA：面向自动驾驶的场景自适应专家混合视觉-语言-行动模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.08113)\n  - 游子涵、刘宏伟、党晨旭、王哲、安思宁、王奥奇、王燕\n  - 发表日期：2026年3月9日\n  - 任务：感知、规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - SAMoE-VLA是一种场景自适应的视觉-语言-行动框架，它基于结构化的场景表示（BEV特征）而非标记嵌入来选择专家，从而实现稳定的自动驾驶。\n    - 引入了一种条件跨模态因果注意力机制，将世界状态、语言意图和动作历史整合进统一的因果推理过程。\n    - 在nuScenes和LangAuto基准测试中取得了最先进的性能，以更少的参数数量超越了先前的基于VLA和基于世界模型的方法。\n\n- [NaviDriveVLM：用于自动驾驶的高层推理与运动规划解耦框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.07901)\n  - 陶希蒙、帕尔迪斯·塔格维、季米特尔·菲列夫、雷扎·兰加里、高拉夫·潘迪\n  - 出版单位：德克萨斯农工大学\n  - 发表日期：2026年3月9日\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - NaviDriveVLM是一个用于自动驾驶的解耦框架，它通过一个大规模导航器和一个轻量级可训练驱动器将高层推理与运动规划分离。\n    - 这种设计在保持强大推理能力的同时降低了训练成本，并为下游规划提供了明确且可解释的中间表示。\n    - 在nuScenes基准上的实验表明，NaviDriveVLM在端到端运动规划方面优于大型VLM基线。\n\n- [考虑运动学的潜在世界模型：用于数据高效的自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.07264)\n  - 李家卓、曹林江、刘琪、熊曦\n  - 发表日期：2026年3月7日\n  - 任务：感知\n  - 摘要：\n    - 提出了一种用于自动驾驶的考虑运动学的潜在世界模型框架，该框架基于递归状态空间模型（RSSM），融入了车辆运动学信息和几何感知监督。\n    - 结构化的潜在动力学提高了长 horizon 想象的保真度并稳定了策略优化，在样本效率和驾驶性能上均优于无模型和基于像素的世界模型基线。\n\n- [感知增强的单目图像多模态空间推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.06985)\n  - 程延春、王润东、杨旭磊、阿洛克·普拉卡什、丹妮拉·鲁斯、马塞洛·H·安格、李世杰\n  - 出版单位：麻省理工学院\n  - 发表日期：2026年3月7日\n  - 任务：感知\n  - 摘要：\n    - 提出了一个感知增强的多模态推理框架，该框架利用视觉参考标记（VRTs）为视觉-语言模型（VLMs）提供显式的以物体为中心的语义基础，从而实现视觉与文本的联合推理。\n    - 引入了一个多模态思维链（MM-CoT）数据集以及一种确定性的排序策略来监督无序的VRT集合，使得可以通过标准的监督微调进行有效训练。\n    - 在单目驾驶场景的空间推理SURDS基准测试上取得了显著提升，性能超越了包括强化学习方法在内的先前方法。\n\n- [BEVLM：将LLMs中的语义知识蒸馏到鸟瞰图表示中](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.06576)\n  - 托马斯·莫宁格、谢绍远、陈启明、丁思浩\n  - 发表日期：2026年3月6日\n  - 任务：感知\n  - 摘要：\n    - BEVLM是一个连接空间一致、语义蒸馏的鸟瞰图（BEV）表示与大型语言模型（LLMs）的框架，用于自动驾驶。\n    - 该方法使LLMs在跨视图驾驶场景中的推理更加有效，通过使用BEV特征作为统一输入，准确率提升了46%。\n    - 将LLMs中的语义知识蒸馏到BEV表示中，显著提升了闭环端到端驾驶性能，在安全关键场景中提高了29%。\n\n- [NOVA：用于自动驾驶中3D多目标跟踪的下一步开放词汇自回归方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.06254)\n  - 骆凯、王旭、范锐、杨凯伦\n  - 发表日期：2026年3月6日\n  - 代码：[NOVA](https:\u002F\u002Fgithub.com\u002Fxifen523\u002FNOVA)\n  - 任务：感知\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[KITTI](http:\u002F\u002Fwww.cvlibs.net\u002Fdatasets\u002Fkitti\u002F)\n  - 摘要：\n    - NOVA是一种用于3D多目标跟踪的新范式，它从传统的基于距离的匹配转向使用自回归大型语言模型进行生成式的时空语义建模。\n    - 它将3D轨迹重新表述为结构化的时空语义序列，从而能够同时编码物理运动的连续性以及深层的语言先验以保证身份一致性。\n    - 在新类别上实现了显著的性能提升，在nuScenes上的AMOTA指标绝对提升了20.21%，且仅使用了一个紧凑的0.5B参数模型。\n\n- [K-Gen：一种多模态语言条件下的可解释关键点引导轨迹生成方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.04868)\n  - 穆明轩、杨国、陈雷、吴平、崔建勋\n  - 发表日期：2026年3月5日\n  - 任务：规划\n  - 数据集：[WOMD](https:\u002F\u002Fwaymo.com\u002Fopen\u002Fdata\u002Fmotion\u002F)、[nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)\n  - 摘要：\n    - K-Gen是一个可解释的关键点引导多模态框架，它利用MLLMs将光栅化的BEV地图输入与文本场景描述统一起来，用于生成轨迹。\n    - K-Gen不直接预测完整的轨迹，而是生成反映智能体意图的可解释关键点及其推理过程，随后再将其细化为精确的轨迹。\n    - 该方法应用了T-DAPO这一轨迹感知的强化微调算法来提升关键点生成效果，在WOMD和nuPlan上均优于基线。\n\n- [PRAM-R：具有LLM引导模态路由的感知-推理-行动-记忆框架，用于自适应自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.04222)\n  - 张毅、张贤、赵赛思、宋英蕾、吴成栋、内纳德·彼得罗维奇、阿洛伊斯·克诺尔\n  - 出版单位：慕尼黑工业大学\n  - 发表日期：2026年3月4日\n  - 任务：感知、规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - PRAM-R是一个具有LLM引导模态路由的统一感知-推理-行动-记忆框架，专为自适应自动驾驶设计，采用异步双环架构。\n    - 该框架使用LLM路由器根据环境背景和传感器诊断结果选择并加权不同的模态，同时配备层次化记忆模块以确保时间一致性。\n    - 评估显示，通过基于迟滞效应的稳定化措施，路由振荡减少了87.2%，并且在保持与全模态基线相当的轨迹精度的同时，模态数量减少了6.22%。\n\n- [基于朗之万引导流匹配的实时生成式策略用于自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.02613)\n  - 朱天泽、王一诺、邹文俊、张天一、王立坤、陶乐天、张飞鸿、吕瑶、李圣波\n  - 发表日期：2026年3月3日\n  - 任务：规划\n  - 数据集：[DeepMind Control Suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n  - 摘要：\n    - 提出DACER-F，一种利用流匹配进行在线强化学习的扩散演员-评论家方法，支持单步推理，实现自动驾驶中的实时决策。\n    - 引入一种结合朗之万动力学和Q函数梯度的方法，动态优化动作以逼近兼顾高Q值与探索性的目标分布。\n    - 在复杂驾驶仿真中表现出色，并在DeepMind Control Suite上展现出良好的可扩展性，以极低的推理延迟获得高分。\n\n- [VLMFusionOcc3D：VLM辅助的多模态3D语义占用预测](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.02609)\n  - A. Enes Doruk、Hasan F. Ates\n  - 发表日期：2026年3月3日\n  - 任务：感知\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[SemanticKITTI](http:\u002F\u002Fwww.semantic-kitti.org\u002F)\n  - 摘要：\n    - VLMFusionOcc3D是一种鲁棒的多模态密集3D语义占用预测框架，利用视觉-语言模型（VLM）将模糊的体素特征锚定到稳定的语义概念上。\n    - 提出实例驱动的VLM注意力机制（InstVLM），将高层语义先验注入3D体素；并引入天气感知自适应融合机制（WeathFusion），根据环境可靠性动态调整传感器权重。\n    - 采用深度感知几何对齐（DAGA）损失以保证结构一致性，并在现有最先进基线上显著提升性能，尤其在恶劣天气条件下表现优异。\n\n- [AnchorDrive：基于锚点引导的扩散重生成的安全关键场景生成LLM情景展开](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.02542)\n  - 蒋竹林、李泽涛、王成、王子文、熊晨\n  - 发表日期：2026年3月3日\n  - 任务：生成\n  - 数据集：[highD](https:\u002F\u002Fwww.highd-dataset.com\u002F)\n  - 摘要：\n    - AnchorDrive是一个两阶段的安全关键场景生成框架，结合大语言模型实现可控生成，并利用扩散模型进行逼真的轨迹重生成。\n    - 第一阶段使用具备计划评估能力的大语言模型驾驶员代理，在自然语言约束下生成语义可控的场景。\n    - 第二阶段从LLM生成的轨迹中提取锚点，引导扩散模型在保留用户意图的同时，生成更具现实感的轨迹。\n\n- [LLM-MLFFN：基于大型语言模型的多级自动驾驶行为特征融合](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.02528)\n  - 李向宇、王天一、程曦、Rakesh Chowdary Machineni、郭兆淼、陈思凯、焦俊峰、Christian Claudel\n  - 发表日期：2026年3月3日\n  - 任务：感知\n  - 数据集：[Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - 摘要：\n    - LLM-MLFFN是一种新颖的LLM增强型多级特征融合网络，用于自动驾驶行为分类，整合了大规模预训练模型的先验信息。\n    - 该框架包括多级特征提取模块、使用LLM的语义描述模块，以及带有加权注意力的双通道特征融合网络。\n    - 在Waymo数据集上的评估显示其性能优越，分类准确率超过94%，证明了结构化特征建模与语义抽象相结合的价值。\n\n- [LaST-VLA：在自动驾驶中基于视觉-语言-行动的潜在时空思维](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.01928)\n  - 罗悦辰、李芳、徐绍青、季阳、张泽涵、王兵、沈元南、崔建伟、陈龙、陈广、叶航军、杨志新、温福熙\n  - 发表日期：2026年3月2日\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - LaST-VLA是一个框架，将推理从离散符号处理转向物理基础的潜在时空思维链，以解决视觉-语言-行动模型中的语义-感知解耦问题。\n    - 实现了一种双特征对齐机制，将3D基础模型中的几何约束和世界模型中的动态预见能力提炼到潜在空间中。\n    - 采用渐进式SFT训练策略和GRPO强化学习，在NAVSIM基准测试中创下新纪录，并在时空推理方面表现卓越。\n\n- [统一自动驾驶中的语言-行动理解与生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.01441)\n  - 王欣洋、刘倩、丁文杰、杨昭、李伟、刘畅、李柏霖、詹坤、郎贤鹏、陈伟\n  - 发表日期：2026年3月2日\n  - 任务：端到端、生成\n  - 摘要：\n    - 提出LinkVLA，一种用于自动驾驶的新颖视觉-语言-行动（VLA）架构，将语言和行动标记统一到共享的离散代码本中，以强制跨模态一致性。\n    - 提议一种辅助性的行动理解目标，通过训练模型根据轨迹生成描述性文字，从而建立深层语义联系，促进语言-行动的双向映射。\n    - 用两步粗细结合（C2F）方法替代自回归生成，使推理时间缩短86%，同时提高指令遵循精度及闭环基准下的驾驶性能。\n\n- [通过显式失败学习释放自动驾驶中VLA的潜力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.01063)\n  - 罗悦辰、陈启茂、李芳、徐绍青、刘嘉鑫、宋梓 Ying、杨志新、温福熙\n  - 发表日期：2026年3月1日\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - 提出ELF-VLA框架，通过结构化的诊断反馈增强强化学习，以解决自动驾驶中视觉-语言-行动模型的性能瓶颈问题。\n    - 生成详细且可解释的失败报告，识别具体的失败模式，从而实现针对性的反馈引导优化和更有效的训练。\n    - 在NAVSIM基准测试中，于PDMS、EPDMS评分以及高层规划准确性方面均达到当前最优水平。\n\n- [DriveCode：面向LLM自动驾驶的领域特定数值编码](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.00919)\n  - 王志业、蒋彦博、周睿、张博、张芳、徐振华、张雅琴、王建强\n  - 发表日期：2026年3月1日\n  - 任务：端到端\n  - 摘要：\n    - 介绍DriveCode，一种基于LLM的自动驾驶新型数值编码方法，它将数字表示为专用嵌入，而非离散文本标记，从而提升数值推理能力和精度。\n    - 使用数字投影器将数字映射到语言模型的隐藏空间，实现与视觉和文本特征在统一多模态序列中的无缝集成。\n    - 在OmniDrive、DriveGPT4和DriveGPT4-V2数据集上，展示了其在轨迹预测和控制信号生成方面的优越性能。\n\n- [Wild-Drive：通过鲁棒多模态路由和高效大型语言模型实现越野场景描述与路径规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.00694)\n  - 王子航、李旭、王本武、朱文凯、陈谢元立、孔东、吕凯琳、杜一楠、彭一鸣、车浩洋\n  - 发表日期：2026年2月28日\n  - 代码：[Wild-Drive](https:\u002F\u002Fgithub.com\u002Fwangzihanggg\u002FWild-Drive)\n  - 任务：规划\n  - 数据集：[OR-C2P基准测试集](https:\u002F\u002Fgithub.com\u002Fwangzihanggg\u002FWild-Drive)\n  - 摘要：\n    - 提出Wild-Drive，一个高效的越野场景描述与路径规划框架，解决了传感器在雨、雾、黑暗等恶劣条件下易受损的问题。\n    - 引入MoRo-Former，一种任务条件化的多模态路由桥梁，可在传感条件恶化时自适应地聚合可靠信息。\n    - 将高效LLM与规划标记及GRU解码器结合，生成结构化描述并预测未来轨迹，并发布了OR-C2P基准测试集用于评估。\n\n- [面向可泛化端到端自动驾驶的风险感知世界模型预测控制](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.23259)\n  - 孙江鑫、薛峰、龙腾、刘畅、胡建方、郑伟士、尼库·塞贝\n  - 出版单位：澳门大学、中国科学院大学、特伦托大学\n  - 发表日期：2026年2月26日\n  - 任务：端到端\n  - 摘要：\n    - 提出RaWMPC，一个统一的端到端自动驾驶框架，通过稳健控制解决泛化问题，无需依赖专家示范。\n    - 利用世界模型预测候选动作的后果，并通过明确的风险评估选择低风险动作，同时辅以风险感知交互策略进一步增强效果。\n    - 引入自我评估蒸馏方法，将世界模型中的避险能力提炼到生成式动作建议网络中。\n\n- [MindDriver：为自动驾驶引入渐进式多模态推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.21952)\n  - 张凌俊、袁宇健、吴昌杰、常新源、蔡欣、曾爽、史林哲、王思瑾、张航、许牧\n  - 发表日期：2026年2月25日\n  - 代码：[MindDriver](https:\u002F\u002Fgithub.com\u002Fhotdogcheesewhite\u002FMindDriver)\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 提出MindDriver，一种用于自动驾驶的渐进式多模态推理框架，使VLM能够通过语义理解、想象和轨迹规划模仿人类的渐进式思维。\n    - 开发了一条反馈引导的自动数据标注流水线，用于生成对齐的多模态推理训练数据，并提出渐进式强化微调方法进行优化。\n\n- [NoRD：一种无需推理即可驾驶的数据高效视觉-语言-行动模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.21172)\n  - 伊山·拉瓦尔、舒布·古普塔、胡一涵、詹伟\n  - 发表日期：2026年2月24日\n  - 项目页面：[NoRD](https:\u002F\u002Fnord-vla-ai.github.io\u002F)\n  - 任务：端到端\n  - 数据集：[Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)、[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - 提出NoRD，一种用于自动驾驶的视觉-语言-行动模型，仅需不到60%的数据进行微调，且无需任何推理标注，便能达到竞争力的表现。\n    - 通过引入Dr. GRPO算法来缓解高方差回放问题，识别并克服了标准GRPO在小型无推理数据集上的难度偏差。\n\n- [VGGDrive：通过跨视角几何对齐赋能视觉-语言模型用于自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.20794)\n  - 王杰、李广、黄志坚、党晨旭、叶杭军、韩亚红、陈龙\n  - 出版单位：（根据作者及上下文推断）\n  - 发表日期：2026年2月24日\n  - 任务：感知、规划\n  - 摘要：\n    - 提出VGGDrive，一种新颖架构，通过连接来自冻结3D模型的3D几何特征，赋予视觉-语言模型（VLM）用于自动驾驶的跨视角几何对齐能力。\n    - 引入即插即用的跨视角3D几何增强器（CVGE），将基础VLM解耦，并通过分层自适应机制注入3D特征。\n    - 在包括跨视角风险感知、运动预测和轨迹规划在内的五个自动驾驶基准测试中，均表现出性能提升。\n\n- [通过掩码视觉-语言-行动扩散实现高效且可解释的端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.20577)\n  - 张嘉茹、马纳夫·加格瓦尼、崔灿、彭俊通、张汝琪、王子然\n  - 发表日期：2026年2月24日\n  - 任务：感知、规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 提出MVLAD-AD，一种用于自动驾驶的掩码视觉-语言-行动扩散框架，兼顾高效规划与语义可解释性。\n    - 引入离散动作标记化策略，从真实驾驶分布中创建紧凑的运动学可行航点码本。\n    - 采用几何感知嵌入学习和动作优先解码策略，提高规划精度，并提供高保真、可解释的推理过程。\n\n- [MeanFuser：基于MeanFlow的快速单步多模态轨迹生成与自适应重建，用于端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.20060)\n  - 王俊利、刘雪怡、郑一楠、邢泽兵、李鹏飞、李光、马坤、陈广、叶航军、夏中普、陈龙、张启超\n  - 发表日期：2026年2月23日\n  - 代码：[MeanFuser](https:\u002F\u002Fgithub.com\u002Fwjl2244\u002FMeanFuser)\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - MeanFuser是一种端到端自动驾驶方法，通过引入高斯混合噪声（GMN）引导生成式采样，从而提升效率和鲁棒性，并实现对轨迹空间的连续表示。\n    - 它将“MeanFlow身份”应用于端到端规划，通过建模平均速度场来消除常微分方程求解器带来的数值误差，显著加快推理速度。\n    - 同时设计了一个轻量级的自适应重建模块（ARM），使模型能够根据注意力权重隐式地从采样方案中选择，或重新构建新的轨迹。\n\n- [面向多智能体协作的安全且可解释的多模态路径规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.19304)\n  - 史浩军、叶苏宇、凯瑟琳·M·格雷里奥、沈建志、尹一凡、丹尼尔·哈沙比、黄健明、舒天敏\n  - 发表日期：2026年2月22日\n  - 任务：规划\n  - 摘要：\n    - CaPE（代码即路径编辑器）是一种安全且可解释的多模态路径规划方法，专为多智能体协作设计。该方法可根据环境和语言交流生成并更新路径计划。\n    - 该方法利用视觉-语言模型（VLM）合成由基于模型的规划器验证的路径编辑程序，从而将通信与安全、可解释的路径计划更新相结合。\n    - 在多种模拟和真实场景中进行了评估，包括多机器人协作、人机协作以及自动驾驶、家庭服务和联合搬运等任务，结果表明其能够显著提升规划与沟通的一致性。\n\n- [城市场景分割中的开放词汇领域泛化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.18853)\n  - 赵东、臧奇、蒲楠、李文静、尼库·塞贝、钟准\n  - 出版单位：特伦托大学\n  - 发表日期：2026年2月21日\n  - 任务：感知\n  - 数据集：[Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)、[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[KITTI](http:\u002F\u002Fwww.cvlibs.net\u002Fdatasets\u002Fkitti\u002F)、[BDD100K](https:\u002F\u002Fbdd-data.berkeley.edu\u002F)\n  - 摘要：\n    - 提出语义分割中的开放词汇领域泛化（OVDG-SS），这是一种新设置，旨在同时解决城市驾驶场景中未见领域和未见类别的问题。\n    - 提出了S2-Corr机制，一种基于状态空间的文本-图像相关性精炼方法，以缓解预训练视觉-语言模型中由领域差异引起的偏差。\n    - 建立了自动驾驶领域首个OVDG-SS基准，涵盖从合成数据到真实数据以及真实数据之间的泛化任务。\n\n- [OODBench：大型视觉-语言模型的分布外基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.18094)\n  - 林玲、白杨、苏恒、朱聪聪、王耀兴、周洋、傅华柱、陈景润\n  - 发表日期：2026年2月20日\n  - 任务：评估\n  - 数据集：[OODBench](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.18094)\n  - 摘要：\n    - 提出OODBench，一种以自动化为主的基准测试构建方法，用于评估视觉-语言模型（VLM）在分布外（OOD）数据上的表现，包含4万个实例级别的OOD实例-类别配对。\n    - 引入了一种可靠的自动化评估指标，通过由基础到高级的提示问题序列，来评估OOD数据对不同难度问题的影响。\n    - 总结了大量发现和见解，以促进未来关于OOD数据获取和评估的研究；结果显示，当前的VLM在OODBench上的性能有明显下降。\n\n- [流匹配条件下的连续异常检测：基于流形感知谱空间的自动驾驶应用](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.17586)\n  - 安东尼奥·吉列恩-佩雷斯\n  - 发表日期：2026年2月19日\n  - 任务：感知\n  - 数据集：[Waymo开放运动数据集](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - 摘要：\n    - 提出Deep-Flow框架，这是一种无监督的安全关键异常检测系统，采用最优传输条件流匹配（OT-CFM）来建模专家驾驶的概率密度。\n    - 通过主成分分析引入低秩谱流形，以确保运动学平滑性和稳定的对数似然估计；同时使用带有目标条件的早期融合Transformer处理多模态路口情况。\n    - 识别出运动学危险与语义不合规之间的可预测性差距，揭示了诸如车道违规等被忽视的分布外行为，从而支持基于数据的安全验证。\n\n- [DriveFine：精炼增强型掩码扩散视觉-语言-行动模型，用于精准稳健的自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.14577)\n  - 党晨旭、昂思宁、李永康、田浩辰、王杰、李广、叶航军、马杰、陈龙、王燕\n  - 发表日期：2026年2月16日\n  - 代码：[DriveFine](https:\u002F\u002Fgithub.com\u002FMSunDYY\u002FDriveFine)\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - DriveFine是一种结合灵活解码与自我修正功能的掩码扩散视觉-语言-行动（VLA）模型，专为自动驾驶设计。\n    - 引入了一种新颖的即插即用块-MoE架构，将精炼专家与生成专家解耦，以保留预训练能力并实现专家的显式选择。\n    - 采用混合强化学习策略，在保持稳定性的同时鼓励有效探索，在NAVSIM基准测试中表现出强大的效能和鲁棒性。\n\n- [基于自监督JEPA的世界模型：用于LiDAR占用率补全与预测](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.12540)\n  - 朱浩然、安娜·霍罗马斯卡\n  - 出版单位：纽约大学\n  - 发表日期：2026年2月13日\n  - 任务：预测\n  - 数据集：[Waymo开放数据集](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - 摘要：\n    - 提出AD-LiST-JEPA，一种用于自动驾驶的自监督世界模型，它利用联合嵌入预测架构（JEPA）从LiDAR数据中预测未来的时空演化。\n    - 通过下游的基于LiDAR的占用率补全与预测（OCF）任务来评估所学表示，从而同时评估感知与预测能力。\n    - 结果表明，在经过JEPA世界模型学习后，使用预训练编码器进行OCF任务的表现有所提升，这得益于对大量未标注数据的有效利用。\n\n- [Talk2DM：借助大型语言模型为车路云一体化动态地图实现自然语言查询与常识推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.11860)\n  - 卢涛、罗金轩、渡边洋介、周正书、陆宇桓、应申、张攀、赵飞、高田博明\n  - 发表日期：2026年2月12日\n  - 任务：推理\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - Talk2DM是一个即插即用的模块，通过支持自然语言查询和常识推理的能力扩展了车路云动态地图（VRC-DM）系统。\n    - 它基于一种新颖的提示链机制（CoP），将人类定义的规则与大型语言模型（LLM）的常识知识相结合。\n    - 实验表明，Talk2DM能够在不同LLM之间无缝切换，同时保持较高的查询准确率，并以2至5秒的响应时间实现超过93%的准确率，展现出实际应用潜力。\n\n- [SToRM：面向高效端到端自动驾驶的多模态LLM监督令牌缩减](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.11656)\n  - 金瑞贤、朴镇福、具度延、朴浩根、春一勇\n  - 发表日期：2026年2月12日\n  - 任务：端到端\n  - 摘要：\n    - 提出SToRM，这是首个用于多模态LLM的监督令牌缩减框架，旨在实现高效的端到端自动驾驶，同时保持与使用全部视觉令牌相当的性能。\n    - 该框架采用轻量级重要性预测器、基于全令牌LLM伪监督的监督训练方法，以及锚点上下文融合模块来减少令牌冗余。\n    - 实验表明，在相同的令牌预算下，SToRM的表现优于当前最先进的E2E驾驶MLLM，能够在降低高达30倍计算成本的同时维持性能。\n\n- [ResWorld：用于端到端自动驾驶的时序残差世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.10884)\n  - 张锦清、傅泽华、徐泽林、戴文英、刘庆杰、王云宏\n  - 发表日期：2026年2月11日\n  - 代码：[ResWorld](https:\u002F\u002Fgithub.com\u002Fmengtan00\u002FResWorld.git)\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - 针对端到端自动驾驶提出时序残差世界模型（TR-World），通过计算场景表示的时序残差来提取动态信息，无需目标检测或跟踪即可完成动态物体建模。\n    - 引入未来引导轨迹精炼模块（FGTR），该模块将先验轨迹与未来的BEV特征相结合，利用未来的道路状况对轨迹进行精炼，并提供监督以防止世界模型崩溃。\n    - 在nuScenes和NAVSIM数据集上实现了最先进的规划性能。\n\n- [从转向到踩踏：自动驾驶VLM能否泛化应用于骑行者辅助的空间感知与规划？](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.10771)\n  - 克里希纳·坎特·纳卡、维达斯里·纳卡\n  - 发表日期：2026年2月11日\n  - 任务：VQA、感知、规划\n  - 摘要：\n    - 引入CyclingVQA诊断基准，用于评估视觉-语言模型（VLM）在骑行者为中心的感知、时空理解以及交通规则到车道推理方面的能力。\n    - 对31多种VLM进行了评估，发现当前模型虽表现出令人鼓舞的能力，但在骑行者特定推理方面仍存在明显差距，尤其是专为驾驶设计的模型在此领域往往不如强大的通用型VLM表现优异。\n    - 提供系统的错误分析，识别出反复出现的失效模式，以指导更有效的骑行者辅助智能系统的开发。\n\n- [HiST-VLA：用于端到端自动驾驶的分层时空视觉-语言-行动模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.13329)\n  - 王一儒、顾子冲、高宇、蒋安青、孙志刚、王硕、衡宇文、孙浩\n  - 发表日期：2026年2月11日\n  - 任务：端到端\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - 提出HiST-VLA，一种用于可靠轨迹生成的分层时空VLA模型，以解决数值推理不精确和三维空间感知薄弱等局限性。\n    - 集成几何感知、状态历史提示和动态令牌稀疏化技术，以增强推理能力和计算效率。\n    - 采用基于分层Transformer的规划器，并结合动态潜在正则化来优化VLA航点，从而在NAVSIM v2基准测试中达到最先进水平。\n\n- [Found-RL：用于自动驾驶的基础模型增强强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.10458)\n  - 曲岩松、盛子豪、黄子琳、陈建聪、罗宇浩、王天义、冯一恒、塞缪尔·拉比、陈思凯\n  - 出版单位：普渡大学\n  - 发表日期：2026年2月11日\n  - 代码：[Found-RL](https:\u002F\u002Fgithub.com\u002Fys-qu\u002Ffound-rl)\n  - 任务：端到端\n  - 数据集：[CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - 摘要：\n    - Found-RL是一个利用基础模型高效提升自动驾驶强化学习的平台，配备异步批量推理框架，可将繁重的VLM推理从仿真循环中解耦。\n    - 引入价值边际正则化（VMR）和优势加权动作引导（AWAG），将VLM的动作建议提炼为强化学习策略，并采用条件对比动作对齐方法解决CLIP在奖励塑造方面的动态盲点问题。\n\n- [自适应时间步长流匹配用于自动驾驶运动规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.10285)\n  - 阿南雅·特里维迪、李安健、穆罕默德·埃尔努尔、优素福·乌穆特·奇夫奇、阿维纳什·辛格、乔文·德萨、裴尚载、大卫·伊瑟勒、塔斯金·帕迪尔、法伊赞·M·塔里克\n  - 出版单位：东北大学\n  - 发表日期：2026年2月10日\n  - 项目页面：[自动驾驶流匹配](https:\u002F\u002Fflow-matching-self-driving.github.io\u002F)\n  - 任务：规划\n  - 数据集：[Waymo开放运动数据集](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - 摘要：\n    - 提出一种条件流匹配框架，用于实时联合预测周围交通参与者及本车轨迹规划，利用轻量级方差估计器在线自适应选择推理步长。\n    - 引入轨迹后处理步骤，将其作为凸二次规划问题来提升乘坐舒适性，且计算开销几乎可以忽略。\n    - 实现20赫兹的更新频率，支持在线部署，并在变道和无保护左转等操作中相比基线模型展现出更好的平滑性和约束遵守能力。\n\n- [SteerVLA：在长尾驾驶场景中引导视觉-语言-行动模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.08440)\n  - 高天、陈思琳、凯瑟琳·格洛索普、蒂莫西·高、孙建凯、凯尔·斯塔霍维奇、吴雪莉、奥耶尔·梅斯、多尔萨·萨迪格、谢尔盖·列维纳、切尔西·芬恩\n  - 发表日期：2026年2月9日\n  - 项目页面：[SteerVLA](https:\u002F\u002Fsteervla.github.io\u002F)\n  - 任务：规划\n  - 摘要：\n    - SteerVLA是一种利用视觉-语言模型（VLM）的推理能力，生成细粒度的语言指令来引导视觉-语言-行动（VLA）驾驶策略的方法。\n    - 在高层VLM和底层VLA策略之间引入丰富的语言接口，将推理结果映射到控制输出上，并借助VLM为驾驶数据添加详细的语言标注。\n    - 在一项具有挑战性的闭环基准测试中进行评估，其总体驾驶评分比当前最先进方法高出4.77分，在长尾子集上的表现更是领先8.04分。\n\n- [视觉与语言：用于驾驶场景安全评估和自动驾驶车辆规划的新颖表示与人工智能](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.07680)\n  - 罗斯·格里尔、迈特拉伊·凯斯卡尔、安赫尔·马丁内斯-桑切斯、帕尔提布·罗伊、沙尚克·施里拉姆、莫汉·特里维迪\n  - 发表日期：2026年2月7日\n  - 任务：规划\n  - 数据集：[Waymo开放数据集](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)、[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 提出了一种基于CLIP图像-文本相似度的轻量级、类别无关的危险筛查方法，用于低延迟语义危险检测，无需显式目标检测。\n    - 探讨将场景级视觉-语言嵌入整合到基于Transformer的轨迹规划器中，发现简单的条件化并不能提高精度，这凸显了任务导向型表示提取的重要性。\n    - 研究使用自然语言作为运动规划的显式行为约束，表明基于视觉场景的乘客式指令能够抑制严重的规划失败，并在模糊场景中提升安全性。\n\n- [透过文字看道路：一种语言引导的RGB-T驾驶场景分割框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.07343)\n  - 鲁图拉吉·雷迪、赫里沙夫·巴库尔·巴鲁阿、俊勇·卢、清氏·阮、加内什·克里希纳萨米\n  - 发表日期：2026年2月7日\n  - 任务：感知\n  - 摘要：\n    - 提出CLARITY框架，用于RGB-热红外语义分割，该框架利用视觉-语言模型（VLM）先验知识，根据检测到的场景条件动态调整融合策略。\n    - 引入机制以保留常被降噪方法忽略的有效暗目标语义，并采用层次化解码器来确保结构一致性，从而获得更清晰的目标边界。\n    - 在MFNet数据集中达到最先进的性能，mIoU为62.3%，mAcc为77.5%。\n\n- [DriveWorld-VLA：面向自动驾驶的统一潜在空间世界建模与视觉-语言-行动技术](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.06521)\n  - 贾飞阳、刘林、宋紫莹、贾彩燕、叶航军、郝晓帅、陈龙\n  - 发表日期：2026年2月6日\n  - 代码：[DriveWorld-VLA](https:\u002F\u002Fgithub.com\u002Fliulin815\u002FDriveWorld-VLA.git)\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)、[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 一种新颖的框架，通过在表示层面紧密集成视觉-语言-行动（VLA）和世界模型，将世界建模与规划统一于潜在空间中。\n    - 使VLA规划器能够直接受益于对场景整体演化的建模，并评估候选动作对未来场景演化的影响，从而减少对密集标注监督的依赖。\n    - 在NAVSIM和nuScenes基准测试中均取得最先进的性能，支持在特征层面进行可控的、动作条件化的想象。\n\n- [ROMAN：用于自动驾驶系统测试的奖励编排多头注意力网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.05629)\n  - 奇建磊、吴宇振、侯家轩、张晓东、范明、孙苏慧、戴伟俊、李博、孙建国、孙军\n  - 出版商：百度\n  - 发表日期：2026年2月5日\n  - 任务：生成\n  - 数据集：[CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - 摘要：\n    - ROMAN是一种用于ADS测试的新型场景生成方法，它结合多头注意力网络与交通法规权重机制，以生成高风险违规场景。\n    - 该方法使用基于LLM的风险权重模块，根据严重性和发生频率评估违规行为，并模拟车辆与交通信号之间的交互作用。\n    - 在CARLA中的百度Apollo ADS上进行评估后，ROMAN在违规数量和多样性方面均超越了现有工具，能够针对输入交通法规的每一条款生成相应场景。\n\n- [AppleVLM：基于先进感知与规划增强型视觉-语言模型的端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.04256)\n  - 韩宇轩、吴坤元、邵千怡、肖仁翔、王子璐、蒋灿森、萧毅、胡亮、楼云江\n  - 发表日期：2026年2月4日\n  - 任务：端到端\n  - 数据集：[CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - 摘要：\n    - AppleVLM是一种先进的感知与规划增强型VLM模型，旨在实现稳健的端到端驾驶，解决车道感知不佳和语言偏差等问题。\n    - 引入一种基于可变形Transformer的新型视觉编码器，用于多视角、多时间步的融合；同时设计专门的规划模态编码鸟瞰信息，以缓解导航偏差。\n    - 其VLM解码器经过分层“思维链”微调，能够整合视觉、语言和规划特征，在CARLA中达到最先进的性能，并成功部署于AGV的实际应用中。\n\n- [基于视觉-语言-行动模型的自动驾驶中场景响应型人机协作运动规划的自然语言指令](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.04184)\n  - 安赫尔·马丁内斯-桑切斯、帕尔提布·罗伊、罗斯·格里尔\n  - 出版商：Mi3-Lab\n  - 发表日期：2026年2月4日\n  - 代码：[doScenes-VLM-Planning](https:\u002F\u002Fgithub.com\u002FMi3-Lab\u002FdoScenes-VLM-Planning)\n  - 任务：规划\n  - 数据集：[doScenes](https:\u002F\u002Fgithub.com\u002FMi3-Lab\u002FdoScenes-VLM-Planning)、[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 首次推出doScenes数据集，将自由形式的乘客指令与nuScenes的真实运动轨迹标注关联起来，用于指令条件化的规划。\n    - 对OpenEMMA端到端驾驶框架进行改造，以整合乘客风格的语言提示，从而在生成轨迹之前实现语言条件化。\n    - 实验证明，指令条件化能够显著提升规划的鲁棒性，防止极端失败的发生，并使轨迹更好地符合措辞得当的指令。\n\n- [InstaDrive：面向真实且一致视频生成的实例感知驾驶世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.03242)\n  - 作者：杨卓然、郭熙、丁晨静、王驰宇、吴伟、张延勇\n  - 发表日期：2026年2月3日\n  - 项目主页：[InstaDrive](https:\u002F\u002Fshanpoyang654.github.io\u002FInstaDrive\u002Fpage.html)\n  - 任务：生成\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - InstaDrive是一种新颖的驾驶视频生成框架，旨在解决世界模型中实例级别的时序一致性和空间几何保真度问题。\n    - 引入了实例流引导器以传播实例特征实现时序一致性，并设计了空间几何对齐器来精确建模物体位置及遮挡关系。\n    - 该方法在视频生成质量上达到最新水平，并能有效提升下游自动驾驶任务性能；同时结合CARLA中的程序化仿真环境，可用于评估安全关键场景。\n\n- [ConsisDrive：基于实例掩码的保身份驾驶世界模型用于视频生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.03213)\n  - 作者：杨卓然、张延勇\n  - 发表日期：2026年2月3日\n  - 项目主页：[ConsisDrive](https:\u002F\u002Fshanpoyang654.github.io\u002FConsisDrive\u002Fpage.html)\n  - 任务：生成\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 提出ConsisDrive，一种保身份的驾驶世界模型，通过在实例层面强制保持时序一致性来防止生成视频中出现身份漂移现象。\n    - 包含两个核心组件：实例掩码注意力机制用于跨帧保持目标身份，以及实例掩码损失函数用以自适应强调前景区域。\n    - 在驾驶视频生成方面达到当前最优水平，并在nuScenes数据集上的下游自动驾驶任务中表现出显著改进。\n\n- [LiFlow：用于3D LiDAR场景补全的流匹配方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.02232)\n  - 作者：安德烈亚·马特扎齐、迪特马尔·图奇\n  - 出版单位：帕尔马大学\n  - 发表日期：2026年2月2日\n  - 代码：[LiFlow](https:\u002F\u002Fgithub.com\u002Fmatteandre\u002FLiFlow)\n  - 任务：感知\n  - 数据集：[KITTI](https:\u002F\u002Fwww.cvlibs.net\u002Fdatasets\u002Fkitti\u002F)\n  - 摘要：\n    - 首次提出针对3D LiDAR场景补全的流匹配框架，相较于基于扩散的方法，该方法通过确保训练与推理阶段初始分布的一致性来提升性能。\n    - 使用最近邻流匹配损失和Chamfer距离损失，以增强点云对齐过程中的局部结构细节和全局覆盖范围。\n    - 在场景补全的多项指标上均达到当前最佳表现。\n\n- [UniDriveDreamer：用于自动驾驶的单阶段多模态世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.02002)\n  - 作者：赵国胜、王耀增、王晓峰、朱铮、于廷东、黄冠、宰永臣、焦继、薛长亮、王小乐、杨振、朱福堂、王兴刚\n  - 发表日期：2026年2月2日\n  - 任务：预测、生成\n  - 摘要：\n    - UniDriveDreamer是一种单阶段统一的多模态世界模型，可直接生成未来多模态观测结果（多摄像头视频和LiDAR序列），无需中间表示或级联模块。\n    - 引入了LiDAR专用VAE和视频VAE，并提出统一潜在锚定（ULA）方法，以对齐两者的潜在分布，从而实现跨模态兼容性和训练稳定性。\n    - 采用扩散Transformer联合建模对齐融合特征的几何对应关系和时间演化过程，并以各模态的结构化场景布局信息作为条件。\n\n- [UniDWM：通过多面表征学习迈向统一的驾驶世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.01536)\n  - 作者：刘帅、任思恒、朱晓瑶、梁泉民、李泽峰、李强、胡欣、黄凯\n  - 发表日期：2026年2月2日\n  - 代码：[UniDWM](https:\u002F\u002Fgithub.com\u002FSay2L\u002FUniDWM)\n  - 任务：感知、预测、规划、生成\n  - 摘要：\n    - UniDWM是一种通过多面表征学习推进自动驾驶的统一驾驶世界模型，能够构建兼具结构与动态感知的潜在世界表示。\n    - 该模型采用联合重建路径和协作式生成框架，并结合条件扩散Transformer来预测未来的世界演变。\n    - 研究表明，该框架是VAE的一种变体，为多面表征学习提供了理论指导，并且在轨迹规划、4D重建及生成等方面具有实际应用价值。\n\n- [HERMES：端到端风险感知多模态具身系统——结合视觉-语言模型应对长尾场景下的自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.00993)\n  - 作者：唐伟哲、游俊伟、刘佳希、王兆义、甘锐、黄子林、魏峰、冉斌\n  - 出版单位：威斯康星大学麦迪逊分校、加州大学伯克利分校、清华大学\n  - 发表日期：2026年2月1日\n  - 任务：端到端\n  - 摘要：\n    - HERMES是一个整体性的风险感知端到端多模态驾驶框架，旨在将明确的长尾风险线索注入轨迹规划中，以实现在混合交通场景下的安全运行。\n    - 利用基础模型辅助的标注流程生成结构化的长尾场景上下文和长尾规划上下文，捕捉以危险为中心的线索，同时考虑驾驶意图和安全偏好。\n    - 引入三模态驾驶模块，融合多视角感知、历史运动线索和语义引导，从而在长尾条件下实现风险敏感且精准的轨迹规划。\n\n- [Drive-JEPA：视频JEPA结合多模态轨迹蒸馏用于端到端驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.22032)\n  - 作者：王林涵、杨子冲、白晨、张国祥、刘晓彤、郑晓音、龙小小、卢昌田、陆成\n  - 发表日期：2026年1月29日\n  - 任务：端到端\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - 提出Drive-JEPA框架，将视频联合嵌入预测架构（V-JEPA）与多模态轨迹蒸馏相结合，用于端到端自动驾驶。\n    - 对V-JEPA进行适配，使其能够在驾驶视频上预训练ViT编码器，从而获得与规划任务相契合的预测性表征；同时引入以提案为中心的规划器，从多种模拟器和人类驾驶轨迹中提炼知识。\n    - 在NAVSIM数据集上取得了当前最优性能，其中V-JEPA表征优于现有方法，而完整框架则树立了新的基准。\n\n- [LLM驱动的场景感知自动驾驶规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.21876)\n  - 李鹤、陈兆伟、高睿、李国梁、郝奇、王帅、徐成中\n  - 发表日期：2026年1月29日\n  - 任务：规划\n  - 摘要：\n    - 提出LAP，一种由大语言模型驱动的自适应规划方法，可在高速和精确驾驶模式之间切换。\n    - 通过利用大语言模型进行场景理解，并将其推理结果融入模式配置与运动规划的联合优化中实现这一目标，采用树搜索MPC和交替最小化算法求解。\n    - 在ROS中的实现表明，在仿真环境中，其驾驶时间和成功率均优于基准方法。\n\n- [Drive-KD：面向自动驾驶的多教师知识蒸馏框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.21288)\n  - 连伟通、唐泽聪、李浩然、高天健、王一飞、王子旭、孟凌翼、茹腾驹、崔哲俊、朱义晨、曹航硕、康琪、陈天行、秦宇森、王凯旋、张宇\n  - 发表日期：2026年1月29日\n  - 任务：规划\n  - 摘要：\n    - Drive-KD框架将自动驾驶分解为“感知-推理-规划”三元结构，并通过知识蒸馏传递这些能力。\n    - 引入层间注意力作为蒸馏信号，构建针对不同能力的单教师模型及采用非对称梯度投影机制的多教师框架，以缓解梯度冲突。\n    - 蒸馏后的InternVL3-1B模型在DriveBench上的综合性能优于预训练的78B模型，在规划任务上甚至超越GPT-5.1，同时显著降低了GPU内存占用并提升了吞吐量。\n\n- [ScenePilot-Bench：用于评估自动驾驶中视觉-语言模型的大规模数据集与基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.19582)\n  - 王宇锦、郑宇彤、范文贤、王天毅、褚洪庆、田大鑫、高炳钊、王建强、陈宏\n  - 发表日期：2026年1月27日\n  - 任务：评估\n  - 摘要：\n    - 介绍ScenePilot-Bench，一个基于多样化ScenePilot-4K数据集（包含3,847小时驾驶视频及多粒度标注）构建的大规模第一人称驾驶基准测试。\n    - 具有四维评估体系，分别从场景理解、空间感知、运动规划及GPT-Score四个方面评估VLM的能力，并引入安全相关指标及跨区域泛化设置。\n    - 提供代表性VLM的实证分析，以明确其性能边界并识别在安全关键场景下面向驾驶的推理能力差距。\n\n- [AutoDriDM：面向自动驾驶中视觉-语言模型决策能力的可解释性基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.14702)\n  - 唐泽聪、王子旭、王一飞、连伟通、高天健、李浩然、茹腾驹、孟凌翼、崔哲俊、朱义晨、康琪、王凯旋、张宇\n  - 发表日期：2026年1月21日\n  - 任务：评估\n  - 数据集：[AutoDriDM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.14702)\n  - 摘要：\n    - AutoDriDM是一个以决策为核心的渐进式基准测试，用于评估自动驾驶中的视觉-语言模型（VLM），包含对象、场景和决策三个维度的6,650道题目。\n    - 该基准测试评估主流VLM，以划定感知到决策的能力边界，揭示了感知与决策性能之间的薄弱关联。\n    - 包含对模型推理过程的可解释性分析，识别关键失效模式，并引入分析模型以实现大规模自动标注。\n\n- [VILTA：用于提升驾驶策略鲁棒性的VLM闭环对抗系统](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.12672)\n  - 陈启茂、李芳、许绍青、赖志毅、谢子迅、罗岳辰、江圣寅、李汉冰、陈龙、王兵、张毅、杨志新\n  - 发表日期：2026年1月19日\n  - 任务：规划\n  - 摘要：\n    - 提出VILTA（VLM闭环轨迹对抗系统），这是一种新颖的框架，可将视觉语言模型直接集成到自动驾驶智能体的闭环训练中。\n    - VLM能够主动理解动态环境，并通过直接编辑周围其他智能体的未来轨迹来生成具有挑战性的长尾场景。\n    - 该方法利用VLM的泛化能力，创建多样化的潜在危急场景课程，从而显著提升最终驾驶策略的安全性和鲁棒性。\n\n- [生成式场景回放用于端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.11475)\n  - 拉吉夫·亚萨拉、迪普蒂·赫格德、韩世忠、程信沛、史云霄、梅萨姆·萨德吉戈加里、什韦塔·马哈詹、阿普拉蒂姆·巴塔查里亚、刘立天、里希克·加雷帕利、托马斯·斯万特松、法提赫·波里克利、蔡宏\n  - 出版商：高通AI研究院\n  - 发表日期：2026年1月16日\n  - 任务：规划\n  - 数据集：[Bench2Drive](https:\u002F\u002Fthinklab-sjtu.github.io\u002FBench2Drive\u002F)\n  - 摘要：\n    - 提出生成式场景回放（GeRo），这是一个即插即用的框架，适用于视觉-语言-行动（VLA）模型，可通过自回归回放联合执行规划与基于语言的未来交通场景生成。\n    - 使用回放一致性损失来稳定预测并保持文本与动作的一致性，从而实现时间一致且基于语言的长期推理和多智能体规划。\n    - 将强化学习与生成式回放相结合，在Bench2Drive上实现了最先进的闭环和开环性能，并展现出强大的零样本鲁棒性。\n\n- [MAD：运动与外观解耦以构建高效驾驶世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.09452)\n  - 阿迈德·拉希米、瓦伦丁·热拉尔、埃洛伊·扎布洛基、马蒂厄·科尔德、亚历山大·阿拉伊\n  - 出版商：EPFL VITA\n  - 发表日期：2026年1月14日\n  - 项目页面：[MAD-World-Model](https:\u002F\u002Fvita-epfl.github.io\u002FMAD-World-Model\u002F)\n  - 任务：生成\n  - 摘要：\n    - 提出一种高效的适配框架，通过将运动学习与外观合成解耦，将通用视频扩散模型转化为可控的驾驶世界模型。\n    - 采用两阶段推理-渲染范式：首先通过骨架化智能体视频推断动态，再渲染逼真的RGB外观，从而以不到先前计算资源6%的成本达到最先进水平。\n\n- [利用3D表征对齐与RGB预训练先验进行LiDAR场景生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.07692)\n  - 尼古拉斯·塞雷约尔-加罗斯、埃林顿·柯比、维克多·贝斯尼耶、内尔敏·萨梅特\n  - 出版商：法雷奥AI\n  - 发布日期：2026年1月12日\n  - 项目页面：[R3DPA](https:\u002F\u002Fgithub.com\u002Fvaleoai\u002FR3DPA)\n  - 代码：[R3DPA](https:\u002F\u002Fgithub.com\u002Fvaleoai\u002FR3DPA)\n  - 任务：生成\n  - 数据集：[KITTI-360](https:\u002F\u002Fwww.cvlibs.net\u002Fdatasets\u002Fkitti-360\u002F)\n  - 摘要：\n    - R3DPA是一种LiDAR场景生成方法，它解锁了用于LiDAR点云的图像预训练先验，并利用自监督3D表征实现最先进的结果。\n    - 该方法将中间生成特征与自监督3D特征对齐，以提高质量，并从大规模图像预训练模型中迁移知识，从而缓解LiDAR数据不足的问题。\n    - 在推理时仅使用无条件模型即可实现点云控制，适用于物体修复和场景混合等任务。\n\n- [通过场景区域压缩实现自动驾驶高效视觉问答流水线](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.07092)\n  - 蔡宇良、叶东强子、陈子天、吴崇若\n  - 发布日期：2026年1月11日\n  - 任务：VQA\n  - 摘要：\n    - 提出SRC-Pipeline，一种用于自动驾驶VQA的高效VLM框架，它将早期帧的token压缩为少量高级token，同时保留近期帧的完整token。\n    - 在保持相近性能的同时，计算量减少了66%，从而能够在安全关键的驾驶环境中更有效地实现实时运行。\n\n- [模块化自主性与对话交互：LLM驱动的自动驾驶决策框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.05806)\n  - 马文·泽格特、科尔比尼安·莫勒、约翰内斯·贝茨\n  - 出版商：慕尼黑工业大学\n  - 发布日期：2026年1月9日\n  - 任务：规划\n  - 摘要：\n    - 提出一种用于自动驾驶系统（ADS）中对话交互的LLM驱动框架，将基于LLM的交互层与Autoware软件栈集成。\n    - 引入三步方法：对交互类别进行分类，开发面向应用的领域特定语言（DSL）用于指令翻译，以及建立一个保障安全的验证层。\n    - 采用两阶段LLM架构以确保高度透明并提供执行反馈，评估证实其时间效率和翻译鲁棒性。\n\n- [SGDrive：面向自动驾驶的场景-目标层次世界认知](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.05640)\n  - 李京宇、吴俊杰、胡东楠、黄翔凯、孙斌、郝志辉、郎贤鹏、朱夏田、张莉\n  - 发布日期：2026年1月9日\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - 提出SGDrive，一种新颖的框架，它围绕驾驶特定的场景-智能体-目标层次结构构建视觉-语言模型（VLM）的表征学习，以模拟人类驾驶认知。\n    - 通过提供结构化的时空表征来解决通用VLM的局限性，实现安全的轨迹规划，并将多级信息整合为紧凑格式。\n    - 在NAVSIM基准测试中，作为纯摄像头方法达到了最先进水平，验证了层次化知识结构在适应VLM用于自动驾驶方面的有效性。\n\n- [LatentVLA：通过潜在动作预测实现自动驾驶高效视觉-语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.05611)\n  - 谢成恩、孙斌、李天宇、吴俊杰、郝志辉、郎贤鹏、李洪洋\n  - 发布日期：2026年1月9日\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)、[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - LatentVLA是一种新颖的框架，利用自监督的潜在动作预测来训练视觉-语言-行动（VLA）模型，无需语言标注，从而消除语言偏见。\n    - 它通过知识蒸馏将VLA的泛化能力迁移到高效的视觉网络中，实现了稳健的性能和实时效率。\n    - 在NAVSIM基准测试中建立了新的最先进水平（PDMS得分为92.4），并在nuScenes上表现出强大的零样本泛化能力。\n\n- [ThinkDrive：思维链引导的渐进式强化学习微调用于自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.04714)\n  - 赵昌、杨哲明、胡云清、郭琪、王子健、李鹏程、季文\n  - 发布日期：2026年1月8日\n  - 任务：规划\n  - 摘要：\n    - ThinkDrive是一种由思维链（CoT）引导的渐进式强化学习微调框架，用于自动驾驶，它将显式推理与难度感知的自适应策略优化相结合。\n    - 该方法采用两阶段训练策略：首先使用CoT解释进行监督微调（SFT），然后应用渐进式RL，并结合难度感知的自适应策略优化器。\n    - 在公开数据集上的评估表明，ThinkDrive的表现优于强大的RL基线，而使用该方法训练的2B参数模型在考试指标上超越了更大的GPT-4o。\n\n- [UniDrive-WM：用于自动驾驶的统一理解、规划和生成世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.04453)\n  - 熊哲晓、叶欣、布尔汉·亚曼、程胜、陆一仁、罗静如、内森·雅各布斯、任刘\n  - 出版商：圣路易斯华盛顿大学、博世研究院\n  - 发布日期：2026年1月7日\n  - 项目页面：[UniDrive-WM](https:\u002F\u002Funidrive-wm.github.io\u002FUniDrive-WM)\n  - 任务：规划、生成\n  - 数据集：[Bench2Drive](https:\u002F\u002Fthinklab-sjtu.github.io\u002FBench2Drive\u002F)\n  - 摘要：\n    - UniDrive-WM是一种基于VLM的统一世界模型，在单个架构中联合完成驾驶场景理解、轨迹规划以及基于轨迹条件的未来图像生成。\n    - 模型预测的轨迹会条件化一个基于VLM的图像生成器，以生成合理的未来帧，从而提供监督信号，增强理解并迭代优化轨迹生成。\n    - 在Bench2Drive上的实验显示，相比先前方法，L2轨迹误差降低了5.9%，碰撞率降低了9.2%，证明了将推理、规划和生成建模相结合的优势。\n\n- [一种用于越野自动驾驶的视觉提示型视觉-语言-行动模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.03519)\n  - 张良栋、聂一鸣、李浩洋、孔凡杰、张宝宝、黄顺鑫、傅凯、闵晨、肖亮\n  - 发表日期：2026年1月7日\n  - 任务：规划\n  - 数据集：[RELLIS-3D](https:\u002F\u002Fgithub.com\u002Funmannedlab\u002FRELLIS-3D)\n  - 摘要：\n    - 提出OFF-EMMA，这是一种端到端的多模态框架，用于越野自动驾驶，旨在解决VLA模型中空间感知不足和推理不稳定的问题。\n    - 引入基于语义分割掩码的视觉提示模块，以增强空间理解能力；同时采用具有自一致性（COT-SC）的思维链推理策略，提升规划的鲁棒性。\n\n- [FROST-Drive：基于冻结视觉编码器的可扩展且高效的端到端驾驶系统](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.03460)\n  - 董泽宇、朱一民、吴宇、孙宇\n  - 发表日期：2026年1月6日\n  - 任务：端到端\n  - 数据集：[Waymo开放端到端数据集](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - 摘要：\n    - FROST-Drive是一种新颖的端到端（E2E）架构，通过保持预训练视觉-语言模型（VLM）的视觉编码器冻结状态，从而保留其泛化能力。\n    - 该模型将冻结的编码器与基于Transformer的适配器以及基于GRU的航点生成解码器相结合，并采用针对评分者反馈分数（RFS）的自定义损失函数进行优化。\n    - 在Waymo开放端到端数据集上表现出色，表明利用冻结的通用VLM编码器比完全微调更能实现稳健的驾驶性能。\n\n- [DrivingGen：面向自动驾驶的生成式视频世界模型综合基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.01528)\n  - 周阳、邵浩、王乐天、宗卓凡、李洪生、史蒂文·L·瓦斯兰德\n  - 出版单位：多伦多大学、上海人工智能实验室、香港中文大学\n  - 发表日期：2026年1月4日\n  - 任务：评估\n  - 数据集：[DrivingGen](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.01528)\n  - 摘要：\n    - DrivingGen是首个针对生成式驾驶世界模型的综合性基准测试，结合了多样化的评估数据集和一套新的评价指标。\n    - 该基准从视觉逼真度、轨迹合理性、时间连贯性及可控性等多个维度进行评估，以推动可靠且可部署的驾驶世界模型的发展。\n    - 对14种最先进模型的评估揭示了通用模型与驾驶专用模型之间的权衡，为可扩展的仿真与规划提供了统一的框架。\n\n- [二分扩散策略优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.00898)\n  - 梁瑞明、郑怡楠、郑可欣、谭天一、李建雄、毛立元、王志浩、陈光、叶杭军、刘静静、王金桥、詹宪远\n  - 发表日期：2025年12月31日\n  - 任务：端到端\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - 提出DIPOLE（二分扩散策略改进），这是一种新颖的强化学习算法，通过将最优策略分解为一对分别用于奖励最大化和最小化的二分策略，实现稳定且可控的扩散策略优化。\n    - 在推理过程中，可通过线性组合所学二分策略的得分来灵活控制贪婪程度。\n    - 在离线及离线转在线强化学习场景（ExORL、OGBench）中均展现出有效性，并成功应用于在NAVSIM基准上训练大型视觉-语言-行动模型以实现端到端自动驾驶。\n\n- [LSRE：用于自动驾驶中实时语义风险检测的潜在语义规则编码](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.24712)\n  - 程谦、周伟涛、景成、邓南山、文俊泽、刘兆阳、江坤、杨迪安\n  - 发表日期：2025年12月31日\n  - 任务：感知\n  - 数据集：[CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - 摘要：\n    - LSRE是一种潜在语义规则编码框架，能够将稀疏采样的VLM判断转化为循环世界模型潜在空间内的决策边界，从而实现实时语义风险评估。\n    - 无需每帧都调用VLM即可实现10 Hz的实时语义风险检测，准确率与大型VLM基线相当，且能更早地预测危险、延迟更低。\n    - 该方法对罕见但语义相似的测试案例也表现出良好的泛化能力，表明语言引导的潜在分类机制是进行语义安全监控的有效手段。\n\n- [反事实VLA：具有自反思能力和适应性推理的视觉-语言-行动模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.24426)\n  - 彭正浩“马克”、丁文豪、游雨蓉、陈宇晓、罗文杰、田托马斯、曹宇龙、阿普尔瓦·夏尔马、徐丹菲、鲍里斯·伊万诺维奇、李博毅、周伯雷、王燕、马可·帕沃内\n  - 发表日期：2025年12月30日\n  - 任务：规划\n  - 摘要：\n    - 提出反事实VLA（CF-VLA），这是一种自反思的VLA框架，可通过反事实推理在执行前对计划行动进行思考和修正。\n    - 提议采用回放-过滤-标注流程来挖掘高价值场景，并标注反事实推理轨迹，以实现高效训练。\n    - 实验表明，该方法在轨迹准确性和安全指标方面均有显著提升（分别高达17.6%和20.5%），且自适应推理仅在复杂场景中启用。\n\n- [面向自动驾驶的空间感知视觉语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.24331)\n  - 魏伟杰、罗志鹏、冯玲、艾琳·利昂\n  - 发表日期：2025年12月30日\n  - 任务：感知\n  - 摘要：\n    - LVLDrive（激光雷达-视觉-语言）是一种新框架，旨在通过引入激光雷达点云作为额外的输入模态，将现有的视觉-语言模型（VLM）升级为具备强大3D度量空间理解能力的自动驾驶专用模型。\n    - 引入渐进式融合Q-Former，逐步注入激光雷达特征，从而减轻对预训练VLM的灾难性干扰，同时保留其原有知识体系。\n    - 开发了一套空间感知问答（SA-QA）数据集，专门用于训练模型的高级3D感知与推理能力，使其在驾驶基准测试中表现优异。\n\n- [DriveLaW：在潜在驾驶世界中统一规划与视频生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.23421)\n  - 夏天泽、李永康、周立军、姚景峰、熊凯欣、孙海洋、王兵、马坤、陈光、叶航俊、刘文宇、王兴刚\n  - 出版单位：华中科技大学、小米汽车\n  - 发表日期：2025年12月29日\n  - 任务：规划、生成\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - DriveLaW是一种新范式，通过将视频生成器中的潜在表征直接注入规划器，从而统一自动驾驶的视频生成与运动规划。\n    - 该框架由用于高保真预测的世界模型DriveLaW-Video和用于生成一致轨迹的扩散规划器DriveLaW-Act组成，两者均采用三阶段渐进式训练策略进行优化。\n    - 实现了当前最佳性能，显著提升了视频预测指标，并在NAVSIM规划基准上创下新纪录。\n\n- [ColaVLA：利用认知潜在推理实现自动驾驶中的层次化并行轨迹规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.22939)\n  - 彭启航、陈雪松、杨晨晔、史绍帅、李洪生\n  - 出版单位：香港中文大学、上海人工智能实验室\n  - 发表日期：2025年12月28日\n  - 项目页面：[ColaVLA](https:\u002F\u002Fpqh22.github.io\u002Fprojects\u002FColaVLA\u002Findex.html)\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - ColaVLA是一个统一的视觉-语言-行动框架，能够将文本中的推理迁移到潜在空间，以实现高效的轨迹规划。\n    - 引入了认知潜在推理器，用于紧凑且以决策为导向的场景理解；同时设计了层次化并行规划器，可在单次通过中生成符合因果关系的轨迹。\n    - 在nuScenes基准测试的开环和闭环设置中均取得了当前最佳性能，且效率和鲁棒性均有提升。\n\n- [KnowVal：知识增强与价值导向的自动驾驶系统](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.20299)\n  - 夏仲宇、陈文浩、王永涛、杨明轩\n  - 发表日期：2025年12月23日\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[Bench2Drive](https:\u002F\u002Fthinklab-sjtu.github.io\u002FBench2Drive\u002F)\n  - 摘要：\n    - 提出KnowVal，一种将视觉-语言推理与全面的驾驶知识图谱及基于LLM的检索机制相结合的自动驾驶系统。\n    - 构建了人类偏好数据集和价值模型，用于指导可解释且与价值观一致的轨迹评估。\n    - 实现了当前最佳的规划性能，包括在nuScenes上达到最低碰撞率，并在Bench2Drive上取得顶尖结果。\n\n- [WorldRFT：面向自动驾驶的强化微调潜在世界模型规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.19133)\n  - 杨鹏轩、卢奔、夏中璞、韩超、高银峰、张腾、詹坤、郎先鹏、郑宇鹏、张奇超\n  - 发表日期：2025年12月22日\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - WorldRFT是一个面向规划的潜在世界模型框架，通过层次分解和局部感知的交互式精炼，将场景表征学习与规划对齐，并辅以强化学习微调（RFT）。\n    - 引入了群体相对策略优化（GRPO），结合轨迹高斯化和碰撞感知奖励，以微调驾驶策略，提升安全性。\n    - 实现了当前最佳性能，在nuScenes上的碰撞率降低了83%，并在NavSim上实现了与基于LiDAR的SOTA方法相当的纯相机表现。\n\n- [InDRiVE：基于潜在分歧的无奖励世界模型预训练用于自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.18850)\n  - 费扎·汗·汗扎达、权载赫\n  - 出版单位：密歇根大学迪尔伯恩分校\n  - 发表日期：2025年12月21日\n  - 任务：规划\n  - 数据集：[CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - 摘要：\n    - InDRiVE是基于DreamerV3风格的模型强化学习智能体，专为自动驾驶设计，仅利用来自潜在集成分歧的内在动机进行无奖励预训练。\n    - 该框架以分歧作为认识论不确定性的代理来驱动探索，并直接从学习到的世界模型中学习无需规划器的探索策略。\n    - 经过内在动机预训练后，该智能体在CARLA的各种城镇和交通密度下，展现出强大的零样本鲁棒性和针对车道保持、避撞等下游任务的稳健少样本适应能力。\n\n- [LLaViDA：用于显式推理与增强轨迹规划的大语言视觉驾驶助手](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.18211)\n  - 刘宇东、斯宾塞·哈利伯顿、金智佑、林悦谦、李一鸣、王钦思、叶辉、孙静伟、米罗斯拉夫·帕伊奇、陈怡然、李海\n  - 发表日期：2025年12月20日\n  - 代码：[LLaViDA](https:\u002F\u002Fgithub.com\u002F)\n  - 任务：规划\n  - 数据集：[NuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - LLaViDA是一款大型语言视觉驾驶助手，利用视觉-语言模型（VLM）进行目标运动预测、语义接地以及链式思维推理，从而支持轨迹规划。\n    - 采用监督微调与轨迹偏好优化（TPO）相结合的两阶段训练流程，以增强场景理解和规划能力。\n    - 在NuScenes基准测试的开环轨迹规划任务中取得了当前最佳性能，平均L2误差为0.31米，碰撞率为0.10%。\n\n- [Corner Case中的驾驶：端到端自动驾驶的真实世界对抗性闭环评估平台](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.16055)\n  - 耿嘉恒、杜家桐、张鑫宇、李烨、王攀渠、黄延军\n  - 发表日期：2025年12月18日\n  - 任务：评估\n  - 摘要：\n    - 提出了一种端到端自动驾驶的闭环评估平台，该平台通过在真实场景中生成对抗性交互，以创造安全关键的Corner Case场景。\n    - 平台使用基于流匹配的真实图像生成器和对抗性交通策略，高效地模拟挑战性交互，并评估UniAD、VAD等模型。\n\n- [OccSTeP: 4D占用时空持久性基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.15621)\n  - 郑宇、胡杰、杨凯伦、张嘉明\n  - 发表日期：2025年12月17日\n  - 代码：[OccSTeP](https:\u002F\u002Fgithub.com\u002FFaterYU\u002FOccSTeP)\n  - 任务：评估\n  - 摘要：\n    - 引入了4D占用时空持久性（OccSTeP）的概念，解决了反应式预测（“接下来会发生什么”）和主动式预测（“如果采取特定未来行动会怎样”）的问题。\n    - 提出了OccSTeP-WM，一种无分词器的世界模型，具有线性复杂度的注意力主干网络和循环状态空间模块，能够在历史输入缺失或噪声较多的情况下实现稳健的在线推理。\n    - 建立了一个包含挑战性场景的新OccSTeP基准测试，并报告了语义mIoU（+6.56%）和占用IoU（+9.26%）的显著提升。\n\n- [OmniDrive-R1: 基于强化学习的交织多模态思维链，用于可信的视觉-语言自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.14044)\n  - 张振国、郑浩然、王一申、徐乐、邓天辰、陈雪峰、陈渠、张博、黄武雄\n  - 发表日期：2025年12月16日\n  - 任务：端到端\n  - 数据集：[DriveLMM-o1](https:\u002F\u002Fgithub.com\u002Fayesha-ishaq\u002FDriveLMM-o1)\n  - 摘要：\n    - OmniDrive-R1是一个用于自动驾驶的端到端视觉-语言模型框架，引入了交织多模态思维链（iMCoT）机制来统一感知与推理。\n    - 其核心创新在于基于强化学习的视觉定位能力，通过两阶段强化学习管道和Clip-GRPO算法实现，该算法使用无标注、基于过程的定位奖励。\n    - 相比Qwen2.5VL-7B基线模型，该模型在DriveLMM-o1数据集上显著提升了推理能力，总体得分从51.77%提高到80.35%，最终答案准确率从37.81%提高到73.62%。\n\n- [MindDrive: 基于在线强化学习的视觉-语言-动作模型，用于自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.13636)\n  - 傅浩宇、张典坤、赵宗创、崔建峰、谢宏伟、王兵、陈光、梁定康、白翔\n  - 发表日期：2025年12月15日\n  - 项目页面：[MindDrive](https:\u002F\u002Fxiaomi-mlab.github.io\u002FMindDrive\u002F)\n  - 任务：规划\n  - 数据集：[Bench2Drive](https:\u002F\u002Fthinklab-sjtu.github.io\u002FBench2Drive\u002F)\n  - 摘要：\n    - MindDrive是一个用于自动驾驶的视觉-语言-动作（VLA）框架，采用在线强化学习来解决模仿学习中分布偏移等局限性问题。\n    - 它配备了一个带有两组LoRA参数的LLM：一个决策专家用于推理，一个动作专家用于将决策映射为轨迹，从而实现基于离散语言决策的试错式学习。\n    - 在Bench2Drive基准测试中，使用轻量级Qwen-0.5B LLM，MindDrive获得了78.04的驾驶分数和55.09%的成功率。\n\n- [DrivePI: 具有空间感知能力的4D多模态大语言模型，用于统一的自动驾驶理解、感知、预测和规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.12799)\n  - 刘哲、黄润辉、杨睿、严思明、王子宁、侯璐、林迪、白翔、赵恒爽\n  - 发表日期：2025年12月14日\n  - 代码：[DrivePI](https:\u002F\u002Fgithub.com\u002Fhappinesslz\u002FDrivePI)\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - DrivePI是一个具有空间感知能力的4D多模态大语言模型，作为统一的视觉-语言-动作（VLA）框架用于自动驾驶，通过端到端优化联合完成空间理解、3D感知、预测和规划。\n    - 该方法整合了点云、多视角图像和语言指令，并利用数据引擎生成文本-占用和文本-流问答对，以实现4D空间理解。\n    - 仅使用0.5B Qwen2.5作为骨干网络，DrivePI作为一个单一模型，在nuScenes和OpenOcc基准测试中的关键指标上均达到或超越现有的VLA及专用VA模型。\n\n- [从人类意图到行为预测：基于意图驱动的端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.12302)\n  - 郑欢、周宇成、闫天毅、苏佳怡、陈洪军、陈杜冰、桂兴泰、韩文成、陶润州、邱中英、杨建飞、沈建兵\n  - 发表日期：2025年12月13日\n  - 任务：端到端\n  - 数据集：[Intention-Drive](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.12302)\n  - 摘要：\n    - 介绍了Intention-Drive，这是一个全面的基准测试和数据集，用于基于意图驱动的端到端自动驾驶，旨在填补对高层次人类意图解读的空白。\n    - 提出了想象未来对齐（IFA）这一新颖的评估协议，利用生成式世界模型来评估语义目标的达成情况，而不仅仅是几何精度。\n    - 探讨了两种解决方案范式——端到端视觉-语言规划器和基于代理的层次化框架——它们相比现有模型表现出更优的人类意图对齐效果。\n\n- [FutureX: 通过潜在思维链世界模型增强端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.11226)\n  - 林洪斌、杨一鸣、张逸凡、郑超达、冯杰、王盛、王振楠、陈世嘉、王博阳、张宇、刘贤明、崔书广、李震\n  - 发表日期：2025年12月12日\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - FutureX是一个基于思维链的流程，通过未来场景的潜在推理和轨迹优化来增强端到端规划器。\n    - 引入了自动思考开关，用于在简单场景下选择即时模式，而在复杂推理时则切换到基于潜在世界模型的思考模式。\n    - 通过生成更合理的计划并减少碰撞，FutureX在NAVSIM上的TransFuser模型中实现了6.2 PDMS的显著提升。\n\n- [迈向高效且有效的多摄像头编码，用于端到端驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.10947)\n  - 杨嘉伟、陈子宇、游雨蓉、王燕、李一鸣、陈宇晓、李博义、伊万诺维奇、帕沃内、王悦\n  - 出版单位：斯坦福大学、NVIDIA\n  - 发表日期：2025年12月11日\n  - 项目页面：[迈向高效且有效的多摄像头编码，用于端到端驾驶](https:\u002F\u002Fjiawei-yang.github.io\u002FFlex\u002F)\n  - 任务：端到端\n  - 摘要：\n    - 提出了Flex，一种高效且有效的几何无关场景编码器，用于端到端自动驾驶中的多摄像头数据。它使用少量可学习的场景令牌，联合编码跨摄像头和时间步的信息。\n    - 在大规模专有数据集中，相比最先进的方法，Flex实现了2.2倍的推理吞吐量和更好的驾驶性能，从而挑战了显式3D归纳偏置（如BEV或占用）的必要性。\n    - 实验证明，这些紧凑的场景令牌无需显式监督即可自发地发展出场景分解能力，为未来的系统提供了一条更具扩展性和数据驱动性的路径。\n\n- [SpaceDrive：将空间感知注入基于VLM的自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.10719)\n  - 李培正、张正浩、大卫·霍尔茨、于航、杨宇彤、赖宇志、宋锐、安德烈亚斯·盖格、安德烈亚斯·泽尔\n  - 出版单位：蒂宾根大学\n  - 发表日期：2025年12月11日\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - SpaceDrive是一种具有空间感知能力的基于VLM的驾驶框架，它将空间信息作为显式的位置编码，而非文本数字标记，以实现语义与空间的联合推理。\n    - 该框架采用通用的位置编码器，用于处理来自多视角深度图、历史自车状态以及文本提示的3D坐标，从而增强视觉特征并直接进行轨迹坐标回归。\n    - 在nuScenes数据集上实现了开环性能的最先进水平，并在Bench2Drive闭环基准测试中，基于VLM的方法中取得了78.02分的驾驶得分。\n\n- [面向端到端驾驶的潜在思维链世界建模](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.10226)\n  - 谭书涵、卡夏普·奇塔、陈宇晓、田然、游雨蓉、王妍、罗文杰、曹宇龙、菲利普·克拉亨布尔、马可·帕沃内、鲍里斯·伊万诺维奇\n  - 发表日期：2025年12月11日\n  - 任务：端到端\n  - 摘要：\n    - Latent-CoT-Drive (LCDrive)：一种模型，它将思维链（CoT）推理以潜在语言的形式表达出来，捕捉驾驶行为可能产生的结果，在与动作对齐的潜在空间中统一推理与决策。\n    - 该模型通过交错使用动作提议标记和世界模型标记来进行推理，这些标记基于学习到的潜在世界模型来表达未来的结果。\n    - 该方法首先利用真实未来的模拟轨迹进行监督训练，随后通过闭环强化学习进行后训练，从而在大规模端到端驾驶基准测试中实现了更快的推理速度和更优的轨迹质量。\n\n- [UniUGP：统一理解、生成与规划的端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.09864)\n  - 卢浩、刘子洋、蒋广峰、罗元飞、陈胜、张阳刚、陈英聪\n  - 发表日期：2025年12月10日\n  - 项目页面：[UniUGP](https:\u002F\u002Fseed-uniugp.github.io\u002F)\n  - 任务：端到端\n  - 摘要：\n    - 提出UniUGP，一个统一的理解—生成—规划框架，通过混合专家架构协同场景推理、未来视频生成和轨迹规划。\n    - 引入跨多个自动驾驶数据集的四阶段训练策略，逐步构建能力，展现出最先进的性能及对长尾场景的优越泛化能力。\n\n- [COVLM-RL：基于VLM引导的强化学习的自动驾驶关键对象导向推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.09349)\n  - 李琳、蔡宇欣、方建武、薛建儒、吕晨\n  - 发表日期：2025年12月10日\n  - 任务：端到端\n  - 数据集：[CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - 摘要：\n    - COVLM-RL是一种新颖的端到端驾驶框架，将关键对象导向推理与VLM引导的强化学习相结合，以解决泛化、效率和可解释性问题。\n    - 引入了针对VLM的思维链提示策略，用于推理关键交通要素，生成语义决策先验，从而降低输入维度并将任务知识注入强化学习中。\n    - 提出了一个一致性损失函数，以使VLM的语义计划与强化学习智能体的连续控制输出保持一致，进而提高可解释性和训练稳定性。\n\n- [Astra：具有自回归去噪功能的通用交互式世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.08931)\n  - 朱一轩、冯佳琪、郑文钊、高源、陶鑫、万鹏飞、周杰、陆继文\n  - 发表日期：2025年12月9日\n  - 任务：生成\n  - 摘要：\n    - Astra是一个交互式的通用世界模型，能够为多种场景（如自动驾驶、机器人抓取等）生成精确的动作交互下的现实未来。\n    - 引入了自回归去噪架构，结合时间因果注意力机制和噪声增强的历史记忆，以在响应速度与时间连贯性之间取得平衡。\n    - 提出了动作感知适配器和动作专家混合模型，以实现精确控制，并支持探索、操作和相机控制等多种任务的多功能性。\n\n- [用于自动驾驶系统设计空间探索的多智能体LLM框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.08476)\n  - 施博安、王绍华、李永哲、涂嘉恒、张志翰\n  - 发表日期：2025年12月9日\n  - 任务：规划\n  - 摘要：\n    - 这是一个基于多智能体大型语言模型（LLM）的自动驾驶系统设计空间探索（DSE）框架，整合了多模态推理、3D仿真和性能分析工具。\n    - 自动化地解析执行结果，并利用专门的LLM智能体指导系统设计探索，包括用户输入、设计生成、执行编排和结果分析。\n    - 在一个无人出租车案例研究中评估后，该框架在相同的探索预算下，相比遗传算法基线，识别出了更多帕累托最优且成本效益更高的解决方案，同时减少了导航时间。\n\n- [基于空间检索增强的自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.06865)\n  - 贾晓松、张晨赫、江宇乐、王颂柏、张志远、陈晨、张绍峰、周玄鹤、杨雪、严俊驰、姜宇刚\n  - 发表日期：2025年12月7日\n  - 任务：端到端\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 提出了一种用于自动驾驶的空间检索范式，利用离线检索到的地理图像（例如来自谷歌地图）作为额外的输入，以在能见度较差的情况下增强感知能力。\n    - 扩展了nuScenes数据集，加入了对齐的地理图像，并在检测、地图构建、占用预测、规划和世界建模这五项核心任务上建立了基准。\n    - 展示了该方法即插即用的特性，表明在无需额外传感器的情况下，某些任务的性能得到了提升。\n\n- [WAM-Diff：一种结合MoE与在线强化学习的掩码扩散VLA框架，用于自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.11872)\n  - 作者：徐明旺、崔家豪、蔡飞鹏、商翰林、朱志浩、栾珊、许怡芳、张能、李耀义、蔡佳、朱思宇\n  - 出版单位：复旦大学\n  - 发表日期：2025年12月6日\n  - 项目页面：[WAM-Diff](https:\u002F\u002Fgithub.com\u002Ffudan-generative-vision\u002FWAM-Diff)\n  - 代码：[WAM-Diff](https:\u002F\u002Fgithub.com\u002Ffudan-generative-vision\u002FWAM-Diff)\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - WAM-Diff是一种视觉-语言-行动（VLA）框架，采用掩码扩散技术迭代优化离散序列，以生成未来的本车轨迹。\n    - 其核心创新包括：将掩码扩散应用于灵活解码、通过基于运动预测和VQA训练的稀疏专家混合模型（MoE）实现可扩展容量，以及利用群体序列策略优化（GSPO）进行在线强化学习。\n    - 在NAVSIM基准测试中取得最先进性能（v1版本PDMS为91.0，v2版本EPDMS为89.7），表明掩码扩散有望成为自回归及连续扩散策略的替代方案。\n\n- [AI生成的驾驶视频是否已准备好用于自动驾驶？一种诊断评估框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.06376)\n  - 作者：项新浩、阿比吉特·拉斯托吉、张嘉伟\n  - 发表日期：2025年12月6日\n  - 任务：生成\n  - 数据集：[ADGV-Bench](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.06376)\n  - 摘要：\n    - 提出了一种诊断框架，用于评估AI生成的驾驶视频（AIGVs）在自动驾驶模型训练与评估中的可靠性。\n    - 引入了ADGV-Bench基准测试，该测试包含人工标注和密集标签，适用于感知任务；同时提出了ADGVE——一种面向驾驶场景的视频质量评估器。\n    - 结果表明，使用所提出的评估器对原始AIGVs进行筛选后，视频质量指标及下游自动驾驶模型性能均有所提升，使AIGVs成为有益的数据补充。\n\n- [WAM-Flow：基于离散流匹配的并行粗细结合式运动规划，用于自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.06112)\n  - 作者：许怡芳、崔家豪、蔡飞鹏、朱志浩、商翰林、栾珊、徐明旺、张能、李耀义、蔡佳、朱思宇\n  - 出版单位：复旦大学、银网\n  - 发表日期：2025年12月5日\n  - 代码：[WAM-Flow](https:\u002F\u002Fgithub.com\u002Ffudan-generative-vision\u002FWAM-Flow)\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)、[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - 摘要：\n    - 本文首次提出用于自动驾驶的离散流匹配架构，实现了低延迟、并行且粗细结合式的运动规划。\n\n- [BeLLA：端到端鸟瞰视角大型语言助手，用于自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.06096)\n  - 作者：卡尔蒂克·莫汉、索南·辛格、阿米特·阿尔温德·卡莱\n  - 发表日期：2025年12月5日\n  - 任务：VQA\n  - 数据集：[NuScenes-QA](https:\u002F\u002Fwww.nuscenes.org\u002F)、[DriveLM](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM)\n  - 摘要：\n    - BeLLA是一种端到端架构，将统一的360°鸟瞰视图表示与大型语言模型连接起来，用于自动驾驶中的问答任务。\n    - 在需要空间推理的任务上表现优于现有方法，例如相对物体定位和行为理解，绝对提升可达9.3%。\n    - 经过NuScenes-QA和DriveLM基准测试，证明其在各类问题上均具有竞争力。\n\n- [从片段到场景：基于视觉-语言模型的自动驾驶时间理解](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.05277)\n  - 作者：凯文·坎农斯、赛义德·兰杰巴尔·阿尔瓦尔、穆罕默德·阿西富勒·侯赛因、艾哈迈德·雷扎伊、莫赫森·戈拉米、阿里雷扎·海达里哈泽伊、周卫民、张勇、穆罕默德·阿克巴里\n  - 发表日期：2025年12月4日\n  - 项目页面：[TAD](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fvbdai\u002FTAD)\n  - 代码：[TAD](https:\u002F\u002Fgithub.com\u002Fvbdi\u002Ftad_bench)\n  - 任务：推理\n  - 数据集：[TAD](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fvbdai\u002FTAD)\n  - 摘要：\n    - 引入了自动驾驶时间理解（TAD）基准测试，包含近6,000个跨7个任务的问答对，用于评估VLM在以本车为中心的驾驶视频上的表现。\n    - 对9种通用及专业模型进行了基准测试，发现由于对细粒度运动理解不足，准确率普遍偏低。\n    - 提出了两种无需训练的解决方案——Scene-CoT和TCogMap——与现有VLM结合使用时，可将TAD的平均准确率提高多达17.72%。\n\n- [E3AD：一种情感感知的视觉-语言-行动模型，用于以人为本的端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.04733)\n  - 作者：唐一鸿、廖海成、聂彤、何俊林、曲傲、陈科华、马伟、李振宁、孙利军、徐承忠\n  - 发表日期：2025年12月4日\n  - 任务：端到端\n  - 摘要：\n    - 提出E3AD，这是一种情感感知的视觉-语言-行动（VLA）框架，适用于开放域端到端（OD-E2E）自动驾驶。该框架能够理解自由形式的指令、推断乘客情绪并规划行驶轨迹。\n    - 引入了连续的效价-唤醒-支配（VAD）情感模型及双路径空间推理模块，以增强语义理解和类人空间认知能力。\n    - 采用一致性导向的训练方案，结合模态预训练与偏好对齐，从而在情感估计、视觉接地及规划方面达到最先进水平。\n\n- [dVLM-AD：通过可控推理提升用于驾驶的扩散型视觉-语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.04459)\n  - 作者：马英姿、曹宇龙、丁文浩、张水白、王燕、鲍里斯·伊万诺维奇、蒋明、马可·帕沃内、肖超伟\n  - 发表日期：2025年12月4日\n  - 任务：端到端\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[WOD-E2E](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - 摘要：\n    - dVLM-AD是一种基于扩散的视觉-语言模型，能够统一感知、结构化推理和底层规划，实现端到端驾驶。\n    - 提出了一个使用双向注意力机制的离散扩散框架，具备可控性和可靠性，有效解决了自回归VLM在一致性和可控性方面的问题。\n    - 在长尾场景下显著提升了行为与轨迹的一致性及规划性能，超越了基于自回归的基线模型。\n\n- [MindDrive：一个整合世界模型与视觉语言模型的一体化框架，用于端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.04441)\n  - 孙斌、曹耀光、王岩、王睿、商嘉辰、冯解杰、陆佳怡、施佳、杨世春、闫晓宇、宋子 Ying\n  - 发表日期：2025年12月4日\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - MindDrive 是一个协调一致的框架，将高质量轨迹生成与全面的决策推理相结合，用于端到端自动驾驶。\n    - 提出了一种“情境模拟—候选生成—多目标权衡”的结构化推理范式，包含未来感知轨迹生成器（FaTG）和面向 VLM 的评估器（VLoE）。\n    - 在 NAVSIM 基准测试中达到最先进水平，通过理性且符合人类价值观的决策，提升了安全性、合规性和泛化能力。\n\n- [先思后行：受世界模型启发的自动驾驶多模态定位框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.03454)\n  - 廖海成、沈焕明、王博楠、李永康、唐一鸿、王承悦、庄丁毅、陈克华、杨海、徐成忠、李振宁\n  - 发表日期：2025年12月3日\n  - 任务：感知\n  - 数据集：[Talk2Car](https:\u002F\u002Fgithub.com\u002Ftalk2car\u002FTalk2Car)\n  - 摘要：\n    - 提出 ThinkDeeper 框架，用于自动驾驶中的视觉定位。该框架利用空间感知世界模型（SA-WM）对未来空间状态进行推理，以消除上下文相关指令的歧义。\n    - 引入 DrivePilot 多源视觉定位数据集，专为自动驾驶设计，并通过 RAG 和思维链提示的 LLM 流程生成语义标注。\n    - 展示了最先进的性能，在 Talk2Car 排行榜上排名第一，并在 DrivePilot、MoCAD 和 RefCOCO 等基准测试中超越基线，在复杂场景中表现出强大的鲁棒性。\n\n- [突破特权限制：将密集奖励知识蒸馏到稀疏奖励策略中](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.04279)\n  - 费扎·汗·汗扎达、权载赫\n  - 出版单位：密歇根大学迪尔伯恩分校\n  - 发表日期：2025年12月3日\n  - 任务：规划\n  - 数据集：[CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - 摘要：\n    - 提出基于奖励特权的世界模型蒸馏方法，这是一种两阶段框架：教师智能体使用密集的特权奖励进行训练，其潜在动态被蒸馏到仅使用稀疏任务奖励训练的学生身上。\n    - 证明在 CARLA 的车道保持和超车基准测试中，稀疏奖励的学生表现优于使用密集奖励的教师以及从零开始训练的基线模型，从而提高了对未见路线的泛化能力。\n\n- [U4D：基于 LiDAR 序列的不确定性感知四维世界建模](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.02982)\n  - 徐翔、梁傲、刘友泉、李林峰、孔令东、刘子威、刘青山\n  - 出版单位：南洋理工大学 S-Lab\n  - 发表日期：2025年12月2日\n  - 任务：感知\n  - 数据集：[Waymo 开放数据集](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - 摘要：\n    - U4D 是一种不确定性感知的四维 LiDAR 世界建模框架，能够估计空间不确定性，从而精确定位语义上具有挑战性的区域。\n    - 采用“由难到易”的两阶段流程进行生成：不确定性区域建模和基于不确定性的补全。\n    - 集成了时空混合（MoST）模块以确保时间一致性，生成几何逼真且时间连贯的 LiDAR 序列。\n\n- [Lumos：让语言模型系统认证成为可能](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.02966)\n  - 伊莎·乔杜里、韦丹特·贾因、阿瓦尔乔特·辛格、卡维娅·萨奇德瓦、萨扬·拉努、加甘迪普·辛格\n  - 发表日期：2025年12月2日\n  - 任务：推理\n  - 摘要：\n    - 提出 Lumos，这是首个用于指定并正式认证语言模型系统（LMS）行为的原则性框架，采用基于图的命令式概率编程 DSL。\n    - 通过为自动驾驶中的视觉语言模型（VLM）指定首个安全规范来演示 Lumos，揭示了一款最先进 VLM 中的关键安全缺陷。\n    - 提供了一个模块化且可扩展的语言框架，能够编码复杂的关联性和时序规范，使 LMS 认证能够适应不断变化的安全威胁。\n\n- [VLM 作为战略家：通过引导扩散自适应生成安全关键测试场景](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.02844)\n  - 吴新正、陈俊毅、钟乃婷、沈勇\n  - 发表日期：2025年12月2日\n  - 任务：生成\n  - 摘要：\n    - 提出一种结合视觉语言模型（VLM）与自适应引导扩散模型的安全关键测试场景生成框架，用于自动驾驶系统。\n    - 构建了一个三层分层架构：VLM 战略层用于确定目标，战术层用于制定引导方案，操作层则负责执行引导扩散。\n    - 引入自适应引导扩散方法，可在闭环仿真中实时精确控制背景车辆，从而生成真实、多样且交互性强的安全关键场景。\n\n- [OpenREAD：基于 LLM 作为批评家的强化端到端自动驾驶开放式推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.01830)\n  - 张松燕、黄文辉、陈展、柯利斯特·朱亚浩、黄启航、吕晨\n  - 发表日期：2025年12月1日\n  - 任务：端到端\n  - 摘要：\n    - OpenREAD 是一个基于视觉语言模型（VLM）的开放式推理增强型自动驾驶（AD）框架，支持从高层次推理到低层次轨迹规划的端到端强化微调（RFT）。\n    - 提议在 RFT 中使用大型语言模型（LLM）作为批评者，以量化开放式问题的推理质量，从而解决场景理解的奖励建模难题。\n    - 构建了驾驶知识数据集上的大规模思维链（CoT）标注，并表明联合端到端 RFT 可显著提升上游和下游任务的表现，在推理和规划基准测试中达到最先进水平。\n\n- [RoboDriveVLM：面向自动驾驶的鲁棒视觉-语言模型的新基准与基线](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.01300)\n  - 廖大成、齐梦实、舒鹏、张志宁、林宇欣、刘亮、马华东\n  - 发表日期：2025年12月1日\n  - 任务：评估\n  - 数据集：[RoboDriveBench](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.01300)\n  - 摘要：\n    - 介绍RoboDriveBench，这是首个基于VLM的端到端自动驾驶鲁棒性基准，评估了包含传感器和提示扰动的11种仿真场景。\n    - 提出RoboDriveVLM，一种新颖的基于VLM的框架，通过将多模态数据（如激光雷达、雷达）映射到统一的潜在空间来提升鲁棒性。\n    - 引入基于跨模态知识蒸馏的测试时适应方法，以提高系统在实际部署中的鲁棒性。\n\n- [CoT4AD：具有显式思维链推理的视觉-语言-行动模型用于自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.22532)\n  - 王兆辉、于腾博、唐浩\n  - 发表日期：2025年11月27日\n  - 任务：推理\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - CoT4AD，一种新颖的视觉-语言-行动（VLA）框架，引入显式思维链（CoT）推理，以增强自动驾驶中的数值和因果推理能力。\n    - 在训练过程中整合感知-问题-预测-行动的CoT，以对齐推理与动作空间；在推理阶段执行隐式CoT推理，以实现稳健的决策。\n    - 在nuScenes和Bench2Drive等基准上的开环和闭环评估中，展现出最先进的性能。\n\n- [RoadSceneBench：面向中层道路场景理解的轻量级基准](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.22466)\n  - 刘希言、王涵、王宇虎、蔡俊杰、曹哲、杨建忠、陆震\n  - 发表日期：2025年11月27日\n  - 项目页面：[RoadSceneBench](https:\u002F\u002Fgithub.com\u002FXiyanLiu\u002FRoadSceneBench)\n  - 代码：[RoadSceneBench](https:\u002F\u002Fgithub.com\u002FXiyanLiu\u002FRoadSceneBench)\n  - 任务：评估\n  - 数据集：[RoadSceneBench](https:\u002F\u002Fgithub.com\u002FXiyanLiu\u002FRoadSceneBench)\n  - 摘要：\n    - RoadSceneBench，一个用于评估中层道路语义视觉推理的轻量级基准，专注于关系理解和结构一致性。\n    - 提出具有时间一致性的层次化关系奖励传播（HRRP-T），这是一种用于VLM的训练框架，旨在增强推理中的空间连贯性和语义对齐。\n\n- [OpenTwinMap：面向城市自动驾驶的开源数字孪生生成器](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.21925)\n  - 亚历克斯·理查德森、乔纳森·斯普林克尔\n  - 出版单位：亚利桑那大学\n  - 发表日期：2025年11月26日\n  - 任务：生成\n  - 数据集：[OpenStreetMap](https:\u002F\u002Fwww.openstreetmap.org)\n  - 摘要：\n    - OpenTwinMap，一个基于Python的开源框架，可从LiDAR扫描和OpenStreetMap数据生成高保真度的3D城市数字孪生。\n    - 该框架强调可扩展性和并行化，能够生成语义分割的静态环境资产，并导出至Unreal Engine用于自动驾驶车辆仿真。\n    - 其目标是为研究人员提供一条灵活且可扩展的流程，降低技术门槛，目前功能包括OSM\u002FLiDAR预处理、道路网格\u002F地形生成以及初步的CARLA集成。\n\n- [LaGen：迈向自回归式的LiDAR场景生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.21256)\n  - 周思卓、贾晓松、张帆睿、李俊杰、张居勇、冯宇康、孙建文、王宋布尔、尤俊奇、严俊驰\n  - 发表日期：2025年11月26日\n  - 任务：生成\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - LaGen，首个从单帧LiDAR输入开始逐帧自回归生成长时程LiDAR场景的框架。\n    - 引入场景解耦估计模块，以增强交互式的对象级生成；同时引入噪声调制模块，以缓解长时程误差累积。\n    - 构建了针对nuScenes的评估协议，证明其性能优于当前最先进的LiDAR生成和预测模型，尤其是在后续帧上表现更佳。\n\n- [4DWorldBench：面向3D\u002F4D世界生成模型的综合评估框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.19836)\n  - 陆一婷、罗伟、涂佩妍、李浩然、朱瀚鑫、俞子豪、王星睿、陈心怡、彭新格、李欣、陈志博\n  - 出版单位：中国科学技术大学\n  - 发表日期：2025年11月25日\n  - 项目页面：[4DWorldBench](https:\u002F\u002Fyeppp27.github.io\u002F4DWorldBench.github.io\u002F)\n  - 任务：推理\n  - 数据集：[4DWorldBench](https:\u002F\u002Fyeppp27.github.io\u002F4DWorldBench.github.io\u002F)\n  - 摘要：\n    - 介绍4DWorldBench，这是一个统一的基准，用于从四个关键维度评估3D\u002F4D世界生成模型：感知质量、条件-4D对齐、物理真实性和4D一致性。\n    - 提出一种自适应评估框架，将多模态条件映射到统一的文本空间，并结合LLM作为评判者、MLLM作为评判者以及基于网络的方法进行综合评估。\n\n- [AD-R1：采用公正世界模型的端到端自动驾驶闭环强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.20325)\n  - 闫天义、唐涛、桂兴泰、李永康、郑嘉森、黄伟尧、孔令东、韩文成、周霞、张雪阳、詹一飞、詹坤、徐成中、沈建兵\n  - 发表日期：2025年11月25日\n  - 任务：端到端\n  - 摘要：\n    - 介绍一种利用公正世界模型对自动驾驶策略进行后训练优化的框架，旨在克服RL中世界模型的乐观偏差。\n    - 提出一种新颖的反事实合成数据管道，用于生成一系列可能的碰撞和离路事件，从而教会模型诚实地预测危险。\n    - 实验证明，该框架以公正世界模型作为内部批评家，在仿真环境中显著减少了安全违规行为，并在新的风险预见基准测试中优于基线。\n\n- [从风险中学习：结合先验知识的LLM引导安全关键场景生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.20726)\n  - 王宇航、黄赫业、徐振华、孙凯来、郭宝申、赵金花\n  - 发表日期：2025年11月25日\n  - 任务：生成\n  - 数据集：[CARLA](https:\u002F\u002Fcarla.org\u002F)、[SMARTS](https:\u002F\u002Fgithub.com\u002Fhuawei-noah\u002FSMARTS)\n  - 摘要：\n    - 一种高保真场景生成框架，将条件变分自编码器（CVAE）与大型语言模型（LLM）相结合，用于自动驾驶安全性验证。\n    - CVAE学习潜在的交通结构以生成物理上一致的基础场景，而LLM则作为对抗性推理引擎，在不同风险水平下指导场景生成。\n    - 该框架增加了对高风险和长尾事件的覆盖范围，提高了与真实世界交通的一致性，并使自动驾驶系统面临比基于规则或数据驱动方法更具挑战性的交互。\n\n- [DeeAD：视觉-语言-行动的动态提前退出机制，用于高效自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.20720)\n  - 胡海波、黄连明、关楠、薛春Jason\n  - 发表日期：2025年11月25日\n  - 任务：规划\n  - 数据集：[Bench2Drive](https:\u002F\u002Fthinklab-sjtu.github.io\u002FBench2Drive\u002F)\n  - 摘要：\n    - DeeAD是一种无需训练、由动作引导的提前退出框架，通过将中间轨迹的物理可行性与轻量级规划先验进行评估，从而加速VLA规划。\n    - 引入多跳控制器，根据得分变化率自适应跳过冗余的Transformer层，实现高达28%的层稀疏化和29%的延迟降低。\n    - 可无缝集成到现有VLA模型中（如ORION），无需重新训练，在Bench2Drive基准测试上保持规划质量和安全性。\n\n- [CoC-VLA：基于因果链的视觉-语言-行动模型，深入探索可解释自动驾驶中的对抗域迁移](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.19914)\n  - 张大鹏、沈飞、赵睿、陈寅达、智鹏、李晨阳、周锐、周庆国\n  - 发表日期：2025年11月25日\n  - 任务：端到端\n  - 摘要：\n    - 提出CoC-VLA，一种由VLM引导的端到端对抗域迁移框架，用于将长尾场景处理能力从仿真环境迁移到实际部署中。\n    - 该框架由教师VLM、学生VLM和判别器组成，利用共享的因果链视觉-语言模型（CoC VLM）基础架构进行思维链式推理。\n    - 引入了一种新颖的对抗训练策略，通过判别器促进能力从仿真环境向真实世界的迁移。\n\n- [Reasoning-VLA：一种快速且通用的视觉-语言-行动推理模型，用于自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.19912)\n  - 张大鹏、袁振龙、陈章权、廖志廷、陈寅达、沈飞、周庆国、蔡增森\n  - 发表日期：2025年11月25日\n  - 任务：端到端\n  - 数据集：[Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)、[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[KITTI](http:\u002F\u002Fwww.cvlibs.net\u002Fdatasets\u002Fkitti\u002F)、[Argoverse](https:\u002F\u002Fwww.argoverse.org\u002F)、[BDD100K](https:\u002F\u002Fopendatalab.org.cn\u002FOpenDataLab\u002FBDD100K)、[CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - 摘要：\n    - 一种通用且快速的视觉-语言-行动（VLA）框架，用于自动驾驶。该框架使用可学习的动作查询与增强推理能力的特征交互，以并行方式生成连续轨迹。\n    - 将八个公开的自动驾驶数据集整合为标准化的、基于思维链推理的格式进行训练。\n    - 通过监督学习和强化学习微调，实现了最先进的性能、卓越的泛化能力和出色的推理速度。\n\n- [Map-World：面向自动驾驶的掩码动作规划与路径积分世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.20156)\n  - 胡斌、陆子健、廖海成、袁承然、饶斌、李永康、李国发、崔志勇、许成忠、李振宁\n  - 出版单位：华中科技大学、小米电动汽车\n  - 发表日期：2025年11月25日\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - MAP-World是一种无先验的多模态规划框架，将掩码动作规划与基于路径权重的世界模型相结合，用于自动驾驶。\n    - 掩码动作规划（MAP）模块将未来的本车运动视为掩码序列补全问题，无需锚点库或教师策略即可生成多样且时间一致的轨迹。\n    - 轻量级世界模型会为每个候选轨迹滚动预测未来的BEV语义，训练时以轨迹概率作为路径权重，从所有可能的未来分布中学习，从而在NAVSIM上达到最先进水平。\n\n- [Percept-WAM：感知增强型世界感知-行动模型，用于鲁棒的端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.19221)\n  - 韩建华、田萌、朱江通、何凡、张慧欣、郭思彤、朱德昌、唐浩、徐培、郭宇泽、牛民哲、朱浩杰、董启超、闫学超、董思源、侯璐、黄清秋、贾晓松、徐航\n  - 发表日期：2025年11月24日\n  - 任务：端到端\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - Percept-WAM是一种感知增强型世界感知-行动模型，将2D\u002F3D场景理解隐式地整合到单一的视觉-语言模型（VLM）中，以实现鲁棒的端到端自动驾驶。\n    - 引入World-PV和World-BEV标记，统一2D\u002F3D感知任务；并通过IoU感知评分和并行自回归解码机制，提升在长尾及复杂场景中的稳定性。\n    - 在感知基准测试中表现出色（如COCO的51.7\u002F58.9 mAP，nuScenes BEV检测），并在nuScenes和NAVSIM上的规划性能优于DiffusionDrive等先前方法。\n\n- [前瞻思考：MLLM与世界模型中的预见智能](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.18735)\n  - 宫展涛、范辽远、郭青、徐勋、杨旭磊、李世杰\n  - 发表日期：2025年11月24日\n  - 任务：VQA\n  - 数据集：[FSU-QA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.18735)\n  - 摘要：\n    - 提出“预见智能”这一概念，即预测未来事件的能力，并介绍了专为评估这一能力而设计的新视觉问答数据集FSU-QA。\n    - 进行了全面研究，表明当前的视觉-语言模型在预见推理方面存在困难，并证明通过微调，FSU-QA可以有效提升这一能力。\n\n- [GuideFlow：约束引导的流匹配，用于端到端自动驾驶中的规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.18729)\n  - 刘林、贾彩燕、于冠义、宋子 Ying、李俊乔、贾飞阳、吴培亮、郝晓帅、罗艳丹\n  - 发表日期：2025年11月24日\n  - 代码：[GuideFlow](https:\u002F\u002Fgithub.com\u002Fliulin815\u002FGuideFlow)\n  - 任务：规划\n  - 数据集：[Bench2Drive](https:\u002F\u002Fthinklab-sjtu.github.io\u002FBench2Drive\u002F)、[NuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)、[ADV-NuScenes]()\n  - 摘要：\n    - GuideFlow 是一种面向端到端自动驾驶的新型规划框架，它利用约束流匹配显式建模流匹配过程，从而缓解多模态轨迹模式坍缩问题。\n    - 该框架在流匹配生成过程中直接施加显式约束，并通过基于能量的模型（EBM）统一训练，以稳健地满足物理约束。\n    - GuideFlow 在生成过程中将驾驶激进程度参数化为控制信号，从而实现对轨迹风格的精确调控，并在 NavSim 等基准测试中取得了最先进的结果。\n\n- [QuickLAP：面向自动驾驶代理的快速语言-行为偏好学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.17855)\n  - 乔丹·阿比·纳德尔、大卫·李、纳撒尼尔·登勒、安德烈娅·博布\n  - 出版方：MIT-CLEAR实验室\n  - 发表日期：2025年11月22日\n  - 代码：[QuickLAP](https:\u002F\u002Fgithub.com\u002FMIT-CLEAR-Lab\u002FQuickLAP)\n  - 任务：规划\n  - 数据集：[CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - 摘要：\n    - QuickLAP 是一种贝叶斯框架，它融合物理反馈和语言反馈，实时推断自动驾驶代理的奖励函数。\n    - 该方法利用大型语言模型从语言中提取奖励特征注意力掩码，并通过闭式更新规则将其与物理修正相结合，实现快速且鲁棒的奖励学习。\n    - 在半自动驾驶模拟器中，与基线相比，QuickLAP 将奖励学习误差降低了 70% 以上，并因其可理解性和协作性而受到用户的青睐。\n\n- [LiSTAR：面向自动驾驶中 4D LiDAR 序列的射线中心世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.16049)\n  - 刘沛、王松涛、张朗、彭兴悦、吕元东、邓佳鑫、陆颂欣、马伟良、张雪洋、詹一飞、郎先鹏、马军\n  - 发表日期：2025年11月20日\n  - 项目页面：[LiSTAR](https:\u002F\u002Focean-luna.github.io\u002FLiSTAR.gitub.io)\n  - 任务：预测\n  - 数据集：[Waymo](https:\u002F\u002Fwaymo.com\u002Fopen)\n  - 摘要：\n    - LiSTAR 是一种新颖的生成式世界模型，可以直接在传感器的原生几何上操作，用于合成高保真且可控的 4D LiDAR 数据。\n    - 它引入了混合圆柱-球面（HCS）表示法以保持数据保真度，并采用以射线为中心的时空注意力 Transformer（START）来确保强大的时间一致性。\n    - 提出了与 4D 点云对齐的体素布局，以及离散的 Masked Generative START（MaskSTART）框架，用于高效、高分辨率且受布局指导的组合式生成。\n\n- [您的视觉-语言模型是否已准备好用于自动驾驶安全？评估外部与车内风险的综合基准](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.14592)\n  - 孟宪辉、张宇辰、黄志坚、陆正、季子凌、尹瑶瑶、张宏远、蒋广峰、林艳丹、陈龙、叶航军、张力、刘军、郝晓帅\n  - 发表日期：2025年11月18日\n  - 任务：VQA\n  - 数据集：[DSBench](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - 介绍了 DSBench，这是首个综合性驾驶安全基准，旨在以统一的方式评估视觉-语言模型（VLM）对外部环境风险及车内驾驶行为安全的认知能力。\n    - 该基准涵盖 10 个关键类别和 28 个子类别，揭示了 VLM 在复杂的安全关键情境中性能显著下降。\n    - 构建了一个包含 9.8 万个安全相关实例的大规模数据集，表明在此数据上进行微调可显著提升 VLM 在自动驾驶中的安全性能。\n\n- [利用来自 VLM 的风险语义蒸馏增强端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.14499)\n  - 秦杰克、王志涛、郑怡楠、陈可宇、周阳、钟源欣、程思远\n  - 发表日期：2025年11月18日\n  - 任务：端到端\n  - 数据集：[Bench2Drive](https:\u002F\u002Fthinklab-sjtu.github.io\u002FBench2Drive\u002F)\n  - 摘要：\n    - 介绍了风险语义蒸馏（RSD），这是一种框架，利用视觉-语言模型（VLM）为端到端自动驾驶骨干网络提供关键物体的风险注意力，从而提高泛化能力。\n    - 提出 RiskHead 插件模块，将 VLM 推导出的因果风险估计蒸馏到鸟瞰图（BEV）特征中，生成可解释的风险注意力图，从而实现更丰富的空间和风险表征。\n    - 在 Bench2Drive 基准测试中，通过使 BEV 特征与人类般的风险感知驾驶行为相一致，显著提升了复杂动态环境下的感知和规划能力。\n\n- [利用模块化交通信号灯与标志识别增强基于 LLM 的自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.14391)\n  - 法比安·施密特、努希克·穆罕默德·卡伊兰·阿卜杜勒·纳扎尔、马库斯·恩茨韦勒、阿比纳夫·瓦拉达\n  - 出版方：斯图加特大学、埃斯林根应用科学大学\n  - 发表日期：2025年11月18日\n  - 代码：[TLS-Assist](https:\u002F\u002Fgithub.com\u002Fiis-esslingen\u002FTLS-Assist)\n  - 任务：规划\n  - 数据集：[LangAuto](https:\u002F\u002Fgithub.com\u002Fwayveai\u002FLangAuto)\n  - 摘要：\n    - 介绍了 TLS-Assist，这是一个模块化冗余层，通过明确的交通信号灯和标志识别来增强基于 LLM 的自动驾驶代理，以强制执行交通规则。\n    - 该即插即用框架将检测结果转换为结构化的自然语言消息，并注入到 LLM 的输入中，支持单目和多目相机配置。\n    - 在 CARLA 中的 LangAuto 基准测试上，TLS-Assist 相较于 LMDrive 和 BEVDriver 分别实现了最高 14% 和 7% 的相对驾驶性能提升，同时减少了交通违规行为。\n\n- [CorrectAD：一种自纠正智能体系统，用于改进自动驾驶端到端规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.13297)\n  - 马恩辉、周利军、唐涛、张嘉欢、蒋俊鹏、张展、韩东、詹坤、张雪阳、郎先鹏、孙海洋、周霞、林迪、于凯成\n  - 发表日期：2025年11月17日\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 提出CorrectAD，一种自纠正智能体系统，通过解决罕见但安全关键的长尾故障问题来改进端到端规划。\n    - 引入PM-Agent以制定数据需求，并开发DriveSora生成模型，用于创建与3D布局一致的时空连贯视频，以进行数据仿真。\n    - 该模型无关的流程可纠正大量失败案例，在nuScenes数据集上将碰撞率降低39%，在内部数据集中降低27%。\n\n- [VLMs引导的自动驾驶可解释决策](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.13881)\n  - 胡鑫、景涛涛、田仁然、丁正明\n  - 发表日期：2025年11月17日\n  - 任务：规划\n  - 摘要：\n    - 提出一种新方法，将视觉-语言模型（VLMs）的角色从直接决策生成器转变为语义增强器，以实现更可靠和可解释的自动驾驶。\n    - 引入多模态交互架构，融合视觉和语言特征以实现准确的决策及文本解释，并采用后处理精炼模块利用VLMs提升预测可靠性。\n    - 在两个自动驾驶基准测试中表现出最先进的性能，为将VLMs集成到可靠的自动驾驶系统中提供了有前景的方向。\n\n- [基于提示的领域适应：通过上下文强化学习实现端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.12755)\n  - 阿莉莎·库拉姆、阿米尔·莫埃尼、张尚通、罗汉·钱德拉\n  - 发表日期：2025年11月16日\n  - 任务：端到端\n  - 数据集：[CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - 摘要：\n    - 提出一种少样本提示驱动的领域适应方法，用于闭环自动驾驶，借助上下文强化学习（ICRL），无需在目标域更新模型或收集额外数据。\n    - 将提示驱动的领域适应扩展至闭环驾驶，通过推理过程中观察到的一般轨迹实现，从而超越了以往仅限于感知任务的方法。\n    - 在CARLA中的实验表明，与现有的提示驱动领域适应基线相比，ICRL能够产生更安全、更高效且更舒适的驾驶策略，尤其是在恶劣天气条件下。\n\n- [LLMs是未来之路吗？LLMs引导的强化学习在去中心化自动驾驶中的案例研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.12751)\n  - 提穆尔·安瓦尔、杰弗里·陈、王宇燕、罗汉·钱德拉\n  - 发表日期：2025年11月16日\n  - 任务：规划\n  - 摘要：\n    - 研究小型本地部署的LLMs（参数量小于14B）是否可以通过为强化学习提供奖励塑造来支持高速公路自动驾驶，而非直接控制。\n    - 呈现一个案例研究，比较了纯强化学习、纯LLMs以及混合LLMs-强化学习方法在复杂场景下的去中心化自动驾驶表现，如密集的高速公路和汇流处。\n    - 结果显示，混合方法介于纯强化学习（中等成功）和纯LLMs（高成功率但效率较低）之间，其中由LLMs影响的方法表现出系统的保守倾向和依赖模型的差异性。\n\n- [VLA-R：面向开放世界端到端自动驾驶的视觉-语言动作检索](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.12405)\n  - 成贤基、文成佑、安浩镇、姜哲勋、沈贤哲\n  - 发表日期：2025年11月16日\n  - 任务：端到端\n  - 摘要：\n    - VLA-R是一种开放世界端到端自动驾驶框架，将开放世界的感知与新颖的视觉-动作检索范式相结合。\n    - 利用冻结的视觉-语言模型进行开放世界检测和分割，并通过Q-Former瓶颈层连接感知与行动领域。\n    - 引入视觉-动作对比学习机制，对视觉-语言和动作嵌入进行对齐，以实现有效的开放世界推理和动作检索。\n\n- [FLAD：基于LLMs的车辆-边缘-云网络联邦学习自动驾驶框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.09025)\n  - 向天傲、支明健、毕元国、蔡林、陈宇豪\n  - 出版单位：维多利亚大学、多伦多大学\n  - 发表日期：2025年11月12日\n  - 任务：端到端\n  - 摘要：\n    - FLAD是一个基于LLMs的联邦学习自动驾驶框架，旨在解决协作式模型训练中计算\u002F传输成本高昂及数据隐私方面的挑战。\n    - 提出一种云-边缘-车辆协同架构，结合智能并行训练与通信调度，以及知识蒸馏方法，以针对异构边缘数据个性化定制LLMs。\n    - 在配备NVIDIA Jetson的测试平台上进行了原型验证，展示了分布式车载资源的有效利用及优越的端到端自动驾驶性能。\n\n- [一种用于自动驾驶中视觉语言模型幻觉缓解的低秩方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.06496)\n  - 龙克克、郭家诚、张天云、于洪凯、李晓鹏\n  - 发表日期：2025年11月9日\n  - 任务：感知\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 提出一种自包含的低秩方法，仅使用VLM生成的多个字幕，无需外部参考或访问模型即可自动按幻觉程度排序。\n    - 构建句子嵌入矩阵，将其分解为低秩共识部分和稀疏残差部分，并根据残差大小选择最无幻觉的字幕。\n    - 在nuScenes数据集上达到87%的选择准确率，优于基线方法，与人类判断高度相关，并将实时应用的推理时间缩短51%-67%。\n\n- [VLDrive：视觉增强型轻量级MLLMs用于高效的语言接地自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.06256)\n  - 张瑞飞、张伟、谭晓、杨思贝、万翔、罗晓楠、李冠斌\n  - 发表日期：2025年11月9日\n  - 代码：[VLDrive](https:\u002F\u002Fgithub.com\u002FReaFly\u002FVLDrive)\n  - 任务：端到端\n  - 数据集：[CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - 摘要：\n    - VLDrive是一种新型轻量级MLLM架构，专为语言接地的自动驾驶设计，其增强的视觉组件解决了视觉表征方面的局限性。\n    - 引入循环一致的动态视觉修剪和记忆增强的特征聚合技术，以生成紧凑的视觉令牌；同时采用距离解耦的指令注意力机制，以改善视觉-语言学习效果。\n    - 在CARLA模拟器中实现了最先进的驾驶性能，同时将参数量减少了81%（从70亿降至13亿），并在不同距离的驾驶评分上均有显著提升。\n\n- [AdaDrive：面向语言驱动自动驾驶的自适应慢速-快速系统](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.06253)\n  - 张瑞飞、谢俊林、张伟、陈维凯、谭晓、万翔、李冠斌\n  - 发表日期：2025年11月9日\n  - 代码：[AdaDrive](https:\u002F\u002Fgithub.com\u002FReaFly\u002FAdaDrive)\n  - 任务：规划\n  - 摘要：\n    - AdaDrive 是一种自适应协作的慢速-快速框架，能够最优地决定何时以及如何让大型语言模型参与语言驱动自动驾驶中的决策过程。\n    - 提出了一种自适应激活损失函数，仅在复杂场景下动态调用大型语言模型；同时引入自适应融合策略，根据场景复杂度连续、按比例地调节大型语言模型的影响。\n\n- [SAFe-Copilot：统一的共享自治框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.04664)\n  - Phat Nguyen、Erfan Aasi、Shiva Sreeram、Guy Rosman、Andrew Silva、Sertac Karaman、Daniela Rus\n  - 出版单位：麻省理工学院\n  - 发表日期：2025年11月6日\n  - 任务：规划\n  - 数据集：[Bench2Drive](https:\u002F\u002Fthinklab-sjtu.github.io\u002FBench2Drive\u002F)\n  - 摘要：\n    - 该框架是一种统一的共享自治系统，利用视觉语言模型（VLM）在高层次抽象上整合人类输入与自主规划器，以推断驾驶员意图。\n    - 框架能够合成连贯的策略来协调人类与自主控制，在人体实验调查中表现出高度一致性，并在 Bench2Drive 基准测试中显著提升性能。\n\n- [基于成对排序与元特征的轨迹预测动态模型选择](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.00126)\n  - 卢博文\n  - 发表日期：2025年10月31日\n  - 任务：预测\n  - 数据集：[nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)\n  - 摘要：\n    - 提出一种动态多专家门控框架，能够在样本级别自适应地从基于物理信息的 LSTM、Transformer 和微调后的 GameFormer 中选择最可靠的轨迹预测器。\n    - 将轨迹专家选择问题建模为基于内部模型信号（元特征）的成对排序问题，无需事后校准即可优化决策质量。\n    - 在 nuPlan-mini 数据集上评估，LLM 增强的三专家门控方案相比 GameFormer 将最终位移误差降低了 9.5%，并在开环仿真中持续表现出改进效果。\n\n- [Token Is All You Need：通过信念-意图协同演化实现认知规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.05540)\n  - 桑世尧\n  - 发表日期：2025年10月30日\n  - 任务：规划\n  - 数据集：[nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)\n  - 摘要：\n    - 提出有效的规划源于一组语义丰富的最小化标记中信念与意图的协同演化，从而挑战了对详尽场景建模的需求。\n    - 实验证明，稀疏的意图标记即可取得优异性能，而基于预测的未来标记进行轨迹解码则可将 ADE 提升 21.6%，表明性能源自认知规划。\n    - 观察到训练过程中认知一致性和时间模糊性的涌现，确立了一种新的范式：智能在于信念与意图的标记化二元性。\n\n- [Alpamayo-R1：弥合推理与动作预测，实现长尾场景下的通用自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.00088)\n  - 王燕、罗文杰、白俊杰、曹宇龙、车彤、陈科、陈宇潇、戴蒙德·珍娜、丁一凡、丁文浩、冯亮、格雷格·海因里希、黄杰克、卡库斯·彼得、李博毅、李品毅、林宗义、刘东然、刘明宇、刘浪川、刘志坚、陆杰森、毛云翔、莫尔恰诺夫·帕夫洛、帕瓦奥·林赛、彭正浩、兰辛格·迈克、施默灵·埃德、沈诗达、石云飞、塔里克·萨拉、田冉、韦克尔·蒂尔曼、翁新硕、肖天军、杨埃里克、杨晓东、尤荣、曾晓辉、张文远、伊万诺维奇·鲍里斯、帕沃内·马可\n  - 出版单位：NVIDIA\n  - 发表日期：2025年10月30日\n  - 代码：[Alpamayo-R1](https:\u002F\u002Fgithub.com\u002FNVlabs\u002Falpamayo)\n  - 任务：规划\n  - 摘要：\n    - Alpamayo-R1 (AR1) 是一种视觉-语言-行动模型（VLA），将因果链推理与轨迹规划相结合，用于处理复杂的长尾驾驶场景。\n    - 引入了通过混合自动标注与人机协作流程构建的因果链（CoC）数据集，以及由推理型 VLM 和基于扩散的轨迹解码器组成的模块化架构。\n    - 该模型采用多阶段训练策略，结合监督微调与强化学习，在仿真和实时车载部署中均实现了更高的规划精度与安全性。\n\n- [超越模仿：基于流匹配的约束感知端到端自动驾驶轨迹生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.26292)\n  - 刘琳、于冠义、宋子 Ying、李俊乔、贾彩艳、贾飞阳、吴培良、罗燕丹\n  - 发表日期：2025年10月30日\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - 提出 CATG，一种新颖的规划框架，利用约束流匹配解决模仿学习中的模式坍塌问题，并将安全与运动学约束直接融入生成过程。\n    - 在流匹配过程中显式施加约束，并将驾驶激进程度参数化为控制信号，以调节轨迹风格。\n    - 在 NavSim v2 挑战赛中以 EPDMS 评分 51.31 获得第二名，并荣获创新奖。\n\n- [通过任务特定提示与空间推理提升自动驾驶视觉语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.24152)\n  - 吴傲迪、罗旭波\n  - 出版单位：中国科学院大学、中南大学\n  - 发表日期：2025年10月28日\n  - 代码：[UCAS-CSU-phase2](https:\u002F\u002Fgithub.com\u002Fwuaodi\u002FUCAS-CSU-phase2)\n  - 任务：VQA\n  - 数据集：RoboSense 挑战赛\n  - 摘要：\n    - 构建了一个系统的自动驾驶场景理解框架，基于提示混合路由器、带有空间推理的任务特定提示、视觉组装模块以及优化的推理参数。\n    - 在 Qwen2.5-VL-72B 上实现，在 IROS 2025 的 RoboSense 挑战赛中，清洁数据上的准确率达到 70.87%，而损坏数据上则达到 72.85%。\n\n- [结合人类先验知识的物理感知空间智能：自动驾驶试点研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.21160)\n  - 吴冠林、苏博彦、赵阳、王普、林一辰、杨浩弗兰克\n  - 发表日期：2025年10月24日\n  - 任务：感知\n  - 数据集：[SIGBench](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.21160)\n  - 摘要：\n    - 提出空间智能网格（SIG），这是一种基于网格的结构化方案，用于显式编码物体布局、关系以及物理约束先验，以支持自动驾驶基础模型的推理。\n    - 推导出基于SIG的评估指标，用以量化模型内在的视觉-空间智能（VSI），从而将空间能力与语言先验区分开来。\n    - 发布SIGBench基准数据集，包含1400帧驾驶场景，并附有SIG真值标签和人类视线轨迹标注，以支持VSI相关任务。\n\n- [解决自动驾驶中的Corner Case问题：基于世界模型的混合专家与大语言模型方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.21867)\n  - 廖海成、王博楠、杨俊贤、王承悦、何正斌、张国辉、徐成中、李振宁\n  - 发表日期：2025年10月23日\n  - 任务：预测\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[NGSIM](https:\u002F\u002Fops.fhwa.dot.gov\u002Ftrafficanalysistools\u002Fngsim.htm)、[HighD](https:\u002F\u002Fwww.highd-dataset.com\u002F)、[MoCAD](https:\u002F\u002Fmocad-dataset.github.io\u002F)\n  - 摘要：\n    - WM-MoE是一种基于世界模型的运动预测框架，整合了感知、记忆和决策功能，旨在应对高风险的Corner Case场景。\n    - 利用轻量级时间分词器的大语言模型进行长时程推理，并引入混合专家（MoE）机制来分解复杂的Corner Case问题。\n    - 提出了nuScenes-corner基准，并在多种数据集上展示了在Corner Case及数据缺失条件下的最先进性能。\n\n- [重新思考驾驶世界模型作为感知任务的合成数据生成器](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.19195)\n  - 曾凯、吴展谦、熊凯欣、魏晓宝、郭向宇、朱振鑫、何嘉乐、周立军、曾博涵、陆明、孙海洋、王兵、陈光、叶航军、张文涛\n  - 出版单位：华中科技大学、小米电动汽车\n  - 发表日期：2025年10月22日\n  - 项目页面：[Dream4Drive](https:\u002F\u002Fwm-research.github.io\u002FDream4Drive\u002F)\n  - 代码：[Dream4Drive](https:\u002F\u002Fgithub.com\u002Fwm-research\u002FDream4Drive)\n  - 任务：感知\n  - 数据集：[DriveObj3D](https:\u002F\u002Fgithub.com\u002Fwm-research\u002FDream4Drive)\n  - 摘要：\n    - 提出Dream4Drive框架，该框架可将视频分解为具备3D感知的引导地图，并渲染3D资产，从而生成经过编辑的多视角逼真视频，用于训练感知模型。\n    - 支持大规模生成多视角Corner Case场景，显著提升自动驾驶中Corner Case场景的感知能力。\n    - 贡献了一个大规模3D资产数据集DriveObj3D，涵盖典型驾驶场景类别，以实现多样化的3D感知视频编辑。\n\n- [OmniNWM：全知全能的驾驶导航世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.18313)\n  - 李博文、马壮、杜达龙、彭宝瑞、梁竹瑾、刘振强、马超、金月明、赵浩、曾文俊、金鑫\n  - 出版单位：清华大学、上海人工智能实验室、中国科学院大学、上海交通大学\n  - 发表日期：2025年10月21日\n  - 项目页面：[OmniNWM](https:\u002F\u002Farlo0o.github.io\u002FOmniNWM\u002F)\n  - 任务：导航\n  - 摘要：\n    - OmniNWM是一种全景式全知全能的导航世界模型，在统一框架内同时处理状态、动作和奖励维度，适用于自动驾驶场景。\n    - 它能够联合生成RGB、语义、度量深度和3D占用率的全景视频，并采用灵活的强制策略进行长时程自回归生成。\n    - 引入归一化的全景Plucker射线图以实现精确的轨迹控制，并利用生成的3D占用率定义基于规则的密集奖励，以确保驾驶合规性和安全性。\n\n- [通过元数据驱动的上下文和任务特定提示实现稳健的驾驶问答系统](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.19001)\n  - 尹承俊、朴俊成、林永善、沈贤静\n  - 发表日期：2025年10月21日\n  - 任务：视觉问答\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 这是一个用于自动驾驶的两阶段视觉-语言问答系统，能够利用大型多模态大语言模型回答高层次的感知、预测和规划问题。\n    - 系统基于多摄像头输入、时间序列历史和思维链提示对模型进行条件设置，并通过自一致性集成提高可靠性。\n    - 第二阶段通过场景元数据和任务特定指令增强提示，显著提升了准确率，并在严重视觉退化条件下表现出强大的鲁棒性。\n\n- [SAVANT：结合视觉增强异常检测的语义分析](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.18034)\n  - 罗伯托·布鲁斯尼基、大卫·波普、高远、皮奇尼尼·马蒂亚、约翰内斯·贝茨\n  - 发表日期：2025年10月20日\n  - 任务：感知\n  - 摘要：\n    - 提出SAVANT框架，通过分层场景分析和结构化场景描述提取与多模态评估的双阶段流程，检测异常驾驶场景。\n    - 在真实驾驶场景中实现了高准确率和召回率，使一个经过微调的70亿参数开源模型超越了专有模型，同时支持本地低成本部署。\n    - 针对数据稀缺问题，自动为超过9640张真实图像打上了高精度标签，为自动驾驶系统中的可靠语义监控提供了切实可行的路径。\n\n- [SimpleVSF：用于端到端自动驾驶轨迹预测的VLM评分融合方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.17191)\n  - 郑培儒、赵云、龚展、朱宏、吴绍华\n  - 发表日期：2025年10月20日\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - SimpleVSF是一种新颖的框架，通过利用视觉-语言模型（VLM）的认知能力和先进的轨迹融合技术，提升端到端规划性能。\n    - 它结合了传统评分器和新型VLM增强评分器，使用稳健的权重融合器进行定量聚合，并借助VLM融合器进行定性、情境感知的决策。\n    - 该方法在ICCV 2025 NAVSIM v2端到端驾驶挑战赛中名列前茅，展现了在安全性、舒适性和效率方面的最先进水平。\n\n- [DiffVLA++：通过度量引导对齐桥接认知推理与端到端驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.17148)\n  - 高宇、蒋安庆、王一儒、王继军、姜浩、孙志刚、于恒文、王硕、赵浩、孙浩\n  - 发表日期：2025年10月20日\n  - 任务：端到端\n  - 摘要：\n    - DiffVLA++是一种增强型自动驾驶框架，通过度量引导对齐将认知推理与端到端规划相连接。\n    - 引入了一个用于语义化轨迹的VLA模块、一个用于物理可行性的E2E模块，以及一个度量引导的轨迹评分器来对齐两者的输出。\n    - 在ICCV 2025自动驾驶大挑战排行榜上取得了49.12的EPDMS分数。\n\n- [基于隐式残差世界模型的视觉中心4D占用预测与规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.16729)\n  - 梅建标、杨宇、杨雪萌、温立成、吕佳俊、史博天、刘勇\n  - 发表日期：2025年10月19日\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 提出IR-WM，一种专注于为视觉中心自动驾驶建模当前状态及世界演化的隐式残差世界模型，避免了完整的场景重建。\n    - 引入了一种利用BEV特征作为时间先验的残差预测方法，并设计了一个对齐模块以缓解随时间累积的误差。\n    - 实验证明，来自世界模型的隐式未来状态能够提升规划精度，在nuScenes数据集上的4D占用预测和轨迹规划任务中达到了顶尖性能。\n\n- [推进越野自动驾驶：大规模ORAD-3D数据集及全面基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.16500)\n  - 陈敏、梅吉林、翟恒、王帅、孙通、孔凡杰、李浩阳、毛方远、刘福洋、王硕、聂一鸣、朱琪、肖亮、赵大伟、胡宇\n  - 发表日期：2025年10月18日\n  - 代码：[ORAD-3D](https:\u002F\u002Fgithub.com\u002Fchaytonmin\u002FORAD-3D)\n  - 任务：生成\n  - 数据集：[ORAD-3D](https:\u002F\u002Fgithub.com\u002Fchaytonmin\u002FORAD-3D)\n  - 摘要：\n    - 介绍了ORAD-3D，这是目前最大的越野自动驾驶数据集，涵盖了多样化的地形、天气条件和光照水平。\n    - 建立了一套针对五个基础任务的全面基准测试体系：2D自由空间检测、3D占用预测、路径规划、视觉-语言模型驱动的驾驶，以及越野环境下的世界建模。\n\n- [VDRive：利用强化VLA与扩散策略实现端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.15446)\n  - 郭子昂、张祖峰\n  - 发表日期：2025年10月17日\n  - 任务：端到端\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - VDRive是一种新颖的端到端自动驾驶流程，通过建模状态-动作映射实现可解释且鲁棒的决策。\n    - 结合了用于上下文状态理解的视觉语言动作模型（VLA）和基于生成式扩散策略的动作头，用于几何动作的生成。\n    - 采用带有演员-评论家框架的强化学习微调流程，在Bench2Drive和nuScenes基准测试中达到了最先进水平。\n\n- [DriveCritic：借助视觉-语言模型实现面向情境、符合人类偏好的自动驾驶评估](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.13108)\n  - 宋京宇、李振鑫、兰世怡、孙兴隆、张娜迪恩、沈玛英、陈约书亚、斯金纳·凯瑟琳、阿尔瓦雷斯·何塞\n  - 发表日期：2025年10月15日\n  - 任务：规划\n  - 摘要：\n    - 介绍了DriveCritic，一种用于情境感知、符合人类偏好的自动驾驶规划评估的新框架，包含标注了人类偏好意见的挑战性场景精选数据集，以及基于视觉-语言模型（VLM）的评估器。\n    - DriveCritic模型采用两阶段监督与强化学习相结合的微调流程，通过整合视觉与符号上下文来评判轨迹对之间的优劣，其在匹配人类偏好方面的表现显著优于现有指标。\n\n- [DriveVLA-W0：世界模型放大自动驾驶中的数据规模效应](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.12796)\n  - 李英燕、尚书瑶、刘伟松、战兵、王浩辰、王雨琪、陈云涛、王小满、安亚松、唐楚峰、侯陆、范卢、张兆祥\n  - 发表日期：2025年10月14日\n  - 任务：端到端\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - 提出了DriveVLA-W0，一种利用世界建模预测未来图像的训练范式，通过密集的自监督信号来学习驾驶环境的动力学特性。\n    - 引入了一个基于世界建模学习表示构建的轻量级动作专家，用于实时推理。\n    - 实验证明，该方法相比基线有显著性能提升，并显示出随着训练数据集规模的扩大，数据规模效应会进一步增强。\n\n- [CoIRL-AD：基于潜在世界模型的协作-竞争式模仿-强化学习在自动驾驶中的应用](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.12560)\n  - 郑晓吉、杨子渊、陈彦豪、彭宇航、汤元荣、刘耿源、陈博凯、龚江涛\n  - 发表日期：2025年10月14日\n  - 代码：[CoIRL-AD](https:\u002F\u002Fgithub.com\u002FSEU-zxj\u002FCoIRL-AD)\n  - 任务：端到端\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 提出了CoIRL-AD，一种竞争性的双策略框架，使IL和RL智能体在训练过程中能够相互交互，突破了传统的两阶段范式。\n    - 引入了一种基于竞争的机制，以促进知识交流并防止梯度冲突。\n    - 实验表明，与基线相比，碰撞率降低了18%，并且具有更强的泛化能力，在长尾场景中的表现也有所提升。\n\n- [基于流匹配的自动驾驶规划及其先进的交互行为建模](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.11083)\n  - 谭天毅、郑一楠、梁瑞明、王泽旭、郑可欣、郑金良、李建雄、詹宪元、刘静静\n  - 发表日期：2025年10月13日\n  - 代码：[Flow-Planner](https:\u002F\u002Fgithub.com\u002FDiffusionAD\u002FFlow-Planner)\n  - 任务：规划\n  - 数据集：[nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)、[interPlan](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Finterplan)\n  - 摘要：\n    - 提出了Flow Planner，这是一个用于自动驾驶规划的框架，通过数据建模、架构设计和学习方式的创新来解决交互行为的建模问题。\n    - 引入了细粒度的轨迹标记化技术，以及专门用于高效时空融合的架构，以更好地捕捉交互行为。\n    - 结合无分类器指导的流匹配技术进行多模态行为生成，动态调整智能体间的交互权重，从而形成连贯的响应策略。\n\n- [用于安全自动驾驶的游戏论风险塑造强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.10960)\n  - 胡东、胡芬青、杨立东、黄超\n  - 发表日期：2025年10月13日\n  - 代码：[GTR2L](https:\u002F\u002Fgithub.com\u002FDanielHu197\u002FGTR2L)\n  - 任务：规划\n  - 摘要：\n    - 提出了一种用于安全自动驾驶的新型游戏论风险塑造强化学习（GTR2L）框架，该框架结合了多层级游戏论世界模型，以预测交互行为和风险。\n    - 具有基于预测不确定性的自适应展开 horizon，以及一种考虑不确定性的屏障机制，用于灵活调节安全边界。\n    - 在安全关键场景中表现出色，其成功率、碰撞减少率和效率均优于当前最先进的基线方法及人类驾驶员。\n\n- [Align2Act：面向人类对齐的指令微调模型用于自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.10503)\n  - 卡尼什卡·贾桑卡尔、苏尼迪·坦德尔\n  - 发表日期：2025年10月12日\n  - 任务：规划\n  - 数据集：[nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)\n  - 摘要：\n    - Align2Act 是一个运动规划框架，它通过基于人类推理和交通规则的结构化驾驶指令，将指令微调的大语言模型转化为与人类行为一致的可解释规划器。\n    - Align2ActChain 模块引导逐步推理，生成可解释的理由说明和安全轨迹，并在 LLaMA-2-7B 上使用 LoRA 技术，基于 nuPlan 数据集进行微调。\n    - 在真实的 nuPlan 封闭环基准测试中，该框架展示了更好的规划质量和更接近人类的行为表现，其结构化的推理显著提升了性能，优于基线大语言模型规划器。\n\n- [LinguaSim：基于大语言模型的自然语言指令驱动的多车辆交互测试场景生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.08046)\n  - 石庆远、孟庆文、程浩、徐庆、王建强\n  - 发表日期：2025年10月9日\n  - 任务：生成\n  - 摘要：\n    - LinguaSim 是一个基于大语言模型的框架，能够将自然语言转换为逼真的、可交互的 3D 场景，用于自动驾驶车辆的测试和训练，同时确保动态的车辆交互以及输入描述与生成场景之间的高度一致性。\n    - 反馈校准模块可以提升生成精度，更好地符合用户意图，并减少过度激进的行为（碰撞率从 46.9% 降至 6.3%）。\n    - 该框架弥合了自然语言与封闭式交互仿真之间的鸿沟，通过场景描述和自动驾驶模型双重约束来控制对抗性车辆行为。\n\n- [GTR-Bench：评估视觉-语言模型中的地理-时间推理能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.07791)\n  - 谢庆洪兵、夏兆源、朱峰、龚利军、李子悦、赵睿、曾龙\n  - 发表日期：2025年10月9日\n  - 代码：[GTR-Bench](https:\u002F\u002Fgithub.com\u002FX-Luffy\u002FGTR-Bench)\n  - 任务：评估\n  - 摘要：\n    - 介绍了 GTR-Bench，这是一个用于评估大规模摄像头网络中移动目标地理-时间推理能力的新基准，要求在地图和视频之间切换视角，并在不重叠的视频视图之间进行联合推理。\n    - 评估结果显示，当前最先进的视觉-语言模型与人类在地理-时间推理方面的表现存在显著差距，揭示了这些模型在上下文利用、时间预测以及地图与视频对齐等方面的关键不足。\n    - 该基准为自动驾驶和具身智能等应用提供了关于时空智能的洞见，相关代码和基准将公开发布。\n\n- [CVD-STORM：用于自动驾驶的跨视角视频扩散与时空重建模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.07944)\n  - 张天瑞、刘一辰、郭子林、郭宇欣、倪靖成、丁晨静、徐丹、陆雷威、吴泽寰\n  - 出版方：商汤科技\n  - 发表日期：2025年10月9日\n  - 项目页面：[CVD-STORM](https:\u002F\u002Fsensetime-fvg.github.io\u002FCVD-STORM)\n  - 任务：生成\n  - 摘要：\n    - CVD-STORM 是一种跨视角视频扩散模型，配备时空重建变分自编码器，可在各种控制输入下生成具有 4D 重建效果的长期多视角视频。\n    - 该方法通过辅助的 4D 重建任务对变分自编码器进行微调，以增强 3D 结构和时间编码能力，随后将其整合到视频扩散过程中，从而提升生成质量。\n    - 该模型在 FID 和 FVD 指标上均有提升，其联合训练的高斯泼溅解码器能够重建动态场景，用于获取几何信息和场景理解。\n\n- [Drive&Gen：端到端驾驶与视频生成模型的协同评估](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.06209)\n  - 王嘉豪、杨振沛、白怡静、李英伟、邹玉良、孙博、阿比吉特·昆杜、何塞·莱萨马、黄露娜、朱泽昊、黄继京、德拉戈米尔·安格洛夫、谭明兴、蒋驰宇\n  - 出版方：谷歌研究院、Waymo\n  - 发表日期：2025年10月7日\n  - 任务：生成\n  - 摘要：\n    - 提出了 Drive&Gen 方法，将驾驶模型与生成式世界模型相结合，利用新颖的统计指标评估视频的真实感，从而用于端到端规划器的评估。\n    - 利用视频生成的可控性，研究影响端到端规划器性能的分布差异和偏差。\n    - 结果表明，视频生成模型所生成的合成数据是替代真实数据的一种经济高效的方式，能够提升端到端模型在现有运行设计域之外的泛化能力。\n\n- [施工区域挑战 VLM 轨迹规划：迈向缓解与鲁棒的自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.02803)\n  - 廖义凡、孙震、邱晓云、赵子霄、唐文兵、何新磊、郑新虎、张天伟、黄心怡、韩星硕\n  - 发表日期：2025年10月3日\n  - 任务：规划\n  - 数据集：[ROADWork](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - 首次系统性地研究了视觉-语言模型（VLM）在施工区域中的轨迹规划能力，发现主流 VLM 的失败率高达 68.0%，并识别出 8 种常见的失败模式。\n    - 提出了 REACT-Drive，这是一种将 VLM 与检索增强生成（RAG）技术相结合的轨迹规划框架，能够将以往的失败案例转化为约束规则，并检索相似模式以提供指导。\n    - 实验表明，REACT-Drive 效果显著，与 VLM 基线相比，平均位移误差降低了约 3 倍，并且在 ROADWork 数据集及 15 个真实施工区域场景的实验中，实现了最低的推理时间（0.58 秒）。\n\n- [Nav-EE：自动驾驶中基于导航引导的高效视觉语言模型提前退出机制](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.01795)\n  - 胡海波、黄连明、王鑫宇、崔宇飞、吴尚宇、关楠、薛春杰\n  - 发表日期：2025年10月2日\n  - 代码：[Nav-EE](https:\u002F\u002Fanonymous.4open.science\u002Fr\u002FNav-EE-BBC4)\n  - 任务：规划\n  - 数据集：[CODA](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FCODA)、[Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)、[BOSCH](https:\u002F\u002Fwww.bosch-mobility.com\u002Fen\u002Fsolutions\u002Fautomated-driving\u002F)\n  - 摘要：\n    - 提出了一种用于自动驾驶中视觉语言模型（VLM）的导航引导型提前退出框架Nav-EE，该框架可在离线阶段预计算特定于任务的退出层，并根据导航先验信息在线动态应用。\n    - 在达到与完整推理相当的精度的同时，将延迟降低了高达63.9%；实车集成测试表明，推理延迟从600毫秒降至300毫秒。\n\n- [视觉语言模型的战略融合：基于Shapley赋权的上下文感知Dawid-Skene算法在自动驾驶多标签任务中的应用](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.01126)\n  - 冯宇翔、张可扬、哈桑·乌丘伊德、阿什维尔·卡尼亚姆帕姆比尔、伊万尼斯·索夫拉斯、帕纳约蒂斯·安格鲁迪斯\n  - 出版单位：帝国理工学院\n  - 发表日期：2025年10月1日\n  - 任务：感知、推理\n  - 数据集：[HDD](https:\u002F\u002Fusa.honda-ri.com\u002Fhdd)\n  - 摘要：\n    - 提出了一种基于博弈论的融合方法——具有共识机制的Shapley赋权上下文感知Dawid-Skene算法，用于对行车记录仪视频进行多标签理解，以解决自动驾驶系统中视觉语言模型产生的幻觉问题。\n    - 利用自动化的流水线，融合了HDD真实标注、车辆运动学和目标跟踪信息，构建了一个包含1000段真实行车记录仪片段的专用数据集，并进行了结构化标注。\n    - 该方法相比单一模型取得了显著提升，包括汉明距离降低23%，F1分数提高超过47%，为决策支持提供了校准且稳健的组件。\n\n- [NuRisk：面向自动驾驶中主体级风险评估的视觉问答数据集](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.25944)\n  - 高源、马蒂亚·皮奇尼尼、罗伯托·布鲁斯尼基、张宇辰、约翰内斯·贝茨\n  - 出版单位：慕尼黑工业大学\n  - 发表日期：2025年9月30日\n  - 任务：VQA\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)、[CommonRoad](https:\u002F\u002Fcommonroad.in.tum.de\u002F)\n  - 摘要：\n    - 提出了NuRisk，一个用于自动驾驶中主体级风险评估的综合性视觉问答（VQA）数据集。该数据集基于nuScenes和Waymo的真实世界数据，并补充了来自CommonRoad仿真器的安全关键场景。\n    - 数据集提供基于鸟瞰视角（BEV）的序列图像，附带定量的主体级风险标注，旨在支持并基准测试时空推理能力。\n    - 基准测试表明，标准视觉语言模型在此任务上难以进行明确的时空推理；而经过微调的7B参数量视觉语言模型则提高了准确性和降低了延迟，使NuRisk成为推动自动驾驶推理能力发展的重要基准。\n\n- [FuncPoison：用于劫持多智能体自动驾驶系统的中毒函数库攻击](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.24408)\n  - 龙宇振、李松泽\n  - 出版单位：加州大学洛杉矶分校\n  - 发表日期：2025年9月29日\n  - 任务：规划\n  - 摘要：\n    - 介绍了一种新型的基于中毒的攻击方法——FuncPoison，该方法针对由大语言模型驱动的多智能体自动驾驶系统中的共享函数库，通过注入恶意工具来操纵智能体行为，从而引发级联错误，导致系统轨迹精度下降。\n    - 攻击利用了基于文本的工具选择机制及标准化命令格式中的漏洞，成功演示了其规避防御的能力，并强调函数库是系统可靠性方面一个关键但尚未充分探索的攻击面。\n\n- [少即是多：轻量却强大的自动驾驶视觉语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.00060)\n  - 杨盛、詹彤、陈冠成、陆延峰、王健\n  - 发表日期：2025年9月29日\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 介绍了一种名为Max-V1的新颖单阶段端到端自动驾驶框架，该框架将驾驶重新概念化为一种广义的语言，并将轨迹规划表述为下一航点预测。\n    - 提出了一种单次生成范式，利用视觉语言模型（VLM）直接从前视摄像头输入预测轨迹，并由基于统计建模的原则性策略进行监督。\n    - 在nuScenes数据集上实现了最先进的性能，提升了30%以上，同时通过大规模专家示范的模仿学习，展现了强大的泛化能力和跨车型的鲁棒性。\n\n- [学习采样：强化学习引导的采样方法在自动驾驶车辆运动规划中的应用](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.24313)\n  - 科尔比尼安·莫勒、罗兰·斯特鲁普、马蒂亚·皮奇尼尼、亚历山大·朗曼、约翰内斯·贝茨\n  - 出版单位：慕尼黑工业大学（TUM）\n  - 发表日期：2025年9月29日\n  - 代码：[Learning-to-Sample](https:\u002F\u002Fgithub.com\u002FTUM-AVS\u002FLearning-to-Sample)\n  - 任务：规划\n  - 数据集：[CommonRoad](https:\u002F\u002Fcommonroad.in.tum.de\u002F)\n  - 摘要：\n    - 提出了一种混合式的基于采样的运动规划框架，利用强化学习智能体引导采样向行动空间中有希望的区域集中，同时保持轨迹生成和评估过程的分析性和可验证性。\n    - 将强化学习采样器与基于可解码深度集合编码器的世界模型相结合，以处理不同数量的交通参与者，并重建潜在表示。\n    - 在CommonRoad上的评估显示，所需样本数量减少了高达99%，运行时间缩短了84%，同时保持了规划的成功率和无碰撞率。\n\n- [通过多模态域适应防止机器人越狱](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.23281)\n  - 弗朗切斯科·马尔基奥里、罗汉·辛哈、克里斯托弗·阿吉亚、亚历山大·罗比、乔治·J·帕帕斯、毛罗·孔蒂、马尔科·帕沃内\n  - 出版单位：帕多瓦大学、斯坦福大学、宾夕法尼亚大学\n  - 发表日期：2025年9月27日\n  - 项目页面：[J-DAPT](https:\u002F\u002Fj-dapt.github.io)\n  - 任务：规划\n  - 数据集：[Waymo开放数据集](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)、[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 介绍了一种轻量级的多模态越狱检测框架——J-DAPT，它利用基于注意力的融合技术和域适应技术，在机器人环境中实现越狱行为的检测。\n    - 通过整合文本和视觉嵌入，捕捉语义意图和环境背景，将通用越狱数据与领域特定参考对齐。\n    - 在自动驾驶、海洋机器人和四足机器人导航等场景下的评估表明，J-DAPT能够在极低开销下将检测准确率提升至接近100%。\n\n- [BEV-VLM：基于统一鸟瞰图抽象的轨迹规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.25249)\n  - 陈冠成、杨盛、詹通、王健\n  - 发表日期：2025年9月27日\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 介绍BEV-VLM框架，该框架利用视觉-语言模型，并以鸟瞰图特征图为视觉输入，用于轨迹规划。\n    - 使用融合多模态传感器数据得到的统一BEV-HD地图格式，实现几何一致的场景描述。\n    - 在nuScenes数据集中，规划精度提升44.8%，并完全避免了碰撞。\n\n- [MTRDrive：面向边缘场景的鲁棒自动驾驶的记忆—工具协同推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.20843)\n  - 罗子昂、钱康安、王嘉华、罗悦辰、苗金宇、傅正、王云龙、蒋思聪、黄子林、胡一飞、杨宇浩、叶浩、杨梦梦、董晓健、姜坤、杨典歌\n  - 发表日期：2025年9月25日\n  - 任务：端到端\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - 提出MTRDrive框架，将程序化驾驶经验与动态工具箱相结合，以增强端到端自动驾驶的泛化能力和主动决策能力。\n    - 设计了一个闭环系统，将基于记忆的经验检索机制与动态工具箱结合，从而提升推理与决策能力。\n    - 在NAVSIM基准测试中达到最先进水平，并在新的Roadwork-VLM基准测试中展现出强大的零样本泛化能力。\n\n- [针对自动驾驶视觉-语言模型的通用伪装攻击](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.20196)\n  - 孔德宏、于思凡、梁思远、梁家伟、甘建厚、刘爱珊、任文琪\n  - 发表日期：2025年9月24日\n  - 任务：端到端\n  - 摘要：\n    - 首次提出适用于自动驾驶视觉-语言模型的通用伪装攻击（UCA）框架，生成可在物理世界中实现且对不同指令和模型架构具有泛化性的伪装纹理。\n    - 引入针对编码器和投影层漏洞的特征发散损失（FDL），并采用多尺度学习策略，以提高在真实场景中对视角和尺度变化的鲁棒性。\n    - 攻击效果显著，在多种VLM-AD模型上均能诱导错误的驾驶指令，性能大幅超越现有方法，且在多样化的动态条件下仍保持高度稳健。\n\n- [离散扩散在自动驾驶反射式视觉-语言-动作模型中的应用](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.20109)\n  - 李鹏翔、郑义楠、王岳、王慧敏、赵航、刘静静、战宪元、战坤、郎贤鹏\n  - 发表日期：2025年9月24日\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - 介绍ReflectDrive框架，该框架基于学习方法，通过离散扩散集成反射机制来生成安全轨迹，以解决视觉-语言-动作模型中模仿学习的局限性。\n    - 提出一种安全感知的反射机制，无需梯度计算即可进行迭代自我修正，利用局部搜索识别不安全标记，并通过修复填充技术生成安全锚点。\n    - 在NAVSIM基准测试中评估，结果表明其在自动驾驶系统的安全关键轨迹生成方面具有显著优势。\n\n- [编排、生成、反思：基于VLM的多智能体协作框架用于自动驾驶策略学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.17042)\n  - 彭增奇、谢宇森、王宇斌、杨锐、陈启峰、马军\n  - 发表日期：2025年9月21日\n  - 任务：规划\n  - 数据集：[CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - 摘要：\n    - 提出OGR框架，这是一种新颖的自动驾驶策略学习框架，利用基于视觉-语言模型的多智能体协作系统自动设计奖励函数和训练课程。\n    - 引入一个包含编排、生成和反思模块的分层智能体系统，并辅以记忆模块和人机协作的并行生成方案，以促进策略的稳健演化。\n    - 实验表明，该框架在CARLA的城市场景中表现出色，具有广泛的适用性和与多种强化学习算法的兼容性，实际道路试验也验证了其可行性。\n\n- [VLM是否已具备自动驾驶中的车道拓扑感知能力？](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.16654)\n  - 陈欣、何佳、李茂政、徐东亮、王天宇、陈一骁、林志新、姚悦\n  - 发表日期：2025年9月20日\n  - 任务：VQA\n  - 摘要：\n    - 系统性评估视觉-语言模型（VLM）在道路拓扑理解方面的能力，这是安全自动驾驶的关键要求。\n    - 基于鸟瞰图车道表示，提出了四个诊断性VQA任务，以捕捉空间拓扑推理的核心要素。\n    - 研究发现，空间推理仍然是当前VLM的基本瓶颈，其性能与模型规模、推理标记长度以及提供的示例数量密切相关。\n\n- [CoReVLA：基于收集与精炼的双阶段端到端自动驾驶框架，用于长尾场景](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.15968)\n  - 方世宇、崔一鸣、梁浩洋、吕晨、韩鹏、孙健\n  - 发表日期：2025年9月19日\n  - 代码：[CoReVLA](https:\u002F\u002Fgithub.com\u002FFanGShiYuu\u002FCoReVLA)\n  - 任务：端到端\n  - 数据集：[Bench2Drive](https:\u002F\u002Fthinklab-sjtu.github.io\u002FBench2Drive\u002F)\n  - 摘要：\n    - CoReVLA是一个持续学习的端到端自动驾驶框架，通过数据收集和行为精炼的双阶段流程，提升在长尾、安全关键场景中的表现。\n    - 该框架首先在驾驶问答数据上进行微调，随后在CAVE仿真环境中收集驾驶员接管数据，并通过直接偏好优化（DPO）进一步精炼，以学习人类偏好并避免奖励欺骗。\n    - 在Bench2Drive基准测试中，CoReVLA获得了72.18分的驾驶评分和50%的成功率，优于现有的长尾场景解决方案。\n\n- [AdaThinkDrive：基于强化学习的自适应思维用于自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.13769)\n  - 罗悦晨、李芳、徐绍青、赖志毅、杨磊、陈启茂、罗子昂、谢子轩、蒋圣寅、刘佳鑫、陈龙、王兵、杨志新\n  - 发表日期：2025年9月17日\n  - 任务：端到端\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - AdaThinkDrive是一种新颖的视觉-语言-动作（VLA）框架，具备双模推理机制（快速回答与慢速思考），用于自动驾驶中的自适应推理。\n    - 引入了自适应思考奖励策略，并结合分组相对策略优化（GRPO），以奖励模型有选择地应用思维链（CoT）推理。\n    - 在Navsim基准测试中取得了最先进的性能（PDMS为90.3），同时相比始终进行推理的基线模型，推理时间减少了14%。\n\n- [用于自动驾驶轨迹预测的大型基础模型：综合综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.10570)\n  - 戴伟、吴生根、吴伟、王振浩、吕思硕、廖海成、于利民、丁卫平、关润威、岳宇涛\n  - 发表日期：2025年9月11日\n  - 任务：预测\n  - 摘要：\n    - 对大型基础模型（LFMs），包括大语言模型（LLM）和多模态大语言模型（MLLM），在自动驾驶轨迹预测中的系统性综述，强调其在实现可解释上下文推理方面的作用。\n    - 涵盖了核心方法学，如轨迹-语言映射、多模态融合和基于约束的推理，以及相关任务、指标、数据集和关键挑战。\n    - 讨论了未来的研究方向，如低延迟推理、因果感知建模和运动基础模型。\n\n- [DepthVision：基于GAN的LiDAR转RGB合成技术赋能鲁棒视觉-语言模型，用于自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.07463)\n  - 斯文·基希纳、尼尔斯·普尔施克、罗斯·格里尔、阿洛伊斯·C·克诺尔\n  - 出版单位：慕尼黑工业大学\n  - 发表日期：2025年9月9日\n  - 任务：感知\n  - 摘要：\n    - DepthVision是一个多模态框架，无需架构改动或重新训练，即可通过将稀疏的LiDAR点云合成出密集的类似RGB的图像，使视觉-语言模型（VLM）能够利用LiDAR数据。\n    - 引入了亮度感知模态适配（LAMA）模块，该模块根据环境光照动态加权融合合成图像和真实相机图像，以补偿黑暗或运动模糊等退化情况。\n    - 该设计使LiDAR在RGB不可靠时成为即插即用的视觉替代品，扩展了现有VLM的运行范围。评估表明，在低光照场景理解方面，相比仅使用RGB的基线模型有显著提升。\n\n- [OccVLA：具有隐式3D占用监督的视觉-语言-动作模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.05578)\n  - 刘瑞勋、孔凌宇、李德润、赵航\n  - 发表日期：2025年9月6日\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 提出OccVLA框架，将3D占用表示整合到自动驾驶的多模态推理中，既作为预测输出，也作为监督信号。\n    - 无需显式3D输入或额外的推理开销，即可从2D视觉输入中学习精细的空间结构，因为占用预测可以被跳过。\n    - 在nuScenes上实现了轨迹规划的最先进结果，并在3D视觉问答任务中表现出色。\n\n- [LatticeWorld：由多模态大语言模型驱动的交互式复杂世界生成框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.05263)\n  - 段英林、邹正霞、顾通伟、贾伟、赵展、许路易、刘新竹、林业楠、江浩、陈康、邱爽\n  - 发表日期：2025年9月5日\n  - 项目页面：[演示视频](https:\u002F\u002Fyoutu.be\u002F8VWZXpERR18)\n  - 任务：生成\n  - 摘要：\n    - LatticeWorld是一个3D世界生成框架，利用轻量级大语言模型（LLaMA-2-7B）和虚幻引擎5，根据多模态文本和视觉指令创建大规模交互式世界。\n    - 该框架简化了工业级3D环境制作流程，效率提升了90倍以上，同时保持了与传统手动方式相当的高质量创意产出。\n\n- [SAM-LLM：通过参数化微调实现可解释的变道轨迹预测](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.03462)\n  - 曹卓、史云霄、徐敏\n  - 发表日期：2025年9月3日\n  - 任务：预测\n  - 摘要：\n    - 介绍SAM-LLM混合架构，将大语言模型（LLM）的上下文推理能力与运动学正弦加速度模型（SAM）的物理精确性相结合，用于自动驾驶轨迹预测。\n    - 针对变道场景，模型输出可解释的物理参数（如横向位移、持续时间），而非原始坐标，从而生成连续且合理的轨迹，输出规模比基于坐标的方案减少了80%。\n    - 实现了98.73%的意图预测准确率，达到最先进的水平，与传统LLM预测器相当，同时提供了更优的可解释性和计算效率。\n\n- [KEPT：基于视觉-语言模型的连续驾驶帧轨迹知识增强预测](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.02966)\n  - 王宇金、王天一、刘泉峰、范文贤、焦俊峰、克里斯蒂安·克劳德尔、严云兵、高炳照、王建强、陈宏\n  - 发表日期：2025年9月3日\n  - 任务：预测\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - KEPT是一个知识增强型VLM框架，可直接从连续的前视驾驶帧中预测本车轨迹。\n    - 集成了时序频率-空间融合视频编码器与k-means及HNSW检索增强生成流水线，利用检索到的知识在思维链提示中结合规划约束。\n    - 采用三阶段微调范式对VLM骨干网络进行对齐，在nuScenes上实现了最先进的开环性能。\n\n- [大语言模型模块能否泛化？面向自动驾驶运动规划的研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.02754)\n  - 王明义、王景科、叶腾驹、陈俊博、俞凯成\n  - 发表日期：2025年9月2日\n  - 任务：预测\n  - 数据集：[Waymo Sim Agents](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - 摘要：\n    - 对五种关键的大语言模型模块在自动驾驶运动规划中的迁移能力进行了全面评估，包括分词器设计、位置嵌入、预训练范式、后训练策略以及推理时的计算方法。\n    - 结果表明，经过适当调整的大语言模型模块能够显著提升在Waymo Sim Agents基准上的性能，达到具有竞争力的水平。\n    - 研究还识别出哪些技术可以有效迁移，分析了失败的原因，并讨论了针对自动驾驶领域所需的必要调整。\n\n- [CVPR2024端到端挑战赛亚军方案：基于视觉语言模型的端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.02659)\n  - 郭子龙、罗毅、沙龙、王东旭、王攀渠、徐晨阳、杨毅\n  - 发表日期：2025年9月2日\n  - 任务：端到端\n  - 摘要：\n    - 提出了一种仅使用摄像头的端到端自动驾驶解决方案，将架构设计与视觉语言模型（VLM）相结合，在CVPR2024端到端挑战赛中获得第二名。\n    - 实验证明，将具备丰富知识的视觉语言模型融入端到端框架能够带来出色的性能，凸显了基于视觉的方法在驾驶任务中的潜力。\n\n- [AutoDrive-R²：激励自动驾驶VLA模型的推理与自我反思能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.01944)\n  - 袁振龙、钱成轩、唐静、陈睿、宋子健、孙磊、褚向翔、蔡宇君、张大鹏、李硕\n  - 发表日期：2025年9月2日\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - 摘要：\n    - AutoDrive-R²是一种新颖的VLA框架，通过思维链处理和强化学习来增强自动驾驶中的推理与自我反思能力。\n    - 引入了nuScenesR²-6K思维链数据集用于监督微调，构建包含自我反思的四步逻辑链条，以连接感知与规划。\n    - 在物理约束奖励框架下采用GRPO强化学习算法，优化可靠且真实的轨迹规划。\n\n- [OmniReason：面向自动驾驶的时序引导型视觉-语言-行动框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.00789)\n  - 刘沛、宁青天、陆欣妍、刘海鹏、马伟良、佘丹根、贾鹏、郎贤鹏、马军\n  - 发表日期：2025年8月31日\n  - 任务：推理\n  - 摘要：\n    - OmniReason是面向自动驾驶的时序引导型视觉-语言-行动（VLA）框架，通过联合建模动态3D环境与决策过程，实现强大的时空推理能力。\n    - 提出了OmniReason-Data大规模VLA数据集，该数据集采用抗幻觉自动标注流水线生成密集的时空标注，确保物理合理性和时间一致性。\n    - 开发了OmniReason-Agent架构，配备稀疏时序记忆模块和解释生成器，利用时空知识蒸馏捕捉因果推理模式，从而实现可解释且具备时间感知的驾驶行为。\n\n- [DriveQA：通过驾驶知识测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.21824)\n  - 魏茂林、刘万洲、Eshed Ohn-Bar\n  - 发表日期：2025年8月29日\n  - 项目页面：[DriveQA](https:\u002F\u002Fdriveqaiccv.github.io\u002F)\n  - 任务：视觉问答\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[BDD](https:\u002F\u002Fbdd-data.berkeley.edu\u002F)\n  - 摘要：\n    - DriveQA是一个内容丰富的开源文本与视觉基准，全面覆盖交通法规及场景，用于评估大语言模型和多模态大语言模型。\n    - 实验表明，在DriveQA上进行微调可以提高模型对交通标志识别和交叉路口决策的准确性；而在其上进行预训练则能提升下游真实世界数据集上的驾驶任务表现。\n\n- [DrivingGaussian++：迈向周围动态驾驶场景的真实重建与可编辑仿真](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.20965)\n  - 熊亚娇、周晓宇、万永涛、孙德清、杨明轩\n  - 发表日期：2025年8月28日\n  - 项目页面：[DrivingGaussian++](https:\u002F\u002Fxiong-creator.github.io\u002FDrivingGaussian_plus.github.io)\n  - 任务：生成\n  - 摘要：\n    - DrivingGaussian++是一个高效的框架，用于对周围动态自动驾驶场景进行真实重建和可控编辑。它采用增量式3D高斯分布表示静态背景，并用复合动态高斯图结构处理移动物体。\n    - 集成了激光雷达先验信息，以实现细节丰富且一致的场景重建；同时，借助多视角图像、深度先验以及大语言模型（LLM）生成的运动轨迹，支持无需训练即可进行纹理、天气和物体操作等可控编辑。\n\n- [随心所欲地驾驶：基于多头扩散模型的策略级运动规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.16947)\n  - 丁帆、罗雪文、许慧慧、鲁图拉吉·雷迪、王锡坤、卢俊勇\n  - 发表日期：2025年8月23日\n  - 任务：规划\n  - 数据集：[nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)\n  - 摘要：\n    - 提出了一种基于扩散的多头轨迹规划器（M-diffusion planner），并结合群体相对策略优化（GRPO）对其进行微调，以学习多样化的特定于不同策略的驾驶行为。\n    - 推理阶段引入大语言模型（LLM），用于指导策略选择，实现动态且指令感知的规划，而无需切换模型。\n    - 在nuPlan基准上达到了最先进的性能，生成的轨迹展现出明显的多样性，能够满足多模式驾驶需求。\n\n- [看得清楚，忘得深刻：重新审视用于驾驶模拟的微调视频生成器](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.16512)\n  - 张春鹏、王晨宇、朱利安·施密特、霍尔格·凯撒、阿兰·帕加尼\n  - 发表日期：2025年8月22日\n  - 任务：生成\n  - 数据集：[Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)、[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 研究了在驾驶模拟中微调视频生成器时存在的权衡问题：视觉保真度虽有所提升，但动态元素的空间准确性可能下降。\n    - 认为这种下降源于在重复性的驾驶场景背景下，视觉质量与动态理解目标之间出现了对齐偏差。\n    - 结果表明，简单的持续学习策略，例如来自不同领域的回放，可以在保持强大视觉质量的同时，维持良好的空间准确性。\n\n- [Prune2Drive：用于加速自动驾驶中视觉-语言模型的即插即用框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.13305)\n  - 雄明浩、文子辰、顾壮成、刘旭阳、张睿、康恒瑞、杨嘉兵、张俊远、李伟佳、何聪辉、王亚飞、张林峰\n  - 发表日期：2025年8月18日\n  - 任务：端到端\n  - 数据集：[DriveLM](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM)、[DriveLMM-o1](https:\u002F\u002Fgithub.com\u002Fayesha-ishaq\u002FDriveLMM-o1)\n  - 摘要：\n    - Prune2Drive 是一种用于自动驾驶多视角 VLM 的即插即用视觉令牌剪枝框架，旨在解决高分辨率多视角图像带来的计算开销问题。\n    - 提出了一种受最远点采样启发的多样性感知令牌选择机制，以及视图自适应剪枝控制器，以学习每路摄像头的最佳剪枝比例。\n    - 在保持 DriveLM 和 DriveLMM-o1 基准上任务性能的同时，实现了显著的速度提升和内存节省（例如，速度提升 6.40 倍，仅保留 10% 的令牌即可降至原 FLOPs 的 13.4%）。\n\n- [ViLaD：面向端到端自动驾驶的大规模视觉语言扩散框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12603)\n  - 崔灿、周宇鹏、彭俊通、朴成妍、杨子冲、普拉桑特·桑卡拉纳拉扬、张嘉如、张汝琪、王子然\n  - 发表日期：2025年8月18日\n  - 任务：端到端\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 提出了 ViLaD，一种新颖的大型视觉语言扩散（LVLD）框架，用于端到端自动驾驶。该框架采用掩码扩散模型并行生成驾驶决策，从而降低延迟。\n    - 该框架支持双向推理和渐进式的“先易后难”生成策略，在 nuScenes 数据集上，其规划准确性和速度均优于自回归 VLM 基线。\n    - 通过在一辆自动驾驶汽车上进行交互式泊车任务的实际部署，证明了其可行性，并实现了接近零的失败率。\n\n- [LMAD：用于可解释自动驾驶的集成式端到端视觉-语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12404)\n  - 宋楠、张博洲、朱夏天、邓建康、张莉\n  - 发表日期：2025年8月17日\n  - 任务：端到端\n  - 数据集：[DriveLM](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM)、[nuScenes-QA](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes#download)\n  - 摘要：\n    - 提出了 LMAD，一种新型的自动驾驶视觉-语言框架，它模仿现代端到端范式，具备全面的场景理解能力和任务专用的 VLM 结构。\n    - 在驾驶任务结构中引入了初步的场景交互模块和专门的专家适配器，以更好地使 VLM 与自动驾驶场景相契合。\n    - 该方法设计为完全兼容现有 VLM，并能无缝集成到以规划为导向的驾驶系统中，为可解释性自动驾驶设定了新标准。\n\n- [ImagiDrive：用于自动驾驶的统一想象与规划框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.11428)\n  - 李京宇、张博洲、金鑫、邓建康、朱夏天、张莉\n  - 发表日期：2025年8月15日\n  - 任务：端到端\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - ImagiDrive 是一种新颖的端到端自动驾驶框架，将基于 VLM 的驾驶智能体与基于 DWM 的场景想象器整合在一个统一的想象与规划循环中。\n    - 引入了提前停止机制和轨迹选择策略，以解决动作级决策与高保真像素级预测融合中的效率和预测准确性难题。\n\n- [VISTA：模拟情境思维与注意力的视觉-语言模型，实现动态环境中类人驾驶员的专注力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.05852)\n  - 凯撒·哈米德、坎达卡尔·阿什拉菲·阿克巴尔、梁娜德\n  - 发表日期：2025年8月7日\n  - 任务：感知\n  - 数据集：[BDD-A](https:\u002F\u002Fbdd-data.berkeley.edu\u002F)\n  - 摘要：\n    - 这是一个视觉-语言框架，通过自然语言建模驾驶员的视线转移行为，利用单张 RGB 图像进行少样本和零样本学习。\n    - 在精选的 BDD-A 字幕数据上微调 LLaVA，使视觉感知与以注意力为中心的场景理解对齐，整合低层次线索和自上而下的上下文信息。\n    - 以自然语言形式生成驾驶员视觉注意力分配及转移的预测结果，为自动驾驶中的可解释人工智能提供了新的方向。\n\n- [IRL-VLA：通过奖励世界模型训练视觉-语言-行动策略](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.06571)\n  - 蒋安庆、高宇、王怡茹、孙志刚、王硕、衡雨文、孙浩、唐世臣、朱丽娟、柴金浩、王继军、顾子冲、江浩、孙莉\n  - 发表日期：2025年8月7日\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - IRL-VLA 是一种新颖的闭环强化学习框架，用于自动驾驶，采用逆向强化学习奖励世界模型结合视觉-语言-行动（VLA）策略。\n    - 该框架采用三阶段范式：首先通过模仿学习预训练 VLA 架构，然后利用 IRL 构建轻量级奖励世界模型，最后通过专门的 PPO 强化学习进一步优化规划。\n    - 在 NAVSIM v2 端到端驾驶基准测试中取得了最先进的性能，并在 CVPR2025 自动驾驶大挑战中获得亚军。\n\n- [LiDARCrafter：从 LiDAR 序列中动态构建 4D 世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.03692)\n  - 梁傲、刘友泉、杨宇、陆东岳、李林峰、孔令东、赵怀慈、黄伟昌\n  - 发表日期：2025年8月5日\n  - 任务：生成\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - LiDARCrafter 是一个统一的框架，能够根据自由格式的自然语言指令生成和编辑 4D LiDAR 数据，将指令解析为以自我为中心的场景图，并以此作为条件输入到三分支扩散网络中。\n    - 该框架包含一个自回归模块，用于生成时间上连贯的 4D LiDAR 序列，并建立了涵盖场景、物体和序列层面指标的综合基准，用于标准化评估。\n\n- [用于细粒度交通标志的Mapillary Vistas验证集：揭示视觉-语言模型局限性的基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.02047)\n  - 斯帕什·加格、阿比谢克·艾奇\n  - 出版商：美国NEC实验室\n  - 发表日期：2025年8月4日\n  - 代码：[relabeling](https:\u002F\u002Fgithub.com\u002Fnec-labs-ma\u002Frelabeling)\n  - 任务：感知\n  - 数据集：[Mapillary Vistas](https:\u002F\u002Fwww.mapillary.com\u002Fdataset\u002Fvistas)\n  - 摘要：\n    - 提出一个新的细粒度交通标志验证集——Mapillary Vistas交通标志验证集（MVV），该数据集包含像素级实例掩码和专家标注。\n    - 将视觉-语言模型与DINOv2进行对比基准测试，结果显示DINOv2在细粒度识别以及车辆、人类等其他类别上的表现更优。\n    - 揭示了当前视觉-语言模型在细粒度视觉理解方面的显著局限性，并将DINOv2确立为自动驾驶感知任务中的强大基线。\n\n- [Bench2ADVLM：面向自动驾驶的视觉-语言模型闭环基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.02028)\n  - 张天元、金婷、王璐、刘江帆、梁思远、张明川、刘爱珊、刘向龙\n  - 发表日期：2025年8月4日\n  - 任务：评估\n  - 摘要：\n    - Bench2ADVLM是一个统一的分层闭环评估框架，用于在仿真和物理平台中对自动驾驶领域的视觉-语言模型进行实时、交互式评估。\n    - 引入了针对仿真的双系统适配架构和物理控制抽象层，以弥合仿真与现实之间的差距，从而实现在真实车辆上的闭环测试。\n    - 具备自我反思的场景生成模块，可自动探索模型行为并发现潜在故障模式，以生成安全关键场景。\n\n- [超越仿真：面向自动驾驶规划与因果关系的世界模型基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.01922)\n  - 亨特·斯科菲尔德、穆罕默德·埃尔马胡吉比、卡斯拉·雷扎伊、山金俊\n  - 出版商：约克大学\n  - 发表日期：2025年8月3日\n  - 任务：评估\n  - 数据集：[Waymo开放数据集](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - 摘要：\n    - 提出了新的指标来评估世界模型作为交通仿真器的鲁本性，特别是其作为策略训练的伪环境的能力。\n    - 将Waymo开放模拟智能体挑战赛（WOSAC）的评估范围扩展至包括与自车存在因果关系的智能体，揭示了顶级模型在轨迹回放条件下失效的场景。\n    - 基于这些新指标分析了最先进的世界模型，以评估其对不可控对象的敏感性及是否适合用于策略训练。\n\n- [基于边缘的多模态传感器数据融合与视觉-语言模型（VLMs）结合，用于自动驾驶车辆的实时避撞](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.01057)\n  - 杨峰泽、于博、周洋、罗学文、涂正中、刘晨曦\n  - 发表日期：2025年8月1日\n  - 任务：规划\n  - 数据集：[DeepAccident](https:\u002F\u002Fgithub.com\u002Fsisl\u002FDeepAccident)\n  - 摘要：\n    - 提出REACT，一个基于微调轻量级视觉-语言模型（VLM）的实时V2X集成轨迹优化框架，用于自动驾驶。\n    - 将基础设施危险预警与车载传感器数据相结合，利用视觉嵌入和上下文推理生成安全导向的轨迹。\n    - 采用残差轨迹融合（RTF）和边缘适应策略实现高效部署，在DeepAccident基准测试中达到最先进的性能。\n\n- [面向自适应自动驾驶的统一感知-语言-行动框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23540)\n  - 张毅、埃里克·利奥·哈斯、曹国义、内纳德·彼得罗维奇、宋英雷、吴成东、阿洛伊斯·克诺尔\n  - 出版商：慕尼黑工业大学\n  - 发表日期：2025年7月31日\n  - 任务：规划\n  - 摘要：\n    - 提出了一个统一的感知-语言-行动（PLA）框架，将多传感器融合与LLM增强的视觉-语言-行动架构相结合，用于自动驾驶。\n    - 该框架配备GPT-4.1驱动的推理核心，将感知与自然语言理解相结合，实现情境感知、可解释且安全约束下的决策。\n    - 在复杂城市场景中，如施工区域的交叉路口，表现出优越的轨迹跟踪、速度预测和自适应规划性能。\n\n- [FastDriveVLA：通过即插即用的重建式标记剪枝实现高效的端到端驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23318)\n  - 曹佳俊、张启哲、贾培东、赵旭辉、兰波、张晓安、李卓、魏小宝、陈思翔、李云燕、刘贤明、卢明、王阳、张尚航\n  - 发表日期：2025年7月31日\n  - 任务：端到端\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - FastDriveVLA是一个新颖的基于重建的视觉标记剪枝框架，专为自动驾驶设计。它包含一个名为ReconPruner的即插即用视觉标记剪枝器，该剪枝器通过MAE风格的像素重建优先处理前景信息。\n    - 引入了一种对抗性的前景-背景重建策略来训练ReconPruner，并创建了一个新的大规模数据集nuScenes-FG，其中包含24.1万张带有前景区域标注的图像-掩码对。\n    - 在nuScenes开环规划基准测试中，不同剪枝比例下均取得了最先进的结果，通过降低VLA模型中长视觉标记的计算成本，实现了高效的端到端驾驶。\n\n- [FASTopoWM：基于潜在世界模型的快慢车道段拓扑推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23325)\n  - 杨一鸣、林洪斌、罗悦如、傅苏忠、郑超、闫欣睿、梅淑琪、唐坤、崔书光、李振\n  - 发表日期：2025年7月31日\n  - 任务：感知、推理\n  - 数据集：[OpenLane-V2](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FOpenLane-V2)\n  - 摘要：\n    - FASTopoWM是一个新颖的快慢车道段拓扑推理框架，通过潜在世界模型增强，以实现全面的BEV道路场景理解。\n    - 该框架能够并行监督历史查询和新查询，以减少姿态估计失败的影响；同时引入潜在查询和BEV世界模型，以提升时间维度上的感知能力。\n    - 在OpenLane-V2基准测试中，该框架在车道段检测和中心线感知方面表现出最先进的性能。\n\n- [用于实时自动驾驶的视觉-语言交叉注意力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23064)\n  - 桑托什·帕塔帕蒂、特里桑斯·斯里尼瓦桑、穆拉里·安巴蒂\n  - 发表日期：2025年7月30日\n  - 任务：端到端\n  - 数据集：[MD-NEX户外驾驶](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - XYZ-Drive是一套用于自动驾驶的单模态视觉-语言模型，能够读取前视摄像头画面、俯视地图和目标点信息，输出转向和速度指令。\n    - 它采用轻量级的目标导向交叉注意力层，在使用微调后的LLaMA-3.2 11B模型进行处理之前，融合目标点、图像和地图特征。\n    - 该模型在MD-NEX基准测试中取得了95%的成功率，超越了现有方法；消融实验进一步证实了各模态及融合机制的重要性。\n\n- [SafeDriveRAG：基于知识图谱检索增强生成的安全自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21585)\n  - 叶浩、齐梦实、刘兆宏、刘亮、马华东\n  - 发表日期：2025年7月29日\n  - 代码：[SafeDriveRAG](https:\u002F\u002Fgithub.com\u002FLumos0507\u002FSafeDriveRAG)\n  - 任务：VQA\n  - 数据集：[SafeDrive228K](https:\u002F\u002Fgithub.com\u002FLumos0507\u002FSafeDriveRAG)\n  - 摘要：\n    - 提出了SafeDrive228K，这是一个大规模多模态VQA基准，包含228,000个示例，覆盖18个子任务，用于评估驾驶场景中的交通安全理解能力。\n    - 介绍了一种即插即用的多模态知识图谱检索增强生成（RAG）框架，并配备了多尺度子图检索算法。\n    - 实验证明，该RAG框架显著提升了五种主流VLM在安全关键任务上的性能。\n\n- [DriveAgent-R1：通过主动感知与混合思维推进基于VLM的自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.20879)\n  - 郑伟成、毛晓飞、叶南飞、李鹏翔、詹坤、郎献鹏、赵航\n  - 发表日期：2025年7月28日\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - DriveAgent-R1是首个具备主动感知功能的自动驾驶规划智能体，它会主动调用视觉推理工具，以视觉证据为基础做出决策。\n    - 引入了一种混合思维框架，可根据场景复杂度自适应地在纯文本推理和工具辅助的视觉推理之间切换。\n    - 采用三阶段渐进式训练策略，核心为级联强化学习阶段，在Drive-Internal和nuScenes数据集上以30亿参数实现了具有竞争力的性能。\n\n- [VESPA：迈向无人（类人）监督的开放世界点云标注，用于自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.20397)\n  - 莱文特·滕普夫利、埃斯特班·里韦拉、马库斯·连坎普\n  - 出版单位：慕尼黑工业大学\n  - 发表日期：2025年7月27日\n  - 任务：感知\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - VESPA是一个多模态自动标注流水线，将LiDAR几何信息与相机语义信息融合，从而在无需真值或高精地图的情况下实现可扩展的3D伪标签生成。\n    - 利用视觉-语言模型（VLM）直接在点云域内进行开放词汇的物体标注与检测优化，支持新类别发现。\n\n- [VLMPlanner：将视觉语言模型与运动规划相结合](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.20342)\n  - 唐志鹏、张莎、邓家俊、王晨杰、游国梁、黄玉婷、林欣睿、张燕勇\n  - 发表日期：2025年7月27日\n  - 任务：规划\n  - 数据集：[nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)\n  - 摘要：\n    - VLMPlanner是一个混合框架，将基于学习的实时规划器与视觉-语言模型（VLM）结合，后者通过对原始多视角图像进行推理来生成鲁棒的轨迹。\n    - 引入了上下文适应性推理门控（CAI-Gate），可根据场景复杂度动态调整VLM的推理频率，从而在性能与计算效率之间取得平衡。\n    - 在nuPlan基准上进行了评估，结果表明其在道路条件复杂且存在动态元素的场景中表现出更优的规划能力。\n\n- [BEV-LLM：利用多模态鸟瞰图地图进行自动驾驶场景描述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.19370)\n  - 费利克斯·布兰德施泰特、埃里克·舒茨、卡塔琳娜·温特、法比安·弗洛尔\n  - 发表日期：2025年7月25日\n  - 任务：感知\n  - 数据集：[nuCaption](https:\u002F\u002Fwww.nuscenes.org\u002Fnuimages)、[nuView](https:\u002F\u002Fwww.nuscenes.org\u002Fnuimages)、[GroundView](https:\u002F\u002Fwww.nuscenes.org\u002Fnuimages)\n  - 摘要：\n    - BEV-LLM是一种用于自动驾驶3D场景描述的轻量级模型，利用BEVFusion将LiDAR数据和多视角图像结合，并引入了一种新颖的绝对位置编码。\n    - 在nuCaption数据集上取得了具有竞争力的性能，BLEU分数相比最先进方法最高提升了5%，且仅使用了一个10亿参数的小型基础模型。\n    - 同时提出了并评测了两个新的数据集——nuView和GroundView，以更好地评估不同驾驶场景下的场景描述能力和目标定位精度。\n\n- [BetterCheck：迈向保障汽车感知系统中VLMs的安全性](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.17722)\n  - 马尔沙·阿沙尼·马哈瓦塔·多纳、贝亚特丽斯·卡布雷罗-丹尼尔、于一楠、克里斯蒂安·贝格尔\n  - 出版单位：哥德堡大学\n  - 发表日期：2025年7月23日\n  - 任务：感知\n  - 数据集：[Waymo开放数据集](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - 摘要：\n    - 提出了BetterCheck方法，用于检测视觉-语言模型（VLM）中的幻觉现象，以保障其在汽车感知系统中的安全性。\n    - 系统性地评估了3种最先进的VLM在Waymo开放数据集中的多种交通场景下的表现，发现它们虽然具备较强的场景理解能力，但仍易产生幻觉。\n\n- [VLM-UDMC：面向城市自动驾驶的VLM增强型统一决策与运动控制](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.15266)\n  - 刘海超、郭浩然、刘沛、马本善、张宇翔、马军、李同恒\n  - 发表日期：2025年7月21日\n  - 代码：[VLM-UDMC](https:\u002F\u002Fgithub.com\u002Fhenryhcliu\u002Fvlmudmc.git)\n  - 任务：规划\n  - 摘要：\n    - 提出了VLM-UDMC框架，这是一种基于视觉-语言模型的城市自动驾驶统一决策与运动控制方案，融入了场景推理和风险感知洞察。\n    - 该框架采用两步推理策略，在上层慢速系统中运用检索增强生成（RAG），根据实时环境变化动态调整运动规划。\n    - 使用轻量级多核分解LSTM对交通参与者进行实时轨迹预测，并通过仿真和实际道路试验验证了该框架的有效性。\n\n- [AGENTS-LLM：基于智能体LLM框架的挑战性交通场景增强生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.13729)\n  - 作者：Yao Yu、Salil Bhatnagar、Markus Mazzola、Vasileios Belagiannis、Igor Gilitschenski、Luigi Palmieri、Simon Razniewski、Marcel Hallgarten\n  - 发表日期：2025年7月18日\n  - 任务：规划\n  - 摘要：\n    - 提出了一种新颖的基于LLM智能体的框架，利用自然语言描述来增强现实世界中的交通场景，从而为自动驾驶规划器生成具有挑战性的测试用例。\n    - 采用智能体设计，实现对场景生成的精细控制，并能够在较小且经济高效的LLM上保持高性能，避免了对海量数据集或人工专家标注的依赖。\n\n- [Orbis：克服驾驶世界模型中长时距预测的挑战](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.13162)\n  - 作者：Arian Mousakhan、Sudhanshu Mittal、Silvio Galesso、Karim Farid、Thomas Brox\n  - 出版单位：弗赖堡大学\n  - 发表日期：2025年7月17日\n  - 项目页面：[Orbis](https:\u002F\u002Flmb-freiburg.github.io\u002Forbis.github.io\u002F)\n  - 代码：[Orbis](https:\u002F\u002Flmb-freiburg.github.io\u002Forbis.github.io\u002F)\n  - 任务：预测\n  - 摘要：\n    - 这是一个驾驶世界模型，旨在通过简单的设计选择，在无需额外监督或传感器的情况下，克服长时距生成和复杂场景泛化方面的挑战。\n    - 仅使用4.69亿参数并在280小时视频数据上训练，便达到了最先进的性能，尤其在转弯操作和城市交通等复杂场景中表现出色。\n    - 引入了一种混合分词器用于对比实验，结果表明，连续自回归模型比基于离散标记的模型更稳健、更强大。\n\n- [基于世界模型的端到端场景生成用于自动驾驶中的事故预测](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.12762)\n  - 作者：Guan Yanchen、Liao Haicheng、Wang Chengyue、Liu Xingcheng、Zhang Jiaxun、Li Zhenning\n  - 发表日期：2025年7月17日\n  - 任务：生成\n  - 数据集：[新基准数据集](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.12762)\n  - 摘要：\n    - 提出了一种将场景生成增强与自适应时间推理相结合的框架，以可靠地预测事故，解决数据稀缺和缺失物体级线索的问题。\n    - 开发了一个由领域知识驱动的提示引导的世界模型视频生成流水线，用于创建高分辨率且统计一致的驾驶场景，从而丰富边缘案例。\n    - 构建了一个动态预测模型，结合强化的图卷积和扩张时间算子来处理数据不完整性和视觉噪声问题，并发布了一个新的基准数据集。\n\n- [ReAL-AD：迈向端到端自动驾驶中的人类式推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.12499)\n  - 作者：Lu Yuhang、Tu Jiadong、Ma Yuexin、Zhu Xinge\n  - 出版单位：4DV实验室\n  - 发表日期：2025年7月16日\n  - 项目页面：[ReAL-AD](https:\u002F\u002F4dvlab.github.io\u002Fproject_page\u002Frealad)\n  - 任务：端到端\n  - 摘要：\n    - 提出了ReAL-AD框架，即一种基于人类认知三层模型——驾驶策略、驾驶决策和驾驶操作——构建的推理增强学习框架。\n    - 集成视觉-语言模型（VLM）以提升态势感知能力，并引入战略推理注入器、战术推理集成器以及层次化轨迹解码器，实现分层推理和轨迹执行。\n    - 大量评估表明，该框架可将规划准确性和安全性提高30%以上，使端到端自动驾驶更具可解释性，并更贴近人类的推理方式。\n\n- [Unreal就是一切：仅需一个引擎即可实现多模态ISAC数据仿真](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.08716)\n  - 作者：Huang Kongwu、Mu Shiyi、Jiang Jun、Gao Yuan、Xu Shugong\n  - 出版单位：上海大学\n  - 发表日期：2025年7月11日\n  - 代码：[Great-MCD](https:\u002F\u002Fgithub.com\u002Fhkw-xg\u002FGreat-MCD)\n  - 任务：生成\n  - 数据集：[Great-MSD](https:\u002F\u002Fgithub.com\u002Fhkw-xg\u002FGreat-MCD)\n  - 摘要：\n    - 提出了Great-X平台，这是一个单引擎多模态数据孪生平台，可在Unreal Engine中重建Sionna的射线追踪技术，并与自动驾驶工具集成，从而高效同步地模拟CSI、RGB、雷达和激光雷达数据。\n    - 构建了一个开源的大规模低空无人机多模态共感觉数据集Great-MSD，并提出了一种基于CSI的无人机3D定位基准算法，证明其在不同CSI仿真引擎中的可行性。\n\n- [VisioPath：视觉-语言增强的模型预测控制用于混合交通中的安全自主导航](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.06441)\n  - 作者：Wang Shanting、Typaldos Panagiotis、Li Chenjun、Malikopoulos Andreas A.\n  - 发表日期：2025年7月8日\n  - 任务：规划\n  - 数据集：[SUMO](https:\u002F\u002Fwww.eclipse.org\u002Fsumo\u002F)\n  - 摘要：\n    - 提出了一种将视觉-语言模型（VLM）与模型预测控制（MPC）相结合的新框架，用于在动态交通环境中实现安全的自动驾驶。\n    - 利用鸟瞰视角管道和零样本VLM提取结构化的车辆信息，构建椭圆形避撞势场进行轨迹规划。\n    - 实现了一个通过微分动态规划求解的有限时域最优控制问题，并采用自适应正则化和事件触发的MPC循环，同时包含安全验证层。\n\n- [LeAD：融合端到端自动驾驶的LLM增强规划系统](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.05754)\n  - 作者：Zhang Yuhang、Liu Jiaqi、Xu Chengkai、Hang Peng、Sun Jian\n  - 发表日期：2025年7月8日\n  - 任务：规划\n  - 数据集：[CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - 摘要：\n    - LeAD是一种双速率自动驾驶架构，将基于模仿学习的端到端框架与大型语言模型（LLM）增强相结合，以提升场景理解和决策能力。\n    - 系统采用高频E2E子系统进行实时循环，而低频LLM模块则利用多模态感知融合和思维链推理来处理复杂场景和边缘情况。\n\n- [NRSeg：基于驾驶世界模型的BEV语义分割抗噪学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.04002)\n  - 作者：Li Siyu、Teng Fei、Cao Yihong、Yang Kailun、Li Zhiyong、Wang Yaonan\n  - 发表日期：2025年7月5日\n  - 代码：[NRSeg](https:\u002F\u002Fgithub.com\u002Flynn-yu\u002FNRSeg)\n  - 任务：感知\n  - 摘要：\n    - 提出了NRSeg框架，这是一种基于驾驶世界模型生成的合成数据，用于BEV语义分割的抗噪学习。\n    - 引入了透视几何一致性度量（PGCM）来评估生成数据的指导能力，并设计了一个双分布并行预测模块（BiDPP）以增强模型的鲁棒性。\n    - 在无监督和半监督的BEV语义分割任务中，mIoU分别提升了13.8%和11.4%，达到了当前最先进的水平。\n\n- [FMOcc：基于TPV驱动流匹配的三维占用预测，采用选择性状态空间模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.02250)\n  - 陈江霞、黄通远、宋科\n  - 发表日期：2025年7月3日\n  - 任务：预测\n  - 数据集：[Occ3D-nuScenes](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FOcc3D)、[OpenOcc](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FOpenOcc)\n  - 摘要：\n    - 提出FMOcc，一种结合流匹配选择性状态空间模型的三视角（TPV）精炼占用网络，用于少帧数的3D占用预测。\n    - 引入流匹配SSM模块（FMSSM）以生成缺失特征，并设计带有平面选择性SSM（PS3M）的TPV SSM层，对TPV特征进行选择性过滤，从而提升效率并改善远距离场景的预测效果。\n    - 设计掩码训练（MT）方法，增强对传感器数据丢失的鲁棒性，在Occ3D-nuScenes和OpenOcc数据集上取得当前最优结果。\n\n- [VLAD：基于VLM增强的分层规划与可解释决策过程的自动驾驶框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01284)\n  - 克里斯蒂安·加里博尔迪、时田隼人、金城健、浅田优希、亚历山大·卡巴略\n  - 发表日期：2025年7月2日\n  - 任务：端到端\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - VLAD是一种视觉-语言自动驾驶模型，将微调后的VLM与VAD端到端系统集成，利用自定义问答数据集增强空间推理能力。\n    - 该系统能够生成高层导航指令，并为驾驶决策提供可解释的自然语言说明，从而提高透明度。\n    - 在nuScenes上的评估显示，相比基线方法，平均碰撞率降低了31.82%。\n\n- [基于LLM的真实安全关键驾驶视频生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01264)\n  - 傅永杰、查瑞建、田培、狄轩\n  - 发表日期：2025年7月2日\n  - 任务：生成\n  - 数据集：[CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - 摘要：\n    - 提出一种新颖的框架，利用大型语言模型（LLM）进行少样本代码生成，自动在CARLA模拟器中合成多样且具有安全关键性的驾驶场景。\n    - 集成Cosmos-Transfer1与ControlNet的视频生成流水线，将渲染的仿真场景转换为逼真的驾驶视频，弥合了仿真与真实之间的外观差距。\n    - 能够可控地生成罕见的边缘场景，如被遮挡的行人横道或突然的车辆切入，用于自动驾驶车辆的仿真测试。\n\n- [World4Drive：基于意图感知物理潜在世界模型的端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.00603)\n  - 郑宇鹏、杨鹏轩、邢泽彬、张启超、郑宇航、高银峰、李鹏飞、张腾、夏仲璞、贾鹏、赵东斌\n  - 出版单位：中国科学院大学\n  - 发表日期：2025年7月1日\n  - 代码：[World4Drive](https:\u002F\u002Fgithub.com\u002Fucaszyp\u002FWorld4Drive)\n  - 任务：端到端\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - World4Drive是一个端到端自动驾驶框架，利用视觉基础模型构建潜在世界模型，无需感知标注即可生成和评估多模态规划轨迹。\n    - 该框架提取场景特征和意图，生成轨迹，预测潜在空间中的未来状态，并通过自监督对齐的筛选模块选择最佳轨迹。\n    - 在nuScenes和NavSim基准测试中表现出色，L2误差和碰撞率显著降低，同时训练收敛速度更快。\n\n- [当数字孪生遇见大型语言模型：面向自动驾驶的真实、交互式可编辑仿真](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.00319)\n  - 坦迈·维拉斯·萨马克、钦迈·维拉斯·萨马克、李冰、文卡特·克罗维\n  - 发表日期：2025年6月30日\n  - 任务：生成\n  - 摘要：\n    - 提出一个统一框架，用于创建高保真数字孪生，以加速自动驾驶研究，兼顾动力学保真度、照片级渲染、情境相关的场景编排以及实时性能。\n    - 利用基于物理和数据驱动的技术实现从现实到仿真重建，达到几何和照片级的精度，并为资产注入物理属性以支持实时动力学仿真。\n    - 集成大型语言模型（LLM）接口，可通过自然语言提示在线灵活编辑驾驶场景，展现出高度的结构相似性、帧率和提示处理能力。\n\n- [StyleDrive：迈向驾驶风格感知的端到端自动驾驶基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.23982)\n  - 郝睿阳、景博文、于海宝、聂再青\n  - 发表日期：2025年6月30日\n  - 项目页面：[StyleDrive](https:\u002F\u002Fstyledrive.github.io\u002F)\n  - 任务：评估\n  - 数据集：[StyleDrive](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FRyhn98\u002FStyleDrive-Dataseta)\n  - 摘要：\n    - 介绍首个大规模真实世界数据集，专为个性化端到端自动驾驶（E2EAD）而精心策划，通过微调的视觉-语言模型将场景拓扑与动态上下文及语义相结合。\n    - 提出混合标注流程，结合行为分析、启发式方法和VLM推理，并通过人机协作验证进行优化。\n    - 建立首个标准化基准，用于评估个性化E2EAD模型，表明融入驾驶偏好有助于提升行为与人类示范的一致性。\n\n- [Epona：用于自动驾驶的自回归扩散世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.24113)\n  - 张凯文、唐振宇、胡晓涛、潘兴刚、郭晓阳、刘源、黄敬伟、袁力、张倩、龙小肖、曹勋、尹伟\n  - 出版单位：清华大学\n  - 发表日期：2025年6月30日\n  - 项目页面：[Epona](https:\u002F\u002Fgithub.com\u002FKevin-thu\u002FEpona\u002F)\n  - 代码：[Epona](https:\u002F\u002Fgithub.com\u002FKevin-thu\u002FEpona\u002F)\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - Epona是一种用于自动驾驶的自回归扩散世界模型，通过解耦时空分解和模块化轨迹\u002F视频预测，实现局部时空分布建模。\n    - 引入一种新的前向训练链策略，以解决自回归循环中的误差累积问题，取得了当前最优性能，FVD指标提升了7.4%，预测时长延长数分钟。\n    - 学习到的世界模型可用作实时运动规划器，在NAVSIM基准测试中表现优于强大的端到端规划器。\n\n- [DriveMRP：利用合成运动数据增强视觉-语言模型以进行运动风险预测](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.02948)\n  - 侯志毅、马恩辉、李芳、赖志毅、何嘉乐、吴展谦、周立军、陈龙、孙驰天、孙海洋、王兵、陈光、叶航俊、俞凯成\n  - 发表日期：2025年6月28日\n  - 代码：[DriveMRP](https:\u002F\u002Fgithub.com\u002FSII-HZY\u002FDriveMRP)\n  - 任务：预测\n  - 摘要：\n    - 提出DriveMRP方法，通过基于鸟瞰视角（BEV）的仿真技术合成高风险运动数据，从而增强视觉-语言模型（VLM）的运动风险预测能力。\n    - 提出了DriveMRP-Agent框架，采用一种新颖的信息注入策略，用于全局上下文、自我视角和轨迹投影，以提升空间推理能力。\n    - 实验表明，该方法在合成数据上的事故识别准确率从27.13%提升至88.03%，在零样本真实场景评估中则从29.42%提升至68.50%，性能显著提升。\n\n- [基于案例推理增强大型语言模型的框架：用于现实安全关键驾驶场景下的决策](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.20531)\n  - 甘文斌、道明山、泽津浩二\n  - 出版单位：NICT、兵库大学\n  - 发表日期：2025年6月25日\n  - 任务：规划\n  - 摘要：\n    - 提出了一种基于案例推理增强大型语言模型（CBR-LLM）的框架，用于复杂且安全关键的驾驶场景中的规避动作决策。\n    - 将行车记录仪视频中的语义场景理解与相关过往驾驶案例的检索相结合，生成情境敏感且符合人类行为规范的动作建议。\n    - 在多个大型语言模型上均表现出更高的决策准确性、更优质的理由说明以及与人类专家的一致性；通过风险感知提示和基于相似性的案例检索进一步提升了性能。\n\n- [统一视觉-语言-动作模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.19850)\n  - 王宇奇、李兴航、王文轩、张俊博、李英燕、陈云涛、王新龙、张兆祥\n  - 发表日期：2025年6月24日\n  - 任务：规划\n  - 数据集：[CALVIN](https:\u002F\u002Fgithub.com\u002Fmees\u002Fcalvin)、[LIBERO](https:\u002F\u002Flibero-project.github.io\u002F)、[Simplenv-Bridge](https:\u002F\u002Fsimpler-env.github.io\u002F)、[ALOHA](https:\u002F\u002Fmobile-aloha.github.io\u002F)\n  - 摘要：\n    - UniVLA是一种统一且原生的多模态VLA模型，能够自回归地将视觉、语言和动作信号建模为离散的标记序列，从而实现灵活的多模态任务学习。\n    - 在后训练阶段引入世界建模，以捕捉视频中的因果动态，促进向下游策略学习的有效迁移，尤其适用于长时程任务。\n    - 在仿真基准测试中取得了新的最先进成果（例如在LIBERO上达到95.5%的准确率），并在真实的ALOHA操作和自动驾驶任务中展现出广泛的适用性。\n\n- [Drive-R1：利用强化学习在视觉-语言模型中弥合自动驾驶中的推理与规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.18234)\n  - 李悦、田萌、朱德昌、朱江通、林振宇、熊志伟、赵新海\n  - 发表日期：2025年6月23日\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[DriveLM-nuScenes](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM)\n  - 摘要：\n    - 提出Drive-R1，这是一种针对自动驾驶场景设计的领域特定视觉-语言模型，旨在连接场景推理与运动规划。\n    - 采用两阶段训练方法：首先在包含思维链推理的数据集上进行监督微调，随后通过强化学习使推理路径与规划结果相一致。\n\n- [DRAMA-X：面向驾驶的细粒度意图预测与风险推理基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.17590)\n  - 米希尔·戈博勒、高翔波、涂正中\n  - 出版单位：加州大学伯克利分校、密歇根大学\n  - 发表日期：2025年6月21日\n  - 任务：评估\n  - 数据集：[DRAMA](https:\u002F\u002Fusa.honda-ri.com\u002Fdrama)\n  - 摘要：\n    - 引入DRAMA-X，这是一个基于自动化标注流水线构建的细粒度基准测试，用于评估安全关键驾驶场景中的多类别意图预测和风险推理。\n    - 提出SGG-Intent轻量级无训练推理框架，该框架利用视觉-语言模型支持的场景图生成器和大型语言模型进行组合式推理，以完成目标检测、意图预测、风险评估及行动建议。\n    - 提供了对四个相互关联的自动驾驶任务的结构化评估，并证明基于场景图的推理能够提升意图和风险预测的准确性，尤其是在明确上下文建模的情况下。\n\n- [NetRoller：通用模型与专用模型的接口，用于端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14589)\n  - 任鑫、刘洪基、梅晓东、刘文儒、叶茂盛、陈志力、马军\n  - 发表日期：2025年6月17日\n  - 代码：[NetRoller](https:\u002F\u002Fgithub.com\u002FRex-sys-hk\u002FNetRoller)\n  - 任务：端到端\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)\n  - 摘要：\n    - 提出NetRoller，一种具有创新机制的适配器，可无缝集成大型语言模型等通用模型与自动驾驶专用模型，以解决异步系统挑战。\n    - 引入三阶段接口方法：通过提前停止高效提取大型语言模型表示，实现稳健的跨模态转换，并借助查询偏移和特征偏移机制提升专用模型性能。\n    - 在nuScenes数据集上展示了规划任务中人类相似性和安全性以及检测\u002F地图构建任务精度的显著提升，使专用模型能够在保持原生频率的同时具备通用模型的情境感知能力。\n\n- [ADRD：基于规则驱动决策系统的大型语言模型自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14299)\n  - 曾凡志、王思琪、朱楚昭、李莉\n  - 发表日期：2025年6月17日\n  - 任务：规划\n  - 摘要：\n    - ADRD是一种新型框架，利用大型语言模型生成可执行的规则驱动决策系统，以实现可解释的自动驾驶。\n    - 该框架集成了三个核心模块：信息模块、代理模块和测试模块，用于迭代生成和优化驾驶策略。\n    - 相比传统的强化学习和先进的大型语言模型方法，ADRD在可解释性、响应速度和驾驶性能方面表现出色。\n\n- [面向集成视觉-语言模型的真实世界自动驾驶分层测试平台](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14100)\n  - 周宇鹏、崔灿、彭俊桐、杨子冲、陆俊武、潘查尔·吉特什、姚彬、王子然\n  - 出版单位：普渡大学\n  - 发表日期：2025年6月17日\n  - 任务：评估\n  - 摘要：\n    - 本文介绍了一种轻量级、结构化且低延迟的车载中间件流水线，并在封闭测试赛道上开发了一种可定制的真实世界交通场景形式。\n\n- [AutoVLA：一种用于端到端自动驾驶的视觉-语言-动作模型，具备自适应推理与强化微调功能](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13757)\n  - 周泽伟、蔡天辉、赵塞斯·Z、张云、黄志宇、周博磊、马佳琪\n  - 出版单位：加州大学洛杉矶分校\n  - 发表日期：2025年6月16日\n  - 项目页面：[AutoVLA](https:\u002F\u002Fautovla.github.io\u002F)\n  - 代码：[AutoVLA](https:\u002F\u002Fgithub.com\u002Fucla-mobility\u002FAutoVLA)\n  - 任务：规划\n  - 数据集：[nuPlan](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)、[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)、[Waymo](https:\u002F\u002Fwaymo.com\u002Fopen)、[Bench2Drive](https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FBench2Drive)（使用[CARLA-Garage数据集](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fcarla_garage)进行训练）\n  - 摘要：\n    - AutoVLA是一个端到端自动驾驶框架，利用预训练的VLM骨干网络，并结合物理离散动作标记。\n    - 采用GRPO方法实现自适应推理，进一步提升模型在端到端驾驶任务中的性能。\n\n- [关于视觉-语言模型在自动驾驶中对视觉感知攻击的天然鲁棒性](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11472)\n  - 佩德拉姆·莫哈杰尔安萨里、阿米尔·萨拉普尔、迈克尔·库尔、黄思宇、穆罕默德·哈马德、塞巴斯蒂安·施泰因霍斯特、哈比卜·奥卢福沃比、梅特·D·佩塞\n  - 出版单位：克莱姆森大学、慕尼黑工业大学、德克萨斯大学阿灵顿分校\n  - 发表日期：2025年6月13日\n  - 任务：感知\n  - 代码：[V2LM](https:\u002F\u002Fgithub.com\u002Fpedram-mohajer\u002FV2LM)\n  - 摘要：\n    - 本文介绍了车辆视觉语言模型（V2LMs），即专为自动驾驶感知任务微调过的VLM。\n\n- [ReCogDrive：一种用于端到端自动驾驶的强化认知框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08052)\n  - 李永康、熊凯欣、郭向宇、李芳、闫思旭、徐刚伟、周利军、陈龙、孙海阳、王兵、陈光、叶航俊、刘文宇、王兴刚\n  - 出版单位：华中科技大学、小米电动车\n  - 发表日期：2025年6月9日\n  - 项目页面：[ReCogDrive](https:\u002F\u002Fxiaomi-research.github.io\u002Frecogdrive\u002F)\n  - 代码：[ReCogDrive](https:\u002F\u002Fxiaomi-research.github.io\u002Frecogdrive\u002F)\n  - 任务：规划\n  - 数据集：[NAVSIM](https:\u002F\u002Fgithub.com\u002Fautonomousvision\u002Fnavsim)\n  - 摘要：\n    - ReCogDrive是一种新颖的强化认知框架，适用于端到端自动驾驶，包含一个视觉-语言模型（VLM）以及基于扩散的规划器，并通过强化学习进一步优化。\n    - ReCogDrive采用三阶段训练范式，首先进行驾驶预训练，随后进行模仿学习和GRPO强化学习，以提升视觉-语言-动作（VLA）系统的规划能力。\n\n- [STSBench：面向自动驾驶中多模态大型语言模型的时空场景基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06218)\n  - 克里斯蒂安·弗鲁维尔斯-赖辛格、杜尚·马利奇、林伟、大卫·希纳格尔、塞缪尔·舒尔特、霍斯特·波塞格\n  - 出版单位：格拉茨工业大学、嵌入式机器学习克里斯蒂安·多普勒实验室、约翰内斯·开普勒林茨大学、亚马逊\n  - 发表日期：2025年6月6日\n  - 任务：问答\n  - 代码：[STSBench](https:\u002F\u002Fgithub.com\u002FLRP-IVC\u002FSTSBench)\n  - 数据集：[STSBench](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fivc-lrp\u002FSTSBench)\n  - 摘要：\n    - STSBench是一个从大规模自动驾驶数据集中自动挖掘场景的框架，这些数据集具有丰富的真值标注。\n    - 该框架应用于NuScenes数据集，提出了STSnu，这是首个基于全面3D感知评估VLM时空推理能力的基准测试。\n\n- [结构化标注加速端到端自动驾驶中的视觉-语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05442)\n  - 蒋浩、胡川、石宇康、何源、王科、张曦、张志鹏\n  - 出版单位：上海交通大学、KargoBot\n  - 发表日期：2025年6月5日\n  - 任务：视觉问答\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - 摘要：\n    - 本文介绍了一个结构化且简洁的基准数据集NuScenes-S，它源自NuScenes数据集，包含机器友好的结构化表示。\n\n- [AD-EE：用于自动驾驶中快速可靠视觉-语言模型的早期退出机制](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05404)\n  - 黄连明、胡海波、崔宇飞、左嘉诚、吴尚宇、关楠、薛春杰\n  - 出版单位：香港城市大学、麦吉尔大学、MBZUAI、苏州大学\n  - 发表日期：2025年6月4日\n  - 任务：感知、视觉问答\n  - 数据集：[Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)、以边缘案例为主的[CODA](https:\u002F\u002Fcoda-dataset.github.io\u002F)\n  - 摘要：\n    - AD-EE是一种早期退出框架，结合了自动驾驶领域的特点，并利用因果推断来识别最佳退出层。\n    - AD-EE提出了一种基于因果推断的方法，用于识别和分析最优的早期退出层，从而提升VLM的推理效率。\n\n- [DriveRX：一种用于跨任务自动驾驶的视觉-语言推理模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20665)\n  - 调牧溪、杨乐乐、尹洪波、王哲旭、王业杰、田大鑫、梁孔明、马展宇\n  - 出版单位：北京邮电大学、中关村研究院、北京航空航天大学\n  - 发表日期：2025年5月27日\n  - 项目页面：[DriveRX](https:\u002F\u002Fpris-cv.github.io\u002FDriveRX\u002F)\n  - 任务：视觉问答\n  - 摘要：\n    - AutoDriveRL是一个统一的训练框架，将自动驾驶建模为一个涉及四个核心任务的结构化推理过程。每个任务都被独立地建模为视觉-语言问答问题，并使用特定于任务的奖励模型进行优化，从而在不同的推理阶段提供细粒度的强化信号。\n    - 在此框架下，训练出了DriveRX，这是一种专为实时决策设计的跨任务推理VLM。\n\n- [WOMD-Reasoning：用于驾驶中交互推理的大规模数据集](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.04281)\n  - 李一恒、范存信、葛崇健、赵志豪、李晨然、许晨峰、姚华秀、富冢正义、周博磊、唐晨、丁明宇、詹伟 **ICML 2025**\n  - 出版单位：加州大学伯克利分校、加州大学洛杉矶分校、北卡罗来纳大学教堂山分校、德克萨斯大学奥斯汀分校\n  - 发表日期：2025年5月25日\n  - 代码：[WOMD-Reasoning](https:\u002F\u002Fgithub.com\u002Fyhli123\u002FWOMD-Reasoning)\n  - 数据集：[Waymo开放运动数据集](https:\u002F\u002Fwaymo.com\u002Fopen\u002Fdownload)\n  - 任务：视觉问答\n  - 摘要：\n    - WOMD-Reasoning是一个基于Waymo开放运动数据集（WOMD）构建的语言标注数据集，专注于描述和推理驾驶场景中的交互与意图。\n\n- [FutureSightDrive：面向自动驾驶的时空思维链视觉推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17685)\n  - 作者：曾爽、常鑫源、谢孟伟、刘欣然、白一凡、潘政、徐牧、魏星\n  - 出版单位：阿里巴巴集团、西安交通大学\n  - 发表日期：2025年5月23日\n  - 任务：生成、规划\n  - 代码：[FSDrive](https:\u002F\u002Fgithub.com\u002FMIV-XJTU\u002FFSDrive)\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - 摘要：\n    - 提出一种时空思维链推理方法，使模型能够从未来的时空维度进行视觉化思考，从而提升轨迹规划能力。\n    - 构建了一个统一的视觉生成与理解预训练范式。\n\n- [基于知识表示与大语言模型的人类语义化自动驾驶导航](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16498)\n  - 作者：奥古斯托·路易斯·巴拉迪尼、米格尔·安赫尔·索特洛\n  - 出版单位：阿卡拉大学\n  - 发表日期：2025年5月22日\n  - 任务：问答\n  - 摘要：\n    - 探讨如何利用大语言模型，将非正式的导航指令转化为结构化的逻辑推理规则，从而生成答案集编程规则。\n\n- [VL-SAFE：基于世界模型的视觉-语言引导安全强化学习框架，用于自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16377)\n  - 作者：曲延松、黄子林、盛子豪、陈建聪、陈思凯、塞缪尔·拉比\n  - 出版单位：普渡大学、威斯康星大学麦迪逊分校\n  - 发表日期：2025年5月22日\n  - 项目页面：[VL-SAFE](https:\u002F\u002Fys-qu.github.io\u002Fvlsafe-website\u002F)\n  - 代码：[VL-SAFE](https:\u002F\u002Fgithub.com\u002Fys-qu\u002Fvl-safe\u002Ftree\u002Fmain)\n  - 任务：框架\n  - 摘要：\n    - VL-SAFE是一个基于世界模型的安全强化学习框架，采用视觉-语言模型作为安全指导机制，专为离线安全策略学习而设计。\n\n- [DriveMoE：端到端自动驾驶中基于视觉-语言-动作模型的专家混合架构](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16278)\n  - 作者：杨振杰、柴怡琳、贾晓松、李启峰、邵宇倩、朱学凯、苏海生、严俊驰 **CVPR 2026**\n  - 出版单位：上海交通大学\n  - 发表日期：2025年5月22日\n  - 项目页面：[DriveMoE](https:\u002F\u002Fthinklab-sjtu.github.io\u002FDriveMoE\u002F)\n  - 代码：[DriveMoE](https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FDriveMoE)\n  - 数据集：[Bench2Drive](https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FBench2Drive)\n  - 任务：规划\n  - 摘要：\n    - DriveMoE是一种新颖的基于专家混合的端到端自动驾驶框架，包含场景专用的视觉专家网络和技能专用的动作专家网络。\n    - DriveMoE建立在我们的视觉-语言-动作（VLA）基线之上，该基线最初源自具身智能领域，名为Drive-π0。\n\n- [AgentThink：面向自动驾驶的视觉-语言模型工具增强型思维链统一框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15298)\n  - 作者：钱侃干、蒋思聪、钟洋、罗子昂、黄子林、朱天泽、姜坤、杨梦梦、傅正、苗金宇、史依宁、林何哲、刘莉、周天宝、黄宇、胡一飞、李广、陈光、叶浩、孙立军、杨殿阁\n  - 出版单位：清华大学、麦吉尔大学、小米公司、威斯康星大学麦迪逊分校\n  - 发表日期：2025年6月12日\n  - 任务：视觉问答\n  - 摘要：\n    - AgentThink是首个将动态代理式工具调用融入视觉-语言推理以支持自动驾驶任务的框架。\n    - 采用两阶段训练流程，结合监督微调与GRPO算法。\n\n- [扩展大型视觉-语言模型以应对自动驾驶中的多样化交互任务](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.08725)\n  - 作者：赵宗创、付浩宇、梁定康、周欣、张鼎元、谢宏伟、王兵、白翔\n  - 出版单位：华中科技大学、小米汽车\n  - 发表日期：2025年5月13日\n  - 任务：视觉问答\n  - 代码：[DriveMonkey](https:\u002F\u002Fgithub.com\u002Fzc-zhao\u002FDriveMonkey)\n  - 摘要：\n    - NuInteract是一个大规模数据集，旨在推动大型视觉-语言模型在自动驾驶领域的应用。该数据集包含23.9万张图像、3.4万帧视频以及超过150万个跨850个场景的图像-语言对，提供详尽的环境描述性文本及2D\u002F3D标注，支持2D\u002F3D视觉定位等任务，从而实现全面的感知、预测和规划。\n    - DriveMonkey是一个灵活的框架，可通过用户提示支持多种交互任务。\n\n- [迈向以人为中心的自动驾驶：融合大语言模型指导与强化学习的快慢架构](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.06875)\n  - 作者：许成凯、刘佳琪、郭一程、张宇航、彭航、孙健\n  - 出版单位：同济大学\n  - 发表日期：2025年5月11日\n  - 任务：规划\n  - 环境：[Highway-Env](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n  - 摘要：\n    - 提出一种“快慢”决策框架，将大语言模型用于高层次指令解析，同时结合强化学习智能体进行低层次实时决策。\n\n- [针对自动驾驶视觉-语言模型的自然反射后门攻击](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.06413)\n  - 作者：刘明、梁思远、考希克·豪拉德、王丽雯、陶大成、张文胜\n  - 出版单位：爱荷华州立大学、新加坡国立大学\n  - 发表日期：2025年5月9日\n  - 任务：视觉问答\n  - 数据集：[DriveLM](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM)\n  - 摘要：\n    - 提出一种基于自然反射的后门攻击方法，专门针对自动驾驶场景下的视觉-语言模型系统，旨在当特定视觉触发条件出现时，诱导系统产生显著的响应延迟。\n\n- [DSDrive：通过统一推理与规划，蒸馏大语言模型实现轻量级端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.05360)\n  - 作者：刘文儒、刘沛、马俊\n  - 出版单位：香港科技大学\n  - 发表日期：2025年5月8日\n  - 任务：规划\n  - 视频：[DSDrive](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=op8PzQurugY)\n  - 摘要：\n    - DSDrive是一个轻量级的端到端自动驾驶框架，采用小型大语言模型处理多模态输入，实现明确的推理与闭环规划。具体而言，我们利用知识蒸馏技术，使小型大语言模型能够承担推理和规划任务，从而提升其整体性能。\n    - 引入了一种新颖的基于航点的双头协同模块，用于连接高层推理与底层轨迹规划。\n\n- [X-Driver：基于视觉-语言模型的可解释自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.05098)\n  - 作者：刘伟、张继元、郑彬雄、胡宇峰、林英展、曾增峰\n  - 出版单位：哈尔滨工业大学、百度公司\n  - 发表日期：2025年5月8日\n  - 任务：视觉问答、规划\n  - 数据集：[Bench2Drive](https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FBench2Drive)\n  - 摘要：\n    - X-Driver是一个统一的多模态大语言模型框架，专为闭环自动驾驶设计，利用思维链和自回归建模来增强感知与决策能力。\n\n- [寻求碰撞：基于检索增强型大语言模型的自动驾驶在线安全关键场景生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00972)\n  - 梅月文、聂彤、孙健、田晔\n  - 出版单位：同济大学、香港理工大学\n  - 发表日期：2025年5月2日\n  - 任务：生成\n  - 数据集：[Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - 摘要：\n    - 提出了一种基于LLM的智能体框架，用于在线生成交互式且具有安全关键性的场景。\n    - 开发了一种记忆与检索机制，使LLM能够持续适应不断变化的场景。\n\n- [V3LMA：面向自动驾驶的视觉3D增强语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00156)\n  - 扬尼克·吕伯施泰特、埃斯特班·里韦拉、尼科·乌勒曼、马库斯·利恩坎普\n  - 出版单位：慕尼黑工业大学、慕尼黑机器人与机器智能研究所\n  - 发表日期：2025年4月30日\n  - 任务：VQA\n  - 数据集：[LingoQA](https:\u002F\u002Fgithub.com\u002Fwayveai\u002FLingoQA)\n  - 摘要：\n    - 提出并评估了V3LMA，这是一种新颖的方法，结合了LLM和LVLM的优势，以增强交通场景中的3D场景理解能力——且无需进行模型训练或微调。\n\n- [通过车载部署的大语言模型增强自动驾驶系统](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.11514)\n  - 尼古拉斯·鲍曼、胡成、帕维特里伦·西瓦索蒂林加姆、秦浩通、谢磊、米凯莱·马尼奥、卢卡·贝尼尼 **RSS 2025**\n  - 出版单位：苏黎世联邦理工学院、浙江大学\n  - 发表日期：2025年4月15日\n  - 任务：规划\n  - 摘要：\n    - 提出了一种混合架构，将低层级的模型预测控制（MPC）与本地部署的大语言模型（LLM）相结合，以提升决策能力和人机交互（HMI）。\n\n- [NuScenes-SpatialQA：自动驾驶中视觉-语言模型的空间理解与推理基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.03164)\n  - 天可欣、毛静睿、张云龙、蒋继万、周洋、涂正中\n  - 出版单位：德克萨斯农工大学、威斯康星大学麦迪逊分校\n  - 发表日期：2025年4月7日\n  - 项目页面：[NuScenes-SpatialQA](https:\u002F\u002Ftaco-group.github.io\u002FNuScenes-SpatialQA\u002F)\n  - 任务：VQA\n  - 摘要：\n    - NuScenes-SpatialQA是首个大规模基于真实标注的问答（QA）基准测试，专门用于评估VLM在自动驾驶中的空间理解和推理能力。\n\n- [OpenDriveVLA：迈向端到端自动驾驶的大规模视觉语言行动模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23463)\n  - 周兴成、韩旭元、杨峰、马云普、阿洛伊斯·C·克诺尔\n  - 出版单位：慕尼黑工业大学、慕尼黑路德维希-马克西米利安大学\n  - 发表日期：2025年3月30日\n  - 项目页面：[OpenDriveVLA](https:\u002F\u002Fdrivevla.github.io\u002F)\n  - 代码：[OpenDriveVLA](https:\u002F\u002Fgithub.com\u002FDriveVLA\u002FOpenDriveVLA)\n  - 任务：VQA、规划\n  - 摘要：\n    - OpenDriveVLA是一种专为端到端自动驾驶设计的视觉-语言行动（VLA）模型。\n\n- [VLM-C4L：基于视觉-语言模型的自动驾驶持续核心数据学习与边缘场景优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23046)\n  - 胡海波、左家诚、楼阳、崔宇飞、王建平、关楠、王进、李永辉、薛春杰 **COLM 2025**\n  - 出版单位：香港城市大学、苏州大学、麦吉尔大学、鸿海研究院、穆罕默德·本·扎耶德人工智能大学\n  - 发表日期：2025年3月29日\n  - 项目页面：[VLM-C4L](https:\u002F\u002Fvlmc4l.site\u002F)\n  - 任务：感知\n  - 摘要：\n    - VLM-C4L是一个持续学习框架，引入视觉-语言模型（VLM）来动态优化和增强边缘场景数据集。该框架将VLM引导的高质量数据提取与核心数据重放策略相结合，使模型能够在保留先前常规场景性能的同时，逐步从多样化的边缘场景中学习，从而确保其在实际自动驾驶中的长期稳定性和适应性。\n\n- [自动驾驶中大型视觉-语言模型的细粒度评估](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.21505)\n  - 李悦、田萌、林振宇、朱江桐、朱德昌、刘海强、王子宁、张悦怡、熊志伟、赵新海\n  - 出版单位：中国科学技术大学、华为诺亚方舟实验室、加州大学伯克利分校\n  - 发表日期：2025年3月27日\n  - 代码：[VLADBench](https:\u002F\u002Fgithub.com\u002FDepth2World\u002FVLADBench)\n  - 任务：VQA\n  - 摘要：\n    - VLADBench专门用于严格评估VLM在自动驾驶中的能力。该基准测试采用分层结构，反映了可靠驾驶所需的复杂技能体系，从对基本场景和交通要素的理解，逐步过渡到高级推理和决策能力。\n\n- [ORION：基于视觉-语言指令驱动行动生成的全栈端到端自动驾驶框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.19755)\n  - 傅浩宇、张典坤、赵宗创、崔建峰、梁定康、张冲、张鼎远、谢宏伟、王兵、白翔\n  - 出版单位：华中科技大学、小米电动汽车\n  - 发表日期：2025年3月25日\n  - 任务：规划\n  - 项目页面：[ORION](https:\u002F\u002Fxiaomi-mlab.github.io\u002FOrion\u002F)\n  - 代码：[ORION](https:\u002F\u002Fgithub.com\u002Fxiaomi-mlab\u002FOrion)\n  - 数据集：[Bench2Drive](https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FBench2Drive)\n  - 摘要：\n    - ORION是一个由视觉-语言指令驱动行动生成的全栈端到端自动驾驶框架。ORION独特地结合了QT-Former来聚合长期历史上下文、大语言模型（LLM）用于驾驶场景推理，以及生成式规划器来进行精确的轨迹预测。\n\n- [AED：利用大语言模型自动发现自动驾驶策略中的有效且多样化漏洞](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.20804)\n  - 邱乐、徐泽来、谭齐鑫、唐文豪、于超、王宇\n  - 出版单位：清华大学、北京中关村研究院\n  - 发表日期：2025年3月24日\n  - 任务：规划\n  - 环境：[Highway-Env](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n  - 摘要：\n    - AED是一个利用大语言模型（LLM）自动发现自动驾驶策略中有效且多样化漏洞的框架。\n    - AED首先使用LLM自动设计强化学习训练的奖励函数。\n\n- [AutoDrive-QA：利用大型视觉-语言模型自动生成自动驾驶数据集的选择题](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.15778)\n  - 博什拉·哈利利、安德鲁·W·史密斯\n  - 出版单位：哥伦比亚大学\n  - 发表日期：2025年3月20日\n  - 任务：VQA\n  - 摘要：\n    - AutoDrive-QA是一个自动化流程，可以将现有的驾驶问答数据集（包括DriveLM、NuScenes-QA和LingoQA）转换为结构化的多项选择题（MCQ）格式。\n\n- [RAD：基于视觉-语言模型的元动作检索增强决策在自动驾驶中的应用](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.13861)\n  - 王宇晋、刘泉峰、蒋正欣、王天一、焦俊峰、褚洪庆、高炳昭、陈宏\n  - 出版单位：同济大学、耶鲁大学、德克萨斯大学奥斯汀分校\n  - 发表日期：2025年3月18日\n  - 任务：VQA\n  - 摘要：\n    - 提出一种检索增强决策（RAD）框架，这是一种新颖的架构，旨在提升视觉-语言模型在自动驾驶场景中可靠生成元动作的能力。\n\n- [面向自动驾驶多模态大语言模型场景理解的能力驱动评估框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11400)\n  - Tin Stribor Sohn、Philipp Reis、Maximilian Dillitzer、Johannes Bach、Jason J. Corso、Eric Sax\n  - 出版单位：保时捷股份公司、信息学研究中心、埃斯林根应用科学大学、密歇根大学、卡尔斯鲁厄理工学院\n  - 发表日期：2025年3月14日\n  - 任务：评估\n  - 摘要：\n    - 本文提出了一种针对自动驾驶领域多模态大语言模型的能力驱动评估的综合框架。该框架从语义、空间、时间及物理四个核心能力维度对场景理解进行结构化分析。\n\n- [DynRsl-VLM：利用动态分辨率视觉-语言模型提升自动驾驶感知能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11265)\n  - 周希睿、单连磊、桂晓林\n  - 出版单位：西安交通大学、中国科学院大学\n  - 发表日期：2025年3月14日\n  - 任务：VQA\n  - 摘要：\n    - DynRsl-VLM引入了一种动态分辨率图像输入处理方法，能够在捕捉图像中所有实体特征信息的同时，确保图像输入对于视觉Transformer（ViT）而言仍具有计算上的可处理性。\n\n- [SimLingo：基于语言-动作对齐的纯视觉闭环自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09594)\n  - 卡特琳·伦茨、陈龙、Elahe Arani、奥列格·西纳夫斯基 **CVPR 2025**\n  - 出版单位：Wayve、图宾根大学、图宾根人工智能中心\n  - 发表日期：2025年3月12日\n  - 任务：规划\n  - 数据集：[Carla Leaderboard V2](https:\u002F\u002Fleaderboard.carla.org\u002Fget_started_v2_0\u002F)、[Bench2Drive](https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FBench2Drive)\n  - 摘要：\n    - 一种基于视觉-语言模型的驾驶模型，在CARLA模拟器的官方CARLA Leaderboard 2.0和本地基准Bench2Drive上均取得了最先进的驾驶性能。\n    (2) 引入一项新任务“动作梦境”，并配套提出了收集指令-动作对的方法以及一个无需执行危险动作即可评估语言与动作理解关联性的基准。\n    (3) 该模型不仅具备优秀的驾驶性能，还整合了多项与语言相关的任务。\n\n- [CoT-Drive：利用大语言模型与思维链提示实现自动驾驶高效运动预测](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07234)\n  - 廖海成、孔翰林、王博楠、王承悦、叶旺、何正兵、徐成忠、李振宁 **IEEE TAI 2025**\n  - 出版单位：澳门大学、麻省理工学院\n  - 发表日期：2025年3月10日\n  - 任务：VQA\n  - 摘要：\n    - 提出一种师生知识蒸馏策略，将大语言模型先进的场景理解能力有效迁移到轻量级语言模型（LM）中，从而确保CoT-Drive能够在边缘设备上实现实时运行，同时保持全面的场景理解和泛化能力。\n\n- [自动驾驶视觉-语言模型安全认知能力评估](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.06497)\n  - 张恩明、龚培哲、戴兴源、吕一升、苗青海\n  - 出版单位：中国科学院大学\n  - 发表日期：2025年3月9日\n  - 代码：[SCD-Bench](https:\u002F\u002Fgithub.com\u002FEMZucas\u002FSCD-Bench)\n  - 任务：评估\n  - 摘要：\n    - 提出一种新的评估方法——安全认知驾驶基准测试（SCD-Bench），并开发了自动驾驶图文标注系统（ADA）。\n\n- [VLM-E2E：通过多模态驾驶员注意力融合提升端到端自动驾驶性能](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.18042)\n  - 刘沛、刘海鹏、刘海超、刘鑫、倪金鑫、马军\n  - 出版单位：香港科技大学（广州）、理想汽车、厦门大学、香港科技大学\n  - 发表日期：2025年2月25日\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - 摘要：\n    - VLME2E是一种新颖的框架，利用视觉-语言模型提供注意力线索以增强训练效果。\n    - 将文本表示融入鸟瞰视角（BEV）特征中进行语义监督，从而使模型能够学习到更丰富的特征表示，明确捕捉驾驶员的注意力语义。\n\n- [CurricuVLM：通过视觉-语言模型实现个性化安全关键课程学习，迈向安全自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.15119)\n  - 盛子豪、黄子林、曲延松、冷岳、Sruthi Bhavanam、陈思凯\n  - 出版单位：威斯康星大学麦迪逊分校、普渡大学\n  - 发表日期：2025年2月21日\n  - 任务：规划\n  - 项目页面：[CurricuVLM](https:\u002F\u002Fzihaosheng.github.io\u002FCurricuVLM\u002F)\n  - 数据集：[MetaDrive](https:\u002F\u002Fgithub.com\u002Fmetadriverse\u002Fmetadrive)、[Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - 摘要：\n    - CurricuVLM是一个新颖的框架，利用视觉-语言模型为自动驾驶智能体实现个性化课程学习。\n    - CurricuVLM是首个将视觉-语言模型用于闭环自动驾驶训练中动态课程生成的工作。\n\n- [V2V-LLM：基于多模态大语言模型的车车协同自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.09980)\n  - 邱旭光、八百原亮、王建义、Stephen F. Smith、Frank Wang、陈敏弘\n  - 出版单位：英伟达、卡内基梅隆大学\n  - 发表日期：2025年2月14日\n  - 任务：VQA\n  - 摘要：\n    - 创建并发布V2V-QA数据集，以支持基于大语言模型的端到端协同自动驾驶方法的开发与评估。\n    - 提出一种用于协同自动驾驶的基线方法V2V-LLM，为V2V-QA提供初始基准。\n\n- [Occ-LLM：基于占用网格的大语言模型增强自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06419)\n  - 许天硕、陆浩、闫旭、蔡英杰、刘冰冰、陈颖聪 **ICRA 2025**\n  - 出版单位：香港科技大学（广州）、华为诺亚方舟实验室\n  - 发表日期：2025年2月10日\n  - 任务：感知、规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - 摘要：\n    - 提出一种用于自动驾驶的基于占用网格的大语言模型（Occ-LLM），展示了其在场景理解方面的优越性能。\n\n- [INSIGHT：通过视觉—语言模型实现情境感知的危险检测与边缘场景评估，提升自动驾驶安全性](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.00262)\n  - 陈典伟、张子凡、刘宇辰、杨显峰特里\n  - 出版单位：马里兰大学、北卡罗来纳州立大学\n  - 发表日期：2025年2月1日\n  - 任务：VQA\n  - 摘要：\n    - INSIGHT（语义与视觉输入融合的通用危险追踪系统）是一个分层的视觉—语言模型框架，旨在增强危险检测和边缘场景评估能力。\n\n- [LLM-attacker：利用大语言模型增强自动驾驶闭环对抗场景生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.15850)\n  - 梅岳文、聂彤、孙健、田烨 **IEEE TITS 2025**\n  - 出版单位：同济大学、香港理工大学\n  - 发表日期：2025年1月27日\n  - 视频：[LLM-attacker](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F15rROV_8LUcc2jXSuSNBHVOHMCFKSn__B\u002Fview)\n  - 任务：生成\n  - 数据集：[MetaDrive](https:\u002F\u002Fgithub.com\u002Fmetadriverse\u002Fmetadrive)、[Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - 摘要：\n    - LLM-attacker：一个基于大语言模型（LLMs）的闭环对抗场景生成框架。\n\n- [针对自动驾驶视觉—语言模型的黑盒对抗攻击](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.13563)\n  - 王璐、张天元、曲阳、梁思远、陈宇威、刘爱珊、刘向龙、陶大成\n  - 出版单位：北京航空航天大学、新加坡国立大学、中国航空工业发展研究中心、南洋理工大学\n  - 发表日期：2025年1月23日\n  - 任务：VQA\n  - 摘要：\n    - 级联对抗干扰（CAD）首次引入决策链破坏机制，通过生成并注入欺骗性语义，导致低层次推理失效，从而确保扰动在整个决策链条中持续有效。\n\n- [面向自动驾驶的多模态大语言模型知识蒸馏](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09757)\n  - 迪普蒂·赫格德、拉吉夫·亚萨尔拉、蔡宏、韩世忠、阿普拉蒂姆·巴塔查里亚、什韦塔·马哈詹、刘丽婷、里希克·加雷帕利、维沙尔·M·帕特尔、法提赫·波里克利\n  - 出版单位：约翰斯·霍普金斯大学、高通AI研究\n  - 发表日期：2025年1月16日\n  - 任务：规划\n  - 摘要：\n    - DiMA是一个端到端的自动驾驶框架，通过将多模态大语言模型的知识蒸馏到基于视觉的规划器中，以确保对长尾事件的鲁棒性，同时保持高效。\n    - 提出了蒸馏任务以及以下代理任务，以对齐视觉规划器和多模态大语言模型的目标：(i) 掩码标记重建 (ii) 未来标记预测 (iii) 场景编辑。\n\n- [自动驾驶场景开发中的语言建模](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09319)\n  - 青木俊明、富田隆史、河合达治、川上大辅、千田信夫\n  - 出版单位：日本高等科学技术研究院、高知大学、三菱电机株式会社\n  - 发表日期：2025年1月16日\n  - 任务：开发\n  - 摘要：\n    - 本研究提出了一种称为“车辆位置图”（CPD）的表示方法。CPD能够简洁地表达大量场景，尤其适用于场景分析和设计。\n\n- [视觉—语言模型在自动驾驶中行人行为与场景理解中的应用](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.06680)\n  - 高浩翔、赵宇\n  - 出版单位：Motional AD LLC、多伦多大学\n  - 发表日期：2025年1月12日\n  - 任务：VQA\n  - 摘要：\n    - 分析了将大语言模型的语义标签有效蒸馏到小型视觉网络中的方法，可用于复杂场景的语义表示，从而支持下游的规划与控制决策。\n\n- [DriVLM：自动驾驶领域中视觉—语言模型的领域适应](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.05081)\n  - 郑旭然、俞昌德\n  - 出版单位：韩国科学技术院\n  - 发表日期：2025年1月9日\n  - 任务：VQA\n  - 摘要：\n    - 探讨了小型多模态大语言模型及其应用在自动驾驶领域的效用。\n\n- [自动驾驶中的视觉—语言模型：基于CLIP的动态场景理解](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.05566)\n  - 穆罕默德·埃尔赫纳维、胡泰法·I·阿什卡尔、安德里·拉科托尼赖、塔夸·I·阿尔哈迪迪、艾哈迈德·贾伯、穆罕默德·阿布·塔米\n  - 出版单位：昆士兰科技大学、阿拉伯美国大学、哥伦比亚大学\n  - 发表日期：2025年1月9日\n  - 任务：场景理解\n  - 摘要：\n    - 本研究利用对比语言—图像预训练（CLIP）模型开发了一套动态场景检索系统，该系统可优化后实现实时部署于边缘设备。\n\n- [视觉—语言模型已准备好用于自动驾驶吗？从可靠性、数据与指标角度的实证研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04003)\n  - 谢绍远、孔令东、董宇豪、司马冲浩、张文伟、陈启 Alfred、刘子威、潘亮 **ICCV 2025**\n  - 项目页面：[DriveBench](https:\u002F\u002Fdrive-bench.github.io)\n  - 代码：[DriveBench](https:\u002F\u002Fgithub.com\u002Fdrive-bench\u002Ftoolkit)\n  - HuggingFace：[DriveBench](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdrive-bench\u002Farena)\n  - 发表日期：2025年1月7日\n  - 任务：基准测试\n  - 摘要：\n    - DriveBench是一个基准数据集，旨在评估视觉—语言模型在17种场景下的可靠性（包括干净、损坏及纯文本输入），涵盖19,200帧、20,498组问答对、三种问题类型、四项主流驾驶任务，以及共12种流行的视觉—语言模型。\n\n- [通过上下文学习生成交通场景，以训练更优的运动规划器](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18086)\n  - 艾孜尔江·艾尔西兰 **AAAI 2025 口头报告**\n  - 出版单位：澳门大学\n  - 发表日期：2024年12月24日\n  - 任务：生成\n  - 环境：[CARLA](https:\u002F\u002Fcarla.org\u002F)\n  - 项目页面：[AutoSceneGen](https:\u002F\u002Fezharjan.github.io\u002FAutoSceneGen\u002F)\n  - 代码：[AutoSceneGen](https:\u002F\u002Fgithub.com\u002FEzharjan\u002FAutoSceneGen)\n  - 摘要：\n    - 提出了一种通用、灵活且经济高效的框架“AutoSceneGen”，可通过场景描述自动增强交通场景的多样性，从而加速仿真与测试过程。\n\n- [大型语言模型引导的深度强化学习在自动驾驶决策中的应用](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18511)\n  - 邵鹏、王振坡、李国强\n  - 出版单位：北京理工大学\n  - 发表日期：2024年12月24日\n  - 任务：规划\n  - 代码：[LGDRL](https:\u002F\u002Fgithub.com\u002Fbitmobility\u002FLGDRL)\n  - 环境：[HighwayEnv](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n  - 摘要：\n    - 提出了一种新颖的大型语言模型（LLM）引导的深度强化学习（LGDRL）框架，用于解决自动驾驶车辆的决策问题。在此框架中，基于LLM的驾驶专家被整合到DRL中，为DRL的学习过程提供智能指导。\n\n- [多模态大型语言模型在自动驾驶中的应用](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.16410)\n  - Md Robiul Islam\n  - 出版单位：威廉与玛丽学院\n  - 发表日期：2025年12月21日\n  - 任务：VQA\n  - 摘要：\n    - 构建了一个虚拟问答（VQA）数据集来微调模型，以解决MLLM在自动驾驶任务中表现不佳的问题。\n\n- [VLM-RL：一种统一的视觉语言模型与强化学习框架，用于安全自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15544)\n  - 黄子林、盛子豪、曲彦松、游俊伟、陈思凯\n  - 出版单位：威斯康星大学麦迪逊分校、普渡大学\n  - 发表日期：2024年12月20日\n  - 任务：奖励设计\n  - 项目页面：[VLM-RL](https:\u002F\u002Fwww.huang-zilin.com\u002FVLM-RL-website\u002F)\n  - 摘要：\n    - VLM-RL是一个统一的框架，将预训练的视觉语言模型（VLMs）与强化学习相结合，利用图像观测和自然语言目标生成奖励信号。\n\n- [AutoTrust：面向自动驾驶的大型视觉语言模型可信度基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15206)\n  - 邢硕、华宏远、高向波、朱深哲、李仁杰、田可欣、李晓鹏、黄恒、杨天宝、王章阳、周洋、姚华秀、涂正中\n  - 出版单位：德克萨斯农工大学、多伦多大学、密歇根大学、威斯康星大学麦迪逊分校、马里兰大学、德克萨斯大学奥斯汀分校、北卡罗来纳大学教堂山分校\n  - 发表日期：2024年12月19日\n  - 任务：VQA\n  - 代码：[AutoTrust](https:\u002F\u002Fgithub.com\u002Ftaco-group\u002FAutoTrust)\n  - 排行榜：[AutoTrust](https:\u002F\u002Ftaco-group.github.io\u002FAutoTrust\u002F)\n  - 摘要：\n    - AutoTrust是一个开创性的基准测试，旨在评估DriveVLMs的可信度。该工作旨在通过确保DriveVLMs在关键维度上可靠运行，从而提升公共安全。\n\n- [VLM-AD：通过视觉语言模型监督实现端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14446)\n  - 徐毅、胡宇鑫、张载伟、格雷戈里·P·迈耶、西瓦·卡尔蒂克·穆斯蒂科韦拉、西达尔塔·斯里尼瓦萨、埃里克·M·沃尔夫、黄欣\n  - 出版单位：Cruise LLC、东北大学\n  - 发表日期：2024年12月19日\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - 摘要：\n    - VLM-AD是一种方法，利用视觉语言模型（VLMs）作为教师，通过提供包含非结构化推理信息和结构化动作标签的额外监督来增强训练效果。\n\n- [RAC3：基于视觉语言模型的检索增强型边缘场景理解用于自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.11050)\n  - 王宇金、刘泉峰、范佳琪、洪锦龙、楚洪庆、田孟健、高炳兆、陈宏\n  - 出版单位：同济大学、深圳技术大学\n  - 发表日期：2024年12月15日\n  - 任务：VQA\n  - 数据集：[CODA-LM](https:\u002F\u002Fcoda-dataset.github.io\u002Fcoda-lm\u002F)\n  - 摘要：\n    - RAC3是一个新框架，旨在提升VLMs在边缘场景理解方面的性能。\n    - RAC3集成了频率空间融合（FSF）图像编码器、用于嵌入模型的跨模态对齐训练方法（结合硬负样本和半硬负样本挖掘），以及基于K-Means聚类和分层可导航小世界（HNSW）索引的快速查询与检索管道。\n\n- [WiseAD：基于视觉语言模型的知识增强型端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.09951)\n  - 张松岩、黄文辉、高子慧、陈浩、吕晨\n  - 出版单位：南洋理工大学、浙江大学\n  - 发表日期：2024年12月13日\n  - 任务：规划\n  - 摘要：\n    - 研究了基础驾驶知识的深度和广度对闭环轨迹规划的影响，并提出了WiseAD，这是一种专为端到端自动驾驶设计的特殊VLM，能够在不同场景下进行驾驶推理、行动解释、目标识别、风险分析、驾驶建议和轨迹规划。\n\n- [PKRD-CoT：自动驾驶中多模态大型语言模型的统一思维链提示](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.02025)\n  - 罗雪雯、丁凡、宋银生、张晓峰、卢俊勇 **ICONIP 2024**\n  - 出版单位：马来西亚莫纳什大学\n  - 任务：问答、提示工程\n  - 发表日期：2024年12月2日\n  - 摘要：\n    - PKRD-CoT基于自动驾驶的四项基本能力——感知、知识、推理和决策——构建而成。\n\n- [针对自动驾驶视觉语言模型的视觉对抗攻击](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18275)\n  - 张天元、王璐、张新伟、张一桐、贾博义、梁思源、胡圣山、傅强、刘爱珊、刘向龙\n  - 出版单位：北京航空航天大学、新加坡国立大学、华中科技大学\n  - 任务：VQA\n  - 发表日期：2024年11月27日\n  - 摘要：\n    - ADvLM是首个专为自动驾驶领域中的VLM设计的视觉对抗攻击框架。\n    - 在文本域采用语义不变诱导，在视觉域采用场景关联增强，以确保攻击在不同指令和连续视角下均具有效力。\n\n- [基于多模态大型语言模型的自动驾驶轨迹规划解释](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.09971)\n  - 山崎翔太、张晨宇、南里拓也、重兼明夫、王思源、西山丈、朱涛、横泽浩平 **ECCV 2024 VCAD研讨会**\n  - 出版单位：日产汽车公司\n  - 发表日期：2024年11月15日\n  - 任务：VQA\n  - 摘要：\n    - 提出了一种推理模型，以本车未来的规划轨迹为输入，生成推理文本。\n\n- [将目标检测模态融入视觉语言模型以增强自动驾驶代理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.05898)\n  - 林峰何、孙一鸣、吴思豪、刘家旭、黄晓伟 **NeurIPS 2024 SafeGenAI Workshop**\n  - 出版单位：利物浦大学、诺丁汉大学\n  - 发表日期：2024年11月8日\n  - 任务：感知\n  - 数据集：[DriveLM](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM)\n  - 摘要：\n    - 通过在CLIP感知网络之外加入基于YOLOS的检测网络，扩展Llama-Adapter架构，以解决目标检测和定位方面的局限性。\n\n- [Senna：连接大型视觉语言模型与端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22313)\n  - 姜博、陈绍宇、廖本成、张星宇、尹伟、张倩、黄昌、刘文宇、王兴刚\n  - 出版单位：华中科技大学\n  - 发表日期：2024年10月29日\n  - 任务：VQA、规划\n  - 代码：[Senna](https:\u002F\u002Fgithub.com\u002Fhustvl\u002FSenna)\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)、[DriveX]()\n  - 摘要：\n    - Senna是一种将LVLM与端到端模型相结合的自动驾驶系统，能够从高层决策到低层轨迹预测实现结构化规划。\n\n- [从文字到车轮：基于视觉的基础模型理解人类语言指令的自主驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10577)\n  - 柳灿辉、成贤基、李大奎、文成佑、闵成宰、沈贤哲\n  - 出版单位：KAIST、ETRI\n  - 发表日期：2024年10月14日\n  - 任务：VQA\n  - 摘要：\n    - 介绍了一种基础模型的创新应用，使配备RGB-D相机的无人地面车辆能够根据人类语言指令导航至指定目的地。\n\n- [CANVAS：面向直观人机交互的常识感知导航系统](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01273)\n  - 崔秀焕、赵勇俊、金敏灿、郑在允、曹民哲、朴有彬、金珉序、金成雄、李成宰、朴辉成、郑智完、柳英宰\n  - 出版单位：MAUM.AI、延世大学\n  - 发表日期：2024年10月2日\n  - 项目页面：[CANVAS](https:\u002F\u002Fworv-ai.github.io\u002Fcanvas\u002F)\n  - 代码：[CANVAS](https:\u002F\u002Fgithub.com\u002Fworv-ai\u002Fcanvas)\n  - 任务：规划\n  - 数据集：[COMMAND](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmaum-ai\u002FCOMMAND)\n  - 摘要：\n    - CANVAS是一个新颖的常识感知机器人导航框架，在模拟和真实环境中均表现出色。\n\n- [LeGEND：大型语言模型辅助的自动驾驶系统场景生成自顶向下方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.10066)\n  - 唐顺成、张振亚、周继祥、雷磊、周源、薛银星 **ASE 2024**\n  - 出版单位：中国科学技术大学、九州大学、浙江理工大学\n  - 任务：生成\n  - 发表日期：2024年9月16日\n  - 代码：[LeGEND](https:\u002F\u002Fgithub.com\u002FMayDGT\u002FLeGEND)\n  - 摘要：\n    - LeGEND是一种自顶向下的场景生成方法，能够同时实现场景的严重性和多样性。\n    - 采用中间语言作为媒介，通过两阶段转换，将事故报告转化为逻辑场景；因此，LeGEND使用两个LLM，分别负责不同阶段。\n    - 实现并验证了LeGEND在Apollo平台上的有效性，发现了11种反映系统缺陷不同方面的关键具体场景。\n\n- [MiniDrive：用于自动驾驶的多层级2D特征作为文本标记的更高效视觉语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.07267)\n  - 张恩明、戴兴元、吕义生、苗庆海\n  - 出版单位：中国科学院大学、中科院自动化所\n  - 任务：问答\n  - 发表日期：2024年9月14日\n  - 代码：[MiniDrive](https:\u002F\u002Fgithub.com\u002FEMZucas\u002Fminidrive)\n  - 摘要：\n    - MiniDrive解决了自动驾驶系统中VLM高效部署和实时响应的挑战。它可以在配备24GB显存的RTX 4090 GPU上完成全量训练。\n    - 特征工程专家混合模型（FE-MoE）解决了如何将多视角的2D特征高效编码为文本标记嵌入的问题，有效减少了视觉特征标记的数量并最小化了特征冗余。\n    - 通过残差结构实现动态指令适配器，解决了同一图像在输入语言模型之前视觉标记固定的问题。\n\n- [Hint-AD：端到端自动驾驶中的整体对齐可解释性](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.06702)\n  - 丁凯瑞、陈博远、苏宇辰、高焕昂、金步、司马冲浩、张武强、李晓辉、保罗·巴尔施、李洪洋、赵浩 **CoRL 2024**\n  - 出版单位：人工智能产业研究院、梅赛德斯-奔驰集团中国有限公司、清华大学、上海人工智能实验室\n  - 发表日期：2024年9月10日\n  - 项目页面：[Hint-AD](https:\u002F\u002Fair-discover.github.io\u002FHint-AD\u002F)\n  - 任务：VQA\n  - 摘要：\n    - Hint-AD是一个集成的AD-语言系统，能够生成与AD模型的整体感知-预测-规划输出相一致的语言。通过引入中间输出和一个用于有效特征适配的全局标记混合子网络，Hint-AD实现了理想的准确性，在包括驾驶解释、3D密集字幕和命令预测在内的驾驶语言任务中达到了最先进的水平。\n    - 贡献了一个人工标注的驾驶解释数据集Nu-X on nuScenes，以弥补这一广泛使用的AD数据集中缺乏驾驶解释数据的问题。\n\n- [OccLLaMA：用于自动驾驶的占用-语言-动作生成式世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.03272)\n  - 魏聚龙、袁善帅、李鹏飞、胡庆达、甘仲学、丁文超\n  - 出版单位：复旦大学、清华大学\n  - 任务：感知（占用）+ 推理\n  - 发表日期：2024年9月5日\n  - 摘要：\n    - OccLLaMA是一个统一的3D占用-语言-动作生成式世界模型，整合了包括但不限于场景理解、规划和4D占用预测等VLA相关任务。\n    - 一种新型场景分词器（类似VQVAE的架构），能够高效地离散化和重建Occ场景，同时考虑稀疏性和类别不平衡问题。\n\n- [ContextVLM：利用视觉语言模型进行零样本和少样本情境理解的自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.00301)\n  - 肖纳克·苏拉尔、纳伦、拉古纳坦·拉杰库马尔 **ITSC 2024**\n  - 出版单位：卡内基梅隆大学\n  - 任务：情境识别\n  - 代码：[ContextVLM](https:\u002F\u002Fgithub.com\u002Fssuralcmu\u002FContextVLM)\n  - 发表日期：2024年8月30日\n  - 摘要：\n    - DrivingContexts是一个大型公开数据集，结合了人工标注和机器标注标签，旨在提升VLM的情境识别能力。\n    - ContextVLM利用视觉语言模型，通过零样本和少样本方法来检测情境。\n\n- [DriveGenVLM：基于视觉语言模型的自动驾驶真实世界视频生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.16647)\n  - 傅勇杰、安莫尔·贾因、狄璇、陈旭、莫兆斌 **IAVVC 2024**\n  - 出版单位：哥伦比亚大学\n  - 任务：生成\n  - 数据集：[Waymo开放数据集](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - 发表日期：2024年8月29日\n  - 摘要：\n    - DriveGenVLM采用基于去噪扩散概率模型的视频生成框架，创建能够模拟真实世界动态的逼真视频序列。\n    - 生成的视频随后使用一个名为“基于第一人称视频的有效上下文学习”（EILEV）的预训练模型，评估其对视觉语言模型（VLM）的适用性。\n\n- [基于大语言模型的边缘—云端协同自动驾驶运动规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.09972)\n  - 陈娇、戴素燕、陈芳芳、吕作宏、唐建华\n  - 出版单位：华南理工大学、琶洲实验室\n  - 任务：规划 + QA\n  - 项目页面：[EC-Drive](https:\u002F\u002Fsites.google.com\u002Fview\u002Fec-drive)\n  - 发表日期：2024年8月19日\n  - 摘要：\n    - EC-Drive是一种新颖的边缘—云端协同自动驾驶系统。\n\n- [V2X-VLM：通过大型视觉语言模型实现端到端的V2X协同自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.09251)\n  - 游俊伟、史浩天、蒋卓宇、黄子林、甘锐、吴克树、程曦、李晓鹏、冉斌\n  - 出版单位：威斯康星大学麦迪逊分校、南洋理工大学、德克萨斯农工大学、康奈尔大学\n  - 任务：规划\n  - 项目页面：[V2X-VLM](https:\u002F\u002Fzilin-huang.github.io\u002FV2X-VLM-website\u002F)\n  - 代码：[V2X-VLM](https:\u002F\u002Fgithub.com\u002Fzilin-huang\u002FV2X-VLM)\n  - 数据集：[DAIR-V2X](https:\u002F\u002Fgithub.com\u002FAIR-THU\u002FDAIR-V2X)\n  - 发表日期：2024年8月9日\n  - 摘要：\n    - V2X-VLM是一个由大型视觉语言模型赋能的端到端VICAD框架，它通过先进的多模态理解和决策能力，提升了自动驾驶车辆在复杂交通场景中的导航能力。\n    - 该框架采用对比学习技术来优化模型区分相关与无关特征的能力，从而确保模型能够学习到特定驾驶环境的鲁棒且具有鉴别性的表征，进而提高在V2X协同场景下轨迹规划的准确性。\n\n- [AgentsCoMerge：大语言模型赋能的匝道汇入协同决策](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.03624)\n  - 胡森康、方正儒、方子涵、邓怡琴、陈贤浩、方玉光、Sam Kwong **IEEE TMC**\n  - 出版单位：香港城市大学、香港大学、岭南大学\n  - 任务：多智能体规划\n  - 发表日期：2024年8月7日\n  - 摘要：\n    - AgentsCoMerge是一个由大语言模型赋能的匝道汇入协同决策系统，包含观测、规划、通信和强化训练模块。实验表明，该系统能够有效提升多智能体协作效率。\n\n- [VLM-MPC：视觉语言基础模型（VLM）引导的自动驾驶模型预测控制器（MPC)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04821)\n  - 龙可可、史浩天、刘嘉熙、李晓鹏\n  - 出版单位：威斯康星大学麦迪逊分校\n  - 任务：规划\n  - 发表日期：2024年8月4日\n  - 摘要：\n    - 提出了一种闭环自动驾驶控制器，将VLM应用于车辆的高层控制。\n    - 上层VLM以车辆前视摄像头图像、文本场景描述以及经验记忆为输入，生成下层MPC所需的控制参数。\n    - 下层MPC则利用这些参数，结合考虑发动机延迟的车辆动力学特性，实现逼真的车辆行为，并将状态反馈回上层。\n    - 这种异步双层结构解决了当前VLM响应速度较慢的问题。\n\n- [SimpleLLM4AD：用于自动驾驶的图式视觉问答端到端视觉语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21293)\n  - 郑培汝、赵云、龚展、朱洪、吴绍华 **IEIT Systems**\n  - 出版单位：威斯康星大学麦迪逊分校\n  - 任务：QA\n  - 发表日期：2024年7月31日\n  - 摘要：\n    - SimpleLLM4AD重新构想了传统的自动驾驶流程，将其划分为感知、预测、规划和行为四个相互关联的阶段。\n    - 每个阶段都被表述为一系列视觉问答（VQA）对，这些问答对相互连接形成图式视觉问答（GVQA）。这种基于图的结构使系统能够系统地推理每个VQA对，从而确保从感知到行动的信息流和决策过程连贯一致。\n\n- [针对联网自动驾驶车辆的驾驶理论知识与技能对大语言模型的测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.17211)\n  - 唐左音、何建华、裴大帅、刘克中、高涛\n  - 出版单位：阿斯顿大学、埃塞克斯大学、武汉理工大学、长安大学\n  - 任务：评估\n  - 发表日期：2024年7月24日\n  - 数据：[英国驾驶理论考试练习题及答案](https:\u002F\u002Fwww.drivinginstructorwebsites.co.uk\u002Fuk-driving-theory-test-practice-questions-and-answers)\n  - 摘要：\n    - 设计并运行了针对多个专有LLM模型（OpenAI GPT系列、百度文心一言和阿里通义千问）以及开源LLM模型（清华大学MiniCPM-2B和MiniCPM-Llama3-V2.5）的驾驶理论测试，共包含500余道选择题。\n\n- [KoMA：基于知识驱动的大语言模型自动驾驶多智能体框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.14239)\n  - 蒋科谋、蔡轩、崔志勇、李傲勇、任一龙、于海阳、杨浩、傅道成、温立成、蔡品隆\n  - 出版单位：北京航空航天大学、约翰霍普金斯大学、上海人工智能实验室\n  - 任务：多智能体规划\n  - 环境：[HighwayEnv](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n  - 项目页面：[KoMA](https:\u002F\u002Fjkmhhh.github.io\u002FKoMA\u002F)\n  - 发表日期：2024年7月19日\n  - 摘要：\n    - 介绍了一个基于知识驱动的自动驾驶框架KoMA，该框架整合了由大语言模型赋能的多智能体，包含五个核心模块：环境、多智能体交互、多步规划、共享内存和基于排名的反思。\n\n- [WOMD-Reasoning：用于交互与驾驶意图推理的大规模语言数据集](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.04281)\n  - 李一恒、葛崇健、李晨然、徐晨峰、富冢正芳、唐晨、丁明宇、詹伟\n  - 出版单位：加州大学伯克利分校、德克萨斯大学奥斯汀分校\n  - 任务：数据集 + 推理\n  - 发表日期：2024年7月5日\n  - 数据集：[WOMD-Reasoning](https:\u002F\u002Fwaymo.com\u002Fopen\u002Fdownload)\n  - 摘要：\n    - WOMD-Reasoning 是一个以交互描述和推理为核心的语言数据集。它深入探讨了由交通规则和人类意图引发的、此前被忽视的关键交互，提供了丰富的见解。\n    - 开发了一套自动语言标注流水线，利用基于规则的转换器将运动数据转化为语言描述，并结合一组人工提示引导 ChatGPT 生成问答对。\n\n- [探索多模态 AI 在驾驶危险预测中的潜力](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10568360)\n  - 科拉瓦特·查伦皮塔克斯、阮文光、菅沼昌则、高桥雅弘、新原龙马、冈谷隆之 **IEEE TIV 2024**\n  - 出版单位：东北大学、RIKEN AIP 中心、电装公司\n  - 任务：预测\n  - 代码：[DHPR](https:\u002F\u002Fgithub.com\u002FDHPR-dataset\u002FDHPR-dataset)\n  - 发表日期：2024年6月21日\n  - 摘要：\n    - DHPR（驾驶危险预测与推理）数据集包含 1.5 万张行车记录仪拍摄的街景图像，每张图像都关联着一个元组，其中包括车速、假设的危险描述以及场景中存在的视觉实体。\n    - 提出了几种基线方法并对其性能进行了评估。\n\n- [异步大型语言模型增强的自动驾驶规划器](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.14556)\n  - 陈源、丁子涵、王梓钦、王燕、张立军、刘思 **ECCV 2024**\n  - 出版单位：北京航空航天大学、清华大学\n  - 任务：规划\n  - 发表日期：2024年6月20日\n  - 代码：[AsyncDriver](https:\u002F\u002Fgithub.com\u002FmemberRE\u002FAsyncDriver)\n  - 数据集：[nuPlan 封闭环路反应式 Hard20](https:\u002F\u002Fwww.nuscenes.org\u002Fnuplan)\n  - 摘要：\n    - AsyncDriver 是一种新颖的异步 LLM 增强框架，其中 LLM 的推理频率可调，并可与实时规划器的频率解耦。\n    - 自适应注入模块具有模型无关性，能够轻松将场景相关的指令特征融入任何基于 Transformer 的实时规划器中，从而提升其理解和执行一系列语言驱动的导航指令的能力。\n    - 与现有方法相比，我们的方法在 nuPlan 的挑战性场景中表现出更优越的闭环评估性能。\n\n- [基于大型语言模型的自动驾驶超对齐框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.05651)\n  - 孔祥瑞、托马斯·布劳恩尔、马尔科·法米、王悦\n  - 出版单位：西澳大利亚大学、昆士兰州政府、昆士兰科技大学\n  - 任务：问答\n  - 发表日期：2024年6月9日\n  - 摘要：\n    - 提出了一种安全的交互框架，用于有效审计与云端 LLM 交互的数据。\n    - 分析了 11 种基于大型语言模型的自动驾驶方法，包括驾驶安全性、token 使用情况、隐私保护以及与人类价值观的一致性。\n    - 在 nuScenesQA 数据集中评估了驾驶提示的有效性，并比较了 gpt-35-turbo 和 llama2-70b 这两种 LLM 后端的不同结果。\n\n- [PlanAgent：用于闭环车辆运动规划的多模态大型语言代理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.01587)\n  - 郑宇鹏、邢泽斌、张启超、金步、李鹏飞、郑宇航、夏仲璞、詹坤、郎献鹏、陈亚然、赵东彬\n  - 出版单位：中国科学院、北京邮电大学、北京航空航天大学、清华大学、理想汽车\n  - 任务：规划\n  - 发表日期：2024年6月4日\n  - 摘要：\n    - PlanAgent 是首个基于多模态大型语言模型的中—中级（使用 BEV，无需原始传感器）闭环自动驾驶规划代理系统。\n    - 提出了一个高效的环境转换模块，该模块通过车道图表示提取多模态信息输入。\n    - 设计了一个推理引擎模块，引入分层思维链（CoT）来指导 MLLM 生成规划代码，并设计了一个反思模块，结合仿真和评分机制筛选掉 MLLM 生成的不合理方案。\n\n- [ChatScene：面向自动驾驶车辆的知识增强型安全关键场景生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.14062)\n  - 张嘉伟、许澈健、李博 **CVPR 2024**\n  - 出版单位：伊利诺伊大学厄巴纳-香槟分校、芝加哥大学\n  - 任务：场景生成\n  - 环境：[Carla](https:\u002F\u002Fgithub.com\u002Fcarla-simulator)\n  - 代码：[ChatScene](https:\u002F\u002Fgithub.com\u002Fjavyduck\u002FChatScene)\n  - 发表日期：2024年5月22日\n  - 摘要：\n    - ChatScene 是一种新型的基于 LLM 的智能体，能够先根据文本描述生成安全关键场景，再通过 Scenic 编程语言将其精心转化为 CARLA 中可执行的仿真场景。\n    - 开发了一个庞大的 Scenic 代码片段检索数据库，其中收录了各种对抗性行为和交通配置，充分利用 LLM 中存储的丰富知识，显著增强了所生成驾驶场景的多样性和关键性。\n\n- [探究多模态 LLM 作为驾驶世界模型的可能性](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.05956)\n  - 希瓦·斯里拉姆、王敦轩、阿拉·马卢夫、盖伊·罗斯曼、塞尔塔克·卡拉曼、丹妮拉·鲁斯\n  - 出版单位：MIT CSAIL、TRI、MIT LID\n  - 任务：基准测试与评估\n  - 代码：[DriveSim](https:\u002F\u002Fgithub.com\u002Fsreeramsa\u002FDriveSim)\n  - 发表日期：2024年5月9日\n  - 摘要：\n    - 这是一项全面的实验研究，旨在评估不同 MLLM 在闭环驾驶场景中的推理和理解能力，以及其决策能力。\n    - DriveSim 是一款专门设计的模拟器，用于生成多样化的驾驶场景，从而为测试和评估 MLLM 从固定车载摄像头视角（即驾驶者视角）理解和推理真实世界驾驶场景的能力提供平台。\n\n- [OmniDrive：用于自动驾驶的全栈式LLM-Agent框架，具备3D感知推理与规划能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.01533)\n  - 王世豪、于志鼎、蒋晓辉、兰士毅、施敏、Nadine Chang、Jan Kautz、李颖、Jose M. Alvarez\n  - 出版单位：北京理工大学、NVIDIA、华中科技大学\n  - 任务：基准测试与规划\n  - 发布日期：2024年5月2日\n  - 代码：[OmniDrive](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FOmniDrive)\n  - 摘要：\n    - OmniDrive是一个旨在实现智能体模型与3D驾驶任务之间强对齐的全栈式框架。\n    - 提出了一套包含场景描述、交通规则、3D定位、反事实推理、决策与规划等全面视觉问答（VQA）任务的新基准。\n\n- [Chat2Scenario：利用大型语言模型从数据集中提取驾驶场景](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.16147)\n  - 赵永奇、肖文博、Tomislav Mihalj、胡佳、Arno Eichberger **IEEE IV 2024**\n  - 出版单位：\n  - 发布日期：2024年4月26日\n  - 任务：生成\n  - 数据集：[Chat2Scenario](https:\u002F\u002Fgithub.com\u002FftgTUGraz\u002FChat2Scenario)\n  - 摘要：\n    - Chat2Scenario是一款基于Web的工具，允许用户通过输入描述性功能场景文本，在数据集中搜索特定的驾驶场景。\n\n- [VLAAD：面向自动驾驶的视觉与语言助手](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10495690)\n  - SungYeon Park、MinJae Lee、JiHyuk Kang、Hahyeon Choi、Yoonah Park、Juhwan Cho、Adam Lee、DongKyu Kim **WACV 2024工作坊**\n  - 出版单位：首尔国立大学、加州大学伯克利分校\n  - 发布日期：2024年4月16日\n  - 任务：VQA\n  - 代码：[VLAAD](https:\u002F\u002Fgithub.com\u002Fsungyeonparkk\u002Fvision-assistant-for-driving)\n  - 摘要：\n    - 介绍了一个多模态指令微调数据集，帮助语言模型在多样化的驾驶场景中学习视觉指令。\n    - 基于该数据集，提出了一款名为VLAAD的多模态LLM驾驶助手。\n\n\n- [REvolve：利用大型语言模型进行自动驾驶奖励进化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.01309)\n  - Rishi Hazra、Alkis Sygkounas、Andreas Persson、Amy Loutfi、Pedro Zuidberg Dos Martires\n  - 出版单位：瑞典厄勒布鲁大学应用自主传感器系统中心（AASS）\n  - 任务：奖励生成\n  - 环境：[AirSim](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FAirSim?tab=readme-ov-file)\n  - 项目页面：[REvolve](https:\u002F\u002Frishihazra.github.io\u002FREvolve\u002F)\n  - 发布日期：2024年4月9日\n  - 摘要：\n    - Reward Evolve（REvolve）是一种新颖的进化框架，使用LLM（特别是GPT-4）输出适用于自动驾驶的奖励函数（以可执行Python代码形式），并根据人类反馈对其进行迭代优化。\n\n- [AGENTSCODRIVER：大型语言模型赋能的终身学习型协同驾驶](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.06345.pdf)\n  - 胡森康、方正儒、方子涵、陈献浩、方宇光\n  - 出版单位：香港城市大学、香港大学\n  - 任务：规划（多车协同）\n  - 发布日期：2024年4月9日\n  - 环境：[HighwayEnv](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n  - 摘要：\n    - AGENTSCODRIVER是一个由LLM驱动、具备终身学习能力的多车协同驾驶框架，允许多个驾驶智能体在复杂交通场景中相互通信并协同驾驶。\n    - 该框架包含推理引擎、认知记忆、强化学习反思机制和通信模块。\n\n- [用于自动驾驶中问答任务的多帧、轻量且高效的视觉-语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.19838)\n  - Akshay Gopalkrishnan、Ross Greer、Mohan Trivedi\n  - 出版单位：UCSD\n  - 任务：问答\n  - 发布日期：2024年3月28日\n  - 代码：[官方](https:\u002F\u002Fgithub.com\u002Fakshaygopalkr\u002FEM-VLM4AD)\n  - 数据集：[DriveLM](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM)\n  - 摘要：\n    - EM-VLM4AD是一种高效、轻量级的多帧视觉语言模型，专为自动驾驶中的视觉问答任务设计。\n    - 相较于现有基线模型，EM-VLM4AD在DriveLM数据集上所需的内存和浮点运算量至少减少了10倍，同时BLEU-4、METEOR、CIDEr和ROGUE等指标均表现更优。\n\n- [LC-LLM：基于大型语言模型的可解释性变道意图与轨迹预测](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.18344)\n  - 彭明星、郭旭森、陈贤达、朱美欣、陈科华、杨浩（Frank）、王学松、王银海\n  - 出版单位：香港科技大学、约翰斯·霍普金斯大学、同济大学、STAR实验室\n  - 任务：轨迹预测\n  - 发布日期：2024年3月27日\n  - 数据集：[highD](https:\u002F\u002Flevelxdata.com\u002Fhighd-dataset\u002F)\n  - 摘要：\n    - LC-LLM是首个用于变道预测的大型语言模型。它充分利用LLM的强大能力来理解复杂的交互场景，从而提升变道预测性能。\n    - LC-LLM能够提供可解释的预测结果，不仅预测变道意图和轨迹，还能生成对预测结果的解释说明。\n\n- [AIDE：自动驾驶中目标检测的自动化数据引擎](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.17373)\n  - 梁明富、苏宗智、Samuel Schulter、Sparsh Garg、赵世宇、吴英、Manmohan Chandraker\n  - 出版单位：西北大学、NEC美国实验室、罗格斯大学、UC圣地亚哥\n  - 发布日期：2024年3月26日\n  - 任务：目标检测\n  - 数据集：[Mapillary](https:\u002F\u002Fwww.mapillary.com\u002Fdataset\u002Fvistas)、[Cityscapes](https:\u002F\u002Fwww.cityscapes-dataset.com\u002F)、[nuImages](https:\u002F\u002Fwww.nuscenes.org\u002Fnuimages)、[BDD100k](https:\u002F\u002Fwww.vis.xyz\u002Fbdd100k\u002F)、[Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)、[KITTI](https:\u002F\u002Fwww.cvlibs.net\u002Fdatasets\u002Fkitti\u002F)\n  - 摘要：\n    - AIDE是一个自动化数据引擎，能够自动识别问题、高效地整理数据、通过自动标注改进模型，并借助生成的多样化场景验证模型。\n\n- [利用大型语言模型制定自动驾驶安全要求](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.16289)\n  - Ali Nouri、Beatriz Cabrero-Daniel、Fredrik Törner、Hȧkan Sivencrona、Christian Berger\n  - 出版单位：查尔姆斯理工大学、哥德堡大学、沃尔沃汽车、Zenseact、哥德堡大学\n  - 任务：问答\n  - 发布日期：2024年3月24日\n  - 摘要：\n    - 提出了一种由提示词和LLM组成的原型流水线，该流水线接收物品定义并以安全要求的形式输出解决方案。\n\n- [LeGo-Drive：语言增强的目标导向闭环端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.20116)\n  - 作者：Pranjal Paul、Anant Garg、Tushar Choudhary、Arun Kumar Singh、K. Madhava Krishna\n  - 出版单位：海得拉巴国际信息技术研究所、爱沙尼亚塔尔图大学\n  - 项目主页：[LeGo-Drive](https:\u002F\u002Freachpranjal.github.io\u002Flego-drive\u002F)\n  - 代码：[LeGo-Drive](https:\u002F\u002Fgithub.com\u002Freachpranjal\u002Flego-drive)\n  - 环境：[Carla](https:\u002F\u002Fgithub.com\u002Fcarla-simulator)\n  - 任务：轨迹预测\n  - 发表日期：2024年3月20日\n  - 摘要：\n    - 提出了一种基于大型语言模型的新型规划引导式端到端目标点导航方案，通过与环境动态交互并生成无碰撞轨迹，从而预测和优化期望状态。\n\n- [基于大型语言模型的混合推理在自动驾驶中的应用](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.13602v3)\n  - 作者：Mehdi Azarafza、Mojtaba Nayyeri、Charles Steinmetz、Steffen Staab、Achim Rettberg\n  - 出版单位：哈姆-利普施塔特应用科学大学、斯图加特大学\n  - 发表日期：2024年3月18日\n  - 任务：推理\n  - 环境：[Carla](https:\u002F\u002Fgithub.com\u002Fcarla-simulator)\n  - 摘要：\n    - 结合算术与常识性元素，利用YOLOv8检测到的物体信息。\n    - 将“物体位置”、“本车速度”、“与物体距离”以及“本车行驶方向”输入大型语言模型，在CARLA环境中进行数学计算。\n    - 综合考虑天气条件等因素后，生成精确的刹车与加速控制指令。\n\n- [大型语言模型驱动的上下文感知运动预测](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.11057.pdf)\n  - 作者：Xiaoji Zheng、Lixiu Wu、Zhijie Yan、Yuanrong Tang、Hao Zhao、Chen Zhong、Bokui Chen、Jiangtao Gong\n  - 出版单位：清华大学\n  - 任务：运动预测\n  - 发表日期：2024年3月17日\n  - 数据集：[WOMD](https:\u002F\u002Fgithub.com\u002Fwaymo-research\u002Fwaymo-open-dataset)\n  - 摘要：\n    - 设计并实施提示工程，使未经微调的GPT-4-V能够理解复杂的交通场景。\n    - 提出一种新方法，将GPT-4-V输出的上下文信息与[MTR](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.13508)相结合。\n\n- [自动驾驶领域的通用预测模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09630)\n  - 作者：Jiazhi Yang、Shenyuan Gao、Yihang Qiu、Li Chen、Tianyu Li、Bo Dai、Kashyap Chitta、Penghao Wu、Jia Zeng、Ping Luo、Jun Zhang、Andreas Geiger、Yu Qiao、Hongyang Li **ECCV 2024**\n  - 出版单位：OpenDriveLab与上海人工智能实验室、香港科技大学、香港大学、图宾根大学、图宾根人工智能中心\n  - 任务：数据集构建与生成\n  - 代码：[DriveAGI](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveAGI.)\n  - 发表日期：2024年3月14日\n  - 摘要：\n    - 提出首个自动驾驶领域的大规模视频预测模型。\n    - 构建的数据集累计超过2000小时的驾驶视频，覆盖全球各地，包含多样化的天气条件和交通场景。\n    - GenAD继承了近期潜在扩散模型的优点，通过新颖的时间推理模块处理驾驶场景中的复杂动态。\n\n- [LLM辅助交通信号灯：利用大型语言模型能力实现复杂城市环境下的拟人化交通信号控制](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.08337)\n  - 作者：Maonan Wang、Aoyu Pang、Yuheng Kan、Man-On Pun、Chung Shue Chen、Bo Huang\n  - 出版单位：香港中文大学、上海人工智能实验室、商汤科技有限公司、诺基亚贝尔实验室\n  - 发表日期：2024年3月13日\n  - 任务：生成\n  - 代码：[LLM-Assisted-Light](https:\u002F\u002Fgithub.com\u002FTraffic-Alpha\u002FLLM-Assisted-Light)\n  - 摘要：\n    - LA-Light是一种混合型交通信号控制系统框架，融合了大型语言模型的拟人化推理能力，使信号控制算法能够以人类认知般的细腻判断来理解和响应复杂的交通场景。\n    - 开发了一个闭环交通信号控制系统，将大型语言模型与一系列可互操作的工具集成在一起。\n\n- [DriveDreamer-2：LLM增强的世界模型用于多样化驾驶视频生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.06845)\n  - 作者：Guosheng Zhao、Xiaofeng Wang、Zheng Zhu、Xinze Chen、Guan Huang、Xiaoyi Bao、Xingang Wang\n  - 出版单位：中国科学院自动化研究所、GigaAI\n  - 发表日期：2024年3月11日\n  - 任务：生成\n  - 项目：[DriveDreamer-2](https:\u002F\u002Fdrivedreamer2.github.io)\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - 摘要：\n    - DriveDreamer-2基于[DriveDreamer](#DriveDreamer)框架，并引入大型语言模型（LLM）以生成用户自定义的驾驶视频。\n    - UniMVM（统一多视角模型）增强了生成驾驶视频的时间与空间连贯性。\n    - HDMap生成器确保背景元素不会与前景轨迹发生冲突。\n    - 利用构建的文本到脚本数据集，对GPT-3.5进行微调，使其具备专门的轨迹生成知识。\n\n- [通过协作式LLM智能体实现自动驾驶的可编辑场景仿真](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.05746)\n  - 作者：Yuxi Wei、Zi Wang、Yifan Lu、Chenxin Xu、Changxing Liu、Hao Zhao、Siheng Chen、Yanfeng Wang\n  - 出版单位：上海交通大学、上海人工智能实验室、卡内基梅隆大学、清华大学\n  - 发表日期：2024年3月11日\n  - 任务：生成\n  - 代码：[ChatSim](https:\u002F\u002Fgithub.com\u002Fyifanlu0227\u002FChatSim)\n  - 数据集：[Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - 摘要：\n    - ChatSim是首个可通过自然语言指令结合外部数字资产实现可编辑的逼真3D驾驶场景仿真的系统。\n    - McNeRF是一种新颖的神经辐射场方法，整合多摄像头输入，提供更广阔的场景渲染能力，有助于生成逼真的效果。\n    - McLight是一种新的多摄像头光照估计技术，融合天穹光与周围环境光，使外部数字资产呈现出真实的纹理和材质。\n\n- [驾驶场景的具身化理解](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.04593)\n  - 作者：Yunsong Zhou、Linyan Huang、Qingwen Bu、Jia Zeng、Tianyu Li、Hang Qiu、Hongzi Zhu、Minyi Guo、Yu Qiao、Hongyang Li **ECCV 2024**\n  - 上海人工智能实验室、上海交通大学、加州大学河滨分校\n  - 发表日期：2024年3月7日\n  - 任务：基准测试与场景理解\n  - 代码：[ELM](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FELM)\n  - 摘要：\n    - ELM是一种具身化语言模型，用于理解时空维度上的长期驾驶场景。\n\n- [DriveVLM：自动驾驶与大型视觉-语言模型的融合](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12289) \n  - 田晓宇、顾俊儒、李柏霖、刘一成、胡晨旭、王洋、詹坤、贾鹏、郎贤鹏、赵航\n  - 出版方：清华大学IIIS、理想汽车\n  - 发表日期：2024年2月25日\n  - 任务：规划\n  - 项目：[DriveVLM](https:\u002F\u002Ftsinghua-mars-lab.github.io\u002FDriveVLM\u002F)\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)、SUP-AD\n  - 摘要：\n    - DriveVLM是一种新颖的自动驾驶系统，利用VLM实现高效的场景理解和规划。\n    - DriveVLM-Dual是一种混合系统，结合了DriveVLM与传统的自动驾驶流程。\n\n- [GenAD：生成式端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.11502)\n  - 郑文钊、宋瑞琪、郭显达、陈龙  **ECCV 2024**\n  - 加州大学伯克利分校、Waytous、中国科学院自动化研究所\n  - 发表日期：2024年2月20日\n  - 任务：生成\n  - 代码：[GenAD](https:\u002F\u002Fgithub.com\u002Fwzzheng\u002FGenAD)\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - 摘要：\n    - GenAD将自动驾驶建模为轨迹生成问题，以充分发挥端到端方法的潜力。\n    - 提出一种以实例为中心的场景分词器，首先将周围场景转换为具备地图感知的实例标记。\n    - 使用变分自编码器在结构化的潜在空间中学习未来轨迹分布，用于轨迹先验建模，并采用时间模型捕捉潜在空间中的主体和本车运动，从而生成更有效的未来轨迹。\n\n- [RAG-Driver：基于检索增强上下文学习的多模态大语言模型驱动的可泛化驾驶解释](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.10828)\n  - 袁建豪、孙书阳、丹尼尔·奥梅扎、赵博、保罗·纽曼、拉尔斯·昆策、马修·加德\n  - 出版方：牛津大学、北京智源人工智能研究院\n  - 发表日期：2024年2月16日\n  - 任务：可解释性驾驶\n  - 项目：[RAG-Driver](https:\u002F\u002Fyuanjianhao508.github.io\u002FRAG-Driver\u002F)\n  - 摘要：\n    - RAG-Driver是一种具备检索增强上下文学习能力的多模态大语言模型，专为具有强大零样本泛化能力的可泛化、可解释的端到端驾驶设计。\n    - 在BDD-X（分布内）和SAX（分布外）基准测试中均达到最先进的动作解释与论证性能。\n\n- [通过大语言模型策略适配实现“随处驾驶”](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.05932)\n  - 李博毅、王悦、毛家庚、鲍里斯·伊万诺维奇、苏尚特·维尔、卡伦·梁、马可·帕沃内 **CVPR 2024**\n  - 出版方：英伟达、南加州大学、华盛顿大学、斯坦福大学\n  - 发表日期：2024年2月8日\n  - 任务：规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - 项目：[LLaDA](https:\u002F\u002Fboyiliee.github.io\u002Fllada\u002F)\n  - 摘要：\n    - LLaDA是一种无需训练的机制，旨在辅助人类驾驶员并将自动驾驶策略适配到新环境。\n    - 交通规则提取器（TRE），用于整理和过滤输入（初始计划+独特交通规则），并将输出送入冻结的大语言模型以获得最终的新计划。\n    - LLaDA默认使用GPT-4作为大语言模型。\n\n- [LimSim++](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.01246)\n  - 傅道成、雷文杰、温立成、蔡品隆、毛松、窦敏、史博天、乔宇\n  - 出版方：上海人工智能实验室、浙江大学\n  - 发表日期：2024年2月2日\n  - 项目：[LimSim++](https:\u002F\u002Fpjlab-adg.github.io\u002Flimsim_plus\u002F)\n  - 摘要：\n    - LimSim++是LimSim的扩展版本，专为在自动驾驶中应用（M）LLM而设计。\n    - 引入了一个基于（M）LLM的基础框架，并通过跨多种场景的定量实验进行了系统验证。\n\n- [LangProp：应用于驾驶的语言模型代码优化框架](https:\u002F\u002Fopenreview.net\u002Fforum?id=UgTrngiN16)\n  - 石田修、詹卢卡·科拉多、乔治·费多塞耶夫、黄德耀、劳埃德·拉塞尔、杰米·肖顿、若昂·F·恩里克斯、安东尼·胡\n  - 出版方：Wayve Technologies、牛津大学视觉几何组\n  - 发表日期：2024年1月18日\n  - 任务：代码生成、规划\n  - 代码：[LangProp](https:\u002F\u002Fgithub.com\u002Fshuishida\u002FLangProp)\n  - 环境：[CARLA](https:\u002F\u002Fgithub.com\u002Fcarla-simulator)\n  - 摘要：\n    - LangProp是一个在监督\u002F强化学习环境下迭代优化由大型语言模型（LLMs）生成的代码的框架。\n    - 在CARLA中使用LangProp，根据当前场景状态生成驾驶代码。\n\n- [VLP：面向自动驾驶的视觉语言规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.05577)\n  - 潘晨斌、布尔哈内丁·亚曼、托马索·内斯蒂、阿比鲁普·马利克、亚历山德罗·G·阿列维、塞内姆·韦利帕萨拉尔、任刘 **CVPR 2024**\n  - 出版方：雪城大学、博世北美研究院及博世人工智能中心（BCAI）\n  - 发表日期：2024年1月14日\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - 摘要：\n    - 提出VLP视觉语言规划模型，由ALP和SLP两个创新组件构成，分别从自动驾驶的BEV推理和决策方面提升ADS性能。\n    - ALP（主体导向的学习范式）将生成的BEV与真实的鸟瞰图地图对齐。\n    - SLP（以自动驾驶车辆为中心的学习范式）将本车查询特征与本车文本规划特征对齐。\n\n- [DME-Driver：在自动驾驶中整合人类决策逻辑与3D场景感知](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.03641)\n  - 韩文成、郭东倩、徐承中、沈健兵\n  - 出版方：澳门大学CIS、SKL-IOTSC\n  - 发表日期：2024年1月8日\n  - 摘要：\n    - DME-Driver = 决策者 + 执行者 + CL\n    - 基于UniAD的执行者网络引入OccFormer和规划模块中的文本信息。\n    - 基于LLaVA的决策者处理来自三种不同模态的输入：当前及先前场景的视觉输入、提示形式的文本输入，以及描述车辆运行状态的当前状态信息。\n    - CL是一种一致性损失机制，虽略微降低性能指标，但显著增强了执行者与决策者之间的决策一致性。\n\n- [AccidentGPT：基于多模态大模型的V2X环境感知事故分析与预防](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13156)\n  - 王乐宁、任一龙、江涵、蔡品隆、傅道成、王天奇、崔志勇、于海洋、王雪松、周汉初、黄和来、王银海\n  - 出版单位：北京航空航天大学、上海人工智能实验室、香港大学、中关村实验室、同济大学、中南大学、华盛顿大学西雅图分校\n  - 发表日期：2023年12月29日\n  - 项目页面：[AccidentGPT](https:\u002F\u002Faccidentgpt.github.io)\n  - 摘要：\n    - AccidentGPT是一款全面的事故分析与预防多模态大模型。\n    - 集成了多车协同感知，以增强对环境的理解和碰撞规避能力。\n    - 提供主动远程安全预警和盲区提醒等高级安全功能。\n    - 为交通警察和管理部门提供实时的交通安全因素智能分析服务。\n\n- [通过鸟瞰视角注入的多模态大模型实现整体自动驾驶理解](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.00988)\n  - 丁新鹏、韩金华、徐航、梁晓丹、张伟、李晓萌 **CVPR 2024**\n  - 出版单位：香港科技大学、华为诺亚方舟实验室、中山大学\n  - 发表日期：2023年12月21日\n  - 任务：数据集 + VQA\n  - 代码：[官方](https:\u002F\u002Fgithub.com\u002Fxmed-lab\u002FNuInstruct)\n  - 摘要：\n    - 介绍NuInstruct，一个基于[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)的新颖数据集，包含17个子任务下的91,000个多视角视频-QA对。\n    - 提出BEV-InMLMM，将具备指令感知能力的BEV特征与现有MLLM相结合，从而增强其信息完整性，包括时间、多视角和空间细节。\n\n- [LLM-ASSIST：基于语言推理增强闭环规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.00125)\n  - S P Sharan、Francesco Pittaluga、Vijay Kumar B G、Manmohan Chandraker\n  - 出版单位：德克萨斯大学奥斯汀分校、NEC美国实验室、加州大学圣地亚哥分校\n  - 发表日期：2023年12月30日\n  - 任务：规划\n  - 环境\u002F数据集：nuPlan闭环非反应式挑战赛\n  - 项目：[LLM-ASSIST](https:\u002F\u002Fllmassist.github.io\u002F)\n  - 摘要：\n    - LLM-Planner接管PDM-Closed无法处理的场景。\n    - 提出两种基于LLM的规划器。\n      - LLM-ASSIST(unc)考虑最无约束的规划问题版本，在该版本中LLM必须直接返回一辆自车的安全未来轨迹。\n      - LLM-ASSIST(par)则考虑参数化的规划问题版本，在该版本中LLM只需为基于规则的规划器PDM-Closed返回一组参数。\n\n- [DriveLM：基于图示视觉问答的驾驶](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.14150.pdf)\n  - 司马冲浩、卡特琳·伦茨、卡希亚普·奇塔、陈莉、张瀚雪、谢成恩、罗平、安德烈亚斯·盖格、李洪洋 **ECCV 2024**\n  - 出版单位：OpenDriveLab、蒂宾根大学、蒂宾根AI中心、香港大学\n  - 代码：[官方](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM)\n  - 发表日期：2023年12月21日\n  - 摘要：\n    - DriveLM任务\n      - 图形VQA将P1-3（感知、预测、规划）推理转化为有向图中的系列问答对(QAs)。\n    - DriveLM数据\n      - DriveLM-Carla\n        - 使用CARLA 0.9.14在Leaderboard 2.0框架下[17]收集数据，并由特权规则型专家进行操作。\n      - Drive-nuScenes\n        - 从视频片段中选取关键帧，挑选这些关键帧中的关键对象，随后为这些关键对象标注帧级别的P1−3 QA。部分感知QA来自nuScenes和[OpenLane-V2](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FOpenLane-V2)的真值，其余QA则由人工标注。\n    - DriveLM智能体\n      - DriveLMAgent基于通用视觉-语言模型构建，因此能够利用预训练过程中获得的基础知识。\n\n- [LingoQA：面向自动驾驶的视频问答](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.14115)\n  - 安娜-玛丽亚·马库、陈龙、扬·许内尔曼、爱丽丝·卡尔松、贝努瓦·哈诺特、普拉贾瓦尔·奇达南达、索拉布·奈尔、维杰·巴德里纳拉扬、亚历克斯·肯德尔、杰米·肖顿、奥列格·西纳夫斯基\n  - 出版单位：Wayve\n  - 任务：VQA + 评估\u002F数据集\n  - 代码：[官方](https:\u002F\u002Fgithub.com\u002Fwayveai\u002FLingoQA)\n  - 发表日期：2023年12月21日\n  - 摘要：\n    - 引入一种用于自动驾驶视频QA的新基准测试，采用学习型文本分类器进行评估。\n    - 构建伦敦市中心的视频QA数据集，包含419,000个样本及其自由形式的问题和答案。\n    - 基于Vicuna-1.5-7B模型组合，为该领域确立新的基准。\n\n- [DriveMLM：将多模态大语言模型与自动驾驶行为规划状态对齐](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.09245)\n  - 王文海、谢江伟、胡传阳、邹浩明、范佳楠、童文文、温杨、吴思磊、邓汉明、李志琪、田浩、陆雷威、朱锡洲、王小刚、乔宇、戴继峰\n  - 出版单位：OpenGVLab、上海人工智能实验室、香港中文大学、商汤科技研究院、斯坦福大学、南京大学、清华大学\n  - 任务：规划 + 解释\n  - 代码：[官方](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FDriveMLM)\n  - 环境：[Carla](https:\u002F\u002Fcarla.org\u002F)\n  - 发表日期：2023年12月14日\n  - 摘要：\n    - DriveMLM是首个能够在真实模拟器中执行闭环自动驾驶的基于LLM的AD框架。\n    - 设计了一种用于决策预测的MLLM规划器，并开发了一个数据引擎，可有效生成决策状态及相应的解释性标注，用于模型训练和评估。\n    - 在CARLA Town05 Long上取得了76.1 DS、0.955 MPI的成绩。\n\n- [自动驾驶中的大语言模型：现实世界实验](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.09397)\n  - 崔灿、马云生、曹旭、叶文倩、周阳、梁凯钊、陈锦泰、卢俊武、杨子冲、廖贵达、高天仁、李尔龙、唐坤、曹志鹏、周彤、刘傲、闫欣睿、梅淑琪、曹建国、王子然、郑超\n  - 出版单位：普渡大学\n  - 发表日期：2023年12月14日\n  - 项目：[官方](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgcRcf9w8BmJfZigDhk1SAfXV0FY65cO7)\n  - 摘要：\n    - 介绍基于大语言模型(LLM)的Talk-to-Drive（Talk2Drive）框架，用于处理人类口头命令并结合上下文信息做出自动驾驶决策，以满足用户在安全性、效率和舒适性方面的个性化偏好。\n\n- [LMDrive：基于大型语言模型的闭环端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.07488)\n  - 郝绍、胡宇轩、王乐天、史蒂文·L·瓦斯兰德、刘宇、李洪胜      **CVPR 2024**\n  - 出版单位：香港中文大学MMLab、商汤研究院、InnoHK下的CPII、多伦多大学、上海人工智能实验室\n  - 任务：规划 + 数据集\n  - 代码：[官方](https:\u002F\u002Fgithub.com\u002Fopendilab\u002FLMDrive)\n  - 环境：[Carla](https:\u002F\u002Fcarla.org\u002F)\n  - 发表日期：2023年12月12日\n  - 摘要：\n    - LMDrive是一种新颖的端到端、闭环、基于语言的自动驾驶框架。\n    - 发布包含导航指令、提示性指令、多模态多视角传感器数据及控制信号的6.4万片段数据集。\n    - 提出用于评估自主智能体的基准LangAuto。\n\n- [大型语言模型在自动驾驶决策中的评估](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.06351.pdf)\n  - 小田桥孝太郎、井上裕一、山口优、八木沼秀达、塩塚大辉、岛谷博之、岩政浩平、井上义明、山口隆文、五十里幸希、堀内司、徳弘健人、十口雄吾、青木俊介\n  - 出版单位：日本图灵公司\n  - 任务：评估\n  - 发表日期：2023年12月11日\n  - 摘要：\n    - 评估两大核心能力：\n      - 空间感知决策能力，即LLM能够根据坐标信息准确识别空间布局；\n      - 遵守交通规则的能力，确保LLM在驾驶过程中严格遵守交通法规。\n\n- [LaMPilot：面向语言模型程序的自动驾驶开放基准数据集](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.04372)\n  - 马云生、崔灿、曹旭、叶文谦、刘培然、陆俊武、阿姆尔·阿卜杜勒拉乌夫、罗希特·古普塔、韩京泰、阿尼凯特·贝拉、詹姆斯·M·雷格、王子然\n  - 出版单位：普渡大学、伊利诺伊大学厄巴纳-香槟分校、弗吉尼亚大学、InfoTech实验室、丰田汽车北美公司\n  - 任务：基准测试\n  - 发表日期：2023年12月7日\n  - 摘要：\n    - LaMPilot是首个专为评估基于LLM的智能体在驾驶场景中表现而设计的交互式环境与数据集。\n    - 包含4,900个场景，专门用于评估自动驾驶中的指令跟踪任务。\n\n- [Reason2Drive：迈向自动驾驶的可解释性和链式推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03661)\n  - 聂明、彭仁远、王春伟、蔡欣悦、韩建华、徐航、张莉\n  - 出版单位：复旦大学、华为诺亚方舟实验室\n  - 任务：VQA + 数据集\n  - 代码：[官方](https:\u002F\u002Fgithub.com\u002Ffudan-zvg\u002FReason2Drive)\n  - 数据集：\n    - [nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n    - [Waymo](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n    - [ONCE](https:\u002F\u002Fonce-for-auto-driving.github.io\u002Findex.html)\n  - 发表日期：2023年12月6日\n  - 摘要：\n    - Reason2Drive是一个包含超过60万个视频-文本对的基准数据集，旨在促进对复杂驾驶环境中可解释性推理的研究。\n    - 引入一种新的评估指标来衡量自动驾驶环境中基于链式推理的性能，并解决BLEU和CIDEr等现有指标存在的语义模糊问题。\n    - 提出一个简单而有效的框架，通过添加先验分词器和指令引导的视觉解码器两个新组件来增强现有的VLM。\n\n- [GPT-4增强的自动驾驶多模态定位：利用大型语言模型的跨模态注意力机制](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03543)\n  - 廖海成、沈焕明、李振宁、王承越、李国发、别一鸣、许成忠\n  - 出版单位：澳门大学、电子科技大学、重庆大学、吉林大学\n  - 任务：检测\u002F预测\n  - 代码：[官方](https:\u002F\u002Fgithub.com\u002FPetrichor625\u002FTalk2car_CAVG)\n  - 数据集：\n    - [Talk2car](https:\u002F\u002Fgithub.com\u002Ftalk2car\u002FTalk2Car)\n  - 发表日期：2023年12月6日\n  - 摘要：\n    - 使用五个编码器（文本、图像、上下文及跨模态）结合多模态解码器来预测目标边界框。\n\n- [Dolphins：用于驾驶的多模态语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00438)\n  - 马英姿、曹玉龙、孙嘉晨、马可·帕沃内、肖超伟 **ECCV 2024**\n  - 出版单位：威斯康星大学麦迪逊分校、NVIDIA、密歇根大学、斯坦福大学\n  - 任务：VQA\n  - 项目：[Dolphins](https:\u002F\u002Fvlm-driver.github.io\u002F)\n  - 代码：[Dolphins](https:\u002F\u002Fgithub.com\u002Fvlm-driver\u002FDolphins)\n  - 数据集：\n    - 图像指令遵循数据集\n      - [GQA](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fdorarad\u002Fgqa\u002Fabout.html)\n      - [MSCOCO](https:\u002F\u002Fcocodataset.org\u002F#home)：[VQAv2](https:\u002F\u002Fvisualqa.org\u002F)、[OK-VQA](https:\u002F\u002Fokvqa.allenai.org\u002F)、[TDIUC](https:\u002F\u002Fkushalkafle.com\u002Fprojects\u002Ftdiuc.html)、[Visual Genome数据集](https:\u002F\u002Fhomes.cs.washington.edu\u002F~ranjay\u002Fvisualgenome\u002Findex.html)\n    - 视频指令遵循数据集\n      - [BDD-X](https:\u002F\u002Fgithub.com\u002FJinkyuKimUCB\u002FBDD-X-dataset)\n  - 发表日期：2023年12月1日\n  - 摘要：\n    - Dolphins基于OpenFlamingo架构，是一款基于VLM的对话式驾驶助手。\n    - 设计了 grounded CoT (GCoT)指令调优方法，并开发了相关数据集。\n\n- [驶向未来：基于世界模型的多视角视觉预测与规划在自动驾驶中的应用](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17918)\n  - 王宇奇、何佳伟、范璐、李宏鑫、陈云涛、张兆祥\n  - 出版单位：中科院自动化所、CAIR、香港科学园研究所、中国科学院\n  - 任务：生成\n  - 项目：[Drive-WM](https:\u002F\u002Fdrive-wm.github.io\u002F)\n  - 代码：[Drive-WM](https:\u002F\u002Fgithub.com\u002FBraveGroup\u002FDrive-WM)\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)、[Waymo开放数据集](https:\u002F\u002Fwaymo.com\u002Fopen\u002F)\n  - 发表日期：2023年11月29日\n  - 摘要：\n    - Drive-WM是一种多视角世界模型，能够在自动驾驶场景中生成高质量、可控且一致的多视角视频。\n    - 首次探索世界模型在自动驾驶端到端规划中的潜在应用。\n\n- [从安全角度赋能自动驾驶：大型语言模型的应用](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00812)\n  - 王一轩、焦若晨、郎成田、辛农·西蒙·詹、黄超、王兆然、杨卓然、朱琪\n  - 出版单位：西北大学、利物浦大学、耶鲁大学\n  - 任务：规划\n  - 环境：[HighwayEnv](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n  - 代码：[官方](https:\u002F\u002Fgithub.com\u002Fwangyixu14\u002Fllm_conditioned_mpc_ad)\n  - 发表日期：2023年11月28日\n  - 摘要：\n    - 将LLM部署为规划中的智能决策者，结合上下文安全学习的安全验证器，以提升整体自动驾驶性能和安全性。\n\n- [GPT-4V掌舵：行人行为预测的潜力与挑战评估](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.14786)\n  - 黄佳、蒋鹏、阿尔维卡·高塔姆、斯里坎特·萨里帕利\n  - 出版单位：美国科斯塔站德州农工大学\n  - 任务：评估（行人行为预测）\n  - 数据集：\n    - [JAAD](https:\u002F\u002Fdata.nvision2.eecs.yorku.ca\u002FJAAD_dataset\u002F)\n    - [PIE](https:\u002F\u002Fdata.nvision2.eecs.yorku.ca\u002FPIE_dataset\u002F)\n    - [WiDEVIEW](https:\u002F\u002Fgithub.com\u002Funmannedlab\u002FUWB_Dataset)\n  - 摘要：\n    - 利用公开可用的数据集，全面评估了GPT-4V在自动驾驶中用于行人行为预测的潜力。\n    - 目前仍不及最先进的传统领域专用模型。\n    - 尽管GPT-4V在行人行为预测的人工智能能力方面取得了显著进展，但要充分挖掘其在实际应用中的潜力，仍需持续开发和优化。\n\n- [ADriver-I：面向自动驾驶的通用世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.13549)\n  - 贾凡、毛伟欣、刘英飞、赵宇成、温玉清、张驰、张翔宇、王天赐\n  - 出版单位：旷视科技、早稻田大学、中国科学技术大学、Mach Drive\n  - 任务：生成+规划\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)、大规模私有数据集\n  - 发表日期：2023年11月22日\n  - 摘要：\n    - ADriver-I以视觉-动作对作为输入，自回归地预测当前帧的控制信号。生成的控制信号连同历史视觉-动作对进一步作为条件，用于预测未来帧。\n    - MLLM（多模态大语言模型）=[LLaVA-7B-1.5](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA)，VDM（视频扩散模型）=[latent-diffusion](https:\u002F\u002Fgithub.com\u002FCompVis\u002Flatent-diffusion)\n  - 评价指标：\n    - 当前帧的速度和转向角的L1误差。\n    - 生成质量：弗雷歇起始距离（FID）、弗雷歇视频距离（FVD）。\n\n- [用于自动驾驶的语言智能体](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.10813)\n  - 毛家庚、叶俊杰、钱宇曦、马可·帕沃内、王悦\n  - 南加州大学、斯坦福大学、NVIDIA\n  - 任务：生成+规划\n  - 项目：[Agent-Driver](https:\u002F\u002Fusc-gvl.github.io\u002FAgent-Driver\u002F)\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - 发表日期：2023年11月17日\n  - 摘要：\n    - Agent-Driver整合了用于动态感知与预测的工具库、存储人类知识的认知记忆，以及模拟人类决策过程的推理引擎。\n    - 在运动规划方面，沿用GPT-Driver（#GPT-Driver）的方法，使用nuScenes训练集中的真实驾驶轨迹对大语言模型进行一个epoch的微调。\n    - 神经模块则采用[UniAD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.10156)中的模块。\n  - 评价指标：\n    - L2误差（以米为单位）和碰撞率（以百分比计）。\n\n- [基于LLM的以人为本的自动驾驶系统，用于用户指令推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.08206)\n  - 杨毅、张庆文、李慈、丹尼尔·西蒙斯·玛尔塔、纳兹雷·巴图尔、约翰·福尔克森\n  - 出版单位：瑞典皇家理工学院、斯堪尼亚公司\n  - 任务：问答\n  - 代码：[DriveCmd](https:\u002F\u002Fgithub.com\u002FKTH-RPL\u002FDriveCmd_LLM)\n  - 数据集：[UCU数据集](https:\u002F\u002Fgithub.com\u002FLLVM-AD\u002Fucu-dataset)\n  - 发表日期：2023年11月14日\n  - 摘要：\n    - 提出利用大型语言模型（LLM）的推理能力，从车内用户的指令中推断系统需求。\n    - LLVM-AD研讨会 @ WACV 2024\n  - 评价指标：\n    - 问题级别的准确率（每个单独问题的正确率）。\n    - 指令级别的准确率（只有当某一指令的所有问题都被正确识别时，才认定该指令的准确率）。\n\n- [与GPT-4V(ision)同行：视觉-语言模型在自动驾驶中的早期探索](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.05332)\n  - 文立成、杨雪萌、傅道成、王晓峰、蔡品龙、李鑫、马涛、李颖轩、徐林然、尚登科、朱正、孙绍燕、白业奇、蔡新宇、窦敏、胡双陆、史博天\n  - 出版单位：上海人工智能实验室、GigaAI、华东师范大学、香港中文大学、WeRide.ai\n  - 项目：[官方](https:\u002F\u002Fgithub.com\u002FPJLab-ADG\u002FGPT4V-AD-Exploration)\n  - 数据集：\n    - 场景理解：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)、[BDD-X](https:\u002F\u002Fgithub.com\u002FJinkyuKimUCB\u002FBDD-X-dataset)、[Carla](https:\u002F\u002Fgithub.com\u002Fcarla-simulator)、[TSDD](http:\u002F\u002Fwww.nlpr.ia.ac.cn\u002Fpal\u002Ftrafficdata\u002Fdetection.html)、[Waymo](https:\u002F\u002Farxiv.org\u002Fabs\u002F1912.04838)、[DAIR-V2X](https:\u002F\u002Fthudair.baai.ac.cn\u002Findex)、[CitySim](https:\u002F\u002Fgithub.com\u002Fozheng1993\u002FUCF-SST-CitySim-Dataset)。\n    - 推理能力：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)、[D2-city](https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.01975)、[Carla](https:\u002F\u002Fgithub.com\u002Fcarla-simulator)、[CODA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.07724)以及互联网。\n    - 扮演驾驶员：真实道路驾驶场景。\n  - 发表日期：2023年11月9日\n  - 摘要：\n    - 对GPT-4V在各类自动驾驶场景中的表现进行了全面且多角度的评估。\n    - 测试了GPT-4V在场景理解、推理以及扮演驾驶员等方面的能力。\n\n- [ChatGPT作为你的车载副驾驶：一次初步尝试](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10286969)\n  - 王世义、朱宇轩、李志恒、王宇彤、李莉、何正兵\n  - 出版单位：清华大学、中国科学院自动化研究所、麻省理工学院\n  - 任务：规划\n  - 发表日期：2023年10月17日\n  - 摘要：\n    - 设计了一个通用框架，将大语言模型嵌入为车辆的“副驾驶”，使其能够根据提供的信息，在满足人类意图的前提下完成特定的驾驶任务。\n\n- [MagicDrive：具有多样化3D几何控制的街景生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.02601)\n  - 高瑞源、陈凯、谢恩泽、洪兰青、李振国、叶德彦、许强\n  - 出版单位：香港中文大学、香港科技大学、华为诺亚方舟实验室\n  - 任务：生成\n  - 项目：[MagicDrive](https:\u002F\u002Fgaoruiyuan.com\u002Fmagicdrive\u002F)\n  - 代码：[MagicDrive](https:\u002F\u002Fgithub.com\u002Fcure-lab\u002FMagicDrive)\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - 发表日期：2023年10月13日\n  - 摘要：\n    - MagicDrive通过独立编码道路地图、物体边界框和相机参数，利用3D标注中的几何信息，实现精确的几何引导合成，从而生成高度逼真的图像。这种方法有效解决了多摄像头视角一致性的问题。\n    - 然而，在一些复杂场景中，例如夜间场景和未见过的天气条件下，仍然面临巨大挑战。\n\n- [接收、推理与反应：利用大型语言模型在自动驾驶中实现言出必行](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.08034)\n  - 崔灿、马云生、曹旭、叶文谦、王子然\n  - 出版单位：普渡大学、伊利诺伊大学厄巴纳-香槟分校、弗吉尼亚大学、PediaMed.AI。\n  - 任务：规划\n  - 项目：[视频](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgcRcf9w8BmLJi_fqTGq-7KCZsbpEIE4a)\n  - 环境：[HighwayEnv](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n  - 发表日期：2023年10月12日\n  - 摘要：\n    - 利用专门工具发挥大型语言模型的语言和上下文理解能力，将LLM的语言与推理能力融入自动驾驶车辆。\n\n- [DrivingDiffusion：基于潜在扩散模型的布局引导多视角驾驶场景视频生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.07771)\n  - 李晓凡、张一夫、叶小青\n  - 出版单位：百度公司\n  - 任务：生成\n  - 项目：[官方](https:\u002F\u002Fdrivingdiffusion.github.io\u002F)\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - 摘要：\n    - 针对复杂城市场景中从3D布局生成多视角视频数据这一新问题。\n    - 提出生成模型DrivingDiffusion，以确保生成视频的跨视角、跨帧一致性及实例质量。\n    - 在nuScenes数据集上达到最先进的视频合成性能。\n  - 评价指标：\n    - 生成质量：Frechet Inception Distance(FID)、Frechet Video Distance(FVD)\n    - 分割指标：mIoU\n\n- \u003Ca id=\"LanguageMPC\">\u003C\u002Fa>[LanguageMPC：大型语言模型作为自动驾驶决策者](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.03026)\n  - 沙浩、穆瑶、蒋宇轩、陈力、徐晨峰、罗平、李恩波、富冢正敏、詹伟、丁明宇\n  - 出版单位：清华大学、香港大学、加州大学伯克利分校\n  - 任务：规划\u002F控制\n  - 代码：[官方](https:\u002F\u002Fsites.google.com\u002Fview\u002Fllm-mpc)\n  - 环境：\n    - [ComplexUrbanScenarios](https:\u002F\u002Fgithub.com\u002Fliuyuqi123\u002FComplexUrbanScenarios)\n    - [Carla](https:\u002F\u002Fgithub.com\u002Fcarla-simulator)\n  - 发表日期：2023年10月4日\n  - 摘要：\n    - 利用大型语言模型通过思维链提供高层次决策。\n    - 将高层决策转化为数学表达式，以指导底层控制器(MPC)。\n    - 评价指标：失败\u002F碰撞次数、效率、时间、惩罚分数。\n\n- \u003Ca id=\"DrivingwithLLMs\">\u003C\u002Fa>[使用LLM驾驶：融合对象级向量模态实现可解释的自动驾驶](https:\u002F\u002Fbrowse.arxiv.org\u002Fabs\u002F2310.01957)\n  - 陈龙、奥列格·西纳夫斯基、扬·许内尔曼、爱丽丝·卡恩松德、安德鲁·詹姆斯·威尔莫特、丹尼·伯奇、丹尼尔·芒德、杰米·肖顿\n  - 出版单位：Wayve公司\n  - 任务：规划+VQA\n  - 代码：[官方](https:\u002F\u002Fgithub.com\u002Fwayveai\u002FDriving-with-LLMs)\n  - 模拟器：定制开发的真实2D模拟器（该模拟器未开源）。\n  - 数据集：[Driving QA](https:\u002F\u002Fgithub.com\u002Fwayveai\u002FDriving-with-LLMs\u002Ftree\u002Fmain\u002Fdata)，由RL专家在模拟器中收集的数据。\n  - 发表日期：2023年10月3日\n  - 摘要：\n    - 提出一种独特的对象级多模态LLM架构(Llama2+Lora)，仅使用向量化表示作为输入。\n    - 开发了一个包含16万个问答对的新数据集，源自1万个驾驶场景（控制指令由RL(PPO)收集，问答对由GPT-3.5生成）。\n    - 评价指标：\n      - 交通信号灯检测准确率\n      - 交通信号灯距离预测的MAE\n      - 加速度预测的MAE\n      - 制动压力预测的MAE\n      - 转向角度预测的MAE\n\n- [Talk2BEV：用于自动驾驶的语言增强鸟瞰图地图](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.02251)\n  - 维克兰特·德万甘、图沙尔·乔杜里、希瓦姆·钱多克、舒巴姆·普里亚达尔尚、阿努什卡·贾因、阿伦·K·辛格、西达尔特·斯里瓦斯塔瓦、克里希纳·穆尔蒂·贾塔瓦拉布拉胡拉、K·马达瓦·克里希纳\n  - 出版单位：海得拉巴国际信息技术学院、英属哥伦比亚大学、塔尔图大学、TensorTour公司、麻省理工学院\n  - 项目页面：[官方](https:\u002F\u002Fllmbev.github.io\u002Ftalk2bev\u002F)\n  - 代码：[Talk2BEV](https:\u002F\u002Fgithub.com\u002Fllmbev\u002Ftalk2bev)\n  - 发表日期：2023年10月3日\n  - 摘要：\n    - 介绍Talk2BEV，这是一种用于自动驾驶场景中鸟瞰图(BEV)地图的大规模视觉语言模型(LVLM)接口。\n    - 无需任何训练或微调，直接依赖预训练的图文模型。\n    - 开发并发布Talk2BEV-Bench基准测试，涵盖1000个由人工标注的BEV场景，包含来自NuScenes数据集的超过2万条问题及真实答案。\n\n- \u003Ca id=\"DriveGPT4\">\u003C\u002Fa>[DriveGPT4：通过大型语言模型实现可解释的端到端自动驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01412)\n  - 徐振华、张雨佳、谢恩泽、赵震、郭勇、Kenneth K. Y. Wong、李振国、赵恒爽\n  - 出版单位：香港大学、浙江大学、华为诺亚方舟实验室、悉尼大学\n  - 项目页面：[官方](https:\u002F\u002Ftonyxuqaq.github.io\u002Fprojects\u002FDriveGPT4\u002F)\n  - 任务：规划\u002F控制+VQA\n  - 数据集：\n    - [BDD-X数据集](https:\u002F\u002Fgithub.com\u002FJinkyuKimUCB\u002FBDD-X-dataset)。\n  - 发表日期：2023年10月2日\n  - 摘要：\n    - 开发一个新的视觉指令调优数据集（基于BDD-X），用于借助ChatGPT\u002FGPT4实现可解释的自动驾驶。\n    - 提出一种新型多模态LLM，名为DriveGPT4(Valley + LLaVA)。\n  - 评价指标：\n    - BLEU4、CIDEr和METEOR评分，以及ChatGPT评分。\n    - 控制信号预测的RMSE。\n\n- \u003Ca id=\"GPT-Driver\">\u003C\u002Fa>[GPT-DRIVER：用GPT学习驾驶](https:\u002F\u002Fbrowse.arxiv.org\u002Fabs\u002F2310.01415v1)\n  - 毛家庚、钱宇曦、赵航、王悦\n  - 出版单位：南加州大学、清华大学\n  - 任务：规划（微调预训练模型）\n  - 项目：[官方](https:\u002F\u002Fpointscoder.github.io\u002Fprojects\u002Fgpt_driver\u002Findex.html)\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - 代码：[GPT-Driver](https:\u002F\u002Fgithub.com\u002FPointsCoder\u002FGPT-Driver)\n  - 发表日期：2023年10月2日\n  - 摘要：\n    - 将运动规划视为语言建模问题。\n    - 通过使用OpenAI微调API的策略，使LLM的输出与人类驾驶行为保持一致。\n    - 利用LLM生成驾驶轨迹。\n  - 评价指标：\n    - L2指标和碰撞率\n\n- [GAIA-1：用于自动驾驶的生成式世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17080)\n  - 作者：Anthony Hu、Lloyd Russell、Hudson Yeo、Zak Murez、George Fedoseev、Alex Kendall、Jamie Shotton、Gianluca Corrado\n  - 出版方：Wayve\n  - 任务：生成\n  - 数据集：\n    - 训练数据集包含4,700小时、25Hz帧率的专有驾驶数据，这些数据于2019年至2023年间在英国伦敦收集，共计约4.2亿张独特图像。\n    - 验证数据集包含400小时的驾驶数据，来自未纳入训练集的行驶记录。\n    - 文本数据来源于在线叙述或离线元数据源。\n  - 发表日期：2023年9月29日\n  - 摘要：\n    - 介绍GAIA-1，这是一种利用视频（预训练的DINO）、文本（T5-large）和动作输入来生成逼真驾驶场景的生成式世界模型。\n    - 它可作为有价值的神经网络模拟器，用于生成无限量的数据。\n\n- [DiLu：基于大语言模型的自动驾驶知识驱动方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.16292)\n  - 作者：Licheng Wen、Daocheng Fu、Xin Li、Xinyu Cai、Tao Ma、Pinlong Cai、Min Dou、Botian Shi、Liang He、Yu Qiao   **ICLR 2024**\n  - 出版方：上海人工智能实验室、华东师范大学、香港中文大学\n  - 发表日期：2023年9月28日\n  - 任务：规划\n  - 环境：\n    - [HighwayEnv](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n    - [CitySim](https:\u002F\u002Fgithub.com\u002Fozheng1993\u002FUCF-SST-CitySim-Dataset)，一个基于无人机的车辆轨迹数据集。\n  - 摘要：\n    - 提出DiLu框架，该框架结合了推理模块和反思模块，使系统能够基于常识知识进行决策，并不断进化。\n\n- [SurrealDriver：基于大语言模型的城市环境生成式驾驶员代理仿真框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.13193)\n  - 作者：Ye Jin、Xiaoxi Shen、Huiling Peng、Xiaoan Liu、Jingli Qin、Jiayang Li、Jintao Xie、Peizhong Gao、Guyue Zhou、Jiangtao Gong\n  - 关键词：人机交互、驾驶员模型、代理、生成式AI、大语言模型、仿真框架\n  - 环境：[CARLA](https:\u002F\u002Fgithub.com\u002Fcarla-simulator)\n  - 出版方：清华大学\n  - 摘要：提出一种基于大语言模型（LLMs）的生成式驾驶员代理仿真框架，能够感知复杂的交通场景并提供逼真的驾驶操作。\n\n- [像说话一样驾驶：在自动驾驶车辆中实现与大语言模型的人类般交互](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.10228)\n  - 作者：Can Cui、Yunsheng Ma、Xu Cao、Wenqian Ye、Ziran Wang\n  - 出版方：普渡大学、PediaMed.AI实验室、弗吉尼亚大学\n  - 任务：规划\n  - 发表日期：2023年9月18日\n  - 摘要：\n    - 提供一个将大语言模型（LLMs）集成到AD中的综合框架。\n\n- \u003Ca id=\"DriveDreamer\">\u003C\u002Fa>[DriveDreamer：迈向由真实世界驱动的自动驾驶世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.09777)\n  - 作者：Xiaofeng Wang、Zheng Zhu、Guan Huang、Xinze Chen、Jiwen Lu  **ECCV 2024**\n  - 出版方：GigaAI、清华大学\n  - 任务：生成\n  - 项目页面：[官方](https:\u002F\u002Fdrivedreamer.github.io\u002F)\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - 发表日期：2023年9月18日\n  - 摘要：\n    - 利用强大的扩散模型构建复杂环境的全面表示。\n    - 通过多模态（文本、图像、HDMap、动作、3DBox）世界模型生成未来的驾驶视频和驾驶策略。\n\n- [你能描述一下正在发生的事情吗？将预训练的语言编码器集成到自动驾驶轨迹预测模型中](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.05282)\n  - 作者：Ali Keysan、Andreas Look、Eitan Kosman、Gonca Gürsun、Jörg Wagner、Yu Yao、Barbara Rakitsch\n  - 出版方：博世人工智能中心、图宾根大学\n  - 任务：预测\n  - 数据集：[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)\n  - 发表日期：2023年9月13日\n  - 摘要：\n    - 将预训练的语言模型作为基于文本的输入编码器集成到AD轨迹预测任务中。\n  - 评估指标：\n    - 最小平均位移误差（minADEk）\n    - 最终位移误差（minFDEk）\n    - 超过2米的漏检率\n\n- [TrafficGPT：查看、处理并与交通基础模型互动](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.06719)\n  - 作者：Siyao Zhang、Daocheng Fu、Zhao Zhang、Bin Yu、Pinlong Cai\n  - 出版方：北京航空航天大学、智能交通技术与系统重点实验室、上海人工智能实验室\n  - 任务：规划\n  - 代码：[官方](https:\u002F\u002Fgithub.com\u002Flijlansg\u002FTrafficGPT.git)\n  - 发表日期：2023年9月13日\n  - 摘要：\n    - 展示TrafficGPT——ChatGPT与交通基础模型的融合。\n    - 通过定义一系列提示，弥合了大语言模型与交通基础模型之间的关键鸿沟。\n\n- [HiLM-D：迈向自动驾驶多模态大语言模型中的高分辨率理解](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.05186)\n  - 作者：Xinpeng Ding、Jianhua Han、Hang Xu、Wei Zhang、Xiaomeng Li\n  - 出版方：香港科技大学、华为诺亚方舟实验室\n  - 任务：检测+VQA\n  - 数据集：[DRAMA](https:\u002F\u002Fusa.honda-ri.com\u002Fdrama)\n  - 发表日期：2023年9月11日\n  - 摘要：\n    - 提出HiLM-D（迈向自动驾驶多模态大语言模型中的高分辨率理解），这是一种将HR信息高效融入MLLMs以完成ROLISP任务的方法。\n    - ROLISP旨在识别、解释并定位对自车构成风险的对象，同时预测其意图并给出建议。\n  - 评估指标：\n    - LLM指标：BLEU4、CIDEr、METEOR、SPICE。\n    - 检测指标：mIoU、IoUs等。\n\n- \u003Ca id=\"LanguagePrompt\">\u003C\u002Fa>[自动驾驶的语言提示](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.04379)\n  - 作者：Dongming Wu、Wencheng Han、Tiancai Wang、Yingfei Liu、Xiangyu Zhang、Jianbing Shen\n  - 出版方：北京理工大学、澳门大学、旷视科技、北京人工智能研究院\n  - 任务：跟踪\n  - 代码：[官方](https:\u002F\u002Fgithub.com\u002Fwudongming97\u002FPrompt4Driving)\n  - 数据集：NuPrompt（未公开），基于[nuScenes](https:\u002F\u002Fwww.nuscenes.org\u002Fnuscenes)。\n  - 发表日期：2023年9月8日\n  - 摘要：\n    - 提出一套基于nuScenes的大规模语言提示集，用于驾驶场景，名为NuPrompt（3D物体-文本对）。\n    - 提出一种高效的基于提示的跟踪模型，在PFTrack的基础上进行提示推理优化，称为PromptTrack。\n\n- [MTD-GPT：用于无信号交叉口自动驾驶的多任务决策GPT模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16118)\n  - 刘家琪、彭航、肖琦、王建强、孙健。*ITSC 2023*\n  - 出版单位：同济大学、清华大学\n  - 任务：预测\n  - 环境：[HighwayEnv](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n  - 发表日期：2023年7月30日\n  - 摘要：\n    - 设计了一套利用强化学习算法训练单任务决策专家并使用专家数据的流程。\n    - 提出MTD-GPT模型，用于自动驾驶车辆在无信号交叉口的多任务（左转、直行、右转）决策。\n\n- [从大型语言模型中进行领域知识蒸馏：自动驾驶领域的实证研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.11769)\n  - 唐云、安东尼奥·A·布鲁托·达科斯塔、张希哲、欧文·帕特里克、西达塔·卡斯特吉尔、保罗·詹宁斯。*ITSC 2023*\n  - 出版单位：华威大学\n  - 任务：问答\n  - 发表日期：2023年7月17日\n  - 摘要：\n    - 开发了一款基于Web的知识蒸馏助手，通过提示工程和LLM ChatGPT实现运行时的监督与灵活干预。\n\n- [像人类一样驾驶：用大型语言模型重新思考自动驾驶](https:\u002F\u002Fbrowse.arxiv.org\u002Fabs\u002F2307.07162)\n  - 傅道成、李欣、温立成、窦敏、蔡品龙、史博天、乔宇\n  - 出版单位：上海人工智能实验室、华东师范大学\n  - 任务：规划\n  - 代码：[官方](https:\u002F\u002Fgithub.com\u002FPJLab-ADG\u002FDriveLikeAHuman)\n  - 环境：[HighwayEnv](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n  - 发表日期：2023年7月14日\n  - 摘要：\n    - 识别出三种关键能力：推理、解释与记忆（积累经验与自我反思）。\n    - 将LLM应用于自动驾驶中的决策环节，以解决长尾场景问题并提升可解释性。\n    - 在闭环离线数据中验证了其可解释性。\n\n- [基于场景级扩散的语言引导交通仿真](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.06344)\n  - 钟子渊、戴维斯·伦佩、陈宇晓、鲍里斯·伊万诺维奇、曹宇龙、徐丹菲、马可·帕沃内、雷百石\n  - 出版单位：哥伦比亚大学、NVIDIA研究院、斯坦福大学、佐治亚理工学院\n  - 任务：扩散\n  - 发表日期：2023年7月10日\n  - 摘要：\n    - 提出CTG++，一种语言引导的场景级条件扩散模型，用于生成符合查询要求的逼真交通仿真。\n    - 利用LLM将用户查询转化为可微分的损失函数，并提出一种基于时空Transformer架构的场景级条件扩散模型，将该损失函数转换为逼真的、符合查询要求的轨迹。\n\n- [ADAPT：动作感知型驾驶描述Transformer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.00673)\n  - 金步、刘鑫宇、郑宇鹏、李鹏飞、赵浩、张彤、郑宇航、周古悦、刘静静 **ICRA 2023**\n  - 出版单位：中国科学院、清华大学、北京大学、西安电子科技大学、南方科技大学、北京航空航天大学\n  - 代码：[ADAPT](https:\u002F\u002Fgithub.com\u002Fjxbbb\u002FADAPT)\n  - 数据集：[BDD-X数据集](https:\u002F\u002Fgithub.com\u002FJinkyuKimUCB\u002FBDD-X-dataset)\n  - 摘要：\n    - 提出ADAPT，一种全新的端到端Transformer驱动的动作叙述与推理框架，专为自动驾驶车辆设计。\n    - 提议一种多任务联合训练框架，将驾驶动作描述任务与控制信号预测任务对齐。\n\n\u003C\u002Fdetails>\n\n\n\n## 工作坊\n\u003Cdetails open>\n\u003Csummary>展开\u002F收起\u003C\u002Fsummary>\n\n- [用于自动驾驶的大规模语言与视觉模型(LLVM-AD)研讨会 @ WACV 2024](https:\u002F\u002Fllvm-ad.github.io\u002F)\n  - 出版单位：腾讯地图高清地图T.Lab、伊利诺伊大学厄巴纳-香槟分校、普渡大学、弗吉尼亚大学\n  - 挑战1：MAPLM：用于地图与交通场景理解的大规模视觉-语言数据集\n    - 数据集：[下载](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1cqFjBH8MLeP6nKFM0l7oV-Srfke-Mx1R?usp=sharing)\n    - 任务：问答\n    - 代码：https:\u002F\u002Fgithub.com\u002FLLVM-AD\u002FMAPLM\n    - 描述：MAPLM结合点云鸟瞰图和全景图像，提供丰富的道路场景图像集合。它包含多层次的场景描述数据，有助于模型在复杂多样的交通环境中导航。\n    - 评价指标：\n      - 帧整体准确率（FRM）：若关于某一帧的所有封闭式问题均回答正确，则该帧被视为正确。\n      - 问题整体准确率（QNS）：若某一个问题的答案正确，则该问题被视为正确。\n      - LAN：当前道路上有多少条车道？\n      - INT：主干道上是否存在路口、交叉口或变道区域？\n      - QLT：该图像中当前路段的点云数据质量如何？\n      - SCN：图片中呈现的是哪种道路场景？(SCN)    \n  - 挑战2：车内用户指令理解（UCU）\n    - 数据集：[下载](https:\u002F\u002Fgithub.com\u002FLLVM-AD\u002Fucu-dataset\u002Fblob\u002Fmain\u002Fucu.csv)\n    - 任务：问答\n    - 代码：https:\u002F\u002Fgithub.com\u002FLLVM-AD\u002Fucu-dataset\n    - 描述：\n      - 本数据集专注于理解自动驾驶汽车环境下的用户指令。它包含1,099条已标注的指令，每条指令都是用户向车辆发出的请求语句。\n    - 评价指标：\n      - 指令级别准确率：若八项答案全部正确，则该指令被视为被正确理解。\n      - 问题级别准确率：按单个问题进行评估。\n\u003C\u002Fdetails>\n\n## 数据集\n\u003Cdetails open>\n\u003Csummary>展开\u002F收起\u003C\u002Fsummary>\n\n```\n格式：\n- [标题](数据集链接) [链接]\n  - 作者1、作者2、作者3…\n  - 关键词\n  - 实验环境或任务\n```\n\n- [用于自动驾驶的交互式增强驾驶数据集](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.20575)\n  - 冯浩杰、张培志、田梦洁、张欣睿、李卓仁、黄俊鹏、王秀荣、朱俊凡、王建洲、尹东晓、熊璐\n  - 发表日期：2026年2月24日\n  - 任务：VQA\n  - 数据集：[IEDD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.20575)\n  - 摘要：\n    - 提出交互式增强驾驶数据集（IEDD），以解决自动驾驶中视觉-语言-行动（VLA）模型面临的交互场景稀疏及多模态对齐不足的问题。\n    - 开发了一套可扩展的流水线，从自然驾驶数据中挖掘百万级别的交互片段，并构建了IEDD-VQA数据集，其中合成的BEV视频将语义动作与结构化语言严格对齐。\n    - 提供基准结果，评估了十种主流视觉语言模型（VLMs），以展示该数据集在评估和微调自动驾驶模型推理能力方面的重用价值。\n    - \n- [先于可见：基于视觉语言模型的自动驾驶潜在风险推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.22928)\n  - 刘佳鑫、闫翔宇、彭亮、杨磊、张凌军、罗岳晨、陶跃明、谭宇轩、李牧、张雷、战子琪、郭赛、王宏、李俊\n  - 发表日期：2025年11月28日\n  - 任务：VQA\n  - 数据集：[PotentialRiskQA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.22928)\n  - 摘要：\n    - 引入PotentialRiskQA，这是一个全新的视觉-语言数据集，用于在潜在风险变得可见之前对其进行推理，包含场景描述、前兆及推断结果的结构化标注。\n    - 提出PR-Reasoner框架，该框架基于视觉语言模型，专为车载潜在风险推理设计，在使用新数据集进行微调后表现出显著的性能提升。\n\n- [CARScenes：面向安全自动驾驶的语义VLM数据集](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.10701)\n  - 何元凯、史伟松\n  - 出版单位：特拉华大学\n  - 发表日期：2025年11月12日\n  - 项目页面：[CARScenes](https:\u002F\u002Fgithub.com\u002FCroquembouche\u002FCAR-Scenes)\n  - 代码：[CARScenes](https:\u002F\u002Fgithub.com\u002FCroquembouche\u002FCAR-Scenes)\n  - 摘要：\n    - CAR-Scenes是一个帧级语义数据集，用于训练和评估视觉-语言模型（VLMs），以实现自动驾驶中的可解释性场景级理解。\n    - 该数据集提供了5,192张标注图像，包含28类知识库，覆盖350多种与环境、道路使用者及车辆行为相关的属性，这些标注通过GPT-4o辅助的流水线完成，并经过人工验证。\n    - 数据集还包括语义检索工具、风险感知场景挖掘工具以及可复现的基线，同时发布标注和分析脚本，以支持智能车辆的可解释性和数据驱动型工作流程。\n\n- [STRIDE-QA：城市驾驶场景中的时空推理视觉问答数据集](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10427)\n  - 石原圭史、佐佐木健人、高桥翼、盐野大辉、山口优\n  - 发表日期：2025年8月14日\n  - 项目页面：[STRIDE-QA](https:\u002F\u002Fturingmotors.github.io\u002Fstride-qa\u002F)\n  - 任务：VQA\n  - 数据集：[STRIDE-QA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10427)\n  - 摘要：\n    - STRIDE-QA是一个大规模的视觉问答（VQA）数据集，旨在从自我中心视角进行物理基础的时空推理，数据来源于东京100小时的多传感器驾驶数据。\n    - 该数据集提供28.5万帧上的1600万个问答对，并配有密集的自动标注，支持以物体为中心和以自我为中心的推理，涵盖三种需要空间定位和时间预测的新颖问答任务。\n    - 基准测试表明，在STRIDE-QA上微调后的VLMs相比通用VLMs几乎为零的得分，其性能大幅提升（空间定位提升55%，预测一致性提升28%），为自动驾驶系统中可靠的VLMs奠定了基础。\n\n- [Box-QAymo：自动驾驶中的框引用VQA数据集](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.00525)\n  - 贾马尔·埃切加赖、傅雨霞、黄子、罗雅丹\n  - 发表日期：2025年7月1日\n  - 项目页面：[Box-QAymo](https:\u002F\u002Fdjamahl99.github.io\u002Fqaymo-pages\u002F)\n  - 任务：VQA\n  - 数据集：[Box-QAymo](https:\u002F\u002Fdjamahl99.github.io\u002Fqaymo-pages\u002F)\n  - 摘要：\n    - 引入Box-QAymo，这是一个框引用的VQA数据集和基准，用于评估和微调VLMs在驾驶场景中对用户指定对象的空间和时间推理能力。\n    - 提出一种分层评估协议，涵盖二元合理性检查、属性预测、运动理解以及跨对象动态的时空推理。\n    - 为开发更 robust 和可解释的自动驾驶系统提供了基础，使其能够在真实条件下与用户有效沟通。\n\n- [CoVLA：自动驾驶的综合视觉-语言-行动数据集](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.10845)\n  - 新井英久、三轮圭太、佐佐木健人、山口优、渡边浩平、青木俊介、山本一成 **WACV 2025 口头报告**\n  - 出版单位：图灵公司\n  - 发表日期：2024年12月2日\n  - 代码：[CoVLA](https:\u002F\u002Fturingmotors.github.io\u002Fcovla-ad\u002F)\n  - 摘要：\n    - CoVLA（综合视觉-语言-行动）数据集是一个庞大的数据集，包含了超过80小时的真实驾驶视频。该数据集采用了一种新颖的可扩展方法，基于自动化数据处理和字幕生成流水线，生成准确的驾驶轨迹，并配以详细的自然语言描述，涵盖驾驶环境和操作动作。\n  \n- [Rank2Tell：用于联合重要性排序与推理的多模态驾驶数据集](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.06597)\n  - 恩娜·萨奇德瓦、纳库尔·阿加瓦尔、苏哈斯·春迪、肖恩·鲁洛夫斯、李嘉辰、贝赫扎德·达里乌什、崔智浩、迈克尔·科亨德费尔\n  - 出版单位：本田研究院、斯坦福大学\n  - 发表日期：2023年9月10日\n  - 摘要：\n    - 这是一个多模态的自我中心数据集，用于对重要性级别进行排序并说明其原因。\n    - 引入了一个联合模型，用于同时进行重要性级别排序和自然语言字幕生成，以对该数据集进行基准测试。\n\n- [DriveLM: 语言驱动的自动驾驶](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM)\n  - 发表者：Sima, Chonghao；Renz, Katrin；Chitta, Kashyap；Chen, Li；Zhang, Hanxue；Xie, Chengen；Luo, Ping；Geiger, Andreas；Li, Hongyang **ECCV 2024**\n  - 数据集：[DriveLM](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FDriveLM\u002Fblob\u002Fmain\u002Fdocs\u002Fgetting_started.md#download-data)\n  - 发表日期：2023年8月\n  - 摘要：\n    - 基于nuScenes数据集构建数据集。\n    - 感知类问题要求模型识别场景中的物体。\n    - 预测类问题要求模型预测场景中重要物体的未来状态。\n    - 规划类问题则要求模型给出合理的规划动作，并避免危险行为。\n\n- [WEDGE：基于生成式视觉-语言模型构建的多天气自动驾驶数据集](https:\u002F\u002Fbrowse.arxiv.org\u002Fabs\u002F2305.07528)\n  - Aboli Marathe、Deva Ramanan、Rahee Walambe、Ketan Kotecha。**CVPR 2023**\n  - 发表单位：卡内基梅隆大学、共生国际大学\n  - 数据集：[WEDGE](https:\u002F\u002Fgithub.com\u002FInfernolia\u002FWEDGE)\n  - 发表日期：2023年5月12日\n  - 摘要：\n    - 基于生成式视觉-语言模型构建的多天气自动驾驶数据集。\n\n- [NuScenes-QA：面向自动驾驶场景的多模态视觉问答基准](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14836)\n  - Tianwen Qian、Jingjing Chen、Linhai Zhuo、Yang Jiao、Yu-Gang Jiang\n  - 发表单位：复旦大学\n  - 数据集：[NuScenes-QA](https:\u002F\u002Fgithub.com\u002Fqiantianwen\u002FNuScenes-QA)\n  - 摘要：\n    - NuScenes-QA基于34,149个视觉场景提供了459,941对问答，其中来自28,130个场景的376,604道题用于训练，而来自6,019个场景的83,337道题则用于测试。\n    - 多视角图像和点云首先通过特征提取主干网络处理，以获得BEV特征。\n\n- [DRAMA：驾驶场景中的风险联合定位与描述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.10767)\n  - Srikanth Malla、Chiho Choi、Isht Dwivedi、Joon Hee Choi、Jiachen Li\n  - 发表单位：\n  - 数据集：[DRAMA](https:\u002F\u002Fusa.honda-ri.com\u002Fdrama#Introduction)\n  - 摘要：\n    - 介绍了一种新颖的数据集DRAMA，该数据集提供了与重要物体相关的驾驶风险的语言描述（重点在于原因），可用于评估驾驶场景中多种视觉描述能力。\n\n- [自动驾驶中的语言提示](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.04379)\n  - 数据集：Nuprompt（未公开）\n  - [先前摘要](#LanguagePrompt)\n\n- [使用LLM进行驾驶：融合对象级向量模态以实现可解释的自动驾驶](https:\u002F\u002Fbrowse.arxiv.org\u002Fabs\u002F2310.01957)\n  - 数据集：[官方数据集](https:\u002F\u002Fgithub.com\u002Fwayveai\u002FDriving-with-LLMs\u002Ftree\u002Fmain\u002Fdata)，采用RL专家在模拟器中收集的数据。\n  - [先前摘要](#DrivingwithLLMs)\n\n- [自动驾驶车辆的文本解释](https:\u002F\u002Farxiv.org\u002Fabs\u002F1807.11546)\n  - Jinkyu Kim、Anna Rohrbach、Trevor Darrell、John Canny、Zeynep Akata **ECCV 2018**。\n  - 发表单位：加州大学伯克利分校、萨尔兰信息学园区、阿姆斯特丹大学\n  - [BDD-X数据集](https:\u002F\u002Fgithub.com\u002FJinkyuKimUCB\u002FBDD-X-dataset)\n\n- [面向自动驾驶车辆的人机建议接地](https:\u002F\u002Farxiv.org\u002Fabs\u002F1911.06978)\n  - Jinkyu Kim、Teruhisa Misu、Yi-Ting Chen、Ashish Tawari、John Canny **CVPR 2019**\n  - 发表单位：加州大学伯克利分校、本田美国研究院\n  - [HAD数据集](https:\u002F\u002Fusa.honda-ri.com\u002Fhad)\n\u003C\u002Fdetails>\n\n\n\n\n## 许可证\n\n“自动驾驶领域优秀LLM资源”根据Apache 2.0许可证发布。","# Awesome-LLM4AD 快速上手指南\n\nAwesome-LLM4AD 是一个由上海交通大学 ReThinklab 维护的开源项目，旨在收集并整理关于**大语言模型用于自动驾驶**（LLM4AD）的前沿研究论文、数据集和资源。它涵盖了视觉 - 语言模型（VLM4AD）和视觉 - 语言 - 动作模型（VLA4AD）等统一范式。\n\n本项目主要作为一个**资源索引库**（Awesome List），而非单一的可执行软件包。因此，“安装”和“使用”主要指获取该资源列表以及复现其中列出的具体算法模型。\n\n## 环境准备\n\n由于本项目是论文与代码资源的集合，运行其中具体的模型需要通用的深度学习环境。建议配置如下：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04\u002F22.04) 或 macOS\n*   **Python 版本**: Python 3.8 或更高版本\n*   **核心依赖**:\n    *   PyTorch (建议 2.0+)\n    *   Transformers (Hugging Face)\n    *   CUDA Toolkit (根据显卡驱动版本安装，用于 GPU 加速)\n*   **其他工具**: Git, Conda (推荐用于环境管理)\n\n> **提示**：国内开发者建议使用清华源或阿里源加速 Python 包和模型的下载。\n\n## 安装步骤\n\n### 1. 克隆项目仓库\n首先将 Awesome-LLM4AD 仓库克隆到本地，以便查阅最新的论文列表、数据集链接和对应的项目代码地址。\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FAwesome-LLM4AD.git\ncd Awesome-LLM4AD\n```\n\n### 2. 配置基础开发环境\n创建一个独立的 Conda 环境并安装基础深度学习框架（以 PyTorch 为例，使用清华镜像源加速）：\n\n```bash\nconda create -n llm4ad python=3.9 -y\nconda activate llm4ad\n\n# 使用清华源安装 PyTorch 和相关依赖\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\npip install transformers accelerate datasets --index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 3. 获取具体模型代码\n本仓库本身不包含所有模型的源代码。请浏览 `README.md` 中的 **[Papers]** 章节，找到你感兴趣的研究（例如 `Vega`, `Drive My Way`, `DreamerAD` 等），点击其对应的 **Code** 或 **Project Page** 链接，进入该特定项目的仓库进行单独克隆和安装。\n\n例如，若要复现 *Drive My Way*：\n```bash\n# 示例：克隆具体子项目（需先在上文列表中确认最新地址）\ngit clone https:\u002F\u002Fgithub.com\u002Ftasl-lab\u002FDMW.git\ncd DMW\npip install -r requirements.txt\n```\n\n## 基本使用\n\n由于 Awesome-LLM4AD 是资源导航，其核心用法是**检索资源**和**复现算法**。\n\n### 1. 检索研究方向\n打开本地的 `README.md` 文件或在 GitHub 页面查看 **Table of Contents**。你可以根据任务类型快速定位资源：\n*   **Planning **(规划): 如 Vega, Drive My Way\n*   **Perception **(感知): 如 Traffic Sign Recognition (TS-1M)\n*   **Generation **(生成\u002F仿真): 如 R1Sim, PhyGenesis\n*   **Prediction **(预测): 如 TIGFlow-GRPO\n\n### 2. 运行示例模型\n以列表中提到的 **Drive My Way **(DMW) 为例，在克隆并安装其独立仓库后，通常的使用流程如下（具体命令请以该项目最新的 `README` 为准）：\n\n```bash\n# 进入具体项目目录\ncd DMW\n\n# 下载预训练模型或数据集 (参考该项目说明)\n# python scripts\u002Fdownload_model.py \n\n# 运行推理示例 (假设脚本名为 inference.py)\npython inference.py --config configs\u002Fpersonalized_drive.yaml --input \"Change lane to the left\"\n```\n\n### 3. 贡献与更新\n该项目持续更新。如果你发现了新的相关论文或资源，可以通过以下方式贡献：\n*   **提交 Issue**: 在 GitHub 仓库中提出建议。\n*   **提交 PR**: Fork 仓库，在 `README.md` 的 `Papers` 部分按照既定格式添加新条目，然后发起 Pull Request。\n\n```markdown\n\u003C!-- 新增条目格式参考 -->\n- [Title](Paper Link)\n  - Authors\n  - Task: Planning\u002FPerception\u002Fetc.\n  - Code: [Link]\n```\n\n> **注意**：详细的数据集下载（如 NAVSIM, Waymo, CARLA 等）请务必访问各论文对应的官方数据页面，并遵守相应的数据使用协议。","某自动驾驶初创公司的算法团队正致力于研发基于大语言模型的决策规划系统，以应对复杂城市路况中的长尾场景。\n\n### 没有 Awesome-LLM4AD 时\n- **文献检索如大海捞针**：研究人员需在 arXiv、Google Scholar 等多个平台手动搜索\"LLM+Autonomous Driving\"相关论文，极易遗漏最新的 VLA（视觉 - 语言 - 动作）模型成果。\n- **技术路线难以对齐**：缺乏统一的分类框架，团队难以快速厘清当前研究是侧重于感知增强、规划决策还是端到端生成，导致技术选型盲目。\n- **复现成本高昂**：找不到配套的代码仓库、数据集或仿真环境链接，往往花费数周时间清洗数据或重构代码，却因缺少关键依赖而失败。\n- **前沿动态滞后**：无法实时追踪领域内的最新突破（如自然语言指令驾驶），导致研发进度落后于学术界前沿水平。\n\n### 使用 Awesome-LLM4AD 后\n- **资源一站式获取**：团队直接利用该清单中 curated 的论文列表，迅速锁定了如 Vega 等支持自然语言指令的统一模型，节省了 80% 的调研时间。\n- **架构清晰明确**：借助其按“规划、感知、问答、生成”分类的概览图，团队快速确定了以 VLA 模型为核心的技术演进路线。\n- **复现效率倍增**：通过列表中提供的代码链接、数据集（如 InstructScene）及仿真器信息，算法工程师在两天内成功搭建了基线系统。\n- **持续同步前沿**：订阅该仓库更新后，团队能第一时间掌握针对长尾场景的最新解决方案，确保持续的技术竞争力。\n\nAwesome-LLM4AD 将分散的学术碎片整合为结构化知识图谱，极大缩短了从理论调研到工程落地的周期。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FThinklab-SJTU_Awesome-LLM4AD_9c21de09.png","Thinklab-SJTU","Thinklab@SJTU","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FThinklab-SJTU_c6a6209f.png","Thinklab at Shanghai Jiao Tong University, led by Prof. Junchi Yan.",null,"http:\u002F\u002Fthinklab.sjtu.edu.cn","https:\u002F\u002Fgithub.com\u002FThinklab-SJTU",1763,102,"2026-04-04T03:18:53","Apache-2.0","","未说明",{"notes":91,"python":89,"dependencies":92},"该仓库（Awesome-LLM4AD）是一个自动驾驶领域大语言模型（LLM4AD）的研究论文、数据集和资源合集清单，本身不包含可执行的源代码或具体的模型实现，因此 README 中未提供运行环境、硬件配置或依赖库的安装需求。用户需根据列表中具体引用的各个独立项目（如 Vega, Drive My Way 等）的原始仓库来获取相应的环境配置信息。",[],[15],[95,96,97,98],"large-language-models","vision-language-action-model","vision-language-model","world-model","2026-03-27T02:49:30.150509","2026-04-11T18:33:10.786597",[102,107,112,117,122,127,132,136],{"id":103,"question_zh":104,"answer_zh":105,"source_url":106},15904,"LLM for AD 领域有哪些已开源代码的论文推荐？","由于该领域发展迅速，许多论文仍处于投稿阶段，代码尚未开源。建议查阅仓库中标记为已接收的论文（如 CVPR 2024, ICLR 等），这些条目通常包含代码链接。","https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FAwesome-LLM4AD\u002Fissues\u002F7",{"id":108,"question_zh":109,"answer_zh":110,"source_url":111},15905,"如何向 Awesome-LLM4AD 仓库提交新的论文或资源？","用户可以通过创建 Issue 来推荐新论文。在 Issue 正文中提供论文标题、作者、会议\u002F期刊来源、Arxiv 链接、项目主页、GitHub 代码库地址以及 BibTeX 引用格式。维护者审核后会将其更新到列表中（例如 DriveBench, ReCogDrive, FSDrive 等均已通过此方式收录）。","https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FAwesome-LLM4AD\u002Fissues\u002F15",{"id":113,"question_zh":114,"answer_zh":115,"source_url":116},15906,"如果发现 README 或列表中存在拼写错误或日期错误怎么办？","可以直接提交 Issue 报告错误（如拼写错误 'Percetion' 应为 'Perception'，或发表年份错误）。维护者确认后会立即修正，并在下一次论文版本更新或 README 同步时生效。","https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FAwesome-LLM4AD\u002Fissues\u002F5",{"id":118,"question_zh":119,"answer_zh":120,"source_url":121},15907,"是否有整合视觉语言模型 (VLM)、扩散规划器和强化学习 (RL) 的端到端自动驾驶工作？","有的，例如 ReCogDrive。这是一个强化的认知框架，采用三阶段训练范式：驾驶预训练、模仿学习和 GRPO 强化学习，旨在增强视觉 - 语言 - 动作 (VLA) 系统的规划能力。相关代码和项目页面可通过仓库中的链接获取。","https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FAwesome-LLM4AD\u002Fissues\u002F14",{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},15908,"有没有利用大语言模型 (LLM) 的上下文学习能力生成交通场景的研究？","是的，AAAI 2025 接收的口头报告论文 AutoSceneGen 正是此类工作。它利用 LLM 的上下文学习能力生成交通场景，以训练更好的运动规划器。相关资源（Arxiv 链接、源代码、视频等）已在仓库中收录。","https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FAwesome-LLM4AD\u002Fissues\u002F9",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},15909,"是否有使用 VLM 预测统一未来帧作为时空思维链 (CoT) 进行轨迹规划的研究？","有的，NeurIPS 2025 Spotlight 论文 FSDrive (FutureSightDrive) 提出了这种方法。它利用 VLM 可视化地思考轨迹规划，将未来帧预测作为时空 CoT。代码和论文链接可在仓库中找到。","https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FAwesome-LLM4AD\u002Fissues\u002F12",{"id":133,"question_zh":134,"answer_zh":135,"source_url":111},15910,"哪里可以找到关于 VLM 在自动驾驶可靠性、数据和指标方面的实证研究？","可以参考 ICCV 2025 接收的论文 DriveBench，标题为《Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives》。该项目提供了 toolkit 代码、HuggingFace 数据集以及详细的项目主页。",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},15911,"是否有相关的自动驾驶数据集推荐？","仓库中已收录了 CoVLA 等相关数据集。用户可以通过查看仓库列表或搜索相关 Issue 获取具体的 Arxiv 论文链接和数据集详情。","https:\u002F\u002Fgithub.com\u002FThinklab-SJTU\u002FAwesome-LLM4AD\u002Fissues\u002F16",[]]