[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Songwxuan--Embodied-AI-Paper-TopConf":3,"tool-Songwxuan--Embodied-AI-Paper-TopConf":65},[4,23,32,40,49,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,2,"2026-04-10T11:13:16",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[19,14,18],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":29,"last_commit_at":38,"category_tags":39,"status":22},5773,"cs-video-courses","Developer-Y\u002Fcs-video-courses","cs-video-courses 是一个精心整理的计算机科学视频课程清单，旨在为自学者提供系统化的学习路径。它汇集了全球知名高校（如加州大学伯克利分校、新南威尔士大学等）的完整课程录像，涵盖从编程基础、数据结构与算法，到操作系统、分布式系统、数据库等核心领域，并深入延伸至人工智能、机器学习、量子计算及区块链等前沿方向。\n\n面对网络上零散且质量参差不齐的教学资源，cs-video-courses 解决了学习者难以找到成体系、高难度大学级别课程的痛点。该项目严格筛选内容，仅收录真正的大学层级课程，排除了碎片化的简短教程或商业广告，确保用户能接触到严谨的学术内容。\n\n这份清单特别适合希望夯实计算机基础的开发者、需要补充特定领域知识的研究人员，以及渴望像在校生一样系统学习计算机科学的自学者。其独特的技术亮点在于分类极其详尽，不仅包含传统的软件工程与网络安全，还细分了生成式 AI、大语言模型、计算生物学等新兴学科，并直接链接至官方视频播放列表，让用户能一站式获取高质量的教育资源，免费享受世界顶尖大学的课堂体验。",79792,"2026-04-08T22:03:59",[18,13,14,20],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":46,"last_commit_at":47,"category_tags":48,"status":22},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[17,13,20,19,18],{"id":50,"name":51,"github_repo":52,"description_zh":53,"stars":54,"difficulty_score":46,"last_commit_at":55,"category_tags":56,"status":22},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",75832,"2026-04-17T21:58:25",[19,13,20,18],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":29,"last_commit_at":63,"category_tags":64,"status":22},3215,"awesome-machine-learning","josephmisiti\u002Fawesome-machine-learning","awesome-machine-learning 是一份精心整理的机器学习资源清单，汇集了全球优秀的机器学习框架、库和软件工具。面对机器学习领域技术迭代快、资源分散且难以甄选的痛点，这份清单按编程语言（如 Python、C++、Go 等）和应用场景（如计算机视觉、自然语言处理、深度学习等）进行了系统化分类，帮助使用者快速定位高质量项目。\n\n它特别适合开发者、数据科学家及研究人员使用。无论是初学者寻找入门库，还是资深工程师对比不同语言的技术选型，都能从中获得极具价值的参考。此外，清单还延伸提供了免费书籍、在线课程、行业会议、技术博客及线下聚会等丰富资源，构建了从学习到实践的全链路支持体系。\n\n其独特亮点在于严格的维护标准：明确标记已停止维护或长期未更新的项目，确保推荐内容的时效性与可靠性。作为机器学习领域的“导航图”，awesome-machine-learning 以开源协作的方式持续更新，旨在降低技术探索门槛，让每一位从业者都能高效地站在巨人的肩膀上创新。",72149,"2026-04-03T21:50:24",[20,18],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":82,"owner_twitter":82,"owner_website":82,"owner_url":83,"languages":82,"stars":84,"forks":85,"last_commit_at":86,"license":87,"difficulty_score":29,"env_os":88,"env_gpu":88,"env_ram":88,"env_deps":89,"category_tags":92,"github_topics":82,"view_count":46,"oss_zip_url":82,"oss_zip_packed_at":82,"status":22,"created_at":93,"updated_at":94,"faqs":95,"releases":96},8828,"Songwxuan\u002FEmbodied-AI-Paper-TopConf","Embodied-AI-Paper-TopConf","[Actively Maintained🔥] A list of Embodied AI papers accepted by top conferences (ICLR, NeurIPS, ICML, RSS, CoRL, ICRA, IROS, CVPR, ICCV, ECCV).","Embodied-AI-Paper-TopConf 是一个持续维护的开源项目，专门汇总了被顶级学术会议（如 ICLR、NeurIPS、CVPR、ICRA 等）录用的具身智能（Embodied AI）领域论文。在具身智能研究飞速发展的今天，相关文献散落在各个会议中，研究人员往往难以高效追踪最新进展。该项目通过系统性地整理和分类这些高质量论文，解决了信息碎片化和检索困难的问题。\n\n无论是高校研究员、企业算法工程师，还是对机器人学习感兴趣的学生，都能从中快速定位到所需的前沿成果。项目不仅按会议年份归档，还细致地将论文划分为“视觉 - 语言 - 动作模型”、“世界模型”、“规划与推理”、“灵巧操作”及“仿真到现实迁移”等多个子方向，极大提升了查阅效率。其独特的亮点在于更新频率高且覆盖范围广，从 2025 年的各大会议一直延续至 2026 年的 ICLR 录用论文，确保用户能第一时间获取学界最新动态。如果你希望紧跟具身智能领域的技术脉搏，Embodied-AI-Paper-TopConf 将是你不可或缺的学术导航助手。","# Embodied-AI-Paper-TopConf\n🔥 NeuIPS2025 &amp; CORL2025 &amp; ICCV2025 &amp; ICML2025 &amp; RSS2025 &amp; CVPR2025 &amp; ICLR2025 &amp; **ICLR2026** Embodied AI Paper List  Resources.\n\n[03\u002F22\u002F2025] We plan to organize more papers on Embodied AI from top conferences in the future and build a more comprehensive paper list. If there are any conference papers you would like to browse or if you have any other suggestions, please feel free to leave an issue.\n\n[04\u002F12\u002F2025] We are updating Embodied AI papers accepted by RSS2025 (Robotics Top Conference)!\n\n[05\u002F21\u002F2025] We are updating Embodied AI papers accepted by ICML2025!\n\n[08\u002F05\u002F2025] We are updating Embodied AI papers accepted by ICCV2025!\n\n[09\u002F30\u002F2025] We are updating Embodied AI papers accepted by CORL2025!\n\n[11\u002F30\u002F2025] We are updating Embodied AI papers accepted by NeuIPS2025!\n\n[03\u002F12\u002F2026] We are updating Embodied AI papers accepted by ICLR2026! ([📖 ICLR2026](ICLR\u002FICLR2026.md))\n\n## 📖 Paper List\n\n- [📖 ICLR2026](#iclr2026)\n  - [Vision-Language-Action Models](#vision-language-action-models-1)\n  - [Vision-Language-Navigation Models](#vision-language-navigation-models)\n  - [World Models](#world-models)\n  - [Planning and Reasoning](#planning-and-reasoning-1)\n  - [Navigation](#navigation)\n  - [Humanoid](#humanoid)\n  - [3D Vision](#3d-vision-1)\n  - [Policy](#policy)\n  - [Dexterous Manipulation](#dexterous-manipulation)\n  - [Tactile](#tactile)\n  - [Sim2real and Real2sim](#sim2real-and-real2sim-1)\n  - [Benchmark and Dataset](#benchmark-and-dataset)\n  - [Other](#other)\n- [📖 NeuIPS2025](#neuips2025)\n  - [Vision-Language-Action Model](#vision-language-action-model)\n  - [Data](#data)\n  - [World Model](#world-model)\n  - [Planning and Reasoning](#planning-and-reasoning)\n  - [Navigation](#navigation)\n  - [Humanoid](#humanoid)\n  - [3D Vision](#3d-vision)\n  - [Policy](#policy)\n  - [Accelerating and Deploying](#accelerating-and-deploying)\n  - [Tactile](#tactile)\n  - [Dexterous](#dexterous)\n  - [Benchmark and Dataset](##benchmark-and-dataset)\n- [📖 CORL2025](#corl2025)\n  - [Vision-Language-Action Model](#vision-language-action-model)\n  - [World Model](#world-model)\n  - [Policy](#policy)\n  - [Humanoid](#humanoid)\n  - [Navigation](#navigation)\n  - [Benchmark and Dataset](#benchmark-and-dataset)\n  - [Dexterous Manipulation](dexterous-manipulation)\n  - [Sim-to-Real](#sim-to-real)\n- [📖 ICCV2025](#iccv2025)\n  - [Vision-Language-Action Model](#vision-language-action-model)\n  - [Vision-Language-Navigation Model](#vision-language-navigation-model)\n  - [Hierarchical Planning](#hierarchical-planning)\n  - [World Model](#world-model)\n  - [Policy](#policy)\n  - [Accelerating and Deploying](#accelerating-and-deploying)\n  - [Perception](#perception)\n  - [Benchmark and Dataset](#benchmark-and-dataset)\n- [📖 ICML2025](#icml2025)\n  - [Vision-Language-Action Models](#vision-language-action-models)\n  - [Planning and Reasoning](#planning-and-reasoning)\n  - [Policies](#policies)\n  - [3D Vision](#3d-vision)\n  - [Dataset](#dataset)\n- [📖 RSS2025](#rss2025)\n- [📖 CVPR2025](#cvpr2025)\n  - [Vision-Language-Action Models](#vision-language-action-models)\n  - [Policies](#policies)\n  - [Grasp](#grasp)\n  - [Humanoid](#humanoid)\n  - [Planning and Reasoning](#planning-and-reasoning)\n  - [3D Vision](#3d-vision)\n  - [Sim2real and Real2sim](#sim2real-and-real2sim)\n  - [Benchmark and Dataset](#benchmark-and-dataset)\n- [📖 ICLR2025](#iclr2025)\n  - [Vision-Language-Action Models](#vision-language-action-models)\n  - [Policies](#policies)\n  - [Planning and Reasoning](#planning-and-reasoning)\n  - [3D Vision](#3d-vision)\n  - [Sim2real and Real2sim](#sim2real-and-real2sim)\n- [📖 ICRA2025](#icra2025)\n\n\n# ICLR2026\n\n[📄 Full List](ICLR\u002FICLR2026.md)\n\n## Vision-Language-Action Models\n\n- Scaling up Memory for Robotic Control via Experience Retrieval [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=1dH4ARGdwD)\n- **MemoryVLA**: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=54U3XHf7qq)\n- **PixelVLA**: Advancing Pixel-level Understanding in Vision-Language-Action Model [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=7M6ryCABIc)\n- **Vlaser**: Vision-Language-Action Model with Synergistic Embodied Reasoning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=8xTDnj39Ti)\n- Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=DdrsHWobR1)\n- **MetaVLA**: Unified Meta Co-Training for Efficient Embodied Adaptation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=E1K2Ph3LtS)\n- Unifying Diffusion and Autoregression for Generalizable Vision-Language-Action Model [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=H1KDMNOKQn)\n- Hybrid Training for Vision-Language-Action Models [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=IBJtOltTbx)\n- End-to-end Listen, Look, Speak and Act [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=LYyoRqf0Ij)\n- **WholeBodyVLA**: Towards Unified Latent VLA for Whole-body Loco-manipulation Control [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=OCJmVjyzN7)\n- **RoboOmni**: Proactive Robot Manipulation in Omni-modal Context [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=OJh7oBCYhL)\n- Unified Vision-Language-Action Model [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=PklMD8PwUy)\n- **SP-VLA**: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=RwdGIIjPlC)\n- **Align-Then-stEer**: Adapting the Vision-Language Action Models through Unified Latent Guidance [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=T3i7Ifeatk)\n- **AutoQVLA**: Not All Channels Are Equal in Vision-Language-Action Model's Quantization [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=TpL2nXanru)\n- Verifier-free Test-Time Sampling for Vision Language Action Models [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=UD4Rw8MOEK)\n- **Interleave-VLA**: Enhancing Robot Manipulation with Image-Text Interleaved Instructions [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=ULTWUuGhC3)\n- **Unified Diffusion VLA**: Vision-Language-Action Model via Joint Discrete Diffusion Diffusion Process [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=UvQOcw2oCD)\n- **Endowing GPT-4 with a Humanoid Body**: Building the Bridge Between Off-the-Shelf VLMs and the Physical World [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=aQWSEjcN9V)\n- On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=cS6xizdYD5)\n- Spatially Guided Training for Vision-Language-Action Model [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=eKhOrQWAVJ)\n- Self-Improving Vision-Language-Action Models with Data Generation via Residual RL [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=eUGoqrZ6Ea)\n- Action-aware Dynamic Pruning for Efficient Vision-Language-Action Manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=ea6j8k8Rnw)\n- **Spatial Forcing**: Implicit Spatial Representation Alignment for Vision-language-action Model [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=euMVC1DO4k)\n- **Genie Envisioner**: A Unified World Foundation Platform for Robotic Manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=fHLtSxDFKC)\n- **From Spatial to Actions**: Grounding Vision-Language-Action Model in Spatial Foundation Priors [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=fzmittHfq3)\n- **TwinVLA**: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=jG9W6nAwVz)\n- **FASTer**: Toward Powerful and Efficient Autoregressive Vision–Language–Action Models with Learnable Action Tokenizer and Block-wise Decoding [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=k6nTUFoqeT)\n- Embodied Navigation Foundation Model [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=kkBOIsrCXh)\n- **X-VLA**: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=kt51kZH4aG)\n- **Actions as Language**: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=sFO9d6XSlf)\n- **OneTwoVLA**: A Unified Vision-Language-Action Model with Adaptive Reasoning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=tWMfhoP3as)\n- **VLM4VLA**: Revisiting Vision-Language-Models in Vision-Language-Action Models [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=tc2UsBeODW)\n- **Vision-Language-Action Instruction Tuning**: From Understanding to Manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=tsxwloasw5)\n- **villa-X**: Enhancing Latent Action Modeling in Vision-Language-Action Models [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=y5CaJb17Fn)\n- **From Seeing to Doing**: Bridging Reasoning and Decision for Robotic Manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=yngvAamNQi)\n\n## Vision-Language-Navigation Models\n\n- **AutoFly**: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=88RKxlFUNY)\n- **Ground Slow, Move Fast**: A Dual-System Foundation Model for Generalizable Vision-Language Navigation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=GK4rznYwhn)\n- Towards Physically Executable 3D Gaussian for Embodied Navigation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=HB6KvsqcAn)\n- Uncertainty-Aware Gaussian Map for Vision-Language Navigation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=LPv59noPAy)\n- **OpenFly**: A COMPREHENSIVE PLATFORM FOR AERIAL VISION-LANGUAGE NAVIGATION [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=OKm3w71ymP)\n- **JanusVLN**: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=RnuB0Nlbd5)\n- **CompassNav**: Steering From Path Imitation to Decision Understanding In Navigation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=eqcDckWHik)\n- **M$^3$E**: Continual Vision-and-Language Navigation via Mixture of Macro and Micro Experts [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=pFh5ygjN3V)\n- All-day Multi-scenes Lifelong Vision-and-Language Navigation with Tucker Adaptation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=qSak1Hjfdq)\n- **OmniNav**: A Unified Framework for Prospective Exploration and Visual-Language Navigation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=zGtTQTD1zu)\n\n## World Models\n\n- **Ctrl-World**: A Controllable Generative World Model for Robot Manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=748bHL2BAv)\n- **Context and Diversity Matter**: The Emergence of In-Context Learning in World Models [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=0GNBqoYcAP)\n- **FantasyWorld**: Geometry-Consistent World Modeling via Unified Video and 3D Prediction [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=3q9vHEqsNx)\n- **NeMo-map**: Neural Implicit Flow Fields for Spatio-Temporal Motion Mapping [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=4HZgkwVVFO)\n- **Astra**: General Interactive World Model with Autoregressive Denoising [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=8UZpmrxoLG)\n- Empowering Multi-Robot Cooperation via Sequential World Models [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=IvUM6UwYCJ)\n- Test-Time Mixture of World Models for Embodied Agents in Dynamic Environments [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=LQD1MrnbxH)\n- **RIG**: Synergizing Reasoning and Imagination in End-to-End Generalist Policy [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=LQv9LU2Ufg)\n- Learning Massively Multitask World Models for Continuous Control [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=MPabX9LEds)\n- Unified 3D Scene Understanding Through Physical World Modeling [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=NQq9JLMfNN)\n- **ExoPredicator**: Learning Abstract Models of Dynamic Worlds for Robot Planning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=a1zfcaNTkM)\n- Efficient Reinforcement Learning by Guiding World Models with Non-Curated Data [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=oBXfPyi47m)\n- **Vid2World**: Crafting Video Diffusion Models to Interactive World Models [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=pFyzqbUiF9)\n- **WMPO**: World Model-based Policy Optimization for Vision-Language-Action Models [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=qE2FyvRvuF)\n- Object-Centric World Models from Few-Shot Annotations for Sample-Efficient Reinforcement Learning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=qmEyJadwHA)\n- Building spatial world models from sparse transitional episodic memories [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=w3w7WVG4ks)\n- **Cosmos Policy**: Fine-Tuning Video Models for Visuomotor Control and Planning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=wPEIStHxYH)\n- **WoW!**: World Models in a Closed-Loop World [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=yDmb7xAfeb)\n\n## Planning and Reasoning\n\n- **VLMgineer**: Vision-Language Models as Robotic Toolsmiths [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=nESyz4PvJL)\n- **MomaGraph**: State-Aware Unified Scene Graphs with Vision-Language Models for Embodied Task Planning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=3eTr9dGwJv)\n- Planning with an Embodied Learnable Memory [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=79BOATBal9)\n- **Theory of Space**: Can Foundation Models Construct Spatial Beliefs through Active Exploration? [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=8iPwqr6Adk)\n- Compositional Visual Planning via Inference-Time Diffusion Scaling [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=EEONns7ae4)\n- Experience-based Knowledge Correction for Robust Planning in Minecraft [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=N22lDHYrXe)\n- Self-Improving Loops for Visual Robotic Planning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=SzUgx5r3wy)\n- **BOLT**: Decision‑Aligned Distillation and Budget-Aware Routing for Constrained Multimodal QA on Robots [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=Vsy3nAnaX6)\n- **ReCAPA**: Hierarchical Predictive Correction to Mitigate Cascading Failures [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=WC6MJ5r5Bj)\n- **One Demo Is All It Takes**: Planning Domain Derivation with LLMs from A Single Demonstration [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=Y1VgLHbzCC)\n- **EVLP**: Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=eJcCW9oNfH)\n- **Towards Improvisational TAMP**: Learning Low-Level Shortcuts in Abstract Planning Graphs [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=enprG5H9aD)\n- **Embodied-R1**: Reinforced Embodied Reasoning for General Robotic Manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=i5wlozMFsQ)\n- Self-Refining Vision Language Model for Robotic Failure Detection and Reasoning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=jr9hGWQioP)\n- Natural Language PDDL (NL-PDDL) for Open-world Goal-oriented Commonsense Regression Planning in Embodied AI [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=kWCNhRdcDI)\n- **SafeFlowMatcher**: Safe and Fast Planning using Flow Matching with Control Barrier Functions [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=refcXHU1Nh)\n- **OmniEVA**: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=tkEmIJv1tB)\n\n## Navigation\n\n- **From Seeing to Experiencing**: Scaling Navigation Foundation Models with Reinforcement Learning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=0c7nAZjyr5)\n- Lifelong Embodied Navigation Learning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=PaYo96rjij)\n- **CE-Nav**: Flow-Guided Reinforcement Refinement for Cross-Embodiment Local Navigation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=apaLoTumdO)\n- Emergence of Spatial Representation in an Actor-Critic Agent with Hippocampus-Inspired Sequence Generator [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=li1vfqDzRD)\n\n## Humanoid\n\n- **HWC-Loco**: A Hierarchical Whole-Body Control Approach to Robust Humanoid Locomotion [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=3UE3Aatcjy)\n- **Task Tokens**: A Flexible Approach to Adapting Behavior Foundation Models [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=6T3wJQhvc3)\n- **BFM-Zero**: A Promptable Behavioral Foundation Model for Humanoid Control Using Unsupervised Reinforcement Learning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=jkhl2oI0g5)\n- **From Language to Locomotion**: Retargeting-free Humanoid Control via Motion Latent Guidance [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=k3Cyx3Uets)\n\n## 3D Vision\n\n- Geometry-aware 4D Video Generation for Robot Manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=18gC6pZVVc)\n- **PD$^{2}$GS**: Part-Level Decoupling and Continuous Deformation of Articulated Objects via Gaussian Splatting [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=W3Q2xvrZtx)\n- **Manipulation as in Simulation**: Enabling Accurate Geometry Perception in Robots [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=sWyX1BpeN4)\n\n## Policy\n\n- Master Skill Learning with Policy-Grounded Synergy of LLM-based Reward Shaping and Exploring [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=1vXMfIYFZp)\n- When would Vision-Proprioception Policies Fail in Robotic Manipulation? [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=2RIqqNqALN)\n- **ManipEvalAgent**: Promptable and Efficient Evaluation Framework for Robotic Manipulation Policies [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=3u6AkbWEls)\n- Remotely Detectable Robot Policy Watermarking [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=8s5jBVybhQ)\n- Difference-Aware Retrieval Polices for Imitation Learning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=9AA27en4go)\n- Capturing Visual Environment Structure Correlates with Control Performance [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=AmczI1k3Yk)\n- **VITA**: Vision-to-Action Flow Matching Policy [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=BTe5VLBjPg)\n- **DemoGrasp**: Universal Dexterous Grasping from a Single Demonstration [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=Bf4FeuW0Mr)\n- **When a Robot is More Capable than a Human**: Learning from Constrained Demonstrators [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=BvirMuKWV1)\n- Autonomous Play with Correspondence-Driven Trajectory Warping [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=FqDmvMZish)\n- Cross-Embodiment Offline Reinforcement Learning for Heterogeneous Robot Datasets [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=GrsoLVNy3Y)\n- Uncovering Robot Vulnerabilities through Semantic Potential Fields [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=Gsrw1vxq1G)\n- Time Optimal Execution of Action Chunk Policies Beyond Demonstration Speed [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=INsLvSCJ4z)\n- Policy Likelihood-based Query Sampling and Critic-Exploited Reset for Efficient Preference-based Reinforcement Learning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=ITeuGb2bYg)\n- Rodrigues Network for Learning Robot Actions [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=IZHk6BXBST)\n- Reference Guided Skill Discovery [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=IaGf8Eh5Uo)\n- Masked Generative Policy for Robotic Control [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=KFu4p3pd11)\n- **GRL-SNAM**: Geometric Reinforcement Learning with Differential Hamiltonians for Navigation and Mapping in Unknown Environments [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=KcC5mwfGf0)\n- **HAMLET**: Switch Your Vision-Language-Action Model into a History-Aware Policy [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=KcJ9U0x6kO)\n- Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=NEOTsyyYH7)\n- Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=OeDwYtp8n1)\n- Policy Contrastive Decoding for Robotic Foundation Models [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=P9PVdWyM3U)\n- **Demystifying Robot Diffusion Policies**: Action Memorization and a Simple Lookup Table Alternative [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=PL0tJOfm7I)\n- **H$^3$DP**: Triply‑Hierarchical Diffusion Policy for Visuomotor Learning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=Q1CP0iAmOb)\n- **SimpleVLA-RL**: Scaling VLA Training via Reinforcement Learning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=TQhSodCM4r)\n- Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=TnLFRhLuZ6)\n- Abstracting Robot Manipulation Skills via Mixture-of-Experts Diffusion Policies [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=VSWjHIveqZ)\n- Accelerated co-design of robots through morphological pretraining [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=WVliGyFwZv)\n- Generalizable Coarse-to-Fine Robot Manipulation via Language-Aligned 3D Keypoints [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=WXFfMLyB6y)\n- **VER**: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=aoorNQFpM6)\n- **SpikePingpong**: Spike Vision-based Fast-Slow Pingpong Robot System [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=d08yOXs1Dl)\n- **EquAct**: An SE(3)-Equivariant Multi-Task Transformer for 3D Robotic Manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=d1wuA8oIH0)\n- Translating Flow to Policy via Hindsight Online Imitation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=dQ6d5bgXtM)\n- Hierarchical Value-Decomposed Offline Reinforcement Learning for Whole-Body Control [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=eSkDNIGbcd)\n- **Cortical Policy**: A Dual-Stream View Transformer for Robotic Manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=eWe8zqGvs5)\n- Geometry-aware Policy Imitation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=ggofj6tyr3)\n- **Contractive Diffusion Policies**: Robust Action Diffusion via Contractive Score-Based Sampling with Differential Equations [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=iKJbmx1iuQ)\n- Scalable Exploration for High-Dimensional Continuous Control via Value-Guided Flow [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=kIYNtxE13h)\n- Mean Flow Policy with Instantaneous Velocity Constraint for One-step Action Generation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=mIeKe74W43)\n- Emergent Dexterity Via Diverse Resets and Large-Scale Reinforcement Learning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=nAO9LcV7nE)\n- Learning Part-Aware Dense 3D Feature Field For Generalizable Articulated Object Manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=qXfRXfAHOK)\n- Real-Time Robot Execution with Masked Action Chunking [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=r0RGJ1j9on)\n- Robust Fine-tuning of Vision-Language-Action Robot Policies via Parameter Merging [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=uWJwQ5SZoM)\n- **ViPRA**: Video Prediction for Robot Actions [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=w3Ik8HUyTT)\n- **RAVEN**: End-to-end Equivariant Robot Learning with RGB Cameras [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=z8BN7KyaPl)\n\n## Dexterous Manipulation\n\n- **DexNDM**: Closing the Reality Gap for Dexterous In-Hand Rotation via Joint-Wise Neural Dynamics Model [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=80vjyj5o7l)\n- **EgoDex**: Learning Dexterous Manipulation from Large-Scale Egocentric Video [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=FFxkFMU89E)\n- **RFS**: Reinforcement learning with Residual flow steering for dexterous manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=Kt9tJeOwjy)\n- Learning to Grasp Anything By Playing with Random Toys [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=NZDaMcpXZm)\n- **SARM**: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=aemqAxScl9)\n- **UniHM**: Unified Dexterous Hand Manipulation with Vision Language Model [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=cVX3VqO8BO)\n- **DexMove**: Learning Tactile-Guided Non-Prehensile Manipulation with Dexterous Hands [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=dT3ZciXvNX)\n- **VLBiMan**: Vision-Language Anchored One-Shot Demonstration Enables Generalizable Bimanual Robotic Manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=he86smZzRk)\n- Cross-Embodied Co-Design for Dexterous Hands [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=k8ovuXEQQu)\n- Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=tv0Sz8A9Tc)\n- Primary-Fine Decoupling for Action Generation in Robotic Imitation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=wySMuWHmt4)\n\n## Tactile\n\n- **AnyTouch 2**: General Optical Tactile Representation Learning For Dynamic Tactile Perception [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=ndilONnABZ)\n- **APPLE**: Toward General Active Perception via Reinforcement Learning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=hU2gT2Ucua)\n\n## Sim2real and Real2sim\n\n- **D-REX**: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=13jshGCK9i)\n- **DemoGrasp**: Universal Dexterous Grasping from a Single Demonstration [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=Bf4FeuW0Mr)\n- **Sim2Real VLA**: Zero-Shot Generalization of Synthesized Skills to Realistic Manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=H4SyKHjd4c)\n- **Exo-Plore**: Exploring Exoskeleton Control Space through Human-aligned Simulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=TmYcqOnxhN)\n- Emergent Dexterity Via Diverse Resets and Large-Scale Reinforcement Learning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=nAO9LcV7nE)\n- **Manipulation as in Simulation**: Enabling Accurate Geometry Perception in Robots [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=sWyX1BpeN4)\n- Latent Adaptation of Foundation Policies for Sim-to-Real Transfer [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=yn9dzttHvT)\n- **RobotArena $\\infty$**: Unlimited Robot Benchmarking via Real-to-Sim Translation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=OutljIofvS)\n- **PD$^{2}$GS**: Part-Level Decoupling and Continuous Deformation of Articulated Objects via Gaussian Splatting [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=W3Q2xvrZtx)\n- Contact-guided Real2Sim from Monocular Video with Planar Scene Primitives [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=xlr3NqxUqY)\n\n## Benchmark and Dataset\n\n- **D2E**: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=TRwQND3xpt)\n- **Memory, Benchmark & Robots**: A Benchmark for Solving Complex Tasks with Reinforcement Learning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=9cLPurIZMj)\n- **DataMIL**: Selecting Data for Robot Imitation Learning with Datamodels [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=AcTsKglDdh)\n- **MIMIC**: Mask-Injected Manipulation Video Generation with Interaction Control [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=COrUdVuInH)\n- **LeRobot**: An Open-Source Library for End-to-End Robot Learning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=CiZMMAFQR3)\n- **RobotArena $\\infty$**: Unlimited Robot Benchmarking via Real-to-Sim Translation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=OutljIofvS)\n- **RoboInter**: A Holistic Intermediate Representation Suite Towards Robotic Manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=PGUC3mmMoi)\n- **ENACT**: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=Patx6MRipw)\n- **AutoBio**: A Simulation and Benchmark for Robotic Automation in Digital Biology Laboratory [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=UUE6HEtjhu)\n- Image Quality Assessment for Embodied AI [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=azj53PLJRL)\n- **MoMaGen**: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=bGPDviEtZ1)\n- **CoNavBench**: Collaborative Long-Horizon Vision-Language Navigation Benchmark [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=bMrH2PFMsi)\n- **World2Minecraft**: Occupancy-Driven simulated scenes Construction [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=dc90uPqxWF)\n- **CitySeeker**: How Do VLMs Explore Embodied Urban Navigation with Implicit Human Needs? [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=hzf23XSDcs)\n- **Seeing Across Views**: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=jXDZJAfRZB)\n- **RoboCasa365**: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=tQJYKwc3n4)\n- **REI-Bench**: Can Embodied Agents Understand Vague Human Instructions in Task Planning? [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=vmBIF25KLf)\n\n## Other\n\n- On the Generalization Capacities of MLLMs for Spatial Intelligence [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=DE5ZJtR4bg)\n- **Embodied Agents Meet Personalization**: Investigating Challenges and Solutions Through the Lens of Memory Utilization [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=E5L43l5EIu)\n- Interaction-aware Representation Modeling With Co-Occurrence Consistency for Egocentric Hand-Object Parsing [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=RYwQ0xQcAh)\n- **PhyScensis**: Physics-Augmented LLM Agents for Complex Physical Scene Arrangement [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=aCVfhY4Qen)\n- **OmniActor**: A Generalist GUI and Embodied Agent for 2D&3D Worlds [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=oJAIjUDxkZ)\n- **EgoWorld**: Translating Exocentric View to Egocentric View using Rich Exocentric Observations [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=wcTuZG9P2o)\n- Contact-guided Real2Sim from Monocular Video with Planar Scene Primitives [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=xlr3NqxUqY)\n\n# NeuIPS2025\n## Vision-Language-Action Model\n- **Fast-in-Slow**: A Dual-System VLA Model Unifying Fast Manipulation within Slow Reasoning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01953) [Page](https:\u002F\u002Ffast-in-slow.github.io\u002F)\n- **AC-DiT**: Adaptive Coordination Diffusion Transformer for Mobile Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01961) [Page](https:\u002F\u002Fac-dit.github.io\u002F)\n- **BridgeVLA**: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.07961) [Page](https:\u002F\u002Fbridgevla.github.io\u002F)\n- **CogVLA**: Cognition-Aligned Vision-Language-Action Models via Instruction-Driven Routing & Sparsification [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.21046) [Page](https:\u002F\u002Fjiutian-vl.github.io\u002FCogVLA-page\u002F)\n- **VideoVLA**: Video Generators Can Be Generalizable Robot Manipulators\n- **ChatVLA-2**: Vision-Language-Action Model with Open-World Reasoning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.21906) [Page](https:\u002F\u002Fchatvla-2.github.io\u002F)\n- Exploring the Limits of Vision-Language-Action Manipulation in Cross-task Generalization [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15660) [Page](https:\u002F\u002Fjiaming-zhou.github.io\u002FAGNOSTOS\u002F)\n- **BadVLA**: Towards Backdoor Attacks on Vision-Language-Action Models via Objective-Decoupled Optimization [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16640) [Page](https:\u002F\u002Fgithub.com\u002FZxy-MLlab\u002FBadVLA)\n- **Compliant Residual DAgger**: Improving Real-World Contact-Rich Manipulation with Human Corrections [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.16685) [Page](https:\u002F\u002Fcompliant-residual-dagger.github.io\u002F)\n- **VLA-OS**: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.17561) [Page](https:\u002F\u002Fnus-lins-lab.github.io\u002Fvlaos\u002F)\n- **ThinkAct**: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.16815) [Page](https:\u002F\u002Fjasper0314-huang.github.io\u002Fthinkact-vla\u002F)\n- Self-Improving Embodied Foundation Models [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.15155) [Page](https:\u002F\u002Fself-improving-efms.github.io\u002F)\n- **Robo2VLM**: Improving Visual Question Answering using Large-Scale Robot Manipulation Data [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15517) [Page](https:\u002F\u002Fberkeleyautomation.github.io\u002Frobo2vlm\u002F)\n- **EnerVerse**: Envisioning Embodied Future Space for Robotics Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01895)\n- Learning Spatial-Aware Manipulation Ordering [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.25138)\n- **PRIMT**: Preference-based Reinforcement Learning with Multimodal Feedback and Trajectory Synthesis from Foundation Models [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.15607)\n- **BEAST**: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06072)\n- **PointMapPolicy**: Structured Point Cloud Processing for Multi-Modal Imitation Learning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.20406)\n- Real-Time Execution of Action Chunking Flow Policies [Paper](https:\u002F\u002Fwww.physicalintelligence.company\u002Fdownload\u002Freal_time_chunking.pdf) [Page](https:\u002F\u002Fwww.physicalintelligence.company\u002Fresearch\u002Freal_time_chunking)\n- **Chain-of-Action**: Trajectory Autoregressive Modeling for Robotic Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.09990) [Page](https:\u002F\u002Fchain-of-action.github.io\u002F)\n- **4D-VLA**: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration\n- **SAFE**: Multitask Failure Detection for Vision-Language-Action Models [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.09937) [Page](https:\u002F\u002Fvla-safe.github.io\u002F)\n- Blindfolded Experts Generalize Better: Insights from Robotic Manipulation and Videogames [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.24194) [Page](https:\u002F\u002Fsites.google.com\u002Fview\u002Fblindfoldedexperts\u002Fhome)\n- **HiMaCon:** Discovering Hierarchical Manipulation Concepts from Unlabeled Multi-Modal Data [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.11321)\n- Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23705)\n- Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01218) [Page](https:\u002F\u002Factol-pretrain.github.io\u002F)\n- **DreamVLA**: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.04447) [Page](https:\u002F\u002Fzhangwenyao1.github.io\u002FDreamVLA\u002Findex.html)\n\n## Data\n- **EgoBridge**: Domain Adaptation for Generalizable Imitation from Egocentric Human Data [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.19626) [Page](https:\u002F\u002Fego-bridge.github.io\u002F)\n- **RobotSmith**: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skill [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14763) [Page](https:\u002F\u002Fumass-embodied-agi.github.io\u002FRobotSmith\u002F)\n- **URDF-Anything**: Constructing Articulated Objects with 3D Multimodal Language Model [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.00940)\n- **DEAL**: Diffusion Evolution Adversarial Learning for Sim-to-Real Transfer\n- Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.18631) [Page](https:\u002F\u002Fot-sim2real.github.io\u002F)\n\n## World Model\n- **SAMPO**: Scale-wise Autoregression with Motion Prompt for Generative World Models [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.15536)\n- Learning 3D Persistent Embodied World Models [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.05495)\n- **OSVI-WM**: One-Shot Visual Imitation for Unseen Tasks using World-Model-Guided Trajectory Generation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20425)\n\n## Planning and Reasoning\n- Towards Reliable LLM-based Robots Planning via Combined Uncertainty Estimation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.08044)\n- **Towards Reliable Code-as-Policies**: A Neuro-Symbolic Framework for Embodied Task Planning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.21302)\n- **RDD**: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.14968) [Page](https:\u002F\u002Frdd-neurips.github.io\u002F)\n- **UniDomain**: Pretraining a Unified PDDL Domain from Real-World Demonstrations for Generalizable Robot Task Planning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21545)\n- InstructFlow: Adaptive Symbolic Constraint-Guided Code Generation for Long-Horizon Planning\n\n## Navigation\n- **C-NAV**: Towards Self-Evolving Continual Object Navigation in Open World [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.20685) [Page](https:\u002F\u002Fbigtree765.github.io\u002FC-Nav-project\u002F)\n- Distilling LLM Prior to Flow Model for Generalizable Agent’s Imagination in Object Goal Navigation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.09423)\n- **TP-MDDN**: Task-Preferenced Multi-Demand-Driven Navigation with Autonomous Decision-Making\n- Active Test-time Vision-Language Navigation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06630)\n- **Aux-Think**: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation  \n- **EfficientNav**: Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.18546)\n- **Seeing through Uncertainty**: Robust Task-Oriented Optimization in Visual Navigation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.00441) [Page](https:\u002F\u002Fgithub.com\u002FPyyWill\u002FNeuRO)\n\n## Humanoid\n- Adversarial Locomotion and Motion Imitation for Humanoid Policy Learning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.14305) [Page](https:\u002F\u002Falmi-humanoid.github.io\u002F)\n- From Experts to a Generalist: Toward General Whole-Body Control for Humanoid Robots [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.12779) [Page](https:\u002F\u002Fbeingbeyond.github.io\u002FBumbleBee\u002F)\n- **KungfuBot**: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills [Paper](https:\u002F\u002Fkungfu-bot.github.io\u002F) [Page](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.12851)\n\n## 3D Vision\n- **DynaRend**: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.24261)\n- Building 3D Representations and Generating Motions From a Single Image via Video-Generation [Paper](https:\u002F\u002Fneurips.cc\u002Fvirtual\u002F2025\u002Floc\u002Fsan-diego\u002Fposter\u002F118141)\n\n## Policy\n- Emerging Risks from Embodied AI Require Urgent Policy Action\n- Human-assisted Robotic Policy Refinement via Action Preference Optimization [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.07127) [Page](https:\u002F\u002Fgewu-lab.github.io\u002Faction_preference_optimization\u002F)\n- *Hyper-GoalNet*: Goal-Conditioned Manipulation Policy Learning with HyperNetworks\n- **ReinFlow**: Fine-tuning Flow Matching Policy with Online Reinforcement Learning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22094) [Page](https:\u002F\u002Freinflow.github.io\u002F)\n- Diversifying Parallel Ergodic Search: A Signature Kernel Evolution Strategy\n- **FreqPolicy**: Efficient Flow-based Visuomotor Policy via Frequency Consistency [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08822)\n- A Practical Guide for Incorporating Symmetry in Diffusion Policy [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.13431)\n- **Latent Policy Barrier**: Learning Robust Visuomotor Policies by Staying In-Distribution [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.05941) [Page](https:\u002F\u002Fproject-latentpolicybarrier.github.io\u002F)\n- Quantization-Free Autoregressive Action Transformer [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14259)\n- Real-World Reinforcement Learning of Active Perception Behaviors\n- Failure Prediction at Runtime for Generative Robot Policies [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.09459)\n- **Act to See, See to Act**: Diffusion-Driven Perception-Action Interplay for Adaptive Policies [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.25822) [Page](https:\u002F\u002Fjingwang18.github.io\u002Fdp-ag.github.io\u002F)\n- **Dynamic Test-Time Compute Scaling in Control Policy**: Difficulty-Aware Stochastic Interpolant Policy [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.20906)\n- **DynaGuide**: Steering Diffusion Polices with Active Dynamic Guidance [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13922) [Page](https:\u002F\u002Fdynaguide.github.io\u002F)\n- World-aware Planning Narratives Enhance Large Vision-Language Model Planner [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.21230)\n \n## Accelerating and Deploying\n- Accelerating Visual-Policy Learning through Parallel Differentiable Simulation [Paper](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2505.10646) [Page](https:\u002F\u002Fhaoxiangyou.github.io\u002FDva_website\u002F)\n- **EfficientVLA**: Training-Free Acceleration and Compression for Vision-Language-Action Models [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10100)\n- A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05294) [Page](https:\u002F\u002Fgokul.dev\u002Fsailor\u002F)\n- **VLA-Cache**: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.02175) [Page](https:\u002F\u002Fvla-cache.github.io\u002F)\n\n## Tactile\n- Universal Visuo-Tactile Video Understanding for Embodied Interaction [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22566)\n- Enhancing Tactile-based Reinforcement Learning for Robotic Control [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.21609) [Page](https:\u002F\u002Felle-miller.github.io\u002Ftactile_rl\u002F)\n- **Taccel**: Scaling Up Vision-based Tactile Robotics via High-performance GPU Simulation [Paper](https:\u002F\u002Ftaccel-simulator.github.io\u002Fassets\u002Ftaccel-paper.pdf) [Page](http:\u002F\u002Ftaccel-simulator.github.io\u002F)\n- **Toward Artificial Palpation**: Representation Learning of Touch on Soft Bodies [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.16596) [Page](https:\u002F\u002Fzoharri.github.io\u002Fartificial-palpation\u002F)\n- **Touch in the Wild**: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.15062v1) [Page](https:\u002F\u002Fbinghao-huang.github.io\u002Ftouch_in_the_wild\u002F)\n\n## Dexterous\n- Contact Map Transfer with Conditional Diffusion Model for Generalizable Dexterous Grasp Generation [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2511.01276) [Page](https:\u002F\u002Fcmtdiffusion.github.io\u002F)\n- **HumanoidGen**: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.00833) [Page](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.00833)\n- **Grasp2Grasp**: Vision-Based Dexterous Grasp Translation via Schrödinger Bridges [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.02489) [Page](https:\u002F\u002Fgrasp2grasp.github.io\u002F)\n- Scaffolding Dexterous Manipulation with Vision-Language Models [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.19212) [Page](https:\u002F\u002Fsites.google.com\u002Fview\u002Fdexterous-vlm-scaffolding)\n- **DexFlyWheel**: A Scalable and Self-improving Data Generation Framework for Dexterous Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.23829) [Page](https:\u002F\u002Fdexflywheel.github.io\u002F)\n- **DexGarmentLab**: Dexterous Garment Manipulation Environment with Generalizable Policy [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.11032) [Page](https:\u002F\u002Fwayrise.github.io\u002FDexGarmentLab\u002F)\n\n## Benchmark and Dataset\n- **RoboCerebra**: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation [Paper](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2506.06677) [Page](https:\u002F\u002Fgithub.com\u002Fqiuboxiang\u002FRoboCerebra)\n- **SutureBot**: A Precision Framework & Benchmark For Autonomous End-to-End Suturing [Paper](https:\u002F\u002Fsuturebot.github.io\u002Fstatic\u002FSutureBot_NeurIPS_2025.pdf) [Page](https:\u002F\u002Fsuturebot.github.io\u002F)\n- Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration\n- **LabUtopia**: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22634) [Page](https:\u002F\u002Frui-li023.github.io\u002Flabutopia-site\u002F)\n- **SonoGym**: High Performance Simulation for Challenging Surgical Tasks with Robotic Ultrasound [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01152) [Page](https:\u002F\u002Fgithub.com\u002FSonoGym\u002FSonoGym)\n- Embodied Crowd Counting\n- **PAC Bench**: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies? [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.23725)\n\n# CORL2025\n\n## Vision-Language-Action Model\n\n- **$\\pi_{0.5}$**: a Vision-Language-Action Model with Open-World Generalization [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16054) [page](https:\u002F\u002Fwww.pi.website\u002Fblog\u002Fpi05)\n- Training Strategies for Efficient Embodied Reasoning[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.08243) [page](https:\u002F\u002Fecot-lite.github.io\u002F)\n- **Long-VLA**: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.19958) [page](https:\u002F\u002Flong-vla.github.io\u002F)\n- **RoboMonkey**: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.17811) [page](https:\u002F\u002Frobomonkey-vla.github.io\u002F)\n- **RoboChemist**: Long-Horizon and Safety-Compliant Robotic Chemical Experimentation [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.08820) [page](https:\u002F\u002Fzzongzheng0918.github.io\u002FRoboChemist.github.io\u002F)\n- **TA-VLA**: Elucidating the Design Space of Torque-aware Vision-Language-Action Models [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.07962) [page](https:\u002F\u002Fzzongzheng0918.github.io\u002FTorque-Aware-VLA.github.io\u002F)\n- **Focusing on What Matters**: Object-Agent-centric Tokenization for Vision Language Action models [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=Ict1OjU9gl#discussion) \n- **FLOWER**: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.04996) [page](https:\u002F\u002Fintuitive-robots.github.io\u002Fflower_vla\u002F)\n- Mechanistic Interpretability for Steering Vision-Language-Action Models [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.00328) \n- **RICL**: Adding In-Context Adaptability to Pre-Trained Vision-Language-Action Models [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.02062) [page](https:\u002F\u002Fricl-vla.github.io\u002F)\n- **DexVLA**: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.05855) [page](https:\u002F\u002Fgithub.com\u002Fjuruobenruo\u002FDexVLA)\n- **FLARE**: Robot Learning with Implicit World Modeling [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15659) [page](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Fgear\u002Fflare\u002F)\n- **3DS-VLA**: A 3D Spatial-Aware Vision Language Action Model for Robust Multi-Task Manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=dT45OMevL5#discussion) \n- **GraspVLA**: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.03233) [page](https:\u002F\u002Fpku-epic.github.io\u002FGraspVLA-web\u002F)\n- **EndoVLA**: Dual-Phase Vision-Language-Action for Precise Autonomous Tracking in Endoscopy [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15206) \n- **MoTo**: A Zero-shot Plug-in Interaction-aware Navigation for General Mobile Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.01658) [page](https:\u002F\u002Fgary3410.github.io\u002FMoTo\u002F)\n- **ControlVLA**: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.16211) [page](https:\u002F\u002Fcontrolvla.github.io\u002F)\n- **TrackVLA**: Embodied Visual Tracking in the Wild [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23189) [page](https:\u002F\u002Fpku-epic.github.io\u002FTrackVLA-web\u002F)\n- **AnyPlace**: Learning Generalizable Object Placement for Robot Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04531) [page](https:\u002F\u002Fany-place.github.io\u002F)\n- Generalist Robot Manipulation beyond Action Labeled Data [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.19958) [page](https:\u002F\u002Fmotovla.github.io\u002F)\n- **LaVA-Man**: Learning Visual Action Representations for Robot Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.19391) [page](https:\u002F\u002Fqm-ipalab.github.io\u002FLaVA-Man\u002F)\n\n## Navigation\n\n- **MoTo**: A Zero-shot Plug-in Interaction-aware Navigation for General Mobile Manipulation \n- Meta-Optimization and Program Search using Language Models for Task and Motion Planning \n- **ObjectReact**: Learning Object-Relative Control for Visual Navigation\n- **HALO**: Human Preference Aligned Offline Reward Learning for Robot Navigation\n- Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models\n- **Long Range Navigator (LRN)**: Extending robot planning horizons beyond metric maps\n- **Search-TTA**: A Multi-Modal Test-Time Adaptation Framework for Visual Search in the Wild\n- **ActLoc**: Learning to Localize on the Move via Active Viewpoint Selection\n- Human-like Navigation in a World Built for Humans\n- **GC-VLN**: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation\n- **GraspMolmo**: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation\n- **Belief-Conditioned One-Step Diffusion**: Real-Time Trajectory Planning with Just-Enough Sensing\n## Policy\n\n- **ImMimic**: Cross-Domain Imitation from Human Videos via Mapping and Interpolation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.10952) [page](https:\u002F\u002Fsites.google.com\u002Fview\u002Fimmimic)\n- **ReWiND**: Language-Guided Rewards Teach Robot Policies without New Demonstrations [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10911) [page](https:\u002F\u002Frewind-reward.github.io\u002F)\n- Steering Your Diffusion Policy with Latent Space Reinforcement Learning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.15799) [page](https:\u002F\u002Fdiffusion-steering.github.io\u002F)\n- **Streaming Flow Policy**: Simplifying diffusion\u002Fflow-matching policies by treating action trajectories as flow trajectories [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.21851) [page](https:\u002F\u002Fsiddancha.github.io\u002Fstreaming-flow-policy\u002F)\n- **SAIL**: Faster-than-Demonstration Execution of Imitation Learning Policies [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11948) [page](https:\u002F\u002Fnadunranawaka1.github.io\u002Fsail-policy\u002F)\n- Reactive In-Air Clothing Manipulation with Confidence-Aware Dense Correspondence and Visuotactile Affordance [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.03889) [page](https:\u002F\u002Fmhtippur.github.io\u002Finairclothmanipulation\u002F)\n- Data Retrieval with Importance Weights for Few-Shot Imitation Learning [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.01657) [page](https:\u002F\u002Frahulschand.github.io\u002Fiwr\u002F)\n- **X-Sim**: Cross-Embodiment Learning via Real-to-Sim-to-Real [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07096) \n- **DemoSpeedup**: Accelerating Visuomotor Policies via Entropy-Guided Demonstration Acceleration [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05064) [page](https:\u002F\u002Fdemospeedup.github.io\u002F)\n- **ManiFlow**: A General Robot Manipulation Policy via Consistency Flow Training [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.01819) [page](https:\u002F\u002Fmaniflow-policy.github.io\u002F)\n- **Text2Touch**: Tactile In-Hand Manipulation with LLM-Designed Reward Functions [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.07445) [page](https:\u002F\u002Fhpfield.github.io\u002Ftext2touch-website\u002F)\n- **Multi-Loco**: Unifying Multi-Embodiment Legged Locomotion via Reinforcement Learning Augmented Diffusion [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11470) [page](https:\u002F\u002Fmops-tamp.github.io\u002F)\n- $\\texttt{SPIN}$: distilling $\\texttt{Skill-RRT}$ for long-horizon prehensile and non-prehensile manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.18015) \n- Imitation Learning Based on Disentangled Representation Learning of Behavioral Characteristics [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.04737) \n- Constraint-Preserving Data Generation for One-Shot Visuomotor Policy Generalization [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.03944) [page](https:\u002F\u002Fcp-gen.github.io\u002F)\n- **CLASS**: Contrastive Learning via Action Sequence Supervision for Robot Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.01600) [page](https:\u002F\u002Fclass-robot.github.io\u002F)\n- **MirrorDuo**: Reflection-Consistent Visuomotor Learning from Mirrored Demonstration Pairs [page](https:\u002F\u002Fgithub.com\u002Fzheyu-zhuang\u002Fmirror-duo?tab=readme-ov-file)\n- Dynamics-Compliant Trajectory Diffusion for Super-Nominal Payload Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.21375)\n- **Eye, Robot**: Learning to Look to Act with a BC-RL Perception-Action Loop [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10968) [page](https:\u002F\u002Fwww.eyerobot.net\u002F)\n- **ARCH**: Hierarchical Hybrid Learning for Long-Horizon Contact-Rich Robotic Assembly [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.16451) [page](https:\u002F\u002Flong-horizon-assembly.github.io\u002F)\n- **KDPE**: A Kernel Density Estimation Strategy for Diffusion Policy Trajectory Selection [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.10511) [page](https:\u002F\u002Fhsp-iit.github.io\u002FKDPE\u002F)\n- **AimBot**: A Simple Auxiliary Visual Cue to Enhance Spatial Awareness of Visuomotor Policies [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.08113) [page](https:\u002F\u002Faimbot-reticle.github.io\u002F)\n- Enabling Long(er) Horizon Imitation for Manipulation Tasks by Modeling Subgoal Transitions \n- **Mobi-$\\pi$**: Mobilizing Your Robot Learning Policy [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23692) [page](https:\u002F\u002Fmobipi.github.io\u002F)\n- Action-Free Reasoning for Policy Generalization [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03729) [page](https:\u002F\u002Frad-generalization.github.io\u002F)\n- **Learn from What We HAVE**: History-Aware VErifier that Reasons about Past Interactions Online [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.00271v1) [page](https:\u002F\u002Fliy1shu.github.io\u002FHAVE_CoRL25\u002F)\n- **D-CODA**: Diffusion for Coordinated Dual-Arm Data Augmentation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.04860) [page](https:\u002F\u002Fdcodaaug.github.io\u002FD-CODA\u002F)\n- **ATK**: Automatic Task-driven Keypoint Selection for Robust Policy Learning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13867) [page](https:\u002F\u002Fyunchuzhang.github.io\u002FATK\u002F)\n- **Poke and Strike**: Learning Task-Informed Exploration Policies [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.00178) [page](https:\u002F\u002Fmarina-aoyama.github.io\u002Fpoke-and-strike\u002F)\n- **SafeBimanual**: Diffusion-based trajectory optimization for safe bimanual manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.18268) [page](https:\u002F\u002Fdenghaoyuan123.github.io\u002FSafeBimanip\u002F)\n- **COMBO-Grasp**: Learning Constraint-Based Manipulation for Bimanual Occluded Grasping [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.08054) \n- **Phantom**: Training Robots Without Robots Using Only Human Videos [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.00779) [page](https:\u002F\u002Fphantom-human-videos.github.io\u002F)\n- Learning Long-Context Diffusion Policies via Past-Token Prediction [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=N4WWF8Les5) \n- **VT-Refine**: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tuning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=mV3W5givYb) \n- **COLLAGE**: Adaptive Fusion-based Retrieval for Augmented Policy Learning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.01131) [page](https:\u002F\u002Frobin-lab.cs.utexas.edu\u002FCOLLAGE\u002F)\n- **CDP**: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14769) [page](https:\u002F\u002Fgaavama.github.io\u002FCDP\u002F)\n- Robust Dexterous Grasping of General Objects [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.05287) [page](https:\u002F\u002Fzdchan.github.io\u002FRobust_DexGrasp\u002F)\n- **Point Policy**: Unifying Observations and Actions with Key Points for Robot Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.20391) [page](https:\u002F\u002Fpoint-policy.github.io\u002F)\n## Benchmark and Dataset\n\n- **RoboArena**: Distributed Real-World Evaluation of Generalist Robot Policies\n- **GraspVLA**: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data \n- **CUPID**: Curating Data your Robot Loves with Influence \n- **AutoEval**: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World\n- **ManipBench**: Benchmarking Vision-Language Models for Low-Level Robot Manipulation Functions\n- Ensuring Force Safety in Vision-Guided Robotic Manipulation via Implicit Tactile Calibration\n- Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration\n- **UniSkill**: Imitating Human Videos via Cross-Embodiment Skill Representations\n## Humanoid\n\n- **HuB**: Learning Extreme Humanoid Balance\n- Versatile Loco-Manipulation through Flexible Interlimb Coordination\n- Visual Imitation Enables Contextual Humanoid Control\n- **Hand-Eye Autonomous Delivery**: Learning Humanoid Navigation, Locomotion and Reaching\n- **CLONE**: Closed-Loop Whole-Body Humanoid Teleoperation for Long-Horizon Tasks \n- **Embrace Contacts**: humanoid shadowing with full body ground contacts\n- **Hold My Beer**: Learning Gentle Humanoid Locomotion and End-Effector Stabilization Control\n- **SLAC**: Simulation-Pretrained Latent Action Space for Whole-Body Real-World RL\n- **Robot Trains Robot**: Automatic Real-World Policy Adaptation and Learning for Humanoids\n- Humanoid Policy ~ Human Policy\n## World Model\n\n- **Real2Render2Real**: Scaling Robot Data Without Dynamics Simulation or Robot Hardware\n- Cross-Sensor Touch Generation\n- **WoMAP**: World Models For Embodied Open-Vocabulary Object Localization\n- **DreamGen**: Unlocking Generalization in Robot Learning through Video World Models\n- **Tool-as-Interface**: Learning Robot Policies from Observing Human Tool Use\n- Articulated Object Estimation in the Wild\n- **DiWA**: Diffusion Policy Adaptation with World Models\n- Steerable Scene Generation with Post Training and Inference-Time Search\n- Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-top Manipulation\n- **Gen2Act**: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation\n- **Reflective Planning**: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation\n- **LaDi-WM**: A Latent Diffusion-Based World Model for Predictive Manipulation\n## Dexterous Manipulation\n\n- **DexUMI**: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation [page](https:\u002F\u002Fdex-umi.github.io\u002F)\n- **Dexplore**: Scalable Neural Control for Dexterous Manipulation from Reference Scoped Exploration\n- **FFHFlow**: Diverse and Uncertainty-Aware Dexterous Grasp Generation via Flow Variational Inference\n- **GraspQP**: Differentiable Optimization of Force Closure for Diverse and Robust Dexterous Grasping [page](https:\u002F\u002Fgraspqp.github.io\u002F)\n- Morphologically Symmetric Reinforcement Learning for Ambidextrous Bimanual Manipulation\n- **KineDex**: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation\n- **D-Cubed**: Latent Diffusion Trajectory Optimisation for Dexterous Deformable Manipulation\n- **LodeStar**: Long-horizon Dexterity via Synthetic Data Augmentation from Human Demonstrations \n## Sim-to-Real\n\n- **The Sound of Simulation**: Learning Multimodal Sim-to-Real Robot Policies with Generative Audio\n- **FetchBot**: Learning Generalizable Object Fetching in Cluttered Scenes via Zero-Shot Sim2Real\n- **ClutterDexGrasp**: A Sim-to-Real System for General Dexterous Grasping in Cluttered Scenes\n- **SimShear**: Sim-to-Real Shear-based Tactile Servoing\n- **Wheeled Lab**: Modern Sim2Real for Low-cost, Open-source Wheeled Robotics\n- Articulate AnyMesh: Open-vocabulary 3D Articulated Objects Modeling\n- **AgentWorld**: An Interactive Simulation Platform for Scene Construction and Mobile Robotic Manipulation\n- Robot Learning from Any Images\n# ICCV2025\n\n## Vision-Language-Action Model\n- Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.13587) [page](https:\u002F\u002Fvlaattacker.github.io\u002F)\n- **VQ-VLA**: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01016) [page](https:\u002F\u002Fxiaoxiao0406.github.io\u002Fvqvla.github.io)\n- **Dita**: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.19757) [page](https:\u002F\u002Frobodita.github.io\u002F)\n- **Moto**: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04445) [page](https:\u002F\u002Fchenyi99.github.io\u002Fmoto\u002F)\n- **A0**: An Affordance-Aware Hierarchical Model for General Robotic Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.12636) [page](https:\u002F\u002Fa-embodied.github.io\u002FA0\u002F)\n- **Embodied VideoAgent**: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.00358) [page](https:\u002F\u002Fembodied-videoagent.github.io\u002F)\n- **CoA-VLA**: Improving Vision-Language-Action Models via Visual-Text Chain-of-Affordance [Paper](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F542)\n- **FedVLA**: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation [Paper](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F1325)\n- Towards Long-Horizon Vision-Language-Action System: Reasoning, Acting and Memory [Paper](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F1915)\n- **PASG**: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation [Paper](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F225)\n- **SD2Actor**: Continuous State Decomposition via Diffusion Embeddings for Robotic Manipulation [Paper](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F1571)\n\n## Vision-Language-Navigation Model\n- **Move to Understand a 3D Scene**: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.04047) [page](https:\u002F\u002Fmtu3d.github.io\u002F)\n- Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.13019) [page](https:\u002F\u002Fcrystalsixone.github.io\u002Fvln_pe.github.io\u002F)\n- **P3Nav**: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.18525)\n- **SAME**: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.05552) [page](https:\u002F\u002Fgithub.com\u002FGengzeZhou\u002FSAME)\n- **NavMorph**: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments  [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.23468) [page](https:\u002F\u002Fgithub.com\u002FFeliciaxyao\u002FNavMorph)\n- Harnessing Input-adaptive Inference for Efficient VLN [Paper](https:\u002F\u002Fopenreview.net\u002Fpdf?id=5gptKWnVPF)\n- Embodied Navigation with Auxiliary Task of Action Description Prediction [Paper](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F1984)\n- 3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation [Paper](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F299)\n- **NavQ**: Learning a Q-Model for Foresighted Vision-and-Language Navigation [Paper](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F944)\n- **monoVLN**: Bridging the Observation Gap between Monocular and Panoramic Vision and Language Navigation [Paper](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F1792)\n\n## Hierarchical Planning\n- Adaptive Articulated Object Manipulation On The Fly with Foundation Model Reasoning and Part Grounding [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.18276)\n- **CogNav**: Cognitive Process Modeling for Object Goal Navigation with LLMs [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.10439) [page](https:\u002F\u002Fyhancao.github.io\u002FCogNav\u002F)\n- **RoBridge**: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.01709) [page](https:\u002F\u002Fabliao.github.io\u002FRoBridge\u002F)\n\n## World Model\n- **IRASim**: A Fine-Grained World Model for Robot Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.14540) [page](https:\u002F\u002Fgen-irasim.github.io\u002F)\n- **GWM**: Towards Scalable Gaussian World Models for Robotic Manipulation [Paper](https:\u002F\u002Fziweiwangthu.github.io\u002Fdata\u002FGWM.pdf) [page](https:\u002F\u002Fgaussian-world-model.github.io\u002F)\n- **DyWA**: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.16806) [page](https:\u002F\u002Fpku-epic.github.io\u002FDyWA\u002F)\n- Diffusion-Based Imaginative Coordination for Bimanual Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.11296)\n- Learning 4D Embodied World Models [Paper](https:\u002F\u002Fopenreview.net\u002Fpdf?id=mnwlhvmKMN)\n\n## Policy\n- Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09186)\n- **EC-Flow**: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.06224) [page](https:\u002F\u002Fec-flow1.github.io\u002F)\n- **Dense Policy**: Bidirectional Autoregressive Learning of Actions [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.13217) [page](https:\u002F\u002Fselen-suyue.github.io\u002FDspNet\u002F)\n- **AnyBimanual**: Transferring Unimanual Policy for General Bimanual Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06779) [page](https:\u002F\u002Fanybimanual.github.io\u002F)\n- Learning Precise Affordances from Egocentric Videos for Robotic Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.10123v1) [page](https:\u002F\u002Freagan1311.github.io\u002Faffgrasp)\n- **iManip**: Skill-Incremental Learning for Robotic Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07087) \n- Spatial-Temporal Aware Visuomotor Diffusion Policy Learning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.06710) [page](https:\u002F\u002Fzhenyangliu.github.io\u002FDP4\u002F)\n- **Wavelet Policy**: Lifting Scheme for Policy Learning in Long-Horizon Tasks [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.04331) [page](https:\u002F\u002Fhhuang-code.github.io\u002Fwavelet_policy\u002F)\n- 4D Visual Pre-training for Robot Learning [Paper](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F972)\n\n## Accelerating and Deploying\n- Saliency-Aware Quantized Imitation Learning for Efficient Robotic Control [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15304)\n- On-Device Diffusion Transformer Policy for Efficient Robot Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.00697)\n- **COSMO**: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24065)\n- **CARP**: Coarse-to-Fine Autoregressive Prediction for Visuomotor Policy Learning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06782) [page](https:\u002F\u002Fcarp-robot.github.io\u002F)\n\n## Perception\n- **EmbodiedOcc**: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04380) [page](https:\u002F\u002Fykiwu.github.io\u002FEmbodiedOcc\u002F)\n- **Embodied Image Captioning**: Self-supervised Learning Agents for Spatially Coherent Image Descriptions [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.08531) [page](https:\u002F\u002Fhsp-iit.github.io\u002Fembodied-captioning\u002F)\n\n## Benchmark and Dataset\n- **VLABench**: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18194) [page](https:\u002F\u002Firanqin.github.io\u002Frobofactory\u002F)\n- **RoboFactory**: Exploring Embodied Agent Collaboration with Compositional Constraints [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.16408) [page](https:\u002F\u002Fvlabench.github.io\u002F)\n- **HUMOTO**: A 4D Dataset of Mocap Human Object Interactions [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10414) [page](https:\u002F\u002Fjiaxin-lu.github.io\u002Fhumoto\u002F)\n- **RoboMM**: All-in-One Multimodal Large Model for Robotic Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.07215)\n- **MoMa-Kitchen**: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11081) [page](https:\u002F\u002Fmomakitchen.github.io\u002F)\n- **RoboPearls**: Editable Video Simulation for Robot Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.22756) [page](https:\u002F\u002Ftangtaogo.github.io\u002FRoboPearls\u002F)\n- **DexH2R**: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.23152) [page](https:\u002F\u002Fdexh2r.github.io\u002F)\n- **Beyond the Destination**: A Novel Benchmark for Exploration-Aware Embodied Question Answering [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11117) [page](https:\u002F\u002Fgithub.com\u002FHCPLab-SYSU\u002FEXPRESS-Bench)\n- **RobAVA**: A Large-scale Dataset and Baseline Towards Video based Robotic Arm Action Understanding [Paper](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F1787)\n- **RoboAnnotatorX**: A Comprehensive and Universal Annotation Framework for Accurate Understanding of Long-horizon Robot Demonstration [Paper](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F2215)\n\n# ICML2025\n\n## Vision-Language-Action Models\n- **Hi Robot**: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19417)\n- **OTTER**: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction [paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.03734) [page](https:\u002F\u002Fottervla.github.io\u002F)\n- **UP-VLA**: A Unified Understanding and Prediction Model for Embodied Agent [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.18867)\n- **ELEMENTAL**: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18825)\n- **ReinboT**: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07395)\n- **A Large Recurrent Action Model:** xLSTM enables Fast Inference for Robotics Tasks [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22391) [page](https:\u002F\u002Fgithub.com\u002Fml-jku\u002FLRAM)\n\n## Planning and Reasoning\n- Efficient Robotic Policy Learning via Latent Space Backward Planning [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.06861) [page](https:\u002F\u002Flbp-authors.github.io\u002F)\n- Closed-Loop Long-Horizon Robotic Planning via Equilibrium Sequence Modeling [paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.01440) [page](https:\u002F\u002Fgithub.com\u002FSingularity0104\u002Fequilibrium-planner)\n\n## Policies\n\n- **SAM2Act**:Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.18564)\n- Pre-training Auto-regressive Robotic Models with 4D Representations [paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.13142) [page](https:\u002F\u002Farm4r.github.io\u002F)\n- Flow-based Domain Randomization for Learning and Sequencing Robotic Skills [paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.01800)\n- **EmbodiedBench**: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.09560) [page](https:\u002F\u002Fembodiedbench.github.io\u002F)\n- Learning Policy Committees for Effective Personalization in MDPs with Diverse Tasks [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01885)\n- Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14803) [page](https:\u002F\u002Fvideo-prediction-policy.github.io\u002F)\n- **STAR**: Learning Diverse Robot Skill Abstractions through Rotation-Augmented [paper](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2506.03863) [page](https:\u002F\u002Fgithub.com\u002FJiuTian-VL\u002FSTAR?tab=readme-ov-file)\n## 3D Vision\n- Unifying 2D and 3D Vision-Language Understanding [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10745) [page](https:\u002F\u002Funivlg.github.io\u002F)\n- GAPrompt: Geometry-Aware Point Cloud Prompt for 3D Vision Model [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.04119) [page](https:\u002F\u002Fgithub.com\u002Fzhoujiahuan1991\u002FICML2025-GAPrompt)\n\n## Dataset\n- WOMD-Reasoning: A Large-Scale Dataset for Interaction Reasoning in Driving [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.04281) [page](https:\u002F\u002Fgithub.com\u002Fyhli123\u002FWOMD-Reasoning)\n\n# RSS2025\n\n- **Unified World Models**: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.02792) [Page](https:\u002F\u002Fweirdlabuw.github.io\u002Fuwm\u002F)\n- **CordViP**: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.08449) [Page](https:\u002F\u002Faureleopku.github.io\u002FCordViP\u002F)\n- **Reactive Diffusion Policy**: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.02881) [Page](https:\u002F\u002Freactive-diffusion-policy.github.io\u002F)\n- Dynamic Rank Adjustment in Diffusion Policies for Efficient and Flexible Training [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03822)\n- **SpatialVLA**: Exploring Spatial Representations for Visual-Language-Action Model [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.15830)\n- **Sketch-to-Skill**: Bootstrapping Robot Learning with Human Drawn Trajectory Sketches [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11918)\n- **NaVILA**: Legged Robot Vision-Language-Action Model for Navigation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04453) [Page](https:\u002F\u002Fnavila-bot.github.io\u002F)\n- **ConRFT**: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.05450) [Page](https:\u002F\u002Fcccedric.github.io\u002Fconrft\u002F)\n- **You Only Teach Once**: Learn One-Shot Bimanual Robotic Manipulation from Video Demonstrations [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.14208) [Page](https:\u002F\u002Fhnuzhy.github.io\u002Fprojects\u002FYOTO\u002F)\n- **ASAP**: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01143) [Page](https:\u002F\u002Fagile.human2humanoid.com\u002F)\n- **Flying Hand**: End-Effector-Centric Framework for Versatile Aerial Manipulation Teleoperation and Policy Learning [Paper](https:\u002F\u002Flecar-lab.github.io\u002Fflying_hand\u002Fstatic\u002Fpdf\u002Fflying_hand.pdf) [Page](https:\u002F\u002Flecar-lab.github.io\u002Fflying_hand\u002F)\n- **DemoGen**: Synthetic Demonstration Generation for Data-Efficient Visuomotor Policy Learning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.16932) [Page](https:\u002F\u002Fdemo-generation.github.io\u002F)\n- **DOGlove**: Dexterous Manipulation with a Low-Cost Open-Source Haptic Force Feedback Glove [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.07730) [Page](https:\u002F\u002Fdo-glove.github.io\u002F)\n- **RoboSplat**: Novel Demonstration Generation with Gaussian Splatting Enables Robust One-Shot Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.13175) [Page](https:\u002F\u002Fyangsizhe.github.io\u002Frobosplat\u002F)\n- Enhancing Autonomous Driving Systems with On-Board Deployed Large Language Models [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.11514)\n- **SATA**: Safe and Adaptive Torque-Based Locomotion Policies Inspired by Animal Learning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12674) [Video](https:\u002F\u002Fyoutu.be\u002Fb1cpTq0Rc5w?si=sAd9y5LE2sWynu7v)\n- **FACTR**: Force-Attending Curriculum Training for Contact-Rich Policy Learning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17432) [Page](https:\u002F\u002Fjasonjzliu.com\u002Ffactr\u002F)\n- **RoboVerse**: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.18904) [Page](https:\u002F\u002Froboverseorg.github.io\u002F)\n- **STDArm**: Transferring Visuomotor Policies From Static Data Training to Dynamic Robot Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.18792)\n\n  \n# CVPR2025\n\n## Vision-Language-Action Models\n\n- **UniAct**: Universal Actions For Enhanced Embodied Foundation Models [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.10105) [Page](https:\u002F\u002F2toinf.github.io\u002FUniAct\u002F)\n- **MoManipVLA**: Transferring Vision-language-action Models for General Mobile Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.13446) [Page](https:\u002F\u002Fgary3410.github.io\u002FmomanipVLA\u002F)\n- **CoT-VLA**: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [Paper](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F33233)\n- **SOLAMI**: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00174) [Page](https:\u002F\u002Fsolami-ai.github.io\u002F)\n- A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.06960) [Page](https:\u002F\u002Fgithub.com\u002FCVMI-Lab\u002FSlotMIM)\n- **Think Small, Act Big**: Primitive Prompt Learning for Lifelong Robot Manipulation\n- **Phoenix**: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction [Paper](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F32789)\n- **OmniManip**: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.03841) [Page](https:\u002F\u002Fomnimanip.github.io\u002F)\n- Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.14235)\n- Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation [Abstract](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F34522)\n- Robotic Visual Instruction\n- **RoboGround**: Robot Manipulation with Grounded Vision-Language Priors\n\n## Policies\n- **KStar Diffuser**: Spatial-Temporal Graph Diffusion Policy with Kinematics Modeling for Bimanual Robotic Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10743)\n- **RoboPEPP**: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.17662)\n- **Lift3D Policy**: Lifting 2D Foundation Models for Robust 3D Robotic Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18623) [Page](https:\u002F\u002Flift3d-web.github.io\u002F)\n- **PDFactor**: Learning Tri-Perspective View Policy Diffusion Field for Multi-Task Robotic Manipulation [Abstract](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F33943)\n- **Two by Two**: Learning Cross-Task Pairwise Objects Assembly for Generalizable Robot Manipulation\n- **FlowRAM**: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation [Abstract](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F33579)\n- **G3Flow**: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18369) [Page](https:\u002F\u002Ftianxingchen.github.io\u002FG3Flow\u002F)\n- **DexHandDiff**: Interaction-aware Diffusion Planning for Adaptive Dexterous Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18562) [Page](https:\u002F\u002Fdexdiffuser.github.io\u002F)\n- **Tra-MoE**: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14519)\n- **AffordDP**: Generalizable Diffusion Policy with Transferable Affordance[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03142) [Page](https:\u002F\u002Fafforddp.github.io\u002F)\n- **Tra-MoE**: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14519) [Page](https:\u002F\u002Fgithub.com\u002FMCG-NJU\u002FTra-MoE)\n\n## Grasp\n- **UniGraspTransformer**: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.02699) [Page](https:\u002F\u002Fdexhand.github.io\u002FUniGraspTransformer\u002F)\n- **DexGrasp Anything**: Towards Universal Robotic Dexterous Grasping with Physics Awareness [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.08257) [Page](https:\u002F\u002Fgithub.com\u002F4DVLab\u002FDexGrasp-Anything)\n- **ZeroGrasp**: Zero-Shot Shape Reconstruction Enabled Robotic Grasping [Paper](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F32440)\n\n## Humanoid\n- Let Humanoid Robots Go Hiking! Integrative Skill Development over Complex Trails [Paper](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F34565) [Page](https:\u002F\u002Flego-h-humanoidrobothiking.github.io\u002F)\n- **MobileH2R**: Learning Generalizable Human to Mobile Robot Handover Exclusively from Scalable and Diverse Synthetic Data [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04595)\n\n## 3D Vision\n- **3D-MVP**: 3D Multiview Pretraining for Robotic Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.18158) [Page](https:\u002F\u002Fjasonqsy.github.io\u002F3DMVP\u002F)\n- **VidBot**: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07135) [Page](https:\u002F\u002Fhanzhic.github.io\u002Fvidbot-project\u002F)\n- **Touch2Shape**: Touch-Conditioned 3D Diffusion for Shape Exploration and Reconstruction [Abs](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F33415)\n\n## Planning and Reasoning\n- **RoboBrain**: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.21257)\n- **PhysVLM**: Enabling Visual Language Models to Understand Robotic Physical Reachability [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.08481)\n- **RoboSpatial**: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.16537)\n- **Tartan IMU**: A Light Foundation Model for Inertial Positioning in Robotics [Abstract](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F33873)\n- **Code-as-Monitor**: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04455) [Page](https:\u002F\u002Fzhoues.github.io\u002FCode-as-Monitor\u002F)\n\n## Video\n\n- **TASTE-Rob**: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11423)\n- **GraphMimic**: Graph-to-Graphs Generative Modeling from Videos for Policy Learning [Paper](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F34942)\n\n\n## Sim2real and Real2sim\n- **Prof. Robot**: Differentiable Robot Rendering Without Static and Self-Collisions [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11269) [Page](https:\u002F\u002Fwww.qrcat.cn\u002Fprof-robot\u002F)\n- **AutoURDF**: Unsupervised Robot Modeling from Point Cloud Frames Using Cluster Registration [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.05507) [Page](https:\u002F\u002Fgithub.com\u002Fjl6017\u002FAutoURDF)\n\n## Benchmark and Dataset\n- **RoboTwin**: Dual-Arm Robot Benchmark with Generative Digital Twins (early version)[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.02920) [Page](https:\u002F\u002Frobotwin-benchmark.github.io\u002Fearly-version\u002F)\n- Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18025)\n- **RoboSense**: Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.15503) [Page](https:\u002F\u002Fgithub.com\u002Fsuhaisheng\u002FRoboSense)\n\n# ICLR2025\n\n## Vision-Language-Action Models\n\n- **LLaRA**: Supercharging Robot Learning Data for Vision-Language Policy [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.20095) [Page](https:\u002F\u002Fgithub.com\u002FLostXine\u002FLLaRA)\n- **VLAS**: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13508) [Page](https:\u002F\u002Fgithub.com\u002Fwhichwhichgone\u002FVLAS)\n- **TraceVLA**: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.10345) [Page](https:\u002F\u002Ftracevla.github.io\u002F)\n- **Robots Pre-train Robots**: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22325) [Page](https:\u002F\u002Frobots-pretrain-robots.github.io\u002F)\n- **PIDM**: Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15109) [Page](https:\u002F\u002Fnimolty.github.io\u002FSeer\u002F)\n\n## Policies\n\n- **GravMAD**: Grounded Spatial Value Maps Guided Action Diffusion for Generalized 3D Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.20154) [Page](https:\u002F\u002Fgravmad.github.io\u002F)\n- **ReViWo**: Learning View-invariant World Models for Visual Robotic Manipulation [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=vJwjWyt4Ed) [zhihu](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F26181243574)\n- **HAMSTER**: Hierarchical Action Models For Open-World Robot Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.05485) [Page](https:\u002F\u002Fhamster-robot.github.io\u002F)\n- **BadRobot**: Jailbreaking Embodied LLMs in the Physical World [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.20242) [Page](https:\u002F\u002Fembodied-llms-safety.github.io\u002F)\n- **STRAP**: Robot Sub-Trajectory Retrieval for Augmented Policy Learning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15182) [Page](https:\u002F\u002Fweirdlabuw.github.io\u002Fstrap\u002F)\n- **SRSA**: Skill Retrieval and Adaptation for Robotic Assembly Tasks [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.04538) [Page](https:\u002F\u002Fsrsa2024.github.io\u002F)\n- Data Scaling Laws in Imitation Learning for Robotic Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18647) [Page](https:\u002F\u002Fdata-scaling-laws.github.io\u002F)\n- **Stem-OB**: Generalizable Visual Imitation Learning with Stem-Like Convergent Observation through Diffusion Inversion [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04919) [Page](https:\u002F\u002Fhukz18.github.io\u002FStem-Ob\u002F)\n\n## 3D Vision\n- **Dream to Manipulate**: Compositional World Models Empowering Robot Imitation Learning with Imagination [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14957) [Page](https:\u002F\u002Fleobarcellona.github.io\u002FDreamToManipulate\u002F)\n- **SPA***: 3D Spatial-Awareness Enables Effective Embodied Representation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08208) [Page](https:\u002F\u002Fhaoyizhu.github.io\u002Fspa\u002F)\n\n## Planning and Reasoning\n- **LASeR**: Towards Diversified and Generalizable Robot Design with Large Language Models [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=7mlvOHL6qJ) [Page](https:\u002F\u002Fgithub.com\u002FWoodySJR\u002FLASeR)\n- Physics-informed Temporal Difference Metric Learning for Robot Motion Planning [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=TOiageVNru) [Page](https:\u002F\u002Fgithub.com\u002Fruiqini\u002Fntrl-demo)\n- **AHA**: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00371) [Page](https:\u002F\u002Faha-vlm.github.io\u002F)\n- **EMOS**: Embodiment-aware Heterogeneous Multi-robot Operating System with LLM Agents[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22662) [Page](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22662)\n- **VisualPredicator**: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23156) [Page](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23156)\n- **DenseMatcher**: Learning 3D Semantic Correspondence for Category-Level Manipulation from a Single Demo [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.05268) [Page](https:\u002F\u002Ftea-lab.github.io\u002FDenseMatcher\u002F)\n- 6D Object Pose Tracking in Internet Videos for Robotic Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10307) [Page](https:\u002F\u002Fponimatkin.github.io\u002Ffreepose\u002F)\n\n## Planning and Reasoning\n- Multi-Robot Motion Planning with Diffusion Models [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03072) [Page](https:\u002F\u002Fgithub.com\u002Fyoraish\u002Fmmd)\n\n## Video\n- **GEVRM**: Goal-Expressive Video Generation Model For Robust Visual Manipulation [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.09268)\n\n## Sim2real and Real2sim\n- **ReGen**: Generative Robot Simulation via Inverse Design [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=EbCUbPZjM1) [Page](https:\u002F\u002Fregen-sim.github.io\u002F)\n  \n# ICRA2025\n- **MoRE**: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.08007)\n- **QUART-Online**: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15576) [Page](https:\u002F\u002Fquart-online.github.io\u002F)\n- **SpatialBot**: Precise Spatial Understanding with Vision Language Models [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.13642) [Page](https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FSpatialBot)\n","# 身体智能论文顶级会议\n🔥 NeuIPS2025 & CORL2025 & ICCV2025 & ICML2025 & RSS2025 & CVPR2025 & ICLR2025 & **ICLR2026** 身体智能论文列表 资源。\n\n[2025年3月22日] 我们计划在未来整理更多来自顶级会议的身体智能相关论文，并构建一个更加全面的论文列表。如果您有想浏览的会议论文，或者有任何其他建议，请随时提交 issue。\n\n[2025年4月12日] 我们正在更新被 RSS2025（机器人顶级会议）接收的身体智能论文！\n\n[2025年5月21日] 我们正在更新被 ICML2025 接收的身体智能论文！\n\n[2025年8月5日] 我们正在更新被 ICCV2025 接收的身体智能论文！\n\n[2025年9月30日] 我们正在更新被 CORL2025 接收的身体智能论文！\n\n[2025年11月30日] 我们正在更新被 NeuIPS2025 接收的身体智能论文！\n\n[2026年3月12日] 我们正在更新被 ICLR2026 接收的身体智能论文！（[📖 ICLR2026](ICLR\u002FICLR2026.md)）\n\n## 📖 论文列表\n\n- [📖 ICLR2026](#iclr2026)\n  - [视觉-语言-动作模型](#vision-language-action-models-1)\n  - [视觉-语言-导航模型](#vision-language-navigation-models)\n  - [世界模型](#world-models)\n  - [规划与推理](#planning-and-reasoning-1)\n  - [导航](#navigation)\n  - [人形机器人](#humanoid)\n  - [3D 视觉](#3d-vision-1)\n  - [策略](#policy)\n  - [灵巧操作](#dexterous-manipulation)\n  - [触觉](#tactile)\n  - [Sim2real 和 Real2sim](#sim2real-and-real2sim-1)\n  - [基准与数据集](#benchmark-and-dataset)\n  - [其他](#other)\n- [📖 NeuIPS2025](#neuips2025)\n  - [视觉-语言-动作模型](#vision-language-action-model)\n  - [数据](#data)\n  - [世界模型](#world-model)\n  - [规划与推理](#planning-and-reasoning)\n  - [导航](#navigation)\n  - [人形机器人](#humanoid)\n  - [3D 视觉](#3d-vision)\n  - [策略](#policy)\n  - [加速与部署](#accelerating-and-deploying)\n  - [触觉](#tactile)\n  - [灵巧操作](#dexterous)\n  - [基准与数据集](##benchmark-and-dataset)\n- [📖 CORL2025](#corl2025)\n  - [视觉-语言-动作模型](#vision-language-action-model)\n  - [世界模型](#world-model)\n  - [策略](#policy)\n  - [人形机器人](#humanoid)\n  - [导航](#navigation)\n  - [基准与数据集](#benchmark-and-dataset)\n  - [灵巧操作](dexterous-manipulation)\n  - [Sim-to-Real](#sim-to-real)\n- [📖 ICCV2025](#iccv2025)\n  - [视觉-语言-动作模型](#vision-language-action-model)\n  - [视觉-语言-导航模型](#vision-language-navigation-model)\n  - [层次化规划](#hierarchical-planning)\n  - [世界模型](#world-model)\n  - [策略](#policy)\n  - [加速与部署](#accelerating-and-deploying)\n  - [感知](#perception)\n  - [基准与数据集](#benchmark-and-dataset)\n- [📖 ICML2025](#icml2025)\n  - [视觉-语言-动作模型](#vision-language-action-models)\n  - [规划与推理](#planning-and-reasoning)\n  - [策略](#policies)\n  - [3D 视觉](#3d-vision)\n  - [数据集](#dataset)\n- [📖 RSS2025](#rss2025)\n- [📖 CVPR2025](#cvpr2025)\n  - [视觉-语言-动作模型](#vision-language-action-models)\n  - [策略](#policies)\n  - [抓取](#grasp)\n  - [人形机器人](#humanoid)\n  - [规划与推理](#planning-and-reasoning)\n  - [3D 视觉](#3d-vision)\n  - [Sim2real 和 Real2sim](#sim2real-and-real2sim)\n  - [基准与数据集](#benchmark-and-dataset)\n- [📖 ICLR2025](#iclr2025)\n  - [视觉-语言-动作模型](#vision-language-action-models)\n  - [策略](#policies)\n  - [规划与推理](#planning-and-reasoning)\n  - [3D 视觉](#3d-vision)\n  - [Sim2real 和 Real2sim](#sim2real-and-real2sim)\n- [📖 ICRA2025](#icra2025)\n\n\n# ICLR2026\n\n[📄 完整列表](ICLR\u002FICLR2026.md)\n\n## 视觉-语言-行动模型\n\n- 通过经验检索扩展机器人控制的记忆规模 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=1dH4ARGdwD)\n- **MemoryVLA**：用于机器人操作的视觉-语言-行动模型中的感知-认知记忆 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=54U3XHf7qq)\n- **PixelVLA**：推进视觉-语言-行动模型中的像素级理解 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=7M6ryCABIc)\n- **Vlaser**：具有协同具身推理能力的视觉-语言-行动模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=8xTDnj39Ti)\n- 通过分离的前向和逆动力学预训练实现解耦的机器人学习 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=DdrsHWobR1)\n- **MetaVLA**：统一的元联合训练以实现高效的具身适应 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=E1K2Ph3LtS)\n- 统一扩散与自回归以构建可泛化的视觉-语言-行动模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=H1KDMNOKQn)\n- 视觉-语言-行动模型的混合训练 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=IBJtOltTbx)\n- 端到端的听、看、说与行动 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=LYyoRqf0Ij)\n- **WholeBodyVLA**：迈向用于全身运动-操作控制的统一潜在视觉-语言-行动模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=OCJmVjyzN7)\n- **RoboOmni**：全模态情境下的主动式机器人操作 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=OJh7oBCYhL)\n- 统一的视觉-语言-行动模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=PklMD8PwUy)\n- **SP-VLA**：一种联合模型调度与标记剪枝的方法，用于加速视觉-语言-行动模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=RwdGIIjPlC)\n- **Align-Then-stEer**：通过统一的潜在指导调整视觉-语言-行动模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=T3i7Ifeatk)\n- **AutoQVLA**：并非所有通道在视觉-语言-行动模型量化中都同等重要 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=TpL2nXanru)\n- 无需验证器的测试时采样方法，适用于视觉-语言-行动模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=UD4Rw8MOEK)\n- **Interleave-VLA**：利用图像-文本交错指令提升机器人操作能力 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=ULTWUuGhC3)\n- **统一扩散VLA**：通过联合离散扩散过程构建视觉-语言-行动模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=UvQOcw2oCD)\n- **赋予GPT-4类人身体**：搭建现成视觉-语言模型与物理世界之间的桥梁 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=aQWSEjcN9V)\n- 关于视觉-语言-行动模型对多模态扰动的鲁棒性 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=cS6xizdYD5)\n- 针对视觉-语言-行动模型的空间引导训练 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=eKhOrQWAVJ)\n- 借助残差强化学习生成数据，实现自我改进的视觉-语言-行动模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=eUGoqrZ6Ea)\n- 面向高效视觉-语言-行动操作的动作感知动态剪枝 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=ea6j8k8Rnw)\n- **空间强制**：为视觉-语言-行动模型提供隐式空间表征对齐 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=euMVC1DO4k)\n- **Genie Envisioner**：一个用于机器人操作的统一世界基础平台 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=fHLtSxDFKC)\n- **从空间到动作**：将视觉-语言-行动模型锚定于空间基础先验 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=fzmittHfq3)\n- **TwinVLA**：利用双单臂视觉-语言-行动模型实现数据高效的双手操作 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=jG9W6nAwVz)\n- **FASTer**：面向强大且高效的自回归视觉-语言-行动模型，采用可学习的动作分词器与分块解码 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=k6nTUFoqeT)\n- 具身导航基础模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=kkBOIsrCXh)\n- **X-VLA**：软提示Transformer作为可扩展的跨具身视觉-语言-行动模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=kt51kZH4aG)\n- **动作即语言**：在不发生灾难性遗忘的情况下将视觉-语言模型微调为视觉-语言-行动模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=sFO9d6XSlf)\n- **OneTwoVLA**：一种具有适应性推理能力的统一视觉-语言-行动模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=tWMfhoP3as)\n- **VLM4VLA**：在视觉-语言-行动模型中重新审视视觉-语言模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=tc2UsBeODW)\n- **视觉-语言-行动指令微调**：从理解到操作 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=tsxwloasw5)\n- **villa-X**：增强视觉-语言-行动模型中的潜在动作建模 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=y5CaJb17Fn)\n- **从看见到行动**：弥合机器人操作中的推理与决策 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=yngvAamNQi)\n\n## 视觉-语言-导航模型\n\n- **AutoFly**：用于无人机在野外自主导航的视觉-语言-行动模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=88RKxlFUNY)\n- **慢行于地，快行于空**：一种用于可泛化视觉-语言导航的双系统基础模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=GK4rznYwhn)\n- 向可物理执行的3D高斯分布迈进，以支持具身导航 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=HB6KvsqcAn)\n- 面向视觉-语言导航的不确定性感知高斯地图 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=LPv59noPAy)\n- **OpenFly**：一个全面的空中视觉-语言导航平台 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=OKm3w71ymP)\n- **JanusVLN**：通过双重隐式记忆解耦语义与空间性，用于视觉-语言导航 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=RnuB0Nlbd5)\n- **CompassNav**：从路径模仿转向导航中的决策理解 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=eqcDckWHik)\n- **M$^3$E**：通过宏观与微观专家的混合实现持续的视觉-语言导航 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=pFh5ygjN3V)\n- 全天候多场景终身视觉-语言导航，结合Tucker分解适应 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=qSak1Hjfdq)\n- **OmniNav**：一个用于前瞻性探索与视觉-语言导航的统一框架 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=zGtTQTD1zu)\n\n## 世界模型\n\n- **Ctrl-World**: 用于机器人操作的可控制生成式世界模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=748bHL2BAv)\n- **上下文与多样性至关重要**: 世界模型中上下文学习的涌现 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=0GNBqoYcAP)\n- **FantasyWorld**: 通过统一的视频和3D预测实现几何一致性世界建模 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=3q9vHEqsNx)\n- **NeMo-map**: 用于时空运动映射的神经隐式流场 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=4HZgkwVVFO)\n- **Astra**: 具有自回归去噪机制的通用交互式世界模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=8UZpmrxoLG)\n- 通过序列化世界模型赋能多机器人协作 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=IvUM6UwYCJ)\n- 面向动态环境中具身智能体的世界模型测试时混合 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=LQD1MrnbxH)\n- **RIG**: 在端到端通才策略中协同推理与想象力 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=LQv9LU2Ufg)\n- 学习大规模多任务世界模型以实现连续控制 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=MPabX9LEds)\n- 通过物理世界建模实现统一的3D场景理解 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=NQq9JLMfNN)\n- **ExoPredicator**: 为机器人规划学习动态世界的抽象模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=a1zfcaNTkM)\n- 利用非精选数据引导世界模型进行高效强化学习 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=oBXfPyi47m)\n- **Vid2World**: 构建视频扩散模型至交互式世界模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=pFyzqbUiF9)\n- **WMPO**: 基于世界模型的视觉-语言-动作模型策略优化 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=qE2FyvRvuF)\n- 基于少量样本标注的对象中心世界模型，用于样本高效的强化学习 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=qmEyJadwHA)\n- 从稀疏的过渡性情景记忆构建空间世界模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=w3w7WVG4ks)\n- **Cosmos Policy**: 针对视觉运动控制与规划微调视频模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=wPEIStHxYH)\n- **WoW!**: 封闭环世界中的世界模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=yDmb7xAfeb)\n\n## 规划与推理\n\n- **VLMgineer**: 视觉-语言模型作为机器人工具专家 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=nESyz4PvJL)\n- **MomaGraph**: 具备状态感知的统一场景图，结合视觉-语言模型用于具身任务规划 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=3eTr9dGwJv)\n- 基于具身可学习记忆的规划 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=79BOATBal9)\n- **空间理论**: 基础模型能否通过主动探索构建空间信念？ [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=8iPwqr6Adk)\n- 基于推理时扩散缩放的组合式视觉规划 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=EEONns7ae4)\n- 基于经验的知识修正，用于Minecraft中的鲁棒规划 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=N22lDHYrXe)\n- 面向视觉机器人规划的自我改进循环 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=SzUgx5r3wy)\n- **BOLT**: 与决策对齐的蒸馏及预算感知路由，用于受限的多模态QA机器人系统 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=Vsy3nAnaX6)\n- **ReCAPA**: 分层预测校正以缓解级联失效 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=WC6MJ5r5Bj)\n- **只需一次演示即可**: 基于LLM从单个演示中推导规划域 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=Y1VgLHbzCC)\n- **EVLP**: 通过强化监督微调学习统一的具身视觉-语言规划器 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=eJcCW9oNfH)\n- **迈向即兴TAMP**: 在抽象规划图中学习低层级捷径 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=enprG5H9aD)\n- **Embodied-R1**: 面向通用机器人操作的强化具身推理 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=i5wlozMFsQ)\n- 用于机器人故障检测与推理的自我精炼视觉语言模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=jr9hGWQioP)\n- 面向具身AI开放世界目标导向常识回归规划的自然语言PDDL (NL-PDDL) [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=kWCNhRdcDI)\n- **SafeFlowMatcher**: 使用带有控制屏障函数的流匹配进行安全快速规划 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=refcXHU1Nh)\n- **OmniEVA**: 基于任务自适应3D接地与具身感知推理的具身多功能规划器 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=tkEmIJv1tB)\n\n## 导航\n\n- **从看见到体验**: 通过强化学习扩展导航基础模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=0c7nAZjyr5)\n- 终身具身导航学习 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=PaYo96rjij)\n- **CE-Nav**: 基于流引导的强化精炼，用于跨具身局部导航 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=apaLoTumdO)\n- 演员-评论家智能体中受海马体启发的序列生成器所催生的空间表征 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=li1vfqDzRD)\n\n## 类人机器人\n\n- **HWC-Loco**: 一种分层全身控制方法，用于稳健的类人机器人行走 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=3UE3Aatcjy)\n- **任务令牌**: 一种灵活的方法来适配行为基础模型 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=6T3wJQhvc3)\n- **BFM-Zero**: 一种可提示的行为基础模型，用于利用无监督强化学习进行类人机器人控制 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=jkhl2oI0g5)\n- **从语言到行走**: 通过运动潜在指导实现无需重定向的类人机器人控制 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=k3Cyx3Uets)\n\n## 3D视觉\n\n- 用于机器人操作的几何感知4D视频生成 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=18gC6pZVVc)\n- **PD$^{2}$GS**: 基于高斯泼溅实现关节物体的部件级解耦与连续变形 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=W3Q2xvrZtx)\n- **如仿真般操作**: 使机器人具备精确的几何感知能力 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=sWyX1BpeN4)\n\n## 策略\n\n- 基于大语言模型奖励塑造与探索的策略驱动协同，掌握技能学习 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=1vXMfIYFZp)\n- 视觉-本体感知策略在机器人操作中何时会失效？[论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=2RIqqNqALN)\n- **ManipEvalAgent**：可提示且高效的机器人操作策略评估框架 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=3u6AkbWEls)\n- 可远程检测的机器人策略水印技术 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=8s5jBVybhQ)\n- 面向模仿学习的差异感知检索策略 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=9AA27en4go)\n- 捕捉视觉环境结构与控制性能相关 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=AmczI1k3Yk)\n- **VITA**：从视觉到动作的流匹配策略 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=BTe5VLBjPg)\n- **DemoGrasp**：仅需一次示范即可实现通用灵巧抓取 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=Bf4FeuW0Mr)\n- **当机器人比人类更胜任时**：从受限演示者处学习 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=BvirMuKWV1)\n- 基于对应关系驱动的轨迹扭曲的自主游戏 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=FqDmvMZish)\n- 面向异构机器人数据集的跨具身离线强化学习 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=GrsoLVNy3Y)\n- 通过语义势场揭示机器人漏洞 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=Gsrw1vxq1G)\n- 超过示范速度的动作块策略的时间最优执行 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=INsLvSCJ4z)\n- 基于策略似然的查询采样与批评家利用重置，用于高效偏好型强化学习 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=ITeuGb2bYg)\n- 用于学习机器人动作的罗德里格斯网络 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=IZHk6BXBST)\n- 参考引导的技能发现 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=IaGf8Eh5Uo)\n- 用于机器人控制的掩码生成式策略 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=KFu4p3pd11)\n- **GRL-SNAM**：基于微分哈密顿量的几何强化学习，用于未知环境中的导航与地图构建 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=KcC5mwfGf0)\n- **HAMLET**：将你的视觉-语言-动作模型切换为历史感知策略 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=KcJ9U0x6kO)\n- 努力弥合人形机器人控制中大规模预训练与高效微调之间的差距 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=NEOTsyyYH7)\n- 通过协作轨迹控制学习机器人操作的视频生成 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=OeDwYtp8n1)\n- 面向机器人基础模型的策略对比解码 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=P9PVdWyM3U)\n- **揭秘机器人扩散策略**：动作记忆与简单的查找表替代方案 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=PL0tJOfm7I)\n- **H$^3$DP**：用于视觉运动学习的三重层次扩散策略 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=Q1CP0iAmOb)\n- **SimpleVLA-RL**：通过强化学习扩展VLA训练规模 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=TQhSodCM4r)\n- 组合你的策略！通过测试时分布级组合改进基于扩散或基于流的机器人策略 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=TnLFRhLuZ6)\n- 通过专家混合扩散策略抽象机器人操作技能 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=VSWjHIveqZ)\n- 通过形态学预训练加速机器人协同设计 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=WVliGyFwZv)\n- 基于语言对齐的3D关键点实现可泛化的由粗到细的机器人操作 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=WXFfMLyB6y)\n- **VER**：用于机器人学习的基础蒸馏与动态路由的视觉专家Transformer [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=aoorNQFpM6)\n- **SpikePingpong**：基于脉冲视觉的快慢结合乒乓球机器人系统 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=d08yOXs1Dl)\n- **EquAct**：一种SE(3)等变的多任务Transformer，用于3D机器人操作 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=d1wuA8oIH0)\n- 通过事后在线模仿将流转换为策略 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=dQ6d5bgXtM)\n- 面向全身控制的层次化价值分解离线强化学习 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=eSkDNIGbcd)\n- **皮层策略**：一种双流视图Transformer，用于机器人操作 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=eWe8zqGvs5)\n- 几何感知的策略模仿 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=ggofj6tyr3)\n- **收缩性扩散策略**：通过基于得分的收缩采样和微分方程实现稳健的动作扩散 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=iKJbmx1iuQ)\n- 基于价值引导的流，实现高维连续控制的可扩展探索 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=kIYNtxE13h)\n- 具有瞬时速度约束的平均流策略，用于单步动作生成 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=mIeKe74W43)\n- 通过多样化重置和大规模强化学习涌现灵巧性 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=nAO9LcV7nE)\n- 学习零件感知的密集3D特征场，用于可泛化的铰接物体操作 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=qXfRXfAHOK)\n- 带有掩码动作分块的实时机器人执行 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=r0RGJ1j9on)\n- 通过参数合并实现视觉-语言-动作机器人策略的稳健微调 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=uWJwQ5SZoM)\n- **ViPRA**：用于机器人动作的视频预测 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=w3Ik8HUyTT)\n- **RAVEN**：使用RGB摄像头进行端到端等变机器人学习 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=z8BN7KyaPl)\n\n## 灵巧操作\n\n- **DexNDM**: 通过关节级神经动力学模型弥合灵巧手中旋转的现实差距 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=80vjyj5o7l)\n- **EgoDex**: 从大规模第一人称视频中学习灵巧操作 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=FFxkFMU89E)\n- **RFS**: 基于残差流引导的强化学习用于灵巧操作 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=Kt9tJeOwjy)\n- 通过玩随机玩具学习抓取任何物体 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=NZDaMcpXZm)\n- **SARM**: 面向长 horizon 机器人操作的阶段感知奖励建模 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=aemqAxScl9)\n- **UniHM**: 基于视觉语言模型的统一灵巧手操作 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=cVX3VqO8BO)\n- **DexMove**: 学习基于触觉引导的非抓握式灵巧手操作 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=dT3ZciXvNX)\n- **VLBiMan**: 视觉-语言锚定的一次性演示实现可泛化的双臂机器人操作 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=he86smZzRk)\n- 灵巧手的跨具身协同设计 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=k8ovuXEQQu)\n- 无需物理演示，通过模仿生成视频进行机器人操作 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=tv0Sz8A9Tc)\n- 机器人模仿中的动作生成：主次解耦 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=wySMuWHmt4)\n\n## 触觉\n\n- **AnyTouch 2**: 用于动态触觉感知的通用光学触觉表征学习 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=ndilONnABZ)\n- **APPLE**: 通过强化学习迈向通用主动感知 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=hU2gT2Ucua)\n\n## Sim2real 和 Real2sim\n\n- **D-REX**: 用于学习灵巧抓取的可微分实转虚转实引擎 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=13jshGCK9i)\n- **DemoGrasp**: 仅需一次演示即可实现通用灵巧抓取 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=Bf4FeuW0Mr)\n- **Sim2Real VLA**: 将合成技能零样本泛化到真实操作中 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=H4SyKHjd4c)\n- **Exo-Plore**: 通过与人类对齐的仿真探索外骨骼控制空间 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=TmYcqOnxhN)\n- 多样化重置与大规模强化学习催生的涌现灵巧性 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=nAO9LcV7nE)\n- **模拟中的操作**: 在机器人中实现精确的几何感知 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=sWyX1BpeN4)\n- 面向虚实迁移的基础策略潜在适应 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=yn9dzttHvT)\n- **RobotArena ∞**: 通过实转虚转换实现无限机器人基准测试 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=OutljIofvS)\n- **PD$^{2}$GS**: 基于高斯泼溅的关节物体零件级解耦与连续变形 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=W3Q2xvrZtx)\n- 基于单目视频和平面场景基元的接触引导型实转虚 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=xlr3NqxUqY)\n\n## 基准与数据集\n\n- **D2E**: 在桌面数据上扩展视觉-动作预训练，以迁移到具身 AI [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=TRwQND3xpt)\n- **Memory, Benchmark & Robots**: 一个用于强化学习解决复杂任务的基准 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=9cLPurIZMj)\n- **DataMIL**: 使用数据模型选择机器人模仿学习的数据 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=AcTsKglDdh)\n- **MIMIC**: 带有交互控制的掩码注入式操作视频生成 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=COrUdVuInH)\n- **LeRobot**: 一个用于端到端机器人学习的开源库 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=CiZMMAFQR3)\n- **RobotArena ∞**: 通过实转虚转换实现无限机器人基准测试 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=OutljIofvS)\n- **RoboInter**: 一个面向机器人操作的整体中间表示套件 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=PGUC3mmMoi)\n- **ENACT**: 通过自我中心交互的世界建模评估具身认知 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=Patx6MRipw)\n- **AutoBio**: 一个用于数字生物实验室中机器人自动化的仿真和基准 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=UUE6HEtjhu)\n- 用于具身 AI 的图像质量评估 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=azj53PLJRL)\n- **MoMaGen**: 在软硬约束下生成多步双臂移动操作的演示 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=bGPDviEtZ1)\n- **CoNavBench**: 协作式长 horizon 视觉-语言导航基准 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=bMrH2PFMsi)\n- **World2Minecraft**: 基于占用率驱动的仿真场景构建 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=dc90uPqxWF)\n- **CitySeeker**: VLM 如何在隐含人类需求的情况下探索具身城市导航？ [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=hzf23XSDcs)\n- **跨视角观察**: 在机器人场景中基准测试视觉-语言模型的空间推理能力 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=jXDZJAfRZB)\n- **RoboCasa365**: 一个用于训练和基准测试通用机器人的大规模仿真框架 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=tQJYKwc3n4)\n- **REI-Bench**: 具身智能体能否在任务规划中理解模糊的人类指令？ [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=vmBIF25KLf)\n\n## 其他\n\n- 关于 MLLM 在空间智能方面的泛化能力 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=DE5ZJtR4bg)\n- **具身智能体与个性化**: 通过记忆利用的视角探讨挑战与解决方案 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=E5L43l5EIu)\n- 基于共现一致性的交互感知表征建模，用于自我中心的手物解析 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=RYwQ0xQcAh)\n- **PhyScensis**: 用于复杂物理场景布置的物理增强 LLM 智能体 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=aCVfhY4Qen)\n- **OmniActor**: 一个适用于 2D 和 3D 世界的通用 GUI 和具身智能体 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=oJAIjUDxkZ)\n- **EgoWorld**: 利用丰富的外在视角观测将外在视角转换为自我中心视角 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=wcTuZG9P2o)\n- 基于单目视频和平面场景基元的接触引导型实转虚 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=xlr3NqxUqY)\n\n# NeuIPS2025\n\n## 视觉-语言-行动模型\n- **Fast-in-Slow**: 一种将快速操作统一到慢速推理中的双系统VLA模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01953) [页面](https:\u002F\u002Ffast-in-slow.github.io\u002F)\n- **AC-DiT**: 用于移动操作的自适应协调扩散Transformer [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01961) [页面](https:\u002F\u002Fac-dit.github.io\u002F)\n- **BridgeVLA**: 基于输入输出对齐的高效3D操作学习，利用视觉-语言模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.07961) [页面](https:\u002F\u002Fbridgevla.github.io\u002F)\n- **CogVLA**: 通过指令驱动的路由与稀疏化实现认知对齐的视觉-语言-行动模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.21046) [页面](https:\u002F\u002Fjiutian-vl.github.io\u002FCogVLA-page\u002F)\n- **VideoVLA**: 视频生成器可以作为具有泛化能力的机器人操作器\n- **ChatVLA-2**: 具有开放世界推理能力的视觉-语言-行动模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.21906) [页面](https:\u002F\u002Fchatvla-2.github.io\u002F)\n- 探索跨任务泛化中视觉-语言-行动操作的极限 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15660) [页面](https:\u002F\u002Fjiaming-zhou.github.io\u002FAGNOSTOS\u002F)\n- **BadVLA**: 通过目标解耦优化实现对视觉-语言-行动模型的后门攻击 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16640) [页面](https:\u002F\u002Fgithub.com\u002FZxy-MLlab\u002FBadVLA)\n- **柔性残差DAgger**: 通过人类纠正改进现实世界中的接触密集型操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.16685) [页面](https:\u002F\u002Fcompliant-residual-dagger.github.io\u002F)\n- **VLA-OS**: 在视觉-语言-行动模型中构建和剖析规划表示与范式 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.17561) [页面](https:\u002F\u002Fnus-lins-lab.github.io\u002Fvlaos\u002F)\n- **ThinkAct**: 通过强化视觉潜在规划进行视觉-语言-行动推理 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.16815) [页面](https:\u002F\u002Fjasper0314-huang.github.io\u002Fthinkact-vla\u002F)\n- 自我改进的具身基础模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.15155) [页面](https:\u002F\u002Fself-improving-efms.github.io\u002F)\n- **Robo2VLM**: 利用大规模机器人操作数据改进视觉问答 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15517) [页面](https:\u002F\u002Fberkeleyautomation.github.io\u002Frobo2vlm\u002F)\n- **EnerVerse**: 构想机器人操作的具身未来空间 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01895)\n- 学习空间感知的操作排序 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.25138)\n- **PRIMT**: 基于偏好、多模态反馈以及来自基础模型的轨迹合成的强化学习 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.15607)\n- **BEAST**: 针对模仿学习，高效编码B样条动作序列的分词方法 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06072)\n- **PointMapPolicy**: 面向多模态模仿学习的结构化点云处理 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.20406)\n- 动作分块流策略的实时执行 [论文](https:\u002F\u002Fwww.physicalintelligence.company\u002Fdownload\u002Freal_time_chunking.pdf) [页面](https:\u002F\u002Fwww.physicalintelligence.company\u002Fresearch\u002Freal_time_chunking)\n- **Action链**: 面向机器人操作的轨迹自回归建模 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.09990) [页面](https:\u002F\u002Fchain-of-action.github.io\u002F)\n- **4D-VLA**: 具有跨场景校准的时空视觉-语言-行动预训练\n- **SAFE**: 视觉-语言-行动模型的多任务故障检测 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.09937) [页面](https:\u002F\u002Fvla-safe.github.io\u002F)\n- 蒙眼专家泛化能力更强：来自机器人操作和电子游戏的洞见 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.24194) [页面](https:\u002F\u002Fsites.google.com\u002Fview\u002Fblindfoldedexperts\u002Fhome)\n- **HiMaCon**: 从无标签多模态数据中发现层次化操作概念 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.11321)\n- 知识隔离型视觉-语言-行动模型：快速训练、快速运行、更好泛化 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23705)\n- 可证明的顺序性和连续性：面向可泛化具身智能体的视觉-语言预训练 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01218) [页面](https:\u002F\u002Factol-pretrain.github.io\u002F)\n- **DreamVLA**: 一个基于全面世界知识的梦想视觉-语言-行动模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.04447) [页面](https:\u002F\u002Fzhangwenyao1.github.io\u002FDreamVLA\u002Findex.html)\n\n## 数据\n- **EgoBridge**: 基于第一人称视角的人类数据实现泛化模仿的领域适应 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.19626) [页面](https:\u002F\u002Fego-bridge.github.io\u002F)\n- **RobotSmith**: 面向复杂操作技能获取的生成式机器人工具设计 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14763) [页面](https:\u002F\u002Fumass-embodied-agi.github.io\u002FRobotSmith\u002F)\n- **URDF-Anything**: 利用3D多模态语言模型构建关节物体 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.00940)\n- **DEAL**: 扩散进化对抗学习用于模拟到真实的迁移\n- 面向模拟与真实策略协同训练的泛化领域适应 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.18631) [页面](https:\u002F\u002Fot-sim2real.github.io\u002F)\n\n## 世界模型\n- **SAMPO**: 基于尺度自回归与运动提示的生成式世界模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.15536)\n- 学习3D持久性具身世界模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.05495)\n- **OSVI-WM**: 使用世界模型引导的轨迹生成，实现针对未见任务的一次性视觉模仿 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20425)\n\n## 规划与推理\n- 通过联合不确定性估计迈向可靠的LLM驱动机器人规划 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.08044)\n- **迈向可靠的代码即策略**: 一种面向具身任务规划的神经符号框架 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.21302)\n- **RDD**: 面向长时程任务规划对齐的基于检索的演示分解器 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.14968) [页面](https:\u002F\u002Frdd-neurips.github.io\u002F)\n- **UniDomain**: 从真实世界演示中预训练统一PDDL域，用于可泛化的机器人任务规划 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21545)\n- InstructFlow: 面向长时程规划的自适应符号约束引导代码生成\n\n## 导航\n- **C-NAV**: 朝着开放世界中自我演进的持续性目标导航 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.20685) [主页](https:\u002F\u002Fbigtree765.github.io\u002FC-Nav-project\u002F)\n- 在目标导向导航中，将大语言模型先验蒸馏至流模型以实现可泛化的智能体想象力 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.09423)\n- **TP-MDDN**: 具有自主决策能力的任务偏好型多需求驱动导航\n- 主动测试时视觉-语言导航 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06630)\n- **Aux-Think**: 探索数据高效型视觉-语言导航的推理策略\n- **EfficientNav**: 基于导航地图缓存与检索的端侧目标导向导航 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.18546)\n- **透过不确定性看世界**: 视觉导航中的鲁棒任务导向优化 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.00441) [主页](https:\u002F\u002Fgithub.com\u002FPyyWill\u002FNeuRO)\n\n## 类人机器人\n- 面向类人机器人策略学习的对抗性运动与动作模仿 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.14305) [主页](https:\u002F\u002Falmi-humanoid.github.io\u002F)\n- 从专家到通用：迈向类人机器人的全身控制 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.12779) [主页](https:\u002F\u002Fbeingbeyond.github.io\u002FBumbleBee\u002F)\n- **KungfuBot**: 基于物理的类人机器人全身控制，用于学习高度动态技能 [论文](https:\u002F\u002Fkungfu-bot.github.io\u002F) [主页](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.12851)\n\n## 3D视觉\n- **DynaRend**: 通过掩码未来渲染学习3D动力学，用于机器人操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.24261)\n- 通过视频生成从单张图像构建3D表征并生成运动 [论文](https:\u002F\u002Fneurips.cc\u002Fvirtual\u002F2025\u002Floc\u002Fsan-diego\u002Fposter\u002F118141)\n\n## 政策\n- 身体化AI带来的新兴风险需要紧急政策行动\n- 通过动作偏好优化进行人类辅助的机器人策略改进 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.07127) [主页](https:\u002F\u002Fgewu-lab.github.io\u002Faction_preference_optimization\u002F)\n- *Hyper-GoalNet*: 基于超网络的目标条件操作策略学习\n- **ReinFlow**: 使用在线强化学习微调流匹配策略 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22094) [主页](https:\u002F\u002Freinflow.github.io\u002F)\n- 多样化并行遍历搜索：一种标志性的核进化策略\n- **FreqPolicy**: 基于频率一致性的高效流式视觉-运动策略 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08822)\n- 将对称性融入扩散策略的实用指南 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.13431)\n- **潜在策略屏障**: 通过保持分布内来学习鲁棒的视觉-运动策略 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.05941) [主页](https:\u002F\u002Fproject-latentpolicybarrier.github.io\u002F)\n- 无量化自回归动作Transformer [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14259)\n- 现实世界中主动感知行为的强化学习\n- 针对生成式机器人策略的运行时故障预测 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.09459)\n- **行动以观，观而行之**: 扩散驱动的感知-行动交互以实现适应性策略 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.25822) [主页](https:\u002F\u002Fjingwang18.github.io\u002Fdp-ag.github.io\u002F)\n- **控制策略中的动态测试时计算缩放**: 基于难度感知的随机插值策略 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.20906)\n- **DynaGuide**: 通过主动动态引导来指导扩散策略 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13922) [主页](https:\u002F\u002Fdynaguide.github.io\u002F)\n- 具备世界意识的规划叙事增强了大型视觉-语言模型规划器 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.21230)\n\n## 加速与部署\n- 通过并行可微仿真加速视觉策略学习 [论文](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2505.10646) [主页](https:\u002F\u002Fhaoxiangyou.github.io\u002FDva_website\u002F)\n- **EfficientVLA**: 无需训练即可加速和压缩视觉-语言-动作模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10100)\n- 平静的海面从未造就过熟练的航海家：通过学习搜索实现稳健的模仿 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05294) [主页](https:\u002F\u002Fgokul.dev\u002Fsailor\u002F)\n- **VLA-Cache**: 通过自适应标记缓存实现高效的视觉-语言-动作操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.02175) [主页](https:\u002F\u002Fvla-cache.github.io\u002F)\n\n## 触觉\n- 面向具身交互的通用视觉-触觉视频理解 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22566)\n- 提升基于触觉的强化学习在机器人控制中的应用 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.21609) [主页](https:\u002F\u002Felle-miller.github.io\u002Ftactile_rl\u002F)\n- **Taccel**: 通过高性能GPU仿真扩大基于视觉的触觉机器人规模 [论文](https:\u002F\u002Ftaccel-simulator.github.io\u002Fassets\u002Ftaccel-paper.pdf) [主页](http:\u002F\u002Ftaccel-simulator.github.io\u002F)\n- **迈向人工触诊**: 软体上的触觉表征学习 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.16596) [主页](https:\u002F\u002Fzoharri.github.io\u002Fartificial-palpation\u002F)\n- **野外触感**: 使用便携式视觉-触觉夹持器学习精细操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.15062v1) [主页](https:\u002F\u002Fbinghao-huang.github.io\u002Ftouch_in_the_wild\u002F)\n\n## 灵巧操作\n- 基于条件扩散模型的接触图迁移，用于生成可泛化的灵巧抓取 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2511.01276) [主页](https:\u002F\u002Fcmtdiffusion.github.io\u002F)\n- **HumanoidGen**: 基于LLM推理生成双手灵巧操作的数据 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.00833) [主页](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.00833)\n- **Grasp2Grasp**: 基于薛定谔桥梁的视觉灵巧抓取转换 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.02489) [主页](https:\u002F\u002Fgrasp2grasp.github.io\u002F)\n- 使用视觉-语言模型搭建灵巧操作框架 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.19212) [主页](https:\u002F\u002Fsites.google.com\u002Fview\u002Fdexterous-vlm-scaffolding)\n- **DexFlyWheel**: 一个可扩展且自我改进的灵巧操作数据生成框架 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.23829) [主页](https:\u002F\u002Fdexflywheel.github.io\u002F)\n- **DexGarmentLab**: 具有可泛化策略的灵巧服装操作环境 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.11032) [主页](https:\u002F\u002Fwayrise.github.io\u002FDexGarmentLab\u002F)\n\n## 基准与数据集\n- **RoboCerebra**: 用于长 horizon 机器人操作评估的大规模基准 [论文](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2506.06677) [页面](https:\u002F\u002Fgithub.com\u002Fqiuboxiang\u002FRoboCerebra)\n- **SutureBot**: 用于自主端到端缝合的高精度框架及基准 [论文](https:\u002F\u002Fsuturebot.github.io\u002Fstatic\u002FSutureBot_NeurIPS_2025.pdf) [页面](https:\u002F\u002Fsuturebot.github.io\u002F)\n- 为多模态机器人导航与协作合成照片级逼真的动态城市环境\n- **LabUtopia**: 面向科学具身智能体的高保真仿真与分层基准 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22634) [页面](https:\u002F\u002Frui-li023.github.io\u002Flabutopia-site\u002F)\n- **SonoGym**: 面向具有挑战性的机器人超声手术任务的高性能仿真 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01152) [页面](https:\u002F\u002Fgithub.com\u002FSonoGym\u002FSonoGym)\n- 具身人群计数\n- **PAC Bench**: 基础模型是否理解执行操作策略的前提条件？[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.23725)\n\n# CORL2025\n\n## 视觉-语言-动作模型\n\n- **$\\pi_{0.5}$**: 具有开放世界泛化能力的视觉-语言-动作模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16054) [页面](https:\u002F\u002Fwww.pi.website\u002Fblog\u002Fpi05)\n- 面向高效具身推理的训练策略 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.08243) [页面](https:\u002F\u002Fecot-lite.github.io\u002F)\n- **Long-VLA**: 解放视觉语言动作模型在机器人操作中的长 horizon 能力 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.19958) [页面](https:\u002F\u002Flong-vla.github.io\u002F)\n- **RoboMonkey**: 扩展视觉语言动作模型的测试时采样与验证 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.17811) [页面](https:\u002F\u002Frobomonkey-vla.github.io\u002F)\n- **RoboChemist**: 长 horizon 且符合安全规范的机器人化学实验 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.08820) [页面](https:\u002F\u002Fzzongzheng0918.github.io\u002FRoboChemist.github.io\u002F)\n- **TA-VLA**: 阐明扭矩感知型视觉语言动作模型的设计空间 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.07962) [页面](https:\u002F\u002Fzzongzheng0918.github.io\u002FTorque-Aware-VLA.github.io\u002F)\n- **聚焦于关键点**: 面向视觉语言动作模型的对象-智能体中心化标记化 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=Ict1OjU9gl#discussion)\n- **FLOWER**: 通过高效的视觉语言动作流策略 democratize 通用型机器人策略 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.04996) [页面](https:\u002F\u002Fintuitive-robots.github.io\u002Fflower_vla\u002F)\n- 用于引导视觉语言动作模型的机制性可解释性 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.00328)\n- **RICL**: 为预训练的视觉语言动作模型添加上下文适应能力 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.02062) [页面](https:\u002F\u002Fricl-vla.github.io\u002F)\n- **DexVLA**: 具有插件式扩散专家的视觉语言模型，用于通用机器人控制 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.05855) [页面](https:\u002F\u002Fgithub.com\u002Fjuruobenruo\u002FDexVLA)\n- **FLARE**: 基于隐式世界建模的机器人学习 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15659) [页面](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Fgear\u002Fflare\u002F)\n- **3DS-VLA**: 一种具备 3D 空间感知能力的视觉语言动作模型，用于鲁棒的多任务操作 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=dT45OMevL5#discussion)\n- **GraspVLA**: 在数十亿规模的合成动作数据上预训练的抓取基础模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.03233) [页面](https:\u002F\u002Fpku-epic.github.io\u002FGraspVLA-web\u002F)\n- **EndoVLA**: 双阶段视觉语言动作模型，用于内窥镜检查中的精确自主跟踪 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15206)\n- **MoTo**: 一种零样本插件式交互感知导航，适用于通用移动操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.01658) [页面](https:\u002F\u002Fgary3410.github.io\u002FMoTo\u002F)\n- **ControlVLA**: 针对预训练视觉语言动作模型的少样本对象中心适应 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.16211) [页面](https:\u002F\u002Fcontrolvla.github.io\u002F)\n- **TrackVLA**: 野外具身视觉跟踪 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23189) [页面](https:\u002F\u002Fpku-epic.github.io\u002FTrackVLA-web\u002F)\n- **AnyPlace**: 学习可泛化的机器人操作中物体放置 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04531) [页面](https:\u002F\u002Fany-place.github.io\u002F)\n- 不依赖动作标注数据的通用型机器人操作 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.19958) [页面](https:\u002F\u002Fmotovla.github.io\u002F)\n- **LaVA-Man**: 学习用于机器人操作的视觉动作表征 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.19391) [页面](https:\u002F\u002Fqm-ipalab.github.io\u002FLaVA-Man\u002F)\n\n## 导航\n\n- **MoTo**: 一种零样本插件式交互感知导航，适用于通用移动操作\n- 使用语言模型进行元优化与程序搜索，用于任务与运动规划\n- **ObjectReact**: 学习基于对象相对关系的视觉导航控制\n- **HALO**: 与人类偏好一致的离线奖励学习，用于机器人导航\n- 想象、验证、执行：基于记忆引导的视觉语言模型代理式探索\n- **长距离导航器 (LRN)**: 将机器人规划视野扩展至度量地图之外\n- **Search-TTA**: 一种多模态测试时适应框架，用于野外视觉搜索\n- **ActLoc**: 通过主动视点选择学习移动中的定位\n- 类似人类的导航，在专为人类设计的世界中\n- **GC-VLN**: 将指令作为图约束，用于免训练的视觉与语言导航\n- **GraspMolmo**: 通过大规模合成数据生成实现可泛化的任务导向抓取\n- **信念条件单步扩散**: 仅需足够感知即可进行实时轨迹规划\n\n## 政策\n\n- **ImMimic**: 通过映射和插值从人类视频中进行跨域模仿 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.10952) [页面](https:\u002F\u002Fsites.google.com\u002Fview\u002Fimmimic)\n- **ReWiND**: 语言引导的奖励无需新演示即可教授机器人策略 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10911) [页面](https:\u002F\u002Frewind-reward.github.io\u002F)\n- 使用潜在空间强化学习引导扩散策略 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.15799) [页面](https:\u002F\u002Fdiffusion-steering.github.io\u002F)\n- **Streaming Flow Policy**: 通过将动作轨迹视为流轨迹来简化扩散\u002F流匹配策略 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.21851) [页面](https:\u002F\u002Fsiddancha.github.io\u002Fstreaming-flow-policy\u002F)\n- **SAIL**: 比演示更快地执行模仿学习策略 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11948) [页面](https:\u002F\u002Fnadunranawaka1.github.io\u002Fsail-policy\u002F)\n- 基于置信度感知的密集对应和视觉触觉效用的反应式空中衣物操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.03889) [页面](https:\u002F\u002Fmhtippur.github.io\u002Finairclothmanipulation\u002F)\n- 基于重要性权重的数据检索用于少样本模仿学习 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.01657) [页面](https:\u002F\u002Frahulschand.github.io\u002Fiwr\u002F)\n- **X-Sim**: 通过真实到仿真再到真实的跨化身学习 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07096)\n- **DemoSpeedup**: 通过熵引导的演示加速来提升视觉运动策略的速度 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05064) [页面](https:\u002F\u002Fdemospeedup.github.io\u002F)\n- **ManiFlow**: 通过一致性流训练的通用机器人操作策略 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.01819) [页面](https:\u002F\u002Fmaniflow-policy.github.io\u002F)\n- **Text2Touch**: 基于LLM设计的奖励函数的触觉手中操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.07445) [页面](https:\u002F\u002Fhpfield.github.io\u002Ftext2touch-website\u002F)\n- **Multi-Loco**: 通过增强型扩散的强化学习统一多化身足式运动 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11470) [页面](https:\u002F\u002Fmops-tamp.github.io\u002F)\n- $\\texttt{SPIN}$: 将$\\texttt{Skill-RRT}$提炼用于长 horizon 的抓取与非抓取操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.18015)\n- 基于行为特征解耦表征学习的模仿学习 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.04737)\n- 用于一次性视觉运动策略泛化的保约束数据生成 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.03944) [页面](https:\u002F\u002Fcp-gen.github.io\u002F)\n- **CLASS**: 通过动作序列监督进行机器人操作的对比学习 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.01600) [页面](https:\u002F\u002Fclass-robot.github.io\u002F)\n- **MirrorDuo**: 从镜像演示对中进行反射一致的视觉运动学习 [页面](https:\u002F\u002Fgithub.com\u002Fzheyu-zhuang\u002Fmirror-duo?tab=readme-ov-file)\n- 动力学兼容的轨迹扩散用于超额定负载操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.21375)\n- **Eye, Robot**: 通过BC-RL感知-行动环路学习看以行动 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10968) [页面](https:\u002F\u002Fwww.eyerobot.net\u002F)\n- **ARCH**: 用于长 horizon 富接触机器人装配的分层混合学习 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.16451) [页面](https:\u002F\u002Flong-horizon-assembly.github.io\u002F)\n- **KDPE**: 用于扩散策略轨迹选择的核密度估计策略 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.10511) [页面](https:\u002F\u002Fhsp-iit.github.io\u002FKDPE\u002F)\n- **AimBot**: 一个简单的辅助视觉提示，用于增强视觉运动策略的空间意识 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.08113) [页面](https:\u002F\u002Faimbot-reticle.github.io\u002F)\n- 通过建模子目标转换，为操作任务实现更长的模仿 horizon\n- **Mobi-$\\pi$**: 使您的机器人学习策略移动化 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23692) [页面](https:\u002F\u002Fmobipi.github.io\u002F)\n- 用于策略泛化的无动作推理 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03729) [页面](https:\u002F\u002Frad-generalization.github.io\u002F)\n- **Learn from What We HAVE**: 一种在线推理过去交互的历史感知验证器 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.00271v1) [页面](https:\u002F\u002Fliy1shu.github.io\u002FHAVE_CoRL25\u002F)\n- **D-CODA**: 用于协调双臂数据增强的扩散 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.04860) [页面](https:\u002F\u002Fdcodaaug.github.io\u002FD-CODA\u002F)\n- **ATK**: 用于稳健策略学习的自动任务驱动关键点选择 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13867) [页面](https:\u002F\u002Fyunchuzhang.github.io\u002FATK\u002F)\n- **Poke and Strike**: 学习任务导向的探索策略 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.00178) [页面](https:\u002F\u002Fmarina-aoyama.github.io\u002Fpoke-and-strike\u002F)\n- **SafeBimanual**: 基于扩散的轨迹优化用于安全的双手操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.18268) [页面](https:\u002F\u002Fdenghaoyuan123.github.io\u002FSafeBimanip\u002F)\n- **COMBO-Grasp**: 学习基于约束的手动遮挡抓取操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.08054)\n- **Phantom**: 仅使用人类视频在没有机器人的情况下训练机器人 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.00779) [页面](https:\u002F\u002Fphantom-human-videos.github.io\u002F)\n- 通过预测过去 token 学习长上下文扩散策略 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=N4WWF8Les5)\n- **VT-Refine**: 通过仿真微调，利用视觉-触觉反馈学习双手装配 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=mV3W5givYb)\n- **COLLAGE**: 用于增强型策略学习的自适应融合检索 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.01131) [页面](https:\u002F\u002Frobin-lab.cs.utexas.edu\u002FCOLLAGE\u002F)\n- **CDP**: 通过因果扩散迈向稳健的自回归视觉运动策略学习 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14769) [页面](https:\u002F\u002Fgaavama.github.io\u002FCDP\u002F)\n- 对一般物体的稳健灵巧抓握 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.05287) [页面](https:\u002F\u002Fzdchan.github.io\u002FRobust_DexGrasp\u002F)\n- **Point Policy**: 通过关键点统一观察与动作，用于机器人操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.20391) [页面](https:\u002F\u002Fpoint-policy.github.io\u002F)\n## 基准与数据集\n\n- **RoboArena**: 分布式通用机器人策略的真实世界评估\n- **GraspVLA**: 在数十亿规模的合成动作数据上预训练的抓取基础模型\n- **CUPID**: 用影响力策划您的机器人喜爱的数据\n- **AutoEval**: 真实世界中通用机器人操作策略的自主评估\n- **ManipBench**: 针对低级机器人操作功能的视觉-语言模型基准测试\n- 通过隐式触觉校准确保视觉引导机器人操作中的力安全性\n- 仅用一次人类演示，通过仿真到现实的强化学习跨越人机化身差距\n- **UniSkill**: 通过跨化身技能表示模仿人类视频\n\n## 类人机器人\n\n- **HuB**: 学习极端的类人机器人平衡\n- 通过灵活的肢体间协调实现多功能的移动与操作\n- 视觉模仿实现情境感知的类人机器人控制\n- **手眼自主配送**：学习类人机器人的导航、行走和抓取\n- **CLONE**: 面向长期任务的闭环全身类人机器人遥操作\n- **拥抱接触**：全身与地面接触的类人机器人跟随\n- **帮我拿啤酒**：学习柔和的类人机器人行走及末端执行器稳定控制\n- **SLAC**: 用于全身现实世界强化学习的仿真预训练潜在动作空间\n- **机器人训练机器人**：类人机器人的自动现实世界策略适应与学习\n- 类人机器人策略 ~ 人类策略\n## 世界模型\n\n- **Real2Render2Real**: 在无需动力学仿真或机器人硬件的情况下扩展机器人数据\n- 跨传感器触觉生成\n- **WoMAP**: 面向具身开放词汇物体定位的世界模型\n- **DreamGen**: 通过视频世界模型解锁机器人学习中的泛化能力\n- **工具即接口**: 通过观察人类使用工具来学习机器人策略\n- 野外关节型物体估计\n- **DiWA**: 基于扩散的世界模型进行策略适应\n- 可引导场景生成结合训练后与推理时搜索\n- 生成式视觉预见结合任务无关姿态估计在机器人桌面操作中的应用\n- **Gen2Act**: 在新场景中生成人类视频以实现可泛化的机器人操作\n- **反思式规划**: 用于多阶段长期机器人操作的视觉-语言模型\n- **LaDi-WM**: 基于潜在扩散的世界模型用于预测性操作\n## 灵巧操作\n\n- **DexUMI**: 将人手用作灵巧操作的通用操作接口 [页面](https:\u002F\u002Fdex-umi.github.io\u002F)\n- **Dexplore**: 基于参考范围探索的可扩展神经网络控制用于灵巧操作\n- **FFHFlow**: 通过流变分推断生成多样化且考虑不确定性的灵巧抓握\n- **GraspQP**: 针对多样化和鲁棒灵巧抓握的力闭合可微优化 [页面](https:\u002F\u002Fgraspqp.github.io\u002F)\n- 形态对称的强化学习用于双手灵巧操作\n- **KineDex**: 通过动觉教学学习触觉感知的视觉运动策略，用于灵巧操作\n- **D-Cubed**: 基于潜在扩散的轨迹优化用于灵巧变形体操作\n- **LodeStar**: 通过来自人类演示的合成数据增强实现长期灵巧操作\n## 仿真到现实\n\n- **模拟的声音**: 使用生成式音频学习多模态仿真到现实机器人策略\n- **FetchBot**: 通过零样本仿真到现实学习杂乱场景中的通用物体抓取\n- **ClutterDexGrasp**: 一种用于杂乱场景中通用灵巧抓握的仿真到现实系统\n- **SimShear**: 基于剪切的触觉伺服的仿真到现实\n- **轮式实验室**: 面向低成本、开源轮式机器人的现代仿真到现实技术\n- Articulate AnyMesh: 开放词汇3D关节型物体建模\n- **AgentWorld**: 用于场景构建和移动机器人操作的交互式仿真平台\n- 从任意图像中学习机器人\n# ICCV2025\n\n## 视觉-语言-行动模型\n\n- 探索机器人领域中视觉-语言-行动模型的对抗脆弱性 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.13587) [页面](https:\u002F\u002Fvlaattacker.github.io\u002F)\n- **VQ-VLA**: 通过扩展向量量化动作标记器改进视觉-语言-行动模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01016) [页面](https:\u002F\u002Fxiaoxiao0406.github.io\u002Fvqvla.github.io)\n- **Dita**: 扩展扩散Transformer以实现通用视觉-语言-行动策略 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.19757) [页面](https:\u002F\u002Frobodita.github.io\u002F)\n- **Moto**: 潜在运动标记作为连接语言，用于从视频中学习机器人操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04445) [页面](https:\u002F\u002Fchenyi99.github.io\u002Fmoto\u002F)\n- **A0**: 一种面向通用机器人操作的 affordance感知层次模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.12636) [页面](https:\u002F\u002Fa-embodied.github.io\u002FA0\u002F)\n- **具身视频代理**: 来自第一视角视频和具身传感器的持久记忆能够实现动态场景理解 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.00358) [页面](https:\u002F\u002Fembodied-videoagent.github.io\u002F)\n- **CoA-VLA**: 通过视觉-文本affordance链改进视觉-语言-行动模型 [论文](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F542)\n- **FedVLA**: 基于双门控专家混合的联邦视觉-语言-行动学习用于机器人操作 [论文](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F1325)\n- 朝向长期视觉-语言-行动系统：推理、行动与记忆 [论文](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F1915)\n- **PASG**: 用于机器人操作中自动化几何基元提取和语义锚定的闭环框架 [论文](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F225)\n- **SD2Actor**: 通过扩散嵌入进行连续状态分解以支持机器人操作 [论文](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F1571)\n\n## 视觉-语言-导航模型\n\n- **移动以理解3D场景**: 桥接视觉接地与探索，实现高效且多功能的具身导航 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.04047) [页面](https:\u002F\u002Fmtu3d.github.io\u002F)\n- 重新思考视觉与语言导航中的具身差距：对物理与视觉差异的整体研究 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.13019) [页面](https:\u002F\u002Fcrystalsixone.github.io\u002Fvln_pe.github.io\u002F)\n- **P3Nav**: 一个整合感知、规划与预测的统一具身导航框架 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.18525)\n- **SAME**: 学习基于状态自适应专家混合的通用语言引导视觉导航 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.05552) [页面](https:\u002F\u002Fgithub.com\u002FGengzeZhou\u002FSAME)\n- **NavMorph**: 一个自我演化的世界模型，用于连续环境中的视觉与语言导航 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.23468) [页面](https:\u002F\u002Fgithub.com\u002FFeliciaxyao\u002FNavMorph)\n- 利用输入自适应推理实现高效的VLN [论文](https:\u002F\u002Fopenreview.net\u002Fpdf?id=5gptKWnVPF)\n- 具身导航结合动作描述预测辅助任务 [论文](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F1984)\n- 具有开放集语义分组的3D高斯地图用于视觉与语言导航 [论文](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F299)\n- **NavQ**: 学习用于前瞻性视觉与语言导航的Q模型 [论文](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F944)\n- **monoVLN**: 桥接单目与全景视觉与语言导航之间的观测差距 [论文](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F1792)\n\n## 分层规划\n- 基于基础模型推理与部件接地的自适应关节物体即时操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.18276)\n- **CogNav**：利用大语言模型进行目标物体导航的认知过程建模 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.10439) [页面](https:\u002F\u002Fyhancao.github.io\u002FCogNav\u002F)\n- **RoBridge**：一种连接认知与执行的分层架构，用于通用机器人操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.01709) [页面](https:\u002F\u002Fabliao.github.io\u002FRoBridge\u002F)\n\n## 世界模型\n- **IRASim**：用于机器人操作的细粒度世界模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.14540) [页面](https:\u002F\u002Fgen-irasim.github.io\u002F)\n- **GWM**：迈向可扩展的高斯世界模型，用于机器人操作 [论文](https:\u002F\u002Fziweiwangthu.github.io\u002Fdata\u002FGWM.pdf) [页面](https:\u002F\u002Fgaussian-world-model.github.io\u002F)\n- **DyWA**：适用于泛化非抓取式操作的动力学自适应世界动作模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.16806) [页面](https:\u002F\u002Fpku-epic.github.io\u002FDyWA\u002F)\n- 基于扩散的双臂操作想象协调 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.11296)\n- 学习4D具身世界模型 [论文](https:\u002F\u002Fopenreview.net\u002Fpdf?id=mnwlhvmKMN)\n\n## 策略\n- 重新思考双臂机器人操作：基于解耦交互框架的学习 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09186)\n- **EC-Flow**：通过以具身为中心的流体方法，从无动作标注视频中实现多功能机器人操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.06224) [页面](https:\u002F\u002Fec-flow1.github.io\u002F)\n- **Dense Policy**：双向自回归动作学习 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.13217) [页面](https:\u002F\u002Fselen-suyue.github.io\u002FDspNet\u002F)\n- **AnyBimanual**：将单臂策略迁移至通用双臂操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06779) [页面](https:\u002F\u002Fanybimanual.github.io\u002F)\n- 从第一人称视频中学习精确的可供性，用于机器人操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.10123v1) [页面](https:\u002F\u002Freagan1311.github.io\u002Faffgrasp)\n- **iManip**：面向机器人操作的技能增量式学习 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07087)\n- 具有时空感知的视觉-运动扩散策略学习 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.06710) [页面](https:\u002F\u002Fzhenyangliu.github.io\u002FDP4\u002F)\n- **Wavelet Policy**：用于长 horizon 任务的策略学习提升方案 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.04331) [页面](https:\u002F\u002Fhhuang-code.github.io\u002Fwavelet_policy\u002F)\n- 面向机器人学习的4D视觉预训练 [论文](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F972)\n\n## 加速与部署\n- 基于显著性感知的量化模仿学习，用于高效机器人控制 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15304)\n- 设备端扩散 Transformer 策略，用于高效机器人操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.00697)\n- **COSMO**：结合选择性记忆的低成本视觉-语言导航 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24065)\n- **CARP**：用于视觉-运动策略学习的粗到细自回归预测 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06782) [页面](https:\u002F\u002Fcarp-robot.github.io\u002F)\n\n## 感知\n- **EmbodiedOcc**：基于视觉的在线场景理解中的具身3D占用预测 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04380) [页面](https:\u002F\u002Fykiwu.github.io\u002FEmbodiedOcc\u002F)\n- **Embodied Image Captioning**：用于空间连贯图像描述的自监督学习智能体 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.08531) [页面](https:\u002F\u002Fhsp-iit.github.io\u002Fembodied-captioning\u002F)\n\n## 基准与数据集\n- **VLABench**：一个大规模的、包含长 horizon 推理任务的语言条件机器人操作基准 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18194) [页面](https:\u002F\u002Firanqin.github.io\u002Frobofactory\u002F)\n- **RoboFactory**：探索具有组合约束的具身智能体协作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.16408) [页面](https:\u002F\u002Fvlabench.github.io\u002F)\n- **HUMOTO**：一个包含动作捕捉的人类物体交互的4D数据集 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10414) [页面](https:\u002F\u002Fjiaxin-lu.github.io\u002Fhumoto\u002F)\n- **RoboMM**：用于机器人操作的一体化多模态大型模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.07215)\n- **MoMa-Kitchen**：一个超过10万条的基准，用于移动操作中基于可供性的最后一公里导航 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11081) [页面](https:\u002F\u002Fmomakitchen.github.io\u002F)\n- **RoboPearls**：用于机器人操作的可编辑视频仿真 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.22756) [页面](https:\u002F\u002Ftangtaogo.github.io\u002FRoboPearls\u002F)\n- **DexH2R**：一个人机交接中动态灵巧抓握的基准 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.23152) [页面](https:\u002F\u002Fdexh2r.github.io\u002F)\n- **超越目的地**：一个新颖的、面向探索的具身问答基准 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11117) [页面](https:\u002F\u002Fgithub.com\u002FHCPLab-SYSU\u002FEXPRESS-Bench)\n- **RobAVA**：一个大规模的数据集和基线，旨在实现基于视频的机械臂动作理解 [论文](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F1787)\n- **RoboAnnotatorX**：一个全面且通用的标注框架，用于准确理解长 horizon 的机器人演示 [论文](https:\u002F\u002Ficcv.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F2215)\n\n# ICML2025\n\n## 视觉-语言-行动模型\n- **Hi Robot**：使用分层视觉-语言-行动模型进行开放式指令遵循 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19417)\n- **OTTER**：一种具有文本感知视觉特征提取的视觉-语言-行动模型 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.03734) [页面](https:\u002F\u002Fottervla.github.io\u002F)\n- **UP-VLA**：一个用于具身智能体的统一理解和预测模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.18867)\n- **ELEMENTAL**：通过演示和视觉-语言模型进行交互式学习，用于机器人奖励设计 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18825)\n- **ReinboT**：利用强化学习增强机器人视觉-语言操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07395)\n- **一个大型循环行动模型**：xLSTM 使机器人任务能够快速推理 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22391) [页面](https:\u002F\u002Fgithub.com\u002Fml-jku\u002FLRAM)\n\n## 计划与推理\n- 通过潜在空间逆向规划进行高效的机器人策略学习 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.06861) [页面](https:\u002F\u002Flbp-authors.github.io\u002F)\n- 通过平衡序列建模进行闭环长 horizon 机器人规划 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.01440) [页面](https:\u002F\u002Fgithub.com\u002FSingularity0104\u002Fequilibrium-planner)\n\n## 政策\n\n- **SAM2Act**：将视觉基础模型与记忆架构集成用于机器人操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.18564)\n- 使用4D表示进行自回归机器人模型的预训练 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.13142) [页面](https:\u002F\u002Farm4r.github.io\u002F)\n- 基于流的领域随机化用于学习和编排机器人技能 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.01800)\n- **EmbodiedBench**：面向视觉驱动具身智能体的多模态大型语言模型综合基准测试 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.09560) [页面](https:\u002F\u002Fembodiedbench.github.io\u002F)\n- 在具有多样化任务的MDP中学习策略委员会以实现有效个性化 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01885)\n- 视频预测策略：一种具有预测性视觉表征的通用机器人策略 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14803) [页面](https:\u002F\u002Fvideo-prediction-policy.github.io\u002F)\n- **STAR**：通过旋转增强学习多样化的机器人技能抽象 [论文](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2506.03863) [页面](https:\u002F\u002Fgithub.com\u002FJiuTian-VL\u002FSTAR?tab=readme-ov-file)\n## 3D视觉\n- 统一2D和3D视觉-语言理解 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10745) [页面](https:\u002F\u002Funivlg.github.io\u002F)\n- GAPrompt：面向3D视觉模型的几何感知点云提示 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.04119) [页面](https:\u002F\u002Fgithub.com\u002Fzhoujiahuan1991\u002FICML2025-GAPrompt)\n\n## 数据集\n- WOMD-Reasoning：用于驾驶中交互推理的大规模数据集 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.04281) [页面](https:\u002F\u002Fgithub.com\u002Fyhli123\u002FWOMD-Reasoning)\n\n# RSS2025\n\n- **统一世界模型**：将视频和动作扩散耦合用于大型机器人数据集上的预训练 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.02792) [页面](https:\u002F\u002Fweirdlabuw.github.io\u002Fuwm\u002F)\n- **CordViP**：基于对应关系的视动策略，用于现实世界中的灵巧操作 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.08449) [页面](https:\u002F\u002Faureleopku.github.io\u002FCordViP\u002F)\n- **反应式扩散策略**：用于接触密集型操作的慢速-快速视觉-触觉策略学习 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.02881) [页面](https:\u002F\u002Freactive-diffusion-policy.github.io\u002F)\n- 扩散策略中动态排名调整，用于高效灵活的训练 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03822)\n- **SpatialVLA**：探索用于视觉-语言-行动模型的空间表征 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.15830)\n- **草图转技能**：利用人类绘制的轨迹草图启动机器人学习 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11918)\n- **NaVILA**：用于导航的足式机器人视觉-语言-行动模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04453) [页面](https:\u002F\u002Fnavila-bot.github.io\u002F)\n- **ConRFT**：通过一致性策略对VLA模型进行强化微调的方法 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.05450) [页面](https:\u002F\u002Fcccedric.github.io\u002Fconrft\u002F)\n- **你只教一次**：从视频演示中学习一次性双臂机器人操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.14208) [页面](https:\u002F\u002Fhnuzhy.github.io\u002Fprojects\u002FYOTO\u002F)\n- **ASAP**：对齐仿真与真实物理环境，以学习敏捷人形机器人全身技能 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01143) [页面](https:\u002F\u002Fagile.human2humanoid.com\u002F)\n- **飞行之手**：以末端执行器为中心的框架，用于多功能空中操作的遥操作和策略学习 [论文](https:\u002F\u002Flecar-lab.github.io\u002Fflying_hand\u002Fstatic\u002Fpdf\u002Fflying_hand.pdf) [页面](https:\u002F\u002Flecar-lab.github.io\u002Fflying_hand\u002F)\n- **DemoGen**：用于数据高效视动策略学习的合成演示生成 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.16932) [页面](https:\u002F\u002Fdemo-generation.github.io\u002F)\n- **DOGlove**：使用低成本开源触觉力反馈手套进行灵巧操作 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.07730) [页面](https:\u002F\u002Fdo-glove.github.io\u002F)\n- **RoboSplat**：采用高斯泼溅技术的新颖演示生成方法，可实现稳健的一次性操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.13175) [页面](https:\u002F\u002Fyangsizhe.github.io\u002Frobosplat\u002F)\n- 利用车载部署的大语言模型增强自动驾驶系统 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.11514)\n- **SATA**：受动物学习启发的安全且自适应的基于扭矩的运动策略 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12674) [视频](https:\u002F\u002Fyoutu.be\u002Fb1cpTq0Rc5w?si=sAd9y5LE2sWynu7v)\n- **FACTR**：面向接触密集型策略学习的力量导向课程训练 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17432) [页面](https:\u002F\u002Fjasonjzliu.com\u002Ffactr\u002F)\n- **RoboVerse**：迈向可扩展且通用的机器人学习统一平台、数据集和基准测试 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.18904) [页面](https:\u002F\u002Froboverseorg.github.io\u002F)\n- **STDArm**：将视动策略从静态数据训练转移到动态机器人操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.18792)\n\n  \n# CVPR2025\n\n## 视觉-语言-行动模型\n\n- **UniAct**：增强具身基础模型的通用行动 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.10105) [页面](https:\u002F\u002F2toinf.github.io\u002FUniAct\u002F)\n- **MoManipVLA**：用于通用移动操作的视觉-语言-行动模型迁移 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.13446) [页面](https:\u002F\u002Fgary3410.github.io\u002FmomanipVLA\u002F)\n- **CoT-VLA**：用于视觉-语言-行动模型的视觉思维链推理 [论文](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F33233)\n- **SOLAMI**：用于与3D自主角色沉浸式互动的社会视觉-语言-行动建模 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00174) [页面](https:\u002F\u002Fsolami-ai.github.io\u002F)\n- 以数据为中心重新审视用于机器人学习的预训练视觉模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.06960) [页面](https:\u002F\u002Fgithub.com\u002FCVMI-Lab\u002FSlotMIM)\n- **从小思考，大处行动**：用于终身机器人操作的原始提示学习\n- **凤凰**：一种基于运动的自我反思框架，用于精细的机器人动作修正 [论文](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F32789)\n- **OmniManip**：通过以物体为中心的交互原语作为空间约束，迈向通用机器人操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.03841) [页面](https:\u002F\u002Fomnimanip.github.io\u002F)\n- 缓解机器人操作视觉预训练中的“人-机器人”领域差异 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.14235)\n- 以物体为中心的提示驱动视觉-语言-行动模型，用于机器人操作 [摘要](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F34522)\n- 机器人视觉指令\n- **RoboGround**：基于 grounded 视觉-语言先验的机器人操作\n\n## 策略\n- **KStar Diffuser**: 基于运动学建模的时空图扩散策略，用于双臂机器人操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10743)\n- **RoboPEPP**: 通过嵌入式预测性预训练进行基于视觉的机器人位姿与关节角估计 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.17662)\n- **Lift3D Policy**: 将2D基础模型提升至鲁棒的3D机器人操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18623) [页面](https:\u002F\u002Flift3d-web.github.io\u002F)\n- **PDFactor**: 学习多任务机器人操作的三视角视图策略扩散场 [摘要](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F33943)\n- **Two by Two**: 学习跨任务的成对物体装配，以实现可泛化的机器人操作\n- **FlowRAM**: 基于区域感知Mamba框架的流匹配策略，用于机器人操作 [摘要](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F33579)\n- **G3Flow**: 用于姿态感知且可泛化的物体操作的生成式3D语义流 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18369) [页面](https:\u002F\u002Ftianxingchen.github.io\u002FG3Flow\u002F)\n- **DexHandDiff**: 面向自适应灵巧操作的交互感知扩散规划 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18562) [页面](https:\u002F\u002Fdexdiffuser.github.io\u002F)\n- **Tra-MoE**: 从多个领域学习轨迹预测模型，用于自适应策略条件化 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14519)\n- **AffordDP**: 具有可迁移效用性的可泛化扩散策略 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.03142) [页面](https:\u002F\u002Fafforddp.github.io\u002F)\n- **Tra-MoE**: 从多个领域学习轨迹预测模型，用于自适应策略条件化 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14519) [页面](https:\u002F\u002Fgithub.com\u002FMCG-NJU\u002FTra-MoE)\n\n## 抓取\n- **UniGraspTransformer**: 面向可扩展灵巧机器人抓取的简化策略蒸馏 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.02699) [页面](https:\u002F\u002Fdexhand.github.io\u002FUniGraspTransformer\u002F)\n- **DexGrasp Anything**: 基于物理感知的通用灵巧机器人抓取 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.08257) [页面](https:\u002F\u002Fgithub.com\u002F4DVLab\u002FDexGrasp-Anything)\n- **ZeroGrasp**: 支持零样本形状重建的机器人抓取 [论文](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F32440)\n\n## 类人机器人\n- 让类人机器人去徒步旅行！复杂地形下的综合技能开发 [论文](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F34565) [页面](https:\u002F\u002Flego-h-humanoidrobothiking.github.io\u002F)\n- **MobileH2R**: 仅利用可扩展且多样化的合成数据学习可泛化的从人类到移动机器人的交接 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04595)\n\n## 3D视觉\n- **3D-MVP**: 用于机器人操作的3D多视角预训练 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.18158) [页面](https:\u002F\u002Fjasonqsy.github.io\u002F3DMVP\u002F)\n- **VidBot**: 从野外2D人体视频中学习可泛化的3D动作，用于零样本机器人操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07135) [页面](https:\u002F\u002Fhanzhic.github.io\u002Fvidbot-project\u002F)\n- **Touch2Shape**: 基于触觉条件的3D扩散，用于形状探索与重建 [摘要](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F33415)\n\n## 规划与推理\n- **RoboBrain**: 从抽象到具体的一体化机器人操作大脑模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.21257)\n- **PhysVLM**: 使视觉语言模型能够理解机器人物理可达性 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.08481)\n- **RoboSpatial**: 向2D和3D视觉-语言模型教授空间理解能力，用于机器人技术 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.16537)\n- **Tartan IMU**: 用于机器人惯性定位的轻量级基础模型 [摘要](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F33873)\n- **代码即监控器**: 面向反应式与主动式机器人故障检测的约束感知视觉编程 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04455) [页面](https:\u002F\u002Fzhoues.github.io\u002FCode-as-Monitor\u002F)\n\n## 视频\n- **TASTE-Rob**: 推进面向任务的手物交互视频生成，以实现可泛化的机器人操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11423)\n- **GraphMimic**: 基于视频的图到图生成建模，用于策略学习 [论文](https:\u002F\u002Fcvpr.thecvf.com\u002Fvirtual\u002F2025\u002Fposter\u002F34942)\n\n## 模拟转真实与真实转模拟\n- **Prof. Robot**: 无静态与自碰撞的可微分机器人渲染 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11269) [页面](https:\u002F\u002Fwww.qrcat.cn\u002Fprof-robot\u002F)\n- **AutoURDF**: 使用聚类配准从点云帧中进行无监督的机器人建模 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.05507) [页面](https:\u002F\u002Fgithub.com\u002Fjl6017\u002FAutoURDF)\n\n## 基准与数据集\n- **RoboTwin**: 具有生成式数字孪生的双臂机器人基准（早期版本）[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.02920) [页面](https:\u002F\u002Frobotwin-benchmark.github.io\u002Fearly-version\u002F)\n- 用于机器人视觉的像素对齐RGB-NIR立体成像及数据集 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18025)\n- **RoboSense**: 大规模数据集与基准，用于拥挤且非结构化环境中的自我中心式机器人感知与导航 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.15503) [页面](https:\u002F\u002Fgithub.com\u002Fsuhaisheng\u002FRoboSense)\n\n# ICLR2025\n\n## 视觉-语言-行动模型\n- **LLaRA**: 为视觉-语言策略增强机器人学习数据 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.20095) [页面](https:\u002F\u002Fgithub.com\u002FLostXine\u002FLLaRA)\n- **VLAS**: 带有语音指令的视觉-语言-行动模型，用于定制化机器人操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13508) [页面](https:\u002F\u002Fgithub.com\u002Fwhichwhichgone\u002FVLAS)\n- **TraceVLA**: 视觉追踪提示增强了通用机器人策略的空间-时间意识 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.10345) [页面](https:\u002F\u002Ftracevla.github.io\u002F)\n- **机器人预训练机器人**: 基于大规模机器人数据集的操作导向型机器人表征 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22325) [页面](https:\u002F\u002Frobots-pretrain-robots.github.io\u002F)\n- **PIDM**: 预测性逆动力学模型是可扩展的机器人操作学习者 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15109) [页面](https:\u002F\u002Fnimolty.github.io\u002FSeer\u002F)\n\n## 政策\n\n- **GravMAD**: 基于地面空间值图引导的动作扩散模型，用于通用3D操作 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.20154) [主页](https:\u002F\u002Fgravmad.github.io\u002F)\n- **ReViWo**: 面向视觉机器人操作的视不变世界模型学习 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=vJwjWyt4Ed) [知乎](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F26181243574)\n- **HAMSTER**: 用于开放世界机器人操作的层次化动作模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.05485) [主页](https:\u002F\u002Fhamster-robot.github.io\u002F)\n- **BadRobot**: 在物理世界中越狱具身大语言模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.20242) [主页](https:\u002F\u002Fembodied-llms-safety.github.io\u002F)\n- **STRAP**: 用于增强型策略学习的机器人子轨迹检索 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15182) [主页](https:\u002F\u002Fweirdlabuw.github.io\u002Fstrap\u002F)\n- **SRSA**: 用于机器人装配任务的技能检索与适应 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.04538) [主页](https:\u002F\u002Fsrsa2024.github.io\u002F)\n- 机器人操作模仿学习中的数据规模法则 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18647) [主页](https:\u002F\u002Fdata-scaling-laws.github.io\u002F)\n- **Stem-OB**: 基于茎状收敛观测与扩散反演的可泛化视觉模仿学习 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04919) [主页](https:\u002F\u002Fhukz18.github.io\u002FStem-Ob\u002F)\n\n## 3D视觉\n- **Dream to Manipulate**: 组合式世界模型通过想象力赋能机器人模仿学习 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14957) [主页](https:\u002F\u002Fleobarcellona.github.io\u002FDreamToManipulate\u002F)\n- **SPA***: 3D空间感知能力实现有效的具身表征 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08208) [主页](https:\u002F\u002Fhaoyizhu.github.io\u002Fspa\u002F)\n\n## 规划与推理\n- **LASeR**: 基于大型语言模型实现多样化和可泛化的机器人设计 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=7mlvOHL6qJ) [主页](https:\u002F\u002Fgithub.com\u002FWoodySJR\u002FLASeR)\n- 面向机器人运动规划的物理信息约束时序差分度量学习 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=TOiageVNru) [主页](https:\u002F\u002Fgithub.com\u002Fruiqini\u002Fntrl-demo)\n- **AHA**: 用于检测和推理机器人操作失败的视觉-语言模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00371) [主页](https:\u002F\u002Faha-vlm.github.io\u002F)\n- **EMOS**: 具身感知的异构多机器人操作系统，配备LLM智能体 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22662) [主页](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22662)\n- **VisualPredicator**: 利用神经符号谓词学习抽象世界模型，用于机器人规划 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23156) [主页](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.23156)\n- **DenseMatcher**: 从单个演示中学习类别级操作的3D语义对应关系 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.05268) [主页](https:\u002F\u002Ftea-lab.github.io\u002FDenseMatcher\u002F)\n- 面向机器人操作的互联网视频中6D物体位姿跟踪 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10307) [主页](https:\u002F\u002Fponimatkin.github.io\u002Ffreepose\u002F)\n\n## 规划与推理\n- 基于扩散模型的多机器人运动规划 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03072) [主页](https:\u002F\u002Fgithub.com\u002Fyoraish\u002Fmmd)\n\n## 视频\n- **GEVRM**: 面向鲁棒视觉操作的目标表达型视频生成模型 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.09268)\n\n## Sim2real与Real2sim\n- **ReGen**: 通过逆向设计实现的生成式机器人仿真 [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=EbCUbPZjM1) [主页](https:\u002F\u002Fregen-sim.github.io\u002F)\n\n# ICRA2025\n- **MoRE**: 解锁四足视觉-语言-动作模型强化学习的可扩展性 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.08007)\n- **QUART-Online**: 无延迟的大型多模态语言模型，用于四足机器人学习 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15576) [主页](https:\u002F\u002Fquart-online.github.io\u002F)\n- **SpatialBot**: 基于视觉语言模型的精确空间理解 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.13642) [主页](https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FSpatialBot)","# Embodied-AI-Paper-TopConf 快速上手指南\n\nEmbodied-AI-Paper-TopConf 是一个持续更新的具身智能（Embodied AI）顶会论文清单资源库，涵盖 NeurIPS、ICLR、CVPR、RSS、CORL 等顶级会议的最新研究成果。本项目主要提供分类整理的论文列表及链接，无需复杂的环境配置即可使用。\n\n## 环境准备\n\n本项目为静态资源列表，无特定的系统或依赖要求。只需具备以下条件：\n- **操作系统**：Windows \u002F macOS \u002F Linux 均可\n- **必要工具**：\n  - Web 浏览器（用于直接在线阅读）\n  - Git（可选，用于克隆仓库到本地）\n  - Markdown 阅读器（可选，用于本地查看 `.md` 文件）\n\n## 安装步骤\n\n你可以选择直接在线浏览或克隆到本地查看。\n\n### 方式一：在线浏览（推荐）\n直接访问 GitHub 仓库页面，点击对应的会议年份（如 `ICLR2026`、`NeurIPS2025`）即可查看分类论文列表。\n- **仓库地址**: [https:\u002F\u002Fgithub.com\u002FEmbodiedAI\u002FEmbodied-AI-Paper-TopConf](https:\u002F\u002Fgithub.com\u002FEmbodiedAI\u002FEmbodied-AI-Paper-TopConf) (请替换为实际仓库地址)\n\n### 方式二：本地克隆\n如果你希望离线查看或贡献内容，可以使用以下命令克隆仓库：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FEmbodiedAI\u002FEmbodied-AI-Paper-TopConf.git\ncd Embodied-AI-Paper-TopConf\n```\n\n> **提示**：国内用户若遇到克隆速度慢的问题，可使用 Gitee 镜像（如有）或通过代理加速。\n\n## 基本使用\n\n本项目核心功能是提供结构化的论文索引。以下是两种最常用的使用场景：\n\n### 1. 查找特定领域的最新论文\n在仓库根目录的 `README.md` 中，找到你关注的会议（例如 `ICLR2026`），点击进入对应的子文档（如 `ICLR\u002FICLR2026.md`）。文档内已按研究方向分类，例如：\n- **Vision-Language-Action Models** (视觉 - 语言 - 动作模型)\n- **World Models** (世界模型)\n- **Navigation** (导航)\n- **Humanoid** (人形机器人)\n\n**示例**：若想查找 ICLR2026 中关于“记忆机制”的 VLA 模型，可进入 `ICLR\u002FICLR2026.md`，定位到 `Vision-Language-Action Models` 章节，即可看到如 **MemoryVLA** 等论文标题及直达链接。\n\n### 2. 追踪特定会议收录情况\n通过根目录的更新日志（Update Log）了解最新收录进度。例如：\n- `[03\u002F12\u002F2026] We are updating Embodied AI papers accepted by ICLR2026!`\n表示该日期正在更新 ICLR2026 的录用论文列表。\n\n点击论文标题后的 `[Paper]` 链接，即可跳转至 OpenReview 或其他官方页面阅读全文。\n\n---\n*注：本项目为纯资料整理库，不涉及代码运行或模型训练，因此无需执行额外的安装脚本或配置环境变量。*","某具身智能实验室的算法团队正筹备新一代人形机器人抓取项目，急需调研 2025 至 2026 年间顶级会议（如 ICLR、CoRL、RSS）中关于“视觉 - 语言 - 动作模型”与“灵巧操作”的最新突破。\n\n### 没有 Embodied-AI-Paper-TopConf 时\n- **检索效率极低**：研究人员需逐个访问 CVPR、ICRA、NeurIPS 等十个不同会议的官网，在海量论文中手动筛选与具身智能相关的条目，耗时数天。\n- **细分领域难定位**：即使找到会议列表，也难以快速区分哪些论文专攻“触觉反馈”或\"Sim2Real 迁移”，往往需要下载摘要甚至全文才能确认相关性。\n- **前沿动态滞后**：由于缺乏统一的更新机制，团队容易错过刚刚录用的 ICLR 2026 或 RSS 2025 最新成果，导致技术选型基于过时的 SOTA（当前最佳）基准。\n- **分类标准混乱**：不同会议对同一技术方向（如规划与推理）的归类不一致，增加了整理文献综述和对比实验的难度。\n\n### 使用 Embodied-AI-Paper-TopConf 后\n- **一站式聚合获取**：团队直接查看该仓库，即可在一个页面内获取从 ICLR 2026 到 ICRA 2025 所有顶会录用的具身 AI 论文清单，将调研时间从数天缩短至几小时。\n- **精准子领域导航**：利用仓库细致的分类标签（如\"Dexterous Manipulation\"、\"Tactile\"），研究员能瞬间锁定与人形机器人抓取直接相关的核心论文，如 MemoryVLA 等最新工作。\n- **实时同步前沿**：得益于仓库的主动维护（Active Maintenance），团队能第一时间发现并研读 2025 下半年及 2026 年初的最新录用论文，确保技术路线始终对标最前沿。\n- **结构化知识梳理**：统一的分类体系帮助团队快速构建技术地图，清晰地对比不同会议在“世界模型”或“策略学习”上的研究侧重，加速了实验方案的设计。\n\nEmbodied-AI-Paper-TopConf 通过高度结构化的顶会论文聚合，将繁琐的文献大海捞针转变为精准的技术情报获取，极大提升了具身智能研发的迭代速度。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FSongwxuan_Embodied-AI-Paper-TopConf_b257980d.png","Songwxuan","Wenxuan Song","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FSongwxuan_ce644ca9.jpg","Ph.D in The Hong Kong University of Science and Technology (Guangzhou)\r\nB.S. in Harbin Institute of Technology\r\nAAAI 2026 Best Paper Owner","The Hong Kong University of Science and Technology (Guangzhou)","Guangzhou",null,"https:\u002F\u002Fgithub.com\u002FSongwxuan",529,12,"2026-04-16T12:59:02","MIT","",{"notes":90,"python":88,"dependencies":91},"该项目仅为顶级会议（如 NeurIPS, CVPR, ICLR 等）具身智能（Embodied AI）论文的汇总列表和资源索引，并非可执行的软件工具或代码库。因此，它没有特定的操作系统、GPU、内存、Python 版本或依赖库要求。用户只需通过浏览器访问提供的论文链接即可阅读相关内容。",[],[18],"2026-03-27T02:49:30.150509","2026-04-18T11:11:41.407165",[],[]]