[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-ActiveVisionLab--Awesome-LLM-3D":3,"tool-ActiveVisionLab--Awesome-LLM-3D":65},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",159636,2,"2026-04-17T23:33:34",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":10,"last_commit_at":33,"category_tags":34,"status":16},8553,"spec-kit","github\u002Fspec-kit","Spec Kit 是一款专为提升软件开发效率而设计的开源工具包，旨在帮助团队快速落地“规格驱动开发”（Spec-Driven Development）模式。传统开发中，需求文档往往与代码实现脱节，导致沟通成本高且结果不可控；而 Spec Kit 通过将规格说明书转化为可执行的指令，让 AI 直接依据明确的业务场景生成高质量代码，从而减少从零开始的随意编码，确保产出结果的可预测性。\n\n该工具特别适合希望利用 AI 辅助编程的开发者、技术负责人及初创团队。无论是启动全新项目还是在现有工程中引入规范化流程，用户只需通过简单的命令行操作，即可初始化项目并集成主流的 AI 编程助手。其核心技术亮点在于“规格即代码”的理念，支持社区扩展与预设模板，允许用户根据特定技术栈定制开发流程。此外，Spec Kit 强调官方维护的安全性，提供稳定的版本管理，帮助开发者在享受 AI 红利的同时，依然牢牢掌握架构设计的主动权，真正实现从“凭感觉写代码”到“按规格建系统”的转变。",88749,"2026-04-17T09:48:14",[15,26,14,13],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":10,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":10,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,"2026-04-10T11:13:16",[26,51,52,53,14,54,15,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":62,"last_commit_at":63,"category_tags":64,"status":16},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[15,51,54],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":80,"owner_twitter":80,"owner_website":81,"owner_url":82,"languages":80,"stars":83,"forks":84,"last_commit_at":85,"license":86,"difficulty_score":62,"env_os":79,"env_gpu":87,"env_ram":87,"env_deps":88,"category_tags":91,"github_topics":80,"view_count":10,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":92,"updated_at":93,"faqs":94,"releases":95},8961,"ActiveVisionLab\u002FAwesome-LLM-3D","Awesome-LLM-3D","Awesome-LLM-3D: a curated list of Multi-modal Large Language Model in 3D world  Resources","Awesome-LLM-3D 是一个专注于整理“多模态大语言模型在 3D 世界应用”的精选资源库。随着人工智能从二维图像向三维空间延伸，研究人员面临着论文分散、技术路线繁杂的挑战。Awesome-LLM-3D 旨在解决这一痛点，它系统性地汇聚了全球前沿学术成果，涵盖 3D 理解、逻辑推理、内容生成以及具身智能代理等核心任务，同时也收录了 CLIP、SAM 等相关基础模型的重要研究，为从业者提供了一张清晰的领域全景图。\n\n该资源库特别适合 AI 研究人员、开发者以及对空间智能感兴趣的技术探索者使用。无论是希望快速追踪最新算法进展的学者，还是正在寻找 3D 项目灵感的工程师，都能在此高效获取经过筛选的高质量论文、代码链接及基准测试数据。其独特亮点在于不仅提供了静态列表，还持续更新包括综述论文和真实 3D 空间理解基准（如 Real-3DQA）在内的深度内容，并按时间顺序梳理技术演进脉络。通过 Awesome-LLM-3D，用户可以轻松把握大模型如何“步入”三维世界的前沿动态，加速相关技术的研发与创新。","\n# Awesome-LLM-3D [![Awesome](https:\u002F\u002Fawesome.re\u002Fbadge.svg)](https:\u002F\u002Fawesome.re) [![Maintenance](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FMaintained%3F-yes-green.svg)](https:\u002F\u002FGitHub.com\u002FNaereen\u002FStrapDown.js\u002Fgraphs\u002Fcommit-activity) [![PR's Welcome](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPRs-welcome-brightgreen.svg?style=flat)](http:\u002F\u002Fmakeapullrequest.com)  \u003Ca href=\"\" target='_blank'>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FActiveVisionLab_Awesome-LLM-3D_readme_0b3a1d550e6b.png\"> \u003C\u002Fa> [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2405.10255v2-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.10255v2)\n\n\u003Cdiv align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FActiveVisionLab_Awesome-LLM-3D_readme_6b1f2d78a861.png\" width=\"100%\">\n\u003C\u002Fdiv>\n\n\n\n## 🏠 About\nHere is a curated list of papers about 3D-Related Tasks empowered by Large Language Models (LLMs). \nIt contains various tasks including 3D understanding, reasoning, generation, and embodied agents. Also, we include other Foundation Models (CLIP, SAM) for the whole picture of this area.\n\nThis is an active repository, you can watch for following the latest advances. If you find it useful, please kindly star ⭐ this repo and [cite](#citation) the paper.\n\n## 🔥 News\n- [2026-03-20] Our benchmark paper **Real-3DQA** is now available at ICLR 2026! Following our survey paper, we now release the benchmark paper on genuine 3D spatial understanding. [Project Page](https:\u002F\u002Freal-3dqa.github.io\u002F)\n- [2025-10-21] 📢 We have released the **second version** of our survey, updated to include literature up to **July 2025**:  \n👉 [*When LLMs Step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models*](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.10255v2)\n- [2024-05-16] Check out the first survey paper in the 3D-LLM domain: [When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.10255) \n- [2024-01-06] [Runsen Xu](https:\u002F\u002Frunsenxu.com\u002F) added chronological information and [Xianzheng Ma](https:\u002F\u002Fxianzhengma.github.io\u002F) reorganized it in Z-A order for better following the latest advances.\n- [2023-12-16] [Xianzheng Ma](https:\u002F\u002Fxianzhengma.github.io\u002F) and [Yash Bhalgat](https:\u002F\u002Fyashbhalgat.github.io\u002F) curated this list and published the first version;\n\n## Table of Contents\n\n- [Awesome-LLM-3D](#awesome-llm-3D)\n  - [3D Unified Understanding and Generation (LLM)](#3d-unified-understanding-and-generation-via-llm)\n  - [3D Understanding (LLM)](#3d-understanding-via-llm)\n  - [3D Understanding (other Foundation Models)](#3d-understanding-via-other-foundation-models)\n  - [3D Reasoning](#3d-reasoning)\n  - [3D Generation](#3d-generation)\n  - [3D Embodied Agent](#3d-embodied-agent)\n  - [3D Benchmarks](#3d-benchmarks)\n  - [Contributing](#contributing)\n\n\n## 3D Unified Understanding and Generation via LLM\n\n|  Date |       Keywords       |    Institute (first)   | Paper                                                                                                                                                                               | Publication | Others |\n| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------:\n| 2025-11-07 | Omni-View | PKU | [Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.07222) | ICLR 2026 | [github](https:\u002F\u002Fgithub.com\u002FAIDC-AI\u002FOmni-View) |\n| 2025-08-16 | UniUGG | FDU | [UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.11952) | ICLR 2026 | [github](https:\u002F\u002Fgithub.com\u002Ffudan-zvg\u002FUniUGG) |\n\n## 3D Understanding via LLM\n\n|  Date |       Keywords       |    Institute (first)   | Paper                                                                                                                                                                               | Publication | Others |\n| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------:\n| 2026-03-07 | 3D-RFT | BIGAI | [3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.04976) | Arxiv | [github](https:\u002F\u002Fgithub.com\u002F3D-RFT\u002F3D-RFT) |\n| 2025-12-05 | Fast Scenescript | Qualcomm \u002F UvA | [Fast SceneScript: Fast and Accurate Language-Based 3D Scene Understanding via Multi-Token Prediction](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.05597) | CVPR '26 | - |\n| 2025-11-27 | G\u003Csup>2\u003C\u002Fsup>VLM | Shanghai AI Lab | [G\u003Csup>2\u003C\u002Fsup>VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2511.21688) | Arxiv | [github](https:\u002F\u002Fgithub.com\u002FInternRobotics\u002FG2VLM) |\n| 2025-11-07 | Omni-View | PKU | [Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.07222) | ICLR 2026 | [github](https:\u002F\u002Fgithub.com\u002FAIDC-AI\u002FOmni-View) |\n| 2025-08-16 | UniUGG | FDU | [UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.11952) | ICLR 2026 | [github](https:\u002F\u002Fgithub.com\u002Ffudan-zvg\u002FUniUGG) |\n| 2025-07-31 | 3D-R1 | PKU | [3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23478) | Arxiv | [project](https:\u002F\u002Faigeeksgroup.github.io\u002F3D-R1\u002F) |\n| 2025-06-11 | LEO-VL | BIGAI | [LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.09935) | Arxiv | [project](https:\u002F\u002Fleo-vl.github.io\u002F) |\n| 2025-06-09 | SpatialLM | Manycore Tech \u002F HKUST | [SpatialLM: Training Large Language Models for Structured Indoor Modeling](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.07491) | NeurIPS '25 | [project](https:\u002F\u002Fmanycore-research.github.io\u002FSpatialLM\u002F) |\n| 2025-06-02 | 3DRS | HKU | [MLLMs Need 3D-Aware Representation Supervision for Scene Understanding](https:\u002F\u002Farxiv.org\u002Fhtml\u002F2506.01946v1) | Arxiv | [project](https:\u002F\u002Fvisual-ai.github.io\u002F3drs\u002F) |\n| 2025-05-30 | VG LLM | CUHK | [Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24625) | Arxiv | [project](https:\u002F\u002Flavi-lab.github.io\u002FVG-LLM\u002F) |\n| 2025-05-29 | Spatial-MLLM | THU | [Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.23747) | Arxiv | [project](https:\u002F\u002Fdiankun-wu.github.io\u002FSpatial-MLLM\u002F) |\n| 2025-05-28 | SeeGround | HKUST(GZ) | [Zero-Shot 3D Visual Grounding from Vision-Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22429) | CVPRW'25 | [project](https:\u002F\u002Fseeground.github.io) |\n| 2025-05-28 | 3DLLM-Mem | UCLA\u002FGoogle | [3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22657) | NeurIPS'25 | [project](https:\u002F\u002F3dllm-mem.github.io\u002F) |\n| 2025-04-24 | 3D-LLaVA | U of Adelaide | [3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.01163) | CVPR '25 | [github](https:\u002F\u002Fgithub.com\u002Fdjiajunustc\u002F3D-LLaVA) |\n| 2025-04-03 | Ross3D | CASIA| [Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.01901) | Arxiv | [project](https:\u002F\u002Fhaochen-wang409.github.io\u002Fross3d\u002F) |\n| 2025-03-08 | SplatTalk | GIT| [SplatTalk: 3D VQA with Gaussian Splatting](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.06271) | Arxiv | [github]() |\n| 2025-03-01 | Inst3D-LMM | ZJU| [Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.00513) | CVPR '25 | [github](https:\u002F\u002Fgithub.com\u002Fhanxunyu\u002FInst3D-LMM) |\n| 2025-02-13 | ENEL | SH AILab | [ENEL: Exploring the Potential of Encoder-free Architectures in 3D LMMs](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.09620v1) | Arxiv | [project](https:\u002F\u002Fgithub.com\u002FIvan-Tang-3D\u002FENEL\u002Ftree\u002Fmain?tab=readme-ov-file) |\n| 2025-02-02 | LSceneLLM | SCUT| [LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.01292) | CVPR '25 | [project](https:\u002F\u002Fgithub.com\u002FHoyyyaard\u002FLSceneLLM) |\n| 2025-01-02 | GPT4Scene | HKU | [GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.01428) | Arxiv | [project](https:\u002F\u002Fgpt4scene.github.io\u002F) |\n| 2024-12-05 | SeeGround | HKUST(GZ) | [SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04383) | CVPR '25 | [project](https:\u002F\u002Fseeground.github.io) |\n| 2024-12-03 | Video-3D LLM | CUHK | [Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00493) | CVPR '25 | [project](https:\u002F\u002Fgithub.com\u002FLaVi-Lab\u002FVideo-3D-LLM) |\n| 2024-11-29 | PerLA | Fondazione Bruno Kessler | [PerLA: Perceptive 3D Language Assistant](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.19774) | CVPR '25 | [project](https:\u002F\u002Fgfmei.github.io\u002FPerLA\u002F) |\n| 2024-10-12 | Situation3D | UIUC | [Situational Awareness Matters in 3D Vision Language Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07544) | CVPR '24 | [project](https:\u002F\u002Fyunzeman.github.io\u002Fsituation3d\u002F) |\n| 2024-09-30 | Robin3D | HKU | [Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00255) | ICCV '25 | [github](https:\u002F\u002Fgithub.com\u002FWeitaiKang\u002FRobin3D) |\n| 2024-09-28 | LLaVA-3D | HKU | [LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.18125) | Arxiv | [project](https:\u002F\u002Fzcmax.github.io\u002Fprojects\u002FLLaVA-3D\u002F) |\n| 2024-09-08 | MSR3D | BIGAI | [Multi-modal Situated Reasoning in 3D Scenes](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.02389) | NeurIPS '24| [project](https:\u002F\u002Fmsr3d.github.io\u002F) |\n| 2024-08-28 | GreenPLM | HUST | [ More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding]( https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.15966) | Arxiv| [github](https:\u002F\u002Fgithub.com\u002FTangYuan96\u002FGreenPLM) |\n| 2024-06-17 | LLaNA | UniBO | [LLaNA: Large Language and NeRF Assistant](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.11840)| NeurIPS '24 | [project](https:\u002F\u002Fandreamaduzzi.github.io\u002Fllana\u002F) |\n| 2024-06-07  | SpatialPIN           | Oxford                 | [SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.13438)                         | NeurIPS '24       | [project](https:\u002F\u002Fdannymcy.github.io\u002Fzeroshot_task_hallucination\u002F) |\n| 2024-06-03 | SpatialRGPT | UCSD | [SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.01584) | NeurIPS '24 | [github](https:\u002F\u002Fgithub.com\u002FAnjieCheng\u002FSpatialRGPT) |\n| 2024-05-02 | MiniGPT-3D | HUST| [MiniGPT-3D: Efficiently Aligning 3D Point Clouds with Large Language Models using 2D Priors](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.01413)| ACM MM '24 | [project](https:\u002F\u002Ftangyuan96.github.io\u002Fminigpt_3d_project_page\u002F) |\n| 2024-03-19 | Scenescript | Meta | [SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.13064)| ECCV '24 | [project](https:\u002F\u002Fwww.projectaria.com\u002Fscenescript\u002F) |\n| 2024-02-27 |  ShapeLLM |    XJTU  | [ShapeLLM: Universal 3D Object Understanding for Embodied Interaction](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.17766)                                                                                | Arxiv | [project](https:\u002F\u002Fqizekun.github.io\u002Fshapellm\u002F) |\n| 2024-01-22  | SpatialVLM           | Google DeepMind        | [SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.12168)                                                                  | CVPR '24    | [project](https:\u002F\u002Fspatial-vlm.github.io\u002F) |\n| 2023-12-21 |  LiDAR-LLM |    PKU  | [LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.14074.pdf)                                                                                | Arxiv | [project](https:\u002F\u002Fsites.google.com\u002Fview\u002Flidar-llm) |\n| 2023-12-15 |  3DAP |    Shanghai AI Lab  | [3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.09738.pdf)                                                                                | Arxiv | [project]() |\n| 2023-12-13 |  Chat-Scene |    ZJU | [Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.08168.pdf)                                                                                | NeurIPS '24 | [github](https:\u002F\u002Fgithub.com\u002FZzZZCHS\u002FChat-Scene) |\n| 2023-12-5 | GPT4Point | HKU | [GPT4Point: A Unified Framework for Point-Language Understanding and Generation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02980.pdf) |Arxiv |  [github](https:\u002F\u002Fgithub.com\u002FPointcept\u002FGPT4Point) |\n| 2023-11-30 |         LL3DA        |     Fudan University    | [LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.18651.pdf)                                                                                                  |Arxiv|  [github](https:\u002F\u002Fgithub.com\u002FOpen3DA\u002FLL3DA) |\n| 2023-11-26 | ZSVG3D | CUHK(SZ) | [Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.15383.pdf) | Arxiv | [project](https:\u002F\u002Fcurryyuan.github.io\u002FZSVG3D\u002F) | Arxiv | \n| 2023-11-18 |          LEO          |      BIGAI      | [An Embodied Generalist Agent in 3D World](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.12871.pdf)                                                           |    ICML '24  |  [github](https:\u002F\u002Fgithub.com\u002Fembodied-generalist\u002Fembodied-generalist) |\n| 2023-10-14 | JM3D-LLM | Xiamen University | [JM3D & JM3D-LLM: Elevating 3D Representation with Joint Multi-modal Cues](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.09503v2.pdf)                                        | ACM MM '23 |  [github](https:\u002F\u002Fgithub.com\u002Fmr-neko\u002Fjm3d) |\n| 2023-10-10 |  Uni3D |    BAAI  | [Uni3D: Exploring Unified 3D Representation at Scale](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06773)                                                                                | ICLR '24 | [project](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FUni3D) |\n| 2023-9-27 |  - |    KAUST  | [Zero-Shot 3D Shape Correspondence](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.03253)                                                                                | Siggraph Asia '23 | - |\n| 2023-9-21|       LLM-Grounder       |      U-Mich      | [LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.12311.pdf)              |     ICRA '24     |  [github](https:\u002F\u002Fgithub.com\u002Fsled-group\u002Fchat-with-nerf) |\n| 2023-9-1 |        Point-Bind       |      CUHK     | [Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.00615.pdf)             |  Arxiv   |  [github](https:\u002F\u002Fgithub.com\u002FZiyuGuo99\u002FPoint-Bind_Point-LLM) |\n| 2023-8-31 |         PointLLM         |      CUHK      | [PointLLM: Empowering Large Language Models to Understand Point Clouds](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16911.pdf)                                                                             |   ECCV '24 |  [github](https:\u002F\u002Fgithub.com\u002FOpenRobotLab\u002FPointLLM) |\n| 2023-8-17|     Chat-3D     |      ZJU     | [Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.08769v1.pdf)                                                          |  Arxiv      |  [github](https:\u002F\u002Fgithub.com\u002FChat-3D\u002FChat-3D)|\n| 2023-8-8 |         3D-VisTA          |      BIGAI      | [3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment](https:\u002F\u002FArxiv.org\u002Fabs\u002F2308.04352)                                                           |    ICCV '23  | [github]() |\n| 2023-7-24 |     3D-LLM     |      UCLA    | [3D-LLM: Injecting the 3D World into Large Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.12981.pdf)                                                                                                                      |   NeurIPS '23|  [github](https:\u002F\u002Fgithub.com\u002FUMass-Foundation-Model\u002F3D-LLM) |\n| 2023-3-29 |       ViewRefer       |      CUHK      | [ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.16894.pdf)                                                                                               |ICCV '23 |[github](https:\u002F\u002Fgithub.com\u002FIvan-Tang-3D\u002FViewRefer3D) |\n| 2022-9-12 |        -        |      MIT      | [Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05629.pdf)                                                                      |Arxiv|  [github](https:\u002F\u002Fgithub.com\u002FMIT-SPARK\u002Fllm_scene_understanding) |\n\n\n## 3D Understanding via other Foundation Models\n|  ID |       keywords       |    Institute (first)    | Paper                                                                                                                                                                               | Publication | Others |\n| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------: \n| 2026-03-21 | OV3D-CG | ICT, CAS | [OV3D-CG: Open-vocabulary 3D Instance Segmentation with Contextual Guidance](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2025\u002Fhtml\u002FZhou_OV3D-CG_Open-vocabulary_3D_Instance_Segmentation_with_Contextual_Guidance_ICCV_2025_paper.html) | ICCV '2025 | [github](https:\u002F\u002Fgithub.com\u002FVIPL-VSU\u002FOV3D-CG) |\n| 2025-11-20 | POMA-3D | Imperial | [POMA-3D: The Point Map Way to 3D Scene Understanding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.16567) | Arxiv | [project](https:\u002F\u002Fmatchlab-imperial.github.io\u002Fpoma3d\u002F) |\n| 2025-07-26 |  OV-3DDet |    HKUST  | [CoDAv2: Collaborative Novel Object Discovery and Box-Guided Cross-Modal Alignment for Open-Vocabulary 3D Object Detection](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.00830)                                                                                | TPAMI '25 | [github](https:\u002F\u002Fgithub.com\u002Fyangcaoai\u002FCoDA_NeurIPS2023) |\n| 2025-02-20 |  CrossOver |  Stanford | [CrossOver: 3D Scene Cross-Modal Alignment](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.15011) | CVPR '25 | [project](https:\u002F\u002Fsayands.github.io\u002Fcrossover\u002F) |\n| 2025-02-05 |  SAGA |  SJTU | [Segment Any 3D Gaussians](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00860) | AAAI '25 | [project](https:\u002F\u002Fjumpat.github.io\u002FSAGA\u002F) |\n| 2024-10-12 |  Lexicon3D |    UIUC  | [Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.03757) | NeurIPS '24 | [project](https:\u002F\u002Fyunzeman.github.io\u002Flexicon3d\u002F) |\n| 2024-10-07 |  Diff2Scene |    CMU  | [Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.13642) | ECCV'24 | [project](https:\u002F\u002Fdiff2scene.github.io\u002F) |\n| 2024-07-19 |  OpenSU3D |  TUM  | [OpenSU3D: Open World 3D Scene Understanding using Foundation Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.03757) | ICRA '25 | [project](https:\u002F\u002Fopensu3d.github.io\u002F) |\n| 2024-04-07 |  Any2Point |    Shanghai AI Lab  | [Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.07989) | ECCV'24 | [github](https:\u002F\u002Fgithub.com\u002FIvan-Tang-3D\u002FAny2Point) |\n| 2024-03-16 |  N2F2 |    Oxford-VGG  | [N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.10997.pdf) | Arxiv | - |\n| 2023-12-17 |  SAI3D |    PKU  | [SAI3D: Segment Any Instance in 3D Scenes](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.11557.pdf)                                                                                | Arxiv | [project](https:\u002F\u002Fyd-yin.github.io\u002FSAI3D) |\n| 2023-12-17 |  Open3DIS |    VinAI  | [Open3DIS: Open-vocabulary 3D Instance Segmentation with 2D Mask Guidance](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.10671.pdf)                                                                                | Arxiv | [project](https:\u002F\u002Fopen3dis.github.io\u002F) |\n| 2023-11-6 |  OVIR-3D |    Rutgers University  | [OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.02873.pdf) | CoRL '23 | [github](https:\u002F\u002Fgithub.com\u002Fshiyoung77\u002FOVIR-3D\u002F) |\n| 2023-10-29|  OpenMask3D |    ETH  | [OpenMask3D: Open-Vocabulary 3D Instance Segmentation](https:\u002F\u002Fopenmask3d.github.io\u002Fstatic\u002Fpdf\u002Fopenmask3d.pdf)                                                                                | NeurIPS '23 | [project](https:\u002F\u002Fopenmask3d.github.io\u002F) |\n| 2023-10-5 |     Open-Fusion     |      -     | [Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.03923.pdf)                                                                            |Arxiv|  [github](https:\u002F\u002Fgithub.com\u002FUARK-AICV\u002FOpenFusion) |\n| 2023-9-22 |  OV-3DDet |    HKUST  | [CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.02960.pdf)                                                                                | NeurIPS '23 | [github](https:\u002F\u002Fgithub.com\u002Fyangcaoai\u002FCoDA_NeurIPS2023) |\n| 2023-9-19 | LAMP |      -      | [From Language to 3D Worlds: Adapting Language Model for Point Cloud Perception](https:\u002F\u002Fopenreview.net\u002Fforum?id=H49g8rRIiF)                                                              |    OpenReview     | - |\n| 2023-9-15 |  OpenNerf |    -    | [OpenNerf: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views](https:\u002F\u002Fopenreview.net\u002Fpdf?id=SgjAojPKb3)                                                                                | OpenReview | [github]() |\n| 2023-9-1|  OpenIns3D |    Cambridge  | [OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.00616.pdf)                                                                                | Arxiv | [project](https:\u002F\u002Fzheninghuang.github.io\u002FOpenIns3D\u002F) |\n| 2023-6-7 |         Contrastive Lift         |     Oxford-VGG     | [Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.04633.pdf)                                                                                        |   NeurIPS '23| [github](https:\u002F\u002Fgithub.com\u002Fyashbhalgat\u002FContrastive-Lift) |\n| 2023-6-4 |  Multi-CLIP |    ETH  | [Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.02329.pdf)                                                                                | Arxiv | - |\n| 2023-5-23 |  3D-OVS |    NTU  | [Weakly Supervised 3D Open-vocabulary Segmentation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.14093.pdf)                                                                                | NeurIPS '23 | [github](https:\u002F\u002Fgithub.com\u002FKunhao-Liu\u002F3D-OVS) |\n| 2023-5-21 |  VL-Fields |    University of Edinburgh  | [VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.12427.pdf)                                                                                | ICRA '23 | [project](https:\u002F\u002Ftsagkas.github.io\u002Fvl-fields\u002F)  |\n| 2023-5-8 |  CLIP-FO3D |    Tsinghua University  | [CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.04748.pdf)                                                                                | ICCVW '23 | - |\n| 2023-4-12 |  3D-VQA |    ETH  | [CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.06061.pdf)                                                                                | CVPRW '23 | [github](https:\u002F\u002Fgithub.com\u002FAlexDelitzas\u002F3D-VQA) |\n| 2023-4-3 |  RegionPLC |    HKU | [RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.00962.pdf)                                                                                | Arxiv | [project](https:\u002F\u002Fjihanyang.github.io\u002Fprojects\u002FRegionPLC) |\n| 2023-3-20 |        CG3D        |      JHU      | [CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.11313.pdf)                                                                                                 |Arxiv|  [github](https:\u002F\u002Fgithub.com\u002Fdeeptibhegde\u002FCLIP-goes-3D) |\n| 2023-3-16 | LERF |     UC Berkeley     | [LERF: Language Embedded Radiance Fields](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.09553.pdf)                                                   | ICCV '23   | [github](https:\u002F\u002Fgithub.com\u002Fkerrj\u002Flerf) |\n| 2023-2-14 |  ConceptFusion |    MIT  | [ConceptFusion: Open-set Multimodal 3D Mapping](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07241.pdf)                                                                                | RSS '23 | [project](https:\u002F\u002Fconcept-fusion.github.io\u002F) |\n| 2023-1-12 |         CLIP2Scene         |      HKU      | [CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.04926.pdf)                                                                                        |    CVPR '23 | [github](https:\u002F\u002Fgithub.com\u002Frunnanchen\u002FCLIP2Scene) |\n| 2022-12-1 |         UniT3D         |      TUM     | [UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2023\u002Fpapers\u002FChen_UniT3D_A_Unified_Transformer_for_3D_Dense_Captioning_and_Visual_ICCV_2023_paper.pdf)                                                                          |   ICCV '23| [github]() |\n| 2022-11-29 |        PLA        |     HKU    | [PLA: Language-Driven Open-Vocabulary 3D Scene Understanding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.16312.pdf)                                                                 |CVPR '23|  [github](https:\u002F\u002Fgithub.com\u002FCVMI-Lab\u002FPLA) |\n| 2022-11-28 |       OpenScene       |      ETHz      | [OpenScene: 3D Scene Understanding with Open Vocabularies](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.15654.pdf)                                                             |   CVPR '23  | [github](https:\u002F\u002Fgithub.com\u002Fpengsongyou\u002Fopenscene) |\n| 2022-10-11 |  CLIP-Fields |    NYU  | [CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05663.pdf)                                                                                | Arxiv | [project](https:\u002F\u002Fmahis.life\u002Fclip-fields\u002F) |\n| 2022-7-23 |  Semantic Abstraction |    Columbia  | [Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11514.pdf)                                                                                | CoRL '22 | [project](https:\u002F\u002Fsemantic-abstraction.cs.columbia.edu\u002F) |\n| 2022-4-26 |   ScanNet200 |    TUM  | [Language-Grounded Indoor 3D Semantic Segmentation in the Wild](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07761.pdf)                                                                                | ECCV '22 | [project](https:\u002F\u002Frozdavid.github.io\u002Fscannet200) |\n\n\n\n\n## 3D Reasoning\n|  Date |       keywords       |    Institute (first)    | Paper                                                                                                                                                                               | Publication | Others |\n| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------: |\n| 2025-12-15 | RoboTracer | BUAA | [RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.13660) | Arxiv | [project](https:\u002F\u002Fzhoues.github.io\u002FRoboTracer\u002F) |\n| 2025-06-11 | SceneCOT | BIGAI | [SceneCOT: Eliciting Chain-of-Thought Reasoning in 3D Scenes](https:\u002F\u002Fscenecot.github.io\u002F) | Arxiv | [project](https:\u002F\u002Fscenecot.github.io\u002F) |\n| 2025-06-04 | RoboRefer | BUAA | [RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04308) | Arxiv | [project](https:\u002F\u002Fzhoues.github.io\u002FRoboRefer\u002F) |\n| 2024-09-08 | MSR3D | BIGAI | [Multi-modal Situated Reasoning in 3D Scenes](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.02389) | NeurIPS '24| [project](https:\u002F\u002Fmsr3d.github.io\u002F) |\n| 2023-5-20|       3D-CLR      |      UCLA     | [3D Concept Learning and Reasoning from Multi-View Images](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.11327.pdf)                                                 |   CVPR '23  | [github](https:\u002F\u002Fgithub.com\u002Fevelinehong\u002F3D-CLR-Official) |\n| - |         Transcribe3D         |     TTI, Chicago     | [Transcribe3D: Grounding LLMs Using Transcribed Information for 3D Referential Reasoning with Self-Corrected Finetuning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=7j3sdUZMTF)                                                                                                  |CoRL '23|  [github]() |\n\n\n## 3D Generation\n|  Date |       keywords       |    Institute    | Paper                                                                                                                                                                               | Publication | Others |\n| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------: \n| 2025-11-07 | Omni-View | PKU | [Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.07222) | ICLR 2026 | [github](https:\u002F\u002Fgithub.com\u002FAIDC-AI\u002FOmni-View) |\n| 2025-08-16 | UniUGG | FDU | [UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.11952) | ICLR 2026 | [github](https:\u002F\u002Fgithub.com\u002Ffudan-zvg\u002FUniUGG) |\n| 2024-11-14 |         LLaMA-Mesh         |     THU     | [LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models](https:\u002F\u002Farxiv.org\u002Fhtml\u002F2411.09595v1)                                                                                                  |Arxiv|  [project](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Ftoronto-ai\u002FLLaMA-Mesh\u002F) |  \n| 2023-11-29 |         ShapeGPT         |     Fudan University     | [ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.17618.pdf)                                                                                                  |Arxiv|  [github](https:\u002F\u002Fgithub.com\u002FOpenShapeLab\u002FShapeGPT) |                                                                                              | Arxiv  |  [github]() |\n| 2023-11-27|         MeshGPT         |     TUM     | [MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.15475.pdf)                                                                                                  |Arxiv |  [project](https:\u002F\u002Fnihalsid.github.io\u002Fmesh-gpt\u002F) |\n| 2023-10-19 |         3D-GPT        |     ANU   | [3D-GPT: Procedural 3D Modeling with Large Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.12945.pdf)                                                                                                   |Arxiv|  [github](https:\u002F\u002Fdreamllm.github.io\u002F) |\n| 2023-9-21 |         LLMR         |     MIT     | [LLMR: Real-time Prompting of Interactive Worlds using Large Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.12276.pdf)                                                                                                  |Arxiv| - |\n| 2023-9-20 |         DreamLLM         |     MEGVII    | [DreamLLM: Synergistic Multimodal Comprehension and Creation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.11499.pdf) | Arxiv | [github](https:\u002F\u002Fgithub.com\u002FRunpeiDong\u002FDreamLLM)\n| 2023-4-1 |      ChatAvatar      |       Deemos Tech            | [DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3592094)                                               |  ACM TOG    | [website](https:\u002F\u002Fhyperhuman.deemos.com\u002F) |\n\n## 3D Embodied Agent\n|  Date |       keywords       |    Institute   | Paper                                                                                                                                                                               | Publication | Others |\n| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------: |\n| 2025-12-15 | RoboTracer | BUAA | [RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.13660) | Arxiv | [project](https:\u002F\u002Fzhoues.github.io\u002FRoboTracer\u002F) |\n| 2025-06-04 | RoboRefer | BUAA | [RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04308) | Arxiv | [project](https:\u002F\u002Fzhoues.github.io\u002FRoboRefer\u002F) |\n| 2025-05-30 | VeBrain | Shanghai AI Lab | [Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.00123v1) | Arxiv | [project](https:\u002F\u002Finternvl.github.io\u002Fblog\u002F2025-05-26-VeBrain\u002F) |\n| 2025-05-28 | 3DLLM-Mem | UCLA | [3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22657) | Arxiv | [project](https:\u002F\u002F3dllm-mem.github.io\u002F) |\n| 2024-01-22 |  SpatialVLM |    Deepmind  | [SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.12168)                                                                                | CVPR '24 | [project](https:\u002F\u002Fspatial-vlm.github.io\u002F) |\n| 2023-12-05 | NaviLLM | CUHK | [Towards Learning a Generalist Model for Embodied Navigation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.02010) | CVPR '24 | [project](https:\u002F\u002Fgithub.com\u002Fzd11024\u002FNaviLLM) |\n| 2023-11-27 | Dobb-E | NYU | [On Bringing Robots Home](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16098.pdf)        |    Arxiv  |  [github](https:\u002F\u002Fgithub.com\u002Fnotmahi\u002Fdobb-e) |\n| 2023-11-26 | STEVE | ZJU | [See and Think: Embodied Agent in Virtual Environment](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.15209) | Arxiv | [github](https:\u002F\u002Fgithub.com\u002Frese1f\u002FSTEVE) |\n| 2023-11-18 | LEO  |   BIGAI  | [An Embodied Generalist Agent in 3D World](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.12871.pdf)   |    ICML '24  |  [github](https:\u002F\u002Fgithub.com\u002Fembodied-generalist\u002Fembodied-generalist) |\n| 2023-9-14 |        UniHSI      |      Shanghai AI Lab     | [Unified Human-Scene Interaction via Prompted Chain-of-Contacts](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.07918.pdf)                                                                                                 |   Arxiv |  [github](https:\u002F\u002Fgithub.com\u002FOpenRobotLab\u002FUniHSI) |\n| 2023-7-28 |         RT-2         |     Google-DeepMind     | [RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.15818.pdf)                                                                                                  |Arxiv|  [github](https:\u002F\u002Frobotics-transformer2.github.io\u002F) |\n| 2023-7-12 |         SayPlan        |     QUT Centre for Robotics    | [SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.06135.pdf)                                                                                                  |CoRL '23|  [github](https:\u002F\u002Fsayplan.github.io\u002F) |\n| 2023-7-12 |          VoxPoser          |      Stanford      | [VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models](https:\u002F\u002Fvoxposer.github.io\u002Fvoxposer.pdf)                                                           |    Arxiv  |  [github](https:\u002F\u002Fgithub.com\u002Fhuangwl18\u002FVoxPoser) |\n| 2022-12-13|         RT-1         |     Google     | [RT-1: Robotics Transformer for Real-World Control at Scale](https:\u002F\u002Frobotics-transformer1.github.io\u002Fassets\u002Frt1.pdf)                                                                                                  |Arxiv|  [github](https:\u002F\u002Frobotics-transformer1.github.io\u002F) |\n| 2022-12-8 |         LLM-Planner         |     The Ohio State University    | [LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04088.pdf)                                                                                                  |ICCV '23|  [github](https:\u002F\u002Fgithub.com\u002FOSU-NLP-Group\u002FLLM-Planner\u002F) |\n| 2022-10-11 |          CLIP-Fields          |      NYU, Meta      | [CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05663.pdf)                                                           |    RSS '23  |  [github](https:\u002F\u002Fgithub.com\u002Fnotmahi\u002Fclip-fields) |\n| 2022-09-20|       NLMap-SayCan       |     Google     | [Open-vocabulary Queryable Scene Representations for Real World Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.09874)                                                                    | ICRA '23|  [github](https:\u002F\u002Fnlmap-saycan.github.io\u002F) |\n\n## 3D Benchmarks\n|  Date |       keywords       |    Institute    | Paper                                                                                                                                                                               | Publication | Others |\n| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------: |\n| 2026-01-22 | Real-3DQA | VGG, Oxford | [Do 3D Large Language Models Really Understand 3D Spatial Relationships?](https:\u002F\u002Fopenreview.net\u002Fforum?id=3vlMiJwo8b) | ICLR 2026 | [project](https:\u002F\u002Freal-3dqa.github.io\u002F) |\n| 2025-12-15 | RoboTracer | BUAA | [RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.13660) | Arxiv | [project](https:\u002F\u002Fzhoues.github.io\u002FRoboTracer\u002F) |\n| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------: \n| 2025-11-28 | ORS3D | HUST | [Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.19430) | AAAI ‘26 | [project](https:\u002F\u002Fh-embodvis.github.io\u002FGRANT\u002F) |\n| 2025-06-04 | RoboRefer | BUAA | [RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04308) | Arxiv | [project](https:\u002F\u002Fzhoues.github.io\u002FRoboRefer\u002F) |\n| 2025-06-09 | SpaCE-10 | SJTU | [SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.07966) | Arxiv | [project](https:\u002F\u002Fgithub.com\u002FVisionXLab\u002FSpaCE-10) |\n| 2025-05-01 | Hypo3D | Imperial | [Hypo3D: Exploring Hypothetical Reasoning in 3D](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.00954) | ICML'25| [project](https:\u002F\u002Fgithub.com\u002FMatchLab-Imperial\u002FHypo3D) |\n| 2025-06-04 | Anywhere3D | BIGAI | [From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04897) | NeurIPS '25 | [project](https:\u002F\u002Fanywhere-3d.github.io\u002F) |\n| 2025-05-01 | SpatialVQA | JHU | [SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00788) | CVPR'25| [project]() |\n| 2025-04-03 | SPAR | Fudan University | [From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.22976) | Arxiv| [project](https:\u002F\u002Ffudan-zvg.github.io\u002Fspar) |\n| 2025-03-28 | Beacon3D | BIGAI | [Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.22420) | CVPR '25| [project](https:\u002F\u002Fbeacon-3d.github.io) |\n| 2025-03-08 | 3D-CoT | PolyU, EIT | [Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.06232) | Arxiv | [dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBattam\u002F3D-CoT) |\n| 2024-09-08 | MSQA \u002F MSNN | BIGAI | [Multi-modal Situated Reasoning in 3D Scenes](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.02389) | NeurIPS '24| [project](https:\u002F\u002Fmsr3d.github.io\u002F) |\n| 2024-08-29 | Space3D-Bench | ETHz | [Space3D-Bench: Spatial 3D Question Answering Benchmark](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.16662) | Arxiv | [project](https:\u002F\u002Fspace3d-bench.github.io\u002F) |\n| 2024-07-24 | City-3DQA | HKUST | [3D Question Answering for City Scene Understanding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.17398) | ACM MM'24 | [project](https:\u002F\u002Fsites.google.com\u002Fview\u002Fcity3dqa\u002F?pli=1) |\n| 2024-06-13 | MMScan | Shanghai AI Lab | [MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.09401) | Arxiv | [github](https:\u002F\u002Fgithub.com\u002FOpenRobotLab\u002FEmbodiedScan) |\n| 2024-06-10 | 3D-GRAND \u002F 3D-POPE | UMich | [3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.05132.pdf) | Arxiv | [project](https:\u002F\u002F3d-grand.github.io) |\n| 2024-06-03 | SpatialRGPT-Bench | UCSD | [SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.01584) | NeurIPS '24 | [github](https:\u002F\u002Fgithub.com\u002FAnjieCheng\u002FSpatialRGPT) |\n| 2024-05-27 | Reason3D | UC Merced | [Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.17427) | 3DV'25 | [project](https:\u002F\u002Freason3d.github.io\u002F) |\n| 2024-1-18 | SceneVerse | BIGAI | [SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.09340.pdf) | ECCV '24 | [github](https:\u002F\u002Fgithub.com\u002Fscene-verse\u002Fsceneverse) |\n| 2023-12-26 | EmbodiedScan | Shanghai AI Lab | [EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.16170.pdf) | Arxiv | [github](https:\u002F\u002Fgithub.com\u002FOpenRobotLab\u002FEmbodiedScan) |\n| 2023-12-17 |         M3DBench        |     Fudan University     | [M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.10763)                                                                                                  |Arxiv|  [github](https:\u002F\u002Fgithub.com\u002FOpenM3D\u002FM3DBench) |\n| 2023-11-29 |         -         |     DeepMind  | [Leveraging VLM-Based Pipelines to Annotate 3D Objects](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.17851.pdf)                                                                                                  |ICML '24|  [github](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Fobjaverse_annotations) |\n| 2023-09-14 |CrossCoherence   |     UniBO  | [Looking at words and points with attention: a benchmark for text-to-shape coherence](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.07917)                                                                                                  |ICCV '23|  [github](https:\u002F\u002Fgithub.com\u002FAndreAmaduzzi\u002FCrossCoherence) |\n| 2022-10-14 |     SQA3D     |      BIGAI    | [SQA3D: Situated Question Answering in 3D Scenes](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07474.pdf)                                                                                                        | ICLR '23| [github](https:\u002F\u002Fgithub.com\u002FSilongYong\u002FSQA3D) |\n| 2022-09-24 |     FE-3DGQA     |      Beihang University    | [Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12028)                                                                                                        | TCSVT | [github](https:\u002F\u002Fgithub.com\u002Fzlccccc\u002F3DVL_Codebase) |\n| 2021-12-20|     ScanQA     |      RIKEN AIP    | [ScanQA: 3D Question Answering for Spatial Scene Understanding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10482.pdf)                                                                                                        | CVPR '23| [github](https:\u002F\u002Fgithub.com\u002FATR-DBI\u002FScanQA) |\n| 2020-12-3 |     Scan2Cap     |      TUM    | [Scan2Cap: Context-aware Dense Captioning in RGB-D Scans](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2012.02206.pdf)                                                                                                        | CVPR '21| [github](https:\u002F\u002Fgithub.com\u002Fdaveredrum\u002FScan2Cap) |\n| 2020-8-23 | ReferIt3D | Stanford | [ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes](https:\u002F\u002Fwww.ecva.net\u002Fpapers\u002Feccv_2020\u002Fpapers_ECCV\u002Fpapers\u002F123460409.pdf) | ECCV '20 | [github](https:\u002F\u002Fgithub.com\u002Freferit3d\u002Freferit3d) |\n| 2019-12-18 |     ScanRefer     |      TUM   | [ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10482.pdf)                                                                                                        | ECCV '20 | [github](https:\u002F\u002Fdaveredrum.github.io\u002FScanRefer\u002F) |\n\n## Contributing\n\nYour contributions are always welcome!\n\nI will keep some pull requests open if I'm not sure if they are awesome for 3D LLMs, you could vote for them by adding 👍 to them.\n\n---\n\nIf you have any questions about this opinionated list, please get in touch at xianzheng@robots.ox.ac.uk or Wechat ID: mxz1997112.\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FActiveVisionLab_Awesome-LLM-3D_readme_748489af54ae.png)](https:\u002F\u002Fstar-history.com\u002F#ActiveVisionLab\u002FAwesome-LLM-3D&Date)\n\n## Citation\nIf you find this repository useful, please consider citing this paper:\n```\n@misc{ma2024llmsstep3dworld,\n      title={When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models}, \n      author={Xianzheng Ma and Yash Bhalgat and Brandon Smart and Shuai Chen and Xinghui Li and Jian Ding and Jindong Gu and Dave Zhenyu Chen and Songyou Peng and Jia-Wang Bian and Philip H Torr and Marc Pollefeys and Matthias Nießner and Ian D Reid and Angel X. Chang and Iro Laina and Victor Adrian Prisacariu},\n      year={2024},\n      journal={arXiv preprint arXiv:2405.10255},\n}\n```\n\n## Acknowledgement\nThis repo is inspired by [Awesome-LLM](https:\u002F\u002Fgithub.com\u002FHannibal046\u002FAwesome-LLM?tab=readme-ov-file#other-awesome-lists)\n\n","# Awesome-LLM-3D [![Awesome](https:\u002F\u002Fawesome.re\u002Fbadge.svg)](https:\u002F\u002Fawesome.re) [![Maintenance](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FMaintained%3F-yes-green.svg)](https:\u002F\u002FGitHub.com\u002FNaereen\u002FStrapDown.js\u002Fgraphs\u002Fcommit-activity) [![PR's Welcome](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPRs-welcome-brightgreen.svg?style=flat)](http:\u002F\u002Fmakeapullrequest.com)  \u003Ca href=\"\" target='_blank'>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FActiveVisionLab_Awesome-LLM-3D_readme_0b3a1d550e6b.png\"> \u003C\u002Fa> [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2405.10255v2-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.10255v2)\n\n\u003Cdiv align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FActiveVisionLab_Awesome-LLM-3D_readme_6b1f2d78a861.png\" width=\"100%\">\n\u003C\u002Fdiv>\n\n\n\n## 🏠 关于\n这里是一个精心整理的清单，收录了由大型语言模型（LLMs）赋能的3D相关任务的相关论文。内容涵盖3D理解、推理、生成以及具身智能体等多个方向。此外，我们还纳入了其他基础模型（如CLIP、SAM），以全面展示该领域的研究现状。\n\n本仓库处于持续更新状态，欢迎关注以获取最新进展。如果您觉得这份资源有用，请为本仓库点个赞⭐，并引用我们的论文（见#citation）。\n\n## 🔥 最新动态\n- [2026-03-20] 我们的基准评测论文**Real-3DQA**现已发表于ICLR 2026！继我们的综述论文之后，我们又发布了关于真正3D空间理解的基准评测论文。[项目主页](https:\u002F\u002Freal-3dqa.github.io\u002F)\n- [2025-10-21] 📢 我们发布了综述的**第二版**，更新至**2025年7月**的文献：  \n👉 [*当LLMs走进3D世界：基于多模态大型语言模型的3D任务综述与元分析*](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.10255v2)\n- [2024-05-16] 查看3D-LLM领域首篇综述论文：[当LLMs走进3D世界：基于多模态大型语言模型的3D任务综述与元分析](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.10255) \n- [2024-01-06] [Runsen Xu](https:\u002F\u002Frunsenxu.com\u002F) 添加了时间顺序信息，而[Xianzheng Ma](https:\u002F\u002Fxianzhengma.github.io\u002F)则按Z-A顺序重新整理，以便更好地追踪最新进展。\n- [2023-12-16] [Xianzheng Ma](https:\u002F\u002Fxianzhengma.github.io\u002F) 和[Yash Bhalgat](https:\u002F\u002Fyashbhalgat.github.io\u002F)共同整理了这份清单，并发布了第一版；\n\n## 目录\n\n- [Awesome-LLM-3D](#awesome-llm-3D)\n  - [通过LLM实现的3D统一理解与生成](#3d-unified-understanding-and-generation-via-llm)\n  - [通过LLM进行3D理解](#3d-understanding-via-llm)\n  - [通过其他基础模型进行3D理解](#3d-understanding-via-other-foundation-models)\n  - [3D推理](#3d-reasoning)\n  - [3D生成](#3d-generation)\n  - [3D具身智能体](#3d-embodied-agent)\n  - [3D基准测试](#3d-benchmarks)\n  - [贡献](#contributing)\n\n\n## 通过LLM实现的3D统一理解与生成\n\n| 日期 |       关键词       |    首发机构   | 论文                                                                                                                                                                               | 发表期刊 | 其他 |\n| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------:\n| 2025-11-07 | Omni-View | 北京大学 | [Omni-View：基于多视角图像的统一3D模型如何通过生成促进理解](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.07222) | ICLR 2026 | [github](https:\u002F\u002Fgithub.com\u002FAIDC-AI\u002FOmni-View) |\n| 2025-08-16 | UniUGG | 复旦大学 | [UniUGG：基于几何-语义编码的统一3D理解和生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.11952) | ICLR 2026 | [github](https:\u002F\u002Fgithub.com\u002Ffudan-zvg\u002FUniUGG) |\n\n## 通过LLM进行3D理解\n\n| 日期 |       关键词       |    机构（第一）   | 论文                                                                                                                                                                               | 发表平台 | 其他 |\n| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------:\n| 2026-03-07 | 3D-RFT | BIGAI | [3D-RFT: 基于视频的3D场景理解的强化微调](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.04976) | Arxiv | [github](https:\u002F\u002Fgithub.com\u002F3D-RFT\u002F3D-RFT) |\n| 2025-12-05 | Fast Scenescript | Qualcomm \u002F UvA | [Fast SceneScript: 通过多令牌预测实现快速准确的语言驱动3D场景理解](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.05597) | CVPR '26 | - |\n| 2025-11-27 | G\u003Csup>2\u003C\u002Fsup>VLM | 上海人工智能实验室 | [G\u003Csup>2\u003C\u002Fsup>VLM: 具有统一3D重建和空间推理能力的几何基础视觉语言模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2511.21688) | Arxiv | [github](https:\u002F\u002Fgithub.com\u002FInternRobotics\u002FG2VLM) |\n| 2025-11-07 | Omni-View | 北京大学 | [Omni-View: 解锁生成如何促进基于多视角图像的统一3D模型中的理解](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.07222) | ICLR 2026 | [github](https:\u002F\u002Fgithub.com\u002FAIDC-AI\u002FOmni-View) |\n| 2025-08-16 | UniUGG | 福建大学 | [UniUGG: 通过几何语义编码实现统一的3D理解和生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.11952) | ICLR 2026 | [github](https:\u002F\u002Fgithub.com\u002Ffudan-zvg\u002FUniUGG) |\n| 2025-07-31 | 3D-R1 | 北京大学 | [3D-R1: 增强3D VLM中的推理能力以实现统一场景理解](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23478) | Arxiv | [项目](https:\u002F\u002Faigeeksgroup.github.io\u002F3D-R1\u002F) |\n| 2025-06-11 | LEO-VL | BIGAI | [LEO-VL: 面向可扩展3D视觉-语言学习的高效场景表示](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.09935) | Arxiv | [项目](https:\u002F\u002Fleo-vl.github.io\u002F) |\n| 2025-06-09 | SpatialLM | Manycore Tech \u002F 香港科技大学 | [SpatialLM: 训练大型语言模型进行结构化室内建模](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.07491) | NeurIPS '25 | [项目](https:\u002F\u002Fmanycore-research.github.io\u002FSpatialLM\u002F) |\n| 2025-06-02 | 3DRS | 香港大学 | [MLLMs需要3D感知表示监督来进行场景理解](https:\u002F\u002Farxiv.org\u002Fhtml\u002F2506.01946v1) | Arxiv | [项目](https:\u002F\u002Fvisual-ai.github.io\u002F3drs\u002F) |\n| 2025-05-30 | VG LLM | 香港中文大学 | [从视频中学习3D世界：利用3D视觉几何先验增强MLLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24625) | Arxiv | [项目](https:\u002F\u002Flavi-lab.github.io\u002FVG-LLM\u002F) |\n| 2025-05-29 | Spatial-MLLM | 清华大学 | [Spatial-MLLM: 提升MLLM在视觉驱动的空间智能方面的能力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.23747) | Arxiv | [项目](https:\u002F\u002Fdiankun-wu.github.io\u002FSpatial-MLLM\u002F) |\n| 2025-05-28 | SeeGround | 香港科技大学（广州） | [基于视觉-语言模型的零样本3D视觉接地](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22429) | CVPRW'25 | [项目](https:\u002F\u002Fseeground.github.io) |\n| 2025-05-28 | 3DLLM-Mem | UCLA\u002FGoogle | [3DLLM-Mem: 面向具身3D大型语言模型的长期时空记忆](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22657) | NeurIPS'25 | [项目](https:\u002F\u002F3dllm-mem.github.io\u002F) |\n| 2025-04-24 | 3D-LLaVA | 阿德莱德大学 | [3D-LLaVA: 借助全能超点Transformer迈向通用3D LMMs](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.01163) | CVPR '25 | [github](https:\u002F\u002Fgithub.com\u002Fdjiajunustc\u002F3D-LLaVA) |\n| 2025-04-03 | Ross3D | 中科院自动化所 | [Ross3D: 具有3D意识的重建式视觉指令微调](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.01901) | Arxiv | [项目](https:\u002F\u002Fhaochen-wang409.github.io\u002Fross3d\u002F) |\n| 2025-03-08 | SplatTalk | GIT | [SplatTalk: 基于高斯泼溅的3D VQA](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.06271) | Arxiv | [github]() |\n| 2025-03-01 | Inst3D-LMM | 浙江大学 | [Inst3D-LMM: 基于多模态指令微调的实例感知3D场景理解](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.00513) | CVPR '25 | [github](https:\u002F\u002Fgithub.com\u002Fhanxunyu\u002FInst3D-LMM) |\n| 2025-02-13 | ENEL | SH AILab | [ENEL: 探索无编码器架构在3D LMMs中的潜力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.09620v1) | Arxiv | [项目](https:\u002F\u002Fgithub.com\u002FIvan-Tang-3D\u002FENEL\u002Ftree\u002Fmain?tab=readme-ov-file) |\n| 2025-02-02 | LSceneLLM | 华南理工大学 | [LSceneLLM: 利用自适应视觉偏好提升大型3D场景理解能力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.01292) | CVPR '25 | [项目](https:\u002F\u002Fgithub.com\u002FHoyyyaard\u002FLSceneLLM) |\n| 2025-01-02 | GPT4Scene | 香港大学 | [GPT4Scene: 利用视觉-语言模型从视频中理解3D场景](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.01428) | Arxiv | [项目](https:\u002F\u002Fgpt4scene.github.io\u002F) |\n| 2024-12-05 | SeeGround | 香港科技大学（广州） | [SeeGround: 观察并接地，实现零样本开放词汇3D视觉接地](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04383) | CVPR '25 | [项目](https:\u002F\u002Fseeground.github.io) |\n| 2024-12-03 | Video-3D LLM | 香港中文大学 | [Video-3D LLM: 学习位置感知视频表示以用于3D场景理解](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00493) | CVPR '25 | [项目](https:\u002F\u002Fgithub.com\u002FLaVi-Lab\u002FVideo-3D-LLM) |\n| 2024-11-29 | PerLA | 布鲁诺·凯斯勒基金会 | [PerLA: 感知型3D语言助手](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.19774) | CVPR '25 | [项目](https:\u002F\u002Fgfmei.github.io\u002FPerLA\u002F) |\n| 2024-10-12 | Situation3D | 伊利诺伊大学厄巴纳-香槟分校 | [情境感知在3D视觉语言推理中的重要性](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07544) | CVPR '24 | [项目](https:\u002F\u002Fyunzeman.github.io\u002Fsituation3d\u002F) |\n| 2024-09-30 | Robin3D | 香港大学 | [Robin3D: 通过稳健的指令微调改进3D大型语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00255) | ICCV '25 | [github](https:\u002F\u002Fgithub.com\u002FWeitaiKang\u002FRobin3D) |\n| 2024-09-28 | LLaVA-3D | 香港大学 | [LLaVA-3D: 一条简单而有效的途径，使LMMs具备3D感知能力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.18125) | Arxiv | [项目](https:\u002F\u002Fzcmax.github.io\u002Fprojects\u002FLLaVA-3D\u002F) |\n| 2024-09-08 | MSR3D | BIGAI | [3D场景中的多模态情境推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.02389) | NeurIPS '24 | [项目](https:\u002F\u002Fmsr3d.github.io\u002F) |\n| 2024-08-28 | GreenPLM | 华中科技大学 | [更多文本，更少点云：迈向3D数据高效的点云-语言理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.15966) | Arxiv | [github](https:\u002F\u002Fgithub.com\u002FTangYuan96\u002FGreenPLM) |\n| 2024-06-17 | LLaNA | 博洛尼亚大学 | [LLaNA: 大型语言与NeRF助手](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.11840) | NeurIPS '24 | [项目](https:\u002F\u002Fandreamaduzzi.github.io\u002Fllana\u002F) |\n| 2024-06-07  | SpatialPIN           | 牛津大学                 | [SpatialPIN: 通过提示和交互式3D先验增强视觉-语言模型的空间推理能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.13438)                         | NeurIPS '24       | [项目](https:\u002F\u002Fdannymcy.github.io\u002Fzeroshot_task_hallucination\u002F) |\n| 2024-06-03 | SpatialRGPT | 加州大学圣地亚哥分校 | [SpatialRGPT: 在视觉语言模型中实现 grounded空间推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.01584) | NeurIPS '24 | [github](https:\u002F\u002Fgithub.com\u002FAnjieCheng\u002FSpatialRGPT) |\n| 2024-05-02 | MiniGPT-3D | 华中科技大学 | [MiniGPT-3D: 利用2D先验高效对齐3D点云与大型语言模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.01413) | ACM MM '24 | [项目](https:\u002F\u002Ftangyuan96.github.io\u002Fminigpt_3d_project_page\u002F) |\n| 2024-03-19 | Scenescript | Meta | [SceneScript: 使用自回归结构化语言模型重建场景](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.13064) | ECCV '24 | [项目](https:\u002F\u002Fwww.projectaria.com\u002Fscenescript\u002F) |\n| 2024-02-27 |  ShapeLLM |    西安交通大学  | [ShapeLLM: 面向具身交互的通用3D对象理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.17766)                                                                                | Arxiv | [项目](https:\u002F\u002Fqizekun.github.io\u002Fshapellm\u002F) |\n| 2024-01-22  | SpatialVLM           | Google DeepMind        | [SpatialVLM: 为视觉-语言模型赋予空间推理能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.12168)                                                                  | CVPR '24    | [项目](https:\u002F\u002Fspatial-vlm.github.io\u002F) |\n| 2023-12-21 |  LiDAR-LLM |    北京大学  | [LiDAR-LLM: 探索大型语言模型在3D激光雷达理解方面的潜力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.14074.pdf)                                                                                | Arxiv | [项目](https:\u002F\u002Fsites.google.com\u002Fview\u002Flidar-llm) |\n| 2023-12-15 |  3DAP |    上海人工智能实验室  | [3DAxiesPrompts: 释放GPT-4V的3D空间任务能力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.09738.pdf)                                                                                | Arxiv | [项目]() |\n| 2023-12-13 |  Chat-Scene |    浙江大学 | [Chat-Scene: 通过对象标识符连接3D场景和大型语言模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.08168.pdf)                                                                                | NeurIPS '24 | [github](https:\u002F\u002Fgithub.com\u002FZzZZCHS\u002FChat-Scene) |\n| 2023-12-5 | GPT4Point | 香港大学 | [GPT4Point: 一个统一的框架，用于点云-语言的理解和生成](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02980.pdf) |Arxiv |  [github](https:\u002F\u002Fgithub.com\u002FPointcept\u002FGPT4Point) |\n| 2023-11-30 |         LL3DA        |     复旦大学    | [LL3DA: 面向全方位3D理解、推理和规划的视觉交互式指令微调](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.18651.pdf)                                                                                                  |Arxiv|  [github](https:\u002F\u002Fgithub.com\u002FOpen3DA\u002FLL3DA) |\n| 2023-11-26 | ZSVG3D | 香港中文大学（深圳） | [面向零样本开放词汇3D视觉接地的可视化编程](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.15383.pdf) | Arxiv | [项目](https:\u002F\u002Fcurryyuan.github.io\u002FZSVG3D\u002F) | Arxiv | \n| 2023-11-18 |          LEO          |      BIGAI      | [3D世界中的具身通才代理](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.12871.pdf)                                                           |    ICML '24  |  [github](https:\u002F\u002Fgithub.com\u002Fembodied-generalist\u002Fembodied-generalist) |\n| 2023-10-14 | JM3D-LLM | 厦门大学 | [JM3D & JM3D-LLM: 通过联合多模态线索提升3D表示](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.09503v2.pdf)                                        | ACM MM '23 |  [github](https:\u002F\u002Fgithub.com\u002Fmr-neko\u002Fjm3d) |\n| 2023-10-10 |  Uni3D |    BAAI  | [Uni3D: 探索大规模统一3D表示](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06773)                                                                                | ICLR '24 | [项目](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FUni3D) |\n| 2023-9-27 |  - |    KAUST  | [零样本3D形状对应](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.03253)                                                                                | Siggraph Asia '23 | - |\n| 2023-9-21|       LLM-Grounder       |      密歇根大学      | [LLM-Grounder: 以大型语言模型为代理实现开放词汇3D视觉接地](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.12311.pdf)              |     ICRA '24     |  [github](https:\u002F\u002Fgithub.com\u002Fsled-group\u002Fchat-with-nerf) |\n| 2023-9-1 |        Point-Bind       |      香港中文大学     | [Point-Bind & Point-LLM: 通过多模态对齐点云，实现3D理解、生成和指令遵循](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.00615.pdf)             |  Arxiv   |  [github](https:\u002F\u002Fgithub.com\u002FZiyuGuo99\u002FPoint-Bind_Point-LLM) |\n| 2023-8-31 |         PointLLM         |      香港中文大学      | [PointLLM: 赋能大型语言模型理解点云](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16911.pdf)                                                                             |   ECCV '24 |  [github](https:\u002F\u002Fgithub.com\u002FOpenRobotLab\u002FPointLLM) |\n| 2023-8-17|     Chat-3D     |      浙江大学     | [Chat-3D: 以数据高效的方式微调大型语言模型，使其能够进行3D场景的通用对话](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.08769v1.pdf)                                                          |  Arxiv      |  [github](https:\u002F\u002Fgithub.com\u002FChat-3D\u002FChat-3D)|\n| 2023-8-8 |         3D-VisTA          |      BIGAI      | [3D-VisTA: 预训练的Transformer，用于3D视觉与文本对齐](https:\u002F\u002FArxiv.org\u002Fabs\u002F2308.04352)                                                           |    ICCV '23  | [github]() |\n| 2023-7-24 |     3D-LLM     |      UCLA    | [3D-LLM: 将3D世界注入大型语言模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.12981.pdf)                                                                                                                      |   NeurIPS '23|  [github](https:\u002F\u002Fgithub.com\u002FUMass-Foundation-Model\u002F3D-LLM) |\n| 2023-3-29 |       ViewRefer       |      香港中文大学      | [ViewRefer: 抓住多视角知识，用于3D视觉接地](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.16894.pdf)                                                                                               |ICCV '23 |[github](https:\u002F\u002Fgithub.com\u002FIvan-Tang-3D\u002FViewRefer3D) |\n| 2022-9-12 |        -        |      MIT      | [利用大型（视觉）语言模型进行机器人3D场景理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.05629.pdf)                                                                      |Arxiv|  [github](https:\u002F\u002Fgithub.com\u002FMIT-SPARK\u002Fllm_scene_understanding) |\n\n\n\n## 通过其他基础模型进行3D理解\n|  编号 |       关键词       |    机构（第一作者单位）    | 论文                                                                                                                                                                               | 发表时间 | 其他 |\n| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------: \n| 2026-03-21 | OV3D-CG | ICT, CAS | [OV3D-CG：基于上下文引导的开放词汇3D实例分割](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2025\u002Fhtml\u002FZhou_OV3D-CG_Open-vocabulary_3D_Instance_Segmentation_with_Contextual_Guidance_ICCV_2025_paper.html) | ICCV '2025 | [github](https:\u002F\u002Fgithub.com\u002FVIPL-VSU\u002FOV3D-CG) |\n| 2025-11-20 | POMA-3D | Imperial | [POMA-3D：点云地图驱动的3D场景理解方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.16567) | Arxiv | [项目](https:\u002F\u002Fmatchlab-imperial.github.io\u002Fpoma3d\u002F) |\n| 2025-07-26 |  OV-3DDet |    HKUST  | [CoDAv2：面向开放词汇3D目标检测的协作式新物体发现与框引导跨模态对齐](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.00830)                                                                                | TPAMI '25 | [github](https:\u002F\u002Fgithub.com\u002Fyangcaoai\u002FCoDA_NeurIPS2023) |\n| 2025-02-20 |  CrossOver |  Stanford | [CrossOver：3D场景跨模态对齐](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.15011) | CVPR '25 | [项目](https:\u002F\u002Fsayands.github.io\u002Fcrossover\u002F) |\n| 2025-02-05 |  SAGA |  SJTU | [Segment Any 3D Gaussians：任意3D高斯场的分割](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00860) | AAAI '25 | [项目](https:\u002F\u002Fjumpat.github.io\u002FSAGA\u002F) |\n| 2024-10-12 |  Lexicon3D |    UIUC  | [Lexicon3D：探索单纯视觉基础模型在复杂3D场景理解中的能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.03757) | NeurIPS '24 | [项目](https:\u002F\u002Fyunzeman.github.io\u002Flexicon3d\u002F) |\n| 2024-10-07 |  Diff2Scene |    CMU  | [利用文本到图像扩散模型实现开放词汇3D语义分割](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.13642) | ECCV'24 | [项目](https:\u002F\u002Fdiff2scene.github.io\u002F) |\n| 2024-07-19 |  OpenSU3D |  TUM  | [OpenSU3D：使用基础模型进行开放世界3D场景理解](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.03757) | ICRA '25 | [项目](https:\u002F\u002Fopensu3d.github.io\u002F) |\n| 2024-04-07 |  Any2Point |    上海人工智能实验室  | [Any2Point：赋能任意模态大模型实现高效3D理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.07989) | ECCV'24 | [github](https:\u002F\u002Fgithub.com\u002FIvan-Tang-3D\u002FAny2Point) |\n| 2024-03-16 |  N2F2 |    牛津大学VGG实验室  | [N2F2：基于嵌套神经特征场的层次化场景理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.10997.pdf) | Arxiv | - |\n| 2023-12-17 |  SAI3D |    北京大学  | [SAI3D：分割3D场景中的任意实例](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.11557.pdf)                                                                                | Arxiv | [项目](https:\u002F\u002Fyd-yin.github.io\u002FSAI3D) |\n| 2023-12-17 |  Open3DIS |    VinAI  | [Open3DIS：基于2D掩码引导的开放词汇3D实例分割](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.10671.pdf)                                                                                | Arxiv | [项目](https:\u002F\u002Fopen3dis.github.io\u002F) |\n| 2023-11-6 |  OVIR-3D |    罗格斯大学  | [OVIR-3D：无需3D数据训练的开放词汇3D实例检索](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.02873.pdf) | CoRL '23 | [github](https:\u002F\u002Fgithub.com\u002Fshiyoung77\u002FOVIR-3D\u002F) |\n| 2023-10-29|  OpenMask3D |    ETH  | [OpenMask3D：开放词汇3D实例分割](https:\u002F\u002Fopenmask3d.github.io\u002Fstatic\u002Fpdf\u002Fopenmask3d.pdf)                                                                                | NeurIPS '23 | [项目](https:\u002F\u002Fopenmask3d.github.io\u002F) |\n| 2023-10-5 |     Open-Fusion     |      -     | [Open-Fusion：实时开放词汇3D建图与可查询场景表示](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.03923.pdf)                                                                            |Arxiv|  [github](https:\u002F\u002Fgithub.com\u002FUARK-AICV\u002FOpenFusion) |\n| 2023-9-22 |  OV-3DDet |    HKUST  | [CoDA：面向开放词汇3D目标检测的协作式新框发现与跨模态对齐](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.02960.pdf)                                                                                | NeurIPS '23 | [github](https:\u002F\u002Fgithub.com\u002Fyangcaoai\u002FCoDA_NeurIPS2023) |\n| 2023-9-19 | LAMP |      -      | [从语言到3D世界：将语言模型适配用于点云感知](https:\u002F\u002Fopenreview.net\u002Fforum?id=H49g8rRIiF)                                                              |    OpenReview     | - |\n| 2023-9-15 |  OpenNerf |    -    | [OpenNerf：基于像素级特征和渲染新视角的开放集3D神经场景分割](https:\u002F\u002Fopenreview.net\u002Fpdf?id=SgjAojPKb3)                                                                                | OpenReview | [github]() |\n| 2023-9-1|  OpenIns3D |    剑桥大学  | [OpenIns3D：针对开放词汇3D实例分割的抓拍与查找](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.00616.pdf)                                                                                | Arxiv | [项目](https:\u002F\u002Fzheninghuang.github.io\u002FOpenIns3D\u002F) |\n| 2023-6-7 |         Contrastive Lift         |     牛津大学VGG实验室     | [Contrastive Lift：通过慢速-快速对比融合实现3D目标实例分割](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.04633.pdf)                                                                                        |   NeurIPS '23| [github](https:\u002F\u002Fgithub.com\u002Fyashbhalgat\u002FContrastive-Lift) |\n| 2023-6-4 |  Multi-CLIP |    ETH  | [Multi-CLIP：面向3D场景问答任务的对比视觉-语言预训练](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.02329.pdf)                                                                                | Arxiv | - |\n| 2023-5-23 |  3D-OVS |    国立台湾大学  | [弱监督下的3D开放词汇分割](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.14093.pdf)                                                                                | NeurIPS '23 | [github](https:\u002F\u002Fgithub.com\u002FKunhao-Liu\u002F3D-OVS) |\n| 2023-5-21 |  VL-Fields |    爱丁堡大学  | [VL-Fields：迈向语言接地的神经隐式空间表示](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.12427.pdf)                                                                                | ICRA '23 | [项目](https:\u002F\u002Ftsagkas.github.io\u002Fvl-fields\u002F)  |\n| 2023-5-8 |  CLIP-FO3D |    清华大学  | [CLIP-FO3D：从2D密集CLIP中学习自由开放世界的3D场景表示](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.04748.pdf)                                                                                | ICCVW '23 | - |\n| 2023-4-12 |  3D-VQA |    ETH  | [CLIP引导的面向3D场景问答任务的视觉-语言预训练](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.06061.pdf)                                                                                | CVPRW '23 | [github](https:\u002F\u002Fgithub.com\u002FAlexDelitzas\u002F3D-VQA) |\n| 2023-4-3 |  RegionPLC |    香港大学 | [RegionPLC：面向开放世界3D场景理解的区域点-语言对比学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.00962.pdf)                                                                                | Arxiv | [项目](https:\u002F\u002Fjihanyang.github.io\u002Fprojects\u002FRegionPLC) |\n| 2023-3-20 |        CG3D        |      约翰霍普金斯大学      | [CLIP走进3D：利用提示调优实现语言接地的3D识别](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.11313.pdf)                                                                                                 |Arxiv|  [github](https:\u002F\u002Fgithub.com\u002Fdeeptibhegde\u002FCLIP-goes-3D) |\n| 2023-3-16 | LERF |     加州大学伯克利分校     | [LERF：语言嵌入辐射场](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.09553.pdf)                                                   | ICCV '23   | [github](https:\u002F\u002Fgithub.com\u002Fkerrj\u002Flerf) |\n| 2023-2-14 |  ConceptFusion |    MIT  | [ConceptFusion：开放集多模态3D映射](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.07241.pdf)                                                                                | RSS '23 | [项目](https:\u002F\u002Fconcept-fusion.github.io\u002F) |\n| 2023-1-12 |         CLIP2Scene         |      香港大学      | [CLIP2Scene：通过CLIP实现标签高效的3D场景理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.04926.pdf)                                                                                        |    CVPR '23 | [github](https:\u002F\u002Fgithub.com\u002Frunnanchen\u002FCLIP2Scene) |\n| 2022-12-1 |         UniT3D         |      TUM     | [UniT3D：用于3D密集字幕生成和视觉定位的统一Transformer](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FICCV2023\u002Fpapers\u002FChen_UniT3D_A_Unified_Transformer_for_3D_Dense_Captioning_and_Visual_ICCV_2023_paper.pdf)                                                                          |   ICCV '23| [github]() |\n| 2022-11-29 |        PLA        |     香港大学    | [PLA：语言驱动的开放词汇3D场景理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.16312.pdf)                                                                 |CVPR '23|  [github](https:\u002F\u002Fgithub.com\u002FCVMI-Lab\u002FPLA) |\n| 2022-11-28 |       OpenScene       |      ETHz      | [OpenScene：使用开放词汇进行3D场景理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.15654.pdf)                                                             |   CVPR '23  | [github](https:\u002F\u002Fgithub.com\u002Fpengsongyou\u002Fopenscene) |\n| 2022-10-11 |  CLIP-Fields |    NYU  | [CLIP-Fields：用于机器人记忆的弱监督语义场](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05663.pdf)                                                                                | Arxiv | [项目](https:\u002F\u002Fmahis.life\u002Fclip-fields\u002F) |\n| 2022-7-23 |  Semantic Abstraction |    哥伦比亚大学  | [Semantic Abstraction：从2D视觉-语言模型出发的开放世界3D场景理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.11514.pdf)                                                                                | CoRL '22 | [项目](https:\u002F\u002Fsemantic-abstraction.cs.columbia.edu\u002F) |\n| 2022-4-26 |   ScanNet200 |    TUM  | [语言接地的室内3D语义分割（野外场景)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.07761.pdf)                                                                                | ECCV '22 | [项目](https:\u002F\u002Frozdavid.github.io\u002Fscannet200) |\n\n## 3D推理\n| 日期 | 关键词 | 机构（第一） | 论文                                                                                                                                                                               | 发表平台 | 其他 |\n| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------: |\n| 2025-12-15 | RoboTracer | 北航 | [RoboTracer: 利用视觉-语言模型中的推理能力掌握机器人领域的空间轨迹](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.13660) | Arxiv | [项目](https:\u002F\u002Fzhoues.github.io\u002FRoboTracer\u002F) |\n| 2025-06-11 | SceneCOT | BIGAI | [SceneCOT: 在3D场景中激发思维链式推理](https:\u002F\u002Fscenecot.github.io\u002F) | Arxiv | [项目](https:\u002F\u002Fscenecot.github.io\u002F) |\n| 2025-06-04 | RoboRefer | 北航 | [RoboRefer: 面向机器人的视觉-语言模型中基于推理的空间指代](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04308) | Arxiv | [项目](https:\u002F\u002Fzhoues.github.io\u002FRoboRefer\u002F) |\n| 2024-09-08 | MSR3D | BIGAI | [3D场景中的多模态情境推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.02389) | NeurIPS '24| [项目](https:\u002F\u002Fmsr3d.github.io\u002F) |\n| 2023-5-20|       3D-CLR      |      UCLA     | [基于多视角图像的3D概念学习与推理](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.11327.pdf)                                                 |   CVPR '23  | [github](https:\u002F\u002Fgithub.com\u002Fevelinehong\u002F3D-CLR-Official) |\n| - |         Transcribe3D         |     TTI, Chicago     | [Transcribe3D: 利用转录信息对齐大语言模型，实现自纠正微调下的3D指代推理](https:\u002F\u002Fopenreview.net\u002Fpdf?id=7j3sdUZMTF)                                                                                                  |CoRL '23|  [github]() |\n\n\n## 3D生成\n| 日期 | 关键词 | 机构 | 论文                                                                                                                                                                               | 发表平台 | 其他 |\n| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------: \n| 2025-11-07 | Omni-View | 北大 | [Omni-View: 解锁多视角图像统一3D模型中生成如何促进理解](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.07222) | ICLR 2026 | [github](https:\u002F\u002Fgithub.com\u002FAIDC-AI\u002FOmni-View) |\n| 2025-08-16 | UniUGG | 南大 | [UniUGG: 基于几何-语义编码的统一3D理解和生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.11952) | ICLR 2026 | [github](https:\u002F\u002Fgithub.com\u002Ffudan-zvg\u002FUniUGG) |\n| 2024-11-14 |         LLaMA-Mesh         |     清华大学     | [LLaMA-Mesh: 将语言模型与3D网格生成统一起来](https:\u002F\u002Farxiv.org\u002Fhtml\u002F2411.09595v1)                                                                                                  |Arxiv|  [项目](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Ftoronto-ai\u002FLLaMA-Mesh\u002F) |  \n| 2023-11-29 |         ShapeGPT         |     复旦大学     | [ShapeGPT: 基于统一多模态语言模型的3D形状生成](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.17618.pdf)                                                                                                  |Arxiv|  [github](https:\u002F\u002Fgithub.com\u002FOpenShapeLab\u002FShapeGPT) |                                                                                              | Arxiv  |  [github]() |\n| 2023-11-27|         MeshGPT         |     慕尼黑工业大学     | [MeshGPT: 使用仅解码器型Transformer生成三角形网格](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.15475.pdf)                                                                                                  |Arxiv |  [项目](https:\u002F\u002Fnihalsid.github.io\u002Fmesh-gpt\u002F) |\n| 2023-10-19 |         3D-GPT        |     ANU   | [3D-GPT: 利用大型语言模型进行程序化3D建模](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.12945.pdf)                                                                                                   |Arxiv|  [github](https:\u002F\u002Fdreamllm.github.io\u002F) |\n| 2023-9-21 |         LLMR         |     MIT     | [LLMR: 利用大型语言模型实时提示交互式世界](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.12276.pdf)                                                                                                  |Arxiv| - |\n| 2023-9-20 |         DreamLLM         |     美格维     | [DreamLLM: 协同的多模态理解与创造](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.11499.pdf) | Arxiv | [github](https:\u002F\u002Fgithub.com\u002FRunpeiDong\u002FDreamLLM)\n| 2023-4-1 |      ChatAvatar      |       Deemos Tech            | [DreamFace: 在文本指导下逐步生成可动画3D人脸](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3592094)                                               |  ACM TOG    | [网站](https:\u002F\u002Fhyperhuman.deemos.com\u002F) |\n\n## 3D 身体化智能体\n| 日期 | 关键词 | 研究机构 | 论文                                                                                                                                                                               | 发表平台 | 其他 |\n| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------: |\n| 2025-12-15 | RoboTracer | 北航 | [RoboTracer: 利用视觉-语言模型中的推理能力掌握机器人空间轨迹](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.13660) | Arxiv | [项目](https:\u002F\u002Fzhoues.github.io\u002FRoboTracer\u002F) |\n| 2025-06-04 | RoboRefer | 北航 | [RoboRefer: 面向机器人的视觉-语言模型中基于推理的空间指代](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04308) | Arxiv | [项目](https:\u002F\u002Fzhoues.github.io\u002FRoboRefer\u002F) |\n| 2025-05-30 | VeBrain | 上海人工智能实验室 | [视觉具身大脑：让多模态大语言模型在空间中看、思考并控制](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.00123v1) | Arxiv | [项目](https:\u002F\u002Finternvl.github.io\u002Fblog\u002F2025-05-26-VeBrain\u002F) |\n| 2025-05-28 | 3DLLM-Mem | UCLA | [3DLLM-Mem: 面向具身3D大语言模型的长期时空记忆](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22657) | Arxiv | [项目](https:\u002F\u002F3dllm-mem.github.io\u002F) |\n| 2024-01-22 | SpatialVLM | Deepmind | [SpatialVLM: 为视觉-语言模型赋予空间推理能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.12168)                                                                                | CVPR '24 | [项目](https:\u002F\u002Fspatial-vlm.github.io\u002F) |\n| 2023-12-05 | NaviLLM | 香港中文大学 | [迈向学习一种用于具身导航的通用模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.02010) | CVPR '24 | [项目](https:\u002F\u002Fgithub.com\u002Fzd11024\u002FNaviLLM) |\n| 2023-11-27 | Dobb-E | NYU | [关于将机器人带回家](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16098.pdf)        |    Arxiv  |  [github](https:\u002F\u002Fgithub.com\u002Fnotmahi\u002Fdobb-e) |\n| 2023-11-26 | STEVE | 浙江大学 | [看见并思考：虚拟环境中的具身智能体](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.15209) | Arxiv | [github](https:\u002F\u002Fgithub.com\u002Frese1f\u002FSTEVE) |\n| 2023-11-18 | LEO  |   BIGAI  | [3D世界中的具身通用智能体](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.12871.pdf)   |    ICML '24  |  [github](https:\u002F\u002Fgithub.com\u002Fembodied-generalist\u002Fembodied-generalist) |\n| 2023-9-14 |        UniHSI      |      上海人工智能实验室     | [通过提示式接触链实现统一的人与场景交互](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.07918.pdf)                                                                                                 |   Arxiv |  [github](https:\u002F\u002Fgithub.com\u002FOpenRobotLab\u002FUniHSI) |\n| 2023-7-28 |         RT-2         |     Google-DeepMind     | [RT-2: 视觉-语言-动作模型将网络知识迁移到机器人控制中](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.15818.pdf)                                                                                                  |Arxiv|  [github](https:\u002F\u002Frobotics-transformer2.github.io\u002F) |\n| 2023-7-12 |         SayPlan        |     QUT机器人中心    | [SayPlan: 使用3D场景图接地大语言模型，实现可扩展的机器人任务规划](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.06135.pdf)                                                                                                  |CoRL '23|  [github](https:\u002F\u002Fsayplan.github.io\u002F) |\n| 2023-7-12 |          VoxPoser          |      斯坦福      | [VoxPoser: 可组合的3D价值地图，用于结合语言模型的机器人操作](https:\u002F\u002Fvoxposer.github.io\u002Fvoxposer.pdf)                                                           |    Arxiv  |  [github](https:\u002F\u002Fgithub.com\u002Fhuangwl18\u002FVoxPoser) |\n| 2022-12-13|         RT-1         |     Google     | [RT-1: 用于大规模真实世界控制的机器人Transformer](https:\u002F\u002Frobotics-transformer1.github.io\u002Fassets\u002Frt1.pdf)                                                                                                  |Arxiv|  [github](https:\u002F\u002Frobotics-transformer1.github.io\u002F) |\n| 2022-12-8 |         LLM-Planner         |     俄亥俄州立大学    | [LLM-Planner: 基于少量示例的大语言模型具身智能体接地规划](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.04088.pdf)                                                                                                  |ICCV '23|  [github](https:\u002F\u002Fgithub.com\u002FOSU-NLP-Group\u002FLLM-Planner\u002F) |\n| 2022-10-11 |          CLIP-Fields          |      NYU, Meta      | [CLIP-Fields: 用于机器人记忆的弱监督语义场](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05663.pdf)                                                           |    RSS '23  |  [github](https:\u002F\u002Fgithub.com\u002Fnotmahi\u002Fclip-fields) |\n| 2022-09-20|       NLMap-SayCan       |     Google     | [面向现实世界规划的开放词汇可查询场景表示](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.09874)                                                                    | ICRA '23|  [github](https:\u002F\u002Fnlmap-saycan.github.io\u002F) |\n\n## 3D 基准测试\n| 日期 | 关键词 | 机构 | 论文                                                                                                                                                                               | 发表平台 | 其他 |\n| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------: |\n| 2026-01-22 | Real-3DQA | VGG, 牛津大学 | [3D 大型语言模型真的理解 3D 空间关系吗？](https:\u002F\u002Fopenreview.net\u002Fforum?id=3vlMiJwo8b) | ICLR 2026 | [项目](https:\u002F\u002Freal-3dqa.github.io\u002F) |\n| 2025-12-15 | RoboTracer | 北航 | [RoboTracer：通过视觉-语言模型中的推理掌握机器人领域的空间轨迹](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.13660) | Arxiv | [项目](https:\u002F\u002Fzhoues.github.io\u002FRoboTracer\u002F) |\n| :-----: | :------------------: | :--------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------: | :---------: \n| 2025-11-28 | ORS3D | 华中科技大学 | [一起烹饪与清洁：教导具身智能体进行并行任务执行](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.19430) | AAAI ‘26 | [项目](https:\u002F\u002Fh-embodvis.github.io\u002FGRANT\u002F) |\n| 2025-06-04 | RoboRefer | 北航 | [RoboRefer：面向机器人领域的视觉-语言模型中的空间指代推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04308) | Arxiv | [项目](https:\u002F\u002Fzhoues.github.io\u002FRoboRefer\u002F) |\n| 2025-06-09 | SpaCE-10 | 上海交通大学 | [SpaCE-10：多模态大型语言模型在组合式空间智能方面的综合基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.07966) | Arxiv | [项目](https:\u002F\u002Fgithub.com\u002FVisionXLab\u002FSpaCE-10) |\n| 2025-05-01 | Hypo3D | 帝国理工学院 | [Hypo3D：探索 3D 中的假设性推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.00954) | ICML'25| [项目](https:\u002F\u002Fgithub.com\u002FMatchLab-Imperial\u002FHypo3D) |\n| 2025-06-04 | Anywhere3D | BIGAI | [从物体到任何地方：3D 场景中多层次视觉接地的整体基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04897) | NeurIPS '25 | [项目](https:\u002F\u002Fanywhere-3d.github.io\u002F) |\n| 2025-05-01 | SpatialVQA | 约翰霍普金斯大学 | [SpatialLLM：一种复合的 3D 信息设计，旨在构建具有空间智能的大型多模态模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00788) | CVPR'25| [项目]() |\n| 2025-04-03 | SPAR | 复旦大学 | [从平面世界到空间：教导视觉-语言模型在 3D 中感知和推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.22976) | Arxiv| [项目](https:\u002F\u002Ffudan-zvg.github.io\u002Fspar) |\n| 2025-03-28 | Beacon3D | BIGAI | [揭开 3D 视觉-语言理解上的迷雾：以对象为中心的链式分析评估](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.22420) | CVPR '25| [项目](https:\u002F\u002Fbeacon-3d.github.io) |\n| 2025-03-08 | 3D-CoT | 香港理工大学、香港浸会大学 | [整合思维链实现多模态对齐：关于 3D 视觉-语言学习的研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.06232) | Arxiv | [数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBattam\u002F3D-CoT) |\n| 2024-09-08 | MSQA \u002F MSNN | BIGAI | [3D 场景中的多模态情境推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.02389) | NeurIPS '24| [项目](https:\u002F\u002Fmsr3d.github.io\u002F) |\n| 2024-08-29 | Space3D-Bench | 苏黎世联邦理工学院 | [Space3D-Bench：空间 3D 问答基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.16662) | Arxiv | [项目](https:\u002F\u002Fspace3d-bench.github.io\u002F) |\n| 2024-07-24 | City-3DQA | 香港科技大学 | [用于城市场景理解的 3D 问答](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.17398) | ACM MM'24 | [项目](https:\u002F\u002Fsites.google.com\u002Fview\u002Fcity3dqa\u002F?pli=1) |\n| 2024-06-13 | MMScan | 上海人工智能实验室 | [MMScan：带有分层接地语言标注的多模态 3D 场景数据集](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.09401) | Arxiv | [GitHub](https:\u002F\u002Fgithub.com\u002FOpenRobotLab\u002FEmbodiedScan) |\n| 2024-06-10 | 3D-GRAND \u002F 3D-POPE | 密歇根大学 | [3D-GRAND：为 3D-LLM 提供更好接地且更少幻觉的大规模数据集（百万级）](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.05132.pdf) | Arxiv | [项目](https:\u002F\u002F3d-grand.github.io) |\n| 2024-06-03 | SpatialRGPT-Bench | 加州大学圣地亚哥分校 | [SpatialRGPT：视觉语言模型中的接地空间推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.01584) | NeurIPS '24 | [GitHub](https:\u002F\u002Fgithub.com\u002FAnjieCheng\u002FSpatialRGPT) |\n| 2024-05-27 | Reason3D | 加州大学默塞德分校 | [Reason3D：利用大型语言模型进行 3D 分割的搜索与推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.17427) | 3DV'25 | [项目](https:\u002F\u002Freason3d.github.io\u002F) |\n| 2024-1-18 | SceneVerse | BIGAI | [SceneVerse：扩展 3D 视觉-语言学习，用于接地场景理解](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.09340.pdf) | ECCV '24 | [GitHub](https:\u002F\u002Fgithub.com\u002Fscene-verse\u002Fsceneverse) |\n| 2023-12-26 | EmbodiedScan | 上海人工智能实验室 | [EmbodiedScan：迈向具身 AI 的整体多模态 3D 感知套件](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.16170.pdf) | Arxiv | [GitHub](https:\u002F\u002Fgithub.com\u002FOpenRobotLab\u002FEmbodiedScan) |\n| 2023-12-17 |         M3DBench        |     复旦大学     | [M3DBench：用多模态 3D 提示指导大型模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.10763)                                                                                                  |Arxiv|  [GitHub](https:\u002F\u002Fgithub.com\u002FOpenM3D\u002FM3DBench) |\n| 2023-11-29 |         -         |     DeepMind  | [利用基于 VLM 的流水线标注 3D 对象](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.17851.pdf)                                                                                                  |ICML '24|  [GitHub](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Fobjaverse_annotations) |\n| 2023-09-14 |CrossCoherence   |     UniBO  | [用注意力同时关注文字与点云：文本到形状一致性的基准测试](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.07917)                                                                                                  |ICCV '23|  [GitHub](https:\u002F\u002Fgithub.com\u002FAndreAmaduzzi\u002FCrossCoherence) |\n| 2022-10-14 |     SQA3D     |      BIGAI    | [SQA3D：3D 场景中的情境问答](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07474.pdf)                                                                                                        | ICLR '23| [GitHub](https:\u002F\u002Fgithub.com\u002FSilongYong\u002FSQA3D) |\n| 2022-09-24 |     FE-3DGQA     |      北京航空航天大学    | [迈向可解释的 3D 接地视觉问答：一个新的基准测试和强大的基线](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12028)                                                                                                        | TCSVT | [GitHub](https:\u002F\u002Fgithub.com\u002Fzlccccc\u002F3DVL_Codebase) |\n| 2021-12-20|     ScanQA     |      RIKEN AIP    | [ScanQA：用于空间场景理解的 3D 问答](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10482.pdf)                                                                                                        | CVPR '23| [GitHub](https:\u002F\u002Fgithub.com\u002FATR-DBI\u002FScanQA) |\n| 2020-12-3 |     Scan2Cap     |      TUM    | [Scan2Cap：RGB-D 扫描中的上下文感知密集字幕](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2012.02206.pdf)                                                                                                        | CVPR '21| [GitHub](https:\u002F\u002Fgithub.com\u002Fdaveredrum\u002FScan2Cap) |\n| 2020-8-23 | ReferIt3D | 斯坦福大学 | [ReferIt3D：用于在现实场景中精细识别 3D 对象的神经听者](https:\u002F\u002Fwww.ecva.net\u002Fpapers\u002Feccv_2020\u002Fpapers_ECCV\u002Fpapers\u002F123460409.pdf) | ECCV '20 | [GitHub](https:\u002F\u002Fgithub.com\u002Freferit3d\u002Freferit3d) |\n| 2019-12-18 |     ScanRefer     |      TUM   | [ScanRefer：使用自然语言在 RGB-D 扫描中定位 3D 对象](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.10482.pdf)                                                                                                        | ECCV '20 | [GitHub](https:\u002F\u002Fdaveredrum.github.io\u002FScanRefer\u002F) |\n\n## 贡献\n\n您的贡献始终受到欢迎！\n\n如果我对某些拉取请求是否适合 3D LLMs 还不确定，我会暂时保持它们开放。您可以通过给这些请求点赞 👍 来表达您的支持。\n\n---\n\n如果您对这份带有明确观点的列表有任何疑问，请通过 xianzheng@robots.ox.ac.uk 或微信 ID: mxz1997112 与我们联系。\n\n## 星标历史\n\n[![星标历史图表](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FActiveVisionLab_Awesome-LLM-3D_readme_748489af54ae.png)](https:\u002F\u002Fstar-history.com\u002F#ActiveVisionLab\u002FAwesome-LLM-3D&Date)\n\n## 引用\n如果您觉得本仓库有用，请考虑引用以下论文：\n```\n@misc{ma2024llmsstep3dworld,\n      title={当大语言模型走进 3D 世界：基于多模态大语言模型的 3D 任务综述与元分析}, \n      author={马贤正、Yash Bhalgat、Brandon Smart、陈帅、李兴辉、丁健、顾金东、陈振宇、彭松友、卞家旺、Philip H Torr、Marc Pollefeys、Matthias Nießner、Ian D Reid、Angel X. Chang、Iro Laina、Victor Adrian Prisacariu},\n      year={2024},\n      journal={arXiv 预印本 arXiv:2405.10255},\n}\n```\n\n## 致谢\n本仓库的灵感来源于 [Awesome-LLM](https:\u002F\u002Fgithub.com\u002FHannibal046\u002FAwesome-LLM?tab=readme-ov-file#other-awesome-lists)。","# Awesome-LLM-3D 快速上手指南\n\n**Awesome-LLM-3D** 并非一个可直接安装的单一软件工具，而是一个由社区维护的**精选论文与开源项目列表**。它汇集了利用大语言模型（LLM）进行 3D 理解、推理、生成及具身智能任务的前沿研究。\n\n本指南将帮助开发者快速浏览该资源库，并选取列表中具体的开源项目（如 `LLaVA-3D`, `SpatialVLM`, `MiniGPT-3D` 等）进行本地部署和使用。\n\n## 1. 环境准备\n\n由于列表中包含多个不同的项目，每个项目的具体依赖略有不同，但大多数基于 PyTorch 和 Hugging Face Transformers 生态。建议先准备以下通用开发环境：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04\u002F22.04) 或 macOS\n*   **Python**: 3.8 - 3.10 (具体版本需参考选定项目的要求)\n*   **GPU**: 支持 CUDA 的 NVIDIA 显卡 (显存建议 16GB 以上以运行大型多模态模型)\n*   **前置依赖**:\n    *   Git\n    *   Conda 或 Mamba (推荐用于环境管理)\n    *   CUDA Toolkit (版本需与 PyTorch 匹配)\n\n**国内加速建议**：\n在克隆仓库或下载模型时，推荐使用国内镜像源以提升速度：\n*   **代码托管**: 使用 Gitee 镜像（若项目有同步）或通过 `git clone https:\u002F\u002Fghproxy.com\u002Fhttps:\u002F\u002Fgithub.com\u002F...` 加速 GitHub 克隆。\n*   **模型下载**: 配置 `HF_ENDPOINT` 使用 Hugging Face 国内镜像。\n    ```bash\n    export HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n    ```\n*   **PyPI 源**: 使用清华或阿里源安装 Python 包。\n    ```bash\n    pip config set global.index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n    ```\n\n## 2. 获取资源列表\n\n首先，克隆 Awesome-LLM-3D 仓库以获取最新的论文列表和项目链接：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAwesome-LLM-3D.git\ncd Awesome-LLM-3D\n```\n\n> **注意**：请根据实际仓库地址调整 URL，若上述地址不可用，请在 GitHub 搜索 \"Awesome-LLM-3D\" 获取最新源地址。\n\n## 3. 基本使用流程\n\n由于这是一个索引列表，\"使用\"的核心步骤是**从列表中选择一个感兴趣的项目**，然后进入该项目仓库进行部署。以下以列表中典型的 **LLaVA-3D** 为例演示完整流程：\n\n### 第一步：选择并进入目标项目\n在 `README.md` 的表格中找到目标项目（例如 `LLaVA-3D`），点击其 \"github\" 链接或直接克隆：\n\n```bash\n# 示例：克隆 LLaVA-3D 项目\ngit clone https:\u002F\u002Fgithub.com\u002Fdjiajunustc\u002F3D-LLaVA.git\ncd 3D-LLaVA\n```\n\n### 第二步：创建虚拟环境并安装依赖\n大多数项目会提供 `requirements.txt` 或 `environment.yml`。\n\n```bash\n# 创建 conda 环境\nconda create -n llm3d python=3.10 -y\nconda activate llm3d\n\n# 安装 PyTorch (根据 CUDA 版本选择，此处以 CUDA 11.8 为例)\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n\n# 安装项目特定依赖\npip install -r requirements.txt\n```\n\n### 第三步：下载预训练模型\n根据项目文档下载权重文件。利用之前设置的国内镜像加速：\n\n```bash\n# 设置 Hugging Face 镜像\nexport HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n\n# 使用 huggingface-cli 下载 (具体模型名称参考项目文档)\nhuggingface-cli download --resume-download \u003Cmodel_owner>\u002F\u003Cmodel_name> --local-dir .\u002Fweights\n```\n\n### 第四步：运行推理示例\n大多数项目提供简单的推理脚本。以下是一个通用的运行示例结构：\n\n```bash\npython inference.py \\\n    --model_path .\u002Fweights \\\n    --input_3d data\u002Fsample.ply \\\n    --prompt \"Describe the spatial relationship between objects in this scene.\"\n```\n\n## 4. 探索更多方向\n\n回到 `Awesome-LLM-3D` 的主目录，你可以查阅 `README.md` 中的分类表格，快速定位其他方向的项目：\n\n*   **3D 统一理解与生成**: 查看 `3D Unified Understanding and Generation` 章节。\n*   **3D 推理**: 查看 `3D Reasoning` 章节，寻找具备空间推理能力的模型。\n*   **具身智能**: 查看 `3D Embodied Agent` 章节，适用于机器人交互任务。\n*   **基准测试**: 查看 `3D Benchmarks` 章节，获取如 `Real-3DQA` 等评测数据集。\n\n通过这种方式，你可以利用 Awesome-LLM-3D 作为导航图，快速接入全球最新的 3D+LLM 开源生态。","某自动驾驶初创公司的感知算法团队，正致力于研发能够理解复杂三维空间并执行推理任务的具身智能代理（Embodied Agent）。\n\n### 没有 Awesome-LLM-3D 时\n- **文献检索如大海捞针**：研究人员需在 arXiv、GitHub 和各类会议网站间反复切换，手动筛选\"3D 理解”、“空间推理”等关键词，耗时数周仍难以覆盖最新成果。\n- **技术选型缺乏全局视野**：由于缺少统一的分类索引，团队难以区分哪些模型专注于纯几何理解，哪些具备多模态生成能力，导致技术路线规划盲目。\n- **复现成本高昂且易错**：找到的论文往往缺失官方代码链接或基准测试数据，工程师需花费大量时间验证论文的可复现性，甚至因信息滞后而重复造轮子。\n- **前沿动态跟进滞后**：面对该领域每周涌现的新论文，团队无法及时获取如 Real-3DQA 等最新基准测试信息，导致研发进度落后于学术界前沿。\n\n### 使用 Awesome-LLM-3D 后\n- **一站式精准获取资源**：团队直接利用其 curated list，按\"3D Embodied Agent\"或\"3D Reasoning\"分类快速锁定目标论文，将调研周期从数周缩短至几天。\n- **清晰的技术地图指引**：通过明确的表格分类（如统一理解与生成、基础模型对比），团队迅速明确了基于几何 - 语义编码的 UniUGG 等模型最适合当前场景。\n- **高效落地与复现**：列表中每条记录均附带最新的 GitHub 仓库链接和发表会议信息，工程师能直接拉取代码进行微调，大幅降低试错成本。\n- **实时同步顶尖进展**：借助其活跃的维护机制，团队第一时间掌握了 ICLR 2026 上的最新基准测试方法，确保算法评估标准始终处于行业领先地位。\n\nAwesome-LLM-3D 将分散的 3D-LLM 学术资源转化为结构化的工程导航图，让研发团队能从繁琐的信息搜集解放出来，专注于核心算法的突破与创新。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FActiveVisionLab_Awesome-LLM-3D_9bfa1119.png","ActiveVisionLab","Active Vision Laboratory","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FActiveVisionLab_3ed8c61d.png","",null,"http:\u002F\u002Fwww.robots.ox.ac.uk\u002FActiveVision\u002F","https:\u002F\u002Fgithub.com\u002FActiveVisionLab",2171,140,"2026-04-18T01:22:38","MIT","未说明",{"notes":89,"python":87,"dependencies":90},"Awesome-LLM-3D 是一个论文和资源列表（Survey\u002FCollection），而非单一的开源软件工具。README 中列出了多个独立的研究项目（如 Omni-View, UniUGG, 3D-RFT 等），每个项目都有独立的代码仓库和特定的环境需求。用户需访问列表中具体项目的 GitHub 链接以获取相应的运行环境配置。",[],[15,54],"2026-03-27T02:49:30.150509","2026-04-18T17:03:32.287210",[],[]]