[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-NVlabs--VoxFormer":3,"tool-NVlabs--VoxFormer":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":79,"owner_twitter":79,"owner_website":80,"owner_url":81,"languages":82,"stars":103,"forks":104,"last_commit_at":105,"license":106,"difficulty_score":10,"env_os":107,"env_gpu":108,"env_ram":109,"env_deps":110,"category_tags":116,"github_topics":117,"view_count":10,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":132,"updated_at":133,"faqs":134,"releases":170},303,"NVlabs\u002FVoxFormer","VoxFormer","Official PyTorch implementation of VoxFormer [CVPR 2023 Highlight]","VoxFormer 是一个基于 PyTorch 开发的 3D 语义场景补全框架，能够从普通的 RGB 图像中重建出完整的三维场景。\n\n人类天生具有“脑补”能力——即使物体被遮挡，也能想象出它的完整形状。VoxFormer 正是为了赋予 AI 这种能力而设计的。它只需要输入一张或多张 2D 照片，就能输出包含几何结构和语义信息的完整 3D 体素地图。\n\n该框架采用了两阶段设计：首先利用深度估计生成稀疏的可见区域查询，然后通过 Transformer 的自注意力机制将这些信息传播到整个空间，填补被遮挡的区域。这种设计思路非常巧妙，因为 2D 图像只能看到可见部分，直接预测全部体素容易出错，而从可见区域出发再扩展能显著提升准确性。\n\nVoxFormer 在 SemanticKITTI 数据集上取得了领先成绩，几何精度提升 20%，语义分割精度提升 18%，同时将训练时的 GPU 内存需求降低到 16GB 以下。\n\n适合对 3D 场景理解、自动驾驶感知、机器人导航等领域感兴趣的研究人员和开发者使用。如果你需要从 2D 图像构建 3D 场景模型，VoxFormer 是一个值得尝试的工具。","\u003Cdiv align=\"center\">   \n  \n# VoxFormer: a Cutting-edge Baseline for 3D Semantic Occupancy Prediction\n\u003C\u002Fdiv>\n\n![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRanked%20%231-Camera--Only%203D%20SSC%20on%20SemanticKITTI-green \"\")\n\n![](.\u002Fteaser\u002Fscene08_13_19.gif \"\")\n\n> **VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion**, CVPR 2023.\n\n> [Yiming Li](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=i_aajNoAAAAJ&view_op=list_works&sortby=pubdate), [Zhiding Yu](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=1VI_oYUAAAAJ&hl=en), [Chris Choy](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=2u8G5ksAAAAJ&hl=en), [Chaowei Xiao](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=Juoqtj8AAAAJ&hl=en), [Jose M. Alvarez](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=Oyx-_UIAAAAJ&hl=en), [Sanja Fidler](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=CUlqK5EAAAAJ&hl=en), [Chen Feng](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=YeG8ZM0AAAAJ&hl=en), [Anima Anandkumar](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=bEcLezcAAAAJ&hl=en)\n\n>  [[PDF]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12251.pdf) [[Project]](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVoxFormer) [[Intro Video]](https:\u002F\u002Fyoutu.be\u002FKEn8oklzyvo?si=k2V4c22MCCvu9zFr) \n\n\n## News\n- [2023\u002F07]: We release the code of voxformer with 3D deformable attention module, achieving slightly better performance. \n- [2023\u002F06]: 🔥 We release [SSCBench](https:\u002F\u002Fgithub.com\u002Fai4ce\u002FSSCBench), a large-scale semantic scene completion benchmark derived from KITTI-360, nuScenes, and Waymo. \n- [2023\u002F06]: Welcome to our CVPR poster session on 21 June (**WED-AM-082**), and check our [online video](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=L0M9ayR316g).\n- [2023\u002F03]: 🔥 VoxFormer is accepted by [CVPR 2023](https:\u002F\u002Fcvpr2023.thecvf.com\u002F) as a highlight paper **(235\u002F9155, 2.5% acceptance rate)**.\n- [2023\u002F02]: Our paper is on [arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.12251).\n- [2022\u002F11]: VoxFormer achieve the SOTA on [SemanticKITTI 3D SSC (Semantic Scene Completion) Task](http:\u002F\u002Fwww.semantic-kitti.org\u002Ftasks.html#ssc) with **13.35% mIoU** and **44.15% IoU** (camera-only)!\n\u003C\u002Fbr>\n\n\n## Abstract\nHumans can easily imagine the complete 3D geometry of occluded objects and scenes. This appealing ability is vital for recognition and understanding. To enable such capability in AI systems, we propose VoxFormer, a Transformer-based semantic scene completion framework that can output complete 3D volumetric semantics from only 2D images. Our framework adopts a two-stage design where we start from a sparse set of visible and occupied voxel queries from depth estimation, followed by a densification stage that generates dense 3D voxels from the sparse ones. A key idea of this design is that the visual features on 2D images correspond only to the visible scene structures rather than the occluded or empty spaces. Therefore, starting with the featurization and prediction of the visible structures is more reliable. Once we obtain the set of sparse queries, we apply a masked autoencoder design to propagate the information to all the voxels by self-attention. Experiments on SemanticKITTI show that VoxFormer outperforms the state of the art with a relative improvement of 20.0% in geometry and 18.1% in semantics and reduces GPU memory during training by ~45% to less than 16GB.\n\n\n## Method\n\n| ![space-1.jpg](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVlabs_VoxFormer_readme_e5eccee3186b.png) | \n|:--:| \n| ***Figure 1. Overall framework of VoxFormer**. Given RGB images, 2D features are extracted by ResNet50 and the depth is estimated by an off-the-shelf depth predictor. The estimated depth after correction enables the class-agnostic query proposal stage: the query located at an occupied position will be selected to carry out deformable cross-attention with image features. Afterwards, mask tokens will be added for completing voxel features by deformable self-attention. The refined voxel features will be upsampled and projected to the output space for per-voxel semantic segmentation. Note that our framework supports the input of single or multiple images.* |\n\n## Getting Started\n- [Installation](docs\u002Finstall.md) \n- [Prepare Dataset](docs\u002Fprepare_dataset.md)\n- [Run and Eval](docs\u002Fgetting_started.md)\n\n## Model Zoo\nThe query proposal network (QPN) for stage-1 is available [here](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1NzN6eqCnuxzau0m_N9B02Q2zwLBKhnBp\u002Fview?usp=share_link).\nFor stage-2, please download the trained models based on the following table.\n\n| Backbone | Method | Lr Schd | IoU| mIoU | Config | Download |\n| :---: | :---: | :---: | :---: | :---:| :---: | :---: |\n| [R50](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1A4Efx7OQ2KVokM1XTbZ6Lf2Q5P-srsyE\u002Fview?usp=share_link) | VoxFormer-T | 20ep | 44.15| 13.35|[config](projects\u002Fconfigs\u002Fvoxformer\u002Fvoxformer-T.py) |[model](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1KOYN3MGHMyCTDZWw4lNNicCdImnKqvlz\u002Fview?usp=share_link) |\n| [R50](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1A4Efx7OQ2KVokM1XTbZ6Lf2Q5P-srsyE\u002Fview?usp=share_link) | VoxFormer-S | 20ep | 44.02| 12.35|[config](projects\u002Fconfigs\u002Fvoxformer\u002Fvoxformer-S.py) |[model](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1UBemF77Cfr0d9rcC_Y9Qmjnqp_c4qoeb\u002Fview?usp=share_link)|\n| [R50](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1A4Efx7OQ2KVokM1XTbZ6Lf2Q5P-srsyE\u002Fview?usp=share_link) | VoxFormer-T-3D | 20ep | 44.35| 13.69|[config](projects\u002Fconfigs\u002Fvoxformer\u002Fvoxformer-T_deform3D.py) |[model](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1JQwaO5XXMMkTcF95tCHk45q6PzZnofS6\u002Fview?usp=drive_link)|\n| [R50](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1A4Efx7OQ2KVokM1XTbZ6Lf2Q5P-srsyE\u002Fview?usp=share_link) | VoxFormer-S-3D | 20ep | 44.42| 12.86|[config](projects\u002Fconfigs\u002Fvoxformer\u002Fvoxformer-S_deform3D.py) |[model](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1kwcMGRl9FOprV2k5kqCS0kfvrbCfJMcZ\u002Fview?usp=drive_link)|\n\n \n## Dataset\n\n- [x] SemanticKITTI\n- [ ] KITTI-360\n- [ ] nuScenes\n\n## Bibtex\nIf this work is helpful for your research, please cite the following BibTeX entry.\n\n```\n@InProceedings{li2023voxformer,\n      title={VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion}, \n      author={Li, Yiming and Yu, Zhiding and Choy, Christopher and Xiao, Chaowei and Alvarez, Jose M and Fidler, Sanja and Feng, Chen and Anandkumar, Anima},\n      booktitle = {Proceedings of the IEEE\u002FCVF Conference on Computer Vision and Pattern Recognition (CVPR)},\n      year={2023}\n}\n```\n\n## License\nCopyright © 2022-2023, NVIDIA Corporation and Affiliates. All rights reserved.\n\nThis work is made available under the Nvidia Source Code License-NC. Click [here](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVoxFormer\u002Fblob\u002Fmain\u002FLICENSE) to view a copy of this license.\n\nThe pre-trained models are shared under [CC-BY-NC-SA-4.0](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc-sa\u002F4.0\u002F). If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.\n\nFor business inquiries, please visit our website and submit the form: [NVIDIA Research Licensing](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fresearch\u002Finquiries\u002F).\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVlabs_VoxFormer_readme_d9f7f2668500.png)](https:\u002F\u002Fstar-history.com\u002F#NVlabs\u002FVoxFormer)\n\n## Acknowledgement\n\nMany thanks to these excellent open source projects:\n- [BEVFormer](https:\u002F\u002Fgithub.com\u002Ffundamentalvision\u002FBEVFormer)\n- [mmdet3d](https:\u002F\u002Fgithub.com\u002Fopen-mmlab\u002Fmmdetection3d)\n- [MonoScene](https:\u002F\u002Fgithub.com\u002Fastra-vision\u002FMonoScene)\n- [LMSCNet](https:\u002F\u002Fgithub.com\u002Fastra-vision\u002FLMSCNet)\n- [semantic-kitti-api](https:\u002F\u002Fgithub.com\u002FPRBonn\u002Fsemantic-kitti-api) \n- [MobileStereoNet](https:\u002F\u002Fgithub.com\u002Fcogsys-tuebingen\u002Fmobilestereonet)\n- [Pseudo_Lidar_V2](https:\u002F\u002Fgithub.com\u002Fmileyan\u002FPseudo_Lidar_V2)\n- [wysiwyg](https:\u002F\u002Fgithub.com\u002Fpeiyunh\u002Fwysiwyg)\n","\u003Cdiv align=\"center\">   \n  \n# VoxFormer：用于3D语义占用预测的尖端基线\n\u003C\u002Fdiv>\n\n![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRanked%20%231-Camera--Only%203D%20SSC%20on%20SemanticKITTI-green \"\")\n\n![](.\u002Fteaser\u002Fscene08_13_19.gif \"\")\n\n> **VoxFormer：基于相机的3D语义场景补全的稀疏体素Transformer**, CVPR 2023.\n\n> [Yiming Li](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=i_aajNoAAAAJ&view_op=list_works&sortby=pubdate), [Zhiding Yu](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=1VI_oYUAAAAJ&hl=en), [Chris Choy](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=2u8G5ksAAAAJ&hl=en), [Chaowei Xiao](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=Juoqtj8AAAAJ&hl=en), [Jose M. Alvarez](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=Oyx-_UIAAAAJ&hl=en), [Sanja Fidler](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=CUlqK5EAAAAJ&hl=en), [Chen Feng](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=YeG8ZM0AAAAJ&hl=en), [Anima Anandkumar](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=bEcLezcAAAAJ&hl=en)\n\n>  [[PDF]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12251.pdf) [[项目]](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVoxFormer) [[介绍视频]](https:\u002F\u002Fyoutu.be\u002FKEn8oklzyvo?si=k2V4c22MCCvu9zFr) \n\n\n## 最新动态\n- [2023\u002F07]：我们发布了带有3D可变形注意力模块的VoxFormer代码，实现了更好的性能。 \n- [2023\u002F06]：🔥 我们发布了 [SSCBench](https:\u002F\u002Fgithub.com\u002Fai4ce\u002FSSCBench)，这是一个从KITTI-360、nuScenes和Waymo衍生的大规模语义场景补全基准。 \n- [2023\u002F06]：欢迎参加我们6月21日的CVPR海报展示（**WED-AM-082**），并查看我们的[在线视频](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=L0M9ayR316g)。\n- [2023\u002F03]：🔥 VoxFormer被[CVPR 2023](https:\u002F\u002Fcvpr2023.thecvf.com\u002F)接收为亮点论文（**235\u002F9155，2.5%接收率**）。\n- [2023\u002F02]：我们的论文已在[arxiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.12251)上发表。\n- [2022\u002F11]：VoxFormer在[SemanticKITTI 3D SSC（语义场景补全）任务](http:\u002F\u002Fwww.semantic-kitti.org\u002Ftasks.html#ssc)上实现了最先进水平，纯相机方法达到**13.35% mIoU**和**44.15% IoU**！\n\u003C\u002Fbr>\n\n\n## 摘要\n人类可以轻松想象被遮挡物体和场景的完整3D几何结构。这种令人向往的能力对识别和理解至关重要。为了在AI系统中实现这种能力，我们提出了VoxFormer，一个基于Transformer的语义场景补全框架，可以仅从2D图像输出完整的3D体素语义。我们的框架采用两阶段设计：首先从深度估计中获取稀疏的可见和占用体素查询，然后通过密集化阶段从稀疏体素生成密集3D体素。该设计的一个关键思想是，2D图像上的视觉特征仅对应于可见场景结构，而不是被遮挡或空旷的空间。因此，从可见结构的特征化和预测开始更为可靠。一旦获得稀疏查询集，我们采用掩码自编码器设计，通过自注意力将信息传播到所有体素。在SemanticKITTI上的实验表明，VoxFormer在几何结构上实现了20.0%的相对提升，在语义上实现了18.1%的相对提升，并将训练时的GPU内存减少约45%至16GB以下。\n\n\n## 方法\n\n| ![space-1.jpg](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVlabs_VoxFormer_readme_e5eccee3186b.png) | \n|:--:| \n| ***图1. VoxFormer的总体框架**。给定RGB图像，2D特征由ResNet50提取，深度由现成的深度预测器估计。校正后的估计深度启用类别无关的查询提议阶段：位于占用位置的查询将被选中，与图像特征进行可变形交叉注意力。之后，将添加掩码标记，通过可变形自注意力完成体素特征。精细化的体素特征将被上采样并投影到输出空间，进行逐体素语义分割。请注意，我们的框架支持单张或多张图像的输入。* |\n\n## 入门指南\n- [安装](docs\u002Finstall.md) \n- [准备数据集](docs\u002Fprepare_dataset.md)\n- [运行和评估](docs\u002Fgetting_started.md)\n\n## 模型库\n阶段1的查询提议网络（QPN）可在[此处](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1NzN6eqCnuxzau0m_N9B02Q2zwLBKhnBp\u002Fview?usp=share_link)下载。\n对于阶段2，请根据下表下载训练好的模型。\n\n| 骨干网络 | 方法 | 学习率调度 | IoU| mIoU | 配置 | 下载 |\n| :---: | :---: | :---: | :---: | :---:| :---: | :---: |\n| [R50](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1A4Efx7OQ2KVokM1XTbZ6Lf2Q5P-srsyE\u002Fview?usp=share_link) | VoxFormer-T | 20ep | 44.15| 13.35|[配置](projects\u002Fconfigs\u002Fvoxformer\u002Fvoxformer-T.py) |[模型](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1KOYN3MGHMyCTDZWw4lNNicCdImnKqvlz\u002Fview?usp=share_link) |\n| [R50](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1A4Efx7OQ2KVokM1XTbZ6Lf2Q5P-srsyE\u002Fview?usp=share_link) | VoxFormer-S | 20ep | 44.02| 12.35|[配置](projects\u002Fconfigs\u002Fvoxformer\u002Fvoxformer-S.py) |[模型](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1UBemF77Cfr0d9rcC_Y9Qmjnqp_c4qoeb\u002Fview?usp=share_link)|\n| [R50](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1A4Efx7OQ2KVokM1XTbZ6Lf2Q5P-srsyE\u002Fview?usp=share_link) | VoxFormer-T-3D | 20ep | 44.35| 13.69|[配置](projects\u002Fconfigs\u002Fvoxformer\u002Fvoxformer-T_deform3D.py) |[模型](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1JQwaO5XXMMkTcF95tCHk45q6PzZnofS6\u002Fview?usp=drive_link)|\n| [R50](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1A4Efx7OQ2KVokM1XTbZ6Lf2Q5P-srsyE\u002Fview?usp=share_link) | VoxFormer-S-3D | 20ep | 44.42| 12.86|[配置](projects\u002Fconfigs\u002Fvoxformer\u002Fvoxformer-S_deform3D.py) |[模型](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1kwcMGRl9FOprV2k5kqCS0kfvrbCfJMcZ\u002Fview?usp=drive_link)|\n\n \n## 数据集\n\n- [x] SemanticKITTI\n- [ ] KITTI-360\n- [ ] nuScenes\n\n## 引用格式\n如果这项工作对您的研究有所帮助，请引用以下BibTeX条目。\n\n```\n@InProceedings{li2023voxformer,\n      title={VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion}, \n      author={Li, Yiming and Yu, Zhiding and Choy, Christopher and Xiao, Chaowei and Alvarez, Jose M and Fidler, Sanja and Feng, Chen and Anandkumar, Anima},\n      booktitle = {Proceedings of the IEEE\u002FCVF Conference on Computer Vision and Pattern Recognition (CVPR)},\n      year={2023}\n}\n```\n\n## 许可证\n版权所有 © 2022-2023，NVIDIA Corporation及其关联公司。保留所有权利。\n\n本作品根据Nvidia源代码许可证-NC提供。点击[此处](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVoxFormer\u002Fblob\u002Fmain\u002FLICENSE)查看许可证副本。\n\n预训练模型根据[CC-BY-NC-SA-4.0](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc-sa\u002F4.0\u002F)共享。如果您混音、转换或构建本材料，您必须根据与原始材料相同的许可证分发您的贡献。\n\n如需商业咨询，请访问我们的网站并提交表单：[NVIDIA研究许可](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fresearch\u002Finquiries\u002F)。\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVlabs_VoxFormer_readme_d9f7f2668500.png)](https:\u002F\u002Fstar-history.com\u002F#NVlabs\u002FVoxFormer)\n\n## 致谢\n\n非常感谢这些优秀的开源项目：\n\n- [BEVFormer](https:\u002F\u002Fgithub.com\u002Ffundamentalvision\u002FBEVFormer)\n- [mmdet3d](https:\u002F\u002Fgithub.com\u002Fopen-mmlab\u002Fmmdetection3d)\n- [MonoScene](https:\u002F\u002Fgithub.com\u002Fastra-vision\u002FMonoScene)\n- [LMSCNet](https:\u002F\u002Fgithub.com\u002Fastra-vision\u002FLMSCNet)\n- [semantic-kitti-api](https:\u002F\u002Fgithub.com\u002FPRBonn\u002Fsemantic-kitti-api)\n- [MobileStereoNet](https:\u002F\u002Fgithub.com\u002Fcogsys-tuebingen\u002Fmobilestereonet)\n- [Pseudo_Lidar_V2](https:\u002F\u002Fgithub.com\u002Fmileyan\u002FPseudo_Lidar_V2)\n- [wysiwyg](https:\u002F\u002Fgithub.com\u002Fpeiyunh\u002Fwysiwyg)","# VoxFormer 快速上手指南\n\n## 工具简介\n\nVoxFormer 是一个基于 Transformer 的 3D 语义场景补全（Semantic Scene Completion）框架，能够仅从 2D 图像预测完整的 3D 几何和语义信息。该工作在 CVPR 2023 发表，在 SemanticKITTI 数据集上取得了领先成绩。\n\n## 环境准备\n\n### 系统要求\n\n- **操作系统**：Linux（推荐 Ubuntu 18.04 或 20.04）\n- **GPU**：NVIDIA GPU（至少 16GB 显存，推荐 24GB 以上）\n- **CUDA**：11.3 或更高版本\n- **Python**：3.7 - 3.9\n\n### 前置依赖\n\n```bash\n# 创建 conda 环境\nconda create -n voxformer python=3.8\nconda activate voxformer\n\n# 安装 PyTorch 和相关依赖\npip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 -f https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Ftorch_stable.html\n\n# 安装 mmcv 和 mmdetection\npip install mmcv-full==1.6.0 -f https:\u002F\u002Fdownload.openmmlab.com\u002Fmmcv\u002Fdist\u002Fcu113\u002Ftorch1.12.0\u002Findex.html\npip install mmdet==2.24.0\npip install mmsegmentation==0.29.0\n\n# 安装其他依赖\npip install -r requirements.txt\n```\n\n## 安装步骤\n\n### 1. 克隆仓库\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVoxFormer.git\ncd VoxFormer\n```\n\n### 2. 安装 VoxFormer\n\n```bash\npip install -e .\n```\n\n### 3. 下载预训练模型\n\n根据需求选择下载以下模型：\n\n| 模型 | IoU | mIoU | 下载链接 |\n|------|-----|------|----------|\n| VoxFormer-T | 44.15 | 13.35 | [model](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1KOYN3MGHMyCTDZWw4lNNicCdImnKqvlz\u002Fview?usp=share_link) |\n| VoxFormer-S | 44.02 | 12.35 | [model](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1UBemF77Cfr0d9rcC_Y9Qmjnqp_c4qoeb\u002Fview?usp=share_link) |\n| VoxFormer-T-3D | 44.35 | 13.69 | [model](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1JQwaO5XXMMkTcF95tCHk45q6PzZnofS6\u002Fview?usp=drive_link) |\n| VoxFormer-S-3D | 44.42 | 12.86 | [model](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1kwcMGRl9FOprV2k5kqCS0kfvrbCfJMcZ\u002Fview?usp=drive_link) |\n\n同时需要下载 [QPN 预训练模型](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1NzN6eqCnuxzau0m_N9B02Q2zwLBKhnBp\u002Fview?usp=share_link)。\n\n### 4. 准备数据集\n\n请参考 [SemanticKITTI 数据集准备文档](docs\u002Fprepare_dataset.md) 进行数据集配置。\n\n## 基本使用\n\n### 训练\n\n```bash\n# 单卡训练\npython tools\u002Ftrain.py projects\u002Fconfigs\u002Fvoxformer\u002Fvoxformer-T.py\n\n# 多卡训练（以 8 卡为例）\npython -m torch.distributed.launch --nproc_per_node=8 tools\u002Ftrain.py projects\u002Fconfigs\u002Fvoxformer\u002Fvoxformer-T.py --launcher pytorch\n```\n\n### 评估\n\n```bash\npython tools\u002Ftest.py projects\u002Fconfigs\u002Fvoxformer\u002Fvoxformer-T.py --checkpoint path\u002Fto\u002Fyour\u002Fcheckpoint.pth --eval mIoU\n```\n\n### 推理演示\n\n```bash\npython demo.py --config projects\u002Fconfigs\u002Fvoxformer\u002Fvoxformer-T.py --checkpoint path\u002Fto\u002Fyour\u002Fcheckpoint.pth --input path\u002Fto\u002Fimages --output path\u002Fto\u002Foutput\n```\n\n## 相关资源\n\n- **论文**：[arXiv PDF](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12251.pdf)\n- **项目主页**：[GitHub](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVoxFormer)\n- **安装文档**：[docs\u002Finstall.md](docs\u002Finstall.md)\n- **数据集准备**：[docs\u002Fprepare_dataset.md](docs\u002Fprepare_dataset.md)\n- **完整教程**：[docs\u002Fgetting_started.md](docs\u002Fgetting_started.md)","# 自动驾驶汽车的环境感知系统\n\n自动驾驶车辆在十字路口左转时，需要准确判断周围环境的完整 3D 结构，包括被建筑物、车辆遮挡的区域（如停放的车辆、骑行人等）。\n\n### 没有 VoxFormer 时\n\n- **感知盲区无法处理**：传统方法只能识别相机可见的物体，对于被遮挡的区域（如建筑物后的行人、停靠车辆后的骑行人）完全无法感知，存在严重的安全隐患\n- **依赖昂贵的激光雷达**：为弥补视觉盲区，需要配备多台激光雷达，硬件成本高昂，且在雨雾天气性能急剧下降\n- **计算资源消耗巨大**：使用多传感器融合方案时，GPU 内存占用常超过 32GB，普通车载计算平台难以承载，导致部署困难\n- **场景重建精度不足**：对复杂城市场景的几何和语义理解有限，难以准确判断可行驶区域，频繁触发保守制动影响通行效率\n\n### 使用 VoxFormer 后\n\n- **单目相机即可预测遮挡场景**：仅凭车载摄像头输入，VoxFormer 就能通过 Transformer 架构推理出被遮挡区域的完整 3D 语义结构，\"看见\"视觉盲区内的物体\n- **大幅降低硬件成本**：仅需普通相机即可实现原本需要激光雷达才能完成的 3D 场景感知，硬件成本降低数万元\n- **GPU 内存占用减少 45%**：在 16GB 以内显存即可完成训练和推理，适配车载嵌入式平台，实现实时部署\n- **精度提升 20%**：在 SemanticKITTI 基准上达到 13.35% mIoU 和 44.15% IoU，准确区分可行驶区域与障碍物，驾驶决策更流畅\n\n### 核心价值\n\nVoxFormer 让自动驾驶车辆仅凭普通相机就能\"想象\"出完整的 3D 世界，在大幅降低成本的同时显著提升感知安全性和决策可靠性。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVlabs_VoxFormer_2042fac9.png","NVlabs","NVIDIA Research Projects","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FNVlabs_fc20d641.jpg","",null,"http:\u002F\u002Fresearch.nvidia.com","https:\u002F\u002Fgithub.com\u002FNVlabs",[83,87,91,95,99],{"name":84,"color":85,"percentage":86},"Python","#3572A5",78.6,{"name":88,"color":89,"percentage":90},"Cuda","#3A4E3A",12.9,{"name":92,"color":93,"percentage":94},"C++","#f34b7d",7.9,{"name":96,"color":97,"percentage":98},"Shell","#89e051",0.5,{"name":100,"color":101,"percentage":102},"CMake","#DA3434",0.1,1185,96,"2026-03-31T21:28:50","NOASSERTION","Linux","需要 NVIDIA GPU，显存 16GB 以下（训练时显存需求降低约 45%）","未说明",{"notes":111,"python":109,"dependencies":112},"基于 PyTorch 的深度学习框架；依赖 mmdetection3d、BEVFormer 等开源项目；需要下载 SemanticKITTI 数据集；需下载预训练模型权重（QPN 网络和 VoxFormer 模型）；项目提供了详细的安装、数据准备和运行文档（docs\u002Finstall.md、docs\u002Fprepare_dataset.md、docs\u002Fgetting_started.md）",[113,114,115],"mmdet3d","torch","transformers",[26,54,13,15,14],[118,119,120,121,122,123,124,125,126,127,128,129,130,131],"3d-scene-understanding","artificial-intelligence","autonomous-driving","autonomous-vehicles","computer-vision","semantic-scene-completion","vision-transformer","3d-perception","occupancy-grid-map","machine-learning","voxel-proceessing","2d-to-3d","deep-learning","semantickitti","2026-03-27T02:49:30.150509","2026-04-06T06:53:20.814211",[135,140,145,150,155,160,165],{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},1026,"运行测试时出现 'No module named deform3dattn_custom_cn' 错误怎么办？","需要修改 multi_scale_deformable_attn_3D_custom_function.py 文件：1) 注释掉 `# raise NotImplementedError(\"Use sys.path.append here to modify the path to your .so file\")` 这一行；2) 在下一行添加你的 .so 文件路径，如 `sys.path.append(\"\u002Froot\u002FVoxFormer\u002Fdeform_attn_3d\")`。这是因为自定义 CUDA 扩展需要正确添加到 Python 路径中才能被找到。","https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVoxFormer\u002Fissues\u002F44",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},1027,"如何在个人数据集上测试模型，而不是使用 KITTI 数据集？","最佳方法是将你的数据转换为与 KITTI 数据集相同的格式。项目中提供了预处理工具，可以参考 https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVoxFormer\u002Ftree\u002Fmain\u002Fpreprocess 目录。开发团队表示正在开发个性化数据的工具包，完成后会发布。","https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVoxFormer\u002Fissues\u002F7",{"id":146,"question_zh":147,"answer_zh":148,"source_url":149},1028,"Stage 2 训练时出现 exitcode: -9 错误是什么原因？","该错误通常是由于内存不足导致的。根据维护者回复，训练 Stage 2 需要 96GB 系统内存（RAM），而不是 VRAM。对于 VRAM，模型只需要不到 18GB。建议检查系统内存是否足够，并尝试增加 swap 空间或关闭其他占用内存的程序。","https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVoxFormer\u002Fissues\u002F57",{"id":151,"question_zh":152,"answer_zh":153,"source_url":154},1029,"模型输出的 tensor shape 为 [1, 256, 256, 32]，32 个通道代表什么？如何进行体素可视化？","输出 tensor 的 32 个通道代表 SemanticKITTI 数据集中的 32 个语义类别。要进行可视化，可以使用 PRBonn 团队的 semantic-kitti-api 工具，具体使用其中的 visualize_voxels.py 脚本。首先需要将预测结果从模型输出转换为 KITTI 格式的 .label 文件，然后使用官方可视化脚本进行渲染。","https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVoxFormer\u002Fissues\u002F13",{"id":156,"question_zh":157,"answer_zh":158,"source_url":159},1030,"预处理数据时无法生成 query 数据怎么办？","请严格按照项目提供的预处理流程操作，参考 https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVoxFormer\u002Ftree\u002Fmain\u002Fpreprocess 目录中的说明。确保按照步骤依次执行数据预处理脚本，包括体素生成和查询生成两个阶段。","https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVoxFormer\u002Fissues\u002F21",{"id":161,"question_zh":162,"answer_zh":163,"source_url":164},1031,"体素可视化效果不如论文中的演示效果，网格不够密集怎么办？","需要将预测标签重新映射到 KITTI 配置格式，然后使用 KITTI 数据集的可视化脚本进行渲染。具体方法是：首先将模型输出的预测结果通过标签映射转换为 SemanticKITTI 格式，然后使用 semantic-kitti-api 中的 visualize_voxels.py 脚本进行可视化，这样可以获得与论文中一致的密集体素效果。","https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVoxFormer\u002Fissues\u002F37",{"id":166,"question_zh":167,"answer_zh":168,"source_url":169},1032,"代码和预训练模型已经发布了吗？","是的，代码和预训练模型已经发布。感谢您对该工作的关注！关于设备端延迟的问题已列入下一步计划，开发团队理解这很重要，会持续更新相关信息。","https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVoxFormer\u002Fissues\u002F2",[]]