[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-ruili3--awesome-dust3r":3,"tool-ruili3--awesome-dust3r":65},[4,23,32,40,49,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,2,"2026-04-05T10:45:23",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[17,13,20,19,18],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":29,"last_commit_at":38,"category_tags":39,"status":22},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74939,"2026-04-05T23:16:38",[19,13,20,18],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":46,"last_commit_at":47,"category_tags":48,"status":22},3215,"awesome-machine-learning","josephmisiti\u002Fawesome-machine-learning","awesome-machine-learning 是一份精心整理的机器学习资源清单，汇集了全球优秀的机器学习框架、库和软件工具。面对机器学习领域技术迭代快、资源分散且难以甄选的痛点，这份清单按编程语言（如 Python、C++、Go 等）和应用场景（如计算机视觉、自然语言处理、深度学习等）进行了系统化分类，帮助使用者快速定位高质量项目。\n\n它特别适合开发者、数据科学家及研究人员使用。无论是初学者寻找入门库，还是资深工程师对比不同语言的技术选型，都能从中获得极具价值的参考。此外，清单还延伸提供了免费书籍、在线课程、行业会议、技术博客及线下聚会等丰富资源，构建了从学习到实践的全链路支持体系。\n\n其独特亮点在于严格的维护标准：明确标记已停止维护或长期未更新的项目，确保推荐内容的时效性与可靠性。作为机器学习领域的“导航图”，awesome-machine-learning 以开源协作的方式持续更新，旨在降低技术探索门槛，让每一位从业者都能高效地站在巨人的肩膀上创新。",72149,1,"2026-04-03T21:50:24",[20,18],{"id":50,"name":51,"github_repo":52,"description_zh":53,"stars":54,"difficulty_score":46,"last_commit_at":55,"category_tags":56,"status":22},2234,"scikit-learn","scikit-learn\u002Fscikit-learn","scikit-learn 是一个基于 Python 构建的开源机器学习库，依托于 SciPy、NumPy 等科学计算生态，旨在让机器学习变得简单高效。它提供了一套统一且简洁的接口，涵盖了从数据预处理、特征工程到模型训练、评估及选择的全流程工具，内置了包括线性回归、支持向量机、随机森林、聚类等在内的丰富经典算法。\n\n对于希望快速验证想法或构建原型的数据科学家、研究人员以及 Python 开发者而言，scikit-learn 是不可或缺的基础设施。它有效解决了机器学习入门门槛高、算法实现复杂以及不同模型间调用方式不统一的痛点，让用户无需重复造轮子，只需几行代码即可调用成熟的算法解决分类、回归、聚类等实际问题。\n\n其核心技术亮点在于高度一致的 API 设计风格，所有估算器（Estimator）均遵循相同的调用逻辑，极大地降低了学习成本并提升了代码的可读性与可维护性。此外，它还提供了强大的模型选择与评估工具，如交叉验证和网格搜索，帮助用户系统地优化模型性能。作为一个由全球志愿者共同维护的成熟项目，scikit-learn 以其稳定性、详尽的文档和活跃的社区支持，成为连接理论学习与工业级应用的最",65628,"2026-04-05T10:10:46",[20,18,14],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":10,"last_commit_at":63,"category_tags":64,"status":22},3364,"keras","keras-team\u002Fkeras","Keras 是一个专为人类设计的深度学习框架，旨在让构建和训练神经网络变得简单直观。它解决了开发者在不同深度学习后端之间切换困难、模型开发效率低以及难以兼顾调试便捷性与运行性能的痛点。\n\n无论是刚入门的学生、专注算法的研究人员，还是需要快速落地产品的工程师，都能通过 Keras 轻松上手。它支持计算机视觉、自然语言处理、音频分析及时间序列预测等多种任务。\n\nKeras 3 的核心亮点在于其独特的“多后端”架构。用户只需编写一套代码，即可灵活选择 TensorFlow、JAX、PyTorch 或 OpenVINO 作为底层运行引擎。这一特性不仅保留了 Keras 一贯的高层易用性，还允许开发者根据需求自由选择：利用 JAX 或 PyTorch 的即时执行模式进行高效调试，或切换至速度最快的后端以获得最高 350% 的性能提升。此外，Keras 具备强大的扩展能力，能无缝从本地笔记本电脑扩展至大规模 GPU 或 TPU 集群，是连接原型开发与生产部署的理想桥梁。",63927,"2026-04-04T15:24:37",[20,14,18],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":82,"owner_twitter":80,"owner_website":83,"owner_url":84,"languages":80,"stars":85,"forks":86,"last_commit_at":87,"license":88,"difficulty_score":89,"env_os":90,"env_gpu":91,"env_ram":91,"env_deps":92,"category_tags":95,"github_topics":96,"view_count":10,"oss_zip_url":80,"oss_zip_packed_at":80,"status":22,"created_at":102,"updated_at":103,"faqs":104,"releases":105},4329,"ruili3\u002Fawesome-dust3r","awesome-dust3r","🌟A curated list of DUSt3R-related papers and resources, tracking recent advancements using this geometric foundation model.","awesome-dust3r 是一个专注于 DUSt3R 及其衍生模型（如 MASt3R）的精选资源库，旨在汇集最新的论文、开源代码、教程视频及技术博客。它主要解决了传统三维重建任务中过度依赖相机参数（如内参和外参）的痛点。在传统流程中，获取这些参数往往繁琐且容易出错，而 DUSt3R 作为一种新兴的几何基础模型，开创了无需预先标定相机即可从任意图像集合中进行稠密三维重建的新范式。\n\n该资源库特别适合计算机视觉领域的研究人员、AI 开发者以及三维技术爱好者使用。无论是希望快速跟进前沿算法的学者，还是寻求高效重建方案的工程师，都能在此找到从理论原理到落地应用的全方位指引。其核心亮点在于梳理了基于 Transformer 架构的创新技术，该技术通过将成对重建转化为点图回归问题，巧妙地统一了单目与双目重建场景，并提供了高效的全局对齐策略。此外，库中还涵盖了动态场景重建、高斯泼溅（Gaussian Splatting）、机器人导航及科学计算等多个扩展方向，持续更新的日志确保了用户能第一时间掌握社区的最新突破，是探索无约束三维视觉技术的理想入口。","\u003Cdiv align=\"center\">\n\u003Ch1>Awesome DUSt3R Resources \u003C\u002Fh1>\n\u003C\u002Fdiv>\n\nA curated list of papers and open-source resources related to DUSt3R\u002FMASt3R, the emerging geometric foundation models empowering a wide span of 3D geometry tasks & applications. PR requests are welcomed, including papers, open-source libraries, blog posts, videos, etc. Repo maintained by [@Rui Li](https:\u002F\u002Fx.com\u002Fleedaray), stay tuned for updates!\n\n## Table of contents\n\n- [Seminal Papers of DUSt3R](#seminal-papers-of-dust3r)\n- [Concurrent Works](#concurrent-works)\n\n\u003Cbr>\n\n\n- [3D Reconstruction](#3d-reconstruction)\n- [Dynamic Scene Reconstruction](#dynamic-scene-reconstruction)\n- [3D Scene Reasoning](#scene-reasoning)\n- [Gaussian Splatting](#gaussian-splatting)\n- [Scene Understanding](#scene-understanding)\n- [Robotics](#robotics)\n- [Pose Estimation](#pose-estimation)\n- [DUSt3R for Science](#dust3r-for-science)\n\n\u003Cbr>\n\n- [Related Codebase](#related-codebase)\n- [Blog Posts](#blog-posts)\n- [Tutorial Videos](#tutorial-videos)\n- [Acknowledgements](#acknowledgements)\n\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Update Log:\u003C\u002Fb>\u003C\u002Fsummary>\n\n**Oct 25, 2025**: Add Human3R, Rig3R, SegMASt3R, PLANA3R, TTT3R.\n\u003Cbr>\n**Sep 6, 2025**: Add SAIL-Recon, FastVGGT, HAMSt3R, Vista-SLAM.\n\u003Cbr>\n**Aug 16, 2025**: Add Test3R.\n\u003Cbr>\n**Aug 15, 2025**: Add MoGe-2, S3PO-GS, π^3, LONG3R， VGGT-Long, STream3R, Dens3R, StreamVGG-T, Back-on-Track, and ViPE.\n\u003Cbr>\n**July 9, 2025**: Add Point3R, GeometryCrafter, CryoFastAR.\n\u003Cbr>\n**June 19, 2025**: Add RaySt3R, Amodal3R, Styl3R.\n\u003Cbr>\n**May 6, 2025**: Add [LaRI](https:\u002F\u002Fruili3.github.io\u002Flari\u002Findex.html).\n\u003Cbr>\n**Apr 29, 2025**: Add Pow3R, Mono3R, Easi3R, FlowR, ODHSR, DPM, Geo4D, POMATO, DAS3R.\n\u003Cbr>\n**Mar 20, 2025**: Add Reloc3r, Pos3R, MASt3R-SLAM, Light3R-SfM, VGGT. \n\u003Cbr>\n**Mar 16, 2025**: Add MUSt3R, PE3R.\n\u003Cbr>\n**Jan 24, 2025**: Add CUT3R, Fast3R, EasySplat, MEt3R, Dust-to-Tower. Happy New Year!\n\u003Cbr>\n**Dec 20, 2024**: Add Align3R, PeRF3R, MV-DUSt3R+, Stereo4D, SLAM3R, LoRA3D.\n\u003Cbr>\n**Nov 15, 2024**: Add MoGe, LSM.\n\u003Cbr>\n**Oct 10, 2024**: Add MASt3R-SfM, MonST3R.\n\u003Cbr>\n**Aug 31, 2024**: Add Spurfies, Spann3R, and ReconX.\n\u003Cbr>\n**Aug 29, 2024**: Add Splatt3R, update the code of InstantSplat, etc.\n\u003Cbr>\n**Jun 21, 2024**: Add the newly released MASt3R.\n\u003Cbr>\n**May 31, 2024**: Add a concurrent work Detector-free SfM and a Mini-DUSt3R codebase.\n\u003Cbr>\n**Apr 27, 2024**: Add concurrent works including FlowMap, ACE0, MicKey, and VGGSfM.\n\u003Cbr>\n**Apr 09, 2024**: Initial list with first 3 papers, blogs and videos. \n\n\u003C\u002Fdetails>\n\u003Cbr>\n\n## Seminal Papers of DUSt3R:\n### 1. DUSt3R: Geometric 3D Vision Made Easy ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-CVPR-green)\n**Authors**: Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, Jerome Revaud\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nMulti-view stereo reconstruction (MVS) in the wild requires to first estimate the camera parameters e.g. intrinsic and extrinsic parameters. These are usually tedious and cumbersome to obtain, yet they are mandatory to triangulate corresponding pixels in 3D space, which is the core of all best performing MVS algorithms. In this work, we take an opposite stance and introduce DUSt3R, a radically novel paradigm for Dense and Unconstrained Stereo 3D Reconstruction of arbitrary image collections, i.e. operating without prior information about camera calibration nor viewpoint poses. We cast the pairwise reconstruction problem as a regression of pointmaps, relaxing the hard constraints of usual projective camera models. We show that this formulation smoothly unifies the monocular and binocular reconstruction cases. In the case where more than two images are provided, we further propose a simple yet effective global alignment strategy that expresses all pairwise pointmaps in a common reference frame. We base our network architecture on standard Transformer encoders and decoders, allowing us to leverage powerful pretrained models. Our formulation directly provides a 3D model of the scene as well as depth information, but interestingly, we can seamlessly recover from it, pixel matches, relative and absolute camera. Exhaustive experiments on all these tasks showcase that the proposed DUSt3R can unify various 3D vision tasks and set new SoTAs on monocular\u002Fmulti-view depth estimation as well as relative pose estimation. In summary, DUSt3R makes many geometric 3D vision tasks easy.\n\u003C\u002Fdetails>\n  \n [📃 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.14132.pdf) | [🌐 Project Page](https:\u002F\u002Fdust3r.europe.naverlabs.com\u002F) | [⌨️ Code](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fdust3r) | [🎥 Explanation Video](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=JdfrG89iPOA) \n\n\u003Cbr>\n\n\n### 2. Grounding Image Matching in 3D with MASt3R ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Vincent Leroy, Yohann Cabon, Jérôme Revaud\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nImage Matching is a core component of all best-performing algorithms and pipelines in 3D vision. Yet despite matching being fundamentally a 3D problem, intrinsically linked to camera pose and scene geometry, it is typically treated as a 2D problem. This makes sense as the goal of matching is to establish correspondences between 2D pixel fields, but also seems like a potentially hazardous choice. In this work, we take a different stance and propose to cast matching as a 3D task with DUSt3R, a recent and powerful 3D reconstruction framework based on Transformers. Based on pointmaps regression, this method displayed impressive robustness in matching views with extreme viewpoint changes, yet with limited accuracy. We aim here to improve the matching capabilities of such an approach while preserving its robustness. We thus propose to augment the DUSt3R network with a new head that outputs dense local features, trained with an additional matching loss. We further address the issue of quadratic complexity of dense matching, which becomes prohibitively slow for downstream applications if not carefully treated. We introduce a fast reciprocal matching scheme that not only accelerates matching by orders of magnitude, but also comes with theoretical guarantees and, lastly, yields improved results. Extensive experiments show that our approach, coined MASt3R, significantly outperforms the state of the art on multiple matching tasks. In particular, it beats the best published methods by 30% (absolute improvement) in VCRE AUC on the extremely challenging Map-free localization dataset.\n\u003C\u002Fdetails>\n  \n [📃 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.09756) | [🌐 Project Page](https:\u002F\u002Feurope.naverlabs.com\u002Fblog\u002Fmast3r-matching-and-stereo-3d-reconstruction\u002F) | [⌨️ Code](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fmast3r)\n\n\u003Cbr>\n\n\n### 3. MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Bardienus Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, Jerome Revaud\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nStructure-from-Motion (SfM), a task aiming at jointly recovering camera poses and 3D geometry of a scene given a set of images, remains a hard problem with still many open challenges despite decades of significant progress. The traditional solution for SfM consists of a complex pipeline of minimal solvers which tends to propagate errors and fails when images do not sufficiently overlap, have too little motion, etc. Recent methods have attempted to revisit this paradigm, but we empirically show that they fall short of fixing these core issues. In this paper, we propose instead to build upon a recently released foundation model for 3D vision that can robustly produce local 3D reconstructions and accurate matches. We introduce a low-memory approach to accurately align these local reconstructions in a global coordinate system. We further show that such foundation models can serve as efficient image retrievers without any overhead, reducing the overall complexity from quadratic to linear. Overall, our novel SfM pipeline is simple, scalable, fast and truly unconstrained, i.e. it can handle any collection of images, ordered or not. Extensive experiments on multiple benchmarks show that our method provides steady performance across diverse settings, especially outperforming existing methods in small- and medium-scale settings.\n\u003C\u002Fdetails>\n  \n [📃 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.19152) | [🌐 Project Page](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fmast3r) | [⌨️ Code](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fmast3r)\n\n\u003Cbr>\n\n### 4. CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2022-Neurips-blue)\n**Authors**: Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Romain Brégier, Yohann Cabon, Vaibhav Arora, Leonid Antsfeld, Boris Chidlovskii, Gabriela Csurka, Jérôme Revaud\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nMasked Image Modeling (MIM) has recently been established as a potent pre-training paradigm. A pretext task is constructed by masking patches in an input image, and this masked content is then predicted by a neural network using visible patches as sole input. This pre-training leads to state-of-the-art performance when finetuned for high-level semantic tasks, e.g. image classification and object detection. In this paper we instead seek to learn representations that transfer well to a wide variety of 3D vision and lower-level geometric downstream tasks, such as depth prediction or optical flow estimation. Inspired by MIM, we propose an unsupervised representation learning task trained from pairs of images showing the same scene from different viewpoints. More precisely, we propose the pretext task of cross-view completion where the first input image is partially masked, and this masked content has to be reconstructed from the visible content and the second image. In single-view MIM, the masked content often cannot be inferred precisely from the visible portion only, so the model learns to act as a prior influenced by high-level semantics. In contrast, this ambiguity can be resolved with cross-view completion from the second unmasked image, on the condition that the model is able to understand the spatial relationship between the two images. Our experiments show that our pretext task leads to significantly improved performance for monocular 3D vision downstream tasks such as depth estimation. In addition, our model can be directly applied to binocular downstream tasks like optical flow or relative camera pose estimation, for which we obtain competitive results without bells and whistles, i.e., using a generic architecture without any task-specific design.\n\u003C\u002Fdetails>\n  \n [📃 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10716.pdf) | [🌐 Project Page](https:\u002F\u002Fcroco.europe.naverlabs.com\u002Fpublic\u002Findex.html) | [⌨️ Code](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fcroco)\n\n\u003Cbr>\n\n\n### 5. CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2023-ICCV-f5cac3)\n**Authors**: Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Brégier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, Jérôme Revaud\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nDespite impressive performance for high-level downstream tasks, self-supervised pre-training methods have not yet fully delivered on dense geometric vision tasks such as stereo matching or optical flow. The application of self-supervised concepts, such as instance discrimination or masked image modeling, to geometric tasks is an active area of research. In this work, we build on the recent cross-view completion framework, a variation of masked image modeling that leverages a second view from the same scene which makes it well suited for binocular downstream tasks. The applicability of this concept has so far been limited in at least two ways: (a) by the difficulty of collecting real-world image pairs -- in practice only synthetic data have been used -- and (b) by the lack of generalization of vanilla transformers to dense downstream tasks for which relative position is more meaningful than absolute position. We explore three avenues of improvement. First, we introduce a method to collect suitable real-world image pairs at large scale. Second, we experiment with relative positional embeddings and show that they enable vision transformers to perform substantially better. Third, we scale up vision transformer based cross-completion architectures, which is made possible by the use of large amounts of data. With these improvements, we show for the first time that state-of-the-art results on stereo matching and optical flow can be reached without using any classical task-specific techniques like correlation volume, iterative estimation, image warping or multi-scale reasoning, thus paving the way towards universal vision models.\n\u003C\u002Fdetails>\n  \n [📃 Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.10408) | [🌐 Project Page](https:\u002F\u002Fcroco.europe.naverlabs.com\u002Fpublic\u002Findex.html) | [⌨️ Code](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fcroco)\n\n\u003Cbr>\n\n\n\n## Concurrent Works:\n## 2024:\n### 1. FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient Descent ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Cameron Smith, David Charatan, Ayush Tewari, Vincent Sitzmann\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nThis paper introduces FlowMap, an end-to-end differentiable method that solves for precise camera poses, camera intrinsics, and per-frame dense depth of a video sequence. Our method performs per-video gradient-descent minimization of a simple least-squares objective that compares the optical flow induced by depth, intrinsics, and poses against correspondences obtained via off-the-shelf optical flow and point tracking. Alongside the use of point tracks to encourage long-term geometric consistency, we introduce a differentiable re-parameterization of depth, intrinsics, and pose that is amenable to first-order optimization. We empirically show that camera parameters and dense depth recovered by our method enable photo-realistic novel view synthesis on 360° trajectories using Gaussian Splatting. Our method not only far outperforms prior gradient-descent based bundle adjustment methods, but surprisingly performs on par with COLMAP, the state-of-the-art SfM method, on the downstream task of 360° novel view synthesis - even though our method is purely gradient-descent based, fully differentiable, and presents a complete departure from conventional SfM. Our result opens the door to the self-supervised training of neural networks that perform camera parameter estimation, 3D reconstruction, and novel view synthesis.\n\u003C\u002Fdetails>\n  \n [📃 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.15259) | [🌐 Project Page](https:\u002F\u002Fcameronosmith.github.io\u002Fflowmap\u002F) | [⌨️ Code](https:\u002F\u002Fgithub.com\u002Fdcharatan\u002Fflowmap)\n\n\u003Cbr>\n\n### 2. Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cavallari, Áron Monszpart, Daniyar Turmukhambetov, Victor Adrian Prisacariu\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe address the task of estimating camera parameters from a set of images depicting a scene. Popular feature-based structure-from-motion (SfM) tools solve this task by incremental reconstruction: they repeat triangulation of sparse 3D points and registration of more camera views to the sparse point cloud. We re-interpret incremental structure-from-motion as an iterated application and refinement of a visual relocalizer, that is, of a method that registers new views to the current state of the reconstruction. This perspective allows us to investigate alternative visual relocalizers that are not rooted in local feature matching. We show that scene coordinate regression, a learning-based relocalization approach, allows us to build implicit, neural scene representations from unposed images. Different from other learning-based reconstruction methods, we do not require pose priors nor sequential inputs, and we optimize efficiently over thousands of images. Our method, ACE0 (ACE Zero), estimates camera poses to an accuracy comparable to feature-based SfM, as demonstrated by novel view synthesis.\n\u003C\u002Fdetails>\n  \n [📃 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.14351) | [🌐 Project Page](https:\u002F\u002Fnianticlabs.github.io\u002Facezero\u002F) | [⌨️ Code](https:\u002F\u002Fgithub.com\u002Fnianticlabs\u002Facezero)\n\n\u003Cbr>\n\n### 3. Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-CVPR-green)\n**Authors**: Axel Barroso-Laguna, Sowmya Munukutla, Victor Adrian Prisacariu, Eric Brachmann\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nGiven two images, we can estimate the relative camera pose between them by establishing image-to-image correspondences. Usually, correspondences are 2D-to-2D and the pose we estimate is defined only up to scale. Some applications, aiming at instant augmented reality anywhere, require scale-metric pose estimates, and hence, they rely on external depth estimators to recover the scale. We present MicKey, a keypoint matching pipeline that is able to predict metric correspondences in 3D camera space. By learning to match 3D coordinates across images, we are able to infer the metric relative pose without depth measurements. Depth measurements are also not required for training, nor are scene reconstructions or image overlap information. MicKey is supervised only by pairs of images and their relative poses. MicKey achieves state-of-the-art performance on the Map-Free Relocalisation benchmark while requiring less supervision than competing approaches.\n\u003C\u002Fdetails>\n  \n [📃 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.06337) | [🌐 Project Page](https:\u002F\u002Fnianticlabs.github.io\u002Fmickey\u002F) | [⌨️ Code](https:\u002F\u002Fgithub.com\u002Fnianticlabs\u002Fmickey)\n\n\u003Cbr>\n\n### 4. VGGSfM: Visual Geometry Grounded Deep Structure From Motion ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-CVPR-green)\n**Authors**: Jianyuan Wang, Nikita Karaev, Christian Rupprecht, David Novotny\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nStructure-from-motion (SfM) is a long-standing problem in the computer vision community, which aims to reconstruct the camera poses and 3D structure of a scene from a set of unconstrained 2D images. Classical frameworks solve this problem in an incremental manner by detecting and matching keypoints, registering images, triangulating 3D points, and conducting bundle adjustment. Recent research efforts have predominantly revolved around harnessing the power of deep learning techniques to enhance specific elements (e.g., keypoint matching), but are still based on the original, non-differentiable pipeline. Instead, we propose a new deep SfM pipeline VGGSfM, where each component is fully differentiable and thus can be trained in an end-to-end manner. To this end, we introduce new mechanisms and simplifications. First, we build on recent advances in deep 2D point tracking to extract reliable pixel-accurate tracks, which eliminates the need for chaining pairwise matches. Furthermore, we recover all cameras simultaneously based on the image and track features instead of gradually registering cameras. Finally, we optimise the cameras and triangulate 3D points via a differentiable bundle adjustment layer. We attain state-of-the-art performance on three popular datasets, CO3D, IMC Phototourism, and ETH3D.\n\u003C\u002Fdetails>\n  \n [📃 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.04563) | [🌐 Project Page](https:\u002F\u002Fvggsfm.github.io\u002F) | [⌨️ Code](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fvggsfm)\n\n\u003Cbr>\n\n### 5. Detector-Free Structure from Motion ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-CVPR-green)\n**Authors**: Xingyi He, Jiaming Sun, Yifan Wang, Sida Peng, Qixing Huang, Hujun Bao, Xiaowei Zhou\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe propose a new structure-from-motion framework to recover accurate camera poses and point clouds from unordered images. Traditional SfM systems typically rely on the successful detection of repeatable keypoints across multiple views as the first step, which is difficult for texture-poor scenes, and poor keypoint detection may break down the whole SfM system. We propose a new detector-free SfM framework to draw benefits from the recent success of detector-free matchers to avoid the early determination of keypoints, while solving the multi-view inconsistency issue of detector-free matchers. Specifically, our framework first reconstructs a coarse SfM model from quantized detector-free matches. Then, it refines the model by a novel iterative refinement pipeline, which iterates between an attention-based multi-view matching module to refine feature tracks and a geometry refinement module to improve the reconstruction accuracy. Experiments demonstrate that the proposed framework outperforms existing detector-based SfM systems on common benchmark datasets. We also collect a texture-poor SfM dataset to demonstrate the capability of our framework to reconstruct texture-poor scenes. Based on this framework, we take first place in Image Matching Challenge 2023.\n\u003C\u002Fdetails>\n  \n [📃 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.15669) | [🌐 Project Page](https:\u002F\u002Fzju3dv.github.io\u002FDetectorFreeSfM\u002F) | [⌨️ Code](https:\u002F\u002Fgithub.com\u002Fzju3dv\u002FDetectorFreeSfM)\n\n\u003Cbr>\n\n\n\n\n## 3D Reconstruction:\n\n\n## 2025:\n### 1. SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**Authors**: Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yanchao Yang, Qingnan Fan, Baoquan Chen\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nIn this paper, we introduce SLAM3R, a novel and effective system for real-time, high-quality, dense 3D reconstruction using RGB videos. SLAM3R provides an end-to-end solution by seamlessly integrating local 3D reconstruction and global coordinate registration through feed-forward neural networks. Given an input video, the system first converts it into overlapping clips using a sliding window mechanism. Unlike traditional pose optimization-based methods, SLAM3R directly regresses 3D pointmaps from RGB images in each window and progressively aligns and deforms these local pointmaps to create a globally consistent scene reconstruction - all without explicitly solving any camera parameters. Experiments across datasets consistently show that SLAM3R achieves state-of-the-art reconstruction accuracy and completeness while maintaining real-time performance at 20+ FPS. [Code](https:\u002F\u002Fgithub.com\u002FPKU-VCL-3DV\u002FSLAM3R). \n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.09401) | [💻 Code](https:\u002F\u002Fgithub.com\u002FPKU-VCL-3DV\u002FSLAM3R)\n\n\u003Cbr>\n\n\n### 2. MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**Authors**: Riku Murai, Eric Dexheimer, Andrew J. Davison\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe present a real-time monocular dense SLAM system designed bottom-up from MASt3R, a two-view 3D reconstruction and matching prior. Equipped with this strong prior, our system is robust on in-the-wild video sequences despite making no assumption on a fixed or parametric camera model beyond a unique camera centre. We introduce efficient methods for pointmap matching, camera tracking and local fusion, graph construction and loop closure, and second-order global optimisation. With known calibration, a simple modification to the system achieves state-of-the-art performance across various benchmarks. Altogether, we propose a plug-and-play monocular SLAM system capable of producing globally-consistent poses and dense geometry while operating at 15 FPS.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.12392) | [🌐 Project Page](https:\u002F\u002Fedexheim.github.io\u002Fmast3r-slam\u002F) | [💻 Code](https:\u002F\u002Fgithub.com\u002Frmurai0610\u002FMASt3R-SLAM) \n\n\u003Cbr>\n\n\n### 3. MEt3R: Measuring Multi-View Consistency in Generated Images ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**Authors**: Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, Jan Eric Lenssen\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe introduce MEt3R, a metric for multi-view consistency in generated images. Large-scale generative models for multi-view image generation are rapidly advancing the field of 3D inference from sparse observations. However, due to the nature of generative modeling, traditional reconstruction metrics are not suitable to measure the quality of generated outputs and metrics that are independent of the sampling procedure are desperately needed. In this work, we specifically address the aspect of consistency between generated multi-view images, which can be evaluated independently of the specific scene. Our approach uses DUSt3R to obtain dense 3D reconstructions from image pairs in a feed-forward manner, which are used to warp image contents from one view into the other. Then, feature maps of these images are compared to obtain a similarity score that is invariant to view-dependent effects. Using MEt3R, we evaluate the consistency of a large set of previous methods for novel view and video generation, including our open, multi-view latent diffusion model.\n\u003C\u002Fdetails>\n  \n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.06336) | [🌐 Project Page](https:\u002F\u002Fgeometric-rl.mpi-inf.mpg.de\u002Fmet3r\u002F) | [💻 Code](https:\u002F\u002Fgithub.com\u002Fmohammadasim98\u002FMEt3R)\n\n\n\u003Cbr>\n\n\n### 4. Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**Authors**: Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, Matt Feiszli\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nMulti-view 3D reconstruction remains a core challenge in computer vision, particularly in applications requiring accurate and scalable representations across diverse perspectives. Current leading methods such as DUSt3R employ a fundamentally pairwise approach, processing images in pairs and necessitating costly global alignment procedures to reconstruct from multiple views. In this work, we propose Fast 3D Reconstruction (Fast3R), a novel multi-view generalization to DUSt3R that achieves efficient and scalable 3D reconstruction by processing many views in parallel. Fast3R's Transformer-based architecture forwards N images in a single forward pass, bypassing the need for iterative alignment. Through extensive experiments on camera pose estimation and 3D reconstruction, Fast3R demonstrates state-of-the-art performance, with significant improvements in inference speed and reduced error accumulation. These results establish Fast3R as a robust alternative for multi-view applications, offering enhanced scalability without compromising reconstruction accuracy.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.13928) | [🌐 Project Page](https:\u002F\u002Ffast3r-3d.github.io\u002F) | [💻 Code](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffast3r)\n\u003Cbr>\n\u003Cbr>\n\n\n### 5. Light3R-SfM: Towards Feed-forward Structure-from-Motion ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**Authors**: Sven Elflein, Qunjie Zhou, Sérgio Agostinho, Laura Leal-Taixé\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe present Light3R-SfM, a feed-forward, end-to-end learnable framework for efficient large-scale Structure-from-Motion (SfM) from unconstrained image collections. Unlike existing SfM solutions that rely on costly matching and global optimization to achieve accurate 3D reconstructions, Light3R-SfM addresses this limitation through a novel latent global alignment module. This module replaces traditional global optimization with a learnable attention mechanism, effectively capturing multi-view constraints across images for robust and precise camera pose estimation. Light3R-SfM constructs a sparse scene graph via retrieval-score-guided shortest path tree to dramatically reduce memory usage and computational overhead compared to the naive approach. Extensive experiments demonstrate that Light3R-SfM achieves competitive accuracy while significantly reducing runtime, making it ideal for 3D reconstruction tasks in real-world applications with a runtime constraint. This work pioneers a data-driven, feed-forward SfM approach, paving the way toward scalable, accurate, and efficient 3D reconstruction in the wild.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.14914)\n\n\u003Cbr>\n\n\n### 6. MUSt3R: Multi-view Network for Stereo 3D Reconstruction ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**Authors**: Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, Vincent Leroy\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nDUSt3R introduced a novel paradigm in geometric computer vision by proposing a model that can provide dense and unconstrained Stereo 3D Reconstruction of arbitrary image collections with no prior information about camera calibration nor viewpoint poses. Under the hood, however, DUSt3R processes image pairs, regressing local 3D reconstructions that need to be aligned in a global coordinate system. The number of pairs, growing quadratically, is an inherent limitation that becomes especially concerning for robust and fast optimization in the case of large image collections. In this paper, we propose an extension of DUSt3R from pairs to multiple views, that addresses all aforementioned concerns. Indeed, we propose a Multi-view Network for Stereo 3D Reconstruction, or MUSt3R, that modifies the DUSt3R architecture by making it symmetric and extending it to directly predict 3D structure for all views in a common coordinate frame. Second, we entail the model with a multi-layer memory mechanism which allows to reduce the computational complexity and to scale the reconstruction to large collections, inferring thousands of 3D pointmaps at high frame-rates with limited added complexity. The framework is designed to perform 3D reconstruction both offline and online, and hence can be seamlessly applied to SfM and visual SLAM scenarios showing state-of-the-art performance on various 3D downstream tasks, including uncalibrated Visual Odometry, relative camera pose, scale and focal estimation, 3D reconstruction and multi-view depth estimation.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2503.01661) | [🌐 Project Page](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fmust3r) | [💻 Code](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fmust3r)\n\n\u003Cbr>\n\n\n\n\n\n### 7. VGGT: Visual Geometry Grounded Transformer ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**Authors**: Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, David Novotny\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. This approach is a step forward in 3D computer vision, where models have typically been constrained to and specialized for single tasks. It is also simple and efficient, reconstructing images in under one second, and still outperforming alternatives that require post-processing with visual geometry optimization techniques. The network achieves state-of-the-art results in multiple 3D tasks, including camera parameter estimation, multi-view depth estimation, dense point cloud reconstruction, and 3D point tracking. We also show that using pretrained VGGT as a feature backbone significantly enhances downstream tasks, such as non-rigid point tracking and feed-forward novel view synthesis.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.11651) | [🌐 Project Page](https:\u002F\u002Fvgg-t.github.io\u002F) | [💻 Code](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fvggt) | [🤗 Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Ffacebook\u002Fvggt)\n\n\u003Cbr>\n\n\n\n\n### 8. Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**Authors**: Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, Jerome Revaud\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe present Pow3r, a novel large 3D vision regression model that is highly versatile in the input modalities it accepts. Unlike previous feed-forward models that lack any mechanism to exploit known camera or scene priors at test time, Pow3r incorporates any combination of auxiliary information such as intrinsics, relative pose, dense or sparse depth, alongside input images, within a single network. Building upon the recent DUSt3R paradigm, a transformer-based architecture that leverages powerful pre-training, our lightweight and versatile conditioning acts as additional guidance for the network to predict more accurate estimates when auxiliary information is available. During training we feed the model with random subsets of modalities at each iteration, which enables the model to operate under different levels of known priors at test time. This in turn opens up new capabilities, such as performing inference in native image resolution, or point-cloud completion. Our experiments on 3D reconstruction, depth completion, multi-view depth prediction, multi-view stereo, and multi-view pose estimation tasks yield state-of-the-art results and confirm the effectiveness of Pow3r at exploiting all available information.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.17316) | [🌐 Project Page](https:\u002F\u002Feurope.naverlabs.com\u002Fresearch\u002Fpublications\u002Fpow3r-empowering-unconstrained-3d-reconstruction-with-camera-and-scene-priors\u002F)\n\n\u003Cbr>\n\n\n\n\n### 9. Mono3R: Exploiting Monocular Cues for Geometric 3D Reconstruction ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Wenyu Li, Sidun Liu, Peng Qiao, Yong Dou\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nRecent advances in data-driven geometric multi-view 3D reconstruction foundation models (e.g., DUSt3R) have shown remarkable performance across various 3D vision tasks, facilitated by the release of large-scale, high-quality 3D datasets. However, as we observed, constrained by their matching-based principles, the reconstruction quality of existing models suffers significant degradation in challenging regions with limited matching cues, particularly in weakly textured areas and low-light conditions. To mitigate these limitations, we propose to harness the inherent robustness of monocular geometry estimation to compensate for the inherent shortcomings of matching-based methods. Specifically, we introduce a monocular-guided refinement module that integrates monocular geometric priors into multi-view reconstruction frameworks. This integration substantially enhances the robustness of multi-view reconstruction systems, leading to high-quality feed-forward reconstructions. Comprehensive experiments across multiple benchmarks demonstrate that our method achieves substantial improvements in both mutli-view camera pose estimation and point cloud accuracy.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.13419)\n\n\u003Cbr>\n\n\n### 10. Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXir-red)\n**Authors**: Yuqi Wu, Wenzhao Zheng, Jie Zhou, Jiwen Lu\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nDense 3D scene reconstruction from an ordered sequence or unordered image collections is a critical step when bringing research in computer vision into practical scenarios. Following the paradigm introduced by DUSt3R, which unifies an image pair densely into a shared coordinate system, subsequent methods maintain an implicit memory to achieve dense 3D reconstruction from more images. However, such implicit memory is limited in capacity and may suffer from information loss of earlier frames. We propose Point3R, an online framework targeting dense streaming 3D reconstruction. To be specific, we maintain an explicit spatial pointer memory directly associated with the 3D structure of the current scene. Each pointer in this memory is assigned a specific 3D position and aggregates scene information nearby in the global coordinate system into a changing spatial feature. Information extracted from the latest frame interacts explicitly with this pointer memory, enabling dense integration of the current observation into the global coordinate system. We design a 3D hierarchical position embedding to promote this interaction and design a simple yet effective fusion mechanism to ensure that our pointer memory is uniform and efficient. Our method achieves competitive or state-of-the-art performance on various tasks with low training costs.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.02863) | [🌐 Project Page](https:\u002F\u002Fykiwu.github.io\u002FPoint3R\u002F) | | [💻 Code](https:\u002F\u002Fgithub.com\u002FYkiWu\u002FPoint3R)\n\n\u003Cbr>\n\n\n\n### 11. MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, Jiaolong Yang\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe propose MoGe-2, an advanced open-domain geometry estimation model that recovers a metric scale 3D point map of a scene from a single image. Our method builds upon the recent monocular geometry estimation approach, MoGe, which predicts affine-invariant point maps with unknown scales. We explore effective strategies to extend MoGe for metric geometry prediction without compromising the relative geometry accuracy provided by the affine-invariant point representation. Additionally, we discover that noise and errors in real data diminish fine-grained detail in the predicted geometry. We address this by developing a unified data refinement approach that filters and completes real data from different sources using sharp synthetic labels, significantly enhancing the granularity of the reconstructed geometry while maintaining the overall accuracy. We train our model on a large corpus of mixed datasets and conducted comprehensive evaluations, demonstrating its superior performance in achieving accurate relative geometry, precise metric scale, and fine-grained detail recovery -- capabilities that no previous methods have simultaneously achieved.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.02546) | [🌐 Project Page](https:\u002F\u002Fwangrc.site\u002FMoGe2Page\u002F) | [💻 Code ](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmoge) | [🤗 Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FRuicheng\u002FMoGe-2)\n\n\u003Cbr>\n\n\n\n### 12. Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-ICCV-pink)\n**Authors**: Chong Cheng, Sicheng Yu, Zijian Wang, Yifan Zhou, Hao Wang\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\n3D Gaussian Splatting (3DGS) has become a popular solution in SLAM due to its high-fidelity and real-time novel view synthesis performance. However, some previous 3DGS SLAM methods employ a differentiable rendering pipeline for tracking, lack geometric priors in outdoor scenes. Other approaches introduce separate tracking modules, but they accumulate errors with significant camera movement, leading to scale drift. To address these challenges, we propose a robust RGB-only outdoor 3DGS SLAM method: S3PO-GS. Technically, we establish a self-consistent tracking module anchored in the 3DGS pointmap, which avoids cumulative scale drift and achieves more precise and robust tracking with fewer iterations. Additionally, we design a patch-based pointmap dynamic mapping module, which introduces geometric priors while avoiding scale ambiguity. This significantly enhances tracking accuracy and the quality of scene reconstruction, making it particularly suitable for complex outdoor environments. Our experiments on the Waymo, KITTI, and DL3DV datasets demonstrate that S3PO-GS achieves state-of-the-art results in novel view synthesis and outperforms other 3DGS SLAM methods in tracking accuracy. Project page: this https URL.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.03737) | [🌐 Project Page](https:\u002F\u002F3dagentworld.github.io\u002FS3PO-GS\u002F) | [💻 Code ](https:\u002F\u002Fgithub.com\u002F3DAgentWorld\u002FS3PO-GS)\n\n\u003Cbr>\n\n\n\n### 13. π^3: Scalable Permutation-Equivariant Visual Geometry Learning ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe introduce , a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast,  employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design makes our model inherently robust to input ordering and highly scalable. These advantages enable our simple and bias-free approach to achieve state-of-the-art performance on a wide range of tasks, including camera pose estimation, monocular\u002Fvideo depth estimation, and dense point map reconstruction. Code and models are publicly available.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.13347) | [🌐 Project Page](https:\u002F\u002Fyyfz.github.io\u002Fpi3\u002F) | [💻 Code ](https:\u002F\u002Fgithub.com\u002Fyyfz\u002FPi3) | [🤗 Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fyyfz233\u002FPi3)\n\n\u003Cbr>\n\n\n\n### 14. LONG3R: Long Sequence Streaming 3D Reconstruction ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-ICCV-pink)\n**Authors**: Zhuoguang Chen, Minghui Qin, Tianyuan Yuan, Zhe Liu, Hang Zhao\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nRecent advancements in multi-view scene reconstruction have been significant, yet existing methods face limitations when processing streams of input images. These methods either rely on time-consuming offline optimization or are restricted to shorter sequences, hindering their applicability in real-time scenarios. In this work, we propose LONG3R (LOng sequence streaming 3D Reconstruction), a novel model designed for streaming multi-view 3D scene reconstruction over longer sequences. Our model achieves real-time processing by operating recurrently, maintaining and updating memory with each new observation. We first employ a memory gating mechanism to filter relevant memory, which, together with a new observation, is fed into a dual-source refined decoder for coarse-to-fine interaction. To effectively capture long-sequence memory, we propose a 3D spatio-temporal memory that dynamically prunes redundant spatial information while adaptively adjusting resolution along the scene. To enhance our model's performance on long sequences while maintaining training efficiency, we employ a two-stage curriculum training strategy, each stage targeting specific capabilities. Experiments demonstrate that LONG3R outperforms state-of-the-art streaming methods, particularly for longer sequences, while maintaining real-time inference speed.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2507.18255) | [🌐 Project Page](https:\u002F\u002Fzgchen33.github.io\u002FLONG3R\u002F) | [💻 Code ](https:\u002F\u002Fzgchen33.github.io\u002FLONG3R\u002F)\n\n\u003Cbr>\n\n\n\n### 15. VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-ICCV-pink)\n**Authors**: Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, Jin Xie\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nFoundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. In this work, we propose VGGT-Long, a simple yet effective system that pushes the limits of monocular 3D reconstruction to kilometer-scale, unbounded outdoor environments. Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization. Without requiring camera calibration, depth supervision or model retraining, VGGT-Long achieves trajectory and reconstruction performance comparable to traditional methods. We evaluate our method on KITTI, Waymo, and Virtual KITTI datasets. VGGT-Long not only runs successfully on long RGB sequences where foundation models typically fail, but also produces accurate and consistent geometry across various conditions. Our results highlight the potential of leveraging foundation models for scalable monocular 3D scene in real-world settings, especially for autonomous driving scenarios.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.16443) | [💻 Code ](https:\u002F\u002Fgithub.com\u002FDengKaiCQ\u002FVGGT-Long)\n\n\u003Cbr>\n\n\n### 16. STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, Xingang Pan\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.10893v1) | [🌐 Project Page](https:\u002F\u002Fnirvanalan.github.io\u002Fprojects\u002Fstream3r\u002F)| [💻 Code ](https:\u002F\u002Fgithub.com\u002FNIRVANALAN\u002FSTream3R)\n\n\u003Cbr>\n\n\n\n### 17. Dens3R: A Foundation Model for 3D Geometry Prediction ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Xianze Fang, Jingnan Gao, Zhe Wang, Zhuo Chen, Xingyu Ren, Jiangjing Lyu, Qiaomu Ren, Zhonglei Yang, Xiaokang Yang, Yichao Yan, Chengfei Lyu\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nRecent advances in dense 3D reconstruction have led to significant progress, yet achieving accurate unified geometric prediction remains a major challenge. Most existing methods are limited to predicting a single geometry quantity from input images. However, geometric quantities such as depth, surface normals, and point maps are inherently correlated, and estimating them in isolation often fails to ensure consistency, thereby limiting both accuracy and practical applicability. This motivates us to explore a unified framework that explicitly models the structural coupling among different geometric properties to enable joint regression. In this paper, we present Dens3R, a 3D foundation model designed for joint geometric dense prediction and adaptable to a wide range of downstream tasks. Dens3R adopts a two-stage training framework to progressively build a pointmap representation that is both generalizable and intrinsically invariant. Specifically, we design a lightweight shared encoder-decoder backbone and introduce position-interpolated rotary positional encoding to maintain expressive power while enhancing robustness to high-resolution inputs. By integrating image-pair matching features with intrinsic invariance modeling, Dens3R accurately regresses multiple geometric quantities such as surface normals and depth, achieving consistent geometry perception from single-view to multi-view inputs. Additionally, we propose a post-processing pipeline that supports geometrically consistent multi-view inference. Extensive experiments demonstrate the superior performance of Dens3R across various dense 3D prediction tasks and highlight its potential for broader applications.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.16290) | [🌐 Project Page](https:\u002F\u002Fg-1nonly.github.io\u002FDens3R\u002F)| [💻 Code ](https:\u002F\u002Fgithub.com\u002FG-1nOnly\u002FDens3R)\n\n\u003Cbr>\n\n\n\n### 18. ViPE: Video Pose Engine for 3D Geometric Perception ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, Sanja Fidler\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nAccurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. While state-of-the-art methods depend on large-scale training data, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a handy and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360° panoramas. We have benchmarked ViPE on multiple benchmarks. Notably, it outperforms existing uncalibrated pose estimation baselines by 18%\u002F50% on TUM\u002FKITTI sequences, and runs at 3-5FPS on a single GPU for standard input resolutions. We use ViPE to annotate a large-scale collection of videos. This collection includes around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames – all annotated with accurate camera poses and dense depth maps. We open-source ViPE and the annotated dataset with the hope of accelerating the development of spatial AI systems.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Ftoronto-ai\u002Fvipe\u002Fassets\u002Fpaper.pdf) | [🌐 Project Page](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Ftoronto-ai\u002Fvipe\u002F)| [💻 Code ](https:\u002F\u002Fgithub.com\u002Fnv-tlabs\u002Fvipe?tab=readme-ov-file)\n\n\u003Cbr>\n\n### 19. Test3R: Learning to Reconstruct 3D at Test Time ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Yuheng Yuan, Qiuhong Shen, Shizun Wang, Xingyi Yang, Xinchao Wang\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nDense matching methods like DUSt3R regress pairwise pointmaps for 3D reconstruction. However, the reliance on pairwise prediction and the limited generalization capability inherently restrict the global geometric consistency. In this work, we introduce \\textbf{Test3R}, a surprisingly simple test-time learning technique that significantly boosts geometric accuracy. Using image triplets ($I_1,I_2,I_3$), Test3R generates reconstructions from pairs ($I_1,I_2$) and ($I_1,I_3$). The core idea is to optimize the network at test time via a self-supervised objective: maximizing the geometric consistency between these two reconstructions relative to the common image $I_1$. This ensures the model produces cross-pair consistent outputs, regardless of the inputs. Extensive experiments demonstrate that our technique significantly outperforms previous state-of-the-art methods on the 3D reconstruction and multi-view depth estimation tasks. Moreover, it is universally applicable and nearly cost-free, making it easily applied to other models and implemented with minimal test-time training overhead and parameter footprint.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.13750) | [🌐 Project Page](https:\u002F\u002Ftest3r-nop.github.io\u002F)| [💻 Code ](https:\u002F\u002Fgithub.com\u002FnopQAQ\u002FTest3R)\n\n\u003Cbr>\n\n\n\n\n### 20. SAIL-Recon: Large SfM by Augmenting Scene Regression with Localization ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Junyuan Deng, Heng Li, Tao Xie, Weiqiang Ren, Qian Zhang, Ping Tan, Xiaoyang Guo\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nScene regression methods, such as VGGT, solve the Structure-from-Motion (SfM) problem by directly regressing camera poses and 3D scene structures from input images. They demonstrate impressive performance in handling images under extreme viewpoint changes. However, these methods struggle to handle a large number of input images. To address this problem, we introduce SAIL-Recon, a feed-forward Transformer for large scale SfM, by augmenting the scene regression network with visual localization capabilities. Specifically, our method first computes a neural scene representation from a subset of anchor images. The regression network is then fine-tuned to reconstruct all input images conditioned on this neural scene representation. Comprehensive experiments show that our method not only scales efficiently to large-scale scenes, but also achieves state-of-the-art results on both camera pose estimation and novel view synthesis benchmarks, including TUM-RGBD, CO3Dv2, and Tanks & Temples. We will publish our model and code.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.17972) | [🌐 Project Page](https:\u002F\u002Fhkust-sail.github.io\u002Fsail-recon\u002F)| [💻 Code ](https:\u002F\u002Fgithub.com\u002FHKUST-SAIL\u002Fsail-recon)\n\n\u003Cbr>\n\n\n### 21. FastVGGT: Training-Free Acceleration of Visual Geometry Transformer ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: You Shen, Zhipeng Zhang, Yansong Qu, Liujuan Cao\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nFoundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, scaling these models to long-sequence image inputs remains a significant challenge due to inference-time inefficiency. In this work, we present a detailed analysis of VGGT, a state-of-the-art feed-forward visual geometry model and identify its primary bottleneck. Visualization further reveals a token collapse phenomenon in the attention maps. Motivated by these findings, we explore the potential of token merging in the feed-forward visual geometry model. Owing to the unique architectural and task-specific properties of 3D models, directly applying existing merging techniques proves challenging. To this end, we propose FastVGGT, which, for the first time, leverages token merging in the 3D domain through a training-free mechanism for accelerating VGGT. we devise a unique token partitioning strategy tailored to 3D architectures and tasks, effectively eliminating redundant computation while preserving VGGT's powerful reconstruction capacity. Extensive experiments on multiple 3D geometry benchmarks validate the effectiveness of our approach. Notably, with 1000 input images, FastVGGT achieves a 4x speedup over VGGT while mitigating error accumulation in long-sequence scenarios. These findings underscore the potential of token merging as a principled solution for scalable 3D vision systems. \n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.02560v1) | [🌐 Project Page](https:\u002F\u002Fmystorm16.github.io\u002Ffastvggt\u002F)| [💻 Code ](https:\u002F\u002Fgithub.com\u002Fmystorm16\u002FFastVGGT)\n\n\u003Cbr>\n\n\n### 22. HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-ICCV-pink)\n**Authors**: Sara Rojas, Matthieu Armando, Bernard Ghamen, Philippe Weinzaepfel, Vincent Leroy, Gregory Rogez\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nRecovering the 3D geometry of a scene from a sparse set of uncalibrated images is a long-standing problem in computer vision. While recent learning-based approaches such as DUSt3R and MASt3R have demonstrated impressive results by directly predicting dense scene geometry, they are primarily trained on outdoor scenes with static environments and struggle to handle human-centric scenarios. In this work, we introduce HAMSt3R, an extension of MASt3R for joint human and scene 3D reconstruction from sparse, uncalibrated multi-view images. First, we exploit DUNE, a strong image encoder obtained by distilling, among others, the encoders from MASt3R and from a state-of-the-art Human Mesh Recovery (HMR) model, multi-HMR, for a better understanding of scene geometry and human bodies. Our method then incorporates additional network heads to segment people, estimate dense correspondences via DensePose, and predict depth in human-centric environments, enabling a more comprehensive 3D reconstruction. By leveraging the outputs of our different heads, HAMSt3R produces a dense point map enriched with human semantic information in 3D. Unlike existing methods that rely on complex optimization pipelines, our approach is fully feed-forward and efficient, making it suitable for real-world applications. We evaluate our model on EgoHumans and EgoExo4D, two challenging benchmarks con taining diverse human-centric scenarios. Additionally, we validate its generalization to traditional multi-view stereo and multi-view pose regression tasks. Our results demonstrate that our method can reconstruct humans effectively while preserving strong performance in general 3D reconstruction tasks, bridging the gap between human and scene understanding in 3D vision.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2508.16433)\n\n\u003Cbr>\n\n\n\n\n### 23. ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Ganlin Zhang, Shenhan Qian, Xi Wang, Daniel Cremers\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe present ViSTA-SLAM as a real-time monocular visual SLAM system that operates without requiring camera intrinsics, making it broadly applicable across diverse camera setups. At its core, the system employs a lightweight symmetric two-view association (STA) model as the frontend, which simultaneously estimates relative camera poses and regresses local pointmaps from only two RGB images. This design reduces model complexity significantly, the size of our frontend is only 35\\% that of comparable state-of-the-art methods, while enhancing the quality of two-view constraints used in the pipeline. In the backend, we construct a specially designed Sim(3) pose graph that incorporates loop closures to address accumulated drift. Extensive experiments demonstrate that our approach achieves superior performance in both camera tracking and dense 3D reconstruction quality compared to current methods.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.01584) | [🌐 Project Page](https:\u002F\u002Fganlinzhang.xyz\u002Fvista-slam\u002F)| [💻 Code ](https:\u002F\u002Fgithub.com\u002Fzhangganlin\u002Fvista-slam)\n\n\u003Cbr>\n\n\n### 24. Rig3R: Rig-Aware Conditioning for Learned 3D Reconstruction ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Samuel Li, Pujith Kachana, Prajwal Chidananda, Saurabh Nair, Yasutaka Furukawa, Matthew Brown\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nEstimating agent pose and 3D scene structure from multi-camera rigs is a central task in embodied AI applications such as autonomous driving. Recent learned approaches such as DUSt3R have shown impressive performance in multiview settings. However, these models treat images as unstructured collections, limiting effectiveness in scenarios where frames are captured from synchronized rigs with known or inferable structure. To this end, we introduce Rig3R, a generalization of prior multiview reconstruction models that incorporates rig structure when available, and learns to infer it when not. Rig3R conditions on optional rig metadata including camera ID, time, and rig poses to develop a rig-aware latent space that remains robust to missing information. It jointly predicts pointmaps and two types of raymaps: a pose raymap relative to a global frame, and a rig raymap relative to a rig-centric frame consistent across time. Rig raymaps allow the model to infer rig structure directly from input images when metadata is missing. Rig3R achieves state-of-the-art performance in 3D reconstruction, camera pose estimation, and rig discovery, outperforming both traditional and learned methods by 17-45% mAA across diverse real-world rig datasets, all in a single forward pass without post-processing or iterative refinement.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.02265) | [🌐 Project Page](https:\u002F\u002Fwayve.ai\u002Fthinking\u002Frig3r\u002F)\n\n\u003Cbr>\n\n\n\n### 25. PLANA3R: Zero-shot Metric Planar 3D Reconstruction via Feed-Forward Planar Splatting ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Changkun Liu, Bin Tan, Zeran Ke, Shangzhan Zhang, Jiachen Liu, Ming Qian, Nan Xue, Yujun Shen, Tristan Braud\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nThis paper addresses metric 3D reconstruction of indoor scenes by exploiting their inherent geometric regularities with compact representations. Using planar 3D primitives - a well-suited representation for man-made environments - we introduce PLANA3R, a pose-free framework for metric Planar 3D Reconstruction from unposed two-view images. Our approach employs Vision Transformers to extract a set of sparse planar primitives, estimate relative camera poses, and supervise geometry learning via planar splatting, where gradients are propagated through high-resolution rendered depth and normal maps of primitives. Unlike prior feedforward methods that require 3D plane annotations during training, PLANA3R learns planar 3D structures without explicit plane supervision, enabling scalable training on large-scale stereo datasets using only depth and normal annotations. We validate PLANA3R on multiple indoor-scene datasets with metric supervision and demonstrate strong generalization to out-of-domain indoor environments across diverse tasks under metric evaluation protocols, including 3D surface reconstruction, depth estimation, and relative pose estimation. Furthermore, by formulating with planar 3D representation, our method emerges with the ability for accurate plane segmentation. \n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.18714) | [🌐 Project Page](https:\u002F\u002Flck666666.github.io\u002Fplana3r\u002F)| [💻 Code ](https:\u002F\u002Fgithub.com\u002Flck666666\u002Fplana3r)\n\n\u003Cbr>\n\n\n### 26. TTT3R: 3D Reconstruction as Test-Time Training ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nModern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a  improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.26645) | [🌐 Project Page](https:\u002F\u002Frover-xingyu.github.io\u002FTTT3R\u002F)| [💻 Code ](rover-xingyu.github.io\u002FTTT3R\u002F)\n\n\u003Cbr>\n\n\n## 2024:\n### 1. Spurfies: Sparse Surface Reconstruction using Local Geometry Priors ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Kevin Raj, Christopher Wewer, Raza Yunus, Eddy Ilg, Jan Eric Lenssen\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe introduce Spurfies, a novel method for sparse-view surface reconstruction that disentangles appearance and geometry information to utilize local geometry priors trained on synthetic data. Recent research heavily focuses on 3D reconstruction using dense multi-view setups, typically requiring hundreds of images. However, these methods often struggle with few-view scenarios. Existing sparse-view reconstruction techniques often rely on multi-view stereo networks that need to learn joint priors for geometry and appearance from a large amount of data. In contrast, we introduce a neural point representation that disentangles geometry and appearance to train a local geometry prior using a subset of the synthetic ShapeNet dataset only. During inference, we utilize this surface prior as additional constraint for surface and appearance reconstruction from sparse input views via differentiable volume rendering, restricting the space of possible solutions. We validate the effectiveness of our method on the DTU dataset and demonstrate that it outperforms previous state of the art by 35% in surface quality while achieving competitive novel view synthesis quality. Moreover, in contrast to previous works, our method can be applied to larger, unbounded scenes, such as Mip-NeRF 360.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.16544) | [🌐 Project Page](https:\u002F\u002Fgeometric-rl.mpi-inf.mpg.de\u002Fspurfies\u002Findex.html) | [💻 Code ](https:\u002F\u002Fgithub.com\u002FkevinYitshak\u002Fspurfies)\n\n\u003Cbr>\n\n\n### 2. 3D Reconstruction with Spatial Memory ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Hengyi Wang, Lourdes Agapito\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe present Spann3R, a novel approach for dense 3D reconstruction from ordered or unordered image collections. Built on the DUSt3R paradigm, Spann3R uses a transformer-based architecture to directly regress pointmaps from images without any prior knowledge of the scene or camera parameters. Unlike DUSt3R, which predicts per image-pair pointmaps each expressed in its local coordinate frame, Spann3R can predict per-image pointmaps expressed in a global coordinate system, thus eliminating the need for optimization-based global alignment. The key idea of Spann3R is to manage an external spatial memory that learns to keep track of all previous relevant 3D information. Spann3R then queries this spatial memory to predict the 3D structure of the next frame in a global coordinate system. Taking advantage of DUSt3R's pre-trained weights, and further fine-tuning on a subset of datasets, Spann3R shows competitive performance and generalization ability on various unseen datasets and can process ordered image collections in real time.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.16061) | [🌐 Project Page](https:\u002F\u002Fhengyiwang.github.io\u002Fprojects\u002Fspanner) | [💻 Code](https:\u002F\u002Fgithub.com\u002FHengyiWang\u002Fspann3r)\n\n\u003Cbr>\n\n### 3. ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, Yueqi Duan\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nAdvancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. However, 3D view consistency struggles to be accurately preserved in directly generated video frames from pre-trained models. To address this, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of our ReconX over state-of-the-art methods in terms of quality and generalizability.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.16767) | [🌐 Project Page](https:\u002F\u002Fliuff19.github.io\u002FReconX\u002F) | [💻 Code](https:\u002F\u002Fgithub.com\u002Fliuff19\u002FReconX)\n\n\u003Cbr>\n\n\n\n\n### 4. MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, Jiaolong Yang\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe present MoGe, a powerful model for recovering 3D geometry from monocular open-domain images. Given a single image, our model directly predicts a 3D point map of the captured scene with an affine-invariant representation, which is agnostic to true global scale and shift. This new representation precludes ambiguous supervision in training and facilitate effective geometry learning. Furthermore, we propose a set of novel global and local geometry supervisions that empower the model to learn high-quality geometry. These include a robust, optimal, and efficient point cloud alignment solver for accurate global shape learning, and a multi-scale local geometry loss promoting precise local geometry supervision. We train our model on a large, mixed dataset and demonstrate its strong generalizability and high accuracy. In our comprehensive evaluation on diverse unseen datasets, our model significantly outperforms state-of-the-art methods across all tasks, including monocular estimation of 3D point map, depth map, and camera field of view. Code and models will be released on our project page.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.19115) | [🌐 Project Page](https:\u002F\u002Fwangrc.site\u002FMoGePage\u002F) | [💻 Code](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmoge) | [🎮 Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FRuicheng\u002FMoGe)\n\n\n\u003Cbr>\n\n### 5. MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, Jiaolong Yang\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nRecent sparse multi-view scene reconstruction advances like DUSt3R and MASt3R no longer require camera calibration and camera pose estimation. However, they only process a pair of views at a time to infer pixel-aligned pointmaps. When dealing with more than two views, a combinatorial number of error prone pairwise reconstructions are usually followed by an expensive global optimization, which often fails to rectify the pairwise reconstruction errors. To handle more views, reduce errors, and improve inference time, we propose the fast single-stage feed-forward network MV-DUSt3R. At its core are multi-view decoder blocks which exchange information across any number of views while considering one reference view. To make our method robust to reference view selection, we further propose MV-DUSt3R+, which employs cross-reference-view blocks to fuse information across different reference view choices. To further enable novel view synthesis, we extend both by adding and jointly training Gaussian splatting heads. Experiments on multi-view stereo reconstruction, multi-view pose estimation, and novel view synthesis confirm that our methods improve significantly upon prior art. Code will be released.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.06974) | [🌐 Project Page](https:\u002F\u002Fmv-dust3rp.github.io\u002F) | [💻 Code ](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fmvdust3r)\n\n\u003Cbr>\n\n\n\n\n### 6. LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Ziqi Lu, Heng Yang, Danfei Xu, Boyi Li, Boris Ivanovic, Marco Pavone, Yue Wang\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nEmerging 3D geometric foundation models, such as DUSt3R, offer a promising approach for in-the-wild 3D vision tasks. However, due to the high-dimensional nature of the problem space and scarcity of high-quality 3D data, these pre-trained models still struggle to generalize to many challenging circumstances, such as limited view overlap or low lighting. To address this, we propose LoRA3D, an efficient self-calibration pipeline to specialize the pre-trained models to target scenes using their own multi-view predictions. Taking sparse RGB images as input, we leverage robust optimization techniques to refine multi-view predictions and align them into a global coordinate frame. In particular, we incorporate prediction confidence into the geometric optimization process, automatically re-weighting the confidence to better reflect point estimation accuracy. We use the calibrated confidence to generate high-quality pseudo labels for the calibrating views and use low-rank adaptation (LoRA) to fine-tune the models on the pseudo-labeled data. Our method does not require any external priors or manual labels. It completes the self-calibration process on a single standard GPU within just 5 minutes. Each low-rank adapter requires only 18MB of storage. We evaluated our method on more than 160 scenes from the Replica, TUM and Waymo Open datasets, achieving up to 88% performance improvement on 3D reconstruction, multi-view pose estimation and novel-view rendering.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.07746)\n\n\u003Cbr>\n\n\n## Dynamic Scene Reconstruction:\n\n## 2025:\n\n### 1. Continuous 3D Perception Model with Persistent State ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, Angjoo Kanazawa\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe present a unified framework capable of solving a broad range of 3D tasks. Our approach features a stateful recurrent model that continuously updates its state representation with each new observation. Given a stream of images, this evolving state can be used to generate metric-scale pointmaps (per-pixel 3D points) for each new input in an online fashion. These pointmaps reside within a common coordinate system, and can be accumulated into a coherent, dense scene reconstruction that updates as new images arrive. Our model, called CUT3R (Continuous Updating Transformer for 3D Reconstruction), captures rich priors of real-world scenes: not only can it predict accurate pointmaps from image observations, but it can also infer unseen regions of the scene by probing at virtual, unobserved views. Our method is simple yet highly flexible, naturally accepting varying lengths of images that may be either video streams or unordered photo collections, containing both static and dynamic content. We evaluate our method on various 3D\u002F4D tasks and demonstrate competitive or state-of-the-art performance in each. \n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.12387) | [🌐 Project Page](https:\u002F\u002Fcut3r.github.io\u002F) | [💻 Code (to be released)](https:\u002F\u002Fcut3r.github.io\u002F)\n\n\u003Cbr>\n\n\n\n### 2. Easi3R: Estimating Disentangled Motion from DUSt3R Without Training ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nRecent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets. In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model. This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths. In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion. By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction. Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or finetuned on extensive dynamic datasets.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.24391) | [🌐 Project Page](https:\u002F\u002Feasi3r.github.io\u002F) | [💻 Code](https:\u002F\u002Fgithub.com\u002FInception3D\u002FEasi3R)\n\n\u003Cbr>\n\n\n\n### 3. ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-ligntgreen)\n**Authors**: Zetong Zhang, Manuel Kaufmann, Lixin Xue, Jie Song, Martin R. Oswald\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nCreating a photorealistic scene and human reconstruction from a single monocular in-the-wild video figures prominently in the perception of a human-centric 3D world. Recent neural rendering advances have enabled holistic human-scene reconstruction but require pre-calibrated camera and human poses, and days of training time. In this work, we introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion. 3D Gaussian Splatting is utilized to learn Gaussian primitives for humans and scenes efficiently, and reconstruction-based camera tracking and human pose estimation modules are designed to enable holistic understanding and effective disentanglement of pose and appearance. Specifically, we design a human deformation module to reconstruct the details and enhance generalizability to out-of-distribution poses faithfully. Aiming to learn the spatial correlation between human and scene accurately, we introduce occlusion-aware human silhouette rendering and monocular geometric priors, which further improve reconstruction quality. Experiments on the EMDB and NeuMan datasets demonstrate superior or on-par performance with existing methods in camera tracking, human pose estimation, novel view synthesis and runtime.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.13167) | [🌐 Project Page](https:\u002F\u002Feth-ait.github.io\u002FODHSR\u002F) | [💻 Code](https:\u002F\u002Fgithub.com\u002Feth-ait\u002FODHSR)\n\n\u003Cbr>\n\n\n### 4. Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Edgar Sucar, Zihang Lai, Eldar Insafutdinov, Andrea Vedaldi\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nDUSt3R has recently shown that one can reduce many tasks in multi-view geometry, including estimating camera intrinsics and extrinsics, reconstructing the scene in 3D, and establishing image correspondences, to the prediction of a pair of viewpoint-invariant point maps, i.e., pixel-aligned point clouds defined in a common reference frame. This formulation is elegant and powerful, but unable to tackle dynamic scenes. To address this challenge, we introduce the concept of Dynamic Point Maps (DPM), extending standard point maps to support 4D tasks such as motion segmentation, scene flow estimation, 3D object tracking, and 2D correspondence. Our key intuition is that, when time is introduced, there are several possible spatial and time references that can be used to define the point maps. We identify a minimal subset of such combinations that can be regressed by a network to solve the sub tasks mentioned above. We train a DPM predictor on a mixture of synthetic and real data and evaluate it across diverse benchmarks for video depth prediction, dynamic point cloud reconstruction, 3D scene flow and object pose tracking, achieving state-of-the-art performance.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.16318) | [🌐 Project Page](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fresearch\u002Fdynamic-point-maps\u002F)\n\n\u003Cbr>\n\n\n\n### 5. Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, Andrea Vedaldi\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic prior captured by such video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, depth, and ray maps. It uses a new multi-modal alignment algorithm to align and fuse these modalities, as well as multiple sliding windows, at inference time, thus obtaining robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods, including recent methods such as MonST3R, which are also designed to handle dynamic scenes.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.07961) | [🌐 Project Page](https:\u002F\u002Fgeo4d.github.io\u002F) | | [💻 Code](https:\u002F\u002Fgithub.com\u002Fjzr99\u002FGeo4D)\n\n\u003Cbr>\n\n\n\n### 6. POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Songyan Zhang, Yongtao Ge, Jinyuan Tian, Guangkai Xu, Hao Chen, Chen Lv, Chunhua Shen\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\n3D reconstruction in dynamic scenes primarily relies on the combination of geometry estimation and matching modules where the latter task is pivotal for distinguishing dynamic regions which can help to mitigate the interference introduced by camera and object motion. Furthermore, the matching module explicitly models object motion, enabling the tracking of specific targets and advancing motion understanding in complex scenarios. Recently, the proposed representation of pointmap in DUSt3R suggests a potential solution to unify both geometry estimation and matching in 3D space, but it still struggles with ambiguous matching in dynamic regions, which may hamper further improvement. In this work, we present POMATO, a unified framework for dynamic 3D reconstruction by marrying pointmap matching with temporal motion. Specifically, our method first learns an explicit matching relationship by mapping RGB pixels from both dynamic and static regions across different views to 3D pointmaps within a unified coordinate system. Furthermore, we introduce a temporal motion module for dynamic motions that ensures scale consistency across different frames and enhances performance in tasks requiring both precise geometry and reliable matching, most notably 3D point tracking. We show the effectiveness of the proposed pointmap matching and temporal fusion paradigm by demonstrating the remarkable performance across multiple downstream tasks, including video depth estimation, 3D point tracking, and pose estimation.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.05692) | [💻 Code](https:\u002F\u002Fgithub.com\u002Fwyddmw\u002FPOMATO)\n\n\u003Cbr>\n\n\n### 7. GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-ICCV-pink)\n**Authors**: Tian-Xing Xu, Xiangjun Gao, Wenbo Hu, Xiaoyu Li, Song-Hai Zhang, Ying Shan\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nDespite remarkable advancements in video depth estimation, existing methods exhibit inherent limitations in achieving geometric fidelity through the affine-invariant predictions, limiting their applicability in reconstruction and other metrically grounded downstream tasks. We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos, enabling accurate 3D\u002F4D reconstruction, camera parameter estimation, and other depth-based applications. At the core of our approach lies a point map Variational Autoencoder (VAE) that learns a latent space agnostic to video latent distributions for effective point map encoding and decoding. Leveraging the VAE, we train a video diffusion model to model the distribution of point map sequences conditioned on the input videos. Extensive evaluations on diverse datasets demonstrate that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.01016) | [🌐 Project Page](https:\u002F\u002Fgeometrycrafter.github.io\u002F) | | [💻 Code](https:\u002F\u002Fgithub.com\u002FTencentARC\u002FGeometryCrafter) | [🤗 Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTencentARC\u002FGeometryCrafter)\n\n\u003Cbr>\n\n\n### 8. Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-ICCV-pink)\n**Authors**: Weirong Chen, Ganlin Zhang, Felix Wimbauer, Rui Wang, Nikita Araslanov, Andrea Vedaldi, Daniel Cremers\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nTraditional SLAM systems, which rely on bundle adjustment, struggle with highly dynamic scenes commonly found in casual videos. Such videos entangle the motion of dynamic elements, undermining the assumption of static environments required by traditional systems. Existing techniques either filter out dynamic elements or model their motion independently. However, the former often results in incomplete reconstructions, whereas the latter can lead to inconsistent motion estimates. Taking a novel approach, this work leverages a 3D point tracker to separate the camera-induced motion from the observed motion of dynamic objects. By considering only the camera-induced component, bundle adjustment can operate reliably on all scene elements as a result. We further ensure depth consistency across video frames with lightweight post-processing based on scale maps. Our framework combines the core of traditional SLAM -- bundle adjustment -- with a robust learning-based 3D tracker front-end. Integrating motion decomposition, bundle adjustment and depth refinement, our unified framework, BA-Track, accurately tracks the camera motion and produces temporally coherent and scale-consistent dense reconstructions, accommodating both static and dynamic elements. Our experiments on challenging datasets reveal significant improvements in camera pose estimation and 3D reconstruction accuracy.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.14516) | [🌐 Project Page](https:\u002F\u002Fwrchen530.github.io\u002Fprojects\u002Fbatrack\u002F) | | [💻 Code (coming soon)](https:\u002F\u002Fwrchen530.github.io\u002Fprojects\u002Fbatrack\u002F)\n\n\u003Cbr>\n\n\n### 9. Streaming 4D Visual Geometry Transformer ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-ICCV-pink)\n**Authors**: Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, Jiwen Lu\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nPerceiving and reconstructing 4D spatial-temporal geometry from videos is a fundamental yet challenging computer vision task. To facilitate interactive and real-time applications, we propose a streaming 4D visual geometry transformer that shares a similar philosophy with autoregressive large language models. We explore a simple and efficient design and employ a causal transformer architecture to process the input sequence in an online manner. We use temporal causal attention and cache the historical keys and values as implicit memory to enable efficient streaming long-term 4D reconstruction. This design can handle real-time 4D reconstruction by incrementally integrating historical information while maintaining high-quality spatial consistency. For efficient training, we propose to distill knowledge from the dense bidirectional visual geometry grounded transformer (VGGT) to our causal model. For inference, our model supports the migration of optimized efficient attention operator (e.g., FlashAttention) from the field of large language models. Extensive experiments on various 4D geometry perception benchmarks demonstrate that our model increases the inference speed in online scenarios while maintaining competitive performance, paving the way for scalable and interactive 4D vision systems.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.11539) | [🌐 Project Page](https:\u002F\u002Fwzzheng.net\u002FStreamVGGT\u002F) | | [💻 Code](https:\u002F\u002Fgithub.com\u002Fwzzheng\u002FStreamVGGT) | [🤗 Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Flch01\u002FStreamVGGT)\n\n\u003Cbr>\n\n\n\n\n\n### 10. Human3R: Everyone Everywhere All at Once ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Yue Chen, Xingyu Chen, Yuxuan Xue, Anpei Chen, Yuliang Xiu, Gerard Pons-Moll\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe present Human3R, a unified, feed-forward framework for online 4D human-scene reconstruction, in the world frame, from casually captured monocular videos. Unlike previous approaches that rely on multi-stage pipelines, iterative contact-aware refinement between humans and scenes, and heavy dependencies, e.g., human detection, depth estimation, and SLAM pre-processing, Human3R jointly recovers global multi-person SMPL-X bodies (\"everyone\"), dense 3D scene (\"everywhere\"), and camera trajectories in a single forward pass (\"all-at-once\"). Our method builds upon the 4D online reconstruction model CUT3R, and uses parameter-efficient visual prompt tuning, to strive to preserve CUT3R's rich spatiotemporal priors, while enabling direct readout of multiple SMPL-X bodies. Human3R is a unified model that eliminates heavy dependencies and iterative refinement. After being trained on the relatively small-scale synthetic dataset BEDLAM for just one day on one GPU, it achieves superior performance with remarkable efficiency: it reconstructs multiple humans in a one-shot manner, along with 3D scenes, in one stage, at real-time speed (15 FPS) with a low memory footprint (8 GB). Extensive experiments demonstrate that Human3R delivers state-of-the-art or competitive performance across tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation, with a single unified model. We hope that Human3R will serve as a simple yet strong baseline, be easily adapted for downstream applications.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.06219) | [🌐 Project Page](https:\u002F\u002Ffanegg.github.io\u002FHuman3R\u002F)| [💻 Code ](https:\u002F\u002Fgithub.com\u002Ffanegg\u002FHuman3R)\n\n\u003Cbr>\n\n\n\u003Cbr>\n\n## 2024:\n### 1. MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, Ming-Hsuan Yang\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nEstimating geometry from dynamic scenes, where objects move and deform over time, remains a core challenge in computer vision. Current approaches often rely on multi-stage pipelines or global optimizations that decompose the problem into subtasks, like depth and flow, leading to complex systems prone to errors. In this paper, we present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes. Our key insight is that by simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes. However, this approach presents a significant challenge: the scarcity of suitable training data, namely dynamic, posed videos with depth labels. Despite this, we show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics, even without an explicit motion representation. Based on this, we introduce new optimizations for several downstream video-specific tasks and demonstrate strong performance on video depth and camera pose estimation, outperforming prior work in terms of robustness and efficiency. Moreover, MonST3R shows promising results for primarily feed-forward 4D reconstruction.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.03825) | [🌐 Project Page](https:\u002F\u002Fmonst3r-project.github.io\u002F) | [💻 Code](https:\u002F\u002Fgithub.com\u002FJunyi42\u002Fmonst3r)\n\n\u003Cbr>\n\n### 2. Align3R: Aligned Monocular Depth Estimation for Dynamic Videos ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-ligntgreen)\n**Authors**: Jiahao Lu, Tianyu Huang, Peng Li, Zhiyang Dou, Cheng Lin, Zhiming Cui, Zhen Dong, Sai-Kit Yeung, Wenping Wang, Yuan Liu\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nRecent developments in monocular depth estimation methods enable high-quality depth estimation of single-view images but fail to estimate consistent video depth across different frames. Recent works address this problem by applying a video diffusion model to generate video depth conditioned on the input video, which is training-expensive and can only produce scale-invariant depth values without camera poses. In this paper, we propose a novel video-depth estimation method called Align3R to estimate temporal consistent depth maps for a dynamic video. Our key idea is to utilize the recent DUSt3R model to align estimated monocular depth maps of different timesteps. First, we fine-tune the DUSt3R model with additional estimated monocular depth as inputs for the dynamic scenes. Then, we apply optimization to reconstruct both depth maps and camera poses. Extensive experiments demonstrate that Align3R estimates consistent video depth and camera poses for a monocular video with superior performance than baseline methods.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.03079) | [🌐 Project Page](https:\u002F\u002Figl-hkust.github.io\u002FAlign3R.github.io\u002F) | [💻 Code](https:\u002F\u002Fgithub.com\u002Fjiah-cloud\u002FAlign3R)\n\n\u003Cbr>\n\n\n\n### 3. Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, Aleksander Holynski\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nLearning to understand dynamic 3D scenes from imagery is crucial for applications ranging from robotics to scene reconstruction. Yet, unlike other problems where large-scale supervised training has enabled rapid progress, directly supervising methods for recovering 3D motion remains challenging due to the fundamental difficulty of obtaining ground truth annotations. We present a system for mining high-quality 4D reconstructions from internet stereoscopic, wide-angle videos. Our system fuses and filters the outputs of camera pose estimation, stereo depth estimation, and temporal tracking methods into high-quality dynamic 3D reconstructions. We use this method to generate large-scale data in the form of world-consistent, pseudo-metric 3D point clouds with long-term motion trajectories. We demonstrate the utility of this data by training a variant of DUSt3R to predict structure and 3D motion from real-world image pairs, showing that training on our reconstructed data enables generalization to diverse real-world scenes. Project page: this https URL\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.09621) | [🌐 Project Page](https:\u002F\u002Fstereo4d.github.io\u002F) | [💻 Code](https:\u002F\u002Fgithub.com\u002FStereo4d\u002Fstereo4d-code)\n\n\u003Cbr>\n\n\n\n### 4. DAS3R: Dynamics-Aware Gaussian Splatting for Static Scene Reconstruction\n ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Kai Xu, Tze Ho Elden Tse, Jizong Peng, Angela Yao\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe propose a novel framework for scene decomposition and static background reconstruction from everyday videos. By integrating the trained motion masks and modeling the static scene as Gaussian splats with dynamics-aware optimization, our method achieves more accurate background reconstruction results than previous works. Our proposed method is termed DAS3R, an abbreviation for Dynamics-Aware Gaussian Splatting for Static Scene Reconstruction. Compared to existing methods, DAS3R is more robust in complex motion scenarios, capable of handling videos where dynamic objects occupy a significant portion of the scene, and does not require camera pose inputs or point cloud data from SLAM-based methods.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.19584v1) | [🌐 Project Page](https:\u002F\u002Fkai422.github.io\u002FDAS3R\u002F) | [💻 Code](https:\u002F\u002Fgithub.com\u002Fkai422\u002Fdas3r)\n\n\u003Cbr>\n\n\n## Scene Reasoning:\n## 2025:\n### 1. LaRI: Layered Ray Intersections for Single-view 3D Geometric Reasoning ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Rui Li, Biao Zhang, Zhenyu Li, Federico Tombari, Peter Wonka\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe present layered ray intersections (LaRI), a new method for unseen geometry reasoning from a single image. Unlike conventional depth estimation that is limited to the visible surface, LaRI models multiple surfaces intersected by the camera rays using layered point maps. Benefiting from the compact and layered representation, LaRI enables complete, efficient, and view-aligned geometric reasoning to unify object- and scene-level tasks. We further propose to predict the ray stopping index, which identifies valid intersecting pixels and layers from LaRI's output. We build a complete training data generation pipeline for synthetic and real-world data, including 3D objects and scenes, with necessary data cleaning steps and coordination between rendering engines. As a generic method, LaRI's performance is validated in two scenarios: It yields comparable object-level results to the recent large generative model using 4% of its training data and 17% of its parameters. Meanwhile, it achieves scene-level occluded geometry reasoning in only one feed-forward.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.18424) | [🌐 Project Page](https:\u002F\u002Fruili3.github.io\u002Flari\u002Findex.html) | [💻 Code](https:\u002F\u002Fgithub.com\u002Fruili3\u002Flari) | [🤗 Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fruili3\u002FLaRI) | [🎞️ Video](https:\u002F\u002Fruili3.github.io\u002Flari\u002Fstatic\u002Fvideos\u002Fteaser_video.mp4)\n\n\u003Cbr>\n\n\n\n### 2. RaySt3R: Predicting Novel Depth Maps for Zero-Shot Object Completion ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Bardienus P. Duisterhof, Jan Oberst, Bowen Wen, Stan Birchfield, Deva Ramanan, Jeffrey Ichnowski\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\n3D shape completion has broad applications in robotics, digital twin reconstruction, and extended reality (XR). Although recent advances in 3D object and scene completion have achieved impressive results, existing methods lack 3D consistency, are computationally expensive, and struggle to capture sharp object boundaries. Our work (RaySt3R) addresses these limitations by recasting 3D shape completion as a novel view synthesis problem. Specifically, given a single RGB-D image and a novel viewpoint (encoded as a collection of query rays), we train a feedforward transformer to predict depth maps, object masks, and per-pixel confidence scores for those query rays. RaySt3R fuses these predictions across multiple query views to reconstruct complete 3D shapes. We evaluate RaySt3R on synthetic and real-world datasets, and observe it achieves state-of-the-art performance, outperforming the baselines on all datasets by up to 44% in 3D chamfer distance.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.05285) | [🌐 Project Page](https:\u002F\u002Frayst3r.github.io\u002F) | [💻 Code](https:\u002F\u002Fgithub.com\u002FDuisterhof\u002Frayst3r) | [🤗 Demo (coming soon)](https:\u002F\u002Frayst3r.github.io\u002F) | [🎞️ Video](https:\u002F\u002Frayst3r.github.io\u002Fstatic\u002Fvideos\u002Fteaser\u002Fteaser_fixed.mp4)\n\n\u003Cbr>\n\n\n### 3. Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Tianhao Wu, Chuanxia Zheng, Frank Guan, Andrea Vedaldi, Tat-Jen Cham\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nMost image-based 3D object reconstructors assume that objects are fully visible, ignoring occlusions that commonly occur in real-world scenarios. In this paper, we introduce Amodal3R, a conditional 3D generative model designed to reconstruct 3D objects from partial observations. We start from a \"foundation\" 3D generative model and extend it to recover plausible 3D geometry and appearance from occluded objects. We introduce a mask-weighted multi-head cross-attention mechanism followed by an occlusion-aware attention layer that explicitly leverages occlusion priors to guide the reconstruction process. We demonstrate that, by training solely on synthetic data, Amodal3R learns to recover full 3D objects even in the presence of occlusions in real scenes. It substantially outperforms existing methods that independently perform 2D amodal completion followed by 3D reconstruction, thereby establishing a new benchmark for occlusion-aware 3D reconstruction.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.13439) | [🌐 Project Page](https:\u002F\u002Fsm0kywu.github.io\u002FAmodal3R) | [💻 Code (coming soon)](https:\u002F\u002Fsm0kywu.github.io\u002FAmodal3R) | [🤗 Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FSm0kyWu\u002FAmodal3R)\n\n\u003Cbr>\n\n\n\n\n\n## Gaussian Splatting:\n\n\n## 2025:\n### 1. EasySplat: View-Adaptive Learning makes 3D Gaussian Splatting Easy ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Ao Gao, Luosong Guo, Tao Chen, Zhao Wang, Ying Tai, Jian Yang, Zhenyu Zhang\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\n3D Gaussian Splatting (3DGS) techniques have achieved satisfactory 3D scene representation. Despite their impressive performance, they confront challenges due to the limitation of structure-from-motion (SfM) methods on acquiring accurate scene initialization, or the inefficiency of densification strategy. In this paper, we introduce a novel framework EasySplat to achieve high-quality 3DGS modeling. Instead of using SfM for scene initialization, we employ a novel method to release the power of large-scale pointmap approaches. Specifically, we propose an efficient grouping strategy based on view similarity, and use robust pointmap priors to obtain high-quality point clouds and camera poses for 3D scene initialization. After obtaining a reliable scene structure, we propose a novel densification approach that adaptively splits Gaussian primitives based on the average shape of neighboring Gaussian ellipsoids, utilizing KNN scheme. In this way, the proposed method tackles the limitation on initialization and optimization, leading to an efficient and accurate 3DGS modeling. Extensive experiments demonstrate that EasySplat outperforms the current state-of-the-art (SOTA) in handling novel view synthesis.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2501.01003)\n\n\u003Cbr>\n\n### 2. FlowR: Flowing from Sparse to Dense 3D Reconstructions ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Tobias Fischer, Samuel Rota Bulò, Yung-Hsu Yang, Nikhil Varma Keetha, Lorenzo Porzi, Norman Müller, Katja Schwarz, Jonathon Luiten, Marc Pollefeys, Peter Kontschieder\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\n3D Gaussian splatting enables high-quality novel view synthesis (NVS) at real-time frame rates. However, its quality drops sharply as we depart from the training views. Thus, dense captures are needed to match the high-quality expectations of some applications, e.g. Virtual Reality (VR). However, such dense captures are very laborious and expensive to obtain. Existing works have explored using 2D generative models to alleviate this requirement by distillation or generating additional training views. These methods are often conditioned only on a handful of reference input views and thus do not fully exploit the available 3D information, leading to inconsistent generation results and reconstruction artifacts. To tackle this problem, we propose a multi-view, flow matching model that learns a flow to connect novel view renderings from possibly sparse reconstructions to renderings that we expect from dense reconstructions. This enables augmenting scene captures with novel, generated views to improve reconstruction quality. Our model is trained on a novel dataset of 3.6M image pairs and can process up to 45 views at 540x960 resolution (91K tokens) on one H100 GPU in a single forward pass. Our pipeline consistently improves NVS in sparse- and dense-view scenarios, leading to higher-quality reconstructions than prior works across multiple, widely-used NVS benchmarks.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.01647) | [🌐 Project Page](https:\u002F\u002Ftobiasfshr.github.io\u002Fpub\u002Fflowr\u002F)\n\n\u003Cbr>\n\n\n\n### 3. Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Peng Wang, Xiang Liu, Peidong Liu\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nStylizing 3D scenes instantly while maintaining multi-view consistency and faithfully resembling a style image remains a significant challenge. Current state-of-the-art 3D stylization methods typically involve computationally intensive test-time optimization to transfer artistic features into a pretrained 3D representation, often requiring dense posed input images. In contrast, leveraging recent advances in feed-forward reconstruction models, we demonstrate a novel approach to achieve direct 3D stylization in less than a second using unposed sparse-view scene images and an arbitrary style image. To address the inherent decoupling between reconstruction and stylization, we introduce a branched architecture that separates structure modeling and appearance shading, effectively preventing stylistic transfer from distorting the underlying 3D scene structure. Furthermore, we adapt an identity loss to facilitate pre-training our stylization model through the novel view synthesis task. This strategy also allows our model to retain its original reconstruction capabilities while being fine-tuned for stylization. Comprehensive evaluations, using both in-domain and out-of-domain datasets, demonstrate that our approach produces high-quality stylized 3D content that achieve a superior blend of style and scene appearance, while also outperforming existing methods in terms of multi-view consistency and efficiency.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.21060) | [🌐 Project Page](https:\u002F\u002Fnickisdope.github.io\u002FStyl3R) | [💻 Code](https:\u002F\u002Fgithub.com\u002FWU-CVGL\u002FStyl3R)\n\n\u003Cbr>\n\n\u003Cbr>\n\n\n## 2024:\n### 1. InstantSplat: Unbounded Sparse-view Pose-free Gaussian Splatting in 40 Seconds ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, Zhangyang Wang, Yue Wang\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWhile novel view synthesis (NVS) has made substantial progress in 3D computer vision, it typically requires an initial estimation of camera intrinsics and extrinsics from dense viewpoints. This pre-processing is usually conducted via a Structure-from-Motion (SfM) pipeline, a procedure that can be slow and unreliable, particularly in sparse-view scenarios with insufficient matched features for accurate reconstruction. In this work, we integrate the strengths of point-based representations (e.g., 3D Gaussian Splatting, 3D-GS) with end-to-end dense stereo models (DUSt3R) to tackle the complex yet unresolved issues in NVS under unconstrained settings, which encompasses pose-free and sparse view challenges. Our framework, InstantSplat, unifies dense stereo priors with 3D-GS to build 3D Gaussians of large-scale scenes from sparseview & pose-free images in less than 1 minute. Specifically, InstantSplat comprises a Coarse Geometric Initialization (CGI) module that swiftly establishes a preliminary scene structure and camera parameters across all training views, utilizing globally-aligned 3D point maps derived from a pre-trained dense stereo pipeline. This is followed by the Fast 3D-Gaussian Optimization (F-3DGO) module, which jointly optimizes the 3D Gaussian attributes and the initialized poses with pose regularization. Experiments conducted on the large-scale outdoor Tanks & Temples datasets demonstrate that InstantSplat significantly improves SSIM (by 32%) while concurrently reducing Absolute Trajectory Error (ATE) by 80%. These establish InstantSplat as a viable solution for scenarios involving posefree and sparse-view conditions. Project page: http:\u002F\u002Finstantsplat.github.io\u002F.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.20309.pdf) | [🌐 Project Page](https:\u002F\u002Finstantsplat.github.io\u002F) | [💻 Code](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FInstantSplat) | [🎥 Video](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=_9aQHLHHoEM&feature=youtu.be) \n\n\u003Cbr>\n\n\n### 2. Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Brandon Smart, Chuanxia Zheng, Iro Laina, Victor Adrian Prisacariu\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nIn this paper, we introduce Splatt3R, a pose-free, feed-forward method for in-the-wild 3D reconstruction and novel view synthesis from stereo pairs. Given uncalibrated natural images, Splatt3R can predict 3D Gaussian Splats without requiring any camera parameters or depth information. For generalizability, we build Splatt3R upon a ``foundation'' 3D geometry reconstruction method, MASt3R, by extending it to deal with both 3D structure and appearance. Specifically, unlike the original MASt3R which reconstructs only 3D point clouds, we predict the additional Gaussian attributes required to construct a Gaussian primitive for each point. Hence, unlike other novel view synthesis methods, Splatt3R is first trained by optimizing the 3D point cloud's geometry loss, and then a novel view synthesis objective. By doing this, we avoid the local minima present in training 3D Gaussian Splats from stereo views. We also propose a novel loss masking strategy that we empirically find is critical for strong performance on extrapolated viewpoints. We train Splatt3R on the ScanNet++ dataset and demonstrate excellent generalisation to uncalibrated, in-the-wild images. Splatt3R can reconstruct scenes at 4FPS at 512 x 512 resolution, and the resultant splats can be rendered in real-time.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.13912) | [🌐 Project Page](https:\u002F\u002Fsplatt3r.active.vision\u002F) | [💻 Code](https:\u002F\u002Fgithub.com\u002Fbtsmart\u002Fsplatt3r)\n\n\u003Cbr>\n\n\n\n\n### 3. Dense Point Clouds Matter: Dust-GS for Scene Reconstruction from Sparse Viewpoints ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Shan Chen, Jiale Zhou, Lei Li\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\n3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in scene synthesis and novel view synthesis tasks. Typically, the initialization of 3D Gaussian primitives relies on point clouds derived from Structure-from-Motion (SfM) methods. However, in scenarios requiring scene reconstruction from sparse viewpoints, the effectiveness of 3DGS is significantly constrained by the quality of these initial point clouds and the limited number of input images. In this study, we present Dust-GS, a novel framework specifically designed to overcome the limitations of 3DGS in sparse viewpoint conditions. Instead of relying solely on SfM, Dust-GS introduces an innovative point cloud initialization technique that remains effective even with sparse input data. Our approach leverages a hybrid strategy that integrates an adaptive depth-based masking technique, thereby enhancing the accuracy and detail of reconstructed scenes. Extensive experiments conducted on several benchmark datasets demonstrate that Dust-GS surpasses traditional 3DGS methods in scenarios with sparse viewpoints, achieving superior scene reconstruction quality with a reduced number of input images.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.08613)\n\n\u003Cbr>\n\n\n\n### 4. LM-Gaussian: Boost Sparse-view 3D Gaussian Splatting with Large Model Priors ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Hanyang Yu, Xiaoxiao Long, Ping Tan\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe aim to address sparse-view reconstruction of a 3D scene by leveraging priors from large-scale vision models. While recent advancements such as 3D Gaussian Splatting (3DGS) have demonstrated remarkable successes in 3D reconstruction, these methods typically necessitate hundreds of input images that densely capture the underlying scene, making them time-consuming and impractical for real-world applications. However, sparse-view reconstruction is inherently ill-posed and under-constrained, often resulting in inferior and incomplete outcomes. This is due to issues such as failed initialization, overfitting on input images, and a lack of details. To mitigate these challenges, we introduce LM-Gaussian, a method capable of generating high-quality reconstructions from a limited number of images. Specifically, we propose a robust initialization module that leverages stereo priors to aid in the recovery of camera poses and the reliable point clouds. Additionally, a diffusion-based refinement is iteratively applied to incorporate image diffusion priors into the Gaussian optimization process to preserve intricate scene details. Finally, we utilize video diffusion priors to further enhance the rendered images for realistic visual effects. Overall, our approach significantly reduces the data acquisition requirements compared to previous 3DGS methods. We validate the effectiveness of our framework through experiments on various public datasets, demonstrating its potential for high-quality 360-degree scene reconstruction. Visual results are on our website.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.03456) | [🌐 Project Page](https:\u002F\u002Fhanyangyu1021.github.io\u002Flm-gaussian.github.io\u002F) | [💻 Code](https:\u002F\u002Fgithub.com\u002Fhanyangyu1021\u002FLMGaussian)\n\n\u003Cbr>\n\n\n\n\n### 5. PreF3R: Pose-Free Feed-Forward 3D Gaussian Splatting from Variable-length Image Sequence ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Zequn Chen, Jiezhi Yang, Heng Yang\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nWe present PreF3R, Pose-Free Feed-forward 3D Reconstruction from an image sequence of variable length. Unlike previous approaches, PreF3R removes the need for camera calibration and reconstructs the 3D Gaussian field within a canonical coordinate frame directly from a sequence of unposed images, enabling efficient novel-view rendering. We leverage DUSt3R's ability for pair-wise 3D structure reconstruction, and extend it to sequential multi-view input via a spatial memory network, eliminating the need for optimization-based global alignment. Additionally, PreF3R incorporates a dense Gaussian parameter prediction head, which enables subsequent novel-view synthesis with differentiable rasterization. This allows supervising our model with the combination of photometric loss and pointmap regression loss, enhancing both photorealism and structural accuracy. Given a sequence of ordered images, PreF3R incrementally reconstructs the 3D Gaussian field at 20 FPS, therefore enabling real-time novel-view rendering. Empirical experiments demonstrate that PreF3R is an effective solution for the challenging task of pose-free feed-forward novel-view synthesis, while also exhibiting robust generalization to unseen scenes.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.16877) | [🌐 Project Page](https:\u002F\u002Fcomputationalrobotics.seas.harvard.edu\u002FPreF3R\u002F) | [💻 Code](https:\u002F\u002Fgithub.com\u002FComputationalRobotics\u002FPreF3R)\n\n\u003Cbr>\n\n\n### 6. Dust to Tower: Coarse-to-Fine Photo-Realistic Scene Reconstruction from Sparse Uncalibrated Images ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Xudong Cai, Yongcai Wang, Zhaoxin Fan, Deng Haoran, Shuo Wang, Wanting Li, Deying Li, Lun Luo, Minhang Wang, Jintao Xu\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nPhoto-realistic scene reconstruction from sparse-view, uncalibrated images is highly required in practice. Although some successes have been made, existing methods are either Sparse-View but require accurate camera parameters (i.e., intrinsic and extrinsic), or SfM-free but need densely captured images. To combine the advantages of both methods while addressing their respective weaknesses, we propose Dust to Tower (D2T), an accurate and efficient coarse-to-fine framework to optimize 3DGS and image poses simultaneously from sparse and uncalibrated images. Our key idea is to first construct a coarse model efficiently and subsequently refine it using warped and inpainted images at novel viewpoints. To do this, we first introduce a Coarse Construction Module (CCM) which exploits a fast Multi-View Stereo model to initialize a 3D Gaussian Splatting (3DGS) and recover initial camera poses. To refine the 3D model at novel viewpoints, we propose a Confidence Aware Depth Alignment (CADA) module to refine the coarse depth maps by aligning their confident parts with estimated depths by a Mono-depth model. Then, a Warped Image-Guided Inpainting (WIGI) module is proposed to warp the training images to novel viewpoints by the refined depth maps, and inpainting is applied to fulfill the ``holes\" in the warped images caused by view-direction changes, providing high-quality supervision to further optimize the 3D model and the camera poses. Extensive experiments and ablation studies demonstrate the validity of D2T and its design choices, achieving state-of-the-art performance in both tasks of novel view synthesis and pose estimation while keeping high efficiency. Codes will be publicly available.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.19518) | [💻 Code (to be released)]()\n\n\u003Cbr>\n\n\n\n\n## Scene Understanding:\n\n## 2025:\n### 1.PE3R: Perception-Efficient 3D Reconstruction ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**Authors**: Jie Hu, Shizun Wang, Xinchao Wang\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nRecent advancements in 2D-to-3D perception have significantly improved the understanding of 3D scenes from 2D images. However, existing methods face critical challenges, including limited generalization across scenes, suboptimal perception accuracy, and slow reconstruction speeds. To address these limitations, we propose Perception-Efficient 3D Reconstruction (PE3R), a novel framework designed to enhance both accuracy and efficiency. PE3R employs a feed-forward architecture to enable rapid 3D semantic field reconstruction. The framework demonstrates robust zero-shot generalization across diverse scenes and objects while significantly improving reconstruction speed. Extensive experiments on 2D-to-3D open-vocabulary segmentation and 3D reconstruction validate the effectiveness and versatility of PE3R. The framework achieves a minimum 9-fold speedup in 3D semantic field reconstruction, along with substantial gains in perception accuracy and reconstruction precision, setting new benchmarks in the field.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.07507) | [💻 Code](https:\u002F\u002Fgithub.com\u002Fhujiecpp\u002FPE3R)\n\u003Cbr>\n\u003Cbr>\n\n\n### 2. SegMASt3R: Geometry Grounded Segment Matching ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-Neurips-blue)\n**Authors**: Rohit Jayanti, Swayam Agrawal, Vansh Garg, Siddharth Tourani, Muhammad Haris Khan, Sourav Garg, Madhava Krishna\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nSegment matching is an important intermediate task in computer vision that establishes correspondences between semantically or geometrically coherent regions across images. Unlike keypoint matching, which focuses on localized features, segment matching captures structured regions, offering greater robustness to occlusions, lighting variations, and viewpoint changes. In this paper, we leverage the spatial understanding of 3D foundation models to tackle wide-baseline segment matching, a challenging setting involving extreme viewpoint shifts. We propose an architecture that uses the inductive bias of these 3D foundation models to match segments across image pairs with up to 180 degree view-point change. Extensive experiments show that our approach outperforms state-of-the-art methods, including the SAM2 video propagator and local feature matching methods, by upto 30% on the AUPRC metric, on ScanNet++ and Replica datasets. We further demonstrate benefits of the proposed model on relevant downstream tasks, including 3D instance segmentation and image-goal navigation.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.05051) | [🌐 Project Page](https:\u002F\u002Fsegmast3r.github.io\u002F) | [💻 Code](https:\u002F\u002Fgithub.com\u002FSegMASt3R)\n\n\n\n\n## 2024:\n### 1. LargeSpatialModel: End-to-end Unposed Images to Semantic 3D ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-Neurips-blue)\n**Authors**: Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, Boris Ivanovic, Marco Pavone, Yue Wang\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nReconstructing and understanding 3D structures from a limited number of images is a well-established problem in computer vision. Traditional methods usually break this task into multiple subtasks, each requiring complex transformations between different data representations. For instance, dense reconstruction through Structure-from-Motion (SfM) involves converting images into key points, optimizing camera parameters, and estimating structures. Afterward, accurate sparse reconstructions are required for further dense modeling, which is subsequently fed into task-specific neural networks. This multi-step process results in considerable processing time and increased engineering complexity.\nIn this work, we present the Large Spatial Model (LSM), which processes unposed RGB images directly into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation, and it can generate versatile label maps by interacting with language at novel viewpoints. Leveraging a Transformer-based architecture, LSM integrates global geometry through pixel-aligned point maps. To enhance spatial attribute regression, we incorporate local context aggregation with multi-scale fusion, improving the accuracy of fine local details. To tackle the scarcity of labeled 3D semantic data and enable natural language-driven scene manipulation, we incorporate a pre-trained 2D language-based segmentation model into a 3D-consistent semantic feature field. An efficient decoder then parameterizes a set of semantic anisotropic Gaussians, facilitating supervised end-to-end learning. Extensive experiments across various tasks show that LSM unifies multiple 3D vision tasks directly from unposed images, achieving real-time semantic 3D reconstruction for the first time.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.18956) | [💻 Code](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FLSM) | [🌐 Project Page](https:\u002F\u002Flargespatialmodel.github.io\u002F) | [🎮 Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fkairunwen\u002FLSM)\n\n\n## Robotics:\n## 2024:\n### 1. Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-RAL-yellow)\n**Authors**: Weiming Zhi, Haozhan Tang, Tianyi Zhang, Matthew Johnson-Roberson\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nRepresenting the environment is a central challenge in robotics, and is essential for effective decision-making. Traditionally, before capturing images with a manipulatormounted camera, users need to calibrate the camera using a specific external marker, such as a checkerboard or AprilTag.\nHowever, recent advances in computer vision have led to the development of 3D foundation models. These are large, pre-trained neural networks that can establish fast and accurate multi-view correspondences with very few images, even in the absence of rich visual features. This paper advocates for the integration of 3D foundation models into scene representation approaches for robotic systems equipped with manipulator-mounted RGB cameras. Specifically, we propose the Joint Calibration and Representation (JCR) method. JCR uses RGB images, captured by a manipulator-mounted camera, to simultaneously construct an environmental representation and calibrate the camera relative to the robot’s end-effector, in the absence of specific calibration markers. The resulting 3D environment representation is aligned with the robot’s coordinate frame and maintains physically accurate scales. We demonstrate that JCR can build effective scene representations using a low-cost RGB camera attached to a manipulator, without prior calibration.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.11683.pdf) | [💻 Code (to be released)]()\n\n\u003Cbr>\n\n### 2. 3D Foundation Models Enable Simultaneous Geometry and Pose Estimation of Grasped Objects ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**Authors**: Weiming Zhi, Haozhan Tang, Tianyi Zhang, Matthew Johnson-Roberson\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nHumans have the remarkable ability to use held objects as tools to interact with their environment. For this to occur, humans internally estimate how hand movements affect the object’s movement. We wish to endow robots with this capability. We contribute methodology to jointly estimate the geometry and pose of objects grasped by a robot, from RGB images captured by an external camera. Notably, our method transforms the estimated geometry into the robot’s coordinate frame, while not requiring the extrinsic parameters of the external camera to be calibrated. Our approach leverages 3D foundation models, large models pre-trained on huge datasets for 3D vision tasks, to produce initial estimates of the in-hand object. These initial estimations do not have physically correct scales and are in the camera’s frame. Then, we formulate, and efficiently solve, a coordinate-alignment problem to recover accurate scales, along with a transformation of the objects to the coordinate frame of the robot. Forward kinematics mappings can subsequently be defined from the manipulator’s joint angles to specified points on the object. These mappings enable the estimation of points on the held object at arbitrary configurations, enabling robot motion to be designed with respect to coordinates on the grasped objects. We empirically evaluate our approach on a robot manipulator holding a diverse set of real-world objects.\n\u003C\u002Fdetails>\n\n[📄 Paper](https:\u002F\u002Fwww.researchgate.net\u002Fprofile\u002FWeiming-Zhi\u002Fpublication\u002F382490016_3D_Foundation_Models_Enable_Simultaneous_Geometry_and_Pose_Estimation_of_Grasped_Objects\u002Flinks\u002F66a01a4527b00e0ca43ddd95\u002F3D-Foundation-Models-Enable-Simultaneous-Geometry-and-Pose-Estimation-of-Grasped-Objects.pdf)\n\u003Cbr>\n\n\n\n\n## Pose Estimation:\n## 2025:\n### 1. Reloc3r: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**Authors**: Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, Yanchao Yang\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nVisual localization aims to determine the camera pose of a query image relative to a database of posed images. In recent years, deep neural networks that directly regress camera poses have gained popularity due to their fast inference capabilities. However, existing methods struggle to either generalize well to new scenes or provide accurate camera pose estimates. To address these issues, we present Reloc3r, a simple yet effective visual localization framework. It consists of an elegantly designed relative pose regression network, and a minimalist motion averaging module for absolute pose estimation. Trained on approximately eight million posed image pairs, Reloc3r achieves surprisingly good performance and generalization ability. We conduct extensive experiments on six public datasets, consistently demonstrating the effectiveness and efficiency of the proposed method. It provides high-quality camera pose estimates in real time and generalizes to novel scenes. [Code](https:\u002F\u002Fgithub.com\u002Fffrivera0\u002Freloc3r).\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.08376) | [💻 Code](https:\u002F\u002Fgithub.com\u002Fffrivera0\u002Freloc3r)\n\n\u003Cbr>\n\n\n### 2. Pos3R: 6D Pose Estimation for Unseen Objects Made Easy ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**Authors**: Weijian Deng, Dylan Campbell, Chunyi Sun, Jiahao Zhang, Shubham Kanitkar, Matthew Shaffer, Stephen Gould\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nFoundation models have significantly reduced the need for task-specific training, while also enhancing generalizability. However, state-of-the-art 6D pose estimators either require further training with pose supervision or neglect advances obtainable from 3D foundation models. The latter is a missed opportunity, since these models are better equipped to predict 3D-consistent features, which are of significant utility for the pose estimation task. To address this gap, we propose Pos3R, a method for estimating the 6D pose of any object from a single RGB image, making extensive use of a 3D reconstruction foundation model and requiring no additional training. We identify template selection as a particular bottleneck for existing methods that is significantly alleviated by the use of a 3D model, which can more easily distinguish between template poses than a 2D model. Despite its simplicity, Pos3R achieves competitive performance on the BOP benchmark across seven diverse datasets, matching or surpassing existing refinement-free methods. Additionally, Pos3R integrates seamlessly with render-and-compare refinement techniques, demonstrating adaptability for high-precision applications.\n\u003C\u002Fdetails>\n\n  [📄 Paper]() | [💻 Code]()\n\n\u003Cbr>\n\n\n\n## DUSt3R for Science:\n## 2025:\n### 1. CryoFastAR: Fast Cryo-EM Ab Initio Reconstruction Made Easy ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-ICCV-pink)\n**Authors**: Jiakai Zhang, Shouchen Zhou, Haizhao Dai, Xinhang Liu, Peihao Wang, Zhiwen Fan, Yuan Pei, Jingyi Yu\n\u003Cdetails span>\n\u003Csummary>\u003Cb>Abstract\u003C\u002Fb>\u003C\u002Fsummary>\nPose estimation from unordered images is fundamental for 3D reconstruction, robotics, and scientific imaging. Recent geometric foundation models, such as DUSt3R, enable end-to-end dense 3D reconstruction but remain underexplored in scientific imaging fields like cryo-electron microscopy (cryo-EM) for near-atomic protein reconstruction. In cryo-EM, pose estimation and 3D reconstruction from unordered particle images still depend on time-consuming iterative optimization, primarily due to challenges such as low signal-to-noise ratios (SNR) and distortions from the contrast transfer function (CTF). We introduce CryoFastAR, the first geometric foundation model that can directly predict poses from Cryo-EM noisy images for Fast ab initio Reconstruction. By integrating multi-view features and training on large-scale simulated cryo-EM data with realistic noise and CTF modulations, CryoFastAR enhances pose estimation accuracy and generalization. To enhance training stability, we propose a progressive training strategy that first allows the model to extract essential features under simpler conditions before gradually increasing difficulty to improve robustness. Experiments show that CryoFastAR achieves comparable quality while significantly accelerating inference over traditional iterative approaches on both synthetic and real datasets.\n\u003C\u002Fdetails>\n\n  [📄 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.05864)\n\n\u003Cbr>\n\n\n\n\n## Related Codebase\n1. [Mini-DUSt3R](https:\u002F\u002Fgithub.com\u002Fpablovela5620\u002Fmini-dust3r): A miniature version of dust3r only for performing inference. May, 2024.\n\n## Blog Posts\n\n1. [3D reconstruction models made easy](https:\u002F\u002Feurope.naverlabs.com\u002Fblog\u002F3d-reconstruction-models-made-easy\u002F)\n2. [InstantSplat: Sub Minute Gaussian Splatting](https:\u002F\u002Fradiancefields.com\u002Finstantsplat-sub-minute-gaussian-splatting\u002F)\n\n\n## Tutorial Videos\n1. [Advanced Image-to-3D AI, DUSt3R](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=kI7wCEAFFb0)\n2. [BSLIVE Pinokio Dust3R to turn 2D into 3D Mesh](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=vY7GcbOsC-U)\n3. [InstantSplat, DUSt3R](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=JdfrG89iPOA)\n\n## Acknowledgements\n- Thanks to [Janusch](https:\u002F\u002Ftwitter.com\u002Fjanusch_patas) for the awesome paper list [awesome-3D-gaussian-splatting](https:\u002F\u002Fgithub.com\u002FMrNeRF\u002Fawesome-3D-gaussian-splatting) and to [Chao Wen](https:\u002F\u002Fwalsvid.github.io\u002F) for the [Awesome-MVS](https:\u002F\u002Fgithub.com\u002Fwalsvid\u002FAwesome-MVS). This list was designed with reference to both.\n","\u003Cdiv align=\"center\">\n\u003Ch1>超棒的 DUSt3R 资源 \u003C\u002Fh1>\n\u003C\u002Fdiv>\n\n这是一份精心整理的论文和开源资源列表，涵盖了 DUSt3R\u002FMASt3R 相关内容。DUSt3R 和 MASt3R 是新兴的几何基础模型，能够支持广泛的 3D 几何任务与应用。欢迎提交 PR 请求，包括论文、开源库、博客文章、视频等。本仓库由 [@Rui Li](https:\u002F\u002Fx.com\u002Fleedaray) 维护，敬请关注后续更新！\n\n## 目录\n\n- [DUSt3R 的开创性论文](#seminal-papers-of-dust3r)\n- [同期相关工作](#concurrent-works)\n\n\u003Cbr>\n\n\n- [3D 重建](#3d-reconstruction)\n- [动态场景重建](#dynamic-scene-reconstruction)\n- [3D 场景推理](#scene-reasoning)\n- [高斯泼溅](#gaussian-splatting)\n- [场景理解](#scene-understanding)\n- [机器人技术](#robotics)\n- [位姿估计](#pose-estimation)\n- [DUSt3R 在科学中的应用](#dust3r-for-science)\n\n\u003Cbr>\n\n- [相关代码库](#related-codebase)\n- [博客文章](#blog-posts)\n- [教程视频](#tutorial-videos)\n- [致谢](#acknowledgements)\n\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>更新日志：\u003C\u002Fb>\u003C\u002Fsummary>\n\n**2025年10月25日**: 新增 Human3R、Rig3R、SegMASt3R、PLANA3R、TTT3R。\n\u003Cbr>\n**2025年9月6日**: 新增 SAIL-Recon、FastVGGT、HAMSt3R、Vista-SLAM。\n\u003Cbr>\n**2025年8月16日**: 新增 Test3R。\n\u003Cbr>\n**2025年8月15日**: 新增 MoGe-2、S3PO-GS、π^3、LONG3R、VGGT-Long、STream3R、Dens3R、StreamVGG-T、Back-on-Track 和 ViPE。\n\u003Cbr>\n**2025年7月9日**: 新增 Point3R、GeometryCrafter、CryoFastAR。\n\u003Cbr>\n**2025年6月19日**: 新增 RaySt3R、Amodal3R、Styl3R。\n\u003Cbr>\n**2025年5月6日**: 新增 [LaRI](https:\u002F\u002Fruili3.github.io\u002Flari\u002Findex.html)。\n\u003Cbr>\n**2025年4月29日**: 新增 Pow3R、Mono3R、Easi3R、FlowR、ODHSR、DPM、Geo4D、POMATO、DAS3R。\n\u003Cbr>\n**2025年3月20日**: 新增 Reloc3r、Pos3R、MASt3R-SLAM、Light3R-SfM、VGGT。 \n\u003Cbr>\n**2025年3月16日**: 新增 MUSt3R、PE3R。\n\u003Cbr>\n**2025年1月24日**: 新增 CUT3R、Fast3R、EasySplat、MEt3R、Dust-to-Tower。新年快乐！\n\u003Cbr>\n**2024年12月20日**: 新增 Align3R、PeRF3R、MV-DUSt3R+、Stereo4D、SLAM3R、LoRA3D。\n\u003Cbr>\n**2024年11月15日**: 新增 MoGe、LSM。\n\u003Cbr>\n**2024年10月10日**: 新增 MASt3R-SfM、MonST3R。\n\u003Cbr>\n**2024年8月31日**: 新增 Spurfies、Spann3R 和 ReconX。\n\u003Cbr>\n**2024年8月29日**: 新增 Splatt3R，更新 InstantSplat 的代码等。\n\u003Cbr>\n**2024年6月21日**: 新增最新发布的 MASt3R。\n\u003Cbr>\n**2024年5月31日**: 新增一项无需检测器的 SfM 研究，以及一个 Mini-DUSt3R 代码库。\n\u003Cbr>\n**2024年4月27日**: 新增多项同期工作，包括 FlowMap、ACE0、MicKey 和 VGGSfM。\n\u003Cbr>\n**2024年4月9日**: 初始列表，包含前三篇论文、博客和视频。\n\n\u003C\u002Fdetails>\n\u003Cbr>\n\n## DUSt3R 的开创性论文：\n### 1. DUSt3R：让几何 3D 视觉变得简单！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-CVPR-green)\n**作者**: Shuzhe Wang、Vincent Leroy、Yohann Cabon、Boris Chidlovskii、Jerome Revaud\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n在野外进行多视图立体重建（MVS）时，通常需要先估计相机参数，例如内参和外参。这些参数的获取往往繁琐且耗时，但却是将对应像素三角化到三维空间的关键步骤——这也是所有高性能 MVS 算法的核心所在。在本工作中，我们提出了截然不同的方法，推出了 DUSt3R，这是一种全新的密集且无约束的立体 3D 重建范式，可处理任意图像集，无需事先了解相机标定或视角姿态信息。我们将成对重建问题转化为点映射回归问题，从而放宽了传统投影相机模型的严格约束。我们证明了这种形式可以无缝统一单目和双目重建场景。当提供超过两张图像时，我们进一步提出了一种简单而有效的全局对齐策略，将所有成对点映射表达在一个共同的参考框架中。我们的网络架构基于标准的 Transformer 编码器和解码器，使我们能够利用强大的预训练模型。我们的方法不仅可以直接生成场景的 3D 模型和深度信息，还可以轻松地从中恢复像素匹配关系、相对和绝对相机位姿。我们在这些任务上的全面实验表明，DUSt3R 可以统一多种 3D 视觉任务，并在单目\u002F多视图深度估计以及相对位姿估计方面创下新的 SOTA 记录。总之，DUSt3R 让许多几何 3D 视觉任务变得简单易行。\n\u003C\u002Fdetails>\n  \n [📃 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.14132.pdf) | [🌐 项目页面](https:\u002F\u002Fdust3r.europe.naverlabs.com\u002F) | [⌨️ 代码](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fdust3r) | [🎥 解说视频](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=JdfrG89iPOA) \n\n\u003Cbr>\n\n\n### 2. 使用 MASt3R 将图像匹配锚定在 3D 中！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**: Vincent Leroy、Yohann Cabon、Jérôme Revaud\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n图像匹配是所有高性能 3D 视觉算法和流水线中的核心组件。然而，尽管匹配本质上是一个 3D 问题，与相机位姿和场景几何紧密相关，但它通常被当作 2D 问题来处理。这样做有一定道理，因为匹配的目标是在 2D 像素场之间建立对应关系，但也可能带来潜在风险。在本工作中，我们采取了不同立场，提议将匹配视为一个 3D 任务，并借助最近推出的强大 3D 重建框架 DUSt3R 来实现。该方法基于点映射回归，在面对极端视角变化的视图时表现出惊人的鲁棒性，但精度有限。我们希望在此基础上提升此类方法的匹配能力，同时保持其鲁棒性。为此，我们提出为 DUSt3R 网络添加一个新的头部，用于输出密集的局部特征，并通过额外的匹配损失对其进行训练。此外，我们还解决了密集匹配的二次复杂度问题，如果不加以妥善处理，会使得下游应用的速度变得极其缓慢。我们引入了一种快速的互惠匹配方案，不仅能将匹配速度提升几个数量级，还具有理论保证，并最终带来更好的结果。大量实验表明，我们提出的这种方法被称为 MASt3R，显著优于现有最先进的匹配技术。特别是在极具挑战性的无地图定位数据集上，它在 VCRE AUC 指标上比最佳公开方法高出 30%（绝对提升）。\n\u003C\u002Fdetails>\n  \n [📃 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.09756) | [🌐 项目页面](https:\u002F\u002Feurope.naverlabs.com\u002Fblog\u002Fmast3r-matching-and-stereo-3d-reconstruction\u002F) | [⌨️ 代码](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fmast3r)\n\n\u003Cbr>\n\n### 3. MASt3R-SfM：一种用于无约束运动恢复结构的全集成解决方案 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**: 巴尔迪努斯·杜伊斯特霍夫、洛伊泽·祖斯特、菲利普·温扎佩尔、文森特·勒鲁瓦、约汉·卡邦、杰罗姆·雷沃\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n运动恢复结构（SfM）是一项任务，旨在根据一组图像联合恢复相机位姿和场景的三维几何信息。尽管经过数十年的显著进展，这一问题仍然充满挑战，许多核心难题尚未解决。传统的SfM解决方案通常由一系列复杂的最小二乘求解器组成，这种流水线容易传递误差，并且在图像重叠不足、运动过少等情况下往往失效。近年来，一些方法试图重新审视这一范式，但我们通过实验证明，这些方法并未真正解决上述核心问题。在本文中，我们提出基于最近发布的三维视觉基础模型来构建SfM系统，该模型能够稳健地生成局部三维重建并提供准确的匹配结果。我们进一步介绍了一种低内存消耗的方法，用于将这些局部重建精确地对齐到全局坐标系中。此外，我们还证明，这类基础模型可以作为高效的图像检索工具，且无需额外开销，从而将整体计算复杂度从二次方降低至线性。总体而言，我们的新型SfM流水线简单、可扩展、高效，并且真正实现了无约束处理——即它可以处理任意数量、有序或无序的图像集合。在多个基准数据集上的大量实验表明，我们的方法在不同场景下均能保持稳定的性能，尤其在中小型数据集上显著优于现有方法。\n\u003C\u002Fdetails>\n  \n [📃 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.19152) | [🌐 项目页面](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fmast3r) | [⌨️ 代码](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fmast3r)\n\n\u003Cbr>\n\n### 4. CroCo：通过跨视图补全实现3D视觉任务的自监督预训练 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2022-Neurips-blue)\n**作者**: 菲利普·温扎佩尔、文森特·勒鲁瓦、托马斯·卢卡斯、罗曼·布雷吉耶、约汉·卡邦、瓦伊巴夫·阿罗拉、列昂尼德·安茨费尔德、鲍里斯·奇德洛夫斯基、加布里埃拉·丘尔卡、杰罗姆·雷沃\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n掩码图像建模（MIM）最近已成为一种强大的预训练范式。其核心思想是通过遮挡输入图像中的部分区域，然后让神经网络仅利用可见区域预测被遮挡的内容。这种预训练方法在微调后能够达到最先进的性能，尤其是在高阶语义任务中，例如图像分类和目标检测。然而，在本论文中，我们希望学习能够迁移到多种3D视觉及低阶几何下游任务的表征，比如深度估计或光流估计。受MIM启发，我们提出了一种无监督表示学习任务，该任务基于同一场景的不同视角拍摄的图像对进行训练。具体来说，我们设计了一个跨视图补全的前置任务：给定第一张图像的部分遮挡，模型需要结合可见内容和第二张图像来重构被遮挡的部分。在单视图MIM中，仅凭可见部分往往难以精确推断出被遮挡的内容，因此模型会倾向于依赖高层语义信息作为先验。而在跨视图补全中，由于存在另一张未被遮挡的图像作为参考，这种不确定性便得以消除，前提是模型能够理解两张图像之间的空间关系。我们的实验表明，这一前置任务显著提升了单目3D视觉下游任务（如深度估计）的性能。此外，我们的模型还可以直接应用于双目下游任务，例如光流或相对相机位姿估计，并在不使用任何特殊设计的情况下取得了具有竞争力的结果，即仅采用通用架构即可完成任务。\n\u003C\u002Fdetails>\n  \n [📃 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10716.pdf) | [🌐 项目页面](https:\u002F\u002Fcroco.europe.naverlabs.com\u002Fpublic\u002Findex.html) | [⌨️ 代码](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fcroco)\n\n\u003Cbr>\n\n\n### 5. CroCo v2：改进的跨视图补全预训练，用于立体匹配和光流 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2023-ICCV-f5cac3)\n**作者**: 菲利普·温扎佩尔、托马斯·卢卡斯、文森特·勒鲁瓦、约汉·卡邦、瓦伊巴夫·阿罗拉、罗曼·布雷吉耶、加布里埃拉·丘尔卡、列昂尼德·安茨费尔德、鲍里斯·奇德洛夫斯基、杰罗姆·雷沃\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n尽管自监督预训练方法在高阶下游任务中表现出色，但在立体匹配和光流等密集型几何视觉任务上，其效果仍不尽如人意。将实例判别或掩码图像建模等自监督概念应用于几何任务，目前仍是研究热点。在本工作中，我们基于近期提出的跨视图补全框架展开研究，该框架是掩码图像建模的一种变体，它利用同一场景下的第二张图像作为补充信息，因此非常适合双目下游任务。然而，这一方法的应用目前仍面临至少两个限制：(a) 实际场景图像对的收集难度较大——实践中大多只能使用合成数据；(b) 原生Transformer模型在处理密集型下游任务时存在局限性，因为对于这类任务而言，相对位置比绝对位置更为重要。为此，我们从三个方面进行了改进：首先，我们提出了一种大规模收集合适真实场景图像对的方法；其次，我们尝试了相对位置嵌入技术，并证明其能使Vision Transformer的表现大幅提升；最后，我们扩大了基于Vision Transformer的跨视图补全架构规模，这得益于大量数据的支持。通过这些改进，我们首次展示了无需使用传统任务特定技术（如相关体积、迭代估计、图像扭曲或多尺度推理）即可达到立体匹配和光流领域的最先进水平，从而为通用视觉模型的发展铺平道路。\n\u003C\u002Fdetails>\n  \n [📃 论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.10408) | [🌐 项目页面](https:\u002F\u002Fcroco.europe.naverlabs.com\u002Fpublic\u002Findex.html) | [⌨️ 代码](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fcroco)\n\n\u003Cbr>\n\n\n\n## 同期工作：\n## 2024年：\n\n### 1. FlowMap：通过梯度下降获取高质量相机位姿、内参与深度 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**: 卡梅伦·史密斯、大卫·查拉坦、阿尤什·特瓦里、文森特·西茨曼\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n本文介绍了一种端到端可微分的方法——FlowMap，用于求解视频序列中精确的相机位姿、相机内参以及每帧的稠密深度。我们的方法针对每个视频执行梯度下降优化，最小化一个简单的最小二乘目标函数，该函数比较由深度、内参和位姿所诱导的光流与通过现成光流算法和点跟踪获得的对应关系。除了利用点轨迹来促进长期几何一致性外，我们还引入了深度、内参和位姿的可微重参数化方式，使其更易于一阶优化。实验表明，我们方法恢复的相机参数和稠密深度能够支持在360°轨迹上使用高斯泼溅技术进行照片级真实感的新视角合成。我们的方法不仅显著优于以往基于梯度下降的束调整方法，而且令人惊讶的是，在360°新视角合成这一下游任务上，其性能甚至可以与当前最先进的SfM方法COLMAP相媲美——尽管我们的方法完全基于梯度下降、全程可微，并且与传统SfM方法有着本质上的区别。这一成果为自监督训练用于相机参数估计、三维重建和新视角合成的神经网络打开了新的大门。\n\u003C\u002Fdetails>\n  \n [📃 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.15259) | [🌐 项目页面](https:\u002F\u002Fcameronosmith.github.io\u002Fflowmap\u002F) | [⌨️ 代码](https:\u002F\u002Fgithub.com\u002Fdcharatan\u002Fflowmap)\n\n\u003Cbr>\n\n### 2. 场景坐标重建：通过增量学习重定位器实现图像集合的位姿估计 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**: 埃里克·布拉赫曼、杰米·温、陈帅、托马索·卡瓦拉里、阿龙·蒙斯帕特、达尼娅尔·图尔穆坎贝托夫、维克多·艾德里安·普里萨卡里乌\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们研究从一组描绘同一场景的图像中估计相机参数的任务。目前流行的基于特征的运动恢复结构（SfM）工具通常采用增量式重建的方式解决这一问题：它们反复进行稀疏3D点的三角测量，并将更多相机视图注册到稀疏点云中。我们将增量式运动恢复结构重新解释为对视觉重定位器的迭代应用与精炼，即一种将新视图注册到当前重建状态的方法。这种视角使我们能够探索不依赖局部特征匹配的替代性视觉重定位器。我们证明，基于学习的场景坐标回归方法能够从无位姿约束的图像中构建隐式的神经场景表示。与其他基于学习的重建方法不同，我们既不需要位姿先验，也不需要顺序输入，并且能够在数千张图像上高效地进行优化。我们的方法ACE0（ACE Zero）能够以与基于特征的SfM相当的精度估计相机位姿，这一点已通过新视角合成任务得到验证。\n\u003C\u002Fdetails>\n  \n [📃 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.14351) | [🌐 项目页面](https:\u002F\u002Fnianticlabs.github.io\u002Facezero\u002F) | [⌨️ 代码](https:\u002F\u002Fgithub.com\u002Fnianticlabs\u002Facezero)\n\n\u003Cbr>\n\n### 3. 在3D空间中匹配2D图像：基于度量对应关系的度量相对位姿 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-CVPR-green)\n**作者**: 阿克塞尔·巴罗索-拉古纳、索米娅·穆努库特拉、维克多·艾德里安·普里萨卡里乌、埃里克·布拉赫曼\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n给定两张图像，我们可以通过建立图像间的对应关系来估计它们之间的相对相机位姿。通常，这些对应关系是2D到2D的，因此我们估计的位姿仅在尺度上是不确定的。然而，一些旨在实现随时随地即时增强现实的应用则需要尺度确定的位姿估计，为此它们往往依赖外部深度估计算法来恢复尺度信息。我们提出了MicKey关键点匹配流水线，它能够预测3D相机空间中的度量对应关系。通过学习跨图像匹配3D坐标，我们可以无需深度测量即可推断出度量相对位姿。此外，训练过程中也无需深度数据、场景重建或图像重叠信息，MicKey仅需图像及其相对位姿的配对作为监督信号。在无地图重定位基准测试中，MicKey取得了当前最佳性能，且所需的监督信息少于其他竞争方法。\n\u003C\u002Fdetails>\n  \n [📃 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.06337) | [🌐 项目页面](https:\u002F\u002Fnianticlabs.github.io\u002Fmickey\u002F) | [⌨️ 代码](https:\u002F\u002Fgithub.com\u002Fnianticlabs\u002Fmickey)\n\n\u003Cbr>\n\n### 4. VGGSfM：视觉几何驱动的深度运动恢复结构 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-CVPR-green)\n**作者**: 王建元、尼基塔·卡拉耶夫、克里斯蒂安·鲁普雷希特、大卫·诺沃特尼\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n运动恢复结构（SfM）是计算机视觉领域的一个长期难题，其目标是从一组无约束的2D图像中重建场景的相机位姿和3D结构。传统的框架通常采用增量式方法来解决这一问题：检测并匹配关键点、注册图像、三角测量3D点，并进行束调整。近年来的研究主要集中在利用深度学习技术来提升特定环节（如关键点匹配），但这些方法仍然基于原始的不可微流程。与此不同，我们提出了一种全新的深度SfM流水线VGGSfM，其中每个组件都完全可微，因此可以进行端到端的训练。为此，我们引入了若干新机制和简化措施。首先，我们基于深度2D点跟踪的最新进展，提取可靠的像素级点轨迹，从而无需再进行两两匹配的链式操作。其次，我们不再逐步注册相机，而是基于图像和轨迹特征同时恢复所有相机的位姿。最后，我们通过一个可微的束调整层来优化相机位姿并三角测量3D点。我们在三个流行的数据集CO3D、IMC Phototourism和ETH3D上均达到了当前最先进的水平。\n\u003C\u002Fdetails>\n  \n [📃 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.04563) | [🌐 项目页面](https:\u002F\u002Fvggsfm.github.io\u002F) | [⌨️ 代码](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fvggsfm)\n\n\u003Cbr>\n\n### 5. 无检测器的运动恢复结构 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-CVPR-green)\n**作者**: 贺兴义、孙嘉铭、王一凡、彭思达、黄启星、鲍虎军、周小伟\n\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了一种新的运动恢复结构框架，用于从无序图像中恢复精确的相机位姿和点云。传统的SfM系统通常依赖于在多视图中成功检测可重复的关键点作为第一步，但这对于纹理稀少的场景来说非常困难，而关键点检测不佳可能会导致整个SfM系统崩溃。为此，我们提出了一种新的无检测器SfM框架，利用近期无检测器匹配方法的成功成果，避免了早期确定关键点的需求，同时解决了无检测器匹配方法中存在的多视图不一致性问题。具体而言，我们的框架首先基于量化后的无检测器匹配重建一个粗略的SfM模型。然后，通过一种新颖的迭代优化流程对该模型进行精炼：该流程在基于注意力的多视图匹配模块与几何精炼模块之间交替运行，前者用于细化特征轨迹，后者则用于提高重建精度。实验表明，所提出的框架在常用基准数据集上优于现有的基于检测器的SfM系统。此外，我们还收集了一个纹理稀少的SfM数据集，以展示我们的框架在重建纹理稀少场景方面的能力。基于该框架，我们在2023年图像匹配挑战赛中获得了第一名。\n\u003C\u002Fdetails>\n  \n [📃 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.15669) | [🌐 项目页面](https:\u002F\u002Fzju3dv.github.io\u002FDetectorFreeSfM\u002F) | [⌨️ 代码](https:\u002F\u002Fgithub.com\u002Fzju3dv\u002FDetectorFreeSfM)\n\n\u003Cbr>\n\n\n\n\n## 3D重建:\n\n\n## 2025:\n### 1. SLAM3R: 基于单目RGB视频的实时稠密场景重建 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**作者**: 刘宇正、董思言、王书哲、尹英达、杨延超、范庆楠、陈宝权\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n本文介绍SLAM3R，这是一种新颖且高效的系统，能够利用RGB视频实现高质量的实时稠密3D重建。SLAM3R通过前馈神经网络无缝集成局部3D重建与全局坐标配准，提供端到端的解决方案。给定一段输入视频，系统首先使用滑动窗口机制将其转换为重叠片段。与传统的基于位姿优化的方法不同，SLAM3R直接从每个窗口中的RGB图像回归出3D点云，并逐步对这些局部点云进行对齐和变形，从而构建全局一致的场景重建——整个过程无需显式求解任何相机参数。跨多个数据集的实验结果一致表明，SLAM3R在保持20+ FPS实时性能的同时，实现了最先进的重建精度和完整性。[代码](https:\u002F\u002Fgithub.com\u002FPKU-VCL-3DV\u002FSLAM3R)。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.09401) | [💻 代码](https:\u002F\u002Fgithub.com\u002FPKU-VCL-3DV\u002FSLAM3R)\n\n\u003Cbr>\n\n\n### 2. MASt3R-SLAM: 带有3D重建先验的实时稠密SLAM ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**作者**: 村井力、埃里克·德克斯海默、安德鲁·J·戴维森\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了一种自下而上设计的实时单目稠密SLAM系统，其基础是MASt3R——一种两视图3D重建与匹配先验。凭借这一强大的先验知识，我们的系统即使不对相机模型做出固定或参数化的假设（仅要求具有唯一的相机中心），也能在野外视频序列中表现出良好的鲁棒性。我们引入了高效的点云匹配、相机跟踪与局部融合、图构建与回环闭合以及二阶全局优化方法。在已知相机标定的情况下，只需对系统进行简单修改，即可在各类基准测试中达到最先进水平。综上所述，我们提出了一套即插即用的单目SLAM系统，能够在以15 FPS运行的同时生成全局一致的位姿和稠密几何信息。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.12392) | [🌐 项目页面](https:\u002F\u002Fedexheim.github.io\u002Fmast3r-slam\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002Frmurai0610\u002FMASt3R-SLAM) \n\n\u003Cbr>\n\n\n### 3. MEt3R: 测量生成图像中的多视图一致性 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**作者**: 穆罕默德·阿西姆、克里斯托弗·韦韦尔、托马斯·维默、伯恩特·希勒、扬·埃里克·伦森\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了MEt3R，一种用于评估生成图像中多视图一致性的指标。大规模的多视图图像生成模型正在迅速推动从稀疏观测中进行3D推理领域的发展。然而，由于生成建模的本质，传统的重建指标并不适用于衡量生成输出的质量，因此迫切需要独立于采样过程的评价指标。在本工作中，我们专门针对生成的多视图图像之间的一致性这一方面进行了研究，该一致性可以独立于具体场景进行评估。我们的方法利用DUSt3R以前馈方式从图像对中获取稠密3D重建，进而将一个视图中的图像内容映射到另一个视图中。随后，比较这些图像的特征图，以获得一个不受视图相关效应影响的一致性分数。借助MEt3R，我们评估了大量先前用于新视图和视频生成的方法的一致性，其中包括我们公开的多视图潜在扩散模型。\n\u003C\u002Fdetails>\n  \n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.06336) | [🌐 项目页面](https:\u002F\u002Fgeometric-rl.mpi-inf.mpg.de\u002Fmet3r\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002Fmohammadasim98\u002FMEt3R)\n\n\n\u003Cbr>\n\n### 4. Fast3R：实现单次前向传播即可完成上千张图像的三维重建！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**作者**：Jianing Yang、Alexander Sax、Kevin J. Liang、Mikael Henaff、Hao Tang、Ang Cao、Joyce Chai、Franziska Meier、Matt Feiszli\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n多视角三维重建一直是计算机视觉领域的核心挑战，尤其是在需要跨不同视角生成准确且可扩展表示的应用中。当前的领先方法，如DUSt3R，采用的是基于成对处理的基本策略，即每次只处理两张图像，并需通过代价高昂的全局对齐步骤来整合多视角信息。在本工作中，我们提出了Fast 3D Reconstruction（Fast3R），这是一种针对DUSt3R的新型多视角泛化方法，能够并行处理大量视图，从而实现高效且可扩展的三维重建。Fast3R基于Transformer的架构可在一次前向传播中同时处理N张图像，无需迭代对齐过程。通过对相机位姿估计和三维重建任务的广泛实验，Fast3R展现出最先进的性能，在推理速度上显著提升，同时有效减少了误差累积。这些结果表明，Fast3R为多视角应用提供了一种稳健的替代方案，能够在不牺牲重建精度的前提下大幅提升可扩展性。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.13928) | [🌐 项目主页](https:\u002F\u002Ffast3r-3d.github.io\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffast3r)\n\u003Cbr>\n\u003Cbr>\n\n\n### 5. Light3R-SfM：迈向前馈式运动恢复结构！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**作者**：Sven Elflein、Qunjie Zhou、Sérgio Agostinho、Laura Leal-Taixé\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了Light3R-SfM，这是一个前馈式的端到端可学习框架，用于从无约束的图像集合中高效地进行大规模运动恢复结构（SfM）。与现有依赖于昂贵匹配和全局优化以实现精确三维重建的SfM解决方案不同，Light3R-SfM通过一个新颖的潜在全局对齐模块克服了这一局限性。该模块用可学习的注意力机制取代了传统的全局优化，能够有效地捕捉跨图像的多视角约束，从而实现鲁棒而精确的相机位姿估计。Light3R-SfM还利用检索得分引导的最短路径树构建稀疏场景图，与朴素方法相比大幅降低了内存使用和计算开销。大量实验表明，Light3R-SfM在保持竞争力的同时显著缩短了运行时间，非常适合那些对运行时有严格限制的实际应用场景中的三维重建任务。这项工作开创了一种数据驱动的前馈式SfM方法，为在真实环境中实现可扩展、准确且高效的三维重建铺平了道路。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.14914)\n\n\u003Cbr>\n\n\n### 6. MUSt3R：用于立体三维重建的多视角网络！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**作者**：Yohann Cabon、Lucas Stoffl、Leonid Antsfeld、Gabriela Csurka、Boris Chidlovskii、Jerome Revaud、Vincent Leroy\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\nDUSt3R通过提出一种无需任何关于相机标定或视角姿态先验信息就能对任意图像集合进行密集且无约束立体三维重建的模型，为几何计算机视觉引入了一种全新范式。然而，在其内部，DUSt3R实际上是以图像对的形式进行处理，回归出局部的三维重建结果，随后还需要将其对齐到全局坐标系中。由于待处理的图像对数量会随图像数量的增加呈平方级增长，这一特性成为了一个固有的限制，尤其在面对大型图像集时，会导致优化过程既缓慢又难以收敛。在本文中，我们提出了将DUSt3R从仅支持成对处理扩展至多视角支持的方案，从而解决了上述所有问题。具体而言，我们设计了一种名为MUSt3R的多视角立体三维重建网络，它通过使DUSt3R架构对称化并直接扩展为能够在统一坐标系下预测所有视角的三维结构来实现这一目标。此外，我们还为该模型引入了多层记忆机制，这不仅降低了计算复杂度，还使其能够扩展到大规模图像集的重建任务，以高帧率推断出数千张三维点云地图，且额外计算开销极小。该框架既可以离线运行，也可以在线执行，因此能够无缝应用于SfM和视觉SLAM场景，并在多种下游三维任务中表现出最先进的性能，包括未标定的视觉里程计、相对相机位姿、尺度与焦距估计、三维重建以及多视角深度估计等。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2503.01661) | [🌐 项目主页](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fmust3r) | [💻 代码](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fmust3r)\n\n\u003Cbr>\n\n\n\n\n\n### 7. VGGT：视觉几何增强型Transformer！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**作者**：Jianyuan Wang、Minghao Chen、Nikita Karaev、Andrea Vedaldi、Christian Rupprecht、David Novotny\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了VGGT，这是一种前馈式神经网络，能够直接从一张、几张或数百张图像中推断出场景的所有关键三维属性，包括相机参数、点云地图、深度图以及三维点轨迹。这种方法是三维计算机视觉领域的一大进步，因为以往的模型通常局限于单一任务或专门针对某一特定任务设计。此外，VGGT简单高效，能在不到一秒钟内完成图像重建，且性能仍优于那些需要借助视觉几何优化技术进行后处理的替代方案。该网络在多项三维任务中均达到了最先进的水平，包括相机参数估计、多视角深度估计、密集点云重建以及三维点跟踪等。我们还证明，将预训练的VGGT作为特征主干网络使用，可以显著提升下游任务的表现，例如非刚性点跟踪和前馈式新视角合成等。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.11651) | [🌐 项目主页](https:\u002F\u002Fvgg-t.github.io\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fvggt) | [🤗 演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Ffacebook\u002Fvggt)\n\n\u003Cbr>\n\n### 8. Pow3R：利用相机与场景先验赋能无约束的3D重建 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**作者**: Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, Jerome Revaud\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了Pow3r，一种新型的大规模3D视觉回归模型，其输入模态具有极高的灵活性。与以往缺乏在测试时利用已知相机或场景先验机制的前馈模型不同，Pow3r能够在单一网络中整合任意组合的辅助信息，如内参、相对位姿、稠密或稀疏深度等，同时结合输入图像。基于最近的DUSt3R范式——一种利用强大预训练的Transformer架构——我们的轻量级且多功能的条件引导模块可在辅助信息可用时为网络提供额外指导，从而预测出更准确的结果。在训练过程中，我们每轮迭代都会向模型输入随机选择的模态子集，这使得模型在测试时能够适应不同程度的已知先验条件。由此，它还具备了新的能力，例如以原生图像分辨率进行推理，或完成点云填充。我们在3D重建、深度补全、多视角深度预测、多视角立体视觉以及多视角姿态估计等任务上的实验均取得了当前最优的结果，证实了Pow3r在充分利用所有可用信息方面的有效性。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.17316) | [🌐 项目页面](https:\u002F\u002Feurope.naverlabs.com\u002Fresearch\u002Fpublications\u002Fpow3r-empowering-unconstrained-3d-reconstruction-with-camera-and-scene-priors\u002F)\n\n\u003Cbr>\n\n\n\n\n### 9. Mono3R：利用单目线索实现几何3D重建 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: Wenyu Li, Sidun Liu, Peng Qiao, Yong Dou\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n近年来，基于数据驱动的几何多视角3D重建基础模型（如DUSt3R）在各类3D视觉任务中表现出色，这得益于大规模高质量3D数据集的发布。然而，我们观察到，受限于其基于匹配的原理，现有模型在缺乏足够匹配线索的困难区域，尤其是纹理较弱和低光照条件下，重建质量会显著下降。为缓解这些局限性，我们提出利用单目几何估计 inherent 的鲁棒性来弥补基于匹配方法的不足。具体而言，我们引入了一个单目引导的精修模块，将单目几何先验融入多视角重建框架中。这一整合大幅提升了多视角重建系统的鲁棒性，从而实现高质量的前馈式重建。在多个基准上的全面实验表明，我们的方法在多视角相机位姿估计和点云精度方面均有显著提升。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.13419)\n\n\u003Cbr>\n\n\n### 10. Point3R：基于显式空间指针记忆的流式3D重建 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXir-red)\n**作者**: Yuqi Wu, Wenzhao Zheng, Jie Zhou, Jiwen Lu\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n从有序序列或无序图像集合中进行稠密3D场景重建，是将计算机视觉研究应用于实际场景的关键步骤。遵循DUSt3R所提出的范式，该范式可将一对图像稠密地统一到共享坐标系中，后续方法通常通过维护隐式记忆来实现更多图像的稠密3D重建。然而，这种隐式记忆容量有限，且可能丢失早期帧的信息。为此，我们提出了Point3R，一个面向稠密流式3D重建的在线框架。具体来说，我们维护了一种与当前场景3D结构直接关联的显式空间指针记忆。该记忆中的每个指针都对应一个特定的3D位置，并在全球坐标系中聚合附近场景信息，形成不断变化的空间特征。来自最新帧的信息会与该指针记忆进行显式交互，从而将当前观测结果稠密地整合到全局坐标系中。我们设计了一种3D分层位置嵌入来促进这种交互，并开发了一种简单而有效的融合机制，以确保我们的指针记忆既统一又高效。我们的方法在多种任务上均实现了具有竞争力或当前最优的性能，且训练成本较低。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.02863) | [🌐 项目页面](https:\u002F\u002Fykiwu.github.io\u002FPoint3R\u002F) | | [💻 代码](https:\u002F\u002Fgithub.com\u002FYkiWu\u002FPoint3R)\n\n\u003Cbr>\n\n\n\n### 11. MoGe-2：兼具度量尺度与清晰细节的精准单目几何 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, Jiaolong Yang\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了MoGe-2，一种先进的开放域几何估计模型，能够从单张图像中恢复场景的度量尺度3D点云。我们的方法建立在近期的单目几何估计方法MoGe之上，后者可预测具有未知尺度但保持仿射不变性的点云。我们探索了有效策略，以在不牺牲仿射不变性点表示所提供的相对几何精度的前提下，将MoGe扩展用于度量几何预测。此外，我们发现真实数据中的噪声和误差会削弱预测几何的细粒度细节。为此，我们开发了一套统一的数据精炼方法，利用清晰的合成标签对来自不同来源的真实数据进行过滤和补全，从而显著提升重建几何的精细程度，同时保持整体精度。我们使用混合数据集的大规模语料库训练了该模型，并进行了全面评估，结果表明其在实现精确的相对几何、准确的度量尺度以及细粒度细节恢复等方面表现卓越——这些能力是此前任何方法都无法同时达到的。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.02546) | [🌐 项目页面](https:\u002F\u002Fwangrc.site\u002FMoGe2Page\u002F) | [💻 代码 ](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmoge) | [🤗 演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FRuicheng\u002FMoGe-2)\n\n\u003Cbr>\n\n### 12. 基于全局尺度一致3D高斯点云的地图构建的户外单目SLAM ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-ICCV-pink)\n**作者**: 程冲、于思成、王子健、周逸凡、王浩\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n3D高斯泼溅（3DGS）因其高保真度和实时新视图合成性能，已成为SLAM领域中广受欢迎的解决方案。然而，一些先前的3DGS SLAM方法采用可微分渲染管线进行位姿跟踪，在户外场景中缺乏几何先验信息。另一些方法则引入了独立的跟踪模块，但当相机发生显著移动时，误差会不断累积，从而导致尺度漂移。为应对这些挑战，我们提出了一种鲁棒的纯RGB户外3DGS SLAM方法：S3PO-GS。在技术层面，我们基于3DGS点云建立了自洽的跟踪模块，避免了尺度漂移的累积，并以更少的迭代次数实现了更为精确和稳健的位姿跟踪。此外，我们还设计了一个基于补丁的点云动态建图模块，该模块在引入几何先验的同时，避免了尺度模糊性。这显著提升了位姿跟踪精度和场景重建质量，使其特别适用于复杂的户外环境。我们在Waymo、KITTI和DL3DV数据集上的实验表明，S3PO-GS在新视图合成方面达到了当前最优水平，并且在位姿跟踪精度上优于其他3DGS SLAM方法。项目主页：此网址。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.03737) | [🌐 项目主页](https:\u002F\u002F3dagentworld.github.io\u002FS3PO-GS\u002F) | [💻 代码 ](https:\u002F\u002Fgithub.com\u002F3DAgentWorld\u002FS3PO-GS)\n\n\u003Cbr>\n\n\n\n### 13. π^3：可扩展的置换等变视觉几何学习 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: 王一凡、周建军、朱浩毅、常文政、周洋、李子尊、陈俊义、庞江淼、沈春华、何通\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了一种前馈神经网络，它为视觉几何重建提供了一种全新的方法，打破了对传统固定参考视图的依赖。以往的方法通常将重建结果锚定在一个指定的视角上，这种归纳偏置可能导致若参考视角不佳时出现不稳定或失败的情况。相比之下，π^3采用完全置换等变的架构，无需任何参考帧即可预测仿射不变的相机位姿和尺度不变的局部点云地图。这一设计使我们的模型天然具备输入顺序无关性和高度可扩展性。这些优势使得我们这种简单且无偏见的方法能够在包括相机位姿估计、单目\u002F视频深度估计以及稠密点云重建在内的广泛任务中达到当前最优性能。代码和模型已公开发布。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.13347) | [🌐 项目主页](https:\u002F\u002Fyyfz.github.io\u002Fpi3\u002F) | [💻 代码 ](https:\u002F\u002Fgithub.com\u002Fyyfz\u002FPi3) | [🤗 演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fyyfz233\u002FPi3)\n\n\u003Cbr>\n\n\n\n### 14. LONG3R：长序列流式3D重建 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-ICCV-pink)\n**作者**: 陈卓光、秦明辉、袁天元、刘哲、赵航\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n近年来，多视角场景重建技术取得了显著进展，然而现有方法在处理输入图像流时仍存在局限性。这些方法要么依赖耗时的离线优化，要么仅限于较短的序列，从而限制了其在实时场景中的应用。在本工作中，我们提出了LONG3R（LOng sequence streaming 3D Reconstruction），一种专为长序列多视角3D场景流式重建而设计的新模型。我们的模型通过循环操作实现实时处理，每次接收到新观测时都会更新并维护内存。首先，我们采用记忆门控机制筛选相关记忆，再将其与新观测一同输入到双源精炼解码器中，进行由粗到细的交互式处理。为了有效捕捉长序列记忆，我们提出了一种3D时空记忆结构，可在动态修剪冗余空间信息的同时，沿场景自适应调整分辨率。为提升模型在长序列上的表现并保持训练效率，我们采用了两阶段课程式训练策略，每个阶段针对特定能力进行优化。实验表明，LONG3R在长序列场景下优于当前最先进的流式重建方法，同时还能保持实时推理速度。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2507.18255) | [🌐 项目主页](https:\u002F\u002Fzgchen33.github.io\u002FLONG3R\u002F) | [💻 代码 ](https:\u002F\u002Fzgchen33.github.io\u002FLONG3R\u002F)\n\n\u003Cbr>\n\n\n\n### 15. VGGT-Long：分块、循环、对齐——突破VGGT在千米级长RGB序列中的极限 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-ICCV-pink)\n**作者**: 邓凯、季泽鑫、徐嘉伟、杨健、谢进\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n近年来，用于3D视觉的基础模型在3D感知方面展现了非凡的能力。然而，将这些模型扩展到大规模RGB流3D重建仍然面临内存限制等挑战。在本工作中，我们提出了VGGT-Long，一个简单却高效的系统，它将单目3D重建的极限推向了千米级、无边界的大规模户外环境。我们的方法通过分块处理策略结合重叠对齐和轻量级回环优化，解决了现有模型在 scalability 方面的瓶颈问题。无需相机标定、深度监督或模型重新训练，VGGT-Long即可实现与传统方法相当的轨迹和重建效果。我们在KITTI、Waymo和Virtual KITTI数据集上对我们的方法进行了评估。VGGT-Long不仅能在基础模型通常无法处理的长RGB序列上成功运行，还能在各种条件下生成准确且一致的几何结构。我们的研究结果凸显了利用基础模型在实际场景中实现可扩展单目3D场景重建的潜力，尤其是在自动驾驶领域。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.16443) | [💻 代码 ](https:\u002F\u002Fgithub.com\u002FDengKaiCQ\u002FVGGT-Long)\n\n\u003Cbr>\n\n### 16. STream3R：基于因果Transformer的可扩展序列化3D重建 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: 兰宇诗、罗一航、洪方舟、周尚辰、陈鸿华、吕兆阳、杨帅、戴博、洛晨澈、潘星刚\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了STream3R，这是一种新颖的3D重建方法，它将点云预测重新表述为仅解码器的Transformer问题。现有的多视角重建最先进方法要么依赖于昂贵的全局优化，要么依靠简单且随序列长度增加而性能急剧下降的记忆机制。相比之下，STream3R引入了一种流式框架，该框架受现代语言建模进展的启发，利用因果注意力高效处理图像序列。通过从大规模3D数据集中学习几何先验，STream3R能够很好地泛化到各种复杂场景，包括传统方法常常失效的动态场景。大量实验表明，我们的方法在静态和动态场景基准测试中均持续优于先前的工作。此外，STream3R与LLM风格的训练基础设施天然兼容，从而能够高效地进行大规模预训练，并针对各种下游3D任务进行微调。我们的研究结果凸显了因果Transformer模型在在线3D感知方面的潜力，为流式环境中实时3D理解铺平了道路。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.10893v1) | [🌐 项目页面](https:\u002F\u002Fnirvanalan.github.io\u002Fprojects\u002Fstream3r\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002FNIRVANALAN\u002FSTream3R)\n\n\u003Cbr>\n\n\n\n### 17. Dens3R：用于3D几何预测的基础模型 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: 方贤泽、高景楠、王哲、陈卓、任兴宇、吕江静、任乔木、杨中雷、杨晓康、颜义超、吕成飞\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n近年来，密集3D重建技术取得了显著进展，然而实现准确的统一几何预测仍然是一个重大挑战。大多数现有方法仅限于从输入图像中预测单一几何量。然而，深度、表面法向量和点云等几何量之间存在内在关联，单独估计这些量往往难以保证一致性，从而限制了精度和实际应用价值。这促使我们探索一种显式建模不同几何属性之间结构耦合的统一框架，以实现联合回归。在本文中，我们提出了Dens3R，这是一种专为联合几何密集预测设计的3D基础模型，可适应广泛的下游任务。Dens3R采用两阶段训练框架，逐步构建既具有泛化能力又具备内在不变性的点云表示。具体而言，我们设计了一个轻量级共享编码器-解码器主干网络，并引入位置插值旋转位置编码，以在保持表达能力的同时增强对高分辨率输入的鲁棒性。通过整合图像对匹配特征与内在不变性建模，Dens3R能够准确回归表面法向量和深度等多种几何量，从而实现从单视图到多视图输入的一致几何感知。此外，我们还提出了一套后处理流水线，支持几何一致性的多视图推理。大量实验表明，Dens3R在各类密集3D预测任务中表现出色，并展现出更广泛的应用潜力。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.16290) | [🌐 项目页面](https:\u002F\u002Fg-1nonly.github.io\u002FDens3R\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002FG-1nOnly\u002FDens3R)\n\n\u003Cbr>\n\n\n\n### 18. ViPE：用于3D几何感知的视频姿态引擎 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: 黄嘉辉、周群杰、赫萨姆·拉贝蒂、亚历山大·科罗夫科、凌欢、任轩驰、沈天畅、高俊、德米特里·斯列皮切夫、林陈轩、任嘉伟、谢凯文、比斯瓦斯·乔伊迪普、莱阿尔-泰克丝·劳拉、菲德勒·桑雅\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n准确的3D几何感知是众多空间AI系统的重要前提。尽管最先进的方法依赖于大规模训练数据，但从野外视频中获取一致且精确的3D标注仍是一个关键挑战。在本工作中，我们推出了ViPE，这是一个便捷且多功能的视频处理引擎，旨在填补这一空白。ViPE能够高效地从无约束的原始视频中估计相机内参、相机运动以及稠密的近度量级深度图。它对各种场景都具有鲁棒性，包括动态自拍视频、电影镜头或行车记录仪视频，并支持多种相机模型，如针孔相机、广角相机和360°全景相机。我们在多个基准上对ViPE进行了评测。值得注意的是，在TUM\u002FKITTI序列上，它分别比现有的未标定姿态估计基线高出18%\u002F50%，并且在标准输入分辨率下，单GPU运行速度可达3–5 FPS。我们使用ViPE标注了一个大规模视频集合。该集合包含约10万条真实互联网视频、100万条高质量AI生成视频以及2000条全景视频，总计约9600万帧——所有视频均附有准确的相机姿态和稠密深度图。我们开源了ViPE及其标注数据集，以期加速空间AI系统的开发。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Ftoronto-ai\u002Fvipe\u002Fassets\u002Fpaper.pdf) | [🌐 项目页面](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Ftoronto-ai\u002Fvipe\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002Fnv-tlabs\u002Fvipe?tab=readme-ov-file)\n\n\u003Cbr>\n\n### 19. Test3R：在测试时学习重建三维场景！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**：袁宇恒、沈秋红、王世尊、杨星毅、王新超\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n像DUSt3R这样的密集匹配方法通过回归成对点云来进行三维重建。然而，这种依赖于成对预测的方式以及有限的泛化能力，天然地限制了全局几何一致性。在本工作中，我们提出了\\textbf{Test3R}——一种出人意料地简单的测试时学习技术，能够显著提升几何精度。Test3R利用图像三元组（$I_1,I_2,I_3$），分别生成由$(I_1,I_2)$和$(I_1,I_3)$组成的重建结果。其核心思想是在测试时通过自监督目标优化网络：最大化这两份重建结果相对于共同图像$I_1$的几何一致性。这确保了模型无论输入如何，都能产生跨对一致的输出。大量实验表明，我们的技术在三维重建和多视角深度估计任务上显著优于先前的最先进方法。此外，该技术具有普适性且几乎无额外开销，因此可以轻松应用于其他模型，并以极低的测试时训练开销和参数占用实现。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.13750) | [🌐 项目主页](https:\u002F\u002Ftest3r-nop.github.io\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002FnopQAQ\u002FTest3R)\n\n\u003Cbr>\n\n\n\n\n### 20. SAIL-Recon：通过结合场景回归与定位实现大规模SfM！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**：邓俊远、李恒、谢涛、任伟强、张倩、谭平、郭晓阳\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n场景回归方法，如VGGT，通过直接从输入图像中回归相机位姿和三维场景结构来解决运动恢复结构（SfM）问题。它们在处理极端视角变化的图像时表现出色。然而，这些方法难以应对大量输入图像。为了解决这一问题，我们引入了SAIL-Recon，这是一种基于前馈Transformer的大规模SfM方法，通过增强场景回归网络的视觉定位能力来实现。具体而言，我们的方法首先从一组锚定图像中计算出神经场景表征，然后微调回归网络，使其能够在该神经场景表征的条件下重建所有输入图像。全面的实验表明，我们的方法不仅能够高效扩展到大规模场景，而且在相机位姿估计和新视图合成等基准测试中均达到了最先进的水平，包括TUM-RGBD、CO3Dv2以及Tanks & Temples数据集。我们计划发布模型和代码。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.17972) | [🌐 项目主页](https:\u002F\u002Fhkust-sail.github.io\u002Fsail-recon\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002FHKUST-SAIL\u002Fsail-recon)\n\n\u003Cbr>\n\n\n### 21. FastVGGT：无需训练加速视觉几何Transformer！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**：沈友、张志鹏、曲彦松、曹柳娟\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n近年来，用于三维视觉的基础模型在三维感知方面展现了非凡的能力。然而，由于推理效率低下，将这些模型扩展到长序列图像输入仍然是一个重大挑战。在本工作中，我们对最先进的前馈视觉几何模型VGGT进行了详细分析，并确定了其主要瓶颈。可视化进一步揭示了注意力图中的标记坍缩现象。基于这些发现，我们探索了在前馈视觉几何模型中进行标记合并的可能性。然而，由于三维模型独特的架构和任务特性，直接应用现有的合并技术颇具挑战。为此，我们提出了FastVGGT，这是首次在三维领域利用无需训练的机制来加速VGGT。我们设计了一种专为三维架构和任务定制的独特标记划分策略，有效消除了冗余计算，同时保留了VGGT强大的重建能力。在多个三维几何基准上的广泛实验验证了我们方法的有效性。值得注意的是，在处理1000张输入图像时，FastVGGT相比VGGT实现了4倍的加速，同时缓解了长序列场景中的误差累积问题。这些发现凸显了标记合并作为可扩展三维视觉系统的一种原则性解决方案的潜力。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.02560v1) | [🌐 项目主页](https:\u002F\u002Fmystorm16.github.io\u002Ffastvggt\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002Fmystorm16\u002FFastVGGT)\n\n\u003Cbr>\n\n\n### 22. HAMSt3R：面向人体的多视角立体三维重建！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-ICCV-pink)\n**作者**：萨拉·罗哈斯、马蒂厄·阿曼多、伯纳德·加门、菲利普·温泽费尔、文森特·勒鲁瓦、格雷戈里·罗热\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n从稀疏的未校准图像集中恢复场景的三维几何结构，是计算机视觉领域长期存在的难题。尽管近年来基于学习的方法，如DUSt3R和MASt3R，通过直接预测密集的场景几何取得了令人瞩目的成果，但它们主要是在静态环境的户外场景上训练的，难以处理以人体为中心的场景。在本工作中，我们提出了HAMSt3R，它是MASt3R的扩展版本，用于从稀疏的未校准多视角图像中联合重建人体和场景的三维结构。首先，我们利用DUNE——一个强大的图像编码器，它通过对MASt3R和最先进的单帧人体网格恢复（HMR）模型multi-HMR等编码器进行蒸馏而获得——来更好地理解场景几何和人体结构。随后，我们的方法增加了额外的网络头部，用于分割人体、通过DensePose估计密集对应关系，并在以人为中心的环境中预测深度，从而实现更全面的三维重建。借助不同头部的输出，HAMSt3R生成了一张富含人体语义信息的密集点云。与依赖复杂优化流程的现有方法不同，我们的方法完全是前馈式的，且高效，因此非常适合实际应用。我们使用EgoHumans和EgoExo4D这两个包含多样化人体相关场景的挑战性基准测试了我们的模型。此外，我们也验证了其在传统多视角立体视觉和多视角姿态回归任务中的泛化能力。结果表明，我们的方法能够在有效重建人体的同时，保持在一般三维重建任务中的优异性能，从而弥合了三维视觉中对人体与场景理解之间的鸿沟。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2508.16433)\n\n\u003Cbr>\n\n### 23. ViSTA-SLAM：基于对称双视图关联的视觉SLAM ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: 张甘霖、钱申涵、王曦、丹尼尔·克雷默斯\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了ViSTA-SLAM，这是一种无需相机内参即可运行的实时单目视觉SLAM系统，因此能够广泛适用于各种不同的相机配置。该系统的核心是一个轻量级的对称双视图关联（STA）模型作为前端，它仅通过两幅RGB图像就能同时估计相对相机位姿并回归局部点云地图。这种设计显著降低了模型复杂度，我们的前端大小仅为同类最先进方法的35%，同时提升了流水线中使用的双视图约束的质量。在后端，我们构建了一个专门设计的Sim(3)位姿图，并引入回环闭合来解决累积漂移问题。大量实验表明，与现有方法相比，我们的方法在相机跟踪和稠密3D重建质量方面均表现出色。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.01584) | [🌐 项目页面](https:\u002F\u002Fganlinzhang.xyz\u002Fvista-slam\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002Fzhangganlin\u002Fvista-slam)\n\n\u003Cbr>\n\n\n### 24. Rig3R：面向学习型3D重建的相机阵列感知条件建模 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: 塞缪尔·李、普吉特·卡查纳、普拉贾瓦尔·奇达南达、索拉布·奈尔、安武隆川、马修·布朗\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n从多相机阵列中估计智能体位姿和3D场景结构是自主驾驶等具身AI应用中的核心任务。近期的DUSt3R等学习型方法在多视角场景中表现出了令人印象深刻的效果。然而，这些模型将图像视为无结构的集合，从而限制了其在使用已知或可推断结构的同步相机阵列采集的场景中的有效性。为此，我们提出了Rig3R，它是对先前多视角重建模型的扩展，在存在相机阵列结构时将其纳入考虑，并在缺失时也能推断出该结构。Rig3R会根据可选的相机阵列元数据（包括相机ID、时间及阵列位姿）进行条件建模，从而构建一个对相机阵列结构敏感的潜在空间，即使信息缺失也能保持稳健性。它同时预测点云地图以及两种类型的射线图：一种是相对于全局坐标系的位姿射线图，另一种则是相对于以相机阵列为重心的坐标系、且随时间保持一致的阵列射线图。当元数据缺失时，阵列射线图使模型能够直接从输入图像中推断出相机阵列结构。Rig3R在3D重建、相机位姿估计和相机阵列发现等方面均达到了当前最先进的水平，在多种真实世界相机阵列数据集上，其综合平均准确率（mAA）比传统方法和学习型方法高出17%至45%，且所有这些结果均可通过一次前向传播获得，无需后处理或迭代优化。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.02265) | [🌐 项目页面](https:\u002F\u002Fwayve.ai\u002Fthinking\u002Frig3r\u002F)\n\n\u003Cbr>\n\n\n\n### 25. PLANA3R：通过前馈式平面投射实现零样本度量平面3D重建 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: 刘昌坤、谭斌、柯泽然、张尚瞻、刘嘉辰、钱明、薛楠、沈宇俊、特里斯坦·布罗德\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n本文旨在利用室内场景固有的几何规律，通过紧凑的表示形式实现其度量3D重建。我们采用平面3D基元——一种非常适合人造环境的表示方式——提出PLANA3R，这是一个无需位姿输入即可从任意姿态的双视图图像中进行度量平面3D重建的框架。我们的方法利用视觉Transformer提取一组稀疏的平面基元，估计相对相机位姿，并通过平面投射监督几何学习过程，其中梯度会通过高分辨率渲染的基元深度图和法线图进行反向传播。与以往需要在训练过程中提供3D平面标注的前馈方法不同，PLANA3R无需显式的平面监督即可学习平面3D结构，从而能够在仅包含深度和法线标注的大规模立体数据集上进行可扩展的训练。我们在多个具有度量标注的室内场景数据集上验证了PLANA3R，并证明其在度量评估标准下，能够针对多种任务对域外室内环境实现强大的泛化能力，包括3D表面重建、深度估计和相对位姿估计。此外，由于采用了平面3D表示，我们的方法还具备精确的平面分割能力。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.18714) | [🌐 项目页面](https:\u002F\u002Flck666666.github.io\u002Fplana3r\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002Flck666666\u002Fplana3r)\n\n\u003Cbr>\n\n\n### 26. TTT3R：将3D重建视为测试时训练 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: 陈星宇、陈悦、修玉良、安德烈亚斯·盖格、陈安培\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n现代循环神经网络因其线性时间复杂度而成为3D重建领域中颇具竞争力的架构。然而，当应用于超出训练上下文长度的场景时，其性能会显著下降，这表明其长度泛化能力有限。在本工作中，我们从测试时训练的角度重新审视3D重建的基础模型，将其设计视为一个在线学习问题。基于这一视角，我们利用记忆状态与新观测之间的对齐置信度，推导出用于更新记忆状态的闭式学习率，以在保留历史信息和适应新观测之间取得平衡。这种无需训练的干预措施被称为TTT3R，它大幅提升了长度泛化能力，在全局位姿估计方面较基准方法提高了，同时仅需6 GB显存即可以20 FPS的速度处理数千张图像。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.26645) | [🌐 项目页面](https:\u002F\u002Frover-xingyu.github.io\u002FTTT3R\u002F) | [💻 代码](rover-xingyu.github.io\u002FTTT3R\u002F)\n\n\u003Cbr>\n\n\n## 2024年：\n\n### 1. Spurfies：基于局部几何先验的稀疏视图表面重建 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**：Kevin Raj、Christopher Wewer、Raza Yunus、Eddy Ilg、Jan Eric Lenssen\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了Spurfies，一种新颖的稀疏视图表面重建方法，该方法将外观和几何信息解耦，从而利用在合成数据上训练的局部几何先验。近年来的研究主要集中在使用密集多视角设置进行3D重建，通常需要数百张图像。然而，这些方法在少视图场景中往往表现不佳。现有的稀疏视图重建技术通常依赖于多视角立体网络，这些网络需要从大量数据中学习几何和外观的联合先验。相比之下，我们引入了一种神经点表示法，将几何与外观解耦，仅使用ShapeNet合成数据集的一个子集来训练局部几何先验。在推理阶段，我们利用这一表面先验作为额外的约束条件，通过可微分体积渲染从稀疏输入视图中重建表面和外观，从而限制可能解的空间。我们在DTU数据集上验证了我们方法的有效性，并证明其在表面质量方面比先前的最先进方法提高了35%，同时在新视角合成质量方面也具有竞争力。此外，与以往工作不同，我们的方法可以应用于更大、无界的场景，例如Mip-NeRF 360。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.16544) | [🌐 项目页面](https:\u002F\u002Fgeometric-rl.mpi-inf.mpg.de\u002Fspurfies\u002Findex.html) | [💻 代码](https:\u002F\u002Fgithub.com\u002FkevinYitshak\u002Fspurfies)\n\n\u003Cbr>\n\n\n### 2. 基于空间记忆的3D重建 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**：Hengyi Wang、Lourdes Agapito\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了一种名为Spann3R的新方法，用于从有序或无序的图像集合中进行稠密3D重建。Spann3R基于DUSt3R范式，采用基于Transformer的架构，直接从图像中回归点云图，而无需任何关于场景或相机参数的先验知识。与DUSt3R不同，后者为每对图像预测各自在其局部坐标系中的点云图，Spann3R则能够预测以全局坐标系表示的单张图像点云图，从而消除了基于优化的全局配准需求。Spann3R的核心思想是管理一个外部空间记忆，该记忆能够学习并跟踪所有先前的相关3D信息。随后，Spann3R会查询这一空间记忆，以全局坐标系预测下一帧的3D结构。借助DUSt3R的预训练权重，并在部分数据集上进一步微调，Spann3R在多种未见数据集上表现出具有竞争力的性能和泛化能力，并且能够实时处理有序图像集合。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.16061) | [🌐 项目页面](https:\u002F\u002Fhengyiwang.github.io\u002Fprojects\u002Fspanner) | [💻 代码](https:\u002F\u002Fgithub.com\u002FHengyiWang\u002Fspann3r)\n\n\u003Cbr>\n\n### 3. ReconX：利用视频扩散模型从稀疏视图重建任意场景 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**：Fangfu Liu、Wenqiang Sun、Hanyang Wang、Yikai Wang、Haowen Sun、Junliang Ye、Jun Zhang、Yueqi Duan\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n3D场景重建领域的进步已将现实世界的2D图像转化为3D模型，从数百张输入照片中生成逼真的3D结果。尽管在密集视图重建场景中取得了巨大成功，但从不足的拍摄视图中渲染出细节丰富的场景仍然是一个不适定的优化问题，常常导致未见区域出现伪影和扭曲。在本文中，我们提出了ReconX，这是一种全新的3D场景重建范式，它将模糊的重建挑战重新定义为时间序列生成任务。关键见解是利用大型预训练视频扩散模型的强大生成先验来进行稀疏视图重建。然而，直接由预训练模型生成的视频帧难以准确保持3D视图一致性。为此，在输入视图有限的情况下，提出的ReconX首先构建一个全局点云，并将其编码为上下文空间作为3D结构条件。在该条件的引导下，视频扩散模型会合成既保留细节又具有高度3D一致性的视频帧，从而确保场景在不同视角下的连贯性。最后，我们通过一种基于置信度的3D高斯泼溅优化方案从生成的视频中恢复3D场景。在多个真实世界数据集上的广泛实验表明，我们的ReconX在质量和泛化能力方面均优于当前最先进的方法。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.16767) | [🌐 项目页面](https:\u002F\u002Fliuff19.github.io\u002FReconX\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002Fliuff19\u002FReconX)\n\n\u003Cbr>\n\n\n\n\n### 4. MoGe：通过最优训练监督解锁开放域图像的精确单目几何估计 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**：Ruicheng Wang、Sicheng Xu、Cassie Dai、Jianfeng Xiang、Yu Deng、Xin Tong、Jiaolong Yang\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了MoGe，这是一个强大的模型，用于从单目开放域图像中恢复3D几何信息。给定一张图像，我们的模型会直接预测所拍摄场景的3D点云图，采用仿射不变的表示形式，这种表示形式不依赖于真实的全局尺度和位移。这种新的表示方式避免了训练过程中的歧义性监督，促进了有效的几何学习。此外，我们还提出了一系列新颖的全局和局部几何监督方法，以帮助模型学习高质量的几何信息。其中包括一个鲁棒、最优且高效的点云对齐求解器，用于准确学习全局形状；以及一个多尺度局部几何损失函数，以促进精确的局部几何监督。我们在一个大型混合数据集上训练了我们的模型，并展示了其强大的泛化能力和高精度。在对我们模型在各种未见数据集上的全面评估中，它在所有任务上都显著优于当前最先进的方法，包括单目3D点云图、深度图和相机视场的估计。代码和模型将在我们的项目页面上发布。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.19115) | [🌐 项目页面](https:\u002F\u002Fwangrc.site\u002FMoGePage\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmoge) | [🎮 演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FRuicheng\u002FMoGe)\n\n\n\u003Cbr>\n\n### 5. MV-DUSt3R+: 基于稀疏视角的单阶段场景重建，仅需2秒！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**: 王瑞成、徐思诚、戴卡西、向建峰、邓宇、佟欣、杨交龙\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n近年来，像DUSt3R和MASt3R这样的稀疏多视角场景重建技术已经不再需要相机标定和位姿估计。然而，它们每次只能处理一对视角来推断像素对齐的点云图。当面对超过两个视角时，通常会先进行大量易出错的两两重建，再通过昂贵的全局优化来修正这些误差，但往往效果不佳。为了支持更多视角、减少误差并提升推理速度，我们提出了快速的单阶段前馈网络MV-DUSt3R。其核心是多视角解码块，能够在考虑一个参考视角的同时，在任意数量的视角间交换信息。为进一步增强方法对参考视角选择的鲁棒性，我们进一步提出了MV-DUSt3R+，它采用跨参考视角块，在不同参考视角的选择之间融合信息。为了进一步实现新视角合成，我们在两者基础上增加了高斯泼溅头，并对其进行联合训练。多视角立体重建、多视角位姿估计和新视角合成的实验结果表明，我们的方法显著优于现有技术。代码将公开发布。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.06974) | [🌐 项目页面](https:\u002F\u002Fmv-dust3rp.github.io\u002F) | [💻 代码 ](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fmvdust3r)\n\n\u003Cbr>\n\n\n\n\n### 6. LoRA3D: 3D几何基础模型的低秩自校准！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**: 陆子琪、杨恒、徐丹菲、李博毅、鲍里斯·伊万诺维奇、马可·帕沃内、王悦\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n新兴的3D几何基础模型，如DUSt3R，为野外环境下的3D视觉任务提供了一种很有前景的方法。然而，由于问题空间的高维性和高质量3D数据的稀缺，这些预训练模型在许多挑战性场景中仍难以泛化，例如视场重叠有限或光照不足等情况。为此，我们提出LoRA3D，这是一种高效的自校准流水线，利用模型自身的多视角预测将其专门化到目标场景。以稀疏的RGB图像作为输入，我们借助稳健的优化技术来细化多视角预测，并将它们对齐到全局坐标系中。特别地，我们将预测置信度融入几何优化过程中，自动重新加权以更好地反映点估计的准确性。然后，我们使用校准后的置信度为校准用的视角生成高质量的伪标签，并通过低秩适应（LoRA）在这些伪标签数据上微调模型。我们的方法无需任何外部先验或人工标注，仅需一块标准GPU即可在5分钟内完成自校准过程。每个低秩适配器仅需18MB存储空间。我们在Replica、TUM和Waymo Open数据集中的160多个场景上评估了该方法，结果在3D重建、多视角位姿估计和新视角渲染方面性能提升了高达88%。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.07746)\n\n\u003Cbr>\n\n\n## 动态场景重建：\n\n## 2025年：\n\n### 1. 具有持久状态的连续3D感知模型！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: 王倩倩、张一飞、亚历山大·霍林斯基、阿列克谢·埃夫罗斯、安朱·卡纳扎瓦\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了一套统一的框架，能够解决广泛的3D任务。我们的方法采用一种带有状态的循环模型，能够随着每一次新的观测不断更新其状态表示。给定一个图像流，这个不断演化的状态可以在线方式为每一个新输入生成度量尺度的点云图（即每像素的3D点）。这些点云图位于同一个坐标系中，可以累积成一个连贯、密集的场景重建，并随着新图像的到来持续更新。我们的模型称为CUT3R（用于3D重建的连续更新Transformer），它捕捉了真实世界场景的丰富先验：不仅能从图像观测中预测准确的点云图，还能通过探测虚拟的未观测视角来推断场景中未被观察到的部分。我们的方法简单而高度灵活，能够自然地接受长度不一的图像序列，无论是视频流还是无序的照片集合，其中既包含静态内容也包含动态内容。我们在多种3D\u002F4D任务上评估了该方法，并在每一项任务中都展示了具有竞争力或最先进的性能。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.12387) | [🌐 项目页面](https:\u002F\u002Fcut3r.github.io\u002F) | [💻 代码（即将发布）](https:\u002F\u002Fcut3r.github.io\u002F)\n\n\u003Cbr>\n\n\n\n### 2. Easi3R: 无需训练即可从DUSt3R中估计解耦的运动！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: 陈星宇、陈岳、修玉良、安德烈亚斯·盖格、陈安培\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\nDUSt3R的最新进展使得基于Transformer网络架构和大规模3D数据的直接监督，能够稳健地估计静态场景的密集点云和相机参数。相比之下，现有的4D数据集规模有限且多样性不足，这成为训练高度通用的4D模型的主要瓶颈。这一限制促使传统的4D方法通过在可扩展的动态视频数据上微调3D模型，并结合光流和深度等额外几何先验来实现。而在本工作中，我们采取了相反的路径，提出Easi3R——一种简单却高效的无训练4D重建方法。我们的方法在推理过程中应用注意力机制的适配，从而无需从头开始预训练或对网络进行微调。我们发现，DUSt3R中的注意力层本身就蕴含着关于相机和物体运动的丰富信息。通过仔细解耦这些注意力图，我们实现了精确的动态区域分割、相机位姿估计以及4D密集点云重建。对真实世界动态视频的广泛实验表明，我们轻量级的注意力适配显著优于那些在大量动态数据上训练或微调的现有最先进方法。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.24391) | [🌐 项目页面](https:\u002F\u002Feasi3r.github.io\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002FInception3D\u002FEasi3R)\n\n\u003Cbr>\n\n### 3. ODHSR：基于单目视频的人与场景在线稠密3D重建 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-ligntgreen)\n**作者**: 张泽桐、曼努埃尔·考夫曼、薛立新、宋杰、马丁·R·奥斯瓦尔德\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n从单张野外采集的单目视频中创建逼真的场景和人体重建，是实现以人为中心的3D世界感知的关键任务之一。近年来，神经渲染技术的进步使得人与场景的整体重建成为可能，但这些方法通常需要预先标定的相机参数和人体姿态，并且训练时间长达数天。在本工作中，我们提出了一种新颖的统一框架，能够以在线方式同时完成相机跟踪、人体姿态估计以及人与场景的联合重建。我们利用3D高斯泼溅技术高效地学习人体和场景的高斯基元，并设计了基于重建的相机跟踪和人体姿态估计模块，从而实现对姿态与外观的整体理解和有效解耦。特别地，我们引入了一个人体形变模块，用于精细重建细节并提升对分布外姿态的泛化能力。为了准确学习人体与场景之间的空间相关性，我们提出了遮挡感知的人体轮廓渲染和单目几何先验，进一步提升了重建质量。在EMDB和NeuMan数据集上的实验表明，该方法在相机跟踪、人体姿态估计、新视图合成以及运行时效率等方面均表现出优于或与现有方法相当的性能。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.13167) | [🌐 项目主页](https:\u002F\u002Feth-ait.github.io\u002FODHSR\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002Feth-ait\u002FODHSR)\n\n\u003Cbr>\n\n\n### 4. 动态点映射：一种用于动态3D重建的通用表示！![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: 埃德加·苏卡尔、赖子航、埃尔达尔·英萨富季诺夫、安德烈亚·韦达尔迪\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\nDUSt3R最近证明，许多多视角几何任务，包括相机内参和外参的估计、场景的3D重建以及图像对应关系的建立，都可以归结为预测一对视点不变的点映射，即定义在共同参考系中的像素对齐点云。这种表述简洁而强大，但无法处理动态场景。为了解决这一挑战，我们提出了动态点映射（DPM）的概念，将标准点映射扩展到支持4D任务，如运动分割、场景光流估计、3D目标跟踪和2D对应关系等。我们的核心思想是，当引入时间维度时，存在多种可能的空间和时间参考系可用于定义点映射。我们识别出其中的一组最小组合，可以通过网络回归来解决上述子任务。我们在合成数据和真实数据的混合数据集上训练了一个DPM预测器，并在视频深度预测、动态点云重建、3D场景光流和目标姿态跟踪等多个基准上进行了评估，取得了当前最先进的性能。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.16318) | [🌐 项目主页](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fresearch\u002Fdynamic-point-maps\u002F)\n\n\u003Cbr>\n\n\n\n### 5. Geo4D：利用视频生成模型进行几何4D场景重建！![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: 蒋泽仁、郑川夏、伊罗·莱娜、黛安·拉尔吕斯、安德烈亚·韦达尔迪\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了Geo4D，一种将视频扩散模型重新用于动态场景单目3D重建的方法。通过利用这类视频模型所捕捉的强大动态先验，Geo4D仅需使用合成数据即可训练，并能在零样本条件下很好地泛化到真实数据。Geo4D预测多种互补的几何模态，包括点云、深度图和光线图。它在推理时采用一种新的多模态对齐算法来对齐和融合这些模态，并结合多个滑动窗口，从而实现对长视频的稳健且精确的4D重建。在多个基准上的大量实验表明，Geo4D显著超越了现有的视频深度估计方法，包括近期提出的同样针对动态场景设计的MonST3R等方法。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.07961) | [🌐 项目主页](https:\u002F\u002Fgeo4d.github.io\u002F) | | [💻 代码](https:\u002F\u002Fgithub.com\u002Fjzr99\u002FGeo4D)\n\n\u003Cbr>\n\n\n\n### 6. POMATO：将点映射匹配与时间运动相结合，用于动态3D重建！![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: 张松岩、葛永涛、田金元、徐广凯、陈浩、吕晨、沈春华\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n动态场景中的3D重建主要依赖于几何估计和匹配模块的结合，其中匹配模块在区分动态区域方面起着关键作用，有助于缓解由相机和物体运动带来的干扰。此外，匹配模块还显式地建模物体运动，从而实现特定目标的跟踪，并推动复杂场景下的运动理解。最近，DUSt3R中提出的点映射表示为在3D空间中统一几何估计和匹配提供了一种潜在的解决方案，但它仍然难以处理动态区域中的模糊匹配问题，这可能会阻碍进一步的改进。在本工作中，我们提出了POMATO，一个通过将点映射匹配与时间运动相结合来进行动态3D重建的统一框架。具体而言，我们的方法首先通过将来自动态和静态区域的RGB像素映射到统一坐标系内的3D点映射，学习明确的匹配关系。此外，我们还引入了一个用于动态运动的时间运动模块，以确保不同帧之间尺度的一致性，并提升在需要精确几何和可靠匹配的任务中的表现，尤其是3D点跟踪任务。我们通过在多个下游任务中展示出色的表现，证明了所提出的点映射匹配与时间融合范式的有效性，这些任务包括视频深度估计、3D点跟踪和姿态估计等。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.05692) | [💻 代码](https:\u002F\u002Fgithub.com\u002Fwyddmw\u002FPOMATO)\n\n\u003Cbr>\n\n### 7. GeometryCrafter：基于扩散先验的开放世界视频几何一致性估计 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-ICCV-pink)\n**作者**: 徐天行、高翔俊、胡文博、李晓宇、张松海、单颖\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n尽管视频深度估计领域取得了显著进展，但现有方法在通过仿射不变预测实现几何保真度方面仍存在固有局限性，这限制了它们在重建及其他基于度量的下游任务中的应用。为此，我们提出了GeometryCrafter，这是一种新颖的框架，能够从开放世界视频中恢复具有时间一致性的高保真点云序列，从而支持精确的3D\u002F4D重建、相机参数估计以及其他基于深度的应用。我们的方法核心是一个点云变分自编码器（VAE），它学习了一个与视频潜在分布无关的隐空间，以实现高效的点云编码和解码。在此基础上，我们训练了一个视频扩散模型，用于建模以输入视频为条件的点云序列分布。在多种数据集上的广泛评估表明，GeometryCrafter在3D精度、时间一致性及泛化能力方面均达到了当前最优水平。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.01016) | [🌐 项目主页](https:\u002F\u002Fgeometrycrafter.github.io\u002F) | | [💻 代码](https:\u002F\u002Fgithub.com\u002FTencentARC\u002FGeometryCrafter) | [🤗 演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTencentARC\u002FGeometryCrafter)\n\n\u003Cbr>\n\n\n### 8. 回归正轨：面向动态场景重建的束调整 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-ICCV-pink)\n**作者**: 陈伟荣、张甘霖、费利克斯·温鲍尔、王睿、尼基塔·阿拉斯拉诺夫、安德烈亚·韦达尔迪、丹尼尔·克雷默斯\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n传统的依赖束调整的SLAM系统在处理常见于日常视频中的高度动态场景时面临困难。这类视频中动态元素的运动相互交织，破坏了传统系统所假设的静态环境前提。现有的技术要么直接滤除动态元素，要么单独建模其运动。然而，前者往往导致重建不完整，而后者则可能导致运动估计不一致。本研究提出了一种新方法，利用3D点跟踪器将相机引起的运动与动态物体的观测运动分离。通过仅考虑相机诱导的运动成分，束调整便能可靠地应用于所有场景元素。此外，我们还基于尺度图进行了轻量级后处理，以确保视频帧之间的深度一致性。我们的框架将传统SLAM的核心——束调整——与鲁棒的基于学习的3D跟踪前端相结合。通过整合运动分解、束调整和深度精炼，我们的统一框架BA-Track能够准确跟踪相机运动，并生成时间连贯且尺度一致的稠密重建结果，同时适应静态和动态元素。我们在具有挑战性的数据集上的实验表明，该方法在相机位姿估计和3D重建精度方面均有显著提升。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.14516) | [🌐 项目主页](https:\u002F\u002Fwrchen530.github.io\u002Fprojects\u002Fbatrack\u002F) | | [💻 代码（即将发布）](https:\u002F\u002Fwrchen530.github.io\u002Fprojects\u002Fbatrack\u002F)\n\n\u003Cbr>\n\n\n### 9. 流式4D视觉几何Transformer ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-ICCV-pink)\n**作者**: 朱东、郑文钊、郭嘉禾、吴宇奇、周杰、陆继文\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n从视频中感知并重建4D时空几何是一项基础但极具挑战性的计算机视觉任务。为了支持交互式和实时应用，我们提出了一种流式4D视觉几何Transformer，其设计理念与自回归大型语言模型类似。我们采用简单高效的设计，使用因果Transformer架构以在线方式处理输入序列。通过引入时间因果注意力机制，并缓存历史键值对作为隐式记忆，我们实现了高效的流式长期4D重建。这种设计能够在保持高质量空间一致性的同时，通过增量整合历史信息来完成实时4D重建。为提高训练效率，我们建议将密集双向视觉几何接地Transformer（VGGT）的知识蒸馏到我们的因果模型中。在推理阶段，我们的模型可以移植来自大型语言模型领域的优化高效注意力算子（如FlashAttention）。在多个4D几何感知基准上的大量实验表明，我们的模型在保持竞争力的同时显著提升了在线场景下的推理速度，从而为可扩展且交互式的4D视觉系统铺平了道路。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.11539) | [🌐 项目主页](https:\u002F\u002Fwzzheng.net\u002FStreamVGGT\u002F) | | [💻 代码](https:\u002F\u002Fgithub.com\u002Fwzzheng\u002FStreamVGGT) | [🤗 演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Flch01\u002FStreamVGGT)\n\n\u003Cbr>\n\n### 10. Human3R：所有人、所有地、一次性完成！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: 陈岳、陈星宇、薛宇轩、陈安培、修玉良、Gerard Pons-Moll\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了Human3R，这是一个统一的前向传播框架，用于从随意拍摄的单目视频中，在世界坐标系下在线重建4D人体与场景。不同于以往依赖多阶段流水线、人与场景之间迭代接触感知优化以及对诸如人体检测、深度估计和SLAM预处理等外部模块高度依赖的方法，Human3R能够在一次前向传播中联合恢复全局多人SMPL-X人体模型（“所有人”）、稠密3D场景（“所有地”）以及相机轨迹（“一次性”）。我们的方法基于4D在线重建模型CUT3R，并采用参数高效的视觉提示调优技术，旨在保留CUT3R丰富的时空先验信息，同时实现对多个SMPL-X人体模型的直接读取。Human3R是一个消除复杂依赖和迭代优化的统一模型。仅在一块GPU上用相对小规模的合成数据集BEDLAM训练一天后，它便以卓越的效率实现了优异性能：能够以实时速度（15 FPS）和较低的内存占用（8 GB），一次性重建多个人体及3D场景。大量实验表明，Human3R在多项任务上均达到了当前最优或具有竞争力的水平，包括全局人体运动估计、局部人体网格恢复、视频深度估计和相机位姿估计，且仅需一个统一模型即可完成。我们希望Human3R能作为一个简单而强大的基线，易于适配于下游应用。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.06219) | [🌐 项目主页](https:\u002F\u002Ffanegg.github.io\u002FHuman3R\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002Ffanegg\u002FHuman3R)\n\n\u003Cbr>\n\n\n\u003Cbr>\n\n## 2024年：\n### 1. MonST3R：一种在运动存在时进行几何估计的简单方法！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**: 张俊毅、查尔斯·赫尔曼、许俊华、瓦伦·詹帕尼、特雷弗·达雷尔、福雷斯特·科尔、孙德庆、杨明轩\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n从动态场景中估计几何结构——即物体随时间移动和变形的场景——仍然是计算机视觉领域的一个核心挑战。现有的方法通常依赖于多阶段流水线或全局优化算法，将问题分解为深度和光流等子任务，从而导致系统复杂且容易出错。在本文中，我们提出了一种名为Motion DUSt3R（MonST3R）的新方法，这是一种以几何为核心的直接估计动态场景每帧几何信息的方法。我们的关键洞察是：通过为每一帧单独估计点云地图，我们可以有效地将原本仅适用于静态场景的DUST3R表示扩展到动态场景中。然而，这种方法面临一个重大挑战——即缺乏合适的训练数据，尤其是带有深度标注的动态摆拍视频。尽管如此，我们发现，通过将问题转化为微调任务，筛选出若干合适的数据集，并在这些有限的数据上进行策略性训练，模型竟然能够很好地处理动态特性，甚至无需显式的运动表示。基于此，我们针对多个下游视频特定任务引入了新的优化方法，并在视频深度和相机位姿估计方面展示了强劲的性能，其鲁棒性和效率均优于现有工作。此外，MonST3R在主要基于前向传播的4D重建任务中也表现出令人期待的结果。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.03825) | [🌐 项目主页](https:\u002F\u002Fmonst3r-project.github.io\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002FJunyi42\u002Fmonst3r)\n\n\u003Cbr>\n\n### 2. Align3R：面向动态视频的一致性单目深度估计！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-ligntgreen)\n**作者**: 陆嘉豪、黄天宇、李鹏、窦志阳、林成、崔志明、董振、杨赛-基特、王文平、刘源\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n近年来，单目深度估计方法在单张图像的高质量深度估计方面取得了显著进展，但在不同帧之间保持一致的视频深度估计方面仍存在不足。近期的一些研究通过使用视频扩散模型，根据输入视频生成条件化的视频深度图来解决这一问题，但这类方法训练成本高昂，且只能生成不包含相机位姿的尺度不变深度值。在本文中，我们提出了一种名为Align3R的新方法，用于为动态视频估计时间上一致的深度图。我们的核心思想是利用最新的DUSt3R模型，对不同时间步的单目深度图进行对齐。首先，我们通过对DUSt3R模型进行微调，使其能够接受额外的单目深度估计作为输入，以适应动态场景；随后，我们进一步优化以同时重建深度图和相机位姿。大量实验表明，Align3R能够为单目视频估计出一致的视频深度和相机位姿，其性能优于基线方法。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.03079) | [🌐 项目主页](https:\u002F\u002Figl-hkust.github.io\u002FAlign3R.github.io\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002Fjiah-cloud\u002FAlign3R)\n\n\u003Cbr>\n\n\n\n### 3. Stereo4D：从互联网立体视频中学习物体在3D空间中的运动规律！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**: 金琳怡、理查德·塔克、李正奇、大卫·福黑、诺亚·斯纳维利、亚历山大·霍林斯基\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n从图像中学习理解动态3D场景对于机器人技术、场景重建等应用至关重要。然而，与其他已通过大规模监督式训练迅速取得进展的问题不同，直接监督3D运动恢复方法仍然面临巨大挑战，根本原因在于难以获取真实的标注数据。我们提出了一套系统，用于从互联网上的立体广角视频中挖掘高质量的4D重建结果。该系统将相机位姿估计、立体深度估计和时间序列跟踪方法的输出融合并过滤，最终生成高质量的动态3D重建。我们利用这一方法生成大规模数据，形式为具有长期运动轨迹的世界一致性伪度量3D点云。我们通过训练DUSt3R的一个变体来预测真实世界图像对中的结构和3D运动，证明了使用我们重建的数据进行训练可以实现对多样化真实场景的泛化。项目主页：此网址\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.09621) | [🌐 项目主页](https:\u002F\u002Fstereo4d.github.io\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002FStereo4d\u002Fstereo4d-code)\n\n\u003Cbr>\n\n### 4. DAS3R：用于静态场景重建的动力学感知高斯泼溅法\n![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**: Kai Xu, Tze Ho Elden Tse, Jizong Peng, Angela Yao\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了一种新颖的框架，用于从日常视频中进行场景分解和静态背景重建。通过整合训练好的运动掩膜，并将静态场景建模为具有动力学感知优化的高斯泼溅点，我们的方法比以往的工作获得了更准确的背景重建结果。我们提出的方法被称为DAS3R，即“用于静态场景重建的动力学感知高斯泼溅法”的缩写。与现有方法相比，DAS3R在复杂运动场景中更加鲁棒，能够处理动态物体占据场景较大比例的视频，并且无需来自基于SLAM的方法的相机位姿输入或点云数据。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.19584v1) | [🌐 项目页面](https:\u002F\u002Fkai422.github.io\u002FDAS3R\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002Fkai422\u002Fdas3r)\n\n\u003Cbr>\n\n\n## 场景推理：\n## 2025年：\n### 1. LaRI：用于单视图3D几何推理的分层光线交点 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: Rui Li, Biao Zhang, Zhenyu Li, Federico Tombari, Peter Wonka\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了分层光线交点（LaRI），这是一种从未知几何体的单张图像中进行推理的新方法。与仅限于可见表面的传统深度估计不同，LaRI使用分层点云地图来建模被相机光线所截取的多个表面。得益于紧凑且分层的表示方式，LaRI能够实现完整、高效且与视角对齐的几何推理，从而统一对象级和场景级任务。我们进一步提出预测光线停止索引，该索引可从LaRI的输出中识别出有效的相交像素和层级。我们构建了一个完整的训练数据生成流水线，用于合成和真实世界数据，包括3D对象和场景，并包含了必要的数据清洗步骤以及渲染引擎之间的协调工作。作为一种通用方法，LaRI的性能已在两种场景中得到验证：它仅使用近期大型生成模型4%的训练数据和17%的参数，便能获得与其相当的对象级结果。同时，它只需一次前向传播即可完成场景级别的遮挡几何推理。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.18424) | [🌐 项目页面](https:\u002F\u002Fruili3.github.io\u002Flari\u002Findex.html) | [💻 代码](https:\u002F\u002Fgithub.com\u002Fruili3\u002Flari) | [🤗 演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fruili3\u002FLaRI) | [🎞️ 视频](https:\u002F\u002Fruili3.github.io\u002Flari\u002Fstatic\u002Fvideos\u002Fteaser_video.mp4)\n\n\u003Cbr>\n\n\n\n### 2. RaySt3R：预测零样本对象补全的新型深度图 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: Bardienus P. Duisterhof, Jan Oberst, Bowen Wen, Stan Birchfield, Deva Ramanan, Jeffrey Ichnowski\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n3D形状补全是机器人技术、数字孪生重建和扩展现实（XR）等领域中具有广泛应用的技术。尽管近年来在3D对象和场景补全方面取得了令人瞩目的进展，但现有方法仍存在缺乏3D一致性、计算成本高昂以及难以捕捉清晰对象边界等问题。我们的工作（RaySt3R）通过将3D形状补全重新定义为一种新型视图合成问题来解决这些局限性。具体而言，给定一张RGB-D图像和一个新的视角（以一组查询光线的形式编码），我们训练一个前馈变换器来预测这些查询光线的深度图、对象掩膜以及每个像素的置信度分数。RaySt3R会将这些预测结果在多个查询视角之间进行融合，从而重建出完整的3D形状。我们在合成和真实世界的数据集上对RaySt3R进行了评估，发现其性能达到了当前最先进的水平，在所有数据集上均以高达44%的3D查默距离优势超越了基线方法。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.05285) | [🌐 项目页面](https:\u002F\u002Frayst3r.github.io\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002FDuisterhof\u002Frayst3r) | [🤗 演示（即将上线）](https:\u002F\u002Frayst3r.github.io\u002F) | [🎞️ 视频](https:\u002F\u002Frayst3r.github.io\u002Fstatic\u002Fvideos\u002Fteaser\u002Fteaser_fixed.mp4)\n\n\u003Cbr>\n\n\n### 3. Amodal3R：从遮挡的2D图像中进行无遮挡3D重建 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: Tianhao Wu, Chuanxia Zheng, Frank Guan, Andrea Vedaldi, Tat-Jen Cham\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n大多数基于图像的3D对象重建方法都假设对象是完全可见的，忽略了在现实场景中经常出现的遮挡现象。在本文中，我们介绍了Amodal3R，这是一种条件式3D生成模型，旨在从部分观测中重建3D对象。我们以一个“基础”3D生成模型为基础，将其扩展到可以从遮挡对象中恢复合理的3D几何和外观。我们引入了一种基于掩膜加权的多头交叉注意力机制，随后又加入了一个考虑遮挡因素的注意力层，该层明确利用遮挡先验信息来指导重建过程。我们证明，仅通过在合成数据上进行训练，Amodal3R即使在真实场景中存在遮挡的情况下，也能学会恢复完整的3D对象。它显著优于那些先独立进行2D无遮挡补全再进行3D重建的现有方法，从而为考虑遮挡因素的3D重建树立了新的基准。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.13439) | [🌐 项目页面](https:\u002F\u002Fsm0kywu.github.io\u002FAmodal3R) | [💻 代码（即将上线）](https:\u002F\u002Fsm0kywu.github.io\u002FAmodal3R) | [🤗 演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FSm0kyWu\u002FAmodal3R)\n\n\u003Cbr>\n\n\n\n\n\n## 高斯泼溅：\n\n\n## 2025年：\n\n### 1. EasySplat：视图自适应学习让3D高斯泼溅变得简单！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**：Ao Gao、Luosong Guo、Tao Chen、Zhao Wang、Ying Tai、Jian Yang、Zhenyu Zhang\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n3D高斯泼溅（3DGS）技术已在3D场景表示方面取得了令人满意的效果。尽管其性能出色，但由于基于运动恢复结构（SfM）的方法在获取准确的场景初始化时存在局限性，或者致密化策略效率较低，3DGS仍面临诸多挑战。本文提出了一种名为EasySplat的新框架，以实现高质量的3DGS建模。我们并未采用SfM进行场景初始化，而是引入了一种新方法，充分发挥大规模点云方法的优势。具体而言，我们提出了一种基于视图相似性的高效分组策略，并利用鲁棒的点云先验信息来获取高质量的点云和相机位姿，从而完成3D场景的初始化。在获得可靠的场景结构后，我们进一步提出了一种新颖的致密化方法，该方法基于邻近高斯椭球体的平均形状，利用KNN方案自适应地分割高斯基元。通过这种方式，所提出的算法克服了初始化和优化方面的限制，实现了高效且精确的3DGS建模。大量实验表明，EasySplat在处理新视角合成任务时，性能优于当前最先进的方法。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2501.01003)\n\n\u003Cbr>\n\n### 2. FlowR：从稀疏到稠密的3D重建流水线！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**：Tobias Fischer、Samuel Rota Bulò、Yung-Hsu Yang、Nikhil Varma Keetha、Lorenzo Porzi、Norman Müller、Katja Schwarz、Jonathon Luiten、Marc Pollefeys、Peter Kontschieder\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n3D高斯泼溅能够在实时帧率下实现高质量的新视角合成（NVS）。然而，一旦偏离训练视图，其质量便会急剧下降。因此，为了满足某些应用（如虚拟现实VR）对高质量图像的期望，往往需要密集的采集数据。然而，这种密集采集既费时又昂贵。现有研究尝试使用2D生成模型通过蒸馏或生成额外的训练视图来缓解这一问题。但这些方法通常仅依赖于少数参考输入视图，未能充分利用现有的3D信息，从而导致生成结果不一致以及重建伪影。为解决这一难题，我们提出了一种多视图流匹配模型，该模型学习一种流场，用于将可能来自稀疏重建的新视角渲染与我们期望的稠密重建渲染连接起来。这使得我们可以用生成的新视角补充场景采集，从而提升重建质量。我们的模型是在一个包含360万张图像对的新数据集上训练的，在单次前向传播中，可在一块H100 GPU上以540×960分辨率（91K个token）处理多达45个视图。我们的流程在稀疏和稠密视图场景中均能持续改进NVS效果，最终在多个广泛使用的NVS基准测试中，重建质量均优于先前的工作。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.01647) | [🌐 项目页面](https:\u002F\u002Ftobiasfshr.github.io\u002Fpub\u002Fflowr\u002F)\n\n\u003Cbr>\n\n\n\n### 3. Styl3R：任意场景与风格的即时3D风格化重建！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**：Peng Wang、Xiang Liu、Peidong Liu\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n在保持多视角一致性并忠实还原风格图像的同时，对3D场景进行即时风格化仍然是一项重大挑战。目前最先进的3D风格化方法通常需要在测试阶段进行计算密集型优化，以将艺术特征迁移到预训练的3D表示中，而这往往需要密集的带姿态输入图像。相比之下，我们利用近期在前馈式重建模型方面的进展，展示了一种新方法：只需使用无姿态的稀疏视图场景图像和任意风格图像，即可在不到一秒钟内实现直接的3D风格化。为了解决重建与风格化之间的固有脱耦问题，我们引入了一种分支架构，将结构建模与外观着色分离，从而有效防止风格迁移扭曲底层的3D场景结构。此外，我们还引入了一种身份损失函数，以支持通过新视角合成任务对风格化模型进行预训练。这一策略不仅使我们的模型保留了原有的重建能力，还能针对风格化任务进行微调。综合评估结果显示，无论使用域内还是域外数据集，我们的方法都能生成高质量的风格化3D内容，兼具出色的风格表现与场景外观融合度，并且在多视角一致性及效率方面均优于现有方法。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.21060) | [🌐 项目页面](https:\u002F\u002Fnickisdope.github.io\u002FStyl3R) | [💻 代码](https:\u002F\u002Fgithub.com\u002FWU-CVGL\u002FStyl3R)\n\n\u003Cbr>\n\n\u003Cbr>\n\n\n## 2024年：\n\n### 1. InstantSplat：40秒内实现无约束、稀疏视角、无姿态的高斯泼溅！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**：Fan Zhiwen、Cong Wenyan、Wen Kairun、Wang Kevin、Zhang Jian、Ding Xinghao、Xu Danfei、Ivanovic Boris、Pavone Marco、Pavlakos Georgios、Wang Zhangyang、Wang Yue\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n尽管新视图合成（NVS）在三维计算机视觉领域取得了显著进展，但它通常需要从密集视角中初步估计相机的内参和外参。这一预处理步骤一般通过运动恢复结构（SfM）流程完成，而该流程在稀疏视角场景下可能既缓慢又不可靠——尤其是在匹配特征不足、难以进行精确重建的情况下。在本工作中，我们结合基于点的表示方法（如3D高斯泼溅，3D-GS）与端到端稠密立体模型（DUSt3R）的优势，以解决无约束条件下NVS中尚未解决的复杂问题，包括无姿态和稀疏视角挑战。我们的框架InstantSplat将稠密立体先验与3D-GS相结合，在不到1分钟内即可从稀疏视角且无姿态的图像中构建大规模场景的3D高斯模型。具体而言，InstantSplat包含一个粗略几何初始化（CGI）模块，该模块利用预先训练好的稠密立体管道生成的全局对齐3D点云地图，快速建立所有训练视图中的初始场景结构和相机参数。随后，Fast 3D-Gaussian Optimization（F-3DGO）模块会联合优化3D高斯属性及已初始化的姿态，并引入姿态正则化。我们在大型户外数据集Tanks & Temples上进行的实验表明，InstantSplat在显著提升SSIM（提高32%）的同时，还将绝对轨迹误差（ATE）降低了80%。这些结果使InstantSplat成为应对无姿态和稀疏视角条件下的可行解决方案。项目主页：http:\u002F\u002Finstantsplat.github.io\u002F。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.20309.pdf) | [🌐 项目主页](https:\u002F\u002Finstantsplat.github.io\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FInstantSplat) | [🎥 视频](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=_9aQHLHHoEM&feature=youtu.be) \n\n\u003Cbr>\n\n\n### 2. Splatt3R：来自未校准图像对的零样本高斯泼溅！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**：Smart Brandon、Zheng Chuanxia、Laina Iro、Prisacariu Victor Adrian\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n本文我们提出了Splatt3R，一种无需姿态、前馈式的野外3D重建与新视图合成方法，适用于立体图像对。给定未校准的自然图像，Splatt3R无需任何相机参数或深度信息即可预测3D高斯泼溅。为提高通用性，我们基于“基础”3D几何重建方法MASt3R进行了扩展，使其同时处理3D结构与外观。具体来说，不同于仅重建3D点云的原始MASt3R，我们为每个点预测了构建高斯基元所需的额外高斯属性。因此，与其他新视图合成方法不同，Splatt3R首先通过优化3D点云的几何损失进行训练，然后再采用新视图合成目标函数。这样做可以避免在从立体视图训练3D高斯泼溅时陷入局部极小值的问题。此外，我们还提出了一种新颖的损失掩码策略，经实证发现其对于在推断视点上的优异表现至关重要。我们在ScanNet++数据集上训练了Splatt3R，并证明其对未校准的野外图像具有出色的泛化能力。Splatt3R可在512×512分辨率下以4FPS的速度重建场景，生成的泼溅可实时渲染。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.13912) | [🌐 项目主页](https:\u002F\u002Fsplatt3r.active.vision\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002Fbtsmart\u002Fsplatt3r)\n\n\u003Cbr>\n\n\n\n\n### 3. 稠密点云至关重要：Dust-GS用于稀疏视角下的场景重建！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**：Chen Shan、Zhou Jiale、Li Lei\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n3D高斯泼溅（3DGS）在场景合成和新视图合成任务中表现出色。通常，3D高斯基元的初始化依赖于由运动恢复结构（SfM）方法生成的点云。然而，在需要从稀疏视角进行场景重建的情况下，3DGS的效果会因初始点云的质量以及输入图像数量有限而受到显著限制。在本研究中，我们提出了Dust-GS，这是一种专门针对稀疏视角条件下3DGS局限性设计的新框架。Dust-GS不完全依赖SfM，而是引入了一种即使在稀疏输入数据下仍能有效工作的创新点云初始化技术。我们的方法采用混合策略，结合自适应深度掩码技术，从而提升重建场景的精度和细节。在多个基准数据集上开展的大量实验表明，Dust-GS在稀疏视角场景中优于传统3DGS方法，能够在更少的输入图像下实现更高质量的场景重建。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.08613)\n\n\u003Cbr>\n\n### 4. LM-Gaussian：利用大模型先验增强稀疏视角3D高斯溅射！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**：Hanyang Yu、Xiaoxiao Long、Ping Tan\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们旨在借助大规模视觉模型的先验知识，解决3D场景的稀疏视角重建问题。尽管近年来诸如3D高斯溅射（3DGS）等方法在3D重建领域取得了显著成果，但这些方法通常需要数百张密集采集场景的输入图像，这不仅耗时，而且在实际应用中并不实用。然而，稀疏视角重建本质上是一个病态且欠约束的问题，往往导致重建结果质量较差且不完整。其原因包括初始化失败、对输入图像过拟合以及细节缺失等。为应对这些挑战，我们提出了LM-Gaussian方法，该方法能够仅使用少量图像生成高质量的重建结果。具体而言，我们设计了一个鲁棒的初始化模块，利用立体先验辅助恢复相机位姿和可靠的点云；此外，我们还引入了基于扩散的迭代优化过程，将图像扩散先验融入高斯优化流程中，以更好地保留场景的精细细节。最后，我们进一步利用视频扩散先验来提升渲染图像的真实感效果。总体而言，与以往的3DGS方法相比，我们的方法显著降低了数据采集需求。我们在多个公开数据集上进行了实验验证，证明了该框架在高质量360度场景重建方面的潜力。可视化结果已发布在我们的网站上。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.03456) | [🌐 项目主页](https:\u002F\u002Fhanyangyu1021.github.io\u002Flm-gaussian.github.io\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002Fhanyangyu1021\u002FLMGaussian)\n\n\u003Cbr>\n\n\n\n\n### 5. PreF3R：无姿态信息的前馈式3D高斯溅射，适用于变长图像序列！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**：Zequn Chen、Jiezhi Yang、Heng Yang\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n我们提出了PreF3R，一种无需相机标定、可直接从任意长度的图像序列中进行3D重建的方法。与现有方法不同，PreF3R无需进行相机标定，而是直接在规范坐标系下重建3D高斯场，从而实现高效的任意视角渲染。我们利用DUSt3R的成对3D结构重建能力，并通过空间记忆网络将其扩展到多视角序列输入，避免了基于优化的全局配准步骤。此外，PreF3R还引入了一个密集的高斯参数预测头，支持后续通过可微光栅化技术进行新视角合成。这一设计使得我们可以同时使用光度损失和点云回归损失来监督模型训练，从而提升重建结果的逼真度和结构准确性。对于给定的有序图像序列，PreF3R能够以每秒20帧的速度增量式地重建3D高斯场，进而实现实时的新视角渲染。实验结果表明，PreF3R是解决无姿态信息前馈式新视角合成这一难题的有效方案，并且对未见场景具有良好的泛化能力。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.16877) | [🌐 项目主页](https:\u002F\u002Fcomputationalrobotics.seas.harvard.edu\u002FPreF3R\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002FComputationalRobotics\u002FPreF3R)\n\n\u003Cbr>\n\n\n### 6. Dust to Tower：从稀疏非标定图像中实现粗粒度到细粒度的逼真场景重建！[](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**：Xudong Cai、Yongcai Wang、Zhaoxin Fan、Deng Haoran、Shuo Wang、Wanting Li、Deying Li、Lun Luo、Minhang Wang、Jintao Xu\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n从稀疏视角、未经标定的图像中实现逼真场景重建在实际应用中有着迫切需求。尽管目前已有一些成功案例，但现有方法要么依赖于精确的相机参数（内参和外参），要么要求密集采集的图像。为了结合两者的优点并克服各自的不足，我们提出了Dust to Tower（D2T）框架——一个高效且准确的粗细结合流程，可同时优化3D高斯溅射模型及图像的相机位姿，且仅需稀疏、未标定的图像作为输入。我们的核心思想是先快速构建一个粗略的3D模型，然后利用在新视角下经过扭曲和修复后的图像对其进行精细化优化。为此，我们首先设计了一个粗建模块（CCM），该模块利用快速多视图立体匹配模型初始化3D高斯溅射，并恢复初始的相机位姿。随后，我们提出了一种置信度感知深度对齐模块（CADA），用于将粗略深度图中置信度较高的区域与单目深度模型估计的深度进行对齐，从而细化深度信息。接着，我们又引入了一个扭曲图像引导修复模块（WIGI），利用细化后的深度图将训练图像扭曲到新视角，并对因视角变化而产生的“空洞”区域进行修复，以此提供高质量的监督信号，进一步优化3D模型和相机位姿。大量实验及消融研究证实了D2T及其设计选择的有效性，在新视角合成和位姿估计两个任务上均达到了当前最优水平，同时保持了较高的效率。相关代码将公开发布。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.19518) | [💻 代码（待发布）]()\n\n\u003Cbr>\n\n\n\n\n## 场景理解：\n\n## 2025年：\n\n### 1.PE3R：感知高效的3D重建 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-arXiv-red)\n**作者**: 胡杰、王世尊、王欣超\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n近年来，2D到3D的感知技术取得了显著进展，极大地提升了从2D图像理解3D场景的能力。然而，现有方法仍面临关键挑战，包括跨场景泛化能力有限、感知精度不足以及重建速度较慢等问题。为解决这些局限性，我们提出了感知高效的3D重建（PE3R）框架，旨在同时提升重建的准确性和效率。PE3R采用前馈式架构，能够快速完成3D语义场的重建。该框架在不同场景和物体之间表现出强大的零样本泛化能力，并显著提高了重建速度。我们在2D到3D开放词汇分割和3D重建任务上的大量实验验证了PE3R的有效性和通用性。该框架在3D语义场重建方面实现了至少9倍的速度提升，同时大幅提高了感知精度和重建精确度，为该领域树立了新的标杆。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.07507) | [💻 代码](https:\u002F\u002Fgithub.com\u002Fhujiecpp\u002FPE3R)\n\u003Cbr>\n\u003Cbr>\n\n\n### 2. SegMASt3R：基于几何的分割匹配 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-Neurips-blue)\n**作者**: 罗希特·贾扬蒂、斯瓦亚姆·阿格瓦尔、万什·加格、西达尔特·图拉尼、穆罕默德·哈里斯·汗、索拉夫·加格、马达瓦·克里希纳\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n分割匹配是计算机视觉中的一个重要中间任务，用于建立图像间语义或几何上连贯区域之间的对应关系。与专注于局部特征的关键点匹配不同，分割匹配捕捉的是结构化的区域，因此对遮挡、光照变化和视角变化具有更强的鲁棒性。在本论文中，我们利用3D基础模型的空间理解能力来解决宽基线分割匹配问题——这一场景涉及极端的视角变化。我们提出了一种架构，借助这些3D基础模型的归纳偏置，能够在视角变化高达180度的图像对之间进行分割匹配。大量实验表明，在ScanNet++和Replica数据集上，我们的方法在AUPRC指标上比最先进的方法（包括SAM2视频传播器和局部特征匹配方法）高出多达30%。此外，我们还展示了所提出的模型在相关下游任务中的优势，例如3D实例分割和图像目标导航。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.05051) | [🌐 项目页面](https:\u002F\u002Fsegmast3r.github.io\u002F) | [💻 代码](https:\u002F\u002Fgithub.com\u002FSegMASt3R)\n\n\n\n\n## 2024年：\n### 1. LargeSpatialModel：端到端无姿态图像到语义3D ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-Neurips-blue)\n**作者**: 范志文、张健、丛文燕、王培浩、李仁杰、温凯润、周世杰、阿丘塔·卡丹比、王章阳、徐丹菲、鲍里斯·伊万诺维奇、马可·帕沃内、王悦\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n从少量图像中重建并理解3D结构是计算机视觉领域的一个经典问题。传统方法通常将这一任务分解为多个子任务，每个子任务都需要在不同的数据表示之间进行复杂的转换。例如，通过运动恢复结构（SfM）进行密集重建时，需要先将图像转换为关键点，优化相机参数，再估计三维结构。随后，还需要对稀疏重建结果进行进一步的密集建模，最后将其输入到特定任务的神经网络中。这种多步骤流程不仅耗时较长，还增加了工程实现的复杂性。\n在本工作中，我们提出了大型空间模型（LSM），可以直接处理无姿态的RGB图像，生成语义辐射场。LSM能够在一次前馈过程中同时估计几何、外观和语义信息，并可通过与语言交互，在任意新视角下生成多样化的标签地图。LSM基于Transformer架构，通过像素对齐的点云图整合全局几何信息。为了增强空间属性回归能力，我们引入了多尺度融合的局部上下文聚合机制，从而提高精细局部细节的准确性。为应对3D语义标注数据稀缺的问题，并实现自然语言驱动的场景操作，我们将一个预训练的2D语言分割模型融入到3D一致的语义特征场中。随后，一个高效的解码器会参数化一组语义各向异性高斯分布，从而支持监督式的端到端学习。大量跨任务的实验表明，LSM能够直接从无姿态图像统一处理多项3D视觉任务，首次实现了实时语义3D重建。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.18956) | [💻 代码](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FLSM) | [🌐 项目页面](https:\u002F\u002Flargespatialmodel.github.io\u002F) | [🎮 演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fkairunwen\u002FLSM)\n\n\n## 机器人学：\n## 2024年：\n### 1. 利用3D基础模型统一场景表示与手眼标定 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-RAL-yellow)\n**作者**: 魏明志、唐浩瞻、张天一、马修·约翰逊-罗伯森\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n环境表示是机器人学中的核心挑战之一，也是有效决策的基础。传统上，在使用机械臂搭载的摄像头采集图像之前，用户需要借助特定的外部标记（如棋盘格或AprilTag）对摄像头进行标定。\n然而，近年来计算机视觉领域的进步催生了3D基础模型。这些大型预训练神经网络能够在极少图像的情况下，即使缺乏丰富的视觉特征，也能快速且准确地建立多视角对应关系。本文主张将3D基础模型整合到配备机械臂搭载RGB摄像头的机器人系统的场景表示方法中。具体而言，我们提出了联合标定与表示方法（JCR）。JCR利用机械臂搭载的RGB摄像头拍摄的图像，无需特定的标定标记，即可同时构建环境表示，并将摄像头相对于机器人末端执行器进行标定。由此生成的3D环境表示与机器人的坐标系对齐，且保持物理尺度的准确性。我们证明了JCR可以仅使用低成本的RGB摄像头连接到机械臂，便能在无需事先标定的情况下构建有效的场景表示。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.11683.pdf) | [💻 代码（即将发布）]() \n\n\u003Cbr>\n\n### 2. 3D 基础模型实现抓取物体的几何与位姿同步估计 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2024-arXiv-red)\n**作者**: 智伟明、唐浩展、张天一、马修·约翰逊-罗伯森\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n人类具有非凡的能力，能够将手中的物体用作工具来与环境互动。为了实现这一点，人类会在内部估算手部动作如何影响物体的运动。我们希望赋予机器人这种能力。为此，我们提出了一种方法，可以从外部相机拍摄的 RGB 图像中联合估计机器人抓取的物体的几何形状和位姿。值得注意的是，我们的方法会将估计的几何形状转换到机器人的坐标系中，而无需对外部相机的外参进行标定。我们的方法利用 3D 基础模型——即在海量 3D 视觉数据集上预训练的大规模模型——来生成手中物体的初始估计。这些初始估计不具备物理上正确的尺度，并且处于相机坐标系中。随后，我们构建并高效求解一个坐标对齐问题，以恢复准确的尺度，并将物体的坐标变换到机器人的坐标系中。之后，可以定义从机械臂关节角度到物体上指定点的正向运动学映射。这些映射使得能够在任意配置下估计被抓取物体上的点，从而可以根据抓取物体上的坐标来设计机器人的运动。我们在一台机械臂上持握多种真实世界物体的情况下，对我们的方法进行了实证评估。\n\u003C\u002Fdetails>\n\n[📄 论文](https:\u002F\u002Fwww.researchgate.net\u002Fprofile\u002FWeiming-Zhi\u002Fpublication\u002F382490016_3D_Foundation_Models_Enable_Simultaneous_Geometry_and_Pose_Estimation_of_Grasped_Objects\u002Flinks\u002F66a01a4527b00e0ca43ddd95\u002F3D-Foundation-Models-Enable-Simultaneous-Geometry-and-Pose-Estimation-of-Grasped-Objects.pdf)\n\u003Cbr>\n\n\n\n\n## 位姿估计：\n## 2025：\n### 1. Reloc3r：用于通用、快速且精确视觉定位的相对相机位姿回归大规模训练 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**作者**: 董思言、王树哲、刘绍辉、蔡露露、范庆楠、尤霍·坎纳拉、杨燕超\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n视觉定位旨在确定查询图像相对于已知位姿图像数据库的相机位姿。近年来，直接回归相机位姿的深度神经网络因其快速推理能力而日益流行。然而，现有方法要么难以泛化到新场景，要么无法提供准确的位姿估计。为解决这些问题，我们提出了 Reloc3r，这是一个简单而有效的视觉定位框架。它由一个设计精巧的相对位姿回归网络和一个用于绝对位姿估计的极简运动平均模块组成。Reloc3r 在约八百万个已知位姿图像对上进行训练，取得了令人惊讶的好性能和泛化能力。我们在六个公开数据集上进行了大量实验，持续证明了该方法的有效性和效率。它可以实时提供高质量的相机位姿估计，并能泛化到新场景。[代码](https:\u002F\u002Fgithub.com\u002Fffrivera0\u002Freloc3r)。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.08376) | [💻 代码](https:\u002F\u002Fgithub.com\u002Fffrivera0\u002Freloc3r)\n\n\u003Cbr>\n\n\n### 2. Pos3R：轻松实现未知物体的 6D 位姿估计 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-CVPR-brightgreen)\n**作者**: 邓伟健、迪伦·坎贝尔、孙春义、张嘉豪、舒巴姆·卡尼特卡尔、马修·沙弗、斯蒂芬·古尔德\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n基础模型显著减少了任务特定训练的需求，同时提升了泛化能力。然而，目前最先进的 6D 位姿估计算法要么需要额外的带位姿监督的训练，要么忽视了利用 3D 基础模型所能带来的进步。后者实际上错失了一个重要机会，因为这类模型更擅长预测 3D 一致性的特征，而这些特征对于位姿估计任务至关重要。为弥补这一差距，我们提出了 Pos3R，这是一种仅需一张 RGB 图像即可估计任意物体 6D 位姿的方法，它充分利用了 3D 重建基础模型，无需任何额外训练。我们发现，模板选择是现有方法中的一个关键瓶颈，而使用 3D 模型可以显著缓解这一问题，因为相比 2D 模型，3D 模型更容易区分不同的模板位姿。尽管方法简单，Pos3R 在 BOP 基准测试的七个不同数据集中仍取得了具有竞争力的成绩，其表现可与现有的无微调方法相媲美甚至超越。此外，Pos3R 还能无缝集成渲染对比微调技术，展现出适用于高精度应用的适应性。\n\u003C\u002Fdetails>\n\n  [📄 论文]() | [💻 代码]()\n\n\u003Cbr>\n\n\n\n## DUSt3R 用于科学：\n## 2025：\n### 1. CryoFastAR：轻松实现快速冷冻电镜从头重建 ![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F2025-ICCV-pink)\n**作者**: 张佳凯、周守臣、戴海钊、刘欣航、王培浩、范志文、裴源、于静怡\n\u003Cdetails span>\n\u003Csummary>\u003Cb>摘要\u003C\u002Fb>\u003C\u002Fsummary>\n从无序图像中进行位姿估计是 3D 重建、机器人技术和科学成像的基础。最近出现的几何基础模型，如 DUSt3R，能够实现端到端的密集 3D 重建，但在科学成像领域，例如用于近原子级蛋白质重构的冷冻电子显微镜（cryo-EM）中，这些模型尚未得到充分探索。在 cryo-EM 中，从无序粒子图像进行位姿估计和 3D 重建仍然依赖于耗时的迭代优化，这主要是由于低信噪比（SNR）以及对比度传递函数（CTF）引起的畸变等挑战所致。我们推出了 CryoFastAR，这是首个可以直接从 cryo-EM 噪声图像中预测位姿，从而实现快速从头重建的几何基础模型。通过整合多视角特征，并在包含真实噪声和 CTF 调制的大规模模拟 cryo-EM 数据上进行训练，CryoFastAR 提升了位姿估计的准确性和泛化能力。为提高训练稳定性，我们提出了一种渐进式训练策略，先让模型在较简单的条件下提取关键特征，再逐步增加难度以提升鲁棒性。实验表明，CryoFastAR 在合成和真实数据集上均实现了与传统迭代方法相当的质量，同时显著加快了推理速度。\n\u003C\u002Fdetails>\n\n  [📄 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.05864)\n\n\u003Cbr>\n\n\n\n\n## 相关代码库\n1. [Mini-DUSt3R](https:\u002F\u002Fgithub.com\u002Fpablovela5620\u002Fmini-dust3r)：仅为执行推理而设计的 dust3r 微型版本。2024 年 5 月。\n\n## 博文\n\n1. [轻松实现3D重建模型](https:\u002F\u002Feurope.naverlabs.com\u002Fblog\u002F3d-reconstruction-models-made-easy\u002F)\n2. [InstantSplat：不到一分钟的高斯泼溅技术](https:\u002F\u002Fradiancefields.com\u002Finstantsplat-sub-minute-gaussian-splatting\u002F)\n\n\n## 教程视频\n1. [先进的图像转3D AI，DUSt3R](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=kI7wCEAFFb0)\n2. [BSLIVE Pinokio Dust3R：将2D转换为3D网格](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=vY7GcbOsC-U)\n3. [InstantSplat，DUSt3R](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=JdfrG89iPOA)\n\n## 致谢\n- 感谢[Janusch](https:\u002F\u002Ftwitter.com\u002Fjanusch_patas)提供的精彩论文清单[awesome-3D-gaussian-splatting](https:\u002F\u002Fgithub.com\u002FMrNeRF\u002Fawesome-3D-gaussian-splatting)，以及[Chao Wen](https:\u002F\u002Fwalsvid.github.io\u002F)提供的[Awesome-MVS](https:\u002F\u002Fgithub.com\u002Fwalsvid\u002FAwesome-MVS)。本列表在编写过程中参考了这两份资源。","# Awesome DUSt3R 快速上手指南\n\n本指南基于 `awesome-dust3r` 资源列表中的核心项目（DUSt3R 和 MASt3R），帮助开发者快速搭建环境并运行基础的 3D 几何重建任务。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04\u002F22.04) 或 Windows (WSL2)。macOS 支持有限，主要依赖 Metal 后端，建议优先使用 Linux。\n*   **Python 版本**: Python 3.8 - 3.10。\n*   **GPU**: 推荐使用 NVIDIA GPU (显存建议 8GB 以上，处理高分辨率或多视图建议 16GB+)，需安装对应的 CUDA 驱动。\n*   **前置依赖**:\n    *   Git\n    *   Conda (推荐用于环境管理) 或 venv\n    *   PyTorch (需匹配您的 CUDA 版本)\n\n> **国内加速提示**：\n> 建议使用清华源或阿里源加速 Python 包和模型下载。\n> *   Pip 镜像：`https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`\n> *   HuggingFace 镜像：设置环境变量 `HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com`\n\n## 安装步骤\n\n以下以官方核心库 **DUSt3R** 为例进行安装（MASt3R 安装流程类似）。\n\n### 1. 创建并激活虚拟环境\n```bash\nconda create -n dust3r python=3.9 -y\nconda activate dust3r\n```\n\n### 2. 安装 PyTorch\n请根据您的 CUDA 版本选择对应的安装命令（以下为 CUDA 11.8 示例）：\n```bash\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n# 若使用国内镜像加速：\n# pip install torch torchvision torchaudio --index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 3. 克隆代码库\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fnaver\u002Fdust3r.git\ncd dust3r\n```\n\n### 4. 安装项目依赖\n```bash\npip install -e .\n```\n> **注意**：如果在国内网络环境下安装 `git+` 开头的依赖失败，建议手动克隆相关子模块或使用镜像源替换 `requirements.txt` 中的源地址后执行 `pip install -r requirements.txt`。\n\n### 5. 配置模型下载加速 (可选但推荐)\n首次运行时会自动从 HuggingFace 下载预训练模型。为避免超时，请设置镜像环境变量：\n```bash\nexport HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n```\n\n## 基本使用\n\nDUSt3R 的核心功能是从任意图像集合中恢复 3D 点云和相机姿态，无需预先知道相机参数。\n\n### 最简单的使用示例 (Python API)\n\n创建一个名为 `demo.py` 的文件，运行以下代码即可对两张图片进行 3D 重建：\n\n```python\nimport torch\nfrom dust3r.inference import inference\nfrom dust3r.model import AsymmetricCroCo3DStereo\nfrom dust3r.utils.image import load_images\n\n# 1. 加载图片 (替换为您本地的图片路径)\nimage_paths = ['assets\u002Fdust3r_logo.png', 'assets\u002Fexample2.jpg'] \nimages = load_images(image_paths, size=512)\n\n# 2. 加载预训练模型\ndevice = 'cuda' if torch.cuda.is_available() else 'cpu'\nmodel = AsymmetricCroCo3DStereo.from_pretrained(\"naver\u002FDUSt3R_ViTLarge_BaseDecoder_512_dpt\").to(device)\n\n# 3. 执行推理\noutput = inference([tuple(images)], model, device, batch_size=1, verbose=False)\n\n# 4. 获取结果\n# output 是一个字典列表，包含 pointmap (3D 点云), confidence, camera poses 等信息\nprint(\"Reconstruction done!\")\nprint(f\"Pointmap shape: {output['pointmap'].shape}\")\nprint(f\"Confidence map shape: {output['conf'].shape}\")\n\n# 如需保存点云为 PLY 文件供可视化软件查看\nfrom dust3r.cloud_opt import global_aligner\nfrom dust3r.post_process import estimate_focal_length\n\n# 简单的全局对齐示例 (针对多视图)\nif len(images) > 1:\n    scene = global_aligner(output, device=device, mode='global')\n    pts3d = scene.get_pts3d()\n    # 此处可添加代码将 pts3d 导出为 .ply 文件\n```\n\n### 命令行快速测试 (如果库支持 CLI)\n\n部分版本提供了简单的脚本入口，可直接运行：\n\n```bash\npython demo.py --model_name DUSt3R_ViTLarge_BaseDecoder_512_dpt --images assets\u002Fdust3r_logo.png assets\u002Fexample2.jpg --output_dir output_demo\n```\n\n### 结果查看\n运行成功后，您可以在输出目录中找到生成的深度图、点云数据或相机轨迹信息。推荐使用 **CloudCompare** 或 **MeshLab** 打开生成的 `.ply` 文件查看 3D 重建效果。\n\n---\n*注：对于 MASt3R (匹配与重建) 或其他衍生项目（如 Fast3R, Light3R-SfM），请参考其各自 GitHub 仓库的 `README.md` 获取特定的模型权重和微调指令，基础环境配置与上述步骤一致。*","某文化遗产保护团队需要利用游客拍摄的非专业照片，快速重建一座古建筑的精细 3D 模型用于数字化归档。\n\n### 没有 awesome-dust3r 时\n- **相机参数依赖严重**：传统多视图立体视觉（MVS）算法必须预先知道相机的内参和外参，而游客照片缺乏这些元数据，导致无法直接计算。\n- **特征匹配失败率高**：面对光照变化大、纹理重复或无特征的区域，传统基于特征点检测的方法极易丢失匹配点，重建模型出现大量空洞。\n- **技术选型迷茫低效**：开发者需在海量论文和代码库中盲目搜索最新进展，难以区分哪些方案支持“无标定”重建，耗费数周时间试错。\n- **动态场景处理棘手**：若照片中包含移动的游客或车辆，传统静态场景假设会导致模型产生严重的伪影和拉伸变形。\n\n### 使用 awesome-dust3r 后\n- **零标定直接重建**：借助列表中集成的 DUSt3R 核心算法，团队直接输入任意图片集即可回归出点云图，完全无需相机校准信息。\n- **鲁棒的几何理解**：利用列表推荐的 MASt3R 等进阶模型，即使在弱纹理或动态干扰下，也能通过几何基础模型的特性获得稠密且一致的 3D 结构。\n- **一站式资源导航**：通过 awesome-dust3r 整理的分类清单，团队迅速定位到针对动态场景（如 Geo4D）和快速重建（如 Fast3R）的最优开源代码，将调研周期缩短至几天。\n- **生态扩展便捷**：依据列表指引，轻松接入 Gaussian Splatting 相关工具（如 Splatt3R），将重建结果快速转化为可实时渲染的高保真场景。\n\nawesome-dust3r 通过聚合前沿几何基础模型生态，让非专家团队也能在无标定条件下高效完成高质量的 3D 重建任务。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fruili3_awesome-dust3r_eea2c8d3.png","ruili3","Rui Li","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fruili3_d23380e0.jpg","CS Ph.D. | working on 3D computer vision",null,"Zürich, Switzerland","lirui.david@gmail.com","https:\u002F\u002Fruili3.github.io","https:\u002F\u002Fgithub.com\u002Fruili3",789,26,"2026-04-05T10:14:23","MIT",4,"","未说明",{"notes":93,"python":91,"dependencies":94},"提供的 README 内容仅为 DUSt3R\u002FMASt3R 相关论文、代码库和资源的项目列表（Awesome List），不包含具体的安装指南、环境配置要求或依赖版本信息。如需获取运行环境需求，请访问文中列出的具体项目代码库链接（如 github.com\u002Fnaver\u002Fdust3r 或 github.com\u002Fnaver\u002Fmast3r）。",[],[18],[97,98,99,100,101],"bundle-adjustment","depth-estimation","multiview-geometry","pointcloud-registration","pose-estimation","2026-03-27T02:49:30.150509","2026-04-06T18:55:56.634888",[],[]]