[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-ByteDance-Seed--Depth-Anything-3":3,"tool-ByteDance-Seed--Depth-Anything-3":64},[4,17,26,40,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,2,"2026-04-03T11:11:01",[13,14,15],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":23,"last_commit_at":32,"category_tags":33,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,34,35,36,15,37,38,13,39],"数据工具","视频","插件","其他","语言模型","音频",{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":10,"last_commit_at":46,"category_tags":47,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,38,37],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":10,"last_commit_at":54,"category_tags":55,"status":16},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74913,"2026-04-05T10:44:17",[38,14,13,37],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":23,"last_commit_at":62,"category_tags":63,"status":16},2471,"tesseract","tesseract-ocr\u002Ftesseract","Tesseract 是一款历史悠久且备受推崇的开源光学字符识别（OCR）引擎，最初由惠普实验室开发，后由 Google 维护，目前由全球社区共同贡献。它的核心功能是将图片中的文字转化为可编辑、可搜索的文本数据，有效解决了从扫描件、照片或 PDF 文档中提取文字信息的难题，是数字化归档和信息自动化的重要基础工具。\n\n在技术层面，Tesseract 展现了强大的适应能力。从版本 4 开始，它引入了基于长短期记忆网络（LSTM）的神经网络 OCR 引擎，显著提升了行识别的准确率；同时，为了兼顾旧有需求，它依然支持传统的字符模式识别引擎。Tesseract 原生支持 UTF-8 编码，开箱即用即可识别超过 100 种语言，并兼容 PNG、JPEG、TIFF 等多种常见图像格式。输出方面，它灵活支持纯文本、hOCR、PDF、TSV 等多种格式，方便后续数据处理。\n\nTesseract 主要面向开发者、研究人员以及需要构建文档处理流程的企业用户。由于它本身是一个命令行工具和库（libtesseract），不包含图形用户界面（GUI），因此最适合具备一定编程能力的技术人员集成到自动化脚本或应用程序中",73286,"2026-04-03T01:56:45",[13,14],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":75,"owner_avatar_url":76,"owner_bio":77,"owner_company":78,"owner_location":78,"owner_email":79,"owner_twitter":78,"owner_website":80,"owner_url":81,"languages":82,"stars":99,"forks":100,"last_commit_at":101,"license":102,"difficulty_score":10,"env_os":103,"env_gpu":104,"env_ram":103,"env_deps":105,"category_tags":114,"github_topics":78,"view_count":115,"oss_zip_url":78,"oss_zip_packed_at":78,"status":16,"created_at":116,"updated_at":117,"faqs":118,"releases":146},508,"ByteDance-Seed\u002FDepth-Anything-3","Depth-Anything-3","Depth Anything 3","Depth Anything 3 是一款先进的开源视觉几何模型，致力于从任意视角的图像或视频中精准恢复三维空间深度信息。它能够灵活处理单目图像及多视图序列，即便在缺乏相机姿态参数的情况下，依然能生成空间一致性的深度图。这一特性有效解决了传统深度估计方法往往需要复杂的多任务学习架构或特定场景训练的限制，实现了极简而高效的技术路径。\n\n对于计算机视觉开发者、算法研究人员以及涉及 3D 重建、自动驾驶或机器人导航的团队而言，Depth Anything 3 提供了强有力的基础支持。其核心亮点在于仅需一个标准的 Transformer 骨干网络，配合创新的深度射线表示法，便能在性能上超越前代产品及同类竞品。特别值得一提的是，新增的流式推理功能让超长视频序列的深度分析成为可能，且显存占用极低。所有模型均基于公开学术数据集训练，兼具通用性与可靠性，是探索视觉感知技术的理想选择。","\u003Cdiv align=\"center\">\n\u003Ch1 style=\"border-bottom: none; margin-bottom: 0px \">Depth Anything 3: Recovering the Visual Space from Any Views\u003C\u002Fh1>\n\u003C!-- \u003Ch2 style=\"border-top: none; margin-top: 3px;\">Recovering the Visual Space from Any Views\u003C\u002Fh2> -->\n\n\n[**Haotong Lin**](https:\u002F\u002Fhaotongl.github.io\u002F)\u003Csup>&ast;\u003C\u002Fsup> · [**Sili Chen**](https:\u002F\u002Fgithub.com\u002FSiliChen321)\u003Csup>&ast;\u003C\u002Fsup> · [**Jun Hao Liew**](https:\u002F\u002Fliewjunhao.github.io\u002F)\u003Csup>&ast;\u003C\u002Fsup> · [**Donny Y. Chen**](https:\u002F\u002Fdonydchen.github.io)\u003Csup>&ast;\u003C\u002Fsup> · [**Zhenyu Li**](https:\u002F\u002Fzhyever.github.io\u002F) · [**Guang Shi**](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=MjXxWbUAAAAJ&hl=en) · [**Jiashi Feng**](https:\u002F\u002Fscholar.google.com.sg\u002Fcitations?user=Q8iay0gAAAAJ&hl=en)\n\u003Cbr>\n[**Bingyi Kang**](https:\u002F\u002Fbingyikang.com\u002F)\u003Csup>&ast;&dagger;\u003C\u002Fsup>\n\n&dagger;project lead&emsp;&ast;Equal Contribution\n\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.10647\">\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-Depth Anything 3-red' alt='Paper PDF'>\u003C\u002Fa>\n\u003Ca href='https:\u002F\u002Fdepth-anything-3.github.io'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-Depth Anything 3-green' alt='Project Page'>\u003C\u002Fa>\n\u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fdepth-anything\u002FDepth-Anything-3'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Demo-blue'>\u003C\u002Fa>\n\u003C!-- \u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdepth-anything\u002FVGB'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBenchmark-VisGeo-yellow' alt='Benchmark'>\u003C\u002Fa> -->\n\u003C!-- \u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdepth-anything\u002Fdata'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBenchmark-xxx-yellow' alt='Data'>\u003C\u002Fa> -->\n\n\u003C\u002Fdiv>\n\nThis work presents **Depth Anything 3 (DA3)**, a model that predicts spatially consistent geometry from\narbitrary visual inputs, with or without known camera poses.\nIn pursuit of minimal modeling, DA3 yields two key insights:\n- 💎 A **single plain transformer** (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization,\n- ✨ A singular **depth-ray representation** obviates the need for complex multi-task learning.\n\n🏆 DA3 significantly outperforms\n[DA2](https:\u002F\u002Fgithub.com\u002FDepthAnything\u002FDepth-Anything-V2) for monocular depth estimation,\nand [VGGT](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fvggt) for multi-view depth estimation and pose estimation.\nAll models are trained exclusively on **public academic datasets**.\n\n\u003C!-- \u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_readme_c8caba69982d.png\" alt=\"Depth Anything 3\" width=\"100%\">\n\u003C\u002Fp> -->\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_readme_68537eaa3539.gif\" alt=\"Depth Anything 3 - Left\" width=\"70%\">\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_readme_d7bb5426caae.png\" alt=\"Depth Anything 3\" width=\"100%\">\n\u003C\u002Fp>\n\n\n## 📰 News\n- **11-12-2025:** 🚀 New models and [**DA3-Streaming**](da3_streaming\u002FREADME.md) released! Handle ultra-long video sequence inference with less than 12GB GPU memory via sliding-window streaming inference. Special thanks to [Kai Deng](https:\u002F\u002Fgithub.com\u002FDengKaiCQ) for his contribution to DA3-Streaming!\n- **08-12-2025:** 📊 [Benchmark evaluation pipeline](docs\u002FBENCHMARK.md) released! Evaluate pose estimation & 3D reconstruction on 5 datasets.\n- **30-11-2025:** Add [`use_ray_pose`](#use-ray-pose) and [`ref_view_strategy`](docs\u002Ffuncs\u002Fref_view_strategy.md) (reference view selection for multi-view inputs).   \n- **25-11-2025:** Add [Awesome DA3 Projects](#-awesome-da3-projects), a community-driven section featuring DA3-based applications.\n- **14-11-2025:** Paper, project page, code and models are all released.\n\n## ✨ Highlights\n\n### 🏆 Model Zoo\nWe release three series of models, each tailored for specific use cases in visual geometry.\n\n- 🌟 **DA3 Main Series** (`DA3-Giant`, `DA3-Large`, `DA3-Base`, `DA3-Small`) These are our flagship foundation models, trained with a unified depth-ray representation. By varying the input configuration, a single model can perform a wide range of tasks:\n  + 🌊 **Monocular Depth Estimation**: Predicts a depth map from a single RGB image.\n  + 🌊 **Multi-View Depth Estimation**: Generates consistent depth maps from multiple images for high-quality fusion.\n  + 🎯 **Pose-Conditioned Depth Estimation**: Achieves superior depth consistency when camera poses are provided as input.\n  + 📷 **Camera Pose Estimation**:  Estimates camera extrinsics and intrinsics from one or more images.\n  + 🟡 **3D Gaussian Estimation**: Directly predicts 3D Gaussians, enabling high-fidelity novel view synthesis.\n\n- 📐 **DA3 Metric Series** (`DA3Metric-Large`) A specialized model fine-tuned for metric depth estimation in monocular settings, ideal for applications requiring real-world scale.\n\n- 🔍 **DA3 Monocular Series** (`DA3Mono-Large`). A dedicated model for high-quality relative monocular depth estimation. Unlike disparity-based models (e.g.,  [Depth Anything 2](https:\u002F\u002Fgithub.com\u002FDepthAnything\u002FDepth-Anything-V2)), it directly predicts depth, resulting in superior geometric accuracy.\n\n🔗 Leveraging these available models, we developed a **nested series** (`DA3Nested-Giant-Large`). This series combines a any-view giant model with a metric model to reconstruct visual geometry at a real-world metric scale.\n\n### 🛠️ Codebase Features\nOur repository is designed to be a powerful and user-friendly toolkit for both practical application and future research.\n- 🎨 **Interactive Web UI & Gallery**: Visualize model outputs and compare results with an easy-to-use Gradio-based web interface.\n- ⚡ **Flexible Command-Line Interface (CLI)**: Powerful and scriptable CLI for batch processing and integration into custom workflows.\n- 💾 **Multiple Export Formats**: Save your results in various formats, including `glb`, `npz`, depth images, `ply`, 3DGS videos, etc, to seamlessly connect with other tools.\n- 🔧 **Extensible and Modular Design**: The codebase is structured to facilitate future research and the integration of new models or functionalities.\n\n\n\u003C!-- ### 🎯 Visual Geometry Benchmark\nWe introduce a new benchmark to rigorously evaluate geometry prediction models on three key tasks: pose estimation, 3D reconstruction, and visual rendering (novel view synthesis) quality.\n\n- 🔄 **Broad Model Compatibility**: Our benchmark is designed to be versatile, supporting the evaluation of various models, including both monocular and multi-view depth estimation approaches.\n- 🔬 **Robust Evaluation Pipeline**: We provide a standardized pipeline featuring RANSAC-based pose alignment, TSDF fusion for dense reconstruction, and a principled view selection strategy for novel view synthesis.\n- 📊 **Standardized Metrics**: Performance is measured using established metrics: AUC for pose accuracy, F1-score and Chamfer Distance for reconstruction, and PSNR\u002FSSIM\u002FLPIPS for rendering quality.\n- 🌍 **Diverse and Challenging Datasets**: The benchmark spans a wide range of scenes from datasets like HiRoom, ETH3D, DTU, 7Scenes, ScanNet++, DL3DV, Tanks and Temples, and MegaDepth. -->\n\n\n## 🚀 Quick Start\n\n### 📦 Installation\n\n```bash\npip install xformers torch\\>=2 torchvision\npip install -e . # Basic\npip install --no-build-isolation git+https:\u002F\u002Fgithub.com\u002Fnerfstudio-project\u002Fgsplat.git@0b4dddf04cb687367602c01196913cde6a743d70 # for gaussian head\npip install -e \".[app]\" # Gradio, python>=3.10\npip install -e \".[all]\" # ALL\n```\n\nFor detailed model information, please refer to the [Model Cards](#-model-cards) section below.\n\n### 💻 Basic Usage\n\n```python\nimport glob, os, torch\nfrom depth_anything_3.api import DepthAnything3\ndevice = torch.device(\"cuda\")\nmodel = DepthAnything3.from_pretrained(\"depth-anything\u002FDA3NESTED-GIANT-LARGE\")\nmodel = model.to(device=device)\nexample_path = \"assets\u002Fexamples\u002FSOH\"\nimages = sorted(glob.glob(os.path.join(example_path, \"*.png\")))\nprediction = model.inference(\n    images,\n)\n# prediction.processed_images : [N, H, W, 3] uint8   array\nprint(prediction.processed_images.shape)\n# prediction.depth            : [N, H, W]    float32 array\nprint(prediction.depth.shape)  \n# prediction.conf             : [N, H, W]    float32 array\nprint(prediction.conf.shape)  \n# prediction.extrinsics       : [N, 3, 4]    float32 array # opencv w2c or colmap format\nprint(prediction.extrinsics.shape)\n# prediction.intrinsics       : [N, 3, 3]    float32 array\nprint(prediction.intrinsics.shape)\n```\n\n```bash\n\nexport MODEL_DIR=depth-anything\u002FDA3NESTED-GIANT-LARGE\n# This can be a Hugging Face repository or a local directory\n# If you encounter network issues, consider using the following mirror: export HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n# Alternatively, you can download the model directly from Hugging Face\nexport GALLERY_DIR=workspace\u002Fgallery\nmkdir -p $GALLERY_DIR\n\n# CLI auto mode with backend reuse\nda3 backend --model-dir ${MODEL_DIR} --gallery-dir ${GALLERY_DIR} # Cache model to gpu\nda3 auto assets\u002Fexamples\u002FSOH \\\n    --export-format glb \\\n    --export-dir ${GALLERY_DIR}\u002FTEST_BACKEND\u002FSOH \\\n    --use-backend\n\n# CLI video processing with feature visualization\nda3 video assets\u002Fexamples\u002Frobot_unitree.mp4 \\\n    --fps 15 \\\n    --use-backend \\\n    --export-dir ${GALLERY_DIR}\u002FTEST_BACKEND\u002Frobo \\\n    --export-format glb-feat_vis \\\n    --feat-vis-fps 15 \\\n    --process-res-method lower_bound_resize \\\n    --export-feat \"11,21,31\"\n\n# CLI auto mode without backend reuse\nda3 auto assets\u002Fexamples\u002FSOH \\\n    --export-format glb \\\n    --export-dir ${GALLERY_DIR}\u002FTEST_CLI\u002FSOH \\\n    --model-dir ${MODEL_DIR}\n\n```\n\nThe model architecture is defined in [`DepthAnything3Net`](src\u002Fdepth_anything_3\u002Fmodel\u002Fda3.py), and specified with a Yaml config file located at [`src\u002Fdepth_anything_3\u002Fconfigs`](src\u002Fdepth_anything_3\u002Fconfigs). The input and output processing are handled by [`DepthAnything3`](src\u002Fdepth_anything_3\u002Fapi.py). To customize the model architecture, simply create a new config file (*e.g.*, `path\u002Fto\u002Fnew\u002Fconfig`) as:\n\n```yaml\n__object__:\n  path: depth_anything_3.model.da3\n  name: DepthAnything3Net\n  args: as_params\n\nnet:\n  __object__:\n    path: depth_anything_3.model.dinov2.dinov2\n    name: DinoV2\n    args: as_params\n\n  name: vitb\n  out_layers: [5, 7, 9, 11]\n  alt_start: 4\n  qknorm_start: 4\n  rope_start: 4\n  cat_token: True\n\nhead:\n  __object__:\n    path: depth_anything_3.model.dualdpt\n    name: DualDPT\n    args: as_params\n\n  dim_in: &head_dim_in 1536\n  output_dim: 2\n  features: &head_features 128\n  out_channels: &head_out_channels [96, 192, 384, 768]\n```\n\nThen, the model can be created with the following code snippet.\n```python\nfrom depth_anything_3.cfg import create_object, load_config\n\nModel = create_object(load_config(\"path\u002Fto\u002Fnew\u002Fconfig\"))\n```\n\n\n\n## 📚 Useful Documentation\n\n- 🖥️ [Command Line Interface](docs\u002FCLI.md)\n- 📑 [Python API](docs\u002FAPI.md)\n- 📊 [Benchmark Evaluation](docs\u002FBENCHMARK.md)\n\n## 🗂️ Model Cards\n\nGenerally, you should observe that DA3-LARGE achieves comparable results to VGGT.\n\nThe Nested series uses an Any-view model to estimate pose and depth, and a monocular metric depth estimator for scaling. \n\n⚠️ Models with the `-1.1` suffix are retrained after fixing a training bug; prefer these refreshed checkpoints. The original `DA3NESTED-GIANT-LARGE`, `DA3-GIANT`, and `DA3-LARGE` remain available but are deprecated. You could expect much better performance for street scenes with the `-1.1` models.\n\n| 🗃️ Model Name                  | 📏 Params | 📊 Rel. Depth | 📷 Pose Est. | 🧭 Pose Cond. | 🎨 GS | 📐 Met. Depth | ☁️ Sky Seg | 📄 License     |\n|-------------------------------|-----------|---------------|--------------|---------------|-------|---------------|-----------|----------------|\n| **Nested** | | | | | | | | |\n| [DA3NESTED-GIANT-LARGE-1.1](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3NESTED-GIANT-LARGE-1.1)  | 1.40B     | ✅             | ✅            | ✅             | ✅     | ✅             | ✅         | CC BY-NC 4.0   |\n| [DA3NESTED-GIANT-LARGE](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3NESTED-GIANT-LARGE)  | 1.40B     | ✅             | ✅            | ✅             | ✅     | ✅             | ✅         | CC BY-NC 4.0   |\n| **Any-view Model** | | | | | | | | |\n| [DA3-GIANT-1.1](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3-GIANT-1.1)                     | 1.15B     | ✅             | ✅            | ✅             | ✅     |               |           | CC BY-NC 4.0   |\n| [DA3-GIANT](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3-GIANT)                     | 1.15B     | ✅             | ✅            | ✅             | ✅     |               |           | CC BY-NC 4.0   |\n| [DA3-LARGE-1.1](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3-LARGE-1.1)                     | 0.35B     | ✅             | ✅            | ✅             |       |               |           | CC BY-NC 4.0     |\n| [DA3-LARGE](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3-LARGE)                     | 0.35B     | ✅             | ✅            | ✅             |       |               |           | CC BY-NC 4.0     |\n| [DA3-BASE](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3-BASE)                     | 0.12B     | ✅             | ✅            | ✅             |       |               |           | Apache 2.0     |\n| [DA3-SMALL](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3-SMALL)                     | 0.08B     | ✅             | ✅            | ✅             |       |               |           | Apache 2.0     |\n|                               |           |               |              |               |               |       |           |                |\n| **Monocular Metric Depth** | | | | | | | | |\n| [DA3METRIC-LARGE](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3METRIC-LARGE)              | 0.35B     | ✅             |              |               |       | ✅             | ✅         | Apache 2.0     |\n|                               |           |               |              |               |               |       |           |                |\n| **Monocular Depth** | | | | | | | | |\n| [DA3MONO-LARGE](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3MONO-LARGE)                | 0.35B     | ✅             |              |               |               |       | ✅         | Apache 2.0     |\n\n\n## ❓ FAQ\n\n- **Monocular Metric Depth**: To obtain metric depth in meters from `DA3METRIC-LARGE`, use `metric_depth = focal * net_output \u002F 300.`, where `focal` is the focal length in pixels (typically the average of fx and fy from the camera intrinsic matrix K). Note that the output from `DA3NESTED-GIANT-LARGE` is already in meters.\n\n- \u003Ca id=\"use-ray-pose\">\u003C\u002Fa>**Ray Head (`use_ray_pose`)**:  Our API and CLI support `use_ray_pose` arg, which means that the model will derive camera pose from ray head, which is generally slightly slower, but more accurate. Note that the default is `False` for faster inference speed. \n  \u003Cdetails>\n  \u003Csummary>AUC3 Results for DA3NESTED-GIANT-LARGE\u003C\u002Fsummary>\n  \n  | Model | HiRoom | ETH3D | DTU | 7Scenes | ScanNet++ | \n  |-------|------|-------|-----|---------|-----------|\n  | `ray_head` | 84.4 | 52.6 | 93.9 | 29.5 | 89.4 |\n  | `cam_head` | 80.3 | 48.4 | 94.1 | 28.5 | 85.0 |\n\n  \u003C\u002Fdetails>\n\n\n\n\n- **Older GPUs without XFormers support**: See [Issue #11](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FDepth-Anything-3\u002Fissues\u002F11). Thanks to [@S-Mahoney](https:\u002F\u002Fgithub.com\u002FS-Mahoney) for the solution!\n\n\n## 🏢 Awesome DA3 Projects\n\nA community-curated list of Depth Anything 3 integrations across 3D tools, creative pipelines, robotics, and web\u002FVR viewers, including but not limited to these. You are welcome to submit your DA3-based project via PR, and we will review and feature it if applicable.\n\n- [DA3-blender](https:\u002F\u002Fgithub.com\u002Fxy-gao\u002FDA3-blender): Blender addon for DA3-based 3D reconstruction from a set of images. \n\n- [ComfyUI-DepthAnythingV3](https:\u002F\u002Fgithub.com\u002FPozzettiAndrea\u002FComfyUI-DepthAnythingV3): ComfyUI nodes for Depth Anything 3, supporting single\u002Fmulti-view and video-consistent depth with optional point‑cloud export.\n\n- [DA3-ROS2-Wrapper](https:\u002F\u002Fgithub.com\u002FGerdsenAI\u002FGerdsenAI-Depth-Anything-3-ROS2-Wrapper): Real-time DA3 depth in ROS2 with multi-camera support. \n\n- [DA3-ROS2-CPP-TensorRT](https:\u002F\u002Fgithub.com\u002Fika-rwth-aachen\u002Fros2-depth-anything-v3-trt): DA3 ROS2 C++ TensorRT Inference Node: a ROS2 node for DA3 depth estimation using TensorRT for real-time inference.\n\n- [VideoDepthViewer3D](https:\u002F\u002Fgithub.com\u002Famariichi\u002FVideoDepthViewer3D): Streaming videos with DA3 metric depth to a Three.js\u002FWebXR 3D viewer for VR\u002Fstereo playback.\n\n\n## 🧑‍💻 Official Codebase Core Contributors and Maintainers\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Ctd align=\"center\">\n      \u003Ca href=\"https:\u002F\u002Fbingykang.github.io\u002F\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_readme_3b3d6fe30713.png\" width=\"100px;\" alt=\"\"\u002F>\n      \u003C\u002Fa>\n        \u003Cbr \u002F>\n        \u003Csub>\u003Cb>Bingyi Kang\u003C\u002Fb>\u003C\u002Fsub>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\">\n      \u003Ca href=\"https:\u002F\u002Fhaotongl.github.io\u002F\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_readme_9f105e84168b.png\" width=\"100px;\" alt=\"\"\u002F>\n      \u003C\u002Fa>\n        \u003Cbr \u002F>\n        \u003Csub>Haotong Lin\u003C\u002Fsub>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\">\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FSiliChen321\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_readme_ab7189683e54.png\" width=\"100px;\" alt=\"\"\u002F>\n      \u003C\u002Fa>\n        \u003Cbr \u002F>\n        \u003Csub>Sili Chen\u003C\u002Fsub>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\">\n      \u003Ca href=\"https:\u002F\u002Fliewjunhao.github.io\u002F\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_readme_3a3edf0d150a.png\" width=\"100px;\" alt=\"\"\u002F>\n       \u003C\u002Fa>\n        \u003Cbr \u002F>\n        \u003Csub>Jun Hao Liew\u003C\u002Fsub>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\">\n      \u003Ca href=\"https:\u002F\u002Fdonydchen.github.io\u002F\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_readme_bc863e529e55.png\" width=\"100px;\" alt=\"\"\u002F>\n      \u003C\u002Fa>\n        \u003Cbr \u002F>\n        \u003Csub>Donny Y. Chen\u003C\u002Fsub>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\">\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FDengKaiCQ\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_readme_83227532e374.png\" width=\"100px;\" alt=\"\"\u002F>\n      \u003C\u002Fa>\n        \u003Cbr \u002F>\n        \u003Csub>Kai Deng\u003C\u002Fsub>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n## 📝 Citations\nIf you find Depth Anything 3 useful in your research or projects, please cite our work:\n\n```\n@article{depthanything3,\n  title={Depth Anything 3: Recovering the visual space from any views},\n  author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang},\n  journal={arXiv preprint arXiv:2511.10647},\n  year={2025}\n}\n```\n","\u003Cdiv align=\"center\">\n\u003Ch1 style=\"border-bottom: none; margin-bottom: 0px \">Depth Anything 3：从任意视角恢复视觉空间\u003C\u002Fh1>\n\u003C!-- \u003Ch2 style=\"border-top: none; margin-top: 3px;\">Recovering the Visual Space from Any Views\u003C\u002Fh2> -->\n\n\n[**Haotong Lin**](https:\u002F\u002Fhaotongl.github.io)\u003Csup>&ast;\u003C\u002Fsup> · [**Sili Chen**](https:\u002F\u002Fgithub.com\u002FSiliChen321)\u003Csup>&ast;\u003C\u002Fsup> · [**Jun Hao Liew**](https:\u002F\u002Fliewjunhao.github.io\u002F)\u003Csup>&ast;\u003C\u002Fsup> · [**Donny Y. Chen**](https:\u002F\u002Fdonydchen.github.io)\u003Csup>&ast;\u003C\u002Fsup> · [**Zhenyu Li**](https:\u002F\u002Fzhyever.github.io\u002F) · [**Guang Shi**](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=MjXxWbUAAAAJ&hl=en) · [**Jiashi Feng**](https:\u002F\u002Fscholar.google.com.sg\u002Fcitations?user=Q8iay0gAAAAJ&hl=en)\n\u003Cbr>\n[**Bingyi Kang**](https:\u002F\u002Fbingyikang.com\u002F)\u003Csup>&ast;&dagger;\u003C\u002Fsup>\n\n&dagger;项目负责人&emsp;&ast;同等贡献\n\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.10647\">\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-Depth Anything 3-red' alt='论文 PDF'>\u003C\u002Fa>\n\u003Ca href='https:\u002F\u002Fdepth-anything-3.github.io'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-Depth Anything 3-green' alt='项目页面'>\u003C\u002Fa>\n\u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fdepth-anything\u002FDepth-Anything-3'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Demo-blue'>\u003C\u002Fa>\n\u003C!-- \u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdepth-anything\u002FVGB'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBenchmark-VisGeo-yellow' alt='Benchmark'>\u003C\u002Fa> -->\n\u003C!-- \u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdepth-anything\u002Fdata'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBenchmark-xxx-yellow' alt='Data'>\u003C\u002Fa> -->\n\n\u003C\u002Fdiv>\n\n本工作提出了 **Depth Anything 3 (DA3)**，该模型能够从任意视觉输入中预测空间一致的几何结构，无论是否已知相机位姿。\n为了追求极简建模，DA3 得出了两个关键见解：\n- 💎 一个**单一的普通 Transformer 网络**（例如标准 DINO 编码器）就足以作为骨干网络，无需架构上的专门设计，\n- ✨ 一种单一的**深度 - 射线表示法**消除了对复杂多任务学习的需求。\n\n🏆 DA3 在单目深度估计方面显著优于\n[DA2](https:\u002F\u002Fgithub.com\u002FDepthAnything\u002FDepth-Anything-V2)，\n并在多视图深度估计和位姿估计方面优于 [VGGT](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fvggt)。\n所有模型均仅在**公开学术数据集**上训练。\n\n\u003C!-- \u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_readme_c8caba69982d.png\" alt=\"Depth Anything 3\" width=\"100%\">\n\u003C\u002Fp> -->\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_readme_68537eaa3539.gif\" alt=\"Depth Anything 3 - Left\" width=\"70%\">\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_readme_d7bb5426caae.png\" alt=\"Depth Anything 3\" width=\"100%\">\n\u003C\u002Fp>\n\n\n## 📰 新闻\n- **2025-11-12：** 🚀 发布新模型和 [**DA3-Streaming**](da3_streaming\u002FREADME.md)！通过滑动窗口流式推理处理超长视频序列推理，且 GPU 显存占用低于 12GB。特别感谢 [Kai Deng](https:\u002F\u002Fgithub.com\u002FDengKaiCQ) 对 DA3-Streaming 的贡献！\n- **2025-08-12：** 📊 [基准评估流程](docs\u002FBENCHMARK.md) 发布！在 5 个数据集上评估位姿估计和 3D 重建。\n- **2025-11-30：** 添加 [`use_ray_pose`](#use-ray-pose) 和 [`ref_view_strategy`](docs\u002Ffuncs\u002Fref_view_strategy.md)（用于多视图输入的参考视图选择）。   \n- **2025-11-25：** 添加 [Awesome DA3 Projects](#-awesome-da3-projects)，这是一个社区驱动的板块，展示基于 DA3 的应用。\n- **2025-11-14：** 论文、项目页面、代码和模型均已发布。\n\n## ✨ 亮点\n\n### 🏆 模型库\n我们发布了三个系列的模型，每个系列都针对视觉几何中的特定用例进行了定制。\n\n- 🌟 **DA3 主系列** (`DA3-Giant`, `DA3-Large`, `DA3-Base`, `DA3-Small`) 这些是我们的旗舰基础模型，使用统一的深度 - 射线表示法训练。通过改变输入配置，单个模型可以执行广泛的任务：\n  + 🌊 **单目深度估计**：从单张 RGB 图像预测深度图。\n  + 🌊 **多视图深度估计**：为高质量融合生成来自多张图像的一致性深度图。\n  + 🎯 **位姿条件深度估计**：当提供相机位姿作为输入时，实现卓越的一致性深度估计。\n  + 📷 **相机位姿估计**：从一张或多张图像中估计相机的外参和内参。\n  + 🟡 **3D 高斯估计**：直接预测 3D 高斯分布，实现高保真的新视图合成。\n\n- 📐 **DA3 度量系列** (`DA3Metric-Large`) 专为单目设置下的度量深度估计而微调的专用模型，适用于需要真实世界尺度的应用。\n\n- 🔍 **DA3 单目系列** (`DA3Mono-Large`)。专用于高质量相对单目深度估计的模型。与基于视差的模型（如 [Depth Anything 2](https:\u002F\u002Fgithub.com\u002FDepthAnything\u002FDepth-Anything-V2)）不同，它直接预测深度，从而具有更优越的几何精度。\n\n🔗 利用这些可用模型，我们开发了一个**嵌套系列** (`DA3Nested-Giant-Large`)。该系列结合任意视图巨型模型和度量模型，以真实世界度量尺度重建视觉几何。\n\n### 🛠️ 代码库特性\n我们的仓库旨在成为一个强大且用户友好的工具包，适用于实际应用和未来研究。\n- 🎨 **交互式 Web UI 与画廊**：通过易于使用的基于 Gradio 的 Web 界面可视化模型输出并比较结果。\n- ⚡ **灵活的命令行界面 (CLI)**：功能强大且可脚本化的命令行界面，用于批处理和集成到自定义工作流中。\n- 💾 **多种导出格式**：将结果保存为各种格式，包括 `glb`, `npz`, 深度图像，`ply`, 3DGS 视频等，以便与其他工具无缝连接。\n- 🔧 **可扩展和模块化设计**：代码库的结构便于未来研究和集成新模型或功能。\n\n\n\u003C!-- ### 🎯 视觉几何基准\n我们引入了一个新的基准，以严格评估几何预测模型在三个关键任务上的表现：位姿估计、3D 重建和视觉渲染（新视图合成）质量。\n\n- 🔄 **广泛的模型兼容性**：我们的基准设计具有通用性，支持评估各种模型，包括单目和多视图深度估计方法。\n- 🔬 **鲁棒的评估流程**：我们提供了一个标准化流程，包含基于 RANSAC 的位姿对齐、用于稠密重建的 TSDF 融合以及用于新视图合成的原则性视图选择策略。\n- 📊 **标准化指标**：性能使用既定指标测量：位姿精度使用 AUC，重建使用 F1-score 和 Chamfer Distance，渲染质量使用 PSNR\u002FSSIM\u002FLPIPS。\n- 🌍 **多样且具有挑战性的数据集**：基准涵盖来自 HiRoom, ETH3D, DTU, 7Scenes, ScanNet++, DL3DV, Tanks and Temples, 和 MegaDepth 等数据集的广泛场景。 -->\n\n\n## 🚀 快速开始\n\n### 📦 安装\n\n```bash\npip install xformers torch\\>=2 torchvision\npip install -e . # Basic\npip install --no-build-isolation git+https:\u002F\u002Fgithub.com\u002Fnerfstudio-project\u002Fgsplat.git@0b4dddf04cb687367602c01196913cde6a743d70 # for gaussian head\npip install -e \".[app]\" # Gradio, python>=3.10\npip install -e \".[all]\" # ALL\n```\n\n有关详细的模型信息，请参阅下方的 [模型卡片 (Model Cards)](#-model-cards) 部分。\n\n### 💻 基本用法\n\n```python\nimport glob, os, torch\nfrom depth_anything_3.api import DepthAnything3\ndevice = torch.device(\"cuda\")\nmodel = DepthAnything3.from_pretrained(\"depth-anything\u002FDA3NESTED-GIANT-LARGE\")\nmodel = model.to(device=device)\nexample_path = \"assets\u002Fexamples\u002FSOH\"\nimages = sorted(glob.glob(os.path.join(example_path, \"*.png\")))\nprediction = model.inference(\n    images,\n)\n# prediction.processed_images : [N, H, W, 3] uint8   array\nprint(prediction.processed_images.shape)\n# prediction.depth            : [N, H, W]    float32 array\nprint(prediction.depth.shape)  \n# prediction.conf             : [N, H, W]    float32 array\nprint(prediction.conf.shape)  \n# prediction.extrinsics       : [N, 3, 4]    float32 array # opencv w2c or colmap format\nprint(prediction.extrinsics.shape)\n# prediction.intrinsics       : [N, 3, 3]    float32 array\nprint(prediction.intrinsics.shape)\n```\n\n```bash\n\nexport MODEL_DIR=depth-anything\u002FDA3NESTED-GIANT-LARGE\n# This can be a Hugging Face repository or a local directory\n# If you encounter network issues, consider using the following mirror: export HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n# Alternatively, you can download the model directly from Hugging Face\nexport GALLERY_DIR=workspace\u002Fgallery\nmkdir -p $GALLERY_DIR\n\n# CLI auto mode with backend reuse\nda3 backend --model-dir ${MODEL_DIR} --gallery-dir ${GALLERY_DIR} # Cache model to gpu\nda3 auto assets\u002Fexamples\u002FSOH \\\n    --export-format glb \\\n    --export-dir ${GALLERY_DIR}\u002FTEST_BACKEND\u002FSOH \\\n    --use-backend\n\n# CLI video processing with feature visualization\nda3 video assets\u002Fexamples\u002Frobot_unitree.mp4 \\\n    --fps 15 \\\n    --use-backend \\\n    --export-dir ${GALLERY_DIR}\u002FTEST_BACKEND\u002Frobo \\\n    --export-format glb-feat_vis \\\n    --feat-vis-fps 15 \\\n    --process-res-method lower_bound_resize \\\n    --export-feat \"11,21,31\"\n\n# CLI auto mode without backend reuse\nda3 auto assets\u002Fexamples\u002FSOH \\\n    --export-format glb \\\n    --export-dir ${GALLERY_DIR}\u002FTEST_CLI\u002FSOH \\\n    --model-dir ${MODEL_DIR}\n\n```\n\n模型架构定义在 [`DepthAnything3Net`](src\u002Fdepth_anything_3\u002Fmodel\u002Fda3.py) 中，并通过位于 [`src\u002Fdepth_anything_3\u002Fconfigs`](src\u002Fdepth_anything_3\u002Fconfigs) 的 YAML 配置文件指定。输入和输出处理由 [`DepthAnything3`](src\u002Fdepth_anything_3\u002Fapi.py) 处理。要自定义模型架构，只需创建一个新的配置文件（例如，`path\u002Fto\u002Fnew\u002Fconfig`）如下：\n\n```yaml\n__object__:\n  path: depth_anything_3.model.da3\n  name: DepthAnything3Net\n  args: as_params\n\nnet:\n  __object__:\n    path: depth_anything_3.model.dinov2.dinov2\n    name: DinoV2\n    args: as_params\n\n  name: vitb\n  out_layers: [5, 7, 9, 11]\n  alt_start: 4\n  qknorm_start: 4\n  rope_start: 4\n  cat_token: True\n\nhead:\n  __object__:\n    path: depth_anything_3.model.dualdpt\n    name: DualDPT\n    args: as_params\n\n  dim_in: &head_dim_in 1536\n  output_dim: 2\n  features: &head_features 128\n  out_channels: &head_out_channels [96, 192, 384, 768]\n```\n\n然后，可以使用以下代码片段创建模型。\n```python\nfrom depth_anything_3.cfg import create_object, load_config\n\nModel = create_object(load_config(\"path\u002Fto\u002Fnew\u002Fconfig\"))\n```\n\n\n\n## 📚 有用文档\n\n- 🖥️ [命令行界面 (CLI)](docs\u002FCLI.md)\n- 📑 [Python 应用程序接口 (API)](docs\u002FAPI.md)\n- 📊 [基准评估 (Benchmark Evaluation)](docs\u002FBENCHMARK.md)\n\n## 🗂️ 模型卡片\n\n通常情况下，您会观察到 DA3-LARGE 取得了与 VGGT 相当的结果。\n\nNested 系列使用 Any-view 模型（任意视角模型）来估计姿态和深度，并使用单目度量深度估计器 (monocular metric depth estimator) 进行缩放。\n\n⚠️ 带有 `-1.1` 后缀的模型是在修复训练错误后重新训练的；请优先使用这些更新后的检查点 (checkpoints)。原始的 `DA3NESTED-GIANT-LARGE`、`DA3-GIANT` 和 `DA3-LARGE` 仍然可用，但已被弃用。使用 `-1.1` 模型时，您可以期待在街景场景中获得更好的性能。\n\n| 🗃️ 模型名称                  | 📏 参数量 | 📊 相对深度 | 📷 姿态估计 | 🧭 姿态条件 | 🎨 GS (高斯泼溅) | 📐 度量深度 | ☁️ 天空分割 | 📄 许可证     |\n|-------------------------------|-----------|---------------|--------------|---------------|-------|---------------|-----------|----------------|\n| **Nested 系列** | | | | | | | | |\n| [DA3NESTED-GIANT-LARGE-1.1](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3NESTED-GIANT-LARGE-1.1)  | 1.40B     | ✅             | ✅            | ✅             | ✅     | ✅             | ✅         | CC BY-NC 4.0   |\n| [DA3NESTED-GIANT-LARGE](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3NESTED-GIANT-LARGE)  | 1.40B     | ✅             | ✅            | ✅             | ✅     | ✅             | ✅         | CC BY-NC 4.0   |\n| **任意视角模型 (Any-view Model)** | | | | | | | | |\n| [DA3-GIANT-1.1](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3-GIANT-1.1)                     | 1.15B     | ✅             | ✅            | ✅             | ✅     |               |           | CC BY-NC 4.0   |\n| [DA3-GIANT](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3-GIANT)                     | 1.15B     | ✅             | ✅            | ✅             | ✅     |               |           | CC BY-NC 4.0   |\n| [DA3-LARGE-1.1](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3-LARGE-1.1)                     | 0.35B     | ✅             | ✅            | ✅             |       |               |           | CC BY-NC 4.0     |\n| [DA3-LARGE](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3-LARGE)                     | 0.35B     | ✅             | ✅            | ✅             |       |               |           | CC BY-NC 4.0     |\n| [DA3-BASE](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3-BASE)                     | 0.12B     | ✅             | ✅            | ✅             |       |               |           | Apache 2.0     |\n| [DA3-SMALL](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3-SMALL)                     | 0.08B     | ✅             | ✅            | ✅             |       |               |           | Apache 2.0     |\n|                               |           |               |              |               |               |       |           |                |\n| **单目度量深度 (Monocular Metric Depth)** | | | | | | | | |\n| [DA3METRIC-LARGE](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3METRIC-LARGE)              | 0.35B     | ✅             |              |               |       | ✅             | ✅         | Apache 2.0     |\n|                               |           |               |              |               |               |       |           |                |\n| **单目深度 (Monocular Depth)** | | | | | | | | |\n| [DA3MONO-LARGE](https:\u002F\u002Fhuggingface.co\u002Fdepth-anything\u002FDA3MONO-LARGE)                | 0.35B     | ✅             |              |               |               |       | ✅         | Apache 2.0     |\n\n\n## ❓ 常见问题\n\n- **单目度量深度 (Monocular Metric Depth)**：要从 `DA3METRIC-LARGE` 获取以米为单位的度量深度，请使用 `metric_depth = focal * net_output \u002F 300.`，其中 `focal` 是以像素为单位的焦距（通常是相机内参矩阵 K 中 fx 和 fy 的平均值）。注意 `DA3NESTED-GIANT-LARGE` 的输出已经是米为单位。\n\n- \u003Ca id=\"use-ray-pose\">\u003C\u002Fa>**射线头 (`use_ray_pose`)**：我们的 API 和 CLI 支持 `use_ray_pose` 参数，这意味着模型将从射线头推导相机姿态，这通常稍慢一些，但更准确。注意默认值为 `False` 以获得更快的推理速度。 \n  \u003Cdetails>\n  \u003Csummary>DA3NESTED-GIANT-LARGE 的 AUC3 结果\u003C\u002Fsummary>\n  \n  | 模型 | HiRoom | ETH3D | DTU | 7Scenes | ScanNet++ | \n  |-------|------|-------|-----|---------|-----------|\n  | `ray_head` | 84.4 | 52.6 | 93.9 | 29.5 | 89.4 |\n  | `cam_head` | 80.3 | 48.4 | 94.1 | 28.5 | 85.0 |\n\n  \u003C\u002Fdetails>\n\n\n\n\n- **不支持 XFormers 的旧 GPU**：请参阅 [Issue #11](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FDepth-Anything-3\u002Fissues\u002F11)。感谢 [@S-Mahoney](https:\u002F\u002Fgithub.com\u002FS-Mahoney) 提供的解决方案！\n\n\n## 🏢 优秀的 DA3 项目\n\n这是一个由社区整理的 Depth Anything 3 集成列表，涵盖 3D 工具、创意管线、机器人以及 Web\u002FVR 查看器等领域，包括但不限于以下项目。欢迎您通过 PR 提交基于 DA3 的项目，如果适用，我们将审核并展示。\n\n- [DA3-blender](https:\u002F\u002Fgithub.com\u002Fxy-gao\u002FDA3-blender): 用于从一组图像进行基于 DA3 的 3D 重建的 Blender 插件。 \n\n- [ComfyUI-DepthAnythingV3](https:\u002F\u002Fgithub.com\u002FPozzettiAndrea\u002FComfyUI-DepthAnythingV3): Depth Anything 3 的 ComfyUI 节点，支持单\u002F多视图和视频一致深度，可选导出点云。\n\n- [DA3-ROS2-Wrapper](https:\u002F\u002Fgithub.com\u002FGerdsenAI\u002FGerdsenAI-Depth-Anything-3-ROS2-Wrapper): 支持多摄像头的 ROS2 实时 DA3 深度。 \n\n- [DA3-ROS2-CPP-TensorRT](https:\u002F\u002Fgithub.com\u002Fika-rwth-aachen\u002Fros2-depth-anything-v3-trt): DA3 ROS2 C++ TensorRT 推理节点：一个使用 TensorRT 进行实时推理的 DA3 深度估计 ROS2 节点。\n\n- [VideoDepthViewer3D](https:\u002F\u002Fgithub.com\u002Famariichi\u002FVideoDepthViewer3D): 将带有 DA3 度量深度的视频流式传输到 Three.js\u002FWebXR 3D 查看器，用于 VR\u002F立体播放。\n\n## 🧑‍💻 官方代码库核心贡献者与维护者\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Ctd align=\"center\">\n      \u003Ca href=\"https:\u002F\u002Fbingykang.github.io\u002F\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_readme_3b3d6fe30713.png\" width=\"100px;\" alt=\"\"\u002F>\n      \u003C\u002Fa>\n        \u003Cbr \u002F>\n        \u003Csub>\u003Cb>Bingyi Kang\u003C\u002Fb>\u003C\u002Fsub>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\">\n      \u003Ca href=\"https:\u002F\u002Fhaotongl.github.io\u002F\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_readme_9f105e84168b.png\" width=\"100px;\" alt=\"\"\u002F>\n      \u003C\u002Fa>\n        \u003Cbr \u002F>\n        \u003Csub>Haotong Lin\u003C\u002Fsub>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\">\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FSiliChen321\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_readme_ab7189683e54.png\" width=\"100px;\" alt=\"\"\u002F>\n      \u003C\u002Fa>\n        \u003Cbr \u002F>\n        \u003Csub>Sili Chen\u003C\u002Fsub>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\">\n      \u003Ca href=\"https:\u002F\u002Fliewjunhao.github.io\u002F\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_readme_3a3edf0d150a.png\" width=\"100px;\" alt=\"\"\u002F>\n       \u003C\u002Fa>\n        \u003Cbr \u002F>\n        \u003Csub>Jun Hao Liew\u003C\u002Fsub>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\">\n      \u003Ca href=\"https:\u002F\u002Fdonydchen.github.io\u002F\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_readme_bc863e529e55.png\" width=\"100px;\" alt=\"\"\u002F>\n      \u003C\u002Fa>\n        \u003Cbr \u002F>\n        \u003Csub>Donny Y. Chen\u003C\u002Fsub>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\">\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FDengKaiCQ\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_readme_83227532e374.png\" width=\"100px;\" alt=\"\"\u002F>\n      \u003C\u002Fa>\n        \u003Cbr \u002F>\n        \u003Csub>Kai Deng\u003C\u002Fsub>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n## 📝 引用\n如果您在研究或项目中发现 Depth Anything 3 有用，请引用我们的工作：\n\n```\n@article{depthanything3,\n  title={Depth Anything 3: Recovering the visual space from any views},\n  author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang},\n  journal={arXiv preprint arXiv:2511.10647},\n  year={2025}\n}\n```","# Depth Anything 3 (DA3) 快速上手指南\n\nDepth Anything 3 (DA3) 是一个能够从任意视角恢复视觉空间的模型，支持单目、多目深度估计及相机位姿估计。本指南帮助您快速完成环境搭建与基础调用。\n\n## 环境准备\n\n- **操作系统**: Linux \u002F macOS \u002F Windows\n- **Python 版本**: >= 3.10\n- **硬件要求**: 建议配备 NVIDIA GPU（CUDA 环境）\n- **核心依赖**: PyTorch >= 2.0, xformers\n\n## 安装步骤\n\n### 1. 配置下载加速（可选但推荐）\n为避免 Hugging Face 下载缓慢，建议在安装前设置国内镜像：\n```bash\nexport HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n```\n\n### 2. 安装依赖\n请根据您的需求选择以下命令之一：\n\n**基础安装：**\n```bash\npip install xformers torch>=2 torchvision\npip install -e . \n```\n\n**如需高斯头支持（3DGS）：**\n```bash\npip install --no-build-isolation git+https:\u002F\u002Fgithub.com\u002Fnerfstudio-project\u002Fgsplat.git@0b4dddf04cb687367602c01196913cde6a743d70\n```\n\n**如需 Web UI 功能：**\n```bash\npip install -e \".[app]\" \n```\n\n**完整安装（包含所有功能）：**\n```bash\npip install -e \".[all]\" \n```\n\n## 基本使用\n\n### Python API 调用\n\n以下示例演示如何加载模型并进行推理。\n\n> **注意**：虽然示例中使用的是 `DA3NESTED-GIANT-LARGE`，但官方建议优先使用带 `-1.1` 后缀的模型（如 `DA3NESTED-GIANT-LARGE-1.1`），以获取修复后的最佳性能。\n\n```python\nimport glob, os, torch\nfrom depth_anything_3.api import DepthAnything3\n\ndevice = torch.device(\"cuda\")\n# 建议使用 -1.1 版本以获得更优性能\nmodel = DepthAnything3.from_pretrained(\"depth-anything\u002FDA3NESTED-GIANT-LARGE\") \nmodel = model.to(device=device)\n\nexample_path = \"assets\u002Fexamples\u002FSOH\"\nimages = sorted(glob.glob(os.path.join(example_path, \"*.png\")))\n\nprediction = model.inference(images)\n\n# prediction.processed_images : [N, H, W, 3] uint8   array\nprint(prediction.processed_images.shape)\n# prediction.depth            : [N, H, W]    float32 array\nprint(prediction.depth.shape)  \n# prediction.conf             : [N, H, W]    float32 array\nprint(prediction.conf.shape)  \n# prediction.extrinsics       : [N, 3, 4]    float32 array\nprint(prediction.extrinsics.shape)\n# prediction.intrinsics       : [N, 3, 3]    float32 array\nprint(prediction.intrinsics.shape)\n```\n\n### 命令行工具 (CLI)\n\n#### 自动模式处理图片\n```bash\nexport MODEL_DIR=depth-anything\u002FDA3NESTED-GIANT-LARGE\nexport GALLERY_DIR=workspace\u002Fgallery\nmkdir -p $GALLERY_DIR\n\n# 启动后端缓存\nda3 backend --model-dir ${MODEL_DIR} --gallery-dir ${GALLERY_DIR} \n\n# 执行自动处理并导出为 glb 格式\nda3 auto assets\u002Fexamples\u002FSOH \\\n    --export-format glb \\\n    --export-dir ${GALLERY_DIR}\u002FTEST_BACKEND\u002FSOH \\\n    --use-backend\n```\n\n#### 视频处理\n```bash\nda3 video assets\u002Fexamples\u002Frobot_unitree.mp4 \\\n    --fps 15 \\\n    --use-backend \\\n    --export-dir ${GALLERY_DIR}\u002FTEST_BACKEND\u002Frobo \\\n    --export-format glb-feat_vis \\\n    --feat-vis-fps 15 \\\n    --process-res-method lower_bound_resize \\\n    --export-feat \"11,21,31\"\n```\n\n更多高级用法请参考项目文档中的 [Command Line Interface](docs\u002FCLI.md) 和 [Python API](docs\u002FAPI.md)。","某自动驾驶研发团队正在优化巡检机器人的室内建图模块，核心目标是利用普通摄像头替代昂贵传感器实现环境感知。\n\n### 没有 Depth-Anything-3 时\n- 必须搭载昂贵的激光雷达才能获取准确深度信息，导致单机硬件成本居高不下。\n- 传统多视角融合算法逻辑复杂，推理速度慢，难以满足实时动态导航的延迟要求。\n- 处理长视频序列时显存占用过大，经常导致程序崩溃中断，无法连续作业。\n- 针对不同场景需反复微调模型，缺乏通用性，部署周期长达数周且维护困难。\n\n### 使用 Depth-Anything-3 后\n- 仅凭单目或双目摄像头即可恢复高精度几何结构，大幅降低硬件门槛与采购成本。\n- 统一架构同时支持单目与多视图估计，简化了工程管线并显著提升推理效率。\n- 借助流式推理功能，在有限显存下稳定处理超长监控视频流，确保持续运行。\n- 基于公开数据集预训练，无需额外标注数据即可直接落地应用，缩短交付时间。\n\nDepth-Anything-3 让普通摄像头也能具备专业级的实时三维空间理解能力，彻底改变了视觉感知系统的构建方式。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FByteDance-Seed_Depth-Anything-3_74ea7004.png","ByteDance-Seed","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FByteDance-Seed_8c020fee.png","",null,"seed.feedback@bytedance.com","https:\u002F\u002Fseed.bytedance.com\u002F","https:\u002F\u002Fgithub.com\u002FByteDance-Seed",[83,87,91,95],{"name":84,"color":85,"percentage":86},"Python","#3572A5",60.5,{"name":88,"color":89,"percentage":90},"Jupyter Notebook","#DA5B0B",39.3,{"name":92,"color":93,"percentage":94},"C++","#f34b7d",0.2,{"name":96,"color":97,"percentage":98},"Shell","#89e051",0,4897,505,"2026-04-05T06:21:00","Apache-2.0","未说明","需要 NVIDIA GPU 及 CUDA 环境，流式推理建议显存小于 12GB",{"notes":106,"python":107,"dependencies":108},"推荐使用 -1.1 后缀的模型权重以修复训练 bug；支持滑动窗口处理超长视频；模型文件需从 Hugging Face 下载","3.10+",[109,110,111,112,113],"torch>=2","torchvision","xformers","gsplat","gradio",[14,37],7,"2026-03-27T02:49:30.150509","2026-04-06T06:46:07.042211",[119,124,128,133,137,142],{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},2024,"为什么我的深度图有伪影且多视角点云几乎没有点？","请确保正确加载了模型权重。必须使用 `from_pretrained` 方法加载预训练模型，否则无法生成有效结果。","https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FDepth-Anything-3\u002Fissues\u002F22",{"id":125,"question_zh":126,"answer_zh":127,"source_url":123},2025,"多视角几何对齐需要满足什么条件？","Umeyama 对齐算法要求至少包含 3 个位姿（N >= 3）。如果输入视图少于 3 张，可能无法执行对齐。",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},2026,"3DGS 渲染结果出现错位该如何解决？","建议切换到更新后的模型版本。请将模型加载代码改为：`model = DepthAnything3.from_pretrained(\"depth-anything\u002FDA3NESTED-GIANT-LARGE-1.1\")`，这能改善少视图场景下的对齐效果。","https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FDepth-Anything-3\u002Fissues\u002F136",{"id":134,"question_zh":135,"answer_zh":136,"source_url":132},2027,"手动移除高斯位置偏移能否解决对齐问题？","不建议手动移除偏移。虽然移除后点云对齐会改善，但会导致新视角渲染质量下降。推荐使用官方更新的模型版本而非修改代码逻辑。",{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},2028,"为什么深度图的背景区域会出现噪声或不平滑？","这是因为训练数据（如 Objaverse）的背景是无监督的，导致模型在纯色背景上产生无意义值。可以通过检查置信度图（confidence map）来过滤这些无效区域。","https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FDepth-Anything-3\u002Fissues\u002F33",{"id":143,"question_zh":144,"answer_zh":145,"source_url":141},2029,"如何将预测的深度图正确归一化并显示？","需先计算最小\u002F最大深度值进行归一化到 [0, 1]，再缩放到 0-255 并转换为 uint8 类型。示例代码：`normalized_depth = (depth_data - min_depth) \u002F (max_depth - min_depth)`，然后使用 `Image.fromarray(image_data, 'L')` 显示。",[]]