[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-google-deepmind--tapnet":3,"tool-google-deepmind--tapnet":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",151314,2,"2026-04-11T23:32:58",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":76,"owner_website":77,"owner_url":78,"languages":79,"stars":92,"forks":93,"last_commit_at":94,"license":95,"difficulty_score":10,"env_os":96,"env_gpu":97,"env_ram":98,"env_deps":99,"category_tags":111,"github_topics":113,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":119,"updated_at":120,"faqs":121,"releases":154},6830,"google-deepmind\u002Ftapnet","tapnet","Tracking Any Point (TAP)","tapnet 是谷歌 DeepMind 推出的开源项目，专注于解决视频中的“任意点跟踪”难题。它的核心能力是在视频中精准锁定并持续追踪用户指定的任意像素点，即使该点在运动过程中发生快速移动、形变或被物体暂时遮挡，也能保持轨迹的连贯与准确。\n\n这一技术有效克服了传统跟踪算法在复杂场景下容易丢失目标或产生漂移的痛点，为视频分析提供了高鲁棒性的基础支持。tapnet 特别适合计算机视觉研究人员、AI 开发者以及从事机器人控制、视频特效制作的专业人士使用。无论是需要构建新的跟踪模型，还是开发依赖运动信息的下游应用（如机器人模仿学习），都能从中获益。\n\n该项目不仅包含了表现卓越的 TAPIR 算法和最新的 TAPNext 系列模型，还引入了独特的 BootsTAP 自举训练策略，利用大量无标签视频显著提升了泛化能力。其中，TAPNext++ 更是实现了长达 40 倍的稳定跟踪时长，具备强大的重检测机制。此外，tapnet 还提供了 TAP-Vid 等多个权威基准数据集与评估指标，旨在推动整个领域向更精确、更高效的方向发展，是探索视频时空理解不可或缺的工具库。","# Tracking Any Point (TAP)\n\n[[`TAP-Vid`](https:\u002F\u002Ftapvid.github.io\u002F)] [[`TAPIR`](https:\u002F\u002Fdeepmind-tapir.github.io\u002F)] [[`RoboTAP`](https:\u002F\u002Frobotap.github.io\u002F)] [[`Blog Post`](https:\u002F\u002Fdeepmind-tapir.github.io\u002Fblogpost.html)] [[`BootsTAP`](https:\u002F\u002Fbootstap.github.io\u002F)] [[`TAPVid-3D`](https:\u002F\u002Ftapvid3d.github.io\u002F)] [[`TAPNext`](https:\u002F\u002Ftap-next.github.io\u002F)] [[`TRAJAN`](https:\u002F\u002Ftrajan-paper.github.io)] [[`TAPNext++`](https:\u002F\u002Ftap-next-plus-plus.github.io\u002F)]\n\nhttps:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fassets\u002F4534987\u002F9f66b81a-7efb-48e7-a59c-f5781c35bebc\n\n\u003C!-- disableFinding(LINE_OVER_80) -->\nWelcome to the official Google Deepmind repository for Tracking Any Point (TAP), home of the TAP-Vid and TAPVid-3D Datasets, our top-performing TAPIR model, and our RoboTAP extension.\n\n- [TAP-Vid](https:\u002F\u002Ftapvid.github.io) is a benchmark for models that perform this task, with a collection of ground-truth points for both real and synthetic videos.\n- [TAPIR](https:\u002F\u002Fdeepmind-tapir.github.io) is a two-stage algorithm which employs two stages: 1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model is fast and surpasses all prior methods by a significant margin on the TAP-Vid benchmark.\n- [RoboTAP](https:\u002F\u002Frobotap.github.io) is a system which utilizes TAPIR point tracks to execute robotics manipulation tasks through efficient imitation in the real world. It also includes a dataset with ground-truth points annotated on real robotics manipulation videos.\n- [BootsTAP](https:\u002F\u002Fbootstap.github.io) (or Bootstrapped Training for TAP) uses a large dataset of unlabeled, real-world video to improve tracking accuracy. Specifically, the model is trained to give consistent predictions across different spatial transformations and corruptions of the video, as well as different choices of the query points. We apply it to TAPIR to create BootsTAPIR, which is architecturally similar to TAPIR but substantially outperforms it on TAP-Vid.\n- [TAPVid-3D](https:\u002F\u002Ftapvid3d.github.io) is a benchmark and set of metrics for models that perform the 3D point tracking task. The benchmark contains 1M+ computed ground-truth trajectories on 4,000+ real-world videos.\n- [TAPNext](https:\u002F\u002Ftap-next.github.io) is our latest, most capable, fastest, yet simplest tracker. It formulates the TAP problem as next token prediction and tracks points simply by propagating information through a network. Note that our best TAPNext checkpoint was fine-tuned using the BootsTAP procedure.\n- [TRAJAN](https:\u002F\u002Ftrajan-paper.github.io) is our first point TRAJectory AutoeNcoder (TRAJAN). TRAJAN conditions on a set of support point trajectories and reconstructs the trajectories of a held out set of query points. The embedding space learned by TRAJAN can be used to compare distributions of videos, to compare motion trajectories in different videos independent of object appearances, and to evaluate the realism and consistency of videos output by generative video models.\n- [TAPNext++](https:\u002F\u002Ftap-next-plus-plus.github.io) is an improved TAPNext checkpoint that has a 40x longer stable tracking performance, allows tracking through occlusions and shows strong re-detection capabilities. It has been fine-tuned on 1024-frame synthetic sequences with a variety of novel training strategies.\n\nThis repository contains the following:\n\n- [TAPNext \u002F TAPNext++ \u002F TAPIR \u002F BootsTAPIR Demos](#demos) for both online **colab demo** and offline **real-time demo** by cloning this repo\n- [TRAJAN Demo](#demos) for online **colab demo**\n- [TAP-Vid Benchmark](#tap-vid) for both evaluation **dataset** and evaluation **metrics**\n- [RoboTAP Benchmark](#roboTAP) for both evaluation **dataset** and point track based clustering code\n- [TAPVid-3D Benchmark](#tapvid-3d) for the evaluation **metrics** and sample **evaluation code** for the TAPVid-3D benchmark.\n- [Checkpoints](#checkpoints) for TAP-Net (the baseline presented in the TAP-Vid paper), TAPIR and BootsTAPIR **pre-trained** model weights in both **Jax** and **PyTorch**\n- [Instructions](#training) for **training** TAP-Net (the baseline presented in the TAP-Vid paper) and TAPIR on Kubric\n\n## Demos\n\nThe simplest way to run TAPNext \u002F TAPNext++ \u002F TAPIR \u002F BootsTAPIR is to use our colab demos online. You can also\nclone this repo and run on your own hardware, including a real-time demo.\n\n### Colab Demo\n\nYou can run colab demos to see how TAPIR works. You can also upload your own video and try point tracking with TAPIR.\nWe provide a few colab demos:\n\n\u003C!-- disableFinding(LINE_OVER_80) -->\n\u003C!-- disableFinding(IMAGE_ALT_TEXT_INACCESSIBLE) -->\n1. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fcolabs\u002Ftorch_tapnextpp_demo.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"TAPNext++\"\u002F>\u003C\u002Fa> **TAPNext++**: This is a fine-tuned BootsTAPNext checkpoint capable of long-term tracking, occlusion tracking and re-detection. It has been fine-tuned on PointOdyssey and Kubric-1024.\n\u003C!-- disableFinding(LINE_OVER_80) -->\n2. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fcolabs\u002Ftapnext_demo.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"BootsTAPNext\"\u002F>\u003C\u002Fa> **BootsTAPNext**: This is the most powerful TAPNext model that runs online (per-frame). This is the BootsTAPNext model reported in the paper.\n\u003C!-- disableFinding(LINE_OVER_80) -->\n\u003C!-- disableFinding(IMAGE_ALT_TEXT_INACCESSIBLE) -->\n3. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Ftorch_tapnext_demo.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"BootsTAPNext PyTorch\"\u002F>\u003C\u002Fa> **BootsTAPNext PyTorch**: This is the most powerful TAPNext model re-implemented in PyTorch, which contains the exact architecture & weights as the Jax model.\n\u003C!-- disableFinding(LINE_OVER_80) -->\n\u003C!-- disableFinding(IMAGE_ALT_TEXT_INACCESSIBLE) -->\n4. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Ftapir_demo.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Offline TAPIR\"\u002F>\u003C\u002Fa> **Standard TAPIR**: This is the most powerful TAPIR \u002F BootsTAPIR model that runs on a whole video at once. We mainly report the results of this model in the paper.\n\u003C!-- disableFinding(LINE_OVER_80) -->\n\u003C!-- disableFinding(IMAGE_ALT_TEXT_INACCESSIBLE) -->\n5. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Fcausal_tapir_demo.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Online TAPIR\"\u002F>\u003C\u002Fa> **Online TAPIR**: This is the sequential causal TAPIR \u002F BootsTAPIR model that allows for online tracking on points, which can be run in real-time on a GPU platform.\n\u003C!-- disableFinding(LINE_OVER_80) -->\n\u003C!-- disableFinding(IMAGE_ALT_TEXT_INACCESSIBLE) -->\n6. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Ftapir_rainbow_demo.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"TAPIR Rainbow Visualization\"\u002F>\u003C\u002Fa> **Rainbow Visualization**: This visualization is used in many of our teaser videos: it does automatic foreground\u002Fbackground segmentation and corrects the tracks for the camera motion, so you can visualize the paths objects take through real space.\n\u003C!-- disableFinding(LINE_OVER_80) -->\n\u003C!-- disableFinding(IMAGE_ALT_TEXT_INACCESSIBLE) -->\n7. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Ftorch_tapir_demo.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Offline PyTorch TAPIR\"\u002F>\u003C\u002Fa> **Standard PyTorch TAPIR**: This is the TAPIR \u002F BootsTAPIR model re-implemented in PyTorch, which contains the exact architecture & weights as the Jax model.\n\u003C!-- disableFinding(LINE_OVER_80) -->\n\u003C!-- disableFinding(IMAGE_ALT_TEXT_INACCESSIBLE) -->\n8. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Ftorch_causal_tapir_demo.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Online PyTorch TAPIR\"\u002F>\u003C\u002Fa> **Online PyTorch TAPIR**: This is the sequential causal BootsTAPIR model re-implemented in PyTorch, which contains the exact architecture & weights as the Jax model.\n\u003C!-- disableFinding(LINE_OVER_80) -->\n\u003C!-- disableFinding(IMAGE_ALT_TEXT_INACCESSIBLE) -->\n9. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Ftrajan_demo.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"TRAJAN\"\u002F>\u003C\u002Fa> **TRAJAN**: This is the point trajectory autoencoder for reconstructing the motion of held out point trajectories conditioned on a set of input point trajectories.\n10. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Ftapir_clustering.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Point Clustering\"\u002F>\u003C\u002Fa> **RoboTAP**: This is the demo of the segmentation algorithm used in RoboTAP.\n11. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Fkubric_for_tapvid(3d).ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Kubric\"\u002F>\u003C\u002Fa> **Kubric for TAPVid(3D)**: This visualization illustrates how to generate groundtruth point tracks in 2D and 3D with world coordinate frame using Kubric dataset.\n\n### Live Demo\n\nClone the repository:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Ftapnet.git\n```\n\nSwitch to the project directory:\n\n```bash\ncd tapnet\n```\n\nInstall the `tapnet` python package (and its requirements for running inference):\n\n```bash\npip install .\n```\n\nDownload the checkpoint\n\n```bash\nmkdir checkpoints\nwget -P checkpoints https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Fcausal_tapir_checkpoint.npy\n```\n\nAdd current path (parent directory of where TapNet is installed)\nto ```PYTHONPATH```:\n\n```bash\nexport PYTHONPATH=`(cd ..\u002F && pwd)`:`pwd`:$PYTHONPATH\n```\n\nIf you want to use CUDA, make sure you install the drivers and a version\nof JAX that's compatible with your CUDA and CUDNN versions.\nRefer to\n[the jax manual](https:\u002F\u002Fgithub.com\u002Fjax-ml\u002Fjax#installation)\nto install the correct JAX version with CUDA.\n\nYou can then run a pretrained causal TAPIR model on a live camera and select points to track:\n\n```bash\ncd ..\npython3 .\u002Ftapnet\u002Flive_demo.py \\\n```\n\nIn our tests, we achieved ~17 fps on 480x480 images on a quadro RTX 4000 (a 2018 mobile GPU).\n\n## Benchmarks\n\nThis repository hosts three separate but related benchmarks: TAP-Vid, its later extension RoboTAP, and TAPVid-3D.\n\n### TAP-Vid\n\nhttps:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fassets\u002F4534987\u002Fff5fa5e3-ed37-4480-ad39-42a1e2744d8b\n\n[TAP-Vid](https:\u002F\u002Ftapvid.github.io) is a dataset of videos along with point tracks, either manually annotated or obtained from a simulator. The aim is to evaluate tracking of any trackable point on any solid physical surface. Algorithms receive a single query point on some frame, and must produce the rest of the track, i.e., including where that point has moved to (if visible), and whether it is visible, on every other frame. This requires point-level precision (unlike prior work on box and segment tracking) potentially on deformable surfaces (unlike structure from motion) over the long term (unlike optical flow) on potentially any object (i.e. class-agnostic, unlike prior class-specific keypoint tracking on humans).\n\nMore details on downloading, using, and evaluating on the **TAP-Vid benchmark** can be found in the corresponding [README](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Ftapnet\u002Ftapvid).\n\n### RoboTAP\n\n[RoboTAP](https:\u002F\u002Frobotap.github.io\u002F) is a following work of TAP-Vid and TAPIR that demonstrates point tracking models are important for robotics.\n\nThe [RoboTAP dataset](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Frobotap\u002Frobotap.zip) follows the same annotation format as TAP-Vid, but is released as an addition to TAP-Vid. In terms of domain, RoboTAP dataset is mostly similar to TAP-Vid-RGB-Stacking, with a key difference that all robotics videos are real and manually annotated. Video sources and object categories are also more diversified. The benchmark dataset includes 265 videos, serving for evaluation purpose only. More details can be found in the TAP-Vid [README](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Ftapnet\u002Ftapvid). We also provide a [RoboTAP Colab Notebook](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Ftapir_clustering.ipynb) demo of the segmentation algorithm used in the paper.\n\n### TAPVid-3D\n\nTAPVid-3D is a dataset and benchmark for evaluating the task of long-range Tracking Any Point in 3D (TAP-3D).\n\nThe benchmark features 4,000+ real-world videos, along with their metric 3D position point trajectories. The dataset is contains three different video sources, and spans a variety of object types, motion patterns, and indoor and outdoor environments. This repository folder contains the code to download and generate these annotations and dataset samples to view. Be aware that it has a separate license from TAP-Vid.\n\nMore details on downloading, using, and evaluating on the **TAPVid-3D benchmark** can be found in the corresponding [README](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Ftree\u002Fmain\u002Ftapnet\u002Ftapvid3d).\n\n### A Note on Coordinates\n\nIn our storage datasets, (x, y) coordinates are typically in normalized raster coordinates: i.e., (0, 0) is the upper-left corner of the upper-left pixel, and\n(1, 1) is the lower-right corner of the lower-right pixel. Our code, however, immediately converts these to regular raster coordinates, matching the output of\nthe Kubric reader: (0, 0) is the upper-left corner of the upper-left pixel, while (h, w) is the lower-right corner of the lower-right pixel, where h is the\nimage height in pixels, and w is the respective width.\n\nWhen working with 2D coordinates, we typically store them in the order (x, y). However, we typically work with 3D coordinates in the order (t, y, x), where\ny and x are raster coordinates as above, but t is in frame coordinates, i.e. 0 refers to the first frame, and 0.5 refers to halfway between the first and\nsecond frames. Please take care with this: one pixel error can make a difference according to our metrics.\n\n## Checkpoints\n\n[\u003Cimg alt=\"Checkpoints\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-deepmind_tapnet_readme_40a262bd0d17.png\" width=\"200px\">](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Ftapnet)\n\n`tapnet\u002Fcheckpoint\u002F` must contain a file checkpoint.npy that's loadable using our NumpyFileCheckpointer. You can download checkpoints below here or on [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Ftapnet), which should closely match the ones used in the paper.\nNote: evaluation results in the table are reported under 256x256 inference resolution, but higher resolution can benefit results. For BootsTAPIR, we typically find the best results at 512x512 resolution, and for TAPIR, even higher resolutions than 512x512 can be beneficial.\n\nmodel|checkpoint|config|backbone|training resolution|DAVIS First (AJ)|DAVIS Strided (AJ)|Kinetics First (AJ)|RoboTAP First (AJ)\n:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:\nTAP-Net|[Jax](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Fcheckpoint.npy)|[tapnet_config.py](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fconfigs\u002Ftapnet_config.py)|TSM-ResNet18|256x256|33.0%|38.4%|38.5%|45.1%\nTAPIR|[Jax](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Ftapir_checkpoint_panning.npy) & [PyTorch](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Ftapir_checkpoint_panning.pt)|[tapir_config.py](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fconfigs\u002Ftapir_config.py)|ResNet18|256x256|58.5%|63.3%|50.0%|59.6%\nOnline TAPIR|[Jax](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Fcausal_tapir_checkpoint.npy)|[causal_tapir_config.py](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fconfigs\u002Fcausal_tapir_config.py)|ResNet18|256x256|56.2%|58.3%|51.2%|59.1%\nBootsTAPIR|[Jax](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Fbootstap\u002Fbootstapir_checkpoint_v2.npy) & [PyTorch](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Fbootstap\u002Fbootstapir_checkpoint_v2.pt)|[tapir_bootstrap_config.py](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fconfigs\u002Ftapir_bootstrap_config.py)|ResNet18 + 4 Convs|256x256 + 512x512|62.4%|67.4%|55.8%|69.2%\nOnline BootsTAPIR|[Jax](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Fbootstap\u002Fcausal_bootstapir_checkpoint.npy) & [PyTorch](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Fbootstap\u002Fcausal_bootstapir_checkpoint.pt)|[tapir_bootstrap_config.py](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fconfigs\u002Ftapir_bootstrap_config.py)|ResNet18 + 4 Convs|256x256 + 512x512|59.7%|61.2%|55.1%|69.1%\nTAPNext|[Jax](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Ftapnext\u002Fbootstapnext_ckpt.npz)|[tapnext_demo.ipynb](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fcolabs\u002Ftapnext_demo.ipynb)|TrecViT-B|256x256|65.25%|68.9%|57.3%|64.1%\nTAPNext++|[PyTorch](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Ftapnextpp\u002Ftapnextpp_ckpt.pt)|[torch_tapnextpp_demo.ipynb](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fcolabs\u002Ftorch_tapnextpp_demo.ipynb)|TrecViT-B|256x256|65.6%|-|53.9%|61.1%\nTRAJAN|[Jax](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Ftrajan\u002Ftrack_autoencoder_ckpt.npz)|[trajan_demo.ipynb](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fcolabs\u002Ftrajan_demo.ipynb)\n\n## Training\n\nWe provide a Jax training and evaluation framework for TAP-Net and TAPIR in the training directory; see the training [README](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Ftree\u002Fmain\u002Ftapnet\u002Ftraining).\n\nOther researchers have developed a [PyTorch training implementation for TAPIR](https:\u002F\u002Fgithub.com\u002Friponazad\u002Fechotracker), which may be of interest; however, this work is **not** affiliated with Google DeepMind, and its accuracy has not been verified by us.\n\n## Citing this Work\n\nPlease use the following bibtex entries to cite our work:\n\n```\n@article{doersch2022tap,\n  title={{TAP}-Vid: A Benchmark for Tracking Any Point in a Video},\n  author={Doersch, Carl and Gupta, Ankush and Markeeva, Larisa and Recasens, Adria and Smaira, Lucas and Aytar, Yusuf and Carreira, Joao and Zisserman, Andrew and Yang, Yi},\n  journal={Advances in Neural Information Processing Systems},\n  volume={35},\n  pages={13610--13626},\n  year={2022}\n}\n```\n```\n@inproceedings{doersch2023tapir,\n  title={{TAPIR}: Tracking any point with per-frame initialization and temporal refinement},\n  author={Doersch, Carl and Yang, Yi and Vecerik, Mel and Gokay, Dilara and Gupta, Ankush and Aytar, Yusuf and Carreira, Joao and Zisserman, Andrew},\n  booktitle={Proceedings of the IEEE\u002FCVF International Conference on Computer Vision},\n  pages={10061--10072},\n  year={2023}\n}\n```\n```\n@article{vecerik2023robotap,\n  title={{RoboTAP}: Tracking arbitrary points for few-shot visual imitation},\n  author={Vecerik, Mel and Doersch, Carl and Yang, Yi and Davchev, Todor and Aytar, Yusuf and Zhou, Guangyao and Hadsell, Raia and Agapito, Lourdes and Scholz, Jon},\n  journal={International Conference on Robotics and Automation},\n  pages={5397--5403},\n  year={2024}\n}\n```\n```\n@article{doersch2024bootstap,\n  title={{BootsTAP}: Bootstrapped Training for Tracking-Any-Point},\n  author={Doersch, Carl and Luc, Pauline and Yang, Yi and Gokay, Dilara and Koppula, Skanda and Gupta, Ankush and Heyward, Joseph and Rocco, Ignacio and Goroshin, Ross and Carreira, Jo{\\~a}o and Zisserman, Andrew},\n  journal={Asian Conference on Computer Vision},\n  year={2024}\n}\n```\n```\n@article{koppula2024tapvid,\n  title={{TAPVid}-{3D}: A Benchmark for Tracking Any Point in {3D}},\n  author={Koppula, Skanda and Rocco, Ignacio and Yang, Yi and Heyward, Joe and Carreira, Jo{\\~a}o and Zisserman, Andrew and Brostow, Gabriel and Doersch, Carl},\n  journal={Advances in Neural Information Processing Systems},\n  year={2024}\n}\n```\n```\n@article{zholus2025tapnext,\n  title={TAPNext: Tracking Any Point (TAP) as Next Token Prediction},\n  author={Zholus, Artem and Doersch, Carl and Yang, Yi and Koppula, Skanda and Patraucean, Viorica and He, Xu Owen and Rocco, Ignacio and Sajjadi, Mehdi S. M. and Chandar, Sarath and Goroshin, Ross},\n  journal={arXiv preprint arXiv:2504.05579},\n  year={2025}\n}\n```\n```\n@article{allen2025trajan,\n  title={Direct Motion Models for Assessing Generated Videos},\n  author={Allen, Kelsey and Doersch, Carl and Zhou, Guangyao and Suhail, Mohammed and Driess, Danny and Rocco, Ignacio and Rubanova, Yulia and Kipf, Thomas and Sajjadi, Mehdi S. M. and Murphy, Kevin and Carreira, Joao and van Steenkiste, Sjoerd},\n  journal={arXiv preprint},\n  year={2025}\n}\n```\n\n## License and Disclaimer\n\nCopyright 2022-2024 Google LLC\n\nSoftware and other materials specific to the TAPVid-3D benchmark are covered by the license outlined in tapvid3d\u002FLICENSE file.\n\nAll other software in this repository is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at:\n\nhttps:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0\n\nAll other non-software materials released here for the TAP-Vid datasets, i.e. the TAP-Vid annotations, as well as the RGB-Stacking videos and RoboTAP videos, are released under a [Creative Commons BY license](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby\u002F4.0\u002F). You may obtain a copy of the CC-BY license at:\nhttps:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby\u002F4.0\u002Flegalcode .\n\nThe original source videos of DAVIS come from the val set, and are also licensed under creative commons licenses per their creators; see the [DAVIS dataset](https:\u002F\u002Fdavischallenge.org\u002Fdavis2017\u002Fcode.html) for details. Kinetics videos are publicly available on YouTube, but subject to their own individual licenses. See the [Kinetics dataset webpage](https:\u002F\u002Fwww.deepmind.com\u002Fopen-source\u002Fkinetics) for details.\n\nUnless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.\n\nThis is not an official Google product.\n","# 跟踪任意点（TAP）\n\n[[`TAP-Vid`](https:\u002F\u002Ftapvid.github.io\u002F)] [[`TAPIR`](https:\u002F\u002Fdeepmind-tapir.github.io\u002F)] [[`RoboTAP`](https:\u002F\u002Frobotap.github.io\u002F)] [[`博客文章`](https:\u002F\u002Fdeepmind-tapir.github.io\u002Fblogpost.html)] [[`BootsTAP`](https:\u002F\u002Fbootstap.github.io\u002F)] [[`TAPVid-3D`](https:\u002F\u002Ftapvid3d.github.io\u002F)] [[`TAPNext`](https:\u002F\u002Ftap-next.github.io\u002F)] [[`TRAJAN`](https:\u002F\u002Ftrajan-paper.github.io)] [[`TAPNext++`](https:\u002F\u002Ftap-next-plus-plus.github.io\u002F)]\n\nhttps:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fassets\u002F4534987\u002F9f66b81a-7efb-48e7-a59c-f5781c35bebc\n\n\u003C!-- disableFinding(LINE_OVER_80) -->\n欢迎来到 Google DeepMind 官方的“跟踪任意点”（TAP）仓库，这里汇集了 TAP-Vid 和 TAPVid-3D 数据集、我们性能领先的 TAPIR 模型以及 RoboTAP 扩展。\n\n- [TAP-Vid](https:\u002F\u002Ftapvid.github.io) 是用于评估此类任务模型的基准，包含针对真实视频和合成视频的真实标注点集合。\n- [TAPIR](https:\u002F\u002Fdeepmind-tapir.github.io) 是一种两阶段算法，第一阶段为匹配阶段，独立地在每一帧中为查询点找到合适的候选匹配点；第二阶段为精炼阶段，基于局部相关性更新轨迹和查询特征。该模型速度快，在 TAP-Vid 基准上显著超越了所有先前方法。\n- [RoboTAP](https:\u002F\u002Frobotap.github.io) 是一个利用 TAPIR 点轨迹在现实世界中高效执行机器人操作任务的系统。它还包含一个数据集，其中对真实的机器人操作视频进行了真实标注点的注释。\n- [BootsTAP](https:\u002F\u002Fbootstap.github.io)（即 TAP 的自举训练）使用大量未标注的真实世界视频来提升跟踪精度。具体而言，该模型被训练为在视频的不同空间变换和损坏情况下，以及不同查询点选择下，都能给出一致的预测结果。我们将这一方法应用于 TAPIR，从而创建了 BootsTAPIR，其架构与 TAPIR 类似，但在 TAP-Vid 上表现显著优于 TAPIR。\n- [TAPVid-3D](https:\u002F\u002Ftapvid3d.github.io) 是用于评估 3D 点跟踪任务模型的基准及指标集合。该基准包含 4,000 多个真实世界视频中的 100 万条以上计算得出的真实轨迹。\n- [TAPNext](https:\u002F\u002Ftap-next.github.io) 是我们最新、功能最强、速度最快且最简单的跟踪器。它将 TAP 问题建模为下一个标记预测，并通过在网络中传播信息来简单地跟踪点。值得注意的是，我们最好的 TAPNext 检查点是使用 BootsTAP 流程进行微调的。\n- [TRAJAN](https:\u002F\u002Ftrajan-paper.github.io) 是我们的首个点轨迹自动编码器（TRAJAN）。TRAJAN 以一组支持点轨迹为条件，重建一组待测查询点的轨迹。TRAJAN 学习到的嵌入空间可用于比较视频分布、在不依赖物体外观的情况下比较不同视频中的运动轨迹，以及评估生成式视频模型输出视频的真实性和一致性。\n- [TAPNext++](https:\u002F\u002Ftap-next-plus-plus.github.io) 是一个改进版的 TAPNext 检查点，其稳定跟踪性能提升了 40 倍，能够穿越遮挡并展现出强大的重新检测能力。它已在包含多种新颖训练策略的 1024 帧合成序列上进行了微调。\n\n本仓库包含以下内容：\n\n- [TAPNext \u002F TAPNext++ \u002F TAPIR \u002F BootsTAPIR 演示](#demos)，可通过克隆此仓库实现在线 **Colab 演示** 和离线 **实时演示**。\n- [TRAJAN 演示](#demos)，用于在线 **Colab 演示**。\n- [TAP-Vid 基准](#tap-vid)，包括评估用的 **数据集** 和 **指标**。\n- [RoboTAP 基准](#roboTAP)，包括评估用的 **数据集** 和基于点轨迹的聚类代码。\n- [TAPVid-3D 基准](#tapvid-3d)，提供评估用的 **指标** 以及 TAPVid-3D 基准的示例 **评估代码**。\n- [检查点](#checkpoints)，包含 TAP-Net（TAP-Vid 论文中提出的基线）、TAPIR 和 BootsTAPIR 的 **预训练** 模型权重，格式分别为 **Jax** 和 **PyTorch**。\n- [训练说明](#training)，介绍如何在 Kubric 上 **训练** TAP-Net（TAP-Vid 论文中提出的基线）和 TAPIR。\n\n## 演示\n\n运行 TAPNext \u002F TAPNext++ \u002F TAPIR \u002F BootsTAPIR 最简单的方式是使用我们的在线 Colab 演示。您也可以克隆本仓库并在自己的硬件上运行，包括实时演示。\n\n### Colab 示例\n\n您可以通过运行 Colab 示例来了解 TAPIR 的工作原理。您还可以上传自己的视频，并尝试使用 TAPIR 进行点跟踪。\n\n我们提供了几个 Colab 示例：\n\n\u003C!-- disableFinding(LINE_OVER_80) -->\n\u003C!-- disableFinding(IMAGE_ALT_TEXT_INACCESSIBLE) -->\n1. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fcolabs\u002Ftorch_tapnextpp_demo.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"TAPNext++\"\u002F>\u003C\u002Fa> **TAPNext++**：这是一个经过微调的 BootsTAPNext 检查点，能够进行长期跟踪、遮挡跟踪和重新检测。它已在 PointOdyssey 和 Kubric-1024 数据集上进行了微调。\n\u003C!-- disableFinding(LINE_OVER_80) -->\n2. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fcolabs\u002Ftapnext_demo.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"BootsTAPNext\"\u002F>\u003C\u002Fa> **BootsTAPNext**：这是功能最强大的 TAPNext 模型，可在每帧上在线运行。该模型即论文中所报道的 BootsTAPNext 模型。\n\u003C!-- disableFinding(LINE_OVER_80) -->\n\u003C!-- disableFinding(IMAGE_ALT_TEXT_INACCESSIBLE) -->\n3. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Ftorch_tapnext_demo.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"BootsTAPNext PyTorch\"\u002F>\u003C\u002Fa> **BootsTAPNext PyTorch**：这是用 PyTorch 重新实现的功能最强大的 TAPNext 模型，其架构和权重与 Jax 版本完全一致。\n\u003C!-- disableFinding(LINE_OVER_80) -->\n\u003C!-- disableFinding(IMAGE_ALT_TEXT_INACCESSIBLE) -->\n4. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Ftapir_demo.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Offline TAPIR\"\u002F>\u003C\u002Fa> **标准 TAPIR**：这是功能最强大的 TAPIR \u002F BootsTAPIR 模型，可一次性处理整段视频。我们在论文中主要报告了该模型的结果。\n\u003C!-- disableFinding(LINE_OVER_80) -->\n\u003C!-- disableFinding(IMAGE_ALT_TEXT_INACCESSIBLE) -->\n5. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Fcausal_tapir_demo.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Online TAPIR\"\u002F>\u003C\u002Fa> **在线 TAPIR**：这是顺序因果的 TAPIR \u002F BootsTAPIR 模型，支持对点的在线跟踪，可在 GPU 平台上实时运行。\n\u003C!-- disableFinding(LINE_OVER_80) -->\n\u003C!-- disableFinding(IMAGE_ALT_TEXT_INACCESSIBLE) -->\n6. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Ftapir_rainbow_demo.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"TAPIR 彩虹可视化\"\u002F>\u003C\u002Fa> **彩虹可视化**：这种可视化效果被广泛应用于我们的预告视频中：它能够自动进行前景\u002F背景分割，并校正因相机运动导致的轨迹偏移，从而直观展示物体在真实空间中的运动路径。\n\u003C!-- disableFinding(LINE_OVER_80) -->\n\u003C!-- disableFinding(IMAGE_ALT_TEXT_INACCESSIBLE) -->\n7. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Ftorch_tapir_demo.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Offline PyTorch TAPIR\"\u002F>\u003C\u002Fa> **标准 PyTorch TAPIR**：这是用 PyTorch 重新实现的 TAPIR \u002F BootsTAPIR 模型，其架构和权重与 Jax 版本完全一致。\n\u003C!-- disableFinding(LINE_OVER_80) -->\n\u003C!-- disableFinding(IMAGE_ALT_TEXT_INACCESSIBLE) -->\n8. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Ftorch_causal_tapir_demo.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Online PyTorch TAPIR\"\u002F>\u003C\u002Fa> **在线 PyTorch TAPIR**：这是用 PyTorch 重新实现的顺序因果 BootsTAPIR 模型，其架构和权重与 Jax 版本完全一致。\n\u003C!-- disableFinding(LINE_OVER_80) -->\n\u003C!-- disableFinding(IMAGE_ALT_TEXT_INACCESSIBLE) -->\n9. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Ftrajan_demo.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"TRAJAN\"\u002F>\u003C\u002Fa> **TRAJAN**：这是一种点轨迹自编码器，可根据一组输入点轨迹重建未见点轨迹的运动。\n10. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Ftapir_clustering.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"点聚类\"\u002F>\u003C\u002Fa> **RoboTAP**：这是 RoboTAP 中使用的分割算法演示。\n11. \u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Fkubric_for_tapvid(3d).ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Kubric\"\u002F>\u003C\u002Fa> **Kubric for TAPVid(3D)**：此可视化展示了如何使用 Kubric 数据集生成具有世界坐标系的 2D 和 3D 真实点轨迹。\n\n### 实时演示\n\n克隆仓库：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Ftapnet.git\n```\n\n切换到项目目录：\n\n```bash\ncd tapnet\n```\n\n安装 `tapnet` Python 包（以及运行推理所需的依赖）：\n\n```bash\npip install .\n```\n\n下载检查点：\n\n```bash\nmkdir checkpoints\nwget -P checkpoints https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Fcausal_tapir_checkpoint.npy\n```\n\n将当前路径（即 TapNet 安装目录的父目录）添加到 ```PYTHONPATH```：\n\n```bash\nexport PYTHONPATH=`(cd ..\u002F && pwd)`:`pwd`:$PYTHONPATH\n```\n\n如果您希望使用 CUDA，请确保已安装相应的驱动程序，并安装与您的 CUDA 和 CUDNN 版本兼容的 JAX 版本。请参考 [JAX 官方文档](https:\u002F\u002Fgithub.com\u002Fjax-ml\u002Fjax#installation)，以正确安装支持 CUDA 的 JAX 版本。\n\n随后，您可以在实时摄像头画面中运行预训练的因果 TAPIR 模型，并选择要跟踪的点：\n\n```bash\ncd ..\npython3 .\u002Ftapnet\u002Flive_demo.py \\\n```\n\n在我们的测试中，使用 Quadro RTX 4000（一款 2018 年发布的移动 GPU）处理 480x480 分辨率的图像时，帧率可达约 17 fps。\n\n## 基准测试\n\n该仓库包含三个相互关联但独立的基准测试：TAP-Vid、其后续扩展 RoboTAP 以及 TAPVid-3D。\n\n### TAP-Vid\n\nhttps:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fassets\u002F4534987\u002Fff5fa5e3-ed37-4480-ad39-42a1e2744d8b\n\n[TAP-Vid](https:\u002F\u002Ftapvid.github.io) 是一个包含视频及其点轨迹的数据集，这些轨迹可以是人工标注的，也可以是从模拟器中获取的。其目标是评估在任何固体物理表面上对可追踪点的跟踪性能。算法会接收某一帧上的单个查询点，并需要生成该点在整个视频序列中的完整轨迹，即包括该点移动到了哪里（如果可见）以及在其他每一帧中是否可见。这要求达到点级别的精度（不同于以往基于边界框和分割的跟踪工作），并且能够在可能的变形表面上进行长期跟踪（不同于结构光流法），同时适用于任何物体（即不依赖于特定类别，不同于以往针对人类等特定类别的关键点跟踪方法）。\n\n关于如何下载、使用及在 **TAP-Vid 基准测试** 上进行评估的更多详细信息，请参阅相应的 [README](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Ftapnet\u002Ftapvid)。\n\n### RoboTAP\n\n[RoboTAP](https:\u002F\u002Frobotap.github.io\u002F) 是 TAP-Vid 和 TAPIR 的后续工作，旨在展示点跟踪模型在机器人技术中的重要性。\n\n[RoboTAP 数据集](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Frobotap\u002Frobotap.zip) 采用与 TAP-Vid 相同的标注格式，但作为 TAP-Vid 的补充发布。从领域上看，RoboTAP 数据集与 TAP-Vid-RGB-Stacking 大致相似，主要区别在于所有机器人相关视频均为真实采集并由人工标注。此外，视频来源和物体类别也更加多样化。该基准数据集包含 265 个视频，仅用于评估目的。更多详情请参阅 TAP-Vid 的 [README](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Ftapnet\u002Ftapvid)。我们还提供了一个 [RoboTAP Colab Notebook](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Ftapir_clustering.ipynb)，演示了论文中使用的分割算法。\n\n### TAPVid-3D\n\nTAPVid-3D 是一个用于评估三维长距离任意点跟踪任务（TAP-3D）的数据集和基准测试。\n\n该基准测试包含 4,000 多段真实世界视频，以及它们的度量级 3D 位置点轨迹。数据集涵盖了三种不同的视频来源，涉及多种物体类型、运动模式以及室内外环境。此仓库文件夹包含用于下载和生成这些标注及数据集样本以供查看的代码。请注意，它与 TAP-Vid 使用的是不同的许可证。\n\n关于如何下载、使用及在 **TAPVid-3D 基准测试** 上进行评估的更多详细信息，请参阅相应的 [README](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Ftree\u002Fmain\u002Ftapnet\u002Ftapvid3d)。\n\n### 关于坐标的一点说明\n\n在我们的存储数据集中，(x, y) 坐标通常采用归一化的栅格坐标：即 (0, 0) 表示左上角像素的左上角，而 (1, 1) 表示右下角像素的右下角。然而，我们的代码会立即将这些坐标转换为常规的栅格坐标，与 Kubric 读取器的输出一致：(0, 0) 表示左上角像素的左上角，而 (h, w) 则表示右下角像素的右下角，其中 h 是图像的高度（以像素为单位），w 是对应的宽度。\n\n在处理 2D 坐标时，我们通常按 (x, y) 的顺序存储；而在处理 3D 坐标时，则通常按 (t, y, x) 的顺序存储，其中 y 和 x 仍为上述的栅格坐标，而 t 则是帧坐标，即 0 表示第一帧，0.5 表示第一帧和第二帧之间的时间点。请务必注意这一点：根据我们的指标，哪怕只差一个像素，结果也会有所不同。\n\n## 检查点\n\n[\u003Cimg alt=\"Checkpoints\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-deepmind_tapnet_readme_40a262bd0d17.png\" width=\"200px\">](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Ftapnet)\n\n`tapnet\u002Fcheckpoint\u002F` 目录下必须包含一个名为 checkpoint.npy 的文件，该文件可使用我们的 NumpyFileCheckpointer 加载。您可以在下方或在 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Ftapnet) 上下载检查点，这些检查点应与论文中使用的检查点高度一致。\n\n注：表格中的评估结果是在 256×256 推理分辨率下报告的，但更高的分辨率可能会带来更好的效果。对于 BootsTAPIR，我们通常发现 512×512 分辨率下的效果最佳；而对于 TAPIR，则甚至可以使用高于 512×512 的分辨率来进一步提升性能。\n\n模型|检查点|配置|骨干网络|训练分辨率|DAVIS First (AJ)|DAVIS Strided (AJ)|Kinetics First (AJ)|RoboTAP First (AJ)\n:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:\nTAP-Net|[Jax](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Fcheckpoint.npy)|[tapnet_config.py](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fconfigs\u002Ftapnet_config.py)|TSM-ResNet18|256×256|33.0%|38.4%|38.5%|45.1%\nTAPIR|[Jax](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Ftapir_checkpoint_panning.npy) & [PyTorch](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Ftapir_checkpoint_panning.pt)|[tapir_config.py](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fconfigs\u002Ftapir_config.py)|ResNet18|256×256|58.5%|63.3%|50.0%|59.6%\n在线 TAPIR|[Jax](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Fcausal_tapir_checkpoint.npy)|[causal_tapir_config.py](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fconfigs\u002Fcausal_tapir_config.py)|ResNet18|256×256|56.2%|58.3%|51.2%|59.1%\nBootsTAPIR|[Jax](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Fbootstap\u002Fbootstapir_checkpoint_v2.npy) & [PyTorch](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Fbootstap\u002Fbootstapir_checkpoint_v2.pt)|[tapir_bootstrap_config.py](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fconfigs\u002Ftapir_bootstrap_config.py)|ResNet18 + 4 层卷积|256×256 + 512×512|62.4%|67.4%|55.8%|69.2%\n在线 BootsTAPIR|[Jax](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Fbootstap\u002Fcausal_bootstapir_checkpoint.npy) & [PyTorch](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Fbootstap\u002Fcausal_bootstapir_checkpoint.pt)|[tapir_bootstrap_config.py](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fconfigs\u002Ftapir_bootstrap_config.py)|ResNet18 + 4 层卷积|256×256 + 512×512|59.7%|61.2%|55.1%|69.1%\nTAPNext|[Jax](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Ftapnext\u002Fbootstapnext_ckpt.npz)|[tapnext_demo.ipynb](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fcolabs\u002Ftapnext_demo.ipynb)|TrecViT-B|256×256|65.25%|68.9%|57.3%|64.1%\nTAPNext++|[PyTorch](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Ftapnextpp\u002Ftapnextpp_ckpt.pt)|[torch_tapnextpp_demo.ipynb](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fcolabs\u002Ftorch_tapnextpp_demo.ipynb)|TrecViT-B|256×256|65.6%|-|53.9%|61.1%\nTRAJAN|[Jax](https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Ftrajan\u002Ftrack_autoencoder_ckpt.npz)|[trajan_demo.ipynb](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fcolabs\u002Ftrajan_demo.ipynb)\n\n## 训练\n\n我们在训练目录中为 TAP-Net 和 TAPIR 提供了一个 Jax 训练与评估框架；详情请参阅训练 [README](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Ftree\u002Fmain\u002Ftapnet\u002Ftraining)。\n\n其他研究者开发了 [TAPIR 的 PyTorch 训练实现](https:\u002F\u002Fgithub.com\u002Friponazad\u002Fechotracker)，可能也会引起您的兴趣；然而，这项工作 **并非** 由 Google DeepMind 开发，其准确性也尚未经过我们的验证。\n\n## 引用本工作\n\n请使用以下 BibTeX 条目来引用我们的工作：\n\n```\n@article{doersch2022tap,\n  title={{TAP}-Vid: A Benchmark for Tracking Any Point in a Video},\n  author={Doersch, Carl and Gupta, Ankush and Markeeva, Larisa and Recasens, Adria and Smaira, Lucas and Aytar, Yusuf and Carreira, Joao and Zisserman, Andrew and Yang, Yi},\n  journal={Advances in Neural Information Processing Systems},\n  volume={35},\n  pages={13610--13626},\n  year={2022}\n}\n```\n```\n@inproceedings{doersch2023tapir,\n  title={{TAPIR}: Tracking any point with per-frame initialization and temporal refinement},\n  author={Doersch, Carl and Yang, Yi and Vecerik, Mel and Gokay, Dilara and Gupta, Ankush and Aytar, Yusuf and Carreira, Joao and Zisserman, Andrew},\n  booktitle={Proceedings of the IEEE\u002FCVF International Conference on Computer Vision},\n  pages={10061--10072},\n  year={2023}\n}\n```\n```\n@article{vecerik2023robotap,\n  title={{RoboTAP}: Tracking arbitrary points for few-shot visual imitation},\n  author={Vecerik, Mel and Doersch, Carl and Yang, Yi and Davchev, Todor and Aytar, Yusuf and Zhou, Guangyao and Hadsell, Raia and Agapito, Lourdes and Scholz, Jon},\n  journal={International Conference on Robotics and Automation},\n  pages={5397--5403},\n  year={2024}\n}\n```\n```\n@article{doersch2024bootstap,\n  title={{BootsTAP}: Bootstrapped Training for Tracking-Any-Point},\n  author={Doersch, Carl and Luc, Pauline and Yang, Yi and Gokay, Dilara and Koppula, Skanda and Gupta, Ankush and Heyward, Joseph and Rocco, Ignacio and Goroshin, Ross and Carreira, Jo{\\~a}o and Zisserman, Andrew},\n  journal={Asian Conference on Computer Vision},\n  year={2024}\n}\n```\n```\n@article{koppula2024tapvid,\n  title={{TAPVid}-{3D}: A Benchmark for Tracking Any Point in {3D}},\n  author={Koppula, Skanda and Rocco, Ignacio and Yang, Yi and Heyward, Joe and Carreira, Jo{\\~a}o and Zisserman, Andrew and Brostow, Gabriel and Doersch, Carl},\n  journal={Advances in Neural Information Processing Systems},\n  year={2024}\n}\n```\n```\n@article{zholus2025tapnext,\n  title={TAPNext: Tracking Any Point (TAP) as Next Token Prediction},\n  author={Zholus, Artem and Doersch, Carl and Yang, Yi and Koppula, Skanda and Patraucean, Viorica and He, Xu Owen and Rocco, Ignacio and Sajjadi, Mehdi S. M. and Chandar, Sarath and Goroshin, Ross},\n  journal={arXiv preprint arXiv:2504.05579},\n  year={2025}\n}\n```\n```\n@article{allen2025trajan,\n  title={Direct Motion Models for Assessing Generated Videos},\n  author={Allen, Kelsey and Doersch, Carl and Zhou, Guangyao and Suhail, Mohammed and Driess, Danny and Rocco, Ignacio and Rubanova, Yulia and Kipf, Thomas and Sajjadi, Mehdi S. M. and Murphy, Kevin and Carreira, Joao and van Steenkiste, Sjoerd},\n  journal={arXiv preprint},\n  year={2025}\n}\n```\n\n## 许可与免责声明\n\n版权所有 © 2022–2024 Google LLC\n\n专属于 TAPVid-3D 基准的软件及其他材料受 tapvid3d\u002FLICENSE 文件中所列许可协议的约束。\n\n本仓库中的所有其他软件均采用 Apache License, Version 2.0（Apache 2.0）许可；除非符合 Apache 2.0 许可协议的规定，否则不得使用。您可以在以下网址获取 Apache 2.0 许可协议的副本：\n\nhttps:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0\n\n此处发布的所有非软件材料，包括 TAP-Vid 数据集的标注、RGB-Stacking 视频以及 RoboTAP 视频，均采用 [Creative Commons BY 许可](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby\u002F4.0\u002F)。您可以在以下网址获取 CC-BY 许可协议的法律文本：\n\nhttps:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby\u002F4.0\u002Flegalcode\n\nDAVIS 的原始视频来自验证集，同样根据其创作者的授权采用知识共享许可；详情请参阅 [DAVIS 数据集](https:\u002F\u002Fdavischallenge.org\u002Fdavis2017\u002Fcode.html)。Kinetics 视频在 YouTube 上公开可用，但受各自独立许可的约束。详情请参阅 [Kinetics 数据集网页](https:\u002F\u002Fwww.deepmind.com\u002Fopen-source\u002Fkinetics)。\n\n除非适用法律另有规定或双方另有书面约定，否则在此处依据 Apache 2.0 或 CC-BY 许可协议分发的所有软件和材料均按“现状”提供，不附带任何明示或默示的保证或条件。具体的权利与限制以相应许可协议中的条款为准。\n\n本项目并非 Google 官方产品。","# TAPNet 快速上手指南\n\nTAPNet 是 Google DeepMind 开源的“任意点跟踪”（Tracking Any Point）工具库，包含业界领先的 TAPIR、TAPNext 等模型，用于在视频中高精度跟踪任意物理表面上的点。\n\n## 环境准备\n\n*   **操作系统**: Linux (推荐) 或 macOS\n*   **Python**: 3.8+\n*   **深度学习框架**: 支持 **JAX** 或 **PyTorch**。\n    *   若使用 GPU 加速，请确保已安装对应的 CUDA 驱动和 cuDNN。\n    *   JAX 用户需根据 CUDA 版本安装对应的 `jax[cuda]` 版本。\n*   **硬件要求**:\n    *   离线处理：任意现代 GPU。\n    *   实时演示：建议具备中等算力 GPU（测试显示 Quadro RTX 4000 可达 ~17 FPS）。\n\n## 安装步骤\n\n### 1. 克隆仓库\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet.git\ncd tapnet\n```\n\n### 2. 安装依赖包\n直接通过 pip 安装当前目录下的包及其推理依赖：\n```bash\npip install .\n```\n> **提示**：国内用户若下载缓慢，可添加清华或阿里镜像源：\n> `pip install . -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`\n\n### 3. 配置环境变量\n将项目根目录添加到 `PYTHONPATH`，以便正确导入模块：\n```bash\nexport PYTHONPATH=$(cd ..\u002F && pwd):$(pwd):$PYTHONPATH\n```\n\n### 4. 下载预训练模型\n创建检查点目录并下载因果型 TAPIR (Causal TAPIR) 模型权重（适用于实时跟踪）：\n```bash\nmkdir checkpoints\nwget -P checkpoints https:\u002F\u002Fstorage.googleapis.com\u002Fdm-tapnet\u002Fcausal_tapir_checkpoint.npy\n```\n> **注意**：若需其他模型（如 TAPNext++ 或 BootsTAPIR），请访问官方 Colab 示例或 HuggingFace 获取对应 `.npy` 权重文件。\n\n## 基本使用\n\n### 方式一：运行实时摄像头演示 (Live Demo)\n安装完成后，可直接调用预训练模型连接摄像头进行实时点跟踪。运行后在画面中点击即可选择跟踪点。\n\n```bash\npython3 .\u002Ftapnet\u002Flive_demo.py\n```\n\n### 方式二：使用 Colab 在线体验 (推荐新手)\n若本地环境配置困难，可直接使用 Google Colab 在线运行各种模型（包括 TAPNext++, BootsTAPIR, TRAJAN 等）：\n\n*   **TAPNext++** (长时跟踪、遮挡恢复): [打开 Colab](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmain\u002Fcolabs\u002Ftorch_tapnextpp_demo.ipynb)\n*   **Standard TAPIR** (离线全视频处理，精度最高): [打开 Colab](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Ftapir_demo.ipynb)\n*   **Online TAPIR** (实时逐帧处理): [打开 Colab](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fdeepmind\u002Ftapnet\u002Fblob\u002Fmaster\u002Fcolabs\u002Fcausal_tapir_demo.ipynb)\n*   **PyTorch 版本**: 仓库提供了与 JAX 架构完全一致的 PyTorch 实现，可在上述链接中选择带有 \"PyTorch\" 字样的 Notebook。\n\n### 方式三：代码调用示例 (Python)\n在 Python 脚本中加载模型并进行推理（以 TAPIR 为例）：\n\n```python\nimport jax.numpy as jnp\nfrom tapnet.models import tapir_model\n\n# 初始化模型\nmodel = tapir_model.TAPIR()\ncheckpoint_path = 'checkpoints\u002Fcausal_tapir_checkpoint.npy'\nstate = model.load_checkpoint(checkpoint_path)\n\n# 准备输入数据 (video: [B, T, H, W, 3], query_points: [B, N, 3])\n# query_points 格式为 [t, y, x]\nvideo = ... \nquery_points = ...\n\n# 执行跟踪\npredictions = model.predict(\n    video=video,\n    query_points=query_points,\n    state=state,\n    is_causal=True # 若使用因果模型进行实时\u002F在线跟踪设为 True\n)\n\n# predictions 包含坐标轨迹、可见性概率及置信度\n```","某自动驾驶研发团队正在处理一段复杂的城市路口监控视频，需要精确分析行人穿越马路时的运动轨迹以优化感知算法。\n\n### 没有 tapnet 时\n- **遮挡即丢失**：当行人被路边车辆短暂遮挡后，传统追踪算法往往直接丢失目标，无法在行人重新出现时恢复追踪，导致轨迹断裂。\n- **人工标注成本极高**：为了获取完整的真值数据用于模型训练，工程师不得不逐帧手动校正坐标，处理一分钟高清视频需耗费数小时。\n- **长时序稳定性差**：在长达数百帧的视频序列中，累积误差会导致追踪点逐渐漂移，最终偏离行人的实际身体部位，失去分析意义。\n- **动态场景适应弱**：面对摄像头抖动或背景中相似纹理的干扰，旧方法容易产生误匹配，将背景噪点错误识别为追踪目标。\n\n### 使用 tapnet 后\n- **强力重检测能力**：借助 TAPNext++ 的特性，即使行人被完全遮挡数十帧，tapnet 也能在其重现瞬间精准“找回”并延续轨迹，确保数据完整。\n- **自动化高效产出**：利用预训练的 BootsTAPIR 模型，团队可一键生成全视频的高密度像素级追踪点，将原本数小时的工作压缩至分钟级。\n- **超长序列稳定追踪**：tapnet 基于下一代令牌预测架构，能在 1024 帧以上的长视频中保持极低漂移率，真实还原行人每一步的细微动作。\n- **抗干扰鲁棒性强**：通过局部相关性细化机制，tapnet 能有效过滤背景杂波和相机抖动影响，牢牢锁定目标特征，即使在复杂光照下也表现稳定。\n\ntapnet 通过解决遮挡丢失和长时序漂移难题，将高难度的视频点追踪任务转化为高效自动化的标准流程，极大加速了视觉算法的迭代闭环。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-deepmind_tapnet_786b818a.png","google-deepmind","Google DeepMind","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fgoogle-deepmind_06b1dd17.png","",null,"https:\u002F\u002Fwww.deepmind.com\u002F","https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind",[80,84,88],{"name":81,"color":82,"percentage":83},"Jupyter Notebook","#DA5B0B",75.3,{"name":85,"color":86,"percentage":87},"Python","#3572A5",24.5,{"name":89,"color":90,"percentage":91},"Shell","#89e051",0.1,1837,176,"2026-04-10T04:53:17","Apache-2.0","Linux, macOS","若需运行实时演示或加速推理，需要支持 CUDA 的 NVIDIA GPU（测试环境为 Quadro RTX 4000）；CPU 亦可运行但速度较慢。需安装与系统 CUDA\u002FcuDNN 版本兼容的 JAX。","未说明",{"notes":100,"python":101,"dependencies":102},"该工具主要基于 Google DeepMind 的 JAX 框架开发，同时也提供了 PyTorch 复现版本。若使用 GPU 加速，必须手动安装与本地 CUDA 驱动匹配的 JAX 版本（README 未指定具体 CUDA 版本号，需参考 JAX 官方文档）。实时演示在较旧的移动 GPU (Quadro RTX 4000) 上可达约 17 FPS。使用前需下载预训练检查点文件。","3.8+",[103,104,105,106,107,108,109,110],"jax","flax","optax","torch (可选，用于 PyTorch 版本模型)","numpy","opencv-python","dm-haiku","chex",[15,14,112],"其他",[114,115,116,117,118],"benchmark","point-tracking","robotics","computer-vision","deep-learning","2026-03-27T02:49:30.150509","2026-04-12T16:22:24.122205",[122,127,131,136,141,146,150],{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},30797,"TAPNext 在线跟踪在运行多帧后出现偏差，且 query_points 未更新，这是预期的行为吗？","是的，tracking_state.query_points 字段设计为固定大小的张量，不需要每次运行时手动更新。在线点跟踪的工作机制是：模型拥有所有查询点（可能在任意帧），但仅在视频对应的帧将它们提供给模型。例如，如果你有 5 个查询点，其中 3 个在第 0 帧，2 个分别在第 5 和第 10 帧，模型在前 5 帧不会使用后两个查询点，直到到达对应帧才会激活。如果跟踪误差在约 150 帧后出现，这是 TAPNext 的已知限制，部分解决方案是减少跟踪点的数量以延长有效跟踪时间。","https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fissues\u002F143",{"id":128,"question_zh":129,"answer_zh":130,"source_url":126},30798,"如何将 TAPNext 模型导出为 ONNX 格式？遇到字典类型输入不支持的错误怎么办？","目前直接将包含字典（如 tracking_state）的模型导出为 ONNX 会遇到 `RuntimeError: Only tuples, lists and Variables are supported...` 错误，因为 ONNX JIT 不推荐直接使用字典作为输入\u002F输出。虽然官方尚未提供完美的原生支持方案，但社区建议尝试将状态字典展平为元组或列表传入，或者在导出前后手动处理序列化逻辑。此外，确保在导出前调用 `model.eval()` 以启用推理模式（使用线性扫描加速）。",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},30799,"下载 Project Aria 数字孪生数据时，序列名称与 TapVid 3D 注释不匹配导致报错，如何解决？","Project Aria 的新序列名称增加了后缀（如 `M1292`），导致与原有的注释文件名不一致。解决方法是预处理 ARIA 数据，将其序列名称还原为注释使用的命名规范。此外，维护者已上传了修复后的 rc3 npz 文件，并更新了代码以使用均值计算轨迹和可见性来解决舍入误差问题。如果遇到特定序列（如 `Apartment_release_work_skeleton_seq136`）的深度\u002F分割文件损坏，需要重新生成该序列的 npz 文件。","https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fissues\u002F108",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},30800,"是否有官方提供的 TAPIR PyTorch 版本训练脚本？","官方目前暂无发布 PyTorch 训练代码的计划。但是，社区用户 @riponazad 提供了一个非官方的 PyTorch 实现仓库，官方已在 README 的训练部分链接了该仓库作为参考。用户可以查阅该第三方实现进行训练。","https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fissues\u002F90",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},30801,"运行 live_demo.py 时遇到 'All hk.Modules must be initialized inside an hk.transform' 错误，如何修复？","该错误已在最新的代码推送中修复。请注意，此次更新也将废弃的 `jax.tree_map` 替换为新引入的 `jax.tree.map`，因此现在运行代码需要非常新版本的 JAX。如果希望兼容旧版 JAX，可以安全地在代码库中执行查找\u002F替换操作，将 `jax.tree.map` 改回 `jax.tree_map`，因为项目未使用其他新的 JAX 特性。","https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Ftapnet\u002Fissues\u002F80",{"id":147,"question_zh":148,"answer_zh":149,"source_url":145},30802,"TAPIR 在测试视频中跟踪效果不佳，物体方向不同或纹理重复时表现不好，原因是什么？","这是 TAPIR 的已知弱点。由于训练数据 Kubric 中缺乏平面内旋转（in-plane rotation）样本，模型对此类变换的不变性较差。此外，如果在不同物体间复用纹理，模型可能会产生错误的匹配（虚假匹配）。在真实视频中，随机纹理（如木纹）通常不会完全重复。建议在可视化时不要绘制被遮挡的点，因为模型可能错误地标记它们为可见。未来的 BootsTAP 模型有望改善这些问题。",{"id":151,"question_zh":152,"answer_zh":153,"source_url":126},30803,"在 AOT 编译模型中使用 torch.nn.functional.interpolate 进行图像缩放时报错，有什么替代方案？","AOT 编译模型不支持直接作为输入的 `torch.nn.functional.interpolate` 操作。可靠的替代方案是使用 OpenCV 进行缩放。具体代码示例如下：先将 Tensor 转为 numpy 数组，使用 `cv2.resize` 进行缩放（插值模式设为 `cv2.INTER_AREA`），然后再转回 Tensor 并调整维度顺序移回 GPU。代码片段：`resized_img = torch.from_numpy(cv2.resize(padded_img.permute(1, 2, 0).cpu().numpy(), resized_shape, interpolation=cv2.INTER_AREA)).permute(2, 0, 1).cuda()`。",[]]