[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-z-x-yang--Segment-and-Track-Anything":3,"tool-z-x-yang--Segment-and-Track-Anything":64},[4,17,26,40,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,2,"2026-04-03T11:11:01",[13,14,15],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":23,"last_commit_at":32,"category_tags":33,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,34,35,36,15,37,38,13,39],"数据工具","视频","插件","其他","语言模型","音频",{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":10,"last_commit_at":46,"category_tags":47,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,38,37],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":10,"last_commit_at":54,"category_tags":55,"status":16},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74913,"2026-04-05T10:44:17",[38,14,13,37],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":23,"last_commit_at":62,"category_tags":63,"status":16},2471,"tesseract","tesseract-ocr\u002Ftesseract","Tesseract 是一款历史悠久且备受推崇的开源光学字符识别（OCR）引擎，最初由惠普实验室开发，后由 Google 维护，目前由全球社区共同贡献。它的核心功能是将图片中的文字转化为可编辑、可搜索的文本数据，有效解决了从扫描件、照片或 PDF 文档中提取文字信息的难题，是数字化归档和信息自动化的重要基础工具。\n\n在技术层面，Tesseract 展现了强大的适应能力。从版本 4 开始，它引入了基于长短期记忆网络（LSTM）的神经网络 OCR 引擎，显著提升了行识别的准确率；同时，为了兼顾旧有需求，它依然支持传统的字符模式识别引擎。Tesseract 原生支持 UTF-8 编码，开箱即用即可识别超过 100 种语言，并兼容 PNG、JPEG、TIFF 等多种常见图像格式。输出方面，它灵活支持纯文本、hOCR、PDF、TSV 等多种格式，方便后续数据处理。\n\nTesseract 主要面向开发者、研究人员以及需要构建文档处理流程的企业用户。由于它本身是一个命令行工具和库（libtesseract），不包含图形用户界面（GUI），因此最适合具备一定编程能力的技术人员集成到自动化脚本或应用程序中",73286,"2026-04-03T01:56:45",[13,14],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":80,"owner_email":79,"owner_twitter":79,"owner_website":81,"owner_url":82,"languages":83,"stars":99,"forks":100,"last_commit_at":101,"license":102,"difficulty_score":10,"env_os":103,"env_gpu":104,"env_ram":103,"env_deps":105,"category_tags":113,"github_topics":114,"view_count":23,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":120,"updated_at":121,"faqs":122,"releases":157},3715,"z-x-yang\u002FSegment-and-Track-Anything","Segment-and-Track-Anything","An open-source project dedicated to tracking and segmenting any objects in videos, either automatically or interactively. The primary algorithms utilized include the Segment Anything Model (SAM) for key-frame segmentation and Associating Objects with Transformers (AOT) for efficient tracking and propagation purposes.","Segment-and-Track-Anything 是一款专注于视频对象分割与追踪的开源项目，旨在帮助用户轻松提取视频中任意目标的轮廓并持续锁定其运动轨迹。它有效解决了传统方法在处理复杂动态场景时难以兼顾精度与效率的难题，无论是自动检测新出现的目标，还是通过人工交互指定特定物体，都能实现高精度的逐帧处理。\n\n该项目特别适合计算机视觉研究人员、视频处理开发者以及需要精细视频分析内容创作者使用。其核心亮点在于巧妙融合了两大前沿算法：利用 Segment Anything Model (SAM) 在关键帧上进行强大的自动或交互式分割，再结合 DeAOT 技术高效地将掩码传播至整个视频序列，实现多目标稳定追踪。此外，它还支持独特的“声音定位”功能，能根据音频线索自动追踪发声物体，并允许用户通过点击、画笔涂抹甚至文本提示等多种灵活方式与系统互动。配合友好的 Web 界面和 Colab 在线演示，用户无需深厚的编程背景也能快速上手，探索视频智能分析的无限可能。","# Segment and Track Anything (SAM-Track)\n**Online Demo:** [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1R10N70AJaslzADFqb-a5OihYkllWEVxB?usp=sharing)\n**Technical Report**: [![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FReport-arXiv:2305.06558-green)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06558)\n\n**Tutorial:** [tutorial-v1.6(audio)](.\u002Ftutorial\u002Ftutorial%20for%20WebUI-1.6-Version.md),[tutorial-v1.5 (Text)](.\u002Ftutorial\u002Ftutorial%20for%20WebUI-1.5-Version.md), [tutorial-v1.0 (Click & Brush)](.\u002Ftutorial\u002Ftutorial%20for%20WebUI-1.0-Version.md)\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_79958e4fb41e.gif\" width=\"880\">\n\u003C\u002Fp>\n\n**Segment and Track Anything** is an open-source project that focuses on the segmentation and tracking of any objects in videos, utilizing both automatic and interactive methods. The primary algorithms utilized include the [**SAM** (Segment Anything Models)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsegment-anything) for automatic\u002Finteractive key-frame segmentation and the [**DeAOT** (Decoupling features in Associating Objects with Transformers)](https:\u002F\u002Fgithub.com\u002Fyoxu515\u002Faot-benchmark) (NeurIPS2022) for efficient multi-object tracking and propagation. The SAM-Track pipeline enables dynamic and automatic detection and segmentation of new objects by SAM, while DeAOT is responsible for tracking all identified objects.\n\n## :loudspeaker:New Features\n- [2024\u002F4\u002F23] We have added an audio-grounding feature that tracks the sound-making object within the video's soundtrack.\n- [2023\u002F5\u002F12] We have authored a technical report for SAM-Track.\n- [2023\u002F5\u002F7] We have added `demo_instseg.ipynb`, which uses Grounding-DINO to detect new objects in the key frames of a video. It can be applied in the fields of smart cities and autonomous driving.\n- [2023\u002F4\u002F29] We have added advanced arguments for AOT-L: `long_term_memory_gap` and `max_len_long_term`.\n   - `long_term_memory_gap` controls the frequency at which the AOT model adds new reference frames to its long-term memory. During mask propagation, AOT matches the current frame with the reference frames stored in the long-term memory. \n   - Setting the gap value to a proper value helps to obtain better performance. To avoid memory explosion in long videos, we set a `max_len_long_term` value for the long-term memory storage, i.e. when the number of memory frames reaches the `max_len_long_term value`, the oldest memory frame will be discarded and a new frame will be added.\n\n- [2023\u002F4\u002F26] **Interactive WebUI 1.5-Version**: We have added new features based on Interactive WebUI-1.0 Version.\n   - We have added a new form of interactivity—text prompts—to SAMTrack.\n   - From now on, multiple objects that need to be tracked can be interactively added.\n   - Check out [tutorial](.\u002Ftutorial\u002Ftutorial%20for%20WebUI-1.5-Version.md) for Interactive WebUI 1.5-Version. More demos will be released in the next few days.\n- [2023\u002F4\u002F26] **Image-Sequence input**: The WebUI now has a new feature that allows for input of image sequences, which can be used to test video segmentation datasets. Get started with the [tutorial](.\u002Ftutorial\u002Ftutorial%20for%20Image-Sequence%20input.md) for Image-Sequence input. \n- [2023\u002F4\u002F25] **Online Demo:** You can easily use SAMTrack in [Colab](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1R10N70AJaslzADFqb-a5OihYkllWEVxB?usp=sharing) for visual tracking tasks.\n\n- [2023\u002F4\u002F23] **Interactive WebUI:** We have introduced a new WebUI that allows interactive user segmentation through strokes and clicks. Feel free to explore and have fun with the [tutorial](.\u002Ftutorial\u002Ftutorial%20for%20WebUI-1.0-Version.md)!\n    - [2023\u002F4\u002F24] **Tutorial V1.0:** Check out our new video tutorials!\n      - YouTube-Link: [Tutorial for Interactively modify single-object mask for first frame of video](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=DF0iFSsX8KY)、[Tutorial for Interactively add object by click](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UJvKPng9_DA)、[Tutorial for Interactively add object by stroke](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=m1oFavjIaCM).\n      - Bilibili Video Link:[Tutorial for Interactively modify single-object mask for first frame of video](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1tM4115791\u002F?spm_id_from=333.999.0.0)、[Tutorial for Interactively add object by click](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1Qs4y1A7d1\u002F)、[Tutorial for Interactively add object by stroke](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1Lm4y117J4\u002F?spm_id_from=333.999.0.0).\n    - 1.0-Version is a developer version, please feel free to contact us if you encounter any bugs :bug:.\n\n- [2023\u002F4\u002F17] **SAMTrack**: Automatically segment and track anything in video!\n\n## :fire:Demos\n\u003Cdiv align=center>\n\n[![Segment-and-Track-Anything Versatile Demo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_8fa0a0bdcd73.jpg)](https:\u002F\u002Fyoutu.be\u002FUPhtpf1k6HA \"Segment-and-Track-Anything Versatile Demo\")\n\u003C\u002Fdiv>\n\nThis video showcases the segmentation and tracking capabilities of SAM-Track in various scenarios, such as street views, AR, cells, animations, aerial shots, and more.\n\n## :calendar:TODO\n - [x] Colab notebook: Completed on April 25th, 2023.\n - [x] 1.0-Version Interactive WebUI: Completed on April 23rd, 2023.\n    - We will create a feature that enables users to interactively modify the mask for the initial video frame according to their needs. The interactive segmentation capabilities of Segment-and-Track-Anything is demonstrated in [Demo8](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Xyd54AngvV8&feature=youtu.be) and [Demo9](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=eZrdna8JkoQ).\n    - Bilibili Video Link: [Demo8](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1JL411v7uE\u002F), [Demo9](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1Qs4y1w763\u002F).\n - [x] 1.5-Version Interactive WebUI: Completed on April 26th, 2023.\n    - We will develop a function that allows interactive modification of multi-object masks for the first frame of a video. This function will be based on Version 1.0.  YouTube: [Demo4](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UFtwFaOfx2I&feature=youtu.be), [Demo5](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=cK5MPFdJdSY&feature=youtu.be); Bilibili: [Demo4](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV17X4y127mJ\u002F), [Demo5](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1Pz4y1a7mC\u002F)\n    - Furthermore, we plan to include text prompts as an additional form of interaction. YouTube: [Demo1](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=5oieHqFIJPc&feature=youtu.be), [Demo2](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=nXfq17X6ohk); Bilibili: [Demo1](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1hg4y157yd\u002F?vd_source=fe3b5c0215d05cc44c8eb3d94abae3ca), [Demo2](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1RV4y1k7i5\u002F)\n - [ ] 2.x-Version Interactive WebUI\n    - In version 2.x, the segmentation model will offer two options: SAM and SEEM.\n    - We will develop a new function where the fixed-category object detection result can be displayed as a prompt.\n    - We will enable SAM-Track to add and modify objects during tracking. YouTube: [Demo6](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=l7hXM1a3nEA&feature=youtu.be\n), [Demo7](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=hPjw28Ul4cw&feature=youtu.be); Bilibili: [Demo6](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1nk4y1j7Am), [Demo7](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1mk4y1E78s\u002F?vd_source=fe3b5c0215d05cc44c8eb3d94abae3ca)\n\n**Demo1** showcases SAM-Track's ability to take the class of objects as prompt. The user gives the category text 'panda' to enable instance-level segmentation and tracking of all objects belonging to this category.\n\u003Cdiv align=center>\n \n[![demo1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_3531043c1701.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=5oieHqFIJPc&feature=youtu.be \"demo1\")\n\u003C\u002Fdiv>\n\n**Demo2** showcases SAM-Track's ability to take the text description as prompt. SAM-Track could segment and track target objects given the input that 'panda on the far left'.\n\u003Cdiv align=center>\n \n[![demo1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_1f1385a26ccd.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=nXfq17X6ohk \"demo1\")\n\u003C\u002Fdiv>\n\n\n**Demo3** showcases SAM-Track's ability to track numerous objects at the same time. SAM-Track is capable of automatically detecting newly appearing objects.\n\u003Cdiv align=center>\n \n[![demo1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_f8478ff28e58.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=jMqFMq0tRP0 \"demo1\")\n\u003C\u002Fdiv>\n\n**Demo4** showcases SAM-Track's ability to take multiple modes of interactions as prompt. The user specified human and skateboard with click and brushstroke, respectively.  \n\u003Cdiv align=center>\n \n[![demo1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_5689522ca907.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UFtwFaOfx2I&feature=youtu.be \"demo1\")\n\u003C\u002Fdiv>\n\n\n**Demo5** showcases SAM-Track's ability to refine the results of segment-everything. The user merges the tram as a whole with a single click.\n\u003Cdiv align=center>\n \n[![demo1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_b3de6537d8df.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=cK5MPFdJdSY&feature=youtu.be \"demo1\")\n\u003C\u002Fdiv>\n\n**Demo6** showcases SAM-Track's ability to add new objects during tracking. The user annotates another car by rolling back to an intermediate frame.\n\u003Cdiv align=center>\n \n[![demo1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_d3371f2673e6.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=l7hXM1a3nEA \"demo1\")\n\u003C\u002Fdiv>\n\n**Demo7** showcases SAM-Track's ability to refine the prediction during tracking. This feature is highly advantageous for segmentation and tracking under complex environments.\n\u003Cdiv align=center>\n\n[![demo1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_216751aa7c4d.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=hPjw28Ul4cw&feature=youtu.be \"demo1\")\n\u003C\u002Fdiv>\n\n**Demo8** showcases SAM-Track's ability to interactively segment and track individual objects.  The user specified that SAM-Track tracked a man playing street basketball.\n\u003Cdiv align=center>\n\n[![Interactive Segment-and-Track-Anything Demo1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_00fb6a276593.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Xyd54AngvV8 \"Interactive Segment-and-Track-Anything Demo1\")\n\u003C\u002Fdiv>\n\n**Demo9** showcases SAM-Track's ability to interactively add specified objects for tracking.The user customized the addition of objects to be tracked on top of the segmentation of everything in the scene using SAM-Track.\n\u003Cdiv align=center>\n \n[![Interactive Segment-and-Track-Anything Demo2](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_8e7262d4e4e2.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=eZrdna8JkoQ \"Interactive Segment-and-Track-Anything Demo2\")\n\u003C\u002Fdiv>\n\n## :computer:Getting Started\n### :bookmark_tabs:Requirements\n\nThe [Segment-Anything](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsegment-anything) repository has been cloned and renamed as sam, and the [aot-benchmark](https:\u002F\u002Fgithub.com\u002Fyoxu515\u002Faot-benchmark) repository has been cloned and renamed as aot.\n\nPlease check the dependency requirements in [SAM](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsegment-anything) and [DeAOT](https:\u002F\u002Fgithub.com\u002Fyoxu515\u002Faot-benchmark).\n\nThe implementation is tested under python 3.9, as well as pytorch 1.10 and torchvision 0.11. **We recommend equivalent or higher pytorch version**.\n\nUse the `install.sh` to install the necessary libs for SAM-Track\n```\nbash script\u002Finstall.sh\n```\n\n### :star:Model Preparation\n\n**- Download the SAM model to ckpt folder for running the code**\n\n1. SAM vit_b (default) : https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fsegment_anything\u002Fsam_vit_b_01ec64.pth\n\n2. SAM vit_l: https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fsegment_anything\u002Fsam_vit_l_0b3195.pth\n\n3. SAM vit_h: https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fsegment_anything\u002Fsam_vit_h_4b8939.pth\n\n**- Download DEAOT model to ckpt folder for running the code**\n\n| Model      | Param (M) |                                             PRE_YTB_DAV                                        |     \n|:---------- |:---------:|:--------------------------------------------------------------------------------------------:  |\n| DeAOTT       |    7.2    | [gdrive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1ThWIZQS03cYWx1EKNN8MIMnJS5eRowzr\u002Fview?usp=sharing) |\n| DeAOTS       |    10.2   | [gdrive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1YwIAV5tBtn5spSFxKLBQBEQGwPHyQlHi\u002Fview?usp=sharing) |\n| DeAOTB       |    13.2   | [gdrive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1BHxsonnvJXylqHlZ1zJHHc-ymKyq-CFf\u002Fview?usp=sharing) |\n| DeAOTL       |    13.2   | [gdrive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F18elNz_wi9JyVBcIUYKhRdL08MA-FqHD5\u002Fview?usp=sharing) |\n| R50-DeAOTL   |    19.8   | [gdrive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1QoChMkTVxdYZ_eBlZhK2acq9KMQZccPJ\u002Fview?usp=sharing) |\n| SwinB-DeAOTL |    70.3   | [gdrive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1g4E-F0RPOx9Nd6J7tU9AE1TjsouL4oZq\u002Fview?usp=sharing) |\n\n**Download Grounding-DINO model to ckpt folder**\n\n| Name | Backbone | Training Data | Box AP (COCO) | Checkpoint | \n|------|----------|---------------|---------------|------------|\n| GroundingDINO-T | Swin-T | O365, GoldG, Cap4M | 48.4 (zero-shot) \u002F 57.2 (fine-tuned) | [Download](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FGroundingDINO\u002Freleases\u002Fdownload\u002Fv0.1.0-alpha\u002Fgroundingdino_swint_ogc.pth) | \n| GroundingDINO-B | Swin-B | COCO, O365, GoldG, Cap4M, OpenImages, ODinW-35, RefCOCO | 56.7 | [Download](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FGroundingDINO\u002Freleases\u002Fdownload\u002Fv0.1.0-alpha2\u002Fgroundingdino_swinb_cogcoor.pth) | \n\n\n**Download AST model to ast_master\u002Fpretrained_models, after cloning the AST repository**\n\n1. [Full AudioSet, 10 tstride, 10 fstride, with Weight Averaging (0.459 mAP)](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fca0b1v2nlxzyeb4\u002Faudioset_10_10_0.4593.pth?dl=1)\n2. [Full AudioSet, 10 tstride, 10 fstride, without Weight Averaging, Model 1 (0.450 mAP)](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002F1tv0hovue1bxupk\u002Faudioset_10_10_0.4495.pth?dl=1)\n3. [Full AudioSet, 10 tstride, 10 fstride, without Weight Averaging, Model 2  (0.448 mAP)](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002F6u5sikl4b9wo4u5\u002Faudioset_10_10_0.4483.pth?dl=1)\n4. [Full AudioSet, 10 tstride, 10 fstride, without Weight Averaging, Model 3  (0.448 mAP)](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fkt6i0v9fvfm1mbq\u002Faudioset_10_10_0.4475.pth?dl=1)\n5. [Full AudioSet, 12 tstride, 12 fstride, without Weight Averaging, Model (0.447 mAP)](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fsnfhx3tizr4nuc8\u002Faudioset_12_12_0.4467.pth?dl=1)\n6. [Full AudioSet, 14 tstride, 14 fstride, without Weight Averaging, Model (0.443 mAP)](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fz18s6pemtnxm4k7\u002Faudioset_14_14_0.4431.pth?dl=1)\n7. [Full AudioSet, 16 tstride, 16 fstride, without Weight Averaging, Model (0.442 mAP)](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fmdsa4t1xmcimia6\u002Faudioset_16_16_0.4422.pth?dl=1)\n\n8. [Speechcommands V2-35, 10 tstride, 10 fstride, without Weight Averaging, Model (98.12% accuracy on evaluation set)](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fq0tbqpwv44pquwy\u002Fspeechcommands_10_10_0.9812.pth?dl=1)\n the default model is audioset_0.4593 ([audioset_0.4593.pth]( https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fcv4knew8mvbrnvq\u002Faudioset_0.4593.pth?dl=1)).\n\nYou can download the **default weights** using the command line as shown below.\n$$\nbash script\u002Fdownload_ckpt.sh\n$$\n\n### Running docker file\n\n1. docker build -f docker\u002FDockerfile -t myimage .\n2. docker run myimage\n\n### :heart:Run Demo\n- The video to be processed can be put in .\u002Fassets. \n- Then run **demo.ipynb** step by step to generate results. \n- The results will be saved as masks for each frame and a gif file for visualization.\n\nThe arguments for SAM-Track, DeAOT and SAM can be manually modified in model_args.py for purpose of using other models or controling the behavior of each model.\n\n### :muscle:WebUI App\nOur user-friendly visual interface allows you to easily obtain the results of your experiments. Simply initiate it using the command line.\n\n```\npython app.py\n```\nUsers can upload the video directly on the UI and use SegTracker to automatically\u002Finteractively track objects within that video. We use a video of a man playing basketball as an example.\n\n![Interactive WebUI](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_c4c9612565a6.jpg)\n\nSegTracker-Parameters:\n - **aot_model**: used to select which version of DeAOT\u002FAOT to use for tracking and propagation.\n - **sam_gap**: used to control how often SAM is used to add newly appearing objects at specified frame intervals. Increase to decrease the frequency of discovering new targets, but significantly improve speed of inference.\n - **points_per_side**: used to control the number of points per side used for generating masks by sampling a grid over the image. Increasing the size enhances the ability to detect small objects, but larger targets may be segmented into finer granularity.\n - **max_obj_num**: used to limit the maximum number of objects that SAM-Track can detect and track. A larger number of objects necessitates a greater utilization of memory, with approximately 16GB of memory capable of processing a maximum of 255 objects.\n\nUsage: To see the details, please refer to the [tutorial for 1.0-Version WebUI](.\u002Ftutorial\u002Ftutorial%20for%20WebUI-1.0-Version.md).\n\n### :school:About us\nThank you for your interest in this project. The project is supervised by the ReLER Lab at Zhejiang University’s College of Computer Science and Technology. ReLER was established by Yang Yi, a Qiu Shi Distinguished Professor at Zhejiang University. Our dedicated team of contributors includes [Yangming Cheng](https:\u002F\u002Fgithub.com\u002Fyamy-cheng), Jiyuan Hu, [Yuanyou Xu](https:\u002F\u002Fgithub.com\u002Fyoxu515), [Liulei Li](https:\u002F\u002Fgithub.com\u002FlingorX), [Xiaodi Li](https:\u002F\u002Fgithub.com\u002FLiNO3Dy), [Zongxin Yang](https:\u002F\u002Fz-x-yang.github.io\u002F), [Wenguan Wang](https:\u002F\u002Fsites.google.com\u002Fview\u002Fwenguanwang) and [Yi Yang](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=RMSuNFwAAAAJ&hl=en).\n\n### :full_moon_with_face:Credits\nLicenses for borrowed code can be found in [licenses.md](https:\u002F\u002Fgithub.com\u002Fz-x-yang\u002FSegment-and-Track-Anything\u002Fblob\u002Fmain\u002Flicenses.md) file. \n\n* DeAOT\u002FAOT - [https:\u002F\u002Fgithub.com\u002Fyoxu515\u002Faot-benchmark](https:\u002F\u002Fgithub.com\u002Fyoxu515\u002Faot-benchmark)\n* SAM - [https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsegment-anything](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsegment-anything)\n* Gradio (for building WebUI) - [https:\u002F\u002Fgithub.com\u002Fgradio-app\u002Fgradio](https:\u002F\u002Fgithub.com\u002Fgradio-app\u002Fgradio)\n* Grounding-Dino - [https:\u002F\u002Fgithub.com\u002Fyamy-cheng\u002FGroundingDINO](https:\u002F\u002Fgithub.com\u002Fyamy-cheng\u002FGroundingDINO)\n* AST - [https:\u002F\u002Fgithub.com\u002FYuanGongND\u002Fast](https:\u002F\u002Fgithub.com\u002FYuanGongND\u002Fast)\n\n### License\nThe project is licensed under the [AGPL-3.0 license](https:\u002F\u002Fgithub.com\u002Fz-x-yang\u002FSegment-and-Track-Anything\u002Fblob\u002Fmain\u002FLICENSE.txt). To utilize or further develop this project for commercial purposes through proprietary means, permission must be granted by us (as well as the owners of any borrowed code).\n\n### Citations\nPlease consider citing the related paper(s) in your publications if it helps your research.\n```\n@article{cheng2023segment,\n  title={Segment and Track Anything},\n  author={Cheng, Yangming and Li, Liulei and Xu, Yuanyou and Li, Xiaodi and Yang, Zongxin and Wang, Wenguan and Yang, Yi},\n  journal={arXiv preprint arXiv:2305.06558},\n  year={2023}\n}\n@article{kirillov2023segment,\n  title={Segment anything},\n  author={Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C and Lo, Wan-Yen and others},\n  journal={arXiv preprint arXiv:2304.02643},\n  year={2023}\n}\n@inproceedings{yang2022deaot,\n  title={Decoupling Features in Hierarchical Propagation for Video Object Segmentation},\n  author={Yang, Zongxin and Yang, Yi},\n  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},\n  year={2022}\n}\n@inproceedings{yang2021aot,\n  title={Associating Objects with Transformers for Video Object Segmentation},\n  author={Yang, Zongxin and Wei, Yunchao and Yang, Yi},\n  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},\n  year={2021}\n}\n@article{yang2024scalable,\n  title={Scalable video object segmentation with identification mechanism},\n  author={Yang, Zongxin and Miao, Jiaxu and Wei, Yunchao and Wang, Wenguan and Wang, Xiaohan and Yang, Yi},\n  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},\n  volume={46},\n  number={9},\n  pages={6247--6262},\n  year={2024},\n  publisher={IEEE}\n}\n@article{liu2023grounding,\n  title={Grounding dino: Marrying dino with grounded pre-training for open-set object detection},\n  author={Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others},\n  journal={arXiv preprint arXiv:2303.05499},\n  year={2023}\n}\n@inproceedings{gong21b_interspeech,\n  author={Yuan Gong and Yu-An Chung and James Glass},\n  title={AST: Audio Spectrogram Transformer},\n  booktitle={Proc. Interspeech 2021},\n  pages={571--575},\n  doi={10.21437\u002FInterspeech.2021-698}\n  year={2021} \n}\n```\n","# 任意对象分割与追踪（SAM-Track）\n**在线演示**：[![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1R10N70AJaslzADFqb-a5OihYkllWEVxB?usp=sharing)\n**技术报告**：[![](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FReport-arXiv:2305.06558-green)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.06558)\n\n**教程**：[tutorial-v1.6（音频）](.\u002Ftutorial\u002Ftutorial%20for%20WebUI-1.6-Version.md)、[tutorial-v1.5（文本）](.\u002Ftutorial\u002Ftutorial%20for%20WebUI-1.5-Version.md)、[tutorial-v1.0（点击与涂抹）](.\u002Ftutorial\u002Ftutorial%20for%20WebUI-1.0-Version.md)\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_79958e4fb41e.gif\" width=\"880\">\n\u003C\u002Fp>\n\n**Segment and Track Anything** 是一个开源项目，专注于视频中任意对象的分割与追踪，支持自动和交互式两种方式。主要算法包括用于关键帧自动\u002F交互式分割的 [**SAM**（Segment Anything Models）](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsegment-anything)，以及用于高效多目标追踪与传播的 [**DeAOT**（Decoupling features in Associating Objects with Transformers）](https:\u002F\u002Fgithub.com\u002Fyoxu515\u002Faot-benchmark)（NeurIPS2022）。SAM-Track 流程能够通过 SAM 动态且自动地检测并分割新对象，而 DeAOT 则负责跟踪所有已识别的对象。\n\n## :loudspeaker:新功能\n- [2024\u002F4\u002F23] 我们新增了音频定位功能，可追踪视频配乐中的发声物体。\n- [2023\u002F5\u002F12] 我们撰写了 SAM-Track 的技术报告。\n- [2023\u002F5\u002F7] 我们添加了 `demo_instseg.ipynb`，该脚本利用 Grounding-DINO 在视频的关键帧中检测新对象，可用于智慧城市和自动驾驶等领域。\n- [2023\u002F4\u002F29] 我们为 AOT-L 增加了高级参数：`long_term_memory_gap` 和 `max_len_long_term`。\n   - `long_term_memory_gap` 控制 AOT 模型向其长期记忆库添加新参考帧的频率。在掩码传播过程中，AOT 会将当前帧与长期记忆库中的参考帧进行匹配。\n   - 合理设置间隔值有助于获得更好的性能。为避免在长视频中内存占用过大，我们还设置了长期记忆存储的最大长度 `max_len_long_term`，即当记忆帧数达到该值时，最旧的帧将被丢弃，并加入新的帧。\n\n- [2023\u002F4\u002F26] **交互式 WebUI 1.5 版**：我们在交互式 WebUI 1.0 版的基础上新增了多项功能。\n   - 我们为 SAMTrack 新增了一种交互方式——文本提示。\n   - 从现在起，用户可以交互式地添加多个需要追踪的对象。\n   - 请参阅 [教程](.\u002Ftutorial\u002Ftutorial%20for%20WebUI-1.5-Version.md) 了解交互式 WebUI 1.5 版的使用方法。未来几天还将发布更多演示。\n- [2023\u002F4\u002F26] **图像序列输入**：WebUI 现在新增了图像序列输入功能，可用于测试视频分割数据集。请参阅 [教程](.\u002Ftutorial\u002Ftutorial%20for%20Image-Sequence%20input.md) 了解如何使用图像序列输入功能。\n- [2023\u002F4\u002F25] **在线演示**：您可以在 [Colab](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1R10N70AJaslzADFqb-a5OihYkllWEVxB?usp=sharing) 中轻松使用 SAMTrack 进行视觉追踪任务。\n\n- [2023\u002F4\u002F23] **交互式 WebUI**：我们推出了全新的 WebUI，支持通过涂抹和点击实现交互式用户分割。欢迎体验并观看 [教程](.\u002Ftutorial\u002Ftutorial%20for%20WebUI-1.0-Version.md)！\n    - [2023\u002F4\u002F24] **教程 V1.0**：请查看我们的全新视频教程！\n      - YouTube 链接：[教程：交互式修改视频第一帧的单个对象掩码](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=DF0iFSsX8KY)、[教程：通过点击交互式添加对象](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UJvKPng9_DA)、[教程：通过涂抹交互式添加对象](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=m1oFavjIaCM)。\n      - Bilibili 视频链接：[教程：交互式修改视频第一帧的单个对象掩码](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1tM4115791\u002F?spm_id_from=333.999.0.0)、[教程：通过点击交互式添加对象](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1Qs4y1A7d1\u002F)、[教程：通过涂抹交互式添加对象](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1Lm4y117J4\u002F?spm_id_from=333.999.0.0)。\n    - 1.0 版本为开发者版本，如果您遇到任何问题，请随时联系我们 :bug:。\n\n- [2023\u002F4\u002F17] **SAMTrack**：自动分割并追踪视频中的任何对象！\n\n## :fire:演示视频\n\u003Cdiv align=center>\n\n[![Segment-and-Track-Anything 多场景演示](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_8fa0a0bdcd73.jpg)](https:\u002F\u002Fyoutu.be\u002FUPhtpf1k6HA \"Segment-and-Track-Anything 多场景演示\")\n\u003C\u002Fdiv>\n\n本视频展示了 SAM-Track 在多种场景下的分割与追踪能力，例如街景、增强现实、细胞、动画、航拍等。\n\n## :calendar:待办事项\n - [x] Colab 笔记本：2023年4月25日完成。\n - [x] 1.0版本交互式WebUI：2023年4月23日完成。\n    - 我们将开发一项功能，允许用户根据需求对初始视频帧的掩码进行交互式修改。Segment-and-Track-Anything 的交互式分割能力在 [Demo8](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Xyd54AngvV8&feature=youtu.be) 和 [Demo9](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=eZrdna8JkoQ) 中得到了展示。\n    - Bilibili 视频链接：[Demo8](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1JL411v7uE\u002F)，[Demo9](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1Qs4y1w763\u002F)。\n - [x] 1.5版本交互式WebUI：2023年4月26日完成。\n    - 我们将开发一个功能，允许对视频第一帧的多对象掩码进行交互式修改。该功能将基于 1.0 版本实现。YouTube：[Demo4](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UFtwFaOfx2I&feature=youtu.be)，[Demo5](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=cK5MPFdJdSY&feature=youtu.be)；Bilibili：[Demo4](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV17X4y127mJ\u002F)，[Demo5](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1Pz4y1a7mC\u002F)。\n    - 此外，我们计划加入文本提示作为额外的交互方式。YouTube：[Demo1](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=5oieHqFIJPc&feature=youtu.be)，[Demo2](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=nXfq17X6ohk)；Bilibili：[Demo1](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1hg4y157yd\u002F?vd_source=fe3b5c0215d05cc44c8eb3d94abae3ca)，[Demo2](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1RV4y1k7i5\u002F)。\n - [ ] 2.x版本交互式WebUI\n    - 在 2.x 版本中，分割模型将提供两种选项：SAM 和 SEEM。\n    - 我们将开发一项新功能，可以将固定类别的目标检测结果作为提示显示。\n    - 我们将使 SAM-Track 在跟踪过程中支持添加和修改目标对象。YouTube：[Demo6](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=l7hXM1a3nEA&feature=youtu.be)，[Demo7](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=hPjw28Ul4cw&feature=youtu.be)；Bilibili：[Demo6](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1nk4y1j7Am)，[Demo7](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1mk4y1E78s\u002F?vd_source=fe3b5c0215d05cc44c8eb3d94abae3ca)\n\n**Demo1** 展示了 SAM-Track 能够以物体类别作为提示的能力。用户输入类别文本“panda”，即可实现对该类别下所有目标的实例级分割与跟踪。\n\u003Cdiv align=center>\n \n[![demo1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_3531043c1701.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=5oieHqFIJPc&feature=youtu.be \"demo1\")\n\u003C\u002Fdiv>\n\n**Demo2** 展示了 SAM-Track 能够以文本描述作为提示的能力。只需输入“最左边的熊猫”，SAM-Track 就能分割并跟踪目标对象。\n\u003Cdiv align=center>\n \n[![demo1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_1f1385a26ccd.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=nXfq17X6ohk \"demo1\")\n\u003C\u002Fdiv>\n\n\n**Demo3** 展示了 SAM-Track 同时跟踪多个目标的能力。它还能自动检测新出现的对象。\n\u003Cdiv align=center>\n \n[![demo1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_f8478ff28e58.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=jMqFMq0tRP0 \"demo1\")\n\u003C\u002Fdiv>\n\n**Demo4** 展示了 SAM-Track 能够接受多种交互方式作为提示的能力。用户分别用点击和笔刷标注了人和滑板。\n\u003Cdiv align=center>\n \n[![demo1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_5689522ca907.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UFtwFaOfx2I&feature=youtu.be \"demo1\")\n\u003C\u002Fdiv>\n\n\n**Demo5** 展示了 SAM-Track 能够对“分割一切”的结果进行优化。用户只需单击一下，就能将电车合并为一个整体。\n\u003Cdiv align=center>\n \n[![demo1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_b3de6537d8df.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=cK5MPFdJdSY&feature=youtu.be \"demo1\")\n\u003C\u002Fdiv>\n\n**Demo6** 展示了 SAM-Track 在跟踪过程中添加新对象的能力。用户通过回退到中间帧，标注了另一辆汽车。\n\u003Cdiv align=center>\n \n[![demo1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_d3371f2673e6.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=l7hXM1a3nEA \"demo1\")\n\u003C\u002Fdiv>\n\n**Demo7** 展示了 SAM-Track 在跟踪过程中不断优化预测的能力。这一特性对于复杂环境下的分割与跟踪尤为有利。\n\u003Cdiv align=center>\n\n[![demo1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_216751aa7c4d.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=hPjw28Ul4cw&feature=youtu.be \"demo1\")\n\u003C\u002Fdiv>\n\n**Demo8** 展示了 SAM-Track 能够交互式地分割并跟踪单个目标。用户指定 SAM-Track 跟踪一名正在打街头篮球的男子。\n\u003Cdiv align=center>\n\n[![Interactive Segment-and-Track-Anything Demo1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_00fb6a276593.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Xyd54AngvV8 \"Interactive Segment-and-Track-Anything Demo1\")\n\u003C\u002Fdiv>\n\n**Demo9** 展示了 SAM-Track 能够交互式地添加指定目标进行跟踪。用户在使用 SAM-Track 对场景中的所有内容进行分割的基础上，自定义了需要跟踪的对象。\n\u003Cdiv align=center>\n \n[![Interactive Segment-and-Track-Anything Demo2](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_8e7262d4e4e2.jpg)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=eZrdna8JkoQ \"Interactive Segment-and-Track-Anything Demo2\")\n\u003C\u002Fdiv>\n\n## :computer:开始使用\n### :bookmark_tabs:要求\n\n已克隆并重命名为 sam 的 [Segment-Anything](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsegment-anything) 仓库，以及已克隆并重命名为 aot 的 [aot-benchmark](https:\u002F\u002Fgithub.com\u002Fyoxu515\u002Faot-benchmark) 仓库。\n\n请检查 [SAM](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsegment-anything) 和 [DeAOT](https:\u002F\u002Fgithub.com\u002Fyoxu515\u002Faot-benchmark) 中的依赖项要求。\n\n该实现已在 Python 3.9、PyTorch 1.10 和 torchvision 0.11 下进行了测试。**建议使用相同或更高版本的 PyTorch**。\n\n使用 `install.sh` 脚本来安装 SAM-Track 所需的库：\n```\nbash script\u002Finstall.sh\n```\n\n### :star:模型准备\n\n**- 下载SAM模型到ckpt文件夹以运行代码**\n\n1. SAM vit_b（默认）：https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fsegment_anything\u002Fsam_vit_b_01ec64.pth\n\n2. SAM vit_l：https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fsegment_anything\u002Fsam_vit_l_0b3195.pth\n\n3. SAM vit_h：https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fsegment_anything\u002Fsam_vit_h_4b8939.pth\n\n**- 下载DEAOT模型到ckpt文件夹以运行代码**\n\n| 模型      | 参数（M） |                                             PRE_YTB_DAV                                        |     \n|:---------- |:---------:|:--------------------------------------------------------------------------------------------:  |\n| DeAOTT       |    7.2    | [gdrive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1ThWIZQS03cYWx1EKNN8MIMnJS5eRowzr\u002Fview?usp=sharing) |\n| DeAOTS       |    10.2   | [gdrive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1YwIAV5tBtn5spSFxKLBQBEQGwPHyQlHi\u002Fview?usp=sharing) |\n| DeAOTB       |    13.2   | [gdrive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1BHxsonnvJXylqHlZ1zJHHc-ymKyq-CFf\u002Fview?usp=sharing) |\n| DeAOTL       |    13.2   | [gdrive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F18elNz_wi9JyVBcIUYKhRdL08MA-FqHD5\u002Fview?usp=sharing) |\n| R50-DeAOTL   |    19.8   | [gdrive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1QoChMkTVxdYZ_eBlZhK2acq9KMQZccPJ\u002Fview?usp=sharing) |\n| SwinB-DeAOTL |    70.3   | [gdrive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1g4E-F0RPOx9Nd6J7tU9AE1TjsouL4oZq\u002Fview?usp=sharing) |\n\n**下载Grounding-DINO模型到ckpt文件夹**\n\n| 名称 | 主干网络 | 训练数据 | 边框AP（COCO） | 检查点 | \n|------|----------|---------------|---------------|------------|\n| GroundingDINO-T | Swin-T | O365、GoldG、Cap4M | 48.4（零样本）\u002F 57.2（微调） | [下载](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FGroundingDINO\u002Freleases\u002Fdownload\u002Fv0.1.0-alpha\u002Fgroundingdino_swint_ogc.pth) | \n| GroundingDINO-B | Swin-B | COCO、O365、GoldG、Cap4M、OpenImages、ODinW-35、RefCOCO | 56.7 | [下载](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FGroundingDINO\u002Freleases\u002Fdownload\u002Fv0.1.0-alpha2\u002Fgroundingdino_swinb_cogcoor.pth) | \n\n\n**下载AST模型到ast_master\u002Fpretrained_models，克隆AST仓库后**\n\n1. [完整AudioSet，10 tstride，10 fstride，带权重平均（0.459 mAP）](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fca0b1v2nlxzyeb4\u002Faudioset_10_10_0.4593.pth?dl=1)\n2. [完整AudioSet，10 tstride，10 fstride，不带权重平均，模型1（0.450 mAP）](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002F1tv0hovue1bxupk\u002Faudioset_10_10_0.4495.pth?dl=1)\n3. [完整AudioSet，10 tstride，10 fstride，不带权重平均，模型2（0.448 mAP）](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002F6u5sikl4b9wo4u5\u002Faudioset_10_10_0.4483.pth?dl=1)\n4. [完整AudioSet，10 tstride，10 fstride，不带权重平均，模型3（0.448 mAP）](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fkt6i0v9fvfm1mbq\u002Faudioset_10_10_0.4475.pth?dl=1)\n5. [完整AudioSet，12 tstride，12 fstride，不带权重平均，模型（0.447 mAP）](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fsnfhx3tizr4nuc8\u002Faudioset_12_12_0.4467.pth?dl=1)\n6. [完整AudioSet，14 tstride，14 fstride，不带权重平均，模型（0.443 mAP）](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fz18s6pemtnxm4k7\u002Faudioset_14_14_0.4431.pth?dl=1)\n7. [完整AudioSet，16 tstride，16 fstride，不带权重平均，模型（0.442 mAP）](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fmdsa4t1xmcimia6\u002Faudioset_16_16_0.4422.pth?dl=1)\n\n8. [Speechcommands V2-35，10 tstride，10 fstride，不带权重平均，模型（在评估集上准确率98.12%）](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fq0tbqpwv44pquwy\u002Fspeechcommands_10_10_0.9812.pth?dl=1)，默认模型是audioset_0.4593（[audioset_0.4593.pth]( https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fcv4knew8mvbrnvq\u002Faudioset_0.4593.pth?dl=1))。\n\n您可以通过以下命令行下载**默认权重**。\n$$\nbash script\u002Fdownload_ckpt.sh\n$$\n\n### 运行docker文件\n\n1. docker build -f docker\u002FDockerfile -t myimage .\n2. docker run myimage\n\n### :heart:运行演示\n- 待处理的视频可以放在.\u002Fassets中。\n- 然后逐步运行**demo.ipynb**以生成结果。\n- 结果将保存为每帧的掩码以及用于可视化的gif文件。\n\nSAM-Track、DeAOT和SAM的参数可以在model_args.py中手动修改，以便使用其他模型或控制各模型的行为。\n\n### :muscle:WebUI应用\n我们友好的可视化界面使您能够轻松获得实验结果。只需通过命令行启动即可。\n```\npython app.py\n```\n用户可以直接在界面上上传视频，并使用SegTracker对视频中的物体进行自动或交互式跟踪。我们以一段男子打篮球的视频为例。\n\n![交互式WebUI](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_readme_c4c9612565a6.jpg)\n\nSegTracker参数：\n - **aot_model**：用于选择使用哪个版本的DeAOT\u002FAOT进行跟踪和传播。\n - **sam_gap**：用于控制SAM在指定帧间隔内添加新出现物体的频率。增加该值可降低发现新目标的频率，但会显著提高推理速度。\n - **points_per_side**：用于控制在图像上采样网格以生成掩码时每边使用的点数。增大该值可提高检测小物体的能力，但较大目标可能会被分割得更细。\n - **max_obj_num**：用于限制SAM-Track能够检测和跟踪的最大物体数量。物体数量越多，所需的内存也越大，大约16GB内存最多可处理255个物体。\n\n用法：欲了解更多细节，请参阅[1.0版本WebUI教程](.\u002Ftutorial\u002Ftutorial%20for%20WebUI-1.0-Version.md)。\n\n### :school:关于我们\n感谢您对本项目的关注。该项目由浙江大学计算机科学与技术学院的ReLER实验室指导。ReLER由浙江大学求是特聘教授杨毅创立。我们的贡献团队包括[程阳明](https:\u002F\u002Fgithub.com\u002Fyamy-cheng)、胡继元、[徐远优](https:\u002F\u002Fgithub.com\u002Fyoxu515)、[李柳磊](https:\u002F\u002Fgithub.com\u002FlingorX)、[李晓迪](https:\u002F\u002Fgithub.com\u002FLiNO3Dy)、[杨宗欣](https:\u002F\u002Fz-x-yang.github.io\u002F)、[王文冠](https:\u002F\u002Fsites.google.com\u002Fview\u002Fwenguanwang)以及[杨毅](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=RMSuNFwAAAAJ&hl=en)。\n\n### :full_moon_with_face:致谢\n借用代码的许可证可在[licenses.md](https:\u002F\u002Fgithub.com\u002Fz-x-yang\u002FSegment-and-Track-Anything\u002Fblob\u002Fmain\u002Flicenses.md)文件中找到。\n\n* DeAOT\u002FAOT - [https:\u002F\u002Fgithub.com\u002Fyoxu515\u002Faot-benchmark](https:\u002F\u002Fgithub.com\u002Fyoxu515\u002Faot-benchmark)\n* SAM - [https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsegment-anything](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsegment-anything)\n* Gradio（用于构建WebUI） - [https:\u002F\u002Fgithub.com\u002Fgradio-app\u002Fgradio](https:\u002F\u002Fgithub.com\u002Fgradio-app\u002Fgradio)\n* Grounding-Dino - [https:\u002F\u002Fgithub.com\u002Fyamy-cheng\u002FGroundingDINO](https:\u002F\u002Fgithub.com\u002Fyamy-cheng\u002FGroundingDINO)\n* AST - [https:\u002F\u002Fgithub.com\u002FYuanGongND\u002Fast](https:\u002F\u002Fgithub.com\u002FYuanGongND\u002Fast)\n\n### 许可协议\n本项目采用 [AGPL-3.0 许可协议](https:\u002F\u002Fgithub.com\u002Fz-x-yang\u002FSegment-and-Track-Anything\u002Fblob\u002Fmain\u002FLICENSE.txt)进行授权。如需将本项目用于商业用途或以专有方式进一步开发，必须获得我们的许可（以及任何借用代码的版权所有者的许可）。\n\n### 引用\n如果您认为相关论文对您的研究有所帮助，请在您的出版物中引用这些论文。\n```\n@article{cheng2023segment,\n  title={Segment and Track Anything},\n  author={Cheng, Yangming and Li, Liulei and Xu, Yuanyou and Li, Xiaodi and Yang, Zongxin and Wang, Wenguan and Yang, Yi},\n  journal={arXiv preprint arXiv:2305.06558},\n  year={2023}\n}\n@article{kirillov2023segment,\n  title={Segment anything},\n  author={Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C and Lo, Wan-Yen and others},\n  journal={arXiv preprint arXiv:2304.02643},\n  year={2023}\n}\n@inproceedings{yang2022deaot,\n  title={Decoupling Features in Hierarchical Propagation for Video Object Segmentation},\n  author={Yang, Zongxin and Yang, Yi},\n  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},\n  year={2022}\n}\n@inproceedings{yang2021aot,\n  title={Associating Objects with Transformers for Video Object Segmentation},\n  author={Yang, Zongxin and Wei, Yunchao and Yang, Yi},\n  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},\n  year={2021}\n}\n@article{yang2024scalable,\n  title={Scalable video object segmentation with identification mechanism},\n  author={Yang, Zongxin and Miao, Jiaxu and Wei, Yunchao and Wang, Wenguan and Wang, Xiaohan and Yang, Yi},\n  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},\n  volume={46},\n  number={9},\n  pages={6247--6262},\n  year={2024},\n  publisher={IEEE}\n}\n@article{liu2023grounding,\n  title={Grounding dino: Marrying dino with grounded pre-training for open-set object detection},\n  author={Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others},\n  journal={arXiv preprint arXiv:2303.05499},\n  year={2023}\n}\n@inproceedings{gong21b_interspeech,\n  author={Yuan Gong and Yu-An Chung and James Glass},\n  title={AST: Audio Spectrogram Transformer},\n  booktitle={Proc. Interspeech 2021},\n  pages={571--575},\n  doi={10.21437\u002FInterspeech.2021-698}\n  year={2021} \n}\n```","# Segment-and-Track-Anything (SAM-Track) 快速上手指南\n\nSegment-and-Track-Anything 是一个开源项目，专注于视频中任意物体的分割与跟踪。它结合了 **SAM** (Segment Anything Models) 进行自动\u002F交互式关键帧分割，以及 **DeAOT** 进行高效的多目标跟踪与传播。\n\n## 1. 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux (推荐)\n*   **Python**: 3.9\n*   **PyTorch**: 1.10 或更高版本\n*   **TorchVision**: 0.11 或更高版本\n*   **GPU**: 推荐使用支持 CUDA 的 NVIDIA 显卡以获得最佳性能\n\n**前置依赖说明：**\n本项目依赖两个核心子模块，安装脚本会自动处理克隆和重命名：\n1.  `sam`: 基于 Facebook Research 的 [Segment-Anything](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsegment-anything)。\n2.  `aot`: 基于 [aot-benchmark](https:\u002F\u002Fgithub.com\u002Fyoxu515\u002Faot-benchmark) (DeAOT)。\n\n> **国内开发者提示**：如果访问 GitHub 或下载模型权重较慢，建议配置科学网络环境，或使用国内镜像源加速 Python 包安装（如清华源、阿里源）。\n\n## 2. 安装步骤\n\n### 步骤一：克隆项目\n首先克隆主仓库到本地：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fz-x-yang\u002FSegment-and-Track-Anything.git\ncd Segment-and-Track-Anything\n```\n\n### 步骤二：安装依赖库\n项目提供了自动化安装脚本，用于安装 SAM、DeAOT 及相关 Python 依赖。请直接运行以下命令：\n\n```bash\nbash script\u002Finstall.sh\n```\n\n*注意：该脚本会自动克隆并重命名所需的子模块。如果脚本执行过程中出现网络错误，请检查网络连接或手动克隆子模块。*\n\n### 步骤三：准备模型权重\n确保已下载必要的预训练模型权重（SAM  checkpoint 和 DeAOT 权重）。通常 `install.sh` 会尝试自动下载，如果失败，请参考项目原文的 \"Model Preparation\" 部分手动下载并放置到指定目录（通常为 `checkpoints\u002F`）。\n\n## 3. 基本使用\n\n### 方式一：使用 Colab 在线体验（推荐新手）\n如果您不想配置本地环境，可以直接使用 Google Colab 进行体验：\n*   **在线演示链接**: [Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1R10N70AJaslzADFqb-a5OihYkllWEVxB?usp=sharing)\n\n### 方式二：本地运行交互式 WebUI\n项目提供了功能强大的 Web 界面，支持点击、画笔涂抹以及**文本提示**等多种交互方式来分割和跟踪视频中的物体。\n\n1.  **启动 WebUI**：\n    在项目根目录下运行启动脚本（具体脚本名称视版本而定，通常为）：\n    ```bash\n    python app.py\n    # 或者\n    python webui.py\n    ```\n    *(注：如果上述命令无效，请查看项目根目录下的 `app.py` 或相关启动脚本)*\n\n2.  **操作流程**：\n    *   在浏览器打开显示的本地地址（通常是 `http:\u002F\u002F127.0.0.1:7860`）。\n    *   **上传视频**：加载您需要处理的视频文件。\n    *   **选择交互模式**：\n        *   **Click & Brush**: 在第一帧通过点击或涂抹指定目标物体。\n        *   **Text Prompt**: 输入文本描述（例如 \"panda\", \"car on the left\"）来自动检测并跟踪目标。\n        *   **Audio Grounding**: (v1.6+ 特性) 根据声音定位并跟踪发声物体。\n    *   **运行跟踪**：点击运行按钮，系统将自动分割首帧并跟踪整个视频序列。\n\n### 方式三：命令行运行示例 (Jupyter Notebook)\n对于批量处理或特定场景（如自动驾驶），可以使用提供的 Notebook 脚本。\n\n运行基于 Grounding-DINO 自动检测新物体的示例：\n```bash\njupyter notebook demo_instseg.ipynb\n```\n在该 Notebook 中，您可以加载视频，利用 Grounding-DINO 在关键帧自动检测物体，并结合 DeAOT 进行全程跟踪。\n\n---\n**更多详细教程**：\n*   **WebUI v1.6 (音频交互)**: 查看 `.\u002Ftutorial\u002Ftutorial for WebUI-1.6-Version.md`\n*   **WebUI v1.5 (文本提示)**: 查看 `.\u002Ftutorial\u002Ftutorial for WebUI-1.5-Version.md`\n*   **WebUI v1.0 (点击与画笔)**: 查看 `.\u002Ftutorial\u002Ftutorial for WebUI-1.0-Version.md`","某野生动物保护团队正在处理数千小时红外相机拍摄的森林监控视频，需要统计特定珍稀动物的活动轨迹与种群数量。\n\n### 没有 Segment-and-Track-Anything 时\n- **人工标注成本极高**：分析师必须逐帧手动勾勒动物轮廓，一段几分钟的视频需耗费数小时，面对海量数据几乎无法完成。\n- **目标丢失频繁**：当动物被树枝遮挡或快速移动时，传统追踪算法极易跟丢目标，导致轨迹断裂，后续需人工反复修正。\n- **新目标识别困难**：视频中突然闯入的新个体无法被自动发现，往往需要人工全程盯守屏幕才能捕捉，漏检率高。\n- **多模态分析缺失**：仅凭视觉难以区分隐蔽在草丛中的动物，无法利用叫声等音频线索辅助定位发声源。\n\n### 使用 Segment-and-Track-Anything 后\n- **自动化高效分割**：利用 SAM 模型自动在关键帧生成高精度掩码，并通过 DeAOT 算法将结果传播至全视频，将处理时间从小时级缩短至分钟级。\n- **鲁棒性显著增强**：即使动物长时间被植被遮挡或发生剧烈形变，长短期记忆机制也能确保持续稳定追踪，无需人工干预修补轨迹。\n- **动态新增目标检测**：结合 Grounding-DINO，系统能自动识别并分割视频中后期新出现的动物个体，实现全天候无人值守监测。\n- **音画协同定位**：启用音频接地功能后，系统可直接根据动物叫声锁定并追踪发声主体，有效解决视觉盲区下的目标捕获难题。\n\nSegment-and-Track-Anything 通过将顶尖的分割与追踪算法深度融合，把原本耗时数周的视频分析工作压缩至瞬间完成，让科研人员能从繁琐的标注中解放出来专注于生态研究本身。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fz-x-yang_Segment-and-Track-Anything_e8ebbaac.png","z-x-yang","Zongxin Yang","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fz-x-yang_776a594e.jpg","I’m currently a postdoctoral researcher at Harvard University. ",null,"Harvard University","https:\u002F\u002Fz-x-yang.github.io\u002F","https:\u002F\u002Fgithub.com\u002Fz-x-yang",[84,88,92,96],{"name":85,"color":86,"percentage":87},"Jupyter Notebook","#DA5B0B",95.5,{"name":89,"color":90,"percentage":91},"Python","#3572A5",4.4,{"name":93,"color":94,"percentage":95},"Shell","#89e051",0,{"name":97,"color":98,"percentage":95},"Dockerfile","#384d54",3119,356,"2026-04-04T20:09:32","AGPL-3.0","未说明","需要 NVIDIA GPU (基于 PyTorch 和 DeAOT\u002FSAM 架构推断)，具体显存大小和 CUDA 版本未在片段中明确说明",{"notes":106,"python":107,"dependencies":108},"该项目依赖两个子模块：Segment-Anything (SAM) 和 DeAOT。官方测试环境为 Python 3.9、PyTorch 1.10 和 TorchVision 0.11，建议使用同等或更高版本的 PyTorch。可通过运行 `bash script\u002Finstall.sh` 脚本安装必要的依赖库。","3.9",[109,110,111,112],"torch>=1.10","torchvision>=0.11","segment-anything (SAM)","aot-benchmark (DeAOT)",[35,14],[115,116,117,118,119],"interactive-segmentation","segment-anything","segment-anything-model","video-object-segmentation","visual-object-tracking","2026-03-27T02:49:30.150509","2026-04-06T07:15:05.911005",[123,128,133,138,143,148,153],{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},17024,"遇到 'ModuleNotFoundError: No module named spatial_correlation_sampler' 错误怎么办？","该错误通常是因为缺少依赖模块。请确保已正确安装所有依赖项。如果是混合模型，其帧率（FPS）可参考各组成部分：R50-DeAOT-L 跟踪部分的单对象 FPS 约为 20+。如需更快的模型，可查阅 AOT 的 model_zoo。","https:\u002F\u002Fgithub.com\u002Fz-x-yang\u002FSegment-and-Track-Anything\u002Fissues\u002F5",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},17025,"如何使用 GroundingDINO 提供边界框并进行后续处理？","可以使用 GroundingDINO 提供边界框（bbox），然后结合当前代码进行下一步操作。AOT 利用分层长短期 Transformer 在帧间传播掩码以获取结果，它同时使用外观和位置信息进行跟踪，而不仅仅是位置。对于首次出现的物体，使用 SAM 获取分割结果来初始化 AOT。","https:\u002F\u002Fgithub.com\u002Fz-x-yang\u002FSegment-and-Track-Anything\u002Fissues\u002F7",{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},17026,"运行 install.sh 脚本时遇到 'ModuleNotFoundError: No module named groundingdino' 错误如何解决？","解决方法有两种：1. 先在环境中安装 PyTorch，然后重新运行 `install.sh` 脚本；2. 如果是在 Google Colab 中，可以在导入 groundingDino 之前，在 `tool\u002Fdetecter.py` 文件中添加以下代码以修正路径：\n```python\nimport sys\nsys.path.append('\u002Fcontent\u002FSegment-and-Track-Anything\u002Fsrc\u002Fgroundingdino')\n```","https:\u002F\u002Fgithub.com\u002Fz-x-yang\u002FSegment-and-Track-Anything\u002Fissues\u002F35",{"id":139,"question_zh":140,"answer_zh":141,"source_url":142},17027,"使用 WebUI 时点击 'Roll back' 或 'Chose this mask to refine' 后报错怎么办？","这是一个操作顺序问题。正确的步骤是：首先在上方的框中点击选择一个掩码，然后点击 'choose this mask refine'，最后在底部的框中进行调整。按照此顺序操作通常可以解决问题。","https:\u002F\u002Fgithub.com\u002Fz-x-yang\u002FSegment-and-Track-Anything\u002Fissues\u002F64",{"id":144,"question_zh":145,"answer_zh":146,"source_url":147},17028,"WebUI 上传视频后，'segment result'、'predicted video' 等板块一直显示错误怎么办？","这是因为 `SegTracker` 尚未初始化。您需要先点击或选择要跟踪的对象以初始化 SegTracker，之后才能正常生成预测结果。请参考官方教程中的演示步骤进行操作。","https:\u002F\u002Fgithub.com\u002Fz-x-yang\u002FSegment-and-Track-Anything\u002Fissues\u002F75",{"id":149,"question_zh":150,"answer_zh":151,"source_url":152},17029,"FFmpeg 报错或视频无法播放是什么原因？","这通常是因为输入视频的编码格式与 'opencv-python' 不兼容。请确保输入视频使用的是 opencv-python 支持的编码格式。此外，如果在浏览器中看不到播放按钮，请检查界面左下角，点击后输出视频将会显示。","https:\u002F\u002Fgithub.com\u002Fz-x-yang\u002FSegment-and-Track-Anything\u002Fissues\u002F12",{"id":154,"question_zh":155,"answer_zh":156,"source_url":132},17030,"AOT 跟踪算法是利用物体的形状还是位置进行跟踪？时间一致性如何保证？","AOT 同时使用外观（appearance）和位置（position）进行跟踪。它包含全局注意力机制（在整个图像中搜索物体）和局部注意力机制（在局部窗口中搜索），这些均基于外观匹配。虽然物体运动通常是局部的，但并非强制，因此即使位置发生剧烈变化，AOT 仍有可能跟踪到物体。关于更多细节，可参考 AOT 论文。",[158,163,168,173],{"id":159,"version":160,"summary_zh":161,"released_at":162},99287,"v1.6","交互式SAMTrack 1.6版本新增音频定位功能，可追踪视频音频轨道中的发声对象。","2024-04-25T07:42:22",{"id":164,"version":165,"summary_zh":166,"released_at":167},99288,"v1.5","交互式SAMTrack 1.5版本支持四种交互式选择待跟踪多目标的方式：全部、点击、笔画和文本。","2023-04-28T03:45:35",{"id":169,"version":170,"summary_zh":171,"released_at":172},99289,"v1.0","交互式SAMTrack 1.0版本推出全新的Web界面，使用户能够轻松选择目标对象进行交互式视频跟踪。","2023-04-25T06:12:38",{"id":174,"version":175,"summary_zh":176,"released_at":177},99290,"v0.5","Segment-and-Track-Anything 0.5 版本支持自动分割和跟踪。","2023-04-19T13:09:44"]