[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-UX-Decoder--Segment-Everything-Everywhere-All-At-Once":3,"tool-UX-Decoder--Segment-Everything-Everywhere-All-At-Once":62},[4,18,26,35,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,2,"2026-04-10T11:39:34",[14,15,13],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":32,"last_commit_at":41,"category_tags":42,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[43,13,15,14],"插件",{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":10,"last_commit_at":50,"category_tags":51,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[52,15,13,14],"语言模型",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[14,15,13,61],"视频",{"id":63,"github_repo":64,"name":65,"description_en":66,"description_zh":67,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":75,"owner_location":75,"owner_email":75,"owner_twitter":75,"owner_website":75,"owner_url":76,"languages":77,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":98,"env_os":99,"env_gpu":100,"env_ram":101,"env_deps":102,"category_tags":111,"github_topics":75,"view_count":32,"oss_zip_url":75,"oss_zip_packed_at":75,"status":17,"created_at":113,"updated_at":114,"faqs":115,"releases":146},6913,"UX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once","Segment-Everything-Everywhere-All-At-Once","[NeurIPS 2023] Official implementation of the paper \"Segment Everything Everywhere All at Once\"","SEEM（Segment Everything Everywhere All at Once）是一款强大的通用图像分割模型，旨在让用户能够“一次性”完成对图像中任意对象的精细化分割。它解决了传统分割工具交互方式单一、难以适应复杂场景的痛点，支持用户通过多种模态的提示词灵活操作。无论是点击、画框、涂鸦等视觉提示，还是输入文字、语音等语言指令，甚至混合使用多种提示方式，SEEM 都能精准理解并执行分割任务。\n\n这款工具特别适合计算机视觉研究人员、AI 开发者以及需要高效图像处理的设计师使用。研究人员可基于其开源代码探索多模态交互的新边界；开发者能将其集成到聊天机器人或编辑软件中，构建更智能的应用；设计师则能利用其灵活的交互方式快速提取素材，提升工作流效率。\n\nSEEM 的核心亮点在于其卓越的多模态融合能力与泛化性。作为 NeurIPS 2023 的接收论文成果，它不仅支持自定义提示组合，还能在无需额外训练的情况下适应各种新颖的交互需求。目前，SEEM 的技术已被应用于 LLaVA-Interactive 和 GPT-4V 的视觉定位增强等前沿项目中，展现了其在交互式图像编辑领域的巨大潜力。","# 👀*SEEM:* Segment Everything Everywhere All at Once\n\n:grapes: \\[[Read our arXiv Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.06718.pdf)\\] &nbsp; :apple: \\[[Try our Demo](http:\u002F\u002Fsemantic-sam.xyzou.net:6090\u002F)\\] \n\nWe introduce **SEEM** that can **S**egment **E**verything **E**verywhere with **M**ulti-modal prompts all at once. SEEM allows users to easily segment an image using prompts of different types including visual prompts (points, marks, boxes, scribbles and image segments) and language prompts (text and audio), etc. It can also work with any combination of prompts or generalize to custom prompts!\n\nby [Xueyan Zou*](https:\u002F\u002Fmaureenzou.github.io\u002F), [Jianwei Yang*](https:\u002F\u002Fjwyang.github.io\u002F), [Hao Zhang*](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=B8hPxMQAAAAJ&hl=en),  [Feng Li*](https:\u002F\u002Ffengli-ust.github.io\u002F), [Linjie Li](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=WR875gYAAAAJ&hl=en), [Jianfeng Wang](http:\u002F\u002Fjianfengwang.me\u002F), [Lijuan Wang](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=cDcWXuIAAAAJ&hl=zh-CN), [Jianfeng Gao^](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fpeople\u002Fjfgao\u002F?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fum%2Fpeople%2Fjfgao%2F), [Yong Jae Lee^](https:\u002F\u002Fpages.cs.wisc.edu\u002F~yongjaelee\u002F), in **NeurIPS 2023**.\n\nA brief introduction of all the generic and interactive segmentation tasks we can do!\n\n![SEEM design](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_a2f2edc48c74.png)\n\n## :rocket: Updates\n* **[2023.11.2]**  SEEM is applied in [LLaVA-Interactive](https:\u002F\u002Fllava-vl.github.io\u002Fllava-interactive\u002F): an all-in-one demo for Image Chat, Segmentation, Generation and Editing. Experience the future of interactive image editing with visual chat.\n[[Project Page](https:\u002F\u002Fllava-vl.github.io\u002Fllava-interactive\u002F)] [[Demo](https:\u002F\u002F6dd3-20-163-117-69.ngrok-free.app\u002F)] [[Code](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-Interactive-Demo)] [[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.00571)]\n* **[2023.10.23]**  SEEM is used in [Set-of-Mark Prompting](https:\u002F\u002Fsom-gpt4v.github.io\u002F): a brand-new visual prompting technique for GPT-4V! It totally unleashes the extraordinary visual grounding power of GPT-4V!\n[[Project Page](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSoM)] [[Code](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSoM)] [[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11441)]\n* **[2023.10.10]** We release the training [log](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fraw\u002Fmain\u002Fseem_v1_focall_unicl.log) for SEEM-Large-v1 and [log](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fraw\u002Fmain\u002Fseem_v1_focalt_unicl.log) for SEEM-Tiny-v1!\n* **[2023.10.04]** We are excited to release :white_check_mark: [training\u002Fevaluation\u002Fdemo code](https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once\u002Fedit\u002Fv1.0\u002FREADME.md#bookmark_tabs-catalog), :white_check_mark: [new checkpoints](https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once\u002Fedit\u002Fv1.0\u002FREADME.md#bookmark_tabs-catalog), and :white_check_mark: [comprehensive readmes](https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once\u002Fedit\u002Fv1.0\u002FREADME.md#bookmark_tabs-catalog) for ***both X-Decoder and SEEM***!\n* **[2023.09.25]** Our work has been accepted to NeurIPS 2023!\n* **[2023.07.27]** We are excited to release our [X-Decoder](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FX-Decoder) training code! We will release its descendant SEEM training code very soon!\n* **[2023.07.10]** We release [Semantic-SAM](https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSemantic-SAM), a universal image segmentation model to enable segment and recognize anything at any desired granularity. Code and checkpoint are available!\n* **[2023.05.02]** We have released the [SEEM Focal-L](https:\u002F\u002Fprojects4jw.blob.core.windows.net\u002Fx-decoder\u002Frelease\u002Fseem_focall_v1.pt) and [X-Decoder Focal-L](https:\u002F\u002Fprojects4jw.blob.core.windows.net\u002Fx-decoder\u002Frelease\u002Fxdecoder_focall_last.pt) checkpoints and [configs](https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once\u002Fblob\u002Fmain\u002Fdemo_code\u002Fconfigs\u002Fseem\u002Fseem_focall_lang.yaml)!\n* **[2023.04.28]** We have updated the [ArXiv](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.06718.pdf) that shows *better interactive segmentation results than SAM*, which trained on x50 more data than us!\n* **[2023.04.26]** We have released the [Demo Code](https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once\u002Ftree\u002Fmain\u002Fdemo_code) and [SEEM-Tiny Checkpoint](https:\u002F\u002Fprojects4jw.blob.core.windows.net\u002Fx-decoder\u002Frelease\u002Fseem_focalt_v1.pt)! Please try the One-Line Started!\n* **[2023.04.20]** SEEM Referring Video Segmentation is out! Please try the [Video Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fxdecoder\u002FSEEM) and take a look at the [NERF examples](https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once#tulip-nerf-examples).\n\n## :bookmark_tabs: Catalog\nWe release the following contents for **both SEEM and X-Decoder**:exclamation:\n- [x] Demo Code\n- [x] Model Checkpoint\n- [x] Comprehensive User Guide\n- [x] Training Code\n- [x] Evaluation Code\n\n:point_right: **One-Line SEEM Demo with Linux:**\n```sh\ngit clone git@github.com:UX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once.git && sh assets\u002Fscripts\u002Frun_demo.sh\n```\n\n:round_pushpin: *[New]* **Getting Started:**\n\n* [INSTALL.md](assets\u002Freadmes\u002FINSTALL.md) \u003Cbr>\n* [DATASET.md](assets\u002Freadmes\u002FDATASET.md) \u003Cbr>\n* [TRAIN.md](assets\u002Freadmes\u002FTRAIN.md) \u003Cbr>\n* [EVAL.md](assets\u002Freadmes\u002FEVAL.md)\n\n:round_pushpin: *[New]* **Latest Checkpoints and Numbers:**\n|                 |                                                                                      |          | COCO |      |      | Ref-COCOg |      |      | VOC   |       | SBD   |       |\n|-----------------|---------------------------------------------------------------------------------------------|------------|------|------|------|-----------|------|------|-------|-------|-------|-------|\n| Method          |  Checkpoint                                                                                  | Backbone | PQ &uarr;  | mAP &uarr; | mIoU &uarr; | cIoU  &uarr; | mIoU &uarr; | AP50 &uarr; | NoC85 &darr; | NoC90 &darr;| NoC85 &darr;| NoC90 &darr;|\n| X-Decoder       |  [ckpt](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FX-Decoder\u002Fresolve\u002Fmain\u002Fxdecoder_focalt_last.pt) | Focal-T  | 50.8 | 39.5 | 62.4 | 57.6      | 63.2 | 71.6 | -     | -     | -     | -     |\n| X-Decoder-oq201 |  [ckpt](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FX-Decoder\u002Fresolve\u002Fmain\u002Fxdecoder_focall_last.pt) | Focal-L  | 56.5 | 46.7 | 67.2 | 62.8      | 67.5 | 76.3 | -     | -     | -     | -     |\n| SEEM_v0            | [ckpt](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fresolve\u002Fmain\u002Fseem_focalt_v0.pt)      | Focal-T  | 50.6 | 39.4 | 60.9 | 58.5      | 63.5 | 71.6 | 3.54  | 4.59  | *     | *     |\n| SEEM_v0            |  -                                                                                           | Davit-d3 | 56.2 | 46.8 | 65.3 | 63.2      | 68.3 | 76.6 | 2.99  | 3.89  | 5.93  | 9.23  |\n| SEEM_v0      | [ckpt](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fresolve\u002Fmain\u002Fseem_focall_v0.pt)       | Focal-L  | 56.2 | 46.4 | 65.5 | 62.8      | 67.7 | 76.2 | 3.04  | 3.85  | *     | *     |\n| SEEM_v1      | [ckpt](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fresolve\u002Fmain\u002Fseem_samvitb_v1.pt)       | SAM-ViT-B  | 52.0 | 43.5 | 60.2 | 54.1      | 62.2 | 69.3 | 2.53  | 3.23  | *     | *     |\n| SEEM_v1       | [ckpt](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fresolve\u002Fmain\u002Fseem_samvitl_v1.pt)       | SAM-ViT-L  | 49.0 | 41.6 | 58.2 | 53.8      | 62.2 | 69.5 | 2.40  | 2.96  | *     | *     |\n| SEEM_v1      | [ckpt](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fresolve\u002Fmain\u002Fseem_focalt_v1.pt)\u002F[log](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fraw\u002Fmain\u002Fseem_v1_focalt_unicl.log)       | Focal-T  | 50.8 | 39.4 | 60.7 |   58.5    |  63.7 | 72.0 | 3.19  | 4.13  | *     | *     |\n| SEEM_v1      | [ckpt](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fblob\u002Fmain\u002Fseem_focall_v1.pt)\u002F[log](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fblob\u002Fmain\u002Fseem_v1_focall_unicl.log)      | Focal-L  | 56.1 | 46.3 | 65.8 |   62.4    |  67.8 | 76.0 | 2.66  | 3.44  | *     | *     |\n\n**SEEM_v0:** Supporting Single Interactive object training and inference \u003Cbr>\n**SEEM_v1:** Supporting Multiple Interactive objects training and inference\n\n\u003Cdiv  align=\"center\">    \n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_6b153287d17b.gif\" width=\"400\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_61110992b274.gif\" width=\"400\" \u002F>   \n\u003C\u002Fdiv>\n\n:fire: **Related projects:**\n\n* [FocalNet](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FFocalNet) and [DaViT](https:\u002F\u002Fgithub.com\u002Fdingmyu\u002Fdavit) : We used FocalNet and DaViT as the vision backbones.\n* [UniCL](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FUniCL) : We used unified contrastive learning technique for learning image-text representations.\n* [X-Decoder](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FX-Decoder) : We built SEEM based on X-Decoder which is a generalist decoder that can do multiple tasks with one model only.\n\n:fire: **Other projects you may find interesting:**\n* [Semantic-SAM](https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSemantic-SAM), a universal image segmentation model to enable segment and recognize anything at any desired granularity\n* [OpenSeed](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FOpenSeeD) : Strong open-set segmentation methods.\n* [Grounding SAM](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FGrounded-Segment-Anything) : Combining Grounding DINO and Segment Anything; [Grounding DINO](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FGroundingDINO): A strong open-set detection model.\n* [X-GPT](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FX-Decoder\u002Ftree\u002Fxgpt) : Conversational Visual Agent supported by X-Decoder.\n* [LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA) : Large Language and Vision Assistant.\n\n## :bulb: Highlights\nInspired by the appealing universal interface in LLMs, we are advocating a universal, interactive multi-modal interface for any type of segmentation with **ONE SINGLE MODEL**. We emphasize **4** important features of **SEEM** below.\n1. **Versatility**: work with various types of prompts, for example, clicks, boxes, polygons, scribbles, texts, and referring image;\n2. **Compositionaliy**: deal with any compositions of prompts;\n3. **Interactivity**: interact with user in multi-rounds, thanks to the memory prompt of **SEEM** to store the session history;\n4. **Semantic awareness**: give a semantic label to any predicted mask;\n\n## :unicorn: How to use the demo\n- Try our default examples first;\n- Upload an image;\n- Select at least one type of prompt of your choice (If you want to use referred region of another image please check \"Example\" and upload another image in referring image panel);\n- Remember to provide the actual prompt for each prompt type you select, otherwise you will meet an error (e.g., remember to draw on the referring image);\n- Our model by default support the **vocabulary** of COCO 80 categories, others will be classified to 'others' or misclassified. If you want to segment using open-vocabulary labels, include the text label in 'text' button after drawing scribbles.\n- Click \"Submit\" and wait for a few seconds.\n\n## :volcano: An interesting example\nAn example of Transformers. The referred image is the truck form of Optimus Prime. Our model can always segment Optimus Prime in target images no matter which form it is in. Thanks Hongyang Li for this fun example.\n\n\u003Cdiv  align=\"center\">    \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_577d7e716c0a.png\" width = \"700\" alt=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_577d7e716c0a.png\" align=center \u002F>\n\u003C\u002Fdiv>\n\n## :tulip: NERF Examples\n* Inspired by the example in [SA3D](https:\u002F\u002Fgithub.com\u002FJumpat\u002FSegmentAnythingin3D), we tried SEEM on NERF Examples and works well :)\n\n\u003Cdiv  align=\"center\">    \n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_6070b73edc7a.gif\" width=\"400\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_c3c8a414f655.gif\" width=\"400\" \u002F> \n\u003C\u002Fdiv>\n\n## :camping: Click, scribble to mask\nWith a simple click or stoke from the user, we can generate the masks and the corresponding category labels for it.\n\n![SEEM design](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_2807064b9d13.png)\n## :mountain_snow: Text to mask\nSEEM can generate the mask with text input from the user, providing multi-modality interaction with human.\n\n![example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_2dcbb49bd55c.png)\n\u003C!-- \n\u003Cdiv  align=\"center\">    \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_2dcbb49bd55c.png\" width = \"700\" alt=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_2dcbb49bd55c.png\" align=center \u002F>\n\u003C\u002Fdiv> -->\n\n## :mosque: Referring image to mask\nWith a simple click or stroke on the referring image, the model is able to segment the objects with similar semantics on the target images.\n![example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_877c00c64fc2.png)\n\nSEEM understands the spatial relationship very well. Look at the three zebras! The segmented zebras have similar positions with the referred zebras. For example, when the leftmost zebra is referred on the upper row, the leftmost zebra on the bottom row is segmented.\n![example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_aec80088e12c.png)\n\n## :blossom: Referring image to video mask\nNo training on video data needed, SEEM works perfectly for you to segment videos with whatever queries you specify!\n![example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_f6ee99f3b3c7.png)\n\n## :sunflower: Audio to mask\nWe use Whisper to turn audio into text prompt to segment the object. Try it in our demo!\n\n\u003Cdiv  align=\"center\">    \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_c98b48f1c67f.png\" width = \"900\" alt=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_c98b48f1c67f.png\" align=center \u002F>\n\u003C\u002Fdiv>\n\n\u003C!-- ## 🔥 Combination of different prompts to mask -->\n\n## :deciduous_tree: Examples of different styles\nAn example of segmenting a meme.\n\u003Cdiv  align=\"center\">    \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_c9d1015e948e.png\" width = \"500\" alt=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_c9d1015e948e.png\" align=center \u002F>\n\u003C\u002Fdiv>\n\nAn example of segmenting trees in cartoon style.\n\u003Cdiv  align=\"center\">    \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_2eeed34f2838.png\" width = \"700\" alt=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_2eeed34f2838.png\" align=center \u002F>\n\u003C\u002Fdiv>\n\nAn example of segmenting a Minecraft image.\n\u003Cdiv  align=\"center\">    \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_def33406fed6.png\" width = \"700\" alt=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_def33406fed6.png\" align=center \u002F>\n\u003C\u002Fdiv>\n\u003C!-- ![example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_def33406fed6.png) -->\nAn example of using referring image on a popular teddy bear.\n\n![example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_bda006dfb90c.png)\n\n## Model\n![SEEM design](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_f7ad7fff214a.png)\n\n## Comparison with SAM\nIn the following figure, we compare the levels of interaction and semantics of three segmentation tasks (edge detection, open-set, and interactive segmentation). Open-set Segmentation usually requires a high level of semantics and does not require interaction. Compared with [SAM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.02643), SEEM covers a wider range of interaction and semantics levels.  For example, SAM only supports limited interaction types like points and boxes, while misses high-semantic tasks since it does not output semantic labels itself. The reasons are: First, SEEM has a unified prompt encoder that encodes all visual and language prompts into a joint representation space. In consequence, SEEM can support more general usages. It has potential to extend to custom prompts. Second, SEEM works very well on text to mask (grounding segmentation) and outputs semantic-aware predictions.\n\u003Cdiv  align=\"center\">    \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_49297ad5571a.jpg\" width = \"500\" alt=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_49297ad5571a.jpg\" align=center \u002F>\n\u003C\u002Fdiv>\n\u003C!-- This figure shows a comparison of our model with concurrent work SAM on the level of interactions and semantics. The x-axis and y-axis denote the level of interaction and semantics, respectively. Three segmentation tasks are shown, including Open-set Segmentation, Edge detection, and Interactive Segmentation. These tasks have different levels of interactions and semantics. For example, Open-set Segmentation usually requires a high level of semantics and does not require interaction. Compared with SAM, our model covers a wider range of interaction and semantics levels. For example, SAM only supports limited interaction types like points and boxes, while misses high-semantic tasks since it does not output semantic labels itself. Note that although we do not report edge detection results, our model can support it by simply converting masks to edges. -->\n\n## :cupid: Acknowledgements\n- We appreciate hugging face for the GPU support on demo!\n\n\n\u003C!-- ## Citation (update when paper is available on arxiv)\nIf you find this project helpful for your research, please consider citing the following BibTeX entry.\n```BibTex\n\n``` -->\n","# 👀*SEEM:* 一次性分割所有地方的一切\n\n:grapes: \\[[阅读我们的 arXiv 论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.06718.pdf)\\] &nbsp; :apple: \\[[体验我们的演示](http:\u002F\u002Fsemantic-sam.xyzou.net:6090\u002F)\\] \n\n我们提出了 **SEEM**，它能够通过多模态提示一次性在任何地方分割任何内容。SEEM 允许用户使用多种类型的提示轻松地对图像进行分割，包括视觉提示（点、标记、框、涂鸦和图像片段）以及语言提示（文本和音频）等。它还可以处理任意组合的提示，甚至可以推广到自定义提示！\n\n作者：[Xueyan Zou*](https:\u002F\u002Fmaureenzou.github.io\u002F)、[Jianwei Yang*](https:\u002F\u002Fjwyang.github.io\u002F)、[Hao Zhang*](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=B8hPxMQAAAAJ&hl=en)、[Feng Li*](https:\u002F\u002Ffengli-ust.github.io\u002F)、[Linjie Li](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=WR875gYAAAAJ&hl=en)、[Jianfeng Wang](http:\u002F\u002Fjianfengwang.me\u002F)、[Lijuan Wang](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=cDcWXuIAAAAJ&hl=zh-CN)、[Jianfeng Gao^](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fpeople\u002Fjfgao\u002F?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fum%2Fpeople%2Fjfgao%2F)、[Yong Jae Lee^](https:\u002F\u002Fpages.cs.wisc.edu\u002F~yongjaelee\u002F)，发表于 **NeurIPS 2023**。\n\n简要介绍我们可以完成的所有通用和交互式分割任务！\n\n![SEEM 设计](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_a2f2edc48c74.png)\n\n## :rocket: 最新动态\n* **[2023.11.2]**  SEEM 被应用于 [LLaVA-Interactive](https:\u002F\u002Fllava-vl.github.io\u002Fllava-interactive\u002F)：一个集图像聊天、分割、生成和编辑于一体的全能演示。通过视觉聊天体验交互式图像编辑的未来。\n[[项目页面](https:\u002F\u002Fllava-vl.github.io\u002Fllava-interactive\u002F)] [[演示](https:\u002F\u002F6dd3-20-163-117-69.ngrok-free.app\u002F)] [[代码](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-Interactive-Demo)] [[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.00571)]\n* **[2023.10.23]**  SEEM 被用于 [Set-of-Mark Prompting](https:\u002F\u002Fsom-gpt4v.github.io\u002F)：一种全新的针对 GPT-4V 的视觉提示技术！它彻底释放了 GPT-4V 非凡的视觉定位能力！\n[[项目页面](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSoM)] [[代码](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSoM)] [[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11441)]\n* **[2023.10.10]** 我们发布了 SEEM-Large-v1 的训练 [日志](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fraw\u002Fmain\u002Fseem_v1_focall_unicl.log) 和 SEEM-Tiny-v1 的 [日志](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fraw\u002Fmain\u002Fseem_v1_focalt_unicl.log)！\n* **[2023.10.04]** 我们很高兴地发布：:white_check_mark: ***X-Decoder 和 SEEM*** 的 [训练\u002F评估\u002F演示代码](https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once\u002Fedit\u002Fv1.0\u002FREADME.md#bookmark_tabs-catalog)，:white_check_mark: [新的检查点](https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once\u002Fedit\u002Fv1.0\u002FREADME.md#bookmark_tabs-catalog)，以及 :white_check_mark: [全面的说明文档](https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once\u002Fedit\u002Fv1.0\u002FREADME.md#bookmark_tabs-catalog)！\n* **[2023.09.25]** 我们的论文已被 NeurIPS 2023 接受！\n* **[2023.07.27]** 我们很高兴地发布了我们的 [X-Decoder](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FX-Decoder) 训练代码！其衍生模型 SEEM 的训练代码也将很快发布！\n* **[2023.07.10]** 我们发布了 [Semantic-SAM](https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSemantic-SAM)，这是一种通用的图像分割模型，能够在任何所需的粒度上实现对任何内容的分割和识别。代码和检查点现已可用！\n* **[2023.05.02]** 我们发布了 [SEEM Focal-L](https:\u002F\u002Fprojects4jw.blob.core.windows.net\u002Fx-decoder\u002Frelease\u002Fseem_focall_v1.pt) 和 [X-Decoder Focal-L](https:\u002F\u002Fprojects4jw.blob.core.windows.net\u002Fx-decoder\u002Frelease\u002Fxdecoder_focall_last.pt) 检查点以及 [配置文件](https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once\u002Fblob\u002Fmain\u002Fdemo_code\u002Fconfigs\u002Fseem\u002Fseem_focall_lang.yaml)！\n* **[2023.04.28]** 我们更新了 [ArXiv](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.06718.pdf)，其中展示了 *比 SAM 更好的交互式分割效果*，而我们的训练数据量是 SAM 的五倍！\n* **[2023.04.26]** 我们发布了 [演示代码](https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once\u002Ftree\u002Fmain\u002Fdemo_code) 和 [SEEM-Tiny 检查点](https:\u002F\u002Fprojects4jw.blob.core.windows.net\u002Fx-decoder\u002Frelease\u002Fseem_focalt_v1.pt)！请尝试一键启动的版本！\n* **[2023.04.20]** SEEM 引用视频分割功能上线！请试用 [视频演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fxdecoder\u002FSEEM) 并查看 [NERF 示例](https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once#tulip-nerf-examples)。\n\n## :bookmark_tabs: 目录\n我们发布了以下内容，适用于**SEEM 和 X-Decoder**：exclamation:\n- [x] 演示代码\n- [x] 模型检查点\n- [x] 全面的用户指南\n- [x] 训练代码\n- [x] 评估代码\n\n:point_right: **Linux 下的一行 SEEM 演示：**\n```sh\ngit clone git@github.com:UX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once.git && sh assets\u002Fscripts\u002Frun_demo.sh\n```\n\n:round_pushpin: *[新]* **快速入门：**\n\n* [INSTALL.md](assets\u002Freadmes\u002FINSTALL.md) \u003Cbr>\n* [DATASET.md](assets\u002Freadmes\u002FDATASET.md) \u003Cbr>\n* [TRAIN.md](assets\u002Freadmes\u002FTRAIN.md) \u003Cbr>\n* [EVAL.md](assets\u002Freadmes\u002FEVAL.md)\n\n:round_pushpin: *[新]* **最新检查点与指标：**\n|                 |                                                                                      |          | COCO |      |      | Ref-COCOg |      |      | VOC   |       | SBD   |       |\n|-----------------|---------------------------------------------------------------------------------------------|------------|------|------|------|-----------|------|------|-------|-------|-------|-------|\n| 方法          |  Checkpoint                                                                                  | Backbone | PQ &uarr;  | mAP &uarr; | mIoU &uarr; | cIoU  &uarr; | mIoU &uarr; | AP50 &uarr; | NoC85 &darr; | NoC90 &darr;| NoC85 &darr;| NoC90 &darr;|\n| X-Decoder       |  [ckpt](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FX-Decoder\u002Fresolve\u002Fmain\u002Fxdecoder_focalt_last.pt) | Focal-T  | 50.8 | 39.5 | 62.4 | 57.6      | 63.2 | 71.6 | -     | -     | -     | -     |\n| X-Decoder-oq201 |  [ckpt](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FX-Decoder\u002Fresolve\u002Fmain\u002Fxdecoder_focall_last.pt) | Focal-L  | 56.5 | 46.7 | 67.2 | 62.8      | 67.5 | 76.3 | -     | -     | -     | -     |\n| SEEM_v0            | [ckpt](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fresolve\u002Fmain\u002Fseem_focalt_v0.pt)      | Focal-T  | 50.6 | 39.4 | 60.9 | 58.5      | 63.5 | 71.6 | 3.54  | 4.59  | *     | *     |\n| SEEM_v0            |  -                                                                                           | Davit-d3 | 56.2 | 46.8 | 65.3 | 63.2      | 68.3 | 76.6 | 2.99  | 3.89  | 5.93  | 9.23  |\n| SEEM_v0      | [ckpt](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fresolve\u002Fmain\u002Fseem_focall_v0.pt)       | Focal-L  | 56.2 | 46.4 | 65.5 | 62.8      | 67.7 | 76.2 | 3.04  | 3.85  | *     | *     |\n| SEEM_v1      | [ckpt](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fresolve\u002Fmain\u002Fseem_samvitb_v1.pt)       | SAM-ViT-B  | 52.0 | 43.5 | 60.2 | 54.1      | 62.2 | 69.3 | 2.53  | 3.23  | *     | *     |\n| SEEM_v1       | [ckpt](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fresolve\u002Fmain\u002Fseem_samvitl_v1.pt)       | SAM-ViT-L  | 49.0 | 41.6 | 58.2 | 53.8      | 62.2 | 69.5 | 2.40  | 2.96  | *     | *     |\n| SEEM_v1      | [ckpt](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fresolve\u002Fmain\u002Fseem_focalt_v1.pt)\u002F[log](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fraw\u002Fmain\u002Fseem_v1_focalt_unicl.log)       | Focal-T  | 50.8 | 39.4 | 60.7 |   58.5    |  63.7 | 72.0 | 3.19  | 4.13  | *     | *     |\n| SEEM_v1      | [ckpt](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fblob\u002Fmain\u002Fseem_focall_v1.pt)\u002F[log](https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fblob\u002Fmain\u002Fseem_v1_focall_unicl.log)      | Focal-L  | 56.1 | 46.3 | 65.8 |   62.4    |  67.8 | 76.0 | 2.66  | 3.44  | *     | *     |\n\n**SEEM_v0:** 支持单个交互式对象的训练和推理 \u003Cbr>\n**SEEM_v1:** 支持多个交互式对象的训练和推理\n\n\u003Cdiv  align=\"center\">    \n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_6b153287d17b.gif\" width=\"400\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_61110992b274.gif\" width=\"400\" \u002F>   \n\u003C\u002Fdiv>\n\n:fire: **相关项目：**\n\n* [FocalNet](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FFocalNet) 和 [DaViT](https:\u002F\u002Fgithub.com\u002Fdingmyu\u002Fdavit) : 我们使用 FocalNet 和 DaViT 作为视觉骨干网络。\n* [UniCL](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FUniCL) : 我们使用统一对比学习技术来学习图像-文本表示。\n* [X-Decoder](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FX-Decoder) : 我们基于 X-Decoder 构建了 SEEM，X-Decoder 是一种通用解码器，可以仅用一个模型完成多项任务。\n\n:fire: **其他你可能会感兴趣的项目：**\n* [Semantic-SAM](https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSemantic-SAM)，一种通用图像分割模型，可在任何所需的粒度上实现对任何事物的分割和识别\n* [OpenSeed](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FOpenSeeD) ：强大的开放集分割方法。\n* [Grounding SAM](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FGrounded-Segment-Anything) ：结合 Grounding DINO 和 Segment Anything；[Grounding DINO](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FGroundingDINO)：一种强大的开放集检测模型。\n* [X-GPT](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FX-Decoder\u002Ftree\u002Fxgpt) ：由 X-Decoder 提供支持的对话式视觉智能体。\n* [LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA) ：大型语言和视觉助手。\n\n## :bulb: 亮点\n受 LLM 中吸引人的通用接口启发，我们倡导使用**单一模型**为任何类型的分割构建通用、交互式的多模态接口。我们强调 **SEEM** 的以下 **4** 个重要特性。\n1. **多功能性**：可处理多种类型的提示，例如点击、框选、多边形、涂鸦、文本以及指代图像；\n2. **组合性**：能够处理任意组合的提示；\n3. **交互性**：通过 **SEEM** 的记忆提示功能存储会话历史，从而实现多轮交互；\n4. **语义感知**：为任何预测的掩码赋予语义标签；\n\n## :unicorn: 如何使用演示\n- 首先尝试我们的默认示例；\n- 上传一张图片；\n- 选择至少一种你想要使用的提示类型（如果你想使用另一张图片中的参考区域，请查看“示例”并在参考图片面板中上传另一张图片）；\n- 请务必为每种你选择的提示类型提供实际的提示，否则会出现错误（例如，在参考图片上绘制时要记得画出来）；\n- 我们的模型默认支持 COCO 80 类别的**词汇表**，其他类别将被归类为“其他”或出现分类错误。如果你想使用开放词汇标签进行分割，请在绘制涂鸦后在“文本”按钮中输入文本标签。\n- 点击“提交”并等待几秒钟。\n\n## :volcano: 一个有趣的例子\n变形金刚的一个例子。参考图是擎天柱的卡车形态。无论擎天柱处于何种形态，我们的模型总能在目标图像中将其分割出来。感谢 Hongyang Li 提供的这个有趣示例。\n\n\u003Cdiv  align=\"center\">    \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_577d7e716c0a.png\" width = \"700\" alt=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_577d7e716c0a.png\" align=center \u002F>\n\u003C\u002Fdiv>\n\n## :tulip: NERF 示例\n* 受到 [SA3D](https:\u002F\u002Fgithub.com\u002FJumpat\u002FSegmentAnythingin3D) 中示例的启发，我们在 NERF 示例上尝试了 SEEM，效果非常好 :)\n\n\u003Cdiv  align=\"center\">    \n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_6070b73edc7a.gif\" width=\"400\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_c3c8a414f655.gif\" width=\"400\" \u002F> \n\u003C\u002Fdiv>\n\n## :camping: 点击、涂鸦生成掩码\n只需用户简单点击或涂抹，我们就能生成对应的掩码及其类别标签。\n\n![SEEM 设计](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_2807064b9d13.png)\n## :mountain_snow: 文本转掩码\nSEEM 可以根据用户的文本输入生成掩码，实现与人类的多模态交互。\n\n![示例](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_2dcbb49bd55c.png)\n\u003C!-- \n\u003Cdiv  align=\"center\">    \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_2dcbb49bd55c.png\" width = \"700\" alt=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_2dcbb49bd55c.png\" align=center \u002F>\n\u003C\u002Fdiv> -->\n\n## :mosque: 引用图像转掩码\n只需在参考图像上简单点击或涂抹，模型就能在目标图像上分割出语义相似的对象。\n![示例](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_877c00c64fc2.png)\n\nSEEM 对空间关系的理解非常到位。看看这三只斑马！分割出的斑马位置与参考斑马的位置非常接近。例如，当上排最左边的斑马被引用时，下排最左边的斑马也会被分割出来。\n![示例](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_aec80088e12c.png)\n\n## :blossom: 引用图像转视频掩码\n无需对视频数据进行训练，SEEM 能够完美地根据您指定的任何查询来分割视频！\n![示例](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_f6ee99f3b3c7.png)\n\n## :sunflower: 音频转掩码\n我们使用 Whisper 将音频转换为文本提示，从而分割目标物体。快来我们的演示中试试吧！\n\n\u003Cdiv  align=\"center\">    \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_c98b48f1c67f.png\" width = \"900\" alt=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_c98b48f1c67f.png\" align=center \u002F>\n\u003C\u002Fdiv>\n\n\u003C!-- ## 🔥 不同提示组合生成掩码 -->\n\n## :deciduous_tree: 多种风格示例\n一个分割表情包的例子。\n\u003Cdiv  align=\"center\">    \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_c9d1015e948e.png\" width = \"500\" alt=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_c9d1015e948e.png\" align=center \u002F>\n\u003C\u002Fdiv>\n\n一个分割卡通风格树木的例子。\n\u003Cdiv  align=\"center\">    \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_2eeed34f2838.png\" width = \"700\" alt=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_2eeed34f2838.png\" align=center \u002F>\n\u003C\u002Fdiv>\n\n一个分割 Minecraft 图片的例子。\n\u003Cdiv  align=\"center\">    \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_def33406fed6.png\" width = \"700\" alt=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_def33406fed6.png\" align=center \u002F>\n\u003C\u002Fdiv>\n\u003C!-- ![示例](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_def33406fed6.png) -->\n一个使用引用图像处理热门泰迪熊的例子。\n\n![示例](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_bda006dfb90c.png)\n\n## 模型\n![SEEM 设计](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_f7ad7fff214a.png)\n\n## 与 SAM 的对比\n在下图中，我们比较了三种分割任务（边缘检测、开放集分割和交互式分割）在交互性和语义性方面的差异。开放集分割通常需要较高的语义理解，且不需要交互；而与 [SAM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.02643) 相比，SEEM 覆盖的交互和语义范围更广。例如，SAM 仅支持点和框等有限的交互方式，且由于自身不输出语义标签，无法完成高语义的任务。其原因在于：首先，SEEM 拥有统一的提示编码器，能够将所有视觉和语言提示编码到一个联合表示空间中，因此可以支持更广泛的使用场景，并具备扩展至自定义提示的能力。其次，SEEM 在文本到掩码（接地分割）方面表现优异，并能输出具有语义感知的预测结果。\n\u003Cdiv  align=\"center\">    \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_49297ad5571a.jpg\" width = \"500\" alt=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_readme_49297ad5571a.jpg\" align=center \u002F>\n\u003C\u002Fdiv>\n\u003C!-- 该图展示了我们模型与同期工作 SAM 在交互性和语义性方面的对比。横轴和纵轴分别代表交互性和语义性的程度。图中展示了三种分割任务：开放集分割、边缘检测和交互式分割。这些任务在交互性和语义性方面各有不同。例如，开放集分割通常需要较高的语义理解，且无需交互。相比之下，我们的模型覆盖了更广泛的交互和语义层次。例如，SAM 仅支持点和框等有限的交互方式，而由于自身不输出语义标签，无法完成高语义的任务。需要注意的是，虽然我们未报告边缘检测的结果，但通过将掩码转换为边缘线，我们的模型同样可以实现这一功能。 -->\n\n## :cupid: 致谢\n- 我们感谢 Hugging Face 在演示中提供的 GPU 支持！\n\n\n\u003C!-- ## 引用（待论文上线 arXiv 后更新）\n如果您觉得本项目对您的研究有所帮助，请考虑引用以下 BibTeX 条目。\n```BibTex\n\n``` -->","# SEEM 快速上手指南\n\n**SEEM (Segment Everything Everywhere All at Once)** 是一个通用的交互式多模态图像分割模型。它支持使用多种提示（如点击、框选、涂鸦、文本、参考图等）一次性分割图像中的任何对象，并能为分割结果提供语义标签。\n\n## 1. 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu 18.04+)\n*   **Python**: 3.8 或更高版本\n*   **GPU**: 支持 CUDA 的 NVIDIA 显卡 (推荐显存 ≥ 8GB)\n*   **前置依赖**:\n    *   PyTorch (建议 1.10+)\n    *   torchvision\n    *   detectron2\n    *   opencv-python\n    *   timm\n    *   transformers\n\n> **注意**: 本项目依赖 `detectron2`，在 Windows 上安装较为复杂，强烈建议在 Linux 环境下运行。\n\n## 2. 安装步骤\n\n### 步骤一：克隆项目代码\n\n```bash\ngit clone git@github.com:UX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once.git\ncd Segment-Everything-Everywhere-All-At-Once\n```\n\n### 步骤二：创建虚拟环境并安装依赖\n\n建议使用 Conda 创建独立环境：\n\n```bash\nconda create -n seem python=3.8 -y\nconda activate seem\n```\n\n安装 PyTorch (请根据您的 CUDA 版本调整，以下为 CUDA 11.7 示例)：\n```bash\npip install torch==2.0.1 torchvision==0.15.2 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu117\n```\n\n安装其他核心依赖及 detectron2：\n```bash\npip install -r requirements.txt\npip install 'git+https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdetectron2.git'\n```\n\n> **国内加速提示**: 如果下载依赖较慢，可添加国内镜像源参数，例如：\n> `pip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`\n\n### 步骤三：下载预训练模型\n\n您需要下载预训练权重文件才能运行演示。以下是推荐的 **SEEM v1 (Focal-L)** 模型（性能最佳）：\n\n```bash\nmkdir checkpoints\nwget -O checkpoints\u002Fseem_focall_v1.pt https:\u002F\u002Fhuggingface.co\u002Fxdecoder\u002FSEEM\u002Fresolve\u002Fmain\u002Fseem_focall_v1.pt\n```\n\n> **备选方案**: 如果 HuggingFace 下载缓慢，可尝试使用国内镜像站或手动下载后上传至服务器。\n\n## 3. 基本使用\n\n### 方式一：一键启动演示 (One-Line Demo)\n\n项目提供了自动化脚本，配置好环境和模型后，可直接运行以下命令启动本地 Web 演示界面：\n\n```bash\nsh assets\u002Fscripts\u002Frun_demo.sh\n```\n\n运行成功后，终端会显示一个本地地址（通常是 `http:\u002F\u002F127.0.0.1:6090` 或类似端口）。在浏览器中打开该地址即可使用图形化界面。\n\n**图形界面操作简述：**\n1.  **上传图片**: 点击上传按钮选择待分割图片。\n2.  **选择提示类型**: 勾选您想使用的提示方式（如 `Point`, `Box`, `Scribble`, `Text` 等）。\n3.  **输入提示**:\n    *   **视觉提示**: 在图片上点击、画框或涂鸦。\n    *   **文本提示**: 在文本框输入物体名称（支持开放词汇，若未命中 COCO 80 类，需配合涂鸦使用文本标签）。\n    *   **参考图**: 若使用参考图分割，需在对应面板上传参考图并在其上标注。\n4.  **提交**: 点击 \"Submit\" 按钮，等待几秒即可看到分割掩码和类别标签。\n\n### 方式二：命令行推理 (Python 脚本)\n\n如果您希望通过代码调用，可以参考 `demo_code` 目录下的示例。以下是一个最小化的 Python 调用逻辑示意：\n\n```python\nimport cv2\nfrom seem.modeling.BaseModel import BaseModel as Base\nfrom seem.utils.arguments import load_opt_from_config_file\nfrom seem.utils.misc import concat_all_gather\n\n# 加载配置文件和模型权重\nopt = load_opt_from_config_file('demo_code\u002Fconfigs\u002Fseem\u002Fseem_focall_lang.yaml')\nopt.model_weight = 'checkpoints\u002Fseem_focall_v1.pt'\n\n# 初始化模型\nmodel = Base(opt).eval().cuda()\n\n# 读取图像\nimage = cv2.imread('assets\u002Fimages\u002Fexample.jpg')\n\n# 构建提示 (示例：中心点提示 + 文本提示)\n# 具体 API 调用请参考 demo_code\u002Finference_seem.py 中的详细实现\n# prompts = {'points': [[h, w]], 'texts': ['object_name']} \n# outputs = model.inference(image, prompts)\n```\n\n> **提示**: 完整的代码示例请查阅项目根目录下的 `demo_code\u002Finference_seem.py` 文件，其中包含了处理不同提示类型（点、框、文本、参考图）的完整逻辑。\n\n---\n*更多高级功能（如训练、评估、视频分割）请参考项目仓库中的 `TRAIN.md` 和 `EVAL.md` 文档。*","某电商平台的视觉运营团队需要快速从海量商品图中提取特定属性（如“红色连衣裙”或“带logo 的运动鞋”）以构建精细化营销素材库。\n\n### 没有 Segment-Everything-Everywhere-All-At-Once 时\n- **交互方式单一**：传统分割模型仅支持点击或画框，无法直接通过输入“带有金色纽扣的外套”这样的自然语言指令进行定位，运营人员需反复手动调整选区。\n- **多任务切换繁琐**：处理不同粒度需求（如区分“整件衣服”与“衣领细节”）时，需加载多个专用模型或在不同工具间切换，工作流断裂且耗时。\n- **复杂场景识别率低**：面对遮挡、重叠或背景杂乱的商品图，基于几何提示的工具极易误判边界，导致后期人工修图成本高昂。\n- **定制化能力弱**：当遇到新出现的商品类别或特殊标记时，无法通过简单的草图或音频描述让模型即时理解并执行分割，必须重新训练模型。\n\n### 使用 Segment-Everything-Everywhere-All-At-Once 后\n- **多模态自由指令**：运营人员可直接输入文本描述、绘制草图甚至录制语音指令，Segment-Everything-Everywhere-All-At-Once 能瞬间精准锁定目标区域，实现“所说即所得”。\n- **统一模型全能处理**：无需切换模型，同一个 Segment-Everything-Everywhere-All-At-Once 即可同时响应从物体整体到局部细节的不同粒度分割需求，大幅提升流转效率。\n- **鲁棒性显著增强**：凭借强大的泛化能力，即使在商品堆叠或背景复杂的图片中，它也能结合语义理解准确剥离目标，输出高质量掩码，减少 90% 的人工修正时间。\n- **零样本即时适应**：面对新品类，只需提供简单的视觉或语言提示，Segment-Everything-Everywhere-All-At-Once 即可立即泛化处理，无需任何额外训练即可上线使用。\n\nSegment-Everything-Everywhere-All-At-Once 通过融合多模态提示与通用分割能力，将原本繁琐的图像抠图工作转化为直观的自然交互，彻底重构了视觉内容生产的效率边界。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUX-Decoder_Segment-Everything-Everywhere-All-At-Once_a2f2edc4.png","UX-Decoder","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FUX-Decoder_568fcd45.png",null,"https:\u002F\u002Fgithub.com\u002FUX-Decoder",[78,82,86,90],{"name":79,"color":80,"percentage":81},"Python","#3572A5",94.9,{"name":83,"color":84,"percentage":85},"Cuda","#3A4E3A",4.5,{"name":87,"color":88,"percentage":89},"C++","#f34b7d",0.5,{"name":91,"color":92,"percentage":93},"Shell","#89e051",0.1,4775,457,"2026-04-12T02:36:28","Apache-2.0",4,"Linux","需要 NVIDIA GPU（基于 FocalNet\u002FDaViT\u002FSAM 骨干网络），具体显存和 CUDA 版本未在提供的文本中说明，但通常此类模型需要支持 CUDA 的高性能显卡","未说明",{"notes":103,"python":101,"dependencies":104},"README 中未直接列出具体的版本号和环境配置细节，仅提供了安装指南链接 (INSTALL.md)。项目依赖 X-Decoder 架构，支持多种视觉骨干网络（如 FocalNet, DaViT, SAM-ViT）。演示脚本明确针对 Linux 环境 ('One-Line SEEM Demo with Linux')。建议查阅项目仓库中的 INSTALL.md 获取准确的 Python 版本、CUDA 版本及依赖库版本要求。",[105,106,107,108,109,110],"torch","transformers","detectron2","xdecoder","focalnet","davitt",[15,112],"其他","2026-03-27T02:49:30.150509","2026-04-13T00:23:11.229875",[116,121,126,131,136,141],{"id":117,"question_zh":118,"answer_zh":119,"source_url":120},31146,"运行 Gradio Demo 或加载页面时出现 \"KeyError: 'dataset'\" 或 ASGI 异常错误怎么办？","这通常是由于 gradio_client 版本不兼容导致的。请尝试将 gradio_client 降级到 0.2.7 版本，或者将 gradio 版本更改为 3.37.0。具体命令如下：\n1. pip install gradio_client==0.2.7\n或者\n2. 修改依赖文件使用 gradio==3.37.0","https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once\u002Fissues\u002F59",{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},31147,"运行训练脚本时遇到 detectron2 的 \"undefined symbol\" ImportError 错误如何解决？","这通常是因为 Docker 环境与最新版本的 PyTorch 不兼容。建议显式指定安装兼容的 PyTorch 版本（如 2.0.1）。请尝试使用以下命令安装：\npip install -I torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --user\n同时确保数据集路径配置正确，避免 FileNotFoundError。","https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once\u002Fissues\u002F83",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},31148,"项目是否发布了用于微调（Finetuning）的训练代码？","是的，维护者已经发布了适用于 SEEM 和 X-Decoder 的综合训练代码。您可以访问项目主仓库获取代码并开始使用。维护者表示会持续支持该代码库。","https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once\u002Fissues\u002F28",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},31149,"Gradio Demo 中使用的检查点（Checkpoint）与发布的模型文件一致吗？","是的，官方已发布 focal-large 检查点和配置文件（例如 seem_focalt_v1.pt），这些与 Demo 中使用的模型一致。用户可以直接下载并使用这些发布的权重进行推理或测试。","https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once\u002Fissues\u002F24",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},31150,"在遥感图像分割任务中，为什么预训练模型的效果远不如在线 Demo？","在线 Demo 可能使用了特定的未公开权重或针对特定任务（如全景分割）进行了优化，而公开的预训练模型（如基于 Davit-d3 骨干网络的模型）是通用版本。目前官方尚未完全开源 Demo 中使用的特定模型权重。建议尝试使用官方发布的 focal-large 检查点，或根据文档自行在特定数据集上进行微调以获得更好效果。","https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once\u002Fissues\u002F90",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},31151,"安装 requirements.txt 时遇到 mpi4py 构建失败（Building wheel for mpi4py error）怎么办？","mpi4py 构建失败通常是因为缺少系统级的 MPI 开发库。在安装 Python 包之前，请先在操作系统层面安装 MPI。例如在 Ubuntu\u002FDebian 上运行：sudo apt-get install libopenmpi-dev openmpi-bin；在 CentOS\u002FRHEL 上运行：sudo yum install openmpi-devel。安装完系统依赖后，再重新运行 pip install 命令。","https:\u002F\u002Fgithub.com\u002FUX-Decoder\u002FSegment-Everything-Everywhere-All-At-Once\u002Fissues\u002F133",[147],{"id":148,"version":149,"summary_zh":150,"released_at":151},223064,"v1.0.0","x解码器检查点","2023-06-01T21:58:16"]