[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-tue-mps--eomt":3,"tool-tue-mps--eomt":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",159636,2,"2026-04-17T23:33:34",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":77,"owner_email":77,"owner_twitter":77,"owner_website":78,"owner_url":79,"languages":80,"stars":89,"forks":90,"last_commit_at":91,"license":92,"difficulty_score":10,"env_os":93,"env_gpu":94,"env_ram":95,"env_deps":96,"category_tags":106,"github_topics":107,"view_count":32,"oss_zip_url":77,"oss_zip_packed_at":77,"status":17,"created_at":114,"updated_at":115,"faqs":116,"releases":147},8774,"tue-mps\u002Feomt","eomt","[CVPR 2025 Highlight] Official code and models for Encoder-only Mask Transformer (EoMT).","eomt 是一款荣获 CVPR 2025 高光推荐的开源图像分割模型，其核心理念是“你的 ViT 本质上就是分割模型”。它巧妙地将标准的 Vision Transformer（ViT）直接转化为高效的分割工具，无需额外添加适配器或复杂的解码器结构。\n\n传统图像分割方法往往依赖繁琐的特定任务组件，导致模型臃肿且推理缓慢。eomt 通过极简架构解决了这一痛点：它让 ViT 同时编码图像块和分割查询，在保持与最先进方法相当精度的同时，显著提升了运行速度（例如在 ViT-L 配置下速度可提升 4 倍）。近期更新更增加了对 DINOv3 主干网络的支持，进一步刷新了全景、实例及语义分割的性能基准。\n\n这款工具特别适合计算机视觉领域的研究人员和开发者使用。如果你正在寻求轻量级、高能效的分割方案，或者希望深入探索 Transformer 架构在底层视觉任务中的潜力，eomt 提供了极佳的实践范例。其代码已集成至 Hugging Face Transformers，便于快速调用与实验。凭借“少即是多”的设计哲学，eomt 证明了在处理分割任务时，纯粹的 Transformer 架构足以胜任，为后续研究","eomt 是一款荣获 CVPR 2025 高光推荐的开源图像分割模型，其核心理念是“你的 ViT 本质上就是分割模型”。它巧妙地将标准的 Vision Transformer（ViT）直接转化为高效的分割工具，无需额外添加适配器或复杂的解码器结构。\n\n传统图像分割方法往往依赖繁琐的特定任务组件，导致模型臃肿且推理缓慢。eomt 通过极简架构解决了这一痛点：它让 ViT 同时编码图像块和分割查询，在保持与最先进方法相当精度的同时，显著提升了运行速度（例如在 ViT-L 配置下速度可提升 4 倍）。近期更新更增加了对 DINOv3 主干网络的支持，进一步刷新了全景、实例及语义分割的性能基准。\n\n这款工具特别适合计算机视觉领域的研究人员和开发者使用。如果你正在寻求轻量级、高能效的分割方案，或者希望深入探索 Transformer 架构在底层视觉任务中的潜力，eomt 提供了极佳的实践范例。其代码已集成至 Hugging Face Transformers，便于快速调用与实验。凭借“少即是多”的设计哲学，eomt 证明了在处理分割任务时，纯粹的 Transformer 架构足以胜任，为后续研究开辟了更简洁高效的技术路径。","# Your ViT is Secretly an Image Segmentation Model  \n**CVPR 2025 ✨ Highlight** · [📄 Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.19108)\n\n**[Tommie Kerssies](https:\u002F\u002Ftommiekerssies.com)\u003Csup>1\u003C\u002Fsup>, [Niccolò Cavagnero](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=Pr4XHRAAAAAJ)\u003Csup>2,*\u003C\u002Fsup>, [Alexander Hermans](https:\u002F\u002Fscholar.google.de\u002Fcitations?user=V0iMeYsAAAAJ)\u003Csup>3\u003C\u002Fsup>, [Narges Norouzi](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=q7sm490AAAAJ)\u003Csup>1\u003C\u002Fsup>, [Giuseppe Averta](https:\u002F\u002Fwww.giuseppeaverta.me\u002F)\u003Csup>2\u003C\u002Fsup>, [Bastian Leibe](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=ZcULDB0AAAAJ)\u003Csup>3\u003C\u002Fsup>, [Gijs Dubbelman](https:\u002F\u002Fscholar.google.nl\u002Fcitations?user=wy57br8AAAAJ)\u003Csup>1\u003C\u002Fsup>, [Daan de Geus](https:\u002F\u002Fddegeus.github.io)\u003Csup>1,3\u003C\u002Fsup>**\n\n¹ Eindhoven University of Technology  \n² Polytechnic of Turin  \n³ RWTH Aachen University  \n\\* Work done while visiting RWTH Aachen University\n\n## Overview\n\nWe present the **Encoder-only Mask Transformer (EoMT)**, a minimalist image segmentation model that repurposes a plain Vision Transformer (ViT) to jointly encode image patches and segmentation queries as tokens. No adapters. No decoders. Just the ViT.\n\nLeveraging large-scale pre-trained ViTs, EoMT achieves accuracy similar to state-of-the-art methods that rely on complex, task-specific components. At the same time, it is significantly faster thanks to its simplicity, for example up to 4× faster with ViT-L.  \n\nTurns out, *your ViT is secretly an image segmentation model*. EoMT shows that architectural complexity isn't necessary. For segmentation, a plain Transformer is all you need.\n\n## 🚀 NEW: PMT \n\nPresenting our latest model, [PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.25398).\n\nPMT reconciles EoMT minimal philosophy with the need of preserving the features of frozen Foundation Models, by mimicking the last layers of EoMT and VidEoMT with a simple and fast decoder.\n\nTake a [look](https:\u002F\u002Fgithub.com\u002Ftue-mps\u002Fpmt)!\n\n## 🚀 NEW: VidEoMT \n\n🔥 We're pleased to present our latest CVPR 2026 paper, [VidEoMT: Your ViT is Secretly Also a Video Segmentation Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.17807).\n\nVidEoMT extends EoMT philosophy to the temporal domain, introducing an encoder-only video segmentation model that is up to 10x faster than competitors.\n\nGo check it [out](https:\u002F\u002Fgithub.com\u002Ftue-mps\u002Fvideomt)! \n\n\n## 🚀 NEW: DINOv3 Support\n\n🔥 We're excited to announce support for **DINOv3** backbones! Our new DINOv3-based EoMT models deliver improved performance across all segmentation tasks:\n\n- **Panoptic Segmentation**: Up to 58.9 PQ on COCO with EoMT-L at 1280×1280\n- **Instance Segmentation**: Up to 49.9 mAP on COCO with EoMT-L at 1280×1280  \n- **Semantic Segmentation**: Up to 59.5 mIoU on ADE20K with EoMT-L at 512×512\n\nAll of this, at the impressive speed of EoMT!\n\nCheck out our [DINOv3 Model Zoo](model_zoo\u002Fdinov3.md) for all available EoMT configurations and performance benchmarks.\n\nThanks to the [DINOv3](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdinov3) team for providing these powerful foundation models!\n\n## 🤗 Transformers\n\nEoMT with DINOv2 is also available on [Hugging Face Transformers](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fmodel_doc\u002Feomt). See available models [here](https:\u002F\u002Fhuggingface.co\u002Fmodels?library=transformers&other=eomt&sort=trending).\n\n## Installation\n\nIf you don't have Conda installed, install Miniconda and restart your shell:\n\n```bash\nwget https:\u002F\u002Frepo.anaconda.com\u002Fminiconda\u002FMiniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh\n```\n\nThen create the environment, activate it, and install the dependencies:\n\n```bash\nconda create -n eomt python==3.13.2\nconda activate eomt\npython3 -m pip install -r requirements.txt\n```\n\n[Weights & Biases](https:\u002F\u002Fwandb.ai\u002F) (wandb) is used for experiment logging and visualization. To enable wandb, log in to your account:\n\n```bash\nwandb login\n```\n\n## Data preparation\n\nDownload the datasets below depending on which datasets you plan to use.  \nYou do **not** need to unzip any of the downloaded files.  \nSimply place them in a directory of your choice and provide that path via the `--data.path` argument.  \nThe code will read the `.zip` files directly.\n\n**COCO**\n```bash\nwget http:\u002F\u002Fimages.cocodataset.org\u002Fzips\u002Ftrain2017.zip\nwget http:\u002F\u002Fimages.cocodataset.org\u002Fzips\u002Fval2017.zip\nwget http:\u002F\u002Fimages.cocodataset.org\u002Fannotations\u002Fannotations_trainval2017.zip\nwget http:\u002F\u002Fimages.cocodataset.org\u002Fannotations\u002Fpanoptic_annotations_trainval2017.zip\n```\n\n**ADE20K**\n```bash\nwget http:\u002F\u002Fdata.csail.mit.edu\u002Fplaces\u002FADEchallenge\u002FADEChallengeData2016.zip\nwget http:\u002F\u002Fsceneparsing.csail.mit.edu\u002Fdata\u002FChallengeData2017\u002Fannotations_instance.tar\ntar -xf annotations_instance.tar\nzip -r -0 annotations_instance.zip annotations_instance\u002F\nrm -rf annotations_instance.tar\nrm -rf annotations_instance\n```\n\n**Cityscapes**\n```bash\nwget --keep-session-cookies --save-cookies=cookies.txt --post-data 'username=\u003Cyour_username>&password=\u003Cyour_password>&submit=Login' https:\u002F\u002Fwww.cityscapes-dataset.com\u002Flogin\u002F\nwget --load-cookies cookies.txt --content-disposition https:\u002F\u002Fwww.cityscapes-dataset.com\u002Ffile-handling\u002F?packageID=1\nwget --load-cookies cookies.txt --content-disposition https:\u002F\u002Fwww.cityscapes-dataset.com\u002Ffile-handling\u002F?packageID=3\n```\n\n🔧 Replace `\u003Cyour_username>` and `\u003Cyour_password>` with your actual [Cityscapes](https:\u002F\u002Fwww.cityscapes-dataset.com\u002F) login credentials.  \n\n## Usage\n\n### Training\n\nTo train EoMT from scratch, run:\n\n```bash\npython3 main.py fit \\\n  -c configs\u002Fdinov2\u002Fcoco\u002Fpanoptic\u002Feomt_large_640.yaml \\\n  --trainer.devices 4 \\\n  --data.batch_size 4 \\\n  --data.path \u002Fpath\u002Fto\u002Fdataset\n```\n\nThis command trains the `EoMT-L` model with a 640×640 input size on COCO panoptic segmentation using 4 GPUs. Each GPU processes a batch of 4 images, for a total batch size of 16. Switch to ```dinov3``` in the configuration path to enable the corresponding DINOv3 model.\n\n✅ Make sure the total batch size is `devices × batch_size = 16`  \n🔧 Replace `\u002Fpath\u002Fto\u002Fdataset` with the directory containing the dataset zip files.\n\n> This configuration takes ~6 hours on 4×NVIDIA H100 GPUs, each using ~26GB VRAM.\n\nTo fine-tune a pre-trained EoMT model, add:\n\n```bash\n  --model.ckpt_path \u002Fpath\u002Fto\u002Fpytorch_model.bin \\\n  --model.load_ckpt_class_head False\n```\n\n🔧 Replace `\u002Fpath\u002Fto\u002Fpytorch_model.bin` with the path to the checkpoint to fine-tune.  \n> `--model.load_ckpt_class_head False` skips loading the classification head when fine-tuning on a dataset with different classes.\n\n> **DINOv3 Models**: When using DINOv3-based configurations, the code expects delta weights relative to DINOv3 weights by default. To disable this behavior and use absolute weights instead, add `--model.delta_weights False`. \n\n### Evaluating\n\nTo evaluate a pre-trained EoMT model, run:\n\n```bash\npython3 main.py validate \\\n  -c configs\u002Fdinov2\u002Fcoco\u002Fpanoptic\u002Feomt_large_640.yaml \\\n  --model.network.masked_attn_enabled False \\\n  --trainer.devices 4 \\\n  --data.batch_size 4 \\\n  --data.path \u002Fpath\u002Fto\u002Fdataset \\\n  --model.ckpt_path \u002Fpath\u002Fto\u002Fpytorch_model.bin\n```\n\nThis command evaluates the same `EoMT-L` model using 4 GPUs with a batch size of 4 per GPU.\n\n🔧 Replace `\u002Fpath\u002Fto\u002Fdataset` with the directory containing the dataset zip files.  \n🔧 Replace `\u002Fpath\u002Fto\u002Fpytorch_model.bin` with the path to the checkpoint to evaluate.\n\nA [notebook](inference.ipynb) is available for quick inference and visualization with auto-downloaded pre-trained models.\n\n> **DINOv3 Models**: When using DINOv3-based configurations, the code expects delta weights relative to DINOv3 weights by default. To disable this behavior and use absolute weights instead, add `--model.delta_weights False`. \n\n## Model Zoo\n\nWe provide pre-trained weights for both DINOv2- and DINOv3-based EoMT models.\n\n- **[DINOv2 Models](model_zoo\u002Fdinov2.md)** - Original published results and pre-trained weights.\n- **[DINOv3 Models](model_zoo\u002Fdinov3.md)** - New DINOv3-based models and pre-trained weights.\n\n## Citation\nIf you find this work useful in your research, please cite it using the BibTeX entry below:\n\n```BibTeX\n@inproceedings{kerssies2025eomt,\n  author    = {Kerssies, Tommie and Cavagnero, Niccol\\`{o} and Hermans, Alexander and Norouzi, Narges and Averta, Giuseppe and Leibe, Bastian and Dubbelman, Gijs and {de Geus}, Daan},\n  title     = {{Your ViT is Secretly an Image Segmentation Model}},\n  booktitle = {Proceedings of the IEEE\u002FCVF Conference on Computer Vision and Pattern Recognition (CVPR)},\n  year      = {2025},\n}\n```\n\n## Acknowledgements\n\nThis project builds upon code from the following libraries and repositories:\n\n- [Hugging Face Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) (Apache-2.0 License)  \n- [PyTorch Image Models (timm)](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpytorch-image-models) (Apache-2.0 License)  \n- [PyTorch Lightning](https:\u002F\u002Fgithub.com\u002FLightning-AI\u002Fpytorch-lightning) (Apache-2.0 License)  \n- [TorchMetrics](https:\u002F\u002Fgithub.com\u002FLightning-AI\u002Ftorchmetrics) (Apache-2.0 License)  \n- [Mask2Former](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FMask2Former) (Apache-2.0 License)\n- [Detectron2](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdetectron2) (Apache-2.0 License)\n","# 你的 ViT 其实是个图像分割模型  \n**CVPR 2025 ✨ 亮点** · [📄 论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.19108)\n\n**[Tommie Kerssies](https:\u002F\u002Ftommiekerssies.com)\u003Csup>1\u003C\u002Fsup>, [Niccolò Cavagnero](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=Pr4XHRAAAAAJ)\u003Csup>2,*\u003C\u002Fsup>, [Alexander Hermans](https:\u002F\u002Fscholar.google.de\u002Fcitations?user=V0iMeYsAAAAJ)\u003Csup>3\u003C\u002Fsup>, [Narges Norouzi](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=q7sm490AAAAJ)\u003Csup>1\u003C\u002Fsup>, [Giuseppe Averta](https:\u002F\u002Fwww.giuseppeaverta.me\u002F)\u003Csup>2\u003C\u002Fsup>, [Bastian Leibe](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=ZcULDB0AAAAJ)\u003Csup>3\u003C\u002Fsup>, [Gijs Dubbelman](https:\u002F\u002Fscholar.google.nl\u002Fcitations?user=wy57br8AAAAJ)\u003Csup>1\u003C\u002Fsup>, [Daan de Geus](https:\u002F\u002Fddegeus.github.io)\u003Csup>1,3\u003C\u002Fsup>**\n\n¹ 埃因霍温理工大学  \n² 都灵理工大学  \n³ 亚琛工业大学  \n\\* 在亚琛工业大学访问期间完成的工作\n\n## 概述\n\n我们提出了 **仅编码器掩码 Transformer (EoMT)**，这是一种极简的图像分割模型，它将普通的 Vision Transformer (ViT) 改造成能够同时将图像块和分割查询编码为标记的架构。无需适配器，也不需要解码器——只需一个 ViT。\n\n借助大规模预训练的 ViT，EoMT 获得了与依赖复杂任务特定组件的最先进方法相当的精度。与此同时，由于其简洁性，它的速度显著更快，例如在使用 ViT-L 时，速度可提升至原来的 4 倍。\n\n事实证明，*你的 ViT 其实就是一个图像分割模型*。EoMT 表明，架构上的复杂性并非必要。对于分割任务来说，一个普通的 Transformer 就足够了。\n\n## 🚀 新：PMT \n\n隆重推出我们的最新模型，[PMT：用于图像和视频分割的纯掩码 Transformer，采用冻结视觉编码器](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.25398)。\n\nPMT 将 EoMT 的极简理念与保留冻结基础模型特征的需求相结合，通过使用简单快速的解码器来模拟 EoMT 和 VidEoMT 的最后几层。\n\n快来看看吧！[点此](https:\u002F\u002Fgithub.com\u002Ftue-mps\u002Fpmt)\n\n## 🚀 新：VidEoMT \n\n🔥 我们很高兴地推出最新的 CVPR 2026 论文，[VidEoMT：你的 ViT 其实也是一个视频分割模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.17807)。\n\nVidEoMT 将 EoMT 的理念扩展到时间维度，提出了一种仅编码器的视频分割模型，其速度比竞争对手快高达 10 倍。\n\n快去查看一下吧！[点此](https:\u002F\u002Fgithub.com\u002Ftue-mps\u002Fvideomt)\n\n## 🚀 新：DINOv3 支持\n\n🔥 我们很高兴宣布支持 **DINOv3** 主干网络！基于 DINOv3 的全新 EoMT 模型在所有分割任务中都带来了性能提升：\n\n- **全景分割**：使用 EoMT-L 在 1280×1280 分辨率下，在 COCO 数据集上达到 58.9 的 PQ\n- **实例分割**：使用 EoMT-L 在 1280×1280 分辨率下，在 COCO 数据集上达到 49.9 的 mAP  \n- **语义分割**：使用 EoMT-L 在 512×512 分辨率下，在 ADE20K 数据集上达到 59.5 的 mIoU\n\n这一切，都在 EoMT 出色的速度下实现！\n\n请查看我们的 [DINOv3 模型库](model_zoo\u002Fdinov3.md)，了解所有可用的 EoMT 配置及性能基准。\n\n感谢 [DINOv3](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdinov3) 团队提供的这些强大的基础模型！\n\n## 🤗 Transformers\n\n搭载 DINOv2 的 EoMT 也已在 [Hugging Face Transformers](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fmodel_doc\u002Feomt) 上发布。可在 [这里](https:\u002F\u002Fhuggingface.co\u002Fmodels?library=transformers&other=eomt&sort=trending)查看可用模型。\n\n## 安装\n\n如果你尚未安装 Conda，请先安装 Miniconda 并重启终端：\n\n```bash\nwget https:\u002F\u002Frepo.anaconda.com\u002Fminiconda\u002FMiniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh\n```\n\n然后创建环境、激活并安装依赖项：\n\n```bash\nconda create -n eomt python==3.13.2\nconda activate eomt\npython3 -m pip install -r requirements.txt\n```\n\n[Weights & Biases](https:\u002F\u002Fwandb.ai\u002F)（wandb）用于实验记录和可视化。要启用 wandb，请登录你的账号：\n\n```bash\nwandb login\n```\n\n## 数据准备\n\n根据你计划使用的数据集，下载以下文件。  \n你**不需要**解压任何下载的文件。  \n只需将它们放置在你选择的目录中，并通过 `--data.path` 参数指定该路径。  \n代码会直接读取 `.zip` 文件。\n\n**COCO**\n```bash\nwget http:\u002F\u002Fimages.cocodataset.org\u002Fzips\u002Ftrain2017.zip\nwget http:\u002F\u002Fimages.cocodataset.org\u002Fzips\u002Fval2017.zip\nwget http:\u002F\u002Fimages.cocodataset.org\u002Fannotations\u002Fannotations_trainval2017.zip\nwget http:\u002F\u002Fimages.cocodataset.org\u002Fannotations\u002Fpanoptic_annotations_trainval2017.zip\n```\n\n**ADE20K**\n```bash\nwget http:\u002F\u002Fdata.csail.mit.edu\u002Fplaces\u002FADEchallenge\u002FADEChallengeData2016.zip\nwget http:\u002F\u002Fsceneparsing.csail.mit.edu\u002Fdata\u002FChallengeData2017\u002Fannotations_instance.tar\ntar -xf annotations_instance.tar\nzip -r -0 annotations_instance.zip annotations_instance\u002F\nrm -rf annotations_instance.tar\nrm -rf annotations_instance\n```\n\n**Cityscapes**\n```bash\nwget --keep-session-cookies --save-cookies=cookies.txt --post-data 'username=\u003Cyour_username>&password=\u003Cyour_password>&submit=Login' https:\u002F\u002Fwww.cityscapes-dataset.com\u002Flogin\u002F\nwget --load-cookies cookies.txt --content-disposition https:\u002F\u002Fwww.cityscapes-dataset.com\u002Ffile-handling\u002F?packageID=1\nwget --load-cookies cookies.txt --content-disposition https:\u002F\u002Fwww.cityscapes-dataset.com\u002Ffile-handling\u002F?packageID=3\n```\n\n🔧 请将 `\u003Cyour_username>` 和 `\u003Cyour_password>` 替换为你实际的 [Cityscapes](https:\u002F\u002Fwww.cityscapes-dataset.com\u002F) 登录凭证。  \n\n## 使用\n\n### 训练\n\n要从头开始训练 EoMT，请运行：\n\n```bash\npython3 main.py fit \\\n  -c configs\u002Fdinov2\u002Fcoco\u002Fpanoptic\u002Feomt_large_640.yaml \\\n  --trainer.devices 4 \\\n  --data.batch_size 4 \\\n  --data.path \u002Fpath\u002Fto\u002Fdataset\n```\n\n此命令将在 COCO 全景分割数据集上，使用 4 张 GPU，以 640×640 的输入尺寸训练 `EoMT-L` 模型。每张 GPU 处理 4 张图像的批次，总批次大小为 16。将配置路径中的 `dinov2` 更改为 `dinov3`，即可启用相应的 DINOv3 模型。\n\n✅ 确保总批次大小为 `devices × batch_size = 16`  \n🔧 请将 `\u002Fpath\u002Fto\u002Fdataset` 替换为包含数据集压缩包的目录。\n\n> 此配置在 4 张 NVIDIA H100 GPU 上大约需要 6 小时，每张 GPU 约占用 26GB 显存。\n\n若要微调预训练的 EoMT 模型，可添加：\n\n```bash\n  --model.ckpt_path \u002Fpath\u002Fto\u002Fpytorch_model.bin \\\n  --model.load_ckpt_class_head False\n```\n\n🔧 请将 `\u002Fpath\u002Fto\u002Fpytorch_model.bin` 替换为你要微调的检查点路径。  \n> `--model.load_ckpt_class_head False` 会在对具有不同类别数据集进行微调时跳过分类头的加载。\n\n> **DINOv3 模型**：当使用基于 DINOv3 的配置时，代码默认期望相对 DINOv3 权重的增量权重。若要禁用此行为而使用绝对权重，请添加 `--model.delta_weights False`。\n\n### 评估\n\n要评估一个预训练的 EoMT 模型，请运行以下命令：\n\n```bash\npython3 main.py validate \\\n  -c configs\u002Fdinov2\u002Fcoco\u002Fpanoptic\u002Feomt_large_640.yaml \\\n  --model.network.masked_attn_enabled False \\\n  --trainer.devices 4 \\\n  --data.batch_size 4 \\\n  --data.path \u002Fpath\u002Fto\u002Fdataset \\\n  --model.ckpt_path \u002Fpath\u002Fto\u002Fpytorch_model.bin\n```\n\n此命令使用 4 张 GPU，每张 GPU 的批大小为 4，来评估相同的 `EoMT-L` 模型。\n\n🔧 请将 `\u002Fpath\u002Fto\u002Fdataset` 替换为包含数据集压缩文件的目录。  \n🔧 请将 `\u002Fpath\u002Fto\u002Fpytorch_model.bin` 替换为要评估的检查点路径。\n\n我们提供了一个 [notebook](inference.ipynb)，可用于快速推理和可视化，并自动下载预训练模型。\n\n> **DINOv3 模型**：当使用基于 DINOv3 的配置时，代码默认会期望相对于 DINOv3 权重的增量权重。若要禁用此行为并改用绝对权重，请添加 `--model.delta_weights False`。\n\n## 模型库\n\n我们提供了基于 DINOv2 和 DINOv3 的 EoMT 模型的预训练权重。\n\n- **[DINOv2 模型](model_zoo\u002Fdinov2.md)** - 原始发表的结果及预训练权重。\n- **[DINOv3 模型](model_zoo\u002Fdinov3.md)** - 新的基于 DINOv3 的模型及预训练权重。\n\n## 引用\n\n如果您在研究中使用了本工作，请使用以下 BibTeX 条目进行引用：\n\n```BibTeX\n@inproceedings{kerssies2025eomt,\n  author    = {Kerssies, Tommie and Cavagnero, Niccol\\`{o} and Hermans, Alexander and Norouzi, Narges and Averta, Giuseppe and Leibe, Bastian and Dubbelman, Gijs and {de Geus}, Daan},\n  title     = {{Your ViT is Secretly an Image Segmentation Model}},\n  booktitle = {Proceedings of the IEEE\u002FCVF Conference on Computer Vision and Pattern Recognition (CVPR)},\n  year      = {2025},\n}\n```\n\n## 致谢\n\n本项目基于以下库和仓库中的代码：\n\n- [Hugging Face Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers)（Apache-2.0 许可证）  \n- [PyTorch Image Models (timm)](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpytorch-image-models)（Apache-2.0 许可证）  \n- [PyTorch Lightning](https:\u002F\u002Fgithub.com\u002FLightning-AI\u002Fpytorch-lightning)（Apache-2.0 许可证）  \n- [TorchMetrics](https:\u002F\u002Fgithub.com\u002FLightning-AI\u002Ftorchmetrics)（Apache-2.0 许可证）  \n- [Mask2Former](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FMask2Former)（Apache-2.0 许可证）  \n- [Detectron2](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdetectron2)（Apache-2.0 许可证）","# EoMT 快速上手指南\n\nEoMT (Encoder-only Mask Transformer) 是一个极简的图像分割模型，它直接将预训练的 Vision Transformer (ViT) 复用为分割模型，无需额外的适配器或解码器。该模型在保持与最先进方法相当精度的同时，显著提升了推理速度（例如 ViT-L 版本快达 4 倍）。\n\n## 环境准备\n\n*   **操作系统**: Linux (推荐)\n*   **Python**: 3.13.2 (官方推荐版本)\n*   **硬件**: NVIDIA GPU (训练示例需多卡，如 4×H100；推理可根据模型大小调整)\n*   **依赖管理**: Conda (Miniconda 或 Anaconda)\n*   **可选**: Weights & Biases (wandb) 账号用于实验日志记录\n\n## 安装步骤\n\n### 1. 安装 Conda (如未安装)\n如果系统中没有 Conda，请先安装 Miniconda：\n```bash\nwget https:\u002F\u002Frepo.anaconda.com\u002Fminiconda\u002FMiniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh\n```\n*安装完成后请重启终端。*\n\n### 2. 创建并激活环境\n创建名为 `eomt` 的虚拟环境并安装依赖：\n```bash\nconda create -n eomt python==3.13.2\nconda activate eomt\npython3 -m pip install -r requirements.txt\n```\n> **提示**: 国内用户若下载依赖较慢，可添加清华或阿里镜像源：\n> `python3 -m pip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`\n\n### 3. 配置实验日志 (可选)\n如需使用 WandB 记录训练过程：\n```bash\nwandb login\n```\n\n## 基本使用\n\n### 1. 数据准备\n下载数据集（以 COCO 为例），**无需解压**，代码可直接读取 `.zip` 文件。将文件存放在同一目录下。\n\n```bash\n# 示例：下载 COCO 数据集\nwget http:\u002F\u002Fimages.cocodataset.org\u002Fzips\u002Ftrain2017.zip\nwget http:\u002F\u002Fimages.cocodataset.org\u002Fzips\u002Fval2017.zip\nwget http:\u002F\u002Fimages.cocodataset.org\u002Fannotations\u002Fannotations_trainval2017.zip\nwget http:\u002F\u002Fimages.cocodataset.org\u002Fannotations\u002Fpanoptic_annotations_trainval2017.zip\n```\n*注：ADE20K 和 Cityscapes 数据集请下载对应文件并放在同一目录。*\n\n### 2. 模型训练\n以下命令使用 4 张 GPU 从头训练 `EoMT-L` 模型（输入分辨率 640×640），基于 DINOv2  backbone 进行 COCO 全景分割任务。\n\n```bash\npython3 main.py fit \\\n  -c configs\u002Fdinov2\u002Fcoco\u002Fpanoptic\u002Feomt_large_640.yaml \\\n  --trainer.devices 4 \\\n  --data.batch_size 4 \\\n  --data.path \u002Fpath\u002Fto\u002Fdataset\n```\n*   **参数说明**:\n    *   `\u002Fpath\u002Fto\u002Fdataset`: 替换为存放上述 zip 文件的目录路径。\n    *   总 Batch Size = `devices` × `batch_size` (示例中为 16)。\n    *   若要使用 **DINOv3** 模型，请将配置路径中的 `dinov2` 改为 `dinov3`。\n    *   若使用 DINOv3 且希望加载绝对权重而非增量权重，请添加 `--model.delta_weights False`。\n\n### 3. 模型评估\n使用预训练权重进行评估：\n\n```bash\npython3 main.py validate \\\n  -c configs\u002Fdinov2\u002Fcoco\u002Fpanoptic\u002Feomt_large_640.yaml \\\n  --model.network.masked_attn_enabled False \\\n  --trainer.devices 4 \\\n  --data.batch_size 4 \\\n  --data.path \u002Fpath\u002Fto\u002Fdataset \\\n  --model.ckpt_path \u002Fpath\u002Fto\u002Fpytorch_model.bin\n```\n*   **参数说明**:\n    *   `\u002Fpath\u002Fto\u002Fpytorch_model.bin`: 替换为预训练模型权重文件的路径。\n    *   `--model.network.masked_attn_enabled False`: 评估时通常关闭掩码注意力以加速。\n\n### 4. 快速推理 (可选)\n项目提供了一个 Jupyter Notebook 用于快速测试和可视化，会自动下载预训练模型：\n```bash\n# 启动 Jupyter 后打开 inference.ipynb\njupyter notebook inference.ipynb\n```","某自动驾驶感知团队正在开发实时道路场景分割系统，需要将摄像头采集的视频流精准划分为车道、车辆和行人等区域。\n\n### 没有 eomt 时\n- **架构臃肿复杂**：必须搭建包含独立编码器和重型解码器（如 Mask2Former）的复杂流水线，代码维护成本高且难以调试。\n- **推理延迟过高**：复杂的解码步骤导致处理单帧图像耗时较长，在高分辨率输入下难以满足车载芯片的实时性要求。\n- **预训练模型浪费**：现有的大规模预训练 ViT 特征无法直接利用，需额外训练大量适配器（Adapters）进行微调，消耗大量算力资源。\n- **视频处理瓶颈**：若扩展至视频分割，传统方案往往需要引入昂贵的时序模块，导致帧率进一步下降，无法流畅运行。\n\n### 使用 eomt 后\n- **架构极简统一**：直接复用纯 ViT 架构，将图像块与分割查询共同作为 Token 处理，移除了所有专用解码器和适配器，代码库大幅精简。\n- **推理速度飞跃**：得益于极简设计，在同等 ViT-L 骨干网络下，推理速度提升高达 4 倍，轻松实现高分辨率下的实时响应。\n- **高效迁移学习**：直接利用冻结的大规模预训练 ViT 权重即可达到业界领先的分割精度，显著降低了训练时间和数据需求。\n- **无缝扩展视频**：基于同一理念衍生的 VidEoMT 可直接处理视频流，比竞品快 10 倍，无需额外复杂的时序建模组件。\n\neomt 证明了简单的纯 Transformer 架构足以胜任复杂分割任务，让开发者能以最低的计算成本获得最先进的性能表现。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftue-mps_eomt_256fedc1.png","tue-mps","Mobile Perception Systems Lab at TU\u002Fe","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Ftue-mps_d81df147.png","The MPS lab focuses on technologies that allow moving sensor platforms to perceive their environment. ",null,"https:\u002F\u002Fwww.tue-mps.org\u002F","https:\u002F\u002Fgithub.com\u002Ftue-mps",[81,85],{"name":82,"color":83,"percentage":84},"Jupyter Notebook","#DA5B0B",89.3,{"name":86,"color":87,"percentage":88},"Python","#3572A5",10.7,575,56,"2026-04-15T04:00:35","MIT","Linux","必需 NVIDIA GPU。示例配置：4×NVIDIA H100，单卡显存约 26GB。","未说明",{"notes":97,"python":98,"dependencies":99},"1. 官方安装指南仅提供了 Linux (x86_64) 的 Miniconda 下载命令，未明确支持 Windows 或 macOS。\n2. 训练示例显示在 4 张 H100 上运行约需 6 小时，单卡显存占用约 26GB。\n3. 数据集准备阶段无需解压文件，代码可直接读取 .zip 格式的数据包。\n4. 使用 DINOv3 模型时，默认加载相对于 DINOv3 权重的增量权重 (delta weights)，若需使用绝对权重需添加特定参数。\n5. 实验日志和可视化依赖 Weights & Biases (wandb)，需登录账号。","3.13.2",[100,101,102,103,104,105],"torch","pytorch-lightning","transformers","timm","torchmetrics","wandb",[35,15],[108,109,110,111,102,112,113,64],"image-segmentation","instance-segmentation","panoptic-segmentation","segmentation","vision-transformer","vit","2026-03-27T02:49:30.150509","2026-04-18T09:20:11.307426",[117,122,127,132,137,142],{"id":118,"question_zh":119,"answer_zh":120,"source_url":121},39359,"如何保存名为 pytorch_model.bin 的模型权重文件？","PyTorch Lightning 默认保存的检查点（checkpoint）包含训练相关状态（如 epoch 数、优化器状态等），而 Hugging Face 上的 `pytorch_model.bin` 仅包含模型权重。两者都可用于推理，但格式不同。若需加载仅含权重的文件，应直接从 Hugging Face 下载或使用 `torch.load()` 加载对应的权重文件；若需断点续训，则使用 Lightning 生成的完整 checkpoint。","https:\u002F\u002Fgithub.com\u002Ftue-mps\u002Feomt\u002Fissues\u002F62",{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},39360,"验证过程中出现 CUDA 显存不足（OOM）错误怎么办？","当测试集中包含高分辨率图像（如 4000×5000）且未启用 Flash Attention 时，验证阶段极易发生显存溢出。解决方案包括：1) 将图像调整为较小分辨率；2) 确保训练和测试的分辨率一致，避免性能下降；3) 检查输入尺寸设置是否正确（例如 mask_logits 形状应为 (queries, input_height\u002F4, input_width\u002F4)）。","https:\u002F\u002Fgithub.com\u002Ftue-mps\u002Feomt\u002Fissues\u002F12",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},39361,"为什么无法复现论文中声称的 128 FPS 推理速度？","官方测量的 FPS 数据是在未启用 Token Merging 的情况下获得的。如果您在推理时启用了 Token Merging，会导致速度下降。请确保在测量 FPS 时禁用该功能。此外，官方计划未来提供专门的 FPS 测试脚本以供参考。","https:\u002F\u002Fgithub.com\u002Ftue-mps\u002Feomt\u002Fissues\u002F50",{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},39362,"遇到 'AttributeError: module 'numpy' has no attribute 'NPY_OWNDATA'' 错误如何解决？","该错误通常由 pycocotools 版本不兼容引起。请将 pycocotools 降级或升级到 2.0.10 版本即可解决：`pip install pycocotools==2.0.10`。如果问题仍存在，建议从头重新安装整个代码库。","https:\u002F\u002Fgithub.com\u002Ftue-mps\u002Feomt\u002Fissues\u002F30",{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},39363,"使用自定义数据集微调时训练损失不下降怎么办？","请确认以下几点：1) 使用的是最新版本的代码库，旧版本中 transforms.py 存在导致图像与标签错位的问题；2) 尝试先在标准数据集（如 COCO）上微调预训练模型，验证流程是否正常；3) 检查数据加载逻辑是否正确，特别是单类别数据集的处理方式。","https:\u002F\u002Fgithub.com\u002Ftue-mps\u002Feomt\u002Fissues\u002F10",{"id":143,"question_zh":144,"answer_zh":145,"source_url":146},39364,"如何将 EoMT 模型部署到 Hugging Face Hub？","作者已接受建议并将模型上传至 Hugging Face。用户可通过 `hf_hub_download` 直接下载模型文件，或使用 `PyTorchModelHubMixin` 类为自定义模型添加 `from_pretrained` 和 `push_to_hub` 功能。模型卡片中已链接相关论文，便于发现和复用。","https:\u002F\u002Fgithub.com\u002Ftue-mps\u002Feomt\u002Fissues\u002F1",[]]