[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-hustvl--YOLOS":3,"tool-hustvl--YOLOS":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":80,"owner_twitter":80,"owner_website":80,"owner_url":81,"languages":82,"stars":91,"forks":92,"last_commit_at":93,"license":94,"difficulty_score":95,"env_os":96,"env_gpu":97,"env_ram":98,"env_deps":99,"category_tags":108,"github_topics":109,"view_count":23,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":114,"updated_at":115,"faqs":116,"releases":147},2829,"hustvl\u002FYOLOS","YOLOS","[NeurIPS 2021] You Only Look at One Sequence","YOLOS 是一个基于视觉 Transformer（ViT）架构的开源目标检测模型，其核心理念是“你只需关注一个序列”。它旨在探索并验证标准的 ViT 模型在仅使用中等规模 ImageNet-1k 数据集预训练后，能否以最小的结构调整直接迁移到高难度的 COCO 目标检测任务中。\n\n传统目标检测器通常依赖大量针对二维空间设计的归纳偏置（如卷积操作或特定的锚框机制），而 YOLOS 打破了这一惯例。它将图像视为一系列固定大小的非重叠补丁序列，以纯粹的“序列到序列”方式完成检测任务。这种方法极大地减少了对二维先验知识的依赖，证明了 Transformer 架构从图像识别跨域到目标检测的强大通用性和迁移能力。\n\nYOLOS 特别适合人工智能研究人员、算法开发者以及对深度学习底层原理感兴趣的技术人员使用。对于希望理解 Transformer 在不同视觉任务中表现机制的研究者，或者想要尝试最小化归纳偏置检测方案的开发者来说，这是一个极具参考价值的基准模型。目前，YOLOS 已集成至 HuggingFace Transformers 库，方便用户直接调用和实验。虽然它并非专为追求极致工业级性能而设","YOLOS 是一个基于视觉 Transformer（ViT）架构的开源目标检测模型，其核心理念是“你只需关注一个序列”。它旨在探索并验证标准的 ViT 模型在仅使用中等规模 ImageNet-1k 数据集预训练后，能否以最小的结构调整直接迁移到高难度的 COCO 目标检测任务中。\n\n传统目标检测器通常依赖大量针对二维空间设计的归纳偏置（如卷积操作或特定的锚框机制），而 YOLOS 打破了这一惯例。它将图像视为一系列固定大小的非重叠补丁序列，以纯粹的“序列到序列”方式完成检测任务。这种方法极大地减少了对二维先验知识的依赖，证明了 Transformer 架构从图像识别跨域到目标检测的强大通用性和迁移能力。\n\nYOLOS 特别适合人工智能研究人员、算法开发者以及对深度学习底层原理感兴趣的技术人员使用。对于希望理解 Transformer 在不同视觉任务中表现机制的研究者，或者想要尝试最小化归纳偏置检测方案的开发者来说，这是一个极具参考价值的基准模型。目前，YOLOS 已集成至 HuggingFace Transformers 库，方便用户直接调用和实验。虽然它并非专为追求极致工业级性能而设计，但其简洁的架构为重新思考视觉领域的 Transformer 应用提供了重要视角。","\u003Cdiv align=\"center\">   \n  \n# You Only :eyes: One Sequence\n\u003C\u002Fdiv>\n\n**TL;DR:**  We study the transferability of the vanilla ViT pre-trained on mid-sized ImageNet-1k to the more challenging COCO object detection benchmark.\n\n:man_technologist: This project is under active development :woman_technologist: :\n\n* **`May 4, 2022`:** :eyes:YOLOS is now available in [🤗HuggingFace Transformers](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fen\u002Fmodel_doc\u002Fyolos)!\n\n* **`Apr 8, 2022`:** If you like YOLOS, you might also like MIMDet ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.02964) \u002F [code & models](https:\u002F\u002Fgithub.com\u002Fhustvl\u002FMIMDet))! MIMDet can efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for high-performance object detection (51.5 box AP and 46.0 mask AP on COCO using ViT-Base & Mask R-CNN).\n\n* **`Oct 28, 2021`:** YOLOS receives an update for [the NeurIPS 2021 camera-ready version](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.00666v3). We add MoCo-v3 self-supervised pre-traineing results, study the impacts of detaching `[Det]` tokens, as well as add a new Discussion Section.  \n\n* **`Sep 29, 2021`:** **YOLOS is accepted to NeurIPS 2021!**\n\n* **`Jun 22, 2021`:**  We update our [manuscript](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.00666.pdf) on arXiv including discussion about position embeddings and more visualizations, check it out!\n\n* **`Jun 9, 2021`:**  We add a [notebook](VisualizeAttention.ipynb) to to visualize self-attention maps of `[Det]` tokens on different heads of the last layer, check it out!\n\n# \n\n> [**You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.00666)\n>\n> by [Yuxin Fang](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=_Lk0-fQAAAAJ&hl=en)\u003Csup>1\u003C\u002Fsup> \\*, Bencheng Liao\u003Csup>1\u003C\u002Fsup> \\*, [Xinggang Wang](https:\u002F\u002Fxinggangw.info\u002F)\u003Csup>1 :email:\u003C\u002Fsup>, [Jiemin Fang](https:\u002F\u002Fjaminfong.cn)\u003Csup>2, 1\u003C\u002Fsup>, Jiyang Qi\u003Csup>1\u003C\u002Fsup>, [Rui Wu](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=Z_ZkkbEAAAAJ&view_op=list_works&citft=1&email_for_op=2yuxinfang%40gmail.com&gmla=AJsN-F6AJfvX_wN_jDDdJOp33cW5LrvrAwATh1FFyrUxKD8H354RTN7gMFIXi4NTozHvdj1ITW1q5sNS3ED-3htZJpnUA9BraZa8Wnc_XSfCR37MriE77bh9KHFTKml-qPSgNTPdxwFl8KHxIgOWc_ZuJdvo8cbBWc_Ec3SBL6n7wsYYS2E1Wzm4kWwXQybOJCGjI8_EwHwwipOfkQR9I2C_Riq1gk1Y_JG3BQ3xrTy2fN_plPE37StUe_nOnrTjUz919wcMXKqW)\u003Csup>3\u003C\u002Fsup>, Jianwei Niu\u003Csup>3\u003C\u002Fsup>, [Wenyu Liu](http:\u002F\u002Feic.hust.edu.cn\u002Fprofessor\u002Fliuwenyu\u002F)\u003Csup>1\u003C\u002Fsup>.\n> \n> \u003Csup>1\u003C\u002Fsup> [School of EIC, HUST](http:\u002F\u002Feic.hust.edu.cn\u002FEnglish\u002FHome.htm), \u003Csup>2\u003C\u002Fsup> Institute of AI, HUST, \u003Csup>3\u003C\u002Fsup> [Horizon Robotics](https:\u002F\u002Fen.horizon.ai).\n> \n> (\\*) equal contribution, (\u003Csup>:email:\u003C\u002Fsup>) corresponding author.\n> \n> *arXiv technical report ([arXiv 2106.00666](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.00666))*\n\n\u003Cbr>\n\n## You Only Look at One Sequence (YOLOS)\n\n### The Illustration of YOLOS\n![yolos](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhustvl_YOLOS_readme_4c150e1ad66c.png)\n\n### Highlights\n\nDirectly inherited from [ViT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.11929) ([DeiT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.12877)), YOLOS is not designed to be yet another high-performance object detector, but to unveil the versatility and transferability of Transformer from image recognition to object detection.\nConcretely, our main contributions are summarized as follows:\n\n* We use the mid-sized `ImageNet-1k` as the sole pre-training dataset, and show that a vanilla [ViT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.11929) ([DeiT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.12877)) can be successfully transferred to perform the challenging object detection task and produce competitive `COCO` results with the fewest possible modifications, _i.e._, by only looking at one sequence (YOLOS).\n\n* We demonstrate that 2D object detection can be accomplished in a pure sequence-to-sequence manner by taking a sequence of fixed-sized non-overlapping image patches as input. Among existing object detectors, YOLOS utilizes minimal 2D inductive biases. Moreover, it is feasible for YOLOS to perform object detection in any dimensional space unaware the exact spatial structure or geometry.\n\n* For [ViT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.11929) ([DeiT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.12877)), we find the object detection results are quite sensitive to the pre-train scheme and the detection performance is far from saturating. Therefore the proposed YOLOS can be used as a challenging benchmark task to evaluate different pre-training strategies for [ViT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.11929) ([DeiT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.12877)).\n\n* We also discuss the impacts as wel as the limitations of prevalent pre-train schemes and model scaling strategies for Transformer in vision through transferring to object detection.\n\n### Results\n|Model |Pre-train Epochs |  ViT (DeiT) Weight \u002F Log| Fine-tune Epochs | Eval Size | YOLOS Checkpoint \u002F Log | AP @ COCO val |\n| :------------: | :------------: | :------------: | :------------: | :------------: | :------------: | :------------: |\n|`YOLOS-Ti`|300|[FB](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fdeit\u002Fdeit_tiny_patch16_224-a1311bcf.pth)|300|512|[Baidu Drive](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F17kn_UX1LhsjRWxeWEwgWIw), [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1P2YbnAIsEOOheAPr3FGkAAD7pPuN-2Mn\u002Fview?usp=sharing) \u002F [Log](https:\u002F\u002Fgist.github.com\u002FYuxin-CV\u002Faaf4f835f5fdba4b58217f0e3131e9da)|28.7\n|`YOLOS-S`|200|[Baidu Drive](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1LsxtuxSGGj5szZssoyzr_Q), [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1waIu4QODBu79JuIwMvchpezrP4nd3NQr\u002Fview?usp=sharing) \u002F [Log](https:\u002F\u002Fgist.github.com\u002FYuxin-CV\u002F98168420dbcc5a0d1e656da83c6bf416)|150|800|[Baidu Drive](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1m39EKyO_7RdIYjDY4Ew_lw), [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1kfHJnC29MqEaizR-d57tzpAxQVhoYRlh\u002Fview?usp=sharing) \u002F [Log](https:\u002F\u002Fgist.github.com\u002FYuxin-CV\u002Fab06dd0d5034e501318de2e9aba9a6fb)|36.1\n|`YOLOS-S`|300|[FB](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fdeit\u002Fdeit_small_patch16_224-cd65a155.pth)|150|800|[Baidu Drive](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F12v6X-r4XhV5nEXF6yNfGRg), [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1GUB16Zt1BUsT-LeHa8oHTE2CwL7E92VY\u002Fview?usp=sharing) \u002F [Log](https:\u002F\u002Fgist.github.com\u002FYuxin-CV\u002F42d733e478c76f686f2b52cf50dfe59d)|36.1\n|`YOLOS-S (dWr)`|300|[Baidu Drive](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1XVfWJk5BFnxIQ3LQeAQypw), [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1uucdzz65lnv-vGFQunTgYSWl7ayJIDgn\u002Fview?usp=sharing) \u002F [Log](https:\u002F\u002Fgist.github.com\u002FYuxin-CV\u002Fe3beedccff156b0065f2eb559a4818d3)|150|800|[Baidu Drive](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1Xk2KbFadSwCOjo7gcoSG0w), [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1vBJVXqazsOoHHMZ6Vg6-MpAkYWstLczQ\u002Fview?usp=sharing) \u002F [Log](https:\u002F\u002Fgist.github.com\u002FYuxin-CV\u002F043ea5d27883a6ff1f105ad5d9ddaa46) |37.6\n|`YOLOS-B`|1000|[FB](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fdeit\u002Fdeit_base_distilled_patch16_384-d0272ac0.pth)|150|800|[Baidu Drive](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1IKGoAlwcdoV25cU5Cs-kew), [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1AUCedyYT2kxgHJNi3UA23P2UNTreGj3_\u002Fview?usp=sharing) \u002F [Log](https:\u002F\u002Fgist.github.com\u002FYuxin-CV\u002Fd5f7720a5868563619ddd64d61760e2f)|42.0\n\n**Notes**: \n\n- The access code for `Baidu Drive` is `yolo`. \n- The `FB` stands for model weights provided by DeiT ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.12877), [code](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdeit)). Thanks for their wonderful works.\n- We will update other models in the future, please stay tuned :) \n\n### Requirement\nThis codebase has been developed with python version 3.6, PyTorch 1.5+ and torchvision 0.6+:\n```setup\nconda install -c pytorch pytorch torchvision\n```\nInstall pycocotools (for evaluation on COCO) and scipy (for training):\n```setup\nconda install cython scipy\npip install -U 'git+https:\u002F\u002Fgithub.com\u002Fcocodataset\u002Fcocoapi.git#subdirectory=PythonAPI'\n```\n\n### Data preparation\nDownload and extract COCO 2017 train and val images with annotations from http:\u002F\u002Fcocodataset.org. We expect the directory structure to be the following:\n```\npath\u002Fto\u002Fcoco\u002F\n  annotations\u002F  # annotation json files\n  train2017\u002F    # train images\n  val2017\u002F      # val images\n```\n### Training\nBefore finetuning on COCO, you need download the ImageNet pretrained model to the `\u002Fpath\u002Fto\u002FYOLOS\u002F` directory\n\u003Cdetails>\n\u003Csummary>To train the \u003Ccode>YOLOS-Ti\u003C\u002Fcode> model in the paper, run this command:\u003C\u002Fsummary>\n\u003Cpre>\u003Ccode>\npython -m torch.distributed.launch \\\n    --nproc_per_node=8 \\\n    --use_env main.py \\\n    --coco_path \u002Fpath\u002Fto\u002Fcoco\n    --batch_size 2 \\\n    --lr 5e-5 \\\n    --epochs 300 \\\n    --backbone_name tiny \\\n    --pre_trained \u002Fpath\u002Fto\u002Fdeit-tiny.pth\\\n    --eval_size 512 \\\n    --init_pe_size 800 1333 \\\n    --output_dir \u002Foutput\u002Fpath\u002Fbox_model\n\u003C\u002Fcode>\u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>To train the \u003Ccode>YOLOS-S\u003C\u002Fcode> model with 200 epoch pretrained Deit-S in the paper, run this command:\u003C\u002Fsummary>\n\u003Cpre>\u003Ccode>\n\npython -m torch.distributed.launch \\\n    --nproc_per_node=8 \\\n    --use_env main.py \\\n    --coco_path \u002Fpath\u002Fto\u002Fcoco\n    --batch_size 1 \\\n    --lr 2.5e-5 \\\n    --epochs 150 \\\n    --backbone_name small \\\n    --pre_trained \u002Fpath\u002Fto\u002Fdeit-small-200epoch.pth\\\n    --eval_size 800 \\\n    --init_pe_size 512 864 \\\n    --mid_pe_size 512 864 \\\n    --output_dir \u002Foutput\u002Fpath\u002Fbox_model\n\n\u003C\u002Fcode>\u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>To train the \u003Ccode>YOLOS-S\u003C\u002Fcode> model with 300 epoch pretrained Deit-S in the paper, run this command:\u003C\u002Fsummary>\n\u003Cpre>\u003Ccode>\npython -m torch.distributed.launch \\\n    --nproc_per_node=8 \\\n    --use_env main.py \\\n    --coco_path \u002Fpath\u002Fto\u002Fcoco\n    --batch_size 1 \\\n    --lr 2.5e-5 \\\n    --epochs 150 \\\n    --backbone_name small \\\n    --pre_trained \u002Fpath\u002Fto\u002Fdeit-small-300epoch.pth\\\n    --eval_size 800 \\\n    --init_pe_size 512 864 \\\n    --mid_pe_size 512 864 \\\n    --output_dir \u002Foutput\u002Fpath\u002Fbox_model\n\n\u003C\u002Fcode>\u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>To train the \u003Ccode>YOLOS-S (dWr)\u003C\u002Fcode> model in the paper, run this command:\u003C\u002Fsummary>\n\u003Cpre>\u003Ccode>\npython -m torch.distributed.launch \\\n    --nproc_per_node=8 \\\n    --use_env main.py \\\n    --coco_path \u002Fpath\u002Fto\u002Fcoco\n    --batch_size 1 \\\n    --lr 2.5e-5 \\\n    --epochs 150 \\\n    --backbone_name small_dWr \\\n    --pre_trained \u002Fpath\u002Fto\u002Fdeit-small-dWr-scale.pth\\\n    --eval_size 800 \\\n    --init_pe_size 512 864 \\\n    --mid_pe_size 512 864 \\\n    --output_dir \u002Foutput\u002Fpath\u002Fbox_model\n\u003C\u002Fcode>\u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>To train the \u003Ccode>YOLOS-B\u003C\u002Fcode> model in the paper, run this command:\u003C\u002Fsummary>\n\u003Cpre>\u003Ccode>\npython -m torch.distributed.launch \\\n    --nproc_per_node=8 \\\n    --use_env main.py \\\n    --coco_path \u002Fpath\u002Fto\u002Fcoco\n    --batch_size 1 \\\n    --lr 2.5e-5 \\\n    --epochs 150 \\\n    --backbone_name base \\\n    --pre_trained \u002Fpath\u002Fto\u002Fdeit-base.pth\\\n    --eval_size 800 \\\n    --init_pe_size 800 1344 \\\n    --mid_pe_size 800 1344 \\\n    --output_dir \u002Foutput\u002Fpath\u002Fbox_model\n\u003C\u002Fcode>\u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\n### Evaluation\n\nTo evaluate `YOLOS-Ti` model on COCO, run:\n\n```eval\npython -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path \u002Fpath\u002Fto\u002Fcoco --batch_size 2 --backbone_name tiny --eval --eval_size 512 --init_pe_size 800 1333 --resume \u002Fpath\u002Fto\u002FYOLOS-Ti\n```\nTo evaluate `YOLOS-S` model on COCO, run:\n```eval\npython -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path \u002Fpath\u002Fto\u002Fcoco --batch_size 1 --backbone_name small --eval --eval_size 800 --init_pe_size 512 864 --mid_pe_size 512 864 --resume \u002Fpath\u002Fto\u002FYOLOS-S\n```\nTo evaluate `YOLOS-S (dWr)` model on COCO, run:\n```eval\npython -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path \u002Fpath\u002Fto\u002Fcoco --batch_size 1 --backbone_name small_dWr --eval --eval_size 800 --init_pe_size 512 864 --mid_pe_size 512 864 --resume \u002Fpath\u002Fto\u002FYOLOS-S(dWr)\n```\n\nTo evaluate `YOLOS-B` model on COCO, run:\n```eval\npython -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path \u002Fpath\u002Fto\u002Fcoco --batch_size 1 --backbone_name base --eval --eval_size 800 --init_pe_size 800 1344 --mid_pe_size 800 1344 --resume \u002Fpath\u002Fto\u002FYOLOS-B\n```\n\n### Visualization\n\n* **Visualize box prediction and object categories distribution:**\n\n\n1. To Get visualization in the paper, you need the finetuned YOLOS models on COCO, run following command to get 100 Det-Toks prediction on COCO val split, then it will generate `\u002Fpath\u002Fto\u002FYOLOS\u002Fvisualization\u002Fmodelname-eval-800-eval-pred.json`\n```\npython cocoval_predjson_generation.py --coco_path \u002Fpath\u002Fto\u002Fcoco --batch_size 1 --backbone_name small --eval --eval_size 800 --init_pe_size 512 864 --mid_pe_size 512 864 --resume \u002Fpath\u002Fto\u002Fyolos-s-model.pth --output_dir .\u002Fvisualization\n```\n2. To get all ground truth object categories on all images from COCO val split, run following command to generate `\u002Fpath\u002Fto\u002FYOLOS\u002Fvisualization\u002Fcoco-valsplit-cls-dist.json`\n```\npython cocoval_gtclsjson_generation.py --coco_path \u002Fpath\u002Fto\u002Fcoco --batch_size 1 --output_dir .\u002Fvisualization\n```\n3. To visualize the distribution of Det-Toks' bboxs and categories, run following command to generate `.png` files in `\u002Fpath\u002Fto\u002FYOLOS\u002Fvisualization\u002F`\n```\n python visualize_dettoken_dist.py --visjson \u002Fpath\u002Fto\u002FYOLOS\u002Fvisualization\u002Fmodelname-eval-800-eval-pred.json --cococlsjson \u002Fpath\u002Fto\u002FYOLOS\u002Fvisualization\u002Fcoco-valsplit-cls-dist.json\n```\n![cls](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhustvl_YOLOS_readme_dfb1297a9cb1.png)\n![cls](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhustvl_YOLOS_readme_6eb61ca97cd9.png)\n\n\n* **Use [VisualizeAttention.ipynb](VisualizeAttention.ipynb) to visualize self-attention of `[Det]` tokens on different heads of the last layer:**\n\n![Det-Tok-41](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhustvl_YOLOS_readme_3fa1e53e740d.png)\n![Det-Tok-96](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhustvl_YOLOS_readme_40c09ba479aa.png)\n\n## Acknowledgement :heart:\nThis project is based on DETR ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.12872), [code](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdetr)), DeiT ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.12877), [code](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdeit)), DINO ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.14294), [code](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdino)) and [timm](https:\u002F\u002Fgithub.com\u002Frwightman\u002Fpytorch-image-models). Thanks for their wonderful works.\n\n\n\n## Citation\n\nIf you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :\n\n```BibTeX\n@article{YOLOS,\n  title={You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection},\n  author={Fang, Yuxin and Liao, Bencheng and Wang, Xinggang and Fang, Jiemin and Qi, Jiyang and Wu, Rui and Niu, Jianwei and Liu, Wenyu},\n  journal={arXiv preprint arXiv:2106.00666},\n  year={2021}\n}\n```\n","\u003Cdiv align=\"center\">   \n  \n# 你只 :eyes: 看一个序列\n\u003C\u002Fdiv>\n\n**简而言之：** 我们研究了在中等规模的 ImageNet-1k 数据集上预训练的普通 ViT 模型，在更具挑战性的 COCO 目标检测基准上的迁移能力。\n\n:man_technologist: 该项目目前处于积极开发中 :woman_technologist: :\n\n* **`2022年5月4日`:** :eyes:YOLOS 现已在 [🤗HuggingFace Transformers](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fen\u002Fmodel_doc\u002Fyolos) 中发布！\n\n* **`2022年4月8日`:** 如果你喜欢 YOLOS，你可能也会喜欢 MIMDet（[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.02964) \u002F [代码与模型](https:\u002F\u002Fgithub.com\u002Fhustvl\u002FMIMDet))！MIMDet 能够高效且有效地将基于掩码图像建模（MIM）预训练的普通视觉 Transformer（ViT）适配到高性能目标检测任务中（使用 ViT-Base 和 Mask R-CNN 在 COCO 上可达到 51.5 的 box AP 和 46.0 的 mask AP）。\n\n* **`2021年10月28日`:** YOLOS 针对 [NeurIPS 2021 的最终定稿版本](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.00666v3)进行了更新。我们加入了 MoCo-v3 自监督预训练的结果，研究了分离 `[Det]` tokens 的影响，并新增了一个讨论部分。\n\n* **`2021年9月29日`:** **YOLOS 已被 NeurIPS 2021 接收！**\n\n* **`2021年6月22日`:** 我们在 arXiv 上更新了我们的 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.00666.pdf)，增加了关于位置嵌入的讨论和更多可视化内容，快来看看吧！\n\n* **`2021年6月9日`:** 我们添加了一个 [notebook](VisualizeAttention.ipynb)，用于可视化最后一层不同头上的 `[Det]` tokens 的自注意力图，欢迎查看！\n\n#\n\n> [**你只看一个序列：通过目标检测重新思考视觉中的 Transformer**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.00666)\n>\n> 作者：[Yuxin Fang](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=_Lk0-fQAAAAJ&hl=en)\u003Csup>1\u003C\u002Fsup> \\*, Bencheng Liao\u003Csup>1\u003C\u002Fsup> \\*, [Xinggang Wang](https:\u002F\u002Fxinggangw.info\u002F)\u003Csup>1 :email:\u003C\u002Fsup>, [Jiemin Fang](https:\u002F\u002Fjaminfong.cn)\u003Csup>2, 1\u003C\u002Fsup>, Jiyang Qi\u003Csup>1\u003C\u002Fsup>, [Rui Wu](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=Z_ZkkbEAAAAJ&view_op=list_works&citft=1&email_for_op=2yuxinfang%40gmail.com&gmla=AJsN-F6AJfvX_wN_jDDdJOp33cW5LrvrAwATh1FFyrUxKD8H354RTN7gMFIXi4NTozHvdj1ITW1q5sNS3ED-3htZJpnUA9BraZa8Wnc_XSfCR37MriE77bh9KHFTKml-qPSgNTPdxwFl8KHxIgOWc_ZuJdvo8cbBWc_Ec3SBL6n7wsYYS2E1Wzm4kWwXQybOJCGjI8_EwHwwipOfkQR9I2C_Riq1gk1Y_JG3BQ3xrTy2fN_plPE37StUe_nOnrTjUz919wcMXKqW)\u003Csup>3\u003C\u002Fsup>, Jianwei Niu\u003Csup>3\u003C\u002Fsup>, [Wenyu Liu](http:\u002F\u002Feic.hust.edu.cn\u002Fprofessor\u002Fliuwenyu\u002F)\u003Csup>1\u003C\u002Fsup>。\n> \n> \u003Csup>1\u003C\u002Fsup> [华中科技大学电子信息学院](http:\u002F\u002Feic.hust.edu.cn\u002FEnglish\u002FHome.htm), \u003Csup>2\u003C\u002Fsup> 华中科技大学人工智能研究院, \u003Csup>3\u003C\u002Fsup> [地平线机器人](https:\u002F\u002Fen.horizon.ai)。\n> \n> (\\*) 共同第一作者，(\u003Csup>:email:\u003C\u002Fsup>) 通讯作者。\n> \n> *arXiv 技术报告 ([arXiv 2106.00666](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.00666))*\n\n\u003Cbr>\n\n## 你只看一个序列（YOLOS）\n\n### YOLOS 的示意图\n![yolos](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhustvl_YOLOS_readme_4c150e1ad66c.png)\n\n### 亮点\n\nYOLOS 直接继承自 [ViT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.11929)（[DeiT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.12877)），它并非旨在成为又一个高性能的目标检测器，而是为了揭示 Transformer 从图像识别到目标检测的通用性和迁移性。\n具体来说，我们的主要贡献总结如下：\n\n* 我们仅使用中等规模的 `ImageNet-1k` 作为唯一的预训练数据集，并证明普通的 [ViT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.11929)（[DeiT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.12877)）只需进行最少的修改，就能成功迁移到具有挑战性的目标检测任务中，以“只看一个序列”的方式（YOLOS），并取得与现有方法相当的 `COCO` 结果。\n\n* 我们展示了可以通过纯序列到序列的方式完成 2D 目标检测，即以固定大小、不重叠的图像块序列作为输入。在现有的目标检测器中，YOLOS 使用的 2D 归纳偏置最少。此外，YOLOS 还可以在任何维度的空间中执行目标检测，而无需了解精确的空间结构或几何形状。\n\n* 对于 [ViT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.11929)（[DeiT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.12877)），我们发现其目标检测结果对预训练方案非常敏感，且检测性能远未达到饱和状态。因此，提出的 YOLOS 可以作为一个具有挑战性的基准任务，用于评估 [ViT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.11929)（[DeiT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.12877)）的不同预训练策略。\n\n* 我们还通过目标检测任务的迁移，探讨了当前视觉领域中流行的预训练方案和模型缩放策略的影响及局限性。\n\n### 结果\n|模型 |预训练轮数 | ViT (DeiT) 权重 \u002F 日志| 微调轮数 | 评估尺寸 | YOLOS 检查点 \u002F 日志 | COCO 验证集上的 AP |\n| :------------: | :------------: | :------------: | :------------: | :------------: | :------------: | :------------: |\n|`YOLOS-Ti`|300|[FB](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fdeit\u002Fdeit_tiny_patch16_224-a1311bcf.pth)|300|512|[百度网盘](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F17kn_UX1LhsjRWxeWEwgWIw), [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1P2YbnAIsEOOheAPr3FGkAAD7pPuN-2Mn\u002Fview?usp=sharing) \u002F [日志](https:\u002F\u002Fgist.github.com\u002FYuxin-CV\u002Faaf4f835f5fdba4b58217f0e3131e9da)|28.7\n|`YOLOS-S`|200|[百度网盘](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1LsxtuxSGGj5szZssoyzr_Q), [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1waIu4QODBu79JuIwMvchpezrP4nd3NQr\u002Fview?usp=sharing) \u002F [日志](https:\u002F\u002Fgist.github.com\u002FYuxin-CV\u002F98168420dbcc5a0d1e656da83c6bf416)|150|800|[百度网盘](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1m39EKyO_7RdIYjDY4Ew_lw), [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1kfHJnC29MqEaizR-d57tzpAxQVhoYRlh\u002Fview?usp=sharing) \u002F [日志](https:\u002F\u002Fgist.github.com\u002FYuxin-CV\u002Fab06dd0d5034e501318de2e9aba9a6fb)|36.1\n|`YOLOS-S`|300|[FB](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fdeit\u002Fdeit_small_patch16_224-cd65a155.pth)|150|800|[百度网盘](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F12v6X-r4XhV5nEXF6yNfGRg), [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1GUB16Zt1BUsT-LeHa8oHTE2CwL7E92VY\u002Fview?usp=sharing) \u002F [日志](https:\u002F\u002Fgist.github.com\u002FYuxin-CV\u002F42d733e478c76f686f2b52cf50dfe59d)|36.1\n|`YOLOS-S (dWr)`|300|[百度网盘](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1XVfWJk5BFnxIQ3LQeAQypw), [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1uucdzz65lnv-vGFQunTgYSWl7ayJIDgn\u002Fview?usp=sharing) \u002F [日志](https:\u002F\u002Fgist.github.com\u002FYuxin-CV\u002Fe3beedccff156b0065f2eb559a4818d3)|150|800|[百度网盘](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1Xk2KbFadSwCOjo7gcoSG0w), [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1vBJVXqazsOoHHMZ6Vg6-MpAkYWstLczQ\u002Fview?usp=sharing) \u002F [日志](https:\u002F\u002Fgist.github.com\u002FYuxin-CV\u002F043ea5d27883a6ff1f105ad5d9ddaa46) |37.6\n|`YOLOS-B`|1000|[FB](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fdeit\u002Fdeit_base_distilled_patch16_384-d0272ac0.pth)|150|800|[百度网盘](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1IKGoAlwcdoV25cU5Cs-kew), [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1AUCedyYT2kxgHJNi3UA23P2UNTreGj3_\u002Fview?usp=sharing) \u002F [日志](https:\u002F\u002Fgist.github.com\u002FYuxin-CV\u002Fd5f7720a5868563619ddd64d61760e2f)|42.0\n\n**注释**: \n\n- `百度网盘`的提取码为`yolo`。 \n- `FB`代表由 DeiT 提供的模型权重（[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.12877), [代码](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdeit)）。感谢他们的优秀工作。\n- 我们将在未来更新其他模型，请持续关注 :) \n\n### 要求\n本代码库使用 Python 3.6、PyTorch 1.5+ 和 torchvision 0.6+ 开发：\n```setup\nconda install -c pytorch pytorch torchvision\n```\n安装 pycocotools（用于在 COCO 数据集上进行评估）和 scipy（用于训练）：\n```setup\nconda install cython scipy\npip install -U 'git+https:\u002F\u002Fgithub.com\u002Fcocodataset\u002Fcocoapi.git#subdirectory=PythonAPI'\n```\n\n### 数据准备\n从 http:\u002F\u002Fcocodataset.org 下载并解压 COCO 2017 训练集和验证集的图像及其标注文件。我们期望目录结构如下：\n```\npath\u002Fto\u002Fcoco\u002F\n  annotations\u002F  # 标注 JSON 文件\n  train2017\u002F    # 训练图像\n  val2017\u002F      # 验证图像\n```\n### 训练\n在 COCO 数据集上进行微调之前，你需要将 ImageNet 预训练模型下载到 `\u002Fpath\u002Fto\u002FYOLOS\u002F` 目录下。\n\u003Cdetails>\n\u003Csummary>要训练论文中的 \u003Ccode>YOLOS-Ti\u003C\u002Fcode> 模型，请运行以下命令：\u003C\u002Fsummary>\n\u003Cpre>\u003Ccode>\npython -m torch.distributed.launch \\\n    --nproc_per_node=8 \\\n    --use_env main.py \\\n    --coco_path \u002Fpath\u002Fto\u002Fcoco\n    --batch_size 2 \\\n    --lr 5e-5 \\\n    --epochs 300 \\\n    --backbone_name tiny \\\n    --pre_trained \u002Fpath\u002Fto\u002Fdeit-tiny.pth\\\n    --eval_size 512 \\\n    --init_pe_size 800 1333 \\\n    --output_dir \u002Foutput\u002Fpath\u002Fbox_model\n\u003C\u002Fcode>\u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>要训练论文中使用 200 轮预训练 Deit-S 的 \u003Ccode>YOLOS-S\u003C\u002Fcode> 模型，请运行以下命令：\u003C\u002Fsummary>\n\u003Cpre>\u003Ccode>\n\npython -m torch.distributed.launch \\\n    --nproc_per_node=8 \\\n    --use_env main.py \\\n    --coco_path \u002Fpath\u002Fto\u002Fcoco\n    --batch_size 1 \\\n    --lr 2.5e-5 \\\n    --epochs 150 \\\n    --backbone_name small \\\n    --pre_trained \u002Fpath\u002Fto\u002Fdeit-small-200epoch.pth\\\n    --eval_size 800 \\\n    --init_pe_size 512 864 \\\n    --mid_pe_size 512 864 \\\n    --output_dir \u002Foutput\u002Fpath\u002Fbox_model\n\n\u003C\u002Fcode>\u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>要训练论文中使用 300 轮预训练 Deit-S 的 \u003Ccode>YOLOS-S\u003C\u002Fcode> 模型，请运行以下命令：\u003C\u002Fsummary>\n\u003Cpre>\u003Ccode>\npython -m torch.distributed.launch \\\n    --nproc_per_node=8 \\\n    --use_env main.py \\\n    --coco_path \u002Fpath\u002Fto\u002Fcoco\n    --batch_size 1 \\\n    --lr 2.5e-5 \\\n    --epochs 150 \\\n    --backbone_name small \\\n    --pre_trained \u002Fpath\u002Fto\u002Fdeit-small-300epoch.pth\\\n    --eval_size 800 \\\n    --init_pe_size 512 864 \\\n    --mid_pe_size 512 864 \\\n    --output_dir \u002Foutput\u002Fpath\u002Fbox_model\n\n\u003C\u002Fcode>\u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>要训练论文中的 \u003Ccode>YOLOS-S (dWr)\u003C\u002Fcode> 模型，请运行以下命令：\u003C\u002Fsummary>\n\u003Cpre>\u003Ccode>\npython -m torch.distributed.launch \\\n    --nproc_per_node=8 \\\n    --use_env main.py \\\n    --coco_path \u002Fpath\u002Fto\u002Fcoco\n    --batch_size 1 \\\n    --lr 2.5e-5 \\\n    --epochs 150 \\\n    --backbone_name small_dWr \\\n    --pre_trained \u002Fpath\u002Fto\u002Fdeit-small-dWr-scale.pth\\\n    --eval_size 800 \\\n    --init_pe_size 512 864 \\\n    --mid_pe_size 512 864 \\\n    --output_dir \u002Foutput\u002Fpath\u002Fbox_model\n\u003C\u002Fcode>\u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>要训练论文中的 \u003Ccode>YOLOS-B\u003C\u002Fcode> 模型，请运行以下命令：\u003C\u002Fsummary>\n\u003Cpre>\u003Ccode>\npython -m torch.distributed.launch \\\n    --nproc_per_node=8 \\\n    --use_env main.py \\\n    --coco_path \u002Fpath\u002Fto\u002Fcoco\n    --batch_size 1 \\\n    --lr 2.5e-5 \\\n    --epochs 150 \\\n    --backbone_name base \\\n    --pre_trained \u002Fpath\u002Fto\u002Fdeit-base.pth\\\n    --eval_size 800 \\\n    --init_pe_size 800 1344 \\\n    --mid_pe_size 800 1344 \\\n    --output_dir \u002Foutput\u002Fpath\u002Fbox_model\n\u003C\u002Fcode>\u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n### 评估\n\n要在 COCO 数据集上评估 `YOLOS-Ti` 模型，请运行以下命令：\n\n```eval\npython -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path \u002Fpath\u002Fto\u002Fcoco --batch_size 2 --backbone_name tiny --eval --eval_size 512 --init_pe_size 800 1333 --resume \u002Fpath\u002Fto\u002FYOLOS-Ti\n```\n\n要在 COCO 数据集上评估 `YOLOS-S` 模型，请运行以下命令：\n```eval\npython -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path \u002Fpath\u002Fto\u002Fcoco --batch_size 1 --backbone_name small --eval --eval_size 800 --init_pe_size 512 864 --mid_pe_size 512 864 --resume \u002Fpath\u002Fto\u002FYOLOS-S\n```\n\n要在 COCO 数据集上评估 `YOLOS-S (dWr)` 模型，请运行以下命令：\n```eval\npython -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path \u002Fpath\u002Fto\u002Fcoco --batch_size 1 --backbone_name small_dWr --eval --eval_size 800 --init_pe_size 512 864 --mid_pe_size 512 864 --resume \u002Fpath\u002Fto\u002FYOLOS-S(dWr)\n```\n\n要在 COCO 数据集上评估 `YOLOS-B` 模型，请运行以下命令：\n```eval\npython -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path \u002Fpath\u002Fto\u002Fcoco --batch_size 1 --backbone_name base --eval --eval_size 800 --init_pe_size 800 1344 --mid_pe_size 800 1344 --resume \u002Fpath\u002Fto\u002FYOLOS-B\n```\n\n### 可视化\n\n* **可视化边界框预测及目标类别分布：**\n\n1. 要获得论文中的可视化效果，您需要在 COCO 数据集上微调过的 YOLOS 模型。运行以下命令以获取 COCO 验证集上的 100 个 Det-Token 预测结果，随后会生成 `\u002Fpath\u002Fto\u002FYOLOS\u002Fvisualization\u002Fmodelname-eval-800-eval-pred.json` 文件：\n```\npython cocoval_predjson_generation.py --coco_path \u002Fpath\u002Fto\u002Fcoco --batch_size 1 --backbone_name small --eval --eval_size 800 --init_pe_size 512 864 --mid_pe_size 512 864 --resume \u002Fpath\u002Fto\u002Fyolos-s-model.pth --output_dir .\u002Fvisualization\n```\n2. 要获取 COCO 验证集所有图像的全部真实标注类别，运行以下命令以生成 `\u002Fpath\u002Fto\u002FYOLOS\u002Fvisualization\u002Fcoco-valsplit-cls-dist.json` 文件：\n```\npython cocoval_gtclsjson_generation.py --coco_path \u002Fpath\u002Fto\u002Fcoco --batch_size 1 --output_dir .\u002Fvisualization\n```\n3. 要可视化 Det-Token 的边界框和类别的分布，运行以下命令以在 `\u002Fpath\u002Fto\u002FYOLOS\u002Fvisualization\u002F` 目录下生成 `.png` 文件：\n```\n python visualize_dettoken_dist.py --visjson \u002Fpath\u002Fto\u002FYOLOS\u002Fvisualization\u002Fmodelname-eval-800-eval-pred.json --cococlsjson \u002Fpath\u002Fto\u002FYOLOS\u002Fvisualization\u002Fcoco-valsplit-cls-dist.json\n```\n![cls](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhustvl_YOLOS_readme_dfb1297a9cb1.png)\n![cls](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhustvl_YOLOS_readme_6eb61ca97cd9.png)\n\n\n* **使用 [VisualizeAttention.ipynb](VisualizeAttention.ipynb) 可视化最后一层不同头上的 `[Det]` Token 自注意力：**\n\n![Det-Tok-41](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhustvl_YOLOS_readme_3fa1e53e740d.png)\n![Det-Tok-96](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhustvl_YOLOS_readme_40c09ba479aa.png)\n\n## 致谢 :heart:\n本项目基于 DETR（[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.12872)，[代码](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdetr))、DeiT（[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.12877)，[代码](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdeit))、DINO（[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.14294)，[代码](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdino)）以及 [timm](https:\u002F\u002Fgithub.com\u002Frwightman\u002Fpytorch-image-models)。感谢这些优秀的工作。\n\n\n\n## 引用\n\n如果您在研究中发现我们的论文和代码有所帮助，请考虑给个项目点赞 :star: 并引用 :pencil: 我们：\n\n```BibTeX\n@article{YOLOS,\n  title={You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection},\n  author={Fang, Yuxin and Liao, Bencheng and Wang, Xinggang and Fang, Jiemin and Qi, Jiyang and Wu, Rui and Niu, Jianwei and Liu, Wenyu},\n  journal={arXiv preprint arXiv:2106.00666},\n  year={2021}\n}\n```","# YOLOS 快速上手指南\n\nYOLOS (You Only Look at One Sequence) 是一个基于纯 Transformer 架构的目标检测模型。它直接复用预训练的 ViT\u002FDeiT 权重，仅需极少修改即可在 COCO 数据集上实现具有竞争力的检测效果，旨在探索 Transformer 从图像识别到目标检测的迁移能力。\n\n## 1. 环境准备\n\n本项目基于 Python 3.6+ 开发，依赖 PyTorch 1.5+ 和 torchvision 0.6+。\n\n**系统要求：**\n*   Python >= 3.6\n*   PyTorch >= 1.5\n*   torchvision >= 0.6\n*   CUDA 环境（推荐用于训练和推理）\n\n**前置依赖安装：**\n建议优先使用国内镜像源加速安装。\n\n```bash\n# 安装 PyTorch 和 torchvision (推荐使用清华或阿里镜像)\npip install torch torchvision -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n\n# 安装其他必要依赖\npip install cython scipy -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n\n# 安装 pycocotools (COCO 评估必备)\npip install -U 'git+https:\u002F\u002Fgithub.com\u002Fcocodataset\u002Fcocoapi.git#subdirectory=PythonAPI' -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 2. 数据准备\n\n下载 COCO 2017 数据集（训练集、验证集及标注文件），并解压至指定目录。目录结构需如下所示：\n\n```text\npath\u002Fto\u002Fcoco\u002F\n  annotations\u002F  # 包含 json 标注文件\n  train2017\u002F    # 训练图片\n  val2017\u002F      # 验证图片\n```\n\n> **提示**：请提前从 ImageNet 预训练好的 DeiT\u002FViT 权重（如 `deit-tiny.pth`, `deit-small.pth` 等），并将其放置在项目目录下备用。\n\n## 3. 基本使用\n\n### 模型训练 (Fine-tuning)\n\n在 COCO 数据集上进行微调前，请确保已下载对应的 ImageNet 预训练权重。以下以训练 **YOLOS-Ti** (Tiny) 模型为例，使用 8 张 GPU 进行分布式训练：\n\n```bash\npython -m torch.distributed.launch \\\n    --nproc_per_node=8 \\\n    --use_env main.py \\\n    --coco_path \u002Fpath\u002Fto\u002Fcoco \\\n    --batch_size 2 \\\n    --lr 5e-5 \\\n    --epochs 300 \\\n    --backbone_name tiny \\\n    --pre_trained \u002Fpath\u002Fto\u002Fdeit-tiny.pth \\\n    --eval_size 512 \\\n    --init_pe_size 800 1333 \\\n    --output_dir \u002Foutput\u002Fpath\u002Fbox_model\n```\n\n*注：请将 `\u002Fpath\u002Fto\u002Fcoco` 替换为实际数据路径，`\u002Fpath\u002Fto\u002Fdeit-tiny.pth` 替换为预训练权重路径。其他型号（如 `small`, `base`）只需调整 `--backbone_name`、`--pre_trained` 及相关尺寸参数。*\n\n### 模型评估 (Evaluation)\n\n使用训练好的检查点进行评估。以下以评估 **YOLOS-Ti** 为例：\n\n```bash\npython -m torch.distributed.launch \\\n    --nproc_per_node=8 \\\n    --use_env main.py \\\n    --coco_path \u002Fpath\u002Fto\u002Fcoco \\\n    --batch_size 2 \\\n    --backbone_name tiny \\\n    --eval \\\n    --eval_size 512 \\\n    --init_pe_size 800 1333 \\\n    --resume \u002Fpath\u002Fto\u002FYOLOS-Ti_checkpoint.pth\n```\n\n*注：`--resume` 后接具体的模型权重文件路径。*\n\n### 快速体验 (Hugging Face)\n\nYOLOS 已集成至 Hugging Face Transformers 库，可通过以下代码快速加载预训练模型进行推理（无需配置完整训练环境）：\n\n```python\nfrom transformers import YolosForObjectDetection, YolosImageProcessor\nimport torch\nfrom PIL import Image\nimport requests\n\n# 加载处理器和模型\nprocessor = YolosImageProcessor.from_pretrained(\"hustvl\u002Fyolos-base\")\nmodel = YolosForObjectDetection.from_pretrained(\"hustvl\u002Fyolos-base\")\n\n# 准备图像\nurl = \"http:\u002F\u002Fimages.cocodataset.org\u002Fval2017\u002F000000039769.jpg\"\nimage = Image.open(requests.get(url, stream=True).raw)\n\n# 推理\ninputs = processor(images=image, return_tensors=\"pt\")\noutputs = model(**inputs)\n\n# 后处理获取结果\ntarget_sizes = torch.tensor([image.size[::-1]])\nresults = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]\n\nfor score, label, box in zip(results[\"scores\"], results[\"labels\"], results[\"boxes\"]):\n    print(f\"Detected {model.config.id2label[label.item()]} with confidence {score.item():.2f} at box {box.tolist()}\")\n```","某自动驾驶初创团队的算法工程师正致力于将图像分类模型快速迁移到车辆与行人检测任务中，以加速原型车的路测迭代。\n\n### 没有 YOLOS 时\n- **架构改造复杂**：必须手动设计复杂的卷积骨干网络（如 ResNet）并搭配繁琐的特征金字塔（FPN）结构，代码耦合度高且难以维护。\n- **预训练依赖重**：为了在 COCO 数据集上获得可用精度，往往需要耗费大量算力在大规模私有数据集上进行长时间预训练，中小团队难以负担。\n- **归纳偏置限制**：传统检测器过度依赖人工设计的 2D 空间先验（如锚框机制），导致模型在面对极端视角或非典型物体时泛化能力不足。\n- **研发周期漫长**：从复现经典论文到调通检测流程，通常需要数周时间进行结构适配和超参数微调，严重拖慢实验验证速度。\n\n### 使用 YOLOS 后\n- **架构极简统一**：直接复用标准的 ViT 架构，仅需引入 `[Det]` 令牌即可将纯序列处理模式应用于检测，无需任何复杂的 2D 结构修改。\n- **迁移高效便捷**：直接使用在中等规模 ImageNet-1k 上预训练的 vanilla ViT 权重，只需极少调整即可在 COCO 基准上产出具有竞争力的检测结果。\n- **泛化能力增强**：摒弃了绝大多数人工设计的 2D 归纳偏置，让模型通过自注意力机制自主学习空间关系，显著提升了对新颖场景的适应力。\n- **迭代速度飞跃**：借助 HuggingFace Transformers 的现成支持，工程师可在几天内完成从分类模型到检测模型的验证闭环，大幅缩短研发路径。\n\nYOLOS 证明了无需专用架构，仅凭最纯粹的 Transformer 序列建模能力，就能以最低成本打破图像分类与目标检测之间的壁垒。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhustvl_YOLOS_3fa1e53e.png","hustvl","HUST Vision Lab","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fhustvl_3e2bf80d.png","HUST Vision Lab of the School of EIC in HUST. Lab Lead @xinggangw",null,"https:\u002F\u002Fgithub.com\u002Fhustvl",[83,87],{"name":84,"color":85,"percentage":86},"Jupyter Notebook","#DA5B0B",71.6,{"name":88,"color":89,"percentage":90},"Python","#3572A5",28.4,906,125,"2026-03-31T03:54:34","MIT",4,"Linux","必需 NVIDIA GPU。训练命令示例中指定 '--nproc_per_node=8'，表明推荐多卡环境（如 8 张显卡）。具体显存需求未说明，但考虑到 ViT 模型及 COCO 数据集训练，建议大显存显卡。","未说明",{"notes":100,"python":101,"dependencies":102},"1. 代码库基于 Python 3.6、PyTorch 1.5+ 和 torchvision 0.6+ 开发。\n2. 必须安装 pycocotools（用于 COCO 评估）和 scipy（用于训练）。\n3. 训练前需下载 ImageNet 预训练模型（如 DeiT 权重）。\n4. 训练和评估脚本使用 torch.distributed.launch，默认示例配置为 8 卡并行。\n5. 数据集需准备 COCO 2017 train\u002Fval 图像及标注文件，目录结构需符合特定格式。\n6. 百度网盘资源访问密码为 'yolo'。","3.6+",[103,104,105,106,107],"torch>=1.5","torchvision>=0.6","pycocotools","scipy","cython",[26,14],[110,111,112,113],"vision-transformer","transformer","object-detection","computer-vision","2026-03-27T02:49:30.150509","2026-04-06T09:45:34.483131",[117,122,127,132,137,142],{"id":118,"question_zh":119,"answer_zh":120,"source_url":121},13089,"为什么 YOLOS-Small 模型有 3000 万参数，而 DeiT-S 只有 2200 万？","额外的参数主要来自于位置嵌入（Positional Embeddings, PE）。最初为了对齐 DETR 设置，我们在每个 Transformer 层都添加了随机初始化的 PE。但后来发现，仅在预训练的第一层将 PE 插值到更大尺寸（例如从 512x864 插值到 800x1344），而不在中间层添加其他 PE，可以在精度和参数量之间取得更好的平衡。采用这种配置后，参数量约为 24.6M（22.1M + 2.5M），精度为 36.6 AP；而旧配置的参数量高达 30.7M，精度仅为 36.1 AP。Tiny 版本模型已采用这种优化后的配置。","https:\u002F\u002Fgithub.com\u002Fhustvl\u002FYOLOS\u002Fissues\u002F3",{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},13090,"评估结果远低于论文报告的数值（例如 Base 模型预期 42.0 AP，实际仅得 13.8 AP），是什么原因？","这是因为代码库基于 DETR，继承了一个特性：评估时的 GPU 数量（num_GPU）和每张 GPU 的批次大小（batch_size_per_GPU）必须与训练时完全一致。如果训练时使用了多卡（如 8 卡）和特定 batch size，评估时也必须使用相同的配置。请确保运行评估命令时指定的 `--nproc_per_node` 和 `--batch_size` 参数与训练阶段保持一致。","https:\u002F\u002Fgithub.com\u002Fhustvl\u002FYOLOS\u002Fissues\u002F10",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},13091,"YOLOS 是否支持动态输入尺寸或改变长宽比（例如将输入改为 800x800）？","虽然模型在技术上可以处理不同尺寸的输入，但如果输入尺寸或长宽比与训练时不一致（例如训练使用的是特定比例，测试改为 1:1），会导致位置嵌入错位，从而造成精度下降。如果需要使用特定的长宽比（如 1:1），建议重新训练模型并开启尺度抖动（scale jittering）数据增强，以使模型适应不同的输入比例。","https:\u002F\u002Fgithub.com\u002Fhustvl\u002FYOLOS\u002Fissues\u002F12",{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},13092,"为什么学习率调度器（Learning Rate Scheduler）是按 epoch 更新而不是按 batch 更新？","在图像识别任务中，如果使用余弦退火学习率调度器（cosine lr scheduler），通常惯例是按 epoch 步进（例如广泛使用的 timm 库）。虽然在 NLP 和语义分割任务中常按 iteration\u002Fbatch 步进，且按 batch 步进理论上更合理且不会导致更差的结果，但本项目目前遵循图像识别领域的常规做法，按 epoch 进行调整。","https:\u002F\u002Fgithub.com\u002Fhustvl\u002FYOLOS\u002Fissues\u002F19",{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},13093,"项目是否支持自动混合精度（AMP）训练？开启后会加速吗？","代码中注释掉了 AMP 相关代码。经过测试，使用 PyTorch 原生的 AMP 并没有观察到显著的加速效果，反而可能导致精度下降。因此，目前不建议在该项目中启用 AMP。","https:\u002F\u002Fgithub.com\u002Fhustvl\u002FYOLOS\u002Fissues\u002F18",{"id":143,"question_zh":144,"answer_zh":145,"source_url":146},13094,"在 Pascal VOC 数据集上测试效果不佳（AP 很低），可能是什么原因？","Pascal VOC 2007 的数据量相对较少，这很可能是导致性能不理想的主要原因。YOLOS 这类基于 Transformer 的模型通常需要大规模数据（如 COCO）进行预训练或训练才能发挥最佳性能。在小数据集上直接训练往往难以收敛到满意的结果，建议检查是否使用了合适的预训练权重或考虑数据增强策略。","https:\u002F\u002Fgithub.com\u002Fhustvl\u002FYOLOS\u002Fissues\u002F4",[]]