[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-microsoft--GLIP":3,"tool-microsoft--GLIP":62},[4,18,26,35,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,2,"2026-04-10T11:39:34",[14,15,13],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":32,"last_commit_at":41,"category_tags":42,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[43,13,15,14],"插件",{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":10,"last_commit_at":50,"category_tags":51,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[52,15,13,14],"语言模型",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[14,15,13,61],"视频",{"id":63,"github_repo":64,"name":65,"description_en":66,"description_zh":67,"ai_summary_zh":68,"readme_en":69,"readme_zh":70,"quickstart_zh":71,"use_case_zh":72,"hero_image_url":73,"owner_login":74,"owner_name":75,"owner_avatar_url":76,"owner_bio":77,"owner_company":78,"owner_location":78,"owner_email":79,"owner_twitter":80,"owner_website":81,"owner_url":82,"languages":83,"stars":100,"forks":101,"last_commit_at":102,"license":103,"difficulty_score":10,"env_os":104,"env_gpu":105,"env_ram":106,"env_deps":107,"category_tags":121,"github_topics":78,"view_count":32,"oss_zip_url":78,"oss_zip_packed_at":78,"status":17,"created_at":123,"updated_at":124,"faqs":125,"releases":156},7375,"microsoft\u002FGLIP","GLIP","Grounded Language-Image Pre-training","GLIP 是一款创新的开源人工智能模型，全称为“基于语言 - 图像预训练的定位模型”（Grounded Language-Image Pre-training）。它的核心能力是将自然语言理解与视觉目标检测深度融合，让用户能够直接用文字描述来查找图片中的特定物体，而无需预先针对这些物体进行专门的训练。\n\n传统检测模型通常只能识别固定类别的对象，一旦遇到训练数据中未出现的新类别便束手无策。GLIP 有效解决了这一局限，展现出强大的“零样本”和“少样本”迁移能力。这意味着即使面对从未见过的物体类别，只要用语言描述清楚，GLIP 也能精准定位。它在多个权威基准测试中表现卓越，甚至在未见过的数据集上也能超越许多经过监督训练的模型。\n\n该工具特别适合计算机视觉研究人员、AI 开发者以及需要处理开放集检测任务的技术团队使用。对于希望构建灵活识别系统或探索多模态大模型应用的从业者而言，GLIP 提供了宝贵的预训练代码、微调方案及评测基准。其独特的技术亮点在于统一了检测与接地（Grounding）任务，实现了从“图像到概念”的高效映射，并曾入选 CVPR 2022 最佳论文候选，代表了当前开放词汇目标检","GLIP 是一款创新的开源人工智能模型，全称为“基于语言 - 图像预训练的定位模型”（Grounded Language-Image Pre-training）。它的核心能力是将自然语言理解与视觉目标检测深度融合，让用户能够直接用文字描述来查找图片中的特定物体，而无需预先针对这些物体进行专门的训练。\n\n传统检测模型通常只能识别固定类别的对象，一旦遇到训练数据中未出现的新类别便束手无策。GLIP 有效解决了这一局限，展现出强大的“零样本”和“少样本”迁移能力。这意味着即使面对从未见过的物体类别，只要用语言描述清楚，GLIP 也能精准定位。它在多个权威基准测试中表现卓越，甚至在未见过的数据集上也能超越许多经过监督训练的模型。\n\n该工具特别适合计算机视觉研究人员、AI 开发者以及需要处理开放集检测任务的技术团队使用。对于希望构建灵活识别系统或探索多模态大模型应用的从业者而言，GLIP 提供了宝贵的预训练代码、微调方案及评测基准。其独特的技术亮点在于统一了检测与接地（Grounding）任务，实现了从“图像到概念”的高效映射，并曾入选 CVPR 2022 最佳论文候选，代表了当前开放词汇目标检测领域的前沿水平。","# GLIP: Grounded Language-Image Pre-training  \r\n\r\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_GLIP_readme_1d0a4662c6e2.png\" width=\"800\"> \r\n\r\n\r\n## Updates\r\n* 01\u002F17\u002F2023: From image understanding to image generation for open-set grounding? Check out [**GLIGEN (Grounded Language-to-Image Generation)**](https:\u002F\u002Fgligen.github.io\u002F)\r\n  - GLIGEN:  (box, concept) $\\rightarrow$ image  ||  GLIP:    image $\\rightarrow$ (box, concept)\r\n\r\n* 09\u002F19\u002F2022: GLIPv2 has been accepted to NeurIPS 2022 ([Updated Version](https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.05836)). \r\n\r\n* 09\u002F18\u002F2022: Organizing ECCV Workshop [*Computer Vision in the Wild (CVinW)*](https:\u002F\u002Fcomputer-vision-in-the-wild.github.io\u002Feccv-2022\u002F), where two challenges are hosted to evaluate the zero-shot, few-shot and full-shot performance of pre-trained vision models in downstream tasks:\r\n  - [``*Image Classification in the Wild (ICinW)*''](https:\u002F\u002Feval.ai\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F1832\u002Foverview) Challenge evaluates on 20 image classification tasks.\r\n  - [``*Object Detection in the Wild (ODinW)*''](https:\u002F\u002Feval.ai\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F1839\u002Foverview) Challenge evaluates on 35 object detection tasks.\r\n\r\n$\\qquad$ [ \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_GLIP_readme_17115bb33e09.png\" width=10%\u002F> [Workshop]](https:\u002F\u002Fcomputer-vision-in-the-wild.github.io\u002Feccv-2022\u002F)    $\\qquad$    [\u003Cimg src=\"https:\u002F\u002Fevalai.s3.amazonaws.com\u002Fmedia\u002Flogos\u002F4e939412-a9c0-46bd-9797-5ba0bd0a9095.jpg\" width=10%\u002F> [IC Challenge] ](https:\u002F\u002Feval.ai\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F1832\u002Foverview)\r\n$\\qquad$    [\u003Cimg src=\"https:\u002F\u002Fevalai.s3.amazonaws.com\u002Fmedia\u002Flogos\u002Fe3727105-2b29-4c9b-98a6-3d1191884eb5.jpg\" width=10%\u002F> [OD Challenge] ](https:\u002F\u002Feval.ai\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F1839\u002Foverview)\r\n\r\n\r\n* 09\u002F13\u002F2022: Updated [HuggingFace Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fhaotiz\u002Fglip-zeroshot-demo)! Feel free to give it a try!!!\r\n\r\n  -  Acknowledgement: Many thanks to the help from @[HuggingFace](https:\u002F\u002Fhuggingface.co\u002F) for a Space GPU upgrade to host the GLIP demo! \r\n\r\n* 06\u002F21\u002F2022: GLIP has been selected as a Best Paper Finalist at CVPR 2022!\r\n\r\n* 06\u002F16\u002F2022: ODinW benchmark released! GLIP-T A&B released!\r\n\r\n* 06\u002F13\u002F2022: GLIPv2 is on Arxiv https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.05836!\r\n\r\n* 04\u002F30\u002F2022: Updated [Colab Demo](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F12x7v-_miN7-SRiziK3Cx4ffJzstBJNqb?usp=sharing)!\r\n\r\n* 04\u002F14\u002F2022: GLIP has been accepted to CVPR 2022 as an oral presentation! First version of code and pre-trained models are released!\r\n\r\n* 12\u002F06\u002F2021: GLIP paper on arxiv https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.03857.\r\n\r\n* 11\u002F23\u002F2021: Project page built. \u003Cbr\u002F>\r\n\r\n## Introduction\r\nThis repository is the project page for [GLIP](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.03857).  GLIP demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. \r\n\r\n1. When directly evaluated on COCO and LVIS (without seeing any images in COCO), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines.\r\n2. After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA.\r\n3. When transferred to 13 downstream object detection tasks, a few-shot GLIP rivals with a fully-supervised Dynamic Head.\r\n\r\nWe provide code for:\r\n\r\n1. **pre-training** GLIP on detection and grounding data;\r\n2. **zero-shot evaluating** GLIP on standard benchmarks (COCO, LVIS, Flickr30K) and custom COCO-formated datasets;\r\n3. **fine-tuning** GLIP on standard benchmarks (COCO) and custom COCO-formated datasets;\r\n4. **a Colab demo**.\r\n5. Toolkits for **the Object Detection in the Wild Benchmark (ODinW)** with 35 downstream detection tasks.\r\n\r\nPlease see respective sections for instructions.\r\n\r\n## Demo\r\nPlease see a Colab demo at [link](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F12x7v-_miN7-SRiziK3Cx4ffJzstBJNqb?usp=sharing)!\r\n\r\n## Installation and Setup\r\n\r\n***Environment***\r\nThis repo requires Pytorch>=1.9 and torchvision. We recommand using docker to setup the environment. You can use this pre-built docker image ``docker pull pengchuanzhang\u002Fmaskrcnn:ubuntu18-py3.7-cuda10.2-pytorch1.9`` or this one ``docker pull pengchuanzhang\u002Fpytorch:ubuntu20.04_torch1.9-cuda11.3-nccl2.9.9`` depending on your GPU.\r\n\r\nThen install the following packages:\r\n```\r\npip install einops shapely timm yacs tensorboardX ftfy prettytable pymongo\r\npip install transformers \r\npython setup.py build develop --user\r\n```\r\n\r\n***Backbone Checkpoints.*** Download the ImageNet pre-trained backbone checkpoints into the ``MODEL`` folder. \r\n```\r\nmkdir MODEL\r\nwget https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fmodels\u002Fswin_tiny_patch4_window7_224.pth -O swin_tiny_patch4_window7_224.pth\r\nwget https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fmodels\u002Fswin_large_patch4_window12_384_22k.pth -O swin_large_patch4_window12_384_22k.pth\r\n```\r\n\r\n\r\n## Model Zoo\r\n\r\n***Checkpoint host move.*** The checkpoint links expired. We are moving the checkpoints to https:\u002F\u002Fhuggingface.co\u002Fharold\u002FGLIP\u002Ftree\u002Fmain. Currently most checkpoints are available. Working to host the remaining checkpoints asap.\r\n\r\nModel | COCO [1] | LVIS [2] | LVIS [3] | ODinW [4] | Pre-Train Data | Config  | Weight\r\n-- | -- | -- | -- | -- | -- | -- | --\r\nGLIP-T (**A**) | 42.9 \u002F 52.9 | - | 14.2 | ~28.7 | O365 | [config](configs\u002Fpretrain\u002Fglip_A_Swin_T_O365.yaml) | [weight](https:\u002F\u002Fhuggingface.co\u002FGLIPModel\u002FGLIP\u002Fblob\u002Fmain\u002Fglip_a_tiny_o365.pth)\r\nGLIP-T (**B**) | 44.9 \u002F 53.8  | - | 13.5 | ~33.2 | O365 | [config](configs\u002Fpretrain\u002Fglip_Swin_T_O365.yaml) | [weight](https:\u002F\u002Fhuggingface.co\u002FGLIPModel\u002FGLIP\u002Fblob\u002Fmain\u002Fglip_tiny_model_o365.pth)\r\nGLIP-T (**C**) | 46.7 \u002F 55.1 | 14.3 | [17.7](https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fmodels\u002Fglip_tiny_model_o365_goldg_lvisbest.pth) | 44.4 | O365,GoldG | [config](configs\u002Fpretrain\u002Fglip_Swin_T_O365_GoldG.yaml) | [weight](https:\u002F\u002Fhuggingface.co\u002FGLIPModel\u002FGLIP\u002Fblob\u002Fmain\u002Fglip_tiny_model_o365_goldg.pth)\r\n**GLIP-T** [5]  | 46.6 \u002F 55.2  | 17.6  | [20.1](https:\u002F\u002Fhuggingface.co\u002FGLIPModel\u002FGLIP\u002Fblob\u002Fmain\u002Fglip_tiny_model_o365_goldg_cc_sbu_lvisbest.pth) | 42.7 | O365,GoldG,CC3M,SBU | [config](configs\u002Fpretrain\u002Fglip_Swin_T_O365_GoldG.yaml) [6] | [weight](https:\u002F\u002Fhuggingface.co\u002FGLIPModel\u002FGLIP\u002Fblob\u002Fmain\u002Fglip_tiny_model_o365_goldg_cc_sbu.pth)\r\n**GLIP-L** [7] | 51.4 \u002F 61.7 [8]  | 29.3 | [30.1](https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fmodels\u002Fglip_large_model_lvisbest.pth) | 51.2 | FourODs,GoldG,CC3M+12M,SBU | [config](configs\u002Fpretrain\u002Fglip_Swin_L.yaml) [9] | [weight](https:\u002F\u002Fhuggingface.co\u002FGLIPModel\u002FGLIP\u002Fblob\u002Fmain\u002Fglip_large_model.pth)\r\n\r\n[1] Zero-shot and fine-tuning performance on COCO val2017.\r\n\r\n[2] Zero-shot performance on LVIS minival (APr) with the last pre-trained checkpoint.\r\n\r\n[3] On LVIS, the model could overfit slightly during the pre-training course. Thus we reported two numbers on LVIS: the performance of the last checkpoint (LVIS[2]) and the performance of the best checkpoint during the pre-training course (LVIS[3]).\r\n\r\n[4] Zero-shot performance on the 13 ODinW datasets. The numbers reported in the GLIP paper is from the best checkpoint during the pre-training course, which may be slightly higher than the numbers for the released last checkpoint, similar to the case of LVIS.\r\n\r\n[5] GLIP-T released in this repo is pre-trained on Conceptual Captions 3M and SBU captions. It is referred in paper in Table 1 and in Appendix C.3. It differs slightly from the GLIP-T in the main paper in terms of downstream performance. We will release the pre-training support for using CC3M and SBU captions data in the next update.\r\n\r\n[6] This config is only intended for zero-shot evaluation and fine-tuning. Pre-training config with support for using CC3M and SBU captions data will be updated.\r\n\r\n[7] GLIP-L released in this repo is pre-trained on Conceptual Captions 3M+12M and SBU captions. It slightly outperforms the GLIP-L in the main paper because the model used to annotate the caption data are improved compared to the main paper. We will release the pre-training support for using CC3M+12M and SBU captions data in the next update.\r\n\r\n[8] Multi-scale testing used.\r\n\r\n[9] This config is only intended for zero-shot evaluation and fine-tuning. Pre-training config with support for using CC3M+12M and SBU captions data to be updated.\r\n\r\n\r\n## Pre-Training\r\n\r\n\r\n***Required Data.***  Prepare ``Objects365``, ``Flickr30K``, and ``MixedGrounding`` data as in [DATA.md](DATA.md). Support for training using caption data (Conceptual Captions and SBU captions) will be released soon.\r\n\r\n***Command.***\r\n\r\nPerform pre-training with the following command (please change the config-file accordingly; checkout model zoo for the corresponding config; change the ``{output_dir}`` to your desired output directory):\r\n\r\n```\r\npython -m torch.distributed.launch --nnodes 2 --nproc_per_node=16 tools\u002Ftrain_net.py \\\r\n    --config-file configs\u002Fpretrain\u002Fglip_Swin_T_O365_GoldG.yaml \\\r\n    --skip-test --use-tensorboard --override_output_dir {output_dir}\r\n```\r\n\r\nFor training GLIP-T models, we used `nnodes = 2`, `nproc_per_node=16` on 32GB V100 machines. For training GLIP-L models, we used `nnodes = 4`, `nproc_per_node=16` on 32GB V100 machines. Please setup the environment accordingly based on your local machine.\r\n\r\n\r\n## (Zero-Shot) Evaluation\r\n\r\n### COCO Evaluation\r\n\r\nPrepare ``COCO\u002Fval2017`` data as in [DATA.md](DATA.md). Set ``{config_file}``, ``{model_checkpoint}`` according to the ``Model Zoo``; set ``{output_dir}`` to a folder where the evaluation results will be stored.\r\n\r\n```\r\npython tools\u002Ftest_grounding_net.py --config-file {config_file} --weight {model_checkpoint} \\\r\n        TEST.IMS_PER_BATCH 1 \\\r\n        MODEL.DYHEAD.SCORE_AGG \"MEAN\" \\\r\n        TEST.EVAL_TASK detection \\\r\n        MODEL.DYHEAD.FUSE_CONFIG.MLM_LOSS False \\\r\n        OUTPUT_DIR {output_dir}\r\n```\r\n\r\n### LVIS Evaluation\r\n\r\nWe follow MDETR to evaluate with the [FixedAP](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.01066.pdf) criterion. Set ``{config_file}``, ``{model_checkpoint}`` according to the ``Model Zoo``. Prepare ``COCO\u002Fval2017`` data as in [DATA.md](DATA.md).\r\n\r\n```\r\npython -m torch.distributed.launch --nproc_per_node=4 \\\r\n        tools\u002Ftest_grounding_net.py \\\r\n        --config-file {config_file} \\\r\n        --task_config configs\u002Flvis\u002Fminival.yaml \\\r\n        --weight {model_checkpoint} \\\r\n        TEST.EVAL_TASK detection OUTPUT_DIR {output_dir} \r\n        TEST.CHUNKED_EVALUATION 40  TEST.IMS_PER_BATCH 4 SOLVER.IMS_PER_BATCH 4 TEST.MDETR_STYLE_AGGREGATE_CLASS_NUM 3000 MODEL.RETINANET.DETECTIONS_PER_IMG 300 MODEL.FCOS.DETECTIONS_PER_IMG 300 MODEL.ATSS.DETECTIONS_PER_IMG 300 MODEL.ROI_HEADS.DETECTIONS_PER_IMG 300\r\n```\r\nIf you wish to evaluate on Val 1.0, set ``--task_config`` to ``configs\u002Flvis\u002Fval.yaml``.\r\n\r\n\r\n### ODinW \u002F Custom Dataset Evaluation\r\n\r\nGLIP supports easy evaluation on a custom dataset. Currently, the code supports evaluation on [COCO-formatted](https:\u002F\u002Fcocodataset.org\u002F#format-data) dataset.\r\n\r\nWe will use the [Aquarium](https:\u002F\u002Fpublic.roboflow.com\u002Fobject-detection\u002Faquarium) dataset from ODinW as an example to show how to evaluate on a custom COCO-formatted dataset.\r\n\r\n1. Download the raw dataset from RoboFlow in the COCO format into ``DATASET\u002Fodinw\u002FAquarium``. Each train\u002Fval\u002Ftest split has a corresponding ``annotation`` file and a ``image`` folder. \r\n\r\n2. Remove the background class from the annotation file. This can be as simple as open \"_annotations.coco.json\" and remove the entry with \"id:0\" from \"categories\". For convenience, we provide the modified annotation files for  Aquarium:\r\n    ```\r\n    wget https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fodinw\u002FAquarium\u002FAquarium%20Combined.v2-raw-1024.coco\u002Ftest\u002Fannotations_without_background.json -O DATASET\u002Fodinw\u002FAquarium\u002FAquarium\\ Combined.v2-raw-1024.coco\u002Ftest\u002Fannotations_without_background.json\r\n    wget https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fodinw\u002FAquarium\u002FAquarium%20Combined.v2-raw-1024.coco\u002Ftrain\u002Fannotations_without_background.json -O DATASET\u002Fodinw\u002FAquarium\u002FAquarium\\ Combined.v2-raw-1024.coco\u002Ftrain\u002Fannotations_without_background.json\r\n    wget https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fodinw\u002FAquarium\u002FAquarium%20Combined.v2-raw-1024.coco\u002Fvalid\u002Fannotations_without_background.json -O DATASET\u002Fodinw\u002FAquarium\u002FAquarium\\ Combined.v2-raw-1024.coco\u002Fvalid\u002Fannotations_without_background.json\r\n    ```\r\n    \r\n4. Then create a yaml file as in ``configs\u002Fodinw_13\u002FAquarium_Aquarium_Combined.v2-raw-1024.coco.yaml``. A few fields to be noted in the yamls:\r\n\r\n    DATASET.CAPTION_PROMPT allows manually changing the prompt (the default prompt is simply concatnating all the categories);\r\n\r\n    MODELS.\\*.NUM_CLASSES need to be set to the number of categories in the dataset (including the background class). E.g., Aquarium has 7 non-background categories thus MODELS.\\*.NUM_CLASSES is set to 8;\r\n\r\n4. Run the following command to evaluate on the dataset. Set ``{config_file}``, ``{model_checkpoint}`` according to the ``Model Zoo``. Set {odinw_configs} to the path of the task yaml file we just prepared.\r\n\r\n```\r\npython tools\u002Ftest_grounding_net.py --config-file {config_file} --weight {model_checkpoint} \\\r\n      --task_config {odinw_configs} \\\r\n      TEST.IMS_PER_BATCH 1 SOLVER.IMS_PER_BATCH 1 \\\r\n      TEST.EVAL_TASK detection \\\r\n      DATASETS.TRAIN_DATASETNAME_SUFFIX _grounding \\\r\n      DATALOADER.DISTRIBUTE_CHUNK_AMONG_NODE False \\\r\n      DATASETS.USE_OVERRIDE_CATEGORY True \\\r\n      DATASETS.USE_CAPTION_PROMPT True\r\n```\r\n\r\n### Flickr30K Evaluation\r\nPrepare ``Flickr30K`` data as in [DATA.md](DATA.md). Set ``{config_file}``, ``{model_checkpoint}`` according to the ``Model Zoo``.\r\n\r\n```\r\npython tools\u002Ftest_grounding_net.py \\\r\n        --config-file {config_file} \\\r\n        --task_config configs\u002Fflickr\u002Ftest.yaml,configs\u002Fflickr\u002Fval.yaml \\\r\n        --weight {model_checkpoint} \\\r\n        OUTPUT_DIR {output_dir} TEST.IMS_PER_BATCH 1 SOLVER.IMS_PER_BATCH 1 TEST.MDETR_STYLE_AGGREGATE_CLASS_NUM 100 TEST.EVAL_TASK grounding MODEL.DYHEAD.FUSE_CONFIG.MLM_LOSS False\r\n```\r\n\r\n\r\n\r\n## Fine-Tuning\r\n\r\n### COCO Fine-Tuning\r\nPrepare the ``COCO`` data as in [DATA.md](DATA.md). Set ``{config_file}``, ``{model_checkpoint}`` according to the ``Model Zoo``.\r\n\r\nBelow is the fine-tuning script for tuning the Tiny models:\r\n```\r\npython -m torch.distributed.launch --nproc_per_node=16 tools\u002Ftrain_net.py \\\r\n       --config-file {config_file} \\\r\n       --skip-test \\\r\n       MODEL.WEIGHT {model_checkpoint} \\\r\n       DATASETS.TRAIN '(\"coco_grounding_train\", )' \\\r\n       MODEL.BACKBONE.FREEZE_CONV_BODY_AT -1 SOLVER.IMS_PER_BATCH 32 SOLVER.USE_AMP True SOLVER.MAX_EPOCH 24 TEST.DURING_TRAINING False TEST.IMS_PER_BATCH 16 SOLVER.FIND_UNUSED_PARAMETERS False SOLVER.BASE_LR 0.00001 SOLVER.LANG_LR 0.00001 SOLVER.STEPS \\(0.67,0.89\\) DATASETS.DISABLE_SHUFFLE True MODEL.DYHEAD.SCORE_AGG \"MEAN\" TEST.EVAL_TASK detection\r\n```\r\n\r\nFor evaluation, please follow the instructions in ``COCO Evaluation``. Scripts for tuning the Large model will be released soon.\r\n\r\n### ODinW \u002F Custom Dataset Fine-Tuning\r\nPrepare the dataset as in ``ODinW \u002F Custom Dataset Evaluation``.\r\n\r\n#### Full Model Fine-Tuning\r\n\r\nFor tuning with 1\u002F3\u002F5\u002F10-shot, set {custom_shot_and_epoch_and_general_copy} to \"1_200_8\", \"3_200_4\", \"5_200_2\", \"10_200_1\", respectively.\r\n\r\nFor tuning with all the data, set {custom_shot_and_epoch_and_general_copy} to \"0_200_1\"; set SOLVER.STEP_PATIENCE to 2; set SOLVER.AUTO_TERMINATE_PATIENCE to 4.\r\n\r\n```\r\npython -m torch.distributed.launch --nproc_per_node=4 tools\u002Ffinetune.py \\\r\n      --config-file {config_file}  --ft-tasks {configs} --skip-test \\\r\n      --custom_shot_and_epoch_and_general_copy {custom_shot_and_epoch_and_general_copy} \\\r\n      --evaluate_only_best_on_test --push_both_val_and_test \\\r\n      MODEL.WEIGHT {model_checkpoint} \\\r\n      SOLVER.USE_AMP True TEST.DURING_TRAINING True TEST.IMS_PER_BATCH 4 SOLVER.IMS_PER_BATCH 4 SOLVER.WEIGHT_DECAY 0.05 TEST.EVAL_TASK detection DATASETS.TRAIN_DATASETNAME_SUFFIX _grounding MODEL.BACKBONE.FREEZE_CONV_BODY_AT 2 MODEL.DYHEAD.USE_CHECKPOINT True SOLVER.FIND_UNUSED_PARAMETERS False SOLVER.TEST_WITH_INFERENCE True SOLVER.USE_AUTOSTEP True DATASETS.USE_OVERRIDE_CATEGORY True SOLVER.SEED 10 DATASETS.SHUFFLE_SEED 3 DATASETS.USE_CAPTION_PROMPT True DATASETS.DISABLE_SHUFFLE True \\\r\n      SOLVER.STEP_PATIENCE 3 SOLVER.CHECKPOINT_PER_EPOCH 1.0 SOLVER.AUTO_TERMINATE_PATIENCE 8 SOLVER.MODEL_EMA 0.0 SOLVER.TUNING_HIGHLEVEL_OVERRIDE full\r\n```\r\n\r\n#### Prompt Tuning\r\nFollow the command as in ``Full Model Fine-Tuning``. But set the following hyper-parameters:\r\n```\r\nSOLVER.WEIGHT_DECAY 0.25 \\\r\nSOLVER.BASE_LR 0.05 \\\r\nSOLVER.TUNING_HIGHLEVEL_OVERRIDE language_prompt_v2\r\n```\r\n\r\n\r\n## The Object Detection in the Wild Benchmark\r\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_GLIP_readme_23a90cd4e2af.png\" width=\"400\"> \r\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_GLIP_readme_bf00a788d883.png\" width=\"300\"> \r\n\r\nODinW was first proposed in GLIP and refined and formalized in [ELEVATER](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08790.pdf). GLIP used 13 downstream tasks while the full ODinW has 35 downstream tasks. It will be hosted as a challenge at [the CV in the Wild Workshop @ ECCV 2022](https:\u002F\u002Fcomputer-vision-in-the-wild.github.io\u002Feccv-2022\u002F). We hope our code encourage the community to participate in this challenge!\r\n\r\nODinW was introduced in GLIP and initially contained 13 datasets. We further expand the datasets by including more datasets from RoboFlow and the final version contains **35** datasets. \r\n\r\nTo distinguish between the two versions, we denote the version used by GLIP as ``ODinW-13`` and the version used by the CVinW workshop as ``ODinW-35``.\r\n\r\nThis repo also provides the necessary code to train and evaluate on ODinW. See instructions below.\r\n\r\n#### Download ODinW\r\nRoboFlow hosts all the original datasets. We are also hosting the datasets and provide a simple script the download all the data. \r\n```\r\npython odinw\u002Fdownload_datasets.py\r\n```\r\n\r\n``configs\u002Fodinw_35`` contain all the meta information of the datasets. ``configs\u002Fodinw_13`` are the datasets used by GLIP. Each dataset follows the coco detection format.\r\n\r\nAll ODinW datasets are in the COCO format; thus we can directly use the similar scripts to adapt and evaluate pre-trained models on ODinW. Below is a brief recap.\r\n\r\n#### (Zero-Shot) Evaluation\r\n``odinw_configs`` can be any of the configs from ``configs\u002Fodinw_14`` and ``configs\u002Fodinw_35``.\r\n```\r\npython tools\u002Ftest_grounding_net.py --config-file {config_file} --weight {model_checkpoint} \\\r\n      --task_config {odinw_configs} \\\r\n      TEST.IMS_PER_BATCH 1 SOLVER.IMS_PER_BATCH 1 \\\r\n      TEST.EVAL_TASK detection \\\r\n      DATASETS.TRAIN_DATASETNAME_SUFFIX _grounding \\\r\n      DATALOADER.DISTRIBUTE_CHUNK_AMONG_NODE False \\\r\n      DATASETS.USE_OVERRIDE_CATEGORY True \\\r\n      DATASETS.USE_CAPTION_PROMPT True\r\n```\r\n\r\n#### Full-Model Fine-Tuning\r\n\r\nFor tuning with 1\u002F3\u002F5\u002F10-shot, set `{custom_shot_and_epoch_and_general_copy}` to \"1_200_8\", \"3_200_4\", \"5_200_2\", \"10_200_1\", respectively.\r\n\r\nFor tuning with all the data, set `{custom_shot_and_epoch_and_general_copy}` to \"0_200_1\"; set SOLVER.STEP_PATIENCE to 2; set SOLVER.AUTO_TERMINATE_PATIENCE to 4.\r\n\r\n```\r\npython -m torch.distributed.launch --nproc_per_node=4 tools\u002Ffinetune.py \\\r\n      --config-file {config_file}  --ft-tasks {odinw_configs} --skip-test \\\r\n      --custom_shot_and_epoch_and_general_copy {custom_shot_and_epoch_and_general_copy} \\\r\n      --evaluate_only_best_on_test --push_both_val_and_test \\\r\n      MODEL.WEIGHT {model_checkpoint} \\\r\n      SOLVER.USE_AMP True TEST.DURING_TRAINING True TEST.IMS_PER_BATCH 4 SOLVER.IMS_PER_BATCH 4 SOLVER.WEIGHT_DECAY 0.05 TEST.EVAL_TASK detection DATASETS.TRAIN_DATASETNAME_SUFFIX _grounding MODEL.BACKBONE.FREEZE_CONV_BODY_AT 2 MODEL.DYHEAD.USE_CHECKPOINT True SOLVER.FIND_UNUSED_PARAMETERS False SOLVER.TEST_WITH_INFERENCE True SOLVER.USE_AUTOSTEP True DATASETS.USE_OVERRIDE_CATEGORY True SOLVER.SEED 10 DATASETS.SHUFFLE_SEED 3 DATASETS.USE_CAPTION_PROMPT True DATASETS.DISABLE_SHUFFLE True \\\r\n      SOLVER.STEP_PATIENCE 3 SOLVER.CHECKPOINT_PER_EPOCH 1.0 SOLVER.AUTO_TERMINATE_PATIENCE 8 SOLVER.MODEL_EMA 0.0 SOLVER.TUNING_HIGHLEVEL_OVERRIDE full\r\n```\r\n\r\n#### Prompt Tuning\r\n\r\nFor tuning with 1\u002F3\u002F5\u002F10-shot, set `{custom_shot_and_epoch_and_general_copy}` to \"1_200_8\", \"3_200_4\", \"5_200_2\", \"10_200_1\", respectively.\r\n\r\nFor tuning with all the data, set `{custom_shot_and_epoch_and_general_copy}` to \"0_200_1\"; set SOLVER.STEP_PATIENCE to 2; set SOLVER.AUTO_TERMINATE_PATIENCE to 4.\r\n\r\nFollow the command as in ``Full Model Fine-Tuning``. But set the following hyper-parameters:\r\n```\r\nSOLVER.WEIGHT_DECAY 0.25 \\\r\nSOLVER.BASE_LR 0.05 \\\r\nSOLVER.TUNING_HIGHLEVEL_OVERRIDE language_prompt_v2\r\n```\r\n\r\n#### Linear Probing\r\nFor tuning with 1\u002F3\u002F5\u002F10-shot, set `{custom_shot_and_epoch_and_general_copy}` to \"1_200_8\", \"3_200_4\", \"5_200_2\", \"10_200_1\", respectively.\r\n\r\nFor tuning with all the data, set `{custom_shot_and_epoch_and_general_copy}` to \"0_200_1\"; set SOLVER.STEP_PATIENCE to 2; set SOLVER.AUTO_TERMINATE_PATIENCE to 4.\r\n\r\nFollow the command as in ``Full Model Fine-Tuning``. But set the following hyper-parameters:\r\n```\r\nSOLVER.TUNING_HIGHLEVEL_OVERRIDE linear_prob\r\n```\r\n\r\n\r\n#### Knowledge-Augmented Inference\r\nGLIP also supports knowledge-augmented inference. Please see [our paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08790.pdf) for details. Here we provide an example on how to use external knowledge. Please download a specialized GLIP-A model for knowledge augmented inference ``wget https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fmodels\u002Fglip_a_tiny_o365_knowledge.pth -O MODEL\u002Fglip_a_tiny_o365_knowledge.pth``.\r\n\r\n```\r\npython tools\u002Ftest_grounding_net.py --config-file configs\u002Fpretrain\u002Fglip_A_Swin_T_O365.yaml --weight MODEL\u002Fglip_a_tiny_o365_knowledge.pth \\\r\n      --task_config {odinw_configs} \\\r\n      TEST.IMS_PER_BATCH 1 SOLVER.IMS_PER_BATCH 1 \\\r\n      TEST.EVAL_TASK detection \\\r\n      DATASETS.TRAIN_DATASETNAME_SUFFIX _grounding \\\r\n      DATALOADER.DISTRIBUTE_CHUNK_AMONG_NODE False \\\r\n      DATASETS.USE_OVERRIDE_CATEGORY True \\\r\n      DATASETS.USE_CAPTION_PROMPT True \\\r\n      GLIPKNOW.KNOWLEDGE_FILE knowledge\u002Fodinw_benchmark35_knowledge_and_gpt3.yaml GLIPKNOW.KNOWLEDGE_TYPE gpt3_and_wiki GLIPKNOW.PARALLEL_LANGUAGE_INPUT True GLIPKNOW.LAN_FEATURE_AGG_TYPE first MODEL.DYHEAD.FUSE_CONFIG.USE_LAYER_SCALE True GLIPKNOW.GPT3_NUM 3 GLIPKNOW.WIKI_AND_GPT3 True\r\n```\r\n\r\n#### Submit Your Results to ODinw Leaderboard\r\n\r\nThe participant teams are encouraged to upload their results to [ODinW leaderboard](https:\u002F\u002Feval.ai\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F1839\u002Foverview) on EvalAI. From the perspective od data labeling cost, lowering the requirement of data requirement enables more scenarios, a varied number of tracks are considered in the challenge: zero-shot, few-shot, and full-shot. Please see the ODinW website for more details about each phase. \r\n\r\n1. For zero\u002Ffull shot setting, the required format for prediction json file is\r\n```\r\n{\r\n      \"dataset_name (e.g., 'WildFireSmoke')\":\r\n            [value]: value is following the COCO's \r\n            result format, which contains \r\n            [\"image_id\":xxx, \"category_id\":xxx, \r\n            \"bbox\":xxx, \"score\":xxx]\r\n}\r\n```\r\nPlease see one provided example for zero shot prediction file: [all_predictions_zeroshot.json](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1lO66zH141O_0pTiIhRC2lY5y2PxmxGOH\u002Fview?usp=sharing) and one full shot prediction file: [all_predictions_fullshot.json](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1-nLs2ZebfPoiA_qa_vvkbJD96V1RU7Vu\u002Fview?usp=sharing).\r\n\r\n2. For few shot (3-shot, according to the challenge description) setting, where three train-val subsets are generated with random seed [3, 30, 300], respectively. The required format for prediction json file is\r\n```\r\n{\r\n      \"dataset_name (e.g., \"WildFireSmoke\")\":{\r\n            \"rand_seed_num (e.g., \"30\")\":\r\n                  [value]: value is following the \r\n                  COCO's result format, which \r\n                  contains [\"image_id\":xxx, \r\n                  \"category_id\":xxx, \"bbox\":xxx, \r\n                  \"score\":xxx]\r\n     }\r\n}\r\n```\r\nPlease see one provided example for few shot prediction file: [all_predictions_3_shot.json](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F13pDjmSf0ZAZghgiDTONDF0ur5FP8AuLx\u002Fview?usp=sharing).\r\n\r\n\r\n\r\n## Citations\r\nPlease consider citing our papers if you use the code:\r\n```\r\n@inproceedings{li2021grounded,\r\n      title={Grounded Language-Image Pre-training},\r\n      author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},\r\n      year={2022},\r\n      booktitle={CVPR},\r\n}\r\n@article{zhang2022glipv2,\r\n  title={GLIPv2: Unifying Localization and Vision-Language Understanding},\r\n  author={Zhang, Haotian* and Zhang, Pengchuan* and Hu, Xiaowei and Chen, Yen-Chun and Li, Liunian Harold and Dai, Xiyang and Wang, Lijuan and Yuan, Lu and Hwang, Jenq-Neng and Gao, Jianfeng},\r\n  journal={arXiv preprint arXiv:2206.05836},\r\n  year={2022}\r\n}\r\n@article{li2022elevater,\r\n  title={ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models},\r\n  author={Li*, Chunyuan and Liu*, Haotian and Li, Liunian Harold and Zhang, Pengchuan and Aneja, Jyoti and Yang, Jianwei and Jin, Ping and Lee, Yong Jae and Hu, Houdong and Liu, Zicheng and others},\r\n  journal={arXiv preprint arXiv:2204.08790},\r\n  year={2022}\r\n}\r\n```\r\n","# GLIP：接地型语言-图像预训练  \n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_GLIP_readme_1d0a4662c6e2.png\" width=\"800\"> \r\n\r\n\r\n## 更新\r\n* 2023年1月17日：从图像理解到开放集接地的图像生成？请查看[**GLIGEN（接地型语言到图像生成）**](https:\u002F\u002Fgligen.github.io\u002F)  \r\n  - GLIGEN：(框, 概念) $\\rightarrow$ 图像  ||  GLIP：    图像 $\\rightarrow$ (框, 概念)\r\n\r\n* 2022年9月19日：GLIPv2已被NeurIPS 2022接收（[更新版本](https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.05836)）。\r\n\r\n* 2022年9月18日：组织ECCV研讨会[*野外计算机视觉（CVinW）*](https:\u002F\u002Fcomputer-vision-in-the-wild.github.io\u002Feccv-2022\u002F)，其中举办了两项挑战赛，用于评估预训练视觉模型在下游任务中的零样本、少样本和全样本性能：\r\n  - [``*野外图像分类（ICinW）*''](https:\u002F\u002Feval.ai\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F1832\u002Foverview) 挑战赛评估了20个图像分类任务。\r\n  - [``*野外目标检测（ODinW）*''](https:\u002F\u002Feval.ai\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F1839\u002Foverview) 挑战赛评估了35个目标检测任务。\r\n\r\n$\\qquad$ [ \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_GLIP_readme_17115bb33e09.png\" width=10%\u002F> [研讨会]](https:\u002F\u002Fcomputer-vision-in-the-wild.github.io\u002Feccv-2022\u002F)    $\\qquad$    [\u003Cimg src=\"https:\u002F\u002Fevalai.s3.amazonaws.com\u002Fmedia\u002Flogos\u002F4e939412-a9c0-46bd-9797-5ba0bd0a9095.jpg\" width=10%\u002F> [IC挑战] ](https:\u002F\u002Feval.ai\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F1832\u002Foverview)\r\n$\\qquad$    [\u003Cimg src=\"https:\u002F\u002Fevalai.s3.amazonaws.com\u002Fmedia\u002Flogos\u002Fe3727105-2b29-4c9b-98a6-3d1191884eb5.jpg\" width=10%\u002F> [OD挑战] ](https:\u002F\u002Feval.ai\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F1839\u002Foverview)\r\n\r\n\r\n* 2022年9月13日：更新了[HuggingFace演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fhaotiz\u002Fglip-zeroshot-demo)! 欢迎大家试用!!!\r\n\r\n  - 致谢：非常感谢@[HuggingFace](https:\u002F\u002Fhuggingface.co\u002F)的帮助，他们升级了Space的GPU以支持GLIP演示！\r\n\r\n* 2022年6月21日：GLIP被选为CVPR 2022最佳论文决赛入围者！\r\n\r\n* 2022年6月16日：ODinW基准发布！GLIP-T A&B发布！\r\n\r\n* 2022年6月13日：GLIPv2已上线Arxiv https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.05836！\r\n\r\n* 2022年4月30日：更新了[Colab演示](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F12x7v-_miN7-SRiziK3Cx4ffJzstBJNqb?usp=sharing)!\r\n\r\n* 2022年4月14日：GLIP已被CVPR 2022接受为口头报告！首次发布了代码和预训练模型！\r\n\r\n* 2021年12月6日：GLIP论文上线arxiv https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.03857。\r\n\r\n* 2021年11月23日：项目页面搭建完毕。\u003Cbr\u002F>\r\n\r\n## 简介\r\n本仓库是[GLIP](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.03857)的项目页面。GLIP在各种对象级识别任务中表现出强大的零样本和少样本迁移能力。\r\n\r\n1. 直接在COCO和LVIS数据集上评估时（未见过COCO中的任何图像），GLIP分别取得了49.8 AP和26.9 AP的成绩，超越了许多有监督基线。\r\n2. 在COCO数据集上进行微调后，GLIP在验证集上达到了60.8 AP，在测试集上达到了61.5 AP，超过了之前的最先进水平。\r\n3. 当迁移到13个下游目标检测任务时，少样本的GLIP表现可与完全有监督的Dynamic Head相媲美。\r\n\r\n我们提供了以下内容的代码：\r\n\r\n1. **预训练** GLIP，使用检测和接地数据；\r\n2. **零样本评估** GLIP在标准基准（COCO、LVIS、Flickr30K）以及自定义COCO格式数据集上的表现；\r\n3. **微调** GLIP在标准基准（COCO）以及自定义COCO格式数据集上的表现；\r\n4. **一个Colab演示**；\r\n5. 用于**野外目标检测基准（ODinW）**的工具包，包含35个下游检测任务。\r\n\r\n请参阅相应章节获取详细说明。\r\n\r\n## 演示\r\n请参阅[链接](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F12x7v-_miN7-SRiziK3Cx4ffJzstBJNqb?usp=sharing)中的Colab演示！\r\n\r\n## 安装与设置\r\n\r\n***环境***\r\n本仓库需要Pytorch>=1.9和torchvision。我们建议使用docker来搭建环境。您可以根据自己的GPU选择使用以下预构建的docker镜像之一：``docker pull pengchuanzhang\u002Fmaskrcnn:ubuntu18-py3.7-cuda10.2-pytorch1.9`` 或 ``docker pull pengchuanzhang\u002Fpytorch:ubuntu20.04_torch1.9-cuda11.3-nccl2.9.9``。\r\n\r\n然后安装以下软件包：\r\n```\r\npip install einops shapely timm yacs tensorboardX ftfy prettytable pymongo\r\npip install transformers \r\npython setup.py build develop --user\r\n```\r\n\r\n***骨干网络检查点。*** 将ImageNet预训练的骨干网络检查点下载到``MODEL``文件夹中。\r\n```\r\nmkdir MODEL\r\nwget https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fmodels\u002Fswin_tiny_patch4_window7_224.pth -O swin_tiny_patch4_window7_224.pth\r\nwget https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fmodels\u002Fswin_large_patch4_window12_384_22k.pth -O swin_large_patch4_window12_384_22k.pth\r\n```\n\n## 模型 zoo\n\n***检查点主机迁移。*** 检查点链接已过期。我们正在将检查点迁移到 https:\u002F\u002Fhuggingface.co\u002Fharold\u002FGLIP\u002Ftree\u002Fmain。目前大多数检查点都可用。我们正尽快托管剩余的检查点。\n\n模型 | COCO [1] | LVIS [2] | LVIS [3] | ODinW [4] | 预训练数据 | 配置  | 权重\n-- | -- | -- | -- | -- | -- | -- | --\nGLIP-T (**A**) | 42.9 \u002F 52.9 | - | 14.2 | ~28.7 | O365 | [配置](configs\u002Fpretrain\u002Fglip_A_Swin_T_O365.yaml) | [权重](https:\u002F\u002Fhuggingface.co\u002FGLIPModel\u002FGLIP\u002Fblob\u002Fmain\u002Fglip_a_tiny_o365.pth)\nGLIP-T (**B**) | 44.9 \u002F 53.8  | - | 13.5 | ~33.2 | O365 | [配置](configs\u002Fpretrain\u002Fglip_Swin_T_O365.yaml) | [权重](https:\u002F\u002Fhuggingface.co\u002FGLIPModel\u002FGLIP\u002Fblob\u002Fmain\u002Fglip_tiny_model_o365.pth)\nGLIP-T (**C**) | 46.7 \u002F 55.1 | 14.3 | [17.7](https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fmodels\u002Fglip_tiny_model_o365_goldg_lvisbest.pth) | 44.4 | O365,GoldG | [配置](configs\u002Fpretrain\u002Fglip_Swin_T_O365_GoldG.yaml) | [权重](https:\u002F\u002Fhuggingface.co\u002FGLIPModel\u002FGLIP\u002Fblob\u002Fmain\u002Fglip_tiny_model_o365_goldg.pth)\n**GLIP-T** [5]  | 46.6 \u002F 55.2  | 17.6  | [20.1](https:\u002F\u002Fhuggingface.co\u002FGLIPModel\u002FGLIP\u002Fblob\u002Fmain\u002Fglip_tiny_model_o365_goldg_cc_sbu_lvisbest.pth) | 42.7 | O365,GoldG,CC3M,SBU | [配置](configs\u002Fpretrain\u002Fglip_Swin_T_O365_GoldG.yaml) [6] | [权重](https:\u002F\u002Fhuggingface.co\u002FGLIPModel\u002FGLIP\u002Fblob\u002Fmain\u002Fglip_tiny_model_o365_goldg_cc_sbu.pth)\n**GLIP-L** [7] | 51.4 \u002F 61.7 [8]  | 29.3 | [30.1](https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fmodels\u002Fglip_large_model_lvisbest.pth) | 51.2 | FourODs,GoldG,CC3M+12M,SBU | [配置](configs\u002Fpretrain\u002Fglip_Swin_L.yaml) [9] | [权重](https:\u002F\u002Fhuggingface.co\u002FGLIPModel\u002FGLIP\u002Fblob\u002Fmain\u002Fglip_large_model.pth)\n\n[1] 在 COCO val2017 上的零样本和微调性能。\n\n[2] 使用最后一个预训练检查点在 LVIS minival（APr）上的零样本性能。\n\n[3] 在 LVIS 上，模型在预训练过程中可能会出现轻微过拟合。因此，我们在 LVIS 上报告了两个数值：最后一个检查点的性能（LVIS[2]）以及预训练过程中最佳检查点的性能（LVIS[3]）。\n\n[4] 在 13 个 ODinW 数据集上的零样本性能。GLIP 论文中报告的数字来自预训练过程中的最佳检查点，这可能略高于发布的最后一个检查点的数字，类似于 LVIS 的情况。\n\n[5] 本仓库中发布的 GLIP-T 是在 Conceptual Captions 3M 和 SBU captions 上进行预训练的。它在论文的表 1 和附录 C.3 中有提及。其下游性能与主论文中的 GLIP-T 略有不同。我们将在下一次更新中发布支持使用 CC3M 和 SBU captions 数据进行预训练的功能。\n\n[6] 此配置仅用于零样本评估和微调。支持使用 CC3M 和 SBU captions 数据进行预训练的配置将会更新。\n\n[7] 本仓库中发布的 GLIP-L 是在 Conceptual Captions 3M+12M 和 SBU captions 上进行预训练的。它的性能略优于主论文中的 GLIP-L，因为用于标注这些 caption 数据的模型相比主论文有所改进。我们将在下一次更新中发布支持使用 CC3M+12M 和 SBU captions 数据进行预训练的功能。\n\n[8] 使用了多尺度测试。\n\n[9] 此配置仅用于零样本评估和微调。支持使用 CC3M+12M 和 SBU captions 数据进行预训练的配置将会更新。\n\n\n## 预训练\n\n\n***所需数据。*** 按照 [DATA.md](DATA.md) 准备 ``Objects365``、``Flickr30K`` 和 ``MixedGrounding`` 数据。支持使用 caption 数据（Conceptual Captions 和 SBU captions）进行训练的功能将很快推出。\n\n***命令。***\n\n使用以下命令进行预训练（请相应地更改配置文件；请查看模型 zoo 获取对应的配置；将 ``{output_dir}`` 更改为您希望的输出目录）：\n\n```\npython -m torch.distributed.launch --nnodes 2 --nproc_per_node=16 tools\u002Ftrain_net.py \\\n    --config-file configs\u002Fpretrain\u002Fglip_Swin_T_O365_GoldG.yaml \\\n    --skip-test --use-tensorboard --override_output_dir {output_dir}\n```\n\n对于训练 GLIP-T 模型，我们使用 `nnodes = 2`，`nproc_per_node=16`，在 32GB V100 机器上进行。对于训练 GLIP-L 模型，我们使用 `nnodes = 4`，`nproc_per_node=16`，在 32GB V100 机器上进行。请根据您的本地机器相应地设置环境。\n\n\n## （零样本）评估\n\n### COCO 评估\n\n按照 [DATA.md](DATA.md) 准备 ``COCO\u002Fval2017`` 数据。根据 ``模型 zoo`` 设置 ``{config_file}``、``{model_checkpoint}``；将 ``{output_dir}`` 设置为存储评估结果的文件夹。\n\n```\npython tools\u002Ftest_grounding_net.py --config-file {config_file} --weight {model_checkpoint} \\\n        TEST.IMS_PER_BATCH 1 \\\n        MODEL.DYHEAD.SCORE_AGG \"MEAN\" \\\n        TEST.EVAL_TASK detection \\\n        MODEL.DYHEAD.FUSE_CONFIG.MLM_LOSS False \\\n        OUTPUT_DIR {output_dir}\n```\n\n### LVIS 评估\n\n我们遵循 MDETR 的方法，使用 [FixedAP](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.01066.pdf) 标准进行评估。根据 ``模型 zoo`` 设置 ``{config_file}``、``{model_checkpoint}``。按照 [DATA.md](DATA.md) 准备 ``COCO\u002Fval2017`` 数据。\n\n```\npython -m torch.distributed.launch --nproc_per_node=4 \\\n        tools\u002Ftest_grounding_net.py \\\n        --config-file {config_file} \\\n        --task_config configs\u002Flvis\u002Fminival.yaml \\\n        --weight {model_checkpoint} \\\n        TEST.EVAL_TASK detection OUTPUT_DIR {output_dir} \n        TEST.CHUNKED_EVALUATION 40  TEST.IMS_PER_BATCH 4 SOLVER.IMS_PER_BATCH 4 TEST.MDETR_STYLE_AGGREGATE_CLASS_NUM 3000 MODEL.RETINANET.DETECTIONS_PER_IMG 300 MODEL.FCOS.DETECTIONS_PER_IMG 300 MODEL.ATSS.DETECTIONS_PER_IMG 300 MODEL.ROI_HEADS.DETECTIONS_PER_IMG 300\n```\n\n如果您希望在 Val 1.0 上进行评估，请将 ``--task_config`` 设置为 ``configs\u002Flvis\u002Fval.yaml``。\n\n### ODinW \u002F 自定义数据集评估\n\nGLIP 支持在自定义数据集上轻松进行评估。目前，代码支持对 [COCO 格式](https:\u002F\u002Fcocodataset.org\u002F#format-data)的数据集进行评估。\n\n我们将以 ODinW 中的 [Aquarium](https:\u002F\u002Fpublic.roboflow.com\u002Fobject-detection\u002Faquarium) 数据集为例，展示如何在自定义的 COCO 格式数据集上进行评估。\n\n1. 从 RoboFlow 下载 COCO 格式的原始数据集，并将其存放在 ``DATASET\u002Fodinw\u002FAquarium`` 目录下。每个训练\u002F验证\u002F测试划分都对应一个 ``annotation`` 文件和一个 ``image`` 文件夹。\n\n2. 从标注文件中移除背景类。这可以通过打开 `_annotations.coco.json` 文件，将 `categories` 中 `id:0` 的条目删除来实现。为方便起见，我们提供了 Aquarium 数据集的修改后的标注文件：\n    ```\n    wget https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fodinw\u002FAquarium\u002FAquarium%20Combined.v2-raw-1024.coco\u002Ftest\u002Fannotations_without_background.json -O DATASET\u002Fodinw\u002FAquarium\u002FAquarium\\ Combined.v2-raw-1024.coco\u002Ftest\u002Fannotations_without_background.json\n    wget https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fodinw\u002FAquarium\u002FAquarium%20Combined.v2-raw-1024.coco\u002Ftrain\u002Fannotations_without_background.json -O DATASET\u002Fodinw\u002FAquarium\u002FAquarium\\ Combined.v2-raw-1024.coco\u002Ftrain\u002Fannotations_without_background.json\n    wget https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fodinw\u002FAquarium\u002FAquarium%20Combined.v2-raw-1024.coco\u002Fvalid\u002Fannotations_without_background.json -O DATASET\u002Fodinw\u002FAquarium\u002FAquarium\\ Combined.v2-raw-1024.coco\u002Fvalid\u002Fannotations_without_background.json\n    ```\n\n4. 然后创建一个 YAML 文件，如 ``configs\u002Fodinw_13\u002FAquarium_Aquarium_Combined.v2-raw-1024.coco.yaml``。YAML 文件中需要注意的几个字段：\n\n    `DATASET.CAPTION_PROMPT` 允许手动更改提示词（默认提示词只是将所有类别拼接在一起）；\n\n    `MODELS.*.NUM_CLASSES` 需要设置为数据集中类别的数量（包括背景类）。例如，Aquarium 数据集有 7 个非背景类别，因此 `MODELS.*.NUM_CLASSES` 设置为 8；\n\n4. 运行以下命令来评估该数据集。根据“模型库”设置 ``{config_file}`` 和 ``{model_checkpoint}``。将 `{odinw_configs}` 设置为我们刚刚准备的任务 YAML 文件路径。\n\n```bash\npython tools\u002Ftest_grounding_net.py --config-file {config_file} --weight {model_checkpoint} \\\n      --task_config {odinw_configs} \\\n      TEST.IMS_PER_BATCH 1 SOLVER.IMS_PER_BATCH 1 \\\n      TEST.EVAL_TASK detection \\\n      DATASETS.TRAIN_DATASETNAME_SUFFIX _grounding \\\n      DATALOADER.DISTRIBUTE_CHUNK_AMONG_NODE False \\\n      DATASETS.USE_OVERRIDE_CATEGORY True \\\n      DATASETS.USE_CAPTION_PROMPT True\n```\n\n### Flickr30K 评估\n按照 [DATA.md](DATA.md) 中的说明准备 ``Flickr30K`` 数据。根据“模型库”设置 ``{config_file}`` 和 ``{model_checkpoint}``。\n\n```bash\npython tools\u002Ftest_grounding_net.py \\\n        --config-file {config_file} \\\n        --task_config configs\u002Fflickr\u002Ftest.yaml,configs\u002Fflickr\u002Fval.yaml \\\n        --weight {model_checkpoint} \\\n        OUTPUT_DIR {output_dir} TEST.IMS_PER_BATCH 1 SOLVER.IMS_PER_BATCH 1 TEST.MDETR_STYLE_AGGREGATE_CLASS_NUM 100 TEST.EVAL_TASK grounding MODEL.DYHEAD.FUSE_CONFIG.MLM_LOSS False\n```\n\n## 微调\n\n### COCO 微调\n按照 [DATA.md](DATA.md) 中的说明准备 ``COCO`` 数据。根据“模型库”设置 ``{config_file}`` 和 ``{model_checkpoint}``。\n\n以下是微调 Tiny 模型的脚本：\n```bash\npython -m torch.distributed.launch --nproc_per_node=16 tools\u002Ftrain_net.py \\\n       --config-file {config_file} \\\n       --skip-test \\\n       MODEL.WEIGHT {model_checkpoint} \\\n       DATASETS.TRAIN '(\"coco_grounding_train\", )' \\\n       MODEL.BACKBONE.FREEZE_CONV_BODY_AT -1 SOLVER.IMS_PER_BATCH 32 SOLVER.USE_AMP True SOLVER.MAX_EPOCH 24 TEST.DURING_TRAINING False TEST.IMS_PER_BATCH 16 SOLVER.FIND_UNUSED_PARAMETERS False SOLVER.BASE_LR 0.00001 SOLVER.LANG_LR 0.00001 SOLVER.STEPS \\(0.67,0.89\\) DATASETS.DISABLE_SHUFFLE True MODEL.DYHEAD.SCORE_AGG \"MEAN\" TEST.EVAL_TASK detection\n```\n\n对于评估，请遵循“COCO 评估”中的说明。用于微调 Large 模型的脚本将很快发布。\n\n### ODinW \u002F 自定义数据集微调\n按照“ODinW \u002F 自定义数据集评估”中的说明准备数据集。\n\n#### 完整模型微调\n\n对于 1\u002F3\u002F5\u002F10 抽样微调，分别将 `{custom_shot_and_epoch_and_general_copy}` 设置为 “1_200_8”、“3_200_4”、“5_200_2”、“10_200_1”。\n\n对于使用全部数据的微调，将 `{custom_shot_and_epoch_and_general_copy}` 设置为 “0_200_1”；将 `SOLVER.STEP_PATIENCE` 设置为 2；将 `SOLVER.AUTO_TERMINATE_PATIENCE` 设置为 4。\n\n```bash\npython -m torch.distributed.launch --nproc_per_node=4 tools\u002Ffinetune.py \\\n      --config-file {config_file}  --ft-tasks {configs} --skip-test \\\n      --custom_shot_and_epoch_and_general_copy {custom_shot_and_epoch_and_general_copy} \\\n      --evaluate_only_best_on_test --push_both_val_and_test \\\n      MODEL.WEIGHT {model_checkpoint} \\\n      SOLVER.USE_AMP True TEST.DURING_TRAINING True TEST.IMS_PER_BATCH 4 SOLVER.IMS_PER_BATCH 4 SOLVER.WEIGHT_DECAY 0.05 TEST.EVAL_TASK detection DATASETS.TRAIN_DATASETNAME_SUFFIX _grounding MODEL.BACKBONE.FREEZE_CONV_BODY_AT 2 MODEL.DYHEAD.USE_CHECKPOINT True SOLVER.FIND_UNUSED_PARAMETERS False SOLVER.TEST_WITH_INFERENCE True SOLVER.USE_AUTOSTEP True DATASETS.USE_OVERRIDE_CATEGORY True SOLVER.SEED 10 DATASETS.SHUFFLE_SEED 3 DATASETS.USE_CAPTION_PROMPT True DATASETS.DISABLE_SHUFFLE True \\\n      SOLVER.STEP_PATIENCE 3 SOLVER.CHECKPOINT_PER_EPOCH 1.0 SOLVER.AUTO_TERMINATE_PATIENCE 8 SOLVER.MODEL_EMA 0.0 SOLVER.TUNING_HIGHLEVEL_OVERRIDE full\n```\n\n#### 提示词微调\n按照“完整模型微调”的命令执行。但需设置以下超参数：\n```bash\nSOLVER.WEIGHT_DECAY 0.25 \\\nSOLVER.BASE_LR 0.05 \\\nSOLVER.TUNING_HIGHLEVEL_OVERRIDE language_prompt_v2\n```\n\n## 野外目标检测基准测试\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_GLIP_readme_23a90cd4e2af.png\" width=\"400\"> \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_GLIP_readme_bf00a788d883.png\" width=\"300\"> \n\nODinW 最早由 GLIP 提出，并在 [ELEVATER](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08790.pdf) 中得到进一步完善和规范化。GLIP 使用了 13 个下游任务，而完整的 ODinW 则包含 35 个下游任务。该基准测试将以挑战赛的形式在 [2022 年 ECCV 野外计算机视觉研讨会](https:\u002F\u002Fcomputer-vision-in-the-wild.github.io\u002Feccv-2022\u002F)上举办。我们希望我们的代码能够激励社区积极参与这一挑战！\n\nODinW 最初在 GLIP 中被引入，当时仅包含 13 个数据集。随后，我们通过纳入更多来自 RoboFlow 的数据集对其进行了扩展，最终版本共包含 **35** 个数据集。\n\n为了区分这两个版本，我们将 GLIP 所使用的版本称为 ``ODinW-13``，而 CVinW 研讨会所使用的版本则称为 ``ODinW-35``。\n\n本仓库还提供了在 ODinW 上进行训练和评估所需的相关代码。具体说明如下。\n\n#### 下载 ODinW\nRoboFlow 托管了所有原始数据集。我们也在本地托管这些数据集，并提供了一个简单的脚本用于一次性下载全部数据。\n```bash\npython odinw\u002Fdownload_datasets.py\n```\n\n``configs\u002Fodinw_35`` 包含所有数据集的元信息。``configs\u002Fodinw_13`` 则是 GLIP 所使用的数据集。每个数据集均采用 COCO 目标检测格式。\n\n所有 ODinW 数据集都采用 COCO 格式；因此，我们可以直接使用类似的脚本，将预训练模型适配并评估于 ODinW 上。以下是简要回顾。\n\n#### （零样本）评估\n``odinw_configs`` 可以是 ``configs\u002Fodinw_14`` 或 ``configs\u002Fodinw_35`` 中的任意配置文件。\n```bash\npython tools\u002Ftest_grounding_net.py --config-file {config_file} --weight {model_checkpoint} \\\n      --task_config {odinw_configs} \\\n      TEST.IMS_PER_BATCH 1 SOLVER.IMS_PER_BATCH 1 \\\n      TEST.EVAL_TASK detection \\\n      DATASETS.TRAIN_DATASETNAME_SUFFIX _grounding \\\n      DATALOADER.DISTRIBUTE_CHUNK_AMONG_NODE False \\\n      DATASETS.USE_OVERRIDE_CATEGORY True \\\n      DATASETS.USE_CAPTION_PROMPT True\n```\n\n#### 全模型微调\n\n对于 1\u002F3\u002F5\u002F10 抽样微调，分别将 `{custom_shot_and_epoch_and_general_copy}` 设置为 “1_200_8”、“3_200_4”、“5_200_2” 和 “10_200_1”。\n\n对于使用全部数据进行微调，将 `{custom_shot_and_epoch_and_general_copy}` 设置为 “0_200_1”；同时将 SOLVER.STEP_PATIENCE 设为 2，SOLVER.AUTO_TERMINATE_PATIENCE 设为 4。\n\n```bash\npython -m torch.distributed.launch --nproc_per_node=4 tools\u002Ffinetune.py \\\n      --config-file {config_file}  --ft-tasks {odinw_configs} --skip-test \\\n      --custom_shot_and_epoch_and_general_copy {custom_shot_and_epoch_and_general_copy} \\\n      --evaluate_only_best_on_test --push_both_val_and_test \\\n      MODEL.WEIGHT {model_checkpoint} \\\n      SOLVER.USE_AMP True TEST.DURING_TRAINING True TEST.IMS_PER_BATCH 4 SOLVER.IMS_PER_BATCH 4 SOLVER.WEIGHT_DECAY 0.05 TEST.EVAL_TASK detection DATASETS.TRAIN_DATASETNAME_SUFFIX _grounding MODEL.BACKBONE.FREEZE_CONV_BODY_AT 2 MODEL.DYHEAD.USE_CHECKPOINT True SOLVER.FIND_UNUSED_PARAMETERS False SOLVER.TEST_WITH_INFERENCE True SOLVER.USE_AUTOSTEP True DATASETS.USE_OVERRIDE_CATEGORY True SOLVER.SEED 10 DATASETS.SHUFFLE_SEED 3 DATASETS.USE_CAPTION_PROMPT True DATASETS.DISABLE_SHUFFLE True \\\n      SOLVER.STEP_PATIENCE 3 SOLVER.CHECKPOINT_PER_EPOCH 1.0 SOLVER.AUTO_TERMINATE_PATIENCE 8 SOLVER.MODEL_EMA 0.0 SOLVER.TUNING_HIGHLEVEL_OVERRIDE full\n```\n\n#### 提示词微调\n\n对于 1\u002F3\u002F5\u002F10 抽样微调，分别将 `{custom_shot_and_epoch_and_general_copy}` 设置为 “1_200_8”、“3_200_4”、“5_200_2” 和 “10_200_1”。\n\n对于使用全部数据进行微调，将 `{custom_shot_and_epoch_and_general_copy}` 设置为 “0_200_1”；同时将 SOLVER.STEP_PATIENCE 设为 2，SOLVER.AUTO_TERMINATE_PATIENCE 设为 4。\n\n按照“全模型微调”的命令执行即可。但需设置以下超参数：\n```bash\nSOLVER.WEIGHT_DECAY 0.25 \\\nSOLVER.BASE_LR 0.05 \\\nSOLVER.TUNING_HIGHLEVEL_OVERRIDE language_prompt_v2\n```\n\n#### 线性探测\n\n对于 1\u002F3\u002F5\u002F10 抽样微调，分别将 `{custom_shot_and_epoch_and_general_copy}` 设置为 “1_200_8”、“3_200_4”、“5_200_2” 和 “10_200_1”。\n\n对于使用全部数据进行微调，将 `{custom_shot_and_epoch_and_general_copy}` 设置为 “0_200_1”；同时将 SOLVER.STEP_PATIENCE 设为 2，SOLVER.AUTO_TERMINATE_PATIENCE 设为 4。\n\n按照“全模型微调”的命令执行即可。但需设置以下超参数：\n```bash\nSOLVER.TUNING_HIGHLEVEL_OVERRIDE linear_prob\n```\n\n#### 知识增强推理\nGLIP 还支持知识增强推理。详情请参阅我们的论文 [https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.08790.pdf]。这里我们提供一个如何使用外部知识的示例。请下载专门用于知识增强推理的 GLIP-A 模型：``wget https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fmodels\u002Fglip_a_tiny_o365_knowledge.pth -O MODEL\u002Fglip_a_tiny_o365_knowledge.pth``。\n\n```bash\npython tools\u002Ftest_grounding_net.py --config-file configs\u002Fpretrain\u002Fglip_A_Swin_T_O365.yaml --weight MODEL\u002Fglip_a_tiny_o365_knowledge.pth \\\n      --task_config {odinw_configs} \\\n      TEST.IMS_PER_BATCH 1 SOLVER.IMS_PER_BATCH 1 \\\n      TEST.EVAL_TASK detection \\\n      DATASETS.TRAIN_DATASETNAME_SUFFIX _grounding \\\n      DATALOADER.DISTRIBUTE_CHUNK_AMONG_NODE False \\\n      DATASETS.USE_OVERRIDE_CATEGORY True \\\n      DATASETS.USE_CAPTION_PROMPT True \\\n      GLIPKNOW.KNOWLEDGE_FILE knowledge\u002Fodinw_benchmark35_knowledge_and_gpt3.yaml GLIPKNOW.KNOWLEDGE_TYPE gpt3_and_wiki GLIPKNOW.PARALLEL_LANGUAGE_INPUT True GLIPKNOW.LAN_FEATURE_AGG_TYPE first MODEL.DYHEAD.FUSE_CONFIG.USE_LAYER_SCALE True GLIPKNOW.GPT3_NUM 3 GLIPKNOW.WIKI_AND_GPT3 True\n```\n\n#### 提交您的结果至 ODinW 排行榜\n\n鼓励参赛团队将他们的结果上传至 EvalAI 上的 [ODinW 排行榜](https:\u002F\u002Feval.ai\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F1839\u002Foverview)。从数据标注成本的角度来看，降低对数据的需求能够支持更多应用场景。本次挑战设置了多个赛道：零样本、少样本和全样本。有关各阶段的详细信息，请参阅 ODinW 官网。\n\n1. 对于零样本\u002F全样本设置，预测 JSON 文件的格式要求为：\n```json\n{\n      \"数据集名称（例如 'WildFireSmoke'）\":\n            [value]: value 需遵循 COCO 的结果格式，即包含 [\"image_id\":xxx, \"category_id\":xxx, \"bbox\":xxx, \"score\":xxx]\n}\n```\n请参阅提供的零样本预测文件示例：[all_predictions_zeroshot.json](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1lO66zH141O_0pTiIhRC2lY5y2PxmxGOH\u002Fview?usp=sharing)，以及全样本预测文件示例：[all_predictions_fullshot.json](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1-nLs2ZebfPoiA_qa_vvkbJD96V1RU7Vu\u002Fview?usp=sharing)。\n\n2. 对于少样本（根据挑战说明为 3 抽样）设置，将使用随机种子 [3, 30, 300] 分别生成三个训练-验证子集。预测 JSON 文件的格式要求为：\n```json\n{\n      \"数据集名称（例如 'WildFireSmoke'）\":{\n            \"随机种子编号（例如 '30'）\":\n                  [value]: value 需遵循 COCO 的结果格式，即包含 [\"image_id\":xxx, \"category_id\":xxx, \"bbox\":xxx, \"score\":xxx]\n     }\n}\n```\n请参阅提供的少样本预测文件示例：[all_predictions_3_shot.json](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F13pDjmSf0ZAZghgiDTONDF0ur5FP8AuLx\u002Fview?usp=sharing)。\n\n## 参考文献\n如果您使用了我们的代码，请考虑引用以下论文：\n``` \n@inproceedings{li2021grounded,\n      title={接地式语言-图像预训练},\n      author={Liunian Harold Li* 和 Pengchuan Zhang* 和 Haotian Zhang* 和 Jianwei Yang 和 Chunyuan Li 和 Yiwu Zhong 和 Lijuan Wang 和 Lu Yuan 和 Lei Zhang 和 Jenq-Neng Hwang 和 Kai-Wei Chang 和 Jianfeng Gao},\n      year={2022},\n      booktitle={CVPR},\n}\n@article{zhang2022glipv2,\n  title={GLIPv2：统一目标定位与视觉-语言理解},\n  author={Zhang, Haotian* 和 Zhang, Pengchuan* 和 Hu, Xiaowei 和 Chen, Yen-Chun 和 Li, Liunian Harold 和 Dai, Xiyang 和 Wang, Lijuan 和 Yuan, Lu 和 Hwang, Jenq-Neng 和 Gao, Jianfeng},\n  journal={arXiv 预印本 arXiv:2206.05836},\n  year={2022}\n}\n@article{li2022elevater,\n  title={ELEVATER：用于评估语言增强型视觉模型的基准和工具包},\n  author={Li*, Chunyuan 和 Liu*, Haotian 和 Li, Liunian Harold 和 Zhang, Pengchuan 和 Aneja, Jyoti 和 Yang, Jianwei 和 Jin, Ping 和 Lee, Yong Jae 和 Hu, Houdong 和 Liu, Zicheng 等},\n  journal={arXiv 预印本 arXiv:2204.08790},\n  year={2022}\n}\n```","# GLIP 快速上手指南\n\nGLIP (Grounded Language-Image Pre-training) 是一个强大的开源模型，具备卓越的零样本（Zero-shot）和少样本（Few-shot）迁移能力，适用于各类物体识别和定位任务。本指南将帮助您快速搭建环境并运行模型。\n\n## 1. 环境准备\n\n### 系统要求\n*   **PyTorch**: 版本 >= 1.9\n*   **Torchvision**: 需与 PyTorch 版本匹配\n*   **GPU**: 推荐 NVIDIA GPU（支持 CUDA）\n*   **操作系统**: Linux (推荐 Ubuntu 18.04\u002F20.04)\n\n### 前置依赖\n官方强烈建议使用 **Docker** 来配置环境，以避免依赖冲突。您可以根据显卡驱动选择以下预构建镜像：\n\n*   **CUDA 10.2 + PyTorch 1.9**:\n    ```bash\n    docker pull pengchuanzhang\u002Fmaskrcnn:ubuntu18-py3.7-cuda10.2-pytorch1.9\n    ```\n*   **CUDA 11.3 + PyTorch 1.9**:\n    ```bash\n    docker pull pengchuanzhang\u002Fpytorch:ubuntu20.04_torch1.9-cuda11.3-nccl2.9.9\n    ```\n\n> **提示**：国内用户若拉取 Docker 镜像较慢，可配置阿里云或腾讯云等国内 Docker 镜像加速器。\n\n## 2. 安装步骤\n\n在启动 Docker 容器后，执行以下步骤完成安装：\n\n### 2.1 安装 Python 依赖包\n```bash\npip install einops shapely timm yacs tensorboardX ftfy prettytable pymongo\npip install transformers \npython setup.py build develop --user\n```\n> **加速建议**：国内用户建议使用清华源或阿里源加速 pip 安装：\n> `pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple \u003Cpackage_name>`\n\n### 2.2 下载预训练骨干网络 (Backbone)\n创建模型目录并下载 ImageNet 预训练的 Swin Transformer 权重：\n\n```bash\nmkdir MODEL\nwget https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fmodels\u002Fswin_tiny_patch4_window7_224.pth -O swin_tiny_patch4_window7_224.pth\nwget https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fmodels\u002Fswin_large_patch4_window12_384_22k.pth -O swin_large_patch4_window12_384_22k.pth\n```\n> **注意**：如果上述链接下载缓慢，请手动下载后上传至 `MODEL` 文件夹。\n> *   Swin-Tiny: [下载链接](https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fmodels\u002Fswin_tiny_patch4_window7_224.pth)\n> *   Swin-Large: [下载链接](https:\u002F\u002Fpenzhanwu2bbs.blob.core.windows.net\u002Fdata\u002FGLIPv1_Open\u002Fmodels\u002Fswin_large_patch4_window12_384_22k.pth)\n\n### 2.3 获取 GLIP 模型权重\n由于原仓库链接可能过期，推荐从 HuggingFace 下载最新权重：\n*   **GLIP-Tiny (通用版)**: [glip_tiny_model_o365_goldg_cc_sbu.pth](https:\u002F\u002Fhuggingface.co\u002FGLIPModel\u002FGLIP\u002Fblob\u002Fmain\u002Fglip_tiny_model_o365_goldg_cc_sbu.pth)\n*   **GLIP-Large (高性能版)**: [glip_large_model.pth](https:\u002F\u002Fhuggingface.co\u002FGLIPModel\u002FGLIP\u002Fblob\u002Fmain\u002Fglip_large_model.pth)\n\n将下载的权重文件放置在项目根目录或指定路径。\n\n## 3. 基本使用\n\nGLIP 的核心功能是**零样本物体检测**，即无需训练即可通过文本描述检测图像中的物体。\n\n### 3.1 准备数据\n确保您有一张待测试的图片（例如 `test.jpg`），并将其路径配置好。如果是评估标准数据集（如 COCO），请参考官方 `DATA.md` 整理数据格式。\n\n### 3.2 运行零样本检测\n以下命令演示如何使用配置文件和预训练权重对数据集进行评估（以检测任务为例）。\n\n**单卡\u002FCPU 调试模式示例：**\n```bash\npython tools\u002Ftest_grounding_net.py --config-file configs\u002Fpretrain\u002Fglip_Swin_T_O365_GoldG.yaml --weight glip_tiny_model_o365_goldg_cc_sbu.pth \\\n        TEST.IMS_PER_BATCH 1 \\\n        MODEL.DYHEAD.SCORE_AGG \"MEAN\" \\\n        TEST.EVAL_TASK detection \\\n        MODEL.DYHEAD.FUSE_CONFIG.MLM_LOSS False \\\n        OUTPUT_DIR .\u002Foutput_results\n```\n\n**参数说明：**\n*   `--config-file`: 对应模型的配置文件（在 Model Zoo 中查找）。\n*   `--weight`: 您下载的 `.pth` 权重文件路径。\n*   `TEST.EVAL_TASK detection`: 指定任务为物体检测。\n*   `OUTPUT_DIR`: 结果输出目录。\n\n### 3.3 自定义数据集评估\nGLIP 支持直接评估 COCO 格式的自定义数据集：\n1.  将数据集整理为 COCO 格式（包含 `images` 文件夹和 `annotations.json` 文件）。\n2.  确保 JSON 文件中不包含背景类（id: 0）。\n3.  修改配置文件中的数据路径指向您的数据集。\n4.  运行上述 `test_grounding_net.py` 命令即可得到检测结果。\n\n### 3.4 在线体验\n如果您不想本地部署，可以直接尝试官方提供的 Demo：\n*   **HuggingFace Space**: [GLIP Zero-shot Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fhaotiz\u002Fglip-zeroshot-demo)\n*   **Google Colab**: [GLIP Colab Notebook](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F12x7v-_miN7-SRiziK3Cx4ffJzstBJNqb?usp=sharing)","某电商初创公司的算法团队正急需构建一个能识别海量长尾商品（如特定款式的手办、小众设计师家具）的自动上架系统，但面临标注数据极度匮乏的困境。\n\n### 没有 GLIP 时\n- **冷启动成本极高**：面对成千上万种未见过的商品类别，团队必须人工收集并逐帧标注数千张图片才能训练基础检测模型，耗时数周。\n- **泛化能力严重不足**：传统监督学习模型一旦遇到训练集中未包含的新品类（Zero-shot 场景），完全无法识别，导致新业务线无法快速展开。\n- **迭代响应迟缓**：每当运营部门提出新增一个细分品类（如“复古露营灯”），算法团队都需要重新经历数据采集、标注、训练的全流程，开发周期长达数天。\n- **语义理解割裂**：模型仅能识别预定义的类别 ID，无法直接理解自然语言描述（如“红色的、带轮子的行李箱”），难以支持灵活的搜索式检测需求。\n\n### 使用 GLIP 后\n- **实现零样本检测**：利用 GLIP 强大的图文预训练能力，团队无需任何新图片标注，直接输入商品名称文本即可在图像中精准定位从未见过的长尾商品。\n- **即时响应新需求**：针对新增品类，只需修改输入的自然语言提示词（Prompt），模型即可实时生效，将新业务上线时间从“周级”缩短至“分钟级”。\n- **支持自然语言交互**：GLIP 原生支持接地（Grounding）任务，开发人员可直接用复杂语句（如“找出桌面上所有未包装的易碎品”）进行检索，大幅提升了系统的灵活性。\n- **小样本微调即达 SOTA**：对于极少数需要高精度的核心品类，仅需提供少量示例图片进行微调，GLIP 的性能即可超越以往全监督训练的大型模型。\n\nGLIP 通过将语言理解与视觉检测深度融合，彻底打破了传统目标检测对大规模标注数据的依赖，让开放集物体识别变得像搜索引擎一样简单高效。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_GLIP_de7c5394.png","microsoft","Microsoft","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fmicrosoft_4900709c.png","Open source projects and samples from Microsoft",null,"opensource@microsoft.com","OpenAtMicrosoft","https:\u002F\u002Fopensource.microsoft.com","https:\u002F\u002Fgithub.com\u002Fmicrosoft",[84,88,92,96],{"name":85,"color":86,"percentage":87},"Python","#3572A5",90.4,{"name":89,"color":90,"percentage":91},"Cuda","#3A4E3A",7.5,{"name":93,"color":94,"percentage":95},"C++","#f34b7d",1.5,{"name":97,"color":98,"percentage":99},"C","#555555",0.6,2584,216,"2026-04-13T09:38:22","MIT","Linux","必需 NVIDIA GPU。官方训练示例使用 32GB 显存的 V100 显卡（GLIP-T 需 2 节点共 32 卡，GLIP-L 需 4 节点共 64 卡）。推理\u002F评估需求未明确说明，但建议使用高显存显卡。支持 CUDA 10.2 或 CUDA 11.3（取决于选择的 Docker 镜像）。","未说明（鉴于分布式训练配置，建议服务器级大内存）",{"notes":108,"python":109,"dependencies":110},"强烈建议使用 Docker 部署环境，官方提供了基于 Ubuntu 18.04 (CUDA 10.2) 和 Ubuntu 20.04 (CUDA 11.3) 的预构建镜像。需要手动下载 ImageNet 预训练的 Swin Transformer 骨干网络权重到 MODEL 文件夹。部分预训练模型权重托管在 HuggingFace 上。自定义数据集评估需转换为 COCO 格式并移除背景类别。","3.7+",[111,112,113,114,115,116,117,118,119,120],"torch>=1.9","torchvision","einops","shapely","timm","yacs","tensorboardX","ftfy","prettytable","transformers",[15,122],"其他","2026-03-27T02:49:30.150509","2026-04-14T12:34:21.416427",[126,131,136,141,146,151],{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},33100,"运行模型时遇到 'RuntimeError: Not compiled with GPU support' 错误怎么办？","该错误通常由环境配置不匹配引起。解决方案如下：\n1. 尝试使用特定的虚拟环境版本：Pytorch 1.10.1, Torchvision 0.11.2, Torchaudio 0.10.1, CUDA 11.3。\n2. 设置 CUDA_HOME 环境变量指向正确的 CUDA 路径。\n3. 重新运行 setup.py 进行编译安装。\n4. 如果使用的是较新版本的 PyTorch（如 2.4.0），请参考相关 commit 修复兼容性问题。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FGLIP\u002Fissues\u002F41",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},33101,"GLIP 模型是否包含掩码预测（Mask Prediction）头？如何使用？","当前的 GLIP 演示代码中的 `show_mask_heatmaps` 参数主要用于快速处理像素分类，并不直接输出标准的 mask 字段。\n1. 训练确实需要 ground truth mask 作为监督信号。\n2. 在 GLIPv2 中，模型以“接地（grounding）”风格预测掩码，具体细节可参考相关论文。\n3. 如果需要使用掩码功能，建议关注 GLIPv2 的更新或自行添加二值分割头并使用 GLIP 生成的提议框（proposal bbox）内的像素作为正样本进行监督训练。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FGLIP\u002Fissues\u002F28",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},33102,"在哪里可以找到微调（Fine-tune）的配置文件？","微调配置文件通常位于 `configs` 目录下。如果在配置中看到 `DATALOADER.DISTRIBUTE_CHUNK_AMONG_NODE` 选项，请注意：\n1. 对于常规微调或小规模数据，应将该选项设置为 `false`。\n2. 该选项原本设计用于数十 TB 级别的预训练数据，以便将数据分布到不同节点避免磁盘空间不足。\n3. 最新的 GLIP-L 配置中已移除此选项以避免混淆。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FGLIP\u002Fissues\u002F5",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},33103,"如何对单张图片进行推理并可视化结果？","可以通过以下方式实现单图推理和可视化：\n1. 参考社区提供的 `OD_Demo.ipynb` Notebook 文件，其中包含了完整的推理流程。\n2. 核心代码修改位于 `maskrcnn_benchmark\u002Fengine\u002Fpredictor_glip.py` 文件中。\n3. 如果需要精确匹配实体名称（避免子串匹配错误，如 'A' 匹配到 'As'），可以将 `predictor_glip.py` 中的正则匹配逻辑修改为使用单词边界匹配：`for m in re.finditer(rf'\\b{entity}\\b', caption.lower()):`。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FGLIP\u002Fissues\u002F4",{"id":147,"question_zh":148,"answer_zh":149,"source_url":150},33104,"Object365 数据集的 TSV 格式如何生成？旧版标注与新版权限不一致怎么办？","关于 Object365 数据格式和版本问题：\n1. 仓库中提供的 `train.label.tsv` 基于 Object365 v1 版本。\n2. 由于 v2 版本删除了部分 v1 图片，导致数量不匹配。建议直接从 OpenDataLab 下载 Object365 v1 数据集，其图片数量与提供的标注文件一致。\n3. 如果无法访问特定的 Azure 存储链接，请尝试官方或其他镜像源下载 v1 数据。\n4. 目前暂无无需 TSV 格式进行预训练的官方脚本，建议按照 v1 数据结构准备数据。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FGLIP\u002Fissues\u002F20",{"id":152,"question_zh":153,"answer_zh":154,"source_url":155},33105,"HuggingFace 或 Gradio 演示中出现颜色识别错误（如橙色和蓝色混淆）的原因是什么？","演示中出现颜色识别错误（例如将橙色识别为蓝色）的原因是图像的 RGB 通道被错误地反转成了 BGR 格式。\n1. 维护者已确认该 Bug 并更新了演示代码。\n2. 如果您在本地复现此问题，请检查图像加载部分是否正确处理了通道顺序（OpenCV 默认读取为 BGR，而 PyTorch 模型通常需要 RGB）。\n3. 请使用最新更新的 HuggingFace Demo 或本地代码以获得修正后的结果。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FGLIP\u002Fissues\u002F42",[]]