[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-ChenDelong1999--RemoteCLIP":3,"tool-ChenDelong1999--RemoteCLIP":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",142651,2,"2026-04-06T23:34:12",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107888,"2026-04-06T11:32:50",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":78,"owner_twitter":79,"owner_website":80,"owner_url":81,"languages":82,"stars":91,"forks":92,"last_commit_at":93,"license":94,"difficulty_score":32,"env_os":95,"env_gpu":96,"env_ram":95,"env_deps":97,"category_tags":105,"github_topics":106,"view_count":32,"oss_zip_url":110,"oss_zip_packed_at":110,"status":17,"created_at":111,"updated_at":112,"faqs":113,"releases":154},4920,"ChenDelong1999\u002FRemoteCLIP","RemoteCLIP","🛰️ Official repository of paper \"RemoteCLIP: A Vision Language Foundation Model for Remote Sensing\" (IEEE TGRS)","RemoteCLIP 是一款专为遥感领域打造的视觉 - 语言基础模型，旨在让计算机像人类一样“看懂”卫星图像并理解相关的文字描述。传统 AI 模型往往难以处理复杂的遥感数据，导致图像检索、自动标注等任务效果不佳。RemoteCLIP 通过在海量遥感图像与文本对上进行训练，成功打通了视觉与语言的界限，显著提升了跨模态检索的准确率，并能零样本迁移到地物分类、目标检测等多种下游任务中。\n\n该工具特别适合遥感领域的研究人员、AI 开发者以及地理信息系统的工程师使用。无论是需要构建高效的图像搜索系统，还是希望利用大模型能力自动解译地表信息，RemoteCLIP 都能提供强有力的支持。其独特亮点在于发布了专门的遥感预训练权重（支持 ResNet 和 ViT 架构），并兼容 OpenCLIP 格式，极大降低了部署门槛；同时，团队还开源了包含数万样本的高质量训练数据集（RET-3 等），为社区复现和二次开发提供了坚实基础。作为发表于 IEEE TGRS 的前沿成果，RemoteCLIP 正成为连接遥感大数据与智能解译的重要桥梁。","\u003Cdiv align=\"center\">\n\n## [RemoteCLIP🛰️: A Vision Language Foundation Model for Remote Sensing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.11029)\n\n[Fan Liu (刘凡)](https:\u002F\u002Fmultimodality.group\u002Fauthor\u002F%E5%88%98%E5%87%A1\u002F)✉ *\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_c61933e8f92a.png\" alt=\"Logo\" width=\"15\">, &nbsp; &nbsp; \n[Delong Chen (陈德龙)](https:\u002F\u002Fchendelong.world\u002F)✉ *\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_1c9090847e00.png\" alt=\"Logo\" width=\"10\">, &nbsp; &nbsp; \n[Zhangqingyun Guan (管张青云)](https:\u002F\u002Fgithub.com\u002Fgzqy1026)\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_c61933e8f92a.png\" alt=\"Logo\" width=\"15\">\n\n[Xiaocong Zhou (周晓聪)](https:\u002F\u002Fmultimodality.group\u002Fauthor\u002F%E5%91%A8%E6%99%93%E8%81%AA\u002F)\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_c61933e8f92a.png\" alt=\"Logo\" width=\"15\">, &nbsp; &nbsp; \n[Jiale Zhu (朱佳乐)](https:\u002F\u002Fmultimodality.group\u002Fauthor\u002F%E6%9C%B1%E4%BD%B3%E4%B9%90\u002F)\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_c61933e8f92a.png\" alt=\"Logo\" width=\"15\">, &nbsp; &nbsp; \n\n[Qiaolin Ye (业巧林)](https:\u002F\u002Fit.njfu.edu.cn\u002Fszdw\u002F20181224\u002Fi14059.html)\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_d0284a4a4a8b.png\" alt=\"Logo\" width=\"15\">, &nbsp; &nbsp; \nLiyong Fu (符利勇)\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_c90d51762866.jpg\" alt=\"Logo\" width=\"15\">, &nbsp; &nbsp; \n[Jun Zhou (周峻)](https:\u002F\u002Fexperts.griffith.edu.au\u002F7205-jun-zhou) \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_3d56b5beb288.png\" alt=\"Logo\" width=\"15\">\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_c00396a65f0f.png\" alt=\"Logo\" width=\"100\"> &nbsp; &nbsp;  &nbsp; &nbsp; \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_c3700059fff7.png\" alt=\"Logo\" width=\"100\"> &nbsp; &nbsp;  &nbsp; &nbsp; \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_90b5f4f4a5a2.jpg\" alt=\"Logo\" width=\"50\"> &nbsp; &nbsp;  &nbsp; &nbsp; \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_c90d51762866.jpg\" alt=\"Logo\" width=\"40\"> &nbsp; &nbsp;  &nbsp; &nbsp; \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_983d6cae5ac4.png\" alt=\"Logo\" width=\"90\">\n\n\\* *Equal Contribution*\n\n\u003C\u002Fdiv>\n\n\n### News\n\n- **2024\u002F04\u002F26**: The training dataset of RemoteCLIP (RET-3, SEG-4, DET-10) is released on 🤗HuggingFace, see [[gzqy1026\u002FRemoteCLIP](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fgzqy1026\u002FRemoteCLIP)].\n\n- **2024\u002F04\u002F03**: Our RemoteCLIP paper has been accepted by IEEE Transactions on Geoscience and Remote Sensing (TGRS) [[doi](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10504785)].\n\n- **2024\u002F03\u002F01**: RemoteCLIP joined the leaderboard on [paperswithcode.com](https:\u002F\u002Fpaperswithcode.com\u002Fpaper\u002Fremoteclip-a-vision-language-foundation-model) [![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fremoteclip-a-vision-language-foundation-model\u002Fcross-modal-retrieval-on-rsicd)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fcross-modal-retrieval-on-rsicd?p=remoteclip-a-vision-language-foundation-model) [![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fremoteclip-a-vision-language-foundation-model\u002Fcross-modal-retrieval-on-rsitmd)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fcross-modal-retrieval-on-rsitmd?p=remoteclip-a-vision-language-foundation-model)\n\n- **2023\u002F12\u002F01**: You can now auto-label remote sensing datasets with RemoteCLIP using the [`autodistill-remote-clip`](https:\u002F\u002Fgithub.com\u002Fautodistill\u002Fautodistill-remote-clip) extension in the [Autodistill](https:\u002F\u002Fgithub.com\u002Fautodistill\u002Fautodistill) framework, thanks [James Gallagher](https:\u002F\u002Fjamesg.blog\u002F) from Roboflow!\n  \n- **2023\u002F11\u002F07**: To facilitate reproducing RemoteCLIP's SOTA image-text retrieval results, we have prepared a `retrieval.py` script for retrieval evaluation on RSITMD, RSICD, and UCM datasets. Please see the [Retrieval Evaluation](#retrieval-evaluation) section for details.\n\n- **2023\u002F07\u002F27**: We make pretrained checkpoints of RemoteCLIP models (`ResNet-50`, `ViT-base-32`, and `ViT-large-14`) available! We converted the weights to the [`OpenCLIP`](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fopen_clip) format, such that loading and using RemoteCLIP is extremely easy! Please see the [Load RemoteCLIP](#load-remoteclip) section for details. We also provide a Jupyter Notebook [demo.ipynb](demo.ipynb). You can also [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FChenDelong1999\u002FRemoteCLIP\u002Fblob\u002Fmain\u002FRemoteCLIP_colab_demo.ipynb), thanks [Dr. Gordon McDonald](https:\u002F\u002Fgithub.com\u002Fgdmcdonald) from the University of Sydney!\n\n- **2023\u002F06\u002F19**: We propose RemoteCLIP, the first vision-language foundation model for remote sensing. The preprint of our RemoteCLIP paper is available on arxiv [[2306.11029]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.11029).\n\n### Introduction\n\nWelcome to the official repository of our paper \"[*RemoteCLIP: A Vision Language Foundation Model for Remote Sensing*](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.11029)\"! \n\nGeneral-purpose foundation models have become increasingly important in the field of artificial intelligence. While self-supervised learning (SSL) and Masked Image Modeling (MIM) have led to promising results in building such foundation models for remote sensing, these models primarily learn low-level features, require annotated data for fine-tuning, and are not applicable for retrieval and zero-shot applications due to the lack of language understanding. \n\n**In response to these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics, as well as aligned text embeddings for seamless downstream application.** To address the scarcity of pre-training data, we leverage data scaling, converting heterogeneous annotations based on Box-to-Caption (B2C) and Mask-to-Box (M2B) conversion, and further incorporating UAV imagery, resulting in a 12xlarger pretraining dataset. \n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_8280ae70a559.png)\n\nRemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, k-NN classification, few-shot classification, image-text retrieval, and object counting. Evaluations on 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, show that RemoteCLIP consistently outperforms baseline foundation models across different model scales. \n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_14cb0cb016e0.png)\n\n**Impressively, RemoteCLIP outperforms previous SoTA by 9.14% mean recall on the RSICD dataset and by 8.92% on RSICD dataset\u003C\u002Fu>. For zero-shot classification, our RemoteCLIP outperforms the CLIP baseline by up to 6.39% average accuracy on 12 downstream datasets.**\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_23a37dc98ca6.png)\n\n\n### Load RemoteCLIP\n\nRemoteCLIP is trained with the [`ITRA`](https:\u002F\u002Fitra.readthedocs.io) codebase, and we have converted the pretrained checkpoints to [`OpenCLIP`](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fopen_clip) compatible format and uploaded them to [[this Huggingface Repo]](https:\u002F\u002Fhuggingface.co\u002Fchendelong\u002FRemoteCLIP\u002Ftree\u002Fmain), such that accessing the model could be more convenient!\n\n- To load RemoteCILP, please first prepare an environment with [OpenCLIP](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fopen_clip) installation, for example, by running this command:\n\n    ```bash\n    # https:\u002F\u002Fpypi.org\u002Fproject\u002Fopen-clip-torch\u002F\n    pip install open-clip-torch\n    ```\n\n- Then, download the pretrained checkpoint from [huggingface](https:\u002F\u002Fhuggingface.co\u002Fchendelong\u002FRemoteCLIP), you can clone the repo with Git LFS, or download it automatically via [huggingface_hub](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fhuggingface_hub):\n\n    ```python\n    from huggingface_hub import hf_hub_download\n\n    for model_name in ['RN50', 'ViT-B-32', 'ViT-L-14']:\n        checkpoint_path = hf_hub_download(\"chendelong\u002FRemoteCLIP\", f\"RemoteCLIP-{model_name}.pt\", cache_dir='checkpoints')\n        print(f'{model_name} is downloaded to {checkpoint_path}.')\n\n    ```\n\n- Now, you can initialize a CLIP model with `OpenCLIP`, then load the RemoteCLIP checkpoint with a few lines of code:\n\n    ```python\n    import torch, open_clip\n    from PIL import Image\n\n    model_name = 'ViT-L-14' # 'RN50' or 'ViT-B-32' or 'ViT-L-14'\n    model, _, preprocess = open_clip.create_model_and_transforms(model_name)\n    tokenizer = open_clip.get_tokenizer(model_name)\n\n    ckpt = torch.load(f\"path\u002Fto\u002Fyour\u002Fcheckpoints\u002FRemoteCLIP-{model_name}.pt\", map_location=\"cpu\")\n    message = model.load_state_dict(ckpt)\n    print(message)\n\n    model = model.cuda().eval()\n    ```\n\n- The following is an example of text-to-image retrieval with RemoteCILP:\n\n    ```python\n    text_queries = [\n        \"A busy airport with many airplanes.\", \n        \"Satellite view of Hohai University.\", \n        \"A building next to a lake.\", \n        \"Many people in a stadium.\", \n        \"a cute cat\",\n        ]\n    text = tokenizer(text_queries)\n    image = preprocess(Image.open(\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_3594d0e766b5.jpg\")).unsqueeze(0)\n\n    with torch.no_grad(), torch.cuda.amp.autocast():\n        image_features = model.encode_image(image.cuda())\n        text_features = model.encode_text(text.cuda())\n        image_features \u002F= image_features.norm(dim=-1, keepdim=True)\n        text_features \u002F= text_features.norm(dim=-1, keepdim=True)\n\n        text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1).cpu().numpy()[0]\n\n    print(f'Predictions of {model_name}:')\n    for query, prob in zip(text_queries, text_probs):\n        print(f\"{query:\u003C40} {prob * 100:5.1f}%\")\n    ```\n\n    You could get the following outputs:\n    ```\n    Predictions of RN50:\n    A busy airport with many airplanes.      100.0%\n    Satellite view of Hohai University.        0.0%\n    A building next to a lake.                 0.0%\n    Many people in a stadium.                  0.0%\n    a cute cat                                 0.0%\n    ```\n\n    \u003Cdiv align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_3594d0e766b5.jpg\" alt=\"airport\" width=\"224\">\n    \u003C\u002Fdiv>\n\n    You can run the above code in [demo.ipynb](demo.ipynb), and you can also [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FChenDelong1999\u002FRemoteCLIP\u002Fblob\u002Fmain\u002FRemoteCLIP_colab_demo.ipynb) \n\n\n### Retrieval Evaluation\n> To perform cross-modal retrieval with RemoteCLIP, we extract image and text representations on the\ntest split, perform L-2 normalization, and retrieval most similar\nsamples based on the dot-product similarity measure. We\nshow the retrieval recall of top-1 (R@1), top-5 (R@5), top-10 (R@10), and the mean recall of these values.\n\nWe have prepared a `retrieval.py` script to replicate the retrieval evaluation. Follow the steps below to evaluate the retrieval performance of RemoteCLIP on the RSITMD, RSICD, and UCM datasets:\n\n- To run the retrieval evaluation, please first install additional dependencies: `pip install clip_benchmark`.\n- Then download and extract image-text datasets [RSITMD](https:\u002F\u002Fgithub.com\u002Fxiaoyuan1996\u002FAMFMN\u002Fblob\u002Fmaster\u002FRSITMD\u002FREADME.md),\n[RSICD](https:\u002F\u002Fgithub.com\u002F201528014227051\u002FRSICD_optimal), and\n[UCM](https:\u002F\u002Faistudio.baidu.com\u002Fdatasetdetail\u002F90740).\n- Execute the following command to obtain image-to-text and text-to-image retrieval results:\n\n    ```bash\n    python retrieval.py \\\n      --model-name \"ViT-B-32\" \\\n      --retrieval-images-dir \"\u002Fpath\u002Fto\u002Frsitmd\u002Fimages\" \\\n      --retrieval-json-dir \"\u002Fpath\u002Fto\u002Fdataset_rsitmd.json\" \\\n      --remoteclip-path \"\u002Fpath\u002Fto\u002FRemoteCLIP_ViT-B-32.pt\"\n    ```\n\n\n### Acknowledgments\n\n- Thanks Wenwen Cai (蔡雯雯) for her efforts on the RemoteCount dataset.\n- Thanks [Dr. Gordon McDonald](https:\u002F\u002Fgithub.com\u002Fgdmcdonald) for making Jupyter Notebook available in Colab!\n- Thanks [James Gallagher](https:\u002F\u002Fjamesg.blog\u002F) for integrating RemoteCLIP into [autodistill](https:\u002F\u002Fgithub.com\u002Fautodistill\u002Fautodistill-remote-clip)!\n\n### Citation\n\nIf you find this work useful, please cite our paper as:\n\n```bibtex\n@article{remoteclip,\n  author       = {Fan Liu and\n                  Delong Chen and\n                  Zhangqingyun Guan and\n                  Xiaocong Zhou and\n                  Jiale Zhu and\n                  Qiaolin Ye and\n                  Liyong Fu and\n                  Jun Zhou},\n  title        = {RemoteCLIP: {A} Vision Language Foundation Model for Remote Sensing},\n  journal      = {{IEEE} Transactions on Geoscience and Remote Sensing},\n  volume       = {62},\n  pages        = {1--16},\n  year         = {2024},\n  url          = {https:\u002F\u002Fdoi.org\u002F10.1109\u002FTGRS.2024.3390838},\n  doi          = {10.1109\u002FTGRS.2024.3390838},\n}\n```\n","\u003Cdiv align=\"center\">\n\n## [RemoteCLIP🛰️：用于遥感的视觉语言基础模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.11029)\n\n[Fan Liu (刘凡)](https:\u002F\u002Fmultimodality.group\u002Fauthor\u002F%E5%88%98%E5%87%A1\u002F)✉ *\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_c61933e8f92a.png\" alt=\"Logo\" width=\"15\">, &nbsp; &nbsp; \n[Delong Chen (陈德龙)](https:\u002F\u002Fchendelong.world\u002F)✉ *\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_1c9090847e00.png\" alt=\"Logo\" width=\"10\">, &nbsp; &nbsp; \n[Zhangqingyun Guan (管张青云)](https:\u002F\u002Fgithub.com\u002Fgzqy1026)\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_c61933e8f92a.png\" alt=\"Logo\" width=\"15\">\n\n[Xiaocong Zhou (周晓聪)](https:\u002F\u002Fmultimodality.group\u002Fauthor\u002F%E5%91%A8%E6%99%93%E8%81%AA\u002F)\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_c61933e8f92a.png\" alt=\"Logo\" width=\"15\">, &nbsp; &nbsp; \n[Jiale Zhu (朱佳乐)](https:\u002F\u002Fmultimodality.group\u002Fauthor\u002F%E6%9C%B1%E4%BD%B3%E4%B9%90\u002F)\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_c61933e8f92a.png\" alt=\"Logo\" width=\"15\">, &nbsp; &nbsp; \n\n[Qiaolin Ye (业巧林)](https:\u002F\u002Fit.njfu.edu.cn\u002Fszdw\u002F20181224\u002Fi14059.html)\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_d0284a4a4a8b.png\" alt=\"Logo\" width=\"15\">, &nbsp; &nbsp; \nLiyong Fu (符利勇)\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_c90d51762866.jpg\" alt=\"Logo\" width=\"15\">, &nbsp; &nbsp; \n[Jun Zhou (周峻)](https:\u002F\u002Fexperts.griffith.edu.au\u002F7205-jun-zhou) \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_3d56b5beb288.png\" alt=\"Logo\" width=\"15\">\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_c00396a65f0f.png\" alt=\"Logo\" width=\"100\"> &nbsp; &nbsp;  &nbsp; &nbsp; \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_c3700059fff7.png\" alt=\"Logo\" width=\"100\"> &nbsp; &nbsp;  &nbsp; &nbsp; \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_90b5f4f4a5a2.jpg\" alt=\"Logo\" width=\"50\"> &nbsp; &nbsp;  &nbsp; &nbsp; \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_c90d51762866.jpg\" alt=\"Logo\" width=\"40\"> &nbsp; &nbsp;  &nbsp; &nbsp; \n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_983d6cae5ac4.png\" alt=\"Logo\" width=\"90\">\n\n\\* *同等贡献*\n\n\u003C\u002Fdiv>\n\n\n### 新闻\n\n- **2024年4月26日**：RemoteCLIP的训练数据集（RET-3、SEG-4、DET-10）已在🤗HuggingFace上发布，详情请参见[[gzqy1026\u002FRemoteCLIP](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fgzqy1026\u002FRemoteCLIP)]。\n\n- **2024年4月3日**：我们的RemoteCLIP论文已被IEEE地球科学与遥感汇刊（TGRS）接收[[doi](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10504785)]。\n\n- **2024年3月1日**：RemoteCLIP已加入[paperswithcode.com](https:\u002F\u002Fpaperswithcode.com\u002Fpaper\u002Fremoteclip-a-vision-language-foundation-model)排行榜 [![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fremoteclip-a-vision-language-foundation-model\u002Fcross-modal-retrieval-on-rsicd)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fcross-modal-retrieval-on-rsicd?p=remoteclip-a-vision-language-foundation-model) [![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fremoteclip-a-vision-language-foundation-model\u002Fcross-modal-retrieval-on-rsitmd)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fcross-modal-retrieval-on-rsitmd?p=remoteclip-a-vision-language-foundation-model)\n\n- **2023年12月1日**：现在你可以使用RemoteCLIP通过[Autodistill](https:\u002F\u002Fgithub.com\u002Fautodistill\u002Fautodistill)框架中的[`autodistill-remote-clip`](https:\u002F\u002Fgithub.com\u002Fautodistill\u002Fautodistill-remote-clip)扩展来自动标注遥感数据集，感谢来自Roboflow的[James Gallagher](https:\u002F\u002Fjamesg.blog\u002F)！\n\n- **2023年11月7日**：为了便于复现RemoteCLIP在图像文本检索任务上的SOTA结果，我们准备了一个`retrieval.py`脚本，用于在RSITMD、RSICD和UCM数据集上进行检索评估。详细信息请参见[检索评估](#retrieval-evaluation)部分。\n\n- **2023年7月27日**：我们发布了RemoteCLIP模型的预训练检查点（`ResNet-50`、`ViT-base-32`和`ViT-large-14`）！我们将权重转换为[`OpenCLIP`](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fopen_clip)格式，使得加载和使用RemoteCLIP变得极其简单！详细信息请参见[加载RemoteCLIP](#load-remoteclip)部分。我们还提供了一个Jupyter Notebook演示文件[demo.ipynb](demo.ipynb)。你也可以[![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FChenDelong1999\u002FRemoteCLIP\u002Fblob\u002Fmain\u002FRemoteCLIP_colab_demo.ipynb)，感谢来自悉尼大学的[Gordon McDonald博士](https:\u002F\u002Fgithub.com\u002Fgdmcdonald)！\n\n- **2023年6月19日**：我们提出了RemoteCLIP，这是首个用于遥感的视觉语言基础模型。我们的RemoteCLIP论文预印本已在arxiv上发布[[2306.11029]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.11029)。\n\n### 简介\n\n欢迎来到我们论文“[*RemoteCLIP：用于遥感的视觉语言基础模型*](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.11029)”的官方仓库！\n\n通用基础模型在人工智能领域的重要性日益凸显。尽管自监督学习（SSL）和掩码图像建模（MIM）在构建遥感领域的此类基础模型方面取得了令人鼓舞的成绩，但这些模型主要学习低级特征，需要标注数据进行微调，并且由于缺乏语言理解能力，无法应用于检索和零样本任务。\n\n**针对这些局限性，我们提出了RemoteCLIP，这是首个用于遥感的视觉语言基础模型，旨在学习具有丰富语义的稳健视觉特征，以及对齐的文本嵌入，以实现无缝的下游应用。** 为了解决预训练数据稀缺的问题，我们采用了数据扩增技术，通过基于Box-to-Caption（B2C）和Mask-to-Box（M2B）转换的异构标注转换，并进一步引入无人机影像，最终构建了一个规模扩大12倍的预训练数据集。\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_8280ae70a559.png)\n\nRemoteCLIP可应用于多种下游任务，包括零样本图像分类、线性探测、k-NN分类、少样本分类、图像文本检索以及目标计数等。在16个数据集上的评估，其中包括一个新引入的RemoteCount基准测试目标计数能力，表明RemoteCLIP在不同模型规模下均持续优于基线基础模型。\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_14cb0cb016e0.png)\n\n**令人印象深刻的是，RemoteCLIP在RSICD数据集上的平均召回率比之前的SoTA高出9.14%，而在RSITMD数据集上则高出8.92%。对于零样本分类任务，我们的RemoteCLIP在12个下游数据集上的平均准确率比CLIP基线高出多达6.39%。**\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_23a37dc98ca6.png)\n\n### 加载 RemoteCLIP\n\nRemoteCLIP 是使用 [`ITRA`](https:\u002F\u002Fitra.readthedocs.io) 代码库训练的，我们已将预训练检查点转换为与 [`OpenCLIP`](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fopen_clip) 兼容的格式，并上传至 [[此 Huggingface 仓库]](https:\u002F\u002Fhuggingface.co\u002Fchendelong\u002FRemoteCLIP\u002Ftree\u002Fmain)，以便更便捷地访问该模型！\n\n- 要加载 RemoteCILP，首先请准备一个安装了 [OpenCLIP](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fopen_clip) 的环境，例如通过运行以下命令：\n\n    ```bash\n    # https:\u002F\u002Fpypi.org\u002Fproject\u002Fopen-clip-torch\u002F\n    pip install open-clip-torch\n    ```\n\n- 然后，从 [huggingface](https:\u002F\u002Fhuggingface.co\u002Fchendelong\u002FRemoteCLIP) 下载预训练检查点，您可以使用 Git LFS 克隆仓库，或通过 [huggingface_hub](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fhuggingface_hub) 自动下载：\n\n    ```python\n    from huggingface_hub import hf_hub_download\n\n    for model_name in ['RN50', 'ViT-B-32', 'ViT-L-14']:\n        checkpoint_path = hf_hub_download(\"chendelong\u002FRemoteCLIP\", f\"RemoteCLIP-{model_name}.pt\", cache_dir='checkpoints')\n        print(f'{model_name} is downloaded to {checkpoint_path}.')\n\n    ```\n\n- 现在，您可以通过 `OpenCLIP` 初始化一个 CLIP 模型，然后用几行代码加载 RemoteCLIP 检查点：\n\n    ```python\n    import torch, open_clip\n    from PIL import Image\n\n    model_name = 'ViT-L-14' # 'RN50' 或 'ViT-B-32' 或 'ViT-L-14'\n    model, _, preprocess = open_clip.create_model_and_transforms(model_name)\n    tokenizer = open_clip.get_tokenizer(model_name)\n\n    ckpt = torch.load(f\"path\u002Fto\u002Fyour\u002Fcheckpoints\u002FRemoteCLIP-{model_name}.pt\", map_location=\"cpu\")\n    message = model.load_state_dict(ckpt)\n    print(message)\n\n    model = model.cuda().eval()\n    ```\n\n- 以下是使用 RemoteCILP 进行文本到图像检索的示例：\n\n    ```python\n    text_queries = [\n        \"繁忙的机场，有许多飞机。\", \n        \"河海大学的卫星视图。\", \n        \"湖边的一栋建筑。\", \n        \"体育场里有很多人。\", \n        \"一只可爱的小猫\",\n        ]\n    text = tokenizer(text_queries)\n    image = preprocess(Image.open(\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_3594d0e766b5.jpg\")).unsqueeze(0)\n\n    with torch.no_grad(), torch.cuda.amp.autocast():\n        image_features = model.encode_image(image.cuda())\n        text_features = model.encode_text(text.cuda())\n        image_features \u002F= image_features.norm(dim=-1, keepdim=True)\n        text_features \u002F= text_features.norm(dim=-1, keepdim=True)\n\n        text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1).cpu().numpy()[0]\n\n    print(f'Predictions of {model_name}:')\n    for query, prob in zip(text_queries, text_probs):\n        print(f\"{query:\u003C40} {prob * 100:5.1f}%\")\n    ```\n\n    您可能会得到如下输出：\n    ```\n    Predictions of RN50:\n    繁忙的机场，有许多飞机。      100.0%\n    河海大学的卫星视图。        0.0%\n    湖边的一栋建筑。                 0.0%\n    体育场里有很多人。                  0.0%\n    一只可爱的小猫                                 0.0%\n    ```\n\n    \u003Cdiv align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_readme_3594d0e766b5.jpg\" alt=\"airport\" width=\"224\">\n    \u003C\u002Fdiv>\n\n    您可以在 [demo.ipynb](demo.ipynb) 中运行上述代码，也可以点击 [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FChenDelong1999\u002FRemoteCLIP\u002Fblob\u002Fmain\u002FRemoteCLIP_colab_demo.ipynb) 直接在 Colab 中运行！\n\n\n### 检索评估\n> 为了使用 RemoteCLIP 进行跨模态检索，我们在测试集上提取图像和文本特征，进行 L-2 归一化，并基于点积相似度度量检索最相似的样本。我们展示了 top-1（R@1）、top-5（R@5）、top-10（R@10）的检索准确率，以及这些值的平均准确率。\n\n我们准备了一个 `retrieval.py` 脚本用于复现检索评估。请按照以下步骤评估 RemoteCLIP 在 RSITMD、RSICD 和 UCM 数据集上的检索性能：\n\n- 要运行检索评估，首先请安装额外的依赖：`pip install clip_benchmark`。\n- 然后下载并解压图像-文本数据集 [RSITMD](https:\u002F\u002Fgithub.com\u002Fxiaoyuan1996\u002FAMFMN\u002Fblob\u002Fmaster\u002FRSITMD\u002FREADME.md)、[RSICD](https:\u002F\u002Fgithub.com\u002F201528014227051\u002FRSICD_optimal) 和 [UCM](https:\u002F\u002Faistudio.baidu.com\u002Fdatasetdetail\u002F90740)。\n- 执行以下命令以获得图像到文本和文本到图像的检索结果：\n\n    ```bash\n    python retrieval.py \\\n      --model-name \"ViT-B-32\" \\\n      --retrieval-images-dir \"\u002Fpath\u002Fto\u002Frsitmd\u002Fimages\" \\\n      --retrieval-json-dir \"\u002Fpath\u002Fto\u002Fdataset_rsitmd.json\" \\\n      --remoteclip-path \"\u002Fpath\u002Fto\u002FRemoteCLIP_ViT-B-32.pt\"\n    ```\n\n\n### 致谢\n\n- 感谢蔡雯雯在 RemoteCount 数据集方面所做的努力。\n- 感谢 [Dr. Gordon McDonald](https:\u002F\u002Fgithub.com\u002Fgdmcdonald) 让 Jupyter Notebook 可以在 Colab 中使用！\n- 感谢 [James Gallagher](https:\u002F\u002Fjamesg.blog\u002F) 将 RemoteCLIP 集成到 [autodistill](https:\u002F\u002Fgithub.com\u002Fautodistill\u002Fautodistill-remote-clip) 中！\n\n### 引用\n\n如果您觉得这项工作有用，请按以下格式引用我们的论文：\n\n```bibtex\n@article{remoteclip,\n  author       = {Fan Liu and\n                  Delong Chen and\n                  Zhangqingyun Guan and\n                  Xiaocong Zhou and\n                  Jiale Zhu and\n                  Qiaolin Ye and\n                  Liyong Fu and\n                  Jun Zhou},\n  title        = {RemoteCLIP: {A} Vision Language Foundation Model for Remote Sensing},\n  journal      = {{IEEE} Transactions on Geoscience and Remote Sensing},\n  volume       = {62},\n  pages        = {1--16},\n  year         = {2024},\n  url          = {https:\u002F\u002Fdoi.org\u002F10.1109\u002FTGRS.2024.3390838},\n  doi          = {10.1109\u002FTGRS.2024.3390838},\n}\n```","# RemoteCLIP 快速上手指南\n\nRemoteCLIP 是首个专为遥感领域设计的视觉 - 语言基础模型，支持零样本图像分类、图文检索、目标计数等下游任务。本指南将帮助您快速在本地环境中加载并使用该模型。\n\n## 环境准备\n\n*   **系统要求**：Linux \u002F macOS \u002F Windows (推荐 Linux)\n*   **Python 版本**：3.8 及以上\n*   **硬件要求**：推荐使用 NVIDIA GPU 以加速推理（需安装 CUDA 对应的 PyTorch 版本）\n*   **前置依赖**：\n    *   `torch` (PyTorch)\n    *   `open-clip-torch`\n    *   `huggingface_hub` (用于自动下载模型权重)\n    *   `Pillow` (图像处理)\n\n> **国内加速建议**：如果访问 Hugging Face 或 PyPI 较慢，建议配置国内镜像源。\n> *   pip 源：`pip config set global.index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`\n> *   Hugging Face 镜像：设置环境变量 `export HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com`\n\n## 安装步骤\n\n1.  **安装 OpenCLIP 及相关依赖**\n    通过 pip 安装核心库 `open-clip-torch`，它已包含所需的 torch 和 pillow 依赖（若未安装 torch 请先单独安装）。\n\n    ```bash\n    pip install open-clip-torch huggingface_hub\n    ```\n\n2.  **下载预训练模型权重**\n    RemoteCLIP 提供了三种架构的权重：`RN50` (ResNet-50), `ViT-B-32`, `ViT-L-14`。以下脚本将自动从 Hugging Face 下载权重到本地 `checkpoints` 目录。\n\n    ```python\n    from huggingface_hub import hf_hub_download\n\n    # 可选模型名称：'RN50', 'ViT-B-32', 'ViT-L-14'\n    model_name = 'ViT-L-14' \n    \n    checkpoint_path = hf_hub_download(\n        \"chendelong\u002FRemoteCLIP\", \n        f\"RemoteCLIP-{model_name}.pt\", \n        cache_dir='checkpoints'\n    )\n    print(f'Model downloaded to: {checkpoint_path}')\n    ```\n\n## 基本使用\n\n以下示例演示如何加载模型并进行最简单的**图文检索**（计算一张机场图片与多个文本描述匹配置信度）。\n\n```python\nimport torch\nimport open_clip\nfrom PIL import Image\n\n# 1. 初始化模型\n# 可选：'RN50', 'ViT-B-32', 'ViT-L-14'\nmodel_name = 'ViT-L-14' \nmodel, _, preprocess = open_clip.create_model_and_transforms(model_name)\ntokenizer = open_clip.get_tokenizer(model_name)\n\n# 2. 加载 RemoteCLIP 权重\n# 请替换为实际下载的路径\nckpt_path = 'checkpoints\u002Fchendelong--RemoteCLIP\u002FRemoteCLIP-ViT-L-14.pt' \nckpt = torch.load(ckpt_path, map_location=\"cpu\")\nmodel.load_state_dict(ckpt)\n\n# 移至 GPU 并设为评估模式\nmodel = model.cuda().eval()\n\n# 3. 准备数据\ntext_queries = [\n    \"A busy airport with many airplanes.\", \n    \"Satellite view of Hohai University.\", \n    \"A building next to a lake.\", \n    \"Many people in a stadium.\", \n    \"a cute cat\",\n]\n# 编码文本\ntext = tokenizer(text_queries).cuda()\n\n# 加载并预处理图片 (示例使用 assets\u002Fairport.jpg，请替换为您的图片路径)\nimage = preprocess(Image.open(\"assets\u002Fairport.jpg\")).unsqueeze(0).cuda()\n\n# 4. 推理计算\nwith torch.no_grad(), torch.cuda.amp.autocast():\n    image_features = model.encode_image(image)\n    text_features = model.encode_text(text)\n    \n    # 归一化特征\n    image_features \u002F= image_features.norm(dim=-1, keepdim=True)\n    text_features \u002F= text_features.norm(dim=-1, keepdim=True)\n\n    # 计算相似度并转换为概率\n    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1).cpu().numpy()[0]\n\n# 5. 输出结果\nprint(f'Predictions of {model_name}:')\nfor query, prob in zip(text_queries, text_probs):\n    print(f\"{query:\u003C40} {prob * 100:5.1f}%\")\n```\n\n**预期输出示例：**\n```text\nPredictions of ViT-L-14:\nA busy airport with many airplanes.      100.0%\nSatellite view of Hohai University.        0.0%\nA building next to a lake.                 0.0%\nMany people in a stadium.                  0.0%\na cute cat                                 0.0%\n```","某省级自然资源监测中心急需从海量历史卫星影像中快速定位特定地物（如“洪水淹没的农田”或“新建的光伏电站”），以支持应急决策和规划审批。\n\n### 没有 RemoteCLIP 时\n- 分析师必须依赖传统分类模型，需先人工标注成千上万张样本才能训练识别新目标，耗时数周且无法应对突发需求。\n- 检索只能基于预设的固定类别标签，无法理解“被云层遮挡的港口”等复杂自然语言描述，导致大量相关影像被漏检。\n- 面对跨传感器（如光学与雷达）数据时，需分别建立多套处理流程，数据孤岛严重，难以进行统一的多模态关联分析。\n- 专家需逐目视解译筛选候选区域，人力成本极高且容易因疲劳产生主观误判，响应速度远滞后于灾害变化。\n\n### 使用 RemoteCLIP 后\n- 利用 RemoteCLIP 的零样本能力，直接输入文本指令即可在未见过的影像中精准定位目标，将新任务上线时间从数周缩短至分钟级。\n- 支持复杂的语义检索，系统能准确理解并找回符合“洪水淹没的农田”等长尾描述的图像，显著提升了检索召回率和准确性。\n- 凭借统一的视觉 - 语言基础模型架构，RemoteCLIP 天然兼容多源遥感数据，实现了跨模态数据的无缝对齐与联合分析。\n- 自动化生成初步解译结果供专家复核，大幅减少人工浏览工作量，使监测团队能将精力集中于高价值决策而非重复劳动。\n\nRemoteCLIP 通过将自然语言理解引入遥感领域，彻底打破了传统模型对标注数据的依赖，让卫星影像分析变得像搜索网页一样简单高效。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FChenDelong1999_RemoteCLIP_14cb0cb0.png","ChenDelong1999","Delong Chen (陈德龙)","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FChenDelong1999_2b21fe80.jpg","Ph.D. Student at HKUST & AMI Labs","The Hong Kong University of Science and Technology","Hong Kong SAR","delong.chen@connect.ust.hk","Delong0_0","http:\u002F\u002Fchendelong.world\u002F","https:\u002F\u002Fgithub.com\u002FChenDelong1999",[83,87],{"name":84,"color":85,"percentage":86},"Jupyter Notebook","#DA5B0B",97.3,{"name":88,"color":89,"percentage":90},"Python","#3572A5",2.7,530,27,"2026-04-06T05:45:03","Apache-2.0","未说明","需要 NVIDIA GPU (代码示例使用 .cuda())，具体显存大小取决于模型版本 (ResNet-50, ViT-B-32, ViT-L-14)，ViT-L-14 建议较大显存",{"notes":98,"python":95,"dependencies":99},"该工具基于 OpenCLIP 格式，需先安装 open-clip-torch。提供三种预训练模型权重 (RN50, ViT-B-32, ViT-L-14)。进行检索评估时需额外安装 clip_benchmark 并准备特定数据集 (RSITMD, RSICD, UCM)。代码示例显示支持混合精度推理 (torch.cuda.amp.autocast)。",[100,101,102,103,104],"open-clip-torch","torch","huggingface_hub","Pillow","clip_benchmark",[15,14],[107,108,109],"remote-sensing","vision-language","contrastive-language-image-pretraining",null,"2026-03-27T02:49:30.150509","2026-04-07T14:37:09.570901",[114,119,124,129,134,139,144,149],{"id":115,"question_zh":116,"answer_zh":117,"source_url":118},22329,"在 RSICD 数据集上进行检索评估时，代码无法正确运行或结果不准确怎么办？","对于 RSICD 数据集，每张图像有 5 个真实描述（ground truth captions）。如果检索模型匹配到其中任意一个，即视为成功检索。如果在评估时遇到问题，请尝试在应用 `get_dataset_collate_fn` 时使用 `'mscoco_captions'` 参数，这通常能解决评估结果打印不正确的问题。","https:\u002F\u002Fgithub.com\u002FChenDelong1999\u002FRemoteCLIP\u002Fissues\u002F7",{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},22328,"Zero-shot 推理结果与论文报告差距巨大，使用的是原始 CLIP 还是经过连续训练的模型？","论文中报告的是标准 top-1 准确率。如果您得到的结果差异很大（例如 0.195% vs 68.62%），请检查您的提示词模板（prompt template）是否正确，并确认是否使用了正确的评估代码。维护者建议分享您的 Zero-shot 推理代码以便排查具体原因。通常使用的是论文中描述的模板（如 'a satellite photo of {class name}'），但需确保实现细节一致。","https:\u002F\u002Fgithub.com\u002FChenDelong1999\u002FRemoteCLIP\u002Fissues\u002F18",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},22330,"Few-shot 线性探测（linear prob）实验中的训练集和测试集是如何划分的？","实验中并没有使用固定的划分。对于 Few-shot 设置（从 1-shot 到 32-shot），是从 80% 的训练分割中随机采样不同数量的样本作为支持集（support set），而测试集始终保持不变，使用的是剩余的 20% 测试分割。例如在 32-shot 情况下，每类仅使用 32 个样本进行学习，其余数据不参与该特定设置的学习但保留在原始训练池中，测试则统一在固定的 20% 测试集上进行。","https:\u002F\u002Fgithub.com\u002FChenDelong1999\u002FRemoteCLIP\u002Fissues\u002F3",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},22331,"线性探测（linear-prob）和 KNN 实验中，训练集和测试集具体是如何划分的？是随机划分吗？","由于许多分类数据集没有提供官方的训练\u002F验证\u002F测试划分，项目采用了自定义划分方法。对于每个类别，按照文件名排序（例如从 `001.jpg`, `002.jpg` 开始），使用前 80% 的样本作为训练集，剩余的 20% 作为测试集。这种划分方式是基于文件顺序而非完全随机打乱后划分。","https:\u002F\u002Fgithub.com\u002FChenDelong1999\u002FRemoteCLIP\u002Fissues\u002F2",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},22332,"如何获取 RemoteCLIP 使用的训练数据集？会开源吗？","目前数据集尚未完全公开。维护者计划在论文被接受后尽快开源数据集。届时可能只会发布生成的文本描述（captions）以及对应的数据集名称和图片 ID，而不是直接提供所有原始图像文件（因为涉及多个第三方数据集的版权授权问题）。","https:\u002F\u002Fgithub.com\u002FChenDelong1999\u002FRemoteCLIP\u002Fissues\u002F21",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},22333,"无法从 Huggingface 下载预训练模型权重，是否有替代下载方式？","如果您在中国大陆地区且无法访问 Huggingface（即使使用 VPN），维护者可以提供百度网盘（Baidu Netdisk）作为替代下载源。您可以直接在 Issue 中留言说明情况，维护者通常会提供百度网盘的链接和提取码。","https:\u002F\u002Fgithub.com\u002FChenDelong1999\u002FRemoteCLIP\u002Fissues\u002F16",{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},22334,"RET-3 数据去重具体是如何操作的？能否提供去重后的文件列表？","去重是通过计算图像的 p-Hash 值完成的。如果两张图像的 p-Hash 值差异位数小于阈值 2，则视为重复图像。项目中已经移除了训练集中与 RSICD 测试集或 RSITMD 测试集重复的样本。维护者已上传了具体的去重文件名列表供下载，包括 `rsicd_duplicate.txt`, `rsitmd_duplicate.txt`, 和 `ucm_duplicate.txt`，这些文件列出了从训练集中移除的图像名称。参考代码逻辑可查阅 `DupImageDetection` 仓库中的方法。","https:\u002F\u002Fgithub.com\u002FChenDelong1999\u002FRemoteCLIP\u002Fissues\u002F13",{"id":150,"question_zh":151,"answer_zh":152,"source_url":153},22335,"在训练过程中，一张图像对应多个 caption 时，如何避免其他 caption 被当作负样本造成干扰？","这是一个关于对比学习中正负样本构建的问题。虽然具体的内部实现细节在提供的评论截断部分未完全显示，但通常此类多模态模型在处理“一图多文”时，会将同一图像的所有对应 caption 都视为该图像的正样本（positive pairs），而在计算损失时，只有不同图像的图文对才会被视为负样本。建议在训练代码中检查 loss 计算部分，确保同一图像的不同描述没有被错误地推远。","https:\u002F\u002Fgithub.com\u002FChenDelong1999\u002FRemoteCLIP\u002Fissues\u002F25",[]]