[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-mbzuai-oryx--GeoChat":3,"tool-mbzuai-oryx--GeoChat":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",158594,2,"2026-04-16T23:34:05",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":76,"owner_website":77,"owner_url":78,"languages":79,"stars":88,"forks":89,"last_commit_at":90,"license":76,"difficulty_score":91,"env_os":92,"env_gpu":93,"env_ram":94,"env_deps":95,"category_tags":106,"github_topics":108,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":111,"updated_at":112,"faqs":113,"releases":142},8220,"mbzuai-oryx\u002FGeoChat","GeoChat","[CVPR 2024 🔥] GeoChat, the first grounded Large Vision Language Model for Remote Sensing","GeoChat 是首个专为遥感领域打造的地面化大型视觉语言模型，曾入选 CVPR 2024。它突破了通用视觉模型在专业领域的局限，能够深入理解高分辨率卫星及航空影像，不仅支持对整张图像的描述，更能精准定位并分析图像中的特定区域，实现“指哪打哪”的深度场景解读。\n\n针对遥感图像数据复杂、通用模型缺乏领域知识导致识别不准的痛点，GeoChat 基于 LLaVA-1.5 架构，利用全新构建的大规模遥感多模态指令数据集进行微调。这使得它在零样本设置下，即可出色完成图像与区域描述、视觉问答、场景分类、接地对话以及参照物体检测等多种任务。\n\n其核心技术亮点在于独特的“区域级推理”能力，能够将自然语言指令与图像中的具体地理坐标或区域紧密关联，大幅提升了人机交互的精确度。目前，GeoChat 已开源代码、模型权重及评估脚本，非常适合从事遥感技术研究的学者、开发地理信息系统的工程师，以及希望探索多模态大模型在垂直领域应用的开发者使用。通过 GeoChat，专业人士可以更高效地从海量遥感数据中提取关键信息，推动智能对地观测技术的发展。","# GeoChat \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_15e0a7a62810.png\" height=\"40\">: Grounded Large Vision-Language Model for Remote Sensing [CVPR-2024]\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FwaxVImv.png\" alt=\"Oryx Video-ChatGPT\">\n\u003C\u002Fp>\n\n#### [Kartik Kuckreja](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fkartik-kuckreja-930531221\u002F)\\*, [Muhammad Sohail Danish](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fmuhammad-sohail-danish\u002F)\\*, [Muzammal Naseer](https:\u002F\u002Fmuzammal-naseer.com\u002F), [Abhijit Das](https:\u002F\u002Fsites.google.com\u002Fsite\u002Fdasabhijit2048\u002Fhome), [Salman Khan](https:\u002F\u002Fsalman-h-khan.github.io\u002F) and [Fahad Khan](https:\u002F\u002Fsites.google.com\u002Fview\u002Ffahadkhans\u002Fhome)\n\\* Equally contributing first authors\n\n#### **Mohamed bin Zayed University of AI, Birla Institute of Technology & Science, Australian National University, Linkoping University**\n\n[![Website](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Website-87CEEB)](https:\u002F\u002Fmbzuai-oryx.github.io\u002FGeoChat)\n[![paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-Paper-\u003CCOLOR>.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.15826)\n[![video](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVideo-Presentation-F9D371)](https:\u002F\u002Fyoutu.be\u002FKOKtkkKpNDk)\n\n---\n\n## 📢 Latest Updates\n- Supplementary material for the accepted paper is available here: [Supplementary](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fblob\u002Fmain\u002Fdocs\u002Fgeochat_supp.pdf).\n- **Feb-28-24**: We open source the code, model, dataset, and evaluation scripts. \n- **Feb-27-24**: GeoChat has been accepted to **CVPR-24** 🎉. \n- **Nov-28-23**: GeoChat paper is released [arxiv link](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.15826). 🔥🔥\n---\n\n\n\n## \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_15e0a7a62810.png\" height=\"40\">Overview\n\nGeoChat is the first grounded Large Vision Language Model, specifically tailored to Remote Sensing(RS) scenarios. Unlike general-domain models, GeoChat excels in handling high-resolution RS imagery, employing region-level reasoning for comprehensive scene interpretation. Leveraging a newly created RS multimodal dataset, GeoChat is fine-tuned using the LLaVA-1.5 architecture. This results in robust zero-shot performance across various RS tasks, including image and region captioning, visual question answering, scene classification, visually grounded conversations, and referring object detection.\n\n---\n## Contents\n- [Install](#install)\n- [Model Zoo](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md)\n- [Dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMBZUAI\u002FGeoChat_Instruct\u002Fblob\u002Fmain\u002FGeoChat_Instruct.json)\n- [Train](#train)\n- [Evaluation](#evaluation)\n\n## Install\n\n1. Clone this repository and navigate to GeoChat folder\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat.git\ncd GeoChat\n```\n\n2. Install Package\n```Shell\nconda create -n geochat python=3.10 -y\nconda activate geochat\npip install --upgrade pip  # enable PEP 660 support\npip install -e .\n```\n\n3. Install additional packages for training cases\n```\npip install ninja\npip install flash-attn --no-build-isolation\n```\n\n### Upgrade to latest code base\n\n```Shell\ngit pull\npip uninstall transformers\npip install -e .\n```\n\n## GeoChat Weights and Demo\nPlease check out our [Model Zoo](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md) for all public GeoChat checkpoints, and check [LoRA.md](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fblob\u002Fmain\u002Fdocs\u002FLoRA.md) for instructions on how to run the demo and training.\n\n## Train\n\nGeoChat training consists of visual instruction tuning using GeoChat_Instruct Dataset: 318k Vicuna-generated multimodal instruction-following data, finetuned over the pretrained weights of LlaVA-v1.5.\n\nWe train GeoChat on 3 A100 GPUs with 40GB memory. To train on fewer GPUs, you can reduce the `per_device_train_batch_size` and increase the `gradient_accumulation_steps` accordingly. Always keep the global batch size the same: `per_device_train_batch_size` x `gradient_accumulation_steps` x `num_gpus`.\n\n### Hyperparameters\nWe use a similar set of hyperparameters as Vicuna in finetuning.  Both hyperparameters used in pretraining and finetuning are provided below.\n\n| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |\n| --- | ---: | ---: | ---: | ---: | ---: |\n| GeoChat-7B | 144 | 2e-5 | 1 | 2048 | 0 |\n\n### Pretrain (feature alignment)\n\nWe use the pretrained projector from LLaVAv1.5, which is trained on 558K subset of the LAION-CC-SBU dataset with BLIP captions. It takes around 3.5 hours for LLaVA-v1.5-7B.\n\n- `--mm_projector_type mlp2x_gelu`: the two-layer MLP vision-language connector.\n- `--vision_tower openai\u002Fclip-vit-large-patch14-336`: CLIP ViT-L\u002F14 336px.\n\n### Visual Instruction Tuning\n\n1. Prepare data\n\nPlease download the annotation of the final mixture of our instruction tuning data [GeoChat_Instruct.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMBZUAI\u002FGeoChat_Instruct\u002Fblob\u002Fmain\u002FGeoChat_Instruct.json), and download the split image zips from the [hugging face](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMBZUAI\u002FGeoChat_Instruct). Save the multiple image zips in a single folder and run the following command to merge them:\n```Shell\ncat images_parta* > images.zip\n```\nUnzip the images.zip file to a folder and give the folder's path in [finetune_lora.sh](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fblob\u002Fmain\u002Fscripts\u002Ffinetune_lora.sh).\n\n2. Start training!\n\nVisual instruction tuning takes more time due to the increased resolution of CLIP to 504X504. It takes around ~25 hours to finetune GeoChat-7B on 3x A100 (40G).\n\nTraining script with DeepSpeed ZeRO-3: [`finetune_lora.sh`](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fblob\u002Fmain\u002Fscripts\u002Ffinetune_lora.sh).\n\nOptions to note:\n\n- `--mm_projector_type mlp2x_gelu`: the two-layer MLP vision-language connector.\n- `--vision_tower openai\u002Fclip-vit-large-patch14-336`: CLIP ViT-L\u002F14 336px.\n- `--image_aspect_ratio pad`: this pads the non-square images to square, instead of cropping them; it slightly reduces hallucination.\n- `--group_by_modality_length True`: this should only be used when your instruction tuning dataset contains both language (e.g. ShareGPT) and multimodal (e.g. LLaVA-Instruct).\n- \n## Evaluation\n\nWe evaluate GeoChat on a diverse set of 7 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.\nSee [Evaluation.md](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fblob\u002Fmain\u002Fdocs\u002FEvaluation.md).\n\n## 🏆 Contributions\n\n- **RS multimodal instruction following dataset.** We present a novel data generation pipeline, to leverage existing object detection dataset to create short descriptions of the images, followed by using Vicuna-v1.5 to create conversations using the generated text alone. Further, we add visual question-answering and scene classification abilities \n using their corresponding datasets. This results in a total of 318k instruction pairs for RS domain.\n- **GeoChat.** Leveraging our dataset, we finetune LLaVA-1.5 to create the remote sensing-domain vision-language model - GeoChat. Our LoRA fine-tuning is efficient and avoids forgetting the necessary context embedded in fully-tuned LLaVA model, whose MLP projection is trained to align images into the word embedding space of the LLM (Vicuna-v1.5). This allows GeoChat to retain the conversation and instruction following abilities of LLaVA and extend its domain-knowledge to remote sensing tasks.  \n\n- **Evaluation Benchmark.** We also address the lack of evaluation benchmarks to assess the capability of existing VLMs on remote-sensing conversations. To this end, we setup evaluation protocols for conversation grounding in RS, as well as a setup a suite of tasks to allow comparisons with future efforts in this direction. We show various supervised as well as  zero-shot evaluations for different remote sensing tasks, including image captioning, visual question answering and scene classification to demonstrate the generalisability of GeoChat conversational VLM.\n\n---\n## 👁️💬 GeoChat : Grounded Large Vision-Language Model for Remote Sensing\n\nGeoChat can accomplish multiple tasks for remote-sensing (RS) image comprehension in a unified framework. Given suitable task tokens and user queries, the model can generate visually grounded responses (text with corresponding object locations - shown on top), visual question answering on images and regions (top left and bottom right, respectively) as well as scene classification (top right) and normal natural language conversations (bottom). This makes it the first RS VLM with grounding capability. \n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_fab239c0cb7c.png\" alt=\"GeoChat Overview\">\n\u003C\u002Fp>\n\n---\n\n## 🛰️ GeoChat : Architecture\n\nAn overview of GeoChat - the first grounded large vision-language model for remote sensing. Given an image input together with a user query, a visual backbone is first used to encode patch-level tokens at a higher resolution via interpolating positional encodings. A multi-layer perceptron (MLP) is used to adapt vision-tokens to language space suitable for input to a Large Language Model (Vicuna 1.5). Besides visual inputs, region locations can also be input to the model together with task-specific prompts that specify the desired task required by the user. Given this context, the LLM can generate natural language responses interleaved with corresponding object locations. GeoChat can perform multiple tasks as shown on top e.g., scene classification, image\u002Fregion captioning, VQA and grounded conversations.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_dbac6ccfb914.png\" alt=\"GeoChat Architectural\">\n\u003C\u002Fp>\n\n---\n\n## 🔍 RS Multimodal Instruction Dataset\n\nTypes of annotations available in the GeoChat instruction-set. For a given RS image, we obtain object attribute and relationship information, referring expressions and region captions along with their corresponding region annotations (shown over the image). This structured information is used to create the rich instruction-set with a total of 318k image-instruction pairs.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_fc583f209880.png\" alt=\"Dataset Annotation Pipeline\">\n\u003C\u002Fp>\n\n\n\n## 🤖 Qualitative results of GeoChat\n\nQualitative results of GeoChat. (\u003Cem>left-right\u003C\u002Fem>) Results are shown on grounding, referring object detection, and disaster\u002Fdamage detection. The user can provide task-specific tokens (e.g., \u003Cstrong>[grounding]\u003C\u002Fstrong>) to shape model responses according to the desired behavior. The model can generate textual responses (\u003Cem>right\u003C\u002Fem>), only visual grounding (\u003Cem>center\u003C\u002Fem>) and both text and object groundings interleaved together (\u003Cem>left\u003C\u002Fem>). The model can also specify object types, object counts, object attributes and object relationships.\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_a93f80ad47e3.png\" alt=\"Results_GCG\">\n\u003C\u002Fp>\n\n---\n\n## 🤖 Visual Question Answering\nQualitative examples for Visual Question Answering tasks. GeoChat is able to hold multi-turn conversations, based on various types of questions, including presence, count, complex comparisons and so on. It is able to detect objects and hold conversations against low resolution images as well.\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_a245c2f45a03.jpg\" alt=\"Visual Question Answering\">\n\u003C\u002Fp>\n\n---\n\n## 🤖 Scene Classification\nQualitative examples for scene classification. We give the model all the classes from the dataset and ask to choose only one.\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_00ea8922e5d7.jpg\" alt=\"Visual Question Answering\">\n\u003C\u002Fp>\n\n---\n\n## 🤖 Grounded Description\nWhen asked to describe the image with the special token '[grounding]', GeoChat outputs both the description of the image as well as the bounding boxes for all the objects detected.\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_6b5fff9404ff.jpg\" alt=\"Grounded Description\">\n\u003C\u002Fp>\n\n---\n\n## 🤖 Referring Expression\nWhen asked about an object as a referred expression, GeoChat is able to locate it and draw rotated bounding boxes around it correspondingly.\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_2bd39fd44201.jpg\" alt=\"Referring Expression\">\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_db099a5dace7.jpg\" alt=\"Referring Expression\">\n\u003C\u002Fp>\n\n---\n\n## 🤖 Region Caption\nQualitative examples for region-based captioning. Given a bounding box, GeoChat is able to provide brief descriptions about the area or the object covered by the bounding box.\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_034969196411.jpg\" alt=\"Region Caption\">\n\u003C\u002Fp>\n\n---\n\n## 📜 Citation\n```bibtex\n  @article{kuckreja2023geochat,\n          title={GeoChat: Grounded Large Vision-Language Model for Remote Sensing},\n          author={Kuckreja, Kartik and Danish, Muhammad S. and Naseer, Muzammal and Das, Abhijit and Khan, Salman and Khan, Fahad S.},\n          journal={The IEEE\u002FCVF Conference on Computer Vision and Pattern Recognition},\n          year={2024}\n  }\n```\n## 🙏 Acknowledgement\nWe are thankful to LLaVA and Vicuna for releasing their models and code as open-source contributions.\n\n---\n[\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_45d2297f2f63.png\" width=\"200\" height=\"100\">](https:\u002F\u002Fwww.ival-mbzuai.com)\n[\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_f7ee9d1ef19f.png\" width=\"100\" height=\"100\">](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx)\n[\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_5538daa7b5d2.png\" width=\"360\" height=\"85\">](https:\u002F\u002Fmbzuai.ac.ae)\n","# GeoChat \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_15e0a7a62810.png\" height=\"40\">：面向遥感的接地型大型视觉-语言模型 [CVPR-2024]\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FwaxVImv.png\" alt=\"Oryx Video-ChatGPT\">\n\u003C\u002Fp>\n\n#### [Kartik Kuckreja](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fkartik-kuckreja-930531221\u002F)\\*, [Muhammad Sohail Danish](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fmuhammad-sohail-danish\u002F)\\*, [Muzammal Naseer](https:\u002F\u002Fmuzammal-naseer.com\u002F), [Abhijit Das](https:\u002F\u002Fsites.google.com\u002Fsite\u002Fdasabhijit2048\u002Fhome), [Salman Khan](https:\u002F\u002Fsalman-h-khan.github.io\u002F) 和 [Fahad Khan](https:\u002F\u002Fsites.google.com\u002Fview\u002Ffahadkhans\u002Fhome)\n\\* 共同第一作者\n\n#### **穆罕默德·本·扎耶德人工智能大学、比尔拉理工学院与科学研究所、澳大利亚国立大学、林雪平大学**\n\n[![网站](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Website-87CEEB)](https:\u002F\u002Fmbzuai-oryx.github.io\u002FGeoChat)\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-Paper-\u003CCOLOR>.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.15826)\n[![视频](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVideo-Presentation-F9D371)](https:\u002F\u002Fyoutu.be\u002FKOKtkkKpNDk)\n\n---\n\n## 📢 最新动态\n- 已接受论文的补充材料现已发布：[Supplementary](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fblob\u002Fmain\u002Fdocs\u002Fgeochat_supp.pdf)。\n- **2024年2月28日**：我们开源了代码、模型、数据集和评估脚本。\n- **2024年2月27日**：GeoChat已被 **CVPR-24** 接受 🎉。\n- **2023年11月28日**：GeoChat论文已发布 [arxiv链接](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.15826)。🔥🔥\n---\n\n\n\n## \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_15e0a7a62810.png\" height=\"40\"> 概述\n\nGeoChat是首个专为遥感（RS）场景设计的接地型大型视觉-语言模型。与通用领域模型不同，GeoChat在处理高分辨率遥感图像方面表现出色，采用区域级推理来实现对场景的全面理解。借助新创建的遥感多模态数据集，GeoChat基于LLaVA-1.5架构进行了微调。这使得它在多种遥感任务中展现出强大的零样本性能，包括图像和区域描述、视觉问答、场景分类、视觉接地对话以及指代性目标检测。\n\n---\n## 目录\n- [安装](#install)\n- [模型库](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md)\n- [数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMBZUAI\u002FGeoChat_Instruct\u002Fblob\u002Fmain\u002FGeoChat_Instruct.json)\n- [训练](#train)\n- [评估](#evaluation)\n\n## 安装\n\n1. 克隆此仓库并进入GeoChat文件夹\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat.git\ncd GeoChat\n```\n\n2. 安装软件包\n```Shell\nconda create -n geochat python=3.10 -y\nconda activate geochat\npip install --upgrade pip  # 启用PEP 660支持\npip install -e .\n```\n\n3. 安装用于训练的额外软件包\n```\npip install ninja\npip install flash-attn --no-build-isolation\n```\n\n### 升级到最新代码库\n\n```Shell\ngit pull\npip uninstall transformers\npip install -e .\n```\n\n## GeoChat权重与演示\n请查看我们的[模型库](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md)，以获取所有公开的GeoChat检查点，并参阅[LoRA.md](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fblob\u002Fmain\u002Fdocs\u002FLoRA.md)，了解如何运行演示和进行训练。\n\n## 训练\n\nGeoChat的训练包括使用GeoChat_Instruct数据集进行视觉指令微调：318k条由Vicuna生成的多模态指令遵循数据，在LLaVA-v1.5的预训练权重基础上进行微调。\n\n我们在3张配备40GB显存的A100 GPU上训练GeoChat。若使用较少的GPU进行训练，可相应降低`per_device_train_batch_size`并增加`gradient_accumulation_steps`。务必保持全局批次大小不变：`per_device_train_batch_size` × `gradient_accumulation_steps` × `num_gpus`。\n\n### 超参数\n我们在微调时采用了与Vicuna相似的超参数设置。以下同时列出了预训练和微调所使用的超参数。\n\n| 超参数 | 全局批次大小 | 学习率 | 轮数 | 最大长度 | 权重衰减 |\n| --- | ---: | ---: | ---: | ---: | ---: |\n| GeoChat-7B | 144 | 2e-5 | 1 | 2048 | 0 |\n\n### 预训练（特征对齐）\n\n我们使用LLaVAv1.5的预训练投影器，该投影器是在包含BLIP字幕的LAION-CC-SBU数据集558K子集上训练的。LLaVA-v1.5-7B大约需要3.5小时完成预训练。\n\n- `--mm_projector_type mlp2x_gelu`：两层MLP视觉-语言连接器。\n- `--vision_tower openai\u002Fclip-vit-large-patch14-336`：CLIP ViT-L\u002F14 336px。\n\n### 视觉指令微调\n\n1. 准备数据\n\n请下载我们指令微调数据最终混合体的标注文件[GeoChat_Instruct.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMBZUAI\u002FGeoChat_Instruct\u002Fblob\u002Fmain\u002FGeoChat_Instruct.json)，并从[Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMBZUAI\u002FGeoChat_Instruct)下载分割后的图像压缩包。将多个图像压缩包保存在一个文件夹中，然后运行以下命令将其合并：\n```Shell\ncat images_parta* > images.zip\n```\n解压images.zip文件至一个文件夹，并将该文件夹路径输入到[finetune_lora.sh](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fblob\u002Fmain\u002Fscripts\u002Ffinetune_lora.sh)中。\n\n2. 开始训练！\n\n由于CLIP分辨率提升至504×504，视觉指令微调所需时间更长。在3张A100（40G）上微调GeoChat-7B大约需要~25小时。\n\n使用DeepSpeed ZeRO-3的训练脚本：[`finetune_lora.sh`](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fblob\u002Fmain\u002Fscripts\u002Ffinetune_lora.sh)。\n\n需要注意的选项：\n\n- `--mm_projector_type mlp2x_gelu`：两层MLP视觉-语言连接器。\n- `--vision_tower openai\u002Fclip-vit-large-patch14-336`：CLIP ViT-L\u002F14 336px。\n- `--image_aspect_ratio pad`：此选项会将非正方形图像填充为正方形，而不是裁剪它们；这样可以略微减少幻觉现象。\n- `--group_by_modality_length True`：仅当您的指令微调数据集同时包含语言（例如ShareGPT）和多模态（例如LLaVA-Instruct）内容时才应使用此选项。\n- \n## 评估\n\n我们基于一组多样化的7个基准测试对GeoChat进行了评估。为确保结果的可重复性，我们采用贪婪解码方式对模型进行评估。我们未使用束搜索进行评估，以使推理过程与实时输出的聊天演示保持一致。\n详情请参阅[Evaluation.md](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fblob\u002Fmain\u002Fdocs\u002FEvaluation.md)。\n\n## 🏆 贡献\n\n- **遥感多模态指令遵循数据集。** 我们提出了一种新颖的数据生成流程，利用现有的目标检测数据集为图像生成简短描述，随后使用Vicuna-v1.5仅基于这些文本创建对话。此外，我们还引入了视觉问答和场景分类能力，并分别使用相应的数据集进行补充。最终，我们为遥感领域构建了一个包含318,000对指令的数据集。\n  \n- **GeoChat。** 基于我们的数据集，我们对LLaVA-1.5进行了微调，从而创建了面向遥感领域的视觉-语言模型——GeoChat。我们的LoRA微调方法高效且不会遗忘全量微调的LLaVA模型中所嵌入的关键上下文信息；该模型的MLP投影层经过训练，能够将图像映射到LLM（Vicuna-v1.5）的词嵌入空间中。这使得GeoChat既能保留LLaVA的对话和指令遵循能力，又能将其领域知识扩展到遥感任务中。\n\n- **评估基准。** 我们还解决了现有视觉-语言模型在遥感对话任务上缺乏评估基准的问题。为此，我们制定了遥感场景下对话定位的评估协议，并设计了一系列任务，以便未来相关工作可以进行比较。我们展示了针对不同遥感任务的各种监督式及零样本评估结果，包括图像字幕生成、视觉问答和场景分类，以证明GeoChat这一对话型视觉-语言模型的泛化能力。\n\n---\n## 👁️💬 GeoChat：面向遥感的具身大型视觉-语言模型\n\nGeoChat能够在统一的框架内完成多项遥感图像理解任务。通过提供适当的任务标记和用户查询，该模型可以生成具有视觉定位的响应（文本中包含对应的目标位置，如图顶部所示）、针对图像及特定区域的视觉问答（分别位于左上角和右下角），以及场景分类（右上角）和常规自然语言对话（底部）。这使其成为首个具备具身能力的遥感视觉-语言模型。\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_fab239c0cb7c.png\" alt=\"GeoChat 概览\">\n\u003C\u002Fp>\n\n---\n\n## 🛰️ GeoChat：架构\n\nGeoChat的架构概览——这是首个面向遥感领域的具身大型视觉-语言模型。给定一张输入图像和用户查询后，首先会使用视觉骨干网络以更高分辨率编码补丁级特征，并通过插值位置编码来实现。随后，一个多层感知机（MLP）会将视觉特征适配到适合输入大型语言模型（Vicuna 1.5）的语言空间中。除了视觉输入外，还可以将区域位置与特定于任务的提示一同输入模型，以明确用户所需执行的任务类型。在此上下文中，语言模型能够生成自然语言响应，并将相应目标的位置信息穿插其中。如图顶部所示，GeoChat可执行多种任务，例如场景分类、图像\u002F区域字幕生成、视觉问答以及具身对话。\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_dbac6ccfb914.png\" alt=\"GeoChat 架构\">\n\u003C\u002Fp>\n\n---\n\n## 🔍 遥感多模态指令数据集\n\nGeoChat指令集中包含的各类标注类型。对于给定的遥感图像，我们提取了对象属性与关系信息、指代表达和区域字幕，同时附带对应的区域标注（显示在图像上方）。这些结构化信息被用于构建丰富的指令集，总计包含318,000对图像-指令对。\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_fc583f209880.png\" alt=\"数据集标注流程\">\n\u003C\u002Fp>\n\n---\n\n## 🤖 GeoChat 定性结果\n\nGeoChat的定性结果。（从左至右）展示的内容包括场景定位、指代目标检测以及灾害\u002F损伤检测。用户可以通过提供特定于任务的标记（例如\u003Cstrong>[grounding]\u003C\u002Fstrong>）来引导模型生成符合需求的响应。模型可以生成纯文本响应（右图）、仅包含视觉定位的输出（中图），以及文本与目标定位信息交织的混合输出（左图）。此外，模型还能识别目标类型、数量、属性及相互关系。\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_a93f80ad47e3.png\" alt=\"Results_GCG\">\n\u003C\u002Fp>\n\n---\n\n## 🤖 视觉问答\n视觉问答任务的定性示例。GeoChat能够根据不同类型的问题进行多轮对话，问题涵盖是否存在、数量统计、复杂比较等。它甚至可以在低分辨率图像上检测目标并展开对话。\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_a245c2f45a03.jpg\" alt=\"视觉问答\">\n\u003C\u002Fp>\n\n---\n\n## 🤖 场景分类\n场景分类任务的定性示例。我们向模型提供了数据集中所有的类别，并要求其从中选择一个。\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_00ea8922e5d7.jpg\" alt=\"视觉问答\">\n\u003C\u002Fp>\n\n---\n\n## 🤖 具身描述\n当用户使用特殊标记“[grounding]”要求描述图像时，GeoChat不仅会生成图像描述，还会为所有检测到的物体绘制边界框。\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_6b5fff9404ff.jpg\" alt=\"具身描述\">\n\u003C\u002Fp>\n\n---\n\n## 🤖 指代表达\n当用户询问某个目标的指代表达时，GeoChat能够准确定位该目标，并为其绘制旋转的边界框。\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_2bd39fd44201.jpg\" alt=\"指代表达\">\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_db099a5dace7.jpg\" alt=\"指代表达\">\n\u003C\u002Fp>\n\n---\n\n## 🤖 区域字幕\n基于区域的字幕生成任务的定性示例。给定一个边界框，GeoChat能够为该区域或框内覆盖的对象提供简短描述。\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_034969196411.jpg\" alt=\"区域字幕\">\n\u003C\u002Fp>\n\n---\n\n## 📜 引用\n```bibtex\n  @article{kuckreja2023geochat,\n          title={GeoChat: Grounded Large Vision-Language Model for Remote Sensing},\n          author={Kuckreja, Kartik and Danish, Muhammad S. and Naseer, Muzammal and Das, Abhijit and Khan, Salman and Khan, Fahad S.},\n          journal={The IEEE\u002FCVF Conference on Computer Vision and Pattern Recognition},\n          year={2024}\n  }\n```\n## 🙏 致谢\n我们感谢LLaVA和Vicuna开放其模型和代码，作为开源贡献。\n\n---\n[\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_45d2297f2f63.png\" width=\"200\" height=\"100\">](https:\u002F\u002Fwww.ival-mbzuai.com)\n[\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_f7ee9d1ef19f.png\" width=\"100\" height=\"100\">](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx)\n[\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_readme_5538daa7b5d2.png\" width=\"360\" height=\"85\">](https:\u002F\u002Fmbzuai.ac.ae)","# GeoChat 快速上手指南\n\nGeoChat 是首个专为遥感（Remote Sensing, RS）场景设计的**接地大型视觉语言模型**。基于 LLaVA-1.5 架构微调，它支持高分辨率遥感图像的图像描述、视觉问答、场景分类以及带有物体定位（Grounding）的多轮对话。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+)\n*   **GPU**: 至少 1 张 NVIDIA GPU (训练推荐 3x A100 40GB，推理可根据模型大小调整)\n*   **CUDA**: 适配您的显卡驱动版本\n*   **Python**: 3.10\n*   **包管理器**: Conda (推荐)\n\n## 安装步骤\n\n### 1. 克隆仓库\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat.git\ncd GeoChat\n```\n\n### 2. 创建并激活虚拟环境\n```bash\nconda create -n geochat python=3.10 -y\nconda activate geochat\n```\n\n### 3. 安装基础依赖\n启用 PEP 660 支持并安装项目包：\n```bash\npip install --upgrade pip\npip install -e .\n```\n\n### 4. 安装训练\u002F推理额外依赖\n为了支持高效训练和推理，需安装 `ninja` 和 `flash-attn`：\n```bash\npip install ninja\npip install flash-attn --no-build-isolation\n```\n> **提示**: 如果国内网络下载 `flash-attn` 较慢，可尝试使用清华源或阿里源加速：\n> `pip install flash-attn --no-build-isolation -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`\n\n### 5. 获取模型权重\n请访问官方 [Model Zoo](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fblob\u002Fmain\u002Fdocs\u002FMODEL_ZOO.md) 下载预训练权重。下载后请按照文档说明将权重放置在指定目录。\n\n## 基本使用\n\nGeoChat 的核心能力是通过特定的任务令牌（Task Tokens）来执行不同的遥感任务。以下是基于命令行脚本的最简使用逻辑。\n\n### 运行演示 (Demo)\n项目提供了基于 LoRA 的推理脚本。假设您已下载好模型权重并配置好路径，可以使用以下命令启动交互或批量推理（具体脚本路径参考 `scripts\u002Ffinetune_lora.sh` 中的推理部分或官方 Demo 文档）：\n\n通常推理命令结构如下（需根据实际下载的模型路径修改 `--model-path` 和 `--image-file`）：\n\n```bash\npython llava\u002Feval\u002Frun_geochat.py \\\n    --model-path \u002Fpath\u002Fto\u002Fgeochat-7b-checkpoint \\\n    --image-file \u002Fpath\u002Fto\u002Fremote_sensing_image.jpg \\\n    --query \"Describe this image.\" \\\n    --conv-mode geochat\n```\n\n### 核心任务示例\n\nGeoChat 通过在前缀添加特殊令牌来切换任务模式：\n\n1.  **接地描述 (Grounded Description)**\n    *   **输入提示**: `[grounding] Describe the image.`\n    *   **功能**: 生成图像描述文本，并同时输出检测到的物体边界框坐标。\n\n2.  **指代表达检测 (Referring Expression)**\n    *   **输入提示**: `Locate the [object description].`\n    *   **功能**: 根据文本描述定位物体，输出旋转边界框。\n\n3.  **视觉问答 (VQA)**\n    *   **输入提示**: `How many buildings are visible in this region?`\n    *   **功能**: 针对图像或特定区域回答问题。\n\n4.  **场景分类 (Scene Classification)**\n    *   **输入提示**: `Classify the scene into one of the following categories: [list of classes].`\n    *   **功能**: 对遥感场景进行分类。\n\n### 数据准备（如需微调）\n如果您计划使用自己的数据进行微调，需先准备指令数据集：\n1.  从 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMBZUAI\u002FGeoChat_Instruct) 下载 `GeoChat_Instruct.json` 和图片压缩包。\n2.  合并图片分卷：\n    ```bash\n    cat images_parta* > images.zip\n    unzip images.zip -d .\u002Fimages_folder\n    ```\n3.  修改 `scripts\u002Ffinetune_lora.sh` 中的数据路径指向您的文件夹。\n\n---\n*注：更多详细参数配置、多卡训练设置及评估基准请参考项目根目录下的 `docs` 文件夹及官方 GitHub 仓库。*","某城市规划局的遥感分析师正急需从最新的高分辨率卫星影像中，快速提取特定受灾区域的建筑物损毁情况并生成详细报告。\n\n### 没有 GeoChat 时\n- 分析师必须依赖通用视觉模型，这些模型缺乏对遥感影像特有视角的理解，常将阴影误判为水体或道路。\n- 无法直接针对图像中的特定坐标区域进行提问，只能获取整张图的笼统描述，难以定位具体受损地块。\n- 需要人工逐帧标注并结合传统 GIS 软件进行二次分析，耗时数小时才能完成单一场景的评估。\n- 面对复杂的地理术语（如“不透水面”、“植被覆盖度”），通用模型往往无法准确理解指令意图，导致回答偏离专业需求。\n\n### 使用 GeoChat 后\n- GeoChat 凭借专为遥感训练的特性，能精准区分阴影与真实地物，显著降低了对高楼阴影和复杂地形的误判率。\n- 支持区域级推理，分析师可直接框选受灾街区询问“该区域内有多少栋完全倒塌的建筑”，获得精确到坐标的回答。\n- 实现了零样本下的多任务处理，一键生成包含场景分类、物体检测及自然语言描述的綜合报告，将分析时间缩短至分钟级。\n- 能够流畅理解并运用专业地理术语进行多轮对话，如同拥有一位精通遥感知识的专家助手实时协同工作。\n\nGeoChat 通过将通用的视觉语言能力深度“落地”于遥感领域，彻底改变了从卫星影像到决策信息的转化效率。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbzuai-oryx_GeoChat_15e0a7a6.png","mbzuai-oryx","ORYX","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fmbzuai-oryx_e1ef1b3c.jpg","A Library for Large Vision-Language Models",null,"https:\u002F\u002Fival-mbzuai.com","https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx",[80,84],{"name":81,"color":82,"percentage":83},"Python","#3572A5",97.5,{"name":85,"color":86,"percentage":87},"Shell","#89e051",2.5,708,62,"2026-04-16T03:47:58",4,"Linux","必需 NVIDIA GPU。训练推荐 3x A100 (40GB 显存)。依赖 flash-attn，通常仅支持 Linux 环境下的 NVIDIA 显卡。","未说明 (建议根据模型大小配置充足内存，训练需大显存)",{"notes":96,"python":97,"dependencies":98},"1. 必须使用 Conda 创建 Python 3.10 环境。2. 安装 flash-attn 时需添加 --no-build-isolation 参数。3. 训练基于 LLaVA-1.5 架构，视觉编码器使用 CLIP ViT-L\u002F14 336px。4. 官方训练脚本使用 DeepSpeed ZeRO-3 优化显存。5. 推理或微调小批量时可减少 GPU 数量，但需调整 batch_size 和 gradient_accumulation_steps 以保持全局 batch size 一致。","3.10",[99,100,101,102,103,104,105],"torch","transformers","flash-attn","ninja","deepspeed","accelerate","peft",[35,15,107],"其他",[109,110],"remote-sensing","vlm","2026-03-27T02:49:30.150509","2026-04-17T09:54:07.714373",[114,119,124,129,134,138],{"id":115,"question_zh":116,"answer_zh":117,"source_url":118},36787,"微调时遇到 'size mismatch' 错误，提示 position_embedding 形状不匹配怎么办？","该错误通常是因为使用了错误的基座模型或视觉塔配置。请确保在训练脚本中明确指定正确的视觉塔和投影器参数。参考以下配置命令：\n\n```bash\npython -m torch.distributed.run \\\n    --nproc_per_node $GPUS_PER_NODE \\\n    geochat\u002Ftrain\u002Ftrain_mem.py \\\n    --lora_enable True \\\n    --model_name_or_path \u003Cpath_to_llava-v1.5-7b> \\\n    --version \u003Cprompt_version> \\\n    --data_path \u003Cpath_to_GeoChat_Instruct.json> \\\n    --image_folder \u003Cpath_to_images> \\\n    --vision_tower openai\u002Fclip-vit-large-patch14-336 \\\n    --mm_projector_type mlp2x_gelu \\\n    --pretrain_mm_mlp_adapter \u003Cpath_to_llava-v1.5-7b\u002Fmm_projector.bin> \\\n    --mm_vision_select_layer -2 \\\n    --mm_use_im_start_end False\n```\n\n关键点是需要显式传入 `--vision_tower` 和 `--pretrain_mm_mlp_adapter` 参数，并确保基座模型路径指向正确的 LLaVA v1.5 7B 模型。","https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fissues\u002F28",{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},36788,"运行评估脚本时报错 'AttributeError: NoneType object has no attribute preprocess' 如何解决？","此错误是因为代码未能正确识别模型类型，导致 `image_processor` 被赋值为 None。解决方法是确保你下载的 GeoChat 模型文件夹名称中包含 \"geochat\" 字样（不区分大小写）。\n\n例如，如果你将模型目录命名为 \"weights\"，程序无法识别；请将其重命名为 \"geochat\" 或包含 \"geochat\" 的名称（如 \"geochat-7B\"）。\n\n代码逻辑会检查模型路径中是否包含 'geochat' 字符串来决定是否加载图像处理器。","https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fissues\u002F23",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},36789,"本地运行 Demo 时出现 'AttributeError: NoneType object has no attribute image_mean' 错误","该错误同样是由于模型路径命名不当导致的。系统通过提取 `model_path` 中的 `model_name` 来判断模型类型。如果 `model_name` 中不包含 \"geochat\" 字符串，`image_processor` 将被设为 None，从而引发此错误。\n\n解决方案：请检查启动 Demo 时指定的模型路径，确保其文件夹名称中包含 \"geochat\"（例如将文件夹重命名为 `geochat-lora` 或类似名称），以便代码能正确加载图像预处理组件。","https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fissues\u002F10",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},36790,"使用 finetune_lora.sh 训练时发生 RuntimeError: Size Mismatch 错误，如何修复？","这通常是因为配置文件中的 `model_type` 设置不正确。请检查你的基座模型和保存的检查点文件中的 `config.json`。\n\n如果 `model_type` 的值是 \"llava\"，请将其手动修改为 \"geochat\"。修改后重新运行训练脚本即可解决尺寸不匹配的问题。","https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FGeoChat\u002Fissues\u002F11",{"id":135,"question_zh":136,"answer_zh":137,"source_url":123},36791,"在进行场景分类（scene categorization）或 VQA 任务评估时，遇到相同的 'NoneType' preprocess 错误怎么办？","无论是场景分类 (`batch_geochat_scene.py`) 还是 VQA 任务 (`batch_geochat_vqa.py`)，出现此错误的原因一致：模型目录名称未包含关键字 \"geochat\"，导致 `load_pretrained_model` 函数未能初始化 `image_processor`。\n\n请统一将模型权重目录重命名为包含 \"geochat\" 的名称（例如 `geochat-7B`），然后重新运行评估脚本。维护者已更新仓库代码，建议同时拉取最新代码再试。",{"id":139,"question_zh":140,"answer_zh":141,"source_url":118},36792,"微调时应该使用哪个具体的基座模型和视觉塔配置？","官方推荐的基座模型是 LLaVA v1.5 7B，地址为：https:\u002F\u002Fhuggingface.co\u002Fliuhaotian\u002Fllava-v1.5-7b。\n\n在训练参数中，必须配合以下配置使用：\n1. `--vision_tower`: 设置为 `openai\u002Fclip-vit-large-patch14-336`\n2. `--mm_projector_type`: 设置为 `mlp2x_gelu`\n3. `--pretrain_mm_mlp_adapter`: 指向基座模型目录下的 `mm_projector.bin` 文件\n4. `--mm_vision_select_layer`: 设置为 `-2`\n\n确保这些参数与脚本中的其他路径配置一致，以避免架构不匹配的错误。",[]]