[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-IDEA-Research--Rex-Omni":3,"tool-IDEA-Research--Rex-Omni":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",160411,2,"2026-04-18T23:33:24",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":72,"owner_avatar_url":73,"owner_bio":74,"owner_company":75,"owner_location":75,"owner_email":75,"owner_twitter":75,"owner_website":76,"owner_url":77,"languages":78,"stars":103,"forks":104,"last_commit_at":105,"license":106,"difficulty_score":10,"env_os":107,"env_gpu":108,"env_ram":109,"env_deps":110,"category_tags":120,"github_topics":121,"view_count":32,"oss_zip_url":75,"oss_zip_packed_at":75,"status":17,"created_at":125,"updated_at":126,"faqs":127,"releases":164},9489,"IDEA-Research\u002FRex-Omni","Rex-Omni","[CVPR2026] Detect Anything via Next Point Prediction ","Rex-Omni 是一款由 IDEA 研究院开源的 30 亿参数多模态大语言模型，它创新性地将物体检测及多种视觉感知任务统一转化为简单的“下一点预测”问题。传统检测模型通常依赖复杂的锚框设计或专门的检测头，而 Rex-Omni 通过让模型像生成文本一样预测下一个坐标点，实现了端到端的通用检测，大幅简化了技术架构并提升了任务适应性。\n\n这款工具特别适合 AI 研究人员、算法工程师以及希望探索多模态大模型潜力的开发者使用。无论是需要快速部署检测功能的工程团队，还是致力于研究统一视觉范式的研究者，都能从中受益。普通用户也可通过其提供的在线 Demo 直观体验“指向即检测”的交互乐趣。\n\nRex-Omni 的技术亮点在于其独特的建模范式，不仅支持高精度的物体定位，还具备强大的泛化能力，可轻松扩展至各类视觉任务。项目提供了完整的训练、评估及微调代码，支持自定义数据集的 SFT 和 GRPO 微调，甚至发布了量化版本以节省一半存储空间，降低了落地门槛。凭借简洁的 API 设计和活跃的社区更新，Rex-Omni 正成为连接自然语言理解与视觉感知的重要桥梁。","\n\u003Cdiv align=center>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_c2cf7d301876.png\" width=600 >\n\u003C\u002Fdiv>\n\n\u003Ch1 align=\"center\">Detect Anything via Next Point Prediction\u003C\u002Fh1>\n\n\u003Cdiv align=center>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Frex-omni.github.io\u002F\">\n    \u003Cimg\n      src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRexOmni-Website-BADFDB?style=flat-square&logo=deno&logoColor=violet&color=BADFDB\"\n      alt=\"RexThinker Website\"\n    \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.12798\">\n    \u003Cimg\n      src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRexOmni-Paper-Red%25red?logo=arxiv&logoColor=red&color=yellow\"\n      alt=\"RexThinker Paper on arXiv\"\n    \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FIDEA-Research\u002FRex-Omni-AWQ\">\n    \u003Cimg \n        src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRexOmni_AWQ-Weight-orange?logo=huggingface&logoColor=yellow\" \n        alt=\"RexThinker weight on Hugging Face\"\n    \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FIDEA-Research\u002FRex-Omni\">\n    \u003Cimg \n        src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRexOmni-Weight-orange?logo=huggingface&logoColor=yellow\" \n        alt=\"RexThinker weight on Hugging Face\"\n    \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FMountchicken\u002FRex-Omni\">\n    \u003Cimg\n      src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRexOmni-Demo-orange?logo=huggingface&logoColor=yellow\" \n      alt=\"RexThinker Demo on Hugging Face\"\n    \u002F>\n  \u003C\u002Fa>\n  \n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\n> Rex-Omni is a 3B-parameter Multimodal Large Language Model (MLLM) that redefines object detection and a wide range of other visual perception tasks as a simple next-token prediction problem.\n\n\u003Cp align=\"center\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_bd9f4e0a5dde.png\" width=\"95%\">\u003C\u002Fp>\n\n# News 🎉\n- [2026-01-10] **Pointing Task Finetuning** is now supported! Train Rex-Omni on custom pointing datasets with SFT and GRPO. See [Fine-tuning Guide](finetuning\u002FREADME.md) for details.\n- [2025-10-31] We release the AWQ quantized version of Rex-Omni, which saves 50% of the storage space. [Rex-Omni-AWQ](https:\u002F\u002Fhuggingface.co\u002FIDEA-Research\u002FRex-Omni-AWQ)\n- [2025-10-29] Fine-tuning code is now [available](finetuning\u002FREADME.md).\n- [2025-10-17] Evaluation code and dataset is now [available](evaluation\u002FREADME.md).\n- [2025-10-15] Rex-Omni is released.\n\n# Table of Contents\n\n- [Breaking News: We've just open-sourced Resophy! 🎉](#breaking-news-weve-just-open-sourced-resophy-)\n- [News 🎉](#news-)\n- [Table of Contents](#table-of-contents)\n  - [TODO LIST 📝](#todo-list-)\n  - [1. Installation ⛳️](#1-installation-️)\n  - [2. Quick Start: Using Rex-Omni for Detection](#2-quick-start-using-rex-omni-for-detection)\n      - [Initialization parameters (RexOmniWrapper)](#initialization-parameters-rexomniwrapper)\n      - [Inference parameters (rex.inference)](#inference-parameters-rexinference)\n  - [3. Cookbooks](#3-cookbooks)\n  - [4. Applications of Rex-Omni](#4-applications-of-rex-omni)\n  - [5. Gradio Demo](#5-gradio-demo)\n    - [Quick Start](#quick-start)\n    - [Available Options](#available-options)\n  - [6. Evaluation](#6-evaluation)\n  - [7. Fine-tuning Rex-Omni](#7-fine-tuning-rex-omni)\n  - [8. LICENSE](#8-license)\n  - [9. Citation](#9-citation)\n\n\n## TODO LIST 📝\n- [x] Add Evaluation Code\n- [x] Add Fine-tuning Code\n- [x] Add Quantilized Rex-Omni\n\n## 1. Installation ⛳️\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FRex-Omni.git\ncd Rex-Omni\nconda create -n rexomni python=3.10 -y\nconda activate rexomni\npip install torch==2.7.0 torchvision --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu128\npip install -r requirements.txt\npip install -v -e .\n```\n\nTest Installation\n```bash\nCUDA_VISIBLE_DEVICES=1 python tutorials\u002Fdetection_example\u002Fdetection_example.py\n```\n\nIf the installation is successful, you will find a visualization of the detection results at `tutorials\u002Fdetection_example\u002Ftest_images\u002Fcafe_visualize.jpg`\n\n## 2. Quick Start: Using Rex-Omni for Detection\nBelow is a minimal example showing how to run object detection using the `rex_omni` package.\n\n```python\nfrom PIL import Image\nfrom rex_omni import RexOmniWrapper, RexOmniVisualize\n\n# 1) Initialize the wrapper (model loads internally)\nrex = RexOmniWrapper(\n    model_path=\"IDEA-Research\u002FRex-Omni\",   # HF repo or local path\n    backend=\"transformers\",                # or \"vllm\" for high-throughput inference\n    # Inference\u002Fgeneration controls (applied across backends)\n    max_tokens=2048,\n    temperature=0.0,\n    top_p=0.05,\n    top_k=1,\n    repetition_penalty=1.05,\n)\n\n# If you are using the AWQ quantized version of Rex-Omni, you can use the following code:\nrex = RexOmniWrapper(\n    model_path=\"IDEA-Research\u002FRex-Omni-AWQ\",\n    backend=\"vllm\",\n    quantization=\"awq\",\n    max_tokens=2048,\n    temperature=0.0,\n    top_p=0.05,\n    top_k=1,\n    repetition_penalty=1.05,\n)\n\n# 2) Prepare input\nimage = Image.open(\"tutorials\u002Fdetection_example\u002Ftest_images\u002Fcafe.jpg\").convert(\"RGB\")\ncategories = [\n    \"man\", \"woman\", \"yellow flower\", \"sofa\", \"robot-shope light\",\n    \"blanket\", \"microwave\", \"laptop\", \"cup\", \"white chair\", \"lamp\",\n]\n\n# 3) Run detection\nresults = rex.inference(images=image, task=\"detection\", categories=categories)\nresult = results[0]\n\n# 4) Visualize\nvis = RexOmniVisualize(\n    image=image,\n    predictions=result[\"extracted_predictions\"],\n    font_size=20,\n    draw_width=5,\n    show_labels=True,\n)\nvis.save(\"tutorials\u002Fdetection_example\u002Ftest_images\u002Fcafe_visualize.jpg\")\n```\n\n#### Initialization parameters (RexOmniWrapper)\n- **model_path**: Hugging Face repo ID or a local checkpoint directory for the Rexe-Omni model.\n- **backend**: \"transformers\" or \"vllm\".\n  - **transformers**: easy to use, good baseline latency.\n  - **vllm**: high-throughput, low-latency inference. Requires the `vllm` package and a compatible environment.\n- **max_tokens**: Maximum number of tokens to generate for each output.\n- **temperature**: Sampling temperature; higher values increase randomness (0.0 = deterministic\u002Fgreedy).\n- **top_p**: Nucleus sampling parameter; model samples from the smallest set of tokens with cumulative probability ≥ top_p.\n- **top_k**: Top-k sampling; restricts sampling to the k most likely tokens.\n- **repetition_penalty**: Penalizes repeated tokens; >1.0 discourages repetition.\n- Optional advanced settings (supported via kwargs when constructing the wrapper):\n  - Transformers: `torch_dtype`, `attn_implementation`, `device_map`, `trust_remote_code`, etc.\n  - VLLM: `tokenizer_mode`, `limit_mm_per_prompt`, `max_model_len`, `gpu_memory_utilization`, `tensor_parallel_size`, `trust_remote_code`, etc.\n\n#### Inference parameters (rex.inference)\n- **images**: A single `PIL.Image.Image` or a list of images for batch inference.\n- **task**: One of `\"detection\"`, `\"pointing\"`, `\"visual_prompting\"`, `\"keypoint\"`, `\"ocr_box\"`, `\"ocr_polygon\"`, `\"gui_grounding\"`, `\"gui_pointing\"`.\n- **categories**: List of category names\u002Fphrases to detect or extract, e.g., `[\"person\", \"cup\"]`. Used to build task prompts.\n- **keypoint_type\": Type of keypoints for keypoint detection task. Options: \"person\", \"hand\", \"animal\"\n- **visual_prompt_boxes**: Reference bounding boxes for visual prompting task. Format: [[x0, y0, x1, y1], ...] in absolute coordinates\n\nReturns a list of dictionaries (one per input image). Each dictionary includes:\n- **raw_output**: The raw text generated by the LLM.\n- **extracted_predictions**: Structured predictions parsed from the raw output, grouped by category.\n  - For detection: `{category: [{\"type\": \"box\", \"coords\": [x0,y0,x1,y1]}, ...], ...}`\n  - For pointing:  `{category: [{\"type\": \"point\", \"coords\": [x0,y0]}, ...], ...}`\n  - For polygon: `{category: [{\"type\": \"polygon\", \"coords\": [x0,y0, ...]}, ...], ...}`\n  - For keypointing: Structured Json\n\nTips:\n- For best performance with VLLM, set `backend=\"vllm\"` and tune `gpu_memory_utilization` and `tensor_parallel_size` according to your GPUs.\n\n## 3. Cookbooks\n\nWe provide comprehensive tutorials for each supported task. Each tutorial includes both standalone Python scripts and interactive Jupyter notebooks.\n\n|       Task       |                                                                Applications                                                               |   Demo |                  Python Example                    |                     Notebook                     |\n|:----------------:|:-----------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------:|:------------------------------------------------:|:------------------------------------------------:|\n| Detection |                               `object detection`                                 | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_f3c591a7df5c.jpg\" width=\"240\"\u002F>  | [code](tutorials\u002Fdetection_example\u002Fdetection_example.py)   | [notebook](tutorials\u002Fdetection_example\u002F_full_notebook.ipynb) |\n|                  |                         `object referring`                      | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_9987e4919649.jpg\" width=\"240\"\u002F>   | [code](tutorials\u002Fdetection_example\u002Freferring_example.py)      |       [notebook](tutorials\u002Fdetection_example\u002F_full_notebook.ipynb)                                           |\n|                  | `gui grounding` | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_40882272148e.jpg\" width=\"240\"\u002F> | [code](tutorials\u002Fdetection_example\u002Fgui_grounding_example.py)  |       [notebook](tutorials\u002Fdetection_example\u002F_full_notebook.ipynb)                                           |\n|                  |                    `layout grounding`    |  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_9814a8f0a8b2.jpg\" width=\"240\"\u002F>        | [code](tutorials\u002Fdetection_example\u002Flayout_grouding_examle.py) |       [notebook](tutorials\u002Fdetection_example\u002F_full_notebook.ipynb)                                             |\n| Pointing |                               `object pointing`            |       \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_27ba68e34228.jpg\" width=\"240\"\u002F>           |   [code](tutorials\u002Fpointing_example\u002Fobject_pointing_example.py)   | [notebook](tutorials\u002Fpointing_example\u002F_full_notebook.ipynb) |\n|                  |                         `gui pointing`    |      \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_1e36b6a1daee.jpg\" width=\"240\"\u002F>              | [code](tutorials\u002Fpointing_example\u002Fgui_pointing_example.py)      |       [notebook](tutorials\u002Fpointing_example\u002F_full_notebook.ipynb)                                           |\n|                  | `affordance pointing` | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_79d49f953414.jpg\" width=\"240\"\u002F> | [code](tutorials\u002Fpointing_example\u002Faffordance_pointing_example.py)  |       [notebook](tutorials\u002Fpointing_example\u002F_full_notebook.ipynb)                                           |\n| Visual prompting | `visual prompting` | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_ff86016ed730.jpg\" width=\"240\"\u002F> | [code](tutorials\u002Fvisual_prompting_example\u002Fvisual_prompt_example.py) | [notebook](tutorials\u002Fvisual_prompting_example\u002F_full_tutorial.ipynb) |\n| OCR | `ocr word box` | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_6b197d857baf.jpg\" width=\"240\"\u002F> | [code](tutorials\u002Focr_example\u002Focr_word_box_example.py) | [notebook](tutorials\u002Focr_example\u002F_full_tutorial.ipynb) |\n|                  | `ocr textline box` | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_876389de49da.jpg\" width=\"240\"\u002F> | [code](tutorials\u002Focr_example\u002Focr_textline_box_example.py) |       [notebook](tutorials\u002Focr_example\u002F_full_tutorial.ipynb)                                           |\n|                  | `ocr polygon` | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_51c017e3a1eb.jpg\" width=\"240\"\u002F> | [code](tutorials\u002Focr_example\u002Focr_polygon_example.py) |       [notebook](tutorials\u002Focr_example\u002F_full_tutorial.ipynb)                                           |\n| Keypointing | `person keypointing` | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_31082b38436c.jpg\" width=\"240\"\u002F> | [code](tutorials\u002Fkeypointing_example\u002Fperson_keypointing_example.py) | [notebook](tutorials\u002Fkeypointing_example\u002F_full_tutorial.ipynb)|\n|             | `animal keypointing`   |     \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_8d74c8bb4bd1.jpg\" width=\"240\"\u002F>                     |  [code](tutorials\u002Fkeypointing_example\u002Fanimal_keypointing_example.py)                                                |       [notebook](tutorials\u002Fkeypointing_example\u002F_full_tutorial.ipynb)                                           |\n| Other | `batch inference` |  | [code](tutorials\u002Fother_example\u002Fbatch_inference.py) ||\n\n## 4. Applications of Rex-Omni\n\nRex-Omni's unified detection framework enables seamless integration with other vision models.\n\n| Application | Description | Demo | Documentation |\n|:------------|:------------|:----:|:-------------:|\n| **Rex-Omni + SAM** | Combine language-driven detection with pixel-perfect segmentation. Rex-Omni detects objects → SAM generates precise masks | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_e16f880dc9ee.jpg\" width=\"500\"\u002F> | [README](applications\u002F_1_rexomni_sam\u002FREADME.md) |\n| **Grounding Data Engine** | Automatically generate phrase grounding annotations from image captions using spaCy and Rex-Omni. | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_ab122e564976.jpg\" width=\"500\"\u002F> | [README](applications\u002F_2_automatic_grounding_data_engine\u002FREADME.md) |\n\n\n## 5. Gradio Demo\n\n![img](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_985fa10a3b37.png)\n\nWe provide an interactive Gradio demo that allows you to test all Rex-Omni capabilities through a web interface.\n\n### Quick Start\n```bash\n# Launch the demo\nCUDA_VISIBLE_DEVICES=0 python app.py --model_path IDEA-Research\u002FRex-Omni\n\n# With custom settings\nCUDA_VISIBLE_DEVICES=0 python app.py \\\n    --model_path IDEA-Research\u002FRex-Omni \\\n    --backend vllm \\\n    --server_name 0.0.0.0 \\\n    --server_port 7890\n```\n\n### Available Options\n- `--model_path`: Model path or HuggingFace repo ID (default: \"IDEA-Research\u002FRex-Omni\")\n- `--backend`: Backend to use - \"transformers\" or \"vllm\" (default: \"transformers\")\n- `--server_name`: Server host address (default: \"192.168.81.138\")\n- `--server_port`: Server port (default: 5211)\n- `--temperature`: Sampling temperature (default: 0.0)\n- `--top_p`: Nucleus sampling parameter (default: 0.05)\n- `--max_tokens`: Maximum tokens to generate (default: 2048)\n\n## 6. Evaluation\nPlease refer to [Evaluation](evaluation\u002FREADME.md) for more details.\n\n## 7. Fine-tuning Rex-Omni\nPlease refer to [Fine-tuning Rex-Omni](finetuning\u002FREADME.md) for more details.\n\n## 8. LICENSE\n\nRex-Omni is licensed under the [IDEA License 1.0](LICENSE), Copyright (c) IDEA. All Rights Reserved. This model is based on Qwen, which is licensed under the [Qwen RESEARCH LICENSE AGREEMENT](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-VL-3B-Instruct\u002Fblob\u002Fmain\u002FLICENSE), Copyright (c) Alibaba Cloud. All Rights Reserved.\n\n\n## 9. Citation\nRex-Omni comes from a series of prior works. If you’re interested, you can take a look.\n\n- [RexThinker](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04034)\n- [RexSeek](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.08507)\n- [ChatRex](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18363)\n- [DINO-X](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14347)\n- [Grounidng DINO 1.5](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.10300)\n- [T-Rex2](https:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-3-031-73414-4_3)\n- [T-Rex](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.13596)\n\n\n```text\n@misc{jiang2025detectpointprediction,\n      title={Detect Anything via Next Point Prediction}, \n      author={Qing Jiang and Junan Huo and Xingyu Chen and Yuda Xiong and Zhaoyang Zeng and Yihao Chen and Tianhe Ren and Junzhi Yu and Lei Zhang},\n      year={2025},\n      eprint={2510.12798},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.12798}, \n}\n\n\n```\n","\u003Cdiv align=center>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_c2cf7d301876.png\" width=600 >\n\u003C\u002Fdiv>\n\n\u003Ch1 align=\"center\">通过下一词预测实现万物检测\u003C\u002Fh1>\n\n\u003Cdiv align=center>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Frex-omni.github.io\u002F\">\n    \u003Cimg\n      src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRexOmni-Website-BADFDB?style=flat-square&logo=deno&logoColor=violet&color=BADFDB\"\n      alt=\"RexThinker 官网\"\n    \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.12798\">\n    \u003Cimg\n      src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRexOmni-Paper-Red%25red?logo=arxiv&logoColor=red&color=yellow\"\n      alt=\"RexThinker 在 arXiv 上的论文\"\n    \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FIDEA-Research\u002FRex-Omni-AWQ\">\n    \u003Cimg \n        src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRexOmni_AWQ-Weight-orange?logo=huggingface&logoColor=yellow\" \n        alt=\"RexThinker 的 AWQ 权重在 Hugging Face 上\"\n    \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FIDEA-Research\u002FRex-Omni\">\n    \u003Cimg \n        src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRexOmni-Weight-orange?logo=huggingface&logoColor=yellow\" \n        alt=\"RexThinker 的权重在 Hugging Face 上\"\n    \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FMountchicken\u002FRex-Omni\">\n    \u003Cimg\n      src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRexOmni-Demo-orange?logo=huggingface&logoColor=yellow\" \n      alt=\"RexThinker 的演示在 Hugging Face 上\"\n    \u002F>\n  \u003C\u002Fa>\n  \n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\n> Rex-Omni 是一个拥有 30 亿参数的多模态大语言模型（MLLM），它将目标检测以及广泛的其他视觉感知任务重新定义为一个简单的下一个标记预测问题。\n\n\u003Cp align=\"center\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_bd9f4e0a5dde.png\" width=\"95%\">\u003C\u002Fp>\n\n# 新闻 🎉\n- [2026-01-10] 现已支持**指向任务微调**！使用 SFT 和 GRPO 在自定义指向数据集上训练 Rex-Omni。详情请参阅[微调指南](finetuning\u002FREADME.md)。\n- [2025-10-31] 我们发布了 Rex-Omni 的 AWQ 量化版本，存储空间节省了 50%。[Rex-Omni-AWQ](https:\u002F\u002Fhuggingface.co\u002FIDEA-Research\u002FRex-Omni-AWQ)\n- [2025-10-29] 微调代码现已[可用](finetuning\u002FREADME.md)。\n- [2025-10-17] 评估代码和数据集现已[可用](evaluation\u002FREADME.md)。\n- [2025-10-15] Rex-Omni 正式发布。\n\n# 目录\n\n- [突发新闻：我们刚刚开源了 Resophy！🎉](#breaking-news-weve-just-open-sourced-resophy-)\n- [新闻 🎉](#news-)\n- [目录](#table-of-contents)\n  - [待办事项清单 📝](#todo-list-)\n  - [1. 安装 ⛳️](#1-installation-️)\n  - [2. 快速入门：使用 Rex-Omni 进行检测](#2-quick-start-using-rex-omni-for-detection)\n      - [初始化参数（RexOmniWrapper）](#initialization-parameters-rexomniwrapper)\n      - [推理参数（rex.inference）](#inference-parameters-rexinference)\n  - [3. 食谱](#3-cookbooks)\n  - [4. Rex-Omni 的应用](#4-applications-of-rex-omni)\n  - [5. Gradio 演示](#5-gradio-demo)\n    - [快速入门](#quick-start)\n    - [可用选项](#available-options)\n  - [6. 评估](#6-evaluation)\n  - [7. 微调 Rex-Omni](#7-fine-tuning-rex-omni)\n  - [8. 许可证](#8-license)\n  - [9. 引用](#9-citation)\n\n\n## 待办事项清单 📝\n- [x] 添加评估代码\n- [x] 添加微调代码\n- [x] 添加量化版 Rex-Omni\n\n## 1. 安装 ⛳️\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FRex-Omni.git\ncd Rex-Omni\nconda create -n rexomni python=3.10 -y\nconda activate rexomni\npip install torch==2.7.0 torchvision --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu128\npip install -r requirements.txt\npip install -v -e .\n```\n\n测试安装\n```bash\nCUDA_VISIBLE_DEVICES=1 python tutorials\u002Fdetection_example\u002Fdetection_example.py\n```\n\n如果安装成功，您将在 `tutorials\u002Fdetection_example\u002Ftest_images\u002Fcafe_visualize.jpg` 找到检测结果的可视化图。\n\n## 2. 快速入门：使用 Rex-Omni 进行检测\n以下是一个最小示例，展示了如何使用 `rex_omni` 包运行目标检测。\n\n```python\nfrom PIL import Image\nfrom rex_omni import RexOmniWrapper, RexOmniVisualize\n\n# 1) 初始化包装器（模型内部加载）\nrex = RexOmniWrapper(\n    model_path=\"IDEA-Research\u002FRex-Omni\",   # HF 仓库或本地路径\n    backend=\"transformers\",                # 或 \"vllm\" 用于高吞吐量推理\n    # 推理\u002F生成控制（适用于所有后端）\n    max_tokens=2048,\n    temperature=0.0,\n    top_p=0.05,\n    top_k=1,\n    repetition_penalty=1.05,\n)\n\n# 如果您使用的是 Rex-Omni 的 AWQ 量化版本，可以使用以下代码：\nrex = RexOmniWrapper(\n    model_path=\"IDEA-Research\u002FRex-Omni-AWQ\",\n    backend=\"vllm\",\n    quantization=\"awq\",\n    max_tokens=2048,\n    temperature=0.0,\n    top_p=0.05,\n    top_k=1,\n    repetition_penalty=1.05,\n)\n\n# 2) 准备输入\nimage = Image.open(\"tutorials\u002Fdetection_example\u002Ftest_images\u002Fcafe.jpg\").convert(\"RGB\")\ncategories = [\n    \"男人\", \"女人\", \"黄色花\", \"沙发\", \"机器人商店灯\",\n    \"毯子\", \"微波炉\", \"笔记本电脑\", \"杯子\", \"白色椅子\", \"台灯\",\n]\n\n# 3) 运行检测\nresults = rex.inference(images=image, task=\"detection\", categories=categories)\nresult = results[0]\n\n# 4) 可视化\nvis = RexOmniVisualize(\n    image=image,\n    predictions=result[\"extracted_predictions\"],\n    font_size=20,\n    draw_width=5,\n    show_labels=True,\n)\nvis.save(\"tutorials\u002Fdetection_example\u002Ftest_images\u002Fcafe_visualize.jpg\")\n```\n\n#### 初始化参数 (RexOmniWrapper)\n- **model_path**: Hugging Face 仓库 ID 或 Rexe-Omni 模型的本地检查点目录。\n- **backend**: \"transformers\" 或 \"vllm\"。\n  - **transformers**: 易于使用，基准延迟表现良好。\n  - **vllm**: 高吞吐量、低延迟推理。需要安装 `vllm` 包并配置兼容环境。\n- **max_tokens**: 每次输出生成的最大标记数。\n- **temperature**: 采样温度；值越高随机性越大（0.0 表示确定性\u002F贪婪采样）。\n- **top_p**: 核采样参数；模型会从累积概率 ≥ top_p 的最小标记集合中进行采样。\n- **top_k**: Top-k 采样；将采样限制在最有可能的 k 个标记内。\n- **repetition_penalty**: 重复标记惩罚；>1.0 会抑制重复。\n- 可选高级设置（通过构造包装器时的 kwargs 支持）：\n  - Transformers：`torch_dtype`、`attn_implementation`、`device_map`、`trust_remote_code` 等。\n  - VLLM：`tokenizer_mode`、`limit_mm_per_prompt`、`max_model_len`、`gpu_memory_utilization`、`tensor_parallel_size`、`trust_remote_code` 等。\n\n#### 推理参数 (rex.inference)\n- **images**: 单张 `PIL.Image.Image` 对象或用于批量推理的图像列表。\n- **task**: 可选 `\"detection\"`、`\"pointing\"`、`\"visual_prompting\"`、`\"keypoint\"`、`\"ocr_box\"`、`\"ocr_polygon\"`、`\"gui_grounding\"`、`\"gui_pointing\"` 中的一项。\n- **categories**: 要检测或提取的类别名称\u002F短语列表，例如 `[\"person\", \"cup\"]`。用于构建任务提示。\n- **keypoint_type**: 关键点检测任务的关键点类型。选项有：\"person\"、\"hand\"、\"animal\"。\n- **visual_prompt_boxes**: 视觉提示任务的参考边界框。格式为绝对坐标系下的 [[x0, y0, x1, y1], ...]。\n\n返回一个字典列表（每个输入图像对应一个字典）。每个字典包含：\n- **raw_output**: LLM 生成的原始文本。\n- **extracted_predictions**: 从原始输出中解析出的结构化预测结果，按类别分组。\n  - 对于检测任务：`{category: [{\"type\": \"box\", \"coords\": [x0,y0,x1,y1]}, ...], ...}`\n  - 对于指向任务：`{category: [{\"type\": \"point\", \"coords\": [x0,y0]}, ...], ...}`\n  - 对于多边形任务：`{category: [{\"type\": \"polygon\", \"coords\": [x0,y0, ...]}, ...], ...}`\n  - 对于关键点任务：结构化的 Json 数据。\n\n提示：\n- 为了在使用 VLLM 时获得最佳性能，请将 `backend=\"vllm\"`，并根据您的 GPU 调整 `gpu_memory_utilization` 和 `tensor_parallel_size`。\n\n## 3. 教程\n\n我们为每项支持的任务提供了全面的教程。每个教程都包含独立的 Python 脚本和交互式的 Jupyter 笔记本。\n\n|       任务       |                                                                应用                                                                 |   演示 |                  Python 示例                    |                     笔记本                     |\n|:----------------:|:-----------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------:|:------------------------------------------------:|:------------------------------------------------:|\n| 目标检测 |                               `目标检测`                                 | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_f3c591a7df5c.jpg\" width=\"240\"\u002F>  | [code](tutorials\u002Fdetection_example\u002Fdetection_example.py)   | [notebook](tutorials\u002Fdetection_example\u002F_full_notebook.ipynb) |\n|                  |                         `对象指代`                      | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_9987e4919649.jpg\" width=\"240\"\u002F>   | [code](tutorials\u002Fdetection_example\u002Freferring_example.py)      |       [notebook](tutorials\u002Fdetection_example\u002F_full_notebook.ipynb)                                           |\n|                  | `GUI 实体定位` | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_40882272148e.jpg\" width=\"240\"\u002F> | [code](tutorials\u002Fdetection_example\u002Fgui_grounding_example.py)  |       [notebook](tutorials\u002Fdetection_example\u002F_full_notebook.ipynb)                                           |\n|                  |                    `布局实体定位`    |  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_9814a8f0a8b2.jpg\" width=\"240\"\u002F>        | [code](tutorials\u002Fdetection_example\u002Flayout_grouding_examle.py) |       [notebook](tutorials\u002Fdetection_example\u002F_full_notebook.ipynb)                                             |\n| 指向 |                               `物体指向`            |       \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_27ba68e34228.jpg\" width=\"240\"\u002F>           |   [code](tutorials\u002Fpointing_example\u002Fobject_pointing_example.py)   | [notebook](tutorials\u002Fpointing_example\u002F_full_notebook.ipynb) |\n|                  |                         `GUI 指向`    |      \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_1e36b6a1daee.jpg\" width=\"240\"\u002F>              | [code](tutorials\u002Fpointing_example\u002Fgui_pointing_example.py)      |       [notebook](tutorials\u002Fpointing_example\u002F_full_notebook.ipynb)                                           |\n|                  | `可供性指向` | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_79d49f953414.jpg\" width=\"240\"\u002F> | [code](tutorials\u002Fpointing_example\u002Faffordance_pointing_example.py)  |       [notebook](tutorials\u002Fpointing_example\u002F_full_notebook.ipynb)                                           |\n| 视觉提示 | `视觉提示` | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_ff86016ed730.jpg\" width=\"240\"\u002F> | [code](tutorials\u002Fvisual_prompting_example\u002Fvisual_prompt_example.py) | [notebook](tutorials\u002Fvisual_prompting_example\u002F_full_tutorial.ipynb) |\n| OCR | `OCR 单词框` | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_6b197d857baf.jpg\" width=\"240\"\u002F> | [code](tutorials\u002Focr_example\u002Focr_word_box_example.py) | [notebook](tutorials\u002Focr_example\u002F_full_tutorial.ipynb) |\n|                  | `OCR 文本行框` | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_876389de49da.jpg\" width=\"240\"\u002F> | [code](tutorials\u002Focr_example\u002Focr_textline_box_example.py) |       [notebook](tutorials\u002Focr_example\u002F_full_tutorial.ipynb)                                           |\n|                  | `OCR 多边形` | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_51c017e3a1eb.jpg\" width=\"240\"\u002F> | [code](tutorials\u002Focr_example\u002Focr_polygon_example.py) |       [notebook](tutorials\u002Focr_example\u002F_full_tutorial.ipynb)                                           |\n| 关键点检测 | `人体关键点检测` | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_31082b38436c.jpg\" width=\"240\"\u002F> | [code](tutorials\u002Fkeypointing_example\u002Fperson_keypointing_example.py) | [notebook](tutorials\u002Fkeypointing_example\u002F_full_tutorial.ipynb)|\n|             | `动物关键点检测`   |     \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_8d74c8bb4bd1.jpg\" width=\"240\"\u002F>                     |  [code](tutorials\u002Fkeypointing_example\u002Fanimal_keypointing_example.py)                                                |       [notebook](tutorials\u002Fkeypointing_example\u002F_full_tutorial.ipynb)                                           |\n| 其他 | `批量推理` |  | [code](tutorials\u002Fother_example\u002Fbatch_inference.py) ||\n\n## 4. Rex-Omni 的应用\n\nRex-Omni 的统一检测框架使其能够与其他视觉模型无缝集成。\n\n| 应用 | 描述 | 演示 | 文档 |\n|:------------|:------------|:----:|:-------------:|\n| **Rex-Omni + SAM** | 将语言驱动的目标检测与像素级精确分割相结合。Rex-Omni 检测物体 → SAM 生成精确的掩码 | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_e16f880dc9ee.jpg\" width=\"500\"\u002F> | [README](applications\u002F_1_rexomni_sam\u002FREADME.md) |\n| **实体定位数据引擎** | 使用 spaCy 和 Rex-Omni，根据图像标题自动生成短语实体定位标注。 | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_ab122e564976.jpg\" width=\"500\"\u002F> | [README](applications\u002F_2_automatic_grounding_data_engine\u002FREADME.md) |\n\n\n## 5. Gradio 演示\n\n![img](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_readme_985fa10a3b37.png)\n\n我们提供了一个交互式的 Gradio 演示，您可以通过网页界面测试 Rex-Omni 的所有功能。\n\n### 快速开始\n```bash\n# 启动演示\nCUDA_VISIBLE_DEVICES=0 python app.py --model_path IDEA-Research\u002FRex-Omni\n\n# 自定义设置\nCUDA_VISIBLE_DEVICES=0 python app.py \\\n    --model_path IDEA-Research\u002FRex-Omni \\\n    --backend vllm \\\n    --server_name 0.0.0.0 \\\n    --server_port 7890\n```\n\n### 可用选项\n- `--model_path`: 模型路径或 HuggingFace 仓库 ID（默认值：“IDEA-Research\u002FRex-Omni”）\n- `--backend`: 使用的后端 - “transformers” 或 “vllm”（默认值：“transformers”）\n- `--server_name`: 服务器主机地址（默认值：“192.168.81.138”）\n- `--server_port`: 服务器端口（默认值：5211）\n- `--temperature`: 采样温度（默认值：0.0）\n- `--top_p`: 核采样参数（默认值：0.05）\n- `--max_tokens`: 最大生成标记数（默认值：2048）\n\n## 6. 评估\n更多详情请参阅 [评估](evaluation\u002FREADME.md)。\n\n## 7. 微调 Rex-Omni\n更多详情请参阅 [微调 Rex-Omni](finetuning\u002FREADME.md)。\n\n## 8. 许可证\n\nRex-Omni 采用 [IDEA License 1.0](LICENSE) 许可证授权，版权所有 © IDEA。保留所有权利。本模型基于 Qwen，Qwen 采用 [Qwen RESEARCH LICENSE AGREEMENT](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-VL-3B-Instruct\u002Fblob\u002Fmain\u002FLICENSE) 许可证授权，版权所有 © 阿里云。保留所有权利。\n\n## 9. 引用\nRex-Omni 源自一系列先前的工作。如果您感兴趣，可以查阅以下文献。\n\n- [RexThinker](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04034)\n- [RexSeek](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.08507)\n- [ChatRex](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18363)\n- [DINO-X](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14347)\n- [Grounidng DINO 1.5](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.10300)\n- [T-Rex2](https:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-3-031-73414-4_3)\n- [T-Rex](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.13596)\n\n\n```text\n@misc{jiang2025detectpointprediction,\n      title={通过下一位置预测检测任何目标}, \n      author={Qing Jiang、Junan Huo、Xingyu Chen、Yuda Xiong、Zhaoyang Zeng、Yihao Chen、Tianhe Ren、Junzhi Yu、Lei Zhang},\n      year={2025},\n      eprint={2510.12798},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.12798}, \n}\n\n\n```","# Rex-Omni 快速上手指南\n\nRex-Omni 是一个 30 亿参数的多模态大语言模型（MLLM），它将目标检测及多种视觉感知任务重新定义为简单的“下一个 token 预测”问题。本指南将帮助您快速在本地部署并运行该模型。\n\n## 1. 环境准备\n\n在开始之前，请确保您的系统满足以下要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+)\n*   **Python 版本**: 3.10\n*   **GPU**: 支持 CUDA 的 NVIDIA 显卡（建议显存 ≥ 16GB 以运行全精度模型，量化版可降低需求）\n*   **CUDA 版本**: 12.8 (对应 PyTorch 2.7.0)\n*   **依赖管理**: 已安装 `conda` 和 `git`\n\n## 2. 安装步骤\n\n请依次执行以下命令来克隆代码库、创建虚拟环境并安装依赖。\n\n```bash\n# 1. 克隆仓库\ngit clone https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FRex-Omni.git\ncd Rex-Omni\n\n# 2. 创建并激活 Conda 环境\nconda create -n rexomni python=3.10 -y\nconda activate rexomni\n\n# 3. 安装 PyTorch (指定 CUDA 12.8 版本)\npip install torch==2.7.0 torchvision --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu128\n\n# 4. 安装项目依赖\npip install -r requirements.txt\n\n# 5. 以可编辑模式安装 rex_omni 包\npip install -v -e .\n```\n\n**验证安装：**\n运行以下测试脚本，如果安装成功，将在 `tutorials\u002Fdetection_example\u002Ftest_images\u002F` 目录下生成可视化结果图 `cafe_visualize.jpg`。\n\n```bash\nCUDA_VISIBLE_DEVICES=1 python tutorials\u002Fdetection_example\u002Fdetection_example.py\n```\n\n## 3. 基本使用\n\n以下是最小化的 Python 示例，展示如何使用 `rex_omni` 进行物体检测。该示例支持加载全精度模型或 AWQ 量化版本。\n\n### 代码示例\n\n```python\nfrom PIL import Image\nfrom rex_omni import RexOmniWrapper, RexOmniVisualize\n\n# 1) 初始化模型包装器\n# 选项 A: 使用全精度模型 (后端可选 transformers 或 vllm)\nrex = RexOmniWrapper(\n    model_path=\"IDEA-Research\u002FRex-Omni\",   \n    backend=\"transformers\",                # 高吞吐量场景可改为 \"vllm\"\n    max_tokens=2048,\n    temperature=0.0,\n    top_p=0.05,\n    top_k=1,\n    repetition_penalty=1.05,\n)\n\n# 选项 B: 使用 AWQ 量化版本 (节省 50% 显存，需配合 vllm 后端)\n# rex = RexOmniWrapper(\n#     model_path=\"IDEA-Research\u002FRex-Omni-AWQ\",\n#     backend=\"vllm\",\n#     quantization=\"awq\",\n#     max_tokens=2048,\n#     temperature=0.0,\n#     top_p=0.05,\n#     top_k=1,\n#     repetition_penalty=1.05,\n# )\n\n# 2) 准备输入数据\nimage = Image.open(\"tutorials\u002Fdetection_example\u002Ftest_images\u002Fcafe.jpg\").convert(\"RGB\")\ncategories = [\n    \"man\", \"woman\", \"yellow flower\", \"sofa\", \"robot-shope light\",\n    \"blanket\", \"microwave\", \"laptop\", \"cup\", \"white chair\", \"lamp\",\n]\n\n# 3) 执行推理 (支持任务：detection, pointing, visual_prompting, keypoint, ocr_box 等)\nresults = rex.inference(images=image, task=\"detection\", categories=categories)\nresult = results[0]\n\n# 4) 可视化结果\nvis = RexOmniVisualize(\n    image=image,\n    predictions=result[\"extracted_predictions\"],\n    font_size=20,\n    draw_width=5,\n    show_labels=True,\n)\nvis.save(\"cafe_visualize.jpg\")\nprint(\"检测完成，结果已保存至 cafe_visualize.jpg\")\n```\n\n### 核心参数说明\n\n*   **model_path**: Hugging Face 仓库 ID 或本地模型路径。\n*   **backend**: \n    *   `\"transformers\"`: 易用，适合基线测试。\n    *   `\"vllm\"`: 高吞吐、低延迟，适合生产环境（需额外安装 `vllm`）。\n*   **task**: 指定任务类型，包括 `\"detection\"` (检测), `\"pointing\"` (指点), `\"keypoint\"` (关键点), `\"ocr_box\"` (文字框) 等。\n*   **categories**: 需要检测或提取的类别名称列表。\n\n> **提示**: 若使用 VLLM 后端以获得最佳性能，建议根据显卡情况调整 `gpu_memory_utilization` 和 `tensor_parallel_size` 参数。","某电商平台的视觉算法团队正致力于升级其商品自动上架系统，需要从海量卖家上传的杂乱图片中精准提取商品主体及其关键属性。\n\n### 没有 Rex-Omni 时\n- **多模型堆砌复杂**：团队需分别部署目标检测、实例分割和关键点定位等多个独立模型，导致推理链路冗长，显存占用极高。\n- **长尾场景识别差**：面对非标准拍摄角度或罕见新品类，传统检测框难以适应，常出现漏检或框选不准，需大量人工复核。\n- **定制开发成本高**：每当新增一种细粒度检测需求（如识别服装的具体纽扣位置），都需要重新标注数据并训练专用模型，周期长达数周。\n- **交互能力缺失**：系统无法理解“找出图中左侧红色的那个包”这类自然语言指令，只能机械地输出所有检测结果，缺乏灵活性。\n\n### 使用 Rex-Omni 后\n- **统一架构降本增效**：Rex-Omni 将检测、分割等任务统一转化为“下一点预测”问题，单个 3B 参数模型即可搞定所有视觉感知任务，显存占用降低 60%。\n- **泛化能力显著提升**：得益于多模态大模型的语义理解力，Rex-Omni 能精准捕捉模糊描述下的商品主体，即使在复杂背景下也能实现像素级精准定位。\n- **零样本快速适配**：面对新类目，只需通过简单的提示词（Prompt）或少量示例微调，Rex-Omni 即可立即生效，新业务上线时间从周级缩短至小时级。\n- **支持自然语言交互**：运营人员可直接输入“标记出所有打折商品的标签位置”，Rex-Omni 即刻理解意图并返回对应坐标，大幅提升了人机协作效率。\n\nRex-Omni 通过将复杂的视觉感知重构为简单的序列预测，真正实现了“一个模型解决所有检测难题”，让电商图像处理变得既智能又轻盈。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FIDEA-Research_Rex-Omni_985fa10a.png","IDEA-Research","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FIDEA-Research_b8b3359e.png","The International Digital Economy Academy (“IDEA”). ",null,"www.idea.edu.cn","https:\u002F\u002Fgithub.com\u002FIDEA-Research",[79,83,87,91,95,99],{"name":80,"color":81,"percentage":82},"Jupyter Notebook","#DA5B0B",91.4,{"name":84,"color":85,"percentage":86},"Python","#3572A5",7.1,{"name":88,"color":89,"percentage":90},"C","#555555",0.9,{"name":92,"color":93,"percentage":94},"C++","#f34b7d",0.6,{"name":96,"color":97,"percentage":98},"Shell","#89e051",0.1,{"name":100,"color":101,"percentage":102},"Jinja","#a52a22",0,1307,90,"2026-04-18T09:32:30","NOASSERTION","Linux","必需 NVIDIA GPU。安装命令指定 CUDA 12.8 (cu128)。支持 vLLM 后端进行高吞吐推理，需根据显卡调整 gpu_memory_utilization 和 tensor_parallel_size。模型为 3B 参数，量化版 (AWQ) 可节省 50% 显存。","未说明",{"notes":111,"python":112,"dependencies":113},"1. 官方安装示例仅针对 Linux 环境 (使用 CUDA 12.8)。2. 提供两种后端：'transformers' (易用) 和 'vllm' (高性能，需额外配置)。3. 支持 AWQ 量化版本以减半存储和显存需求。4. 支持多种任务包括检测、指点、OCR、关键点检测等。5. 需通过 Hugging Face 下载模型权重。","3.10",[114,115,116,117,118,119],"torch==2.7.0","torchvision","transformers","vllm (可选，用于高吞吐推理)","PIL (Pillow)","gradio (用于 Demo)",[35,15],[122,123,124],"mllm","object-detection","open-set","2026-03-27T02:49:30.150509","2026-04-19T15:46:30.377240",[128,133,137,141,146,151,155,159],{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},42553,"经过 SFT 和 GRPO 微调后，目标检测效果很差（误检、漏检、框偏差大）是什么原因？","主要原因可能是训练数据中缺乏负样本（Negative Prompts）。如果在推理时输入了全部类别，但训练过程中只包含存在目标的类别（GT），模型会产生幻觉，即无论目标是否存在都强行输出检测框。\n解决方案：\n1. 在训练数据中加入背景类别作为负样本（即图中不存在的类别，bbox 设为 null 或 None）。\n2. 尝试调整正负样本比例至 1:1，这有助于改善幻觉问题，让模型学会在目标不存在时输出 None。\n3. 如果 SFT 后模型已具备输出 None 的能力但表现不佳，可以继续使用 GRPO 进行强化学习修复。","https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FRex-Omni\u002Fissues\u002F41",{"id":134,"question_zh":135,"answer_zh":136,"source_url":132},42554,"如何在训练数据中正确添加负样本以解决模型幻觉问题？","添加负样本时，可以基于同一张训练图像生成两条数据记录：一条包含真实存在的目标（正样本），另一条包含该图像中不存在的类别（负样本）。\n数据格式示例：\n- 正样本：{\"image_name\": \"xxx.png\", \"annotation\": {\"boxes\": [{\"bbox\": [x1, y1, x2, y2], \"phrase\": \"目标名称\"}]}}\n- 负样本：{\"image_name\": \"xxx.png\", \"annotation\": {\"boxes\": [{\"bbox\": null, \"phrase\": \"不存在的目标名称\"}]}}\n注意：负样本中的 bbox 应设为 null，表示该类别在当前图像中不存在。通过这种方式训练，可以让模型学习到何时不应输出检测框。",{"id":138,"question_zh":139,"answer_zh":140,"source_url":132},42555,"SFT 微调后 Loss 收敛到 1.3 左右是否正常？","是正常的。对于目标检测任务，尤其是使用 Object365 等大规模数据集进行 SFT 微调时，Loss 收敛到 1.3 左右是合理的现象。通常训练 1-2 个 epoch 即可，关键在于数据量要足够大。如果 Loss 在此数值附近震荡且不再显著下降，通常不代表训练失败，建议结合验证集的实际检测效果（如 mAP）来评估模型性能，而非单纯依赖 Loss 数值。",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},42556,"运行 GRPO 训练时遇到 CUDA Out of Memory (显存爆炸) 怎么办？","GRPO 训练对显存要求较高，即使使用双卡 A6000 (48G) 也可能爆显存。日志显示在 sharding manager 同步权重或 vllm offload 阶段显存激增。\n建议尝试以下优化：\n1. 减少采样数量：检查采样参数中的 'n' 值（默认为 8），尝试减小该值以减少并行生成的序列数。\n2. 调整 batch size：减小 --per_device_train_batch_size。\n3. 开启 DeepSpeed ZeRO 优化：确保使用了合适的 DeepSpeed 配置文件（如 zero2.json 或 zero3.json）来分片优化器和状态。\n4. 限制最大生成长度：适当减小 --max_model_len 或生成时的 max_tokens 参数。","https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FRex-Omni\u002Fissues\u002F110",{"id":147,"question_zh":148,"answer_zh":149,"source_url":150},42557,"如何为新任务添加特殊的坐标 Token（Special Tokens）？","通常做法是将词表最后的 1000 个 token 替换为离散坐标 token（对应 0-1000 的坐标值）。\n具体操作步骤（以修改 tokenizer 文件为例）：\n1. 修改 tokenizer.json：更新 vocab 和 merges 规则，将最后 1000 个条目替换为坐标 token（如 \u003C0>, \u003C1>...\u003C1000>）。\n2. 修改 tokenizer_config.json：确保 special tokens 配置正确，有时需要将相关标记的 special 属性设为 false 以避免预处理冲突。\n3. 如果使用非官方框架（如 ms-swift），可能需要编写脚本自动完成上述替换，并确保模型初始化时加载了修改后的词表。","https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FRex-Omni\u002Fissues\u002F22",{"id":152,"question_zh":153,"answer_zh":154,"source_url":150},42558,"目标检测任务训练 Loss 降到 2.0 左右就不再下降，且 Token Accuracy 只有 50%，如何解决？","这是密集场景或多目标检测任务中的常见现象，Loss 在 2.0 左右震荡通常是正常的，不一定代表训练不充分。\n建议策略：\n1. 两阶段微调：先使用大型通用数据集（如 Object365）进行第一阶段微调，使模型掌握通用的检测能力；然后再使用特定的密集场景数据进行第二阶段微调。\n2. 检查重复和漏框问题：如果存在大量重复预测或漏检，可能是解码策略问题，可尝试调整 NMS 阈值或在后处理阶段优化。\n3. 使用 GRPO 优化：对于框偏移或难以收敛的问题，可以在 SFT 之后使用 GRPO（强化学习）进一步微调，以提升定位精度。",{"id":156,"question_zh":157,"answer_zh":158,"source_url":150},42559,"为什么坐标数字需要用尖括号 \u003C> 包起来作为 Special Token？","将数字用尖括号（如 \u003C100>）包起来是为了将其定义为独立的特殊 Token，防止分词器（Tokenizer）将数字拆解为多个子词（subwords），从而保证坐标预测的原子性和准确性。如果直接使用纯数字（如 10-1000），分词器可能会根据 BPE 规则将其切分，导致模型难以直接回归精确的坐标值。因此，需要在词表中明确定义这些带括号的标记为特殊 Token。",{"id":160,"question_zh":161,"answer_zh":162,"source_url":163},42560,"添加负样本训练后仍然出现严重幻觉（硬凑出不存在的目标），如何进一步改善？","即使添加了负样本，如果正负比例不平衡或训练强度不够，幻觉仍可能存在。\n改进措施：\n1. 严格平衡正负样本比例：确保训练集中正样本和负样本的比例接近 1:1，强迫模型频繁面对“无目标”的情况。\n2. 增加负样本多样性：负样本中应包含各种常见的但不存在于当前图像中的类别，覆盖长尾分布。\n3. 启用 GRPO 训练：SFT 主要让模型学会输出格式和基本的 None 概念，而 GRPO 可以通过奖励机制（Reward Model）惩罚错误的检测框，奖励正确的“无检测”行为，从而显著降低幻觉率。","https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FRex-Omni\u002Fissues\u002F50",[]]