[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-yfzhang114--Thyme":3,"tool-yfzhang114--Thyme":61},[4,18,26,36,44,52],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",141543,2,"2026-04-06T11:32:54",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107888,"2026-04-06T11:32:50",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":10,"last_commit_at":50,"category_tags":51,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":53,"name":54,"github_repo":55,"description_zh":56,"stars":57,"difficulty_score":10,"last_commit_at":58,"category_tags":59,"status":17},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[14,15,13,60],"视频",{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":78,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":81,"stars":107,"forks":108,"last_commit_at":109,"license":110,"difficulty_score":111,"env_os":112,"env_gpu":113,"env_ram":114,"env_deps":115,"category_tags":129,"github_topics":78,"view_count":32,"oss_zip_url":78,"oss_zip_packed_at":78,"status":17,"created_at":131,"updated_at":132,"faqs":133,"releases":174},4530,"yfzhang114\u002FThyme","Thyme","✨✨ [ICLR 2026] Think Beyond Images","Thyme 是一款突破传统视觉交互范式的开源 AI 模型，旨在让智能体“超越图像本身”进行思考。它不再局限于被动地识别图片内容，而是能够自主生成并执行多样化的图像处理代码与计算操作，从而主动分析高分辨率图像或解决复杂的推理难题。\n\n这一工具主要解决了现有多模态模型在面对精细感知任务和复杂逻辑推导时能力不足的问题。通过引入创新的“监督微调 + 强化学习”两阶段训练策略，并结合独有的 GRPO-ATS 算法，Thyme 在探索推理路径与确保代码执行精度之间取得了极佳平衡，显著提升了处理高难度视觉任务的准确率。\n\nThyme 特别适合人工智能研究人员、大模型开发者以及需要处理复杂视觉数据分析的工程团队使用。无论是希望深入研究视觉推理机制的学者，还是致力于构建下一代具身智能或自动化分析系统的工程师，都能利用其开放的 SFT 与 RL 模型及数据集进行二次开发或评估。作为入选 ICLR 2026 的前沿成果，Thyme 为多模态领域提供了一套从数据准备、模型训练到效果评估的完整解决方案，助力社区探索视觉智能的新边界。","\u003Cp align=\"center\" width=\"40%\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyfzhang114_Thyme_readme_8148a5627742.png\"  width=\"50%\" height=\"50%\">\n\u003C\u002Fp>\n\n\u003Cfont size=7>\u003Cdiv align='center' >  \n[[📖 Home Page](https:\u002F\u002Fthyme-vl.github.io\u002F)] \n[[📖 Technique Report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.11630)] \n\n[[📊 Thyme SFT Model](https:\u002F\u002Fhuggingface.co\u002FKwai-Keye\u002FThyme-SFT)] \n[[📊 Thyme RL Model](https:\u002F\u002Fhuggingface.co\u002FKwai-Keye\u002FThyme-RL)] \n[[📝 SFT Data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FKwai-Keye\u002FThyme-SFT)] \n[[📝 RL Data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FKwai-Keye\u002FThyme-RL)] \n\u003C\u002Fdiv>\u003C\u002Ffont>\n\n## 🔥 News\n* **`2025.09.04`** 🌟 Thyme is supported by [VLMEvalKit](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit) and [LMMs-Eval](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval). Feel free to use it without hesitation!\n* **`2025.08.18`** 🌟 We are excited to introduce **Thyme: Think Beyond Images**. Thyme transcends traditional ``thinking with images'' paradigms by autonomously generating and executing diverse image processing and computational operations through executable code, significantly enhancing performance on high-resolution perception and complex reasoning tasks. Leveraging a novel two-stage training strategy that combines supervised fine-tuning with reinforcement learning and empowered by the innovative GRPO-ATS algorithm, Thyme achieves a sophisticated balance between reasoning exploration and code execution precision.\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyfzhang114_Thyme_readme_3a2ee1a20d2c.png\"  width=\"100%\" height=\"100%\">\n\u003C\u002Fp>\n\n\n---\n# Table of Contents\n\n1. [Quick Start](#1-quick-start)\n2. [Data Preparation](#2-data-preparation)\n3. [Supervised Fine-Tuning (Thyme-SFT)](#3-supervised-fine-tuning-thyme-sft)\n4. [Reinforcement Learning (Thyme-RL)](#4-reinforcement-learning-thyme-rl)\n5. [Evaluation](#5-evaluation)\n6. [Usage Example: How to use Thyme](#6-usage-example-how-to-use-thyme)\n7. [Citation](#citation)\n8. [Related Projects](#related-projects)\n\n---\n\n## 1. Quick Start\n\n### 1.1 Clone the Repository\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FThyme.git\ncd Thyme\n```\n\n### 1.2 Environment Setup & Dependency Installation\n\nWe recommend creating a Conda environment for isolation and installing dependencies as follows:\n\n```bash\nconda create -n Thyme python=3.10 -y\nconda activate Thyme\n\npip install -e .\npip install \"sglang[all]\" -U\npip install \"vllm>=0.5.1\" \"transformers\u003C4.55\" \"trl\u003C0.21\" -U\npip install \"lmdeploy>=0.5,\u003C0.9\" -U --no-deps\npip install autoawq -U --no-deps\npip install auto_gptq optimum bitsandbytes \"gradio\u003C5.33\" -U\npip install git+https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fms-swift.git\npip install timm -U\npip install \"deepspeed\u003C0.17\" -U\npip install qwen_vl_utils qwen_omni_utils decord librosa icecream soundfile -U\npip install liger_kernel nvitop pre-commit math_verify py-spy -U\npip install wandb\n\npip install flash-attn --no-build-isolation --use-pep517\n\n```\n\n---\n\n## 2. Data Preparation\n\n### 2.1 Download Dataset\n\nObtain the training data from the [HuggingFace Dataset Page](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FKwai-Keye\u002FThyme-SFT). The SFT dataset consists of three splits:\n\n* `wo_thinking_thyme_single_round`: Single-turn image operation data\n* `2round`: Multi-turn dialogue data\n* `computation`: Annealing data used for computational tasks\n\nEach sample’s `image` field is a list containing the original and processed images.\n\n### 2.2 Process Images and Update Paths\n\nBefore training, ensure all referenced images are downloaded and saved locally. Update the dataset files (e.g., `.jsonl`) by replacing image URLs or remote paths with local absolute paths, for example:\n\n```json\n\"image\": [\n  \"\u002Fpath\u002Fto\u002Foriginal_images\u002F0904.0709_0.jpg\",\n  \"\u002Fpath\u002Fto\u002Fprocessed_images\u002F0904.0709_0_6349.jpg\"\n]\n```\n\n```\n#!\u002Fusr\u002Fbin\u002Fenv python3\nimport os, json, base64\nfrom pathlib import Path\nfrom concurrent.futures import ThreadPoolExecutor, as_completed\nfrom datasets import load_dataset\nfrom tqdm import tqdm\nimport io\nfrom PIL import Image\n\nHF_DATA_DIR = \".\u002Fdata\u002FThyme-SFT\"\nROOT_OUT    = Path(\"Thyme_sft_data\")\nIMG_ROOT    = '.\u002Fdata\u002FThyme_sft_data\u002Fimg'\nJSONL_ROOT  = ROOT_OUT \u002F \"jsonl\"\nSPLITS      = [\"wo_thinking_thyme_single_round\", \"2round\", \"computation\"] #,  \nMAX_WORKERS = os.cpu_count()      # Can be adjusted based on machine specs\n\n# IMG_ROOT.mkdir(parents=True, exist_ok=True)\nJSONL_ROOT.mkdir(parents=True, exist_ok=True)\n\n# ----------- Thread pool task -----------\ndef save_one_image(args):\n    \"\"\"\n    Decode Base64 string, handle transparency and save image as JPEG.\n\n    Args:\n        args (tuple): Tuple containing (b64_str, save_path).\n    \"\"\"\n    b64_str, save_path = args\n    if os.path.exists(save_path):\n        return save_path\n\n    try:\n        # 1. Decode Base64 to get raw binary data\n        image_bytes = base64.b64decode(b64_str)\n\n        # 2. Use Pillow to open image from binary data\n        with Image.open(io.BytesIO(image_bytes)) as img:\n            # 3. Handle transparency (key step)\n            # Check if image mode needs transparency handling.\n            # 'P' mode may contain transparency, 'LA' is grayscale+transparency.\n            # 'RGBA' is the most common mode with transparency.\n            if img.mode in (\"RGBA\", \"LA\", \"P\"):\n                # To uniformly handle all transparency cases, first convert image to RGBA mode.\n                # If image is in 'P' mode with transparency, conversion will result in correct RGBA image.\n                img = img.convert(\"RGBA\")\n\n                # Create a white background base image\n                background = Image.new(\"RGB\", img.size, (255, 255, 255))\n\n                # Paste original image onto the background.\n                # At this point img is already in RGBA mode, so it can serve as its own mask.\n                # Pillow will automatically use its Alpha channel.\n                background.paste(img, (0, 0), img)\n                img = background # Now img is the merged RGB image\n\n            # If image mode is not RGB (e.g., 'L', 'CMYK', etc.), convert to RGB\n            elif img.mode != \"RGB\":\n                img = img.convert(\"RGB\")\n\n            # 4. Save image in JPEG format\n            # JPEG doesn't support transparency, so background filling is necessary.\n            img.save(save_path, \"jpeg\", quality=95) # Recommend adding quality parameter\n\n        return str(save_path)\n\n    except Exception as e:\n        # Add exception handling for debugging which image caused the problem\n        print(f\"Error processing image for {save_path}: {e}\")\n        return None\n\n# ----------- Main processing -----------\nfor split in SPLITS:\n    print(f\"\\n>>> Processing split : {split}  (max_workers={MAX_WORKERS})\")\n    # 3. Write jsonl, check if already exists\n    jsonl_path = JSONL_ROOT \u002F f\"{split}.jsonl\"\n    if not jsonl_path.exists():  # Only write if jsonl file doesn't exist\n        print(f\"  JSONL  -> {jsonl_path}\")\n    else:\n        print(f\"  JSONL already exists: {jsonl_path}\")\n\n    ds = load_dataset(HF_DATA_DIR, split=split)\n\n    img_dir = IMG_ROOT + '\u002F' + split\n    # img_dir.mkdir(exist_ok=True)\n\n    # 1. First collect all tasks to be saved\n    tasks = []               # (b64_str, save_path)\n    records = []             # For writing jsonl\n    for sample_idx, sample in enumerate(ds):\n        img_paths = []\n        for img_idx, b64_img in enumerate(sample[\"image\"], start=1):\n            img_name = f\"{sample_idx+1:08d}_{img_idx:02d}.jpg\"\n            img_path = img_dir + '\u002F' + img_name\n            tasks.append((b64_img, img_path))\n            img_paths.append(str(img_path))\n        records.append({\n            \"image\": img_paths,\n            \"question\": sample[\"question\"],\n            \"response\": sample[\"response\"]\n        })\n\n    # 2. Execute with multi-threading\n    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as pool:\n        # Only save images that don't already exist\n        saved_images = list(tqdm(pool.map(save_one_image, tasks),\n                                 total=len(tasks), desc=\"Saving images\"))\n    \n    # Filter out items that returned None (i.e., files that already existed)\n    saved_images = [img for img in saved_images if img is not None]\n\n    with open(jsonl_path, \"w\", encoding=\"utf-8\") as f:\n        for rec in records:\n            f.write(json.dumps(rec, ensure_ascii=False) + \"\\n\")\n    \n    print(f\"  Images -> {img_dir}  ({len(saved_images)} files)\")\n\nprint(\"\\nAll done (multi-threaded)!\")\n\n```\n\n### 2.3 File Path Conversion for System Integration\n\nIn every question, there is a specified file path that needs to be converted into the correct system path for use in our platform. The following steps outline the process for handling these paths.\n\n#### Conversion Process:\n\n1. **Original Path Format**:\n\n   * Example: `\"User Image Path: \\\"\u002Fmllm_hdd\u002Fyfzhang\u002Fdata\u002Ftemp_processed_images\u002Fcauldron_dvqa_images_dvqa_00110792.png_rotated_image_318.png\\\"\"`\n\n2. **Transformation**:\n\n   * Extract the filename from the original path.\n   * Convert it into the first element of the `image` array in the system.\n   * This element will represent the correct file path for the system.\n\n3. **Response Path Conversion**:\n\n   * Similarly, ensure that any corresponding absolute paths provided in the `response` are transformed to match the system format as described.\n\n---\n\n## 3. Supervised Fine-Tuning (Thyme-SFT)\n\n### 3.1 Training Data Format\n\nTraining samples follow this JSON format example (full dataset includes similar structures):\n\n```json\n{\n  \"image\": [\"\u002Fpath\u002Fto\u002Foriginal.jpg\", \"\u002Fpath\u002Fto\u002Fprocessed.jpg\"],\n  \"question\": \"\u003Cimage>\\nBased on the top-right graph, describe the behavior of P(z) as z approaches zero. Options:\\n...\",\n  \"response\": \"\u003Cthink>Detailed reasoning and executable code...\u003C\u002Fthink>\u003Canswer>B\u003C\u002Fanswer>\"\n}\n```\n\n### 3.2 Configure Training Paths\n\nSet these variables in your training script or environment:\n\n* `DATASET`: Path to your training dataset\n* `SAVE_PATH`: Directory to save the trained model\n* `Model`: Path to your model\n\n### 3.3 Run Training\n\nExecute the training scripts:\n\n```bash\nsh scripts\u002Fsft_stage1.sh   # Stage 1: Supervised fine-tuning\nsh scripts\u002Fsft_stage2.sh   # Stage 2: Computational Data Annealing\n```\n\n**Note:** Since the computational data contains samples with and without images, this stage requires these two types of data to be processed and input separately. Please ensure your script handles this accordingly.\n\n---\n\n## 4. Reinforcement Learning (Thyme-RL)\n\n### 4.1 RL Dataset Format\n\nEach RL data sample follows this structure:\n\n```json\n{\n  \"messages\": [\n    {\n      \"role\": \"system\",\n      \"content\": \"You are a helpful assistant. Solve the problem step-by-step...\"\n    },\n    {\n      \"role\": \"user\",\n      \"content\": \"\u003Cimage>\\nQuestion text...\\n\"\n    }\n  ],\n  \"images\": [\"\u002Fpath\u002Fto\u002Fimage.jpg\"],\n  \"solution\": \"black\",\n  \"question\": \"What is the color of the glasses frame of the man in the white shirt?\"\n}\n```\n\n### 4.2 Reward Model Configuration\n\n* Reward functions are implemented in `examples\u002Ftrain\u002Fgrpo\u002Fplugin\u002Fagent_rm.py`, supporting `fmt_orm`, `vqa_orm`, and `cst_orm` by default.\n* For multi-node training, configure `REWARD_API_ADDRESS` and `QWEN_API_PORT` in `.deepspeed_env`.\n* Single-node training can use the default environment variables.\n\nExample asynchronous function calling the reward model API:\n\n```python\nasync def llm_openai_api(messages, ip=\"0.0.0.0\", host=\"8080\", temperature=0.1, max_tokens=256, top_p=None, n=1):\n    openai_api_base = f\"http:\u002F\u002F{ip}:{host}\u002Fv1\"\n    async with httpx.AsyncClient(timeout=httpx.Timeout(600.0)) as client:\n        model = \"\u002Fmllm_hdd\u002Fyfzhang\u002Fmodels\u002FQwen2.5-VL-72B-Instruct-AWQ\"\n        resp = await client.post(\n            f\"{openai_api_base}\u002Fchat\u002Fcompletions\",\n            headers={\"Content-Type\": \"application\u002Fjson\"},\n            json={\n                \"model\": model,\n                \"messages\": messages,\n                \"temperature\": temperature,\n                \"max_tokens\": max_tokens,\n                \"top_p\": top_p,\n                \"n\": n,\n            },\n        )\n        resp.raise_for_status()\n        response_data = resp.json()\n        return [choice[\"message\"][\"content\"] for choice in response_data[\"choices\"]]\n```\n\n### 4.3 Launch RL Training\n\nRun the script:\n\n```bash\nsh scripts\u002Frl.sh\n```\n\nParameters explained:\n\n* `--O3 true` activates Thyme training configuration with multi-turn dialogue and sandboxed iterative output processing.\n* `--stop_words ...` defines output termination tokens; enabling GRPO-ATS with sampling temperature set to 0 when `\u003Ccode>` is a stop word.\n\n\n## 5. Evaluation\n\nWe recommend using [VLMEvalKit](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit) for automatic evaluation on benchmarks.\n\n\n### 5.1  Install VLMEvalKit\n```shell\ncd eval\u002FVLMEvalKit\npip install -e .\n```\n\n### 5.2 Evaluate on Multiple Benchmarks\n    \nFor example, to evaluate on `V*` and `HRBench` with 8 GPUs, simply run (under `eval\u002FVLMEvalKit` directory):\n```shell\ntorchrun --nproc-per-node=8 run.py --data VStarBench HRBench4K HRBench8K --model Thyme-7B --verbose\n```\n  \n\n## 6. Usage Example: How to use Thyme\n\nWe provide a simple script (`eval\u002Finference.py`) to inference on custom cases. Simply run:\n```shell\npython -m eval.inference\n```\nYou may modify the user question and image path for your own cases:\n```python\n# --- 3. Prepare Input Data (Example) ---\n# Example question and responses (replace with your actual data)\n# Example case (sampled from MME-RealWorld-Lite bench)\nquestion_text = (\"Question: What is the plate number of the blue car in the picture?\\nOptions:\\n\"\n                 \"A. S OT 911\\n\"\n                 \"B. S TQ 119\\n\"\n                 \"C. S QT 911\\n\"\n                 \"D. B QT 119\\n\"\n                 \"E. This image doesn't feature the plate number.\\n\"\n                 \"Please select the correct answer from the options above.\")\nimage_path = \"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyfzhang114_Thyme_readme_22adc671318e.jpg\"\n```\n\u003Cdetails>\n\u003Csummary>Visualization of the Provided Example\u003C\u002Fsummary>\n\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyfzhang114_Thyme_readme_22adc671318e.jpg\" alt=\"Sample picture\" width=\"100%\">\n\n### ❓ User Prompt\n```\nQuestion: What is the plate number of the blue car in the picture?\nOptions:\nA. S OT 911\nB. S TQ 119\nC. S QT 911\nD. B QT 119\nE. This image doesn't feature the plate number.\nPlease select the correct answer from the options above.\n```\n\n---\n\n### 🤖 Model Response\n> \\\u003Cthink>To determine the plate number of the blue car in the image, we need to focus on the license plate located near the bottom front of the vehicle. The plate is relatively small compared to the overall size of the car and the image, so zooming in on this specific area will help us clearly read the characters on the plate.\u003Cbr>\nThe coordinates for the license plate appear to be around the lower front section of the car, specifically near the center of the lower part of the image. Cropping and zooming in on this region will enhance the visibility of the text on the plate, making it easier to identify the correct option.\u003Cbr>\nHere is the Python code to crop and zoom in on the license plate area:\u003Cbr>\n\n> \\\u003Ccode>\n```python\nimport cv2\nimport os\nimport numpy as np\nfrom uuid import uuid4\n\n# Load the image\nimage_path = \"\u002Fhttps:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyfzhang114_Thyme_readme_22adc671318e.jpg\"\nimage = cv2.imread(image_path)\n\n# Define the coordinates for the license plate area (inferred from the image content)\nx1, y1, x2, y2 = 1250, 1030, 1360, 1070\n\n# Crop the image based on the defined coordinates\ncropped_image = image[y1:y2, x1:x2]\n\n# Apply a zoom factor to enhance the cropped region\nzoom_factor = 2\nresized_image = cv2.resize(cropped_image, None, fx=zoom_factor, fy=zoom_factor, interpolation=cv2.INTER_CUBIC)\n\n# Save the processed image with a random suffix\nrandom_suffix = str(uuid4())[:8]\nprocessed_filename = f\"17127_{random_suffix}.jpg\"\nprocessed_path = os.path.join(\"\u002Fmnt\u002Fdata\u002Ftemp_processed_images\u002F\", processed_filename)\ncv2.imwrite(processed_path, resized_image)\n\n# Print the path of the saved cropped image for further processing\nprint(processed_path)\n```\n> \\\u003C\u002Fcode>\n\n> \\\u003Csandbox_output>\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyfzhang114_Thyme_readme_3d5a386319e2.jpg\" alt=\"Sample processed picture\" width=\"30%\">\\\u003C\u002Fsandbox_output>\n\n>Upon examining the cropped and zoomed-in image of the license plate, it becomes clear that the characters are \"S QT 911\". This matches option C. Therefore, the correct answer is C. S QT 911.\\\u003C\u002Fthink>\u003Cbr>\n\n\\\u003Canswer> **C. S QT 911** \\\u003C\u002Fanswer>\n\n\u003C\u002Fdetails>\n\n## Citation\n\nIf you find Thyme useful in your research or applications, please cite our paper:\n\n```bibtex\n@article{zhang2025thyme,\n  title={Thyme: Think Beyond Images},\n  author={Zhang, Yi-Fan and Lu, Xingyu and Yin, Shukang and Fu, Chaoyou and Chen, Wei and Hu, Xiao and Wen, Bin and Jiang, Kaiyu and Liu, Changyi and Zhang, Tianke and others},\n  journal={arXiv preprint arXiv:2508.11630},\n  year={2025}\n}\n```\n\n## Related Projects\nExplore other related work from our team:\n\n- [Kwai Keye-VL](https:\u002F\u002Fgithub.com\u002FKwai-Keye\u002FKeye)\n- [R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning](https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002Fr1_reward)\n- [MM-RLHF: The Next Step Forward in Multimodal LLM Alignment](https:\u002F\u002Fmm-rlhf.github.io\u002F)\n- [MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?](https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FMME-RealWorld)\n- [MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15296)\n- [Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models](https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FSliME)\n- [VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA)\n","\u003Cp align=\"center\" width=\"40%\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyfzhang114_Thyme_readme_8148a5627742.png\"  width=\"50%\" height=\"50%\">\n\u003C\u002Fp>\n\n\u003Cfont size=7>\u003Cdiv align='center' >  \n[[📖 首页](https:\u002F\u002Fthyme-vl.github.io\u002F)] \n[[📖 技术报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.11630)] \n\n[[📊 Thyme SFT 模型](https:\u002F\u002Fhuggingface.co\u002FKwai-Keye\u002FThyme-SFT)] \n[[📊 Thyme RL 模型](https:\u002F\u002Fhuggingface.co\u002FKwai-Keye\u002FThyme-RL)] \n[[📝 SFT 数据](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FKwai-Keye\u002FThyme-SFT)] \n[[📝 RL 数据](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FKwai-Keye\u002FThyme-RL)] \n\u003C\u002Fdiv>\u003C\u002Ffont>\n\n## 🔥 最新消息\n* **`2025.09.04`** 🌟 Thyme 已被 [VLMEvalKit](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit) 和 [LMMs-Eval](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval) 支持。欢迎随时使用！\n* **`2025.08.18`** 🌟 我们非常高兴地推出 **Thyme: 超越图像的思考**。Thyme 不再局限于传统的“以图像为媒介的思考”模式，而是通过可执行代码自主生成并执行多样化的图像处理与计算操作，从而显著提升在高分辨率感知和复杂推理任务上的表现。借助结合监督微调与强化学习的创新两阶段训练策略，并由新颖的 GRPO-ATS 算法赋能，Thyme 在推理探索与代码执行精度之间实现了精妙的平衡。\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyfzhang114_Thyme_readme_3a2ee1a20d2c.png\"  width=\"100%\" height=\"100%\">\n\u003C\u002Fp>\n\n\n---\n# 目录\n\n1. [快速入门](#1-quick-start)\n2. [数据准备](#2-data-preparation)\n3. [监督微调 (Thyme-SFT)](#3-supervised-fine-tuning-thyme-sft)\n4. [强化学习 (Thyme-RL)](#4-reinforcement-learning-thyme-rl)\n5. [评估](#5-evaluation)\n6. [使用示例：如何使用 Thyme](#6-usage-example-how-to-use-thyme)\n7. [引用](#citation)\n8. [相关项目](#related-projects)\n\n---\n\n## 1. 快速入门\n\n### 1.1 克隆仓库\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FThyme.git\ncd Thyme\n```\n\n### 1.2 环境设置与依赖安装\n\n我们建议创建一个 Conda 环境以实现隔离，并按以下步骤安装依赖：\n\n```bash\nconda create -n Thyme python=3.10 -y\nconda activate Thyme\n\npip install -e .\npip install \"sglang[all]\" -U\npip install \"vllm>=0.5.1\" \"transformers\u003C4.55\" \"trl\u003C0.21\" -U\npip install \"lmdeploy>=0.5,\u003C0.9\" -U --no-deps\npip install autoawq -U --no-deps\npip install auto_gptq optimum bitsandbytes \"gradio\u003C5.33\" -U\npip install git+https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fms-swift.git\npip install timm -U\npip install \"deepspeed\u003C0.17\" -U\npip install qwen_vl_utils qwen_omni_utils decord librosa icecream soundfile -U\npip install liger_kernel nvitop pre-commit math_verify py-spy -U\npip install wandb\n\npip install flash-attn --no-build-isolation --use-pep517\n\n```\n\n---\n\n## 2. 数据准备\n\n### 2.1 下载数据集\n\n从 [HuggingFace 数据集页面](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FKwai-Keye\u002FThyme-SFT) 获取训练数据。SFT 数据集包含三个子集：\n\n* `wo_thinking_thyme_single_round`: 单轮图像操作数据\n* `2round`: 多轮对话数据\n* `computation`: 用于计算任务的退火数据\n\n每个样本的 `image` 字段是一个列表，包含原始图像和处理后的图像。\n\n### 2.2 处理图像并更新路径\n\n在训练之前，请确保所有引用的图像均已下载并保存到本地。通过将图像 URL 或远程路径替换为本地绝对路径来更新数据文件（例如 `.jsonl`），例如：\n\n```json\n\"image\": [\n  \"\u002Fpath\u002Fto\u002Foriginal_images\u002F0904.0709_0.jpg\",\n  \"\u002Fpath\u002Fto\u002Fprocessed_images\u002F0904.0709_0_6349.jpg\"\n]\n```\n\n```\n#!\u002Fusr\u002Fbin\u002Fenv python3\nimport os, json, base64\nfrom pathlib import Path\nfrom concurrent.futures import ThreadPoolExecutor, as_completed\nfrom datasets import load_dataset\nfrom tqdm import tqdm\nimport io\nfrom PIL import Image\n\nHF_DATA_DIR = \".\u002Fdata\u002FThyme-SFT\"\nROOT_OUT    = Path(\"Thyme_sft_data\")\nIMG_ROOT    = '.\u002Fdata\u002FThyme_sft_data\u002Fimg'\nJSONL_ROOT  = ROOT_OUT \u002F \"jsonl\"\nSPLITS      = [\"wo_thinking_thyme_single_round\", \"2round\", \"computation\"] #,  \nMAX_WORKERS = os.cpu_count()      # 可根据机器配置调整\n\n# IMG_ROOT.mkdir(parents=True, exist_ok=True)\nJSONL_ROOT.mkdir(parents=True, exist_ok=True)\n\n# ----------- 线程池任务 -----------\ndef save_one_image(args):\n    \"\"\"\n    解码 Base64 字符串，处理透明度并以 JPEG 格式保存图像。\n\n    Args:\n        args (tuple): 包含 (b64_str, save_path) 的元组。\n    \"\"\"\n    b64_str, save_path = args\n    if os.path.exists(save_path):\n        return save_path\n\n    try：\n        # 1. 解码 Base64 以获取原始二进制数据\n        image_bytes = base64.b64decode(b64_str)\n\n        # 2. 使用 Pillow 从二进制数据打开图像\n        with Image.open(io.BytesIO(image_bytes)) as img：\n            # 3. 处理透明度（关键步骤）\n            # 检查图像模式是否需要处理透明度。\n            # 'P' 模式可能包含透明度，'LA' 是灰度+透明度。\n            # 'RGBA' 是最常见的带有透明度的模式。\n            if img.mode in (\"RGBA\", \"LA\", \"P\")：\n                # 为了统一处理所有透明度情况，首先将图像转换为 RGBA 模式。\n                # 如果图像是带有透明度的 'P' 模式，转换后会得到正确的 RGBA 图像。\n                img = img.convert(\"RGBA\")\n\n                # 创建白色背景的基础图像\n                background = Image.new(\"RGB\", img.size, (255, 255, 255))\n\n                # 将原图粘贴到背景上。\n                # 此时 img 已经是 RGBA 模式，因此可以作为自己的遮罩。\n                # Pillow 会自动使用其 Alpha 通道。\n                background.paste(img, (0, 0), img)\n                img = background # 现在 img 是合并后的 RGB 图像\n\n            # 如果图像模式不是 RGB（例如 'L', 'CMYK' 等），则将其转换为 RGB\n            elif img.mode != \"RGB\":\n                img = img.convert(\"RGB\")\n\n            # 4. 以 JPEG 格式保存图像\n            # JPEG 不支持透明度，因此需要填充背景。\n            img.save(save_path, \"jpeg\", quality=95) # 建议添加质量参数\n\n        return str(save_path)\n\n    except Exception as e：\n        # 添加异常处理以便调试是哪张图像导致了问题\n        print(f\"处理 {save_path} 的图像时出错：{e}\")\n        return None\n\n# ----------- 主处理 -----------\nfor split in SPLITS:\n    print(f\"\\n>>> 处理分片 : {split}  (max_workers={MAX_WORKERS})\")\n    # 3. 写入 jsonl 文件，检查是否已存在\n    jsonl_path = JSONL_ROOT \u002F f\"{split}.jsonl\"\n    if not jsonl_path.exists():  # 只有在 jsonl 文件不存在时才写入\n        print(f\"  JSONL  -> {jsonl_path}\")\n    else:\n        print(f\"  JSONL 已存在: {jsonl_path}\")\n\n    ds = load_dataset(HF_DATA_DIR, split=split)\n\n    img_dir = IMG_ROOT + '\u002F' + split\n    # img_dir.mkdir(exist_ok=True)\n\n    # 1. 首先收集所有需要保存的任务\n    tasks = []               # (b64_str, save_path)\n    records = []             # 用于写入 jsonl\n    for sample_idx, sample in enumerate(ds):\n        img_paths = []\n        for img_idx, b64_img in enumerate(sample[\"image\"], start=1):\n            img_name = f\"{sample_idx+1:08d}_{img_idx:02d}.jpg\"\n            img_path = img_dir + '\u002F' + img_name\n            tasks.append((b64_img, img_path))\n            img_paths.append(str(img_path))\n        records.append({\n            \"image\": img_paths,\n            \"question\": sample[\"question\"],\n            \"response\": sample[\"response\"]\n        })\n\n    # 2. 使用多线程执行\n    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as pool:\n        # 只保存尚未存在的图片\n        saved_images = list(tqdm(pool.map(save_one_image, tasks),\n                                 total=len(tasks), desc=\"保存图片\"))\n    \n    # 过滤掉返回 None 的项（即已经存在的文件）\n    saved_images = [img for img in saved_images if img is not None]\n\n    with open(jsonl_path, \"w\", encoding=\"utf-8\") as f:\n        for rec in records:\n            f.write(json.dumps(rec, ensure_ascii=False) + \"\\n\")\n    \n    print(f\"  图片 -> {img_dir}  ({len(saved_images)} 张)\")\n\nprint(\"\\n全部完成（多线程）！\")\n\n```\n\n### 2.3 文件路径转换以实现系统集成\n\n在每个问题中，都包含一个需要转换为适合我们平台使用的正确系统路径的指定文件路径。以下步骤概述了处理这些路径的过程。\n\n#### 转换过程：\n\n1. **原始路径格式**：\n\n   * 示例：`\"用户图像路径: \\\"\u002Fmllm_hdd\u002Fyfzhang\u002Fdata\u002Ftemp_processed_images\u002Fcauldron_dvqa_images_dvqa_00110792.png_rotated_image_318.png\\\"\"`\n\n2. **转换**：\n\n   * 从原始路径中提取文件名。\n   * 将其转换为系统中 `image` 数组的第一个元素。\n   * 该元素将代表系统中的正确文件路径。\n\n3. **响应路径转换**：\n\n   * 同样地，确保在 `response` 中提供的任何相应绝对路径都被转换为符合系统格式的形式。\n\n---\n\n## 3. 监督微调（Thyme-SFT）\n\n### 3.1 训练数据格式\n\n训练样本遵循以下 JSON 格式示例（完整数据集包含类似结构）：\n\n```json\n{\n  \"image\": [\"\u002Fpath\u002Fto\u002Foriginal.jpg\", \"\u002Fpath\u002Fto\u002Fprocessed.jpg\"],\n  \"question\": \"\u003Cimage>\\n根据右上角的图表，描述 P(z) 在 z 趋近于零时的行为。选项：\\n...\",\n  \"response\": \"思考>详细推理和可执行代码...\u003Canswer>B\u003C\u002Fanswer>\"\n}\n```\n\n### 3.2 配置训练路径\n\n在您的训练脚本或环境中设置以下变量：\n\n* `DATASET`: 您的训练数据集路径\n* `SAVE_PATH`: 保存训练模型的目录\n* `Model`: 您的模型路径\n\n### 3.3 运行训练\n\n执行训练脚本：\n\n```bash\nsh scripts\u002Fsft_stage1.sh   # 第一阶段：监督微调\nsh scripts\u002Fsft_stage2.sh   # 第二阶段：计算数据退火\n```\n\n**注意**：由于计算数据包含带图和不带图的样本，这一阶段需要分别处理并输入这两种类型的数据。请确保您的脚本能够妥善处理这一点。\n\n---\n\n## 4. 强化学习（Thyme-RL）\n\n### 4.1 RL 数据集格式\n\n每个 RL 数据样本遵循以下结构：\n\n```json\n{\n  \"messages\": [\n    {\n      \"role\": \"system\",\n      \"content\": \"您是一位乐于助人的助手。请逐步解决问题...\"\n    },\n    {\n      \"role\": \"user\",\n      \"content\": \"\u003Cimage>\\n问题文本...\\n\"\n    }\n  ],\n  \"images\": [\"\u002Fpath\u002Fto\u002Fimage.jpg\"],\n  \"solution\": \"black\",\n  \"question\": \"穿白衬衫的男士眼镜框是什么颜色？\"\n}\n```\n\n### 4.2 奖励模型配置\n\n*  Believers\n\n# 示例案例（采样自 MME-RealWorld-Lite 基准）\nquestion_text = (\"问题：图片中蓝色汽车的车牌号是多少？\\n选项：\\n\"\n                 \"A. S OT 911\\n\"\n                 \"B. S TQ 119\\n\"\n                 \"C. S QT 911\\n\"\n                 \"D. B QT 119\\n\"\n                 \"E. 该图片未显示车牌号。\\n\"\n                 \"请从以上选项中选择正确答案。\")\nimage_path = \"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyfzhang114_Thyme_readme_22adc671318e.jpg\"\n```\n\u003Cdetails>\n\u003Csummary>所提供示例的可视化\u003C\u002Fsummary>\n\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyfzhang114_Thyme_readme_22adc671318e.jpg\" alt=\"示例图片\" width=\"100%\">\n\n### ❓ 用户提示\n```\n问题：图片中蓝色汽车的车牌号是多少？\n选项：\nA. S OT 911\nB. S TQ 119\nC. S QT 911\nD. B QT 119\nE. 该图片未显示车牌号。\n请从以上选项中选择正确答案。\n```\n\n---\n\n### 🤖 模型响应\n> \\\u003Cthink>为了确定图片中蓝色汽车的车牌号，我们需要重点关注位于车辆前部下方的车牌。与整辆车及图片的整体尺寸相比，车牌相对较小，因此对该区域进行放大有助于清晰地读取车牌上的字符。\u003Cbr>\n车牌的大致位置在车的前部下方，具体来说是图像下半部分的中央附近。通过裁剪并放大这一区域，可以提高车牌文字的可见度，从而更容易识别出正确选项。\u003Cbr>\n以下是用于裁剪并放大车牌区域的 Python 代码：\u003Cbr>\n\n> \\\u003Ccode>\n```python\nimport cv2\nimport os\nimport numpy as np\nfrom uuid import uuid4\n\n# 加载图片\nimage_path = \"\u002Fhttps:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyfzhang114_Thyme_readme_22adc671318e.jpg\"\nimage = cv2.imread(image_path)\n\n# 定义车牌区域的坐标（根据图片内容推断）\nx1, y1, x2, y2 = 1250, 1030, 1360, 1070\n\n# 根据定义的坐标裁剪图片\ncropped_image = image[y1:y2, x1:x2]\n\n# 应用缩放因子以增强裁剪后的区域\nzoom_factor = 2\nresized_image = cv2.resize(cropped_image, None, fx=zoom_factor, fy=zoom_factor, interpolation=cv2.INTER_CUBIC)\n\n# 使用随机后缀保存处理后的图片\nrandom_suffix = str(uuid4())[:8]\nprocessed_filename = f\"17127_{random_suffix}.jpg\"\nprocessed_path = os.path.join(\"\u002Fmnt\u002Fdata\u002Ftemp_processed_images\u002F\", processed_filename)\ncv2.imwrite(processed_path, resized_image)\n\n# 打印保存的裁剪图片路径，以便进一步处理\nprint(processed_path)\n```\n> \\\u003C\u002Fcode>\n\n> \\\u003Csandbox_output>\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyfzhang114_Thyme_readme_3d5a386319e2.jpg\" alt=\"示例处理图片\" width=\"30%\">\\\u003C\u002Fsandbox_output>\n\n>通过观察裁剪并放大的车牌图片，可以清楚地看到车牌上的字符为“S QT 911”。这与选项 C 完全一致。因此，正确答案是 C. S QT 911。\\\u003C\u002Fthink>br>\n\n\\\u003Canswer> **C. S QT 911** \\\u003C\u002Fanswer>\n\n\u003C\u002Fdetails>\n\n## 引用\n\n如果您在研究或应用中发现 Thyme 有用，请引用我们的论文：\n\n```bibtex\n@article{zhang2025thyme,\n  title={Thyme: Think Beyond Images},\n  author={Zhang, Yi-Fan and Lu, Xingyu and Yin, Shukang and Fu, Chaoyou and Chen, Wei and Hu, Xiao and Wen, Bin and Jiang, Kaiyu and Liu, Changyi and Zhang, Tianke and others},\n  journal={arXiv preprint arXiv:2508.11630},\n  year={2025}\n}\n```\n\n## 相关项目\n探索我们团队的其他相关工作：\n\n- [Kwai Keye-VL](https:\u002F\u002Fgithub.com\u002FKwai-Keye\u002FKeye)\n- [R1-Reward: 通过稳定强化学习训练多模态奖励模型](https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002Fr1_reward)\n- [MM-RLHF: 多模态大模型对齐的下一步发展](https:\u002F\u002Fmm-rlhf.github.io\u002F)\n- [MME-RealWorld: 您的多模态大模型能否应对连人类都难以处理的高分辨率真实场景？](https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FMME-RealWorld)\n- [MME-Survey: 多模态大模型评估的全面综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15296)\n- [超越 LLaVA-HD：深入研究高分辨率大型多模态模型](https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FSliME)\n- [VITA-1.5: 向 GPT-4o 级别的实时视觉与语音交互迈进](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA)","# Thyme 快速上手指南\n\nThyme 是一款超越传统“图像思维”范式的多模态大模型。它能够通过生成和执行可执行代码，自主完成多样化的图像处理与计算操作，显著提升高分辨率感知和复杂推理任务的性能。\n\n## 1. 环境准备\n\n### 系统要求\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+)\n*   **Python**: 3.10\n*   **GPU**: 支持 CUDA 的 NVIDIA 显卡（建议显存 24GB+ 以运行大模型）\n*   **编译器**: GCC, G++\n\n### 前置依赖\n确保已安装 `conda` 和 `git`。\n\n## 2. 安装步骤\n\n### 2.1 克隆仓库\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FThyme.git\ncd Thyme\n```\n\n### 2.2 创建并激活 Conda 环境\n```bash\nconda create -n Thyme python=3.10 -y\nconda activate Thyme\n```\n\n### 2.3 安装依赖库\n依次执行以下命令安装核心依赖。为确保兼容性，请严格按照顺序安装：\n\n```bash\n# 安装基础包\npip install -e .\n\n# 安装推理与训练框架\npip install \"sglang[all]\" -U\npip install \"vllm>=0.5.1\" \"transformers\u003C4.55\" \"trl\u003C0.21\" -U\npip install \"lmdeploy>=0.5,\u003C0.9\" -U --no-deps\npip install autoawq -U --no-deps\n\n# 安装量化工具与界面库\npip install auto_gptq optimum bitsandbytes \"gradio\u003C5.33\" -U\n\n# 安装 ModelScope Swift 及相关工具\npip install git+https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fms-swift.git\npip install timm -U\npip install \"deepspeed\u003C0.17\" -U\n\n# 安装多媒体处理与辅助库\npip install qwen_vl_utils qwen_omni_utils decord librosa icecream soundfile -U\npip install liger_kernel nvitop pre-commit math_verify py-spy -U\npip install wandb\n\n# 安装 Flash Attention (需确保系统已安装对应版本的 CUDA Toolkit)\npip install flash-attn --no-build-isolation --use-pep517\n```\n\n> **提示**：国内用户若下载缓慢，可在 `pip` 命令后添加 `-i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple` 使用清华镜像源。\n\n## 3. 数据准备\n\n### 3.1 下载数据集\n从 HuggingFace 下载 SFT 训练数据（包含单轮、多轮及计算任务三个分集）：\n*   数据集地址：[Kwai-Keye\u002FThyme-SFT](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FKwai-Keye\u002FThyme-SFT)\n\n### 3.2 图像处理与路径本地化\n训练前需将数据集中的 Base64 图片解码并保存至本地，同时更新 JSONL 文件中的图片路径为绝对路径。\n\n可以使用项目提供的脚本自动处理（需先修改脚本中的 `HF_DATA_DIR` 和输出路径）：\n\n```python\n#!\u002Fusr\u002Fbin\u002Fenv python3\nimport os, json, base64\nfrom pathlib import Path\nfrom concurrent.futures import ThreadPoolExecutor, as_completed\nfrom datasets import load_dataset\nfrom tqdm import tqdm\nimport io\nfrom PIL import Image\n\n# 配置路径\nHF_DATA_DIR = \".\u002Fdata\u002FThyme-SFT\" # 替换为实际下载路径或 HF 标识符\nROOT_OUT    = Path(\"Thyme_sft_data\")\nIMG_ROOT    = ROOT_OUT \u002F \"img\"\nJSONL_ROOT  = ROOT_OUT \u002F \"jsonl\"\nSPLITS      = [\"wo_thinking_thyme_single_round\", \"2round\", \"computation\"]\nMAX_WORKERS = os.cpu_count()\n\nIMG_ROOT.mkdir(parents=True, exist_ok=True)\nJSONL_ROOT.mkdir(parents=True, exist_ok=True)\n\ndef save_one_image(args):\n    b64_str, save_path = args\n    if os.path.exists(save_path):\n        return save_path\n    try:\n        image_bytes = base64.b64decode(b64_str)\n        with Image.open(io.BytesIO(image_bytes)) as img:\n            if img.mode in (\"RGBA\", \"LA\", \"P\"):\n                img = img.convert(\"RGBA\")\n                background = Image.new(\"RGB\", img.size, (255, 255, 255))\n                background.paste(img, (0, 0), img)\n                img = background\n            elif img.mode != \"RGB\":\n                img = img.convert(\"RGB\")\n            img.save(save_path, \"jpeg\", quality=95)\n        return str(save_path)\n    except Exception as e:\n        print(f\"Error processing image for {save_path}: {e}\")\n        return None\n\nfor split in SPLITS:\n    print(f\"\\n>>> Processing split : {split}\")\n    jsonl_path = JSONL_ROOT \u002F f\"{split}.jsonl\"\n    \n    # 加载数据集\n    ds = load_dataset(HF_DATA_DIR, split=split)\n    img_dir = IMG_ROOT \u002F split\n    img_dir.mkdir(exist_ok=True)\n\n    tasks = []\n    records = []\n    \n    for sample_idx, sample in enumerate(ds):\n        img_paths = []\n        for img_idx, b64_img in enumerate(sample[\"image\"], start=1):\n            img_name = f\"{sample_idx+1:08d}_{img_idx:02d}.jpg\"\n            img_path = img_dir \u002F img_name\n            tasks.append((b64_img, str(img_path)))\n            img_paths.append(str(img_path))\n        records.append({\n            \"image\": img_paths,\n            \"question\": sample[\"question\"],\n            \"response\": sample[\"response\"]\n        })\n\n    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as pool:\n        list(tqdm(pool.map(save_one_image, tasks), total=len(tasks), desc=\"Saving images\"))\n\n    with open(jsonl_path, \"w\", encoding=\"utf-8\") as f:\n        for rec in records:\n            f.write(json.dumps(rec, ensure_ascii=False) + \"\\n\")\n    \n    print(f\"  Done: {img_dir}\")\n\nprint(\"\\nAll done!\")\n```\n\n### 3.3 路径格式修正\n确保数据集中的问题（question）和回答（response）中涉及的文件路径已转换为上述生成的本地绝对路径。\n\n## 4. 基本使用\n\n### 4.1 监督微调 (SFT)\nThyme 的 SFT 分为两个阶段：基础指令微调与计算数据退火。\n\n**配置文件变量：**\n*   `DATASET`: 处理后数据的 JSONL 路径\n*   `SAVE_PATH`: 模型保存路径\n*   `Model`: 基座模型路径\n\n**启动训练：**\n```bash\n# 阶段 1: 监督微调\nsh scripts\u002Fsft_stage1.sh\n\n# 阶段 2: 计算数据退火 (注意：此阶段需分别处理含图和不含图的数据)\nsh scripts\u002Fsft_stage2.sh\n```\n\n### 4.2 强化学习 (RL)\n基于 GRPO-ATS 算法进行强化学习训练。\n\n**奖励模型配置：**\n*   默认支持 `fmt_orm`, `vqa_orm`, `cst_orm`。\n*   多节点训练需在 `.deepspeed_env` 中配置 `REWARD_API_ADDRESS` 和 `QWEN_API_PORT`。\n\n**启动训练：**\n参考 `scripts\u002Frl_train.sh` (需根据实际环境调整 Reward Model API 地址)。\n\n### 4.3 推理示例\n训练完成后，可使用 `vllm` 或 `sglang` 加载模型进行推理。数据输入格式如下：\n\n```json\n{\n  \"image\": [\"\u002Fpath\u002Fto\u002Foriginal.jpg\", \"\u002Fpath\u002Fto\u002Fprocessed.jpg\"],\n  \"question\": \"\u003Cimage>\\nBased on the top-right graph, describe the behavior of P(z)...\",\n  \"response\": \"\u003Cthink>Reasoning and code...\u003C\u002Fthink>\u003Canswer>B\u003C\u002Fanswer>\"\n}\n```\n\n模型将自动识别 `\u003Cthink>` 标签内的代码并执行，最终输出 `\u003Canswer>` 标签内的结果。","某医疗影像分析团队正在处理一批高分辨率病理切片，需要从中精准计数细胞数量并计算特定区域的异常密度，以辅助医生进行癌症早期筛查。\n\n### 没有 Thyme 时\n- **分辨率受限导致漏检**：传统多模态模型受限于上下文窗口，必须压缩输入图片，导致微小细胞特征丢失，计数准确率大幅下降。\n- **缺乏精确计算能力**：模型仅能给出模糊的定性描述（如“很多”或“少量”），无法执行具体的数学运算得出精确密度数值。\n- **流程割裂效率低下**：开发人员需先让模型识别区域，再人工编写脚本裁剪图像并调用外部库计算，环节繁琐且容易出错。\n- **复杂推理易幻觉**：在面对重叠细胞或复杂背景时，模型常因无法“深思”而产生幻觉，编造不存在的检测结果。\n\n### 使用 Thyme 后\n- **代码驱动的高清感知**：Thyme 自主生成可执行代码，动态对高分辨率切片进行分块处理和放大操作，确保每一个微小细胞都被精准捕捉。\n- **自动执行精确计算**：模型直接编写并运行 Python 脚本完成细胞计数与密度公式运算，输出结果精确到小数点后两位，消除估算误差。\n- **端到端自主闭环**：从图像预处理、特征提取到最终数值计算，Thyme 在一个对话回合内自动完成所有步骤，无需人工干预中间流程。\n- **深度推理抑制幻觉**：借助 GRPO-ATS 算法强化的推理探索能力，Thyme 能在执行代码前规划最佳策略，有效区分重叠细胞与背景噪声。\n\nThyme 通过“超越图像”的代码执行能力，将多模态大模型从单纯的“看图说话”升级为能独立解决复杂视觉计算任务的专业智能体。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyfzhang114_Thyme_8148a562.png","yfzhang114","Yi-Fan Zhang","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fyfzhang114_7c5f6c92.jpg","The final year Ph.D. student @CASIA, NLPR, advised by Prof. Tieniu Tan. Formerly @microsoft Research Asia and @Alibaba DAMO Academy.","State Key Laboratory of Pattern Recognition","Haidian Beijing",null,"https:\u002F\u002Fyfzhang114.github.io\u002F","https:\u002F\u002Fgithub.com\u002Fyfzhang114",[82,86,90,94,98,101,104],{"name":83,"color":84,"percentage":85},"Python","#3572A5",99.4,{"name":87,"color":88,"percentage":89},"Jupyter Notebook","#DA5B0B",0.3,{"name":91,"color":92,"percentage":93},"Shell","#89e051",0.2,{"name":95,"color":96,"percentage":97},"CSS","#663399",0,{"name":99,"color":100,"percentage":97},"Makefile","#427819",{"name":102,"color":103,"percentage":97},"HTML","#e34c26",{"name":105,"color":106,"percentage":97},"JavaScript","#f1e05a",589,36,"2026-04-03T09:48:20","Apache-2.0",4,"Linux","必需 NVIDIA GPU，需支持 flash-attn 和 deepspeed，具体显存未说明（建议大显存以运行 72B 模型及 RL 训练），CUDA 版本未明确指定但需兼容 torch 和 flash-attn","未说明",{"notes":116,"python":117,"dependencies":118},"1. 强烈建议使用 Conda 创建隔离环境。2. 安装 flash-attn 时需添加 --no-build-isolation --use-pep517 参数。3. 部分库（如 lmdeploy, autoawq）安装时需使用 --no-deps 以避免冲突。4. 训练前需自行下载数据集并将图片 Base64 解码保存为本地路径，同时更新 JSONL 文件中的路径指向。5. 强化学习阶段可能需要配置多节点环境变量或奖励模型 API 地址。","3.10",[119,120,121,122,123,124,125,126,127,128],"sglang[all]","vllm>=0.5.1","transformers\u003C4.55","trl\u003C0.21","lmdeploy>=0.5,\u003C0.9","deepspeed\u003C0.17","flash-attn","bitsandbytes","gradio\u003C5.33","ms-swift",[35,15,130],"其他","2026-03-27T02:49:30.150509","2026-04-07T00:51:48.558504",[134,139,144,149,154,159,164,169],{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},20621,"如何获取并转换 Thyme 数据集为 JSONL 格式？","你需要从 Hugging Face 下载数据集，然后使用脚本将图像解码并保存。维护者提供了一个 Python 脚本示例，主要步骤包括：\n1. 使用 `datasets.load_dataset` 加载数据。\n2. 遍历数据，解码 Base64 编码的图像字符串。\n3. 处理图像透明度并保存为 JPEG 格式到指定目录。\n4. 将处理后的数据写入 JSONL 文件。\n关键代码片段涉及 `base64.b64decode` 解码图像和使用 `PIL.Image` 保存。请根据实际路径修改 `HF_DATA_DIR` 和 `ROOT_OUT` 变量。","https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FThyme\u002Fissues\u002F2",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},20622,"SFT 第二阶段的数据需要如何处理？是否必须将含图和纯文本数据分开？","是的，必须手动将数据集拆分为两个独立的文件：一个包含带图像的样本（如 `vqa.jsonl`），另一个包含纯文本样本（如 `textqa.jsonl`）。\n原因是 Swift 训练框架不支持在同一个输入文件中混合图像 - 文本和纯文本示例。在 SFT 阶段，应通过单独的数据参数传入，例如：`--data xxx\u002Fvqa.jsonl xxx\u002Ftextqa.jsonl`，以确保数据加载器正确处理不同的模态。","https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FThyme\u002Fissues\u002F14",{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},20623,"README 中提到的“系统集成文件路径转换”具体策略是什么？是否有官方脚本？","关于路径转换的策略如下：\n1. **重写范围**：think 块、代码块以及问题中出现的所有路径都应被处理，原则是对原始图像路径和保存\u002F处理后的图像路径进行直接替换。\n2. **路径类型建议**：建议将所有路径转换为相对路径，以避免绝对路径带来的机器依赖问题。\n3. **映射策略**：即使只有一个图像，也建议使用列表结构。对于代码生成的处理路径，通常只需规范化目录前缀，或者根据样本中的 `image[1]` 替换整个处理后的路径。\n4. **格式兼容性**：转换器应能处理带有 Markdown 装饰（如 `**`）的文本标签模式（例如 `User Image Path: \"...\"`）。","https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FThyme\u002Fissues\u002F7",{"id":150,"question_zh":151,"answer_zh":152,"source_url":153},20624,"Thyme-RL 数据集与 DeepEyes 数据集有什么关系？","Thyme-RL 数据集是基于 DeepEyes 的 47k 数据集构建的。具体来说，团队对 DeepEyes 的 RL 数据进行了过滤，并额外添加了更难感知（Perception）数据以提升模型能力。","https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FThyme\u002Fissues\u002F8",{"id":155,"question_zh":156,"answer_zh":157,"source_url":158},20625,"论文实验中 POPE 数据集相关的 aAcc, qAcc, fAcc 指标具体含义是什么？","这些指标来源于 HallucinationBench（幻觉基准测试），而非标准的 POPE 论文定义。它们用于评估模型在幻觉检测任务上的不同维度表现，具体细节需参考 HallucinationBench 的相关文档或论文补充材料。","https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FThyme\u002Fissues\u002F10",{"id":160,"question_zh":161,"answer_zh":162,"source_url":163},20626,"是否支持在更大的模型（如 Qwen2.5-VL-72B-Instruct）上应用该方法？效果如何？","由于资源限制，团队尚未在更大规模的模型上进行实验。但在理论上，更大规模的模型能够更稳定地生成代码，并有望提供更好的性能表现。","https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FThyme\u002Fissues\u002F9",{"id":165,"question_zh":166,"answer_zh":167,"source_url":168},20627,"项目中的 sandbox（沙箱）代码在哪里可以找到？","Sandbox 相关的代码位于 `swift\u002Ftrainers\u002Fsandbox.py` 文件中。","https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FThyme\u002Fissues\u002F13",{"id":170,"question_zh":171,"answer_zh":172,"source_url":173},20628,"是否计划支持 lmms-eval 评测框架？","这是一个很好的建议。团队目前正在整理代码，准备向官方的 vlmevalkit 和 lmmseval 仓库提交 PR，以正式支持该评测框架。","https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FThyme\u002Fissues\u002F5",[]]