[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-zai-org--GLM-V":3,"tool-zai-org--GLM-V":64},[4,17,26,40,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,2,"2026-04-03T11:11:01",[13,14,15],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":23,"last_commit_at":32,"category_tags":33,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,34,35,36,15,37,38,13,39],"数据工具","视频","插件","其他","语言模型","音频",{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":10,"last_commit_at":46,"category_tags":47,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,38,37],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":10,"last_commit_at":54,"category_tags":55,"status":16},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74939,"2026-04-05T23:16:38",[38,14,13,37],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":23,"last_commit_at":62,"category_tags":63,"status":16},2471,"tesseract","tesseract-ocr\u002Ftesseract","Tesseract 是一款历史悠久且备受推崇的开源光学字符识别（OCR）引擎，最初由惠普实验室开发，后由 Google 维护，目前由全球社区共同贡献。它的核心功能是将图片中的文字转化为可编辑、可搜索的文本数据，有效解决了从扫描件、照片或 PDF 文档中提取文字信息的难题，是数字化归档和信息自动化的重要基础工具。\n\n在技术层面，Tesseract 展现了强大的适应能力。从版本 4 开始，它引入了基于长短期记忆网络（LSTM）的神经网络 OCR 引擎，显著提升了行识别的准确率；同时，为了兼顾旧有需求，它依然支持传统的字符模式识别引擎。Tesseract 原生支持 UTF-8 编码，开箱即用即可识别超过 100 种语言，并兼容 PNG、JPEG、TIFF 等多种常见图像格式。输出方面，它灵活支持纯文本、hOCR、PDF、TSV 等多种格式，方便后续数据处理。\n\nTesseract 主要面向开发者、研究人员以及需要构建文档处理流程的企业用户。由于它本身是一个命令行工具和库（libtesseract），不包含图形用户界面（GUI），因此最适合具备一定编程能力的技术人员集成到自动化脚本或应用程序中",73286,"2026-04-03T01:56:45",[13,14],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":79,"owner_twitter":80,"owner_website":81,"owner_url":82,"languages":83,"stars":88,"forks":89,"last_commit_at":90,"license":91,"difficulty_score":10,"env_os":92,"env_gpu":93,"env_ram":92,"env_deps":94,"category_tags":100,"github_topics":101,"view_count":106,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":107,"updated_at":108,"faqs":109,"releases":138},529,"zai-org\u002FGLM-V","GLM-V","GLM-4.6V\u002F4.5V\u002F4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning","GLM-V 是智谱 AI 开源的一系列视觉语言模型，包含 GLM-4.6V、4.5V 和 4.1V 等版本。它旨在突破传统多模态模型仅具备基础感知的局限，通过可扩展的强化学习技术，赋予模型深度推理、长上下文理解及复杂问题解决的能力。\n\n对于开发者而言，GLM-V 提供了构建智能应用的核心引擎；研究人员可借此探索多模态技术的前沿边界；而设计师或工程师则能利用其衍生的 UI2Code 代码生成、Glyph 长文本压缩等专项技能提升工作效率。项目不仅开放了完整的算法实现与预训练权重，还发布了桌面助手 Demo 及奖励系统代码，支持本地部署与二次开发。无论是通过 API 集成还是在线体验，GLM-V 都致力于降低多模态大模型的使用门槛，助力社区共同创造更智能的创新应用。","# GLM-V\n\n[中文阅读.](.\u002FREADME_zh.md)\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=resources\u002Flogo.svg width=\"40%\"\u002F>\n\u003C\u002Fdiv>\n\u003Cp align=\"center\">\n    👋 Join our \u003Ca href=\"resources\u002FWECHAT.md\" target=\"_blank\">WeChat\u003C\u002Fa> and \u003Ca href=\"https:\u002F\u002Ft.co\u002Fb6zGxJvzzS\" target=\"_blank\">Discord\u003C\u002Fa> communities.\n    \u003Cbr>\n    📖 Check out the GLM-4.6V \u003Ca href=\"https:\u002F\u002Fz.ai\u002Fblog\u002Fglm-4.6v\" target=\"_blank\">blog\u003C\u002Fa> and GLM-4.5V & GLM-4.1V \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01006\" target=\"_blank\">paper\u003C\u002Fa>.\n    \u003Cbr>\n    📍 Try \u003Ca href=\"https:\u002F\u002Fchat.z.ai\u002F\" target=\"_blank\">online\u003C\u002Fa> or use the \u003Ca href=\"https:\u002F\u002Fdocs.z.ai\u002Fguides\u002Fvlm\u002Fglm-4.6v\" target=\"_blank\">API\u003C\u002Fa>.\n\u003C\u002Fp>\n\n## Introduction\n\nVision-language models (VLMs) have become a key cornerstone of intelligent systems. As real-world AI tasks grow\nincreasingly complex, VLMs urgently need to enhance reasoning capabilities beyond basic multimodal perception —\nimproving accuracy, comprehensiveness, and intelligence — to enable complex problem solving, long-context understanding,\nand multimodal agents.\n\nThrough our open-source work, we aim to explore the technological frontier together with the community while empowering\nmore developers to create exciting and innovative applications.\n\n**This open-source repository contains our `GLM-4.6V`, `GLM-4.5V` and `GLM-4.1V` series models.** For performance and\ndetails, see [Model Overview](#model-overview). For known issues,\nsee [Fixed and Remaining Issues](#fixed-and-remaining-issues).\n\n## Project Updates\n\n- **News**: `2026\u002F03\u002F28`: We have released multiple GLM-V related Skills, covering several specialized areas\n  such as GLM-V-Grounding and GLM-V-Prompt-Gen. You are welcome to try them [here](skills).\n- **News**: `2025\u002F11\u002F10`: We released **UI2Code^N**, a RL-enhanced UI coding model with UI-to-code, UI-polish, and\n  UI-edit capabilities. The model is trained based on `GLM-4.1V-Base`. Check it\n  out [here](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FUI2Code_N).\n- **News**: `2025\u002F10\u002F27`: We’ve released **Glyph**, a framework for scaling the context length through visual-text\n  compression, the glyph model trained based on `GLM-4.1V-Base`. Check it\n  out [here](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGlyph).\n- **News**: `2025\u002F08\u002F11`: We released **GLM-4.5V** with significant improvements across multiple benchmarks. We also\n  open-sourced our handcrafted **desktop assistant app** for debugging. Once connected to GLM-4.5V, it can capture\n  visual information from your PC screen via screenshots or screen recordings. Feel free to try it out or customize it\n  into your own multimodal assistant. Click [here](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fzai-org\u002FGLM-4.5V-Demo-App) to download\n  the installer or [build from source](examples\u002Fvllm-chat-helper\u002FREADME.md)!\n- **News**: `2025\u002F07\u002F16`: We have open-sourced the **VLM Reward System** used to train GLM-4.1V-Thinking.View\n  the [code repository](glmv_reward) and run locally: `python examples\u002Freward_system_demo.py`.\n- **News**: `2025\u002F07\u002F01`: We released **GLM-4.1V-9B-Thinking** and\n  its [technical report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01006).\n\n## Model Implementation Code\n\n- GLM-4.5V and GLM-4.6V model algorithm: see the full implementation\n  in [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Ftree\u002Fmain\u002Fsrc\u002Ftransformers\u002Fmodels\u002Fglm4v_moe).\n- GLM-4.1V-9B-Thinking model algorithm: see the full implementation\n  in [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Ftree\u002Fmain\u002Fsrc\u002Ftransformers\u002Fmodels\u002Fglm4v).\n- Both models share identical multimodal preprocessing, but use different conversation templates — please distinguish\n  carefully.\n\n## Model Downloads\n\n| Model                | Download Links                                                                                                                                       | Type             |\n|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|\n| GLM-4.6V             | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.6V)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.6V)                         | Hybrid Reasoning |\n| GLM-4.6V-FP8         | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.6V-FP8)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.6V-FP8)                 | Hybrid Reasoning |\n| GLM-4.6V-Flash       | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.6V-Flash)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.6V-Flash)             | Hybrid Reasoning |\n| GLM-4.5V             | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.5V)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.5V)                         | Hybrid Reasoning |\n| GLM-4.5V-FP8         | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.5V-FP8)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.5V-FP8)                 | Hybrid Reasoning |\n| GLM-4.1V-9B-Thinking | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.1V-9B-Thinking)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.1V-9B-Thinking) | Reasoning        |\n| GLM-4.1V-9B-Base     | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.1V-9B-Base)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.1V-9B-Base)         | Base             |\n\n\n+ Hugging Face provides GGUF format model weights. You can download the GGUF format model of GLM-V from [here](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fggml-org\u002Fglm-v).\n\n## Using Case\n\n### Grounding\n\nGLM-4.5V \u002F GLM-4.6V \u002F GLM-4.1V equips precise grounding capabilities. Given a prompt that requests the location of a specific object, the model\nis able to reasoning step-by-step and identify the bounding boxes of the target object. The query prompt supports\ncomplex descriptions of the target object as well as specified output formats, for example:\n>\n> - Help me to locate \u003Cexpr> in the image and give me its bounding boxes.\n> - Please pinpoint the bounding box [[x1,y1,x2,y2], …] in the image as per the given description. \u003Cexpr>\n\nHere, `\u003Cexpr>` is the description of the target object. The output bounding box is a quadruple $$[x_1,y_1,x_2,y_2]$$\ncomposed of the coordinates of the top-left and bottom-right corners, where each value is normalized by the image\nwidth (for x) or height (for y) and scaled by 1000.\n\nIn the response, the special tokens `\u003C|begin_of_box|>` and `\u003C|end_of_box|>` are used to mark the image bounding box in\nthe answer. The bracket style may vary ([], [[]], (), \u003C>, etc.), but the meaning is the same: to enclose the coordinates\nof the box.\n\n### GUI Agent\n\n- `examples\u002Fgui-agent`: Demonstrates prompt construction and output handling for GUI Agents, including strategies for\n  mobile, PC, and web. Prompt templates differ between GLM-4.1V and GLM-4.5V.\n\n### Quick Demo\n\n- `examples\u002Fvlm-helper`: A desktop assistant for GLM multimodal models (mainly GLM-4.5V, compatible with GLM-4.1V),\n  supporting text, images, videos, PDFs, PPTs, and more. Connects to the GLM multimodal API for intelligent services\n  across scenarios. Download the [installer](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fzai-org\u002FGLM-4.5V-Demo-App)\n  or [build from source](examples\u002Fvlm-helper\u002FREADME.md).\n\n## Quick Start\n\n### Environment Installation\n\n```bash\npip install -r requirements.txt\n```\n\n- vLLM and SGLang dependencies may conflict, so it is recommended to install only one of them in each environment.\n- Please note that after installation, you should verify the version of `transformers` and ensure it is upgraded to `5.2.0` or above.\n\n### transformers\n\n- `trans_infer_cli.py`: CLI for continuous conversations using `transformers` backend.\n- `trans_infer_gradio.py`: Gradio web interface with multimodal input (images, videos, PDFs, PPTs) using `transformers`\n  backend.\n- `trans_infer_bench`: Academic reproduction script for `GLM-4.1V-9B-Thinking`. It forces reasoning truncation at length\n  `8192` and requests direct answers afterward. Includes a video input example; modify for other cases.\n\n### vLLM\n\n```bash\nvllm serve zai-org\u002FGLM-4.6V \\\n     --tensor-parallel-size 4 \\\n     --tool-call-parser glm45 \\\n     --reasoning-parser glm45 \\\n     --enable-auto-tool-choice \\\n     --served-model-name glm-4.6v \\\n     --allowed-local-media-path \u002F \\\n     --mm-encoder-tp-mode data \\\n     --mm_processor_cache_type shm\n```\n\nFor more detail, check [vLLM Recipes](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Frecipes\u002Fblob\u002Fmain\u002FGLM\u002FGLM-V.md).\n\n### SGlang\n\n```shell\nsglang serve --model-path zai-org\u002FGLM-4.6V \\\n     --tp-size 4 \\\n     --tool-call-parser glm45 \\\n     --reasoning-parser glm45 \\\n     --served-model-name glm-4.6v \\\n     --mm-enable-dp-encoder \\\n     --port 8000 \\\n     --host 0.0.0.0\n```\n\nNotes:\n\n- We recommend increasing `SGLANG_VLM_CACHE_SIZE_MB` (e.g., `1024`) to provide sufficient cache space for video\n  understanding.\n- When using `vLLM` and `SGLang`, thinking mode is enabled by default. To disable the thinking switch, Add:\n  `extra_body={\"chat_template_kwargs\": {\"enable_thinking\": False}}`\n- You can configure a thinking budget to limit the model’s maximum reasoning span. Add\n\n    ```python\n  from sglang.srt.sampling.custom_logit_processor import Glm4MoeThinkingBudgetLogitProcessor\n    ```\n\n  and\n\n    ```python\n  extra_body={\n            \"custom_logit_processor\": Glm4MoeThinkingBudgetLogitProcessor().to_str(),\n            \"custom_params\": {\n                \"thinking_budget\": 8192, # max reasoning length in tokens\n            },\n        },\n    ```\n\n### xLLM\n\ncheck [here](examples\u002FAscend_NPU\u002FREADME_zh.md) for detailed instructions.\n\n## Integration with Other Automation Tools\n\n### Midscene.js\n\n[Midscene.js](https:\u002F\u002Fmidscenejs.com\u002Fen\u002Findex.html) is an open-source UI automation SDK driven by vision models, supporting multi-platform automation through JavaScript or Yaml-format process syntax.\n\nMidscene.js has completed integration with GLM-V models. You can quickly experience GLM-V through the [Midscene.js Integration Guide](https:\u002F\u002Fmidscenejs.com\u002Fmodel-common-config.html#glm-v).\n\nHere are two examples to help you get started quickly:\n\n- [Call Midscene.js via TypeScript scripts](.\u002Fexamples\u002Fmidscene-ts-demo)\n- [Experience Midscene.js via Yaml scripts](.\u002Fexamples\u002Fmidscene-yaml-demo)\n\n## Model Fine-tuning\n\n[LLaMA-Factory](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory) already supports fine-tuning for GLM-4.5V &\nGLM-4.1V-9B-Thinking models. Below is an example of dataset construction using two images. You should organize your\ndataset into `finetune.json` in the following format, This is an example for fine-tuning GLM-4.1V-9B.\n\n```json\n[\n  {\n    \"messages\": [\n      {\n        \"content\": \"\u003Cimage>Who are they?\",\n        \"role\": \"user\"\n      },\n      {\n        \"content\": \"\u003Cthink>\\nUser asked me to observe the image and find the answer. I know they are Kane and Goretzka from Bayern Munich.\u003C\u002Fthink>\\n\u003Canswer>They're Kane and Goretzka from Bayern Munich.\u003C\u002Fanswer>\",\n        \"role\": \"assistant\"\n      },\n      {\n        \"content\": \"\u003Cimage>What are they doing?\",\n        \"role\": \"user\"\n      },\n      {\n        \"content\": \"\u003Cthink>\\nI need to observe what these people are doing. Oh, they are celebrating on the soccer field.\u003C\u002Fthink>\\n\u003Canswer>They are celebrating on the soccer field.\u003C\u002Fanswer>\",\n        \"role\": \"assistant\"\n      }\n    ],\n    \"images\": [\n      \"mllm_demo_data\u002F1.jpg\",\n      \"mllm_demo_data\u002F2.jpg\"\n    ]\n  }\n]\n```\n\n1. The content inside `\u003Cthink> ... \u003C\u002Fthink>` will **not** be stored as conversation history or in fine-tuning data.\n2. The `\u003Cimage>` tag will be replaced with the corresponding image information.\n3. For the GLM-4.5V model, the \u003Canswer> and \u003C\u002Fanswer> tags should be removed.\n\nThen, you can fine-tune following the standard LLaMA-Factory procedure.\n\n## Model Overview\n\n### GLM-4.6V\n\nGLM-4.6V series model includes two versions: GLM-4.6V (106B), a foundation model designed for cloud and high-performance\ncluster scenarios,\nand GLM-4.6V-Flash (9B), a lightweight model optimized for local deployment and low-latency applications.\nGLM-4.6V scales its context window to 128k tokens in training,\nand achieves SoTA performance in visual understanding among models of similar parameter scales.\nCrucially, we integrate native Function Calling capabilities for the first time.\nThis effectively bridges the gap between \"visual perception\" and \"executable action\"\nproviding a unified technical foundation for multimodal agents in real-world business scenarios.\n\n![GLM-4.6V Benchmarks](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fzai-org_GLM-V_readme_1a46ac87e303.jpeg)\n\nBeyond achieves SoTA performance across major multimodal benchmarks at comparable model scales. GLM-4.6V introduces\nseveral key features:\n\n- **Native Multimodal Function Calling**\nEnables native vision-driven tool use. Images, screenshots, and document pages can be passed directly as tool inputs without text conversion, while visual outputs (charts, search images, rendered pages) are interpreted and integrated into the reasoning chain. This closes the loop from perception to understanding to execution.\n\n- **Interleaved Image-Text Content Generation**\nSupports high-quality mixed media creation from complex multimodal inputs. GLM-4.6V takes a multimodal context—spanning documents, user inputs, and tool-retrieved images—and synthesizes coherent, interleaved image-text content tailored to the task. During generation it can actively call search and retrieval tools to gather and curate additional text and visuals, producing rich, visually grounded content.\n\n- **Multimodal Document Understanding**\nGLM-4.6V can process up to 128K tokens of multi-document or long-document input, directly interpreting richly formatted pages as images. It understands text, layout, charts, tables, and figures jointly, enabling accurate comprehension of complex, image-heavy documents without requiring prior conversion to plain text.\n\n- **Frontend Replication & Visual Editing**\nReconstructs pixel-accurate HTML\u002FCSS from UI screenshots and supports natural-language-driven edits. It detects layout, components, and styles visually, generates clean code, and applies iterative visual modifications through simple user instructions.\n\n### GLM-4.5V\n\nGLM-4.5V is based on ZhipuAI’s GLM-4.5-Air.\nIt continues the technical approach of GLM-4.1V-Thinking, achieving SOTA performance among models of the same scale on\n42 public vision-language benchmarks.\nIt covers common tasks such as image, video, and document understanding, as well as GUI agent operations.\n\nBeyond benchmark performance, GLM-4.5V focuses on real-world usability. Through efficient hybrid training, it can handle\ndiverse types of visual content, enabling full-spectrum vision reasoning, including:\n\n- **Image reasoning** (scene understanding, complex multi-image analysis, spatial recognition)\n- **Video understanding** (long video segmentation and event recognition)\n- **GUI tasks** (screen reading, icon recognition, desktop operation assistance)\n- **Complex chart & long document parsing** (research report analysis, information extraction)\n- **Grounding** (precise visual element localization)\n\nThe model also introduces a **Thinking Mode** switch, allowing users to balance between quick responses and deep\nreasoning. This switch works the same as in the `GLM-4.5` language model.\n\n### GLM-4.1V-9B\n\nBuilt on the [GLM-4-9B-0414](https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-4) foundation model, the **GLM-4.1V-9B-Thinking** model\nintroduces a reasoning paradigm and uses RLCS (Reinforcement Learning with Curriculum Sampling) to comprehensively\nenhance model capabilities.\nIt achieves the strongest performance among 10B-level VLMs and matches or surpasses the much larger Qwen-2.5-VL-72B in\n18 benchmark tasks.\n\nWe also open-sourced the base model **GLM-4.1V-9B-Base** to support researchers in exploring the limits of\nvision-language model capabilities.\n\n![rl](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fzai-org_GLM-V_readme_f0a780d472b2.jpeg)\n\nCompared with the previous generation CogVLM2 and GLM-4V series, **GLM-4.1V-Thinking** brings:\n\n1. The series’ first reasoning-focused model, excelling in multiple domains beyond mathematics.\n2. **64k** context length support.\n3. Support for **any aspect ratio** and up to **4k** image resolution.\n4. A bilingual (Chinese\u002FEnglish) open-source version.\n\nGLM-4.1V-9B-Thinking integrates the **Chain-of-Thought** reasoning mechanism, improving accuracy, richness, and\ninterpretability.\nIt leads on 23 out of 28 benchmark tasks at the 10B parameter scale, and outperforms Qwen-2.5-VL-72B on 18 tasks despite\nits smaller size.\n\n## Remaining Issues\n\nSince the open-sourcing of GLM-4.1V, we have received extensive feedback from the community and are well aware that the model still has many shortcomings. In subsequent iterations, we attempted to address several common issues — such as repetitive thinking outputs and formatting errors — which have been mitigated to some extent in this new version.\n\nHowever, the model still has several limitations and issues that we will fix as soon as possible:\n\n1. Pure text QA capabilities still have significant room for improvement. In this development cycle, our primary focus was on visual multimodal scenarios, and we will enhance pure text abilities in upcoming updates.\n2. The model may still overthink or even repeat itself in certain cases, especially when dealing with complex prompts.\n3. In some situations, the model may restate the answer again at the end.\n4. There remain certain perception limitations, such as counting accuracy and identifying specific individuals, which still require improvement.\n\nThank you for your patience and understanding. We also welcome feedback and suggestions in the issue section — we will respond and improve as much as we can!\n\n## Citation\n\nIf you use this model, please cite the following paper:\n\n```bibtex\n@misc{vteam2025glm45vglm41vthinkingversatilemultimodal,\n      title={GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning},\n      author={V Team and Wenyi Hong and Wenmeng Yu and Xiaotao Gu and Guo Wang and Guobing Gan and Haomiao Tang and Jiale Cheng and Ji Qi and Junhui Ji and Lihang Pan and Shuaiqi Duan and Weihan Wang and Yan Wang and Yean Cheng and Zehai He and Zhe Su and Zhen Yang and Ziyang Pan and Aohan Zeng and Baoxu Wang and Bin Chen and Boyan Shi and Changyu Pang and Chenhui Zhang and Da Yin and Fan Yang and Guoqing Chen and Jiazheng Xu and Jiale Zhu and Jiali Chen and Jing Chen and Jinhao Chen and Jinghao Lin and Jinjiang Wang and Junjie Chen and Leqi Lei and Letian Gong and Leyi Pan and Mingdao Liu and Mingde Xu and Mingzhi Zhang and Qinkai Zheng and Sheng Yang and Shi Zhong and Shiyu Huang and Shuyuan Zhao and Siyan Xue and Shangqin Tu and Shengbiao Meng and Tianshu Zhang and Tianwei Luo and Tianxiang Hao and Tianyu Tong and Wenkai Li and Wei Jia and Xiao Liu and Xiaohan Zhang and Xin Lyu and Xinyue Fan and Xuancheng Huang and Yanling Wang and Yadong Xue and Yanfeng Wang and Yanzi Wang and Yifan An and Yifan Du and Yiming Shi and Yiheng Huang and Yilin Niu and Yuan Wang and Yuanchang Yue and Yuchen Li and Yutao Zhang and Yuting Wang and Yu Wang and Yuxuan Zhang and Zhao Xue and Zhenyu Hou and Zhengxiao Du and Zihan Wang and Peng Zhang and Debing Liu and Bin Xu and Juanzi Li and Minlie Huang and Yuxiao Dong and Jie Tang},\n      year={2025},\n      eprint={2507.01006},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01006},\n}\n```\n","# GLM-V\n\n[中文阅读。](.\u002FREADME_zh.md)\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=resources\u002Flogo.svg width=\"40%\"\u002F>\n\u003C\u002Fdiv>\n\u003Cp align=\"center\">\n    👋 加入我们的 \u003Ca href=\"resources\u002FWECHAT.md\" target=\"_blank\">微信\u003C\u002Fa> 和 \u003Ca href=\"https:\u002F\u002Ft.co\u002Fb6zGxJvzzS\" target=\"_blank\">Discord\u003C\u002Fa> 社区。\n    \u003Cbr>\n    📖 查看 GLM-4.6V \u003Ca href=\"https:\u002F\u002Fz.ai\u002Fblog\u002Fglm-4.6v\" target=\"_blank\">博客\u003C\u002Fa> 以及 GLM-4.5V & GLM-4.1V \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01006\" target=\"_blank\">论文\u003C\u002Fa>。\n    \u003Cbr>\n    📍 尝试 \u003Ca href=\"https:\u002F\u002Fchat.z.ai\u002F\" target=\"_blank\">在线版\u003C\u002Fa> 或使用 \u003Ca href=\"https:\u002F\u002Fdocs.z.ai\u002Fguides\u002Fvlm\u002Fglm-4.6v\" target=\"_blank\">API\u003C\u002Fa>。\n\u003C\u002Fp>\n\n## 简介\n\n视觉语言模型（Vision-Language Models, VLMs）已成为智能系统的关键基石。随着现实世界 AI 任务日益复杂，VLMs 迫切需要增强超越基本多模态感知的推理能力——提高准确性、全面性和智能性——以实现复杂问题解决、长上下文理解和多模态智能体。\n\n通过我们的开源工作，我们旨在与社区共同探索技术前沿，同时赋能更多开发者创建令人兴奋且创新的应用。\n\n**本开源仓库包含我们的 `GLM-4.6V`、`GLM-4.5V` 和 `GLM-4.1V` 系列模型。** 有关性能和详细信息，请参阅 [模型概览](#model-overview)。有关已知问题，请参阅 [已知问题与待解决问题](#fixed-and-remaining-issues)。\n\n## 项目更新\n\n- **新闻**: `2026\u002F03\u002F28`: 我们发布了多个 GLM-V 相关技能，涵盖多个专业领域，如 GLM-V-Grounding 和 GLM-V-Prompt-Gen。欢迎在此尝试 [这里](skills)。\n- **新闻**: `2025\u002F11\u002F10`: 我们发布了 **UI2Code^N**，这是一个具有 UI 转代码、UI 润色和 UI 编辑能力的强化学习增强（RL-enhanced）UI 编码模型。该模型基于 `GLM-4.1V-Base` 训练。请在此查看 [这里](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FUI2Code_N)。\n- **新闻**: `2025\u002F10\u002F27`: 我们发布了 **Glyph**，这是一个通过图文压缩扩展上下文长度的框架，glyph 模型基于 `GLM-4.1V-Base` 训练。请在此查看 [这里](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGlyph)。\n- **新闻**: `2025\u002F08\u002F11`: 我们发布了 **GLM-4.5V**，在多个基准测试中均有显著提升。我们还开源了用于调试的手制 **桌面助手应用**。连接到 GLM-4.5V 后，它可以通过截图或屏幕录制捕获您 PC 屏幕上的视觉信息。欢迎试用或将其定制为您自己的多模态助手。点击 [这里](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fzai-org\u002FGLM-4.5V-Demo-App) 下载安装程序，或 [从源码构建](examples\u002Fvllm-chat-helper\u002FREADME.md)!\n- **新闻**: `2025\u002F07\u002F16`: 我们开源了用于训练 GLM-4.1V-Thinking 的 **VLM 奖励系统**。查看 [代码仓库](glmv_reward) 并在本地运行：`python examples\u002Freward_system_demo.py`。\n- **新闻**: `2025\u002F07\u002F01`: 我们发布了 **GLM-4.1V-9B-Thinking** 及其 [技术报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01006)。\n\n## 模型实现代码\n\n- GLM-4.5V 和 GLM-4.6V 模型算法：参见 [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Ftree\u002Fmain\u002Fsrc\u002Ftransformers\u002Fmodels\u002Fglm4v_moe) 中的完整实现。\n- GLM-4.1V-9B-Thinking 模型算法：参见 [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Ftree\u002Fmain\u002Fsrc\u002Ftransformers\u002Fmodels\u002Fglm4v) 中的完整实现。\n- 两个模型共享相同的多模态预处理，但使用不同的对话模板——请仔细区分。\n\n## 模型下载\n\n| 模型                | 下载链接                                                                                                                                                                                                                       | 类型             |\n|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|\n| GLM-4.6V             | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.6V)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.6V)                         | 混合推理 |\n| GLM-4.6V-FP8         | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.6V-FP8)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.6V-FP8)                 | 混合推理 |\n| GLM-4.6V-Flash       | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.6V-Flash)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.6V-Flash)             | 混合推理 |\n| GLM-4.5V             | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.5V)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.5V)                         | 混合推理 |\n| GLM-4.5V-FP8         | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.5V-FP8)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.5V-FP8)                 | 混合推理 |\n| GLM-4.1V-9B-Thinking | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.1V-9B-Thinking)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.1V-9B-Thinking) | 推理        |\n| GLM-4.1V-9B-Base     | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.1V-9B-Base)\u003Cbr>[🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.1V-9B-Base)         | 基础             |\n\n\n+ Hugging Face 提供 GGUF 格式的模型权重。您可以从 [这里](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fggml-org\u002Fglm-v) 下载 GLM-V 的 GGUF 格式模型。\n\n## 使用案例\n\n### 定位 (Grounding)\n\nGLM-4.5V \u002F GLM-4.6V \u002F GLM-4.1V 具备精确的定位能力。给定一个请求特定对象位置的提示，模型能够逐步推理并识别目标对象的边界框。查询提示支持对目标对象的复杂描述以及指定的输出格式，例如：\n>\n> - Help me to locate \u003Cexpr> in the image and give me its bounding boxes.\n> - Please pinpoint the bounding box [[x1,y1,x2,y2], …] in the image as per the given description. \u003Cexpr>\n\n此处，`\u003Cexpr>` 是目标对象的描述。输出的边界框是一个四元组 $$[x_1,y_1,x_2,y_2]$$，由左上角和右下角的坐标组成，其中每个值分别由图像宽度（对于 x）或高度（对于 y）归一化并乘以 1000。\n\n在响应中，特殊标记 `\u003C|begin_of_box|>` 和 `\u003C|end_of_box|>` 用于在答案中标记图像边界框。括号样式可能有所不同（`[]`, `[[]]`, `()`, `\u003C>` 等），但含义相同：即包围框的坐标。\n\n### GUI 智能体 (GUI Agent)\n\n- `examples\u002Fgui-agent`: 演示了 GUI 智能体的提示构建和输出处理，包括针对移动、PC 和 Web 的策略。GLM-4.1V 和 GLM-4.5V 之间的提示模板不同。\n\n### 快速演示\n\n- `examples\u002Fvlm-helper`: 一个用于 GLM 多模态模型（主要是 GLM-4.5V，兼容 GLM-4.1V）的桌面助手，支持文本、图像、视频、PDF、PPT 等。连接到 GLM 多模态 API（应用程序接口）以提供跨场景的智能服务。下载 [安装程序](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fzai-org\u002FGLM-4.5V-Demo-App) 或 [从源代码构建](examples\u002Fvlm-helper\u002FREADME.md)。\n\n## 快速开始\n\n### 环境安装\n\n```bash\npip install -r requirements.txt\n```\n\n- vLLM 和 SGLang 依赖可能冲突，因此建议每个环境中只安装其中之一。\n- 请注意，安装后应验证 `transformers` 的版本，并确保其升级到 `5.2.0` 或更高版本。\n\n### transformers\n\n- `trans_infer_cli.py`: 用于使用 `transformers` 后端进行连续对话的 CLI（命令行界面）。\n- `trans_infer_gradio.py`: 使用 `transformers` 后端的多模态输入（图像、视频、PDF、PPT）Gradio Web 界面。\n- `trans_infer_bench`: `GLM-4.1V-9B-Thinking` 的学术复现脚本。它在长度 `8192` 处强制截断推理并随后请求直接答案。包含视频输入示例；其他情况请修改。\n\n### vLLM\n\n```bash\nvllm serve zai-org\u002FGLM-4.6V \\\n     --tensor-parallel-size 4 \\\n     --tool-call-parser glm45 \\\n     --reasoning-parser glm45 \\\n     --enable-auto-tool-choice \\\n     --served-model-name glm-4.6v \\\n     --allowed-local-media-path \u002F \\\n     --mm-encoder-tp-mode data \\\n     --mm-processor-cache-type shm\n```\n\n更多详情，请查看 [vLLM Recipes](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Frecipes\u002Fblob\u002Fmain\u002FGLM\u002FGLM-V.md)。\n\n### SGLang\n\n```shell\nsglang serve --model-path zai-org\u002FGLM-4.6V \\\n     --tp-size 4 \\\n     --tool-call-parser glm45 \\\n     --reasoning-parser glm45 \\\n     --served-model-name glm-4.6v \\\n     --mm-enable-dp-encoder \\\n     --port 8000 \\\n     --host 0.0.0.0\n```\n\n注意：\n\n- 我们建议增加 `SGLANG_VLM_CACHE_SIZE_MB`（例如 `1024`），以为视频理解提供足够的缓存空间。\n- 当使用 `vLLM` 和 `SGLang` 时，思考模式默认启用。要禁用思考开关，添加：`extra_body={\"chat_template_kwargs\": {\"enable_thinking\": False}}`\n- 您可以配置思考预算以限制模型的最大推理跨度。添加\n    ```python\n  from sglang.srt.sampling.custom_logit_processor import Glm4MoeThinkingBudgetLogitProcessor\n    ```\n  以及\n    ```python\n  extra_body={\n            \"custom_logit_processor\": Glm4MoeThinkingBudgetLogitProcessor().to_str(),\n            \"custom_params\": {\n                \"thinking_budget\": 8192, # max reasoning length in tokens\n            },\n        },\n    ```\n\n### xLLM\n\n详细指令请查看 [此处](examples\u002FAscend_NPU\u002FREADME_zh.md)。\n\n## 与其他自动化工具集成\n\n### Midscene.js\n\n[Midscene.js](https:\u002F\u002Fmidscenejs.com\u002Fen\u002Findex.html) 是一个由视觉模型驱动的开源 UI 自动化 SDK（软件开发工具包），支持通过 JavaScript 或 Yaml 格式的过程语法实现多平台自动化。\n\nMidscene.js 已完成与 GLM-V 模型的集成。您可以通过 [Midscene.js 集成指南](https:\u002F\u002Fmidscenejs.com\u002Fmodel-common-config.html#glm-v) 快速体验 GLM-V。\n\n以下是两个帮助您快速入门的示例：\n\n- [通过 TypeScript 脚本调用 Midscene.js](.\u002Fexamples\u002Fmidscene-ts-demo)\n- [通过 Yaml 脚本体验 Midscene.js](.\u002Fexamples\u002Fmidscene-yaml-demo)\n\n## 模型微调\n\n[LLaMA-Factory](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory) 已支持对 GLM-4.5V 及 GLM-4.1V-9B-Thinking 模型进行微调。下面是使用两张图片构建数据集的示例。您应将数据集组织为以下格式的 `finetune.json`，这是针对 GLM-4.1V-9B 微调的示例。\n\n```json\n[\n  {\n    \"messages\": [\n      {\n        \"content\": \"\u003Cimage>Who are they?\",\n        \"role\": \"user\"\n      },\n      {\n        \"content\": \"ynchroneg>\\nUser asked me to observe the image and find the answer. I know they are Kane and Goretzka from Bayern Munich.ost switching>\\n\u003Canswer>They're Kane and Goretzka from Bayern Munich.\u003C\u002Fanswer>\",\n        \"role\": \"assistant\"\n      },\n      {\n        \"content\": \"\u003Cimage>What are they doing?\",\n        \"role\": \"user\"\n      },\n      {\n        \"content\": \"ynchroneg>\\nI need to observe what these people are doing. Oh, they are celebrating on the soccer field.ost switching>\\n\u003Canswer>They are celebrating on the soccer field.\u003C\u002Fanswer>\",\n        \"role\": \"assistant\"\n      }\n    ],\n    \"images\": [\n      \"mllm_demo_data\u002F1.jpg\",\n      \"mllm_demo_data\u002F2.jpg\"\n    ]\n  }\n]\n```\n\n1. `synchronneg> ... ost switching>` 内部的内容**不会**作为对话历史或微调数据保存。\n2. `\u003Cimage>` 标签将被替换为相应的图像信息。\n3. 对于 GLM-4.5V 模型，应移除 `\u003Canswer>` 和 `\u003C\u002Fanswer>` 标签。\n\n然后，您可以按照标准的 LLaMA-Factory 流程进行微调。\n\n## 模型概览\n\n### GLM-4.6V\n\nGLM-4.6V 系列模型包含两个版本：GLM-4.6V（106B），这是一个专为云端和高性能集群场景设计的基础模型；以及 GLM-4.6V-Flash（9B），这是一个针对本地部署和低延迟应用优化的轻量级模型。GLM-4.6V 在训练中将其上下文窗口扩展至 128k tokens，并在相似参数规模的模型中实现了视觉理解领域的 SoTA（最先进）性能。关键的是，我们首次集成了原生的 Function Calling（函数调用）能力。这有效地弥合了“视觉感知”与“可执行操作”之间的差距，为现实世界业务场景中的多模态智能体（agents）提供了统一的技术基础。\n\n![GLM-4.6V Benchmarks](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fzai-org_GLM-V_readme_1a46ac87e303.jpeg)\n\n此外，在可比模型规模下，它在主要多模态基准测试中也达到了 SoTA 性能。GLM-4.6V 引入了几个关键特性：\n\n- **原生多模态函数调用**\n  支持原生的视觉驱动工具使用。图像、截图和文档页面可以直接作为工具输入传递，无需转换为文本，同时视觉输出（图表、搜索图片、渲染页面）会被解析并整合到推理链中。这实现了从感知到理解再到执行的闭环。\n\n- **图文交错内容生成**\n  支持从复杂的多模态输入进行高质量混合媒体创作。GLM-4.6V 接收涵盖文档、用户输入和工具检索图像的多模态上下文，并合成适合任务的连贯的、交错的图文内容。在生成过程中，它可以主动调用搜索和检索工具来收集和整理额外的文本和视觉内容，生成丰富且基于视觉的内容。\n\n- **多模态文档理解**\n  GLM-4.6V 可以处理多达 128K tokens 的多文档或长文档输入，直接将格式丰富的页面作为图像进行解析。它联合理解文本、布局、图表、表格和图片，能够准确理解复杂的、以图像为主的文档，而无需预先转换为纯文本。\n\n- **前端复制与视觉编辑**\n  从 UI 截图重建像素级精确的 HTML\u002FCSS，并支持自然语言驱动的编辑。它通过视觉检测布局、组件和样式，生成干净的代码，并通过简单的用户指令应用迭代视觉修改。\n\n### GLM-4.5V\n\nGLM-4.5V 基于智谱 AI 的 GLM-4.5-Air。它延续了 GLM-4.1V-Thinking 的技术路线，在 42 个公开视觉 - 语言基准测试中，在同规模模型中取得了 SOTA 性能。它涵盖了常见任务，如图像、视频和文档理解，以及 GUI 智能体操作。\n\n除了基准性能外，GLM-4.5V 还注重实际可用性。通过高效的混合训练，它能够处理多种类型的视觉内容，实现全谱系的视觉推理，包括：\n\n- **图像推理**（场景理解、复杂多图分析、空间识别）\n- **视频理解**（长视频分割与事件识别）\n- **GUI 任务**（屏幕阅读、图标识别、桌面操作辅助）\n- **复杂图表与长文档解析**（研究报告分析、信息提取）\n- **Grounding（视觉定位）**（精确视觉元素定位）\n\n该模型还引入了一个 **Thinking Mode**（思考模式）开关，允许用户在快速响应和深度推理之间进行平衡。此开关的工作方式与 `GLM-4.5` 语言模型相同。\n\n### GLM-4.1V-9B\n\n基于 [GLM-4-9B-0414](https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-4) 基础模型构建的 **GLM-4.1V-9B-Thinking** 模型引入了一种推理范式，并使用 RLCS（课程采样强化学习）全面增强模型能力。它在 10B 级别的 VLM（视觉语言模型）中实现了最强的性能，并在 18 项基准任务中与更大的 Qwen-2.5-VL-72B 持平或超越。\n\n我们还开源了基础模型 **GLM-4.1V-9B-Base**，以支持研究人员探索视觉语言模型能力的极限。\n\n![rl](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fzai-org_GLM-V_readme_f0a780d472b2.jpeg)\n\n与上一代 CogVLM2 和 GLM-4V 系列相比，**GLM-4.1V-Thinking** 带来了：\n\n1. 该系列首个专注于推理的模型，在数学之外的多个领域表现出色。\n2. 支持 **64k** 上下文长度。\n3. 支持 **任意宽高比** 和高达 **4k** 图像分辨率。\n4. 提供双语（中文\u002F英文）开源版本。\n\nGLM-4.1V-9B-Thinking 集成了 **Chain-of-Thought**（思维链）推理机制，提高了准确性、丰富性和可解释性。在 10B 参数规模下，它在 28 项基准任务中的 23 项上领先，尽管体积更小，但在 18 项任务上优于 Qwen-2.5-VL-72B。\n\n## 遗留问题\n\n自 GLM-4.1V 开源以来，我们收到了社区的广泛反馈，并且清楚地意识到该模型仍存在许多不足。在后续迭代中，我们尝试解决了一些常见问题——例如重复的思考输出和格式错误——这些问题在新版本中得到了一定程度的缓解。\n\n然而，该模型仍有一些局限性和问题，我们将尽快修复：\n\n1. 纯文本问答能力仍有很大提升空间。在本开发周期中，我们的主要重点是视觉多模态场景，我们将在未来的更新中增强纯文本能力。\n2. 在某些情况下，模型可能仍然会过度思考甚至重复自己，尤其是在处理复杂提示时。\n3. 在某些情况下，模型可能会在结尾处再次重述答案。\n4. 仍然存在某些感知局限性，例如计数准确性和识别特定个人，这些仍需改进。\n\n感谢您的耐心和理解。我们也欢迎在 Issue 部分提供反馈和建议——我们将尽可能回应和改进！\n\n## 引用\n\n如果您使用了本模型，请引用以下论文：\n\n```bibtex\n@misc{vteam2025glm45vglm41vthinkingversatilemultimodal,\n      title={GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning},\n      author={V Team and Wenyi Hong and Wenmeng Yu and Xiaotao Gu and Guo Wang and Guobing Gan and Haomiao Tang and Jiale Cheng and Ji Qi and Junhui Ji and Lihang Pan and Shuaiqi Duan and Weihan Wang and Yan Wang and Yean Cheng and Zehai He and Zhe Su and Zhen Yang and Ziyang Pan and Aohan Zeng and Baoxu Wang and Bin Chen and Boyan Shi and Changyu Pang and Chenhui Zhang and Da Yin and Fan Yang and Guoqing Chen and Jiazheng Xu and Jiale Zhu and Jiali Chen and Jing Chen and Jinhao Chen and Jinghao Lin and Jinjiang Wang and Junjie Chen and Leqi Lei and Letian Gong and Leyi Pan and Mingdao Liu and Mingde Xu and Mingzhi Zhang and Qinkai Zheng and Sheng Yang and Shi Zhong and Shiyu Huang and Shuyuan Zhao and Siyan Xue and Shangqin Tu and Shengbiao Meng and Tianshu Zhang and Tianwei Luo and Tianxiang Hao and Tianyu Tong and Wenkai Li and Wei Jia and Xiao Liu and Xiaohan Zhang and Xin Lyu and Xinyue Fan and Xuancheng Huang and Yanling Wang and Yadong Xue and Yanfeng Wang and Yanzi Wang and Yifan An and Yifan Du and Yiming Shi and Yiheng Huang and Yilin Niu and Yuan Wang and Yuanchang Yue and Yuchen Li and Yutao Zhang and Yuting Wang and Yu Wang and Yuxuan Zhang and Zhao Xue and Zhenyu Hou and Zhengxiao Du and Zihan Wang and Peng Zhang and Debing Liu and Bin Xu and Juanzi Li and Minlie Huang and Yuxiao Dong and Jie Tang},\n      year={2025},\n      eprint={2507.01006},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01006},\n}\n```","# GLM-V 快速上手指南\n\nGLM-V 是智谱 AI 开源的视觉语言模型（VLM）系列，包含 GLM-4.6V、GLM-4.5V 和 GLM-4.1V 等版本，具备强大的多模态感知与推理能力。本指南将帮助您快速完成环境搭建与模型部署。\n\n## 1. 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n- **系统要求**：建议配备 NVIDIA GPU（显存根据模型大小而定）。\n- **Python 版本**：推荐 Python 3.8 及以上。\n- **依赖库**：\n  - 必须安装 `transformers`，且版本需升级至 **5.2.0** 或更高。\n  - 推理引擎二选一：`vLLM` 或 `SGLang`（两者依赖可能冲突，请勿同时安装）。\n\n## 2. 安装步骤\n\n### 2.1 克隆仓库并安装基础依赖\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FTHUDM\u002FGLM-V.git\ncd GLM-V\npip install -r requirements.txt\n```\n\n> **注意**：安装完成后，请检查 `transformers` 版本是否满足要求。\n\n### 2.2 下载模型权重\n\n推荐使用国内镜像源加速下载。以 **GLM-4.6V** 为例：\n\n- **Hugging Face**: [链接](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.6V)\n- **ModelScope (推荐)**: [链接](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002FGLM-4.6V)\n\n其他版本（如 GLM-4.5V、GLM-4.1V）也可在 ModelScope 上找到对应资源。\n\n## 3. 基本使用\n\n### 3.1 本地快速测试 (Transformers)\n\n如果您希望快速体验模型效果，可使用内置的 Gradio 界面或命令行工具。\n\n**启动 Web 界面：**\n```bash\npython trans_infer_gradio.py\n```\n该脚本支持文本、图片、视频、PDF 等多模态输入。\n\n**启动命令行交互：**\n```bash\npython trans_infer_cli.py\n```\n\n### 3.2 高性能服务部署 (vLLM)\n\n如需通过 API 提供服务，建议使用 vLLM 进行部署。以下为启动示例：\n\n```bash\nvllm serve zai-org\u002FGLM-4.6V \\\n     --tensor-parallel-size 4 \\\n     --tool-call-parser glm45 \\\n     --reasoning-parser glm45 \\\n     --enable-auto-tool-choice \\\n     --served-model-name glm-4.6v \\\n     --allowed-local-media-path \u002F \\\n     --mm-encoder-tp-mode data \\\n     --mm_processor_cache_type shm\n```\n\n> **提示**：更多详细配置可参考 [vLLM Recipes](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Frecipes\u002Fblob\u002Fmain\u002FGLM\u002FGLM-V.md)。\n\n### 3.3 高性能服务部署 (SGLang)\n\n若选择 SGLang 作为推理后端，请使用以下命令：\n\n```shell\nsglang serve --model-path zai-org\u002FGLM-4.6V \\\n     --tp-size 4 \\\n     --tool-call-parser glm45 \\\n     --reasoning-parser glm45 \\\n     --served-model-name glm-4.6v \\\n     --mm-enable-dp-encoder \\\n     --port 8000 \\\n     --host 0.0.0.0\n```\n\n> **注意**：\n> - 建议增加环境变量 `SGLANG_VLM_CACHE_SIZE_MB`（例如设置为 `1024`）以优化视频理解缓存。\n> - 默认开启思考模式，如需关闭可在请求中添加 `extra_body={\"chat_template_kwargs\": {\"enable_thinking\": False}}`。","某电商团队前端工程师小张在紧急迭代项目中，需要将一份包含复杂图表和动态效果的设计稿快速转化为可交互的 React 代码。\n\n### 没有 GLM-V 时\n- 手动编写 HTML\u002FCSS 结构耗时极长，且容易遗漏细微的间距与圆角细节。\n- 难以从静态图片中准确推断出悬停、点击等动态交互逻辑，需反复询问设计师。\n- 处理多图表组合布局时，样式调整极其繁琐，经常需要反复调试才能对齐。\n- 遇到特殊图标或字体时需额外寻找素材资源，频繁打断开发流程与思路。\n\n### 使用 GLM-V 后\n- GLM-V 直接识别图片生成基础 React 组件代码框架，大幅缩短搭建时间。\n- 强大的推理能力帮助还原复杂的响应式布局逻辑，自动适配不同屏幕尺寸。\n- 自动补全缺失的 CSS 样式与配色方案，减少人工调试与修改的时间成本。\n- 支持长上下文理解，一次性处理整页设计图无需裁剪，保持整体视觉一致性。\n\nGLM-V 通过深度视觉推理将设计到代码的转化效率提升数倍，显著降低重复劳动。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fzai-org_GLM-V_a434df55.png","zai-org","Z.ai","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fzai-org_f0d5ae80.png","ChatGLM, GLM-4.5, CogVLM, CodeGeeX, CogView, CogVideoX | CogDL, AMiner | Zhipu.ai (Z.ai)",null,"Zai_org","https:\u002F\u002Fwww.zhipuai.cn\u002Fen","https:\u002F\u002Fgithub.com\u002Fzai-org",[84],{"name":85,"color":86,"percentage":87},"Python","#3572A5",100,2257,161,"2026-04-05T21:00:45","Apache-2.0","未说明","需 GPU 支持（示例使用 tensor-parallel-size 4），具体型号和显存未说明",{"notes":95,"python":92,"dependencies":96},"vLLM 与 SGLang 依赖冲突，建议每个环境只安装其中一个；transformers 需升级至 5.2.0 或以上；支持 GGUF 格式模型下载；推理模式默认开启 Thinking，可通过参数关闭或限制长度。",[97,98,99],"transformers>=5.2.0","vLLM","SGLang",[14,35],[102,103,104,105],"image2text","video-understanding","vlm","reasoning",7,"2026-03-27T02:49:30.150509","2026-04-06T08:17:40.939054",[110,115,120,125,129,134],{"id":111,"question_zh":112,"answer_zh":113,"source_url":114},2130,"如何正确安装依赖环境以避免主分支代码不稳定问题？","由于 transformers 和 vLLM 的主分支可能不稳定，建议按照指定 commit 进行源代码安装，不要使用任何 pip 发行版或镜像站。具体依赖如下：\nsetuptools>=80.9.0\nsetuptools_scm>=8.3.1\ngit+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers.git@91221da2f1f68df9eb97c980a7206b14c4d3a9b0\ngit+https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm.git@220aee902a291209f2975d4cd02dadcc6749ffe6\ntorchvision>=0.22.0\ngradio>=5.35.0\npre-commit>=4.2.0\nPyMuPDF>=1.26.1\nav>=14.4.0\naccelerate>=1.6.0\nspaces>=0.37.1","https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-V\u002Fissues\u002F18",{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},2131,"使用 Docker 部署 GLM-4.5V 模型时推荐哪些启动参数？","推荐使用 llmnet\u002Fvllm-preview 镜像，并在启动命令中包含以下关键参数以支持工具调用和推理解析：\n--tool-call-parser=glm45\n--reasoning-parser=glm45\n--enable-auto-tool-choice\n--gpu-memory-utilization=0.92\n--max-num-seqs=16\n--enable-prefix-caching\n--disable-log-requests\n--allowed-local-media-path=\u002Ftmp\n--media-io-kwargs={\"video\":{\"num_frames\":-1}}","https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-V\u002Fissues\u002F121",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},2132,"运行 GLM-4.5V 时提示‘无法识别框架 glm4V-moe’如何解决？","这通常是 vLLM 库版本过旧导致的。解决方法是更新 vLLM 到最新的 commit 并进行编译安装。同时检查张量并行设置，该模型有 12 个注意力头，建议 tensor_parallel_size 设置为 4 以适配硬件。","https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-V\u002Fissues\u002F142",{"id":126,"question_zh":127,"answer_zh":128,"source_url":119},2133,"为什么 8 卡 GPU 无法运行 FP8 或 16bf 模式？","由于模型架构存在 12 头的限制，8 卡配置无法跑通 FP8 或 16bf 模式。实测 4x48G 显卡可以运行 FP8。建议根据 GPU 数量调整并行策略，例如使用 4 卡配置。",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},2134,"遇到 `ValueError: ... does not support tensor parallel yet!` 错误该怎么办？","需要升级 vLLM 到预发布版本并更新 transformers 库。请执行以下命令：\npip install -U vllm --pre --extra-index-url https:\u002F\u002Fwheels.vllm.ai\u002Fnightly\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers","https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-V\u002Fissues\u002F143",{"id":135,"question_zh":136,"answer_zh":137,"source_url":133},2135,"卸载旧版 vLLM 后出现 `ImportError: cannot import name 'PretrainedConfig'` 如何处理？","这是依赖冲突导致的导入错误。建议不要随意卸载，应通过升级版本解决。若必须重装，请确保 transformers 版本与 vLLM 兼容，最好参考 Issue #18 中的指定 commit 进行源码安装，避免使用 pip 直接安装。",[]]