[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Blaizzy--mlx-vlm":3,"tool-Blaizzy--mlx-vlm":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",156033,2,"2026-04-14T23:32:00",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":78,"owner_email":79,"owner_twitter":80,"owner_website":81,"owner_url":82,"languages":83,"stars":88,"forks":89,"last_commit_at":90,"license":91,"difficulty_score":32,"env_os":92,"env_gpu":93,"env_ram":94,"env_deps":95,"category_tags":103,"github_topics":104,"view_count":32,"oss_zip_url":117,"oss_zip_packed_at":117,"status":17,"created_at":118,"updated_at":119,"faqs":120,"releases":152},7633,"Blaizzy\u002Fmlx-vlm","mlx-vlm","MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.","mlx-vlm 是一款专为 Mac 用户打造的开源工具包，旨在让视觉语言模型（VLM）及多模态模型（支持图像、音频和视频）的推理与微调变得简单高效。它基于 Apple 的 MLX 框架开发，解决了在 macOS 本地运行大型多模态 AI 模型时常见的配置复杂、依赖繁琐及性能优化不足等痛点。\n\n无论是希望快速体验最新多模态能力的普通用户，还是需要定制模型的研究人员与开发者，都能从中受益。普通用户可通过命令行或图形界面轻松实现“看图说话”、“听音描述”甚至图文音混合交互；开发者则能利用其提供的 Python 接口进行深度集成，或对 DeepSeek-OCR、Phi-4、Gemma 等主流模型进行微调训练。\n\nmlx-vlm 的技术亮点在于针对 Apple Silicon 芯片的深度优化，支持激活值量化、TurboQuant KV 缓存加速以及视觉特征缓存等高级功能，显著提升了推理速度并降低了显存占用。此外，它还原生支持多轮多图对话及独特的“思考预算”控制，允许用户在处理复杂推理任务时精确管理模型的思考过程。只需一条 pip 命令即可完成安装，是 Mac 平台上探索前沿多模态 AI 的理想入","mlx-vlm 是一款专为 Mac 用户打造的开源工具包，旨在让视觉语言模型（VLM）及多模态模型（支持图像、音频和视频）的推理与微调变得简单高效。它基于 Apple 的 MLX 框架开发，解决了在 macOS 本地运行大型多模态 AI 模型时常见的配置复杂、依赖繁琐及性能优化不足等痛点。\n\n无论是希望快速体验最新多模态能力的普通用户，还是需要定制模型的研究人员与开发者，都能从中受益。普通用户可通过命令行或图形界面轻松实现“看图说话”、“听音描述”甚至图文音混合交互；开发者则能利用其提供的 Python 接口进行深度集成，或对 DeepSeek-OCR、Phi-4、Gemma 等主流模型进行微调训练。\n\nmlx-vlm 的技术亮点在于针对 Apple Silicon 芯片的深度优化，支持激活值量化、TurboQuant KV 缓存加速以及视觉特征缓存等高级功能，显著提升了推理速度并降低了显存占用。此外，它还原生支持多轮多图对话及独特的“思考预算”控制，允许用户在处理复杂推理任务时精确管理模型的思考过程。只需一条 pip 命令即可完成安装，是 Mac 平台上探索前沿多模态 AI 的理想入口。","[![Upload Python Package](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Factions\u002Fworkflows\u002Fpython-publish.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Factions\u002Fworkflows\u002Fpython-publish.yml)\n# MLX-VLM\n\nMLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) and Omni Models (VLMs with audio and video support) on your Mac using MLX.\n\n## Table of Contents\n- [Installation](#installation)\n- [Usage](#usage)\n  - [Command Line Interface (CLI)](#command-line-interface-cli)\n    - [Thinking Budget](#thinking-budget)\n  - [Chat UI with Gradio](#chat-ui-with-gradio)\n  - [Python Script](#python-script)\n- [Activation Quantization (CUDA)](#activation-quantization-cuda)\n- [Multi-Image Chat Support](#multi-image-chat-support)\n  - [Supported Models](#supported-models)\n  - [Usage Examples](#usage-examples)\n- [Model-Specific Documentation](#model-specific-documentation)\n- [Vision Feature Caching](#vision-feature-caching)\n- [TurboQuant KV Cache](#turboquant-kv-cache)\n- [Fine-tuning](#fine-tuning)\n\n## Model-Specific Documentation\n\nSome models have detailed documentation with prompt formats, examples, and best practices:\n\n| Model | Documentation |\n|-------|---------------|\n| DeepSeek-OCR | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fdeepseekocr\u002FREADME.md) |\n| DeepSeek-OCR-2 | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fdeepseekocr_2\u002FREADME.md) |\n| DOTS-OCR | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fdots_ocr\u002FREADME.md) |\n| DOTS-MOCR | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fdots_ocr\u002FREADME.md) |\n| GLM-OCR | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fglm_ocr\u002FREADME.md) |\n| Phi-4 Reasoning Vision | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fphi4_siglip\u002FREADME.md) |\n| MiniCPM-o | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fminicpmo\u002FREADME.md) |\n| Phi-4 Multimodal | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fphi4mm\u002FREADME.md) |\n| MolmoPoint | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fmolmo_point\u002FREADME.md) |\n| Moondream3 | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fmoondream3\u002FREADME.md) |\n| Gemma 4 | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fgemma4\u002FREADME.md) |\n| Falcon-OCR | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Ffalcon_ocr\u002FREADME.md) |\n| Granite Vision 3.2 | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fgranite_vision\u002FREADME.md) |\n| Granite 4.0 Vision | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fgranite4_vision\u002FREADME.md) |\n\n## Installation\n\nThe easiest way to get started is to install the `mlx-vlm` package using pip:\n\n```sh\npip install -U mlx-vlm\n```\n\n## Usage\n\n### Command Line Interface (CLI)\n\nGenerate output from a model using the CLI:\n\n```sh\n# Text generation\nmlx_vlm.generate --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt \"Hello, how are you?\"\n\n# Image generation\nmlx_vlm.generate --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit --max-tokens 100 --temperature 0.0 --image http:\u002F\u002Fimages.cocodataset.org\u002Fval2017\u002F000000039769.jpg\n\n# Audio generation (New)\nmlx_vlm.generate --model mlx-community\u002Fgemma-3n-E2B-it-4bit --max-tokens 100 --prompt \"Describe what you hear\" --audio \u002Fpath\u002Fto\u002Faudio.wav\n\n# Multi-modal generation (Image + Audio)\nmlx_vlm.generate --model mlx-community\u002Fgemma-3n-E2B-it-4bit --max-tokens 100 --prompt \"Describe what you see and hear\" --image \u002Fpath\u002Fto\u002Fimage.jpg --audio \u002Fpath\u002Fto\u002Faudio.wav\n```\n\n#### Thinking Budget\n\nFor thinking models (e.g., Qwen3.5), you can limit the number of tokens spent in the thinking block:\n\n```sh\nmlx_vlm.generate --model mlx-community\u002FQwen3.5-2B-4bit \\\n  --thinking-budget 50 \\\n  --thinking-start-token \"\u003Cthink>\" \\\n  --thinking-end-token \"\u003C\u002Fthink>\" \\\n  --enable-thinking \\\n  --prompt \"Solve 2+2\"\n```\n\n| Flag | Description |\n|------|-------------|\n| `--enable-thinking` | Activate thinking mode in the chat template |\n| `--thinking-budget` | Max tokens allowed inside the thinking block |\n| `--thinking-start-token` | Token that opens a thinking block (default: `\u003Cthink>`) |\n| `--thinking-end-token` | Token that closes a thinking block (default: `\u003C\u002Fthink>`) |\n\nWhen the budget is exceeded, the model is forced to emit `\\n\u003C\u002Fthink>` and transition to the answer. If `--enable-thinking` is passed but the model's chat template does not support it, the budget is applied only if the model generates the start token on its own.\n\n### Chat UI with Gradio\n\nLaunch a chat interface using Gradio:\n\n```sh\nmlx_vlm.chat_ui --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit\n```\n\n### Python Script\n\nHere's an example of how to use MLX-VLM in a Python script:\n\n```python\nimport mlx.core as mx\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\n# Load the model\nmodel_path = \"mlx-community\u002FQwen2-VL-2B-Instruct-4bit\"\nmodel, processor = load(model_path)\nconfig = load_config(model_path)\n\n# Prepare input\nimage = [\"http:\u002F\u002Fimages.cocodataset.org\u002Fval2017\u002F000000039769.jpg\"]\n# image = [Image.open(\"...\")] can also be used with PIL.Image.Image objects\nprompt = \"Describe this image.\"\n\n# Apply chat template\nformatted_prompt = apply_chat_template(\n    processor, config, prompt, num_images=len(image)\n)\n\n# Generate output\noutput = generate(model, processor, formatted_prompt, image, verbose=False)\nprint(output)\n```\n\n#### Audio Example\n\n```python\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\n# Load model with audio support\nmodel_path = \"mlx-community\u002Fgemma-3n-E2B-it-4bit\"\nmodel, processor = load(model_path)\nconfig = model.config\n\n# Prepare audio input\naudio = [\"\u002Fpath\u002Fto\u002Faudio1.wav\", \"\u002Fpath\u002Fto\u002Faudio2.mp3\"]\nprompt = \"Describe what you hear in these audio files.\"\n\n# Apply chat template with audio\nformatted_prompt = apply_chat_template(\n    processor, config, prompt, num_audios=len(audio)\n)\n\n# Generate output with audio\noutput = generate(model, processor, formatted_prompt, audio=audio, verbose=False)\nprint(output)\n```\n\n#### Multi-Modal Example (Image + Audio)\n\n```python\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\n# Load multi-modal model\nmodel_path = \"mlx-community\u002Fgemma-3n-E2B-it-4bit\"\nmodel, processor = load(model_path)\nconfig = model.config\n\n# Prepare inputs\nimage = [\"\u002Fpath\u002Fto\u002Fimage.jpg\"]\naudio = [\"\u002Fpath\u002Fto\u002Faudio.wav\"]\nprompt = \"\"\n\n# Apply chat template\nformatted_prompt = apply_chat_template(\n    processor, config, prompt,\n    num_images=len(image),\n    num_audios=len(audio)\n)\n\n# Generate output\noutput = generate(model, processor, formatted_prompt, image, audio=audio, verbose=False)\nprint(output)\n```\n\n### Server (FastAPI)\n\nStart the server:\n```sh\nmlx_vlm.server --port 8080\n\n# Preload a model at startup (Hugging Face repo or local path)\nmlx_vlm.server --model \u003Chf_repo_or_local_path>\n\n# Preload a model with adapter\nmlx_vlm.server --model \u003Chf_repo_or_local_path> --adapter-path \u003Cadapter_path>\n\n# With trust remote code enabled (required for some models)\nmlx_vlm.server --trust-remote-code\n```\n\n#### Server Options\n\n- `--model`: Preload a model at server startup, accepts a Hugging Face repo ID or local path (optional, loads lazily on first request if omitted)\n- `--adapter-path`: Path for adapter weights to use with the preloaded model\n- `--host`: Host address (default: `0.0.0.0`)\n- `--port`: Port number (default: `8080`)\n- `--trust-remote-code`: Trust remote code when loading models from Hugging Face Hub\n- `--kv-bits`: Number of bits for KV cache quantization (e.g. `3.5` for TurboQuant)\n- `--kv-quant-scheme`: KV cache quantization backend (`uniform` or `turboquant`)\n\nYou can also set trust remote code via environment variable:\n```sh\nMLX_TRUST_REMOTE_CODE=true mlx_vlm.server\n```\n\nThe server provides multiple endpoints for different use cases and supports dynamic model loading\u002Funloading with caching (one model at a time).\n\n#### Available Endpoints\n\n- `\u002Fmodels` and `\u002Fv1\u002Fmodels` - List models available locally\n- `\u002Fchat\u002Fcompletions` and `\u002Fv1\u002Fchat\u002Fcompletions` - OpenAI-compatible chat-style interaction endpoint with support for images, audio, and text\n- `\u002Fresponses` and `\u002Fv1\u002Fresponses` - OpenAI-compatible responses endpoint\n- `\u002Fhealth` - Check server status\n- `\u002Funload` - Unload current model from memory\n\n#### Usage Examples\n\n##### List available models\n\n```sh\ncurl \"http:\u002F\u002Flocalhost:8080\u002Fmodels\"\n```\n\n##### Text Input\n\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fchat\u002Fcompletions\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"mlx-community\u002FQwen2-VL-2B-Instruct-4bit\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": \"Hello, how are you\"\n      }\n    ],\n    \"stream\": true,\n    \"max_tokens\": 100\n  }'\n```\n\n##### Image Input\n\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fchat\u002Fcompletions\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"mlx-community\u002FQwen2.5-VL-32B-Instruct-8bit\",\n    \"messages\":\n    [\n      {\n        \"role\": \"system\",\n        \"content\": \"You are a helpful assistant.\"\n      },\n      {\n        \"role\": \"user\",\n        \"content\": [\n          {\n            \"type\": \"text\",\n            \"text\": \"This is today's chart for energy demand in California. Can you provide an analysis of the chart and comment on the implications for renewable energy in California?\"\n          },\n          {\n            \"type\": \"input_image\",\n            \"image_url\": \"\u002Fpath\u002Fto\u002Frepo\u002Fexamples\u002Fimages\u002Frenewables_california.png\"\n          }\n        ]\n      }\n    ],\n    \"stream\": true,\n    \"max_tokens\": 1000\n  }'\n```\n\n##### Audio Support (New)\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fgenerate\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"mlx-community\u002Fgemma-3n-E2B-it-4bit\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": [\n          { \"type\": \"text\", \"text\": \"Describe what you hear in these audio files\" },\n          { \"type\": \"input_audio\", \"input_audio\": \"\u002Fpath\u002Fto\u002Faudio1.wav\" },\n          { \"type\": \"input_audio\", \"input_audio\": \"https:\u002F\u002Fexample.com\u002Faudio2.mp3\" }\n        ]\n      }\n    ],\n    \"stream\": true,\n    \"max_tokens\": 500\n  }'\n```\n\n##### Multi-Modal (Image + Audio)\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fgenerate\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"mlx-community\u002Fgemma-3n-E2B-it-4bit\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": [\n          {\"type\": \"input_image\", \"image_url\": \"\u002Fpath\u002Fto\u002Fimage.jpg\"},\n          {\"type\": \"input_audio\", \"input_audio\": \"\u002Fpath\u002Fto\u002Faudio.wav\"}\n        ]\n      }\n    ],\n    \"max_tokens\": 100\n  }'\n```\n\n##### Responses Endpoint\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fresponses\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"mlx-community\u002FQwen2-VL-2B-Instruct-4bit\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": [\n          {\"type\": \"input_text\", \"text\": \"What is in this image?\"},\n          {\"type\": \"input_image\", \"image_url\": \"\u002Fpath\u002Fto\u002Fimage.jpg\"}\n        ]\n      }\n    ],\n    \"max_tokens\": 100\n  }'\n```\n\n#### Request Parameters\n\n- `model`: Model identifier (required)\n- `messages`: Chat messages for chat\u002FOpenAI endpoints\n- `max_tokens`: Maximum tokens to generate\n- `temperature`: Sampling temperature\n- `top_p`: Top-p sampling parameter\n- `top_k`: Top-k sampling cutoff\n- `min_p`: Min-p sampling threshold\n- `repetition_penalty`: Penalty applied to repeated tokens\n- `stream`: Enable streaming responses\n\n\n## Activation Quantization (CUDA)\n\nWhen running on NVIDIA GPUs with MLX CUDA, models quantized with `mxfp8` or `nvfp4` modes require activation quantization to work properly. This converts `QuantizedLinear` layers to `QQLinear` layers which quantize both weights and activations.\n\n### Command Line\n\nUse the `-qa` or `--quantize-activations` flag:\n\n```sh\nmlx_vlm.generate --model \u002Fpath\u002Fto\u002Fmxfp8-model --prompt \"Describe this image\" --image \u002Fpath\u002Fto\u002Fimage.jpg -qa\n```\n\n### Python API\n\nPass `quantize_activations=True` to the `load` function:\n\n```python\nfrom mlx_vlm import load, generate\n\n# Load with activation quantization enabled\nmodel, processor = load(\n    \"path\u002Fto\u002Fmxfp8-quantized-model\",\n    quantize_activations=True\n)\n\n# Generate as usual\noutput = generate(model, processor, \"Describe this image\", image=[\"image.jpg\"])\n```\n\n### Supported Quantization Modes\n\n- `mxfp8` - 8-bit MX floating point\n- `nvfp4` - 4-bit NVIDIA floating point\n\n> **Note**: This feature is required for mxfp\u002Fnvfp quantized models on CUDA. On Apple Silicon (Metal), these models work without the flag.\n\n## Multi-Image Chat Support\n\nMLX-VLM supports analyzing multiple images simultaneously with select models. This feature enables more complex visual reasoning tasks and comprehensive analysis across multiple images in a single conversation.\n\n\n### Usage Examples\n\n#### Python Script\n\n```python\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\nmodel_path = \"mlx-community\u002FQwen2-VL-2B-Instruct-4bit\"\nmodel, processor = load(model_path)\nconfig = model.config\n\nimages = [\"path\u002Fto\u002Fimage1.jpg\", \"path\u002Fto\u002Fimage2.jpg\"]\nprompt = \"Compare these two images.\"\n\nformatted_prompt = apply_chat_template(\n    processor, config, prompt, num_images=len(images)\n)\n\noutput = generate(model, processor, formatted_prompt, images, verbose=False)\nprint(output)\n```\n\n#### Command Line\n\n```sh\nmlx_vlm.generate --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt \"Compare these images\" --image path\u002Fto\u002Fimage1.jpg path\u002Fto\u002Fimage2.jpg\n```\n\n## Video Understanding\n\nMLX-VLM also supports video analysis such as captioning, summarization, and more, with select models.\n\n### Supported Models\n\nThe following models support video chat:\n\n1. Qwen2-VL\n2. Qwen2.5-VL\n3. Idefics3\n4. LLaVA\n\nWith more coming soon.\n\n### Usage Examples\n\n#### Command Line\n```sh\nmlx_vlm.video_generate --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt \"Describe this video\" --video path\u002Fto\u002Fvideo.mp4 --max-pixels 224 224 --fps 1.0\n```\n\n\nThese examples demonstrate how to use multiple images with MLX-VLM for more complex visual reasoning tasks.\n\n## Vision Feature Caching\n\nIn multi-turn conversations about an image, the vision encoder runs on every turn even though the image hasn't changed. `VisionFeatureCache` stores projected vision features in an LRU cache keyed by image path, so the expensive vision encoder is only called once per unique image.\n\n### How It Works\n\n1. **First turn (cache miss)** -- `encode_image()` runs the full vision pipeline (vision tower + projector), stores the result in the cache, and passes it to the language model.\n2. **Subsequent turns (cache hit)** -- the cached features are passed directly via `cached_image_features`, skipping the vision encoder entirely.\n3. **Image switch** -- when the image changes, it's a new cache key so features are computed and cached. Switching back to a previous image is a cache hit.\n\nThe cache holds up to 8 entries (configurable) and uses LRU eviction.\n\n### CLI\n\nAll chat interfaces use `VisionFeatureCache` automatically:\n\n```sh\n# Gradio chat UI\npython -m mlx_vlm.chat_ui --model google\u002Fgemma-4-26b-a4b-it\n\n# Interactive chat with Rich UI (load images with \u002Fimage command)\npython -m mlx_vlm.chat --model google\u002Fgemma-4-26b-a4b-it\n\n# Inline chat mode\npython -m mlx_vlm.generate \\\n  --model google\u002Fgemma-4-26b-a4b-it \\\n  --image path\u002Fto\u002Fimage.jpg \\\n  --chat \\\n  --max-tokens 200\n```\n\n### Python\n\n```python\nfrom mlx_vlm import load, stream_generate, VisionFeatureCache\nfrom mlx_vlm.prompt_utils import apply_chat_template\n\nmodel, processor = load(\"google\u002Fgemma-4-26b-a4b-it\")\ncache = VisionFeatureCache()\n\nimage = \"path\u002Fto\u002Fimage.jpg\"\n\n# Turn 1 -- cache miss, encodes image\nprompt1 = apply_chat_template(processor, model.config, \"Describe this image.\", num_images=1)\nfor chunk in stream_generate(model, processor, prompt1, image=[image],\n                              max_tokens=200, vision_cache=cache):\n    print(chunk.text, end=\"\")\n\n# Turn 2 -- cache hit, skips vision encoder\nprompt2 = apply_chat_template(processor, model.config, \"What colors do you see?\", num_images=1)\nfor chunk in stream_generate(model, processor, prompt2, image=[image],\n                              max_tokens=200, vision_cache=cache):\n    print(chunk.text, end=\"\")\n```\n\n### Server\n\nThe server caches vision features automatically across requests for the same image. No configuration needed -- the cache is created when a model loads and cleared on unload.\n\n```sh\nmlx_vlm.server --model google\u002Fgemma-4-26b-a4b-it\n```\n\nMulti-turn conversations via `\u002Fv1\u002Fchat\u002Fcompletions` (streaming and non-streaming) and `\u002Fresponses` all benefit. The same image sent across multiple requests will only be encoded once.\n\n### Performance\n\nTested on `google\u002Fgemma-4-26b-a4b-it` over 10 multi-turn conversation turns:\n\n| Metric | Without Cache | With Cache |\n|--------|--------------|------------|\n| Prompt TPS | ~48 | ~550-825 |\n| Speedup | -- | **11x+** |\n| Peak Memory | 52.66 GB | 52.66 GB (flat) |\n\nGeneration speed (~31 tok\u002Fs) and memory are unaffected -- only prompt processing gets faster.\n\n## TurboQuant KV Cache\n\nTurboQuant compresses the KV cache during generation, enabling longer context lengths with less memory while maintaining quality.\n\n### Quick Start\n\n```sh\n# 3.5-bit KV cache quantization (3-bit keys + 4-bit values)\nmlx_vlm generate \\\n  --model mlx-community\u002FQwen3.5-4B-4bit \\\n  --kv-bits 3.5 \\\n  --kv-quant-scheme turboquant \\\n  --prompt \"Your long prompt here...\"\n```\n\n```python\nfrom mlx_vlm import generate\n\nresult = generate(\n    model, processor, prompt,\n    kv_bits=3.5,\n    kv_quant_scheme=\"turboquant\",\n    max_tokens=256,\n)\n```\n\n```sh\n# Server with TurboQuant\nmlx_vlm server \\\n  --model google\u002Fgemma-4-26b-a4b-it \\\n  --kv-bits 3.5 \\\n  --kv-quant-scheme turboquant\n```\n\n### How It Works\n\nTurboQuant uses random rotation + codebook quantization ([arXiv:2504.19874](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.19874)) to compress KV cache entries from 16-bit to 2-4 bits per dimension:\n\n- **Keys & Values**: MSE codebook quantization with Hadamard rotation\n- **Fractional bits** (e.g. 3.5): uses lower bits for keys, higher for values (3-bit K + 4-bit V)\n\nCustom Metal kernels fuse score computation and value aggregation directly on packed quantized data, avoiding full dequantization during decode.\n\n### Performance\n\nTested on Qwen3.5-4B-4bit at 128k context:\n\n| Metric | Baseline | TurboQuant 3.5-bit |\n|--------|----------|-------------------|\n| KV Memory | 4.1 GB | 0.97 GB (**76% reduction**) |\n| Peak Memory | 18.3 GB | 17.3 GB (**-1.0 GB**) |\n\nAt 512k+ contexts, TurboQuant's per-layer attention is **faster than FP16 SDPA** due to reduced memory bandwidth requirements.\n\nTested on gemma-4-31b-it at 128k context:\n\n| Metric | Baseline | TurboQuant 3.5-bit |\n|--------|----------|-------------------|\n| KV Memory | 13.3 GB | 4.9 GB (**63% reduction**) |\n| Peak Memory | 75.2 GB | 65.8 GB (**-9.4 GB**) |\n\n### Supported Bit Widths\n\n| Bits | Compression | Best For |\n|------|------------|----------|\n| 2 | ~8x | Maximum compression, some quality loss |\n| 3 | ~5x | Good balance of quality and compression |\n| 3.5 | ~4.5x | Recommended default (3-bit keys + 4-bit values) |\n| 4 | ~4x | Best quality, moderate compression |\n\n### Compatibility\n\nTurboQuant automatically quantizes `KVCache` layers (global attention). Models with `RotatingKVCache` (sliding window) or `ArraysCache` (MLA\u002Fabsorbed keys) keep their native cache format for those layers since they are already memory-efficient.\n\n# Fine-tuning\n\nMLX-VLM supports fine-tuning models with LoRA and QLoRA.\n\n## LoRA & QLoRA\n\nTo learn more about LoRA, please refer to the [LoRA.md](.\u002Fmlx_vlm\u002FLORA.MD) file.\n","[![上传 Python 包](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Factions\u002Fworkflows\u002Fpython-publish.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Factions\u002Fworkflows\u002Fpython-publish.yml)\n# MLX-VLM\n\nMLX-VLM 是一个用于在您的 Mac 上使用 MLX 对视觉语言模型（VLM）和全能模型（支持音频和视频的 VLM）进行推理和微调的软件包。\n\n## 目录\n- [安装](#installation)\n- [使用](#usage)\n  - [命令行界面 (CLI)](#command-line-interface-cli)\n    - [思考预算](#thinking-budget)\n  - [Gradio 聊天界面](#chat-ui-with-gradio)\n  - [Python 脚本](#python-script)\n- [激活量化 (CUDA)](#activation-quantization-cuda)\n- [多图像聊天支持](#multi-image-chat-support)\n  - [支持的模型](#supported-models)\n  - [使用示例](#usage-examples)\n- [模型专用文档](#model-specific-documentation)\n- [视觉特征缓存](#vision-feature-caching)\n- [TurboQuant KV 缓存](#turboquant-kv-cache)\n- [微调](#fine-tuning)\n\n## 模型专用文档\n\n部分模型提供了详细的文档，包括提示格式、示例和最佳实践：\n\n| 模型 | 文档 |\n|-------|---------------|\n| DeepSeek-OCR | [文档](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fdeepseekocr\u002FREADME.md) |\n| DeepSeek-OCR-2 | [文档](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fdeepseekocr_2\u002FREADME.md) |\n| DOTS-OCR | [文档](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fdots_ocr\u002FREADME.md) |\n| DOTS-MOCR | [文档](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fdots_ocr\u002FREADME.md) |\n| GLM-OCR | [文档](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fglm_ocr\u002FREADME.md) |\n| Phi-4 Reasoning Vision | [文档](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fphi4_siglip\u002FREADME.md) |\n| MiniCPM-o | [文档](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fminicpmo\u002FREADME.md) |\n| Phi-4 多模态 | [文档](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fphi4mm\u002FREADME.md) |\n| MolmoPoint | [文档](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fmolmo_point\u002FREADME.md) |\n| Moondream3 | [文档](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fmoondream3\u002FREADME.md) |\n| Gemma 4 | [文档](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fgemma4\u002FREADME.md) |\n| Falcon-OCR | [文档](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Ffalcon_ocr\u002FREADME.md) |\n| Granite Vision 3.2 | [文档](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fgranite_vision\u002FREADME.md) |\n| Granite 4.0 Vision | [文档](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fgranite4_vision\u002FREADME.md) |\n\n## 安装\n\n开始使用的最简单方法是使用 pip 安装 `mlx-vlm` 包：\n\n```sh\npip install -U mlx-vlm\n```\n\n## 使用\n\n### 命令行界面 (CLI)\n\n使用 CLI 从模型生成输出：\n\n```sh\n# 文本生成\nmlx_vlm.generate --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt \"你好，最近怎么样？\"\n\n# 图像生成\nmlx_vlm.generate --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit --max-tokens 100 --temperature 0.0 --image http:\u002F\u002Fimages.cocodataset.org\u002Fval2017\u002F000000039769.jpg\n\n# 音频生成（新功能）\nmlx_vlm.generate --model mlx-community\u002Fgemma-3n-E2B-it-4bit --max-tokens 100 --prompt \"描述你听到的内容\" --audio \u002Fpath\u002Fto\u002Faudio.wav\n\n# 多模态生成（图像 + 音频）\nmlx_vlm.generate --model mlx-community\u002Fgemma-3n-E2B-it-4bit --max-tokens 100 --prompt \"描述你看到和听到的内容\" --image \u002Fpath\u002Fto\u002Fimage.jpg --audio \u002Fpath\u002Fto\u002Faudio.wav\n```\n\n#### 思考预算\n\n对于具备思考能力的模型（例如 Qwen3.5），您可以限制思考块中消耗的标记数量：\n\n```sh\nmlx_vlm.generate --model mlx-community\u002FQwen3.5-2B-4bit \\\n  --thinking-budget 50 \\\n  --thinking-start-token \"\u003Cthink>\" \\\n  --thinking-end-token \"\u003C\u002Fthink>\" \\\n  --enable-thinking \\\n  --prompt \"计算 2+2\"\n```\n\n| 标志 | 描述 |\n|------|-------------|\n| `--enable-thinking` | 在聊天模板中启用思考模式 |\n| `--thinking-budget` | 思考块内允许的最大标记数 |\n| `--thinking-start-token` | 打开思考块的标记（默认：`\u003Cthink>`） |\n| `--thinking-end-token` | 关闭思考块的标记（默认：`\u003C\u002Fthink>`） |\n\n当超过预算时，模型会强制输出 `\\n\u003C\u002Fthink>` 并过渡到回答。如果传递了 `--enable-thinking` 但模型的聊天模板不支持该功能，则只有当模型自行生成起始标记时，预算才会生效。\n\n### Gradio 聊天界面\n\n使用 Gradio 启动聊天界面：\n\n```sh\nmlx_vlm.chat_ui --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit\n```\n\n### Python 脚本\n\n以下是一个在 Python 脚本中使用 MLX-VLM 的示例：\n\n```python\nimport mlx.core as mx\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\n# 加载模型\nmodel_path = \"mlx-community\u002FQwen2-VL-2B-Instruct-4bit\"\nmodel, processor = load(model_path)\nconfig = load_config(model_path)\n\n# 准备输入\nimage = [\"http:\u002F\u002Fimages.cocodataset.org\u002Fval2017\u002F000000039769.jpg\"]\n# 也可以使用 PIL.Image.Image 对象：image = [Image.open(\"...\")]\nprompt = \"请描述这张图片。\"\n\n# 应用聊天模板\nformatted_prompt = apply_chat_template(\n    processor, config, prompt, num_images=len(image)\n)\n\n# 生成输出\noutput = generate(model, processor, formatted_prompt, image, verbose=False)\nprint(output)\n```\n\n#### 音频示例\n\n```python\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\n# 加载支持音频的模型\nmodel_path = \"mlx-community\u002Fgemma-3n-E2B-it-4bit\"\nmodel, processor = load(model_path)\nconfig = model.config\n\n# 准备音频输入\naudio = [\"\u002Fpath\u002Fto\u002Faudio1.wav\", \"\u002Fpath\u002Fto\u002Faudio2.mp3\"]\nprompt = \"请描述这些音频文件中你听到的内容。\"\n\n# 应用包含音频的聊天模板\nformatted_prompt = apply_chat_template(\n    processor, config, prompt, num_audios=len(audio)\n)\n\n# 生成包含音频的输出\noutput = generate(model, processor, formatted_prompt, audio=audio, verbose=False)\nprint(output)\n```\n\n#### 多模态示例（图像 + 音频）\n\n```python\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\n# 加载多模态模型\nmodel_path = \"mlx-community\u002Fgemma-3n-E2B-it-4bit\"\nmodel, processor = load(model_path)\nconfig = model.config\n\n# 准备输入\nimage = [\"\u002Fpath\u002Fto\u002Fimage.jpg\"]\naudio = [\"\u002Fpath\u002Fto\u002Faudio.wav\"]\nprompt = \"\"\n\n# 应用聊天模板\nformatted_prompt = apply_chat_template(\n    processor, config, prompt,\n    num_images=len(image),\n    num_audios=len(audio)\n)\n\n# 生成输出\noutput = generate(model, processor, formatted_prompt, image, audio=audio, verbose=False)\nprint(output)\n```\n\n### 服务器 (FastAPI)\n\n启动服务器：\n```sh\nmlx_vlm.server --port 8080\n\n# 在启动时预加载模型（Hugging Face 仓库或本地路径）\nmlx_vlm.server --model \u003Chf_repo_or_local_path>\n\n# 使用适配器预加载模型\nmlx_vlm.server --model \u003Chf_repo_or_local_path> --adapter-path \u003Cadapter_path>\n\n# 启用信任远程代码（某些模型需要）\nmlx_vlm.server --trust-remote-code\n```\n\n#### 服务器选项\n\n- `--model`: 在服务器启动时预加载模型，接受 Hugging Face 仓库 ID 或本地路径（可选；若省略，则在首次请求时按需加载）\n- `--adapter-path`: 用于与预加载模型一起使用的适配器权重路径\n- `--host`: 主机地址（默认：`0.0.0.0`）\n- `--port`: 端口号（默认：`8080`）\n- `--trust-remote-code`: 从 Hugging Face Hub 加载模型时信任远程代码\n- `--kv-bits`: KV 缓存量化位数（例如，`3.5` 表示 TurboQuant）\n- `--kv-quant-scheme`: KV 缓存量化后端（`uniform` 或 `turboquant`）\n\n您也可以通过环境变量设置信任远程代码：\n```sh\nMLX_TRUST_REMOTE_CODE=true mlx_vlm.server\n```\n\n该服务器提供多个端点以满足不同使用场景，并支持动态加载和卸载模型，同时具备缓存功能（一次仅能加载一个模型）。\n\n#### 可用端点\n\n- `\u002Fmodels` 和 `\u002Fv1\u002Fmodels`：列出本地可用的模型\n- `\u002Fchat\u002Fcompletions` 和 `\u002Fv1\u002Fchat\u002Fcompletions`：兼容 OpenAI 的聊天式交互端点，支持图像、音频和文本\n- `\u002Fresponses` 和 `\u002Fv1\u002Fresponses`：兼容 OpenAI 的响应端点\n- `\u002Fhealth`：检查服务器状态\n- `\u002Funload`：从内存中卸载当前模型\n\n#### 使用示例\n\n##### 列出可用模型\n\n```sh\ncurl \"http:\u002F\u002Flocalhost:8080\u002Fmodels\"\n```\n\n##### 文本输入\n\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fchat\u002Fcompletions\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"mlx-community\u002FQwen2-VL-2B-Instruct-4bit\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": \"你好，最近怎么样\"\n      }\n    ],\n    \"stream\": true,\n    \"max_tokens\": 100\n  }'\n```\n\n##### 图像输入\n\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fchat\u002Fcompletions\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"mlx-community\u002FQwen2.5-VL-32B-Instruct-8bit\",\n    \"messages\":\n    [\n      {\n        \"role\": \"system\",\n        \"content\": \"你是一个有用的助手。\"\n      },\n      {\n        \"role\": \"user\",\n        \"content\": [\n          {\n            \"type\": \"text\",\n            \"text\": \"这是加利福尼亚州今日的能源需求图表。你能对这张图进行分析，并谈谈它对加州可再生能源的影响吗？\"\n          },\n          {\n            \"type\": \"input_image\",\n            \"image_url\": \"\u002Fpath\u002Fto\u002Frepo\u002Fexamples\u002Fimages\u002Frenewables_california.png\"\n          }\n        ]\n      }\n    ],\n    \"stream\": true,\n    \"max_tokens\": 1000\n  }'\n```\n\n##### 音频支持（新增）\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fgenerate\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"mlx-community\u002Fgemma-3n-E2B-it-4bit\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": [\n          { \"type\": \"text\", \"text\": \"请描述这些音频文件中听到的内容\" },\n          { \"type\": \"input_audio\", \"input_audio\": \"\u002Fpath\u002Fto\u002Faudio1.wav\" },\n          { \"type\": \"input_audio\", \"input_audio\": \"https:\u002F\u002Fexample.com\u002Faudio2.mp3\" }\n        ]\n      }\n    ],\n    \"stream\": true,\n    \"max_tokens\": 500\n  }'\n```\n\n##### 多模态（图像 + 音频）\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fgenerate\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"mlx-community\u002Fgemma-3n-E2B-it-4bit\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": [\n          {\"type\": \"input_image\", \"image_url\": \"\u002Fpath\u002Fto\u002Fimage.jpg\"},\n          {\"type\": \"input_audio\", \"input_audio\": \"\u002Fpath\u002Fto\u002Faudio.wav\"}\n        ]\n      }\n    ],\n    \"max_tokens\": 100\n  }'\n```\n\n##### 响应端点\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fresponses\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"mlx-community\u002FQwen2-VL-2B-Instruct-4bit\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": [\n          {\"type\": \"input_text\", \"text\": \"这张图片里有什么？\"},\n          {\"type\": \"input_image\", \"image_url\": \"\u002Fpath\u002Fto\u002Fimage.jpg\"}\n        ]\n      }\n    ],\n    \"max_tokens\": 100\n  }'\n```\n\n#### 请求参数\n\n- `model`: 模型标识符（必填）\n- `messages`: 用于聊天\u002FOpenAI 兼容端点的对话消息\n- `max_tokens`: 最大生成标记数\n- `temperature`: 采样温度\n- `top_p`: Top-p 采样参数\n- `top_k`: Top-k 采样截断值\n- `min_p`: Min-p 采样阈值\n- `repetition_penalty`: 重复标记惩罚因子\n- `stream`: 启用流式响应\n\n\n## 激活量化（CUDA）\n\n当在配备 MLX CUDA 的 NVIDIA GPU 上运行时，使用 `mxfp8` 或 `nvfp4` 模式量化的模型需要启用激活量化才能正常工作。这会将 `QuantizedLinear` 层转换为 `QQLinear` 层，从而对权重和激活都进行量化。\n\n### 命令行\n\n使用 `-qa` 或 `--quantize-activations` 标志：\n\n```sh\nmlx_vlm.generate --model \u002Fpath\u002Fto\u002Fmxfp8-model --prompt \"描述这张图片\" --image \u002Fpath\u002Fto\u002Fimage.jpg -qa\n```\n\n### Python API\n\n在 `load` 函数中传入 `quantize_activations=True`：\n\n```python\nfrom mlx_vlm import load, generate\n\n# 启用激活量化后加载模型\nmodel, processor = load(\n    \"path\u002Fto\u002Fmxfp8-quantized-model\",\n    quantize_activations=True\n)\n\n# 正常生成\noutput = generate(model, processor, \"描述这张图片\", image=[\"image.jpg\"])\n```\n\n### 支持的量化模式\n\n- `mxfp8`：8 位 MX 浮点数\n- `nvfp4`：4 位 NVIDIA 浮点数\n\n> **注意**：此功能是 CUDA 上 mxfp\u002Fnvfp 量化模型所必需的。而在 Apple Silicon（Metal）上，这些模型无需该标志即可正常工作。\n\n## 多图像聊天支持\n\nMLX-VLM 支持使用部分模型同时分析多张图像。此功能能够实现更复杂的视觉推理任务，并在单次对话中对多张图像进行全面分析。\n\n\n### 使用示例\n\n#### Python 脚本\n\n```python\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\nmodel_path = \"mlx-community\u002FQwen2-VL-2B-Instruct-4bit\"\nmodel, processor = load(model_path)\nconfig = model.config\n\nimages = [\"path\u002Fto\u002Fimage1.jpg\", \"path\u002Fto\u002Fimage2.jpg\"]\nprompt = \"比较这两张图片。\"\n\nformatted_prompt = apply_chat_template(\n    processor, config, prompt, num_images=len(images)\n)\n\noutput = generate(model, processor, formatted_prompt, images, verbose=False)\nprint(output)\n```\n\n#### 命令行\n\n```sh\nmlx_vlm.generate --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt \"比较这些图片\" --image path\u002Fto\u002Fimage1.jpg path\u002Fto\u002Fimage2.jpg\n```\n\n## 视频理解\n\nMLX-VLM 还支持视频分析，例如字幕生成、摘要提取等，适用于部分模型。\n\n### 支持的模型\n\n以下模型支持视频聊天：\n\n1. Qwen2-VL\n2. Qwen2.5-VL\n3. Idefics3\n4. LLaVA\n\n更多模型即将推出。\n\n### 使用示例\n\n#### 命令行\n```sh\nmlx_vlm.video_generate --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt \"描述这段视频\" --video path\u002Fto\u002Fvideo.mp4 --max-pixels 224 224 --fps 1.0\n```\n\n\n这些示例展示了如何使用 MLX-VLM 处理多张图片，以完成更复杂的视觉推理任务。\n\n## 视觉特征缓存\n\n在关于同一张图片的多轮对话中，即使图片没有变化，视觉编码器也会在每一轮都运行。`VisionFeatureCache` 会将投影后的视觉特征存储在一个基于图片路径的 LRU 缓存中，这样昂贵的视觉编码器就只需对每张唯一图片调用一次。\n\n### 工作原理\n\n1. **第一轮（缓存未命中）** —— `encode_image()` 会运行完整的视觉处理流程（视觉塔 + 投影器），并将结果存储在缓存中，然后传递给语言模型。\n2. **后续轮次（缓存命中）** —— 直接通过 `cached_image_features` 传递缓存中的特征，完全跳过视觉编码器。\n3. **图片切换** —— 当图片发生变化时，会生成一个新的缓存键，因此需要重新计算并缓存特征。如果再次切换回之前的图片，则会命中缓存。\n\n缓存最多可容纳 8 个条目（可配置），并采用 LRU 策略进行淘汰。\n\n### 命令行界面\n\n所有聊天界面都会自动使用 `VisionFeatureCache`：\n\n```sh\n# Gradio 聊天界面\npython -m mlx_vlm.chat_ui --model google\u002Fgemma-4-26b-a4b-it\n\n# 带有 Rich UI 的交互式聊天（使用 \u002Fimage 命令加载图片）\npython -m mlx_vlm.chat --model google\u002Fgemma-4-26b-a4b-it\n\n# 内联聊天模式\npython -m mlx_vlm.generate \\\n  --model google\u002Fgemma-4-26b-a4b-it \\\n  --image path\u002Fto\u002Fimage.jpg \\\n  --chat \\\n  --max-tokens 200\n```\n\n### Python 示例\n\n```python\nfrom mlx_vlm import load, stream_generate, VisionFeatureCache\nfrom mlx_vlm.prompt_utils import apply_chat_template\n\nmodel, processor = load(\"google\u002Fgemma-4-26b-a4b-it\")\ncache = VisionFeatureCache()\n\nimage = \"path\u002Fto\u002Fimage.jpg\"\n\n# 第一轮 —— 缓存未命中，编码图片\nprompt1 = apply_chat_template(processor, model.config, \"请描述这张图片。\", num_images=1)\nfor chunk in stream_generate(model, processor, prompt1, image=[image],\n                              max_tokens=200, vision_cache=cache):\n    print(chunk.text, end=\"\")\n\n# 第二轮 —— 缓存命中，跳过视觉编码器\nprompt2 = apply_chat_template(processor, model.config, \"你看到了哪些颜色？\", num_images=1)\nfor chunk in stream_generate(model, processor, prompt2, image=[image],\n                              max_tokens=200, vision_cache=cache):\n    print(chunk.text, end=\"\")\n```\n\n### 服务器端\n\n服务器会自动为同一张图片的多个请求缓存视觉特征。无需任何配置——缓存会在模型加载时创建，并在卸载时清空。\n\n```sh\nmlx_vlm.server --model google\u002Fgemma-4-26b-a4b-it\n```\n\n通过 `\u002Fv1\u002Fchat\u002Fcompletions`（流式和非流式）以及 `\u002Fresponses` 接口进行的多轮对话都将受益于这一功能。同一张图片在多个请求中只会被编码一次。\n\n### 性能\n\n在 `google\u002Fgemma-4-26b-a4b-it` 模型上进行了 10 轮多轮对话测试：\n\n| 指标         | 无缓存       | 有缓存       |\n|--------------|-------------|-------------|\n| 提示词 TPS   | ~48         | ~550-825    |\n| 加速倍数     | --          | **11 倍以上** |\n| 峰值内存     | 52.66 GB    | 52.66 GB（持平） |\n\n生成速度（约 31 tok\u002Fs）和内存占用不受影响——只有提示词的处理速度得到了提升。\n\n## TurboQuant KV 缓存\n\nTurboQuant 在生成过程中压缩 KV 缓存，从而在保持质量的同时，以更少的内存实现更长的上下文长度。\n\n### 快速入门\n\n```sh\n# 3.5 位 KV 缓存量化（3 位键 + 4 位值）\nmlx_vlm generate \\\n  --model mlx-community\u002FQwen3.5-4B-4bit \\\n  --kv-bits 3.5 \\\n  --kv-quant-scheme turboquant \\\n  --prompt \"您的长提示在这里...\"\n```\n\n```python\nfrom mlx_vlm import generate\n\nresult = generate(\n    model, processor, prompt,\n    kv_bits=3.5,\n    kv_quant_scheme=\"turboquant\",\n    max_tokens=256,\n)\n```\n\n```sh\n# 启用 TurboQuant 的服务器\nmlx_vlm server \\\n  --model google\u002Fgemma-4-26b-a4b-it \\\n  --kv-bits 3.5 \\\n  --kv-quant-scheme turboquant\n```\n\n### 工作原理\n\nTurboQuant 使用随机旋转 + 码本量化（[arXiv:2504.19874](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.19874)）将 KV 缓存条目从 16 位压缩到每维 2–4 位：\n\n- **键与值**：采用 Hadamard 旋转的 MSE 码本量化\n- **分数位**（如 3.5 位）：使用较低的位数表示键，较高的位数表示值（3 位键 + 4 位值）\n\n自定义 Metal 内核可以直接在打包的量化数据上融合打分计算和值聚合，避免在解码过程中进行完全的反量化。\n\n### 性能\n\n在 Qwen3.5-4B-4bit 模型上，使用 128k 上下文长度进行测试：\n\n| 指标           | 基线        | TurboQuant 3.5 位 |\n|----------------|-------------|------------------|\n| KV 内存        | 4.1 GB      | 0.97 GB（**减少 76%**） |\n| 峰值内存       | 18.3 GB     | 17.3 GB（**减少 1.0 GB**） |\n\n在 512k+ 上下文长度下，由于减少了内存带宽需求，TurboQuant 的逐层注意力计算速度**比 FP16 SDPA 更快**。\n\n在 gemma-4-31b-it 模型上，使用 128k 上下文长度进行测试：\n\n| 指标           | 基线        | TurboQuant 3.5 位 |\n|----------------|-------------|------------------|\n| KV 内存        | 13.3 GB     | 4.9 GB（**减少 63%**） |\n| 峰值内存       | 75.2 GB     | 65.8 GB（**减少 9.4 GB**） |\n\n### 支持的位宽\n\n| 位数 | 压缩比 | 最佳用途 |\n|------|--------|----------|\n| 2    | ~8x    | 最大压缩，质量略有损失 |\n| 3    | ~5x    | 质量与压缩之间的良好平衡 |\n| 3.5  | ~4.5x  | 推荐默认设置（3 位键 + 4 位值） |\n| 4    | ~4x    | 质量最佳，压缩适度 |\n\n### 兼容性\n\nTurboQuant 会自动量化 `KVCache` 层（全局注意力）。对于具有 `RotatingKVCache`（滑动窗口）或 `ArraysCache`（MLA\u002F吸收键）的模型，它们的原生缓存格式将保持不变，因为这些层本身已经非常高效地利用了内存。\n\n# 微调\n\nMLX-VLM 支持使用 LoRA 和 QLoRA 对模型进行微调。\n\n## LoRA 和 QLoRA\n\n要了解更多关于 LoRA 的信息，请参阅 [LoRA.md](.\u002Fmlx_vlm\u002FLORA.MD) 文件。","# MLX-VLM 快速上手指南\n\nMLX-VLM 是一个专为 Mac 设计的开源工具包，支持在本地运行和微调视觉语言模型（VLM）及多模态模型（支持音频和视频）。它基于 Apple MLX 框架，能够高效利用 Mac 的 GPU 进行推理。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**: macOS (推荐 macOS 14.0 及以上)\n- **硬件**: Apple Silicon 芯片 (M1, M2, M3 系列等)\n- **Python 版本**: Python 3.9 - 3.12\n\n### 前置依赖\n确保已安装 `pip` 和基础开发工具。如果是首次使用 MLX 相关工具，建议先更新 pip：\n\n```bash\npython3 -m pip install --upgrade pip\n```\n\n> **注意**：本工具主要针对 Apple Silicon Mac 优化。虽然部分功能支持 NVIDIA GPU (CUDA)，但核心体验建议在 Mac 上进行。\n\n## 安装步骤\n\n通过 pip 直接安装最新版本的 `mlx-vlm`：\n\n```bash\npip install -U mlx-vlm\n```\n\n如果需要从国内镜像源加速安装（推荐中国开发者使用）：\n\n```bash\npip install -U mlx-vlm -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 基本使用\n\n安装完成后，你可以通过命令行、Python 脚本或启动本地服务器来使用模型。\n\n### 1. 命令行界面 (CLI)\n\n这是最简单的使用方式，支持文本、图像、音频及多模态输入。\n\n**纯文本生成：**\n```sh\nmlx_vlm.generate --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt \"Hello, how are you?\"\n```\n\n**图像理解（传入图片 URL 或本地路径）：**\n```sh\nmlx_vlm.generate --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit --max-tokens 100 --temperature 0.0 --image http:\u002F\u002Fimages.cocodataset.org\u002Fval2017\u002F000000039769.jpg\n```\n\n**音频分析（需支持音频的模型）：**\n```sh\nmlx_vlm.generate --model mlx-community\u002Fgemma-3n-E2B-it-4bit --max-tokens 100 --prompt \"Describe what you hear\" --audio \u002Fpath\u002Fto\u002Faudio.wav\n```\n\n**多模态输入（图像 + 音频）：**\n```sh\nmlx_vlm.generate --model mlx-community\u002Fgemma-3n-E2B-it-4bit --max-tokens 100 --prompt \"Describe what you see and hear\" --image \u002Fpath\u002Fto\u002Fimage.jpg --audio \u002Fpath\u002Fto\u002Faudio.wav\n```\n\n### 2. 启动聊天界面 (Gradio)\n\n快速启动一个网页版的聊天交互界面：\n\n```sh\nmlx_vlm.chat_ui --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit\n```\n启动后，终端会显示本地访问地址（通常是 `http:\u002F\u002F127.0.0.1:7860`），在浏览器打开即可对话。\n\n### 3. Python 脚本调用\n\n在你的项目中集成 MLX-VLM：\n\n```python\nimport mlx.core as mx\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\n# 加载模型\nmodel_path = \"mlx-community\u002FQwen2-VL-2B-Instruct-4bit\"\nmodel, processor = load(model_path)\nconfig = load_config(model_path)\n\n# 准备输入 (支持本地路径或 URL)\nimage = [\"http:\u002F\u002Fimages.cocodataset.org\u002Fval2017\u002F000000039769.jpg\"]\nprompt = \"Describe this image.\"\n\n# 应用聊天模板\nformatted_prompt = apply_chat_template(\n    processor, config, prompt, num_images=len(image)\n)\n\n# 生成输出\noutput = generate(model, processor, formatted_prompt, image, verbose=False)\nprint(output)\n```\n\n### 4. 启动 API 服务器 (FastAPI)\n\n将模型部署为兼容 OpenAI 格式的 API 服务：\n\n```sh\n# 启动默认服务\nmlx_vlm.server --port 8080\n\n# 预加载特定模型\nmlx_vlm.server --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit --port 8080\n```\n\n**调用示例 (curl)：**\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fchat\u002Fcompletions\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"mlx-community\u002FQwen2-VL-2B-Instruct-4bit\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": \"Hello, how are you\"\n      }\n    ],\n    \"stream\": true,\n    \"max_tokens\": 100\n  }'\n```\n\n> **提示**：更多高级功能（如思维链预算控制 `thinking-budget`、量化配置等）请参考官方完整文档。","一位独立开发者需要在配备 Apple Silicon 芯片的 MacBook Pro 上，快速构建一个能同时理解截图、文档照片及语音备注的多模态辅助工具，用于自动化整理每日工作日志。\n\n### 没有 mlx-vlm 时\n- **硬件闲置与依赖冲突**：无法直接利用 Mac 强大的统一内存运行大型视觉语言模型（VLM），被迫依赖昂贵的云端 GPU 或在配置复杂的 Linux 环境中挣扎。\n- **多模态处理割裂**：处理图像、文本和音频需要分别调用不同的 API 或模型，难以在一个流程中实现“看图听音写总结”的连贯操作。\n- **推理延迟高昂**：云端请求受网络波动影响大，且按次计费成本高，导致本地实时交互体验卡顿，无法流畅进行多轮对话。\n- **微调门槛极高**：若想针对特定业务场景（如识别特定格式发票）优化模型，缺乏在本地高效进行微调的工具链，数据隐私也难以保障。\n\n### 使用 mlx-vlm 后\n- **原生性能释放**：通过一行 `pip install` 即可在 Mac 上直接运行量化后的 VLM 和 Omni 模型，充分利用 MLX 框架加速，实现低延迟的本地推理。\n- **全模态一键融合**：利用 CLI 或 Python 脚本，单条命令即可同时输入图像和音频（如 `--image` 加 `--audio` 参数），让模型直接输出综合描述，流程丝滑顺畅。\n- **成本归零与隐私安全**：所有计算均在本地完成，无需联网发送敏感数据，彻底消除云 API 费用，且支持离线环境下的稳定运行。\n- **便捷定制化能力**：内置细调（Fine-tuning）功能，开发者可轻松使用私有数据集在本地优化模型，使其更精准地适应特定办公场景。\n\nmlx-vlm 将 Mac 变成了强大的本地多模态 AI 工作站，让开发者能以零成本、高隐私的方式轻松落地复杂的视觉与听觉理解应用。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FBlaizzy_mlx-vlm_1df44ee7.png","Blaizzy","Prince Canuma","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FBlaizzy_6caa5b9d.png","MLOps | LLMs | VLMs | Audio LMs | Ex - ML Research Engineer @arcee-ai ","kulissiwa.com","Poland","prince.gdt@gmail.com","Prince_Canuma","medium.com\u002F@prince.canuma","https:\u002F\u002Fgithub.com\u002FBlaizzy",[84],{"name":85,"color":86,"percentage":87},"Python","#3572A5",100,4338,472,"2026-04-14T17:58:25","MIT","macOS","非必需。主要设计用于 Apple Silicon (M1\u002FM2\u002FM3 等) Mac 的统一内存架构。若使用 NVIDIA GPU 需配合 MLX CUDA 后端，并可能需要激活量化 (Activation Quantization)。","未说明 (取决于所选模型大小，利用 macOS 统一内存)",{"notes":96,"python":97,"dependencies":98},"该工具专为 macOS 上的 Apple Silicon 芯片优化，利用 MLX 框架进行推理和微调。支持视觉语言模型 (VLM) 及包含音频\u002F视频支持的全能模型。若在 NVIDIA GPU 上运行 (实验性)，需启用激活量化以支持 mxfp8 或 nvfp4 量化模式。安装可通过 pip 直接完成。","未说明",[99,64,100,101,102],"mlx","gradio","fastapi","numpy",[35,14],[105,106,99,107,108,109,110,111,112,113,114,115,116],"llava","llm","vision-transformer","apple-silicon","idefics","local-ai","paligemma","vision-framework","vision-language-model","florence2","molmo","pixtral",null,"2026-03-27T02:49:30.150509","2026-04-15T10:59:57.094222",[121,126,130,135,140,144,148],{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},34183,"mlx-vlm 框架具体支持哪些模型？","目前并没有一份详尽的文档列出所有支持的模型。维护者建议参考其在 Hugging Face 上发布的模型列表作为测试通过的示例。对于新添加的模型类型，通常会在发布时提供测试用的具体模型示例。如果遇到问题，建议从全新的代码克隆开始尝试，避免文件同步工具（如 iCloud）导致的配置文件缺失或损坏。","https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fissues\u002F80",{"id":127,"question_zh":128,"answer_zh":129,"source_url":125},34184,"为什么加载某些模型（如 Pixtral）时会报错找不到 config.json？","这通常是因为模型未被完全支持或配置文件路径错误。部分用户反馈，即使仓库中有 config.json，如果 model_type 不匹配（例如强行将 Pixtral 替换为 llava）也会报错。建议不要手动修改模型类型，而是等待框架原生支持该模型架构。此外，确保本地缓存目录（~\u002F.cache\u002Fhuggingface\u002F）中的模型文件完整，必要时删除缓存重新下载。",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},34185,"如何高效地向维护者报告模型运行错误或异常输出？","维护者建议不要发布冗长的完整日志或所有模型的测试结果。请仅提交“失败”或“可疑”的案例，并附上简短的描述。如果能提供具体的输入图片和对应的错误输出（而非成功的案例），将极大帮助维护者快速定位和诊断问题。长篇大论的线程很难在每次发布时被逐一检查。","https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fissues\u002F147",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},34186,"不同模型在生成图像描述时的性能和表现有何差异？","根据社区测试：\n1. SmolVLM 系列：速度快（约 3-4 秒），内存占用低，能生成简洁准确的描述。\n2. PaliGemma 系列：速度中等，有时输出过于简略（如仅输出关键词）。\n3. Phi-3.5-Vision：速度较慢，内存占用高，描述详细但可能缺乏特定细节。\n4. Llama-3.2-11B\u002F90B：在普通 Mac 上极慢或无法运行（需 90GB+ 内存），且生成的描述不一定准确。\n5. Florence-2：在某些配置下可能产生乱码。\n建议根据硬件资源选择 SmolVLM 用于快速任务，或更大模型用于需要详细推理的任务。","https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fissues\u002F375",{"id":141,"question_zh":142,"answer_zh":143,"source_url":139},34187,"为什么有些模型生成的输出包含奇怪标签（如 \u003Cend_of_utterance>）或格式混乱？","这是由于不同模型的训练数据和输出格式不一致导致的。有些模型倾向于输出 Markdown 格式，有些则包含特殊的结束符标签。这属于模型本身的特性而非框架错误。用户可以通过调整提示词（Prompt）或在后处理步骤中编写脚本过滤掉这些特殊标签和格式化符号来获得更干净的文本。",{"id":145,"question_zh":146,"answer_zh":147,"source_url":134},34188,"在 Apple Silicon Mac 上运行大型视觉语言模型有什么硬件限制？","在配备 128GB 内存的 Mac 上，部分超大模型（如 Llama-3.2-90B 或未量化的 Llama-3.2-11B）仍然可能因内存不足无法运行或极其缓慢。例如，Llama-3.2-11B 原生版本可能占用超过 90GB 内存。建议使用经过量化（如 4bit, 6bit, 8bit）的社区版本（如 mlx-community 下的模型），或者选择参数量较小的模型（如 SmolVLM, Phi-3.5-mini）以获得流畅体验。",{"id":149,"question_zh":150,"answer_zh":151,"source_url":139},34189,"是否有工具可以批量测试多个模型的性能并生成对比报告？","社区用户开发了自定义脚本（如 check_models.py），可以扫描文件夹中的图片，批量运行不同模型，并记录耗时、内存峰值（Active\u002FCache\u002FPeak Mem）及输出结果。生成的报告可导出为 Markdown 或 HTML 格式。维护者表示欢迎此类工具，并建议将其整理为 CSV 格式或带有筛选功能的下拉列表，以便更快地过滤失败案例进行调试。",[153,158,163,168,173,178,183,188,193,198,203,208,213,218,223,228,233,238,243,248],{"id":154,"version":155,"summary_zh":156,"released_at":157},264077,"v0.4.4","## 变更内容\n* 修复 KV 共享模型的 Gemma 4 分块预填充问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F901 中完成\n* 修复 Gemma 4 视觉与文本结合时性能下降及处理器配置缺失的问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F906 中完成\n* 修复 Falcon-Perception 300M 模型问题，并将 generate_perception 功能移至模型内部，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F910 中完成\n* 修复 Gemma 4 工具解析器对嵌套参数的支持问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F916 中完成\n* 添加 VisionFeatureCache 用于多轮图像缓存，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F913 中完成\n* 修复 video_generate 和 smolvlm_video_generate CLI 命令失效的问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F919 中完成\n* 优化 TurboQuant Metal 内核：在保持 89% KV 缓存节省的前提下，性能提升至基准的 0.85 至 1.90 倍，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F909 中完成\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fcompare\u002Fv0.4.3...v0.4.4","2026-04-04T15:18:28",{"id":159,"version":160,"summary_zh":161,"released_at":162},264078,"v0.4.3","## 变更内容\n* 添加 SAM 3.1，包含对象多路复用功能，并由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F880 中优化了实时推理流水线。\n* 添加 Falcon-OCR 模型支持，由 @Griffintaur 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F879 中实现。\n* 添加 RF-DETR 检测与分割模型，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F884 中实现。\n* 添加 Granite Vision 3.2 和 4.0 模型，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F885 中实现。\n* 为 CUDA 支持向视觉模型推理流水线添加 wired_limit 参数，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F887 中完成。\n* 添加 Falcon Perception 模型，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F888 中实现。\n* 修复（服务器）：默认禁用 Uvicorn 的自动重载功能以防止内存泄漏，由 @futurepitcher 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F883 中完成。\n* 添加 Gemma 4 模型支持（视觉、音频、MoE），由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F890 中实现。\n* 修复 Gemma 4 嵌入缩放问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F893 中完成。\n* 添加 Turbo Quant 功能，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F858 中实现。\n* 移除 TurboQuant 基准测试相关文件，并添加 README 文档说明，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F894 中完成。\n\n## 新贡献者\n* @Griffintaur 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F879 中完成了首次贡献。\n* @futurepitcher 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F883 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fcompare\u002Fv0.4.2...v0.4.3","2026-04-02T21:14:40",{"id":164,"version":165,"summary_zh":166,"released_at":167},264079,"v0.4.2","## 变更内容\n* 确保模型卡片中包含最少的元数据，由 @pcuenca 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F853 中实现\n* (修复 qwen3_5)：为 qwen3_5 模型类型注册 AutoProcessor 补丁，由 @mdstaff 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F859 中实现\n* 为 qwen3_5_moe 自动处理器打补丁，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F867 中实现\n* 将 Qwen3.5 的 RMSNorm 输出转换为输入的数据类型，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F868 中实现\n* 修复 CLI 和服务器中的思考默认设置，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F869 中实现\n* 修复在没有 Torch 的情况下加载 LFM2-VL 处理器的问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F872 中实现\n* 修复 Magistral 图像标记扩展问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F873 中实现\n* 添加 Facebook Sam3，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F875 中实现\n* [Sam3] 实时视频中仅对掩码进行标注绘制，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F876 中实现\n* 添加 DOTS-MOCR 处理器支持，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F874 中实现\n* 修复 PaliGemma 处理器关键字参数路由问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F877 中实现\n\n## 新贡献者\n* @mdstaff 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F859 中完成了首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fcompare\u002Fv0.4.1...v0.4.2","2026-03-28T21:21:31",{"id":169,"version":170,"summary_zh":171,"released_at":172},264080,"v0.4.1","## 变更内容\n* 功能（服务端）：添加 `--model` 和 `--adapter-path` 标志，用于启动时的预加载，由 @auggie246 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F811 中实现。\n* 修复（qwen3_5）：在连续批处理下防止批次维度不匹配的问题，由 @kol22 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F813 中实现。\n* 向服务端接口添加通用采样参数，由 @spicyneuron 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F814 中实现。\n* 按照 OpenAI 规范，为每个工具调用添加一个“索引”，由 @viktike 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F818 中实现。\n* 修复 Qwen3-Omni（qwen3_omni_moe）集成问题，由 @howeirdo 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F820 中实现。\n* 添加 Mistral4 模型，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F827 中实现。\n* 清理服务端及生成参数、验证逻辑和默认值，由 @spicyneuron 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F829 中实现。\n* 将自定义处理器设为默认，并移除对 PyTorch 的依赖，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F821 中实现。\n* 添加 Molmo 点模型，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F844 中实现。\n\n## 新贡献者\n* @auggie246 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F811 中完成了首次贡献。\n* @howeirdo 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F820 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fcompare\u002Fv0.4.0...v0.4.1","2026-03-21T14:25:29",{"id":174,"version":175,"summary_zh":176,"released_at":177},264081,"v0.4.0","## 变更内容\n* 由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F751 中修复 gemma3n 的短提示问题\n* 由 @Goekdeniz-Guelmez 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F499 中添加全量权重微调功能\n* 由 @will-lms 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F753 中使用 _position_ids 初始化 qwen3_5_moe.LanguageModel\n* 由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F760 中修复批量推理以使用 InputEmbeddingsFeatures\n* 修复 (qwen3_vl): 当新图像\u002F视频到达时重置 _position_ids，由 @kol22 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F756 中完成\n* 修复：将 13 处 bare except 替换为 except Exception，由 @haosenwang1018 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F765 中完成\n* 为 Qwen3-VL-MoE 的图像输入重置位置 ID，由 @kol22 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F761 中完成\n* 修复仅文本的单批次问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F771 中完成\n* 为新的训练器更新 LORA.md，由 @Goekdeniz-Guelmez 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F775 中完成\n* 支持 OpenAI 兼容端点的 \u002Fv1\u002F 前缀，由 @spicyneuron 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F783 中完成\n* CORS 中间件，由 @viktike 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F766 中实现\n* fx：README 中的 curl 错误，由 @yinzhidong 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F788 中修复\n* Streaming chatml 响应增强，由 @viktike 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F764 中完成\n* 向服务器添加 KV 缓存（量化）参数，由 @viktike 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F776 中完成\n* 修复 mllama 训练并增加 mapp 中的更多模型，由 @Goekdeniz-Guelmez 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F777 中完成\n* 添加思考预算和标志位，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F789 中完成\n* 修复：在 chat_ui 入口处为 gradio 添加导入保护，由 @RtYkk 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F787 中完成\n* 服务器中的工具调用功能，由 @viktike 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F773 中实现\n* 添加 Minicpm-o-2.5，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F791 中完成\n* 为服务器添加可选的 prefill-step-size 命令行参数，由 @viktike 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F792 中完成\n* qwen3_omni_moe：修复 visual_embeds_multiscale 的 UnboundLocalError，由 @ronaldseoh 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F794 中完成\n* 添加 Phi-4-reasoning-vision-15B (phi4-siglip)，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F796 中完成\n* 添加 phi4mm，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F797 中完成\n* 添加 ensure_fused_sdpa 函数以优化注意力计算，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F800 中完成\n* [Internvl_chat] 为 LanguageModel 添加可选的 kwargs 参数，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F801 中完成\n* 保护加载图像和音频的功能，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F803 中完成\n* [Qwen3.5 MoE] 修复量化谓词，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F804 中完成\n* 添加 orpo，由 @Goekdeniz-Guelmez 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F795 中完成\n* 添加 moondream3，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm 中完成","2026-03-07T19:01:10",{"id":179,"version":180,"summary_zh":181,"released_at":182},264082,"v0.3.12","## 变更内容\n* [MODEL] 由 @JJJYmmm 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F722 中添加对 Qwen3.5 系列的支持\n* qwen3_omni_moe：修复解构错误，由 @Nixuge 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F723 中完成\n* 文档：澄清部分模型需要 [torch] 依赖以使用 torchvision，由 @Mr-Neutr0n 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F716 中完成\n* 使用 mlx-lm 提供的 logits 处理器和采样器，由 @AG6GR 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F724 中完成\n* 修复当配置中使用 image_token_id 时 LoRA 训练崩溃的问题，由 @JesseRod329 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F720 中完成\n* video_generate.py：在 is_video_file 中处理大写扩展名，由 @Nixuge 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F719 中完成\n* 尊重量化配置，由 @pcuenca 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F692 中完成\n* [Ministral] 为模型权重添加 FP8 反量化功能，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F727 中完成\n* 修复 LFM2.5VL 处理器兼容性补丁中的图像占位符数量问题，由 @mattjcly 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F725 中完成\n* 为 QQLinear 层添加激活量化支持，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F728 中完成\n* 重构 utils.py 中的处理器返回类型，改为使用 ProcessorMixin，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F729 中完成\n* 修复 glm（4v 和 4v moe）的广播问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F731 中完成\n* [FastVLM] 修复 dtype 转换问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F733 中完成\n* 修复 kimi 的广播问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F734 中完成\n* 为 fastvlm 添加自定义处理器，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F736 中完成\n* 修复 paligemma 的预填充和数值精度问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F737 中完成\n* 修复 idefics3 llama3 和 smovlm 的问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F738 中完成\n* 文档（README）：修复 curl 示例中的 JSON 格式问题，由 @chsdwn 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F739 中完成\n* 修复 qwen3.5 moe 的分块预填充问题，由 @JJJYmmm 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F742 中完成\n* 修复 florence-2（处理器和推理）的问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F743 中完成\n* 修复 Qwen3.5 的类型转换和谓词问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F744 中完成\n* [Qwen2, 2.5] 修复视觉溢出问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F745 中完成\n* [Ministral3] 修复多图像问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F747 中完成\n* [Phiv3] 修复 dtype 转换问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F748 中完成\n* 添加 dots-ocr 功能，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F749 中完成\n\n## 新贡献者\n* @Nixuge 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F723 中完成了首次贡献\n* @Mr-Neutr0n 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F716 中完成了首次贡献\n* @AG6GR 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F724 中完成了首次贡献\n* @JesseRod329 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F720 中完成了首次贡献\n* @chsdwn 完成了其首…","2026-02-16T23:38:54",{"id":184,"version":185,"summary_zh":186,"released_at":187},264083,"v0.3.11","## 变更内容\n* @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F694 中重构了 _generate_batch 函数中的输入嵌入处理逻辑。\n* [Qwen2-VL] @hturbe 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F704 中修复了 Vision 模块中错误的注意力掩码。\n* @Blaizzy 和 @mikolaj92 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F706 中新增了 GLM-OCR 功能。\n* 修复：当模型加载失败时，记录底层 ImportError 错误，由 @antonvice 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F710 中完成。\n* @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F699 中修复了奇怪的限制问题和 deepstack 评估，并添加了预填充步长参数。\n* [PaddleOCR] @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F712 中修复了硬编码的处理器配置。\n* [SmolVLM] @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F713 中重构了 Attention 类，使其能够动态计算 n_kv_heads。\n\n## 新贡献者\n* @hturbe 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F704 中完成了首次贡献。\n* @mikolaj92 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F706 中完成了首次贡献。\n* @antonvice 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F710 中完成了首次贡献。\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fcompare\u002Fv0.3.10...v0.3.11","2026-02-04T20:11:02",{"id":189,"version":190,"summary_zh":191,"released_at":192},264084,"v0.3.10","## 变更内容\n* 修复 lighton 的 qk_norm，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F615 中完成\n* 修复 GLM-4.6V 的视觉类型，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F616 中完成\n* 添加对 rope 参数的支持 [GLM-4.6V MoE]，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F617 中完成\n* 增强聊天 UI，由 @ivanfioravanti 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F619 中完成\n* 添加对 TokenizersBackend 的支持，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F621 中完成\n* UI 改进，由 @ivanfioravanti 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F620 中完成\n* 文档：修复 README 中 \u002Fchat\u002Fcompletion 端点的拼写错误，由 @zenyr 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F623 中完成\n* 添加对 HunyuanOCR (hunyuan_vl) 的支持，由 @voxmenthe 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F604 中完成\n* 添加批量生成功能，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F538 中完成\n* 添加 autoflake（移除未使用的导入），由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F628 中完成\n* 版本升级至 0.3.10，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F632 中完成\n* 修复 qwen3_vl_moe 的 kwargs 转发问题，由 @mattjcly 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F633 中完成\n* 添加对 Jina VLM 的支持，由 @hanxiao 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F631 中完成\n* 修复：在文件复制时跳过 index.json，以防止覆盖，由 @mzau 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F638 中完成\n* 添加 Molmo2 支持，由 @aliyovic 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F639 中完成\n* 修复 glm46v，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F629 中完成\n* 将 qwen3omni 迁移到 MLX，由 @hellopahe 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F598 中完成\n* 修复音频加载问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F642 中完成\n* 修复来自批量生成 PR 的 pixtral 仅文本回归问题，由 @mzau 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F644 中完成\n* 修复：当提示中没有图片时，修复 mlx-vlm.generate --chat 功能，由 @Deekshith-Dade 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F648 中完成\n* 增强 load_config，使其包含 generation_config.json 并提取 eos…，由 @cubist38 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F650 中完成\n* 添加 LFM2.5-VL，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F653 中完成\n* 添加 MXFP4 量化支持，由 @zhutao100 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F514 中完成\n* 添加对 nvfp4 和 mxfp8 的支持，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F657 中完成\n* 修复服务器端的音频、聊天功能以及信任远程代码的问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F660 中完成\n* 修复分词器问题，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F662 中完成\n* 将 kwargs 传递到 get_model_path 中的 snapshot_download，由 @Deekshith-Dade 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F663 中完成\n* 撤销“将 kwargs 传递到 get_model_path 中的 snapshot_download”操作，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F666 中完成\n* 【模型】添加 PaddleOCR-VL 模型支持，由 @zhang-prog 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F656 中完成\n* 移除 _has_bytelevel_pretokenizer，由 @pcuenca 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F668 中完成\n* 新增结构化输出功能，由 @cubist38 在 https","2026-01-28T22:25:15",{"id":194,"version":195,"summary_zh":196,"released_at":197},264085,"v0.3.9","## 变更内容\n* 修复 qwen3_vl 中的 ValueError: 图像特征与图像标记不匹配，由 @ziya32 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F608 中完成\n* 新增 ministral3 模型，由 @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F611 中添加\n\n## 新贡献者\n* @ziya32 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F608 中完成了首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fcompare\u002Fv0.3.8...v0.3.9","2025-12-03T21:47:51",{"id":199,"version":200,"summary_zh":201,"released_at":202},264086,"v0.3.8","## 变更内容\n* @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F589 中添加了 chat_ui 命令\n* @Blaizzy 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F592 中修复了使用 OpenAI SDK 调用 \u002Fchat\u002Fcompletions 时出现的 422 错误 #590\n* @awni 在 https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F599 中修复了 qwen3_vl 问题\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fcompare\u002Fv0.3.7...v0.3.8","2025-11-27T01:34:08",{"id":204,"version":205,"summary_zh":206,"released_at":207},264087,"v0.3.7","## What's Changed\r\n* update readme with new openai endpoints details by @mguella in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F585\r\n* Fix lighton OCR empty result on main by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F587\r\n* Add model revision loading by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F588\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fcompare\u002Fv0.3.6...v0.3.7","2025-11-17T16:04:18",{"id":209,"version":210,"summary_zh":211,"released_at":212},264088,"v0.3.6","## What's Changed\r\n* Fix: Input cast error by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F559\r\n* Fix qwen-vl position ids (2.5 & 3) by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F564\r\n* Add Evals by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F563\r\n* Fix Qwen3-VL Attention by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F566\r\n* Update evals init by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F567\r\n* Fix Qwen3-VL multi image reshape by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F569\r\n* Add processor args to DeepSeekOCR by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F570\r\n* host and port params for server by @mguella in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F568\r\n* FastVLM by @pcuenca in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F502\r\n* Add support for InternVL3 by @iRonJ in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F540\r\n* Add Z.ai GLM-4.1v by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F572\r\n* Make rope deltas private  by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F573\r\n* Add example notebook for interleaving text and images in prompts by @Copilot in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F574\r\n* [Bugfix] fix mrope in qwen2vl and qwen2.5vl by @JJJYmmm in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F576\r\n* Add lighton-ocr by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F550\r\n* changed image parameter instead of files in stream_generate by @Manikandan-t in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F521\r\n* Remove auto config loading by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F577\r\n* [BugFix][Qwen3VL] fix deepstack and multi-image inference by @JJJYmmm in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F581\r\n* openai compatible endpoints by @mguella in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F580\r\n* Bump version to 0.3.6 by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F582\r\n\r\n## New Contributors\r\n* @mguella made their first contribution in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F568\r\n* @iRonJ made their first contribution in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F540\r\n* @Copilot made their first contribution in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F574\r\n* @JJJYmmm made their first contribution in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F576\r\n* @Manikandan-t made their first contribution in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F521\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fcompare\u002Fv0.3.5...v0.3.6","2025-11-14T22:34:00",{"id":214,"version":215,"summary_zh":216,"released_at":217},264089,"v0.3.5","## What's Changed\r\n* Remove docs actions temporarly by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F536\r\n* Fix load_image for URLs by @pcuenca in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F534\r\n* Add Deepseek ocr by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F541\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fcompare\u002Fv0.3.4...v0.3.5","2025-10-26T22:43:08",{"id":219,"version":220,"summary_zh":221,"released_at":222},264090,"v0.3.4","## What's Changed\r\n* Add n_kv_heads property used in LM Studio to glm4v_moe.LanguageModel by @hehua2008 in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F472\r\n* Fix: rope rotation (GLM-4.5v) by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F481\r\n* Remove scipy dep by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F482\r\n* Add CUDA and CPU as optional deps by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F483\r\n* Fix Deepseek vl2 chat template by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F488\r\n* Fix deepseek-vl default chat template by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F490\r\n* Fix smolvlm video generate by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F491\r\n* Fix base64 encoded images by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F493\r\n* Map Apriel configs to pixtral and fix prompt formatting by @ivanfioravanti in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F518\r\n* Fix video understanding NB by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F519\r\n* Fix fine-tuning bug in trainer.py by @avishekjana in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F473\r\n* Bump minimum required Python version to 3.10 by @dokterbob in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F485\r\n* Add Qwen3-VL (Dense & MoE) by @Blaizzy & @vincentamato in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F528\r\n* Fix video Qwen3-VL by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F529\r\n* Fix Qwen3 VL (Dense) Sanitize by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F531\r\n* Bump version to 0.3.4 by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F535\r\n\r\n## New Contributors\r\n* @hehua2008 made their first contribution in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F472\r\n* @ivanfioravanti made their first contribution in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F518\r\n* @avishekjana made their first contribution in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F473\r\n* @dokterbob made their first contribution in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F485\r\n* @vincentamato  made their first contributuin in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F529\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fcompare\u002Fv0.3.3...v0.3.4","2025-10-14T08:00:34",{"id":224,"version":225,"summary_zh":226,"released_at":227},264091,"v0.3.3","## What's Changed\r\n* fix changelog task by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F441\r\n* Add LFM2-VL by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F460\r\n* [llava_next] Fix config inheritance by @neilmehta24 in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F448\r\n* Kimi_VL: Fix activation args by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F465\r\n* Add GLM-4-5V by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F458\r\n* External access to LFM2-VL merge-input-IDs method by @christian-lms in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F466\r\n* Add Command-A-Vision by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F467\r\n* [kernels] Use a header for bicubic_interpolate for compatibility with macOS \u003C 15 by @mattjcly in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F469\r\n\r\n## New Contributors\r\n* @christian-lms made their first contribution in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F466\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fcompare\u002Fv0.3.2...v0.3.3","2025-08-20T14:52:15",{"id":229,"version":230,"summary_zh":231,"released_at":232},264092,"v0.3.2","## What's Changed\r\n* Fix energy calc in omni.py by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F427\r\n* Feat(build): migrate to `pyproject.toml` by @SauravMaheshkar in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F282\r\n* Fix quant predicate by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F430\r\n* Load dependencies from requirements.txt by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F431\r\n* Fix broken wheel builds caused by ambiguous package spec by @neilmehta24 in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F432\r\n* Make UI and audio dependencies optional by @zhnext in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F433\r\n* Cleanup: Refactor config by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F437\r\n* Add cuda support by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F438\r\n* Fix\u002Fserver module exposure by @zhnext in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F434\r\n* Support text only training by @Goekdeniz-Guelmez in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F424\r\n* Fix phi3_v and molmo mask by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F440\r\n\r\n## New Contributors\r\n* @zhnext made their first contribution in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F433\r\n* @Goekdeniz-Guelmez made their first contribution in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F424\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fcompare\u002Fv0.3.1...v0.3.2","2025-07-22T16:12:40",{"id":234,"version":235,"summary_zh":236,"released_at":237},264093,"v0.3.1","## What's Changed\r\n* fix(chat-ui): Fix imports, blocking Chat-ui server start by @zenyr in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F420\r\n* Fix server empty image for Gemma3n by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F422\r\n* [Gemma3n] Add hooks for image embedding computation by @will-lms in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F407\r\n* [gemma3n] Fix OCR after weight re-upload by @neilmehta24 in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F425\r\n* Add gemma3n omni example by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F426\r\n\r\n## New Contributors\r\n* @zenyr made their first contribution in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F420\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fcompare\u002Fv0.3.0...v0.3.1","2025-07-12T16:21:01",{"id":239,"version":240,"summary_zh":241,"released_at":242},264094,"v0.3.0","## What's Changed\r\n* [gemma3n] Correctly scale text embeddings for quantized gemma3n conversions by @neilmehta24 in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F397\r\n* smolvlm_video example: fix typo in system prompt by @pcuenca in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F389\r\n* Fix gemma3n pixel casting  by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F398\r\n* Fix audio model check and prompt utils by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F395\r\n* Add KV Quantization by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F401\r\n* Fix Gemma3n multi-task merging and update LM by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F405\r\n* [gemma3n] Fix vision encoder implementation of EdgeResidual and UniversalInvertedResidual by @neilmehta24 in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F410\r\n* fix: Remove unnecessary unicode_escape decoding for Chinese text input by @nicekate in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F403\r\n* Add support for Mixed Quant by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F413\r\n* Fix gemma3n Vision OCR + LM only reponses by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F414\r\n* Fix generate signature by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F416\r\n* Add support for audio modality in server by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F417\r\n* Update server, readme and misc by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F418\r\n\r\n## New Contributors\r\n* @nicekate made their first contribution in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F403\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fcompare\u002Fv0.2.0...v0.3.0","2025-07-05T17:29:18",{"id":244,"version":245,"summary_zh":246,"released_at":247},264095,"v0.2.0","## What's Changed\r\n* Fix Gemma 3 config by @DePasqualeOrg in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F384\r\n* Add option to save LoRA adapter after each training epoch by @breynolds007 in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F385\r\n* De-duplicate Mixtral3's merge_input_ids_with_image_features by @will-lms in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F390\r\n* Add support for Gemma3n (Omni) by @FL33TW00D @pcuenca and @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F391\r\n* Add support for audio by @Blaizzy  https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F391\r\n\r\n## New Contributors\r\n* @DePasqualeOrg made their first contribution in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F384\r\n* @will-lms made their first contribution in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F390\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fcompare\u002Fv0.1.27...v0.2.0","2025-06-26T18:38:07",{"id":249,"version":250,"summary_zh":251,"released_at":252},264096,"v0.1.27","## What's Changed\r\n* Fix README POST request typo by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F358\r\n* Fix add_eos_token_ids early exit by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F356\r\n* Add max_position_embeddings to TextConfig dataclasses by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F360\r\n* Fix revision argument not passed in load by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F361\r\n* Add initial MkDocs setup by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F362\r\n* Update deploy-docs.yml by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F363\r\n* Update update-changelog.yml by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F364\r\n* Update mkdocs configuration by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F365\r\n* Remove hardcoded pytorch install error message by @mattjcly in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F369\r\n* Update processor tests by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F370\r\n* Fix Kimi-VL Chat template by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F376\r\n* Fix multi images understand for Qwen2 and Qwen2.5 VL by @qnguyen3 in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F377\r\n* Qwen2.5 vl fix  by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F378\r\n* Bump mlx to v0.26.0 by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F381\r\n* Bump by @Blaizzy in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F382\r\n\r\n## New Contributors\r\n* @prncvrm made their first contribution in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F319\r\n* @qnguyen3 made their first contribution in https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fpull\u002F377\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fcompare\u002Fv0.1.26...v0.1.27","2025-06-08T17:32:29"]