[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-turboderp-org--exllamav3":3,"tool-turboderp-org--exllamav3":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":79,"owner_twitter":80,"owner_website":79,"owner_url":81,"languages":82,"stars":99,"forks":100,"last_commit_at":101,"license":102,"difficulty_score":103,"env_os":104,"env_gpu":105,"env_ram":106,"env_deps":107,"category_tags":117,"github_topics":79,"view_count":23,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":118,"updated_at":119,"faqs":120,"releases":149},2084,"turboderp-org\u002Fexllamav3","exllamav3","An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs ","ExLlamaV3 是一款专为现代消费级显卡打造的高性能大语言模型（LLM）推理库，旨在让用户在本地设备上高效运行各类先进 AI 模型。它核心解决了在有限硬件资源下，如何平衡模型运行速度与精度的难题，通过全新的 EXL3 量化格式（基于 QTIP 技术），显著降低了显存占用并提升了推理速度。\n\n该工具特别适合希望在个人电脑上部署私有化 AI 服务的开发者、技术研究人员以及资深极客用户。无论是构建本地知识库、开发智能应用，还是进行模型实验，ExLlamaV3 都能提供强有力的支持。其独特亮点包括灵活的张量并行与专家并行策略，能够充分利用多卡或多 GPU 环境；支持动态批处理以提升并发能力；并兼容 Hugging Face Transformers 生态，方便集成。此外，它还原生支持多种主流架构（如 Llama 3、Qwen 3.5、Gemma 2 等）及多模态任务，并可通过 TabbyAPI 快速搭建兼容 OpenAI 标准的本地服务接口。虽然部分高级功能（如 LoRA 微调、ROCm 支持）仍在完善中，但 ExLlamaV3 已为本地大模型推理树立了新的效率标杆。","\n# \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fturboderp-org_exllamav3_readme_a7bbb05a1f1e.png\" width=\"40\"> ExLlamaV3\n\nExLlamaV3 is an inference library for running local LLMs on modern consumer GPUs. Headline features:\n\n- New [EXL3](doc\u002Fexl3.md) quantization format based on QTIP\n- Flexible tensor-parallel and expert-parallel inference for consumer hardware setups\n- OpenAI-compatible server provided via [TabbyAPI](https:\u002F\u002Fgithub.com\u002Ftheroyallab\u002FtabbyAPI\u002F) \n- Continuous, dynamic batching\n- HF Transformers plugin (see [here](examples\u002Ftransformers_integration.py))\n- HF model support (see [supported architectures](#architecture-support))\n- Speculative decoding\n- 2-8 bit cache quantization\n- Multimodal support\n\nThe official and recommended backend server for ExLlamaV3 is [TabbyAPI](https:\u002F\u002Fgithub.com\u002Ftheroyallab\u002FtabbyAPI\u002F), which provides an OpenAI-compatible API for local or remote inference, with extended features like HF model downloading, embedding model support and support for HF Jinja2 chat templates.\n\n### ⚠️ Important\n\n- **Qwen3-Next** and **Qwen3.5** can take advantage of [Flash Linear Attention](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention), though this requires\n  Triton, and performance can be shaky due to the sporadic JIT compilation it imposes. [causal-conv1d](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fcausal-conv1d) is\n  supported and recommended but not required.\n- **Qwen3-Next** and **Qwen3.5** currently do not support tensor\u002Fexpert parallelism.\n\n## Architecture support\n\n- **AFM** (ArceeForCausalLM)\n- **Apertus** (ApertursForCausalLM)\n- **Command-R** etc. (CohereForCausalLM)\n- **Command-A**, **Command-R7B**, **Command-R+** etc. (Cohere2ForCausalLM)\n- **DeciLM**, **Nemotron** (DeciLMForCausalLM)\n- **dots.llm1** (Dots1ForCausalLM)\n- **ERNIE 4.5** (Ernie4_5_ForCausalLM, Ernie4_5_MoeForCausalLM)\n- **EXAONE 4.0** (Exaone4ForCausalLM)\n- **Gemma 2** (Gemma2ForCausalLM)\n- **Gemma 3** (Gemma3ForCausalLM, Gemma3ForConditionalGeneration) *- multimodal*\n- **GLM 4**, **GLM 4.5**, **GLM 4.5-Air**, **GLM 4.6** (Glm4ForCausalLM, Glm4MoeForCausalLM)\n- **GLM 4.1V**, **GLM 4.5V** (Glm4vForConditionalGeneration, Glm4vMoeForConditionalGeneration) *- multimodal*\n- **HyperCLOVAX** (HyperCLOVAXForCausalLM, HCXVisionV2ForCausalLM) *- multimodal*\n- **IQuest-Coder** (IQuestCoderForCausalLM)\n- **Llama**, **Llama 2**, **Llama 3**, **Llama 3.1-Nemotron** etc. (LlamaForCausalLM)\n- **MiMo-RL** (MiMoForCausalLM)\n- **MiniMax-M2** (MiniMaxM2ForCausalLM)\n- **Mistral**, **Ministral 3**, **Devstral 2** etc. (MistralForCausalLM, Mistral3ForConditionalGeneration) *- multimodal*\n- **Mixtral** (MixtralForCausalLM)\n- **NanoChat** (NanoChatForCausalLM)\n- **Olmo 3.1** (Olmo3ForCausalLM)\n- **Olmo-Hybrid** (OlmoHybridForCausalLM)\n- **Phi3**, **Phi4** (Phi3ForCausalLM)\n- **Qwen 2**, **Qwen 2.5**, **Qwen 2.5 VL** (Qwen2ForCausalLM, Qwen2_5_VLForConditionalGeneration) *- multimodal*\n- **Qwen 3** (Qwen3ForCausalLM, Qwen3MoeForCausalLM)\n- **Qwen 3-Next** (Qwen3NextForCausalLM)\n- **Qwen 3-VL** (Qwen3VLForConditionalGeneration)  *- multimodal*\n- **Qwen 3-VL MoE** (Qwen3VLMoeForConditionalGeneration) *- multimodal*\n- **Qwen 3.5** (Qwen3_5ForConditionalGeneration) *- multimodal*\n- **Qwen 3.5 MoE** (Qwen3_5MoeForConditionalGeneration) *- multimodal*\n- **Seed-OSS** (SeedOssForCausalLM)\n- **SmolLM** (SmolLM3ForCausalLM)\n- **SolarOpen** (SolarOpenForCausalLM)\n- **Step 3.5 Flash** (Step3p5ForCausalLM)\n\nAlways adding more, stay tuned.\n\n\n## What's missing?\n\nCurrently on the to-do list:\n\n- Lots of optimization\n- LoRA support\n- ROCm support\n- More sampling functions\n- More quantization modes (FP4 etc.)\n\nAs for what is implemented, expect that some things may be a little broken at first. Please be patient, raise issues and\u002For contribute. 👉👈 \n\n\n## How to?\n\n[TabbyAPI](https:\u002F\u002Fgithub.com\u002Ftheroyallab\u002FtabbyAPI\u002F) has a startup script that manages and installs prerequisites if you want to get started quickly with inference in an OAI-compatible client. \n\nOtherwise, start by making sure you have the appropriate version of [PyTorch](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F) installed (CUDA 12.4 or later) since the Torch dependency is not automatically handled by `pip`. Then pick a method below:\n\n### Method 1: Installing from prebuilt wheel (recommended if you're unsure)\n\nPick a wheel from the [releases page](https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Freleases), then e.g.:\n\n```sh\npip install https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Freleases\u002Fdownload\u002Fv0.0.6\u002Fexllamav3-0.0.6+cu128.torch2.8.0-cp313-cp313-linux_x86_64.whl\n```\n\n### Method 2: Installing from PyPi:\n\n```sh\npip install exllamav3\n```\nNote that the PyPi package does not contain a prebuilt extension and requires the CUDA toolkit and build prerequisites (i.e. VS Build Tools on Windows, gcc on Linux, `python-dev` headers etc.).    \n\n### Method 3: Building from source\n\n```sh\n# Clone the repo\ngit clone https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\ncd exllamav3\n\n# (Optional) switch to dev branch for latest in-progress features\ngit checkout dev\n\n# Install requirements (make sure you install Torch separately)\npip install -r requirements.txt\n```\n\nAt this point you should be able to run the conversion, eval and example scripts from the main repo directory, e.g. `python convert.py -i ...`\n\nTo install the library for the active venv, run from the repo directory:\n\n```sh\npip install .\n```\n\nRelevant env variables for building:\n- `MAX_JOBS`: by default ninja may launch too many processes and run out of system memory for compilation. Set this to a reasonable value like 4 in that case.  \n- `EXLLAMA_NOCOMPILE`: set to install the library without compiling the C++\u002FCUDA extension. Torch will build\u002Fload it at runtime instead.\n\n\n## Conversion\n\nTo convert a model to EXL3 format, use:\n\n```sh\n# Convert model\npython convert.py -i \u003Cinput_dir> -o \u003Coutput_dir> -w \u003Cworking_dir> -b \u003Cbitrate>\n\n# Resume an interrupted quant job\npython convert.py -w \u003Cworking_dir> -r\n\n# More options\npython convert.py -h\n```\n\nThe working directory is temporary storage for state checkpoints and for storing quantized tensors until the converted model can be compiled. It should have enough free space to store an entire copy of the output model. Note that while EXL2 conversion by default resumes an interrupted job when pointed to an existing folder, EXL3 needs you to explicitly resume with the `-r`\u002F`--resume` argument.    \n\nSee [here](doc\u002Fconvert.md) for more information.\n\n\n## Examples\n\nA number of example scripts are provided to showcase the features of the backend and generator. Some of them have hardcoded model paths and should be edited before you run them, but there is a simple CLI chatbot that you can start with:\n\n```sh\npython examples\u002Fchat.py -m \u003Cinput_dir> -mode \u003Cprompt_mode> \n\n# E.g.:\npython examples\u002Fchat.py -m \u002Fmnt\u002Fmodels\u002Fllama3.1-8b-instruct-exl3 -mode llama3\n\n# Wealth of options\npython examples\u002Fchat.py -h\n```\n\n## EXL3 quantization\n\n\u003Cdiv align=\"center\">\n    \u003Ca href=\"doc\u002Fexl3.md\" target=\"_blank\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fturboderp-org_exllamav3_readme_67ebe6aa545b.png\" width=\"640\">\n    \u003C\u002Fa>\n\u003C\u002Fdiv>\n\nDespite their amazing achievements, most SOTA quantization techniques remain cumbersome or even prohibitively expensive to use. For instance, **AQLM** quantization of a 70B model takes around **720 GPU-hours** on an A100 server, costing $850 US at the time of writing. ExLlamaV3 aims to address this with the **EXL3** format, which is a streamlined variant of [**QTIP**](https:\u002F\u002Fgithub.com\u002FCornell-RelaxML\u002Fqtip) from Cornell RelaxML. The conversion process is designed to be simple and efficient and requires only an input model (in HF format) and a target bitrate. By computing Hessians on the fly and thanks to a fused Viterbi kernel, the quantizer can convert a model in a single step, taking a couple of minutes for smaller models, up to a few hours for larger ones (70B+) (on a single RTX 4090 or equivalent GPU.)\n\nThe [Marlin](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fmarlin)-inspired GEMM kernel achieves roughly memory-bound latency under optimal conditions (4bpw, RTX 4090), though it still needs some work to achieve the same efficiency on Ampere GPUs and to remain memory-bound at lower bitrates.\n\nSince converted models largely retain the original file structure (unlike **EXL2** which renames some tensors in its quest to turn every model into a Llama variant), it will be possible to extend **EXL3** support to other frameworks like HF Transformers and vLLM.\n\nThere are some benchmark results [here](doc\u002Fexl3.md), and a full writeup on the format is coming soon.\n\nFun fact: Llama-3.1-70B-EXL3 is coherent at 1.6 bpw. With the output layer quantized to 3 bpw and a 4096-token cache, inference is possible in under 16 GB of VRAM. \n\n\n### Community\n\nYou are always welcome to join the [ExLlama discord server](https:\u002F\u002Fdiscord.gg\u002FNSFwVuCjRq) ←🎮  \n\n\n### 🤗 HuggingFace repos\n\nA selection of EXL3-quantized models is available [here](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fturboderp\u002Fexl3-models-67f2dfe530f05cb9f596d21a). Also shout out the following lovely people:\n \n- [ArtusDev](https:\u002F\u002Fhuggingface.co\u002FArtusDev)\n- [MikeRoz](https:\u002F\u002Fhuggingface.co\u002FMikeRoz) \n- [MetaphoricalCode](https:\u002F\u002Fhuggingface.co\u002FMetaphoricalCode) \n- [Ready.Art](https:\u002F\u002Fhuggingface.co\u002FReadyArt) \n- [isogen](https:\u002F\u002Fhuggingface.co\u002Fisogen\u002Fmodels)\n\n\n## Acknowledgements\n\nThis project owes its existence to a wonderful community of FOSS developers and some very generous supporters (🐈❤️!) The following projects in particular deserve a special mention:\n\n- [TabbyAPI](https:\u002F\u002Fgithub.com\u002Ftheroyallab\u002FtabbyAPI\u002F)\n- [PyTorch](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch)\n- [FlashAttention](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention)\n- [QTIP](https:\u002F\u002Fgithub.com\u002FCornell-RelaxML\u002Fqtip)\n- [Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers)\n- [Marlin](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fmarlin)\n- [Flash Linear Attention](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention)","# \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fturboderp-org_exllamav3_readme_a7bbb05a1f1e.png\" width=\"40\"> ExLlamaV3\n\nExLlamaV3 是一款用于在现代消费级 GPU 上运行本地大模型的推理库。其主要特性包括：\n\n- 基于 QTIP 的全新 [EXL3](doc\u002Fexl3.md) 量化格式\n- 针对消费级硬件配置的灵活张量并行与专家并行推理\n- 通过 [TabbyAPI](https:\u002F\u002Fgithub.com\u002Ftheroyallab\u002FtabbyAPI\u002F) 提供的 OpenAI 兼容服务器\n- 连续动态批处理\n- Hugging Face Transformers 插件（参见 [此处](examples\u002Ftransformers_integration.py)）\n- 支持 Hugging Face 模型（参见 [支持的架构](#architecture-support)）\n- 推测解码\n- 2 至 8 位缓存量化\n- 多模态支持\n\nExLlamaV3 官方推荐的后端服务器是 [TabbyAPI](https:\u002F\u002Fgithub.com\u002Ftheroyallab\u002FtabbyAPI\u002F)，它提供了一个兼容 OpenAI 的 API，可用于本地或远程推理，并具备扩展功能，如 Hugging Face 模型下载、嵌入模型支持以及对 Hugging Face Jinja2 聊天模板的支持。\n\n### ⚠️ 重要提示\n\n- **Qwen3-Next** 和 **Qwen3.5** 可以利用 [Flash Linear Attention](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention)，但此功能需要 Triton 支持，且由于其不稳定的 JIT 编译机制，性能可能不够稳定。[causal-conv1d](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fcausal-conv1d) 是受支持且推荐的替代方案，但并非必需。\n- **Qwen3-Next** 和 **Qwen3.5** 目前尚不支持张量\u002F专家并行。\n\n## 架构支持\n\n- **AFM** (ArceeForCausalLM)\n- **Apertus** (ApertursForCausalLM)\n- **Command-R** 等（CohereForCausalLM）\n- **Command-A**、**Command-R7B**、**Command-R+** 等（Cohere2ForCausalLM）\n- **DeciLM**、**Nemotron**（DeciLMForCausalLM）\n- **dots.llm1**（Dots1ForCausalLM）\n- **ERNIE 4.5**（Ernie4_5_ForCausalLM、Ernie4_5_MoeForCausalLM）\n- **EXAONE 4.0**（Exaone4ForCausalLM）\n- **Gemma 2**（Gemma2ForCausalLM）\n- **Gemma 3**（Gemma3ForCausalLM、Gemma3ForConditionalGeneration）*- 多模态*\n- **GLM 4**、**GLM 4.5**、**GLM 4.5-Air**、**GLM 4.6**（Glm4ForCausalLM、Glm4MoeForCausalLM）\n- **GLM 4.1V**、**GLM 4.5V**（Glm4vForConditionalGeneration、Glm4vMoeForConditionalGeneration）*- 多模态*\n- **HyperCLOVAX**（HyperCLOVAXForCausalLM、HCXVisionV2ForCausalLM）*- 多模态*\n- **IQuest-Coder**（IQuestCoderForCausalLM）\n- **Llama**、**Llama 2**、**Llama 3**、**Llama 3.1-Nemotron** 等（LlamaForCausalLM）\n- **MiMo-RL**（MiMoForCausalLM）\n- **MiniMax-M2**（MiniMaxM2ForCausalLM）\n- **Mistral**、**Ministral 3**、**Devstral 2** 等（MistralForCausalLM、Mistral3ForConditionalGeneration）*- 多模态*\n- **Mixtral**（MixtralForCausalLM）\n- **NanoChat**（NanoChatForCausalLM）\n- **Olmo 3.1**（Olmo3ForCausalLM）\n- **Olmo-Hybrid**（OlmoHybridForCausalLM）\n- **Phi3**、**Phi4**（Phi3ForCausalLM）\n- **Qwen 2**、**Qwen 2.5**、**Qwen 2.5 VL**（Qwen2ForCausalLM、Qwen2_5_VLForConditionalGeneration）*- 多模态*\n- **Qwen 3**（Qwen3ForCausalLM、Qwen3MoeForCausalLM）\n- **Qwen 3-Next**（Qwen3NextForCausalLM）\n- **Qwen 3-VL**（Qwen3VLForConditionalGeneration）*- 多模态*\n- **Qwen 3-VL MoE**（Qwen3VLMoeForConditionalGeneration）*- 多模态*\n- **Qwen 3.5**（Qwen3_5ForConditionalGeneration）*- 多模态*\n- **Qwen 3.5 MoE**（Qwen3_5MoeForConditionalGeneration）*- 多模态*\n- **Seed-OSS**（SeedOssForCausalLM）\n- **SmolLM**（SmolLM3ForCausalLM）\n- **SolarOpen**（SolarOpenForCausalLM）\n- **Step 3.5 Flash**（Step3p5ForCausalLM）\n\n我们仍在不断添加更多支持，请持续关注。\n\n## 尚未实现的功能？\n\n目前的待办事项包括：\n\n- 大量优化\n- LoRA 支持\n- ROCm 支持\n- 更多采样函数\n- 更多量化模式（如 FP4 等）\n\n至于已实现的部分，初期可能会存在一些小问题。请耐心等待，如有疑问或建议，欢迎提交 Issue 或参与贡献。👉👈\n\n## 如何使用？\n\n如果您希望快速在兼容 OAI 的客户端中开始推理，可以使用 [TabbyAPI](https:\u002F\u002Fgithub.com\u002Ftheroyallab\u002FtabbyAPI\u002F) 提供的启动脚本，该脚本会自动管理并安装所需依赖。\n\n否则，请先确保已安装适当版本的 [PyTorch](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F)（CUDA 12.4 或更高版本），因为 `pip` 不会自动处理 Torch 的依赖关系。然后选择以下方法之一：\n\n### 方法 1：从预编译的 wheel 安装（推荐给不确定如何操作的用户）\n\n从 [releases 页面](https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Freleases) 下载合适的 wheel 文件，例如：\n\n```sh\npip install https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Freleases\u002Fdownload\u002Fv0.0.6\u002Fexllamav3-0.0.6+cu128.torch2.8.0-cp313-cp313-linux_x86_64.whl\n```\n\n### 方法 2：从 PyPI 安装：\n\n```sh\npip install exllamav3\n```\n\n请注意，PyPI 包不包含预编译的扩展，因此需要 CUDA 工具包及构建所需的依赖项（例如 Windows 上的 VS Build Tools、Linux 上的 gcc、`python-dev` 头文件等）。\n\n### 方法 3：从源代码编译：\n\n```sh\n# 克隆仓库\ngit clone https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\ncd exllamav3\n\n# （可选）切换到 dev 分支以获取最新的开发中功能\ngit checkout dev\n\n# 安装依赖项（请务必单独安装 Torch）\npip install -r requirements.txt\n```\n\n此时，您应该能够从主仓库目录运行转换、评估和示例脚本，例如 `python convert.py -i ...`。\n\n要将库安装到当前的虚拟环境中，请在仓库目录下执行：\n\n```sh\npip install .\n```\n\n与构建相关的环境变量：\n- `MAX_JOBS`：默认情况下，ninja 可能会启动过多进程而导致系统内存不足。在这种情况下，可以将其设置为一个合理的值，例如 4。\n- `EXLLAMA_NOCOMPILE`：设置此变量可在不编译 C++\u002FCUDA 扩展的情况下安装库。Torch 将在运行时自行构建并加载该扩展。\n\n## 模型转换\n\n要将模型转换为 EXL3 格式，请使用以下命令：\n\n```sh\n# 转换模型\npython convert.py -i \u003Cinput_dir> -o \u003Coutput_dir> -w \u003Cworking_dir> -b \u003Cbitrate>\n\n# 继续中断的量化任务\npython convert.py -w \u003Cworking_dir> -r\n\n# 更多选项\npython convert.py -h\n```\n\n工作目录是用于存储状态检查点以及量化张量的临时存储空间，直到转换后的模型可以完成编译。该目录应有足够的可用空间来存放整个输出模型的副本。需要注意的是，虽然 EXL2 转换默认会在指向现有文件夹时恢复中断的任务，但 EXL3 需要您显式地使用 `-r`\u002F`--resume` 参数来恢复任务。\n\n更多信息请参阅 [这里](doc\u002Fconvert.md)。\n\n## 示例\n\n我们提供了一系列示例脚本，用于展示后端和生成器的各项功能。其中一些脚本硬编码了模型路径，您在运行之前需要进行编辑；不过，也有一个简单的 CLI 聊天机器人可供您立即上手：\n\n```sh\npython examples\u002Fchat.py -m \u003Cinput_dir> -mode \u003Cprompt_mode>\n\n# 例如：\npython examples\u002Fchat.py -m \u002Fmnt\u002Fmodels\u002Fllama3.1-8b-instruct-exl3 -mode llama3\n\n# 丰富的选项\npython examples\u002Fchat.py -h\n```\n\n## EXL3 量化\n\n\u003Cdiv align=\"center\">\n    \u003Ca href=\"doc\u002Fexl3.md\" target=\"_blank\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fturboderp-org_exllamav3_readme_67ebe6aa545b.png\" width=\"640\">\n    \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n尽管取得了令人瞩目的成就，大多数最先进的量化技术仍然使用起来繁琐，甚至成本高昂到难以承受。例如，对一个700亿参数模型进行 **AQLM** 量化，在一台 A100 服务器上大约需要 **720 GPU 小时**，按撰写本文时的定价计算，费用高达850美元。ExLlamaV3 旨在通过 **EXL3** 格式来解决这一问题，该格式是康奈尔 RelaxML 团队的 [**QTIP**](https:\u002F\u002Fgithub.com\u002FCornell-RelaxML\u002Fqtip) 的一种精简变体。转换过程设计得简单高效，仅需输入一个 HF 格式的模型和目标比特率即可。通过实时计算海森矩阵，并借助融合的维特比内核，量化器可以在单步中完成模型转换：较小的模型只需几分钟，而较大的模型（700亿参数以上）则可能需要几小时（在单块 RTX 4090 或同等性能的 GPU 上）。\n\n受 [Marlin](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fmarlin) 启发的 GEMM 内核，在最佳条件下（4bpw，RTX 4090）能够达到接近内存带宽限制的延迟，不过它仍需进一步优化，以在 Ampere 架构的 GPU 上实现同样高效的性能，并在较低比特率下保持内存受限的状态。\n\n由于转换后的模型基本保留了原始文件结构（不同于 **EXL2**，后者为了将所有模型统一为 Llama 变体而重命名部分张量），未来有望将 **EXL3** 支持扩展到其他框架，如 HF Transformers 和 vLLM。\n\n一些基准测试结果可以在这里找到：[doc\u002Fexl3.md]，关于该格式的完整说明也将很快发布。\n\n有趣的是：Llama-3.1-70B-EXL3 在 1.6 bpw 的情况下依然保持连贯性。如果将输出层量化至 3 bpw，并配备 4096 个 token 的缓存，推理所需的显存便可控制在 16 GB 以内。\n\n\n### 社区\n\n欢迎随时加入 ExLlama 的 Discord 服务器：[discord.gg\u002FNSFwVuCjRq] ←🎮  \n\n\n### 🤗 HuggingFace 仓库\n\n精选的 EXL3 量化模型已在此处提供：[huggingface.co\u002Fcollections\u002Fturboderp\u002Fexl3-models-67f2dfe530f05cb9f596d21a]。同时也要感谢以下几位优秀的贡献者：\n\n- [ArtusDev](https:\u002F\u002Fhuggingface.co\u002FArtusDev)\n- [MikeRoz](https:\u002F\u002Fhuggingface.co\u002FMikeRoz) \n- [MetaphoricalCode](https:\u002F\u002Fhuggingface.co\u002FMetaphoricalCode) \n- [Ready.Art](https:\u002F\u002Fhuggingface.co\u002FReadyArt) \n- [isogen](https:\u002F\u002Fhuggingface.co\u002Fisogen\u002Fmodels)\n\n\n## 致谢\n\n本项目得以实现，离不开一群优秀的开源开发者社区以及几位非常慷慨的支持者（🐈❤️！）。特别要感谢以下项目：\n\n- [TabbyAPI](https:\u002F\u002Fgithub.com\u002Ftheroyallab\u002FtabbyAPI\u002F)\n- [PyTorch](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch)\n- [FlashAttention](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention)\n- [QTIP](https:\u002F\u002Fgithub.com\u002FCornell-RelaxML\u002Fqtip)\n- [Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers)\n- [Marlin](https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fmarlin)\n- [Flash Linear Attention](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention)","# ExLlamaV3 快速上手指南\n\nExLlamaV3 是一个专为现代消费级 GPU 设计的大语言模型（LLM）本地推理库。它引入了全新的 **EXL3** 量化格式，支持灵活的张量并行\u002F专家并行、连续动态批处理、推测解码以及多模态模型推理。官方推荐的配套服务端是 [TabbyAPI](https:\u002F\u002Fgithub.com\u002Ftheroyallab\u002FtabbyAPI\u002F)，可提供兼容 OpenAI 的 API 接口。\n\n## 环境准备\n\n在开始之前，请确保您的系统满足以下要求：\n\n*   **操作系统**：Linux (推荐) 或 Windows。\n*   **GPU**：支持 CUDA 的现代 NVIDIA 显卡（推荐 RTX 30\u002F40 系列或更高）。\n*   **Python**：建议 Python 3.10+。\n*   **PyTorch**：必须预先安装 **CUDA 12.4 或更高版本** 的 PyTorch。`pip` 不会自动处理此依赖。\n    *   安装命令示例（根据实际环境调整）：\n        ```bash\n        pip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu124\n        ```\n*   **编译工具（仅源码安装需要）**：\n    *   Linux: `gcc`, `python-dev` headers, `ninja`。\n    *   Windows: Visual Studio Build Tools。\n    *   *注：若内存有限，构建时可设置环境变量 `MAX_JOBS=4` 防止编译进程过多导致内存溢出。*\n\n## 安装步骤\n\n推荐优先使用预编译包进行安装，以避免复杂的编译环境问题。\n\n### 方法一：安装预编译 Wheel（推荐）\n\n访问 [Releases 页面](https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Freleases) 下载与您环境（CUDA 版本、Python 版本、系统架构）匹配的 `.whl` 文件，然后运行：\n\n```bash\npip install https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Freleases\u002Fdownload\u002Fv0.0.6\u002Fexllamav3-0.0.6+cu128.torch2.8.0-cp313-cp313-linux_x86_64.whl\n```\n*(请将上述 URL 替换为您实际下载的文件链接)*\n\n### 方法二：从 PyPI 安装\n\n此方法不包含预编译扩展，需要本地具备完整的 CUDA Toolkit 和编译环境。\n\n```bash\npip install exllamav3\n```\n\n### 方法三：从源码构建\n\n适用于需要最新开发版功能或自定义构建的用户。\n\n```bash\n# 克隆仓库\ngit clone https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\ncd exllamav3\n\n# (可选) 切换到 dev 分支获取最新功能\ngit checkout dev\n\n# 安装依赖（确保已单独安装 Torch）\npip install -r requirements.txt\n\n# 安装库到当前虚拟环境\npip install .\n```\n\n## 基本使用\n\nExLlamaV3 的核心工作流分为两步：**模型量化转换** 和 **推理运行**。\n\n### 1. 模型量化转换 (EXL3 格式)\n\n将 Hugging Face 格式的模型转换为高效的 EXL3 格式。您需要指定输入目录、输出目录、工作目录（用于临时存储）和目标比特率。\n\n```bash\n# 执行转换\n# \u003Cinput_dir>: 原始 HF 模型路径\n# \u003Coutput_dir>: 转换后模型保存路径\n# \u003Cworking_dir>: 临时工作目录（需足够空间存放完整模型副本）\n# \u003Cbitrate>: 目标比特率 (例如 4.0)\npython convert.py -i \u003Cinput_dir> -o \u003Coutput_dir> -w \u003Cworking_dir> -b \u003Cbitrate>\n\n# 如果转换中断，可使用 -r 参数恢复任务\npython convert.py -w \u003Cworking_dir> -r\n```\n\n> **注意**：EXL3 格式转换不像 EXL2 那样自动检测断点，中断后必须显式添加 `-r` 或 `--resume` 参数才能继续。\n\n### 2. 运行推理示例\n\n库中提供了简单的命令行聊天脚本用于测试。\n\n```bash\n# 启动聊天机器人\n# -m: 已转换好的 EXL3 模型路径\n# -mode: 提示词模板模式 (如 llama3, qwen 等)\npython examples\u002Fchat.py -m \u002Fmnt\u002Fmodels\u002Fllama3.1-8b-instruct-exl3 -mode llama3\n\n# 查看可用选项\npython examples\u002Fchat.py -h\n```\n\n### 3. 生产环境部署 (推荐)\n\n对于需要 OpenAI 兼容 API、动态批处理或远程服务的场景，强烈建议搭配 **TabbyAPI** 使用：\n\n1.  克隆并安装 [TabbyAPI](https:\u002F\u002Fgithub.com\u002Ftheroyallab\u002FtabbyAPI\u002F)。\n2.  TabbyAPI 内置了启动脚本，可自动管理依赖并提供丰富的配置选项（如模型下载、Embedding 支持、Jinja2 模板等）。\n\n---\n*更多高级功能（如多模态支持、张量并行配置）及支持的模型架构列表，请参阅项目官方文档。*","一位独立开发者试图在单张 RTX 4090 显卡上部署最新的 Qwen3.5-MoE 多模态大模型，以构建一个能实时分析图表并回答业务数据的本地智能助手。\n\n### 没有 exllamav3 时\n- **显存爆满无法运行**：原始模型权重过大，即使使用常规量化，单卡显存仍无法容纳 Qwen3.5-MoE 的庞大参数，导致程序直接崩溃。\n- **推理速度极慢**：勉强通过分片或多卡方案运行时，由于缺乏针对消费级显卡的专家并行（Expert-Parallel）优化，生成每个字都需要数秒，完全无法交互。\n- **多模态支持缺失**：现有的本地推理后端对 Qwen3.5-VL 等新架构的多模态输入支持不完善，上传图片后模型无法正确识别视觉内容。\n- **集成开发困难**：缺乏标准的 OpenAI 接口，前端应用需要编写大量自定义代码才能连接本地模型，维护成本极高。\n\n### 使用 exllamav3 后\n- **单卡流畅运行**：借助全新的 EXL3 量化格式，exllamav3 将模型体积大幅压缩，成功让 Qwen3.5-MoE 在单张 RTX 4090 上完整加载且精度损失极小。\n- **响应实时化**：利用其灵活的张量并行和动态批处理技术，令牌生成速度提升至每秒数十个 token，实现了近乎实时的对话体验。\n- **原生多模态解析**：exllamav3 原生支持 Qwen3.5-VL 架构，用户直接发送业务图表，模型即可精准提取数据并进行深度分析。\n- **无缝对接应用**：配合 TabbyAPI 后端，直接提供标准的 OpenAI 兼容接口，开发者无需修改任何前端代码即可接入本地高性能模型。\n\nexllamav3 通过极致的量化算法与架构优化，打破了消费级显卡运行顶级多模态大模型的硬件壁垒，让本地私有化部署变得高效且触手可及。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fturboderp-org_exllamav3_ebf289df.png","turboderp-org","Turboderp","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fturboderp-org_26158f1d.jpg","I'm an organization now.",null,"turboderp_","https:\u002F\u002Fgithub.com\u002Fturboderp-org",[83,87,91,95],{"name":84,"color":85,"percentage":86},"Python","#3572A5",57.1,{"name":88,"color":89,"percentage":90},"Cuda","#3A4E3A",39.4,{"name":92,"color":93,"percentage":94},"C++","#f34b7d",3.5,{"name":96,"color":97,"percentage":98},"C","#555555",0,743,77,"2026-04-05T08:24:00","MIT",4,"Linux, Windows","必需 NVIDIA GPU (现代消费级显卡)，需安装 CUDA 12.4 或更高版本。显存需求取决于模型大小和量化位率 (例如：70B 模型在 1.6 bpw 量化下可在 16GB 显存运行)。不支持 ROCm (AMD GPU)。","未说明 (但在编译时若并行任务过多可能导致系统内存不足，建议根据模型大小预留充足内存)",{"notes":108,"python":109,"dependencies":110},"1. Windows 用户从源码安装需安装 VS Build Tools，Linux 用户需 gcc 和 python-dev 头文件。2. 推荐通过预编译 wheel 安装以避免复杂的编译环境配置。3. 转换模型时需要临时工作目录，其可用空间需至少能容纳一份完整的输出模型副本。4. Qwen3-Next 和 Qwen3.5 模型目前不支持张量并行\u002F专家并行。5. 编译时可设置 MAX_JOBS 环境变量 (如设为 4) 防止内存溢出。","3.13 (示例中提及 cp313，具体支持范围未详述，但通常需较新版本以匹配 PyTorch)",[111,112,113,114,115,116],"torch>=2.8.0 (需单独安装，匹配 CUDA 12.4+)","ninja (用于编译)","CUDA Toolkit (若从源码安装)","tabbyAPI (推荐的后端服务器)","flash-linear-attention (可选，用于 Qwen3-Next\u002F3.5)","causal-conv1d (可选，推荐用于 Qwen3-Next\u002F3.5)",[26,13,54],"2026-03-27T02:49:30.150509","2026-04-06T06:46:07.128364",[121,126,131,136,141,145],{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},9502,"为什么在 Windows 上运行 Qwen3-Next 模型时 GPU 利用率低且生成速度慢？","这通常是由 `flash-linear-attention` 库引起的瓶颈。维护者指出，CPU 开销限制了速度。解决方案是安装最新的主分支版本（main branch）的 `flash-linear-attention`，或者等待官方优化。有用户反馈安装最新版后，Windows 上的 GPU 利用率可稳定在 35% 左右，生成速度提升至约 28 tokens\u002Fs。此外，维护者提到已编写自定义 CUDA 内核以绕过 Triton 的 CPU 开销，但在某些配置下 CPU 仍是瓶颈。","https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fissues\u002F84",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},9503,"加载 Gemma 3 模型时报错提示架构 'Gemma3ForConditionalGeneration' 不在列表中怎么办？","该问题通常是因为使用的 ExLlamaV3 版本过旧，尚未包含对 Gemma 3 架构的支持。请确保拉取最新的开发分支（dev branch）代码或更新到最新版本。有用户确认在更新到 TabbyAPI 的主分支最新版后问题解决。注意：目前的实现可能会忽略 `vision_config` 部分。","https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fissues\u002F39",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},9504,"Qwen3-VL 模型的 EXL3 量化版本提取图像坐标不准确是什么原因？","坐标提取不准确通常是由于依赖库版本不匹配导致的，特别是 `numpy`、`torch` 或 `pillow` 的版本差异。有用户在将依赖项升级到特定版本（如 numpy 2.10）后解决了该问题，使得量化模型的表现与原始模型一致。建议检查并统一运行环境与训练\u002F测试环境的依赖版本。","https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fissues\u002F155",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},9505,"在 Windows 上转换模型时遇到 'RuntimeError: Error opening file' 错误如何解决？","此错误通常与 Windows 系统下的文件路径长度限制、权限问题或特定的 CUDA\u002FJIT 编译冲突有关。由于该问题难以复现且可能与特定硬件驱动（如移动版的 RTX 3070）或模型结构（如 GLM4）的兼容性有关，建议尝试以下方法：1. 确保使用较短的文件路径；2. 以管理员身份运行脚本；3. 尝试量化其他模型（如 Gemma3-4B）以排除特定模型文件的损坏；4. 检查是否有杀毒软件拦截了文件访问。","https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fissues\u002F33",{"id":142,"question_zh":143,"answer_zh":144,"source_url":130},9506,"ExLlamaV3 量化后的模型在低比特率（如 4.0 bpw）下精度损失严重吗？","根据用户反馈，ExLlamaV3 (EXL3) 在低比特率下的表现远优于 ExLlamaV2 (EXL2)。特别是在 4.0 bpw 的设置下，EXL3 在实际生产任务中的准确性有显著提升，其表现差异比图表曲线显示的还要大。如果遇到精度问题，建议优先检查是否误用了 EXL2 的实现或配置，并确保使用的是最新的 EXL3 代码库。",{"id":146,"question_zh":147,"answer_zh":148,"source_url":125},9507,"如何提升 ExLlamaV3 在单 Token 生成时的性能瓶颈？","单 Token 生成的主要瓶颈通常在于 CPU 开销，尤其是当使用基于 Triton 的实现时。维护者已经针对 Ada 架构等开发了专用的 CUDA 内核来替代 `flash-linear-attention` 以减少 CPU 开销。如果遭遇性能瓶颈，建议：1. 确保安装了包含最新 CUDA 内核优化的开发版 ExLlamaV3；2. 检查系统 CPU 是否存在其他负载；3. 在 Windows 上特别注意系统调度可能带来的额外开销。",[150,155,160,165,170,175,180,185,190,195,200,205,210,215,220,225,230,235,240,245],{"id":151,"version":152,"summary_zh":153,"released_at":154},106904,"v0.0.28","- Fix regression breaking inference for GLM4.5-Air and related models\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.27...v0.0.28","2026-03-30T20:05:37",{"id":156,"version":157,"summary_zh":158,"released_at":159},106905,"v0.0.27","- New and more robust allocation strategy for non-integer bitrates\r\n- Added `-hq` argument to quantizer (explanation  [here](doc\u002Fconvert.md))\r\n- Fix bug causing prompt caching to fail on recurrent models for certain combinations of prompt length and chunk size\r\n- Fix broken output when using repetition penalties without decay range (affecting some OAI clients via TabbyAPI)\r\n- Fix issue allowing recurrent state to fall out of sync with K\u002FV cache\r\n- Support more features in Nanochat, for some reason\r\n- Other fixes and QoL improvements\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.26...v0.0.27","2026-03-26T01:44:31",{"id":161,"version":162,"summary_zh":163,"released_at":164},106906,"v0.0.26","- Fused expert kernel for improved prompt and batch throughput on MoE models\r\n- Support OlmoHybridForCausalLM\r\n- Fix non-integer bitrates when quantizing models with a very large MLP layers\r\n- Minor bugfixes\r\n- QoL improvements\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.25...v0.0.26","2026-03-16T18:57:11",{"id":166,"version":167,"summary_zh":168,"released_at":169},106907,"v0.0.25","- Add Qwen3_5ForCausalLM and Qwen3_5MoeForCausalLM\r\n- Support Qwen3.5 finetunes saved entirely in BF16 format\r\n- Correct tensor format for Qwen3.5 models with split experts (support REAPed models)\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.24...v0.0.25","2026-03-11T22:50:52",{"id":171,"version":172,"summary_zh":173,"released_at":174},106908,"v0.0.24","- Faster MoE routing with graphs\r\n- Fix regression breaking GLM 4.7\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.23...v0.0.24","2026-03-08T19:42:50",{"id":176,"version":177,"summary_zh":178,"released_at":179},106909,"v0.0.23","- Support **Qwen 3.5** (Qwen3_5ForConditionalGeneration, Qwen3_5MoeForConditionalGeneration)\r\n- Support **Step 3.5** (Step3p5ForCausalLM)\r\n- Enable tensor-P support for Minimax-M2\r\n- Switch quantizer to use out_scales by default\r\n- Include Torch 2.10 wheels\r\n- Various bugfixes, optimizations and QoL improvements\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.22...v0.0.23","2026-03-05T15:53:11",{"id":181,"version":182,"summary_zh":183,"released_at":184},106910,"v0.0.22","- Fix regression causing models with preserved bf16 tensors (multimodal specifically) to fail quantization\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.21...v0.0.22","2026-02-10T16:51:55",{"id":186,"version":187,"summary_zh":188,"released_at":189},106911,"v0.0.21","- Fix regression affecting Qwen3-Next\r\n- Avoid using `safetensors` lib during quantization (fixes OoM errors sometimes)\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.20...v0.0.21","2026-02-09T21:21:17",{"id":191,"version":192,"summary_zh":193,"released_at":194},106912,"v0.0.20","- Support Qwen2_5_VLForConditionalGeneration\r\n- Fix ComboSampler regression\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.19...v0.0.20","2026-01-19T22:23:50",{"id":196,"version":197,"summary_zh":198,"released_at":199},106913,"v0.0.19","- Support Olmo3ForCausalLM\r\n- Support HyperCLOVAXForCausalLM (and HCXVisionV2ForCausalLM)\r\n- Support SolarOpenForCausalLM\r\n- Support NanoChatForCausalLM\r\n- Better support for quantizing from FP8 weights\r\n- Fix memory leak for large job queues on Qwen3-Next\r\n- Add Adaptive-P sampler\r\n- QoL improvements and bugfixes\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.18...v0.0.19","2026-01-16T22:30:36",{"id":201,"version":202,"summary_zh":203,"released_at":204},106914,"v0.0.18","- Fixes for GLM-4.6V-Flash\r\n- Support Ministral3 text-only models (e.g. Devstral-3-123B)\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.17...v0.0.18","2025-12-10T09:47:37",{"id":206,"version":207,"summary_zh":208,"released_at":209},106915,"v0.0.17","- Fix Mistral3 implementation (supports Ministral models now)\r\n- Fix for REAPed models with arbitrary number of experts\r\n- Various other fixes\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.16...v0.0.17","2025-12-07T16:51:15",{"id":211,"version":212,"summary_zh":213,"released_at":214},106916,"v0.0.16","- Fix regression breaking tensor-parallel inference\r\n- Allow TP text-model to work with vision tower\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.15...v0.0.16","2025-11-25T16:57:42",{"id":216,"version":217,"summary_zh":218,"released_at":219},106917,"v0.0.15","- Support Glm4vForConditionalGeneration\r\n- Support Glm4vMoeForConditionalGeneration\r\n- Fix some tokenizer issues\r\n- QoL improvements\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.14...v0.0.15","2025-11-16T12:55:11",{"id":221,"version":222,"summary_zh":223,"released_at":224},106918,"v0.0.14","- Fix small regression in Gemma and Mistral vision towers.\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.13...v0.0.14","2025-11-10T00:38:23",{"id":226,"version":227,"summary_zh":228,"released_at":229},106919,"v0.0.13","- Support Qwen3-VL and Qwen3-VL MoE\r\n- Minor bugfixes\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.12...v0.0.13","2025-11-09T22:04:30",{"id":231,"version":232,"summary_zh":233,"released_at":234},106920,"v0.0.12","- Support MiniMaxM2ForCausalLM\r\n- Graphs (reduce CPU overhead)\r\n- Misc. optimizations\r\n- Allow loading FP8 tensors (for quantization only, converted to FP16 on-the-fly)\r\n- Fix some bugs\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.11...v0.0.12","2025-11-01T17:27:25",{"id":236,"version":237,"summary_zh":238,"released_at":239},106921,"v0.0.11","- Fix issue with TP loading of models quantized since v0.0.9+\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.10...v0.0.11","2025-10-17T15:35:11",{"id":241,"version":242,"summary_zh":243,"released_at":244},106922,"v0.0.10","- Fix issue preventing AsyncGenerator from working with new requeue option\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.9...v0.0.10","2025-10-15T12:51:36",{"id":246,"version":247,"summary_zh":248,"released_at":249},106923,"v0.0.9","- Lock MCG and MUL1 multipliers, no longer flag as experimental\r\n- Switch to MCG codebook by default to new models (use `--codebook 3inst` for previous default)\r\n- Add more calibration data\r\n- Increase default calibration size to 250 rows (use `--cal_rows 100` for previous default)\r\n- Fix quantized cache for bsz > 1\r\n- Fix kernel selection on A100\r\n- A few more TP-related fixes\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fturboderp-org\u002Fexllamav3\u002Fcompare\u002Fv0.0.8...v0.0.9","2025-10-13T21:42:08"]