[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-dropbox--hqq":3,"tool-dropbox--hqq":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",143909,2,"2026-04-07T11:33:18",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107888,"2026-04-06T11:32:50",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":76,"owner_website":77,"owner_url":78,"languages":79,"stars":92,"forks":93,"last_commit_at":94,"license":95,"difficulty_score":10,"env_os":96,"env_gpu":97,"env_ram":98,"env_deps":99,"category_tags":107,"github_topics":108,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":112,"updated_at":113,"faqs":114,"releases":143},5039,"dropbox\u002Fhqq","hqq","Official implementation of Half-Quadratic Quantization (HQQ)","hqq 是半二次量化（Half-Quadratic Quantization）技术的官方实现，专为高效压缩大型人工智能模型而设计。它核心解决了传统量化方法依赖校准数据、耗时较长且兼容性有限的痛点，让用户无需任何校准样本，即可在几分钟内完成对超大语言模型或视觉模型的量化处理。\n\n这款工具非常适合需要部署大模型的开发者、研究人员以及希望降低显存占用的工程团队。hqq 支持从 8 位到极端的 1 位等多种精度选择，并兼容 PyTorch 生态中的 PEFT 微调训练与 `torch.compile` 加速。其独特的技术亮点在于反量化过程仅为线性运算，这意味着它能无缝对接各类优化的 CUDA\u002FTriton 内核，显著提升推理速度。此外，进阶版本 HQQ+ 还引入了可训练的低秩适配器，进一步提升了低比特下的模型表现。对于追求速度与精度平衡的用户，推荐尝试 4 位精度配合特定分组设置，既能大幅节省显存，又能保持优异的模型质量。","## Half-Quadratic Quantization (HQQ)\nThis repository contains the official implementation of Half-Quadratic Quantization (\u003Cb>HQQ\u003C\u002Fb>) presented in our articles: \n* HQQ: https:\u002F\u002Fdropbox.github.io\u002Fhqq_blog\u002F\n* HQQ+: https:\u002F\u002Fdropbox.github.io\u002F1bit_blog\u002F\n\n### What is HQQ?\n\u003Cb>HQQ\u003C\u002Fb> is a fast and accurate model quantizer that skips the need for calibration data. Quantize the largest models, without calibration data, in just a few minutes at most 🚀.\n\n\u003Cdetails>\n  \u003Csummary>FAQ \u003C\u002Fsummary>\n \u003Cb> Why should I use HQQ instead of other quantization methods? \u003C\u002Fb>\u003Cbr>\n\u003Cul>\n\u003Cli> HQQ is very fast to quantize models.\u003C\u002Fli>\n\u003Cli> It supports 8,4,3,2,1 bits.\u003C\u002Fli>\n\u003Cli> You can use it on any model (LLMs, Vision, etc.).\u003C\u002Fli>\n\u003Cli> The dequantization step is a linear operation, this means that HQQ is compatbile with various optimized CUDA\u002FTriton kernels.\u003C\u002Fli>\n\u003Cli> HQQ is compatible with peft training.\u003C\u002Fli>\n\u003Cli> We try to make HQQ fully compatible `torch.compile` for faster inference and training.\u003C\u002Fli>\n\u003C\u002Ful>\n  \n  \u003Cb>What is the quality of the quantized models? \u003C\u002Fb>\u003Cbr>\n  We have detailed benchmarks on both language and vision models. Please refer to our blog posts: \u003Ca href=\"https:\u002F\u002Fdropbox.github.io\u002Fhqq_blog\u002F\">HQQ\u003C\u002Fa>, \u003Ca href=\"https:\u002F\u002Fdropbox.github.io\u002F1bit_blog\u002F\">HQQ+\u003C\u002Fa>.\u003Cbr> \n\n  \u003Cb>What is the speed of the quantized models?\u003C\u002Fb>\u003Cbr>\n  4-bit models with `axis=1` can use optimized inference fused kernels. Moreover, we focus on making hqq fully compatible with `torch.compile` which speeds-up both training and inference. For more details, please refer to the backend section below. \u003Cbr>\n\n  \u003Cb>What quantization settings should I use?\u003C\u002Fb>\u003Cbr>\n  You should start with `nbits=4, group_size=64, axis=1`. These settings offer a good balance between quality, vram usage and speed. If you want better results with the same vram usage, switch to `axis=0` and use the ATEN backend, but this setting is not supported for fast inference. \u003Cbr>\n  \n  \u003Cb>What does the `axis` parameter mean? \u003C\u002Fb>\u003Cbr>\n  The `axis` parameter is the axis along which grouping is performed. In general `axis=0` gives better results than `axis=1`, especially at lower bits. However, the optimized inference runtime only supports `axis=1` for the moment.\u003Cbr>\n  \n  \u003Cb>What is the difference between HQQ and HQQ+?\u003C\u002Fb>\u003Cbr>\n  HQQ+ is HQQ with trainable low-rank adapters to improve the quantization quality at lower bits.\u003Cbr>\n\n\u003C\u002Fdetails>\n\n### Installation \nFirst, make sure you have a Pytorch 2 version that matches your CUDA version: https:\u002F\u002Fpytorch.org\u002F\n\nYou can install hqq via  \n```\n#latest stable version\npip install hqq;\n\n#Latest updates - recommended\npip install git+https:\u002F\u002Fgithub.com\u002Fdropbox\u002Fhqq.git; \n\n#Disable building the CUDA kernels for the aten backend\nDISABLE_CUDA=1 pip install ...\n```\n\nAlternatively, clone the repo and run ```pip install .``` from this current folder. \n\n### Basic Usage\nTo perform quantization with HQQ, you simply need to replace the linear layers ( ```torch.nn.Linear```) as follows:\n```Python\nfrom hqq.core.quantize import *\n#Quantization settings\nquant_config = BaseQuantizeConfig(nbits=4, group_size=64)\n\n#Replace your linear layer \nhqq_layer = HQQLinear(your_linear_layer, #torch.nn.Linear or None \n                      quant_config=quant_config, #quantization configuration\n                      compute_dtype=torch.float16, #compute dtype\n                      device='cuda', #cuda device\n                      initialize=True, #Use False to quantize later\n                      del_orig=True #if True, delete the original layer\n                      )\n\nW_r = hqq_layer.dequantize() #dequantize()\nW_q = hqq_layer.unpack(dtype=torch.uint8) #unpack\ny   = hqq_layer(x) #forward-pass\n```\n\nThe quantization parameters are set as follows:\n\n- ```nbits``` (int): supports 8, 4, 3, 2, 1 bits.\n- ```group_size``` (int): no restrictions as long as ```weight.numel()``` is divisible by the ```group_size```.\n- ```view_as_float``` (bool): if True, the quantized parameter is viewed as float instead of an int type.\n\n### Usage with Models\n#### Transformers 🤗\nFor usage with HF's transformers, see the example below from the \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fen\u002Fquantization#hqq\">documentation\u003C\u002Fa>:\n```Python\nfrom transformers import AutoModelForCausalLM, HqqConfig\n\n# All linear layers will use the same quantization config\nquant_config = HqqConfig(nbits=4, group_size=64)\n\n# Load and quantize\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_id, \n    torch_dtype=torch.float16, \n    device_map=\"cuda\", \n    quantization_config=quant_config\n)\n```\nYou can save\u002Fload quantized models as regular transformers models via `save_pretrained` \u002F `from_pretrained`.\n\n#### HQQ Lib\nYou can also utilize the HQQ library to quantize transformers models:\n```Python\n#Load the model on CPU\nfrom transformers import AutoModelForCausalLM\nmodel = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype)\n\n#Quantize\nfrom hqq.models.hf.base import AutoHQQHFModel\nquant_config = BaseQuantizeConfig(nbits=4, group_size=64) \nAutoHQQHFModel.quantize_model(model, quant_config=quant_config, compute_dtype=compute_dtype, device=device)\n```\nYou can save\u002Fload quantized models as follows:\n```Python\nfrom hqq.models.hf.base import AutoHQQHFModel\n\n#Save: Make sure to save the model BEFORE any patching\nAutoHQQHFModel.save_quantized(model, save_dir)\n\n#Save as safetensors (to be load via transformers or vllm)\nAutoHQQHFModel.save_to_safetensors(model, save_dir)\n\n#Load\nmodel = AutoHQQHFModel.from_quantized(save_dir)\n```\n\n❗ Note that models saved via the hqq lib are not compatible with `.from_pretrained()`\n\n### Backends\n#### Native Backends\nThe following native dequantization backends can be used by the `HQQLinear` module:\n```Python\nHQQLinear.set_backend(HQQBackend.PYTORCH)          #Pytorch backend - Default\nHQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE)  #Compiled Pytorch\nHQQLinear.set_backend(HQQBackend.ATEN)             #Aten\u002FCUDA backend - only axis=0 supported\n```\n❗ Note that ```HQQBackend.ATEN```  only supports `axis=0`. \n\n#### Optimized Inference\nWe support external backends for faster inference with fused kernels. You can enable one of the backends after the model was quantized as follows:\n```Python\nfrom hqq.utils.patching import prepare_for_inference\n\n#Pytorch backend that makes the model compatible with fullgraph torch.compile: works with any settings\n#prepare_for_inference(model)\n\n#Gemlite backend: nbits=4\u002F2\u002F1, compute_dtype=float16, axis=1\nprepare_for_inference(model, backend=\"gemlite\") \n\n#Torchao's tiny_gemm backend (fast for batch-size\u003C4): nbits=4, compute_dtype=bfloat16, axis=1\n#prepare_for_inference(model, backend=\"torchao_int4\") \n```\nNote that these backends only work with `axis=1`. Additional restrictions apply regarding the group-size values depending on the backend. You should expect ~158 tokens\u002Fsec with a Llama3-8B 4-bit quantized model on a 4090 RTX.\n\nWhen a quantization config is not supported by the specified inference backend, hqq will fallback to the native backend. \n\n### Custom Quantization Configurations ⚙️\nYou can set up various quantization configurations for different layers by specifying the settings for each layer name:\n#### Transformers 🤗\n```Python\n# Each linear layer with the same tag will use a dedicated quantization config\nq4_config = {'nbits':4, 'group_size':64}\nq3_config = {'nbits':3, 'group_size':32}\n\nquant_config  = HqqConfig(dynamic_config={\n  'self_attn.q_proj':q4_config,\n  'self_attn.k_proj':q4_config,\n  'self_attn.v_proj':q4_config,\n  'self_attn.o_proj':q4_config,\n\n  'mlp.gate_proj':q3_config,\n  'mlp.up_proj'  :q3_config,\n  'mlp.down_proj':q3_config,\n})\n```\n#### HQQ lib\n```Python\nfrom hqq.core.quantize import *\nq4_config    = BaseQuantizeConfig(nbits=4, group_size=64) \nq3_config    = BaseQuantizeConfig(nbits=3, group_size=32)\n\nquant_config = {'self_attn.q_proj':q4_config,\n  'self_attn.k_proj':q4_config,\n  'self_attn.v_proj':q4_config,\n  'self_attn.o_proj':q4_config,\n\n  'mlp.gate_proj':q3_config,\n  'mlp.up_proj'  :q3_config,\n  'mlp.down_proj':q3_config,\n}\n```\n\n### VLLM\nYou can use HQQ in \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\u002F\">vllm\u003C\u002Fa>. Make sure to install \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fdropbox\u002Fgemlite\u002F\">GemLite\u003C\u002Fa> before using the backend.\n\n```Python\n#Or you can quantize on-the-fly\nfrom hqq.utils.vllm import set_vllm_onthefly_hqq_quant\nskip_modules = ['lm_head', 'visual', 'vision']\n\n#Select one of the following modes:\n\n#INT\u002FFP format\nset_vllm_onthefly_hqq_quant(weight_bits=8, group_size=None, quant_mode='int8_weightonly', skip_modules=skip_modules) #A16W8 - INT8 weight only\nset_vllm_onthefly_hqq_quant(weight_bits=4, group_size=128, quant_mode='int4_weightonly', skip_modules=skip_modules) #A16W4 - HQQ weight only\nset_vllm_onthefly_hqq_quant(weight_bits=8, quant_mode='int8_dynamic', skip_modules=skip_modules) #A8W8 - INT8 x INT8 dynamic\nset_vllm_onthefly_hqq_quant(weight_bits=8, quant_mode='fp8_dynamic', skip_modules=skip_modules) #A8W8 - FP8 x FP8 dynamic\n\n#MXFP format\nset_vllm_onthefly_hqq_quant(weight_bits=8, group_size=None, quant_mode='mxfp8_dynamic', skip_modules=skip_modules) #A8W8 - MXFP8 x MXPF8 - post_scale=True\nset_vllm_onthefly_hqq_quant(weight_bits=8, group_size=32, quant_mode='mxfp8_dynamic', skip_modules=skip_modules) #A8W8 - MXFP8 x MXPF8- post_scale=False\nset_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='mxfp4_weightonly', skip_modules=skip_modules) #A16W4 - MXFP4 weight-only\nset_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='mxfp8_dynamic', skip_modules=skip_modules) #A8W4 - MXFP8 x MXFP4 dynamic\nset_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='mxfp4_dynamic', skip_modules=skip_modules) #A4W4 - MXPF4 x MXPF4 dynamic\nset_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='nvfp4_dynamic', skip_modules=skip_modules) #A4W4 - NVFP4 x NVFP4 dynamic\n\n\nllm = LLM(model=\"meta-llama\u002FLlama-3.2-3B-Instruct\", max_model_len=4096, gpu_memory_utilization=0.80, dtype=torch.float16)\n```\n\n### Peft Training\nPeft training is directly supported in the HuggingFace's \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fpeft\u002Fv0.12.0\u002Fen\u002Fdeveloper_guides\u002Fquantization#hqq-quantization\"> peft library\u003C\u002Fa>. If you still want to use hqq-lib's peft utilities, here's how: \n\n```Python\n#First, quantize\u002Fload a quantized HQQ model the\nfrom hqq.core.peft import PeftUtils\n\nbase_lora_params = {'lora_type':'default', 'r':32, 'lora_alpha':64, 'dropout':0.05, 'train_dtype':torch.float32}\nlora_params      = {'self_attn.q_proj': base_lora_params,\n                    'self_attn.k_proj': base_lora_params,\n                    'self_attn.v_proj': base_lora_params,\n                    'self_attn.o_proj': base_lora_params,\n                    'mlp.gate_proj'   : None,\n                    'mlp.up_proj'     : None,\n                    'mlp.down_proj'   : None}\n\n\n#Add LoRA to linear\u002FHQQ modules\nPeftUtils.add_lora(model, lora_params)\n\n#Optional: set your backend\nHQQLinear.set_backend(HQQBackend.ATEN if axis==0 else HQQBackend.PYTORCH_COMPILE)\n\n#Train ....\n\n#Convert LoRA weights to the same model dtype for faster inference\nmodel.eval()\nPeftUtils.cast_lora_weights(model, dtype=compute_dtype)\n\n#Save LoRA weights\nPeftUtils.save_lora_weights(model, filename)\n\n#Load LoRA weights: automatically calls add_lora \nPeftUtils.load_lora_weights(model, filename)\n```\n\nWe provide a complete example to train a model with HQQ\u002FLoRA that you can find in ```examples\u002Fhqq_plus.py```.\n\nIf you want to use muti-gpu training via FSDP, check out this awesome repo by Answer.AI: https:\u002F\u002Fgithub.com\u002FAnswerDotAI\u002Ffsdp_qlora\n\n### Examples \nWe provide a variety of examples demonstrating model quantization across different backends within the ```examples```  directory.\n\n### Citation 📜\n```\n@misc{badri2023hqq,\ntitle  = {Half-Quadratic Quantization of Large Machine Learning Models},\nurl    = {https:\u002F\u002Fdropbox.github.io\u002Fhqq_blog\u002F},\nauthor = {Hicham Badri and Appu Shaji},\nmonth  = {November},\nyear   = {2023}\n```\n","## 半二次量化 (HQQ)\n本仓库包含我们在以下文章中提出的半二次量化的官方实现：\n* HQQ：https:\u002F\u002Fdropbox.github.io\u002Fhqq_blog\u002F\n* HQQ+：https:\u002F\u002Fdropbox.github.io\u002F1bit_blog\u002F\n\n### 什么是HQQ？\n\u003Cb>HQQ\u003C\u002Fb> 是一种快速且精确的模型量化工具，无需校准数据即可使用。您可以在短短几分钟内对最大规模的模型进行量化，完全不需要校准数据 🚀。\n\n\u003Cdetails>\n  \u003Csummary>常见问题解答\u003C\u002Fsummary>\n  \u003Cb>为什么我应该选择HQQ而不是其他量化方法？\u003C\u002Fb>\u003Cbr>\n  \u003Cul>\n    \u003Cli>HQQ 量化模型的速度非常快。\u003C\u002Fli>\n    \u003Cli>它支持 8、4、3、2、1 位量化。\u003C\u002Fli>\n    \u003Cli>您可以将其应用于任何模型（如大语言模型、视觉模型等）。\u003C\u002Fli>\n    \u003Cli>反量化步骤是一个线性操作，这意味着 HQQ 可以与各种优化的 CUDA\u002FTriton 内核兼容。\u003C\u002Fli>\n    \u003Cli>HQQ 与 PEFT 训练兼容。\u003C\u002Fli>\n    \u003Cli>我们正在努力使 HQQ 完全兼容 `torch.compile`，以加速推理和训练。\u003C\u002Fli>\n  \u003C\u002Ful>\n\n  \u003Cb>量化后的模型质量如何？\u003C\u002Fb>\u003Cbr>\n  我们针对语言和视觉模型进行了详细的基准测试。请参阅我们的博客文章：\u003Ca href=\"https:\u002F\u002Fdropbox.github.io\u002Fhqq_blog\u002F\">HQQ\u003C\u002Fa> 和 \u003Ca href=\"https:\u002F\u002Fdropbox.github.io\u002F1bit_blog\u002F\">HQQ+\u003C\u002Fa>。\u003Cbr>\n\n  \u003Cb>量化后的模型运行速度如何？\u003C\u002Fb>\u003Cbr>\n  使用 `axis=1` 的 4 位模型可以利用优化的融合推理内核。此外，我们正致力于让 HQQ 完全兼容 `torch.compile`，从而加速训练和推理。更多细节请参阅下方的后端部分。\u003Cbr>\n\n  \u003Cb>我应该使用哪些量化设置？\u003C\u002Fb>\u003Cbr>\n  建议从 `nbits=4, group_size=64, axis=1` 开始。这些设置在质量、显存占用和速度之间提供了良好的平衡。如果您希望在相同显存占用下获得更好的效果，可以切换到 `axis=0` 并使用 ATEN 后端，但该设置目前不支持快速推理。\u003Cbr>\n\n  \u003Cb>`axis` 参数是什么意思？\u003C\u002Fb>\u003Cbr>\n  `axis` 参数指定了分组操作沿哪个维度进行。通常情况下，`axis=0` 比 `axis=1` 效果更好，尤其是在低比特量化时。然而，目前优化的推理运行时仅支持 `axis=1`。\u003Cbr>\n\n  \u003Cb>HQQ 和 HQQ+ 有什么区别？\u003C\u002Fb>\u003Cbr>\n  HQQ+ 是在 HQQ 的基础上加入了可训练的低秩适配器，以提升低比特量化下的模型质量。\u003Cbr>\n\u003C\u002Fdetails>\n\n### 安装\n首先，请确保您安装的 PyTorch 2 版本与您的 CUDA 版本匹配：https:\u002F\u002Fpytorch.org\u002F\n\n您可以通过以下方式安装 hqq：\n```\n# 最新稳定版\npip install hqq;\n\n# 最新更新 - 推荐\npip install git+https:\u002F\u002Fgithub.com\u002Fdropbox\u002Fhqq.git;\n\n# 禁用构建 ATEN 后端的 CUDA 内核\nDISABLE_CUDA=1 pip install ...\n```\n\n或者，您可以克隆仓库，并在此目录下运行 `pip install .`。\n\n### 基本用法\n要使用 HQQ 进行量化，只需将线性层（`torch.nn.Linear`）替换为如下代码：\n```Python\nfrom hqq.core.quantize import *\n# 量化配置\nquant_config = BaseQuantizeConfig(nbits=4, group_size=64)\n\n# 替换您的线性层\nhqq_layer = HQQLinear(your_linear_layer, # torch.nn.Linear 或 None\n                      quant_config=quant_config, # 量化配置\n                      compute_dtype=torch.float16, # 计算数据类型\n                      device='cuda', # CUDA 设备\n                      initialize=True, # 使用 False 可稍后再量化\n                      del_orig=True # 如果为 True，则删除原始层\n                      )\n\nW_r = hqq_layer.dequantize() # 反量化\nW_q = hqq_layer.unpack(dtype=torch.uint8) # 解包\ny   = hqq_layer(x) # 前向传播\n```\n\n量化参数的设置如下：\n- `nbits`（整数）：支持 8、4、3、2、1 位。\n- `group_size`（整数）：只要 `weight.numel()` 能被 `group_size` 整除即可，无限制。\n- `view_as_float`（布尔值）：如果为 True，则将量化后的参数视为浮点数，而非整数类型。\n\n### 与模型结合使用\n#### Transformers 🤗\n关于与 Hugging Face Transformers 结合使用的说明，请参阅 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fen\u002Fquantization#hqq\">文档\u003C\u002Fa>中的示例：\n```Python\nfrom transformers import AutoModelForCausalLM, HqqConfig\n\n# 所有线性层将使用相同的量化配置\nquant_config = HqqConfig(nbits=4, group_size=64)\n\n# 加载并量化\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_id,\n    torch_dtype=torch.float16,\n    device_map=\"cuda\",\n    quantization_config=quant_config\n)\n```\n您可以像普通 Transformers 模型一样使用 `save_pretrained` 和 `from_pretrained` 来保存和加载量化后的模型。\n\n#### HQQ 库\n您也可以使用 HQQ 库来量化 Transformers 模型：\n```Python\n# 在 CPU 上加载模型\nfrom transformers import AutoModelForCausalLM\nmodel = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype)\n\n# 量化\nfrom hqq.models.hf.base import AutoHQQHFModel\nquant_config = BaseQuantizeConfig(nbits=4, group_size=64)\nAutoHQQHFModel.quantize_model(model, quant_config=quant_config, compute_dtype=compute_dtype, device=device)\n```\n您可以按照以下方式保存和加载量化后的模型：\n```Python\nfrom hqq.models.hf.base import AutoHQQHFModel\n\n# 保存：务必在任何补丁之前保存模型\nAutoHQQHFModel.save_quantized(model, save_dir)\n\n# 以 safetensors 格式保存（可通过 Transformers 或 VLLM 加载）\nAutoHQQHFModel.save_to_safetensors(model, save_dir)\n\n# 加载\nmodel = AutoHQQHFModel.from_quantized(save_dir)\n```\n\n❗ 请注意，通过 HQQ 库保存的模型与 `.from_pretrained()` 不兼容。\n\n### 后端\n#### 原生后端\n`HQQLinear` 模块可以使用以下原生反量化后端：\n```Python\nHQQLinear.set_backend(HQQBackend.PYTORCH)          # PyTorch 后端 - 默认\nHQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE)  # 编译后的 PyTorch\nHQQLinear.set_backend(HQQBackend.ATEN)             # Aten\u002FCUDA 后端 - 仅支持 axis=0\n```\n❗ 注意，`HQQBackend.ATEN` 仅支持 `axis=0`。\n\n#### 优化推理\n我们支持外部后端以实现更快的推理，并采用融合内核。在模型完成量化后，您可以按如下方式启用其中一个后端：\n```Python\nfrom hqq.utils.patching import prepare_for_inference\n\n# PyTorch 后端，使模型与 fullgraph torch.compile 兼容：适用于任何设置\n# prepare_for_inference(model)\n\n# Gemlite 后端：nbits=4\u002F2\u002F1，compute_dtype=float16，axis=1\nprepare_for_inference(model, backend=\"gemlite\") \n\n# Torchao 的 tiny_gemm 后端（batch-size\u003C4 时速度快）：nbits=4，compute_dtype=bfloat16，axis=1\n# prepare_for_inference(model, backend=\"torchao_int4\")\n```\n请注意，这些后端仅适用于 `axis=1`。此外，根据不同的后端，还存在关于组大小值的额外限制。在 RTX 4090 上，使用 Llama3-8B 的 4 位量化模型时，预计每秒可处理约 158 个 token。\n\n当指定的推理后端不支持某个量化配置时，hqq 将回退到原生后端。\n\n### 自定义量化配置 ⚙️\n您可以通过为每个层名称指定设置，为不同层配置各种量化方案：\n#### Transformers 🤗\n```Python\n# 具有相同标签的每个线性层将使用专用的量化配置\nq4_config = {'nbits':4, 'group_size':64}\nq3_config = {'nbits':3, 'group_size':32}\n\nquant_config  = HqqConfig(dynamic_config={\n  'self_attn.q_proj':q4_config,\n  'self_attn.k_proj':q4_config,\n  'self_attn.v_proj':q4_config,\n  'self_attn.o_proj':q4_config,\n\n  'mlp.gate_proj':q3_config,\n  'mlp.up_proj'  :q3_config,\n  'mlp.down_proj':q3_config,\n})\n```\n#### HQQ 库\n```Python\nfrom hqq.core.quantize import *\nq4_config    = BaseQuantizeConfig(nbits=4, group_size=64) \nq3_config    = BaseQuantizeConfig(nbits=3, group_size=32)\n\nquant_config = {'self_attn.q_proj':q4_config,\n  'self_attn.k_proj':q4_config,\n  'self_attn.v_proj':q4_config,\n  'self_attn.o_proj':q4_config,\n\n  'mlp.gate_proj':q3_config,\n  'mlp.up_proj'  :q3_config,\n  'mlp.down_proj':q3_config,\n}\n```\n\n### VLLM\n您可以在 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\u002F\">vllm\u003C\u002Fa> 中使用 HQQ。请确保在使用该后端之前安装 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fdropbox\u002Fgemlite\u002F\">GemLite\u003C\u002Fa>。\n\n```Python\n# 或者您也可以进行在线量化\nfrom hqq.utils.vllm import set_vllm_onthefly_hqq_quant\nskip_modules = ['lm_head', 'visual', 'vision']\n\n# 选择以下模式之一：\n\n# INT\u002FFP 格式\nset_vllm_onthefly_hqq_quant(weight_bits=8, group_size=None, quant_mode='int8_weightonly', skip_modules=skip_modules) #A16W8 - 仅 INT8 权重\nset_vllm_onthefly_hqq_quant(weight_bits=4, group_size=128, quant_mode='int4_weightonly', skip_modules=skip_modules) #A16W4 - 仅 HQQ 权重\nset_vllm_onthefly_hqq_quant(weight_bits=8, quant_mode='int8_dynamic', skip_modules=skip_modules) #A8W8 - 动态 INT8 x INT8\nset_vllm_onthefly_hqq_quant(weight_bits=8, quant_mode='fp8_dynamic', skip_modules=skip_modules) #A8W8 - 动态 FP8 x FP8\n\n# MXFP 格式\nset_vllm_onthefly_hqq_quant(weight_bits=8, group_size=None, quant_mode='mxfp8_dynamic', skip_modules=skip_modules) #A8W8 - MXFP8 x MXPF8 - post_scale=True\nset_vllm_onthefly_hqq_quant(weight_bits=8, group_size=32, quant_mode='mxfp8_dynamic', skip_modules=skip_modules) #A8W8 - MXFP8 x MXPF8 - post_scale=False\nset_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='mxfp4_weightonly', skip_modules=skip_modules) #A16W4 - 仅 MXFP4 权重\nset_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='mxfp8_dynamic', skip_modules=skip_modules) #A8W4 - 动态 MXFP8 x MXFP4\nset_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='mxfp4_dynamic', skip_modules=skip_modules) #A4W4 - 动态 MXPF4 x MXPF4\nset_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='nvfp4_dynamic', skip_modules=skip_modules) #A4W4 - 动态 NVFP4 x NVFP4\n\n\nllm = LLM(model=\"meta-llama\u002FLlama-3.2-3B-Instruct\", max_model_len=4096, gpu_memory_utilization=0.80, dtype=torch.float16)\n```\n\n### Peft 训练\nPeft 训练在 HuggingFace 的 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fpeft\u002Fv0.12.0\u002Fen\u002Fdeveloper_guides\u002Fquantization#hqq-quantization\">peft 库\u003C\u002Fa>中得到直接支持。如果您仍想使用 hqq-lib 的 peft 工具，操作方法如下：\n\n```Python\n# 首先，量化或加载一个已量化的 HQQ 模型\nfrom hqq.core.peft import PeftUtils\n\nbase_lora_params = {'lora_type':'default', 'r':32, 'lora_alpha':64, 'dropout':0.05, 'train_dtype':torch.float32}\nlora_params      = {'self_attn.q_proj': base_lora_params,\n                    'self_attn.k_proj': base_lora_params,\n                    'self_attn.v_proj': base_lora_params,\n                    'self_attn.o_proj': base_lora_params,\n                    'mlp.gate_proj'   : None,\n                    'mlp.up_proj'     : None,\n                    'mlp.down_proj'   : None}\n\n\n# 将 LoRA 添加到线性\u002FHQQ 模块\nPeftUtils.add_lora(model, lora_params)\n\n# 可选：设置您的后端\nHQQLinear.set_backend(HQQBackend.ATEN if axis==0 else HQQBackend.PYTORCH_COMPILE)\n\n# 开始训练....\n\n# 将 LoRA 权重转换为与模型相同的数据类型，以加快推理速度\nmodel.eval()\nPeftUtils.cast_lora_weights(model, dtype=compute_dtype)\n\n# 保存 LoRA 权重\nPeftUtils.save_lora_weights(model, filename)\n\n# 加载 LoRA 权重：会自动调用 add_lora\nPeftUtils.load_lora_weights(model, filename)\n```\n\n我们提供了一个完整的示例，展示如何使用 HQQ\u002FLoRA 训练模型，您可以在 `examples\u002Fhqq_plus.py` 中找到该示例。\n\n如果您希望通过 FSDP 进行多 GPU 训练，请查看 Answer.AI 提供的优秀仓库：https:\u002F\u002Fgithub.com\u002FAnswerDotAI\u002Ffsdp_qlora\n\n### 示例\n我们在 `examples` 目录中提供了多种示例，展示了在不同后端上对模型进行量化的过程。\n\n### 引用 📜\n```\n@misc{badri2023hqq,\ntitle  = {大型机器学习模型的半二次量化},\nurl    = {https:\u002F\u002Fdropbox.github.io\u002Fhqq_blog\u002F},\nauthor = {Hicham Badri 和 Appu Shaji},\nmonth  = {十一月},\nyear   = {2023}\n```","# HQQ 快速上手指南\n\nHQQ (Half-Quadratic Quantization) 是一款快速且高精度的模型量化工具，最大特点是不需要校准数据即可对大型模型（如 LLM、视觉模型）进行量化。支持 8\u002F4\u002F3\u002F2\u002F1 bit 多种精度，并兼容 PEFT 训练和 `torch.compile`。\n\n## 环境准备\n\n*   **操作系统**: Linux \u002F Windows \u002F macOS\n*   **Python**: 建议 3.8+\n*   **PyTorch**: 版本需 >= 2.0，且必须与本地 CUDA 版本匹配。\n    *   安装前请确认环境：访问 [PyTorch 官网](https:\u002F\u002Fpytorch.org\u002F) 获取对应命令。\n*   **硬件**: 推荐使用 NVIDIA GPU 以获得最佳推理速度（支持 CUDA 后端）。\n\n## 安装步骤\n\n推荐使用 pip 直接安装最新稳定版或开发版。国内用户若遇网络问题，可配置清华或阿里镜像源。\n\n**方式一：安装最新稳定版**\n```bash\npip install hqq\n# 国内加速示例\n# pip install hqq -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n**方式二：安装最新开发版（推荐，包含最新优化）**\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fdropbox\u002Fhqq.git\n# 国内加速示例\n# pip install git+https:\u002F\u002Fgithub.com\u002Fdropbox\u002Fhqq.git -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n**方式三：禁用 CUDA 内核构建（仅使用纯 PyTorch 后端）**\n如果编译环境缺少 CUDA 工具链，可使用此命令：\n```bash\nDISABLE_CUDA=1 pip install hqq\n```\n\n或者克隆仓库后安装：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fdropbox\u002Fhqq.git\ncd hqq\npip install .\n```\n\n## 基本使用\n\n### 1. 核心用法：替换线性层\n这是最基础的使用方式，直接将 `torch.nn.Linear` 替换为 `HQQLinear`。\n\n```python\nimport torch\nfrom hqq.core.quantize import BaseQuantizeConfig, HQQLinear\n\n# 定义量化配置：4-bit, 分组大小 64 (推荐起始配置)\nquant_config = BaseQuantizeConfig(nbits=4, group_size=64)\n\n# 假设你有一个原始的 linear 层\n# your_linear_layer = torch.nn.Linear(512, 512) \n\n# 创建 HQQ 线性层并进行量化\nhqq_layer = HQQLinear(\n    your_linear_layer,          # 原始 torch.nn.Linear 层，若为 None 则稍后初始化\n    quant_config=quant_config,  # 量化配置\n    compute_dtype=torch.float16,# 计算数据类型\n    device='cuda',              # 设备\n    initialize=True,            # True: 立即量化; False: 稍后量化\n    del_orig=True               # True: 量化后删除原始层以节省显存\n)\n\n# 使用前向传播 (自动反量化计算)\n# x = torch.randn(1, 512).half().cuda()\n# y = hqq_layer(x)\n\n# 手动反量化获取权重 (可选)\n# W_r = hqq_layer.dequantize() \n```\n\n### 2. 结合 Hugging Face Transformers\n快速加载并量化预训练模型。\n\n```python\nfrom transformers import AutoModelForCausalLM, HqqConfig\nimport torch\n\nmodel_id = \"meta-llama\u002FLlama-2-7b-hf\" # 示例模型\n\n# 配置量化参数\nquant_config = HqqConfig(nbits=4, group_size=64)\n\n# 直接加载并量化模型\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_id, \n    torch_dtype=torch.float16, \n    device_map=\"cuda\", \n    quantization_config=quant_config\n)\n\n# 此时 model 已完成量化，可直接用于推理或保存\n# model.save_pretrained(\".\u002Fquantized_model\")\n```\n\n### 3. 进阶：使用 HQQ 库量化已加载的模型\n如果你已经在一个 CPU 或 GPU 上加载了模型，可以使用 HQQ 专用接口进行转换。\n\n```python\nfrom transformers import AutoModelForCausalLM\nfrom hqq.models.hf.base import AutoHQQHFModel\nfrom hqq.core.quantize import BaseQuantizeConfig\nimport torch\n\ncompute_dtype = torch.float16\ndevice = \"cuda\"\nmodel_id = \"meta-llama\u002FLlama-2-7b-hf\"\n\n# 1. 在 CPU 上加载原始模型\nmodel = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype)\n\n# 2. 执行量化\nquant_config = BaseQuantizeConfig(nbits=4, group_size=64) \nAutoHQQHFModel.quantize_model(\n    model, \n    quant_config=quant_config, \n    compute_dtype=compute_dtype, \n    device=device\n)\n\n# 3. 保存量化后的模型 (注意：此处不能使用标准的 from_pretrained 加载)\nsave_dir = \".\u002Fhqq_quantized_model\"\nAutoHQQHFModel.save_to_safetensors(model, save_dir)\n\n# 4. 加载量化模型\n# loaded_model = AutoHQQHFModel.from_quantized(save_dir)\n```\n\n### 4. 优化推理速度 (可选)\n量化完成后，可通过切换后端或使用融合内核进一步提升推理速度。需确保量化配置为 `axis=1`。\n\n```python\nfrom hqq.utils.patching import prepare_for_inference\n\n# 启用 Gemlite 后端 (推荐用于 4-bit, axis=1, float16)\n# 需先安装 gemlite: pip install git+https:\u002F\u002Fgithub.com\u002Fdropbox\u002Fgemlite.git\nprepare_for_inference(model, backend=\"gemlite\")\n\n# 或者启用 torch.compile 优化\n# prepare_for_inference(model) \n```","某初创团队试图在单张消费级显卡（如 RTX 3090）上部署参数量巨大的多模态大模型，以构建实时的智能客服系统。\n\n### 没有 hqq 时\n- **显存严重不足**：原始模型权重过大，直接加载即导致显存溢出（OOM），无法启动服务。\n- **校准数据缺失**：传统量化方法依赖大量代表性校准数据集，团队缺乏此类数据且收集清洗耗时数周。\n- **量化过程缓慢**：现有工具对超大模型进行量化往往需要数小时甚至更久，严重拖慢迭代节奏。\n- **精度损失不可控**：强行降低位宽（如 4-bit）后，模型在复杂对话中的逻辑推理能力显著下降，回答质量堪忧。\n\n### 使用 hqq 后\n- **极速低显存部署**：hqq 支持无需校准数据的 4-bit 量化，几分钟内即可将模型压缩至单卡显存范围内并成功运行。\n- **零数据依赖**：完全跳过校准步骤，直接对任意架构模型（LLM 或视觉模型）进行量化，立即投入使用。\n- **推理速度提升**：配合 `axis=1` 设置与优化的 CUDA 内核，量化后的模型在保持高精度的同时，实现了流畅的实时响应。\n- **灵活精度平衡**：通过调整 `nbits` 和 `group_size` 参数，团队在不增加显存负担的前提下，微调出了兼顾速度与智能度的最佳配置。\n\nhqq 通过无校准的快速量化技术，让资源受限的团队也能低成本、高效率地落地超大规模 AI 模型。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdropbox_hqq_c4ea1d60.png","dropbox","Dropbox","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fdropbox_7b475138.png","",null,"https:\u002F\u002Fdropbox.com\u002F","https:\u002F\u002Fgithub.com\u002Fdropbox",[80,84,88],{"name":81,"color":82,"percentage":83},"Python","#3572A5",91.5,{"name":85,"color":86,"percentage":87},"Cuda","#3A4E3A",6,{"name":89,"color":90,"percentage":91},"C++","#f34b7d",2.4,925,90,"2026-04-06T09:40:55","Apache-2.0","Linux","必需 NVIDIA GPU。安装需匹配 CUDA 版本（具体版本未说明，需自行对照 PyTorch 2.x 要求）。优化推理后端（如 GemLite）仅支持 axis=1 配置。示例中提到在 RTX 4090 上运行 Llama3-8B 4-bit 模型。","未说明",{"notes":100,"python":98,"dependencies":101},"1. 必须安装与当前 CUDA 版本匹配的 PyTorch 2.x 版本。\n2. 默认使用 PyTorch 后端，若需禁用 CUDA 内核构建可设置环境变量 DISABLE_CUDA=1。\n3. ATEN 后端仅支持 axis=0 配置，而优化的推理后端（GemLite, TorchAO）仅支持 axis=1 配置。\n4. 支持 1\u002F2\u002F3\u002F4\u002F8 bit 量化，推荐初始设置为 nbits=4, group_size=64, axis=1。\n5. 若使用 vLLM 集成，需预先安装 GemLite 库。\n6. 支持 PEFT\u002FLoRA 微调训练。",[102,103,104,105,106],"torch>=2.0","transformers","peft","vllm (可选)","gemlite (可选，用于加速)",[14,35],[109,110,111],"machine-learning","quantization","llm","2026-03-27T02:49:30.150509","2026-04-07T22:49:53.339544",[115,120,125,130,135,139],{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},22911,"微调 1-bit 量化模型时损失不下降且生成文本质量差，可能是什么原因？","这通常不是 HQQ 库本身的问题，而是环境或依赖版本不兼容导致的。有用户反馈在更新 `trl` 库到最新版本（通过 `pip install trl --upgrade`）后问题解决。请检查您的 `trl` 版本是否与当前 HQQ 版本兼容，建议安装最新版重试。","https:\u002F\u002Fgithub.com\u002Fdropbox\u002Fhqq\u002Fissues\u002F115",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},22912,"为什么对模型进行 4-bit 量化后推理速度反而变慢了？","量化后变慢通常是因为错误地使用了 `torch.compile`。编译仅在输入形状静态时有效，如果在预填充阶段（prefill）或输入长度\u002F批次大小变化频繁时使用编译，会导致反复重新编译，从而显著增加耗时。解决方案：对于动态输入场景，不要手动调用 `torch.compile`，仅使用 Flash Attention 2 并量化模型即可；对于 Qwen2-VL 等模型，新版 transformers 已在 `generate` 中自动处理编译，无需外部编译。","https:\u002F\u002Fgithub.com\u002Fdropbox\u002Fhqq\u002Fissues\u002F138",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},22913,"如何在 Colab T4 上正确运行量化后的 Llama-2-7b 模型并进行流式推理？","确保使用正确的加载方式和推理配置。参考成功运行的代码示例：使用 `HQQModelForCausalLM.from_quantized` 加载模型，设置 `tokenizer.add_bos_token = False` 和 `tokenizer.add_eos_token = False`，若没有 pad_token 需手动添加。推理时使用 `TextIteratorStreamer` 配合多线程实现流式输出。注意不要将 `.to(cuda)` 错误地用在字符串返回值上（应为 `device='cuda'` 参数传递）。具体可参考社区提供的成功 Notebook 示例。","https:\u002F\u002Fgithub.com\u002Fdropbox\u002Fhqq\u002Fissues\u002F135",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},22914,"运行 HQQ 示例脚本时遇到 'cache_size_limit reached' 警告或错误，如何解决？","该警告通常与 tokenizer 的聊天模板未设置有关，但不影响核心功能。可以通过设置环境变量 `os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"` 避免并行化问题。如果使用的是较新版本的 transformers，建议显式指定聊天模板或升级到支持自动模板的版本。此外，预热（warm-up）阶段并不需要聊天模板，可跳过相关步骤直接测试推理。","https:\u002F\u002Fgithub.com\u002Fdropbox\u002Fhqq\u002Fissues\u002F129",{"id":136,"question_zh":137,"answer_zh":138,"source_url":124},22915,"使用 HQQ 量化多模态模型（如 MiniCPM-V 或 Qwen2-VL）时需要注意什么？","只需量化语言模型部分（如 `model.llm`），视觉编码器（VPM）和重采样器（resampler）应保持原精度并移至 CUDA。对于 Qwen2-VL，升级 `transformers>=4.47.1` 可修复 m-rope 相关问题，并且新版已内置自动编译逻辑，无需手动调用 `torch.compile`。推荐使用 `attn_implementation=\"sdpa\"` 或 `\"flash_attention_2\"` 提升效率，避免在动态输入场景下强制编译。",{"id":140,"question_zh":141,"answer_zh":142,"source_url":134},22916,"HQQ 支持哪些后端？如果 Marlin 或 BitBlas 导入失败怎么办？","HQQ 支持多种后端包括 PYTORCH、PYTORCH_COMPILE、MARLIN、BITBLAS 等。若出现 \"failed to import the Marlin backend\" 或 \"BitBlas backend\" 警告，表示这些可选后端未正确安装，但不影响基础功能。如需使用，请根据提示访问对应 GitHub 项目（https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002Fmarlin 或 https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FBitBLAS）按说明安装。默认情况下，HQQ 会自动回退到 PyTorch 后端继续运行。",[144,149,154,159,164,169,174,179,184,189,194,199,204,209,214,219,224,229,234,239],{"id":145,"version":146,"summary_zh":147,"released_at":148},136677,"v0.2.8.post1","小型 TOML 文件补丁","2025-10-20T15:39:37",{"id":150,"version":151,"summary_zh":152,"released_at":153},136678,"v0.2.8","Bug 修复：\n- 修复在新版本 Transformer 模型中静态缓存的初始化问题\n- 添加 mxfp vLLM 补丁工具\n- 改进 Transformer 模型的 CUDA 图和编译设置\n","2025-08-18T10:55:42",{"id":155,"version":156,"summary_zh":157,"released_at":158},136679,"0.2.7.post1","错误修复：\n- 修复生成过程中的 HIP 图：https:\u002F\u002Fgithub.com\u002Fmobiusml\u002Fhqq\u002Fcommit\u002Fbc8f4c7d778a0cdbfe115299ea7253ed28948d31\n- 修复 `HQQLinear` 在线性输入为 None 时的问题：https:\u002F\u002Fgithub.com\u002Fmobiusml\u002Fhqq\u002Fcommit\u002F3b86ac950f699a4ca3584cb18bea023b2f5e1da9","2025-06-12T15:16:46",{"id":160,"version":161,"summary_zh":162,"released_at":163},136680,"0.2.7","- 修复当 `max - min` 非常小时出现的 `nan` 错误：https:\u002F\u002Fgithub.com\u002Fmobiusml\u002Fhqq\u002Fcommit\u002F373cbea93892cb491a3c072e0036a37848926404\n- 添加 `DISABLE_CUDA=1` 环境变量，以禁用为 aten 后端构建 CUDA 内核。这可以加快 pip 构建速度。https:\u002F\u002Fgithub.com\u002Fmobiusml\u002Fhqq\u002Fcommit\u002F861f6906a2ebf4c864603d7eebd2091b9beb2a77\n- 改进内存使用情况：https:\u002F\u002Fgithub.com\u002Fmobiusml\u002Fhqq\u002Fcommit\u002Fa566c78961ea408c747ad2a9bd4f3a9235ff3b70\n- 修复 vLLM 的 PyTorch 回退逻辑：https:\u002F\u002Fgithub.com\u002Fmobiusml\u002Fhqq\u002Fcommit\u002Fd3f14b494eb9939e05a7aba854796eab13da3d3b","2025-06-02T08:07:03",{"id":165,"version":166,"summary_zh":167,"released_at":168},136681,"0.2.6","- 修复 CUDA 构建问题\n- 为 hqq_aten 添加 `torchcompile()` 支持\n- 为 vllm\u002Fhqq 添加 bfloat16 支持\n- 更新 vllm 工具，以支持 `hqq_gemlite` 和 `hqq_torch` 别名\n- 修复 vLLM v1 的问题\n- 将 `save_to_safetensors` 扩展至 VLM\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fmobiusml\u002Fhqq\u002Fcompare\u002Fv0.2.5...0.2.6","2025-05-13T11:05:53",{"id":170,"version":171,"summary_zh":172,"released_at":173},136682,"v0.2.5","-修复后端中的 `.name`\n-在 VLLM 打补丁时跳过 GemLite 中无效的输入\u002F输出特征尺寸\n-通过 GemLite 实现更快的 VLLM 打包","2025-03-17T15:24:26",{"id":175,"version":176,"summary_zh":177,"released_at":178},136683,"0.2.3.post1","错误修复：\n- 检查状态字典中的 `W_q`，以修复 PEFT 相关问题 https:\u002F\u002Fgithub.com\u002Fmobiusml\u002Fhqq\u002Fissues\u002F151\n- 修复与 `AutoHQQHFModel.save_to_safetensors` 相关的错误","2025-02-20T11:12:25",{"id":180,"version":181,"summary_zh":182,"released_at":183},136684,"0.2.3","* 通过打补丁实现 VLLM 支持——GemLite 后端 + 即时量化\n* 增加对 Aria 的支持\n* 增加加载量化 SequenceClassification 模型的功能\n* 通过（自定义 CUDA 图、SDPA 数学后端等）实现更快的解码\n* 修复与 torch compile 及 hf_generator 相关、针对较新版本 Transformers 的 bug\n* 修复无分组量化模型保存相关的 bug\n* 修复大型量化模型保存相关的 bug\n* 更新示例代码\n* 增加对 HQQLinear `.to(device)` 的支持\n","2025-02-17T08:43:36",{"id":185,"version":186,"summary_zh":187,"released_at":188},136685,"0.2.2","## HQQ v0.2.2\r\n\r\n- 支持在不使用 `HFGenerator` 的情况下进行静态缓存编译\r\n- 修复与 `torch.compile` 相关的各类问题","2024-09-12T15:23:16",{"id":190,"version":191,"summary_zh":192,"released_at":193},136686,"0.2.1","## HQQ v0.2.1\n\n- 为未初始化的层添加了 `HQQLinear.state_dict()` 方法。主要用于支持 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fpull\u002F33141。","2024-08-29T16:25:23",{"id":195,"version":196,"summary_zh":197,"released_at":198},136687,"0.2.0","## HQQ v0.2.0\r\n\r\n-  Bug fixes\r\n- Safetensors support for transformers via https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fpull\u002F33141\r\n- `quant_scale`, `quant_zero` and `offload_meta` are now deprecated. You can still use them with the hqq lib, but you can't use them with the transformers lib\r\n","2024-08-28T10:05:11",{"id":200,"version":201,"summary_zh":202,"released_at":203},136688,"v0.1.8","## HQQ v0.1.8\r\n\r\n-  Add BitBlas backend support \r\n- Simpler HQQLinear from weights `HQQLinear.from_weights(W, bias, etc.)`\r\n- Fix memory leak while swaping layers for the TorchAO Backend\r\n- Add `HQQLinear.unpack()` call ","2024-07-11T12:00:37",{"id":205,"version":206,"summary_zh":207,"released_at":208},136689,"v0.1.7.post3","## HQQ v0.1.7.post3\r\n\r\n-  Enable CPU quantization and runtime\r\n- `_load_state_dict` fix\r\n- fix `extra_repr` in `HQQLinear`\r\n- fix `from_quantized` bugs\r\n- fix `|` typing\r\n- fix 3-bit `axis=1` slicing bug\r\n- add 5\u002F6 bit for testing","2024-05-28T07:48:18",{"id":210,"version":211,"summary_zh":212,"released_at":213},136690,"0.1.7.post2","## HQQ v0.1.7.post2\r\n- Various bug fixes, especially with `AutoHQQHFModel` and the patching logic, to make it work with any transformers model.\r\n- Readme refactoring.\r\n- Whisper example.\r\n","2024-05-06T16:41:06",{"id":215,"version":216,"summary_zh":217,"released_at":218},136691,"0.1.7","## HQQ v0.1.7\r\n-  Faster inference with torchao \u002F marlin 4-bit kernels\r\n- Multi-gpu support for `model.quantize()`\r\n- Custom HF generator\r\n- Various bug fixes\u002Fimprovements\r\n","2024-04-24T08:59:54",{"id":220,"version":221,"summary_zh":222,"released_at":223},136692,"0.1.6.post2","## HQQ v0.1.6.post2\r\nSame as \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmobiusml\u002Fhqq\u002Freleases\u002Ftag\u002F0.1.6\">v0.1.6\u003C\u002Fa> with ```setup.py``` fixes:\r\n\r\n- ```find_packages``` fix: https:\u002F\u002Fgithub.com\u002Fmobiusml\u002Fhqq\u002Fpull\u002F25 \r\n- Auto-build CUDA kernels via pypi package: https:\u002F\u002Fgithub.com\u002Fmobiusml\u002Fhqq\u002Fpull\u002F26\r\n","2024-03-19T18:24:14",{"id":225,"version":226,"summary_zh":227,"released_at":228},136693,"0.1.6.post1","## HQQ v0.1.6.post1\r\nSame as \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmobiusml\u002Fhqq\u002Freleases\u002Ftag\u002F0.1.6\">v0.1.6\u003C\u002Fa> with a  ```find_packages``` fix https:\u002F\u002Fgithub.com\u002Fmobiusml\u002Fhqq\u002Fpull\u002F25 ","2024-03-19T15:16:03",{"id":230,"version":231,"summary_zh":232,"released_at":233},136694,"0.1.6","## HQQ v0.1.6\r\nUse v0.1.6.post1 instead, unless you clone the repo first then install. \r\n\r\n### Features\r\n- Quantize on target device.\r\n- Meta-offloading uses pinned memory for faster\u002Fasync transfers.\r\n- Loading saved LoRA weights automatically adds LoRA modules if not already present.\r\n- ```pip install``` automatically compiles the CUDA kernels now.\r\n- CUDA backend automatically detected and used when available.\r\n- You can quantize any HF model automatically via ```AutoHQQHFModel```.\r\n- Faster meta-offloading with CUDA streams (experimental).\r\n- Int8 matmul (experimental).\r\n- Shared memory CUDA kernels (experimental).\r\n\r\n### Bugs\r\n- Fix Peft bias dtype.\r\n- Removed auto backend setting in LoRA.\r\n- All ```HQQLinear``` dtype\u002Fdevice-related overloads now return self which should solve a couple of issues.\r\n\r\n### Other\r\n- Refactor backends (using backprop backends by default now).\r\n- Added typing.\r\n- Ruff fix and reformat all Python files.\r\n- Refactor ATEN for reference tensors.\r\n\r\n### Issues\r\n- Using CUDA streams for offloading is faster but uses more memory (~+700MB with Llama2-7B 2-bit\u002Fgs=16) . In fact, sometimes it's almost  as fast as keeping data on the GPU, so worth looking into this. \r\n- Shared memory CUDA kernels are a bit slower than without for some reason.\r\n- The block size setting doesn't have much influence on the speed.\r\n- Int8 matmul is slower than fp16 with the current \"placeholder\" implementation, it should be done on the Aten\u002FCUDA side. \r\n","2024-03-19T13:35:57",{"id":235,"version":236,"summary_zh":237,"released_at":238},136695,"0.1.5","## HQQ v0.1.5\r\n### New features\r\n- Added support for multi-gpu FSDP QLoRA training  (https:\u002F\u002Fgithub.com\u002Fmobiusml\u002Fhqq\u002Fpull\u002F17)\r\n\r\n### Issues\r\n- ```torch.compile``` and the ```PYTORCH_COMPILE``` backend break with ```view_as_float=True```. No known solution for the moment.\r\n- A bit slower inference with   ```view_as_float=True```. Solution: after training, the user can revert back to in bitpacking. \r\n","2024-03-01T10:50:55",{"id":240,"version":241,"summary_zh":242,"released_at":243},136696,"0.1.4","## HQQ v0.1.4\r\n### New features\r\n- Added 1-bit support with CUDA dequant kernels.\r\n","2024-02-28T09:55:30"]