[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Vahe1994--SpQR":3,"tool-Vahe1994--SpQR":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",150037,2,"2026-04-10T23:33:47",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":65,"owner_company":75,"owner_location":65,"owner_email":65,"owner_twitter":65,"owner_website":65,"owner_url":76,"languages":77,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":98,"env_os":99,"env_gpu":100,"env_ram":101,"env_deps":102,"category_tags":109,"github_topics":65,"view_count":32,"oss_zip_url":65,"oss_zip_packed_at":65,"status":17,"created_at":110,"updated_at":111,"faqs":112,"releases":143},6432,"Vahe1994\u002FSpQR","SpQR",null,"SpQR 是一款专为大型语言模型（LLM）设计的开源权重压缩工具，旨在实现“近乎无损”的模型瘦身。它主要解决了大模型因参数量巨大而导致的存储成本高、显存占用大及部署困难等痛点，让用户能在有限的硬件资源下运行更强大的模型。\n\n该工具特别适合 AI 研究人员、算法工程师以及需要在本地或边缘设备部署大模型的开发者使用。其核心亮点在于提出了一种独特的“稀疏 - 量化”混合表示法：通过智能识别并单独处理权重中的异常值（Outliers），对其余部分进行高精度量化。这种策略使得 SpQR 在将模型压缩至极低比特（如 4-bit）的同时，仍能保持与原始浮点模型极其接近的 perplexity（困惑度）表现，显著优于传统量化方案。\n\n目前，SpQR 已支持 LLaMA、Falcon 和 OPT 等主流模型家族。虽然运行完整压缩流程需要较高的显存配置（如 A100），但它也提供了激活卸载等选项，使在消费级显卡上评估压缩后模型性能成为可能。对于追求极致压缩率且不愿牺牲模型智能的用户而言，SpQR 是一个值得深入探索的技术选择。","# SpQR model compression\n\n    \nIt accompanies the research paper \"[SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.03078)\" .\n\n# Installation\n\n### Packages\n\nTo run SpQR with `falcon` make sure that you have `torch>=2.0.0` with `CUDA` support.\n\nInstall packages from `requirements.txt`:\n```bash\npip install -r requirements.txt\n```\n\n__Note:__ the results reported in the ArXiv paper where obtained using `4.28.dev0` version of `transformers`, commit id [`464d420775`](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Farchive\u002F464d420775653885760e30d24d3703e14f4e8a14.zip).\n\n\n### Loading \u002F caching datasets and tokenizer\n\nThe script will require downloading and caching locally the relevant tokenizer and the datasets. \nThey will be saved in default Huggingface Datasets directory unless alternative location is provided by env variables.\nSee [relevant Datasets documentation section](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fmain\u002Fen\u002Fcache#cache-directory)\n### Models\n\nThis repository is expected to work with models of `LLaMA`, `Falcon` and `OPT` families so far.\n\n#### Data\n\nFor quantization with SpQR its is recommended to use the subset of the data model \nwas trained on. I.e. for quantization of `LLaMA` models we recommend to use the subset\nof [RedPajama](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftogethercomputer\u002FRedPajama-Data-1T-Sample) and for `Falcon` quantization - [RefinedWeb](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftiiuae\u002Ffalcon-refinedweb).Both subsets  are stored in `data` directory: \n* `data\u002Fred_pajama_n=1024.pth`\n* `data\u002Frefined_web_n=128.pth`\n  \n**Note** These subsets are already processed with the corresponding model tokenizer. Use for different model will lead to\nunexpected behavior.\n\n For `OPT` following GPTQ paper we recommend to use `c4`. \n\n### W&B logging\n\nFor the sake of convenience one can optionally log the data to `Weights and Biases` service (wandb).\nRun `pip install wandb` for W&B logging.\nSpecify `$WANDB_ENTITY`, `$WANDB_PROJECT`, `$WANDB_NAME` environment variables prior to running experiments. use `--wandb` argument to enable logging\n\n# Launching\n\n### GPU and RAM requirements\nThis code was developed and tested using a single A100 GPU with 80GB GPU RAM. It may successfully run on GPUs with 32GB+ VRAM for perplexity evaluation of up to `LLaMA-65B` and `Falcon-40B` models. \nWith `--offload activations` option, the model perplexity may be evaluated on machines with less VRAM: 24GB+ for Llama 65B and 6GB+ for Llama 7B.\nThe perplexity testing code also requires RAM amount sufficient to hold uncompressed model weights (e.g. ~130GB for Llama65B) and testing datasets.\nFor `Language Model Evaluation Harness` evaluation one needs to have enough memory to load whole model\non one or several devices + activation tensors.\n\n### Model downloading\nThe code requires the LLaMA model to be downloaded in Huggingface format and saved locally. The scripts below assume that `$TRANSFORMERS_CACHE` variable points to the Huggingface Transformers cache folder.\n\n### Perplexity benchmarks:\nThis script compresses the model and then tests its performance in terms of perplexity using WikiText2, C4, and Penn Treebank datasets. \n\nThe command to launch the script should look like this: \n\n```\nexport MODEL_PATH=\u003CPATH_TO_MODEL_DIR>\nexport DATASET=\u003CINSERT DATASET NAME OR PATH TO CUSTOM DATA>\n\npython main.py $MODEL_PATH $DATASET \\\n    --wbits 4 \\\n    --groupsize 16 \\\n    --perchannel \\\n    --qq_scale_bits 3 \\\n    --qq_zero_bits 3 \\\n    --qq_groupsize 16 \\\n    --outlier_threshold=0.2 \\\n    --permutation_order act_order \\\n    --percdamp 1e0 \\\n    --nsamples 128 \n```\nThe command above runs near-lossless compression as described in the article. Adjusting the above parameters allows for tighter compression with a slightly greater loss. \n\nNote the launch arguments:\n- `\u003CPATH_TO_MODEL_DIR>` - path to model folder, which contains `config.json `\n- `one of [c4, ptb, wikitext2, pajama, refinedweb, none]` -- name of dataset to use for compression, or path to an alternative preprocessed and tokenized dataset.\n- `--wbits 3` -- number of bits for quantized weights representation\n- `--groupsize 16` -- size of first-order groups for compression\n- `--qq_groupsize 16` -- size of second-order (quantized) groups for compression\n- `--qq_scale_bits 3 --qq_zero_bits 3` -- bit sizes for quantizing first order weights' scale and zeros.\n- `--offload activations` -- moves activations to RAM when not used. Reduces VRAM usage while slowing work by ~10%. \nrun `python main.py --help` for more details on command line arguments, including compression parameters.\n- `--save --load` -- path to save\u002Fload quantized model.\n### LM Evaluation Harness benchmark.\n\nTo perform zero-shot evaluation, we use [Language Model Evaluation Harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) framework with slight modifications. This repository contains a copy of LM Evaluation Harness repo from early 2023 in `lm-eval-harness` folder. \n#### Installation\nBefore running the code make sure that you have all the requirements and dependencies of `lm-eval-harness` installed. To install them run:\n```\npip install -r lm-evaluation-harness\u002Frequirements.txt\n```\n#### Execution\n\nThe main script launching the evaluation procedure is `lmeval.py` .\n\nNote. Current version of the script support only LLaMA\u002FFalcon quantization. Therefore, set:\n* `--model=hf-causal`\n* `--model_args pretrained=$MODEL_PATH` where `$MODEL_PATH` has to be one of the LLaMA models\n  \n`--quantization_args` - list of comma separated arguments for quantizer. For details and options\nrefer to `spqr_config.py`.\n\nBelow is presented an example of benchmark launch.\n\n```\nexport MODEL_PATH=\u003CINSERT PATH_TO_MODEL_DIR>\nexport DATASET=\u003CINSERT DATASET NAME OR PATH TO CUSTOM DATA>\n\npython lmeval.py \\\n    --model hf-causal \\\n    --model_args pretrained=$MODEL_PATH,dtype=float16,use_accelerate=True \\\n    --quantization_args dataset=$DATASET,wbits=4,groupsize=16,perchannel=True,qq_scale_bits=3,qq_zero_bits=3,qq_groupsize=16,percdamp=1.0,outlier_threshold=0.2,simplified_outliers=False,nsamples=128,offload_activations=True \\\n    --tasks winogrande,piqa,hellaswag,arc_easy,arc_challenge \\\n    --batch_size 1\n```\n\nPerformance and runtime notes:\n* For large models (LLaMA-30B, LLaMA-65B) specify `max_memory_per_gpu={value}GIB` so that there are free 15-20GIB of GPU memory for each GPU to store activations for calibration. \n* `offload_activations=True` slightly reduces peak memory consumption \n* Typically `LlaMA-30B` requires 1-2 A100 GPUs with 80Gb of memory and `LlaMA-65B` requires 3 A100 with 80Gb each.\n* With enough spare GPU memory, one can raise batch size to accelerate evaluation process.\n\n\n## Inference\n\nThis repository also contains an efficient CUDA kernel implementation of the \nSpQR matvec. The file `inference_demo.py` h orcontains a demo of this functionality \nby running end-to-end model inference. Below is an example of how to launch it.\n\n```bash\nusage: inference_demo.py [-h] [--pretrained_model_path PRETRAINED_MODEL_PATH] [--compressed_model_path COMPRESSED_MODEL_PATH] --execution_mode {0,1}\n\noptions:\n  -h, --help            show this help message and exit\n  --pretrained_model_path PRETRAINED_MODEL_PATH\n                        Path to the model to the pretrained model\n  --compressed_model_path COMPRESSED_MODEL_PATH\n                        Path to the compressed .pt model\n  --execution_mode {0,1}\n                        If set to 0, will evaluate the dense pretrained model. If set to 1, will evaluate the spqr-quantized model\n```\n\nThis script also reports the mean and median time of the forward() passes and the total inference execution time. \n\n# Pre-Requisites for Running the Conversion Scripts, Tests and Benchmarks\n\nIn order to run the benchmark and test suite you need to build the sources used by these scripts.\nYou can do so by running the following command:\n\n```bash\n\u002Fbin\u002Fbash scripts\u002Fbuild.sh \n```\n\nwhich simply runs the `setup.py` script.\n\n# Conversion From Legacy to Optimized SPQR Storage\n\nAfter running SpQR which produces the tensors stored in int8, in order to run the efficient inference kernels, \none must convert the tensors produces by SpQR (legacy tensors) into the optimized storage format used by \nthe cuda kernel. In order to do so, run the following script:\n\n```bash\nusage: convert_legacy_model_format.py [-h] --base_model BASE_MODEL --legacy_model_path LEGACY_MODEL_PATH [--sparse_strategy {csr,ptcsr,optimize_latency}] [--save_pt SAVE_PT] [--save_per_layer SAVE_PER_LAYER]\n\noptions:\n  -h, --help            show this help message and exit\n  --base_model BASE_MODEL\n                        path or name of the unquantized model\n  --legacy_model_path LEGACY_MODEL_PATH\n                        path to legacy model\n  --sparse_strategy {csr,ptcsr,optimize_latency}\n                        Sparse strategy storage. Options: csr, ptcsr, auto. CSR - Compressed Sparse Rows PTCSR - Alternative storage format optimize_latency - Use the current GPU to determine the optimal storage format to reduce\n                        kernel latency\n  --save_pt SAVE_PT     Save the converted quantized .pt model here\n  --save_per_layer SAVE_PER_LAYER\n                        Save the converted quantized m\n```\n\n# Hugginface Conversion\n\nTo convert a model into a Hugging Face compatible format, use convert_to_hf.py script:\n\n```bash\nusage: convert_to_hf.py [-h] [--model MODEL] [--config_path CONFIG_PATH] [--in_path_pt IN_PATH_PT] [--out_path OUT_PATH] [--save_safetensors] [--trust_remote_code] [--load_model] [--save_tokenizer]\n\noptions:\n  -h, --help            show this help message and exit\n  --model MODEL         Path to the model to base config on, as in AutoConfig.from_pretrained()\n  --config_path CONFIG_PATH\n                        Path to the model to base config on, as in AutoConfig.from_pretrained()\n  --in_path_pt IN_PATH_PT\n                        Path of the checkpoint to convert\n  --out_path OUT_PATH   Path to save HF compatible checkpoint to\n  --save_safetensors    Whether to save in safetensors format\n  --trust_remote_code   Whether to trust remote code\n  --load_model          Whether to load model\n  --save_tokenizer      Whether to save tokenizer\n```\n\n# Benchmarks (matvec kernel)\n\nIn order to run the matvec benchmark suite, one should run:\n\n```bash \nbench_spqr.py [-h] --tensor_path TENSOR_PATH [--ptcsr_path PTCSR_PATH] [--output_path OUTPUT_PATH]\n\noptions:\n  -h, --help            show this help message and exit\n  --tensor_path TENSOR_PATH\n                        Path to folder containing the tensors of the formmodel_path\u002F 0\u002F tensor0 tensor1\n  --ptcsr_path PTCSR_PATH\n                        Path to folder containing the tensors of the formmodel_path\u002F 0\u002F tensor0 tensor1\n  --output_path OUTPUT_PATH\n                        Path to results *.csv file.\n\n```\n\nMake sure that the `\u003Ctensor_path>` and the optional `\u003Cptcsr_path.` point to a folder containing quantized matrices produced by the `convert_legacy_model_format.py` script.\nUse `\u003Ccuda_device_id>` to set the cuda device during benchmark. The script outputs the results in `\u003Cresults_output>`.\n\n# Tests\n\nIn order to run the unittest, simply execute:\n\n```bash\npython3 tests\u002Ftest.py\n```\n\n\n## Citation\n```\n@misc{dettmers2023spqr,\n      title={SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression}, \n      author={Tim Dettmers and Ruslan Svirschevski and Vage Egiazarian and Denis Kuznedelev and Elias Frantar and Saleh Ashkboos and Alexander Borzunov and Torsten Hoefler and Dan Alistarh},\n      year={2023},\n      eprint={2306.03078},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n","# SpQR 模型压缩\n\n它伴随的研究论文是“SpQR：用于近无损 LLM 权重压缩的稀疏量化表示”（https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.03078）。\n\n# 安装\n\n### 软件包\n\n要使用 `falcon` 运行 SpQR，请确保您已安装支持 `CUDA` 的 `torch>=2.0.0`。\n\n从 `requirements.txt` 安装软件包：\n```bash\npip install -r requirements.txt\n```\n\n__注意：__ ArXiv 论文中报告的结果是使用 `transformers` 的 `4.28.dev0` 版本获得的，其提交 ID 为 [`464d420775`](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Farchive\u002F464d420775653885760e30d24d3703e14f4e8a14.zip)。\n\n### 加载\u002F缓存数据集和分词器\n\n该脚本需要下载并本地缓存相关的分词器和数据集。它们将被保存在默认的 Huggingface 数据集目录中，除非通过环境变量指定了其他位置。请参阅 [相关数据集文档部分](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fmain\u002Fen\u002Fcache#cache-directory)。\n\n### 模型\n\n目前，此仓库预计可与 `LLaMA`、`Falcon` 和 `OPT` 系列模型配合使用。\n\n#### 数据\n\n对于使用 SpQR 进行量化，建议使用模型训练时所用数据的一个子集。例如，对于 `LLaMA` 模型的量化，我们推荐使用 [RedPajama](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftogethercomputer\u002FRedPajama-Data-1T-Sample) 的子集；而对于 `Falcon` 的量化，则推荐使用 [RefinedWeb](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftiiuae\u002Ffalcon-refinedweb)。这两个子集都存储在 `data` 目录中：\n* `data\u002Fred_pajama_n=1024.pth`\n* `data\u002Frefined_web_n=128.pth`\n\n**注意** 这些子集已经过相应模型分词器的处理。如果用于其他模型，可能会导致意外行为。\n\n对于 `OPT`，根据 GPTQ 论文的建议，我们推荐使用 `c4`。\n\n### W&B 日志记录\n\n为方便起见，可以选择将数据记录到 `Weights and Biases` 服务（wandb）。运行 `pip install wandb` 即可启用 W&B 日志记录。在运行实验之前，请指定 `$WANDB_ENTITY`、`$WANDB_PROJECT` 和 `$WANDB_NAME` 环境变量，并使用 `--wandb` 参数启用日志记录。\n\n# 启动\n\n### GPU 和 RAM 要求\n\n此代码是在配备 80GB 显存的单个 A100 GPU 上开发和测试的。它也可以在具有 32GB 以上显存的 GPU 上成功运行，用于评估高达 `LLaMA-65B` 和 `Falcon-40B` 模型的困惑度。使用 `--offload activations` 选项后，可以在显存较少的机器上评估模型的困惑度：Llama 65B 需要 24GB 以上，Llama 7B 则需要 6GB 以上。\n\n困惑度测试代码还需要足够的内存来容纳未压缩的模型权重（例如，Llama 65B 约需 130GB）以及测试数据集。对于 `Language Model Evaluation Harness` 评估，需要有足够的内存来将整个模型加载到一台或多台设备上，再加上激活张量。\n\n### 模型下载\n\n代码要求以 Huggingface 格式下载 LLaMA 模型并本地保存。以下脚本假定 `$TRANSFORMERS_CACHE` 变量指向 Huggingface Transformers 的缓存文件夹。\n\n### 困惑度基准测试\n此脚本会先压缩模型，然后使用 WikiText2、C4 和 Penn Treebank 数据集测试其困惑度性能。\n\n启动脚本的命令应如下所示：\n\n```\nexport MODEL_PATH=\u003C模型目录路径>\nexport DATASET=\u003C插入数据集名称或自定义数据路径>\n\npython main.py $MODEL_PATH $DATASET \\\n    --wbits 4 \\\n    --groupsize 16 \\\n    --perchannel \\\n    --qq_scale_bits 3 \\\n    --qq_zero_bits 3 \\\n    --qq_groupsize 16 \\\n    --outlier_threshold=0.2 \\\n    --permutation_order act_order \\\n    --percdamp 1e0 \\\n    --nsamples 128 \n```\n\n上述命令执行了文章中描述的近无损压缩。调整这些参数可以实现更紧密的压缩，但会带来略微更大的损失。\n\n请注意启动参数：\n- `\u003C模型目录路径>` — 包含 `config.json` 文件的模型文件夹路径。\n- `[c4, ptb, wikitext2, pajama, refinedweb, none]` 中的一个 — 用于压缩的数据集名称，或预处理并分词过的替代数据集路径。\n- `--wbits 3` — 量化权重表示的位数。\n- `--groupsize 16` — 用于压缩的一阶分组大小。\n- `--qq_groupsize 16` — 用于压缩的二阶（量化）分组大小。\n- `--qq_scale_bits 3 --qq_zero_bits 3` — 用于量化一阶权重尺度和零值的位数。\n- `--offload activations` — 在不使用激活时将其移至 RAM。这会减少显存占用，但会使运行速度降低约 10%。运行 `python main.py --help` 可获取有关命令行参数的更多详细信息，包括压缩参数。\n- `--save --load` — 用于保存\u002F加载量化模型的路径。\n\n### LM Evaluation Harness 基准测试\n\n为了进行零样本评估，我们使用经过轻微修改的 [Language Model Evaluation Harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) 框架。此仓库在 `lm-eval-harness` 文件夹中包含 2023 年初版本的 LM Evaluation Harness 代码库。\n\n#### 安装\n在运行代码之前，请确保已安装 `lm-eval-harness` 的所有要求和依赖项。运行以下命令进行安装：\n```\npip install -r lm-evaluation-harness\u002Frequirements.txt\n```\n\n#### 执行\n启动评估程序的主要脚本是 `lmeval.py`。\n\n注意：当前版本的脚本仅支持 LLaMA\u002FFalcon 的量化。因此，请设置：\n* `--model=hf-causal`\n* `--model_args pretrained=$MODEL_PATH`，其中 `$MODEL_PATH` 必须是 LLaMA 模型之一。\n\n`--quantization_args` — 量化器的逗号分隔参数列表。有关详细信息和选项，请参阅 `spqr_config.py`。\n\n以下是基准测试启动示例：\n\n```\nexport MODEL_PATH=\u003C插入模型目录路径>\nexport DATASET=\u003C插入数据集名称或自定义数据路径>\n\npython lmeval.py \\\n    --model hf-causal \\\n    --model_args pretrained=$MODEL_PATH,dtype=float16,use_accelerate=True \\\n    --quantization_args dataset=$DATASET,wbits=4,groupsize=16,perchannel=True,qq_scale_bits=3,qq_zero_bits=3,qq_groupsize=16,percdamp=1.0,outlier_threshold=0.2,simplified_outliers=False,nsamples=128,offload_activations=True \\\n    --tasks winogrande,piqa,hellaswag,arc_easy,arc_challenge \\\n    --batch_size 1\n```\n\n性能和运行注意事项：\n* 对于大型模型（LLaMA-30B、LLaMA-65B），请指定 `max_memory_per_gpu={value}GIB`，以便每块 GPU 保留 15-20GIB 的空闲显存，用于存储校准所需的激活数据。\n* `offload_activations=True` 可略微降低峰值内存消耗。\n* 通常，LLaMA-30B 需要 1-2 块 80Gb 显存的 A100 GPU，而 LLaMA-65B 则需要 3 块 80Gb 显存的 A100 GPU。\n* 如果有充足的空闲显存，可以提高批量大小以加快评估进程。\n\n## 推理\n\n该仓库还包含 SpQR 矩阵-向量乘法的高效 CUDA 内核实现。文件 `inference_demo.py` 包含一个端到端模型推理的演示。以下是运行该脚本的示例。\n\n```bash\n用法: inference_demo.py [-h] [--pretrained_model_path PRETRAINED_MODEL_PATH] [--compressed_model_path COMPRESSED_MODEL_PATH] --execution_mode {0,1}\n\n选项:\n  -h, --help            显示此帮助消息并退出\n  --pretrained_model_path PRETRAINED_MODEL_PATH\n                        预训练模型的路径\n  --compressed_model_path COMPRESSED_MODEL_PATH\n                        压缩后的 .pt 模型路径\n  --execution_mode {0,1}\n                        如果设置为 0，则评估密集的预训练模型。如果设置为 1，则评估 SpQR 量化模型\n```\n\n该脚本还会报告 forward() 步骤的平均和中位数时间，以及总的推理执行时间。\n\n# 运行转换脚本、测试和基准测试的先决条件\n\n为了运行基准测试和测试套件，您需要构建这些脚本所使用的源代码。可以通过运行以下命令来完成：\n\n```bash\n\u002Fbin\u002Fbash scripts\u002Fbuild.sh \n```\n\n该命令会简单地运行 `setup.py` 脚本。\n\n# 从旧版到优化 SPQR 存储格式的转换\n\n在运行 SpQR 并生成以 int8 存储的张量后，为了使用高效的推理内核，必须将 SpQR 生成的张量（旧版张量）转换为 CUDA 内核所使用的优化存储格式。为此，请运行以下脚本：\n\n```bash\n用法: convert_legacy_model_format.py [-h] --base_model BASE_MODEL --legacy_model_path LEGACY_MODEL_PATH [--sparse_strategy {csr,ptcsr,optimize_latency}] [--save_pt SAVE_PT] [--save_per_layer SAVE_PER_LAYER]\n\n选项:\n  -h, --help            显示此帮助消息并退出\n  --base_model BASE_MODEL\n                        未量化模型的路径或名称\n  --legacy_model_path LEGACY_MODEL_PATH\n                        旧版模型的路径\n  --sparse_strategy {csr,ptcsr,optimize_latency}\n                        稀疏存储策略。选项：csr、ptcsr、auto。CSR 表示压缩稀疏行格式；PTCSR 表示另一种存储格式；optimize_latency 表示使用当前 GPU 自动确定最优存储格式以减少内核延迟。\n  --save_pt SAVE_PT     将转换后的量化 .pt 模型保存至此处\n  --save_per_layer SAVE_PER_LAYER\n                        将转换后的量化 m\n```\n\n# Hugging Face 格式转换\n\n要将模型转换为 Hugging Face 兼容格式，请使用 `convert_to_hf.py` 脚本：\n\n```bash\n用法: convert_to_hf.py [-h] [--model MODEL] [--config_path CONFIG_PATH] [--in_path_pt IN_PATH_PT] [--out_path OUT_PATH] [--save_safetensors] [--trust_remote_code] [--load_model] [--save_tokenizer]\n\n选项:\n  -h, --help            显示此帮助消息并退出\n  --model MODEL         用于基于 AutoConfig.from_pretrained() 构建配置的基础模型路径\n  --config_path CONFIG_PATH\n                        用于基于 AutoConfig.from_pretrained() 构建配置的基础模型路径\n  --in_path_pt IN_PATH_PT\n                        要转换的检查点路径\n  --out_path OUT_PATH   保存 Hugging Face 兼容检查点的路径\n  --save_safetensors    是否以 safetensors 格式保存\n  --trust_remote_code   是否信任远程代码\n  --load_model          是否加载模型\n  --save_tokenizer      是否保存分词器\n```\n\n# 基准测试（matvec 内核）\n\n要运行 matvec 基准测试套件，应执行以下命令：\n\n```bash\nbench_spqr.py [-h] --tensor_path TENSOR_PATH [--ptcsr_path PTCSR_PATH] [--output_path OUTPUT_PATH]\n\n选项:\n  -h, --help            显示此帮助消息并退出\n  --tensor_path TENSOR_PATH\n                        包含形如 model_path\u002F0\u002Ftensor0 tensor1 的张量的文件夹路径\n  --ptcsr_path PTCSR_PATH\n                        包含形如 model_path\u002F0\u002Ftensor0 tensor1 的张量的文件夹路径\n  --output_path OUTPUT_PATH\n                        结果 *.csv 文件的保存路径。\n```\n\n请确保 `\u003Ctensor_path>` 和可选的 `\u003Cptcsr_path>` 指向由 `convert_legacy_model_format.py` 脚本生成的量化矩阵所在的文件夹。使用 `\u003Ccuda_device_id>` 设置基准测试期间的 CUDA 设备。脚本会将结果输出到 `\u003Cresults_output>` 中。\n\n# 测试\n\n要运行单元测试，只需执行以下命令：\n\n```bash\npython3 tests\u002Ftest.py\n```\n\n\n## 引用\n```\n@misc{dettmers2023spqr,\n      title={SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression}, \n      author={Tim Dettmers and Ruslan Svirschevski and Vage Egiazarian and Denis Kuznedelev and Elias Frantar and Saleh Ashkboos and Alexander Borzunov and Torsten Hoefler and Dan Alistarh},\n      year={2023},\n      eprint={2306.03078},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```","# SpQR 快速上手指南\n\nSpQR 是一种用于大语言模型（LLM）权重的稀疏量化表示方法，旨在实现近乎无损的模型压缩。本指南帮助中国开发者快速部署并使用该工具。\n\n## 环境准备\n\n### 系统要求\n- **GPU**: 推荐单张 NVIDIA A100 (80GB)。\n  - 最小显存需求：\n    - 困惑度评估：32GB+ (支持 LLaMA-65B\u002FFalcon-40B)。\n    - 开启 `--offload activations` 后：24GB+ (LLaMA-65B) 或 6GB+ (LLaMA-7B)。\n- **内存 (RAM)**: 需足够容纳未压缩的模型权重（例如 LLaMA-65B 约需 130GB）及测试数据集。\n- **软件依赖**:\n  - Python 环境\n  - `torch >= 2.0.0` (必须支持 CUDA)\n  - `transformers` (论文结果基于版本 `4.28.dev0`)\n\n### 前置依赖\n确保已安装 CUDA 驱动，并配置好 Hugging Face 缓存目录（可选通过环境变量自定义）。\n\n## 安装步骤\n\n1. **克隆仓库** (假设已获取源码)\n2. **安装 Python 依赖**\n   ```bash\n   pip install -r requirements.txt\n   ```\n   > **提示**: 如需使用 W&B 日志记录，请额外运行 `pip install wandb`。\n\n3. **构建底层组件** (运行基准测试和转换脚本前必需)\n   ```bash\n   \u002Fbin\u002Fbash scripts\u002Fbuild.sh\n   ```\n\n4. **准备数据与模型**\n   - **模型**: 下载 Hugging Face 格式的 LLaMA、Falcon 或 OPT 模型到本地。\n   - **数据集**: 脚本会自动下载并缓存分词器及数据集。\n     - LLaMA 推荐使用 [RedPajama](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftogethercomputer\u002FRedPajama-Data-1T-Sample) 子集。\n     - Falcon 推荐使用 [RefinedWeb](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftiiuae\u002Ffalcon-refinedweb) 子集。\n     - OPT 推荐使用 `c4`。\n     - *注*: 仓库 `data` 目录下已提供预处理好的子集文件，请勿混用不同模型的分词数据。\n\n## 基本使用\n\n以下示例展示如何对模型进行近无损压缩并评估其困惑度（Perplexity）。\n\n### 1. 执行模型压缩与困惑度测试\n\n设置模型路径和数据集名称，运行主脚本：\n\n```bash\nexport MODEL_PATH=\u003CPATH_TO_MODEL_DIR>\nexport DATASET=\u003CINSERT DATASET NAME OR PATH TO CUSTOM DATA>\n\npython main.py $MODEL_PATH $DATASET \\\n    --wbits 4 \\\n    --groupsize 16 \\\n    --perchannel \\\n    --qq_scale_bits 3 \\\n    --qq_zero_bits 3 \\\n    --qq_groupsize 16 \\\n    --outlier_threshold=0.2 \\\n    --permutation_order act_order \\\n    --percdamp 1e0 \\\n    --nsamples 128\n```\n\n**参数说明**:\n- `\u003CPATH_TO_MODEL_DIR>`: 包含 `config.json` 的模型文件夹路径。\n- `\u003CDATASET>`: 数据集名称 (`c4`, `ptb`, `wikitext2`, `pajama`, `refinedweb`) 或自定义预处理数据路径。\n- `--wbits`: 量化权重位数 (如 3 或 4)。\n- `--offload activations`: 若显存不足可添加此参数，将激活值移至内存（速度降低约 10%）。\n- `--save --load`: 可用于保存或加载量化后的模型。\n\n### 2. 模型格式转换 (优化推理)\n\nSpQR 生成的初始张量为 int8 格式，为了使用高效的 CUDA 内核进行推理，需转换为优化存储格式：\n\n```bash\npython convert_legacy_model_format.py \\\n    --base_model \u003CPATH_TO_UNQUANTIZED_MODEL> \\\n    --legacy_model_path \u003CPATH_TO_LEGACY_SPQR_MODEL> \\\n    --sparse_strategy optimize_latency \\\n    --save_pt \u003CPATH_TO_SAVE_CONVERTED_PT>\n```\n\n### 3. 端到端推理演示\n\n使用转换后的模型进行推理测试，对比原始模型与量化模型的性能：\n\n```bash\npython inference_demo.py \\\n    --pretrained_model_path \u003CPATH_TO_PRETRAINED_MODEL> \\\n    --compressed_model_path \u003CPATH_TO_COMPRESSED_PT_MODEL> \\\n    --execution_mode 1\n```\n- `--execution_mode 0`: 评估原始稠密模型。\n- `--execution_mode 1`: 评估 SpQR 量化模型。\n\n### 4. (可选) 零样本任务评估\n\n使用修改版的 LM Evaluation Harness 进行基准测试：\n\n```bash\nexport MODEL_PATH=\u003CINSERT PATH_TO_MODEL_DIR>\nexport DATASET=\u003CINSERT DATASET NAME OR PATH TO CUSTOM DATA>\n\npython lmeval.py \\\n    --model hf-causal \\\n    --model_args pretrained=$MODEL_PATH,dtype=float16,use_accelerate=True \\\n    --quantization_args dataset=$DATASET,wbits=4,groupsize=16,perchannel=True,qq_scale_bits=3,qq_zero_bits=3,qq_groupsize=16,percdamp=1.0,outlier_threshold=0.2,simplified_outliers=False,nsamples=128,offload_activations=True \\\n    --tasks winogrande,piqa,hellaswag,arc_easy,arc_challenge \\\n    --batch_size 1\n```\n*注意：大模型（如 LLaMA-30B\u002F65B）可能需要多卡并行，请根据显存情况调整 `max_memory_per_gpu`。*","某 AI 初创团队试图将 650 亿参数的 LLaMA 大模型部署到单张消费级显卡上，以构建低成本的垂直领域客服助手。\n\n### 没有 SpQR 时\n- **显存严重不足**：原始模型权重占用超过 130GB 内存，即便使用高端 A100 也需多卡并行，完全无法在仅 24GB 显存的 RTX 4090 上运行。\n- **推理延迟过高**：为了强行适配硬件而采用粗糙的量化方案，导致模型“智力”大幅下降，回答逻辑混乱，无法满足客服场景的准确性要求。\n- **部署成本高昂**：被迫租用昂贵的云端多 GPU 集群，使得单次对话成本居高不下，商业落地几乎无利可图。\n\n### 使用 SpQR 后\n- **实现单卡部署**：利用 SpQR 的稀疏量化技术，将 65B 模型压缩至近无损状态，成功配合 `--offload activations` 参数在 24GB 显存设备上流畅运行。\n- **保持卓越性能**：通过保留关键异常值（outliers）的精度，模型在 WikiText2 等基准测试中的困惑度（Perplexity）几乎未受影响，客服回答依然精准自然。\n- **大幅降低成本**：无需购买昂贵服务器，直接基于现有消费级硬件完成私有化部署，将推理成本降低了 90% 以上。\n\nSpQR 通过独特的稀疏 - 量化混合表示法，打破了大模型对昂贵算力的依赖，让顶级智商的 LLM 真正走进了普通开发者的本地环境。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FVahe1994_SpQR_f15fde8f.png","Vahe1994","Egiazarian Vage","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FVahe1994_624b4177.png","ISTA","https:\u002F\u002Fgithub.com\u002FVahe1994",[78,82,86,90],{"name":79,"color":80,"percentage":81},"Python","#3572A5",92.9,{"name":83,"color":84,"percentage":85},"Cuda","#3A4E3A",4.2,{"name":87,"color":88,"percentage":89},"C++","#f34b7d",2.8,{"name":91,"color":92,"percentage":93},"Shell","#89e051",0,553,43,"2026-03-17T19:27:14","Apache-2.0",4,"未说明","必需 NVIDIA GPU。开发测试环境为单卡 A100 (80GB)。评估 LLaMA-65B\u002FFalcon-40B 需 32GB+ 显存；开启 '--offload activations' 后，LLaMA-65B 需 24GB+，LLaMA-7B 需 6GB+。大型模型零样本评估可能需要多卡（如 LLaMA-65B 需 3 张 80GB A100）。需要 CUDA 支持。","需足够容纳未压缩模型权重及数据集。例如 LLaMA-65B 约需 130GB+ 系统内存。",{"notes":103,"python":99,"dependencies":104},"1. 论文结果基于特定版本的 transformers (4.28.dev0) 获得。2. 支持 LLaMA、Falcon 和 OPT 系列模型。3. 量化建议使用模型训练数据的子集（如 RedPajama 或 RefinedWeb），且数据需经对应分词器处理。4. 运行基准测试前需执行 'scripts\u002Fbuild.sh' 构建源码。5. 量化后的模型需转换为优化格式（使用 convert_legacy_model_format.py）才能运行高效推理内核。6. 首次运行需下载并缓存数据集和分词器。",[105,106,107,108],"torch>=2.0.0","transformers==4.28.dev0","datasets","wandb (可选)",[35,14],"2026-03-27T02:49:30.150509","2026-04-11T08:11:51.043359",[113,118,123,128,133,138],{"id":114,"question_zh":115,"answer_zh":116,"source_url":117},29099,"SpQR 的推理代码（inference code）发布了吗？在哪里可以找到？","推理内核（inference kernel）现已在 SpQR 端可用。对于 Hugging Face 集成，请参考 PR：https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Fpull\u002F34976。","https:\u002F\u002Fgithub.com\u002FVahe1994\u002FSpQR\u002Fissues\u002F12",{"id":119,"question_zh":120,"answer_zh":121,"source_url":122},29100,"为什么代码中使用了权重排列（permutation）？保存量化模型时是否需要保存排列顺序？","对于某些模型，排列顺序至关重要，使用 `actorder` 通常能带来约 0.1 ppl 的性能提升。保存排列顺序会增加极小的内存开销（每个参数小于 0.01 bits）。如果不保存排列顺序，推理时需要反向排列激活值（de-permute activations），或者作为变通方案，可以通过设置 identity 选项跳过激活值的排列（但这可能会略微影响精度）。目前没有其他方法可以在不保存排列顺序的情况下进行标准推理。","https:\u002F\u002Fgithub.com\u002FVahe1994\u002FSpQR\u002Fissues\u002F18",{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},29101,"我可以保存压缩后的模型用于直接推理或结合 LoRA 适配器使用吗？","项目已发布模型保存功能的草案 PR（https:\u002F\u002Fgithub.com\u002FVahe1994\u002FSpQR\u002Fpull\u002F32），该功能几乎完成但当时尚未完全测试。早期版本仅发布了评估代码，高效推理和完整保存功能随后通过更新和 Hugging Face 集成提供。","https:\u002F\u002Fgithub.com\u002FVahe1994\u002FSpQR\u002Fissues\u002F1",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},29102,"加载 LLaMa 30B 模型时出现错误，如何解决？","这通常是由于 safetensor 文件损坏或下载不完整导致的。尝试重新下载该模型文件通常可以解决此问题。","https:\u002F\u002Fgithub.com\u002FVahe1994\u002FSpQR\u002Fissues\u002F43",{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},29103,"为什么当前版本不支持保存压缩模型（抛出 NotImplementedError）？","早期版本主要专注于发布评估代码以验证压缩模型的质量，因此暂时禁用了保存功能。后续更新中已通过新的 PR 添加了模型保存支持，请确保使用最新代码或参考相关的保存功能 PR。","https:\u002F\u002Fgithub.com\u002FVahe1994\u002FSpQR\u002Fissues\u002F11",{"id":139,"question_zh":140,"answer_zh":141,"source_url":142},29104,"在调试代码时发现 outlier mask（异常值掩码）的位置与量化权重不对应，原因是什么？","这是预期行为。`unstructured_outlier_mask` 从 `SPQRUtil.quantize()` 返回时不会进行反向排列。代码中已有相关注释说明（见 spqr_engine.py 第 209 行附近），这在研究和分析代码时需要注意，但不影响程序的实际功能。","https:\u002F\u002Fgithub.com\u002FVahe1994\u002FSpQR\u002Fissues\u002F30",[]]