[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tool-cedrickchee--awesome-ml-model-compression":3,"similar-cedrickchee--awesome-ml-model-compression":49},{"id":4,"github_repo":5,"name":6,"description_en":7,"description_zh":8,"ai_summary_zh":8,"readme_en":9,"readme_zh":10,"quickstart_zh":11,"use_case_zh":12,"hero_image_url":13,"owner_login":14,"owner_name":15,"owner_avatar_url":16,"owner_bio":17,"owner_company":18,"owner_location":19,"owner_email":20,"owner_twitter":21,"owner_website":22,"owner_url":23,"languages":20,"stars":24,"forks":25,"last_commit_at":26,"license":27,"difficulty_score":28,"env_os":29,"env_gpu":30,"env_ram":30,"env_deps":31,"category_tags":34,"github_topics":36,"view_count":43,"oss_zip_url":20,"oss_zip_packed_at":20,"status":44,"created_at":45,"updated_at":46,"faqs":47,"releases":48},9233,"cedrickchee\u002Fawesome-ml-model-compression","awesome-ml-model-compression","Awesome machine learning model compression research papers, quantization, tools, and learning material.","awesome-ml-model-compression 是一个专为机器学习模型压缩与加速领域打造的精选资源库。随着深度学习模型日益庞大，如何在保持高精度的同时降低计算成本和存储占用，成为落地应用的关键难题。这份清单系统地整理了该领域的核心解决方案，涵盖量化、剪枝、知识蒸馏、二值化及低秩近似等主流技术路径。\n\n它汇集了从经典综述到前沿架构（如 MobileNet、SqueezeNet）的研究论文，提供了实用的工具库、框架链接，以及教程文章和视频演讲。无论是希望深入探索算法原理的研究人员，还是需要将大模型部署到移动端或嵌入式设备的开发者，都能在这里快速找到理论依据和工程实现参考。其独特亮点在于分类清晰、内容全面，不仅包含通用的压缩策略，还特别收录了关于 FP8 格式等最新硬件友好型技术的探讨，帮助用户紧跟行业趋势。通过整合学术研究与工程实践，awesome-ml-model-compression 致力于让模型变得更小、更快、更高效，是进入模型优化领域的理想入门指南。","# Awesome ML Model Compression [![Awesome](https:\u002F\u002Fawesome.re\u002Fbadge.svg)](https:\u002F\u002Fawesome.re)\n\nAn awesome style list that curates the best machine learning model compression and acceleration research papers, articles, tutorials, libraries, tools and more. PRs are welcome!\n\n# Contents\n\n- [Papers](#papers)\n  - [General](#general)\n  - [Architecture](#architecture)\n  - [Quantization](#quantization)\n  - [Binarization](#binarization)\n  - [Pruning](#pruning)\n  - [Distillation](#distillation)\n  - [Low Rank Approximation](#low-rank-approximation)\n  - [Offloading](#offloading)\n  - [Parallelism](#parallelism)\n- [Articles](#articles)\n  - [Howtos](#howtos)\n  - [Assorted](#assorted)\n  - [Reference](#reference)\n  - [Blogs](#blogs)\n- [Tools](#tools)\n  - [Libraries](#libraries)\n  - [Frameworks](#frameworks)\n- [Videos](#videos)\n  - [Talks](#talks)\n  - [Training & tutorials](#training--tutorials)\n\n---\n\n## Papers\n\n### General\n\n- [A Survey of Model Compression and Acceleration for Deep Neural Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.09282)\n- [Model compression as constrained optimization, with application to neural nets. Part I: general framework](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.01209)\n- [Model compression as constrained optimization, with application to neural nets. Part II: quantization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.04319)\n- [Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.08962)\n- [FP8 Formats for Deep Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.05433) by NVIDIA, Arm, and Intel, 2022 - FP8 delivered the performance of INT8 with accuracy of FP16. E4M3, a variant of FP8 has the benefits of INT8 with none of the loss in accuracy and throughput.\n\n### Architecture\n\n- [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https:\u002F\u002Farxiv.org\u002Fabs\u002F1704.04861)\n- [MobileNetV2: Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation](https:\u002F\u002Farxiv.org\u002Fabs\u002F1801.04381)\n- [Xception: Deep Learning with Depthwise Separable Convolutions](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.02357)\n- [ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.01083)\n- [SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and \u003C0.5MB model size](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.07360)\n- [Fast YOLO: A Fast You Only Look Once System for Real-time Embedded Object Detection in Video](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.05943)\n- [AddressNet: Shift-based Primitives for Efficient Convolutional Neural Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1809.08458)\n- [ResNeXt: Aggregated Residual Transformations for Deep Neural Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1611.05431)\n- [ResBinNet: Residual Binary Neural Network](https:\u002F\u002Farxiv.org\u002Fabs\u002F1711.01243)\n- [Residual Attention Network for Image Classification](https:\u002F\u002Farxiv.org\u002Fabs\u002F1704.06904)\n- [Squeezedet: Uniﬁed, small, low power fully convolutional neural networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1612.01051)\n- [SEP-Nets: Small and Effective Pattern Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.03912)\n- [Dynamic Capacity Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.07838)\n- [Learning Infinite Layer Networks Without the Kernel Trick](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.05316v2)\n- [Efficient Sparse-Winograd Convolutional Neural Networks](https:\u002F\u002Fopenreview.net\u002Fpdf?id=r1rqJyHKg)\n- [DSD: Dense-Sparse-Dense Training for Deep Neural Networks](https:\u002F\u002Fopenreview.net\u002Fpdf?id=HyoST_9xl)\n- [Coordinating Filters for Faster Deep Neural Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1703.09746v3)\n- [Deep Networks with Stochastic Depth](https:\u002F\u002Farxiv.org\u002Fabs\u002F1603.09382)\n\n### Quantization\n\n- [Quantized Convolutional Neural Networks for Mobile Devices](https:\u002F\u002Farxiv.org\u002Fabs\u002F1512.06473)\n- [Towards the Limit of Network Quantization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1612.01543)\n- [Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations](https:\u002F\u002Farxiv.org\u002Fabs\u002F1609.07061)\n- [Compressing Deep Convolutional Networks using Vector Quantization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1412.6115)\n- [Trained Ternary Quantization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1612.01064)\n- [The ZipML Framework for Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1611.05402)\n- [ShiftCNN: Generalized Low-Precision Architecture for Inference of Convolutional Neural Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.02393)\n- [Deep Learning with Low Precision by Half-wave Gaussian Quantization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1702.00953)\n- [Loss-aware Binarization of Deep Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1611.01600)\n- [Quantize weights and activations in Recurrent Neural Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1611.10176)\n- [Fixed-Point Performance Analysis of Recurrent Neural Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1512.01322)\n- [And the bit goes down: Revisiting the quantization of neural networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1907.05686)\n- [8-bit Optimizers via Block-wise Quantization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.02861)\n- [LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale](https:\u002F\u002Farxiv.org\u002Fabs\u002F2208.07339) [[blog post](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fhf-bitsandbytes-integration)]\n- [SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.10438) by MIT and NVIDIA (2022) [[code](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fsmoothquant)]\n- [ZeroQuant: Efficient and Affordable Post-training Quantization for Large Transformer-based Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.01861) by Microsoft (2022) [[code](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed)]\n- [nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.09557) by NAVER CLOVA and Pohang University of Science and Technology, Korea (2022)\n- [MKQ-BERT: Quantized BERT with 4-bits Weights and Activations](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.13483) by Tencent AIPD (2022)\n- [Understanding and Overcoming the Challenges of Efficient Transformer Quantization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2109.12948) by Qualcomm AI Research (2021) [[code](https:\u002F\u002Fgithub.com\u002Fqualcomm-ai-research\u002Ftransformer-quantization)]\n- [Mesa: A Memory-saving **Training** Framework for Transformers](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.11124v1) by Monash University (2021)\n- [The case for 4-bit precision: k-bit Inference Scaling Laws](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.09720) by Tim Dettmers et al. (2022) - Overall, their findings show that **4-bit precision is almost universally optimal for total model bits and zero-shot accuracy**.\n- [GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.17323) by Elias Frantar et al., 2022.\n- Other lists:\n  - [htqin\u002Fawesome-model-quantization](https:\u002F\u002Fgithub.com\u002Fhtqin\u002Fawesome-model-quantization)\n\n### Binarization\n\n- [Binarized Convolutional Neural Networks with Separable Filters for Efficient Hardware Acceleration](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.04693)\n- [Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.02830)\n- [Local Binary Convolutional Neural Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1608.06049)\n- [XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1603.05279)\n- [DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.06160)\n\n### Pruning\n\n- [Faster CNNs with Direct Sparse Convolutions and Guided Pruning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1608.01409)\n- [Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding](https:\u002F\u002Farxiv.org\u002Fabs\u002F1510.00149)\n- [Pruning Convolutional Neural Networks for Resource Efficient Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F1611.06440)\n- [Pruning Filters for Efficient ConvNets](https:\u002F\u002Farxiv.org\u002Fabs\u002F1608.08710)\n- [Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1611.05128)\n- [Learning to Prune: Exploring the Frontier of Fast and Accurate Parsing](http:\u002F\u002Fwww.cs.jhu.edu\u002F~jason\u002Fpapers\u002Fvieira+eisner.tacl17.pdf)\n- [Fine-Pruning: Joint Fine-Tuning and Compression of a Convolutional Network with Bayesian Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.09102)\n- [Learning both Weights and Connections for Efficient Neural Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1506.02626)\n- [ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.06342)\n- [Data-Driven Sparse Structure Selection for Deep Neural Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.01213)\n- [Soft Weight-Sharing for Neural Network Compression](https:\u002F\u002Farxiv.org\u002Fabs\u002F1702.04008)\n- [Dynamic Network Surgery for Efficient DNNs](https:\u002F\u002Farxiv.org\u002Fabs\u002F1608.04493)\n- [Channel pruning for accelerating very deep neural networks](http:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_ICCV_2017\u002Fpapers\u002FHe_Channel_Pruning_for_ICCV_2017_paper.pdf)\n- [AMC: AutoML for model compression and acceleration on mobile devices](http:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_ECCV_2018\u002Fpapers\u002FYihui_He_AMC_Automated_Model_ECCV_2018_paper.pdf)\n- [ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA](https:\u002F\u002Farxiv.org\u002Fabs\u002F1612.00694)\n- [Massive Language Models Can Be Accurately Pruned in One-Shot (2023)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.00774) - Pruning methods: post-training, layer-wise. Quantization methods: joint sparsification & post-training quantization.\n    > They propose `SparseGPT`, the first accurate one-shot pruning method which works efficiently at the scale of models with 10-100 billion parameters. `SparseGPT` works by reducing the pruning problem to an extremely large-scale instance of sparse regression. It is based on a new approximate sparse regression solver, used to solve a layer-wise compression problem, which is efficient enough to execute in a few hours on the largest openly-available GPT models (175B parameters), using a single GPU. At the same time, SparseGPT is accurate enough to drop negligible accuracy post-pruning, without any fine-tuning.\n- [UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.13741) by Tsinghua University et al. (ICML 2023) [[Code](https:\u002F\u002Fgithub.com\u002Fsdc17\u002FUPop)]\n- [A Simple and Effective Pruning Approach for Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.11695) by CMU, Meta AI Research et al. (May 2024) - The popular approach known as magnitude pruning removes the smallest weights in a network based on the assumption that weights closest to 0 can be set to 0 with the least impact on performance. In LLMs, the magnitudes of a subset of outputs from an intermediate layer may be up to 20x larger than those of other outputs of the same layer. Removing the weights that are multiplied by these large outputs — even weights close to zero — could significantly degrade performance. Thus, a pruning technique that considers both weights and intermediate-layer outputs can accelerate a network with less impact on performance. Why it matters: The ability to compress models without affecting their performance is becoming more important as mobiles and personal computers become powerful enough to run them. [Code: [Wanda](https:\u002F\u002Fgithub.com\u002Flocuslab\u002Fwanda)]\n\n### Distillation\n\n- [Distilling the Knowledge in a Neural Network](https:\u002F\u002Farxiv.org\u002Fabs\u002F1503.02531)\n- [Deep Model Compression: Distilling Knowledge from Noisy Teachers](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.09650)\n- [Learning Efficient Object Detection Models with Knowledge Distillation](http:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F6676-learning-efficient-object-detection-models-with-knowledge-distillation.pdf)\n- [Data-Free Knowledge Distillation For Deep Neural Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.07535)\n- [Knowledge Projection for Effective Design of Thinner and Faster Deep Neural Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.09505)\n- [Moonshine: Distilling with Cheap Convolutions](https:\u002F\u002Farxiv.org\u002Fabs\u002F1711.02613)\n- [Model Distillation with Knowledge Transfer from Face Classification to Alignment and Verification](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.02929)\n- [Like What You Like: Knowledge Distill via Neuron Selectivity Transfer](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.01219)\n- [Sequence-Level Knowledge Distillation](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.07947)\n- [Learning Loss for Knowledge Distillation with Conditional Adversarial Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.00513)\n- [Dark knowledge](http:\u002F\u002Fwww.ttic.edu\u002Fdl\u002Fdark14.pdf)\n- [DarkRank: Accelerating Deep Metric Learning via Cross Sample Similarities Transfer](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.01220)\n- [FitNets: Hints for Thin Deep Nets](https:\u002F\u002Farxiv.org\u002Fabs\u002F1412.6550)\n- [MobileID: Face Model Compression by Distilling Knowledge from Neurons](https:\u002F\u002Fwww.aaai.org\u002Focs\u002Findex.php\u002FAAAI\u002FAAAI16\u002Fpaper\u002Fview\u002F11977)\n- [Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer](https:\u002F\u002Farxiv.org\u002Fabs\u002F1612.03928)\n\n### Low Rank Approximation\n\n- [Speeding up convolutional neural networks with low rank expansions](http:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fpublications\u002F2014\u002FJaderberg14b\u002Fjaderberg14b.pdf)\n- [Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.06530)\n- [Convolutional neural networks with low-rank regularization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.06067)\n- [Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation](https:\u002F\u002Farxiv.org\u002Fabs\u002F1404.0736)\n- [Accelerating Very Deep Convolutional Networks for Classification and Detection](https:\u002F\u002Farxiv.org\u002Fabs\u002F1505.06798)\n- [Efficient and Accurate Approximations of Nonlinear Convolutional Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1411.4229)\n- [LoRA: Low-Rank Adaptation of Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.09685 ) - Low-rank adapters were proposed for GPT-like models by Hu et al.\n- [QLoRA: Efficient Finetuning of Quantized LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14314) by Tim Dettmers et al. (2023)\n\n### Offloading\n\nRecent years have witnessed the emergence of systems that are specialized for LLM inference, such as FasterTransformer (NVIDIA, 2022), PaLM inference (Pope et al., 2022), Deepspeed-Inference (Aminabadi et al., 2022), Accelerate (HuggingFace, 2022), LightSeq (Wang et al., 2021), TurboTransformers (Fang et al., 2021).\n\nTo enable LLM inference on easily accessible hardware, offloading is an essential technique — to our knowledge, among current systems, only Deepspeed-Inference and Huggingface Accelerate include such functionality.\n\n- [FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU](https:\u002F\u002Fraw.githubusercontent.com\u002FFMInference\u002FFlexGen\u002Fmain\u002Fdocs\u002Fpaper.pdf) by HazyResearch@Stanford et al., 2023. [[Tweet]](https:\u002F\u002Farchive.is\u002F2bqSy)\n\n### Parallelism\n\nCompression methods for model acceleration (i.e., model parallelism) papers:\n\n- [Does compressing activations help model parallel training? (2023)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.02654) - They presents the first empirical study on the effectiveness of compression algorithms (pruning-based, learning-based, and quantization-based - using a Transformer architecture) to improve the communication speed of model parallelism. **Summary:** 1) activation compression not equal to gradient compression; 2) training setups matter a lot; 3) don't compress early layers' activation.\n\n## Articles\n\nContent published on the Web.\n\n### Howtos\n\n- [How to Quantize Neural Networks with TensorFlow](https:\u002F\u002Fpetewarden.com\u002F2016\u002F05\u002F03\u002Fhow-to-quantize-neural-networks-with-tensorflow\u002F)\n- [🤗 PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware (2023)](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fpeft) - The Hugging Face PEFT library enables using the most popular and performant models from Transformers coupled with the simplicity and scalability of Accelerate. Currently supported PEFT methods: LoRA, prefix tuning, prompt tuning, and P-Tuning (which employs trainable continuous prompt embeddings). They'll be exploring more PEFT methods, such as (IA)3 and bottleneck adapters. Results: The number of parameters needed to fine-tune Flan-T5-XXL is now 9.4M, about 7X fewer than AlexNet (source: [Tweet](https:\u002F\u002Ftwitter.com\u002Fdmvaldman\u002Fstatus\u002F1624143468003221504)).\n\n### Assorted\n\n- [Why the Future of Machine Learning is Tiny](https:\u002F\u002Fpetewarden.com\u002F2018\u002F06\u002F11\u002Fwhy-the-future-of-machine-learning-is-tiny\u002F)\n- [Deep Learning Model Compression for Image Analysis: Methods and Architectures](https:\u002F\u002Fmedium.com\u002Fcomet-app\u002Fdeep-learning-model-compression-for-image-analysis-methods-and-architectures-398f82b0c06f)\n- [A foolproof way to shrink deep learning models](https:\u002F\u002Fnews.mit.edu\u002F2020\u002Ffoolproof-way-shrink-deep-learning-models-0430) by MIT (Alex Renda et al.) - A pruning algorithm: train to completion, globally prune the 20% of weights with the lowest magnitudes (the weakest connections), retrain with **learning rate rewinding** for the original (early training) rate, iteratively repeat until the desired sparsity is reached (model is as tiny as you want).\n\n### Reference\n\n### Blogs\n\n- [A Visual Guide to Quantization](https:\u002F\u002Fnewsletter.maartengrootendorst.com\u002Fp\u002Fa-visual-guide-to-quantization) - Demystifying the Compression of Large Language Models by Maarten Grootendorst (Jul 2024) - An approachable and great primer into quantization and widely supported quantization methods in tools and libraries including GPTQ, GGUF, and BitNet (1-bit).\n- [Overview of natively supported quantization schemes in 🤗 Transformers](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Foverview-quantization-transformers) (Sept 2023)\n- [TensorFlow Model Optimization Toolkit — Pruning API](https:\u002F\u002Fmedium.com\u002Ftensorflow\u002Ftensorflow-model-optimization-toolkit-pruning-api-42cac9157a6a?linkId=67380711)\n- [Compressing neural networks for image classification and detection](https:\u002F\u002Fai.facebook.com\u002Fblog\u002Fcompressing-neural-networks-for-image-classification-and-detection\u002F) - Facebook AI researchers have developed a new method for reducing the memory footprint of neural networks by quantizing their weights, while maintaining a short inference time. They manage to get a 76.1% top-1 ResNet-50 that fits in 5 MB and also compress a Mask R-CNN within 6 MB.\n- [All The Ways You Can Compress BERT](http:\u002F\u002Fmitchgordon.me\u002Fmachine\u002Flearning\u002F2019\u002F11\u002F18\u002Fall-the-ways-to-compress-BERT.html) - An overview of different compression methods for large NLP models (BERT) based on different characteristics and compares their results.\n- [Deep Learning Model Compression](https:\u002F\u002Frachitsingh.com\u002Fdeep-learning-model-compression\u002F) methods.\n- [Do We Really Need Model Compression](http:\u002F\u002Fmitchgordon.me\u002Fmachine\u002Flearning\u002F2020\u002F01\u002F13\u002Fdo-we-really-need-model-compression.html) in the future?\n- Quantization: [Breakdown of Nvidia H100s for Transformer Inferencing](https:\u002F\u002Fcarolchen.me\u002Fblog\u002Fh100-inferencing\u002F) by Carol Chen, ML ops at Cohere.\n  > Transformer Engine utilizes FP8 and FP16 together to reduce memory usage and increase performance while still maintaining accuracy for large language models.\n- [Comparison between quantization techniques and formats for LLMs](https:\u002F\u002Foobabooga.github.io\u002Fblog\u002Fposts\u002Fgptq-awq-exl2-llamacpp\u002F)  (Oct 2023)- A detailed comparison between GGUF (llama.cpp), GPTQ, AWQ, EXL2, q4_K_M, q4_K_S, and load_in_4bit: perplexity, VRAM, speed, model size, and loading time.\n- [Which Quantization Method Is Best for You?: GGUF, GPTQ, or AWQ](https:\u002F\u002Farchive.is\u002FYy9tE) (Jan 2024) - A gentle introduction to three prominent quantization methods — GPTQ, AWQ, and GGUF.\n- [Comparing Quantized Performance in Llama Models](https:\u002F\u002Fwww.lesswrong.com\u002Fposts\u002FqmPXQbyYA66DuJbht\u002Fcomparing-quantized-performance-in-llama-models) (Jul 2024) - 8 bit quantized seems fine, for 4 bit it depends. It covers different quantization schemes including GGUF, [HQQ](https:\u002F\u002Fmobiusml.github.io\u002Fhqq_blog\u002F) (Half-Quadratic Quantization), AWQ, GPTQ, and BnB.\n\n## Tools\n  \n### Libraries\n\n- [TensorFlow Model Optimization Toolkit](https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Fmodel-optimization). Accompanied blog post, [TensorFlow Model Optimization Toolkit — Pruning API](https:\u002F\u002Fmedium.com\u002Ftensorflow\u002Ftensorflow-model-optimization-toolkit-pruning-api-42cac9157a6a?linkId=67380711)\n- [XNNPACK](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fxnnpack) is a highly optimized library of floating-point neural network inference operators for ARM, WebAssembly, and x86 (SSE2 level) platforms. It's a based on QNNPACK library. However, unlike QNNPACK, XNNPACK focuses entirely on floating-point operators.\n- [Bitsandbytes](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbitsandbytes) is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers and quantization functions.\n- [NNCP](https:\u002F\u002Fbellard.org\u002Fnncp\u002F) - An experiment to build a practical lossless data compressor with neural networks. The latest version uses a Transformer model (slower but best ratio). LSTM (faster) is also available.\n\n### Frameworks\n\n### Paper Implementations\n\n- [facebookresearch\u002Fkill-the-bits](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fkill-the-bits) - code and compressed models for the paper, \"And the bit goes down: Revisiting the quantization of neural networks\" by Facebook AI Research.\n\n## Videos\n\n### Talks\n\n### Training & tutorials\n\n## License\n\nI am providing code and resources in this repository to you under an open source license.  Because this is my personal repository, the license you receive to my code and resources is from me and not my employer.\n\n* Code: [MIT](LICENSE) license. Copyright 2022- Cedric Chee\n* Text content: [Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)](http:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-sa\u002F4.0\u002F)\n","# 令人惊叹的机器学习模型压缩 [![Awesome](https:\u002F\u002Fawesome.re\u002Fbadge.svg)](https:\u002F\u002Fawesome.re)\n\n一个以“Awesome”风格整理的列表，汇集了最佳的机器学习模型压缩与加速相关研究论文、文章、教程、库、工具等。欢迎提交 PR！\n\n# 目录\n\n- [论文](#papers)\n  - [通用](#general)\n  - [架构](#architecture)\n  - [量化](#quantization)\n  - [二值化](#binarization)\n  - [剪枝](#pruning)\n  - [蒸馏](#distillation)\n  - [低秩近似](#low-rank-approximation)\n  - [卸载](#offloading)\n  - [并行化](#parallelism)\n- [文章](#articles)\n  - [操作指南](#howtos)\n  - [杂项](#assorted)\n  - [参考](#reference)\n  - [博客](#blogs)\n- [工具](#tools)\n  - [库](#libraries)\n  - [框架](#frameworks)\n- [视频](#videos)\n  - [演讲](#talks)\n  - [培训与教程](#training--tutorials)\n\n---\n\n## 论文\n\n### 通用\n\n- [深度神经网络的模型压缩与加速综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.09282)\n- [作为约束优化的模型压缩及其在神经网络中的应用：第一部分——通用框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.01209)\n- [作为约束优化的模型压缩及其在神经网络中的应用：第二部分——量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.04319)\n- [高效深度学习：使深度学习模型更小、更快、更好的综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.08962)\n- [用于深度学习的 FP8 格式](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.05433)，由 NVIDIA、Arm 和 Intel 联合发布，2022 年——FP8 在保持 FP16 精度的同时实现了 INT8 的性能。E4M3 是 FP8 的一种变体，兼具 INT8 的优势，且不会损失精度和吞吐量。\n\n### 架构\n\n- [MobileNets：适用于移动视觉应用的高效卷积神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1704.04861)\n- [MobileNetV2：倒残差与线性瓶颈——用于分类、检测和分割的移动网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1801.04381)\n- [Xception：使用深度可分离卷积的深度学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.02357)\n- [ShuffleNet：一种极其高效的移动端卷积神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.01083)\n- [SqueezeNet：参数量减少 50 倍、模型大小小于 0.5MB 即达到 AlexNet 级别的准确率](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.07360)\n- [Fast YOLO：用于视频中实时嵌入式目标检测的快速 You Only Look Once 系统](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.05943)\n- [AddressNet：基于移位的原语，用于高效卷积神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1809.08458)\n- [ResNeXt：面向深度神经网络的聚合残差变换](https:\u002F\u002Farxiv.org\u002Fabs\u002F1611.05431)\n- [ResBinNet：残差二值神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1711.01243)\n- [用于图像分类的残差注意力网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1704.06904)\n- [Squeezedet：统一的小型低功耗全卷积神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1612.01051)\n- [SEP-Nets：小型而有效的模式网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.03912)\n- [动态容量网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.07838)\n- [无需核技巧即可学习无限层网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.05316v2)\n- [高效的稀疏 Winograd 卷积神经网络](https:\u002F\u002Fopenreview.net\u002Fpdf?id=r1rqJyHKg)\n- [DSD：面向深度神经网络的密集-稀疏-密集训练](https:\u002F\u002Fopenreview.net\u002Fpdf?id=HyoST_9xl)\n- [用于加速深度神经网络的协调滤波器](https:\u002F\u002Farxiv.org\u002Fabs\u002F1703.09746v3)\n- [具有随机深度的深度网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1603.09382)\n\n### 量化\n\n- [面向移动设备的量化卷积神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1512.06473)\n- [迈向网络量化极限](https:\u002F\u002Farxiv.org\u002Fabs\u002F1612.01543)\n- [量化神经网络：使用低精度权重和激活进行神经网络训练](https:\u002F\u002Farxiv.org\u002Fabs\u002F1609.07061)\n- [利用向量量化压缩深度卷积神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1412.6115)\n- [训练后的三值量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F1612.01064)\n- [ZipML 框架：端到端低精度模型训练——可行与不可行，以及一点深度学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1611.05402)\n- [ShiftCNN：面向卷积神经网络推理的广义低精度架构](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.02393)\n- [通过半波高斯量化实现低精度深度学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1702.00953)\n- [深度网络的损失感知二值化](https:\u002F\u002Farxiv.org\u002Fabs\u002F1611.01600)\n- [循环神经网络中的权重与激活量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F1611.10176)\n- [循环神经网络的定点性能分析](https:\u002F\u002Farxiv.org\u002Fabs\u002F1512.01322)\n- [比特数下降：重新审视神经网络的量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F1907.05686)\n- [通过分块量化实现 8 位优化器](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.02861)\n- [LLM.int8()：大规模 Transformer 的 8 位矩阵乘法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2208.07339) [[博客文章](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fhf-bitsandbytes-integration)]\n- [SmoothQuant：面向大型语言模型的精确高效后训练量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.10438)，由 MIT 和 NVIDIA 共同开发（2022 年）[[代码](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fsmoothquant)]\n- [ZeroQuant：面向大型基于 Transformer 模型的高效经济型后训练量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.01861)，由 Microsoft 开发（2022 年）[[代码](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed)]\n- [nuQmm：用于高效推理大规模生成式语言模型的量化 MatMul](https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.09557)，由韩国 NAVER CLOVA 和浦项科技大学联合开发（2022 年）\n- [MKQ-BERT：采用 4 位权重和激活的量化 BERT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.13483)，由腾讯 AIPD 开发（2022 年）\n- [理解并克服高效 Transformer 量化面临的挑战](https:\u002F\u002Farxiv.org\u002Fabs\u002F2109.12948)，由 Qualcomm AI Research 开发（2021 年）[[代码](https:\u002F\u002Fgithub.com\u002Fqualcomm-ai-research\u002Ftransformer-quantization)]\n- [Mesa：一种节省内存的 Transformer **训练**框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.11124v1)，由莫纳什大学开发（2021 年）\n- [关于 4 位精度的理由：k 位推理缩放法则](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.09720)，由 Tim Dettmers 等人提出（2022 年）——总体而言，他们的研究结果表明，**对于模型总比特数和零样本准确率而言，4 位精度几乎普遍最优**。\n- [GPTQ：面向生成式预训练 Transformer 的精确后训练量化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.17323)，由 Elias Frantar 等人开发，2022 年。\n- 其他列表：\n  - [htqin\u002Fawesome-model-quantization](https:\u002F\u002Fgithub.com\u002Fhtqin\u002Fawesome-model-quantization)\n\n### 二值化\n\n- [用于高效硬件加速的可分离滤波器二值化卷积神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.04693)\n- [二值化神经网络：训练权重和激活值被约束为+1或-1的深度神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.02830)\n- [局部二值卷积神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1608.06049)\n- [XNOR-Net：使用二值卷积神经网络进行ImageNet分类](https:\u002F\u002Farxiv.org\u002Fabs\u002F1603.05279)\n- [DoReFa-Net：训练低比特宽度梯度的低比特宽度卷积神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.06160)\n\n### 剪枝\n\n- [通过直接稀疏卷积和引导式剪枝加速CNN](https:\u002F\u002Farxiv.org\u002Fabs\u002F1608.01409)\n- [深度压缩：利用剪枝、量化训练和霍夫曼编码压缩深度神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1510.00149)\n- [为资源高效的推理而剪枝卷积神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1611.06440)\n- [为高效ConvNet而剪枝滤波器](https:\u002F\u002Farxiv.org\u002Fabs\u002F1608.08710)\n- [基于能耗感知剪枝设计节能型卷积神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1611.05128)\n- [学习剪枝：探索快速且准确解析的前沿](http:\u002F\u002Fwww.cs.jhu.edu\u002F~jason\u002Fpapers\u002Fvieira+eisner.tacl17.pdf)\n- [精细剪枝：结合贝叶斯优化对卷积网络进行联合微调与压缩](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.09102)\n- [同时学习权重和连接以构建高效神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1506.02626)\n- [ThiNet：一种用于深度神经网络压缩的滤波器级剪枝方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.06342)\n- [面向深度神经网络的数据驱动稀疏结构选择](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.01213)\n- [面向神经网络压缩的软权值共享](https:\u002F\u002Farxiv.org\u002Fabs\u002F1702.04008)\n- [用于高效DNN的动态网络手术](https:\u002F\u002Farxiv.org\u002Fabs\u002F1608.04493)\n- [通道剪枝加速超深层神经网络](http:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_ICCV_2017\u002Fpapers\u002FHe_Channel_Pruning_for_ICCV_2017_paper.pdf)\n- [AMC：移动端模型压缩与加速的AutoML](http:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent_ECCV_2018\u002Fpapers\u002FYihui_He_AMC_Automated_Model_ECCV_2018_paper.pdf)\n- [ESE：基于FPGA的稀疏LSTM高效语音识别引擎](https:\u002F\u002Farxiv.org\u002Fabs\u002F1612.00694)\n- [大规模语言模型可在一次操作中精确剪枝（2023）](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.00774)——剪枝方法：训练后剪枝、逐层剪枝。量化方法：联合稀疏化与训练后量化。\n    > 他们提出了`SparseGPT`，这是首个能在100亿至1000亿参数规模下高效运行的精确一次性剪枝方法。`SparseGPT`将剪枝问题转化为一个超大规模的稀疏回归实例。它基于一种新的近似稀疏回归求解器，用于解决逐层压缩问题，效率足以在单个GPU上用时数小时处理最大的公开GPT模型（1750亿参数）。同时，`SparseGPT`足够精确，能够在不进行任何微调的情况下，在剪枝后几乎不损失精度。\n- [UPop：用于压缩视觉-语言Transformer的统一渐进式剪枝](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.13741)，由清华大学等机构提出（ICML 2023）[[代码](https:\u002F\u002Fgithub.com\u002Fsdc17\u002FUPop)]\n- [大型语言模型的简单有效剪枝方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.11695)，由CMU、Meta AI Research等机构于2024年5月提出——流行的幅度剪枝方法基于这样的假设：越接近0的权重可以被置为0而不影响性能。然而，在LLM中，某一层中间输出的一部分数值可能比同层其他输出大20倍。如果移除与这些大数值相乘的权重——即使它们接近零——也可能显著降低性能。因此，同时考虑权重和中间层输出的剪枝技术可以在较少影响性能的情况下加速网络。其重要性在于：随着移动设备和个人电脑性能的提升，能够在不影响模型性能的前提下压缩模型变得愈发重要。[代码：[Wanda](https:\u002F\u002Fgithub.com\u002Flocuslab\u002Fwanda)]\n\n### 知识蒸馏\n\n- [蒸馏神经网络中的知识](https:\u002F\u002Farxiv.org\u002Fabs\u002F1503.02531)\n- [深度模型压缩：从噪声教师中蒸馏知识](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.09650)\n- [通过知识蒸馏学习高效的目标检测模型](http:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F6676-learning-efficient-object-detection-models-with-knowledge-distillation.pdf)\n- [面向深度神经网络的无数据知识蒸馏](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.07535)\n- [知识投影用于高效设计更薄更快的深度神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.09505)\n- [Moonshine：通过廉价卷积进行知识蒸馏](https:\u002F\u002Farxiv.org\u002Fabs\u002F1711.02613)\n- [从人脸分类到对齐与验证的知识迁移式模型蒸馏](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.02929)\n- [如你所好：通过神经元选择性转移进行知识蒸馏](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.01219)\n- [序列级知识蒸馏](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.07947)\n- [基于条件对抗网络的知识蒸馏损失函数](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.00513)\n- [暗知识](http:\u002F\u002Fwww.ttic.edu\u002Fdl\u002Fdark14.pdf)\n- [DarkRank：通过跨样本相似性转移加速深度度量学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.01220)\n- [FitNets：为轻量级深度网络提供提示](https:\u002F\u002Farxiv.org\u002Fabs\u002F1412.6550)\n- [MobileID：通过从神经元中蒸馏知识来压缩人脸模型](https:\u002F\u002Fwww.aaai.org\u002Focs\u002Findex.php\u002FAAAI\u002FAAAI16\u002Fpaper\u002Fview\u002F11977)\n- [更加关注注意力：通过注意力转移提升卷积神经网络性能](https:\u002F\u002Farxiv.org\u002Fabs\u002F1612.03928)\n\n### 低秩近似\n\n- [通过低秩展开加速卷积神经网络](http:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fpublications\u002F2014\u002FJaderberg14b\u002Fjaderberg14b.pdf)\n- [用于快速、低功耗移动应用的深度卷积神经网络压缩](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.06530)\n- [带有低秩正则化的卷积神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.06067)\n- [利用卷积神经网络中的线性结构进行高效推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F1404.0736)\n- [加速用于分类和检测的超深卷积神经网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1505.06798)\n- [非线性卷积神经网络的有效且精确的近似](https:\u002F\u002Farxiv.org\u002Fabs\u002F1411.4229)\n- [LoRA：大型语言模型的低秩适配](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.09685) - Hu 等人提出了适用于 GPT 类模型的低秩适配器。\n- [QLoRA：量化大语言模型的高效微调](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14314) 由 Tim Dettmers 等人（2023 年）提出。\n\n### 模型卸载\n\n近年来，出现了许多专门用于大语言模型推理的系统，例如 FasterTransformer（NVIDIA，2022 年）、PaLM 推理（Pope 等人，2022 年）、Deepspeed-Inference（Aminabadi 等人，2022 年）、Accelerate（HuggingFace，2022 年）、LightSeq（Wang 等人，2021 年）、TurboTransformers（Fang 等人，2021 年）。\n\n为了在普通硬件上实现大语言模型推理，模型卸载是一项关键技术——据我们所知，在现有系统中，只有 Deepspeed-Inference 和 Huggingface Accelerate 提供了这一功能。\n\n- [FlexGen：单 GPU 高吞吐量生成式大语言模型推理](https:\u002F\u002Fraw.githubusercontent.com\u002FFMInference\u002FFlexGen\u002Fmain\u002Fdocs\u002Fpaper.pdf) 由 HazyResearch@Stanford 等人于 2023 年提出。[[推文]](https:\u002F\u002Farchive.is\u002F2bqSy)\n\n### 并行化\n\n用于模型加速的压缩方法（即模型并行化）相关论文：\n\n- [压缩激活值是否有助于模型并行训练？（2023 年）](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.02654) - 该研究首次对压缩算法（基于剪枝、基于学习以及基于量化的方法，使用 Transformer 架构）在提升模型并行通信速度方面的有效性进行了实证研究。**总结：** 1）激活值压缩并不等同于梯度压缩；2）训练设置非常重要；3）不要压缩早期层的激活值。\n\n## 文章\n\n在网络上发布的内容。\n\n### 操作指南\n\n- [如何使用 TensorFlow 对神经网络进行量化](https:\u002F\u002Fpetewarden.com\u002F2016\u002F05\u002F03\u002Fhow-to-quantize-neural-networks-with-tensorflow\u002F)\n- [🤗 PEFT：在低资源硬件上对十亿参数级模型进行参数高效的微调（2023 年）](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fpeft) - HuggingFace 的 PEFT 库使得用户能够结合 Transformers 中最流行、性能最佳的模型与 Accelerate 的简洁性和可扩展性。目前支持的 PEFT 方法有：LoRA、前缀调优、提示调优以及 P-Tuning（采用可训练的连续提示嵌入）。他们还将探索更多 PEFT 方法，如 (IA)3 和瓶颈适配器。结果：现在微调 Flan-T5-XXL 所需的参数数量为 940 万，约为 AlexNet 参数量的七分之一（来源：[推文](https:\u002F\u002Ftwitter.com\u002Fdmvaldman\u002Fstatus\u002F1624143468003221504)）。\n\n### 杂项\n\n- [为什么机器学习的未来是微型的](https:\u002F\u002Fpetewarden.com\u002F2018\u002F06\u002F11\u002Fwhy-the-future-of-machine-learning-is-tiny\u002F)\n- [用于图像分析的深度学习模型压缩：方法与架构](https:\u002F\u002Fmedium.com\u002Fcomet-app\u002Fdeep-learning-model-compression-for-image-analysis-methods-and-architectures-398f82b0c06f)\n- [一种万无一失的缩小深度学习模型的方法](https:\u002F\u002Fnews.mit.edu\u002F2020\u002Ffoolproof-way-shrink-deep-learning-models-0430) 由 MIT（Alex Renda 等人）提出 - 一种剪枝算法：训练至完成，全局剪掉权重绝对值最小的 20%（最弱的连接），然后以原始（早期训练）的学习率进行**学习率回溯**再训练，迭代重复直至达到所需的稀疏度（模型可以小到你想要的程度）。\n\n### 参考\n\n### 博客\n\n- [量化可视化指南](https:\u002F\u002Fnewsletter.maartengrootendorst.com\u002Fp\u002Fa-visual-guide-to-quantization) - Maarten Grootendorst（2024 年 7 月）揭秘大型语言模型的压缩技术 - 一篇通俗易懂且优秀的量化入门文章，介绍了包括 GPTQ、GGUF 和 BitNet（1 位）在内的工具和库中广泛支持的量化方法。\n- [🤗 Transformers 中原生支持的量化方案概述](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Foverview-quantization-transformers)（2023 年 9 月）\n- [TensorFlow 模型优化工具包 — 剪枝 API](https:\u002F\u002Fmedium.com\u002Ftensorflow\u002Ftensorflow-model-optimization-toolkit-pruning-api-42cac9157a6a?linkId=67380711)\n- [用于图像分类和目标检测的神经网络压缩](https:\u002F\u002Fai.facebook.com\u002Fblog\u002Fcompressing-neural-networks-for-image-classification-and-detection\u002F) - Facebook AI 的研究人员开发了一种新方法，通过量化神经网络的权重来减少其内存占用，同时保持较短的推理时间。他们成功将一个 76.1% top-1 准确率的 ResNet-50 模型压缩至 5 MB，并将 Mask R-CNN 压缩至 6 MB。\n- [压缩 BERT 的所有方法](http:\u002F\u002Fmitchgordon.me\u002Fmachine\u002Flearning\u002F2019\u002F11\u002F18\u002Fall-the-ways-to-compress-BERT.html) - 根据不同特性对大型 NLP 模型（BERT）的不同压缩方法进行了概述，并比较了它们的效果。\n- [深度学习模型压缩方法](https:\u002F\u002Frachitsingh.com\u002Fdeep-learning-model-compression\u002F)\n- [未来我们真的需要模型压缩吗？](http:\u002F\u002Fmitchgordon.me\u002Fmachine\u002Flearning\u002F2020\u002F01\u002F13\u002Fdo-we-really-need-model-compression.html)\n- 量化：[用于 Transformer 推理的 Nvidia H100 分析](https:\u002F\u002Fcarolchen.me\u002Fblog\u002Fh100-inferencing\u002F) 由 Cohere 的 ML 运维工程师 Carol Chen 撰写。\n  > Transformer Engine 同时利用 FP8 和 FP16 来降低内存使用并提高性能，同时仍能保持大型语言模型的准确性。\n- [LLM 的量化技术和格式对比](https:\u002F\u002Foobabooga.github.io\u002Fblog\u002Fposts\u002Fgptq-awq-exl2-llamacpp\u002F)（2023 年 10 月）- 对 GGUF（llama.cpp）、GPTQ、AWQ、EXL2、q4_K_M、q4_K_S 以及 load_in_4bit 进行了详细比较：困惑度、显存、速度、模型大小和加载时间。\n- [哪种量化方法最适合你？GGUF、GPTQ 还是 AWQ](https:\u002F\u002Farchive.is\u002FYy9tE)（2024 年 1 月）- 一篇关于三种主流量化方法——GPTQ、AWQ 和 GGUF——的温和介绍。\n- [Llama 模型中量化性能的比较](https:\u002F\u002Fwww.lesswrong.com\u002Fposts\u002FqmPXQbyYA66DuJbht\u002Fcomparing-quantized-performance-in-llama-models)（2024 年 7 月）- 8 位量化似乎还不错，而 4 位则视情况而定。文中涵盖了多种量化方案，包括 GGUF、[HQQ](https:\u002F\u002Fmobiusml.github.io\u002Fhqq_blog\u002F)（半二次量化）、AWQ、GPTQ 和 BnB。\n\n## 工具\n\n### 图书馆\n\n- [TensorFlow 模型优化工具包](https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Fmodel-optimization)。配套博文：[TensorFlow 模型优化工具包——剪枝 API](https:\u002F\u002Fmedium.com\u002Ftensorflow\u002Ftensorflow-model-optimization-toolkit-pruning-api-42cac9157a6a?linkId=67380711)\n- [XNNPACK](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fxnnpack) 是一个高度优化的浮点神经网络推理算子库，适用于 ARM、WebAssembly 和 x86（SSE2 级别）平台。它基于 QNNPACK 库构建。然而，与 QNNPACK 不同，XNNPACK 完全专注于浮点运算。\n- [Bitsandbytes](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbitsandbytes) 是一个轻量级的 CUDA 自定义函数封装库，特别是针对 8 位优化器和量化函数。\n- [NNCP](https:\u002F\u002Fbellard.org\u002Fnncp\u002F) - 一项利用神经网络构建实用无损数据压缩器的实验。最新版本使用 Transformer 模型（速度较慢但压缩比最佳），同时也提供 LSTM 模型（速度较快）。\n\n### 框架\n\n### 论文实现\n\n- [facebookresearch\u002Fkill-the-bits](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fkill-the-bits) - Facebook AI Research 论文《而比特继续下降：重新审视神经网络的量化》的代码及压缩模型。\n\n## 视频\n\n### 报告\n\n### 培训与教程\n\n## 许可协议\n\n我在此仓库中以开源许可向您提供代码和资源。由于这是我的个人仓库，您所获得的代码和资源许可来自我个人，而非我的雇主。\n\n* 代码：[MIT](LICENSE) 许可。版权所有 2022 年 - Cedric Chee\n* 文本内容：[知识共享署名-相同方式共享 4.0 国际 (CC BY-SA 4.0)](http:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-sa\u002F4.0\u002F)","# awesome-ml-model-compression 快速上手指南\n\n`awesome-ml-model-compression` 并非一个可直接安装的软件库或框架，而是一个**精选资源列表（Awesome List）**。它汇集了机器学习模型压缩与加速领域最优秀的研究论文、教程、开源库和工具。\n\n本指南旨在帮助开发者如何利用该列表快速找到适合的工具，并以列表中推荐的热门量化库 `bitsandbytes`（支持 LLM.int8()）为例，演示如何在中国网络环境下进行环境准备、安装及基本使用。\n\n## 1. 环境准备\n\n在开始模型压缩实践前，请确保满足以下基础要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+) 或 macOS。Windows 用户建议使用 WSL2。\n*   **Python 版本**: 3.8 - 3.11\n*   **硬件要求**:\n    *   **CPU**: 任意现代多核处理器。\n    *   **GPU**: 推荐使用 NVIDIA GPU (Compute Capability 7.0+) 以加速推理和量化过程。显存建议 8GB 以上（运行大语言模型量化版可适当降低）。\n*   **前置依赖**:\n    *   PyTorch (建议 2.0+)\n    *   CUDA Toolkit (需与 PyTorch 版本匹配)\n    *   Git\n\n## 2. 安装步骤\n\n由于该仓库是资源列表，你首先需要克隆仓库以查阅最新工具，随后根据列表指引安装具体的压缩库。以下以安装列表中提到的 **`bitsandbytes`** (用于 8-bit\u002F4-bit 量化) 为例。\n\n### 2.1 获取资源列表\n克隆仓库以便随时查阅最新的论文和工具链接：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fvaibhavjain03\u002Fawesome-ml-model-compression.git\ncd awesome-ml-model-compression\n```\n\n### 2.2 安装具体压缩库 (以 bitsandbytes 为例)\n为了在中国大陆地区获得更快的下载速度，推荐使用国内镜像源（如清华源或阿里源）安装依赖。\n\n**步骤 1: 配置 pip 国内镜像**\n```bash\npip config set global.index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n**步骤 2: 安装 PyTorch (CUDA 版本)**\n*请根据你的实际显卡驱动版本选择对应的 CUDA 版本，此处以 CUDA 11.8 为例：*\n```bash\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n```\n\n**步骤 3: 安装 bitsandbytes**\n```bash\npip install bitsandbytes\n```\n*(注：如果是其他工具如 `NNCF`, `TensorRT`, `DeepSpeed` 等，请参考仓库中 `Tools` 章节对应项目的官方文档进行安装)*\n\n## 3. 基本使用\n\n以下示例展示如何使用 `bitsandbytes` 加载一个经过 4-bit 量化的大语言模型（这是模型压缩最典型的应用场景），显著降低显存占用。\n\n**前提**: 已安装 `transformers` 和 `accelerate`。\n```bash\npip install transformers accelerate\n```\n\n**代码示例 (Python):**\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nimport torch\nfrom bitsandbytes.config import BitsAndBytesConfig\n\n# 1. 定义 4-bit 量化配置\nquantization_config = BitsAndBytesConfig(\n    load_in_4bit=True,              # 启用 4-bit 加载\n    bnb_4bit_quant_type=\"nf4\",      # 使用正态分布 4-bit 数据类型\n    bnb_4bit_compute_dtype=torch.float16, # 计算数据类型\n    bnb_4bit_use_double_quant=True, # 启用双重量化以进一步节省内存\n)\n\n# 2. 指定模型名称 (例如 Llama-2-7b)\nmodel_id = \"meta-llama\u002FLlama-2-7b-hf\"\n\n# 3. 加载分词器\ntokenizer = AutoTokenizer.from_pretrained(model_id)\n\n# 4. 加载量化后的模型\n# 此时模型显存占用将减少约 75%\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_id,\n    quantization_config=quantization_config,\n    device_map=\"auto\",              # 自动分配设备到 GPU\n    trust_remote_code=True\n)\n\n# 5. 简单推理测试\ninput_text = \"Once upon a time in the world of AI,\"\ninputs = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\")\n\noutputs = model.generate(**inputs, max_new_tokens=50)\nprint(tokenizer.decode(outputs[0], skip_special_tokens=True))\n```\n\n**下一步建议**:\n回到 `awesome-ml-model-compression` 仓库的 README 文件，浏览 **Papers** 和 **Tools** 章节，根据你的具体需求（如剪枝 Pruning、蒸馏 Distillation 或特定架构优化）探索更多专业工具。","某边缘计算团队正试图将高精度目标检测模型部署到算力受限的工业巡检无人机上，以满足实时识别缺陷的需求。\n\n### 没有 awesome-ml-model-compression 时\n- **选型迷茫**：面对海量的压缩论文（如量化、剪枝、蒸馏），团队难以快速筛选出适合移动端的具体架构（如 MobileNet 或 SqueezeNet），浪费数周时间调研。\n- **精度与速度失衡**：自行尝试简单的模型裁剪后，发现推理速度虽提升，但检测准确率大幅下降，缺乏像 FP8 量化这样兼顾性能与精度的成熟方案参考。\n- **工具链断裂**：找不到统一的库来实施低秩近似或二值化操作，只能从零复现算法，导致开发周期严重滞后，无法按时交付原型。\n- **知识碎片化**：团队成员需分散在各类博客和视频中寻找教程，缺乏系统性的“如何做”指南，沟通成本极高。\n\n### 使用 awesome-ml-model-compression 后\n- **精准技术定位**：直接通过分类目录锁定\"Architecture\"和\"Quantization\"板块，快速采纳 MobileNetV2 架构结合 FP8 量化策略，当天即确定技术路线。\n- **性能最优解**：参考列表中 NVIDIA 等大厂关于 FP8 格式的论文，成功在保持 FP16 级别精度的同时，实现了 INT8 级别的推理加速，完美平衡指标。\n- **开箱即用生态**：利用\"Tools\"章节推荐的成熟框架和库，迅速落地剪枝与蒸馏流程，将原本需要一个月的算法验证工作缩短至三天。\n- **系统化学习**：依托 curated 的教程和视频资源，团队成员快速统一认知，高效完成了从理论到工程落地的闭环。\n\nawesome-ml-model-compression 通过一站式聚合前沿研究与实用工具，将模型压缩的探索成本从“月级”降低至“天级”，让边缘 AI 部署不再受限于算力瓶颈。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fcedrickchee_awesome-ml-model-compression_cc1ba1d2.png","cedrickchee","Cedric Chee","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fcedrickchee_586e70c2.png","Lead Software Engineer | LLMs | Go\u002FJS, backend | product dev @ startups | 🧑‍🎓 CompSci | alumni: fast.ai, Antler.co","InvictusByte","PID 1",null,"cedric_chee","https:\u002F\u002Fcedricchee.com","https:\u002F\u002Fgithub.com\u002Fcedrickchee",543,62,"2026-04-13T14:42:09","MIT",1,"","未说明",{"notes":32,"python":30,"dependencies":33},"该仓库是一个 curated list（精选列表），主要收集了关于机器学习模型压缩和加速的论文、文章、教程、库和工具的链接清单。它本身不是一个可执行的软件工具或框架，因此没有特定的操作系统、GPU、内存、Python 版本或依赖库的安装需求。用户需根据列表中引用的具体子项目（如 SmoothQuant, SparseGPT, Wanda 等）查阅其各自的文档以获取运行环境要求。",[],[35],"开发框架",[37,38,39,40,41,42],"machine-learning","model-compression","quantization","pruning","awesome-list","neural-networks",2,"ready","2026-03-27T02:49:30.150509","2026-04-19T03:05:03.762504",[],[],[50,62,70,79,87,96],{"id":51,"name":52,"github_repo":53,"description_zh":54,"stars":55,"difficulty_score":56,"last_commit_at":57,"category_tags":58,"status":44},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[59,35,60,61],"Agent","图像","数据工具",{"id":63,"name":64,"github_repo":65,"description_zh":66,"stars":67,"difficulty_score":56,"last_commit_at":68,"category_tags":69,"status":44},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[35,60,59],{"id":71,"name":72,"github_repo":73,"description_zh":74,"stars":75,"difficulty_score":43,"last_commit_at":76,"category_tags":77,"status":44},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",160015,"2026-04-18T11:30:52",[35,59,78],"语言模型",{"id":80,"name":81,"github_repo":82,"description_zh":83,"stars":84,"difficulty_score":43,"last_commit_at":85,"category_tags":86,"status":44},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[35,60,59],{"id":88,"name":89,"github_repo":90,"description_zh":91,"stars":92,"difficulty_score":43,"last_commit_at":93,"category_tags":94,"status":44},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[95,59,60,35],"插件",{"id":97,"name":98,"github_repo":99,"description_zh":100,"stars":101,"difficulty_score":43,"last_commit_at":102,"category_tags":103,"status":44},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[95,35]]