[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-airaria--TextBrewer":3,"tool-airaria--TextBrewer":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":80,"owner_email":81,"owner_twitter":79,"owner_website":82,"owner_url":83,"languages":84,"stars":89,"forks":90,"last_commit_at":91,"license":92,"difficulty_score":23,"env_os":93,"env_gpu":94,"env_ram":93,"env_deps":95,"category_tags":105,"github_topics":106,"view_count":23,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":112,"updated_at":113,"faqs":114,"releases":144},3688,"airaria\u002FTextBrewer","TextBrewer","A PyTorch-based knowledge distillation toolkit for natural language processing","TextBrewer 是一款基于 PyTorch 构建的自然语言处理知识蒸馏工具包，旨在帮助开发者轻松压缩大型预训练模型。在深度学习应用中，庞大的模型往往导致推理速度慢、内存占用高，难以在资源受限的设备上部署。TextBrewer 通过“知识蒸馏”技术，让一个小巧的“学生模型”向强大的“教师模型”学习，从而在几乎不牺牲性能的前提下，显著提升推理速度并降低显存消耗。\n\n该工具特别适合 NLP 领域的研究人员和工程开发者使用。它整合了来自自然语言处理和计算机视觉领域的多种先进蒸馏算法，提供了一个灵活易用的框架，让用户能快速复现前沿方法。TextBrewer 的独特亮点在于其高度的灵活性：支持为学生和教师模型输入不同的数据批次，这意味着它可以处理词汇表不同的模型间蒸馏（例如从 RoBERTa 蒸馏到 BERT）；同时支持预计算并缓存教师模型的输出，进一步加速训练过程。此外，它还提供了如 BERT-EMD 等自适应层匹配机制，无需人工指定复杂的层级对应关系。无论是希望优化模型部署的工程师，还是探索蒸馏新算法的研究者，TextBrewer 都能提供高效、专业的技术支持。"," [**English**](README.md) | [**中文说明**](README_ZH.md)\n\n\u003Cp align=\"center\">\n    \u003Cbr>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fairaria_TextBrewer_readme_8cf43e588d08.png\" width=\"500\"\u002F>\n    \u003Cbr>\n\u003Cp>\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer\u002Fblob\u002Fmaster\u002FLICENSE\">\n        \u003Cimg alt=\"GitHub\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fairaria\u002FTextBrewer.svg?color=blue&style=flat-square\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Ftextbrewer.readthedocs.io\u002F\">\n        \u003Cimg alt=\"Documentation\" src=\"https:\u002F\u002Fimg.shields.io\u002Fwebsite?down_message=offline&label=Documentation&up_message=online&url=https%3A%2F%2Ftextbrewer.readthedocs.io\">\n    \u003C\u002Fa>    \n    \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Ftextbrewer\">\n        \u003Cimg alt=\"PyPI\" src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Ftextbrewer\">\n    \u003C\u002Fa>    \n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer\u002Freleases\">\n        \u003Cimg alt=\"GitHub release\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002Fairaria\u002FTextBrewer?include_prereleases\">\n    \u003C\u002Fa>\n\u003C\u002Fp>\n\n**TextBrewer** is a PyTorch-based model distillation toolkit for natural language processing. It includes various distillation techniques from both NLP and CV field and provides an easy-to-use distillation framework, which allows users to quickly experiment with the state-of-the-art distillation methods to compress the model with a relatively small sacrifice in the performance, increasing the inference speed and reducing the memory usage.\n\nCheck our paper through [ACL Anthology](https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2020.acl-demos.2\u002F) or [arXiv pre-print](https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.12620).\n\n[Full Documentation](https:\u002F\u002Ftextbrewer.readthedocs.io\u002F)\n\n## News\n\n**Dec 17, 2021**\n\n* **We have released a model pruning toolkit TextPruner**. Check https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextPruner\n\n**Oct 24, 2021**\n\n* **We propose the first pre-trained language model that specifically focusing on Chinese minority languages. Check：https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-Minority-PLM**\n\n\n**Jul 8, 2021**\n\n* **New examples with Transformers 4**\n  * The current examples (exmaples\u002F) have been written with old versions of Transformers and they may cause some confusions and bugs. We rewrite the examples with Transformers 4 in Jupyter Notebooks, which are easy to follow and learn.\n  * The new examples can be found at [examples\u002Fnotebook_examples](examples\u002Fnotebook_examples\u002F). See [Examples](#examples) for details.\n\n**Mar 1, 2021**\n\n* **BERT-EMD and custom distiller**\n\n  * We added an experiment with [BERT-EMD](https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2020.emnlp-main.242\u002F) in the [MNLI exmaple](examples\u002Fmnli_example\u002F). BERT-EMD allows each intermediate student layer to learn from any intermediate teacher layers adaptively, based on optimizing Earth Mover’s Distance. So there is no need to specify the matching scheme. \n  * We have written a new [EMDDistiller](examples\u002Fmnli_example\u002Fdistiller_emd.py) to perform BERT-EMD. It demonstrates how to write a custom distiller.\n\n* **updated MNLI example**\n\n  * We removed the pretrained_pytorch_bert and used transformers library instead in all the MNLI examples.\n\n\u003Cdetails>\n\u003Csummary>Click here to see old news\u003C\u002Fsummary>\n\n**Nov 11, 2020**\n\n* **Updated to 0.2.1**:\n  * **More flexible distillation**: Supports feeding different batches to the student and teacher. It means the batches for the student and teacher no longer need to be the same. It can be used for distilling models with different vocabularies (e.g., from RoBERTa to BERT).\n  * **Faster distillation**: Users now can pre-compute and cache the teacher outputs, then feed the cache to the distiller to save teacher's forward pass time.\n  \n    See [Feed Different batches to Student and Teacher, Feed Cached Values](https:\u002F\u002Ftextbrewer.readthedocs.io\u002Fen\u002Flatest\u002FConcepts.html#feed-different-batches-to-student-and-teacher-feed-cached-values) for details of the above features.\n  \n  * `MultiTaskDistiller` now supports intermediate feature matching loss.\n  * Tensorboard now records more detailed losses (KD loss, hard label loss, matching losses...).\n\n  See details in [releases](https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer\u002Freleases\u002Ftag\u002Fv0.2.1).\n\n**August 27, 2020**\n\n**We are happy to announce that our model is on top of GLUE benchmark, check [leaderboard](https:\u002F\u002Fgluebenchmark.com\u002Fleaderboard).**\n\n\n\n**Aug 24, 2020**\n\n* **Updated to 0.2.0.1**:\n  * fixed bugs in `MultiTaskDistiller` and training loops.\n\n**Jul 29, 2020**\n\n* **Updated to 0.2.0**:\n    * Added the support for distributed data-parallel training with `DistributedDataParallel`: `TrainingConfig` now accpects the `local_rank` argument. See the documentation of `TrainingConfig` for detail.\n* Added an example of distillation on the Chinese NER task to demonstrate distributed data-parallel training. See [examples\u002Fmsra_ner_example](examples\u002Fmsra_ner_example).\n\n**Jul 14, 2020**\n* **Updated to 0.1.10**:\n    * Now supports mixed precision training with Apex! Just set `fp16` to `True` in `TrainingConfig`. See the documentation of `TrainingConfig` for detail.\n    * Added `data_parallel` option in `TrainingConfig` to enable data parallel training and mixed precision training work together.\n\n**Apr 26, 2020**\n\n* Added Chinese NER task (MSRA NER) results.\n* Added results for distilling to T12-nano model, which has a similar strcuture to Electra-small.\n* Updated some results of CoNLL-2003, CMRC 2018 and DRCD.\n\n**Apr 22, 2020**\n\n* **Updated to 0.1.9** (added cache option which speeds up distillation; fixed some bugs). See details in [releases](https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer\u002Freleases\u002Ftag\u002Fv0.1.9).\n* Added experimential results for distilling Electra-base to Electra-small on Chinese tasks.\n* TextBrewer has been accepted by [ACL 2020](http:\u002F\u002Facl2020.org) as a demo paper, please use our new [bib entry](#Citation).\n\n**Mar 17, 2020**\n\n* Added CoNLL-2003 English NER distillation example. See [examples\u002Fconll2003_example](examples\u002Fconll2003_example).\n\n**Mar 11, 2020**\n\n* **Updated to 0.1.8** (Improvements on TrainingConfig and train method). See details in [releases](https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer\u002Freleases\u002Ftag\u002Fv0.1.8).\n\n**Mar 2, 2020**\n\n* Initial public version 0.1.7 has been released. See details in [releases](https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer\u002Freleases\u002Ftag\u002Fv0.1.7).\n\n\u003C\u002Fdetails>\n\n## Table of Contents\n\n\u003C!-- TOC -->\n\n| Section | Contents |\n|-|-|\n| [Introduction](#introduction) | Introduction to TextBrewer |\n| [Installation](#installation) | How to install |\n| [Workflow](#workflow) | Two stages of TextBrewer workflow |\n| [Quickstart](#quickstart) | Example: distilling BERT-base to a 3-layer BERT |\n| [Experiments](#experiments) | Distillation experiments on typical English and Chinese datasets |\n| [Core Concepts](#core-concepts) | Brief explanations of the core concepts in TextBrewer |\n| [FAQ](#faq) | Frequently asked questions |\n| [Known Issues](#known-issues) | Known issues |\n| [Citation](#citation) | Citation to TextBrewer |\n| [Follow Us](#follow-us) | - |\n\n\u003C!-- \u002FTOC -->\n\n## Introduction\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fairaria_TextBrewer_readme_e3d7f37cb5f5.png)\n\n**Textbrewer** is designed for the knowledge distillation of NLP models. It provides various distillation methods and offers a distillation framework for quickly setting up experiments. \n\nThe main features of **TextBrewer** are:\n\n* Wide-support: it supports various model architectures (especially **transformer**-based models)\n* Flexibility: design your own distillation scheme by combining different techniques; it also supports user-defined loss functions, modules, etc.\n* Easy-to-use: users don't need to modify the model architectures\n* Built for NLP: it is suitable for a wide variety of NLP tasks: text classification, machine reading comprehension, sequence labeling, ...\n\n**TextBrewer** currently is shipped with the following distillation techniques: \n\n* Mixed soft-label and hard-label training\n* Dynamic loss weight adjustment and temperature adjustment\n* Various distillation loss functions: hidden states MSE, attention-matrix-based loss, neuron selectivity transfer, ...\n* Freely adding intermediate features matching losses\n* Multi-teacher distillation\n* ...\n\n**TextBrewer** includes:\n\n1. **Distillers**: the cores of distillation. Different distillers perform different distillation modes. There are GeneralDistiller, MultiTeacherDistiller, BasicTrainer, etc. \n2. **Configurations and presets**: Configuration classes for training and distillation, and predefined distillation loss functions and strategies. \n3. **Utilities**: auxiliary tools such as model parameters analysis. \n\n\nTo start distillation, users need to provide\n\n1. the models (the trained **teacher** model and the un-trained **student** model)\n2. datasets and experiment configurations \n\n**TextBrewer** has achieved impressive results on several typical NLP tasks. See [Experiments](#experiments).\n\nSee [Full Documentation](https:\u002F\u002Ftextbrewer.readthedocs.io\u002F) for detailed usages.\n\n### Architecture\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fairaria_TextBrewer_readme_e3d7f37cb5f5.png)\n\n## Installation\n\n* Requirements\n  * Python >= 3.6\n  * PyTorch >= 1.1.0\n  * TensorboardX or Tensorboard\n  * NumPy\n  * tqdm\n  * Transformers >= 2.0 (optional, used by some examples)\n  * Apex == 0.1.0 (optional, mixed precision training)\n\n* Install from PyPI\n\n  ```shell\n  pip install textbrewer\n  ```\n\n* Install from the Github source\n\n  ```shell\n  git clone https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer.git\n  pip install .\u002Ftextbrewer\n  ```\n\n## Workflow\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fairaria_TextBrewer_readme_d63482bac52e.png)\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fairaria_TextBrewer_readme_8f4c999d0479.png)\n\n* **Stage 1**: Preparation:\n  1. Train the teacher model\n  2. Define and initialize the student model\n  3. Construct a dataloader, an optimizer, and a learning rate scheduler\n\n* **Stage 2**: Distillation with TextBrewer:\n  1. Construct a **TraningConfig** and a **DistillationConfig**, initialize a **distiller**\n  2. Define an **adaptor** and a **callback**. The **adaptor** is used for adaptation of model inputs and outputs. The **callback** is called by the distiller during training\n  3. Call the **train** method of the **distiller**\n\n\n## Quickstart\n\nHere we show the usage of TextBrewer by distilling BERT-base to a 3-layer BERT.\n\nBefore distillation, we assume users have provided:\n\n* A trained teacher model `teacher_model` (BERT-base) and a to-be-trained student model `student_model` (3-layer BERT).\n* a `dataloader` of the dataset, an `optimizer` and a learning rate builder or class `scheduler_class ` and its args dict `scheduler_dict`.\n\nDistill with TextBrewer:\n\n```python \nimport textbrewer\nfrom textbrewer import GeneralDistiller\nfrom textbrewer import TrainingConfig, DistillationConfig\n\n# Show the statistics of model parameters\nprint(\"\\nteacher_model's parametrers:\")\nresult, _ = textbrewer.utils.display_parameters(teacher_model,max_level=3)\nprint (result)\n\nprint(\"student_model's parametrers:\")\nresult, _ = textbrewer.utils.display_parameters(student_model,max_level=3)\nprint (result)\n\n# Define an adaptor for interpreting the model inputs and outputs\ndef simple_adaptor(batch, model_outputs):\n      # The second and third elements of model outputs are the logits and hidden states\n    return {'logits': model_outputs[1],\n            'hidden': model_outputs[2]}\n\n# Training configuration \ntrain_config = TrainingConfig()\n# Distillation configuration\n# Matching different layers of the student and the teacher\ndistill_config = DistillationConfig(\n    intermediate_matches=[    \n     {'layer_T':0, 'layer_S':0, 'feature':'hidden', 'loss': 'hidden_mse','weight' : 1},\n     {'layer_T':8, 'layer_S':2, 'feature':'hidden', 'loss': 'hidden_mse','weight' : 1}])\n\n# Build distiller\ndistiller = GeneralDistiller(\n    train_config=train_config, distill_config = distill_config,\n    model_T = teacher_model, model_S = student_model, \n    adaptor_T = simple_adaptor, adaptor_S = simple_adaptor)\n\n# Start!\nwith distiller:\n    distiller.train(optimizer, dataloader, num_epochs=1, scheduler_class=scheduler_class, scheduler_args = scheduler_args, callback=None)\n```\n\n### **Examples**\n\n* **Notebook examples with Transformers 4**\n  * [examples\u002Fnotebook\\_examples\u002Fsst2.ipynb](examples\u002Fnotebook\\_examples\u002Fsst2.ipynb) (English): training and distilling BERT on SST-2, an English sentence classification task.\n  * [examples\u002Fnotebook\\_examples\u002Fmsra_ner.ipynb](examples\u002Fnotebook\\_examples\u002Fmsra_ner.ipynb) (Chinese): training and distilling BERT on MSRA NER, a Chinese sequence labeling task.\n  * [examples\u002Fnotebook\\_examples\u002Fsqaudv1.1.ipynb](examples\u002Fnotebook\\_examples\u002Fsqaudv1.1.ipynb) (English): training and distilling BERT on SQuAD 1.1, an English MRC task.\n\n* [examples\u002Frandom_token_example](examples\u002Frandom_token_example) : a simple runnable toy example which demonstrates the usage of TextBrewer. This example performs distillation on the text classification task with random tokens as inputs.\n* [examples\u002Fcmrc2018\\_example](examples\u002Fcmrc2018_example) (Chinese): distillation on CMRC 2018, a Chinese MRC task, using DRCD as data augmentation.\n* [examples\u002Fmnli\\_example](examples\u002Fmnli_example) (English): distillation on MNLI, an English sentence-pair classification task. This example also shows how to perform multi-teacher distillation.\n* [examples\u002Fconll2003_example](examples\u002Fconll2003_example) (English): distillation on CoNLL-2003 English NER task, which is in form of sequence labeling.\n* [examples\u002Fmsra_ner_example](examples\u002Fmsra_ner_example) (Chinese): This example distills a Chinese-ELECTRA-base model on the MSRA NER task with distributed data-parallel training(single node, multi-GPU).\n\n\n## Experiments\n\nWe have performed distillation experiments on several typical English and Chinese NLP datasets. The setups and configurations are listed below.\n\n### Models\n\n* For English tasks, the teacher model is [**BERT-base-cased**](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbert).\n* For Chinese tasks, the teacher models are [**RoBERTa-wwm-ext**](https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-BERT-wwm) and [**Electra-base**](https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-ELECTRA) released by the Joint Laboratory of HIT and iFLYTEK Research.\n\nWe have tested different student models. To compare with public results, the student models are built with standard transformer blocks except for BiGRU which is a single-layer bidirectional GRU. The architectures are listed below. Note that the number of parameters includes the embedding layer but does not include the output layer of each specific task. \n\n#### English models\n\n| Model                 | \\#Layers | Hidden size | Feed-forward size | \\#Params | Relative size |\n| :--------------------- | --------- | ----------- | ----------------- | -------- | ------------- |\n| BERT-base-cased (teacher) | 12        | 768         | 3072              | 108M     | 100%          |\n| T6 (student)              | 6         | 768         | 3072              | 65M      | 60%           |\n| T3 (student)              | 3         | 768         | 3072              | 44M      | 41%           |\n| T3-small (student)        | 3         | 384         | 1536              | 17M      | 16%           |\n| T4-Tiny (student)         | 4         | 312         | 1200              | 14M      | 13%           |\n| T12-nano (student)        | 12        | 256         | 1024              | 17M      | 16%           |\n| BiGRU (student)           | -         | 768         | -                 | 31M      | 29%           |\n\n#### Chinese models\n\n| Model                 | \\#Layers | Hidden size | Feed-forward size | \\#Params | Relative size   |\n| :--------------------- | --------- | ----------- | ----------------- | -------- | ------------- |\n| RoBERTa-wwm-ext (teacher) | 12        | 768         | 3072              | 102M      | 100%          |\n| Electra-base (teacher)    | 12        | 768         | 3072              | 102M      | 100%          |\n| T3 (student)              | 3         | 768         | 3072              | 38M       | 37%           |\n| T3-small (student)        | 3         | 384         | 1536              | 14M       | 14%           |\n| T4-Tiny (student)         | 4         | 312         | 1200              | 11M       | 11%           |\n| Electra-small (student)   | 12        | 256         | 1024              | 12M       | 12%           |\n\n* T6 architecture is the same as [DistilBERT\u003Csup>[1]\u003C\u002Fsup>](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.01108), [BERT\u003Csub>6\u003C\u002Fsub>-PKD\u003Csup>[2]\u003C\u002Fsup>](https:\u002F\u002Farxiv.org\u002Fabs\u002F1908.09355), and  [BERT-of-Theseus\u003Csup>[3]\u003C\u002Fsup>](https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.02925).\n* T4-tiny architecture is the same as [TinyBERT\u003Csup>[4]\u003C\u002Fsup>](https:\u002F\u002Farxiv.org\u002Fabs\u002F1909.10351).\n* T3 architecure is the same as [BERT\u003Csub>3\u003C\u002Fsub>-PKD\u003Csup>[2]\u003C\u002Fsup>](https:\u002F\u002Farxiv.org\u002Fabs\u002F1908.09355).\n\n### Distillation Configurations\n\n```python\ndistill_config = DistillationConfig(temperature = 8, intermediate_matches = matches)\n# Others arguments take the default values\n```\n\n`matches` are differnt for different models:\n\n| Model        | matches                                             |\n| :--------    | --------------------------------------------------- |\n| BiGRU        | None                                                |\n| T6           | L6_hidden_mse + L6_hidden_smmd                      |\n| T3           | L3_hidden_mse + L3_hidden_smmd                      |\n| T3-small     | L3n_hidden_mse + L3_hidden_smmd                     |\n| T4-Tiny      | L4t_hidden_mse + L4_hidden_smmd                     |\n| T12-nano     | small_hidden_mse + small_hidden_smmd                |\n| Electra-small| small_hidden_mse + small_hidden_smmd                |\n\nThe definitions of matches are at [examples\u002Fmatches\u002Fmatches.py](examples\u002Fmatches\u002Fmatches.py).\n\nWe use GeneralDistiller in all the distillation experiments.\n\n### Training Configurations\n\n* Learning rate is 1e-4 (unless otherwise specified).  \n* We train all the models for 30~60 epochs.\n\n### Results on English Datasets\n\nWe experiment on the following typical English datasets:\n\n| Dataset    | Task type | Metrics | \\#Train | \\#Dev | Note |\n| :---------- | -------- | ------- | ------- | ---- | ---- |\n| [**MNLI**](https:\u002F\u002Fwww.nyu.edu\u002Fprojects\u002Fbowman\u002Fmultinli\u002F)       | text classification | m\u002Fmm Acc | 393K    | 20K  | sentence-pair 3-class classification |\n| [**SQuAD 1.1**](https:\u002F\u002Frajpurkar.github.io\u002FSQuAD-explorer\u002F)   | reading comprehension | EM\u002FF1   | 88K     | 11K  | span-extraction machine reading comprehension |\n| [**CoNLL-2003**](https:\u002F\u002Fwww.clips.uantwerpen.be\u002Fconll2003\u002Fner) | sequence labeling | F1      | 23K     | 6K   | named entity recognition |\n\nWe list the public results from [DistilBERT](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.01108), [BERT-PKD](https:\u002F\u002Farxiv.org\u002Fabs\u002F1908.09355), [BERT-of-Theseus](https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.02925), [TinyBERT](https:\u002F\u002Farxiv.org\u002Fabs\u002F1909.10351) and our results below for comparison.\n\nPublic results:\n\n| Model (public) | MNLI  | SQuAD  | CoNLL-2003 |\n| :-------------  | --------------- | ------------- | --------------- |\n| DistilBERT (T6)    | 81.6 \u002F 81.1 | 78.1 \u002F 86.2   | -               |\n| BERT\u003Csub>6\u003C\u002Fsub>-PKD (T6)     | 81.5 \u002F 81.0     | 77.1 \u002F 85.3   | -|\n| BERT-of-Theseus (T6) | 82.4\u002F  82.1   | -        | -                |\n| BERT\u003Csub>3\u003C\u002Fsub>-PKD (T3)     | 76.7 \u002F 76.3     | -             | -|\n| TinyBERT (T4-tiny) | 82.8 \u002F 82.9                | 72.7 \u002F 82.1   | -|\n\nOur results:\n\n| Model (ours) | MNLI  | SQuAD  | CoNLL-2003 |\n| :-------------  | --------------- | ------------- | --------------- |\n| **BERT-base-cased** (teacher) | 83.7 \u002F 84.0     | 81.5 \u002F 88.6   | 91.1  |\n| BiGRU          | -               | -             | 85.3            |\n| T6             | 83.5 \u002F 84.0     | 80.8 \u002F 88.1   | 90.7            |\n| T3             | 81.8 \u002F 82.7     | 76.4 \u002F 84.9   | 87.5            |\n| T3-small       | 81.3 \u002F 81.7     | 72.3 \u002F 81.4   | 78.6            |\n| T4-tiny        | 82.0 \u002F 82.6     | 75.2 \u002F 84.0   | 89.1            |\n| T12-nano       | 83.2 \u002F 83.9     | 79.0 \u002F 86.6   | 89.6            |\n\n**Note**:\n\n1. The equivalent model structures of public models are shown in the brackets after their names. \n2. When distilling to T4-tiny, NewsQA is used for data augmentation on SQuAD and HotpotQA is used for data augmentation on CoNLL-2003.\n3. When distilling to T12-nano, HotpotQA is used for data augmentation on CoNLL-2003.\n\n\n\n### Results on Chinese Datasets\n\nWe experiment on the following typical Chinese datasets:\n\n\n| Dataset | Task type | Metrics | \\#Train | \\#Dev | Note |\n| :------- | ---- | ------- | ------- | ---- | ---- |\n| [**XNLI**](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbert\u002Fblob\u002Fmaster\u002Fmultilingual.md) | text classification | Acc | 393K | 2.5K | Chinese translation version of MNLI |\n| [**LCQMC**](http:\u002F\u002Ficrc.hitsz.edu.cn\u002Finfo\u002F1037\u002F1146.htm) | text classification | Acc | 239K | 8.8K | sentence-pair matching, binary classification |\n| [**CMRC 2018**](https:\u002F\u002Fgithub.com\u002Fymcui\u002Fcmrc2018) | reading comprehension | EM\u002FF1 | 10K | 3.4K | span-extraction machine reading comprehension |\n| [**DRCD**](https:\u002F\u002Fgithub.com\u002FDRCKnowledgeTeam\u002FDRCD) | reading comprehension | EM\u002FF1 | 27K | 3.5K | span-extraction machine reading comprehension (Traditional Chinese) |\n| [**MSRA NER**](https:\u002F\u002Ffaculty.washington.edu\u002Flevow\u002Fpapers\u002Fsighan06.pdf) | sequence labeling | F1 | 45K | 3.4K (#Test) | Chinese named entity recognition |\n\nThe results are listed below.\n\n| Model           | XNLI | LCQMC | CMRC 2018 | DRCD |\n| :--------------- | ---------- | ----------- | ---------------- | ------------ |\n| **RoBERTa-wwm-ext** (teacher) | 79.9       | 89.4        | 68.8 \u002F 86.4      | 86.5 \u002F 92.5  |\n| T3          | 78.4       | 89.0        | 66.4 \u002F 84.2      | 78.2 \u002F 86.4  |\n| T3-small    | 76.0       | 88.1        | 58.0 \u002F 79.3      | 75.8 \u002F 84.8  |\n| T4-tiny     | 76.2       | 88.4        | 61.8 \u002F 81.8      | 77.3 \u002F 86.1  |\n\n| Model                       | XNLI       | LCQMC       | CMRC 2018        | DRCD        | MSRA NER |\n| :---------------------------| ---------- | ----------- | ---------------- | ------------|----------|\n| **Electra-base** (teacher)) | 77.8       | 89.8        | 65.6 \u002F 84.7     | 86.9 \u002F 92.3  | 95.14    |\n| Electra-small               | 77.7       | 89.3        | 66.5 \u002F 84.9     | 85.5 \u002F 91.3  | 93.48    |\n\n\n**Note**:\n\n1. Learning rate decay is not used in distillation on CMRC 2018 and DRCD.\n2. CMRC 2018 and DRCD take each other as the augmentation dataset in the distillation.\n3. The settings of training Electra-base teacher model can be found at [**Chinese-ELECTRA**](https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-ELECTRA).\n4. Electra-small student model is initialized with the [pretrained weights](https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-ELECTRA).\n\n## Core Concepts\n\n### Configurations\n\n* `TrainingConfig`: configuration related to general deep learning model training\n* `DistillationConfig`: configuration related to distillation methods\n\n### Distillers\n\nDistillers are in charge of conducting the actual experiments. The following distillers are available:\n\n* `BasicDistiller`: **single-teacher single-task** distillation, provides basic distillation strategies.\n* `GeneralDistiller` (Recommended): **single-teacher single-task** distillation, supports intermediate features matching. **Recommended most of the time**.\n* `MultiTeacherDistiller`: **multi-teacher** distillation, which distills multiple teacher models (of the same task) into a single student model. **This class doesn't support Intermediate features matching.**\n* `MultiTaskDistiller`: **multi-task** distillation, which distills multiple teacher models (of different tasks) into a single student.\n* `BasicTrainer`: Supervised training a single model on a labeled dataset, not for distillation. **It can be used to train a teacher model**.\n\n\n### User-Defined Functions\n\nIn TextBrewer, there are two functions that should be implemented by users: **callback** and **adaptor**.\n\n####  **Callback** \n\nAt each checkpoint, after saving the student model, the callback function will be called by the distiller. A callback can be used to evaluate the performance of the student model at each checkpoint.\n\n#### Adaptor\nIt converts the model inputs and outputs to the specified format so that they could be recognized by the distiller, and distillation losses can be computed. At each training step, batch and model outputs will be passed to the adaptor; the adaptor re-organizes the data and returns a dictionary.\n\nFor more details, see the explanations in [Full Documentation](https:\u002F\u002Ftextbrewer.readthedocs.io\u002F).\n\n## FAQ\n\n**Q**: How to initialize the student model?\n\n**A**: The student model could be randomly initialized (i.e., with no prior knowledge) or be initialized by pre-trained weights.\nFor example, when distilling a BERT-base model to a 3-layer BERT, you could initialize the student model with [RBT3](#https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-BERT-wwm) (for Chinese tasks) or the first three layers of BERT (for English tasks) to avoid cold start problem. \nWe recommend that users use pre-trained student models whenever possible to fully take advantage of large-scale pre-training.\n\n**Q**: How to set training hyperparameters for the distillation experiments？\n\n**A**: Knowledge distillation usually requires more training epochs and larger learning rate than training on the labeled dataset. For example, training SQuAD on BERT-base usually takes 3 epochs with lr=3e-5; however, distillation takes 30~50 epochs with lr=1e-4. **The conclusions are based on our experiments, and you are advised to try on your own data**.\n\n**Q**: My teacher model and student model take different inputs (they do not share vocabularies), so how can I distill?\n\n**A**: You need to feed different batches to the teacher and the student. See the section [Feed Different batches to Student and Teacher, Feed Cached Values](https:\u002F\u002Ftextbrewer.readthedocs.io\u002Fen\u002Flatest\u002FConcepts.html#feed-different-batches-to-student-and-teacher-feed-cached-values) in the full documentation.\n\n**Q**: I have stored the logits from my teacher model. Can I use them in the distillation to save the forward pass time?\n\n**A**: Yes, see the section [Feed Different batches to Student and Teacher, Feed Cached Values](https:\u002F\u002Ftextbrewer.readthedocs.io\u002Fen\u002Flatest\u002FConcepts.html#feed-different-batches-to-student-and-teacher-feed-cached-values) in the full documentation.\n\n## Known Issues\n\n* ~~Multi-GPU training support is only available through `DataParallel` currently.~~\n* Multi-label classification is not supported.\n\n## Citation\n\nIf you find TextBrewer is helpful, please cite [our paper](https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2020.acl-demos.2\u002F):\n```bibtex\n@InProceedings{textbrewer-acl2020-demo,\n    title = \"{T}ext{B}rewer: {A}n {O}pen-{S}ource {K}nowledge {D}istillation {T}oolkit for {N}atural {L}anguage {P}rocessing\",\n    author = \"Yang, Ziqing and Cui, Yiming and Chen, Zhipeng and Che, Wanxiang and Liu, Ting and Wang, Shijin and Hu, Guoping\",\n    booktitle = \"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations\",\n    year = \"2020\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2020.acl-demos.2\",\n    pages = \"9--16\",\n}\n```\n\n## Follow Us\nFollow our official WeChat account to keep updated with our latest technologies!\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fairaria_TextBrewer_readme_a467d80e8c8d.jpg)\n","[**English**](README.md) | [**中文说明**](README_ZH.md)\n\n\u003Cp align=\"center\">\n    \u003Cbr>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fairaria_TextBrewer_readme_8cf43e588d08.png\" width=\"500\"\u002F>\n    \u003Cbr>\n\u003Cp>\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer\u002Fblob\u002Fmaster\u002FLICENSE\">\n        \u003Cimg alt=\"GitHub\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fairaria\u002FTextBrewer.svg?color=blue&style=flat-square\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Ftextbrewer.readthedocs.io\u002F\">\n        \u003Cimg alt=\"Documentation\" src=\"https:\u002F\u002Fimg.shields.io\u002Fwebsite?down_message=offline&label=Documentation&up_message=online&url=https%3A%2F%2Ftextbrewer.readthedocs.io\">\n    \u003C\u002Fa>    \n    \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Ftextbrewer\">\n        \u003Cimg alt=\"PyPI\" src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Ftextbrewer\">\n    \u003C\u002Fa>    \n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer\u002Freleases\">\n        \u003Cimg alt=\"GitHub release\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002Fairaria\u002FTextBrewer?include_prereleases\">\n    \u003C\u002Fa>\n\u003C\u002Fp>\n\n**TextBrewer** 是一个基于 PyTorch 的自然语言处理模型蒸馏工具包。它整合了来自自然语言处理和计算机视觉领域的多种蒸馏技术，并提供了一个易于使用的蒸馏框架，使用户能够快速尝试最先进的蒸馏方法，在性能损失较小的情况下压缩模型，从而提升推理速度并降低内存占用。\n\n您可以通过 [ACL Anthology](https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2020.acl-demos.2\u002F) 或 [arXiv 预印本](https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.12620)查看我们的论文。\n\n[完整文档](https:\u002F\u002Ftextbrewer.readthedocs.io\u002F)\n\n## 新闻\n\n**2021年12月17日**\n\n* **我们发布了模型剪枝工具包 TextPruner**。详情请见：https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextPruner\n\n**2021年10月24日**\n\n* **我们提出了首个专门针对中国少数民族语言的预训练语言模型。详情请见：https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-Minority-PLM**\n\n\n**2021年7月8日**\n\n* **新增使用 Transformers 4 的示例**\n  * 目前的示例（exmaples\u002F）是基于旧版本的 Transformers 编写的，可能会引起一些混淆和错误。我们使用 Transformers 4 在 Jupyter Notebook 中重写了这些示例，更加易于理解和学习。\n  * 新的示例可以在 [examples\u002Fnotebook_examples](examples\u002Fnotebook_examples\u002F) 中找到。详细信息请参阅 [示例](#examples)。\n\n**2021年3月1日**\n\n* **BERT-EMD 和自定义蒸馏器**\n\n  * 我们在 [MNLI 示例](examples\u002Fmnli_example\u002F) 中添加了 [BERT-EMD](https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2020.emnlp-main.242\u002F) 的实验。BERT-EMD 允许每个中间层的学生模型根据优化地球移动距离的原则，自适应地从任意中间层的教师模型中学习，因此无需指定匹配方案。\n  * 我们编写了一个新的 [EMDDistiller](examples\u002Fmnli_example\u002Fdistiller_emd.py)，用于执行 BERT-EMD。它展示了如何编写自定义蒸馏器。\n\n* **更新了 MNLI 示例**\n\n  * 我们移除了 pretrained_pytorch_bert，并在所有 MNLI 示例中改用 transformers 库。\n\n\u003Cdetails>\n\u003Csummary>点击此处查看旧版新闻\u003C\u002Fsummary>\n\n**2020年11月11日**\n\n* **更新至 0.2.1**:\n  * **更灵活的蒸馏**：支持为学生模型和教师模型输入不同的批次数据。这意味着学生和教师的批次数据不再需要相同，可用于蒸馏具有不同词汇表的模型（例如从 RoBERTa 到 BERT）。\n  * **更快的蒸馏**：用户现在可以预先计算并缓存教师模型的输出，然后将缓存的数据输入蒸馏器，以节省教师模型的前向传播时间。\n  \n    关于上述功能的详细信息，请参阅 [为学生和教师输入不同批次数据、输入缓存值](https:\u002F\u002Ftextbrewer.readthedocs.io\u002Fen\u002Flatest\u002FConcepts.html#feed-different-batches-to-student-and-teacher-feed-cached-values)。\n\n  * `MultiTaskDistiller` 现在支持中间特征匹配损失。\n  * TensorBoard 现在会记录更详细的损失（知识蒸馏损失、硬标签损失、匹配损失等）。\n\n  详细信息请参阅 [发布记录](https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer\u002Freleases\u002Ftag\u002Fv0.2.1)。\n\n**2020年8月27日**\n\n**我们很高兴地宣布，我们的模型在 GLUE 基准测试中名列前茅，请查看 [排行榜](https:\u002F\u002Fgluebenchmark.com\u002Fleaderboard)。**\n\n\n\n**2020年8月24日**\n\n* **更新至 0.2.0.1**：\n  * 修复了 `MultiTaskDistiller` 和训练循环中的 bug。\n\n**2020年7月29日**\n\n* **更新至 0.2.0**：\n    * 添加了对分布式数据并行训练的支持，通过 `DistributedDataParallel` 实现：`TrainingConfig` 现在接受 `local_rank` 参数。详细信息请参阅 `TrainingConfig` 的文档。\n    * 添加了一个关于中文 NER 任务蒸馏的示例，用于演示分布式数据并行训练。详情请参阅 [examples\u002Fmsra_ner_example](examples\u002Fmsra_ner_example)。\n\n**2020年7月14日**\n* **更新至 0.1.10**：\n    * 现在支持使用 Apex 进行混合精度训练！只需在 `TrainingConfig` 中将 `fp16` 设置为 `True` 即可。详细信息请参阅 `TrainingConfig` 的文档。\n    * 在 `TrainingConfig` 中增加了 `data_parallel` 选项，以实现数据并行训练与混合精度训练的同时进行。\n\n**2020年4月26日**\n\n* 添加了中文 NER 任务（MSRA NER）的结果。\n* 添加了蒸馏至 T12-nano 模型的结果，该模型结构类似于 Electra-small。\n* 更新了 CoNLL-2003、CMRC 2018 和 DRCD 的部分结果。\n\n**2020年4月22日**\n\n* **更新至 0.1.9**（增加了缓存选项，加快蒸馏速度；修复了一些 bug）。详细信息请参阅 [发布记录](https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer\u002Freleases\u002Ftag\u002Fv0.1.9)。\n* 添加了在中文任务上将 Electra-base 蒸馏至 Electra-small 的实验结果。\n* TextBrewer 已被 [ACL 2020](http:\u002F\u002Facl2020.org) 接受为演示论文，请使用我们的新 [引用条目](#Citation)。\n\n**2020年3月17日**\n\n* 添加了 CoNLL-2003 英文 NER 蒸馏示例。详情请参阅 [examples\u002Fconll2003_example](examples\u002Fconll2003_example)。\n\n**2020年3月11日**\n\n* **更新至 0.1.8**（对 TrainingConfig 和训练方法进行了改进）。详细信息请参阅 [发布记录](https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer\u002Freleases\u002Ftag\u002Fv0.1.8)。\n\n**2020年3月2日**\n\n* 初始公开版本 0.1.7 发布。详细信息请参阅 [发布记录](https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer\u002Freleases\u002Ftag\u002Fv0.1.7)。\n\n\u003C\u002Fdetails>\n\n## 目录\n\n\u003C!-- TOC -->\n\n| 章节 | 内容 |\n|-|-|\n| [简介](#introduction) | TextBrewer 简介 |\n| [安装](#installation) | 如何安装 |\n| [工作流程](#workflow) | TextBrewer 工作流程的两个阶段 |\n| [快速入门](#quickstart) | 示例：将 BERT-base 蒸馏为 3 层 BERT |\n| [实验](#experiments) | 典型英汉数据集上的蒸馏实验 |\n| [核心概念](#core-concepts) | TextBrewer 核心概念简要说明 |\n| [常见问题](#faq) | 常见问题解答 |\n| [已知问题](#known-issues) | 已知问题 |\n| [引用](#citation) | TextBrewer 的引用方式 |\n| [关注我们](#follow-us) | - |\n\n\u003C!-- \u002FTOC -->\n\n## 简介\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fairaria_TextBrewer_readme_e3d7f37cb5f5.png)\n\n**Textbrewer** 是专为自然语言处理模型的知识蒸馏设计的工具。它提供了多种蒸馏方法，并构建了一个快速搭建实验的蒸馏框架。\n\n**TextBrewer** 的主要特点包括：\n\n* 广泛支持：支持多种模型架构（尤其是基于 **Transformer** 的模型）。\n* 灵活性：用户可以通过组合不同的技术来自定义蒸馏方案；同时支持自定义损失函数、模块等。\n* 易用性：用户无需修改模型架构。\n* 专为 NLP 设计：适用于多种 NLP 任务，如文本分类、机器阅读理解、序列标注等。\n\n目前，**TextBrewer** 提供了以下蒸馏技术：\n\n* 混合软标签与硬标签训练\n* 动态调整损失权重和温度\n* 多种蒸馏损失函数：隐藏层状态均方误差、基于注意力矩阵的损失、神经元选择性迁移等。\n* 自由添加中间特征匹配损失\n* 多教师蒸馏\n* ...\n\n**TextBrewer** 包含以下内容：\n\n1. **Distillers**：蒸馏的核心组件。不同的蒸馏器执行不同的蒸馏模式，例如 GeneralDistiller、MultiTeacherDistiller 和 BasicTrainer 等。\n2. **配置与预设**：用于训练和蒸馏的配置类，以及预定义的蒸馏损失函数和策略。\n3. **Utilities**：辅助工具，如模型参数分析等。\n\n要开始蒸馏，用户需要提供：\n\n1. 模型（已训练好的 **教师** 模型和未训练的 **学生** 模型）。\n2. 数据集和实验配置。\n\n**TextBrewer** 在多个典型的 NLP 任务上取得了显著的效果。详情请参阅 [Experiments](#experiments)。\n\n详细用法请参阅 [完整文档](https:\u002F\u002Ftextbrewer.readthedocs.io\u002F)。\n\n### 架构\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fairaria_TextBrewer_readme_e3d7f37cb5f5.png)\n\n## 安装\n\n* 需求\n  * Python >= 3.6\n  * PyTorch >= 1.1.0\n  * TensorboardX 或 Tensorboard\n  * NumPy\n  * tqdm\n  * Transformers >= 2.0（可选，部分示例会用到）\n  * Apex == 0.1.0（可选，混合精度训练）\n\n* 从 PyPI 安装\n\n  ```shell\n  pip install textbrewer\n  ```\n\n* 从 Github 源码安装\n\n  ```shell\n  git clone https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer.git\n  pip install .\u002Ftextbrewer\n  ```\n\n## 工作流程\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fairaria_TextBrewer_readme_d63482bac52e.png)\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fairaria_TextBrewer_readme_8f4c999d0479.png)\n\n* **阶段 1**：准备工作：\n  1. 训练教师模型。\n  2. 定义并初始化学生模型。\n  3. 构建数据加载器、优化器和学习率调度器。\n\n* **阶段 2**：使用 TextBrewer 进行蒸馏：\n  1. 构建 **TrainingConfig** 和 **DistillationConfig**，并初始化一个 **distiller**。\n  2. 定义一个 **adaptor** 和一个 **callback**。**adaptor** 用于适配模型的输入和输出，而 **callback** 则会在训练过程中被 distiller 调用。\n  3. 调用 **distiller** 的 **train** 方法。\n\n\n## 快速入门\n\n下面我们将通过将 BERT-base 蒸馏到一个 3 层的 BERT 模型来展示 TextBrewer 的使用方法。\n\n在蒸馏之前，我们假设用户已经准备好了：\n\n* 一个已训练好的教师模型 `teacher_model`（BERT-base）和一个待训练的学生模型 `student_model`（3 层 BERT）。\n* 数据集的数据加载器 `dataloader`、优化器以及学习率调度器的类 `scheduler_class` 和其参数字典 `scheduler_dict`。\n\n使用 TextBrewer 进行蒸馏：\n\n```python \nimport textbrewer\nfrom textbrewer import GeneralDistiller\nfrom textbrewer import TrainingConfig, DistillationConfig\n\n# 显示模型参数统计信息\nprint(\"\\nteacher_model's parametrers:\")\nresult, _ = textbrewer.utils.display_parameters(teacher_model,max_level=3)\nprint (result)\n\nprint(\"student_model's parametrers:\")\nresult, _ = textbrewer.utils.display_parameters(student_model,max_level=3)\nprint (result)\n\n# 定义一个适配器来解释模型的输入和输出\ndef simple_adaptor(batch, model_outputs):\n      # 模型输出的第二和第三项分别是 logits 和隐藏层状态\n    return {'logits': model_outputs[1],\n            'hidden': model_outputs[2]}\n\n# 训练配置 \ntrain_config = TrainingConfig()\n# 蒸馏配置\n# 匹配学生和教师模型的不同层\ndistill_config = DistillationConfig(\n    intermediate_matches=[    \n     {'layer_T':0, 'layer_S':0, 'feature':'hidden', 'loss': 'hidden_mse','weight' : 1},\n     {'layer_T':8, 'layer_S':2, 'feature':'hidden', 'loss': 'hidden_mse','weight' : 1}])\n\n# 构建蒸馏器\ndistiller = GeneralDistiller(\n    train_config=train_config, distill_config = distill_config,\n    model_T = teacher_model, model_S = student_model, \n    adaptor_T = simple_adaptor, adaptor_S = simple_adaptor)\n\n# 开始！\nwith distiller:\n    distiller.train(optimizer, dataloader, num_epochs=1, scheduler_class=scheduler_class, scheduler_args = scheduler_args, callback=None)\n```\n\n### **示例**\n\n* **带有 Transformers 4 的 Notebook 示例**\n  * [examples\u002Fnotebook\\_examples\u002Fsst2.ipynb](examples\u002Fnotebook\\_examples\u002Fsst2.ipynb)（英语）：在 SST-2 数据集上训练并蒸馏 BERT，该数据集是一个英语句子分类任务。\n  * [examples\u002Fnotebook\\_examples\u002Fmsra_ner.ipynb](examples\u002Fnotebook\\_examples\u002Fmsra_ner.ipynb)（中文）：在 MSRA NER 数据集上训练并蒸馏 BERT，该数据集是一个中文序列标注任务。\n  * [examples\u002Fnotebook\\_examples\u002Fsqaudv1.1.ipynb](examples\u002Fnotebook\\_examples\u002Fsqaudv1.1.ipynb)（英语）：在 SQuAD 1.1 数据集上训练并蒸馏 BERT，该数据集是一个英语机器阅读理解任务。\n\n* [examples\u002Frandom_token_example](examples\u002Frandom_token_example)：一个简单的可运行玩具示例，展示了 TextBrewer 的用法。该示例以随机标记作为输入，在文本分类任务上进行蒸馏。\n* [examples\u002Fcmrc2018\\_example](examples\u002Fcmrc2018_example)（中文）：使用 DRCD 作为数据增强，在 CMRC 2018 数据集上进行蒸馏，该数据集是一个中文机器阅读理解任务。\n* [examples\u002Fmnli\\_example](examples\u002Fmnli_example)（英语）：在 MNLI 数据集上进行蒸馏，该数据集是一个英语句子对分类任务。该示例还展示了如何进行多教师蒸馏。\n* [examples\u002Fconll2003_example](examples\u002Fconll2003_example)（英语）：在 CoNLL-2003 英文 NER 数据集上进行蒸馏，该任务属于序列标注类型。\n* [examples\u002Fmsra_ner_example](examples\u002Fmsra_ner_example)（中文）：本示例在一个单节点多 GPU 的分布式数据并行训练环境中，将 Chinese-ELECTRA-base 模型蒸馏到 MSRA NER 任务上。\n\n\n## 实验\n\n我们在多个典型的英汉 NLP 数据集上进行了蒸馏实验。以下是具体的设置和配置。\n\n### 模型\n\n* 对于英文任务，教师模型是 [**BERT-base-cased**](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbert)。\n* 对于中文任务，教师模型是哈工大与科大讯飞联合实验室发布的 [**RoBERTa-wwm-ext**](https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-BERT-wwm) 和 [**Electra-base**](https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-ELECTRA)。\n\n我们测试了不同的学生模型。为了与公开结果进行对比，学生模型均采用标准的 Transformer 块构建，除了 BiGRU 是单层双向 GRU。以下是各模型的架构说明。需要注意的是，参数量包括嵌入层，但不包括每个具体任务的输出层。\n\n#### 英文模型\n\n| 模型                 | 层数 | 隐藏层大小 | 前馈网络大小 | 参数量 | 相对大小 |\n| :--------------------- | --------- | ----------- | ----------------- | -------- | ------------- |\n| BERT-base-cased（教师） | 12        | 768         | 3072              | 1.08亿     | 100%          |\n| T6（学生）              | 6         | 768         | 3072              | 6500万      | 60%           |\n| T3（学生）              | 3         | 768         | 3072              | 4400万      | 41%           |\n| T3-small（学生）        | 3         | 384         | 1536              | 1700万      | 16%           |\n| T4-Tiny（学生）         | 4         | 312         | 1200              | 1400万      | 13%           |\n| T12-nano（学生）        | 12        | 256         | 1024              | 1700万      | 16%           |\n| BiGRU（学生）           | -         | 768         | -                 | 3100万      | 29%           |\n\n#### 中文模型\n\n| 模型                 | 层数 | 隐藏层大小 | 前馈网络大小 | 参数量 | 相对大小   |\n| :--------------------- | --------- | ----------- | ----------------- | -------- | ------------- |\n| RoBERTa-wwm-ext（教师） | 12        | 768         | 3072              | 1.02亿      | 100%          |\n| Electra-base（教师）    | 12        | 768         | 3072              | 1.02亿      | 100%          |\n| T3（学生）              | 3         | 768         | 3072              | 3800万       | 37%           |\n| T3-small（学生）        | 3         | 384         | 1536              | 1400万       | 14%           |\n| T4-Tiny（学生）         | 4         | 312         | 1200              | 1100万       | 11%           |\n| Electra-small（学生）   | 12        | 256         | 1024              | 1200万       | 12%           |\n\n* T6 架构与 [DistilBERT\u003Csup>[1]\u003C\u002Fsup>](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.01108)、[BERT\u003Csub>6\u003C\u002Fsub>-PKD\u003Csup>[2]\u003C\u002Fsup>](https:\u002F\u002Farxiv.org\u002Fabs\u002F1908.09355) 和 [BERT-of-Theseus\u003Csup>[3]\u003C\u002Fsup>](https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.02925) 相同。\n* T4-tiny 架构与 [TinyBERT\u003Csup>[4]\u003C\u002Fsup>](https:\u002F\u002Farxiv.org\u002Fabs\u002F1909.10351) 相同。\n* T3 架构与 [BERT\u003Csub>3\u003C\u002Fsub>-PKD\u003Csup>[2]\u003C\u002Fsup>](https:\u002F\u002Farxiv.org\u002Fabs\u002F1908.09355) 相同。\n\n### 蒸馏配置\n\n```python\ndistill_config = DistillationConfig(temperature = 8, intermediate_matches = matches)\n# 其他参数使用默认值\n```\n\n`matches` 根据不同模型有所不同：\n\n| 模型        | matches                                             |\n| :--------    | --------------------------------------------------- |\n| BiGRU        | 无                                                |\n| T6           | L6_hidden_mse + L6_hidden_smmd                      |\n| T3           | L3_hidden_mse + L3_hidden_smmd                      |\n| T3-small     | L3n_hidden_mse + L3_hidden_smmd                     |\n| T4-Tiny      | L4t_hidden_mse + L4_hidden_smmd                     |\n| T12-nano     | small_hidden_mse + small_hidden_smmd                |\n| Electra-small| small_hidden_mse + small_hidden_smmd                |\n\n`matches` 的定义见 [examples\u002Fmatches\u002Fmatches.py](examples\u002Fmatches\u002Fmatches.py)。\n\n我们在所有蒸馏实验中均使用 GeneralDistiller。\n\n### 训练配置\n\n* 学习率均为 1e-4（除非另有说明）。  \n* 我们对所有模型训练 30~60 个 epoch。\n\n### 英文数据集上的结果\n\n我们在以下典型的英文数据集上进行了实验：\n\n| 数据集    | 任务类型 | 指标 | 训练样本数 | 验证样本数 | 备注 |\n| :---------- | -------- | ------- | ------- | ---- | ---- |\n| [**MNLI**](https:\u002F\u002Fwww.nyu.edu\u002Fprojects\u002Fbowman\u002Fmultinli\u002F)       | 文本分类 | m\u002Fmm Acc | 39.3万    | 2万  | 句子对三分类 |\n| [**SQuAD 1.1**](https:\u002F\u002Frajpurkar.github.io\u002FSQuAD-explorer\u002F)   | 阅读理解 | EM\u002FF1   | 8.8万     | 1.1万  | 跨度抽取式机器阅读理解 |\n| [**CoNLL-2003**](https:\u002F\u002Fwww.clips.uantwerpen.be\u002Fconll2003\u002Fner) | 序列标注 | F1      | 2.3万     | 6千   | 命名实体识别 |\n\n我们将 [DistilBERT](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.01108)、[BERT-PKD](https:\u002F\u002Farxiv.org\u002Fabs\u002F1908.09355)、[BERT-of-Theseus](https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.02925)、[TinyBERT](https:\u002F\u002Farxiv.org\u002Fabs\u002F1909.10351) 的公开结果以及我们的结果列在下表中以便比较。\n\n公开结果：\n\n| 模型（公开） | MNLI  | SQuAD  | CoNLL-2003 |\n| :-------------  | --------------- | ------------- | --------------- |\n| DistilBERT（T6）    | 81.6 \u002F 81.1 | 78.1 \u002F 86.2   | -               |\n| BERT\u003Csub>6\u003C\u002Fsub>-PKD（T6）     | 81.5 \u002F 81.0     | 77.1 \u002F 85.3   | -|\n| BERT-of-Theseus（T6） | 82.4\u002F  82.1   | -        | -                |\n| BERT\u003Csub>3\u003C\u002Fsub>-PKD（T3）     | 76.7 \u002F 76.3     | -             | -|\n| TinyBERT（T4-tiny） | 82.8 \u002F 82.9                | 72.7 \u002F 82.1   | -|\n\n我们的结果：\n\n| 模型（我们） | MNLI  | SQuAD  | CoNLL-2003 |\n| :-------------  | --------------- | ------------- | --------------- |\n| **BERT-base-cased**（教师） | 83.7 \u002F 84.0     | 81.5 \u002F 88.6   | 91.1  |\n| BiGRU          | -               | -             | 85.3            |\n| T6             | 83.5 \u002F 84.0     | 80.8 \u002F 88.1   | 90.7            |\n| T3             | 81.8 \u002F 82.7     | 76.4 \u002F 84.9   | 87.5            |\n| T3-small       | 81.3 \u002F 81.7     | 72.3 \u002F 81.4   | 78.6            |\n| T4-tiny        | 82.0 \u002F 82.6     | 75.2 \u002F 84.0   | 89.1            |\n| T12-nano       | 83.2 \u002F 83.9     | 79.0 \u002F 86.6   | 89.6            |\n\n**备注**：\n\n1. 公开模型的等效结构已在名称后的括号中注明。\n2. 在蒸馏至 T4-tiny 时，SQuAD 使用 NewsQA 进行数据增强，CoNLL-2003 则使用 HotpotQA 进行数据增强。\n3. 在蒸馏至 T12-nano 时，CoNLL-2003 使用 HotpotQA 进行数据增强。\n\n### 中文数据集上的结果\n\n我们在以下典型的中文数据集上进行了实验：\n\n\n| 数据集 | 任务类型 | 指标 | 训练样本数 | 验证样本数 | 备注 |\n| :------- | ---- | ------- | ------- | ---- | ---- |\n| [**XNLI**](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbert\u002Fblob\u002Fmaster\u002Fmultilingual.md) | 文本分类 | 准确率 | 39.3万 | 2.5千 | MNLI 的中文翻译版本 |\n| [**LCQMC**](http:\u002F\u002Ficrc.hitsz.edu.cn\u002Finfo\u002F1037\u002F1146.htm) | 文本分类 | 准确率 | 23.9万 | 8.8千 | 句子对匹配，二分类 |\n| [**CMRC 2018**](https:\u002F\u002Fgithub.com\u002Fymcui\u002Fcmrc2018) | 阅读理解 | EM\u002FF1 | 1万 | 3.4千 | 跨度抽取式机器阅读理解 |\n| [**DRCD**](https:\u002F\u002Fgithub.com\u002FDRCKnowledgeTeam\u002FDRCD) | 阅读理解 | EM\u002FF1 | 2.7万 | 3.5千 | 跨度抽取式机器阅读理解（繁体中文） |\n| [**MSRA NER**](https:\u002F\u002Ffaculty.washington.edu\u002Flevow\u002Fpapers\u002Fsighan06.pdf) | 序列标注 | F1 | 4.5万 | 3.4千 (#测试) | 中文命名实体识别 |\n\n结果如下所示。\n\n| 模型           | XNLI | LCQMC | CMRC 2018 | DRCD |\n| :--------------- | ---------- | ----------- | ---------------- | ------------ |\n| **RoBERTa-wwm-ext** (教师模型) | 79.9       | 89.4        | 68.8 \u002F 86.4      | 86.5 \u002F 92.5  |\n| T3          | 78.4       | 89.0        | 66.4 \u002F 84.2      | 78.2 \u002F 86.4  |\n| T3-small    | 76.0       | 88.1        | 58.0 \u002F 79.3      | 75.8 \u002F 84.8  |\n| T4-tiny     | 76.2       | 88.4        | 61.8 \u002F 81.8      | 77.3 \u002F 86.1  |\n\n| 模型                       | XNLI       | LCQMC       | CMRC 2018        | DRCD        | MSRA NER |\n| :---------------------------| ---------- | ----------- | ---------------- | ------------|----------|\n| **Electra-base** (教师模型) | 77.8       | 89.8        | 65.6 \u002F 84.7     | 86.9 \u002F 92.3  | 95.14    |\n| Electra-small               | 77.7       | 89.3        | 66.5 \u002F 84.9     | 85.5 \u002F 91.3  | 93.48    |\n\n\n**注**：\n\n1. 在 CMRC 2018 和 DRCD 的蒸馏中未使用学习率衰减。\n2. CMRC 2018 和 DRCD 在蒸馏过程中互为增强数据集。\n3. Electra-base 教师模型的训练设置可在 [**Chinese-ELECTRA**](https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-ELECTRA) 中找到。\n4. Electra-small 学生模型使用 [预训练权重](https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-ELECTRA) 进行初始化。\n\n## 核心概念\n\n### 配置\n\n* `TrainingConfig`: 与通用深度学习模型训练相关的配置\n* `DistillationConfig`: 与蒸馏方法相关的配置\n\n### 蒸馏器\n\n蒸馏器负责执行实际的实验。可用的蒸馏器如下：\n\n* `BasicDistiller`: **单教师单任务**蒸馏，提供基本的蒸馏策略。\n* `GeneralDistiller`（推荐）: **单教师单任务**蒸馏，支持中间特征匹配。**大多数情况下推荐使用**。\n* `MultiTeacherDistiller`: **多教师**蒸馏，将多个教师模型（同一任务）蒸馏到一个学生模型中。**此类不支持中间特征匹配。**\n* `MultiTaskDistiller`: **多任务**蒸馏，将多个教师模型（不同任务）蒸馏到一个学生模型中。\n* `BasicTrainer`: 监督式训练单个模型于有标签数据集，不用于蒸馏。**可用于训练教师模型**。\n\n\n### 用户自定义函数\n\n在 TextBrewer 中，有两个需要用户实现的函数：**回调函数**和**适配器**。\n\n####  **回调函数** \n\n在每个检查点，保存学生模型之后，回调函数将由蒸馏器调用。回调函数可用于评估学生模型在每个检查点的表现。\n\n#### 适配器\n它将模型的输入和输出转换为指定格式，以便蒸馏器能够识别，并计算蒸馏损失。在每个训练步骤中，批次和模型输出都会传递给适配器；适配器会重新组织数据并返回一个字典。\n\n更多详细信息，请参阅 [完整文档](https:\u002F\u002Ftextbrewer.readthedocs.io\u002F) 中的说明。\n\n## 常见问题解答\n\n**问**: 如何初始化学生模型？\n\n**答**: 学生模型可以随机初始化（即无先验知识），也可以使用预训练权重进行初始化。\n例如，在将 BERT-base 模型蒸馏到 3 层 BERT 时，为了避免冷启动问题，可以使用 [RBT3](#https:\u002F\u002Fgithub.com\u002Fymcui\u002FChinese-BERT-wwm)（针对中文任务）或 BERT 的前三层（针对英文任务）来初始化学生模型。\n我们建议用户尽可能使用预训练的学生模型，以充分利用大规模预训练的优势。\n\n**问**: 如何设置蒸馏实验的训练超参数？\n\n**答**: 知识蒸馏通常比在有标签数据集上训练需要更多的训练轮次和更大的学习率。例如，使用 BERT-base 训练 SQuAD 通常需要 3 个轮次，学习率为 3e-5；而蒸馏则需要 30~50 个轮次，学习率为 1e-4。**这些结论基于我们的实验，建议您根据自己的数据自行尝试**。\n\n**问**: 我的教师模型和学生模型输入不同（词汇表不一致），那我该如何进行蒸馏？\n\n**答**: 您需要分别向教师和学生模型输入不同的批次数据。请参阅完整文档中的[为学生和教师输入不同批次数据，使用缓存值]部分。\n\n**问**: 我已经存储了教师模型的 logits，能否在蒸馏中直接使用它们来节省前向传播时间？\n\n**答**: 是的，请参阅完整文档中的[为学生和教师输入不同批次数据，使用缓存值]部分。\n\n## 已知问题\n\n* ~~目前多 GPU 训练仅通过 `DataParallel` 支持。~~\n* 不支持多标签分类。\n\n## 引用\n\n如果您觉得 TextBrewer 很有帮助，请引用我们的论文：\n```bibtex\n@InProceedings{textbrewer-acl2020-demo,\n    title = \"{T}ext{B}rewer: {A}n {O}pen-{S}ource {K}nowledge {D}istillation {T}oolkit for {N}atural {L}anguage {P}rocessing\",\n    author = \"Yang, Ziqing and Cui, Yiming and Chen, Zhipeng and Che, Wanxiang and Liu, Ting and Wang, Shijin and Hu, Guoping\",\n    booktitle = \"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations\",\n    year = \"2020\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2020.acl-demos.2\",\n    pages = \"9--16\",\n}\n```\n\n## 关注我们\n关注我们的官方微信公众号，及时获取最新技术动态！\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fairaria_TextBrewer_readme_a467d80e8c8d.jpg)","# TextBrewer 快速上手指南\n\nTextBrewer 是一个基于 PyTorch 的自然语言处理（NLP）模型蒸馏工具包。它集成了多种蒸馏技术，提供易用的框架，帮助用户在性能损失较小的情况下压缩模型，从而提升推理速度并降低显存占用。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux \u002F macOS \u002F Windows\n*   **Python**: >= 3.6\n*   **核心依赖**:\n    *   PyTorch >= 1.1.0\n    *   NumPy\n    *   tqdm\n    *   TensorboardX 或 Tensorboard\n*   **可选依赖**:\n    *   Transformers >= 2.0 (用于加载预训练模型，推荐安装最新版)\n    *   Apex == 0.1.0 (如需开启混合精度训练)\n\n## 安装步骤\n\n您可以通过 PyPI 直接安装，也可以从 GitHub 源码安装。国内用户建议使用清华源或阿里源加速下载。\n\n### 方式一：通过 PyPI 安装（推荐）\n\n```bash\npip install textbrewer -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 方式二：从 GitHub 源码安装\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer.git\ncd TextBrewer\npip install . -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 基本使用\n\nTextBrewer 的工作流程主要分为两个阶段：**准备阶段**（定义模型、数据、优化器）和 **蒸馏阶段**（配置蒸馏策略并启动训练）。\n\n以下是一个将 `BERT-base` (教师模型) 蒸馏到 `3-layer BERT` (学生模型) 的最小化示例。\n\n### 1. 前置准备\n假设您已经准备好了以下对象：\n*   `teacher_model`: 训练好的教师模型\n*   `student_model`: 待训练的学生模型\n*   `dataloader`: 数据加载器\n*   `optimizer`: 优化器\n*   `scheduler_class` 和 `scheduler_args`: 学习率调度器类及其参数\n\n### 2. 编写蒸馏代码\n\n```python\nimport textbrewer\nfrom textbrewer import GeneralDistiller\nfrom textbrewer import TrainingConfig, DistillationConfig\n\n# 1. (可选) 查看模型参数量统计\nprint(\"\\nteacher_model's parameters:\")\nresult, _ = textbrewer.utils.display_parameters(teacher_model, max_level=3)\nprint(result)\n\nprint(\"student_model's parameters:\")\nresult, _ = textbrewer.utils.display_parameters(student_model, max_level=3)\nprint(result)\n\n# 2. 定义 Adaptor\n# Adaptor 用于适配模型输入输出，提取需要的 logits 和 hidden states\ndef simple_adaptor(batch, model_outputs):\n    # 假设 model_outputs[1] 是 logits, model_outputs[2] 是 hidden states\n    return {'logits': model_outputs[1],\n            'hidden': model_outputs[2]}\n\n# 3. 配置训练与蒸馏策略\n# 训练配置 (使用默认值)\ntrain_config = TrainingConfig()\n\n# 蒸馏配置\n# 定义中间层匹配策略：例如将学生第 0 层匹配教师第 0 层，学生第 2 层匹配教师第 8 层\ndistill_config = DistillationConfig(\n    intermediate_matches=[    \n     {'layer_T': 0, 'layer_S': 0, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1},\n     {'layer_T': 8, 'layer_S': 2, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}\n    ])\n\n# 4. 构建蒸馏器 (Distiller)\ndistiller = GeneralDistiller(\n    train_config=train_config, \n    distill_config=distill_config,\n    model_T=teacher_model, \n    model_S=student_model, \n    adaptor_T=simple_adaptor, \n    adaptor_S=simple_adaptor\n)\n\n# 5. 开始蒸馏训练\nwith distiller:\n    distiller.train(\n        optimizer, \n        dataloader, \n        num_epochs=1, \n        scheduler_class=scheduler_class, \n        scheduler_args=scheduler_args, \n        callback=None\n    )\n```\n\n### 关键点说明\n*   **Adaptor**: 必须定义，用于告诉 TextBrewer 如何从您的模型输出中提取用于计算损失的关键张量（如 logits, hidden states, attention matrices）。\n*   **DistillationConfig**: 通过 `intermediate_matches` 列表灵活定义教师层与学生层的对应关系及损失函数类型。\n*   **GeneralDistiller**: 最常用的蒸馏器类，适用于单教师单学生场景。\n\n更多高级用法（如多教师蒸馏、自定义损失函数、缓存教师输出加速训练等），请参考 [官方文档](https:\u002F\u002Ftextbrewer.readthedocs.io\u002F) 或查看项目中的 `examples\u002Fnotebook_examples` 目录。","某电商公司的算法团队需要将高精度的 BERT 大模型部署到移动端客服机器人中，以实时回答用户咨询，但受限于手机端的算力和内存资源。\n\n### 没有 TextBrewer 时\n- **手动实现成本极高**：工程师需从零编写知识蒸馏代码，复现复杂的中间层特征匹配或注意力矩阵对齐逻辑，开发周期长达数周。\n- **模型压缩效果难调优**：缺乏统一的框架支持多种蒸馏策略（如 BERT-EMD），难以快速实验不同方案，导致小模型在精度上损失过大。\n- **训练效率低下**：每次迭代都需同时前向传播教师和学生模型，且无法缓存教师输出，导致显存占用高、训练速度缓慢。\n- **跨架构迁移困难**：若教师模型与学生模型词表不一致（如从 RoBERTa 蒸馏到 BERT），需自行处理数据对齐问题，极易出错。\n\n### 使用 TextBrewer 后\n- **开箱即用的蒸馏框架**：直接调用内置的 Distiller 模块，几行代码即可配置先进的蒸馏算法，将开发时间缩短至几天内。\n- **性能与体积完美平衡**：利用自适应层匹配等技术，在模型体积缩小 70% 的情况下，仅牺牲不到 2% 的意图识别准确率。\n- **推理加速显著**：通过预计算并缓存教师模型输出，大幅减少重复计算，训练速度提升 40%，显存占用降低 50%。\n- **灵活适配异构模型**：原生支持向学生和教师输入不同批次的数据，轻松实现跨词表、跨架构的模型迁移，无需额外数据处理。\n\nTextBrewer 让团队以极低的工程成本，成功将庞大的语言模型“瘦身”并高效部署到端侧设备，实现了用户体验与资源消耗的双赢。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fairaria_TextBrewer_ad1f9ee7.png","airaria","Ziqing Yang","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fairaria_ef40fba3.jpg","What I cannot create, I do not understand",null,"Shanghai","yangziqing@163.com","airaria.github.io","https:\u002F\u002Fgithub.com\u002Fairaria",[85],{"name":86,"color":87,"percentage":88},"Python","#3572A5",100,1695,246,"2026-04-02T08:39:56","Apache-2.0","未说明","未说明 (依赖 PyTorch，支持混合精度训练需安装 Apex)",{"notes":96,"python":97,"dependencies":98},"该工具主要用于 NLP 模型蒸馏。虽然 README 未明确指定操作系统，但作为 PyTorch 项目通常兼容 Linux、macOS 和 Windows。GPU 并非强制要求，但若进行大规模模型蒸馏或使用混合精度训练（需安装 Apex），建议使用支持 CUDA 的 NVIDIA GPU。具体显存和 CUDA 版本取决于所选用的教师\u002F学生模型大小及 PyTorch 版本兼容性。",">=3.6",[99,100,101,102,103,104],"torch>=1.1.0","tensorboardX 或 tensorboard","numpy","tqdm","transformers>=2.0 (可选)","apex==0.1.0 (可选，用于混合精度训练)",[13,51,26],[107,108,109,110,111],"bert","pytorch","nlp","knowledge","distillation","2026-03-27T02:49:30.150509","2026-04-06T05:32:27.314243",[115,120,125,130,135,140],{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},16906,"MNLI 数据集在哪里获取？","请使用 GLUE 基准测试中的 MultiNLI matched\u002Fmismatched 数据集。访问地址：https:\u002F\u002Fgluebenchmark.com\u002Ftasks","https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer\u002Fissues\u002F67",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},16907,"MSRA 命名实体识别任务无法复现官方结果，F1 分数偏低怎么办？","建议检查以下配置：\n1. 模型选择：Teacher 使用 Chinese-ELECTRA-base，Student 使用 Chinese-ELECTRA-small。\n2. 超参数调整：尝试将学习率 (learning rate) 改为 1e-4，训练轮数 (epoch) 设为 60，这能显著改善结果。\n3. 环境版本：确保使用 Pytorch==1.10.0 和 Transformers==4.28.0（或在旧版环境中尝试 Transformers 3.X）。\n参考脚本：examples\u002Fmsra_ner_example 里的 ner_ElectraTrain_dist.sh 和 ner_ElectraDistill_dist.sh。","https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer\u002Fissues\u002F49",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},16908,"运行 notebook_examples\u002Fmsra_ner.ipynb 时报错或出现兼容性问题如何解决？","该 Notebook 示例依赖了较多 datasets 特性且未及时更新，可能存在兼容性问题。建议直接参考并运行对应的 Python 脚本文件：https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer\u002Ftree\u002Fmaster\u002Fexamples\u002Fmsra_ner_example。此外，请确保 Transformers 版本兼容（如 4.25.1 或更高版本需确认适配性）。","https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer\u002Fissues\u002F112",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},16909,"在蒸馏过程中添加 Hard Loss 导致模型效果变差（尤其是随机初始化 Student 时），原因是什么？","如果 Student 模型是随机初始化的，直接引入 Hard Loss 可能会导致训练不稳定或效果急剧下降。这是因为硬标签在模型未收敛时提供的梯度可能具有误导性。\n建议方案：\n1. 如果只需训练 Hard Loss，建议使用 textbrewer.BasicTrainer 而不是 GeneralDistiller 进行验证。\n2. 若必须蒸馏，可尝试先让 Student 模型通过 Hard Loss 预训练一定步数，或极小化 hard_loss_weight（但需注意即使很小也可能影响性能）。\n3. 检查代码逻辑，确认 BasicTrainer 直接训练与 GeneralDistiller 仅保留 Hard Loss 的逻辑一致性。","https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer\u002Fissues\u002F39",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},16910,"对 BERT-wwm-ext 进行蒸馏时遇到数据加载或版本兼容问题如何解决？","该问题通常由 Transformers 库版本过高引起。解决方案包括：\n1. 降低 Transformers 版本至兼容版本。\n2. 修改数据加载方式，确保 transform 函数返回的数据格式为字典形式，例如：\n   return {'input_ids': self.all_input_ids[index], 'attention_mask': self.all_attention_mask[index], 'labels': self.all_labels[index]}","https:\u002F\u002Fgithub.com\u002Fairaria\u002FTextBrewer\u002Fissues\u002F30",{"id":141,"question_zh":142,"answer_zh":143,"source_url":129},16911,"如何正确配置 DistillationConfig 中的中间层匹配（intermediate_matches）？","在配置中间层匹配时，需明确指定 Teacher 层 (layer_T)、Student 层 (layer_S)、特征类型 (feature) 及损失函数 (loss)。例如在 MSRA NER 任务中，常见的配置如下：\nintermediate_matches=[\n  {\"layer_T\":0, \"layer_S\":0, \"feature\":\"hidden\", \"loss\":\"hidden_mse\", \"weight\":1},\n  {\"layer_T\":4, \"layer_S\":1, \"feature\":\"hidden\", \"loss\":\"hidden_mse\", \"weight\":1},\n  {\"layer_T\":8, \"layer_S\":2, \"feature\":\"hidden\", \"loss\":\"hidden_mse\", \"weight\":1},\n  {\"layer_T\":12,\"layer_S\":3, \"feature\":\"hidden\", \"loss\":\"hidden_mse\", \"weight\":1}\n]\n同时需在 Adaptor 中正确返回对应的 hidden_states，例如：return {\"logits\":model_outputs.logits, 'hidden': model_outputs.hidden_states}。",[145,150,155,160,165,170,175],{"id":146,"version":147,"summary_zh":148,"released_at":149},99155,"v0.2.1.post1","# 新特性\n\n  * **更灵活的蒸馏**：支持为学生模型和教师模型分别输入不同的批次数据。这意味着学生模型和教师模型所使用的批次不必再保持一致。这一功能可用于对具有不同词汇表的模型进行蒸馏（例如，从 RoBERTa 到 BERT）。详情请参阅文档。\n\n  * **更快的蒸馏**：用户可以预先计算并缓存教师模型的输出，然后将缓存结果输入到蒸馏器中，从而节省教师模型的前向传播时间。详情请参阅文档。\n\n# 改进\n\n* `MultiTaskDistiller` 现在是 `GeneralDistiller` 的子类，并支持中间层特征匹配损失。\n* TensorBoard 现在会记录更详细的损失项（知识蒸馏损失、硬标签损失、匹配损失等）。\n* `pkd_loss` 现在接受形状为 (batch_size, length, hidden_size) 或 (batch_size, hidden_size) 的张量。在后一种情况下，损失直接在输入张量上计算，而不考虑第一个位置的隐藏状态。","2020-11-11T02:38:09",{"id":151,"version":152,"summary_zh":153,"released_at":154},99156,"v0.2.0.1","# 错误修复\n\n* 修复了 `MultiTaskDistiller` 中的错误。\n* 修复了在按 `num_steps` 训练时出现的无限训练循环问题。现在蒸馏器会正确停止。","2020-08-24T08:40:54",{"id":156,"version":157,"summary_zh":158,"released_at":159},99157,"v0.2.0","# 新特性\n  \n* 现在支持使用 `torch.nn.parallel.DistributedDataParallel` 进行分布式数据并行训练！你可以将 `local_rank` 传递给 `TrainingConfig` 来配置分布式训练环境。`DistributedDataParallel` 的详细用法请参阅 PyTorch 官方文档。\n\n* 我们还新增了一个示例（中文命名实体识别任务），演示如何将 TextBrewer 与分布式数据并行训练结合使用。","2020-07-30T01:15:48",{"id":161,"version":162,"summary_zh":163,"released_at":164},99158,"v0.1.10","# 新特性\n  \n  * 现在支持使用 Apex 进行混合精度训练！只需在 `TrainingConfig` 中将 `fp16` 设置为 `True` 即可。详情请参阅 `TrainingConfig` 的文档。\n* 在 `TrainingConfig` 中新增了 `data_parallel` 选项，以在 TextBrewer 内启用数据并行训练。","2020-07-16T01:49:31",{"id":166,"version":167,"summary_zh":168,"released_at":169},99159,"v0.1.9","# 新特性\n\n* 在 `DistillationConfig` 中新增了一个选项 `is_caching_logits`。如果 `is_caching_logits` 为 True，蒸馏器将会缓存批次数据和教师模型的输出 logits，从而使得这些 logits 只需计算一次。这将加快蒸馏过程。该功能**仅适用于** `BasicDistiller` 和 `MultiTeacherDistiller`。**请注意，在大型数据集上将其设置为 True 时要谨慎，因为这会将批次数据和 logits 存储在内存中。**\n\n# 改进\n\n* 在蒸馏器的 `train` 方法中新增了参数 `max_grad_norm`，用于设置梯度裁剪的强度。默认值为 -1，即不进行梯度裁剪。\n* 在蒸馏器的 `train` 方法中新增了参数 `scheduler_class` 和 `scheduler_args`。旧的 `scheduler` 参数可能会导致收敛问题，现已弃用，推荐使用 `scheduler_class` 和 `scheduler_args`。详情请参阅文档。\n* 移除了 `display_parameters` 中的 `print` 语句。现在不会直接将统计信息打印到屏幕上。\n\n# 错误修复\n\n* 修复了 `zero_grad()` 的错误调用。\n","2020-04-21T01:06:45",{"id":171,"version":172,"summary_zh":173,"released_at":174},99160,"v0.1.8","# 改进：\n\n* 可将 `TrainingConfig.log_dir` 设置为 `None` 以禁用 TensorBoard。\n* 在蒸馏器中添加了 `print_freq` 属性，用于控制日志记录的频率。\n* 在蒸馏器的 `train` 方法中新增了一个参数 `num_steps`。如果指定了 `num_steps`，蒸馏器将忽略 `num_epochs`，并支持未知大小的数据加载器（即没有 `__len__` 属性的数据加载器）。\n* 在蒸馏器的 `train` 方法中新增了一个参数 `batch_postprocessor`，用于对每个批次进行后处理。","2020-03-11T01:07:22",{"id":176,"version":177,"summary_zh":178,"released_at":179},99161,"v0.1.7","这是 TextBrewer 的首个版本。","2020-03-03T02:57:09"]