[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-e-p-armstrong--augmentoolkit":3,"tool-e-p-armstrong--augmentoolkit":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":79,"owner_twitter":79,"owner_website":80,"owner_url":81,"languages":82,"stars":102,"forks":103,"last_commit_at":104,"license":105,"difficulty_score":10,"env_os":106,"env_gpu":107,"env_ram":108,"env_deps":109,"category_tags":115,"github_topics":116,"view_count":10,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":120,"updated_at":121,"faqs":122,"releases":150},870,"e-p-armstrong\u002Faugmentoolkit","augmentoolkit","Create Custom LLMs","augmentoolkit 是一款专为打造领域专家型 AI 而设计的开源工具。通过简单的文档上传操作，它能自动生成高质量的数据集，进而更新大模型的“知识库”，使其成为你指定领域的专家。这有效解决了通用大模型知识截止早、无法理解私有数据或特定小众领域内容的问题。\n\n无论你是需要追踪前沿科研论文的学者，希望 AI 深刻理解个人兴趣的研究者，还是想为虚构世界构建 lore 专家的创作者，augmentoolkit 都能满足需求。它特别适合开发者、技术人员及有一定动手能力的普通用户使用。\n\n技术层面，augmentoolkit 支持在本地计算机离线运行，无需外部 API 密钥即可生成数据，极大保障了数据安全与隐私。此外，它还能自动创建 RAG 就绪数据集并启动推理服务器，兼容 Deepseek、Llama 等主流开源模型，且支持多 GPU 并行加速。作为 MIT 许可的项目，它高度可定制，是构建专属智能助手的高效选择。","# Augmentoolkit - Data for Domain-expert AI\nAugmentoolkit creates domain-expert datasets that update an AI's brain (basically, its knowledge cutoff), so that the AI becomes an expert in an area of your choosing.\n\nYou upload documents, and press a button. And get a fully trained custom LLM. Now every aspect of your AI's behavior and understanding is under your control. Better still, Augmentoolkit **optionally works offline on your computer** -- no external API key required* for datagen† on most hardware.\n\nMaybe you want AI to know the latest research papers in your field, or perhaps you want an LLM that understands your passion deeply and has learned from the same sources as you. Possibly, you dream of creating a lore expert for your favorite obscure fictional universe. Whatever the application is, Augmentoolkit lets you take text and make an LLM's brain inherently learn the information contained within. It also automatically creates a RAG-ready dataset (and can start up an inference server) if you want some traditional grounding as well.\n\nGet started now (the interface will guide you through generating your first dataset):\n\n(Be sure to use Python 3.11 when creating the virtual environment to be sure this'll work)\n\n### MacOS (interface)\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit.git\ncd augmentoolkit # Python == 3.11\nbash macos.sh # NOTE: Will attempt to install valkey via brew if not found.\n# bash local_macos.sh # use this command if doing local dataset generation\n```\n\n### Linux (interface)\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit.git\ncd augmentoolkit # Python == 3.11\nbash linux.sh # NOTE: will build Valkey from source if a Redis\u002FValkey server is not running\n```\n\n**Or for local inference**\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit.git\ncd augmentoolkit # Python == 3.11\nbash local_linux.sh normal # or you can write \"small\" or a custom model name to serve the quantized version (for more consumer hardware) or a model of your choice, respectively. See the quickstart page linked just a bit farther down for a full reference here.\n```\n\nIf you have multiple GPUs, run `local_linux.sh` with the `--tensor-parallelism N` argument. N == number of GPUs you have (even). So: 1, 2, 4, 8... etc.\n\n> [!IMPORTANT]\n>\n> Please star the repo.\n\n### Windows (interface)\n> [!NOTE]\n>\n> If you're on windows, your best bet is to use [WSL](https:\u002F\u002Flearn.microsoft.com\u002Fen-us\u002Fwindows\u002Fwsl\u002Finstall). [The CLI is easier to get running on windows honestly.](docs\u002Fquickstart.md#windows-cli)\n\n\u003Csub>*Note that datagen can take a while on a lot of hardware however, don't expect fast datagen on an old mac for instance. And for training you will need either a powerful machine of your own, or to rent (latter is done automatically for you if you so choose).\u003C\u002Fsub>\n\n\u003Csub>†If you want data to generate faster you *can* use an open-source LLM API, and the quickstart encourages you to. In addition to its custom dataset generation model, Augmentoolkit is optimized for open source LLMs like Deepseek or Llama.\u003C\u002Fsub>\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fe-p-armstrong_augmentoolkit_readme_9f2fe32fdff0.png)\n\nAugmentoolkit, now that it is on its 3.0 version, has been refined and improved through over a year of professional application and experimentation. It is now the best way in the world to create domain expert LLMs, and it's MIT licensed.\n\nIf you use this project and like it, please consider starring the repo! It's also designed to be extremely customizable so consider **forking** Augmentoolkit!\n\n> [!IMPORTANT]\n>\n> The below links contain very useful information. There is a table of contents, that links to extensive documentation pages for any conceivable part of the project, a bit further down.\n\n[Help Videos](#video-tutorials) I walk through how to do all the cool stuff in this project starting from scratch, including training LLMs with the data and configs you get (takes 10 minutes). Check out the help videos if you want further guidance!\n\n[Community](https:\u002F\u002Fdiscord.gg\u002Fs6PBfsaVzu) If you have questions, if you are training models, if you are building cool new pipelines or extensions on top of Augmentoolkit's code, or if just want to hang out, I'd love to see you on the Augmentoolkit discord! It's also a good place to reach me.\n\n[Newsletter](https:\u002F\u002Fpromptingweekly.substack.com\u002F) I write about model training and data generation over on a Substack. Totally free, I just want to help people be able to use the tool better.\n\n[Contact](#contact) I'm doing all kinds of things around this project, if you're interested in the mission and the business of bringing custom, personally-aligned AI to everyone, let's get in touch!\n\n[Build](docs\u002Fpipeline_primer.md) Augmentoolkit is meant to be the go-to tool for people experimenting with training LLMs, whether they're a hobbyist or a professional. To that end, building new pipelines is as simple as writing Python functions (while adhering to about 2 mostly optional conventions). Efficient explainers and pipeline templates are provided for you to build your own dataset generation pipelines, and by extension, your own datasets and your own completely custom LLMs.\n\nAll configs are fully annotated with comments and placeholders to help you understand them as you fill them out.\n\n## Documentation Pages\n\n> [!NOTE]\n>\n> that this documentation page (the main README) has an [important note](#note-for-when-you-start-training) about model training for facts that you should read regardless of your experience level.\n\n1. [Quickstart](docs\u002Fquickstart.md)\n    - [CLI](docs\u002Fquickstart.md#the-cli)\n    - [Interface](docs\u002Fquickstart.md#macos-interface)\n1. [Video Help](#video-tutorials) \u003C!-- Pages still todo -->\n    - [Generate Data and Train an LLM! (12 mins)](#train-a-model-on-your-own-data-in-13-minutes)\n    - [Detailed exploration (Interface)](#interface-deep-dive)\n    - [Detailed exploration (CLI)](#cli-and-code-structure-deep-dive)\n1. [Vision](docs\u002Fvision.md)\n    - [What is it? (Technical)](docs\u002Fvision.md#what-is-it-technical)\n    - [What is it? (General)](docs\u002Fvision.md#what-is-it-general)\n    - [Ideas Presented](docs\u002Fvision.md#ideas-presented-and-hypotheses)\n    - [Goals](docs\u002Fvision.md#goals)\n1. Longstart (customization and development guide links)\n    - [Pipelines Available by Default]()\n        - [Compositions]()\n            - [Complete Factual Generation](docs\u002Fcomplete_factual_datagen.md)\n            - [Meta Datagen](docs\u002Fmeta.md)\n        - [Pipelines]()\n            - [Multi-Source Recall Factual Datagen](docs\u002Fmulti_source_facts.md)\n            - [Single-Source Recall Factual Datagen](docs\u002Fsingle_source_recall.md)\n            - [Representation Variation](docs\u002Frepresentation_variation.md)\n            - [Traditional Classifier Bootstrapper](docs\u002Fclassifier_bootstrapper.md)\n            - [Generic Data Rephrase](docs\u002Fgeneric_data_rephrase.md)\n            - [GRPO (experimental)](docs\u002Fgrpo.md)\n            - [Correction Data (loss-masked mistakes)](docs\u002Fcorrections.md)\n            - [Rag Data (preparing for enhanced recall)](docs\u002Frag_data.md)\n            - [Debug (health check)](docs\u002Fdebug.md)\n            - [Starting Point (build your own pipeline!)](docs\u002Fexample.md)\n            - [RPToolkit](docs\u002Frptoolkit.md)\n        - [Utility]()\n            - [Basic LLM Server](docs\u002Fbasic_server.md)\n            - [RAG LLM Server](docs\u002Frag_server.md)\n            - **[Discord Hosting](docs\u002Fdiscord.md)**\n        - [Config Common Fields](docs\u002Fconfig_common_fields.md)\n        \u003C!-- - [Training an LLM Walkthrough]() -->\n    - [Understand the Tool In Detail]()\n        - [Project Structure](docs\u002Fproject_structure.md)\n        - [CLI]()\n            - [Flows](docs\u002FCLI_flows.md)\n                - [Upload Documents](docs\u002FCLI_flows.md#upload-documents)\n                - [Starting Runs](docs\u002FCLI_flows.md#starting-runs)\n                - [Observing Runs](docs\u002FCLI_flows.md#observing-results)\n                - [Getting Your Results Back](docs\u002FCLI_flows.md#getting-your-results-back)\n        - [Interface]()\n            - [Flows](docs\u002Finterface_flows.md)\n                - [Upload Documents](docs\u002Finterface_flows.md#upload-documents)\n                - [Starting Runs](docs\u002Finterface_flows.md#starting-runs)\n                - [Observing Runs](docs\u002Finterface_flows.md#observing-runs)\n                - [Getting Your Results Back](docs\u002Finterface_flows.md#getting-your-results-back)\n        - [Training with Axolotl Concepts](docs\u002Faxolotl_concepts.md)\n    - [Customize and Develop]()\n        - [New Pipeline Primer](docs\u002Fpipeline_primer.md)\n        - [Abstractions Primer](docs\u002Fabstractions_primer.md)\n        - [Conventions Commandments](docs\u002Fpipeline_conventions.md)\n        - [Reminder That Conventions Are Minimal and You Can Just Code and It Will Probably Work](docs\u002Fconventions_reminder.md)\n1. [Discord!](#discord)\n1. [Updates and Training\u002FDatagen Tips Blog! Stay in the loop!](#training-and-datagen-tips-blog)\n1. [Contributing!](#contributing)\n1. [Contact & Client Work](#contact)\n\n**If you're familiar with LLMs and want a more jargonful rundown of what Augmentoolkit is and what makes it cool, check out [this section](docs\u002Fvision.md)**\n\nCite:\n[![DOI](https:\u002F\u002Fzenodo.org\u002Fbadge\u002F726083337.svg)](https:\u002F\u002Fzenodo.org\u002Fdoi\u002F10.5281\u002Fzenodo.11525927)\n\n\n> [!NOTE]\n>\n> If you don't want to do model training, but just want to generate a dataset, turn `do_train` off in your dataset generation config.\n\n### Video Tutorials\n\n#### [Train a Model on your Own Data in 13 Minutes](https:\u002F\u002Fyoutu.be\u002FE9TyyZzIMyY)\n\n#### [Interface Deep Dive!](https:\u002F\u002Fyoutu.be\u002FM-OFVwHPfeU)\n\n#### [CLI and Code Structure Deep Dive!](https:\u002F\u002Fyoutu.be\u002FcEkgw7sYqMw)\n\n^ This one is useful if you're going to make modifications to the code\n\n### Benefits\n**Augmentoolkit makes LLM data easy.**\n- **Cheap:** Augmentoolkit pipelines use open-source LLMs, and so can be run on consumer hardware for hardly any cost, or cheaply via APIs like Deepinfra *(the \"local\" prompt sets should enable usage of most pipelines by reasoning models, too)*\n- **Effortless:** Any Augmentoolkit pipeline can be run with an intuitive interface that is started by running a start script. Alternatively, you can make data by putting some files in a folder, and then running a Python script. If that's too much, you can also use the graphical user interface, now a first-class citizen in Augmentoolkit 3 (and in fact, the recommended way to run Augmentoolkit). Previously-started runs are continued automatically, so you don't need to worry about interruptions costing you time and\u002For money.\n- **Fast:** when using APIs, you can quickly generate millions of trainable tokens. Fully async code lets you get results quickly. Reading and chunking caches ensure that even large-scale workloads are quick to use. Models are automatically trained after the data is ready, and are even automatically downloaded and prepared for inference on your local machine. All the hard or annoying parts of the process have been automated and made efficient. In the past creating datasets and iterating and testing and learning could have taken a skilled person months; now, anyone can press a button, come back in a day, and chat with a newly-trained model.\n- **Innovative, Effective Approach to Factual Training:** Augmentoolkit has a production-tested method of creating domain-expert LLMs that can understand entirely new subjects. Many separate pipelines are composed together to produce quality datasets that teach capabilities such as answering factual questions, acknowledging when something is not known by the model, correcting mistakes, etc. You can be confident in getting high-quality specialist models when you use Augmentoolkit.\n\nWe've also done our best to **facilitate the step after you generate your data -- training your LLM:**\n- **Production-Scale:** Datasets that are gigabytes-large have been generated with Augmentoolkit -- it is battle-hardened, it works at scale without annoying inefficiencies costing immense time, and it is ready for the stresses of production.\n- **Train an AI for the cost of a dinner:** you can generate data on your own hardware for what is basically free. Augmentoolkit can then automatically perform a full finetune of an AI, on your own data, for a tiny sum of money (roughly $20 for the finetuning part of the process).\n- **Create your LLM in less than a day:** with a fully automated process for turning documents into datasets, and only a single button-click needed to kick off training, making a subject matter expert LLM is *fast* (especially when you use API for the dataset generation). Iterate quickly and cheaply.\n- **When you use the same recipe, you get the same bread:** Augmentoolkit datasets have been used successfully for professional consulting projects. Video documentation is linked in this README that shows exactly how to use this tool to do the same. The code, settings, and prompts you need is all here. Examples, templates, comments, marked-out placeholders, and extensive documentation is all available.\n- **Train AI with confidence, *especially* if it's your first time:** between the battle-tested process, extensive video docs, in-depth README, and Discord community, you can be confident you'll get a good LLM out of this.\n\n**Do it all locally**\nWith a custom-trained 7b model built to run these pipelines specifically, Augmentoolkit can generate data on consumer hardware, and can do so at incredible scale, with incredible parallelism, when on higher-performance computers. Budget does not need to be a constraint -- just passion and time. Of course, if you want immediate results\u002Fspeed, you can use an API too.\n\nFinally, **using the model you create should be easy and valuable:**\n- **AI that understands your facts:** For the professionals and the passionate: training an LLM with Augmentoolkit's Complete Factual Datagen \"composition\" pipeline creates an assistant that understands the big picture of the data you're training on. If RAG is like giving an LLM an open-book test on a textbook it hasn't read before, then training on Augmentoolkit data gives it some time to study before the test as well. This pipeline has been battle-tested in consulting projects across different industries. Compared to earlier versions of Augmentoolkit, Augmentoolkit's 3.0 version generates a wide variety of different domain data, and it even automatically balances this data with the generic data it uses.\n- **Individual Alignment:** Use GPRO (the same algorithm that made Deepseek R1 as good as it is) to align a model to any task imaginable without modifying any code. Augmentoolkit adopts an innovative approach of letting you use an LLM as a reward function -- you write a prompt that grades certain outputs higher, and then those reward scores teach the model to behave more like that in the future. Want your model to do a task better? Explain what \"better\" is and then the model will learn it. Want your model to be more emotional and human-like? Explain how to grade responses based on their emotional content, and the model will [learn it](https:\u002F\u002Fhuggingface.co\u002FHeralax\u002Fllama-gRPo-emotions-nothoughts). Want your model to write like a pirate? Explain in your grading prompt what makes a good pirate-like response, and the model will learn it. You can also change code and use traditional reward functions if you want to. The GRPO pipeline is experimental and in beta, but early results are promising.\n- **Make sense of massive data without using human annotators:** For the heavy-duty ML professionals: if you have a large dataset with tons of unlabelled text (like the Enron emails dataset, IMDB, or fineweb, etc.) you can now write a sentence or two that describes two classes which exist in that data. Augmentoolkit's classifier creator pipeline will then use an LLM to make a full classification dataset, based on a subset of the input data and your specified classes; it'll then train a classifier and evaluate it and take more data and retrain, in a loop, until validation loss is below a specified threshold. Classifiers trained using this pipeline seem to achieve similar performance to classifiers trained on human-labelled data. Be advised that data is not yet automatically balanced between different labels.\n- **AI inspired by your favorite fiction:** For the creatives and entertainers: using RPToolkit, you can create detailed and varied multi-turn roleplaying data with the themes of any story you can think of. If you're creating custom AI for creative or entertainment purposes, you can now specialize it in any genre you want. Want a depressing and dark specialist in mecha stories? Feed in some stories and you can get a ton of data for that. How about an AI writer of wholesome slice of life? You can get data for that too. Create as broad or as narrow of a writing AI as you want from whatever inspiration you can find.\n\n*Clarification: Augmentoolkit, the project, has multiple pipelines: the original pipeline (QA), RPtoolkit (rich multiturn roleplaying data), and the classifier creator. If it is said that \"Augmentoolkit can make [some kind of data]\" then I mean that one of Augmentoolkit's pipelines can do so.*\n\n### NOTE For When You Start Training\n\nFactual finetuning requires a certain number of optimizer steps to stick. If training is where the LLM's brain \"moves\" towards a place where it understands your new domain, \"optimizer steps\" are the number of times the LLM moves. **If your dataset is small you may not have enough optimizer steps for the LLM to learn the new domain well.**\n\nBecause of this, ironically, it can be easier to teach LLMs large new domains rather than small ones, with training. However, there are tools at your disposal for turning a small dataset into a large one when you use Augmentoolkit.\n\nIn [complete factual dataset](docs\u002Fcomplete_factual_datagen.md) you have the `number_of_factual_sft_generations_to_do` setting for the whole pipeline, and the `variation_generation_counts` which you can customize per input dir. The one that is customized per dir makes the data from a specific input dir represented more in the continued pretraining data; the other setting increases the overall amount of SFT data made from all input dirs together. With these two levers you can make a small dataset as large as you need — though some of the data may be very similar, you can still scale it up in this way to teach it to an LLM without catastrophic drawbacks.\n\nAs a \"break glass in case of emergency\" option, if your dataset is exceptionally small, you may want to consider turning sample packing off. This an be done by modifying the pretrain and finetune kwargs to set sample packing off (do this in the complete factual datagen config).\n\n```yaml\nother_pretrain_kwargs: {sample_packing: False}\nother_finetune_kwargs: {sample_packing: False}\n```\n\nTurning off sample packing has not been tested with the current iteration of Augmentoolkit's settings yet, so success with that emergency approach cannot be guaranteed for extremely small datasets, but since the main problem with extremely small datasets is a lack of per-epoch optimizer steps causing the LLM to not learn the data enough, *theoretically*, this should work.\n\nMost of the configuration of Augmentoolkit you'll do, besides changing input\u002Foutput paths for different models, is probably going to be related to the optimizer step. With very large input datasets you'll want to reduce things that increase the optimizer step because otherwise you'll be training for a long time, whereas with very small ones you'll have to pull out tricks to increase it. That's why this has a section and is marked out as important -- be cognizant of the size of your dataset as you create it!\n\nIf you have any questions about your specific use case, consider heading over to the [Discord](https:\u002F\u002Fdiscord.gg\u002Fs6PBfsaVzu)\n\n### Temporary Announcement if You've Been Here Before\nIf you're returning to look at this repo after a while, I want to make a few things clear!\n\nFirstly, a *lot* has changed. I disappeared for six months, half of it was spent on research and the other half was spent building. I wanted the tool to fundamentally work better. Now, Augmentoolkit reliably produces great domain experts across datasets of different sizes. It can even teach a model something that it has not seen at all during pretraining. This experimentation cost thousands of dollars out of pocket, but I believe it was worth it, since anyone can now make domain experts about arbitrary subjects with very little technical experience.\n\nSecondly, things are *much* easier to use. The interface is robust and is not a second-class citizen anymore. Start scripts, automatically generated and balanced training configs, better error messages, and a host of other improvements should make Augmentoolkit much nicer to use.\n\nNot much code from the original survived, though making older pipelines fit into the project as it is now, is pretty simple ([see the new pipeline example and you'll get a picture of how they look like now](\u002Fdocs\u002Fexample.md)). Also, if you have custom prompts from before, they should work with the new pipelines without modification. The older pipeline compared to this one is like a rat compared to a human -- they're technically related and have a lot of the same DNA, but the human is much more evolved and more capable. I hope you enjoy using the new project and getting great results.\n\nThe bad news is that since so much has changed, some new bugs probably got introduced. Please report bugs so that they can be fixed. I had not been keeping too close an eye on the issues these past 4 months since the entire project was being ripped apart anyway -- now that it is in a more final form, and frankly now that I have better discipline with this sort of thing, I'll be focused on the Discord and GitHub issues to correct any mistakes that you point out. If any of the new documentation is unclear about parts of the project, please let me know. And, if you have [custom pipelines]() that you want to add, or bugfixes, please check out [Contributing]() and make a PR!\n \u003C!-- Not much survived from the previous version, this is like a modern ape observing a rat -- technically related, lots of the DNA is the same, but the ape is much more evolved and capable. Also since everything is differnt and the project is much larger some of the accumulated fixes will have gone away. Please report bugs so that they can be fixed. -->\n\n### Discord\n\nCustom-built models are (usually) not meant to be enjoyed only by their creators. There's a [new feature in Augmentoolkit](docs\u002Fdiscord.md) where you can easily make your custom models into Discord bots! Now you can share your custom AI creations with your friends or community! Also, all the code runs on your own computer so no worry about recurring costs.\n\nSpeaking of Discord...\n\nAugmentoolkit is partly about democratizing dataset generation, so community is hugely important to the project! There's a Discord server where you can get help with custom model creation, as well as share new pipelines, prompt sets, or projects you're creating! [Come hang out and be part of a useful community of like-minded people!](https:\u002F\u002Fdiscord.gg\u002Fs6PBfsaVzu)\n\n### Training and Datagen Tips Blog\n\nI write about model training and data generation over on a [free substack](https:\u002F\u002Fpromptingweekly.substack.com\u002F)! If you want read access to my brain as I continue to experiment and explore with dataset generation, consider signing up to up your model creation game. If you're planning on building your own dataset generation pipelines using the tools and abstractions provided by Augmentoolkit, some of the advice there might be very useful.\n\nNow that the new Augmentoolkit version is out, I finally have time to post again (and new ideas to post as well).\n\n### Contributing!\n\nPRs for bugfixes, new pipelines, and improvements are welcome! If you have an experiment you're proud of, consider opening a PR. The rules are pretty standard:\n\n- Contributors can open PRs\n- Collaborators can push to branches and merge PRs into the master branch\n- Collaborators may either be chosen depending on contributions, or may be chosen internally within Augmentoolkit (the company)\n- [The example pipeline and its documentation](docs\u002Fexample.md) contain useful information for making your own pipeline. You are encouraged to fork Augmentoolkit and experiment!\n- Code with the style you want, just test thoroughly before making a PR\n    - Caveat: failing silently or continuing is worse than explicitly erroring if an impossible state is reached\n    - Asserts are your friend\n    - `black .` makes even MY code look formatted nice, it can do the same for yours.\n\n### Useful Commands for Datagen and Training Workflows\n\nCopy-paste these when appropriate or use them as a reference.\n\nCopy files over to a different computer (such as a GPU instance on runpod)\n\n```\nscp -P [port] -r .\u002Foutputs\u002Fyour-output-dir\u002Fpretraining_run root@123.456.78.9:\u002Fworkspace\u002Faxolotl\n```\n\nKick off a training run:\n```\naccelerate launch -m axolotl.cli.train [your_config].yaml\n```\n\nConvert and quantize with llama.cpp\n```\npython ~\u002Fllama.cpp\u002Fconvert_hf_to_gguf.py --outtype q8_0\n```\n\n## Contact!\n\n- Email me at: evanpeterarmstrong@gmail.com (NOTE: my inbox is flooded, your message might not get through, prefer booking a call for serious discussions)\n- [For serious and urgent discussions, we can schedule a call!](https:\u002F\u002Fcalendly.com\u002Fevanpeterarmstrong\u002F30min)\n- [I'm pretty active on the Augmentoolkit discord server and a bunch of other AI discords. Find me as @heralax!](https:\u002F\u002Fdiscord.gg\u002Fs6PBfsaVzu)\n- [I sometimes post stuff on X\u002FTwitter](https:\u002F\u002Ftwitter.com\u002Fe_p_armstrong)\n- [Substack! I am finally posting again.](https:\u002F\u002Fpromptingweekly.substack.com\u002F)\n- [YouTube -- The source of the help videos](https:\u002F\u002Fwww.youtube.com\u002F@Heralax)\n- [Let's connect on LinkedIn!](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fevan-armstrong-1a84b3200\u002F)\n\nIf you have a company or organization that wants to serve custom domain-expert AI to internal users (to get employees the information they need to do their jobs great) or to external users (for instance, to answer community questions or increase product awareness) then we should [get in touch](https:\u002F\u002Fcalendly.com\u002Fevanpeterarmstrong\u002Fdiscovery-call).\n\nAlso, if you have an **AI Chat Wrapper Startup or Company** that is currently being extorted by OpenAI, [we should also talk](https:\u002F\u002Fcalendly.com\u002Fevanpeterarmstrong\u002Fdiscovery-call), because I'm fairly certain **I can save you a ton on API costs while maintaining or improving answer quality**. Not only will I use this tool to produce a quality model, but I also have the means to run it at scale, which is not a trivial problem to solve.\n\nThe Augmentoolkit project is going to continue to be developed. It has been consistently developed for a long time -- the many-months gap from this major update to the previous one is because I was busy researching and developing the techniques, and later, preparing this release itself (this update has been in the works a long time). I am open to and would appreciate organizations sponsoring this open-source project -- I'd like nothing more than to research and build the tools for creating custom LLMs all day!\n\nI am also working on ambitious commercial solutions involving Augmentoolkit and Augmentoolkit tech. This project is part of a larger master-plan. If you're an investor, I'm very open to discussions here! My [Calendly](https:\u002F\u002Fcalendly.com\u002Fevanpeterarmstrong\u002F30min) is always open.\n\nCurrent Generalized AI means hallucinations, platitudes, and sycophancy; domain-expert AI writes with your knowledge, understanding, and is aligned to **your** tastes. As more people use the same AI, more of the world sounds the same and thinks the same--it is my belief that individually-customized LLMs are the only way to avoid this world of slop.\n","# Augmentoolkit - 领域专家 AI 的数据\nAugmentoolkit 创建领域专家数据集，用于更新 AI 的大脑（基本上是其知识截止日期），从而使 AI 成为您选择领域的专家。\n\n您上传文档，点击按钮。即可获得一个完全训练好的自定义大语言模型 (LLM)。现在，AI 行为的每一个方面及其理解能力都在您的掌控之中。更棒的是，Augmentoolkit **可选择在您的计算机上离线工作** —— 在大多数硬件上进行数据生成 (datagen) 无需外部 API 密钥*†。\n\n也许您希望 AI 了解您所在领域的最新研究论文，或者您可能想要一个深刻理解您热情并从与您相同来源学习的 LLM。也有可能，您梦想为您喜爱的冷门虚构宇宙创建一个背景设定专家。无论应用场景如何，Augmentoolkit 都能让您利用文本，使 LLM 的大脑内在地学习其中包含的信息。此外，如果您也需要一些传统的事实依据，它还可以自动创建一个就绪的检索增强生成 (RAG) 数据集（并可以启动一个推理服务器）。\n\n立即开始（界面将引导您生成第一个数据集）：\n\n（创建虚拟环境时请务必使用 Python 3.11 以确保正常运行）\n\n### MacOS (界面)\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit.git\ncd augmentoolkit # Python == 3.11\nbash macos.sh # NOTE: Will attempt to install valkey via brew if not found.\n# bash local_macos.sh # use this command if doing local dataset generation\n```\n\n### Linux (界面)\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit.git\ncd augmentoolkit # Python == 3.11\nbash linux.sh # NOTE: will build Valkey from source if a Redis\u002FValkey server is not running\n```\n\n**或者用于本地推理**\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit.git\ncd augmentoolkit # Python == 3.11\nbash local_linux.sh normal # or you can write \"small\" or a custom model name to serve the quantized version (for more consumer hardware) or a model of your choice, respectively. See the quickstart page linked just a bit farther down for a full reference here.\n```\n\n如果您有多个 GPU，请使用 `--tensor-parallelism N` 参数运行 `local_linux.sh`。N == 您拥有的 GPU 数量（偶数）。所以：1, 2, 4, 8... 等等。\n\n> [!IMPORTANT]\n>\n> 请给仓库点个 Star。\n\n### Windows (界面)\n> [!NOTE]\n>\n> 如果您使用的是 Windows，最好的选择是使用 [WSL](https:\u002F\u002Flearn.microsoft.com\u002Fen-us\u002Fwindows\u002Fwsl\u002Finstall)。[老实说，CLI 在 Windows 上更容易运行。](docs\u002Fquickstart.md#windows-cli)\n\n\u003Csub>*注意，数据生成在很多硬件上可能需要一段时间，例如不要指望在旧款 Mac 上能实现快速数据生成。对于训练，您需要要么拥有自己的强大机器，要么租用（如果您选择后者，系统会自动为您完成）。\u003C\u002Fsub>\n\n\u003Csub>†如果您希望数据生成更快，您*可以*使用开源 LLM API，且快速入门指南也鼓励您这样做。除了其自定义数据集生成模型外，Augmentoolkit 还针对 Deepseek 或 Llama 等开源 LLM 进行了优化。\u003C\u002Fsub>\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fe-p-armstrong_augmentoolkit_readme_9f2fe32fdff0.png)\n\nAugmentoolkit 现已更新至 3.0 版本，经过了一年多的专业应用和实验的打磨与改进。它是目前世界上创建领域专家 LLM 的最佳方式，并且采用 MIT 许可证。\n\n如果您使用了这个项目并喜欢它，请考虑给仓库点个 Star！它的设计也非常易于定制，因此请考虑 **Fork（分支）** Augmentoolkit！\n\n> [!IMPORTANT]\n>\n> 下面的链接包含非常有用的信息。稍后下方有一个目录，链接到项目中任何可想象部分的详细文档页面。\n\n[帮助视频](#video-tutorials) 我将从零开始演示如何完成本项目中的所有酷炫操作，包括使用您获得的数据和配置训练 LLM（耗时 10 分钟）。如果您需要更多指导，请查看帮助视频！\n\n[社区](https:\u002F\u002Fdiscord.gg\u002Fs6PBfsaVzu) 如果您有问题，如果您正在训练模型，如果您正在基于 Augmentoolkit 的代码构建新的酷炫流水线或扩展，或者只是想聊聊，我很高兴在 Augmentoolkit Discord 上看到您！这也是联系我的好地方。\n\n[通讯](https:\u002F\u002Fpromptingweekly.substack.com\u002F) 我在 Substack 上撰写关于模型训练和数据生成的文章。完全免费，我只是想帮助大家更好地使用该工具。\n\n[联系](#contact) 我正在围绕这个项目做各种事情，如果您对将定制、个人对齐的 AI 带给每个人的使命和业务感兴趣，请联系我们！\n\n[构建](docs\u002Fpipeline_primer.md) Augmentoolkit 旨在成为人们实验训练 LLM 的首选工具，无论是爱好者还是专业人士。为此，构建新流水线就像编写 Python 函数一样简单（同时遵守大约 2 个主要是可选的约定）。我们为您提供高效的解释器和流水线模板，以便您构建自己的数据集生成流水线，进而构建您自己的数据集和您自己完全定制的 LLM。\n\n所有配置文件都带有完整的注释和占位符，以帮助您在使用时理解它们。\n\n## 文档页面\n\n> [!NOTE]\n>\n> 请注意，此文档页面（主 README）包含关于事实模型训练的 [重要说明](#note-for-when-you-start-training)，无论您的经验水平如何都应阅读。\n\n1. [快速开始](docs\u002Fquickstart.md)\n    - [命令行界面 (CLI)](docs\u002Fquickstart.md#the-cli)\n    - [界面](docs\u002Fquickstart.md#macos-interface)\n1. [视频帮助](#video-tutorials) \u003C!-- Pages still todo -->\n    - [使用自己的数据训练大语言模型 (LLM)！(12 分钟)](#train-a-model-on-your-own-data-in-13-minutes)\n    - [详细探索 (界面)](#interface-deep-dive)\n    - [详细探索 (CLI)](#cli-and-code-structure-deep-dive)\n1. [愿景](docs\u002Fvision.md)\n    - [它是什么？(技术)](docs\u002Fvision.md#what-is-it-technical)\n    - [它是什么？(通用)](docs\u002Fvision.md#what-is-it-general)\n    - [提出的想法](docs\u002Fvision.md#ideas-presented-and-hypotheses)\n    - [目标](docs\u002Fvision.md#goals)\n1. 深度指南 (自定义和开发指南链接)\n    - [默认可用的流水线]()\n        - [组合]()\n            - [完整事实生成](docs\u002Fcomplete_factual_datagen.md)\n            - [元数据生成 (Datagen)](docs\u002Fmeta.md)\n        - [流水线]()\n            - [多源回忆事实数据生成](docs\u002Fmulti_source_facts.md)\n            - [单源回忆事实数据生成](docs\u002Fsingle_source_recall.md)\n            - [表示变异](docs\u002Frepresentation_variation.md)\n            - [传统分类器引导程序 (Bootstrapper)](docs\u002Fclassifier_bootstrapper.md)\n            - [通用数据改写](docs\u002Fgeneric_data_rephrase.md)\n            - [GRPO (实验性)](docs\u002Fgrpo.md)\n            - [修正数据（损失掩码错误）](docs\u002Fcorrections.md)\n            - [RAG 数据（为增强回忆做准备）](docs\u002Frag_data.md)\n            - [调试（健康检查）](docs\u002Fdebug.md)\n            - [起点（构建你自己的流水线！）](docs\u002Fexample.md)\n            - [RPToolkit](docs\u002Frptoolkit.md)\n        - [工具]()\n            - [基础 LLM 服务器](docs\u002Fbasic_server.md)\n            - [RAG LLM 服务器](docs\u002Frag_server.md)\n            - **[Discord 托管](docs\u002Fdiscord.md)**\n        - [配置 (Config) 通用字段](docs\u002Fconfig_common_fields.md)\n        \u003C!-- - [Training an LLM Walkthrough]() -->\n    - [详细理解该工具]()\n        - [项目结构](docs\u002Fproject_structure.md)\n        - [命令行界面 (CLI)]()\n            - [流程](docs\u002FCLI_flows.md)\n                - [上传文档](docs\u002FCLI_flows.md#upload-documents)\n                - [启动运行](docs\u002FCLI_flows.md#starting-runs)\n                - [观察运行](docs\u002FCLI_flows.md#observing-results)\n                - [获取结果](docs\u002FCLI_flows.md#getting-your-results-back)\n        - [界面]()\n            - [流程](docs\u002Finterface_flows.md)\n                - [上传文档](docs\u002Finterface_flows.md#upload-documents)\n                - [启动运行](docs\u002Finterface_flows.md#starting-runs)\n                - [观察运行](docs\u002Finterface_flows.md#observing-runs)\n                - [获取结果](docs\u002Finterface_flows.md#getting-your-results-back)\n        - [使用 Axolotl 概念进行训练](docs\u002Faxolotl_concepts.md)\n    - [自定义与开发]()\n        - [新流水线入门](docs\u002Fpipeline_primer.md)\n        - [抽象入门](docs\u002Fabstractions_primer.md)\n        - [约定准则](docs\u002Fpipeline_conventions.md)\n        - [提醒：约定是极简的，您可以直接编码，它很可能能工作](docs\u002Fconventions_reminder.md)\n1. [Discord！](#discord)\n1. [更新和训练\u002F数据生成 (Datagen) 技巧博客！保持关注！](#training-and-datagen-tips-blog)\n1. [贡献！](#contributing)\n1. [联系与客户项目](#contact)\n\n**如果您熟悉大语言模型 (LLM) 并希望了解 Augmentoolkit 是什么以及它为何出色的更技术性概述，请查看 [本节](docs\u002Fvision.md)**\n\n引用：\n[![DOI](https:\u002F\u002Fzenodo.org\u002Fbadge\u002F726083337.svg)](https:\u002F\u002Fzenodo.org\u002Fdoi\u002F10.5281\u002Fzenodo.11525927)\n\n\n> [!NOTE]\n>\n> 如果您不想进行模型训练，而只想生成数据集，请在您的数据生成配置 (Config) 中关闭 `do_train`。\n\n### 视频教程\n\n#### [13 分钟内使用自己的数据训练模型](https:\u002F\u002Fyoutu.be\u002FE9TyyZzIMyY)\n\n#### [界面深度探索！](https:\u002F\u002Fyoutu.be\u002FM-OFVwHPfeU)\n\n#### [CLI 和代码结构深度探索！](https:\u002F\u002Fyoutu.be\u002FcEkgw7sYqMw)\n\n^ 如果您打算修改代码，这个很有用\n\n### 优势\n**Augmentoolkit 让 LLM 数据处理变得简单。**\n- **廉价：** Augmentoolkit 流水线使用开源 LLM，因此可以在消费级硬件上以极低的成本运行，或者通过 Deepinfra 等应用程序接口 (API) 廉价运行 *(“本地”提示词集也应使推理模型能够使用大多数流水线)*\n- **轻松：** 任何 Augmentoolkit 流水线都可以通过直观的界面运行，只需运行启动脚本即可。或者，您可以将一些文件放入文件夹，然后运行 Python 脚本来生成数据。如果这还不够，您还可以使用图形用户界面 (GUI)，现在是 Augmentoolkit 3 中的一等公民（事实上，也是运行 Augmentoolkit 的推荐方式）。之前启动的运行会自动继续，因此您无需担心中断会耗费您的时间和\u002F或金钱。\n- **快速：** 使用 API 时，您可以快速生成数百万个可训练令牌 (Tokens)。完全异步 (Async) 的代码让您能快速获得结果。读取和分块 (Chunking) 缓存确保即使是大规模工作负载也能快速使用。数据准备好后，模型会自动训练，甚至会自动下载并准备在您的本地机器上进行推理 (Inference)。过程中所有困难或烦人的部分都已自动化并高效化。过去创建数据集、迭代、测试和学习可能需要熟练人员数月；现在，任何人都可以按下一个按钮，一天后回来，与新训练的模型聊天。\n- **事实性训练的创新有效方法：** Augmentoolkit 拥有一种经过生产验证的方法，用于创建能够理解全新主题的领域专家 LLM。许多独立的流水线组合在一起，产生高质量的数据集，教授诸如回答事实性问题、承认模型未知内容、纠正错误等能力。使用 Augmentoolkit 时，您可以确信能获得高质量的专家模型。\n\n我们已竭尽全力**简化生成数据后的步骤——训练你的大语言模型 (LLM)：**\n- **生产级规模：** 使用 Augmentoolkit 已经生成了高达数 GB 的数据集——它经过了实战检验，能够大规模运行而不会因恼人的低效而浪费大量时间，并且能够承受生产环境的压力。\n- **用吃顿晚饭的代价训练一个 AI：** 你可以基本免费地用自己的硬件生成数据。然后 Augmentoolkit 可以自动使用你自己的数据对 AI 进行全量微调，费用极低（微调部分的费用大约只需 20 美元）。\n- **在一天内创建你的 LLM：** 通过全自动化的流程将文档转化为数据集，并且只需点击一下按钮即可启动训练，构建领域专家级 LLM 的速度非常快（特别是当你使用 API 生成数据集时）。实现快速且廉价的迭代。\n- **使用相同的配方，你总会得到同样的面包：** Augmentoolkit 的数据集已成功用于专业咨询项目。本 README 中链接的视频文档详细展示了如何完全一样地使用本工具。你所需要的代码、设置和提示词（prompts）就在这里。示例、模板、注释、标记出的占位符以及详尽的说明文档一应俱全。\n- **放心大胆地训练 AI，尤其是如果你是第一次尝试：** 凭借经过实战检验的流程、丰富的视频文档、深入的 README 以及 Discord 社区的支持，你可以确信能从中获得一个优质的 LLM。\n\n**全部在本地完成**\n借助专门为此类流水线构建的定制训练的 7b 模型，Augmentoolkit 可以在消费级硬件上生成数据，并且在高性能计算机上可以实现惊人的规模和并行度。预算不应成为限制条件——只需要热情和时间的投入。当然，如果你需要即时的结果\u002F速度，也可以使用 API（应用程序编程接口）。\n\n最后，**使用你创建的模型应该既简单又有价值：**\n- **理解你事实数据的 AI：** 面向专业人士和爱好者：使用 Augmentoolkit 的 Complete Factual Datagen“组合”流水线训练 LLM，可以创建一个理解你所训练数据宏观图景的助手。如果说 RAG（检索增强生成）就像是给 LLM 一场关于它从未读过的教科书的开卷考试，那么在 Augmentoolkit 数据上进行训练则像是给了它在考试前一些复习时间。该流水线已在不同行业的咨询项目中经过实战检验。与早期版本相比，Augmentoolkit 3.0 版本生成了各种各样的不同领域数据，并且会自动将这些数据与其使用的通用数据进行平衡。\n- **个体对齐 (Individual Alignment)：** 使用 GPRO（使 Deepseek R1 如此出色的同一算法）来调整模型以适应任何想象得到的任务，而无需修改任何代码。Augmentoolkit 采用了一种创新的方法，允许你将 LLM 用作奖励函数（reward function）——你编写一个提示词来给某些输出打更高的分数，然后这些奖励分数会教导模型在未来更多地表现出类似的行为。希望你的模型更好地执行任务吗？解释一下什么是“更好”，然后模型就会学会它。希望你的模型更富有人情味和情感吗？解释一下如何根据情感内容给回复打分，模型就会 [学会它](https:\u002F\u002Fhuggingface.co\u002FHeralax\u002Fllama-gRPo-emotions-nothoughts)。希望你的模型写起东西像海盗一样吗？在你的评分提示词中解释什么样的回复是好的海盗风格回复，模型就会学会它。你也可以根据需要修改代码并使用传统的奖励函数。GRPO 流水线目前处于实验阶段并处于测试版（beta），但初步结果令人鼓舞。\n- **在不使用人工标注的情况下理解海量数据：** 面向重度机器学习（ML）专业人士：如果你有一个包含大量未标注文本的大型数据集（如 Enron 邮件数据集、IMDb 或 fineweb 等），你现在可以写一两句话来描述其中存在的两个类别。Augmentoolkit 的分类器创建流水线随后将使用 LLM 基于输入数据的子集和你指定的类别来制作完整的分类数据集；然后它会训练一个分类器并评估它，再获取更多数据并重新训练，如此循环，直到验证损失（validation loss）低于指定阈值。使用此流水线训练出的分类器似乎能达到与在人类标注数据上训练出的分类器相似的性能。请注意，目前数据在不同标签之间尚未实现自动平衡。\n- **受你喜爱的小说启发的 AI：** 面向创意工作者和娱乐从业者：使用 RPToolkit，你可以围绕你能想到的任何故事主题，创建详细多样的多轮角色扮演数据。如果你是为了创作或娱乐目的创建自定义 AI，现在你可以将其专业化到任何你想要的类型。想要一个专攻机甲故事的阴郁黑暗专家吗？输入一些故事，你就可以获得大量相关数据。那来一个温馨日常生活的 AI 作家呢？你也可以为这个获取数据。从你能找到的任何灵感出发，创建范围可宽可窄的写作 AI。\n\n*澄清：Augmentoolkit 项目包含多个流水线：原始流水线（QA）、RPtoolkit（丰富的多轮角色扮演数据）和分类器创建器。如果提到\"Augmentoolkit 可以生成 [某种数据]\"，我的意思是 Augmentoolkit 的某个流水线可以做到这一点。*\n\n### 开始训练时的注意事项\n\n事实性微调 (Factual finetuning) 需要一定数量的优化器步数 (optimizer steps) 才能稳固生效。如果将训练视为大语言模型 (LLM) 的“大脑”向理解你新领域的方向“移动”的过程，那么“优化器步数”就是 LLM 移动的次数。**如果你的数据集很小，可能没有足够的优化器步数让 LLM 很好地学习新领域。**\n\n因此，讽刺的是，通过训练，教 LLM 大型新领域可能比小型领域更容易。不过，当你使用 Augmentoolkit 时，有一些工具可以将小数据集转化为大数据集。\n\n在 [完整事实数据集](docs\u002Fcomplete_factual_datagen.md) 中，你有针对整个流程的 `number_of_factual_sft_generations_to_do` 设置，以及可以按输入目录自定义的 `variation_generation_counts`。按目录自定义的设置会让特定输入目录的数据在继续预训练数据中占比更多；另一个设置则增加所有输入目录共同生成的监督微调 (SFT) 数据的总量。利用这两个杠杆，你可以将小数据集扩大到所需的大小——尽管部分数据可能非常相似，但你仍然可以通过这种方式将其放大并教给 LLM，而不会造成灾难性的后果。\n\n作为一个“紧急情况下打破玻璃”的选项，如果你的数据集特别小，你可能需要考虑关闭样本打包 (sample packing)。这可以通过修改预训练和微调的关键字参数 (kwargs) 来关闭样本打包（请在完整事实数据生成配置中进行此操作）。\n\n```yaml\nother_pretrain_kwargs: {sample_packing: False}\nother_finetune_kwargs: {sample_packing: False}\n```\n\n关闭样本打包尚未在当前版本的 Augmentoolkit 设置中进行测试，因此对于极小的数据集，这种紧急方法的成功无法保证，但由于极小数据集的主要问题在于每个轮次 (epoch) 的优化器步数不足导致 LLM 未能充分学习数据，*理论上*，这应该有效。\n\n除了为不同模型更改输入\u002F输出路径外，你对 Augmentoolkit 进行的大部分配置可能都与优化器步数有关。对于非常大的输入数据集，你希望减少那些会增加优化器步数的因素，否则训练时间会很长；而对于非常小的数据集，你则需要使出浑身解数来增加它。这就是为什么这里有专门的部分并将其标记为重要——在创建数据集时要意识到其大小！\n\n如果你对你的具体用例有任何疑问，请考虑前往 [Discord](https:\u002F\u002Fdiscord.gg\u002Fs6PBfsaVzu)\n\n### 如果您之前来过这里的临时公告\n如果你过了一段时间回来查看这个仓库，我想澄清几件事！\n\n首先，发生了*很多*变化。我消失了六个月，一半时间花在研究上，另一半时间花在构建上。我希望这个工具能在根本上运作得更好。现在，Augmentoolkit 能够可靠地在不同大小的数据集上生成优秀的领域专家。它甚至可以教会模型一些它在预训练期间完全未见过的内容。这次实验自费花费了数千美元，但我认为这是值得的，因为现在任何人都可以用很少的技术经验制作关于任意主题的领域专家。\n\n其次，事情变得*容易*多了。接口很健壮，不再是二等公民了。启动脚本、自动生成且平衡的训练配置、更好的错误消息以及其他众多改进应该会让 Augmentoolkit 更易用。\n\n原始代码留存不多，但让旧管道 (pipeline) 适应现在的项目非常简单（[查看新的管道示例，你就会了解它们现在的样子](\u002Fdocs\u002Fexample.md)）。此外，如果你之前有自定义提示词 (prompts)，它们应该可以在新管道中无需修改即可工作。旧的管道与这一个相比就像老鼠与人一样——它们在技术上是相关的，有很多相同的 DNA，但人类进化得更完善，能力更强。希望你享受使用新项目并获得很好的结果。\n\n坏消息是，由于变化如此之大，可能引入了一些新 Bug。请报告 Bug 以便修复。过去 4 个月我没有太密切关注问题，反正整个项目都在被拆解重组——现在它处于更最终的形式，坦白说，既然我现在对这类事情有更好的自律，我将专注于 Discord 和 GitHub 问题，纠正你们指出的任何错误。如果新文档对项目某些部分的说明不清楚，请告诉我。另外，如果你有想添加的 [自定义管道]() 或 Bug 修复，请查看 [贡献指南]() 并提交拉取请求 (PR)！\n \u003C!-- 前一版本留存不多，这就像现代猿观察老鼠——技术上相关，大部分 DNA 相同，但猿类进化得更完善且更有能力。此外，由于一切皆不同且项目更大，一些累积的修复可能会消失。请报告 Bug 以便修复。 -->\n\n### Discord\n\n定制构建的模型（通常）不仅仅供创建者自己享用。Augmentoolkit 有一个 [新功能](docs\u002Fdiscord.md)，你可以轻松地将自定义模型变成 Discord 机器人！现在你可以与朋友或社区分享你的自定义 AI 创作！此外，所有代码都在你自己的电脑上运行，所以不用担心持续费用。\n\n说到 Discord...\n\nAugmentoolkit 部分是为了普及数据集生成，因此社区对项目至关重要！有一个 Discord 服务器，你可以在那里获得自定义模型创建的帮助，也可以分享新的管道、提示词集或你正在创建的项目！[来逛逛，成为志同道合的有用社区的一员吧！](https:\u002F\u002Fdiscord.gg\u002Fs6PBfsaVzu)\n\n### 训练和数据生成技巧博客\n\n我在一个 [免费的 Substack (博客平台)](https:\u002F\u002Fpromptingweekly.substack.com\u002F) 上撰写关于模型训练和数据生成的文章！如果你想在我继续实验和探索数据集生成时阅读我的思路，考虑订阅以提升你的模型创建水平。如果你计划使用 Augmentoolkit 提供的工具和抽象来构建自己的数据集生成管道，那里的一些建议可能会非常有用。\n\n现在新版本 Augmentoolkit 已经发布，我终于有时间再次发帖了（也有新的想法要分享）。\n\n### 贡献！\n\n欢迎提交用于修复错误（bugfixes）、新流水线（pipelines）和改进的 PR（Pull Request，拉取请求）！如果您有引以为傲的实验，请考虑提交一个 PR。规则相当标准：\n\n- 贡献者可以提交 PR\n- 协作者可以向分支推送代码并将 PR 合并到主分支（master branch）\n- 协作者可能根据贡献被选中，也可能由 Augmentoolkit（公司）内部选定\n- [示例流水线及其文档](docs\u002Fexample.md) 包含制作您自己流水线的有用信息。鼓励您 Fork（复刻）Augmentoolkit 并进行实验！\n- 按您喜欢的风格编写代码，只需在提交 PR 前充分测试即可\n    - 注意：如果达到不可能状态，静默失败或继续运行比明确报错更糟糕\n    - 断言（Asserts）是您的朋友\n    - `black .` 甚至能让我的代码看起来格式美观，它也能对您的代码做到同样效果。\n\n### 数据生成 (Datagen) 和训练工作流的有用命令\n\n在适当时复制粘贴这些命令，或将其作为参考。\n\n将文件复制到另一台计算机（例如 RunPod 上的 GPU 实例）\n\n```\nscp -P [port] -r .\u002Foutputs\u002Fyour-output-dir\u002Fpretraining_run root@123.456.78.9:\u002Fworkspace\u002Faxolotl\n```\n\n启动训练任务：\n```\naccelerate launch -m axolotl.cli.train [your_config].yaml\n```\n\n使用 llama.cpp 进行转换和量化 (quantize)\n```\npython ~\u002Fllama.cpp\u002Fconvert_hf_to_gguf.py --outtype q8_0\n```\n\n## 联系方式！\n\n- 通过电子邮件联系我：evanpeterarmstrong@gmail.com（注意：我的收件箱已满，您的消息可能无法送达，对于严肃的讨论，建议预约通话）\n- [对于严肃且紧急的讨论，我们可以安排通话！](https:\u002F\u002Fcalendly.com\u002Fevanpeterarmstrong\u002F30min)\n- [我在 Augmentoolkit Discord 服务器以及许多其他 AI Discord 上非常活跃。搜索 @heralax 找到我！](https:\u002F\u002Fdiscord.gg\u002Fs6PBfsaVzu)\n- [我有时会在 X\u002FTwitter 上发布内容](https:\u002F\u002Ftwitter.com\u002Fe_p_armstrong)\n- [Substack！我终于又开始发文了。](https:\u002F\u002Fpromptingweekly.substack.com\u002F)\n- [YouTube —— 帮助视频的来源](https:\u002F\u002Fwww.youtube.com\u002F@Heralax)\n- [让我们在 LinkedIn 上建立联系！](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fevan-armstrong-1a84b3200\u002F)\n\n如果您所在的公司或组织希望为内部用户（让员工获得做好工作所需的信息）或外部用户（例如回答社区问题或提高产品知名度）提供定制的领域专家 AI，那么我们应该[取得联系](https:\u002F\u002Fcalendly.com\u002Fevanpeterarmstrong\u002Fdiscovery-call)。\n\n此外，如果您是一家目前正受到 OpenAI 勒索的**AI 聊天封装 (Chat Wrapper) 初创公司或企业**，[我们也应该谈谈](https:\u002F\u002Fcalendly.com\u002Fevanpeterarmstrong\u002Fdiscovery-call)，因为我相当确定**我可以在保持或提高回答质量的同时为您节省大量 API 成本**。我不仅会使用此工具生产高质量模型，还拥有大规模运行的手段，这并非一个容易解决的问题。\n\nAugmentoolkit 项目将继续开发。它已经持续开发了很长时间——从上一次更新到这次重大更新的数月间隔是因为我正忙于研究和开发技术，随后准备此次发布本身（这次更新已经筹备了很久）。我愿意并感谢组织赞助这个开源项目——我最想做的就是整天研究并构建创建定制 LLM（大语言模型）的工具！\n\n我也正在致力于涉及 Augmentoolkit 及其技术的雄心勃勃的商业解决方案。该项目是一个更大总体规划的一部分。如果您是投资者，我非常乐意就此进行讨论！我的 [Calendly](https:\u002F\u002Fcalendly.com\u002Fevanpeterarmstrong\u002F30min) 随时开放。\n\n当前的通用 AI (Generalized AI) 意味着幻觉、陈词滥调和阿谀奉承；领域专家 AI 则运用您的知识、理解力，并与**您的**品味保持一致。随着越来越多的人使用相同的 AI，世界听起来越来越相似，思考方式也越来越一致——我相信，个性化定制的 LLM 是避免这种充满垃圾内容 (slop) 世界的唯一途径。","# Augmentoolkit 快速上手指南\n\nAugmentoolkit 是一个用于创建领域专家级数据集的工具，可帮助你基于自定义文档训练专属的 LLM。它支持离线运行，无需外部 API 密钥即可在大多数硬件上生成数据，并能自动构建 RAG 就绪的数据集。\n\n## 1. 环境准备\n\n在开始之前，请确保你的开发环境满足以下要求：\n\n- **操作系统**：macOS、Linux 或 Windows（推荐通过 WSL 运行）。\n- **Python 版本**：**必须使用 Python 3.11**。请在创建虚拟环境时特别注意此版本要求，以确保兼容性。\n- **依赖管理**：工具会自动处理部分依赖（如通过 brew 安装 valkey），但需要系统已安装 `git`。\n- **网络环境**：虽然支持离线工作流，但初始克隆仓库和首次依赖拉取需要网络连接。\n\n> **注意**：Windows 用户强烈建议使用 [WSL](https:\u002F\u002Flearn.microsoft.com\u002Fen-us\u002Fwindows\u002Fwsl\u002Finstall) 以获得最佳体验。\n\n## 2. 安装步骤\n\n请根据你的操作系统选择对应的启动脚本。所有操作均需在终端中执行。\n\n### 通用克隆命令\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit.git\ncd augmentoolkit\n```\n\n### 按系统执行启动脚本\n\n#### macOS\n```bash\nbash macos.sh\n```\n> 如果本地生成数据集，可使用：`bash local_macos.sh`\n\n#### Linux\n```bash\nbash linux.sh\n```\n> 如果没有运行 Redis\u002FValkey 服务，脚本将尝试从源码构建 Valkey。\n\n#### 本地推理 (Local Inference)\n如果你希望直接进行本地推理，可以使用以下命令（支持指定模型大小）：\n```bash\nbash local_linux.sh normal\n```\n*注：可以将 `normal` 替换为 `small` 以使用量化版本，或填入其他自定义模型名称。*\n\n#### 多 GPU 支持\n如果你拥有多个 GPU，可以在运行 `local_linux.sh` 时添加 `--tensor-parallelism N` 参数（N 为 GPU 数量，需为偶数，如 2, 4, 8 等）。\n\n## 3. 基本使用\n\nAugmentoolkit 设计有直观的界面来引导用户完成首个数据集的生成。\n\n1.  **启动界面**：运行上述对应系统的 `.sh` 脚本后，工具将启动配置好的环境。\n2.  **上传文档**：通过提供的界面上传你的领域文档。\n3.  **生成数据与训练**：\n    - 点击按钮即可开始数据处理。\n    - 系统会自动创建 RAG 就绪的数据集。\n    - 如果需要训练模型，确保在数据生成配置中开启训练选项（默认可能关闭，可在配置中将 `do_train` 设为 `true`）。\n4.  **结果获取**：\n    - 处理完成后，你可以获得一个经过微调的 LLM 或高质量数据集。\n    - 工具甚至可以直接启动推理服务器供你测试。\n\n> **提示**：如果你只想要生成数据集而不想训练模型，请在数据生成配置中将 `do_train` 设置为 `off`。","某医疗科研团队希望构建一个能深度理解特定罕见病领域文献与内部实验数据的 AI 助手，用于加速新药研发讨论与文献综述。\n\n### 没有 augmentoolkit 时\n- 通用大模型缺乏垂直领域知识，对专业医学术语理解偏差大，容易产生事实性幻觉。\n- 依赖外部 API 调用，敏感患者数据和未公开实验细节面临隐私泄露与合规隐患。\n- 传统 RAG 方案检索延迟高，且难以将新知识真正融入模型参数，导致回答生硬。\n- 每次更新知识库需重新清洗数据并调整向量库，IT 运维负担沉重且响应慢。\n\n### 使用 augmentoolkit 后\n- 直接上传 PDF 论文与实验报告，一键生成领域专家级定制模型，术语理解精准且逻辑连贯。\n- 支持本地离线运行数据生成与推理，无需外部 API Key，彻底保障核心数据不出内网。\n- 自动创建 RAG 就绪数据集，既保留传统检索能力，又实现关键知识内化于模型权重。\n- 新增文档即可快速迭代模型认知，无需重复繁琐的工程配置流程，大幅缩短上线周期。\n\naugmentoolkit 让科研人员能够以最低成本、最高安全性打造完全受控的私有化领域 AI 专家。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fe-p-armstrong_augmentoolkit_9f2fe32f.png","e-p-armstrong","Evan Armstrong","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fe-p-armstrong_b62bf3d5.png","Author and Super Hacka",null,"https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit","https:\u002F\u002Fgithub.com\u002Fe-p-armstrong",[83,87,91,95,99],{"name":84,"color":85,"percentage":86},"Python","#3572A5",72.2,{"name":88,"color":89,"percentage":90},"JavaScript","#f1e05a",21.2,{"name":92,"color":93,"percentage":94},"Shell","#89e051",6.5,{"name":96,"color":97,"percentage":98},"CSS","#663399",0,{"name":100,"color":101,"percentage":98},"HTML","#e34c26",1822,244,"2026-04-01T20:51:43","MIT","Linux, macOS, Windows (推荐 WSL)","未明确指定具体型号及显存大小，本地训练建议高性能机器或租用云资源，支持多 GPU 张量并行，消费级硬件可运行量化版","未说明",{"notes":110,"python":111,"dependencies":112},"必须使用 Python 3.11 创建虚拟环境；Windows 用户强烈建议使用 WSL；界面模式需安装 Valkey（Mac 通过 brew，Linux 可源码编译）；数据生成可选离线运行但旧硬件较慢；训练支持本地或云端租用；支持多卡并行 (--tensor-parallelism)。","3.11",[113,114],"valkey","axolotl",[14,15,51,13,26],[117,118,119],"ai","dataset-generation","finetuning-llms","2026-03-27T02:49:30.150509","2026-04-06T05:35:42.885268",[123,128,132,137,141,145],{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},3741,"使用 Ollama 时 API 请求返回 404 Not Found 错误怎么办？","请检查 API 端点路径配置。日志显示请求了 `\u002Fv1\u002Fcompletions`，但 Ollama 通常需要使用 `\u002Fv1\u002Fchat\u002Fcompletions`。请确保配置中的 BASE_URL 和路径正确，有用户反馈修改链接后问题解决。","https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit\u002Fissues\u002F5",{"id":129,"question_zh":130,"answer_zh":131,"source_url":127},3742,"Augmentoolkit 支持直接读取 PDF 文件作为输入吗？","目前不支持。该工具需要纯文本输入。建议先使用外部工具（如 marker）将 PDF 转换为 Markdown 或文本格式后再导入项目进行处理。",{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},3743,"WebUI 启动时报错 UnicodeDecodeError 如何解决？","这是 Gradio 界面可能遇到的编码问题。建议放弃 WebUI，改用命令行模式运行，执行命令：`python run_augmentoolkit.py`，这样可以绕过相关错误。","https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit\u002Fissues\u002F48",{"id":138,"question_zh":139,"answer_zh":140,"source_url":136},3744,"运行过程中内存占用过高（例如达到 30GB）是否正常？","这可能不是正常现象，可能是代码效率问题或 Gradio 缓存了大量历史记录导致的。维护者承认代码有时可能低效，建议尝试 CLI 模式以减少内存压力。",{"id":142,"question_zh":143,"answer_zh":144,"source_url":136},3745,"遇到 \"ERROR - Error in Generation Step: 'NoneType' object has no attribute 'group'\" 是否需要停止程序？","不需要。维护者确认这不是导致程序停止的错误，仅表示模型某次输出格式有误，流水线通常会继续运行，可以忽略此警告。",{"id":146,"question_zh":147,"answer_zh":148,"source_url":149},3746,"在 create_conversation 阶段出现 'NoneType' object has no attribute 'strip' 错误怎么办？","这是一个已知 Bug，维护者已在后续版本中修复。请确保拉取并更新到最新代码版本，或者检查相关的正则表达式配置是否正确。","https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit\u002Fissues\u002F6",[151,156,161,166,171],{"id":152,"version":153,"summary_zh":154,"released_at":155},103306,"v1.0.0","I am creating an official release for the Augmentoolkit project, which allows for QA dataset generation using open source models.\r\n\r\n## What's Changed\r\n* first \"release\" on GitHub, with all features and bugfixes\r\n* APIs, Local Models, OpenAI, Gemini all supported\r\n* simplification and rewrite by @darkacorn in https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit\u002Fpull\u002F2\r\n* Gradio Web UI + Extended Input Folder by @cocktailpeanut in https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit\u002Fpull\u002F16\r\n* feat: add gemini api support by @alexandreteles in https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit\u002Fpull\u002F18\r\n\r\n## New Contributors\r\n* @darkacorn made their first contribution in https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit\u002Fpull\u002F2\r\n* @cocktailpeanut made their first contribution in https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit\u002Fpull\u002F16\r\n* @alexandreteles made their first contribution in https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit\u002Fpull\u002F18\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit\u002Fcommits\u002Fv1.0.0","2024-06-08T10:08:18",{"id":157,"version":158,"summary_zh":159,"released_at":160},103302,"v3.0.0","Augmentoolkit 3.0 is essentially an entirely new project.\r\n\r\nBefore we had 3 pipelines. Now we have 16.\r\n\r\nBefore we just generated data. Now it automatically trains whole LLMs with autogenerated training configs. Datagen can be done locally, efficiently, on consumer hardware, thanks to a custom-trained dataset generation model.\r\n\r\nThe factual finetuning process's quality has been completely revolutionized during development -- three separate times, each building on the one before it.\r\n\r\nA full changelog is impractical, since everything is changed. Every abstraction has been improved. Every way in which the tool is used has been streamlined and improved. Every pipeline is better. Every outcome is higher-quality and more efficiently delivered.\r\n\r\nInstead of a changelog, refer to the documentation, since diffs don't mean much when the project has been effectively rewritten from the ground up.\r\n\r\nHowever, if you've forked the project before to build your own data pipelines, do not despair -- porting pipelines to New Augmentoolkit is easy and there is the pipeline conventions, abstractions primer, and new pipeline primer in the documentation (docs\u002F...) to guide you through the process. Alternatively, you can get help on the [Discord](https:\u002F\u002Fdiscord.gg\u002Fn9AUXdma).\r\n\r\nAugmentoolkit is now the best way in the world to make custom data, and by extension, custom models.\r\n\r\nHappy Hacking!","2025-06-12T08:03:39",{"id":162,"version":163,"summary_zh":164,"released_at":165},103303,"v2.5.0","This is a tagged release with the final Augmentoolkit update before 3.0.\r\n\r\n3.0 changes literally everything and is not at all backwards compatible with previous versions (though incorporating new pipelines into 3.0 isn't terribly hard).\r\n\r\nStill, if you've forked off of the commits from 6 months ago before I disappeared to go write 3.0, this is the tagged version you want for compatibility.\r\n\r\nStill, I seriously recommend migrating to the newest release.","2025-06-12T07:43:16",{"id":167,"version":168,"summary_zh":169,"released_at":170},103304,"v2.0.0","- New pipeline: RPToolkit. Generate RP data from any conceivable fictional story!\r\n- Augmentoolkit is no longer one isolated pipeline; it is now an extesible and modular project that can support any number of pipelines.  Multiple pipeline executions can be scheduled in sequence.\r\n- Complete refactor, with new abstractions, cleaner code, fewer bugs.\r\n- New interface, with Streamlit, optimized for the new workflow.\r\n- Massively overhauled documentation with a large number of tutorial videos.\r\n\r\n## What's Changed\r\n* chore: Update file encoding to utf-8 for consistency by @1Etherl in https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit\u002Fpull\u002F45\r\n* Usability overhaul by @e-p-armstrong in https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit\u002Fpull\u002F49\r\n\r\n## New Contributors\r\n* @1Etherl made their first contribution in https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit\u002Fpull\u002F45\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fe-p-armstrong\u002Faugmentoolkit\u002Fcompare\u002Fv1.5.0...v2.0.0","2024-09-12T23:03:50",{"id":172,"version":173,"summary_zh":174,"released_at":175},103305,"v1.5.0","There's been a massive update: Augmentoolkit has been enhanced with a third pipeline. This one is specialized around making data at scale easier to work with, and giving you a tool to sort through it all: you can now make the dataset for, and train, any conceivable binary classification model quickly and at basically no cost.\r\n\r\nSome other features made between now and the last release are also included here.\r\n\r\n- **New pipeline: classifier creator.** Generates data for, trains, evaluates, and iterates on a small compute-efficient binary classification model — all within a single script.\r\n  - Allows painless **classification of massive amounts of unlabelled data** using any conceivable labels.\r\n  - **Achieves results comparable to classifiers trained on human-labelled data.**\r\n  - Extremely cost-efficient (a classifier costs less than a coffee even when using APIs)\r\n  - **Fast** (takes less than an hour to generate the data and train the classifier; frankly, depending on your settings, **often less than ten minutes**).\r\n  - Fully documented\r\n  - Configurable: change the base classifier model you train on, set a cap on the maximum number of iterations you will perform, and classify based on any labels imaginable\r\n- **Pure synthetic data pipeline** (EXPERIMENTAL): Don't have an input text? Describe the kind of conversations you want, and Augmentoolkit will use random combinations of labels and features to make a diversity of synthetic interactions. Useful for aligning the style of the model; not so good for adding facts.\r\n  - This pipeline first generates a pipeline for the specific type of conversations the user describes, then runs that pipeline. Currently the generated pipeline needs slightly better prompts to be usable without modification. The pure synthetic pipeline can, therefore, be used usefully but you'll have to polish up the .\u002Fpure_synthetic_pipeline\u002Fprompts folder's contents first.\r\n- Overhauls to generation for **improved model training performance.**\r\n- **Prompt overrides for Augmentoolkit's default mode** out of the box: generate long-response data, \"negative data\".\r\n- **Improved local generation workflow:** no longer does local generation rely on two separate files. Now it uses the main `processing.py`; what section you're working through is controlled through `config.yaml`.\r\n- Miscellanious fixes and improvements.\r\n- **Axolotl training configs** provided as part of the repo so that getting started creating your own LLM is easier.","2024-07-09T20:41:04"]