[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Tencent--PatrickStar":3,"tool-Tencent--PatrickStar":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":79,"owner_twitter":79,"owner_website":80,"owner_url":81,"languages":82,"stars":91,"forks":92,"last_commit_at":93,"license":94,"difficulty_score":10,"env_os":95,"env_gpu":96,"env_ram":97,"env_deps":98,"category_tags":104,"github_topics":105,"view_count":23,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":111,"updated_at":112,"faqs":113,"releases":144},2205,"Tencent\u002FPatrickStar","PatrickStar","PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP and democratizes AI for everyone.","PatrickStar 是一款专为自然语言处理（NLP）领域设计的开源训练框架，旨在让更大、更快、更绿色的预训练模型变得触手可及。它核心解决了大模型训练中常见的“显存溢出”难题：传统方法往往需要堆砌大量 GPU 才能运行超大参数模型，而 PatrickStar 通过创新的异构训练技术，能够动态调度 CPU 和 GPU 内存资源。\n\n其独特的技术亮点在于引入了基于\"Chunk\"的动态内存管理机制。与现有方案静态划分数据不同，PatrickStar 能将非计算部分的模型数据智能卸载至 CPU，仅将当前计算所需部分保留在 GPU 中。这种设计不仅极大提升了显存利用率，还优化了多卡通信效率。实测数据显示，在同等硬件条件下，PatrickStar 能训练的模型规模是主流方案 DeepSpeed 的两倍以上；甚至在仅用 32 张 A100 GPU 的集群上，就能成功训练高达 1750 亿参数的 GPT-3 模型，大幅降低了算力门槛。\n\n这款工具非常适合 AI 研究人员、算法工程师以及希望探索大模型但受限于硬件资源的开发者使用。基于 PyTorch 构建，PatrickStar 易于集成到现有项目中","PatrickStar 是一款专为自然语言处理（NLP）领域设计的开源训练框架，旨在让更大、更快、更绿色的预训练模型变得触手可及。它核心解决了大模型训练中常见的“显存溢出”难题：传统方法往往需要堆砌大量 GPU 才能运行超大参数模型，而 PatrickStar 通过创新的异构训练技术，能够动态调度 CPU 和 GPU 内存资源。\n\n其独特的技术亮点在于引入了基于\"Chunk\"的动态内存管理机制。与现有方案静态划分数据不同，PatrickStar 能将非计算部分的模型数据智能卸载至 CPU，仅将当前计算所需部分保留在 GPU 中。这种设计不仅极大提升了显存利用率，还优化了多卡通信效率。实测数据显示，在同等硬件条件下，PatrickStar 能训练的模型规模是主流方案 DeepSpeed 的两倍以上；甚至在仅用 32 张 A100 GPU 的集群上，就能成功训练高达 1750 亿参数的 GPT-3 模型，大幅降低了算力门槛。\n\n这款工具非常适合 AI 研究人员、算法工程师以及希望探索大模型但受限于硬件资源的开发者使用。基于 PyTorch 构建，PatrickStar 易于集成到现有项目中，帮助团队以更低的成本高效完成从微调到预训练的全流程，真正推动大模型技术的普惠化。","## PatrickStar: Parallel Training of Large Language Models via a Chunk-based Memory Management\n\n![logo](.\u002Flogo.png)\n\n### Recent Progress\nSee [CHANGE_LOG.md](.\u002FCHANGE_LOG.md).\n\n### Meeting PatrickStar\nPre-Trained Models (PTM) are becoming the hotspot of both NLP research and industry application. However, the training of PTMs requires enormous hardware resources, making it only accessible to a small portion of people in the AI community. Now, **PatrickStar will make PTM training available to everyone!**\n\nOut-of-memory error (OOM) is the nightmare of every engineer training PTMs. We often have to introduce more GPUs to store the model params to prevent such errors. PatrickStar brings a better solution for such problem. With the **heterogeneous training** (DeepSpeed Zero Stage 3 also uses it), PatrickStar could fully use both the CPU and GPU memory so that you could use fewer GPUs to train larger models.\n\n### System Design\nThe idea of Patrick is like this. The non-model data (mainly activations) varies during training, but the current heterogeneous training solutions are **statically** splitting the model data to CPU and GPU. To better use the GPU, PatrickStar proposes a **dynamic** memory scheduling with the help of a chunk-based memory management module. The memory management of PatrickStar supports offloading everything but the current computing part of the model to the CPU to save GPU. In addition, chunk-based memory management is efficient for collective communication when scaling to multiple GPUs.\nSee the paper and [this doc](.\u002FINSIDE.md) for the idea behind PatrickStar.\n\n### Results\nIn experiment, Patrickstar v0.4.3 is able to train a **18 Billion**(18B) param model with 8xTesla V100 GPU and 240GB GPU memory in WeChat datacenter node, whose network topology is like [this](.\u002Fdoc\u002Fyard_network_fabric.md). PatrickStar is over twice as large as DeepSpeed. And the performance of PatrickStar is better for models of the same size as well. The pstar is PatrickStar v0.4.3. The deeps indicates performance of DeepSpeed v0.4.3 using the official example [DeepSpeed example](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeedExamples\u002Fblob\u002Fmaster\u002FMegatron-LM-v1.1.5-ZeRO3\u002Fexamples\u002Fds_pretrain_gpt2-zero3.sh) zero3 stage with activation optimizations opening by default.\n\n![alt perf](.\u002Fdoc\u002Fmgpu_scalability.png \"performance testing result\")\n\nWe also evaluated PatrickStar v0.4.3 on a single node of A100 SuperPod. It can train 68B model on 8xA100 with 1TB CPU memory, which is over 6x larger than DeepSpeed v0.5.7. Besides the model scale, PatrickStar is way more efficient than DeepSpeed. The benchmark scripts are in [here](.\u002Fexamples\u002Fbenchmark).\n\n![alt perf](.\u002Fdoc\u002Fone_node_perf_a100.png \"performance testing result on SuperNode\")\n\nDetailed benchmark results on the WeChat AI data center and NVIDIA SuperPod are posted on this [Google Doc](https:\u002F\u002Fdocs.google.com\u002Fspreadsheets\u002Fd\u002F136CWc_jA_2zC4h1r-6dzD4PrOvp6aw6uCDchEyQv6sE\u002Fedit?usp=sharing).\n\n\nScale PatrickStar to multiple machines (node) on SuperPod.\nWe succeed in training a GPT3-175B on 32 GPU. As far as we know, it is the first work\nto run GPT3 on such a small GPU cluster.\nMicrosoft used 10,000 V100 to pertain GPT3.\nNow you can finetune it or even pretrain your own one on 32 A100 GPU, amazing!\n\n![alt perf](.\u002Fdoc\u002Fm_node_superpod.png \"performance testing result on multiple Node of  SuperNode\")\n\n\nWe've also trained the [CLUE-GPT2](https:\u002F\u002Fhuggingface.co\u002Fuer\u002Fgpt2-chinese-cluecorpussmall) model with PatrickStar, the loss and accuracy curve is shown below:\n\n![CLUE-GPT2](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencent_PatrickStar_readme_e72281b9e986.png)\n\n### Installation\n```bash\npip install .\n```\n\nNote that PatrickStar requires gcc of version 7 or higher. You could also use NVIDIA NGC images, the following image is tested:\n\n```bash\ndocker pull nvcr.io\u002Fnvidia\u002Fpytorch:21.06-py3\n```\n\n### Usage\nPatrickStar is based on PyTorch, making it easy to migrate a pytorch project. Here is an example of PatrickStar:\n\n```python\nfrom patrickstar.runtime import initialize_engine\n\nconfig = {\n    \"optimizer\": {\n        \"type\": \"Adam\",\n        \"params\": {\n            \"lr\": 0.001,\n            \"betas\": (0.9, 0.999),\n            \"eps\": 1e-6,\n            \"weight_decay\": 0,\n            \"use_hybrid_adam\": True,\n        },\n    },\n    \"fp16\": {  # loss scaler params\n        \"enabled\": True,\n        \"loss_scale\": 0,\n        \"initial_scale_power\": 2 ** 3,\n        \"loss_scale_window\": 1000,\n        \"hysteresis\": 2,\n        \"min_loss_scale\": 1,\n    },\n    \"default_chunk_size\": 64 * 1024 * 1024,\n    \"release_after_init\": True,\n    \"use_cpu_embedding\": False,\n    \"client\": {\n        \"mem_tracer\": {\n            \"use_async_mem_monitor\": args.with_async_mem_monitor,\n        }\n    },\n}\n\ndef model_func():\n    # MyModel is a derived class for torch.nn.Module\n    return MyModel(...)\n\nmodel, optimizer = initialize_engine(model_func=model_func, local_rank=0, config=config)\n\n...\n\nfor data in dataloader:\n    optimizer.zero_grad()\n\n    loss = model(data)\n    model.backward(loss)\n    optimizer.step()\n```\n\nWe use the same `config` format as [DeepSpeed configuration JSON](https:\u002F\u002Fwww.deepspeed.ai\u002Fdocs\u002Fconfig-json\u002F#optimizer-parameters), which mainly includes params of optimizer, loss scaler, and some PatrickStar-specific configuration.\n\nFor a detail explanation of the above example, please check the guide [here](.\u002FGUIDE.md)\n\nFor more examples, please check [here](.\u002Fexamples).\n\nA quick-start benchmark script is [here](.\u002Fexamples\u002Frun_transformers.sh). It is executed with randomly generated data; therefore you do not need to prepare the real data. It also demonstrated all of the optimization techniques for patrickstar. For more optimization tricks running the benchmark see [Optimization Options](.\u002Fdoc\u002Foptimization_options.md).\n\n\n### License\nBSD 3-Clause License\n\n### Cite Us\n```\n@article{fang2021patrickstar,\n  title={PatrickStar: Parallel Training of Pre-trained Models via a Chunk-based Memory Management},\n  author={Fang, Jiarui and Yu, Yang and Zhu, Zilin and Li, Shenggui and You, Yang and Zhou, Jie},\n  journal={arXiv preprint arXiv:2108.05818},\n  year={2021}\n}\n@article{fang2022parallel,\n  title={Parallel Training of Pre-Trained Models via Chunk-Based Dynamic Memory Management},\n  author={Fang, Jiarui and Zhu, Zilin and Li, Shenggui and Su, Hui and Yu, Yang and Zhou, Jie and You, Yang},\n  journal={IEEE Transactions on Parallel and Distributed Systems},\n  volume={34},\n  number={1},\n  pages={304--315},\n  year={2022},\n  publisher={IEEE}\n}\n```\n\n### Contact Us\n{jiaruifang, zilinzhu, josephyu}@tencent.com\n\nPowered by WeChat AI Team, Tencent NLP Oteam.\n","## PatrickStar：基于分块内存管理的大规模语言模型并行训练\n\n![logo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencent_PatrickStar_readme_4cd47b02e26e.png)\n\n### 最新进展\n请参阅 [CHANGE_LOG.md](.\u002FCHANGE_LOG.md)。\n\n### 认识 PatrickStar\n预训练模型（PTM）正成为自然语言处理研究和工业应用的热点。然而，训练这些模型需要巨大的硬件资源，使得只有少数人工智能领域的从业者能够进行。现在，**PatrickStar 将让每个人都能轻松训练 PTM！**\n\n显存不足错误（OOM）是每一位训练 PTM 的工程师的噩梦。我们通常不得不增加 GPU 数量来存储模型参数以避免此类错误。PatrickStar 为这一问题提供了一种更好的解决方案。通过 **异构训练**（DeepSpeed Zero Stage 3 也采用了该技术），PatrickStar 能够充分利用 CPU 和 GPU 的显存，从而用更少的 GPU 来训练更大的模型。\n\n### 系统设计\nPatrick 的核心思想如下：在训练过程中，非模型数据（主要是激活值）是动态变化的，而现有的异构训练方案却将模型数据 **静态地** 分配到 CPU 和 GPU 上。为了更好地利用 GPU，PatrickStar 提出了一种基于分块内存管理模块的 **动态** 内存调度机制。PatrickStar 的内存管理可以将除当前计算部分之外的所有内容卸载到 CPU 上，从而节省 GPU 显存。此外，分块内存管理在扩展到多 GPU 时，对于集体通信也非常高效。有关 PatrickStar 的设计理念，请参阅论文和 [这篇文档](.\u002FINSIDE.md)。\n\n### 结果\n实验中，PatrickStar v0.4.3 在微信数据中心节点上，使用 8 块 Tesla V100 GPU 和 240GB 显存，成功训练了一个拥有 **180 亿**（18B）参数的模型，其网络拓扑结构如 [此处](.\u002Fdoc\u002Fyard_network_fabric.md) 所示。PatrickStar 比 DeepSpeed 大了两倍多，且在相同规模的模型上性能也更优。图中的 pstar 表示 PatrickStar v0.4.3 的性能，而 deeps 则表示使用官方示例 [DeepSpeed 示例](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeedExamples\u002Fblob\u002Fmaster\u002FMegatron-LM-v1.1.5-ZeRO3\u002Fexamples\u002Fds_pretrain_gpt2-zero3.sh) 中默认开启激活优化的 ZeRO3 阶段的 DeepSpeed v0.4.3 的性能。\n\n![alt perf](.\u002Fdoc\u002Fmgpu_scalability.png \"性能测试结果\")\n\n我们还在 A100 SuperPod 的单个节点上评估了 PatrickStar v0.4.3。它可以在 8 块 A100 GPU 和 1TB CPU 内存的配置下训练一个 680 亿参数的模型，这比 DeepSpeed v0.5.7 大了 6 倍以上。除了模型规模之外，PatrickStar 的效率也远超 DeepSpeed。基准测试脚本见 [这里](.\u002Fexamples\u002Fbenchmark)。\n\n![alt perf](.\u002Fdoc\u002Fone_node_perf_a100.png \"SuperNode 上的性能测试结果\")\n\n我们在微信 AI 数据中心和 NVIDIA SuperPod 上的详细基准测试结果已发布在该 [Google 文档](https:\u002F\u002Fdocs.google.com\u002Fspreadsheets\u002Fd\u002F136CWc_jA_2zC4h1r-6dzD4PrOvp6aw6uCDchEyQv6sE\u002Fedit?usp=sharing) 中。\n\n将 PatrickStar 扩展到 SuperPod 的多台机器（节点）上。\n我们成功地在 32 块 GPU 上训练了一个 GPT3-175B 模型。据我们所知，这是首次在如此小型的 GPU 集群上运行 GPT3 的工作。\n微软曾使用 10,000 块 V100 来训练 GPT3。而现在，你只需 32 块 A100 GPU 就能微调甚至预训练属于自己的 GPT3，真是太棒了！\n\n![alt perf](.\u002Fdoc\u002Fm_node_superpod.png \"SuperNode 多节点上的性能测试结果\")\n\n我们还使用 PatrickStar 训练了 [CLUE-GPT2](https:\u002F\u002Fhuggingface.co\u002Fuer\u002Fgpt2-chinese-cluecorpussmall) 模型，其损失和准确率曲线如下所示：\n\n![CLUE-GPT2](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencent_PatrickStar_readme_e72281b9e986.png)\n\n### 安装\n```bash\npip install .\n```\n\n请注意，PatrickStar 需要 GCC 7 或更高版本。你也可以使用 NVIDIA NGC 镜像，以下镜像是经过测试的：\n\n```bash\ndocker pull nvcr.io\u002Fnvidia\u002Fpytorch:21.06-py3\n```\n\n### 使用\nPatrickStar 基于 PyTorch，因此很容易将现有的 PyTorch 项目迁移到 PatrickStar 上。以下是一个 PatrickStar 的示例：\n\n```python\nfrom patrickstar.runtime import initialize_engine\n\nconfig = {\n    \"optimizer\": {\n        \"type\": \"Adam\",\n        \"params\": {\n            \"lr\": 0.001,\n            \"betas\": (0.9, 0.999),\n            \"eps\": 1e-6,\n            \"weight_decay\": 0,\n            \"use_hybrid_adam\": True,\n        },\n    },\n    \"fp16\": {  # loss scaler 参数\n        \"enabled\": True,\n        \"loss_scale\": 0,\n        \"initial_scale_power\": 2 ** 3,\n        \"loss_scale_window\": 1000,\n        \"hysteresis\": 2,\n        \"min_loss_scale\": 1,\n    },\n    \"default_chunk_size\": 64 * 1024 * 1024,\n    \"release_after_init\": True,\n    \"use_cpu_embedding\": False,\n    \"client\": {\n        \"mem_tracer\": {\n            \"use_async_mem_monitor\": args.with_async_mem_monitor,\n        }\n    },\n}\n\ndef model_func():\n    # MyModel 是 torch.nn.Module 的子类\n    return MyModel(...)\n\nmodel, optimizer = initialize_engine(model_func=model_func, local_rank=0, config=config)\n\n...\n\nfor data in dataloader:\n    optimizer.zero_grad()\n\n    loss = model(data)\n    model.backward(loss)\n    optimizer.step()\n```\n\n我们使用的 `config` 格式与 [DeepSpeed 配置 JSON](https:\u002F\u002Fwww.deepspeed.ai\u002Fdocs\u002Fconfig-json\u002F#optimizer-parameters) 相同，主要包括优化器参数、损失缩放器参数以及一些 PatrickStar 特有的配置。\n\n有关上述示例的详细说明，请参阅指南 [这里](.\u002FGUIDE.md)。\n\n更多示例请查看 [这里](.\u002Fexamples)。\n\n一个快速入门的基准测试脚本见 [这里](.\u002Fexamples\u002Frun_transformers.sh)。该脚本使用随机生成的数据运行，因此无需准备真实数据。它还演示了 PatrickStar 的所有优化技术。有关运行基准测试的更多优化技巧，请参阅 [优化选项](.\u002Fdoc\u002Foptimization_options.md)。\n\n### 许可证\nBSD 3-Clause 许可证\n\n### 引用我们\n```\n@article{fang2021patrickstar,\n  title={PatrickStar: 基于分块内存管理的预训练模型并行训练},\n  author={Fang, Jiarui and Yu, Yang and Zhu, Zilin and Li, Shenggui and You, Yang and Zhou, Jie},\n  journal={arXiv preprint arXiv:2108.05818},\n  year={2021}\n}\n@article{fang2022parallel,\n  title={基于分块动态内存管理的预训练模型并行训练},\n  author={Fang, Jiarui and Zhu, Zilin and Li, Shenggui and Su, Hui and Yu, Yang and Zhou, Jie and You, Yang},\n  journal={IEEE Transactions on Parallel and Distributed Systems},\n  volume={34},\n  number={1},\n  pages={304--315},\n  year={2022},\n  publisher={IEEE}\n}\n```\n\n### 联系我们\n{jiaruifang, zilinzhu, josephyu}@tencent.com\n\n由微信 AI 团队和腾讯 NLP Oteam 提供支持。","# PatrickStar 快速上手指南\n\nPatrickStar 是一款基于分块内存管理（Chunk-based Memory Management）的大语言模型并行训练工具。它通过动态调度 CPU 和 GPU 显存，显著降低显存占用，使开发者能够使用更少的显卡训练更大的模型（例如在 8 张 A100 上训练 68B 模型，或在 32 张 GPU 上训练 GPT-3 175B）。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**：Linux\n*   **编译器**：GCC 版本 7 或更高（必需）\n*   **深度学习框架**：PyTorch\n*   **推荐镜像**：为了方便配置环境，推荐使用 NVIDIA NGC 提供的 PyTorch 镜像（已预装兼容的驱动和库）：\n    ```bash\n    docker pull nvcr.io\u002Fnvidia\u002Fpytorch:21.06-py3\n    ```\n\n## 安装步骤\n\n您可以直接通过源码进行安装。建议在国内网络环境下配置 pip 国内镜像源以加速下载。\n\n```bash\n# 使用国内镜像源安装（推荐）\npip install . -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n> **注意**：如果在构建过程中遇到编译错误，请再次确认 `gcc --version` 是否 >= 7。\n\n## 基本使用\n\nPatrickStar 基于 PyTorch 构建，迁移现有项目非常简单。其配置格式与 DeepSpeed 兼容，主要包含优化器、混合精度训练参数及 PatrickStar 特有的内存管理配置。\n\n以下是一个最小化的使用示例：\n\n```python\nfrom patrickstar.runtime import initialize_engine\n\n# 配置字典\nconfig = {\n    \"optimizer\": {\n        \"type\": \"Adam\",\n        \"params\": {\n            \"lr\": 0.001,\n            \"betas\": (0.9, 0.999),\n            \"eps\": 1e-6,\n            \"weight_decay\": 0,\n            \"use_hybrid_adam\": True,\n        },\n    },\n    \"fp16\": {  # 混合精度训练参数\n        \"enabled\": True,\n        \"loss_scale\": 0,\n        \"initial_scale_power\": 2 ** 3,\n        \"loss_scale_window\": 1000,\n        \"hysteresis\": 2,\n        \"min_loss_scale\": 1,\n    },\n    \"default_chunk_size\": 64 * 1024 * 1024, # 默认分块大小\n    \"release_after_init\": True,\n    \"use_cpu_embedding\": False,\n    \"client\": {\n        \"mem_tracer\": {\n            # 是否启用异步内存监控\n            \"use_async_mem_monitor\": args.with_async_mem_monitor, \n        }\n    },\n}\n\ndef model_func():\n    # MyModel 是继承自 torch.nn.Module 的自定义模型类\n    return MyModel(...)\n\n# 初始化引擎\nmodel, optimizer = initialize_engine(model_func=model_func, local_rank=0, config=config)\n\n# 标准训练循环\nfor data in dataloader:\n    optimizer.zero_grad()\n\n    loss = model(data)\n    model.backward(loss)\n    optimizer.step()\n```\n\n### 快速测试\n如果您想立即验证安装并查看性能基准，可以运行官方提供的快速基准测试脚本（该脚本使用随机生成的数据，无需准备真实数据集）：\n\n```bash\n.\u002Fexamples\u002Frun_transformers.sh\n```\n\n更多详细的优化技巧和完整示例，请参考项目目录下的 `examples` 文件夹及 `doc\u002Foptimization_options.md` 文档。","某中型 AI 实验室的研究团队试图在有限的算力预算下，基于自有行业数据从头预训练一个 180 亿参数的大语言模型。\n\n### 没有 PatrickStar 时\n- **硬件门槛过高**：受限于显存容量，团队必须筹集数十张高端 GPU 才能勉强启动训练，远超实验室预算。\n- **频繁遭遇显存溢出**：在尝试静态混合训练方案时，常因激活值波动导致 OOM（显存溢出）错误，训练过程极不稳定。\n- **资源利用率低下**：现有的异构训练策略静态划分内存，导致 GPU 在等待数据时空转，而 CPU 内存却大量闲置。\n- **扩展性受限**：由于通信效率瓶颈，增加更多节点并未带来线性的速度提升，反而增加了运维复杂度。\n\n### 使用 PatrickStar 后\n- **大幅降低硬件需求**：凭借动态分块内存管理，团队仅用 8 张 V100 显卡配合大容量 CPU 内存，便成功跑通了 18B 模型的训练。\n- **彻底解决 OOM 难题**：PatrickStar 能将非计算部分的模型数据动态卸载至 CPU，确保 GPU 只处理当前计算任务，训练全程稳定无崩溃。\n- **最大化异构算力**：通过动态调度机制，充分榨干了 CPU 与 GPU 的联合内存潜力，使单节点可支撑的模型规模超越 DeepSpeed 两倍以上。\n- **高效多机扩展**：在 32 张 A100 的集群上成功复现了 GPT-3 175B 的训练，证明了其在小规模集群上训练超大规模模型的卓越扩展性。\n\nPatrickStar 通过智能的动态内存调度，打破了显存墙的限制，让中小团队也能以低成本驾驭千亿级大模型的训练。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencent_PatrickStar_4cd47b02.png","Tencent","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FTencent_f7e55588.png","",null,"https:\u002F\u002Fopensource.tencent.com","https:\u002F\u002Fgithub.com\u002FTencent",[83,87],{"name":84,"color":85,"percentage":86},"Python","#3572A5",89.4,{"name":88,"color":89,"percentage":90},"C++","#f34b7d",10.6,778,59,"2026-03-15T14:00:12","BSD-3-Clause","Linux","必需 NVIDIA GPU。测试环境包括 Tesla V100 (8x, 共 240GB 显存) 和 A100 (8x)。支持异构训练，可利用 CPU 内存扩展显存容量。","未说明具体最低值，但大规模训练推荐大内存（测试中使用了 1TB CPU 内存来训练 68B 模型）。",{"notes":99,"python":100,"dependencies":101},"1. 编译需要 GCC 版本 7 或更高。2. 推荐使用官方测试过的 NVIDIA NGC Docker 镜像 (nvcr.io\u002Fnvidia\u002Fpytorch:21.06-py3)。3. 该工具核心特性是动态内存调度，可将非计算部分的模型数据卸载到 CPU，从而用较少的 GPU 训练更大的模型（如单节点 8xA100 + 1TB 内存可训练 68B 模型）。4. 配置格式兼容 DeepSpeed JSON。","3.x (基于提供的 Docker 镜像 nvcr.io\u002Fnvidia\u002Fpytorch:21.06-py3 推断)",[102,103],"PyTorch","GCC >= 7",[15,13,26],[106,107,108,109,110],"nlp","pretrained-models","gpt","bert","pytorch","2026-03-27T02:49:30.150509","2026-04-06T08:09:10.376056",[114,119,124,129,134,139],{"id":115,"question_zh":116,"answer_zh":117,"source_url":118},10296,"如何优化 CPU Embedding 的内存占用，避免每个进程都存储完整参数？","目前每个进程存储一份完整参数确实浪费内存。推荐两种方案：\n1. **模型并行方案（推荐）**：将 Embedding 矩阵按 hidden dim 切分 N 份，前向传播时对 input_ids 进行 all-to-all 通信，反向传播时对 activations 进行 all-to-all 通信。\n2. **单进程计算方案**：仅由 rank=0 进程负责存储和计算 Embedding 矩阵。前向时 rank=0 收集所有 input_ids，反向时收集所有 activations。此方案通信量最少。\n可根据实际需求选择作为用户可选配置项。","https:\u002F\u002Fgithub.com\u002FTencent\u002FPatrickStar\u002Fissues\u002F45",{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},10297,"C++ Adam 优化器速度较慢或耗时增加的原因是什么？","Adam 计算时间增加通常是因为计入了 fp16 到 fp32 的类型转换时间。如果怀疑是 Loss Scale 导致的性能瓶颈（因为它相当于对所有参数求和），可以尝试去掉 Loss Scale 进行测试。另外，建议将梯度从 fp16 到 fp32 的转换过程直接放入 CPU Adam 内部执行，这样可以减少额外的数据拷贝和存储开销，提升整体效率。","https:\u002F\u002Fgithub.com\u002FTencent\u002FPatrickStar\u002Fissues\u002F67",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},10298,"训练过程中报错 \"RuntimeError: chunk move failed\" 且提示 CPU 内存不足怎么办？","该错误表明 CPU 没有足够的连续内存空间移动 chunk。即使总内存使用率不高，也可能因碎片化导致失败。解决方法包括：\n1. 调整启动参数，尝试使用 `--with_mem_cache` 和 `--with_static_partition` 选项。\n2. 检查并调整 `chunk_size` 设置，不同的优化选项会影响模型规模和执行效率。\n3. 参考官方文档中的优化选项说明：https:\u002F\u002Fgithub.com\u002FTencent\u002FPatrickStar\u002Fblob\u002Fmaster\u002Fdoc\u002Foptimization_options.md","https:\u002F\u002Fgithub.com\u002FTencent\u002FPatrickStar\u002Fissues\u002F308",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},10299,"PatrickStar 是否支持在 Megatron-DeepSpeed (MDS) 框架中使用？","目前 PatrickStar 主要作为独立框架或插件思路开发。虽然官方认为 Megatron+DeepSpeed 并非终极方案（PatrickStar 结合 ZeroDP 可在更少显卡上训练超大模型如 GPT3-175B），但团队正在考虑将 PatrickStar 改进为插件形式以支持更多用户。用户可以通过参考现有的 T5 示例代码，手动将 PatrickStar 应用到基于 Megatron-DeepSpeed 的模型训练中。","https:\u002F\u002Fgithub.com\u002FTencent\u002FPatrickStar\u002Fissues\u002F293",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},10300,"如何进一步减少内存消耗，实现 FWD+BWD+ADAM 的融合优化？","可以通过只保留 param fp32 来缩减内存 footprint：\n1. **前向传播 (FWD)**：临时分配 submodule 需要的 param fp16，从 param fp32 拷贝数据，计算完毕后立即释放。\n2. **反向传播 (BWD)**：需要时再从 param fp32 转化产生 grad fp16，随即开始 Adam 计算更新 param fp32，之后丢弃 grad fp16。\n这种融合方式可将总内存消耗降低至接近优化器状态大小。实现时需将 grad 从 fp16 到 fp32 的转换逻辑直接集成到 CPU Adam 中，以避免中间结果存储问题。","https:\u002F\u002Fgithub.com\u002FTencent\u002FPatrickStar\u002Fissues\u002F26",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},10301,"如何支持部分模型由 Chunk 管理，部分由 PyTorch 原生管理（混合计算图）？","可以定义一个混合计算图，其中一部分由基于 chunk 的内存管理器管理，其余部分由 PyTorch 原生管理。具体实现类似于 CPU Embedding 的优化方案：\n1. 对于未由 chunk 管理的部分（如 MoE 结构中的特定层），可以添加一个作用域（scope），在该作用域内创建的所有参数标记为 `TORCH_BASED`。\n2. 这样未管理部分的反向传播只需调用标准的 `loss.backward()`，而管理部分调用 `model.backward(loss)`，互不干扰且能利用 PyTorch 原生的 autograd 机制。","https:\u002F\u002Fgithub.com\u002FTencent\u002FPatrickStar\u002Fissues\u002F188",[145,150,155,160,165,170,175],{"id":146,"version":147,"summary_zh":148,"released_at":149},107428,"v0.4.6","Evaluate on 8 nodes of SuperPod. Fix bugs in multi-GPU mem tracer.","2021-12-23T11:38:09",{"id":151,"version":152,"summary_zh":153,"released_at":154},107429,"v0.4.5","refractory the files in example and add chunk size searching.","2021-12-13T06:51:40",{"id":156,"version":157,"summary_zh":158,"released_at":159},107430,"v0.4.4","The system is successfully evaluated on a multi-node system.\r\nThe benchmark scripts are integrated with memory-centric tiling borrowed from DeepSpeed.\r\nIt trains an 18B model on WeChat Yard.","2021-12-08T03:19:00",{"id":161,"version":162,"summary_zh":163,"released_at":164},107431,"v0.4.3","PatrickStar is evaluated on 8xA100 SuperNode.\r\n1. Fix async copy bug in chunk move.\r\n2. Add Memory Allocation Cache\r\n3. Memory Saving Communication.","2021-11-27T13:10:17",{"id":166,"version":167,"summary_zh":168,"released_at":169},107432,"v0.4.2","Refactored memory tracer.\r\n","2021-11-24T07:24:07",{"id":171,"version":172,"summary_zh":173,"released_at":174},107433,"v0.3.0","The initial open source version 🎉🎉🎉","2021-11-08T02:58:36",{"id":176,"version":177,"summary_zh":178,"released_at":179},107434,"v0.1.0","单机单卡版本。使用eager mode进行chunk schema调度。性能不佳，由于巨大的CPU-GPU移动开销。","2021-05-10T08:04:49"]