[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-hpcaitech--ColossalAI":3,"tool-hpcaitech--ColossalAI":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",157379,2,"2026-04-15T23:32:42",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":77,"owner_twitter":76,"owner_website":78,"owner_url":79,"languages":80,"stars":109,"forks":110,"last_commit_at":111,"license":112,"difficulty_score":113,"env_os":114,"env_gpu":115,"env_ram":114,"env_deps":116,"category_tags":121,"github_topics":122,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":135,"updated_at":136,"faqs":137,"releases":166},7928,"hpcaitech\u002FColossalAI","ColossalAI","Making large AI models cheaper, faster and more accessible","ColossalAI 是一个致力于让大型人工智能模型训练与推理变得更经济、高效且易于获取的开源系统。它主要解决了大模型在开发过程中面临的显存受限、训练速度缓慢以及硬件成本高昂等核心痛点，通过先进的并行策略和系统优化，让用户能在有限的计算资源上运行参数量巨大的模型。\n\n这款工具非常适合 AI 研究人员、算法工程师以及希望深入探索大模型技术的开发者使用。无论是进行前沿学术研究，还是构建企业级 AI 应用，ColossalAI 都能提供强有力的支持。其独特的技术亮点在于集成了多种高效的并行训练技术（如张量并行、流水线并行及序列并行），并针对主流硬件进行了深度适配与加速。此外，它还提供了丰富的预置示例和友好的文档，帮助用户快速上手，轻松实现从模型微调到大规模部署的全流程。借助 ColossalAI，用户无需从零构建复杂的底层架构，即可显著降低算力门槛，将更多精力聚焦于模型创新与应用落地。","# Colossal-AI\n\u003Cdiv id=\"top\" align=\"center\">\n\n   [![logo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_a0721a485ad2.png)](https:\u002F\u002Fwww.colossalai.org\u002F)\n\n   Colossal-AI: Making large AI models cheaper, faster, and more accessible\n\n   \u003Ch3> \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.14883\"> Paper \u003C\u002Fa> |\n   \u003Ca href=\"https:\u002F\u002Fwww.colossalai.org\u002F\"> Documentation \u003C\u002Fa> |\n   \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\"> Examples \u003C\u002Fa> |\n   \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fdiscussions\"> Forum \u003C\u002Fa> |\n   \u003Ca href=\"https:\u002F\u002Fcolossalai.org\u002Fzh-Hans\u002Fdocs\u002Fget_started\u002Fbonus\u002F\">GPU Cloud Playground \u003C\u002Fa> |\n   \u003Ca href=\"https:\u002F\u002Fhpc-ai.com\u002Fblog\"> Blog \u003C\u002Fa>\u003C\u002Fh3>\n\n   [![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhpcaitech\u002FColossalAI?style=social)](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fstargazers)\n   [![Build](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Factions\u002Fworkflows\u002Fbuild_on_schedule.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Factions\u002Fworkflows\u002Fbuild_on_schedule.yml)\n   [![Documentation](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_6bf48b3e9a6d.png)](https:\u002F\u002Fcolossalai.readthedocs.io\u002Fen\u002Flatest\u002F?badge=latest)\n   [![CodeFactor](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_9ee0cb95ac54.png)](https:\u002F\u002Fwww.codefactor.io\u002Frepository\u002Fgithub\u002Fhpcaitech\u002Fcolossalai)\n   [![HuggingFace badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97HuggingFace-Join-yellow)](https:\u002F\u002Fhuggingface.co\u002Fhpcai-tech)\n   [![slack badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSlack-join-blueviolet?logo=slack&amp)](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002Fpublic_assets\u002Ftree\u002Fmain\u002Fcolossalai\u002Fcontact\u002Fslack)\n   [![WeChat badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F微信-加入-green?logo=wechat&amp)](https:\u002F\u002Fraw.githubusercontent.com\u002Fhpcaitech\u002Fpublic_assets\u002Fmain\u002Fcolossalai\u002Fimg\u002FWeChat.png)\n\n\n   | [English](README.md) | [中文](docs\u002FREADME-zh-Hans.md) |\n\n\u003C\u002Fdiv>\n\n## Instantly Run Colossal-AI on Enterprise-Grade GPUs\n\nSkip the setup. Access a powerful, pre-configured Colossal-AI environment on [**HPC-AI Cloud**](https:\u002F\u002Fhpc-ai.com\u002F?utm_source=github&utm_medium=social&utm_campaign=promotion-colossalai).\n\nTrain your models and scale your AI workload in one click!\n\n* **NVIDIA Blackwell B200s**: Experience the next generation of AI performance ([See Benchmarks](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002Fb200)). Now available on cloud from **$2.47\u002Fhr**.\n* **Cost-Effective H200 Cluster**: Get premier performance with on-demand rental from just **$1.99\u002Fhr**.\n\n[**Get Started Now & Claim Your Free Credits →**](https:\u002F\u002Fhpc-ai.com\u002F?utm_source=github&utm_medium=social&utm_campaign=promotion-colossalai)\n\n\u003Cdiv align=\"center\">\n   \u003Ca href=\"https:\u002F\u002Fhpc-ai.com\u002F?utm_source=github&utm_medium=social&utm_campaign=promotion-colossalai\">\n   \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_ded760bdf45d.png\" width=\"850\" \u002F>\n   \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n## Instant Access Top Open Models at Half the Cost\n\nSkip the hassle. Access powerful, long-context LLMs seamlessly through [**HPC-AI Model APIs**](https:\u002F\u002Fhpc-ai.com\u002Fmodel-apis?utm_source=github&utm_medium=social&utm_campaign=promotion-colossalai).\n\nBuild your AI agents, chatbots, and RAG applications with HPC-AI Model APIs!\n\n* **Latest & Greatest Models**: Experience state-of-the-art performance with Kimi 2.5, MiniMax 2.5, and GLM 5.1. Perfect for massive 2M+ context windows and complex coding tasks.\n\n* **Unbeatable Pricing**: Stop overpaying for API endpoints. Get premier inference speed at up to 50% cheaper than OpenRouter.\n\n[**Get Started Now & Claim Your $4 Free Credits →**](https:\u002F\u002Fwww.hpc-ai.com\u002Faccount\u002Fsignup?redirectUrl=\u002Fmodels-console\u002Fmodels&invitation_code=HPCAI-MAPI&utm_source=google&utm_medium=social&utm_id=newlaunch)\n\n\u003Cdiv align=\"center\">\n   \u003Ca href=\"https:\u002F\u002Fhpc-ai.com\u002Fmodel-apis?utm_source=github&utm_medium=social&utm_campaign=promotion-colossalai\">\n   \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_bfe1818377ba.png\" width=\"850\" \u002F>\n   \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n### Colossal-AI Benchmark\n\nTo see how these performance gains translate to real-world applications, we conducted a large language model training benchmark using Colossal-AI on Llama-like models. The tests were run on both 8-card and 16-card configurations for 7B and 70B models, respectively.\n\n|              GPU              |  GPUs  | Model Size |    Parallelism    | Batch Size per DP | Seqlen | Throughput | TFLOPS\u002FGPU  | Peak Mem(MiB)  |\n| :-----------------------------: | :--------: | :-------------: | :------------------: | :-----------: | :--------------: | :-------------: | :-------------: | :-------------: |\n|         H200            |     8     |      7B       |   zero2(dp8)     | 36 |        4096     |       17.13 samp\u002Fs     |       534.18     |       119040.02     |\n|         H200            |     16     |      70B       |   zero2     | 48 |        4096     |       3.27 samp\u002Fs     |       469.1     |       150032.23     |\n|         B200            |     8     |      7B       |   zero1(dp2)+tp2+pp4     | 128 |        4096     |       25.83 samp\u002Fs     |       805.69     |       100119.77     |\n|         H200            |     16     |      70B       |   zero1(dp2)+tp2+pp4     | 128 |        4096     |       5.66 samp\u002Fs     |       811.79     |       100072.02     |\n\nThe results from the Colossal-AI benchmark provide the most practical insight. For the 7B model on 8 cards, the **B200 achieved a 50% higher throughput** and a significant increase in TFLOPS per GPU. For the 70B model on 16 cards, the B200 again demonstrated a clear advantage, with **over 70% higher throughput and TFLOPS per GPU**. These numbers show that the B200's performance gains translate directly to faster training times for large-scale models.\n\n## Latest News\n* [2025\u002F02] [DeepSeek 671B Fine-Tuning Guide Revealed—Unlock the Upgraded DeepSeek Suite with One Click, AI Players Ecstatic!](https:\u002F\u002Fcompany.hpc-ai.com\u002Fblog\u002Fshocking-release-deepseek-671b-fine-tuning-guide-revealed-unlock-the-upgraded-deepseek-suite-with-one-click-ai-players-ecstatic)\n* [2024\u002F12] [The development cost of video generation models has saved by 50%! Open-source solutions are now available with H200 GPU vouchers](https:\u002F\u002Fcompany.hpc-ai.com\u002Fblog\u002Fthe-development-cost-of-video-generation-models-has-saved-by-50-open-source-solutions-are-now-available-with-h200-gpu-vouchers) [[code]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora\u002Fblob\u002Fmain\u002Fscripts\u002Ftrain.py) [[vouchers]](https:\u002F\u002Fcolossalai.org\u002Fzh-Hans\u002Fdocs\u002Fget_started\u002Fbonus\u002F)\n* [2024\u002F10] [How to build a low-cost Sora-like app? Solutions for you](https:\u002F\u002Fcompany.hpc-ai.com\u002Fblog\u002Fhow-to-build-a-low-cost-sora-like-app-solutions-for-you)\n* [2024\u002F09] [Singapore Startup HPC-AI Tech Secures 50 Million USD in Series A Funding to Build the Video Generation AI Model and GPU Platform](https:\u002F\u002Fcompany.hpc-ai.com\u002Fblog\u002Fsingapore-startup-hpc-ai-tech-secures-50-million-usd-in-series-a-funding-to-build-the-video-generation-ai-model-and-gpu-platform)\n* [2024\u002F09] [Reducing AI Large Model Training Costs by 30% Requires Just a Single Line of Code From FP8 Mixed Precision Training Upgrades](https:\u002F\u002Fcompany.hpc-ai.com\u002Fblog\u002Freducing-ai-large-model-training-costs-by-30-requires-just-a-single-line-of-code-from-fp8-mixed-precision-training-upgrades)\n* [2024\u002F06] [Open-Sora Continues Open Source: Generate Any 16-Second 720p HD Video with One Click, Model Weights Ready to Use](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002Fopen-sora-from-hpc-ai-tech-team-continues-open-source-generate-any-16-second-720p-hd-video-with-one-click-model-weights-ready-to-use)\n* [2024\u002F05] [Large AI Models Inference Speed Doubled, Colossal-Inference Open Source Release](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002Fcolossal-inference)\n* [2024\u002F04] [Open-Sora Unveils Major Upgrade: Embracing Open Source with Single-Shot 16-Second Video Generation and 720p Resolution](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002Fopen-soras-comprehensive-upgrade-unveiled-embracing-16-second-video-generation-and-720p-resolution-in-open-source)\n* [2024\u002F04] [Most cost-effective solutions for inference, fine-tuning and pretraining, tailored to LLaMA3 series](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002Fmost-cost-effective-solutions-for-inference-fine-tuning-and-pretraining-tailored-to-llama3-series)\n\n## Table of Contents\n\u003Cul>\n \u003Cli>\u003Ca href=\"#Why-Colossal-AI\">Why Colossal-AI\u003C\u002Fa> \u003C\u002Fli>\n \u003Cli>\u003Ca href=\"#Features\">Features\u003C\u002Fa> \u003C\u002Fli>\n \u003Cli>\n   \u003Ca href=\"#Colossal-AI-in-the-Real-World\">Colossal-AI for Real World Applications\u003C\u002Fa>\n   \u003Cul>\n     \u003Cli>\u003Ca href=\"#Open-Sora\">Open-Sora: Revealing Complete Model Parameters, Training Details, and Everything for Sora-like Video Generation Models\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#Colossal-LLaMA-2\">Colossal-LLaMA-2: One Half-Day of Training Using a Few Hundred Dollars Yields Similar Results to Mainstream Large Models, Open-Source and Commercial-Free Domain-Specific Llm Solution\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#ColossalChat\">ColossalChat: An Open-Source Solution for Cloning ChatGPT With a Complete RLHF Pipeline\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#AIGC\">AIGC: Acceleration of Stable Diffusion\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#Biomedicine\">Biomedicine: Acceleration of AlphaFold Protein Structure\u003C\u002Fa>\u003C\u002Fli>\n   \u003C\u002Ful>\n \u003C\u002Fli>\n \u003Cli>\n   \u003Ca href=\"#Parallel-Training-Demo\">Parallel Training Demo\u003C\u002Fa>\n   \u003Cul>\n     \u003Cli>\u003Ca href=\"#LLaMA3\">LLaMA 1\u002F2\u002F3 \u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#MoE\">MoE\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#GPT-3\">GPT-3\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#GPT-2\">GPT-2\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#BERT\">BERT\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#PaLM\">PaLM\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#OPT\">OPT\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#ViT\">ViT\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#Recommendation-System-Models\">Recommendation System Models\u003C\u002Fa>\u003C\u002Fli>\n   \u003C\u002Ful>\n \u003C\u002Fli>\n \u003Cli>\n   \u003Ca href=\"#Single-GPU-Training-Demo\">Single GPU Training Demo\u003C\u002Fa>\n   \u003Cul>\n     \u003Cli>\u003Ca href=\"#GPT-2-Single\">GPT-2\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#PaLM-Single\">PaLM\u003C\u002Fa>\u003C\u002Fli>\n   \u003C\u002Ful>\n \u003C\u002Fli>\n \u003Cli>\n   \u003Ca href=\"#Inference\">Inference\u003C\u002Fa>\n   \u003Cul>\n     \u003Cli>\u003Ca href=\"#Colossal-Inference\">Colossal-Inference: Large AI  Models Inference Speed Doubled\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#Grok-1\">Grok-1: 314B model of PyTorch + HuggingFace Inference\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#SwiftInfer\">SwiftInfer:Breaks the Length Limit of LLM for Multi-Round Conversations with 46% Acceleration\u003C\u002Fa>\u003C\u002Fli>\n   \u003C\u002Ful>\n \u003C\u002Fli>\n \u003Cli>\n   \u003Ca href=\"#Installation\">Installation\u003C\u002Fa>\n   \u003Cul>\n     \u003Cli>\u003Ca href=\"#PyPI\">PyPI\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#Install-From-Source\">Install From Source\u003C\u002Fa>\u003C\u002Fli>\n   \u003C\u002Ful>\n \u003C\u002Fli>\n \u003Cli>\u003Ca href=\"#Use-Docker\">Use Docker\u003C\u002Fa>\u003C\u002Fli>\n \u003Cli>\u003Ca href=\"#Community\">Community\u003C\u002Fa>\u003C\u002Fli>\n \u003Cli>\u003Ca href=\"#Contributing\">Contributing\u003C\u002Fa>\u003C\u002Fli>\n \u003Cli>\u003Ca href=\"#Cite-Us\">Cite Us\u003C\u002Fa>\u003C\u002Fli>\n\u003C\u002Ful>\n\n## Why Colossal-AI\n\u003Cdiv align=\"center\">\n   \u003Ca href=\"https:\u002F\u002Fyoutu.be\u002FKnXSfjqkKN0\">\n   \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_7f26a73cebb4.png\" width=\"600\" \u002F>\n   \u003C\u002Fa>\n\n   Prof. James Demmel (UC Berkeley): Colossal-AI makes training AI models efficient, easy, and scalable.\n\u003C\u002Fdiv>\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">back to top\u003C\u002Fa>)\u003C\u002Fp>\n\n## Features\n\nColossal-AI provides a collection of parallel components for you. We aim to support you to write your\ndistributed deep learning models just like how you write your model on your laptop. We provide user-friendly tools to kickstart\ndistributed training and inference in a few lines.\n\n- Parallelism strategies\n  - Data Parallelism\n  - Pipeline Parallelism\n  - 1D, [2D](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.05343), [2.5D](https:\u002F\u002Farxiv.org\u002Fabs\u002F2105.14500), [3D](https:\u002F\u002Farxiv.org\u002Fabs\u002F2105.14450) Tensor Parallelism\n  - [Sequence Parallelism](https:\u002F\u002Farxiv.org\u002Fabs\u002F2105.13120)\n  - [Zero Redundancy Optimizer (ZeRO)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.02054)\n  - [Auto-Parallelism](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.02599)\n\n- Heterogeneous Memory Management\n  - [PatrickStar](https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.05818)\n\n- Friendly Usage\n  - Parallelism based on the configuration file\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">back to top\u003C\u002Fa>)\u003C\u002Fp>\n\n## Colossal-AI in the Real World\n### Open-Sora\n\n[Open-Sora](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora)：Revealing Complete Model Parameters, Training Details, and Everything for Sora-like Video Generation Models\n[[code]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora)\n[[blog]](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002Fopen-sora-from-hpc-ai-tech-team-continues-open-source-generate-any-16-second-720p-hd-video-with-one-click-model-weights-ready-to-use)\n[[Model weights]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora?tab=readme-ov-file#model-weights)\n[[Demo]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora?tab=readme-ov-file#-latest-demo)\n[[GPU Cloud Playground]](https:\u002F\u002Fcloud.luchentech.com\u002F)\n[[OpenSora Image]](https:\u002F\u002Fcloud.luchentech.com\u002Fdoc\u002Fdocs\u002Fimage\u002Fopen-sora\u002F)\n\n\u003Cdiv align=\"center\">\n   \u003Ca href=\"https:\u002F\u002Fyoutu.be\u002FilMQpU71ddI?si=J4JSPzZ03ycYmlki\">\n   \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_5a7cad61e54a.png\" width=\"700\" \u002F>\n   \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">back to top\u003C\u002Fa>)\u003C\u002Fp>\n\n### Colossal-LLaMA-2\n\n[[GPU Cloud Playground]](https:\u002F\u002Fcloud.luchentech.com\u002F)\n[[LLaMA3 Image]](https:\u002F\u002Fcloud.luchentech.com\u002Fdoc\u002Fdocs\u002Fimage\u002Fllama)\n\n- 7B: One half-day of training using a few hundred dollars yields similar results to mainstream large models, open-source and commercial-free domain-specific LLM solution.\n[[code]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fapplications\u002FColossal-LLaMA-2)\n[[blog]](https:\u002F\u002Fwww.hpc-ai.tech\u002Fblog\u002Fone-half-day-of-training-using-a-few-hundred-dollars-yields-similar-results-to-mainstream-large-models-open-source-and-commercial-free-domain-specific-llm-solution)\n[[HuggingFace model weights]](https:\u002F\u002Fhuggingface.co\u002Fhpcai-tech\u002FColossal-LLaMA-2-7b-base)\n[[Modelscope model weights]](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fcolossalai\u002FColossal-LLaMA-2-7b-base\u002Fsummary)\n\n- 13B: Construct refined 13B private model with just $5000 USD.\n[[code]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fapplications\u002FColossal-LLaMA-2)\n[[blog]](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002Fcolossal-llama-2-13b)\n[[HuggingFace model weights]](https:\u002F\u002Fhuggingface.co\u002Fhpcai-tech\u002FColossal-LLaMA-2-13b-base)\n[[Modelscope model weights]](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fcolossalai\u002FColossal-LLaMA-2-13b-base\u002Fsummary)\n\n|              Model              |  Backbone  | Tokens Consumed |     MMLU (5-shot)    | CMMLU (5-shot)| AGIEval (5-shot) | GAOKAO (0-shot) | CEval (5-shot)  |\n| :-----------------------------: | :--------: | :-------------: | :------------------: | :-----------: | :--------------: | :-------------: | :-------------: |\n|          Baichuan-7B            |     -      |      1.2T       |    42.32 (42.30)     | 44.53 (44.02) |        38.72     |       36.74     |       42.80     |\n|       Baichuan-13B-Base         |     -      |      1.4T       |    50.51 (51.60)     | 55.73 (55.30) |        47.20     |       51.41     |       53.60     |\n|       Baichuan2-7B-Base         |     -      |      2.6T       |    46.97 (54.16)     | 57.67 (57.07) |        45.76     |       52.60     |       54.00     |\n|       Baichuan2-13B-Base        |     -      |      2.6T       |    54.84 (59.17)     | 62.62 (61.97) |        52.08     |       58.25     |       58.10     |\n|           ChatGLM-6B            |     -      |      1.0T       |    39.67 (40.63)     |   41.17 (-)   |        40.10     |       36.53     |       38.90     |\n|          ChatGLM2-6B            |     -      |      1.4T       |    44.74 (45.46)     |   49.40 (-)   |        46.36     |       45.49     |       51.70     |\n|          InternLM-7B            |     -      |      1.6T       |    46.70 (51.00)     |   52.00 (-)   |        44.77     |       61.64     |       52.80     |\n|            Qwen-7B              |     -      |      2.2T       |    54.29 (56.70)     | 56.03 (58.80) |        52.47     |       56.42     |       59.60     |\n|           Llama-2-7B            |     -      |      2.0T       |    44.47 (45.30)     |   32.97 (-)   |        32.60     |       25.46     |         -       |\n| Linly-AI\u002FChinese-LLaMA-2-7B-hf  | Llama-2-7B |      1.0T       |        37.43         |     29.92     |        32.00     |       27.57     |         -       |\n| wenge-research\u002Fyayi-7b-llama2   | Llama-2-7B |        -        |        38.56         |     31.52     |        30.99     |       25.95     |         -       |\n| ziqingyang\u002Fchinese-llama-2-7b   | Llama-2-7B |        -        |        33.86         |     34.69     |        34.52     |       25.18     |        34.2     |\n| TigerResearch\u002Ftigerbot-7b-base  | Llama-2-7B |      0.3T       |        43.73         |     42.04     |        37.64     |       30.61     |         -       |\n|  LinkSoul\u002FChinese-Llama-2-7b    | Llama-2-7B |        -        |        48.41         |     38.31     |        38.45     |       27.72     |         -       |\n|       FlagAlpha\u002FAtom-7B         | Llama-2-7B |      0.1T       |        49.96         |     41.10     |        39.83     |       33.00     |         -       |\n| IDEA-CCNL\u002FZiya-LLaMA-13B-v1.1   | Llama-13B  |      0.11T      |        50.25         |     40.99     |        40.04     |       30.54     |         -       |\n|  **Colossal-LLaMA-2-7b-base**   | Llama-2-7B |   **0.0085T**   |        53.06         |     49.89     |        51.48     |       58.82     |        50.2     |\n|  **Colossal-LLaMA-2-13b-base**  | Llama-2-13B |   **0.025T**    |        56.42         |     61.80     |        54.69     |       69.53     |        60.3     |\n\n\n### ColossalChat\n\n\u003Cdiv align=\"center\">\n   \u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=HcTiHzApHm0\">\n   \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_640918410b82.png\" width=\"700\" \u002F>\n   \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n[ColossalChat](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fapplications\u002FChat): An open-source solution for cloning [ChatGPT](https:\u002F\u002Fopenai.com\u002Fblog\u002Fchatgpt\u002F) with a complete RLHF pipeline.\n[[code]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fapplications\u002FChat)\n[[blog]](https:\u002F\u002Fmedium.com\u002F@yangyou_berkeley\u002Fcolossalchat-an-open-source-solution-for-cloning-chatgpt-with-a-complete-rlhf-pipeline-5edf08fb538b)\n[[demo]](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=HcTiHzApHm0)\n[[tutorial]](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=-qFBZFmOJfg)\n\n\u003Cp id=\"ColossalChat-Speed\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_3eeda076eea4.jpg\" width=450\u002F>\n\u003C\u002Fp>\n\n- Up to 10 times faster for RLHF PPO Stage3 Training\n\n\u003Cp id=\"ColossalChat_scaling\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_ee6ffde32c45.png\" width=800\u002F>\n\u003C\u002Fp>\n\n- Up to 7.73 times faster for single server training and 1.42 times faster for single-GPU inference\n\n\u003Cp id=\"ColossalChat-1GPU\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_e8180b57912c.jpg\" width=450\u002F>\n\u003C\u002Fp>\n\n- Up to 10.3x growth in model capacity on one GPU\n- A mini demo training process requires only 1.62GB of GPU memory (any consumer-grade GPU)\n\n\u003Cp id=\"ColossalChat-LoRA\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_b55a7973e16f.jpg\" width=600\u002F>\n\u003C\u002Fp>\n\n- Increase the capacity of the fine-tuning model by up to 3.7 times on a single GPU\n- Keep at a sufficiently high running speed\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">back to top\u003C\u002Fa>)\u003C\u002Fp>\n\n\n### AIGC\nAcceleration of AIGC (AI-Generated Content) models such as [Stable Diffusion v1](https:\u002F\u002Fgithub.com\u002FCompVis\u002Fstable-diffusion) and [Stable Diffusion v2](https:\u002F\u002Fgithub.com\u002FStability-AI\u002Fstablediffusion).\n\u003Cp id=\"diffusion_train\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_27e3a110551e.png\" width=800\u002F>\n\u003C\u002Fp>\n\n- [Training](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\u002Fimages\u002Fdiffusion): Reduce Stable Diffusion memory consumption by up to 5.6x and hardware cost by up to 46x (from A100 to RTX3060).\n\n\u003Cp id=\"diffusion_demo\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_d3ff4911276b.png\" width=800\u002F>\n\u003C\u002Fp>\n\n- [DreamBooth Fine-tuning](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\u002Fimages\u002Fdreambooth): Personalize your model using just 3-5 images of the desired subject.\n\n\u003Cp id=\"inference-sd\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_a13e5c6f5274.jpg\" width=800\u002F>\n\u003C\u002Fp>\n\n- [Inference](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\u002Fimages\u002Fdiffusion): Reduce inference GPU memory consumption by 2.5x.\n\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">back to top\u003C\u002Fa>)\u003C\u002Fp>\n\n### Biomedicine\nAcceleration of [AlphaFold Protein Structure](https:\u002F\u002Falphafold.ebi.ac.uk\u002F)\n\n\u003Cp id=\"FastFold\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_6684dcef2cbd.jpg\" width=800\u002F>\n\u003C\u002Fp>\n\n- [FastFold](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FFastFold): Accelerating training and inference on GPU Clusters, faster data processing, inference sequence containing more than 10000 residues.\n\n\u003Cp id=\"FastFold-Intel\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_a6ab1aef6829.jpg\" width=600\u002F>\n\u003C\u002Fp>\n\n- [FastFold with Intel](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FFastFold): 3x inference acceleration and 39% cost reduce.\n\n\u003Cp id=\"xTrimoMultimer\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_39ccf437e152.jpg\" width=800\u002F>\n\u003C\u002Fp>\n\n- [xTrimoMultimer](https:\u002F\u002Fgithub.com\u002Fbiomap-research\u002FxTrimoMultimer): accelerating structure prediction of protein monomers and multimer by 11x.\n\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">back to top\u003C\u002Fa>)\u003C\u002Fp>\n\n## Parallel Training Demo\n### LLaMA3\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_c0c2dfafa4fa.png\" width=600\u002F>\n\u003C\u002Fp>\n\n- 70 billion parameter LLaMA3 model training accelerated by 18%\n[[code]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\u002Flanguage\u002Fllama)\n[[GPU Cloud Playground]](https:\u002F\u002Fcloud.luchentech.com\u002F)\n[[LLaMA3 Image]](https:\u002F\u002Fcloud.luchentech.com\u002Fdoc\u002Fdocs\u002Fimage\u002Fllama)\n\n### LLaMA2\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_652ab748b460.png\" width=600\u002F>\n\u003C\u002Fp>\n\n- 70 billion parameter LLaMA2 model training accelerated by 195%\n[[code]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\u002Flanguage\u002Fllama)\n[[blog]](https:\u002F\u002Fwww.hpc-ai.tech\u002Fblog\u002F70b-llama2-training)\n\n### LLaMA1\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_8cb7944afde8.png\" width=600\u002F>\n\u003C\u002Fp>\n\n- 65-billion-parameter large model pretraining accelerated by 38%\n[[code]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\u002Flanguage\u002Fllama)\n[[blog]](https:\u002F\u002Fwww.hpc-ai.tech\u002Fblog\u002Flarge-model-pretraining)\n\n### MoE\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_f05769580ec9.png\" width=800\u002F>\n\u003C\u002Fp>\n\n- Enhanced MoE parallelism, Open-source MoE model training can be 9 times more efficient\n[[code]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\u002Flanguage\u002Fopenmoe)\n[[blog]](https:\u002F\u002Fwww.hpc-ai.tech\u002Fblog\u002Fenhanced-moe-parallelism-open-source-moe-model-training-can-be-9-times-more-efficient)\n\n### GPT-3\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_f81e4bfe23c8.png\" width=700\u002F>\n\u003C\u002Fp>\n\n- Save 50% GPU resources and 10.7% acceleration\n\n### GPT-2\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_10d0568f335d.png\" width=800\u002F>\n\n- 11x lower GPU memory consumption, and superlinear scaling efficiency with Tensor Parallelism\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_8c61dfd6f7b1.png\" width=800>\n\n- 24x larger model size on the same hardware\n- over 3x acceleration\n### BERT\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_0d385886a2f0.png\" width=800\u002F>\n\n- 2x faster training, or 50% longer sequence length\n\n### PaLM\n- [PaLM-colossalai](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FPaLM-colossalai): Scalable implementation of Google's Pathways Language Model ([PaLM](https:\u002F\u002Fai.googleblog.com\u002F2022\u002F04\u002Fpathways-language-model-palm-scaling-to.html)).\n\n### OPT\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_6f0c18c02a9e.png\" width=800\u002F>\n\n- [Open Pretrained Transformer (OPT)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fmetaseq), a 175-Billion parameter AI language model released by Meta, which stimulates AI programmers to perform various downstream tasks and application deployments because of public pre-trained model weights.\n- 45% speedup fine-tuning OPT at low cost in lines. [[Example]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\u002Flanguage\u002Fopt) [[Online Serving]](https:\u002F\u002Fcolossalai.org\u002Fdocs\u002Fadvanced_tutorials\u002Fopt_service)\n\nPlease visit our [documentation](https:\u002F\u002Fwww.colossalai.org\u002F) and [examples](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples) for more details.\n\n### ViT\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_815210098a77.png\" width=\"450\" \u002F>\n\u003C\u002Fp>\n\n- 14x larger batch size, and 5x faster training for Tensor Parallelism = 64\n\n### Recommendation System Models\n- [Cached Embedding](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FCachedEmbedding), utilize software cache to train larger embedding tables with a smaller GPU memory budget.\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">back to top\u003C\u002Fa>)\u003C\u002Fp>\n\n## Single GPU Training Demo\n\n### GPT-2\n\u003Cp id=\"GPT-2-Single\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_5e25975df53f.png\" width=450\u002F>\n\u003C\u002Fp>\n\n- 20x larger model size on the same hardware\n\n\u003Cp id=\"GPT-2-NVME\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_0e3ab8cb57d5.png\" width=800\u002F>\n\u003C\u002Fp>\n\n- 120x larger model size on the same hardware (RTX 3080)\n\n### PaLM\n\u003Cp id=\"PaLM-Single\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_ca0b3186f2ff.png\" width=450\u002F>\n\u003C\u002Fp>\n\n- 34x larger model size on the same hardware\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">back to top\u003C\u002Fa>)\u003C\u002Fp>\n\n\n## Inference\n### Colossal-Inference\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_08e0d611d749.png\" width=1000\u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_1a74bb9944f9.png\" width=1000\u002F>\n\u003C\u002Fp>\n\n - Large AI models inference speed doubled, compared to the offline inference performance of vLLM in some cases.\n[[code]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fcolossalai\u002Finference)\n[[blog]](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002Fcolossal-inference)\n[[GPU Cloud Playground]](https:\u002F\u002Fcloud.luchentech.com\u002F)\n[[LLaMA3 Image]](https:\u002F\u002Fcloud.luchentech.com\u002Fdoc\u002Fdocs\u002Fimage\u002Fllama)\n\n### Grok-1\n\u003Cp id=\"Grok-1\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_b4b105effe65.jpg\" width=600\u002F>\n\u003C\u002Fp>\n\n - 314 Billion Parameter Grok-1 Inference Accelerated by 3.8x, an easy-to-use Python + PyTorch + HuggingFace version for Inference.\n\n[[code]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\u002Flanguage\u002Fgrok-1)\n[[blog]](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002F314-billion-parameter-grok-1-inference-accelerated-by-3.8x-efficient-and-easy-to-use-pytorchhuggingface-version-is-here)\n[[HuggingFace Grok-1 PyTorch model weights]](https:\u002F\u002Fhuggingface.co\u002Fhpcai-tech\u002Fgrok-1)\n[[ModelScope Grok-1 PyTorch model weights]](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fcolossalai\u002Fgrok-1-pytorch\u002Fsummary)\n\n### SwiftInfer\n\u003Cp id=\"SwiftInfer\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_b78d37b53500.jpg\" width=800\u002F>\n\u003C\u002Fp>\n\n- [SwiftInfer](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FSwiftInfer): Inference performance improved by 46%, open source solution breaks the length limit of LLM for multi-round conversations\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">back to top\u003C\u002Fa>)\u003C\u002Fp>\n\n## Installation\n\nRequirements:\n- PyTorch >= 2.2\n- Python >= 3.7\n- CUDA >= 11.0\n- [NVIDIA GPU Compute Capability](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-gpus) >= 7.0 (V100\u002FRTX20 and higher)\n- Linux OS\n\nIf you encounter any problem with installation, you may want to raise an [issue](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fissues\u002Fnew\u002Fchoose) in this repository.\n\n### Install from PyPI\n\nYou can easily install Colossal-AI with the following command. **By default, we do not build PyTorch extensions during installation.**\n\n```bash\npip install colossalai\n```\n\n**Note: only Linux is supported for now.**\n\nHowever, if you want to build the PyTorch extensions during installation, you can set `BUILD_EXT=1`.\n\n```bash\nBUILD_EXT=1 pip install colossalai\n```\n\n**Otherwise, CUDA kernels will be built during runtime when you actually need them.**\n\nWe also keep releasing the nightly version to PyPI every week. This allows you to access the unreleased features and bug fixes in the main branch.\nInstallation can be made via\n\n```bash\npip install colossalai-nightly\n```\n\n### Download From Source\n\n> The version of Colossal-AI will be in line with the main branch of the repository. Feel free to raise an issue if you encounter any problems. :)\n\n```shell\ngit clone https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI.git\ncd ColossalAI\n\n# install colossalai\npip install .\n```\n\nBy default, we do not compile CUDA\u002FC++ kernels. ColossalAI will build them during runtime.\nIf you want to install and enable CUDA kernel fusion (compulsory installation when using fused optimizer):\n\n```shell\nBUILD_EXT=1 pip install .\n```\n\nFor Users with CUDA 10.2, you can still build ColossalAI from source. However, you need to manually download the cub library and copy it to the corresponding directory.\n\n```bash\n# clone the repository\ngit clone https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI.git\ncd ColossalAI\n\n# download the cub library\nwget https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcub\u002Farchive\u002Frefs\u002Ftags\u002F1.8.0.zip\nunzip 1.8.0.zip\ncp -r cub-1.8.0\u002Fcub\u002F colossalai\u002Fkernel\u002Fcuda_native\u002Fcsrc\u002Fkernels\u002Finclude\u002F\n\n# install\nBUILD_EXT=1 pip install .\n```\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">back to top\u003C\u002Fa>)\u003C\u002Fp>\n\n## Use Docker\n\n### Pull from DockerHub\n\nYou can directly pull the docker image from our [DockerHub page](https:\u002F\u002Fhub.docker.com\u002Fr\u002Fhpcaitech\u002Fcolossalai). The image is automatically uploaded upon release.\n\n\n### Build On Your Own\n\nRun the following command to build a docker image from Dockerfile provided.\n\n> Building Colossal-AI from scratch requires GPU support, you need to use Nvidia Docker Runtime as the default when doing `docker build`. More details can be found [here](https:\u002F\u002Fstackoverflow.com\u002Fquestions\u002F59691207\u002Fdocker-build-with-nvidia-runtime).\n> We recommend you install Colossal-AI from our [project page](https:\u002F\u002Fwww.colossalai.org) directly.\n\n\n```bash\ncd ColossalAI\ndocker build -t colossalai .\u002Fdocker\n```\n\nRun the following command to start the docker container in interactive mode.\n\n```bash\ndocker run -ti --gpus all --rm --ipc=host colossalai bash\n```\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">back to top\u003C\u002Fa>)\u003C\u002Fp>\n\n## Community\n\nJoin the Colossal-AI community on [Forum](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fdiscussions),\n[Slack](https:\u002F\u002Fjoin.slack.com\u002Ft\u002Fcolossalaiworkspace\u002Fshared_invite\u002Fzt-z7b26eeb-CBp7jouvu~r0~lcFzX832w),\nand [WeChat(微信)](https:\u002F\u002Fraw.githubusercontent.com\u002Fhpcaitech\u002Fpublic_assets\u002Fmain\u002Fcolossalai\u002Fimg\u002FWeChat.png \"qrcode\") to share your suggestions, feedback, and questions with our engineering team.\n\n## Contributing\nReferring to the successful attempts of [BLOOM](https:\u002F\u002Fbigscience.huggingface.co\u002F) and [Stable Diffusion](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FStable_Diffusion), any and all developers and partners with computing powers, datasets, models are welcome to join and build the Colossal-AI community, making efforts towards the era of big AI models!\n\nYou may contact us or participate in the following ways:\n1. [Leaving a Star ⭐](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fstargazers) to show your like and support. Thanks!\n2. Posting an [issue](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fissues\u002Fnew\u002Fchoose), or submitting a PR on GitHub follow the guideline in [Contributing](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fblob\u002Fmain\u002FCONTRIBUTING.md)\n3. Send your official proposal to email contact@hpcaitech.com\n\nThanks so much to all of our amazing contributors!\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_668ee4138888.png\"  width=\"800px\"\u002F>\n\u003C\u002Fa>\n\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">back to top\u003C\u002Fa>)\u003C\u002Fp>\n\n\n## CI\u002FCD\n\nWe leverage the power of [GitHub Actions](https:\u002F\u002Fgithub.com\u002Ffeatures\u002Factions) to automate our development, release and deployment workflows. Please check out this [documentation](.github\u002Fworkflows\u002FREADME.md) on how the automated workflows are operated.\n\n\n## Cite Us\n\nThis project is inspired by some related projects (some by our team and some by other organizations). We would like to credit these amazing projects as listed in the [Reference List](.\u002Fdocs\u002FREFERENCE.md).\n\nTo cite this project, you can use the following BibTeX citation.\n\n```\n@inproceedings{10.1145\u002F3605573.3605613,\nauthor = {Li, Shenggui and Liu, Hongxin and Bian, Zhengda and Fang, Jiarui and Huang, Haichen and Liu, Yuliang and Wang, Boxiang and You, Yang},\ntitle = {Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training},\nyear = {2023},\nisbn = {9798400708435},\npublisher = {Association for Computing Machinery},\naddress = {New York, NY, USA},\nurl = {https:\u002F\u002Fdoi.org\u002F10.1145\u002F3605573.3605613},\ndoi = {10.1145\u002F3605573.3605613},\nabstract = {The success of Transformer models has pushed the deep learning model scale to billions of parameters, but the memory limitation of a single GPU has led to an urgent need for training on multi-GPU clusters. However, the best practice for choosing the optimal parallel strategy is still lacking, as it requires domain expertise in both deep learning and parallel computing. The Colossal-AI system addressed the above challenge by introducing a unified interface to scale your sequential code of model training to distributed environments. It supports parallel training methods such as data, pipeline, tensor, and sequence parallelism and is integrated with heterogeneous training and zero redundancy optimizer. Compared to the baseline system, Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.},\nbooktitle = {Proceedings of the 52nd International Conference on Parallel Processing},\npages = {766–775},\nnumpages = {10},\nkeywords = {datasets, gaze detection, text tagging, neural networks},\nlocation = {Salt Lake City, UT, USA},\nseries = {ICPP '23}\n}\n```\n\nColossal-AI has been accepted as official tutorial by top conferences [NeurIPS](https:\u002F\u002Fnips.cc\u002F), [SC](https:\u002F\u002Fsc22.supercomputing.org\u002F), [AAAI](https:\u002F\u002Faaai.org\u002FConferences\u002FAAAI-23\u002F),\n[PPoPP](https:\u002F\u002Fppopp23.sigplan.org\u002F), [CVPR](https:\u002F\u002Fcvpr2023.thecvf.com\u002F), [ISC](https:\u002F\u002Fwww.isc-hpc.com\u002F), [NVIDIA GTC](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcspring23-S51482\u002F) ,etc.\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">back to top\u003C\u002Fa>)\u003C\u002Fp>\n","# Colossal-AI\n\u003Cdiv id=\"top\" align=\"center\">\n\n   [![logo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_a0721a485ad2.png)](https:\u002F\u002Fwww.colossalai.org\u002F)\n\n   Colossal-AI：让大型AI模型更经济、更快、更易用\n\n   \u003Ch3> \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.14883\"> 论文 \u003C\u002Fa> |\n   \u003Ca href=\"https:\u002F\u002Fwww.colossalai.org\u002F\"> 文档 \u003C\u002Fa> |\n   \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\"> 示例 \u003C\u002Fa> |\n   \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fdiscussions\"> 论坛 \u003C\u002Fa> |\n   \u003Ca href=\"https:\u002F\u002Fcolossalai.org\u002Fzh-Hans\u002Fdocs\u002Fget_started\u002Fbonus\u002F\"> GPU云平台 \u003C\u002Fa> |\n   \u003Ca href=\"https:\u002F\u002Fhpc-ai.com\u002Fblog\"> 博客 \u003C\u002Fa>\u003C\u002Fh3>\n\n   [![GitHub仓库星标数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhpcaitech\u002FColossalAI?style=social)](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fstargazers)\n   [![构建状态](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Factions\u002Fworkflows\u002Fbuild_on_schedule.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Factions\u002Fworkflows\u002Fbuild_on_schedule.yml)\n   [![文档](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_6bf48b3e9a6d.png)](https:\u002F\u002Fcolossalai.readthedocs.io\u002Fen\u002Flatest\u002F?badge=latest)\n   [![CodeFactor](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_9ee0cb95ac54.png)](https:\u002F\u002Fwww.codefactor.io\u002Frepository\u002Fgithub\u002Fhpcaitech\u002Fcolossalai)\n   [![HuggingFace徽章](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97HuggingFace-Join-yellow)](https:\u002F\u002Fhuggingface.co\u002Fhpcai-tech)\n   [![Slack徽章](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSlack-join-blueviolet?logo=slack&amp)](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002Fpublic_assets\u002Ftree\u002Fmain\u002Fcolossalai\u002Fcontact\u002Fslack)\n   [![微信徽章](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F微信-加入-green?logo=wechat&amp)](https:\u002F\u002Fraw.githubusercontent.com\u002Fhpcaitech\u002Fpublic_assets\u002Fmain\u002Fcolossalai\u002Fimg\u002FWeChat.png)\n\n\n   | [English](README.md) | [中文](docs\u002FREADME-zh-Hans.md) |\n\n\u003C\u002Fdiv>\n\n## 在企业级GPU上即刻运行Colossal-AI\n\n无需繁琐的配置。在[**HPC-AI云**](https:\u002F\u002Fhpc-ai.com\u002F?utm_source=github&utm_medium=social&utm_campaign=promotion-colossalai)上，您即可访问强大且预配置好的Colossal-AI环境。\n\n只需点击一下，即可训练您的模型并扩展AI工作负载！\n\n* **NVIDIA Blackwell B200s**：体验下一代AI性能（[查看基准测试结果](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002Fb200)）。现可在云端以低至**每小时2.47美元**的价格使用。\n* **高性价比H200集群**：按需租赁，仅需**每小时1.99美元**，即可享受顶级性能。\n\n[**立即开始并领取免费额度 →**](https:\u002F\u002Fhpc-ai.com\u002F?utm_source=github&utm_medium=social&utm_campaign=promotion-colossalai)\n\n\u003Cdiv align=\"center\">\n   \u003Ca href=\"https:\u002F\u002Fhpc-ai.com\u002F?utm_source=github&utm_medium=social&utm_campaign=promotion-colossalai\">\n   \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_ded760bdf45d.png\" width=\"850\" \u002F>\n   \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n## 以半价畅享顶尖开源模型\n\n省去麻烦。通过[**HPC-AI模型API**](https:\u002F\u002Fhpc-ai.com\u002Fmodel-apis?utm_source=github&utm_medium=social&utm_campaign=promotion-colossalai)，您可以无缝访问强大的长上下文LLM。\n\n使用HPC-AI模型API构建您的AI智能体、聊天机器人和RAG应用吧！\n\n* **最新最全模型**：体验Kimi 2.5、MiniMax 2.5和GLM 5.1等最先进的性能。非常适合处理超过200万token的超大上下文窗口及复杂编码任务。\n\n* **无与伦比的价格**：不再为API端点支付过高费用。以比OpenRouter低至50%的价格获得顶级推理速度。\n\n[**立即开始并领取4美元免费额度 →**](https:\u002F\u002Fwww.hpc-ai.com\u002Faccount\u002Fsignup?redirectUrl=\u002Fmodels-console\u002Fmodels&invitation_code=HPCAI-MAPI&utm_source=google&utm_medium=social&utm_id=newlaunch)\n\n\u003Cdiv align=\"center\">\n   \u003Ca href=\"https:\u002F\u002Fhpc-ai.com\u002Fmodel-apis?utm_source=github&utm_medium=social&utm_campaign=promotion-colossalai\">\n   \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_bfe1818377ba.png\" width=\"850\" \u002F>\n   \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n### Colossal-AI基准测试\n\n为验证这些性能提升在实际应用中的效果，我们使用Colossal-AI对类似Llama的模型进行了大规模语言模型训练基准测试。测试分别在8卡和16卡配置下进行，对应7B和70B规模的模型。\n\n|              GPU              |  GPUs  | 模型大小 |    并行策略    | 每个数据并行组的批量大小 | 序列长度 | 吞吐量 | TFLOPS\u002FGPU  | 峰值显存(MiB)  |\n| :-----------------------------: | :--------: | :-------------: | :------------------: | :-----------: | :--------------: | :-------------: | :-------------: | :-------------: |\n|         H200            |     8     |      7B       |   zero2(dp8)     | 36 |        4096     |       17.13 样本\u002F秒     |       534.18     |       119040.02     |\n|         H200            |     16     |      70B       |   zero2     | 48 |        4096     |       3.27 样本\u002F秒     |       469.1     |       150032.23     |\n|         B200            |     8     |      7B       |   zero1(dp2)+tp2+pp4     | 128 |        4096     |       25.83 样本\u002F秒     |       805.69     |       100119.77     |\n|         H200            |     16     |      70B       |   zero1(dp2)+tp2+pp4     | 128 |        4096     |       5.66 样本\u002F秒     |       811.79     |       100072.02     |\n\nColossal-AI基准测试的结果提供了极具实用价值的洞察。对于8卡上的7B模型，**B200的吞吐量高出50%**，且每GPU的TFLOPS显著提升。而对于16卡上的70B模型，B200同样展现出明显优势，其**吞吐量和每GPU的TFLOPS均高出70%以上**。这些数据表明，B200的性能提升能够直接转化为大规模模型更短的训练时间。\n\n## 最新消息\n* [2025\u002F02] [DeepSeek 671B 微调指南曝光——一键解锁升级版 DeepSeek 套件，AI 爱好者欣喜若狂！](https:\u002F\u002Fcompany.hpc-ai.com\u002Fblog\u002Fshocking-release-deepseek-671b-fine-tuning-guide-revealed-unlock-the-upgraded-deepseek-suite-with-one-click-ai-players-ecstatic)\n* [2024\u002F12] [视频生成模型的开发成本节省了50%！现提供开源解决方案，并附赠 H200 GPU 代金券](https:\u002F\u002Fcompany.hpc-ai.com\u002Fblog\u002Fthe-development-cost-of-video-generation-models-has-saved-by-50-open-source-solutions-are-now-available-with-h200-gpu-vouchers) [[代码]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora\u002Fblob\u002Fmain\u002Fscripts\u002Ftrain.py) [[代金券]](https:\u002F\u002Fcolossalai.org\u002Fzh-Hans\u002Fdocs\u002Fget_started\u002Fbonus\u002F)\n* [2024\u002F10] [如何构建低成本的 Sora 类应用？为您提供的解决方案](https:\u002F\u002Fcompany.hpc-ai.com\u002Fblog\u002Fhow-to-build-a-low-cost-sora-like-app-solutions-for-you)\n* [2024\u002F09] [新加坡初创公司 HPC-AI Tech 获得 5000 万美元 A 轮融资，用于打造视频生成 AI 模型和 GPU 平台](https:\u002F\u002Fcompany.hpc-ai.com\u002Fblog\u002Fsingapore-startup-hpc-ai-tech-secures-50-million-usd-in-series-a-funding-to-build-the-video-generation-ai-model-and-gpu-platform)\n* [2024\u002F09] [通过 FP8 混合精度训练升级，仅需一行代码即可将 AI 大模型训练成本降低 30%](https:\u002F\u002Fcompany.hpc-ai.com\u002Fblog\u002Freducing-ai-large-model-training-costs-by-30-requires-just-a-single-line-of-code-from-fp8-mixed-precision-training-upgrades)\n* [2024\u002F06] [Open-Sora 继续开源：一键生成任意 16 秒 720p 高清视频，模型权重即用](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002Fopen-sora-from-hpc-ai-tech-team-continues-open-source-generate-any-16-second-720p-hd-video-with-one-click-model-weights-ready-to-use)\n* [2024\u002F05] [大型 AI 模型推理速度翻倍，Colossal-Inference 开源发布](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002Fcolossal-inference)\n* [2024\u002F04] [Open-Sora 全面升级：拥抱开源，支持单次生成 16 秒、720p 分辨率视频](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002Fopen-soras-comprehensive-upgrade-unveiled-embracing-16-second-video-generation-and-720p-resolution-in-open-source)\n* [2024\u002F04] [针对 LLaMA3 系列量身定制的最具性价比的推理、微调和预训练解决方案](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002Fmost-cost-effective-solutions-for-inference-fine-tuning-and-pretraining-tailored-to-llama3-series)\n\n## 目录\n\u003Cul>\n \u003Cli>\u003Ca href=\"#Why-Colossal-AI\">为什么选择 Colossal-AI\u003C\u002Fa> \u003C\u002Fli>\n \u003Cli>\u003Ca href=\"#Features\">功能特性\u003C\u002Fa> \u003C\u002Fli>\n \u003Cli>\n   \u003Ca href=\"#Colossal-AI-in-the-Real-World\">Colossal-AI 在实际应用中的表现\u003C\u002Fa>\n   \u003Cul>\n     \u003Cli>\u003Ca href=\"#Open-Sora\">Open-Sora：公开完整模型参数、训练细节及所有 Sora 类视频生成模型相关信息\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#Colossal-LLaMA-2\">Colossal-LLaMA-2：只需花费几百美元进行半天训练，效果即可媲美主流大模型；开源且无商业限制的领域专用 LLM 解决方案\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#ColossalChat\">ColossalChat：包含完整 RLHF 流程的开源 ChatGPT 克隆解决方案\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#AIGC\">AIGC：Stable Diffusion 加速\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#Biomedicine\">生物医学：AlphaFold 蛋白质结构预测加速\u003C\u002Fa>\u003C\u002Fli>\n   \u003C\u002Ful>\n \u003C\u002Fli>\n \u003Cli>\n   \u003Ca href=\"#Parallel-Training-Demo\">并行训练演示\u003C\u002Fa>\n   \u003Cul>\n     \u003Cli>\u003Ca href=\"#LLaMA3\">LLaMA 1\u002F2\u002F3 \u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#MoE\">MoE\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#GPT-3\">GPT-3\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#GPT-2\">GPT-2\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#BERT\">BERT\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#PaLM\">PaLM\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#OPT\">OPT\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#ViT\">ViT\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#Recommendation-System-Models\">推荐系统模型\u003C\u002Fa>\u003C\u002Fli>\n   \u003C\u002Ful>\n \u003C\u002Fli>\n \u003Cli>\n   \u003Ca href=\"#Single-GPU-Training-Demo\">单 GPU 训练演示\u003C\u002Fa>\n   \u003Cul>\n     \u003Cli>\u003Ca href=\"#GPT-2-Single\">GPT-2\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#PaLM-Single\">PaLM\u003C\u002Fa>\u003C\u002Fli>\n   \u003C\u002Ful>\n \u003C\u002Fli>\n \u003Cli>\n   \u003Ca href=\"#Inference\">推理\u003C\u002Fa>\n   \u003Cul>\n     \u003Cli>\u003Ca href=\"#Colossal-Inference\">Colossal-Inference：大型 AI 模型推理速度翻倍\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#Grok-1\">Grok-1：PyTorch + HuggingFace 的 314B 模型推理\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#SwiftInfer\">SwiftInfer：以 46% 的加速打破 LLM 多轮对话的长度限制\u003C\u002Fa>\u003C\u002Fli>\n   \u003C\u002Ful>\n \u003C\u002Fli>\n \u003Cli>\n   \u003Ca href=\"#Installation\">安装\u003C\u002Fa>\n   \u003Cul>\n     \u003Cli>\u003Ca href=\"#PyPI\">PyPI\u003C\u002Fa>\u003C\u002Fli>\n     \u003Cli>\u003Ca href=\"#Install-From-Source\">从源码安装\u003C\u002Fa>\u003C\u002Fli>\n   \u003C\u002Ful>\n \u003C\u002Fli>\n \u003Cli>\u003Ca href=\"#Use-Docker\">使用 Docker\u003C\u002Fa>\u003C\u002Fli>\n \u003Cli>\u003Ca href=\"#Community\">社区\u003C\u002Fa>\u003C\u002Fli>\n \u003Cli>\u003Ca href=\"#Contributing\">贡献代码\u003C\u002Fa>\u003C\u002Fli>\n \u003Cli>\u003Ca href=\"#Cite-Us\">引用我们\u003C\u002Fa>\u003C\u002Fli>\n\u003C\u002Ful>\n\n## 为什么选择 Colossal-AI\n\u003Cdiv align=\"center\">\n   \u003Ca href=\"https:\u002F\u002Fyoutu.be\u002FKnXSfjqkKN0\">\n   \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_7f26a73cebb4.png\" width=\"600\" \u002F>\n   \u003C\u002Fa>\n\n   詹姆斯·德梅尔教授（加州大学伯克利分校）：Colossal-AI 使 AI 模型的训练高效、简单且可扩展。\n\u003C\u002Fdiv>\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">返回顶部\u003C\u002Fa>)\u003C\u002Fp>\n\n## 功能特性\n\nColossal-AI 为您提供一系列并行化组件。我们的目标是让您像在笔记本电脑上编写模型一样轻松地编写分布式深度学习模型。我们提供友好的工具，只需几行代码即可启动分布式训练和推理。\n\n- 并行策略：\n  - 数据并行\n  - 流水线并行\n  - 1D、[2D](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.05343)、[2.5D](https:\u002F\u002Farxiv.org\u002Fabs\u002F2105.14500)、[3D](https:\u002F\u002Farxiv.org\u002Fabs\u002F2105.14450) 张量并行\n  - [序列并行](https:\u002F\u002Farxiv.org\u002Fabs\u002F2105.13120)\n  - [Zero Redundancy Optimizer (ZeRO)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.02054)\n  - [自动并行](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.02599)\n\n- 异构内存管理：\n  - [PatrickStar](https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.05818)\n\n- 友好易用：\n  - 基于配置文件的并行化\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">返回顶部\u003C\u002Fa>)\u003C\u002Fp>\n\n## Colossal-AI 在现实世界中的应用\n\n### Open-Sora\n\n[Open-Sora](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora)：揭秘完整模型参数、训练细节以及所有与Sora类似视频生成模型相关的内容\n[[代码]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora)\n[[博客]](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002Fopen-sora-from-hpc-ai-tech-team-continues-open-source-generate-any-16-second-720p-hd-video-with-one-click-model-weights-ready-to-use)\n[[模型权重]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora?tab=readme-ov-file#model-weights)\n[[演示]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FOpen-Sora?tab=readme-ov-file#-latest-demo)\n[[GPU云平台]](https:\u002F\u002Fcloud.luchentech.com\u002F)\n[[OpenSora图像]](https:\u002F\u002Fcloud.luchentech.com\u002Fdoc\u002Fdocs\u002Fimage\u002Fopen-sora\u002F)\n\n\u003Cdiv align=\"center\">\n   \u003Ca href=\"https:\u002F\u002Fyoutu.be\u002FilMQpU71ddI?si=J4JSPzZ03ycYmlki\">\n   \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_5a7cad61e54a.png\" width=\"700\" \u002F>\n   \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">返回顶部\u003C\u002Fa>)\u003C\u002Fp>\n\n### Colossal-LLaMA-2\n\n[[GPU云平台]](https:\u002F\u002Fcloud.luchentech.com\u002F)\n[[LLaMA3图像]](https:\u002F\u002Fcloud.luchentech.com\u002Fdoc\u002Fdocs\u002Fimage\u002Fllama)\n\n- 7B：仅需几百美元、半天的训练，即可获得与主流大模型相当的效果，是一款开源且无商业限制的领域专用LLM解决方案。\n[[代码]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fapplications\u002FColossal-LLaMA-2)\n[[博客]](https:\u002F\u002Fwww.hpc-ai.tech\u002Fblog\u002Fone-half-day-of-training-using-a-few-hundred-dollars-yields-similar-results-to-mainstream-large-models-open-source-and-commercial-free-domain-specific-llm-solution)\n[[HuggingFace模型权重]](https:\u002F\u002Fhuggingface.co\u002Fhpcai-tech\u002FColossal-LLaMA-2-7b-base)\n[[Modelscope模型权重]](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fcolossalai\u002FColossal-LLaMA-2-7b-base\u002Fsummary)\n\n- 13B：仅需5000美元，即可构建出性能优异的13B规模私有模型。\n[[代码]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fapplications\u002FColossal-LLaMA-2)\n[[博客]](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002Fcolossal-llama-2-13b)\n[[HuggingFace模型权重]](https:\u002F\u002Fhuggingface.co\u002Fhpcai-tech\u002FColossal-LLaMA-2-13b-base)\n[[Modelscope模型权重]](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fcolossalai\u002FColossal-LLaMA-2-13b-base\u002Fsummary)\n\n|              模型              |  主干网络  | 消耗的token数 |     MMLU (5-shot)    | CMMLU (5-shot)| AGIEval (5-shot) | GAOKAO (0-shot) | CEval (5-shot)  |\n| :-----------------------------: | :--------: | :-------------: | :------------------: | :-----------: | :--------------: | :-------------: | :-------------: |\n|          Baichuan-7B            |     -      |      1.2T       |    42.32 (42.30)     | 44.53 (44.02) |        38.72     |       36.74     |       42.80     |\n|       Baichuan-13B-Base         |     -      |      1.4T       |    50.51 (51.60)     | 55.73 (55.30) |        47.20     |       51.41     |       53.60     |\n|       Baichuan2-7B-Base         |     -      |      2.6T       |    46.97 (54.16)     | 57.67 (57.07) |        45.76     |       52.60     |       54.00     |\n|       Baichuan2-13B-Base        |     -      |      2.6T       |    54.84 (59.17)     | 62.62 (61.97) |        52.08     |       58.25     |       58.10     |\n|           ChatGLM-6B            |     -      |      1.0T       |    39.67 (40.63)     |   41.17 (-)   |        40.10     |       36.53     |       38.90     |\n|          ChatGLM2-6B            |     -      |      1.4T       |    44.74 (45.46)     |   49.40 (-)   |        46.36     |       45.49     |       51.70     |\n|          InternLM-7B            |     -      |      1.6T       |    46.70 (51.00)     |   52.00 (-)   |        44.77     |       61.64     |       52.80     |\n|            Qwen-7B              |     -      |      2.2T       |    54.29 (56.70)     | 56.03 (58.80) |        52.47     |       56.42     |       59.60     |\n|           Llama-2-7B            |     -      |      2.0T       |    44.47 (45.30)     |   32.97 (-)   |        32.60     |       25.46     |         -       |\n| Linly-AI\u002FChinese-LLaMA-2-7B-hf  | Llama-2-7B |      1.0T       |        37.43         |     29.92     |        32.00     |       27.57     |         -       |\n| wenge-research\u002Fyayi-7b-llama2   | Llama-2-7B |        -        |        38.56         |     31.52     |        30.99     |       25.95     |         -       |\n| ziqingyang\u002Fchinese-llama-2-7b   | Llama-2-7B |        -        |        33.86         |     34.69     |        34.52     |       25.18     |        34.2     |\n| TigerResearch\u002Ftigerbot-7b-base  | Llama-2-7B |      0.3T       |        43.73         |     42.04     |        37.64     |       30.61     |         -       |\n|  LinkSoul\u002FChinese-Llama-2-7b    | Llama-2-7B |        -        |        48.41         |     38.31     |        38.45     |       27.72     |         -       |\n|       FlagAlpha\u002FAtom-7B         | Llama-2-7B |      0.1T       |        49.96         |     41.10     |        39.83     |       33.00     |         -       |\n| IDEA-CCNL\u002FZiya-LLaMA-13B-v1.1   | Llama-13B  |      0.11T      |        50.25         |     40.99     |        40.04     |       30.54     |         -       |\n|  **Colossal-LLaMA-2-7b-base**   | Llama-2-7B |   **0.0085T**   |        53.06         |     49.89     |        51.48     |       58.82     |        50.2     |\n|  **Colossal-LLaMA-2-13b-base**  | Llama-2-13B |   **0.025T**    |        56.42         |     61.80     |        54.69     |       69.53     |        60.3     |\n\n### ColossalChat\n\n\u003Cdiv align=\"center\">\n   \u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=HcTiHzApHm0\">\n   \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_640918410b82.png\" width=\"700\" \u002F>\n   \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n[ColossalChat](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fapplications\u002FChat): 一个开源解决方案，用于克隆 [ChatGPT](https:\u002F\u002Fopenai.com\u002Fblog\u002Fchatgpt\u002F)，并配备完整的 RLHF 流程。\n[[代码]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fapplications\u002FChat)\n[[博客]](https:\u002F\u002Fmedium.com\u002F@yangyou_berkeley\u002Fcolossalchat-an-open-source-solution-for-cloning-chatgpt-with-a-complete-rlhf-pipeline-5edf08fb538b)\n[[演示]](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=HcTiHzApHm0)\n[[教程]](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=-qFBZFmOJfg)\n\n\u003Cp id=\"ColossalChat-Speed\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_3eeda076eea4.jpg\" width=450\u002F>\n\u003C\u002Fp>\n\n- RLHF PPO Stage3 训练速度最高可提升至10倍\n\n\u003Cp id=\"ColossalChat_scaling\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_ee6ffde32c45.png\" width=800\u002F>\n\u003C\u002Fp>\n\n- 单服务器训练速度最高可提升至7.73倍，单GPU推理速度最高可提升至1.42倍\n\n\u003Cp id=\"ColossalChat-1GPU\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_e8180b57912c.jpg\" width=450\u002F>\n\u003C\u002Fp>\n\n- 在单个GPU上，模型容量最高可增长10.3倍\n- 一次小型演示训练过程仅需1.62GB显存（任何消费级GPU均可）\n\n\u003Cp id=\"ColossalChat-LoRA\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_b55a7973e16f.jpg\" width=600\u002F>\n\u003C\u002Fp>\n\n- 在单个GPU上，微调模型的容量最高可提升至3.7倍\n- 同时保持足够高的运行速度\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">返回顶部\u003C\u002Fa>)\u003C\u002Fp>\n\n\n### AIGC\n加速AIGC（人工智能生成内容）模型，例如 [Stable Diffusion v1](https:\u002F\u002Fgithub.com\u002FCompVis\u002Fstable-diffusion) 和 [Stable Diffusion v2](https:\u002F\u002Fgithub.com\u002FStability-AI\u002Fstablediffusion)。\n\u003Cp id=\"diffusion_train\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_27e3a110551e.png\" width=800\u002F>\n\u003C\u002Fp>\n\n- [训练](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\u002Fimages\u002Fdiffusion): 将Stable Diffusion的显存消耗降低至多5.6倍，硬件成本降低至多46倍（从A100降至RTX3060）。\n\n\u003Cp id=\"diffusion_demo\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_d3ff4911276b.png\" width=800\u002F>\n\u003C\u002Fp>\n\n- [DreamBooth微调](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\u002Fimages\u002Fdreambooth): 仅需3–5张目标对象的照片即可个性化您的模型。\n\n\u003Cp id=\"inference-sd\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_a13e5c6f5274.jpg\" width=800\u002F>\n\u003C\u002Fp>\n\n- [推理](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\u002Fimages\u002Fdiffusion): 将推理过程中的显存消耗减少2.5倍。\n\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">返回顶部\u003C\u002Fa>)\u003C\u002Fp>\n\n### 生物医药\n加速 [AlphaFold蛋白质结构预测](https:\u002F\u002Falphafold.ebi.ac.uk\u002F)\n\n\u003Cp id=\"FastFold\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_6684dcef2cbd.jpg\" width=800\u002F>\n\u003C\u002Fp>\n\n- [FastFold](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FFastFold): 加速GPU集群上的训练和推理，提升数据处理速度，支持超过10000个残基的序列推理。\n\n\u003Cp id=\"FastFold-Intel\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_a6ab1aef6829.jpg\" width=600\u002F>\n\u003C\u002Fp>\n\n- [FastFold与Intel结合](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FFastFold): 推理速度提升3倍，成本降低39%。\n\n\u003Cp id=\"xTrimoMultimer\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_39ccf437e152.jpg\" width=800\u002F>\n\u003C\u002Fp>\n\n- [xTrimoMultimer](https:\u002F\u002Fgithub.com\u002Fbiomap-research\u002FxTrimoMultimer): 将蛋白质单体和多聚体的结构预测速度提升11倍。\n\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">返回顶部\u003C\u002Fa>)\u003C\u002Fp>\n\n## 并行训练演示\n### LLaMA3\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_c0c2dfafa4fa.png\" width=600\u002F>\n\u003C\u002Fp>\n\n- 700亿参数的LLaMA3模型训练加速18%\n[[代码]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\u002Flanguage\u002Fllama)\n[[GPU云平台]](https:\u002F\u002Fcloud.luchentech.com\u002F)\n[[LLaMA3图像]](https:\u002F\u002Fcloud.luchentech.com\u002Fdoc\u002Fdocs\u002Fimage\u002Fllama)\n\n### LLaMA2\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_652ab748b460.png\" width=600\u002F>\n\u003C\u002Fp>\n\n- 700亿参数的LLaMA2模型训练加速195%\n[[代码]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\u002Flanguage\u002Fllama)\n[[博客]](https:\u002F\u002Fwww.hpc-ai.tech\u002Fblog\u002F70b-llama2-training)\n\n### LLaMA1\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_8cb7944afde8.png\" width=600\u002F>\n\u003C\u002Fp>\n\n- 650亿参数的大模型预训练加速38%\n[[代码]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\u002Flanguage\u002Fllama)\n[[博客]](https:\u002F\u002Fwww.hpc-ai.tech\u002Fblog\u002Flarge-model-pretraining)\n\n### MoE\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_f05769580ec9.png\" width=800\u002F>\n\u003C\u002Fp>\n\n- 增强的MoE并行性，开源MoE模型训练效率可提高9倍\n[[代码]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\u002Flanguage\u002Fopenmoe)\n[[博客]](https:\u002F\u002Fwww.hpc-ai.tech\u002Fblog\u002Fenhanced-moe-parallelism-open-source-moe-model-training-can-be-9-times-more-efficient)\n\n### GPT-3\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_f81e4bfe23c8.png\" width=700\u002F>\n\u003C\u002Fp>\n\n- 节省50%的GPU资源，并实现10.7%的加速\n\n### GPT-2\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_10d0568f335d.png\" width=800\u002F>\n\n- 显存消耗降低11倍，且采用张量并行时具有超线性扩展效率\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_8c61dfd6f7b1.png\" width=800>\n\n- 在相同硬件条件下，模型规模扩大24倍\n- 加速超过3倍\n\n### BERT\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_0d385886a2f0.png\" width=800\u002F>\n\n- 训练速度提升2倍，或序列长度延长50%\n\n### PaLM\n- [PaLM-colossalai](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FPaLM-colossalai)：谷歌Pathways语言模型（[PaLM](https:\u002F\u002Fai.googleblog.com\u002F2022\u002F04\u002Fpathways-language-model-palm-scaling-to.html)）的可扩展实现。\n\n### OPT\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_6f0c18c02a9e.png\" width=800\u002F>\n\n- [Open Pretrained Transformer (OPT)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fmetaseq)，由Meta发布的1750亿参数AI语言模型，其公开的预训练权重激发了AI开发者进行各种下游任务和应用部署。\n- 以较低的代码成本实现OPT微调速度提升45%。[[示例]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\u002Flanguage\u002Fopt) [[在线推理]](https:\u002F\u002Fcolossalai.org\u002Fdocs\u002Fadvanced_tutorials\u002Fopt_service)\n\n更多详情请访问我们的[文档](https:\u002F\u002Fwww.colossalai.org\u002F)和[示例](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples)。\n\n### ViT\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_815210098a77.png\" width=\"450\" \u002F>\n\u003C\u002Fp>\n\n- 对于张量并行度为64的情况，批量大小扩大14倍，训练速度提升5倍。\n\n### 推荐系统模型\n- [Cached Embedding](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FCachedEmbedding)，利用软件缓存技术，在较小的GPU显存预算下训练更大的嵌入表。\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">返回顶部\u003C\u002Fa>)\u003C\u002Fp>\n\n## 单GPU训练演示\n\n### GPT-2\n\u003Cp id=\"GPT-2-Single\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_5e25975df53f.png\" width=450\u002F>\n\u003C\u002Fp>\n\n- 在相同硬件上，模型规模扩大20倍。\n\n\u003Cp id=\"GPT-2-NVME\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_0e3ab8cb57d5.png\" width=800\u002F>\n\u003C\u002Fp>\n\n- 在相同硬件（RTX 3080）上，模型规模扩大120倍。\n\n### PaLM\n\u003Cp id=\"PaLM-Single\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_ca0b3186f2ff.png\" width=450\u002F>\n\u003C\u002Fp>\n\n- 在相同硬件上，模型规模扩大34倍。\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">返回顶部\u003C\u002Fa>)\u003C\u002Fp>\n\n\n## 推理\n### Colossal-Inference\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_08e0d611d749.png\" width=1000\u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_1a74bb9944f9.png\" width=1000\u002F>\n\u003C\u002Fp>\n\n - 在某些情况下，大型AI模型的推理速度相比vLLM的离线推理性能提升了一倍。\n[[代码]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fcolossalai\u002Finference)\n[[博客]](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002Fcolossal-inference)\n[[GPU云平台]](https:\u002F\u002Fcloud.luchentech.com\u002F)\n[[LLaMA3图像]](https:\u002F\u002Fcloud.luchentech.com\u002Fdoc\u002Fdocs\u002Fimage\u002Fllama)\n\n### Grok-1\n\u003Cp id=\"Grok-1\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_b4b105effe65.jpg\" width=600\u002F>\n\u003C\u002Fp>\n\n - 3140亿参数的Grok-1推理加速3.8倍，提供易于使用的Python + PyTorch + HuggingFace版本用于推理。\n\n[[代码]](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fexamples\u002Flanguage\u002Fgrok-1)\n[[博客]](https:\u002F\u002Fhpc-ai.com\u002Fblog\u002F314-billion-parameter-grok-1-inference-accelerated-by-3.8x-efficient-and-easy-to-use-pytorchhuggingface-version-is-here)\n[[HuggingFace Grok-1 PyTorch模型权重]](https:\u002F\u002Fhuggingface.co\u002Fhpcai-tech\u002Fgrok-1)\n[[ModelScope Grok-1 PyTorch模型权重]](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fcolossalai\u002Fgrok-1-pytorch\u002Fsummary)\n\n### SwiftInfer\n\u003Cp id=\"SwiftInfer\" align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_b78d37b53500.jpg\" width=800\u002F>\n\u003C\u002Fp>\n\n- [SwiftInfer](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FSwiftInfer)：推理性能提升46%，开源解决方案突破了LLM在多轮对话中的长度限制。\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">返回顶部\u003C\u002Fa>)\u003C\u002Fp>\n\n## 安装\n\n要求：\n- PyTorch ≥ 2.2\n- Python ≥ 3.7\n- CUDA ≥ 11.0\n- [NVIDIA GPU计算能力](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-gpus) ≥ 7.0（V100\u002FRTX20及以上）\n- Linux操作系统\n\n如果在安装过程中遇到任何问题，您可以在本仓库中提交[issue](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fissues\u002Fnew\u002Fchoose)。\n\n### 通过PyPI安装\n\n您可以使用以下命令轻松安装Colossal-AI。**默认情况下，我们在安装时不会构建PyTorch扩展。**\n\n```bash\npip install colossalai\n```\n\n**注意：目前仅支持Linux系统。**\n\n然而，如果您希望在安装时构建PyTorch扩展，可以设置`BUILD_EXT=1`。\n\n```bash\nBUILD_EXT=1 pip install colossalai\n```\n\n**否则，CUDA内核将在您实际需要时于运行时构建。**\n\n我们每周还会向PyPI发布夜间版本，使您能够体验主分支中尚未发布的功能和错误修复。\n可通过以下命令进行安装：\n\n```bash\npip install colossalai-nightly\n```\n\n### 从源码下载\n\n> Colossal-AI的版本将与仓库的主分支保持一致。如遇任何问题，请随时提出issue。:)\n\n```shell\ngit clone https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI.git\ncd ColossalAI\n\n# 安装colossalai\npip install .\n```\n\n默认情况下，我们不会编译CUDA\u002FC++内核。ColossalAI会在运行时构建它们。\n如果您希望安装并启用CUDA内核融合（使用融合优化器时必须安装）：\n\n```shell\nBUILD_EXT=1 pip install .\n```\n\n对于使用CUDA 10.2的用户，仍然可以从源码构建ColossalAI。不过，您需要手动下载cub库并将其复制到相应目录。\n\n```bash\n# 克隆仓库\ngit clone https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI.git\ncd ColossalAI\n\n# 下载cub库\nwget https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcub\u002Farchive\u002Frefs\u002Ftags\u002F1.8.0.zip\nunzip 1.8.0.zip\ncp -r cub-1.8.0\u002Fcub\u002F colossalai\u002Fkernel\u002Fcuda_native\u002Fcsrc\u002Fkernels\u002Finclude\u002F\n\n# 安装\nBUILD_EXT=1 pip install .\n```\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">返回顶部\u003C\u002Fa>)\u003C\u002Fp>\n\n## 使用Docker\n\n### 从DockerHub拉取\n\n您可以直接从我们的[DockerHub页面](https:\u002F\u002Fhub.docker.com\u002Fr\u002Fhpcaitech\u002Fcolossalai)拉取Docker镜像。每次发布时，镜像都会自动上传。\n\n### Build On Your Own\n\nRun the following command to build a docker image from Dockerfile provided.\n\n> Building Colossal-AI from scratch requires GPU support, you need to use Nvidia Docker Runtime as the default when doing `docker build`. More details can be found [here](https:\u002F\u002Fstackoverflow.com\u002Fquestions\u002F59691207\u002Fdocker-build-with-nvidia-runtime).\n> We recommend you install Colossal-AI from our [project page](https:\u002F\u002Fwww.colossalai.org) directly.\n\n\n```bash\ncd ColossalAI\ndocker build -t colossalai .\u002Fdocker\n```\n\nRun the following command to start the docker container in interactive mode.\n\n```bash\ndocker run -ti --gpus all --rm --ipc=host colossalai bash\n```\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">back to top\u003C\u002Fa>)\u003C\u002Fp>\n\n## Community\n\nJoin the Colossal-AI community on [Forum](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fdiscussions),\n[Slack](https:\u002F\u002Fjoin.slack.com\u002Ft\u002Fcolossalaiworkspace\u002Fshared_invite\u002Fzt-z7b26eeb-CBp7jouvu~r0~lcFzX832w),\nand [WeChat(微信)](https:\u002F\u002Fraw.githubusercontent.com\u002Fhpcaitech\u002Fpublic_assets\u002Fmain\u002Fcolossalai\u002Fimg\u002FWeChat.png \"qrcode\") to share your suggestions, feedback, and questions with our engineering team.\n\n## Contributing\nReferring to the successful attempts of [BLOOM](https:\u002F\u002Fbigscience.huggingface.co\u002F) and [Stable Diffusion](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FStable_Diffusion), any and all developers and partners with computing powers, datasets, models are welcome to join and build the Colossal-AI community, making efforts towards the era of big AI models!\n\nYou may contact us or participate in the following ways:\n1. [Leaving a Star ⭐](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fstargazers) to show your like and support. Thanks!\n2. Posting an [issue](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fissues\u002Fnew\u002Fchoose), or submitting a PR on GitHub follow the guideline in [Contributing](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fblob\u002Fmain\u002FCONTRIBUTING.md)\n3. Send your official proposal to email contact@hpcaitech.com\n\nThanks so much to all of our amazing contributors!\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_readme_668ee4138888.png\"  width=\"800px\"\u002F>\n\u003C\u002Fa>\n\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">back to top\u003C\u002Fa>)\u003C\u002Fp>\n\n\n## CI\u002FCD\n\nWe leverage the power of [GitHub Actions](https:\u002F\u002Fgithub.com\u002Ffeatures\u002Factions) to automate our development, release and deployment workflows. Please check out this [documentation](.github\u002Fworkflows\u002FREADME.md) on how the automated workflows are operated.\n\n\n## Cite Us\n\nThis project is inspired by some related projects (some by our team and some by other organizations). We would like to credit these amazing projects as listed in the [Reference List](.\u002Fdocs\u002FREFERENCE.md).\n\nTo cite this project, you can use the following BibTeX citation.\n\n```\n@inproceedings{10.1145\u002F3605573.3605613,\nauthor = {Li, Shenggui and Liu, Hongxin and Bian, Zhengda and Fang, Jiarui and Huang, Haichen and Liu, Yuliang and Wang, Boxiang and You, Yang},\ntitle = {Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training},\nyear = {2023},\nisbn = {9798400708435},\npublisher = {Association for Computing Machinery},\naddress = {New York, NY, USA},\nurl = {https:\u002F\u002Fdoi.org\u002F10.1145\u002F3605573.3605613},\ndoi = {10.1145\u002F3605573.3605613},\nabstract = {The success of Transformer models has pushed the deep learning model scale to billions of parameters, but the memory limitation of a single GPU has led to an urgent need for training on multi-GPU clusters. However, the best practice for choosing the optimal parallel strategy is still lacking, as it requires domain expertise in both deep learning and parallel computing. The Colossal-AI system addressed the above challenge by introducing a unified interface to scale your sequential code of model training to distributed environments. It supports parallel training methods such as data, pipeline, tensor, and sequence parallelism and is integrated with heterogeneous training and zero redundancy optimizer. Compared to the baseline system, Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.},\nbooktitle = {Proceedings of the 52nd International Conference on Parallel Processing},\npages = {766–775},\nnumpages = {10},\nkeywords = {datasets, gaze detection, text tagging, neural networks},\nlocation = {Salt Lake City, UT, USA},\nseries = {ICPP '23}\n}\n```\n\nColossal-AI has been accepted as official tutorial by top conferences [NeurIPS](https:\u002F\u002Fnips.cc\u002F), [SC](https:\u002F\u002Fsc22.supercomputing.org\u002F), [AAAI](https:\u002F\u002Faaai.org\u002FConferences\u002FAAAI-23\u002F),\n[PPoPP](https:\u002F\u002Fppopp23.sigplan.org\u002F), [CVPR](https:\u002F\u002Fcvpr2023.thecvf.com\u002F), [ISC](https:\u002F\u002Fwww.isc-hpc.com\u002F), [NVIDIA GTC](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fon-demand\u002Fsession\u002Fgtcspring23-S51482\u002F) ,etc.\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">back to top\u003C\u002Fa>)\u003C\u002Fp>","# ColossalAI 快速上手指南\n\nColossalAI 是一个旨在让大型 AI 模型训练更便宜、更快、更易用的开源系统。它提供了一套并行组件，让你能像在本机上编写模型一样轻松编写分布式深度学习模型。\n\n## 环境准备\n\n在开始之前，请确保你的开发环境满足以下要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu 18.04\u002F20.04\u002F22.04)\n*   **Python**: 3.8 - 3.10\n*   **CUDA**: 11.0 或更高版本 (根据显卡驱动和 PyTorch 版本匹配)\n*   **PyTorch**: 1.12 或更高版本\n*   **硬件**: 至少一张 NVIDIA GPU (支持多卡及集群分布式训练)\n\n> **提示**：国内开发者若遇到网络问题，建议在安装依赖时配置清华源或阿里源加速。\n\n## 安装步骤\n\n你可以通过 PyPI 直接安装，或者从源码安装以获取最新功能。\n\n### 方式一：通过 PyPI 安装（推荐）\n\n这是最快捷的安装方式。\n\n```bash\npip install colossalai\n```\n\n**国内加速安装：**\n```bash\npip install colossalai -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 方式二：从源码安装\n\n如果你需要最新的功能或进行二次开发，建议从源码安装。\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI.git\ncd ColossalAI\npip install -e .\n```\n\n**国内加速克隆：**\n```bash\ngit clone https:\u002F\u002Fgitee.com\u002Fhpcaitech\u002FColossalAI.git  # 如果存在镜像\n# 或者使用 git 代理加速\n```\n\n### 方式三：使用 Docker（最省心）\n\n为了避免环境配置冲突，可以直接拉取预配置好的 Docker 镜像。\n\n```bash\ndocker pull hpcaitech\u002Fcolossalai:latest\ndocker run --gpus all -it hpcaitech\u002Fcolossalai:latest\n```\n\n## 基本使用\n\nColossalAI 的核心优势在于只需几行代码即可启动分布式训练。以下是一个基于配置文件启动并行训练的最简示例。\n\n### 1. 准备配置文件 (`config.py`)\n\nColossalAI 允许通过配置文件定义并行策略（如数据并行、流水线并行、张量并行等）。\n\n```python\nfrom colossalai.context import ParallelMode\nfrom colossalai.core import global_context as gpc\nfrom colossalai.utils import get_dataloader\n\n# 定义并行配置\nCONFIG = dict(\n    parallel=dict(\n        pipeline=1,\n        tensor=dict(\n            size=2,\n            mode='1d',\n        ),\n        data=1,\n    ),\n    fp16=dict(\n        mode='auto',\n    ),\n    gradient_accumulation=1,\n    clip_grad_norm=1.0,\n)\n```\n\n### 2. 编写训练脚本 (`train.py`)\n\n在你的训练脚本中，引入 `colossalai` 并初始化上下文，即可自动应用并行策略。\n\n```python\nimport torch\nimport torch.nn as nn\nfrom torch.utils.data import DataLoader\nfrom torchvision.datasets import CIFAR10\nfrom torchvision.transforms import ToTensor, Normalize, Compose\n\nimport colossalai\nfrom colossalai.core import global_context as gpc\nfrom colossalai.logging import disable_existing_loggers\nfrom colossalai.utils import get_dataloader\nfrom colossalai.trainer import Trainer, TrainerHook\nfrom colossalai.context import ParallelMode\nfrom colossalai.nn.optimizer import HybridAdam\n\n# 简单的模型定义\nclass SimpleModel(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.flatten = nn.Flatten()\n        self.linear = nn.Linear(3072, 10)\n\n    def forward(self, x):\n        x = self.flatten(x)\n        return self.linear(x)\n\ndef main():\n    # 1. 初始化 ColossalAI 上下文 (自动读取环境变量中的 rank, world_size 等)\n    colossalai.launch(config=CONFIG, rank=gpc.get_global_rank(), world_size=gpc.get_world_size(), host='localhost', port=29500, backend='nccl')\n    \n    # 2. 构建模型、优化器和数据加载器\n    model = SimpleModel().cuda()\n    optimizer = HybridAdam(model.parameters(), lr=1e-3)\n    \n    transform = Compose([ToTensor(), Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])\n    train_dataset = CIFAR10(root='.\u002Fdata', train=True, download=True, transform=transform)\n    train_dataloader = get_dataloader(dataset=train_dataset, batch_size=64, shuffle=True, drop_last=True)\n\n    # 3. 创建 Trainer\n    trainer = Trainer(model=model, optimizer=optimizer, criterion=torch.nn.CrossEntropyLoss())\n\n    # 4. 开始训练\n    trainer.fit(train_dataloader=train_dataloader, epochs=1)\n\nif __name__ == '__main__':\n    main()\n```\n\n### 3. 启动训练\n\n使用 `colossalai run` 命令启动分布式任务。以下示例是在单机 2 张 GPU 上运行：\n\n```bash\ncolossalai run --nproc_per_node 2 train.py\n```\n\n如果是多机多卡环境，可以使用 `colossalai run` 配合主机列表，或使用 Slurm 等调度系统启动。\n\n---\n现在你已经成功运行了第一个 ColossalAI 分布式训练任务！你可以前往 [官方文档](https:\u002F\u002Fcolossalai.org\u002Fzh-Hans\u002Fdocs\u002Fget_started\u002F) 探索更多高级功能，如大模型微调 (LLaMA, ChatGLM 等) 和推理加速。","某金融科技公司算法团队需要在有限的预算下，基于开源基座模型训练一个拥有 700 亿参数、支持长上下文的专业风控大模型。\n\n### 没有 ColossalAI 时\n- **硬件门槛极高**：传统并行策略无法将超大模型装入单卡显存，团队被迫采购昂贵的多节点 GPU 集群，初期投入成本激增。\n- **开发周期漫长**：手动编写分布式训练代码（如 ZeRO、流水线并行）耗时数周，且极易出现通信死锁或显存溢出错误，调试困难。\n- **训练效率低下**：由于缺乏优化的算子融合与通信调度，GPU 利用率长期低于 40%，原本预计两周的训练任务往往拖延至一个月以上。\n- **长序列支持受限**：面对金融研报等超长文本，现有框架难以高效处理长上下文，频繁报错或被迫截断关键信息。\n\n### 使用 ColossalAI 后\n- **低成本启动**：利用其自动并行技术与显存优化机制，团队仅用少量消费级显卡即可启动 70B 模型训练，硬件成本降低 60%。\n- **极速落地**：通过几行配置代码即可开启 3D 并行训练，无需底层重构，模型上线时间从数周缩短至 2 天。\n- **性能显著提升**：内置的高效算子与通信优化使 GPU 利用率提升至 85% 以上，训练速度提升 3 倍，按期交付模型。\n- **无缝长文处理**：原生支持超长序列并行计算，轻松处理百万级 token 上下文，完整保留风控所需的细节特征。\n\nColossalAI 通过极致的系统优化，让中小团队也能以低廉成本和敏捷速度驾驭超大规模 AI 模型的训练与应用。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhpcaitech_ColossalAI_a0721a48.png","hpcaitech","HPC-AI Tech","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fhpcaitech_930ec3e0.jpg","We are a global team to help you train and deploy your AI models",null,"service@hpc-ai.com","https:\u002F\u002Fhpc-ai.com\u002F","https:\u002F\u002Fgithub.com\u002Fhpcaitech",[81,85,89,93,97,101,105],{"name":82,"color":83,"percentage":84},"Python","#3572A5",93.2,{"name":86,"color":87,"percentage":88},"Cuda","#3A4E3A",2.5,{"name":90,"color":91,"percentage":92},"HTML","#e34c26",1.9,{"name":94,"color":95,"percentage":96},"C++","#f34b7d",1.2,{"name":98,"color":99,"percentage":100},"Shell","#89e051",0.9,{"name":102,"color":103,"percentage":104},"C","#555555",0.2,{"name":106,"color":107,"percentage":108},"Dockerfile","#384d54",0,41366,4516,"2026-04-15T18:30:52","Apache-2.0",4,"未说明","需要 NVIDIA GPU（基准测试提及 H200, B200），支持多卡并行（8 卡\u002F16 卡配置），显存需求视模型规模而定（7B 模型约 12GB+，70B 模型需更大显存或并行策略），CUDA 版本未明确说明",{"notes":117,"python":114,"dependencies":118},"该工具专注于大规模 AI 模型的分布式训练与推理，支持数据并行、流水线并行、张量并行等多种策略。官方推荐使用 Docker 部署或通过 HPC-AI Cloud 直接使用预配置环境。具体依赖版本及安装步骤需参考文档中的'Installation'章节（当前提供的文本中未包含详细版本号）。",[119,120],"torch","transformers",[15,14,13,16],[123,124,125,126,127,128,129,130,131,132,133,134],"deep-learning","hpc","large-scale","data-parallelism","pipeline-parallelism","model-parallelism","ai","big-model","distributed-computing","inference","heterogeneous-training","foundation-models","2026-03-27T02:49:30.150509","2026-04-16T08:19:16.120719",[138,143,148,153,158,162],{"id":139,"question_zh":140,"answer_zh":141,"source_url":142},35488,"为什么无法使用 ColossalAI 的聊天网站 (chat.colossalai.org)？","该网站是一个限时试用项目，目前已经关闭，不再响应任何语言的请求。'colossal' 是仓库名称，代表一个用于加速神经网络训练和推理的框架。如果未来重新开放试用，官方会发布相关新闻。","https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fissues\u002F4181",{"id":144,"question_zh":145,"answer_zh":146,"source_url":147},35489,"运行时报错 'No module named colossalai._C.cpu_adam' 或找不到共享对象文件怎么办？","这通常是因为 GCC 版本过低或环境不兼容导致的。请检查并更新 GCC 版本，确保符合安装要求。另外，如果报错提示缺少 'cublas_v2.h' 文件，可以尝试将该文件复制到 CUDA 目录中解决。建议参考官方最新的安装文档重新配置环境。","https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fissues\u002F2743",{"id":149,"question_zh":150,"answer_zh":151,"source_url":152},35490,"为什么使用 ColossalAI 的 AMP (混合精度训练) 比原生 PyTorch AMP 消耗更多的显存？","当使用 `colossalai.initialize` 初始化并配合 Engine 进行训练时，即使配置了 `AMP_TYPE.TORCH`，显存占用也可能高于直接使用 `colossalai.amp.convert_to_torch_amp` 包装模型的情况。这是因为 ColossalAI 的 Engine 包含额外的功能封装。如果追求最小显存占用且不需要 Engine 的高级功能，可以直接使用 `colossalai.amp.convert_to_torch_amp` 手动包装 model、optimizer 和 criterion 进行训练。","https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fissues\u002F1083",{"id":154,"question_zh":155,"answer_zh":156,"source_url":157},35491,"单机多卡训练时出现 'The client socket has failed to connect' 或 'Name or service not known' 错误如何解决？","该错误通常发生在分布式训练初始化阶段，表明节点间无法通过主机名建立连接。这可能是因为主机名解析失败或网络配置问题。请检查 `\u002Fetc\u002Fhosts` 文件确保主机名能正确解析到本地 IP，或者尝试在启动命令中明确指定网络接口。如果是容器化环境，需确保容器间网络互通且主机名配置正确。","https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fissues\u002F3215",{"id":159,"question_zh":160,"answer_zh":161,"source_url":147},35492,"遇到编译错误或 C++ 扩展加载失败时，应该检查哪些环境因素？","首先检查 GCC 版本是否过低，ColossalAI 的 C++ 扩展（如 cpu_adam）需要较新版本的编译器支持。其次检查 CUDA 环境是否完整，例如确认是否存在 `cublas_v2.h` 等头文件。如果缺失关键文件，可能需要重新安装 CUDA Toolkit 或手动复制缺失文件到相应目录。最后，确保 Python 环境与安装的 PyTorch 版本及 CUDA 版本完全兼容。",{"id":163,"question_zh":164,"answer_zh":165,"source_url":152},35493,"如何正确地在非 Engine 模式下使用混合精度训练 (FP16)？","如果不使用 `colossalai.initialize` 和 Engine，可以手动调用 `colossalai.amp.convert_to_torch_amp(model, optimizer, criterion)` 来包装模型、优化器和损失函数。之后按照标准的 PyTorch 训练循环执行：计算输出 -> 计算损失 -> `optimizer.backward(loss)` -> `optimizer.step()` -> `optimizer.zero_grad()`。这种方式通常比通过 Engine 启动更节省显存。",[167,172,177,182,187,192,197,202,207,212,217,222,227,232,237,242,247,252,257,262],{"id":168,"version":169,"summary_zh":170,"released_at":171},280629,"v0.5.0","## 变更内容\n* [HotFix] 更新加载 LoRA 模型的 README；由 @duanjunwen 在 https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fpull\u002F6240 中完成\n* 更新 README.md；由 @Yanjia0 在 https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fpull\u002F6268 中完成\n* [ci] 更新 CI 配置；由 @flybird11111 在 https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fpull\u002F6254 中完成\n* [upgrade] 升级 transformers 库；由 @flybird11111 在 https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fpull\u002F6320 中完成\n* [release] 发布新版本；由 @flybird11111 在 https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fpull\u002F6330 中完成\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fcompare\u002Fv0.4.9...v0.5.0","2025-06-04T06:00:47",{"id":173,"version":174,"summary_zh":175,"released_at":176},280630,"v0.4.9","## 变更内容\n\n### 发布 \n- [release] 更新版本 (#6236) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n\n### 热修复 \n- [hotfix] 修复 LoRA 加载问题 (#6231) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n\n### 其他 \n- [misc] 更新 PyTorch 版本 (#6206) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n\n### 对话 \n- 合并 pull request #6208，来自 hpcaitech\u002Fgrpo_dev 由 [YeAnbang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FYeAnbang)\n\n### Pre-commit.ci \n- [pre-commit.ci] 由 pre-commit.com 的钩子自动修复 由 [pre-commit-ci[bot]](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fpre-commit-ci%5Bbot%5D)\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fcompare\u002Fv0.4.9...v0.4.8","2025-03-04T01:51:48",{"id":178,"version":179,"summary_zh":180,"released_at":181},280631,"v0.4.8","## 变更内容\n\n### 发布 \n- [release] 更新版本 (#6195) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\n\n### 文档 \n- [doc] DeepSeek V3\u002FR1 新闻 (#6199) 由 [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell) 提交\n\n### 应用 \n- [application] 添加 LoRA SFT 示例数据 (#6198) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\n- [application] 更新 README (#6196) 由 [Tong Li](https:\u002F\u002Fapi.github.com\u002Fusers\u002FTongLi3701) 提交\n- [application] 添加 LoRA SFT 示例 (#6192) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\n\n### Pre-commit.ci \n- 为 PPO 添加 GRPO 并支持 RLVR (#6186) 由 [YeAnbang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FYeAnbang) 提交\n\n### Checkpointio \n- [checkpointio] 修复异步 IO 问题 (#6189) 由 [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111) 提交\n- [checkpointio] 修复 3D 模型的检查点问题 (#6187) 由 [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111) 提交\n- [checkpointio] 如果张量同时被填充和分布式，则在去填充之前先收集张量 (#6168) 由 [Lemon Qin](https:\u002F\u002Fapi.github.com\u002Fusers\u002FLemon-412) 提交\n- [checkpointio] 支持加载时的重叠 pinning (#6177) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\n\n### 紧急修复 \n- [hotfix] 修复零优化保存问题 (#6191) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\n- [hotfix] 修复 sp+dp 的混合检查点IO 问题 (#6184) 由 [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111) 提交\n\n### Shardformer \n- [shardformer] 为 DeepSeek V3 支持流水线，并优化 LoRA 保存 (#6188) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\n- [shardformer] 为 DeepSeek V3 支持 EP (#6185) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\n\n### CI \n- [CI] 使用共享辅助函数清理分布式优化测试 (#6125) 由 [Wenxuan Tan](https:\u002F\u002Fapi.github.com\u002Fusers\u002FEdenzzzz) 提交\n\n### 问题模板 \n- [Issue template] 添加复选框，要求提供复现错误的详细信息 (#6104) 由 [Wenxuan Tan](https:\u002F\u002Fapi.github.com\u002Fusers\u002FEdenzzzz) 提交\n\n### 推理 \n- [Inference] 修复 README 中的示例 (#6178) 由 [Guangyao Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGuangyaoZhang) 提交\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fcompare\u002Fv0.4.8...v0.4.7","2025-02-20T03:37:37",{"id":183,"version":184,"summary_zh":185,"released_at":186},280632,"v0.4.7","## 变更内容\n\n### 发布 \n- [release] 更新版本 (#6174) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n\n### Pre-commit.ci \n- [pre-commit.ci] pre-commit 自动更新 (#6113) 由 [pre-commit-ci[bot]](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fpre-commit-ci%5Bbot%5D)\n\n### Sharderformer \n- [Sharderformer] 在 Sharderformer 策略中支持 zbv (#6150) 由 [duanjunwen](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fduanjunwen)\n\n### Checkpointio \n- [checkpointio] 支持非阻塞的 pin 加载 (#6172) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n- [checkpointio] 为 3d 支持 asyncio (#6152) 由 [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\n- [checkpointio] 修复 async io (#6155) 由 [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\n- [checkpointio] 支持调试日志 (#6153) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n- [checkpointio] 修复 zero 优化器异步保存时的内存问题 (#6151) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n- 合并 pull request #6149 from ver217\u002Fhotfix\u002Fckpt 由 [Wang Binluo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fwangbluo)\n- [checkpointio] 禁用缓冲 由 [ver217](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n- [checkpointio] 修复 pinned state dict 由 [ver217](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n- [checkpointio] 修复大小计算 由 [ver217](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n- [checkpointio] 修复性能问题 (#6139) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n- [checkpointio] 支持异步模型保存 (#6131) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n\n### 新闻 \n- [news] 发布适用于 Sora 的 ColossalAI (#6166) 由 [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\n\n### 热修复 \n- [hotfix] 提升兼容性 (#6165) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n- [Hotfix] 修复归一化问题 (#6163) 由 [duanjunwen](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fduanjunwen)\n- [hotfix] 修复 zero 通信缓冲区初始化问题 (#6154) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n- [hotfix] 修复 flash attn window_size 错误 (#6132) 由 [duanjunwen](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fduanjunwen)\n\n### 文档 \n- [doc] 添加奖励事件 (#6164) 由 [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\n- [doc] 更新云链接 (#6148) 由 [Sze-qq](https:\u002F\u002Fapi.github.com\u002Fusers\u002FSze-qq)\n- [doc] 添加 HPC 云介绍 (#6147) 由 [Sze-qq](https:\u002F\u002Fapi.github.com\u002Fusers\u002FSze-qq)\n\n### 设备 \n- [Device] 支持 NPU (#6159) 由 [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\n\n### 修复 \n- [fix] 修复由 perf 版本引起的 bug (#6156) 由 [duanjunwen](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fduanjunwen)\n- [fix] 多节点反向传播速度变慢问题 (#6134) 由 [Hanks](https:\u002F\u002Fapi.github.com\u002Fusers\u002FBurkeHulk)\n\n### Optim \n- [optim] 修复 adam 加载问题 (#6146) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n\n### Zerobubble \n- [Zerobubble] 合并 main 分支 (#6142) 由 [duanjunwen](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fduanjunwen)\n\n### Async io \n- [async io] 支持 async io (#6137) 由 [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybi","2025-01-03T03:53:16",{"id":188,"version":189,"summary_zh":190,"released_at":191},280633,"v0.4.6","## 变更内容\n\n### 发布 \n- [release] 更新版本 (#6109)，作者：[Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n\n### Pre-commit.ci \n- [pre-commit.ci] 自动更新 pre-commit (#6078)，作者：[pre-commit-ci[bot]](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fpre-commit-ci%5Bbot%5D)\n\n### Checkpointio \n- [checkpointio] 修复混合插件模型保存问题 (#6106)，作者：[Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n\n### Mcts \n- [MCTS] 添加自精炼 MCTS (#6098)，作者：[Tong Li](https:\u002F\u002Fapi.github.com\u002Fusers\u002FTongLi3701)\n\n### 文档 \n- [doc] Sora 解决方案新闻 (#6100)，作者：[binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\n\n### 扩展 \n- [extension] 修复编译检查问题 (#6099)，作者：[Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n\n### 热修复 \n- 合并 Pull Request #6096，来自 BurkeHulk 的分支 hotfix\u002Flora_ckpt，作者：[Hanks](https:\u002F\u002Fapi.github.com\u002Fusers\u002FBurkeHulk)\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fcompare\u002Fv0.4.6...v0.4.5","2024-11-04T09:28:04",{"id":193,"version":194,"summary_zh":195,"released_at":196},280634,"v0.4.5","## 变更内容\n\n### 发布 \n- [release] 更新版本 (#6094) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n\n### 杂项 \n- [misc] 适配 PyTorch API 升级并移除旧版导入 (#6093) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n\n### FP8 \n- [fp8] 添加回退机制，并使编译选项可配置 (#6092) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n\n### 构建任务 \n- [chore] 重构 由 [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\n\n### 检查点 \n- [ckpt] 添加 safetensors 工具 由 [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\n\n### 流水线 \n- [pipeline] 修复多输出的反向传播问题 (#6090) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n\n### 环形注意力 \n- [Ring Attention] 改进注释 (#6085) 由 [Wenxuan Tan](https:\u002F\u002Fapi.github.com\u002Fusers\u002FEdenzzzz)\n- 合并拉取请求 #6071，来自 wangbluo 的 ring_attention 分支 由 [Wang Binluo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fwangbluo)\n\n### Coati \n- [Coati] 使用 PP 训练 DPO (#6054) 由 [Tong Li](https:\u002F\u002Fapi.github.com\u002Fusers\u002FTongLi3701)\n\n### Shardformer \n- [shardformer] 优化序列并行 (#6086) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n- [shardformer] 修复一维线性层的行操作，并支持融合 QKV 线性层的不均匀划分 (#6084) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fcompare\u002Fv0.4.5...v0.4.4","2024-10-21T02:21:19",{"id":198,"version":199,"summary_zh":200,"released_at":201},280635,"v0.4.4","## 变更内容\n\n### 发布 \n- [release] 更新版本 (#6062) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\n\n### Colossaleval \n- [ColossalEval] 支持 vllm (#6056) 由 [Camille Zhong](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCamille7777) 提交\n\n### Moe \n- [moe] 添加 shared_expert 的并行策略，并修复 deepseek 的测试 (#6063) 由 [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw) 提交\n\n### Sp \n- 合并 pull request #6064，来自 wangbluo\u002Ffix_attn 由 [Wang Binluo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fwangbluo) 提交\n- 合并 pull request #6061，来自 wangbluo\u002Fsp_fix 由 [Wang Binluo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fwangbluo) 提交\n\n### 文档 \n- [doc] FP8 训练与通信文档 (#6050) 由 [Guangyao Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGuangyaoZhang) 提交\n- [doc] 更新 sp 文档 (#6055) 由 [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111) 提交\n\n### Fp8 \n- [fp8] 禁用节点内 all_gather。禁用冗余的 fp8 all_gather (#6059) 由 [Guangyao Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGuangyaoZhang) 提交\n- [fp8] 修复 mixtral 中缺失的 fp8_comm 标志位 (#6057) 由 [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw) 提交\n- [fp8] 热修复 backward hook (#6053) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\n\n### Pre-commit.ci \n- [pre-commit.ci] 来自 pre-commit.com 钩子的自动修复 由 [pre-commit-ci[bot]](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fpre-commit-ci%5Bbot%5D) 提交\n\n### 热修复 \n- [hotfix] moe 混合并行基准测试及后续修复 (#6048) 由 [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw) 提交\n\n### 功能 \n- [Feature] 在 SP 中拆分交叉熵计算 (#5959) 由 [Wenxuan Tan](https:\u002F\u002Fapi.github.com\u002Fusers\u002FEdenzzzz) 提交\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fcompare\u002Fv0.4.4...v0.4.3","2024-09-19T02:53:35",{"id":203,"version":204,"summary_zh":205,"released_at":206},280636,"v0.4.3","## 变更内容\r\n\r\n### 发布 \r\n- [release] 更新版本 (#6041) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\r\n\r\n### FP8 \r\n- [fp8] 禁用节点内 all_to_all_fp8 (#6045) 由 [Hanks](https:\u002F\u002Fapi.github.com\u002Fusers\u002FBurkeHulk) 提交\r\n- [fp8] 修复线性钩子 (#6046) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\r\n- [fp8] 优化 all-gather (#6043) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\r\n- [FP8] 对 scale 进行 unsqueeze 处理，使其兼容 torch.compile (#6040) 由 [Guangyao Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGuangyaoZhang) 提交\r\n- 合并 pull request #6012 from hpcaitech\u002Ffeature\u002Ffp8_comm 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\r\n- 合并 pull request #6033 from wangbluo\u002Ffix 由 [Wang Binluo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fwangbluo) 提交\r\n- 合并 pull request #6024 from wangbluo\u002Ffix_merge 由 [Wang Binluo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fwangbluo) 提交\r\n- 合并 pull request #6023 from wangbluo\u002Ffp8_merge 由 [Wang Binluo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fwangbluo) 提交\r\n- [fp8] 将 feature\u002Ffp8_comm 合并到 Colossalai 的 main 分支 (#6016) 由 [Wang Binluo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fwangbluo) 提交\r\n- [fp8] Zero 支持 fp8 线性层。(#6006) 由 [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111) 提交\r\n- [fp8] 为 MoeHybridParallelPlugin 添加 use_fp8 选项 (#6009) 由 [Wang Binluo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fwangbluo) 提交\r\n- [fp8] 更新 reduce-scatter 测试 (#6002) 由 [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111) 提交\r\n- [fp8] 通过 [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw) 提升线性层性能\r\n- [fp8] 将 torch.compile 用于 linear_fp8 更新至 >= 2.4.0 (#6004) 由 [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw) 提交\r\n- [fp8] 支持异步 FP8 通信 (#5997) 由 [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111) 提交\r\n- [fp8] 使用 compile 重构 fp8 线性层 (#5993) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\r\n- [fp8] 支持混合并行插件 (#5982) 由 [Wang Binluo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fwangbluo) 提交\r\n- [fp8] Moe 支持 fp8 通信 (#5977) 由 [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111) 提交\r\n- [fp8] 使用 torch compile (torch >= 2.3.0) (#5979) 由 [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw) 提交\r\n- [fp8] 支持 gemini 插件 (#5978) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\r\n- [fp8] 为混合并行插件支持 fp8 amp (#5975) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\r\n- [fp8] 添加 fp8 线性层 (#5967) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\r\n- [fp8] 支持 all2all fp8 (#5953) 由 [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111) 提交\r\n- [FP8] rebase main (#5963) 由 [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111) 提交\r\n- 合并 pull request #5961 from ver217\u002Ffeature\u002Fzeor-fp8 由 [Hanks](https:\u002F\u002Fapi.github.com\u002Fusers\u002FBurkeHulk) 提交\r\n- [fp8] 为低级别 zero 添加 fp8 通信功能 由 [ver217](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\r\n\r\n### 热修复 \r\n- [Hotfix] 移除已弃用的安装步骤 (#6042) 由 [Tong Li](https:\u002F\u002Fapi.github.com\u002Fusers\u002FTongLi3701) 提交\r\n- [Hotfix] 修复 llama 前向替换中的 bug","2024-09-10T02:39:50",{"id":208,"version":209,"summary_zh":210,"released_at":211},280637,"v0.4.2","## 变更内容\n\n### 发布 \n- [release] 更新版本 (#5952) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\n\n### Zero \n- [zero] 修复 master 参数 (#5951) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\n\n### 功能 \n- [Feat] 为扩散推理添加 Distrifusion 加速支持 (#5895) 由 [Runyu Lu](https:\u002F\u002Fapi.github.com\u002Fusers\u002FLRY89757) 提交\n\n### Shardformer \n- [shardformer] 修复注意力掩码 (#5947) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\n- [shardformer] 修复注意力掩码 (#5945) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\n\n### 聊天 \n- 合并 pull request #5922 from hpcaitech\u002Fkto 由 [YeAnbang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FYeAnbang) 提交\n\n### 特性 \n- [Feature] 添加开关以控制每个 epoch 结束后是否需要保存模型检查点 (#5941) 由 [zhurunhua](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fzhurunhua) 提交\n\n### 热修复 \n- [Hotfix] 修复 ZeRO 拼写错误 #5936 由 [Edenzzzz](https:\u002F\u002Fapi.github.com\u002Fusers\u002FEdenzzzz) 提交\n\n### 修复 bug \n- [FIX BUG] 将环境参数转换为整数 (#5934) 由 [Gao, Ruiyuan](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflymin) 提交\n- [FIX BUG] UnboundLocalError: 无法访问未关联值的局部变量 'default_conversation' (#5931) 由 [zhurunhua](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fzhurunhua) 提交\n\n### Colossalchat \n- [ColossalChat] 为 ColossalChat 进行热修复 (#5910) 由 [Tong Li](https:\u002F\u002Fapi.github.com\u002Fusers\u002FTongLi3701) 提交\n\n### 示例 \n- [Examples] 为 OPT 和 GPT 示例添加懒加载初始化 (#5924) 由 [Edenzzzz](https:\u002F\u002Fapi.github.com\u002Fusers\u002FEdenzzzz) 提交\n\n### 插件 \n- [plugin] 支持混合并行中的 all-gather 重叠 (#5919) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fcompare\u002Fv0.4.2...v0.4.1","2024-07-31T02:06:47",{"id":213,"version":214,"summary_zh":215,"released_at":216},280638,"v0.4.1","## 变更内容\n\n### 发布 \n- [release] 更新版本 (#5912) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\n\n### 杂项 \n- [misc] 支持 PyTorch 2.3 (#5893) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\n\n### 兼容性 \n- [compatibility] 支持 PyTorch 2.2 (#5875) 由 [Guangyao Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGuangyaoZhang) 提交\n\n### 对话 \n- 合并 pull request #5901 from hpcaitech\u002Fcolossalchat 由 [YeAnbang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FYeAnbang) 合并\n- 合并 pull request #5850 from hpcaitech\u002Frlhf_SimPO 由 [YeAnbang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FYeAnbang) 合并\n\n### Shardformer \n- [ShardFormer] 修复 Qwen2 SP (#5903) 由 [Guangyao Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGuangyaoZhang) 提交\n- [ShardFormer] 为 Command-R、Qwen2 和 ChatGLM 添加 Ulysses 序列并行支持 (#5897) 由 [Guangyao Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGuangyaoZhang) 提交\n- [shardformer] 支持 DeepseekMoE (#5871) 由 [Haze188](https:\u002F\u002Fapi.github.com\u002Fusers\u002FHz188) 提交\n- [shardformer] 修复 MoE 相关问题 (#5883) 由 [Wang Binluo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fwangbluo) 提交\n- [Shardformer] 将 Qwen2 的建模方式改为梯度检查点风格 (#5874) 由 [Jianghai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCjhHa1) 提交\n- [shardformer] 删除 xformers (#5859) 由 [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111) 提交\n\n### 自动并行 \n- [Auto Parallel]: 将算子内计划生成速度提升 44% (#5446) 由 [Stephan Kö](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fstephankoe) 提交\n\n### Zero \n- [zero] 支持 all-gather 重叠 (#5898) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\n\n### Pre-commit.ci \n- [pre-commit.ci] 由 pre-commit.com 钩子自动修复 由 [pre-commit-ci[bot]](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fpre-commit-ci%5Bbot%5D) 提交\n- [pre-commit.ci] pre-commit 自动更新 (#5878) 由 [pre-commit-ci[bot]](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fpre-commit-ci%5Bbot%5D) 提交\n- [pre-commit.ci] pre-commit 自动更新 (#5572) 由 [pre-commit-ci[bot]](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fpre-commit-ci%5Bbot%5D) 提交\n\n### 功能 \n- [Feature] 为 Llama 启用 PP + SP (#5868) 由 [Edenzzzz](https:\u002F\u002Fapi.github.com\u002Fusers\u002FEdenzzzz) 提交\n\n### 紧急修复 \n- [HotFix] 为 #5838 添加 CI、import 和 requirements 测试 (#5892) 由 [Runyu Lu](https:\u002F\u002Fapi.github.com\u002Fusers\u002FLRY89757) 提交\n- [Hotfix] 修复 OPT 梯度检查点前向传播问题 由 [Edenzzzz](https:\u002F\u002Fapi.github.com\u002Fusers\u002FEdenzzzz) 提交\n- [hotfix] 修复大张量超出 TensorBucket 最大容量的 bug (#5879) 由 [Haze188](https:\u002F\u002Fapi.github.com\u002Fusers\u002FHz188) 提交\n\n### 新特性 \n- [Feat] 支持扩散模型 (PixArtAlpha\u002FStableDiffusion3) (#5838) 由 [Runyu Lu](https:\u002F\u002Fapi.github.com\u002Fusers\u002FLRY89757) 提交\n\n### 紧急修复 \n- [Hoxfix] 修复 CUDA_DEVICE_MAX_CONNECTIONS 以支持通信重叠 由 [Edenzzzz](https:\u002F\u002Fapi.github.com\u002Fusers\u002FEdenzzzz) 提交\n\n### 量化 \n- [quant] 修复 bitsandbytes 版本检查 (#5882) 由 [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217) 提交\n\n### 文档 \n- [doc] 更新 Llama + SP 的兼容性；修复分布式优化表格 由 [Edenzzzz](https:\u002F\u002Fapi.github.com\u002Fusers\u002FEdenzzzz) 提交\n\n### MoE\u002FZero \n- [MoE\u002FZeRO] MoE 重构","2024-07-17T09:30:39",{"id":218,"version":219,"summary_zh":220,"released_at":221},280639,"v0.4.0","## What's Changed \r\n\r\n### Release \r\n- [release] update version (#5864) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Inference \r\n- [Inference]Lazy Init Support (#5785) by [Runyu Lu](https:\u002F\u002Fapi.github.com\u002Fusers\u002FLRY89757)\r\n\r\n### Shardformer \r\n- [shardformer] Support the T5ForTokenClassification model (#5816) by [Guangyao Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGuangyaoZhang)\r\n\r\n### Zero \r\n- [zero] use bucket during allgather (#5860) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Gemini \r\n- [gemini] fixes for benchmarking (#5847) by [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- [gemini] fix missing return (#5845) by [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n\r\n### Feature \r\n- [Feature] optimize PP overlap (#5735) by [Edenzzzz](https:\u002F\u002Fapi.github.com\u002Fusers\u002FEdenzzzz)\r\n\r\n### Doc \r\n- [doc] add GPU cloud playground (#5851) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n- [doc] fix open sora model weight link (#5848) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n- [doc] opensora v1.2 news (#5846) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fcompare\u002Fv0.4.0...v0.3.9","2024-06-28T02:51:35",{"id":223,"version":224,"summary_zh":225,"released_at":226},280640,"v0.3.9","## What's Changed \r\n\r\n### Release \r\n- [release] update version (#5833) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Fix \r\n- [Fix] Fix spec-dec Glide LlamaModel for compatibility with transformers (#5837) by [Yuanheng Zhao](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fyuanheng-zhao)\r\n\r\n### Shardformer \r\n-  [shardformer] Change atol in test command-r weight-check to pass pytest (#5835) by [Guangyao Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGuangyaoZhang)\r\n- Merge pull request #5818 from GuangyaoZhang\u002Fcommand-r by [Guangyao Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGuangyaoZhang)\r\n- [shardformer] upgrade transformers to 4.39.3 (#5815) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] fix modeling of bloom and falcon (#5796) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [shardformer] fix import (#5788) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Devops \r\n- [devops] Remove building on PR when edited to avoid skip issue (#5836) by [Guangyao Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGuangyaoZhang)\r\n- [devops] fix docker ci (#5780) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Launch \r\n- [launch] Support IPv4 host initialization in launch (#5822) by [Kai Lv](https:\u002F\u002Fapi.github.com\u002Fusers\u002FKaiLv69)\r\n\r\n### Misc \r\n- [misc] Add dist optim to doc sidebar  (#5806) by [Edenzzzz](https:\u002F\u002Fapi.github.com\u002Fusers\u002FEdenzzzz)\r\n- [misc] update requirements (#5787) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [misc] fix dist logger (#5782) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [misc] Accelerate CI for zero and dist optim (#5758) by [Edenzzzz](https:\u002F\u002Fapi.github.com\u002Fusers\u002FEdenzzzz)\r\n- [misc] update dockerfile (#5776) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Pre-commit.ci \r\n- [pre-commit.ci] auto fixes from pre-commit.com hooks by [pre-commit-ci[bot]](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fpre-commit-ci%5Bbot%5D)\r\n- [pre-commit.ci] auto fixes from pre-commit.com hooks by [pre-commit-ci[bot]](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fpre-commit-ci%5Bbot%5D)\r\n- [pre-commit.ci] auto fixes from pre-commit.com hooks by [pre-commit-ci[bot]](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fpre-commit-ci%5Bbot%5D)\r\n\r\n### Gemini \r\n- [gemini] quick fix on possible async operation (#5803) by [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- [Gemini] Use async stream to prefetch and h2d data moving (#5781) by [Haze188](https:\u002F\u002Fapi.github.com\u002Fusers\u002FHz188)\r\n- [gemini] optimize reduce scatter d2h copy (#5760) by [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n\r\n### Inference \r\n- [Inference] Fix flash-attn import and add model test (#5794) by [Li Xingjian](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fchar-1ee)\r\n- [Inference]refactor baichuan (#5791) by [Runyu Lu](https:\u002F\u002Fapi.github.com\u002Fusers\u002FLRY89757)\r\n- Merge pull request #5771 from char-1ee\u002Frefactor\u002Fmodeling by [Li Xingjian](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fchar-1ee)\r\n- [Inference]Add Streaming LLM (#5745) by [yuehuayingxueluo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fyuehuayingxueluo)\r\n\r\n### Test \r\n- [test] fix qwen2 pytest distLarge (#5797) by [Guangyao Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGuangyaoZhang)\r\n- [test] fix chatglm test kit (#5793) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [test] Fix\u002Ffix testcase (#5770) by [duanjunwen](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fduanjunwen)\r\n\r\n### Colossalchat \r\n- Merge pull request #5759 from hpcaitech\u002Fcolossalchat_upgrade by [YeAnbang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FYeAnbang)\r\n\r\n### Install \r\n- [install]fix setup (#5786) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n\r\n### Hotfix \r\n- [hotfix] fix testcase in test_fx\u002Ftest_tracer (#5779) by [duanjunwen](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fduanjunwen)\r\n- [hotfix] fix llama flash attention forward (#5777) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [Hotfix] Add missing init file in inference.executor (#5774) by [Yuanheng Zhao](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fyuanheng-zhao)\r\n\r\n### Test\u002Fci \r\n- [Test\u002FCI] remove test cases to reduce CI duration (#5753) by [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n\r\n### Ci\u002Ftests \r\n- [CI\u002Ftests] simplify some test case to reduce testing time (#5755) by [Haze188](https:\u002F\u002Fapi.github.com\u002Fusers\u002FHz188)\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fcompare\u002Fv0.3.9...v0.3.8","2024-06-20T05:35:03",{"id":228,"version":229,"summary_zh":230,"released_at":231},280641,"v0.3.8","## What's Changed \r\n\r\n### Release \r\n- [release] update version (#5752) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Fix\u002Fexample \r\n- [Fix\u002FExample] Fix Llama Inference Loading Data Type (#5763) by [Yuanheng Zhao](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fyuanheng-zhao)\r\n\r\n### Gemini \r\n- Merge pull request #5749 from hpcaitech\u002Fprefetch by [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- Merge pull request #5754 from Hz188\u002Fprefetch by [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- [Gemini] add some code for reduce-scatter overlap, chunk prefetch in llama benchmark. (#5751) by [Haze188](https:\u002F\u002Fapi.github.com\u002Fusers\u002FHz188)\r\n- [gemini] async grad chunk reduce (all-reduce&reduce-scatter) (#5713) by [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- Merge pull request #5733 from Hz188\u002Ffeature\u002Fprefetch by [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- Merge pull request #5731 from botbw\u002Fprefetch by [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- [gemini] init auto policy prefetch by [hxwang](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- Merge pull request #5722 from botbw\u002Fprefetch by [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- [gemini] maxprefetch means maximum work to keep by [hxwang](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- [gemini] use compute_chunk to find next chunk by [hxwang](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- [gemini] prefetch chunks by [hxwang](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- [gemini]remove registered gradients hooks (#5696) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n\r\n### Chore \r\n- [chore] refactor profiler utils by [hxwang](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- [chore] remove unnecessary assert since compute list might not be recorded by [hxwang](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- [chore] remove unnecessary test & changes by [hxwang](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- Merge pull request #5738 from botbw\u002Fprefetch by [Haze188](https:\u002F\u002Fapi.github.com\u002Fusers\u002FHz188)\r\n- [chore] fix init error by [hxwang](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- [chore] Update placement_policy.py by [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- [chore] remove debugging info by [hxwang](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- [chore] remove print by [hxwang](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- [chore] refactor & sync by [hxwang](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- [chore] sync by [hxwang](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n\r\n### Bug \r\n- [bug] continue fix by [hxwang](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- [bug] workaround for idx fix by [hxwang](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n- [bug] fix early return (#5740) by [botbw](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbotbw)\r\n\r\n### Bugs \r\n- [bugs] fix args.profile=False DummyProfiler errro by [genghaozhe](https:\u002F\u002Fapi.github.com\u002Fusers\u002FHz188)\r\n\r\n### Inference \r\n- [inference] Fix running time of test_continuous_batching (#5750) by [Yuanheng Zhao](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fyuanheng-zhao)\r\n- [Inference]Fix readme and example for API server (#5742) by [Jianghai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCjhHa1)\r\n- [inference] release (#5747) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n- [Inference] Fix Inference Generation Config and Sampling (#5710) by [Yuanheng Zhao](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fyuanheng-zhao)\r\n- [Inference] Fix API server, test and example (#5712) by [Jianghai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCjhHa1)\r\n- [Inference] Delete duplicated copy_vector (#5716) by [傅剑寒](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCourtesy-Xs)\r\n- [Inference]Adapt repetition_penalty and no_repeat_ngram_size (#5708) by [yuehuayingxueluo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fyuehuayingxueluo)\r\n- [Inference] Add example test_ci script by [CjhHa1](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCjhHa1)\r\n- [Inference] Fix bugs and docs for feat\u002Fonline-server (#5598) by [Jianghai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCjhHa1)\r\n- [Inference] resolve rebase conflicts by [CjhHa1](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCjhHa1)\r\n- [Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example (#5432) by [Jianghai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCjhHa1)\r\n- [Inference] ADD  async and sync Api server using FastAPI (#5396) by [Jianghai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCjhHa1)\r\n- [Inference] Support the logic related to ignoring EOS token (#5693) by [yuehuayingxueluo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fyuehuayingxueluo)\r\n- [Inference]Adapt temperature processing logic (#5689) by [yuehuayingxueluo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fyuehuayingxueluo)\r\n- [Inference] Remove unnecessary float4_ and rename float8_ to float8 (#5679) by [Steve Luo](https:\u002F\u002Fapi.github.com\u002Fusers\u002FSunflowerAries)\r\n- [Inference] Fix quant bits order (#5681) by [傅剑寒](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCourtesy-Xs)\r\n- [inference]Add alibi to flash attn function (#5678) by [yuehuayingxueluo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fyuehuayingxueluo)\r\n- [Inference] Adapt Baichuan2-13B TP (#5659) by [yuehuayingxueluo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fyuehuayingxueluo)\r\n\r\n### Feature \r\n- [Feature] auto-cast optimizers to distributed version (#5746) by [Edenzzzz](https:\u002F\u002Fapi.githu","2024-05-31T11:41:19",{"id":233,"version":234,"summary_zh":235,"released_at":236},280642,"v0.3.7","## What's Changed \r\n\r\n### Release \r\n- [release] update version (#5654) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [release] grok-1 inference benchmark (#5500) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n- [release] grok-1 314b inference (#5490) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n\r\n### Hotfix \r\n- [hotfix] add soft link to support required files (#5661) by [Tong Li](https:\u002F\u002Fapi.github.com\u002Fusers\u002FTongLi3701)\r\n- [hotfix] Fixed fused layernorm bug without apex (#5609) by [Edenzzzz](https:\u002F\u002Fapi.github.com\u002Fusers\u002FEdenzzzz)\r\n- [hotfix] Fix examples no pad token & auto parallel codegen bug; (#5606) by [Edenzzzz](https:\u002F\u002Fapi.github.com\u002Fusers\u002FEdenzzzz)\r\n- [hotfix] fix typo s\u002Fget_defualt_parser \u002Fget_default_parser (#5548) by [digger yu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fdigger-yu)\r\n- [hotfix] quick fixes to make legacy tutorials runnable (#5559) by [Edenzzzz](https:\u002F\u002Fapi.github.com\u002Fusers\u002FEdenzzzz)\r\n- [hotfix] set return_outputs=False in examples and polish code (#5404) by [Wenhao Chen](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCWHer)\r\n- [hotfix] fix typo s\u002Fkeywrods\u002Fkeywords etc. (#5429) by [digger yu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fdigger-yu)\r\n\r\n### News \r\n- [news] llama3 and open-sora v1.1 (#5655) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n\r\n### Lazyinit \r\n- [lazyinit] skip whisper test (#5653) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Shardformer \r\n- [shardformer] refactor pipeline grad ckpt config (#5646) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [shardformer] fix chatglm implementation (#5644) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [shardformer] remove useless code (#5645) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] update transformers (#5583) by [Wang Binluo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fwangbluo)\r\n- [shardformer] fix pipeline grad ckpt (#5620) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [shardformer] refactor embedding resize (#5603) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] Sequence Parallelism Optimization (#5533) by [Zhongkai Zhao](https:\u002F\u002Fapi.github.com\u002Fusers\u002FKKZ20)\r\n- [shardformer] fix pipeline forward error if custom layer distribution is used (#5189) by [Insu Jang](https:\u002F\u002Fapi.github.com\u002Fusers\u002Finsujang)\r\n- [shardformer] update colo attention to support custom mask (#5510) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [shardformer]Fix lm parallel. (#5480) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] fix gathering output when using tensor parallelism (#5431) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n\r\n### Fix \r\n- [Fix]: implement thread-safety singleton to avoid deadlock for very large-scale training scenarios (#5625) by [Season](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fzigzagcai)\r\n- [fix] fix typo s\u002Fmuiti-node \u002Fmulti-node etc. (#5448) by [digger yu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fdigger-yu)\r\n- [Fix] Grok-1 use tokenizer from the same pretrained path (#5532) by [Yuanheng Zhao](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fyuanheng-zhao)\r\n- [fix] fix grok-1 example typo (#5506) by [Yuanheng Zhao](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fyuanheng-zhao)\r\n\r\n### Coloattention \r\n- [coloattention]modify coloattention (#5627) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n\r\n### Example \r\n- [example] llama3 (#5631) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n- [example] update Grok-1 inference (#5495) by [Yuanheng Zhao](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fyuanheng-zhao)\r\n- [example] add grok-1 inference (#5485) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Exampe \r\n- [exampe] update llama example (#5626) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Feature \r\n- [Feature] Support LLaMA-3 CPT and ST (#5619) by [Tong Li](https:\u002F\u002Fapi.github.com\u002Fusers\u002FTongLi3701)\r\n\r\n### Zero \r\n- [zero] support multiple (partial) backward passes (#5596) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Doc \r\n- [doc] fix ColossalMoE readme (#5599) by [Camille Zhong](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCamille7777)\r\n- [doc] update open-sora demo (#5479) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n- [doc] release Open-Sora 1.0 with model weights (#5468) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n\r\n### Devops \r\n- [devops] remove post commit ci (#5566) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [devops] fix example test ci (#5504) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [devops] fix compatibility (#5444) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Shardformer, pipeline \r\n- [shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama (#5508) by [Wenhao Chen](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCWHer)\r\n\r\n### Colossalchat \r\n- [ColossalChat] Update RLHF V2 (#5286) by [YeAnbang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FYeAnbang)\r\n\r\n### Format \r\n- [format] applied code formatting o","2024-04-27T11:00:32",{"id":238,"version":239,"summary_zh":240,"released_at":241},280643,"v0.3.6","## What's Changed \r\n\r\n### Release \r\n- [release] update version (#5411) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Colossal-llama2 \r\n- [colossal-llama2] add stream chat examlple for chat version model (#5428) by [Camille Zhong](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCamille7777)\r\n\r\n### Hotfix \r\n- [hotfix] fix stable diffusion inference bug. (#5289) by [Youngon](https:\u002F\u002Fapi.github.com\u002Fusers\u002FYoungon)\r\n- [hotfix] fix typo change MoECheckpintIO to MoECheckpointIO (#5335) by [digger yu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fdigger-yu)\r\n- [hotfix] fix typo change enabel to enable under colossalai\u002Fshardformer\u002F (#5317) by [digger yu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fdigger-yu)\r\n- [hotfix] fix typo change _descrption to _description (#5331) by [digger yu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fdigger-yu)\r\n- [hotfix] fix typo of openmoe model source (#5403) by [Luo Yihang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FLuo-Yihang)\r\n- [hotfix] fix sd vit import error (#5420) by [MickeyCHAN](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fdanyow-cheung)\r\n- [hotfix] Fix wrong import in meta_registry (#5392) by [Stephan Kölker](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fstephankoe)\r\n- [hotfix] fix variable type for top_p (#5313) by [CZYCW](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCZYCW)\r\n\r\n### Doc \r\n- [doc] Fix typo s\u002Finfered\u002Finferred\u002F (#5288) by [hugo-syn](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fhugo-syn)\r\n- [doc] update some translations with README-zh-Hans.md (#5382) by [digger yu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fdigger-yu)\r\n- [doc] sora release (#5425) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n- [doc] fix blog link by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n- [doc] fix blog link by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n- [doc] updated installation command (#5389) by [Frank Lee](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFrankLeeeee)\r\n- [doc] Fix typo (#5361) by [yixiaoer](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fyixiaoer)\r\n\r\n### Eval-hotfix \r\n- [eval-hotfix] set few_shot_data to None when few shot is disabled (#5422) by [Dongruixuan Li](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fstarcatmeow)\r\n\r\n### Devops \r\n- [devops] fix extention building (#5427) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Example \r\n- [example]add gpt2 benchmark example script. (#5295) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [example] reuse flash attn patch (#5400) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Workflow \r\n- [workflow] added pypi channel (#5412) by [Frank Lee](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFrankLeeeee)\r\n\r\n### Shardformer \r\n- [shardformer]gather llama logits (#5398) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n\r\n### Setup \r\n- [setup] fixed nightly release (#5388) by [Frank Lee](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFrankLeeeee)\r\n\r\n### Fsdp \r\n- [fsdp] impl save\u002Fload shard model\u002Foptimizer (#5357) by [QinLuo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fericxsun)\r\n\r\n### Extension \r\n- [extension] hotfix jit extension setup (#5402) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Llama \r\n- [llama] fix training and inference scripts (#5384) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Fcompare\u002Fv0.3.6...v0.3.5","2024-03-07T15:38:25",{"id":243,"version":244,"summary_zh":245,"released_at":246},280644,"v0.3.5","## What's Changed \r\n\r\n### Release \r\n- [release] update version (#5380) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Llama \r\n- Merge pull request #5377 from hpcaitech\u002Fexample\u002Fllama-npu by [Frank Lee](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFrankLeeeee)\r\n- [llama] fix memory issue (#5371) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [llama] polish training script and fix optim ckpt (#5368) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [llama] fix neftune & pbar with start_step (#5364) by [Camille Zhong](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCamille7777)\r\n- [llama] add flash attn patch for npu (#5362) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [llama] update training script (#5360) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [llama] fix dataloader for hybrid parallel (#5358) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Moe \r\n- [moe] fix tests by [ver217](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [moe] fix mixtral optim checkpoint (#5344) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [moe] fix mixtral forward default value (#5329) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [moe] fix mixtral checkpoint io (#5314) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [moe] support mixtral (#5309) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [moe] update capacity computing (#5253) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [moe] init mixtral impl by [Xuanlei Zhao](https:\u002F\u002Fapi.github.com\u002Fusers\u002Foahzxl)\r\n- [moe]: fix ep\u002Ftp tests, add hierarchical all2all (#4982) by [Wenhao Chen](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCWHer)\r\n- [moe] support optimizer checkpoint (#5015) by [Xuanlei Zhao](https:\u002F\u002Fapi.github.com\u002Fusers\u002Foahzxl)\r\n- [moe] merge moe into main (#4978) by [Xuanlei Zhao](https:\u002F\u002Fapi.github.com\u002Fusers\u002Foahzxl)\r\n\r\n### Lr-scheduler \r\n- [lr-scheduler] fix load state dict and add test (#5369) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Eval \r\n-  [eval] update llama npu eval (#5366) by [Camille Zhong](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCamille7777)\r\n\r\n### Gemini \r\n- [gemini] fix param op hook when output is tuple (#5355) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [gemini]  hotfix NaN loss while using Gemini + tensor_parallel (#5150) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [gemini]fix gemini optimzer, saving Shardformer in Gemini got list assignment index out of range (#5085) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [gemini] gemini support extra-dp (#5043) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [gemini] gemini support tensor parallelism. (#4942) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n\r\n### Fix \r\n- [fix] remove unnecessary dp_size assert  (#5351) by [Wenhao Chen](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCWHer)\r\n\r\n### Checkpointio \r\n- [checkpointio] fix gemini and hybrid parallel optim checkpoint (#5347) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Chat \r\n- [Chat] fix sft loss nan (#5345) by [YeAnbang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FYeAnbang)\r\n\r\n### Extension \r\n- [extension] fixed exception catch (#5342) by [Frank Lee](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFrankLeeeee)\r\n\r\n### Doc \r\n- [doc] added docs for extensions (#5324) by [Frank Lee](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFrankLeeeee)\r\n- [doc] add llama2-13B disyplay (#5285) by [Desperado-Jia](https:\u002F\u002Fapi.github.com\u002Fusers\u002FDesperado-Jia)\r\n- [doc] fix doc typo (#5256) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n- [doc] fix typo in Colossal-LLaMA-2\u002FREADME.md (#5247) by [digger yu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fdigger-yu)\r\n- [doc] SwiftInfer release (#5236) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n- [doc] add Colossal-LLaMA-2-13B (#5234) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n- [doc] Make leaderboard format more uniform and good-looking (#5231) by [JIMMY ZHAO](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fzhimin-z)\r\n- [doc] Update README.md of Colossal-LLAMA2 (#5233) by [Camille Zhong](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCamille7777)\r\n- [doc] Update required third-party library list for testing and torch comptibility checking (#5207) by [Zhongkai Zhao](https:\u002F\u002Fapi.github.com\u002Fusers\u002FKKZ20)\r\n- [doc] update pytorch version in documents. (#5177) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [doc] fix colossalqa document (#5146) by [Michelle](https:\u002F\u002Fapi.github.com\u002Fusers\u002FMichelleMa8)\r\n- [doc] updated paper citation (#5131) by [Frank Lee](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFrankLeeeee)\r\n- [doc] add moe news (#5128) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n\r\n### Tests \r\n- [tests] fix t5 test. (#5322) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n\r\n### Accelerator \r\n- Merge pull request #5321 from FrankLeeeee\u002Fhotfix\u002Faccelerator-api by [Frank Lee](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFrankLeeeee)\r\n- [accelerator] fixed npu api by [FrankLeeeee](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFrankLeeeee)\r\n- [accelerator] ","2024-02-23T08:46:07",{"id":248,"version":249,"summary_zh":250,"released_at":251},280645,"v0.3.4","## What's Changed \r\n\r\n### Release \r\n- [release] update version (#4995) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Pipeline inference \r\n- [Pipeline Inference] Merge pp with tp (#4993) by [Bin Jia](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFoolPlayer)\r\n- [Pipeline inference] Combine kvcache with pipeline inference (#4938) by [Bin Jia](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFoolPlayer)\r\n- [Pipeline Inference] Sync pipeline inference branch to main  (#4820) by [Bin Jia](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFoolPlayer)\r\n\r\n### Doc \r\n- [doc] add supported feature diagram for hybrid parallel plugin (#4996) by [ppt0011](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fppt0011)\r\n- [doc]Update doc for colossal-inference (#4989) by [Cuiqing Li (李崔卿)](https:\u002F\u002Fapi.github.com\u002Fusers\u002Ftiandiao123)\r\n- Merge pull request #4889 from ppt0011\u002Fmain by [ppt0011](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fppt0011)\r\n- [doc] add reminder for issue encountered with hybrid adam by [ppt0011](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fppt0011)\r\n- [doc] update advanced tutorials, training gpt with hybrid parallelism (#4866) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- Merge pull request #4858 from Shawlleyw\u002Fmain by [ppt0011](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fppt0011)\r\n- [doc] update slack link (#4823) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n- [doc] add lazy init docs (#4808) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- Merge pull request #4805 from TongLi3701\u002Fdocs\u002Ffix by [Desperado-Jia](https:\u002F\u002Fapi.github.com\u002Fusers\u002FDesperado-Jia)\r\n- [doc] polish shardformer doc (#4779) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [doc] add llama2 domain-specific solution news (#4789) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n\r\n### Hotfix \r\n- [hotfix] fix the bug of repeatedly storing param group (#4951) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [hotfix] Fix the bug where process groups were not being properly released. (#4940) by [littsk](https:\u002F\u002Fapi.github.com\u002Fusers\u002Flittsk)\r\n- [hotfix] fix torch 2.0 compatibility (#4936) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [hotfix] fix lr scheduler bug in torch 2.0 (#4864) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [hotfix] fix bug in sequence parallel test (#4887) by [littsk](https:\u002F\u002Fapi.github.com\u002Fusers\u002Flittsk)\r\n- [hotfix] Correct several erroneous code comments (#4794) by [littsk](https:\u002F\u002Fapi.github.com\u002Fusers\u002Flittsk)\r\n- [hotfix] fix norm type error in zero optimizer (#4795) by [littsk](https:\u002F\u002Fapi.github.com\u002Fusers\u002Flittsk)\r\n- [hotfix] change llama2 Colossal-LLaMA-2 script filename (#4800) by [Chandler-Bing](https:\u002F\u002Fapi.github.com\u002Fusers\u002FChandler-Bing)\r\n\r\n### Kernels \r\n- [Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention  (#4965) by [Cuiqing Li](https:\u002F\u002Fapi.github.com\u002Fusers\u002Ftiandiao123)\r\n\r\n### Inference \r\n- [Inference] Dynamic Batching Inference, online and offline (#4953) by [Jianghai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCjhHa1)\r\n- [Inference]ADD Bench Chatglm2 script (#4963) by [Jianghai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCjhHa1)\r\n- [inference] add reference and fix some bugs (#4937) by [Xu Kai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FXu-Kai)\r\n- [inference] Add smmoothquant for llama (#4904) by [Xu Kai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FXu-Kai)\r\n- [inference] add llama2 support (#4898) by [Xu Kai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FXu-Kai)\r\n- [inference]fix import bug and delete down useless init (#4830) by [Jianghai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCjhHa1)\r\n\r\n### Test \r\n- [test] merge old components to test to model zoo (#4945) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [test] add no master test for low level zero plugin (#4934) by [Zhongkai Zhao](https:\u002F\u002Fapi.github.com\u002Fusers\u002FKKZ20)\r\n- Merge pull request #4856 from KKZ20\u002Ftest\u002Fmodel_support_for_low_level_zero by [ppt0011](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fppt0011)\r\n- [test] modify model supporting part of low_level_zero plugin (including correspoding docs) by Zhongkai Zhao\r\n\r\n### Refactor \r\n- [Refactor] Integrated some lightllm kernels into token-attention  (#4946) by [Cuiqing Li](https:\u002F\u002Fapi.github.com\u002Fusers\u002Ftiandiao123)\r\n\r\n### Nfc \r\n- [nfc] fix some typo with colossalai\u002F docs\u002F etc. (#4920) by [digger yu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fdigger-yu)\r\n- [nfc] fix minor typo in README (#4846) by [Blagoy Simandoff](https:\u002F\u002Fapi.github.com\u002Fusers\u002FblagoySimandov)\r\n- [NFC] polish code style (#4799) by [Camille Zhong](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCamille7777)\r\n- [NFC] polish colossalai\u002Finference\u002Fquant\u002Fgptq\u002Fcai_gptq\u002F__init__.py code style (#4792) by [Michelle](https:\u002F\u002Fapi.github.com\u002Fusers\u002FMichelleMa8)\r\n\r\n### Format \r\n- [format] applied code formatting on changed files in pull request 4820 (#4886) by [github-actions[bot]](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fgithub-actions%5Bbot%5D)\r\n- [format] applied code formatting on changed files in pull request 4908 (#4918) by [github-actions[bot]](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fgithub-actions%5Bbot%5D)\r\n- [format] applied code formatting on changed files i","2023-11-01T05:57:35",{"id":253,"version":254,"summary_zh":255,"released_at":256},280646,"v0.3.3","## What's Changed \r\n\r\n### Release \r\n- [release] update version (#4775) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Inference \r\n- [inference] chatglm2 infer demo (#4724) by [Jianghai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCjhHa1)\r\n\r\n### Feature \r\n- [feature] add gptq for inference (#4754) by [Xu Kai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FXu-Kai)\r\n- [Feature] The first PR to Add TP inference engine, kv-cache manager and related kernels for our inference system (#4577) by [Cuiqing Li](https:\u002F\u002Fapi.github.com\u002Fusers\u002Ftiandiao123)\r\n\r\n### Bug \r\n- [bug] Fix the version check bug in colossalai run when generating the cmd. (#4713) by [littsk](https:\u002F\u002Fapi.github.com\u002Fusers\u002Flittsk)\r\n- [bug] fix get_default_parser in examples (#4764) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n\r\n### Lazy \r\n- [lazy] support torch 2.0 (#4763) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Chat \r\n- [chat]: add lora merge weights config (#4766) by [Wenhao Chen](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCWHer)\r\n- [chat]: update rm, add wandb and fix bugs (#4471) by [Wenhao Chen](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCWHer)\r\n\r\n### Doc \r\n- [doc] add shardformer doc to sidebar (#4768) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [doc] clean up outdated docs (#4765) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- Merge pull request #4757 from ppt0011\u002Fmain by [ppt0011](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fppt0011)\r\n- [doc] put native colossalai plugins first in description section by [Pengtai Xu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fppt0011)\r\n- [doc] add model examples for each plugin by [Pengtai Xu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fppt0011)\r\n- [doc] put individual plugin explanation in front by [Pengtai Xu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fppt0011)\r\n- [doc] explain suitable use case for each plugin by [Pengtai Xu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fppt0011)\r\n- [doc] explaination of loading large pretrained models (#4741) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [doc] polish shardformer doc (#4735) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [doc] add shardformer support matrix\u002Fupdate tensor parallel documents (#4728) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [doc] Add user document for Shardformer (#4702) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [doc] fix llama2 code link (#4726) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n- [doc] add potential solution for OOM in llama2 example (#4699) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [doc] Update booster user documents. (#4669) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n\r\n### Shardformer \r\n- [shardformer] fix master param sync for hybrid plugin\u002Frewrite unwrapping logic (#4758) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [shardformer] add custom policy in hybrid parallel plugin (#4718) by [Xuanlei Zhao](https:\u002F\u002Fapi.github.com\u002Fusers\u002Foahzxl)\r\n- [shardformer] update seq parallel document (#4730) by [Bin Jia](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFoolPlayer)\r\n- [shardformer] update pipeline parallel document (#4725) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] to fix whisper test failed due to significant accuracy differences. (#4710) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] fix GPT2DoubleHeadsModel (#4703) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] update shardformer readme (#4689) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer]fix gpt2 double head (#4663) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] update llama2\u002Fopt finetune example and fix llama2 policy (#4645) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] Support customized policy for llamav2 based model with HybridParallelPlugin (#4624) by [eric8607242](https:\u002F\u002Fapi.github.com\u002Fusers\u002Feric8607242)\r\n\r\n### Misc \r\n- [misc] update pre-commit and run all files (#4752) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Format \r\n- [format] applied code formatting on changed files in pull request 4743 (#4750) by [github-actions[bot]](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fgithub-actions%5Bbot%5D)\r\n- [format] applied code formatting on changed files in pull request 4726 (#4727) by [github-actions[bot]](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fgithub-actions%5Bbot%5D)\r\n\r\n### Legacy \r\n- [legacy] clean up legacy code (#4743) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- Merge pull request #4738 from ppt0011\u002Fmain by [ppt0011](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fppt0011)\r\n- [legacy] remove deterministic data loader test by [Pengtai Xu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fppt0011)\r\n- [legacy] move communication and nn to legacy and refactor logger (#4671) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Kernel \r\n- [kernel] update triton init #4740 (#4740) by [Xuanlei Zhao](https:\u002F\u002Fapi.github.com\u002Fusers\u002Foahzxl)\r\n\r\n### Example \r\n- [exam","2023-09-22T10:30:08",{"id":258,"version":259,"summary_zh":260,"released_at":261},280647,"v0.3.2","## What's Changed \r\n\r\n### Release \r\n- [release] update version (#4623) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Shardformer \r\n- Merge pull request #4612 from hpcaitech\u002Ffeature\u002Fshardformer by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [shardformer] update shardformer readme (#4617) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] Add overlap optional for HybridParallelPlugin (#4615) by [Bin Jia](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFoolPlayer)\r\n- [shardformer] update bert finetune example with HybridParallelPlugin (#4584) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] Pytree fix (#4533) by [Jianghai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCjhHa1)\r\n- [shardformer] support from_pretrained when loading model with HybridParallelPlugin (#4575) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [shardformer] fix submodule replacement bug when enabling pp (#4544) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [shardformer] support pp+tp+zero1 tests (#4531) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] fix opt test hanging (#4521) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] Add overlap support for gpt2 (#4535) by [Bin Jia](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFoolPlayer)\r\n- [shardformer] fix emerged bugs after updating transformers (#4526) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [shardformer] zero1+pp and the corresponding tests (#4517) by [Jianghai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCjhHa1)\r\n- [shardformer] support sharded checkpoint IO for models of HybridParallelPlugin  (#4506) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [shardformer] opt fix. (#4514) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] vit\u002Fllama\u002Ft5 ignore the sequence parallelism flag and some fix. (#4498) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] tests for 3d parallel (#4493) by [Jianghai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCjhHa1)\r\n- [shardformer] chatglm support sequence parallel (#4482) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] support tp+zero for shardformer (#4472) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [shardformer] Pipeline\u002Fwhisper (#4456) by [Jianghai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCjhHa1)\r\n- [shardformer] bert support sequence parallel. (#4455) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] bloom support sequence parallel (#4465) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] support interleaved pipeline (#4448) by [LuGY](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGy-Lu)\r\n- [shardformer] support DDP in HybridPlugin\u002Fadd tp+dp tests (#4446) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [shardformer] fix import by [ver217](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [shardformer] fix embedding by [ver217](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n- [shardformer] update bloom\u002Fllama\u002Fvit\u002Fchatglm tests (#4420) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer]update t5 tests for using all optimizations. (#4407) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] update tests for all optimization (#4413) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] rewrite tests for opt\u002Fbloom\u002Fllama\u002Fvit\u002Fchatglm (#4395) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [shardformer]fix, test gpt2 for AMP+TP (#4403) by [flybird11111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] test all optimizations (#4399) by [flybird1111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] update shardformer to use flash attention 2 (#4392) by [flybird1111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [Shardformer] Merge flash attention branch to pipeline branch (#4362) by [flybird1111](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fflybird11111)\r\n- [shardformer] add util functions for shardformer tests\u002Ffix sync_shared_param (#4366) by [Baizhou Zhang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFridge003)\r\n- [shardformer] support Blip2 (#4243) by [FoolPlayer](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFoolPlayer)\r\n- [shardformer] support ChatGLMForConditionalGeneration & add fusedlayernorm for vit by [klhhhhh](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fklhhhhh)\r\n- [shardformer] pre-commit check files by [klhhhhh](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fklhhhhh)\r\n- [shardformer] register without auto policy by [klhhhhh](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fklhhhhh)\r\n- [shardformer] ChatGLM support layernorm sharding by [klhhhhh](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fklhhhhh)\r\n- [shardformer] delete some file by [klhhhhh](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fklhhhhh)\r\n- [shardformer] support chatglm without layernorm by [klhhhhh](https:\u002F\u002Fapi.github.com\u002Fuse","2023-09-06T15:42:16",{"id":263,"version":264,"summary_zh":265,"released_at":266},280648,"v0.3.1","## What's Changed \r\n\r\n### Release \r\n- [release] update version (#4332) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Chat \r\n- [chat] fix compute_approx_kl (#4338) by [Wenhao Chen](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCWHer)\r\n- [chat] removed cache file (#4155) by [Frank Lee](https:\u002F\u002Fapi.github.com\u002Fusers\u002FFrankLeeeee)\r\n- [chat] use official transformers and fix some issues (#4117) by [Wenhao Chen](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCWHer)\r\n- [chat] remove naive strategy and split colossalai strategy (#4094) by [Wenhao Chen](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCWHer)\r\n- [chat] refactor trainer class (#4080) by [Wenhao Chen](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCWHer)\r\n- [chat]: fix chat evaluation possible bug (#4064) by [Michelle](https:\u002F\u002Fapi.github.com\u002Fusers\u002FMichelleMa8)\r\n- [chat] refactor strategy class with booster api (#3987) by [Wenhao Chen](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCWHer)\r\n- [chat] refactor actor class (#3968) by [Wenhao Chen](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCWHer)\r\n- [chat] add distributed PPO trainer (#3740) by [Hongxin Liu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fver217)\r\n\r\n### Zero \r\n- [zero] optimize the optimizer step time (#4221) by [LuGY](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGy-Lu)\r\n- [zero] support shard optimizer state dict of zero (#4194) by [LuGY](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGy-Lu)\r\n- [zero] add state dict for low level zero (#4179) by [LuGY](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGy-Lu)\r\n- [zero] allow passing process group to zero12 (#4153) by [LuGY](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGy-Lu)\r\n- [zero]support no_sync method for zero1 plugin (#4138) by [LuGY](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGy-Lu)\r\n- [zero] refactor low level zero for shard evenly (#4030) by [LuGY](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGy-Lu)\r\n\r\n### Nfc \r\n- [NFC] polish applications\u002FChat\u002Fcoati\u002Fmodels\u002Futils.py codestyle (#4277) by [yuxuan-lou](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fyuxuan-lou)\r\n- [NFC] polish applications\u002FChat\u002Fcoati\u002Ftrainer\u002Fstrategies\u002Fbase.py code style (#4278) by [Zirui Zhu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fziruizhu)\r\n- [NFC] polish applications\u002FChat\u002Fcoati\u002Fmodels\u002Fgeneration.py code style (#4275) by [RichardoLuo](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fyangluo7)\r\n- [NFC] polish applications\u002FChat\u002Finference\u002Fserver.py code style (#4274) by [Yuanchen](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fchengeharrison)\r\n- [NFC] fix format of application\u002FChat\u002Fcoati\u002Ftrainer\u002Futils.py (#4273) by [アマデウス](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fkurisusnowdeng)\r\n- [NFC] polish applications\u002FChat\u002Fexamples\u002Ftrain_reward_model.py code style (#4271) by [Xu Kai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FXu-Kai)\r\n- [NFC] fix: format (#4270) by [dayellow](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fdayellow)\r\n- [NFC] polish runtime_preparation_pass style (#4266) by [Wenhao Chen](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCWHer)\r\n- [NFC] polish unary_elementwise_generator.py code style (#4267) by [YeAnbang](https:\u002F\u002Fapi.github.com\u002Fusers\u002FYeAnbang)\r\n- [NFC] polish applications\u002FChat\u002Fcoati\u002Ftrainer\u002Fbase.py code style (#4260) by [shenggan](https:\u002F\u002Fapi.github.com\u002Fusers\u002FShenggan)\r\n- [NFC] polish applications\u002FChat\u002Fcoati\u002Fdataset\u002Fsft_dataset.py code style (#4259) by [Zheng Zangwei (Alex Zheng)](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fzhengzangw)\r\n- [NFC] polish colossalai\u002Fbooster\u002Fplugin\u002Flow_level_zero_plugin.py code style (#4256) by [梁爽](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fsupercooledith)\r\n- [NFC] polish colossalai\u002Fauto_parallel\u002Foffload\u002Famp_optimizer.py code style (#4255) by [Yanjia0](https:\u002F\u002Fapi.github.com\u002Fusers\u002FYanjia0)\r\n- [NFC] polish colossalai\u002Fcli\u002Fbenchmark\u002Futils.py code style (#4254) by [ocd_with_naming](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fyuanheng-zhao)\r\n- [NFC] policy applications\u002FChat\u002Fexamples\u002Fray\u002Fmmmt_prompt.py code style (#4250) by [CZYCW](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCZYCW)\r\n- [NFC] polish applications\u002FChat\u002Fcoati\u002Fmodels\u002Fbase\u002Factor.py code style (#4248) by [Junming Wu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fjason524w)\r\n- [NFC] polish applications\u002FChat\u002Finference\u002Frequirements.txt code style (#4265) by [Camille Zhong](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCamille7777)\r\n- [NFC] Fix format for mixed precision (#4253) by [Jianghai](https:\u002F\u002Fapi.github.com\u002Fusers\u002FCjhHa1)\r\n- [nfc]fix ColossalaiOptimizer is not defined (#4122) by [digger yu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fdigger-yu)\r\n- [nfc] fix dim not defined and fix typo (#3991) by [digger yu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fdigger-yu)\r\n- [nfc] fix typo colossalai\u002Fzero (#3923) by [digger yu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fdigger-yu)\r\n- [nfc]fix typo colossalai\u002Fpipeline tensor nn (#3899) by [digger yu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fdigger-yu)\r\n- [nfc] fix typo colossalai\u002Fnn  (#3887) by [digger yu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fdigger-yu)\r\n- [nfc] fix typo colossalai\u002Fcli fx kernel (#3847) by [digger yu](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fdigger-yu)\r\n\r\n### Example \r\n- Fix\u002Fformat (#4261) by [Michelle](https:\u002F\u002Fapi.github.com\u002Fusers\u002FMichelleMa8)\r\n- [example] add llama pretraining (#4257) by [binmakeswell](https:\u002F\u002Fapi.github.com\u002Fusers\u002Fbinmakeswell)\r\n- [example] fix bucket size in example of gpt gemini (#4028) by [LuGY](https:\u002F\u002Fapi.github.com\u002Fusers\u002FGy-Lu)\r\n- [example] update ViT example using booster api (#3940) ","2023-08-01T07:02:44"]