[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-EleutherAI--gpt-neox":3,"tool-EleutherAI--gpt-neox":65},[4,17,27,35,48,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",152630,2,"2026-04-12T23:33:54",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":10,"last_commit_at":33,"category_tags":34,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":10,"last_commit_at":41,"category_tags":42,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,"2026-04-10T11:13:16",[26,43,44,45,14,46,15,13,47],"数据工具","视频","插件","其他","音频",{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":54,"last_commit_at":55,"category_tags":56,"status":16},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[15,43,46],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":54,"last_commit_at":63,"category_tags":64,"status":16},6590,"gpt4all","nomic-ai\u002Fgpt4all","GPT4All 是一款让普通电脑也能轻松运行大型语言模型（LLM）的开源工具。它的核心目标是打破算力壁垒，让用户无需依赖昂贵的显卡（GPU）或云端 API，即可在普通的笔记本电脑和台式机上私密、离线地部署和使用大模型。\n\n对于担心数据隐私、希望完全掌控本地数据的企业用户、研究人员以及技术爱好者来说，GPT4All 提供了理想的解决方案。它解决了传统大模型必须联网调用或需要高端硬件才能运行的痛点，让日常设备也能成为强大的 AI 助手。无论是希望构建本地知识库的开发者，还是单纯想体验私有化 AI 聊天的普通用户，都能从中受益。\n\n技术上，GPT4All 基于高效的 `llama.cpp` 后端，支持多种主流模型架构（包括最新的 DeepSeek R1 蒸馏模型），并采用 GGUF 格式优化推理速度。它不仅提供界面友好的桌面客户端，支持 Windows、macOS 和 Linux 等多平台一键安装，还为开发者提供了便捷的 Python 库，可轻松集成到 LangChain 等生态中。通过简单的下载和配置，用户即可立即开始探索本地大模型的无限可能。",77307,"2026-04-11T06:52:37",[15,13],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":71,"readme_en":72,"readme_zh":73,"quickstart_zh":74,"use_case_zh":75,"hero_image_url":76,"owner_login":77,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":81,"owner_twitter":82,"owner_website":83,"owner_url":84,"languages":85,"stars":114,"forks":115,"last_commit_at":116,"license":117,"difficulty_score":118,"env_os":119,"env_gpu":120,"env_ram":121,"env_deps":122,"category_tags":133,"github_topics":134,"view_count":10,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":139,"updated_at":140,"faqs":141,"releases":172},7027,"EleutherAI\u002Fgpt-neox","gpt-neox","An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries","GPT-NeoX 是 EleutherAI 推出的开源库，专为在 GPU 集群上从头训练超大规模语言模型而设计。它基于 NVIDIA 的 Megatron 框架，并深度融合了 DeepSpeed 技术，旨在解决数十亿参数级模型在分布式训练中面临的显存受限、效率低下及硬件适配复杂等核心难题。\n\n这款工具特别适合需要研发巨型基础模型的研究机构、高校实验室及企业算法团队。如果您仅需使用现成模型进行推理或微调小型模型，Hugging Face transformers 可能是更轻便的选择；但若您致力于探索模型训练的边界，GPT-NeoX 则是理想伙伴。\n\n其独特亮点在于卓越的硬件兼容性与前沿架构支持。它不仅能在 AWS、CoreWeave 等云平台运行，还能完美适配 Summit、Frontier、LUMI 等顶级超级计算机，支持 Slurm、MPI 等多种调度系统。技术上，它集成了 ZeRO 优化、3D 并行策略、旋转位置编码（RoPE）、Flash Attention 以及最新的 Transformer Engine 加速，并提供 Pythia、LLaMA、Falcon 等主流架构的配置模","GPT-NeoX 是 EleutherAI 推出的开源库，专为在 GPU 集群上从头训练超大规模语言模型而设计。它基于 NVIDIA 的 Megatron 框架，并深度融合了 DeepSpeed 技术，旨在解决数十亿参数级模型在分布式训练中面临的显存受限、效率低下及硬件适配复杂等核心难题。\n\n这款工具特别适合需要研发巨型基础模型的研究机构、高校实验室及企业算法团队。如果您仅需使用现成模型进行推理或微调小型模型，Hugging Face transformers 可能是更轻便的选择；但若您致力于探索模型训练的边界，GPT-NeoX 则是理想伙伴。\n\n其独特亮点在于卓越的硬件兼容性与前沿架构支持。它不仅能在 AWS、CoreWeave 等云平台运行，还能完美适配 Summit、Frontier、LUMI 等顶级超级计算机，支持 Slurm、MPI 等多种调度系统。技术上，它集成了 ZeRO 优化、3D 并行策略、旋转位置编码（RoPE）、Flash Attention 以及最新的 Transformer Engine 加速，并提供 Pythia、LLaMA、Falcon 等主流架构的配置模板。此外，它还原生支持 DPO 偏好学习与 RWKV 架构，并能无缝对接 Hugging Face 生态及 WandB 等监控工具，帮助开发者高效推进大模型研究。","[![GitHub issues](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002FEleutherAI\u002Fgpt-neox)](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fissues)\n[\u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fwandb\u002Fassets\u002Fmain\u002Fwandb-github-badge-28.svg\" alt=\"Weights & Biases monitoring\" height=20>](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fneox)\n\n# GPT-NeoX\n\nThis repository records [EleutherAI](https:\u002F\u002Fwww.eleuther.ai)'s library for training large-scale language models on GPUs. Our current framework is based on NVIDIA's [Megatron Language Model](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) and has been augmented with techniques from [DeepSpeed](https:\u002F\u002Fwww.deepspeed.ai) as well as some novel optimizations. We aim to make this repo a centralized and accessible place to gather techniques for training large-scale autoregressive language models, and accelerate research into large-scale training. This library is in widespread use in [academic, industry, and government labs](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox#adoption-and-publications), including by researchers at Oak Ridge National Lab, CarperAI, Stability AI, Together.ai, Korea University, Carnegie Mellon University, and the University of Tokyo among others. Uniquely among similar libraries GPT-NeoX supports a wide variety of systems and hardwares, including launching via Slurm, MPI, and the IBM Job Step Manager, and has been run at scale on [AWS](https:\u002F\u002Faws.amazon.com\u002F), [CoreWeave](https:\u002F\u002Fwww.coreweave.com\u002F), [ORNL Summit](https:\u002F\u002Fwww.olcf.ornl.gov\u002Fsummit\u002F), [ORNL Frontier](https:\u002F\u002Fwww.olcf.ornl.gov\u002Ffrontier\u002F),  [LUMI](https:\u002F\u002Fwww.lumi-supercomputer.eu\u002F), and others.\n\n**If you are not looking to train models with billions of parameters from scratch, this is likely the wrong library to use. For generic inference needs, we recommend you use the Hugging Face `transformers` library instead which supports GPT-NeoX models.**\n\n## Why GPT-NeoX?\n\nGPT-NeoX leverages many of the same features and technologies as the popular Megatron-DeepSpeed library but with substantially increased usability and novel optimizations. Major features include:\n* Distributed training with ZeRO and 3D parallelism\n* A wide variety of systems and hardwares, including launching via Slurm, MPI, and the IBM Job Step Manager, and has been run at scale on [AWS](https:\u002F\u002Faws.amazon.com\u002F), [CoreWeave](https:\u002F\u002Fwww.coreweave.com\u002F), Oak Ridge's [Summit](https:\u002F\u002Fwww.olcf.ornl.gov\u002Fsummit\u002F) and [Frontier](https:\u002F\u002Fwww.olcf.ornl.gov\u002Ffrontier\u002F),  [Pacific Northwest National Laboratory](https:\u002F\u002Fhpc.pnl.gov\u002Findex.shtml), Argonne's [Polaris](https:\u002F\u002Fdocs.alcf.anl.gov\u002Fpolaris\u002Fdata-science-workflows\u002Fapplications\u002Fgpt-neox\u002F), [LUMI](https:\u002F\u002Fwww.lumi-supercomputer.eu\u002F), and more.\n* Cutting edge architectural innovations including rotary and alibi positional embeddings, parallel feedforward attention layers, and flash attention.\n* Predefined configurations for popular architectures including Pythia, PaLM, Falcon, and LLaMA 1 \\& 2\n* Curriculum Learning\n* Easy connections with the open source ecosystem, including Hugging Face's [tokenizers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftokenizers) and [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002F) libraries, monitor experiments via [WandB](https:\u002F\u002Fwandb.ai\u002Fsite)\u002F[Comet](https:\u002F\u002Fwww.comet.com\u002Fsite\u002F)\u002FTensorBoard, and evaluation via our [Language Model Evaluation Harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness).\n\n## News\n**[10\u002F9\u002F2024]** We now support [Transformer Engine](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTransformerEngine) integration\n\n**[9\u002F9\u002F2024]** We now support preference learning via [DPO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18290), [KTO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.01306), and reward modeling\n\n**[9\u002F9\u002F2024]** We now support integration with [Comet ML](https:\u002F\u002Fwww.comet.com\u002Fsite\u002F), a machine learning monitoring platform\n\n**[5\u002F21\u002F2024]** We now support [RWKV](https:\u002F\u002Fwww.rwkv.com\u002F) with pipeline parallelism!. See the PRs for [RWKV](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fpull\u002F1198) and [RWKV+pipeline](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fpull\u002F1221)\n\n**[3\u002F21\u002F2024]** We now support Mixture-of-Experts (MoE)\n\n**[3\u002F17\u002F2024]** We now support AMD MI250X GPUs\n\n**[3\u002F15\u002F2024]** We now support [Mamba](https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fmamba) with tensor parallelism! See [the PR](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fpull\u002F1184)\n\n**[8\u002F10\u002F2023]** We now support checkpointing with AWS S3! Activate with the `s3_path` config option (for more detail, see [the PR](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fpull\u002F1010))\n\n**[9\u002F20\u002F2023]** As of https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fpull\u002F1035, we have deprecated Flash Attention 0.x and 1.x, and migrated support to Flash Attention 2.x. We don't believe this will cause problems, but if you have a specific use-case that requires old flash support using the latest GPT-NeoX, please raise an issue.\n\n**[8\u002F10\u002F2023]** We have experimental support for LLaMA 2 and Flash Attention v2 supported in our [math-lm](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fmath-lm) project that will be upstreamed later this month.\n\n**[5\u002F17\u002F2023]** After fixing some miscellaneous bugs we now fully support bf16.\n\n**[4\u002F11\u002F2023]** We have upgraded our Flash Attention implementation to now support Alibi positional embeddings.\n\n**[3\u002F9\u002F2023]** We have released GPT-NeoX 2.0.0, an upgraded version built on the latest DeepSpeed which will be regularly synced with going forward.\n\n## Versions\n\nPrior to 3\u002F9\u002F2023, GPT-NeoX relied on [DeeperSpeed](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002FDeeperSpeed), which was based on an old version of DeepSpeed (0.3.15). In order to migrate to the latest upstream DeepSpeed version while allowing users to access the old versions of GPT-NeoX and DeeperSpeed, we have introduced two versioned releases for both libraries:\n\n- Version 2.0 of [GPT-NeoX](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Freleases\u002Ftag\u002Fv2.0) and [DeeperSpeed](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002FDeeperSpeed\u002Freleases\u002Ftag\u002Fv2.0) are the latest versions built on the latest DeepSpeed, and will be maintained going forward.\n- Version 1.0 of [GPT-NeoX](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Freleases\u002Ftag\u002Fv1.0) and [DeeperSpeed](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002FDeeperSpeed\u002Freleases\u002Ftag\u002Fv1.0) maintain snapshots of the old stable versions that [GPT-NeoX-20B](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.06745) and the [Pythia Suite](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia) were trained on.\n\n# Contents\n\n- [GPT-NeoX](#gpt-neox)\n  * [Why GPT-NeoX?](#why-gpt-neox)\n  * [News](#news)\n  * [Versions](#versions)\n- [Contents](#contents)\n- [Quick Start](#quick-start)\n  * [Environment and Dependencies](#environment-and-dependencies)\n    + [Host Setup](#host-setup)\n    + [Flash Attention](#flash-attention)\n    + [Transformer Engine](#transformer-engine)\n    + [Multi-Node Launching](#multi-node-launching)\n    + [Containerized Setup](#containerized-setup)\n  * [Usage](#usage)\n- [Configuration](#configuration)\n    * [Mixture of Experts](#mixture-of-experts)\n- [Datasets](#datasets)\n  * [Preconfigured Datasets](#preconfigured-datasets)\n  * [Using Custom Data](#using-custom-data)\n- [Training and Finetuning](#training-and-finetuning)\n  * [Pretrained Models](#pretrained-models)\n    + [GPT-NeoX-20B](#gpt-neox-20b)\n    + [Pythia](#pythia)\n    + [Polyglot](#polyglot)\n- [Inference](#inference)\n- [Evaluation](#evaluation)\n- [Exporting to Hugging Face](#exporting-to-hugging-face)\n- [Monitoring](#monitoring)\n  * [Weights and Biases](#weights-and-biases)\n  * [TensorBoard](#tensorboard)\n- [Running on multi-node](#running-on-multi-node)\n- [Profiling](#profiling)\n- [Adoption and Publications](#adoption-and-publications)\n  * [Publications](#publications)\n  * [Models](#models)\n    + [English LLMs](#english-llms)\n    + [Non-English LLMs](#non-english-llms)\n    + [Code Models](#code-models)\n    + [Other Modalities](#other-modalities)\n- [Administrative Notes](#administrative-notes)\n  * [Citing GPT-NeoX](#citing-gpt-neox)\n  * [Contributing](#contributing)\n  * [Licensing](#licensing)\n  * [Acknowledgements](#acknowledgements)\n\n# Quick Start\n\n## Environment and Dependencies\n\n### Host Setup\n\nThis codebase has primarily developed and tested for Python 3.8-3.10, and PyTorch 1.8-2.0. This is not a strict requirement, and other versions and combinations of libraries may work.\n\nTo install the remaining basic dependencies, run:\n\n```bash\npip install -r requirements\u002Frequirements.txt\npip install -r requirements\u002Frequirements-wandb.txt # optional, if logging using WandB\npip install -r requirements\u002Frequirements-tensorboard.txt # optional, if logging via tensorboard\npip install -r requirements\u002Frequirements-comet.txt # optional, if logging via Comet\n```\n\nfrom the repository root.\n\n> [!Warning]\n> Our codebase relies on [DeeperSpeed](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002FDeeperSpeed), our fork of the [DeepSpeed](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) library with some added changes. We strongly recommend using Anaconda, a virtual machine, or some other form of environment isolation before continuing. Failure to do so may cause other repositories that rely on DeepSpeed to break.\n\n\u003C\u002Faside>\n\n### Fused Kernels\nWe now support AMD GPUs (MI100, MI250X) through JIT fused-kernel compilation. Fused kernels will be built and loaded as needed. To avoid waiting during job launching, you can also do the following for manual pre-build:\n\n```python\npython\nfrom megatron.fused_kernels import load\nload()\n```\nThis will automatically adapts building process over different GPU vendors (AMD, NVIDIA) without platform specific code changes. To further test fused kernels using `pytest`, use `pytest tests\u002Fmodel\u002Ftest_fused_kernels.py`\n\n### Flash Attention\n\nTo use [Flash-Attention](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fflash-attention), install the additional dependencies in  `.\u002Frequirements\u002Frequirements-flashattention.txt` or use a PyTorch NGC container with it pre-installed (note that functionality is not guaranteed using versions different from our requirements file). Then set the attention type in your configuration accordingly (see [configs](.\u002Fconfigs\u002F)). This can provide significant speed-ups over regular attention on certain GPU architectures, including Ampere GPUs (such as A100s); see the repository for more details.\n\n### Transformer Engine\n\nTo use [Transformer Engine (TE)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTransformerEngine), install the additional dependencies in  `.\u002Frequirements\u002Frequirements-transformer-engine.txt` or use a PyTorch NGC container with it pre-installed (note that functionality is not guaranteed using versions different from our requirements file). See [this config](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Fconfigs\u002F1-3B-transformer-engine.yml) for an example of using TE on a 1.3B model. This can provide significant speed-ups over regular attention on certain GPU architectures, including Ampere and Hopper GPUs; see the repository for more details.\n\n\nTE provides very efficient kernels for both A100 and H100 GPUs. We've run some sample ablations on A100:\n\n\n\nand H100:\n\n\n\n\n### Multi-Node Launching\n\nNeoX and Deep(er)Speed support training on multiple different nodes and you have the option of using a variety of different launchers to orchestrate multi-node jobs.\n\nIn general there needs to be a \"hostfile\" somewhere accessible with the format:\n\n```bash\nnode1_ip slots=8\nnode2_ip slots=8\n```\n\nwhere the first column contains the IP address for each node in your setup and the number of slots is the number of GPUs that node has access to. In your config you must pass in the path to the hostfile with `\"hostfile\": \"\u002Fpath\u002Fto\u002Fhostfile\"`. Alternatively the path to the hostfile can be in the environment variable `DLTS_HOSTFILE`.\n\n#### pdsh\n\n`pdsh` is the default launcher, and if you're using `pdsh` then all you must do (besides ensuring that pdsh is installed in your environment) is set `{\"launcher\": \"pdsh\"}` in your config files.\n\n#### MPI\n\nIf using MPI then you must specify the MPI library (DeepSpeed\u002FGPT-NeoX currently supports `mvapich`, `openmpi`, `mpich`, and `impi`, though `openmpi` is the most commonly used and tested) as well as pass the `deepspeed_mpi` flag in your config file:\n\n```json\n{\n    \"launcher\": \"openmpi\",\n    \"deepspeed_mpi\": true\n}\n```\n\nWith your environment properly set up and the correct configuration files you can use `deepy.py` like a normal python script and start (for example) a training job with:\n\n`python3 deepy.py train.py \u002Fpath\u002Fto\u002Fconfigs\u002Fmy_model.yml`\n\n#### Slurm\n\nUsing Slurm can be slightly more involved. Like with MPI, you must add the following to your config:\n\n```json\n{\n    \"launcher\": \"slurm\",\n    \"deepspeed_slurm\": true\n}\n```\nIf you do not have ssh access to the compute nodes in your Slurm cluster you need to add `{\"no_ssh_check\": true}`\n\n#### (Advanced) Custom Launching\n\nThere are many cases where the above default launching options are not sufficient\n\n- Many clusters have their own unique job scheduler or specific MPI\u002FSlurm arguments necessary for launching jobs such as [Summit JSRun](https:\u002F\u002Fdocs.olcf.ornl.gov\u002Fsystems\u002Fsummit_user_guide.html#job-launcher-jsrun) or [LLNL Flux](https:\u002F\u002Fcomputing.llnl.gov\u002Fprojects\u002Fflux-building-framework-resource-management)\n- While the above Slurm\u002FMPI\u002Fpdsh default options are enough for most job runs, advanced users may want to add arguments for optimization or debugging purposes\n\nIn these cases, you will need to modify the DeepSpeed [multinode runner](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed\u002Fblob\u002F17957728c0362bf8ae70feca308e491e55ef9feb\u002Fdeepspeed\u002Flauncher\u002Fmultinode_runner.py) utility to support your usecase. Broadly, these enhancements fall under two categories:\n\n##### 1. Adding a Launcher (e.g. [JSRun](https:\u002F\u002Fdocs.olcf.ornl.gov\u002Fsystems\u002Fsummit_user_guide.html#job-launcher-jsrun), [Flux](https:\u002F\u002Fcomputing.llnl.gov\u002Fprojects\u002Fflux-building-framework-resource-management), etc)\n\nIn this case, you must add a new multinode runner class to `deepspeed\u002Flauncher\u002Fmultinode_runner.py` and expose it as a configuration option in GPT-NeoX. Examples on how we did this for [Summit JSRun](https:\u002F\u002Fdocs.olcf.ornl.gov\u002Fsystems\u002Fsummit_user_guide.html#job-launcher-jsrun) are in [this DeeperSpeed commit](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002FDeeperSpeed\u002Fcommit\u002F9aed6c8500d7c492d85c5c88687322dbda70e370) and [this GPT-NeoX commit](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fcommit\u002F3782c7ae60f8624e566e3879b89bb09e8b59b869), respectively.\n\n##### 2. Modifying Run Command or Environment Variables\n\nWe have encountered many cases where we wish to modify the MPI\u002FSlurm run command for an optimization or to debug (e.g. to modify the [Slurm srun CPU binding](https:\u002F\u002Fslurm.schedmd.com\u002Fsrun.html#OPT_cpu-bind) or to tag MPI logs with the rank). In this case, you must modify the multinode runner class' run command under its `get_cmd` method (e.g. [mpirun_cmd](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed\u002Fblob\u002F17957728c0362bf8ae70feca308e491e55ef9feb\u002Fdeepspeed\u002Flauncher\u002Fmultinode_runner.py#L135-L147) for OpenMPI). Examples on how we did this to provide optimized and rank-tagged run commands using Slurm and OpenMPI for the Stability cluster are in [this DeeperSpeed branch](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed\u002Fcompare\u002Fmaster...EleutherAI:DeeperSpeed:v2.0-stability)\n\n\n#### Hostfile Generation\n\nIn general you will not be able to have a single fixed hostfile, so you need to have a script to generate one dynamically when your job starts. An example script to dynamically generate a hostfile using [Slurm](https:\u002F\u002Fslurm.schedmd.com\u002Fdocumentation.html) and 8 GPUs per node is:\n\n```bash\n#!\u002Fbin\u002Fbash\nGPUS_PER_NODE=8\nmkdir -p \u002Fsample\u002Fpath\u002Fto\u002Fhostfiles\n# need to add the current slurm jobid to hostfile name so that we don't add to previous hostfile\nhostfile=\u002Fsample\u002Fpath\u002Fto\u002Fhostfiles\u002Fhosts_$SLURM_JOBID\n# be extra sure we aren't appending to a previous hostfile\nrm $hostfile &> \u002Fdev\u002Fnull\n# loop over the node names\nfor i in `scontrol show hostnames $SLURM_NODELIST`\ndo\n    # add a line to the hostfile\n    echo $i slots=$GPUS_PER_NODE >>$hostfile\ndone\n```\n\n`$SLURM_JOBID` and `$SLURM_NODELIST` being environment variables Slurm will create for you. See the [sbatch documentation](https:\u002F\u002Fslurm.schedmd.com\u002Fsbatch.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES) for a full list of available Slurm environment variables set at job creation time.\n\n#### Job Launching\n\nThen you can create an [sbatch](https:\u002F\u002Fslurm.schedmd.com\u002Fsbatch.html) script from which to kick off your GPT-NeoX job. A bare-bones sbatch script on a Slurm-based cluster with 8 GPUs per node would look like this:\n\n```bash\n#!\u002Fbin\u002Fbash\n#SBATCH --job-name=\"neox\"\n#SBATCH --partition=your-partition\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=8\n#SBATCH --gres=gpu:8\n\n# Some potentially useful distributed environment variables\nexport HOSTNAMES=`scontrol show hostnames \"$SLURM_JOB_NODELIST\"`\nexport MASTER_ADDR=$(scontrol show hostnames \"$SLURM_JOB_NODELIST\" | head -n 1)\nexport MASTER_PORT=12802\nexport COUNT_NODE=`scontrol show hostnames \"$SLURM_JOB_NODELIST\" | wc -l`\n\n# Your hostfile creation script from above\n.\u002Fwrite_hostfile.sh\n# Tell DeepSpeed where to find our generated hostfile via DLTS_HOSTFILE\nexport DLTS_HOSTFILE=\u002Fsample\u002Fpath\u002Fto\u002Fhostfiles\u002Fhosts_$SLURM_JOBID\n\n# Launch training\npython3 deepy.py train.py \u002Fsample\u002Fpath\u002Fto\u002Fyour\u002Fconfigs\u002Fmy_model.yml\n\n```\n\nYou can then kick off a training run with `sbatch my_sbatch_script.sh`\n\n\n### Containerized Setup\n\nWe provide containers for [Apptainer](#apptainer) (formerly Singularity) and [Docker](#docker) under [containers](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Fcontainers\u002F).\n\n#### Docker\n\nIf you prefer to run NeoX in a Docker container, we provide a Dockerfile and docker-compose configuration.\n\nRequirements to run the container are to have appropriate GPU drivers, an up-to-date installation of Docker, and [nvidia-container-toolkit](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html) installed. To test if your installation is good you can use their \"sample workload\", which is:\n\n```\ndocker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi\n```\n\nProvided that will run, you need to export NEOX_DATA_PATH and NEOX_CHECKPOINT_PATH in your environment to specify your data directory and directory for storing and loading checkpoints:\n\n```\nexport NEOX_DATA_PATH=\u002Fmnt\u002Fsda\u002Fdata\u002Fenwiki8 #or wherever your data is stored on your system\nexport NEOX_CHECKPOINT_PATH=\u002Fmnt\u002Fsda\u002Fcheckpoints\n```\n\nAnd then, from the `gpt-neox\u002Fcontainers\u002Fdocker` directory, you can build the image and run a shell in a container with\n\n```\ndocker compose run gpt-neox bash\n```\n\nAfter the build, you should be able to do this:\n```\nmchorse@537851ed67de:~$ echo $(pwd)\n\u002Fhome\u002Fmchorse\nmchorse@537851ed67de:~$ ls -al\ntotal 48\ndrwxr-xr-x  1 mchorse mchorse 4096 Jan  8 05:33 .\ndrwxr-xr-x  1 root    root    4096 Jan  8 04:09 ..\n-rw-r--r--  1 mchorse mchorse  220 Feb 25  2020 .bash_logout\n-rw-r--r--  1 mchorse mchorse 3972 Jan  8 04:09 .bashrc\ndrwxr-xr-x  4 mchorse mchorse 4096 Jan  8 05:35 .cache\ndrwx------  3 mchorse mchorse 4096 Jan  8 05:33 .nv\n-rw-r--r--  1 mchorse mchorse  807 Feb 25  2020 .profile\ndrwxr-xr-x  2 root    root    4096 Jan  8 04:09 .ssh\ndrwxrwxr-x  8 mchorse mchorse 4096 Jan  8 05:35 chk\ndrwxrwxrwx  6 root    root    4096 Jan  7 17:02 data\ndrwxr-xr-x 11 mchorse mchorse 4096 Jan  8 03:52 gpt-neox\n```\n\nFor a long-running job, you should run\n\n```\ndocker compose up -d\n```\n\nto run the container in detached mode, and then, in a separate terminal session, run\n\n```\ndocker compose exec gpt-neox bash\n```\n\nYou can then run any job you want from inside the container.\n\nConcerns when running for a long time or in detached mode include\n - You will have to terminate the container manually when you are no longer using it\n - If you want processes to continue running when your shell session ends, you will need to background them.\n - If you then want logging, you will have to make sure to pipe logs to disk, and set up wandb and\u002For Comet logging.\n\nIf you prefer to run the prebuilt container image from dockerhub, you can run the docker compose commands with ```-f docker-compose-dockerhub.yml``` instead, e.g.,\n\n```\ndocker compose run -f containers\u002Fdocker\u002Fdocker-compose-dockerhub.yml gpt-neox bash\n```\n\n#### Singularity\u002FApptainer\n\nWe also support Apptainer (formerly Singularity) deployments. Some users find Apptainer useeful for systems that don't provide root-access, such as shared HPC systems at national labs and universities.\n\nRequirements to run the container are to have appropriate GPU drivers and an up-to-date installation of Apptainer. You can build an image from our Apptainer file by running:\n\n```\ncd containers\u002Fapptainer\u002F\napptainer build gpt-neox.sif gpt-neox.def\n```\n\nYou can use your new image in a few ways:\n\n1. Run a shell inside the container:\n```\napptainer shell --nv --bind \u002Fpath\u002Fto\u002Fdata:\u002Fdata,\u002Fpath\u002Fto\u002Fcode:\u002Fcode gpt-neox.sif\n```\n\nUse the --nv flag to enable NVIDIA GPU support.\n\n2. Execute a command directly:\n```\napptainer exec --nv gpt-neox.sif python your_script.py\n```\n\nFor this method, you'll need to make the requirements files and fused_kernels directory available during the build process, either by:\n\n- Using the `--bind` option during build\n- Adding them to the definition file using the `%files` section\n- Copying them into specific locations during build\n\nApptainer\u002FSingularity containers run with the user's own UID\u002FGID by default, so some of the user creation parts may be redundant.\nBy default, your home directory is automatically mounted in Apptainer\u002FSingularity containers, which differs from Docker behavior.\nFor more info on Apptainer deployment, we suggest consulting [their userguide](https:\u002F\u002Fapptainer.org\u002Fdocs\u002Fuser\u002F1.0\u002Findex.html)\n\n## Usage\n\nAll functionality should be launched using `deepy.py`, a wrapper around the `deepspeed` launcher.\n\nWe currently offer three main functions:\n1. `train.py` is used for training and finetuning models.\n2. `eval.py` is used to evaluate a trained model using the [language model evaluation harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness).\n3. `generate.py` is used to sample text from a trained model.\n\nwhich can be launched with:\n\n```bash\n.\u002Fdeepy.py [script.py] [.\u002Fpath\u002Fto\u002Fconfig_1.yml] [.\u002Fpath\u002Fto\u002Fconfig_2.yml] ... [.\u002Fpath\u002Fto\u002Fconfig_n.yml]\n```\n\nFor example, to launch training you can run\n```bash\n.\u002Fdeepy.py train.py .\u002Fconfigs\u002F20B.yml .\u002Fconfigs\u002Flocal_cluster.yml\n```\n\nFor more details on each entry point, see the [Training and Finetuning](#training-and-finetuning), [Inference](#inference) and [Evaluation](#evaluation) respectively.\n\n# Configuration\n\nGPT-NeoX parameters are defined in a YAML configuration file which is passed to the deepy.py launcher. We have provided some example .yml files in [configs](.\u002Fconfigs\u002F), showing a diverse array of features and model sizes.\n\nThese files are generally complete, but non-optimal. For example, depending on your specific GPU configuration, you may need to change some settings such as `pipe-parallel-size`, `model-parallel-size` to increase or decrease the degree of parallelisation, `train_micro_batch_size_per_gpu` or `gradient-accumulation-steps` to modify batch size related settings, or the `zero_optimization` dict to modify how optimizer states are parallelised across workers.\n\nFor a more detailed guide to the features available and how to configure them, see [the configuration README](configs\u002FREADME.md), and for documentation of every possible argument, see [configs\u002Fneox_arguments.md](configs\u002Fneox_arguments.md).\n\n## Mixture of Experts\n\nGPT-NeoX includes support for Dropless Mixture of Experts (DMoE) through the `megablocks` library. It is compatible with both existing Megatron Tensor Parallelism and DeepSpeed Pipeline Parallel setups.\n\nThis implementation leverages the existing Tensor Parallel Group to also shard the expert weights.\nIt uses Sinkhorn routing to avoid the need for a load balancing loss.\n\nFor an example of a basic complete configuration, see configs\u002F125M-dmoe.yml.\n\nMost MoE related configuration arguments are prefixed with `moe`. The bare minimum addition to your configuration to enable MoE is as follows:\n\n```yaml\nmoe_num_experts: 1 # 1 disables MoE. 8 is a common value.\n```\n\n# Datasets\n\n## Preconfigured Datasets\n\nSeveral preconfigured datasets are available, including most components from [the Pile](https:\u002F\u002Farxiv.org\u002Fabs\u002F2101.00027), as well as the Pile train set itself, for straightforward tokenization using the `prepare_data.py` entry point.\n\nE.G, to download and tokenize the enwik8 dataset with the GPT2 Tokenizer, saving them to `.\u002Fdata` you can run:\n\n```\npython prepare_data.py -d .\u002Fdata\n```\n\nor a single shard of the pile (`pile_subset`) with the GPT-NeoX-20B tokenizer (assuming you have it saved at `.\u002F20B_checkpoints\u002F20B_tokenizer.json`):\n\n```\npython prepare_data.py -d .\u002Fdata -t HFTokenizer --vocab-file .\u002F20B_checkpoints\u002F20B_tokenizer.json pile_subset\n```\n\nThe tokenized data will be saved out to two files: `[data-dir]\u002F[dataset-name]\u002F[dataset-name]_text_document.bin`and `[data-dir]\u002F[dataset-name]\u002F[dataset-name]_text_document.idx`. You will need to add the prefix that both these files share to your training configuration file under the `data-path` field. E.G:\n\n```yaml\n  \"data-path\": \".\u002Fdata\u002Fenwik8\u002Fenwik8_text_document\",\n```\n\n## Using Custom Data\n\nTo prepare your own dataset for training with custom data, format it as one large [jsonl](https:\u002F\u002Fjsonlines.org\u002F)-formatted file with each item in the list of dictionaries being a separate document. The document text should be grouped under one JSON key, i.e `\"text\"`. Any auxiliary data stored in other fields will not be used.\n\nNext make sure to download the GPT2 tokenizer vocab, and merge files from the following links:\n\n- Vocab: https:\u002F\u002Fs3.amazonaws.com\u002Fmodels.huggingface.co\u002Fbert\u002Fgpt2-vocab.json\n- Merge: https:\u002F\u002Fs3.amazonaws.com\u002Fmodels.huggingface.co\u002Fbert\u002Fgpt2-merges.txt\n\nOr use the 20B tokenizer (for which only a single Vocab file is needed):\n\n- Vocab: https:\u002F\u002Fthe-eye.eu\u002Fpublic\u002FAI\u002Fmodels\u002FGPT-NeoX-20B\u002Fslim_weights\u002F20B_tokenizer.json\n\n(alternatively, you can provide any tokenizer file that can be loaded by Hugging Face's tokenizers library with the `Tokenizer.from_pretrained()` command)\n\nYou can now pretokenize your data using `tools\u002Fdatasets\u002Fpreprocess_data.py`, the arguments for which are detailed below:\n\n```\nusage: preprocess_data.py [-h] --input INPUT [--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]] [--num-docs NUM_DOCS] --tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer} [--vocab-file VOCAB_FILE] [--merge-file MERGE_FILE] [--append-eod] [--ftfy] --output-prefix OUTPUT_PREFIX\n                          [--dataset-impl {lazy,cached,mmap}] [--workers WORKERS] [--log-interval LOG_INTERVAL]\n\noptional arguments:\n  -h, --help            show this help message and exit\n\ninput data:\n  --input INPUT         Path to input jsonl files or lmd archive(s) - if using multiple archives, put them in a comma separated list\n  --jsonl-keys JSONL_KEYS [JSONL_KEYS ...]\n                        space separate listed of keys to extract from jsonl. Default: text\n  --num-docs NUM_DOCS   Optional: Number of documents in the input data (if known) for an accurate progress bar.\n\ntokenizer:\n  --tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer}\n                        What type of tokenizer to use.\n  --vocab-file VOCAB_FILE\n                        Path to the vocab file\n  --merge-file MERGE_FILE\n                        Path to the BPE merge file (if necessary).\n  --append-eod          Append an \u003Ceod> token to the end of a document.\n  --ftfy                Use ftfy to clean text\n\noutput data:\n  --output-prefix OUTPUT_PREFIX\n                        Path to binary output file without suffix\n  --dataset-impl {lazy,cached,mmap}\n                        Dataset implementation to use. Default: mmap\n\nruntime:\n  --workers WORKERS     Number of worker processes to launch\n  --log-interval LOG_INTERVAL\n                        Interval between progress updates\n\n```\n\nFor example:\n\n```bash\npython tools\u002Fdatasets\u002Fpreprocess_data.py \\\n            --input .\u002Fdata\u002Fmydataset.jsonl.zst \\\n            --output-prefix .\u002Fdata\u002Fmydataset \\\n            --vocab .\u002Fdata\u002Fgpt2-vocab.json \\\n            --merge-file gpt2-merges.txt \\\n            --dataset-impl mmap \\\n            --tokenizer-type GPT2BPETokenizer \\\n            --append-eod\n```\n\nYou would then run training with the following settings added to your configuration file:\n\n```yaml\n  \"data-path\": \"data\u002Fmydataset_text_document\",\n```\n\n# Training and Finetuning\n\nTraining is launched using `deepy.py`, a wrapper around DeepSpeed's launcher, which launches the same script in parallel across many GPUs \u002F nodes.\n\nThe general usage pattern is:\n\n```bash\npython .\u002Fdeepy.py train.py [path\u002Fto\u002Fconfig1.yml] [path\u002Fto\u002Fconfig2.yml] ...\n```\n\nYou can pass in an arbitrary number of configs which will all be merged at runtime.\n\nYou can also optionally pass in a config prefix, which will assume all your configs are in the same folder and append that prefix to their path.\n\nFor example:\n\n```bash\npython .\u002Fdeepy.py train.py -d configs 125M.yml local_setup.yml\n```\n\nThis will deploy the `train.py` script on all nodes with one process per GPU. The worker nodes and number of GPUs are specified in the `\u002Fjob\u002Fhostfile` file (see [parameter documentation](configs\u002FREADME.md)), or can simply be passed in as the `num_gpus` arg if running on a single node setup.\n\nAlthough this is not strictly necessary, we find it useful to define the model parameters in one config file (e.g `configs\u002F125M.yml`) and the data path parameters in another (e.g `configs\u002Flocal_setup.yml`).\n\n\n## Pretrained Models\n\n### GPT-NeoX-20B\n\nGPT-NeoX-20B is a 20 billion parameter autoregressive language model trained on [the Pile](https:\u002F\u002Farxiv.org\u002Fabs\u002F2101.00027). Technical details about GPT-NeoX-20B can be found in [the associated paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.06745). The configuration file for this model is both available at [`.\u002Fconfigs\u002F20B.yml`](.\u002Fconfigs\u002F20B.yml) and included in the download links below.\n\n[Slim weights](https:\u002F\u002Fthe-eye.eu\u002Fpublic\u002FAI\u002Fmodels\u002FGPT-NeoX-20B\u002Fslim_weights\u002F) - (No optimizer states, for inference or finetuning, 39GB)\n\nTo download from the command line to a folder named `20B_checkpoints`, use the following command:\n\n```bash\nwget --cut-dirs=5 -nH -r --no-parent --reject \"index.html*\" https:\u002F\u002Fthe-eye.eu\u002Fpublic\u002FAI\u002Fmodels\u002FGPT-NeoX-20B\u002Fslim_weights\u002F -P 20B_checkpoints\n```\n\n[Full weights](https:\u002F\u002Fthe-eye.eu\u002Fpublic\u002FAI\u002Fmodels\u002FGPT-NeoX-20B\u002Ffull_weights\u002F) - (Including optimizer states, 268GB)\n\nTo download from the command line to a folder named `20B_checkpoints`, use the following command:\n\n```bash\nwget --cut-dirs=5 -nH -r --no-parent --reject \"index.html*\" https:\u002F\u002Fthe-eye.eu\u002Fpublic\u002FAI\u002Fmodels\u002FGPT-NeoX-20B\u002Ffull_weights\u002F -P 20B_checkpoints\n```\n\nWeights can be alternatively be downloaded using a BitTorrent client. Torrent files can be downloaded here: [slim weights](https:\u002F\u002Fthe-eye.eu\u002Fpublic\u002FAI\u002Fmodels\u002FGPT-NeoX-20B\u002Fslim_weights.torrent), [full weights](https:\u002F\u002Fthe-eye.eu\u002Fpublic\u002FAI\u002Fmodels\u002FGPT-NeoX-20B\u002Ffull_weights.torrent).\n\nWe additionally have 150 checkpoints saved throughout training, one every 1,000 steps. We are working on figuring out how to best serve these at scale, but in the meanwhile people interested in working with the partially trained checkpoints can email us at contact@eleuther.ai to arrange access.\n\n### Pythia\n\nThe Pythia Scaling Suite is a suite of models ranging from 70M parameters to 12B parameters trained on [the Pile](https:\u002F\u002Fpile.eleuther.ai) intended to promote research on interpretability and training dynamics of large language models. Further details about the project and links to the models can be found in the [in the paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.01373) and [on the project's GitHub](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia).\n\n### Polyglot\n\nThe Polyglot Project is an effort to train powerful non-English pretrained language models to promote the accessibility of this technology to researchers outside the dominant powerhouses of machine learning. EleutherAI has trained and released 1.3B, 3.8B, and 5.8B parameter Korean language models, the largest of which outpreforms all other publicly available language models on Korean language tasks. Further details about the project and links to the models can be found [here](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpolyglot).\n\n# Inference\n\n**For most uses we recommend deploying models trained using the GPT-NeoX library via the Hugging Face Transformers library which is better optimized for inference.**\n\nWe support three types of generation from a pretrained model:\n1. Unconditional generation\n2. Conditional generation based on an input read from a file\n3. Interactive generation, which allows for multiple rounds of back-and-forth between a user and the language model via a command line interface\n\nAll three types of text generation can be launched via `python .\u002Fdeepy.py generate.py -d configs 125M.yml local_setup.yml text_generation.yml` with the appropriate values set in `configs\u002Ftext_generation.yml`.\n\n# Evaluation\n\nGPT-NeoX supports evaluation on downstream tasks through the [language model evaluation harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness).\n\nTo evaluate a trained model on the evaluation harness, simply run:\n\n```bash\npython .\u002Fdeepy.py eval.py -d configs your_configs.yml --eval_tasks task1 task2 ... taskn\n```\n\nwhere `--eval_tasks` is a list of evaluation tasks followed by spaces, e.g `--eval_tasks lambada hellaswag piqa sciq`. For details of all tasks available, refer to the [lm-evaluation-harness repo](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness).\n\n# Exporting to Hugging Face\n\nGPT-NeoX is optimized heavily for training only, and GPT-NeoX model checkpoints are not compatible out of the box with other deep learning libraries. To make models easily loadable and shareable with end users, and for further exporting to various other frameworks, GPT-NeoX supports checkpoint conversion to the [Hugging Face Transformers](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.03771) format.\n\nThough NeoX supports a number of different architectural configurations, including AliBi positional embeddings, not all of these configurations map cleanly onto the supported configurations within Hugging Face Transformers.\n\nNeoX supports export of compatible models into the following architectures:\n- GPTNeoXForCausalLM\n- LlamaForCausalLM\n- MistralForCausalLM\n\nTraining a model which does not fit into one of these Hugging Face Transformers architectures cleanly will require writing custom modeling code for the exported model.\n\nTo convert a GPT-NeoX library checkpoint to Hugging Face-loadable format, run:\n```bash\npython .\u002Ftools\u002Fckpts\u002Fconvert_neox_to_hf.py --input_dir \u002Fpath\u002Fto\u002Fmodel\u002Fglobal_stepXXX --config_file your_config.yml --output_dir hf_model\u002Fsave\u002Flocation --precision {auto,fp16,bf16,fp32} --architecture {neox,mistral,llama}\n```\n\nThen to upload a model to [the Hugging Face Hub](https:\u002F\u002Fhuggingface.co\u002F), run:\n```bash\nhuggingface-cli login\npython .\u002Ftools\u002Fckpts\u002Fupload.py\n```\nand input the requested information, including HF hub user token.\n\n### Importing Models Into GPT-NeoX\n\nNeoX supplies several utilities for converting a pretrained model checkpoint into a format that can be trained within the library.\n\nThe following models or model families can be loaded in GPT-NeoX:\n- Llama 1\n- Llama 2\n- CodeLlama\n- Mistral-7b-v0.1\n\nWe provide two utilities for converting from two different checkpoint formats into a format compatible with GPT-NeoX.\n\nTo convert a Llama 1 or Llama 2 checkpoint distributed by Meta AI from its original file format (downloadable [here](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fllama) or [here](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FLlama-2-7b)) into the GPT-NeoX library, run\n\n```\npython tools\u002Fckpts\u002Fconvert_raw_llama_weights_to_neox.py --input_dir \u002Fpath\u002Fto\u002Fmodel\u002Fparent\u002Fdir\u002F7B --model_size 7B --output_dir \u002Fpath\u002Fto\u002Fsave\u002Fckpt --num_output_shards \u003CTENSOR_PARALLEL_SIZE> (--pipeline_parallel if pipeline-parallel-size >= 1)\n```\n\n\nTo convert from a Hugging Face model into a NeoX-loadable, run `tools\u002Fckpts\u002Fconvert_hf_to_sequential.py`. See documentation within that file for further options.\n\n\n# Monitoring\n\nIn addition to storing logs locally, we provide built-in support for two popular experiment monitoring frameworks: [Weights & Biases](https:\u002F\u002Fwandb.ai\u002Fsite), [TensorBoard](https:\u002F\u002Fwww.tensorflow.org\u002Ftensorboard\u002F), and [Comet](https:\u002F\u002Fwww.comet.com\u002Fsite)\n\n## Weights and Biases\n\n[Weights & Biases to record our experiments](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fneox) is a machine learning monitoring platform. To use wandb to monitor your gpt-neox experiments:\n1. Create an account at https:\u002F\u002Fwandb.ai\u002Fsite to generate your API key\n2. Log into Weights & Biases on your machine&mdash;you can do this by executing `wandb login`&mdash;your runs will automatically be recorded.\n3. Dependencies required for wandb monitoring can be found in and installed from `.\u002Frequirements\u002Frequirements-wandb.txt`. An example config is provided in `.\u002Fconfigs\u002Flocal_setup_wandb.yml`.\n4. There are two optional fields associated with Weights & Biases: \u003Ccode>\u003Cvar>wandb_group\u003C\u002Fvar>\u003C\u002Fcode> allows you to name the run group and \u003Ccode>\u003Cvar>wandb_team\u003C\u002Fvar>\u003C\u002Fcode> allows you to assign your runs to an organization or team account. An example config is provided in `.\u002Fconfigs\u002Flocal_setup_wandb.yml`.\n\n## TensorBoard\n\nWe support using TensorBoard via the \u003Ccode>\u003Cvar>tensorboard-dir\u003C\u002Fvar>\u003C\u002Fcode> field. Dependencies required for TensorBoard monitoring can be found in and installed from  `.\u002Frequirements\u002Frequirements-tensorboard.txt`.\n\n## Comet\n\n[Comet](https:\u002F\u002Fwww.comet.com\u002Fsite) is a machine learning monitoring platform. To use comet to monitor your gpt-neox experiments:\n1. Create an account at https:\u002F\u002Fwww.comet.com\u002Flogin to generate your API key.\n2. Once generated, link your API key at runtime by running `comet login` or passing `export COMET_API_KEY=\u003Cyour-key-here>`\n3. Install `comet_ml` and any dependency libraries via `pip install -r requirements\u002Frequirements-comet.txt`\n4. Enable Comet with `use_comet: True`. You can also customize where data is being logged with `comet_workspace` and `comet_project`. A full example config with comet enabled is provided in `configs\u002Flocal_setup_comet.yml`.\n5. Run your experiment, and monitor metrics in the Comet workspace that you passed!\n\n# Running on multi-node\n\nIf you need to supply a hostfile for use with the MPI-based DeepSpeed launcher, you can set the environment variable `DLTS_HOSTFILE` to point to the hostfile.\n\n# Profiling\n\nWe support profiling with Nsight Systems, the PyTorch Profiler, and PyTorch Memory Profiling.\n\n## Nsight Systems Profiling\n\nTo use the Nsight Systems profiling, set config options `profile`, `profile_step_start`, and `profile_step_stop` (see [here](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Fconfigs\u002Fneox_arguments.md) for argument usage, and [here](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Fconfigs\u002Fprof.yml) for a sample config).\n\nTo populate nsys metrics, launch training with:\n\n```\nnsys profile -s none -t nvtx,cuda -o \u003Cpath\u002Fto\u002Fprofiling\u002Foutput> --force-overwrite true \\\n--capture-range=cudaProfilerApi --capture-range-end=stop python $TRAIN_PATH\u002Fdeepy.py \\\n$TRAIN_PATH\u002Ftrain.py --conf_dir configs \u003Cconfig files>\n```\n\nThe generated output file can then by viewed with the Nsight Systems GUI:\n\n![nsight-prof](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FEleutherAI_gpt-neox_readme_cce154b3c298.png)\n\n## PyTorch Profiling\n\nTo use the built-in PyTorch profiler, set config options `profile`, `profile_step_start`, and `profile_step_stop` (see [here](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Fconfigs\u002Fneox_arguments.md) for argument usage, and [here](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Fconfigs\u002Fprof.yml) for a sample config).\n\nThe PyTorch profiler will save traces to your `tensorboard` log directory.  You can view these traces within\nTensorBoard by following the steps [here](https:\u002F\u002Fpytorch.org\u002Ftutorials\u002Fintermediate\u002Ftensorboard_profiler_tutorial.html).\n\n![torch-prof](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FEleutherAI_gpt-neox_readme_73117eb7bcb1.png)\n\n## PyTorch Memory Profiling\n\nTo use PyTorch Memory Profiling, set config options `memory_profiling` and `memory_profiling_path` (see [here](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Fconfigs\u002Fneox_arguments.md) for argument usage, and [here](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Fconfigs\u002Fprof.yml) for a sample config).\n\n![mem-prof](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FEleutherAI_gpt-neox_readme_3625efaf8b43.png)\n\nView the generated profile with the [memory_viz.py](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fblob\u002Fmain\u002Ftorch\u002Fcuda\u002F_memory_viz.py) script. Run with:\n\n```\npython _memory_viz.py trace_plot \u003Cgenerated_profile> -o trace.html\n```\n\n# Adoption and Publications\n\nThe GPT-NeoX library was been widely adopted by academic and industry researchers and ported on to many HPC systems.\n\nIf you have found this library useful in your research, please reach out and let us know! We would love to add you to our lists.\n\n## Publications\n\nEleutherAI and our collaborators have used it in the following publications:\n - **Sid Black**, **Stella Biderman**, **Eric Hallahan**, **Quentin Anthony**, **Leo Gao**, **Laurence Golding**, **Horace He**, **Connor Leahy**, **Kyle McDonell**, **Jason Phang**, **Michael Pieler**, **Shivanshu Purohit**, **Laria Reynolds**, **Jon Tow**, **Ben Wang**, and **Samuel Weinbach**. \"[GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.06745).\" In *Proceedings of the ACL Workshop on Challenges \\& Perspectives in Creating Large Language Models*, 2022.\n - **Stella Biderman**, **Hailey Schoelkopf**, **Quentin Anthony**, **Herbie Bradley**, **Kyle O'Brien**, **Eric Hallahan**, **Mohammad Aflah Khan**, **Shivanshu Purohit**, **USVSN Sai Prashanth**, Edward Raff, **Aviya Skowron**, **Lintang Sutawika**, **Oskar van der Wal**. \"[Pythia: A suite for analyzing large language models across training and scaling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.01373).\" In _International Conference on Machine Learning_, pp. 2397-2430. _PMLR_, 2023.\n - Zhangir Azerbayev, Bartosz Piotrowski, **Hailey Schoelkopf**, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. \"[Proofnet: Autoformalizing and formally proving undergraduate-level mathematics](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.12433). *arXiv preprint arXiv:2302.12433*, 2023.\n - **Stella Biderman**, **USVSN Sai Prashanth**, **Lintang Sutawika**, **Hailey Schoelkopf**, **Quentin Anthony**, **Shivanshu Purohit**, and Edward Raff. \"[Emergent and predictable memorization in large language models.](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.11158)\" In _Neural Information Processing Systems_, 2023.\n - **Hyunwoong Ko**, **Kichang Yang**, **Minho Ryu**, **Taekyoon Choi**, **Seungmu Yang,** and Sungho Park. \"[A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.02254).\" *arXiv preprint arXiv:2306.02254*, 2023.\n - Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats Leon Richter, **Quentin Anthony**, Eugene Belilovsky, Irina Rish, and Timothée Lesort. \"[Continual Pre-Training of Large Language Models: How to re-warm your model?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.04014)\" In _Workshop on Efficient Systems for Foundation Models @ ICML_, 2023.\n - **Zhangir Azerbayev**, **Hailey Schoelkopf**, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, **Stella Biderman**, and Sean Welleck. \"[Llemma: An open language model for mathematics]([https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.04014](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.10631))\" In _Math-AI Workshop @ NeurIPS_, 2023.\n - Alexander Havrilla, Maksym Zhuravinskyi, Duy Phung, Aman Tiwari, Jonathan Tow, **Stella Biderman**, **Quentin Anthony**, and **Louis Castricato**. \"[trlX: A Framework for Large Scale Reinforcement Learning from Human Feedback](https:\u002F\u002Faclanthology.org\u002F2023.emnlp-main.530\u002F).\" In _Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing_, 2023.\n -  **Quentin Anthony**, **Jacob Hatef**, Deepak Narayanan, **Stella Biderman**, Stas Bekman, Junqi Yin, Aamir Shafi, Hari Subramoni, and Dhabaleswar Panda. \"[The Case for Co-Designing Model Architectures with Hardware](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.14489).\" In _arXiv preprint_, 2024.\n - Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L. Richter, **Quentin Anthony**, Timothée Lesort, Eugene Belilovsky, Irina Rish. \"[Simple and Scalable Strategies to Continually Pre-train Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.08763).\" In _arXiv preprint_, 2024.\n - Junqi Yin, Avishek Bose, Guojing Cong, Isaac Lyngaas, **Quentin Anthony**. \"[Comparative Study of Large Language Model Architectures on Frontier](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.00691).\" In _arXiv preprint_, 2024.\n\nThe following publications by other research groups use this library:\n- Ta-Chung Chi, Ting-Han Fan, Peter J. Ramadge, and Alexander Rudnicky. \"[KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.09921).\" In *Advances in Neural Information Processing Systems* 35, 2022.\n- Sameera Horawalavithana, Ellyn Ayton, Shivam Sharma, Scott Howland, Megha Subramanian, Scott Vasquez, Robin Cosbey, Maria Glenski, and Svitlana Volkova. \"[Foundation Models of Scientific Knowledge for Chemistry: Opportunities, Challenges and Lessons Learned](https:\u002F\u002Faclanthology.org\u002F2022.bigscience-1.12\u002F).\" In *Proceedings of the ACL Workshop on Challenges \\& Perspectives in Creating Large Language Models*, 2022.\n- Sophia Kolak, Ruben Martins, Claire Le Goues, and Vincent J. Hellendoorn. \"[Patch Generation with Language Models: Feasibility and Scaling Behavior](https:\u002F\u002Fpar.nsf.gov\u002Fbiblio\u002F10340618)\".\" In *Proceedings of the Deep Learning for Code Workshop at ICLR*, 2022.\n- Frank F. Xu, Uri Alon, Graham Neubig, and Vincent J. Hellendoorn. \"[A Systematic Evaluation of Large Language Models of Code](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.13169).\" In *Proceedings of the ICLR Workshop on Deep Learning For Code*, 2022.\n- Byung-Doh Oh and William Schuler. \"[Transformer-Based LM Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.11389).\" In *Findings of the Association for Computational Linguistics*, 2023.\n- Ta-Chung Chi, Ting-Han Fan, Alexander Rudnicky, and Peter Ramadge. \"[Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis](https:\u002F\u002Faclanthology.org\u002F2023.acl-long.756\u002F).\" In _Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)_, pp. 13522-13537, 2023.\n- Ta-Chung Chi, Ting-Han Fan, Li-Wei Chen, Alexander Rudnicky, and Peter Ramadge. \"[Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings](https:\u002F\u002Faclanthology.org\u002F2023.acl-short.102\u002F).\" In _Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)_, pp. 13522-13537, 2023.\n- Xidong Feng, Yicheng Luo, Ziyan Wang, Hongrui Tang, Mengyue Yang, Kun Shao, David Mguni, Yali Du, and Jun Wang. \"[ChessGPT: Bridging Policy Learning and Language Modeling.](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.09200)\" _arXiv preprint arXiv:2306.09200_, 2023.\n- Orion Walker Dollar, Sameera Horawalavithana, Scott Vasquez, W. James Pfaendtner, and Svitlana Volkova. \"[MolJET: Multimodal Joint Embedding Transformer for Conditional de novo Molecular Design and Multi-Property Optimization.](https:\u002F\u002Fopenreview.net\u002Fpdf?id=7UudBVsIrr)\" _preprint under review_, 2023.\n- Jean Kaddour and Qi Liu. \"[Text Data Augmentation in Low-Resource Settings via Fine-Tuning of Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01119).\" _arXiv:2310.01119_, 2023.\n- Alon Albalak, Liangming Pan, Colin Raffel, and William Yang Wang. \"[Efficient Online Data Mixing For Language Model Pre-Training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.02406).\" In _NeurIPS Workshop on R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models_, 2023.\n- Eghbal A. Hosseini and Evelina Fedorenko. \"[Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2023.11.05.564832v1).\" In _Neural Information Processing Systems_, 2023.\n- Junqi Yin, Sajal Dash, Feiyi Wang, and Mallikarjun Shankar. \"[FORGE: Pre-Training Open Foundation Models for Science](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3581784.3613215). In _Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis_, 1-13, 2023.\n- Jean Kaddour and Qi Liu. \"[Text Data Augmentation in Low-Resource Settings via Fine-Tuning of Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01119).\" In _arXiv preprint arXiv:2310.01119_, 2023.\n- Peng Di, Jianguo Li, Hang Yu, Wei Jiang, Wenting Cai, Yang Cao, Chaoyu Chen, Dajun Chen, Hongwei Chen, Liang Chen, Gang Fan, Jie Gong, Zi Gong, Wen Hu, Tingting Guo, Zhichao Lei, Ting Li, Zheng Li, Ming Liang, Cong Liao, Bingchang Liu, Jiachen Liu, Zhiwei Liu, Shaojun Lu, Min Shen, Guangpei Wang, Huan Wang, Zhi Wang, Zhaogui Xu, Jiawei Yang, Qing Ye, Gehao Zhang, Yu Zhang, Zelin Zhao, Xunjin Zheng, Hailian Zhou, Lifu Zhu, and Xianying Zhu. \"[CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06266).\" In _arXiv preprint arXiv:2310.06266_, 2023.\n- Nikitha Rao, Kush Jain, Uri Alon, Claire Le Goues, and Vincent J Hellendoorn. \"[CAT-LM Training Language Models on Aligned Code And Tests](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01602).\" In _38th IEEE\u002FACM International Conference on Automated Software Engineering (ASE)_, pp. 409-420. IEEE, 2023.\n- Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Brijesh Warrier, Nithish Mahalingam, Ricardo Bianchini. \"[POLCA: Power Oversubscription in LLM Cloud Providers](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.12908).\" In _arXiv preprint_, 2023.\n- Junqi Yin, Sajal Dash, John Gounley, Feiyi Wang, and Georgia Tourassi. \"[Evaluation of pre-training large language models on leadership-class supercomputers](https:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1007\u002Fs11227-023-05479-7).\" In _the Journal of Supercomputing_ 79, no. 18, 2023.\n- Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Mihai Capota, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, and Gal Oren. \"[Domain-Specific Code Language Models: Unraveling the Potential for HPC Codes and Tasks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13322).\" In _arXiv preprint_, 2023.\n- Guobin Shen, Dongcheng Zhao, Yiting Dong, Yang Li, Jindong Li, Kang Sun, and Yi Zeng. \"[Astrocyte-Enabled Advancements in Spiking Neural Networks for Large Language Modeling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.07625).\" In _arXiv preprint_, 2023.\n- Eghbal A. Hosseini, Martin A. Schrimpf, Yian Zhang, Samuel Bowman, Noga Zaslavsky, and Evelina Fedorenko. \"[Artificial neural network language models align neurally and behaviorally with humans even after a developmentally realistic amount of training.](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2022.10.04.510681)\" In _Neurobiology of Language_, 2024.\n- Xiongye Xiao, Chenyu Zhou, Heng Ping, Defu Cao, Yaxing Li, Yizhuo Zhou, Shixuan Li, and Paul Bogdan. \"[Exploring Neuron Interactions and Emergence in LLMs: From the Multifractal Analysis Perspective](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.09099).\" In _arXiv preprint_, 2024.\n- Zhiyuan Zeng, Qipeng Guo, Zhaoye Fei, Zhangyue Yin, Yunhua Zhou, Linyang Li, Tianxiang Sun, Hang Yan, Dahua Lin, and Xipeng Qiu. \"[Turn Waste into Worth: Rectifying Top-k Router of MoE](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12399).\" In _arXiv preprint_, 2024.\n\n## Models\nThe following models were trained using this library:\n\n### English LLMs\n- EleutherAI's [GPT-NeoX-20B](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fgpt-neox-20b) and [Pythia (70M through 13B)](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia)\n- CarperAI's [FIM-NeoX-1.3B](https:\u002F\u002Fhuggingface.co\u002FCarperAI\u002FFIM-NeoX-1.3B)\n- StabilityAI's [StableLM (3B and 7B)](https:\u002F\u002Fgithub.com\u002FStability-AI\u002FStableLM)\n- Together.ai's [RedPajama-INCITE (3B and 7B)](https:\u002F\u002Ftogether.ai\u002Fblog\u002Fredpajama-models-v1)\n- Carnegie Mellon University's [proofGPT (1.3B and 6.7B)](https:\u002F\u002Fhuggingface.co\u002Fhoskinson-center\u002FproofGPT-v0.1-6.7B)\n- Dampish's [StellarX (2.8B and 4B)](https:\u002F\u002Fhuggingface.co\u002FDampish\u002FStellarX-4B-V0.2)\n- Chinese Academy of Sciences's [AstroSNN (1.5B)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.07625)\n\n### Non-English LLMs\n- EleutherAI's [Polyglot-Ko (1.3B through 12.8B)](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpolyglot) (Korean)\n- Korea University's [KULLM-Polyglot (5.8B and 12.8B)](https:\u002F\u002Fgithub.com\u002Fnlpai-lab\u002FKULLM) (Korean)\n- Stability AI's [Japanese Stable LM (7B)](https:\u002F\u002Fhuggingface.co\u002Fstabilityai\u002Fjapanese-stablelm-base-alpha-7b) (Japanese)\n- LearnItAnyway's [LLaVA-Polyglot-Ko (1.3B)](https:\u002F\u002Fhuggingface.co\u002FLearnItAnyway\u002Fllava-polyglot-ko-1.3b-hf) (Korean)\n- Rinna Co.'s [japanese-gpt-neox-3.6b](https:\u002F\u002Fhuggingface.co\u002Frinna\u002Fjapanese-gpt-neox-3.6b) (Japanese) and [bilingual-gpt-neox-4b](https:\u002F\u002Fhuggingface.co\u002Frinna\u002Fbilingual-gpt-neox-4b) (English \u002F Japanese)\n- CyberAgent's [Open-CLM (125M through 7B)](https:\u002F\u002Fhuggingface.co\u002Fcyberagent\u002Fopen-calm-7b) (Japanese)\n- The Hungarian Research Centre for Linguistics's [PULI GPTrio (6.7B)](https:\u002F\u002Fhuggingface.co\u002FNYTK\u002FPULI-GPTrio) (Hungarian \u002F English \u002F Chinese)\n- The University of Tokyo's [weblab-10b](https:\u002F\u002Fhuggingface.co\u002FKojima777\u002Fweblab-10b) and [weblab-10b-instruct](https:\u002F\u002Fhuggingface.co\u002FKojima777\u002Fweblab-10b-instruction-sft) (Japanese)\n- nolando.ai's [Hi-NOLIN (9B)](https:\u002F\u002Fblog.nolano.ai\u002FHi-NOLIN\u002F) (English, Hindi)\n- Renmin University of China's [YuLan (12B)](https:\u002F\u002Fhuggingface.co\u002Fyulan-team\u002FYuLan-Base-12b) (English, Chinese)\n- The Basque Center for Language Technology's [Latixna (70B)](https:\u002F\u002Fhuggingface.co\u002FHiTZ\u002Flatxa-70b-v1.2) (Basque)\n\n### Code Models\n- Carnegie Mellon University's [PolyCoder (160M through 2.7B)](https:\u002F\u002Fgithub.com\u002FVHellendoorn\u002FCode-LMs) and [CAT-LM (2.7B)](https:\u002F\u002Fhuggingface.co\u002Fnikitharao\u002Fcatlm)\n- StabilityAI's [StableCode (1.3B)](https:\u002F\u002Fstability.ai\u002Fblog\u002Fstablecode-llm-generative-ai-coding) and [StableCode-Completion-Alpha (3B)](https:\u002F\u002Fstability.ai\u002Fblog\u002Fstablecode-llm-generative-ai-coding)\n- CodeFuse AI's [CodeFuse (13B)](https:\u002F\u002Fhuggingface.co\u002Fcodefuse-ai\u002FCodeFuse-13B)\n\n### AI for Science\n- EleutherAI's [LLeMMA (34B)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.10631)\n- Oak Ridge National Lab's [FORGE (26B)](https:\u002F\u002Fgithub.com\u002Fat-aaims\u002Fforge)\n- Oak Ridge National Lab's [Unnamed Material Science Domain Models (7B)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.00691)\n- Pacific Northwest National Lab's [MolJet (undisclosed size)](https:\u002F\u002Fopenreview.net\u002Fpdf?id=7UudBVsIrr)\n\n### Other Modalities\n-  Rinna Co.'s [PSLM (7B)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.12428) (speech \u002F text)\n-  University College London's [ChessGPT-3B](https:\u002F\u002Fhuggingface.co\u002FWaterhorse\u002Fchessgpt-base-v1)\n-  Gretel's [Text-to-Table (3B)](https:\u002F\u002Fhuggingface.co\u002Fgretelai\u002Ftext2table)\n\n# Administrative Notes\n\n## Citing GPT-NeoX\n\nIf you have found the GPT-NeoX library helpful in your work, you can cite this repository as\n\n```bibtex\n@software{gpt-neox-library,\n  title = {{GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch}},\n  author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Phang, Jason and Purohit, Shivanshu and Schoelkopf, Hailey and Stander, Dashiell and Songz, Tri and Tigges, Curt and Thérien, Benjamin and Wang, Phil and Weinbach, Samuel},\n  url = {https:\u002F\u002Fwww.github.com\u002Feleutherai\u002Fgpt-neox},\n  doi = {10.5281\u002Fzenodo.5879544},\n  month = {9},\n  year = {2023},\n  version = {2.0.0},\n}\n```\n\nTo cite the 20 billion parameter model named `GPT-NeoX-20B`, please use\n\n```bibtex\n@inproceedings{gpt-neox-20b,\n  title={{GPT-NeoX-20B}: An Open-Source Autoregressive Language Model},\n  author={Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, USVSN Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel},\n  booktitle={Proceedings of the ACL Workshop on Challenges \\& Perspectives in Creating Large Language Models},\n  url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.06745},\n  year={2022}\n}\n```\n\n## Contributing\nGPT-NeoX is built by the open-source AI community, and relies on our amazing contributors! Please see our\n[contributing](CONTRIBUTING.md) guide for more details on our CLA, code formatting, testing,\netc.\n\n## Licensing\n\nThis repository hosts code that is part of EleutherAI's GPT-NeoX project. Copyright (c) 2024, EleutherAI. Licensed under the Apache License:\n\n    Licensed under the Apache License, Version 2.0 (the \"License\");\n    you may not use this file except in compliance with the License.\n    You may obtain a copy of the License at\n\n        http:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0\n\n    Unless required by applicable law or agreed to in writing, software\n    distributed under the License is distributed on an \"AS IS\" BASIS,\n    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n    See the License for the specific language governing permissions and\n    limitations under the License.\n\nThis repository is based off code written by NVIDIA that is licensed under the Apache License, Version 2.0. In accordance with the Apache License, all files that are modifications of code originally written by NVIDIA maintain a NVIDIA copyright header. All files that do not contain such a header are the exclusive copyright of EleutherAI. When the NVIDIA code has been modified from its original version, that fact is noted in the copyright header. All derivative works of this repository must preserve these headers under the terms of the Apache License.\n\nThis repository also contains code written by a number of other authors. Such contributions are marked and the relevant licensing is included where appropriate.\n\nFor full terms, see the `LICENSE` file. If you have any questions, comments, or concerns about licensing please email us at contact@eleuther.ai.\n\n## Acknowledgements\n\nWe run our experiments on a Kubernetes cluster provided by [CoreWeave](https:\u002F\u002Fcoreweave.com\u002F) and a Slurm cluster provided by [Stability AI](https:\u002F\u002Fstability.ai). We are thankful to the DeepSpeed team for their advice and consultation.\n","[![GitHub issues](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002FEleutherAI\u002Fgpt-neox)](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fissues)\n[\u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fwandb\u002Fassets\u002Fmain\u002Fwandb-github-badge-28.svg\" alt=\"Weights & Biases 监控\" height=20>](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fneox)\n\n# GPT-NeoX\n\n本仓库记录了 EleutherAI（https:\u002F\u002Fwww.eleuther.ai）用于在 GPU 上训练大规模语言模型的库。我们当前的框架基于 NVIDIA 的 Megatron 语言模型（https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM），并结合了 DeepSpeed（https:\u002F\u002Fwww.deepspeed.ai）的技术以及一些新颖的优化方法。我们的目标是将此仓库打造成为一个集中且易于访问的平台，汇集训练大规模自回归语言模型的技术，并加速大规模训练领域的研究。该库已被广泛应用于学术界、工业界和政府实验室中（https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox#adoption-and-publications），包括橡树岭国家实验室、CarperAI、Stability AI、Together.ai、韩国大学、卡内基梅隆大学、东京大学等机构的研究人员。与其他类似库相比，GPT-NeoX 的独特之处在于其对多种系统和硬件的支持，例如可通过 Slurm、MPI 和 IBM Job Step Manager 启动，并且已经在 AWS（https:\u002F\u002Faws.amazon.com\u002F）、CoreWeave（https:\u002F\u002Fwww.coreweave.com\u002F）、橡树岭国家实验室的 Summit 超级计算机（https:\u002F\u002Fwww.olcf.ornl.gov\u002Fsummit\u002F）、Frontier 超级计算机（https:\u002F\u002Fwww.olcf.ornl.gov\u002Ffrontier\u002F）、LUMI 超级计算机（https:\u002F\u002Fwww.lumi-supercomputer.eu\u002F）等平台上成功运行。\n\n**如果您并非打算从头开始训练拥有数十亿参数的模型，那么这可能并不是适合您的库。对于一般的推理需求，我们建议您使用 Hugging Face 的 `transformers` 库，它同样支持 GPT-NeoX 模型。**\n\n## 为什么选择 GPT-NeoX？\n\nGPT-NeoX 借鉴了广受欢迎的 Megatron-DeepSpeed 库中的许多功能和技术，但在易用性和创新性优化方面有了显著提升。主要特性包括：\n* 使用 ZeRO 和 3D 并行化进行分布式训练\n* 支持多种系统和硬件，包括通过 Slurm、MPI 和 IBM Job Step Manager 启动；已在 AWS（https:\u002F\u002Faws.amazon.com\u002F）、CoreWeave（https:\u002F\u002Fwww.coreweave.com\u002F）、橡树岭国家实验室的 Summit（https:\u002F\u002Fwww.olcf.ornl.gov\u002Fsummit\u002F）和 Frontier（https:\u002F\u002Fwww.olcf.ornl.gov\u002Ffrontier\u002F）超级计算机、太平洋西北国家实验室（https:\u002F\u002Fhpc.pnl.gov\u002Findex.shtml）、阿贡国家实验室的 Polaris 超级计算机（https:\u002F\u002Fdocs.alcf.anl.gov\u002Fpolaris\u002Fdata-science-workflows\u002Fapplications\u002Fgpt-neox\u002F）、LUMI 超级计算机（https:\u002F\u002Fwww.lumi-supercomputer.eu\u002F）等多个平台大规模运行。\n* 包含旋转位置嵌入、Alibi 位置嵌入、并行前馈注意力层以及 Flash Attention 等前沿架构创新。\n* 预定义了 Pythia、PaLM、Falcon 以及 LLaMA 1 和 2 等流行架构的配置。\n* 课程式学习（Curriculum Learning）。\n* 与开源生态系统的无缝对接，包括 Hugging Face 的 `tokenizers`（https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftokenizers）和 `transformers`（https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers）库；可通过 WandB（https:\u002F\u002Fwandb.ai\u002Fsite\u002F）、Comet（https:\u002F\u002Fwww.comet.com\u002Fsite\u002F）、TensorBoard 等工具监控实验；并通过我们的语言模型评估框架（https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness）进行模型评估。\n\n## 最新动态\n**[2024年10月9日]** 现已支持 Transformer Engine（https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTransformerEngine）集成。\n\n**[2024年9月9日]** 现已支持通过 DPO（https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18290）、KTO（https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.01306）以及奖励建模进行偏好学习。\n\n**[2024年9月9日]** 现已支持与机器学习监控平台 Comet ML（https:\u002F\u002Fwww.comet.com\u002Fsite\u002F）集成。\n\n**[2024年5月21日]** 现已支持 RWKV（https:\u002F\u002Fwww.rwkv.com\u002F）的流水线并行化！详情请参阅关于 RWKV 的 PR（https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fpull\u002F1198）和 RWKV+流水线并行化的 PR（https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fpull\u002F1221）。\n\n**[2024年3月21日]** 现已支持专家混合（MoE）。\n\n**[2024年3月17日]** 现已支持 AMD MI250X GPU。\n\n**[2024年3月15日]** 现已支持 Mamba（https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fmamba）的张量并行化！详情请参阅相关 PR（https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fpull\u002F1184）。\n\n**[2023年8月10日]** 现已支持使用 AWS S3 进行检查点保存！可通过 `s3_path` 配置选项启用（更多详情请参阅 PR：https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fpull\u002F1010）。\n\n**[2023年9月20日]** 自 https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fpull\u002F1035 起，我们已弃用 Flash Attention 0.x 和 1.x 版本，并将支持迁移至 Flash Attention 2.x 版本。我们认为此举不会带来问题，但如果您有特定用例需要旧版 Flash Attention 支持，同时又希望使用最新版本的 GPT-NeoX，请提交 issue。\n\n**[2023年8月10日]** 我们在 math-lm 项目（https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fmath-lm）中提供了 LLaMA 2 和 Flash Attention v2 的实验性支持，相关内容将于本月晚些时候合并到主分支。\n\n**[2023年5月17日]** 在修复了一些小 bug 后，我们现在完全支持 bf16 数据类型。\n\n**[2023年4月11日]** 我们的 Flash Attention 实现现已升级，支持 Alibi 位置嵌入。\n\n**[2023年3月9日]** 我们发布了 GPT-NeoX 2.0.0 版本，这是一个基于最新 DeepSpeed 构建的升级版本，并将保持定期同步更新。\n\n## 版本说明\n\n在 2023年3月9日之前，GPT-NeoX 依赖于 DeeperSpeed（https:\u002F\u002Fgithub.com\u002FEleutherAI\u002FDeeperSpeed），而 DeeperSpeed 则基于旧版 DeepSpeed（0.3.15）。为了迁移到最新的 DeepSpeed 主干版本，同时允许用户继续使用旧版 GPT-NeoX 和 DeeperSpeed，我们为这两个库分别推出了两个版本：\n\n- [GPT-NeoX](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Freleases\u002Ftag\u002Fv2.0) 和 [DeeperSpeed](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002FDeeperSpeed\u002Freleases\u002Ftag\u002Fv2.0) 的 2.0 版本是基于最新 DeepSpeed 构建的最新版本，未来将继续维护。\n- [GPT-NeoX](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Freleases\u002Ftag\u002Fv1.0) 和 [DeeperSpeed](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002FDeeperSpeed\u002Freleases\u002Ftag\u002Fv1.0) 的 1.0 版本则保留了旧稳定版本的快照，这些版本曾被用于训练 [GPT-NeoX-20B](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.06745) 和 [Pythia 套件](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia)。\n\n# 目录\n\n- [GPT-NeoX](#gpt-neox)\n  * [为什么选择GPT-NeoX？](#why-gpt-neox)\n  * [新闻](#news)\n  * [版本](#versions)\n- [目录](#contents)\n- [快速入门](#quick-start)\n  * [环境与依赖](#environment-and-dependencies)\n    + [主机设置](#host-setup)\n    + [Flash Attention](#flash-attention)\n    + [Transformer Engine](#transformer-engine)\n    + [多节点启动](#multi-node-launching)\n    + [容器化部署](#containerized-setup)\n  * [使用方法](#usage)\n- [配置](#configuration)\n    * [专家混合模型](#mixture-of-experts)\n- [数据集](#datasets)\n  * [预配置数据集](#preconfigured-datasets)\n  * [使用自定义数据](#using-custom-data)\n- [训练与微调](#training-and-finetuning)\n  * [预训练模型](#pretrained-models)\n    + [GPT-NeoX-20B](#gpt-neox-20b)\n    + [Pythia](#pythia)\n    + [Polyglot](#polyglot)\n- [推理](#inference)\n- [评估](#evaluation)\n- [导出到Hugging Face](#exporting-to-hugging-face)\n- [监控](#monitoring)\n  * [Weights & Biases](#weights-and-biases)\n  * [TensorBoard](#tensorboard)\n- [多节点运行](#running-on-multi-node)\n- [性能分析](#profiling)\n- [应用与论文](#adoption-and-publications)\n  * [论文](#publications)\n  * [模型](#models)\n    + [英文大模型](#english-llms)\n    + [非英文大模型](#non-english-llms)\n    + [代码模型](#code-models)\n    + [其他模态](#other-modalities)\n- [管理说明](#administrative-notes)\n  * [引用GPT-NeoX](#citing-gpt-neox)\n  * [贡献](#contributing)\n  * [许可](#licensing)\n  * [致谢](#acknowledgements)\n\n# 快速入门\n\n## 环境与依赖\n\n### 主机设置\n\n本代码库主要针对 Python 3.8–3.10 和 PyTorch 1.8–2.0 进行开发和测试。但这并非严格要求，其他版本及库的组合也可能适用。\n\n要安装其余基本依赖，请在仓库根目录下运行：\n\n```bash\npip install -r requirements\u002Frequirements.txt\npip install -r requirements\u002Frequirements-wandb.txt # 可选，若使用 WandB 日志记录\npip install -r requirements\u002Frequirements-tensorboard.txt # 可选，若使用 TensorBoard 日志记录\npip install -r requirements\u002Frequirements-comet.txt # 可选，若使用 Comet 日志记录\n```\n\n> [!警告]\n> 我们的代码库依赖于 [DeeperSpeed](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002FDeeperSpeed)，这是我们基于 [DeepSpeed](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) 库并进行了一些修改后的分支。强烈建议在继续之前使用 Anaconda、虚拟机或其他形式的环境隔离。否则，可能会影响依赖 DeepSpeed 的其他项目正常运行。\n\n\u003C\u002Faside>\n\n### 融合内核\n我们现在通过 JIT 融合内核编译支持 AMD GPU（MI100、MI250X）。融合内核会根据需要构建并加载。为避免作业启动时的等待，您也可以手动预先构建：\n\n```python\npython\nfrom megatron.fused_kernels import load\nload()\n```\n这将自动适应不同 GPU 厂商（AMD、NVIDIA）的构建过程，而无需针对特定平台进行代码修改。要进一步使用 `pytest` 测试融合内核，请运行 `pytest tests\u002Fmodel\u002Ftest_fused_kernels.py`。\n\n### Flash Attention\n\n要使用 [Flash-Attention](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fflash-attention)，请安装 `.\u002Frequirements\u002Frequirements-flashattention.txt` 中的额外依赖，或直接使用已预装该功能的 PyTorch NGC 容器（请注意，使用与我们要求文件不同的版本可能无法保证功能正常）。然后在您的配置中相应地设置注意力类型（参见 [configs](.\u002Fconfigs\u002F)）。在某些 GPU 架构上，例如 Ampere 架构的 A100 等，Flash Attention 可以显著提升常规注意力机制的速度；更多详情请参阅仓库文档。\n\n### Transformer Engine\n\n要使用 [Transformer Engine (TE)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTransformerEngine)，请安装 `.\u002Frequirements\u002Frequirements-transformer-engine.txt` 中的额外依赖，或使用已预装该功能的 PyTorch NGC 容器（请注意，使用与我们要求文件不同的版本可能无法保证功能正常）。可参考 [此配置](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Fconfigs\u002F1-3B-transformer-engine.yml) 了解如何在 13 亿参数模型上使用 TE。在某些 GPU 架构上，包括 Ampere 和 Hopper 架构，TE 可以显著提升常规注意力机制的速度；更多信息请参阅仓库文档。\n\nTE 为 A100 和 H100 GPU 提供了非常高效的内核。我们已在 A100 上进行了若干示例消融实验：\n\n\n\n以及 H100：\n\n### 多节点启动\n\nNeoX 和 Deep(er)Speed 支持在多个不同节点上进行训练，并且您可以选择多种不同的启动器来编排多节点任务。\n\n通常情况下，需要在一个可访问的位置提供一个“hostfile”，其格式如下：\n\n```bash\nnode1_ip slots=8\nnode2_ip slots=8\n```\n\n其中第一列是您设置中每个节点的 IP 地址，而槽位数表示该节点可访问的 GPU 数量。在您的配置中，必须通过 `\"hostfile\": \"\u002Fpath\u002Fto\u002Fhostfile\"` 传递 hostfile 的路径。此外，也可以将 hostfile 的路径设置在环境变量 `DLTS_HOSTFILE` 中。\n\n#### pdsh\n\n`pdsh` 是默认的启动器。如果您使用 `pdsh`，除了确保您的环境中已安装 `pdsh` 外，只需在配置文件中设置 `{\"launcher\": \"pdsh\"}` 即可。\n\n#### MPI\n\n如果使用 MPI，则必须指定 MPI 库（目前 DeepSpeed\u002FGPT-NeoX 支持 `mvapich`、`openmpi`、`mpich` 和 `impi`，但 `openmpi` 是最常用且经过充分测试的），并在配置文件中添加 `deepspeed_mpi` 标志：\n\n```json\n{\n    \"launcher\": \"openmpi\",\n    \"deepspeed_mpi\": true\n}\n```\n\n在正确设置好环境并准备好相应的配置文件后，您可以像普通 Python 脚本一样使用 `deepy.py` 来启动训练任务，例如：\n\n`python3 deepy.py train.py \u002Fpath\u002Fto\u002Fconfigs\u002Fmy_model.yml`\n\n#### Slurm\n\n使用 Slurm 可能会稍微复杂一些。与 MPI 类似，您也需要在配置中添加以下内容：\n\n```json\n{\n    \"launcher\": \"slurm\",\n    \"deepspeed_slurm\": true\n}\n```\n\n如果您没有 SSH 访问权限来连接到 Slurm 集群中的计算节点，则需要添加 `{\"no_ssh_check\": true}`。\n\n#### （高级）自定义启动\n\n在许多情况下，上述默认的启动选项并不足以满足需求：\n\n- 许多集群有自己的独特作业调度器，或者有特定的 MPI\u002FSlurm 参数用于启动作业，例如 [Summit JSRun](https:\u002F\u002Fdocs.olcf.ornl.gov\u002Fsystems\u002Fsummit_user_guide.html#job-launcher-jsrun) 或 [LLNL Flux](https:\u002F\u002Fcomputing.llnl.gov\u002Fprojects\u002Fflux-building-framework-resource-management)。\n- 虽然上述 Slurm\u002FMPI\u002Fpdsh 默认选项对于大多数作业运行已经足够，但高级用户可能希望添加参数以进行优化或调试。\n\n在这种情况下，您需要修改 DeepSpeed 的 [multinode runner](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed\u002Fblob\u002F17957728c0362bf8ae70feca308e491e55ef9feb\u002Fdeepspeed\u002Flauncher\u002Fmultinode_runner.py) 工具，以支持您的用例。这些增强大致可以分为两类：\n\n##### 1. 添加新的启动器（例如 [JSRun](https:\u002F\u002Fdocs.olcf.ornl.gov\u002Fsystems\u002Fsummit_user_guide.html#job-launcher-jsrun)、[Flux](https:\u002F\u002Fcomputing.llnl.gov\u002Fprojects\u002Fflux-building-framework-resource-management) 等）\n\n在这种情况下，您需要在 `deepspeed\u002Flauncher\u002Fmultinode_runner.py` 中添加一个新的多节点运行器类，并将其作为 GPT-NeoX 中的一个配置选项公开。我们为 [Summit JSRun](https:\u002F\u002Fdocs.olcf.ornl.gov\u002Fsystems\u002Fsummit_user_guide.html#job-launcher-jsrun) 实现的例子分别见 [DeeperSpeed 的这个提交](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002FDeeperSpeed\u002Fcommit\u002F9aed6c8500d7c492d85c5c88687322dbda70e370) 和 [GPT-NeoX 的这个提交](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fcommit\u002F3782c7ae60f8624e566e3879b89bb09e8b59b869)。\n\n##### 2. 修改运行命令或环境变量\n\n我们遇到过许多需要修改 MPI\u002FSlurm 运行命令的情况，可能是为了优化或调试（例如修改 [Slurm srun CPU 绑定](https:\u002F\u002Fslurm.schedmd.com\u002Fsrun.html#OPT_cpu-bind) 或为 MPI 日志添加进程 rank 标签）。在这种情况下，您需要修改多节点运行器类的 `get_cmd` 方法中的运行命令（例如 OpenMPI 的 [mpirun_cmd](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed\u002Fblob\u002F17957728c0362bf8ae70feca308e491e55ef9feb\u002Fdeepspeed\u002Flauncher\u002Fmultinode_runner.py#L135-L147)）。我们使用 Slurm 和 OpenMPI 为 Stability 集群提供优化且带有 rank 标签的运行命令的例子，见 [DeeperSpeed 的这个分支](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed\u002Fcompare\u002Fmaster...EleutherAI:DeeperSpeed:v2.0-stability)。\n\n#### Hostfile 生成\n\n一般来说，您无法使用一个固定的 hostfile，因此需要编写一个脚本来在作业启动时动态生成 hostfile。以下是一个使用 [Slurm](https:\u002F\u002Fslurm.schedmd.com\u002Fdocumentation.html) 并假设每个节点有 8 个 GPU 的动态生成 hostfile 的示例脚本：\n\n```bash\n#!\u002Fbin\u002Fbash\nGPUS_PER_NODE=8\nmkdir -p \u002Fsample\u002Fpath\u002Fto\u002Fhostfiles\n# 将当前的 Slurm 作业 ID 添加到 hostfile 名称中，以避免追加到之前的 hostfile\nhostfile=\u002Fsample\u002Fpath\u002Fto\u002Fhostfiles\u002Fhosts_$SLURM_JOBID\n# 确保不会追加到之前的 hostfile\nrm $hostfile &> \u002Fdev\u002Fnull\n# 遍历节点名称\nfor i in `scontrol show hostnames $SLURM_NODELIST`\ndo\n    # 向 hostfile 中添加一行\n    echo $i slots=$GPUS_PER_NODE >>$hostfile\ndone\n```\n\n`$SLURM_JOBID` 和 `$SLURM_NODELIST` 是 Slurm 自动为您创建的环境变量。有关作业创建时可用的完整 Slurm 环境变量列表，请参阅 [sbatch 文档](https:\u002F\u002Fslurm.schedmd.com\u002Fsbatch.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES)。\n\n#### 作业启动\n\n然后，您可以创建一个 [sbatch](https:\u002F\u002Fslurm.schedmd.com\u002Fsbatch.html) 脚本来启动您的 GPT-NeoX 作业。在一个基于 Slurm 的集群上，假设每个节点有 8 个 GPU，一个最基本的 sbatch 脚本可能如下所示：\n\n```bash\n#!\u002Fbin\u002Fbash\n#SBATCH --job-name=\"neox\"\n#SBATCH --partition=your-partition\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=8\n#SBATCH --gres=gpu:8\n\n# 一些可能有用的分布式环境变量\nexport HOSTNAMES=`scontrol show hostnames \"$SLURM_JOB_NODELIST\"`\nexport MASTER_ADDR=$(scontrol show hostnames \"$SLURM_JOB_NODELIST\" | head -n 1)\nexport MASTER_PORT=12802\nexport COUNT_NODE=`scontrol show hostnames \"$SLURM_JOB_NODELIST\" | wc -l`\n\n# 上面提到的 hostfile 生成脚本\n.\u002Fwrite_hostfile.sh\n# 通过 DLTS_HOSTFILE 告诉 DeepSpeed 我们生成的 hostfile 的位置\nexport DLTS_HOSTFILE=\u002Fsample\u002Fpath\u002Fto\u002Fhostfiles\u002Fhosts_$SLURM_JOBID\n\n# 启动训练\npython3 deepy.py train.py \u002Fsample\u002Fpath\u002Fto\u002Fyour\u002Fconfigs\u002Fmy_model.yml\n\n```\n\n之后，您可以通过执行 `sbatch my_sbatch_script.sh` 来启动训练任务。\n\n### 容器化部署\n\n我们在 [containers](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Fcontainers\u002F) 目录下提供了适用于 [Apptainer](#apptainer)（原名 Singularity）和 [Docker](#docker) 的容器镜像。\n\n#### Docker\n\n如果您希望通过 Docker 容器运行 NeoX，我们提供了一个 Dockerfile 和 docker-compose 配置文件。\n\n运行该容器的必要条件包括：安装合适的 GPU 驱动程序、最新版本的 Docker 以及 [nvidia-container-toolkit](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html)。要测试您的环境是否配置正确，可以使用 NVIDIA 提供的“示例工作负载”：\n\n```\ndocker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi\n```\n\n如果上述命令能够成功执行，则需要在您的环境中设置 `NEOX_DATA_PATH` 和 `NEOX_CHECKPOINT_PATH` 环境变量，以指定数据目录和检查点存储路径：\n\n```\nexport NEOX_DATA_PATH=\u002Fmnt\u002Fsda\u002Fdata\u002Fenwiki8 # 或您系统中存放数据的实际路径\nexport NEOX_CHECKPOINT_PATH=\u002Fmnt\u002Fsda\u002Fcheckpoints\n```\n\n然后，在 `gpt-neox\u002Fcontainers\u002Fdocker` 目录下，您可以构建镜像并在容器中启动一个 shell：\n\n```\ndocker compose run gpt-neox bash\n```\n\n构建完成后，您应该能够看到类似以下的输出：\n\n```\nmchorse@537851ed67de:~$ echo $(pwd)\n\u002Fhome\u002Fmchorse\nmchorse@537851ed67de:~$ ls -al\ntotal 48\ndrwxr-xr-x  1 mchorse mchorse 4096 Jan  8 05:33 .\ndrwxr-xr-x  1 root    root    4096 Jan  8 04:09 ..\n-rw-r--r--  1 mchorse mchorse  220 Feb 25  2020 .bash_logout\n-rw-r--r--  1 mchorse mchorse 3972 Jan  8 04:09 .bashrc\ndrwxr-xr-x  4 mchorse mchorse 4096 Jan  8 05:35 .cache\ndrwx------  3 mchorse mchorse 4096 Jan  8 05:33 .nv\n-rw-r--r--  1 mchorse mchorse  807 Feb 25  2020 .profile\ndrwxr-xr-x  2 root    root    4096 Jan  8 04:09 .ssh\ndrwxrwxr-x  8 mchorse mchorse 4096 Jan  8 05:35 chk\ndrwxrwxrwx  6 root    root    4096 Jan  7 17:02 data\ndrwxr-xr-x 11 mchorse mchorse 4096 Jan  8 03:52 gpt-neox\n```\n\n对于长时间运行的任务，您应使用以下命令以分离模式运行容器：\n\n```\ndocker compose up -d\n```\n\n随后，在另一个终端会话中，您可以执行以下命令进入正在运行的容器：\n\n```\ndocker compose exec gpt-neox bash\n```\n\n之后，您就可以在容器内运行所需的任何任务。\n\n长时间运行或以分离模式运行时需要注意以下几点：\n- 当您不再使用容器时，必须手动停止它。\n- 如果希望在您的 shell 会话结束后继续运行某些进程，您需要将它们置于后台运行。\n- 如果需要日志记录功能，则需确保将日志输出重定向到磁盘，并配置 WandB 和\u002F或 Comet 日志记录。\n\n如果您更倾向于直接使用 Docker Hub 上的预构建镜像，可以在运行 docker-compose 命令时添加 `-f docker-compose-dockerhub.yml` 参数，例如：\n\n```\ndocker compose run -f containers\u002Fdocker\u002Fdocker-compose-dockerhub.yml gpt-neox bash\n```\n\n#### Singularity\u002FApptainer\n\n我们同样支持 Apptainer（原名 Singularity）部署。部分用户发现 Apptainer 在无法获得 root 权限的系统上非常有用，例如国家实验室和大学中的共享高性能计算集群。\n\n运行该容器的必要条件包括安装合适的 GPU 驱动程序以及最新版本的 Apptainer。您可以通过运行以下命令从我们的 Apptainer 文件构建镜像：\n\n```\ncd containers\u002Fapptainer\u002F\napptainer build gpt-neox.sif gpt-neox.def\n```\n\n构建完成后，您可以通过以下几种方式使用新镜像：\n1. 在容器内启动一个 shell：\n```\napptainer shell --nv --bind \u002Fpath\u002Fto\u002Fdata:\u002Fdata,\u002Fpath\u002Fto\u002Fcode:\u002Fcode gpt-neox.sif\n```\n\n使用 `--nv` 标志启用 NVIDIA GPU 支持。\n2. 直接执行一条命令：\n```\napptainer exec --nv gpt-neox.sif python your_script.py\n```\n\n对于第二种方法，您需要在构建过程中使依赖文件和 `fused_kernels` 目录可用，具体可通过以下方式实现：\n- 在构建时使用 `--bind` 选项；\n- 将这些文件添加到定义文件的 `%files` 部分；\n- 在构建过程中将其复制到特定位置。\n\n默认情况下，Apptainer\u002FSingularity 容器会使用用户的 UID\u002FGID 运行，因此部分用户创建步骤可能是多余的。此外，默认情况下，用户的主目录会被自动挂载到 Apptainer\u002FSingularity 容器中，这与 Docker 的行为有所不同。有关 Apptainer 部署的更多信息，建议参考其[用户指南](https:\u002F\u002Fapptainer.org\u002Fdocs\u002Fuser\u002F1.0\u002Findex.html)。\n\n## 使用方法\n\n所有功能都应通过 `deepy.py` 启动，它是对 `deepspeed` 启动脚本的封装。\n\n目前我们提供三个主要功能：\n1. `train.py` 用于训练和微调模型；\n2. `eval.py` 用于使用 [语言模型评估框架](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness)评估已训练好的模型；\n3. `generate.py` 用于从已训练好的模型中采样文本。\n\n这些功能可以通过以下命令启动：\n\n```bash\n.\u002Fdeepy.py [script.py] [.\u002Fpath\u002Fto\u002Fconfig_1.yml] [.\u002Fpath\u002Fto\u002Fconfig_2.yml] ... [.\u002Fpath\u002Fto\u002Fconfig_n.yml]\n```\n\n例如，要启动训练，您可以运行：\n```bash\n.\u002Fdeepy.py train.py .\u002Fconfigs\u002F20B.yml .\u002Fconfigs\u002Flocal_cluster.yml\n```\n\n有关每个入口点的详细信息，请参阅相应的[训练与微调](#training-and-finetuning)、[推理](#inference)和[评估](#evaluation)部分。\n\n# 配置\n\nGPT-NeoX 的参数由 YAML 配置文件定义，并传递给 `deepy.py` 启动脚本。我们在 [configs](.\u002Fconfigs\u002F) 目录中提供了一些示例 `.yml` 文件，展示了多种功能和不同规模的模型。\n\n这些配置文件通常是完整的，但可能并非最优。例如，根据您的 GPU 配置，您可能需要调整一些设置，如 `pipe-parallel-size` 和 `model-parallel-size` 来增加或减少并行度，或者调整 `train_micro_batch_size_per_gpu` 和 `gradient-accumulation-steps` 来修改批次大小相关的参数，又或者调整 `zero_optimization` 字典来改变优化器状态在各工作节点之间的并行化方式。\n\n有关可用功能及其配置方法的详细指南，请参阅 [配置说明文档](configs\u002FREADME.md)，而所有可能参数的详细说明则可在 [configs\u002Fneox_arguments.md](configs\u002Fneox_arguments.md) 中找到。\n\n## 混合专家模型\n\nGPT-NeoX 通过 `megablocks` 库支持无丢弃混合专家模型（DMoE）。它与现有的 Megatron 张量并行和 DeepSpeed 流水线并行设置兼容。\n\n该实现利用现有的张量并行组来同时划分专家权重，并采用 Sinkhorn 路由算法避免负载均衡损失的引入。\n\n有关基本完整配置的示例，请参阅 `configs\u002F125M-dmoe.yml` 文件。\n\n大多数与 MoE 相关的配置参数都以 `moe` 为前缀。要启用 MoE，您至少需要在配置文件中添加以下内容：\n\n```yaml\nmoe_num_experts: 1 # 1 表示禁用 MoE。8 是常见值。\n```\n\n# 数据集\n\n## 预配置数据集\n\n提供了多个预配置的数据集，包括 [The Pile](https:\u002F\u002Farxiv.org\u002Fabs\u002F2101.00027) 中的大多数组件，以及 The Pile 的训练集本身，以便使用 `prepare_data.py` 入口点进行直接的分词处理。\n\n例如，要下载并使用 GPT2 分词器对 enwik8 数据集进行分词，然后将其保存到 `.\u002Fdata` 目录下，可以运行以下命令：\n\n```\npython prepare_data.py -d .\u002Fdata\n```\n\n或者对 The Pile 的单个分片 (`pile_subset`) 使用 GPT-NeoX-20B 分词器（假设你已将其保存在 `.\u002F20B_checkpoints\u002F20B_tokenizer.json`）：\n\n```\npython prepare_data.py -d .\u002Fdata -t HFTokenizer --vocab-file .\u002F20B_checkpoints\u002F20B_tokenizer.json pile_subset\n```\n\n分词后的数据将被保存为两个文件：`[data-dir]\u002F[dataset-name]\u002F[dataset-name]_text_document.bin` 和 `[data-dir]\u002F[dataset-name]\u002F[dataset-name]_text_document.idx`。你需要将这两个文件共有的前缀添加到你的训练配置文件中的 `data-path` 字段中。例如：\n\n```yaml\n  \"data-path\": \".\u002Fdata\u002Fenwik8\u002Fenwik8_text_document\",\n```\n\n## 使用自定义数据\n\n要准备自己的数据集以用于自定义数据的训练，需将其格式化为一个大型的 [jsonl](https:\u002F\u002Fjsonlines.org\u002F) 格式文件，其中列表中的每个字典项代表一个单独的文档。文档文本应归于一个 JSON 键下，即 `\"text\"`。存储在其他字段中的任何辅助数据将不会被使用。\n\n接下来，请确保下载 GPT2 分词器的词汇表，并从以下链接合并文件：\n\n- 词汇表：https:\u002F\u002Fs3.amazonaws.com\u002Fmodels.huggingface.co\u002Fbert\u002Fgpt2-vocab.json\n- 合并文件：https:\u002F\u002Fs3.amazonaws.com\u002Fmodels.huggingface.co\u002Fbert\u002Fgpt2-merges.txt\n\n或者使用 20B 分词器（仅需一个词汇表文件）：\n\n- 词汇表：https:\u002F\u002Fthe-eye.eu\u002Fpublic\u002FAI\u002Fmodels\u002FGPT-NeoX-20B\u002Fslim_weights\u002F20B_tokenizer.json\n\n（此外，你也可以提供任何能够通过 Hugging Face 的分词器库使用 `Tokenizer.from_pretrained()` 命令加载的分词器文件）\n\n现在你可以使用 `tools\u002Fdatasets\u002Fpreprocess_data.py` 对数据进行预分词，其参数说明如下：\n\n```\n用法：preprocess_data.py [-h] --input INPUT [--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]] [--num-docs NUM_DOCS] --tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer} [--vocab-file VOCAB_FILE] [--merge-file MERGE_FILE] [--append-eod] [--ftfy] --output-prefix OUTPUT_PREFIX\n                          [--dataset-impl {lazy,cached,mmap}] [--workers WORKERS] [--log-interval LOG_INTERVAL]\n\n可选参数：\n  -h, --help            显示此帮助信息并退出\n\n输入数据：\n  --input INPUT         输入 jsonl 文件或 lmd 归档文件的路径；如果使用多个归档文件，请用逗号分隔。\n  --jsonl-keys JSONL_KEYS [JSONL_KEYS ...]\n                        从 jsonl 中提取的键名列表，以空格分隔。默认值：text\n  --num-docs NUM_DOCS   可选：输入数据中的文档数量（如果已知），以便显示准确的进度条。\n\n分词器：\n  --tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer}\n                        要使用的分词器类型。\n  --vocab-file VOCAB_FILE\n                        词汇表文件的路径。\n  --merge-file MERGE_FILE\n                        BPE 合并文件的路径（如果需要）。\n  --append-eod          在文档末尾添加 \u003Ceod> 标记。\n  --ftfy                使用 ftfy 清理文本。\n\n输出数据：\n  --output-prefix OUTPUT_PREFIX\n                        二进制输出文件的路径，不带后缀。\n  --dataset-impl {lazy,cached,mmap}\n                        要使用的数据集实现方式。默认：mmap。\n\n运行时：\n  --workers WORKERS     启动的工作进程数量。\n  --log-interval LOG_INTERVAL\n                        进度更新的时间间隔。\n\n```\n\n例如：\n\n```bash\npython tools\u002Fdatasets\u002Fpreprocess_data.py \\\n            --input .\u002Fdata\u002Fmydataset.jsonl.zst \\\n            --output-prefix .\u002Fdata\u002Fmydataset \\\n            --vocab .\u002Fdata\u002Fgpt2-vocab.json \\\n            --merge-file gpt2-merges.txt \\\n            --dataset-impl mmap \\\n            --tokenizer-type GPT2BPETokenizer \\\n            --append-eod\n```\n\n随后，你可以在配置文件中添加以下设置来启动训练：\n\n```yaml\n  \"data-path\": \"data\u002Fmydataset_text_document\",\n```\n\n# 训练与微调\n\n训练通过 `deepy.py` 启动，它是 DeepSpeed 启动器的封装工具，可在多块 GPU 或多个节点上并行启动相同的脚本。\n\n一般的使用模式是：\n\n```bash\npython .\u002Fdeepy.py train.py [path\u002Fto\u002Fconfig1.yml] [path\u002Fto\u002Fconfig2.yml] ...\n```\n\n你可以传递任意数量的配置文件，它们将在运行时被合并。\n\n你还可以选择性地传递一个配置文件前缀，这样系统会假定所有配置文件都在同一个文件夹中，并将该前缀附加到它们的路径上。\n\n例如：\n\n```bash\npython .\u002Fdeepy.py train.py -d configs 125M.yml local_setup.yml\n```\n\n这将在所有节点上部署 `train.py` 脚本，每块 GPU 上运行一个进程。工作节点和 GPU 数量在 `\u002Fjob\u002Fhostfile` 文件中指定（参见 [参数文档](configs\u002FREADME.md)），或者如果是在单节点设置中运行，可以直接通过 `num_gpus` 参数指定。\n\n虽然这不是严格必需的，但我们发现将模型参数定义在一个配置文件中（如 `configs\u002F125M.yml`），而将数据路径参数定义在另一个配置文件中（如 `configs\u002Flocal_setup.yml`）会很有帮助。\n\n## 预训练模型\n\n### GPT-NeoX-20B\n\nGPT-NeoX-20B 是一个拥有 200 亿参数的自回归语言模型，基于 [The Pile](https:\u002F\u002Farxiv.org\u002Fabs\u002F2101.00027) 数据集训练而成。有关 GPT-NeoX-20B 的技术细节，请参阅[相关论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.06745)。该模型的配置文件既可在 [`.\u002Fconfigs\u002F20B.yml`](.\u002Fconfigs\u002F20B.yml) 中找到，也包含在下方的下载链接中。\n\n【精简权重】(https:\u002F\u002Fthe-eye.eu\u002Fpublic\u002FAI\u002Fmodels\u002FGPT-NeoX-20B\u002Fslim_weights\u002F) - （不含优化器状态，适用于推理或微调，39GB）\n\n若要通过命令行将文件下载到名为 `20B_checkpoints` 的文件夹中，可使用以下命令：\n\n```bash\nwget --cut-dirs=5 -nH -r --no-parent --reject \"index.html*\" https:\u002F\u002Fthe-eye.eu\u002Fpublic\u002FAI\u002Fmodels\u002FGPT-NeoX-20B\u002Fslim_weights\u002F -P 20B_checkpoints\n```\n\n【完整权重】(https:\u002F\u002Fthe-eye.eu\u002Fpublic\u002FAI\u002Fmodels\u002FGPT-NeoX-20B\u002Ffull_weights\u002F) - （包含优化器状态，268GB）\n\n若要通过命令行将文件下载到名为 `20B_checkpoints` 的文件夹中，可使用以下命令：\n\n```bash\nwget --cut-dirs=5 -nH -r --no-parent --reject \"index.html*\" https:\u002F\u002Fthe-eye.eu\u002Fpublic\u002FAI\u002Fmodels\u002FGPT-NeoX-20B\u002Ffull_weights\u002F -P 20B_checkpoints\n```\n\n此外，也可以使用 BitTorrent 客户端下载权重。种子文件可在此处下载：【精简权重】(https:\u002F\u002Fthe-eye.eu\u002Fpublic\u002FAI\u002Fmodels\u002FGPT-NeoX-20B\u002Fslim_weights.torrent)，【完整权重】(https:\u002F\u002Fthe-eye.eu\u002Fpublic\u002FAI\u002Fmodels\u002FGPT-NeoX-20B\u002Ffull_weights.torrent)。\n\n我们还在训练过程中保存了 150 个检查点，每 1,000 步保存一个。目前我们正在研究如何以最佳方式大规模提供这些检查点，但在此期间，有兴趣使用部分训练好的检查点的研究人员可以通过 contact@eleuther.ai 与我们联系，以安排访问权限。\n\n### Pythia\n\nPythia 扩展系列是一套从 7000 万参数到 120 亿参数不等的模型，基于 [The Pile](https:\u002F\u002Fpile.eleuther.ai) 训练而成，旨在促进对大型语言模型可解释性和训练动态的研究。有关该项目的更多详细信息及模型链接，请参阅[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.01373)和[项目 GitHub 页面](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia)。\n\n### Polyglot\n\nPolyglot 项目致力于训练强大的非英语预训练语言模型，以提升机器学习领域之外的研究人员对该技术的可及性。EleutherAI 已经训练并发布了 13 亿、38 亿和 58 亿参数的韩语语言模型，其中最大的模型在韩语任务上表现优于所有其他公开可用的语言模型。有关该项目的更多详细信息及模型链接，请参阅[此处](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpolyglot)。\n\n# 推理\n\n**对于大多数用途，我们建议通过 Hugging Face Transformers 库部署使用 GPT-NeoX 库训练的模型，因为该库针对推理进行了更好的优化。**\n\n我们支持三种类型的预训练模型生成：\n1. 无条件生成\n2. 基于从文件读取的输入进行的条件生成\n3. 交互式生成，允许用户通过命令行界面与语言模型进行多轮对话。\n\n这三种文本生成方式均可通过 `python .\u002Fdeepy.py generate.py -d configs 125M.yml local_setup.yml text_generation.yml` 启动，并在 `configs\u002Ftext_generation.yml` 中设置相应的参数。\n\n# 评估\n\nGPT-NeoX 支持通过 [语言模型评估框架](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) 对下游任务进行评估。\n\n要使用评估框架评估已训练的模型，只需运行：\n\n```bash\npython .\u002Fdeepy.py eval.py -d configs your_configs.yml --eval_tasks task1 task2 ... taskn\n```\n\n其中 `--eval_tasks` 是由空格分隔的评估任务列表，例如 `--eval_tasks lambada hellaswag piqa sciq`。有关所有可用任务的详细信息，请参阅 [lm-evaluation-harness 仓库](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness)。\n\n# 导出至 Hugging Face\n\nGPT-NeoX 主要针对训练进行了高度优化，因此 GPT-NeoX 模型检查点无法直接与其他深度学习库兼容。为了使模型易于加载和与最终用户共享，并进一步导出到其他框架，GPT-NeoX 支持将检查点转换为 [Hugging Face Transformers](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.03771) 格式。\n\n尽管 NeoX 支持多种不同的架构配置，包括 AliBi 位置嵌入，但并非所有这些配置都能无缝映射到 Hugging Face Transformers 支持的架构中。\n\nNeoX 支持将兼容的模型导出为以下架构：\n- GPTNeoXForCausalLM\n- LlamaForCausalLM\n- MistralForCausalLM\n\n如果训练的模型无法完全适配上述 Hugging Face Transformers 架构之一，则需要为导出的模型编写自定义建模代码。\n\n要将 GPT-NeoX 库中的检查点转换为 Hugging Face 可加载格式，可运行：\n\n```bash\npython .\u002Ftools\u002Fckpts\u002Fconvert_neox_to_hf.py --input_dir \u002Fpath\u002Fto\u002Fmodel\u002Fglobal_stepXXX --config_file your_config.yml --output_dir hf_model\u002Fsave\u002Flocation --precision {auto,fp16,bf16,fp32} --architecture {neox,mistral,llama}\n```\n\n然后，要将模型上传到 [Hugging Face Hub](https:\u002F\u002Fhuggingface.co\u002F)，可运行：\n\n```bash\nhuggingface-cli login\npython .\u002Ftools\u002Fckpts\u002Fupload.py\n```\n\n并输入所需信息，包括 Hugging Face 用户令牌。\n\n### 将模型导入 GPT-NeoX\n\nNeoX 提供了若干工具，用于将预训练模型检查点转换为可在该库中训练的格式。\n\n以下模型或模型家族可以加载到 GPT-NeoX 中：\n- Llama 1\n- Llama 2\n- CodeLlama\n- Mistral-7b-v0.1\n\n我们提供了两种工具，分别用于将两种不同格式的检查点转换为与 GPT-NeoX 兼容的格式。\n\n要将 Meta AI 发布的 Llama 1 或 Llama 2 检查点从其原始文件格式（可从 [这里](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fllama) 或 [这里](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FLlama-2-7b) 下载）转换为 GPT-NeoX 库格式，可运行：\n\n```bash\npython tools\u002Fckpts\u002Fconvert_raw_llama_weights_to_neox.py --input_dir \u002Fpath\u002Fto\u002Fmodel\u002Fparent\u002Fdir\u002F7B --model_size 7B --output_dir \u002Fpath\u002Fto\u002Fsave\u002Fckpt --num_output_shards \u003CTENSOR_PARALLEL_SIZE> (--pipeline_parallel if pipeline-parallel-size >= 1)\n```\n\n要将 Hugging Face 模型转换为 NeoX 可加载格式，可运行 `tools\u002Fckpts\u002Fconvert_hf_to_sequential.py`。更多选项请参阅该文件中的文档说明。\n\n# 监控\n\n除了在本地存储日志外，我们还内置支持两种流行的实验监控框架：[Weights & Biases](https:\u002F\u002Fwandb.ai\u002Fsite)、[TensorBoard](https:\u002F\u002Fwww.tensorflow.org\u002Ftensorboard\u002F) 和 [Comet](https:\u002F\u002Fwww.comet.com\u002Fsite)。\n\n## 权重与偏差\n\n[Weights & Biases 用于记录我们的实验](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fneox) 是一个机器学习监控平台。要使用 Wandb 监控你的 GPT-NeoX 实验，请按照以下步骤操作：\n\n1. 在 https:\u002F\u002Fwandb.ai\u002Fsite 上创建一个账户以生成你的 API 密钥。\n2. 在你的机器上登录 Weights & Biases——你可以通过执行 `wandb login` 来完成——你的运行将自动被记录。\n3. Wandb 监控所需的依赖项可以在 `.\u002Frequirements\u002Frequirements-wandb.txt` 中找到，并从中安装。示例配置文件位于 `.\u002Fconfigs\u002Flocal_setup_wandb.yml`。\n4. Weights & Biases 有两个可选字段：`\u003Ccode>\u003Cvar>wandb_group\u003C\u002Fvar>\u003C\u002Fcode>` 允许你为运行分组命名，而 `\u003Ccode>\u003Cvar>wandb_team\u003C\u002Fvar>\u003C\u002Fcode>` 允许你将运行分配到某个组织或团队账户。示例配置文件同样位于 `.\u002Fconfigs\u002Flocal_setup_wandb.yml`。\n\n## TensorBoard\n\n我们支持通过 `\u003Ccode>\u003Cvar>tensorboard-dir\u003C\u002Fvar>\u003C\u002Fcode>` 字段使用 TensorBoard。TensorBoard 监控所需的依赖项可以在 `.\u002Frequirements\u002Frequirements-tensorboard.txt` 中找到，并从中安装。\n\n## Comet\n\n[Comet](https:\u002F\u002Fwww.comet.com\u002Fsite) 是一个机器学习监控平台。要使用 Comet 监控你的 GPT-NeoX 实验，请按照以下步骤操作：\n\n1. 在 https:\u002F\u002Fwww.comet.com\u002Flogin 上创建一个账户以生成你的 API 密钥。\n2. 生成后，在运行时通过执行 `comet login` 或者设置环境变量 `export COMET_API_KEY=\u003Cyour-key-here>` 来关联你的 API 密钥。\n3. 使用 `pip install -r requirements\u002Frequirements-comet.txt` 安装 `comet_ml` 及其依赖库。\n4. 启用 Comet：设置 `use_comet: True`。你还可以通过 `comet_workspace` 和 `comet_project` 自定义数据的记录位置。启用 Comet 的完整示例配置文件位于 `configs\u002Flocal_setup_comet.yml`。\n5. 运行你的实验，并在你指定的 Comet 工作空间中监控指标！\n\n# 多节点运行\n\n如果你需要为基于 MPI 的 DeepSpeed 启动器提供主机文件，可以设置环境变量 `DLTS_HOSTFILE` 指向该主机文件。\n\n# 性能分析\n\n我们支持使用 Nsight Systems、PyTorch Profiler 和 PyTorch 内存分析工具进行性能分析。\n\n## Nsight Systems 性能分析\n\n要使用 Nsight Systems 进行性能分析，需设置配置选项 `profile`、`profile_step_start` 和 `profile_step_stop`（有关参数用法请参阅 [这里](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Fconfigs\u002Fneox_arguments.md)，示例配置请参阅 [这里](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Fconfigs\u002Fprof.yml)）。\n\n要生成 nsys 指标，可以使用以下命令启动训练：\n\n```\nnsys profile -s none -t nvtx,cuda -o \u003Cpath\u002Fto\u002Fprofiling\u002Foutput> --force-overwrite true \\\n--capture-range=cudaProfilerApi --capture-range-end=stop python $TRAIN_PATH\u002Fdeepy.py \\\n$TRAIN_PATH\u002Ftrain.py --conf_dir configs \u003Cconfig files>\n```\n\n生成的输出文件随后可以通过 Nsight Systems GUI 查看：\n\n![nsight-prof](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FEleutherAI_gpt-neox_readme_cce154b3c298.png)\n\n## PyTorch 性能分析\n\n要使用 PyTorch 内置的性能分析工具，需设置配置选项 `profile`、`profile_step_start` 和 `profile_step_stop`（有关参数用法请参阅 [这里](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Fconfigs\u002Fneox_arguments.md)，示例配置请参阅 [这里](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Fconfigs\u002Fprof.yml)）。\n\nPyTorch 性能分析工具会将跟踪信息保存到你的 TensorBoard 日志目录中。你可以按照 [这里](https:\u002F\u002Fpytorch.org\u002Ftutorials\u002Fintermediate\u002Ftensorboard_profiler_tutorial.html) 的步骤，在 TensorBoard 中查看这些跟踪信息。\n\n![torch-prof](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FEleutherAI_gpt-neox_readme_73117eb7bcb1.png)\n\n## PyTorch 内存分析\n\n要使用 PyTorch 内存分析工具，需设置配置选项 `memory_profiling` 和 `memory_profiling_path`（有关参数用法请参阅 [这里](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Fconfigs\u002Fneox_arguments.md)，示例配置请参阅 [这里](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002Fmain\u002Fconfigs\u002Fprof.yml)）。\n\n![mem-prof](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FEleutherAI_gpt-neox_readme_3625efaf8b43.png)\n\n使用 [memory_viz.py](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fblob\u002Fmain\u002Ftorch\u002Fcuda\u002F_memory_viz.py) 脚本查看生成的内存分析报告。运行命令如下：\n\n```\npython _memory_viz.py trace_plot \u003Cgenerated_profile> -o trace.html\n```\n\n# 采用与发表论文\n\nGPT-NeoX 库已被学术界和工业界的众多研究人员广泛采用，并移植到了许多高性能计算系统上。\n\n如果你在研究中发现这个库很有用，请随时联系我们告知我们！我们非常乐意将你加入我们的名单。\n\n## 出版物\n\nEleutherAI 及其合作者已在以下出版物中使用了该模型：\n- **Sid Black**、**Stella Biderman**、**Eric Hallahan**、**Quentin Anthony**、**Leo Gao**、**Laurence Golding**、**Horace He**、**Connor Leahy**、**Kyle McDonell**、**Jason Phang**、**Michael Pieler**、**Shivanshu Purohit**、**Laria Reynolds**、**Jon Tow**、**Ben Wang** 和 **Samuel Weinbach**。“GPT-NeoX-20B：一个开源的自回归语言模型”。载于《ACL 大型语言模型构建中的挑战与展望研讨会论文集》，2022 年。\n- **Stella Biderman**、**Hailey Schoelkopf**、**Quentin Anthony**、**Herbie Bradley**、**Kyle O'Brien**、**Eric Hallahan**、**Mohammad Aflah Khan**、**Shivanshu Purohit**、**USVSN Sai Prashanth**、Edward Raff、**Aviya Skowron**、**Lintang Sutawika**、**Oskar van der Wal**。“Pythia：一套用于跨训练与规模扩展分析大型语言模型的工具”。载于《国际机器学习大会》，第 2397–2430 页。PMLR，2023 年。\n- Zhangir Azerbayev、Bartosz Piotrowski、**Hailey Schoelkopf**、Edward W. Ayers、Dragomir Radev 和 Jeremy Avigad。“Proofnet：自动形式化并形式化证明本科水平数学”。*arXiv 预印本 arXiv:2302.12433*，2023 年。\n- **Stella Biderman**、**USVSN Sai Prashanth**、**Lintang Sutawika**、**Hailey Schoelkopf**、**Quentin Anthony**、**Shivanshu Purohit** 和 Edward Raff。“大型语言模型中的涌现式与可预测性记忆”。载于《神经信息处理系统》，2023 年。\n- **Hyunwoong Ko**、**Kichang Yang**、**Minho Ryu**、**Taekyoon Choi**、**Seungmu Yang** 和 Sungho Park。“Polyglot-Ko 技术报告：开源大规模韩语语言模型”。*arXiv 预印本 arXiv:2306.02254*，2023 年。\n- Kshitij Gupta、Benjamin Thérien、Adam Ibrahim、Mats Leon Richter、**Quentin Anthony**、Eugene Belilovsky、Irina Rish 和 Timothée Lesort。“大型语言模型的持续预训练：如何重新‘唤醒’你的模型？”载于《ICML 基础模型高效系统研讨会》，2023 年。\n- **Zhangir Azerbayev**、**Hailey Schoelkopf**、Keiran Paster、Marco Dos Santos、Stephen McAleer、Albert Q Jiang、Jia Deng、**Stella Biderman** 和 Sean Welleck。“Llemma：一个面向数学的开源语言模型”。载于《NeurIPS 数学—人工智能研讨会》，2023 年。\n- Alexander Havrilla、Maksym Zhuravinskyi、Duy Phung、Aman Tiwari、Jonathan Tow、**Stella Biderman**、**Quentin Anthony** 和 **Louis Castricato**。“trlX：一个用于大规模人类反馈强化学习的框架”。载于《2023 年自然语言处理经验方法会议论文集》，2023 年。\n- **Quentin Anthony**、**Jacob Hatef**、Deepak Narayanan、**Stella Biderman**、Stas Bekman、Junqi Yin、Aamir Shafi、Hari Subramoni 和 Dhabaleswar Panda。“与硬件协同设计模型架构的理由”。载于 *arXiv 预印本*，2024 年。\n- Adam Ibrahim、Benjamin Thérien、Kshitij Gupta、Mats L. Richter、**Quentin Anthony**、Timothée Lesort、Eugene Belilovsky 和 Irina Rish。“简单且可扩展的策略以持续预训练大型语言模型”。载于 *arXiv 预印本*，2024 年。\n- Junqi Yin、Avishek Bose、Guojing Cong、Isaac Lyngaas、**Quentin Anthony**。“前沿领域大型语言模型架构的比较研究”。载于 *arXiv 预印本*，2024 年。\n\n其他研究团队的以下出版物使用了该库：\n- 蔡忠志、范廷翰、彼得·拉马奇和亚历山大·鲁德尼茨基。\"[KERPLE：用于长度外推的核化相对位置嵌入](https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.09921)。\" 载于《神经信息处理系统进展》第35卷，2022年。\n- 萨米拉·霍拉瓦拉维塔纳、埃琳·艾顿、希瓦姆·夏尔马、斯科特·豪兰德、梅加·苏布拉马尼安、斯科特·巴斯克斯、罗宾·科斯比、玛丽亚·格伦斯基和斯维特兰娜·沃尔科娃。\"[化学领域的科学知识基础模型：机遇、挑战与经验教训](https:\u002F\u002Faclanthology.org\u002F2022.bigscience-1.12\u002F)。\" 载于《ACL大型语言模型创建中的挑战与展望研讨会论文集》，2022年。\n- 索菲娅·科拉克、鲁本·马丁斯、克莱尔·勒古埃斯和文森特·J·赫伦多恩。\"[利用语言模型生成代码片段：可行性与规模效应](https:\u002F\u002Fpar.nsf.gov\u002Fbiblio\u002F10340618)\"。 载于《ICLR深度学习与代码研讨会论文集》，2022年。\n- 弗兰克·F·徐、乌里·阿隆、格雷厄姆·纽比格和文森特·J·赫伦多恩。\"[大型代码语言模型的系统性评估](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.13169)。\" 载于《ICLR深度学习与代码研讨会论文集》，2022年。\n- 普永道和威廉·舒勒。\"[基于Transformer的语言模型惊喜度在约20亿训练token时对人类阅读时间的预测效果最佳](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.11389)。\" 载于《计算语言学协会研究成果》，2023年。\n- 蔡忠志、范廷翰、亚历山大·鲁德尼茨基和彼得·拉马奇。\"[通过感受野分析视角剖析Transformer的长度外推能力](https:\u002F\u002Faclanthology.org\u002F2023.acl-long.756\u002F)。\" 载于《第61届计算语言学协会年会论文集（第一卷：长文）》，第13522–13537页，2023年。\n- 蔡忠志、范廷翰、陈立伟、亚历山大·鲁德尼茨基和彼得·拉马奇。\"[无位置嵌入的Transformer语言模型中，潜在的位置信息蕴含于自注意力的方差之中](https:\u002F\u002Faclanthology.org\u002F2023.acl-short.102\u002F)。\" 载于《第61届计算语言学协会年会论文集（第二卷：短文）》，第13522–13537页，2023年。\n- 冯锡东、罗一成、王子言、唐宏瑞、杨梦月、邵坤、大卫·姆古尼、杜雅莉和王军。\"[ChessGPT：连接策略学习与语言建模](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.09200)。\"_arXiv预印本 arXiv:2306.09200_\"，2023年。\n- 奥里昂·沃克·多勒、萨米拉·霍拉瓦拉维塔纳、斯科特·巴斯克斯、W·詹姆斯·普芬德特纳和斯维特兰娜·沃尔科娃。\"[MolJET：用于条件性从头分子设计及多属性优化的多模态联合嵌入Transformer](https:\u002F\u002Fopenreview.net\u002Fpdf?id=7UudBVsIrr)。\"_正在审阅中的预印本_\"，2023年。\n- 让·卡杜尔和刘琦。\"[通过微调大型语言模型实现低资源环境下的文本数据增强](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01119)。\"_arXiv:2310.01119_\"，2023年。\n- 阿隆·阿尔巴拉克、潘梁明、科林·拉菲尔和威廉·杨·王。\"[面向语言模型预训练的高效在线数据混合](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.02406)。\" 载于《NeurIPS关于R0-FoMo的研讨会：大型基础模型中少样本与零样本学习的鲁棒性》，2023年。\n- 埃格巴尔·A·侯赛尼和埃韦丽娜·费多连科。\"[大型语言模型隐式地学会将神经句法轨迹拉直，从而构建自然语言的预测性表征](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2023.11.05.564832v1)。\" 载于《神经信息处理系统》，2023年。\n- 尹俊琪、萨贾尔·达什、王飞翼和马利卡尔君·尚卡尔。\"[FORGE：面向科学的开放基础模型预训练](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3581784.3613215)。\" 载于《高性能计算、网络、存储与分析国际会议论文集》，第1–13页，2023年。\n- 让·卡杜尔和刘琦。\"[通过微调大型语言模型实现低资源环境下的文本数据增强](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01119)。\" 载于\"_arXiv预印本 arXiv:2310.01119_\"，2023年。\n- 彭迪、李建国、于航、蒋伟、蔡文婷、曹阳、陈超宇、陈大钧、陈洪伟、陈亮、樊刚、龚杰、龚梓、胡文、郭婷婷、雷志超、李婷、李正、梁明、廖聪、刘炳昌、刘嘉晨、刘志伟、陆绍军、沈敏、王广培、王欢、王志、许兆贵、杨佳伟、叶青、张戈浩、张宇、赵泽林、郑训进、周海莲、朱立夫和朱贤英。\"[CodeFuse-13B：一个预训练的多语言代码大型语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06266)。\" 载于\"_arXiv预印本 arXiv:2310.06266_\"，2023年。\n- 尼基塔·拉奥、库什·贾因、乌里·阿隆、克莱尔·勒古埃斯和文森特·J·赫伦多恩。\"[CAT-LM：在对齐的代码和测试上训练语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01602)。\" 载于第38届IEEE\u002FACM自动化软件工程国际会议（ASE），第409–420页。IEEE，2023年。\n- 普拉蒂尤什·帕特尔、伊莎·丘克塞、张超杰、伊尼戈·戈伊里、布里杰什·瓦里耶尔、尼提什·马哈林甘和里卡多·比安奇尼。\"[POLCA：LLM云服务提供商中的功率超额订阅](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.12908)。\" 载于\"_arXiv预印本_\"，2023年。\n- 尹俊琪、萨贾尔·达什、约翰·古恩利、王飞翼和乔治娅·图拉斯西。\"[在领导级超级计算机上预训练大型语言模型的评估](https:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1007\u002Fs11227-023-05479-7)。\" 载于《超级计算杂志》第79卷第18期，2023年。\n- 塔尔·卡多什、尼兰詹·哈萨布尼斯、Vy A. 伍、纳达夫·施耐德、内娃·克里恩、米哈伊·卡波塔、阿卜杜勒·瓦赛、内斯林·艾哈迈德、泰德·威尔克、盖伊·塔米尔、尤瓦尔·平特尔、蒂莫西·马特森和加尔·奥伦。\"[领域专用代码语言模型：挖掘HPC代码与任务的潜力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13322)。\" 载于\"_arXiv预印本_\"，2023年。\n- 沈国斌、赵东成、董怡婷、李洋、李金东、孙康和曾毅。\"[星形胶质细胞助力脉冲神经网络在大型语言模型中的发展](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.07625)。\" 载于\"_arXiv预印本_\"，2023年。\n- 埃格巴尔·A·侯赛尼、马丁·A·施林普夫、张彦、塞缪尔·鲍曼、诺加·扎斯拉夫斯基和埃韦丽娜·费多连科。\"[人工神经网络语言模型即使经过发育阶段上较为现实的训练量，其神经活动与行为表现仍能与人类保持一致](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2022.10.04.510681)。\" 载于《语言神经生物学》，2024年。\n- 肖雄业、周辰宇、彭恒、曹德富、李亚星、周义卓、李世轩和保罗·博格丹。\"[从多重分形分析视角探索LLM中的神经元交互与涌现现象](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.09099)。\" 载于\"_arXiv预印本_\"，2024年。\n- 曾志远、郭启鹏、费兆业、尹章悦、周云华、李林阳、孙天翔、严航、林大华和邱锡鹏。\"[变废为宝：修正MoE的Top-k路由器](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12399)。\" 载于\"_arXiv预印本_\"，2024年。\n\n\n\n## 模型\n以下模型是使用该库训练的：\n\n### 英语大模型\n- EleutherAI 的 [GPT-NeoX-20B](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fgpt-neox-20b) 和 [Pythia（70M 至 13B）](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia)\n- CarperAI 的 [FIM-NeoX-1.3B](https:\u002F\u002Fhuggingface.co\u002FCarperAI\u002FFIM-NeoX-1.3B)\n- StabilityAI 的 [StableLM（3B 和 7B）](https:\u002F\u002Fgithub.com\u002FStability-AI\u002FStableLM)\n- Together.ai 的 [RedPajama-INCITE（3B 和 7B）](https:\u002F\u002Ftogether.ai\u002Fblog\u002Fredpajama-models-v1)\n- 卡内基梅隆大学的 [proofGPT（1.3B 和 6.7B）](https:\u002F\u002Fhuggingface.co\u002Fhoskinson-center\u002FproofGPT-v0.1-6.7B)\n- Dampish 的 [StellarX（2.8B 和 4B）](https:\u002F\u002Fhuggingface.co\u002FDampish\u002FStellarX-4B-V0.2)\n- 中国科学院的 [AstroSNN（1.5B）](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.07625)\n\n### 非英语大模型\n- EleutherAI 的 [Polyglot-Ko（1.3B 至 12.8B）](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpolyglot)（韩语）\n- 高丽大学的 [KULLM-Polyglot（5.8B 和 12.8B）](https:\u002F\u002Fgithub.com\u002Fnlpai-lab\u002FKULLM)（韩语）\n- Stability AI 的 [日语 Stable LM（7B）](https:\u002F\u002Fhuggingface.co\u002Fstabilityai\u002Fjapanese-stablelm-base-alpha-7b)（日语）\n- LearnItAnyway 的 [LLaVA-Polyglot-Ko（1.3B）](https:\u002F\u002Fhuggingface.co\u002FLearnItAnyway\u002Fllava-polyglot-ko-1.3b-hf)（韩语）\n- Rinna 公司的 [japanese-gpt-neox-3.6b](https:\u002F\u002Fhuggingface.co\u002Frinna\u002Fjapanese-gpt-neox-3.6b)（日语）和 [bilingual-gpt-neox-4b](https:\u002F\u002Fhuggingface.co\u002Frinna\u002Fbilingual-gpt-neox-4b)（英日双语）\n- CyberAgent 的 [Open-CLM（125M 至 7B）](https:\u002F\u002Fhuggingface.co\u002Fcyberagent\u002Fopen-calm-7b)（日语）\n- 匈牙利语言学研究中心的 [PULI GPTrio（6.7B）](https:\u002F\u002Fhuggingface.co\u002FNYTK\u002FPULI-GPTrio)（匈牙利语 \u002F 英语 \u002F 中文）\n- 东京大学的 [weblab-10b](https:\u002F\u002Fhuggingface.co\u002FKojima777\u002Fweblab-10b) 和 [weblab-10b-instruct](https:\u002F\u002Fhuggingface.co\u002FKojima777\u002Fweblab-10b-instruction-sft)（日语）\n- nolando.ai 的 [Hi-NOLIN（9B）](https:\u002F\u002Fblog.nolano.ai\u002FHi-NOLIN\u002F)（英语、印地语）\n- 中国人民大学的 [YuLan（12B）](https:\u002F\u002Fhuggingface.co\u002Fyulan-team\u002FYuLan-Base-12b)（英语、中文）\n- 巴斯克语言技术中心的 [Latixna（70B）](https:\u002F\u002Fhuggingface.co\u002FHiTZ\u002Flatxa-70b-v1.2)（巴斯克语）\n\n### 代码模型\n- 卡内基梅隆大学的 [PolyCoder（160M 至 2.7B）](https:\u002F\u002Fgithub.com\u002FVHellendoorn\u002FCode-LMs) 和 [CAT-LM（2.7B）](https:\u002F\u002Fhuggingface.co\u002Fnikitharao\u002Fcatlm)\n- StabilityAI 的 [StableCode（1.3B）](https:\u002F\u002Fstability.ai\u002Fblog\u002Fstablecode-llm-generative-ai-coding) 和 [StableCode-Completion-Alpha（3B）](https:\u002F\u002Fstability.ai\u002Fblog\u002Fstablecode-llm-generative-ai-coding)\n- CodeFuse AI 的 [CodeFuse（13B）](https:\u002F\u002Fhuggingface.co\u002Fcodefuse-ai\u002FCodeFuse-13B)\n\n### 科学领域的人工智能\n- EleutherAI 的 [LLeMMA（34B）](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.10631)\n- 奥克里奇国家实验室的 [FORGE（26B）](https:\u002F\u002Fgithub.com\u002Fat-aaims\u002Fforge)\n- 奥克里奇国家实验室的 [未命名材料科学领域模型（7B）](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.00691)\n- 太平洋西北国家实验室的 [MolJet（规模未公开）](https:\u002F\u002Fopenreview.net\u002Fpdf?id=7UudBVsIrr)\n\n### 其他模态\n- Rinna 公司的 [PSLM（7B）](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.12428)（语音 \u002F 文本）\n- 伦敦大学学院的 [ChessGPT-3B](https:\u002F\u002Fhuggingface.co\u002FWaterhorse\u002Fchessgpt-base-v1)\n- Gretel 的 [Text-to-Table（3B）](https:\u002F\u002Fhuggingface.co\u002Fgretelai\u002Ftext2table)\n\n# 行政说明\n\n## 引用 GPT-NeoX\n如果您在工作中发现 GPT-NeoX 库很有帮助，可以按以下方式引用该仓库：\n\n```bibtex\n@software{gpt-neox-library,\n  title = {{GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch}},\n  author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Phang, Jason and Purohit, Shivanshu and Schoelkopf, Hailey and Stander, Dashiell and Songz, Tri and Tigges, Curt and Thérien, Benjamin and Wang, Phil and Weinbach, Samuel},\n  url = {https:\u002F\u002Fwww.github.com\u002Feleutherai\u002Fgpt-neox},\n  doi = {10.5281\u002Fzenodo.5879544},\n  month = {9},\n  year = {2023},\n  version = {2.0.0},\n}\n```\n\n要引用参数量为 200 亿的模型 `GPT-NeoX-20B`，请使用：\n\n```bibtex\n@inproceedings{gpt-neox-20b,\n  title={{GPT-NeoX-20B}: An Open-Source Autoregressive Language Model},\n  author={Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, USVSN Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel},\n  booktitle={Proceedings of the ACL Workshop on Challenges \\& Perspectives in Creating Large Language Models},\n  url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.06745},\n  year={2022}\n}\n```\n\n## 贡献\nGPT-NeoX 由开源人工智能社区构建，离不开我们优秀的贡献者！有关我们的 CLA、代码格式化、测试等方面的详细信息，请参阅我们的[贡献指南](CONTRIBUTING.md)。\n\n## 许可协议\n本仓库托管的是 EleutherAI GPT-NeoX 项目的一部分代码。版权所有 © 2024, EleutherAI。根据 Apache 许可证授权：\n\n    根据 Apache 许可证第 2.0 版（“许可证”）进行授权；\n    除非遵守许可证条款，否则不得使用此文件。\n    您可以在以下网址获取许可证副本：\n\n        http:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0\n\n    除非适用法律要求或书面同意，否则根据“原样”基础分发软件，\n    不提供任何形式的保证或条件。\n    请参阅许可证以了解具体的权限和限制。\n\n本仓库基于 NVIDIA 编写的代码，并根据 Apache 许可证第 2.0 版授权。根据 Apache 许可证的规定，所有对 NVIDIA 原始代码进行修改的文件均保留 NVIDIA 的版权声明。而未包含此类声明的文件则完全归 EleutherAI 所有。当 NVIDIA 的代码被修改时，会在版权声明中注明这一点。本仓库的所有衍生作品都必须按照 Apache 许可证的要求保留这些声明。\n\n此外，本仓库还包含其他作者编写的代码。相关贡献均已标注，并在适当情况下附上了相应的许可信息。\n\n完整条款请参阅 `LICENSE` 文件。如您对许可有任何疑问、意见或顾虑，请发送电子邮件至 contact@eleuther.ai。\n\n## 致谢\n我们使用 [CoreWeave](https:\u002F\u002Fcoreweave.com\u002F) 提供的 Kubernetes 集群以及 [Stability AI](https:\u002F\u002Fstability.ai) 提供的 Slurm 集群进行实验。同时，我们也感谢 DeepSpeed 团队提供的建议和咨询。","# GPT-NeoX 快速上手指南\n\nGPT-NeoX 是 EleutherAI 开发的用于在 GPU 上训练大规模语言模型的库。它基于 NVIDIA 的 Megatron-LM，并融合了 DeepSpeed 技术及多种新颖优化。**注意：本库专为从头训练数十亿参数级别的模型设计；若仅需通用推理，建议使用 Hugging Face `transformers` 库。**\n\n## 环境准备\n\n### 系统要求\n*   **Python**: 推荐 3.8 - 3.10\n*   **PyTorch**: 推荐 1.8 - 2.0\n*   **GPU**: 支持 NVIDIA (Ampere\u002FHopper 等) 及 AMD (MI100, MI250X) GPU。\n*   **多节点支持**: 支持 Slurm, MPI, IBM Job Step Manager 等启动器。\n\n### 前置依赖与警告\n本库依赖 EleutherAI  fork 的 **[DeeperSpeed](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002FDeeperSpeed)**（基于 DeepSpeed 修改版）。\n> **重要警告**：强烈建议在 Anaconda、虚拟环境或容器中进行隔离安装。直接在全局环境安装可能会破坏其他依赖标准 DeepSpeed 的项目。\n\n若需使用以下加速特性，请提前准备：\n*   **Flash Attention**: 适用于 Ampere (如 A100) 及更新架构，可显著提升速度。\n*   **Transformer Engine**: 适用于 NVIDIA A100\u002FH100，提供高效内核。\n*   **多节点训练**: 需准备 `hostfile` 文件，格式如下：\n    ```bash\n    node1_ip slots=8\n    node2_ip slots=8\n    ```\n\n## 安装步骤\n\n1.  **克隆仓库**\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox.git\n    cd gpt-neox\n    ```\n\n2.  **安装基础依赖**\n    建议先配置国内镜像源（如清华源）以加速下载：\n    ```bash\n    pip config set global.index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n    ```\n    \n    执行安装命令：\n    ```bash\n    pip install -r requirements\u002Frequirements.txt\n    ```\n\n3.  **安装可选监控依赖**（按需选择）\n    ```bash\n    # WandB 监控\n    pip install -r requirements\u002Frequirements-wandb.txt\n    \n    # TensorBoard 监控\n    pip install -r requirements\u002Frequirements-tensorboard.txt\n    \n    # Comet ML 监控\n    pip install -r requirements\u002Frequirements-comet.txt\n    ```\n\n4.  **安装加速组件**（可选但推荐）\n    *   **Flash Attention**:\n        ```bash\n        pip install -r requirements\u002Frequirements-flashattention.txt\n        ```\n    *   **Transformer Engine**:\n        ```bash\n        pip install -r requirements\u002Frequirements-transformer-engine.txt\n        ```\n    *   **AMD GPU 融合内核预编译** (避免启动时等待):\n        ```python\n        python -c \"from megatron.fused_kernels import load; load()\"\n        ```\n\n## 基本使用\n\nGPT-NeoX 通过配置文件驱动训练。以下是启动训练的最简流程：\n\n1.  **准备配置文件**\n    在 `configs\u002F` 目录下选择一个预设配置（例如 `1-3B.yml` 或针对特定架构的配置），或复制一份作为模板进行修改。关键配置项包括数据路径、模型参数及并行策略。\n\n2.  **启动训练**\n    使用 `deepspeed` 命令启动训练任务。以下是在单节点多卡上的运行示例：\n\n    ```bash\n    deepspeed train.py --conf_dir configs\u002Fyour_config_folder --config your_config_file.yml\n    ```\n\n    *   `--conf_dir`: 配置文件所在目录。\n    *   `--config`: 具体的配置文件名（不含 `.yml` 后缀，或根据实际脚本参数调整）。\n\n3.  **多节点启动示例**\n    若需多节点训练，确保已设置 `hostfile` 路径（通过环境变量 `DLTS_HOSTFILE` 或配置文件中的 `\"hostfile\"` 字段），并使用相应的启动器（默认为 `pdsh`）：\n\n    ```bash\n    deepspeed --hostfile=\u002Fpath\u002Fto\u002Fhostfile train.py --conf_dir configs\u002F... --config ...\n    ```\n\n**提示**：训练完成后，模型可通过内置工具导出为 Hugging Face 格式以便后续推理使用。","某国家级实验室的研究团队正试图在 Frontier 超级计算机上从头训练一个拥有数百亿参数的开源大语言模型，以探索特定科学领域的知识边界。\n\n### 没有 gpt-neox 时\n- **硬件适配困难**：团队需耗费数周手动修改代码以适配 Slurm 调度器和 MPI 环境，难以直接利用 Frontier 等顶级超算资源。\n- **显存瓶颈限制**：缺乏高效的 ZeRO 优化和 3D 并行策略，导致单卡显存无法容纳巨大模型，被迫缩小模型规模或降低批次大小。\n- **训练稳定性差**：缺少内置的 Curriculum Learning（课程学习）和新型位置编码（如 Rotary\u002FAlibi），模型在长序列训练中容易发散或收敛缓慢。\n- **生态割裂严重**：实验监控、分词器处理与评估流程分散在不同工具链中，数据对接繁琐，难以复现和追踪实验结果。\n\n### 使用 gpt-neox 后\n- **无缝超算集成**：gpt-neox 原生支持 Slurm、MPI 及 IBM Job Step Manager，团队可直接在 Frontier 和 Summit 上启动大规模分布式训练。\n- **突破显存限制**：借助 DeepSpeed 的 ZeRO 技术和 3D 并行架构，成功在有限显存下运行百亿参数模型，显著提升了训练效率。\n- **加速收敛与创新**：直接调用预置的 Pythia 或 LLaMA 架构配置，结合 Flash Attention 和课程学习策略，大幅缩短了模型达到最优性能的时间。\n- **全流程生态打通**：一键对接 Hugging Face 分词器与 Transformers 库，并通过 WandB 或 Comet ML 实时监控实验，实现了从训练到评估的闭环管理。\n\ngpt-neox 将原本需要数月攻坚的基础设施难题转化为可配置的选项，让研究人员能专注于算法创新而非工程修补。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FEleutherAI_gpt-neox_1b9cd590.png","EleutherAI","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FEleutherAI_cadf9bbb.png","",null,"contact@eleuther.ai","AIEleuther","www.eleuther.ai","https:\u002F\u002Fgithub.com\u002FEleutherAI",[86,90,94,98,102,106,110],{"name":87,"color":88,"percentage":89},"Python","#3572A5",85.4,{"name":91,"color":92,"percentage":93},"C++","#f34b7d",11.2,{"name":95,"color":96,"percentage":97},"Cuda","#3A4E3A",2.5,{"name":99,"color":100,"percentage":101},"Shell","#89e051",0.5,{"name":103,"color":104,"percentage":105},"Dockerfile","#384d54",0.3,{"name":107,"color":108,"percentage":109},"C","#555555",0.1,{"name":111,"color":112,"percentage":113},"Makefile","#427819",0,7409,1105,"2026-04-11T22:36:17","Apache-2.0",5,"Linux","必需。主要支持 NVIDIA GPU (如 A100, H100, Ampere\u002FHopper 架构)；已支持 AMD GPU (MI100, MI250X)。需通过 JIT 编译融合内核。未明确具体显存大小，但工具专为十亿级参数模型训练设计，通常需要高显存多卡环境。","未说明",{"notes":123,"python":124,"dependencies":125},"该库主要用于从头训练十亿级参数的大模型，若仅需推理建议使用 Hugging Face transformers。强烈建议使用 Anaconda 或虚拟机隔离环境，因其依赖定制的 DeeperSpeed 分支，可能与其他依赖标准 DeepSpeed 的库冲突。支持多种启动器 (Slurm, MPI, pdsh) 及多节点分布式训练。","3.8-3.10",[126,127,128,129,130,131,132],"PyTorch 1.8-2.0","DeepSpeed (基于 EleutherAI\u002FDeeperSpeed 分支)","Megatron-LM","Flash Attention 2.x (可选)","NVIDIA Transformer Engine (可选)","Hugging Face tokenizers","WandB \u002F Comet \u002F TensorBoard (可选监控)",[15],[135,136,137,138],"deepspeed-library","gpt-3","transformers","language-model","2026-03-27T02:49:30.150509","2026-04-13T13:46:17.714216",[142,147,152,157,162,167],{"id":143,"question_zh":144,"answer_zh":145,"source_url":146},31628,"如何将 Llama 权重转换为 GPT-NeoX 格式以避免加载错误？","运行转换脚本时需注意参数与配置文件的一致性：\n1. 如果运行 `convert_raw_llama_weights_to_neox.py` 时**未**传递 `--pipeline_parallel` 参数，必须在 YAML 配置文件中设置 `pipe-parallel-size: 0`。\n2. 如果**传递**了 `--pipeline_parallel` 参数，必须在 YAML 配置文件中设置 `pipe-parallel-size` 为大于等于 1 的值（例如 1）。\n\n原因：当流水线并行大小设为 0 时，检查点的保存\u002F加载格式不同（尝试从 \"module\" 键加载）；而使用流水线并行时，权重是按层文件保存和加载的。不匹配会导致找不到权重文件的错误。","https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fissues\u002F971",{"id":148,"question_zh":149,"answer_zh":150,"source_url":151},31629,"为什么开启流水线并行（Pipeline Parallelism）后训练速度变慢？","流水线并行（PP）会引入节点间通信开销，可能导致速度显著下降。优化建议如下：\n1. 尝试减小微批次大小（micro batch size），甚至设为 1。\n2. 如果显存允许，尝试减少或移除流水线并行（设置 `pipe-parallel-size=1` 或 0），仅使用模型并行（Model Parallelism）。\n3. 参考数据：在 512 张 A100 GPU 上训练 175B 模型，序列长度 2048 时，优化后可达到 100+ TFLOPs；若微批次过大或 PP 设置不当，TFLOPs 可能降至 33 左右。","https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fissues\u002F903",{"id":153,"question_zh":154,"answer_zh":155,"source_url":156},31630,"如何在多节点集群上进行分布式训练？","在多节点（Multi-node）环境下训练时，需要正确配置通信地址和端口。虽然具体脚本取决于集群调度系统（如 SLURM），但核心思路类似于 `torch.distributed.launch`：\n1. 需要指定 `master_addr`（主节点地址）和 `master_port`（主节点端口）。\n2. 确保所有节点能够通过网络互相访问指定的端口。\n3. 如果使用 SLURM 等调度器，需编写相应的提交脚本来自动分配这些参数并启动多节点任务。目前社区中有关于如何使用 SLURM 运行多节点训练的讨论。","https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fissues\u002F734",{"id":158,"question_zh":159,"answer_zh":160,"source_url":161},31631,"单节点 8 张 A100 GPU 无法运行 13B 或 20B 模型（显存不足 OOM）怎么办？","单节点 8 张 A100（假设每张 40GB，共 320GB）可能不足以直接训练或微调大模型，因为除了权重外，还需要存储优化器状态、梯度以及激活值。\n1. 对于 20B 模型，权重和优化器状态总和约为 268GB，这还未包含激活值和临时缓冲区，因此单节点极易 OOM。\n2. 解决方案包括：\n   - 增加节点数量（多节点训练）。\n   - 启用 DeepSpeed ZeRO Stage 3 并结合 CPU Offload（将部分状态卸载到 CPU 内存）。\n   - 减小微批次大小（micro batch size）。\n   - 使用流水线并行（Pipeline Parallelism）将模型切分到更多 GPU 上。","https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fissues\u002F409",{"id":163,"question_zh":164,"answer_zh":165,"source_url":166},31632,"合并检查点后出现 'size mismatch for word_embeddings.weight' 错误如何解决？","该错误通常是由于词汇表大小（Vocab Size）不匹配导致的。例如，检查点中的权重形状为 `[50432, 6144]`，而当前模型期望的形状为 `[50304, 6144]`。\n解决方法：\n1. 确认使用的分词器（Tokenizer）与预训练模型是否一致。GPT-NeoX 通常使用特定的分词器，其词汇表大小固定。\n2. 检查合并脚本（如 `merge20b.py`）的输入目录是否包含了正确的分词器文件或配置。\n3. 如果是自定义数据集训练，需确保初始化模型时的词汇表大小与检查点完全一致，否则无法直接加载权重。","https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fissues\u002F612",{"id":168,"question_zh":169,"answer_zh":170,"source_url":171},31633,"安装依赖时遇到 'No matching distribution found for triton' 错误怎么办？","这通常是因为指定的 Triton 版本（如 0.4.2 或 1.0.0）与当前的 Python 版本、PyTorch 版本或操作系统不兼容。\n解决步骤：\n1. 检查当前安装的 Triton 版本：运行 `pip show triton`。\n2. 确认 PyTorch 版本是否与所需的 Triton 版本匹配。Triton 对 PyTorch 版本有严格依赖。\n3. 尝试升级或降级 Triton 到与当前环境兼容的版本，或者使用项目推荐的特定 requirements 文件（有时需要手动编译安装特定版本的 Triton）。\n4. 确保使用的是受支持的 Python 版本（如 3.8 或 3.9），较新的 Python 版本可能缺乏预编译包。","https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fissues\u002F628",[173,178,183],{"id":174,"version":175,"summary_zh":176,"released_at":177},238884,"v2.0","借助 GPT-NeoX 2.0，我们现在支持上游的 DeepSpeed。这使得我们可以使用 DeepSpeed 的新功能，例如 [课程学习](https:\u002F\u002Fwww.deepspeed.ai\u002Ftutorials\u002Fcurriculum-learning\u002F)、[通信日志记录](https:\u002F\u002Fwww.deepspeed.ai\u002Ftutorials\u002Fcomms-logging\u002F) 和 [自动调优](https:\u002F\u002Fwww.deepspeed.ai\u002Ftutorials\u002Fautotuning\u002F)。\n\n对于上游 DeepSpeed 中任何与 GPT-NeoX 2.0 根本不兼容的更改，我们采取以下措施：\n- 尝试向上游 DeepSpeed 提交拉取请求；\n- 将该拉取请求暂存于 [DeeperSpeed 2.x](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002FDeeperSpeed\u002Freleases\u002Ftag\u002Fv2.0)，以确保始终有一个与 GPT-Neox 2.x 兼容的 DeepSpeed 版本。\n\n因此，除非您的用例依赖于尚未合并到 [DeeperSpeed 2.x](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002FDeeperSpeed\u002Freleases\u002Ftag\u002Fv2.0) 的特定上游 DeepSpeed 功能，否则我们建议使用 [DeeperSpeed 2.x](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002FDeeperSpeed\u002Freleases\u002Ftag\u002Fv2.0)。\n\n## 变更内容\n* 在 https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fpull\u002F704 中添加了 Mup 支持；\n* 在 https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fpull\u002F746 中更新了 deepspeed_main；\n* 在 https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fpull\u002F663 中添加了对最新版 DeepSpeed 的支持；\n* 在 https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fpull\u002F695 中添加了课程学习支持；\n* 在 https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fpull\u002F739 中添加了自动调优支持。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fcompare\u002Fv1.0...v2.0","2023-03-10T00:26:35",{"id":179,"version":180,"summary_zh":181,"released_at":182},238885,"v1.0","这是 GPT-NeoX 基于旧版 DeeperSpeed（0.3.15）的遗留版本。我们仅建议在以下情况下使用此版本：您正在加载基于旧版 DeeperSpeed 的模型（例如 GPT-J、GPT-NeoX20B、Pythia 系列等）。\n\n本版本与 v2.x 的主要区别在于所支持的 DeepSpeed 版本。如果您使用的是 2.x 版本，我们假定您正在使用 DeepSpeed 的最新版本或 DeeperSpeed 2.x 版本。","2023-03-09T17:11:32",{"id":184,"version":185,"summary_zh":80,"released_at":186},238886,"legacy_gptj_residual.1.0.0","2022-05-16T14:24:47"]