[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-EleutherAI--pythia":3,"tool-EleutherAI--pythia":65},[4,17,25,39,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":10,"last_commit_at":23,"category_tags":24,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":26,"name":27,"github_repo":28,"description_zh":29,"stars":30,"difficulty_score":10,"last_commit_at":31,"category_tags":32,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[33,34,35,36,14,37,15,13,38],"图像","数据工具","视频","插件","其他","音频",{"id":40,"name":41,"github_repo":42,"description_zh":43,"stars":44,"difficulty_score":45,"last_commit_at":46,"category_tags":47,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[14,33,13,15,37],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":45,"last_commit_at":54,"category_tags":55,"status":16},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74939,"2026-04-05T23:16:38",[15,33,13,37],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":62,"last_commit_at":63,"category_tags":64,"status":16},3215,"awesome-machine-learning","josephmisiti\u002Fawesome-machine-learning","awesome-machine-learning 是一份精心整理的机器学习资源清单，汇集了全球优秀的机器学习框架、库和软件工具。面对机器学习领域技术迭代快、资源分散且难以甄选的痛点，这份清单按编程语言（如 Python、C++、Go 等）和应用场景（如计算机视觉、自然语言处理、深度学习等）进行了系统化分类，帮助使用者快速定位高质量项目。\n\n它特别适合开发者、数据科学家及研究人员使用。无论是初学者寻找入门库，还是资深工程师对比不同语言的技术选型，都能从中获得极具价值的参考。此外，清单还延伸提供了免费书籍、在线课程、行业会议、技术博客及线下聚会等丰富资源，构建了从学习到实践的全链路支持体系。\n\n其独特亮点在于严格的维护标准：明确标记已停止维护或长期未更新的项目，确保推荐内容的时效性与可靠性。作为机器学习领域的“导航图”，awesome-machine-learning 以开源协作的方式持续更新，旨在降低技术探索门槛，让每一位从业者都能高效地站在巨人的肩膀上创新。",72149,1,"2026-04-03T21:50:24",[13,37],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":80,"owner_twitter":81,"owner_website":82,"owner_url":83,"languages":84,"stars":97,"forks":98,"last_commit_at":99,"license":100,"difficulty_score":101,"env_os":102,"env_gpu":103,"env_ram":104,"env_deps":105,"category_tags":111,"github_topics":79,"view_count":10,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":112,"updated_at":113,"faqs":114,"releases":143},4210,"EleutherAI\u002Fpythia","pythia","The hub for EleutherAI's work on interpretability and learning dynamics","Pythia 是 EleutherAI 推出的开源项目，旨在成为研究大型语言模型可解释性与学习动态的核心枢纽。它并非单一模型，而是一整套经过精心设计的自回归 Transformer 模型系列，专门用于揭示知识在训练过程中如何产生与演变。\n\n针对现有模型套件难以支持深度机理研究的痛点，Pythia 提供了前所未有的透明度与复现性。其核心突破在于：所有模型均基于完全相同的数据顺序训练，并保留了多达 154 个训练检查点。这一独特设计让研究者能够像观看“延时摄影”一样，细致观察模型从随机初始化到成熟的全过程，从而进行因果干预分析和学习轨迹追踪。此外，项目公开了全部代码、数据及论文结果，确保每一项结论都可被独立验证。\n\nPythia 主要面向 AI 研究人员、算法工程师及对大模型内部机制感兴趣的开发者。无论是探究记忆涌现规律、分析训练稳定性，还是评估伦理风险，Pythia 都为这些前沿课题提供了坚实的基础设施。作为该领域的先驱，Pythia 的成功实践已激发了包括 OLMo 和 Amber 在内的多个后续开源项目，极大地推动了社区对大模型“黑盒”内部的探索进程。","# Pythia: Interpreting Transformers Across Time and Scale\n\nThis repository is for EleutherAI's project *Pythia* which combines interpretability analysis and scaling laws to understand how knowledge develops and evolves during training in autoregressive transformers. For detailed info on the models, their training, and their properties, please see our paper [Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.01373).\n\nThe Pythia suite was developed with the explicit purpose of enabling research in interpretability, learning dynamics, and ethics and transparency for which existing model suites were inadequate. The key features of the Pythia suite are:\n1. All models, data, and code used in the paper are publicly released, enabling full reproducibility of results. All results in our paper have been independently verified by at least one other lab.\n2. All models feature 154 checkpoints saved throughout training, enabling the study of learning dynamics of LLMs.\n3. All models were trained on the same data in the same order, enabling researchers to explore causal interventions on the training process.\n\nAt time of release, Pythia was the only model suite in the world to meet these desiderata. In fact, the 154 checkpoints we released for our 12B parameter models represented more partially trained checkpoints for each model than the rest of the world had ever released for all 12B+ models combined. Our work has inspired several others to create similar projects, including LLM360's [Amber](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06550) and [K2-65B](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.07124), AI2's [OLMo](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.00838), and Zyphra's [BlackMamba](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.01771).\n\nAside from the Pythia suite itself, this repository also acts as a hub containing information, code, and reproducibility instructions for the following papers:\n* Emergent and Predictable Memorization in Large Language Models [[code](\u002Fpredictable-memorization)] [[paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.11158)]\n* PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs [[code](\u002Fpolypythias)] [[paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=bmrYu2Ekdz)]\n\n## Changelog\n\n[March 10, 2025] Added info for the PolyPythias paper.\n\n[July 9, 2024] Substantially revamped the readme, including better historical contextualization and promoting lots of cool research people have done with Pythia. Also added links to subsequently trained models.\n\n[November 2, 2023] We have added 14M and 31M models at the request of some researchers. We plan on training deduped versions of these models in the future.\n\n[April 3, 2023] We have released a new version of all Pythia models, fixing various inconsistencies in the original suite. Please see Appendix B in [the Pythia paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.01373) for details on the changes. The old models (\"v0\") remain available [here](https:\u002F\u002Fhuggingface.co\u002Fmodels?other=pythia_v0) and may be useful for ablation studies.\n\n[January 20, 2023] On January 20, 2023, we chose to rename the Pythia model suite to include both embedding layer and unembedding layer parameters in our total parameter counts, in line with many other model suites and because we believe this convention better reflects the on-device memory usage of these models. We also discovered that due to a typo one of our models was smaller than we thought, and replaced it with a model of the intended size. See [here](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-410m-deduped#naming-convention-and-parameter-count) for more details.\n\n## Table of contents\n\n- [Models](#models)\n  * [Multiple random seeds](#multiple-random-seeds)\n  * [Changelog](#changelog)\n- [Using Pythia](#using-pythia)\n  * [Quickstart](#quickstart)\n  * [Reproducing Training](#reproducing-training)\n  * [Exploring the Dataset](#exploring-the-dataset)\n  * [Pythia Paper Replication](#pythia-paper-replication)\n- [Benchmark Scores](#benchmark-scores)\n- [Research Building on Pythia](#research-building-on-pythia)\n  * [Language model internals](#language-model-internals)\n  * [Learning dynamics](#learning-dynamics)\n  * [How training data determines model behavior](#how-training-data-determines-model-behavior)\n  * [Security, auditing, and compliance research](#security-auditing-and-compliance-research)\n- [Citation Details](#citation-details)\n- [License](#license)\n\n## Models\n\nWe train and release a suite of 8 model sizes on the Pile ([paper](https:\u002F\u002Fpile.eleuther.ai\u002F), [datasheet](https:\u002F\u002Farxiv.org\u002Fabs\u002F2201.07311)) as well as the Pile with deduplication applied. All 8 model sizes are trained on the exact same data, in the exact same order. Each model saw 299,892,736,000 ~= 300B tokens during training. This corresponds to just under 1 epoch on the Pile for \"standard\" models, and ~= 1.5 epochs on the deduped Pile (which contains 207B tokens in 1 epoch). All models are trained with mixed precision, using fp16 for all models except `EleutherAI\u002Fpythia-1b` which trained with bf16, because in fp16 the model experienced an irreconcilable loss spike late in training.\n\nAfter our initial release, we trained 14M and 31M parameter models at the request of alignment researchers interested in scaling sparse autoencoders.\n\n| Params | n_layers | d_model | n_heads | d_head | Batch Size | Learning Rate | Hugging Face Checkpoints                                                |\n| ------ | -------- | ------- | ------- | ------ | ---------- | ------------- | ---------------------------------------------------------- |\n| 14M    | 6        | 128     | 4       | 32     | 2M         | 1.0e-3          | [Standard](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-14m)  |\n| 31M    | 6        | 256     | 8       | 32     | 2M         | 1.0e-3          | [Standard](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-31m) |\n| 70M    | 6        | 512     | 8       | 64     | 2M         | 1.0e-3          | [Standard](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-70m), [Deduped](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-70m-deduped)  |\n| 160M   | 12       | 768     | 12      | 64     | 2M         | 6.0e-4          | [Standard](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-160m), [Deduped](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-160m-deduped)|\n| 410M   | 24       | 1024    | 16      | 64     | 2M         | 3.0e-4          | [Standard](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-410m), [Deduped](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-410m-deduped)|\n| 1B     | 16       | 2048    | 8       | 256    | 2M         | 3.0e-4          | [Standard](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-1b), [Deduped](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-1b-deduped)    |\n| 1.4B   | 24       | 2048    | 16      | 128    | 2M         | 2.0e-4          | [Standard](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-1.4b), [Deduped](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-1.4b-deduped)|\n| 2.8B   | 32       | 2560    | 32      | 80     | 2M         | 1.6e-4        | [Standard](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-2.8b), [Deduped](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-2.8b-deduped)|\n| 6.9B   | 32       | 4096    | 32      | 128    | 2M         | 1.2e-4        | [Standard](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-6.9b), [Deduped](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-6.9b-deduped)|\n| 12B    | 36       | 5120    | 40      | 128    | 2M         | 1.2e-4        | [Standard](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-12b), [Deduped](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-12b-deduped)  |\n\n\nTo promote research on the learning dynamics of LLMs we make 154 checkpoints available for each model, representing steps 0 (initialization), 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1000, and then every 1,000 subsequent steps. We also upload the pre-tokenized data files and a script to reconstruct the dataloader as seen during training for all models. See [Reproducing Training](#reproducing-training) section for more details.\n\nConfig files used to train these models with the [GPT-NeoX library](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox) can be found at the `models\u002F` directory within this repository, as well as in the GPT-NeoX library itself.\n\nWe made a mistake while originally training these models resulting in some inconsistencies across runs. We reran the entire model suite with these inconsistencies fixed and the original runs are available under the name `EleutherAI\u002Fpythia-160m-v0`. See the Pythia paper for further details on how the v0 models differ from the main suite.\n\nThe loss curves for all models are contained in our (messy!) wandb project [here](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fpythia).\n\nA rough and partial correspondence between models and wandb runs is given by:\n| Model | Wandb |\n| --------- | --------- |\n| Pythia-2.8b | [Link](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fpythia\u002Fgroups\u002F2.7B%20New_36751euw?workspace=user-schoelkopf) |\n| Pythia-2.8b-deduped | [Link](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fpythia\u002Fgroups\u002F2.7B%20Deduped%20New_1ygfbs9n?workspace=user-schoelkopf) |\n| Pythia-1b | [Link](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fpythia\u002Fgroups\u002F800M%20Pythia_1zw5etef\u002Fworkspace) |\n| Pythia-1.4b | [Link](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fpythia\u002Fgroups\u002FPythia%201.3B_lepj8rtx\u002Fworkspace?workspace=user-schoelkopf) |\n| Pythia-1.4b-deduped | [Link](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fpythia\u002Fgroups\u002F1.3B%20Dedup_10v5wko4\u002Fworkspace?workspace=user-schoelkopf) |\n| Pythia-160m | [Link](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fpythia\u002Fgroups\u002FPythia%20125M_1mpgqyzx\u002Fworkspace?workspace=user-schoelkopf) |\n| Pythia-160m-deduped | [Link](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fpythia\u002Fgroups\u002F125M%20Dedup_ym78zh5k\u002Fworkspace?workspace=user-schoelkopf) |\n\n### Multiple random seeds\n\nThe random seed used to train the Pythia models is the GPT-NeoX default: 1234. To enable research into how randomness effects model behavior, we have been training more models with different random seeds. We have currently trained and released the following models using each random seed from 1 to 9.\n\n- Pythia 14M\n- Pythia 31M\n- Pythia 70M\n- Pythia 160M\n- Pythia 410M\n\nAll of these models are the _standard_ Pythia models, not the ones trained on the deduplicated Pile. Combined with the originally released models they represent ten otherwise identical variants using different random seeds. They can be found on HuggingFace using the naming pattern `https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-[size]-seed[num]`. For example, `https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-160m-seed7`. Note that the models trained with seed 1234 do not have a seed specified in their url.\n\nRuns replicating the smaller Pythia models across multiple seeds are at: https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fpythia-extra-seeds\n\n#### Erata\n\nAs noted [in this issue](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia\u002Fissues\u002F135) the 6.9B and 12B models accidentially used a different initializaiton due to not having the initialization value specified in the config file.\n\n## Using Pythia\n\n### Quickstart\n\nAll Pythia models are hosted on [the Huggingface hub](https:\u002F\u002Fhuggingface.co\u002FEleutherAI). They can be loaded and used via the following code (shown for the 3000-step `pythia-70M-deduped` model checkpoint):\n\n```python\nfrom transformers import GPTNeoXForCausalLM, AutoTokenizer\n\nmodel = GPTNeoXForCausalLM.from_pretrained(\n  \"EleutherAI\u002Fpythia-70m-deduped\",\n  revision=\"step3000\",\n  cache_dir=\".\u002Fpythia-70m-deduped\u002Fstep3000\",\n)\n\ntokenizer = AutoTokenizer.from_pretrained(\n  \"EleutherAI\u002Fpythia-70m-deduped\",\n  revision=\"step3000\",\n  cache_dir=\".\u002Fpythia-70m-deduped\u002Fstep3000\",\n)\n\ninputs = tokenizer(\"Hello, I am\", return_tensors=\"pt\")\ntokens = model.generate(**inputs)\nprint(tokenizer.decode(tokens[0]))\n```\n\nAll models were trained for the equivalent of 143000 steps at a batch size of 2,097,152 tokens. Revision\u002Fbranch `step143000` corresponds exactly to the model checkpoint on the `main` branch of each model.\n\nWe additionally have all model checkpoints in the format accepted by the [GPT-NeoX library](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox), with final-step checkpoints+optimizer states downloadable from the Hugging Face Hub at `EleutherAI\u002Fneox-ckpt-pythia-xxx-deduped-v1` but do not serve them for all steps at scale due to size of optimizer states and anticipated lower demand. If you would like to perform analysis using the intermediate models within the GPT-NeoX codebase, or would like the optimizer states for other steps, please email hailey@eleuther.ai and stella@eleuther.ai.\n\n> ❗ `pythia-{size}-v0` models on Huggingface of sizes `160m, 410m, 1.4b` were trained with a batch size of 4M tokens across 71500 steps and checkpointed every 500 steps. The step names on Huggingface for these v0 models are renamed for consistency with all 2M batch models so the model checkpointed labeled `step1000` of `pythia-1.4b-v0` was actually step 500, but has seen the same number of tokens as the other step1000 checkpoints.\n\n### Reproducing Training\n\n_(Expanded reproduction instructions provided by @BaruchG )._\n\nWe provide the training data for replication of our training runs. The [GPT-NeoX library](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox) requires the pre-tokenized training data in the form of 2 memory-mapped numpy arrays: a `.bin` and `.idx` file. We provide these files via the Hugging Face hub. To download and use the deduplicated Pile training data:\n```bash\ngit lfs clone https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEleutherAI\u002Fpythia_deduped_pile_idxmaps\n\n# Optionally, to ensure against corrupt files\npython utils\u002Fchecksum_shards.py\n\npython utils\u002Funshard_memmap.py --input_file .\u002Fpythia_deduped_pile_idxmaps\u002Fpile_0.87_deduped_text_document-00000-of-00082.bin --num_shards 83 --output_dir .\u002Fpythia_pile_idxmaps\u002F\n\n# The correct sha256 for the full file is 0cd548efd15974d5cca78f9baddbd59220ca675535dcfc0c350087c79f504693\n# This can be checked with sha256sum .\u002Fpythia_pile_idxmaps\u002F*\n``` \nThis will take over a day to run, though it should not require more than 5 GB of RAM. We recommend downloading this rather than retokenizing the Pile from scratch in order to guarantee preservation of the data order seen by the Pythia models. In addition to the training data, you will need to make a local copy of the tokenizer we used to train our models. You can find it [here](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia\u002Fblob\u002Fmain\u002Futils\u002F20B_tokenizer.json).\n\nNext you will need to set up the training environment:\n```\ngit clone https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox.git\ncd gpt-neox\ngit checkout v1.0\npip install -r requirements\u002Frequirements-flashattention.txt\nwget https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia\u002Fblob\u002Fmain\u002Fmodels\u002F160M\u002Fpythia-160m-deduped.yml\ndocker build -t pythia:latest .\n```\nAfter the container finishes building, run the container using the following command (from the root of the GPT-NeoX repo with your pythia yaml accessible from within that folder):\n```\ndocker run --runtime=nvidia --rm -it -e NVIDIA_VISIBLE_DEVICES=0,1,2,3 --shm-size=1g --ulimit memlock=-1 --mount type=bind,src=$PWD,dst=\u002Fgpt-neox -v $(pwd):\u002Fworkspace\u002F pythia:latest bash\n```\nYou can use the -v argument to add more connected volumes for the dataset and the Yaml file if is not accessible from within the docker container.\n\nChange the lines of the data paths and tokenizer paths as follows:\n```\n  \"train-data-paths\": [\"\u002Ffsx\u002Fpile\u002Fpile_20B_tokenizer_text_document\"], #point this to your folder which was generated in step 1 containing the .bin and .idx file\n  \"valid-data-paths\": [\"\u002Ffsx\u002Fpile\u002Fpile_20B_tokenizer_text_document\"], #point this to your folder which was generated in step 1 containing the .bin and .idx file\n  \"test-data-paths\": [\"\u002Ffsx\u002Fpile\u002Fpile_20B_tokenizer_text_document\"], #point this to your folder which was generated in step 1 containing the .bin and .idx file\n\n  \"tokenizer-type\": \"HFTokenizer\",\n  \"vocab-file\": \"\u002Ffsx\u002Fpile\u002F20B_tokenizer.json\", # point this to the tokenizer retrieved in step 2\n```\nDepending on how much VRAM you have available you may need to adjust the batch sizes. The total batch size is calculated via `Total GPUs * train_micro_batch_size_per_gpu * gradient_accumulation_steps \u002F (pipe-parallel-size * model-parallel-size)` and needs to be kept at 1024 to match the Pythia training batch size. You \n```\n   \"train_micro_batch_size_per_gpu\": XXX, # make this a value that will fit within your GPU memory\n   \"gradient_accumulation_steps\": 1, # make this a value to compensate to make the total batch size 1024.\n```\nIf you would like your weights to be saved add that information to the yaml file as well. For example, to save in the checkpoints folder, at the bottom you can add:\n```\n  \"launcher\": \"slurm\",\n  \"deepspeed_slurm\": false,\n\n  \"save\": \"checkpoints\",\n  \"load\": \"checkpoints\",\n  \"checkpoint_validation_with_forward_pass\": False,\n}\n```\nMake sure the paths are the paths from inside your docker container and if you want the weights to have persistence, make sure that they are accessible from outside the container, for example in \u002Fworkspace\u002F .\n\nYou should now be able to start training your model by running:\n```\npython deepy.py train.py pythia-160m-deduped.yml  2>&1 | tee output.txt\n```\nthe output will be saved to output.txt, if you don’t want that just delete the end.\n\nIn order to convert your model to the Hugging Face `transformers` format, you can use the script `tools\u002Fconvert_to_hf.py` from within the GPT-NeoX library. You may have to add `from typing import List` to the type of the file and change the line [here](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002F71df4d5017f9f4919566a11454fe3a507ffdc632\u002Ftools\u002Fconvert_to_hf.py#L44) from `list[torch.Tensor]` to `List[torch.Tensor]`. You can then run the script like this to convert the weights at step 143000:\n```\npython tools\u002Fconvert_to_hf.py --input_dir checkpoints\u002Fglobal_step143000\u002F --config_file checkpoints2\u002Fglobal_step 143000\u002Fconfigs\u002Fpythia-70m.yml --output_dir .\u002Foutput\u002F \n```\nThis should output a file structure similar to the one found at https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-70m-deduped\u002Ftree\u002Fmain.\n\n> ❗ Sometimes people find that they don't end up with the right tokenizer for reasons we have been unable to debug. If your `tokenizer_config.json` looks different than the one [here](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-70m-deduped\u002Fblob\u002Fmain\u002Ftokenizer_config.json) and `special_tokens_map.json` look different than [here](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-70m-deduped\u002Fblob\u002Fmain\u002Fspecial_tokens_map.json) you may need to replace them with the ones on Huggingface.\n\nTo run evaluations using our evaluation library, install the containers [here](https:\u002F\u002Fhub.docker.com\u002Fr\u002Fhuggingface\u002Ftransformers-pytorch-gpu\u002Ftags) (tested with the 4.28 and 4.29 versions). After setting up that docker container run:\n```\ngit clone https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\ncd lm-evaluation-harness\npip install -e .\n```\nas outlined in the Harness repository. You should then be able to run the benchmark by pointing it at your weights (which should be in your container) by running a command similar to this:\n```\npython3 main.py --model hf-causal-experimental  --model_args pretrained=..\u002Fgpt-neox\u002Foutput\u002F --tasks lambada_openai,piqa,winogrande,arc_easy,sciq,wikitext --device cuda:0\n```\n\n### Exploring the Dataset\n\nWe provide a tool to view particular portions of the training dataloader used by all models during training, at `utils\u002Fbatch_viewer.py`.\n\nFirst, we need to clone the Pythia repository:\n```\ngit clone https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia\n```\nNext, we must install dependencies:\n```\npip install torch==1.13.0+cu117 -f https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Ftorch\u002F\npip install numpy tqdm huggingface_hub\n```\n\nNext, we must download the appropriate dataset. We provide preshuffled versions of the duped and deduped pile. Download the appropriate one using Huggingface's utilities as follows:\n\n> Tip: Make sure to replace `path\u002Fto\u002F*` to appropriate paths where you intend to save datasets downloaded from Huggingface.\n- To download standard version, use \n  ```py\n  from huggingface_hub import hf_hub_download\n  hf_hub_download(repo_id=\"EleutherAI\u002Fpile-standard-pythia-preshuffled\", repo_type=\"dataset\", cache_dir=\"path\u002Fto\u002Flocal\u002Ffolder\")\n  ```\n- To download the deduped version, use\n  ```py\n  from huggingface_hub import hf_hub_download\n  hf_hub_download(repo_id=\"EleutherAI\u002Fpile-deduped-pythia-preshuffled\", repo_type=\"dataset\", cache_dir=\"path\u002Fto\u002Flocal\u002Ffolder\")\n  ```\n\nYou can now merge the files by using the script `utils\u002Funshard_memmap.py` : \n\n```sh\npython3 utils\u002Funshard_memmap.py --input_file \"path\u002Fto\u002Flocal\u002Ffolder\u002Fdocument-00000-of-00020.bin\" --num_shards 21 --output_dir \"path\u002Fto\u002Fmerged\u002Ffolder\u002F\"\n```\n\nMake sure to also copy index file to the merged folder, using the command\n```sh\ncp path\u002Fto\u002Flocal\u002Ffolder\u002Fdocument.idx path\u002Fto\u002Fmerged\u002Ffolder\u002Fdocument.idx\n```\n\nNow, we're all set up to run `utils\u002Fbatch_viewer.py` !\n\n```sh\npython3 utils\u002Fbatch_viewer.py \\\n  --start_iteration 0 \\\n  --end_iteration 1000 \\\n  --load_path path\u002Fto\u002Fmerged\u002Ffolder\u002Fdocument \\\n  --save_path path\u002Fto\u002Fsave\u002Ffolder\u002F \\\n  --conf_dir utils\u002Fdummy_config.yml \n```\n\nThis will save a separate file containing all the indicies as a numpy array. \n\nYou can now load this using numpy as \n\n```py\nimport numpy as np\n\nindicies = np.load(\"path\u002Fto\u002Fsave\u002Ffolder\u002Findicies.npy\")\n```\n\nThese indicies contain tokenized sequences of integers of size (None, 2049), where an integer corresponds to a unique token index.\nNote that documents are concatenated and saperated by an `EOD` token. Thus, each sample or batch may not start with an EOD token. During training, target tokens are left shifted by 1. Thus, a model of sequence length 2048 requires 2049 length sequences for training (For more info, refer to [this comment](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia\u002Fissues\u002F123#issuecomment-1791136253))\n\n### Pythia Paper Replication\n\nWe provide further information for those interested in replicating the case studies performed in the Pythia suite paper in the `case-studies\u002F` folder of this repository.\n\n### Benchmark Scores\n\nWe also provide benchmark 0-shot and 5-shot results on a variety of NLP datasets:\n\n- ARC-challenge (`arc_challenge`)\n- ARC-easy (`arc_easy`)\n- BLiMP (`blimp_*`)\n- Lambada (`lambada_openai`)\n- LogiQA (`logiqa`)\n- MMLU (`hendrycksTest*`)\n- PiQA (`piqa`)\n- SciQ (`sciq`)\n- Wikitext (`wikitext`)\n- Winogrande (`winogrande`)\n- WSC (`wsc`)\n\nEvaluations were performed in GPT-NeoX using the [LM Evaluation Harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) and are viewable by model and step at `evals\u002Fpythia-v1\u002F*\u002F*` in this repository. **Warning:** All evaluations were run using the **to-do** commit of the language model evaluation harness almost years ago and may not be reproducible by the current version.\n\n## Research Building on Pythia\n\nOur primary goal with the Pythia project is to enable research on topics including interpretability and learning dynamics at EleutherAI and in the community writ large. Here we document select papers using our models, focusing on work that is uniquely empowered by the Pythia suite and would be less feasible or infeasible with models released by other organizations. For a larger list of papers citing Pythia, see [here](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FPythia%3A-A-Suite-for-Analyzing-Large-Language-Models-Biderman-Schoelkopf\u002Fbe55e8ec4213868db08f2c3168ae666001bea4b8#citing-papers).\n\n### Language model internals\n\n- Belrose, et al. \"[Eliciting latent predictions from transformers with the tuned lens](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.08112).\" _arXiv preprint arXiv:2303.08112_ (2023). **EleutherAI Paper**\n- Brown, et al. \"[Understanding the Inner Workings of Language Models Through Representation Dissimilarity](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.14993).\" _Conference on Empirical Methods in Natural Language Processing_ (2023).\n- Feng and Steinhardt. \"[How do Language Models Bind Entities in Context?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.17191).\" _International Conference on Learning Representations_ (2023).\n- Garde, Kran, and Barez. \"[DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01870).\" _arXiv preprint arXiv:2310.01870_ (2023).\n- Gurnee, et al. \"[Finding Neurons in a Haystack: Case Studies with Sparse Probing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.01610).\" _Transactions of Machine Learning Research_ (2023).\n- Stolfo, Belinkov, and Sachan. \"[Understanding Arithmetic Reasoning in Language Models using Causal Mediation Analysis](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.15054).\" _Conference on Empirical Methods in Natural Language Processing_ (2023).\n\n### Learning dynamics\n\n- Gupta, et al. \"[Continual Pre-Training of Large Language Models: How to re-warm your model?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.04014).\" _Workshop on Efficient Systems for Foundation Models @ ICML_ (2023).\n- Michaelov and Bergen. \"[Emergent inabilities? Inverse scaling over the course of pretraining](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14681).\" _Findings of the Association for Computational Linguistics: EMNLP_ (2023).\n- Sanyal, et al. \"[Understanding the Effectiveness of Early Weight Averaging for Training Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.03241).\" _arXiv preprint arXiv:2306.03241_ (2023).\n- Tian, et al. \"[JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00535).\" _arXiv preprint arXiv:2310.0053_ (2023).\n- Ye, et al. \"[Language Versatilists vs. Specialists: An Empirical Revisiting on Multilingual Transfer Ability](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.06688).\" arXiv preprint arXiv:2306.06688 (2023).\n- Belrose, et al. \"[Neural Networks Learn Statistics of Increasing Complexity](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.04362).\" _International Conference on Learning Representations_ (2024). **EleutherAI Paper**\n- Godey et al. \"[Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.07647).\" _arXiv preprint arXiv:2404.07647_ (2024).\n- Singh, et al. \"[Hallmarks of Optimization Trajectories in Neural Networks: Directional Exploration and Redundancy](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.07379).\" _arXiv preprint arXiv:2403.07379_ (2024).\n- Tigges, et al. \"[Stability and Generalizability of Language Model Mechanisms Across Training and Scale](https:\u002F\u002Fopenreview.net\u002Fforum?id=1WeLXvaNJP).\" _Mechanistic Interpretability Workshop @ ICML_ (2024). **EleutherAI Paper**\n- Diehl Martinez, et al. \"[Tending Towards Stability: Convergence Challenges in Small Language Models](https:\u002F\u002Faclanthology.org\u002F2024.findings-emnlp.187\u002F) \" _Findings of the Association for Computational Linguistics: EMNLP_ (2024).\n\n### How training data determines model behavior\n\n- Roger. \"[Large Language Models Sometimes Generate Purely Negatively-Reinforced Text](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.07567).\" _arXiv preprint arXiv:2306.07567_ (2023).\n- Oh, et al. \"[Frequency Explains the Inverse Correlation of Large Language Models’ Size, Training Data Amount, and Surprisal’s Fit to Reading Times](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.02255).\" _arXiv preprint arXiv:2402.02255_ (2024).\n- Liu, et al. \"[On Training Data Influence of GPT Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.07840).\" _arXiv preprint arXiv:2404.07840_ (2024).\n- Lesci, et al. \"[Causal Estimation of Memorisation Profiles](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.04327).\" _Association for Computational Linguistics_ (2024).\n\n### Security, auditing, and compliance research\n\n- Ippolito, et al. \"[Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System.](https:\u002F\u002Faclanthology.org\u002F2023.inlg-main.28\u002F)\" _International Natural Language Generation Conference_. 2023.\n- Biderman, et al. \"[Emergent and predictable memorization in large language models.](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.11158)\" _Neural Information Processing Systems_ (2023). **EleutherAI Paper**\n- Choi, Shavit, and Duvenaud. \"[Tools for Verifying Neural Models' Training Data](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.00682).\" _Neural Information Processing Systems_ (2023).\n- Li, et al. \"[MoPe: Model Perturbation-based Privacy Attacks on Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.14369).\" _Conference on Empirical Methods in Natural Language Processing_ (2023).\n- Min, et al. \"[SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.04430).\" _International Conference on Learning Representations_ (2024).\n- Pawelczyk, et al. \"[Machine Unlearning Fails to Remove Data Poisoning Attacks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.17216).\" _arXiv preprint arXiv:2406.17216_ (2024).\n- Prashanth, et al. \"[Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.17746).\" _arXiv preprint arXiv:2406.17746_ (2024). **EleutherAI Paper**\n- Duan, et al. \"[Do Membership Inference Attacks Work on Large Language Models?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.07841).\" _Conference on Language Modeling_ (2024).\n\n## Citation Details\n\nIf you use the Pythia models in your research, please cite our paper via:\n\n```\n@inproceedings{biderman2023pythia,\n  title={Pythia: A suite for analyzing large language models across training and scaling},\n  author={Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin Gregory and Bradley, Herbie and O’Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and others},\n  booktitle={International Conference on Machine Learning},\n  pages={2397--2430},\n  year={2023},\n  organization={PMLR}\n}\n```\nIf you use data or results from other papers found in this repository, please cite the corresponding papers. Citation information can be found in the respective README and are also reproduced below for convenience:\n```\n@inproceedings{biderman2023emergent,\n      title={Emergent and Predictable Memorization in Large Language Models}, \n      author={Biderman, Stella and Prashanth, USVSN Sai and Sutawika, Lintang and Schoelkopf, Hailey and Anthony, Quentin and Purohit, Shivanshu and Raff, Edward},\n      booktitle={Advances in Neural Information Processing Systems},\n      year={2023}\n}\n\n@inproceedings{van2025polypythias,\n      title={PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs},\n      author={van der Wal, Oskar and Lesci, Pietro and M{\\\"u}ller-Eberstein, Max and Saphra, Naomi and Schoelkopf, Hailey and Zuidema, Willem and Biderman, Stella},\n      booktitle={{The Thirteenth International Conference on Learning Representations}},\n      year={2025}\n```\nIf you are interested in citing our training data, training library, or evaluation library you can do so with the following:\n\n```\n@article{gao2020pile,\n  title={The pile: An 800gb dataset of diverse text for language modeling},\n  author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others},\n  journal={arXiv preprint arXiv:2101.00027},\n  year={2020}\n}\n\n@article{biderman2022datasheet,\n  title={Datasheet for the pile},\n  author={Biderman, Stella and Bicheno, Kieran and Gao, Leo},\n  journal={arXiv preprint arXiv:2201.07311},\n  year={2022}\n}\n\n@software{gpt-neox-library,\n  title = {{GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch}},\n  author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Phang, Jason and Purohit, Shivanshu and Schoelkopf, Hailey and Stander, Dashiell and Songz, Tri and Tigges, Curt and Thérien, Benjamin and Wang, Phil and Weinbach, Samuel},\n  url = {https:\u002F\u002Fwww.github.com\u002Feleutherai\u002Fgpt-neox},\n  doi = {10.5281\u002Fzenodo.5879544},\n  month = {9},\n  year = {2023},\n  version = {2.0.0},\n}\n\n@misc{eval-harness,\n  author       = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},\n  title        = {A framework for few-shot language model evaluation},\n  month        = sep,\n  year         = 2021,\n  publisher    = {Zenodo},\n  version      = {v0.0.1},\n  doi          = {10.5281\u002Fzenodo.5371628},\n  url          = {https:\u002F\u002Fdoi.org\u002F10.5281\u002Fzenodo.5371628}\n}\n```\n\n## License\nThe following license applies to all code in this GitHub repo, as well as the Pythia models and any other copyrightable artifacts contained in this repository.\n\n```\n   Copyright 2024 EleutherAI\n\n   Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n       http:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License.\n```\n","# Pythia：跨时间和规模的Transformer模型可解释性研究\n\n本仓库是EleutherAI的*Pythia*项目，该项目结合了可解释性分析和规模定律，旨在理解自回归Transformer模型在训练过程中知识的发展与演变。有关模型、训练过程及其特性的详细信息，请参阅我们的论文《Pythia：一个用于分析大规模语言模型训练与规模效应的工具集》（https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.01373）。\n\nPythia工具集的开发初衷是为了支持可解释性、学习动态以及伦理与透明性方面的研究，而现有的模型工具集在这方面存在不足。Pythia工具集的主要特点包括：\n1. 论文中使用的所有模型、数据和代码均已公开发布，确保结果的完全可复现。我们论文中的所有结果均经过至少一个其他实验室的独立验证。\n2. 所有模型在训练过程中保存了154个检查点，便于研究大型语言模型的学习动态。\n3. 所有模型均基于相同的数据并按相同的顺序进行训练，使研究人员能够探索对训练过程的因果干预。\n\n截至发布时，Pythia是全球唯一满足上述要求的模型工具集。事实上，我们为120亿参数模型发布的154个部分训练检查点，数量超过了当时全球其他机构为所有120亿+参数模型所发布的检查点总数。我们的工作启发了多个类似项目，包括LLM360的[Amber](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06550)和[K2-65B](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.07124)、AI2的[OLMo](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.00838)，以及Zyphra的[BlackMamba](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.01771)。\n\n除Pythia工具集本身外，本仓库还作为信息、代码及可复现性说明的中心，涵盖以下论文的相关内容：\n* 大型语言模型中的涌现式与可预测性记忆[[代码](\u002Fpredictable-memorization)] [[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.11158)]\n* PolyPythias：五十次语言模型预训练运行中的稳定性与异常值[[代码](\u002Fpolypythias)] [[论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=bmrYu2Ekdz)]\n\n## 更新日志\n\n[2025年3月10日] 添加了PolyPythias论文的相关信息。\n\n[2024年7月9日] 大幅更新了README文件，包括更完善的历史背景介绍，并展示了众多研究者利用Pythia开展的精彩工作。同时新增了后续训练模型的链接。\n\n[2023年11月2日] 根据部分研究人员的要求，我们增加了1400万和3100万参数的模型。未来计划训练这些模型的去重版本。\n\n[2023年4月3日] 我们发布了所有Pythia模型的新版本，修复了原始工具集中存在的多种不一致之处。具体更改详情请参见《Pythia论文》（https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.01373）附录B。旧版模型（“v0”）仍可在[此处](https:\u002F\u002Fhuggingface.co\u002Fmodels?other=pythia_v0)获取，可能适用于消融实验。\n\n[2023年1月20日] 为与许多其他模型工具集保持一致，并认为该惯例更能反映这些模型的设备端内存占用情况，我们决定将Pythia模型工具集的命名方式调整为同时包含嵌入层和解嵌入层的参数，以计算总参数量。此外，我们还发现由于笔误，其中一款模型的实际参数量小于预期，已用目标参数量的模型进行了替换。更多详情请参见[此处](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-410m-deduped#naming-convention-and-parameter-count)。\n\n## 目录\n\n- [模型](#models)\n  * [多随机种子](#multiple-random-seeds)\n  * [更新日志](#changelog)\n- [使用Pythia](#using-pythia)\n  * [快速入门](#quickstart)\n  * [复现训练](#reproducing-training)\n  * [探索数据集](#exploring-the-dataset)\n  * [Pythia论文复现](#pythia-paper-replication)\n- [基准测试分数](#benchmark-scores)\n- [基于Pythia的研究](#research-building-on-pythia)\n  * [语言模型内部机制](#language-model-internals)\n  * [学习动态](#learning-dynamics)\n  * [训练数据如何决定模型行为](#how-training-data-determines-model-behavior)\n  * [安全、审计与合规研究](#security-auditing-and-compliance-research)\n- [引用信息](#citation-details)\n- [许可证](#license)\n\n## 模型\n\n我们在 Pile 数据集（[论文](https:\u002F\u002Fpile.eleuther.ai\u002F)、[数据表](https:\u002F\u002Farxiv.org\u002Fabs\u002F2201.07311)）以及去重后的 Pile 数据集上训练并发布了 8 种不同规模的模型系列。所有 8 种规模的模型均使用完全相同的数据，并按照完全相同的顺序进行训练。每种模型在训练过程中共处理了约 299,892,736,000 个 token，即接近 3000 亿个 token。对于“标准”模型而言，这相当于在 Pile 数据集中略少于 1 个 epoch；而对于去重后的 Pile 数据集，则约为 1.5 个 epoch（该数据集 1 个 epoch 包含 2070 亿个 token）。所有模型均采用混合精度训练，除 `EleutherAI\u002Fpythia-1b` 使用 bf16 外，其余模型均使用 fp16。这是因为 `pythia-1b` 在 fp16 训练时，于训练后期出现了无法弥合的损失激增。\n\n在首次发布之后，应关注稀疏自编码器扩展的研究人员的要求，我们又训练了 1400 万和 3100 万参数的模型。\n\n| 参数量 | 层数 | 模型维度 \\(d_{\\text{model}}\\) | 注意力头数 \\(n_{\\text{heads}}\\) | 每个注意力头的维度 \\(d_{\\text{head}}\\) | 批量大小 | 学习率 | Hugging Face 检查点                                                |\n| ------ | -------- | ------- | ------- | ------ | ---------- | ------------- | ---------------------------------------------------------- |\n| 14M    | 6        | 128     | 4       | 32     | 2M         | 1.0e-3          | [标准](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-14m)  |\n| 31M    | 6        | 256     | 8       | 32     | 2M         | 1.0e-3          | [标准](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-31m) |\n| 70M    | 6        | 512     | 8       | 64     | 2M         | 1.0e-3          | [标准](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-70m), [去重版](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-70m-deduped)  |\n| 160M   | 12       | 768     | 12      | 64     | 2M         | 6.0e-4          | [标准](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-160m), [去重版](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-160m-deduped)|\n| 410M   | 24       | 1024    | 16      | 64     | 2M         | 3.0e-4          | [标准](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-410m), [去重版](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-410m-deduped)|\n| 1B     | 16       | 2048    | 8       | 256    | 2M         | 3.0e-4          | [标准](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-1b), [去重版](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-1b-deduped)    |\n| 1.4B   | 24       | 2048    | 16      | 128    | 2M         | 2.0e-4          | [标准](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-1.4b), [去重版](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-1.4b-deduped)|\n| 2.8B   | 32       | 2560    | 32      | 80     | 2M         | 1.6e-4        | [标准](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-2.8b), [去重版](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-2.8b-deduped)|\n| 6.9B   | 32       | 4096    | 32      | 128    | 2M         | 1.2e-4        | [标准](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-6.9b), [去重版](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-6.9b-deduped)|\n| 12B    | 36       | 5120    | 40      | 128    | 2M         | 1.2e-4        | [标准](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-12b), [去重版](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-12b-deduped)  |\n\n\n为了促进对大型语言模型学习动态的研究，我们为每个模型提供了 154 个检查点，分别对应第 0 步（初始化）、第 1 步、第 2 步、第 4 步、第 8 步、第 16 步、第 32 步、第 64 步、第 128 步、第 256 步、第 512 步、第 1000 步，以及之后每 1000 步的一个检查点。我们还上传了所有模型的预分词数据文件以及用于重建训练期间所见数据加载器的脚本。更多详情请参阅“重现训练”部分。\n\n用于使用 [GPT-NeoX 库](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox) 训练这些模型的配置文件，可在本仓库的 `models\u002F` 目录中找到，也可在 GPT-NeoX 库本身中找到。\n\n我们在最初训练这些模型时犯了一个错误，导致不同运行之间存在一些不一致之处。我们随后重新运行了整个模型系列，修复了这些不一致问题，原始运行则以 `EleutherAI\u002Fpythia-160m-v0` 的名称提供。有关 v0 版本模型与主系列有何不同的详细信息，请参阅 Pythia 论文。\n\n所有模型的损失曲线都包含在我们混乱的 wandb 项目中：[这里](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fpythia)。\n\n模型与 wandb 运行之间的大致对应关系如下：\n| 模型 | Wandb |\n| --------- | --------- |\n| Pythia-2.8b | [链接](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fpythia\u002Fgroups\u002F2.7B%20New_36751euw?workspace=user-schoelkopf) |\n| Pythia-2.8b-deduped | [链接](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fpythia\u002Fgroups\u002F2.7B%20Deduped%20New_1ygfbs9n?workspace=user-schoelkopf) |\n| Pythia-1b | [链接](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fpythia\u002Fgroups\u002F800M%20Pythia_1zw5etef\u002Fworkspace) |\n| Pythia-1.4b | [链接](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fpythia\u002Fgroups\u002FPythia%201.3B_lepj8rtx\u002Fworkspace?workspace=user-schoelkopf) |\n| Pythia-1.4b-deduped | [链接](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fpythia\u002Fgroups\u002F1.3B%20Dedup_10v5wko4\u002Fworkspace?workspace=user-schoelkopf) |\n| Pythia-160m | [链接](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fpythia\u002Fgroups\u002FPythia%20125M_1mpgqyzx\u002Fworkspace?workspace=user-schoelkopf) |\n| Pythia-160m-deduped | [链接](https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fpythia\u002Fgroups\u002F125M%20Dedup_ym78zh5k\u002Fworkspace?workspace=user-schoelkopf) |\n\n### 多个随机种子\n\n用于训练 Pythia 模型的随机种子是 GPT-NeoX 的默认值：1234。为了支持研究随机性如何影响模型行为，我们又使用不同的随机种子训练了更多模型。目前，我们已使用从 1 到 9 的每个随机种子训练并发布了以下模型：\n\n- Pythia 14M\n- Pythia 31M\n- Pythia 70M\n- Pythia 160M\n- Pythia 410M\n\n所有这些模型均为“标准”版本的 Pythia，而非在去重后的 Pile 数据集上训练的版本。结合最初发布的模型，它们构成了十种使用不同随机种子但其他方面完全相同的变体。您可以在 Hugging Face 上通过命名模式 `https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-[size]-seed[num]` 找到这些模型。例如，`https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-160m-seed7`。请注意，使用随机种子 1234 训练的模型在其 URL 中并未指定种子编号。\n\n跨多个随机种子复制小型 Pythia 模型的运行记录位于：https:\u002F\u002Fwandb.ai\u002Feleutherai\u002Fpythia-extra-seeds\n\n#### 更正\n\n如 [此 issue](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia\u002Fissues\u002F135) 所述，由于配置文件中未指定初始化值，6.9B 和 12B 模型意外地使用了不同的初始化方法。\n\n## 使用 Pythia\n\n### 快速入门\n\n所有 Pythia 模型都托管在 [Hugging Face 模型库](https:\u002F\u002Fhuggingface.co\u002FEleutherAI) 上。可以通过以下代码加载并使用它们（以 3000 步的 `pythia-70M-deduped` 模型检查点为例）：\n\n```python\nfrom transformers import GPTNeoXForCausalLM, AutoTokenizer\n\nmodel = GPTNeoXForCausalLM.from_pretrained(\n  \"EleutherAI\u002Fpythia-70m-deduped\",\n  revision=\"step3000\",\n  cache_dir=\".\u002Fpythia-70m-deduped\u002Fstep3000\",\n)\n\ntokenizer = AutoTokenizer.from_pretrained(\n  \"EleutherAI\u002Fpythia-70m-deduped\",\n  revision=\"step3000\",\n  cache_dir=\".\u002Fpythia-70m-deduped\u002Fstep3000\",\n)\n\ninputs = tokenizer(\"Hello, I am\", return_tensors=\"pt\")\ntokens = model.generate(**inputs)\nprint(tokenizer.decode(tokens[0]))\n```\n\n所有模型均以 2,097,152 个标记为批次大小，训练了相当于 143,000 步的内容。分支 `step143000` 完全对应于每个模型在 `main` 分支上的模型检查点。\n\n此外，我们还以 [GPT-NeoX 库](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox) 接受的格式提供了所有模型检查点，最终步数的检查点及优化器状态可从 Hugging Face 模型库的 `EleutherAI\u002Fneox-ckpt-pythia-xxx-deduped-v1` 下载。但由于优化器状态文件体积较大且预计需求较低，我们并未大规模提供所有步骤的优化器状态。如果您希望在 GPT-NeoX 代码库中使用中间模型进行分析，或需要其他步骤的优化器状态，请发送邮件至 hailey@eleuther.ai 和 stella@eleuther.ai。\n\n> ❗ Hugging Face 上的 `pythia-{size}-v0` 模型（尺寸为 `160m、410m、1.4b`）以 400 万个标记为批次大小，训练了 71,500 步，每 500 步保存一次检查点。这些 v0 模型在 Hugging Face 上的步数名称已更改为与所有 200 万批次模型保持一致，因此 `pythia-1.4b-v0` 中标记为 `step1000` 的模型检查点实际上对应的是第 500 步，但其处理的标记数量与其他 `step1000` 检查点相同。\n\n### 重现训练\n\n_（由 @BaruchG 提供的扩展重现说明）。_\n\n我们提供了用于复现训练过程的训练数据。[GPT-NeoX 库](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox) 需要预先分词的训练数据，形式为两个内存映射的 NumPy 数组：`.bin` 和 `.idx` 文件。我们通过 Hugging Face 模型库提供这些文件。要下载并使用去重后的 Pile 训练数据：\n```bash\ngit lfs clone https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEleutherAI\u002Fpythia_deduped_pile_idxmaps\n\n# 可选：为确保文件未损坏\npython utils\u002Fchecksum_shards.py\n\npython utils\u002Funshard_memmap.py --input_file .\u002Fpythia_deduped_pile_idxmaps\u002Fpile_0.87_deduped_text_document-00000-of-00082.bin --num_shards 83 --output_dir .\u002Fpythia_pile_idxmaps\u002F\n\n# 整个文件的正确 sha256 值为 0cd548efd15974d5cca78f9baddbd59220ca675535dcfc0c350087c79f504693\n\n# 可以使用 sha256sum .\u002Fpythia_pile_idxmaps\u002F* 来校验\n``` \n这将需要超过一天的时间来运行，但应该不会占用超过 5 GB 的内存。我们建议下载此数据集，而不是从头开始对 Pile 数据重新分词，以确保保留 Pythia 模型所见的数据顺序。除了训练数据之外，您还需要本地复制我们用于训练模型的分词器。您可以[在这里](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia\u002Fblob\u002Fmain\u002Futils\u002F20B_tokenizer.json)找到它。\n\n接下来，您需要设置训练环境：\n```\ngit clone https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox.git\ncd gpt-neox\ngit checkout v1.0\npip install -r requirements\u002Frequirements-flashattention.txt\nwget https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia\u002Fblob\u002Fmain\u002Fmodels\u002F160M\u002Fpythia-160m-deduped.yml\ndocker build -t pythia:latest .\n```\n容器构建完成后，使用以下命令运行容器（从 GPT-NeoX 仓库的根目录启动，确保您的 pythia yaml 文件在该目录下可访问）：\n```\ndocker run --runtime=nvidia --rm -it -e NVIDIA_VISIBLE_DEVICES=0,1,2,3 --shm-size=1g --ulimit memlock=-1 --mount type=bind,src=$PWD,dst=\u002Fgpt-neox -v $(pwd):\u002Fworkspace\u002F pythia:latest bash\n```\n如果数据集和 YAML 文件无法在 Docker 容器内访问，可以使用 `-v` 参数挂载更多卷。\n\n请按如下方式修改数据路径和分词器路径：\n```\n  \"train-data-paths\": [\"\u002Ffsx\u002Fpile\u002Fpile_20B_tokenizer_text_document\"], # 指向步骤 1 中生成的包含 .bin 和 .idx 文件的文件夹\n  \"valid-data-paths\": [\"\u002Ffsx\u002Fpile\u002Fpile_20B_tokenizer_text_document\"], # 指向步骤 1 中生成的包含 .bin 和 .idx 文件的文件夹\n  \"test-data-paths\": [\"\u002Ffsx\u002Fpile\u002Fpile_20B_tokenizer_text_document\"], # 指向步骤 1 中生成的包含 .bin 和 .idx 文件的文件夹\n\n  \"tokenizer-type\": \"HFTokenizer\",\n  \"vocab-file\": \"\u002Ffsx\u002Fpile\u002F20B_tokenizer.json\", # 指向步骤 2 中获取的分词器\n```\n根据您可用的显存大小，可能需要调整批量大小。总批量大小的计算公式为 `Total GPUs * train_micro_batch_size_per_gpu * gradient_accumulation_steps \u002F (pipe-parallel-size * model-parallel-size)`，需保持为 1024，以匹配 Pythia 的训练批量大小。因此您需要：\n```\n   \"train_micro_batch_size_per_gpu\": XXX, # 设置一个适合您 GPU 显存的值\n   \"gradient_accumulation_steps\": 1, # 调整此值以使总批量大小达到 1024。\n```\n如果您希望保存模型权重，也可以在 YAML 文件中添加相关配置。例如，要将模型保存到 checkpoints 文件夹，可以在文件末尾添加：\n```\n  \"launcher\": \"slurm\",\n  \"deepspeed_slurm\": false,\n\n  \"save\": \"checkpoints\",\n  \"load\": \"checkpoints\",\n  \"checkpoint_validation_with_forward_pass\": False,\n}\n```\n请确保路径是 Docker 容器内的路径；如果希望权重能够持久化，请确保它们可以从容器外部访问，例如放在 \u002Fworkspace\u002F 目录下。\n\n现在您可以开始训练模型了，运行以下命令：\n```\npython deepy.py train.py pythia-160m-deduped.yml  2>&1 | tee output.txt\n```\n输出将被保存到 output.txt 文件中，如果您不需要保存输出，可以删除结尾部分。\n\n为了将您的模型转换为 Hugging Face `transformers` 格式，可以使用 GPT-NeoX 库中的脚本 `tools\u002Fconvert_to_hf.py`。您可能需要在文件顶部添加 `from typing import List`，并将[此处](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fgpt-neox\u002Fblob\u002F71df4d5017f9f4919566a11454fe3a507ffdc632\u002Ftools\u002Fconvert_to_hf.py#L44)的类型从 `list[torch.Tensor]` 改为 `List[torch.Tensor]`。然后您可以运行以下命令来转换第 143000 步的权重：\n```\npython tools\u002Fconvert_to_hf.py --input_dir checkpoints\u002Fglobal_step143000\u002F --config_file checkpoints2\u002Fglobal_step 143000\u002Fconfigs\u002Fpythia-70m.yml --output_dir .\u002Foutput\u002F \n```\n这将生成一个与 https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-70m-deduped\u002Ftree\u002Fmain 类似的文件结构。\n\n> ❗ 有时人们会发现最终得到的分词器并不正确，而我们目前尚无法调试出原因。如果您的 `tokenizer_config.json` 与[这里](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-70m-deduped\u002Fblob\u002Fmain\u002Ftokenizer_config.json)的不一致，且 `special_tokens_map.json` 也与[这里](https:\u002F\u002Fhuggingface.co\u002FEleutherAI\u002Fpythia-70m-deduped\u002Fblob\u002Fmain\u002Fspecial_tokens_map.json)的不同，则可能需要将其替换为 Hugging Face 上的版本。\n\n要使用我们的评估库进行评估，请安装[这里的容器](https:\u002F\u002Fhub.docker.com\u002Fr\u002Fhuggingface\u002Ftransformers-pytorch-gpu\u002Ftags)（已测试过 4.28 和 4.29 版本）。设置好 Docker 容器后，执行以下操作：\n```\ngit clone https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\ncd lm-evaluation-harness\npip install -e .\n```\n如 Harness 仓库中的说明所示。随后，您就可以通过指向您的模型权重（应在容器内）来运行基准测试，命令类似于：\n```\npython3 main.py --model hf-causal-experimental  --model_args pretrained=..\u002Fgpt-neox\u002Foutput\u002F --tasks lambada_openai,piqa,winogrande,arc_easy,sciq,wikitext --device cuda:0\n```\n\n### 数据集探索\n\n我们提供了一个工具，用于查看所有模型在训练过程中使用的训练数据加载器中的特定部分，该工具位于 `utils\u002Fbatch_viewer.py`。\n\n首先，我们需要克隆 Pythia 仓库：\n```\ngit clone https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia\n```\n接下来，我们必须安装依赖项：\n```\npip install torch==1.13.0+cu117 -f https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Ftorch\u002F\npip install numpy tqdm huggingface_hub\n```\n\n然后，我们需要下载相应的数据集。我们提供了重复和去重后的 Pile 数据集的预打乱版本。可以使用 Hugging Face 的工具按如下方式下载合适的版本：\n\n> 提示：请确保将 `path\u002Fto\u002F*` 替换为你打算保存从 Hugging Face 下载的数据集的适当路径。\n- 要下载标准版本，使用以下代码：\n  ```py\n  from huggingface_hub import hf_hub_download\n  hf_hub_download(repo_id=\"EleutherAI\u002Fpile-standard-pythia-preshuffled\", repo_type=\"dataset\", cache_dir=\"path\u002Fto\u002Flocal\u002Ffolder\")\n  ```\n- 要下载去重版本，使用以下代码：\n  ```py\n  from huggingface_hub import hf_hub_download\n  hf_hub_download(repo_id=\"EleutherAI\u002Fpile-deduped-pythia-preshuffled\", repo_type=\"dataset\", cache_dir=\"path\u002Fto\u002Flocal\u002Ffolder\")\n  ```\n\n现在，你可以使用脚本 `utils\u002Funshard_memmap.py` 来合并文件：\n\n```sh\npython3 utils\u002Funshard_memmap.py --input_file \"path\u002Fto\u002Flocal\u002Ffolder\u002Fdocument-00000-of-00020.bin\" --num_shards 21 --output_dir \"path\u002Fto\u002Fmerged\u002Ffolder\u002F\"\n```\n\n同时，别忘了将索引文件复制到合并后的文件夹中，使用以下命令：\n```sh\ncp path\u002Fto\u002Flocal\u002Ffolder\u002Fdocument.idx path\u002Fto\u002Fmerged\u002Ffolder\u002Fdocument.idx\n```\n\n现在，我们已经准备好运行 `utils\u002Fbatch_viewer.py` 了！\n\n```sh\npython3 utils\u002Fbatch_viewer.py \\\n  --start_iteration 0 \\\n  --end_iteration 1000 \\\n  --load_path path\u002Fto\u002Fmerged\u002Ffolder\u002Fdocument \\\n  --save_path path\u002Fto\u002Fsave\u002Ffolder\u002F \\\n  --conf_dir utils\u002Fdummy_config.yml \n```\n\n这将会保存一个单独的文件，其中包含所有的索引，以 NumPy 数组的形式存储。\n\n之后，你可以使用 NumPy 加载这些索引：\n\n```py\nimport numpy as np\n\nindicies = np.load(\"path\u002Fto\u002Fsave\u002Ffolder\u002Findicies.npy\")\n```\n\n这些索引包含了大小为 (None, 2049) 的整数标记序列，其中每个整数对应一个唯一的标记索引。\n请注意，文档是被连接在一起的，并且由一个 `EOD` 标记分隔。因此，每个样本或批次可能并不以 `EOD` 标记开始。在训练过程中，目标标记会被左移一位。因此，对于一个序列长度为 2048 的模型来说，需要 2049 长度的序列来进行训练（更多信息请参阅 [此评论](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia\u002Fissues\u002F123#issuecomment-1791136253)）。\n\n### Pythia 论文复现\n\n我们为那些有兴趣复现 Pythia 系列论文中案例研究的人，在本仓库的 `case-studies\u002F` 文件夹中提供了进一步的信息。\n\n### 基准测试分数\n\n我们还提供了在多种 NLP 数据集上的零样本和五样本基准测试结果：\n\n- ARC-challenge (`arc_challenge`)\n- ARC-easy (`arc_easy`)\n- BLiMP (`blimp_*`)\n- Lambada (`lambada_openai`)\n- LogiQA (`logiqa`)\n- MMLU (`hendrycksTest*`)\n- PiQA (`piqa`)\n- SciQ (`sciq`)\n- Wikitext (`wikitext`)\n- Winogrande (`winogrande`)\n- WSC (`wsc`)\n\n评估是在 GPT-NeoX 中使用 [LM Evaluation Harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) 进行的，并且可以在本仓库的 `evals\u002Fpythia-v1\u002F*\u002F*` 目录下按模型和步骤查看。**警告：** 所有评估都是在语言模型评估工具包的“待办”提交版本上进行的，距今已有近一年时间，因此可能无法用当前版本重现。\n\n## 基于 Pythia 的研究\n\n我们开展 Pythia 项目的主要目的是为了促进 EleutherAI 及更广泛社区内关于可解释性、学习动态等主题的研究。在此，我们记录了一些使用我们模型的研究论文，重点介绍了那些因 Pythia 系列而得以实现，而在其他组织发布的模型上则难以甚至不可能完成的工作。如需查看更多引用 Pythia 的论文，请参阅 [此处](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FPythia%3A-A-Suite-for-Analyzing-Large-Language-Models-Biderman-Schoelkopf\u002Fbe55e8ec4213868db08f2c3168ae666001bea4b8#citing-papers)。\n\n### 语言模型内部机制\n\n- Belrose 等人：“[通过调优后的镜头从 Transformer 模型中提取潜在预测](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.08112)”。_arXiv 预印本 arXiv:2303.08112_（2023年）。**EleutherAI 论文**\n- Brown 等人：“[通过表示差异性理解语言模型的内部运作](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.14993)”。_自然语言处理经验方法会议_（2023年）。\n- Feng 和 Steinhardt：“[语言模型如何在上下文中绑定实体？](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.17191)”。_国际学习表征会议_（2023年）。\n- Garde、Kran 和 Barez：“[DeepDecipher：访问并研究大型语言模型中的神经元激活情况](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.01870)”。_arXiv 预印本 arXiv:2310.01870_（2023年）。\n- Gurnee 等人：“[在干草堆中寻找神经元：稀疏探测案例研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.01610)”。_机器学习研究汇刊_（2023年）。\n- Stolfo、Belinkov 和 Sachan：“[利用因果中介分析理解语言模型中的算术推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.15054)”。_自然语言处理经验方法会议_（2023年）。\n\n### 学习动力学\n\n- Gupta 等人：“大型语言模型的持续预训练：如何重新“暖机”你的模型？”（2023）。_ICML 基础模型高效系统研讨会_。\n- 米哈伊洛夫和伯根：“涌现的无能？预训练过程中的反向缩放现象”（2023）。_计算语言学协会会议成果：EMNLP_。\n- 桑亚尔等人：“理解早期权重平均在训练大型语言模型中的有效性”（2023）。_arXiv 预印本 arXiv:2306.03241_。\n- 田等人：“JoMA：通过 MLP 和注意力机制的联合动态揭秘多层 Transformer”（2023）。_arXiv 预印本 arXiv:2310.00535_。\n- 叶等人：“语言多面手与专家：关于多语言迁移能力的实证再探讨”（2023）。_arXiv 预印本 arXiv:2306.06688_。\n- 贝尔罗斯等人：“神经网络学习日益复杂的统计规律”（2024）。_国际表示学习大会_。**EleutherAI 论文**\n- 戈代等人：“为什么小型语言模型表现欠佳？通过 Softmax 瓶颈研究语言模型饱和现象”（2024）。_arXiv 预印本 arXiv:2404.07647_。\n- 辛格等人：“神经网络优化轨迹的特征：方向性探索与冗余性”（2024）。_arXiv 预印本 arXiv:2403.07379_。\n- 蒂格斯等人：“语言模型机制在不同训练阶段和规模下的稳定性和泛化能力”（2024）。_ICML 机制可解释性研讨会_。**EleutherAI 论文**\n- 迪尔·马丁内斯等人：“趋向稳定：小型语言模型的收敛挑战”（2024）。_计算语言学协会会议成果：EMNLP_。\n\n### 训练数据如何决定模型行为\n\n- 罗杰：“大型语言模型有时会生成完全由负强化驱动的文本”（2023）。_arXiv 预印本 arXiv:2306.07567_。\n- 奥等人：“频率可以解释大型语言模型的规模、训练数据量与 surprisal 对阅读时间拟合度之间的负相关关系”（2024）。_arXiv 预印本 arXiv:2402.02255_。\n- 刘等人：“关于 GPT 模型训练数据的影响”（2024）。_arXiv 预印本 arXiv:2404.07840_。\n- 莱西等人：“记忆模式的因果估计”（2024）。_计算语言学协会_。\n\n### 安全、审计与合规研究\n\n- 伊波利托等人：“在仅能黑盒访问语言生成系统的情况下，逆向工程解码策略”（2023）。_国际自然语言生成会议_。\n- 比德曼等人：“大型语言模型中涌现且可预测的记忆现象”（2023）。_神经信息处理系统会议_。**EleutherAI 论文**\n- 崔、沙维特和杜文诺：“验证神经网络训练数据的工具”（2023）。_神经信息处理系统会议_。\n- 李等人：“MoPe：基于模型扰动的语言模型隐私攻击”（2023）。_自然语言处理经验方法会议_。\n- 闵等人：“SILO 语言模型：在非参数化数据存储中隔离法律风险”（2024）。_国际表示学习大会_。\n- 波韦尔奇克等人：“机器遗忘无法清除数据污染攻击”（2024）。_arXiv 预印本 arXiv:2406.17216_。\n- 普拉桑特等人：“背诵、重建、回忆：LM 中的记忆是一种多面向的现象”（2024）。_arXiv 预印本 arXiv:2406.17746_。**EleutherAI 论文**\n- 段等人：“成员身份推断攻击对大型语言模型有效吗？”（2024）。_语言建模会议_。\n\n## 引用详情\n\n如果您在研究中使用 Pythia 模型，请通过以下方式引用我们的论文：\n\n```\n@inproceedings{biderman2023pythia,\n  title={Pythia: 一套用于分析大规模语言模型训练与规模扩展的工具集},\n  author={Biderman, Stella 和 Schoelkopf, Hailey 和 Anthony, Quentin Gregory 和 Bradley, Herbie 和 O’Brien, Kyle 和 Hallahan, Eric 和 Khan, Mohammad Aflah 和 Purohit, Shivanshu 和 Prashanth, USVSN Sai 和 Raff, Edward 等},\n  booktitle={国际机器学习大会},\n  pages={2397--2430},\n  year={2023},\n  organization={PMLR}\n}\n```\n\n如果您使用本仓库中其他论文的数据或结果，请引用相应的论文。引用信息可在各自的 README 文件中找到，为方便起见，也在此列出如下：\n```\n@inproceedings{biderman2023emergent,\n      title={大型语言模型中的涌现式与可预测性记忆}, \n      author={Biderman, Stella 和 Prashanth, USVSN Sai 和 Sutawika, Lintang 和 Schoelkopf, Hailey 和 Anthony, Quentin 和 Purohit, Shivanshu 和 Raff, Edward},\n      booktitle={神经信息处理系统进展},\n      year={2023}\n}\n\n@inproceedings{van2025polypythias,\n      title={PolyPythias：五十次语言模型预训练运行中的稳定性与异常值},\n      author={van der Wal, Oskar 和 Lesci, Pietro 和 M{\\\"u}ller-Eberstein, Max 和 Saphra, Naomi 和 Schoelkopf, Hailey 和 Zuidema, Willem 和 Biderman, Stella},\n      booktitle={{第十三届国际学习表示会议}},\n      year={2025}\n```\n\n如果您希望引用我们的训练数据、训练库或评估库，可以使用以下引用：\n\n```\n@article{gao2020pile,\n  title={The Pile：一个用于语言建模的800GB多样化文本数据集},\n  author={Gao, Leo 和 Biderman, Stella 和 Black, Sid 和 Golding, Laurence 和 Hoppe, Travis 和 Foster, Charles 和 Phang, Jason 和 He, Horace 和 Thite, Anish 和 Nabeshima, Noa 等},\n  journal={arXiv 预印本 arXiv:2101.00027},\n  year={2020}\n}\n\n@article{biderman2022datasheet,\n  title={The Pile 数据表},\n  author={Biderman, Stella 和 Bicheno, Kieran 和 Gao, Leo},\n  journal={arXiv 预印本 arXiv:2201.07311},\n  year={2022}\n}\n\n@software{gpt-neox-library,\n  title = {{GPT-NeoX：基于 PyTorch 的大规模自回归语言建模}},\n  author = {Andonian, Alex 和 Anthony, Quentin 和 Biderman, Stella 和 Black, Sid 和 Gali, Preetham 和 Gao, Leo 和 Hallahan, Eric 和 Levy-Kramer, Josh 和 Leahy, Connor 和 Nestler, Lucas 和 Parker, Kip 和 Pieler, Michael 和 Phang, Jason 和 Purohit, Shivanshu 和 Schoelkopf, Hailey 和 Stander, Dashiell 和 Songz, Tri 和 Tigges, Curt 和 Thérien, Benjamin 和 Wang, Phil 和 Weinbach, Samuel},\n  url = {https:\u002F\u002Fwww.github.com\u002Feleutherai\u002Fgpt-neox},\n  doi = {10.5281\u002Fzenodo.5879544},\n  month = {9},\n  year = {2023},\n  version = {2.0.0},\n}\n\n@misc{eval-harness,\n  author       = {Gao, Leo 和 Tow, Jonathan 和 Abbasi, Baber 和 Biderman, Stella 和 Black, Sid 和 DiPofi, Anthony 和 Foster, Charles 和 Golding, Laurence 和 Hsu, Jeffrey 和 Le Noac'h, Alain 和 Li, Haonan 和 McDonell, Kyle 和 Muennighoff, Niklas 和 Ociepa, Chris 和 Phang, Jason 和 Reynolds, Laria 和 Schoelkopf, Hailey 和 Skowron, Aviya 和 Sutawika, Lintang 和 Tang, Eric 和 Thite, Anish 和 Wang, Ben 和 Wang, Kevin 和 Zou, Andy},\n  title        = {一种用于少样本语言模型评估的框架},\n  month        = sep,\n  year         = 2021,\n  publisher    = {Zenodo},\n  version      = {v0.0.1},\n  doi          = {10.5281\u002Fzenodo.5371628},\n  url          = {https:\u002F\u002Fdoi.org\u002F10.5281\u002Fzenodo.5371628}\n}\n```\n\n## 许可证\n以下许可证适用于本 GitHub 仓库中的所有代码，以及 Pythia 模型和本仓库中包含的任何其他受版权保护的成果。\n\n```\n   版权归 2024 年 EleutherAI 所有\n\n   根据 Apache 许可证第 2.0 版（“许可证”）授权；除非遵守该许可证，否则不得使用本文档。\n   您可以在以下网址获取许可证副本：\n\n       http:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0\n\n   除非适用法律要求或书面同意，否则根据本许可证分发的软件以“按原样”提供，不附带任何形式的保证或条件。\n   有关特定语言的权限和限制，请参阅许可证文件。\n```","# Pythia 快速上手指南\n\nPythia 是 EleutherAI 开源的一系列自回归 Transformer 模型，旨在通过可解释性分析和缩放定律研究大语言模型在训练过程中的知识发展与演变。该套件提供了从 14M 到 12B 多种规模的模型，且每个模型均保存了 154 个训练检查点（Checkpoints），支持对学习动态进行细粒度研究。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**: Linux (推荐), macOS, Windows\n- **Python**: 3.8 或更高版本\n- **GPU**: 建议使用支持 CUDA 的 NVIDIA GPU（显存需求视模型大小而定，70M\u002F160M 模型可在消费级显卡运行，12B 模型需多卡或高显存环境）\n\n### 前置依赖\n主要依赖 `transformers` 和 `torch` 库。建议先更新 pip 并安装基础深度学习框架。\n\n```bash\npip install --upgrade pip\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n```\n\n> **注**：上述命令使用 PyTorch 官方 CUDA 11.8 源。若需其他 CUDA 版本或 CPU 版本，请访问 [PyTorch 官网](https:\u002F\u002Fpytorch.org\u002F) 获取对应命令。国内用户可使用清华镜像加速安装：\n> ```bash\n> pip install torch torchvision torchaudio --index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n## 安装步骤\n\n安装 Hugging Face `transformers` 库以加载和使用 Pythia 模型。\n\n```bash\npip install transformers accelerate\n```\n\n**国内加速方案**：\n推荐使用清华大学开源软件镜像源进行安装，以提升下载速度：\n\n```bash\npip install transformers accelerate -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 基本使用\n\nPythia 所有模型均托管在 Hugging Face Hub 上。你可以轻松加载特定训练步数（checkpoint）的模型进行推理或分析。\n\n以下示例演示如何加载 `pythia-70m-deduped` 模型在第 3000 步的检查点，并生成文本：\n\n```python\nfrom transformers import GPTNeoXForCausalLM, AutoTokenizer\n\n# 加载模型和分词器，指定 revision 为具体的训练步数\nmodel = GPTNeoXForCausalLM.from_pretrained(\n  \"EleutherAI\u002Fpythia-70m-deduped\",\n  revision=\"step3000\",\n  cache_dir=\".\u002Fpythia-70m-deduped\u002Fstep3000\",\n)\n\ntokenizer = AutoTokenizer.from_pretrained(\n  \"EleutherAI\u002Fpythia-70m-deduped\",\n  revision=\"step3000\",\n  cache_dir=\".\u002Fpythia-70m-deduped\u002Fstep3000\",\n)\n\n# 准备输入并生成文本\ninputs = tokenizer(\"Hello, I am\", return_tensors=\"pt\")\ntokens = model.generate(**inputs)\nprint(tokenizer.decode(tokens[0]))\n```\n\n### 关键说明\n- **模型版本选择**：将 `\"EleutherAI\u002Fpythia-70m-deduped\"` 替换为其他型号（如 `pythia-1b`, `pythia-2.8b` 等）。\n- **检查点选择**：`revision=\"step3000\"` 可更改为任意可用的检查点步数（如 `step0`, `step1000`, `step143000`）。`step143000` 对应最终训练完成的模型，也等同于 `main` 分支。\n- **去重数据版**：推荐优先使用带 `-deduped` 后缀的模型，这些模型是在去重后的 Pile 数据集上训练的，通常表现更稳定。\n\n### 可用模型列表\n| 参数量 | 标准版 HuggingFace ID | 去重版 HuggingFace ID |\n| :--- | :--- | :--- |\n| 70M | `EleutherAI\u002Fpythia-70m` | `EleutherAI\u002Fpythia-70m-deduped` |\n| 160M | `EleutherAI\u002Fpythia-160m` | `EleutherAI\u002Fpythia-160m-deduped` |\n| 410M | `EleutherAI\u002Fpythia-410m` | `EleutherAI\u002Fpythia-410m-deduped` |\n| 1B | `EleutherAI\u002Fpythia-1b` | `EleutherAI\u002Fpythia-1b-deduped` |\n| 1.4B | `EleutherAI\u002Fpythia-1.4b` | `EleutherAI\u002Fpythia-1.4b-deduped` |\n| 2.8B | `EleutherAI\u002Fpythia-2.8b` | `EleutherAI\u002Fpythia-2.8b-deduped` |\n| 6.9B | `EleutherAI\u002Fpythia-6.9b` | `EleutherAI\u002Fpythia-6.9b-deduped` |\n| 12B | `EleutherAI\u002Fpythia-12b` | `EleutherAI\u002Fpythia-12b-deduped` |","某 AI 安全研究团队正在深入调查大语言模型在训练过程中何时开始“死记硬背”敏感数据，以便设计干预机制防止隐私泄露。\n\n### 没有 pythia 时\n- 研究者只能获取模型的最终版本，如同只看电影结局而无法回放进度条，完全无法定位记忆形成的具体训练阶段。\n- 由于缺乏统一数据和顺序的对照模型，难以区分是数据本身的问题还是随机初始化导致的差异，因果推断几乎不可能实现。\n- 复现他人关于“学习动态”的论文极其困难，因为业界此前从未公开过如此密集的中间检查点，实验结果无法被独立验证。\n- 针对百亿参数级别模型的研究被迫停滞，因为全球范围内可供分析的中间状态 checkpoint 总数甚至不如 pythia 单个模型提供的多。\n\n### 使用 pythia 后\n- 团队直接调用 pythia 提供的 154 个连续检查点，精确绘制出模型从“理解规律”到“机械记忆”的转折曲线，锁定了干预的最佳时间窗口。\n- 利用相同数据和顺序训练的不同规模模型组，研究人员成功实施了因果干预实验，清晰证明了特定数据片段对记忆涌现的直接贡献。\n- 基于完全开源的代码、数据和权重，团队在一周内就复现了顶会论文结果，并在此基础上快速迭代出了新的防御算法。\n- 借助前所未有的细粒度中间状态数据，团队首次完成了对 12B 参数模型全生命周期的可解释性分析，填补了该领域的空白。\n\npythia 通过提供全透明、高密度的训练过程快照，将大模型研究从黑盒猜测变成了可精确观测和干预的科学实验。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FEleutherAI_pythia_f12aa1bc.png","EleutherAI","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FEleutherAI_cadf9bbb.png","",null,"contact@eleuther.ai","AIEleuther","www.eleuther.ai","https:\u002F\u002Fgithub.com\u002FEleutherAI",[85,89,93],{"name":86,"color":87,"percentage":88},"Jupyter Notebook","#DA5B0B",98.4,{"name":90,"color":91,"percentage":92},"Python","#3572A5",1.5,{"name":94,"color":95,"percentage":96},"Shell","#89e051",0,2763,210,"2026-04-05T13:43:22","Apache-2.0",4,"未说明","训练需多卡 NVIDIA GPU 支持混合精度 (fp16\u002Fbf16)；推理依赖 transformers 库，显存需求视模型大小而定 (14M-12B)，大模型 (如 12B) 建议 24GB+ 显存","未说明 (训练需大量内存以支持 2M token batch size)",{"notes":106,"python":102,"dependencies":107},"该工具主要提供预训练模型权重用于研究，而非独立的训练脚本包。模型基于 GPT-NeoX 架构，可通过 Hugging Face transformers 库直接加载。训练原始数据来自 The Pile。所有模型均提供 154 个检查点以研究学习动态。部分大模型 (如 1B) 需使用 bf16 精度以避免损失尖峰。",[108,109,110],"transformers","GPT-NeoX","torch",[15,37],"2026-03-27T02:49:30.150509","2026-04-06T14:01:40.549352",[115,120,125,130,135,139],{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},19179,"Pythia 模型的训练序列长度为什么是 2049 而不是 2048？","序列长度设为 2049 是因为包含了结束符（EOD, End of Document）token。标准版（非去重）的 Pythia 模型在训练时使用了 EOD token；而大多数去重版（deduped）模型训练时未使用 EOD token，但有一个例外：Pythia 2.8B deduped 模型使用了 EOD token，若在不包含 EOD token 的情况下使用它，某些场景下性能会显著下降。此外，14M 和 31M 模型在训练过程中可能存在数据加载管道的问题，导致预测文档起始 token 时出现异常。","https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia\u002Fissues\u002F123",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},19180,"如何复现 Pythia 模型的评估结果？为什么我的复现结果与官方数据不符？","官方确认仓库中的评估结果没有问题。如果您发现复现结果（特别是 Lambada OpenAI 任务）与官方数据差距较大（例如 step143000 检查点），这可能是由于评估任务本身的不稳定性或模型在训练后期（超过 100,000 步后）出现了某种程度的“灾难性遗忘”。建议直接使用官方提供的评估脚本和配置进行验证，确保使用的检查点版本和评估任务设置与官方一致。","https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia\u002Fissues\u002F118",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},19181,"用于生成 Pythia 训练数据（.bin 和 .idx 文件）的预处理脚本在哪里？","项目仅提供了预处理后的数据，未直接公开生成 .bin 和 .idx 文件的原始预处理脚本。需要注意的是，Hugging Face 上的数据集（如 EleutherAI\u002Fthe_pile_deduplicated）与用于训练 Pythia 的 LFS 预标记数据集在洗牌顺序上并不相同。如果您需要完全控制数据处理流程，建议考虑使用 FineWeb 数据集从头开始处理，或者如果必须使用 Pile 数据集且需要特定的 EOD token 处理，唯一的选项是从头开始对数据进行分词和预处理。","https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia\u002Fissues\u002F69",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},19182,"早期记忆的数据是否预示着后期也会被记住？模型会遗忘早期记忆的内容吗？","是的，相关性热图显示早期记忆与后期记忆存在正相关。然而，追踪发现所有模型尺寸的模型都倾向于遗忘它们在早期检查点中记住的内容。随着训练进行，记忆总量的增加主要来源于新记住的数据点，而早期记住的数据点有轻微的被遗忘趋势。相关的分析代码和笔记本已提交至 `analysis\u002Fmemorization` 目录供参考。","https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Fpythia\u002Fissues\u002F29",{"id":136,"question_zh":137,"answer_zh":138,"source_url":119},19183,"Pythia 去重模型在训练时是否使用了 EOD token？","绝大多数去重版（deduped）Pythia 模型在训练时没有使用 EOD token。但是有一个重要的例外：Pythia 2.8B deduped 模型是在使用了 EOD token 的情况下训练的。因此，如果在推理时不为该特定模型提供 EOD token，其在某些条件下的表现会显著变差。对于其他去重模型，通常不需要添加 EOD token。",{"id":140,"question_zh":141,"answer_zh":142,"source_url":119},19184,"如果想重新训练类似 Pythia 的模型，有什么数据集和建议？","如果预算充足，建议使用基于 Llama 架构（包含 GQA、旋转位置编码等特性）并在 FineWeb 数据集上进行训练。如果预算有限，可以考虑使用 Minipile 或 FineWeb-Edu 数据集的样本（约 100 亿 token）。这两种方案都允许您从头处理数据集，从而获得完全的控制权。如果您坚持使用 Pile 数据集并需要包含 EOD token，则必须从头开始对该数据集进行分词处理，因为现有的预分词版本可能不符合您的特定需求。",[]]