[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-princeton-nlp--SimPO":3,"tool-princeton-nlp--SimPO":65},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",160784,2,"2026-04-19T11:32:54",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":10,"last_commit_at":33,"category_tags":34,"status":16},8553,"spec-kit","github\u002Fspec-kit","Spec Kit 是一款专为提升软件开发效率而设计的开源工具包，旨在帮助团队快速落地“规格驱动开发”（Spec-Driven Development）模式。传统开发中，需求文档往往与代码实现脱节，导致沟通成本高且结果不可控；而 Spec Kit 通过将规格说明书转化为可执行的指令，让 AI 直接依据明确的业务场景生成高质量代码，从而减少从零开始的随意编码，确保产出结果的可预测性。\n\n该工具特别适合希望利用 AI 辅助编程的开发者、技术负责人及初创团队。无论是启动全新项目还是在现有工程中引入规范化流程，用户只需通过简单的命令行操作，即可初始化项目并集成主流的 AI 编程助手。其核心技术亮点在于“规格即代码”的理念，支持社区扩展与预设模板，允许用户根据特定技术栈定制开发流程。此外，Spec Kit 强调官方维护的安全性，提供稳定的版本管理，帮助开发者在享受 AI 红利的同时，依然牢牢掌握架构设计的主动权，真正实现从“凭感觉写代码”到“按规格建系统”的转变。",88749,"2026-04-17T09:48:14",[15,26,14,13],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":10,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":10,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85267,"2026-04-18T11:00:28",[26,51,52,53,14,54,15,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":62,"last_commit_at":63,"category_tags":64,"status":16},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[15,51,54],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":80,"owner_twitter":80,"owner_website":81,"owner_url":82,"languages":83,"stars":92,"forks":93,"last_commit_at":94,"license":95,"difficulty_score":23,"env_os":96,"env_gpu":97,"env_ram":96,"env_deps":98,"category_tags":107,"github_topics":108,"view_count":10,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":113,"updated_at":114,"faqs":115,"releases":146},9715,"princeton-nlp\u002FSimPO","SimPO","[NeurIPS 2024] SimPO: Simple Preference Optimization with a Reference-Free Reward","SimPO 是一款面向大语言模型对齐的高效开源算法，旨在通过更简洁的方式优化模型对人类偏好的遵循能力。它主要解决了现有主流方法（如 DPO）依赖参考模型导致计算资源消耗大、训练流程复杂的问题。SimPO 创新性地提出了一种“无参考奖励”机制，在无需额外参考模型的情况下，直接利用偏好数据进行优化，从而显著降低了显存占用和训练成本。\n\n该工具特别适合 AI 研究人员和开发者使用，尤其是那些希望在有限算力下复现前沿对齐效果，或需要将大模型快速适配到特定垂直领域的团队。其核心技术亮点在于极简的架构设计与卓越的性能表现：在 AlpacaEval 2、MT-Bench 等多个权威基准测试中，SimPO 的表现均超越了 DPO 及其最新变体，甚至曾登顶排行榜首位。此外，项目提供了完善的训练脚本、超参数调优建议及基于 Llama3 和 Gemma2 的预训练模型检查点，极大地提升了实验的可复现性。无论是进行学术研究还是工程落地，SimPO 都为构建高质量、高对齐度的语言模型提供了一条轻量且强大的技术路径。","# Simple Preference Optimization (SimPO)\n\nThis repository contains the code and released models for our paper [SimPO: Simple Preference Optimization with a Reference-Free Reward](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.14734). We propose a simpler and more effective preference optimization algorithm than DPO (Direct Preference Optimization) without using a reference model. SimPO outperforms DPO and its latest variants across AlpacaEval 2, MT-Bench, and Arena-Hard benchmarks under various settings. Please find all the released model checkpoints at [this link](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fprinceton-nlp\u002Fsimpo-66500741a5a066eb7d445889). \n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fprinceton-nlp_SimPO_readme_3a5bb0450f8d.png\" width=\"1000px\">\u003C\u002Fimg>\n\n## 🆕 Changelog \n- **[2024.10.12]** To facilitate reproducibility, we release the training curves for Llama3-Instruct and Gemma2-IT:\n  - [Llama3-Instruct-SimPO](https:\u002F\u002Fwandb.ai\u002Fyumeng0818\u002Fsimpo\u002Fruns\u002Fzoesxyuj)\n  - [Llama3-Instruct-SimPO v0.2](https:\u002F\u002Fwandb.ai\u002Fyumeng0818\u002Fsimpo\u002Fruns\u002Fzvv56fcj)\n  - [Gemma2-IT-SimPO](https:\u002F\u002Fwandb.ai\u002Fyumeng0818\u002Fsimpo\u002Fruns\u002F4w25j650)\n- **[2024.07.17]** We released a new SimPO model [gemma-2-9b-it-SimPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002Fgemma-2-9b-it-SimPO) by fine-tuning Google's gemma-2 9B model using on-policy [UltraFeedback data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fprinceton-nlp\u002Fgemma2-ultrafeedback-armorm) annotated by [ArmoRM](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002FArmoRM-Llama3-8B-v0.1), achieving a **72.4** LC win rate on AlpacaEval 2 (**#[1 on the Leaderboard](https:\u002F\u002Ftatsu-lab.github.io\u002Falpaca_eval\u002F)** 🎉🎉) and a **59.1** win rate on Arena-Hard! Please find the training script [here](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Fblob\u002Fmain\u002Ftraining_configs\u002Fgemma-2-9b-it-simpo.yaml) and the data generation scripts [here](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Ftree\u002Fmain\u002Fon_policy_data_gen)!\n- **[2024.07.08]** We updated our paper ([v2](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.14734v2))\n  - Additional baselines (RRHF, SLiC-HF, CPO) \n  - New Llama3-Instruct setting (v0.2) with [ArmoRM](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002FArmoRM-Llama3-8B-v0.1) as the preference label annotator, yielding a better-performing model, [Llama-3-Instruct-8B-SimPO-v0.2](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-SimPO-v0.2), with a **53.7** LC win rate on AlpacaEval 2 and a **36.5** win rate on Arena-Hard ([training script](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Fblob\u002Fmain\u002Ftraining_configs\u002Fllama-3-8b-instruct-simpo-v2.yaml))!\n  - [SimPO trainer](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Fblob\u002Fmain\u002Fscripts\u002Fsimpo_trainer.py) update for better reproducibility. The hyperparameter `gamma` changed to `gamma_beta_ratio` for easier tuning.\n\n## 🔗 Quick Links\n- [SimPO: Simple Preference Optimization with a Reference-Free Reward](#simple-preference-optimization-simpo)\n  - [Changelog](#-changelog)\n  - [Tips for Running SimPO](#tips-for-running-simpo)\n  - [Released Models](#released-models)\n  - [Install Requirements](#install-requirements)\n  - [Training scripts](#training-scripts)\n  - [Evaluation](#evaluation)\n  - [Bugs or Questions?](#bugs-or-questions)\n  - [Citation](#citation)\n\n## Tips for Running SimPO\nGiven the various inquiries about SimPO, we provide a list of tips to help you reproduce our paper results and achieve better outcomes for running SimPO on your own tasks. \n\n### Environment\nWe provide an [environment file](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Fblob\u002Fmain\u002Fenvironment.yml) including the python package versions we used in our experiments. For optimal reproducibility, we recommend using the same package versions. However, please note that results may still vary due to differences in hardware configurations and CUDA versions, etc.\n\n### Hyperparameter tuning\nHyperparameter tuning is crucial for SimPO (and other preference optimization algorithms in general). The three main hyperparameters of SimPO to focus on are `learning_rate`, `beta`, and `gamma` (we recommend keeping the total batch size fixed at 128).\n- `learning_rate`: It is the most critical hyperparameter for preference optimization. A large learning rate (e.g., 1e-5) can significantly degrade performance, causing the model to produce incoherent sentences or completely repetitive responses. We recommend grid searching over 3e-7, 5e-7, 8e-7, and 1e-6, if resources allow. **We find that a smaller learning rate (e.g., 5e-7) is more suitable for reasoning intensive domains like math for both DPO and SimPO.**\n- `beta`: Beta controls the reward scaling between winning and losing responses. SimPO requires a much larger `beta` than DPO. In our preprint, we used a beta of `2.0` or `2.5`, but in many cases, an even larger beta (e.g., `10`) could yield better results.\n- `gamma`: Gamma controls the target reward margin. We suggest tuning the ratio of gamma to beta (i.e., `gamma \u002F beta`). We recommend using `0.5` as a starting point for `gamma_beta_ratio` and grid searching between `0` and `1`. A well-tuned `gamma_beta_ratio` can provide a modest improvement, but it is not as critical as other hyperparameters.\n\nWe used the following hyperparameters for training the released models (note that in our latest update, we changed the hyperparameter `gamma` to `gamma_beta_ratio` as the latter is normalized and easier to tune under different `beta` values).\n| Setting           | β   | γ\u002Fβ   | Learning rate |\n|-------------------|-----|-----|----------------|\n| Mistral-Base      | 2.0 | 0.8 | 3e-7           |\n| Mistral-Instruct  | 2.5 | 0.1 | 5e-7           |\n| Llama3-Base       | 2.0 | 0.5 | 6e-7           |\n| Llama3-Instruct   | 2.5 | 0.55 | 1e-6           |\n| Llama3-Instruct v0.2   | 10 | 0.3 | 1e-6           |\n| Gemma             | 10 | 0.5 | 8e-7 |  \n\nFor DPO, the best hyperparameters for each setting are as follows.\n| Setting                  | β | Learning Rate |\n|------------------------|------|---------------|\n| Mistral-Base           | 0.01 | 5e-7      |\n| Mistral-Instruct       | 0.01 | 5e-7      |\n| Llama3-Base            | 0.01 | 5e-7      |\n| Llama3-Instruct        | 0.01 | 7e-7      |\n| Llama3-Instruct v0.2   | 0.01 | 3e-7      |\n| Gemma             | 0.01 | 5e-7 |  \n\n\n### Training and evaluation consistency in BOS\nOur released Llama3 models use the initial version of the Llama3 tokenizer (prior to this [PR](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FMeta-Llama-3-8B-Instruct\u002Fcommit\u002F339ce92d052f002cdbac4a4bd551d1c61dd8345e)). We have found that the updated Llama3 tokenizer with vLLM occasionally introduces two BOS tokens, which can affect evaluation results. Therefore, please ensure that only one BOS token is included in the prompt after applying the Llama3 chat template during any evaluation.\n\n*Notably, if you are training Llama3 and evaluating the trained models on AlpacaEval 2 and Arena-Hard using the templates provided in this repo, please make sure to use the pre-update Llama3 tokenizer (i.e., the one before the PR).*\n\n### Reproducing AlpacaEval 2 numbers\nPlease make sure that you use `alpaca-eval==0.6.2` and [model configurations](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Ftree\u002Fmain\u002Feval\u002Falpacaeval2\u002Fconfigs) in our repo for successfully reproducing AlpacaEval 2 results. AlpacaEval has a major revision for vllm decoding since `0.6.3` and causes a discrepancy from our experiments. \n\n### Adding an extra SFT loss\nThe [CPO_SIMPO](https:\u002F\u002Fgithub.com\u002Ffe1ixxu\u002FCPO_SIMPO\u002Ftree\u002Fmain) repository did preliminary experiments and observed that in some cases, adding an additional SFT loss can help improve results. In our own experiments, the SFT regularization helps preserve the reasoning ability (e.g., GSM8K) but degrades chat performance. If you'd like to apply SFT regularization, you can set `sft_weight` to be a positive value (by default it's 0).\n\n\n## Released Models\n\n### Gemma  \nWe release the following two models that are built on top of the strong [google\u002Fgemma-2-9b-it](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fgemma-2-9b-it) model by training DPO and SimPO on the on-policy dataset [princeton-nlp\u002Fgemma2-ultrafeedback-armorm](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fprinceton-nlp\u002Fgemma2-ultrafeedback-armorm). For GSM and MMLU, we use the [ZeroEval](https:\u002F\u002Fgithub.com\u002Fyuchenlin\u002FZeroEval) repository which aims to evaluate instruction-tuned LLMs (i.e., chat models instead of base models) for their zero-shot performance on reasoning and knowledge heavy tasks. More results on [WildBench](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fallenai\u002FWildBench) are coming soon. \n\n|               models                    | AE2 LC | AE2 WR | AE2 Length |  AH  | AH Length |  GSM | GSM Length | MMLU | MMLU Length |\n|-----------------------------------|:------:|:------:|:----------:|:----:|:---------:|:----:|:----------:|:----:|:-----------:|\n|        [google\u002Fgemma-2-9b-it](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fgemma-2-9b-it)       |  51.1  |  38.1  |    1571    | 40.8 |    545    | 87.4 |     395    | 72.7 |     515     |\n|  [princeton-nlp\u002Fgemma-2-9b-it-DPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002Fgemma-2-9b-it-DPO)  |  67.8  |  65.4  |    2016    | 58.9 |    717    | 88.5 |     392    | 72.2 |     624     |\n| [princeton-nlp\u002Fgemma-2-9b-it-SimPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002Fgemma-2-9b-it-SimPO) |  72.4  |  65.9  |    1833    | 59.1 |    693    | 88.0 |     341    | 72.2 |     441     |\n\n- Compared to the llama3 models, we found that the gemma models exhibit significantly less catastrophic forgetting on math tasks (e.g., GSM) and MMLU, despite the ultrafeedback dataset having limited math-related data. This demonstrates that the [google\u002Fgemma-2-9b-it](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fgemma-2-9b-it) model is more suitable for continued preference optimization.\n- SimPO and DPO perform comparably across all benchmarks, but SimPO is inherently simpler and less resource-intensive.\n\n\n### v0.2\nWe found that using a strong reward model for annotating preference optimization datasets is crucial. In this iteration, we have reannotated the dataset [princeton-nlp\u002Fllama3-ultrafeedback-armorm](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fprinceton-nlp\u002Fllama3-ultrafeedback-armorm) using a more powerful reward model, [RLHFlow\u002FArmoRM-Llama3-8B-v0.1](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002FArmoRM-Llama3-8B-v0.1). As a result, the v0.2 models demonstrate significantly improved performance compared to the v0.1 models. \n\n**Caveat**: We have observed that the SimPO v0.2 model often struggles with generating outputs that require adherence to specific structures, such as json. This issue arises from a combination of factors: the llama3-instruct model's tendency to forget and the large learning rate (e.g., 1e-6) used during training, which causes deviation from the original model. To address this, we developed SimPO models based on the [google\u002Fgemma-2-9b-it](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fgemma-2-9b-it). We found that changing the initial model significantly mitigates the forgetting issue and reduces the impact of the learning rate.\n\n| models                       |                                                                                                           | AE2 LC | AE2 WR |  AH  |\n|------------------------------|-----------------------------------------------------------------------------------------------------------|:------:|:------:|:----:|\n| Llama 3 Instruct 8B RRHF v0.2 | [princeton-nlp\u002FLlama-3-Instruct-8B-RRHF-v2.0](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-RRHF-v0.2) |  37.9  |  31.6  | 28.8 |\n| Llama 3 Instruct 8B SLiC-HF v0.2 | [princeton-nlp\u002FLlama-3-Instruct-8B-SLiC-HF-v2.0](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-SLiC-HF-v0.2) |  33.9  |  32.5  | 29.3 |\n| Llama 3 Instruct 8B DPO v0.2 | [princeton-nlp\u002FLlama-3-Instruct-8B-DPO-v0.2](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-DPO-v0.2) |  48.2  |  47.5  | 35.2 |\n| Llama 3 Instruct 8B IPO v0.2 | [princeton-nlp\u002FLlama-3-Instruct-8B-IPO-v0.2](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-IPO-v0.2) |  46.8  |  42.4  | 36.6 |\n| Llama 3 Instruct 8B CPO v0.2 | [princeton-nlp\u002FLlama-3-Instruct-8B-CPO-v0.2](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-CPO-v0.2) |  34.1  |  36.4  | 30.9 |\n| Llama 3 Instruct 8B KTO v0.2 | [princeton-nlp\u002FLlama-3-Instruct-8B-KTO-v0.2](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-KTO-v0.2) |  34.1  |  32.1  | 27.3 |\n| Llama 3 Instruct 8B ORPO v0.2 | [princeton-nlp\u002FLlama-3-Instruct-8B-ORPO-v0.2](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-ORPO-v0.2) |  38.1  |  33.8  | 28.2 |\n| Llama 3 Instruct 8B R-DPO v0.2 | [princeton-nlp\u002FLlama-3-Instruct-8B-RDPO-v0.2](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-RDPO-v0.2) |  48.0  | 45.8  | 35.1 |\n| Llama 3 Instruct 8B SimPO v0.2 | [princeton-nlp\u002FLlama-3-Instruct-8B-SimPO-v0.2](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-SimPO-v0.2) |  53.7  |  47.5  | 36.5 |\n\n### v0.1\nBelow is the complete list of models evaluated in our preprint. We used the [HuggingFaceH4\u002Fultrafeedback_binarized](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceH4\u002Fultrafeedback_binarized) dataset to train the Mistral Base and Llama3 Base models, the [princeton-nlp\u002Fmistral-instruct-ultrafeedback](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fprinceton-nlp\u002Fmistral-instruct-ultrafeedback) dataset to train the Mistral Instruct models, and the [princeton-nlp\u002Fllama3-ultrafeedback](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fprinceton-nlp\u002Fllama3-ultrafeedback) dataset to train the Llama3 Instruct models. The latter two datasets are annotated by the [llm-blender\u002FPairRM](https:\u002F\u002Fhuggingface.co\u002Fllm-blender\u002FPairRM) model.\n\nmodels                       |                                                                                                           | AE2 LC | AE2 WR |  AH  |\n|------------------------------|-----------------------------------------------------------------------------------------------------------|:------:|:------:|:----:|\n| Mistral Base 7B SFT          | [alignment-handbook\u002Fzephyr-7b-sft-full](https:\u002F\u002Fhuggingface.co\u002Falignment-handbook\u002Fzephyr-7b-sft-full)     |   8.4  |   6.2  |  1.3 |\n| Mistral Base 7B RRHF         | [princeton-nlp\u002FMistral-7B-Base-SFT-RRHF](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Base-SFT-RRHF) |  11.6  |  10.2  |  6.9 |\n| Mistral Base 7B SLiC-HF      | [princeton-nlp\u002FMistral-7B-Base-SFT-SLiC-HF](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Base-SFT-SLiC-HF) |  10.9  |   8.9  |  7.3 |\n| Mistral Base 7B DPO (Zephyr) | [princeton-nlp\u002FMistral-7B-Base-SFT-DPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Base-SFT-DPO)     |  15.1  |  12.5  | 10.4 |\n| Mistral Base 7B IPO          | [princeton-nlp\u002FMistral-7B-Base-SFT-IPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Base-SFT-IPO)     |  11.8  |   9.4  |  7.5 |\n| Mistral Base 7B CPO          | [princeton-nlp\u002FMistral-7B-Base-SFT-CPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Base-SFT-CPO)     |  9.8  |   8.9  |  6.9 |\n| Mistral Base 7B KTO          | [princeton-nlp\u002FMistral-7B-Base-SFT-KTO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Base-SFT-KTO)     |  13.1  |   9.1  |  5.6 |\n| Mistral Base 7B ORPO         | [kaist-ai\u002Fmistral-orpo-beta](https:\u002F\u002Fhuggingface.co\u002Fkaist-ai\u002Fmistral-orpo-beta)                           |  14.7  |  12.2  |  7.0 |\n| Mistral Base 7B R-DPO        | [princeton-nlp\u002FMistral-7B-Base-SFT-RDPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Base-SFT-RDPO)   |  17.4  |  12.8  |  9.9 |\n| Mistral Base 7B SimPO        | [princeton-nlp\u002FMistral-7B-Base-SFT-SimPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Base-SFT-SimPO) |  21.4  |  20.8  | 16.6 |\n| Mistral Instruct 7B SFT      | [mistralai\u002FMistral-7B-Instruct-v0.2](https:\u002F\u002Fhuggingface.co\u002Fmistralai\u002FMistral-7B-Instruct-v0.2)           |  17.1  |  14.7  | 12.6 |\n| Mistral Instruct 7B RRHF     | [princeton-nlp\u002FMistral-7B-Instruct-RRHF](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Instruct-RRHF) |  25.3  |  24.8  | 18.1 |\n| Mistral Instruct 7B SLiC-HF  | [princeton-nlp\u002FMistral-7B-Instruct-SLiC-HF](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Instruct-SLiC-HF) |  24.1  |  24.6  | 18.9 |\n| Mistral Instruct 7B DPO      | [princeton-nlp\u002FMistral-7B-Instruct-DPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Instruct-DPO)     |  26.8  |  24.9  | 16.3 |\n| Mistral Instruct 7B IPO      | [princeton-nlp\u002FMistral-7B-Instruct-IPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Instruct-IPO)     |  20.3  |  20.3  | 16.2 |\n| Mistral Instruct 7B CPO      | [princeton-nlp\u002FMistral-7B-Instruct-CPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Instruct-CPO)     |  23.8  |  28.8  | 22.6 |\n| Mistral Instruct 7B KTO      | [princeton-nlp\u002FMistral-7B-Instruct-KTO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Instruct-KTO)     |  24.5  |  23.6  | 17.9 |\n| Mistral Instruct 7B ORPO     | [princeton-nlp\u002FMistral-7B-Instruct-ORPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Instruct-ORPO)   |  24.5  |  24.9  | 20.8 |\n| Mistral Instruct 7B R-DPO    | [princeton-nlp\u002FMistral-7B-Instruct-RDPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Instruct-RDPO)   |  27.3  |  24.5  | 16.1 |\n| Mistral Instruct 7B SimPO    | [princeton-nlp\u002FMistral-7B-Instruct-SimPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Instruct-SimPO) |  32.1  |  34.8  | 21.0 |\n| Llama3 Base 8B SFT           | [princeton-nlp\u002FLlama-3-Base-8B-SFT](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-SFT)             |   6.2  |   4.6  |  3.3 |\n| Llama3 Base 8B RRHF          | [princeton-nlp\u002FLlama-3-Base-8B-RRHF](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-RRHF)           |  10.8  |   8.1  |  6.6 |\n| Llama3 Base 8B SLiC-HF       | [princeton-nlp\u002FLlama-3-Base-8B-SLiC-HF](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-SLiC-HF)     |  12.1  |  10.1  | 10.3 |\n| Llama3 Base 8B DPO           | [princeton-nlp\u002FLlama-3-Base-8B-SFT-DPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-SFT-DPO)     |  18.2  |  15.5  | 15.9 |\n| Llama3 Base 8B IPO           | [princeton-nlp\u002FLlama-3-Base-8B-SFT-IPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-SFT-IPO)     |  14.4  |  14.2  | 17.8 |\n| Llama3 Base 8B CPO           | [princeton-nlp\u002FLlama-3-Base-8B-SFT-CPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-SFT-CPO)     |  10.8  |  8.1  | 5.8 |\n| Llama3 Base 8B KTO           | [princeton-nlp\u002FLlama-3-Base-8B-SFT-KTO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-SFT-KTO)     |  14.2  |  12.4  | 12.5 |\n| Llama3 Base 8B ORPO          | [princeton-nlp\u002FLlama-3-Base-8B-SFT-ORPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-SFT-ORPO)   |  12.2  |  10.6  | 10.8 |\n| Llama3 Base 8B R-DPO         | [princeton-nlp\u002FLlama-3-Base-8B-SFT-RDPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-SFT-RDPO)   |  17.6  |  14.4  | 17.2 |\n| Llama3 Base 8B SimPO         | [princeton-nlp\u002FLlama-3-Base-8B-SFT-SimPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-SFT-SimPO) |  22.0  |  20.3  | 23.4 |\n| Llama3 Instruct 8B SFT       | [meta-llama\u002FMeta-Llama-3-Instruct-8B](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FMeta-Llama-3-Instruct-8B)         |  26.0  |  25.3  | 22.3 |\n| Llama3 Instruct 8B RRHF      | [princeton-nlp\u002FLlama-3-Instruct-8B-RRHF](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-RRHF) |  31.3  |  28.4  | 26.5 |\n| Llama3 Instruct 8B SLiC-HF   | [princeton-nlp\u002FLlama-3-Instruct-8B-SLiC-HF](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-SLiC-HF) |  26.9  |  27.5  | 26.2 |\n| Llama3 Instruct 8B DPO       | [princeton-nlp\u002FLlama-3-Instruct-8B-DPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-DPO)     |  40.3  |  37.9  | 32.6 |\n| Llama3 Instruct 8B IPO       | [princeton-nlp\u002FLlama-3-Instruct-8B-IPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-IPO)     |  35.6  |  35.6  | 30.5 |\n| Llama3 Instruct 8B CPO       | [princeton-nlp\u002FLlama-3-Instruct-8B-CPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-CPO)     |  33.1  |  31.8  | 26.4 |\n| Llama3 Instruct 8B KTO       | [princeton-nlp\u002FLlama-3-Instruct-8B-KTO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-KTO)     |  33.1  |  31.8  | 26.4 |\n| Llama3 Instruct 8B ORPO      | [princeton-nlp\u002FLlama-3-Instruct-8B-ORPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-ORPO)   |  28.5  |  27.4  | 25.8 |\n| Llama3 Instruct 8B R-DPO     | [princeton-nlp\u002FLlama-3-Instruct-8B-RDPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-RDPO)   |  41.1  |  37.8  | 33.1 |\n| Llama3 Instruct 8B SimPO     | [princeton-nlp\u002FLlama-3-Instruct-8B-SimPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-SimPO) |  44.7  |  40.5  | 33.8 |\n\n\n### Use our models for inference\nPlease refer to the [generate.py](generate.py) script for detailed instructions on loading the model with the appropriate chat template.\n\n## Install Requirements\n\nOur codebase is built upon the [alignment-handbook repo](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Falignment-handbook). The following steps will guide you through the installation process.\n\nFirst, create a Python virtual environment using e.g. Conda:\n```shell\nconda create -n handbook python=3.10 && conda activate handbook\n```\n\nNext, install PyTorch `v2.2.2`. Since this is hardware-dependent, we\ndirect you to the [PyTorch Installation Page](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F).\n\nYou can then install the remaining package dependencies of [alignment-handbook](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Falignment-handbook) as follows:\n\n```shell\ngit clone https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Falignment-handbook.git\ncd .\u002Falignment-handbook\u002F\npython -m pip install .\n```\n\nYou will also need Flash Attention 2 installed, which can be done by running:\n\n```shell\npython -m pip install flash-attn --no-build-isolation\n```\n\n## Training Scripts\n\nWe provide four training config files for the four training setups reported in our paper. The training config is set for 4xH100 GPUs. You may need to adjust `num_processes` and `per_device_train_batch_size` based on your computation environment. \n\n* Mistral-Base:\n```shell\nACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs\u002Fdeepspeed_zero3.yaml scripts\u002Frun_simpo.py training_configs\u002Fmistral-7b-base-simpo.yaml\n```\n* Mistral-Instruct:\n```shell\nACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs\u002Fdeepspeed_zero3.yaml scripts\u002Frun_simpo.py training_configs\u002Fmistral-7b-instruct-simpo.yaml\n```\n* Llama3-Base:\n```shell\nACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs\u002Fdeepspeed_zero3.yaml scripts\u002Frun_simpo.py training_configs\u002Fllama-3-8b-base-simpo.yaml\n```\n* Llama3-Instruct:\n```shell\nACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs\u002Fdeepspeed_zero3.yaml scripts\u002Frun_simpo.py training_configs\u002Fllama-3-8b-instruct-simpo.yaml\n```\n* Llama3-Instruct v0.2:\n```shell\nACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs\u002Fdeepspeed_zero3.yaml scripts\u002Frun_simpo.py training_configs\u002Fllama-3-8b-instruct-simpo-v2.yaml\n```\n\n## Evaluation\n\nWe follow the official implementation for evaluation on AlpacaEval 2, Arena-Hard, and MT-Bench, as follows (more details can be found under [the eval directory](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Ftree\u002Fmain\u002Feval)):\n\n* AlpacaEval 2: Please refer to the [AlpacaEval repo](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval) for evaluation.\n\n* Arena-Hard: Please refer to to the [Arena-Hard-Auto repo](https:\u002F\u002Fgithub.com\u002Flm-sys\u002Farena-hard-auto) for evaluation.\n\n* MT-Bench: Please refer to the [FastChat repo](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat) for evaluation.\n\n## Bugs or Questions?\nIf you have any questions related to the code or the paper, feel free to email Yu (yumeng5@virginia.edu). If you encounter any problems when using the code, or want to report a bug, feel free to open an issue! Please try to specify the problem with details so we can help you better and quicker!\n\n## Citation\nPlease cite our paper if you find the repo helpful in your work:\n\n```bibtex\n@inproceedings{meng2024simpo,\n   title={SimPO: Simple Preference Optimization with a Reference-Free Reward},\n   author={Meng, Yu and Xia, Mengzhou and Chen, Danqi},\n   booktitle={Advances in Neural Information Processing Systems (NeurIPS)},\n   year={2024}\n}\n```\n","# 简单偏好优化 (SimPO)\n\n本仓库包含我们论文《SimPO：无需参考模型的简单偏好优化》（[arXiv:2405.14734](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.14734)）的代码及已发布的模型。我们提出了一种比DPO（直接偏好优化）更简单、更有效的偏好优化算法，且无需使用参考模型。在AlpacaEval 2、MT-Bench和Arena-Hard等多个基准测试中，SimPO在不同设置下均优于DPO及其最新变体。所有已发布的模型检查点请见[此链接](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fprinceton-nlp\u002Fsimpo-66500741a5a066eb7d445889)。\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fprinceton-nlp_SimPO_readme_3a5bb0450f8d.png\" width=\"1000px\">\u003C\u002Fimg>\n\n## 🆕 更改日志\n- **[2024.10.12]** 为便于复现，我们发布了Llama3-Instruct和Gemma2-IT的训练曲线：\n  - [Llama3-Instruct-SimPO](https:\u002F\u002Fwandb.ai\u002Fyumeng0818\u002Fsimpo\u002Fruns\u002Fzoesxyuj)\n  - [Llama3-Instruct-SimPO v0.2](https:\u002F\u002Fwandb.ai\u002Fyumeng0818\u002Fsimpo\u002Fruns\u002Fzvv56fcj)\n  - [Gemma2-IT-SimPO](https:\u002F\u002Fwandb.ai\u002Fyumeng0818\u002Fsimpo\u002Fruns\u002F4w25j650)\n- **[2024.07.17]** 我们发布了一个新的SimPO模型[gemma-2-9b-it-SimPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002Fgemma-2-9b-it-SimPO)，该模型基于Google的gemma-2 9B模型，使用由[ArmoRM](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002FArmoRM-Llama3-8B-v0.1)标注的on-policy [UltraFeedback数据](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fprinceton-nlp\u002Fgemma2-ultrafeedback-armorm)进行微调，在AlpacaEval 2上的LC胜率达到了**72.4**分（**#[排行榜第一](https:\u002F\u002Ftatsu-lab.github.io\u002Falpaca_eval\u002F)** 🎉🎉），在Arena-Hard上的胜率也达到了**59.1**分！训练脚本请见[这里](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Fblob\u002Fmain\u002Ftraining_configs\u002Fgemma-2-9b-it-simpo.yaml)，数据生成脚本请见[这里](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Ftree\u002Fmain\u002Fon_policy_data_gen)！\n- **[2024.07.08]** 我们更新了论文（[v2](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.14734v2)）：\n  - 增加了更多基线方法（RRHF、SLiC-HF、CPO）。\n  - 引入了新的Llama3-Instruct设置（v0.2），采用[ArmoRM](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002FArmoRM-Llama3-8B-v0.1)作为偏好标签标注者，训练出表现更优的模型[Llama-3-Instruct-8B-SimPO-v0.2](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-SimPO-v0.2)，其在AlpacaEval 2上的LC胜率达到了**53.7**分，在Arena-Hard上的胜率则为**36.5**分（训练脚本请见[这里](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Fblob\u002Fmain\u002Ftraining_configs\u002Fllama-3-8b-instruct-simpo-v2.yaml)）！\n  - 更新了[SimPO训练器](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Fblob\u002Fmain\u002Fscripts\u002Fsimpo_trainer.py)，以提高复现性。超参数`gamma`现已改为`gamma_beta_ratio`，以便于调参。\n\n## 🔗 快速链接\n- [SimPO：无需参考模型的简单偏好优化](#simple-preference-optimization-simpo)\n  - [更改日志](#-changelog)\n  - [运行SimPO的技巧](#tips-for-running-simpo)\n  - [已发布模型](#released-models)\n  - [安装依赖](#install-requirements)\n  - [训练脚本](#training-scripts)\n  - [评估](#evaluation)\n  - [遇到问题或疑问？](#bugs-or-questions)\n  - [引用](#citation)\n\n## 运行SimPO的技巧\n鉴于大家对SimPO的诸多询问，我们整理了一份技巧清单，旨在帮助您复现我们的论文结果，并在自己的任务上取得更好的效果。\n\n### 环境配置\n我们提供了一个[环境文件](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Fblob\u002Fmain\u002Fenvironment.yml)，其中包含了我们在实验中使用的Python包版本。为了获得最佳的复现效果，建议您使用相同的包版本。不过，请注意，由于硬件配置和CUDA版本等差异，结果仍可能存在波动。\n\n### 超参数调优\n超参数调优对于SimPO（以及一般的偏好优化算法）至关重要。SimPO的三个主要超参数是`learning_rate`、`beta`和`gamma`（我们建议将总批次大小固定为128）。\n- `learning_rate`：这是偏好优化中最关键的超参数。过大的学习率（如1e-5）会显著降低性能，导致模型生成不连贯的句子或完全重复的回答。如果资源允许，我们建议在3e-7、5e-7、8e-7和1e-6之间进行网格搜索。**我们发现，对于数学等需要大量推理的任务，较小的学习率（如5e-7）更适合DPO和SimPO。**\n- `beta`：Beta控制着获胜和失败回复之间的奖励缩放比例。与DPO相比，SimPO需要更大的`beta`值。在我们的预印本中，我们使用了`2.0`或`2.5`的beta值，但在许多情况下，更大的beta值（如`10`）可能会带来更好的效果。\n- `gamma`：Gamma控制目标奖励的边际。我们建议调整`gamma`与`beta`的比值（即`gamma \u002F beta`）。初始可以将`gamma_beta_ratio`设为0.5，并在0到1之间进行网格搜索。一个调优得当的`gamma_beta_ratio`可以带来适度的提升，但其重要性不如其他超参数。\n\n我们用于训练已发布模型的超参数如下（请注意，在最新更新中，我们将超参数`gamma`改为了`gamma_beta_ratio`，因为后者经过归一化处理，在不同的`beta`值下更容易调优）。\n| 设置           | β   | γ\u002Fβ   | 学习率 |\n|-------------------|-----|-----|----------------|\n| Mistral-Base      | 2.0 | 0.8 | 3e-7           |\n| Mistral-Instruct  | 2.5 | 0.1 | 5e-7           |\n| Llama3-Base       | 2.0 | 0.5 | 6e-7           |\n| Llama3-Instruct   | 2.5 | 0.55 | 1e-6           |\n| Llama3-Instruct v0.2   | 10 | 0.3 | 1e-6           |\n| Gemma             | 10 | 0.5 | 8e-7 |  \n\n对于DPO，各设置的最佳超参数如下：\n| 设置                  | β | 学习率 |\n|------------------------|------|---------------|\n| Mistral-Base           | 0.01 | 5e-7      |\n| Mistral-Instruct       | 0.01 | 5e-7      |\n| Llama3-Base            | 0.01 | 5e-7      |\n| Llama3-Instruct        | 0.01 | 7e-7      |\n| Llama3-Instruct v0.2   | 0.01 | 3e-7      |\n| Gemma             | 0.01 | 5e-7 |  \n\n\n### BOS一致性：训练与评估\n我们发布的Llama3模型使用的是Llama3分词器的初始版本（在此次[PR](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FMeta-Llama-3-8B-Instruct\u002Fcommit\u002F339ce92d052f002cdbac4a4bd551d1c61dd8345e)之前）。我们发现，使用vLLM时，更新后的Llama3分词器有时会引入两个BOS标记，这可能会影响评估结果。因此，在任何评估过程中，应用Llama3聊天模板后，请确保提示中仅包含一个BOS标记。\n\n*值得注意的是，如果您正在训练Llama3，并使用本仓库提供的模板在AlpacaEval 2和Arena-Hard上评估训练好的模型，请务必使用更新前的Llama3分词器（即PR之前的版本）。*\n\n### 复现 AlpacaEval 2 的结果\n请确保使用 `alpaca-eval==0.6.2` 以及我们仓库中的 [模型配置](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Ftree\u002Fmain\u002Feval\u002Falpacaeval2\u002Fconfigs)，才能成功复现 AlpacaEval 2 的结果。自 `0.6.3` 版本以来，AlpacaEval 对 vLLM 解码进行了重大修订，这会导致与我们实验结果的差异。\n\n### 添加额外的 SFT 损失\n[CPO_SIMPO](https:\u002F\u002Fgithub.com\u002Ffe1ixxu\u002FCPO_SIMPO\u002Ftree\u002Fmain) 仓库进行了初步实验，发现有时添加额外的 SFT 损失可以帮助提升效果。而在我们自己的实验中，SFT 正则化有助于保持推理能力（例如 GSM8K），但会降低聊天性能。如果您希望应用 SFT 正则化，可以将 `sft_weight` 设置为一个正数值（默认值为 0）。\n\n\n\n## 已发布的模型\n\n### Gemma  \n我们发布了以下两个模型，它们基于强大的 [google\u002Fgemma-2-9b-it](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fgemma-2-9b-it) 模型，在 on-policy 数据集 [princeton-nlp\u002Fgemma2-ultrafeedback-armorm](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fprinceton-nlp\u002Fgemma2-ultrafeedback-armorm) 上分别通过 DPO 和 SimPO 训练而成。对于 GSM 和 MMLU 任务，我们使用了 [ZeroEval](https:\u002F\u002Fgithub.com\u002Fyuchenlin\u002FZeroEval) 仓库，该仓库旨在评估经过指令微调的 LLM（即聊天模型而非基础模型）在零样本条件下处理推理和知识密集型任务的能力。更多关于 [WildBench](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fallenai\u002FWildBench) 的结果即将发布。\n\n|               模型                    | AE2 LC | AE2 WR | AE2 长度 |  AH  | AH 长度 |  GSM | GSM 长度 | MMLU | MMLU 长度 |\n|-----------------------------------|:------:|:------:|:----------:|:----:|:---------:|:----:|:----------:|:----:|:-----------:|\n|        [google\u002Fgemma-2-9b-it](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fgemma-2-9b-it)       |  51.1  |  38.1  |    1571    | 40.8 |    545    | 87.4 |     395    | 72.7 |     515     |\n|  [princeton-nlp\u002Fgemma-2-9b-it-DPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002Fgemma-2-9b-it-DPO)  |  67.8  |  65.4  |    2016    | 58.9 |    717    | 88.5 |     392    | 72.2 |     624     |\n| [princeton-nlp\u002Fgemma-2-9b-it-SimPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002Fgemma-2-9b-it-SimPO) |  72.4  |  65.9  |    1833    | 59.1 |    693    | 88.0 |     341    | 72.2 |     441     |\n\n- 与 llama3 模型相比，我们发现 gemma 模型在数学任务（如 GSM）和 MMLU 上表现出显著更少的灾难性遗忘现象，尽管 ultrafeedback 数据集中的数学相关数据有限。这表明 [google\u002Fgemma-2-9b-it](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fgemma-2-9b-it) 模型更适合进行持续的偏好优化。\n- 在所有基准测试中，SimPO 和 DPO 的表现相当，但 SimPO 本质上更为简单且资源消耗更低。\n\n\n### v0.2\n我们发现，使用强大的奖励模型来标注偏好优化数据集至关重要。在这一版本中，我们使用更强大的奖励模型 [RLHFlow\u002FArmoRM-Llama3-8B-v0.1](https:\u002F\u002Fhuggingface.co\u002FRLHFlow\u002FArmoRM-Llama3-8B-v0.1) 对数据集 [princeton-nlp\u002Fllama3-ultrafeedback-armorm](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fprinceton-nlp\u002Fllama3-ultrafeedback-armorm) 进行了重新标注。因此，v0.2 版本的模型相比 v0.1 版本在性能上有了显著提升。\n\n**注意**：我们观察到，SimPO v0.2 模型在生成需要遵循特定结构的输出时（如 JSON）往往存在困难。这一问题是由多种因素共同造成的：llama3-instruct 模型本身容易遗忘，以及训练过程中使用的较大学习率（例如 1e-6）导致模型偏离原始状态。为此，我们基于 [google\u002Fgemma-2-9b-it](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fgemma-2-9b-it) 开发了 SimPO 模型。我们发现，更换初始模型可以显著缓解遗忘问题，并降低学习率的影响。\n\n| 模型                       |                                                                                                           | AE2 LC | AE2 WR |  AH  |\n|------------------------------|-----------------------------------------------------------------------------------------------------------|:------:|:------:|:----:|\n| Llama 3 Instruct 8B RRHF v0.2 | [princeton-nlp\u002FLlama-3-Instruct-8B-RRHF-v2.0](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-RRHF-v0.2) |  37.9  |  31.6  | 28.8 |\n| Llama 3 Instruct 8B SLiC-HF v0.2 | [princeton-nlp\u002FLlama-3-Instruct-8B-SLiC-HF-v2.0](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-SLiC-HF-v0.2) |  33.9  |  32.5  | 29.3 |\n| Llama 3 Instruct 8B DPO v0.2 | [princeton-nlp\u002FLlama-3-Instruct-8B-DPO-v0.2](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-DPO-v0.2) |  48.2  |  47.5  | 35.2 |\n| Llama 3 Instruct 8B IPO v0.2 | [princeton-nlp\u002FLlama-3-Instruct-8B-IPO-v0.2](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-IPO-v0.2) |  46.8  |  42.4  | 36.6 |\n| Llama 3 Instruct 8B CPO v0.2 | [princeton-nlp\u002FLlama-3-Instruct-8B-CPO-v0.2](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-CPO-v0.2) |  34.1  |  36.4  | 30.9 |\n| Llama 3 Instruct 8B KTO v0.2 | [princeton-nlp\u002FLlama-3-Instruct-8B-KTO-v0.2](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-KTO-v0.2) |  34.1  |  32.1  | 27.3 |\n| Llama 3 Instruct 8B ORPO v0.2 | [princeton-nlp\u002FLlama-3-Instruct-8B-ORPO-v0.2](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-ORPO-v0.2) |  38.1  |  33.8  | 28.2 |\n| Llama 3 Instruct 8B R-DPO v0.2 | [princeton-nlp\u002FLlama-3-Instruct-8B-RDPO-v0.2](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-RDPO-v0.2) |  48.0  |  45.8  | 35.1 |\n| Llama 3 Instruct 8B SimPO v0.2 | [princeton-nlp\u002FLlama-3-Instruct-8B-SimPO-v0.2](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-SimPO-v0.2) |  53.7  |  47.5  | 36.5 |\n\n### v0.1\n以下是我们在预印本中评估的完整模型列表。我们使用 [HuggingFaceH4\u002Fultrafeedback_binarized](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceH4\u002Fultrafeedback_binarized) 数据集训练了 Mistral Base 和 Llama3 Base 模型；使用 [princeton-nlp\u002Fmistral-instruct-ultrafeedback](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fprinceton-nlp\u002Fmistral-instruct-ultrafeedback) 数据集训练了 Mistral Instruct 模型；而 Llama3 Instruct 模型则是在 [princeton-nlp\u002Fllama3-ultrafeedback](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fprinceton-nlp\u002Fllama3-ultrafeedback) 数据集上训练的。后两个数据集均由 [llm-blender\u002FPairRM](https:\u002F\u002Fhuggingface.co\u002Fllm-blender\u002FPairRM) 模型标注。\n\n模型                       |                                                                                                           | AE2 LC | AE2 WR |  AH  |\n|------------------------------|-----------------------------------------------------------------------------------------------------------|:------:|:------:|:----:|\n| Mistral Base 7B SFT          | [alignment-handbook\u002Fzephyr-7b-sft-full](https:\u002F\u002Fhuggingface.co\u002Falignment-handbook\u002Fzephyr-7b-sft-full)     |   8.4  |   6.2  |  1.3 |\n| Mistral Base 7B RRHF         | [princeton-nlp\u002FMistral-7B-Base-SFT-RRHF](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Base-SFT-RRHF) |  11.6  |  10.2  |  6.9 |\n| Mistral Base 7B SLiC-HF      | [princeton-nlp\u002FMistral-7B-Base-SFT-SLiC-HF](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Base-SFT-SLiC-HF) |  10.9  |   8.9  |  7.3 |\n| Mistral Base 7B DPO (Zephyr) | [princeton-nlp\u002FMistral-7B-Base-SFT-DPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Base-SFT-DPO)     |  15.1  |  12.5  | 10.4 |\n| Mistral Base 7B IPO          | [princeton-nlp\u002FMistral-7B-Base-SFT-IPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Base-SFT-IPO)     |  11.8  |   9.4  |  7.5 |\n| Mistral Base 7B CPO          | [princeton-nlp\u002FMistral-7B-Base-SFT-CPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Base-SFT-CPO)     |  9.8  |   8.9  |  6.9 |\n| Mistral Base 7B KTO          | [princeton-nlp\u002FMistral-7B-Base-SFT-KTO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Base-SFT-KTO)     |  13.1  |   9.1  |  5.6 |\n| Mistral Base 7B ORPO         | [kaist-ai\u002Fmistral-orpo-beta](https:\u002F\u002Fhuggingface.co\u002Fkaist-ai\u002Fmistral-orpo-beta)                           |  14.7  |  12.2  |  7.0 |\n| Mistral Base 7B R-DPO        | [princeton-nlp\u002FMistral-7B-Base-SFT-RDPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Base-SFT-RDPO)   |  17.4  |  12.8  |  9.9 |\n| Mistral Base 7B SimPO        | [princeton-nlp\u002FMistral-7B-Base-SFT-SimPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Base-SFT-SimPO) |  21.4  |  20.8  | 16.6 |\n| Mistral Instruct 7B SFT      | [mistralai\u002FMistral-7B-Instruct-v0.2](https:\u002F\u002Fhuggingface.co\u002Fmistralai\u002FMistral-7B-Instruct-v0.2)           |  17.1  |  14.7  | 12.6 |\n| Mistral Instruct 7B RRHF     | [princeton-nlp\u002FMistral-7B-Instruct-RRHF](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Instruct-RRHF) |  25.3  |  24.8  | 18.1 |\n| Mistral Instruct 7B SLiC-HF  | [princeton-nlp\u002FMistral-7B-Instruct-SLiC-HF](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Instruct-SLiC-HF) |  24.1  |  24.6  | 18.9 |\n| Mistral Instruct 7B DPO      | [princeton-nlp\u002FMistral-7B-Instruct-DPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Instruct-DPO)     |  26.8  |  24.9  | 16.3 |\n| Mistral Instruct 7B IPO      | [princeton-nlp\u002FMistral-7B-Instruct-IPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Instruct-IPO)     |  20.3  |  20.3  | 16.2 |\n| Mistral Instruct 7B CPO      | [princeton-nlp\u002FMistral-7B-Instruct-CPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Instruct-CPO)     |  23.8  |  28.8  | 22.6 |\n| Mistral Instruct 7B KTO      | [princeton-nlp\u002FMistral-7B-Instruct-KTO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Instruct-KTO)     |  24.5  |  23.6  | 17.9 |\n| Mistral Instruct 7B ORPO     | [princeton-nlp\u002FMistral-7B-Instruct-ORPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Instruct-ORPO)   |  24.5  |  24.9  | 20.8 |\n| Mistral Instruct 7B R-DPO    | [princeton-nlp\u002FMistral-7B-Instruct-RDPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Instruct-RDPO)   |  27.3  |  24.5  | 16.1 |\n| Mistral Instruct 7B SimPO    | [princeton-nlp\u002FMistral-7B-Instruct-SimPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FMistral-7B-Instruct-SimPO) |  32.1  |  34.8  | 21.0 |\n| Llama3 Base 8B SFT           | [princeton-nlp\u002FLlama-3-Base-8B-SFT](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-SFT)             |   6.2  |   4.6  |  3.3 |\n| Llama3 Base 8B RRHF          | [princeton-nlp\u002FLlama-3-Base-8B-RRHF](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-RRHF)           |  10.8  |   8.1  |  6.6 |\n| Llama3 Base 8B SLiC-HF       | [princeton-nlp\u002FLlama-3-Base-8B-SLiC-HF](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-SLiC-HF)     |  12.1  |  10.1  | 10.3 |\n| Llama3 Base 8B DPO           | [princeton-nlp\u002FLlama-3-Base-8B-SFT-DPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-SFT-DPO)     |  18.2  |  15.5  | 15.9 |\n| Llama3 Base 8B IPO           | [princeton-nlp\u002FLlama-3-Base-8B-SFT-IPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-SFT-IPO)     |  14.4  |  14.2  | 17.8 |\n| Llama3 Base 8B CPO           | [princeton-nlp\u002FLlama-3-Base-8B-SFT-CPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-SFT-CPO)     |  10.8  |   8.1  | 5.8 |\n| Llama3 Base 8B KTO           | [princeton-nlp\u002FLlama-3-Base-8B-SFT-KTO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-SFT-KTO)     |  14.2  |  12.4  | 12.5 |\n| Llama3 Base 8B ORPO          | [princeton-nlp\u002FLlama-3-Base-8B-SFT-ORPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-SFT-ORPO)   |  12.2  |  10.6  | 10.8 |\n| Llama3 Base 8B R-DPO         | [princeton-nlp\u002FLlama-3-Base-8B-SFT-RDPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-SFT-RDPO)   |  17.6  |  14.4  | 17.2 |\n| Llama3 Base 8B SimPO         | [princeton-nlp\u002FLlama-3-Base-8B-SFT-SimPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Base-8B-SFT-SimPO) |  22.0  |  20.3  | 23.4 |\n| Llama3 Instruct 8B SFT       | [meta-llama\u002FMeta-Llama-3-Instruct-8B](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FMeta-Llama-3-Instruct-8B)         |  26.0  |  25.3  | 22.3 |\n| Llama3 Instruct 8B RRHF      | [princeton-nlp\u002FLlama-3-Instruct-8B-RRHF](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-RRHF) |  31.3  |  28.4  | 26.5 |\n| Llama3 Instruct 8B SLiC-HF   | [princeton-nlp\u002FLlama-3-Instruct-8B-SLiC-HF](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-SLiC-HF) |  26.9  |  27.5  | 26.2 |\n| Llama3 Instruct 8B DPO       | [princeton-nlp\u002FLlama-3-Instruct-8B-DPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-DPO)     |  40.3  |  37.9  | 32.6 |\n| Llama3 Instruct 8B IPO       | [princeton-nlp\u002FLlama-3-Instruct-8B-IPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-IPO)     |  35.6  |  35.6  | 30.5 |\n| Llama3 Instruct 8B CPO       | [princeton-nlp\u002FLlama-3-Instruct-8B-CPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-CPO)     |  33.1  |  31.8  | 26.4 |\n| Llama3 Instruct 8B KTO       | [princeton-nlp\u002FLlama-3-Instruct-8B-KTO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-KTO)     |  33.1  |  31.8  | 26.4 |\n| Llama3 Instruct 8B ORPO      | [princeton-nlp\u002FLlama-3-Instruct-8B-ORPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-ORPO)   |  28.5  |  27.4  | 25.8 |\n| Llama3 Instruct 8B R-DPO     | [princeton-nlp\u002FLlama-3-Instruct-8B-RDPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-RDPO)   |  41.1  |  37.8  | 33.1 |\n| Llama3 Instruct 8B SimPO     | [princeton-nlp\u002FLlama-3-Instruct-8B-SimPO](https:\u002F\u002Fhuggingface.co\u002Fprinceton-nlp\u002FLlama-3-Instruct-8B-SimPO) |  44.7  |  40.5  | 33.8 |\n\n### 使用我们的模型进行推理\n请参阅 [generate.py](generate.py) 脚本，以获取使用适当聊天模板加载模型的详细说明。\n\n## 安装依赖\n\n我们的代码库基于 [alignment-handbook 仓库](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Falignment-handbook) 构建。以下步骤将指导您完成安装过程。\n\n首先，使用 Conda 等工具创建一个 Python 虚拟环境：\n```shell\nconda create -n handbook python=3.10 && conda activate handbook\n```\n\n接下来，安装 PyTorch `v2.2.2`。由于这与硬件相关，我们建议您访问 [PyTorch 安装页面](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F)。\n\n然后，您可以按照以下步骤安装 [alignment-handbook](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Falignment-handbook) 的其余包依赖项：\n\n```shell\ngit clone https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Falignment-handbook.git\ncd .\u002Falignment-handbook\u002F\npython -m pip install .\n```\n\n您还需要安装 Flash Attention 2，可以通过运行以下命令完成：\n```shell\npython -m pip install flash-attn --no-build-isolation\n```\n\n## 训练脚本\n\n我们提供了四份训练配置文件，分别对应论文中报告的四种训练设置。这些配置适用于 4 张 H100 GPU。根据您的计算环境，可能需要调整 `num_processes` 和 `per_device_train_batch_size` 参数。\n\n* Mistral-Base：\n```shell\nACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs\u002Fdeepspeed_zero3.yaml scripts\u002Frun_simpo.py training_configs\u002Fmistral-7b-base-simpo.yaml\n```\n* Mistral-Instruct：\n```shell\nACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs\u002Fdeepspeed_zero3.yaml scripts\u002Frun_simpo.py training_configs\u002Fmistral-7b-instruct-simpo.yaml\n```\n* Llama3-Base：\n```shell\nACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs\u002Fdeepspeed_zero3.yaml scripts\u002Frun_simpo.py training_configs\u002Fllama-3-8b-base-simpo.yaml\n```\n* Llama3-Instruct：\n```shell\nACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs\u002Fdeepspeed_zero3.yaml scripts\u002Frun_simpo.py training_configs\u002Fllama-3-8b-instruct-simpo.yaml\n```\n* Llama3-Instruct v0.2：\n```shell\nACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs\u002Fdeepspeed_zero3.yaml scripts\u002Frun_simpo.py training_configs\u002Fllama-3-8b-instruct-simpo-v2.yaml\n```\n\n## 评估\n\n我们遵循官方实现对 AlpacaEval 2、Arena-Hard 和 MT-Bench 进行评估，具体如下（更多细节请参阅 [eval 目录](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Ftree\u002Fmain\u002Feval)）：\n\n* AlpacaEval 2：请参考 [AlpacaEval 仓库](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval) 进行评估。\n* Arena-Hard：请参考 [Arena-Hard-Auto 仓库](https:\u002F\u002Fgithub.com\u002Flm-sys\u002Farena-hard-auto) 进行评估。\n* MT-Bench：请参考 [FastChat 仓库](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat) 进行评估。\n\n## 发现问题或有疑问？\n如果您对代码或论文有任何疑问，请随时发送邮件至 Yu（yumeng5@virginia.edu）。如果在使用代码时遇到任何问题，或希望报告 bug，请随时提交 issue！请尽量详细描述问题，以便我们能够更快更好地帮助您！\n\n## 引用\n如果您在工作中觉得本仓库有所帮助，请引用我们的论文：\n\n```bibtex\n@inproceedings{meng2024simpo,\n   title={SimPO: 基于无参考奖励的简单偏好优化},\n   author={Meng, Yu and Xia, Mengzhou and Chen, Danqi},\n   booktitle={神经信息处理系统大会 (NeurIPS)},\n   year={2024}\n}\n```","# SimPO 快速上手指南\n\nSimPO (Simple Preference Optimization) 是一种无需参考模型（Reference-Free）的偏好优化算法。相比 DPO，它更简单且高效，在 AlpacaEval 2、MT-Bench 和 Arena-Hard 等基准测试中表现优异。\n\n## 环境准备\n\n### 系统要求\n- **Python**: 建议 Python 3.10+\n- **GPU**: 推荐使用支持 CUDA 的 NVIDIA GPU（显存需求取决于模型大小，训练大模型建议 24GB+）\n- **CUDA**: 确保已安装与 PyTorch 版本匹配的 CUDA 驱动\n\n### 前置依赖\n项目提供了完整的 `environment.yml` 文件以确保实验可复现性。强烈建议使用该文件创建环境，以避免版本冲突。\n\n## 安装步骤\n\n1. **克隆仓库**\n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO.git\n   cd SimPO\n   ```\n\n2. **创建并激活 Conda 环境**\n   使用官方提供的环境配置文件安装依赖：\n   ```bash\n   conda env create -f environment.yml\n   conda activate simpo\n   ```\n   *注：如果网络较慢，可在创建环境后手动配置 pip\u002Fconda 国内镜像源加速后续包的安装。*\n\n3. **验证安装**\n   确保核心库（如 `transformers`, `accelerate`, `trl` 等）已正确安装且版本匹配。\n\n## 基本使用\n\nSimPO 的核心在于调整关键超参数进行训练。以下是基于现有脚本启动训练的最简流程。\n\n### 1. 准备数据与配置\n确保你已准备好偏好数据集（包含 chosen 和 rejected 回答）。项目提供了针对 Llama3 和 Gemma 系列的示例配置文件，位于 `training_configs\u002F` 目录下。\n\n### 2. 关键超参数设置\n在运行训练前，请重点关注以下三个超参数（建议在固定总 batch size 为 128 的前提下调整）：\n\n- **`learning_rate`**: 最关键参数。过大（如 1e-5）会导致模型崩溃。\n  - 推荐网格搜索范围：`3e-7`, `5e-7`, `8e-7`, `1e-6`。\n  - 数学推理任务建议使用较小值（如 `5e-7`）。\n- **`beta`**: 控制奖励缩放。SimPO 需要的值远大于 DPO。\n  - 推荐起始值：`2.0` 或 `2.5`，部分场景下 `10` 效果更好。\n- **`gamma_beta_ratio`**: 控制目标奖励边际（即 $\\gamma \u002F \\beta$）。\n  - 推荐起始值：`0.5`，搜索范围 `0` 到 `1`。\n\n**参考配置表：**\n\n| 模型设定 | beta | gamma\u002Fbeta | learning_rate |\n| :--- | :--- | :--- | :--- |\n| Mistral-Base | 2.0 | 0.8 | 3e-7 |\n| Llama3-Instruct | 2.5 | 0.55 | 1e-6 |\n| Gemma-2-IT | 10 | 0.5 | 8e-7 |\n\n### 3. 启动训练\n使用提供的训练脚本启动任务。以下以 Llama3-Instruct 为例：\n\n```bash\npython scripts\u002Fsimpo_trainer.py \\\n    --config training_configs\u002Fllama-3-8b-instruct-simpo-v2.yaml \\\n    --output_dir .\u002Foutputs\u002Fllama3-simpo\n```\n\n*注意：若需添加 SFT 损失以防止推理能力遗忘，可在配置中将 `sft_weight` 设为正值（默认为 0）。*\n\n### 4. 评估注意事项\n- **Tokenizer 版本**: 对于 Llama3 模型，评估时（特别是在 AlpacaEval 2 和 Arena-Hard 上）请务必使用**更新前**的 Llama3 tokenizer，以确保 Prompt 中仅包含一个 BOS token，避免结果偏差。\n- **AlpacaEval 复现**: 若要复现论文中的 AlpacaEval 2 分数，必须使用 `alpaca-eval==0.6.2` 版本及仓库内 `eval\u002Falpacaeval2\u002Fconfigs` 下的配置文件。","某初创团队正在基于 Llama 3 构建垂直领域的智能客服助手，急需通过人类反馈强化学习（RLHF）提升回答的精准度与亲和力。\n\n### 没有 SimPO 时\n- **显存资源紧张**：传统 DPO 算法训练时必须额外加载一个庞大的参考模型（Reference Model），导致显存占用激增，团队被迫缩减批次大小或升级昂贵硬件。\n- **调参复杂低效**：参考模型的引入增加了超参数耦合度，工程师需花费数天时间平衡策略模型与参考模型的权重，难以快速收敛到最优状态。\n- **效果提升瓶颈**：在有限的计算资源下，模型在复杂指令遵循和长文本逻辑上表现平平，AlpacaEval 评测胜率长期停滞，无法满足上线标准。\n\n### 使用 SimPO 后\n- **大幅降低门槛**：SimPO 采用无参考奖励机制，直接移除了对参考模型的需求，显存占用显著下降，团队得以在单卡环境下使用更大批次进行训练。\n- **训练稳定快捷**：去除了冗余组件后，超参数调节变得直观简单，模型收敛速度加快，原本需要一周的迭代周期缩短至两天内完成。\n- **性能显著跃升**：得益于更高效的优化目标，微调后的客服模型在 AlpacaEval 2 和 Arena-Hard 基准测试中胜率大幅提升，回答更加自然流畅且逻辑严密。\n\nSimPO 通过精简架构消除了参考模型依赖，让中小团队也能以更低成本、更高效率打造出顶尖水平的对齐模型。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fprinceton-nlp_SimPO_3a5bb045.png","princeton-nlp","Princeton Natural Language Processing","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fprinceton-nlp_9459cd72.png","",null,"http:\u002F\u002Fnlp.cs.princeton.edu","https:\u002F\u002Fgithub.com\u002Fprinceton-nlp",[84,88],{"name":85,"color":86,"percentage":87},"Python","#3572A5",98.1,{"name":89,"color":90,"percentage":91},"Jinja","#a52a22",1.9,951,75,"2026-04-15T12:54:01","MIT","未说明","需要 NVIDIA GPU（提及 CUDA 版本差异会影响结果），具体显存大小未说明，但建议总 batch size 固定为 128",{"notes":99,"python":100,"dependencies":101},"1. 强烈建议使用项目提供的 environment.yml 文件创建环境以确保复现性。\n2. 超参数调优至关重要：学习率推荐在 3e-7 到 1e-6 之间搜索；SimPO 的 beta 值通常比 DPO 大得多（如 2.0-10）；gamma 建议通过 gamma\u002Fbeta 比率（0-1）进行调节。\n3. Llama3 模型评估时需注意分词器版本：必须使用更新前的 Llama3 tokenizer，确保 prompt 中仅包含一个 BOS token，否则会影响评测结果。\n4. 复现 AlpacaEval 2 结果必须使用 alpaca-eval==0.6.2 版本，更高版本会导致解码差异。\n5. 训练数学推理任务时，建议使用较小的学习率（如 5e-7）。","未说明（需参考 environment.yml 文件）",[102,103,104,105,106],"alpaca-eval==0.6.2","vllm","transformers","torch","accelerate",[15],[109,110,111,112],"alignment","large-language-models","preference-alignment","rlhf","2026-03-27T02:49:30.150509","2026-04-20T04:05:11.608024",[116,121,126,131,136,141],{"id":117,"question_zh":118,"answer_zh":119,"source_url":120},43638,"训练过程中出现模型崩溃（输出大量重复内容），且接受\u002F拒绝奖励迅速下降，该如何解决？","这通常与超参数设置有关。用户反馈在使用 `batch_size=128`, `max_step=1000`, `simpo_gamma=0.5`, `beta=2.5` 时出现了该问题。虽然维护者未直接给出固定参数，但社区讨论指出 SimPO + LoRA 可能在某些基准（如 AlpacaEval LC）上表现优于全量微调，建议尝试调整学习率或使用 LoRA 进行微调以观察是否改善稳定性。此外，需检查数据集格式是否正确，并确认是否在公共数据集（如 DPO-En-Zh-20k）上也复现了该问题以排除数据因素。","https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Fissues\u002F12",{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},43639,"无法复现论文中 Mistral-7B-Instruct-DPO 的结果，可能是什么原因？","主要原因可能是 Tokenizer 版本不一致。维护者确认之前使用了旧版本的 Tokenizer。解决方案是确保训练和评估时使用与 `mistral-common` 对齐的最新 Tokenizer 版本。此外，不同随机种子会导致结果略有波动（例如 LC Win Rate 在 30.0-30.7 之间，而论文报告为 32.1），且生成长度也可能稍长。建议使用官方更新的训练脚本并检查 Tokenizer 配置。","https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Fissues\u002F38",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},43640,"SimPO 结合 LoRA 微调的效果如何？是否有推荐参数？","社区用户反馈 SimPO + LoRA 效果良好。有用户使用 LoRA rank (dimension) 为 256，基于 GitHub 上传的 SFT checkpoint 对 llama-3-8b-base 进行训练，在 MT-Bench (GPT-4 评判) 上获得了约 7.8 分，与全量微调效果相当甚至更好。另一位用户指出，在使用 LoRA 微调时，通过调整学习率可以进一步提升效果。建议尝试 LoRA 维度 256 并配合学习率调优。","https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Fissues\u002F39",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},43641,"按照论文参数进行 SFT 训练却无法达到报告的评估结果，SFT 数据处理细节是什么？","对于多轮对话数据集（如 HuggingFaceH4\u002Fultrachat_200k），需要明确标签处理方式。维护者指出，对于 Llama3-Base 模型的 SFT 阶段，由于基础模型没有 `chat_template`，需要手动定义它；而在随后的偏好优化阶段，由于起始模型已是包含 `chat_template` 的 SFT 模型，则无需重新定义。请确保数据处理逻辑（如是否截取最后一轮回复作为目标）与官方实现一致。","https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Fissues\u002F27",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},43642,"定量复现论文结果时性能下降，特别是 AlpacaEval 得分较低，常见原因有哪些？","性能下降的主要原因通常是训练和评估过程中 BOS (Begin of Sentence) token 处理不一致。维护者更新了训练脚本以消除因 `trl` 包版本不同导致的 BOS token 处理差异。此外，在使用特定版本的 Llama3-Instruct tokenizer 进行评估（如 AlpacaEval 2 和 Arena-Hard）时，可能会错误地添加冗余的 BOS token。解决方案包括：1. 使用官方最新的训练脚本；2. 检查并确保训练与评估时的 BOS token 策略一致；3. 验证软件环境（如 vllm 和 alpaca_eval 版本）并移除不支持的参数（如 `torch_dtype: 'bfloat16'` 或 `do_sample: True`）。","https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Fissues\u002F21",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},43643,"使用 DeepSpeed Zero3 训练时显存占用过高（单卡 70GB），即使 batch size 设为 1 也无法缓解，如何解决？","该问题可能与特定的 DeepSpeed 配置或依赖包版本有关。虽然具体解决方案在提供的评论片段中被截断，但通常此类高显存占用问题可以通过以下方式排查：1. 检查 `accelerate_configs\u002Fdeepspeed_zero3.yaml` 配置文件，确保 `offload_optimizer` 或 `offload_param` 已正确启用以将部分状态卸载到 CPU；2. 确认使用的 `deepspeed` 和 `accelerate` 库版本与项目推荐版本一致；3. 尝试减小 `max_seq_length` 或检查是否加载了不必要的额外模块。建议参考项目中其他成功运行的配置示例。","https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FSimPO\u002Fissues\u002F9",[]]