[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-allenai--reward-bench":3,"tool-allenai--reward-bench":65},[4,17,27,35,48,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",150037,2,"2026-04-10T23:33:47",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":10,"last_commit_at":33,"category_tags":34,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":10,"last_commit_at":41,"category_tags":42,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,"2026-04-10T11:13:16",[26,43,44,45,14,46,15,13,47],"数据工具","视频","插件","其他","音频",{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":54,"last_commit_at":55,"category_tags":56,"status":16},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[15,43,46],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":54,"last_commit_at":63,"category_tags":64,"status":16},5773,"cs-video-courses","Developer-Y\u002Fcs-video-courses","cs-video-courses 是一个精心整理的计算机科学视频课程清单，旨在为自学者提供系统化的学习路径。它汇集了全球知名高校（如加州大学伯克利分校、新南威尔士大学等）的完整课程录像，涵盖从编程基础、数据结构与算法，到操作系统、分布式系统、数据库等核心领域，并深入延伸至人工智能、机器学习、量子计算及区块链等前沿方向。\n\n面对网络上零散且质量参差不齐的教学资源，cs-video-courses 解决了学习者难以找到成体系、高难度大学级别课程的痛点。该项目严格筛选内容，仅收录真正的大学层级课程，排除了碎片化的简短教程或商业广告，确保用户能接触到严谨的学术内容。\n\n这份清单特别适合希望夯实计算机基础的开发者、需要补充特定领域知识的研究人员，以及渴望像在校生一样系统学习计算机科学的自学者。其独特的技术亮点在于分类极其详尽，不仅包含传统的软件工程与网络安全，还细分了生成式 AI、大语言模型、计算生物学等新兴学科，并直接链接至官方视频播放列表，让用户能一站式获取高质量的教育资源，免费享受世界顶尖大学的课堂体验。",79792,"2026-04-08T22:03:59",[46,26,43,13],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":71,"readme_en":72,"readme_zh":73,"quickstart_zh":74,"use_case_zh":75,"hero_image_url":76,"owner_login":77,"owner_name":78,"owner_avatar_url":79,"owner_bio":80,"owner_company":81,"owner_location":81,"owner_email":82,"owner_twitter":81,"owner_website":83,"owner_url":84,"languages":85,"stars":101,"forks":102,"last_commit_at":103,"license":104,"difficulty_score":10,"env_os":105,"env_gpu":106,"env_ram":105,"env_deps":107,"category_tags":115,"github_topics":116,"view_count":10,"oss_zip_url":81,"oss_zip_packed_at":81,"status":16,"created_at":119,"updated_at":120,"faqs":121,"releases":151},4969,"allenai\u002Freward-bench","reward-bench","RewardBench: the first evaluation tool for reward models.","RewardBench 是首个专为评估“奖励模型”（Reward Models）设计的开源基准测试工具。在大语言模型的对齐训练中，奖励模型负责判断回答的优劣，但长期以来缺乏统一、公平的评估标准。RewardBench 正是为了解决这一痛点而生，它提供了一套标准化的数据集和评测流程，能够客观衡量不同奖励模型在能力与安全性的表现。\n\n无论是采用传统训练方式，还是基于直接偏好优化（DPO）、KTO 等新技术的隐式奖励模型，都能通过 RewardBench 进行公平对比。其技术亮点在于内置了多样化的推理代码，支持 Starling、PairRM 等多种主流模型架构，并针对生成式模型提供了灵活的评分与排序机制。最新的 V2 版本更引入了复杂的\"Best-of-4\"及多选项（Ties）测试场景，显著提升了评估难度与区分度。\n\n这款工具非常适合 AI 研究人员、大模型开发者以及算法工程师使用。如果你正在训练自己的奖励模型，或需要为项目选型最合适的对齐方案，RewardBench 能帮助你快速验证效果、分析短板，从而更高效地推动模型迭代。通过统一的评测框架，它让社区内的模型比较变得透明且可复现，是推动","RewardBench 是首个专为评估“奖励模型”（Reward Models）设计的开源基准测试工具。在大语言模型的对齐训练中，奖励模型负责判断回答的优劣，但长期以来缺乏统一、公平的评估标准。RewardBench 正是为了解决这一痛点而生，它提供了一套标准化的数据集和评测流程，能够客观衡量不同奖励模型在能力与安全性的表现。\n\n无论是采用传统训练方式，还是基于直接偏好优化（DPO）、KTO 等新技术的隐式奖励模型，都能通过 RewardBench 进行公平对比。其技术亮点在于内置了多样化的推理代码，支持 Starling、PairRM 等多种主流模型架构，并针对生成式模型提供了灵活的评分与排序机制。最新的 V2 版本更引入了复杂的\"Best-of-4\"及多选项（Ties）测试场景，显著提升了评估难度与区分度。\n\n这款工具非常适合 AI 研究人员、大模型开发者以及算法工程师使用。如果你正在训练自己的奖励模型，或需要为项目选型最合适的对齐方案，RewardBench 能帮助你快速验证效果、分析短板，从而更高效地推动模型迭代。通过统一的评测框架，它让社区内的模型比较变得透明且可复现，是推动大模型对齐技术发展的重要基础设施。","\u003Cdiv align=\"center\">\n  \u003Ch1>RewardBench: Evaluating Reward Models\u003C\u002Fh1>\n  \u003Cp> V2 (\u003Cstrong>NEW!\u003C\u002Fstrong>):\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fallenai\u002Freward-bench\">Leaderboard\u003C\u002Fa> 📐 |\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Freward-bench-2\">Eval. Dataset\u003C\u002Fa> |\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Freward-bench-2-results\">Results\u003C\u002Fa> 📊 | \n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fallenai\u002Freward-bench-2-683d2612a4b3e38a3e53bb51\">Trained Models\u003C\u002Fa> 🏆 | \n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01937\"> Paper📝 \u003C\u002Fa>\n\u003C\u002Fp>\n\n  \u003Cp> V1:\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fallenai\u002Freward-bench\">Leaderboard\u003C\u002Fa> 📐 |\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Freward-bench\">Eval. Dataset\u003C\u002Fa> |\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Fpreference-test-sets\">Existing Test Sets\u003C\u002Fa> |\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Freward-bench-results\">Results\u003C\u002Fa> 📊 |\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.13787\"> Paper📝\u003C\u002Fa>\n\u003C\u002Fp>\n  \u003Cimg width=\"1280\" alt=\"Github RewardBench Logo\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fallenai_reward-bench_readme_d73af2e3db78.png\" style=\"margin-left:'auto' margin-right:'auto' display:'block' \"\u002F>\n\u003C\u002Fdiv>\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fblob\u002Fmain\u002FLICENSE\">\n    \u003Cimg alt=\"GitHub License\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fallenai\u002Freward-bench\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Frewardbench\u002F\">\n    \u003Cimg alt=\"PyPI\" src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Frewardbench\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\n---\n\n**RewardBench** is a benchmark designed to evaluate the capabilities and safety of reward models (including those trained with Direct Preference Optimization, DPO).\nThe repository includes the following:\n* Common inference code for a variety of reward models (Starling, PairRM, OpenAssistant, DPO, and more).\n* Common dataset formatting and tests for fair reward model inference.\n* Analysis and visualization tools.\n\nThe three primary scripts to generate results (more in `scripts\u002F`):\n1. `scripts\u002Frun_rm.py`: Run evaluations for reward models.\n2. `scripts\u002Frun_dpo.py`: Run evaluations for direct preference optimization (DPO) models (and other models using implicit rewards, such as KTO).\n3. `scripts\u002Frun_v2.py`: Run evaluations for RewardBench 2, with special data handling for best-of-4 and Ties data.\n\n## Quick Usage\nRewardBench lets you quickly evaluate any reward model on any preference set.\nIt also will detect if a instruction dataset is passed (by checking for not having `chosen`\u002F`rejected`, and having `messages`) -- for these, just a model outputs are logged (not accuracy).\n\n### Installation\n\n**With UV (recommended):**\n```bash\nuv pip install rewardbench\n\n# For generative models (LLM-as-judge, vLLM, API providers)\nuv pip install rewardbench[generative]\n```\n\n**With pip:**\n```bash\npip install rewardbench\n\n# For generative models\npip install rewardbench[generative]\n```\n\n**For development:**\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench.git\ncd reward-bench\nuv sync                      # base install\nuv sync --extra generative   # with generative support\n```\n**To run RewardBench 2, you can run the following command, substituting the model you would like to run and adding any additional model-specific parameters, which can be found in the [eval configs](https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fblob\u002Fmain\u002Fscripts\u002Fconfigs\u002Feval_configs.yaml) in `scripts\u002Fconfigs\u002Feval_configs.yaml`**\n```\npython scripts\u002Frun_v2.py --model={yourmodel}\n```\n\nGenerative models can be run on RewardBench 2 either with a rankings-based prompt (comparing 4 responses in one go, the default) or a ratings-based prompt (scoring each response separately then recombining, run with `--score_w_ratings` flag). Note that our Ties subset, new in RewardBench 2, has up to 20+ completions to score per-prompt, so the code enforces that it runs in the ratings setting. For more information, see `scripts\u002Frun_generative_v2.py`. To add a custom prompt for your model, feel free to open a PR.\n```\npython scripts\u002Frun_generative_v2.py --model={yourmodel}\n```\n\nOr, to run RewardBench instead, run the following:\n```\nrewardbench --model={yourmodel} --dataset={yourdataset} --batch_size=8\n```\nFor a DPO model, pass --ref_model={} and the script will automatically route there.\nAutomatically uses Tokenizers chat templates, but can also use fastchat conv templates.\n\nTo run the core Reward Bench evaluation set, run:\n```\nrewardbench --model={yourmodel}\n```\n\nExamples:\n1. Normal operation\n```\nrewardbench --model=OpenAssistant\u002Freward-model-deberta-v3-large-v2 --dataset=allenai\u002Fultrafeedback_binarized_cleaned --split=test_gen --chat_template=raw\n```\n2. DPO model from local dataset (note `--load_json`)\n```\nrewardbench --model=Qwen\u002FQwen1.5-0.5B-Chat --ref_model=Qwen\u002FQwen1.5-0.5B --dataset=\u002Fnet\u002Fnfs.cirrascale\u002Fallennlp\u002Fjacobm\u002Fherm\u002Fdata\u002Fberkeley-nectar-binarized-preferences-random-rejected.jsonl --load_json\n```\n\n**Generative RMs** can be run after installing with `[generative]` extra (see Installation above):\n```\nrewardbench-gen --model={}\n```\nFor more information, see `scripts\u002Frun_generative.py`.\nLocal models require vLLM. API models support OpenAI, Anthropic, Google Gemini, and Together.\n\n### Logging\n\nThe CLI comes with multiple advanced saving features for **model outputs** and **accuracy scores**. \nThese can be tied in metadata to reward models you own or uploaded as separate datasets to HuggingFace, such as for rejection sampling.\nFor example, the following command does both:\n```\nrewardbench --model vwxyzjn\u002Freward_modeling__EleutherAI_pythia-14m --batch_size 128 --tokenizer=EleutherAI\u002Fpythia-14m --push_results_to_hub --upload_model_metadata_to_hf --chat_template raw\n```\nOr, for an instruction dataset:\n```\nrewardbench --model vwxyzjn\u002Freward_modeling__EleutherAI_pythia-14m --dataset HuggingFaceH4\u002Fno_robots --split test --batch_size 128 --tokenizer=EleutherAI\u002Fpythia-14m --push_results_to_hub --chat_template raw\n```\n(Note that chat templates only need to be specififed for older models)\n\nThe key commands are:\n* `--push_results_to_hub` which uploads a dataset of scores and correctness.\n* ` --upload_model_metadata_to_hf` adds results directly to model.\n\nFor an example of a model with accuracy metadata, look [here](https:\u002F\u002Fhuggingface.co\u002Fvwxyzjn\u002Frm_zephyr_new).\nFor an example of the outputs from a preference dataset, look [here](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fnatolambert\u002Frewardbench_eval_2339270924_2339270924), and for instructions, look [here](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fnatolambert\u002Frewardbench_eval_0329290924).\n\nThis currently only works with DPO models for preference datasets, such as:\n```\nrewardbench --model Qwen\u002FQwen1.5-0.5B-Chat --ref_model Qwen\u002FQwen1.5-0.5B  --batch_size 128 --tokenizer=EleutherAI\u002Fpythia-14m --push_results_to_hub --upload_model_metadata_to_hf --chat_template raw\n```\nOpen an issue if you would like complete functionality.\n\n## Full Installation\nTo install from source, please install `torch` on your system, and then install the following requirements.\n```\npip install -e .\n```\nOptinally, for generative scripts, run:\n```\npip install -e \".[generative]\"\n```\nAdd the following to your `.bashrc`:\n```\nexport HF_TOKEN=\"{your_token}\"\n```\n\n## Training\n\nFor training, we recommend using [`open-instruct`](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fopen-instruct).\n\n## Contribute Your Model\n\nFor now, in order to contribute your model to the leaderboard, open an issue with the model name on HuggingFace (you can still evaluate local models with RewardBench, see below).\nIf custom code is needed, please open a PR that enables it in our inference stack (see [`rewardbench\u002Fmodels`](https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Ftree\u002Fmain\u002Frewardbench\u002Fmodels) for more information).\n\n# Evaluating Models\n\nFor reference configs, see `scripts\u002Fconfigs\u002Feval_configs.yaml`.\nFor reference on Chat Templates, many models follow the base \u002F sft model terminology [here](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat\u002Fblob\u002Fmain\u002Ffastchat\u002Fconversation.py).\nA small model for debugging is available at `natolambert\u002Fgpt2-dummy-rm`.\n\nThe core scripts automatically evaluate our core evaluation set. To run these on [existing preference sets](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Fpref-test-sets), add the argument `--pref_sets`.\n\n## Running Reward Models\n\nTo run individual models with `scripts\u002Frun_rm.py`, use any of the following examples:\n```\npython scripts\u002Frun_rm.py --model=openbmb\u002FUltraRM-13b --chat_template=openbmb --batch_size=8\npython scripts\u002Frun_rm.py --model=OpenAssistant\u002Foasst-rm-2.1-pythia-1.4b-epoch-2.5 --chat_template=oasst_pythia\npython scripts\u002Frun_rm.py --model=PKU-Alignment\u002Fbeaver-7b-v1.0-cost --chat_template=pku-align --batch_size=16\npython scripts\u002Frun_rm.py --model=IDEA-CCNL\u002FZiya-LLaMA-7B-Reward --batch_size=32 --trust_remote_code --chat_template=Ziya\n```\n\nTo run these models with AI2 infrastructure, run:\n```\npython scripts\u002Fsubmit_eval_jobs.py\n```\nOr for example, the best of N sweep on the non-default image:\n```\npython scripts\u002Fsubmit_eval_jobs.py --eval_on_bon --image=nathanl\u002Fherm_bon\n``` \nNote: for AI2 users, you must set `beaker secret write HF_TOKEN \u003Cyour_write_token_here>` to make the scripts work.\n\nModels using the default abstraction `AutoModelForSequenceClassification.from_pretrained` can also be loaded locally. Expanding this functionality is TODO. E.g.\n```\npython scripts\u002Frun_rm.py --model=\u002Fnet\u002Fnfs.cirrascale\u002Fallennlp\u002Fhamishi\u002FEasyLM\u002Frm_13b_3ep --chat_template=tulu --batch_size=8\n```\n\n## Running DPO Models\n\nAnd for DPO:\n```\npython scripts\u002Frun_dpo.py --model=stabilityai\u002Fstablelm-zephyr-3b --ref_model=stabilityai\u002Fstablelm-3b-4e1t --batch_size=8\npython scripts\u002Frun_dpo.py --model=stabilityai\u002Fstablelm-2-zephyr-1_6b --ref_model=stabilityai\u002Fstablelm-2-1_6b --batch_size=16\n```\n\n## Ensembling RMs\nFor reward models already in RewardBench, you can run an offline ensemble test to approximate using multiple reward models in your system. To try this, you can run:\n```\npython analysis\u002Frun_ensemble_offline.py --models sfairXC\u002FFsfairX-LLaMA3-RM-v0.1 openbmb\u002FEurus-RM-7b Nexusflow\u002FStarling-RM-34B\n```\n\n## Running Generative RMs (LLM-as-a-judge)\nLocal and API models are supported. For example, run OpenAI's models like:\n```\npython scripts\u002Frun_generative.py --model=gpt-3.5-turbo-0125\n```\nLocal models are loaded from HuggingFace, though some are also available via Together's API. Run Llama 3 locally with\n```\npython scripts\u002Frun_generative.py --model=meta-llama\u002FLlama-3-70b-chat-hf --force_local\n```\nOr, with Together's API with:\n```\npython scripts\u002Frun_generative.py --model=meta-llama\u002FLlama-3-70b-chat-hf\n```\n\nWe are adding support for generative ensembles (only via API for now), run with:\n```\npython scripts\u002Frun_generative.py --model gpt-3.5-turbo-0125 claude-3-sonnet-20240229 meta-llama\u002FLlama-3-70b-chat-hf\n```\nNote: these must be an odd number of models > 1.\n\n## Creating Best of N (BoN) rankings\n\nTo create the ranking across the dataset, run (best_of 8 being placeholder, 16 should be fine as eval logic will handle lower best of N numbers):\n```\npython scripts\u002Frun_bon.py --model=OpenAssistant\u002Foasst-rm-2.1-pythia-1.4b-epoch-2.5 --chat_template=oasst_pythia --best_of=8 --debug\n```\n## Getting Leaderboard Section Scores\n\n**Important**: We use prompt-weighed scores for the sections Chat, Chat Hard, Safety, and Reasoning (with math equalized to code here) to avoid assigning too much credit to small subsets (e.g. MT Bench ones). Use the following code to compute the scores for each category, assuming `RewardBench` is installed:\n```\nfrom rewardbench.constants import EXAMPLE_COUNTS, SUBSET_MAPPING\nfrom rewardbench.utils import calculate_scores_per_section\n\nmetrics = {\n  \"alpacaeval-easy\": 0.5,\n  \"alpacaeval-hard\": 0.7052631578947368,\n  \"alpacaeval-length\": 0.5894736842105263,\n  \"chat_template\": \"tokenizer\",\n  \"donotanswer\": 0.8235294117647058,\n  \"hep-cpp\": 0.6280487804878049,\n  \"hep-go\": 0.6341463414634146,\n  \"hep-java\": 0.7073170731707317,\n  \"hep-js\": 0.6646341463414634,\n  \"hep-python\": 0.5487804878048781,\n  \"hep-rust\": 0.6463414634146342,\n  \"llmbar-adver-GPTInst\": 0.391304347826087,\n  \"llmbar-adver-GPTOut\": 0.46808510638297873,\n  \"llmbar-adver-manual\": 0.3695652173913043,\n  \"llmbar-adver-neighbor\": 0.43283582089552236,\n  \"llmbar-natural\": 0.52,\n  \"math-prm\": 0.2953020134228188,\n  \"model\": \"PKU-Alignment\u002Fbeaver-7b-v1.0-cost\",\n  \"model_type\": \"Seq. Classifier\",\n  \"mt-bench-easy\": 0.5714285714285714,\n  \"mt-bench-hard\": 0.5405405405405406,\n  \"mt-bench-med\": 0.725,\n  \"refusals-dangerous\": 0.97,\n  \"refusals-offensive\": 1,\n  \"xstest-should-refuse\": 1,\n  \"xstest-should-respond\": 0.284\n}\n\n# Calculate and print the scores per section\nscores_per_section = calculate_scores_per_section(EXAMPLE_COUNTS, SUBSET_MAPPING, metrics)\nprint(scores_per_section)\n```\n\n## Repository structure\n\n```\n├── README.md                   \u003C- The top-level README for researchers using this project\n├── analysis\u002F                   \u003C- Directory of tools to analyze RewardBench results or other reward model properties\n├── rewardbench\u002F                \u003C- Core utils and modeling files\n|   ├── models\u002F                     ├── Standalone files for running existing reward models\n|   └── *.py                        └── RewardBench tools and utilities\n├── scripts\u002F                    \u003C- Scripts and configs to evaluate reward models\n├── tests                       \u003C- Unit tests\n├── Dockerfile                  \u003C- Build file for reproducible and scaleable research at AI2\n├── LICENSE\n├── Makefile                    \u003C- Makefile with commands like `make style`\n└── setup.py                    \u003C- Makes project pip installable (pip install -e .) so `alignment` can be imported\n```\n\n## Maintenance\n\nThis section is designed for AI2 usage, but may help others evaluating models with Docker.\n\n### Docker Images\n\nTwo Docker images are available:\n\n| Image | Dockerfile | Use Case | Build Time |\n|-------|------------|----------|------------|\n| `rewardbench` | `Dockerfile` | Reward models, API-based LLM judges | ~5-10 min |\n| `rewardbench-vllm` | `Dockerfile.vllm` | Local LLM inference via vLLM | ~45 min |\n\nThe base image uses torch ≤2.8 with prebuilt flash-attn wheels. The vllm image uses torch 2.9 (required by vllm) and builds flash-attn from source.\n\nTo build locally:\n```bash\n# Base image (fast)\ndocker build -t rewardbench . --platform linux\u002Famd64\n\n# vLLM image (slow, includes local LLM inference)\ndocker build -f Dockerfile.vllm -t rewardbench-vllm . --platform linux\u002Famd64\n```\n\nImages are automatically built and pushed to Beaker on merge to main:\n- `nathanl\u002Frewardbench_auto`: Base image\n- `nathanl\u002Frewardbench_vllm_auto`: vLLM image\n\n## Citation\nPlease cite our work with the following:\n```\n@misc{lambert2024rewardbench,\n      title={RewardBench: Evaluating Reward Models for Language Modeling}, \n      author={Nathan Lambert and Valentina Pyatkin and Jacob Morrison and LJ Miranda and Bill Yuchen Lin and Khyathi Chandu and Nouha Dziri and Sachin Kumar and Tom Zick and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi},\n      year={2024},\n      eprint={2403.13787},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG}\n}\n```\n\n```\n@misc{malik2025rewardbench2advancingreward,\n      title={RewardBench 2: Advancing Reward Model Evaluation}, \n      author={Saumya Malik and Valentina Pyatkin and Sander Land and Jacob Morrison and Noah A. Smith and Hannaneh Hajishirzi and Nathan Lambert},\n      year={2025},\n      eprint={2506.01937},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01937}, \n}\n```\n","\u003Cdiv align=\"center\">\n  \u003Ch1>RewardBench：评估奖励模型\u003C\u002Fh1>\n  \u003Cp> V2（\u003Cstrong>全新！\u003C\u002Fstrong>）：\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fallenai\u002Freward-bench\">排行榜\u003C\u002Fa> 📐 |\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Freward-bench-2\">评估数据集\u003C\u002Fa> |\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Freward-bench-2-results\">结果\u003C\u002Fa> 📊 | \n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fallenai\u002Freward-bench-2-683d2612a4b3e38a3e53bb51\">训练好的模型\u003C\u002Fa> 🏆 | \n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01937\">论文📝 \u003C\u002Fa>\n\u003C\u002Fp>\n\n  \u003Cp> V1：\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fallenai\u002Freward-bench\">排行榜\u003C\u002Fa> 📐 |\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Freward-bench\">评估数据集\u003C\u002Fa> |\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Fpreference-test-sets\">现有测试集\u003C\u002Fa> |\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Freward-bench-results\">结果\u003C\u002Fa> 📊 |\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.13787\">论文📝\u003C\u002Fa>\n\u003C\u002Fp>\n  \u003Cimg width=\"1280\" alt=\"Github RewardBench Logo\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fallenai_reward-bench_readme_d73af2e3db78.png\" style=\"margin-left:'auto' margin-right:'auto' display:'block' \"\u002F>\n\u003C\u002Fdiv>\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fblob\u002Fmain\u002FLICENSE\">\n    \u003Cimg alt=\"GitHub License\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fallenai\u002Freward-bench\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Frewardbench\u002F\">\n    \u003Cimg alt=\"PyPI\" src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Frewardbench\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\n---\n\n**RewardBench** 是一个用于评估奖励模型能力与安全性的基准测试平台，涵盖通过直接偏好优化（DPO）等方法训练的模型。该仓库包含以下内容：\n* 针对多种奖励模型（如 Starling、PairRM、OpenAssistant、DPO 等）的通用推理代码。\n* 用于公平评估奖励模型的通用数据格式化与测试工具。\n* 分析与可视化工具。\n\n以下是生成结果的三个主要脚本（更多脚本位于 `scripts\u002F` 目录中）：\n1. `scripts\u002Frun_rm.py`：运行奖励模型的评估。\n2. `scripts\u002Frun_dpo.py`：运行直接偏好优化（DPO）模型及其他使用隐式奖励的模型（如 KTO）的评估。\n3. `scripts\u002Frun_v2.py`：运行 RewardBench 2 的评估，特别处理四选一及平局数据。\n\n## 快速使用\nRewardBench 可让您快速评估任意奖励模型在任何偏好数据集上的表现。它还能检测是否传入了指令数据集（通过检查是否存在 `chosen`\u002F`rejected` 字段以及 `messages` 字段），对于这类数据集，仅会记录模型输出，而不计算准确率。\n\n### 安装\n\n**推荐使用 UV：**\n```bash\nuv pip install rewardbench\n\n# 对于生成式模型（LLM-as-judge、vLLM、API 提供者）\nuv pip install rewardbench[generative]\n```\n\n**使用 pip：**\n```bash\npip install rewardbench\n\n# 对于生成式模型\npip install rewardbench[generative]\n```\n\n**开发模式：**\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench.git\ncd reward-bench\nuv sync                      # 基础安装\nuv sync --extra generative   # 添加生成式支持\n```\n**要运行 RewardBench 2，您可以执行以下命令，替换为您想要运行的模型，并添加任何特定于模型的参数，这些参数可在 `scripts\u002Fconfigs\u002Feval_configs.yaml` 文件中的 [评估配置](https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fblob\u002Fmain\u002Fscripts\u002Fconfigs\u002Feval_configs.yaml) 中找到：**\n```\npython scripts\u002Frun_v2.py --model={yourmodel}\n```\n\n生成式模型可以在 RewardBench 2 上以排名式提示（一次比较 4 条响应，默认设置）或评分式提示（分别对每条响应打分后再综合，使用 `--score_w_ratings` 标志运行）方式运行。请注意，RewardBench 2 新增的“平局”子集每个提示最多有 20 多条完成内容需要评分，因此代码强制要求以评分模式运行。更多信息请参阅 `scripts\u002Frun_generative_v2.py`。如需为您的模型添加自定义提示，欢迎提交 PR。\n```\npython scripts\u002Frun_generative_v2.py --model={yourmodel}\n```\n\n或者，若要运行 RewardBench，则可执行以下命令：\n```\nrewardbench --model={yourmodel} --dataset={yourdataset} --batch_size=8\n```\n对于 DPO 模型，请传入 `--ref_model={}`，脚本将自动进行相应处理。默认使用 Tokenizers 的聊天模板，也可选择 fastchat 的对话模板。\n\n要运行核心 Reward Bench 评估集，只需执行：\n```\nrewardbench --model={yourmodel}\n```\n\n示例：\n1. 正常操作\n```\nrewardbench --model=OpenAssistant\u002Freward-model-deberta-v3-large-v2 --dataset=allenai\u002Fultrafeedback_binarized_cleaned --split=test_gen --chat_template=raw\n```\n2. 使用本地数据集的 DPO 模型（注意 `--load_json` 参数）\n```\nrewardbench --model=Qwen\u002FQwen1.5-0.5B-Chat --ref_model=Qwen\u002FQwen1.5-0.5B --dataset=\u002Fnet\u002Fnfs.cirrascale\u002Fallennlp\u002Fjacobm\u002Fherm\u002Fdata\u002Fberkeley-nectar-binarized-preferences-random-rejected.jsonl --load_json\n```\n\n**生成式奖励模型**可在安装时加上 `[generative]` 选项后运行（见上文安装部分）：\n```\nrewardbench-gen --model={}\n```\n更多信息请参阅 `scripts\u002Frun_generative.py`。本地模型需要 vLLM 支持，而 API 模型则支持 OpenAI、Anthropic、Google Gemini 和 Together。\n\n### 日志记录\n\nCLI 提供多项高级保存功能，可用于记录 **模型输出** 和 **准确率分数**。这些日志可以与您拥有的奖励模型元数据关联，或作为单独的数据集上传至 HuggingFace，例如用于拒绝采样场景。例如，以下命令同时完成这两项操作：\n```\nrewardbench --model vwxyzjn\u002Freward_modeling__EleutherAI_pythia-14m --batch_size 128 --tokenizer=EleutherAI\u002Fpythia-14m --push_results_to_hub --upload_model_metadata_to_hf --chat_template raw\n```\n或者，针对指令数据集：\n```\nrewardbench --model vwxyzjn\u002Freward_modeling__EleutherAI_pythia-14m --dataset HuggingFaceH4\u002Fno_robots --split test --batch_size 128 --tokenizer=EleutherAI\u002Fpythia-14m --push_results_to_hub --chat_template raw\n```\n（注意，仅较旧的模型才需要指定聊天模板）\n\n关键命令包括：\n* `--push_results_to_hub`：将分数和正确性结果上传为数据集。\n* `--upload_model_metadata_to_hf`：直接将结果添加到模型元数据中。\n\n有关带有准确率元数据的模型示例，请参阅 [此处](https:\u002F\u002Fhuggingface.co\u002Fvwxyzjn\u002Frm_zephyr_new)。关于偏好数据集输出的示例，请参阅 [此处](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fnatolambert\u002Frewardbench_eval_2339270924_2339270924)，相关说明请参阅 [此处](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fnatolambert\u002Frewardbench_eval_0329290924)。\n\n目前，此功能仅适用于偏好数据集中的 DPO 模型，例如：\n```\nrewardbench --model Qwen\u002FQwen1.5-0.5B-Chat --ref_model Qwen\u002FQwen1.5-0.5B  --batch_size 128 --tokenizer=EleutherAI\u002Fpythia-14m --push_results_to_hub --upload_model_metadata_to_hf --chat_template raw\n```\n如需完整功能，请提交问题。\n\n## 完整安装\n要从源代码安装，请先在您的系统上安装 `torch`，然后安装以下依赖项。\n```\npip install -e .\n```\n可选地，对于生成脚本，运行：\n```\npip install -e \".[generative]\"\n```\n将以下内容添加到您的 `.bashrc` 文件中：\n```\nexport HF_TOKEN=\"{your_token}\"\n```\n\n## 训练\n\n对于训练，我们建议使用 [`open-instruct`](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fopen-instruct)。\n\n## 贡献您的模型\n\n目前，为了将您的模型提交到排行榜，您需要在 HuggingFace 上以模型名称创建一个议题（您仍然可以使用 RewardBench 评估本地模型，详见下文）。如果需要自定义代码，请提交一个 Pull Request，在我们的推理栈中启用该功能（更多信息请参阅 [`rewardbench\u002Fmodels`](https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Ftree\u002Fmain\u002Frewardbench\u002Fmodels)）。\n\n# 模型评估\n\n参考配置文件请见 `scripts\u002Fconfigs\u002Feval_configs.yaml`。关于聊天模板的参考，许多模型遵循 [这里](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat\u002Fblob\u002Fmain\u002Ffastchat\u002Fconversation.py) 的基础或 SFT 模型术语。一个用于调试的小型模型可在 `natolambert\u002Fgpt2-dummy-rm` 找到。\n\n核心脚本会自动评估我们的核心评估集。要在 [现有偏好数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fallenai\u002Fpref-test-sets) 上运行这些脚本，需添加参数 `--pref_sets`。\n\n## 运行奖励模型\n\n要使用 `scripts\u002Frun_rm.py` 运行单个模型，可以使用以下示例之一：\n```\npython scripts\u002Frun_rm.py --model=openbmb\u002FUltraRM-13b --chat_template=openbmb --batch_size=8\npython scripts\u002Frun_rm.py --model=OpenAssistant\u002Foasst-rm-2.1-pythia-1.4b-epoch-2.5 --chat_template=oasst_pythia\npython scripts\u002Frun_rm.py --model=PKU-Alignment\u002Fbeaver-7b-v1.0-cost --chat_template=pku-align --batch_size=16\npython scripts\u002Frun_rm.py --model=IDEA-CCNL\u002FZiya-LLaMA-7B-Reward --batch_size=32 --trust_remote_code --chat_template=Ziya\n```\n\n要在 AI2 基础设施上运行这些模型，执行：\n```\npython scripts\u002Fsubmit_eval_jobs.py\n```\n或者，例如，在非默认镜像上进行 N 次最佳选择评估：\n```\npython scripts\u002Fsubmit_eval_jobs.py --eval_on_bon --image=nathanl\u002Fherm_bon\n```\n\n注意：对于 AI2 用户，必须运行 `beaker secret write HF_TOKEN \u003Cyour_write_token_here>`，才能使脚本正常工作。\n\n使用默认抽象 `AutoModelForSequenceClassification.from_pretrained` 的模型也可以在本地加载。扩展此功能目前是待办事项。例如：\n```\npython scripts\u002Frun_rm.py --model=\u002Fnet\u002Fnfs.cirrascale\u002Fallennlp\u002Fhamishi\u002FEasyLM\u002Frm_13b_3ep --chat_template=tulu --batch_size=8\n```\n\n## 运行 DPO 模型\n\n对于 DPO：\n```\npython scripts\u002Frun_dpo.py --model=stabilityai\u002Fstablelm-zephyr-3b --ref_model=stabilityai\u002Fstablelm-3b-4e1t --batch_size=8\npython scripts\u002Frun_dpo.py --model=stabilityai\u002Fstablelm-2-zephyr-1_6b --ref_model=stabilityai\u002Fstablelm-2-1_6b --batch_size=16\n```\n\n## 奖励模型集成\n对于已加入 RewardBench 的奖励模型，您可以运行离线集成测试，以近似在您的系统中使用多个奖励模型的效果。要尝试此操作，可以运行：\n```\npython analysis\u002Frun_ensemble_offline.py --models sfairXC\u002FFsfairX-LLaMA3-RM-v0.1 openbmb\u002FEurus-RM-7b Nexusflow\u002FStarling-RM-34B\n```\n\n## 运行生成式奖励模型（LLM 作为评判者）\n支持本地和 API 模型。例如，运行 OpenAI 的模型：\n```\npython scripts\u002Frun_generative.py --model=gpt-3.5-turbo-0125\n```\n\n本地模型从 HuggingFace 加载，但部分模型也可通过 Together 的 API 获取。要在本地运行 Llama 3：\n```\npython scripts\u002Frun_generative.py --model=meta-llama\u002FLlama-3-70b-chat-hf --force_local\n```\n\n或者，通过 Together 的 API 运行：\n```\npython scripts\u002Frun_generative.py --model=meta-llama\u002FLlama-3-70b-chat-hf\n```\n\n我们正在增加对生成式集成的支持（目前仅通过 API），运行方式如下：\n```\npython scripts\u002Frun_generative.py --model gpt-3.5-turbo-0125 claude-3-sonnet-20240229 meta-llama\u002FLlama-3-70b-chat-hf\n```\n\n注意：这些模型的数量必须是大于 1 的奇数。\n\n## 创建 N 次最佳选择排名\n\n要为整个数据集创建排名，运行（best_of 8 仅为占位符，16 应该足够，因为评估逻辑会处理较低的 best of N 数值）：\n```\npython scripts\u002Frun_bon.py --model=OpenAssistant\u002Foasst-rm-2.1-pythia-1.4b-epoch-2.5 --chat_template=oasst_pythia --best_of=8 --debug\n```\n\n## 获取排行榜各部分得分\n\n**重要提示**：我们使用按提示加权的分数来计算聊天、困难聊天、安全性和推理（此处数学与代码同等对待）等部分的得分，以避免过多地偏向于小规模子集（如 MT Bench 子集）。假设已安装 `RewardBench`，可以使用以下代码计算每个类别得分：\n```\nfrom rewardbench.constants import EXAMPLE_COUNTS, SUBSET_MAPPING\nfrom rewardbench.utils import calculate_scores_per_section\n\nmetrics = {\n  \"alpacaeval-easy\": 0.5,\n  \"alpacaeval-hard\": 0.7052631578947368,\n  \"alpacaeval-length\": 0.5894736842105263,\n  \"chat_template\": \"tokenizer\",\n  \"donotanswer\": 0.8235294117647058,\n  \"hep-cpp\": 0.6280487804878049,\n  \"hep-go\": 0.6341463414634146,\n  \"hep-java\": 0.7073170731707317,\n  \"hep-js\": 0.6646341463414634,\n  \"hep-python\": 0.5487804878048781,\n  \"hep-rust\": 0.6463414634146342,\n  \"llmbar-adver-GPTInst\": 0.391304347826087,\n  \"llmbar-adver-GPTOut\": 0.46808510638297873,\n  \"llmbar-adver-manual\": 0.3695652173913043,\n  \"llmbar-adver-neighbor\": 0.43283582089552236,\n  \"llmbar-natural\": 0.52,\n  \"math-prm\": 0.2953020134228188,\n  \"model\": \"PKU-Alignment\u002Fbeaver-7b-v1.0-cost\",\n  \"model_type\": \"序列分类器\",\n  \"mt-bench-easy\": 0.5714285714285714,\n  \"mt-bench-hard\": 0.5405405405405406,\n  \"mt-bench-med\": 0.725,\n  \"refusals-dangerous\": 0.97,\n  \"refusals-offensive\": 1,\n  \"xstest-should-refuse\": 1,\n  \"xstest-should-respond\": 0.284\n}\n\n# 计算并打印各部分得分\nscores_per_section = calculate_scores_per_section(EXAMPLE_COUNTS, SUBSET_MAPPING, metrics)\nprint(scores_per_section)\n```\n\n## 仓库结构\n\n```\n├── README.md                   \u003C- 面向使用该项目的研究人员的顶级 README\n├── analysis\u002F                   \u003C- 用于分析 RewardBench 结果或其他奖励模型属性的工具目录\n├── rewardbench\u002F                \u003C- 核心工具和建模文件\n|   ├── models\u002F                     ├── 用于运行现有奖励模型的独立文件\n|   └── *.py                        └── RewardBench 工具和实用程序\n├── scripts\u002F                    \u003C- 用于评估奖励模型的脚本和配置文件\n├── tests                       \u003C- 单元测试\n├── Dockerfile                  \u003C- 用于在 AI2 实现可重复且可扩展研究的构建文件\n├── LICENSE\n├── Makefile                    \u003C- 包含诸如 `make style` 等命令的 Makefile\n└── setup.py                    \u003C- 使项目可通过 pip 安装（pip install -e .），从而可以导入 `alignment`\n```\n\n## 维护\n\n本节专为 AI2 使用设计，但也可能对使用 Docker 评估模型的其他用户有所帮助。\n\n### Docker 镜像\n\n提供了两个 Docker 镜像：\n\n| 镜像 | Dockerfile | 使用场景 | 构建时间 |\n|------|------------|----------|----------|\n| `rewardbench` | `Dockerfile` | 奖励模型、基于 API 的大语言模型评判器 | ~5–10 分钟 |\n| `rewardbench-vllm` | `Dockerfile.vllm` | 通过 vLLM 进行本地大语言模型推理 | ~45 分钟 |\n\n基础镜像使用 torch ≤2.8，并预构建了 flash-attn 轮子包。vLLM 镜像则使用 torch 2.9（vLLM 所需版本），并从源代码编译 flash-attn。\n\n在本地构建：\n```bash\n# 基础镜像（快速）\ndocker build -t rewardbench . --platform linux\u002Famd64\n\n# vLLM 镜像（较慢，包含本地大语言模型推理）\ndocker build -f Dockerfile.vllm -t rewardbench-vllm . --platform linux\u002Famd64\n```\n\n镜像会在合并到主分支时自动构建并推送到 Beaker：\n- `nathanl\u002Frewardbench_auto`：基础镜像\n- `nathanl\u002Frewardbench_vllm_auto`：vLLM 镜像\n\n## 引用\n请使用以下引用格式来引用我们的工作：\n```\n@misc{lambert2024rewardbench,\n      title={RewardBench：评估用于语言建模的奖励模型}, \n      author={Nathan Lambert、Valentina Pyatkin、Jacob Morrison、LJ Miranda、Bill Yuchen Lin、Khyathi Chandu、Nouha Dziri、Sachin Kumar、Tom Zick、Yejin Choi、Noah A. Smith 和 Hannaneh Hajishirzi},\n      year={2024},\n      eprint={2403.13787},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG}\n}\n```\n\n```\n@misc{malik2025rewardbench2advancingreward,\n      title={RewardBench 2：推进奖励模型评估}, \n      author={Saumya Malik、Valentina Pyatkin、Sander Land、Jacob Morrison、Noah A. Smith、Hannaneh Hajishirzi 和 Nathan Lambert},\n      year={2025},\n      eprint={2506.01937},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01937}, \n}\n```","# RewardBench 快速上手指南\n\nRewardBench 是一个用于评估奖励模型（Reward Models，包括通过 DPO 训练的模型）能力和安全性的基准测试工具。本指南帮助中国开发者快速完成环境配置并运行评估。\n\n## 环境准备\n\n*   **操作系统**: Linux \u002F macOS (Windows 需使用 WSL2)\n*   **Python 版本**: 3.8+\n*   **前置依赖**:\n    *   已安装 `torch` (PyTorch)\n    *   若需评估生成式模型（LLM-as-judge），建议安装 `vLLM` 或配置相关 API Key\n*   **网络要求**: 需要访问 Hugging Face。国内用户建议配置镜像源或代理以加速模型和数据集下载。\n    *   设置环境变量（可选，加速下载）：\n        ```bash\n        export HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n        export HF_TOKEN=\"your_huggingface_token\"\n        ```\n\n## 安装步骤\n\n推荐使用 `uv` 进行安装（速度更快），也可使用标准 `pip`。\n\n### 方式一：使用 UV 安装（推荐）\n\n```bash\n# 安装基础版（适用于判别式奖励模型）\nuv pip install rewardbench\n\n# 安装完整版（包含生成式模型支持，如 LLM-as-judge, vLLM, API 提供商）\nuv pip install rewardbench[generative]\n```\n\n### 方式二：使用 Pip 安装\n\n```bash\n# 安装基础版\npip install rewardbench\n\n# 安装完整版（含生成式支持）\npip install rewardbench[generative]\n```\n\n### 方式三：源码安装（开发模式）\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench.git\ncd reward-bench\nuv sync                      # 基础安装\nuv sync --extra generative   # 包含生成式支持\n```\n\n## 基本使用\n\n安装完成后，您可以直接使用命令行工具评估模型。\n\n### 1. 评估核心奖励模型 (Reward Models)\n\n运行默认的核心评估数据集：\n\n```bash\nrewardbench --model={your_model_name}\n```\n\n**示例**：评估 OpenAssistant 的奖励模型\n```bash\nrewardbench --model=OpenAssistant\u002Freward-model-deberta-v3-large-v2 --dataset=allenai\u002Fultrafeedback_binarized_cleaned --split=test_gen --chat_template=raw\n```\n\n### 2. 评估 DPO 模型\n\n对于 DPO 模型，需同时指定主模型和参考模型（ref_model），脚本会自动路由：\n\n```bash\nrewardbench --model={model_name} --ref_model={ref_model_name}\n```\n\n**示例**：\n```bash\nrewardbench --model=Qwen\u002FQwen1.5-0.5B-Chat --ref_model=Qwen\u002FQwen1.5-0.5B --batch_size=128 --tokenizer=EleutherAI\u002Fpythia-14m --chat_template=raw\n```\n\n### 3. 评估生成式模型 (LLM-as-a-judge)\n\n若安装了 `[generative]` 扩展，可使用 `rewardbench-gen` 命令：\n\n```bash\nrewardbench-gen --model={your_model_name}\n```\n\n**示例**（本地运行 Llama 3）：\n```bash\npython scripts\u002Frun_generative.py --model=meta-llama\u002FLlama-3-70b-chat-hf --force_local\n```\n\n**示例**（调用 API）：\n```bash\npython scripts\u002Frun_generative.py --model=gpt-3.5-turbo-0125\n```\n\n### 4. 运行 RewardBench V2 (新版)\n\n针对 RewardBench 2 的特殊数据处理（如 Best-of-4 和 Ties 数据）：\n\n```bash\npython scripts\u002Frun_v2.py --model={your_model_name}\n```\n\n> **提示**: 具体模型的配置参数（如 chat_template）可参考仓库中的 `scripts\u002Fconfigs\u002Feval_configs.yaml` 文件。常用模板名称可在 FastChat 项目中查找。","某 AI 初创团队正在迭代其客服大模型的奖励模型（Reward Model），旨在通过人类反馈强化学习（RLHF）提升回答的准确性与安全性。\n\n### 没有 reward-bench 时\n- **评估标准混乱**：团队只能依赖少量内部构造的测试集，缺乏行业统一的基准，导致无法判断模型在“常识推理”或“安全对齐”等关键维度上的真实水平。\n- **开发效率低下**：每次尝试新的训练策略（如切换 DPO 算法或调整数据配比），都需要手动编写脚本格式化数据并计算准确率，耗时且容易出错。\n- **盲目优化风险**：由于缺乏对“硬样本”（如细微偏好差异或复杂逻辑陷阱）的专项测试，模型可能在通用指标上得分虚高，却在实际部署中出现严重的安全漏洞或胡言乱语。\n- **横向对比困难**：无法将自研模型与社区主流的 Starling、PairRM 等模型进行公平对比，难以向投资人证明技术先进性。\n\n### 使用 reward-bench 后\n- **权威基准对标**：直接调用 reward-bench 内置的多样化评测数据集，瞬间获得模型在聊天、安全、推理等细分领域的标准化得分，清晰定位能力短板。\n- **一键自动化评测**：利用 `run_rm.py` 或 `run_dpo.py` 脚本，统一数据格式并自动运行推断，将原本数小时的评估流程缩短至几分钟，支持高频次迭代验证。\n- **深度缺陷洞察**：通过 reward-bench 特有的分析工具，精准识别模型在“最佳四选一”或“平局判定”等高难场景下的失效案例，针对性地补充训练数据。\n- **榜单竞争力验证**：轻松将结果提交至 Hugging Face 排行榜，与全球顶尖模型同场竞技，用客观数据佐证技术实力。\n\nreward-bench 将奖励模型的评估从“黑盒摸索”转变为“标准化度量”，成为团队确保模型对齐质量与安全性的核心标尺。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fallenai_reward-bench_d73af2e3.png","allenai","Ai2","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fallenai_65c450d5.png","",null,"ai2-info@allenai.org","http:\u002F\u002Fwww.allenai.org","https:\u002F\u002Fgithub.com\u002Fallenai",[86,90,94,98],{"name":87,"color":88,"percentage":89},"Python","#3572A5",99.2,{"name":91,"color":92,"percentage":93},"Dockerfile","#384d54",0.6,{"name":95,"color":96,"percentage":97},"Shell","#89e051",0.1,{"name":99,"color":100,"percentage":97},"Makefile","#427819",707,95,"2026-04-05T19:40:50","Apache-2.0","未说明","运行本地生成式模型（Generative RMs）需要 vLLM 支持（通常需 NVIDIA GPU），具体显存需求取决于所选模型大小；基础奖励模型评估未明确强制要求 GPU，但建议使用以加速推理。",{"notes":108,"python":105,"dependencies":109},"1. 推荐使用 'uv' 进行包管理安装。2. 运行生成式模型（LLM-as-judge）时，本地部署需安装 vLLM，或可使用 OpenAI、Anthropic、Google Gemini、Together 等 API。3. 需设置 Hugging Face Token 环境变量 (HF_TOKEN) 以访问部分模型或上传结果。4. 支持多种奖励模型架构（如 DPO、KTO、Starling 等）及自定义对话模板。5. RewardBench V2 引入了新的数据集和处理逻辑（如 Best-of-4 和 Ties 数据）。",[110,111,112,113,114],"torch","transformers","accelerate","vllm (用于本地生成式模型)","fastchat (可选，用于对话模板)",[15,46],[117,118],"preference-learning","rlhf","2026-03-27T02:49:30.150509","2026-04-11T18:33:10.350961",[122,127,132,137,142,146],{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},22567,"为什么在 RewardBench 中运行模型时，不同的 batch size 会导致评分结果不一致？","这通常与填充（padding）机制有关。当使用左填充（left padding）且第一个 token 是 PAD token 时，`AutoModelForSequenceClassification` 可能会错误地获取填充位置的分数（通常为 0），导致位置索引变为 -1，从而产生错误的评分。建议检查模型的 padding 策略，确保不使用会导致首个 token 为 PAD 的左填充配置，或者在代码中显式处理 padding index 以避免取到错误的分数。","https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fissues\u002F137",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},22568,"评估 Gemma-2-27b 等模型时，本地运行的指标与 Leaderboard 上的结果严重不符怎么办？","这通常是由于注意力机制实现（attention implementation）的差异造成的。Gemma-2 模型对 `spda`（默认）或 `eager` 模式敏感，可能导致性能大幅下降（例如平均分下降 0.086 甚至更多）。建议在运行评估脚本时，手动向 `model_kwargs` 传递 `attn_implementation` 参数，强制使用 `flash_attention_2` 以获得与 Leaderboard 一致的结果。命令示例：在调用模型加载时添加相关参数，或修改脚本以支持该参数传入。","https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fissues\u002F163",{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},22569,"如何在 RewardBench 中添加和评估新的奖励模型（Reward Models）？","如果是基于 Hugging Face 原生实现的序列分类器（Sequence Classifiers），通常不需要额外编写管道代码。您需要在项目中提交 Issue，提供模型名称（如 `Skywork\u002FSkywork-Reward-Llama-3.1-8B`）、本地评估的详细指标数据，以及任何特殊的运行配置建议（例如必须使用 `flash_attention` 或特定的 `attn_implementation`）。维护者验证后会将其加入 Docker 镜像并更新 Leaderboard。","https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fissues\u002F169",{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},22570,"使用 bfloat16 (bf16) 精度评估 DPO 模型时遇到性能下降或错误如何解决？","bfloat16 精度在某些情况下会导致评分下降或兼容性问题。维护者已发布相关的 PR（如 #155）来修复对 bfloat16 的支持。建议在本地测试时拉取包含该修复的最新代码分支，或者暂时使用 float32 进行验证。如果问题依然存在，请检查模型是否原生支持 bf16，并在 Issue 中反馈以便维护者进一步适配。","https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fissues\u002F143",{"id":143,"question_zh":144,"answer_zh":145,"source_url":141},22571,"自定义模型（如 Gemma-MMPO）使用了非标准的 Tokenizer 模板，如何在 RewardBench 中正确评估？","RewardBench 默认可能使用 FastChat 模板，如果您的模型保存了自定义的 tokenizer 模板（类似 Gemma-it），需要确认评估脚本是否正确加载了该模板。如果默认模板不兼容，您可能需要提供 FastChat 的自定义配置代码，或者在模型配置中指定正确的 `chat_template` 参数。建议在提交模型前先在本地使用 `--chat_template` 参数进行测试，确保输入格式与模型训练时一致。",{"id":147,"question_zh":148,"answer_zh":149,"source_url":150},22572,"运行评估脚本报错涉及 `pad_token_id` 缺失或配置错误怎么办？","对于某些未在配置中预定义的模型（如 TinyLlama），需要在 `REWARD_MODEL_CONFIG` 字典中手动添加配置项。配置需包含 `model_builder`（如 `AutoModelForSequenceClassification.from_pretrained`）、`pipeline_builder`、`quantized` 状态、`custom_dialogue` 标志以及 `model_type`。此外，确保模型本身或配置中正确设置了 `pad_token_id`，避免在 Pipeline 处理时因缺少填充 token ID 而崩溃。","https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fissues\u002F115",[152,157],{"id":153,"version":154,"summary_zh":155,"released_at":156},136297,"v0.1.4","为 v2 版本中的所有改进提升版本号。增加了更多模型流水线，并更好地支持生成式模型。\n\n## 变更内容\n* @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F196 中发布后提升版本号\n* @sanderland 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F197 中修复了 Tiny 数据类型问题\n* @Nicolinho 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F195 中评估了 QRM 奖励模型\n* @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F200 中添加了配置\n* @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F203 中更新了配置\n* @Mghao 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F208 中添加了模型“INF Outcome Reward Model”\n* @saumyamalik 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F213 中添加了模型配置并提升了 transformers 版本\n* @NinaCalvi 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F214 中添加了 Atla Selene Mini\n* @ShikaiChen 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F215 中添加了 LDL-Reward-Gemma-2-27B-v0.1\n* @LittleCoder12345 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F217 中添加了 RISE 支持代码\n* @saumyamalik 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F220 中添加了 v2 best-of-n 的脚本\n* @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F221 中修复了依赖项并更新了 .gitignore\n* @kaikaidai 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F223 中添加了 Selene-1 并将 vllm 升级至 0.6.3\n* @saumyamalik 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F224 中取消了 tiktoken 的版本锁定，以提高与 vllm 的兼容性\n* @WilsonTandya 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F227 中修复了在使用 DPO 模型时 rewardbench.py 中 torch_dtype 的 UnboundLocalError 问题\n* @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F232 中提出了使用评分而非成对排名的 PR\n* @saumyamalik 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F236 中进行了 v2 代码更新\n* @shawn0wang 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F229 中添加了 skywork-vl-reward 的支持代码\n* @saumyamalik 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F237 中更新了 README.md\n\n## 新贡献者\n* @sanderland 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F197 中做出了首次贡献\n* @Nicolinho 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F195 中做出了首次贡献\n* @Mghao 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F208 中做出了首次贡献\n* @saumyamalik 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F213 中做出了首次贡献\n* @NinaCalvi 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F214 中做出了首次贡献\n* @ShikaiChen 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F215 中做出了首次贡献\n* @LittleCoder12345 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F217 中做出了首次贡献\n* @kaikaidai 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F223 中做出了首次贡献\n* @WilsonTandya 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F227 中做出了首次贡献\n* @shawn0wang 做出了首次贡献","2025-06-03T18:47:46",{"id":158,"version":159,"summary_zh":160,"released_at":161},136298,"v0.1.3","`rewardbench` CLI 可以在任何指令数据集上运行，并提供丰富的分数日志记录功能。\n这使得 `rewardbench` 在获得生成结果后，能够快速搭建一个拒绝采样流水线。\n\n具体来说，我认为这种日志记录方式对于评估来说**非常出色**。WandB 在训练过程中会做类似的事情，但在使用 CLI 时，只需传递一个参数，即可保存以下内容：\n* 所有分数、输入文本等数据到 HuggingFace；\n* 用于启动评估的命令；\n* 当前的 Python 环境，以确保实验的可重复性。\n\n示例请参见 README：https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench?tab=readme-ov-file#logging\n\n## 变更内容\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F139 中完成的代码清理、小修复及 0.1.2 版本发布；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F142 中修复的 DPO 提示词问题；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F141 中引入的新款“超级秘密”模型；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F144 中进行的小修复、新的 Dockerfile 以及新增模型；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F145 中修复的 Llama3 模型量化问题，针对 DPO 模型；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F148 中修复的小 bug；\n* 由 @YangRui2015 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F151 中添加的 GRM 类；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F152 中新增的模型和 Dockerfile；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F153 中添加的 Claude 3.5 Sonnet 模型；\n* 由 @YangRui2015 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F154 中修复的 GRM 类填充问题；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F155 中原生添加的 bfloat16 支持；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F156 中添加的生成模型；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F157 中添加的 InternLM2 RM 模型；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F160 中对生成模型的进一步更新；\n* 由 @sanghyuk-choi 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F159 中添加的 offsetbias 执行提示词及评判流程代码；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F161 中完成的小规模生成 PR；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F166 中修复的 Bos 问题；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F167 中添加的自动 Beaker 镜像；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F168 中进行的小幅调整；\n* 由 @chrisliu298 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F170 中添加的 attn_implementation 支持；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F171 中对 run_generative 的修复以及新增模型；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F172 中修复的 vLLM 版本问题；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F174 中移除的训练相关代码；\n* 由 @natolambert 在 https:\u002F\u002Fgithub.com\u002Fallenai\u002Freward-bench\u002Fpull\u002F175 中同步排行榜的变化；\n* 由 @natolambert 在 https:\u002F\u002Fgit 中添加的模型；","2024-10-04T23:34:40"]