[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-tatsu-lab--alpaca_eval":3,"tool-tatsu-lab--alpaca_eval":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":79,"owner_twitter":79,"owner_website":79,"owner_url":80,"languages":81,"stars":90,"forks":91,"last_commit_at":92,"license":93,"difficulty_score":23,"env_os":94,"env_gpu":94,"env_ram":94,"env_deps":95,"category_tags":99,"github_topics":100,"view_count":109,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":110,"updated_at":111,"faqs":112,"releases":142},735,"tatsu-lab\u002Falpaca_eval","alpaca_eval","An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.","AlpacaEval 是一款专为指令跟随型语言模型打造的自动评测工具。它利用大模型模拟人类偏好，替代了传统耗时、昂贵且难以复现的人工评估流程。对于致力于模型迭代的开发者和研究人员而言，这能极大提升效率。\n\n相比其他基准测试，AlpacaEval 表现卓越：运行时间少于 3 分钟，成本低于 10 美元，且与权威的人类基准 ChatBot Arena 评分的相关性高达 0.98。其最新的 2.0 版本默认采用长度控制的胜率指标，不仅进一步提升了评估准确度，还有效防止了模型通过生成长文本“刷分”，确保结果公平可靠。\n\n无论是验证微调效果、对比不同模型性能，还是构建自定义排行榜，AlpacaEval 都能提供快速、高质量且可复现的数据支持，是 AI 研发中值得信赖的自动化评估方案。","# \u003Ca href=\"https:\u002F\u002Ftatsu-lab.github.io\u002Falpaca_eval\u002F\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_10aad1a5d6d4.png\" width=\"35\">\u003C\u002Fa> [AlpacaEval](https:\u002F\u002Ftatsu-lab.github.io\u002Falpaca_eval\u002F) : An Automatic Evaluator for Instruction-following Language Models\n\n[![Code License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode%20License-Apache_2.0-green.svg)](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_farm\u002Fblob\u002Fmain\u002FLICENSE)\n[![Data License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FData%20License-CC%20By%20NC%204.0-red.svg)](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_farm\u002Fblob\u002Fmain\u002FDATA_LICENSE)\n[![Python 3.10+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10+-blue.svg)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002Frelease\u002Fpython-3100\u002F)\n[![discord](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdiscord-server-blue?logo=discord&logoColor=white)](https:\u002F\u002Fdiscord.gg\u002FGJMxJSVZZM)\n\n\n**AlpacaEval 2.0 with length-controlled win-rates** ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.04475)) has a spearman correlation of **0.98** with [ChatBot Arena](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Flmsys\u002Fchatbot-arena-leaderboard) while costing less than **$10** of OpenAI credits run and running in less than 3 minutes. Our goal is to have a benchmark for chat LLMs that is: fast (\u003C 5min), cheap (\u003C $10), and highly correlated with humans (0.98). Here's a comparison with other benchmarks:\n\n\u003Cp float=\"left\" align=\"middle\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_86b6a7d16a94.png\" alt=\"LC AlpacaEval is the most highly correlated benchmark with Chat Arena.\" width=\"500\"\u002F>\n\u003C\u002Fp>\n\n---\n\nUpdates:\n\n:tada: **Length-controlled Win Rates** are out and used by default! This increases the correlation with ChatBot Arena from 0.93 to 0.98, while significantly decreasing length gameability. The raw win rates are still shown on the website and the CLI. More details [here](#length-controlled-win-rates).\n\n:tada: **AlpacaEval 2.0** is out and used by default! We improved the auto-annotator (better and cheaper) and use GPT-4 preview as baseline. More details [here](#alpacaeval-20). For the old version, set your environment variable `IS_ALPACA_EVAL_2=False`.\n\n---\n\n\u003Cdetails open>\n  \u003Csummary>\u003Cb>Table of Contents\u003C\u002Fb>\u003C\u002Fsummary>\n\n1. [Overview](#overview)\n2. [Quick Start](#quick-start)\n2. [Leaderboards and how to interpret them](#leaderboards-and-how-to-interpret-them)\n    - [Models](#models)\n    - [Evaluators](#evaluators)\n3. [Use-cases](#use-cases)\n    - [Evaluating a model](#evaluating-a-model)\n    - [Making a new leaderboard](#making-a-new-leaderboard)\n    - [Making a new evaluator](#making-a-new-evaluator)\n4. [Contributing](#contributing)\n    - [Contributing a model](#contributing-a-model)\n    - [Contributing an evaluator](#contributing-an-evaluator)\n    - [Contributing an eval set](#contributing-an-eval-set)\n    - [Contributing a completion function](#contributing-a-completion-function)\n5. [Limitations](#limitations)\n6. [Analysis](#additional-analysis-and-plots)\n    - [Length-controlled AlpacaEval](#length-controlled-alpacaeval-lcae)\n    - [Analyzing an evaluator](#analyzing-an-evaluator)\n    - [Analyzing an eval set](#analyzing-an-eval-set)\n7. [Citation](#citation)\n8. [Additional information](#additional-information)\n   - [Length-controlled win rates](#length-controlled-win-rates)\n   - [AlpacaEval 2.0](#alpacaeval-20)\n   - [Data Release](#data-release)\n   - [Differences with AlpacaFarm](#differences-with-alpacafarm)\n   - [Related work](#related-work)\n   - [Interpreting annotations](#interpreting-annotations)\n   - [Major updates](#major-updates)\n\n\u003C\u002Fdetails>\n\n# Overview\n\n\nEvaluation of instruction-following models (e.g., ChatGPT) typically requires human interactions. This is\ntime-consuming, expensive, and hard to replicate. AlpacaEval in an LLM-based automatic evaluation that is fast, cheap,\nreplicable, and validated against 20K human annotations.\nIt is particularly useful for model development.\nAlthough we improved over prior automatic evaluation pipelines, there are still fundamental [limitations](#limitations) like the preference for longer outputs.\nAlpacaEval provides the following:\n\n- [**Leaderboard**](https:\u002F\u002Ftatsu-lab.github.io\u002Falpaca_eval\u002F): a leaderboard of common models on the AlpacaEval\n  evaluation set. **Caution**: Automatic evaluators (e.g. GPT-4) may be biased towards models that generate longer outputs and\u002For that were fine-tuned on the model underlying the evaluator (e.g. GPT-4).\n- [**Automatic evaluator**](#evaluators): an automatic evaluator that has high agreement with humans (validated on 20K\n  annotations). We evaluate a\n  model by\n  measuring the fraction of times a powerful LLM (e.g. GPT-4) prefers the outputs from that model\n  over\n  outputs from a reference model. Our evaluators enable caching and output randomization by default.\n- [**Toolkit for building automatic evaluators**](#analysis): a simple interface for\n  building advanced automatic evaluators (e.g. with caching, batching, or multi-annotators) and analyzing them (quality,\n  price, speed, statistical power, bias, variance etc).\n- [**Human evaluation data**](#data-release): 20K human preferences between a given and reference model\n  on the [AlpacaFarm](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_farm\u002Ftree\u002Fmain)\n  evaluation set. 2.5K of these are cross-annotations (4 humans annotating the same 650 examples).\n- [**AlpacaEval dataset**](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Falpaca_eval.json): a simplification\n  of [AlpacaFarm's](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_farm\u002Ftree\u002Fmain) evaluation set, where \"instructions\" and \"inputs\" are merged into one field, and reference outputs are longer. [Details here](#data-release).\n\n\n\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>When to use and not use AlpacaEval?\u003C\u002Fb>\u003C\u002Fsummary>\n\n**When to use AlpacaEval?**\nOur automatic evaluator is a quick and cheap proxy for human evaluation of simple\ninstruction-following tasks.\nIt is useful if you\nhave to run many evaluations quickly, e.g., during model development.\n\n**When not to use AlpacaEval?**\nAs any other automatic evaluator, AlpacaEval should **not replace human evaluation in\nhigh-stake decision-making**, e.g., to decide on model release. In particular, AlpacaEval is limited by the fact\nthat (1) the instructions in the eval set might not be representative of advanced usage of LLMs; (2) automatic\nevaluators may have biases such as favoring style over\nfactuality of the answer; and (3) AlpacaEval does not measure the risks that a model could cause.\nDetails in [limitations](#limitations).\n\n\u003C\u002Fdetails>\n\n\n# Quick Start\n\nTo install the stable release, run\n\n```bash\npip install alpaca-eval\n```\n\nTo install the nightly version, run\n\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\n```\n\nThen you can use it as follows:\n\n```bash\nexport OPENAI_API_KEY=\u003Cyour_api_key> # for more complex configs, e.g. using Azure or switching clients see client_configs\u002FREADME.md \nalpaca_eval --model_outputs 'example\u002Foutputs.json' \n```\n\nThis will print the leaderboard to the console, and save both the leaderboard and the annotations to the same directory as the `model_outputs` file. Important parameters are the following:\n\n- **model_outputs** : A path to a json file for the outputs of the model to add to the leaderboard. Each dictionary\n  should\n  contain the keys `instruction` and `output`.\n- **annotators_config**: This is the annotator to use. We recommend using `weighted_alpaca_eval_gpt4_turbo` (\n  default for AlpacaEval 2.0), which has a\n  high agreement rate with our human annotation data, large context size, and is pretty cheap. For a comparison of all annotators see [here](#evaluators).\n- **reference_outputs**:  The outputs of the reference model. Same format as `model_outputs`. By default, this\n  is `gpt4_turbo` for AlpacaEval 2.0.\n- **output_path**: Path for saving annotations and leaderboard.\n\nIf you don't have the model outputs, you can\nuse [`evaluate_from_model`](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain#evaluating-a-model) and\npass a local path or a name of a\nHuggingFace\nmodel, or a model from a standard API (OpenAI, Anthropic, Cohere, google, ...). Other commands:\n\n\u003Cdetails open>\n  \u003Csummary>\u003Ccode>>>> alpaca_eval -- --help\u003C\u002Fcode>\u003C\u002Fsummary>\n\n```\nSYNOPSIS\n    alpaca_eval COMMAND\n\nCOMMANDS\n    COMMAND is one of the following:\n\n     evaluate\n       Evaluate a model based on its outputs. This is the default entrypoint if no command is specified.\n\n     evaluate_from_model\n       Evaluate a model from HuggingFace or an API provider. This is a wrapper around `evaluate` which includes generating from a desired model.\n\n     make_leaderboard\n       Precompute and save an entire leaderboard for a given dataset \u002F evaluator \u002F set of models generations.\n\n     analyze_evaluators\n       Analyze an evaluator and populates the evaluators leaderboard (agreement with human, speed, price,...).\n```\n\n\u003C\u002Fdetails>\n\nFor more information about each function use `alpaca_eval \u003Ccommand> -- --help`.\n\n# Leaderboards and how to interpret them\n\n## Models\n\nOur leaderboards are computed on the [AlpacaEval dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftatsu-lab\u002Falpaca_eval).\nWe precomputed the leaderboard for important models using different baseline models and autoannotators. \nOur two main leaderboards (\"AlpacaEval 2.0\" and \"AlpacaEval\") can be found\n[on this page](https:\u002F\u002Ftatsu-lab.github.io\u002Falpaca_eval\u002F).\n\"AlpacaEval 2.0\" uses `weighted_alpaca_eval_gpt4_turbo` for the annotator and `gpt4_turbo` for the baseline.\n\"AlpacaEval\" uses `alpaca_eval_gpt4` for the annotator and `text_davinci_003` for the baseline.\nFor all precomputed leaderboards see [here](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fleaderboards).\nLater we also show how to [add your model](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval#evaluating-a-model) to the\nleaderboard and how to make\na [new leaderboard for your evaluator\u002Fdataset](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval#making-a-new-leaderboard).\nSee [here](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fmodels_configs) for the configs of all\nmodels that are available out of the box.\n\n**AlpacaEval minimal leaderboard**:\n\n|                       | Win Rate | Std Error |\n|:----------------------|---------:|----------:|\n| gpt4                  |     95.3 |       0.7 |\n| claude                |     88.4 |       1.1 |\n| chatgpt               |     86.1 |       1.2 |\n| guanaco-65b           |     71.8 |       1.6 |\n| vicuna-13b            |     70.4 |       1.6 |\n| text_davinci_003      |     50.0 |       0.0 |\n| alpaca-farm-ppo-human |     41.2 |       1.7 |\n| alpaca-7b             |     26.5 |       1.5 |\n| text_davinci_001      |     15.2 |       1.2 |\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>How exactly are those metrics computed?\u003C\u002Fb>\u003C\u002Fsummary>\n\n**Win Rate**: the win rate measures the fraction of time the model's output is preferred over the reference's outputs (`test-davinci-003` for AlpacaEval and `gpt4_turbo` for AlpacaEval 2.0).\nMore specifically, to compute the win rate we collect pairs of outputs of the desired model on every instruction from\nthe ApacaEval dataset.\nWe then pair each output with the output of our reference model (e.g. `text-davinci-003`) on the same instruction.\nWe then ask our automatic evaluator which output they prefer.\nSee [AlpacaEval's](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002Falpaca_eval_gpt4)\nand [AlpacaEval 2.0's](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002Fweighted_alpaca_eval_gpt4_turbo) prompts and configs, in particular we randomize the order of\noutputs to avoid position bias.\nWe then average the preferences over all instructions in the dataset to get the win rate of the model over the baseline.\nIf both outputs are exactly the same we use a half preference for both models.\n\n**Standard error**: this is the standard error (normalized by N-1) of the win rate, i.e., the preferences averaged over\nthe different instructions.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>Details about our auto-annotator: \u003Ccode>alpaca_eval_gpt4\u003C\u002Fcode>\u003C\u002Fb>\u003C\u002Fsummary>\n\nOur `alpaca_eval_gpt4` (\nsee [configs](#https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002Falpaca_eval_gpt4\u002Fconfigs.yaml#L5))\nannotator averages over preferences, where preferences are obtained as follows:\n\n1. it takes in an instruction and a pair of outputs (from the desired model and the reference model)\n2. if a preference was this triple was already computed, it returns it (i.e. it uses caching)\n3. it randomizes the order of the outputs to avoid position bias\n4. it formats the instruction and outputs into\n   the [following zero-shot prompt](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002Falpaca_eval_gpt4\u002Falpaca_eval.txt),\n   which asks to order the outputs in order of preference\n5. it completes the prompt using GPT4 with `temperature=0`\n6. it parses the preference from the completions and returns it\n\nThe annotator is a mix between (and was highly influenced by) [AlpacaFarm](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_farm)\nand [Aviary](https:\u002F\u002Fgithub.com\u002Fray-project\u002Faviary\u002Ftree\u002Fmaster) evaluators.\nIn particular, we use the same code as for AlpacaFarm (caching\u002Frandomization\u002Fhyperparameters) but use a ranking prompt\nsimilar to that of Aviary.\nWe make changes to Aviary's prompt to decrease the bias for longer outputs.\nDetails in [Related work](#related-work).\n\nFor AlpacaEval 2.0 we use `weighted_alpaca_eval_gpt4_turbo`, which uses logprobs to compute continuous preference and uses GPT4_turbo as model (\nsee [configs](#https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002Fweighted_alpaca_eval_gpt4_turbo\u002Fconfigs.yaml)).\n\n\u003C\u002Fdetails>\n\n\n\n## Evaluators\n\nWe evaluate different automatic annotators on the AlpacaEval set by comparing to\n2.5K [human annotations](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Falpaca_farm_human_crossannotations.json)\nwe collected (~650 instructions each with 4 human annotations).\nBelow we show metrics for our suggested evaluators (`weighted_alpaca_eval_gpt4_turbo`,`alpaca_eval_gpt4`), for prior\nautomatic\nevaluators ([`alpaca_farm_greedy_gpt4`](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_farm),[`aviary_gpt4`](https:\u002F\u002Faviary.anyscale.com\u002F),[`lmsys_gpt4`](https:\u002F\u002Fchat.lmsys.org\u002F)),\nfor humans (`humans`), and for different base models with essentially the same\nprompt (`gpt4`,`claude`,`text_davinci_003`,`chatgpt_fn`,`guanaco_33b`, `chatgpt`).\nSee [here](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs) for the configs of all\nevaluators that are available out of the box and their associated metrics.\n\n|                                 |   Human agreement |   Price [$\u002F1000 examples] |   Time [seconds\u002F1000 examples] |   Spearman corr. |   Pearson corr. |   Bias |   Variance |   Proba. prefer longer |\n|:--------------------------------|------------------:|--------------------------:|-------------------------------:|-----------------:|----------------:|-------:|-----------:|-----------------------:|\n| alpaca_eval_gpt4                |              69.2 |                      13.6 |                           1455 |             0.97 |            0.93 |   28.4 |       14.6 |                   0.68 |\n| alpaca_eval_cot_gpt4_turbo_fn   |              68.6 |                       6.3 |                           1989 |             0.97 |            0.90 |   29.3 |       18.4 |                   0.67 |\n| alpaca_eval_gpt4_turbo_fn       |              68.1 |                       5.5 |                            864 |             0.93 |            0.82 |   30.2 |       15.6 |                   0.65 |\n| alpaca_eval_llama3_70b_fn       |              67.5 |                       0.4 |                            209 |             0.90 |            0.86 |   32.3 |       8.2 |                   0.79 |\n| gpt4                            |              66.9 |                      12.5 |                           1037 |             0.88 |            0.87 |   31.5 |       14.6 |                   0.65 |\n| alpaca_farm_greedy_gpt4         |              66.4 |                      15.3 |                            878 |             0.85 |            0.75 |   30.2 |       19.3 |                   0.60 |\n| alpaca_eval_cot_gpt4_turbo_fn |              65.7 |                       4.3 |                            228 |             0.78 |            0.77 |   33.9 |       23.7 |                   0.61 |\n| humans                          |              65.7 |                     300.0 |                          36800 |             1.00 |            1.00 |    0.0 |       34.3 |                   0.64 |\n| claude                          |              65.3 |                       3.3 |                            173 |             0.93 |            0.90 |   32.4 |       18.5 |                   0.66 |\n| lmsys_gpt4                      |              65.3 |                      13.9 |                          17982 |             0.98 |            0.97 |   31.6 |       15.9 |                   0.74 |\n| text_davinci_003                |              64.1 |                       8.7 |                            121 |             0.85 |            0.83 |   33.8 |       22.7 |                   0.70 |\n| longest                         |              62.2 |                       0.0 |                              0 |             0.27 |            0.56 |   37.8 |        0.0 |                   1.00 |\n| chatgpt                         |              57.3 |                       0.8 |                            285 |             0.72 |            0.71 |   39.4 |       34.1 |                   0.59 |\n\n\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>How exactly are those metrics computed?\u003C\u002Fb>\u003C\u002Fsummary>\n\nWe now explain in words how we compute the metrics in the table\nabove. [The code is here](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Ff05cbd651b79ac93906b19d01fe443b45828b0f2\u002Fsrc\u002Falpaca_eval\u002Fanalyze.py#L366).\n\n**Human agreement**: this measures the agreement between the current annotator and the majority preferences of\nhumans on\nour\n~650 annotations from\nour [cross-annotation set](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Falpaca_farm_human_crossannotations.json),\nwhich contains 4 human annotations per example.\nTo estimate the agreement between a single human (`humans` row in the table above) and the majority of humans, we take\none of the 4 annotations and compute the accuracy that it has when predicting the mode of the other 3 annotations.\nWe then average this accuracy over all 4 annotations and over the 650 instructions to get the human agreement, i.e., we\ncompute the expected (over humans and samples)\nleave-one-out agreement.\nIf the mode is not unique, we take one of the modes at random.\nWe perform exactly the same computation for the automatic annotators, so that the final numbers are comparable.\n\n**Price [$\u002F1000 examples]**: this is the average price of every 1000 annotations.\nFor humans, it is the price that [we paid Mechanical Turkers](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14387) to collect those\nannotations ($21\u002Fhour).\nIf the price depends on the machine used to compute the annotations (e.g. Guanaco) we leave it empty.\n\n**Time [seconds\u002F1000 examples]**: this is the average time it takes to compute 1000 annotations.\nFor humans, it is the estimated median time that each Mechanical Turker took to annotate 1000 examples.\nFor automatic annotators, it is the average time that it took us when running the annotations. Note that this can depend\non API limits that are different for different users and the number of requests that the clusters are\nprocessing.\n\n**Spearman corr.**: this measures the Spearman correlation between a leaderboard computed with the auto-annotator's preference and the leaderboard computed with human preferences. As with `Human agreement`, we use the human annotations from AlpacaFarm but we now consider the method-level agreement rather than only the sample-wise agreement with humans. Note that we only use have 9 models and so the correlation is not very reliable. \n\n**Pearson corr.**: same as with `Spearman corr.` but with Pearson correlation.\n\n**Bias**: agreement between the most likely human label and the most likely automatic one.\nFor automatic annotators we estimate it by sampling 4 different annotations for each example.\nThe randomness here comes from the order of the outputs in the prompt, sampling from the LLM, and if applicable the\norder of the instruction in the batch and the choice of annotator in the pool.\nWe then take the mode of the 4 annotations and compute the accuracy of the mode when predicting the mode of the 4 human\nannotations.\nNote that this is likely an overestimate on the real bias that we would get if we had an \"infinite\" number of\ncross-annotations.\nA low bias means that the annotator has in expectation the same preferences as humans.\nFor the case of humans, the bias is zero by definition.\nNote that this is related to but not the standard statistical bias, because we take the mode instead of average over\nannotations and we consider 0-1 loss instead of squared loss.\n\n\n**Variance**: expected agreement a single automatic preference and the most likely one.\nWe estimate it the same way as we estimated \"human agreement\" for humans, i.e., we take the expected leave one out error\nwhen predicting the mode of the 3 annotations using the 4th annotation.\nA low variance means that the annotator is consistent with its preference, i.e., if you sample from it with different\nseeds it will give the same result.\nAs with the bias, this is not exactly the standard statistical variance, because we take the mode instead of average\nover annotations and we\nconsider 0-1 loss instead of squared loss.\n\nNote that the \"human agreement\" is tightly related to the bias and variance. In particular, the variance\nmeasures the error due to the fact that we only use a single annotation while the bias aims to measure the irreducible\nerror\nfor the current annotator.\n\n**Proba. prefer longer**: this is the probability that the annotator prefers the longer output when one of the two\noutputs is significantly longer than the other (more than 30 characters difference).\n\nIn the [full table](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002FREADME.md) we\nalso provide the following metrics:\n\n**Proba. prefer lists**: this is the probability that the annotator prefers the output that contains a list\u002Fbullet\npoints when one output does but not the other.\n\n**Proba. prefer 1**: this is the probability that the annotator prefers the first of the pair of outputs. All our\nproposed annotators randomize over outputs in the prompt, so this should be 0.5. Prior annotators, such as `lmsys`\nand `aviary`, do not.\n\n**# parsed**: this is the number of examples that the annotator was able to parse.\n\nNote that if the variance and bias is empty, it means that we only performed one single annotation for each 648 example\ndue to resource (time and price) constraints. This explains why the #parsed is 648, otherwise it should be 2592.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>Tips for choosing evaluators\u003C\u002Fb>\u003C\u002Fsummary>\n\nOverall we recommend using `annotators_config=weighted_alpaca_eval_gpt4_turbo` if you want the high agreement with humans, and\n`annotators_config=chatgpt_fn` if you are on a tight budget.\n\nWhen choosing an annotator we recommend you to consider the following (the first three are obvious):\n\n- `\"Human agreement [%]\"`\n- `\"Price [$\u002F1000 examples]\"`\n- `\"Time [seconds\u002F1000 examples]\"`\n- `\"* corr.\"` approx. > 0.7. It is important that the correlation is not too low, but we do not recommend using it as the main metric as the correlation is computed on only 9 models. \n- `\"Proba. prefer longer\"` approx. \u003C 0.7. Indeed, we found see that the majority of preference of human annotators have\n  strong bias for longer answers (as shown by the\n  high [performance=62.2](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002FREADME.md)\n  of\n  the `\"longest\"` evaluator that always\n  prefers the longest output). This suggests that it might more of a bias with the human annotators. In order to avoid\n  having leaderboards with strong biases for length, we suggest using automatic annotators with less than 0.7 \"Proba.\n  prefer longer\".\n- `\"Variance\"` approx. \u003C 0.2. We believe that a good evaluator should have as little variance as possible so that\n  results are mostly reproducible. Note that variance can be desirable in the case where we are simulating humans\n  as shown in [AlpacaFarm](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14387).\n\nWe filtered the annotators that do not satisfy those requirements in the table above (besides humans \u002F ChatGPT \u002F 003 \u002F\nlmsys for\nreference purposes). For\nall\nresults see [here](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002FREADME.md).\nIn general, we found `weighted_alpaca_eval_gpt4_turbo` to be a good trade-off between quality \u002F price \u002F time \u002F\nvariance \u002F length bias.\n\n\u003C\u002Fdetails>\n\nThe above metrics are computed with respect to annotations from crowd-workers. Although useful, those annotations are\nnot perfect, e.g., crowd-workers often favor style\nover\nfactuality. We thus recommend users to validate automatic evaluators on their own instructions and human annotations.\nDetails in [limitations](#limitations).\n\n# Use-cases\n\n\n## Evaluating a model\n\n\u003Cdetails>\n  \u003Csummary>\u003Ccode>>>> alpaca_eval evaluate -- --help\u003C\u002Fcode>\u003C\u002Fsummary>\n\n```\nNAME\n    alpaca_eval evaluate - Evaluate a model based on its outputs. This is the default entrypoint if no command is specified.\n\nSYNOPSIS\n    alpaca_eval evaluate \u003Cflags>\n\nDESCRIPTION\n    Evaluate a model based on its outputs. This is the default entrypoint if no command is specified.\n\nFLAGS\n    --model_outputs=MODEL_OUTPUTS\n        Type: Optional[Union]\n        Default: None\n        The outputs of the model to add to the leaderboard. Accepts data (list of dictionary, pd.dataframe, datasets.Dataset) or a path to read those (json, csv, tsv) or a function to generate those. Each dictionary (or row of dataframe) should contain the keys that are formatted in the prompts. E.g. by default `instruction` and `output` with optional `input`. If None, we just print the leaderboard.\n    -r, --reference_outputs=REFERENCE_OUTPUTS\n        Type: Union\n        Default: \u003Cfunc...\n        The outputs of the reference model. Same format as `model_outputs`. If None, the reference outputs are a specific set of Davinci 003 outputs on the AlpacaEval set:\n    --annotators_config=ANNOTATORS_CONFIG\n        Type: Union\n        Default: 'alpaca_eval_gpt4_turbo_fn'\n        The path the (or list of dict of) the annotator's config file. For details see the docstring of `PairwiseAnnotator`.\n    -n, --name=NAME\n        Type: Optional[Optional]\n        Default: None\n        The name of the model to add to the leaderboard. If None we check if `generator is in model_outputs` if not we use \"Current model\".\n    -o, --output_path=OUTPUT_PATH\n        Type: Union\n        Default: 'auto'\n        Path to the directory where the new leaderboard and the annotations should be stored. If None we don't save. If `auto` we use `model_outputs` if it is a path, and otherwise use the directory from which we call the script.\n    -p, --precomputed_leaderboard=PRECOMPUTED_LEADERBOARD\n        Type: Union\n        Default: 'auto'\n        The precomputed leaderboard or a path to it (json, csv, or tsv). The leaderboard should contain at least the column `win_rate`. If `auto` we will try to use the corresponding leaderboard for the reference outputs (only if in CORRESPONDING_OUTPUTS_LEADERBOARDS). If `None` we won't add other models from the leaderboard.\n    --is_overwrite_leaderboard=IS_OVERWRITE_LEADERBOARD\n        Type: bool\n        Default: False\n        Whether to overwrite the leaderboard if the model is already in it.\n    -l, --leaderboard_mode_to_print=LEADERBOARD_MODE_TO_PRINT\n        Type: Optional\n        Default: 'minimal'\n        The mode of the leaderboard to use. Only used if the precomputed leaderboard has a column `mode`, in which case it will filter the leaderboard by this mode. If None keeps all.\n    -c, --current_leaderboard_mode=CURRENT_LEADERBOARD_MODE\n        Type: str\n        Default: 'community'\n        The mode of the leaderboard for the current method.\n    --is_return_instead_of_print=IS_RETURN_INSTEAD_OF_PRINT\n        Type: bool\n        Default: False\n        Whether to return the metrics instead of printing the results.\n    -f, --fn_metric=FN_METRIC\n        Type: Union\n        Default: 'pairwise_to_winrate'\n        The function or function name in `metrics.py` that will be used to convert preference to metrics. The function should take a sequence of preferences (0 for draw, 1 for base win, 2 when the model to compare wins) and return a dictionary of metrics and the key by which to sort the leaderboard.\n    -s, --sort_by=SORT_BY\n        Type: str\n        Default: 'win_rate'\n        The key by which to sort the leaderboard.\n    --is_cache_leaderboard=IS_CACHE_LEADERBOARD\n        Type: Optional[Optional]\n        Default: None\n        Whether to save the result leaderboard to `precomputed_leaderboard`. If None we save only if max_instances not None. A preferred way of adding models to the leaderboard is to set `precomputed_leaderboard` to the previously saved leaderboard at `\u003Coutput_path>\u002Fleaderboard.csv`.\n    --max_instances=MAX_INSTANCES\n        Type: Optional[Optional]\n        Default: None\n        The maximum number of instances to annotate. Useful for testing.\n    --annotation_kwargs=ANNOTATION_KWARGS\n        Type: Optional[Optional]\n        Default: None\n        Additional arguments to pass to `PairwiseAnnotator.annotate_head2head`.\n    -A, --Annotator=ANNOTATOR\n        Default: \u003Cclass 'alpaca_eval.annotators.pairwise_evaluator.PairwiseAn...\n        The annotator class to use.\n    Additional flags are accepted.\n        Additional arguments to pass to `PairwiseAnnotator`.\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ccode>>>> alpaca_eval evaluate_from_model -- --help\u003C\u002Fcode>\u003C\u002Fsummary>\n\n```\nNAME\n    alpaca_eval evaluate_from_model - Evaluate a model from HuggingFace or an API provider. This is a wrapper around `evaluate` which includes generating from a desired model.\n\nSYNOPSIS\n    alpaca_eval evaluate_from_model MODEL_CONFIGS \u003Cflags>\n\nDESCRIPTION\n    Evaluate a model from HuggingFace or an API provider. This is a wrapper around `evaluate` which includes generating from a desired model.\n\nPOSITIONAL ARGUMENTS\n    MODEL_CONFIGS\n        Type: Union\n        A dictionary or path (relative to `models_configs`) to a yaml file containing the configuration of the model to decode from. If a directory,we search for 'configs.yaml' in it. The keys in the first dictionary should be the generator's name, and the value should be a dictionary of the generator's configuration which should have the\n\nFLAGS\n    -r, --reference_model_configs=REFERENCE_MODEL_CONFIGS\n        Type: Optional[Union]\n        Default: None\n        Same as in `model_configs` but for the reference model. If None, we use the default Davinci003 outputs.\n    -e, --evaluation_dataset=EVALUATION_DATASET\n        Type: Union\n        Default: \u003Cfunc...\n        Path to the evaluation dataset or a function that returns a dataframe. If None, we use the default evaluation\n    -a, --annotators_config=ANNOTATORS_CONFIG\n        Type: Union\n        Default: 'alpaca_eval_gpt4_turbo_fn'\n        Path to the annotators configuration or a dictionary. If None, we use the default annotators configuration.\n    -o, --output_path=OUTPUT_PATH\n        Type: Union\n        Default: 'auto'\n        Path to save the generations, annotations and leaderboard. If auto saves at `results\u002F\u003Cmodel_name>`\n    -m, --max_instances=MAX_INSTANCES\n        Type: Optional[int]\n        Default: None\n        Maximum number of instances to generate and evaluate. If None, we evaluate all instances.\n    --is_strip_output=IS_STRIP_OUTPUT\n        Type: bool\n        Default: True\n        Whether to strip trailing and leading whitespaces from the outputs.\n    --is_load_outputs=IS_LOAD_OUTPUTS\n        Type: bool\n        Default: True\n        Whether to try to load outputs from the output path. If True and outputs exist we only generate outputs for instructions that don't have outputs yet.\n    -c, --chunksize=CHUNKSIZE\n        Type: int\n        Default: 64\n        Number of instances to generate before saving. If None, we save after all generations.\n    Additional flags are accepted.\n        Other kwargs to `evaluate`\n\nNOTES\n    You can also use flags syntax for POSITIONAL ARGUMENTS\n```\n\n\u003C\u002Fdetails>\n\nTo evaluate a model you need to:\n\n1. Choose an evaluation set and compute outputs specified as `model_outputs`. By default, we use\n   the 805 examples from [AlpacaEval](#data-release). To compute outputs on AlpacaEval use:\n\n```python\nimport datasets\n\neval_set = datasets.load_dataset(\"tatsu-lab\u002Falpaca_eval\", \"alpaca_eval\")[\"eval\"]\nfor example in eval_set:\n    # generate here is a placeholder for your models generations\n    example[\"output\"] = generate(example[\"instruction\"])\n    example[\"generator\"] = \"my_model\" # name of your model\n```\n\nif your model is a HuggingFace model or from a standard API provider (OpenAI, Anthropic, Cohere). Then you can\ndirectly use `alpaca_eval evaluate_from_model` to also take care of generating outputs.\n\n2. Compute the reference outputs `reference_outputs`. By default, we use precomputed outputs of [`gpt4_turbo` on\n   AlpacaEval](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftatsu-lab\u002Falpaca_eval).\n   If you\n   want to use a different model or a different dataset follow the same steps as (1.).\n3. Choose an evaluator specified via `annotators_config`. We recommend using `alpaca_eval_gpt4_turbo_fn`. For other options and comparisons\n   see [this table](#evaluators). Depending on the evaluator you might need to\n   set the appropriate API_KEY in your environment\n   or int the [client_configs](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fclient_configs).\n\nRunning all together:\n\n```bash\nalpaca_eval --model_outputs 'example\u002Foutputs.json' \\\n  --annotators_config 'alpaca_eval_gpt4_turbo_fn'\n```\n\nIf you don't have decoded outputs, you can use `evaluate_from_model` which takes care of decoding (model and reference)\nfor you.\nHere's an\nexample:\n\n```bash\n# need a GPU for local models\nalpaca_eval evaluate_from_model \\\n  --model_configs 'oasst_pythia_12b' \\\n  --annotators_config 'alpaca_eval_gpt4_turbo_fn'      \n```\n\nHere the `model_configs` and `reference_model_configs` (optional) are paths to a directory that specifies the prompt,\nthe model\nprovider (here HuggingFace) and decoding parameters.\nSee [this directory](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fmodels_configs) for examples.\nFor all model providers that are available out-of-the-box\nsee [here](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fdecoders).\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>Information about annotators\u003C\u002Fb>\u003C\u002Fsummary>\n\n- **Caching**: by default all annotations are cached on\n  disk at `caching_path`. Annotations are thus never recomputed, which makes annotations faster, cheaper and allow for\n  reproducibility. This helps even when evaluating different models as many models\n  have\n  the same outputs.\n- **Output randomization** by default, we randomize over the examples of outputs, as we found that annotators tend to\n  prefer the first examples\n  they see.\n- **Batching** we provide code and examples to batch annotations, which decreases cost and time for annotations if the\n  prompt is long. See for\n  example [alpaca_farm_greedy_gpt4](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002Falpaca_farm_greedy_gpt4).\n- **Pool of annotators** we provide code and examples to evaluate using a pool of automatic annotators, which is helpful\n  for replicating the variance of [human annotations](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14387). See for\n  example [alpaca_farm](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002Falpaca_farm).\n- **Seeding based on instructions** For reproducibility and more fair comparison between models, we seed all\n  randomness (output order, order in batches,\n  examples for each annotator in a pool) based on the instruction.\n\n\u003C\u002Fdetails>\n\n\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">Making a new leaderboard\u003C\u002Fh2>\u003C\u002Fsummary>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ccode>>>> alpaca_eval make_leaderboard -- --help\u003C\u002Fcode>\u003C\u002Fsummary>\n\n```\nNAME\n    alpaca_eval make_leaderboard - Precompute and save an entire leaderboard for a given dataset \u002F evaluator \u002F set of models generations.\n\nSYNOPSIS\n    alpaca_eval make_leaderboard \u003Cflags>\n\nDESCRIPTION\n    Precompute and save an entire leaderboard for a given dataset \u002F evaluator \u002F set of models generations.\n\nFLAGS\n    --leaderboard_path=LEADERBOARD_PATH\n        Type: Optional[Union]\n        Default: None\n        The path to save the leaderboard to. The leaderboard will be saved as a csv file, if it already exists it will\n    --annotators_config=ANNOTATORS_CONFIG\n        Type: Union\n        Default: 'alpaca_eval_gpt4_turbo_fn'\n        The path the (or list of dict of) the annotator's config file.\n    --all_model_outputs=ALL_MODEL_OUTPUTS\n        Type: Union\n        Default: \u003Cfu...\n        The outputs of all models to add to the leaderboard. Accepts data (list of dictionary, pd.dataframe, datasets.Dataset) or a path to read those (json, csv, tsv potentially with globbing) or a function to generate those. If the path contains a globbing pattern, we will read all files matching the pattern and concatenate them. Each dictionary (or row of dataframe) should contain the keys that are formatted in the prompts. E.g. by default `instruction` and `output` with optional `input`. It should also contain a column `generator` with the name of the current model.\n    -r, --reference_outputs=REFERENCE_OUTPUTS\n        Type: Union\n        Default: \u003Cfunc...\n        The outputs of the reference model. Same format as `all_model_outputs` but without needing `generator`. By default, the reference outputs are the 003 outputs on AlpacaEval set.\n    -f, --fn_add_to_leaderboard=FN_ADD_TO_LEADERBOARD\n        Type: Callable\n        Default: 'evaluate'\n        The function to use to add a model to the leaderboard. If a string, it should be the name of a function in `main.py`. The function should take the arguments: `model_outputs`, `annotators_config`, `name`, `precomputed_leaderboard`, `is_return_instead_of_print`, `reference_outputs`.\n    --leaderboard_mode=LEADERBOARD_MODE\n        Type: str\n        Default: 'verified'\n        The mode of the leaderboard to save all new entries with.\n    -i, --is_return_instead_of_print=IS_RETURN_INSTEAD_OF_PRINT\n        Type: bool\n        Default: False\n        Whether to return the metrics instead of printing the results.\n    Additional flags are accepted.\n        Additional arguments to pass to `fn_add_to_leaderboard`.\n```\n\n\u003C\u002Fdetails>\n\nIf you want to make a new leaderboard using a single command (rather than multiple `alpaca_eval` calls), for your\ndesired evaluation\nset and evaluators, you can use the following:\n\n```bash\nalpaca_eval make_leaderboard \\\n  --leaderboard_path \u003Cpath_to_save_leaderboard> \\\n  --all_model_outputs \u003Cmodel_outputs_path> \\\n  --reference_outputs \u003Creference_outputs_path> \\\n  --annotators_config \u003Cpath_to_config.yaml>\n```\n\nwhere:\n\n- `leaderboard_path`: path to save the leaderboard to. The leaderboard will be saved as a csv file, if it already exists\n  it will append.\n- `all_model_outputs` : The json path to the outputs of all models to add to the leaderboard (as a single file or by\n  globbing multiple files). Each dictionary should contain\n  the keys (`instruction` and `output`) that are formatted in the prompts and a column `generator` with the name of the\n  current model. As an example\n  see [this file](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Falpaca_eval_all_outputs.json).\n- `reference_outputs` the path to the outputs of the reference model. Each dictionary should contain\n  the keys (`instruction` and `output`) that are formatted in the prompts. By\n  default, the reference outputs are the 003 outputs on AlpacaEval set.\n- `annotators_config`: The path to the annotator's config file. Defaults to `alpaca_eval_gpt4`.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">Making a new evaluator\u003C\u002Fh2>\u003C\u002Fsummary>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ccode>>>> alpaca_eval analyze_evaluators -- --help\u003C\u002Fcode>\u003C\u002Fsummary>\n\n```\nNAME\n    alpaca_eval analyze_evaluators - Analyze an evaluator and populates the evaluators leaderboard (agreement with human, speed, price,...).\n\nSYNOPSIS\n    alpaca_eval analyze_evaluators \u003Cflags>\n\nDESCRIPTION\n    Analyze an evaluator and populates the evaluators leaderboard (agreement with human, speed, price,...).\n\nFLAGS\n    --annotators_config=ANNOTATORS_CONFIG\n        Type: Union\n        Default: 'alpaca_eval_gpt4_turbo_fn'\n        The path the (or list of dict of) the annotator's config file.\n    -A, --Annotator=ANNOTATOR\n        Default: \u003Cclass 'alpaca_eval.annotators.pairwise_evaluator.PairwiseAn...\n        The annotator class to use.\n    --analyzer_kwargs=ANALYZER_KWARGS\n        Type: Optional[Optional]\n        Default: None\n        Additional arguments to pass to the analyzer.\n    -p, --precomputed_leaderboard=PRECOMPUTED_LEADERBOARD\n        Type: Union\n        Default: PosixPath('\u002FUsers\u002Fyanndubois\u002FDesktop\u002FGitHub\u002Falpaca_eval\u002Fsrc\u002F...\n        The precomputed (meta)leaderboard of annotators or a path to it (json, csv, or tsv).\n    --is_save_leaderboard=IS_SAVE_LEADERBOARD\n        Type: bool\n        Default: False\n        Whether to save the leaderboard (ie analyzed results).\n    --is_return_instead_of_print=IS_RETURN_INSTEAD_OF_PRINT\n        Type: bool\n        Default: False\n        Whether to return the leaderboard (ie analyzed results). If True, it will not print the results.\n    --is_overwrite_leaderboard=IS_OVERWRITE_LEADERBOARD\n        Type: bool\n        Default: False\n        Whether to overwrite the leaderboard if it already exists.\n    -m, --max_instances=MAX_INSTANCES\n        Type: Optional[Optional]\n        Default: None\n        The maximum number of instances to analyze.\n    --is_single_annotator=IS_SINGLE_ANNOTATOR\n        Type: bool\n        Default: False\n        Whether to analyze a single annotator. If True, will not be able to estimate the annotator's bias.\n    -l, --leaderboard_mode_to_print=LEADERBOARD_MODE_TO_PRINT\n        Type: str\n        Default: 'minimal'\n        The mode of the leaderboard to print.\n    -c, --current_leaderboard_mode=CURRENT_LEADERBOARD_MODE\n        Type: str\n        Default: 'minimal'\n        The mode of the leaderboard to save all new entries with.\n    -o, --output_path=OUTPUT_PATH\n        Type: Union\n        Default: 'auto'\n        Path to save the leaderboard and annotataions. If None, we don't save.\n    Additional flags are accepted.\n        Additional arguments to pass to `Annotator`.\n```\n\n\u003C\u002Fdetails>\n\nAlpacaEval provides a simple way of making new evaluators. All you need is to make a new `configs.yaml` configuration\nfile, which you will then pass\nas `--annotators_config \u003Cpath_to_config.yaml>` to `alpaca_eval`.\nHere are some ways you can make a new evaluator:\n\n- **Changing the prompt**: Write a new prompt in a text file and specify the path in `prompt_template` of the\n  configuration file. Paths are relative to the configuration file.\n- **Changing decoding parameters**: Specify the desired parameters in `completions_kwargs` in the configuration file. To\n  see all available parameters refer to the docstrings of the corresponding\n  function [in this file](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fdecoders\u002F__init__.py)\n  specified by `fn_completions`\n  in the configuration file.\n- **Changing the model**: Specify the desired model in `model_name` and the corresponding\n  prompt in `prompt_template`. If the model comes from another provider you\n  will\n  have\n  to change `fn_completions` which maps to the corresponding function\n  in [this file](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fdecoders\u002F__init__.py). We\n  provide `fn_completions` functions to use models from OpenAI, Anthropic, Cohere, or HuggingFace. To\n  install packages needed for\n  all providers\n  use `pip install alpaca_eval[all]`.\n\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>Other parameters in the configuration file\u003C\u002Fb>\u003C\u002Fsummary>\n\nThe easiest is to check the docstrings\nof [`SinglePairwiseAnnotator`](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fannotators\u002Fpairwise_evaluator.py#L537).\nHere are some important ones:\n\n```\nParameters\n----------\nprompt_template : path\n    A prompt that will be given to `fn_prompter` or path to the prompts. Path is relative to\n    `evaluators_configs\u002F`\n\nfn_completion_parser : callable or str\n    Function in `completion_parsers.py` to use for parsing the completions into preferences. For each completion,\n    the number of preferences should be equal to the batch_size if not we set all the preferences in that batch to\n    NaN.\n\ncompletion_parser_kwargs : dict\n    Kwargs for fn_completion_parser.\n\nfn_completions : callable or str\n    Function in `decoders.py` to use for decoding the output.\n\ncompletions_kwargs : dict\n    kwargs for fn_completions. E.g. model_name, max_tokens, temperature, top_p, top_k, stop_seq.\n\nis_randomize_output_order : bool\n    Whether to randomize output_1, output_2 when formatting.\n\nbatch_size : int\n    Number of examples that will be added in a single prompt.\n```\n\n\u003C\u002Fdetails>\n\nOnce you made the evaluator you can also analyze it and add it to the _evaluator's_ [leaderboard](#evaluators) using the\nfollowing command:\n\n```bash\nalpaca_eval analyze_evaluators --annotators_config '\u003Cpath_to_config.yaml>'    \n```\n\nTo estimate the bias and variance this evaluates every example with 4 seeds, i.e., 2.5K\nevaluation.\nIf you want a cheaper evaluation you can use a single seed using `--is_single_annotator True` which will skip the\nestimation of bias and variance.\n\n\u003C\u002Fdetails>\n\n# Contributing\n\nWe are accepting PRs for new models, evaluators, and eval sets, in addition to bug fixes.\nWe will update the [leaderboard website](https:\u002F\u002Ftatsu-lab.github.io\u002Falpaca_eval\u002F) regularly with new community\ncontributions.\nWe have also created a [support discord](https:\u002F\u002Fdiscord.gg\u002FGJMxJSVZZM) for AlpacaEval in case you run into any issues\nand\nwish to ask help from the community.\n\nTo get started, please first fork the repo, and install the package from source `pip install -e .`\n\n## Contributing a model\n\nFirst, you'll need to add a model config definition in the [models_configs](src\u002Falpaca_eval\u002Fmodels_configs\u002F) folder. As\nan example, you can look at\nthe [falcon-7b-instruct yaml](src\u002Falpaca_eval\u002Fmodels_configs\u002Ffalcon-7b-instruct\u002Fconfigs.yaml). Please make sure the\nfolder name and key name in the yaml match exactly.\n\nThen, please follow the steps in [Evaluating a model](#evaluating-a-model) to run inference on the model to produce\noutputs on the eval set and score the model according to one of the evaluators.\nAn example command may look like:\n\n```sh\nalpaca_eval evaluate_from_model \\\n  --model_configs 'falcon-7b-instruct'\n```\n\nAfter running this command, you should have generated an outputs json and a new entry in the corresponding [leaderboard\nfile](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fleaderboards\u002Fdata_AlpacaEval). Please make a PR\nwith the\nconfig, outputs file, and updated leaderboard.\n\nConcretely you should do something like:\n\n1. Fork the repository in github\n2. Clone the forked repository `git clone \u003CURL>`\n3. Make a model config at `src\u002Falpaca_eval\u002Fmodels_configs\u002F\u003Cmodel_name>` and evaluate it `evaluate_from_model --model_configs '\u003Cmodel_name>'`\n4. Add the model configs, output, and leaderboard entry to the forked repository\n```sh\ngit add src\u002Falpaca_eval\u002Fmodels_configs\u002F\u003Cmodel_name> # add the model config\ngit add src\u002Falpaca_eval\u002Fleaderboards\u002F # add the actual leaderboard entry\ngit add src\u002Falpaca_eval\u002Fmetrics\u002Fweights # add the weights for LC\ngit add -f results\u002F\u003Cmodel_name>\u002Fmodel_outputs.json # force add the outputs on the dataset\ngit add -f results\u002F\u003Cmodel_name>\u002F*\u002Fannotations.json # force add the evaluations from the annotators\ngit commit -m \"Add \u003Cmodel_name> to AlpacaEval\"\ngit push\n``` \n5. Create a [pull request on AlpacaEval](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpulls)\n\nNote: if you are generating outputs outside of AlpacaEval you should still add a model config but with `fn_completions: null`. \nSee [this config](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fmodels_configs\u002Fdolphin-2.2.1-mistral-7b\u002Fconfigs.yaml) for an example.\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch3 tabindex=\"-1\" dir=\"auto\">Getting your model verified\u003C\u002Fh3>\u003C\u002Fsummary>\n\n\u003Cp align=\"center\">\n\u003Cimg align=\"center\" alt=\"verified.png\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_af9131fb80f7.png\" width=\"500\"\u002F>\n\u003C\u002Fp>\n\nA verified result in AlpacaEval indicates that a core maintainer has decoded the outputs from the model and performed the evaluation. Unfortunately, we, the AlpacaEval maintainers, lack the resources to verify all the models and so we will only do that for models that are in the top-5 of the leaderboard. We apologize for any inconvenience this may cause and appreciate your understanding. To have your model verified, please follow the steps below:\n\n1. Contact `@yann`  on Discord, or email us if you have our email, providing a brief rationale for why your model should be verified.\n2. Await our response and approval before proceeding.\n3. Prepare a script to decode from your model that does not require a GPU, typically the same script used for your model contribution. It should run using `alpaca_eval evaluate_from_model --model_configs '\u003Cyour_model_name>'` without requiring a local GPU.\n4. Generate temporary API keys for running the script and share them with us. Specifically, we need the keys for both decoding your model and for evaluation (e.g., OpenAI or Anthropic key).\n5. We will execute `alpaca_eval evaluate_from_model --model_configs '\u003Cyour_model_name>'`, update the results, and inform you so that you can revoke the temporary keys.\n\nNote that we will not re-evaluate the same model. Due to sampling variance, the results might slightly differ from your initial ones. We will replace your previous community results with the verified ones. \n\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">Contributing an evaluator\u003C\u002Fh2>\u003C\u002Fsummary>\n\nPlease first follow the directions in [Making a new evaluator](#making-a-new-evaluator).\nOnce you're created the annotator config, we ask that you create a new leaderboard for the annotator by evaluating the\nminimal set of models. The outputs for these models can be found by\ndownloading [alpaca_eval_all_outputs.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Falpaca_eval_all_outputs.json).\n\n```bash\nalpaca_eval make_leaderboard \\\n  --leaderboard_path src\u002Falpaca_eval\u002Fleaderboards\u002Fdata_AlpacaEval\u002F\u003Cevaluator>_leaderboard.csv \\\n  --all_model_outputs alpaca_eval_all_outputs.json \\\n  --annotators_config \u003Cevaluator_config>\n```\n\nThen, please create a PR with the annotator config and leaderboard csv.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">Contributing an eval set\u003C\u002Fh2>\u003C\u002Fsummary>\n\nTo contribute a new eval set, you'll first need to specify a set of textual instructions.\nThen, you'll need to specify a set of reference outputs (model win-rates are computed against this reference).\nFor ease of use, you may use the default [text-davinci-003](src\u002Falpaca_eval\u002Fmodels_configs\u002Ftext_davinci_003\u002F) reference\nconfig.\n\nPlace these together into a json, where each entry specifies the fields `instruction`, `output`, and `generator`. You\ncan look to [alpaca_eval.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Falpaca_eval.json) as a\nguide (the `dataset` field is not necessary).\n\nFinally, we ask that you create a minimal leaderboard on this new evaluation set. You can do this with the following:\n\n```bash\nalpaca_eval make_leaderboard \\\n  --leaderboard_path \u003Csrc\u002Falpaca_eval\u002Fleaderboards\u002Fdata_AlpacaEval\u002Fyour_leaderboard_name.csv> \\\n  --all_model_outputs alpaca_eval_all_outputs.json \\\n  --reference_outputs \u003Cpath_to_json_file>\n```\n\nPlease submit a PR with the eval set json and corresponding leaderboard csv.\n\n\n\u003C\u002Fdetails>\n\n\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">Contributing a completion function\u003C\u002Fh2>\u003C\u002Fsummary>\n\nCurrently, we allow different completion functions, e.g., `openai`, `anthropic`, `huggingface_local`, `huggingface_hub_api` ... If you want to contribute a new completion function \u002F API with which to perform inference then follow those steps:\n1. add a file \u003Cname>.py with a function  `\u003Cname>_completions(prompts : Sequence[str], model_name :str, ... )`  in the [decoder folder](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fdecoders). This function should take as argument the prompts + kwargs and return the completions. Please look at other completion functions in the directory for templates. E.g. [huggingface_local_completions](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fdecoders\u002Fhuggingface_local.py) or [anthropic](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fdecoders\u002Fanthropic.py).\n2. add `\u003Cname>_completions` and dependencies in [__init__](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fdecoders\u002F__init__.py) . Again you can follow the example of [huggingface_local_completions](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fdecoders\u002F__init__.py#L30)\n3. update optional dependencies in [setup.py](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsetup.py)\n4. add a model you want to evaluate in the [models configs](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fmodels_configs)\n5. evaluate your model using `alpaca_eval evaluate_from_model --model_configs '\u003Cmodel_configs>'`\n6. (optional) push the results from the previous model on AlpacaEval leaderboard following [those steps](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain#contributing-a-model)\n\nFeel free to start a PR early, we'll be able to provide some help in the process! \n\n\u003C\u002Fdetails>\n\n# Limitations\n\nThe AlpacaEval evaluation pipeline, like other current evaluators have important limitations and should therefore not be\nused as replacement for human evaluation in important settings, such as to decide whether a model is ready to be\ndeployed.\nThose can broadly be clustered into 3 categories:\n\n1. **Instructions might not be representative of real-usage**:  the AlpacaEval set contains examples from a variety of\n   datasets ([self-instruct](https:\u002F\u002Fgithub.com\u002Fyizhongw\u002Fself-instruct),\n   [open-assistant](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenAssistant\u002Foasst1\u002Fviewer\u002FOpenAssistant--oasst1\u002Fvalidation), [vicuna](https:\u002F\u002Flmsys.org\u002Fblog\u002F2023-03-30-vicuna\u002F), [koala](https:\u002F\u002Fgithub.com\u002Farnav-gudibande\u002Fkoala-test-set), [hh-rlhf](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAnthropic\u002Fhh-rlhf\u002Fviewer\u002FAnthropic--hh-rlhf\u002Ftest))\n   which might not be representative of real-usage and advanced applications of better models like GPT4. This likely makes the best closed models (GPT4 \u002F Claude \u002F ChatGPT \u002F ...) seem more similar to the open models than what they are. Indeed, those closed models seem to be pretrained\u002Ffinetuned on much more diverse data. See for\n   example [this blog](https:\u002F\u002Fmedium.com\u002F@marcotcr\u002Fexploring-chatgpt-vs-open-source-models-on-slightly-harder-tasks-aa0395c31610)\n   for preliminary results on more complex instructions.\n   Note, however, that in [AlpacaFarm](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14387) we showed that win-rates on our evaluation set\n   are highly correlated (0.97 R2) with win-rates on instructions from user interactions with the Alpaca Demo.\n   Furthermore, the AlpacaEval leaderboard shows larger\n   gap between the open models and OpenAI models than other leaderboards (\n   e.g. [lmsys](https:\u002F\u002Flmsys.org\u002Fblog\u002F2023-03-30-vicuna\u002F)).\n\n2. **Biases of automatic annotators**: the raw automatic annotators seem to have implicit biases. In particular, we found\n   that they tend to prefer longer outputs and outputs that contain lists (e.g. 0.68 \u002F 0.69 for `alpaca_eval_gpt4`\n   and 0.62 \u002F 0.58 for `claude`).\n   Although we found that humans have similar biases (0.64 \u002F 0.61), we believe that this could be more of a limitation\n   of human annotation pipeline we used rather than a true human bias. More generally, through qualitative analysis, we\n   found that automatic annotators give more importance to the style\n   of the output than its content (e.g. factuality).\n   Finally, we found that automatic evaluators tend to prefer outputs from models that are similar (likely trained on\n   the same data) as suggested by the big difference between ChatGPT\u002FGPT4 on `claude`'s and `alpaca_eval_gpt4`'s\n   leaderboard. Note that the length bias is partially mitigated in our length-controlled win-rates.\n3. **Lack of safety evaluation**: importantly, AlpacaEval only evaluates the instruction-following capabilities of\n   models rather than the harm that they could cause (e.g. toxic behavior or bias). As a result the small gap between\n   current ChatGPT and the best open source models **should not** be interpreted as if that the latter are ready to be\n   deployed.\n\nBeyond those limitations about the evaluation pipelines, there are also limitations about our validation of the\nevaluators and our [proposed approach](#analyzing-an-eval-set) to selecting evaluation sets.\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>Limitations about our validation pipeline\u003C\u002Fb>\u003C\u002Fb>\u003C\u002Fsummary>\n\nFirst, our validation of evaluators based on human cross-annotations suffers from the following limitations: (1) we\nqualitatively found that our crowd-workers tend to also favor style such as length and presence of lists over\nfactuality;\n(2) this does not validate whether win-rates against a reference model is a good evaluation strategy in the first place;\n(3) preferences from 16 crowd-workers are not representative of preferences of all humans.\n\nSecond, our suggested approach to selecting evaluation sets based on statistical power suffers from the following\nlimitations: (1) statistical power does not ensure the right direction, e.g. you can have an unnatural set of\ninstructions where Alpaca \"performs\" better than better model; and\n(2) this can push users to select data to support the hypothesis that they want to validate.\n\n\u003C\u002Fdetails>\n\n\n# Additional analysis and plots\n\n\n[\u002F\u002F]: # (AlpacaEval provides a few visualization tools to help you analyze and improve your automatic evaluation pipeline. We)\n\n[\u002F\u002F]: # (briefly explain)\n\n[\u002F\u002F]: # (them here and provide)\n\n[\u002F\u002F]: # (notebooks for more analysis. )\n\n[\u002F\u002F]: # (For a description of all the metrics we consider)\n\n[\u002F\u002F]: # (refer to [How exactly are those metrics computed?]&#40;https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval#evaluators&#41;)\n\n## Length-controlled AlpacaEval (LCAE)\n\n\n**Length-controlled AlpacaEval Visualizations:**\n[![analyzing an evaluator](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fnotebooks\u002Ffigured_length_controlled.ipynb)\n\n**Length-controlled AlpacaEval Development:**\n[![analyzing an evaluator](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fnotebooks\u002Flength_controlled.ipynb)\n\nThe notebook shows different options that we considered for mitigating the length bias of automatic annotators. \n\nHere we briefly summarize the main results. Namely:\n- **LCAE increases the correlation with Chat Arena to 0.98** from 0.94 for AlpacaEval 2.0. This makes LCAE the most highly correlated benchmark with Chat Arena as seen in the plot below.\n\n\n\u003Cp float=\"left\" align=\"middle\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_3729ec9ba6f0.png\" alt=\"LC AlpacaEval is the most highly correlated benchmark with Chat Arena.\" width=\"500\"\u002F>\n\u003C\u002Fp>\n\n- **LCAE decreases length gameability** one of the major issues of AlpacaEval is that you can increase your win-rate by increasing the length of your outputs. For example, in AlpacaEval 2.0 the win-rate for the baseline (50%) increases to 64% when prompted to “give as much detail as possible” and decreases to 23% when prompted to “be as concise as possible while still providing all the necessary information to answer the question”. More generally the relative length gameability was ~21% for AlpacaEval and decreases to ~6% for LCAE, so it's 3x less gameable through prompt length. This is shown in the plot below.  \n\n\n\u003Cp float=\"left\" align=\"middle\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_68c997c99880.png\" alt=\"LC AlpacaEval decreases length gameability of the benchmark.\" width=\"500\"\u002F>\n\u003C\u002Fp>\n\n- **We can predict performance for different baselines** One other benefit of using a GLM for controlling for length bias. Is that we now have a model that can predict the win-rate of a model for different baselines. In particular, our GLM has many nice properties, for example `win_rate(m,b) = 1 - win_rate(b,m) \\in [0,1]` and `win_rate(m,m) = 0.5`. This is shown in the plot below.\n\n\n\u003Cp float=\"left\" align=\"middle\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_0f60e430f35c.png\" alt=\"Predicted win rate for different baselines\" width=\"500\"\u002F>\n\u003C\u002Fp>\n\n\nFinally, note that we are only controlling for length bias. There are other known biases that we are not controlling for, such as the fact that auto-annotators prefer outputs similar to their model. Although we could control for that, in practice we have found that to be less of an issue than length bias. For two reasons (1) this mostly a single model in the leaderboard because fine-tuning on outputs from the auto-annotator doesn't seem to have doesn't seem to impact the win-rate as much, and (2) the bias is actually less strong that what one could think. For example we show below a subset of the leaderboards auto-annotated by three different models, and we see that the ranking of models is exactly the same. In particular, `claude-3-opus` prefers `gpt4_preview`, and `mistral-large` prefers the former two.\n\n\u003Cp float=\"left\" align=\"middle\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_a2bfd8f7f18b.png\" alt=\"Leaderboard by different auto-annotators\" width=\"500\"\u002F>\n\u003C\u002Fp>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">Analyzing an evaluator\u003C\u002Fh2>\u003C\u002Fsummary>\n\n[\u002F\u002F]: # (## Analyzing an evaluator)\n\n**Caution**: all the following results are about AlpacaEval 1.0 and have not been updated since\n\n**Analyzing evaluators:**\n[![analyzing an evaluator](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fnotebooks\u002Fanalyzing_annotators.ipynb)\n\nAs we saw in [the evaluator's leaderboard](#evaluators), there are many metrics to consider when selecting an evaluator,\ne.g. the quality, price, and speed. To assist with selection of the evaluator we provide a few functions to plot those\nmetrics.\nThe following shows for example the price\u002Ftime\u002Fagreement of the different evaluators.\n\n![plot_quality_vs_price_and_time.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_c8f5171445cc.png)\n\nHere we see that `alpaca_eval_gpt4` performs very well and is better than humans on all the considered metrics.\n\nPreviously we only considered the agreement with human annotators overall.\nAn additional validation that one could do is checking whether making a leaderboard using our\nautomatic annotator gives similar results as a leaderboard from humans.\nTo enable such analysis, we release [human\nannotations](#data-release) of outputs from 22 methods from [AlpacaFarm](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_farm) =>\n22*805 = ~18K annotations. As a result we\ncan\ntest\nthe correlation between the win-rates of the 22 models as evaluated by the humans and our automatic annotator.\nNote that this is arguably a better way of selecting an automatic evaluator than using \"human agreement [%]\" but is\nexpensive given that it requires 18K\nannotations.\nThe plot below shows such correlation for the `alpaca_eval_gpt4` evaluator.\n\n\u003Cp float=\"left\" align=\"middle\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_aa8801bfd20b.png\" alt=\"Correlation between humans and alpaca_eval_gpt4\" width=\"400\"\u002F>\n\u003C\u002Fp>\n\nWe see that the `alpaca_eval_gpt4` leaderboard is highly correlated (0.94 Pearson correlation) to the leaderboard from\nhumans, which further\nsuggests that automatic evaluation is a good proxy for human evaluation.\nFor the code and more analysis,\nsee [this notebook](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fnotebooks\u002Fanalyzing_annotators.ipynb), or the\ncolab notebook above.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">Analyzing an eval set\u003C\u002Fh2>\u003C\u002Fsummary>\n\n[\u002F\u002F]: # (## Analyzing an eval set)\n\n**Caution**: all the following results are about AlpacaEval 1.0 and have not been updated since.\n\n**Making evaluation sets:**\n[![analyzing an evaluator](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fnotebooks\u002Fanalyzing_evalset.ipynb)\n\nWhen creating an evaluation set there are two main factors to consider: how much data to use? and what data?\n\nOne way of answering those question is by considering a leaderboard of models that you believe are of different\nquality and checking what and how much data is needed to distinguish between them in a statistically significant way.\nWe will do so below using a paired t-test to test if the difference in win-rates between every pair of models\nis\nstatistically significant.\n\nFirst, let us consider the question of how much data to use.\nBelow we show the number of random samples needed from AlpacaEval for the paired t-test to give a p-value \u003C 0.05 for\neach pair of models in the minimal `alpaca_eval_gpt4`\nleaderboard.\nGrey cells correspond to pairs that are not significantly different on the 805 samples.\ny- and x-axis are ordered by the win-rate of the first and second model respectively.\n\n\n\u003Cp float=\"left\" align=\"middle\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_2173d751396e.png\" alt=\"Number of samples needed to distinguish pairs in the Claude leaderboard\" width=\"500\"\u002F>\n\u003C\u002Fp>\n\nWe see that most models can already be distinguished with 50 samples, and that 150 samples allows distinguishing the\nmajority of pairs (74 out of 78). This suggests that we can decrease the evaluation set size by a factor of\n4 when testing two models that have similar performance gaps as those on the\nminimal `alpaca_eval_gpt4` [leaderboard](#models).\n\nThe second question is what data to use. Again we can try to answer this question from a statistical power perspective:\nwhat data allows to best distinguish between models. Let's consider this for all the datasets that are part of\nAlpacaEval, but let us control for the size of the evaluation sets as we only care about the quality of the data. The\nfollowing plot shows the p-values from the paired t-test of each pairs of models on 80 examples of each subset of\nAlpacaEval.\n\n![plot_paired_ttests_per_dataset.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_78d595bcdb16.png)\n\nWe see for example that the self-instruct dataset yields the least statistical power, which suggests that one could\nremove this dataset from the evaluation set.\nThe exact reason should be analyzed in future work.\nFor the code and more analysis\nsee [this notebook](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fnotebooks\u002Fanalyzing_evalset.ipynb), or the\ncolab notebook above.\n\n\u003C\u002Fdetails>\n\n# Citation\n\nPlease consider citing the following depending on what you are using and referring to:\n- **Code, results, and general benchmark**: `alpaca_eval` (this repo). Specify whether you are using AlpacaEval or AlpacaEval 2.0. For length-controlled win-rates see below.\n- **Length-controlled (LC) win rates**: `alpaca_eval_length`.\n- **Human annotations**: `dubois2023alpacafarm` ([AlpacaFarm](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14387))\n- **AlpacaEval evaluation set**: `alpaca_eval`  and [self-instruct](https:\u002F\u002Fgithub.com\u002Fyizhongw\u002Fself-instruct),\n[open-assistant](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenAssistant\u002Foasst1\u002Fviewer\u002FOpenAssistant--oasst1\u002Fvalidation), [vicuna](https:\u002F\u002Flmsys.org\u002Fblog\u002F2023-03-30-vicuna\u002F), [koala](https:\u002F\u002Fgithub.com\u002Farnav-gudibande\u002Fkoala-test-set), [hh-rlhf](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAnthropic\u002Fhh-rlhf\u002Fviewer\u002FAnthropic--hh-rlhf\u002Ftest).\n\nHere are the bibtex entries:\n\n```\n@misc{alpaca_eval,\n  author = {Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },\n  title = {AlpacaEval: An Automatic Evaluator of Instruction-following Models},\n  year = {2023},\n  month = {5},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval}}\n}\n```\n\n```\n@article{dubois2024length,\n  title={Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators},\n  author={Dubois, Yann and Galambosi, Bal{\\'a}zs and Liang, Percy and Hashimoto, Tatsunori B},\n  journal={arXiv preprint arXiv:2404.04475},\n  year={2024}\n}\n```\n\n```\n@misc{dubois2023alpacafarm,\n  title={AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback}, \n  author={Yann Dubois and Xuechen Li and Rohan Taori and Tianyi Zhang and Ishaan Gulrajani and Jimmy Ba and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto},\n  year={2023},\n  eprint={2305.14387},\n  archivePrefix={arXiv},\n  primaryClass={cs.LG}\n}\n```\n\n# More information\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">Length-Controlled Win Rates\u003C\u002Fh2>\u003C\u002Fsummary>\n\nLength controlled (LC) win-rates are a debiased version of the win-rates that control for the length of the outputs.\n\nThe main idea is that for each model we will fit a logistic regression to  predict the preference of the autoannotator given: (1) the instruction, (2) the model, and (3) the difference of length between the baseline and model output. \nGiven such a logistic regression we can then try to predict the counterfactual \"what would the preference be if the model's output had the same length as the baseline\" by setting the length difference to 0.\nBy averaging over this length-controlled preference, we then obtain the length-controlled win-rate.\nThe exact form of the logistic regression is taken such that the interpretation of LC win rates is similar to the raw win rates, for example for any model `m1` and `m2` we have `win_rate(m1, m2) = 1 - win_rate(m2, m1) \\in [0,100]` and `win_rate(m1, m1) = 0.5`. \nLength controlled win-rates increase the correlation between AlpacaEval's leaderboard and Chat Arena from **0.93 to 0.98 Spearman correlation, while significantly decreasing the length gameability of the annotator**.\nFor more information and results about length controlled win-rates see [this notebook](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fnotebooks\u002Flength_controlled.ipynb).\n\nThis idea of estimating the controlled direct effect, by predicting the outcome while conditioning on the mediator (the length difference), is common in statistical inference.\n\nTo get LC win rates on previously annotated models, you can use the following command:\n\n```bash\npip install -U alpaca_eval\nalpaca_eval --model_outputs … --is_recompute_metrics_only True\n```\n\n\u003C\u002Fdetails>\n\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">AlpacaEval 2.0\u003C\u002Fh2>\u003C\u002Fsummary>\n\nAlpacaEval 2.0 is a new version of AlpacaEval. Here are the differences:\n- **reference: `gpt4_turbo`**: we upgraded the baseline from `text-davinci-003` to `gpt4_turbo` to make the benchmark more challenging and have a metric that better reflects the current state of the art.\n- **annotator: `weighted_alpaca_eval_gpt4_turbo`**: we improved the annotator in quality and price. First, we use the `gpt4_turbo` model for annotating, which is approximately 2x cheaper than `gpt4`. Second, we changed the prompt such that the model outputs a single token, which further reduced cost and speed. Finally, instead of using a binary preference, we used the logprobs to compute a continuous preference, which gives the final weighted win-rate. Note that the latter two changes had the surprising effect of decreasing the annotators' length biased.\n\nBy default, AlpacaEval 2.0 will be used from `pip install alpaca_eval==0.5`. If you wish to use the old configs by default, you can set `IS_ALPACA_EVAL_2=False` in your environment.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">Data Release\u003C\u002Fh2>\u003C\u002Fsummary>\n\nAs part of AlpacaEval, we release the following data:\n\n- **Human annotations (17701)** in order to develop and understand automatic evaluators, we release all the human\n  pairwise\n  evaluation that we collected for AlpacaFarm. This contains comparisons between 22 models with the `text-davinci-003`\n  reference on the AlpacaFarm evaluation set. Annotations are from a pool of 16 crowd workers on Amazon Mechanical Turk.\n  The different models are: 6 from OpenAI, 2 SFT models from AlpacaFarm, 13 RLHF methods from AlpacaFarm, and LLaMA 7B.\n- **Human cross-annotations (2596)** in order to further analyze automatic evaluators we selected (via stratified\n  sampling\n  across models and datasets) 650 examples from the AlpacaFarm evaluation set and collected 4 human annotations per\n  example.\n- **AlpacaEval set (805)** we made slight modifications\u002Fsimplification of the AlpacaFarm evaluation set. In particular,\n  we first merged\n  the instruction and input fields into a single instruction field. This affects 1\u002F4 of the examples in the AlpacaFarm\n  evaluation set, all of which are from the [self-instruct evaluation set](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.10560). Second we\n  regenerated the text-davinci-003 reference outputs without limiting the length of its outputs.\n\nFor more details about the human annotations refer to the [AlpacaFarm paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14387).\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">Differences with AlpacaFarm\u003C\u002Fh2>\u003C\u002Fsummary>\n\nAlpacaEval is an improvement and simplification of the automatic pairwise preference simulator\nfrom [AlpacaFarm](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_farm).\nOutside AlpacaFarm, you should be using AlpacaEval.\nHere are the main differences:\n\n- **AlpacaEval merges instructions and inputs**: The AlpacaEval evaluation is the same as the AlpacaFarm evaluation\n  except that the instruction and input fields are merged as `{instruction}\\n\\n{input}`. This affects 1\u002F4 of the\n  examples in the AlpacaFarm evaluation set (the [self-instruct](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.10560) subset).\n  This simplification provides a more fair comparison for models that were not trained by distinguishing between\n  the two fields.\n- **AlpacaEval handles longer generations**: Models in AlpacaFarm were limited to a maximum number of 300 tokens for\n  generations. We\n  change this number to 2000 for AlpacaEval. Note that this also affects the reference generations (`text-davinci-003`),\n  so the results on AlpacaEval are not comparable to those on AlpacaFarm even for examples that had no input\n  field.\n- **AlpacaEval removes intra- and inter-annotator variance**: The AlpacaFarm simulator replicates human annotation in\n  terms of both mode behavior and diversity.\n  In particular, AlpacaFarm's simulator uses a pool of models and prompts and adds noise to replicate human intra- and\n  inter-annotator variance.\n  If the goal is to use an automatic annotator for evaluation or simply training better models, then this variance\n  may not be desirable. The default annotators in AlpacaEval thus don't have this variance. We give the option to add it\n  back by\n  using `--anotators_config 'alpaca_farm'` and `--p_label_flip 0.25` when creating an evaluator.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">Related work\u003C\u002Fh2>\u003C\u002Fsummary>\n\nThere have been several work that propose new automatic annotators for instruction-following models. Here we list the\nones that we are aware of and discuss how they differ from ours. We evaluated all of those\nin [our evaluator's leaderboard](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval#evaluators).\n\n- **Vicuna\u002Flmsys** The lmsys annotator (`lmsys_gpt4`) evaluates the pair by asking the annotator a score from 1-10 for\n  each output, and then selecting the output with the highest score as preferred. They do not randomize over output\n  order and they ask an explanation _after_ the score. Overall, we found that this annotator has strong bias towards\n  longer outputs (0.74) and relatively low correlation with human annotations (63.2).\n- **AlpacaFarm** The best AlpacaFarm annotator (`alpaca_farm_greedy_gpt4`) evaluates the pair by directly asking the\n  annotator\n  which output it prefers. Furthermore, it batches 5 examples together to amortize the length of the prompt and\n  randomizes the order of outputs. Overall, we\n  found that this annotator has much less bias towards longer outputs (0.60) and is faster (878 seconds\u002F1000 examples)\n  than others. It has a\n  slightly higher correlation with the majority of human annotations (66.4) than humans themselves (65.7).\n  However, it is more expensive ($15.3\u002F1000 examples) and doesn't work with very long outputs given the batching.\n- **Aviary** The Aviary annotator (`aviary_gpt4`) asks the annotator to order the output by its preference, rather than\n  simply selecting the preferred output. It does not randomize the order of outputs and uses high temperature for\n  decoding (0.9). Overall, we found that this annotator has relatively strong bias towards longer outputs (0.70) and\n  very high\n  correlation with human annotations (69.1). By decreasing the temperature and randomizing the order of outputs,\n  we [further improved](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002FREADME.md)\n  the correlation to 69.8 (`improved_aviary_gpt4`) but this further increased the length bias to 0.73.\n\nOur `alpaca_eval_gpt4` is a mix between the AlpacaFarm and Aviary annotators. It asks the annotator to order the outputs\nby preference, but it uses temperature 0, randomizes over outputs, and made some modifications to the prompt to decrease\nlength bias to 0.68.\n\nOther related work include recent papers which analyze automatic evaluators.\nFor example:\n\n- [AlpacaFarm Appx C](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14387)\n  and [Large Language Models are not Fair Evaluators](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.17926v1) both found that automatic\n  annotators have\n  a position bias.\n- [AlpacaFarm Sec. 5.2.](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14387)\n  and [The False Promise of Imitating Proprietary LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.15717) both found that\n  automatic\n  annotators favor style (e.g. use of list, tone, word choice, length) over factuality.\n\n\u003C\u002Fdetails>\n\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">Interpreting annotations\u003C\u002Fh2>\u003C\u002Fsummary>\n\nFor all models you can find the auto-annotations under `results\u002F\u003Cmodel_name>\u002F*\u002Fannotations.json`. The annotations have the following columns:\n- `instruction`: the prompt\n- `generator_1`: the baseline model\n- `output_1`: the output of the baseline model\n- `generator_2`: the model being evaluated\n- `output_2`: the output of the model being evaluated\n- `annotator`: the auto-annotator\n- `preference`: the result of the auto-annotator. This is a float between 1 and 2. Closer to 1 means that the auto-annotator prefers `output_1`, closer to 2 means that it prefers `output_2`. For AlpacaEval 2.0, `preference-1` corresponds to the probability of `output_1` being preferred. For AlpacaEval 1.0, `preference` is 1 if `output_1` is preferred, 2 if `output_2` is preferred, and 1.5 if they are the same. The win rate is always`(preference -1).mean()`.\n- `raw_completion`: the raw output of the auto-annotator. This is field contains the completions before de-randomization of the order between `output_1` and `output_2`! It is thus much harder to interpret, see below for more information.\n\n**Chain of thought**\n\nFor some annotators, e.g. `alpaca_eval_cot_gpt4_turbo_fn` we use **chain of thought reasoning** to make the models preferences more interpretable. Those can then be found under `concise_explanation`. To interpret them, you should also look at `referenced_models` which translates the temporary model name (in the prompt) to the actual output. Below, we provide more explanation as to what is happening behind the scenes.\n\nYou can check the `raw_annotations[\"concise_explanation]` column in `annotations.json` (e.g. [here](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fresults\u002Fgpt4\u002Falpaca_eval_cot_gpt4_turbo_fn\u002Fannotations.json)) which contains the chain of thought reasoning of the auto annotator. Note that the raw_annotations is not modified by the randomization of the order of the outputs. In particular, `\"m\"` and `\"M\"` can sometime refer to the first model (the reference) and sometime to the second model (the model being evaluated). To understand which model is being referred to, you should use the column `preference` and `ordered_models`. To make it easier we add a column `\"referenced_models\"` mapping the model names to the corresponding outputs. For example in the following annotation we see that the preference is 1.0 (i.e. `output_1`) and corresponds to model `M` in `concise_explanation` (see `ordered_models`).  \n\n```json \n{\n  \"instruction\": \"How did US states get their names?\",\n  \"output_1\": \"The names of U.S. states are derived from a variety of languages and sources, including indigenous American languages, European languages (primarily English, Spanish, and French), as well as other historical influences. Here is a brief overview of how some states got their names:\\n\\n1. **Indigenous American Languages**: Many state names come from Native American words or names of tribes, often from the languages of the indigenous peoples who lived in or near those areas.\\n   - **Alabama**: Possibly derived from the Choctaw language, meaning \\\"thicket clearers.\\\"\\n   - **Connecticut**: From a Mohegan-Pequot word meaning \\\"long tidal river.\\\"\\n   - **Massachusetts**: [...]\",\n  \"generator_1\": \"gpt4_1106_preview\",\n  \"dataset\": \"helpful_base\",\n  \"output_2\": \"The names of the 50 U.S. states come from a variety of sources, including Native American languages, European languages, and historical figures. Here's a brief overview of how some states got their names:\\n\\n1. Native American origins: Many states have names derived from Native American languages. For example, Alabama comes from the Choctaw word \\\"Albah amo,\\\" meaning \\\"plant gatherers\\\" or \\\"herb gatherers.\\\" Similarly, the name Mississippi comes from the Ojibwe word \\\"Misi-ziibi,\\\" meaning \\\"great river.\\\"\\n\\n2. European languages: [...].\",\n  \"generator_2\": \"gpt4\",\n  \"annotator\": \"alpaca_eval_cot_gpt4_turbo_fn\",\n  \"preference\": 1.0,\n  \"raw_completion\": {\n    \"concise_explanation\": \"Model M provided a more detailed and structured response, including bold headings for each category and a wider range of examples. It also included additional categories such as 'Other European Languages' and 'Combination of Languages and Influences', which added depth to the explanation. Model m's response was accurate but less comprehensive and lacked the clear structure found in Model M's output.\",\n    \"ordered_models\": [\n      {\n        \"model\": \"M\",\n        \"rank\": 1\n      },\n      {\n        \"model\": \"m\",\n        \"rank\": 2\n      }\n    ]\n  },\n  \"referenced_models\": {\n    \"M\": \"output_1\",\n    \"m\": \"output_2\"\n  }\n}\n```\n\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">Major updates\u003C\u002Fh2>\u003C\u002Fsummary>\n\n- 12th March 2024: updated to use length-controlled (LC) win rates. This is a debiased version of the win-rates that control for the length of the outputs. \n- 3rd January 2024: updated to AlpacaEval 2.0, which uses GPT4-turbo as baseline and annotator.\n- 2nd January 2024: added Azure API and more general way of setting client configs. See [here](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fclient_configs\u002FREADME.md)\n- 19th June 2023: add leaderboard `chatgpt_fn` that anyone can use (no waiting lists).\n- 19th June 2023: update to\n  use [OpenAI's function calling](https:\u002F\u002Fopenai.com\u002Fblog\u002Ffunction-calling-and-other-api-updates).\n  Example: [`chatgpt_fn`](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002Fchatgpt_fn)\n  or [`alpaca_eval_gpt4_fn`](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002Falpaca_eval_gpt4_fn).\n\n\u003C\u002Fdetails>\n","# \u003Ca href=\"https:\u002F\u002Ftatsu-lab.github.io\u002Falpaca_eval\u002F\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_10aad1a5d6d4.png\" width=\"35\">\u003C\u002Fa> [AlpacaEval](https:\u002F\u002Ftatsu-lab.github.io\u002Falpaca_eval\u002F) : 面向指令遵循语言模型的自动评估器\n\n[![Code License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode%20License-Apache_2.0-green.svg)](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_farm\u002Fblob\u002Fmain\u002FLICENSE)\n[![Data License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FData%20License-CC%20By%20NC%204.0-red.svg)](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_farm\u002Fblob\u002Fmain\u002FDATA_LICENSE)\n[![Python 3.10+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10+-blue.svg)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002Frelease\u002Fpython-3100\u002F)\n[![discord](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdiscord-server-blue?logo=discord&logoColor=white)](https:\u002F\u002Fdiscord.gg\u002FGJMxJSVZZM)\n\n\n**带有长度控制胜率（length-controlled win-rates）的 AlpacaEval 2.0**（[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.04475)）与 [ChatBot Arena](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Flmsys\u002Fchatbot-arena-leaderboard) 的 Spearman 相关系数达到 **0.98**，同时运行成本低于 **10 美元** 的 OpenAI 积分，且耗时少于 3 分钟。我们的目标是建立一个聊天大语言模型（chat LLMs）基准，具备：快速（\u003C 5 分钟）、廉价（\u003C 10 美元）以及与人类高度相关（0.98）的特点。以下是与其他基准的比较：\n\n\u003Cp float=\"left\" align=\"middle\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_86b6a7d16a94.png\" alt=\"LC AlpacaEval is the most highly correlated benchmark with Chat Arena.\" width=\"500\"\u002F>\n\u003C\u002Fp>\n\n---\n\n更新：\n\n:tada: **长度控制胜率（Length-controlled Win Rates）** 已发布并默认启用！这将与 ChatBot Arena 的相关性从 0.93 提高到 0.98，同时显著降低了长度操纵的可能性。原始胜率仍会在网站和命令行界面（CLI）中显示。更多详情 [在此](#length-controlled-win-rates)。\n\n:tada: **AlpacaEval 2.0** 已发布并默认启用！我们改进了自动标注器（更好且更便宜），并使用 GPT-4 preview 作为基线。更多详情 [在此](#alpacaeval-20)。对于旧版本，请设置环境变量 `IS_ALPACA_EVAL_2=False`。\n\n---\n\n\u003Cdetails open>\n  \u003Csummary>\u003Cb>目录\u003C\u002Fb>\u003C\u002Fsummary>\n\n1. [概述](#overview)\n2. [快速开始](#quick-start)\n2. [排行榜及如何解读](#leaderboards-and-how-to-interpret-them)\n    - [模型](#models)\n    - [评估器](#evaluators)\n3. [使用场景](#use-cases)\n    - [评估模型](#evaluating-a-model)\n    - [制作新排行榜](#making-a-new-leaderboard)\n    - [制作新评估器](#making-a-new-evaluator)\n4. [贡献](#contributing)\n    - [贡献模型](#contributing-a-model)\n    - [贡献评估器](#contributing-an-evaluator)\n    - [贡献评估集](#contributing-an-eval-set)\n    - [贡献补全函数](#contributing-a-completion-function)\n5. [局限性](#limitations)\n6. [分析](#additional-analysis-and-plots)\n    - [长度控制 AlpacaEval](#length-controlled-alpacaeval-lcae)\n    - [分析评估器](#analyzing-an-evaluator)\n    - [分析评估集](#analyzing-an-eval-set)\n7. [引用](#citation)\n8. [附加信息](#additional-information)\n   - [长度控制胜率](#length-controlled-win-rates)\n   - [AlpacaEval 2.0](#alpacaeval-20)\n   - [数据发布](#data-release)\n   - [与 AlpacaFarm 的区别](#differences-with-alpacafarm)\n   - [相关工作](#related-work)\n   - [解读标注](#interpreting-annotations)\n   - [重大更新](#major-updates)\n\n\u003C\u002Fdetails>\n\n# 概述\n\n指令遵循模型（例如 ChatGPT）的评估通常需要人工交互。这既耗时又昂贵，且难以复现。AlpacaEval 是一种基于大语言模型（LLM）的自动评估方法，具有快速、廉价、可复现的特点，并经过 2 万条人工标注的验证。它特别适用于模型开发。尽管我们在之前的自动评估流程上有所改进，但仍存在基本的 [局限性](#limitations)，例如对更长输出的偏好。AlpacaEval 提供以下内容：\n\n- [**排行榜**](https:\u002F\u002Ftatsu-lab.github.io\u002Falpaca_eval\u002F)：AlpacaEval 评估集上常见模型的排行榜。**注意**：自动评估器（例如 GPT-4）可能会偏向于生成更长输出和\u002F或在评估器底层模型（例如 GPT-4）上微调过的模型。\n- [**自动评估器**](#evaluators)：一个与人类高度一致的自动评估器（在 2 万条标注上进行了验证）。我们通过测量强大的 LLM（例如 GPT-4）偏好该模型输出而非参考模型输出的频率来评估模型。我们的评估器默认启用了缓存和输出随机化功能。\n- [**构建自动评估器的工具包**](#analysis)：用于构建高级自动评估器（例如带缓存、批处理或多标注者）并分析它们（质量、价格、速度、统计效力、偏差、方差等）的简单接口。\n- [**人工评估数据**](#data-release)：AlpacaFarm 评估集上给定模型与参考模型之间的 2 万条人工偏好数据。其中 2500 条是交叉标注（4 名人类标注同一组 650 个示例）。\n- [**AlpacaEval 数据集**](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Falpaca_eval.json)：AlpacaFarm 评估集的简化版，其中“指令”和“输入”合并为一个字段，且参考输出更长。[详情在此](#data-release)。\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>何时使用和不使用 AlpacaEval？\u003C\u002Fb>\u003C\u002Fsummary>\n\n**何时使用 AlpacaEval？**\n我们的自动评估器是简单指令遵循任务的人工评估的快速且廉价的代理。如果您需要快速运行大量评估，例如在模型开发期间，它非常有用。\n\n**何时不使用 AlpacaEval？**\n像任何其他自动评估器一样，AlpacaEval **不应取代高风险决策中的人工评估**，例如决定模型发布。特别是，AlpacaEval 受限于以下事实：(1) 评估集中的指令可能无法代表 LLM 的高级使用；(2) 自动评估器可能存在偏见，例如偏好回答的风格而非事实准确性；以及 (3) AlpacaEval 不衡量模型可能造成的风险。详情见 [局限性](#limitations)。\n\n\u003C\u002Fdetails>\n\n# 快速开始\n\n要安装稳定版，请运行\n\n```bash\npip install alpaca-eval\n```\n\n要安装夜间版本（开发版），请运行\n\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\n```\n\n然后你可以按如下方式使用：\n\n```bash\nexport OPENAI_API_KEY=\u003Cyour_api_key> # for more complex configs, e.g. using Azure or switching clients see client_configs\u002FREADME.md \nalpaca_eval --model_outputs 'example\u002Foutputs.json' \n```\n\n这将把排行榜 (leaderboard) 打印到控制台，并将排行榜和标注结果保存到与 `model_outputs` 文件相同的目录中。重要参数如下：\n\n- **model_outputs**：要添加到排行榜的模型输出的 JSON 文件路径。每个字典 (dictionary) 应包含 `instruction` 和 `output` 键。\n- **annotators_config**：这是要使用的标注器 (annotator)。我们推荐使用 `weighted_alpaca_eval_gpt4_turbo`（AlpacaEval 2.0 的默认值），它与我们的人工标注数据有较高的一致性，上下文大小 (context size) 较大，且成本较低。关于所有标注器的比较，请参见 [此处](#evaluators)。\n- **reference_outputs**：参考模型的输出。格式与 `model_outputs` 相同。默认情况下，对于 AlpacaEval 2.0，这是 `gpt4_turbo`。\n- **output_path**：保存标注结果和排行榜的路径。\n\n如果您没有模型输出，可以使用 [`evaluate_from_model`](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain#evaluating-a-model)，并传入本地路径或 HuggingFace 模型名称，或者来自标准 API (应用程序编程接口) 的模型（OpenAI、Anthropic、Cohere、Google 等）。其他命令：\n\n\u003Cdetails open>\n  \u003Csummary>\u003Ccode>>>> alpaca_eval -- --help\u003C\u002Fcode>\u003C\u002Fsummary>\n\n```\nSYNOPSIS\n    alpaca_eval COMMAND\n\nCOMMANDS\n    COMMAND is one of the following:\n\n     evaluate\n       Evaluate a model based on its outputs. This is the default entrypoint if no command is specified.\n\n     evaluate_from_model\n       Evaluate a model from HuggingFace or an API provider. This is a wrapper around `evaluate` which includes generating from a desired model.\n\n     make_leaderboard\n       Precompute and save an entire leaderboard for a given dataset \u002F evaluator \u002F set of models generations.\n\n     analyze_evaluators\n       Analyze an evaluator and populates the evaluators leaderboard (agreement with human, speed, price,...).\n```\n\n\u003C\u002Fdetails>\n\n有关每个函数的更多信息，请使用 `alpaca_eval \u003Ccommand> -- --help`。\n\n# 排行榜及如何解读它们\n\n## 模型\n\n我们的排行榜是基于 [AlpacaEval 数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftatsu-lab\u002Falpaca_eval) 计算的。\n我们使用不同的基线（baseline）模型和自动标注器（autoannotators）预先计算了重要模型的排行榜。\n我们的两个主要排行榜（\"AlpacaEval 2.0\" 和 \"AlpacaEval\"）可以在 [此页面](https:\u002F\u002Ftatsu-lab.github.io\u002Falpaca_eval\u002F) 找到。\n\"AlpacaEval 2.0\" 使用 `weighted_alpaca_eval_gpt4_turbo` 作为标注器（annotator），使用 `gpt4_turbo` 作为基线（baseline）。\n\"AlpacaEval\" 使用 `alpaca_eval_gpt4` 作为标注器（annotator），使用 `text_davinci_003` 作为基线（baseline）。\n所有预计算的排行榜请参见 [此处](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fleaderboards)。\n稍后我们还将展示如何 [添加您的模型](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval#evaluating-a-model) 到排行榜，以及如何为您的评估器\u002F数据集创建一个 [新的排行榜](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval#making-a-new-leaderboard)。\n有关开箱即用（out of the box）的所有模型的配置，请参见 [此处](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fmodels_configs)。\n\n**AlpacaEval 最小化排行榜**：\n\n|                       | 胜率 (Win Rate) | 标准误 (Std Error) |\n|:----------------------|---------:|----------:|\n| gpt4                  |     95.3 |       0.7 |\n| claude                |     88.4 |       1.1 |\n| chatgpt               |     86.1 |       1.2 |\n| guanaco-65b           |     71.8 |       1.6 |\n| vicuna-13b            |     70.4 |       1.6 |\n| text_davinci_003      |     50.0 |       0.0 |\n| alpaca-farm-ppo-human |     41.2 |       1.7 |\n| alpaca-7b             |     26.5 |       1.5 |\n| text_davinci_001      |     15.2 |       1.2 |\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>这些指标是如何精确计算的？\u003C\u002Fb>\u003C\u002Fsummary>\n\n**胜率（Win Rate）**：胜率衡量的是模型输出优于参考输出的频率（AlpacaEval 中为 `test-davinci-003`，AlpacaEval 2.0 中为 `gpt4_turbo`）。\n更具体地说，为了计算胜率，我们从 AlpacaEval 数据集中收集目标模型在每个指令上的输出对。\n然后我们将每个输出与参考模型（例如 `text-davinci-003`）在同一指令上的输出配对。\n然后我们询问自动评估器它们更喜欢哪个输出。\n请参阅 [AlpacaEval 的](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002Falpaca_eval_gpt4) 和 [AlpacaEval 2.0 的](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002Fweighted_alpaca_eval_gpt4_turbo) 提示词（prompts）和配置，特别是我们会随机化输出的顺序以避免位置偏差。\n然后我们对数据集中所有指令的偏好进行平均，以获得模型相对于基线（baseline）的胜率。\n如果两个输出完全相同，我们为两个模型各计一半偏好。\n\n**标准误（Standard error）**：这是胜率的标准误（以 N-1 归一化），即跨不同指令的平均偏好。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>关于我们的自动标注器（auto-annotator）：\u003Ccode>alpaca_eval_gpt4\u003C\u002Fcode> 的详细信息\u003C\u002Fb>\u003C\u002Fsummary>\n\n我们的 `alpaca_eval_gpt4`（见 [配置](#https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002Falpaca_eval_gpt4\u002Fconfigs.yaml#L5)）标注器会对偏好取平均，其中偏好获取方式如下：\n\n1. 它接收一个指令和一对输出（来自目标模型和参考模型）\n2. 如果该三元组（triple）的偏好已计算过，则返回它（即使用缓存）\n3. 它随机化输出的顺序以避免位置偏差\n4. 它将指令和输出格式化为 [以下零样本提示词（zero-shot prompt）](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002Falpaca_eval_gpt4\u002Falpaca_eval.txt)，要求按偏好顺序排列输出\n5. 它使用 `temperature=0` 通过 GPT4 完成提示词\n6. 它从补全结果中解析偏好并返回\n\n该标注器是 [AlpacaFarm](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_farm) 和 [Aviary](https:\u002F\u002Fgithub.com\u002Fray-project\u002Faviary\u002Ftree\u002Fmaster) 评估器的混合体（并深受其影响）。\n特别是，我们使用了与 AlpacaFarm 相同的代码（缓存\u002F随机化\u002F超参数），但使用了类似于 Aviary 的排序提示词。\n我们对 Aviary 的提示词进行了修改，以减少对较长输出的偏差。\n详情见 [相关工作](#related-work)。\n\n对于 AlpacaEval 2.0，我们使用 `weighted_alpaca_eval_gpt4_turbo`，它使用 logprobs（对数概率）来计算连续偏好，并使用 GPT4_turbo 作为模型（见 [配置](#https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002Fweighted_alpaca_eval_gpt4_turbo\u002Fconfigs.yaml)）。\n\n\u003C\u002Fdetails>\n\n\n\n## 评估器\n\n我们通过与我们收集的 2.5K [人工标注](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Falpaca_farm_human_crossannotations.json) 进行比较，在 AlpacaEval 集上评估不同的自动标注器（~650 个指令，每个有 4 个人工标注）。\n下面我们展示了我们建议的评估器（`weighted_alpaca_eval_gpt4_turbo`,`alpaca_eval_gpt4`）、先前自动评估器（[`alpaca_farm_greedy_gpt4`](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_farm),[`aviary_gpt4`](https:\u002F\u002Faviary.anyscale.com\u002F),[`lmsys_gpt4`](https:\u002F\u002Fchat.lmsys.org\u002F)）、人类（`humans`）以及具有基本相同提示词的不同基础模型（`gpt4`,`claude`,`text_davinci_003`,`chatgpt_fn`,`guanaco_33b`, `chatgpt`）的指标。\n有关开箱即用（out of the box）的所有评估器及其相关指标的配置，请参见 [此处](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs)。\n\n|                                 |   人类一致性 |   价格 [$\u002F1000 示例] |   时间 [秒\u002F1000 示例] |   Spearman 相关系数 |   Pearson 相关系数 |   偏差 |   方差 |   偏好更长回答的概率 |\n|:--------------------------------|------------------:|--------------------------:|-------------------------------:|-----------------:|----------------:|-------:|-----------:|-----------------------:|\n| alpaca_eval_gpt4                |              69.2 |                      13.6 |                           1455 |             0.97 |            0.93 |   28.4 |       14.6 |                   0.68 |\n| alpaca_eval_cot_gpt4_turbo_fn   |              68.6 |                       6.3 |                           1989 |             0.97 |            0.90 |   29.3 |       18.4 |                   0.67 |\n| alpaca_eval_gpt4_turbo_fn       |              68.1 |                       5.5 |                            864 |             0.93 |            0.82 |   30.2 |       15.6 |                   0.65 |\n| alpaca_eval_llama3_70b_fn       |              67.5 |                       0.4 |                            209 |             0.90 |            0.86 |   32.3 |        8.2 |                   0.79 |\n| gpt4                            |              66.9 |                      12.5 |                           1037 |             0.88 |            0.87 |   31.5 |       14.6 |                   0.65 |\n| alpaca_farm_greedy_gpt4         |              66.4 |                      15.3 |                            878 |             0.85 |            0.75 |   30.2 |       19.3 |                   0.60 |\n| alpaca_eval_cot_gpt4_turbo_fn |              65.7 |                       4.3 |                            228 |             0.78 |            0.77 |   33.9 |       23.7 |                   0.61 |\n| humans                          |              65.7 |                     300.0 |                          36800 |             1.00 |            1.00 |    0.0 |       34.3 |                   0.64 |\n| claude                          |              65.3 |                       3.3 |                            173 |             0.93 |            0.90 |   32.4 |       18.5 |                   0.66 |\n| lmsys_gpt4                      |              65.3 |                      13.9 |                          17982 |             0.98 |            0.97 |   31.6 |       15.9 |                   0.74 |\n| text_davinci_003                |              64.1 |                       8.7 |                            121 |             0.85 |            0.83 |   33.8 |       22.7 |                   0.70 |\n| longest                         |              62.2 |                       0.0 |                              0 |             0.27 |            0.56 |   37.8 |        0.0 |                   1.00 |\n| chatgpt                         |              57.3 |                       0.8 |                            285 |             0.72 |            0.71 |   39.4 |       34.1 |                   0.59 |\n\n\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>这些指标是如何精确计算的？\u003C\u002Fb>\u003C\u002Fsummary>\n\n我们现在用文字说明如何计算上表中的指标。[代码在此处](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Ff05cbd651b79ac93906b19d01fe443b45828b0f2\u002Fsrc\u002Falpaca_eval\u002Fanalyze.py#L366)。\n\n**人类一致性**：这衡量了当前标注者与来自我们 [交叉标注集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Falpaca_farm_human_crossannotations.json) 的约 650 条标注中人类多数偏好之间的一致性，该数据集每个示例包含 4 条人工标注。\n为了估计单个标注者（上表中 `humans` 行）与人类多数之间的一致性，我们选取 4 条标注中的一条，并计算其在预测其余 3 条标注的众数 (mode) 时的准确率。\n然后，我们在所有 4 条标注和 650 个指令上平均此准确率以获得人类一致性，即，我们计算期望（在人类和样本上）的留一法 (leave-one-out) 一致性。\n如果众数不唯一，我们随机选择一个众数。\n我们对自动标注器执行完全相同的计算，以便最终数字具有可比性。\n\n**价格 [$\u002F1000 示例]**：这是每 1000 次标注的平均价格。\n对于人类，这是我们支付给 [Mechanical Turkers (众包工人)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14387) 以收集这些标注的价格（每小时 21 美元）。\n如果价格取决于用于计算标注的机器（例如 Guanaco），我们将其留空。\n\n**时间 [秒\u002F1000 示例]**：这是计算 1000 次标注所需的平均时间。\n对于人类，这是每位 Mechanical Turker (众包工人) 标注 1000 个示例所花费的中位时间的估计值。\n对于自动标注器，这是我们运行标注时花费的平均时间。请注意，这可能取决于不同用户不同的 API 限制以及集群正在处理的请求数量。\n\n**Spearman 相关系数**：这衡量了使用自动标注器偏好计算的排行榜与使用人类偏好计算的排行榜之间的 Spearman 相关性 (Spearman correlation)。与人类一致性一样，我们使用来自 AlpacaFarm 的人工标注，但现在我们考虑的是方法级别的一致性，而不仅仅是与样本层面的一致性。请注意，我们仅使用了 9 个模型，因此相关性不太可靠。 \n\n**Pearson 相关系数**：与 Spearman 相关系数相同，但使用 Pearson 相关性 (Pearson correlation)。\n\n**偏差**：最可能的人类标签与最可能的自动标签之间的一致性。\n对于自动标注器，我们通过为每个示例采样 4 条不同的标注来估计它。\n这里的随机性来自于提示词 (prompt) 中输出的顺序、从 LLM (大语言模型) 采样，以及如果适用的话，批次中指令的顺序和池中标注器的选择。\n然后，我们取这 4 条标注的众数，并计算该众数在预测 4 条人类标注的众数时的准确率。\n请注意，这可能是如果我们拥有“无限”数量的交叉标注时会得到的真实偏差的高估。\n低偏差意味着标注者在期望上与人类具有相同的偏好。\n对于人类的情况，根据定义偏差为零。\n请注意，这与标准统计偏差相关但不等同，因为我们取众数而不是对标注求平均，并且我们考虑 0-1 损失 (0-1 loss) 而不是平方损失 (squared loss)。\n\n**Variance (方差)**：单个自动偏好与最可能偏好之间的预期一致性。\n我们估算它的方式与估算人类的\"人类一致性 (human agreement)\"相同，即，在使用第 4 个标注预测 3 个标注的众数 (mode) 时，取留一法 (leave one out) 误差的期望值。\n低方差意味着标注器 (annotator) 与其自身偏好一致，即，如果你用不同的种子 (seeds) 从它采样，它会给出相同的结果。\n与偏差 (bias) 类似，这并非标准的统计方差，因为我们取的是标注的众数而非平均值，并且考虑的是 0-1 损失而非平方损失。\n\n请注意，\"人类一致性 (human agreement)\"与偏差和方差密切相关。特别是，方差衡量了由于我们仅使用单个标注而产生的误差，而偏差旨在衡量当前标注器的不可约误差。\n\n**Proba. prefer longer (偏好更长概率)**：当两个输出中一个显著长于另一个（超过 30 个字符差异）时，标注器偏好较长输出的概率。\n\n在 [完整表格](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002FREADME.md) 中，我们还提供了以下指标：\n\n**Proba. prefer lists (偏好列表概率)**：当一个输出包含列表\u002F项目符号而另一个不包含时，标注器偏好包含该输出的概率。\n\n**Proba. prefer 1 (偏好第一个概率)**：标注器偏好成对输出中第一个的概率。我们提出的所有标注器在提示词 (prompt) 中对输出进行了随机化，因此这应该是 0.5。先前的标注器，如 `lmsys` 和 `aviary`，则不是这样。\n\n**# parsed (解析数量)**：这是标注器能够解析的示例数量。\n\n请注意，如果方差和偏差为空，这意味着由于资源（时间和价格）限制，每个 648 个示例仅执行了一次单次标注。这解释了为什么 #parsed 是 648，否则它应该是 2592。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>Tips for choosing evaluators\u003C\u002Fb>\u003C\u002Fsummary>\n\n总体而言，如果您希望与人类高度一致，建议使用 `annotators_config=weighted_alpaca_eval_gpt4_turbo`；如果您预算紧张，建议使用 `annotators_config=chatgpt_fn`。\n\n在选择标注器时，我们建议您考虑以下几点（前三点显而易见）：\n\n- `\"人类一致性 (%)\"`\n- `\"价格 [$\u002F1000 个示例]\"`\n- `\"时间 [秒\u002F1000 个示例]\"`\n- `\"* 相关性 (* corr.)\"` 约 > 0.7。相关性不宜过低很重要，但我们不建议将其作为主要指标，因为相关性仅基于 9 个模型计算。 \n- `\"Proba. prefer longer (偏好更长概率)\"` 约 \u003C 0.7。确实，我们发现大多数人类标注者的偏好都强烈偏向更长的答案（正如始终偏好最长输出的 `\"longest\"` 评估器的 [performance=62.2](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002FREADME.md) 所示）。这表明这可能是人类标注者的一种偏差。为了避免排行榜出现强烈的长度偏差，我们建议使用“偏好更长概率”低于 0.7 的自动标注器。\n- `\"Variance (方差)\"` 约 \u003C 0.2。我们相信一个好的评估器 (evaluator) 应尽可能少方差，以便结果主要是可复现的。注意，在模拟人类的情况下（如 [AlpacaFarm](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14387) 所示），方差可能是可取的。\n\n我们在上表中过滤了不满足这些要求的标注器（除了用于参考目的的人类 \u002F ChatGPT \u002F 003 \u002F lmsys）。查看所有结果请参见 [此处](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002FREADME.md)。\n总体而言，我们发现 `weighted_alpaca_eval_gpt4_turbo` 在质量\u002F价格\u002F时间\u002F方差\u002F长度偏差之间是一个很好的权衡。\n\n\u003C\u002Fdetails>\n\n上述指标是基于众包工作者 (crowd-workers) 的标注计算的。虽然有用，但这些标注并不完美，例如，众包工作者往往更看重风格而非事实性 (factuality)。因此，我们建议用户在自己的指令和人类标注上验证自动评估器。详情见 [局限性](#limitations)。\n\n\n\n# 使用场景\n\n\n## 评估模型\n\n\u003Cdetails>\n  \u003Csummary>\u003Ccode>>>> alpaca_eval evaluate -- --help\u003C\u002Fcode>\u003C\u002Fsummary>\n\n```\nNAME\n    alpaca_eval evaluate - Evaluate a model based on its outputs. This is the default entrypoint if no command is specified.\n\nSYNOPSIS\n    alpaca_eval evaluate \u003Cflags>\n\nDESCRIPTION\n    Evaluate a model based on its outputs. This is the default entrypoint if no command is specified.\n```\n\n标志位\n    --model_outputs=MODEL_OUTPUTS\n        类型：Optional[Union]\n        默认值：None\n        要添加到排行榜的模型输出。接受数据（字典列表、pd.dataframe、datasets.Dataset）或读取这些数据的路径（json, csv, tsv）或生成这些数据的函数。每个字典（或 DataFrame 的行）应包含在提示词中格式化的键。例如默认情况下为 `instruction` 和 `output`，可选 `input`。如果为 None，我们仅打印排行榜。\n    -r, --reference_outputs=REFERENCE_OUTPUTS\n        类型：Union\n        默认值：\u003Cfunc...\n        参考模型的输出。与 `model_outputs` 格式相同。如果为 None，参考输出是 AlpacaEval 集上的一组特定 Davinci 003 输出：\n    --annotators_config=ANNOTATORS_CONFIG\n        类型：Union\n        默认值：'alpaca_eval_gpt4_turbo_fn'\n        标注器配置文件的路径（或字典列表）。详情见 `PairwiseAnnotator` 的文档字符串。\n    -n, --name=NAME\n        类型：Optional[Optional]\n        默认值：None\n        要添加到排行榜的模型名称。如果为 None，我们检查 `generator` 是否在 `model_outputs` 中，否则使用 \"Current model\"。\n    -o, --output_path=OUTPUT_PATH\n        类型：Union\n        默认值：'auto'\n        存储新排行榜和标注结果的目录路径。如果为 None，我们不保存。如果为 `auto`，如果 `model_outputs` 是路径则使用它，否则使用调用脚本的目录。\n    -p, --precomputed_leaderboard=PRECOMPUTED_LEADERBOARD\n        类型：Union\n        默认值：'auto'\n        预计算的排行榜或其路径（json, csv, 或 tsv）。排行榜应至少包含列 `win_rate`。如果为 `auto`，我们将尝试使用对应于参考输出的排行榜（仅在 CORRESPONDING_OUTPUTS_LEADERBOARDS 中存在时）。如果为 `None`，我们不会添加排行榜中的其他模型。\n    --is_overwrite_leaderboard=IS_OVERWRITE_LEADERBOARD\n        类型：bool\n        默认值：False\n        如果模型已在排行榜中，是否覆盖排行榜。\n    -l, --leaderboard_mode_to_print=LEADERBOARD_MODE_TO_PRINT\n        类型：Optional\n        默认值：'minimal'\n        使用的排行榜模式。仅当预计算排行榜有 `mode` 列时使用，此时将按此模式过滤排行榜。如果为 None，保留所有。\n    -c, --current_leaderboard_mode=CURRENT_LEADERBOARD_MODE\n        类型：str\n        默认值：'community'\n        当前方法的排行榜模式。\n    --is_return_instead_of_print=IS_RETURN_INSTEAD_OF_PRINT\n        类型：bool\n        默认值：False\n        是否返回指标而不是打印结果。\n    -f, --fn_metric=FN_METRIC\n        类型：Union\n        默认值：'pairwise_to_winrate'\n        `metrics.py` 中将偏好转换为指标的函数或函数名。该函数应接受一个偏好序列（0 表示平局，1 表示基础模型获胜，2 表示要比较的模型获胜）并返回一个指标字典以及用于排序排行榜的键。\n    -s, --sort_by=SORT_BY\n        类型：str\n        默认值：'win_rate'\n        用于排序排行榜的键。\n    --is_cache_leaderboard=IS_CACHE_LEADERBOARD\n        类型：Optional[Optional]\n        默认值：None\n        是否将结果排行榜保存到 `precomputed_leaderboard`。如果为 None，仅当 max_instances 不为 None 时保存。向排行榜添加模型的首选方法是将 `precomputed_leaderboard` 设置为之前保存在 `\u003Coutput_path>\u002Fleaderboard.csv` 的排行榜。\n    --max_instances=MAX_INSTANCES\n        类型：Optional[Optional]\n        默认值：None\n        要标注的最大实例数。用于测试。\n    --annotation_kwargs=ANNOTATION_KWARGS\n        类型：Optional[Optional]\n        默认值：None\n        传递给 `PairwiseAnnotator.annotate_head2head` 的其他参数。\n    -A, --Annotator=ANNOTATOR\n        默认值：\u003Cclass 'alpaca_eval.annotators.pairwise_evaluator.PairwiseAn...\n        要使用的标注器类。\n    接受额外的标志。\n        传递给 `PairwiseAnnotator` 的其他参数。\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ccode>>>> alpaca_eval evaluate_from_model -- --help\u003C\u002Fcode>\u003C\u002Fsummary>\n\n```\n名称\n    alpaca_eval evaluate_from_model - 评估来自 HuggingFace 或 API 提供商的模型。这是 `evaluate` 的包装器，包括从所需模型生成内容。\n\n概要\n    alpaca_eval evaluate_from_model MODEL_CONFIGS \u003C标志>\n\n描述\n    评估来自 HuggingFace 或 API 提供商的模型。这是 `evaluate` 的包装器，包括从所需模型生成内容。\n\n位置参数\n    MODEL_CONFIGS\n        类型：Union\n        字典或路径（相对于 `models_configs`），指向包含解码模型配置的 yaml 文件。如果是目录，我们在其中搜索 'configs.yaml'。第一个字典中的键应为生成器的名称，值应为生成器配置的字典，该配置应具有\n\n标志\n    -r, --reference_model_configs=REFERENCE_MODEL_CONFIGS\n        类型：Optional[Union]\n        默认值：None\n        与 `model_configs` 相同，但用于参考模型。如果为 None，我们使用默认的 Davinci003 输出。\n    -e, --evaluation_dataset=EVALUATION_DATASET\n        类型：Union\n        默认值：\u003Cfunc...\n        评估数据集的路径或返回 DataFrame 的函数。如果为 None，我们使用默认的评估\n    -a, --annotators_config=ANNOTATORS_CONFIG\n        类型：Union\n        默认值：'alpaca_eval_gpt4_turbo_fn'\n        标注器配置的路径或字典。如果为 None，我们使用默认的标注器配置。\n    -o, --output_path=OUTPUT_PATH\n        类型：Union\n        默认值：'auto'\n        保存生成内容、标注和排行榜的路径。如果 auto 则保存在 `results\u002F\u003Cmodel_name>`\n    -m, --max_instances=MAX_INSTANCES\n        类型：Optional[int]\n        默认值：None\n        生成和评估的最大实例数。如果为 None，我们评估所有实例。\n    --is_strip_output=IS_STRIP_OUTPUT\n        类型：bool\n        默认值：True\n        是否去除输出尾部及首部的空白字符。\n    --is_load_outputs=IS_LOAD_OUTPUTS\n        类型：bool\n        默认值：True\n        是否尝试从输出路径加载输出。如果为 True 且输出存在，我们仅为尚未有输出的指令生成输出。\n    -c, --chunksize=CHUNKSIZE\n        类型：int\n        默认值：64\n        保存前生成的实例数。如果为 None，在所有生成完成后保存。\n    接受额外的标志。\n        传递给 `evaluate` 的其他关键字参数\n\n备注\n    您也可以对位置参数使用标志语法\n```\n\n\u003C\u002Fdetails>\n\n要评估模型，您需要：\n\n1. 选择一个评估集并计算指定的 `model_outputs`（模型输出）。默认情况下，我们使用来自 [AlpacaEval](#data-release) 的 805 个示例。要在 AlpacaEval 上计算输出，请使用：\n\n```python\nimport datasets\n\neval_set = datasets.load_dataset(\"tatsu-lab\u002Falpaca_eval\", \"alpaca_eval\")[\"eval\"]\nfor example in eval_set:\n    # generate here is a placeholder for your models generations\n    example[\"output\"] = generate(example[\"instruction\"])\n    example[\"generator\"] = \"my_model\" # name of your model\n```\n\n如果您的模型是 HuggingFace 模型或来自标准 API 提供商（OpenAI, Anthropic, Cohere），那么您可以直接使用 `alpaca_eval evaluate_from_model` 来处理生成输出的任务。\n\n2. 计算参考输出 `reference_outputs`（参考输出）。默认情况下，我们使用 [`gpt4_turbo` 在 AlpacaEval 上的预计算输出](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftatsu-lab\u002Falpaca_eval)。\n如果您想使用不同的模型或不同的数据集，请遵循与 (1.) 相同的步骤。\n3. 通过 `annotators_config`（标注器配置）指定一个评估器。我们推荐使用 `alpaca_eval_gpt4_turbo_fn`。对于其他选项和比较，请参阅 [此表](#evaluators)。根据评估器的不同，您可能需要在环境中设置适当的 `API_KEY`，或者在 [client_configs](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fclient_configs) 中设置。\n\n一起运行：\n\n```bash\nalpaca_eval --model_outputs 'example\u002Foutputs.json' \\\n  --annotators_config 'alpaca_eval_gpt4_turbo_fn'\n```\n\n如果您没有解码后的输出，可以使用 `evaluate_from_model`，它将为您处理解码（模型和参考）工作。\n这是一个示例：\n\n```bash\n\n\n# need a GPU for local models\nalpaca_eval evaluate_from_model \\\n  --model_configs 'oasst_pythia_12b' \\\n  --annotators_config 'alpaca_eval_gpt4_turbo_fn'      \n```\n\n这里 `model_configs` 和 `reference_model_configs`（可选）是指定 `prompt`（提示词）、模型提供商（此处为 HuggingFace）和解码参数的目录路径。\n示例请参阅 [此目录](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fmodels_configs)。\n对于所有开箱即用的模型提供商，请参阅 [此处](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fdecoders)。\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>关于标注者的信息\u003C\u002Fb>\u003C\u002Fsummary>\n\n- **缓存（Caching）**：默认情况下，所有注释都缓存到磁盘上的 `caching_path` 处。因此注释永远不会重新计算，这使得注释更快、更便宜，并允许可复现性。即使评估不同的模型也有帮助，因为许多模型具有相同的输出。\n- **输出随机化（Output randomization）**：默认情况下，我们对输出示例进行随机打乱，因为我们发现标注者倾向于偏好他们看到的第一个示例。\n- **批处理（Batching）**：我们提供代码和示例来批量处理注释，如果 `prompt` 较长，这可以降低注释的成本和时间。例如参见 [alpaca_farm_greedy_gpt4](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002Falpaca_farm_greedy_gpt4)。\n- **标注者池（Pool of annotators）**：我们提供代码和示例来使用自动标注者池进行评估，这对于复制 [人类标注](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14387) 的方差很有帮助。例如参见 [alpaca_farm](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002Falpaca_farm)。\n- **基于指令的种子设定（Seeding based on instructions）**：为了可复现性和模型之间更公平的比较，我们基于指令对所有随机性（输出顺序、批次中的顺序、池中每个标注者的示例）进行种子设定。\n\n\u003C\u002Fdetails>\n\n\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">制作新的排行榜\u003C\u002Fh2>\u003C\u002Fsummary>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ccode>>>> alpaca_eval make_leaderboard -- --help\u003C\u002Fcode>\u003C\u002Fsummary>\n\n```\nNAME\n    alpaca_eval make_leaderboard - Precompute and save an entire leaderboard for a given dataset \u002F evaluator \u002F set of models generations.\n\nSYNOPSIS\n    alpaca_eval make_leaderboard \u003Cflags>\n\nDESCRIPTION\n    Precompute and save an entire leaderboard for a given dataset \u002F evaluator \u002F set of models generations.\n\nFLAGS\n    --leaderboard_path=LEADERBOARD_PATH\n        Type: Optional[Union]\n        Default: None\n        The path to save the leaderboard to. The leaderboard will be saved as a csv file, if it already exists it will\n    --annotators_config=ANNOTATORS_CONFIG\n        Type: Union\n        Default: 'alpaca_eval_gpt4_turbo_fn'\n        The path the (or list of dict of) the annotator's config file.\n    --all_model_outputs=ALL_MODEL_OUTPUTS\n        Type: Union\n        Default: \u003Cfu...\n        The outputs of all models to add to the leaderboard. Accepts data (list of dictionary, pd.dataframe, datasets.Dataset) or a path to read those (json, csv, tsv potentially with globbing) or a function to generate those. If the path contains a globbing pattern, we will read all files matching the pattern and concatenate them. Each dictionary (or row of dataframe) should contain the keys that are formatted in the prompts. E.g. by default `instruction` and `output` with optional `input`. It should also contain a column `generator` with the name of the current model.\n    -r, --reference_outputs=REFERENCE_OUTPUTS\n        Type: Union\n        Default: \u003Cfunc...\n        The outputs of the reference model. Same format as `all_model_outputs` but without needing `generator`. By default, the reference outputs are the 003 outputs on AlpacaEval set.\n    -f, --fn_add_to_leaderboard=FN_ADD_TO_LEADERBOARD\n        Type: Callable\n        Default: 'evaluate'\n        The function to use to add a model to the leaderboard. If a string, it should be the name of a function in `main.py`. The function should take the arguments: `model_outputs`, `annotators_config`, `name`, `precomputed_leaderboard`, `is_return_instead_of_print`, `reference_outputs`.\n    --leaderboard_mode=LEADERBOARD_MODE\n        Type: str\n        Default: 'verified'\n        The mode of the leaderboard to save all new entries with.\n    -i, --is_return_instead_of_print=IS_RETURN_INSTEAD_OF_PRINT\n        Type: bool\n        Default: False\n        Whether to return the metrics instead of printing the results.\n    Additional flags are accepted.\n        Additional arguments to pass to `fn_add_to_leaderboard`.\n```\n\n\u003C\u002Fdetails>\n\n如果您想使用单个命令（而不是多次 `alpaca_eval` 调用）为您的目标评估集和评估器制作新的排行榜，可以使用以下内容：\n\n```bash\nalpaca_eval make_leaderboard \\\n  --leaderboard_path \u003Cpath_to_save_leaderboard> \\\n  --all_model_outputs \u003Cmodel_outputs_path> \\\n  --reference_outputs \u003Creference_outputs_path> \\\n  --annotators_config \u003Cpath_to_config.yaml>\n```\n\n其中：\n\n- `leaderboard_path`: 保存排行榜 (leaderboard) 的路径。排行榜将保存为 csv 文件，如果文件已存在则会追加内容。\n- `all_model_outputs`：要添加到排行榜 (leaderboard) 的所有模型输出的 json 路径（可以是单个文件或通过通配符匹配多个文件）。每个字典应包含在提示词 (prompt) 中格式化的键 (`instruction` 和 `output`) 以及一个名为 `generator` 的列，其中包含当前模型的名字。例如请参见 [此文件](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Falpaca_eval_all_outputs.json)。\n- `reference_outputs`：参考模型的输出路径。每个字典应包含在提示词 (prompt) 中格式化的键 (`instruction` 和 `output`)。默认情况下，参考输出是 AlpacaEval 集上的 003 输出。\n- `annotators_config`：标注器 (annotator) 配置文件的路径。默认为 `alpaca_eval_gpt4`。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">创建新的评估器\u003C\u002Fh2>\u003C\u002Fsummary>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ccode>>>> alpaca_eval analyze_evaluators -- --help\u003C\u002Fcode>\u003C\u002Fsummary>\n\n```\nNAME\n    alpaca_eval analyze_evaluators - Analyze an evaluator and populates the evaluators leaderboard (agreement with human, speed, price,...).\n\nSYNOPSIS\n    alpaca_eval analyze_evaluators \u003Cflags>\n\nDESCRIPTION\n    Analyze an evaluator and populates the evaluators leaderboard (agreement with human, speed, price,...).\n\nFLAGS\n    --annotators_config=ANNOTATORS_CONFIG\n        Type: Union\n        Default: 'alpaca_eval_gpt4_turbo_fn'\n        The path the (or list of dict of) the annotator's config file.\n    -A, --Annotator=ANNOTATOR\n        Default: \u003Cclass 'alpaca_eval.annotators.pairwise_evaluator.PairwiseAn...\n        The annotator class to use.\n    --analyzer_kwargs=ANALYZER_KWARGS\n        Type: Optional[Optional]\n        Default: None\n        Additional arguments to pass to the analyzer.\n    -p, --precomputed_leaderboard=PRECOMPUTED_LEADERBOARD\n        Type: Union\n        Default: PosixPath('\u002FUsers\u002Fyanndubois\u002FDesktop\u002FGitHub\u002Falpaca_eval\u002Fsrc\u002F...\n        The precomputed (meta)leaderboard of annotators or a path to it (json, csv, or tsv).\n    --is_save_leaderboard=IS_SAVE_LEADERBOARD\n        Type: bool\n        Default: False\n        Whether to save the leaderboard (ie analyzed results).\n    --is_return_instead_of_print=IS_RETURN_INSTEAD_OF_PRINT\n        Type: bool\n        Default: False\n        Whether to return the leaderboard (ie analyzed results). If True, it will not print the results.\n    --is_overwrite_leaderboard=IS_OVERWRITE_LEADERBOARD\n        Type: bool\n        Default: False\n        Whether to overwrite the leaderboard if it already exists.\n    -m, --max_instances=MAX_INSTANCES\n        Type: Optional[Optional]\n        Default: None\n        The maximum number of instances to analyze.\n    --is_single_annotator=IS_SINGLE_ANNOTATOR\n        Type: bool\n        Default: False\n        Whether to analyze a single annotator. If True, will not be able to estimate the annotator's bias.\n    -l, --leaderboard_mode_to_print=LEADERBOARD_MODE_TO_PRINT\n        Type: str\n        Default: 'minimal'\n        The mode of the leaderboard to print.\n    -c, --current_leaderboard_mode=CURRENT_LEADERBOARD_MODE\n        Type: str\n        Default: 'minimal'\n        The mode of the leaderboard to save all new entries with.\n    -o, --output_path=OUTPUT_PATH\n        Type: Union\n        Default: 'auto'\n        Path to save the leaderboard and annotataions. If None, we don't save.\n    Additional flags are accepted.\n        Additional arguments to pass to `Annotator`.\n```\n\n\u003C\u002Fdetails>\n\nAlpacaEval 提供了一种简单的方法来创建新的评估器 (evaluator)。你只需要创建一个新 的 `configs.yaml` 配置文件，然后将其作为 `--annotators_config \u003Cpath_to_config.yaml>` 传递给 `alpaca_eval`。\n以下是创建新评估器的几种方法：\n\n- **更改提示词 (prompt)**：在文本文件中编写新提示词，并在配置文件的 `prompt_template` 中指定路径。路径相对于配置文件。\n- **更改解码参数**：在配置文件的 `completions_kwargs` 中指定所需的参数。有关所有可用参数的详细信息，请参阅由配置文件中 `fn_completions` 指定的对应函数的文档字符串 [在此文件中](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fdecoders\u002F__init__.py)。\n- **更改模型**：在 `model_name` 中指定所需的模型，并在 `prompt_template` 中指定相应的提示词。如果模型来自其他提供商，你需要更改 `fn_completions`，该函数映射到 [此文件](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fdecoders\u002F__init__.py) 中的相应函数。我们提供了用于使用 OpenAI、Anthropic、Cohere 或 HuggingFace 模型的 `fn_completions` 函数。要安装所有提供商所需的包，请使用 `pip install alpaca_eval[all]`。\n\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>配置文件中的其他参数\u003C\u002Fb>\u003C\u002Fsummary>\n\n最简单的方法是查看 [`SinglePairwiseAnnotator`](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fannotators\u002Fpairwise_evaluator.py#L537) 的文档字符串。\n这里有一些重要的参数：\n\n```\nParameters\n----------\nprompt_template : path\n    A prompt that will be given to `fn_prompter` or path to the prompts. Path is relative to\n    `evaluators_configs\u002F`\n\nfn_completion_parser : callable or str\n    Function in `completion_parsers.py` to use for parsing the completions into preferences. For each completion,\n    the number of preferences should be equal to the batch_size if not we set all the preferences in that batch to\n    NaN.\n\ncompletion_parser_kwargs : dict\n    Kwargs for fn_completion_parser.\n\nfn_completions : callable or str\n    Function in `decoders.py` to use for decoding the output.\n\ncompletions_kwargs : dict\n    kwargs for fn_completions. E.g. model_name, max_tokens, temperature, top_p, top_k, stop_seq.\n\nis_randomize_output_order : bool\n    Whether to randomize output_1, output_2 when formatting.\n\nbatch_size : int\n    Number of examples that will be added in a single prompt.\n```\n\n\u003C\u002Fdetails>\n\n创建好评估器后，你还可以分析它并使用以下命令将其添加到 _评估器_ 的 [排行榜](#evaluators) 中：\n\n```bash\nalpaca_eval analyze_evaluators --annotators_config '\u003Cpath_to_config.yaml>'    \n```\n\n为了估计偏差 (bias) 和方差 (variance)，此操作使用 4 个随机种子 (seeds) 评估每个示例，即 2.5K 次评估。\n如果你想要更便宜的评价，可以使用 `--is_single_annotator True` 使用单个种子，这将跳过偏差和方差的估计。\n\n\u003C\u002Fdetails>\n\n# 参与贡献\n\n我们接受针对新模型、评估器（Evaluator）和评估集（Eval Set）的 PR（拉取请求），以及错误修复。我们将定期更新 [排行榜网站](https:\u002F\u002Ftatsu-lab.github.io\u002Falpaca_eval\u002F) 以纳入新的社区贡献。我们还为 AlpacaEval 创建了一个 [支持 Discord](https:\u002F\u002Fdiscord.gg\u002FGJMxJSVZZM)，如果您遇到任何问题并希望向社区寻求帮助，可以使用它。\n\n要开始贡献，请先 fork 该仓库，并从源码安装包 `pip install -e .`。\n\n## 贡献模型\n\n首先，您需要在 [models_configs](src\u002Falpaca_eval\u002Fmodels_configs\u002F) 文件夹中添加一个模型配置定义。作为示例，您可以查看 [falcon-7b-instruct yaml](src\u002Falpaca_eval\u002Fmodels_configs\u002Ffalcon-7b-instruct\u002Fconfigs.yaml)。请确保文件夹名称和 yaml 中的键名完全匹配。\n\n然后，请按照 [评估模型](#evaluating-a-model) 中的步骤运行模型的推理（inference），在评估集上生成输出，并根据其中一个评估器对模型进行评分。示例命令可能如下所示：\n\n```sh\nalpaca_eval evaluate_from_model \\\n  --model_configs 'falcon-7b-instruct'\n```\n\n运行此命令后，您应该生成了一个 outputs json 文件以及对应的 [排行榜文件](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fleaderboards\u002Fdata_AlpacaEval) 中的新条目。请提交包含配置、输出文件和更新后的排行榜的 PR。\n\n具体来说，您应该执行以下操作：\n\n1. 在 github 上 fork 仓库\n2. 克隆 fork 后的仓库 `git clone \u003CURL>`\n3. 在 `src\u002Falpaca_eval\u002Fmodels_configs\u002F\u003Cmodel_name>` 处创建模型配置并对其进行评估 `evaluate_from_model --model_configs '\u003Cmodel_name>'`\n4. 将模型配置、输出和排行榜条目添加到 fork 后的仓库中\n```sh\ngit add src\u002Falpaca_eval\u002Fmodels_configs\u002F\u003Cmodel_name> # add the model config\ngit add src\u002Falpaca_eval\u002Fleaderboards\u002F # add the actual leaderboard entry\ngit add src\u002Falpaca_eval\u002Fmetrics\u002Fweights # add the weights for LC\ngit add -f results\u002F\u003Cmodel_name>\u002Fmodel_outputs.json # force add the outputs on the dataset\ngit add -f results\u002F\u003Cmodel_name>\u002F*\u002Fannotations.json # force add the evaluations from the annotators\ngit commit -m \"Add \u003Cmodel_name> to AlpacaEval\"\ngit push\n``` \n5. 在 [AlpacaEval 上创建拉取请求](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpulls)\n\n注意：如果您在 AlpacaEval 之外生成输出，您仍然需要添加模型配置，但使用 `fn_completions: null`。请参阅 [此配置](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fmodels_configs\u002Fdolphin-2.2.1-mistral-7b\u002Fconfigs.yaml) 作为示例。\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch3 tabindex=\"-1\" dir=\"auto\">获取您的模型验证\u003C\u002Fh3>\u003C\u002Fsummary>\n\n\u003Cp align=\"center\">\n\u003Cimg align=\"center\" alt=\"verified.png\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_af9131fb80f7.png\" width=\"500\"\u002F>\n\u003C\u002Fp>\n\nAlpacaEval 中的已验证结果表示核心维护者已解码模型的输出并执行了评估。不幸的是，我们 AlpacaEval 维护者缺乏资源来验证所有模型，因此我们仅会对排行榜前 5 名的模型进行验证。对于由此可能造成的任何不便我们深表歉意，感谢您的理解。若要验证您的模型，请遵循以下步骤：\n\n1. 通过 Discord 联系 `@yann`，或者如果您有我们的电子邮件，请给我们发邮件，简要说明为什么您的模型应该被验证。\n2. 等待我们的回复和批准后再继续。\n3. 准备一个从您的模型解码的脚本，该脚本不需要 GPU，通常与您贡献模型时使用的脚本相同。它应能在不使用本地 GPU 的情况下运行 `alpaca_eval evaluate_from_model --model_configs '\u003Cyour_model_name>'`。\n4. 生成用于运行脚本的临时 API 密钥（API Keys）并与我们分享。具体来说，我们需要解码您的模型和进行评估所需的密钥（例如，OpenAI 或 Anthropic 密钥）。\n5. 我们将执行 `alpaca_eval evaluate_from_model --model_configs '\u003Cyour_model_name>'`，更新结果，并通知您以便您可以撤销临时密钥。\n\n请注意，我们不会重新评估同一个模型。由于采样方差（Sampling Variance），结果可能与您的初始结果略有不同。我们将用已验证的结果替换您之前的社区结果。 \n\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">贡献评估器\u003C\u002Fh2>\u003C\u002Fsummary>\n\n请首先遵循 [制作新评估器](#making-a-new-evaluator) 中的指示。一旦您创建了标注器（annotator）配置，我们要求您通过评估最小集合的模型为该标注器创建一个新的排行榜。这些模型的输出可以通过下载 [alpaca_eval_all_outputs.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Falpaca_eval_all_outputs.json) 找到。\n\n```bash\nalpaca_eval make_leaderboard \\\n  --leaderboard_path src\u002Falpaca_eval\u002Fleaderboards\u002Fdata_AlpacaEval\u002F\u003Cevaluator>_leaderboard.csv \\\n  --all_model_outputs alpaca_eval_all_outputs.json \\\n  --annotators_config \u003Cevaluator_config>\n```\n\n然后，请提交包含标注器配置和排行榜 csv 的 PR。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">贡献评估集\u003C\u002Fh2>\u003C\u002Fsummary>\n\n要贡献新的评估集，您首先需要指定一组文本指令。然后，您需要指定一组参考输出（模型胜率是相对于此参考计算的）。为了方便使用，您可以使用默认的 [text-davinci-003](src\u002Falpaca_eval\u002Fmodels_configs\u002Ftext_davinci_003\u002F) 参考配置。\n\n将这些放在一起放入一个 json 文件中，其中每个条目指定 `instruction`、`output` 和 `generator` 字段。您可以参考 [alpaca_eval.json](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Falpaca_eval.json) 作为指南（`dataset` 字段不是必需的）。\n\n最后，我们要求您在这个新的评估集上创建一个最小排行榜。您可以使用以下内容完成此操作：\n\n```bash\nalpaca_eval make_leaderboard \\\n  --leaderboard_path \u003Csrc\u002Falpaca_eval\u002Fleaderboards\u002Fdata_AlpacaEval\u002Fyour_leaderboard_name.csv> \\\n  --all_model_outputs alpaca_eval_all_outputs.json \\\n  --reference_outputs \u003Cpath_to_json_file>\n```\n\n请提交包含评估集 json 和相应排行榜 csv 的 PR。\n\n\n\u003C\u002Fdetails>\n\n\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">贡献补全函数\u003C\u002Fh2>\u003C\u002Fsummary>\n\n目前，我们支持不同的补全函数（completion function），例如 `openai`, `anthropic`, `huggingface_local`, `huggingface_hub_api` ... 如果您想贡献一个新的补全函数\u002FAPI 来进行推理（inference），请遵循以下步骤：\n1. 在 [decoder 文件夹](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fdecoders) 中添加一个名为 `\u003Cname>.py` 的文件，其中包含函数 `\u003Cname>_completions(prompts : Sequence[str], model_name :str, ... )`。该函数应以 prompts 和 kwargs 作为参数并返回补全结果（completions）。请查看目录中的其他补全函数作为模板。例如 [huggingface_local_completions](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fdecoders\u002Fhuggingface_local.py) 或 [anthropic](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fdecoders\u002Fanthropic.py)。\n2. 在 [__init__](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fdecoders\u002F__init__.py) 中添加 `\u003Cname>_completions` 及其依赖项。同样您可以参考 [huggingface_local_completions](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fdecoders\u002F__init__.py#L30) 的示例。\n3. 更新 [setup.py](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsetup.py) 中的可选依赖项。\n4. 在 [models configs](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fmodels_configs) 中添加您想要评估的模型。\n5. 使用 `alpaca_eval evaluate_from_model --model_configs '\u003Cmodel_configs>'` 评估您的模型。\n6. （可选）按照 [这些步骤](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain#contributing-a-model) 将上一个模型的结果推送到 AlpacaEval 排行榜。\n\n随时可以尽早开始提交 PR (Pull Request)，我们将能在过程中提供帮助！ \n\n\u003C\u002Fdetails>\n\n\n\n# 局限性\n\nAlpacaEval 评估流程与其他当前的评估器一样，存在重要的局限性，因此不应在重要场景中替代人工评估，例如决定模型是否准备好部署。这些局限性大致可分为三类：\n\n1. **指令可能无法代表真实使用情况**：AlpacaEval 集包含来自各种数据集的示例（[self-instruct](https:\u002F\u002Fgithub.com\u002Fyizhongw\u002Fself-instruct), [open-assistant](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenAssistant\u002Foasst1\u002Fviewer\u002FOpenAssistant--oasst1\u002Fvalidation), [vicuna](https:\u002F\u002Flmsys.org\u002Fblog\u002F2023-03-30-vicuna\u002F), [koala](https:\u002F\u002Fgithub.com\u002Farnav-gudibande\u002Fkoala-test-set), [hh-rlhf](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAnthropic\u002Fhh-rlhf\u002Fviewer\u002FAnthropic--hh-rlhf\u002Ftest)），这些可能无法代表真实使用情况以及像 GPT4 这样更好模型的先进应用。这可能导致最好的闭源模型（GPT4 \u002F Claude \u002F ChatGPT \u002F ...）看起来比实际情况更接近开源模型。事实上，这些闭源模型似乎是在更多样化的数据上预训练\u002F微调的。例如参见 [此博客](https:\u002F\u002Fmedium.com\u002F@marcotcr\u002Fexploring-chatgpt-vs-open-source-models-on-slightly-harder-tasks-aa0395c31610) 关于更复杂指令的初步结果。然而，请注意，在 [AlpacaFarm](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14387) 中，我们展示了我们在评估集上的胜率（win-rates）与 Alpaca Demo 用户交互指令的胜率高度相关（0.97 R2）。此外，AlpacaEval 排行榜显示的开源模型和 OpenAI 模型之间的差距比其他排行榜（例如 [lmsys](https:\u002F\u002Flmsys.org\u002Fblog\u002F2023-03-30-vicuna\u002F)）更大。\n\n2. **自动标注者的偏见**：原始自动标注者似乎存在隐性偏见。特别是，我们发现它们倾向于偏好更长的输出和包含列表的输出（例如 `alpaca_eval_gpt4` 为 0.68 \u002F 0.69，`claude` 为 0.62 \u002F 0.58）。虽然我们发现人类也有类似的偏见（0.64 \u002F 0.61），但我们认为这可能更多是我们使用的人工标注流程的局限性，而非真正的人类偏见。更普遍地说，通过定性分析，我们发现自动标注者更重视输出的风格而非其内容（例如事实性）。最后，我们发现自动评估者倾向于偏好来自相似模型（很可能是在相同数据上训练的）的输出，正如 `claude` 和 `alpaca_eval_gpt4` 排行榜上 ChatGPT\u002FGPT4 的巨大差异所表明的那样。注意，长度控制在我们的长度控制胜率中部分缓解了长度偏见。\n3. **缺乏安全评估**：重要的是，AlpacaEval 仅评估模型的指令遵循能力，而不是它们可能造成的危害（例如有毒行为或偏见）。因此，当前 ChatGPT 和最佳开源模型之间的小差距**不应**被解释为后者已准备好部署。\n\n除了上述关于评估流程的局限性外，关于我们对评估器的验证以及我们 [提出的方法](#analyzing-an-eval-set) 来选择评估集也存在局限性。\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>关于我们验证流程的局限性\u003C\u002Fb>\u003C\u002Fb>\u003C\u002Fsummary>\n\n首先，基于人工交叉标注的评估器验证存在以下局限性：(1) 我们定性发现，我们的众包工作者也倾向于偏好风格（如长度和列表的存在）而非事实性；(2) 这并不能首先验证针对参考模型的胜率是否是一种好的评估策略；(3) 16 名众包工作者的偏好不能代表所有人类的偏好。\n\n其次，我们建议的基于统计功效选择评估集的方法存在以下局限性：(1) 统计功效不能确保正确的方向，例如，您可能有一组不自然的指令，其中 Alpaca 的表现“优于”更好的模型；(2) 这可能会促使用户选择数据以支持他们想要验证的假设。\n\n\u003C\u002Fdetails>\n\n\n# 额外分析和图表\n\n\n[\u002F\u002F]: # (AlpacaEval provides a few visualization tools to help you analyze and improve your automatic evaluation pipeline. We)\n\n[\u002F\u002F]: # (briefly explain)\n\n[\u002F\u002F]: # (them here and provide)\n\n[\u002F\u002F]: # (notebooks for more analysis. )\n\n[\u002F\u002F]: # (For a description of all the metrics we consider)\n\n[\u002F\u002F]: # (refer to [How exactly are those metrics computed?]&#40;https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval#evaluators&#41;)\n\n## 长度控制的 AlpacaEval (LCAE)\n\n\n**长度控制的 AlpacaEval 可视化：**\n[![analyzing an evaluator](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fnotebooks\u002Ffigured_length_controlled.ipynb)\n\n**长度控制的 AlpacaEval 开发：**\n[![analyzing an evaluator](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fnotebooks\u002Flength_controlled.ipynb)\n\n该笔记本展示了我们考虑过的用于减轻自动标注器长度偏差的不同选项。\n\n在此我们简要总结主要结果。即：\n- **LCAE（长度控制版 AlpacaEval）将 AlpacaEval 2.0 与 Chat Arena 的相关性从 0.94 提高到了 0.98**。这使得 LCAE 成为与 Chat Arena 相关性最高的基准，如下图所示。\n\n\n\u003Cp float=\"left\" align=\"middle\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_3729ec9ba6f0.png\" alt=\"LC AlpacaEval 是与 Chat Arena 相关性最高的基准。\" width=\"500\"\u002F>\n\u003C\u002Fp>\n\n- **LCAE 降低了长度可操纵性**。AlpacaEval 的主要问题之一是，你可以通过增加输出长度来提高胜率。例如，在 AlpacaEval 2.0 中，当提示“尽可能提供详细信息”时，基线的胜率（50%）增加到 64%，而当提示“在提供回答问题所需的所有必要信息的同时尽可能简洁”时，胜率下降到 23%。更一般地说，AlpacaEval 的相对长度可操纵性约为 21%，而 LCAE 下降至约 6%，因此通过提示词长度进行操纵的可能性降低了 3 倍。如下图所示。  \n\n\n\u003Cp float=\"left\" align=\"middle\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_68c997c99880.png\" alt=\"LC AlpacaEval 降低了基准的长度可操纵性。\" width=\"500\"\u002F>\n\u003C\u002Fp>\n\n- **我们可以预测不同基线的性能**。使用 GLM（广义线性模型）来控制长度偏差的另一个好处在于，我们现在拥有一个模型，可以预测模型在不同基线下的胜率。特别是，我们的 GLM 具有许多优良属性，例如 `win_rate(m,b) = 1 - win_rate(b,m) \\in [0,1]` 和 `win_rate(m,m) = 0.5`。如下图所示。\n\n\n\u003Cp float=\"left\" align=\"middle\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_0f60e430f35c.png\" alt=\"不同基线的预测胜率\" width=\"500\"\u002F>\n\u003C\u002Fp>\n\n\n最后，请注意我们仅控制了长度偏差。还有其他已知的偏差我们没有控制，例如自动标注器倾向于偏好与其模型相似的输出。虽然我们可以控制这一点，但在实践中我们发现这不如长度偏差那么严重。原因有二：(1) 这主要是因为排行榜上主要是单个模型，因为在自动标注器的输出上进行微调似乎并没有像想象中那样显著影响胜率；(2) 这种偏差实际上比人们想象的要弱。例如，我们在下面展示了由三个不同模型自动标注的排行榜子集，我们看到模型的排名完全相同。特别是，`claude-3-opus` 偏好 `gpt4_preview`，而 `mistral-large` 偏好前两者。\n\n\u003Cp float=\"left\" align=\"middle\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_a2bfd8f7f18b.png\" alt=\"不同自动标注器的排行榜\" width=\"500\"\u002F>\n\u003C\u002Fp>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">分析评估器\u003C\u002Fh2>\u003C\u002Fsummary>\n\n[\u002F\u002F]: # (## Analyzing an evaluator)\n\n**注意**：以下所有结果均关于 AlpacaEval 1.0，且此后未更新。\n\n**分析评估器：**\n[![analyzing an evaluator](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fnotebooks\u002Fanalyzing_annotators.ipynb)\n\n正如我们在 [评估器排行榜](#evaluators) 中所见，在选择评估器时有许多指标需要考虑，例如质量、价格和速度。为了协助选择评估器，我们提供了一些函数来绘制这些指标。下图例如显示了不同评估器的价格\u002F时间\u002F一致性。\n\n![plot_quality_vs_price_and_time.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_c8f5171445cc.png)\n\n在这里我们看到 `alpaca_eval_gpt4` 表现非常好，在所有考虑的指标上都优于人类。\n\n此前我们只考虑了与人类标注者的整体一致性。可以进行的额外验证是检查使用我们的自动标注器制作排行榜是否与人类制作的排行榜给出相似的结果。为了支持此类分析，我们发布了来自 [AlpacaFarm](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_farm) 的 22 种方法的输出的 [人类标注](#data-release) => 22*805 = ~18K 标注。因此，我们可以测试 22 个模型的胜率在人类评估和我们自动标注器评估之间的相关性。请注意，这可以说是比使用“人类一致性 [%]\"更好的选择自动评估器的方法，但由于需要 18K 标注，成本较高。下图显示了 `alpaca_eval_gpt4` 评估器的此类相关性。\n\n\u003Cp float=\"left\" align=\"middle\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_aa8801bfd20b.png\" alt=\"人类与 alpaca_eval_gpt4 之间的相关性\" width=\"400\"\u002F>\n\u003C\u002Fp>\n\n我们看到 `alpaca_eval_gpt4` 排行榜与人类排行榜高度相关（0.94 皮尔逊相关系数），这进一步表明自动评估是人类评估的良好代理。有关代码和更多分析，请参阅 [此笔记本](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fnotebooks\u002Fanalyzing_annotators.ipynb)，或上面的 Colab 笔记本。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">分析评估集\u003C\u002Fh2>\u003C\u002Fsummary>\n\n[\u002F\u002F]: # (## Analyzing an eval set)\n\n**注意**：以下所有结果均关于 AlpacaEval 1.0，且此后未更新。\n\n**制作评估集：**\n[![analyzing an evaluator](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fnotebooks\u002Fanalyzing_evalset.ipynb)\n\n创建评估集时有两个主要因素需要考虑：使用多少数据？以及什么数据？\n\n回答这些问题的一种方法是考虑一个你认为质量不同的模型排行榜，并检查需要多少数据才能在统计上显著地区分它们。我们将在下面使用配对 t 检验来测试每对模型之间的胜率差异是否具有统计学意义。\n\n首先，让我们考虑使用多少数据的问题。下面我们展示了从 AlpacaEval 中需要的随机样本数量，以便在最小的 `alpaca_eval_gpt4` 排行榜中，每对模型的配对 t 检验给出 P 值 \u003C 0.05。灰色单元格对应于在 805 个样本中没有显著差异的对。y 轴和 x 轴分别按第一个和第二个模型的胜率排序。\n\n\n\u003Cp float=\"left\" align=\"middle\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_2173d751396e.png\" alt=\"区分 Claude 排行榜中对所需的样本数量\" width=\"500\"\u002F>\n\u003C\u002Fp>\n\n我们可以看到，大多数模型仅用 50 个样本即可区分，而 150 个样本允许区分绝大多数配对（78 对中的 74 对）。这表明在测试具有与最小 `alpaca_eval_gpt4` [排行榜](#models) 上相似性能差距的两个模型时，我们可以将评估集大小减少为原来的四分之一。\n\n第二个问题是什么数据可以使用。同样，我们可以从统计功效 (statistical power) 的角度来回答这个问题：什么样的数据最能区分模型。让我们考虑 AlpacaEval 包含的所有数据集，但让我们控制评估集的大小，因为我们只关心数据的质量。下图显示了 AlpacaEval 每个子集的 80 个示例上，每对模型的成对 t 检验 (paired t-test) 的 P 值 (p-values)。\n\n![plot_paired_ttests_per_dataset.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_readme_78d595bcdb16.png)\n\n例如，我们看到 self-instruct 数据集产生的统计功效最低，这表明可以从评估集中移除该数据集。确切原因应在未来的工作中进行分析。关于代码和更多分析，请参阅 [此笔记本](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fnotebooks\u002Fanalyzing_evalset.ipynb)，或上面的 colab 笔记本。\n\n\u003C\u002Fdetails>\n\n\n\n# 引用\n\n请根据您使用的内容和引用的内容考虑引用以下内容：\n- **代码、结果和通用基准**：`alpaca_eval`（本仓库）。指定您使用的是 AlpacaEval 还是 AlpacaEval 2.0。有关长度控制的胜率，见下文。\n- **长度控制 (LC) 胜率**：`alpaca_eval_length`。\n- **人工标注**：`dubois2023alpacafarm` ([AlpacaFarm](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14387))\n- **AlpacaEval 评估集**：`alpaca_eval` 以及 [self-instruct](https:\u002F\u002Fgithub.com\u002Fyizhongw\u002Fself-instruct), [open-assistant](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenAssistant\u002Foasst1\u002Fviewer\u002FOpenAssistant--oasst1\u002Fvalidation), [vicuna](https:\u002F\u002Flmsys.org\u002Fblog\u002F2023-03-30-vicuna\u002F), [koala](https:\u002F\u002Fgithub.com\u002Farnav-gudibande\u002Fkoala-test-set), [hh-rlhf](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAnthropic\u002Fhh-rlhf\u002Fviewer\u002FAnthropic--hh-rlhf\u002Ftest)。\n\n以下是 bibtex 条目：\n\n```\n@misc{alpaca_eval,\n  author = {Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },\n  title = {AlpacaEval: An Automatic Evaluator of Instruction-following Models},\n  year = {2023},\n  month = {5},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval}}\n}\n```\n\n```\n@article{dubois2024length,\n  title={Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators},\n  author={Dubois, Yann and Galambosi, Bal{\\'a}zs and Liang, Percy and Hashimoto, Tatsunori B},\n  journal={arXiv preprint arXiv:2404.04475},\n  year={2024}\n}\n```\n\n```\n@misc{dubois2023alpacafarm,\n  title={AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback}, \n  author={Yann Dubois and Xuechen Li and Rohan Taori and Tianyi Zhang and Ishaan Gulrajani and Jimmy Ba and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto},\n  year={2023},\n  eprint={2305.14387},\n  archivePrefix={arXiv},\n  primaryClass={cs.LG}\n}\n```\n\n# 更多信息\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">长度控制胜率\u003C\u002Fh2>\u003C\u002Fsummary>\n\n长度控制 (LC) 胜率是胜率的去偏版本，它控制了输出的长度。\n\n主要思想是，对于每个模型，我们将拟合一个逻辑回归 (logistic regression) 来预测自动标注器 (autoannotator) 的偏好，给定条件为：(1) 指令，(2) 模型，以及 (3) 基线 (baseline) 与模型输出之间的长度差异。\n有了这样的逻辑回归，我们可以通过将长度差异设置为 0 来尝试预测反事实 (counterfactual) 情况：“如果模型的输出长度与基线相同，偏好会是多少”。\n通过对这种长度控制的偏好进行平均，我们便得到了长度控制的胜率。\n逻辑回归的确切形式被设定为使 LC 胜率的解释类似于原始胜率，例如对于任何模型 `m1` 和 `m2`，我们有 `win_rate(m1, m2) = 1 - win_rate(m2, m1) \\in [0,100]` 且 `win_rate(m1, m1) = 0.5`。\n长度控制胜率将 AlpacaEval 排行榜 (leaderboard) 与 Chat Arena 之间的相关性从 **0.93 提高到 0.98 斯皮尔曼相关系数 (Spearman correlation)**，同时显著降低了标注器的长度可操纵性 (length gameability)。\n有关长度控制胜率的更多信息和结果，请参阅 [此笔记本](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fnotebooks\u002Flength_controlled.ipynb)。\n\n这种通过预测结果并条件化于中介变量 (mediator)（长度差异）来估计受控直接效应 (controlled direct effect) 的想法，在统计推断中很常见。\n\n要获取先前标注模型的 LC 胜率，您可以使用以下命令：\n\n```bash\npip install -U alpaca_eval\nalpaca_eval --model_outputs … --is_recompute_metrics_only True\n```\n\n\u003C\u002Fdetails>\n\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">AlpacaEval 2.0\u003C\u002Fh2>\u003C\u002Fsummary>\n\nAlpacaEval 2.0 是 AlpacaEval 的新版本。以下是区别：\n- **参考模型：`gpt4_turbo`**：我们将基线 (baseline) 从 `text-davinci-003` 升级到 `gpt4_turbo`，以使基准 (benchmark) 更具挑战性，并获得更能反映当前最先进水平 (state of the art) 的指标。\n- **标注器：`weighted_alpaca_eval_gpt4_turbo`**：我们在质量和价格方面改进了标注器。首先，我们使用 `gpt4_turbo` 模型进行标注，其成本约为 `gpt4` 的一半。其次，我们更改了提示词，使模型输出单个 token，这进一步降低了成本并提高了速度。最后，我们没有使用二元偏好，而是使用了 logprobs (对数概率) 来计算连续偏好，从而得出最终的加权胜率。请注意，后两项变化产生了意想不到的效果，即减少了标注器的长度偏差。\n\n默认情况下，`pip install alpaca_eval==0.5` 将使用 AlpacaEval 2.0。如果您希望默认使用旧配置，可以在环境中设置 `IS_ALPACA_EVAL_2=False`。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">数据发布\u003C\u002Fh2>\u003C\u002Fsummary>\n\n作为 AlpacaEval 的一部分，我们发布以下数据：\n\n- **人工标注数据（17701）** 为了开发和理解自动评估器（automatic evaluators），我们发布了为 AlpacaFarm 收集的所有人类成对评估数据。这包含了在 AlpacaFarm 评估集上，22 个模型与 `text-davinci-003` 参考模型之间的比较。标注来自 Amazon Mechanical Turk 上的 16 名众包工人池。涉及的模型包括：6 个来自 OpenAI，2 个来自 AlpacaFarm 的 SFT（监督微调）模型，13 个来自 AlpacaFarm 的 RLHF（基于人类反馈的强化学习）方法，以及 LLaMA 7B。\n- **人工交叉标注（2596）** 为了进一步分析自动评估器，我们从 AlpacaFarm 评估集中选择了 650 个示例（通过跨模型和数据层的分层采样），并为每个示例收集了 4 个人工标注。\n- **AlpacaEval 数据集（805）** 我们对 AlpacaFarm 评估集进行了轻微的修改\u002F简化。首先，我们将 instruction（指令）和 input（输入）字段合并为单个 instruction 字段。这影响了 AlpacaFarm 评估集中 1\u002F4 的示例，这些示例全部来自 [self-instruct 评估集](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.10560)。其次，我们重新生成了 text-davinci-003 参考输出，不再限制其输出长度。\n\n关于人工标注的更多详情，请参阅 [AlpacaFarm 论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14387)。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">与 AlpacaFarm 的区别\u003C\u002Fh2>\u003C\u002Fsummary>\n\nAlpacaEval 是对 [AlpacaFarm](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_farm) 中的自动成对偏好模拟器的改进和简化。除了 AlpacaFarm 之外，你应该使用 AlpacaEval。主要区别如下：\n\n- **AlpacaEval 合并了指令和输入**：AlpacaEval 评估与 AlpacaFarm 评估相同，只是 instruction（指令）和 input（输入）字段被合并为 `{instruction}\\n\\n{input}`。这影响了 AlpacaFarm 评估集中 1\u002F4 的示例（[self-instruct](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.10560) 子集）。这种简化为那些未通过区分这两个字段进行训练的模型提供了更公平的比较。\n- **AlpacaEval 处理更长的生成**：AlpacaFarm 中的模型在生成时限制最大 token（词元）数为 300。我们将此数字更改为 AlpacaEval 的 2000。注意，这也影响了参考生成（`text-davinci-003`），因此即使对于没有 input 字段的示例，AlpacaEval 上的结果也与 AlpacaFarm 上的结果不可比。\n- **AlpacaEval 消除了标注者内部和标注者之间的方差**：AlpacaFarm 模拟器在模式行为和多样性方面复制了人工标注。特别是，AlpacaFarm 的模拟器使用模型池和提示词（prompt），并添加噪声以复制人工标注的内部和之间方差。如果目标是使用自动标注器进行评估或简单地训练更好的模型，那么这种方差可能并不理想。因此，AlpacaEval 的默认标注器没有这种方差。我们通过使用 `--anotators_config 'alpaca_farm'` 和 `--p_label_flip 0.25` 创建评估器时，提供选项将其加回。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">相关工作\u003C\u002Fh2>\u003C\u002Fsummary>\n\n已有几项工作提出了用于指令跟随模型的新自动标注器。这里我们列出我们所知的并讨论它们与我们的区别。我们在 [我们的评估器排行榜](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval#evaluators) 中评估了所有这些。\n\n- **Vicuna\u002Flmsys** lmsys 标注器（`lmsys_gpt4`）通过询问标注器对每个输出的评分（1-10 分），然后选择得分最高的输出作为首选来评估成对输出。他们不对输出顺序进行随机化，并且在评分后要求解释。总体而言，我们发现该标注器对较长输出有强烈的 bias（偏差）（0.74），且与人工标注的 correlation（相关性）相对较低（63.2）。\n- **AlpacaFarm** 最好的 AlpacaFarm 标注器（`alpaca_farm_greedy_gpt4`）通过直接询问标注器它更喜欢哪个输出来评估成对输出。此外，它将 5 个示例 batch（批处理）在一起以分摊提示词的长度，并对输出顺序进行随机化。总体而言，我们发现该标注器对较长输出的 bias（偏差）较小（0.60），且比其他标注器更快（每 1000 个示例 878 秒）。它与大多数人工标注的 correlation（相关性）略高（66.4），高于人类本身（65.7）。然而，它的成本更高（每 1000 个示例 $15.3），并且由于 batching（批处理）原因无法处理非常长的输出。\n- **Aviary** Aviary 标注器（`aviary_gpt4`）要求标注器按偏好对输出进行排序，而不仅仅是选择首选输出。它不对输出顺序进行随机化，并使用较高的 temperature（温度参数）进行解码（0.9）。总体而言，我们发现该标注器对较长输出有相对较强的 bias（偏差）（0.70），且与人工标注的 correlation（相关性）非常高（69.1）。通过降低 temperature（温度参数）和随机化输出顺序，我们 [进一步提高了](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fblob\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002FREADME.md) correlation（相关性）至 69.8（`improved_aviary_gpt4`），但这进一步将长度 bias（偏差）增加到了 0.73。\n\n我们的 `alpaca_eval_gpt4` 是 AlpacaFarm 和 Aviary 标注器的混合体。它要求标注器按偏好对输出进行排序，但它使用 temperature（温度参数）0，对输出进行随机化，并对 prompt（提示词）进行了一些修改，将长度 bias（偏差）降低到 0.68。\n\n其他相关工作包括最近分析自动评估器的论文。例如：\n\n- [AlpacaFarm Appx C](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14387)\n  和 [Large Language Models are not Fair Evaluators](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.17926v1) 都发现自动标注器存在 position bias（位置偏差）。\n- [AlpacaFarm Sec. 5.2.](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14387)\n  和 [The False Promise of Imitating Proprietary LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.15717) 都发现自动标注器偏爱 style（风格，例如列表的使用、语调、措辞、长度）而非 factuality（事实性）。\n\n\u003C\u002Fdetails>\n\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">解读标注\u003C\u002Fh2>\u003C\u002Fsummary>\n\n\u003C\u002Fdetails>\n\n对于所有模型，您可以在 `results\u002F\u003Cmodel_name>\u002F*\u002Fannotations.json` 下找到 auto-annotations（自动标注）。这些标注包含以下列：\n- `instruction`: prompt（提示词）\n- `generator_1`: baseline model（基线模型）\n- `output_1`: 基线模型的输出\n- `generator_2`: 被评估的模型\n- `output_2`: 被评估模型的输出\n- `annotator`: auto-annotator（自动标注器）\n- `preference`: auto-annotator（自动标注器）的结果。这是一个介于 1 和 2 之间的 float（浮点数）。越接近 1 表示 auto-annotator（自动标注器）更偏好 `output_1`，越接近 2 表示它更偏好 `output_2`。对于 AlpacaEval 2.0，`preference-1` 对应于 `output_1` 被选中的概率。对于 AlpacaEval 1.0，如果 `output_1` 被偏好则 `preference` 为 1，如果 `output_2` 被偏好则为 2，如果两者相同则为 1.5。win rate（胜率）始终是 `(preference -1).mean()`。\n- `raw_completion`: auto-annotator（自动标注器）的原始输出。此字段包含在 `output_1` 和 `output_2` 顺序去随机化之前的补全内容！因此更难解读，详见下文。\n\n**Chain of thought（思维链）**\n\n对于某些标注器，例如 `alpaca_eval_cot_gpt4_turbo_fn`，我们使用 Chain of thought reasoning（思维链推理）使模型的偏好更具可解释性。这些内容可以在 `concise_explanation` 下找到。要解读它们，您还应该查看 `referenced_models`，它将临时模型名称（在提示词中）转换为实际输出。下面，我们将提供更多关于背后机制的解释。\n\n您可以检查 `annotations.json` 中的 `raw_annotations[\"concise_explanation]` 列（例如 [此处](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fresults\u002Fgpt4\u002Falpaca_eval_cot_gpt4_turbo_fn\u002Fannotations.json)），其中包含自动标注器的 Chain of thought reasoning（思维链推理）。请注意，`raw_annotations` 不受输出顺序随机化的影响。特别是，`\"m\"` 和 `\"M\"` 有时指第一个模型（参考模型），有时指第二个模型（被评估模型）。要了解指的是哪个模型，您应该使用 `preference` 和 `ordered_models` 列。为了使其更容易理解，我们添加了一个 `\"referenced_models\"` 列，将模型名称映射到相应的输出。例如，在下面的标注中，我们看到偏好值为 1.0（即 `output_1`），并且对应于 `concise_explanation` 中的模型 `M`（参见 `ordered_models`）。\n\n```json \n{\n  \"instruction\": \"How did US states get their names?\",\n  \"output_1\": \"The names of U.S. states are derived from a variety of languages and sources, including indigenous American languages, European languages (primarily English, Spanish, and French), as well as other historical influences. Here is a brief overview of how some states got their names:\\n\\n1. **Indigenous American Languages**: Many state names come from Native American words or names of tribes, often from the languages of the indigenous peoples who lived in or near those areas.\\n   - **Alabama**: Possibly derived from the Choctaw language, meaning \\\"thicket clearers.\\\"\\n   - **Connecticut**: From a Mohegan-Pequot word meaning \\\"long tidal river.\\\"\\n   - **Massachusetts\": [...]\",\n  \"generator_1\": \"gpt4_1106_preview\",\n  \"dataset\": \"helpful_base\",\n  \"output_2\": \"The names of the 50 U.S. states come from a variety of sources, including Native American languages, European languages, and historical figures. Here's a brief overview of how some states got their names:\\n\\n1. Native American origins: Many states have names derived from Native American languages. For example, Alabama comes from the Choctaw word \\\"Albah amo,\\\" meaning \\\"plant gatherers\\\" or \\\"herb gatherers.\\\" Similarly, the name Mississippi comes from the Ojibwe word \\\"Misi-ziibi,\\\" meaning \\\"great river.\\\"\\n\\n2. European languages: [...].\",\n  \"generator_2\": \"gpt4\",\n  \"annotator\": \"alpaca_eval_cot_gpt4_turbo_fn\",\n  \"preference\": 1.0,\n  \"raw_completion\": {\n    \"concise_explanation\": \"Model M provided a more detailed and structured response, including bold headings for each category and a wider range of examples. It also included additional categories such as 'Other European Languages' and 'Combination of Languages and Influences', which added depth to the explanation. Model m's response was accurate but less comprehensive and lacked the clear structure found in Model M's output.\",\n    \"ordered_models\": [\n      {\n        \"model\": \"M\",\n        \"rank\": 1\n      },\n      {\n        \"model\": \"m\",\n        \"rank\": 2\n      }\n    ]\n  },\n  \"referenced_models\": {\n    \"M\": \"output_1\",\n    \"m\": \"output_2\"\n  }\n}\n```\n\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Ch2 tabindex=\"-1\" dir=\"auto\">主要更新\u003C\u002Fh2>\u003C\u002Fsummary>\n\n- 2024 年 3 月 12 日：更新为使用 length-controlled (LC)（长度控制）win rates（胜率）。这是控制输出长度的 win-rates（胜率）的无偏版本。\n- 2024 年 1 月 3 日：更新至 AlpacaEval 2.0，该版本使用 GPT4-turbo 作为 baseline（基线）和 annotator（标注器）。\n- 2024 年 1 月 2 日：添加了 Azure API 以及更通用的客户端配置设置方式。参见 [此处](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fclient_configs\u002FREADME.md)\n- 2023 年 6 月 19 日：添加任何人都可以使用的排行榜 `chatgpt_fn`（无需等待列表）。\n- 2023 年 6 月 19 日：更新以使用 [OpenAI 的 function calling（函数调用）](https:\u002F\u002Fopenai.com\u002Fblog\u002Ffunction-calling-and-other-api-updates)。示例：[`chatgpt_fn`](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002Fchatgpt_fn) 或 [`alpaca_eval_gpt4_fn`](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Ftree\u002Fmain\u002Fsrc\u002Falpaca_eval\u002Fevaluators_configs\u002Falpaca_eval_gpt4_fn)。\n\n\u003C\u002Fdetails>","# AlpacaEval 快速上手指南\n\nAlpacaEval 是一个用于指令跟随语言模型的自动评估工具，旨在提供快速、低成本且与人类评估高度相关的基准测试。\n\n## 环境准备\n\n- **系统要求**：Python 3.10 或更高版本\n- **前置依赖**：\n  - 需要配置 OpenAI API Key（默认评估器基于 GPT-4）\n  - 如需评估本地模型，需确保能访问 HuggingFace 或相关 API\n- **网络建议**：国内开发者建议在安装时配置国内 pip 镜像源以加速下载。\n\n## 安装步骤\n\n通过 pip 安装稳定版：\n\n```bash\npip install alpaca-eval\n```\n\n如需安装开发版（nightly version）：\n\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\n```\n\n## 基本使用\n\n### 1. 配置 API 密钥\n\n在使用前，请导出您的 OpenAI API Key：\n\n```bash\nexport OPENAI_API_KEY=\u003Cyour_api_key>\n```\n\n### 2. 评估模型输出文件\n\n如果您已有模型生成的输出文件（JSON 格式），可直接运行评估命令。文件需包含 `instruction` 和 `output` 字段。\n\n```bash\nalpaca_eval --model_outputs 'example\u002Foutputs.json' \n```\n\n此命令将打印排行榜到控制台，并将结果保存至同一目录。\n\n### 3. 直接评估模型\n\n若没有预先生成的输出，可直接指定模型名称（支持 HuggingFace 模型或标准 API 提供商）：\n\n```bash\nalpaca_eval evaluate_from_model \u003Cmodel_name_or_path>\n```\n\n### 常用参数说明\n\n- **`--model_outputs`**：待评估模型输出的 JSON 文件路径。\n- **`--annotators_config`**：选择评估器，默认为 `weighted_alpaca_eval_gpt4_turbo`（推荐）。\n- **`--reference_outputs`**：参考模型的输出，默认使用 `gpt4_turbo`。\n- **`--output_path`**：保存评估结果的路径。\n\n更多命令选项可运行以下查看帮助：\n\n```bash\nalpaca_eval --help\n```","某 AI 初创团队在微调开源大模型时，需要频繁对比多个检查点（checkpoint）的指令遵循能力，以便决定下一步优化方向。\n\n### 没有 alpaca_eval 时\n- 依赖人工标注或众包平台，评估一个版本耗时数天，严重拖慢迭代节奏。\n- 聘请专业评测人员成本高昂，单次完整测试预算可能超过数百美元。\n- 人工评分存在主观偏差，不同评审员对同一回答的打分差异较大，难以横向对比。\n- 缺乏自动化基准，无法快速验证新策略是否真的提升了模型表现。\n\n### 使用 alpaca_eval 后\n- alpaca_eval 能在 3 分钟内自动完成全量测试，将评估周期从数天压缩至分钟级。\n- 仅需少量 API 调用费用，单次运行成本低于 10 美元，极大降低了试错门槛。\n- 其评分结果与 ChatBot Arena 的人类偏好相关性高达 0.98，数据可信度媲美人工。\n- 内置长度控制机制，有效防止模型通过堆砌字数来刷高分，确保评估公平性。\n\nalpaca_eval 以极低的成本和极高的效率，为模型开发提供了可信赖的自动化评估基准。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftatsu-lab_alpaca_eval_c8f51714.png","tatsu-lab","Tatsu's shared repositories","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Ftatsu-lab_d160c91d.png","Tatsu's shared repos",null,"https:\u002F\u002Fgithub.com\u002Ftatsu-lab",[82,86],{"name":83,"color":84,"percentage":85},"Jupyter Notebook","#DA5B0B",91,{"name":87,"color":88,"percentage":89},"Python","#3572A5",9,1963,306,"2026-04-03T12:06:39","Apache-2.0","未说明",{"notes":96,"python":97,"dependencies":98},"需设置 OPENAI_API_KEY 环境变量；核心评估功能依赖外部 API（如 GPT-4）；运行速度快（\u003C5 分钟），成本低（\u003C$10）；支持评估本地 HuggingFace 模型或标准 API 模型；无需本地 GPU 即可完成基于 API 的自动评估流程。","3.10+",[94],[54,13,26],[101,102,103,104,105,106,107,108],"deep-learning","evaluation","foundation-models","instruction-following","large-language-models","leaderboard","nlp","rlhf",5,"2026-03-27T02:49:30.150509","2026-04-06T08:52:40.268111",[113,118,123,127,132,137],{"id":114,"question_zh":115,"answer_zh":116,"source_url":117},3114,"为什么 AlpacaEval 2.0 评估结果显示胜率异常高或出现 GLM 警告？","这通常是因为使用了非官方的 API 部署或错误的模型配置。建议直接使用 OpenAI 官方 API 进行测试。如果使用的是内部版本（如微软），请确保与 OpenAI 兼容。GLM 警告（长度控制胜率与原始胜率差异大）可能意味着 GLM 失效，需检查模型和 API 设置。","https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fissues\u002F310",{"id":119,"question_zh":120,"answer_zh":121,"source_url":122},3115,"如何将自定义脚本生成的评估结果转换为 AlpacaEval 所需的 annotations.json 格式？","可以转换，但需要保留特定字段（instruction, output_1, generator_1, output_2, generator_2, annotator, raw_completion）。维护者提供了参考代码，指出需要确保数据框包含这些列才能正确生成结果，并可直接利用现有脚本来完成格式转换。","https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fissues\u002F263",{"id":124,"question_zh":125,"answer_zh":126,"source_url":122},3116,"在使用现有评估文件时，是否需要处理 raw_completion 中的随机化顺序？","是的，必须处理随机化。LLM judge 后需要确认 token 是否始终对应 output_1 还是随顺序切换。可以在 raw_completion 中直接修改 logprobs 中的 token，或者在获得偏好后统一处理。如果不撤销随机化，可能导致结果不准确。",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},3117,"使用 alpaca_eval 命令时 GPT-4 API 报错且仅计算了部分样本，如何解决？","这通常是因为使用了非官方的 API 基础地址（如实验室提供的代理 URL）。解决方案是切换到官方 OpenAI GPT-4 API，问题即可解决。之前使用实验室 API 时曾遇到此问题，切换后恢复正常。","https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fissues\u002F273",{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},3118,"解析 AlpacaEval 2.0 结果时失败，提示缺失 ordered_models 或 rank 字段，原因是什么？","这通常是由于中间接口（intermediary interface）的问题，而非 OpenAI 本身。尝试使用基础接口（base interface）测试一个示例，或者检查是否使用了非标准的 API 转发服务。如果是中间件问题，维护者建议排查接口层。","https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fissues\u002F204",{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},3119,"AlpacaEval 2.0 新增的长度控制指标（Length-Controlled Metric）有什么作用？","该指标用于消除 GPT-4 对长度的偏见。它计算响应长于\u002F短于参考响应时的胜率，然后取平均值（balance_win_rate）。新指标具有数学性质，可预测任意两模型的胜率（长度校正或不校正）及 ELO 评分。","https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fissues\u002F225",[143,148,153,158,163,168,173,178,183,188,193,198,203,208,213,218,223,228,233,238],{"id":144,"version":145,"summary_zh":146,"released_at":147},112342,"v0.6.6","## What's Changed\n* [ENH] add strict decoding OAI by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F394\n* Add blendaxai-gm-l6-vo31 to AlpacaEval by @ym-blendax-ai in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F399\n* Added Llama3-PBM-Nova-70B model by @PKU-Baichuan in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F395\n* Add evaluator weighted_alpaca_eval_gpt-4o-mini-2024-07-18 by @tongyx361 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F401\n* Add Shopee-SlimMoA-v1 to AlpacaEval by @LLM-Alignment-sh in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F398\n* [ENH] add metadata to completion: date, version,... by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F402\n* Add REBEL-Llama-3-8B-Instruct-Armo to AlpacaEval by @ZhaolinGao in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F403\n* Add Llama-3-8B-Instruct-SkillMix to AlpacaEval by @parksimon0808 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F405\n* Updated HF Link in model_configs for Llama-3-8B-Instruct-SkillMix by @parksimon0808 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F409\n* Add SelfMoA_gemma-2-9b-it-SimPO,  SelfMoA_gemma-2-9b-it-WPO-HB to AlpacaEval by @wenzhe-li in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F411\n* add Self-taught-llama3.1-70B-dpo as a evaluator by @tianlu-wang in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F412\n* Add GPO-Llama-3-8B-Instruct-GPM-2B and SPPO-Llama-3-8B-Instruct-GPM-2… by @xukp20 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F413\n* Add NullModel to AlpacaEval by @xszheng2020 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F414\n* Add Llama-3-Instruct-8B-RainbowPO to AlpacaEval by @hanyang1999 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F416\n* add example for Llama3 vllm server by @cameron-chen in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F404\n* Add FuseChat-3.0 models to AlpacaEval by @yangzy39 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F426\n* Add TOA to AlpacaEval by @oceanypt in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F428\n* [BUG] tool_calls by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F429\n\n## New Contributors\n* @PKU-Baichuan made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F395\n* @LLM-Alignment-sh made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F398\n* @parksimon0808 made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F405\n* @wenzhe-li made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F411\n* @tianlu-wang made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F412\n* @xukp20 made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F413\n* @xszheng2020 made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F414\n* @hanyang1999 made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F416\n* @cameron-chen made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F404\n* @yangzy39 made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F426\n* @oceanypt made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F428\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.6.5...v0.6.6","2024-12-27T21:47:43",{"id":149,"version":150,"summary_zh":151,"released_at":152},112343,"v0.6.5","## What's Changed\n* Add Llama-3-Instruct-8B-WPO-HB-v2 to AlpacaEval by @wzhouad in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F377\n* [ENH] add llama 3.1 by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F378\n* [ENH] add example for LLama 3 vllm by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F381\n* Add Infinity-Instruct-7M-0729-Llama3_1-70B, Infinity-Instruct-7M-0729-Llama3_1-8B, Infinity-Instruct-7M-0729-mistral-7B to AlpacaEval by @cszhengyh in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F383\n* Add gemma-2-9b-it-WPO-HB to AlpacaEval by @wzhouad in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F384\n* Add link to gemma-2-9b-it-WPO-HB by @wzhouad in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F385\n* Change the name of the Infinity-Instruct-7M-0729-Models to Infinity-Instruct-7M-Gen-Models by @cszhengyh in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F387\n* Add blendaxai-gm-l3-v35 to AlpacaEval by @ym-blendax-ai in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F389\n* [ENH] OpenAI use tools instead of functions by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F391\n* [ENH] enable base_dir to be a list by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F392\n* [ENH] add mistral v0.3, Qwen2 70b, gtp4 mini by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F393\n\n## New Contributors\n* @wzhouad made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F377\n* @ym-blendax-ai made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F389\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.6.4...v0.6.5","2024-08-17T23:39:20",{"id":154,"version":155,"summary_zh":156,"released_at":157},112344,"v0.6.4","## What's Changed\n* Add SPPO-Llama-3-Instruct-8B-PairRM to AlpacaEval by @Edward-Sun in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F354\n* Add Infinity-Instruct-3M-0613-Llama3-70B to AlpacaEval by @cszhengyh in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F358\n* Add SPPO-Gemma-2-9B-It-PairRM to AlpacaEval by @angelahzyuan in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F359\n* Add Infinity-Instruct-3M-0625-Models to AlpacaEval by @cszhengyh in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F364\n* Add Higgs Llama3-70B V2 Results by @sxjscience in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F367\n* Added Ghost 8B Beta (d0x5) model by @lh0x00 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F366\n* Add gemma-2-9b-it-SimPO and gemma-2-9b-it-DPO to AlpacaEval by @xiamengzhou in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F368\n* [ENH] add CI test for unwanted files by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F369\n* update model links by @xiamengzhou in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F370\n* [ENH] add the code to compute instruction_following by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F371\n* [ENH] adding simplified glm by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F372\n* [BUG] backward compatibility vllm do_sample -> use_beam_search by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F373\n\n## New Contributors\n* @angelahzyuan made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F359\n* @sxjscience made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F367\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.6.3...v0.6.4","2024-07-18T18:01:23",{"id":159,"version":160,"summary_zh":161,"released_at":162},112345,"v0.6.3","## What's Changed\n* Add the evaluation result for our latest model by @hendrydong in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F286\n* Add Ghost 7B Alpha to AlpacaEval by @lh0x00 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F288\n* Add link for FsfairX-Zephyr-Chat-v0.1 by @hendrydong in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F289\n* add Qwen1.5-110B-Chat self-report results by @Lukeming-tsinghua in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F291\n* [ENH] verifying all the qwens by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F292\n* Enable analyzing evaluators\u002Fannotators on data without multiple generator models by @rdnfn in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F293\n* Add Storm-7B to AlpacaEval by @yifan123 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F294\n* Use verified by default by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F297\n* Add SPPO-Mistral7B-PairRM to AlpacaEval by @Edward-Sun in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F298\n* Add ExPO results to AlpacaEval by @chujiezheng in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F299\n* Fix typo in README.md by @tongyx361 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F302\n* Add Yi-Large Preview to AlpacaEval by @HyperdriveHustle in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F304\n* \"Add Mistral-7B+RAHF-DUAL+LoRA to AlpacaEval\" by @LiuAmber in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F307\n* [verified] Yi-large by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F309\n* [ADD] GPT4-o by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F311\n* [ENH] add LC SEM by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F317\n* llama3 evaluator by @zhuang-li in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F314\n* Update README.md by @zhuang-li in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F315\n* [CLEAN] move evaluators lb llama3 by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F318\n* [ENH] vicuna 1.5 by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F319\n* Add Llama-3-Instruct-8B-SimPO to AlpacaEval by @xiamengzhou in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F320\n* [ENH] Use multi threading instead of processing by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F321\n* Add Aligner 2B+GPT-4 Turbo (04\u002F09) Results by @AlignInc in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F324\n* Add REBEL-Llama-3-8B-Instruct to AlpacaEval by @ZhaolinGao in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F326\n* [ENH&BUG] improve VLLM by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F330\n* Add ExPO + `Llama-3-Instruct-8B-SimPO` results by @chujiezheng in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F331\n* fix model link by @chujiezheng in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F332\n* Add merlinite-7B-AOT to AlpacaEval by @imelnyk in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F334\n* [BUG] fix bs in VLLM and add chatml by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F338\n* Add Together-MoA, Together-MoA-Lite to AlpacaEval by @IsThatYou in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F342\n* Add Nanbeige2-16B-Chat to AlpacaEval by @yuani114 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F345\n* Add claude-3-5-sonnet-20240620 to AlpacaEval by @MarjovanLier in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F348\n* [BUG] trust repo alpaca_eval by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F349\n* Add OpenPipe Mixture of Agents model to Alpaca Eval by @saum7800 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F347\n* Add Storm-7B, Storm-7B (best-of-64) to AlpacaEval by @yifan123 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F344\n* Add Infinity-Instruct-3M-0613-Mistral-7B to AlpacaEval by @cszhengyh in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F351\n\n## New Contributors\n* @hendrydong made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F286\n* @lh0x00 made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F288\n* @yifan123 made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F294\n* @Edward-Sun made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F298\n* @chujiezheng made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F299\n* @tongyx361 made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F302\n* @LiuAmber made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F307\n* @zhuang-li made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F314\n* @xiamengzhou made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F320\n* @ZhaolinGao made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F326\n* @imelnyk made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F334\n* @IsThatYou made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F342\n* @MarjovanLier made their first cont","2024-06-24T00:58:37",{"id":164,"version":165,"summary_zh":166,"released_at":167},112346,"v0.6.2","## What's Changed\n* [BUG] backward compatibility with AF by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F278\n* Add Nanbeige-Plus-Chat-v0.1 to AlpacaEval by @yuani114 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F279\n* Update README.md by @Dominic789654 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F280\n* [BUG] revert to GPT4 preview 1106 by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F283\n* Add support for analyzing evaluators with custom cross-annotations by @rdnfn in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F281\n* [ENH] llama3 by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F285\n\n## New Contributors\n* @Dominic789654 made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F280\n* @rdnfn made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F281\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.6.1...v0.6.2","2024-04-19T06:28:02",{"id":169,"version":170,"summary_zh":171,"released_at":172},112347,"v0.6.1","## What's Changed\n* Add Aligner-2B+Qwen1.5-72B-Chat & Aligner-2B+Claude3 Opus to AlpacaEval by @AlignInc in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F259\n* Supplement for Aligner by @AlignInc in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F261\n* Add Ein-70B-v0.1 to AlpacaEval by @bin-bi in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F262\n* Add TempNet-LLaMA2-Chat to AlpacaEval by @xumao-nju in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F264\n* Add Conifer-7B-DPO to AlpacaEval by @liulixin29 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F267\n* Updating link to a super fast demo! by @kyleliang919 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F268\n* Add Nanbeige2-8B-Chat to AlpacaEval by @yuani114 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F274\n* [ENH] adding drbx and gpt4 turbo by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F275\n\n## New Contributors\n* @AlignInc made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F259\n* @bin-bi made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F262\n* @xumao-nju made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F264\n* @liulixin29 made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F267\n* @yuani114 made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F274\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.6...v0.6.1","2024-04-13T05:40:49",{"id":174,"version":175,"summary_zh":176,"released_at":177},112348,"v0.6","## What's Changed\n* [DATA] Add Gemma by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F242\n* [NOTEBOOK] adding final length correction notebook. by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F244\n* add Mistral-7B-ReMax-v0.1 by @liziniu in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F245\n* [ENH] add claude 3 by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F247\n* [ENH] add contextual by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F250\n* [ENH] add mistral large by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F251\n* Add Samba-CoE-v0.2 to AlpacaEval by @kyleliang919 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F253\n* Add Samba-CoE-v0.2-best-of-16 to AlpacaEval by @kyleliang919 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F256\n* Add Mistral-ORPO-Beta to AlpacaEval by @jiwooya1000 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F257\n* Yann\u002Flength correction by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F258\n\n## New Contributors\n* @liziniu made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F245\n* @kyleliang919 made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F253\n* @jiwooya1000 made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F257\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.5.4...v0.6","2024-03-20T02:50:29",{"id":179,"version":180,"summary_zh":181,"released_at":182},112349,"v0.5.4","## What's Changed\n* Add Qwen1.5-72B-Chat to AlpacaEval by @Lukeming-tsinghua in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F226\n* Add claude-instant-1.2, deepseek-llm-67b-chat, wizardlm-70b, Qwen-14B-Chat  (config + outputs without annotations) by @gblazex in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F228\n* [DATA] Adding annotations for the arena models by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F229\n* Update README.md - Add missing \"Y\" to \"ou\" by @yoderj in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F230\n* [DEV] Analyzing length-controlled metrics. by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F231\n* [DOC] add annotation interpretation by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F232\n* [DATA] add results from the Arena openai models  by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F234\n* update ELO for llama-2-13b-chat-hf by @gblazex in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F235\n* [NOTEBOOK] add length-corrected GLM  by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F237\n* [ENH] add inverse mapper to make sure in and out types are the same by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F240\n* [ENH] update to allow AF to use AE by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F241\n\n## New Contributors\n* @Lukeming-tsinghua made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F226\n* @yoderj made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F230\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.5.3...v0.5.4","2024-02-24T08:56:20",{"id":184,"version":185,"summary_zh":186,"released_at":187},112350,"v0.5.3","## What's Changed\n* [ENH] add mistral-medium by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F205\n* [ENH] add internlm2-chat-20b-ppo by @C1rN09 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F207\n* prettify \"pretty_name\" of internlm2 by @C1rN09 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F208\n* [ENH] add outputs & configs form dolphin 2.2.1 by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F209\n* Add PairRM 0.4B + Yi-34B-Chat to AlpacaEval 2.0 by @jdf-prog in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F210\n* dolphin 2.1.1 configs.yaml by @gblazex in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F212\n* Update README.md (small typo) by @xwinxu in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F213\n* [TEST]: fix ordering of df by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F214\n* Add Snorkel-Mistral-PairRM-DPO (best-of-16) to Alpaca Eval 2.0 by @viethoangtranduong in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F215\n* update InternLM2 chat template by @C1rN09 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F216\n* Add Starling-LM-7B-alpha, vicuna-13b-v1.5, vicuna-7b-v1.5 to AlpacaEval (config + outputs without annotations) by @gblazex in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F217\n* [RES] add 3 models for arena correlations by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F218\n* Add xwinlm-70b-v0.3 to AlpacaEval by @nbl97 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F221\n* [ENH] add referenced_models locally by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F224\n\n## New Contributors\n* @C1rN09 made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F207\n* @gblazex made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F212\n* @xwinxu made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F213\n* @viethoangtranduong made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F215\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.5.2...v0.5.3","2024-02-01T08:54:42",{"id":189,"version":190,"summary_zh":191,"released_at":192},112351,"v0.5.2","## What's Changed\n* [BUG] force openai >1.5.0 by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F202\n* [WIP] precompute all leaderboard for AE2 by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F199\n* [ENH] add OpenHermes by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F203\n\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.5.1...v0.5.2","2024-01-10T23:57:34",{"id":194,"version":195,"summary_zh":196,"released_at":197},112352,"v0.5.1","## What's Changed\n* [BUG] fix no OAI org id set by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F200\n\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.5.0...v0.5.1","2024-01-10T06:16:16",{"id":199,"version":200,"summary_zh":201,"released_at":202},112353,"v0.5.0","## What's Changed\n* Fix mssg check by @Muennighoff in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F174\n* Add MiniChat-1.5-3B to AlpacaEval and Fix MiniChat-3B by @GeneZC in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F176\n* Add 01-ai\u002FYi-34B-Chat to AlpacaEval by @HyperdriveHustle in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F175\n* feat:  add way to verify results by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F177\n* show img in readme by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F178\n* Add PairRM best-of-16 to AlpacaEval by @jdf-prog in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F181\n* Verify Yi by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F182\n* chore: add phi-2 sft by @lxuechen in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F184\n* add cut-13b by @wwxu21 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F186\n* chore: add phi-2 dpo by @lxuechen in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F185\n* Support phi2, Support SOLAR 10.7B LMCocktail by @yhyu13 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F183\n* Update openai.py by @Muennighoff in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F188\n* chore: add link for phi-2-sft by @lxuechen in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F190\n* chore: fix links by @lxuechen in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F191\n* Add deita-7b-v1.0 model by @VPeterV in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F192\n* [ENH] Azure OAI client & more general way of switching between client configs by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F193\n* [ENH] Weighted win rates by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F189\n* [ENH] new models: Gemini \u002F claude2.1 \u002F mistral \u002F mixtral \u002F .. by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F195\n* [ENH] alpaca_eval 2.0 by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F196\n\n## New Contributors\n* @Muennighoff made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F174\n* @HyperdriveHustle made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F175\n* @jdf-prog made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F181\n* @lxuechen made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F184\n* @wwxu21 made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F186\n* @yhyu13 made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F183\n* @VPeterV made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F192\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.3.6...v0.5.0","2024-01-10T02:32:32",{"id":204,"version":205,"summary_zh":206,"released_at":207},112354,"v0.3.6","## What's Changed\n* feat: verify all the cohere model & use it as eval by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F170\n* Add Tulu 2 models to AlpacaEval by @hamishivi in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F171\n\n## New Contributors\n* @hamishivi made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F171\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.3.5...v0.3.6","2023-11-24T22:50:12",{"id":209,"version":210,"summary_zh":211,"released_at":212},112355,"v0.3.5","## What's Changed\n* [WIP] GPT4 turbo as evaluator by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F160\n* [ENH] add GPT4 turbo as evaluator in README by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F165\n* Add minichat-3b to AlpacaEval by @GeneZC in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F167\n* fix: filter openai spam filter by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F169\n\n## New Contributors\n* @GeneZC made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F167\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.3.3...v0.3.5","2023-11-16T23:19:28",{"id":214,"version":215,"summary_zh":216,"released_at":217},112356,"vv0.3.4","## What's Changed\r\n* [WIP] GPT4 turbo as evaluator by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F160\r\n* [ENH] add GPT4 turbo as evaluator in README by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F165\r\n* Add minichat-3b to AlpacaEval by @GeneZC in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F167\r\n* fix: filter openai spam filter by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F169\r\n\r\n## New Contributors\r\n* @GeneZC made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F167\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.3.3...vv0.3.4","2023-11-16T23:14:28",{"id":219,"version":220,"summary_zh":221,"released_at":222},112357,"v0.3.3","## What's Changed\n* Gpt4 turbo by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F159\n\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.3.2...v0.3.3","2023-11-08T08:25:00",{"id":224,"version":225,"summary_zh":226,"released_at":227},112358,"v0.3.2","## What's Changed\n* add UltraLM-13b-V2.0\u002FUltraLM-13b-V2.0-best-of-16\u002FUltraLM-13b-best-of-16 to AlpacaEval by @lifan-yuan in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F139\n* Add annotations & fix leaderboard by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F142\n* refresh Cohere by @sanderland in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F141\n* Add PlatoLM-7B to AlpacaEval  by @renatz in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F143\n* Add evo-7b to AlpacaEval by @zfang in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F144\n* Add NEFTune models to AlpacaEval by @neelsjain in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F146\n* Add claude2-alpaca-13b, recycled-wizardlm-7b-v1.0, recycled-wizardlm-… by @MingLiiii in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F147\n* Add CausalLM\u002F14B to AlpacaEval by @CausalLM in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F148\n* Add Zephyr 7B evals by @lewtun in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F152\n* Add Evo v2 7B by @zfang in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F153\n* Add decoder for calling Anthropic models via Amazon Bedrock by @billcai in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F151\n* cohere update by @sanderland in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F155\n* feat:  upgrade to openai 1.0.0 by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F157\n\n## New Contributors\n* @lifan-yuan made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F139\n* @renatz made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F143\n* @zfang made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F144\n* @neelsjain made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F146\n* @MingLiiii made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F147\n* @CausalLM made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F148\n* @lewtun made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F152\n* @billcai made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F151\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.3.1...v0.3.2","2023-11-08T07:18:20",{"id":229,"version":230,"summary_zh":231,"released_at":232},112359,"v0.3.1","## What's Changed\n* Add results of Xwin-LM  by @nbl97 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F135\n* [ENH] add gpt 3.5 instruct by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F137\n\n## New Contributors\n* @nbl97 made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F135\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.3.0...v0.3.1","2023-09-19T20:58:08",{"id":234,"version":235,"summary_zh":236,"released_at":237},112360,"v0.3.0","## What's Changed\n* [ENH] add fixed gpt4 version annotator by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F127\n* Add openbuddy-llama2-13b-v11.1 by @44670 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F129\n* [ENH] add max concurrency oai by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F131\n\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.2.9...v0.3.0","2023-09-01T05:30:03",{"id":239,"version":240,"summary_zh":241,"released_at":242},112361,"v0.2.9","## What's Changed\n* Ensure primary keys are string & decrease processes for OpenAI by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F116\n* Add JinaChat to the leaderboards by @jupyterjazz in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F117\n* [BUG] jina chat error in configs by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F118\n* Add Humpback to AlpacaEval by @xianxl in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F120\n* update Humpback results by @xianxl in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F121\n* add link to Humpback paper by @xianxl in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F122\n* Add `vllm` decoder for model inference by @44670 in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F124\n* [ENH] return `completions_all` and allow sequence of max_tokens by @YannDubs in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F125\n\n## New Contributors\n* @jupyterjazz made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F117\n* @xianxl made their first contribution in https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fpull\u002F120\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval\u002Fcompare\u002Fv0.2.8...v0.2.9","2023-08-23T02:46:49"]